* null pointer reference after crash
@ 2017-08-28 17:23 Christian Theune
2017-08-28 17:42 ` Darrick J. Wong
0 siblings, 1 reply; 9+ messages in thread
From: Christian Theune @ 2017-08-28 17:23 UTC (permalink / raw)
To: linux-xfs
[-- Attachment #1: Type: text/plain, Size: 8988 bytes --]
Hi,
we stumbled over this today as a host rebooted with an unrelated (iommu) kernel crash and got completely stuck after this:
I’m currently running xfs_repair on all disks and will then see whether this will resolve, still I guess you want to know about it. Kernel is 4.9.43 vanilla. Let me know if you need more data.
Aug 28 15:27:00 cartman09 kernel: [ 637.746484] IP: [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
Aug 28 15:27:00 cartman09 kernel: [ 637.758513] PGD f325ae067
Aug 28 15:27:00 cartman09 kernel: [ 637.763573] PUD 0
Aug 28 15:27:00 cartman09 kernel: [ 637.767593]
Aug 28 15:27:00 cartman09 kernel: [ 637.770576] Oops: 0000 [#1] SMP
Aug 28 15:27:00 cartman09 kernel: [ 637.776852] Modules linked in: nf_log_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_log_ipv6 nf_log_common xt_LOG xt_limit nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack sch_fq x86_pkg_temp_thermal kvm_intel kvm ixgbe irqbypass nvme crc32c_intel mdio nvme_core acpi_cpufreq nbd nf_conntrack_ftp nf_conntr
ack dm_zero dm_thin_pool dm_persistent_data dm_bio_prison dm_round_robin dm_multipath xts aesni_intel glue_helper lrw ablk_helper cryptd aes_x86_64 fuse dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log
Aug 28 15:27:00 cartman09 kernel: [ 637.868058] CPU: 1 PID: 10011 Comm: ceph-osd Not tainted 4.9.43 #1
Aug 28 15:27:00 cartman09 kernel: [ 637.880398] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
Aug 28 15:27:00 cartman09 kernel: [ 637.895168] task: ffff8805de20b900 task.stack: ffffc9002e470000
Aug 28 15:27:00 cartman09 kernel: [ 637.906989] RIP: 0010:[<ffffffff81312320>] [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
Aug 28 15:27:00 cartman09 kernel: [ 637.923869] RSP: 0018:ffffc9002e473cb8 EFLAGS: 00010282
Aug 28 15:27:00 cartman09 kernel: [ 637.934479] RAX: 0000000000000000 RBX: ffff88083fe0d220 RCX: 0000000000000001
Aug 28 15:27:00 cartman09 kernel: [ 637.948724] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc9002e473c70
Aug 28 15:27:00 cartman09 kernel: [ 637.962972] RBP: ffffc9002e473cd8 R08: 000000003a0d15aa R09: ffffc9002e473b50
Aug 28 15:27:00 cartman09 kernel: [ 637.977219] R10: fffffffffffffffe R11: 0000000000000001 R12: ffffc9002e473d08
Aug 28 15:27:00 cartman09 kernel: [ 637.991469] R13: ffff880658318f00 R14: 0000000000000003 R15: 000000003a0d15aa
Aug 28 15:27:00 cartman09 kernel: [ 638.005715] FS: 00007f37c7dd8700(0000) GS:ffff88085fa40000(0000) knlGS:0000000000000000
Aug 28 15:27:00 cartman09 kernel: [ 638.021871] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 28 15:27:00 cartman09 kernel: [ 638.033345] CR2: 00000000000000a0 CR3: 0000000f32506000 CR4: 00000000001406e0
Aug 28 15:27:00 cartman09 kernel: [ 638.047592] Stack:
Aug 28 15:27:00 cartman09 kernel: [ 638.051615] ffffffff81a44fe0 ffffc9002e473cd8 ffffc9002e473dd0 0000000000000003
Aug 28 15:27:00 cartman09 kernel: [ 638.066521] ffffc9002e473d48 ffffffff81337404 0000000300000008 ffff88076d846040
Aug 28 15:27:00 cartman09 kernel: [ 638.081426] 00000000d2fc5128 ffff880649f47d80 0000000000000000 0000000000000000
Aug 28 15:27:00 cartman09 kernel: [ 638.096331] Call Trace:
Aug 28 15:27:00 cartman09 kernel: [ 638.101226] [<ffffffff81337404>] xfs_attr3_node_inactive+0x174/0x210
Aug 28 15:27:00 cartman09 kernel: [ 638.114083] [<ffffffff8133744a>] xfs_attr3_node_inactive+0x1ba/0x210
Aug 28 15:27:00 cartman09 kernel: [ 638.126944] [<ffffffff813376da>] xfs_attr_inactive+0x23a/0x250
Aug 28 15:27:00 cartman09 kernel: [ 638.138767] [<ffffffff81350a4b>] xfs_inactive+0x7b/0x110
Aug 28 15:27:00 cartman09 kernel: [ 638.149547] [<ffffffff81359344>] xfs_fs_destroy_inode+0xa4/0x210
Aug 28 15:27:00 cartman09 kernel: [ 638.161714] [<ffffffff811c46cb>] destroy_inode+0x3b/0x60
Aug 28 15:27:00 cartman09 kernel: [ 638.172493] [<ffffffff811c4819>] evict+0x129/0x190
Aug 28 15:27:00 cartman09 kernel: [ 638.182238] [<ffffffff811c4c4a>] iput+0x19a/0x200
Aug 28 15:27:00 cartman09 kernel: [ 638.191805] [<ffffffff811b9129>] do_unlinkat+0x129/0x2d0
Aug 28 15:27:00 cartman09 kernel: [ 638.202584] [<ffffffff811b9d26>] SyS_unlink+0x16/0x20
Aug 28 15:27:00 cartman09 kernel: [ 638.212847] [<ffffffff81885260>] entry_SYSCALL_64_fastpath+0x13/0x94
Aug 28 15:27:00 cartman09 kernel: [ 638.225706] Code: 55 48 89 e5 41 54 53 4d 89 c4 48 89 fb 48 83 ec 10 48 c7 04 24 e0 4f a4 81 e8 fd fe ff ff 85 c0 75 46 48 85 db 74 41 49 8b 34 24 <48> 8b 96 a0 00 00 00 0f b7 52 08 66 c1 c2 08 66 81 fa be 3e 74
Aug 28 15:27:00 cartman09 kernel: [ 638.265605] RIP [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
Aug 28 15:27:00 cartman09 kernel: [ 638.277809] RSP <ffffc9002e473cb8>
Aug 28 15:27:00 cartman09 kernel: [ 638.284776] CR2: 00000000000000a0
Aug 28 15:27:00 cartman09 kernel: [ 638.291941] ---[ end trace 4dd737d8c717c6f3 ]—
. This also lead to more problems in the kernel, specifically:
Aug 28 15:57:36 cartman09 kernel: [ 2464.661772] swapper/0:
Aug 28 15:57:36 cartman09 kernel: [ 2464.662250] swapper/8:
Aug 28 15:57:36 cartman09 kernel: [ 2464.662253] page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
Aug 28 15:57:36 cartman09 kernel: [ 2464.662257] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G D 4.9.43 #1
Aug 28 15:57:36 cartman09 kernel: [ 2464.662258] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
Aug 28 15:57:36 cartman09 kernel: [ 2464.662259] ffff88107fa83ba8
Aug 28 15:57:36 cartman09 kernel: [ 2464.662260] ffffffff813ebcf8 ffffffff81c56c40 0000000000000000 ffff88107fa83c28
Aug 28 15:57:36 cartman09 kernel: [ 2464.662262] ffffffff8114835c 020800207fa83b01 ffffffff81c56c40 ffff88107fa83bd0
Aug 28 15:57:36 cartman09 kernel: [ 2464.662263] ffff881000000010 ffff88107fa83c38 ffff88107fa83be8Call Trace:
Aug 28 15:57:36 cartman09 kernel: [ 2464.662266] <IRQ>
Aug 28 15:57:36 cartman09 kernel: [ 2464.662273] [<ffffffff813ebcf8>] dump_stack+0x4d/0x65
Aug 28 15:57:36 cartman09 kernel: [ 2464.662279] [<ffffffff8114835c>] warn_alloc+0x11c/0x140
Aug 28 15:57:36 cartman09 kernel: [ 2464.662281] [<ffffffff8114862d>] __alloc_pages_slowpath+0x23d/0xb70
Aug 28 15:57:36 cartman09 kernel: [ 2464.662285] [<ffffffff814f0d33>] ? dma_pte_clear_level+0x113/0x190
Aug 28 15:57:36 cartman09 kernel: [ 2464.662288] [<ffffffff81149112>] __alloc_pages_nodemask+0x1b2/0x240
Aug 28 15:57:36 cartman09 kernel: [ 2464.662290] [<ffffffff81149312>] __alloc_page_frag+0x172/0x1a0
Aug 28 15:57:36 cartman09 kernel: [ 2464.662293] [<ffffffff8172da86>] __napi_alloc_skb+0x86/0xd0
Aug 28 15:57:36 cartman09 kernel: [ 2464.662303] [<ffffffffa05b3558>] ixgbe_clean_rx_irq+0xf8/0x950 [ixgbe]
Aug 28 15:57:36 cartman09 kernel: [ 2464.662306] [<ffffffffa05b4a5d>] ixgbe_poll+0x3cd/0x780 [ixgbe]
Aug 28 15:57:36 cartman09 kernel: [ 2464.662309] [<ffffffff8173c963>] net_rx_action+0x203/0x350
Aug 28 15:57:36 cartman09 kernel: [ 2464.662314] [<ffffffff81887957>] __do_softirq+0xe7/0x256
Aug 28 15:57:36 cartman09 kernel: [ 2464.662317] [<ffffffff810637ba>] irq_exit+0x9a/0xa0
Aug 28 15:57:36 cartman09 kernel: [ 2464.662319] [<ffffffff818876c4>] do_IRQ+0x54/0xd0
Aug 28 15:57:36 cartman09 kernel: [ 2464.662321] [<ffffffff81885bbf>] common_interrupt+0x7f/0x7f
Aug 28 15:57:36 cartman09 kernel: [ 2464.662322] <EOI>
Aug 28 15:57:36 cartman09 kernel: [ 2464.662327] [<ffffffff816f060f>] ? cpuidle_enter_state+0x10f/0x250
Aug 28 15:57:36 cartman09 kernel: [ 2464.662328] [<ffffffff816f0787>] cpuidle_enter+0x17/0x20
Aug 28 15:57:36 cartman09 kernel: [ 2464.662332] [<ffffffff8109ad43>] call_cpuidle+0x23/0x40
Aug 28 15:57:36 cartman09 kernel: [ 2464.662334] [<ffffffff8109af61>] cpu_startup_entry+0x101/0x1d0
Aug 28 15:57:36 cartman09 kernel: [ 2464.662338] [<ffffffff8103ef78>] start_secondary+0xe8/0xf0
Aug 28 15:57:36 cartman09 kernel: [ 2464.662339] Mem-Info:
Aug 28 15:57:36 cartman09 kernel: [ 2464.662346] active_anon:2251029 inactive_anon:358089 isolated_anon:0
Aug 28 15:57:36 cartman09 kernel: [ 2464.662346] active_file:551559 inactive_file:11813633 isolated_file:64
Aug 28 15:57:36 cartman09 kernel: [ 2464.662346] unevictable:540 dirty:15786 writeback:968 unstable:0
Aug 28 15:57:36 cartman09 kernel: [ 2464.662346] slab_reclaimable:904481 slab_unreclaimable:150538
Aug 28 15:57:36 cartman09 kernel: [ 2464.662346] mapped:308954 shmem:309 pagetables:13475 bounce:0
Aug 28 15:57:36 cartman09 kernel: [ 2464.662346] free:35590 free_pcp:5360 free_cma:0
The system became mostly unusable (with load going into 1000+) after that and I hard-rebooted with disabled services for further diagnostics. xfs_repair seems to finish without finding any major issues.
Cheers,
Christian
--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: null pointer reference after crash
2017-08-28 17:23 null pointer reference after crash Christian Theune
@ 2017-08-28 17:42 ` Darrick J. Wong
2017-08-28 19:00 ` Christian Theune
0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2017-08-28 17:42 UTC (permalink / raw)
To: Christian Theune; +Cc: linux-xfs
On Mon, Aug 28, 2017 at 07:23:19PM +0200, Christian Theune wrote:
> Hi,
>
> we stumbled over this today as a host rebooted with an unrelated (iommu)
> kernel crash and got completely stuck after this:
>
> I’m currently running xfs_repair on all disks and will then see whether this
> will resolve, still I guess you want to know about it. Kernel is 4.9.43
> vanilla. Let me know if you need more data.
Does commit cd87d8679201 ("xfs: don't crash on unexpected holes in dir/attr
btrees") fix this problem? It'll be in 4.13, maybe someone can backport it
to 4.9?
(Assuming you can get it to reproduce reliably?)
((Wish I could figure out how we end up with a corrupt-looking xattr tree
in the first place...))
--D
>
> Aug 28 15:27:00 cartman09 kernel: [ 637.746484] IP: [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
> Aug 28 15:27:00 cartman09 kernel: [ 637.758513] PGD f325ae067
> Aug 28 15:27:00 cartman09 kernel: [ 637.763573] PUD 0
> Aug 28 15:27:00 cartman09 kernel: [ 637.767593]
> Aug 28 15:27:00 cartman09 kernel: [ 637.770576] Oops: 0000 [#1] SMP
> Aug 28 15:27:00 cartman09 kernel: [ 637.776852] Modules linked in: nf_log_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_log_ipv6 nf_log_common xt_LOG xt_limit nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack sch_fq x86_pkg_temp_thermal kvm_intel kvm ixgbe irqbypass nvme crc32c_intel mdio nvme_core acpi_cpufreq nbd nf_conntrack_ftp nf_conntr
> ack dm_zero dm_thin_pool dm_persistent_data dm_bio_prison dm_round_robin dm_multipath xts aesni_intel glue_helper lrw ablk_helper cryptd aes_x86_64 fuse dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log
> Aug 28 15:27:00 cartman09 kernel: [ 637.868058] CPU: 1 PID: 10011 Comm: ceph-osd Not tainted 4.9.43 #1
> Aug 28 15:27:00 cartman09 kernel: [ 637.880398] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
> Aug 28 15:27:00 cartman09 kernel: [ 637.895168] task: ffff8805de20b900 task.stack: ffffc9002e470000
> Aug 28 15:27:00 cartman09 kernel: [ 637.906989] RIP: 0010:[<ffffffff81312320>] [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
> Aug 28 15:27:00 cartman09 kernel: [ 637.923869] RSP: 0018:ffffc9002e473cb8 EFLAGS: 00010282
> Aug 28 15:27:00 cartman09 kernel: [ 637.934479] RAX: 0000000000000000 RBX: ffff88083fe0d220 RCX: 0000000000000001
> Aug 28 15:27:00 cartman09 kernel: [ 637.948724] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc9002e473c70
> Aug 28 15:27:00 cartman09 kernel: [ 637.962972] RBP: ffffc9002e473cd8 R08: 000000003a0d15aa R09: ffffc9002e473b50
> Aug 28 15:27:00 cartman09 kernel: [ 637.977219] R10: fffffffffffffffe R11: 0000000000000001 R12: ffffc9002e473d08
> Aug 28 15:27:00 cartman09 kernel: [ 637.991469] R13: ffff880658318f00 R14: 0000000000000003 R15: 000000003a0d15aa
> Aug 28 15:27:00 cartman09 kernel: [ 638.005715] FS: 00007f37c7dd8700(0000) GS:ffff88085fa40000(0000) knlGS:0000000000000000
> Aug 28 15:27:00 cartman09 kernel: [ 638.021871] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Aug 28 15:27:00 cartman09 kernel: [ 638.033345] CR2: 00000000000000a0 CR3: 0000000f32506000 CR4: 00000000001406e0
> Aug 28 15:27:00 cartman09 kernel: [ 638.047592] Stack:
> Aug 28 15:27:00 cartman09 kernel: [ 638.051615] ffffffff81a44fe0 ffffc9002e473cd8 ffffc9002e473dd0 0000000000000003
> Aug 28 15:27:00 cartman09 kernel: [ 638.066521] ffffc9002e473d48 ffffffff81337404 0000000300000008 ffff88076d846040
> Aug 28 15:27:00 cartman09 kernel: [ 638.081426] 00000000d2fc5128 ffff880649f47d80 0000000000000000 0000000000000000
> Aug 28 15:27:00 cartman09 kernel: [ 638.096331] Call Trace:
> Aug 28 15:27:00 cartman09 kernel: [ 638.101226] [<ffffffff81337404>] xfs_attr3_node_inactive+0x174/0x210
> Aug 28 15:27:00 cartman09 kernel: [ 638.114083] [<ffffffff8133744a>] xfs_attr3_node_inactive+0x1ba/0x210
> Aug 28 15:27:00 cartman09 kernel: [ 638.126944] [<ffffffff813376da>] xfs_attr_inactive+0x23a/0x250
> Aug 28 15:27:00 cartman09 kernel: [ 638.138767] [<ffffffff81350a4b>] xfs_inactive+0x7b/0x110
> Aug 28 15:27:00 cartman09 kernel: [ 638.149547] [<ffffffff81359344>] xfs_fs_destroy_inode+0xa4/0x210
> Aug 28 15:27:00 cartman09 kernel: [ 638.161714] [<ffffffff811c46cb>] destroy_inode+0x3b/0x60
> Aug 28 15:27:00 cartman09 kernel: [ 638.172493] [<ffffffff811c4819>] evict+0x129/0x190
> Aug 28 15:27:00 cartman09 kernel: [ 638.182238] [<ffffffff811c4c4a>] iput+0x19a/0x200
> Aug 28 15:27:00 cartman09 kernel: [ 638.191805] [<ffffffff811b9129>] do_unlinkat+0x129/0x2d0
> Aug 28 15:27:00 cartman09 kernel: [ 638.202584] [<ffffffff811b9d26>] SyS_unlink+0x16/0x20
> Aug 28 15:27:00 cartman09 kernel: [ 638.212847] [<ffffffff81885260>] entry_SYSCALL_64_fastpath+0x13/0x94
> Aug 28 15:27:00 cartman09 kernel: [ 638.225706] Code: 55 48 89 e5 41 54 53 4d 89 c4 48 89 fb 48 83 ec 10 48 c7 04 24 e0 4f a4 81 e8 fd fe ff ff 85 c0 75 46 48 85 db 74 41 49 8b 34 24 <48> 8b 96 a0 00 00 00 0f b7 52 08 66 c1 c2 08 66 81 fa be 3e 74
> Aug 28 15:27:00 cartman09 kernel: [ 638.265605] RIP [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
> Aug 28 15:27:00 cartman09 kernel: [ 638.277809] RSP <ffffc9002e473cb8>
> Aug 28 15:27:00 cartman09 kernel: [ 638.284776] CR2: 00000000000000a0
> Aug 28 15:27:00 cartman09 kernel: [ 638.291941] ---[ end trace 4dd737d8c717c6f3 ]—
>
> . This also lead to more problems in the kernel, specifically:
>
> Aug 28 15:57:36 cartman09 kernel: [ 2464.661772] swapper/0:
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662250] swapper/8:
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662253] page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662257] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G D 4.9.43 #1
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662258] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662259] ffff88107fa83ba8
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662260] ffffffff813ebcf8 ffffffff81c56c40 0000000000000000 ffff88107fa83c28
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662262] ffffffff8114835c 020800207fa83b01 ffffffff81c56c40 ffff88107fa83bd0
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662263] ffff881000000010 ffff88107fa83c38 ffff88107fa83be8Call Trace:
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662266] <IRQ>
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662273] [<ffffffff813ebcf8>] dump_stack+0x4d/0x65
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662279] [<ffffffff8114835c>] warn_alloc+0x11c/0x140
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662281] [<ffffffff8114862d>] __alloc_pages_slowpath+0x23d/0xb70
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662285] [<ffffffff814f0d33>] ? dma_pte_clear_level+0x113/0x190
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662288] [<ffffffff81149112>] __alloc_pages_nodemask+0x1b2/0x240
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662290] [<ffffffff81149312>] __alloc_page_frag+0x172/0x1a0
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662293] [<ffffffff8172da86>] __napi_alloc_skb+0x86/0xd0
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662303] [<ffffffffa05b3558>] ixgbe_clean_rx_irq+0xf8/0x950 [ixgbe]
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662306] [<ffffffffa05b4a5d>] ixgbe_poll+0x3cd/0x780 [ixgbe]
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662309] [<ffffffff8173c963>] net_rx_action+0x203/0x350
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662314] [<ffffffff81887957>] __do_softirq+0xe7/0x256
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662317] [<ffffffff810637ba>] irq_exit+0x9a/0xa0
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662319] [<ffffffff818876c4>] do_IRQ+0x54/0xd0
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662321] [<ffffffff81885bbf>] common_interrupt+0x7f/0x7f
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662322] <EOI>
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662327] [<ffffffff816f060f>] ? cpuidle_enter_state+0x10f/0x250
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662328] [<ffffffff816f0787>] cpuidle_enter+0x17/0x20
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662332] [<ffffffff8109ad43>] call_cpuidle+0x23/0x40
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662334] [<ffffffff8109af61>] cpu_startup_entry+0x101/0x1d0
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662338] [<ffffffff8103ef78>] start_secondary+0xe8/0xf0
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662339] Mem-Info:
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662346] active_anon:2251029 inactive_anon:358089 isolated_anon:0
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662346] active_file:551559 inactive_file:11813633 isolated_file:64
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662346] unevictable:540 dirty:15786 writeback:968 unstable:0
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662346] slab_reclaimable:904481 slab_unreclaimable:150538
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662346] mapped:308954 shmem:309 pagetables:13475 bounce:0
> Aug 28 15:57:36 cartman09 kernel: [ 2464.662346] free:35590 free_pcp:5360 free_cma:0
>
> The system became mostly unusable (with load going into 1000+) after that and I hard-rebooted with disabled services for further diagnostics. xfs_repair seems to finish without finding any major issues.
>
> Cheers,
> Christian
>
> --
> Christian Theune · ct@flyingcircus.io · +49 345 219401 0
> Flying Circus Internet Operations GmbH · http://flyingcircus.io
> Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: null pointer reference after crash
2017-08-28 17:42 ` Darrick J. Wong
@ 2017-08-28 19:00 ` Christian Theune
2017-08-30 13:56 ` Christian Theune
0 siblings, 1 reply; 9+ messages in thread
From: Christian Theune @ 2017-08-28 19:00 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs
[-- Attachment #1: Type: text/plain, Size: 1492 bytes --]
Hi,
> On Aug 28, 2017, at 7:42 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Mon, Aug 28, 2017 at 07:23:19PM +0200, Christian Theune wrote:
>> Hi,
>>
>> we stumbled over this today as a host rebooted with an unrelated (iommu)
>> kernel crash and got completely stuck after this:
>>
>> I’m currently running xfs_repair on all disks and will then see whether this
>> will resolve, still I guess you want to know about it. Kernel is 4.9.43
>> vanilla. Let me know if you need more data.
>
> Does commit cd87d8679201 ("xfs: don't crash on unexpected holes in dir/attr
> btrees") fix this problem? It'll be in 4.13, maybe someone can backport it
> to 4.9?
Thanks for the suggestion. I’ll keep that in mind in case I see this again.
> (Assuming you can get it to reproduce reliably?)
I have only seen it once today and hopefully won’t see it again. We have had some storage servers that run multiple SSD and HDD disks (for Ceph) crash multiple times a week lastly due to the IOMMU issues that resulted in hardware watchdog reboots, so I guess those xfs' did have quite some noise in it.
Not sure I can do anything to reproduce it at all. *fingers crossed*
Christian
--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: null pointer reference after crash
2017-08-28 19:00 ` Christian Theune
@ 2017-08-30 13:56 ` Christian Theune
2017-08-30 15:58 ` Darrick J. Wong
0 siblings, 1 reply; 9+ messages in thread
From: Christian Theune @ 2017-08-30 13:56 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs
[-- Attachment #1: Type: text/plain, Size: 5149 bytes --]
Hi,
just got it again on a different call path, maybe that helps:
[ 1070.136303] Oops: 0000 [#1] SMP
[ 1070.142577] Modules linked in: nf_log_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_log_ipv6 nf_log_common xt_LOG xt_limit nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack sch_fq x86_pkg_temp_thermal kvm_intel kvm irqbypass nvme crc32c_intel ixgbe nvme_core mdio acpi_cpufreq nbd nf_conntrack_ftp nf_conntrack dm_zero dm_thin_pool dm_persistent_data dm_bio_prison dm_round_robin dm_multipath xts aesni_intel glue_helper lrw ablk_helper cryptd aes_x86_64 fuse dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log
[ 1070.233784] CPU: 19 PID: 7460 Comm: ceph-osd Not tainted 4.9.43 #1
[ 1070.246124] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
[ 1070.260895] task: ffff8810517d0000 task.stack: ffffc9002abec000
[ 1070.272710] RIP: 0010:[<ffffffff81312320>] [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
[ 1070.289592] RSP: 0018:ffffc9002abefd28 EFLAGS: 00010286
[ 1070.300199] RAX: 0000000000000000 RBX: ffff88104d859a48 RCX: 0000000000000001
[ 1070.314447] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc9002abefce0
[ 1070.328694] RBP: ffffc9002abefd48 R08: 0000000066656566 R09: ffffc9002abefbc0
[ 1070.342942] R10: fffffffffffffffe R11: 0000000000000001 R12: ffffc9002abefd78
[ 1070.357191] R13: ffff88066b430780 R14: 0000000000000005 R15: 0000000066656566
[ 1070.371436] FS: 00007fe511bfc700(0000) GS:ffff88107fbc0000(0000) knlGS:0000000000000000
[ 1070.387590] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1070.399066] CR2: 00000000000000a0 CR3: 0000000f14d50000 CR4: 00000000001406e0
[ 1070.413311] Stack:
[ 1070.417332] ffffffff81a44fe0 ffffc9002abefd48 ffffc9002abefdd0 0000000000000005
[ 1070.432239] ffffc9002abefdb8 ffffffff81337404 0000000200000008 ffff8809b5cab040
[ 1070.447144] 000000005e94ce38 ffff880c25e1c600 0000000000000000 0000000000000000
[ 1070.462051] Call Trace:
[ 1070.466949] [<ffffffff81337404>] xfs_attr3_node_inactive+0x174/0x210
[ 1070.479802] [<ffffffff813376da>] xfs_attr_inactive+0x23a/0x250
[ 1070.491625] [<ffffffff81350a4b>] xfs_inactive+0x7b/0x110
[ 1070.502403] [<ffffffff81359344>] xfs_fs_destroy_inode+0xa4/0x210
[ 1070.514573] [<ffffffff811c46cb>] destroy_inode+0x3b/0x60
[ 1070.525352] [<ffffffff811c4819>] evict+0x129/0x190
[ 1070.535093] [<ffffffff811c4c4a>] iput+0x19a/0x200
[ 1070.544660] [<ffffffff811b9129>] do_unlinkat+0x129/0x2d0
[ 1070.555445] [<ffffffff811b9d26>] SyS_unlink+0x16/0x20
[ 1070.565706] [<ffffffff81885260>] entry_SYSCALL_64_fastpath+0x13/0x94
[ 1070.578562] Code: 55 48 89 e5 41 54 53 4d 89 c4 48 89 fb 48 83 ec 10 48 c7 04 24 e0 4f a4 81 e8 fd fe ff ff 85 c0 75 46 48 85 db 74 41 49 8b 34 24 <48> 8b 96 a0 00 00 00 0f b7 52 08 66 c1 c2 08 66 81 fa be 3e 74
[ 1070.618459] RIP [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
[ 1070.630663] RSP <ffffc9002abefd28>
[ 1070.637630] CR2: 00000000000000a0
[ 1070.644858] ---[ end trace bc2d3667eef00f69 ]—
As of now the system doesn’t have the same following issues and the other FS’s are still functioning. I’ll run xfs_repair later today on all filesystems for good measure.
Christian
> On Aug 28, 2017, at 9:00 PM, Christian Theune <ct@flyingcircus.io> wrote:
>
> Hi,
>
>> On Aug 28, 2017, at 7:42 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>>
>> On Mon, Aug 28, 2017 at 07:23:19PM +0200, Christian Theune wrote:
>>> Hi,
>>>
>>> we stumbled over this today as a host rebooted with an unrelated (iommu)
>>> kernel crash and got completely stuck after this:
>>>
>>> I’m currently running xfs_repair on all disks and will then see whether this
>>> will resolve, still I guess you want to know about it. Kernel is 4.9.43
>>> vanilla. Let me know if you need more data.
>>
>> Does commit cd87d8679201 ("xfs: don't crash on unexpected holes in dir/attr
>> btrees") fix this problem? It'll be in 4.13, maybe someone can backport it
>> to 4.9?
>
> Thanks for the suggestion. I’ll keep that in mind in case I see this again.
>
>> (Assuming you can get it to reproduce reliably?)
>
> I have only seen it once today and hopefully won’t see it again. We have had some storage servers that run multiple SSD and HDD disks (for Ceph) crash multiple times a week lastly due to the IOMMU issues that resulted in hardware watchdog reboots, so I guess those xfs' did have quite some noise in it.
>
> Not sure I can do anything to reproduce it at all. *fingers crossed*
>
> Christian
>
> --
> Christian Theune · ct@flyingcircus.io · +49 345 219401 0
> Flying Circus Internet Operations GmbH · http://flyingcircus.io
> Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
Liebe Grüße,
Christian Theune
--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: null pointer reference after crash
2017-08-30 13:56 ` Christian Theune
@ 2017-08-30 15:58 ` Darrick J. Wong
2017-08-30 19:03 ` Christian Theune
0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2017-08-30 15:58 UTC (permalink / raw)
To: Christian Theune; +Cc: linux-xfs
On Wed, Aug 30, 2017 at 03:56:05PM +0200, Christian Theune wrote:
> Hi,
>
> just got it again on a different call path, maybe that helps:
>
> [ 1070.136303] Oops: 0000 [#1] SMP
> [ 1070.142577] Modules linked in: nf_log_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_log_ipv6 nf_log_common xt_LOG xt_limit nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack sch_fq x86_pkg_temp_thermal kvm_intel kvm irqbypass nvme crc32c_intel ixgbe nvme_core mdio acpi_cpufreq nbd nf_conntrack_ftp nf_conntrack dm_zero dm_thin_pool dm_persistent_data dm_bio_prison dm_round_robin dm_multipath xts aesni_intel glue_helper lrw ablk_helper cryptd aes_x86_64 fuse dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log
> [ 1070.233784] CPU: 19 PID: 7460 Comm: ceph-osd Not tainted 4.9.43 #1
> [ 1070.246124] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
> [ 1070.260895] task: ffff8810517d0000 task.stack: ffffc9002abec000
> [ 1070.272710] RIP: 0010:[<ffffffff81312320>] [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
> [ 1070.289592] RSP: 0018:ffffc9002abefd28 EFLAGS: 00010286
> [ 1070.300199] RAX: 0000000000000000 RBX: ffff88104d859a48 RCX: 0000000000000001
> [ 1070.314447] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc9002abefce0
> [ 1070.328694] RBP: ffffc9002abefd48 R08: 0000000066656566 R09: ffffc9002abefbc0
> [ 1070.342942] R10: fffffffffffffffe R11: 0000000000000001 R12: ffffc9002abefd78
> [ 1070.357191] R13: ffff88066b430780 R14: 0000000000000005 R15: 0000000066656566
> [ 1070.371436] FS: 00007fe511bfc700(0000) GS:ffff88107fbc0000(0000) knlGS:0000000000000000
> [ 1070.387590] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1070.399066] CR2: 00000000000000a0 CR3: 0000000f14d50000 CR4: 00000000001406e0
> [ 1070.413311] Stack:
> [ 1070.417332] ffffffff81a44fe0 ffffc9002abefd48 ffffc9002abefdd0 0000000000000005
> [ 1070.432239] ffffc9002abefdb8 ffffffff81337404 0000000200000008 ffff8809b5cab040
> [ 1070.447144] 000000005e94ce38 ffff880c25e1c600 0000000000000000 0000000000000000
> [ 1070.462051] Call Trace:
> [ 1070.466949] [<ffffffff81337404>] xfs_attr3_node_inactive+0x174/0x210
> [ 1070.479802] [<ffffffff813376da>] xfs_attr_inactive+0x23a/0x250
> [ 1070.491625] [<ffffffff81350a4b>] xfs_inactive+0x7b/0x110
> [ 1070.502403] [<ffffffff81359344>] xfs_fs_destroy_inode+0xa4/0x210
> [ 1070.514573] [<ffffffff811c46cb>] destroy_inode+0x3b/0x60
> [ 1070.525352] [<ffffffff811c4819>] evict+0x129/0x190
> [ 1070.535093] [<ffffffff811c4c4a>] iput+0x19a/0x200
> [ 1070.544660] [<ffffffff811b9129>] do_unlinkat+0x129/0x2d0
> [ 1070.555445] [<ffffffff811b9d26>] SyS_unlink+0x16/0x20
> [ 1070.565706] [<ffffffff81885260>] entry_SYSCALL_64_fastpath+0x13/0x94
This looks like the same call stack as last time.
Is this with a patched 4.9.43 kernel, or just vanilla?
--D
> [ 1070.578562] Code: 55 48 89 e5 41 54 53 4d 89 c4 48 89 fb 48 83 ec 10 48 c7 04 24 e0 4f a4 81 e8 fd fe ff ff 85 c0 75 46 48 85 db 74 41 49 8b 34 24 <48> 8b 96 a0 00 00 00 0f b7 52 08 66 c1 c2 08 66 81 fa be 3e 74
> [ 1070.618459] RIP [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
> [ 1070.630663] RSP <ffffc9002abefd28>
> [ 1070.637630] CR2: 00000000000000a0
> [ 1070.644858] ---[ end trace bc2d3667eef00f69 ]—
>
> As of now the system doesn’t have the same following issues and the other FS’s are still functioning. I’ll run xfs_repair later today on all filesystems for good measure.
>
> Christian
>
> > On Aug 28, 2017, at 9:00 PM, Christian Theune <ct@flyingcircus.io> wrote:
> >
> > Hi,
> >
> >> On Aug 28, 2017, at 7:42 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >>
> >> On Mon, Aug 28, 2017 at 07:23:19PM +0200, Christian Theune wrote:
> >>> Hi,
> >>>
> >>> we stumbled over this today as a host rebooted with an unrelated (iommu)
> >>> kernel crash and got completely stuck after this:
> >>>
> >>> I’m currently running xfs_repair on all disks and will then see whether this
> >>> will resolve, still I guess you want to know about it. Kernel is 4.9.43
> >>> vanilla. Let me know if you need more data.
> >>
> >> Does commit cd87d8679201 ("xfs: don't crash on unexpected holes in dir/attr
> >> btrees") fix this problem? It'll be in 4.13, maybe someone can backport it
> >> to 4.9?
> >
> > Thanks for the suggestion. I’ll keep that in mind in case I see this again.
> >
> >> (Assuming you can get it to reproduce reliably?)
> >
> > I have only seen it once today and hopefully won’t see it again. We have had some storage servers that run multiple SSD and HDD disks (for Ceph) crash multiple times a week lastly due to the IOMMU issues that resulted in hardware watchdog reboots, so I guess those xfs' did have quite some noise in it.
> >
> > Not sure I can do anything to reproduce it at all. *fingers crossed*
> >
> > Christian
> >
> > --
> > Christian Theune · ct@flyingcircus.io · +49 345 219401 0
> > Flying Circus Internet Operations GmbH · http://flyingcircus.io
> > Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
> > HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
>
> Liebe Grüße,
> Christian Theune
>
> --
> Christian Theune · ct@flyingcircus.io · +49 345 219401 0
> Flying Circus Internet Operations GmbH · http://flyingcircus.io
> Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: null pointer reference after crash
2017-08-30 15:58 ` Darrick J. Wong
@ 2017-08-30 19:03 ` Christian Theune
2017-09-01 20:53 ` Christian Theune
0 siblings, 1 reply; 9+ messages in thread
From: Christian Theune @ 2017-08-30 19:03 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs
[-- Attachment #1: Type: text/plain, Size: 3852 bytes --]
Hi,
> On Aug 30, 2017, at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Wed, Aug 30, 2017 at 03:56:05PM +0200, Christian Theune wrote:
>> Hi,
>>
>> just got it again on a different call path, maybe that helps:
>>
>> [ 1070.136303] Oops: 0000 [#1] SMP
>> [ 1070.142577] Modules linked in: nf_log_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_log_ipv6 nf_log_common xt_LOG xt_limit nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack sch_fq x86_pkg_temp_thermal kvm_intel kvm irqbypass nvme crc32c_intel ixgbe nvme_core mdio acpi_cpufreq nbd nf_conntrack_ftp nf_conntrack dm_zero dm_thin_pool dm_persistent_data dm_bio_prison dm_round_robin dm_multipath xts aesni_intel glue_helper lrw ablk_helper cryptd aes_x86_64 fuse dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log
>> [ 1070.233784] CPU: 19 PID: 7460 Comm: ceph-osd Not tainted 4.9.43 #1
>> [ 1070.246124] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
>> [ 1070.260895] task: ffff8810517d0000 task.stack: ffffc9002abec000
>> [ 1070.272710] RIP: 0010:[<ffffffff81312320>] [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
>> [ 1070.289592] RSP: 0018:ffffc9002abefd28 EFLAGS: 00010286
>> [ 1070.300199] RAX: 0000000000000000 RBX: ffff88104d859a48 RCX: 0000000000000001
>> [ 1070.314447] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc9002abefce0
>> [ 1070.328694] RBP: ffffc9002abefd48 R08: 0000000066656566 R09: ffffc9002abefbc0
>> [ 1070.342942] R10: fffffffffffffffe R11: 0000000000000001 R12: ffffc9002abefd78
>> [ 1070.357191] R13: ffff88066b430780 R14: 0000000000000005 R15: 0000000066656566
>> [ 1070.371436] FS: 00007fe511bfc700(0000) GS:ffff88107fbc0000(0000) knlGS:0000000000000000
>> [ 1070.387590] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1070.399066] CR2: 00000000000000a0 CR3: 0000000f14d50000 CR4: 00000000001406e0
>> [ 1070.413311] Stack:
>> [ 1070.417332] ffffffff81a44fe0 ffffc9002abefd48 ffffc9002abefdd0 0000000000000005
>> [ 1070.432239] ffffc9002abefdb8 ffffffff81337404 0000000200000008 ffff8809b5cab040
>> [ 1070.447144] 000000005e94ce38 ffff880c25e1c600 0000000000000000 0000000000000000
>> [ 1070.462051] Call Trace:
>> [ 1070.466949] [<ffffffff81337404>] xfs_attr3_node_inactive+0x174/0x210
>> [ 1070.479802] [<ffffffff813376da>] xfs_attr_inactive+0x23a/0x250
there’s a subtle difference: xfs_inactive is calling xfs_attr_inactive. That wasn’t in there the last time. As I don’t know the internals it might also be irrelevant and you filtered that out correctly whereas it appeared maybe important to me. :)
>> [ 1070.491625] [<ffffffff81350a4b>] xfs_inactive+0x7b/0x110
>> [ 1070.502403] [<ffffffff81359344>] xfs_fs_destroy_inode+0xa4/0x210
>> [ 1070.514573] [<ffffffff811c46cb>] destroy_inode+0x3b/0x60
>> [ 1070.525352] [<ffffffff811c4819>] evict+0x129/0x190
>> [ 1070.535093] [<ffffffff811c4c4a>] iput+0x19a/0x200
>> [ 1070.544660] [<ffffffff811b9129>] do_unlinkat+0x129/0x2d0
>> [ 1070.555445] [<ffffffff811b9d26>] SyS_unlink+0x16/0x20
>> [ 1070.565706] [<ffffffff81885260>] entry_SYSCALL_64_fastpath+0x13/0x94
>
> This looks like the same call stack as last time.
>
> Is this with a patched 4.9.43 kernel, or just vanilla?
Just vanilla. Didn’t have time to do any patching: also this is in production and individually the hosts take up to 14 days before crashing right now.
The system has been up for 6 hours now and aside from the one defect FS the other mounted filesystems are performing OK.
Christian
PS: Shouldn’t you be offline? :)
--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: null pointer reference after crash
2017-08-30 19:03 ` Christian Theune
@ 2017-09-01 20:53 ` Christian Theune
2017-09-01 21:03 ` Christian Theune
0 siblings, 1 reply; 9+ messages in thread
From: Christian Theune @ 2017-09-01 20:53 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs
[-- Attachment #1: Type: text/plain, Size: 3820 bytes --]
Hi,
got it again today: this time with a filesystem that just seconds before saw a (clean) xfs_repair. Also, another Ceph user stumbled over this today:
https://www.spinics.net/lists/ceph-users/msg36628.html
Here’s my dump of today - it’s identical to the last one, so maybe this will be the last one I’m posting here until you ask me for more information. :)
[ 2052.528430] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
[ 2052.544143] IP: [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
[ 2052.556170] PGD 0 [ 2052.559844]
[ 2052.562825] Oops: 0000 [#1] SMP
[ 2052.569099] Modules linked in: nf_log_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_log_ipv6 nf_log_common xt_LOG xt_limit nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack sch_fq x86_pkg_temp_thermal kvm_intel kvm irqbypass ixgbe nvme crc32c_intel nvme_core mdio acpi_cpufreq nbd nf_conntrack_ftp nf_conntrack dm_zero dm_thin_pool dm_persistent_data dm_bio_prison dm_round_robin dm_multipath xts aesni_intel glue_helper lrw ablk_helper cryptd aes_x86_64 fuse dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log
[ 2052.660301] CPU: 20 PID: 12288 Comm: ceph-osd Not tainted 4.9.43 #1
[ 2052.672811] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
[ 2052.687579] task: ffff880f0d85b900 task.stack: ffffc90009708000
[ 2052.699397] RIP: 0010:[<ffffffff81312320>] [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
[ 2052.716280] RSP: 0018:ffffc9000970bd28 EFLAGS: 00010286
[ 2052.726886] RAX: 0000000000000000 RBX: ffff8810504e7878 RCX: 0000000000000001
[ 2052.741135] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc9000970bce0
[ 2052.755379] RBP: ffffc9000970bd48 R08: 00000000b6c20f50 R09: ffffc9000970bbc0
[ 2052.769627] R10: fffffffffffffffe R11: 0000000000000001 R12: ffffc9000970bd78
[ 2052.783875] R13: ffff880f8e65cec0 R14: 0000000000000003 R15: 00000000b6c20f50
[ 2052.798120] FS: 00007fdd627fa700(0000) GS:ffff88107fc00000(0000) knlGS:0000000000000000
[ 2052.814276] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2052.825745] CR2: 00000000000000a0 CR3: 0000000fcf9e8000 CR4: 00000000001406e0
[ 2052.839992] Stack:
[ 2052.844015] ffffffff81a44fe0 ffffc9000970bd48 ffffc9000970bdd0 0000000000000003
[ 2052.858918] ffffc9000970bdb8 ffffffff81337404 0000000200000008 ffff880892da4040
[ 2052.873824] 000000005e94d370 ffff88105a603000 0000000000000000 0000000000000000
[ 2052.888730] Call Trace:
[ 2052.893630] [<ffffffff81337404>] xfs_attr3_node_inactive+0x174/0x210
[ 2052.906496] [<ffffffff813376da>] xfs_attr_inactive+0x23a/0x250
[ 2052.918317] [<ffffffff81350a4b>] xfs_inactive+0x7b/0x110
[ 2052.929096] [<ffffffff81359344>] xfs_fs_destroy_inode+0xa4/0x210
[ 2052.941267] [<ffffffff811c46cb>] destroy_inode+0x3b/0x60
[ 2052.952041] [<ffffffff811c4819>] evict+0x129/0x190
[ 2052.961783] [<ffffffff811c4c4a>] iput+0x19a/0x200
[ 2052.971349] [<ffffffff811b9129>] do_unlinkat+0x129/0x2d0
[ 2052.982134] [<ffffffff811b9d26>] SyS_unlink+0x16/0x20
[ 2052.992394] [<ffffffff81885260>] entry_SYSCALL_64_fastpath+0x13/0x94
[ 2053.005252] Code: 55 48 89 e5 41 54 53 4d 89 c4 48 89 fb 48 83 ec 10 48 c7 04 24 e0 4f a4 81 e8 fd fe ff ff 85 c0 75 46 48 85 db 74 41 49 8b 34 24 <48> 8b 96 a0 00 00 00 0f b7 52 08 66 c1 c2 08 66 81 fa be 3e 74
[ 2053.045148] RIP [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
[ 2053.057348] RSP <ffffc9000970bd28>
[ 2053.064318] CR2: 00000000000000a0
[ 2053.071494] ---[ end trace 9360ec3fb784a9ab ]---
Cheers,
Christian
--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: null pointer reference after crash
2017-09-01 20:53 ` Christian Theune
@ 2017-09-01 21:03 ` Christian Theune
2017-09-01 21:38 ` Christian Theune
0 siblings, 1 reply; 9+ messages in thread
From: Christian Theune @ 2017-09-01 21:03 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs
[-- Attachment #1: Type: text/plain, Size: 4593 bytes --]
Hi,
something that might be of value: I haven’t seen those on 4.9.25 (where we started to see very regular host crashes/reboots due to iommu issues) at all - they only started to creep up on the 4.9.43 that we’ve been running for about 2 weeks now.
Christian
> On Sep 1, 2017, at 10:53 PM, Christian Theune <ct@flyingcircus.io> wrote:
>
> Hi,
>
> got it again today: this time with a filesystem that just seconds before saw a (clean) xfs_repair. Also, another Ceph user stumbled over this today:
> https://www.spinics.net/lists/ceph-users/msg36628.html
>
> Here’s my dump of today - it’s identical to the last one, so maybe this will be the last one I’m posting here until you ask me for more information. :)
>
> [ 2052.528430] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
> [ 2052.544143] IP: [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
> [ 2052.556170] PGD 0 [ 2052.559844]
> [ 2052.562825] Oops: 0000 [#1] SMP
> [ 2052.569099] Modules linked in: nf_log_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_log_ipv6 nf_log_common xt_LOG xt_limit nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack sch_fq x86_pkg_temp_thermal kvm_intel kvm irqbypass ixgbe nvme crc32c_intel nvme_core mdio acpi_cpufreq nbd nf_conntrack_ftp nf_conntrack dm_zero dm_thin_pool dm_persistent_data dm_bio_prison dm_round_robin dm_multipath xts aesni_intel glue_helper lrw ablk_helper cryptd aes_x86_64 fuse dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log
> [ 2052.660301] CPU: 20 PID: 12288 Comm: ceph-osd Not tainted 4.9.43 #1
> [ 2052.672811] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
> [ 2052.687579] task: ffff880f0d85b900 task.stack: ffffc90009708000
> [ 2052.699397] RIP: 0010:[<ffffffff81312320>] [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
> [ 2052.716280] RSP: 0018:ffffc9000970bd28 EFLAGS: 00010286
> [ 2052.726886] RAX: 0000000000000000 RBX: ffff8810504e7878 RCX: 0000000000000001
> [ 2052.741135] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc9000970bce0
> [ 2052.755379] RBP: ffffc9000970bd48 R08: 00000000b6c20f50 R09: ffffc9000970bbc0
> [ 2052.769627] R10: fffffffffffffffe R11: 0000000000000001 R12: ffffc9000970bd78
> [ 2052.783875] R13: ffff880f8e65cec0 R14: 0000000000000003 R15: 00000000b6c20f50
> [ 2052.798120] FS: 00007fdd627fa700(0000) GS:ffff88107fc00000(0000) knlGS:0000000000000000
> [ 2052.814276] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2052.825745] CR2: 00000000000000a0 CR3: 0000000fcf9e8000 CR4: 00000000001406e0
> [ 2052.839992] Stack:
> [ 2052.844015] ffffffff81a44fe0 ffffc9000970bd48 ffffc9000970bdd0 0000000000000003
> [ 2052.858918] ffffc9000970bdb8 ffffffff81337404 0000000200000008 ffff880892da4040
> [ 2052.873824] 000000005e94d370 ffff88105a603000 0000000000000000 0000000000000000
> [ 2052.888730] Call Trace:
> [ 2052.893630] [<ffffffff81337404>] xfs_attr3_node_inactive+0x174/0x210
> [ 2052.906496] [<ffffffff813376da>] xfs_attr_inactive+0x23a/0x250
> [ 2052.918317] [<ffffffff81350a4b>] xfs_inactive+0x7b/0x110
> [ 2052.929096] [<ffffffff81359344>] xfs_fs_destroy_inode+0xa4/0x210
> [ 2052.941267] [<ffffffff811c46cb>] destroy_inode+0x3b/0x60
> [ 2052.952041] [<ffffffff811c4819>] evict+0x129/0x190
> [ 2052.961783] [<ffffffff811c4c4a>] iput+0x19a/0x200
> [ 2052.971349] [<ffffffff811b9129>] do_unlinkat+0x129/0x2d0
> [ 2052.982134] [<ffffffff811b9d26>] SyS_unlink+0x16/0x20
> [ 2052.992394] [<ffffffff81885260>] entry_SYSCALL_64_fastpath+0x13/0x94
> [ 2053.005252] Code: 55 48 89 e5 41 54 53 4d 89 c4 48 89 fb 48 83 ec 10 48 c7 04 24 e0 4f a4 81 e8 fd fe ff ff 85 c0 75 46 48 85 db 74 41 49 8b 34 24 <48> 8b 96 a0 00 00 00 0f b7 52 08 66 c1 c2 08 66 81 fa be 3e 74
> [ 2053.045148] RIP [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
> [ 2053.057348] RSP <ffffc9000970bd28>
> [ 2053.064318] CR2: 00000000000000a0
> [ 2053.071494] ---[ end trace 9360ec3fb784a9ab ]---
>
> Cheers,
> Christian
>
> --
> Christian Theune · ct@flyingcircus.io · +49 345 219401 0
> Flying Circus Internet Operations GmbH · http://flyingcircus.io
> Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
>
Liebe Grüße,
Christian Theune
--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: null pointer reference after crash
2017-09-01 21:03 ` Christian Theune
@ 2017-09-01 21:38 ` Christian Theune
0 siblings, 0 replies; 9+ messages in thread
From: Christian Theune @ 2017-09-01 21:38 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs
[-- Attachment #1: Type: text/plain, Size: 5813 bytes --]
Hi,
(sorry for the staggered posting, sitting in an off-hours maintenance cycle and got time to think and things are trickling in).
The hopefully last piece that I can add for today is that I’ve ever only seen this happen is within maybe 10 minutes or less, after a reboot and (clean or unclean doesn’t matter) mount and then getting an immediate spike of traffic with Ceph recoverying things. Also, we started adding a ‘find’ and ‘vmtouch’ prewarm script before starting Ceph. So the boot order is
- mount
- vmtouch selected files
- cache inodes by running ‘find’ on the disk
- start Ceph osd
Once it survived for a while I haven’t seen it crash at all, only directly after boots. (So far)
Christian
> On Sep 1, 2017, at 11:03 PM, Christian Theune <ct@flyingcircus.io> wrote:
>
> Hi,
>
> something that might be of value: I haven’t seen those on 4.9.25 (where we started to see very regular host crashes/reboots due to iommu issues) at all - they only started to creep up on the 4.9.43 that we’ve been running for about 2 weeks now.
>
> Christian
>
>> On Sep 1, 2017, at 10:53 PM, Christian Theune <ct@flyingcircus.io> wrote:
>>
>> Hi,
>>
>> got it again today: this time with a filesystem that just seconds before saw a (clean) xfs_repair. Also, another Ceph user stumbled over this today:
>> https://www.spinics.net/lists/ceph-users/msg36628.html
>>
>> Here’s my dump of today - it’s identical to the last one, so maybe this will be the last one I’m posting here until you ask me for more information. :)
>>
>> [ 2052.528430] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
>> [ 2052.544143] IP: [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
>> [ 2052.556170] PGD 0 [ 2052.559844]
>> [ 2052.562825] Oops: 0000 [#1] SMP
>> [ 2052.569099] Modules linked in: nf_log_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_log_ipv6 nf_log_common xt_LOG xt_limit nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack sch_fq x86_pkg_temp_thermal kvm_intel kvm irqbypass ixgbe nvme crc32c_intel nvme_core mdio acpi_cpufreq nbd nf_conntrack_ftp nf_conntrack dm_zero dm_thin_pool dm_persistent_data dm_bio_prison dm_round_robin dm_multipath xts aesni_intel glue_helper lrw ablk_helper cryptd aes_x86_64 fuse dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log
>> [ 2052.660301] CPU: 20 PID: 12288 Comm: ceph-osd Not tainted 4.9.43 #1
>> [ 2052.672811] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
>> [ 2052.687579] task: ffff880f0d85b900 task.stack: ffffc90009708000
>> [ 2052.699397] RIP: 0010:[<ffffffff81312320>] [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
>> [ 2052.716280] RSP: 0018:ffffc9000970bd28 EFLAGS: 00010286
>> [ 2052.726886] RAX: 0000000000000000 RBX: ffff8810504e7878 RCX: 0000000000000001
>> [ 2052.741135] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc9000970bce0
>> [ 2052.755379] RBP: ffffc9000970bd48 R08: 00000000b6c20f50 R09: ffffc9000970bbc0
>> [ 2052.769627] R10: fffffffffffffffe R11: 0000000000000001 R12: ffffc9000970bd78
>> [ 2052.783875] R13: ffff880f8e65cec0 R14: 0000000000000003 R15: 00000000b6c20f50
>> [ 2052.798120] FS: 00007fdd627fa700(0000) GS:ffff88107fc00000(0000) knlGS:0000000000000000
>> [ 2052.814276] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 2052.825745] CR2: 00000000000000a0 CR3: 0000000fcf9e8000 CR4: 00000000001406e0
>> [ 2052.839992] Stack:
>> [ 2052.844015] ffffffff81a44fe0 ffffc9000970bd48 ffffc9000970bdd0 0000000000000003
>> [ 2052.858918] ffffc9000970bdb8 ffffffff81337404 0000000200000008 ffff880892da4040
>> [ 2052.873824] 000000005e94d370 ffff88105a603000 0000000000000000 0000000000000000
>> [ 2052.888730] Call Trace:
>> [ 2052.893630] [<ffffffff81337404>] xfs_attr3_node_inactive+0x174/0x210
>> [ 2052.906496] [<ffffffff813376da>] xfs_attr_inactive+0x23a/0x250
>> [ 2052.918317] [<ffffffff81350a4b>] xfs_inactive+0x7b/0x110
>> [ 2052.929096] [<ffffffff81359344>] xfs_fs_destroy_inode+0xa4/0x210
>> [ 2052.941267] [<ffffffff811c46cb>] destroy_inode+0x3b/0x60
>> [ 2052.952041] [<ffffffff811c4819>] evict+0x129/0x190
>> [ 2052.961783] [<ffffffff811c4c4a>] iput+0x19a/0x200
>> [ 2052.971349] [<ffffffff811b9129>] do_unlinkat+0x129/0x2d0
>> [ 2052.982134] [<ffffffff811b9d26>] SyS_unlink+0x16/0x20
>> [ 2052.992394] [<ffffffff81885260>] entry_SYSCALL_64_fastpath+0x13/0x94
>> [ 2053.005252] Code: 55 48 89 e5 41 54 53 4d 89 c4 48 89 fb 48 83 ec 10 48 c7 04 24 e0 4f a4 81 e8 fd fe ff ff 85 c0 75 46 48 85 db 74 41 49 8b 34 24 <48> 8b 96 a0 00 00 00 0f b7 52 08 66 c1 c2 08 66 81 fa be 3e 74
>> [ 2053.045148] RIP [<ffffffff81312320>] xfs_da3_node_read+0x30/0xb0
>> [ 2053.057348] RSP <ffffc9000970bd28>
>> [ 2053.064318] CR2: 00000000000000a0
>> [ 2053.071494] ---[ end trace 9360ec3fb784a9ab ]---
>>
>> Cheers,
>> Christian
>>
>> --
>> Christian Theune · ct@flyingcircus.io · +49 345 219401 0
>> Flying Circus Internet Operations GmbH · http://flyingcircus.io
>> Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
>>
>
> Liebe Grüße,
> Christian Theune
>
> --
> Christian Theune · ct@flyingcircus.io · +49 345 219401 0
> Flying Circus Internet Operations GmbH · http://flyingcircus.io
> Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
>
Liebe Grüße,
Christian Theune
--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-09-01 21:39 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-28 17:23 null pointer reference after crash Christian Theune
2017-08-28 17:42 ` Darrick J. Wong
2017-08-28 19:00 ` Christian Theune
2017-08-30 13:56 ` Christian Theune
2017-08-30 15:58 ` Darrick J. Wong
2017-08-30 19:03 ` Christian Theune
2017-09-01 20:53 ` Christian Theune
2017-09-01 21:03 ` Christian Theune
2017-09-01 21:38 ` Christian Theune
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.