* Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB @ 2016-10-03 10:52 Johannes Bauer 2016-10-04 3:18 ` Theodore Ts'o 2016-10-04 8:41 ` Jan Kara 0 siblings, 2 replies; 19+ messages in thread From: Johannes Bauer @ 2016-10-03 10:52 UTC (permalink / raw) To: linux-ext4 Hi there, I have recently bought an Intel NUC6i3SYB. That's essentially a small form-factor x86_64 PC. That device runs Linux Mint. Unfortunately I see frequent kernel oopses within the ext4 subsystem and consequently loss of data, corrupted files and complete system crashes. Here's a recent call trace: [ 796.429566] systemd[1]: apt-daily.timer: Adding 4h 1min 3.723928s random time. [ 3405.666456] general protection fault: 0000 [#1] SMP [ 3405.666519] Modules linked in: rfcomm bnep binfmt_misc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic arc4 snd_soc_skl snd_soc_skl_ipc snd_hda_ext_core iwlmvm snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_core mac80211 btusb snd_compress btrtl ac97_bus snd_pcm_dmaengine dw_dmac_core snd_hda_intel snd_hda_codec snd_hda_core 8250_dw snd_hwdep intel_rapl snd_pcm x86_pkg_temp_thermal intel_powerclamp coretemp iwlwifi snd_seq_midi snd_seq_midi_event kvm_intel snd_rawmidi hci_uart btbcm btqca snd_seq cfg80211 snd_seq_device kvm btintel bluetooth snd_timer irqbypass ir_sharp_decoder ir_rc5_decoder snd ir_lirc_codec ir_jvc_decoder ir_xmp_decoder lirc_dev ir_mce_kbd_decoder ir_sanyo_decoder ir_rc6_decoder ir_sony_decoder ir_nec_decoder rc_rc6_mce soundcore ite_cir intel_lpss_acpi mei_me rc_core [ 3405.667326] idma64 virt_dma shpchp intel_lpss_pci intel_lpss mei acpi_pad acpi_als kfifo_buf industrialio mac_hid parport_pc ppdev lp parport autofs4 btrfs xor raid6_pq jitterentropy_rng drbg ansi_cprng dm_crypt algif_skcipher af_alg dm_mirror dm_region_hash dm_log crct10dif_pclmul crc32_pclmul i915_bpo intel_ips i2c_algo_bit drm_kms_helper aesni_intel syscopyarea sysfillrect aes_x86_64 lrw gf128mul glue_helper sysimgblt ablk_helper e1000e fb_sys_fops sdhci_pci cryptd ahci ptp i2c_hid drm pps_core libahci sdhci pinctrl_sunrisepoint video hid pinctrl_intel fjes [ 3405.667929] CPU: 3 PID: 2261 Comm: hexchat Not tainted 4.4.0-21-generic #37-Ubuntu [ 3405.667998] Hardware name: /NUC6i3SYB, BIOS SYSKLi35.86A.0042.2016.0409.1246 04/09/2016 [ 3405.668082] task: ffff88003565ac40 ti: ffff8804332e8000 task.ti: ffff8804332e8000 [ 3405.668148] RIP: 0010:[<ffffffff811eb027>] [<ffffffff811eb027>] kmem_cache_alloc+0x77/0x1f0 [ 3405.668234] RSP: 0018:ffff8804332eba88 EFLAGS: 00010282 [ 3405.668282] RAX: 0000000000000000 RBX: 0000000002408040 RCX: 00000000000e1547 [ 3405.668345] RDX: 00000000000e1546 RSI: 0000000002408040 RDI: 000000000001a940 [ 3405.668408] RBP: ffff8804332ebab8 R08: ffff88046ed9a940 R09: ffdb88033bb3a3a8 [ 3405.668470] R10: ffff8804591a4ed0 R11: ffffffff81ccc462 R12: 0000000002408040 [ 3405.668533] R13: ffffffff81243351 R14: ffff88045e08bc00 R15: ffff88045e08bc00 [ 3405.668597] FS: 00007f1df9704a40(0000) GS:ffff88046ed80000(0000) knlGS:0000000000000000 [ 3405.668668] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3405.668719] CR2: 00007fd945ecebd6 CR3: 0000000456a48000 CR4: 00000000003406e0 [ 3405.668782] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3405.668844] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3405.668906] Stack: [ 3405.668926] 01ff880438ee2508 0000000000001000 ffff8803344df000 ffffea000cd137c0 [ 3405.669003] 0000000000000000 0000000000000000 ffff8804332ebad0 ffffffff81243351 [ 3405.669080] ffff8800354bd024 ffff8804332ebb18 ffffffff81243829 00000001332ebb70 [ 3405.669156] Call Trace: [ 3405.669186] [<ffffffff81243351>] alloc_buffer_head+0x21/0x60 [ 3405.669240] [<ffffffff81243829>] alloc_page_buffers+0x79/0xe0 [ 3405.669294] [<ffffffff812438ae>] create_empty_buffers+0x1e/0xc0 [ 3405.669351] [<ffffffff812979cc>] ext4_block_write_begin+0x3cc/0x4d0 [ 3405.669410] [<ffffffff812e74db>] ? jbd2__journal_start+0xdb/0x1e0 [ 3405.669469] [<ffffffff81296e10>] ? ext4_inode_attach_jinode.part.60+0xb0/0xb0 [ 3405.669536] [<ffffffff812cb83d>] ? __ext4_journal_start_sb+0x6d/0x120 [ 3405.669596] [<ffffffff8129d574>] ext4_da_write_begin+0x154/0x320 [ 3405.669656] [<ffffffff8118d4de>] generic_perform_write+0xce/0x1c0 [ 3405.669713] [<ffffffff8118f382>] __generic_file_write_iter+0x1a2/0x1e0 [ 3405.669773] [<ffffffff81291ffc>] ext4_file_write_iter+0xfc/0x460 [ 3405.669833] [<ffffffff81794d6e>] ? inet_recvmsg+0x7e/0xb0 [ 3405.669885] [<ffffffff816fdb6b>] ? sock_recvmsg+0x3b/0x50 [ 3405.669938] [<ffffffff8120bedb>] new_sync_write+0x9b/0xe0 [ 3405.669990] [<ffffffff8120bf46>] __vfs_write+0x26/0x40 [ 3405.670040] [<ffffffff8120c8c9>] vfs_write+0xa9/0x1a0 [ 3405.672397] [<ffffffff8120c776>] ? vfs_read+0x86/0x130 [ 3405.674693] [<ffffffff8120d585>] SyS_write+0x55/0xc0 [ 3405.676925] [<ffffffff818244f2>] entry_SYSCALL_64_fastpath+0x16/0x71 [ 3405.679111] Code: 08 65 4c 03 05 83 f1 e1 7e 49 83 78 10 00 4d 8b 08 0f 84 29 01 00 00 4d 85 c9 0f 84 20 01 00 00 49 63 47 20 48 8d 4a 01 49 8b 3f <49> 8b 1c 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 84 c0 74 bb 49 63 [ 3405.683725] RIP [<ffffffff811eb027>] kmem_cache_alloc+0x77/0x1f0 [ 3405.685876] RSP <ffff8804332eba88> [ 3405.696001] ---[ end trace 4968a9119e168c92 ]--- After this occurs, the system becomes extremely unstable, i.e., the filesystem cannot be read properly anymore (e.g., ssh logins usually do not work anymore, most binaries just segfault). After a reboot (which has to be done manually, "shutdown -r now" also segfaults) it works fine again (until the problem comes back). Since the hardware is fairly new, I cannot exclude a hardware defect as of now. I've thouroughly tested the RAM though and not found any defect there (ran MemTest86+ for 24 hours). One curious thing is that someone else seems to have run into this before. Searching for the symbols in the stackframe I came upon this: http://pastebin.com/BJbu35H4 Which, quoting in full: Jul 16 14:28:29 nuc kernel: [ 370.642612] general protection fault: 0000 [#1] SMP Jul 16 14:28:29 nuc kernel: [ 370.642657] Modules linked in: arc4 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iwlmvm kvm mac80211 irqbypass crct10dif_pclmul crc32_pclmul iwlwifi aesni_intel aes_x86_64 snd_hda_codec_hdmi btusb btrtl snd_hda_codec_realtek btbcm btintel lrw gf128mul snd_hda_codec_generic ir_xmp_decoder glue_helper ablk_helper ir_lirc_codec cryptd mei_me lirc_dev ir_mce_kbd_decoder ir_sharp_decoder ir_sanyo_decoder bluetooth cfg80211 snd_hda_intel ir_sony_decoder mei ir_jvc_decoder ir_rc6_decoder ir_rc5_decoder ir_nec_decoder lpc_ich snd_hda_codec shpchp snd_soc_rt5640 snd_hda_core snd_soc_rl6231 snd_hwdep snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine rc_rc6_mce dw_dmac snd_pcm nuvoton_cir rc_core snd_timer dw_dmac_core snd elan_i2c soundcore snd_soc_sst_acpi spi_pxa2xx_platform i2c_designware_platform i2c_designware_core 8250_dw mac_hid ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 xt_comment nf_log_ipv4 nf_log_common xt_LOG xt_multiport xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat sunrpc nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables autofs4 btrfs xor raid6_pq i915 i2c_algo_bit drm_kms_helper e1000e syscopyarea sysfillrect sysimgblt uas fb_sys_fops ptp ahci sdhci_acpi usb_storage libahci drm pps_core video i2c_hid sdhci hid fjes Jul 16 14:28:29 nuc kernel: [ 370.643778] CPU: 3 PID: 1505 Comm: dd Not tainted 4.4.0-31-generic #50-Ubuntu Jul 16 14:28:29 nuc kernel: [ 370.643822] Hardware name: /D54250WYK, BIOS WYLPT10H.86A.0041.2015.0720.1108 07/20/2015 Jul 16 14:28:29 nuc kernel: [ 370.643878] task: ffff88040aa90000 ti: ffff880407b98000 task.ti: ffff880407b98000 Jul 16 14:28:29 nuc kernel: [ 370.643923] RIP: 0010:[<ffffffff811eb987>] [<ffffffff811eb987>] kmem_cache_alloc+0x77/0x1f0 Jul 16 14:28:29 nuc kernel: [ 370.643982] RSP: 0018:ffff880407b9ba80 EFLAGS: 00010282 Jul 16 14:28:29 nuc kernel: [ 370.644015] RAX: 0000000000000000 RBX: 0000000002408040 RCX: 00000000000bb283 Jul 16 14:28:29 nuc kernel: [ 370.644054] RDX: 00000000000bb282 RSI: 0000000002408040 RDI: 000000000001a940 Jul 16 14:28:29 nuc kernel: [ 370.644076] RBP: ffff880407b9bab0 R08: ffff88041fb9a940 R09: ddff88007f5aff08 Jul 16 14:28:29 nuc kernel: [ 370.644097] R10: ffff8800d522d060 R11: ffffffff81ccf1ea R12: 0000000002408040 Jul 16 14:28:29 nuc kernel: [ 370.644119] R13: ffffffff81243ea1 R14: ffff88040f08bc00 R15: ffff88040f08bc00 Jul 16 14:28:29 nuc kernel: [ 370.644141] FS: 00007fa227587700(0000) GS:ffff88041fb80000(0000) knlGS:0000000000000000 Jul 16 14:28:29 nuc kernel: [ 370.644165] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 14:28:29 nuc kernel: [ 370.644183] CR2: 0000000000a08000 CR3: 000000040aaac000 CR4: 00000000001406e0 Jul 16 14:28:29 nuc kernel: [ 370.644205] Stack: Jul 16 14:28:29 nuc kernel: [ 370.644213] 01ff8803fef10a00 0000000000001000 ffff88007b89f000 ffffea0001ee27c0 Jul 16 14:28:29 nuc kernel: [ 370.644241] 0000000000000000 0000000000000000 ffff880407b9bac8 ffffffff81243ea1 Jul 16 14:28:29 nuc kernel: [ 370.644270] ffff8804099b8024 ffff880407b9bb10 ffffffff81244379 0000000107b9bb68 Jul 16 14:28:29 nuc kernel: [ 370.644298] Call Trace: Jul 16 14:28:29 nuc kernel: [ 370.644311] [<ffffffff81243ea1>] alloc_buffer_head+0x21/0x60 Jul 16 14:28:29 nuc kernel: [ 370.644329] [<ffffffff81244379>] alloc_page_buffers+0x79/0xe0 Jul 16 14:28:29 nuc kernel: [ 370.644349] [<ffffffff812443fe>] create_empty_buffers+0x1e/0xc0 Jul 16 14:28:29 nuc kernel: [ 370.644369] [<ffffffff812987fc>] ext4_block_write_begin+0x3cc/0x4e0 Jul 16 14:28:29 nuc kernel: [ 370.644390] [<ffffffff812e8afb>] ? jbd2__journal_start+0xdb/0x1e0 Jul 16 14:28:29 nuc kernel: [ 370.644410] [<ffffffff81297c40>] ? ext4_inode_attach_jinode.part.60+0xb0/0xb0 Jul 16 14:28:29 nuc kernel: [ 370.644434] [<ffffffff812ccc2d>] ? __ext4_journal_start_sb+0x6d/0x120 Jul 16 14:28:29 nuc kernel: [ 370.644456] [<ffffffff8129e61d>] ext4_da_write_begin+0x15d/0x340 Jul 16 14:28:29 nuc kernel: [ 370.644477] [<ffffffff8118db4e>] generic_perform_write+0xce/0x1c0 Jul 16 14:28:29 nuc kernel: [ 370.644498] [<ffffffff8118f9f2>] __generic_file_write_iter+0x1a2/0x1e0 Jul 16 14:28:29 nuc kernel: [ 370.644518] [<ffffffff81292d72>] ext4_file_write_iter+0x102/0x470 Jul 16 14:28:29 nuc kernel: [ 370.644540] [<ffffffff81403f37>] ? iov_iter_zero+0x67/0x200 Jul 16 14:28:29 nuc kernel: [ 370.644560] [<ffffffff8120c94b>] new_sync_write+0x9b/0xe0 Jul 16 14:28:29 nuc kernel: [ 370.644578] [<ffffffff8120c9b6>] __vfs_write+0x26/0x40 Jul 16 14:28:29 nuc kernel: [ 370.645377] [<ffffffff8120d339>] vfs_write+0xa9/0x1a0 Jul 16 14:28:29 nuc kernel: [ 370.646174] [<ffffffff8120d274>] ? vfs_read+0x114/0x130 Jul 16 14:28:29 nuc kernel: [ 370.646973] [<ffffffff8120dff5>] SyS_write+0x55/0xc0 Jul 16 14:28:29 nuc kernel: [ 370.647766] [<ffffffff8182db32>] entry_SYSCALL_64_fastpath+0x16/0x71 Jul 16 14:28:29 nuc kernel: [ 370.648547] Code: 08 65 4c 03 05 23 e8 e1 7e 49 83 78 10 00 4d 8b 08 0f 84 29 01 00 00 4d 85 c9 0f 84 20 01 00 00 49 63 47 20 48 8d 4a 01 49 8b 3f <49> 8b 1c 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 84 c0 74 bb 49 63 Jul 16 14:28:29 nuc kernel: [ 370.650215] RIP [<ffffffff811eb987>] kmem_cache_alloc+0x77/0x1f0 Jul 16 14:28:29 nuc kernel: [ 370.650985] RSP <ffff880407b9ba80> Jul 16 14:28:29 nuc kernel: [ 370.651755] ---[ end trace 639091250fabe2af ]--- Shows also a stacktrace with the same call path, also running on a (different) Intel NUC, also running a 4.4.0 kernel. This pastebin is nowhere referenced however, so I'm unsure who found it and where exactly it was posted. Since the offending process in the unknown guy or girl's pastebin was dd, however, I believe that he or she tried to deliberately reproduce the problem. The problem occurs only when the system is under heavy disk load for me (usually after an hour of activity). I've a process running which frequently does sqlite3 commits about every 10 seconds. Having it run overnight with almost no load led to no oooops. Any and all advice is greatly appreciated. Cheers, Johannes ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB 2016-10-03 10:52 Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB Johannes Bauer @ 2016-10-04 3:18 ` Theodore Ts'o 2016-10-04 8:41 ` Jan Kara 1 sibling, 0 replies; 19+ messages in thread From: Theodore Ts'o @ 2016-10-04 3:18 UTC (permalink / raw) To: Johannes Bauer; +Cc: linux-ext4 On Mon, Oct 03, 2016 at 12:52:20PM +0200, Johannes Bauer wrote: > Shows also a stacktrace with the same call path, also running on a > (different) Intel NUC, also running a 4.4.0 kernel. This pastebin is > nowhere referenced however, so I'm unsure who found it and where exactly > it was posted. Since the offending process in the unknown guy or girl's > pastebin was dd, however, I believe that he or she tried to deliberately > reproduce the problem. Have you tried using a 4.4.23 kernel? There are a large number of bug fixes in the kernel betweeb 4.4.0 and 4.4.23. The last time I've done a stable kernel test run was against 4.4.17, and it passed clean: FSTESTIMG: gce-xfstests/xfstests-201608132226 FSTESTVER: e2fsprogs v1.43.1-22-g25c4a20 (Wed, 8 Jun 2016 18:11:27 -0400) FSTESTVER: fio fio-2.6-8-ge6989e1 (Thu, 4 Feb 2016 12:09:48 -0700) FSTESTVER: quota 81aca5c (Tue, 12 Jul 2016 16:15:45 +0200) FSTESTVER: xfsprogs v4.5.0 (Tue, 15 Mar 2016 15:25:56 +1100) FSTESTVER: xfstests-bld 75f1eb0 (Sat, 13 Aug 2016 22:18:57 -0400) FSTESTVER: xfstests linux-v3.8-1149-g4e58a5b (Mon, 8 Aug 2016 10:50:34 -0400) FSTESTVER: kernel 4.4.17 #4 SMP Mon Aug 15 23:55:25 EDT 2016 x86_64 FSTESTCFG: "all" FSTESTSET: "-g auto" FSTESTEXC: "ext4/022" FSTESTOPT: "aex" MNTOPTS: "" CPUS: "2" MEM: "7477.49" MEM: 7680 MB (Max capacity) BEGIN TEST 4k: Ext4 4k block Tue Aug 16 00:05:28 EDT 2016 Passed all 224 tests - Ted P.S. Fixes between 4.4.0 and 4.4.17: % git log --oneline v4.4..v4.4.17 -- fs/ext4 fs/jbd2 26015f0 ext4: verify extent header depth 8b8de1c ext4: silence UBSAN in ext4_mb_init() 12aa7d9 ext4: address UBSAN warning in mb_find_order_for_block() b2601bb ext4: fix oops on corrupted filesystem b2044c3 ext4: clean up error handling when orphan list is corrupted c5ce389 ext4: fix hang when processing corrupted orphaned inode list fa5613b ext4: iterate over buffer heads correctly in move_extent_per_page() 2122834 ext4: fix races of writeback with punch hole and zero range 1f7b7e9 ext4: fix races between buffered IO and collapse / insert range e096ade ext4: move unlocked dio protection from ext4_alloc_file_blocks() 0b680de ext4: fix races between page faults and hole punching c745297 ext4: fix NULL pointer dereference in ext4_mark_inode_dirty() ee8516a ext4: ignore quota mount options if the quota feature is enabled 321299a ext4: add lockdep annotations for i_data_sem 93272be jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path 7c3d142 ext4: fix bh->b_state corruption bbfe21c ext4: don't read blocks from disk after extents being swapped 600d41f ext4: fix potential integer overflow 33f48f8 ext4: fix scheduling in atomic on group checksum failure b80b70e ext4 crypto: add missing locking for keyring_key access Fixes between 4.4.17 and 4.4.23: % git log --oneline v4.4.17..v4.4.23 -- fs/ext4 fs/jbd2 bf63b9d fscrypto: require write access to mount to set encryption policy 8d693a2 fscrypto: add authorization check for setting encryption policy d8aafd0 ext4: use __GFP_NOFAIL in ext4_free_blocks() 1d12bad ext4: avoid modifying checksum fields directly during checksum verification 77ae14d ext4: avoid deadlock when expanding inode size a79f1f7 ext4: properly align shifted xattrs when expanding inodes e6abdbf ext4: fix xattr shifting when expanding inodes part 2 f2c06c7 ext4: fix xattr shifting when expanding inodes dfa0a22 ext4: validate that metadata blocks do not overlap superblock 564e0f8 jbd2: make journal y2038 safe 3a22cf0 ext4: fix reference counting bug on block allocation error db82c74 ext4: short-cut orphan cleanup on error f8d4d52 ext4: validate s_reserved_gdt_blocks on mount 175f36c ext4: don't call ext4_should_journal_data() on the journal inode 5a7f477 ext4: fix deadlock during page writeback 9e38db2 ext4: check for extents that wrap around And note that not all fixes get backported. Sometimes a patch is too large or too complex to backport. Or sometimes we forget to tag a patch for a stable kernel backport that really should have been backported. So trying to see if you can replicate the problem using the latest 4.8 kernel would also be a good thing to try. Finally, the oops was inside the memory allocator, so it's possible the problem was caused by a corrupted freelist, which could have been caused by a wild pointer dereference in any part of the kernel, not necessarily ext4. Which is another reason to go to the latest 4.4.x kernel or to try the 4.8 kernel. The bug in some other part of the subsystem may have since been fixed. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB 2016-10-03 10:52 Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB Johannes Bauer 2016-10-04 3:18 ` Theodore Ts'o @ 2016-10-04 8:41 ` Jan Kara 2016-10-04 16:50 ` Johannes Bauer 1 sibling, 1 reply; 19+ messages in thread From: Jan Kara @ 2016-10-04 8:41 UTC (permalink / raw) To: Johannes Bauer; +Cc: linux-ext4, linux-mm Hi! On Mon 03-10-16 12:52:20, Johannes Bauer wrote: > I have recently bought an Intel NUC6i3SYB. That's essentially a small > form-factor x86_64 PC. That device runs Linux Mint. Unfortunately I see > frequent kernel oopses within the ext4 subsystem and consequently loss > of data, corrupted files and complete system crashes. Here's a recent > call trace: The problem looks like memory corruption: > [ 3405.666456] general protection fault: 0000 [#1] SMP <snip> > [ 3405.667929] CPU: 3 PID: 2261 Comm: hexchat Not tainted > 4.4.0-21-generic #37-Ubuntu > [ 3405.667998] Hardware name: /NUC6i3SYB, BIOS > SYSKLi35.86A.0042.2016.0409.1246 04/09/2016 > [ 3405.668082] task: ffff88003565ac40 ti: ffff8804332e8000 task.ti: > ffff8804332e8000 > [ 3405.668148] RIP: 0010:[<ffffffff811eb027>] [<ffffffff811eb027>] > kmem_cache_alloc+0x77/0x1f0 So we crash in kmem_cache_alloc(), looking at the disassebly at: mov (%r9,%rax,1),%rbx Now look at register contents: > [ 3405.668234] RSP: 0018:ffff8804332eba88 EFLAGS: 00010282 > [ 3405.668282] RAX: 0000000000000000 RBX: 0000000002408040 RCX: > 00000000000e1547 > [ 3405.668345] RDX: 00000000000e1546 RSI: 0000000002408040 RDI: > 000000000001a940 > [ 3405.668408] RBP: ffff8804332ebab8 R08: ffff88046ed9a940 R09: > ffdb88033bb3a3a8 So %rax is 0, %r9 is ffdb88033bb3a3a8 - that's a problem because this is not a valid kernel pointer. Well, actually it is but it points somewhere to a vmalloc area and that particular place is apparently unmapped. I don't think anything in that path should be doing anything with vmalloc so I'd rather think that something corrupted the pointer. Hum, and looking into the oops you pasted below, there %r9 is ddff88007f5aff08 - that's definitely a corrupted pointer. Anyway, adding linux-mm to CC since this does not look ext4 related but rather mm related issue. Bugs like these are always hard to catch, usually it's some flaky device driver, sometimes also flaky HW. You can try running kernel with various debug options enabled in a hope to catch the code corrupting memory earlier - e.g. CONFIG_DEBUG_PAGE_ALLOC sometimes catches something, CONFIG_SLAB_DEBUG can be useful as well. Another option is to get a crashdump when the oops happens (although that's going to be a pain to setup on such a small machine) and then look at which places point to the corrupted memory - sometimes you can find old structures pointing to the place and find the use-after-free issue or stuff like that... Honza > [ 3405.668470] R10: ffff8804591a4ed0 R11: ffffffff81ccc462 R12: > 0000000002408040 > [ 3405.668533] R13: ffffffff81243351 R14: ffff88045e08bc00 R15: > ffff88045e08bc00 > [ 3405.668597] FS: 00007f1df9704a40(0000) GS:ffff88046ed80000(0000) > knlGS:0000000000000000 > [ 3405.668668] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 3405.668719] CR2: 00007fd945ecebd6 CR3: 0000000456a48000 CR4: > 00000000003406e0 > [ 3405.668782] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 3405.668844] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [ 3405.668906] Stack: > [ 3405.668926] 01ff880438ee2508 0000000000001000 ffff8803344df000 > ffffea000cd137c0 > [ 3405.669003] 0000000000000000 0000000000000000 ffff8804332ebad0 > ffffffff81243351 > [ 3405.669080] ffff8800354bd024 ffff8804332ebb18 ffffffff81243829 > 00000001332ebb70 > [ 3405.669156] Call Trace: > [ 3405.669186] [<ffffffff81243351>] alloc_buffer_head+0x21/0x60 > [ 3405.669240] [<ffffffff81243829>] alloc_page_buffers+0x79/0xe0 > [ 3405.669294] [<ffffffff812438ae>] create_empty_buffers+0x1e/0xc0 > [ 3405.669351] [<ffffffff812979cc>] ext4_block_write_begin+0x3cc/0x4d0 > [ 3405.669410] [<ffffffff812e74db>] ? jbd2__journal_start+0xdb/0x1e0 > [ 3405.669469] [<ffffffff81296e10>] ? > ext4_inode_attach_jinode.part.60+0xb0/0xb0 > [ 3405.669536] [<ffffffff812cb83d>] ? __ext4_journal_start_sb+0x6d/0x120 > [ 3405.669596] [<ffffffff8129d574>] ext4_da_write_begin+0x154/0x320 > [ 3405.669656] [<ffffffff8118d4de>] generic_perform_write+0xce/0x1c0 > [ 3405.669713] [<ffffffff8118f382>] __generic_file_write_iter+0x1a2/0x1e0 > [ 3405.669773] [<ffffffff81291ffc>] ext4_file_write_iter+0xfc/0x460 > [ 3405.669833] [<ffffffff81794d6e>] ? inet_recvmsg+0x7e/0xb0 > [ 3405.669885] [<ffffffff816fdb6b>] ? sock_recvmsg+0x3b/0x50 > [ 3405.669938] [<ffffffff8120bedb>] new_sync_write+0x9b/0xe0 > [ 3405.669990] [<ffffffff8120bf46>] __vfs_write+0x26/0x40 > [ 3405.670040] [<ffffffff8120c8c9>] vfs_write+0xa9/0x1a0 > [ 3405.672397] [<ffffffff8120c776>] ? vfs_read+0x86/0x130 > [ 3405.674693] [<ffffffff8120d585>] SyS_write+0x55/0xc0 > [ 3405.676925] [<ffffffff818244f2>] entry_SYSCALL_64_fastpath+0x16/0x71 > [ 3405.679111] Code: 08 65 4c 03 05 83 f1 e1 7e 49 83 78 10 00 4d 8b 08 > 0f 84 29 01 00 00 4d 85 c9 0f 84 20 01 00 00 49 63 47 20 48 8d 4a 01 49 > 8b 3f <49> 8b 1c 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 84 c0 74 bb 49 63 > [ 3405.683725] RIP [<ffffffff811eb027>] kmem_cache_alloc+0x77/0x1f0 > [ 3405.685876] RSP <ffff8804332eba88> > [ 3405.696001] ---[ end trace 4968a9119e168c92 ]--- > > After this occurs, the system becomes extremely unstable, i.e., the > filesystem cannot be read properly anymore (e.g., ssh logins usually do > not work anymore, most binaries just segfault). After a reboot (which > has to be done manually, "shutdown -r now" also segfaults) it works fine > again (until the problem comes back). > > Since the hardware is fairly new, I cannot exclude a hardware defect as > of now. I've thouroughly tested the RAM though and not found any defect > there (ran MemTest86+ for 24 hours). One curious thing is that someone > else seems to have run into this before. Searching for the symbols in > the stackframe I came upon this: > > http://pastebin.com/BJbu35H4 > > Which, quoting in full: > > Jul 16 14:28:29 nuc kernel: [ 370.642612] general protection fault: > 0000 [#1] SMP > Jul 16 14:28:29 nuc kernel: [ 370.642657] Modules linked in: arc4 > intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel > iwlmvm kvm mac80211 irqbypass crct10dif_pclmul crc32_pclmul iwlwifi > aesni_intel aes_x86_64 snd_hda_codec_hdmi btusb btrtl > snd_hda_codec_realtek btbcm btintel lrw gf128mul snd_hda_codec_generic > ir_xmp_decoder glue_helper ablk_helper ir_lirc_codec cryptd mei_me > lirc_dev ir_mce_kbd_decoder ir_sharp_decoder ir_sanyo_decoder bluetooth > cfg80211 snd_hda_intel ir_sony_decoder mei ir_jvc_decoder ir_rc6_decoder > ir_rc5_decoder ir_nec_decoder lpc_ich snd_hda_codec shpchp > snd_soc_rt5640 snd_hda_core snd_soc_rl6231 snd_hwdep snd_soc_core > snd_compress ac97_bus snd_pcm_dmaengine rc_rc6_mce dw_dmac snd_pcm > nuvoton_cir rc_core snd_timer dw_dmac_core snd elan_i2c soundcore > snd_soc_sst_acpi spi_pxa2xx_platform i2c_designware_platform > i2c_designware_core 8250_dw mac_hid ip6t_REJECT nf_reject_ipv6 > nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT > nf_reject_ipv4 xt_comment nf_log_ipv4 nf_log_common xt_LOG xt_multiport > xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 > xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns > nf_conntrack_broadcast nf_nat_ftp nf_nat sunrpc nf_conntrack_ftp > nf_conntrack iptable_filter ip_tables x_tables autofs4 btrfs xor > raid6_pq i915 i2c_algo_bit drm_kms_helper e1000e syscopyarea sysfillrect > sysimgblt uas fb_sys_fops ptp ahci sdhci_acpi usb_storage libahci drm > pps_core video i2c_hid sdhci hid fjes > Jul 16 14:28:29 nuc kernel: [ 370.643778] CPU: 3 PID: 1505 Comm: dd Not > tainted 4.4.0-31-generic #50-Ubuntu > Jul 16 14:28:29 nuc kernel: [ 370.643822] Hardware name: > /D54250WYK, BIOS WYLPT10H.86A.0041.2015.0720.1108 07/20/2015 > Jul 16 14:28:29 nuc kernel: [ 370.643878] task: ffff88040aa90000 ti: > ffff880407b98000 task.ti: ffff880407b98000 > Jul 16 14:28:29 nuc kernel: [ 370.643923] RIP: > 0010:[<ffffffff811eb987>] [<ffffffff811eb987>] kmem_cache_alloc+0x77/0x1f0 > Jul 16 14:28:29 nuc kernel: [ 370.643982] RSP: 0018:ffff880407b9ba80 > EFLAGS: 00010282 > Jul 16 14:28:29 nuc kernel: [ 370.644015] RAX: 0000000000000000 RBX: > 0000000002408040 RCX: 00000000000bb283 > Jul 16 14:28:29 nuc kernel: [ 370.644054] RDX: 00000000000bb282 RSI: > 0000000002408040 RDI: 000000000001a940 > Jul 16 14:28:29 nuc kernel: [ 370.644076] RBP: ffff880407b9bab0 R08: > ffff88041fb9a940 R09: ddff88007f5aff08 > Jul 16 14:28:29 nuc kernel: [ 370.644097] R10: ffff8800d522d060 R11: > ffffffff81ccf1ea R12: 0000000002408040 > Jul 16 14:28:29 nuc kernel: [ 370.644119] R13: ffffffff81243ea1 R14: > ffff88040f08bc00 R15: ffff88040f08bc00 > Jul 16 14:28:29 nuc kernel: [ 370.644141] FS: 00007fa227587700(0000) > GS:ffff88041fb80000(0000) knlGS:0000000000000000 > Jul 16 14:28:29 nuc kernel: [ 370.644165] CS: 0010 DS: 0000 ES: 0000 > CR0: 0000000080050033 > Jul 16 14:28:29 nuc kernel: [ 370.644183] CR2: 0000000000a08000 CR3: > 000000040aaac000 CR4: 00000000001406e0 > Jul 16 14:28:29 nuc kernel: [ 370.644205] Stack: > Jul 16 14:28:29 nuc kernel: [ 370.644213] 01ff8803fef10a00 > 0000000000001000 ffff88007b89f000 ffffea0001ee27c0 > Jul 16 14:28:29 nuc kernel: [ 370.644241] 0000000000000000 > 0000000000000000 ffff880407b9bac8 ffffffff81243ea1 > Jul 16 14:28:29 nuc kernel: [ 370.644270] ffff8804099b8024 > ffff880407b9bb10 ffffffff81244379 0000000107b9bb68 > Jul 16 14:28:29 nuc kernel: [ 370.644298] Call Trace: > Jul 16 14:28:29 nuc kernel: [ 370.644311] [<ffffffff81243ea1>] > alloc_buffer_head+0x21/0x60 > Jul 16 14:28:29 nuc kernel: [ 370.644329] [<ffffffff81244379>] > alloc_page_buffers+0x79/0xe0 > Jul 16 14:28:29 nuc kernel: [ 370.644349] [<ffffffff812443fe>] > create_empty_buffers+0x1e/0xc0 > Jul 16 14:28:29 nuc kernel: [ 370.644369] [<ffffffff812987fc>] > ext4_block_write_begin+0x3cc/0x4e0 > Jul 16 14:28:29 nuc kernel: [ 370.644390] [<ffffffff812e8afb>] ? > jbd2__journal_start+0xdb/0x1e0 > Jul 16 14:28:29 nuc kernel: [ 370.644410] [<ffffffff81297c40>] ? > ext4_inode_attach_jinode.part.60+0xb0/0xb0 > Jul 16 14:28:29 nuc kernel: [ 370.644434] [<ffffffff812ccc2d>] ? > __ext4_journal_start_sb+0x6d/0x120 > Jul 16 14:28:29 nuc kernel: [ 370.644456] [<ffffffff8129e61d>] > ext4_da_write_begin+0x15d/0x340 > Jul 16 14:28:29 nuc kernel: [ 370.644477] [<ffffffff8118db4e>] > generic_perform_write+0xce/0x1c0 > Jul 16 14:28:29 nuc kernel: [ 370.644498] [<ffffffff8118f9f2>] > __generic_file_write_iter+0x1a2/0x1e0 > Jul 16 14:28:29 nuc kernel: [ 370.644518] [<ffffffff81292d72>] > ext4_file_write_iter+0x102/0x470 > Jul 16 14:28:29 nuc kernel: [ 370.644540] [<ffffffff81403f37>] ? > iov_iter_zero+0x67/0x200 > Jul 16 14:28:29 nuc kernel: [ 370.644560] [<ffffffff8120c94b>] > new_sync_write+0x9b/0xe0 > Jul 16 14:28:29 nuc kernel: [ 370.644578] [<ffffffff8120c9b6>] > __vfs_write+0x26/0x40 > Jul 16 14:28:29 nuc kernel: [ 370.645377] [<ffffffff8120d339>] > vfs_write+0xa9/0x1a0 > Jul 16 14:28:29 nuc kernel: [ 370.646174] [<ffffffff8120d274>] ? > vfs_read+0x114/0x130 > Jul 16 14:28:29 nuc kernel: [ 370.646973] [<ffffffff8120dff5>] > SyS_write+0x55/0xc0 > Jul 16 14:28:29 nuc kernel: [ 370.647766] [<ffffffff8182db32>] > entry_SYSCALL_64_fastpath+0x16/0x71 > Jul 16 14:28:29 nuc kernel: [ 370.648547] Code: 08 65 4c 03 05 23 e8 e1 > 7e 49 83 78 10 00 4d 8b 08 0f 84 29 01 00 00 4d 85 c9 0f 84 20 01 00 00 > 49 63 47 20 48 8d 4a 01 49 8b 3f <49> 8b 1c 01 4c 89 c8 65 48 0f c7 0f > 0f 94 c0 84 c0 74 bb 49 63 > Jul 16 14:28:29 nuc kernel: [ 370.650215] RIP [<ffffffff811eb987>] > kmem_cache_alloc+0x77/0x1f0 > Jul 16 14:28:29 nuc kernel: [ 370.650985] RSP <ffff880407b9ba80> > Jul 16 14:28:29 nuc kernel: [ 370.651755] ---[ end trace > 639091250fabe2af ]--- > > Shows also a stacktrace with the same call path, also running on a > (different) Intel NUC, also running a 4.4.0 kernel. This pastebin is > nowhere referenced however, so I'm unsure who found it and where exactly > it was posted. Since the offending process in the unknown guy or girl's > pastebin was dd, however, I believe that he or she tried to deliberately > reproduce the problem. > > The problem occurs only when the system is under heavy disk load for me > (usually after an hour of activity). I've a process running which > frequently does sqlite3 commits about every 10 seconds. Having it run > overnight with almost no load led to no oooops. > > Any and all advice is greatly appreciated. > Cheers, > Johannes > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB 2016-10-04 8:41 ` Jan Kara @ 2016-10-04 16:50 ` Johannes Bauer 2016-10-04 17:32 ` Johannes Bauer 0 siblings, 1 reply; 19+ messages in thread From: Johannes Bauer @ 2016-10-04 16:50 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4, linux-mm On 04.10.2016 10:41, Jan Kara wrote: > The problem looks like memory corruption: [...] Huh, very interesting -- thanks for the walkthrough! > Anyway, adding linux-mm to CC since this does not look ext4 related but > rather mm related issue. > > Bugs like these are always hard to catch, usually it's some flaky device > driver, sometimes also flaky HW. You can try running kernel with various > debug options enabled in a hope to catch the code corrupting memory > earlier - e.g. CONFIG_DEBUG_PAGE_ALLOC sometimes catches something, > CONFIG_SLAB_DEBUG can be useful as well. Another option is to get a > crashdump when the oops happens (although that's going to be a pain to > setup on such a small machine) and then look at which places point to > the corrupted memory - sometimes you can find old structures pointing to > the place and find the use-after-free issue or stuff like that... Uhh, that sounds painful. So I'm following Ted's advice and building myself a 4.8 as we speak. If the problem is fixed, would it be of any help to trace the source by going back to the 4.4.0 and reproduce with the debug symbols you mentioned? I don't think a memdump would be difficult on the machine (while it certainly has a small form factor, it's got a 1 TB hdd and 16 GB of RAM, so it's not really that small). Cheers, Johannes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB 2016-10-04 16:50 ` Johannes Bauer @ 2016-10-04 17:32 ` Johannes Bauer 0 siblings, 0 replies; 19+ messages in thread From: Johannes Bauer @ 2016-10-04 17:32 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4, linux-mm On 04.10.2016 18:50, Johannes Bauer wrote: > Uhh, that sounds painful. So I'm following Ted's advice and building > myself a 4.8 as we speak. Damn bad idea to build on the instable target. Lots of gcc segfaults and weird stuff, even without a kernel panic. The system appears to be instable as hell. Wonder how it can even run and how much of the root fs is already corrupted :-( Rebuilding 4.8 on a different host. Cheers, Johannes ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB @ 2016-10-04 17:32 ` Johannes Bauer 0 siblings, 0 replies; 19+ messages in thread From: Johannes Bauer @ 2016-10-04 17:32 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4, linux-mm On 04.10.2016 18:50, Johannes Bauer wrote: > Uhh, that sounds painful. So I'm following Ted's advice and building > myself a 4.8 as we speak. Damn bad idea to build on the instable target. Lots of gcc segfaults and weird stuff, even without a kernel panic. The system appears to be instable as hell. Wonder how it can even run and how much of the root fs is already corrupted :-( Rebuilding 4.8 on a different host. Cheers, Johannes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB 2016-10-04 17:32 ` Johannes Bauer @ 2016-10-04 18:45 ` Andrey Korolyov -1 siblings, 0 replies; 19+ messages in thread From: Andrey Korolyov @ 2016-10-04 18:45 UTC (permalink / raw) To: Johannes Bauer; +Cc: Jan Kara, linux-ext4, linux-mm On Tue, Oct 4, 2016 at 8:32 PM, Johannes Bauer <dfnsonfsduifb@gmx.de> wrote: > On 04.10.2016 18:50, Johannes Bauer wrote: > >> Uhh, that sounds painful. So I'm following Ted's advice and building >> myself a 4.8 as we speak. > > Damn bad idea to build on the instable target. Lots of gcc segfaults and > weird stuff, even without a kernel panic. The system appears to be > instable as hell. Wonder how it can even run and how much of the root fs > is already corrupted :-( > > Rebuilding 4.8 on a different host. Looks like a platform itself is somewhat faulty: [1]. Also please bear in mind that standalone memory testers would rather not expose certain classes of memory failures, I`d suggest to test allocator`s work against gcc runs on tmpfs, almost same as you did before. Frequency of crashes due to wrong pointer contents of an fs cache is most probably a direct outcome from its relative memory footprint. 1. https://communities.intel.com/thread/105640 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB @ 2016-10-04 18:45 ` Andrey Korolyov 0 siblings, 0 replies; 19+ messages in thread From: Andrey Korolyov @ 2016-10-04 18:45 UTC (permalink / raw) To: Johannes Bauer; +Cc: Jan Kara, linux-ext4, linux-mm On Tue, Oct 4, 2016 at 8:32 PM, Johannes Bauer <dfnsonfsduifb@gmx.de> wrote: > On 04.10.2016 18:50, Johannes Bauer wrote: > >> Uhh, that sounds painful. So I'm following Ted's advice and building >> myself a 4.8 as we speak. > > Damn bad idea to build on the instable target. Lots of gcc segfaults and > weird stuff, even without a kernel panic. The system appears to be > instable as hell. Wonder how it can even run and how much of the root fs > is already corrupted :-( > > Rebuilding 4.8 on a different host. Looks like a platform itself is somewhat faulty: [1]. Also please bear in mind that standalone memory testers would rather not expose certain classes of memory failures, I`d suggest to test allocator`s work against gcc runs on tmpfs, almost same as you did before. Frequency of crashes due to wrong pointer contents of an fs cache is most probably a direct outcome from its relative memory footprint. 1. https://communities.intel.com/thread/105640 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB 2016-10-04 18:45 ` Andrey Korolyov @ 2016-10-04 19:02 ` Johannes Bauer -1 siblings, 0 replies; 19+ messages in thread From: Johannes Bauer @ 2016-10-04 19:02 UTC (permalink / raw) To: Andrey Korolyov; +Cc: Jan Kara, linux-ext4, linux-mm On 04.10.2016 20:45, Andrey Korolyov wrote: > On Tue, Oct 4, 2016 at 8:32 PM, Johannes Bauer <dfnsonfsduifb@gmx.de> wrote: >> On 04.10.2016 18:50, Johannes Bauer wrote: >> >>> Uhh, that sounds painful. So I'm following Ted's advice and building >>> myself a 4.8 as we speak. >> >> Damn bad idea to build on the instable target. Lots of gcc segfaults and >> weird stuff, even without a kernel panic. The system appears to be >> instable as hell. Wonder how it can even run and how much of the root fs >> is already corrupted :-( >> >> Rebuilding 4.8 on a different host. > > Looks like a platform itself is somewhat faulty: [1]. Thanks for the hint, I'll post there as well. The device is less than 4 weeks old, so it's still under full warranty. Maybe it really is the HW and I'll return it. > Also please bear > in mind that standalone memory testers would rather not expose certain > classes of memory failures, I`d suggest to test allocator`s work > against gcc runs on tmpfs, almost same as you did before. Frequency of > crashes due to wrong pointer contents of an fs cache is most probably > a direct outcome from its relative memory footprint. I will and did, but strangely some kernel building on /dev/shm worked really nice. I Ctrl-Ced, rebooted for good measure and rsynced the 4.8.0 on the device. Then, I tried to update-initramfs: nuc [/lib/modules]: update-initramfs -u update-initramfs: Generating /boot/initrd.img-4.4.0-21-generic modinfo: ERROR: could not get modinfo from 'qla3xxx': Invalid argument Segmentation fault Segmentation fault modinfo: ERROR: could not get modinfo from 'mpt3sas': Invalid argument modinfo: ERROR: could not get modinfo from 'pktcdvd': No such file or directory Bus error Bus error Bus error Bus error Bus error Bus error Bus error Bus error Bus error Bus error Bus error Bus error [...] Segmentation fault Segmentation fault Segmentation fault Segmentation fault update-initramfs: failed for /boot/initrd.img-4.4.0-21-generic with 139. update-initramfs causes heavy disk I/O, so really maybe it's something with the disk driver. As of now I really can't get 4.8.0 to even get to a point where it'd be bootable. I'll continue fighting on all fronts and report as soon as I learn more. Thanks for the help, it is very much appreciated. Cheers, Joe ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB @ 2016-10-04 19:02 ` Johannes Bauer 0 siblings, 0 replies; 19+ messages in thread From: Johannes Bauer @ 2016-10-04 19:02 UTC (permalink / raw) To: Andrey Korolyov; +Cc: Jan Kara, linux-ext4, linux-mm On 04.10.2016 20:45, Andrey Korolyov wrote: > On Tue, Oct 4, 2016 at 8:32 PM, Johannes Bauer <dfnsonfsduifb@gmx.de> wrote: >> On 04.10.2016 18:50, Johannes Bauer wrote: >> >>> Uhh, that sounds painful. So I'm following Ted's advice and building >>> myself a 4.8 as we speak. >> >> Damn bad idea to build on the instable target. Lots of gcc segfaults and >> weird stuff, even without a kernel panic. The system appears to be >> instable as hell. Wonder how it can even run and how much of the root fs >> is already corrupted :-( >> >> Rebuilding 4.8 on a different host. > > Looks like a platform itself is somewhat faulty: [1]. Thanks for the hint, I'll post there as well. The device is less than 4 weeks old, so it's still under full warranty. Maybe it really is the HW and I'll return it. > Also please bear > in mind that standalone memory testers would rather not expose certain > classes of memory failures, I`d suggest to test allocator`s work > against gcc runs on tmpfs, almost same as you did before. Frequency of > crashes due to wrong pointer contents of an fs cache is most probably > a direct outcome from its relative memory footprint. I will and did, but strangely some kernel building on /dev/shm worked really nice. I Ctrl-Ced, rebooted for good measure and rsynced the 4.8.0 on the device. Then, I tried to update-initramfs: nuc [/lib/modules]: update-initramfs -u update-initramfs: Generating /boot/initrd.img-4.4.0-21-generic modinfo: ERROR: could not get modinfo from 'qla3xxx': Invalid argument Segmentation fault Segmentation fault modinfo: ERROR: could not get modinfo from 'mpt3sas': Invalid argument modinfo: ERROR: could not get modinfo from 'pktcdvd': No such file or directory Bus error Bus error Bus error Bus error Bus error Bus error Bus error Bus error Bus error Bus error Bus error Bus error [...] Segmentation fault Segmentation fault Segmentation fault Segmentation fault update-initramfs: failed for /boot/initrd.img-4.4.0-21-generic with 139. update-initramfs causes heavy disk I/O, so really maybe it's something with the disk driver. As of now I really can't get 4.8.0 to even get to a point where it'd be bootable. I'll continue fighting on all fronts and report as soon as I learn more. Thanks for the help, it is very much appreciated. Cheers, Joe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB 2016-10-04 18:45 ` Andrey Korolyov @ 2016-10-04 19:55 ` Johannes Bauer -1 siblings, 0 replies; 19+ messages in thread From: Johannes Bauer @ 2016-10-04 19:55 UTC (permalink / raw) To: Andrey Korolyov; +Cc: Jan Kara, linux-ext4, linux-mm On 04.10.2016 20:45, Andrey Korolyov wrote: >> Damn bad idea to build on the instable target. Lots of gcc segfaults and >> weird stuff, even without a kernel panic. The system appears to be >> instable as hell. Wonder how it can even run and how much of the root fs >> is already corrupted :-( >> >> Rebuilding 4.8 on a different host. > > Looks like a platform itself is somewhat faulty: [1]. Also please bear > in mind that standalone memory testers would rather not expose certain > classes of memory failures, I`d suggest to test allocator`s work > against gcc runs on tmpfs, almost same as you did before. Frequency of > crashes due to wrong pointer contents of an fs cache is most probably > a direct outcome from its relative memory footprint. So there's some interesting new data points that I couldn't make sense of. Maybe you can. First off, 4.8.0 shows the same symptoms. When I try to build 4.8.0 in /usr/src/linux using make -j4, I get bus errors and segfaults in gcc pretty soon. Doing the same thing in /dev/shm, however, builds like a charm. Three kernels built, all ran through perfectly. Not one try in /usr/src did that, all my attempts failed. What could cause this? Faulty hard drive? It's brand new: Model Family: Western Digital Red Device Model: WDC WD10JFCX-68N6GN0 Firmware Version: 82.00A82 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 182 181 021 Pre-fail Always - 1858 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 17 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 178 Or faulty AHCI controller or driver? [ 9.746277] ahci 0000:00:17.0: version 3.0 [ 9.746499] ahci 0000:00:17.0: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode [ 9.746501] ahci 0000:00:17.0: flags: 64bit ncq pm led clo only pio slum part deso sadm sds apst [ 9.753844] scsi host0: ahci [ 9.754648] ata1: SATA max UDMA/133 abar m2048@0xdf14d000 port 0xdf14d100 irq 275 I'm super puzzled right now :-( Cheers, Johannes ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB @ 2016-10-04 19:55 ` Johannes Bauer 0 siblings, 0 replies; 19+ messages in thread From: Johannes Bauer @ 2016-10-04 19:55 UTC (permalink / raw) To: Andrey Korolyov; +Cc: Jan Kara, linux-ext4, linux-mm On 04.10.2016 20:45, Andrey Korolyov wrote: >> Damn bad idea to build on the instable target. Lots of gcc segfaults and >> weird stuff, even without a kernel panic. The system appears to be >> instable as hell. Wonder how it can even run and how much of the root fs >> is already corrupted :-( >> >> Rebuilding 4.8 on a different host. > > Looks like a platform itself is somewhat faulty: [1]. Also please bear > in mind that standalone memory testers would rather not expose certain > classes of memory failures, I`d suggest to test allocator`s work > against gcc runs on tmpfs, almost same as you did before. Frequency of > crashes due to wrong pointer contents of an fs cache is most probably > a direct outcome from its relative memory footprint. So there's some interesting new data points that I couldn't make sense of. Maybe you can. First off, 4.8.0 shows the same symptoms. When I try to build 4.8.0 in /usr/src/linux using make -j4, I get bus errors and segfaults in gcc pretty soon. Doing the same thing in /dev/shm, however, builds like a charm. Three kernels built, all ran through perfectly. Not one try in /usr/src did that, all my attempts failed. What could cause this? Faulty hard drive? It's brand new: Model Family: Western Digital Red Device Model: WDC WD10JFCX-68N6GN0 Firmware Version: 82.00A82 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 182 181 021 Pre-fail Always - 1858 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 17 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 178 Or faulty AHCI controller or driver? [ 9.746277] ahci 0000:00:17.0: version 3.0 [ 9.746499] ahci 0000:00:17.0: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode [ 9.746501] ahci 0000:00:17.0: flags: 64bit ncq pm led clo only pio slum part deso sadm sds apst [ 9.753844] scsi host0: ahci [ 9.754648] ata1: SATA max UDMA/133 abar m2048@0xdf14d000 port 0xdf14d100 irq 275 I'm super puzzled right now :-( Cheers, Johannes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB 2016-10-04 19:55 ` Johannes Bauer @ 2016-10-04 20:17 ` Andrey Korolyov -1 siblings, 0 replies; 19+ messages in thread From: Andrey Korolyov @ 2016-10-04 20:17 UTC (permalink / raw) To: Johannes Bauer; +Cc: Jan Kara, linux-ext4, linux-mm > I'm super puzzled right now :-( > There are three strawman` ideas out of head, down by a level of naiveness increase: - disk controller corrupts DMA chunks themselves, could be tested against usb stick/sd card with same fs or by switching disk controller to a legacy mode if possible, but cascading failure shown previously should be rather unusual for this, - SMP could be partially broken in such manner that it would cause overlapped accesses under certain conditions, may be checked with 'nosmp', - disk accesses and corresponding power spikes are causing partial undervoltage condition somewhere where bits are relatively freely flipping on paths without parity checking, though this could be addressed only to an onboard power distributor, not to power source itself. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB @ 2016-10-04 20:17 ` Andrey Korolyov 0 siblings, 0 replies; 19+ messages in thread From: Andrey Korolyov @ 2016-10-04 20:17 UTC (permalink / raw) To: Johannes Bauer; +Cc: Jan Kara, linux-ext4, linux-mm > I'm super puzzled right now :-( > There are three strawman` ideas out of head, down by a level of naiveness increase: - disk controller corrupts DMA chunks themselves, could be tested against usb stick/sd card with same fs or by switching disk controller to a legacy mode if possible, but cascading failure shown previously should be rather unusual for this, - SMP could be partially broken in such manner that it would cause overlapped accesses under certain conditions, may be checked with 'nosmp', - disk accesses and corresponding power spikes are causing partial undervoltage condition somewhere where bits are relatively freely flipping on paths without parity checking, though this could be addressed only to an onboard power distributor, not to power source itself. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB 2016-10-04 20:17 ` Andrey Korolyov @ 2016-10-04 21:54 ` Johannes Bauer -1 siblings, 0 replies; 19+ messages in thread From: Johannes Bauer @ 2016-10-04 21:54 UTC (permalink / raw) To: Andrey Korolyov; +Cc: Jan Kara, linux-ext4, linux-mm On 04.10.2016 22:17, Andrey Korolyov wrote: >> I'm super puzzled right now :-( >> > > There are three strawman` ideas out of head, down by a level of > naiveness increase: > - disk controller corrupts DMA chunks themselves, could be tested > against usb stick/sd card with same fs or by switching disk controller > to a legacy mode if possible, but cascading failure shown previously > should be rather unusual for this, I'll check out if this is possible somehow tomorrow. > - SMP could be partially broken in such manner that it would cause > overlapped accesses under certain conditions, may be checked with > 'nosmp', Unfortunately not: CC [M] drivers/infiniband/core/multicast.o CC [M] drivers/infiniband/core/mad.o drivers/infiniband/core/mad.c: In function ‘ib_mad_port_close’: drivers/infiniband/core/mad.c:3252:1: internal compiler error: Bus error } ^ nuc [~]: cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.8.0 root=UUID=f6a792b3-3027-4293-a118-f0df1de9b25c ro ip=:::::eno1:dhcp nosmp > - disk accesses and corresponding power spikes are causing partial > undervoltage condition somewhere where bits are relatively freely > flipping on paths without parity checking, though this could be > addressed only to an onboard power distributor, not to power source > itself. Huh that sounds like "defective hardware" to me, wouldn't it? Cheers and thank you for your help, Johannes ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB @ 2016-10-04 21:54 ` Johannes Bauer 0 siblings, 0 replies; 19+ messages in thread From: Johannes Bauer @ 2016-10-04 21:54 UTC (permalink / raw) To: Andrey Korolyov; +Cc: Jan Kara, linux-ext4, linux-mm On 04.10.2016 22:17, Andrey Korolyov wrote: >> I'm super puzzled right now :-( >> > > There are three strawman` ideas out of head, down by a level of > naiveness increase: > - disk controller corrupts DMA chunks themselves, could be tested > against usb stick/sd card with same fs or by switching disk controller > to a legacy mode if possible, but cascading failure shown previously > should be rather unusual for this, I'll check out if this is possible somehow tomorrow. > - SMP could be partially broken in such manner that it would cause > overlapped accesses under certain conditions, may be checked with > 'nosmp', Unfortunately not: CC [M] drivers/infiniband/core/multicast.o CC [M] drivers/infiniband/core/mad.o drivers/infiniband/core/mad.c: In function a??ib_mad_port_closea??: drivers/infiniband/core/mad.c:3252:1: internal compiler error: Bus error } ^ nuc [~]: cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.8.0 root=UUID=f6a792b3-3027-4293-a118-f0df1de9b25c ro ip=:::::eno1:dhcp nosmp > - disk accesses and corresponding power spikes are causing partial > undervoltage condition somewhere where bits are relatively freely > flipping on paths without parity checking, though this could be > addressed only to an onboard power distributor, not to power source > itself. Huh that sounds like "defective hardware" to me, wouldn't it? Cheers and thank you for your help, Johannes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB 2016-10-04 21:54 ` Johannes Bauer (?) @ 2016-10-05 6:20 ` Jan Kara -1 siblings, 0 replies; 19+ messages in thread From: Jan Kara @ 2016-10-05 6:20 UTC (permalink / raw) To: Johannes Bauer; +Cc: Andrey Korolyov, Jan Kara, linux-ext4, linux-mm On Tue 04-10-16 23:54:24, Johannes Bauer wrote: > > - disk accesses and corresponding power spikes are causing partial > > undervoltage condition somewhere where bits are relatively freely > > flipping on paths without parity checking, though this could be > > addressed only to an onboard power distributor, not to power source > > itself. > > Huh that sounds like "defective hardware" to me, wouldn't it? Yeah, from the frequency and the kind of failures, I actually don't think it's a kernel bug anymore. So I'd also suspect something like that bits on memory bus start to flip when the disk is loaded or something like that. If you say compilation on tmpfs is fine - can you try compiling kernel in tmpfs in a loop and then after it is running smoothly for a while start to load the disk by copying a lot of data there? Do the errors trigger? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB 2016-10-04 18:45 ` Andrey Korolyov @ 2016-10-04 20:18 ` Johannes Bauer -1 siblings, 0 replies; 19+ messages in thread From: Johannes Bauer @ 2016-10-04 20:18 UTC (permalink / raw) To: linux-ext4; +Cc: linux-mm On 04.10.2016 20:45, Andrey Korolyov wrote: > 1. https://communities.intel.com/thread/105640 Created a thread in the Intel forum, here's for cross reference: https://communities.intel.com/message/425731 Cheers, Johannes ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB @ 2016-10-04 20:18 ` Johannes Bauer 0 siblings, 0 replies; 19+ messages in thread From: Johannes Bauer @ 2016-10-04 20:18 UTC (permalink / raw) To: linux-ext4; +Cc: linux-mm On 04.10.2016 20:45, Andrey Korolyov wrote: > 1. https://communities.intel.com/thread/105640 Created a thread in the Intel forum, here's for cross reference: https://communities.intel.com/message/425731 Cheers, Johannes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2016-10-05 6:20 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-10-03 10:52 Frequent ext4 oopses with 4.4.0 on Intel NUC6i3SYB Johannes Bauer 2016-10-04 3:18 ` Theodore Ts'o 2016-10-04 8:41 ` Jan Kara 2016-10-04 16:50 ` Johannes Bauer 2016-10-04 17:32 ` Johannes Bauer 2016-10-04 17:32 ` Johannes Bauer 2016-10-04 18:45 ` Andrey Korolyov 2016-10-04 18:45 ` Andrey Korolyov 2016-10-04 19:02 ` Johannes Bauer 2016-10-04 19:02 ` Johannes Bauer 2016-10-04 19:55 ` Johannes Bauer 2016-10-04 19:55 ` Johannes Bauer 2016-10-04 20:17 ` Andrey Korolyov 2016-10-04 20:17 ` Andrey Korolyov 2016-10-04 21:54 ` Johannes Bauer 2016-10-04 21:54 ` Johannes Bauer 2016-10-05 6:20 ` Jan Kara 2016-10-04 20:18 ` Johannes Bauer 2016-10-04 20:18 ` Johannes Bauer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.