From: Justin Piszcz <jpiszcz@lucidpixels.com> To: linux-kernel@vger.kernel.org Cc: xfs@oss.sgi.com, Alan Piszcz <ap@solarrain.com> Subject: 3.1-rc4: spectacular kernel errors / filesystem crash Date: Sun, 11 Sep 2011 05:40:09 -0400 (EDT) [thread overview] Message-ID: <alpine.DEB.2.02.1109110511250.8626@p34.internal.lan> (raw) Hi, Over the past 24-48 hours I was running some CPU-intenstive jobs and there was heavy I/O on the RAID (9750-24i4e + a RAID6).. I believe most of the problem started when I included many kernel options as modules (before I only compiled in [*] the drivers I used), there appears to have something to gone awry in the kernel and then afterwards, disks started going in and out, XFS shut down, etcera. I'm opening a case with LSI to see what happened with the 3ware card; however, after a power cycle, everything came back OK (the drives and HW) is physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but other than that, everything 'seems' OK, still need to do an fsck. Something went wrong in the kernel and caused a cascading effect of errors, this occurred (I believe) when I started to run a lot of encoding jobs; however, I was doing a lot of data transfer for the past 24-48 hours on the RAID array, the system (separate SSD/EXT4) remained unaffected but other weird stuff happened as well.. I still see these in the logs as well after the reboot (not often; but e.g., the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the physical drives are 100% healthy): [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update. So, my plan: 1. Report this error to LKML+XFS mailing lists. 2. Open case with LSI support. 3. Recompile the kernel how I used for many years [only compile in options that you need [*] and do not compile drivers as modules] 4. Reboot Linux systems and see if this recurs again under the same workload, after the RAID is done rebuilding. -- So these errors are quite long, will upload to HTTP and paste the relevant bits below. -- URLs for FULL logs: 1. tw_cli /cX show diag: http://home.comcast.net/~jpiszcz/20110911/show_diag.txt 2. Full kernel log (and previous morning of kernel crash) http://home.comcast.net/~jpiszcz/20110911/kern.log.txt 3. tw_cli /cX show all http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt -- Summary (what seems to have occurred, have not done a full analysis yet) 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors 2. Then, the time source went unstable (this happens with weird kernel bugs on many different hosts, I have seen this over time). 3. Then, on the 3ward carde, drives started leaving and being re-inserted by themsevles, XFS went off-line to protect the filesystem due to the 3ware issues -- 3ware/RAID-- Interesting errors: I've never seen this before on a 3ware RAID controller, at least from what I can remember and I've been using 3ware cards for many years.. p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi HDS723030AL p3 CFG-OP-FAIL - 2.73 TB SATA 3 - Hitachi HDS723030AL -- Kernel/ERRORS: FWIW it all seem to start during an encoding job around 21:00: Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC Link is Down Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO (0x04:0x002B): Verify completed:unit=0. Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here ]------------ Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250() Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): transmit queue 5 timed out Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi i7core_edac edac_core video Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not tainted 3.1.0-rc4 #1 Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace: Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>] warn_slowpath_common+0x7a/0xb0 Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>] warn_slowpath_fmt+0x41/0x50 Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ? schedule+0x2e4/0x950 Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>] dev_watchdog+0x23f/0x250 Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>] run_timer_softirq+0xf2/0x220 Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ? qdisc_reset+0x50/0x50 Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>] __do_softirq+0x98/0x120 Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>] run_ksoftirqd+0xb5/0x160 Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ? __do_softirq+0x120/0x120 Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>] kthread+0x87/0x90 Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>] kernel_thread_helper+0x4/0x10 Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ? kthread_worker_fn+0x130/0x130 Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ? gs_change+0xb/0xb Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba ]--- Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset adapter Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:947] -- URLs for FULL logs: 1. tw_cli /cX show diag: http://home.comcast.net/~jpiszcz/20110911/show_diag.txt 2. Full kernel log (and previous morning of kernel crash) http://home.comcast.net/~jpiszcz/20110911/kern.log.txt 3. tw_cli /cX show all http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt -- Currently... After all of this happened, I stopped all I/O on the system/all processes, etc I shutdown the host, removed the power, powered it back up, now the drives that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them to rebuild before doing anything else. Justin.
WARNING: multiple messages have this Message-ID (diff)
From: Justin Piszcz <jpiszcz@lucidpixels.com> To: linux-kernel@vger.kernel.org Cc: Alan Piszcz <ap@solarrain.com>, xfs@oss.sgi.com Subject: 3.1-rc4: spectacular kernel errors / filesystem crash Date: Sun, 11 Sep 2011 05:40:09 -0400 (EDT) [thread overview] Message-ID: <alpine.DEB.2.02.1109110511250.8626@p34.internal.lan> (raw) Hi, Over the past 24-48 hours I was running some CPU-intenstive jobs and there was heavy I/O on the RAID (9750-24i4e + a RAID6).. I believe most of the problem started when I included many kernel options as modules (before I only compiled in [*] the drivers I used), there appears to have something to gone awry in the kernel and then afterwards, disks started going in and out, XFS shut down, etcera. I'm opening a case with LSI to see what happened with the 3ware card; however, after a power cycle, everything came back OK (the drives and HW) is physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but other than that, everything 'seems' OK, still need to do an fsck. Something went wrong in the kernel and caused a cascading effect of errors, this occurred (I believe) when I started to run a lot of encoding jobs; however, I was doing a lot of data transfer for the past 24-48 hours on the RAID array, the system (separate SSD/EXT4) remained unaffected but other weird stuff happened as well.. I still see these in the logs as well after the reboot (not often; but e.g., the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the physical drives are 100% healthy): [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update. So, my plan: 1. Report this error to LKML+XFS mailing lists. 2. Open case with LSI support. 3. Recompile the kernel how I used for many years [only compile in options that you need [*] and do not compile drivers as modules] 4. Reboot Linux systems and see if this recurs again under the same workload, after the RAID is done rebuilding. -- So these errors are quite long, will upload to HTTP and paste the relevant bits below. -- URLs for FULL logs: 1. tw_cli /cX show diag: http://home.comcast.net/~jpiszcz/20110911/show_diag.txt 2. Full kernel log (and previous morning of kernel crash) http://home.comcast.net/~jpiszcz/20110911/kern.log.txt 3. tw_cli /cX show all http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt -- Summary (what seems to have occurred, have not done a full analysis yet) 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors 2. Then, the time source went unstable (this happens with weird kernel bugs on many different hosts, I have seen this over time). 3. Then, on the 3ward carde, drives started leaving and being re-inserted by themsevles, XFS went off-line to protect the filesystem due to the 3ware issues -- 3ware/RAID-- Interesting errors: I've never seen this before on a 3ware RAID controller, at least from what I can remember and I've been using 3ware cards for many years.. p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi HDS723030AL p3 CFG-OP-FAIL - 2.73 TB SATA 3 - Hitachi HDS723030AL -- Kernel/ERRORS: FWIW it all seem to start during an encoding job around 21:00: Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC Link is Down Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO (0x04:0x002B): Verify completed:unit=0. Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here ]------------ Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250() Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): transmit queue 5 timed out Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi i7core_edac edac_core video Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not tainted 3.1.0-rc4 #1 Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace: Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>] warn_slowpath_common+0x7a/0xb0 Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>] warn_slowpath_fmt+0x41/0x50 Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ? schedule+0x2e4/0x950 Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>] dev_watchdog+0x23f/0x250 Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>] run_timer_softirq+0xf2/0x220 Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ? qdisc_reset+0x50/0x50 Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>] __do_softirq+0x98/0x120 Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>] run_ksoftirqd+0xb5/0x160 Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ? __do_softirq+0x120/0x120 Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>] kthread+0x87/0x90 Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>] kernel_thread_helper+0x4/0x10 Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ? kthread_worker_fn+0x130/0x130 Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ? gs_change+0xb/0xb Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba ]--- Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset adapter Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:947] -- URLs for FULL logs: 1. tw_cli /cX show diag: http://home.comcast.net/~jpiszcz/20110911/show_diag.txt 2. Full kernel log (and previous morning of kernel crash) http://home.comcast.net/~jpiszcz/20110911/kern.log.txt 3. tw_cli /cX show all http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt -- Currently... After all of this happened, I stopped all I/O on the system/all processes, etc I shutdown the host, removed the power, powered it back up, now the drives that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them to rebuild before doing anything else. Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
next reply other threads:[~2011-09-11 9:40 UTC|newest] Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top 2011-09-11 9:40 Justin Piszcz [this message] 2011-09-11 9:40 ` 3.1-rc4: spectacular kernel errors / filesystem crash Justin Piszcz 2011-09-13 3:59 ` Jesse Brandeburg 2011-09-13 3:59 ` Jesse Brandeburg 2011-09-13 4:05 ` Eric Dumazet 2011-09-13 4:05 ` Eric Dumazet 2011-09-13 14:54 ` Justin Piszcz 2011-09-13 14:54 ` Justin Piszcz 2011-09-13 14:58 ` Eric Dumazet 2011-09-13 14:58 ` Eric Dumazet 2011-09-13 15:35 ` Jon Mason 2011-09-13 15:35 ` Jon Mason 2011-09-13 15:42 ` Justin Piszcz 2011-09-13 15:42 ` Justin Piszcz 2011-09-13 15:51 ` Jon Mason 2011-09-13 15:51 ` Jon Mason 2011-09-13 16:32 ` Justin Piszcz 2011-09-13 16:32 ` Justin Piszcz
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=alpine.DEB.2.02.1109110511250.8626@p34.internal.lan \ --to=jpiszcz@lucidpixels.com \ --cc=ap@solarrain.com \ --cc=linux-kernel@vger.kernel.org \ --cc=xfs@oss.sgi.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.