* 3.1-rc4: spectacular kernel errors / filesystem crash @ 2011-09-11 9:40 ` Justin Piszcz 0 siblings, 0 replies; 18+ messages in thread From: Justin Piszcz @ 2011-09-11 9:40 UTC (permalink / raw) To: linux-kernel; +Cc: xfs, Alan Piszcz Hi, Over the past 24-48 hours I was running some CPU-intenstive jobs and there was heavy I/O on the RAID (9750-24i4e + a RAID6).. I believe most of the problem started when I included many kernel options as modules (before I only compiled in [*] the drivers I used), there appears to have something to gone awry in the kernel and then afterwards, disks started going in and out, XFS shut down, etcera. I'm opening a case with LSI to see what happened with the 3ware card; however, after a power cycle, everything came back OK (the drives and HW) is physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but other than that, everything 'seems' OK, still need to do an fsck. Something went wrong in the kernel and caused a cascading effect of errors, this occurred (I believe) when I started to run a lot of encoding jobs; however, I was doing a lot of data transfer for the past 24-48 hours on the RAID array, the system (separate SSD/EXT4) remained unaffected but other weird stuff happened as well.. I still see these in the logs as well after the reboot (not often; but e.g., the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the physical drives are 100% healthy): [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update. So, my plan: 1. Report this error to LKML+XFS mailing lists. 2. Open case with LSI support. 3. Recompile the kernel how I used for many years [only compile in options that you need [*] and do not compile drivers as modules] 4. Reboot Linux systems and see if this recurs again under the same workload, after the RAID is done rebuilding. -- So these errors are quite long, will upload to HTTP and paste the relevant bits below. -- URLs for FULL logs: 1. tw_cli /cX show diag: http://home.comcast.net/~jpiszcz/20110911/show_diag.txt 2. Full kernel log (and previous morning of kernel crash) http://home.comcast.net/~jpiszcz/20110911/kern.log.txt 3. tw_cli /cX show all http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt -- Summary (what seems to have occurred, have not done a full analysis yet) 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors 2. Then, the time source went unstable (this happens with weird kernel bugs on many different hosts, I have seen this over time). 3. Then, on the 3ward carde, drives started leaving and being re-inserted by themsevles, XFS went off-line to protect the filesystem due to the 3ware issues -- 3ware/RAID-- Interesting errors: I've never seen this before on a 3ware RAID controller, at least from what I can remember and I've been using 3ware cards for many years.. p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi HDS723030AL p3 CFG-OP-FAIL - 2.73 TB SATA 3 - Hitachi HDS723030AL -- Kernel/ERRORS: FWIW it all seem to start during an encoding job around 21:00: Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC Link is Down Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO (0x04:0x002B): Verify completed:unit=0. Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here ]------------ Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250() Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): transmit queue 5 timed out Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi i7core_edac edac_core video Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not tainted 3.1.0-rc4 #1 Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace: Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>] warn_slowpath_common+0x7a/0xb0 Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>] warn_slowpath_fmt+0x41/0x50 Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ? schedule+0x2e4/0x950 Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>] dev_watchdog+0x23f/0x250 Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>] run_timer_softirq+0xf2/0x220 Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ? qdisc_reset+0x50/0x50 Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>] __do_softirq+0x98/0x120 Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>] run_ksoftirqd+0xb5/0x160 Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ? __do_softirq+0x120/0x120 Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>] kthread+0x87/0x90 Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>] kernel_thread_helper+0x4/0x10 Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ? kthread_worker_fn+0x130/0x130 Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ? gs_change+0xb/0xb Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba ]--- Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset adapter Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:947] -- URLs for FULL logs: 1. tw_cli /cX show diag: http://home.comcast.net/~jpiszcz/20110911/show_diag.txt 2. Full kernel log (and previous morning of kernel crash) http://home.comcast.net/~jpiszcz/20110911/kern.log.txt 3. tw_cli /cX show all http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt -- Currently... After all of this happened, I stopped all I/O on the system/all processes, etc I shutdown the host, removed the power, powered it back up, now the drives that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them to rebuild before doing anything else. Justin. ^ permalink raw reply [flat|nested] 18+ messages in thread
* 3.1-rc4: spectacular kernel errors / filesystem crash @ 2011-09-11 9:40 ` Justin Piszcz 0 siblings, 0 replies; 18+ messages in thread From: Justin Piszcz @ 2011-09-11 9:40 UTC (permalink / raw) To: linux-kernel; +Cc: Alan Piszcz, xfs Hi, Over the past 24-48 hours I was running some CPU-intenstive jobs and there was heavy I/O on the RAID (9750-24i4e + a RAID6).. I believe most of the problem started when I included many kernel options as modules (before I only compiled in [*] the drivers I used), there appears to have something to gone awry in the kernel and then afterwards, disks started going in and out, XFS shut down, etcera. I'm opening a case with LSI to see what happened with the 3ware card; however, after a power cycle, everything came back OK (the drives and HW) is physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but other than that, everything 'seems' OK, still need to do an fsck. Something went wrong in the kernel and caused a cascading effect of errors, this occurred (I believe) when I started to run a lot of encoding jobs; however, I was doing a lot of data transfer for the past 24-48 hours on the RAID array, the system (separate SSD/EXT4) remained unaffected but other weird stuff happened as well.. I still see these in the logs as well after the reboot (not often; but e.g., the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the physical drives are 100% healthy): [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update. So, my plan: 1. Report this error to LKML+XFS mailing lists. 2. Open case with LSI support. 3. Recompile the kernel how I used for many years [only compile in options that you need [*] and do not compile drivers as modules] 4. Reboot Linux systems and see if this recurs again under the same workload, after the RAID is done rebuilding. -- So these errors are quite long, will upload to HTTP and paste the relevant bits below. -- URLs for FULL logs: 1. tw_cli /cX show diag: http://home.comcast.net/~jpiszcz/20110911/show_diag.txt 2. Full kernel log (and previous morning of kernel crash) http://home.comcast.net/~jpiszcz/20110911/kern.log.txt 3. tw_cli /cX show all http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt -- Summary (what seems to have occurred, have not done a full analysis yet) 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors 2. Then, the time source went unstable (this happens with weird kernel bugs on many different hosts, I have seen this over time). 3. Then, on the 3ward carde, drives started leaving and being re-inserted by themsevles, XFS went off-line to protect the filesystem due to the 3ware issues -- 3ware/RAID-- Interesting errors: I've never seen this before on a 3ware RAID controller, at least from what I can remember and I've been using 3ware cards for many years.. p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi HDS723030AL p3 CFG-OP-FAIL - 2.73 TB SATA 3 - Hitachi HDS723030AL -- Kernel/ERRORS: FWIW it all seem to start during an encoding job around 21:00: Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC Link is Down Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO (0x04:0x002B): Verify completed:unit=0. Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here ]------------ Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250() Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): transmit queue 5 timed out Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi i7core_edac edac_core video Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not tainted 3.1.0-rc4 #1 Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace: Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>] warn_slowpath_common+0x7a/0xb0 Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>] warn_slowpath_fmt+0x41/0x50 Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ? schedule+0x2e4/0x950 Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>] dev_watchdog+0x23f/0x250 Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>] run_timer_softirq+0xf2/0x220 Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ? qdisc_reset+0x50/0x50 Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>] __do_softirq+0x98/0x120 Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>] run_ksoftirqd+0xb5/0x160 Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ? __do_softirq+0x120/0x120 Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>] kthread+0x87/0x90 Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>] kernel_thread_helper+0x4/0x10 Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ? kthread_worker_fn+0x130/0x130 Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ? gs_change+0xb/0xb Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba ]--- Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset adapter Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:947] -- URLs for FULL logs: 1. tw_cli /cX show diag: http://home.comcast.net/~jpiszcz/20110911/show_diag.txt 2. Full kernel log (and previous morning of kernel crash) http://home.comcast.net/~jpiszcz/20110911/kern.log.txt 3. tw_cli /cX show all http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt -- Currently... After all of this happened, I stopped all I/O on the system/all processes, etc I shutdown the host, removed the power, powered it back up, now the drives that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them to rebuild before doing anything else. Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash 2011-09-11 9:40 ` Justin Piszcz @ 2011-09-13 3:59 ` Jesse Brandeburg -1 siblings, 0 replies; 18+ messages in thread From: Jesse Brandeburg @ 2011-09-13 3:59 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-kernel, xfs, Alan Piszcz, NetDEV list added netdev because it appears to start with an igb tx hang On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: > Hi, > > Over the past 24-48 hours I was running some CPU-intenstive jobs and there > was heavy I/O on the RAID (9750-24i4e + a RAID6).. > > I believe most of the problem started when I included many kernel options as > modules (before I only compiled in [*] the drivers I used), there appears to > have something to gone awry in the kernel and then afterwards, disks started > going in and out, XFS shut down, etcera. > > I'm opening a case with LSI to see what happened with the 3ware card; > however, after a power cycle, everything came back OK (the drives and HW) is > physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but > other than that, everything 'seems' OK, still need to do an fsck. > > Something went wrong in the kernel and caused a cascading effect of errors, > this occurred (I believe) when I started to run a lot of encoding jobs; > however, I was doing a lot of data transfer for the past 24-48 hours on the > RAID array, the system (separate SSD/EXT4) remained unaffected but other > weird stuff happened as well.. > > I still see these in the logs as well after the reboot (not often; but e.g., > the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the > physical drives are 100% healthy): > > [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a > firmware bug on this device. Contact the card vendor for a firmware update. > > So, my plan: > > 1. Report this error to LKML+XFS mailing lists. > 2. Open case with LSI support. > 3. Recompile the kernel how I used for many years [only compile in options > that you need [*] and do not compile drivers as modules] > 4. Reboot Linux systems and see if this recurs again under the same > workload, after the RAID is done rebuilding. > > -- > > So these errors are quite long, will upload to HTTP and paste the relevant > bits below. > > -- > > URLs for FULL logs: > > 1. tw_cli /cX show diag: > http://home.comcast.net/~jpiszcz/20110911/show_diag.txt > > 2. Full kernel log (and previous morning of kernel crash) > http://home.comcast.net/~jpiszcz/20110911/kern.log.txt > > 3. tw_cli /cX show all > http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt > > -- > > Summary (what seems to have occurred, have not done a full analysis yet) > > 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors > > 2. Then, the time source went unstable (this happens with weird kernel bugs > on many different hosts, I have seen this over time). > > 3. Then, on the 3ward carde, drives started leaving and being re-inserted > by themsevles, XFS went off-line to protect the filesystem due to the > 3ware issues > > -- > > 3ware/RAID-- Interesting errors: > > I've never seen this before on a 3ware RAID controller, at least from what > I can remember and I've been using 3ware cards for many years.. > > p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi > HDS723030AL p3 CFG-OP-FAIL - 2.73 TB SATA 3 - > Hitachi HDS723030AL > > -- > > Kernel/ERRORS: > > FWIW it all seem to start during an encoding job around 21:00: > > Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC > Link is Down > Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO > (0x04:0x002B): Verify completed:unit=0. > Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here > ]------------ > Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at > net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250() > Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F > Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): > transmit queue 5 timed out > Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod > tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio > snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib > snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event > snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev > serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi > i7core_edac edac_core video > Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not > tainted 3.1.0-rc4 #1 > Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace: > Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>] > warn_slowpath_common+0x7a/0xb0 > Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>] > warn_slowpath_fmt+0x41/0x50 > Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ? > schedule+0x2e4/0x950 > Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>] > dev_watchdog+0x23f/0x250 > Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>] > run_timer_softirq+0xf2/0x220 > Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ? > qdisc_reset+0x50/0x50 > Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>] > __do_softirq+0x98/0x120 > Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>] > run_ksoftirqd+0xb5/0x160 > Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ? > __do_softirq+0x120/0x120 > Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>] > kthread+0x87/0x90 > Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>] > kernel_thread_helper+0x4/0x10 > Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ? > kthread_worker_fn+0x130/0x130 > Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ? > gs_change+0xb/0xb > Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba > ]--- > Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset > adapter > Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 > Mbps Full Duplex, Flow Control: RX/TX > Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck > for 22s! [kswapd0:947] > > -- > > URLs for FULL logs: > > 1. tw_cli /cX show diag: > http://home.comcast.net/~jpiszcz/20110911/show_diag.txt > > 2. Full kernel log (and previous morning of kernel crash) > http://home.comcast.net/~jpiszcz/20110911/kern.log.txt > > 3. tw_cli /cX show all > http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt > > -- > > Currently... > > After all of this happened, I stopped all I/O on the system/all processes, > etc > I shutdown the host, removed the power, powered it back up, now the drives > that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them > to rebuild before doing anything else. > > Justin. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash @ 2011-09-13 3:59 ` Jesse Brandeburg 0 siblings, 0 replies; 18+ messages in thread From: Jesse Brandeburg @ 2011-09-13 3:59 UTC (permalink / raw) To: Justin Piszcz; +Cc: NetDEV list, Alan Piszcz, linux-kernel, xfs added netdev because it appears to start with an igb tx hang On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: > Hi, > > Over the past 24-48 hours I was running some CPU-intenstive jobs and there > was heavy I/O on the RAID (9750-24i4e + a RAID6).. > > I believe most of the problem started when I included many kernel options as > modules (before I only compiled in [*] the drivers I used), there appears to > have something to gone awry in the kernel and then afterwards, disks started > going in and out, XFS shut down, etcera. > > I'm opening a case with LSI to see what happened with the 3ware card; > however, after a power cycle, everything came back OK (the drives and HW) is > physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but > other than that, everything 'seems' OK, still need to do an fsck. > > Something went wrong in the kernel and caused a cascading effect of errors, > this occurred (I believe) when I started to run a lot of encoding jobs; > however, I was doing a lot of data transfer for the past 24-48 hours on the > RAID array, the system (separate SSD/EXT4) remained unaffected but other > weird stuff happened as well.. > > I still see these in the logs as well after the reboot (not often; but e.g., > the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the > physical drives are 100% healthy): > > [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a > firmware bug on this device. Contact the card vendor for a firmware update. > > So, my plan: > > 1. Report this error to LKML+XFS mailing lists. > 2. Open case with LSI support. > 3. Recompile the kernel how I used for many years [only compile in options > that you need [*] and do not compile drivers as modules] > 4. Reboot Linux systems and see if this recurs again under the same > workload, after the RAID is done rebuilding. > > -- > > So these errors are quite long, will upload to HTTP and paste the relevant > bits below. > > -- > > URLs for FULL logs: > > 1. tw_cli /cX show diag: > http://home.comcast.net/~jpiszcz/20110911/show_diag.txt > > 2. Full kernel log (and previous morning of kernel crash) > http://home.comcast.net/~jpiszcz/20110911/kern.log.txt > > 3. tw_cli /cX show all > http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt > > -- > > Summary (what seems to have occurred, have not done a full analysis yet) > > 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors > > 2. Then, the time source went unstable (this happens with weird kernel bugs > on many different hosts, I have seen this over time). > > 3. Then, on the 3ward carde, drives started leaving and being re-inserted > by themsevles, XFS went off-line to protect the filesystem due to the > 3ware issues > > -- > > 3ware/RAID-- Interesting errors: > > I've never seen this before on a 3ware RAID controller, at least from what > I can remember and I've been using 3ware cards for many years.. > > p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi > HDS723030AL p3 CFG-OP-FAIL - 2.73 TB SATA 3 - > Hitachi HDS723030AL > > -- > > Kernel/ERRORS: > > FWIW it all seem to start during an encoding job around 21:00: > > Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC > Link is Down > Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO > (0x04:0x002B): Verify completed:unit=0. > Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here > ]------------ > Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at > net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250() > Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F > Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): > transmit queue 5 timed out > Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod > tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio > snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib > snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event > snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev > serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi > i7core_edac edac_core video > Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not > tainted 3.1.0-rc4 #1 > Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace: > Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>] > warn_slowpath_common+0x7a/0xb0 > Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>] > warn_slowpath_fmt+0x41/0x50 > Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ? > schedule+0x2e4/0x950 > Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>] > dev_watchdog+0x23f/0x250 > Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>] > run_timer_softirq+0xf2/0x220 > Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ? > qdisc_reset+0x50/0x50 > Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>] > __do_softirq+0x98/0x120 > Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>] > run_ksoftirqd+0xb5/0x160 > Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ? > __do_softirq+0x120/0x120 > Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>] > kthread+0x87/0x90 > Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>] > kernel_thread_helper+0x4/0x10 > Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ? > kthread_worker_fn+0x130/0x130 > Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ? > gs_change+0xb/0xb > Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba > ]--- > Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset > adapter > Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 > Mbps Full Duplex, Flow Control: RX/TX > Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck > for 22s! [kswapd0:947] > > -- > > URLs for FULL logs: > > 1. tw_cli /cX show diag: > http://home.comcast.net/~jpiszcz/20110911/show_diag.txt > > 2. Full kernel log (and previous morning of kernel crash) > http://home.comcast.net/~jpiszcz/20110911/kern.log.txt > > 3. tw_cli /cX show all > http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt > > -- > > Currently... > > After all of this happened, I stopped all I/O on the system/all processes, > etc > I shutdown the host, removed the power, powered it back up, now the drives > that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them > to rebuild before doing anything else. > > Justin. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash 2011-09-13 3:59 ` Jesse Brandeburg @ 2011-09-13 4:05 ` Eric Dumazet -1 siblings, 0 replies; 18+ messages in thread From: Eric Dumazet @ 2011-09-13 4:05 UTC (permalink / raw) To: Jesse Brandeburg Cc: Justin Piszcz, linux-kernel, xfs, Alan Piszcz, NetDEV list Le lundi 12 septembre 2011 à 20:59 -0700, Jesse Brandeburg a écrit : > added netdev because it appears to start with an igb tx hang > > On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: > > Hi, > > > > Over the past 24-48 hours I was running some CPU-intenstive jobs and there > > was heavy I/O on the RAID (9750-24i4e + a RAID6).. > > > > I believe most of the problem started when I included many kernel options as > > modules (before I only compiled in [*] the drivers I used), there appears to > > have something to gone awry in the kernel and then afterwards, disks started > > going in and out, XFS shut down, etcera. > > > > I'm opening a case with LSI to see what happened with the 3ware card; > > however, after a power cycle, everything came back OK (the drives and HW) is > > physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but > > other than that, everything 'seems' OK, still need to do an fsck. > > > > Something went wrong in the kernel and caused a cascading effect of errors, > > this occurred (I believe) when I started to run a lot of encoding jobs; > > however, I was doing a lot of data transfer for the past 24-48 hours on the > > RAID array, the system (separate SSD/EXT4) remained unaffected but other > > weird stuff happened as well.. > > > > I still see these in the logs as well after the reboot (not often; but e.g., > > the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the > > physical drives are 100% healthy): > > > > [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a > > firmware bug on this device. Contact the card vendor for a firmware update. > > > > So, my plan: > > > > 1. Report this error to LKML+XFS mailing lists. > > 2. Open case with LSI support. > > 3. Recompile the kernel how I used for many years [only compile in options > > that you need [*] and do not compile drivers as modules] > > 4. Reboot Linux systems and see if this recurs again under the same > > workload, after the RAID is done rebuilding. > > > > -- > > > > So these errors are quite long, will upload to HTTP and paste the relevant > > bits below. > > > > -- > > > > URLs for FULL logs: > > > > 1. tw_cli /cX show diag: > > http://home.comcast.net/~jpiszcz/20110911/show_diag.txt > > > > 2. Full kernel log (and previous morning of kernel crash) > > http://home.comcast.net/~jpiszcz/20110911/kern.log.txt > > > > 3. tw_cli /cX show all > > http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt > > > > -- > > > > Summary (what seems to have occurred, have not done a full analysis yet) > > > > 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors > > > > 2. Then, the time source went unstable (this happens with weird kernel bugs > > on many different hosts, I have seen this over time). > > > > 3. Then, on the 3ward carde, drives started leaving and being re-inserted > > by themsevles, XFS went off-line to protect the filesystem due to the > > 3ware issues > > > > -- > > > > 3ware/RAID-- Interesting errors: > > > > I've never seen this before on a 3ware RAID controller, at least from what > > I can remember and I've been using 3ware cards for many years.. > > > > p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi > > HDS723030AL p3 CFG-OP-FAIL - 2.73 TB SATA 3 - > > Hitachi HDS723030AL > > > > -- > > > > Kernel/ERRORS: > > > > FWIW it all seem to start during an encoding job around 21:00: > > > > Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC > > Link is Down > > Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO > > (0x04:0x002B): Verify completed:unit=0. > > Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here > > ]------------ > > Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at > > net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250() > > Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F > > Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): > > transmit queue 5 timed out > > Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod > > tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio > > snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib > > snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event > > snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev > > serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi > > i7core_edac edac_core video > > Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not > > tainted 3.1.0-rc4 #1 > > Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace: > > Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>] > > warn_slowpath_common+0x7a/0xb0 > > Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>] > > warn_slowpath_fmt+0x41/0x50 > > Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ? > > schedule+0x2e4/0x950 > > Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>] > > dev_watchdog+0x23f/0x250 > > Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>] > > run_timer_softirq+0xf2/0x220 > > Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ? > > qdisc_reset+0x50/0x50 > > Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>] > > __do_softirq+0x98/0x120 > > Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>] > > run_ksoftirqd+0xb5/0x160 > > Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ? > > __do_softirq+0x120/0x120 > > Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>] > > kthread+0x87/0x90 > > Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>] > > kernel_thread_helper+0x4/0x10 > > Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ? > > kthread_worker_fn+0x130/0x130 > > Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ? > > gs_change+0xb/0xb > > Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba > > ]--- > > Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset > > adapter > > Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 > > Mbps Full Duplex, Flow Control: RX/TX > > Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck > > for 22s! [kswapd0:947] > > > > -- > > > > URLs for FULL logs: > > > > 1. tw_cli /cX show diag: > > http://home.comcast.net/~jpiszcz/20110911/show_diag.txt > > > > 2. Full kernel log (and previous morning of kernel crash) > > http://home.comcast.net/~jpiszcz/20110911/kern.log.txt > > > > 3. tw_cli /cX show all > > http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt > > > > -- > > > > Currently... > > > > After all of this happened, I stopped all I/O on the system/all processes, > > etc > > I shutdown the host, removed the power, powered it back up, now the drives > > that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them > > to rebuild before doing anything else. > > > > Justin. > > > > Please Justin make sure you pulled commit commit ed2888e906b56769b4ffabb9c577190438aa68b8 Author: Jon Mason <mason@myri.com> Date: Thu Sep 8 16:41:18 2011 -0500 PCI: Remove MRRS modification from MPS setting code Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has massive negative ramifications on some devices. Without knowing which devices have this issue, do not modify from the default value when walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe the default procedure. Tested-by: Sven Schnelle <svens@stackframe.org> Tested-by: Simon Kirby <sim@hostway.ca> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online. References: https://bugzilla.kernel.org/show_bug.cgi?id=42162 Signed-off-by: Jon Mason <mason@myri.com> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash @ 2011-09-13 4:05 ` Eric Dumazet 0 siblings, 0 replies; 18+ messages in thread From: Eric Dumazet @ 2011-09-13 4:05 UTC (permalink / raw) To: Jesse Brandeburg Cc: Alan Piszcz, NetDEV list, xfs, Justin Piszcz, linux-kernel Le lundi 12 septembre 2011 à 20:59 -0700, Jesse Brandeburg a écrit : > added netdev because it appears to start with an igb tx hang > > On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: > > Hi, > > > > Over the past 24-48 hours I was running some CPU-intenstive jobs and there > > was heavy I/O on the RAID (9750-24i4e + a RAID6).. > > > > I believe most of the problem started when I included many kernel options as > > modules (before I only compiled in [*] the drivers I used), there appears to > > have something to gone awry in the kernel and then afterwards, disks started > > going in and out, XFS shut down, etcera. > > > > I'm opening a case with LSI to see what happened with the 3ware card; > > however, after a power cycle, everything came back OK (the drives and HW) is > > physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but > > other than that, everything 'seems' OK, still need to do an fsck. > > > > Something went wrong in the kernel and caused a cascading effect of errors, > > this occurred (I believe) when I started to run a lot of encoding jobs; > > however, I was doing a lot of data transfer for the past 24-48 hours on the > > RAID array, the system (separate SSD/EXT4) remained unaffected but other > > weird stuff happened as well.. > > > > I still see these in the logs as well after the reboot (not often; but e.g., > > the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the > > physical drives are 100% healthy): > > > > [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a > > firmware bug on this device. Contact the card vendor for a firmware update. > > > > So, my plan: > > > > 1. Report this error to LKML+XFS mailing lists. > > 2. Open case with LSI support. > > 3. Recompile the kernel how I used for many years [only compile in options > > that you need [*] and do not compile drivers as modules] > > 4. Reboot Linux systems and see if this recurs again under the same > > workload, after the RAID is done rebuilding. > > > > -- > > > > So these errors are quite long, will upload to HTTP and paste the relevant > > bits below. > > > > -- > > > > URLs for FULL logs: > > > > 1. tw_cli /cX show diag: > > http://home.comcast.net/~jpiszcz/20110911/show_diag.txt > > > > 2. Full kernel log (and previous morning of kernel crash) > > http://home.comcast.net/~jpiszcz/20110911/kern.log.txt > > > > 3. tw_cli /cX show all > > http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt > > > > -- > > > > Summary (what seems to have occurred, have not done a full analysis yet) > > > > 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors > > > > 2. Then, the time source went unstable (this happens with weird kernel bugs > > on many different hosts, I have seen this over time). > > > > 3. Then, on the 3ward carde, drives started leaving and being re-inserted > > by themsevles, XFS went off-line to protect the filesystem due to the > > 3ware issues > > > > -- > > > > 3ware/RAID-- Interesting errors: > > > > I've never seen this before on a 3ware RAID controller, at least from what > > I can remember and I've been using 3ware cards for many years.. > > > > p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi > > HDS723030AL p3 CFG-OP-FAIL - 2.73 TB SATA 3 - > > Hitachi HDS723030AL > > > > -- > > > > Kernel/ERRORS: > > > > FWIW it all seem to start during an encoding job around 21:00: > > > > Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC > > Link is Down > > Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO > > (0x04:0x002B): Verify completed:unit=0. > > Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here > > ]------------ > > Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at > > net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250() > > Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F > > Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): > > transmit queue 5 timed out > > Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod > > tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio > > snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib > > snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event > > snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev > > serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi > > i7core_edac edac_core video > > Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not > > tainted 3.1.0-rc4 #1 > > Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace: > > Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>] > > warn_slowpath_common+0x7a/0xb0 > > Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>] > > warn_slowpath_fmt+0x41/0x50 > > Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ? > > schedule+0x2e4/0x950 > > Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>] > > dev_watchdog+0x23f/0x250 > > Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>] > > run_timer_softirq+0xf2/0x220 > > Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ? > > qdisc_reset+0x50/0x50 > > Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>] > > __do_softirq+0x98/0x120 > > Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>] > > run_ksoftirqd+0xb5/0x160 > > Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ? > > __do_softirq+0x120/0x120 > > Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>] > > kthread+0x87/0x90 > > Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>] > > kernel_thread_helper+0x4/0x10 > > Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ? > > kthread_worker_fn+0x130/0x130 > > Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ? > > gs_change+0xb/0xb > > Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba > > ]--- > > Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset > > adapter > > Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 > > Mbps Full Duplex, Flow Control: RX/TX > > Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck > > for 22s! [kswapd0:947] > > > > -- > > > > URLs for FULL logs: > > > > 1. tw_cli /cX show diag: > > http://home.comcast.net/~jpiszcz/20110911/show_diag.txt > > > > 2. Full kernel log (and previous morning of kernel crash) > > http://home.comcast.net/~jpiszcz/20110911/kern.log.txt > > > > 3. tw_cli /cX show all > > http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt > > > > -- > > > > Currently... > > > > After all of this happened, I stopped all I/O on the system/all processes, > > etc > > I shutdown the host, removed the power, powered it back up, now the drives > > that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them > > to rebuild before doing anything else. > > > > Justin. > > > > Please Justin make sure you pulled commit commit ed2888e906b56769b4ffabb9c577190438aa68b8 Author: Jon Mason <mason@myri.com> Date: Thu Sep 8 16:41:18 2011 -0500 PCI: Remove MRRS modification from MPS setting code Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has massive negative ramifications on some devices. Without knowing which devices have this issue, do not modify from the default value when walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe the default procedure. Tested-by: Sven Schnelle <svens@stackframe.org> Tested-by: Simon Kirby <sim@hostway.ca> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online. References: https://bugzilla.kernel.org/show_bug.cgi?id=42162 Signed-off-by: Jon Mason <mason@myri.com> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash 2011-09-13 4:05 ` Eric Dumazet @ 2011-09-13 14:54 ` Justin Piszcz -1 siblings, 0 replies; 18+ messages in thread From: Justin Piszcz @ 2011-09-13 14:54 UTC (permalink / raw) To: Eric Dumazet Cc: Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs, linux-kernel On Tue, 13 Sep 2011, Eric Dumazet wrote: > Please Justin make sure you pulled commit > > commit ed2888e906b56769b4ffabb9c577190438aa68b8 > Author: Jon Mason <mason@myri.com> > Date: Thu Sep 8 16:41:18 2011 -0500 > > PCI: Remove MRRS modification from MPS setting code > > Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has > massive negative ramifications on some devices. Without knowing which > devices have this issue, do not modify from the default value when > walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe > the default procedure. > > Tested-by: Sven Schnelle <svens@stackframe.org> > Tested-by: Simon Kirby <sim@hostway.ca> > Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> > Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> > Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online. > References: https://bugzilla.kernel.org/show_bug.cgi?id=42162 > Signed-off-by: Jon Mason <mason@myri.com> > Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Hello, I found this commit here: http://permalink.gmane.org/gmane.linux.kernel.pci/11700 Applied: # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file drivers/pci/probe.c I will update this thread if the problem recurs, can someone also please advise which DEBUG options I should have enabled to catch further SLAB/RCU issues? So far, I have the following enabled: CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_HAVE_DMA_API_DEBUG=y CONFIG_X86_DEBUGCTLMSR=y CONFIG_DEBUG_FS=y CONFIG_DEBUG_KERNEL=y CONFIG_DEBUG_SLAB=y CONFIG_DEBUG_SLAB_LEAK=y CONFIG_DEBUG_KMEMLEAK=y CONFIG_DEBUG_STACK_USAGE=y CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_INFO=y CONFIG_DEBUG_VM=y CONFIG_DEBUG_VIRTUAL=y CONFIG_DEBUG_MEMORY_INIT=y CONFIG_DEBUG_PER_CPU_MAPS=y CONFIG_DEBUG_PAGEALLOC=y CONFIG_DEBUG_STACKOVERFLOW=y CONFIG_DEBUG_RODATA=y Thanks, Justin. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash @ 2011-09-13 14:54 ` Justin Piszcz 0 siblings, 0 replies; 18+ messages in thread From: Justin Piszcz @ 2011-09-13 14:54 UTC (permalink / raw) To: Eric Dumazet Cc: NetDEV list, xfs, Alan Piszcz, linux-kernel, Jesse Brandeburg On Tue, 13 Sep 2011, Eric Dumazet wrote: > Please Justin make sure you pulled commit > > commit ed2888e906b56769b4ffabb9c577190438aa68b8 > Author: Jon Mason <mason@myri.com> > Date: Thu Sep 8 16:41:18 2011 -0500 > > PCI: Remove MRRS modification from MPS setting code > > Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has > massive negative ramifications on some devices. Without knowing which > devices have this issue, do not modify from the default value when > walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe > the default procedure. > > Tested-by: Sven Schnelle <svens@stackframe.org> > Tested-by: Simon Kirby <sim@hostway.ca> > Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> > Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> > Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online. > References: https://bugzilla.kernel.org/show_bug.cgi?id=42162 > Signed-off-by: Jon Mason <mason@myri.com> > Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Hello, I found this commit here: http://permalink.gmane.org/gmane.linux.kernel.pci/11700 Applied: # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file drivers/pci/probe.c I will update this thread if the problem recurs, can someone also please advise which DEBUG options I should have enabled to catch further SLAB/RCU issues? So far, I have the following enabled: CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_HAVE_DMA_API_DEBUG=y CONFIG_X86_DEBUGCTLMSR=y CONFIG_DEBUG_FS=y CONFIG_DEBUG_KERNEL=y CONFIG_DEBUG_SLAB=y CONFIG_DEBUG_SLAB_LEAK=y CONFIG_DEBUG_KMEMLEAK=y CONFIG_DEBUG_STACK_USAGE=y CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_INFO=y CONFIG_DEBUG_VM=y CONFIG_DEBUG_VIRTUAL=y CONFIG_DEBUG_MEMORY_INIT=y CONFIG_DEBUG_PER_CPU_MAPS=y CONFIG_DEBUG_PAGEALLOC=y CONFIG_DEBUG_STACKOVERFLOW=y CONFIG_DEBUG_RODATA=y Thanks, Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash 2011-09-13 14:54 ` Justin Piszcz @ 2011-09-13 14:58 ` Eric Dumazet -1 siblings, 0 replies; 18+ messages in thread From: Eric Dumazet @ 2011-09-13 14:58 UTC (permalink / raw) To: Justin Piszcz Cc: Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs, linux-kernel 2011/9/13 Justin Piszcz <jpiszcz@lucidpixels.com>: > > I found this commit here: > http://permalink.gmane.org/gmane.linux.kernel.pci/11700 > > Applied: > # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file > drivers/pci/probe.c > > Oh, I should have sent the git anchor you can use instead of searching the web ; git pull https://github.com/torvalds/linux.git ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash @ 2011-09-13 14:58 ` Eric Dumazet 0 siblings, 0 replies; 18+ messages in thread From: Eric Dumazet @ 2011-09-13 14:58 UTC (permalink / raw) To: Justin Piszcz Cc: NetDEV list, xfs, Alan Piszcz, linux-kernel, Jesse Brandeburg 2011/9/13 Justin Piszcz <jpiszcz@lucidpixels.com>: > > I found this commit here: > http://permalink.gmane.org/gmane.linux.kernel.pci/11700 > > Applied: > # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file > drivers/pci/probe.c > > Oh, I should have sent the git anchor you can use instead of searching the web ; git pull https://github.com/torvalds/linux.git _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash 2011-09-13 14:54 ` Justin Piszcz @ 2011-09-13 15:35 ` Jon Mason -1 siblings, 0 replies; 18+ messages in thread From: Jon Mason @ 2011-09-13 15:35 UTC (permalink / raw) To: Justin Piszcz Cc: Eric Dumazet, Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs, linux-kernel On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: > > > On Tue, 13 Sep 2011, Eric Dumazet wrote: > >> Please Justin make sure you pulled commit >> commit ed2888e906b56769b4ffabb9c577190438aa68b8 >> Author: Jon Mason <mason@myri.com> >> Date: Thu Sep 8 16:41:18 2011 -0500 >> >> PCI: Remove MRRS modification from MPS setting code >> >> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has >> massive negative ramifications on some devices. Without knowing which >> devices have this issue, do not modify from the default value when >> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe >> the default procedure. >> >> Tested-by: Sven Schnelle <svens@stackframe.org> >> Tested-by: Simon Kirby <sim@hostway.ca> >> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> >> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> >> Reported-and-tested-by: Niels Ole Salscheider >> <niels_ole@salscheider-online. >> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162 >> Signed-off-by: Jon Mason <mason@myri.com> >> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> >> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > > Hello, > > I found this commit here: > http://permalink.gmane.org/gmane.linux.kernel.pci/11700 This is an early version of the patch. This is the patch that you want: https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8 It appears that this patch didn't make it to lkml or linux-pci list due to kernel.org DNS being down when it was sent. Thanks, Jon > > Applied: > # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file > drivers/pci/probe.c > > I will update this thread if the problem recurs, can someone also please > advise > which DEBUG options I should have enabled to catch further SLAB/RCU issues? > > So far, I have the following enabled: > > CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y > CONFIG_HAVE_DMA_API_DEBUG=y > CONFIG_X86_DEBUGCTLMSR=y > CONFIG_DEBUG_FS=y > CONFIG_DEBUG_KERNEL=y > CONFIG_DEBUG_SLAB=y > CONFIG_DEBUG_SLAB_LEAK=y > CONFIG_DEBUG_KMEMLEAK=y > CONFIG_DEBUG_STACK_USAGE=y > CONFIG_DEBUG_BUGVERBOSE=y > CONFIG_DEBUG_INFO=y > CONFIG_DEBUG_VM=y > CONFIG_DEBUG_VIRTUAL=y > CONFIG_DEBUG_MEMORY_INIT=y > CONFIG_DEBUG_PER_CPU_MAPS=y > CONFIG_DEBUG_PAGEALLOC=y > CONFIG_DEBUG_STACKOVERFLOW=y > CONFIG_DEBUG_RODATA=y > > Thanks, > > Justin. > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash @ 2011-09-13 15:35 ` Jon Mason 0 siblings, 0 replies; 18+ messages in thread From: Jon Mason @ 2011-09-13 15:35 UTC (permalink / raw) To: Justin Piszcz Cc: Eric Dumazet, NetDEV list, linux-kernel, xfs, Jesse Brandeburg, Alan Piszcz On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: > > > On Tue, 13 Sep 2011, Eric Dumazet wrote: > >> Please Justin make sure you pulled commit >> commit ed2888e906b56769b4ffabb9c577190438aa68b8 >> Author: Jon Mason <mason@myri.com> >> Date: Thu Sep 8 16:41:18 2011 -0500 >> >> PCI: Remove MRRS modification from MPS setting code >> >> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has >> massive negative ramifications on some devices. Without knowing which >> devices have this issue, do not modify from the default value when >> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe >> the default procedure. >> >> Tested-by: Sven Schnelle <svens@stackframe.org> >> Tested-by: Simon Kirby <sim@hostway.ca> >> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> >> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> >> Reported-and-tested-by: Niels Ole Salscheider >> <niels_ole@salscheider-online. >> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162 >> Signed-off-by: Jon Mason <mason@myri.com> >> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> >> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > > Hello, > > I found this commit here: > http://permalink.gmane.org/gmane.linux.kernel.pci/11700 This is an early version of the patch. This is the patch that you want: https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8 It appears that this patch didn't make it to lkml or linux-pci list due to kernel.org DNS being down when it was sent. Thanks, Jon > > Applied: > # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file > drivers/pci/probe.c > > I will update this thread if the problem recurs, can someone also please > advise > which DEBUG options I should have enabled to catch further SLAB/RCU issues? > > So far, I have the following enabled: > > CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y > CONFIG_HAVE_DMA_API_DEBUG=y > CONFIG_X86_DEBUGCTLMSR=y > CONFIG_DEBUG_FS=y > CONFIG_DEBUG_KERNEL=y > CONFIG_DEBUG_SLAB=y > CONFIG_DEBUG_SLAB_LEAK=y > CONFIG_DEBUG_KMEMLEAK=y > CONFIG_DEBUG_STACK_USAGE=y > CONFIG_DEBUG_BUGVERBOSE=y > CONFIG_DEBUG_INFO=y > CONFIG_DEBUG_VM=y > CONFIG_DEBUG_VIRTUAL=y > CONFIG_DEBUG_MEMORY_INIT=y > CONFIG_DEBUG_PER_CPU_MAPS=y > CONFIG_DEBUG_PAGEALLOC=y > CONFIG_DEBUG_STACKOVERFLOW=y > CONFIG_DEBUG_RODATA=y > > Thanks, > > Justin. > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash 2011-09-13 15:35 ` Jon Mason @ 2011-09-13 15:42 ` Justin Piszcz -1 siblings, 0 replies; 18+ messages in thread From: Justin Piszcz @ 2011-09-13 15:42 UTC (permalink / raw) To: Jon Mason Cc: Eric Dumazet, Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 1906 bytes --] On Tue, 13 Sep 2011, Jon Mason wrote: > On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: >> >> >> On Tue, 13 Sep 2011, Eric Dumazet wrote: >> >>> Please Justin make sure you pulled commit >>> commit ed2888e906b56769b4ffabb9c577190438aa68b8 >>> Author: Jon Mason <mason@myri.com> >>> Date: Thu Sep 8 16:41:18 2011 -0500 >>> >>> PCI: Remove MRRS modification from MPS setting code >>> >>> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has >>> massive negative ramifications on some devices. Without knowing which >>> devices have this issue, do not modify from the default value when >>> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe >>> the default procedure. >>> >>> Tested-by: Sven Schnelle <svens@stackframe.org> >>> Tested-by: Simon Kirby <sim@hostway.ca> >>> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> >>> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> >>> Reported-and-tested-by: Niels Ole Salscheider >>> <niels_ole@salscheider-online. >>> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162 >>> Signed-off-by: Jon Mason <mason@myri.com> >>> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> >>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> >> >> Hello, >> >> I found this commit here: >> http://permalink.gmane.org/gmane.linux.kernel.pci/11700 > > This is an early version of the patch. This is the patch that you want: > https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8 > > It appears that this patch didn't make it to lkml or linux-pci list > due to kernel.org DNS being down when it was sent. > > Thanks, > Jon I need to learn how to use git at some point, can you please provide plain text patches so I can apply them and reboot? Justin. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash @ 2011-09-13 15:42 ` Justin Piszcz 0 siblings, 0 replies; 18+ messages in thread From: Justin Piszcz @ 2011-09-13 15:42 UTC (permalink / raw) To: Jon Mason Cc: Eric Dumazet, NetDEV list, linux-kernel, xfs, Jesse Brandeburg, Alan Piszcz [-- Attachment #1: Type: TEXT/PLAIN, Size: 1906 bytes --] On Tue, 13 Sep 2011, Jon Mason wrote: > On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: >> >> >> On Tue, 13 Sep 2011, Eric Dumazet wrote: >> >>> Please Justin make sure you pulled commit >>> commit ed2888e906b56769b4ffabb9c577190438aa68b8 >>> Author: Jon Mason <mason@myri.com> >>> Date: Thu Sep 8 16:41:18 2011 -0500 >>> >>> PCI: Remove MRRS modification from MPS setting code >>> >>> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has >>> massive negative ramifications on some devices. Without knowing which >>> devices have this issue, do not modify from the default value when >>> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe >>> the default procedure. >>> >>> Tested-by: Sven Schnelle <svens@stackframe.org> >>> Tested-by: Simon Kirby <sim@hostway.ca> >>> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> >>> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> >>> Reported-and-tested-by: Niels Ole Salscheider >>> <niels_ole@salscheider-online. >>> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162 >>> Signed-off-by: Jon Mason <mason@myri.com> >>> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> >>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> >> >> Hello, >> >> I found this commit here: >> http://permalink.gmane.org/gmane.linux.kernel.pci/11700 > > This is an early version of the patch. This is the patch that you want: > https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8 > > It appears that this patch didn't make it to lkml or linux-pci list > due to kernel.org DNS being down when it was sent. > > Thanks, > Jon I need to learn how to use git at some point, can you please provide plain text patches so I can apply them and reboot? Justin. [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash 2011-09-13 15:42 ` Justin Piszcz @ 2011-09-13 15:51 ` Jon Mason -1 siblings, 0 replies; 18+ messages in thread From: Jon Mason @ 2011-09-13 15:51 UTC (permalink / raw) To: Justin Piszcz Cc: Eric Dumazet, Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2175 bytes --] On Tue, Sep 13, 2011 at 10:42 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: > > > On Tue, 13 Sep 2011, Jon Mason wrote: > >> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> >> wrote: >>> >>> >>> On Tue, 13 Sep 2011, Eric Dumazet wrote: >>> >>>> Please Justin make sure you pulled commit >>>> commit ed2888e906b56769b4ffabb9c577190438aa68b8 >>>> Author: Jon Mason <mason@myri.com> >>>> Date: Thu Sep 8 16:41:18 2011 -0500 >>>> >>>> PCI: Remove MRRS modification from MPS setting code >>>> >>>> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has >>>> massive negative ramifications on some devices. Without knowing which >>>> devices have this issue, do not modify from the default value when >>>> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe >>>> the default procedure. >>>> >>>> Tested-by: Sven Schnelle <svens@stackframe.org> >>>> Tested-by: Simon Kirby <sim@hostway.ca> >>>> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> >>>> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> >>>> Reported-and-tested-by: Niels Ole Salscheider >>>> <niels_ole@salscheider-online. >>>> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162 >>>> Signed-off-by: Jon Mason <mason@myri.com> >>>> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> >>>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> >>> >>> Hello, >>> >>> I found this commit here: >>> http://permalink.gmane.org/gmane.linux.kernel.pci/11700 >> >> This is an early version of the patch. This is the patch that you want: >> >> https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8 >> >> It appears that this patch didn't make it to lkml or linux-pci list >> due to kernel.org DNS being down when it was sent. >> >> Thanks, >> Jon > > I need to learn how to use git at some point, can you please provide plain > text patches so I can apply them and reboot? > > Justin. I've attached the 2 patches I asked Linus to include into 3.1-rc6. Let me know if there are any issues. Thanks, Jon [-- Attachment #2: 0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch --] [-- Type: text/x-patch, Size: 2344 bytes --] From cf822aed99fd8851d82ae5f2df11c29b79e316c8 Mon Sep 17 00:00:00 2001 From: Shyam Iyer <shyam.iyer.t@gmail.com> Date: Wed, 31 Aug 2011 12:21:42 -0400 Subject: [PATCH 1/2] Fix pointer dereference before call to pcie_bus_configure_settings There is a potential NULL pointer dereference in calls to pcie_bus_configure_settings due to attempts to access pci_bus self variables when the self pointer is NULL. To correct this, verify that the self pointer in pci_bus is non-NULL before dereferencing it. Reported-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Shyam Iyer <shyam_iyer@dell.com> Signed-off-by: Jon Mason <mason@myri.com> --- arch/x86/pci/acpi.c | 9 +++++++-- drivers/pci/hotplug/pcihp_slot.c | 4 +++- drivers/pci/probe.c | 3 --- 3 files changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c index c953302..039d913 100644 --- a/arch/x86/pci/acpi.c +++ b/arch/x86/pci/acpi.c @@ -365,8 +365,13 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_pci_root *root) */ if (bus) { struct pci_bus *child; - list_for_each_entry(child, &bus->children, node) - pcie_bus_configure_settings(child, child->self->pcie_mpss); + list_for_each_entry(child, &bus->children, node) { + struct pci_dev *self = child->self; + if (!self) + continue; + + pcie_bus_configure_settings(child, self->pcie_mpss); + } } if (!bus) diff --git a/drivers/pci/hotplug/pcihp_slot.c b/drivers/pci/hotplug/pcihp_slot.c index 753b21a..3ffd9c1 100644 --- a/drivers/pci/hotplug/pcihp_slot.c +++ b/drivers/pci/hotplug/pcihp_slot.c @@ -169,7 +169,9 @@ void pci_configure_slot(struct pci_dev *dev) (dev->class >> 8) == PCI_CLASS_BRIDGE_PCI))) return; - pcie_bus_configure_settings(dev->bus, dev->bus->self->pcie_mpss); + if (dev->bus && dev->bus->self) + pcie_bus_configure_settings(dev->bus, + dev->bus->self->pcie_mpss); memset(&hpp, 0, sizeof(hpp)); ret = pci_get_hp_params(dev, &hpp); diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 8473727..0820fc1 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1456,9 +1456,6 @@ void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss) { u8 smpss = mpss; - if (!bus->self) - return; - if (!pci_is_pcie(bus->self)) return; -- 1.7.6 [-- Attachment #3: 0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch --] [-- Type: text/x-patch, Size: 4404 bytes --] From 74d81235f8e4bd60859d539a27e51d3a09d183cf Mon Sep 17 00:00:00 2001 From: Jon Mason <mason@myri.com> Date: Thu, 8 Sep 2011 12:59:00 -0500 Subject: [PATCH 2/2] PCI: Remove MRRS modification from MPS setting code Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has massive negative ramifications on some devices. Without knowing which devices have this issue, do not modify from the default value when walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe the default procedure. Tested-by: Sven Schnelle <svens@stackframe.org> Tested-by: Simon Kirby <sim@hostway.ca> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.de> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162 Signed-off-by: Jon Mason <mason@myri.com> --- drivers/pci/pci.c | 2 +- drivers/pci/probe.c | 41 ++++++++++++++++++++++------------------- 2 files changed, 23 insertions(+), 20 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 0ce6742..4e84fd4 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -77,7 +77,7 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE; unsigned long pci_hotplug_io_size = DEFAULT_HOTPLUG_IO_SIZE; unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE; -enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PERFORMANCE; +enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE; /* * The default CLS is used if arch didn't set CLS explicitly and not diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 0820fc1..b1187ff 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1396,34 +1396,37 @@ static void pcie_write_mps(struct pci_dev *dev, int mps) static void pcie_write_mrrs(struct pci_dev *dev, int mps) { - int rc, mrrs; + int rc, mrrs, dev_mpss; - if (pcie_bus_config == PCIE_BUS_PERFORMANCE) { - int dev_mpss = 128 << dev->pcie_mpss; + /* In the "safe" case, do not configure the MRRS. There appear to be + * issues with setting MRRS to 0 on a number of devices. + */ - /* For Max performance, the MRRS must be set to the largest - * supported value. However, it cannot be configured larger - * than the MPS the device or the bus can support. This assumes - * that the largest MRRS available on the device cannot be - * smaller than the device MPSS. - */ - mrrs = mps < dev_mpss ? mps : dev_mpss; - } else - /* In the "safe" case, configure the MRRS for fairness on the - * bus by making all devices have the same size - */ - mrrs = mps; + if (pcie_bus_config != PCIE_BUS_PERFORMANCE) + return; + + dev_mpss = 128 << dev->pcie_mpss; + /* For Max performance, the MRRS must be set to the largest supported + * value. However, it cannot be configured larger than the MPS the + * device or the bus can support. This assumes that the largest MRRS + * available on the device cannot be smaller than the device MPSS. + */ + mrrs = min(mps, dev_mpss); /* MRRS is a R/W register. Invalid values can be written, but a - * subsiquent read will verify if the value is acceptable or not. + * subsequent read will verify if the value is acceptable or not. * If the MRRS value provided is not acceptable (e.g., too large), * shrink the value until it is acceptable to the HW. */ while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) { + dev_warn(&dev->dev, "Attempting to modify the PCI-E MRRS value" + " to %d. If any issues are encountered, please try " + "running with pci=pcie_bus_safe\n", mrrs); rc = pcie_set_readrq(dev, mrrs); if (rc) - dev_err(&dev->dev, "Failed attempting to set the MRRS\n"); + dev_err(&dev->dev, + "Failed attempting to set the MRRS\n"); mrrs /= 2; } @@ -1436,13 +1439,13 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data) if (!pci_is_pcie(dev)) return 0; - dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", + dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev)); pcie_write_mps(dev, mps); pcie_write_mrrs(dev, mps); - dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", + dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev)); return 0; -- 1.7.6 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash @ 2011-09-13 15:51 ` Jon Mason 0 siblings, 0 replies; 18+ messages in thread From: Jon Mason @ 2011-09-13 15:51 UTC (permalink / raw) To: Justin Piszcz Cc: Eric Dumazet, NetDEV list, linux-kernel, xfs, Jesse Brandeburg, Alan Piszcz [-- Attachment #1: Type: text/plain, Size: 2175 bytes --] On Tue, Sep 13, 2011 at 10:42 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: > > > On Tue, 13 Sep 2011, Jon Mason wrote: > >> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> >> wrote: >>> >>> >>> On Tue, 13 Sep 2011, Eric Dumazet wrote: >>> >>>> Please Justin make sure you pulled commit >>>> commit ed2888e906b56769b4ffabb9c577190438aa68b8 >>>> Author: Jon Mason <mason@myri.com> >>>> Date: Thu Sep 8 16:41:18 2011 -0500 >>>> >>>> PCI: Remove MRRS modification from MPS setting code >>>> >>>> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has >>>> massive negative ramifications on some devices. Without knowing which >>>> devices have this issue, do not modify from the default value when >>>> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe >>>> the default procedure. >>>> >>>> Tested-by: Sven Schnelle <svens@stackframe.org> >>>> Tested-by: Simon Kirby <sim@hostway.ca> >>>> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> >>>> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> >>>> Reported-and-tested-by: Niels Ole Salscheider >>>> <niels_ole@salscheider-online. >>>> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162 >>>> Signed-off-by: Jon Mason <mason@myri.com> >>>> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> >>>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> >>> >>> Hello, >>> >>> I found this commit here: >>> http://permalink.gmane.org/gmane.linux.kernel.pci/11700 >> >> This is an early version of the patch. This is the patch that you want: >> >> https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8 >> >> It appears that this patch didn't make it to lkml or linux-pci list >> due to kernel.org DNS being down when it was sent. >> >> Thanks, >> Jon > > I need to learn how to use git at some point, can you please provide plain > text patches so I can apply them and reboot? > > Justin. I've attached the 2 patches I asked Linus to include into 3.1-rc6. Let me know if there are any issues. Thanks, Jon [-- Attachment #2: 0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch --] [-- Type: text/x-patch, Size: 2344 bytes --] From cf822aed99fd8851d82ae5f2df11c29b79e316c8 Mon Sep 17 00:00:00 2001 From: Shyam Iyer <shyam.iyer.t@gmail.com> Date: Wed, 31 Aug 2011 12:21:42 -0400 Subject: [PATCH 1/2] Fix pointer dereference before call to pcie_bus_configure_settings There is a potential NULL pointer dereference in calls to pcie_bus_configure_settings due to attempts to access pci_bus self variables when the self pointer is NULL. To correct this, verify that the self pointer in pci_bus is non-NULL before dereferencing it. Reported-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Shyam Iyer <shyam_iyer@dell.com> Signed-off-by: Jon Mason <mason@myri.com> --- arch/x86/pci/acpi.c | 9 +++++++-- drivers/pci/hotplug/pcihp_slot.c | 4 +++- drivers/pci/probe.c | 3 --- 3 files changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c index c953302..039d913 100644 --- a/arch/x86/pci/acpi.c +++ b/arch/x86/pci/acpi.c @@ -365,8 +365,13 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_pci_root *root) */ if (bus) { struct pci_bus *child; - list_for_each_entry(child, &bus->children, node) - pcie_bus_configure_settings(child, child->self->pcie_mpss); + list_for_each_entry(child, &bus->children, node) { + struct pci_dev *self = child->self; + if (!self) + continue; + + pcie_bus_configure_settings(child, self->pcie_mpss); + } } if (!bus) diff --git a/drivers/pci/hotplug/pcihp_slot.c b/drivers/pci/hotplug/pcihp_slot.c index 753b21a..3ffd9c1 100644 --- a/drivers/pci/hotplug/pcihp_slot.c +++ b/drivers/pci/hotplug/pcihp_slot.c @@ -169,7 +169,9 @@ void pci_configure_slot(struct pci_dev *dev) (dev->class >> 8) == PCI_CLASS_BRIDGE_PCI))) return; - pcie_bus_configure_settings(dev->bus, dev->bus->self->pcie_mpss); + if (dev->bus && dev->bus->self) + pcie_bus_configure_settings(dev->bus, + dev->bus->self->pcie_mpss); memset(&hpp, 0, sizeof(hpp)); ret = pci_get_hp_params(dev, &hpp); diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 8473727..0820fc1 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1456,9 +1456,6 @@ void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss) { u8 smpss = mpss; - if (!bus->self) - return; - if (!pci_is_pcie(bus->self)) return; -- 1.7.6 [-- Attachment #3: 0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch --] [-- Type: text/x-patch, Size: 4404 bytes --] From 74d81235f8e4bd60859d539a27e51d3a09d183cf Mon Sep 17 00:00:00 2001 From: Jon Mason <mason@myri.com> Date: Thu, 8 Sep 2011 12:59:00 -0500 Subject: [PATCH 2/2] PCI: Remove MRRS modification from MPS setting code Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has massive negative ramifications on some devices. Without knowing which devices have this issue, do not modify from the default value when walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe the default procedure. Tested-by: Sven Schnelle <svens@stackframe.org> Tested-by: Simon Kirby <sim@hostway.ca> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.de> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162 Signed-off-by: Jon Mason <mason@myri.com> --- drivers/pci/pci.c | 2 +- drivers/pci/probe.c | 41 ++++++++++++++++++++++------------------- 2 files changed, 23 insertions(+), 20 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 0ce6742..4e84fd4 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -77,7 +77,7 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE; unsigned long pci_hotplug_io_size = DEFAULT_HOTPLUG_IO_SIZE; unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE; -enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PERFORMANCE; +enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE; /* * The default CLS is used if arch didn't set CLS explicitly and not diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 0820fc1..b1187ff 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1396,34 +1396,37 @@ static void pcie_write_mps(struct pci_dev *dev, int mps) static void pcie_write_mrrs(struct pci_dev *dev, int mps) { - int rc, mrrs; + int rc, mrrs, dev_mpss; - if (pcie_bus_config == PCIE_BUS_PERFORMANCE) { - int dev_mpss = 128 << dev->pcie_mpss; + /* In the "safe" case, do not configure the MRRS. There appear to be + * issues with setting MRRS to 0 on a number of devices. + */ - /* For Max performance, the MRRS must be set to the largest - * supported value. However, it cannot be configured larger - * than the MPS the device or the bus can support. This assumes - * that the largest MRRS available on the device cannot be - * smaller than the device MPSS. - */ - mrrs = mps < dev_mpss ? mps : dev_mpss; - } else - /* In the "safe" case, configure the MRRS for fairness on the - * bus by making all devices have the same size - */ - mrrs = mps; + if (pcie_bus_config != PCIE_BUS_PERFORMANCE) + return; + + dev_mpss = 128 << dev->pcie_mpss; + /* For Max performance, the MRRS must be set to the largest supported + * value. However, it cannot be configured larger than the MPS the + * device or the bus can support. This assumes that the largest MRRS + * available on the device cannot be smaller than the device MPSS. + */ + mrrs = min(mps, dev_mpss); /* MRRS is a R/W register. Invalid values can be written, but a - * subsiquent read will verify if the value is acceptable or not. + * subsequent read will verify if the value is acceptable or not. * If the MRRS value provided is not acceptable (e.g., too large), * shrink the value until it is acceptable to the HW. */ while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) { + dev_warn(&dev->dev, "Attempting to modify the PCI-E MRRS value" + " to %d. If any issues are encountered, please try " + "running with pci=pcie_bus_safe\n", mrrs); rc = pcie_set_readrq(dev, mrrs); if (rc) - dev_err(&dev->dev, "Failed attempting to set the MRRS\n"); + dev_err(&dev->dev, + "Failed attempting to set the MRRS\n"); mrrs /= 2; } @@ -1436,13 +1439,13 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data) if (!pci_is_pcie(dev)) return 0; - dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", + dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev)); pcie_write_mps(dev, mps); pcie_write_mrrs(dev, mps); - dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", + dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev)); return 0; -- 1.7.6 [-- Attachment #4: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash 2011-09-13 15:51 ` Jon Mason @ 2011-09-13 16:32 ` Justin Piszcz -1 siblings, 0 replies; 18+ messages in thread From: Justin Piszcz @ 2011-09-13 16:32 UTC (permalink / raw) To: Jon Mason Cc: Eric Dumazet, Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs, linux-kernel On Tue, 13 Sep 2011, Jon Mason wrote: > On Tue, Sep 13, 2011 at 10:42 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: >> >> >> On Tue, 13 Sep 2011, Jon Mason wrote: >> >>> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> >>> wrote: >>>> >>>> >>>> On Tue, 13 Sep 2011, Eric Dumazet wrote: >>>> Thanks, # patch -p1 < ../0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch patching file arch/x86/pci/acpi.c patching file drivers/pci/hotplug/pcihp_slot.c patching file drivers/pci/probe.c # patch -p1 < ../0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch patching file drivers/pci/pci.c patching file drivers/pci/probe.c # Rebooted & running with new patches for 3.1-rc4. Will let you know if any further issues, I wonder if this will fix the RCU/SLAB issues too, thanks. Justin. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash @ 2011-09-13 16:32 ` Justin Piszcz 0 siblings, 0 replies; 18+ messages in thread From: Justin Piszcz @ 2011-09-13 16:32 UTC (permalink / raw) To: Jon Mason Cc: Eric Dumazet, NetDEV list, linux-kernel, xfs, Jesse Brandeburg, Alan Piszcz On Tue, 13 Sep 2011, Jon Mason wrote: > On Tue, Sep 13, 2011 at 10:42 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: >> >> >> On Tue, 13 Sep 2011, Jon Mason wrote: >> >>> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> >>> wrote: >>>> >>>> >>>> On Tue, 13 Sep 2011, Eric Dumazet wrote: >>>> Thanks, # patch -p1 < ../0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch patching file arch/x86/pci/acpi.c patching file drivers/pci/hotplug/pcihp_slot.c patching file drivers/pci/probe.c # patch -p1 < ../0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch patching file drivers/pci/pci.c patching file drivers/pci/probe.c # Rebooted & running with new patches for 3.1-rc4. Will let you know if any further issues, I wonder if this will fix the RCU/SLAB issues too, thanks. Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2011-09-13 16:32 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-09-11 9:40 3.1-rc4: spectacular kernel errors / filesystem crash Justin Piszcz 2011-09-11 9:40 ` Justin Piszcz 2011-09-13 3:59 ` Jesse Brandeburg 2011-09-13 3:59 ` Jesse Brandeburg 2011-09-13 4:05 ` Eric Dumazet 2011-09-13 4:05 ` Eric Dumazet 2011-09-13 14:54 ` Justin Piszcz 2011-09-13 14:54 ` Justin Piszcz 2011-09-13 14:58 ` Eric Dumazet 2011-09-13 14:58 ` Eric Dumazet 2011-09-13 15:35 ` Jon Mason 2011-09-13 15:35 ` Jon Mason 2011-09-13 15:42 ` Justin Piszcz 2011-09-13 15:42 ` Justin Piszcz 2011-09-13 15:51 ` Jon Mason 2011-09-13 15:51 ` Jon Mason 2011-09-13 16:32 ` Justin Piszcz 2011-09-13 16:32 ` Justin Piszcz
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.