* 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-11 9:40 ` Justin Piszcz
0 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-11 9:40 UTC (permalink / raw)
To: linux-kernel; +Cc: xfs, Alan Piszcz
Hi,
Over the past 24-48 hours I was running some CPU-intenstive jobs and there
was heavy I/O on the RAID (9750-24i4e + a RAID6)..
I believe most of the problem started when I included many kernel options
as modules (before I only compiled in [*] the drivers I used), there
appears to have something to gone awry in the kernel and then afterwards,
disks started going in and out, XFS shut down, etcera.
I'm opening a case with LSI to see what happened with the 3ware card;
however, after a power cycle, everything came back OK (the drives and HW)
is physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL
but other than that, everything 'seems' OK, still need to do an fsck.
Something went wrong in the kernel and caused a cascading effect of
errors, this occurred (I believe) when I started to run a lot of encoding
jobs; however, I was doing a lot of data transfer for the past 24-48 hours
on the RAID array, the system (separate SSD/EXT4) remained unaffected but
other weird stuff happened as well..
I still see these in the logs as well after the reboot (not often; but e.g.,
the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the
physical drives are 100% healthy):
[ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.
So, my plan:
1. Report this error to LKML+XFS mailing lists.
2. Open case with LSI support.
3. Recompile the kernel how I used for many years [only compile in options
that you need [*] and do not compile drivers as modules]
4. Reboot Linux systems and see if this recurs again under the same
workload, after the RAID is done rebuilding.
--
So these errors are quite long, will upload to HTTP and paste the relevant
bits below.
--
URLs for FULL logs:
1. tw_cli /cX show diag:
http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
2. Full kernel log (and previous morning of kernel crash)
http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
3. tw_cli /cX show all
http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
--
Summary (what seems to have occurred, have not done a full analysis yet)
1. 3ware card freaked out due to kernel/RCU/APIC(?) errors
2. Then, the time source went unstable (this happens with weird kernel bugs
on many different hosts, I have seen this over time).
3. Then, on the 3ward carde, drives started leaving and being re-inserted
by themsevles, XFS went off-line to protect the filesystem due to the
3ware issues
--
3ware/RAID-- Interesting errors:
I've never seen this before on a 3ware RAID controller, at least from what
I can remember and I've been using 3ware cards for many years..
p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi HDS723030AL
p3 CFG-OP-FAIL - 2.73 TB SATA 3 - Hitachi HDS723030AL
--
Kernel/ERRORS:
FWIW it all seem to start during an encoding job around 21:00:
Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC Link is Down
Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO (0x04:0x002B): Verify completed:unit=0.
Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here ]------------
Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): transmit queue 5 timed out
Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi i7core_edac edac_core video
Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not tainted 3.1.0-rc4 #1
Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>] warn_slowpath_common+0x7a/0xb0
Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>] warn_slowpath_fmt+0x41/0x50
Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ? schedule+0x2e4/0x950
Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>] dev_watchdog+0x23f/0x250
Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>] run_timer_softirq+0xf2/0x220
Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ? qdisc_reset+0x50/0x50
Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>] __do_softirq+0x98/0x120
Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>] run_ksoftirqd+0xb5/0x160
Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ? __do_softirq+0x120/0x120
Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>] kthread+0x87/0x90
Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>] kernel_thread_helper+0x4/0x10
Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ? kthread_worker_fn+0x130/0x130
Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ? gs_change+0xb/0xb
Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba ]---
Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset adapter
Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:947]
--
URLs for FULL logs:
1. tw_cli /cX show diag:
http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
2. Full kernel log (and previous morning of kernel crash)
http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
3. tw_cli /cX show all
http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
--
Currently...
After all of this happened, I stopped all I/O on the system/all processes, etc
I shutdown the host, removed the power, powered it back up, now the drives
that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
to rebuild before doing anything else.
Justin.
^ permalink raw reply [flat|nested] 18+ messages in thread
* 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-11 9:40 ` Justin Piszcz
0 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-11 9:40 UTC (permalink / raw)
To: linux-kernel; +Cc: Alan Piszcz, xfs
Hi,
Over the past 24-48 hours I was running some CPU-intenstive jobs and there
was heavy I/O on the RAID (9750-24i4e + a RAID6)..
I believe most of the problem started when I included many kernel options
as modules (before I only compiled in [*] the drivers I used), there
appears to have something to gone awry in the kernel and then afterwards,
disks started going in and out, XFS shut down, etcera.
I'm opening a case with LSI to see what happened with the 3ware card;
however, after a power cycle, everything came back OK (the drives and HW)
is physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL
but other than that, everything 'seems' OK, still need to do an fsck.
Something went wrong in the kernel and caused a cascading effect of
errors, this occurred (I believe) when I started to run a lot of encoding
jobs; however, I was doing a lot of data transfer for the past 24-48 hours
on the RAID array, the system (separate SSD/EXT4) remained unaffected but
other weird stuff happened as well..
I still see these in the logs as well after the reboot (not often; but e.g.,
the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the
physical drives are 100% healthy):
[ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.
So, my plan:
1. Report this error to LKML+XFS mailing lists.
2. Open case with LSI support.
3. Recompile the kernel how I used for many years [only compile in options
that you need [*] and do not compile drivers as modules]
4. Reboot Linux systems and see if this recurs again under the same
workload, after the RAID is done rebuilding.
--
So these errors are quite long, will upload to HTTP and paste the relevant
bits below.
--
URLs for FULL logs:
1. tw_cli /cX show diag:
http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
2. Full kernel log (and previous morning of kernel crash)
http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
3. tw_cli /cX show all
http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
--
Summary (what seems to have occurred, have not done a full analysis yet)
1. 3ware card freaked out due to kernel/RCU/APIC(?) errors
2. Then, the time source went unstable (this happens with weird kernel bugs
on many different hosts, I have seen this over time).
3. Then, on the 3ward carde, drives started leaving and being re-inserted
by themsevles, XFS went off-line to protect the filesystem due to the
3ware issues
--
3ware/RAID-- Interesting errors:
I've never seen this before on a 3ware RAID controller, at least from what
I can remember and I've been using 3ware cards for many years..
p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi HDS723030AL
p3 CFG-OP-FAIL - 2.73 TB SATA 3 - Hitachi HDS723030AL
--
Kernel/ERRORS:
FWIW it all seem to start during an encoding job around 21:00:
Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC Link is Down
Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO (0x04:0x002B): Verify completed:unit=0.
Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here ]------------
Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): transmit queue 5 timed out
Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi i7core_edac edac_core video
Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not tainted 3.1.0-rc4 #1
Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>] warn_slowpath_common+0x7a/0xb0
Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>] warn_slowpath_fmt+0x41/0x50
Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ? schedule+0x2e4/0x950
Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>] dev_watchdog+0x23f/0x250
Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>] run_timer_softirq+0xf2/0x220
Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ? qdisc_reset+0x50/0x50
Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>] __do_softirq+0x98/0x120
Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>] run_ksoftirqd+0xb5/0x160
Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ? __do_softirq+0x120/0x120
Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>] kthread+0x87/0x90
Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>] kernel_thread_helper+0x4/0x10
Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ? kthread_worker_fn+0x130/0x130
Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ? gs_change+0xb/0xb
Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba ]---
Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset adapter
Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:947]
--
URLs for FULL logs:
1. tw_cli /cX show diag:
http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
2. Full kernel log (and previous morning of kernel crash)
http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
3. tw_cli /cX show all
http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
--
Currently...
After all of this happened, I stopped all I/O on the system/all processes, etc
I shutdown the host, removed the power, powered it back up, now the drives
that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
to rebuild before doing anything else.
Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
2011-09-11 9:40 ` Justin Piszcz
@ 2011-09-13 3:59 ` Jesse Brandeburg
-1 siblings, 0 replies; 18+ messages in thread
From: Jesse Brandeburg @ 2011-09-13 3:59 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel, xfs, Alan Piszcz, NetDEV list
added netdev because it appears to start with an igb tx hang
On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> Hi,
>
> Over the past 24-48 hours I was running some CPU-intenstive jobs and there
> was heavy I/O on the RAID (9750-24i4e + a RAID6)..
>
> I believe most of the problem started when I included many kernel options as
> modules (before I only compiled in [*] the drivers I used), there appears to
> have something to gone awry in the kernel and then afterwards, disks started
> going in and out, XFS shut down, etcera.
>
> I'm opening a case with LSI to see what happened with the 3ware card;
> however, after a power cycle, everything came back OK (the drives and HW) is
> physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but
> other than that, everything 'seems' OK, still need to do an fsck.
>
> Something went wrong in the kernel and caused a cascading effect of errors,
> this occurred (I believe) when I started to run a lot of encoding jobs;
> however, I was doing a lot of data transfer for the past 24-48 hours on the
> RAID array, the system (separate SSD/EXT4) remained unaffected but other
> weird stuff happened as well..
>
> I still see these in the logs as well after the reboot (not often; but e.g.,
> the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the
> physical drives are 100% healthy):
>
> [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a
> firmware bug on this device. Contact the card vendor for a firmware update.
>
> So, my plan:
>
> 1. Report this error to LKML+XFS mailing lists.
> 2. Open case with LSI support.
> 3. Recompile the kernel how I used for many years [only compile in options
> that you need [*] and do not compile drivers as modules]
> 4. Reboot Linux systems and see if this recurs again under the same
> workload, after the RAID is done rebuilding.
>
> --
>
> So these errors are quite long, will upload to HTTP and paste the relevant
> bits below.
>
> --
>
> URLs for FULL logs:
>
> 1. tw_cli /cX show diag:
> http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
>
> 2. Full kernel log (and previous morning of kernel crash)
> http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
>
> 3. tw_cli /cX show all
> http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
>
> --
>
> Summary (what seems to have occurred, have not done a full analysis yet)
>
> 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors
>
> 2. Then, the time source went unstable (this happens with weird kernel bugs
> on many different hosts, I have seen this over time).
>
> 3. Then, on the 3ward carde, drives started leaving and being re-inserted
> by themsevles, XFS went off-line to protect the filesystem due to the
> 3ware issues
>
> --
>
> 3ware/RAID-- Interesting errors:
>
> I've never seen this before on a 3ware RAID controller, at least from what
> I can remember and I've been using 3ware cards for many years..
>
> p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi
> HDS723030AL p3 CFG-OP-FAIL - 2.73 TB SATA 3 -
> Hitachi HDS723030AL
>
> --
>
> Kernel/ERRORS:
>
> FWIW it all seem to start during an encoding job around 21:00:
>
> Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC
> Link is Down
> Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO
> (0x04:0x002B): Verify completed:unit=0.
> Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here
> ]------------
> Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at
> net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
> Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
> Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb):
> transmit queue 5 timed out
> Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod
> tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio
> snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev
> serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> i7core_edac edac_core video
> Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not
> tainted 3.1.0-rc4 #1
> Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
> Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>]
> warn_slowpath_common+0x7a/0xb0
> Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>]
> warn_slowpath_fmt+0x41/0x50
> Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ?
> schedule+0x2e4/0x950
> Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>]
> dev_watchdog+0x23f/0x250
> Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>]
> run_timer_softirq+0xf2/0x220
> Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ?
> qdisc_reset+0x50/0x50
> Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>]
> __do_softirq+0x98/0x120
> Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>]
> run_ksoftirqd+0xb5/0x160
> Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ?
> __do_softirq+0x120/0x120
> Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>]
> kthread+0x87/0x90
> Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>]
> kernel_thread_helper+0x4/0x10
> Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ?
> kthread_worker_fn+0x130/0x130
> Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ?
> gs_change+0xb/0xb
> Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba
> ]---
> Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset
> adapter
> Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000
> Mbps Full Duplex, Flow Control: RX/TX
> Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck
> for 22s! [kswapd0:947]
>
> --
>
> URLs for FULL logs:
>
> 1. tw_cli /cX show diag:
> http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
>
> 2. Full kernel log (and previous morning of kernel crash)
> http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
>
> 3. tw_cli /cX show all
> http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
>
> --
>
> Currently...
>
> After all of this happened, I stopped all I/O on the system/all processes,
> etc
> I shutdown the host, removed the power, powered it back up, now the drives
> that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
> to rebuild before doing anything else.
>
> Justin.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 3:59 ` Jesse Brandeburg
0 siblings, 0 replies; 18+ messages in thread
From: Jesse Brandeburg @ 2011-09-13 3:59 UTC (permalink / raw)
To: Justin Piszcz; +Cc: NetDEV list, Alan Piszcz, linux-kernel, xfs
added netdev because it appears to start with an igb tx hang
On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> Hi,
>
> Over the past 24-48 hours I was running some CPU-intenstive jobs and there
> was heavy I/O on the RAID (9750-24i4e + a RAID6)..
>
> I believe most of the problem started when I included many kernel options as
> modules (before I only compiled in [*] the drivers I used), there appears to
> have something to gone awry in the kernel and then afterwards, disks started
> going in and out, XFS shut down, etcera.
>
> I'm opening a case with LSI to see what happened with the 3ware card;
> however, after a power cycle, everything came back OK (the drives and HW) is
> physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but
> other than that, everything 'seems' OK, still need to do an fsck.
>
> Something went wrong in the kernel and caused a cascading effect of errors,
> this occurred (I believe) when I started to run a lot of encoding jobs;
> however, I was doing a lot of data transfer for the past 24-48 hours on the
> RAID array, the system (separate SSD/EXT4) remained unaffected but other
> weird stuff happened as well..
>
> I still see these in the logs as well after the reboot (not often; but e.g.,
> the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the
> physical drives are 100% healthy):
>
> [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a
> firmware bug on this device. Contact the card vendor for a firmware update.
>
> So, my plan:
>
> 1. Report this error to LKML+XFS mailing lists.
> 2. Open case with LSI support.
> 3. Recompile the kernel how I used for many years [only compile in options
> that you need [*] and do not compile drivers as modules]
> 4. Reboot Linux systems and see if this recurs again under the same
> workload, after the RAID is done rebuilding.
>
> --
>
> So these errors are quite long, will upload to HTTP and paste the relevant
> bits below.
>
> --
>
> URLs for FULL logs:
>
> 1. tw_cli /cX show diag:
> http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
>
> 2. Full kernel log (and previous morning of kernel crash)
> http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
>
> 3. tw_cli /cX show all
> http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
>
> --
>
> Summary (what seems to have occurred, have not done a full analysis yet)
>
> 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors
>
> 2. Then, the time source went unstable (this happens with weird kernel bugs
> on many different hosts, I have seen this over time).
>
> 3. Then, on the 3ward carde, drives started leaving and being re-inserted
> by themsevles, XFS went off-line to protect the filesystem due to the
> 3ware issues
>
> --
>
> 3ware/RAID-- Interesting errors:
>
> I've never seen this before on a 3ware RAID controller, at least from what
> I can remember and I've been using 3ware cards for many years..
>
> p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi
> HDS723030AL p3 CFG-OP-FAIL - 2.73 TB SATA 3 -
> Hitachi HDS723030AL
>
> --
>
> Kernel/ERRORS:
>
> FWIW it all seem to start during an encoding job around 21:00:
>
> Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC
> Link is Down
> Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO
> (0x04:0x002B): Verify completed:unit=0.
> Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here
> ]------------
> Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at
> net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
> Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
> Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb):
> transmit queue 5 timed out
> Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod
> tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio
> snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev
> serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> i7core_edac edac_core video
> Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not
> tainted 3.1.0-rc4 #1
> Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
> Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>]
> warn_slowpath_common+0x7a/0xb0
> Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>]
> warn_slowpath_fmt+0x41/0x50
> Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ?
> schedule+0x2e4/0x950
> Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>]
> dev_watchdog+0x23f/0x250
> Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>]
> run_timer_softirq+0xf2/0x220
> Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ?
> qdisc_reset+0x50/0x50
> Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>]
> __do_softirq+0x98/0x120
> Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>]
> run_ksoftirqd+0xb5/0x160
> Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ?
> __do_softirq+0x120/0x120
> Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>]
> kthread+0x87/0x90
> Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>]
> kernel_thread_helper+0x4/0x10
> Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ?
> kthread_worker_fn+0x130/0x130
> Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ?
> gs_change+0xb/0xb
> Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba
> ]---
> Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset
> adapter
> Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000
> Mbps Full Duplex, Flow Control: RX/TX
> Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck
> for 22s! [kswapd0:947]
>
> --
>
> URLs for FULL logs:
>
> 1. tw_cli /cX show diag:
> http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
>
> 2. Full kernel log (and previous morning of kernel crash)
> http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
>
> 3. tw_cli /cX show all
> http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
>
> --
>
> Currently...
>
> After all of this happened, I stopped all I/O on the system/all processes,
> etc
> I shutdown the host, removed the power, powered it back up, now the drives
> that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
> to rebuild before doing anything else.
>
> Justin.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
2011-09-13 3:59 ` Jesse Brandeburg
@ 2011-09-13 4:05 ` Eric Dumazet
-1 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2011-09-13 4:05 UTC (permalink / raw)
To: Jesse Brandeburg
Cc: Justin Piszcz, linux-kernel, xfs, Alan Piszcz, NetDEV list
Le lundi 12 septembre 2011 à 20:59 -0700, Jesse Brandeburg a écrit :
> added netdev because it appears to start with an igb tx hang
>
> On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > Hi,
> >
> > Over the past 24-48 hours I was running some CPU-intenstive jobs and there
> > was heavy I/O on the RAID (9750-24i4e + a RAID6)..
> >
> > I believe most of the problem started when I included many kernel options as
> > modules (before I only compiled in [*] the drivers I used), there appears to
> > have something to gone awry in the kernel and then afterwards, disks started
> > going in and out, XFS shut down, etcera.
> >
> > I'm opening a case with LSI to see what happened with the 3ware card;
> > however, after a power cycle, everything came back OK (the drives and HW) is
> > physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but
> > other than that, everything 'seems' OK, still need to do an fsck.
> >
> > Something went wrong in the kernel and caused a cascading effect of errors,
> > this occurred (I believe) when I started to run a lot of encoding jobs;
> > however, I was doing a lot of data transfer for the past 24-48 hours on the
> > RAID array, the system (separate SSD/EXT4) remained unaffected but other
> > weird stuff happened as well..
> >
> > I still see these in the logs as well after the reboot (not often; but e.g.,
> > the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the
> > physical drives are 100% healthy):
> >
> > [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a
> > firmware bug on this device. Contact the card vendor for a firmware update.
> >
> > So, my plan:
> >
> > 1. Report this error to LKML+XFS mailing lists.
> > 2. Open case with LSI support.
> > 3. Recompile the kernel how I used for many years [only compile in options
> > that you need [*] and do not compile drivers as modules]
> > 4. Reboot Linux systems and see if this recurs again under the same
> > workload, after the RAID is done rebuilding.
> >
> > --
> >
> > So these errors are quite long, will upload to HTTP and paste the relevant
> > bits below.
> >
> > --
> >
> > URLs for FULL logs:
> >
> > 1. tw_cli /cX show diag:
> > http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
> >
> > 2. Full kernel log (and previous morning of kernel crash)
> > http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
> >
> > 3. tw_cli /cX show all
> > http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
> >
> > --
> >
> > Summary (what seems to have occurred, have not done a full analysis yet)
> >
> > 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors
> >
> > 2. Then, the time source went unstable (this happens with weird kernel bugs
> > on many different hosts, I have seen this over time).
> >
> > 3. Then, on the 3ward carde, drives started leaving and being re-inserted
> > by themsevles, XFS went off-line to protect the filesystem due to the
> > 3ware issues
> >
> > --
> >
> > 3ware/RAID-- Interesting errors:
> >
> > I've never seen this before on a 3ware RAID controller, at least from what
> > I can remember and I've been using 3ware cards for many years..
> >
> > p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi
> > HDS723030AL p3 CFG-OP-FAIL - 2.73 TB SATA 3 -
> > Hitachi HDS723030AL
> >
> > --
> >
> > Kernel/ERRORS:
> >
> > FWIW it all seem to start during an encoding job around 21:00:
> >
> > Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC
> > Link is Down
> > Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO
> > (0x04:0x002B): Verify completed:unit=0.
> > Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here
> > ]------------
> > Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at
> > net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
> > Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
> > Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb):
> > transmit queue 5 timed out
> > Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod
> > tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio
> > snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> > snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> > snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev
> > serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> > i7core_edac edac_core video
> > Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not
> > tainted 3.1.0-rc4 #1
> > Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
> > Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>]
> > warn_slowpath_common+0x7a/0xb0
> > Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>]
> > warn_slowpath_fmt+0x41/0x50
> > Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ?
> > schedule+0x2e4/0x950
> > Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>]
> > dev_watchdog+0x23f/0x250
> > Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>]
> > run_timer_softirq+0xf2/0x220
> > Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ?
> > qdisc_reset+0x50/0x50
> > Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>]
> > __do_softirq+0x98/0x120
> > Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>]
> > run_ksoftirqd+0xb5/0x160
> > Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ?
> > __do_softirq+0x120/0x120
> > Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>]
> > kthread+0x87/0x90
> > Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>]
> > kernel_thread_helper+0x4/0x10
> > Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ?
> > kthread_worker_fn+0x130/0x130
> > Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ?
> > gs_change+0xb/0xb
> > Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba
> > ]---
> > Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset
> > adapter
> > Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000
> > Mbps Full Duplex, Flow Control: RX/TX
> > Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck
> > for 22s! [kswapd0:947]
> >
> > --
> >
> > URLs for FULL logs:
> >
> > 1. tw_cli /cX show diag:
> > http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
> >
> > 2. Full kernel log (and previous morning of kernel crash)
> > http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
> >
> > 3. tw_cli /cX show all
> > http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
> >
> > --
> >
> > Currently...
> >
> > After all of this happened, I stopped all I/O on the system/all processes,
> > etc
> > I shutdown the host, removed the power, powered it back up, now the drives
> > that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
> > to rebuild before doing anything else.
> >
> > Justin.
> >
> >
Please Justin make sure you pulled commit
commit ed2888e906b56769b4ffabb9c577190438aa68b8
Author: Jon Mason <mason@myri.com>
Date: Thu Sep 8 16:41:18 2011 -0500
PCI: Remove MRRS modification from MPS setting code
Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
massive negative ramifications on some devices. Without knowing which
devices have this issue, do not modify from the default value when
walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe
the default procedure.
Tested-by: Sven Schnelle <svens@stackframe.org>
Tested-by: Simon Kirby <sim@hostway.ca>
Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.
References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
Signed-off-by: Jon Mason <mason@myri.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 4:05 ` Eric Dumazet
0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2011-09-13 4:05 UTC (permalink / raw)
To: Jesse Brandeburg
Cc: Alan Piszcz, NetDEV list, xfs, Justin Piszcz, linux-kernel
Le lundi 12 septembre 2011 à 20:59 -0700, Jesse Brandeburg a écrit :
> added netdev because it appears to start with an igb tx hang
>
> On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > Hi,
> >
> > Over the past 24-48 hours I was running some CPU-intenstive jobs and there
> > was heavy I/O on the RAID (9750-24i4e + a RAID6)..
> >
> > I believe most of the problem started when I included many kernel options as
> > modules (before I only compiled in [*] the drivers I used), there appears to
> > have something to gone awry in the kernel and then afterwards, disks started
> > going in and out, XFS shut down, etcera.
> >
> > I'm opening a case with LSI to see what happened with the 3ware card;
> > however, after a power cycle, everything came back OK (the drives and HW) is
> > physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but
> > other than that, everything 'seems' OK, still need to do an fsck.
> >
> > Something went wrong in the kernel and caused a cascading effect of errors,
> > this occurred (I believe) when I started to run a lot of encoding jobs;
> > however, I was doing a lot of data transfer for the past 24-48 hours on the
> > RAID array, the system (separate SSD/EXT4) remained unaffected but other
> > weird stuff happened as well..
> >
> > I still see these in the logs as well after the reboot (not often; but e.g.,
> > the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the
> > physical drives are 100% healthy):
> >
> > [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed. This is likely a
> > firmware bug on this device. Contact the card vendor for a firmware update.
> >
> > So, my plan:
> >
> > 1. Report this error to LKML+XFS mailing lists.
> > 2. Open case with LSI support.
> > 3. Recompile the kernel how I used for many years [only compile in options
> > that you need [*] and do not compile drivers as modules]
> > 4. Reboot Linux systems and see if this recurs again under the same
> > workload, after the RAID is done rebuilding.
> >
> > --
> >
> > So these errors are quite long, will upload to HTTP and paste the relevant
> > bits below.
> >
> > --
> >
> > URLs for FULL logs:
> >
> > 1. tw_cli /cX show diag:
> > http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
> >
> > 2. Full kernel log (and previous morning of kernel crash)
> > http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
> >
> > 3. tw_cli /cX show all
> > http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
> >
> > --
> >
> > Summary (what seems to have occurred, have not done a full analysis yet)
> >
> > 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors
> >
> > 2. Then, the time source went unstable (this happens with weird kernel bugs
> > on many different hosts, I have seen this over time).
> >
> > 3. Then, on the 3ward carde, drives started leaving and being re-inserted
> > by themsevles, XFS went off-line to protect the filesystem due to the
> > 3ware issues
> >
> > --
> >
> > 3ware/RAID-- Interesting errors:
> >
> > I've never seen this before on a 3ware RAID controller, at least from what
> > I can remember and I've been using 3ware cards for many years..
> >
> > p2 CFG-OP-FAIL - 2.73 TB SATA 2 - Hitachi
> > HDS723030AL p3 CFG-OP-FAIL - 2.73 TB SATA 3 -
> > Hitachi HDS723030AL
> >
> > --
> >
> > Kernel/ERRORS:
> >
> > FWIW it all seem to start during an encoding job around 21:00:
> >
> > Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC
> > Link is Down
> > Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO
> > (0x04:0x002B): Verify completed:unit=0.
> > Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here
> > ]------------
> > Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at
> > net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
> > Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
> > Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb):
> > transmit queue 5 timed out
> > Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod
> > tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio
> > snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> > snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> > snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev
> > serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> > i7core_edac edac_core video
> > Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not
> > tainted 3.1.0-rc4 #1
> > Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
> > Sep 10 20:59:39 p34 kernel: [531189.671424] [<ffffffff810379ba>]
> > warn_slowpath_common+0x7a/0xb0
> > Sep 10 20:59:39 p34 kernel: [531189.671427] [<ffffffff81037a91>]
> > warn_slowpath_fmt+0x41/0x50
> > Sep 10 20:59:39 p34 kernel: [531189.671433] [<ffffffff815d7874>] ?
> > schedule+0x2e4/0x950
> > Sep 10 20:59:39 p34 kernel: [531189.671436] [<ffffffff814e5aff>]
> > dev_watchdog+0x23f/0x250
> > Sep 10 20:59:39 p34 kernel: [531189.671440] [<ffffffff81043872>]
> > run_timer_softirq+0xf2/0x220
> > Sep 10 20:59:39 p34 kernel: [531189.671443] [<ffffffff814e58c0>] ?
> > qdisc_reset+0x50/0x50
> > Sep 10 20:59:39 p34 kernel: [531189.671446] [<ffffffff8103d208>]
> > __do_softirq+0x98/0x120
> > Sep 10 20:59:39 p34 kernel: [531189.671448] [<ffffffff8103d345>]
> > run_ksoftirqd+0xb5/0x160
> > Sep 10 20:59:39 p34 kernel: [531189.671454] [<ffffffff8103d290>] ?
> > __do_softirq+0x120/0x120
> > Sep 10 20:59:39 p34 kernel: [531189.671458] [<ffffffff810523b7>]
> > kthread+0x87/0x90
> > Sep 10 20:59:39 p34 kernel: [531189.671462] [<ffffffff815dbdb4>]
> > kernel_thread_helper+0x4/0x10
> > Sep 10 20:59:39 p34 kernel: [531189.671465] [<ffffffff81052330>] ?
> > kthread_worker_fn+0x130/0x130
> > Sep 10 20:59:39 p34 kernel: [531189.671467] [<ffffffff815dbdb0>] ?
> > gs_change+0xb/0xb
> > Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba
> > ]---
> > Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset
> > adapter
> > Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000
> > Mbps Full Duplex, Flow Control: RX/TX
> > Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck
> > for 22s! [kswapd0:947]
> >
> > --
> >
> > URLs for FULL logs:
> >
> > 1. tw_cli /cX show diag:
> > http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
> >
> > 2. Full kernel log (and previous morning of kernel crash)
> > http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
> >
> > 3. tw_cli /cX show all
> > http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
> >
> > --
> >
> > Currently...
> >
> > After all of this happened, I stopped all I/O on the system/all processes,
> > etc
> > I shutdown the host, removed the power, powered it back up, now the drives
> > that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
> > to rebuild before doing anything else.
> >
> > Justin.
> >
> >
Please Justin make sure you pulled commit
commit ed2888e906b56769b4ffabb9c577190438aa68b8
Author: Jon Mason <mason@myri.com>
Date: Thu Sep 8 16:41:18 2011 -0500
PCI: Remove MRRS modification from MPS setting code
Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
massive negative ramifications on some devices. Without knowing which
devices have this issue, do not modify from the default value when
walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe
the default procedure.
Tested-by: Sven Schnelle <svens@stackframe.org>
Tested-by: Simon Kirby <sim@hostway.ca>
Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.
References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
Signed-off-by: Jon Mason <mason@myri.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
2011-09-13 4:05 ` Eric Dumazet
@ 2011-09-13 14:54 ` Justin Piszcz
-1 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-13 14:54 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs, linux-kernel
On Tue, 13 Sep 2011, Eric Dumazet wrote:
> Please Justin make sure you pulled commit
>
> commit ed2888e906b56769b4ffabb9c577190438aa68b8
> Author: Jon Mason <mason@myri.com>
> Date: Thu Sep 8 16:41:18 2011 -0500
>
> PCI: Remove MRRS modification from MPS setting code
>
> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
> massive negative ramifications on some devices. Without knowing which
> devices have this issue, do not modify from the default value when
> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe
> the default procedure.
>
> Tested-by: Sven Schnelle <svens@stackframe.org>
> Tested-by: Simon Kirby <sim@hostway.ca>
> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.
> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
> Signed-off-by: Jon Mason <mason@myri.com>
> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hello,
I found this commit here:
http://permalink.gmane.org/gmane.linux.kernel.pci/11700
Applied:
# patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt
patching file drivers/pci/probe.c
I will update this thread if the problem recurs, can someone also please advise
which DEBUG options I should have enabled to catch further SLAB/RCU issues?
So far, I have the following enabled:
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_SLAB_LEAK=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VIRTUAL=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_PER_CPU_MAPS=y
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_RODATA=y
Thanks,
Justin.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 14:54 ` Justin Piszcz
0 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-13 14:54 UTC (permalink / raw)
To: Eric Dumazet
Cc: NetDEV list, xfs, Alan Piszcz, linux-kernel, Jesse Brandeburg
On Tue, 13 Sep 2011, Eric Dumazet wrote:
> Please Justin make sure you pulled commit
>
> commit ed2888e906b56769b4ffabb9c577190438aa68b8
> Author: Jon Mason <mason@myri.com>
> Date: Thu Sep 8 16:41:18 2011 -0500
>
> PCI: Remove MRRS modification from MPS setting code
>
> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
> massive negative ramifications on some devices. Without knowing which
> devices have this issue, do not modify from the default value when
> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe
> the default procedure.
>
> Tested-by: Sven Schnelle <svens@stackframe.org>
> Tested-by: Simon Kirby <sim@hostway.ca>
> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.
> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
> Signed-off-by: Jon Mason <mason@myri.com>
> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hello,
I found this commit here:
http://permalink.gmane.org/gmane.linux.kernel.pci/11700
Applied:
# patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt
patching file drivers/pci/probe.c
I will update this thread if the problem recurs, can someone also please advise
which DEBUG options I should have enabled to catch further SLAB/RCU issues?
So far, I have the following enabled:
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_SLAB_LEAK=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VIRTUAL=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_PER_CPU_MAPS=y
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_RODATA=y
Thanks,
Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
2011-09-13 14:54 ` Justin Piszcz
@ 2011-09-13 14:58 ` Eric Dumazet
-1 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2011-09-13 14:58 UTC (permalink / raw)
To: Justin Piszcz
Cc: Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs, linux-kernel
2011/9/13 Justin Piszcz <jpiszcz@lucidpixels.com>:
>
> I found this commit here:
> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
>
> Applied:
> # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file
> drivers/pci/probe.c
>
>
Oh, I should have sent the git anchor you can use instead of searching the web ;
git pull https://github.com/torvalds/linux.git
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 14:58 ` Eric Dumazet
0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2011-09-13 14:58 UTC (permalink / raw)
To: Justin Piszcz
Cc: NetDEV list, xfs, Alan Piszcz, linux-kernel, Jesse Brandeburg
2011/9/13 Justin Piszcz <jpiszcz@lucidpixels.com>:
>
> I found this commit here:
> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
>
> Applied:
> # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file
> drivers/pci/probe.c
>
>
Oh, I should have sent the git anchor you can use instead of searching the web ;
git pull https://github.com/torvalds/linux.git
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
2011-09-13 14:54 ` Justin Piszcz
@ 2011-09-13 15:35 ` Jon Mason
-1 siblings, 0 replies; 18+ messages in thread
From: Jon Mason @ 2011-09-13 15:35 UTC (permalink / raw)
To: Justin Piszcz
Cc: Eric Dumazet, Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs,
linux-kernel
On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>
>> Please Justin make sure you pulled commit
>> commit ed2888e906b56769b4ffabb9c577190438aa68b8
>> Author: Jon Mason <mason@myri.com>
>> Date: Thu Sep 8 16:41:18 2011 -0500
>>
>> PCI: Remove MRRS modification from MPS setting code
>>
>> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>> massive negative ramifications on some devices. Without knowing which
>> devices have this issue, do not modify from the default value when
>> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe
>> the default procedure.
>>
>> Tested-by: Sven Schnelle <svens@stackframe.org>
>> Tested-by: Simon Kirby <sim@hostway.ca>
>> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>> Reported-and-tested-by: Niels Ole Salscheider
>> <niels_ole@salscheider-online.
>> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>> Signed-off-by: Jon Mason <mason@myri.com>
>> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> Hello,
>
> I found this commit here:
> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
This is an early version of the patch. This is the patch that you want:
https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8
It appears that this patch didn't make it to lkml or linux-pci list
due to kernel.org DNS being down when it was sent.
Thanks,
Jon
>
> Applied:
> # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file
> drivers/pci/probe.c
>
> I will update this thread if the problem recurs, can someone also please
> advise
> which DEBUG options I should have enabled to catch further SLAB/RCU issues?
>
> So far, I have the following enabled:
>
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_HAVE_DMA_API_DEBUG=y
> CONFIG_X86_DEBUGCTLMSR=y
> CONFIG_DEBUG_FS=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DEBUG_SLAB=y
> CONFIG_DEBUG_SLAB_LEAK=y
> CONFIG_DEBUG_KMEMLEAK=y
> CONFIG_DEBUG_STACK_USAGE=y
> CONFIG_DEBUG_BUGVERBOSE=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_VM=y
> CONFIG_DEBUG_VIRTUAL=y
> CONFIG_DEBUG_MEMORY_INIT=y
> CONFIG_DEBUG_PER_CPU_MAPS=y
> CONFIG_DEBUG_PAGEALLOC=y
> CONFIG_DEBUG_STACKOVERFLOW=y
> CONFIG_DEBUG_RODATA=y
>
> Thanks,
>
> Justin.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 15:35 ` Jon Mason
0 siblings, 0 replies; 18+ messages in thread
From: Jon Mason @ 2011-09-13 15:35 UTC (permalink / raw)
To: Justin Piszcz
Cc: Eric Dumazet, NetDEV list, linux-kernel, xfs, Jesse Brandeburg,
Alan Piszcz
On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>
>> Please Justin make sure you pulled commit
>> commit ed2888e906b56769b4ffabb9c577190438aa68b8
>> Author: Jon Mason <mason@myri.com>
>> Date: Thu Sep 8 16:41:18 2011 -0500
>>
>> PCI: Remove MRRS modification from MPS setting code
>>
>> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>> massive negative ramifications on some devices. Without knowing which
>> devices have this issue, do not modify from the default value when
>> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe
>> the default procedure.
>>
>> Tested-by: Sven Schnelle <svens@stackframe.org>
>> Tested-by: Simon Kirby <sim@hostway.ca>
>> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>> Reported-and-tested-by: Niels Ole Salscheider
>> <niels_ole@salscheider-online.
>> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>> Signed-off-by: Jon Mason <mason@myri.com>
>> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> Hello,
>
> I found this commit here:
> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
This is an early version of the patch. This is the patch that you want:
https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8
It appears that this patch didn't make it to lkml or linux-pci list
due to kernel.org DNS being down when it was sent.
Thanks,
Jon
>
> Applied:
> # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file
> drivers/pci/probe.c
>
> I will update this thread if the problem recurs, can someone also please
> advise
> which DEBUG options I should have enabled to catch further SLAB/RCU issues?
>
> So far, I have the following enabled:
>
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_HAVE_DMA_API_DEBUG=y
> CONFIG_X86_DEBUGCTLMSR=y
> CONFIG_DEBUG_FS=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DEBUG_SLAB=y
> CONFIG_DEBUG_SLAB_LEAK=y
> CONFIG_DEBUG_KMEMLEAK=y
> CONFIG_DEBUG_STACK_USAGE=y
> CONFIG_DEBUG_BUGVERBOSE=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_VM=y
> CONFIG_DEBUG_VIRTUAL=y
> CONFIG_DEBUG_MEMORY_INIT=y
> CONFIG_DEBUG_PER_CPU_MAPS=y
> CONFIG_DEBUG_PAGEALLOC=y
> CONFIG_DEBUG_STACKOVERFLOW=y
> CONFIG_DEBUG_RODATA=y
>
> Thanks,
>
> Justin.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
2011-09-13 15:35 ` Jon Mason
@ 2011-09-13 15:42 ` Justin Piszcz
-1 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-13 15:42 UTC (permalink / raw)
To: Jon Mason
Cc: Eric Dumazet, Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs,
linux-kernel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1906 bytes --]
On Tue, 13 Sep 2011, Jon Mason wrote:
> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>
>>
>> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>>
>>> Please Justin make sure you pulled commit
>>> commit ed2888e906b56769b4ffabb9c577190438aa68b8
>>> Author: Jon Mason <mason@myri.com>
>>> Date: Thu Sep 8 16:41:18 2011 -0500
>>>
>>> PCI: Remove MRRS modification from MPS setting code
>>>
>>> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>>> massive negative ramifications on some devices. Without knowing which
>>> devices have this issue, do not modify from the default value when
>>> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe
>>> the default procedure.
>>>
>>> Tested-by: Sven Schnelle <svens@stackframe.org>
>>> Tested-by: Simon Kirby <sim@hostway.ca>
>>> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>>> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>>> Reported-and-tested-by: Niels Ole Salscheider
>>> <niels_ole@salscheider-online.
>>> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>>> Signed-off-by: Jon Mason <mason@myri.com>
>>> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>>
>> Hello,
>>
>> I found this commit here:
>> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
>
> This is an early version of the patch. This is the patch that you want:
> https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8
>
> It appears that this patch didn't make it to lkml or linux-pci list
> due to kernel.org DNS being down when it was sent.
>
> Thanks,
> Jon
I need to learn how to use git at some point, can you please provide plain
text patches so I can apply them and reboot?
Justin.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 15:42 ` Justin Piszcz
0 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-13 15:42 UTC (permalink / raw)
To: Jon Mason
Cc: Eric Dumazet, NetDEV list, linux-kernel, xfs, Jesse Brandeburg,
Alan Piszcz
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1906 bytes --]
On Tue, 13 Sep 2011, Jon Mason wrote:
> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>
>>
>> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>>
>>> Please Justin make sure you pulled commit
>>> commit ed2888e906b56769b4ffabb9c577190438aa68b8
>>> Author: Jon Mason <mason@myri.com>
>>> Date: Thu Sep 8 16:41:18 2011 -0500
>>>
>>> PCI: Remove MRRS modification from MPS setting code
>>>
>>> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>>> massive negative ramifications on some devices. Without knowing which
>>> devices have this issue, do not modify from the default value when
>>> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe
>>> the default procedure.
>>>
>>> Tested-by: Sven Schnelle <svens@stackframe.org>
>>> Tested-by: Simon Kirby <sim@hostway.ca>
>>> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>>> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>>> Reported-and-tested-by: Niels Ole Salscheider
>>> <niels_ole@salscheider-online.
>>> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>>> Signed-off-by: Jon Mason <mason@myri.com>
>>> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>>
>> Hello,
>>
>> I found this commit here:
>> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
>
> This is an early version of the patch. This is the patch that you want:
> https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8
>
> It appears that this patch didn't make it to lkml or linux-pci list
> due to kernel.org DNS being down when it was sent.
>
> Thanks,
> Jon
I need to learn how to use git at some point, can you please provide plain
text patches so I can apply them and reboot?
Justin.
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
2011-09-13 15:42 ` Justin Piszcz
@ 2011-09-13 15:51 ` Jon Mason
-1 siblings, 0 replies; 18+ messages in thread
From: Jon Mason @ 2011-09-13 15:51 UTC (permalink / raw)
To: Justin Piszcz
Cc: Eric Dumazet, Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2175 bytes --]
On Tue, Sep 13, 2011 at 10:42 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Tue, 13 Sep 2011, Jon Mason wrote:
>
>> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com>
>> wrote:
>>>
>>>
>>> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>>>
>>>> Please Justin make sure you pulled commit
>>>> commit ed2888e906b56769b4ffabb9c577190438aa68b8
>>>> Author: Jon Mason <mason@myri.com>
>>>> Date: Thu Sep 8 16:41:18 2011 -0500
>>>>
>>>> PCI: Remove MRRS modification from MPS setting code
>>>>
>>>> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>>>> massive negative ramifications on some devices. Without knowing which
>>>> devices have this issue, do not modify from the default value when
>>>> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe
>>>> the default procedure.
>>>>
>>>> Tested-by: Sven Schnelle <svens@stackframe.org>
>>>> Tested-by: Simon Kirby <sim@hostway.ca>
>>>> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>>>> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>>>> Reported-and-tested-by: Niels Ole Salscheider
>>>> <niels_ole@salscheider-online.
>>>> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>>>> Signed-off-by: Jon Mason <mason@myri.com>
>>>> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>>>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>>>
>>> Hello,
>>>
>>> I found this commit here:
>>> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
>>
>> This is an early version of the patch. This is the patch that you want:
>>
>> https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8
>>
>> It appears that this patch didn't make it to lkml or linux-pci list
>> due to kernel.org DNS being down when it was sent.
>>
>> Thanks,
>> Jon
>
> I need to learn how to use git at some point, can you please provide plain
> text patches so I can apply them and reboot?
>
> Justin.
I've attached the 2 patches I asked Linus to include into 3.1-rc6.
Let me know if there are any issues.
Thanks,
Jon
[-- Attachment #2: 0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch --]
[-- Type: text/x-patch, Size: 2344 bytes --]
From cf822aed99fd8851d82ae5f2df11c29b79e316c8 Mon Sep 17 00:00:00 2001
From: Shyam Iyer <shyam.iyer.t@gmail.com>
Date: Wed, 31 Aug 2011 12:21:42 -0400
Subject: [PATCH 1/2] Fix pointer dereference before call to
pcie_bus_configure_settings
There is a potential NULL pointer dereference in calls to
pcie_bus_configure_settings due to attempts to access pci_bus self
variables when the self pointer is NULL. To correct this, verify that
the self pointer in pci_bus is non-NULL before dereferencing it.
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Shyam Iyer <shyam_iyer@dell.com>
Signed-off-by: Jon Mason <mason@myri.com>
---
arch/x86/pci/acpi.c | 9 +++++++--
drivers/pci/hotplug/pcihp_slot.c | 4 +++-
drivers/pci/probe.c | 3 ---
3 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index c953302..039d913 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -365,8 +365,13 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_pci_root *root)
*/
if (bus) {
struct pci_bus *child;
- list_for_each_entry(child, &bus->children, node)
- pcie_bus_configure_settings(child, child->self->pcie_mpss);
+ list_for_each_entry(child, &bus->children, node) {
+ struct pci_dev *self = child->self;
+ if (!self)
+ continue;
+
+ pcie_bus_configure_settings(child, self->pcie_mpss);
+ }
}
if (!bus)
diff --git a/drivers/pci/hotplug/pcihp_slot.c b/drivers/pci/hotplug/pcihp_slot.c
index 753b21a..3ffd9c1 100644
--- a/drivers/pci/hotplug/pcihp_slot.c
+++ b/drivers/pci/hotplug/pcihp_slot.c
@@ -169,7 +169,9 @@ void pci_configure_slot(struct pci_dev *dev)
(dev->class >> 8) == PCI_CLASS_BRIDGE_PCI)))
return;
- pcie_bus_configure_settings(dev->bus, dev->bus->self->pcie_mpss);
+ if (dev->bus && dev->bus->self)
+ pcie_bus_configure_settings(dev->bus,
+ dev->bus->self->pcie_mpss);
memset(&hpp, 0, sizeof(hpp));
ret = pci_get_hp_params(dev, &hpp);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 8473727..0820fc1 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1456,9 +1456,6 @@ void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss)
{
u8 smpss = mpss;
- if (!bus->self)
- return;
-
if (!pci_is_pcie(bus->self))
return;
--
1.7.6
[-- Attachment #3: 0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch --]
[-- Type: text/x-patch, Size: 4404 bytes --]
From 74d81235f8e4bd60859d539a27e51d3a09d183cf Mon Sep 17 00:00:00 2001
From: Jon Mason <mason@myri.com>
Date: Thu, 8 Sep 2011 12:59:00 -0500
Subject: [PATCH 2/2] PCI: Remove MRRS modification from MPS setting code
Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
massive negative ramifications on some devices. Without knowing which
devices have this issue, do not modify from the default value when
walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe
the default procedure.
Tested-by: Sven Schnelle <svens@stackframe.org>
Tested-by: Simon Kirby <sim@hostway.ca>
Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.de>
References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
Signed-off-by: Jon Mason <mason@myri.com>
---
drivers/pci/pci.c | 2 +-
drivers/pci/probe.c | 41 ++++++++++++++++++++++-------------------
2 files changed, 23 insertions(+), 20 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 0ce6742..4e84fd4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -77,7 +77,7 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE;
unsigned long pci_hotplug_io_size = DEFAULT_HOTPLUG_IO_SIZE;
unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE;
-enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PERFORMANCE;
+enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE;
/*
* The default CLS is used if arch didn't set CLS explicitly and not
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 0820fc1..b1187ff 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1396,34 +1396,37 @@ static void pcie_write_mps(struct pci_dev *dev, int mps)
static void pcie_write_mrrs(struct pci_dev *dev, int mps)
{
- int rc, mrrs;
+ int rc, mrrs, dev_mpss;
- if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
- int dev_mpss = 128 << dev->pcie_mpss;
+ /* In the "safe" case, do not configure the MRRS. There appear to be
+ * issues with setting MRRS to 0 on a number of devices.
+ */
- /* For Max performance, the MRRS must be set to the largest
- * supported value. However, it cannot be configured larger
- * than the MPS the device or the bus can support. This assumes
- * that the largest MRRS available on the device cannot be
- * smaller than the device MPSS.
- */
- mrrs = mps < dev_mpss ? mps : dev_mpss;
- } else
- /* In the "safe" case, configure the MRRS for fairness on the
- * bus by making all devices have the same size
- */
- mrrs = mps;
+ if (pcie_bus_config != PCIE_BUS_PERFORMANCE)
+ return;
+
+ dev_mpss = 128 << dev->pcie_mpss;
+ /* For Max performance, the MRRS must be set to the largest supported
+ * value. However, it cannot be configured larger than the MPS the
+ * device or the bus can support. This assumes that the largest MRRS
+ * available on the device cannot be smaller than the device MPSS.
+ */
+ mrrs = min(mps, dev_mpss);
/* MRRS is a R/W register. Invalid values can be written, but a
- * subsiquent read will verify if the value is acceptable or not.
+ * subsequent read will verify if the value is acceptable or not.
* If the MRRS value provided is not acceptable (e.g., too large),
* shrink the value until it is acceptable to the HW.
*/
while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) {
+ dev_warn(&dev->dev, "Attempting to modify the PCI-E MRRS value"
+ " to %d. If any issues are encountered, please try "
+ "running with pci=pcie_bus_safe\n", mrrs);
rc = pcie_set_readrq(dev, mrrs);
if (rc)
- dev_err(&dev->dev, "Failed attempting to set the MRRS\n");
+ dev_err(&dev->dev,
+ "Failed attempting to set the MRRS\n");
mrrs /= 2;
}
@@ -1436,13 +1439,13 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
if (!pci_is_pcie(dev))
return 0;
- dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
+ dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
pcie_write_mps(dev, mps);
pcie_write_mrrs(dev, mps);
- dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
+ dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
return 0;
--
1.7.6
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 15:51 ` Jon Mason
0 siblings, 0 replies; 18+ messages in thread
From: Jon Mason @ 2011-09-13 15:51 UTC (permalink / raw)
To: Justin Piszcz
Cc: Eric Dumazet, NetDEV list, linux-kernel, xfs, Jesse Brandeburg,
Alan Piszcz
[-- Attachment #1: Type: text/plain, Size: 2175 bytes --]
On Tue, Sep 13, 2011 at 10:42 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Tue, 13 Sep 2011, Jon Mason wrote:
>
>> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com>
>> wrote:
>>>
>>>
>>> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>>>
>>>> Please Justin make sure you pulled commit
>>>> commit ed2888e906b56769b4ffabb9c577190438aa68b8
>>>> Author: Jon Mason <mason@myri.com>
>>>> Date: Thu Sep 8 16:41:18 2011 -0500
>>>>
>>>> PCI: Remove MRRS modification from MPS setting code
>>>>
>>>> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>>>> massive negative ramifications on some devices. Without knowing which
>>>> devices have this issue, do not modify from the default value when
>>>> walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe
>>>> the default procedure.
>>>>
>>>> Tested-by: Sven Schnelle <svens@stackframe.org>
>>>> Tested-by: Simon Kirby <sim@hostway.ca>
>>>> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>>>> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>>>> Reported-and-tested-by: Niels Ole Salscheider
>>>> <niels_ole@salscheider-online.
>>>> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>>>> Signed-off-by: Jon Mason <mason@myri.com>
>>>> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>>>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>>>
>>> Hello,
>>>
>>> I found this commit here:
>>> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
>>
>> This is an early version of the patch. This is the patch that you want:
>>
>> https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8
>>
>> It appears that this patch didn't make it to lkml or linux-pci list
>> due to kernel.org DNS being down when it was sent.
>>
>> Thanks,
>> Jon
>
> I need to learn how to use git at some point, can you please provide plain
> text patches so I can apply them and reboot?
>
> Justin.
I've attached the 2 patches I asked Linus to include into 3.1-rc6.
Let me know if there are any issues.
Thanks,
Jon
[-- Attachment #2: 0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch --]
[-- Type: text/x-patch, Size: 2344 bytes --]
From cf822aed99fd8851d82ae5f2df11c29b79e316c8 Mon Sep 17 00:00:00 2001
From: Shyam Iyer <shyam.iyer.t@gmail.com>
Date: Wed, 31 Aug 2011 12:21:42 -0400
Subject: [PATCH 1/2] Fix pointer dereference before call to
pcie_bus_configure_settings
There is a potential NULL pointer dereference in calls to
pcie_bus_configure_settings due to attempts to access pci_bus self
variables when the self pointer is NULL. To correct this, verify that
the self pointer in pci_bus is non-NULL before dereferencing it.
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Shyam Iyer <shyam_iyer@dell.com>
Signed-off-by: Jon Mason <mason@myri.com>
---
arch/x86/pci/acpi.c | 9 +++++++--
drivers/pci/hotplug/pcihp_slot.c | 4 +++-
drivers/pci/probe.c | 3 ---
3 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index c953302..039d913 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -365,8 +365,13 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_pci_root *root)
*/
if (bus) {
struct pci_bus *child;
- list_for_each_entry(child, &bus->children, node)
- pcie_bus_configure_settings(child, child->self->pcie_mpss);
+ list_for_each_entry(child, &bus->children, node) {
+ struct pci_dev *self = child->self;
+ if (!self)
+ continue;
+
+ pcie_bus_configure_settings(child, self->pcie_mpss);
+ }
}
if (!bus)
diff --git a/drivers/pci/hotplug/pcihp_slot.c b/drivers/pci/hotplug/pcihp_slot.c
index 753b21a..3ffd9c1 100644
--- a/drivers/pci/hotplug/pcihp_slot.c
+++ b/drivers/pci/hotplug/pcihp_slot.c
@@ -169,7 +169,9 @@ void pci_configure_slot(struct pci_dev *dev)
(dev->class >> 8) == PCI_CLASS_BRIDGE_PCI)))
return;
- pcie_bus_configure_settings(dev->bus, dev->bus->self->pcie_mpss);
+ if (dev->bus && dev->bus->self)
+ pcie_bus_configure_settings(dev->bus,
+ dev->bus->self->pcie_mpss);
memset(&hpp, 0, sizeof(hpp));
ret = pci_get_hp_params(dev, &hpp);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 8473727..0820fc1 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1456,9 +1456,6 @@ void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss)
{
u8 smpss = mpss;
- if (!bus->self)
- return;
-
if (!pci_is_pcie(bus->self))
return;
--
1.7.6
[-- Attachment #3: 0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch --]
[-- Type: text/x-patch, Size: 4404 bytes --]
From 74d81235f8e4bd60859d539a27e51d3a09d183cf Mon Sep 17 00:00:00 2001
From: Jon Mason <mason@myri.com>
Date: Thu, 8 Sep 2011 12:59:00 -0500
Subject: [PATCH 2/2] PCI: Remove MRRS modification from MPS setting code
Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
massive negative ramifications on some devices. Without knowing which
devices have this issue, do not modify from the default value when
walking the PCI-E bus in pcie_bus_safe mode. Also, make pcie_bus_safe
the default procedure.
Tested-by: Sven Schnelle <svens@stackframe.org>
Tested-by: Simon Kirby <sim@hostway.ca>
Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.de>
References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
Signed-off-by: Jon Mason <mason@myri.com>
---
drivers/pci/pci.c | 2 +-
drivers/pci/probe.c | 41 ++++++++++++++++++++++-------------------
2 files changed, 23 insertions(+), 20 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 0ce6742..4e84fd4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -77,7 +77,7 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE;
unsigned long pci_hotplug_io_size = DEFAULT_HOTPLUG_IO_SIZE;
unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE;
-enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PERFORMANCE;
+enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE;
/*
* The default CLS is used if arch didn't set CLS explicitly and not
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 0820fc1..b1187ff 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1396,34 +1396,37 @@ static void pcie_write_mps(struct pci_dev *dev, int mps)
static void pcie_write_mrrs(struct pci_dev *dev, int mps)
{
- int rc, mrrs;
+ int rc, mrrs, dev_mpss;
- if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
- int dev_mpss = 128 << dev->pcie_mpss;
+ /* In the "safe" case, do not configure the MRRS. There appear to be
+ * issues with setting MRRS to 0 on a number of devices.
+ */
- /* For Max performance, the MRRS must be set to the largest
- * supported value. However, it cannot be configured larger
- * than the MPS the device or the bus can support. This assumes
- * that the largest MRRS available on the device cannot be
- * smaller than the device MPSS.
- */
- mrrs = mps < dev_mpss ? mps : dev_mpss;
- } else
- /* In the "safe" case, configure the MRRS for fairness on the
- * bus by making all devices have the same size
- */
- mrrs = mps;
+ if (pcie_bus_config != PCIE_BUS_PERFORMANCE)
+ return;
+
+ dev_mpss = 128 << dev->pcie_mpss;
+ /* For Max performance, the MRRS must be set to the largest supported
+ * value. However, it cannot be configured larger than the MPS the
+ * device or the bus can support. This assumes that the largest MRRS
+ * available on the device cannot be smaller than the device MPSS.
+ */
+ mrrs = min(mps, dev_mpss);
/* MRRS is a R/W register. Invalid values can be written, but a
- * subsiquent read will verify if the value is acceptable or not.
+ * subsequent read will verify if the value is acceptable or not.
* If the MRRS value provided is not acceptable (e.g., too large),
* shrink the value until it is acceptable to the HW.
*/
while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) {
+ dev_warn(&dev->dev, "Attempting to modify the PCI-E MRRS value"
+ " to %d. If any issues are encountered, please try "
+ "running with pci=pcie_bus_safe\n", mrrs);
rc = pcie_set_readrq(dev, mrrs);
if (rc)
- dev_err(&dev->dev, "Failed attempting to set the MRRS\n");
+ dev_err(&dev->dev,
+ "Failed attempting to set the MRRS\n");
mrrs /= 2;
}
@@ -1436,13 +1439,13 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
if (!pci_is_pcie(dev))
return 0;
- dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
+ dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
pcie_write_mps(dev, mps);
pcie_write_mrrs(dev, mps);
- dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
+ dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
return 0;
--
1.7.6
[-- Attachment #4: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
2011-09-13 15:51 ` Jon Mason
@ 2011-09-13 16:32 ` Justin Piszcz
-1 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-13 16:32 UTC (permalink / raw)
To: Jon Mason
Cc: Eric Dumazet, Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs,
linux-kernel
On Tue, 13 Sep 2011, Jon Mason wrote:
> On Tue, Sep 13, 2011 at 10:42 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>
>>
>> On Tue, 13 Sep 2011, Jon Mason wrote:
>>
>>> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com>
>>> wrote:
>>>>
>>>>
>>>> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>>>>
Thanks,
# patch -p1 < ../0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch
patching file arch/x86/pci/acpi.c
patching file drivers/pci/hotplug/pcihp_slot.c
patching file drivers/pci/probe.c
# patch -p1 < ../0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch
patching file drivers/pci/pci.c
patching file drivers/pci/probe.c
#
Rebooted & running with new patches for 3.1-rc4.
Will let you know if any further issues, I wonder if this will fix
the RCU/SLAB issues too, thanks.
Justin.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 16:32 ` Justin Piszcz
0 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-13 16:32 UTC (permalink / raw)
To: Jon Mason
Cc: Eric Dumazet, NetDEV list, linux-kernel, xfs, Jesse Brandeburg,
Alan Piszcz
On Tue, 13 Sep 2011, Jon Mason wrote:
> On Tue, Sep 13, 2011 at 10:42 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>
>>
>> On Tue, 13 Sep 2011, Jon Mason wrote:
>>
>>> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com>
>>> wrote:
>>>>
>>>>
>>>> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>>>>
Thanks,
# patch -p1 < ../0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch
patching file arch/x86/pci/acpi.c
patching file drivers/pci/hotplug/pcihp_slot.c
patching file drivers/pci/probe.c
# patch -p1 < ../0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch
patching file drivers/pci/pci.c
patching file drivers/pci/probe.c
#
Rebooted & running with new patches for 3.1-rc4.
Will let you know if any further issues, I wonder if this will fix
the RCU/SLAB issues too, thanks.
Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2011-09-13 16:32 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-11 9:40 3.1-rc4: spectacular kernel errors / filesystem crash Justin Piszcz
2011-09-11 9:40 ` Justin Piszcz
2011-09-13 3:59 ` Jesse Brandeburg
2011-09-13 3:59 ` Jesse Brandeburg
2011-09-13 4:05 ` Eric Dumazet
2011-09-13 4:05 ` Eric Dumazet
2011-09-13 14:54 ` Justin Piszcz
2011-09-13 14:54 ` Justin Piszcz
2011-09-13 14:58 ` Eric Dumazet
2011-09-13 14:58 ` Eric Dumazet
2011-09-13 15:35 ` Jon Mason
2011-09-13 15:35 ` Jon Mason
2011-09-13 15:42 ` Justin Piszcz
2011-09-13 15:42 ` Justin Piszcz
2011-09-13 15:51 ` Jon Mason
2011-09-13 15:51 ` Jon Mason
2011-09-13 16:32 ` Justin Piszcz
2011-09-13 16:32 ` Justin Piszcz
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.