All of lore.kernel.org
 help / color / mirror / Atom feed
* 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-11  9:40 ` Justin Piszcz
  0 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-11  9:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: xfs, Alan Piszcz

Hi,

Over the past 24-48 hours I was running some CPU-intenstive jobs and there 
was heavy I/O on the RAID (9750-24i4e + a RAID6)..

I believe most of the problem started when I included many kernel options 
as modules (before I only compiled in [*] the drivers I used), there 
appears to have something to gone awry in the kernel and then afterwards, 
disks started going in and out, XFS shut down, etcera.

I'm opening a case with LSI to see what happened with the 3ware card; 
however, after a power cycle, everything came back OK (the drives and HW) 
is physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL 
but other than that, everything 'seems' OK, still need to do an fsck.

Something went wrong in the kernel and caused a cascading effect of 
errors, this occurred (I believe) when I started to run a lot of encoding 
jobs; however, I was doing a lot of data transfer for the past 24-48 hours 
on the RAID array, the system (separate SSD/EXT4) remained unaffected but 
other weird stuff happened as well..

I still see these in the logs as well after the reboot (not often; but e.g., 
the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the 
physical drives are 100% healthy):

[ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.

So, my plan:

1. Report this error to LKML+XFS mailing lists.
2. Open case with LSI support.
3. Recompile the kernel how I used for many years [only compile in options
    that you need [*] and do not compile drivers as modules]
4. Reboot Linux systems and see if this recurs again under the same
    workload, after the RAID is done rebuilding.

--

So these errors are quite long, will upload to HTTP and paste the relevant 
bits below.

--

URLs for FULL logs:

1. tw_cli /cX show diag:
    http://home.comcast.net/~jpiszcz/20110911/show_diag.txt

2. Full kernel log (and previous morning of kernel crash)
    http://home.comcast.net/~jpiszcz/20110911/kern.log.txt

3. tw_cli /cX show all
    http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt

--

Summary (what seems to have occurred, have not done a full analysis yet)

1. 3ware card freaked out due to kernel/RCU/APIC(?) errors

2. Then, the time source went unstable (this happens with weird kernel bugs
    on many different hosts, I have seen this over time).

3. Then, on the 3ward carde, drives started leaving and being re-inserted
    by themsevles, XFS went off-line to protect the filesystem due to the
    3ware issues

--

3ware/RAID-- Interesting errors:

I've never seen this before on a 3ware RAID controller, at least from what
I can remember and I've been using 3ware cards for many years..

p2    CFG-OP-FAIL    -    2.73 TB   SATA  2   -            Hitachi HDS723030AL 
p3    CFG-OP-FAIL    -    2.73 TB   SATA  3   -            Hitachi HDS723030AL

--

Kernel/ERRORS:

FWIW it all seem to start during an encoding job around 21:00:

Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC Link is Down
Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO (0x04:0x002B): Verify completed:unit=0.
Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here ]------------
Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): transmit queue 5 timed out
Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi i7core_edac edac_core video
Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not tainted 3.1.0-rc4 #1
Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
Sep 10 20:59:39 p34 kernel: [531189.671424]  [<ffffffff810379ba>] warn_slowpath_common+0x7a/0xb0
Sep 10 20:59:39 p34 kernel: [531189.671427]  [<ffffffff81037a91>] warn_slowpath_fmt+0x41/0x50
Sep 10 20:59:39 p34 kernel: [531189.671433]  [<ffffffff815d7874>] ? schedule+0x2e4/0x950
Sep 10 20:59:39 p34 kernel: [531189.671436]  [<ffffffff814e5aff>] dev_watchdog+0x23f/0x250
Sep 10 20:59:39 p34 kernel: [531189.671440]  [<ffffffff81043872>] run_timer_softirq+0xf2/0x220
Sep 10 20:59:39 p34 kernel: [531189.671443]  [<ffffffff814e58c0>] ? qdisc_reset+0x50/0x50
Sep 10 20:59:39 p34 kernel: [531189.671446]  [<ffffffff8103d208>] __do_softirq+0x98/0x120
Sep 10 20:59:39 p34 kernel: [531189.671448]  [<ffffffff8103d345>] run_ksoftirqd+0xb5/0x160
Sep 10 20:59:39 p34 kernel: [531189.671454]  [<ffffffff8103d290>] ? __do_softirq+0x120/0x120
Sep 10 20:59:39 p34 kernel: [531189.671458]  [<ffffffff810523b7>] kthread+0x87/0x90
Sep 10 20:59:39 p34 kernel: [531189.671462]  [<ffffffff815dbdb4>] kernel_thread_helper+0x4/0x10
Sep 10 20:59:39 p34 kernel: [531189.671465]  [<ffffffff81052330>] ? kthread_worker_fn+0x130/0x130
Sep 10 20:59:39 p34 kernel: [531189.671467]  [<ffffffff815dbdb0>] ? gs_change+0xb/0xb
Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba ]---
Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset adapter
Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:947]

--

URLs for FULL logs:

1. tw_cli /cX show diag:
    http://home.comcast.net/~jpiszcz/20110911/show_diag.txt

2. Full kernel log (and previous morning of kernel crash)
    http://home.comcast.net/~jpiszcz/20110911/kern.log.txt

3. tw_cli /cX show all
    http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt

--

Currently...

After all of this happened, I stopped all I/O on the system/all processes, etc
I shutdown the host, removed the power, powered it back up, now the drives
that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
to rebuild before doing anything else.

Justin.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-11  9:40 ` Justin Piszcz
  0 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-11  9:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alan Piszcz, xfs

Hi,

Over the past 24-48 hours I was running some CPU-intenstive jobs and there 
was heavy I/O on the RAID (9750-24i4e + a RAID6)..

I believe most of the problem started when I included many kernel options 
as modules (before I only compiled in [*] the drivers I used), there 
appears to have something to gone awry in the kernel and then afterwards, 
disks started going in and out, XFS shut down, etcera.

I'm opening a case with LSI to see what happened with the 3ware card; 
however, after a power cycle, everything came back OK (the drives and HW) 
is physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL 
but other than that, everything 'seems' OK, still need to do an fsck.

Something went wrong in the kernel and caused a cascading effect of 
errors, this occurred (I believe) when I started to run a lot of encoding 
jobs; however, I was doing a lot of data transfer for the past 24-48 hours 
on the RAID array, the system (separate SSD/EXT4) remained unaffected but 
other weird stuff happened as well..

I still see these in the logs as well after the reboot (not often; but e.g., 
the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the 
physical drives are 100% healthy):

[ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.

So, my plan:

1. Report this error to LKML+XFS mailing lists.
2. Open case with LSI support.
3. Recompile the kernel how I used for many years [only compile in options
    that you need [*] and do not compile drivers as modules]
4. Reboot Linux systems and see if this recurs again under the same
    workload, after the RAID is done rebuilding.

--

So these errors are quite long, will upload to HTTP and paste the relevant 
bits below.

--

URLs for FULL logs:

1. tw_cli /cX show diag:
    http://home.comcast.net/~jpiszcz/20110911/show_diag.txt

2. Full kernel log (and previous morning of kernel crash)
    http://home.comcast.net/~jpiszcz/20110911/kern.log.txt

3. tw_cli /cX show all
    http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt

--

Summary (what seems to have occurred, have not done a full analysis yet)

1. 3ware card freaked out due to kernel/RCU/APIC(?) errors

2. Then, the time source went unstable (this happens with weird kernel bugs
    on many different hosts, I have seen this over time).

3. Then, on the 3ward carde, drives started leaving and being re-inserted
    by themsevles, XFS went off-line to protect the filesystem due to the
    3ware issues

--

3ware/RAID-- Interesting errors:

I've never seen this before on a 3ware RAID controller, at least from what
I can remember and I've been using 3ware cards for many years..

p2    CFG-OP-FAIL    -    2.73 TB   SATA  2   -            Hitachi HDS723030AL 
p3    CFG-OP-FAIL    -    2.73 TB   SATA  3   -            Hitachi HDS723030AL

--

Kernel/ERRORS:

FWIW it all seem to start during an encoding job around 21:00:

Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC Link is Down
Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO (0x04:0x002B): Verify completed:unit=0.
Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here ]------------
Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb): transmit queue 5 timed out
Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi i7core_edac edac_core video
Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not tainted 3.1.0-rc4 #1
Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
Sep 10 20:59:39 p34 kernel: [531189.671424]  [<ffffffff810379ba>] warn_slowpath_common+0x7a/0xb0
Sep 10 20:59:39 p34 kernel: [531189.671427]  [<ffffffff81037a91>] warn_slowpath_fmt+0x41/0x50
Sep 10 20:59:39 p34 kernel: [531189.671433]  [<ffffffff815d7874>] ? schedule+0x2e4/0x950
Sep 10 20:59:39 p34 kernel: [531189.671436]  [<ffffffff814e5aff>] dev_watchdog+0x23f/0x250
Sep 10 20:59:39 p34 kernel: [531189.671440]  [<ffffffff81043872>] run_timer_softirq+0xf2/0x220
Sep 10 20:59:39 p34 kernel: [531189.671443]  [<ffffffff814e58c0>] ? qdisc_reset+0x50/0x50
Sep 10 20:59:39 p34 kernel: [531189.671446]  [<ffffffff8103d208>] __do_softirq+0x98/0x120
Sep 10 20:59:39 p34 kernel: [531189.671448]  [<ffffffff8103d345>] run_ksoftirqd+0xb5/0x160
Sep 10 20:59:39 p34 kernel: [531189.671454]  [<ffffffff8103d290>] ? __do_softirq+0x120/0x120
Sep 10 20:59:39 p34 kernel: [531189.671458]  [<ffffffff810523b7>] kthread+0x87/0x90
Sep 10 20:59:39 p34 kernel: [531189.671462]  [<ffffffff815dbdb4>] kernel_thread_helper+0x4/0x10
Sep 10 20:59:39 p34 kernel: [531189.671465]  [<ffffffff81052330>] ? kthread_worker_fn+0x130/0x130
Sep 10 20:59:39 p34 kernel: [531189.671467]  [<ffffffff815dbdb0>] ? gs_change+0xb/0xb
Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba ]---
Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset adapter
Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:947]

--

URLs for FULL logs:

1. tw_cli /cX show diag:
    http://home.comcast.net/~jpiszcz/20110911/show_diag.txt

2. Full kernel log (and previous morning of kernel crash)
    http://home.comcast.net/~jpiszcz/20110911/kern.log.txt

3. tw_cli /cX show all
    http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt

--

Currently...

After all of this happened, I stopped all I/O on the system/all processes, etc
I shutdown the host, removed the power, powered it back up, now the drives
that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
to rebuild before doing anything else.

Justin.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
  2011-09-11  9:40 ` Justin Piszcz
@ 2011-09-13  3:59   ` Jesse Brandeburg
  -1 siblings, 0 replies; 18+ messages in thread
From: Jesse Brandeburg @ 2011-09-13  3:59 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, xfs, Alan Piszcz, NetDEV list

added netdev because it appears to start with an igb tx hang

On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> Hi,
>
> Over the past 24-48 hours I was running some CPU-intenstive jobs and there
> was heavy I/O on the RAID (9750-24i4e + a RAID6)..
>
> I believe most of the problem started when I included many kernel options as
> modules (before I only compiled in [*] the drivers I used), there appears to
> have something to gone awry in the kernel and then afterwards, disks started
> going in and out, XFS shut down, etcera.
>
> I'm opening a case with LSI to see what happened with the 3ware card;
> however, after a power cycle, everything came back OK (the drives and HW) is
> physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but
> other than that, everything 'seems' OK, still need to do an fsck.
>
> Something went wrong in the kernel and caused a cascading effect of errors,
> this occurred (I believe) when I started to run a lot of encoding jobs;
> however, I was doing a lot of data transfer for the past 24-48 hours on the
> RAID array, the system (separate SSD/EXT4) remained unaffected but other
> weird stuff happened as well..
>
> I still see these in the logs as well after the reboot (not often; but e.g.,
> the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the
> physical drives are 100% healthy):
>
> [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed.  This is likely a
> firmware bug on this device.  Contact the card vendor for a firmware update.
>
> So, my plan:
>
> 1. Report this error to LKML+XFS mailing lists.
> 2. Open case with LSI support.
> 3. Recompile the kernel how I used for many years [only compile in options
>   that you need [*] and do not compile drivers as modules]
> 4. Reboot Linux systems and see if this recurs again under the same
>   workload, after the RAID is done rebuilding.
>
> --
>
> So these errors are quite long, will upload to HTTP and paste the relevant
> bits below.
>
> --
>
> URLs for FULL logs:
>
> 1. tw_cli /cX show diag:
>   http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
>
> 2. Full kernel log (and previous morning of kernel crash)
>   http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
>
> 3. tw_cli /cX show all
>   http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
>
> --
>
> Summary (what seems to have occurred, have not done a full analysis yet)
>
> 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors
>
> 2. Then, the time source went unstable (this happens with weird kernel bugs
>   on many different hosts, I have seen this over time).
>
> 3. Then, on the 3ward carde, drives started leaving and being re-inserted
>   by themsevles, XFS went off-line to protect the filesystem due to the
>   3ware issues
>
> --
>
> 3ware/RAID-- Interesting errors:
>
> I've never seen this before on a 3ware RAID controller, at least from what
> I can remember and I've been using 3ware cards for many years..
>
> p2    CFG-OP-FAIL    -    2.73 TB   SATA  2   -            Hitachi
> HDS723030AL p3    CFG-OP-FAIL    -    2.73 TB   SATA  3   -
>  Hitachi HDS723030AL
>
> --
>
> Kernel/ERRORS:
>
> FWIW it all seem to start during an encoding job around 21:00:
>
> Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC
> Link is Down
> Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO
> (0x04:0x002B): Verify completed:unit=0.
> Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here
> ]------------
> Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at
> net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
> Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
> Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb):
> transmit queue 5 timed out
> Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod
> tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio
> snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev
> serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> i7core_edac edac_core video
> Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not
> tainted 3.1.0-rc4 #1
> Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
> Sep 10 20:59:39 p34 kernel: [531189.671424]  [<ffffffff810379ba>]
> warn_slowpath_common+0x7a/0xb0
> Sep 10 20:59:39 p34 kernel: [531189.671427]  [<ffffffff81037a91>]
> warn_slowpath_fmt+0x41/0x50
> Sep 10 20:59:39 p34 kernel: [531189.671433]  [<ffffffff815d7874>] ?
> schedule+0x2e4/0x950
> Sep 10 20:59:39 p34 kernel: [531189.671436]  [<ffffffff814e5aff>]
> dev_watchdog+0x23f/0x250
> Sep 10 20:59:39 p34 kernel: [531189.671440]  [<ffffffff81043872>]
> run_timer_softirq+0xf2/0x220
> Sep 10 20:59:39 p34 kernel: [531189.671443]  [<ffffffff814e58c0>] ?
> qdisc_reset+0x50/0x50
> Sep 10 20:59:39 p34 kernel: [531189.671446]  [<ffffffff8103d208>]
> __do_softirq+0x98/0x120
> Sep 10 20:59:39 p34 kernel: [531189.671448]  [<ffffffff8103d345>]
> run_ksoftirqd+0xb5/0x160
> Sep 10 20:59:39 p34 kernel: [531189.671454]  [<ffffffff8103d290>] ?
> __do_softirq+0x120/0x120
> Sep 10 20:59:39 p34 kernel: [531189.671458]  [<ffffffff810523b7>]
> kthread+0x87/0x90
> Sep 10 20:59:39 p34 kernel: [531189.671462]  [<ffffffff815dbdb4>]
> kernel_thread_helper+0x4/0x10
> Sep 10 20:59:39 p34 kernel: [531189.671465]  [<ffffffff81052330>] ?
> kthread_worker_fn+0x130/0x130
> Sep 10 20:59:39 p34 kernel: [531189.671467]  [<ffffffff815dbdb0>] ?
> gs_change+0xb/0xb
> Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba
> ]---
> Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset
> adapter
> Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000
> Mbps Full Duplex, Flow Control: RX/TX
> Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck
> for 22s! [kswapd0:947]
>
> --
>
> URLs for FULL logs:
>
> 1. tw_cli /cX show diag:
>   http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
>
> 2. Full kernel log (and previous morning of kernel crash)
>   http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
>
> 3. tw_cli /cX show all
>   http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
>
> --
>
> Currently...
>
> After all of this happened, I stopped all I/O on the system/all processes,
> etc
> I shutdown the host, removed the power, powered it back up, now the drives
> that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
> to rebuild before doing anything else.
>
> Justin.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13  3:59   ` Jesse Brandeburg
  0 siblings, 0 replies; 18+ messages in thread
From: Jesse Brandeburg @ 2011-09-13  3:59 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: NetDEV list, Alan Piszcz, linux-kernel, xfs

added netdev because it appears to start with an igb tx hang

On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> Hi,
>
> Over the past 24-48 hours I was running some CPU-intenstive jobs and there
> was heavy I/O on the RAID (9750-24i4e + a RAID6)..
>
> I believe most of the problem started when I included many kernel options as
> modules (before I only compiled in [*] the drivers I used), there appears to
> have something to gone awry in the kernel and then afterwards, disks started
> going in and out, XFS shut down, etcera.
>
> I'm opening a case with LSI to see what happened with the 3ware card;
> however, after a power cycle, everything came back OK (the drives and HW) is
> physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but
> other than that, everything 'seems' OK, still need to do an fsck.
>
> Something went wrong in the kernel and caused a cascading effect of errors,
> this occurred (I believe) when I started to run a lot of encoding jobs;
> however, I was doing a lot of data transfer for the past 24-48 hours on the
> RAID array, the system (separate SSD/EXT4) remained unaffected but other
> weird stuff happened as well..
>
> I still see these in the logs as well after the reboot (not often; but e.g.,
> the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the
> physical drives are 100% healthy):
>
> [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed.  This is likely a
> firmware bug on this device.  Contact the card vendor for a firmware update.
>
> So, my plan:
>
> 1. Report this error to LKML+XFS mailing lists.
> 2. Open case with LSI support.
> 3. Recompile the kernel how I used for many years [only compile in options
>   that you need [*] and do not compile drivers as modules]
> 4. Reboot Linux systems and see if this recurs again under the same
>   workload, after the RAID is done rebuilding.
>
> --
>
> So these errors are quite long, will upload to HTTP and paste the relevant
> bits below.
>
> --
>
> URLs for FULL logs:
>
> 1. tw_cli /cX show diag:
>   http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
>
> 2. Full kernel log (and previous morning of kernel crash)
>   http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
>
> 3. tw_cli /cX show all
>   http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
>
> --
>
> Summary (what seems to have occurred, have not done a full analysis yet)
>
> 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors
>
> 2. Then, the time source went unstable (this happens with weird kernel bugs
>   on many different hosts, I have seen this over time).
>
> 3. Then, on the 3ward carde, drives started leaving and being re-inserted
>   by themsevles, XFS went off-line to protect the filesystem due to the
>   3ware issues
>
> --
>
> 3ware/RAID-- Interesting errors:
>
> I've never seen this before on a 3ware RAID controller, at least from what
> I can remember and I've been using 3ware cards for many years..
>
> p2    CFG-OP-FAIL    -    2.73 TB   SATA  2   -            Hitachi
> HDS723030AL p3    CFG-OP-FAIL    -    2.73 TB   SATA  3   -
>  Hitachi HDS723030AL
>
> --
>
> Kernel/ERRORS:
>
> FWIW it all seem to start during an encoding job around 21:00:
>
> Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC
> Link is Down
> Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO
> (0x04:0x002B): Verify completed:unit=0.
> Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here
> ]------------
> Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at
> net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
> Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
> Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb):
> transmit queue 5 timed out
> Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod
> tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio
> snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev
> serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> i7core_edac edac_core video
> Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not
> tainted 3.1.0-rc4 #1
> Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
> Sep 10 20:59:39 p34 kernel: [531189.671424]  [<ffffffff810379ba>]
> warn_slowpath_common+0x7a/0xb0
> Sep 10 20:59:39 p34 kernel: [531189.671427]  [<ffffffff81037a91>]
> warn_slowpath_fmt+0x41/0x50
> Sep 10 20:59:39 p34 kernel: [531189.671433]  [<ffffffff815d7874>] ?
> schedule+0x2e4/0x950
> Sep 10 20:59:39 p34 kernel: [531189.671436]  [<ffffffff814e5aff>]
> dev_watchdog+0x23f/0x250
> Sep 10 20:59:39 p34 kernel: [531189.671440]  [<ffffffff81043872>]
> run_timer_softirq+0xf2/0x220
> Sep 10 20:59:39 p34 kernel: [531189.671443]  [<ffffffff814e58c0>] ?
> qdisc_reset+0x50/0x50
> Sep 10 20:59:39 p34 kernel: [531189.671446]  [<ffffffff8103d208>]
> __do_softirq+0x98/0x120
> Sep 10 20:59:39 p34 kernel: [531189.671448]  [<ffffffff8103d345>]
> run_ksoftirqd+0xb5/0x160
> Sep 10 20:59:39 p34 kernel: [531189.671454]  [<ffffffff8103d290>] ?
> __do_softirq+0x120/0x120
> Sep 10 20:59:39 p34 kernel: [531189.671458]  [<ffffffff810523b7>]
> kthread+0x87/0x90
> Sep 10 20:59:39 p34 kernel: [531189.671462]  [<ffffffff815dbdb4>]
> kernel_thread_helper+0x4/0x10
> Sep 10 20:59:39 p34 kernel: [531189.671465]  [<ffffffff81052330>] ?
> kthread_worker_fn+0x130/0x130
> Sep 10 20:59:39 p34 kernel: [531189.671467]  [<ffffffff815dbdb0>] ?
> gs_change+0xb/0xb
> Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba
> ]---
> Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset
> adapter
> Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000
> Mbps Full Duplex, Flow Control: RX/TX
> Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck
> for 22s! [kswapd0:947]
>
> --
>
> URLs for FULL logs:
>
> 1. tw_cli /cX show diag:
>   http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
>
> 2. Full kernel log (and previous morning of kernel crash)
>   http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
>
> 3. tw_cli /cX show all
>   http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
>
> --
>
> Currently...
>
> After all of this happened, I stopped all I/O on the system/all processes,
> etc
> I shutdown the host, removed the power, powered it back up, now the drives
> that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
> to rebuild before doing anything else.
>
> Justin.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
  2011-09-13  3:59   ` Jesse Brandeburg
@ 2011-09-13  4:05     ` Eric Dumazet
  -1 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2011-09-13  4:05 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Justin Piszcz, linux-kernel, xfs, Alan Piszcz, NetDEV list

Le lundi 12 septembre 2011 à 20:59 -0700, Jesse Brandeburg a écrit :
> added netdev because it appears to start with an igb tx hang
> 
> On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > Hi,
> >
> > Over the past 24-48 hours I was running some CPU-intenstive jobs and there
> > was heavy I/O on the RAID (9750-24i4e + a RAID6)..
> >
> > I believe most of the problem started when I included many kernel options as
> > modules (before I only compiled in [*] the drivers I used), there appears to
> > have something to gone awry in the kernel and then afterwards, disks started
> > going in and out, XFS shut down, etcera.
> >
> > I'm opening a case with LSI to see what happened with the 3ware card;
> > however, after a power cycle, everything came back OK (the drives and HW) is
> > physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but
> > other than that, everything 'seems' OK, still need to do an fsck.
> >
> > Something went wrong in the kernel and caused a cascading effect of errors,
> > this occurred (I believe) when I started to run a lot of encoding jobs;
> > however, I was doing a lot of data transfer for the past 24-48 hours on the
> > RAID array, the system (separate SSD/EXT4) remained unaffected but other
> > weird stuff happened as well..
> >
> > I still see these in the logs as well after the reboot (not often; but e.g.,
> > the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the
> > physical drives are 100% healthy):
> >
> > [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed.  This is likely a
> > firmware bug on this device.  Contact the card vendor for a firmware update.
> >
> > So, my plan:
> >
> > 1. Report this error to LKML+XFS mailing lists.
> > 2. Open case with LSI support.
> > 3. Recompile the kernel how I used for many years [only compile in options
> >   that you need [*] and do not compile drivers as modules]
> > 4. Reboot Linux systems and see if this recurs again under the same
> >   workload, after the RAID is done rebuilding.
> >
> > --
> >
> > So these errors are quite long, will upload to HTTP and paste the relevant
> > bits below.
> >
> > --
> >
> > URLs for FULL logs:
> >
> > 1. tw_cli /cX show diag:
> >   http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
> >
> > 2. Full kernel log (and previous morning of kernel crash)
> >   http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
> >
> > 3. tw_cli /cX show all
> >   http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
> >
> > --
> >
> > Summary (what seems to have occurred, have not done a full analysis yet)
> >
> > 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors
> >
> > 2. Then, the time source went unstable (this happens with weird kernel bugs
> >   on many different hosts, I have seen this over time).
> >
> > 3. Then, on the 3ward carde, drives started leaving and being re-inserted
> >   by themsevles, XFS went off-line to protect the filesystem due to the
> >   3ware issues
> >
> > --
> >
> > 3ware/RAID-- Interesting errors:
> >
> > I've never seen this before on a 3ware RAID controller, at least from what
> > I can remember and I've been using 3ware cards for many years..
> >
> > p2    CFG-OP-FAIL    -    2.73 TB   SATA  2   -            Hitachi
> > HDS723030AL p3    CFG-OP-FAIL    -    2.73 TB   SATA  3   -
> >  Hitachi HDS723030AL
> >
> > --
> >
> > Kernel/ERRORS:
> >
> > FWIW it all seem to start during an encoding job around 21:00:
> >
> > Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC
> > Link is Down
> > Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO
> > (0x04:0x002B): Verify completed:unit=0.
> > Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here
> > ]------------
> > Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at
> > net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
> > Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
> > Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb):
> > transmit queue 5 timed out
> > Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod
> > tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio
> > snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> > snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> > snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev
> > serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> > i7core_edac edac_core video
> > Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not
> > tainted 3.1.0-rc4 #1
> > Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
> > Sep 10 20:59:39 p34 kernel: [531189.671424]  [<ffffffff810379ba>]
> > warn_slowpath_common+0x7a/0xb0
> > Sep 10 20:59:39 p34 kernel: [531189.671427]  [<ffffffff81037a91>]
> > warn_slowpath_fmt+0x41/0x50
> > Sep 10 20:59:39 p34 kernel: [531189.671433]  [<ffffffff815d7874>] ?
> > schedule+0x2e4/0x950
> > Sep 10 20:59:39 p34 kernel: [531189.671436]  [<ffffffff814e5aff>]
> > dev_watchdog+0x23f/0x250
> > Sep 10 20:59:39 p34 kernel: [531189.671440]  [<ffffffff81043872>]
> > run_timer_softirq+0xf2/0x220
> > Sep 10 20:59:39 p34 kernel: [531189.671443]  [<ffffffff814e58c0>] ?
> > qdisc_reset+0x50/0x50
> > Sep 10 20:59:39 p34 kernel: [531189.671446]  [<ffffffff8103d208>]
> > __do_softirq+0x98/0x120
> > Sep 10 20:59:39 p34 kernel: [531189.671448]  [<ffffffff8103d345>]
> > run_ksoftirqd+0xb5/0x160
> > Sep 10 20:59:39 p34 kernel: [531189.671454]  [<ffffffff8103d290>] ?
> > __do_softirq+0x120/0x120
> > Sep 10 20:59:39 p34 kernel: [531189.671458]  [<ffffffff810523b7>]
> > kthread+0x87/0x90
> > Sep 10 20:59:39 p34 kernel: [531189.671462]  [<ffffffff815dbdb4>]
> > kernel_thread_helper+0x4/0x10
> > Sep 10 20:59:39 p34 kernel: [531189.671465]  [<ffffffff81052330>] ?
> > kthread_worker_fn+0x130/0x130
> > Sep 10 20:59:39 p34 kernel: [531189.671467]  [<ffffffff815dbdb0>] ?
> > gs_change+0xb/0xb
> > Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba
> > ]---
> > Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset
> > adapter
> > Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000
> > Mbps Full Duplex, Flow Control: RX/TX
> > Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck
> > for 22s! [kswapd0:947]
> >
> > --
> >
> > URLs for FULL logs:
> >
> > 1. tw_cli /cX show diag:
> >   http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
> >
> > 2. Full kernel log (and previous morning of kernel crash)
> >   http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
> >
> > 3. tw_cli /cX show all
> >   http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
> >
> > --
> >
> > Currently...
> >
> > After all of this happened, I stopped all I/O on the system/all processes,
> > etc
> > I shutdown the host, removed the power, powered it back up, now the drives
> > that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
> > to rebuild before doing anything else.
> >
> > Justin.
> >
> >

Please Justin make sure you pulled commit 

commit ed2888e906b56769b4ffabb9c577190438aa68b8
Author: Jon Mason <mason@myri.com>
Date:   Thu Sep 8 16:41:18 2011 -0500

    PCI: Remove MRRS modification from MPS setting code
    
    Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
    massive negative ramifications on some devices.  Without knowing which
    devices have this issue, do not modify from the default value when
    walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
    the default procedure.
    
    Tested-by: Sven Schnelle <svens@stackframe.org>
    Tested-by: Simon Kirby <sim@hostway.ca>
    Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
    Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
    Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.
    References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
    Signed-off-by: Jon Mason <mason@myri.com>
    Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13  4:05     ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2011-09-13  4:05 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Alan Piszcz, NetDEV list, xfs, Justin Piszcz, linux-kernel

Le lundi 12 septembre 2011 à 20:59 -0700, Jesse Brandeburg a écrit :
> added netdev because it appears to start with an igb tx hang
> 
> On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > Hi,
> >
> > Over the past 24-48 hours I was running some CPU-intenstive jobs and there
> > was heavy I/O on the RAID (9750-24i4e + a RAID6)..
> >
> > I believe most of the problem started when I included many kernel options as
> > modules (before I only compiled in [*] the drivers I used), there appears to
> > have something to gone awry in the kernel and then afterwards, disks started
> > going in and out, XFS shut down, etcera.
> >
> > I'm opening a case with LSI to see what happened with the 3ware card;
> > however, after a power cycle, everything came back OK (the drives and HW) is
> > physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but
> > other than that, everything 'seems' OK, still need to do an fsck.
> >
> > Something went wrong in the kernel and caused a cascading effect of errors,
> > this occurred (I believe) when I started to run a lot of encoding jobs;
> > however, I was doing a lot of data transfer for the past 24-48 hours on the
> > RAID array, the system (separate SSD/EXT4) remained unaffected but other
> > weird stuff happened as well..
> >
> > I still see these in the logs as well after the reboot (not often; but e.g.,
> > the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the
> > physical drives are 100% healthy):
> >
> > [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed.  This is likely a
> > firmware bug on this device.  Contact the card vendor for a firmware update.
> >
> > So, my plan:
> >
> > 1. Report this error to LKML+XFS mailing lists.
> > 2. Open case with LSI support.
> > 3. Recompile the kernel how I used for many years [only compile in options
> >   that you need [*] and do not compile drivers as modules]
> > 4. Reboot Linux systems and see if this recurs again under the same
> >   workload, after the RAID is done rebuilding.
> >
> > --
> >
> > So these errors are quite long, will upload to HTTP and paste the relevant
> > bits below.
> >
> > --
> >
> > URLs for FULL logs:
> >
> > 1. tw_cli /cX show diag:
> >   http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
> >
> > 2. Full kernel log (and previous morning of kernel crash)
> >   http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
> >
> > 3. tw_cli /cX show all
> >   http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
> >
> > --
> >
> > Summary (what seems to have occurred, have not done a full analysis yet)
> >
> > 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors
> >
> > 2. Then, the time source went unstable (this happens with weird kernel bugs
> >   on many different hosts, I have seen this over time).
> >
> > 3. Then, on the 3ward carde, drives started leaving and being re-inserted
> >   by themsevles, XFS went off-line to protect the filesystem due to the
> >   3ware issues
> >
> > --
> >
> > 3ware/RAID-- Interesting errors:
> >
> > I've never seen this before on a 3ware RAID controller, at least from what
> > I can remember and I've been using 3ware cards for many years..
> >
> > p2    CFG-OP-FAIL    -    2.73 TB   SATA  2   -            Hitachi
> > HDS723030AL p3    CFG-OP-FAIL    -    2.73 TB   SATA  3   -
> >  Hitachi HDS723030AL
> >
> > --
> >
> > Kernel/ERRORS:
> >
> > FWIW it all seem to start during an encoding job around 21:00:
> >
> > Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC
> > Link is Down
> > Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO
> > (0x04:0x002B): Verify completed:unit=0.
> > Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here
> > ]------------
> > Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at
> > net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
> > Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
> > Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb):
> > transmit queue 5 timed out
> > Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod
> > tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio
> > snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> > snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> > snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev
> > serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> > i7core_edac edac_core video
> > Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not
> > tainted 3.1.0-rc4 #1
> > Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
> > Sep 10 20:59:39 p34 kernel: [531189.671424]  [<ffffffff810379ba>]
> > warn_slowpath_common+0x7a/0xb0
> > Sep 10 20:59:39 p34 kernel: [531189.671427]  [<ffffffff81037a91>]
> > warn_slowpath_fmt+0x41/0x50
> > Sep 10 20:59:39 p34 kernel: [531189.671433]  [<ffffffff815d7874>] ?
> > schedule+0x2e4/0x950
> > Sep 10 20:59:39 p34 kernel: [531189.671436]  [<ffffffff814e5aff>]
> > dev_watchdog+0x23f/0x250
> > Sep 10 20:59:39 p34 kernel: [531189.671440]  [<ffffffff81043872>]
> > run_timer_softirq+0xf2/0x220
> > Sep 10 20:59:39 p34 kernel: [531189.671443]  [<ffffffff814e58c0>] ?
> > qdisc_reset+0x50/0x50
> > Sep 10 20:59:39 p34 kernel: [531189.671446]  [<ffffffff8103d208>]
> > __do_softirq+0x98/0x120
> > Sep 10 20:59:39 p34 kernel: [531189.671448]  [<ffffffff8103d345>]
> > run_ksoftirqd+0xb5/0x160
> > Sep 10 20:59:39 p34 kernel: [531189.671454]  [<ffffffff8103d290>] ?
> > __do_softirq+0x120/0x120
> > Sep 10 20:59:39 p34 kernel: [531189.671458]  [<ffffffff810523b7>]
> > kthread+0x87/0x90
> > Sep 10 20:59:39 p34 kernel: [531189.671462]  [<ffffffff815dbdb4>]
> > kernel_thread_helper+0x4/0x10
> > Sep 10 20:59:39 p34 kernel: [531189.671465]  [<ffffffff81052330>] ?
> > kthread_worker_fn+0x130/0x130
> > Sep 10 20:59:39 p34 kernel: [531189.671467]  [<ffffffff815dbdb0>] ?
> > gs_change+0xb/0xb
> > Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba
> > ]---
> > Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset
> > adapter
> > Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000
> > Mbps Full Duplex, Flow Control: RX/TX
> > Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck
> > for 22s! [kswapd0:947]
> >
> > --
> >
> > URLs for FULL logs:
> >
> > 1. tw_cli /cX show diag:
> >   http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
> >
> > 2. Full kernel log (and previous morning of kernel crash)
> >   http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
> >
> > 3. tw_cli /cX show all
> >   http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
> >
> > --
> >
> > Currently...
> >
> > After all of this happened, I stopped all I/O on the system/all processes,
> > etc
> > I shutdown the host, removed the power, powered it back up, now the drives
> > that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
> > to rebuild before doing anything else.
> >
> > Justin.
> >
> >

Please Justin make sure you pulled commit 

commit ed2888e906b56769b4ffabb9c577190438aa68b8
Author: Jon Mason <mason@myri.com>
Date:   Thu Sep 8 16:41:18 2011 -0500

    PCI: Remove MRRS modification from MPS setting code
    
    Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
    massive negative ramifications on some devices.  Without knowing which
    devices have this issue, do not modify from the default value when
    walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
    the default procedure.
    
    Tested-by: Sven Schnelle <svens@stackframe.org>
    Tested-by: Simon Kirby <sim@hostway.ca>
    Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
    Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
    Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.
    References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
    Signed-off-by: Jon Mason <mason@myri.com>
    Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
  2011-09-13  4:05     ` Eric Dumazet
@ 2011-09-13 14:54       ` Justin Piszcz
  -1 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-13 14:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs, linux-kernel



On Tue, 13 Sep 2011, Eric Dumazet wrote:

> Please Justin make sure you pulled commit 
>
> commit ed2888e906b56769b4ffabb9c577190438aa68b8
> Author: Jon Mason <mason@myri.com>
> Date:   Thu Sep 8 16:41:18 2011 -0500
>
>    PCI: Remove MRRS modification from MPS setting code
>
>    Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>    massive negative ramifications on some devices.  Without knowing which
>    devices have this issue, do not modify from the default value when
>    walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
>    the default procedure.
>
>    Tested-by: Sven Schnelle <svens@stackframe.org>
>    Tested-by: Simon Kirby <sim@hostway.ca>
>    Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>    Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>    Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.
>    References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>    Signed-off-by: Jon Mason <mason@myri.com>
>    Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Hello,

I found this commit here:
http://permalink.gmane.org/gmane.linux.kernel.pci/11700

Applied:
# patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt 
patching file drivers/pci/probe.c

I will update this thread if the problem recurs, can someone also please advise
which DEBUG options I should have enabled to catch further SLAB/RCU issues?

So far, I have the following enabled:

CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_SLAB_LEAK=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VIRTUAL=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_PER_CPU_MAPS=y
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_RODATA=y

Thanks,

Justin.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 14:54       ` Justin Piszcz
  0 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-13 14:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: NetDEV list, xfs, Alan Piszcz, linux-kernel, Jesse Brandeburg



On Tue, 13 Sep 2011, Eric Dumazet wrote:

> Please Justin make sure you pulled commit 
>
> commit ed2888e906b56769b4ffabb9c577190438aa68b8
> Author: Jon Mason <mason@myri.com>
> Date:   Thu Sep 8 16:41:18 2011 -0500
>
>    PCI: Remove MRRS modification from MPS setting code
>
>    Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>    massive negative ramifications on some devices.  Without knowing which
>    devices have this issue, do not modify from the default value when
>    walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
>    the default procedure.
>
>    Tested-by: Sven Schnelle <svens@stackframe.org>
>    Tested-by: Simon Kirby <sim@hostway.ca>
>    Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>    Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>    Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.
>    References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>    Signed-off-by: Jon Mason <mason@myri.com>
>    Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Hello,

I found this commit here:
http://permalink.gmane.org/gmane.linux.kernel.pci/11700

Applied:
# patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt 
patching file drivers/pci/probe.c

I will update this thread if the problem recurs, can someone also please advise
which DEBUG options I should have enabled to catch further SLAB/RCU issues?

So far, I have the following enabled:

CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_SLAB_LEAK=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VIRTUAL=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_PER_CPU_MAPS=y
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_RODATA=y

Thanks,

Justin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
  2011-09-13 14:54       ` Justin Piszcz
@ 2011-09-13 14:58         ` Eric Dumazet
  -1 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2011-09-13 14:58 UTC (permalink / raw)
  To: Justin Piszcz
  Cc: Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs, linux-kernel

2011/9/13 Justin Piszcz <jpiszcz@lucidpixels.com>:
>

> I found this commit here:
> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
>
> Applied:
> # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file
> drivers/pci/probe.c
>
>

Oh, I should have sent the git anchor you can use instead of searching the web ;

git pull https://github.com/torvalds/linux.git

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 14:58         ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2011-09-13 14:58 UTC (permalink / raw)
  To: Justin Piszcz
  Cc: NetDEV list, xfs, Alan Piszcz, linux-kernel, Jesse Brandeburg

2011/9/13 Justin Piszcz <jpiszcz@lucidpixels.com>:
>

> I found this commit here:
> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
>
> Applied:
> # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file
> drivers/pci/probe.c
>
>

Oh, I should have sent the git anchor you can use instead of searching the web ;

git pull https://github.com/torvalds/linux.git

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
  2011-09-13 14:54       ` Justin Piszcz
@ 2011-09-13 15:35         ` Jon Mason
  -1 siblings, 0 replies; 18+ messages in thread
From: Jon Mason @ 2011-09-13 15:35 UTC (permalink / raw)
  To: Justin Piszcz
  Cc: Eric Dumazet, Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs,
	linux-kernel

On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>
>> Please Justin make sure you pulled commit
>> commit ed2888e906b56769b4ffabb9c577190438aa68b8
>> Author: Jon Mason <mason@myri.com>
>> Date:   Thu Sep 8 16:41:18 2011 -0500
>>
>>   PCI: Remove MRRS modification from MPS setting code
>>
>>   Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>>   massive negative ramifications on some devices.  Without knowing which
>>   devices have this issue, do not modify from the default value when
>>   walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
>>   the default procedure.
>>
>>   Tested-by: Sven Schnelle <svens@stackframe.org>
>>   Tested-by: Simon Kirby <sim@hostway.ca>
>>   Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>>   Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>>   Reported-and-tested-by: Niels Ole Salscheider
>> <niels_ole@salscheider-online.
>>   References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>>   Signed-off-by: Jon Mason <mason@myri.com>
>>   Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>>   Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> Hello,
>
> I found this commit here:
> http://permalink.gmane.org/gmane.linux.kernel.pci/11700

This is an early version of the patch.  This is the patch that you want:
https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8

It appears that this patch didn't make it to lkml or linux-pci list
due to kernel.org DNS being down when it was sent.

Thanks,
Jon

>
> Applied:
> # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file
> drivers/pci/probe.c
>
> I will update this thread if the problem recurs, can someone also please
> advise
> which DEBUG options I should have enabled to catch further SLAB/RCU issues?
>
> So far, I have the following enabled:
>
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_HAVE_DMA_API_DEBUG=y
> CONFIG_X86_DEBUGCTLMSR=y
> CONFIG_DEBUG_FS=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DEBUG_SLAB=y
> CONFIG_DEBUG_SLAB_LEAK=y
> CONFIG_DEBUG_KMEMLEAK=y
> CONFIG_DEBUG_STACK_USAGE=y
> CONFIG_DEBUG_BUGVERBOSE=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_VM=y
> CONFIG_DEBUG_VIRTUAL=y
> CONFIG_DEBUG_MEMORY_INIT=y
> CONFIG_DEBUG_PER_CPU_MAPS=y
> CONFIG_DEBUG_PAGEALLOC=y
> CONFIG_DEBUG_STACKOVERFLOW=y
> CONFIG_DEBUG_RODATA=y
>
> Thanks,
>
> Justin.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 15:35         ` Jon Mason
  0 siblings, 0 replies; 18+ messages in thread
From: Jon Mason @ 2011-09-13 15:35 UTC (permalink / raw)
  To: Justin Piszcz
  Cc: Eric Dumazet, NetDEV list, linux-kernel, xfs, Jesse Brandeburg,
	Alan Piszcz

On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>
>> Please Justin make sure you pulled commit
>> commit ed2888e906b56769b4ffabb9c577190438aa68b8
>> Author: Jon Mason <mason@myri.com>
>> Date:   Thu Sep 8 16:41:18 2011 -0500
>>
>>   PCI: Remove MRRS modification from MPS setting code
>>
>>   Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>>   massive negative ramifications on some devices.  Without knowing which
>>   devices have this issue, do not modify from the default value when
>>   walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
>>   the default procedure.
>>
>>   Tested-by: Sven Schnelle <svens@stackframe.org>
>>   Tested-by: Simon Kirby <sim@hostway.ca>
>>   Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>>   Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>>   Reported-and-tested-by: Niels Ole Salscheider
>> <niels_ole@salscheider-online.
>>   References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>>   Signed-off-by: Jon Mason <mason@myri.com>
>>   Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>>   Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> Hello,
>
> I found this commit here:
> http://permalink.gmane.org/gmane.linux.kernel.pci/11700

This is an early version of the patch.  This is the patch that you want:
https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8

It appears that this patch didn't make it to lkml or linux-pci list
due to kernel.org DNS being down when it was sent.

Thanks,
Jon

>
> Applied:
> # patch -p1 < ../ed2888e906b56769b4ffabb9c577190438aa68b8.txt patching file
> drivers/pci/probe.c
>
> I will update this thread if the problem recurs, can someone also please
> advise
> which DEBUG options I should have enabled to catch further SLAB/RCU issues?
>
> So far, I have the following enabled:
>
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_HAVE_DMA_API_DEBUG=y
> CONFIG_X86_DEBUGCTLMSR=y
> CONFIG_DEBUG_FS=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DEBUG_SLAB=y
> CONFIG_DEBUG_SLAB_LEAK=y
> CONFIG_DEBUG_KMEMLEAK=y
> CONFIG_DEBUG_STACK_USAGE=y
> CONFIG_DEBUG_BUGVERBOSE=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_VM=y
> CONFIG_DEBUG_VIRTUAL=y
> CONFIG_DEBUG_MEMORY_INIT=y
> CONFIG_DEBUG_PER_CPU_MAPS=y
> CONFIG_DEBUG_PAGEALLOC=y
> CONFIG_DEBUG_STACKOVERFLOW=y
> CONFIG_DEBUG_RODATA=y
>
> Thanks,
>
> Justin.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
  2011-09-13 15:35         ` Jon Mason
@ 2011-09-13 15:42           ` Justin Piszcz
  -1 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-13 15:42 UTC (permalink / raw)
  To: Jon Mason
  Cc: Eric Dumazet, Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs,
	linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1906 bytes --]



On Tue, 13 Sep 2011, Jon Mason wrote:

> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>
>>
>> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>>
>>> Please Justin make sure you pulled commit
>>> commit ed2888e906b56769b4ffabb9c577190438aa68b8
>>> Author: Jon Mason <mason@myri.com>
>>> Date:   Thu Sep 8 16:41:18 2011 -0500
>>>
>>>   PCI: Remove MRRS modification from MPS setting code
>>>
>>>   Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>>>   massive negative ramifications on some devices.  Without knowing which
>>>   devices have this issue, do not modify from the default value when
>>>   walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
>>>   the default procedure.
>>>
>>>   Tested-by: Sven Schnelle <svens@stackframe.org>
>>>   Tested-by: Simon Kirby <sim@hostway.ca>
>>>   Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>>>   Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>>>   Reported-and-tested-by: Niels Ole Salscheider
>>> <niels_ole@salscheider-online.
>>>   References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>>>   Signed-off-by: Jon Mason <mason@myri.com>
>>>   Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>>>   Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>>
>> Hello,
>>
>> I found this commit here:
>> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
>
> This is an early version of the patch.  This is the patch that you want:
> https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8
>
> It appears that this patch didn't make it to lkml or linux-pci list
> due to kernel.org DNS being down when it was sent.
>
> Thanks,
> Jon

I need to learn how to use git at some point, can you please provide plain
text patches so I can apply them and reboot?

Justin.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 15:42           ` Justin Piszcz
  0 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-13 15:42 UTC (permalink / raw)
  To: Jon Mason
  Cc: Eric Dumazet, NetDEV list, linux-kernel, xfs, Jesse Brandeburg,
	Alan Piszcz

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1906 bytes --]



On Tue, 13 Sep 2011, Jon Mason wrote:

> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>
>>
>> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>>
>>> Please Justin make sure you pulled commit
>>> commit ed2888e906b56769b4ffabb9c577190438aa68b8
>>> Author: Jon Mason <mason@myri.com>
>>> Date:   Thu Sep 8 16:41:18 2011 -0500
>>>
>>>   PCI: Remove MRRS modification from MPS setting code
>>>
>>>   Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>>>   massive negative ramifications on some devices.  Without knowing which
>>>   devices have this issue, do not modify from the default value when
>>>   walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
>>>   the default procedure.
>>>
>>>   Tested-by: Sven Schnelle <svens@stackframe.org>
>>>   Tested-by: Simon Kirby <sim@hostway.ca>
>>>   Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>>>   Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>>>   Reported-and-tested-by: Niels Ole Salscheider
>>> <niels_ole@salscheider-online.
>>>   References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>>>   Signed-off-by: Jon Mason <mason@myri.com>
>>>   Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>>>   Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>>
>> Hello,
>>
>> I found this commit here:
>> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
>
> This is an early version of the patch.  This is the patch that you want:
> https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8
>
> It appears that this patch didn't make it to lkml or linux-pci list
> due to kernel.org DNS being down when it was sent.
>
> Thanks,
> Jon

I need to learn how to use git at some point, can you please provide plain
text patches so I can apply them and reboot?

Justin.

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
  2011-09-13 15:42           ` Justin Piszcz
@ 2011-09-13 15:51             ` Jon Mason
  -1 siblings, 0 replies; 18+ messages in thread
From: Jon Mason @ 2011-09-13 15:51 UTC (permalink / raw)
  To: Justin Piszcz
  Cc: Eric Dumazet, Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2175 bytes --]

On Tue, Sep 13, 2011 at 10:42 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Tue, 13 Sep 2011, Jon Mason wrote:
>
>> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com>
>> wrote:
>>>
>>>
>>> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>>>
>>>> Please Justin make sure you pulled commit
>>>> commit ed2888e906b56769b4ffabb9c577190438aa68b8
>>>> Author: Jon Mason <mason@myri.com>
>>>> Date:   Thu Sep 8 16:41:18 2011 -0500
>>>>
>>>>   PCI: Remove MRRS modification from MPS setting code
>>>>
>>>>   Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>>>>   massive negative ramifications on some devices.  Without knowing which
>>>>   devices have this issue, do not modify from the default value when
>>>>   walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
>>>>   the default procedure.
>>>>
>>>>   Tested-by: Sven Schnelle <svens@stackframe.org>
>>>>   Tested-by: Simon Kirby <sim@hostway.ca>
>>>>   Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>>>>   Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>>>>   Reported-and-tested-by: Niels Ole Salscheider
>>>> <niels_ole@salscheider-online.
>>>>   References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>>>>   Signed-off-by: Jon Mason <mason@myri.com>
>>>>   Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>>>>   Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>>>
>>> Hello,
>>>
>>> I found this commit here:
>>> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
>>
>> This is an early version of the patch.  This is the patch that you want:
>>
>> https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8
>>
>> It appears that this patch didn't make it to lkml or linux-pci list
>> due to kernel.org DNS being down when it was sent.
>>
>> Thanks,
>> Jon
>
> I need to learn how to use git at some point, can you please provide plain
> text patches so I can apply them and reboot?
>
> Justin.

I've attached the 2 patches I asked Linus to include into 3.1-rc6.
Let me know if there are any issues.

Thanks,
Jon

[-- Attachment #2: 0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch --]
[-- Type: text/x-patch, Size: 2344 bytes --]

From cf822aed99fd8851d82ae5f2df11c29b79e316c8 Mon Sep 17 00:00:00 2001
From: Shyam Iyer <shyam.iyer.t@gmail.com>
Date: Wed, 31 Aug 2011 12:21:42 -0400
Subject: [PATCH 1/2] Fix pointer dereference before call to
 pcie_bus_configure_settings

There is a potential NULL pointer dereference in calls to
pcie_bus_configure_settings due to attempts to access pci_bus self
variables when the self pointer is NULL.  To correct this, verify that
the self pointer in pci_bus is non-NULL before dereferencing it.

Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Shyam Iyer <shyam_iyer@dell.com>
Signed-off-by: Jon Mason <mason@myri.com>
---
 arch/x86/pci/acpi.c              |    9 +++++++--
 drivers/pci/hotplug/pcihp_slot.c |    4 +++-
 drivers/pci/probe.c              |    3 ---
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index c953302..039d913 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -365,8 +365,13 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_pci_root *root)
 	 */
 	if (bus) {
 		struct pci_bus *child;
-		list_for_each_entry(child, &bus->children, node)
-			pcie_bus_configure_settings(child, child->self->pcie_mpss);
+		list_for_each_entry(child, &bus->children, node) {
+			struct pci_dev *self = child->self;
+			if (!self)
+				continue;
+
+			pcie_bus_configure_settings(child, self->pcie_mpss);
+		}
 	}
 
 	if (!bus)
diff --git a/drivers/pci/hotplug/pcihp_slot.c b/drivers/pci/hotplug/pcihp_slot.c
index 753b21a..3ffd9c1 100644
--- a/drivers/pci/hotplug/pcihp_slot.c
+++ b/drivers/pci/hotplug/pcihp_slot.c
@@ -169,7 +169,9 @@ void pci_configure_slot(struct pci_dev *dev)
 			(dev->class >> 8) == PCI_CLASS_BRIDGE_PCI)))
 		return;
 
-	pcie_bus_configure_settings(dev->bus, dev->bus->self->pcie_mpss);
+	if (dev->bus && dev->bus->self)
+		pcie_bus_configure_settings(dev->bus,
+					    dev->bus->self->pcie_mpss);
 
 	memset(&hpp, 0, sizeof(hpp));
 	ret = pci_get_hp_params(dev, &hpp);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 8473727..0820fc1 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1456,9 +1456,6 @@ void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss)
 {
 	u8 smpss = mpss;
 
-	if (!bus->self)
-		return;
-
 	if (!pci_is_pcie(bus->self))
 		return;
 
-- 
1.7.6


[-- Attachment #3: 0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch --]
[-- Type: text/x-patch, Size: 4404 bytes --]

From 74d81235f8e4bd60859d539a27e51d3a09d183cf Mon Sep 17 00:00:00 2001
From: Jon Mason <mason@myri.com>
Date: Thu, 8 Sep 2011 12:59:00 -0500
Subject: [PATCH 2/2] PCI: Remove MRRS modification from MPS setting code

Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
massive negative ramifications on some devices.  Without knowing which
devices have this issue, do not modify from the default value when
walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
the default procedure.

Tested-by: Sven Schnelle <svens@stackframe.org>
Tested-by: Simon Kirby <sim@hostway.ca>
Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.de>
References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
Signed-off-by: Jon Mason <mason@myri.com>
---
 drivers/pci/pci.c   |    2 +-
 drivers/pci/probe.c |   41 ++++++++++++++++++++++-------------------
 2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 0ce6742..4e84fd4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -77,7 +77,7 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE;
 unsigned long pci_hotplug_io_size  = DEFAULT_HOTPLUG_IO_SIZE;
 unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE;
 
-enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PERFORMANCE;
+enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE;
 
 /*
  * The default CLS is used if arch didn't set CLS explicitly and not
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 0820fc1..b1187ff 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1396,34 +1396,37 @@ static void pcie_write_mps(struct pci_dev *dev, int mps)
 
 static void pcie_write_mrrs(struct pci_dev *dev, int mps)
 {
-	int rc, mrrs;
+	int rc, mrrs, dev_mpss;
 
-	if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
-		int dev_mpss = 128 << dev->pcie_mpss;
+	/* In the "safe" case, do not configure the MRRS.  There appear to be
+	 * issues with setting MRRS to 0 on a number of devices.
+	 */
 
-		/* For Max performance, the MRRS must be set to the largest
-		 * supported value.  However, it cannot be configured larger
-		 * than the MPS the device or the bus can support.  This assumes
-		 * that the largest MRRS available on the device cannot be
-		 * smaller than the device MPSS.
-		 */
-		mrrs = mps < dev_mpss ? mps : dev_mpss;
-	} else
-		/* In the "safe" case, configure the MRRS for fairness on the
-		 * bus by making all devices have the same size
-		 */
-		mrrs = mps;
+	if (pcie_bus_config != PCIE_BUS_PERFORMANCE)
+		return;
+
+	dev_mpss = 128 << dev->pcie_mpss;
 
+	/* For Max performance, the MRRS must be set to the largest supported
+	 * value.  However, it cannot be configured larger than the MPS the
+	 * device or the bus can support.  This assumes that the largest MRRS
+	 * available on the device cannot be smaller than the device MPSS.
+	 */
+	mrrs = min(mps, dev_mpss);
 
 	/* MRRS is a R/W register.  Invalid values can be written, but a
-	 * subsiquent read will verify if the value is acceptable or not.
+	 * subsequent read will verify if the value is acceptable or not.
 	 * If the MRRS value provided is not acceptable (e.g., too large),
 	 * shrink the value until it is acceptable to the HW.
  	 */
 	while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) {
+		dev_warn(&dev->dev, "Attempting to modify the PCI-E MRRS value"
+			 " to %d.  If any issues are encountered, please try "
+			 "running with pci=pcie_bus_safe\n", mrrs);
 		rc = pcie_set_readrq(dev, mrrs);
 		if (rc)
-			dev_err(&dev->dev, "Failed attempting to set the MRRS\n");
+			dev_err(&dev->dev,
+				"Failed attempting to set the MRRS\n");
 
 		mrrs /= 2;
 	}
@@ -1436,13 +1439,13 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
 	if (!pci_is_pcie(dev))
 		return 0;
 
-	dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
+	dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
 		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
 
 	pcie_write_mps(dev, mps);
 	pcie_write_mrrs(dev, mps);
 
-	dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
+	dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
 		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
 
 	return 0;
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 15:51             ` Jon Mason
  0 siblings, 0 replies; 18+ messages in thread
From: Jon Mason @ 2011-09-13 15:51 UTC (permalink / raw)
  To: Justin Piszcz
  Cc: Eric Dumazet, NetDEV list, linux-kernel, xfs, Jesse Brandeburg,
	Alan Piszcz

[-- Attachment #1: Type: text/plain, Size: 2175 bytes --]

On Tue, Sep 13, 2011 at 10:42 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Tue, 13 Sep 2011, Jon Mason wrote:
>
>> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com>
>> wrote:
>>>
>>>
>>> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>>>
>>>> Please Justin make sure you pulled commit
>>>> commit ed2888e906b56769b4ffabb9c577190438aa68b8
>>>> Author: Jon Mason <mason@myri.com>
>>>> Date:   Thu Sep 8 16:41:18 2011 -0500
>>>>
>>>>   PCI: Remove MRRS modification from MPS setting code
>>>>
>>>>   Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>>>>   massive negative ramifications on some devices.  Without knowing which
>>>>   devices have this issue, do not modify from the default value when
>>>>   walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
>>>>   the default procedure.
>>>>
>>>>   Tested-by: Sven Schnelle <svens@stackframe.org>
>>>>   Tested-by: Simon Kirby <sim@hostway.ca>
>>>>   Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>>>>   Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>>>>   Reported-and-tested-by: Niels Ole Salscheider
>>>> <niels_ole@salscheider-online.
>>>>   References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>>>>   Signed-off-by: Jon Mason <mason@myri.com>
>>>>   Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
>>>>   Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>>>
>>> Hello,
>>>
>>> I found this commit here:
>>> http://permalink.gmane.org/gmane.linux.kernel.pci/11700
>>
>> This is an early version of the patch.  This is the patch that you want:
>>
>> https://github.com/torvalds/linux/commit/ed2888e906b56769b4ffabb9c577190438aa68b8
>>
>> It appears that this patch didn't make it to lkml or linux-pci list
>> due to kernel.org DNS being down when it was sent.
>>
>> Thanks,
>> Jon
>
> I need to learn how to use git at some point, can you please provide plain
> text patches so I can apply them and reboot?
>
> Justin.

I've attached the 2 patches I asked Linus to include into 3.1-rc6.
Let me know if there are any issues.

Thanks,
Jon

[-- Attachment #2: 0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch --]
[-- Type: text/x-patch, Size: 2344 bytes --]

From cf822aed99fd8851d82ae5f2df11c29b79e316c8 Mon Sep 17 00:00:00 2001
From: Shyam Iyer <shyam.iyer.t@gmail.com>
Date: Wed, 31 Aug 2011 12:21:42 -0400
Subject: [PATCH 1/2] Fix pointer dereference before call to
 pcie_bus_configure_settings

There is a potential NULL pointer dereference in calls to
pcie_bus_configure_settings due to attempts to access pci_bus self
variables when the self pointer is NULL.  To correct this, verify that
the self pointer in pci_bus is non-NULL before dereferencing it.

Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Shyam Iyer <shyam_iyer@dell.com>
Signed-off-by: Jon Mason <mason@myri.com>
---
 arch/x86/pci/acpi.c              |    9 +++++++--
 drivers/pci/hotplug/pcihp_slot.c |    4 +++-
 drivers/pci/probe.c              |    3 ---
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index c953302..039d913 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -365,8 +365,13 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_pci_root *root)
 	 */
 	if (bus) {
 		struct pci_bus *child;
-		list_for_each_entry(child, &bus->children, node)
-			pcie_bus_configure_settings(child, child->self->pcie_mpss);
+		list_for_each_entry(child, &bus->children, node) {
+			struct pci_dev *self = child->self;
+			if (!self)
+				continue;
+
+			pcie_bus_configure_settings(child, self->pcie_mpss);
+		}
 	}
 
 	if (!bus)
diff --git a/drivers/pci/hotplug/pcihp_slot.c b/drivers/pci/hotplug/pcihp_slot.c
index 753b21a..3ffd9c1 100644
--- a/drivers/pci/hotplug/pcihp_slot.c
+++ b/drivers/pci/hotplug/pcihp_slot.c
@@ -169,7 +169,9 @@ void pci_configure_slot(struct pci_dev *dev)
 			(dev->class >> 8) == PCI_CLASS_BRIDGE_PCI)))
 		return;
 
-	pcie_bus_configure_settings(dev->bus, dev->bus->self->pcie_mpss);
+	if (dev->bus && dev->bus->self)
+		pcie_bus_configure_settings(dev->bus,
+					    dev->bus->self->pcie_mpss);
 
 	memset(&hpp, 0, sizeof(hpp));
 	ret = pci_get_hp_params(dev, &hpp);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 8473727..0820fc1 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1456,9 +1456,6 @@ void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss)
 {
 	u8 smpss = mpss;
 
-	if (!bus->self)
-		return;
-
 	if (!pci_is_pcie(bus->self))
 		return;
 
-- 
1.7.6


[-- Attachment #3: 0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch --]
[-- Type: text/x-patch, Size: 4404 bytes --]

From 74d81235f8e4bd60859d539a27e51d3a09d183cf Mon Sep 17 00:00:00 2001
From: Jon Mason <mason@myri.com>
Date: Thu, 8 Sep 2011 12:59:00 -0500
Subject: [PATCH 2/2] PCI: Remove MRRS modification from MPS setting code

Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
massive negative ramifications on some devices.  Without knowing which
devices have this issue, do not modify from the default value when
walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
the default procedure.

Tested-by: Sven Schnelle <svens@stackframe.org>
Tested-by: Simon Kirby <sim@hostway.ca>
Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.de>
References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
Signed-off-by: Jon Mason <mason@myri.com>
---
 drivers/pci/pci.c   |    2 +-
 drivers/pci/probe.c |   41 ++++++++++++++++++++++-------------------
 2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 0ce6742..4e84fd4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -77,7 +77,7 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE;
 unsigned long pci_hotplug_io_size  = DEFAULT_HOTPLUG_IO_SIZE;
 unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE;
 
-enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PERFORMANCE;
+enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE;
 
 /*
  * The default CLS is used if arch didn't set CLS explicitly and not
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 0820fc1..b1187ff 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1396,34 +1396,37 @@ static void pcie_write_mps(struct pci_dev *dev, int mps)
 
 static void pcie_write_mrrs(struct pci_dev *dev, int mps)
 {
-	int rc, mrrs;
+	int rc, mrrs, dev_mpss;
 
-	if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
-		int dev_mpss = 128 << dev->pcie_mpss;
+	/* In the "safe" case, do not configure the MRRS.  There appear to be
+	 * issues with setting MRRS to 0 on a number of devices.
+	 */
 
-		/* For Max performance, the MRRS must be set to the largest
-		 * supported value.  However, it cannot be configured larger
-		 * than the MPS the device or the bus can support.  This assumes
-		 * that the largest MRRS available on the device cannot be
-		 * smaller than the device MPSS.
-		 */
-		mrrs = mps < dev_mpss ? mps : dev_mpss;
-	} else
-		/* In the "safe" case, configure the MRRS for fairness on the
-		 * bus by making all devices have the same size
-		 */
-		mrrs = mps;
+	if (pcie_bus_config != PCIE_BUS_PERFORMANCE)
+		return;
+
+	dev_mpss = 128 << dev->pcie_mpss;
 
+	/* For Max performance, the MRRS must be set to the largest supported
+	 * value.  However, it cannot be configured larger than the MPS the
+	 * device or the bus can support.  This assumes that the largest MRRS
+	 * available on the device cannot be smaller than the device MPSS.
+	 */
+	mrrs = min(mps, dev_mpss);
 
 	/* MRRS is a R/W register.  Invalid values can be written, but a
-	 * subsiquent read will verify if the value is acceptable or not.
+	 * subsequent read will verify if the value is acceptable or not.
 	 * If the MRRS value provided is not acceptable (e.g., too large),
 	 * shrink the value until it is acceptable to the HW.
  	 */
 	while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) {
+		dev_warn(&dev->dev, "Attempting to modify the PCI-E MRRS value"
+			 " to %d.  If any issues are encountered, please try "
+			 "running with pci=pcie_bus_safe\n", mrrs);
 		rc = pcie_set_readrq(dev, mrrs);
 		if (rc)
-			dev_err(&dev->dev, "Failed attempting to set the MRRS\n");
+			dev_err(&dev->dev,
+				"Failed attempting to set the MRRS\n");
 
 		mrrs /= 2;
 	}
@@ -1436,13 +1439,13 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
 	if (!pci_is_pcie(dev))
 		return 0;
 
-	dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
+	dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
 		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
 
 	pcie_write_mps(dev, mps);
 	pcie_write_mrrs(dev, mps);
 
-	dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
+	dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
 		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
 
 	return 0;
-- 
1.7.6


[-- Attachment #4: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
  2011-09-13 15:51             ` Jon Mason
@ 2011-09-13 16:32               ` Justin Piszcz
  -1 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-13 16:32 UTC (permalink / raw)
  To: Jon Mason
  Cc: Eric Dumazet, Jesse Brandeburg, Alan Piszcz, NetDEV list, xfs,
	linux-kernel



On Tue, 13 Sep 2011, Jon Mason wrote:

> On Tue, Sep 13, 2011 at 10:42 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>
>>
>> On Tue, 13 Sep 2011, Jon Mason wrote:
>>
>>> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com>
>>> wrote:
>>>>
>>>>
>>>> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>>>>

Thanks,

# patch -p1 < ../0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch
patching file arch/x86/pci/acpi.c
patching file drivers/pci/hotplug/pcihp_slot.c
patching file drivers/pci/probe.c
# patch -p1 < ../0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch
patching file drivers/pci/pci.c
patching file drivers/pci/probe.c
#

Rebooted & running with new patches for 3.1-rc4.
Will let you know if any further issues, I wonder if this will fix
the RCU/SLAB issues too, thanks.

Justin.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 3.1-rc4: spectacular kernel errors / filesystem crash
@ 2011-09-13 16:32               ` Justin Piszcz
  0 siblings, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2011-09-13 16:32 UTC (permalink / raw)
  To: Jon Mason
  Cc: Eric Dumazet, NetDEV list, linux-kernel, xfs, Jesse Brandeburg,
	Alan Piszcz



On Tue, 13 Sep 2011, Jon Mason wrote:

> On Tue, Sep 13, 2011 at 10:42 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>
>>
>> On Tue, 13 Sep 2011, Jon Mason wrote:
>>
>>> On Tue, Sep 13, 2011 at 9:54 AM, Justin Piszcz <jpiszcz@lucidpixels.com>
>>> wrote:
>>>>
>>>>
>>>> On Tue, 13 Sep 2011, Eric Dumazet wrote:
>>>>

Thanks,

# patch -p1 < ../0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch
patching file arch/x86/pci/acpi.c
patching file drivers/pci/hotplug/pcihp_slot.c
patching file drivers/pci/probe.c
# patch -p1 < ../0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch
patching file drivers/pci/pci.c
patching file drivers/pci/probe.c
#

Rebooted & running with new patches for 3.1-rc4.
Will let you know if any further issues, I wonder if this will fix
the RCU/SLAB issues too, thanks.

Justin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-09-13 16:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-11  9:40 3.1-rc4: spectacular kernel errors / filesystem crash Justin Piszcz
2011-09-11  9:40 ` Justin Piszcz
2011-09-13  3:59 ` Jesse Brandeburg
2011-09-13  3:59   ` Jesse Brandeburg
2011-09-13  4:05   ` Eric Dumazet
2011-09-13  4:05     ` Eric Dumazet
2011-09-13 14:54     ` Justin Piszcz
2011-09-13 14:54       ` Justin Piszcz
2011-09-13 14:58       ` Eric Dumazet
2011-09-13 14:58         ` Eric Dumazet
2011-09-13 15:35       ` Jon Mason
2011-09-13 15:35         ` Jon Mason
2011-09-13 15:42         ` Justin Piszcz
2011-09-13 15:42           ` Justin Piszcz
2011-09-13 15:51           ` Jon Mason
2011-09-13 15:51             ` Jon Mason
2011-09-13 16:32             ` Justin Piszcz
2011-09-13 16:32               ` Justin Piszcz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.