All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Jesse Brandeburg <jesse.brandeburg@gmail.com>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>,
	linux-kernel@vger.kernel.org, xfs@oss.sgi.com,
	Alan Piszcz <ap@solarrain.com>,
	NetDEV list <netdev@vger.kernel.org>
Subject: Re: 3.1-rc4: spectacular kernel errors / filesystem crash
Date: Tue, 13 Sep 2011 06:05:06 +0200	[thread overview]
Message-ID: <1315886706.2556.11.camel@edumazet-laptop> (raw)
In-Reply-To: <CAEuXFEzs1f7n5taYzupux3AtKmRcY4P0m7yjkUQA8aLyS8eujw@mail.gmail.com>

Le lundi 12 septembre 2011 à 20:59 -0700, Jesse Brandeburg a écrit :
> added netdev because it appears to start with an igb tx hang
> 
> On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > Hi,
> >
> > Over the past 24-48 hours I was running some CPU-intenstive jobs and there
> > was heavy I/O on the RAID (9750-24i4e + a RAID6)..
> >
> > I believe most of the problem started when I included many kernel options as
> > modules (before I only compiled in [*] the drivers I used), there appears to
> > have something to gone awry in the kernel and then afterwards, disks started
> > going in and out, XFS shut down, etcera.
> >
> > I'm opening a case with LSI to see what happened with the 3ware card;
> > however, after a power cycle, everything came back OK (the drives and HW) is
> > physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but
> > other than that, everything 'seems' OK, still need to do an fsck.
> >
> > Something went wrong in the kernel and caused a cascading effect of errors,
> > this occurred (I believe) when I started to run a lot of encoding jobs;
> > however, I was doing a lot of data transfer for the past 24-48 hours on the
> > RAID array, the system (separate SSD/EXT4) remained unaffected but other
> > weird stuff happened as well..
> >
> > I still see these in the logs as well after the reboot (not often; but e.g.,
> > the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the
> > physical drives are 100% healthy):
> >
> > [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed.  This is likely a
> > firmware bug on this device.  Contact the card vendor for a firmware update.
> >
> > So, my plan:
> >
> > 1. Report this error to LKML+XFS mailing lists.
> > 2. Open case with LSI support.
> > 3. Recompile the kernel how I used for many years [only compile in options
> >   that you need [*] and do not compile drivers as modules]
> > 4. Reboot Linux systems and see if this recurs again under the same
> >   workload, after the RAID is done rebuilding.
> >
> > --
> >
> > So these errors are quite long, will upload to HTTP and paste the relevant
> > bits below.
> >
> > --
> >
> > URLs for FULL logs:
> >
> > 1. tw_cli /cX show diag:
> >   http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
> >
> > 2. Full kernel log (and previous morning of kernel crash)
> >   http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
> >
> > 3. tw_cli /cX show all
> >   http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
> >
> > --
> >
> > Summary (what seems to have occurred, have not done a full analysis yet)
> >
> > 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors
> >
> > 2. Then, the time source went unstable (this happens with weird kernel bugs
> >   on many different hosts, I have seen this over time).
> >
> > 3. Then, on the 3ward carde, drives started leaving and being re-inserted
> >   by themsevles, XFS went off-line to protect the filesystem due to the
> >   3ware issues
> >
> > --
> >
> > 3ware/RAID-- Interesting errors:
> >
> > I've never seen this before on a 3ware RAID controller, at least from what
> > I can remember and I've been using 3ware cards for many years..
> >
> > p2    CFG-OP-FAIL    -    2.73 TB   SATA  2   -            Hitachi
> > HDS723030AL p3    CFG-OP-FAIL    -    2.73 TB   SATA  3   -
> >  Hitachi HDS723030AL
> >
> > --
> >
> > Kernel/ERRORS:
> >
> > FWIW it all seem to start during an encoding job around 21:00:
> >
> > Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC
> > Link is Down
> > Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO
> > (0x04:0x002B): Verify completed:unit=0.
> > Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here
> > ]------------
> > Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at
> > net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
> > Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
> > Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb):
> > transmit queue 5 timed out
> > Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod
> > tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio
> > snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> > snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> > snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev
> > serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> > i7core_edac edac_core video
> > Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not
> > tainted 3.1.0-rc4 #1
> > Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
> > Sep 10 20:59:39 p34 kernel: [531189.671424]  [<ffffffff810379ba>]
> > warn_slowpath_common+0x7a/0xb0
> > Sep 10 20:59:39 p34 kernel: [531189.671427]  [<ffffffff81037a91>]
> > warn_slowpath_fmt+0x41/0x50
> > Sep 10 20:59:39 p34 kernel: [531189.671433]  [<ffffffff815d7874>] ?
> > schedule+0x2e4/0x950
> > Sep 10 20:59:39 p34 kernel: [531189.671436]  [<ffffffff814e5aff>]
> > dev_watchdog+0x23f/0x250
> > Sep 10 20:59:39 p34 kernel: [531189.671440]  [<ffffffff81043872>]
> > run_timer_softirq+0xf2/0x220
> > Sep 10 20:59:39 p34 kernel: [531189.671443]  [<ffffffff814e58c0>] ?
> > qdisc_reset+0x50/0x50
> > Sep 10 20:59:39 p34 kernel: [531189.671446]  [<ffffffff8103d208>]
> > __do_softirq+0x98/0x120
> > Sep 10 20:59:39 p34 kernel: [531189.671448]  [<ffffffff8103d345>]
> > run_ksoftirqd+0xb5/0x160
> > Sep 10 20:59:39 p34 kernel: [531189.671454]  [<ffffffff8103d290>] ?
> > __do_softirq+0x120/0x120
> > Sep 10 20:59:39 p34 kernel: [531189.671458]  [<ffffffff810523b7>]
> > kthread+0x87/0x90
> > Sep 10 20:59:39 p34 kernel: [531189.671462]  [<ffffffff815dbdb4>]
> > kernel_thread_helper+0x4/0x10
> > Sep 10 20:59:39 p34 kernel: [531189.671465]  [<ffffffff81052330>] ?
> > kthread_worker_fn+0x130/0x130
> > Sep 10 20:59:39 p34 kernel: [531189.671467]  [<ffffffff815dbdb0>] ?
> > gs_change+0xb/0xb
> > Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba
> > ]---
> > Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset
> > adapter
> > Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000
> > Mbps Full Duplex, Flow Control: RX/TX
> > Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck
> > for 22s! [kswapd0:947]
> >
> > --
> >
> > URLs for FULL logs:
> >
> > 1. tw_cli /cX show diag:
> >   http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
> >
> > 2. Full kernel log (and previous morning of kernel crash)
> >   http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
> >
> > 3. tw_cli /cX show all
> >   http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
> >
> > --
> >
> > Currently...
> >
> > After all of this happened, I stopped all I/O on the system/all processes,
> > etc
> > I shutdown the host, removed the power, powered it back up, now the drives
> > that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
> > to rebuild before doing anything else.
> >
> > Justin.
> >
> >

Please Justin make sure you pulled commit 

commit ed2888e906b56769b4ffabb9c577190438aa68b8
Author: Jon Mason <mason@myri.com>
Date:   Thu Sep 8 16:41:18 2011 -0500

    PCI: Remove MRRS modification from MPS setting code
    
    Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
    massive negative ramifications on some devices.  Without knowing which
    devices have this issue, do not modify from the default value when
    walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
    the default procedure.
    
    Tested-by: Sven Schnelle <svens@stackframe.org>
    Tested-by: Simon Kirby <sim@hostway.ca>
    Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
    Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
    Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.
    References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
    Signed-off-by: Jon Mason <mason@myri.com>
    Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>



WARNING: multiple messages have this Message-ID (diff)
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Jesse Brandeburg <jesse.brandeburg@gmail.com>
Cc: Alan Piszcz <ap@solarrain.com>,
	NetDEV list <netdev@vger.kernel.org>,
	xfs@oss.sgi.com, Justin Piszcz <jpiszcz@lucidpixels.com>,
	linux-kernel@vger.kernel.org
Subject: Re: 3.1-rc4: spectacular kernel errors / filesystem crash
Date: Tue, 13 Sep 2011 06:05:06 +0200	[thread overview]
Message-ID: <1315886706.2556.11.camel@edumazet-laptop> (raw)
In-Reply-To: <CAEuXFEzs1f7n5taYzupux3AtKmRcY4P0m7yjkUQA8aLyS8eujw@mail.gmail.com>

Le lundi 12 septembre 2011 à 20:59 -0700, Jesse Brandeburg a écrit :
> added netdev because it appears to start with an igb tx hang
> 
> On Sun, Sep 11, 2011 at 2:40 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > Hi,
> >
> > Over the past 24-48 hours I was running some CPU-intenstive jobs and there
> > was heavy I/O on the RAID (9750-24i4e + a RAID6)..
> >
> > I believe most of the problem started when I included many kernel options as
> > modules (before I only compiled in [*] the drivers I used), there appears to
> > have something to gone awry in the kernel and then afterwards, disks started
> > going in and out, XFS shut down, etcera.
> >
> > I'm opening a case with LSI to see what happened with the 3ware card;
> > however, after a power cycle, everything came back OK (the drives and HW) is
> > physically OK, it is rebuilding onto those two drives with CFG-OP-FAIL but
> > other than that, everything 'seems' OK, still need to do an fsck.
> >
> > Something went wrong in the kernel and caused a cascading effect of errors,
> > this occurred (I believe) when I started to run a lot of encoding jobs;
> > however, I was doing a lot of data transfer for the past 24-48 hours on the
> > RAID array, the system (separate SSD/EXT4) remained unaffected but other
> > weird stuff happened as well..
> >
> > I still see these in the logs as well after the reboot (not often; but e.g.,
> > the RAID controller is rebuilding from the two drives with CFG-OPT-FAIL (the
> > physical drives are 100% healthy):
> >
> > [ 1062.925904] 3w-sas 0000:83:00.0: vpd r/w failed.  This is likely a
> > firmware bug on this device.  Contact the card vendor for a firmware update.
> >
> > So, my plan:
> >
> > 1. Report this error to LKML+XFS mailing lists.
> > 2. Open case with LSI support.
> > 3. Recompile the kernel how I used for many years [only compile in options
> >   that you need [*] and do not compile drivers as modules]
> > 4. Reboot Linux systems and see if this recurs again under the same
> >   workload, after the RAID is done rebuilding.
> >
> > --
> >
> > So these errors are quite long, will upload to HTTP and paste the relevant
> > bits below.
> >
> > --
> >
> > URLs for FULL logs:
> >
> > 1. tw_cli /cX show diag:
> >   http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
> >
> > 2. Full kernel log (and previous morning of kernel crash)
> >   http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
> >
> > 3. tw_cli /cX show all
> >   http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
> >
> > --
> >
> > Summary (what seems to have occurred, have not done a full analysis yet)
> >
> > 1. 3ware card freaked out due to kernel/RCU/APIC(?) errors
> >
> > 2. Then, the time source went unstable (this happens with weird kernel bugs
> >   on many different hosts, I have seen this over time).
> >
> > 3. Then, on the 3ward carde, drives started leaving and being re-inserted
> >   by themsevles, XFS went off-line to protect the filesystem due to the
> >   3ware issues
> >
> > --
> >
> > 3ware/RAID-- Interesting errors:
> >
> > I've never seen this before on a 3ware RAID controller, at least from what
> > I can remember and I've been using 3ware cards for many years..
> >
> > p2    CFG-OP-FAIL    -    2.73 TB   SATA  2   -            Hitachi
> > HDS723030AL p3    CFG-OP-FAIL    -    2.73 TB   SATA  3   -
> >  Hitachi HDS723030AL
> >
> > --
> >
> > Kernel/ERRORS:
> >
> > FWIW it all seem to start during an encoding job around 21:00:
> >
> > Sep 10 18:00:00 p34 kernel: [520427.143054] ixgbe 0000:03:00.0: eth6: NIC
> > Link is Down
> > Sep 10 19:20:04 p34 kernel: [525223.256098] 3w-sas: scsi1: AEN: INFO
> > (0x04:0x002B): Verify completed:unit=0.
> > Sep 10 20:59:39 p34 kernel: [531189.671361] ------------[ cut here
> > ]------------
> > Sep 10 20:59:39 p34 kernel: [531189.671376] WARNING: at
> > net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
> > Sep 10 20:59:39 p34 kernel: [531189.671378] Hardware name: X8DTH-i/6/iF/6F
> > Sep 10 20:59:39 p34 kernel: [531189.671380] NETDEV WATCHDOG: eth1 (igb):
> > transmit queue 5 timed out
> > Sep 10 20:59:39 p34 kernel: [531189.671382] Modules linked in: dm_mod
> > tcp_diag parport_pc ppdev lp parport inet_diag pl2303 ftdi_sio snd_usb_audio
> > snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> > snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> > snd_seq snd_timer snd_seq_device snd soundcore ub cdc_acm usbserial joydev
> > serio_raw nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> > i7core_edac edac_core video
> > Sep 10 20:59:39 p34 kernel: [531189.671414] Pid: 83, comm: ksoftirqd/19 Not
> > tainted 3.1.0-rc4 #1
> > Sep 10 20:59:39 p34 kernel: [531189.671415] Call Trace:
> > Sep 10 20:59:39 p34 kernel: [531189.671424]  [<ffffffff810379ba>]
> > warn_slowpath_common+0x7a/0xb0
> > Sep 10 20:59:39 p34 kernel: [531189.671427]  [<ffffffff81037a91>]
> > warn_slowpath_fmt+0x41/0x50
> > Sep 10 20:59:39 p34 kernel: [531189.671433]  [<ffffffff815d7874>] ?
> > schedule+0x2e4/0x950
> > Sep 10 20:59:39 p34 kernel: [531189.671436]  [<ffffffff814e5aff>]
> > dev_watchdog+0x23f/0x250
> > Sep 10 20:59:39 p34 kernel: [531189.671440]  [<ffffffff81043872>]
> > run_timer_softirq+0xf2/0x220
> > Sep 10 20:59:39 p34 kernel: [531189.671443]  [<ffffffff814e58c0>] ?
> > qdisc_reset+0x50/0x50
> > Sep 10 20:59:39 p34 kernel: [531189.671446]  [<ffffffff8103d208>]
> > __do_softirq+0x98/0x120
> > Sep 10 20:59:39 p34 kernel: [531189.671448]  [<ffffffff8103d345>]
> > run_ksoftirqd+0xb5/0x160
> > Sep 10 20:59:39 p34 kernel: [531189.671454]  [<ffffffff8103d290>] ?
> > __do_softirq+0x120/0x120
> > Sep 10 20:59:39 p34 kernel: [531189.671458]  [<ffffffff810523b7>]
> > kthread+0x87/0x90
> > Sep 10 20:59:39 p34 kernel: [531189.671462]  [<ffffffff815dbdb4>]
> > kernel_thread_helper+0x4/0x10
> > Sep 10 20:59:39 p34 kernel: [531189.671465]  [<ffffffff81052330>] ?
> > kthread_worker_fn+0x130/0x130
> > Sep 10 20:59:39 p34 kernel: [531189.671467]  [<ffffffff815dbdb0>] ?
> > gs_change+0xb/0xb
> > Sep 10 20:59:39 p34 kernel: [531189.671468] ---[ end trace 553dfe731fce91ba
> > ]---
> > Sep 10 20:59:39 p34 kernel: [531189.671478] igb 0000:01:00.1: eth1: Reset
> > adapter
> > Sep 10 20:59:42 p34 kernel: [531192.826058] igb: eth1 NIC Link is Up 1000
> > Mbps Full Duplex, Flow Control: RX/TX
> > Sep 10 21:00:00 p34 kernel: [531210.034506] BUG: soft lockup - CPU#0 stuck
> > for 22s! [kswapd0:947]
> >
> > --
> >
> > URLs for FULL logs:
> >
> > 1. tw_cli /cX show diag:
> >   http://home.comcast.net/~jpiszcz/20110911/show_diag.txt
> >
> > 2. Full kernel log (and previous morning of kernel crash)
> >   http://home.comcast.net/~jpiszcz/20110911/kern.log.txt
> >
> > 3. tw_cli /cX show all
> >   http://home.comcast.net/~jpiszcz/20110911/cfg-fail.txt
> >
> > --
> >
> > Currently...
> >
> > After all of this happened, I stopped all I/O on the system/all processes,
> > etc
> > I shutdown the host, removed the power, powered it back up, now the drives
> > that showed CFG-OP-FAIL before now show as REBUILDING, I am waiting for them
> > to rebuild before doing anything else.
> >
> > Justin.
> >
> >

Please Justin make sure you pulled commit 

commit ed2888e906b56769b4ffabb9c577190438aa68b8
Author: Jon Mason <mason@myri.com>
Date:   Thu Sep 8 16:41:18 2011 -0500

    PCI: Remove MRRS modification from MPS setting code
    
    Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
    massive negative ramifications on some devices.  Without knowing which
    devices have this issue, do not modify from the default value when
    walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
    the default procedure.
    
    Tested-by: Sven Schnelle <svens@stackframe.org>
    Tested-by: Simon Kirby <sim@hostway.ca>
    Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
    Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
    Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.
    References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
    Signed-off-by: Jon Mason <mason@myri.com>
    Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2011-09-13  4:05 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-09-11  9:40 3.1-rc4: spectacular kernel errors / filesystem crash Justin Piszcz
2011-09-11  9:40 ` Justin Piszcz
2011-09-13  3:59 ` Jesse Brandeburg
2011-09-13  3:59   ` Jesse Brandeburg
2011-09-13  4:05   ` Eric Dumazet [this message]
2011-09-13  4:05     ` Eric Dumazet
2011-09-13 14:54     ` Justin Piszcz
2011-09-13 14:54       ` Justin Piszcz
2011-09-13 14:58       ` Eric Dumazet
2011-09-13 14:58         ` Eric Dumazet
2011-09-13 15:35       ` Jon Mason
2011-09-13 15:35         ` Jon Mason
2011-09-13 15:42         ` Justin Piszcz
2011-09-13 15:42           ` Justin Piszcz
2011-09-13 15:51           ` Jon Mason
2011-09-13 15:51             ` Jon Mason
2011-09-13 16:32             ` Justin Piszcz
2011-09-13 16:32               ` Justin Piszcz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1315886706.2556.11.camel@edumazet-laptop \
    --to=eric.dumazet@gmail.com \
    --cc=ap@solarrain.com \
    --cc=jesse.brandeburg@gmail.com \
    --cc=jpiszcz@lucidpixels.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.