linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* xfs problems (possibly after upgrading from linux kernel 2.6.27.10 to .14)
@ 2009-02-17 14:49 Carsten Aulbert
  2009-02-17 17:24 ` Eric Sandeen
  2009-02-18  9:19 ` Dave Chinner
  0 siblings, 2 replies; 8+ messages in thread
From: Carsten Aulbert @ 2009-02-17 14:49 UTC (permalink / raw)
  To: xfs; +Cc: linux-kernel

Hi all,

within the past few days we hit many XFS internal errors like these. Are these
errors known (and possibly already fixed)? I checked the commits till 
2.6.27.17 and there does not seem anything related to this.

Do you need more information or can I send these nodes into a re-install?

Cheers

Carsten

PS: Please CC me, as I'm currently not on this list.

Feb 16 20:34:49 n0035 kernel: [275873.335916] Filesystem "sda6": XFS internal error xfs_trans_cancel at line 1164 of file fs/xfs/xfs_
trans.c.  Caller 0xffffffff8031b704
Feb 16 20:34:49 n0035 kernel: [275873.336262] Pid: 20060, comm: lalapps_Makefak Not tainted 2.6.27.14-nodes #1
Feb 16 20:34:49 n0035 kernel: [275873.336452]
Feb 16 20:34:49 n0035 kernel: [275873.336452] Call Trace:
Feb 16 20:34:49 n0035 kernel: [275873.336814]  [<ffffffff8031b704>] xfs_create+0x3ee/0x47b
Feb 16 20:34:49 n0035 kernel: [275873.337003]  [<ffffffff803167c6>] xfs_trans_cancel+0x55/0xed
Feb 16 20:34:49 n0035 kernel: [275873.337191]  [<ffffffff8031b704>] xfs_create+0x3ee/0x47b
Feb 16 20:34:49 n0035 kernel: [275873.341237]  [<ffffffff802dfc91>] xfs_attr_get+0x90/0xa1
Feb 16 20:34:49 n0035 kernel: [275873.341424]  [<ffffffff803249e4>] xfs_vn_mknod+0x14b/0x214
Feb 16 20:34:49 n0035 kernel: [275873.341612]  [<ffffffff8027c6b1>] vfs_create+0x71/0xb9
Feb 16 20:34:49 n0035 kernel: [275873.341798]  [<ffffffff8027ec64>] do_filp_open+0x1e6/0x733
Feb 16 20:34:49 n0035 kernel: [275873.341990]  [<ffffffff802738c7>] do_sys_open+0x48/0xcc
Feb 16 20:34:49 n0035 kernel: [275873.342177]  [<ffffffff8020bddb>] system_call_fastpath+0x16/0x1b
Feb 16 20:34:49 n0035 kernel: [275873.342366]
Feb 16 20:34:49 n0035 kernel: [275873.342548] xfs_force_shutdown(sda6,0x8) called from line 1165 of file fs/xfs/xfs_trans.c.  Return
address = 0xffffffff803167df
Feb 16 20:34:49 n0035 kernel: [275873.343158] xfs_force_shutdown(sda6,0x2) called from line 818 of file fs/xfs/xfs_log.c.  Return add
ress = 0xffffffff8030d17b
Feb 16 20:34:49 n0035 kernel: [275873.343508] Filesystem "sda6": Corruption of in-memory data detected.  Shutting down filesystem: sd
a6
Feb 16 20:34:49 n0035 kernel: [275873.343909] Please umount the filesystem, and rectify the problem(s)
Feb 16 20:34:53 n0035 kernel: [275877.000013] Filesystem "sda6": xfs_log_force: error 5 returned.

***********************************8

Feb 16 22:01:28 n0260 kernel: [1129250.851451] Filesystem "sda6": xfs_iflush: Bad inode 1176564060 magic number 0x36b5, ptr 0xffff880
1a7c06c00
Feb 16 22:01:28 n0260 kernel: [1129250.852132] xfs_force_shutdown(sda6,0x8) called from line 3215 of file fs/xfs/xfs_inode.c.  Return
 address = 0xffffffff803077b2
Feb 16 22:01:28 n0260 kernel: [1129250.871943] xfs_force_shutdown(sda6,0x2) called from line 818 of file fs/xfs/xfs_log.c.  Return ad
dress = 0xffffffff8030d17b
Feb 16 22:01:28 n0260 kernel: [1129250.872270] xfs_force_shutdown(sda6,0x2) called from line 818 of file fs/xfs/xfs_log.c.  Return ad
dress = 0xffffffff8030d17b
Feb 16 22:01:28 n0260 kernel: [1129250.872632] Filesystem "sda6": Corruption of in-memory data detected.  Shutting down filesystem: s
da6
Feb 16 22:01:28 n0260 kernel: [1129250.872955] Filesystem "sda6": xfs_inactive: xfs_trans_commit() returned error 5
Feb 16 22:01:28 n0260 kernel: [1129250.873317] Please umount the filesystem, and rectify the problem(s)
Feb 16 22:01:28 n0260 kernel: [1129250.942070] Filesystem "sda6": xfs_log_force: error 5 returned.

*****************************************

Feb 17 07:48:08 n0393 kernel: [1163760.030411] Filesystem "sda6": xfs_iflush: Bad inode 2117711358 magic number 0x8334, ptr 0xffff880
06883de00
Feb 17 07:48:08 n0393 kernel: [1163760.030750] xfs_force_shutdown(sda6,0x8) called from line 3215 of file fs/xfs/xfs_inode.c.  Return
 address = 0xffffffff803077b2
Feb 17 07:48:08 n0393 kernel: [1163760.092840] xfs_force_shutdown(sda6,0x2) called from line 818 of file fs/xfs/xfs_log.c.  Return ad
dress = 0xffffffff8030d17b
Feb 17 07:48:08 n0393 kernel: [1163760.093211] Filesystem "sda6": Corruption of in-memory data detected.  Shutting down filesystem: s
da6
Feb 17 07:48:08 n0393 kernel: [1163760.093645] Please umount the filesystem, and rectify the problem(s)
Feb 17 07:48:09 n0393 kernel: [1163761.100176] Filesystem "sda6": xfs_log_force: error 5 returned.

****************************************

Feb 17 05:57:44 n0463 kernel: [1156816.912129] Filesystem "sda6": XFS internal error xfs_btree_check_sblock at line 307 of file fs/xf
s/xfs_btree.c.  Caller 0xffffffff802dd15b
Feb 17 05:57:44 n0463 kernel: [1156816.912566] Pid: 31011, comm: rm Not tainted 2.6.27.14-nodes #1
Feb 17 05:57:44 n0463 kernel: [1156816.912774]
Feb 17 05:57:44 n0463 kernel: [1156816.912775] Call Trace:
Feb 17 05:57:44 n0463 kernel: [1156816.913177]  [<ffffffff802dd15b>] xfs_alloc_lookup+0x135/0x36a
Feb 17 05:57:44 n0463 kernel: [1156816.913385]  [<ffffffff802f15da>] xfs_btree_check_sblock+0xaf/0xbf
Feb 17 05:57:44 n0463 kernel: [1156816.913593]  [<ffffffff802dd15b>] xfs_alloc_lookup+0x135/0x36a
Feb 17 05:57:44 n0463 kernel: [1156816.913802]  [<ffffffff802f110a>] xfs_btree_init_cursor+0x31/0x1a5
Feb 17 05:57:44 n0463 kernel: [1156816.914011]  [<ffffffff802dadeb>] xfs_free_ag_extent+0x351/0x6bb
Feb 17 05:57:44 n0463 kernel: [1156816.914218]  [<ffffffff802dc7df>] xfs_free_extent+0xa9/0xc9
Feb 17 05:57:44 n0463 kernel: [1156816.914426]  [<ffffffff802e52de>] xfs_bmap_finish+0xee/0x15f
Feb 17 05:57:44 n0463 kernel: [1156816.914633]  [<ffffffff80305505>] xfs_itruncate_finish+0x190/0x2ba
Feb 17 05:57:44 n0463 kernel: [1156816.914844]  [<ffffffff8031d49b>] xfs_inactive+0x1de/0x40f
Feb 17 05:57:44 n0463 kernel: [1156816.915050]  [<ffffffff80327957>] xfs_fs_clear_inode+0xb0/0xf3
Feb 17 05:57:44 n0463 kernel: [1156816.915257]  [<ffffffff8028654a>] clear_inode+0x6d/0xc4
Feb 17 05:57:44 n0463 kernel: [1156816.915461]  [<ffffffff8028662a>] generic_delete_inode+0x89/0xe3
Feb 17 05:57:44 n0463 kernel: [1156816.915670]  [<ffffffff8027e278>] do_unlinkat+0xda/0x157
Feb 17 05:57:44 n0463 kernel: [1156816.915878]  [<ffffffff8034a017>] __up_read+0x13/0x8a
Feb 17 05:57:44 n0463 kernel: [1156816.916086]  [<ffffffff804a0c39>] error_exit+0x0/0x51
Feb 17 05:57:44 n0463 kernel: [1156816.916294]  [<ffffffff8020bddb>] system_call_fastpath+0x16/0x1b
Feb 17 05:57:44 n0463 kernel: [1156816.916500]
Feb 17 05:57:44 n0463 kernel: [1156816.916700] xfs_force_shutdown(sda6,0x8) called from line 4269 of file fs/xfs/xfs_bmap.c.  Return
address = 0xffffffff802e5313
Feb 17 05:57:44 n0463 kernel: [1156816.920788] xfs_force_shutdown(sda6,0x2) called from line 818 of file fs/xfs/xfs_log.c.  Return ad
dress = 0xffffffff8030d17b
Feb 17 05:57:44 n0463 kernel: [1156816.921144] xfs_force_shutdown(sda6,0x2) called from line 818 of file fs/xfs/xfs_log.c.  Return ad
dress = 0xffffffff8030d17b
Feb 17 05:57:44 n0463 kernel: [1156816.921564] Filesystem "sda6": Corruption of in-memory data detected.  Shutting down filesystem: s
da6
Feb 17 05:57:44 n0463 kernel: [1156816.921937] Please umount the filesystem, and rectify the problem(s)
Feb 17 05:57:48 n0463 kernel: [1156821.731882] Filesystem "sda6": xfs_log_force: error 5 returned.

*************************************************************************8

Feb 17 07:49:47 n0770 kernel: [1162181.596030] Filesystem "sda6": XFS internal error xfs_trans_cancel at line 1164 of file fs/xfs/xfs
_trans.c.  Caller 0xffffffff8031b704
Feb 17 07:49:47 n0770 kernel: [1162181.596379] Pid: 15905, comm: lalapps_Makefak Not tainted 2.6.27.14-nodes #1
Feb 17 07:49:47 n0770 kernel: [1162181.596717]
Feb 17 07:49:47 n0770 kernel: [1162181.596717] Call Trace:
Feb 17 07:49:47 n0770 kernel: [1162181.597080]  [<ffffffff8031b704>] xfs_create+0x3ee/0x47b
Feb 17 07:49:47 n0770 kernel: [1162181.597275]  [<ffffffff803167c6>] xfs_trans_cancel+0x55/0xed
Feb 17 07:49:47 n0770 kernel: [1162181.597477]  [<ffffffff8031b704>] xfs_create+0x3ee/0x47b
Feb 17 07:49:47 n0770 kernel: [1162181.597675]  [<ffffffff802dfc91>] xfs_attr_get+0x90/0xa1
Feb 17 07:49:47 n0770 kernel: [1162181.597876]  [<ffffffff803249e4>] xfs_vn_mknod+0x14b/0x214
Feb 17 07:49:47 n0770 kernel: [1162181.598079]  [<ffffffff8027c6b1>] vfs_create+0x71/0xb9
Feb 17 07:49:47 n0770 kernel: [1162181.598278]  [<ffffffff8027ec64>] do_filp_open+0x1e6/0x733
Feb 17 07:49:47 n0770 kernel: [1162181.598477]  [<ffffffff802738c7>] do_sys_open+0x48/0xcc
Feb 17 07:49:47 n0770 kernel: [1162181.598672]  [<ffffffff8020bddb>] system_call_fastpath+0x16/0x1b
Feb 17 07:49:47 n0770 kernel: [1162181.598871]
Feb 17 07:49:47 n0770 kernel: [1162181.599061] xfs_force_shutdown(sda6,0x8) called from line 1165 of file fs/xfs/xfs_trans.c.  Return
 address = 0xffffffff803167df
Feb 17 07:49:47 n0770 kernel: [1162181.605374] xfs_force_shutdown(sda6,0x2) called from line 818 of file fs/xfs/xfs_log.c.  Return ad
dress = 0xffffffff8030d17b
Feb 17 07:49:47 n0770 kernel: [1162181.605775] Filesystem "sda6": Corruption of in-memory data detected.  Shutting down filesystem: s
da6
Feb 17 07:49:47 n0770 kernel: [1162181.606227] Please umount the filesystem, and rectify the problem(s)
Feb 17 07:49:49 n0770 kernel: [1162182.911372] Filesystem "sda6": xfs_log_force: error 5 returned.

***************************************************************************

Feb 17 00:47:33 n0986 kernel: [1136074.667418] Filesystem "sda6": xfs_iflush: Bad inode 1223835676 magic number 0xe5e5, ptr 0xffff880
1763f6c00
Feb 17 00:47:33 n0986 kernel: [1136074.667833] xfs_force_shutdown(sda6,0x8) called from line 3215 of file fs/xfs/xfs_inode.c.  Return
 address = 0xffffffff803077b2
Feb 17 00:47:33 n0986 kernel: [1136074.678958] xfs_force_shutdown(sda6,0x2) called from line 818 of file fs/xfs/xfs_log.c.  Return ad
dress = 0xffffffff8030d17b
Feb 17 00:47:33 n0986 kernel: [1136074.679430] Filesystem "sda6": Log I/O Error Detected.  Shutting down filesystem: sda6
Feb 17 00:47:33 n0986 kernel: [1136074.679783] Please umount the filesystem, and rectify the problem(s)
Feb 17 00:47:33 n0986 kernel: [1136075.686618] Filesystem "sda6": xfs_log_force: error 5 returned.

*******************************************************************************

Feb 17 01:18:08 n1003 kernel: [1137811.535542] Filesystem "sda6": XFS internal error xfs_trans_cancel at line 1164 of file fs/xfs/xfs
_trans.c.  Caller 0xffffffff8031b704
Feb 17 01:18:08 n1003 kernel: [1137811.535925] Pid: 28435, comm: lalapps_Makefak Not tainted 2.6.27.14-nodes #1
Feb 17 01:18:08 n1003 kernel: [1137811.536287]
Feb 17 01:18:08 n1003 kernel: [1137811.536288] Call Trace:
Feb 17 01:18:08 n1003 kernel: [1137811.536684]  [<ffffffff8031b704>] xfs_create+0x3ee/0x47b
Feb 17 01:18:08 n1003 kernel: [1137811.536890]  [<ffffffff803167c6>] xfs_trans_cancel+0x55/0xed
Feb 17 01:18:08 n1003 kernel: [1137811.537096]  [<ffffffff8031b704>] xfs_create+0x3ee/0x47b
Feb 17 01:18:08 n1003 kernel: [1137811.537301]  [<ffffffff802dfc91>] xfs_attr_get+0x90/0xa1
Feb 17 01:18:08 n1003 kernel: [1137811.537506]  [<ffffffff803249e4>] xfs_vn_mknod+0x14b/0x214
Feb 17 01:18:08 n1003 kernel: [1137811.537711]  [<ffffffff8027c6b1>] vfs_create+0x71/0xb9
Feb 17 01:18:08 n1003 kernel: [1137811.537917]  [<ffffffff8027ec64>] do_filp_open+0x1e6/0x733
Feb 17 01:18:08 n1003 kernel: [1137811.538143]  [<ffffffff802738c7>] do_sys_open+0x48/0xcc
Feb 17 01:18:08 n1003 kernel: [1137811.538361]  [<ffffffff8020bddb>] system_call_fastpath+0x16/0x1b
Feb 17 01:18:08 n1003 kernel: [1137811.538566]
Feb 17 01:18:08 n1003 kernel: [1137811.538764] xfs_force_shutdown(sda6,0x8) called from line 1165 of file fs/xfs/xfs_trans.c.  Return
 address = 0xffffffff803167df
Feb 17 01:18:08 n1003 kernel: [1137811.622631] Filesystem "sda6": Corruption of in-memory data detected.  Shutting down filesystem: s
da6
Feb 17 01:18:08 n1003 kernel: [1137811.622996] Please umount the filesystem, and rectify the problem(s)
Feb 17 01:18:12 n1003 kernel: [1137815.196767] Filesystem "sda6": xfs_log_force: error 5 returned.

******************************8

plus a few more nodes showing the same characteristics 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs problems (possibly after upgrading from linux kernel 2.6.27.10 to .14)
  2009-02-17 14:49 xfs problems (possibly after upgrading from linux kernel 2.6.27.10 to .14) Carsten Aulbert
@ 2009-02-17 17:24 ` Eric Sandeen
  2009-02-18  9:19 ` Dave Chinner
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Sandeen @ 2009-02-17 17:24 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: xfs, linux-kernel

Carsten Aulbert wrote:
> Hi all,
> 
> within the past few days we hit many XFS internal errors like these. Are these
> errors known (and possibly already fixed)? I checked the commits till 
> 2.6.27.17 and there does not seem anything related to this.
> 
> Do you need more information or can I send these nodes into a re-install?
> 
> Cheers
> 
> Carsten
> 
> PS: Please CC me, as I'm currently not on this list.

It'd be worth running xfs_repair on one of these nodes, I think, to see
if you're encountering on-disk corruption, which is what this looks like.

Anything funky about your storage?  Any IO/storage issues before this?
Does going back to .10 make it go away?

-Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs problems (possibly after upgrading from linux kernel 2.6.27.10 to .14)
  2009-02-17 14:49 xfs problems (possibly after upgrading from linux kernel 2.6.27.10 to .14) Carsten Aulbert
  2009-02-17 17:24 ` Eric Sandeen
@ 2009-02-18  9:19 ` Dave Chinner
  2009-02-18  9:36   ` Carsten Aulbert
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2009-02-18  9:19 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: xfs, linux-kernel

On Tue, Feb 17, 2009 at 03:49:16PM +0100, Carsten Aulbert wrote:
> Hi all,
> 
> within the past few days we hit many XFS internal errors like these. Are these
> errors known (and possibly already fixed)? I checked the commits till 
> 2.6.27.17 and there does not seem anything related to this.

.....

> Feb 16 20:34:49 n0035 kernel: [275873.335916] Filesystem "sda6": XFS internal error xfs_trans_cancel at line 1164 of file fs/xfs/xfs_

A transaction shutdown on create. That implies some kind of ENOSPC
issue.

> Do you need more information or can I send these nodes into a re-install?

More information. Can you get a machine into a state where you can
trigger this condition reproducably by doing:

	mount filesystem
	touch /mnt/filesystem/some_new_file

If you can get it to that state, and you can provide an xfs_metadump
image of the filesystem when in that state, I can track down the
problem and fix it.

> Feb 16 22:01:28 n0260 kernel: [1129250.851451] Filesystem "sda6": xfs_iflush: Bad inode 1176564060 magic number 0x36b5, ptr 0xffff8801a7c06c00

However, this implies some kind of memory corruption is occurring.
That is reading the inode out of the buffer before flushing the
in-memory state to disk. This implies someone has scribbled over
page cache pages.


> Feb 17 05:57:44 n0463 kernel: [1156816.912129] Filesystem "sda6": XFS internal error xfs_btree_check_sblock at line 307 of file fs/xfs/xfs_btree.c.  Caller 0xffffffff802dd15b

And that is another buffer that has been scribbled over.
Something is corrupting the page cache, I think. Whether the
original shutdown is caused by the some corruption, i don't
know.

> plus a few more nodes showing the same characteristics 

Hmmmm. Did this show up in 2.6.27.10? Or did it start occurring only
after you upgraded from .10 to .14?

Cheers,

Dave.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs problems (possibly after upgrading from linux kernel 2.6.27.10 to .14)
  2009-02-18  9:19 ` Dave Chinner
@ 2009-02-18  9:36   ` Carsten Aulbert
  2009-02-19  6:19     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Carsten Aulbert @ 2009-02-18  9:36 UTC (permalink / raw)
  To: david; +Cc: xfs, linux-kernel

Hi Dave,

Dave Chinner schrieb:
> On Tue, Feb 17, 2009 at 03:49:16PM +0100, Carsten Aulbert wrote:
>> Hi all,
>>
>> within the past few days we hit many XFS internal errors like these. Are these
>> errors known (and possibly already fixed)? I checked the commits till 
>> 2.6.27.17 and there does not seem anything related to this.
> 
> .....
> 
>> Feb 16 20:34:49 n0035 kernel: [275873.335916] Filesystem "sda6": XFS internal error xfs_trans_cancel at line 1164 of file fs/xfs/xfs_
> 
> A transaction shutdown on create. That implies some kind of ENOSPC
> issue.
> 
>> Do you need more information or can I send these nodes into a re-install?
> 
> More information. Can you get a machine into a state where you can
> trigger this condition reproducably by doing:
> 
> 	mount filesystem
> 	touch /mnt/filesystem/some_new_file
> 
> If you can get it to that state, and you can provide an xfs_metadump
> image of the filesystem when in that state, I can track down the
> problem and fix it.

I can try doing that on a few machines, would a metadump help on a
machine where this corruption occurred some time ago and is still in
this state?

> 
>> Feb 16 22:01:28 n0260 kernel: [1129250.851451] Filesystem "sda6": xfs_iflush: Bad inode 1176564060 magic number 0x36b5, ptr 0xffff8801a7c06c00
> 
> However, this implies some kind of memory corruption is occurring.
> That is reading the inode out of the buffer before flushing the
> in-memory state to disk. This implies someone has scribbled over
> page cache pages.
> 
> 
>> Feb 17 05:57:44 n0463 kernel: [1156816.912129] Filesystem "sda6": XFS internal error xfs_btree_check_sblock at line 307 of file fs/xfs/xfs_btree.c.  Caller 0xffffffff802dd15b
> 
> And that is another buffer that has been scribbled over.
> Something is corrupting the page cache, I think. Whether the
> original shutdown is caused by the some corruption, i don't
> know.
> 

At least on two nodes  we ran memtest86+ overnight and so far no error.

>> plus a few more nodes showing the same characteristics 
> 
> Hmmmm. Did this show up in 2.6.27.10? Or did it start occurring only
> after you upgraded from .10 to .14?

As far as I can see this only happened after the upgrade about 14 days
ago. What strikes me odd is that we only had this occurring massively on
Monday and Tuesday this week.

I don't know if a certain access pattern could trigger this somehow.

Cheers

Carsten

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs problems (possibly after upgrading from linux kernel 2.6.27.10 to .14)
  2009-02-18  9:36   ` Carsten Aulbert
@ 2009-02-19  6:19     ` Dave Chinner
  2009-02-19 10:13       ` Carsten Aulbert
  2009-02-19 12:01       ` Nick Piggin
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2009-02-19  6:19 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: xfs, linux-kernel, npiggin

On Wed, Feb 18, 2009 at 10:36:59AM +0100, Carsten Aulbert wrote:
> Dave Chinner schrieb:
> > On Tue, Feb 17, 2009 at 03:49:16PM +0100, Carsten Aulbert wrote:
> >> Do you need more information or can I send these nodes into a re-install?
> > 
> > More information. Can you get a machine into a state where you can
> > trigger this condition reproducably by doing:
> > 
> > 	mount filesystem
> > 	touch /mnt/filesystem/some_new_file
> > 
> > If you can get it to that state, and you can provide an xfs_metadump
> > image of the filesystem when in that state, I can track down the
> > problem and fix it.
> 
> I can try doing that on a few machines, would a metadump help on a
> machine where this corruption occurred some time ago and is still in
> this state?

If you unmount the filesystem, mount it again and then touch a new
file and it reports the error again, then yes, a metadump woul dbe
great.

If the error doesn't show up after a unmount/mount, then I
can't use a metadump image to reproduce the problem.

> >> Feb 16 22:01:28 n0260 kernel: [1129250.851451] Filesystem "sda6": xfs_iflush: Bad inode 1176564060 magic number 0x36b5, ptr 0xffff8801a7c06c00
> > 
> > However, this implies some kind of memory corruption is occurring.
> > That is reading the inode out of the buffer before flushing the
> > in-memory state to disk. This implies someone has scribbled over
> > page cache pages.
> > 
> > 
> >> Feb 17 05:57:44 n0463 kernel: [1156816.912129] Filesystem "sda6": XFS internal error xfs_btree_check_sblock at line 307 of file fs/xfs/xfs_btree.c.  Caller 0xffffffff802dd15b
> > 
> > And that is another buffer that has been scribbled over.
> > Something is corrupting the page cache, I think. Whether the
> > original shutdown is caused by the some corruption, i don't
> > know.
> > 
> 
> At least on two nodes  we ran memtest86+ overnight and so far no error.

I don't think it is hardware related.

> >> plus a few more nodes showing the same characteristics 
> > 
> > Hmmmm. Did this show up in 2.6.27.10? Or did it start occurring only
> > after you upgraded from .10 to .14?
> 
> As far as I can see this only happened after the upgrade about 14 days
> ago. What strikes me odd is that we only had this occurring massively on
> Monday and Tuesday this week.
> 
> I don't know if a certain access pattern could trigger this somehow.

I suspect so. We've already had XFS trigger one bug in the new
lockless pagecache code, and the fix for that went in 2.6.27.11 -
between the good version and the version that you've been seeing
these memory corruptions on. I'm wondering if that fix exposed or
introduced another bug that you've hit....

Nick?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs problems (possibly after upgrading from linux kernel 2.6.27.10 to .14)
  2009-02-19  6:19     ` Dave Chinner
@ 2009-02-19 10:13       ` Carsten Aulbert
  2009-02-19 12:01       ` Nick Piggin
  1 sibling, 0 replies; 8+ messages in thread
From: Carsten Aulbert @ 2009-02-19 10:13 UTC (permalink / raw)
  To: xfs, linux-kernel, npiggin

Hi again,

Dave Chinner schrieb:
>> I can try doing that on a few machines, would a metadump help on a
>> machine where this corruption occurred some time ago and is still in
>> this state?
> 
> If you unmount the filesystem, mount it again and then touch a new
> file and it reports the error again, then yes, a metadump woul dbe
> great.
> 
> If the error doesn't show up after a unmount/mount, then I
> can't use a metadump image to reproduce the problem.
> 

I've done it on two nodes so far and the result is not good (metadump wise):
[1344887.778232] Filesystem "sda6": xfs_log_force: error 5 returned.
[1344887.778432] xfs_force_shutdown(sda6,0x1) called from line 420 of
file fs/xfs/xfs_rw.c.  Return address = 0xffffffff8031dd7e
[1344889.579836] Filesystem "sda6": xfs_log_force: error 5 returned.
[1344889.580044] Filesystem "sda6": xfs_log_force: error 5 returned.
[1344889.580257] Filesystem "sda6": xfs_log_force: error 5 returned.
[1344889.580450] Filesystem "sda6": xfs_log_force: error 5 returned.
[1344889.624774] Filesystem "sda6": xfs_log_force: error 5 returned.
[1344915.783844] XFS mounting filesystem sda6
[1344915.872333] Starting XFS recovery on filesystem: sda6 (logdev:
internal)
[1344917.399834] Ending XFS recovery on filesystem: sda6 (logdev: internal)

After that I can touch/create all files I want on the fs again.

> I suspect so. We've already had XFS trigger one bug in the new
> lockless pagecache code, and the fix for that went in 2.6.27.11 -
> between the good version and the version that you've been seeing
> these memory corruptions on. I'm wondering if that fix exposed or
> introduced another bug that you've hit....
> 
> Nick?

If it was triggered by a user job, it might have been in the kernel for
longer and the user just did not run it for a few weeks.

I'll try to gather more information.

Cheers

Carsten

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs problems (possibly after upgrading from linux kernel 2.6.27.10 to .14)
  2009-02-19  6:19     ` Dave Chinner
  2009-02-19 10:13       ` Carsten Aulbert
@ 2009-02-19 12:01       ` Nick Piggin
  2009-02-19 13:12         ` Carsten Aulbert
  1 sibling, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2009-02-19 12:01 UTC (permalink / raw)
  To: Carsten Aulbert, xfs, linux-kernel

On Thu, Feb 19, 2009 at 05:19:25PM +1100, Dave Chinner wrote:
> On Wed, Feb 18, 2009 at 10:36:59AM +0100, Carsten Aulbert wrote:
> > >> plus a few more nodes showing the same characteristics 
> > > 
> > > Hmmmm. Did this show up in 2.6.27.10? Or did it start occurring only
> > > after you upgraded from .10 to .14?
> > 
> > As far as I can see this only happened after the upgrade about 14 days
> > ago. What strikes me odd is that we only had this occurring massively on
> > Monday and Tuesday this week.
> > 
> > I don't know if a certain access pattern could trigger this somehow.
> 
> I suspect so. We've already had XFS trigger one bug in the new
> lockless pagecache code, and the fix for that went in 2.6.27.11 -
> between the good version and the version that you've been seeing
> these memory corruptions on. I'm wondering if that fix exposed or
> introduced another bug that you've hit....

Highly unlikely. It only introduces constraints on how the
compiler may generate code, so it would have to be a compiler
bug to cause a bug I think.

I wonder how long you've been running with 2.6.27 based kernels
without corruption?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs problems (possibly after upgrading from linux kernel 2.6.27.10 to .14)
  2009-02-19 12:01       ` Nick Piggin
@ 2009-02-19 13:12         ` Carsten Aulbert
  0 siblings, 0 replies; 8+ messages in thread
From: Carsten Aulbert @ 2009-02-19 13:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: xfs, linux-kernel

Hi Nick,

Nick Piggin schrieb:
> I wonder how long you've been running with 2.6.27 based kernels
> without corruption?

Let me see, we upgraded from 2.6.25.9 to 2.6.27.10 in mid January, i.e.
our data base time line is about 6 weeks.

The upgrade to .14 happened about 16 days ago.

HTH

Carsten

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-02-19 13:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-17 14:49 xfs problems (possibly after upgrading from linux kernel 2.6.27.10 to .14) Carsten Aulbert
2009-02-17 17:24 ` Eric Sandeen
2009-02-18  9:19 ` Dave Chinner
2009-02-18  9:36   ` Carsten Aulbert
2009-02-19  6:19     ` Dave Chinner
2009-02-19 10:13       ` Carsten Aulbert
2009-02-19 12:01       ` Nick Piggin
2009-02-19 13:12         ` Carsten Aulbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).