All of lore.kernel.org
 help / color / mirror / Atom feed
* Advice needed with file system corruption
@ 2016-07-14 12:27 Steve Brooks
  2016-07-14 13:05 ` Carlos Maiolino
  2016-08-08 14:11 ` Emmanuel Florac
  0 siblings, 2 replies; 13+ messages in thread
From: Steve Brooks @ 2016-07-14 12:27 UTC (permalink / raw)
  To: xfs

Hi All,

We have a RAID system with file system issues as follows,

50 TB in RAID 6 hosted on an Adaptec 71605 controller using WD4000FYYZ 
drives.

Centos 6.7  2.6.32-642.el6.x86_64   :   xfsprogs-3.1.1-16.el6

While rebuilding a replaced disk, with the file system online and in 
use, the system logs showed multiple entries of;

XFS (sde): Corruption detected. Unmount and run xfs_repair.

[See also at the end of post for a section of XFS related errors in the log]

I unmounted the filesystem and waited for the controller to finish 
rebuilding the array. I then moved the most important data to another 
RAID array on a different server. The data is generated from HPC 
simulations and is not backed up but can be regenerated in needed.

The default el6 "xfs_repair" is in "xfsprogs-3.1.1-16.el6". I notice 
that the "elrepo_testing" repository has a much later version of 
"xfsprogs" namely

  xfsprogs.x86_64 4.3.0-1.el6.elrepo

As far as I understand the user based tools are backwards compatible so 
would it be better to use the "4.3" release of "xfsprogs"instead of the 
default "3.1.1" included in the installation of el6?

I ran an "xfs_repair -nv /dev/sde" for both "3.1.1" and "4.3" and both 
completed successfully showing the repairs that would have taken place. 
I can post these if requested.

The "3.1.1"  version of "xfs_repair -n" ran in 1 minute, 32 seconds

The "4.3"     version of "xfs_repair -n" ran in 50 seconds


So my questions are

[1] Which version of "xfs_repair" should I use to make the repair?

[2] Is there anything I should have done differently?


Many thanks for any advice given it is much appreciated.

Thanks,  Steve



Many blocks (about 20) of code similar to this were repeated in the logs.

Jul  8 18:40:17 sraid1v kernel: ffff880dca95b000: 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00  ................
Jul  8 18:40:17 sraid1v kernel: XFS (sde): Internal error 
xfs_da_do_buf(2) at line 2136 of file fs/xfs/xfs_da_btree.c. Caller 
0xffffffffa0e6e81a
Jul  8 18:40:17 sraid1v kernel:
Jul  8 18:40:17 sraid1v kernel: Pid: 8844, comm: idl Tainted: 
P           -- ------------    2.6.32-642.el6.x86_64 #1
Jul  8 18:40:17 sraid1v kernel: Call Trace:
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e7b68f>] ? 
xfs_error_report+0x3f/0x50 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e81a>] ? 
xfs_da_read_buf+0x2a/0x30 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e7b6fe>] ? 
xfs_corruption_error+0x5e/0x90 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e6fc>] ? 
xfs_da_do_buf+0x6cc/0x770 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e81a>] ? 
xfs_da_read_buf+0x2a/0x30 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffff810154e3>] ? 
native_sched_clock+0x13/0x80
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e81a>] ? 
xfs_da_read_buf+0x2a/0x30 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e74a21>] ? 
xfs_dir2_leaf_lookup_int+0x61/0x2c0 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e74a21>] ? 
xfs_dir2_leaf_lookup_int+0x61/0x2c0 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e74e05>] ? 
xfs_dir2_leaf_lookup+0x35/0xf0 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e71306>] ? 
xfs_dir2_isleaf+0x26/0x60 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e71ce4>] ? 
xfs_dir_lookup+0x174/0x190 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e9ea47>] ? 
xfs_lookup+0x87/0x110 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0eabd74>] ? 
xfs_vn_lookup+0x54/0xa0 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffff811a9ca5>] ? do_lookup+0x1a5/0x230
Jul  8 18:40:17 sraid1v kernel: [<ffffffff811aa823>] ? 
__link_path_walk+0x763/0x1060
Jul  8 18:40:17 sraid1v kernel: [<ffffffff811ab3da>] ? path_walk+0x6a/0xe0
Jul  8 18:40:17 sraid1v kernel: [<ffffffff811ab5eb>] ? 
filename_lookup+0x6b/0xc0
Jul  8 18:40:17 sraid1v kernel: [<ffffffff8123ac46>] ? 
security_file_alloc+0x16/0x20
Jul  8 18:40:17 sraid1v kernel: [<ffffffff811acac4>] ? 
do_filp_open+0x104/0xd20
Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e9a4fc>] ? 
_xfs_trans_commit+0x25c/0x310 [xfs]
Jul  8 18:40:17 sraid1v kernel: [<ffffffff812a749a>] ? 
strncpy_from_user+0x4a/0x90
Jul  8 18:40:17 sraid1v kernel: [<ffffffff811ba252>] ? alloc_fd+0x92/0x160
Jul  8 18:40:17 sraid1v kernel: [<ffffffff81196bd7>] ? 
do_sys_open+0x67/0x130
Jul  8 18:40:17 sraid1v kernel: [<ffffffff81196ce0>] ? sys_open+0x20/0x30
Jul  8 18:40:17 sraid1v kernel: [<ffffffff8100b0d2>] ? 
system_call_fastpath+0x16/0x1b
Jul  8 18:40:17 sraid1v kernel: XFS (sde): Corruption detected. Unmount 
and run xfs_repair
Jul  8 18:40:17 sraid1v kernel: ffff880dca95b000: 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00  ................
Jul  8 18:40:17 sraid1v kernel: XFS (sde): Internal error 
xfs_da_do_buf(2) at line 2136 of file fs/xfs/xfs_da_btree.c. Caller 
0xffffffffa0e6e81a
Jul  8 18:40:17 sraid1v kernel:
Jul  8 18:40:17 sraid1v kernel: Pid: 8844, comm: idl Tainted: 
P           -- ------------    2.6.32-642.el6.x86_64 #1







_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Advice needed with file system corruption
  2016-07-14 12:27 Advice needed with file system corruption Steve Brooks
@ 2016-07-14 13:05 ` Carlos Maiolino
  2016-07-14 13:57   ` Steve Brooks
  2016-08-08 14:11 ` Emmanuel Florac
  1 sibling, 1 reply; 13+ messages in thread
From: Carlos Maiolino @ 2016-07-14 13:05 UTC (permalink / raw)
  To: Steve Brooks; +Cc: xfs

Hi steve.

On Thu, Jul 14, 2016 at 01:27:22PM +0100, Steve Brooks wrote:
> 
> The "3.1.1"  version of "xfs_repair -n" ran in 1 minute, 32 seconds
> 
> The "4.3"     version of "xfs_repair -n" ran in 50 seconds
> 

Yes, the later versions are compatible with old disk-format filesystems,
and they have improvements in memory usage, speed, etc too

> 
> So my questions are
> 
> [1] Which version of "xfs_repair" should I use to make the repair?
> 
> [2] Is there anything I should have done differently?
>

No, just use the latest stable one, and the defaults, unless you have a good
reason to not use default options, which by your e-mail I believe you don't have
one.

The logs you send below, looks from a corrupted btree, but xfs_repair should be
able to fix that for you.

Cheers.


> 
> Many thanks for any advice given it is much appreciated.
> 
> Thanks,  Steve
> 
> 
> 
> Many blocks (about 20) of code similar to this were repeated in the logs.
> 
> Jul  8 18:40:17 sraid1v kernel: ffff880dca95b000: 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00  ................
> Jul  8 18:40:17 sraid1v kernel: XFS (sde): Internal error xfs_da_do_buf(2)
> at line 2136 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffffa0e6e81a
> Jul  8 18:40:17 sraid1v kernel:
> Jul  8 18:40:17 sraid1v kernel: Pid: 8844, comm: idl Tainted: P           --
> ------------    2.6.32-642.el6.x86_64 #1
> Jul  8 18:40:17 sraid1v kernel: Call Trace:
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e7b68f>] ?
> xfs_error_report+0x3f/0x50 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e81a>] ?
> xfs_da_read_buf+0x2a/0x30 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e7b6fe>] ?
> xfs_corruption_error+0x5e/0x90 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e6fc>] ?
> xfs_da_do_buf+0x6cc/0x770 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e81a>] ?
> xfs_da_read_buf+0x2a/0x30 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffff810154e3>] ?
> native_sched_clock+0x13/0x80
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e81a>] ?
> xfs_da_read_buf+0x2a/0x30 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e74a21>] ?
> xfs_dir2_leaf_lookup_int+0x61/0x2c0 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e74a21>] ?
> xfs_dir2_leaf_lookup_int+0x61/0x2c0 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e74e05>] ?
> xfs_dir2_leaf_lookup+0x35/0xf0 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e71306>] ?
> xfs_dir2_isleaf+0x26/0x60 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e71ce4>] ?
> xfs_dir_lookup+0x174/0x190 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e9ea47>] ? xfs_lookup+0x87/0x110
> [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0eabd74>] ?
> xfs_vn_lookup+0x54/0xa0 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffff811a9ca5>] ? do_lookup+0x1a5/0x230
> Jul  8 18:40:17 sraid1v kernel: [<ffffffff811aa823>] ?
> __link_path_walk+0x763/0x1060
> Jul  8 18:40:17 sraid1v kernel: [<ffffffff811ab3da>] ? path_walk+0x6a/0xe0
> Jul  8 18:40:17 sraid1v kernel: [<ffffffff811ab5eb>] ?
> filename_lookup+0x6b/0xc0
> Jul  8 18:40:17 sraid1v kernel: [<ffffffff8123ac46>] ?
> security_file_alloc+0x16/0x20
> Jul  8 18:40:17 sraid1v kernel: [<ffffffff811acac4>] ?
> do_filp_open+0x104/0xd20
> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e9a4fc>] ?
> _xfs_trans_commit+0x25c/0x310 [xfs]
> Jul  8 18:40:17 sraid1v kernel: [<ffffffff812a749a>] ?
> strncpy_from_user+0x4a/0x90
> Jul  8 18:40:17 sraid1v kernel: [<ffffffff811ba252>] ? alloc_fd+0x92/0x160
> Jul  8 18:40:17 sraid1v kernel: [<ffffffff81196bd7>] ?
> do_sys_open+0x67/0x130
> Jul  8 18:40:17 sraid1v kernel: [<ffffffff81196ce0>] ? sys_open+0x20/0x30
> Jul  8 18:40:17 sraid1v kernel: [<ffffffff8100b0d2>] ?
> system_call_fastpath+0x16/0x1b
> Jul  8 18:40:17 sraid1v kernel: XFS (sde): Corruption detected. Unmount and
> run xfs_repair
> Jul  8 18:40:17 sraid1v kernel: ffff880dca95b000: 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00  ................
> Jul  8 18:40:17 sraid1v kernel: XFS (sde): Internal error xfs_da_do_buf(2)
> at line 2136 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffffa0e6e81a
> Jul  8 18:40:17 sraid1v kernel:
> Jul  8 18:40:17 sraid1v kernel: Pid: 8844, comm: idl Tainted: P           --
> ------------    2.6.32-642.el6.x86_64 #1
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

-- 
Carlos

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Advice needed with file system corruption
  2016-07-14 13:05 ` Carlos Maiolino
@ 2016-07-14 13:57   ` Steve Brooks
  2016-07-14 14:17     ` Carlos Maiolino
  0 siblings, 1 reply; 13+ messages in thread
From: Steve Brooks @ 2016-07-14 13:57 UTC (permalink / raw)
  To: xfs

Hi Carlos,

Many thanks again, for your good advice. I ran the version 4.3 of 
"xfs_repair" as suggested below and it did it's job very quickly in 50 
seconds exactly as reported in the "No modify mode". Is the time 
reported at the end of the "No modify mode" always a good approximation 
of running in "modify mode" ?

Anyway all is good now and it looks like any missing files are now in 
the "lost+found" directory.

Steve

On 14/07/16 14:05, Carlos Maiolino wrote:
> Hi steve.
>
> On Thu, Jul 14, 2016 at 01:27:22PM +0100, Steve Brooks wrote:
>> The "3.1.1"  version of "xfs_repair -n" ran in 1 minute, 32 seconds
>>
>> The "4.3"     version of "xfs_repair -n" ran in 50 seconds
>>
> Yes, the later versions are compatible with old disk-format filesystems,
> and they have improvements in memory usage, speed, etc too
>
>> So my questions are
>>
>> [1] Which version of "xfs_repair" should I use to make the repair?
>>
>> [2] Is there anything I should have done differently?
>>
> No, just use the latest stable one, and the defaults, unless you have a good
> reason to not use default options, which by your e-mail I believe you don't have
> one.
>
> The logs you send below, looks from a corrupted btree, but xfs_repair should be
> able to fix that for you.
>
> Cheers.
>
>
>> Many thanks for any advice given it is much appreciated.
>>
>> Thanks,  Steve
>>
>>
>>
>> Many blocks (about 20) of code similar to this were repeated in the logs.
>>
>> Jul  8 18:40:17 sraid1v kernel: ffff880dca95b000: 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00  ................
>> Jul  8 18:40:17 sraid1v kernel: XFS (sde): Internal error xfs_da_do_buf(2)
>> at line 2136 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffffa0e6e81a
>> Jul  8 18:40:17 sraid1v kernel:
>> Jul  8 18:40:17 sraid1v kernel: Pid: 8844, comm: idl Tainted: P           --
>> ------------    2.6.32-642.el6.x86_64 #1
>> Jul  8 18:40:17 sraid1v kernel: Call Trace:
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e7b68f>] ?
>> xfs_error_report+0x3f/0x50 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e81a>] ?
>> xfs_da_read_buf+0x2a/0x30 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e7b6fe>] ?
>> xfs_corruption_error+0x5e/0x90 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e6fc>] ?
>> xfs_da_do_buf+0x6cc/0x770 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e81a>] ?
>> xfs_da_read_buf+0x2a/0x30 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffff810154e3>] ?
>> native_sched_clock+0x13/0x80
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e81a>] ?
>> xfs_da_read_buf+0x2a/0x30 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e74a21>] ?
>> xfs_dir2_leaf_lookup_int+0x61/0x2c0 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e74a21>] ?
>> xfs_dir2_leaf_lookup_int+0x61/0x2c0 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e74e05>] ?
>> xfs_dir2_leaf_lookup+0x35/0xf0 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e71306>] ?
>> xfs_dir2_isleaf+0x26/0x60 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e71ce4>] ?
>> xfs_dir_lookup+0x174/0x190 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e9ea47>] ? xfs_lookup+0x87/0x110
>> [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0eabd74>] ?
>> xfs_vn_lookup+0x54/0xa0 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffff811a9ca5>] ? do_lookup+0x1a5/0x230
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffff811aa823>] ?
>> __link_path_walk+0x763/0x1060
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffff811ab3da>] ? path_walk+0x6a/0xe0
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffff811ab5eb>] ?
>> filename_lookup+0x6b/0xc0
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffff8123ac46>] ?
>> security_file_alloc+0x16/0x20
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffff811acac4>] ?
>> do_filp_open+0x104/0xd20
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e9a4fc>] ?
>> _xfs_trans_commit+0x25c/0x310 [xfs]
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffff812a749a>] ?
>> strncpy_from_user+0x4a/0x90
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffff811ba252>] ? alloc_fd+0x92/0x160
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffff81196bd7>] ?
>> do_sys_open+0x67/0x130
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffff81196ce0>] ? sys_open+0x20/0x30
>> Jul  8 18:40:17 sraid1v kernel: [<ffffffff8100b0d2>] ?
>> system_call_fastpath+0x16/0x1b
>> Jul  8 18:40:17 sraid1v kernel: XFS (sde): Corruption detected. Unmount and
>> run xfs_repair
>> Jul  8 18:40:17 sraid1v kernel: ffff880dca95b000: 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00  ................
>> Jul  8 18:40:17 sraid1v kernel: XFS (sde): Internal error xfs_da_do_buf(2)
>> at line 2136 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffffa0e6e81a
>> Jul  8 18:40:17 sraid1v kernel:
>> Jul  8 18:40:17 sraid1v kernel: Pid: 8844, comm: idl Tainted: P           --
>> ------------    2.6.32-642.el6.x86_64 #1
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs

-- 
Dr Stephen Brooks

Solar MHD Theory Group
Tel    ::  01334 463735
Fax    ::  01334 463748
---------------------------------------
Mathematical Institute
North Haugh
University of St. Andrews
St Andrews, Fife KY16 9SS
SCOTLAND
---------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Advice needed with file system corruption
  2016-07-14 13:57   ` Steve Brooks
@ 2016-07-14 14:17     ` Carlos Maiolino
  2016-07-14 23:33       ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Carlos Maiolino @ 2016-07-14 14:17 UTC (permalink / raw)
  To: Steve Brooks; +Cc: xfs

On Thu, Jul 14, 2016 at 02:57:25PM +0100, Steve Brooks wrote:
> Hi Carlos,
> 
> Many thanks again, for your good advice. I ran the version 4.3 of
> "xfs_repair" as suggested below and it did it's job very quickly in 50
> seconds exactly as reported in the "No modify mode". Is the time reported at
> the end of the "No modify mode" always a good approximation of running in
> "modify mode" ?

Good to know. But I'm not quite sure if the no modify mode could be used as a
good approximation of a real run. I would say to not take it as true giving that
xfs_repair can't predict the amount of time it will need to write all
modifications it needs to do on the filesystem's metadata, and it will certainly
can take much more time, depending on how corrupted the filesystem is.

> 
> Anyway all is good now and it looks like any missing files are now in the
> "lost+found" directory.
> 
> Steve
> 
> On 14/07/16 14:05, Carlos Maiolino wrote:
> > Hi steve.
> > 
> > On Thu, Jul 14, 2016 at 01:27:22PM +0100, Steve Brooks wrote:
> > > The "3.1.1"  version of "xfs_repair -n" ran in 1 minute, 32 seconds
> > > 
> > > The "4.3"     version of "xfs_repair -n" ran in 50 seconds
> > > 
> > Yes, the later versions are compatible with old disk-format filesystems,
> > and they have improvements in memory usage, speed, etc too
> > 
> > > So my questions are
> > > 
> > > [1] Which version of "xfs_repair" should I use to make the repair?
> > > 
> > > [2] Is there anything I should have done differently?
> > > 
> > No, just use the latest stable one, and the defaults, unless you have a good
> > reason to not use default options, which by your e-mail I believe you don't have
> > one.
> > 
> > The logs you send below, looks from a corrupted btree, but xfs_repair should be
> > able to fix that for you.
> > 
> > Cheers.
> > 
> > 
> > > Many thanks for any advice given it is much appreciated.
> > > 
> > > Thanks,  Steve
> > > 
> > > 
> > > 
> > > Many blocks (about 20) of code similar to this were repeated in the logs.
> > > 
> > > Jul  8 18:40:17 sraid1v kernel: ffff880dca95b000: 00 00 00 00 00 00 00 00 00
> > > 00 00 00 00 00 00 00  ................
> > > Jul  8 18:40:17 sraid1v kernel: XFS (sde): Internal error xfs_da_do_buf(2)
> > > at line 2136 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffffa0e6e81a
> > > Jul  8 18:40:17 sraid1v kernel:
> > > Jul  8 18:40:17 sraid1v kernel: Pid: 8844, comm: idl Tainted: P           --
> > > ------------    2.6.32-642.el6.x86_64 #1
> > > Jul  8 18:40:17 sraid1v kernel: Call Trace:
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e7b68f>] ?
> > > xfs_error_report+0x3f/0x50 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e81a>] ?
> > > xfs_da_read_buf+0x2a/0x30 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e7b6fe>] ?
> > > xfs_corruption_error+0x5e/0x90 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e6fc>] ?
> > > xfs_da_do_buf+0x6cc/0x770 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e81a>] ?
> > > xfs_da_read_buf+0x2a/0x30 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffff810154e3>] ?
> > > native_sched_clock+0x13/0x80
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e6e81a>] ?
> > > xfs_da_read_buf+0x2a/0x30 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e74a21>] ?
> > > xfs_dir2_leaf_lookup_int+0x61/0x2c0 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e74a21>] ?
> > > xfs_dir2_leaf_lookup_int+0x61/0x2c0 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e74e05>] ?
> > > xfs_dir2_leaf_lookup+0x35/0xf0 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e71306>] ?
> > > xfs_dir2_isleaf+0x26/0x60 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e71ce4>] ?
> > > xfs_dir_lookup+0x174/0x190 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e9ea47>] ? xfs_lookup+0x87/0x110
> > > [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0eabd74>] ?
> > > xfs_vn_lookup+0x54/0xa0 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffff811a9ca5>] ? do_lookup+0x1a5/0x230
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffff811aa823>] ?
> > > __link_path_walk+0x763/0x1060
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffff811ab3da>] ? path_walk+0x6a/0xe0
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffff811ab5eb>] ?
> > > filename_lookup+0x6b/0xc0
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffff8123ac46>] ?
> > > security_file_alloc+0x16/0x20
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffff811acac4>] ?
> > > do_filp_open+0x104/0xd20
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffffa0e9a4fc>] ?
> > > _xfs_trans_commit+0x25c/0x310 [xfs]
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffff812a749a>] ?
> > > strncpy_from_user+0x4a/0x90
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffff811ba252>] ? alloc_fd+0x92/0x160
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffff81196bd7>] ?
> > > do_sys_open+0x67/0x130
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffff81196ce0>] ? sys_open+0x20/0x30
> > > Jul  8 18:40:17 sraid1v kernel: [<ffffffff8100b0d2>] ?
> > > system_call_fastpath+0x16/0x1b
> > > Jul  8 18:40:17 sraid1v kernel: XFS (sde): Corruption detected. Unmount and
> > > run xfs_repair
> > > Jul  8 18:40:17 sraid1v kernel: ffff880dca95b000: 00 00 00 00 00 00 00 00 00
> > > 00 00 00 00 00 00 00  ................
> > > Jul  8 18:40:17 sraid1v kernel: XFS (sde): Internal error xfs_da_do_buf(2)
> > > at line 2136 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffffa0e6e81a
> > > Jul  8 18:40:17 sraid1v kernel:
> > > Jul  8 18:40:17 sraid1v kernel: Pid: 8844, comm: idl Tainted: P           --
> > > ------------    2.6.32-642.el6.x86_64 #1
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> -- 
> Dr Stephen Brooks
> 
> Solar MHD Theory Group
> Tel    ::  01334 463735
> Fax    ::  01334 463748
> ---------------------------------------
> Mathematical Institute
> North Haugh
> University of St. Andrews
> St Andrews, Fife KY16 9SS
> SCOTLAND
> ---------------------------------------
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

-- 
Carlos

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Advice needed with file system corruption
  2016-07-14 14:17     ` Carlos Maiolino
@ 2016-07-14 23:33       ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-07-14 23:33 UTC (permalink / raw)
  To: Steve Brooks, xfs

On Thu, Jul 14, 2016 at 04:17:51PM +0200, Carlos Maiolino wrote:
> On Thu, Jul 14, 2016 at 02:57:25PM +0100, Steve Brooks wrote:
> > Hi Carlos,
> > 
> > Many thanks again, for your good advice. I ran the version 4.3 of
> > "xfs_repair" as suggested below and it did it's job very quickly in 50
> > seconds exactly as reported in the "No modify mode". Is the time reported at
> > the end of the "No modify mode" always a good approximation of running in
> > "modify mode" ?
> 
> Good to know. But I'm not quite sure if the no modify mode could be used as a
> good approximation of a real run. I would say to not take it as true giving that
> xfs_repair can't predict the amount of time it will need to write all
> modifications it needs to do on the filesystem's metadata, and it will certainly
> can take much more time, depending on how corrupted the filesystem is.

Yup, the no-modify mode skips a couple of steps in repair - phase 5
which rebuilds freespace btrees, and phase 7 which correctly link
counts - and so can only be considered the minimum runtime in "fix
it all up" mode. FWIW, Phase 6 can also blow out massively in
runtime if there's significant directory damage that results in
needing to move lots of inodes to the lost+found directory.

> > > Hi steve.
> > > 
> > > On Thu, Jul 14, 2016 at 01:27:22PM +0100, Steve Brooks wrote:
> > > > The "3.1.1"  version of "xfs_repair -n" ran in 1 minute, 32 seconds
> > > > 
> > > > The "4.3"     version of "xfs_repair -n" ran in 50 seconds

And it's good to know that recent performance improvements show real
world benefits, not just on the badly broken filesystems I used for
testing.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Advice needed with file system corruption
  2016-07-14 12:27 Advice needed with file system corruption Steve Brooks
  2016-07-14 13:05 ` Carlos Maiolino
@ 2016-08-08 14:11 ` Emmanuel Florac
  2016-08-08 15:38   ` Roger Willcocks
  2016-08-08 16:16   ` Steve Brooks
  1 sibling, 2 replies; 13+ messages in thread
From: Emmanuel Florac @ 2016-08-08 14:11 UTC (permalink / raw)
  To: Steve Brooks; +Cc: xfs

Le Thu, 14 Jul 2016 13:27:22 +0100
Steve Brooks <sjb14@st-andrews.ac.uk> écrivait:

> We have a RAID system with file system issues as follows,
> 
> 50 TB in RAID 6 hosted on an Adaptec 71605 controller using
> WD4000FYYZ drives.
> 
> Centos 6.7  2.6.32-642.el6.x86_64   :   xfsprogs-3.1.1-16.el6
> 
> While rebuilding a replaced disk, with the file system online and in 
> use, the system logs showed multiple entries of;
> 
> XFS (sde): Corruption detected. Unmount and run xfs_repair.
> 

Late to the game, I just wanted to remark that I've unfortunately
verified many times that write activity during rebuilds on Adaptec RAID
controllers often creates corruption. I've reported that to Adaptec,
but they don't seem to care much...

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Advice needed with file system corruption
  2016-08-08 14:11 ` Emmanuel Florac
@ 2016-08-08 15:38   ` Roger Willcocks
  2016-08-08 15:44     ` Emmanuel Florac
  2016-08-08 16:16   ` Steve Brooks
  1 sibling, 1 reply; 13+ messages in thread
From: Roger Willcocks @ 2016-08-08 15:38 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: Steve Brooks, xfs

On Mon, 2016-08-08 at 16:11 +0200, Emmanuel Florac wrote:
> Le Thu, 14 Jul 2016 13:27:22 +0100
> Steve Brooks <sjb14@st-andrews.ac.uk> écrivait:
> 
> > We have a RAID system with file system issues as follows,
> > 
> > 50 TB in RAID 6 hosted on an Adaptec 71605 controller using
> > WD4000FYYZ drives.
> > 
> > Centos 6.7  2.6.32-642.el6.x86_64   :   xfsprogs-3.1.1-16.el6
> > 
> > While rebuilding a replaced disk, with the file system online and in 
> > use, the system logs showed multiple entries of;
> > 
> > XFS (sde): Corruption detected. Unmount and run xfs_repair.
> > 
> 
> Late to the game, I just wanted to remark that I've unfortunately
> verified many times that write activity during rebuilds on Adaptec RAID
> controllers often creates corruption. I've reported that to Adaptec,
> but they don't seem to care much...
> 

It rather depends on why the disk was replaced in the first place...

--
Roger


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Advice needed with file system corruption
  2016-08-08 15:38   ` Roger Willcocks
@ 2016-08-08 15:44     ` Emmanuel Florac
  2016-08-09  4:02       ` Gim Leong Chin
  0 siblings, 1 reply; 13+ messages in thread
From: Emmanuel Florac @ 2016-08-08 15:44 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: Steve Brooks, xfs

Le Mon, 08 Aug 2016 16:38:11 +0100
Roger Willcocks <roger@filmlight.ltd.uk> écrivait:

> > 
> > Late to the game, I just wanted to remark that I've unfortunately
> > verified many times that write activity during rebuilds on Adaptec
> > RAID controllers often creates corruption. I've reported that to
> > Adaptec, but they don't seem to care much...
> >   
> 
> It rather depends on why the disk was replaced in the first place...

Well, given I always use RAID-6, it shouldn't matter; a failed drive
shouldn't alter the array behaviour significantly, as it simply falls
back to sort-of RAID-5 (any bad block read or write should be corrected
on the fly).

It seems like explicitly disabling individual disk drives write-back
cache somewhat mitigates the effect. 

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Advice needed with file system corruption
  2016-08-08 14:11 ` Emmanuel Florac
  2016-08-08 15:38   ` Roger Willcocks
@ 2016-08-08 16:16   ` Steve Brooks
  1 sibling, 0 replies; 13+ messages in thread
From: Steve Brooks @ 2016-08-08 16:16 UTC (permalink / raw)
  To: Emmanuel Florac, Steve Brooks; +Cc: xfs

Hi,

I chose the words "rebuilding a replaced disk" deliberately as I removed 
a disk that (according to adaptec's software) had some "media errors" 
even though the SMART attributes showed there were no "pending sectors" 
or "reallocated sectors", in fact all the SMART attributes were clean.  
As I was also using "RAID 6" I did not expect any issues leaving the 
filesystem online while rebuilding. Previous to this the RAID had been 
running live 24/7 for 0ver three years.

Steve

  On 08/08/2016 15:11, Emmanuel Florac wrote:
> Le Thu, 14 Jul 2016 13:27:22 +0100
> Steve Brooks <sjb14@st-andrews.ac.uk> écrivait:
>
>> We have a RAID system with file system issues as follows,
>>
>> 50 TB in RAID 6 hosted on an Adaptec 71605 controller using
>> WD4000FYYZ drives.
>>
>> Centos 6.7  2.6.32-642.el6.x86_64   :   xfsprogs-3.1.1-16.el6
>>
>> While rebuilding a replaced disk, with the file system online and in
>> use, the system logs showed multiple entries of;
>>
>> XFS (sde): Corruption detected. Unmount and run xfs_repair.
>>
> Late to the game, I just wanted to remark that I've unfortunately
> verified many times that write activity during rebuilds on Adaptec RAID
> controllers often creates corruption. I've reported that to Adaptec,
> but they don't seem to care much...
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Advice needed with file system corruption
  2016-08-08 15:44     ` Emmanuel Florac
@ 2016-08-09  4:02       ` Gim Leong Chin
  2016-08-09 12:40         ` Carlos E. R.
  0 siblings, 1 reply; 13+ messages in thread
From: Gim Leong Chin @ 2016-08-09  4:02 UTC (permalink / raw)
  To: Emmanuel Florac, Roger Willcocks; +Cc: Steve Brooks, xfs


[-- Attachment #1.1: Type: text/plain, Size: 1066 bytes --]



      From: Emmanuel Florac <eflorac@intellique.com>
 To: Roger Willcocks <roger@filmlight.ltd.uk> 
Cc: Steve Brooks <sjb14@st-andrews.ac.uk>; xfs@oss.sgi.com
 Sent: Monday, 8 August 2016, 23:44
 Subject: Re: Advice needed with file system corruption
   
Le Mon, 08 Aug 2016 16:38:11 +0100
Roger Willcocks <roger@filmlight.ltd.uk> écrivait:

> > 
> > Late to the game, I just wanted to remark that I've unfortunately
> > verified many times that write activity during rebuilds on Adaptec
> > RAID controllers often creates corruption. I've reported that to
> > Adaptec, but they don't seem to care much...
> >  
> 

> It seems like explicitly disabling individual disk drives write-back
> cache somewhat mitigates the effect. 

Drives connected to RAID controllers with battery backed cache should have their caches "disabled" (they are really set to write through mode instead).  By the way, I found out in lab testing that 7200 RPM SATA drives suffer a big performance loss when doing sequential writes in cache write through mode.

  

[-- Attachment #1.2: Type: text/html, Size: 2925 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Advice needed with file system corruption
  2016-08-09  4:02       ` Gim Leong Chin
@ 2016-08-09 12:40         ` Carlos E. R.
  2016-08-09 15:43           ` Gim Leong Chin
  2016-08-09 21:26           ` Dave Chinner
  0 siblings, 2 replies; 13+ messages in thread
From: Carlos E. R. @ 2016-08-09 12:40 UTC (permalink / raw)
  To: XFS mail list


[-- Attachment #1.1: Type: text/plain, Size: 677 bytes --]

On 2016-08-09 06:02, Gim Leong Chin wrote:

> Drives connected to RAID controllers with battery backed cache should
> have their caches "disabled" (they are really set to write through mode
> instead).  By the way, I found out in lab testing that 7200 RPM SATA
> drives suffer a big performance loss when doing sequential writes in
> cache write through mode.<http://oss.sgi.com/mailman/listinfo/xfs>

If you disable the disk internal cache, as a consequence you also
disable the disk internal write optimizations. It has to be much slower
at writing. It seems to me obvious.

-- 
Cheers / Saludos,

		Carlos E. R.
		(from 13.1 x86_64 "Bottle" at Telcontar)


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Advice needed with file system corruption
  2016-08-09 12:40         ` Carlos E. R.
@ 2016-08-09 15:43           ` Gim Leong Chin
  2016-08-09 21:26           ` Dave Chinner
  1 sibling, 0 replies; 13+ messages in thread
From: Gim Leong Chin @ 2016-08-09 15:43 UTC (permalink / raw)
  To: Carlos E. R., XFS mail list


[-- Attachment #1.1: Type: text/plain, Size: 1416 bytes --]

On 2016-08-09 06:02, Gim Leong Chin wrote:



>> Drives connected to RAID controllers with battery backed cache should
>> have their caches "disabled" (they are really set to write through mode
>> instead).  By the way, I found out in lab testing that 7200 RPM SATA
>> drives suffer a big performance loss when doing sequential writes in
>> cache write through mode.<
> If you disable the disk internal cache, as a consequence you also
> disable the disk internal write optimizations. It has to be much slower
> at writing. It seems to me obvious.

> -- 
> Cheers / Saludos,

 >       Carlos E. R.
 >       (from 13.1 x86_64 "Bottle" at Telcontar)

The drop in sequential write data rate for 3.5" 7200 RPM SATA drives was around 50%, I cannot remember the exact numbers, that is not obvious to me.

As a reminder, the drive cache is really set to write through mode, it is not possible to disable the cache, as an application engineer from HGST told me, so the drive internal write optimizations are still there, just that the IO command is reported to be completed only when the data has been writen to the drive platter.

10k and 15k RPM SAS drives connected to LSI Internal RAID controllers have their drive cache "disabled" automatically, I wonder how much is the data rate drop compared to drive cache "enabled", considering that LSI IR controllers do not have cache.

GL
  

[-- Attachment #1.2: Type: text/html, Size: 3109 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Advice needed with file system corruption
  2016-08-09 12:40         ` Carlos E. R.
  2016-08-09 15:43           ` Gim Leong Chin
@ 2016-08-09 21:26           ` Dave Chinner
  1 sibling, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-08-09 21:26 UTC (permalink / raw)
  To: Carlos E. R.; +Cc: XFS mail list

On Tue, Aug 09, 2016 at 02:40:26PM +0200, Carlos E. R. wrote:
> On 2016-08-09 06:02, Gim Leong Chin wrote:
> 
> > Drives connected to RAID controllers with battery backed cache should
> > have their caches "disabled" (they are really set to write through mode
> > instead).  By the way, I found out in lab testing that 7200 RPM SATA
> > drives suffer a big performance loss when doing sequential writes in
> > cache write through mode.<http://oss.sgi.com/mailman/listinfo/xfs>
> 
> If you disable the disk internal cache, as a consequence you also
> disable the disk internal write optimizations. It has to be much slower
> at writing. It seems to me obvious.

This is why decent HW RAID controllers have a large non volatile
write cache - the caching is done in the controller where it is safe
from power loss, not in the drive where it is unsafe. Write
optimisations happen at the RAID controller level, not at the
individual drive level.

As for 10/15krpm SAS drive performance, they generally are only
slower in microbenchmark situations (e.g. sequential single sector
writes) when the write cache is disabled. These sorts of loads
aren't typically seen in the real world, so for most people there is
little difference in performance on high end enterprise SAS drives
when changing the cache mode....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-08-09 21:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-14 12:27 Advice needed with file system corruption Steve Brooks
2016-07-14 13:05 ` Carlos Maiolino
2016-07-14 13:57   ` Steve Brooks
2016-07-14 14:17     ` Carlos Maiolino
2016-07-14 23:33       ` Dave Chinner
2016-08-08 14:11 ` Emmanuel Florac
2016-08-08 15:38   ` Roger Willcocks
2016-08-08 15:44     ` Emmanuel Florac
2016-08-09  4:02       ` Gim Leong Chin
2016-08-09 12:40         ` Carlos E. R.
2016-08-09 15:43           ` Gim Leong Chin
2016-08-09 21:26           ` Dave Chinner
2016-08-08 16:16   ` Steve Brooks

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.