All of lore.kernel.org
 help / color / mirror / Atom feed
* ext4 problems with external RAID array via SAS connection
@ 2011-02-07 18:53 bryan.coleman
  2011-02-07 22:54 ` Ted Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: bryan.coleman @ 2011-02-07 18:53 UTC (permalink / raw)
  To: linux-ext4

I am experiencing problems with an ext4 file system.

At first, the drive seemed to work fine.  I was primarily copying things 
to the drive migrating data from another server.  After many GBs of data, 
that seemingly successfully were done being transferred, I started seeing 
ext4 errors in /var/log/messages.  I then unmounted the drive and ran fsck 
on it (which took multiple hours to run).  I then ls'ed around and one of 
the areas caused the system to again throw ext4 errors.

I did run memtest through one complete pass and it found no problems.

I then went looking for help on the fedora forum and it was suggested that 
I increase my journal size.  So I recreated the ext4 partition (with 
larger journal) and started the migration process again.  After several 
days of copying, the errors started again.


Here are some of the errors from /var/log/messages:

Feb 2 04:48:30 mdct-00fs kernel: [672021.519914] EXT4-fs error (device 
dm-2): ext4_mb_generate_buddy: EXT4-fs: group 22307: 460 blocks in bitmap, 
0 in gd
Feb 2 04:48:30 mdct-00fs kernel: [672021.520429] EXT4-fs error (device 
dm-2): ext4_mb_generate_buddy: EXT4-fs: group 22308: 1339 blocks in 
bitmap, 0 in gd
Feb 2 04:48:30 mdct-00fs kernel: [672021.520927] EXT4-fs error (device 
dm-2): ext4_mb_generate_buddy: EXT4-fs: group 22309: 3204 blocks in 
bitmap, 0 in gd
Feb 2 04:48:30 mdct-00fs kernel: [672021.521409] EXT4-fs error (device 
dm-2): ext4_mb_generate_buddy: EXT4-fs: group 22310: 2117 blocks in 
bitmap, 0 in gd
Feb 4 05:08:29 mdct-00fs kernel: [845547.724807] EXT4-fs error (device 
dm-2): ext4_dx_find_entry: inode #311951364: (comm scp) bad entry in 
directory: directory entry across blocks - 
block=1257308156offset=0(9166848), inode=3143403788, rec_len=80864, 
name_len=168
Feb 4 05:08:29 mdct-00fs kernel: [845547.733034] EXT4-fs error (device 
dm-2): ext4_add_entry: inode #311951364: (comm scp) bad entry in 
directory: directory entry across blocks - block=1257308156offset=0(0), 
inode=3143403788, rec_len=80864, name_len=168
Feb 4 05:19:41 mdct-00fs kernel: [846217.922351] EXT4-fs error (device 
dm-2): ext4_dx_find_entry: inode #311951364: (comm scp) bad entry in 
directory: directory entry across blocks - 
block=1257308156offset=0(9166848), inode=3143403788, rec_len=80864, 
name_len=168
Feb 4 05:19:41 mdct-00fs kernel: [846217.928922] EXT4-fs error (device 
dm-2): ext4_add_entry: inode #311951364: (comm scp) bad entry in 
directory: directory entry across blocks - block=1257308156offset=0(0), 
inode=3143403788, rec_len=80864, name_len=168 


Here is my setup:

        Promise Vtrak RAID array with 12 drives in a RAID 6 configuration 
(over 5TB).
        The promise array is connected to my server using a external SAS 
connection.
        OS: Fedora 14
 
        One logical volume on the promise.
        One logical volume at the external SAS level.
        One logical volume at the OS level.
        So from my OS, I see one logical volume depicting one big drive.

        I then setup the ext4 system using the following command: 
'mkfs.ext4 -v -m 1 -J size=1024 -E stride=16,stripe-width=160 
/dev/vg_storage/lv_storage'


Any thoughts/tips on how to track down the problem?

My thought now is to try using ext3; however, my fear is that I will just 
run into the problem with it.  Is ext4 production ready?


Thoughts?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext4 problems with external RAID array via SAS connection
  2011-02-07 18:53 ext4 problems with external RAID array via SAS connection bryan.coleman
@ 2011-02-07 22:54 ` Ted Ts'o
  2011-02-08 13:18   ` bryan.coleman
  0 siblings, 1 reply; 10+ messages in thread
From: Ted Ts'o @ 2011-02-07 22:54 UTC (permalink / raw)
  To: bryan.coleman; +Cc: linux-ext4

On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote:
> I am experiencing problems with an ext4 file system.
> 
> At first, the drive seemed to work fine.  I was primarily copying things 
> to the drive migrating data from another server.  After many GBs of data, 
> that seemingly successfully were done being transferred, I started seeing 
> ext4 errors in /var/log/messages.  I then unmounted the drive and ran fsck 
> on it (which took multiple hours to run).  I then ls'ed around and one of 
> the areas caused the system to again throw ext4 errors.

Did fsck report any errors?  Do you have a copy of your fsck
transcript?

The errors you've reported do make me suspicious that there's
something unstable with your hardware...

					- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext4 problems with external RAID array via SAS connection
  2011-02-07 22:54 ` Ted Ts'o
@ 2011-02-08 13:18   ` bryan.coleman
  2011-02-08 14:50     ` bryan.coleman
  0 siblings, 1 reply; 10+ messages in thread
From: bryan.coleman @ 2011-02-08 13:18 UTC (permalink / raw)
  To: linux-ext4, linux-ext4-owner

When I ran fsck after the first bout of failure, it did report a lot of 
errors.  I do not have a copy of that fsck transcript; however, I have not 
yet run fsck since my second attempt.  Is there a method of capturing the 
transcript that is preferred?

Bryan



From:   Ted Ts'o <tytso@mit.edu>
To:     bryan.coleman@dart.biz
Cc:     linux-ext4@vger.kernel.org
Date:   02/07/2011 05:55 PM
Subject:        Re: ext4 problems with external RAID array via SAS 
connection
Sent by:        linux-ext4-owner@vger.kernel.org



On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote:
> I am experiencing problems with an ext4 file system.
> 
> At first, the drive seemed to work fine.  I was primarily copying things 

> to the drive migrating data from another server.  After many GBs of 
data, 
> that seemingly successfully were done being transferred, I started 
seeing 
> ext4 errors in /var/log/messages.  I then unmounted the drive and ran 
fsck 
> on it (which took multiple hours to run).  I then ls'ed around and one 
of 
> the areas caused the system to again throw ext4 errors.

Did fsck report any errors?  Do you have a copy of your fsck
transcript?

The errors you've reported do make me suspicious that there's
something unstable with your hardware...

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext4 problems with external RAID array via SAS connection
  2011-02-08 13:18   ` bryan.coleman
@ 2011-02-08 14:50     ` bryan.coleman
  2011-02-08 15:19       ` Eric Sandeen
  0 siblings, 1 reply; 10+ messages in thread
From: bryan.coleman @ 2011-02-08 14:50 UTC (permalink / raw)
  To: linux-ext4, linux-ext4-owner; +Cc: Ted Ts'o

Well, I attempted to run fsck on the problem drive using the script 
command to capture the transcript; however, it failed to read a block from 
the file system.  The exception was "fsck.ext4: Attempt to read block from 
filesystem resulted in short read while trying to open 
/dev/mapper/vg_storage-lv_storage". 

Other messages that are now in /var/log/messages:

Buffer I/O error on device dm-2, logical block 0
lost page write due to I/O error on dm-2
EXT4-fs (dm-2): previous I/O error to superblock detected
Buffer I/O error on device dm-2, logical block 0
lost page write due to I/O error on dm-2
Buffer I/O error on device dm-2, logical block 0
Buffer I/O error on device dm-2, logical block 1
Buffer I/O error on device dm-2, logical block 2
Buffer I/O error on device dm-2, logical block 3
Buffer I/O error on device dm-2, logical block 0
EXT4-fs (dm-2): unable to read superblock


Since it looks like I need to start the process all over again, is there a 
good way to quickly determine if the problem is hardware related?  Is 
there a preferred method that will stress test the drive and shed more 
light on what might be going wrong?

Thank you,

Bryan



From:   bryan.coleman@dart.biz
To:     linux-ext4@vger.kernel.org, linux-ext4-owner@vger.kernel.org
Date:   02/08/2011 08:19 AM
Subject:        Re: ext4 problems with external RAID array via SAS 
connection
Sent by:        linux-ext4-owner@vger.kernel.org



When I ran fsck after the first bout of failure, it did report a lot of 
errors.  I do not have a copy of that fsck transcript; however, I have not 

yet run fsck since my second attempt.  Is there a method of capturing the 
transcript that is preferred?

Bryan



From:   Ted Ts'o <tytso@mit.edu>
To:     bryan.coleman@dart.biz
Cc:     linux-ext4@vger.kernel.org
Date:   02/07/2011 05:55 PM
Subject:        Re: ext4 problems with external RAID array via SAS 
connection
Sent by:        linux-ext4-owner@vger.kernel.org



On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote:
> I am experiencing problems with an ext4 file system.
> 
> At first, the drive seemed to work fine.  I was primarily copying things 


> to the drive migrating data from another server.  After many GBs of 
data, 
> that seemingly successfully were done being transferred, I started 
seeing 
> ext4 errors in /var/log/messages.  I then unmounted the drive and ran 
fsck 
> on it (which took multiple hours to run).  I then ls'ed around and one 
of 
> the areas caused the system to again throw ext4 errors.

Did fsck report any errors?  Do you have a copy of your fsck
transcript?

The errors you've reported do make me suspicious that there's
something unstable with your hardware...

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext4 problems with external RAID array via SAS connection
  2011-02-08 14:50     ` bryan.coleman
@ 2011-02-08 15:19       ` Eric Sandeen
  2011-02-08 18:50         ` bryan.coleman
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Sandeen @ 2011-02-08 15:19 UTC (permalink / raw)
  To: bryan.coleman; +Cc: linux-ext4, Ted Ts'o

On 2/8/11 8:50 AM, bryan.coleman@dart.biz wrote:
> Well, I attempted to run fsck on the problem drive using the script 
> command to capture the transcript; however, it failed to read a block from 
> the file system.  The exception was "fsck.ext4: Attempt to read block from 
> filesystem resulted in short read while trying to open 
> /dev/mapper/vg_storage-lv_storage". 
> 
> Other messages that are now in /var/log/messages:
> 
> Buffer I/O error on device dm-2, logical block 0
> lost page write due to I/O error on dm-2
> EXT4-fs (dm-2): previous I/O error to superblock detected
> Buffer I/O error on device dm-2, logical block 0
> lost page write due to I/O error on dm-2
> Buffer I/O error on device dm-2, logical block 0
> Buffer I/O error on device dm-2, logical block 1
> Buffer I/O error on device dm-2, logical block 2
> Buffer I/O error on device dm-2, logical block 3
> Buffer I/O error on device dm-2, logical block 0
> EXT4-fs (dm-2): unable to read superblock
> 
> 
> Since it looks like I need to start the process all over again, is there a 
> good way to quickly determine if the problem is hardware related?  Is 
> there a preferred method that will stress test the drive and shed more 
> light on what might be going wrong?

You have a hardware problem... "Buffer I/O error on device dm-2, logical block 0"
means that you failed to read the first block on that device; not something
e2fsck can fix, I'm afraid; you'll need to sort out what's wrong with the storage,
first.

-Eric

> Thank you,
> 
> Bryan
> 
> 
> 
> From:   bryan.coleman@dart.biz
> To:     linux-ext4@vger.kernel.org, linux-ext4-owner@vger.kernel.org
> Date:   02/08/2011 08:19 AM
> Subject:        Re: ext4 problems with external RAID array via SAS 
> connection
> Sent by:        linux-ext4-owner@vger.kernel.org
> 
> 
> 
> When I ran fsck after the first bout of failure, it did report a lot of 
> errors.  I do not have a copy of that fsck transcript; however, I have not 
> 
> yet run fsck since my second attempt.  Is there a method of capturing the 
> transcript that is preferred?
> 
> Bryan
> 
> 
> 
> From:   Ted Ts'o <tytso@mit.edu>
> To:     bryan.coleman@dart.biz
> Cc:     linux-ext4@vger.kernel.org
> Date:   02/07/2011 05:55 PM
> Subject:        Re: ext4 problems with external RAID array via SAS 
> connection
> Sent by:        linux-ext4-owner@vger.kernel.org
> 
> 
> 
> On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote:
>> I am experiencing problems with an ext4 file system.
>>
>> At first, the drive seemed to work fine.  I was primarily copying things 
> 
> 
>> to the drive migrating data from another server.  After many GBs of 
> data, 
>> that seemingly successfully were done being transferred, I started 
> seeing 
>> ext4 errors in /var/log/messages.  I then unmounted the drive and ran 
> fsck 
>> on it (which took multiple hours to run).  I then ls'ed around and one 
> of 
>> the areas caused the system to again throw ext4 errors.
> 
> Did fsck report any errors?  Do you have a copy of your fsck
> transcript?
> 
> The errors you've reported do make me suspicious that there's
> something unstable with your hardware...
> 
>   - Ted
> --

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext4 problems with external RAID array via SAS connection
  2011-02-08 15:19       ` Eric Sandeen
@ 2011-02-08 18:50         ` bryan.coleman
  2011-02-08 20:49           ` Eric Sandeen
  0 siblings, 1 reply; 10+ messages in thread
From: bryan.coleman @ 2011-02-08 18:50 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-ext4, linux-ext4-owner, Ted Ts'o

I found that the promise array had been restarted via watchdog timer.  I 
am investigating that avenue via promise (albeit slow).  Note: the 
watchdog reset the controller days after the initial ext4 messages.  I'm 
not saying they are unrelated.  I just what to get all of the facts out 
there.

I suspect the connection between the server and the promise got hosed when 
the controller was reset.  When I restart the server, I could fsck the 
drive.

The fsck is currently running (and has been for some time now). 

It is doing a ton of "Inode ######## ref count is 2, should be 1.  Fix? 
yes"  "Unattached inode #########"  "Connect to /lost+found? yes"

I am running fsck in a script session; however, there are currently a ton 
of the messages above (current log size: 106M).

Do you think it is still hardware?  If so, is there a command that would 
stress it enough to break quickly?  What is the best way to isolate 
hardware problems?

Bryan



From:   Eric Sandeen <sandeen@redhat.com>
To:     bryan.coleman@dart.biz
Cc:     linux-ext4@vger.kernel.org, "Ted Ts'o" <tytso@mit.edu>
Date:   02/08/2011 10:21 AM
Subject:        Re: ext4 problems with external RAID array via SAS 
connection
Sent by:        linux-ext4-owner@vger.kernel.org



On 2/8/11 8:50 AM, bryan.coleman@dart.biz wrote:
> Well, I attempted to run fsck on the problem drive using the script 
> command to capture the transcript; however, it failed to read a block 
from 
> the file system.  The exception was "fsck.ext4: Attempt to read block 
from 
> filesystem resulted in short read while trying to open 
> /dev/mapper/vg_storage-lv_storage". 
> 
> Other messages that are now in /var/log/messages:
> 
> Buffer I/O error on device dm-2, logical block 0
> lost page write due to I/O error on dm-2
> EXT4-fs (dm-2): previous I/O error to superblock detected
> Buffer I/O error on device dm-2, logical block 0
> lost page write due to I/O error on dm-2
> Buffer I/O error on device dm-2, logical block 0
> Buffer I/O error on device dm-2, logical block 1
> Buffer I/O error on device dm-2, logical block 2
> Buffer I/O error on device dm-2, logical block 3
> Buffer I/O error on device dm-2, logical block 0
> EXT4-fs (dm-2): unable to read superblock
> 
> 
> Since it looks like I need to start the process all over again, is there 
a 
> good way to quickly determine if the problem is hardware related?  Is 
> there a preferred method that will stress test the drive and shed more 
> light on what might be going wrong?

You have a hardware problem... "Buffer I/O error on device dm-2, logical 
block 0"
means that you failed to read the first block on that device; not 
something
e2fsck can fix, I'm afraid; you'll need to sort out what's wrong with the 
storage,
first.

-Eric

> Thank you,
> 
> Bryan
> 
> 
> 
> From:   bryan.coleman@dart.biz
> To:     linux-ext4@vger.kernel.org, linux-ext4-owner@vger.kernel.org
> Date:   02/08/2011 08:19 AM
> Subject:        Re: ext4 problems with external RAID array via SAS 
> connection
> Sent by:        linux-ext4-owner@vger.kernel.org
> 
> 
> 
> When I ran fsck after the first bout of failure, it did report a lot of 
> errors.  I do not have a copy of that fsck transcript; however, I have 
not 
> 
> yet run fsck since my second attempt.  Is there a method of capturing 
the 
> transcript that is preferred?
> 
> Bryan
> 
> 
> 
> From:   Ted Ts'o <tytso@mit.edu>
> To:     bryan.coleman@dart.biz
> Cc:     linux-ext4@vger.kernel.org
> Date:   02/07/2011 05:55 PM
> Subject:        Re: ext4 problems with external RAID array via SAS 
> connection
> Sent by:        linux-ext4-owner@vger.kernel.org
> 
> 
> 
> On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote:
>> I am experiencing problems with an ext4 file system.
>>
>> At first, the drive seemed to work fine.  I was primarily copying 
things 
> 
> 
>> to the drive migrating data from another server.  After many GBs of 
> data, 
>> that seemingly successfully were done being transferred, I started 
> seeing 
>> ext4 errors in /var/log/messages.  I then unmounted the drive and ran 
> fsck 
>> on it (which took multiple hours to run).  I then ls'ed around and one 
> of 
>> the areas caused the system to again throw ext4 errors.
> 
> Did fsck report any errors?  Do you have a copy of your fsck
> transcript?
> 
> The errors you've reported do make me suspicious that there's
> something unstable with your hardware...
> 
>   - Ted
> --
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext4 problems with external RAID array via SAS connection
  2011-02-08 18:50         ` bryan.coleman
@ 2011-02-08 20:49           ` Eric Sandeen
  2011-02-09 13:43             ` bryan.coleman
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Sandeen @ 2011-02-08 20:49 UTC (permalink / raw)
  To: bryan.coleman; +Cc: linux-ext4, Ted Ts'o

On 2/8/11 12:50 PM, bryan.coleman@dart.biz wrote:
> I found that the promise array had been restarted via watchdog timer.  I 
> am investigating that avenue via promise (albeit slow).  Note: the 
> watchdog reset the controller days after the initial ext4 messages.  I'm 
> not saying they are unrelated.  I just what to get all of the facts out 
> there.
> 
> I suspect the connection between the server and the promise got hosed when 
> the controller was reset.  When I restart the server, I could fsck the 
> drive.
> 
> The fsck is currently running (and has been for some time now). 
> 
> It is doing a ton of "Inode ######## ref count is 2, should be 1.  Fix? 
> yes"  "Unattached inode #########"  "Connect to /lost+found? yes"
> 
> I am running fsck in a script session; however, there are currently a ton 
> of the messages above (current log size: 106M).
> 
> Do you think it is still hardware?  If so, is there a command that would 
> stress it enough to break quickly?  What is the best way to isolate 
> hardware problems?

My assertion of hardware problems was based solely on the IO error reading
block 0.  If you can't read the superblock there's not much to be done.

As for what caused the corruption fsck is now finding, that's harder to say,
you're essentially getting reports that fsck is finding errors which happened
sometime in the past.

My first thought is whether a large cache on the array got lost when it was
reset, that could certainly cause filesystem corruption.

-Eric

> Bryan
> 
> 
> 
> From:   Eric Sandeen <sandeen@redhat.com>
> To:     bryan.coleman@dart.biz
> Cc:     linux-ext4@vger.kernel.org, "Ted Ts'o" <tytso@mit.edu>
> Date:   02/08/2011 10:21 AM
> Subject:        Re: ext4 problems with external RAID array via SAS 
> connection
> Sent by:        linux-ext4-owner@vger.kernel.org
> 
> 
> 
> On 2/8/11 8:50 AM, bryan.coleman@dart.biz wrote:
>> Well, I attempted to run fsck on the problem drive using the script 
>> command to capture the transcript; however, it failed to read a block 
> from 
>> the file system.  The exception was "fsck.ext4: Attempt to read block 
> from 
>> filesystem resulted in short read while trying to open 
>> /dev/mapper/vg_storage-lv_storage". 
>>
>> Other messages that are now in /var/log/messages:
>>
>> Buffer I/O error on device dm-2, logical block 0
>> lost page write due to I/O error on dm-2
>> EXT4-fs (dm-2): previous I/O error to superblock detected
>> Buffer I/O error on device dm-2, logical block 0
>> lost page write due to I/O error on dm-2
>> Buffer I/O error on device dm-2, logical block 0
>> Buffer I/O error on device dm-2, logical block 1
>> Buffer I/O error on device dm-2, logical block 2
>> Buffer I/O error on device dm-2, logical block 3
>> Buffer I/O error on device dm-2, logical block 0
>> EXT4-fs (dm-2): unable to read superblock
>>
>>
>> Since it looks like I need to start the process all over again, is there 
> a 
>> good way to quickly determine if the problem is hardware related?  Is 
>> there a preferred method that will stress test the drive and shed more 
>> light on what might be going wrong?
> 
> You have a hardware problem... "Buffer I/O error on device dm-2, logical 
> block 0"
> means that you failed to read the first block on that device; not 
> something
> e2fsck can fix, I'm afraid; you'll need to sort out what's wrong with the 
> storage,
> first.
> 
> -Eric
> 
>> Thank you,
>>
>> Bryan
>>
>>
>>
>> From:   bryan.coleman@dart.biz
>> To:     linux-ext4@vger.kernel.org, linux-ext4-owner@vger.kernel.org
>> Date:   02/08/2011 08:19 AM
>> Subject:        Re: ext4 problems with external RAID array via SAS 
>> connection
>> Sent by:        linux-ext4-owner@vger.kernel.org
>>
>>
>>
>> When I ran fsck after the first bout of failure, it did report a lot of 
>> errors.  I do not have a copy of that fsck transcript; however, I have 
> not 
>>
>> yet run fsck since my second attempt.  Is there a method of capturing 
> the 
>> transcript that is preferred?
>>
>> Bryan
>>
>>
>>
>> From:   Ted Ts'o <tytso@mit.edu>
>> To:     bryan.coleman@dart.biz
>> Cc:     linux-ext4@vger.kernel.org
>> Date:   02/07/2011 05:55 PM
>> Subject:        Re: ext4 problems with external RAID array via SAS 
>> connection
>> Sent by:        linux-ext4-owner@vger.kernel.org
>>
>>
>>
>> On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote:
>>> I am experiencing problems with an ext4 file system.
>>>
>>> At first, the drive seemed to work fine.  I was primarily copying 
> things 
>>
>>
>>> to the drive migrating data from another server.  After many GBs of 
>> data, 
>>> that seemingly successfully were done being transferred, I started 
>> seeing 
>>> ext4 errors in /var/log/messages.  I then unmounted the drive and ran 
>> fsck 
>>> on it (which took multiple hours to run).  I then ls'ed around and one 
>> of 
>>> the areas caused the system to again throw ext4 errors.
>>
>> Did fsck report any errors?  Do you have a copy of your fsck
>> transcript?
>>
>> The errors you've reported do make me suspicious that there's
>> something unstable with your hardware...
>>
>>   - Ted
>> --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext4 problems with external RAID array via SAS connection
  2011-02-08 20:49           ` Eric Sandeen
@ 2011-02-09 13:43             ` bryan.coleman
  2011-02-09 18:28               ` Ted Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: bryan.coleman @ 2011-02-09 13:43 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-ext4, Ted Ts'o

The disk was not in the middle of copying when the array went down.

I did get an fsck transcript; however, it is 14M tgz'd.  I don't really 
want to send it to the list, but am willing to send it direct if you (or 
Ted) are willing.

The fsck said it completed successfully.  I kicked off fsck again just to 
make sure and it reported clean.  So I mounted the drive and ls'd around 
and it started reporting errors.  "ls: cannot access 40: Input/output 
error"  Note: 40 is a directory.

So I unmounted again and started an fsck.  It reported errors and started 
on it's merry way; however, now it was dealing with many "Multiple-claimed 
block(s) in inode #########: <many numbers following>"

I am willing to reformat the drive again; however, I would like to know 
what the best way to track down the issue is?

Any thoughts? 



From:   Eric Sandeen <sandeen@redhat.com>
To:     bryan.coleman@dart.biz
Cc:     linux-ext4@vger.kernel.org, "Ted Ts'o" <tytso@mit.edu>
Date:   02/08/2011 03:49 PM
Subject:        Re: ext4 problems with external RAID array via SAS 
connection



On 2/8/11 12:50 PM, bryan.coleman@dart.biz wrote:
> I found that the promise array had been restarted via watchdog timer.  I 

> am investigating that avenue via promise (albeit slow).  Note: the 
> watchdog reset the controller days after the initial ext4 messages.  I'm 

> not saying they are unrelated.  I just what to get all of the facts out 
> there.
> 
> I suspect the connection between the server and the promise got hosed 
when 
> the controller was reset.  When I restart the server, I could fsck the 
> drive.
> 
> The fsck is currently running (and has been for some time now). 
> 
> It is doing a ton of "Inode ######## ref count is 2, should be 1.  Fix? 
> yes"  "Unattached inode #########"  "Connect to /lost+found? yes"
> 
> I am running fsck in a script session; however, there are currently a 
ton 
> of the messages above (current log size: 106M).
> 
> Do you think it is still hardware?  If so, is there a command that would 

> stress it enough to break quickly?  What is the best way to isolate 
> hardware problems?

My assertion of hardware problems was based solely on the IO error reading
block 0.  If you can't read the superblock there's not much to be done.

As for what caused the corruption fsck is now finding, that's harder to 
say,
you're essentially getting reports that fsck is finding errors which 
happened
sometime in the past.

My first thought is whether a large cache on the array got lost when it 
was
reset, that could certainly cause filesystem corruption.

-Eric

> Bryan
> 
> 
> 
> From:   Eric Sandeen <sandeen@redhat.com>
> To:     bryan.coleman@dart.biz
> Cc:     linux-ext4@vger.kernel.org, "Ted Ts'o" <tytso@mit.edu>
> Date:   02/08/2011 10:21 AM
> Subject:        Re: ext4 problems with external RAID array via SAS 
> connection
> Sent by:        linux-ext4-owner@vger.kernel.org
> 
> 
> 
> On 2/8/11 8:50 AM, bryan.coleman@dart.biz wrote:
>> Well, I attempted to run fsck on the problem drive using the script 
>> command to capture the transcript; however, it failed to read a block 
> from 
>> the file system.  The exception was "fsck.ext4: Attempt to read block 
> from 
>> filesystem resulted in short read while trying to open 
>> /dev/mapper/vg_storage-lv_storage". 
>>
>> Other messages that are now in /var/log/messages:
>>
>> Buffer I/O error on device dm-2, logical block 0
>> lost page write due to I/O error on dm-2
>> EXT4-fs (dm-2): previous I/O error to superblock detected
>> Buffer I/O error on device dm-2, logical block 0
>> lost page write due to I/O error on dm-2
>> Buffer I/O error on device dm-2, logical block 0
>> Buffer I/O error on device dm-2, logical block 1
>> Buffer I/O error on device dm-2, logical block 2
>> Buffer I/O error on device dm-2, logical block 3
>> Buffer I/O error on device dm-2, logical block 0
>> EXT4-fs (dm-2): unable to read superblock
>>
>>
>> Since it looks like I need to start the process all over again, is 
there 
> a 
>> good way to quickly determine if the problem is hardware related?  Is 
>> there a preferred method that will stress test the drive and shed more 
>> light on what might be going wrong?
> 
> You have a hardware problem... "Buffer I/O error on device dm-2, logical 

> block 0"
> means that you failed to read the first block on that device; not 
> something
> e2fsck can fix, I'm afraid; you'll need to sort out what's wrong with 
the 
> storage,
> first.
> 
> -Eric
> 
>> Thank you,
>>
>> Bryan
>>
>>
>>
>> From:   bryan.coleman@dart.biz
>> To:     linux-ext4@vger.kernel.org, linux-ext4-owner@vger.kernel.org
>> Date:   02/08/2011 08:19 AM
>> Subject:        Re: ext4 problems with external RAID array via SAS 
>> connection
>> Sent by:        linux-ext4-owner@vger.kernel.org
>>
>>
>>
>> When I ran fsck after the first bout of failure, it did report a lot of 

>> errors.  I do not have a copy of that fsck transcript; however, I have 
> not 
>>
>> yet run fsck since my second attempt.  Is there a method of capturing 
> the 
>> transcript that is preferred?
>>
>> Bryan
>>
>>
>>
>> From:   Ted Ts'o <tytso@mit.edu>
>> To:     bryan.coleman@dart.biz
>> Cc:     linux-ext4@vger.kernel.org
>> Date:   02/07/2011 05:55 PM
>> Subject:        Re: ext4 problems with external RAID array via SAS 
>> connection
>> Sent by:        linux-ext4-owner@vger.kernel.org
>>
>>
>>
>> On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote:
>>> I am experiencing problems with an ext4 file system.
>>>
>>> At first, the drive seemed to work fine.  I was primarily copying 
> things 
>>
>>
>>> to the drive migrating data from another server.  After many GBs of 
>> data, 
>>> that seemingly successfully were done being transferred, I started 
>> seeing 
>>> ext4 errors in /var/log/messages.  I then unmounted the drive and ran 
>> fsck 
>>> on it (which took multiple hours to run).  I then ls'ed around and one 

>> of 
>>> the areas caused the system to again throw ext4 errors.
>>
>> Did fsck report any errors?  Do you have a copy of your fsck
>> transcript?
>>
>> The errors you've reported do make me suspicious that there's
>> something unstable with your hardware...
>>
>>   - Ted
>> --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext4 problems with external RAID array via SAS connection
  2011-02-09 13:43             ` bryan.coleman
@ 2011-02-09 18:28               ` Ted Ts'o
  2011-02-09 19:46                 ` Ric Wheeler
  0 siblings, 1 reply; 10+ messages in thread
From: Ted Ts'o @ 2011-02-09 18:28 UTC (permalink / raw)
  To: bryan.coleman; +Cc: Eric Sandeen, linux-ext4

On Wed, Feb 09, 2011 at 08:43:56AM -0500, bryan.coleman@dart.biz wrote:
> 
> The fsck said it completed successfully.  I kicked off fsck again just to 
> make sure and it reported clean.  So I mounted the drive and ls'd around 
> and it started reporting errors.  "ls: cannot access 40: Input/output 
> error"  Note: 40 is a directory.

Well, we'd need to look at the kernel messages, but the Input/output
error strongly suggests that there are, well, I/O errors talking to
your storage array.  Which again suggests hardware problems, or device
driver bugs, or both.

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext4 problems with external RAID array via SAS connection
  2011-02-09 18:28               ` Ted Ts'o
@ 2011-02-09 19:46                 ` Ric Wheeler
  0 siblings, 0 replies; 10+ messages in thread
From: Ric Wheeler @ 2011-02-09 19:46 UTC (permalink / raw)
  To: Ted Ts'o, bryan.coleman; +Cc: Eric Sandeen, linux-ext4

On 02/09/2011 01:28 PM, Ted Ts'o wrote:
> On Wed, Feb 09, 2011 at 08:43:56AM -0500, bryan.coleman@dart.biz wrote:
>> The fsck said it completed successfully.  I kicked off fsck again just to
>> make sure and it reported clean.  So I mounted the drive and ls'd around
>> and it started reporting errors.  "ls: cannot access 40: Input/output
>> error"  Note: 40 is a directory.
> Well, we'd need to look at the kernel messages, but the Input/output
> error strongly suggests that there are, well, I/O errors talking to
> your storage array.  Which again suggests hardware problems, or device
> driver bugs, or both.
>
> 						- Ted

I think that you might want to start to test with a simplified storage config. 
Try keep the RAID card in the loop, but using a simpler RAID scheme (single 
drive? RAID0 or RAID1) and see if the issue persists.

Ric


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-02-09 19:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-07 18:53 ext4 problems with external RAID array via SAS connection bryan.coleman
2011-02-07 22:54 ` Ted Ts'o
2011-02-08 13:18   ` bryan.coleman
2011-02-08 14:50     ` bryan.coleman
2011-02-08 15:19       ` Eric Sandeen
2011-02-08 18:50         ` bryan.coleman
2011-02-08 20:49           ` Eric Sandeen
2011-02-09 13:43             ` bryan.coleman
2011-02-09 18:28               ` Ted Ts'o
2011-02-09 19:46                 ` Ric Wheeler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.