* ext4 problems with external RAID array via SAS connection @ 2011-02-07 18:53 bryan.coleman 2011-02-07 22:54 ` Ted Ts'o 0 siblings, 1 reply; 10+ messages in thread From: bryan.coleman @ 2011-02-07 18:53 UTC (permalink / raw) To: linux-ext4 I am experiencing problems with an ext4 file system. At first, the drive seemed to work fine. I was primarily copying things to the drive migrating data from another server. After many GBs of data, that seemingly successfully were done being transferred, I started seeing ext4 errors in /var/log/messages. I then unmounted the drive and ran fsck on it (which took multiple hours to run). I then ls'ed around and one of the areas caused the system to again throw ext4 errors. I did run memtest through one complete pass and it found no problems. I then went looking for help on the fedora forum and it was suggested that I increase my journal size. So I recreated the ext4 partition (with larger journal) and started the migration process again. After several days of copying, the errors started again. Here are some of the errors from /var/log/messages: Feb 2 04:48:30 mdct-00fs kernel: [672021.519914] EXT4-fs error (device dm-2): ext4_mb_generate_buddy: EXT4-fs: group 22307: 460 blocks in bitmap, 0 in gd Feb 2 04:48:30 mdct-00fs kernel: [672021.520429] EXT4-fs error (device dm-2): ext4_mb_generate_buddy: EXT4-fs: group 22308: 1339 blocks in bitmap, 0 in gd Feb 2 04:48:30 mdct-00fs kernel: [672021.520927] EXT4-fs error (device dm-2): ext4_mb_generate_buddy: EXT4-fs: group 22309: 3204 blocks in bitmap, 0 in gd Feb 2 04:48:30 mdct-00fs kernel: [672021.521409] EXT4-fs error (device dm-2): ext4_mb_generate_buddy: EXT4-fs: group 22310: 2117 blocks in bitmap, 0 in gd Feb 4 05:08:29 mdct-00fs kernel: [845547.724807] EXT4-fs error (device dm-2): ext4_dx_find_entry: inode #311951364: (comm scp) bad entry in directory: directory entry across blocks - block=1257308156offset=0(9166848), inode=3143403788, rec_len=80864, name_len=168 Feb 4 05:08:29 mdct-00fs kernel: [845547.733034] EXT4-fs error (device dm-2): ext4_add_entry: inode #311951364: (comm scp) bad entry in directory: directory entry across blocks - block=1257308156offset=0(0), inode=3143403788, rec_len=80864, name_len=168 Feb 4 05:19:41 mdct-00fs kernel: [846217.922351] EXT4-fs error (device dm-2): ext4_dx_find_entry: inode #311951364: (comm scp) bad entry in directory: directory entry across blocks - block=1257308156offset=0(9166848), inode=3143403788, rec_len=80864, name_len=168 Feb 4 05:19:41 mdct-00fs kernel: [846217.928922] EXT4-fs error (device dm-2): ext4_add_entry: inode #311951364: (comm scp) bad entry in directory: directory entry across blocks - block=1257308156offset=0(0), inode=3143403788, rec_len=80864, name_len=168 Here is my setup: Promise Vtrak RAID array with 12 drives in a RAID 6 configuration (over 5TB). The promise array is connected to my server using a external SAS connection. OS: Fedora 14 One logical volume on the promise. One logical volume at the external SAS level. One logical volume at the OS level. So from my OS, I see one logical volume depicting one big drive. I then setup the ext4 system using the following command: 'mkfs.ext4 -v -m 1 -J size=1024 -E stride=16,stripe-width=160 /dev/vg_storage/lv_storage' Any thoughts/tips on how to track down the problem? My thought now is to try using ext3; however, my fear is that I will just run into the problem with it. Is ext4 production ready? Thoughts? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4 problems with external RAID array via SAS connection 2011-02-07 18:53 ext4 problems with external RAID array via SAS connection bryan.coleman @ 2011-02-07 22:54 ` Ted Ts'o 2011-02-08 13:18 ` bryan.coleman 0 siblings, 1 reply; 10+ messages in thread From: Ted Ts'o @ 2011-02-07 22:54 UTC (permalink / raw) To: bryan.coleman; +Cc: linux-ext4 On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote: > I am experiencing problems with an ext4 file system. > > At first, the drive seemed to work fine. I was primarily copying things > to the drive migrating data from another server. After many GBs of data, > that seemingly successfully were done being transferred, I started seeing > ext4 errors in /var/log/messages. I then unmounted the drive and ran fsck > on it (which took multiple hours to run). I then ls'ed around and one of > the areas caused the system to again throw ext4 errors. Did fsck report any errors? Do you have a copy of your fsck transcript? The errors you've reported do make me suspicious that there's something unstable with your hardware... - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4 problems with external RAID array via SAS connection 2011-02-07 22:54 ` Ted Ts'o @ 2011-02-08 13:18 ` bryan.coleman 2011-02-08 14:50 ` bryan.coleman 0 siblings, 1 reply; 10+ messages in thread From: bryan.coleman @ 2011-02-08 13:18 UTC (permalink / raw) To: linux-ext4, linux-ext4-owner When I ran fsck after the first bout of failure, it did report a lot of errors. I do not have a copy of that fsck transcript; however, I have not yet run fsck since my second attempt. Is there a method of capturing the transcript that is preferred? Bryan From: Ted Ts'o <tytso@mit.edu> To: bryan.coleman@dart.biz Cc: linux-ext4@vger.kernel.org Date: 02/07/2011 05:55 PM Subject: Re: ext4 problems with external RAID array via SAS connection Sent by: linux-ext4-owner@vger.kernel.org On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote: > I am experiencing problems with an ext4 file system. > > At first, the drive seemed to work fine. I was primarily copying things > to the drive migrating data from another server. After many GBs of data, > that seemingly successfully were done being transferred, I started seeing > ext4 errors in /var/log/messages. I then unmounted the drive and ran fsck > on it (which took multiple hours to run). I then ls'ed around and one of > the areas caused the system to again throw ext4 errors. Did fsck report any errors? Do you have a copy of your fsck transcript? The errors you've reported do make me suspicious that there's something unstable with your hardware... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4 problems with external RAID array via SAS connection 2011-02-08 13:18 ` bryan.coleman @ 2011-02-08 14:50 ` bryan.coleman 2011-02-08 15:19 ` Eric Sandeen 0 siblings, 1 reply; 10+ messages in thread From: bryan.coleman @ 2011-02-08 14:50 UTC (permalink / raw) To: linux-ext4, linux-ext4-owner; +Cc: Ted Ts'o Well, I attempted to run fsck on the problem drive using the script command to capture the transcript; however, it failed to read a block from the file system. The exception was "fsck.ext4: Attempt to read block from filesystem resulted in short read while trying to open /dev/mapper/vg_storage-lv_storage". Other messages that are now in /var/log/messages: Buffer I/O error on device dm-2, logical block 0 lost page write due to I/O error on dm-2 EXT4-fs (dm-2): previous I/O error to superblock detected Buffer I/O error on device dm-2, logical block 0 lost page write due to I/O error on dm-2 Buffer I/O error on device dm-2, logical block 0 Buffer I/O error on device dm-2, logical block 1 Buffer I/O error on device dm-2, logical block 2 Buffer I/O error on device dm-2, logical block 3 Buffer I/O error on device dm-2, logical block 0 EXT4-fs (dm-2): unable to read superblock Since it looks like I need to start the process all over again, is there a good way to quickly determine if the problem is hardware related? Is there a preferred method that will stress test the drive and shed more light on what might be going wrong? Thank you, Bryan From: bryan.coleman@dart.biz To: linux-ext4@vger.kernel.org, linux-ext4-owner@vger.kernel.org Date: 02/08/2011 08:19 AM Subject: Re: ext4 problems with external RAID array via SAS connection Sent by: linux-ext4-owner@vger.kernel.org When I ran fsck after the first bout of failure, it did report a lot of errors. I do not have a copy of that fsck transcript; however, I have not yet run fsck since my second attempt. Is there a method of capturing the transcript that is preferred? Bryan From: Ted Ts'o <tytso@mit.edu> To: bryan.coleman@dart.biz Cc: linux-ext4@vger.kernel.org Date: 02/07/2011 05:55 PM Subject: Re: ext4 problems with external RAID array via SAS connection Sent by: linux-ext4-owner@vger.kernel.org On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote: > I am experiencing problems with an ext4 file system. > > At first, the drive seemed to work fine. I was primarily copying things > to the drive migrating data from another server. After many GBs of data, > that seemingly successfully were done being transferred, I started seeing > ext4 errors in /var/log/messages. I then unmounted the drive and ran fsck > on it (which took multiple hours to run). I then ls'ed around and one of > the areas caused the system to again throw ext4 errors. Did fsck report any errors? Do you have a copy of your fsck transcript? The errors you've reported do make me suspicious that there's something unstable with your hardware... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4 problems with external RAID array via SAS connection 2011-02-08 14:50 ` bryan.coleman @ 2011-02-08 15:19 ` Eric Sandeen 2011-02-08 18:50 ` bryan.coleman 0 siblings, 1 reply; 10+ messages in thread From: Eric Sandeen @ 2011-02-08 15:19 UTC (permalink / raw) To: bryan.coleman; +Cc: linux-ext4, Ted Ts'o On 2/8/11 8:50 AM, bryan.coleman@dart.biz wrote: > Well, I attempted to run fsck on the problem drive using the script > command to capture the transcript; however, it failed to read a block from > the file system. The exception was "fsck.ext4: Attempt to read block from > filesystem resulted in short read while trying to open > /dev/mapper/vg_storage-lv_storage". > > Other messages that are now in /var/log/messages: > > Buffer I/O error on device dm-2, logical block 0 > lost page write due to I/O error on dm-2 > EXT4-fs (dm-2): previous I/O error to superblock detected > Buffer I/O error on device dm-2, logical block 0 > lost page write due to I/O error on dm-2 > Buffer I/O error on device dm-2, logical block 0 > Buffer I/O error on device dm-2, logical block 1 > Buffer I/O error on device dm-2, logical block 2 > Buffer I/O error on device dm-2, logical block 3 > Buffer I/O error on device dm-2, logical block 0 > EXT4-fs (dm-2): unable to read superblock > > > Since it looks like I need to start the process all over again, is there a > good way to quickly determine if the problem is hardware related? Is > there a preferred method that will stress test the drive and shed more > light on what might be going wrong? You have a hardware problem... "Buffer I/O error on device dm-2, logical block 0" means that you failed to read the first block on that device; not something e2fsck can fix, I'm afraid; you'll need to sort out what's wrong with the storage, first. -Eric > Thank you, > > Bryan > > > > From: bryan.coleman@dart.biz > To: linux-ext4@vger.kernel.org, linux-ext4-owner@vger.kernel.org > Date: 02/08/2011 08:19 AM > Subject: Re: ext4 problems with external RAID array via SAS > connection > Sent by: linux-ext4-owner@vger.kernel.org > > > > When I ran fsck after the first bout of failure, it did report a lot of > errors. I do not have a copy of that fsck transcript; however, I have not > > yet run fsck since my second attempt. Is there a method of capturing the > transcript that is preferred? > > Bryan > > > > From: Ted Ts'o <tytso@mit.edu> > To: bryan.coleman@dart.biz > Cc: linux-ext4@vger.kernel.org > Date: 02/07/2011 05:55 PM > Subject: Re: ext4 problems with external RAID array via SAS > connection > Sent by: linux-ext4-owner@vger.kernel.org > > > > On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote: >> I am experiencing problems with an ext4 file system. >> >> At first, the drive seemed to work fine. I was primarily copying things > > >> to the drive migrating data from another server. After many GBs of > data, >> that seemingly successfully were done being transferred, I started > seeing >> ext4 errors in /var/log/messages. I then unmounted the drive and ran > fsck >> on it (which took multiple hours to run). I then ls'ed around and one > of >> the areas caused the system to again throw ext4 errors. > > Did fsck report any errors? Do you have a copy of your fsck > transcript? > > The errors you've reported do make me suspicious that there's > something unstable with your hardware... > > - Ted > -- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4 problems with external RAID array via SAS connection 2011-02-08 15:19 ` Eric Sandeen @ 2011-02-08 18:50 ` bryan.coleman 2011-02-08 20:49 ` Eric Sandeen 0 siblings, 1 reply; 10+ messages in thread From: bryan.coleman @ 2011-02-08 18:50 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-ext4, linux-ext4-owner, Ted Ts'o I found that the promise array had been restarted via watchdog timer. I am investigating that avenue via promise (albeit slow). Note: the watchdog reset the controller days after the initial ext4 messages. I'm not saying they are unrelated. I just what to get all of the facts out there. I suspect the connection between the server and the promise got hosed when the controller was reset. When I restart the server, I could fsck the drive. The fsck is currently running (and has been for some time now). It is doing a ton of "Inode ######## ref count is 2, should be 1. Fix? yes" "Unattached inode #########" "Connect to /lost+found? yes" I am running fsck in a script session; however, there are currently a ton of the messages above (current log size: 106M). Do you think it is still hardware? If so, is there a command that would stress it enough to break quickly? What is the best way to isolate hardware problems? Bryan From: Eric Sandeen <sandeen@redhat.com> To: bryan.coleman@dart.biz Cc: linux-ext4@vger.kernel.org, "Ted Ts'o" <tytso@mit.edu> Date: 02/08/2011 10:21 AM Subject: Re: ext4 problems with external RAID array via SAS connection Sent by: linux-ext4-owner@vger.kernel.org On 2/8/11 8:50 AM, bryan.coleman@dart.biz wrote: > Well, I attempted to run fsck on the problem drive using the script > command to capture the transcript; however, it failed to read a block from > the file system. The exception was "fsck.ext4: Attempt to read block from > filesystem resulted in short read while trying to open > /dev/mapper/vg_storage-lv_storage". > > Other messages that are now in /var/log/messages: > > Buffer I/O error on device dm-2, logical block 0 > lost page write due to I/O error on dm-2 > EXT4-fs (dm-2): previous I/O error to superblock detected > Buffer I/O error on device dm-2, logical block 0 > lost page write due to I/O error on dm-2 > Buffer I/O error on device dm-2, logical block 0 > Buffer I/O error on device dm-2, logical block 1 > Buffer I/O error on device dm-2, logical block 2 > Buffer I/O error on device dm-2, logical block 3 > Buffer I/O error on device dm-2, logical block 0 > EXT4-fs (dm-2): unable to read superblock > > > Since it looks like I need to start the process all over again, is there a > good way to quickly determine if the problem is hardware related? Is > there a preferred method that will stress test the drive and shed more > light on what might be going wrong? You have a hardware problem... "Buffer I/O error on device dm-2, logical block 0" means that you failed to read the first block on that device; not something e2fsck can fix, I'm afraid; you'll need to sort out what's wrong with the storage, first. -Eric > Thank you, > > Bryan > > > > From: bryan.coleman@dart.biz > To: linux-ext4@vger.kernel.org, linux-ext4-owner@vger.kernel.org > Date: 02/08/2011 08:19 AM > Subject: Re: ext4 problems with external RAID array via SAS > connection > Sent by: linux-ext4-owner@vger.kernel.org > > > > When I ran fsck after the first bout of failure, it did report a lot of > errors. I do not have a copy of that fsck transcript; however, I have not > > yet run fsck since my second attempt. Is there a method of capturing the > transcript that is preferred? > > Bryan > > > > From: Ted Ts'o <tytso@mit.edu> > To: bryan.coleman@dart.biz > Cc: linux-ext4@vger.kernel.org > Date: 02/07/2011 05:55 PM > Subject: Re: ext4 problems with external RAID array via SAS > connection > Sent by: linux-ext4-owner@vger.kernel.org > > > > On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote: >> I am experiencing problems with an ext4 file system. >> >> At first, the drive seemed to work fine. I was primarily copying things > > >> to the drive migrating data from another server. After many GBs of > data, >> that seemingly successfully were done being transferred, I started > seeing >> ext4 errors in /var/log/messages. I then unmounted the drive and ran > fsck >> on it (which took multiple hours to run). I then ls'ed around and one > of >> the areas caused the system to again throw ext4 errors. > > Did fsck report any errors? Do you have a copy of your fsck > transcript? > > The errors you've reported do make me suspicious that there's > something unstable with your hardware... > > - Ted > -- -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4 problems with external RAID array via SAS connection 2011-02-08 18:50 ` bryan.coleman @ 2011-02-08 20:49 ` Eric Sandeen 2011-02-09 13:43 ` bryan.coleman 0 siblings, 1 reply; 10+ messages in thread From: Eric Sandeen @ 2011-02-08 20:49 UTC (permalink / raw) To: bryan.coleman; +Cc: linux-ext4, Ted Ts'o On 2/8/11 12:50 PM, bryan.coleman@dart.biz wrote: > I found that the promise array had been restarted via watchdog timer. I > am investigating that avenue via promise (albeit slow). Note: the > watchdog reset the controller days after the initial ext4 messages. I'm > not saying they are unrelated. I just what to get all of the facts out > there. > > I suspect the connection between the server and the promise got hosed when > the controller was reset. When I restart the server, I could fsck the > drive. > > The fsck is currently running (and has been for some time now). > > It is doing a ton of "Inode ######## ref count is 2, should be 1. Fix? > yes" "Unattached inode #########" "Connect to /lost+found? yes" > > I am running fsck in a script session; however, there are currently a ton > of the messages above (current log size: 106M). > > Do you think it is still hardware? If so, is there a command that would > stress it enough to break quickly? What is the best way to isolate > hardware problems? My assertion of hardware problems was based solely on the IO error reading block 0. If you can't read the superblock there's not much to be done. As for what caused the corruption fsck is now finding, that's harder to say, you're essentially getting reports that fsck is finding errors which happened sometime in the past. My first thought is whether a large cache on the array got lost when it was reset, that could certainly cause filesystem corruption. -Eric > Bryan > > > > From: Eric Sandeen <sandeen@redhat.com> > To: bryan.coleman@dart.biz > Cc: linux-ext4@vger.kernel.org, "Ted Ts'o" <tytso@mit.edu> > Date: 02/08/2011 10:21 AM > Subject: Re: ext4 problems with external RAID array via SAS > connection > Sent by: linux-ext4-owner@vger.kernel.org > > > > On 2/8/11 8:50 AM, bryan.coleman@dart.biz wrote: >> Well, I attempted to run fsck on the problem drive using the script >> command to capture the transcript; however, it failed to read a block > from >> the file system. The exception was "fsck.ext4: Attempt to read block > from >> filesystem resulted in short read while trying to open >> /dev/mapper/vg_storage-lv_storage". >> >> Other messages that are now in /var/log/messages: >> >> Buffer I/O error on device dm-2, logical block 0 >> lost page write due to I/O error on dm-2 >> EXT4-fs (dm-2): previous I/O error to superblock detected >> Buffer I/O error on device dm-2, logical block 0 >> lost page write due to I/O error on dm-2 >> Buffer I/O error on device dm-2, logical block 0 >> Buffer I/O error on device dm-2, logical block 1 >> Buffer I/O error on device dm-2, logical block 2 >> Buffer I/O error on device dm-2, logical block 3 >> Buffer I/O error on device dm-2, logical block 0 >> EXT4-fs (dm-2): unable to read superblock >> >> >> Since it looks like I need to start the process all over again, is there > a >> good way to quickly determine if the problem is hardware related? Is >> there a preferred method that will stress test the drive and shed more >> light on what might be going wrong? > > You have a hardware problem... "Buffer I/O error on device dm-2, logical > block 0" > means that you failed to read the first block on that device; not > something > e2fsck can fix, I'm afraid; you'll need to sort out what's wrong with the > storage, > first. > > -Eric > >> Thank you, >> >> Bryan >> >> >> >> From: bryan.coleman@dart.biz >> To: linux-ext4@vger.kernel.org, linux-ext4-owner@vger.kernel.org >> Date: 02/08/2011 08:19 AM >> Subject: Re: ext4 problems with external RAID array via SAS >> connection >> Sent by: linux-ext4-owner@vger.kernel.org >> >> >> >> When I ran fsck after the first bout of failure, it did report a lot of >> errors. I do not have a copy of that fsck transcript; however, I have > not >> >> yet run fsck since my second attempt. Is there a method of capturing > the >> transcript that is preferred? >> >> Bryan >> >> >> >> From: Ted Ts'o <tytso@mit.edu> >> To: bryan.coleman@dart.biz >> Cc: linux-ext4@vger.kernel.org >> Date: 02/07/2011 05:55 PM >> Subject: Re: ext4 problems with external RAID array via SAS >> connection >> Sent by: linux-ext4-owner@vger.kernel.org >> >> >> >> On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote: >>> I am experiencing problems with an ext4 file system. >>> >>> At first, the drive seemed to work fine. I was primarily copying > things >> >> >>> to the drive migrating data from another server. After many GBs of >> data, >>> that seemingly successfully were done being transferred, I started >> seeing >>> ext4 errors in /var/log/messages. I then unmounted the drive and ran >> fsck >>> on it (which took multiple hours to run). I then ls'ed around and one >> of >>> the areas caused the system to again throw ext4 errors. >> >> Did fsck report any errors? Do you have a copy of your fsck >> transcript? >> >> The errors you've reported do make me suspicious that there's >> something unstable with your hardware... >> >> - Ted >> -- > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4 problems with external RAID array via SAS connection 2011-02-08 20:49 ` Eric Sandeen @ 2011-02-09 13:43 ` bryan.coleman 2011-02-09 18:28 ` Ted Ts'o 0 siblings, 1 reply; 10+ messages in thread From: bryan.coleman @ 2011-02-09 13:43 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-ext4, Ted Ts'o The disk was not in the middle of copying when the array went down. I did get an fsck transcript; however, it is 14M tgz'd. I don't really want to send it to the list, but am willing to send it direct if you (or Ted) are willing. The fsck said it completed successfully. I kicked off fsck again just to make sure and it reported clean. So I mounted the drive and ls'd around and it started reporting errors. "ls: cannot access 40: Input/output error" Note: 40 is a directory. So I unmounted again and started an fsck. It reported errors and started on it's merry way; however, now it was dealing with many "Multiple-claimed block(s) in inode #########: <many numbers following>" I am willing to reformat the drive again; however, I would like to know what the best way to track down the issue is? Any thoughts? From: Eric Sandeen <sandeen@redhat.com> To: bryan.coleman@dart.biz Cc: linux-ext4@vger.kernel.org, "Ted Ts'o" <tytso@mit.edu> Date: 02/08/2011 03:49 PM Subject: Re: ext4 problems with external RAID array via SAS connection On 2/8/11 12:50 PM, bryan.coleman@dart.biz wrote: > I found that the promise array had been restarted via watchdog timer. I > am investigating that avenue via promise (albeit slow). Note: the > watchdog reset the controller days after the initial ext4 messages. I'm > not saying they are unrelated. I just what to get all of the facts out > there. > > I suspect the connection between the server and the promise got hosed when > the controller was reset. When I restart the server, I could fsck the > drive. > > The fsck is currently running (and has been for some time now). > > It is doing a ton of "Inode ######## ref count is 2, should be 1. Fix? > yes" "Unattached inode #########" "Connect to /lost+found? yes" > > I am running fsck in a script session; however, there are currently a ton > of the messages above (current log size: 106M). > > Do you think it is still hardware? If so, is there a command that would > stress it enough to break quickly? What is the best way to isolate > hardware problems? My assertion of hardware problems was based solely on the IO error reading block 0. If you can't read the superblock there's not much to be done. As for what caused the corruption fsck is now finding, that's harder to say, you're essentially getting reports that fsck is finding errors which happened sometime in the past. My first thought is whether a large cache on the array got lost when it was reset, that could certainly cause filesystem corruption. -Eric > Bryan > > > > From: Eric Sandeen <sandeen@redhat.com> > To: bryan.coleman@dart.biz > Cc: linux-ext4@vger.kernel.org, "Ted Ts'o" <tytso@mit.edu> > Date: 02/08/2011 10:21 AM > Subject: Re: ext4 problems with external RAID array via SAS > connection > Sent by: linux-ext4-owner@vger.kernel.org > > > > On 2/8/11 8:50 AM, bryan.coleman@dart.biz wrote: >> Well, I attempted to run fsck on the problem drive using the script >> command to capture the transcript; however, it failed to read a block > from >> the file system. The exception was "fsck.ext4: Attempt to read block > from >> filesystem resulted in short read while trying to open >> /dev/mapper/vg_storage-lv_storage". >> >> Other messages that are now in /var/log/messages: >> >> Buffer I/O error on device dm-2, logical block 0 >> lost page write due to I/O error on dm-2 >> EXT4-fs (dm-2): previous I/O error to superblock detected >> Buffer I/O error on device dm-2, logical block 0 >> lost page write due to I/O error on dm-2 >> Buffer I/O error on device dm-2, logical block 0 >> Buffer I/O error on device dm-2, logical block 1 >> Buffer I/O error on device dm-2, logical block 2 >> Buffer I/O error on device dm-2, logical block 3 >> Buffer I/O error on device dm-2, logical block 0 >> EXT4-fs (dm-2): unable to read superblock >> >> >> Since it looks like I need to start the process all over again, is there > a >> good way to quickly determine if the problem is hardware related? Is >> there a preferred method that will stress test the drive and shed more >> light on what might be going wrong? > > You have a hardware problem... "Buffer I/O error on device dm-2, logical > block 0" > means that you failed to read the first block on that device; not > something > e2fsck can fix, I'm afraid; you'll need to sort out what's wrong with the > storage, > first. > > -Eric > >> Thank you, >> >> Bryan >> >> >> >> From: bryan.coleman@dart.biz >> To: linux-ext4@vger.kernel.org, linux-ext4-owner@vger.kernel.org >> Date: 02/08/2011 08:19 AM >> Subject: Re: ext4 problems with external RAID array via SAS >> connection >> Sent by: linux-ext4-owner@vger.kernel.org >> >> >> >> When I ran fsck after the first bout of failure, it did report a lot of >> errors. I do not have a copy of that fsck transcript; however, I have > not >> >> yet run fsck since my second attempt. Is there a method of capturing > the >> transcript that is preferred? >> >> Bryan >> >> >> >> From: Ted Ts'o <tytso@mit.edu> >> To: bryan.coleman@dart.biz >> Cc: linux-ext4@vger.kernel.org >> Date: 02/07/2011 05:55 PM >> Subject: Re: ext4 problems with external RAID array via SAS >> connection >> Sent by: linux-ext4-owner@vger.kernel.org >> >> >> >> On Mon, Feb 07, 2011 at 01:53:18PM -0500, bryan.coleman@dart.biz wrote: >>> I am experiencing problems with an ext4 file system. >>> >>> At first, the drive seemed to work fine. I was primarily copying > things >> >> >>> to the drive migrating data from another server. After many GBs of >> data, >>> that seemingly successfully were done being transferred, I started >> seeing >>> ext4 errors in /var/log/messages. I then unmounted the drive and ran >> fsck >>> on it (which took multiple hours to run). I then ls'ed around and one >> of >>> the areas caused the system to again throw ext4 errors. >> >> Did fsck report any errors? Do you have a copy of your fsck >> transcript? >> >> The errors you've reported do make me suspicious that there's >> something unstable with your hardware... >> >> - Ted >> -- > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4 problems with external RAID array via SAS connection 2011-02-09 13:43 ` bryan.coleman @ 2011-02-09 18:28 ` Ted Ts'o 2011-02-09 19:46 ` Ric Wheeler 0 siblings, 1 reply; 10+ messages in thread From: Ted Ts'o @ 2011-02-09 18:28 UTC (permalink / raw) To: bryan.coleman; +Cc: Eric Sandeen, linux-ext4 On Wed, Feb 09, 2011 at 08:43:56AM -0500, bryan.coleman@dart.biz wrote: > > The fsck said it completed successfully. I kicked off fsck again just to > make sure and it reported clean. So I mounted the drive and ls'd around > and it started reporting errors. "ls: cannot access 40: Input/output > error" Note: 40 is a directory. Well, we'd need to look at the kernel messages, but the Input/output error strongly suggests that there are, well, I/O errors talking to your storage array. Which again suggests hardware problems, or device driver bugs, or both. - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4 problems with external RAID array via SAS connection 2011-02-09 18:28 ` Ted Ts'o @ 2011-02-09 19:46 ` Ric Wheeler 0 siblings, 0 replies; 10+ messages in thread From: Ric Wheeler @ 2011-02-09 19:46 UTC (permalink / raw) To: Ted Ts'o, bryan.coleman; +Cc: Eric Sandeen, linux-ext4 On 02/09/2011 01:28 PM, Ted Ts'o wrote: > On Wed, Feb 09, 2011 at 08:43:56AM -0500, bryan.coleman@dart.biz wrote: >> The fsck said it completed successfully. I kicked off fsck again just to >> make sure and it reported clean. So I mounted the drive and ls'd around >> and it started reporting errors. "ls: cannot access 40: Input/output >> error" Note: 40 is a directory. > Well, we'd need to look at the kernel messages, but the Input/output > error strongly suggests that there are, well, I/O errors talking to > your storage array. Which again suggests hardware problems, or device > driver bugs, or both. > > - Ted I think that you might want to start to test with a simplified storage config. Try keep the RAID card in the loop, but using a simpler RAID scheme (single drive? RAID0 or RAID1) and see if the issue persists. Ric ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-02-09 19:46 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-02-07 18:53 ext4 problems with external RAID array via SAS connection bryan.coleman 2011-02-07 22:54 ` Ted Ts'o 2011-02-08 13:18 ` bryan.coleman 2011-02-08 14:50 ` bryan.coleman 2011-02-08 15:19 ` Eric Sandeen 2011-02-08 18:50 ` bryan.coleman 2011-02-08 20:49 ` Eric Sandeen 2011-02-09 13:43 ` bryan.coleman 2011-02-09 18:28 ` Ted Ts'o 2011-02-09 19:46 ` Ric Wheeler
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.