* Automatically drop caches after mdadm fails a drive out of an array? [not found] <1413719638.30344.1392138285471.JavaMail.zimbra@xes-inc.com> @ 2014-02-11 17:11 ` Andrew Martin 2014-02-11 19:54 ` NeilBrown 0 siblings, 1 reply; 10+ messages in thread From: Andrew Martin @ 2014-02-11 17:11 UTC (permalink / raw) To: linux-raid Hello, I am running mdadm 3.2.5 on an Ubuntu 12.04 fileserver with a 10-drive RAID6 array (10x1TB). Recently, /dev/sdb started failing: Feb 10 13:49:29 myfileserver kernel: [17162220.838256] sas: command 0xffff88010628f600, task 0xffff8800466241c0, timed out: BLK_EH_NOT_HANDLED Around this same time, a few users attempted to access a directory on this RAID array over CIFS, which they had previously accessed earlier in the day. When they attempted to access it this time, the directory was empty. The emptiness of the folder was confirmed via a local shell on the fileserver, which reported the same information. At around 13:50, mdadm dropped /dev/sdb from the RAID array: Feb 10 13:50:31 myfileserver mdadm[1897]: Fail event detected on md device /dev/md2, component device /dev/sdb However, it was not until around 14:15 that these files reappeared in the directory. I am guessing that it took this long for the invalid, cached read to be flushed from the kernel buffer cache. The concern with the above behavior is it leaves a potentially large window of time during which certain data may not be correctly returned from the RAID array. Is it possible for mdadm to automatically flush the kernel buffer cache after it drops a drive from the array: sync; echo 3 > /proc/sys/vm/drop_caches This would have caused the data to have been re-read at 13:50, a much smaller window of time during which invalid data was present in the cache. Or, is there a better suggestion for handling this situation? Thanks, Andrew Martin ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Automatically drop caches after mdadm fails a drive out of an array? 2014-02-11 17:11 ` Automatically drop caches after mdadm fails a drive out of an array? Andrew Martin @ 2014-02-11 19:54 ` NeilBrown 2014-02-11 23:10 ` Andrew Martin 0 siblings, 1 reply; 10+ messages in thread From: NeilBrown @ 2014-02-11 19:54 UTC (permalink / raw) To: Andrew Martin; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2192 bytes --] On Tue, 11 Feb 2014 11:11:04 -0600 (CST) Andrew Martin <amartin@xes-inc.com> wrote: > Hello, > > I am running mdadm 3.2.5 on an Ubuntu 12.04 fileserver with a 10-drive RAID6 array (10x1TB). Recently, /dev/sdb started failing: > Feb 10 13:49:29 myfileserver kernel: [17162220.838256] sas: command 0xffff88010628f600, task 0xffff8800466241c0, timed out: BLK_EH_NOT_HANDLED > > Around this same time, a few users attempted to access a directory on this RAID array over CIFS, which they had previously accessed earlier in the day. When they attempted to access it this time, the directory was empty. The emptiness of the folder was confirmed via a local shell on the fileserver, which reported the same information. At around 13:50, mdadm dropped /dev/sdb from the RAID array: The directory being empty can have nothing to do with the device failure. md/raid will never let bad data into the page cache in the manner you suggest. I cannot explain to you what happened, but I'm absolutely certain it wasn't something that could be fixed by md dropping any caches. NeilBrown > Feb 10 13:50:31 myfileserver mdadm[1897]: Fail event detected on md device /dev/md2, component device /dev/sdb > > However, it was not until around 14:15 that these files reappeared in the directory. I am guessing that it took this long for the invalid, cached read to be flushed from the kernel buffer cache. > > The concern with the above behavior is it leaves a potentially large window of time during which certain data may not be correctly returned from the RAID array. Is it possible for mdadm to automatically flush the kernel buffer cache after it drops a drive from the array: > sync; echo 3 > /proc/sys/vm/drop_caches > > This would have caused the data to have been re-read at 13:50, a much smaller window of time during which invalid data was present in the cache. Or, is there a better suggestion for handling this situation? > > Thanks, > > Andrew Martin > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Automatically drop caches after mdadm fails a drive out of an array? 2014-02-11 19:54 ` NeilBrown @ 2014-02-11 23:10 ` Andrew Martin 2014-02-12 0:11 ` Stan Hoeppner 0 siblings, 1 reply; 10+ messages in thread From: Andrew Martin @ 2014-02-11 23:10 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Neil, ----- Original Message ----- > From: "NeilBrown" <neilb@suse.de> > To: "Andrew Martin" <amartin@xes-inc.com> > Cc: linux-raid@vger.kernel.org > Sent: Tuesday, February 11, 2014 1:54:20 PM > Subject: Re: Automatically drop caches after mdadm fails a drive out of an array? > > On Tue, 11 Feb 2014 11:11:04 -0600 (CST) Andrew Martin <amartin@xes-inc.com> > wrote: > > > Hello, > > > > I am running mdadm 3.2.5 on an Ubuntu 12.04 fileserver with a 10-drive > > RAID6 array (10x1TB). Recently, /dev/sdb started failing: > > Feb 10 13:49:29 myfileserver kernel: [17162220.838256] sas: command > > 0xffff88010628f600, task 0xffff8800466241c0, timed out: BLK_EH_NOT_HANDLED > > > > Around this same time, a few users attempted to access a directory on this > > RAID array over CIFS, which they had previously accessed earlier in the > > day. When they attempted to access it this time, the directory was empty. > > The emptiness of the folder was confirmed via a local shell on the > > fileserver, which reported the same information. At around 13:50, mdadm > > dropped /dev/sdb from the RAID array: > > The directory being empty can have nothing to do with the device failure. > md/raid will never let bad data into the page cache in the manner you > suggest. Thank you for the clarification. What other possibilities could have triggered this behavior? I am also using LVM and DRBD on top of the the md device. Thanks, Andrew ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Automatically drop caches after mdadm fails a drive out of an array? 2014-02-11 23:10 ` Andrew Martin @ 2014-02-12 0:11 ` Stan Hoeppner 2014-02-12 14:44 ` Andrew Martin 0 siblings, 1 reply; 10+ messages in thread From: Stan Hoeppner @ 2014-02-12 0:11 UTC (permalink / raw) To: Andrew Martin, NeilBrown; +Cc: linux-raid On 2/11/2014 5:10 PM, Andrew Martin wrote: > Neil, > > ----- Original Message ----- >> From: "NeilBrown" <neilb@suse.de> >> To: "Andrew Martin" <amartin@xes-inc.com> >> Cc: linux-raid@vger.kernel.org >> Sent: Tuesday, February 11, 2014 1:54:20 PM >> Subject: Re: Automatically drop caches after mdadm fails a drive out of an array? >> >> On Tue, 11 Feb 2014 11:11:04 -0600 (CST) Andrew Martin <amartin@xes-inc.com> >> wrote: >> >>> Hello, >>> >>> I am running mdadm 3.2.5 on an Ubuntu 12.04 fileserver with a 10-drive >>> RAID6 array (10x1TB). Recently, /dev/sdb started failing: >>> Feb 10 13:49:29 myfileserver kernel: [17162220.838256] sas: command >>> 0xffff88010628f600, task 0xffff8800466241c0, timed out: BLK_EH_NOT_HANDLED >>> >>> Around this same time, a few users attempted to access a directory on this >>> RAID array over CIFS, which they had previously accessed earlier in the >>> day. When they attempted to access it this time, the directory was empty. >>> The emptiness of the folder was confirmed via a local shell on the >>> fileserver, which reported the same information. At around 13:50, mdadm >>> dropped /dev/sdb from the RAID array: >> >> The directory being empty can have nothing to do with the device failure. >> md/raid will never let bad data into the page cache in the manner you >> suggest. > > Thank you for the clarification. What other possibilities could have triggered > this behavior? I am also using LVM and DRBD on top of the the md device. The filesystem told you the directory was empty. Directories and files are filesystem structures. Why are you talking about all the layers of the stack below the filesystem, but not the filesystem itself? What filesystem is this? Are there any FS related errors in dmesg? -- Stan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Automatically drop caches after mdadm fails a drive out of an array? 2014-02-12 0:11 ` Stan Hoeppner @ 2014-02-12 14:44 ` Andrew Martin 2014-02-13 8:29 ` Stan Hoeppner 0 siblings, 1 reply; 10+ messages in thread From: Andrew Martin @ 2014-02-12 14:44 UTC (permalink / raw) To: stan; +Cc: NeilBrown, linux-raid Stan, ----- Original Message ----- > From: "Stan Hoeppner" <stan@hardwarefreak.com> > To: "Andrew Martin" <amartin@xes-inc.com>, "NeilBrown" <neilb@suse.de> > Cc: linux-raid@vger.kernel.org > Sent: Tuesday, February 11, 2014 6:11:06 PM > Subject: Re: Automatically drop caches after mdadm fails a drive out of an array? > > On 2/11/2014 5:10 PM, Andrew Martin wrote: > > Neil, > > > > ----- Original Message ----- > >> From: "NeilBrown" <neilb@suse.de> > >> To: "Andrew Martin" <amartin@xes-inc.com> > >> Cc: linux-raid@vger.kernel.org > >> Sent: Tuesday, February 11, 2014 1:54:20 PM > >> Subject: Re: Automatically drop caches after mdadm fails a drive out of an > >> array? > >> > >> On Tue, 11 Feb 2014 11:11:04 -0600 (CST) Andrew Martin > >> <amartin@xes-inc.com> > >> wrote: > >> > >>> Hello, > >>> > >>> I am running mdadm 3.2.5 on an Ubuntu 12.04 fileserver with a 10-drive > >>> RAID6 array (10x1TB). Recently, /dev/sdb started failing: > >>> Feb 10 13:49:29 myfileserver kernel: [17162220.838256] sas: command > >>> 0xffff88010628f600, task 0xffff8800466241c0, timed out: > >>> BLK_EH_NOT_HANDLED > >>> > >>> Around this same time, a few users attempted to access a directory on > >>> this > >>> RAID array over CIFS, which they had previously accessed earlier in the > >>> day. When they attempted to access it this time, the directory was empty. > >>> The emptiness of the folder was confirmed via a local shell on the > >>> fileserver, which reported the same information. At around 13:50, mdadm > >>> dropped /dev/sdb from the RAID array: > >> > >> The directory being empty can have nothing to do with the device failure. > >> md/raid will never let bad data into the page cache in the manner you > >> suggest. > > > > Thank you for the clarification. What other possibilities could have > > triggered > > this behavior? I am also using LVM and DRBD on top of the the md device. > > The filesystem told you the directory was empty. Directories and files > are filesystem structures. Why are you talking about all the layers of > the stack below the filesystem, but not the filesystem itself? What > filesystem is this? Are there any FS related errors in dmesg? It seemed unlikely that the timing of the failure of the drive out of the raid array and these filesystem-level problems was coincidental. Yes, there were also filesystem errors, immediately after md dropped the device. This is an ext4 filesystem: 13:50:31 mdadm[1897]: Fail event detected on md device /dev/md2, component device /dev/sdb 13:50:31 smbd[3428]: [2014/02/10 13:50:31.226854, 0] smbd/process.c:2439(keepalive_fn) 13:50:31 smbd[13539]: [2014/02/10 13:50:31.227084, 0] smbd/process.c:2439(keepalive_fn) 13:50:31 kernel: [17162282.624858] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 13:50:31 kernel: [17162282.823733] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 13:50:31 kernel: [17162282.832886] /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_sas.c 1863:port 2 slot 45 rx_desc 3002D has error info8000000080000000. 13:50:31 kernel: [17162282.832920] /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_94xx.c 626:command active 30305FFF, slot [2d]. 13:50:31 kernel: [17162282.991884] /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_sas.c 1863:port 3 slot 52 rx_desc 30034 has error info8000000080000000. 13:50:31 kernel: [17162282.991892] /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_94xx.c 626:command active 302FFFFF, slot [34]. 13:50:31 kernel: [17162282.992072] /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_sas.c 1863:port 2 slot 53 rx_desc 30035 has error info8000000080000000. ... 13:52:03 kernel: [17162374.423961] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 13:52:04 kernel: [17162375.839851] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 13:52:08 kernel: [17162380.135391] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 13:52:13 kernel: [17162385.108358] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 13:52:17 kernel: [17162388.166515] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 ... Thanks, Andrew ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Automatically drop caches after mdadm fails a drive out of an array? 2014-02-12 14:44 ` Andrew Martin @ 2014-02-13 8:29 ` Stan Hoeppner 2014-02-13 14:57 ` Andrew Martin 0 siblings, 1 reply; 10+ messages in thread From: Stan Hoeppner @ 2014-02-13 8:29 UTC (permalink / raw) To: Andrew Martin; +Cc: NeilBrown, linux-raid On 2/12/2014 8:44 AM, Andrew Martin wrote: > Stan, > > ----- Original Message ----- >> From: "Stan Hoeppner" <stan@hardwarefreak.com> >> To: "Andrew Martin" <amartin@xes-inc.com>, "NeilBrown" <neilb@suse.de> >> Cc: linux-raid@vger.kernel.org >> Sent: Tuesday, February 11, 2014 6:11:06 PM >> Subject: Re: Automatically drop caches after mdadm fails a drive out of an array? >> >> On 2/11/2014 5:10 PM, Andrew Martin wrote: >>> Neil, >>> >>> ----- Original Message ----- >>>> From: "NeilBrown" <neilb@suse.de> >>>> To: "Andrew Martin" <amartin@xes-inc.com> >>>> Cc: linux-raid@vger.kernel.org >>>> Sent: Tuesday, February 11, 2014 1:54:20 PM >>>> Subject: Re: Automatically drop caches after mdadm fails a drive out of an >>>> array? >>>> >>>> On Tue, 11 Feb 2014 11:11:04 -0600 (CST) Andrew Martin >>>> <amartin@xes-inc.com> >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> I am running mdadm 3.2.5 on an Ubuntu 12.04 fileserver with a 10-drive >>>>> RAID6 array (10x1TB). Recently, /dev/sdb started failing: >>>>> Feb 10 13:49:29 myfileserver kernel: [17162220.838256] sas: command >>>>> 0xffff88010628f600, task 0xffff8800466241c0, timed out: >>>>> BLK_EH_NOT_HANDLED >>>>> >>>>> Around this same time, a few users attempted to access a directory on >>>>> this >>>>> RAID array over CIFS, which they had previously accessed earlier in the >>>>> day. When they attempted to access it this time, the directory was empty. >>>>> The emptiness of the folder was confirmed via a local shell on the >>>>> fileserver, which reported the same information. At around 13:50, mdadm >>>>> dropped /dev/sdb from the RAID array: >>>> >>>> The directory being empty can have nothing to do with the device failure. >>>> md/raid will never let bad data into the page cache in the manner you >>>> suggest. >>> >>> Thank you for the clarification. What other possibilities could have >>> triggered >>> this behavior? I am also using LVM and DRBD on top of the the md device. >> >> The filesystem told you the directory was empty. Directories and files >> are filesystem structures. Why are you talking about all the layers of >> the stack below the filesystem, but not the filesystem itself? What >> filesystem is this? Are there any FS related errors in dmesg? > > It seemed unlikely that the timing of the failure of the drive out of > the raid array and these filesystem-level problems was coincidental. > Yes, there were also filesystem errors, immediately after md dropped the > device. This is an ext4 filesystem: Please show all disk/controller errors in close time proximity before the md fail event. > 13:50:31 mdadm[1897]: Fail event detected on md device /dev/md2, component device /dev/sdb > 13:50:31 smbd[3428]: [2014/02/10 13:50:31.226854, 0] smbd/process.c:2439(keepalive_fn) > 13:50:31 smbd[13539]: [2014/02/10 13:50:31.227084, 0] smbd/process.c:2439(keepalive_fn) > 13:50:31 kernel: [17162282.624858] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 > 13:50:31 kernel: [17162282.823733] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 > 13:50:31 kernel: [17162282.832886] /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_sas.c 1863:port 2 slot 45 rx_desc 3002D has error info8000000080000000. > 13:50:31 kernel: [17162282.832920] /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_94xx.c 626:command active 30305FFF, slot [2d]. > 13:50:31 kernel: [17162282.991884] /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_sas.c 1863:port 3 slot 52 rx_desc 30034 has error info8000000080000000. > 13:50:31 kernel: [17162282.991892] /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_94xx.c 626:command active 302FFFFF, slot [34]. > 13:50:31 kernel: [17162282.992072] /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_sas.c 1863:port 2 slot 53 rx_desc 30035 has error info8000000080000000. > ... > 13:52:03 kernel: [17162374.423961] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 > 13:52:04 kernel: [17162375.839851] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 > 13:52:08 kernel: [17162380.135391] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 > 13:52:13 kernel: [17162385.108358] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 > 13:52:17 kernel: [17162388.166515] EXT4-fs error (device drbd0): htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, rec_len=29801, name_len=99 > ... Does drbd0 sit atop md2? Also, the Marvel x8 SAS controllers are fine for Windows. But the Linux driver sucks, and has historically made the HBAs unusable. The most popular is probably the SuperMicro AOC-SASLP-MV8. In the log above the driver is showing errors on two SAS ports simultaneously. If not for the presence of mvsas I'd normally assume dirty power or a bad backplane due to such errors. The errors should not propagate up the stack to drbd. But the mere presence of this driver suggests it is part of the problem. Swap the Marvell SAS card for something decent and I'd bet most of your problems will disappear. -- Stan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Automatically drop caches after mdadm fails a drive out of an array? 2014-02-13 8:29 ` Stan Hoeppner @ 2014-02-13 14:57 ` Andrew Martin 2014-02-13 17:25 ` Mikael Abrahamsson 0 siblings, 1 reply; 10+ messages in thread From: Andrew Martin @ 2014-02-13 14:57 UTC (permalink / raw) To: stan; +Cc: NeilBrown, linux-raid ----- Original Message ----- > From: "Stan Hoeppner" <stan@hardwarefreak.com> > To: "Andrew Martin" <amartin@xes-inc.com> > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org > Sent: Thursday, February 13, 2014 2:29:04 AM > Subject: Re: Automatically drop caches after mdadm fails a drive out of an array? > > > It seemed unlikely that the timing of the failure of the drive out of > > the raid array and these filesystem-level problems was coincidental. > > Yes, there were also filesystem errors, immediately after md dropped the > > device. This is an ext4 filesystem: > > Please show all disk/controller errors in close time proximity before > the md fail event. > > > 13:50:31 mdadm[1897]: Fail event detected on md device /dev/md2, component > > device /dev/sdb > > 13:50:31 smbd[3428]: [2014/02/10 13:50:31.226854, 0] > > smbd/process.c:2439(keepalive_fn) > > 13:50:31 smbd[13539]: [2014/02/10 13:50:31.227084, 0] > > smbd/process.c:2439(keepalive_fn) > > 13:50:31 kernel: [17162282.624858] EXT4-fs error (device drbd0): > > htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: > > bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, > > rec_len=29801, name_len=99 > > 13:50:31 kernel: [17162282.823733] EXT4-fs error (device drbd0): > > htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: > > bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, > > rec_len=29801, name_len=99 > > 13:50:31 kernel: [17162282.832886] > > /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_sas.c 1863:port 2 slot 45 > > rx_desc 3002D has error info8000000080000000. > > 13:50:31 kernel: [17162282.832920] > > /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_94xx.c 626:command active > > 30305FFF, slot [2d]. > > 13:50:31 kernel: [17162282.991884] > > /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_sas.c 1863:port 3 slot 52 > > rx_desc 30034 has error info8000000080000000. > > 13:50:31 kernel: [17162282.991892] > > /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_94xx.c 626:command active > > 302FFFFF, slot [34]. > > 13:50:31 kernel: [17162282.992072] > > /build/buildd/linux-3.2.0/drivers/scsi/mvsas/mv_sas.c 1863:port 2 slot 53 > > rx_desc 30035 has error info8000000080000000. > > ... > > 13:52:03 kernel: [17162374.423961] EXT4-fs error (device drbd0): > > htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: > > bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, > > rec_len=29801, name_len=99 > > 13:52:04 kernel: [17162375.839851] EXT4-fs error (device drbd0): > > htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: > > bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, > > rec_len=29801, name_len=99 > > 13:52:08 kernel: [17162380.135391] EXT4-fs error (device drbd0): > > htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: > > bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, > > rec_len=29801, name_len=99 > > 13:52:13 kernel: [17162385.108358] EXT4-fs error (device drbd0): > > htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: > > bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, > > rec_len=29801, name_len=99 > > 13:52:17 kernel: [17162388.166515] EXT4-fs error (device drbd0): > > htree_dirblock_to_tree:587: inode #148638560: block 1189089581: comm smbd: > > bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2004033568, > > rec_len=29801, name_len=99 > > ... > > Does drbd0 sit atop md2? > > Also, the Marvel x8 SAS controllers are fine for Windows. But the Linux > driver sucks, and has historically made the HBAs unusable. The most > popular is probably the SuperMicro AOC-SASLP-MV8. In the log above the > driver is showing errors on two SAS ports simultaneously. If not for > the presence of mvsas I'd normally assume dirty power or a bad backplane > due to such errors. The errors should not propagate up the stack to > drbd. But the mere presence of this driver suggests it is part of the > problem. > > Swap the Marvell SAS card for something decent and I'd bet most of your > problems will disappear. Stan, You are correct; this is a SuperMicro AOC-SAS2LP-MV8 card. Here is a complete copy of the error messages in syslog: http://pastebin.com/DJqHDPvH Note that I added a new, replacement drive to the array at 17:09. In lieu of Marvel SAS cards, what would you recommend? Yes, DRBD sits on top of the md/raid array. The complete stack is: HDDs <-- md/raid <-- LVM <-- DRBD (drbd0) <-- ext4 Thanks, Andrew ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Automatically drop caches after mdadm fails a drive out of an array? 2014-02-13 14:57 ` Andrew Martin @ 2014-02-13 17:25 ` Mikael Abrahamsson 2014-02-14 4:53 ` Stan Hoeppner 0 siblings, 1 reply; 10+ messages in thread From: Mikael Abrahamsson @ 2014-02-13 17:25 UTC (permalink / raw) To: Andrew Martin; +Cc: linux-raid On Thu, 13 Feb 2014, Andrew Martin wrote: > Note that I added a new, replacement drive to the array at 17:09. In > lieu of Marvel SAS cards, what would you recommend? LSI 2008 is very well tested HBA controller, it exists in many vendors' cards. Works properly with 4TB drives (and larger I presume). For instance the LSI SAS 9211-8i HBA has this chip. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Automatically drop caches after mdadm fails a drive out of an array? 2014-02-13 17:25 ` Mikael Abrahamsson @ 2014-02-14 4:53 ` Stan Hoeppner 2014-02-14 22:40 ` Andrew Martin 0 siblings, 1 reply; 10+ messages in thread From: Stan Hoeppner @ 2014-02-14 4:53 UTC (permalink / raw) To: Mikael Abrahamsson, Andrew Martin; +Cc: linux-raid On 2/13/2014 11:25 AM, Mikael Abrahamsson wrote: > On Thu, 13 Feb 2014, Andrew Martin wrote: > >> Note that I added a new, replacement drive to the array at 17:09. In >> lieu of Marvel SAS cards, what would you recommend? > > LSI 2008 is very well tested HBA controller, it exists in many vendors' > cards. Works properly with 4TB drives (and larger I presume). For > instance the LSI SAS 9211-8i HBA has this chip. Agreed, LSI based HBAs are all pretty reliable. These chips can be found on motherboards, LSI branded HBAs, as well as IBM, Intel, and other branded HBAs. The multi-lane Adaptecs are good as well. Honestly, I've only really heard of serious problems with the mvsas driver and the Marvell SAS 88SE64xx based HBAs, most often the Supermicro, but also the Highpoint 2600 family. There may be others. -- Stan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Automatically drop caches after mdadm fails a drive out of an array? 2014-02-14 4:53 ` Stan Hoeppner @ 2014-02-14 22:40 ` Andrew Martin 0 siblings, 0 replies; 10+ messages in thread From: Andrew Martin @ 2014-02-14 22:40 UTC (permalink / raw) To: stan; +Cc: Mikael Abrahamsson, linux-raid ----- Original Message ----- > From: "Stan Hoeppner" <stan@hardwarefreak.com> > To: "Mikael Abrahamsson" <swmike@swm.pp.se>, "Andrew Martin" <amartin@xes-inc.com> > Cc: linux-raid@vger.kernel.org > Sent: Thursday, February 13, 2014 10:53:02 PM > Subject: Re: Automatically drop caches after mdadm fails a drive out of an array? > > On 2/13/2014 11:25 AM, Mikael Abrahamsson wrote: > > On Thu, 13 Feb 2014, Andrew Martin wrote: > > > >> Note that I added a new, replacement drive to the array at 17:09. In > >> lieu of Marvel SAS cards, what would you recommend? > > > > LSI 2008 is very well tested HBA controller, it exists in many vendors' > > cards. Works properly with 4TB drives (and larger I presume). For > > instance the LSI SAS 9211-8i HBA has this chip. > > Agreed, LSI based HBAs are all pretty reliable. These chips can be > found on motherboards, LSI branded HBAs, as well as IBM, Intel, and > other branded HBAs. The multi-lane Adaptecs are good as well. > > Honestly, I've only really heard of serious problems with the mvsas > driver and the Marvell SAS 88SE64xx based HBAs, most often the > Supermicro, but also the Highpoint 2600 family. There may be others. > Thanks for the help! Andrew ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-02-14 22:40 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <1413719638.30344.1392138285471.JavaMail.zimbra@xes-inc.com> 2014-02-11 17:11 ` Automatically drop caches after mdadm fails a drive out of an array? Andrew Martin 2014-02-11 19:54 ` NeilBrown 2014-02-11 23:10 ` Andrew Martin 2014-02-12 0:11 ` Stan Hoeppner 2014-02-12 14:44 ` Andrew Martin 2014-02-13 8:29 ` Stan Hoeppner 2014-02-13 14:57 ` Andrew Martin 2014-02-13 17:25 ` Mikael Abrahamsson 2014-02-14 4:53 ` Stan Hoeppner 2014-02-14 22:40 ` Andrew Martin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.