All of lore.kernel.org
 help / color / mirror / Atom feed
* High mismatch count on root device - how to best handle?
@ 2011-04-25 22:32 Mark Knecht
  2011-04-26  1:30 ` Mark Knecht
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Knecht @ 2011-04-25 22:32 UTC (permalink / raw)
  To: Linux-RAID

I did a drive check today, first time in months, and found I have a
high mismatch count on my RAID1 root device. What's the best way to
handle getting this cleaned up?

1) I'm running some smartctl tests as I write this.

2) Do I just do an

echo repair

to md126 or do I have to boot a rescue CD before I do that?

If you need more info please let me know.

Thanks,
Mark

c2stable ~ # cat /sys/block/md3/md/mismatch_cnt
0
c2stable ~ # cat /sys/block/md6/md/mismatch_cnt
0
c2stable ~ # cat /sys/block/md7/md/mismatch_cnt
0
c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
222336
c2stable ~ # df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md126            51612920  26159408  22831712  54% /
udev                     10240       432      9808   5% /dev
/dev/md7             389183252 144979184 224434676  40% /VirtualMachines
shm                    6151452         0   6151452   0% /dev/shm
c2stable ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md6 : active raid1 sdc6[2] sda6[0] sdb6[1]
      247416933 blocks super 1.1 [3/3] [UUU]

md7 : active raid6 sdc7[2] sda7[0] sdb7[1] sdd2[3] sde2[4]
      395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

md3 : active raid6 sdc3[2] sda3[0] sdb3[1] sdd3[3] sde3[4]
      157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

md126 : active raid1 sdc5[2] sda5[0] sdb5[1]
      52436032 blocks [3/3] [UUU]

unused devices: <none>
c2stable ~ #

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: High mismatch count on root device - how to best handle?
  2011-04-25 22:32 High mismatch count on root device - how to best handle? Mark Knecht
@ 2011-04-26  1:30 ` Mark Knecht
  2011-04-26 17:22   ` Mark Knecht
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Knecht @ 2011-04-26  1:30 UTC (permalink / raw)
  To: Linux-RAID

On Mon, Apr 25, 2011 at 3:32 PM, Mark Knecht <markknecht@gmail.com> wrote:
> I did a drive check today, first time in months, and found I have a
> high mismatch count on my RAID1 root device. What's the best way to
> handle getting this cleaned up?
>
> 1) I'm running some smartctl tests as I write this.
>
> 2) Do I just do an
>
> echo repair
>
> to md126 or do I have to boot a rescue CD before I do that?
>
> If you need more info please let me know.
>
> Thanks,
> Mark
>
> c2stable ~ # cat /sys/block/md3/md/mismatch_cnt
> 0
> c2stable ~ # cat /sys/block/md6/md/mismatch_cnt
> 0
> c2stable ~ # cat /sys/block/md7/md/mismatch_cnt
> 0
> c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
> 222336
> c2stable ~ # df
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/md126            51612920  26159408  22831712  54% /
> udev                     10240       432      9808   5% /dev
> /dev/md7             389183252 144979184 224434676  40% /VirtualMachines
> shm                    6151452         0   6151452   0% /dev/shm
> c2stable ~ # cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> md6 : active raid1 sdc6[2] sda6[0] sdb6[1]
>      247416933 blocks super 1.1 [3/3] [UUU]
>
> md7 : active raid6 sdc7[2] sda7[0] sdb7[1] sdd2[3] sde2[4]
>      395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]
>
> md3 : active raid6 sdc3[2] sda3[0] sdb3[1] sdd3[3] sde3[4]
>      157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]
>
> md126 : active raid1 sdc5[2] sda5[0] sdb5[1]
>      52436032 blocks [3/3] [UUU]
>
> unused devices: <none>
> c2stable ~ #
>

The smartctl tests that I ran (long) completed without error on all 5
drives in the system:

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      2887         -
# 2  Extended offline    Completed without error       00%      2046         -


So, if I understand correctly the next step I'd do would be something like

echo repair >/sys/block/md126/md/sync_action

but I'm unclear about the need to do this when mdadm seems to think
the RAID is clean:

c2stable ~ # mdadm -D /dev/md126
/dev/md126:
        Version : 0.90
  Creation Time : Tue Apr 13 09:02:34 2010
     Raid Level : raid1
     Array Size : 52436032 (50.01 GiB 53.69 GB)
  Used Dev Size : 52436032 (50.01 GiB 53.69 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 126
    Persistence : Superblock is persistent

    Update Time : Mon Apr 25 18:29:39 2011
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

           UUID : edb0ed65:6e87b20e:dc0d88ba:780ef6a3
         Events : 0.248880

    Number   Major   Minor   RaidDevice State
       0       8        5        0      active sync   /dev/sda5
       1       8       21        1      active sync   /dev/sdb5
       2       8       37        2      active sync   /dev/sdc5
c2stable ~ #

Thanks in advance.

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: High mismatch count on root device - how to best handle?
  2011-04-26  1:30 ` Mark Knecht
@ 2011-04-26 17:22   ` Mark Knecht
  2011-04-26 19:38     ` Phil Turmel
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Knecht @ 2011-04-26 17:22 UTC (permalink / raw)
  To: Linux-RAID

On Mon, Apr 25, 2011 at 6:30 PM, Mark Knecht <markknecht@gmail.com> wrote:
> On Mon, Apr 25, 2011 at 3:32 PM, Mark Knecht <markknecht@gmail.com> wrote:
>> I did a drive check today, first time in months, and found I have a
>> high mismatch count on my RAID1 root device. What's the best way to
>> handle getting this cleaned up?
>>
>> 1) I'm running some smartctl tests as I write this.
>>
>> 2) Do I just do an
>>
>> echo repair
>>
>> to md126 or do I have to boot a rescue CD before I do that?
>>
>> If you need more info please let me know.
>>
>> Thanks,
>> Mark
>>
>> c2stable ~ # cat /sys/block/md3/md/mismatch_cnt
>> 0
>> c2stable ~ # cat /sys/block/md6/md/mismatch_cnt
>> 0
>> c2stable ~ # cat /sys/block/md7/md/mismatch_cnt
>> 0
>> c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
>> 222336
>> c2stable ~ # df
>> Filesystem           1K-blocks      Used Available Use% Mounted on
>> /dev/md126            51612920  26159408  22831712  54% /
>> udev                     10240       432      9808   5% /dev
>> /dev/md7             389183252 144979184 224434676  40% /VirtualMachines
>> shm                    6151452         0   6151452   0% /dev/shm
>> c2stable ~ # cat /proc/mdstat
>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
>> md6 : active raid1 sdc6[2] sda6[0] sdb6[1]
>>      247416933 blocks super 1.1 [3/3] [UUU]
>>
>> md7 : active raid6 sdc7[2] sda7[0] sdb7[1] sdd2[3] sde2[4]
>>      395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]
>>
>> md3 : active raid6 sdc3[2] sda3[0] sdb3[1] sdd3[3] sde3[4]
>>      157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]
>>
>> md126 : active raid1 sdc5[2] sda5[0] sdb5[1]
>>      52436032 blocks [3/3] [UUU]
>>
>> unused devices: <none>
>> c2stable ~ #
>>
>
> The smartctl tests that I ran (long) completed without error on all 5
> drives in the system:
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Extended offline    Completed without error       00%      2887         -
> # 2  Extended offline    Completed without error       00%      2046         -
>
>
> So, if I understand correctly the next step I'd do would be something like
>
> echo repair >/sys/block/md126/md/sync_action
>
> but I'm unclear about the need to do this when mdadm seems to think
> the RAID is clean:
>
> c2stable ~ # mdadm -D /dev/md126
> /dev/md126:
>        Version : 0.90
>  Creation Time : Tue Apr 13 09:02:34 2010
>     Raid Level : raid1
>     Array Size : 52436032 (50.01 GiB 53.69 GB)
>  Used Dev Size : 52436032 (50.01 GiB 53.69 GB)
>   Raid Devices : 3
>  Total Devices : 3
> Preferred Minor : 126
>    Persistence : Superblock is persistent
>
>    Update Time : Mon Apr 25 18:29:39 2011
>          State : clean
>  Active Devices : 3
> Working Devices : 3
>  Failed Devices : 0
>  Spare Devices : 0
>
>           UUID : edb0ed65:6e87b20e:dc0d88ba:780ef6a3
>         Events : 0.248880
>
>    Number   Major   Minor   RaidDevice State
>       0       8        5        0      active sync   /dev/sda5
>       1       8       21        1      active sync   /dev/sdb5
>       2       8       37        2      active sync   /dev/sdc5
> c2stable ~ #
>
> Thanks in advance.
>
> Cheers,
> Mark
>

OK, I don't know exactly what I'm looking for a problem here. I ran
the repair, then rebooted. Mismatch count was zero. It seemed the
repair had worked.

I then used the system for about 4 hours. After 4 hours I did another
check and found the mismatch count had increased.

What I need to get a handle on is:

1) Is this serious? (I assume yes)

2) How do I figure out which drive(s) of the 3 is having trouble?

3) If there is a specific drive, what is the process to swap it out?

Thanks,
Mark


c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
0
c2stable ~ # echo check >/sys/block/md126/md/sync_action
c2stable ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md6 : active raid1 sdb6[1] sdc6[2] sda6[0]
      247416933 blocks super 1.1 [3/3] [UUU]

md7 : active raid6 sdb7[1] sdc7[2] sde2[4] sda7[0] sdd2[3]
      395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

md3 : active raid6 sdb3[1] sdc3[2] sda3[0] sdd3[3] sde3[4]
      157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

md126 : active raid1 sdc5[2] sdb5[1] sda5[0]
      52436032 blocks [3/3] [UUU]
      [>....................]  check =  1.1% (626560/52436032)
finish=11.0min speed=78320K/sec

unused devices: <none>
c2stable ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md6 : active raid1 sdb6[1] sdc6[2] sda6[0]
      247416933 blocks super 1.1 [3/3] [UUU]

md7 : active raid6 sdb7[1] sdc7[2] sde2[4] sda7[0] sdd2[3]
      395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

md3 : active raid6 sdb3[1] sdc3[2] sda3[0] sdd3[3] sde3[4]
      157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

md126 : active raid1 sdc5[2] sdb5[1] sda5[0]
      52436032 blocks [3/3] [UUU]
      [===========>.........]  check = 59.6% (31291776/52436032)
finish=5.5min speed=63887K/sec

unused devices: <none>
c2stable ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md6 : active raid1 sdb6[1] sdc6[2] sda6[0]
      247416933 blocks super 1.1 [3/3] [UUU]

md7 : active raid6 sdb7[1] sdc7[2] sde2[4] sda7[0] sdd2[3]
      395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5]
[UUUUU]

md3 : active raid6 sdb3[1] sdc3[2] sda3[0] sdd3[3] sde3[4]
      157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5]
[UUUUU]

md126 : active raid1 sdc5[2] sdb5[1] sda5[0]
      52436032 blocks [3/3] [UUU]

unused devices: <none>
c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
7424
c2stable ~ #
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: High mismatch count on root device - how to best handle?
  2011-04-26 17:22   ` Mark Knecht
@ 2011-04-26 19:38     ` Phil Turmel
  2011-04-28  0:38       ` Mark Knecht
  0 siblings, 1 reply; 10+ messages in thread
From: Phil Turmel @ 2011-04-26 19:38 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Linux-RAID

Hi Mark,

On 04/26/2011 01:22 PM, Mark Knecht wrote:
> On Mon, Apr 25, 2011 at 6:30 PM, Mark Knecht <markknecht@gmail.com> wrote:
[trim /]

> OK, I don't know exactly what I'm looking for a problem here. I ran
> the repair, then rebooted. Mismatch count was zero. It seemed the
> repair had worked.
> 
> I then used the system for about 4 hours. After 4 hours I did another
> check and found the mismatch count had increased.
> 
> What I need to get a handle on is:
> 
> 1) Is this serious? (I assume yes)

Maybe.  Are you using a file in this filesystem as swap in lieu of a dedicated swap partition?

I vaguely recall reading that certain code paths in the swap logic can abandon queued writes (due to the data no longer being needed by the VM), such that one or more raid members are left inconsistent.  Supposedly only affecting mirrored raid, and only for swap files/partitions.

I don't know if this was ever fixed.  or even if anyone tried to fix it.

> 2) How do I figure out which drive(s) of the 3 is having trouble?

Don't know.  Failing drives usually give themselves away with warnings in dmesg, and/or ejection from the array.  There's nothing in the kernel or mdadm that'll help here.  You'd have to do three-way voting comparison of all blocks on the member partitions.

> 3) If there is a specific drive, what is the process to swap it out?

mdadm /dev/mdX --fail /dev/sdXY
mdadm /dev/mdX --remove /dev/sdXY

(swap drives)

mdadm /dev/mdX --add /dev/sdZY

HTH,

Phil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: High mismatch count on root device - how to best handle?
  2011-04-26 19:38     ` Phil Turmel
@ 2011-04-28  0:38       ` Mark Knecht
  2011-04-28  1:12         ` Phil Turmel
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Knecht @ 2011-04-28  0:38 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Linux-RAID

On Tue, Apr 26, 2011 at 12:38 PM, Phil Turmel <philip@turmel.org> wrote:
> Hi Mark,
>
> On 04/26/2011 01:22 PM, Mark Knecht wrote:
>> On Mon, Apr 25, 2011 at 6:30 PM, Mark Knecht <markknecht@gmail.com> wrote:
> [trim /]
>
>> OK, I don't know exactly what I'm looking for a problem here. I ran
>> the repair, then rebooted. Mismatch count was zero. It seemed the
>> repair had worked.
>>
>> I then used the system for about 4 hours. After 4 hours I did another
>> check and found the mismatch count had increased.
>>
>> What I need to get a handle on is:
>>
>> 1) Is this serious? (I assume yes)
>
> Maybe.  Are you using a file in this filesystem as swap in lieu of a dedicated swap partition?
>

No, swap is on 3 drives as 3 partitions. The kernel runs swap and it
has nothing to do with RAID other than it shares a portion of the
drives.

> I vaguely recall reading that certain code paths in the swap logic can abandon queued writes (due to the data no longer being needed by the VM), such that one or more raid members are left inconsistent.  Supposedly only affecting mirrored raid, and only for swap files/partitions.
>
> I don't know if this was ever fixed.  or even if anyone tried to fix it.
>

md126 is the main 3-drive RAID1 root partition of a Gentoo install.
Kernel is 2.6.38-gentoo-r1 and I'm using mdadm-3.1.4.

Nothing I do with echo repair seems to stick very well. For a few
moments mismatch_cnt will read 0, but as far as I can tell if I do
another echo check then I Get another high mismatch_cnt again.

Once thing I'm wondering about is whether repair even works on a
3-disk RAID1? I've seen threads out there that suggest it doesn't and
that possibly it's just bypassing the actual repair operation?


>> 2) How do I figure out which drive(s) of the 3 is having trouble?
>
> Don't know.  Failing drives usually give themselves away with warnings in dmesg, and/or ejection from the array.  There's nothing in the kernel or mdadm that'll help here.  You'd have to do three-way voting comparison of all blocks on the member partitions.
>
>> 3) If there is a specific drive, what is the process to swap it out?
>
> mdadm /dev/mdX --fail /dev/sdXY
> mdadm /dev/mdX --remove /dev/sdXY
>
> (swap drives)
>
> mdadm /dev/mdX --add /dev/sdZY
>

I will have some additional things to figure out. There are 5 drives
in this box with a mixture of 3-drive RAID1 & 5-drive RAID6 across
them. If I pull a drive then I need to ensure that all four RAIDs are
going to get rebuilt correctly. I suspect they will, but I'll want to
be careful.

Still, if I haven't a clue which drive is causing the mismatch then I
cannot know which one to pull..

Thanks for your inputs!

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: High mismatch count on root device - how to best handle?
  2011-04-28  0:38       ` Mark Knecht
@ 2011-04-28  1:12         ` Phil Turmel
  2011-04-28  5:31           ` Wolfgang Denk
  0 siblings, 1 reply; 10+ messages in thread
From: Phil Turmel @ 2011-04-28  1:12 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Linux-RAID

Hi Mark,

On 04/27/2011 08:38 PM, Mark Knecht wrote:
> On Tue, Apr 26, 2011 at 12:38 PM, Phil Turmel <philip@turmel.org> wrote:
>> Hi Mark,
>>
>> On 04/26/2011 01:22 PM, Mark Knecht wrote:
>>> On Mon, Apr 25, 2011 at 6:30 PM, Mark Knecht <markknecht@gmail.com> wrote:
>> [trim /]
>>
>>> OK, I don't know exactly what I'm looking for a problem here. I ran
>>> the repair, then rebooted. Mismatch count was zero. It seemed the
>>> repair had worked.
>>>
>>> I then used the system for about 4 hours. After 4 hours I did another
>>> check and found the mismatch count had increased.
>>>
>>> What I need to get a handle on is:
>>>
>>> 1) Is this serious? (I assume yes)
>>
>> Maybe.  Are you using a file in this filesystem as swap in lieu of a dedicated swap partition?
>>
> 
> No, swap is on 3 drives as 3 partitions. The kernel runs swap and it
> has nothing to do with RAID other than it shares a portion of the
> drives.

OK.

>> I vaguely recall reading that certain code paths in the swap logic can abandon queued writes (due to the data no longer being needed by the VM), such that one or more raid members are left inconsistent.  Supposedly only affecting mirrored raid, and only for swap files/partitions.
>>
>> I don't know if this was ever fixed.  or even if anyone tried to fix it.
>>
> 
> md126 is the main 3-drive RAID1 root partition of a Gentoo install.
> Kernel is 2.6.38-gentoo-r1 and I'm using mdadm-3.1.4.
> 
> Nothing I do with echo repair seems to stick very well. For a few
> moments mismatch_cnt will read 0, but as far as I can tell if I do
> another echo check then I Get another high mismatch_cnt again.

Hmmm.  Since its not swap, this would make me worry about the hardware.  Have you considered shuffling SATA port assignments to see if a pattern shows up?  Also consider moving some of the drive power load to another PS.

> Once thing I'm wondering about is whether repair even works on a
> 3-disk RAID1? I've seen threads out there that suggest it doesn't and
> that possibly it's just bypassing the actual repair operation?

I've not heard of such.  But repair does *not* mean "pick the matching data and write to the third", but rather, "unconditionally write whatever is in the first mirror to the other two, if there's any mismatch".

One of Neil's links explains why, but it boils down to the lack of knowledge about the order writes occurred before the interruption (or bug) that caused the mismatch.

http://neil.brown.name/blog/20100211050355

>>> 2) How do I figure out which drive(s) of the 3 is having trouble?

After messing with hardware (one change at a time), brute-force is next:

Image the drives individually to new drives, or loop-mountable files on other storage, then assemble the copies as degraded arrays, one at a time.  For each, compute file-by-file checksums, and compare to each other and to backups or other external reference (you *do* have backups... ?).

Others may have better suggestions.  I've never had to do this.

>> Don't know.  Failing drives usually give themselves away with warnings in dmesg, and/or ejection from the array.  There's nothing in the kernel or mdadm that'll help here.  You'd have to do three-way voting comparison of all blocks on the member partitions.
>>
>>> 3) If there is a specific drive, what is the process to swap it out?
>>
>> mdadm /dev/mdX --fail /dev/sdXY
>> mdadm /dev/mdX --remove /dev/sdXY
>>
>> (swap drives)
>>
>> mdadm /dev/mdX --add /dev/sdZY
>>
> 
> I will have some additional things to figure out. There are 5 drives
> in this box with a mixture of 3-drive RAID1 & 5-drive RAID6 across
> them. If I pull a drive then I need to ensure that all four RAIDs are
> going to get rebuilt correctly. I suspect they will, but I'll want to
> be careful.

Paranoia is good.  Backups are better.

> Still, if I haven't a clue which drive is causing the mismatch then I
> cannot know which one to pull..

This is really a file system problem, and effort are underway to solve it.  Btrfs in particular, although it is still experimental.  I'm looking forward to that status changing.

> Thanks for your inputs!
> 
> Cheers,
> Mark

Regards,

Phil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: High mismatch count on root device - how to best handle?
  2011-04-28  1:12         ` Phil Turmel
@ 2011-04-28  5:31           ` Wolfgang Denk
  2011-04-30 22:51             ` Mark Knecht
  0 siblings, 1 reply; 10+ messages in thread
From: Wolfgang Denk @ 2011-04-28  5:31 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Mark Knecht, Linux-RAID

Dear Phil Turmel,

In message <4DB8BEFE.3020009@turmel.org> you wrote:
> 
> Hmmm.  Since its not swap, this would make me worry about the hardware.  Have you considered shuffling SATA port assignments to see if a pattern shows up?  Also consider moving some of the drive power load to another PS.

I do not think this is hardware related.  I see this behaviour on at
least 5 different machines which show no other problems except for the
mismatch count in the RAID 1 partitins that hold the /boot partition.

> > Still, if I haven't a clue which drive is causing the mismatch then I
> > cannot know which one to pull..
> 
> This is really a file system problem, and effort are underway to solve it.  Btrfs in particular, although it is still experimental.  I'm looking forward to that status changing.

It will probably take some time until grub can boot from a RAID1 array
with btrfs on it...

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Q:  What's a light-year?
A:  One-third less calories than a regular year.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: High mismatch count on root device - how to best handle?
  2011-04-28  5:31           ` Wolfgang Denk
@ 2011-04-30 22:51             ` Mark Knecht
  2011-05-01 14:50               ` Brad Campbell
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Knecht @ 2011-04-30 22:51 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Phil Turmel, Linux-RAID

On Wed, Apr 27, 2011 at 10:31 PM, Wolfgang Denk <wd@denx.de> wrote:
> Dear Phil Turmel,
>
> In message <4DB8BEFE.3020009@turmel.org> you wrote:
>>
>> Hmmm.  Since its not swap, this would make me worry about the hardware.  Have you considered shuffling SATA port assignments to see if a pattern shows up?  Also consider moving some of the drive power load to another PS.
>
> I do not think this is hardware related.  I see this behaviour on at
> least 5 different machines which show no other problems except for the
> mismatch count in the RAID 1 partitins that hold the /boot partition.
>

That's interesting to me.

In my case /boot is on it's own partition and not mounted when I do
the test. There was however a RAID6 mounted at the time I was doing
the repair on the RAID1. I tried dismounting it but that didn't change
anything. Still got the same sort of error count.

>> > Still, if I haven't a clue which drive is causing the mismatch then I
>> > cannot know which one to pull..
>>
>> This is really a file system problem, and effort are underway to solve it.  Btrfs in particular, although it is still experimental.  I'm looking forward to that status changing.
>
> It will probably take some time until grub can boot from a RAID1 array
> with btrfs on it...
>
> Best regards,
>
> Wolfgang Denk

Thanks for the info.

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: High mismatch count on root device - how to best handle?
  2011-04-30 22:51             ` Mark Knecht
@ 2011-05-01 14:50               ` Brad Campbell
  2011-05-01 17:13                 ` Mark Knecht
  0 siblings, 1 reply; 10+ messages in thread
From: Brad Campbell @ 2011-05-01 14:50 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Wolfgang Denk, Phil Turmel, Linux-RAID

On 01/05/11 06:51, Mark Knecht wrote:
> On Wed, Apr 27, 2011 at 10:31 PM, Wolfgang Denk<wd@denx.de>  wrote:
>> Dear Phil Turmel,
>>
>> In message<4DB8BEFE.3020009@turmel.org>  you wrote:
>>>
>>> Hmmm.  Since its not swap, this would make me worry about the hardware.  Have you considered shuffling SATA port assignments to see if a pattern shows up?  Also consider moving some of the drive power load to another PS.
>>
>> I do not think this is hardware related.  I see this behaviour on at
>> least 5 different machines which show no other problems except for the
>> mismatch count in the RAID 1 partitins that hold the /boot partition.

root@srv:/server# grep . /sys/block/md?/md/mismatch_cnt
/sys/block/md0/md/mismatch_cnt:0
/sys/block/md1/md/mismatch_cnt:128
/sys/block/md2/md/mismatch_cnt:0
/sys/block/md3/md/mismatch_cnt:41728
/sys/block/md4/md/mismatch_cnt:896
/sys/block/md5/md/mismatch_cnt:0
/sys/block/md6/md/mismatch_cnt:4352

root@srv:/server# cat /proc/mdstat | grep md[1346]
md6 : active raid1 sdp6[0] sdo6[1]
md4 : active raid1 sdp3[0] sdo3[1]
md3 : active raid1 sdp2[0] sdo2[1]
md1 : active raid1 sdp1[0] sdo1[1]

root@srv:/server# cat /etc/fstab | grep md[1346]
/dev/md1        /               ext4 
errors=remount-ro,commit=30,noatime     0 1
/dev/md6        /raid0          ext4    defaults,commit=30,noatime      0 1
/dev/md4        /home           ext4    defaults,commit=30,noatime      0 1
/dev/md3        none            swap    sw

I see them _all_ the time on RAID1's..

When I configured this system _years_ ago, I did not know any better, so 
it's a bit of a mish-mash.

The machine also has a 10 drive RAID-6 and a 3 drive RAID-5. The only 
time I've seen mismatches on those is when I used a SIL 3132 controller 
and it trashed the RAID-6.

Brad

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: High mismatch count on root device - how to best handle?
  2011-05-01 14:50               ` Brad Campbell
@ 2011-05-01 17:13                 ` Mark Knecht
  0 siblings, 0 replies; 10+ messages in thread
From: Mark Knecht @ 2011-05-01 17:13 UTC (permalink / raw)
  To: Brad Campbell; +Cc: Wolfgang Denk, Phil Turmel, Linux-RAID

On Sun, May 1, 2011 at 7:50 AM, Brad Campbell <lists2009@fnarfbargle.com> wrote:
> On 01/05/11 06:51, Mark Knecht wrote:
>>
>> On Wed, Apr 27, 2011 at 10:31 PM, Wolfgang Denk<wd@denx.de>  wrote:
>>>
>>> Dear Phil Turmel,
>>>
>>> In message<4DB8BEFE.3020009@turmel.org>  you wrote:
>>>>
>>>> Hmmm.  Since its not swap, this would make me worry about the hardware.
>>>>  Have you considered shuffling SATA port assignments to see if a pattern
>>>> shows up?  Also consider moving some of the drive power load to another PS.
>>>
>>> I do not think this is hardware related.  I see this behaviour on at
>>> least 5 different machines which show no other problems except for the
>>> mismatch count in the RAID 1 partitins that hold the /boot partition.
>
> root@srv:/server# grep . /sys/block/md?/md/mismatch_cnt
> /sys/block/md0/md/mismatch_cnt:0
> /sys/block/md1/md/mismatch_cnt:128
> /sys/block/md2/md/mismatch_cnt:0
> /sys/block/md3/md/mismatch_cnt:41728
> /sys/block/md4/md/mismatch_cnt:896
> /sys/block/md5/md/mismatch_cnt:0
> /sys/block/md6/md/mismatch_cnt:4352
>
> root@srv:/server# cat /proc/mdstat | grep md[1346]
> md6 : active raid1 sdp6[0] sdo6[1]
> md4 : active raid1 sdp3[0] sdo3[1]
> md3 : active raid1 sdp2[0] sdo2[1]
> md1 : active raid1 sdp1[0] sdo1[1]
>
> root@srv:/server# cat /etc/fstab | grep md[1346]
> /dev/md1        /               ext4 errors=remount-ro,commit=30,noatime
> 0 1
> /dev/md6        /raid0          ext4    defaults,commit=30,noatime      0 1
> /dev/md4        /home           ext4    defaults,commit=30,noatime      0 1
> /dev/md3        none            swap    sw
>
> I see them _all_ the time on RAID1's..
>
> When I configured this system _years_ ago, I did not know any better, so
> it's a bit of a mish-mash.
>
> The machine also has a 10 drive RAID-6 and a 3 drive RAID-5. The only time
> I've seen mismatches on those is when I used a SIL 3132 controller and it
> trashed the RAID-6.
>
> Brad
>

Brad,
   Thanks very much for sharing the data and your experiences with
this. I'll simply ignore it on the RAID1's from now on.

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-05-01 17:13 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-25 22:32 High mismatch count on root device - how to best handle? Mark Knecht
2011-04-26  1:30 ` Mark Knecht
2011-04-26 17:22   ` Mark Knecht
2011-04-26 19:38     ` Phil Turmel
2011-04-28  0:38       ` Mark Knecht
2011-04-28  1:12         ` Phil Turmel
2011-04-28  5:31           ` Wolfgang Denk
2011-04-30 22:51             ` Mark Knecht
2011-05-01 14:50               ` Brad Campbell
2011-05-01 17:13                 ` Mark Knecht

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.