All of lore.kernel.org
 help / color / mirror / Atom feed
* Extremely High mismatch_cnt on RAID1 system
@ 2014-10-04 13:46 Dennis Grant
  2014-10-07 13:14 ` Ethan Wilson
  0 siblings, 1 reply; 5+ messages in thread
From: Dennis Grant @ 2014-10-04 13:46 UTC (permalink / raw)
  To: linux-raid

Hello all.

I recently updated an Ubuntu 12.04 LTS system to a 14.04 LTS system.
This upgrade did not go particularly smoothly, and I've been dealing
with various types of weirdness ever since.

This system has 4 hard drives in it, comprising 3 RAID 1 arrays. One
pair of drives contains two arrays - one for / (md0), and one for
/home (md2). The other array is a pair of older drives (the original
array from when the machine was first built) which I have mounted as
/home/backups (md3)

Some investigation has me thinking that perhaps the arrays had gotten
out of sync, or perhaps there was even a failing drive. So I've been
checking and repairing arrays to see what happens.

All drives passed their long SMART tests with no issues.

Even after multiple checks, repairs, and rebuilds, the arrays on the
bigger drives (/ and /home) are showing insanely high mismatch_cnt
values. This has me concerned.

Here are the details:

$cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5]
[raid4] [raid10]
md3 : active raid1 sdb4[3] sda4[2]
      660213568 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda2[2] sdb2[3]
      307068736 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdc1[0] sdd1[1]
      2930133824 blocks super 1.2 [2/2] [UU]

unused devices: <none>

$ sudo mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sun Jun  9 19:19:32 2013
     Raid Level : raid1
     Array Size : 307068736 (292.84 GiB 314.44 GB)
  Used Dev Size : 307068736 (292.84 GiB 314.44 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Sat Oct  4 10:18:58 2014
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : karzai:0  (local to host karzai)
           UUID : ff878502:6d567c41:4dbd32f7:7a1122be
         Events : 256772

$ sudo mdadm -D /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Sat Jun  8 04:09:23 2013
     Raid Level : raid1
     Array Size : 2930133824 (2794.39 GiB 3000.46 GB)
  Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Sat Oct  4 10:19:33 2014
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : karzai:2  (local to host karzai)
           UUID : dd5af8bf:02a9500e:0b73f986:72a2ba43
         Events : 459

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1


    Number   Major   Minor   RaidDevice State
       3       8       18        0      active sync   /dev/sdb2
       2       8        2        1      active sync   /dev/sda2

$ sudo mdadm -D /dev/md3
/dev/md3:
        Version : 1.2
  Creation Time : Sun Jun  9 20:05:28 2013
     Raid Level : raid1
     Array Size : 660213568 (629.63 GiB 676.06 GB)
  Used Dev Size : 660213568 (629.63 GiB 676.06 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Sat Oct  4 09:52:47 2014
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : karzai:3  (local to host karzai)
           UUID : e60332c8:e5568487:df2f3b61:94fd2b2c
         Events : 14971

    Number   Major   Minor   RaidDevice State
       3       8       20        0      active sync   /dev/sdb4
       2       8        4        1      active sync   /dev/sda4

$ grep . /sys/block/md?/md/mismatch_cnt
/sys/block/md0/md/mismatch_cnt:3148032
/sys/block/md2/md/mismatch_cnt:20217856
/sys/block/md3/md/mismatch_cnt:0

fsck of md2 and md3 were clean

Should I be concerned? The fact that the high counts on on arrays that
share the same physical drives has me very nervous, notwithstanding
the passed SMART tests.

Thank you.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Extremely High mismatch_cnt on RAID1 system
  2014-10-04 13:46 Extremely High mismatch_cnt on RAID1 system Dennis Grant
@ 2014-10-07 13:14 ` Ethan Wilson
  2014-10-07 13:58   ` Wilson, Jonathan
  2014-10-09  3:17   ` Brassow Jonathan
  0 siblings, 2 replies; 5+ messages in thread
From: Ethan Wilson @ 2014-10-07 13:14 UTC (permalink / raw)
  To: linux-raid

On 04/10/2014 15:46, Dennis Grant wrote:
> Hello all.
>
> ...
>
> Even after multiple checks, repairs, and rebuilds, the arrays on the
> bigger drives (/ and /home) are showing insanely high mismatch_cnt
> values. This has me concerned.
>

Dennis,
since nobody more knowledgeable replied, I will try.

Some mismatches on raid1 have been there since always, and nobody ever 
deeply investigated what they were caused by, nor if they happen on 
unallocated filesystem space or on real live data. It seems that if LVM 
is between raid1 and the filesystem then they don't happen anymore, but 
again nobody is really sure of why.

Recently some changes in the raid1 resync algorithm introduced some bugs 
that could possibly generate additional mismatches, but if you haven't 
had resyncs then I am not so sure if such bugs and their fixes are 
relevant. However the fixes are here:
https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.14.20
search for "raid".

You might want to upgrade to kernel 3.14.20, which is probably not what 
your Ubuntu LTS has currently, then repair the arrays, then see if they 
grow again.
Note that you need to do repair and not check:
echo repair > /sys/block/md0/md/sync_action
at the next "check" the mismatch_cnt should be 0 (not just after 
"repair", because that would count the number of mismatches that have 
been repaired).

I'd say that mismatches in general are pretty worrisome, they shouldn't 
happen, they are likely to indicate corruption, so if what I said 
doesn't work, e.g. mismatches grow again, try to report it again on the 
list and somebody might be able to help further to track down this problem.

Regards
EW


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Extremely High mismatch_cnt on RAID1 system
  2014-10-07 13:14 ` Ethan Wilson
@ 2014-10-07 13:58   ` Wilson, Jonathan
  2014-10-07 14:23     ` Ethan Wilson
  2014-10-09  3:17   ` Brassow Jonathan
  1 sibling, 1 reply; 5+ messages in thread
From: Wilson, Jonathan @ 2014-10-07 13:58 UTC (permalink / raw)
  To: Ethan Wilson; +Cc: linux-raid

On Tue, 2014-10-07 at 15:14 +0200, Ethan Wilson wrote:
> On 04/10/2014 15:46, Dennis Grant wrote:
> > Hello all.
> >
> > ...
> >
> > Even after multiple checks, repairs, and rebuilds, the arrays on the
> > bigger drives (/ and /home) are showing insanely high mismatch_cnt
> > values. This has me concerned.
> >
> 
> Dennis,
> since nobody more knowledgeable replied, I will try.
> 
> Some mismatches on raid1 have been there since always, and nobody ever 
> deeply investigated what they were caused by, nor if they happen on 
> unallocated filesystem space or on real live data. It seems that if LVM 
> is between raid1 and the filesystem then they don't happen anymore, but 
> again nobody is really sure of why.

Would mismatches happen if an "assume clean" was used, either for a good
reason (say to forced a dropped disk back in) or in error, so that while
the data on the secondary disk(s) becomes self correcting as new
writes/updates are performed, to all disks, should the "primary" drive
fail the second one would contain out of sync data, where it had never
been (re)written. Although which is "primary" and which is "secondary"
is I guess not really a good description.

I would have thought that doing a DD to a _FILE_ that fills up the file
system would also reduce the mismatch count, as it would force
"correct(ing)" data to all the disks, baring reserved file system
blocks/areas.
 
NOTE DD to a FILE on the file system, NOT the raid device, the latter
will DESTROY ALL data!






^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Extremely High mismatch_cnt on RAID1 system
  2014-10-07 13:58   ` Wilson, Jonathan
@ 2014-10-07 14:23     ` Ethan Wilson
  0 siblings, 0 replies; 5+ messages in thread
From: Ethan Wilson @ 2014-10-07 14:23 UTC (permalink / raw)
  To: linux-raid

On 07/10/2014 15:58, Wilson, Jonathan wrote:
> Would mismatches happen if an "assume clean" was used

--assume-clean during creation, then yes, until the first "repair" and 
then "check".


> , either for a good
> reason (say to forced a dropped disk back in

I don't think it is possible to force the addition of a disk with 
--assume-clean, I think that's an option only for --create

> ) or in error, so that while
> the data on the secondary disk(s) becomes self correcting as new
> writes/updates are performed, to all disks, should the "primary" drive
> fail the second one would contain out of sync data, where it had never
> been (re)written. Although which is "primary" and which is "secondary"
> is I guess not really a good description.
>
> I would have thought that doing a DD to a _FILE_ that fills up the file
> system would also reduce the mismatch count

Yes, except theoretically for raid5 which operates RMW mode, because 
that mode propagates existing parity errors if non-full-stripes are written.
But a large file is written sequentially, probably full stripes will be 
written, so in that case, yes again.

> , as it would force
> "correct(ing)" data to all the disks, baring reserved file system
> blocks/areas.

Indeed yours is a good way to determine if mismatches are mapped to 
existing files or to unused space on the filesystem.
Once all the filesystem space is overwritten with a file, if 
mismatch_cnt is still nonzero, the mismatches are evidently located on 
files, which means data corruption.
If Dennis tells us that the mismatches count still raises after kernel 
upgrade and raid repair (repair itself will bring it to 0), we can 
suggest this test, to check for data corruption.

EW

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Extremely High mismatch_cnt on RAID1 system
  2014-10-07 13:14 ` Ethan Wilson
  2014-10-07 13:58   ` Wilson, Jonathan
@ 2014-10-09  3:17   ` Brassow Jonathan
  1 sibling, 0 replies; 5+ messages in thread
From: Brassow Jonathan @ 2014-10-09  3:17 UTC (permalink / raw)
  To: Ethan Wilson; +Cc: linux-raid


On Oct 7, 2014, at 8:14 AM, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:

> On 04/10/2014 15:46, Dennis Grant wrote:
>> Hello all.
>> 
>> ...
>> 
>> Even after multiple checks, repairs, and rebuilds, the arrays on the
>> bigger drives (/ and /home) are showing insanely high mismatch_cnt
>> values. This has me concerned.
>> 
> 
> Dennis,
> since nobody more knowledgeable replied, I will try.
> 
> Some mismatches on raid1 have been there since always, and nobody ever deeply investigated what they were caused by, nor if they happen on unallocated filesystem space or on real live data. It seems that if LVM is between raid1 and the filesystem then they don't happen anymore, but again nobody is really sure of why.
> 
> Recently some changes in the raid1 resync algorithm introduced some bugs that could possibly generate additional mismatches, but if you haven't had resyncs then I am not so sure if such bugs and their fixes are relevant. However the fixes are here:
> https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.14.20
> search for "raid".
> 
> You might want to upgrade to kernel 3.14.20, which is probably not what your Ubuntu LTS has currently, then repair the arrays, then see if they grow again.
> Note that you need to do repair and not check:
> echo repair > /sys/block/md0/md/sync_action
> at the next "check" the mismatch_cnt should be 0 (not just after "repair", because that would count the number of mismatches that have been repaired).
> 
> I'd say that mismatches in general are pretty worrisome, they shouldn't happen, they are likely to indicate corruption, so if what I said doesn't work, e.g. mismatches grow again, try to report it again on the list and somebody might be able to help further to track down this problem.

The mismatches count can be incremented during operations other than check and repair.  I believe its behavior also varies between RAID personalities.  However, if you check the ‘last_sync_action’ and see that it was a “check” operation, you are probably safe to assume that the mismatch count has been computed correctly.

Note the following commit:
commit c4a39551451666229b4ea5e8aae8ca0131d00665
Author: Jonathan Brassow <jbrassow@redhat.com>
Date:   Tue Jun 25 01:23:59 2013 -0500

    MD: Remember the last sync operation that was performed

    MD:  Remember the last sync operation that was performed

    This patch adds a field to the mddev structure to track the last
    sync operation that was performed.  This is especially useful when
    it comes to what is recorded in mismatch_cnt in sysfs.  If the
    last operation was "data-check", then it reports the number of
    descrepancies found by the user-initiated check.  If it was a
    "repair" operation, then it is reporting the number of
    descrepancies repaired.  etc.

    Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
    Signed-off-by: NeilBrown <neilb@suse.de>

Relatedly, LVM makes use of the MD RAID personalities to provide its RAID capabilities.  It does this by accessing MD through a thin device-mapper target called "dm-raid” - not to be confused with the similarly named userspace application.  The above mentioned commit contains a change to the dm-raid module as well, which causes it to report ‘0’ mismatches unless the ‘last_sync_action’ was a “check”.  So, for dm-raid (and by extension LVM) the ambiguity in mismatch_count is gone, but the user must be careful when looking at the number for MD.

 brassow--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-10-09  3:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-04 13:46 Extremely High mismatch_cnt on RAID1 system Dennis Grant
2014-10-07 13:14 ` Ethan Wilson
2014-10-07 13:58   ` Wilson, Jonathan
2014-10-07 14:23     ` Ethan Wilson
2014-10-09  3:17   ` Brassow Jonathan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.