All of lore.kernel.org
 help / color / mirror / Atom feed
* 3.12: raid-1 mismatch_cnt question
@ 2013-11-04 10:25 ` Justin Piszcz
  0 siblings, 0 replies; 23+ messages in thread
From: Justin Piszcz @ 2013-11-04 10:25 UTC (permalink / raw)
  To: linux-kernel, linux-raid

Hi,

I run two SSDs in a RAID-1 configuration and I have a swap partition on a
third SSD.  Over time, the mismatch_cnt between the two devices grows higher
and higher.

Once a week, I run a check and repair against the md devices to help bring
the mismatch_cnt down.  When I run the check and repair, the system is live
so there are various logs/processes writing to disk.  The system also has
ECC memory and there are no errors reported.

The following graph is the mismatch_cnt from June 2013 to current; each drop
represents a check+repair.  In September, I dropped the kernel/vm caches
before running check/repair and that seemed to help a bit.
http://home.comcast.net/~jpiszcz/20131104/md_raid_mismatch_cnt.png

My question is: is this normal or should the mismatch_cnt always be 0 unless
there is a HW or md/driver issue?

Justin.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* 3.12: raid-1 mismatch_cnt question
@ 2013-11-04 10:25 ` Justin Piszcz
  0 siblings, 0 replies; 23+ messages in thread
From: Justin Piszcz @ 2013-11-04 10:25 UTC (permalink / raw)
  To: linux-kernel, linux-raid

Hi,

I run two SSDs in a RAID-1 configuration and I have a swap partition on a
third SSD.  Over time, the mismatch_cnt between the two devices grows higher
and higher.

Once a week, I run a check and repair against the md devices to help bring
the mismatch_cnt down.  When I run the check and repair, the system is live
so there are various logs/processes writing to disk.  The system also has
ECC memory and there are no errors reported.

The following graph is the mismatch_cnt from June 2013 to current; each drop
represents a check+repair.  In September, I dropped the kernel/vm caches
before running check/repair and that seemed to help a bit.
http://home.comcast.net/~jpiszcz/20131104/md_raid_mismatch_cnt.png

My question is: is this normal or should the mismatch_cnt always be 0 unless
there is a HW or md/driver issue?

Justin.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-04 10:25 ` Justin Piszcz
  (?)
@ 2013-11-07 10:54 ` Justin Piszcz
  2013-11-12  0:39   ` Brad Campbell
  -1 siblings, 1 reply; 23+ messages in thread
From: Justin Piszcz @ 2013-11-07 10:54 UTC (permalink / raw)
  To: open list, linux-raid

On Mon, Nov 4, 2013 at 5:25 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> Hi,
>
> I run two SSDs in a RAID-1 configuration and I have a swap partition on a
> third SSD.  Over time, the mismatch_cnt between the two devices grows higher
> and higher.
>
> Once a week, I run a check and repair against the md devices to help bring
> the mismatch_cnt down.  When I run the check and repair, the system is live
> so there are various logs/processes writing to disk.  The system also has
> ECC memory and there are no errors reported.
>
> The following graph is the mismatch_cnt from June 2013 to current; each drop
> represents a check+repair.  In September, I dropped the kernel/vm caches
> before running check/repair and that seemed to help a bit.
> http://home.comcast.net/~jpiszcz/20131104/md_raid_mismatch_cnt.png
>
> My question is: is this normal or should the mismatch_cnt always be 0 unless
> there is a HW or md/driver issue?
>
> Justin.
>

Hi,

Could anyone please comment if this is normal/expected behavior?

Thanks,

Justin.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: 3.12: raid-1 mismatch_cnt question
       [not found] ` <527E8B74.70301@shiftmail.org>
@ 2013-11-09 22:49   ` Justin Piszcz
  2013-11-10 12:45     ` joystick
  0 siblings, 1 reply; 23+ messages in thread
From: Justin Piszcz @ 2013-11-09 22:49 UTC (permalink / raw)
  To: 'joystick'; +Cc: 'linux-raid'


From: joystick [mailto:joystick@shiftmail.org] 

[ .. ]

Hi,

> 1) It might be Grub writing state data to one device only during boot. IF
the machine was rebooted at least once prior to check.
The checks (multiple) had occurred after the reboot, last uptime (was ~40+
days)-- also using LILO here with the checks running once a week.

> 2) Earlier discussions on this list suggested that it might be a write
buffer becoming invalid during write because a temporary file being written
> has been deleted in the meantime and the buffer reused with different
content even if the buffer was still in-flight for the write. If this is 
> true, the region with mismatches would belong to unallocated space on the
filesystem so would be harmless. To confirm this, one in your 
> situation should write zeroes to a new file so to fill the filesystem,
then remove the file, just prior to the check or repair
> dd if=/dev/zero of=emptyfile bs=1M ; rm emptyfile ; echo check > .........
> this should result in zero or near-zero (see next point) mismatches. I
think nobody has tried this before so if you can try this that would be 
> great.

Baseline (had run a repair 9+ hours earlier btw):
# echo "Before: " $(cat /sys/block/md{0,1}/md/mismatch_cnt)
Before:  0 7552

# dd if=/dev/zero of=emptyfile bs=1M
dd: error writing 'emptyfile': No space left on device
66180+0 records in
66179+0 records out
69394198528 bytes (69 GB) copied, 127.136 s, 546 MB/s

# rm emptyfile

# echo check > /sys/devices/virtual/block/md0/md/sync_action
# echo check > /sys/devices/virtual/block/md1/md/sync_action
# # .. waiting until check done ..

# echo "After: " $(cat /sys/block/md{0,1}/md/mismatch_cnt)
After:  0 6016

> 3) I'm not sure if a small number of mismatches can arise when check or
repair reads a sector that is being written to. This cannot account for
> the large number you see but could return not exactly zero when you do the
test of previous point.
Agree (there are some processes, logging, etc. to the RAID-1 on occasion but
when I used to use HDDs in a similar configuration, I never saw this level
of mismatches and a repair would usually bring it down to 0 or a very small
number.

> 4) Theories above do not explain why you see an improvement dropping
caches. This is very interesting. How do you exactly drop the caches?

In short:
1.   sync
2.   echo 1 > /proc/sys/vm/drop_caches
3.   sync
4.   echo check > sync_action
[ .. ]
5.  if mismatch_cnt > 0
6.  repeat 1-3 above
7.  echo repair > sync_action

> 5) I have an additional theory for SSDs: do you have TRIMs enabled in
mount options, or do you perform periodic TRIMs? If yes, note that the 
> SSD might return whatever from the sectors being TRIMmed, and hence the
mismatch. See this:
>
http://serverfault.com/questions/530652/background-discard-on-swap-partition
s-on-linux-ssd
> do you have trim option enabled? do your SSDs have deterministic read data
after trim?
I have TRIM (discard) enabled for the / (root) only and only use MDRAID-1
for the /boot and / (root) filesystems, I have a 3rd SSD dedicated to swap.

(/dev/sdb, /dev/sdc):
/dev/md0        /boot            ext3    defaults                   0  0
/dev/md1        /                ext4    defaults,discard           0  0

(/dev/sdd)
/dev/sdd1       none             swap    sw                          0  0

Justin.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-09 22:49   ` Justin Piszcz
@ 2013-11-10 12:45     ` joystick
  2013-11-11  9:26       ` Justin Piszcz
  0 siblings, 1 reply; 23+ messages in thread
From: joystick @ 2013-11-10 12:45 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: 'linux-raid'

On 09/11/2013 23:49, Justin Piszcz wrote:
> From: joystick [mailto:joystick@shiftmail.org]
>
> [ .. ]
>
> Hi,
>
>> 1) It might be Grub writing state data to one device only during boot. IF the machine was rebooted at least once prior to check.
> The checks (multiple) had occurred after the reboot, last uptime (was ~40+ days)-- also using LILO here with the checks running once a week.

You mean that you *repaired* the mismatches, then waited without 
rebooting, then repeated the check and there were again mismatches?


>> 2) Earlier discussions on this list suggested that it might be a write buffer becoming invalid during write because a temporary file being written has been deleted in the meantime and the buffer reused with different content even if the buffer was still in-flight for the write. If this is true, the region with mismatches would belong to unallocated space on the filesystem so would be harmless. To confirm this, one in your situation should write zeroes to a new file so to fill the filesystem, then remove the file, just prior to the check or repair
>>
>> dd if=/dev/zero of=emptyfile bs=1M ; rm emptyfile ; echo check > .........
>>
>> this should result in zero or near-zero (see next point) mismatches. I think nobody has tried this before so if you can try this that would be great.
> Baseline (had run a repair 9+ hours earlier btw):
> # echo "Before: " $(cat /sys/block/md{0,1}/md/mismatch_cnt)
> Before:  0 7552
>
> # dd if=/dev/zero of=emptyfile bs=1M
> dd: error writing 'emptyfile': No space left on device
> 66180+0 records in
> 66179+0 records out
> 69394198528 bytes (69 GB) copied, 127.136 s, 546 MB/s
>
> # rm emptyfile
>
> # echo check > /sys/devices/virtual/block/md0/md/sync_action
> # echo check > /sys/devices/virtual/block/md1/md/sync_action
> # # .. waiting until check done ..
>
> # echo "After: " $(cat /sys/block/md{0,1}/md/mismatch_cnt)
> After:  0 6016


Still mismatches after zero filling the filesystem.
This is important. This partially supports and partially undermines the 
main theory that was previously supported by people in this list, the 
one of empty space which I mentioned in my previous post.
Supports: the count has reduced from 7552 to 6016 so it seems the 
supposed mechanism actually happens sometimes.
Undermines (*): there are still 6016 mismatches, apparently belonging 
(*) to existing files.

(*) unless explanation is due to Trim, i.e. point 5 below

Since you have discard enabled on md1 mount options, I would suggest one 
more test:
Compute space left on md1 filesystem, e.g. 64.6 GiB (69394198528 bytes, 
watch out: not 69 GB) in example above.
Keep a reasonable margin for your activities, e.g. 3 GB
Fill the remainder, e.g. 61*1024 MB  (if I computed correctly)

# dd if=/dev/zero of=emptyfile bs=1M count=62464

now perform the check for mismatches with emptyfile still on the filesystem. Delete only afterwards.
This should keep Trim effects mostly out of the game.

# echo check > /sys/devices/virtual/block/md1/md/sync_action
# rm emptyfile

>> ...
>> 4) Theories above do not explain why you see an improvement dropping
> caches. This is very interesting. How do you exactly drop the caches?
>
> In short:
> 1.   sync
> 2.   echo 1 > /proc/sys/vm/drop_caches
> 3.   sync
> 4.   echo check > sync_action
> [ .. ]
> 5.  if mismatch_cnt > 0
> 6.  repeat 1-3 above
> 7.  echo repair > sync_action

The only reason I can think of, for which dropping in this way might 
help, is if Trim-med areas return nonzero upon read for such SSD. In 
that case the cache and the device return different values upon read.

I think the kernel should drop the cache of trimmed areas. Probably this 
is not implemented yet. Can anybody confirm?


>> 5) I have an additional theory for SSDs: do you have TRIMs enabled in mount options, or do you perform periodic TRIMs? If yes, note that the  SSD might return whatever from the sectors being TRIMmed, and hence the mismatch. See this:
>>
>> http://serverfault.com/questions/530652/background-discard-on-swap-partitions-on-linux-ssd
>>
>> do you have trim option enabled? do your SSDs have deterministic read data after trim?
> I have TRIM (discard) enabled for the / (root) only and only use MDRAID-1
> for the /boot and / (root) filesystems, I have a 3rd SSD dedicated to swap.
>
> (/dev/sdb, /dev/sdc):
> /dev/md0        /boot            ext3    defaults                   0  0
> /dev/md1        /                ext4    defaults,discard           0  0
>
> (/dev/sdd)
> /dev/sdd1       none             swap    sw                          0  0

One answer is missing: has it got deterministic read data after trim?

# hdparm -I /dev/sdX | grep TRIM

does it contain something like " * Deterministic read data after TRIM" ?

I would not trust this 100% anyways; the new test I suggested for point 
2 above should be more reliable.

Regards
J.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: 3.12: raid-1 mismatch_cnt question
  2013-11-10 12:45     ` joystick
@ 2013-11-11  9:26       ` Justin Piszcz
  2013-11-11 11:06         ` joystick
  0 siblings, 1 reply; 23+ messages in thread
From: Justin Piszcz @ 2013-11-11  9:26 UTC (permalink / raw)
  To: 'joystick'; +Cc: 'linux-raid'



-----Original Message-----
From: joystick [mailto:joystick@shiftmail.org] 
Sent: Sunday, November 10, 2013 7:46 AM
To: Justin Piszcz
Cc: 'linux-raid'
Subject: Re: 3.12: raid-1 mismatch_cnt question

[ .. ]

> You mean that you *repaired* the mismatches, then waited without 
> rebooting, then repeated the check and there were again mismatches?

Yes.

[ .. ]

( # dd if=/dev/zero of=emptyfile bs=1M count=62464; now perform the check
for mismatches with emptyfile still on the filesystem. Delete only
afterwards; This should keep Trim effects mostly out of the game; # echo
check > /sys/devices/virtual/block/md1/md/sync_action; # rm emptyfile )

Had 103GB free, so:

$ dd if=/dev/zero of=emptyfile bs=1M count=100000 (4.8GB free after)
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 193.335 s, 542 MB/s

echo check > /sys/devices/virtual/block/md1/md/sync_action

cat /sys/devices/virtual/block/md1/md/mismatch_cnt
32640

>> ...
>> 4) Theories above do not explain why you see an improvement dropping
> caches. This is very interesting. How do you exactly drop the caches?
>
> In short:
> 1.   sync
> 2.   echo 1 > /proc/sys/vm/drop_caches
> 3.   sync
> 4.   echo check > sync_action
> [ .. ]
> 5.  if mismatch_cnt > 0
> 6.  repeat 1-3 above
> 7.  echo repair > sync_action

The only reason I can think of, for which dropping in this way might 
help, is if Trim-med areas return nonzero upon read for such SSD. In 
that case the cache and the device return different values upon read.

I think the kernel should drop the cache of trimmed areas. Probably this 
is not implemented yet. Can anybody confirm?

[ .. ]

One answer is missing: has it got deterministic read data after trim?
# hdparm -I /dev/sdX | grep TRIM
does it contain something like " * Deterministic read data after TRIM" ?

[ .. ]

Yes.

# hdparm -I /dev/sdb|grep "TRIM"
           *    Data Set Management TRIM supported (limit 1 block)
           *    Deterministic read data after TRIM
# hdparm -I /dev/sdc|grep "TRIM"
           *    Data Set Management TRIM supported (limit 1 block)
           *    Deterministic read data after TRIM

I would not trust this 100% anyways; the new test I suggested for point 
2 above should be more reliable.

[ .. ]

Ok.

Justin.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-11  9:26       ` Justin Piszcz
@ 2013-11-11 11:06         ` joystick
  2013-11-11 18:52           ` Justin Piszcz
  0 siblings, 1 reply; 23+ messages in thread
From: joystick @ 2013-11-11 11:06 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: 'joystick', 'linux-raid'

On 11/11/2013 10:26, Justin Piszcz wrote:
> -----Original Message-----
> .................
>

Very bad news then. Mismatches belong to occupied filesystem space. 
Seems like your data indeed got corrupted somehow and reading from 
different drives probably returns different content for existing files.

Most likely culprits that come to my mind:

1- MD raid1 bug
2- SSD bug (what brand and model?)
3- Loose SATA cable
4- Linux or SSD bug on trim, such as trimming wrong offsets killing live 
data
5- MD does not lock regions during check so returns erroneous mismatches 
for areas being written. This would be harmless but your mismatches 
number seems to high to me for this.

I would suggest to investigate further. One idea is to find which files 
are affected, then reading from both disks independently you should be 
able to determine if all wrong data are on the same SSD (probable loose 
cable or SSD bug if they are different) or evenly distributed (probable 
MD raid1 bug or SSD bug if they are identical).

The easiest, if it works, would be to determine the location of 
mismatches, and then get the filename from there.
Unfortunately I don't think MD tells you the location of mismatches 
directly. Do you want to try the following:
/sys/block/mdX/md/sync{_min,_max} should allow you to narrow the region 
of the next check. Then check, then cat mismatch_cnt.
Narrow progressively so that you identify one block only. Invoke sync 
and check again same region a couple of times so to be sure that it's 
not due to point 5 above. Then try debugfs (in readonly mode can be used 
with fs mounted), there should be an option to get the inode from the 
block number... I hope that block numbers are not offset by MD... I 
think it's icheck and then you might need find -inum to find the filename.

Now it's better to inspect the file to confirm it has indeed different 
content on the two sides...

activate bitmap for raid1, preferably with small chunksize
fail 1 drive so to degrade raid1
drop caches with blockdev --flushbufs on the md device such as /dev/md2, 
on the two underlying partitions such as /dev/sd[ab]2, and maybe even on 
the two disk holding then such as /dev/sd[ab] (I'm not really sure what 
is the minimum needed) ; and also echo 3 > /proc/sys/vm/drop_caches
cp the file to another filesystem
reattach drive, let it resync the differences using the bitmap
fail the other drive
drop all caches again
cp again file to another filesystem
reattach drive and let it resync

diff the two copied files... what do you see?

BTW can your system be taken offline or is it a production system? If it 
can be taken offline you can easily dump md5sums for all files from both 
sides of the RAID, that would be quicker.

Regards
J.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-11 11:06         ` joystick
@ 2013-11-11 18:52           ` Justin Piszcz
  2013-11-11 21:23             ` John Stoffel
                               ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Justin Piszcz @ 2013-11-11 18:52 UTC (permalink / raw)
  To: joystick; +Cc: linux-raid

[ .. ]

> Very bad news then. Mismatches belong to occupied filesystem space. Seems
> like your data indeed got corrupted somehow and reading from different
> drives probably returns different content for existing files.
>
> Most likely culprits that come to my mind:
>
> 1- MD raid1 bug
> 2- SSD bug (what brand and model?)

# smartctl -a /dev/sdb|grep -i model
Model Family:     Intel 520 Series SSDs
Device Model:     INTEL SSDSC2CW240A3

# smartctl -a /dev/sdc|grep -i model
Model Family:     Intel 520 Series SSDs
Device Model:     INTEL SSDSC2CW240A3


> 3- Loose SATA cable
Confirmed this is not the case.


> 4- Linux or SSD bug on trim, such as trimming wrong offsets killing live
> data
> 5- MD does not lock regions during check so returns erroneous mismatches for
> areas being written. This would be harmless but your mismatches number seems
> to high to me for this.
I wonder if this could be it.

>
> I would suggest to investigate further. One idea is to find which files are
> affected, then reading from both disks independently you should be able to
> determine if all wrong data are on the same SSD (probable loose cable or SSD
> bug if they are different) or evenly distributed (probable MD raid1 bug or
> SSD bug if they are identical).
>
> The easiest, if it works, would be to determine the location of mismatches,
> and then get the filename from there.
> Unfortunately I don't think MD tells you the location of mismatches
> directly. Do you want to try the following:
> /sys/block/mdX/md/sync{_min,_max} should allow you to narrow the region of
> the next check. Then check, then cat mismatch_cnt.
> Narrow progressively so that you identify one block only. Invoke sync and
> check again same region a couple of times so to be sure that it's not due to
> point 5 above. Then try debugfs (in readonly mode can be used with fs
> mounted), there should be an option to get the inode from the block
> number... I hope that block numbers are not offset by MD... I think it's
> icheck and then you might need find -inum to find the filename.
>
> Now it's better to inspect the file to confirm it has indeed different
> content on the two sides...
>
> activate bitmap for raid1, preferably with small chunksize
> fail 1 drive so to degrade raid1
> drop caches with blockdev --flushbufs on the md device such as /dev/md2, on
> the two underlying partitions such as /dev/sd[ab]2, and maybe even on the
> two disk holding then such as /dev/sd[ab] (I'm not really sure what is the
> minimum needed) ; and also echo 3 > /proc/sys/vm/drop_caches
> cp the file to another filesystem
> reattach drive, let it resync the differences using the bitmap
> fail the other drive
> drop all caches again
> cp again file to another filesystem
> reattach drive and let it resync
>
> diff the two copied files... what do you see?
>
> BTW can your system be taken offline or is it a production system? If it can
> be taken offline you can easily dump md5sums for all files from both sides
> of the RAID, that would be quicker.

I took a slightly different approach, hopefully this will provide the
information you are looking for:

Rebooted to a system rescue cd:

Did not mount the filesystem, before a check:

  cat /sys/devices/virtual/block/md1/md/mismatch_cnt
  256

Ran a check > sync_action and re-checked the mismatch_cnt:

  cat /sys/devices/virtual/block/md1/md/mismatch_cnt
  68352

Ran a repair > sync_action
  68352 (expected, need to re-run check):

Ran a check > sync_action
  0

It appears when there a files moving around / being written to it can
throw off the mismatch_cnt?  As the FS above was not mounted, it
repaired ok?

Justin.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-11 18:52           ` Justin Piszcz
@ 2013-11-11 21:23             ` John Stoffel
  2013-11-11 21:55               ` NeilBrown
  2013-11-11 21:58             ` NeilBrown
  2013-11-12  9:30             ` joystick
  2 siblings, 1 reply; 23+ messages in thread
From: John Stoffel @ 2013-11-11 21:23 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: joystick, linux-raid


I thought you could also get a mis-match count from open mmap'd files,
which aren't completely written to one disk or another?  

John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-11 21:23             ` John Stoffel
@ 2013-11-11 21:55               ` NeilBrown
  2013-11-12  2:49                 ` John Stoffel
  0 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2013-11-11 21:55 UTC (permalink / raw)
  To: John Stoffel; +Cc: Justin Piszcz, joystick, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1269 bytes --]

On Mon, 11 Nov 2013 16:23:49 -0500 "John Stoffel" <john@stoffel.org> wrote:

> 
> I thought you could also get a mis-match count from open mmap'd files,
> which aren't completely written to one disk or another?  
> 

The only cause I can imagine for the mismatch count increasing is for a page
of memory to be change while it is being written out (so each device sees a
different value) and then for the page to be invalidated (so the dirty page
never gets written out again).

The only way to change a page while it is being written out is (I think)
through memory mapping (though this could have change, "write" might achieve
it).

But normally if you change a memory mapped page while it is being written it
will be marked 'dirty' and so will be written out again - the same to both
devices.

You could possibly modify a mem-mapped file, and then delete it before the
latest changes were written, but think that would be unlikely to do on
purpose.

Swap is the most likely cause.  If some pages in a process were written out
and changed during the writeout, and then the process was killed, you could
easily get a mismatch persisting.

But that doesn't seem to be the case here.

So I don't know what would be causing it.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-11 18:52           ` Justin Piszcz
  2013-11-11 21:23             ` John Stoffel
@ 2013-11-11 21:58             ` NeilBrown
  2013-11-11 22:18               ` Justin Piszcz
  2013-11-12  9:30             ` joystick
  2 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2013-11-11 21:58 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: joystick, linux-raid

[-- Attachment #1: Type: text/plain, Size: 589 bytes --]

On Mon, 11 Nov 2013 13:52:23 -0500 Justin Piszcz <jpiszcz@lucidpixels.com>
wrote:


> > 4- Linux or SSD bug on trim, such as trimming wrong offsets killing live
> > data
> > 5- MD does not lock regions during check so returns erroneous mismatches for
> > areas being written. This would be harmless but your mismatches number seems
> > to high to me for this.
> I wonder if this could be it.

I cannot promise that my code is bug free, but if it wasn't getting this
locking correct, that would be a very serious bug.  I think this possibility
is quite unlikely.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: 3.12: raid-1 mismatch_cnt question
  2013-11-11 21:58             ` NeilBrown
@ 2013-11-11 22:18               ` Justin Piszcz
  0 siblings, 0 replies; 23+ messages in thread
From: Justin Piszcz @ 2013-11-11 22:18 UTC (permalink / raw)
  To: 'NeilBrown'; +Cc: 'joystick', 'linux-raid'



-----Original Message-----
From: NeilBrown [mailto:neilb@suse.de] 
Sent: Monday, November 11, 2013 4:58 PM
To: Justin Piszcz
Cc: joystick; linux-raid
Subject: Re: 3.12: raid-1 mismatch_cnt question

On Mon, 11 Nov 2013 13:52:23 -0500 Justin Piszcz <jpiszcz@lucidpixels.com>
wrote:

[ .. ]

I cannot promise that my code is bug free, but if it wasn't getting this
locking correct, that would be a very serious bug.  I think this possibility
is quite unlikely.

NeilBrown

--

FWIW:

Mount points/options:
# grep md /etc/fstab
/dev/md0        /boot            ext3    defaults                   0  0
/dev/md1        /                ext4    defaults,discard           0  0

Both SSD's are plugged directly into the motherboard's 2 x 6Gbps ports
(Intel chipset/X9SRL-F mobo)

# mdadm --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Fri Nov  2 10:10:14 2012
     Raid Level : raid1
     Array Size : 1048512 (1024.11 MiB 1073.68 MB)
  Used Dev Size : 1048512 (1024.11 MiB 1073.68 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Nov 11 13:50:08 2013
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 08624db8:4840de94:c44c77eb:7ee19756
         Events : 0.441

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       17        1      active sync   /dev/sdb1

# mdadm --detail /dev/md1
/dev/md1:
        Version : 0.90
  Creation Time : Fri Nov  2 10:10:27 2012
     Raid Level : raid1
     Array Size : 233381376 (222.57 GiB 238.98 GB)
  Used Dev Size : 233381376 (222.57 GiB 238.98 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Mon Nov 11 17:12:37 2013
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 03a64f68:595e7c61:c44c77eb:7ee19756
         Events : 0.340396

    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdc2
       1       8       18        1      active sync   /dev/sdb2

# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sdc2[0] sdb2[1]
      233381376 blocks [2/2] [UU]
      
md0 : active raid1 sdc1[0] sdb1[1]
      1048512 blocks [2/2] [UU]
      
unused devices: <none>

At boot:
[   12.579770] md: raid1 personality registered for level 1
[   13.777189] md: Waiting for all devices to be available before autodetect
[   13.790829] md: If you don't use raid, use raid=noautodetect
[   13.804604] md: Autodetecting RAID arrays.
[   13.820461] md: Scanned 4 and added 4 devices.
[   13.834231] md: autorun ...
[   13.847869] md: considering sdc2 ...
[   13.861458] md:  adding sdc2 ...
[   13.874984] md: sdc1 has different UUID to sdc2
[   13.888487] md:  adding sdb2 ...
[   13.901934] md: sdb1 has different UUID to sdc2
[   13.915469] md: created md1
[   13.928790] md: bind<sdb2>
[   13.942004] md: bind<sdc2>
[   13.955088] md: running: <sdc2><sdb2>
[   13.968212] md/raid1:md1: active with 2 out of 2 mirrors
[   13.981297] md1: detected capacity change from 0 to 238982529024
[   13.994501] md: considering sdc1 ...
[   14.007674] md:  adding sdc1 ...
[   14.020775] md:  adding sdb1 ...
[   14.033718] md: created md0
[   14.046545] md: bind<sdb1>
[   14.059301] md: bind<sdc1>
[   14.071918] md: running: <sdc1><sdb1>
[   14.084644] md/raid1:md0: active with 2 out of 2 mirrors
[   14.097321] md0: detected capacity change from 0 to 1073676288
[   14.110059] md: ... autorun DONE.
[   14.123179]  md1: unknown partition table
[   14.138285] EXT4-fs (md1): mounted filesystem with ordered data mode.
Opts: (null)
[   14.640887]  md0:
[   15.785971] EXT4-fs (md1): re-mounted. Opts: discard
[   15.970284] EXT3-fs (md0): using internal journal
[   15.970285] EXT3-fs (md0): mounted filesystem with writeback data mode

Justin.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-07 10:54 ` Justin Piszcz
@ 2013-11-12  0:39   ` Brad Campbell
  2013-11-12  9:14     ` Justin Piszcz
  0 siblings, 1 reply; 23+ messages in thread
From: Brad Campbell @ 2013-11-12  0:39 UTC (permalink / raw)
  To: Justin Piszcz, open list, linux-raid

On 11/07/2013 06:54 PM, Justin Piszcz wrote:
> On Mon, Nov 4, 2013 at 5:25 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>> Hi,
>>
>> I run two SSDs in a RAID-1 configuration and I have a swap partition on a
>> third SSD.  Over time, the mismatch_cnt between the two devices grows higher
>> and higher.
>>

Are both SSD's identical? Do you have discard enabled on the filesystem?

The reason I ask is I have a RAID10 comprised of 3 Intel and 3 Samsung 
SSD's. The Intel return 0 after TRIM while the Samsung don't, so I 
_always_ have a massive mismatch_cnt after I run fstrim. I never use a 
repair operation as it's just going to re-write the already trimmed sectors.


Just a thought.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-11 21:55               ` NeilBrown
@ 2013-11-12  2:49                 ` John Stoffel
  0 siblings, 0 replies; 23+ messages in thread
From: John Stoffel @ 2013-11-12  2:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: John Stoffel, Justin Piszcz, joystick, linux-raid

>>>>> "NeilBrown" == NeilBrown  <neilb@suse.de> writes:

NeilBrown> On Mon, 11 Nov 2013 16:23:49 -0500 "John Stoffel" <john@stoffel.org> wrote:
>> 
>> I thought you could also get a mis-match count from open mmap'd files,
>> which aren't completely written to one disk or another?  
>> 

NeilBrown> The only cause I can imagine for the mismatch count
NeilBrown> increasing is for a page of memory to be change while it is
NeilBrown> being written out (so each device sees a different value)
NeilBrown> and then for the page to be invalidated (so the dirty page
NeilBrown> never gets written out again).

NeilBrown> The only way to change a page while it is being written out
NeilBrown> is (I think) through memory mapping (though this could have
NeilBrown> change, "write" might achieve it).

NeilBrown> But normally if you change a memory mapped page while it is
NeilBrown> being written it will be marked 'dirty' and so will be
NeilBrown> written out again - the same to both devices.

NeilBrown> You could possibly modify a mem-mapped file, and then
NeilBrown> delete it before the latest changes were written, but think
NeilBrown> that would be unlikely to do on purpose.

NeilBrown> Swap is the most likely cause.  If some pages in a process
NeilBrown> were written out and changed during the writeout, and then
NeilBrown> the process was killed, you could easily get a mismatch
NeilBrown> persisting.

I think it is swap in my case, it's a RHEL 5.6 box being used to host
VNC sessions for my users.  They do weekly checks on the mirrored
disks, and I think I get mis-match counts because swap is on MD too.  

John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-12  0:39   ` Brad Campbell
@ 2013-11-12  9:14     ` Justin Piszcz
  0 siblings, 0 replies; 23+ messages in thread
From: Justin Piszcz @ 2013-11-12  9:14 UTC (permalink / raw)
  To: Brad Campbell; +Cc: open list, linux-raid

On Mon, Nov 11, 2013 at 7:39 PM, Brad Campbell
<lists2009@fnarfbargle.com> wrote:
> On 11/07/2013 06:54 PM, Justin Piszcz wrote:
>>
>> On Mon, Nov 4, 2013 at 5:25 AM, Justin Piszcz <jpiszcz@lucidpixels.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I run two SSDs in a RAID-1 configuration and I have a swap partition on a
>>> third SSD.  Over time, the mismatch_cnt between the two devices grows
>>> higher
>>> and higher.
>>>
>
> Are both SSD's identical? Do you have discard enabled on the filesystem?
Yes (2 x Intel SSDSC2CW240A3) & yes )/dev/root on / type ext4
(rw,relatime,discard,data=ordered))

>
> The reason I ask is I have a RAID10 comprised of 3 Intel and 3 Samsung
> SSD's. The Intel return 0 after TRIM while the Samsung don't, so I _always_
> have a massive mismatch_cnt after I run fstrim. I never use a repair
> operation as it's just going to re-write the already trimmed sectors.
Very interesting and good to know!

Justin.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-11 18:52           ` Justin Piszcz
  2013-11-11 21:23             ` John Stoffel
  2013-11-11 21:58             ` NeilBrown
@ 2013-11-12  9:30             ` joystick
  2013-11-12 10:29               ` Bernd Schubert
  2 siblings, 1 reply; 23+ messages in thread
From: joystick @ 2013-11-12  9:30 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

On 11/11/2013 19:52, Justin Piszcz wrote:

>> 4- Linux or SSD bug on trim, such as trimming wrong offsets killing live
>> data
>> 5- MD does not lock regions during check so returns erroneous mismatches for
>> areas being written. This would be harmless but your mismatches number seems
>> to high to me for this.
> I wonder if this could be it.

It's not, your reboot test confirmed it's not, when you did this:


> Ran a check > sync_action and re-checked the mismatch_cnt:
>
>   cat /sys/devices/virtual/block/md1/md/mismatch_cnt
>   68352

this should have been zero if it was the case



>> I would suggest to investigate further. One idea is to find which files are
>> affected....
> I took a slightly different approach, hopefully this will provide the
> information you are looking for:

Actually no, and you "fixed" it so you cannot do any further test until 
the number of mismatches grows again


> Rebooted to a system rescue cd:
>
> Did not mount the filesystem, before a check:
>
>    cat /sys/devices/virtual/block/md1/md/mismatch_cnt
>    256
>
> Ran a check > sync_action and re-checked the mismatch_cnt:
>
>    cat /sys/devices/virtual/block/md1/md/mismatch_cnt
>    68352
>
> Ran a repair > sync_action
>    68352 (expected, need to re-run check):
>
> Ran a check > sync_action
>    0
>
> It appears when there a files moving around / being written to it can
> throw off the mismatch_cnt?

Maybe, and it shouldn't happen. This is a serious bug somewhere, it 
corrupts data, we need to find it.

> As the FS above was not mounted, it
> repaired ok?

No, you just copied over one disk to the other. This does not mean 
"fixed" in the filesystem sense. Data is still corrupted, just the two 
legs of the RAID are now corrupted identically one to the other.

Wait so that mismatches grow again a couple of thousands, then I suggest 
you really do what I wrote in my previous email.
If you can afford to bring the system offline then it's really easy 
because you can find all mismatching files in one shot

- wait for mismatch_cnt reach 2000 at least (the more, the better), then 
reboot machine with a livecd
- mount RAID
- mount the filesystem readonly
- (very important or it will resync) activate bitmap for raid1, 
preferably with small chunksize
- fail 1 drive so to degrade raid1
- drop caches with blockdev --flushbufs on the md device such as 
/dev/md2, on the two underlying partitions such as /dev/sd[ab]2, and 
maybe even on the two disk holding then such as /dev/sd[ab] (I'm not 
really sure what is the minimum needed) ; and also echo 3 > 
/proc/sys/vm/drop_caches
- recursive md5sum for all files of the filesystem (something like find 
-type f -print0 | xargs -0 md5sum (untested)) > redirect stdout to a 
file on another filesystem
- reattach drive with --re-add, let it resync the differences using the 
bitmap (there shouldn't be any, should complete immediately)
- fail the other drive
- drop all caches again
- again find | md5sum , redirected to another file on another filesystem
- reattach drive with --re-add

now analyze differences between md5sums. Those are the files which are 
different in the two legs of the RAID, and they shouldn't be (aka 
corruption).
Find preferably humanly readable text files which are sequentially 
written, such as log files. It is more difficult to understand what's 
wrong for files changed in the middle such as database files, or binary 
files.

Copy those files out, to another filesystem.
You need to, again:
- fail 1 drive so to degrade raid1
- drop caches as described above
- copy all files out, to a directory in another filesystem
- reattach drive with --re-add
- fail the other drive
- drop all caches again
- copy all files out again to another directory of another filesystem
- reattach drive with --re-add

At this point you can restart machine to production.

Inspect the two versions of such files... If you can tell us something 
about which files got corrupted and what you exactly see in the 
corruption point (you can use hexdump to see binary chars), we could 
make some further guesses.

Regards
J.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-12  9:30             ` joystick
@ 2013-11-12 10:29               ` Bernd Schubert
  2013-11-13 22:10                 ` Justin Piszcz
  0 siblings, 1 reply; 23+ messages in thread
From: Bernd Schubert @ 2013-11-12 10:29 UTC (permalink / raw)
  To: joystick, Justin Piszcz; +Cc: linux-raid

On 11/12/2013 10:30 AM, joystick wrote:
> On 11/11/2013 19:52, Justin Piszcz wrote:
> Wait so that mismatches grow again a couple of thousands, then I suggest
> you really do what I wrote in my previous email.
> If you can afford to bring the system offline then it's really easy
> because you can find all mismatching files in one shot
>
> - wait for mismatch_cnt reach 2000 at least (the more, the better), then
> reboot machine with a livecd
> - mount RAID
> - mount the filesystem readonly
> - (very important or it will resync) activate bitmap for raid1,
> preferably with small chunksize
> - fail 1 drive so to degrade raid1
> - drop caches with blockdev --flushbufs on the md device such as
> /dev/md2, on the two underlying partitions such as /dev/sd[ab]2, and
> maybe even on the two disk holding then such as /dev/sd[ab] (I'm not
> really sure what is the minimum needed) ; and also echo 3 >
> /proc/sys/vm/drop_caches
> - recursive md5sum for all files of the filesystem (something like find
> -type f -print0 | xargs -0 md5sum (untested)) > redirect stdout to a
> file on another filesystem
> - reattach drive with --re-add, let it resync the differences using the
> bitmap (there shouldn't be any, should complete immediately)
> - fail the other drive
> - drop all caches again
> - again find | md5sum , redirected to another file on another filesystem
> - reattach drive with --re-add
>
> now analyze differences between md5sums. Those are the files which are
> different in the two legs of the RAID, and they shouldn't be (aka
> corruption).
> Find preferably humanly readable text files which are sequentially
> written, such as log files. It is more difficult to understand what's
> wrong for files changed in the middle such as database files, or binary
> files.
>

If you have available disk space, you might run ql-fstest (possibly in 
combination) with the above method.

https://bitbucket.org/aakef/ql-fstest

Right now it does not support yet to restart it and to verify existing 
files, but I'm going to add this, either this evening or on Thursday.


Cheers,
Bernd


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: 3.12: raid-1 mismatch_cnt question
  2013-11-12 10:29               ` Bernd Schubert
@ 2013-11-13 22:10                 ` Justin Piszcz
  2013-11-14  8:44                   ` joystick
  0 siblings, 1 reply; 23+ messages in thread
From: Justin Piszcz @ 2013-11-13 22:10 UTC (permalink / raw)
  To: 'Bernd Schubert', 'joystick'; +Cc: 'linux-raid'



-----Original Message-----
From: Bernd Schubert [mailto:bernd.schubert@fastmail.fm] 
Sent: Tuesday, November 12, 2013 5:29 AM
To: joystick; Justin Piszcz
Cc: linux-raid
Subject: Re: 3.12: raid-1 mismatch_cnt question

joystick's recommendations:

>> - wait for mismatch_cnt reach 2000 at least (the more, the better), then .. [ .. ]

$ cat /sys/devices/virtual/block/md1/md/mismatch_cnt 
254336


[ .. ]


If you have available disk space, you might run ql-fstest (possibly in 
combination) with the above method.

https://bitbucket.org/aakef/ql-fstest

Right now it does not support yet to restart it and to verify existing 
files, but I'm going to add this, either this evening or on Thursday.

[ .. ]

I attempted to test both here:
http://home.comcast.net/~jpiszcz/20131113/joystick_cmds.txt

The --re-add did not work btw.

Justin.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-13 22:10                 ` Justin Piszcz
@ 2013-11-14  8:44                   ` joystick
  2013-11-14 10:43                     ` Justin Piszcz
  0 siblings, 1 reply; 23+ messages in thread
From: joystick @ 2013-11-14  8:44 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: 'Bernd Schubert', 'linux-raid'

On 13/11/2013 23:10, Justin Piszcz wrote:
> I attempted to test both here:
> http://home.comcast.net/~jpiszcz/20131113/joystick_cmds.txt
>
> The --re-add did not work btw.
>

Unfortunately the --re-add HAD to work for the test to be of any value.

Like you did, with just -a (--add), it resynced completely when you 
added the old device back in, similarly to when you run repair.
After that you had two exactly identical devices on the RAID, so, no 
wonder the two md5sums of the two legs resulted to be identical.
Also mismatch_cnt went to zero again, and will need to grow again to 
some significant value before you can repeat the test.

The reason for which --re-add failed seems to be (I did a test on our 
machines) that you need to also --remove the device after --fail, so it's:

when removing:
     mdadm /dev/md1 --fail /dev/sda2
mdadm: set /dev/sdb2 faulty in /dev/md1
     mdadm /dev/md1 --remove /dev/sda2
         mdadm: hot removed /dev/sdb2 from /dev/md1

.... compute md5sums ...

when re-adding:
     mdadm /dev/md1 --re-add /dev/sda2
         mdadm: re-added /dev/sda2

you actually did remove the drive, but after that you did not retry with 
--re-add, you went straight to --add which fully replicated the content 
of sda2 on sdb2

Also: does the array already have a bitmap? It's not clear from the log. 
If it does not have a bitmap, even --re-add will replicate all content, 
so you really need a bitmap for this test
To add a bitmap you can do:
mdadm /dev/md1 --grow --bitmap=internal
at the beginning of the test.

What kernel version is yours?

Regards
J.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: 3.12: raid-1 mismatch_cnt question
  2013-11-14  8:44                   ` joystick
@ 2013-11-14 10:43                     ` Justin Piszcz
  2013-11-14 16:09                       ` joystick
  0 siblings, 1 reply; 23+ messages in thread
From: Justin Piszcz @ 2013-11-14 10:43 UTC (permalink / raw)
  To: 'joystick'; +Cc: 'Bernd Schubert', 'linux-raid'



-----Original Message-----
From: joystick [mailto:joystick@shiftmail.org] 
Sent: Thursday, November 14, 2013 3:44 AM
To: Justin Piszcz
Cc: 'Bernd Schubert'; 'linux-raid'
Subject: Re: 3.12: raid-1 mismatch_cnt question

On 13/11/2013 23:10, Justin Piszcz wrote:
> I attempted to test both here:
> http://home.comcast.net/~jpiszcz/20131113/joystick_cmds.txt
>
> The --re-add did not work btw.
>

>> Unfortunately the --re-add HAD to work for the test to be of any value.

$ cat /sys/devices/virtual/block/md1/md/mismatch_cnt
303232

Ready to test again, mismatch_cnt very high..

[ .. ]

Please see the following per your new instructions:
http://home.comcast.net/~jpiszcz/20131114/joystick_cmds2.txt

Summary: No diffs found.

>> What kernel version is yours?
Was using system rescue cd 3.7.0, appears to be 3.4.47.
Will re-try with the latest system rescue cd, 3.8.1, appears to be 3.4.66.

On a side note, I have not seen any corruption on any of my files; debsums also confirms no issues with any of the system files, so I am wondering if mismatch_cnt is accurate based on the diff above and not seeing any corruption?

# debsums|grep -v OK$
debsums: no md5sums for libgme0
#

Justin.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-14 10:43                     ` Justin Piszcz
@ 2013-11-14 16:09                       ` joystick
  2013-11-14 17:22                         ` Justin Piszcz
  0 siblings, 1 reply; 23+ messages in thread
From: joystick @ 2013-11-14 16:09 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: 'Bernd Schubert', 'linux-raid'

On 14/11/2013 11:43, Justin Piszcz wrote:
> $ cat /sys/devices/virtual/block/md1/md/mismatch_cnt
> 303232
>
> Ready to test again, mismatch_cnt very high..
>
> [ .. ]
>
> Please see the following per your new instructions:
> http://home.comcast.net/~jpiszcz/20131114/joystick_cmds2.txt
>
> Summary: No diffs found.

mmh that's strange...

At the end of the procedure (like now, if you didn't resync or repair in 
the meanwhile) is mismatch_cnt still so high?
I'm wondering if a resync happened anyway somehow notwithstanding the 
procedure seems correct to me this time.


>>> What kernel version is yours?
> Was using system rescue cd 3.7.0, appears to be 3.4.47.

no, not that one...
it would be helpful to know the kernel version that *creates* 
mismatches, the one that you have running normally on the live system.
That's the "bugged" one, supposing this is really a bug (until we find 
where the mismatches are, it's difficult to say wether this is a data 
loss or not)


> Will re-try with the latest system rescue cd, 3.8.1, appears to be 3.4.66.
no, that's not needed...


> On a side note, I have not seen any corruption on any of my files; debsums also confirms no issues with any of the system files, so I am wondering if mismatch_cnt is accurate based on the diff above and not seeing any corruption?

yep the problem is now in fact to understand WHERE these mismatches are 
hiding...

Ubuntu files are mostly executables and config files which do not get 
changed often. Mismatches there are less likely to be there than in the 
files which do indeed change.

Maybe the mismatched are located ext4 metadata areas which are not files 
and so can't be seen with md5sums... That would still be as much 
worrisome, unless some expert of ext4 can tell that it's ok (it can be 
OK if the region with mismatches is an old metadata area, currently 
unused; the mechanism that can create harmless mismatches in this case 
has been described by Neil)

It seems you will need to perform the other test I described previously. 
A bit more complex, but it should find something. This can be done live, 
or at least the beginning of it:

- First confirm that mismatch_cnt is still high..

- Then if this does not disrupt your system operation too much, i would 
suggest to fill 95% of free space with a zeroes file like you did in 
earlier tests. Otherwise for a mismatch happening in non-file area we 
won't be sure of what kind of area is that. Maybe recompute mismatch_cnt 
after this.

then, copypasting the procedure with some modifications:
----
... to determine the location of mismatches (...)
Unfortunately I don't think MD tells you the location of mismatches 
directly. Do you want to try the following:
/sys/block/md1/md/sync_min and /sys/block/md1/md/sync_max should allow 
you to narrow the region of the next check.
Set them, then perform check, then cat mismatch_cnt.
Narrow progressively sync_min and sync_max so that you identify the most 
dense areas of mismatches, or a few single blocks that mismatch.
When you have identified some regions or isolated blocks, invoke "sync" 
from bash and then check again the same region a couple of times so to 
be sure that it stays mismatched and it's not just a transient situation.
Then try with debugfs (in readonly mode can be used with fs mounted): 
there should be an option to get the inode number from a block number of 
the device... I hope that block numbers are not offset by MD... I think 
it's icheck and after that you might need "find -inum <inode_number>" 
launched on the same filesystem to find the corresponding filename from 
the inode number. That should be the file that contains the mismatch.
----

Try to report here what you find.
If the mismatching regions do not correspond to files (that would agree 
with your previous test), somebody expert of ext4 might be able to tell 
what do they correspond to.

Regards
J.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: 3.12: raid-1 mismatch_cnt question
  2013-11-14 16:09                       ` joystick
@ 2013-11-14 17:22                         ` Justin Piszcz
  2013-11-15  8:51                           ` joystick
  0 siblings, 1 reply; 23+ messages in thread
From: Justin Piszcz @ 2013-11-14 17:22 UTC (permalink / raw)
  To: 'joystick'; +Cc: 'Bernd Schubert', 'linux-raid'



-----Original Message-----
From: joystick [mailto:joystick@shiftmail.org] 
Sent: Thursday, November 14, 2013 11:09 AM
To: Justin Piszcz
Cc: 'Bernd Schubert'; 'linux-raid'
Subject: Re: 3.12: raid-1 mismatch_cnt question

[ .. ]

>> At the end of the procedure (like now, if you didn't resync or repair in 
>> the meanwhile) is mismatch_cnt still so high?
After a reboot, I ran the check and yes it was still high.

[ .. ]

>> no, not that one...
>> it would be helpful to know the kernel version that *creates* 
>> mismatches, the one that you have running normally on the live system.
Version: 3.12.0 (and typically always use the latest)
That's the "bugged" one, supposing this is really a bug (until we find 
where the mismatches are, it's difficult to say wether this is a data 
loss or not)

>> Maybe the mismatched are located ext4 metadata areas which are not files 
>> and so can't be seen with md5sums... That would still be as much 
>> worrisome, unless some expert of ext4 can tell that it's ok (it can be 
>> OK if the region with mismatches is an old metadata area, currently 
>> unused; the mechanism that can create harmless mismatches in this case 
>> has been described by Neil)

If that is what is occurring, is it possible to exclude them from mismatch_cnt?

[ .. ]

- First confirm that mismatch_cnt is still high..
It was 0 after reboot.

[ .. ]


- Then if this does not disrupt your system operation too much, i would 
suggest to fill 95% of free space with a zeroes file like you did in 
earlier tests. Otherwise for a mismatch happening in non-file area we 
won't be sure of what kind of area is that. Maybe recompute mismatch_cnt 
after this.

Create file up to 95% utilization on /root:
/dev/root       219G  205G   12G  95% /

Re-check:
# echo check > /sys/devices/virtual/block/md1/md/sync_action
# cat /sys/devices/virtual/block/md1/md/mismatch_cnt
27520

then, copypasting the procedure with some modifications:
----
... to determine the location of mismatches (...)
Unfortunately I don't think MD tells you the location of mismatches 
directly. Do you want to try the following:
/sys/block/md1/md/sync_min and /sys/block/md1/md/sync_max should allow 
you to narrow the region of the next check.
Set them, then perform check, then cat mismatch_cnt.
Narrow progressively sync_min and sync_max so that you identify the most 
dense areas of mismatches, or a few single blocks that mismatch.
When you have identified some regions or isolated blocks, invoke "sync" 
from bash and then check again the same region a couple of times so to 
be sure that it stays mismatched and it's not just a transient situation.
Then try with debugfs (in readonly mode can be used with fs mounted): 
there should be an option to get the inode number from a block number of 
the device... I hope that block numbers are not offset by MD... I think 
it's icheck and after that you might need "find -inum <inode_number>" 
launched on the same filesystem to find the corresponding filename from 
the inode number. That should be the file that contains the mismatch.
[ .. ]
When I do this, the speed of check thereafter is very slow:

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
      233381376 blocks [2/2] [UU]
      [>....................]  check =  0.0% (4500/233381376) finish=80387.9min speed=48K/sec (55 days)

The speed continues to decrease when the sync_min is set to 1000 and sync_max is 9000 (this won't work).

A few minutes later:

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
      233381376 blocks [2/2] [UU]
      [>....................]  check =  0.0% (4500/233381376) finish=200485.5min speed=19K/sec

It would be interesting if someone else on this list has ext4 and sees similar results (mismatch_cnt) with their SSDs vs. another FS (XFS/etc).

Justin.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3.12: raid-1 mismatch_cnt question
  2013-11-14 17:22                         ` Justin Piszcz
@ 2013-11-15  8:51                           ` joystick
  0 siblings, 0 replies; 23+ messages in thread
From: joystick @ 2013-11-15  8:51 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: 'linux-raid'

On 14/11/2013 18:22, Justin Piszcz wrote:
>
> -----Original Message-----
> From: joystick [mailto:joystick@shiftmail.org]
> Sent: Thursday, November 14, 2013 11:09 AM
> To: Justin Piszcz
> Cc: 'Bernd Schubert'; 'linux-raid'
> Subject: Re: 3.12: raid-1 mismatch_cnt question
>
> [ .. ]
>
>>> At the end of the procedure (like now, if you didn't resync or repair in
>>> the meanwhile) is mismatch_cnt still so high?
> After a reboot, I ran the check and yes it was still high.
>
> [ .. ]
>
>>> no, not that one...
>>> it would be helpful to know the kernel version that *creates*
>>> mismatches, the one that you have running normally on the live system.
> Version: 3.12.0 (and typically always use the latest)

ok

> That's the "bugged" one, supposing this is really a bug (until we find
> where the mismatches are, it's difficult to say wether this is a data
> loss or not)
>
>>> Maybe the mismatched are located ext4 metadata areas which are not files
>>> and so can't be seen with md5sums... That would still be as much
>>> worrisome, unless some expert of ext4 can tell that it's ok (it can be
>>> OK if the region with mismatches is an old metadata area, currently
>>> unused; the mechanism that can create harmless mismatches in this case
>>> has been described by Neil)
> If that is what is occurring, is it possible to exclude them from mismatch_cnt?

not possible unfortunately.

But your mismatch_cnt is exceptionally high, which is unlikely to come 
from the described mechanism. Other people usually have zero even after 
months of operation, e.g. I have zero


> [ .. ]
>
> - First confirm that mismatch_cnt is still high..
> It was 0 after reboot.

Above you wrote that after the procedure you rebooted then did a check 
and it was still high. Can you guess when did it repair?


> [ .. ]
>
>
> - Then if this does not disrupt your system operation too much, i would
> suggest to fill 95% of free space with a zeroes file like you did in
> earlier tests. Otherwise for a mismatch happening in non-file area we
> won't be sure of what kind of area is that. Maybe recompute mismatch_cnt
> after this.
>
> Create file up to 95% utilization on /root:
> /dev/root       219G  205G   12G  95% /
>
> Re-check:
> # echo check > /sys/devices/virtual/block/md1/md/sync_action
> # cat /sys/devices/virtual/block/md1/md/mismatch_cnt
> 27520

?????
You mean that mismatch_cnt was zero, then you created a big file full of 
zeroes and after that mismatch_cnt jumped to 27520 ??
I believe this should not happen, especially not with the harmless 
mechanism explained by Neil, and this narrows the bug quite a lot.
If you confirm I understood correctly, can you retry such thing a couple 
of times? Delete zeroes file, repair RAID so that mismatch_cnt goes to 
zero, check to confirm that mismatch_cnt is zero, create a file full of 
zeroes, check again... did mismatch_cnt jump to a high value?

If reproducing the bug is so easy, you might want to try earlier kernels 
such as the 3.0.101 and re-test with that .
If earlier kernels do not have such bug it becomes relatively easy to 
find when was it introduced. Maybe without even knowing where are the 
mismatches located.


> then, copypasting the procedure with some modifications:
> ----
> ... to determine the location of mismatches (...)
> Unfortunately I don't think MD tells you the location of mismatches
> directly. Do you want to try the following:
> /sys/block/md1/md/sync_min and /sys/block/md1/md/sync_max should allow
> you to narrow the region of the next check.
> Set them, then perform check, then cat mismatch_cnt.
> Narrow progressively sync_min and sync_max so that you identify the most
> dense areas of mismatches, or a few single blocks that mismatch.
> When you have identified some regions or isolated blocks, invoke "sync"
> from bash and then check again the same region a couple of times so to
> be sure that it stays mismatched and it's not just a transient situation.
> Then try with debugfs (in readonly mode can be used with fs mounted):
> there should be an option to get the inode number from a block number of
> the device... I hope that block numbers are not offset by MD... I think
> it's icheck and after that you might need "find -inum <inode_number>"
> launched on the same filesystem to find the corresponding filename from
> the inode number. That should be the file that contains the mismatch.
> [ .. ]
> When I do this, the speed of check thereafter is very slow:
>
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
>        233381376 blocks [2/2] [UU]
>        [>....................]  check =  0.0% (4500/233381376) finish=80387.9min speed=48K/sec (55 days)
>
> The speed continues to decrease when the sync_min is set to 1000 and sync_max is 9000 (this won't work).

Are you running the "find" simultaneously with "check" ?
Check priority is rather low so I understand why it would slow down if 
you are also doing "find". Otherwise... seems like it's another bug.



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2013-11-15  8:51 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-04 10:25 3.12: raid-1 mismatch_cnt question Justin Piszcz
2013-11-04 10:25 ` Justin Piszcz
2013-11-07 10:54 ` Justin Piszcz
2013-11-12  0:39   ` Brad Campbell
2013-11-12  9:14     ` Justin Piszcz
     [not found] ` <527E8B74.70301@shiftmail.org>
2013-11-09 22:49   ` Justin Piszcz
2013-11-10 12:45     ` joystick
2013-11-11  9:26       ` Justin Piszcz
2013-11-11 11:06         ` joystick
2013-11-11 18:52           ` Justin Piszcz
2013-11-11 21:23             ` John Stoffel
2013-11-11 21:55               ` NeilBrown
2013-11-12  2:49                 ` John Stoffel
2013-11-11 21:58             ` NeilBrown
2013-11-11 22:18               ` Justin Piszcz
2013-11-12  9:30             ` joystick
2013-11-12 10:29               ` Bernd Schubert
2013-11-13 22:10                 ` Justin Piszcz
2013-11-14  8:44                   ` joystick
2013-11-14 10:43                     ` Justin Piszcz
2013-11-14 16:09                       ` joystick
2013-11-14 17:22                         ` Justin Piszcz
2013-11-15  8:51                           ` joystick

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.