All of lore.kernel.org
 help / color / mirror / Atom feed
* WARNING: mismatch_cnt is not 0 on <array device>
@ 2016-09-26  2:43 Benjammin2068
  2016-09-26  3:42 ` Brad Campbell
  0 siblings, 1 reply; 26+ messages in thread
From: Benjammin2068 @ 2016-09-26  2:43 UTC (permalink / raw)
  To: Linux-RAID

Hey all,


 So the RAID5 which I upgraded to RAID6 was humming along all week just fine (I did the change last weekend) and this weekend I got this:

WARNING: mismatch_cnt is not 0 on /dev/md127

The array seems happy and clean:


> /dev/md127:
>         Version : 1.2
>   Creation Time : Tue Aug 23 03:06:46 2011
>      Raid Level : raid6
>      Array Size : 2930276352 (2794.53 GiB 3000.60 GB)
>   Used Dev Size : 976758784 (931.51 GiB 1000.20 GB)
>    Raid Devices : 5
>   Total Devices : 6
>     Persistence : Superblock is persistent
>
>   Intent Bitmap : Internal
>
>     Update Time : Sun Sep 25 21:42:55 2016
>           State : clean
>  Active Devices : 5
> Working Devices : 6
>  Failed Devices : 0
>   Spare Devices : 1
>
>          Layout : left-symmetric
>      Chunk Size : 512K
>
>            Name : :BigRAID
>            UUID : 97b17840:3eaff079:d8e384d0:bfdbda42
>          Events : 905701
>
>     Number   Major   Minor   RaidDevice State
>        6       8       81        0      active sync   /dev/sdf1
>        1       8       33        1      active sync   /dev/sdc1
>        5       8       49        2      active sync   /dev/sdd1
>        4       8       65        3      active sync   /dev/sde1
>        8       8       97        4      active sync   /dev/sdg1
>
>        7       8      113        -      spare   /dev/sdh1

I don't think I've ever got a message before about mismatch_cnt

 -Ben


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-26  2:43 WARNING: mismatch_cnt is not 0 on <array device> Benjammin2068
@ 2016-09-26  3:42 ` Brad Campbell
  2016-09-26  7:19   ` Benjammin2068
  2016-09-26 19:47   ` Benjammin2068
  0 siblings, 2 replies; 26+ messages in thread
From: Brad Campbell @ 2016-09-26  3:42 UTC (permalink / raw)
  To: Benjammin2068, Linux-RAID

On 26/09/16 10:43, Benjammin2068 wrote:
> Hey all,
>
>
>  So the RAID5 which I upgraded to RAID6 was humming along all week just fine (I did the change last weekend) and this weekend I got this:
>
> WARNING: mismatch_cnt is not 0 on /dev/md127
>
> The array seems happy and clean:
>

You best find out what it is to start with. Example from mine :

brad@srv:~$ cat /sys/block/md2/md/mismatch_cnt
8171392

This is a bad example because it's a RAID10 of 6 SSDs. 3 support zero 
after TRIM and the other 3 don't, so after a TRIM of the filesystem the 
mismatch count is through the roof.

Unless you are swapping to an array or you have some known issue like 
the one I mention above, a mismatch count of non-zero is not good.

I lost most of a RAID6 due to a faulty SIL SATA controller and it was 
the high mismatch counts that alerted me. Unfortunately I was about 6 
months in and had only checked for the first time. It was silently 
corrupting reads under load, and so the read-modify-write cycles were 
quietly corrupting the array.

Check the mismatch count, run a "check" on the array and check it again. 
If they vary wildly something odd is going on. If they are the same then 
you might want to figure out what might have caused it.

Regards,
Brad.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-26  3:42 ` Brad Campbell
@ 2016-09-26  7:19   ` Benjammin2068
  2016-09-26  7:40     ` Brad Campbell
  2016-09-26 19:47   ` Benjammin2068
  1 sibling, 1 reply; 26+ messages in thread
From: Benjammin2068 @ 2016-09-26  7:19 UTC (permalink / raw)
  To: Linux-RAID



On 09/25/2016 10:42 PM, Brad Campbell wrote:
>
> You best find out what it is to start with. Example from mine :
>
> brad@srv:~$ cat /sys/block/md2/md/mismatch_cnt
> 8171392
>
> This is a bad example because it's a RAID10 of 6 SSDs. 3 support zero after TRIM and the other 3 don't, so after a TRIM of the filesystem the mismatch count is through the roof.
>
> Unless you are swapping to an array or you have some known issue like the one I mention above, a mismatch count of non-zero is not good.
>
> I lost most of a RAID6 due to a faulty SIL SATA controller and it was the high mismatch counts that alerted me. Unfortunately I was about 6 months in and had only checked for the first time. It was silently corrupting reads under load, and so the read-modify-write cycles were quietly corrupting the array.
>
> Check the mismatch count, run a "check" on the array and check it again. If they vary wildly something odd is going on. If they are the same then you might want to figure out what might have caused it.

Mine is 8.

besides moving from a RAID5 to a RAID6, I also recently installed a new HD controller (Marvell 88SE9485) from SuperMicro. (onto a SuperMicro Motherboard)

Previously, the RAID was contained with 4 SATA ports all on the motherboard (also SuperMicro) -- only with this move to RAID6, I needed the extra ports, so this is the card I got.

I'll go run a check (and repair)...

Will report back when it's all done.

 -Ben

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-26  7:19   ` Benjammin2068
@ 2016-09-26  7:40     ` Brad Campbell
  0 siblings, 0 replies; 26+ messages in thread
From: Brad Campbell @ 2016-09-26  7:40 UTC (permalink / raw)
  To: Benjammin2068, Linux-RAID

On 26/09/16 15:19, Benjammin2068 wrote:
>
>
>
> I'll go run a check (and repair)...
>

Don't run a repair until you've got it sussed. Check is read-only, 
repair isn't.

Brad

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-26  3:42 ` Brad Campbell
  2016-09-26  7:19   ` Benjammin2068
@ 2016-09-26 19:47   ` Benjammin2068
  2016-09-26 21:15     ` Phil Turmel
  1 sibling, 1 reply; 26+ messages in thread
From: Benjammin2068 @ 2016-09-26 19:47 UTC (permalink / raw)
  To: Linux-RAID

Well that instills fear and doubt...

the mismatch_cnt was 8.

I did a repair and then a check and now it's 10704....

:(

 -Ben


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-26 19:47   ` Benjammin2068
@ 2016-09-26 21:15     ` Phil Turmel
  2016-09-27  1:08       ` Benjammin2068
  2016-09-27  6:42       ` Benjammin2068
  0 siblings, 2 replies; 26+ messages in thread
From: Phil Turmel @ 2016-09-26 21:15 UTC (permalink / raw)
  To: Benjammin2068, Linux-RAID

On 09/26/2016 03:47 PM, Benjammin2068 wrote:
> Well that instills fear and doubt...
> 
> the mismatch_cnt was 8.
> 
> I did a repair and then a check and now it's 10704....
> 
> :(

Danger Will Robinson!

Seriously.  You very likely have a hardware problem corrupting your
data.  Do you have ECC RAM, and if not, when was the last time you did
an exhaustive memtest?

Recheck all of your data cables and if using an add-on controller, check
for a secure install in the PCIe slot.

Phil


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-26 21:15     ` Phil Turmel
@ 2016-09-27  1:08       ` Benjammin2068
  2016-09-27  9:16         ` Brad Campbell
  2016-09-27  6:42       ` Benjammin2068
  1 sibling, 1 reply; 26+ messages in thread
From: Benjammin2068 @ 2016-09-27  1:08 UTC (permalink / raw)
  To: Phil Turmel, Linux-RAID



On 09/26/2016 04:15 PM, Phil Turmel wrote:
> On 09/26/2016 03:47 PM, Benjammin2068 wrote:
>> Well that instills fear and doubt...
>>
>> the mismatch_cnt was 8.
>>
>> I did a repair and then a check and now it's 10704....
>>
>> :(
> Danger Will Robinson!
>
> Seriously.  You very likely have a hardware problem corrupting your
> data.  Do you have ECC RAM, and if not, when was the last time you did
> an exhaustive memtest?

ECC RAM: Yes.

MEMtest - not for a while. Will do. Have to take my server down for that.

This problem has only popped up since putting in this new controller *AND* expanding the RAID to a drive on this controller when changing RAID5 with 4 members (using MB SATA ports) to RAID6 using 5 members that includes 2 drives on new controller of which 1 is a hot spare.

(and the controller is a Marvell 88SE9485 - anyone know of any problems with this controller? It's a x8 controller living in a x8 slot.

> Recheck all of your data cables and if using an add-on controller, check
> for a secure install in the PCIe slot.
>

Will do.

Is there way to verify if that 5th drive is "the problem drive"?

Also, I just did a "repair" and the mismatch is now back to 8... which seems like a suspicious number considering the filesystem on this new drive (because it's a WD10 series with 4096byte sectors) has a slightly larger FS than the Samsung HD103SJ (and Seagate equivalents) in the array too.

And I just found this:

https://www.thomas-krenn.com/en/wiki/Mdadm_checkarray

Which says, "It could simply be that the system does not care what is stored on that part of the array - it is unused space."

I have WD10 drives that have this "extra space" thing going on because of their 4096byte sector size thing. (see previous posts about that to this list.)

 -Ben


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-26 21:15     ` Phil Turmel
  2016-09-27  1:08       ` Benjammin2068
@ 2016-09-27  6:42       ` Benjammin2068
  1 sibling, 0 replies; 26+ messages in thread
From: Benjammin2068 @ 2016-09-27  6:42 UTC (permalink / raw)
  To: Linux-RAID


Well, I maybe found the problem.

1: I had an unhappy fan in the system -- and it's the one that cools the card slot area...

2: I think the controller card was getting warm anyway -- so I also changed the venting and put a temp sensor (CrystalFontz SCAB don'tcha know.. with dallas 1820 1-wire thermometers)... so I'm gonna keep an eye on it now.

3: I set up Munin to keep track of mismatch_cnt on all volumes. Didn't even know it existed.. :( wish I had history. I will now going forward.

I'll let ya'all know.

As usually... awesome help -- thanks!

 -Ben


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-27  1:08       ` Benjammin2068
@ 2016-09-27  9:16         ` Brad Campbell
  2016-09-27 16:27           ` Benjammin2068
  0 siblings, 1 reply; 26+ messages in thread
From: Brad Campbell @ 2016-09-27  9:16 UTC (permalink / raw)
  To: Benjammin2068, Phil Turmel, Linux-RAID

On 27/09/16 09:08, Benjammin2068 wrote:
>

> Also, I just did a "repair" and the mismatch is now back to 8... which seems like a suspicious number considering the filesystem on this new drive (because it's a WD10 series with 4096byte sectors) has a slightly larger FS than the Samsung HD103SJ (and Seagate equivalents) in the array too.

See that is a bad thing to do if you even remotely suspect you have a 
problem. All a "repair" does is check the parity on a stripe and if 
there is a mismatch it re-writes it. You are writing to an array that 
apparently has issues.

I'd be checking the filesystem and file contents very carefully for 
corruption, and running several sequential check actions to keep an eye 
on the mismatch count.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-27  9:16         ` Brad Campbell
@ 2016-09-27 16:27           ` Benjammin2068
  2016-09-27 16:36             ` Roman Mamedov
  2016-09-27 16:45             ` Wols Lists
  0 siblings, 2 replies; 26+ messages in thread
From: Benjammin2068 @ 2016-09-27 16:27 UTC (permalink / raw)
  To: Linux-RAID

On 09/27/2016 04:16 AM, Brad Campbell wrote:
> On 27/09/16 09:08, Benjammin2068 wrote:
>>
>
>> Also, I just did a "repair" and the mismatch is now back to 8... which seems like a suspicious number considering the filesystem on this new drive (because it's a WD10 series with 4096byte sectors) has a slightly larger FS than the Samsung HD103SJ (and Seagate equivalents) in the array too.
>
> See that is a bad thing to do if you even remotely suspect you have a problem. All a "repair" does is check the parity on a stripe and if there is a mismatch it re-writes it. You are writing to an array that apparently has issues.
>
> I'd be checking the filesystem and file contents very carefully for corruption, and running several sequential check actions to keep an eye on the mismatch count.
>

Yep.

Once I reconfig'd the hardware and checked the cables in the system on boot the number is now 0. (which makes sense at boot - but is creepy) I put a monitor into munin which I'll be watching closely for when it changes.

BUT... I think I did find the problem. The card was running hot due to airflow. That's been remedied (I hope) -- the temp sensor on the heat-sink for the PCIe controller now sits around 45'C which is fine. Before it was >= 60'C . :O

Thanks again everyone,

 -Ben

p.s. The Linux RAID Wiki doesn't cover mismatch_cnt at all.... would be kinda nice considering how critical (or not) this is... and what to do about it.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-27 16:27           ` Benjammin2068
@ 2016-09-27 16:36             ` Roman Mamedov
  2016-09-27 17:24               ` Benjammin2068
  2016-09-27 16:45             ` Wols Lists
  1 sibling, 1 reply; 26+ messages in thread
From: Roman Mamedov @ 2016-09-27 16:36 UTC (permalink / raw)
  To: Benjammin2068; +Cc: Linux-RAID

[-- Attachment #1: Type: text/plain, Size: 733 bytes --]

On Tue, 27 Sep 2016 11:27:13 -0500
Benjammin2068 <benjammin2068@gmail.com> wrote:

> I think I did find the problem. The card was running hot due to airflow.
> That's been remedied (I hope) -- the temp sensor on the heat-sink for the
> PCIe controller now sits around 45'C which is fine. Before it was >=
> 60'C . :O

I wouldn't trust such controller anyway. 15 degrees difference and it
(allegedly) gives you silent data corruption? What if you have a particularly
hot day, and/or the AC is out for a few hours.
There is a lot of better failure modes than this (honestly reported read or
CRC errors for a start, or heck, even complete lock-up of the controller would
be more preferrable).

-- 
With respect,
Roman

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-27 16:27           ` Benjammin2068
  2016-09-27 16:36             ` Roman Mamedov
@ 2016-09-27 16:45             ` Wols Lists
  2016-09-27 17:30               ` Benjammin2068
  1 sibling, 1 reply; 26+ messages in thread
From: Wols Lists @ 2016-09-27 16:45 UTC (permalink / raw)
  To: Benjammin2068, Linux-RAID

On 27/09/16 17:27, Benjammin2068 wrote:
> p.s. The Linux RAID Wiki doesn't cover mismatch_cnt at all.... would be kinda nice considering how critical (or not) this is... and what to do about it.

I'm thinking about all this. The second section is all about recovering
a failing/ed array, and is new. The first section is the original,
that's being updated. It just feels totally wrong to me now, as it's
becoming a jumbled mess of old and new.

What I'm probably going to do, is create a new first section about
setting up a raid system. That means that a section on monitoring will
actually make sense and fit between setting it up, and fixing problems.

(And all the old stuff will end up in the "software archaeology"
section, so people who are still running ancient systems can find it :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-27 16:36             ` Roman Mamedov
@ 2016-09-27 17:24               ` Benjammin2068
  0 siblings, 0 replies; 26+ messages in thread
From: Benjammin2068 @ 2016-09-27 17:24 UTC (permalink / raw)
  To: Linux-RAID

On 09/27/2016 11:36 AM, Roman Mamedov wrote:
> On Tue, 27 Sep 2016 11:27:13 -0500
> Benjammin2068 <benjammin2068@gmail.com> wrote:
>
>> I think I did find the problem. The card was running hot due to airflow.
>> That's been remedied (I hope) -- the temp sensor on the heat-sink for the
>> PCIe controller now sits around 45'C which is fine. Before it was >=
>> 60'C . :O
> I wouldn't trust such controller anyway. 15 degrees difference and it
> (allegedly) gives you silent data corruption? What if you have a particularly
> hot day, and/or the AC is out for a few hours.
> There is a lot of better failure modes than this (honestly reported read or
> CRC errors for a start, or heck, even complete lock-up of the controller would
> be more preferrable).
>

I think it was running way hotter than that. I could only get to it to measure in a certain timespan and it had already cooled off. (that's what heatsinks do)

by the time I got a temp sensor on it -- it was too late.

  -Ben

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-27 16:45             ` Wols Lists
@ 2016-09-27 17:30               ` Benjammin2068
  2016-09-27 19:35                 ` Wols Lists
  2016-09-27 23:09                 ` Adam Goryachev
  0 siblings, 2 replies; 26+ messages in thread
From: Benjammin2068 @ 2016-09-27 17:30 UTC (permalink / raw)
  To: Linux-RAID

On 09/27/2016 11:45 AM, Wols Lists wrote:
>
> I'm thinking about all this. The second section is all about recovering
> a failing/ed array, and is new. The first section is the original,
> that's being updated. It just feels totally wrong to me now, as it's
> becoming a jumbled mess of old and new.
>
> What I'm probably going to do, is create a new first section about
> setting up a raid system. That means that a section on monitoring will
> actually make sense and fit between setting it up, and fixing problems.
>
> (And all the old stuff will end up in the "software archaeology"
> section, so people who are still running ancient systems can find it :-)
>

That would be awesome.

 There was a shell script out there already for MUNIN, but I modified it a little to add thresholds that throw up flags. I might change some more to handle different thresholds for different devices or the ability to monitor only RAIDs that matter.

I have smartctl running for all my drives -- but that doesn't help me at the mdadm level.

While you're in the docs adding stuff about mismatch_cnt, is there anything that can help someone backtrace which block cause the count to go up? This would help us mere mortals maybe go back to inspect a block or a file or something to make sure it's not corrupted.

 -Ben

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-27 17:30               ` Benjammin2068
@ 2016-09-27 19:35                 ` Wols Lists
  2016-09-27 23:09                 ` Adam Goryachev
  1 sibling, 0 replies; 26+ messages in thread
From: Wols Lists @ 2016-09-27 19:35 UTC (permalink / raw)
  To: Benjammin2068, Linux-RAID

On 27/09/16 18:30, Benjammin2068 wrote:
> I have smartctl running for all my drives -- but that doesn't help me at the mdadm level.
> 
> While you're in the docs adding stuff about mismatch_cnt, is there anything that can help someone backtrace which block cause the count to go up? This would help us mere mortals maybe go back to inspect a block or a file or something to make sure it's not corrupted.

I'm planning to go back through the archives as best I can (I keep a
local archive :-), and I'm starring everything of interest. So if
anybody can chime in, I'll make a note! :-)

All this should eventually get there ... and I'm planning to add a
kernel programming section too - not least because I want to do some!

Cheers,
Wol

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-27 17:30               ` Benjammin2068
  2016-09-27 19:35                 ` Wols Lists
@ 2016-09-27 23:09                 ` Adam Goryachev
  2016-09-28 15:55                   ` Benjammin2068
       [not found]                   ` <0f6bd6f6-20ee-1720-23fc-27d206063bfc@gmail.com>
  1 sibling, 2 replies; 26+ messages in thread
From: Adam Goryachev @ 2016-09-27 23:09 UTC (permalink / raw)
  To: Benjammin2068, Linux-RAID

On 28/09/16 03:30, Benjammin2068 wrote:
> On 09/27/2016 11:45 AM, Wols Lists wrote:
>> I'm thinking about all this. The second section is all about recovering
>> a failing/ed array, and is new. The first section is the original,
>> that's being updated. It just feels totally wrong to me now, as it's
>> becoming a jumbled mess of old and new.
>>
>> What I'm probably going to do, is create a new first section about
>> setting up a raid system. That means that a section on monitoring will
>> actually make sense and fit between setting it up, and fixing problems.
>>
>> (And all the old stuff will end up in the "software archaeology"
>> section, so people who are still running ancient systems can find it :-)
>>
> That would be awesome.
>
>   There was a shell script out there already for MUNIN, but I modified it a little to add thresholds that throw up flags. I might change some more to handle different thresholds for different devices or the ability to monitor only RAIDs that matter.
>
> I have smartctl running for all my drives -- but that doesn't help me at the mdadm level.
>
> While you're in the docs adding stuff about mismatch_cnt, is there anything that can help someone backtrace which block cause the count to go up? This would help us mere mortals maybe go back to inspect a block or a file or something to make sure it's not corrupted.
>
Just out of interest, but I'm not sure how useful your munin monitoring 
will be... AFAIK, the mismatch_cnt value is only updated when you run a 
check, which would probably take some number of hours to complete. I 
would guess that you are unlikely to run more than one check a week or 
month.... and as soon as there is any change (unless you know the 
explanation) then you should be looking to resolve that.

Unless of course I'm wrong about when the count is updated?

Regards,
Adam
-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-27 23:09                 ` Adam Goryachev
@ 2016-09-28 15:55                   ` Benjammin2068
  2016-10-02 17:33                     ` Benjammin2068
       [not found]                   ` <0f6bd6f6-20ee-1720-23fc-27d206063bfc@gmail.com>
  1 sibling, 1 reply; 26+ messages in thread
From: Benjammin2068 @ 2016-09-28 15:55 UTC (permalink / raw)
  To: Adam Goryachev, Linux-RAID

On 09/27/2016 06:09 PM, Adam Goryachev wrote:
> Just out of interest, but I'm not sure how useful your munin monitoring will be... AFAIK, the mismatch_cnt value is only updated when you run a check, which would probably take some number of hours to complete. I would guess that you are unlikely to run more than one check a week or month.... and as soon as there is any change (unless you know the explanation) then you should be looking to resolve that.
>
> Unless of course I'm wrong about when the count is updated?

You would be correct - only during checks - but it's an easy way for me to just keep track of it..

And a check takes about 3hrs on this array. (and it happens every weekend)

 -Ben

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-09-28 15:55                   ` Benjammin2068
@ 2016-10-02 17:33                     ` Benjammin2068
  0 siblings, 0 replies; 26+ messages in thread
From: Benjammin2068 @ 2016-10-02 17:33 UTC (permalink / raw)
  To: Linux-RAID


On 09/28/2016 10:55 AM, Benjammin2068 wrote:
> On 09/27/2016 06:09 PM, Adam Goryachev wrote:
>> Just out of interest, but I'm not sure how useful your munin monitoring will be... AFAIK, the mismatch_cnt value is only updated when you run a check, which would probably take some number of hours to complete. I would guess that you are unlikely to run more than one check a week or month.... and as soon as there is any change (unless you know the explanation) then you should be looking to resolve that.
>>
>> Unless of course I'm wrong about when the count is updated?
> You would be correct - only during checks - but it's an easy way for me to just keep track of it..
>
> And a check takes about 3hrs on this array. (and it happens every weekend)
>
>

Ok... so as an update....

I ran 2 checks during the week after fixing the thermal issues I was seeing that I think were part of the problem. Both checks resulted in the mismatch_cnt remaining at 0.

The usual CRON automated check ran this weekend -- and bumped the count to 24 on my RAID6 (there's a RAID1 swap that went to 160, but I've read for something like SWAP, this can be ignored. I just thought I'd mention it since it feels a bit like Schrodinger's drive array.)

So... I've emailed the Vendor for a replacement since this SAS/SATA add-on card is new. And maybe it's just flaky.

IN the meantime, is it possible to shrink the array back to RAID5 (4 members) so I can run it on the MB's controller only -- where I never seemed to have a mismatch count problem? (I kind need the array to be up and running since I need to work. :P)

Also -- for academic reasons, is it possible to get a list of the the blocks that causes the count to increase to try confirm it's the new add-on card that's the problem?
If there's a way to list what blocks and drive that were the issue, then I could go check those files....

Otherwise, the mdadm reports the array as clean and happy and FSCK reports the partition as clean and happy.

How disconcerting is that?

Thanks,

 -Ben


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
       [not found]                   ` <0f6bd6f6-20ee-1720-23fc-27d206063bfc@gmail.com>
@ 2016-11-08 19:52                     ` Benjammin2068
  2016-11-08 19:53                     ` Benjammin2068
  1 sibling, 0 replies; 26+ messages in thread
From: Benjammin2068 @ 2016-11-08 19:52 UTC (permalink / raw)
  To: Linux-RAID

Hey all,

 I'm still trying to work through this..

I've replaced the cables on the new AOC-SAS2LP-MV8 and the card itself (supermicro sent replacements)


but I still occasionally have the mismatch_cnt error.. (like a week will go by and a weekend raid-check will happen with  nothing.)

So -- my question now (as recommended by someone else) is how to see if it's a drive issue somehow. (not that it's a "real" drive issue, but a card firmware issue that's aggravated by a drive's firmware somehow.)

I've looked through the logs -- but how do I trace down a mismatch_cnt? I don't see anything in dmesg or messages....

There are a couple 3Gb/s drives attached to this card.. maybe that's it? Who knows? But I don't have debug info to help me chase it down now that I'm trying to work it out with Supermicro tech support.

Where do I look for more info about source or the event of the mismatch?

Thanks,

 -Ben


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
       [not found]                   ` <0f6bd6f6-20ee-1720-23fc-27d206063bfc@gmail.com>
  2016-11-08 19:52                     ` Benjammin2068
@ 2016-11-08 19:53                     ` Benjammin2068
  2016-11-08 20:38                       ` Phil Turmel
  1 sibling, 1 reply; 26+ messages in thread
From: Benjammin2068 @ 2016-11-08 19:53 UTC (permalink / raw)
  To: Linux-RAID

On 11/08/2016 12:47 PM, Benjammin2068 wrote:
> Hey all,
>
>  I'm still trying to work through this..
>
> I've replaced the cables on the new AOC-SAS2LP-MV8 and the card itself (supermicro sent replacements)
>
>
> but I still occasionally have the mismatch_cnt error.. (like a week will go by and a weekend raid-check will happen with  nothing.)
>
> So -- my question now (as recommended by someone else) is how to see if it's a drive issue somehow. (not that it's a "real" drive issue, but a card firmware issue that's aggravated by a drive's firmware somehow.)
>
> I've looked through the logs -- but how do I trace down a mismatch_cnt? I don't see anything in dmesg or messages....
>
> There are a couple 3Gb/s drives attached to this card.. maybe that's it? Who knows? But I don't have debug info to help me chase it down now that I'm trying to work it out with Supermicro tech support.
>
> Where do I look for more info about source or the event of the mismatch?
>
>

Now that I think about it -- and have been talking out loud to myself (I don't think I'm crazy)...

A parallel to all this is:

I don't think the mismatch_cnt started showing up until I moved from RAID5 to RAID6.

:O

How painful is it to switch back to RAID5 to test that theory?

 -Ben

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-11-08 19:53                     ` Benjammin2068
@ 2016-11-08 20:38                       ` Phil Turmel
  2016-11-08 21:01                         ` Wols Lists
                                           ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Phil Turmel @ 2016-11-08 20:38 UTC (permalink / raw)
  To: Benjammin2068, Linux-RAID

On 11/08/2016 02:53 PM, Benjammin2068 wrote:
> On 11/08/2016 12:47 PM, Benjammin2068 wrote:

> Now that I think about it -- and have been talking out loud to myself (I don't think I'm crazy)...
> 
> A parallel to all this is:
> 
> I don't think the mismatch_cnt started showing up until I moved from RAID5 to RAID6.
> 
> :O
> 
> How painful is it to switch back to RAID5 to test that theory?

Don't.  Sounds like raid6's stricter calculations are catching a real
problem.  Do you have ECC RAM?  If so, are you getting any machine check
exceptions?  If not, have you done a thorough memtest any time in the
recent past?

If it's not memory, can you exercise the controller channels heavily to
see if they drop from errors?

Have you added up the peak current draws of your drives to make sure
your power supply keeps up when all drives are writing simultaneously
(common with parity raid)?

One more: do you have swap on top of md raid?

Phil

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-11-08 20:38                       ` Phil Turmel
@ 2016-11-08 21:01                         ` Wols Lists
  2016-11-09 19:00                           ` Benjammin2068
  2017-02-28 19:50                           ` WARNING: mismatch_cnt is not 0 on <array device> [SOLVED?] Benjammin2068
  2016-11-09 19:00                         ` WARNING: mismatch_cnt is not 0 on <array device> Benjammin2068
  2016-11-09 19:52                         ` Benjammin2068
  2 siblings, 2 replies; 26+ messages in thread
From: Wols Lists @ 2016-11-08 21:01 UTC (permalink / raw)
  To: Phil Turmel, Benjammin2068, Linux-RAID

On 08/11/16 20:38, Phil Turmel wrote:
> Have you added up the peak current draws of your drives to make sure
> your power supply keeps up when all drives are writing simultaneously
> (common with parity raid)?

On that point, be aware that many power supplies quote the sum of the
power to all rails. It could well be that the supply is nominally plenty
powerful enough, but the load on an individual rail is too high.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-11-08 20:38                       ` Phil Turmel
  2016-11-08 21:01                         ` Wols Lists
@ 2016-11-09 19:00                         ` Benjammin2068
  2016-11-09 19:52                         ` Benjammin2068
  2 siblings, 0 replies; 26+ messages in thread
From: Benjammin2068 @ 2016-11-09 19:00 UTC (permalink / raw)
  To: Linux-RAID

On 11/08/2016 02:38 PM, Phil Turmel wrote:
> On 11/08/2016 02:53 PM, Benjammin2068 wrote:
>> On 11/08/2016 12:47 PM, Benjammin2068 wrote:
>> Now that I think about it -- and have been talking out loud to myself (I don't think I'm crazy)...
>>
>> A parallel to all this is:
>>
>> I don't think the mismatch_cnt started showing up until I moved from RAID5 to RAID6.
>>
>> :O
>>
>> How painful is it to switch back to RAID5 to test that theory?
> Don't.  Sounds like raid6's stricter calculations are catching a real problem.

Ok -- no switching back to RAID5.

> Do you have ECC RAM?

Yes.
>
> If so, are you getting any machine check exceptions?

not getting any machine check problems (I looked)

> If not, have you done a thorough memtest any time in the recent past?

Yes. When I started getting the mismatch counts, I took the system down and ran MEMtest on this through a couple of passes.

no problem.

> If it's not memory, can you exercise the controller channels heavily to
> see if they drop from errors?

I could but haven't -- any recommendations on tools out there?

Also, I've also wondered if the raid-check that happens on Sunday isn't actually part of that kind of problem.

i.e. if I didn't do the weekly check, the drives don't get slammed anywhere  near as much the rest of the week.

Does mismatch_cnt only change value during a check -- or does it happen with each operation?

> Have you added up the peak current draws of your drives to make sure
> your power supply keeps up when all drives are writing simultaneously
> (common with parity raid)?

Not exactly. but can do that. The system has a 650W supply -- I'll go do a power check and work that against the known drives in the system.

This is a "server chassis" though which came with the 8 slots in the front to power drives - so it's not exactly a "home chassis" that I put in a 300W and then jammed full of drives.

Still -- that's a reasonable question and I'll investigate.

> One more: do you have swap on top of md raid?

No. I've seen about mismatch on RAID1 causing mismatch counts.

However, I am running a VM on this RAID volume (VirtualBox and a reasonably sleepy instance of Win7_64) and have pondered that.


 -Ben



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-11-08 21:01                         ` Wols Lists
@ 2016-11-09 19:00                           ` Benjammin2068
  2017-02-28 19:50                           ` WARNING: mismatch_cnt is not 0 on <array device> [SOLVED?] Benjammin2068
  1 sibling, 0 replies; 26+ messages in thread
From: Benjammin2068 @ 2016-11-09 19:00 UTC (permalink / raw)
  To: Linux-RAID

On 11/08/2016 03:01 PM, Wols Lists wrote:
> On 08/11/16 20:38, Phil Turmel wrote:
>> Have you added up the peak current draws of your drives to make sure
>> your power supply keeps up when all drives are writing simultaneously
>> (common with parity raid)?
> On that point, be aware that many power supplies quote the sum of the
> power to all rails. It could well be that the supply is nominally plenty
> powerful enough, but the load on an individual rail is too high.
>
>

Right.. I'll check that when I do the math on the supplies.

  -Ben


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device>
  2016-11-08 20:38                       ` Phil Turmel
  2016-11-08 21:01                         ` Wols Lists
  2016-11-09 19:00                         ` WARNING: mismatch_cnt is not 0 on <array device> Benjammin2068
@ 2016-11-09 19:52                         ` Benjammin2068
  2 siblings, 0 replies; 26+ messages in thread
From: Benjammin2068 @ 2016-11-09 19:52 UTC (permalink / raw)
  To: Linux-RAID

On 11/08/2016 02:38 PM, Phil Turmel wrote:
>
> Have you added up the peak current draws of your drives to make sure
> your power supply keeps up when all drives are writing simultaneously
> (common with parity raid)?

Keeping in mind this is a pretty empty box that's pretty sleepy until I hit compile doing FPGA/SoC development.
(i.e. no power hungry graphics cards -- it's a home file server more than a desktop)

Here's them power numbers -- all of the drives are listed as peak (as in start up) where average was much less... but using those hypothetically as worst case:

They still don't touch the supply rails even after considering fans (which spin at very low PWM duty cycles) (not sure this table will print -- let me know if it doesn't)


	
	
	
	
	
	

	*_Model:_* 	*_Wattage_* 	*_+5V_* 	*_+12V_* 	*_+3.3V_* 	*_5V StdBy_*

	PWS-652-2H 	650W 	30A 	54A 	20A 	4A

	
	
	
	
	
	
Slot 1 	WD2500AAJS 	21.24 	
	1.77 	N/A 	N/A
Slot 2 	WD10EZEX 	30 	
	2.5 	N/A 	N/A
Slot 3 	ST1000DM005 	24 	
	2 	N/A 	N/A
Slot 4 	HD103SJ 	36.4 	2 	2.2 	N/A 	N/A
Slot 5 	WD10EFRX-68F 	14.4 	
	1.2 	N/A 	N/A
Slot 6 	HD103SJ 	36.4 	2 	2.2 	N/A 	N/A
Slot 7 	WD10EFRX-68F 	14.4 	
	1.2 	N/A 	N/A
Slot 8 	WD10EFRX-68F 	14.4 	
	1.2 	N/A 	N/A
Slot 9 	Empty 	
	
	
	N/A 	N/A
Slot 10 	Empty 	
	
	
	N/A 	N/A
Slot 11 	WD10JFCX-68N 	5 	1 	0 	N/A 	N/A
Slot 12 	WD10JFCX-68N 	5 	1 	0 	N/A 	N/A

	
	
	
	
	
	

	*Totals:* 	*201.24* 	*6* 	*14.27* 	
	


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: WARNING: mismatch_cnt is not 0 on <array device> [SOLVED?]
  2016-11-08 21:01                         ` Wols Lists
  2016-11-09 19:00                           ` Benjammin2068
@ 2017-02-28 19:50                           ` Benjammin2068
  1 sibling, 0 replies; 26+ messages in thread
From: Benjammin2068 @ 2017-02-28 19:50 UTC (permalink / raw)
  To: Linux-RAID

As an interesting update for everyone on this thread....

I replaced all my drives with a single type/speed (WD Red) and now that the older Samsung HD103SJ (now Seagate) drives have been gone, I haven't had any mismatch_cnt hiccups with that marvell JBOD MV8 controller.

Just thought I'd offer that as a current data point.

Cheers,

 -Ben

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-02-28 19:50 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-26  2:43 WARNING: mismatch_cnt is not 0 on <array device> Benjammin2068
2016-09-26  3:42 ` Brad Campbell
2016-09-26  7:19   ` Benjammin2068
2016-09-26  7:40     ` Brad Campbell
2016-09-26 19:47   ` Benjammin2068
2016-09-26 21:15     ` Phil Turmel
2016-09-27  1:08       ` Benjammin2068
2016-09-27  9:16         ` Brad Campbell
2016-09-27 16:27           ` Benjammin2068
2016-09-27 16:36             ` Roman Mamedov
2016-09-27 17:24               ` Benjammin2068
2016-09-27 16:45             ` Wols Lists
2016-09-27 17:30               ` Benjammin2068
2016-09-27 19:35                 ` Wols Lists
2016-09-27 23:09                 ` Adam Goryachev
2016-09-28 15:55                   ` Benjammin2068
2016-10-02 17:33                     ` Benjammin2068
     [not found]                   ` <0f6bd6f6-20ee-1720-23fc-27d206063bfc@gmail.com>
2016-11-08 19:52                     ` Benjammin2068
2016-11-08 19:53                     ` Benjammin2068
2016-11-08 20:38                       ` Phil Turmel
2016-11-08 21:01                         ` Wols Lists
2016-11-09 19:00                           ` Benjammin2068
2017-02-28 19:50                           ` WARNING: mismatch_cnt is not 0 on <array device> [SOLVED?] Benjammin2068
2016-11-09 19:00                         ` WARNING: mismatch_cnt is not 0 on <array device> Benjammin2068
2016-11-09 19:52                         ` Benjammin2068
2016-09-27  6:42       ` Benjammin2068

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.