Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18

All of lore.kernel.org
 help / color / mirror / Atom feed

* Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
@ 2013-01-27 19:26 Wolfgang Denk
  2013-01-27 19:45 ` Chris Murphy
                   ` (4 more replies)
  0 siblings, 5 replies; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-27 19:26 UTC (permalink / raw)
  To: linux-raid

I have seen "mismatch_cnt is not 0" warnings in the past, but that has
always been with RAID 1 arrays, and with relatively small numbers on
/sys/block/md*/md/mismatch_cnt; my understanding was that this was not
actually critical.

However, after updating to Fedora 18, I get this message from all
updated
systems that have RAID 6 arrays, and with _huge_ numbers of
mismatch_cnt, like that:

fter updating to Fedora 18, I get this message from all updated
systems that have RAID 6 arrays, and with _huge_ numbers of
mismatch_cnt, like that:

# mdadm -q --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Mon Jan 14 14:20:34 2013
     Raid Level : raid6
     Array Size : 1459617792 (1392.00 GiB 1494.65 GB)
  Used Dev Size : 243269632 (232.00 GiB 249.11 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

    Update Time : Sun Jan 27 02:27:28 2013
          State : clean 
 Active Devices : 8
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 16K

           Name : XXX:0  (local to host XXX)
           UUID : da015f96:138b37bf:d5ef71dc:8970ab15
         Events : 8

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       8       65        1      active sync   /dev/sde1
       2       8       81        2      active sync   /dev/sdf1
       3       8       97        3      active sync   /dev/sdg1
       4       8      113        4      active sync   /dev/sdh1
       5       8      129        5      active sync   /dev/sdi1
       6       8      145        6      active sync   /dev/sdj1
       7       8      161        7      active sync   /dev/sdk1
# cat /sys/block/md0/md/mismatch_cnt
362732152


This is with mdadm v3.2.6 (mdadm-3.2.6-7.fc18.x86_64); except for the
huge values of mismatch_cnt, I see no other indications for errors on
the disk drives, RAID arrays or the file systems on top of these.

Is this some known (and hopefully harmless), issue, or must I worry
about our data?


Thanks in advance.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
On the subject of C program indentation: "In My Egotistical  Opinion,
most  people's  C  programs  should be indented six feet downward and
covered with dirt."                               - Blair P. Houghton

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-27 19:26 Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18 Wolfgang Denk
@ 2013-01-27 19:45 ` Chris Murphy
  2013-01-27 23:10   ` Wolfgang Denk
  2013-01-27 20:05 ` Robin Hill
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 52+ messages in thread
From: Chris Murphy @ 2013-01-27 19:45 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: linux-raid

On Jan 27, 2013, at 12:26 PM, Wolfgang Denk <wd@denx.de> wrote:

> I have seen "mismatch_cnt is not 0" warnings in the past, but that has
> always been with RAID 1 arrays, and with relatively small numbers on
> /sys/block/md*/md/mismatch_cnt; my understanding was that this was not
> actually critical.

Well, you have a block on each drive that's supposed to be identical, and if there's a mismatch that means they're not identical and it's ambiguous which one is correct. So if the data on the drives is unimportant to retrieve correctly, then I guess it's not a critical error.

> 
> However, after updating to Fedora 18, I get this message from all
> updated
> systems that have RAID 6 arrays, and with _huge_ numbers of
> mismatch_cnt, like that:
> 
> fter updating to Fedora 18, I get this message from all updated
> systems that have RAID 6 arrays, and with _huge_ numbers of
> mismatch_cnt, like that:
> 

Bad paste. No messages provided.

> This is with mdadm v3.2.6 (mdadm-3.2.6-7.fc18.x86_64); except for the
> huge values of mismatch_cnt, I see no other indications for errors on
> the disk drives, RAID arrays or the file systems on top of these.
> 
> Is this some known (and hopefully harmless), issue, or must I worry
> about our data?

It means there's a mismatch between parity and data. One of them is wrong, maybe both if there are many errors. So yeah I'd say it sounds like a problem.

What's the smartctl -a look like for all drives? I imagine one or more have bad sectors, ECC errors, or UDMA/CRD errors.

Chris Murphy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-27 19:26 Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18 Wolfgang Denk
  2013-01-27 19:45 ` Chris Murphy
@ 2013-01-27 20:05 ` Robin Hill
  2013-01-27 23:11   ` Wolfgang Denk
  2013-01-28  1:14 ` Phil Turmel
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 52+ messages in thread
From: Robin Hill @ 2013-01-27 20:05 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 981 bytes --]

On Sun Jan 27, 2013 at 08:26:56 +0100, Wolfgang Denk wrote:

> I have seen "mismatch_cnt is not 0" warnings in the past, but that has
> always been with RAID 1 arrays, and with relatively small numbers on
> /sys/block/md*/md/mismatch_cnt; my understanding was that this was not
> actually critical.
> 
> However, after updating to Fedora 18, I get this message from all
> updated
> systems that have RAID 6 arrays, and with _huge_ numbers of
> mismatch_cnt, like that:
> 
<-snip->
> # cat /sys/block/md0/md/mismatch_cnt
> 362732152
> 
Is this after having run a check, or just after boot? I've seen some odd
numbers at times, but they've gone away after running a check
(presumably something not being initialised).

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-27 19:45 ` Chris Murphy
@ 2013-01-27 23:10   ` Wolfgang Denk
  0 siblings, 0 replies; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-27 23:10 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid

Dear Chris Murphy,

In message <CA3C3BC1-A00A-422B-A643-D5323FDC3F6A@colorremedies.com> you wrote:
> 
> Well, you have a block on each drive that's supposed to be identical,
> and if there's a mismatch that means they're not identical and it's
> ambiguous which one is correct. So if the data on the drives is
> unimportant to retrieve correctly, then I guess it's not a critical
> error.

Heh, would I be running a RAID 6 array if the data was unimportant?

> What's the smartctl -a look like for all drives? I imagine one or more
> have bad sectors, ECC errors, or UDMA/CRD errors.

I see a some Hardware_ECC_Recovered; very few disks have 1
Reallocated_Sector_Ct; there are no (0) Current_Pending_Sector,
Offline_Uncorrectable, UDMA_CRC_Error_Count, Multi_Zone_Error_Rate, or
Data_Address_Mark_Errs.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
If God wanted me to touch my toes, he'd have put them on my knees.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-27 20:05 ` Robin Hill
@ 2013-01-27 23:11   ` Wolfgang Denk
  0 siblings, 0 replies; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-27 23:11 UTC (permalink / raw)
  To: Robin Hill; +Cc: linux-raid

Dear Robin Hill,

In message <20130127200557.GA10960@cthulhu.home.robinhill.me.uk> you wrote:
> 
> Is this after having run a check, or just after boot? I've seen some odd
> numbers at times, but they've gone away after running a check
> (presumably something not being initialised).

The non-zero mismatch_cnt was reported by a check, and after that I
looked at the state of the array.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
People are always a lot more complicated than you  think.  It's  very
important to remember that.             - Terry Pratchett, _Truckers_

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-27 19:26 Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18 Wolfgang Denk
  2013-01-27 19:45 ` Chris Murphy
  2013-01-27 20:05 ` Robin Hill
@ 2013-01-28  1:14 ` Phil Turmel
  2013-01-28  1:42   ` Chris Murphy
  2013-01-28  6:27   ` Wolfgang Denk
  2013-01-28  2:07 ` Brad Campbell
  2013-01-28 17:37 ` Piergiorgio Sartor
  4 siblings, 2 replies; 52+ messages in thread
From: Phil Turmel @ 2013-01-28  1:14 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: linux-raid

On 01/27/2013 02:26 PM, Wolfgang Denk wrote:
> I have seen "mismatch_cnt is not 0" warnings in the past, but that has
> always been with RAID 1 arrays, and with relatively small numbers on
> /sys/block/md*/md/mismatch_cnt; my understanding was that this was not
> actually critical.
> 
> However, after updating to Fedora 18, I get this message from all
> updated
> systems that have RAID 6 arrays, and with _huge_ numbers of
> mismatch_cnt, like that:
> 
> fter updating to Fedora 18, I get this message from all updated
> systems that have RAID 6 arrays, and with _huge_ numbers of
> mismatch_cnt, like that:
> 
> # mdadm -q --detail /dev/md0
> /dev/md0:
>         Version : 1.2
>   Creation Time : Mon Jan 14 14:20:34 2013

Something is missing from this story.  Was the original system created
this month, and immediately upgraded?  Or was there a "mdadm --create"
in this story you haven't told us about?

>      Raid Level : raid6
>      Array Size : 1459617792 (1392.00 GiB 1494.65 GB)
>   Used Dev Size : 243269632 (232.00 GiB 249.11 GB)
>    Raid Devices : 8
>   Total Devices : 8
>     Persistence : Superblock is persistent
> 
>     Update Time : Sun Jan 27 02:27:28 2013
>           State : clean 
>  Active Devices : 8
> Working Devices : 8
>  Failed Devices : 0
>   Spare Devices : 0
> 
>          Layout : left-symmetric
>      Chunk Size : 16K
> 
>            Name : XXX:0  (local to host XXX)
>            UUID : da015f96:138b37bf:d5ef71dc:8970ab15
>          Events : 8
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       49        0      active sync   /dev/sdd1
>        1       8       65        1      active sync   /dev/sde1
>        2       8       81        2      active sync   /dev/sdf1
>        3       8       97        3      active sync   /dev/sdg1
>        4       8      113        4      active sync   /dev/sdh1
>        5       8      129        5      active sync   /dev/sdi1
>        6       8      145        6      active sync   /dev/sdj1
>        7       8      161        7      active sync   /dev/sdk1
> # cat /sys/block/md0/md/mismatch_cnt
> 362732152
> 
> 
> This is with mdadm v3.2.6 (mdadm-3.2.6-7.fc18.x86_64); except for the
> huge values of mismatch_cnt, I see no other indications for errors on
> the disk drives, RAID arrays or the file systems on top of these.
> 
> Is this some known (and hopefully harmless), issue, or must I worry
> about our data?

I would be worried, but there's not enough information here to say.

Please share the output of "mdadm -E /dev/sd[defghijk]1".

Please also explain/share what you've done to conclude the filesystem is
error-free.

Phil

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28  1:14 ` Phil Turmel
@ 2013-01-28  1:42   ` Chris Murphy
  2013-01-28  2:16     ` Chris Murphy
  2013-01-28  6:36     ` Wolfgang Denk
  2013-01-28  6:27   ` Wolfgang Denk
  1 sibling, 2 replies; 52+ messages in thread
From: Chris Murphy @ 2013-01-28  1:42 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Wolfgang Denk, linux-raid

Are the drives all the same model? What model? And are they AF disks?

The mismatch number is not divisible by 16, yet your chunk size is 16KB. It is divisible by 4 and 8, so I'm going to guess that the physical sector size is 4096 bytes. If correct, I'm coming up with a maximum of 346GiB worth of sectors may be adversely affected, assuming every sector in the mismatch count is bad (which is probably not true, but could be).

What file system?

Have you done only a check or also a repair, either before or after the upgrade? (I'm not suggesting doing a repair now.)

A check should cause URE's to be fixed, but you don't get a count against them because chunk and parity end up being the same for the check after being fixed.

Are there any messages in dmesg for the time of the check? I would like to know what was recorded in dmesg at the time the mismatches were found.

Chris Murphy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-27 19:26 Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18 Wolfgang Denk
                   ` (2 preceding siblings ...)
  2013-01-28  1:14 ` Phil Turmel
@ 2013-01-28  2:07 ` Brad Campbell
  2013-01-28  6:39   ` Wolfgang Denk
  2013-01-28 17:37 ` Piergiorgio Sartor
  4 siblings, 1 reply; 52+ messages in thread
From: Brad Campbell @ 2013-01-28  2:07 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: linux-raid

On 28/01/13 03:26, Wolfgang Denk wrote:
> I have seen "mismatch_cnt is not 0" warnings in the past, but that has
> always been with RAID 1 arrays, and with relatively small numbers on
> /sys/block/md*/md/mismatch_cnt; my understanding was that this was not
> actually critical.
>
> However, after updating to Fedora 18, I get this message from all
> updated
> systems that have RAID 6 arrays, and with _huge_ numbers of
> mismatch_cnt, like that:

I saw this when I put two drives from a 10 drive RAID6 on a sil3124 
card. Unfortunately by the time I figured out what was going on my array 
was trashed.

Massive mismatch counts are indicative of an insidious problem further 
down the storage stack. Check your drivers, cards, cables and PSU.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28  1:42   ` Chris Murphy
@ 2013-01-28  2:16     ` Chris Murphy
  2013-01-28  6:43       ` Wolfgang Denk
  2013-01-28  6:36     ` Wolfgang Denk
  1 sibling, 1 reply; 52+ messages in thread
From: Chris Murphy @ 2013-01-28  2:16 UTC (permalink / raw)
  To: Phil Turmel, Brad Campbell; +Cc: Wolfgang Denk, linux-raid@vger.kernel.org Raid

On Jan 27, 2013, at 6:42 PM, Chris Murphy <lists@colorremedies.com> wrote:

> The mismatch number is not divisible by 16, yet your chunk size is 16KB. It is divisible by 4 and 8, so I'm going to guess that the physical sector size is 4096 bytes. If correct, I'm coming up with a maximum of 346GiB worth of sectors may be adversely affected, assuming every sector in the mismatch count is bad (which is probably not true, but could be).

Since the upgrade to Fedora 18 for this particular RAID 6, can you estimate how much data has been written to the array? Could it be in the 90GiB to 350GiB range?

On Jan 27, 2013, at 7:07 PM, Brad Campbell <lists2009@fnarfbargle.com> wrote:
> 
> Massive mismatch counts are indicative of an insidious problem further down the storage stack. Check your drivers, cards, cables and PSU.

Yeah, this may be difficult, but I think the array needs to be remounted ro or unmounted. The writes may be killing it. But there's not enough information yet to know, it's smoke but no fire so far.

man 4 md says the same, for raid 5 and 6, mismatches are not expected to be software problems, but much more likely hardware. But if it's true that the OP's problems started exactly with the upgrade to Fedora 18, that it could be a device driver. What HBA is being used?

Chris Murphy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28  1:14 ` Phil Turmel
  2013-01-28  1:42   ` Chris Murphy
@ 2013-01-28  6:27   ` Wolfgang Denk
  1 sibling, 0 replies; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28  6:27 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Dear Phil Turmel,

In message <5105D0EA.7080100@turmel.org> you wrote:
>
> >   Creation Time : Mon Jan 14 14:20:34 2013
> 
> Something is missing from this story.  Was the original system created
> this month, and immediately upgraded?  Or was there a "mdadm --create"
> in this story you haven't told us about?

On one of the systems where I observe these problems the array was
indeed recreated; this is running  set of old Maxtor MaXLine Plus II
7Y250M0 drives, and the rebuild was done to throw out some drives that
started growing number of reallocated sectors.

The two other systems have been running unmodified hardware-wise for
many months.

> Please share the output of "mdadm -E /dev/sd[defghijk]1".


/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : da015f96:138b37bf:d5ef71dc:8970ab15
           Name : atlas.denx.de:0  (local to host atlas.denx.de)
  Creation Time : Mon Jan 14 14:20:34 2013
     Raid Level : raid6
   Raid Devices : 8

 Avail Dev Size : 489968447 (233.64 GiB 250.86 GB)
     Array Size : 1459617792 (1392.00 GiB 1494.65 GB)
  Used Dev Size : 486539264 (232.00 GiB 249.11 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 9d820d74:cd981ec5:b0d95ad6:37124d98

    Update Time : Sun Jan 27 23:03:15 2013
       Checksum : 9fe9401b - correct
         Events : 8

         Layout : left-symmetric
     Chunk Size : 16K

   Device Role : Active device 0
   Array State : AAAAAAAA ('A' == active, '.' == missing)
/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : da015f96:138b37bf:d5ef71dc:8970ab15
           Name : atlas.denx.de:0  (local to host atlas.denx.de)
  Creation Time : Mon Jan 14 14:20:34 2013
     Raid Level : raid6
   Raid Devices : 8

 Avail Dev Size : 489970560 (233.64 GiB 250.86 GB)
     Array Size : 1459617792 (1392.00 GiB 1494.65 GB)
  Used Dev Size : 486539264 (232.00 GiB 249.11 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : d0494c53:30feb967:37f035e7:5665af56

    Update Time : Sun Jan 27 23:03:15 2013
       Checksum : f100de9d - correct
         Events : 8

         Layout : left-symmetric
     Chunk Size : 16K

   Device Role : Active device 6
   Array State : AAAAAAAA ('A' == active, '.' == missing)
/dev/sdf1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : da015f96:138b37bf:d5ef71dc:8970ab15
           Name : atlas.denx.de:0  (local to host atlas.denx.de)
  Creation Time : Mon Jan 14 14:20:34 2013
     Raid Level : raid6
   Raid Devices : 8

 Avail Dev Size : 489970560 (233.64 GiB 250.86 GB)
     Array Size : 1459617792 (1392.00 GiB 1494.65 GB)
  Used Dev Size : 486539264 (232.00 GiB 249.11 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 6d84b3ee:3789d762:1de41934:986e51c2

    Update Time : Sun Jan 27 23:03:15 2013
       Checksum : 400ba165 - correct
         Events : 8

         Layout : left-symmetric
     Chunk Size : 16K

   Device Role : Active device 1
   Array State : AAAAAAAA ('A' == active, '.' == missing)
/dev/sdg1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : da015f96:138b37bf:d5ef71dc:8970ab15
           Name : atlas.denx.de:0  (local to host atlas.denx.de)
  Creation Time : Mon Jan 14 14:20:34 2013
     Raid Level : raid6
   Raid Devices : 8

 Avail Dev Size : 489970560 (233.64 GiB 250.86 GB)
     Array Size : 1459617792 (1392.00 GiB 1494.65 GB)
  Used Dev Size : 486539264 (232.00 GiB 249.11 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 1605798c:a669952b:33d986ec:a25c7c1f

    Update Time : Sun Jan 27 23:03:15 2013
       Checksum : bc26e59d - correct
         Events : 8

         Layout : left-symmetric
     Chunk Size : 16K

   Device Role : Active device 2
   Array State : AAAAAAAA ('A' == active, '.' == missing)
/dev/sdh1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : da015f96:138b37bf:d5ef71dc:8970ab15
           Name : atlas.denx.de:0  (local to host atlas.denx.de)
  Creation Time : Mon Jan 14 14:20:34 2013
     Raid Level : raid6
   Raid Devices : 8

 Avail Dev Size : 489970560 (233.64 GiB 250.86 GB)
     Array Size : 1459617792 (1392.00 GiB 1494.65 GB)
  Used Dev Size : 486539264 (232.00 GiB 249.11 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 90bdf4cd:1e05fd37:bd4d27d5:666c9924

    Update Time : Sun Jan 27 23:03:15 2013
       Checksum : f7c7bdde - correct
         Events : 8

         Layout : left-symmetric
     Chunk Size : 16K

   Device Role : Active device 3
   Array State : AAAAAAAA ('A' == active, '.' == missing)
/dev/sdi1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : da015f96:138b37bf:d5ef71dc:8970ab15
           Name : atlas.denx.de:0  (local to host atlas.denx.de)
  Creation Time : Mon Jan 14 14:20:34 2013
     Raid Level : raid6
   Raid Devices : 8

 Avail Dev Size : 489970560 (233.64 GiB 250.86 GB)
     Array Size : 1459617792 (1392.00 GiB 1494.65 GB)
  Used Dev Size : 486539264 (232.00 GiB 249.11 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 2143a792:be518d02:b2ad54b6:fbf72cd7

    Update Time : Sun Jan 27 23:03:15 2013
       Checksum : 1acb7b9e - correct
         Events : 8

         Layout : left-symmetric
     Chunk Size : 16K

   Device Role : Active device 7
   Array State : AAAAAAAA ('A' == active, '.' == missing)
/dev/sdj1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : da015f96:138b37bf:d5ef71dc:8970ab15
           Name : atlas.denx.de:0  (local to host atlas.denx.de)
  Creation Time : Mon Jan 14 14:20:34 2013
     Raid Level : raid6
   Raid Devices : 8

 Avail Dev Size : 489970560 (233.64 GiB 250.86 GB)
     Array Size : 1459617792 (1392.00 GiB 1494.65 GB)
  Used Dev Size : 486539264 (232.00 GiB 249.11 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 8d76c4b7:819e678e:c9c66587:9eea674f

    Update Time : Sun Jan 27 23:03:15 2013
       Checksum : 150f0784 - correct
         Events : 8

         Layout : left-symmetric
     Chunk Size : 16K

   Device Role : Active device 4
   Array State : AAAAAAAA ('A' == active, '.' == missing)
/dev/sdk1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : da015f96:138b37bf:d5ef71dc:8970ab15
           Name : atlas.denx.de:0  (local to host atlas.denx.de)
  Creation Time : Mon Jan 14 14:20:34 2013
     Raid Level : raid6
   Raid Devices : 8

 Avail Dev Size : 489970560 (233.64 GiB 250.86 GB)
     Array Size : 1459617792 (1392.00 GiB 1494.65 GB)
  Used Dev Size : 486539264 (232.00 GiB 249.11 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 5d9ef991:6162ced4:d50c17e8:882b4e68

    Update Time : Sun Jan 27 23:03:15 2013
       Checksum : af427a2b - correct
         Events : 8

         Layout : left-symmetric
     Chunk Size : 16K

   Device Role : Active device 5
   Array State : AAAAAAAA ('A' == active, '.' == missing)

> Please also explain/share what you've done to conclude the filesystem is
> error-free.

fsck shows no errors, a full backup shows no errors, and a number of
files I checked manually showed no corruption.  I'm aware that this is
not a very reliable test.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
At the source of every error which is blamed on the computer you will
find at least two human errors, including the error of blaming it  on
the computer.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28  1:42   ` Chris Murphy
  2013-01-28  2:16     ` Chris Murphy
@ 2013-01-28  6:36     ` Wolfgang Denk
  2013-01-28  7:00       ` Chris Murphy
  1 sibling, 1 reply; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28  6:36 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Phil Turmel, linux-raid

Dear Chris Murphy,

In message <6BCDA3B1-C137-486D-A45D-55912207AF3C@colorremedies.com> you wrote:
> Are the drives all the same model? What model? And are they AF disks?

All of the affected arrays are all of the same disks (one system 8 x
Seagate NL35 ST3250623NS; one 6 x Seagate Barracuda ES.2 ST3250310NS;
one 8 x Maxtor MaXLine Plus II 7Y250M0).

To my understanding none of these are AF disks.

> The mismatch number is not divisible by 16, yet your chunk size is 16KB. It is divisible by 4 and 8, so I'm going to guess that the physical sector size is 4096 bytes. If correct, I'm coming up with a maximum of 346GiB worth of sectors may be adversely 
> affected, assuming every sector in the mismatch count is bad (which is probably not true, but could be).

No.  These are all 512 byte sector drives.

> What file system?

ext4 on all of these.

> Have you done only a check or also a repair, either before or after the upgrade? (I'm not suggesting doing a repair now.)

I did only a check.  I rebootetd two of the systems, which made the
mismatch_cnt go back to zero.  Of course I don;t know if this means
everything is fine now.

> A check should cause URE's to be fixed, but you don't get a count against them because chunk and parity end up being the same for the check after being fixed.
> 
> Are there any messages in dmesg for the time of the check? I would like to know what was recorded in dmesg at the time the mismatches were found.

There is absolutely nothing to be seen in the logs; I also routinely
monitor the drives for reallocated, offline uncorrectable and pending
sectors, and nothing can be seen here either.

I would like to point out that all these systems have been up an
running essentially unchanged for many months, even years (OK, the old
Maxtor drive based one needs a disk swap every few weeks, but this is
because the disks have reached EOL). No such problem has been observed
ever before, but now it happens at the first check operation after
switching to Fedora 18, simultaneous on all systems ...

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
What is wanted is not the will to believe,  but the will to find out,
which is the exact opposite.
		        -- Bertrand Russell, "Skeptical Essays", 1928

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28  2:07 ` Brad Campbell
@ 2013-01-28  6:39   ` Wolfgang Denk
  2013-01-28  7:58     ` Dan Williams
  0 siblings, 1 reply; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28  6:39 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid

Dear Brad Campbell,

In message <5105DD72.6080909@fnarfbargle.com> you wrote:
>
> Massive mismatch counts are indicative of an insidious problem further 
> down the storage stack. Check your drivers, cards, cables and PSU.

That's why I'm asking. The hardware has not been changed at all, and
has been running reliably for many months even years. No such problem
has been observed ever before, but now it happens at the first check
operation after switching to Fedora 18, simultaneous on all systems.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Being schizophrenic is better than living alone.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28  2:16     ` Chris Murphy
@ 2013-01-28  6:43       ` Wolfgang Denk
  0 siblings, 0 replies; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28  6:43 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Phil Turmel, Brad Campbell, linux-raid@vger.kernel.org Raid

Dear Chris,

In message <2C80C970-FE4F-4CFE-8F40-48BE14F3140F@colorremedies.com> you wrote:
> 
> > The mismatch number is not divisible by 16, yet your chunk size is
> 16KB. It is divisible by 4 and 8, so I'm going to guess that the
> physical sector size is 4096 bytes. If correct, I'm coming up with a
> maximum of 346GiB worth of sectors may be adversely affected, assuming
> every sector in the mismatch count is bad (which is probably not true,
> but could be).
> 
> Since the upgrade to Fedora 18 for this particular RAID 6, can you
> estimate how much data has been written to the array? Could it be in the
> 90GiB to 350GiB range?

No.  Much less has been written.  But then, your calculation above is
wrong - these are 510 byte sector disks.

> man 4 md says the same, for raid 5 and 6, mismatches are not expected to
> be software problems, but much more likely hardware. But if it's true
> that the OP's problems started exactly with the upgrade to Fedora 18,
> that it could be a device driver. What HBA is being used?

Correct, the hardware has not been changed, and has been running fine
for many months before,

Two of the systems use a Marvell Technology Group Ltd. MV88SX6081
8-port SATA II PCI-X Controller; the third system uses a LSI Logic /
Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS controller.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Human beings were created by water to transport it uphill.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28  6:36     ` Wolfgang Denk
@ 2013-01-28  7:00       ` Chris Murphy
  2013-01-28 10:27         ` Wolfgang Denk
  0 siblings, 1 reply; 52+ messages in thread
From: Chris Murphy @ 2013-01-28  7:00 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Phil Turmel, linux-raid


On Jan 27, 2013, at 11:36 PM, Wolfgang Denk <wd@denx.de> wrote:

> 
> No.  These are all 512 byte sector drives.

362732152 comes out to a max of 173GiB of mismatching sectors for a 512 byte sector size. In any case, the mistmatch count is not trivial, even if it's not a lot of sectors relative to the size of the array.

> 
>> What file system?
> 
> ext4 on all of these.

And what does fsck.ext4 -f -n get you?


>> Have you done only a check or also a repair, either before or after the upgrade? (I'm not suggesting doing a repair now.)
> 
> I did only a check.  I rebootetd two of the systems, which made the
> mismatch_cnt go back to zero.  Of course I don;t know if this means
> everything is fine now.

No, it's fine when you write check to md/sync_action, and upon completion md/mismatch_cnt is 0.

> 

>> A check should cause URE's to be fixed, but you don't get a count against them because chunk and parity end up being the same for the check after being fixed.
>> 
>> Are there any messages in dmesg for the time of the check? I would like to know what was recorded in dmesg at the time the mismatches were found.
> 
> There is absolutely nothing to be seen in the logs; I also routinely
> monitor the drives for reallocated, offline uncorrectable and pending
> sectors, and nothing can be seen here either.
> 
> I would like to point out that all these systems have been up an
> running essentially unchanged for many months, even years (OK, the old
> Maxtor drive based one needs a disk swap every few weeks, but this is
> because the disks have reached EOL). No such problem has been observed
> ever before, but now it happens at the first check operation after
> switching to Fedora 18, simultaneous on all systems …

Switching to Fedora 18 means many things have changed so it's harder to find out, if software, what's causing it. Is it kernel itself, is it md, is it the HBA driver?


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28  6:39   ` Wolfgang Denk
@ 2013-01-28  7:58     ` Dan Williams
  0 siblings, 0 replies; 52+ messages in thread
From: Dan Williams @ 2013-01-28  7:58 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Brad Campbell, linux-raid

On Sun, Jan 27, 2013 at 10:39 PM, Wolfgang Denk <wd@denx.de> wrote:
> Dear Brad Campbell,
>
> In message <5105DD72.6080909@fnarfbargle.com> you wrote:
>>
>> Massive mismatch counts are indicative of an insidious problem further
>> down the storage stack. Check your drivers, cards, cables and PSU.
>
> That's why I'm asking. The hardware has not been changed at all, and
> has been running reliably for many months even years. No such problem
> has been observed ever before, but now it happens at the first check
> operation after switching to Fedora 18, simultaneous on all systems.

Can you, without too much pain, downgrade the kernel to the one from
before the upgrade (which kernel was that?) and re-run the check?
That would be an interesting data point.  Although if this reverts the
behavior I expect it would be fairly straightforward to reproduce.

--
Dan

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28  7:00       ` Chris Murphy
@ 2013-01-28 10:27         ` Wolfgang Denk
  0 siblings, 0 replies; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28 10:27 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Phil Turmel, linux-raid

Dear Chris,

In message <7BE0F80F-4730-41C0-80D0-AFD7996FDB4C@colorremedies.com> you wrote:
> 
> And what does fsck.ext4 -f -n get you?

No errors on any of the affected file systems.

> Switching to Fedora 18 means many things have changed so it's
> harder to find out, if software, what's causing it. Is it kernel
> itself, is it md, is it the HBA driver?

Agreed. But as there are at least 2 different HBA brands involved it
appears we can exclude this part.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
The world is no nursery.                              - Sigmund Freud

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-27 19:26 Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18 Wolfgang Denk
                   ` (3 preceding siblings ...)
  2013-01-28  2:07 ` Brad Campbell
@ 2013-01-28 17:37 ` Piergiorgio Sartor
  2013-01-28 18:12   ` Chris Murphy
  2013-01-28 19:00   ` Wolfgang Denk
  4 siblings, 2 replies; 52+ messages in thread
From: Piergiorgio Sartor @ 2013-01-28 17:37 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: linux-raid

On Sun, Jan 27, 2013 at 08:26:56PM +0100, Wolfgang Denk wrote:
[...]
> # cat /sys/block/md0/md/mismatch_cnt
> 362732152
> 
> 
> This is with mdadm v3.2.6 (mdadm-3.2.6-7.fc18.x86_64); except for the
> huge values of mismatch_cnt, I see no other indications for errors on
> the disk drives, RAID arrays or the file systems on top of these.
> 
> Is this some known (and hopefully harmless), issue, or must I worry
> about our data?
[...]


Hi Wolfgang,

I would shamelessly suggest to try "raid6check", in order
to see if some components have problems.

The software is somehow buried into "mdadm" source code,
probably you'll need to take it from the repository.

Hope it will help,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 17:37 ` Piergiorgio Sartor
@ 2013-01-28 18:12   ` Chris Murphy
  2013-01-28 19:00   ` Wolfgang Denk
  1 sibling, 0 replies; 52+ messages in thread
From: Chris Murphy @ 2013-01-28 18:12 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Wolfgang Denk, linux-raid

On Jan 28, 2013, at 10:37 AM, Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote:

> On Sun, Jan 27, 2013 at 08:26:56PM +0100, Wolfgang Denk wrote:
> [...]
>> # cat /sys/block/md0/md/mismatch_cnt
>> 362732152
>> 
>> 
>> This is with mdadm v3.2.6 (mdadm-3.2.6-7.fc18.x86_64); except for the
>> huge values of mismatch_cnt, I see no other indications for errors on
>> the disk drives, RAID arrays or the file systems on top of these.
>> 
>> Is this some known (and hopefully harmless), issue, or must I worry
>> about our data?
> [...]
> 
> 
> Hi Wolfgang,
> 
> I would shamelessly suggest to try "raid6check", in order
> to see if some components have problems.
> 
> The software is somehow buried into "mdadm" source code,
> probably you'll need to take it from the repository.

I'm curious to see what this is and what it reveals. The thing about the count mismatch is it doesn't tell us which is incorrect: data chunk, or parity chunk. For RAID 4/5 of course this is ambiguous. But for RAID 6 it's useful to know the nature of the mismatch. Of course, there is still ambiguity, but it's not 50/50.

This is also why I'm skeptical of "repair" since it assumes the data chunks are valid in the case of a mismatch. How does this work in the case of raid 6, if the two parity chunks agree, but the data chunk mismatches? Is the data chunk still deferred to in a repair?

Chris Murphy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 17:37 ` Piergiorgio Sartor
  2013-01-28 18:12   ` Chris Murphy
@ 2013-01-28 19:00   ` Wolfgang Denk
  2013-01-28 19:10     ` Wolfgang Denk
  2013-01-28 19:18     ` Piergiorgio Sartor
  1 sibling, 2 replies; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28 19:00 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

Dear Piergiorgio,

In message <20130128173704.GA2329@lazy.lzy> you wrote:
>
> I would shamelessly suggest to try "raid6check", in order
> to see if some components have problems.
> 
> The software is somehow buried into "mdadm" source code,
> probably you'll need to take it from the repository.

Found it.  Thanks for the suggestion.

However, this is extreme verbose:

layout: 2
disks: 8
component size: 249108103168
total stripes: 15204352
chunk size: 16384

disk: 0 - offset: 134217728 - size: 250864926720 - name: /dev/sdk1 -
slot: 5
disk: 1 - offset: 134217728 - size: 250864926720 - name: /dev/sdj1 -
slot: 4
disk: 2 - offset: 134217728 - size: 250864926720 - name: /dev/sdi1 -
slot: 7
disk: 3 - offset: 134217728 - size: 250864926720 - name: /dev/sdh1 -
slot: 3
disk: 4 - offset: 134217728 - size: 250864926720 - name: /dev/sdg1 -
slot: 2
disk: 5 - offset: 134217728 - size: 250864926720 - name: /dev/sdf1 -
slot: 1
disk: 6 - offset: 134217728 - size: 250864926720 - name: /dev/sde1 -
slot: 6
disk: 7 - offset: 134217728 - size: 250863844352 - name: /dev/sdd1 -
slot: 0

pos --> 0
0->1
1->2
2->3
3->4
4->5
5->6
pos --> 1
0->0
1->1
2->2
3->3
4->4
5->5
pos --> 2
0->7
1->0
2->1
3->2
4->3
5->4
pos --> 3
0->6
1->7
2->0
3->1
4->2
5->3
pos --> 4
0->5
1->6
2->7
3->0
4->1
5->2
pos --> 5
...

etc. ad nauseam.  I guess "pos" means stripe here, so it would print
this for all stripes in the array?  Does this means all of them are
broken?  Or what would I  have to look for to see where an error
mightbe?

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Our OS who art in CPU, UNIX be thy name.
Thy programs run, thy syscalls done,
In kernel as it is in user!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 19:00   ` Wolfgang Denk
@ 2013-01-28 19:10     ` Wolfgang Denk
  2013-01-28 19:22       ` Piergiorgio Sartor
  2013-01-28 19:18     ` Piergiorgio Sartor
  1 sibling, 1 reply; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28 19:10 UTC (permalink / raw)
  To: Piergiorgio Sartor, linux-raid

Dear Piergiorgio,

In message <20130128190035.D943A294BAB@gemini.denx.de> I wrote:
> 
> Found it.  Thanks for the suggestion.
> 
> However, this is extreme verbose:

OK, filtering this through "fgrep -v -e '->'" gives something like
this:

layout: 2
disks: 8
component size: 250058162176
total stripes: 15262339
chunk size: 16384

disk: 0 - offset: 1048576 - size: 250058301440 - name: /dev/sdj - slot: 7
disk: 1 - offset: 1048576 - size: 250058301440 - name: /dev/sdi - slot: 6
disk: 2 - offset: 1048576 - size: 250058301440 - name: /dev/sdh - slot: 5
disk: 3 - offset: 1048576 - size: 250058301440 - name: /dev/sdg - slot: 4
disk: 4 - offset: 1048576 - size: 250058301440 - name: /dev/sdf - slot: 3
disk: 5 - offset: 1048576 - size: 250058301440 - name: /dev/sde - slot: 2
disk: 6 - offset: 1048576 - size: 250058301440 - name: /dev/sdd - slot: 1
disk: 7 - offset: 1048576 - size: 250058301440 - name: /dev/sdc - slot: 0

Q(7) wrong at 1
P(5) wrong at 2
Q(6) wrong at 2
P(4) wrong at 3
Q(5) wrong at 3
P(3) wrong at 4
Q(4) wrong at 4
P(2) wrong at 5
Q(3) wrong at 5
P(1) wrong at 6
Q(2) wrong at 6
P(0) wrong at 7
Q(1) wrong at 7
P(7) wrong at 8
Q(7) wrong at 9
P(2) wrong at 541
Q(3) wrong at 541
P(1) wrong at 542
Q(2) wrong at 542
P(0) wrong at 543
Q(0) wrong at 544
P(6) wrong at 545
Q(7) wrong at 545
P(5) wrong at 546
Q(6) wrong at 546
P(4) wrong at 547
Q(5) wrong at 547
P(3) wrong at 548
Q(4) wrong at 548
P(2) wrong at 549
Q(3) wrong at 549
P(1) wrong at 550
Q(2) wrong at 550
P(0) wrong at 551
Q(0) wrong at 552
P(6) wrong at 553
Q(7) wrong at 553
P(5) wrong at 554
Q(6) wrong at 554
P(4) wrong at 555
Q(5) wrong at 555
P(3) wrong at 556
Q(4) wrong at 556
P(2) wrong at 557
Q(3) wrong at 557
P(1) wrong at 558
Q(2) wrong at 558
P(0) wrong at 559
Q(0) wrong at 560
P(6) wrong at 561
Q(7) wrong at 561
P(5) wrong at 562
Q(6) wrong at 562
P(4) wrong at 563
Q(5) wrong at 563
P(3) wrong at 564
Q(4) wrong at 564
P(2) wrong at 565
Q(3) wrong at 565
P(1) wrong at 566
Q(2) wrong at 566
P(0) wrong at 567
Q(0) wrong at 568
P(6) wrong at 569
Q(7) wrong at 569
P(5) wrong at 570
Q(6) wrong at 570
P(4) wrong at 571
Q(5) wrong at 571
P(3) wrong at 572
Q(4) wrong at 572
P(2) wrong at 573
Q(3) wrong at 573
P(1) wrong at 574
Q(2) wrong at 574
P(0) wrong at 575
Q(0) wrong at 576
P(6) wrong at 577
Q(7) wrong at 577
P(5) wrong at 578
Q(6) wrong at 578
P(4) wrong at 579
Q(5) wrong at 579
P(3) wrong at 580
Q(4) wrong at 580
P(2) wrong at 581
Q(3) wrong at 581
P(1) wrong at 582
...

etc. etc. [The test is still running, but I already have more than
10k such lines...]

What does this tell us?




Note:  on one system (where I can recreate / rstore the content of the
array more easily) I decided to play some experiments, so I ran a
"repair" on the array.  The repair completed without any error
indicateion - but ended with a mismatch_cnt = 362731480.

My simple tests still ndicate no data corrption (but then I might just
be looking in the wrong places).


Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
It is a human characteristic to love little  animals,  especially  if
they're attractive in some way.
	-- McCoy, "The Trouble with Tribbles", stardate 4525.6

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 19:00   ` Wolfgang Denk
  2013-01-28 19:10     ` Wolfgang Denk
@ 2013-01-28 19:18     ` Piergiorgio Sartor
  1 sibling, 0 replies; 52+ messages in thread
From: Piergiorgio Sartor @ 2013-01-28 19:18 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid

On Mon, Jan 28, 2013 at 08:00:35PM +0100, Wolfgang Denk wrote:
> Dear Piergiorgio,
> 
> In message <20130128173704.GA2329@lazy.lzy> you wrote:
> >
> > I would shamelessly suggest to try "raid6check", in order
> > to see if some components have problems.
> > 
> > The software is somehow buried into "mdadm" source code,
> > probably you'll need to take it from the repository.
> 
> Found it.  Thanks for the suggestion.
> 
> However, this is extreme verbose:
> 
> layout: 2
> disks: 8
> component size: 249108103168
> total stripes: 15204352
> chunk size: 16384
> 
> disk: 0 - offset: 134217728 - size: 250864926720 - name: /dev/sdk1 -
> slot: 5
> disk: 1 - offset: 134217728 - size: 250864926720 - name: /dev/sdj1 -
> slot: 4
> disk: 2 - offset: 134217728 - size: 250864926720 - name: /dev/sdi1 -
> slot: 7
> disk: 3 - offset: 134217728 - size: 250864926720 - name: /dev/sdh1 -
> slot: 3
> disk: 4 - offset: 134217728 - size: 250864926720 - name: /dev/sdg1 -
> slot: 2
> disk: 5 - offset: 134217728 - size: 250864926720 - name: /dev/sdf1 -
> slot: 1
> disk: 6 - offset: 134217728 - size: 250864926720 - name: /dev/sde1 -
> slot: 6
> disk: 7 - offset: 134217728 - size: 250863844352 - name: /dev/sdd1 -
> slot: 0
> 
> pos --> 0
> 0->1
> 1->2
> 2->3
> 3->4
> 4->5
> 5->6
> pos --> 1
> 0->0
> 1->1
> 2->2
> 3->3
> 4->4
> 5->5
> pos --> 2
> 0->7
> 1->0
> 2->1
> 3->2
> 4->3
> 5->4
> pos --> 3
> 0->6
> 1->7
> 2->0
> 3->1
> 4->2
> 5->3
> pos --> 4
> 0->5
> 1->6
> 2->7
> 3->0
> 4->1
> 5->2
> pos --> 5
> ...
> 
> etc. ad nauseam.  I guess "pos" means stripe here, so it would print
> this for all stripes in the array?  Does this means all of them are
> broken?  Or what would I  have to look for to see where an error
> mightbe?

Hi Wolfgang,

the output is indeed verbose, my suggestion would be
to redirect it to a file (on different storage) and
"grep" later for "Error".
This should report if a specific device is detected
with problems or if it cannot detect which device.

The output you see above means everything is correct,
until stripe 4, at least. So you're right, the "pos"
is the stripe position.

In case of error, something like:

Error detected at X: possible failed disk slot: Y

Which means stripe X, disk Y, from the initial print.

Or it could be:

Error detected at X: disk slot unknown

Which should be obvious.

Hope this helps,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 19:10     ` Wolfgang Denk
@ 2013-01-28 19:22       ` Piergiorgio Sartor
  2013-01-28 20:19         ` Wolfgang Denk
  0 siblings, 1 reply; 52+ messages in thread
From: Piergiorgio Sartor @ 2013-01-28 19:22 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid

On Mon, Jan 28, 2013 at 08:10:41PM +0100, Wolfgang Denk wrote:
[...]
> etc. etc. [The test is still running, but I already have more than
> 10k such lines...]

Hi Wolfgang,

please grep for "Error".

> What does this tell us?

There are for sure parity errors, but it would be
more interesting to know (that's why "Error") if
a specific device has/had problems.

> Note:  on one system (where I can recreate / rstore the content of the
> array more easily) I decided to play some experiments, so I ran a
> "repair" on the array.  The repair completed without any error
> indicateion - but ended with a mismatch_cnt = 362731480.

I guess this is expected, the "repair" has the parities
fixed, nothing more, while reporting how many fixes.
 
> My simple tests still ndicate no data corrption (but then I might just
> be looking in the wrong places).

Well, this might be possible, even if I would not
count on it...

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 19:22       ` Piergiorgio Sartor
@ 2013-01-28 20:19         ` Wolfgang Denk
  2013-01-28 20:44           ` Piergiorgio Sartor
  0 siblings, 1 reply; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28 20:19 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

Dear Piergiorgio,

In message <20130128192256.GB13803@lazy.lzy> you wrote:
>
> please grep for "Error".
> 
> > What does this tell us?
> 
> There are for sure parity errors, but it would be
> more interesting to know (that's why "Error") if
> a specific device has/had problems.

So I see a large number (> 200,000 in one array, so far - check still
running) of "Error detected at ...: disk slot unknown" messages,
insequences for example like this:

...
Q(0) wrong at 1951856
Q(7) wrong at 1951857
P(5) wrong at 1951858
Q(6) wrong at 1951858
Error detected at 1951858: disk slot unknown
P(4) wrong at 1951859
Q(5) wrong at 1951859
Error detected at 1951859: disk slot unknown
P(3) wrong at 1951860
Q(4) wrong at 1951860
Error detected at 1951860: disk slot unknown
P(2) wrong at 1951861
Q(3) wrong at 1951861
Error detected at 1951861: disk slot unknown
P(1) wrong at 1951862
Q(2) wrong at 1951862
Error detected at 1951862: disk slot unknown
P(0) wrong at 1951863
Q(1) wrong at 1951863
Error detected at 1951863: disk slot unknown
Q(0) wrong at 1951864
Q(7) wrong at 1951865
P(5) wrong at 1951866
Q(6) wrong at 1951866
Error detected at 1951866: disk slot unknown
P(4) wrong at 1951867
Q(5) wrong at 1951867
Error detected at 1951867: disk slot unknown
P(3) wrong at 1951868
Q(4) wrong at 1951868
Error detected at 1951868: disk slot unknown
P(2) wrong at 1951869
Q(3) wrong at 1951869
Error detected at 1951869: disk slot unknown
P(1) wrong at 1951870
Q(2) wrong at 1951870
Error detected at 1951870: disk slot unknown
P(0) wrong at 1951871
Q(1) wrong at 1951871
Error detected at 1951871: disk slot unknown
...

Does this allow to tell which disk it is?


Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
The first thing we do is kill all the lawyers.
(Shakespeare. II Henry VI, Act IV, scene ii)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 20:19         ` Wolfgang Denk
@ 2013-01-28 20:44           ` Piergiorgio Sartor
  2013-01-28 22:47             ` Chris Murphy
  2013-01-28 23:18             ` Wolfgang Denk
  0 siblings, 2 replies; 52+ messages in thread
From: Piergiorgio Sartor @ 2013-01-28 20:44 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid

On Mon, Jan 28, 2013 at 09:19:47PM +0100, Wolfgang Denk wrote:
[...]
> P(0) wrong at 1951871
> Q(1) wrong at 1951871
> Error detected at 1951871: disk slot unknown
> ...
> 
> Does this allow to tell which disk it is?

Hi Wolfgang,

unfortunately, this means more than one disk
has data which cannot match the parities.
If only one disk would have been corrupted, we
could have seen it reported (apart from patological
cases), but when the slot is "unknown" it means
the detection was not possible.

It could still be the data is as it was written,
but, for some reasons I could not imagine, the
parities (both) are not correct.

This could be in case of some software bug, which
would be quite a surprise, I must say.

Still, would be nice to check if the whole array
is it this state or if, sooner or later, some
knwon slot (with error) is found somewhere else.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 20:44           ` Piergiorgio Sartor
@ 2013-01-28 22:47             ` Chris Murphy
  2013-01-28 22:49               ` Chris Murphy
                                 ` (2 more replies)
  2013-01-28 23:18             ` Wolfgang Denk
  1 sibling, 3 replies; 52+ messages in thread
From: Chris Murphy @ 2013-01-28 22:47 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Wolfgang Denk, linux-raid

On Jan 28, 2013, at 1:44 PM, Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote:

> It could still be the data is as it was written,
> but, for some reasons I could not imagine, the
> parities (both) are not correct.

In RAID 6 are the two parity chunks in a stripe identical? If yes, does 'unknown' only tell us that the data chunk recomputed parity does not match either of the parity chunks; or does it tell us anything about whether the parity chunks themselves are still the same or different?

> 
> This could be in case of some software bug, which
> would be quite a surprise, I must say.

Yes, it sounds reproducible on more than one array, more than one HBA. Is it also more than one computer also, Wolfgang?

I think regression is going to be needed to find it. Hopefully the problem is restricted to parity computation and data chunks aren't affected; however if a URE occurs in a data chunk, it could be reconstructed incorrectly from bad parity so it's obviously still a big problem.

Chris Murphy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 22:47             ` Chris Murphy
@ 2013-01-28 22:49               ` Chris Murphy
  2013-01-28 23:03                 ` Wolfgang Denk
  2013-01-28 22:59               ` Wolfgang Denk
  2013-01-29 17:49               ` Piergiorgio Sartor
  2 siblings, 1 reply; 52+ messages in thread
From: Chris Murphy @ 2013-01-28 22:49 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Wolfgang Denk, linux-raid


On Jan 28, 2013, at 3:47 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 
> I think regression is going to be needed to find it. Hopefully the problem is restricted to parity computation and data chunks aren't affected; however if a URE occurs in a data chunk, it could be reconstructed incorrectly from bad parity so it's obviously still a big problem.

Actually it also needs to be reproduced as well. Wolfgang, the arrays were created/running under what version of mdadm and kernel the last time you did a check and there were no problems?

And now you're on Fedora 18, but the problem appeared with what version of mdadm and kernel?

Chris Murphy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 22:47             ` Chris Murphy
  2013-01-28 22:49               ` Chris Murphy
@ 2013-01-28 22:59               ` Wolfgang Denk
  2013-01-28 23:07                 ` Chris Murphy
  2013-01-29 17:49               ` Piergiorgio Sartor
  2 siblings, 1 reply; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28 22:59 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Piergiorgio Sartor, linux-raid

Dear Chris,

In message <6D287BCE-96EB-4F91-AC5A-34CD7AD2C68D@colorremedies.com> you wrote:
> 
> Yes, it sounds reproducible on more than one array, more than one HBA. Is it also more than one computer also, Wolfgang?

Correct, these are 3 different machines.

> I think regression is going to be needed to find it. Hopefully the
> problem is restricted to parity computation and data chunks aren't
> affected; however if a URE occurs in a data chunk, it could be
> reconstructed incorrectly from bad parity so it's obviously still a
> big problem.

My gut feeling is that the data are still OK, but I have to admit that
I inspected only a small fraction of the files, and I would like to
avoid restoring the data from backup tapes to another system as long
as possible.  So it indeed appears to me as if we had a sotware issue,
computing incorrect parity data.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
"Consistency requires you to be as ignorant today as you were a  year
ago."                                              - Bernard Berenson

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 22:49               ` Chris Murphy
@ 2013-01-28 23:03                 ` Wolfgang Denk
  2013-01-28 23:13                   ` Chris Murphy
  0 siblings, 1 reply; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28 23:03 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Piergiorgio Sartor, linux-raid

Dear Chris,

In message <1CEFA702-4512-40C1-ABE4-0FF7F42D9F44@colorremedies.com> you wrote:
> 
> Actually it also needs to be reproduced as well. Wolfgang, the
> arrays were created/running under what version of mdadm and kernel
> the last time you did a check and there were no problems?

No. The arrays were created before the update, and were running fine
by then.  The most recent of them was created under Fedora 17 - the
other two are much older; it would take a bit of effort to find out
when I created these, and with which exact versions.

The last time bbefore the update to Fedora 18, all systems were
running Fedora 17 (with almost all updates installed), and there were
no problems.

> And now you're on Fedora 18, but the problem appeared with what
> version of mdadm and kernel?

The problem showed up during the first check after updating to Fedora
18.  This is with:

	kernel-3.7.4-204.fc18.x86_64
	mdadm-3.2.6-12.fc18.x86_64

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
"One planet is all you get."

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 22:59               ` Wolfgang Denk
@ 2013-01-28 23:07                 ` Chris Murphy
  2013-01-28 23:23                   ` Wolfgang Denk
  0 siblings, 1 reply; 52+ messages in thread
From: Chris Murphy @ 2013-01-28 23:07 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid

On Jan 28, 2013, at 3:59 PM, Wolfgang Denk <wd@denx.de> wrote:

> Dear Chris,
> 
> In message <6D287BCE-96EB-4F91-AC5A-34CD7AD2C68D@colorremedies.com> you wrote:
>> 
>> Yes, it sounds reproducible on more than one array, more than one HBA. Is it also more than one computer also, Wolfgang?
> 
> Correct, these are 3 different machines.

Too bad. Better to test first, than commit so many computers and arrays for such a major change.
> 
>> I think regression is going to be needed to find it. Hopefully the
>> problem is restricted to parity computation and data chunks aren't
>> affected; however if a URE occurs in a data chunk, it could be
>> reconstructed incorrectly from bad parity so it's obviously still a
>> big problem.
> 
> My gut feeling is that the data are still OK, but I have to admit that
> I inspected only a small fraction of the files, and I would like to
> avoid restoring the data from backup tapes to another system as long
> as possible.  So it indeed appears to me as if we had a sotware issue,
> computing incorrect parity data.

Unclear. If parity chunks are both wrong, then that means you effectively have partial RAID 0 depending on what parity chunks are correct or not. I'm not recommending this, but if you set one disk to faulty and started your file system and file tests again… if they're bad then indeed it's parity that's affected. If you don't get errors, then it indicates the test method is insufficient to locate the errors and it could still be data that's affected.

It's a tenuous situation. It might be wise to pick a low priority computer for regression, and hopefully the problem gets better rather than worse. If the assumption is that the parity is bad, it needs to be recalculated with repair. If that goes well with tests and another check scrub, then it's better to get on with additional regressions sooner than later. Again in the meantime if you lost a drive, it could be a real mess if the raid starts to rebuild bad data from parity. Or even starts to write user data incorrectly too.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 23:03                 ` Wolfgang Denk
@ 2013-01-28 23:13                   ` Chris Murphy
  2013-01-28 23:31                     ` Wolfgang Denk
  0 siblings, 1 reply; 52+ messages in thread
From: Chris Murphy @ 2013-01-28 23:13 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid


On Jan 28, 2013, at 4:03 PM, Wolfgang Denk <wd@denx.de> wrote:

> Dear Chris,
> 
> In message <1CEFA702-4512-40C1-ABE4-0FF7F42D9F44@colorremedies.com> you wrote:
>> 
>> Actually it also needs to be reproduced as well. Wolfgang, the
>> arrays were created/running under what version of mdadm and kernel
>> the last time you did a check and there were no problems?
> 
> No. The arrays were created before the update, and were running fine
> by then.  The most recent of them was created under Fedora 17 - the
> other two are much older; it would take a bit of effort to find out
> when I created these, and with which exact versions.
> 
> The last time bbefore the update to Fedora 18, all systems were
> running Fedora 17 (with almost all updates installed), and there were
> no problems.

Yes but what kernel version and mdadm? Right now 3.7.3-104 is stable kernel for Fedora 17, and 3.2.6-8 for mdadm for Fedora 17. So saying Fedora 17 tells not very much.

> 
>> And now you're on Fedora 18, but the problem appeared with what
>> version of mdadm and kernel?
> 
> The problem showed up during the first check after updating to Fedora
> 18.  This is with:
> 
> 	kernel-3.7.4-204.fc18.x86_64
> 	mdadm-3.2.6-12.fc18.x86_64

OK and the versions before that, that you know the problem was not occurring?


Chris Murphy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 20:44           ` Piergiorgio Sartor
  2013-01-28 22:47             ` Chris Murphy
@ 2013-01-28 23:18             ` Wolfgang Denk
  2013-01-29 17:57               ` Piergiorgio Sartor
  1 sibling, 1 reply; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28 23:18 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

Dear Piergiorgio Sartor,

In message <20130128204422.GA14115@lazy.lzy> you wrote:
>
> It could still be the data is as it was written,
> but, for some reasons I could not imagine, the
> parities (both) are not correct.

This is what I think at the moment, as all my samples of data so fare
looked ok.

> This could be in case of some software bug, which
> would be quite a surprise, I must say.

Indeed.  Guess how I feel...

> Still, would be nice to check if the whole array
> is it this state or if, sooner or later, some
> knwon slot (with error) is found somewhere else.

Checks still running.  I see two things:

- on the array where I was running "repair" before, raid6check reports
  no errors so far - but still there is a mismatch_cnt = 362731480
  raid6check is still running.

- on the second machine, I have 558579 lines out output, 176 of which
  are errors of type "Error detected at : disk slot unknown"; no other
  errors reported so far.  raid6check is still running.

- on the third machine, I have 5512894 lines out output, 1599431 of
  which are errors of type "Error detected at : disk slot unknown"; no
  other errors reported so far.  raid6check is still running.

This smells really bad as if parity computation was broken...

OK, add more hardware details...

A: Supermicro X8SAX mainboard, Core i7 CPU 950 @ 3.07GHz, 24 GB RAM
H: Supermicro X8ST3 mainboard, Xeon CPU W3565  @ 3.20GHz, 24 GB RAM
X: Supermicro X8SAX mainboard, Core i7 CPU 950 @ 3.07GHz, 24 GB RAM


Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
What is research but a blind date with knowledge?      -- Will Harvey

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 23:07                 ` Chris Murphy
@ 2013-01-28 23:23                   ` Wolfgang Denk
  2013-01-28 23:42                     ` Chris Murphy
  2013-01-29 18:02                     ` Roy Sigurd Karlsbakk
  0 siblings, 2 replies; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28 23:23 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Piergiorgio Sartor, linux-raid

Dear Chris,

In message <ADB1B276-21F3-4006-A613-F979F931EDC3@colorremedies.com> you wrote:
> 
> > Correct, these are 3 different machines.
> 
> Too bad. Better to test first, than commit so many computers and arrays
> for such a major change.

In hindsight you are of course correct.  But then, these are still not
really vitally critically systems, and I hve to admit that I did not
expect such kind of problems.  I have installed a large number of
Fedora releases before (all of them since FC4 actually, on quite a
number of systems), and while there have always been some problems, I
never ran into something like this before.

> Unclear. If parity chunks are both wrong, then that means you
> effectively have partial RAID 0 depending on what parity chunks are
> correct or not. I'm not recommending this, but if you set one disk to
> faulty and started your file system and file tests again… if they're
> bad then indeed it's parity that's affected. If you don't get errors,
> then it indicates the test method is insufficient to locate the errors
> and it could still be data that's affected.

OK, I will keep this in mind. If needed, I can dedicate one of the
systems to even a destructive test without too much actual loss.

> It's a tenuous situation. It might be wise to pick a low priority
> computer for regression, and hopefully the problem gets better rather
> than worse. If the assumption is that the parity is bad, it needs to be
> recalculated with repair. If that goes well with tests and another check
> scrub, then it's better to get on with additional regressions sooner
> than later. Again in the meantime if you lost a drive, it could be a
> real mess if the raid starts to rebuild bad data from parity. Or even
> starts to write user data incorrectly too.

Well, I did this - the repair worked without errors, but it left again
a huge mismatch_cnt; raid6check on this array has not found any
problems so far - even though I see mismatch_cnt = 362731480

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
O Staat!   Wie tief dir alle Besten fluchen!  Du bist kein Ziel.  Der
Mensch muß weiter suchen.                     - Christian Morgenstern
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 23:13                   ` Chris Murphy
@ 2013-01-28 23:31                     ` Wolfgang Denk
  0 siblings, 0 replies; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-28 23:31 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Piergiorgio Sartor, linux-raid

Dear Chris Murphy,

In message <19DB9CE1-E7A9-4812-B3C8-1E59AD6ADA67@colorremedies.com> you wrote:
> 
> > The last time bbefore the update to Fedora 18, all systems were
> > running Fedora 17 (with almost all updates installed), and there were
> > no problems.
> 
> Yes but what kernel version and mdadm? Right now 3.7.3-104 is stable kernel for Fedora 17, and 3.2.6-8 for mdadm for Fedora 17. So saying Fedora 17 tells not very much.

It was mdadm-3.2.6-1.fc17.x86_64 before on all 3 systems.

Two machines were running kernel version 3.6.10-2.fc17.x86_64, the
third one 3.6.11-1.fc17.x86_64

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
"What if" is a trademark of Hewlett Packard, so stop using it in your
sentences without permission, or risk being sued.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 23:23                   ` Wolfgang Denk
@ 2013-01-28 23:42                     ` Chris Murphy
  2013-01-29 18:02                     ` Roy Sigurd Karlsbakk
  1 sibling, 0 replies; 52+ messages in thread
From: Chris Murphy @ 2013-01-28 23:42 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid@vger.kernel.org Raid

On Jan 28, 2013, at 4:23 PM, Wolfgang Denk <wd@denx.de> wrote:
> 
> Well, I did this - the repair worked without errors, but it left again
> a huge mismatch_cnt; raid6check on this array has not found any
> problems so far - even though I see mismatch_cnt = 362731480

Well, we don't know the nature of the bug. Also the method of testing can be confusing.

If this is a bug, and you used the buggy version combination to do a repair, maybe it's seeing good  parity as bad parity, and that's where the high mismatch count comes from. But then it replaces it with new parity, which could actually be bad, but either scrub or raid6check affected by the bug is seeing the bad parity as correct, hence no error. And my speculation is likely not helpful.

I think more eyeballs need to be on this to try and get a reproducer, and find out what's going on. So you should probably post something on the Fedora devel list. If it's a bug, better to squash it sooner than later. If it's a false alarm, oh well not much harm done.

Chris Murphy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 22:47             ` Chris Murphy
  2013-01-28 22:49               ` Chris Murphy
  2013-01-28 22:59               ` Wolfgang Denk
@ 2013-01-29 17:49               ` Piergiorgio Sartor
  2013-01-29 19:35                 ` Paul Menzel
  2 siblings, 1 reply; 52+ messages in thread
From: Piergiorgio Sartor @ 2013-01-29 17:49 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Piergiorgio Sartor, Wolfgang Denk, linux-raid

On Mon, Jan 28, 2013 at 03:47:13PM -0700, Chris Murphy wrote:
[...]
> In RAID 6 are the two parity chunks in a stripe identical? If yes, does 'unknown' only tell us that the data chunk recomputed parity does not match either of the parity chunks; or does it tell us anything about whether the parity chunks themselves are still the same or different?

Hi Chris,

the two parities, P and Q, are not the same,
they're different Reed-Solomon polynomials.

You'll find more detail here:

http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

At the end of this document, it is explained
what I've implemented in raid6check.
The process is the following.
The data is read (per stripe) and P, Q are
then calculated.
If the read P and Q are the same as the
calculated ones, then it is *assumed* the
data is OK.
If P *or* (xor) Q does not match, then it
is *assumed* the non matching one is wrong.

If both, P and Q, do not match, then there
is a calculation which returns us the slot
(chunk position, if you want) where the
error could be.
If the slot is not in the ammissible range,
for example if we have 5 data disk (0 to 4)
and the calculation returns 10, then it
means the error cannot be detected, which,
in turn, means probably more than one slot
has errors.

The "unknown" above states exactly this,
the two parities are not matching (P wrong,
Q wrong at the same position) and the
calculation cannot find which slot could
be the corrupted one, so there could be
more than one and, with only two parities,
we cannot say anything more.

> I think regression is going to be needed to find it. Hopefully the problem is restricted to parity computation and data chunks aren't affected; however if a URE occurs in a data chunk, it could be reconstructed incorrectly from bad parity so it's obviously still a big problem.

Maybe parities are not correctly calculated,
but I wonder how this could be, given the
amount of usage the code had and has.

Unless the new VX optimization have a problem.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 23:18             ` Wolfgang Denk
@ 2013-01-29 17:57               ` Piergiorgio Sartor
  2013-01-29 18:43                 ` Wolfgang Denk
  0 siblings, 1 reply; 52+ messages in thread
From: Piergiorgio Sartor @ 2013-01-29 17:57 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid

On Tue, Jan 29, 2013 at 12:18:39AM +0100, Wolfgang Denk wrote:
[...]
> This is what I think at the moment, as all my samples of data so fare
> looked ok.

Hi Wolfgang,

my personal opinion would be to confirm
*all* the data is OK, if you can.
This will point to the parity calculation
as error source, I guess.

> > This could be in case of some software bug, which
> > would be quite a surprise, I must say.
> 
> Indeed.  Guess how I feel...

The lucky one... :-)
 
> > Still, would be nice to check if the whole array
> > is it this state or if, sooner or later, some
> > knwon slot (with error) is found somewhere else.
> 
> Checks still running.  I see two things:
> 
> - on the array where I was running "repair" before, raid6check reports
>   no errors so far - but still there is a mismatch_cnt = 362731480
>   raid6check is still running.

As mentioned, the "repair" reports the
number of repairs it did, so unless
you ran a check after than, the number
is expected, I guess.
 
> - on the second machine, I have 558579 lines out output, 176 of which
>   are errors of type "Error detected at : disk slot unknown"; no other
>   errors reported so far.  raid6check is still running.

Nah, ja, it is slow, I know...
In any case, as wrote in another post, "unknown"
means both parities are wrong and a suitable,
guilty, slot cannot be found.
So, either both parities are wrong and only
them (best case scenario), or more than one
disk has corruped data on the same stripe.

> - on the third machine, I have 5512894 lines out output, 1599431 of
>   which are errors of type "Error detected at : disk slot unknown"; no
>   other errors reported so far.  raid6check is still running.
> 
> This smells really bad as if parity computation was broken...

Uhm, as mentioned, it would be nice to
find a specific error slot...
Well, not so nice, but this would point
to an HW problem.
 
> OK, add more hardware details...
> 
> A: Supermicro X8SAX mainboard, Core i7 CPU 950 @ 3.07GHz, 24 GB RAM
> H: Supermicro X8ST3 mainboard, Xeon CPU W3565  @ 3.20GHz, 24 GB RAM
> X: Supermicro X8SAX mainboard, Core i7 CPU 950 @ 3.07GHz, 24 GB RAM
 
What does the kernel log says about the choosen
RAID6 algorithm?

There should be some information with "dmesg".

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-28 23:23                   ` Wolfgang Denk
  2013-01-28 23:42                     ` Chris Murphy
@ 2013-01-29 18:02                     ` Roy Sigurd Karlsbakk
  2013-01-29 18:28                       ` Wolfgang Denk
  1 sibling, 1 reply; 52+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-01-29 18:02 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid, Chris Murphy

> > Too bad. Better to test first, than commit so many computers and
> > arrays for such a major change.
> 
> In hindsight you are of course correct. But then, these are still not
> really vitally critically systems, and I hve to admit that I did not
> expect such kind of problems. I have installed a large number of
> Fedora releases before (all of them since FC4 actually, on quite a
> number of systems), and while there have always been some problems, I
> never ran into something like this before.

Not to be a bitch, but I'd recommend using something more stable on servers than Fedora. CentOS or Ubuntu LTS are my open choices.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-29 18:02                     ` Roy Sigurd Karlsbakk
@ 2013-01-29 18:28                       ` Wolfgang Denk
  2013-01-29 18:43                         ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-29 18:28 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Piergiorgio Sartor, linux-raid, Chris Murphy

Dear Roy Sigurd Karlsbakk,

In message <30125923.16.1359482526397.JavaMail.root@zimbra> you wrote:
>
> Not to be a bitch, but I'd recommend using something more stable on servers
>  than Fedora. CentOS or Ubuntu LTS are my open choices.

And how would you ever get a stable RHEL or CentOS if not some brave,
fearless Knights of the Toggled Bit were courageous enough to test
such bleading edge software for you? ;-)

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
"...one of the main causes of the fall of the Roman Empire was  that,
lacking  zero,  they had no way to indicate successful termination of
their C programs."                                     - Robert Firth

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-29 17:57               ` Piergiorgio Sartor
@ 2013-01-29 18:43                 ` Wolfgang Denk
  2013-01-29 20:24                   ` Piergiorgio Sartor
  0 siblings, 1 reply; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-29 18:43 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

Dear Piergiorgio,

In message <20130129175720.GB2396@lazy.lzy> you wrote:
>
> my personal opinion would be to confirm
> *all* the data is OK, if you can.

Unfortunately I have no easy way doing this.  Most of the data are
working trees of software builds, or build results, so I have no
checksum or other information.  But all files I accessed so far, of
where I was able to check, were actually correct.

> Uhm, as mentioned, it would be nice to
> find a specific error slot...
> Well, not so nice, but this would point
> to an HW problem.

I am not a friend of quick conclusions in cases like this, but I think
that hardware issues are very unlikely to hit with identical effects
simultaneously on 3 different machines in 2 different locations.


> > OK, add more hardware details...
> > 
> > A: Supermicro X8SAX mainboard, Core i7 CPU 950 @ 3.07GHz, 24 GB RAM
> > H: Supermicro X8ST3 mainboard, Xeon CPU W3565  @ 3.20GHz, 24 GB RAM
> > X: Supermicro X8SAX mainboard, Core i7 CPU 950 @ 3.07GHz, 24 GB RAM
>  
> What does the kernel log says about the choosen
> RAID6 algorithm?

System A:

[   57.121902] raid6: sse2x1    7660 MB/s
[   57.138892] raid6: sse2x2    8687 MB/s
[   57.155890] raid6: sse2x4    9789 MB/s
[   57.155891] raid6: using algorithm sse2x4 (9789 MB/s)
[   57.155892] raid6: using ssse3x2 recovery algorithm

System H:

[   45.360607] raid6: sse2x1    7753 MB/s
[   45.403614] raid6: sse2x2    8777 MB/s
[   45.445612] raid6: sse2x4    9773 MB/s
[   45.472547] raid6: using algorithm sse2x4 (9773 MB/s)
[   45.503347] raid6: using ssse3x2 recovery algorithm

System X:

[   51.471793] raid6: sse2x1    3996 MB/s
[   51.517657] raid6: sse2x2    4851 MB/s
[   51.566579] raid6: sse2x4    4960 MB/s
[   51.598831] raid6: using algorithm sse2x4 (4960 MB/s)
[   51.638697] raid6: using ssse3x2 recovery algorithm


Note: I remembered that I converted yet another machine which has a
(smaller) RAID6 array (4 x SAMSUNG SpinPoint F1 HE502IJ) on a less
powerful PC (Gigabyte P35-DS3R mainboard, Intel Core 2 Duo CPU E6750
@ 2.66GHz, 8 GB RAM).  This shows NO problems (so far).  Here I have:

[   11.253015] raid6: sse2x1    1902 MB/s
[   11.270021] raid6: sse2x2    2296 MB/s
[   11.287274] raid6: sse2x4    3171 MB/s
[   11.287533] raid6: using algorithm sse2x4 (3171 MB/s)
[   11.288127] raid6: using ssse3x2 recovery algorithm


If you can draw any conclusions from that - I can't.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
GUIs  are  virtually  useless.  Learn  tools.  They're  configurable,
scriptable, automatable, cron-able, interoperable, etc. We don't need
no brain-dead winslurping monolithic claptrap.
                               -- Tom Christiansen in 371140df@csnews

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-29 18:28                       ` Wolfgang Denk
@ 2013-01-29 18:43                         ` Roy Sigurd Karlsbakk
  0 siblings, 0 replies; 52+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-01-29 18:43 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid, Chris Murphy

> And how would you ever get a stable RHEL or CentOS if not some brave,
> fearless Knights of the Toggled Bit were courageous enough to test
> such bleading edge software for you? ;-)

Well, userspace software is one thing, but kernel stuff I prefer to be *stable*, which isn't the case with Fedora or non-LTS Ubuntu releases. Things do happen…

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-29 17:49               ` Piergiorgio Sartor
@ 2013-01-29 19:35                 ` Paul Menzel
  2013-01-29 20:18                   ` Piergiorgio Sartor
  0 siblings, 1 reply; 52+ messages in thread
From: Paul Menzel @ 2013-01-29 19:35 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Chris Murphy, Wolfgang Denk, linux-raid

[-- Attachment #1: Type: text/plain, Size: 184 bytes --]

Am Dienstag, den 29.01.2013, 18:49 +0100 schrieb Piergiorgio Sartor:

[…]

> Unless the new VX optimization have a problem.

What commits would these be?


Thanks,

Paul

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-29 19:35                 ` Paul Menzel
@ 2013-01-29 20:18                   ` Piergiorgio Sartor
  0 siblings, 0 replies; 52+ messages in thread
From: Piergiorgio Sartor @ 2013-01-29 20:18 UTC (permalink / raw)
  To: Paul Menzel; +Cc: Piergiorgio Sartor, Chris Murphy, Wolfgang Denk, linux-raid

On Tue, Jan 29, 2013 at 08:35:32PM +0100, Paul Menzel wrote:
> Am Dienstag, den 29.01.2013, 18:49 +0100 schrieb Piergiorgio Sartor:
> 
> […]
> 
> > Unless the new VX optimization have a problem.
> 
> What commits would these be?

Hi Paul,

I do not remember exactly, but I saw some patches
going back and forth with this optimizations.

It should not be too long ago,

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-29 18:43                 ` Wolfgang Denk
@ 2013-01-29 20:24                   ` Piergiorgio Sartor
  2013-01-31 12:12                     ` Wolfgang Denk
  0 siblings, 1 reply; 52+ messages in thread
From: Piergiorgio Sartor @ 2013-01-29 20:24 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid

On Tue, Jan 29, 2013 at 07:43:09PM +0100, Wolfgang Denk wrote:
[...]
> I am not a friend of quick conclusions in cases like this, but I think
> that hardware issues are very unlikely to hit with identical effects
> simultaneously on 3 different machines in 2 different locations.

Hi Wolfgang,

yep, I think the same.
If all error report by raid6check, on the three
systems, are "unknown", then it seems to be a
software problem.

> 
> > > OK, add more hardware details...
> > > 
> > > A: Supermicro X8SAX mainboard, Core i7 CPU 950 @ 3.07GHz, 24 GB RAM
> > > H: Supermicro X8ST3 mainboard, Xeon CPU W3565  @ 3.20GHz, 24 GB RAM
> > > X: Supermicro X8SAX mainboard, Core i7 CPU 950 @ 3.07GHz, 24 GB RAM
> >  
> > What does the kernel log says about the choosen
> > RAID6 algorithm?
> 
> System A:
> 
> [   57.121902] raid6: sse2x1    7660 MB/s
> [   57.138892] raid6: sse2x2    8687 MB/s
> [   57.155890] raid6: sse2x4    9789 MB/s
> [   57.155891] raid6: using algorithm sse2x4 (9789 MB/s)
> [   57.155892] raid6: using ssse3x2 recovery algorithm
> 
> System H:
> 
> [   45.360607] raid6: sse2x1    7753 MB/s
> [   45.403614] raid6: sse2x2    8777 MB/s
> [   45.445612] raid6: sse2x4    9773 MB/s
> [   45.472547] raid6: using algorithm sse2x4 (9773 MB/s)
> [   45.503347] raid6: using ssse3x2 recovery algorithm
> 
> System X:
> 
> [   51.471793] raid6: sse2x1    3996 MB/s
> [   51.517657] raid6: sse2x2    4851 MB/s
> [   51.566579] raid6: sse2x4    4960 MB/s
> [   51.598831] raid6: using algorithm sse2x4 (4960 MB/s)
> [   51.638697] raid6: using ssse3x2 recovery algorithm

This ssse3x3 I do not know, it seems new to me,
but I might be wrong, we would need someone
with more insight of RAID6 implementation.
 
> Note: I remembered that I converted yet another machine which has a
> (smaller) RAID6 array (4 x SAMSUNG SpinPoint F1 HE502IJ) on a less
> powerful PC (Gigabyte P35-DS3R mainboard, Intel Core 2 Duo CPU E6750
> @ 2.66GHz, 8 GB RAM).  This shows NO problems (so far).  Here I have:
> 
> [   11.253015] raid6: sse2x1    1902 MB/s
> [   11.270021] raid6: sse2x2    2296 MB/s
> [   11.287274] raid6: sse2x4    3171 MB/s
> [   11.287533] raid6: using algorithm sse2x4 (3171 MB/s)
> [   11.288127] raid6: using ssse3x2 recovery algorithm
> 
> 
> If you can draw any conclusions from that - I can't.

It is also the slower...
Nevertheless, at this point, I cannot draw
any conclusions too... Sorry.
 
bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-29 20:24                   ` Piergiorgio Sartor
@ 2013-01-31 12:12                     ` Wolfgang Denk
  2013-01-31 17:14                       ` Chris Murphy
  2013-01-31 17:47                       ` Piergiorgio Sartor
  0 siblings, 2 replies; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-31 12:12 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

Dear Piergiorgio,

In message <20130129202433.GB7005@lazy.lzy> you wrote:
>
> If all error report by raid6check, on the three
> systems, are "unknown", then it seems to be a
> software problem.

I think we can be pretty sure of this now.  For a test, I installed a
vanilla mainline Linux kernel (v3.8-rc5) on the affected machines.

A "check" operation showed no more problems, but "raid6test"
still reported a large number of errors like these:

...
P(4) wrong at 10291
Q(5) wrong at 10291
Error detected at 10291: disk slot unknown
P(3) wrong at 10292
Q(4) wrong at 10292
Error detected at 10292: disk slot unknown
P(2) wrong at 10293
Q(3) wrong at 10293
Error detected at 10293: disk slot unknown
...

After running a "repair" on the array, both "check" and "raid6test"
would not report any further issues.

I'll continue to watch this for a while, but I think I will not
"update" to a Fedora kernel for some time...

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Chapter 1 -- The story so  far:
In the beginning the Universe was created. This has  made  a  lot  of
people very angry and been widely regarded as a bad move.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-31 12:12                     ` Wolfgang Denk
@ 2013-01-31 17:14                       ` Chris Murphy
  2013-01-31 17:51                         ` Piergiorgio Sartor
  2013-01-31 18:36                         ` Wolfgang Denk
  2013-01-31 17:47                       ` Piergiorgio Sartor
  1 sibling, 2 replies; 52+ messages in thread
From: Chris Murphy @ 2013-01-31 17:14 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid

On Jan 31, 2013, at 5:12 AM, Wolfgang Denk <wd@denx.de> wrote:
> 
> After running a "repair" on the array, both "check" and "raid6test"
> would not report any further issues.

Yes but this would be consistent with a derivative parity, written to disk and then checked against an algorithm that expects derivative parity. What happens if you go back to the old kernel before all the problems were happening and you do a check? What happens if you go back to a Fedora kernel you know exhibited the problem and you do a check?

Question for Piergiorgio is if check and raid6test use the same, or independent, code for checking parity?

> I'll continue to watch this for a while, but I think I will not
> "update" to a Fedora kernel for some time...

I think a bug needs to be filed with the information you have thus far.

Chris Murphy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-31 12:12                     ` Wolfgang Denk
  2013-01-31 17:14                       ` Chris Murphy
@ 2013-01-31 17:47                       ` Piergiorgio Sartor
  1 sibling, 0 replies; 52+ messages in thread
From: Piergiorgio Sartor @ 2013-01-31 17:47 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid

On Thu, Jan 31, 2013 at 01:12:20PM +0100, Wolfgang Denk wrote:
> Dear Piergiorgio,
> 
> In message <20130129202433.GB7005@lazy.lzy> you wrote:
> >
> > If all error report by raid6check, on the three
> > systems, are "unknown", then it seems to be a
> > software problem.
> 
> I think we can be pretty sure of this now.  For a test, I installed a
> vanilla mainline Linux kernel (v3.8-rc5) on the affected machines.
> 
> A "check" operation showed no more problems, but "raid6test"
> still reported a large number of errors like these:

Hi Wolfgang,

this surprise me quite a lot, the two checks should
have similar results. The only algorithmic difference
I know of is that raid6check reports "per stripe",
while the in kernel check should report "per block".
 
> ...
> P(4) wrong at 10291
> Q(5) wrong at 10291
> Error detected at 10291: disk slot unknown
> P(3) wrong at 10292
> Q(4) wrong at 10292
> Error detected at 10292: disk slot unknown
> P(2) wrong at 10293
> Q(3) wrong at 10293
> Error detected at 10293: disk slot unknown
> ...
> 
> After running a "repair" on the array, both "check" and "raid6test"
> would not report any further issues.

Which is again a surprise, if the repair changed
the parities, then the raid6check should complain,
if before it was not.
 
This confuses me a lot, I think Neil Brown or
H. Peter Anvin should comment on this situation.

bye.

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-31 17:14                       ` Chris Murphy
@ 2013-01-31 17:51                         ` Piergiorgio Sartor
  2013-01-31 18:36                         ` Wolfgang Denk
  1 sibling, 0 replies; 52+ messages in thread
From: Piergiorgio Sartor @ 2013-01-31 17:51 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Wolfgang Denk, Piergiorgio Sartor, linux-raid

On Thu, Jan 31, 2013 at 10:14:35AM -0700, Chris Murphy wrote:
> 
> On Jan 31, 2013, at 5:12 AM, Wolfgang Denk <wd@denx.de> wrote:
> > 
> > After running a "repair" on the array, both "check" and "raid6test"
> > would not report any further issues.
> 
> Yes but this would be consistent with a derivative parity, written to disk and then checked against an algorithm that expects derivative parity. What happens if you go back to the old kernel before all the problems were happening and you do a check? What happens if you go back to a Fedora kernel you know exhibited the problem and you do a check?
> 
> Question for Piergiorgio is if check and raid6test use the same, or independent, code for checking parity?

Hi Chris,

the code base is different, namely raid6check
uses plain C, with no optimizations (that's
why is so sloooow), while the kernel code has
different paths, depending on which optimization
is available and best.
The algorithm should be the same.

Nevertheless, the tests I run, intentionally
and unintentionally, on raid6check were always
consistent, so this surprise me.
 
> > I'll continue to watch this for a while, but I think I will not
> > "update" to a Fedora kernel for some time...
> 
> 
> I think a bug needs to be filed with the information you have thus far.

I fully agree with this, what was seen there
is for sure not normal.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-31 17:14                       ` Chris Murphy
  2013-01-31 17:51                         ` Piergiorgio Sartor
@ 2013-01-31 18:36                         ` Wolfgang Denk
  2013-01-31 19:35                           ` Chris Murphy
  1 sibling, 1 reply; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-31 18:36 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Piergiorgio Sartor, linux-raid

Dear Chris Murphy,

In message <3DB28596-B1D6-48A9-9520-4CF9D367E39D@colorremedies.com> you wrote:
> 
> > After running a "repair" on the array, both "check" and "raid6test"
> > would not report any further issues.
> 
> Yes but this would be consistent with a derivative parity, written
> to disk and then checked against an algorithm that expects derivative
> parity. What happens if you go back to the old kernel before all the
> problems were happening and you do a check? What happens if you go
> back to a Fedora kernel you know exhibited the problem and you do a
> check?

I cannot test the exact old kernel I was running before any more;
Fedora has released an update in the meantime, and they do not keep
older updates around, only the very latest one - which is the same
version as causes the problems.  When using the (really old) kernel
from the installation media, I see the same behaviour as with current
mainline: I have to run a "repair", and then the array is, and
remains, clean.

With the current Fedora kernel, the first check will report errors
which do not go away permanently, not even with a "repair".

> Question for Piergiorgio is if check and raid6test use the same, or
> independent, code for checking parity?

My impression is that they must use different code - raid6test takes
much, much longer and causes a much higher CPU load than running
"check".

> I think a bug needs to be filed with the information you have thus far.

I did this, actually in parallel with reporting the issues here:
https://bugzilla.redhat.com/show_bug.cgi?id=904831

I think the relevant Fedora people are on Cc:, but there was zero
response so far; seems potential data loss is of no concern to the
Fedora project :-(

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
You don't have to worry about me. I might have been born yesterday...
but I stayed up all night.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-31 18:36                         ` Wolfgang Denk
@ 2013-01-31 19:35                           ` Chris Murphy
  2013-01-31 19:46                             ` Chris Murphy
  2013-01-31 20:05                             ` Wolfgang Denk
  0 siblings, 2 replies; 52+ messages in thread
From: Chris Murphy @ 2013-01-31 19:35 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid

On Jan 31, 2013, at 11:36 AM, Wolfgang Denk <wd@denx.de> wrote:

> I cannot test the exact old kernel I was running before any more;
> Fedora has released an update in the meantime, and they do not keep
> older updates around

They are in koji.

> When using the (really old) kernel
> from the installation media, I see the same behaviour as with current
> mainline: I have to run a "repair", and then the array is, and
> remains, clean.
> 
> With the current Fedora kernel, the first check will report errors
> which do not go away permanently, not even with a "repair".

This is 3.7.4-204?

Hypothetically this should be reproducible by anyone using that kernel, by creating a new raid6, running repair, and then running check and confirming thee mismatch count is non-zero.

> 
> I did this, actually in parallel with reporting the issues here:
> https://bugzilla.redhat.com/show_bug.cgi?id=904831

I defer to others but I'm not sure if the component is mdadm or if it's kernel, in this case. You've only changed kernels from 3.7.4-204 to 3.8-rc5, not mdadm.

> I think the relevant Fedora people are on Cc:, but there was zero
> response so far; seems potential data loss is of no concern to the
> Fedora project :-(

It's not atypical for there to be delays in responding to such things when it's not widespread, and as yet no one else has reproduced it even on this list.

So we need reproducers, and they need to comment on the bug. Also a step by step to reproduce the bug is important, to help get people to try to reproduce it.

Chris Murphy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-31 19:35                           ` Chris Murphy
@ 2013-01-31 19:46                             ` Chris Murphy
  2013-01-31 20:05                             ` Wolfgang Denk
  1 sibling, 0 replies; 52+ messages in thread
From: Chris Murphy @ 2013-01-31 19:46 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid


On Jan 31, 2013, at 12:35 PM, Chris Murphy <lists@colorremedies.com> wrote:

> 
> On Jan 31, 2013, at 11:36 AM, Wolfgang Denk <wd@denx.de> wrote:
> 
>> I cannot test the exact old kernel I was running before any more;
>> Fedora has released an update in the meantime, and they do not keep
>> older updates around
> 
> They are in koji.

Also there are 3.8-rc kernels in koji as well. So you could try one of those and see if this is a Fedora specific bug. But I doubt it.


Chris Murphy


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-31 19:35                           ` Chris Murphy
  2013-01-31 19:46                             ` Chris Murphy
@ 2013-01-31 20:05                             ` Wolfgang Denk
  2013-01-31 20:41                               ` Chris Murphy
  1 sibling, 1 reply; 52+ messages in thread
From: Wolfgang Denk @ 2013-01-31 20:05 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Piergiorgio Sartor, linux-raid

Dear Chris,

In message <1C562BB9-9FD8-482C-BE3E-E52A027A5C88@colorremedies.com> you wrote:
> 
> > I cannot test the exact old kernel I was running before any more;
> > Fedora has released an update in the meantime, and they do not keep
> > older updates around
> 
> They are in koji.

Ah!  Do you have any pointer for me how to access stuff there?

> > With the current Fedora kernel, the first check will report errors
> > which do not go away permanently, not even with a "repair".
> 
> This is 3.7.4-204?

Correct.

> Hypothetically this should be reproducible by anyone using that kernel, by creating a new raid6, running repair, and then running check and confirming thee mismatch count is non-zero.

Actually just running check should be sufficient - that was how I
discovered the issue: I received warning mails from the "raid-check"
cron job.

> > I did this, actually in parallel with reporting the issues here:
> > https://bugzilla.redhat.com/show_bug.cgi?id=904831
> 
> I defer to others but I'm not sure if the component is mdadm or if it's kernel, in this case. You've only changed kernels from 3.7.4-204 to 3.8-rc5, not mdadm.

True. That was not clear to me initially, so I was looking for
"something RAID6 related".  Fixed now - thanks for pointing out.

> > I think the relevant Fedora people are on Cc:, but there was zero
> > response so far; seems potential data loss is of no concern to the
> > Fedora project :-(
> 
> It's not atypical for there to be delays in responding to such things when it's not widespread, and as yet no one else has reproduced it even on this list.

This is what surprises me most.  I would have expected at least some
"me too!" by now...

> So we need reproducers, and they need to comment on the bug. Also a step by step to reproduce the bug is important, to help get people to try to reproduce it.

For me it is sufficient to:

- boot the 3.7.4-204
- bring up the array
- run "check"

It comes up with mismatch_cnt=0 after boot, and some huge number while
/ after the check.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Die Freiheit des Menschen liegt nicht darin, dass er tun kann, was er
will, sondern darin, dass er nicht tun muss, was er nicht will.
                                             -- Jean-Jacques Rousseau

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18
  2013-01-31 20:05                             ` Wolfgang Denk
@ 2013-01-31 20:41                               ` Chris Murphy
  0 siblings, 0 replies; 52+ messages in thread
From: Chris Murphy @ 2013-01-31 20:41 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Piergiorgio Sartor, linux-raid


On Jan 31, 2013, at 1:05 PM, Wolfgang Denk <wd@denx.de> wrote:
> 
> Ah!  Do you have any pointer for me how to access stuff there?
http://koji.fedoraproject.org/koji/
Type in kernel for search.
Click on a result that has a green check for state.
Find the rpm you want, right click download.

You can use rpm -ivh --force to install an older kernel.


> 
>>> With the current Fedora kernel, the first check will report errors
>>> which do not go away permanently, not even with a "repair".
>> 
>> This is 3.7.4-204?
> 
> Correct.
> 
>> Hypothetically this should be reproducible by anyone using that kernel, by creating a new raid6, running repair, and then running check and confirming thee mismatch count is non-zero.
> 
> Actually just running check should be sufficient - that was how I
> discovered the issue: I received warning mails from the "raid-check"
> cron job.

But check alone doesn't reproduce the problem because the check *might* be finding valid mismatches, however unlikely on a new raid array. The problem is doing a repair which should fix any mismatches, running check still finding mismatches then is a huge bug for the scrub process for sure; and if some parity is actually being computed wrong, or written on disk wrong, then it's a possible data loss bug in addition. Not just (maybe bogus) errors.


> This is what surprises me most.  I would have expected at least some
> "me too!" by now…

If there were a study that said 95% of users using md raid never do scrubs, either check or repair, I would not be surprised at all.

> For me it is sufficient to:
> 
> - boot the 3.7.4-204
> - bring up the array
> - run "check"
> 
> It comes up with mismatch_cnt=0 after boot, and some huge number while
> / after the check.

If check is non-zero following a repair, with one kernel version but not another, it's a software bug.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2013-01-31 20:41 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-27 19:26 Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18 Wolfgang Denk
2013-01-27 19:45 ` Chris Murphy
2013-01-27 23:10   ` Wolfgang Denk
2013-01-27 20:05 ` Robin Hill
2013-01-27 23:11   ` Wolfgang Denk
2013-01-28  1:14 ` Phil Turmel
2013-01-28  1:42   ` Chris Murphy
2013-01-28  2:16     ` Chris Murphy
2013-01-28  6:43       ` Wolfgang Denk
2013-01-28  6:36     ` Wolfgang Denk
2013-01-28  7:00       ` Chris Murphy
2013-01-28 10:27         ` Wolfgang Denk
2013-01-28  6:27   ` Wolfgang Denk
2013-01-28  2:07 ` Brad Campbell
2013-01-28  6:39   ` Wolfgang Denk
2013-01-28  7:58     ` Dan Williams
2013-01-28 17:37 ` Piergiorgio Sartor
2013-01-28 18:12   ` Chris Murphy
2013-01-28 19:00   ` Wolfgang Denk
2013-01-28 19:10     ` Wolfgang Denk
2013-01-28 19:22       ` Piergiorgio Sartor
2013-01-28 20:19         ` Wolfgang Denk
2013-01-28 20:44           ` Piergiorgio Sartor
2013-01-28 22:47             ` Chris Murphy
2013-01-28 22:49               ` Chris Murphy
2013-01-28 23:03                 ` Wolfgang Denk
2013-01-28 23:13                   ` Chris Murphy
2013-01-28 23:31                     ` Wolfgang Denk
2013-01-28 22:59               ` Wolfgang Denk
2013-01-28 23:07                 ` Chris Murphy
2013-01-28 23:23                   ` Wolfgang Denk
2013-01-28 23:42                     ` Chris Murphy
2013-01-29 18:02                     ` Roy Sigurd Karlsbakk
2013-01-29 18:28                       ` Wolfgang Denk
2013-01-29 18:43                         ` Roy Sigurd Karlsbakk
2013-01-29 17:49               ` Piergiorgio Sartor
2013-01-29 19:35                 ` Paul Menzel
2013-01-29 20:18                   ` Piergiorgio Sartor
2013-01-28 23:18             ` Wolfgang Denk
2013-01-29 17:57               ` Piergiorgio Sartor
2013-01-29 18:43                 ` Wolfgang Denk
2013-01-29 20:24                   ` Piergiorgio Sartor
2013-01-31 12:12                     ` Wolfgang Denk
2013-01-31 17:14                       ` Chris Murphy
2013-01-31 17:51                         ` Piergiorgio Sartor
2013-01-31 18:36                         ` Wolfgang Denk
2013-01-31 19:35                           ` Chris Murphy
2013-01-31 19:46                             ` Chris Murphy
2013-01-31 20:05                             ` Wolfgang Denk
2013-01-31 20:41                               ` Chris Murphy
2013-01-31 17:47                       ` Piergiorgio Sartor
2013-01-28 19:18     ` Piergiorgio Sartor

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.