All of lore.kernel.org
 help / color / mirror / Atom feed
* mdadm stuck at 0% reshape after grow
@ 2017-12-05  9:41 Jeremy Graham
  2017-12-05 10:56 ` Wols Lists
  2017-12-05 15:55 ` 002
  0 siblings, 2 replies; 26+ messages in thread
From: Jeremy Graham @ 2017-12-05  9:41 UTC (permalink / raw)
  To: linux-raid

Hello,

After visiting almost every google result for "mdadm reshape stuck"
and other variants, I am calling out to the experts for help.

History:
I have had a linux raid5 array running mostly issue free for many
years now. It started with 3 x 3TB drives and has grown over the years
to 5 disks. Each new drive addition has added without a problem with a
mdadm --add and mdadm --grow.

I recently purchased disk 6, to be safe I did a long smartctl test
first, I got no error so proceeded with a mdadm add/grow. But this
time the reshape got stuck almost immediately. I have rebooted many
times and re-assembled but each time I get the same result.

Reshape never progresses and eventually speed reduces to 0K/sec (I
have let it go for days)

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : active raid5 sdf1[8] sdb1[10] sdc1[7] sde1[9] sdd1[5] sdg1[6]
      11721054208 blocks super 1.2 level 5, 512k chunk, algorithm 2
[6/6] [UUUUUU]
      [>....................]  reshape =  0.0% (120220/2930263552)
finish=9147785.2min speed=5K/sec

What I have tried:
* Force assemble
* Purchased another drive and used ddrescue to clone the other new
drive and assemble with that instead
* Upgrading mdadm
* Booting with SystemRescueCD and attempting an assemble
All of above result in the same thing, a stuck reshape at 0%

Logs and output:
https://pastebin.com/fi9pUQgD also at the end of this message. Please
let me know if I can provide anything else

Hardware:
* i3, 16GB Ram
* System disk 256G SSD
* All raid drives are running off a LSI 9211-8i HBA.
* All raid drives are segate NAS drives


If anyone has ideas I would be eternally grateful!

Cheers,
Jeremy







Directly after reboot
---------------------

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : inactive sdc1[7](S) sdb1[10](S) sde1[9](S) sdd1[5](S) sdg1[6](S)
sdf1[8](S)
      17581584384 blocks super 1.2

unused devices: <none>


$ mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
     Raid Level : raid0
  Total Devices : 6
    Persistence : Superblock is persistent

          State : inactive

  Delta Devices : 1, (-1->0)
      New Level : raid5
     New Layout : left-symmetric
  New Chunksize : 512K

           Name : homer2:1
           UUID : f39f89d2:1bcd3d55:d173d206:d85b8bbc
         Events : 4830576

    Number   Major   Minor   RaidDevice

       -       8       17        -        /dev/sdb1
       -       8       33        -        /dev/sdc1
       -       8       49        -        /dev/sdd1
       -       8       65        -        /dev/sde1
       -       8       81        -        /dev/sdf1
       -       8       97        -        /dev/sdg1


Assemble the array
------------------

$ export MDADM_GROW_ALLOW_OLD=1

** Grow allow old used as thought the new drive was faulty and
attempted to assemble array with it unplugged **


$ mdadm --stop /dev/md0
mdadm: stopped /dev/md0


$ mdadm --assemble -v /dev/md0 /dev/sd[bcdefg]1
mdadm: looking for devices for /dev/md0
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 5.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 4.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/md0 has an active reshape - checking if critical section
needs to be restored
mdadm: accepting backup with timestamp 1511984338 for array with
timestamp 1512378131
mdadm: backup-metadata found on device-5 but is not needed
mdadm: added /dev/sdg1 to /dev/md0 as 1
mdadm: added /dev/sdd1 to /dev/md0 as 2
mdadm: added /dev/sde1 to /dev/md0 as 3
mdadm: added /dev/sdc1 to /dev/md0 as 4
mdadm: added /dev/sdb1 to /dev/md0 as 5
mdadm: added /dev/sdf1 to /dev/md0 as 0
mdadm: /dev/md0 has been started with 6 drives.


$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : active raid5 sdf1[8] sdb1[10] sdc1[7] sde1[9] sdd1[5] sdg1[6]
      11721054208 blocks super 1.2 level 5, 512k chunk, algorithm 2
[6/6] [UUUUUU]
      [>....................]  reshape =  0.0% (120220/2930263552)
finish=187144.1min speed=260K/sec

unused devices: <none>

** Progress never changes, speed reduces eventually to 0K/sec **


$ mdadm --examine /dev/sd[bcdefg]1
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : f39f89d2:1bcd3d55:d173d206:d85b8bbc
           Name : homer2:1
  Creation Time : Sun Dec  2 12:04:24 2012
     Raid Level : raid5
   Raid Devices : 6

 Avail Dev Size : 5860528128 (2794.52 GiB 3000.59 GB)
     Array Size : 14651317760 (13972.59 GiB 15002.95 GB)
  Used Dev Size : 5860527104 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1960 sectors, after=1024 sectors
          State : active
    Device UUID : d185c556:52b345e2:244ff3c8:9210bf26

  Reshape pos'n : 491520 (480.00 MiB 503.32 MB)
  Delta Devices : 1 (5->6)

    Update Time : Tue Dec  5 18:30:50 2017
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : b8e49d67 - correct
         Events : 4830578

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 5
   Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : f39f89d2:1bcd3d55:d173d206:d85b8bbc
           Name : homer2:1
  Creation Time : Sun Dec  2 12:04:24 2012
     Raid Level : raid5
   Raid Devices : 6

 Avail Dev Size : 5860528128 (2794.52 GiB 3000.59 GB)
     Array Size : 14651317760 (13972.59 GiB 15002.95 GB)
  Used Dev Size : 5860527104 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1968 sectors, after=1024 sectors
          State : active
    Device UUID : 96958379:2ca77922:2065aa7b:26b9afae

  Reshape pos'n : 491520 (480.00 MiB 503.32 MB)
  Delta Devices : 1 (5->6)

    Update Time : Tue Dec  5 18:30:50 2017
       Checksum : 56765f4b - correct
         Events : 4830578

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 4
   Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : f39f89d2:1bcd3d55:d173d206:d85b8bbc
           Name : homer2:1
  Creation Time : Sun Dec  2 12:04:24 2012
     Raid Level : raid5
   Raid Devices : 6

 Avail Dev Size : 5860528128 (2794.52 GiB 3000.59 GB)
     Array Size : 14651317760 (13972.59 GiB 15002.95 GB)
  Used Dev Size : 5860527104 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1968 sectors, after=1024 sectors
          State : active
    Device UUID : b9fc27cb:36f08923:64d5979a:45d24263

  Reshape pos'n : 491520 (480.00 MiB 503.32 MB)
  Delta Devices : 1 (5->6)

    Update Time : Tue Dec  5 18:30:50 2017
       Checksum : 7cab98d9 - correct
         Events : 4830578

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0xc
     Array UUID : f39f89d2:1bcd3d55:d173d206:d85b8bbc
           Name : homer2:1
  Creation Time : Sun Dec  2 12:04:24 2012
     Raid Level : raid5
   Raid Devices : 6

 Avail Dev Size : 5860528128 (2794.52 GiB 3000.59 GB)
     Array Size : 14651317760 (13972.59 GiB 15002.95 GB)
  Used Dev Size : 5860527104 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1960 sectors, after=1024 sectors
          State : active
    Device UUID : a33aacb4:00d283ef:5715be52:a0678279

  Reshape pos'n : 491520 (480.00 MiB 503.32 MB)
  Delta Devices : 1 (5->6)

    Update Time : Tue Dec  5 18:30:50 2017
  Bad Block Log : 512 entries available at offset 72 sectors - bad
blocks present.
       Checksum : 978e30 - correct
         Events : 4830578

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdf1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0xc
     Array UUID : f39f89d2:1bcd3d55:d173d206:d85b8bbc
           Name : homer2:1
  Creation Time : Sun Dec  2 12:04:24 2012
     Raid Level : raid5
   Raid Devices : 6

 Avail Dev Size : 5860528128 (2794.52 GiB 3000.59 GB)
     Array Size : 14651317760 (13972.59 GiB 15002.95 GB)
  Used Dev Size : 5860527104 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1960 sectors, after=1024 sectors
          State : active
    Device UUID : cabeb84a:b5df18c9:cb378062:be0f8998

  Reshape pos'n : 491520 (480.00 MiB 503.32 MB)
  Delta Devices : 1 (5->6)

    Update Time : Tue Dec  5 18:30:50 2017
  Bad Block Log : 512 entries available at offset 72 sectors - bad
blocks present.
       Checksum : 9f01ea9c - correct
         Events : 4830578

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdg1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : f39f89d2:1bcd3d55:d173d206:d85b8bbc
           Name : homer2:1
  Creation Time : Sun Dec  2 12:04:24 2012
     Raid Level : raid5
   Raid Devices : 6

 Avail Dev Size : 5860528128 (2794.52 GiB 3000.59 GB)
     Array Size : 14651317760 (13972.59 GiB 15002.95 GB)
  Used Dev Size : 5860527104 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1968 sectors, after=1024 sectors
          State : active
    Device UUID : 35c83ff2:9c960c86:4982f67a:e4720c3b

  Reshape pos'n : 491520 (480.00 MiB 503.32 MB)
  Delta Devices : 1 (5->6)

    Update Time : Tue Dec  5 18:30:50 2017
       Checksum : be6e5840 - correct
         Events : 4830578

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)

** Reshape pos'n is always: 491520 (480.00 MiB 503.32 MB) **


$ mdadm --examine /dev/sd[bcdefg]
/dev/sdb:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)
/dev/sdc:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)
/dev/sdd:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)
/dev/sde:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)
/dev/sdf:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)
/dev/sdg:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)


$ mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sun Dec  2 12:04:24 2012
     Raid Level : raid5
     Array Size : 11721054208 (11178.07 GiB 12002.36 GB)
  Used Dev Size : 2930263552 (2794.52 GiB 3000.59 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Tue Dec  5 18:30:50 2017
          State : active, reshaping
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

 Reshape Status : 0% complete
  Delta Devices : 1, (5->6)

           Name : homer2:1
           UUID : f39f89d2:1bcd3d55:d173d206:d85b8bbc
         Events : 4830578

    Number   Major   Minor   RaidDevice State
       8       8       81        0      active sync   /dev/sdf1
       6       8       97        1      active sync   /dev/sdg1
       5       8       49        2      active sync   /dev/sdd1
       9       8       65        3      active sync   /dev/sde1
       7       8       33        4      active sync   /dev/sdc1
      10       8       17        5      active sync   /dev/sdb1


$ dmesg
[69979.860466] md: md0 stopped.
[69979.884579] md: bind<sdg1>
[69979.903373] md: bind<sdd1>
[69979.903750] md: bind<sde1>
[69979.903939] md: bind<sdc1>
[69979.922600] md: bind<sdb1>
[69979.922958] md: bind<sdf1>
[69979.932462] md/raid:md0: not clean -- starting background reconstruction
[69979.932467] md/raid:md0: reshape will continue
[69979.932486] md/raid:md0: device sdf1 operational as raid disk 0
[69979.932488] md/raid:md0: device sdb1 operational as raid disk 5
[69979.932489] md/raid:md0: device sdc1 operational as raid disk 4
[69979.932490] md/raid:md0: device sde1 operational as raid disk 3
[69979.932491] md/raid:md0: device sdd1 operational as raid disk 2
[69979.932493] md/raid:md0: device sdg1 operational as raid disk 1
[69979.932912] md/raid:md0: allocated 6490kB
[69979.932943] md/raid:md0: raid level 5 active with 6 out of 6
devices, algorithm 2
[69979.932945] RAID conf printout:
[69979.932947]  --- level:5 rd:6 wd:6
[69979.932948]  disk 0, o:1, dev:sdf1
[69979.932950]  disk 1, o:1, dev:sdg1
[69979.932951]  disk 2, o:1, dev:sdd1
[69979.932952]  disk 3, o:1, dev:sde1
[69979.932954]  disk 4, o:1, dev:sdc1
[69979.932955]  disk 5, o:1, dev:sdb1
[69979.933007] md0: detected capacity change from 0 to 12002359508992
[69979.933130] md: reshape of RAID array md0
[69979.933132] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[69979.933134] md: using maximum available idle IO bandwidth (but not
more than 200000 KB/sec) for reshape.
[69979.933139] md: using 128k window, over a total of 2930263552k.
[70197.635112] INFO: task md0_reshape:30529 blocked for more than 120 seconds.
[70197.635142]       Not tainted 4.4.0-101-generic #124-Ubuntu
[70197.635161] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[70197.635187] md0_reshape     D ffff88011da37aa8     0 30529      2 0x00000000
[70197.635191]  ffff88011da37aa8 ffff88011da37a78 ffff880214a40e00
ffff880210577000
[70197.635193]  ffff88011da38000 ffff8800d49de424 ffff8800d49de658
ffff8800d49de638
[70197.635194]  ffff8800d49de670 ffff88011da37ac0 ffffffff818406d5
ffff8800d49de400
[70197.635196] Call Trace:
[70197.635202]  [<ffffffff818406d5>] schedule+0x35/0x80
[70197.635206]  [<ffffffffc034045f>]
raid5_get_active_stripe+0x31f/0x700 [raid456]
[70197.635210]  [<ffffffff810c4420>] ? wake_atomic_t_function+0x60/0x60
[70197.635212]  [<ffffffffc0344da4>] reshape_request+0x584/0x950 [raid456]
[70197.635215]  [<ffffffff810a9c6a>] ? finish_task_switch+0x7a/0x220
[70197.635218]  [<ffffffffc034548c>] sync_request+0x31c/0x3a0 [raid456]
[70197.635219]  [<ffffffff81840026>] ? __schedule+0x3b6/0xa30
[70197.635222]  [<ffffffff814102b5>] ? find_next_bit+0x15/0x20
[70197.635225]  [<ffffffff81710bb1>] ? is_mddev_idle+0x9c/0xfa
[70197.635227]  [<ffffffff816adbbc>] md_do_sync+0x89c/0xe60
[70197.635229]  [<ffffffff810c4420>] ? wake_atomic_t_function+0x60/0x60
[70197.635231]  [<ffffffff816aa319>] md_thread+0x139/0x150
[70197.635233]  [<ffffffff810c4420>] ? wake_atomic_t_function+0x60/0x60
[70197.635234]  [<ffffffff816aa1e0>] ? find_pers+0x70/0x70
[70197.635236]  [<ffffffff810a0c75>] kthread+0xe5/0x100
[70197.635237]  [<ffffffff810a0b90>] ? kthread_create_on_node+0x1e0/0x1e0
[70197.635239]  [<ffffffff81844b8f>] ret_from_fork+0x3f/0x70
[70197.635241]  [<ffffffff810a0b90>] ? kthread_create_on_node+0x1e0/0x1e0
[70317.630767] INFO: task md0_reshape:30529 blocked for more than 120 seconds.
[70317.630796]       Not tainted 4.4.0-101-generic #124-Ubuntu
[70317.630815] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[70317.630841] md0_reshape     D ffff88011da37aa8     0 30529      2 0x00000000
[70317.630844]  ffff88011da37aa8 ffff88011da37a78 ffff880214a40e00
ffff880210577000
[70317.630846]  ffff88011da38000 ffff8800d49de424 ffff8800d49de658
ffff8800d49de638
[70317.630848]  ffff8800d49de670 ffff88011da37ac0 ffffffff818406d5
ffff8800d49de400
[70317.630850] Call Trace:
[70317.630855]  [<ffffffff818406d5>] schedule+0x35/0x80
[70317.630860]  [<ffffffffc034045f>]
raid5_get_active_stripe+0x31f/0x700 [raid456]
[70317.630863]  [<ffffffff810c4420>] ? wake_atomic_t_function+0x60/0x60
[70317.630866]  [<ffffffffc0344da4>] reshape_request+0x584/0x950 [raid456]
[70317.630868]  [<ffffffff810a9c6a>] ? finish_task_switch+0x7a/0x220
[70317.630871]  [<ffffffffc034548c>] sync_request+0x31c/0x3a0 [raid456]
[70317.630872]  [<ffffffff81840026>] ? __schedule+0x3b6/0xa30
[70317.630876]  [<ffffffff814102b5>] ? find_next_bit+0x15/0x20
[70317.630878]  [<ffffffff81710bb1>] ? is_mddev_idle+0x9c/0xfa
[70317.630881]  [<ffffffff816adbbc>] md_do_sync+0x89c/0xe60
[70317.630883]  [<ffffffff810c4420>] ? wake_atomic_t_function+0x60/0x60
[70317.630885]  [<ffffffff816aa319>] md_thread+0x139/0x150
[70317.630886]  [<ffffffff810c4420>] ? wake_atomic_t_function+0x60/0x60
[70317.630888]  [<ffffffff816aa1e0>] ? find_pers+0x70/0x70
[70317.630890]  [<ffffffff810a0c75>] kthread+0xe5/0x100
[70317.630892]  [<ffffffff810a0b90>] ? kthread_create_on_node+0x1e0/0x1e0
[70317.630894]  [<ffffffff81844b8f>] ret_from_fork+0x3f/0x70
[70317.630896]  [<ffffffff810a0b90>] ? kthread_create_on_node+0x1e0/0x1e0


$ uname -a
Linux homer 4.4.0-101-generic #124-Ubuntu SMP Fri Nov 10 18:29:59 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux


$ mdadm --version
mdadm - v3.4 - 28th January 2016


$ ps aux | grep md0
root      2086  0.0  0.0  14220   940 pts/0    S+   18:49   0:00 grep
--color=auto md0
root     30528 99.9  0.0      0     0 ?        R    18:30  18:35 [md0_raid5]
root     30529  0.0  0.0      0     0 ?        D    18:30   0:00 [md0_reshape]

** md0_raid5 always at 100%, md0_reshape always at 0% **


$ cat /sys/block/md0/md/stripe_cache_size
513

$ echo 16384 > /sys/block/md0/md/stripe_cache_size

$ cat /sys/block/md0/md/stripe_cache_size
16384


$ iostat
Linux 4.4.0-101-generic (homer)         12/05/2017      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.71    0.03    1.11    0.37    0.00   95.78

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             122.04       126.50      1068.10    9006825   76047672
sdb               0.00         0.04         0.31       2543      21918
sdc               0.01         0.47         0.31      33113      21918
sdd               0.01         0.47         0.31      33141      21918
sdf               0.01         0.47         0.31      33665      21922
sde               0.01         0.47         0.31      33637      21922
sdg               0.01         0.47         0.31      33125      21918
md0               0.00         0.01         0.00       1036          0


$ lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda       8:0    0 223.6G  0 disk
├─sda1    8:1    0 215.7G  0 part  /
├─sda2    8:2    0     1K  0 part
└─sda5    8:5    0   7.9G  0 part  [SWAP]
sdb       8:16   0   2.7T  0 disk
└─sdb1    8:17   0   2.7T  0 part
  └─md0   9:0    0  10.9T  0 raid5
sdc       8:32   0   2.7T  0 disk
└─sdc1    8:33   0   2.7T  0 part
  └─md0   9:0    0  10.9T  0 raid5
sdd       8:48   0   2.7T  0 disk
└─sdd1    8:49   0   2.7T  0 part
  └─md0   9:0    0  10.9T  0 raid5
sde       8:64   0   2.7T  0 disk
└─sde1    8:65   0   2.7T  0 part
  └─md0   9:0    0  10.9T  0 raid5
sdf       8:80   0   2.7T  0 disk
└─sdf1    8:81   0   2.7T  0 part
  └─md0   9:0    0  10.9T  0 raid5
sdg       8:96   0   2.7T  0 disk
└─sdg1    8:97   0   2.7T  0 part
  └─md0   9:0    0  10.9T  0 raid5


$ for disk in {b,c,d,e,f,g}; do smartctl -a /dev/sd${disk}1; done
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.0-101-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST3000VN007-2E4166
Serial Number:    Z6A0Z4MY
LU WWN Device Id: 5 000c50 0a3c31b28
Firmware Version: SC60
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Dec  5 19:00:40 2017 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  107) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection
on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 376) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   006    Pre-fail
Always       -       209008
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail
Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age
Always       -       4
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail
Always       -       45978
  9 Power_On_Hours          0x0032   100   100   000    Old_age
Always       -       78
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age
Always       -       4
184 End-to-End_Error        0x0032   100   100   099    Old_age
Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age
Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age
Always       -       0
189 High_Fly_Writes         0x003a   088   088   000    Old_age
Always       -       12
190 Airflow_Temperature_Cel 0x0022   061   058   045    Old_age
Always       -       39 (Min/Max 38/40)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age
Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age
Always       -       4
193 Load_Cycle_Count        0x0032   100   100   000    Old_age
Always       -       4
194 Temperature_Celsius     0x0022   039   042   000    Old_age
Always       -       39 (0 22 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        78         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.0-101-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate NAS HDD
Device Model:     ST3000VN000-1HJ166
Serial Number:    W6A077L9
LU WWN Device Id: 5 000c50 07d588414
Firmware Version: SC60
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Dec  5 19:00:40 2017 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  107) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection
on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 389) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail
Always       -       163598304
  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail
Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age
Always       -       42
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x000f   080   060   030    Pre-fail
Always       -       108755626
  9 Power_On_Hours          0x0032   074   074   000    Old_age
Always       -       23572
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age
Always       -       42
184 End-to-End_Error        0x0032   100   100   099    Old_age
Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age
Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age
Always       -       0
189 High_Fly_Writes         0x003a   019   019   000    Old_age
Always       -       81
190 Airflow_Temperature_Cel 0x0022   067   060   045    Old_age
Always       -       33 (Min/Max 32/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age
Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age
Always       -       26
193 Load_Cycle_Count        0x0032   100   100   000    Old_age
Always       -       1011
194 Temperature_Celsius     0x0022   033   040   000    Old_age
Always       -       33 (0 23 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     12035         -
# 2  Short offline       Completed without error       00%     11911         -
# 3  Short offline       Completed without error       00%     10759         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.0-101-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate NAS HDD
Device Model:     ST3000VN000-1HJ166
Serial Number:    W6A07K5E
LU WWN Device Id: 5 000c50 07d5cca73
Firmware Version: SC60
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Dec  5 19:00:41 2017 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   97) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection
on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 386) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   115   099   006    Pre-fail
Always       -       85334928
  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail
Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age
Always       -       60
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x000f   080   060   030    Pre-fail
Always       -       107935868
  9 Power_On_Hours          0x0032   073   073   000    Old_age
Always       -       23904
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age
Always       -       60
184 End-to-End_Error        0x0032   100   100   099    Old_age
Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age
Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age
Always       -       0
189 High_Fly_Writes         0x003a   001   001   000    Old_age
Always       -       118
190 Airflow_Temperature_Cel 0x0022   066   058   045    Old_age
Always       -       34 (Min/Max 33/35)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age
Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age
Always       -       39
193 Load_Cycle_Count        0x0032   100   100   000    Old_age
Always       -       1043
194 Temperature_Celsius     0x0022   034   042   000    Old_age
Always       -       34 (0 23 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     12367         -
# 2  Short offline       Completed without error       00%     12243         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.0-101-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate NAS HDD
Device Model:     ST3000VN000-1HJ166
Serial Number:    W6A1GLG8
LU WWN Device Id: 5 000c50 09b618897
Firmware Version: SC60
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Dec  5 19:00:41 2017 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  107) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection
on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 379) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   118   099   006    Pre-fail
Always       -       169536040
  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail
Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age
Always       -       27
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x000f   076   060   030    Pre-fail
Always       -       46431252
  9 Power_On_Hours          0x0032   086   086   000    Old_age
Always       -       12765
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age
Always       -       27
184 End-to-End_Error        0x0032   100   100   099    Old_age
Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age
Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age
Always       -       0
189 High_Fly_Writes         0x003a   083   083   000    Old_age
Always       -       17
190 Airflow_Temperature_Cel 0x0022   065   059   045    Old_age
Always       -       35 (Min/Max 34/36)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age
Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age
Always       -       22
193 Load_Cycle_Count        0x0032   100   100   000    Old_age
Always       -       62
194 Temperature_Celsius     0x0022   035   041   000    Old_age
Always       -       35 (0 19 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1229         -
# 2  Short offline       Completed without error       00%      1105         -
# 3  Short offline       Completed without error       00%      1090         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.0-101-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate NAS HDD
Device Model:     ST3000VN000-1HJ166
Serial Number:    W6A0TLA6
LU WWN Device Id: 5 000c50 08b715e4e
Firmware Version: SC60
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Dec  5 19:00:41 2017 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  107) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection
on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 387) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   115   099   006    Pre-fail
Always       -       83913088
  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail
Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age
Always       -       31
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x000f   076   060   030    Pre-fail
Always       -       47812930
  9 Power_On_Hours          0x0032   086   086   000    Old_age
Always       -       12820
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age
Always       -       31
184 End-to-End_Error        0x0032   100   100   099    Old_age
Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age
Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age
Always       -       0
189 High_Fly_Writes         0x003a   034   034   000    Old_age
Always       -       66
190 Airflow_Temperature_Cel 0x0022   063   057   045    Old_age
Always       -       37 (Min/Max 35/38)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age
Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age
Always       -       23
193 Load_Cycle_Count        0x0032   100   100   000    Old_age
Always       -       559
194 Temperature_Celsius     0x0022   037   043   000    Old_age
Always       -       37 (0 21 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1284         -
# 2  Short offline       Completed without error       00%      1160         -
# 3  Short offline       Completed without error       00%      1145         -
# 4  Short offline       Completed without error       00%         8         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.0-101-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate NAS HDD
Device Model:     ST3000VN000-1HJ166
Serial Number:    W6A07KJY
LU WWN Device Id: 5 000c50 07d5cb95b
Firmware Version: SC60
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Dec  5 19:00:41 2017 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  107) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection
on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 392) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail
Always       -       109735048
  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail
Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age
Always       -       48
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x000f   080   060   030    Pre-fail
Always       -       108007271
  9 Power_On_Hours          0x0032   073   073   000    Old_age
Always       -       23829
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age
Always       -       48
184 End-to-End_Error        0x0032   100   100   099    Old_age
Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age
Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age
Always       -       8
189 High_Fly_Writes         0x003a   015   015   000    Old_age
Always       -       85
190 Airflow_Temperature_Cel 0x0022   063   058   045    Old_age
Always       -       37 (Min/Max 35/38)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age
Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age
Always       -       29
193 Load_Cycle_Count        0x0032   100   100   000    Old_age
Always       -       1029
194 Temperature_Celsius     0x0022   037   042   000    Old_age
Always       -       37 (0 24 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   196   000    Old_age
Always       -       3577

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     12293         -
# 2  Short offline       Completed without error       00%     12169         -
# 3  Extended offline    Completed without error       00%       475         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

end<div id="DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br />
<table style="border-top: 1px solid #D3D4DE;">
	<tr>
        <td style="width: 55px; padding-top: 13px;"><a
href="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail"
target="_blank"><img
src="https://ipmcdn.avast.com/images/icons/icon-envelope-tick-green-avg-v1.png"
alt="" width="46" height="29" style="width: 46px; height: 29px;"
/></a></td>
		<td style="width: 470px; padding-top: 12px; color: #41424e;
font-size: 13px; font-family: Arial, Helvetica, sans-serif;
line-height: 18px;">Virus-free. <a
href="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail"
target="_blank" style="color: #4453ea;">www.avg.com</a>
		</td>
	</tr>
</table><a href="#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1"
height="1"></a></div>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-05  9:41 mdadm stuck at 0% reshape after grow Jeremy Graham
@ 2017-12-05 10:56 ` Wols Lists
  2017-12-05 15:49   ` Nix
  2017-12-05 15:55 ` 002
  1 sibling, 1 reply; 26+ messages in thread
From: Wols Lists @ 2017-12-05 10:56 UTC (permalink / raw)
  To: Jeremy Graham, linux-raid

On 05/12/17 09:41, Jeremy Graham wrote:
> $ mdadm --version
> mdadm - v3.4 - 28th January 2016

Won't do any harm to try the latest version, but this could well be the
problem.

https://raid.wiki.kernel.org/index.php/Linux_Raid

That'll tell you where to download the latest mdadm from. This sounds a
typical problem that people have had, and iirc upgrading mdadm often
fixes it.

Let us know what the latest mdadm does, and if it doesn't fix it,
someone else will be along with more advice.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-05 10:56 ` Wols Lists
@ 2017-12-05 15:49   ` Nix
  0 siblings, 0 replies; 26+ messages in thread
From: Nix @ 2017-12-05 15:49 UTC (permalink / raw)
  To: Wols Lists; +Cc: Jeremy Graham, linux-raid

On 5 Dec 2017, Wols Lists told this:

> On 05/12/17 09:41, Jeremy Graham wrote:
>> $ mdadm --version
>> mdadm - v3.4 - 28th January 2016
>
> Won't do any harm to try the latest version, but this could well be the
> problem.
>
> https://raid.wiki.kernel.org/index.php/Linux_Raid
>
> That'll tell you where to download the latest mdadm from. This sounds a
> typical problem that people have had, and iirc upgrading mdadm often
> fixes it.

This suggests otherwise:

[69979.933007] md0: detected capacity change from 0 to 12002359508992
[69979.933130] md: reshape of RAID array md0
[69979.933132] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[69979.933134] md: using maximum available idle IO bandwidth (but not
more than 200000 KB/sec) for reshape.
[69979.933139] md: using 128k window, over a total of 2930263552k.
[70197.635112] INFO: task md0_reshape:30529 blocked for more than 120 seconds.
[70197.635142]       Not tainted 4.4.0-101-generic #124-Ubuntu
[70197.635161] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[70197.635187] md0_reshape     D ffff88011da37aa8     0 30529      2 0x00000000
[70197.635191]  ffff88011da37aa8 ffff88011da37a78 ffff880214a40e00
ffff880210577000
[70197.635193]  ffff88011da38000 ffff8800d49de424 ffff8800d49de658
ffff8800d49de638
[70197.635194]  ffff8800d49de670 ffff88011da37ac0 ffffffff818406d5
ffff8800d49de400
[70197.635196] Call Trace:
[70197.635202]  [<ffffffff818406d5>] schedule+0x35/0x80
[70197.635206]  [<ffffffffc034045f>]
raid5_get_active_stripe+0x31f/0x700 [raid456]
[70197.635210]  [<ffffffff810c4420>] ? wake_atomic_t_function+0x60/0x60
[70197.635212]  [<ffffffffc0344da4>] reshape_request+0x584/0x950 [raid456]
[70197.635215]  [<ffffffff810a9c6a>] ? finish_task_switch+0x7a/0x220
[70197.635218]  [<ffffffffc034548c>] sync_request+0x31c/0x3a0 [raid456]
[70197.635219]  [<ffffffff81840026>] ? __schedule+0x3b6/0xa30
[70197.635222]  [<ffffffff814102b5>] ? find_next_bit+0x15/0x20
[70197.635225]  [<ffffffff81710bb1>] ? is_mddev_idle+0x9c/0xfa
[70197.635227]  [<ffffffff816adbbc>] md_do_sync+0x89c/0xe60
[70197.635229]  [<ffffffff810c4420>] ? wake_atomic_t_function+0x60/0x60
[70197.635231]  [<ffffffff816aa319>] md_thread+0x139/0x150
[70197.635233]  [<ffffffff810c4420>] ? wake_atomic_t_function+0x60/0x60
[70197.635234]  [<ffffffff816aa1e0>] ? find_pers+0x70/0x70
[70197.635236]  [<ffffffff810a0c75>] kthread+0xe5/0x100
[70197.635237]  [<ffffffff810a0b90>] ? kthread_create_on_node+0x1e0/0x1e0
[70197.635239]  [<ffffffff81844b8f>] ret_from_fork+0x3f/0x70
[70197.635241]  [<ffffffff810a0b90>] ? kthread_create_on_node+0x1e0/0x1e0
[70317.630767] INFO: task md0_reshape:30529 blocked for more than 120 seconds.
[70317.630796]       Not tainted 4.4.0-101-generic #124-Ubuntu
[70317.630815] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

That's a kernel bug, probably a deadlock. *Definitely* try a newer
kernel, 4.14.3 (the latest) if possible. I bet this is fixed by

6ab2a4b806ae21b6c3e47c5ff1285ec06d505325
RAID5: revert e9e4c377e2f563 to fix a livelock

which fixes a bug which exactly like this: the faulty patch was present
from v4.2 to v4.6. You're in the middle of that range... it might be
worth seeing if the distro kernel you're running has applied that patch,
too.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-05  9:41 mdadm stuck at 0% reshape after grow Jeremy Graham
  2017-12-05 10:56 ` Wols Lists
@ 2017-12-05 15:55 ` 002
  2017-12-06  2:51   ` Phil Turmel
  1 sibling, 1 reply; 26+ messages in thread
From: 002 @ 2017-12-05 15:55 UTC (permalink / raw)
  To: Jeremy Graham, linux-raid

A well known reason for this behavior are bad blocks in device's BBL, which you happen to have:


> $ mdadm --examine /dev/sd[bcdefg]1

> /dev/sde1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0xc
>      Array UUID : f39f89d2:1bcd3d55:d173d206:d85b8bbc
>            Name : homer2:1
>   Creation Time : Sun Dec 2 12:04:24 2012
>      Raid Level : raid5
>    Raid Devices : 6
>
>  Avail Dev Size : 5860528128 (2794.52 GiB 3000.59 GB)
>      Array Size : 14651317760 (13972.59 GiB 15002.95 GB)
>   Used Dev Size : 5860527104 (2794.52 GiB 3000.59 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>    Unused Space : before=1960 sectors, after=1024 sectors
>           State : active
>     Device UUID : a33aacb4:00d283ef:5715be52:a0678279
>
>   Reshape pos'n : 491520 (480.00 MiB 503.32 MB)
>   Delta Devices : 1 (5->6)
>
>     Update Time : Tue Dec 5 18:30:50 2017
>   Bad Block Log : 512 entries available at offset 72 sectors - bad
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> blocks present.
^^^^^^^^^^^^^^^^^
>        Checksum : 978e30 - correct
>          Events : 4830578
>
>          Layout : left-symmetric
>      Chunk Size : 512K
>
>    Device Role : Active device 3
>    Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
> /dev/sdf1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0xc
>      Array UUID : f39f89d2:1bcd3d55:d173d206:d85b8bbc
>            Name : homer2:1
>   Creation Time : Sun Dec 2 12:04:24 2012
>      Raid Level : raid5
>    Raid Devices : 6
>
>  Avail Dev Size : 5860528128 (2794.52 GiB 3000.59 GB)
>      Array Size : 14651317760 (13972.59 GiB 15002.95 GB)
>   Used Dev Size : 5860527104 (2794.52 GiB 3000.59 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>    Unused Space : before=1960 sectors, after=1024 sectors
>           State : active
>     Device UUID : cabeb84a:b5df18c9:cb378062:be0f8998
>
>   Reshape pos'n : 491520 (480.00 MiB 503.32 MB)
>   Delta Devices : 1 (5->6)
>
>     Update Time : Tue Dec 5 18:30:50 2017
>   Bad Block Log : 512 entries available at offset 72 sectors - bad
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> blocks present.
^^^^^^^^^^^^^^^^^
>        Checksum : 9f01ea9c - correct
>          Events : 4830578
>
>          Layout : left-symmetric
>      Chunk Size : 512K
>
>    Device Role : Active device 0
>    Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)

This feature generally shouldn't be used, because its implementation is unfinished. Empty BBL's can be removed from every device by giving "--update=no-bbl" option to mdadm on assemble, but before that you must manually regenerate content for each block in BBL's and then manually zero the lists in superblocks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-05 15:55 ` 002
@ 2017-12-06  2:51   ` Phil Turmel
  2017-12-06  4:33     ` Jeremy Graham
  0 siblings, 1 reply; 26+ messages in thread
From: Phil Turmel @ 2017-12-06  2:51 UTC (permalink / raw)
  To: 002, Jeremy Graham, linux-raid

On 12/05/2017 10:55 AM, 002@tut.by wrote:
> A well known reason for this behavior are bad blocks in device's BBL,
> which you happen to have:

[trim /]

> This feature generally shouldn't be used, because its implementation
> is unfinished. Empty BBL's can be removed from every device by giving
> "--update=no-bbl" option to mdadm on assemble, but before that you
> must manually regenerate content for each block in BBL's and then
> manually zero the lists in superblocks.

I endorse this opinion.  The BBL should never have been merged as-is.
Without a block reallocation system like a real hard drive, a BBL entry
simple drops redundancy on the block in question without any other
corrective action.

Completely, utterly, brain-dead.

At the very least, it should not be enabled by default on new arrays.

Phil

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06  2:51   ` Phil Turmel
@ 2017-12-06  4:33     ` Jeremy Graham
  2017-12-06  7:36       ` Jeremy Graham
  2017-12-06 10:49       ` Andreas Klauer
  0 siblings, 2 replies; 26+ messages in thread
From: Jeremy Graham @ 2017-12-06  4:33 UTC (permalink / raw)
  To: Phil Turmel; +Cc: 002, linux-raid

Thanks very much for the responses so far all!

@Nix
> That's a kernel bug, probably a deadlock. *Definitely* try a newer
> kernel, 4.14.3 (the latest) if possible. I bet this is fixed by

I tried updating the kernel, bumped to "4.12.5-041205-generic" first
but still no dice.

@002, @Phil
> This feature generally shouldn't be used, because its implementation is
> unfinished. Empty BBL's can be removed from every device by giving
> "--update=no-bbl" option to mdadm on assemble, but before that you
> must manually regenerate content for each block in BBL's and then
> manually zero the lists in superblocks.

I am currently running "badblocks -v /dev/sde1 > sde1.badsectors.txt"
and "badblocks -v /dev/sdf1 > sdf1.badsectors.txt" to see what that
comes back with.

How do I "manually regenerate content for each block in BBL's and then
manually zero the lists in superblocks"? This is uncharted (and
slightly scary) territory for me.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06  4:33     ` Jeremy Graham
@ 2017-12-06  7:36       ` Jeremy Graham
  2017-12-06 13:34         ` Wols Lists
  2017-12-06 14:02         ` 002
  2017-12-06 10:49       ` Andreas Klauer
  1 sibling, 2 replies; 26+ messages in thread
From: Jeremy Graham @ 2017-12-06  7:36 UTC (permalink / raw)
  To: Jeremy Graham; +Cc: Phil Turmel, 002, linux-raid

Minor update, badblocks reported no bad blocks:

$ badblocks -v /dev/sdb1 > sdb1.badsectors.txt
Checking blocks 0 to 2930265087
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)

$ badblocks -v /dev/sdf1 > sdf1.badsectors.txt
Checking blocks 0 to 2930265087
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)

$ badblocks -v /dev/sde1 > sde1.badsectors.txt
Checking blocks 0 to 2930265087
Checking for bad blocks (read-only test):
done
Pass completed, 0 bad blocks found. (0/0/0 errors)


However mdadm says otherwise

$ mdadm --examine-badblocks /dev/sd[bcdefg]1
Bad-blocks list is empty in /dev/sdb1
No bad-blocks list configured on /dev/sdc1
No bad-blocks list configured on /dev/sdd1
Bad-blocks on /dev/sde1:
              243656 for 368 sectors
            56643704 for 512 sectors
            56644216 for 248 sectors
            56874144 for 88 sectors
            93973944 for 288 sectors
           515436792 for 512 sectors
           515437304 for 64 sectors
           576966904 for 456 sectors
          1689261664 for 352 sectors
          1689262424 for 512 sectors
          1689262936 for 104 sectors
          2271200520 for 512 sectors
          2271201032 for 440 sectors
          2271214344 for 440 sectors
          2271221320 for 512 sectors
          2271221832 for 120 sectors
          2933440312 for 512 sectors
          2933440824 for 48 sectors
          2965096488 for 400 sectors
          2972488160 for 512 sectors
          2972488672 for 160 sectors
          4462680184 for 312 sectors
          4799622528 for 224 sectors
          4799623104 for 512 sectors
          4799623616 for 160 sectors
          4799626912 for 512 sectors
          4799627424 for 448 sectors
          4799631240 for 512 sectors
          4799631752 for 216 sectors
          4799633568 for 448 sectors
          4799635752 for 312 sectors
          4799638008 for 104 sectors
          4799655704 for 512 sectors
          4799656216 for 512 sectors
          4799656728 for 512 sectors
          4799657240 for 328 sectors
          4799657608 for 512 sectors
          4799658120 for 472 sectors
          4799658600 for 512 sectors
          4799659112 for 504 sectors
          4799659936 for 512 sectors
          4799660448 for 512 sectors
          4799660960 for 512 sectors
          4799661472 for 192 sectors
          4799662192 for 496 sectors
          4799662720 for 512 sectors
          4799663232 for 480 sectors
          4799664344 for 392 sectors
          4799664744 for 512 sectors
          4799665256 for 504 sectors
          4799666536 for 512 sectors
          4799667048 for 392 sectors
          4799667600 for 208 sectors
          4799668800 for 32 sectors
          4799668856 for 512 sectors
          4799669368 for 488 sectors
          4799670984 for 512 sectors
          4799671496 for 408 sectors
          4799671984 for 512 sectors
          4799672496 for 432 sectors
          4799673152 for 512 sectors
          4799673664 for 288 sectors
          4799673960 for 512 sectors
          4799674472 for 504 sectors
          4799675456 for 512 sectors
          4799675968 for 32 sectors
          4799676104 for 512 sectors
          4799676616 for 408 sectors
          4799677624 for 424 sectors
          4799678152 for 512 sectors
          4799678664 for 408 sectors
          4799679824 for 272 sectors
          4799680128 for 512 sectors
          4799680640 for 480 sectors
          4799682112 for 512 sectors
          4799682624 for 512 sectors
          4799684384 for 512 sectors
          4799684896 for 320 sectors
          4799686728 for 512 sectors
          4799687240 for 24 sectors
          4799687336 for 512 sectors
          4799687848 for 440 sectors
          4799721864 for 216 sectors
          4807365632 for 512 sectors
          4807366144 for 8 sectors
          4807562840 for 512 sectors
          4807563352 for 128 sectors
          4829237792 for 88 sectors
          4979524112 for 512 sectors
          4979524624 for 272 sectors
          4979574088 for 512 sectors
          4979574600 for 472 sectors
          4979630520 for 512 sectors
          4979631032 for 360 sectors
          5555126536 for 376 sectors
          5555143496 for 512 sectors
          5555144008 for 312 sectors
          5829752272 for 512 sectors
          5829752784 for 384 sectors
Bad-blocks on /dev/sdf1:
              243656 for 368 sectors
            56643704 for 512 sectors
            56644216 for 248 sectors
            56874144 for 88 sectors
            93973944 for 288 sectors
           515436792 for 512 sectors
           515437304 for 64 sectors
           576966904 for 456 sectors
          1689261664 for 352 sectors
          1689262424 for 512 sectors
          1689262936 for 104 sectors
          2271200520 for 512 sectors
          2271201032 for 440 sectors
          2271214344 for 440 sectors
          2271221320 for 512 sectors
          2271221832 for 120 sectors
          2933440312 for 512 sectors
          2933440824 for 48 sectors
          2965096488 for 400 sectors
          2972488160 for 512 sectors
          2972488672 for 160 sectors
          4462680184 for 312 sectors
          4799622528 for 224 sectors
          4799623104 for 512 sectors
          4799623616 for 160 sectors
          4799626912 for 512 sectors
          4799627424 for 448 sectors
          4799631240 for 512 sectors
          4799631752 for 216 sectors
          4799633568 for 448 sectors
          4799635752 for 312 sectors
          4799638008 for 104 sectors
          4799655704 for 512 sectors
          4799656216 for 512 sectors
          4799656728 for 512 sectors
          4799657240 for 328 sectors
          4799657608 for 512 sectors
          4799658120 for 472 sectors
          4799658600 for 512 sectors
          4799659112 for 504 sectors
          4799659936 for 512 sectors
          4799660448 for 512 sectors
          4799660960 for 512 sectors
          4799661472 for 192 sectors
          4799662192 for 496 sectors
          4799662720 for 512 sectors
          4799663232 for 480 sectors
          4799664344 for 392 sectors
          4799664744 for 512 sectors
          4799665256 for 504 sectors
          4799666536 for 512 sectors
          4799667048 for 392 sectors
          4799667600 for 208 sectors
          4799668800 for 32 sectors
          4799668856 for 512 sectors
          4799669368 for 488 sectors
          4799670984 for 512 sectors
          4799671496 for 408 sectors
          4799671984 for 512 sectors
          4799672496 for 432 sectors
          4799673152 for 512 sectors
          4799673664 for 288 sectors
          4799673960 for 512 sectors
          4799674472 for 504 sectors
          4799675456 for 512 sectors
          4799675968 for 32 sectors
          4799676104 for 512 sectors
          4799676616 for 408 sectors
          4799677624 for 424 sectors
          4799678152 for 512 sectors
          4799678664 for 408 sectors
          4799679824 for 272 sectors
          4799680128 for 512 sectors
          4799680640 for 480 sectors
          4799682112 for 512 sectors
          4799682624 for 512 sectors
          4799684384 for 512 sectors
          4799684896 for 320 sectors
          4799686728 for 512 sectors
          4799687240 for 24 sectors
          4799687336 for 512 sectors
          4799687848 for 440 sectors
          4799721864 for 216 sectors
          4807365632 for 512 sectors
          4807366144 for 8 sectors
          4807562840 for 512 sectors
          4807563352 for 128 sectors
          4829237792 for 88 sectors
          4979524112 for 512 sectors
          4979524624 for 272 sectors
          4979574088 for 512 sectors
          4979574600 for 472 sectors
          4979630520 for 512 sectors
          4979631032 for 360 sectors
          5555126536 for 376 sectors
          5555143496 for 512 sectors
          5555144008 for 312 sectors
          5829752272 for 512 sectors
          5829752784 for 384 sectors
No bad-blocks list configured on /dev/sdg1


I'm pretty sure the bad blocks are the result of a faulty SATA cable
about 2 years ago

As there are no badblocks on the devices themselves (only reported by
mdadm) would it be safe to try:
mdadm --assemble -v /dev/md0 /dev/sd[bcdefg]1 --update=no-bbl




> Thanks very much for the responses so far all!
>
> @Nix
>> That's a kernel bug, probably a deadlock. *Definitely* try a newer
>> kernel, 4.14.3 (the latest) if possible. I bet this is fixed by
>
> I tried updating the kernel, bumped to "4.12.5-041205-generic" first
> but still no dice.
>
> @002, @Phil
>> This feature generally shouldn't be used, because its implementation is
>> unfinished. Empty BBL's can be removed from every device by giving
>> "--update=no-bbl" option to mdadm on assemble, but before that you
>> must manually regenerate content for each block in BBL's and then
>> manually zero the lists in superblocks.
>
> I am currently running "badblocks -v /dev/sde1 > sde1.badsectors.txt"
> and "badblocks -v /dev/sdf1 > sdf1.badsectors.txt" to see what that
> comes back with.
>
> How do I "manually regenerate content for each block in BBL's and then
> manually zero the lists in superblocks"? This is uncharted (and
> slightly scary) territory for me.


end<div id="DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br />
<table style="border-top: 1px solid #D3D4DE;">
	<tr>
        <td style="width: 55px; padding-top: 13px;"><a
href="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail"
target="_blank"><img
src="https://ipmcdn.avast.com/images/icons/icon-envelope-tick-green-avg-v1.png"
alt="" width="46" height="29" style="width: 46px; height: 29px;"
/></a></td>
		<td style="width: 470px; padding-top: 12px; color: #41424e;
font-size: 13px; font-family: Arial, Helvetica, sans-serif;
line-height: 18px;">Virus-free. <a
href="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail"
target="_blank" style="color: #4453ea;">www.avg.com</a>
		</td>
	</tr>
</table><a href="#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1"
height="1"></a></div>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06  4:33     ` Jeremy Graham
  2017-12-06  7:36       ` Jeremy Graham
@ 2017-12-06 10:49       ` Andreas Klauer
  2017-12-06 14:15         ` Phil Turmel
  1 sibling, 1 reply; 26+ messages in thread
From: Andreas Klauer @ 2017-12-06 10:49 UTC (permalink / raw)
  To: Jeremy Graham; +Cc: linux-raid

On Wed, Dec 06, 2017 at 03:33:36PM +1100, Jeremy Graham wrote:
> How do I "manually regenerate content for each block in BBL's and then
> manually zero the lists in superblocks"? This is uncharted (and
> slightly scary) territory for me.

You don't. Just mdadm --assemble it with --update=force-no-bbl here.

If you have a filesystem with bad blocks management on top of it, 
check that too and clear it if necessary.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06  7:36       ` Jeremy Graham
@ 2017-12-06 13:34         ` Wols Lists
  2017-12-06 14:02         ` 002
  1 sibling, 0 replies; 26+ messages in thread
From: Wols Lists @ 2017-12-06 13:34 UTC (permalink / raw)
  To: Jeremy Graham; +Cc: linux-raid

On 06/12/17 07:36, Jeremy Graham wrote:
> However mdadm says otherwise
> 
> $ mdadm --examine-badblocks /dev/sd[bcdefg]1
> Bad-blocks list is empty in /dev/sdb1
> No bad-blocks list configured on /dev/sdc1
> No bad-blocks list configured on /dev/sdd1
> Bad-blocks on /dev/sde1:

You'll notice the list is the same for all drives ... so most of those
blocks probably aren't bad, anyway ...

Somehow, it seems to copy the list from one drive, to all the others ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06  7:36       ` Jeremy Graham
  2017-12-06 13:34         ` Wols Lists
@ 2017-12-06 14:02         ` 002
  1 sibling, 0 replies; 26+ messages in thread
From: 002 @ 2017-12-06 14:02 UTC (permalink / raw)
  To: Jeremy Graham; +Cc: Phil Turmel, linux-raid

As you can see, both lists are equal, what implies the following scenario. 
1. At some point you have had a drive dropped from the array.
2. After that and before the rebuild finished, every read error has converted to a BBL record.
3. Every BBL record has produced a matching BBL record on the replacement drive.

So, on replaced raid member, the data sectors, listed  in BBL, contain garbage. This garbage will no longer be protected by read errors, once you have removed the BBL as Andreas suggests, so don't do it yet. 

First of all, scan for currently unreadable files (probably, tar --ignore-failed-read may be suitable for the task). If you find you have lost something important, then we'll go on with reconstruction and parity regeneration.



06.12.2017, 10:37, "Jeremy Graham" <jeremy@doghouse.agency>:
> Minor update, badblocks reported no bad blocks:
>
> $ badblocks -v /dev/sdb1 > sdb1.badsectors.txt
> Checking blocks 0 to 2930265087
> Checking for bad blocks (read-only test): done
> Pass completed, 0 bad blocks found. (0/0/0 errors)
>
> $ badblocks -v /dev/sdf1 > sdf1.badsectors.txt
> Checking blocks 0 to 2930265087
> Checking for bad blocks (read-only test): done
> Pass completed, 0 bad blocks found. (0/0/0 errors)
>
> $ badblocks -v /dev/sde1 > sde1.badsectors.txt
> Checking blocks 0 to 2930265087
> Checking for bad blocks (read-only test):
> done
> Pass completed, 0 bad blocks found. (0/0/0 errors)
>
> However mdadm says otherwise
>
> $ mdadm --examine-badblocks /dev/sd[bcdefg]1
> Bad-blocks list is empty in /dev/sdb1
> No bad-blocks list configured on /dev/sdc1
> No bad-blocks list configured on /dev/sdd1
> Bad-blocks on /dev/sde1:
>               243656 for 368 sectors
>             56643704 for 512 sectors
>             56644216 for 248 sectors
>             56874144 for 88 sectors
>             93973944 for 288 sectors
>            515436792 for 512 sectors
>            515437304 for 64 sectors
>            576966904 for 456 sectors
>           1689261664 for 352 sectors
>           1689262424 for 512 sectors
>           1689262936 for 104 sectors
>           2271200520 for 512 sectors
>           2271201032 for 440 sectors
>           2271214344 for 440 sectors
>           2271221320 for 512 sectors
>           2271221832 for 120 sectors
>           2933440312 for 512 sectors
>           2933440824 for 48 sectors
>           2965096488 for 400 sectors
>           2972488160 for 512 sectors
>           2972488672 for 160 sectors
>           4462680184 for 312 sectors
>           4799622528 for 224 sectors
>           4799623104 for 512 sectors
>           4799623616 for 160 sectors
>           4799626912 for 512 sectors
>           4799627424 for 448 sectors
>           4799631240 for 512 sectors
>           4799631752 for 216 sectors
>           4799633568 for 448 sectors
>           4799635752 for 312 sectors
>           4799638008 for 104 sectors
>           4799655704 for 512 sectors
>           4799656216 for 512 sectors
>           4799656728 for 512 sectors
>           4799657240 for 328 sectors
>           4799657608 for 512 sectors
>           4799658120 for 472 sectors
>           4799658600 for 512 sectors
>           4799659112 for 504 sectors
>           4799659936 for 512 sectors
>           4799660448 for 512 sectors
>           4799660960 for 512 sectors
>           4799661472 for 192 sectors
>           4799662192 for 496 sectors
>           4799662720 for 512 sectors
>           4799663232 for 480 sectors
>           4799664344 for 392 sectors
>           4799664744 for 512 sectors
>           4799665256 for 504 sectors
>           4799666536 for 512 sectors
>           4799667048 for 392 sectors
>           4799667600 for 208 sectors
>           4799668800 for 32 sectors
>           4799668856 for 512 sectors
>           4799669368 for 488 sectors
>           4799670984 for 512 sectors
>           4799671496 for 408 sectors
>           4799671984 for 512 sectors
>           4799672496 for 432 sectors
>           4799673152 for 512 sectors
>           4799673664 for 288 sectors
>           4799673960 for 512 sectors
>           4799674472 for 504 sectors
>           4799675456 for 512 sectors
>           4799675968 for 32 sectors
>           4799676104 for 512 sectors
>           4799676616 for 408 sectors
>           4799677624 for 424 sectors
>           4799678152 for 512 sectors
>           4799678664 for 408 sectors
>           4799679824 for 272 sectors
>           4799680128 for 512 sectors
>           4799680640 for 480 sectors
>           4799682112 for 512 sectors
>           4799682624 for 512 sectors
>           4799684384 for 512 sectors
>           4799684896 for 320 sectors
>           4799686728 for 512 sectors
>           4799687240 for 24 sectors
>           4799687336 for 512 sectors
>           4799687848 for 440 sectors
>           4799721864 for 216 sectors
>           4807365632 for 512 sectors
>           4807366144 for 8 sectors
>           4807562840 for 512 sectors
>           4807563352 for 128 sectors
>           4829237792 for 88 sectors
>           4979524112 for 512 sectors
>           4979524624 for 272 sectors
>           4979574088 for 512 sectors
>           4979574600 for 472 sectors
>           4979630520 for 512 sectors
>           4979631032 for 360 sectors
>           5555126536 for 376 sectors
>           5555143496 for 512 sectors
>           5555144008 for 312 sectors
>           5829752272 for 512 sectors
>           5829752784 for 384 sectors
> Bad-blocks on /dev/sdf1:
>               243656 for 368 sectors
>             56643704 for 512 sectors
>             56644216 for 248 sectors
>             56874144 for 88 sectors
>             93973944 for 288 sectors
>            515436792 for 512 sectors
>            515437304 for 64 sectors
>            576966904 for 456 sectors
>           1689261664 for 352 sectors
>           1689262424 for 512 sectors
>           1689262936 for 104 sectors
>           2271200520 for 512 sectors
>           2271201032 for 440 sectors
>           2271214344 for 440 sectors
>           2271221320 for 512 sectors
>           2271221832 for 120 sectors
>           2933440312 for 512 sectors
>           2933440824 for 48 sectors
>           2965096488 for 400 sectors
>           2972488160 for 512 sectors
>           2972488672 for 160 sectors
>           4462680184 for 312 sectors
>           4799622528 for 224 sectors
>           4799623104 for 512 sectors
>           4799623616 for 160 sectors
>           4799626912 for 512 sectors
>           4799627424 for 448 sectors
>           4799631240 for 512 sectors
>           4799631752 for 216 sectors
>           4799633568 for 448 sectors
>           4799635752 for 312 sectors
>           4799638008 for 104 sectors
>           4799655704 for 512 sectors
>           4799656216 for 512 sectors
>           4799656728 for 512 sectors
>           4799657240 for 328 sectors
>           4799657608 for 512 sectors
>           4799658120 for 472 sectors
>           4799658600 for 512 sectors
>           4799659112 for 504 sectors
>           4799659936 for 512 sectors
>           4799660448 for 512 sectors
>           4799660960 for 512 sectors
>           4799661472 for 192 sectors
>           4799662192 for 496 sectors
>           4799662720 for 512 sectors
>           4799663232 for 480 sectors
>           4799664344 for 392 sectors
>           4799664744 for 512 sectors
>           4799665256 for 504 sectors
>           4799666536 for 512 sectors
>           4799667048 for 392 sectors
>           4799667600 for 208 sectors
>           4799668800 for 32 sectors
>           4799668856 for 512 sectors
>           4799669368 for 488 sectors
>           4799670984 for 512 sectors
>           4799671496 for 408 sectors
>           4799671984 for 512 sectors
>           4799672496 for 432 sectors
>           4799673152 for 512 sectors
>           4799673664 for 288 sectors
>           4799673960 for 512 sectors
>           4799674472 for 504 sectors
>           4799675456 for 512 sectors
>           4799675968 for 32 sectors
>           4799676104 for 512 sectors
>           4799676616 for 408 sectors
>           4799677624 for 424 sectors
>           4799678152 for 512 sectors
>           4799678664 for 408 sectors
>           4799679824 for 272 sectors
>           4799680128 for 512 sectors
>           4799680640 for 480 sectors
>           4799682112 for 512 sectors
>           4799682624 for 512 sectors
>           4799684384 for 512 sectors
>           4799684896 for 320 sectors
>           4799686728 for 512 sectors
>           4799687240 for 24 sectors
>           4799687336 for 512 sectors
>           4799687848 for 440 sectors
>           4799721864 for 216 sectors
>           4807365632 for 512 sectors
>           4807366144 for 8 sectors
>           4807562840 for 512 sectors
>           4807563352 for 128 sectors
>           4829237792 for 88 sectors
>           4979524112 for 512 sectors
>           4979524624 for 272 sectors
>           4979574088 for 512 sectors
>           4979574600 for 472 sectors
>           4979630520 for 512 sectors
>           4979631032 for 360 sectors
>           5555126536 for 376 sectors
>           5555143496 for 512 sectors
>           5555144008 for 312 sectors
>           5829752272 for 512 sectors
>           5829752784 for 384 sectors
> No bad-blocks list configured on /dev/sdg1
>
> I'm pretty sure the bad blocks are the result of a faulty SATA cable
> about 2 years ago
>
> As there are no badblocks on the devices themselves (only reported by
> mdadm) would it be safe to try:
> mdadm --assemble -v /dev/md0 /dev/sd[bcdefg]1 --update=no-bbl
>
>>  Thanks very much for the responses so far all!
>>
>>  @Nix
>>>  That's a kernel bug, probably a deadlock. *Definitely* try a newer
>>>  kernel, 4.14.3 (the latest) if possible. I bet this is fixed by
>>
>>  I tried updating the kernel, bumped to "4.12.5-041205-generic" first
>>  but still no dice.
>>
>>  @002, @Phil
>>>  This feature generally shouldn't be used, because its implementation is
>>>  unfinished. Empty BBL's can be removed from every device by giving
>>>  "--update=no-bbl" option to mdadm on assemble, but before that you
>>>  must manually regenerate content for each block in BBL's and then
>>>  manually zero the lists in superblocks.
>>
>>  I am currently running "badblocks -v /dev/sde1 > sde1.badsectors.txt"
>>  and "badblocks -v /dev/sdf1 > sdf1.badsectors.txt" to see what that
>>  comes back with.
>>
>>  How do I "manually regenerate content for each block in BBL's and then
>>  manually zero the lists in superblocks"? This is uncharted (and
>>  slightly scary) territory for me.
>
> end<div id="DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br />
> <table style="border-top: 1px solid #D3D4DE;">
>         <tr>
>         <td style="width: 55px; padding-top: 13px;"><a
> href="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail"
> target="_blank"><img
> src="https://ipmcdn.avast.com/images/icons/icon-envelope-tick-green-avg-v1.png"
> alt="" width="46" height="29" style="width: 46px; height: 29px;"
> /></a></td>
>                 <td style="width: 470px; padding-top: 12px; color: #41424e;
> font-size: 13px; font-family: Arial, Helvetica, sans-serif;
> line-height: 18px;">Virus-free. <a
> href="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail"
> target="_blank" style="color: #4453ea;">www.avg.com</a>
>                 </td>
>         </tr>
> </table><a href="#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1"
> height="1"></a></div>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06 10:49       ` Andreas Klauer
@ 2017-12-06 14:15         ` Phil Turmel
  2017-12-06 16:03           ` Andreas Klauer
  0 siblings, 1 reply; 26+ messages in thread
From: Phil Turmel @ 2017-12-06 14:15 UTC (permalink / raw)
  To: Andreas Klauer, Jeremy Graham; +Cc: linux-raid

On 12/06/2017 05:49 AM, Andreas Klauer wrote:
> On Wed, Dec 06, 2017 at 03:33:36PM +1100, Jeremy Graham wrote:
>> How do I "manually regenerate content for each block in BBL's and then
>> manually zero the lists in superblocks"? This is uncharted (and
>> slightly scary) territory for me.
> 
> You don't. Just mdadm --assemble it with --update=force-no-bbl here.

The problem with this is that the sectors currently marked don't have
appropriate data.  So garbage will be read in those locations after the
BBL is deleted.

There's only two approaches with any hope of rescuing the content:

1) Backup everything that is known to involve the BBL sectors, then
restore those files after the BBL is deleted.

2) Use hdparm to fake bad sectors on the underlying device in the same
locations as the BBL.  Then run a scrub after the BBL is deleted.  A bit
risky, as MD only allows 10 UREs per hour before kicking a drive.  The
array might need --assemble --force multiple times to complete the scrub.

> If you have a filesystem with bad blocks management on top of it, 
> check that too and clear it if necessary.

MD's BBL system doesn't coordinate with the filesystem on top, so this
is meaningless.

The BBL in MD is woefully incomplete and should *never* be used.

Phil

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06 14:15         ` Phil Turmel
@ 2017-12-06 16:03           ` Andreas Klauer
  2017-12-06 16:21             ` Phil Turmel
  0 siblings, 1 reply; 26+ messages in thread
From: Andreas Klauer @ 2017-12-06 16:03 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Jeremy Graham, linux-raid

On Wed, Dec 06, 2017 at 09:15:21AM -0500, Phil Turmel wrote:
> The problem with this is that the sectors currently marked don't have
> appropriate data.

It might have the correct data. Depends what exactly happened.
If it happened years ago and you never noticed until reshape, 
chances are it won't matter one way or another.

Of course, it doesn't hurt to take additional steps, if you have 
backups to compare with or some other way to check file integrity. 

> > If you have a filesystem with bad blocks management on top of it, 
> > check that too and clear it if necessary.
> 
> MD's BBL system doesn't coordinate with the filesystem on top, so this
> is meaningless.

MD with duped BBLs does return read errors, so it's a possibility.
 
> The BBL in MD is woefully incomplete and should *never* be used.

There's ups and downs to everything. Relocations would be awful too. 
Harms performance and makes recovery all but impossible. So many people 
on this list with lost metadata, figuring out RAID layout and drive 
oder is hard, but figuring out random relocations is impossible.

The BBL could be improved a lot if it prevented BBLs to be identical 
across drives, and gave bad blocks a second chance. Once the cable 
problem is solved, MD should help you turning those bad blocks back 
into good ones.

And if your drive actually has real bad blocks, the only correct course 
of action is to replace it entirely. The problem with BBL right now is 
that even if you replace all drives, the BBL stays. Once it's duplicated 
you are stuck with it forever until you forcibly remove it.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06 16:03           ` Andreas Klauer
@ 2017-12-06 16:21             ` Phil Turmel
  2017-12-06 18:24               ` 002
  2017-12-06 20:19               ` Edward Kuns
  0 siblings, 2 replies; 26+ messages in thread
From: Phil Turmel @ 2017-12-06 16:21 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Jeremy Graham, linux-raid

On 12/06/2017 11:03 AM, Andreas Klauer wrote:
> On Wed, Dec 06, 2017 at 09:15:21AM -0500, Phil Turmel wrote:
>> The problem with this is that the sectors currently marked don't have
>> appropriate data.
> 
> It might have the correct data. Depends what exactly happened.
> If it happened years ago and you never noticed until reshape, 
> chances are it won't matter one way or another.

No, almost certainly not the correct data.  The data that was attempted
to be written at the time the BB was added didn't make it to disk, and
any future updated data writes would be skipped since it's in the list.

> Of course, it doesn't hurt to take additional steps, if you have 
> backups to compare with or some other way to check file integrity. 

If you check integrity before deleting the BBL, MD reconstructs the
data.  If you check integrity after deleting the BBL, MD is giving you
the garbage (because it doesn't know to reconstruct).

>>> If you have a filesystem with bad blocks management on top of it, 
>>> check that too and clear it if necessary.
>>
>> MD's BBL system doesn't coordinate with the filesystem on top, so this
>> is meaningless.
> 
> MD with duped BBLs does return read errors, so it's a possibility.

No, it doesn't.  The read error is only passed to the filesystem if
there's no redundancy left for the block address.

>> The BBL in MD is woefully incomplete and should *never* be used.
> 
> There's ups and downs to everything. Relocations would be awful too. 
> Harms performance and makes recovery all but impossible. So many people 
> on this list with lost metadata, figuring out RAID layout and drive 
> oder is hard, but figuring out random relocations is impossible.

There's no "up" to the existing BBL.  It isn't doing what people think.
It does NOT cause the upper layer to avoid the block address.  It just
kills redundancy at that address.

> The BBL could be improved a lot if it prevented BBLs to be identical 
> across drives, and gave bad blocks a second chance. Once the cable 
> problem is solved, MD should help you turning those bad blocks back 
> into good ones.

MD does exactly this with all modern hard drives using the drives'
built-in relocation systems.  And the write-intent bitmap/re-add feature
helps efficiently deal with writes that were missed on that device while
it was disconnected.

The only thing a BBL could actually help with on modern drives is an
exhausted on-drive relocation table, and only if the BBL was able to do
relocations itself.  Of course, by the time a drive exhausts it internal
spares, it's too far gone to trust anyways.

> And if your drive actually has real bad blocks, the only correct course 
> of action is to replace it entirely.

No, modern drives will attempt to fix blocks on rewrite, and will
relocate them internally if unfixable.  Precisely what you think MD's
BBL should do.  MD's BBL is creating an unfixable mess, not actually
fixing anything.

This is why I suggested using hdparm to pass the BBL data to the
underlying drive.  Then MD *will* actually fix each block.

> The problem with BBL right now is 
> that even if you replace all drives, the BBL stays. Once it's duplicated 
> you are stuck with it forever until you forcibly remove it.

The problem with the BBL right now is its existence.

Phil

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06 16:21             ` Phil Turmel
@ 2017-12-06 18:24               ` 002
  2017-12-07  8:40                 ` Jeremy Graham
  2017-12-06 20:19               ` Edward Kuns
  1 sibling, 1 reply; 26+ messages in thread
From: 002 @ 2017-12-06 18:24 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Jeremy Graham, linux-raid, Andreas Klauer

> 
> No, almost certainly not the correct data. The data that was attempted
> to be written at the time the BB was added didn't make it to disk, and
> any future updated data writes would be skipped since it's in the list.

According to Neil's design notes, this is expected to happen only on members that had introduced write errors after last raid assembly.
According to my experience, on kernels at least up to v4.3, rewriting of member's bad blocks, that had already replicated to parity blocks, just fails with write error on file system level. I believe, Jeremy would have noticed if that happened to him, so, most certainly, the bad blocks hasn't been rewritten since. And if these are induced by read errors and not write errors (which is far more probable with most recent drives), the correct data is still there.

> No, it doesn't. The read error is only passed to the filesystem if
> there's no redundancy left for the block address.
Which is the case for every block here. Look at his BBL's.

> There's no "up" to the existing BBL. It isn't doing what people think.
> It does NOT cause the upper layer to avoid the block address. It just
> kills redundancy at that address.
Well, in a scenario with a completely lost (broken/unrecoverable) drive and expected occasional read errors on remaining raid members, even current state of BBL does more good then evil. Neil's design goals for the BBL feature looked perfectly valid back in 2010, only today they need amendment. As for the implementation itself, it is stable, but unfinished, lacking rewrite support (or is it not?) and a working reshape (probably just big fat warning in mdadm, that reshaping with non-empty BBL's is forbidden, because of potential data loss risk).

> This is why I suggested using hdparm to pass the BBL data to the
> underlying drive. Then MD *will* actually fix each block.
I don't believe the soft bad generation is a stable feature that produces repeatable and equal results on every drive. Also, you can't undo soft bad without rewriting sector. Risky advice. For raid5 it is trivial to xor sectors with the same numbers (plus data offset) to produce the correct missing sector and thus regenerate parity without relying on md doing this.
> 
> The problem with the BBL right now is its existence.
The tested implementation itself has a value. Though, I agree, BBL's absolutely shouldn't be turned on by default.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06 16:21             ` Phil Turmel
  2017-12-06 18:24               ` 002
@ 2017-12-06 20:19               ` Edward Kuns
  2017-12-07 10:26                 ` Wols Lists
  2017-12-07 13:58                 ` Andreas Klauer
  1 sibling, 2 replies; 26+ messages in thread
From: Edward Kuns @ 2017-12-06 20:19 UTC (permalink / raw)
  To: Phil Turmel, Wols Lists; +Cc: Andreas Klauer, Jeremy Graham, Linux-RAID

On Wed, Dec 6, 2017 at 10:21 AM, Phil Turmel <philip@turmel.org> wrote:
> The problem with the BBL right now is its existence.

I have a couple questions:

1) If I have bad blocks lists configured, how do I safely remove them?

I checked my three arrays and I have BBL configured on two of my eight
partitions making up my three arrays:

# mdadm --examine-badblocks /dev/sda5 /dev/sdb3
No bad-blocks list configured on /dev/sda5
No bad-blocks list configured on /dev/sdb3
# mdadm --examine-badblocks /dev/sda3 /dev/sdb2
No bad-blocks list configured on /dev/sda3
No bad-blocks list configured on /dev/sdb2
# mdadm --examine-badblocks /dev/sda2 /dev/sdb1 /dev/sdc1 /dev/sdd1
No bad-blocks list configured on /dev/sda2
No bad-blocks list configured on /dev/sdb1
Bad-blocks list is empty in /dev/sdc1
Bad-blocks list is empty in /dev/sdd1

I replaced sdc and sdd a couple years ago when one of the two failed.
(They were the same Seagate model that had a particularly high failure
rate not obvious when I bought them.  So I replaced both.)  Apparently
when I replaced them, I inadvertently enabled the BBL on them.

2) Wol, should there be a section on the Wiki about "Things you should
make sure you have configured" that includes disabling the BBL (unless
you know what you're doing), making sure you're scrubbing regularly,
making sure you have drives that support scterc (or if you don't,
configuring /sys/block/<device>/device/timeout), and so on?  Perhaps a
list of information you should have handy before disaster strikes to
make life a lot easier if it does?  E.g., running lsdrv or dumping
partition tables to text files or listing information about your RAID
configuration and LVM, etc.


I have an unrelated question due to poking around while gathering the
above information.  I just realized that this code that I put in
/etc/rc.d/rc.local doesn't work for me because smartctl is not
returning an error:

# Force drives to play nice with MD
for i in /dev/sd? ; do
    if smartctl -l scterc,70,70 $i > /dev/null ; then
        echo -n $i " is good "
    else
        echo 180 > /sys/block/${i/\/dev\/}/device/timeout
        echo -n $i " is  bad "
    fi;
    smartctl -i $i | egrep "(Device Model|Product:)"
    blockdev --setra 1024 $i
done

If I check this manually, I notice that smartctl returns 0 whether the
command succeeds or fails.

# smartctl -l scterc,70,70 /dev/sdb ; echo $?
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.8.13-100.fc23.x86_64]
(local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Commands not supported

0

This is an old Linux version and I need to upgrade, I know.  Hopefully
over the holidays.  But I got that scriptlet above from this mailing
list and I see it at
https://raid.wiki.kernel.org/index.php/Timeout_Mismatch -- so did the
smartctl behavior change at some point?

# smartctl --version
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.8.13-100.fc23.x86_64]
(local build)

      Thanks,

           Eddie

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06 18:24               ` 002
@ 2017-12-07  8:40                 ` Jeremy Graham
  0 siblings, 0 replies; 26+ messages in thread
From: Jeremy Graham @ 2017-12-07  8:40 UTC (permalink / raw)
  To: 002; +Cc: Phil Turmel, linux-raid, Andreas Klauer

A huge thank you to all who gave advice!! The issue was the BBL

I decided to just run with a --update=force-no-bbl for a few reasons
* I was sure those bad blocks were read errors generated a long time
ago due to a faulty SATA cable
* Since then I have noted a couple of corrupt files but nothing of
importance and I just generally delete anything corrupt anyway
* Anything irreplaceable on the raid was backed up elsewhere
* I did not trust myself with finding + fixing the bad blocks
manually, and felt it might just get me into more trouble
* Almost anything that touched the raid while active was hanging the
terminal (even a fdisk -l) making trying things painful

So cowboy style I just ran
mdadm --assemble --update=force-no-bbl -v /dev/md0 /dev/sd[fgdecb]1

And Boom! it started doing its thing without a hiccup. About 12h on a
reshape, then 6hr on a resync, after that I did a "fsck.ext4 -f
/dev/md0" with no issues found, so a "resize2fs /dev/md0" and finally
a mount. I have since tested a heap of data and hasn't missed a beat.

mdadm --examine isn't reporting bad blocks anymore, not sure if they
are silently waiting to cause me problems in the future but appears ok
for now.

Thanks again for all the help and advice!

Nice to see my array healthy again :)

$ mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sun Dec  2 12:04:24 2012
     Raid Level : raid5
     Array Size : 14651317760 (13972.59 GiB 15002.95 GB)
  Used Dev Size : 2930263552 (2794.52 GiB 3000.59 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Thu Dec  7 19:26:53 2017
          State : clean
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : homer2:1
           UUID : f39f89d2:1bcd3d55:d173d206:d85b8bbc
         Events : 4835083

    Number   Major   Minor   RaidDevice State
       8       8       81        0      active sync   /dev/sdf1
       6       8       97        1      active sync   /dev/sdg1
       5       8       49        2      active sync   /dev/sdd1
       9       8       65        3      active sync   /dev/sde1
       7       8       33        4      active sync   /dev/sdc1
      10       8       17        5      active sync   /dev/sdb1

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0]
[raid1] [raid10]
md0 : active raid5 sdd1[5] sdc1[7] sdb1[10] sde1[9] sdg1[6] sdf1[8]
      14651317760 blocks super 1.2 level 5, 512k chunk, algorithm 2
[6/6] [UUUUUU]

unused devices: <none>

end<div id="DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br />
<table style="border-top: 1px solid #D3D4DE;">
	<tr>
        <td style="width: 55px; padding-top: 13px;"><a
href="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail"
target="_blank"><img
src="https://ipmcdn.avast.com/images/icons/icon-envelope-tick-green-avg-v1.png"
alt="" width="46" height="29" style="width: 46px; height: 29px;"
/></a></td>
		<td style="width: 470px; padding-top: 12px; color: #41424e;
font-size: 13px; font-family: Arial, Helvetica, sans-serif;
line-height: 18px;">Virus-free. <a
href="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail"
target="_blank" style="color: #4453ea;">www.avg.com</a>
		</td>
	</tr>
</table><a href="#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1"
height="1"></a></div>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06 20:19               ` Edward Kuns
@ 2017-12-07 10:26                 ` Wols Lists
  2017-12-07 13:58                 ` Andreas Klauer
  1 sibling, 0 replies; 26+ messages in thread
From: Wols Lists @ 2017-12-07 10:26 UTC (permalink / raw)
  To: Edward Kuns; +Cc: Linux-RAID

On 06/12/17 20:19, Edward Kuns wrote:
> 2) Wol, should there be a section on the Wiki about "Things you should
> make sure you have configured" that includes disabling the BBL (unless
> you know what you're doing), making sure you're scrubbing regularly,
> making sure you have drives that support scterc (or if you don't,
> configuring /sys/block/<device>/device/timeout), and so on?  Perhaps a
> list of information you should have handy before disaster strikes to
> make life a lot easier if it does?  E.g., running lsdrv or dumping
> partition tables to text files or listing information about your RAID
> configuration and LVM, etc.

A lot of that information is there. I'm just very conscious of the need
to make everything read well - too much documentation feels like it's
been thrown together, and is a horrible read.

One piece of documentation is a perfect example of how readers can miss
stuff because it's too obvious ... :-) I had trouble finding out how to
do some operation to do with text entry in a word processor. I couldn't
find any reference to it in the index. I searched high and low. Then
somebody pointed it out to me in the manual - it was repeated on nearly
every other page!

I'm planning to condense a lot of this thread into the "scary things
that are easy to fix" page, but your idea of a checklist page sounds
very good. Expect it to appear some time "soon" :-)

(Note also that a lot of this stuff I don't have personal experience of,
so it only tends to make its way into the wiki when something crops up
on the mailing list.)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-06 20:19               ` Edward Kuns
  2017-12-07 10:26                 ` Wols Lists
@ 2017-12-07 13:58                 ` Andreas Klauer
  2017-12-07 17:06                   ` Wols Lists
  2017-12-07 17:40                   ` Andreas Klauer
  1 sibling, 2 replies; 26+ messages in thread
From: Andreas Klauer @ 2017-12-07 13:58 UTC (permalink / raw)
  To: Edward Kuns; +Cc: Phil Turmel, Wols Lists, Jeremy Graham, Linux-RAID

On Wed, Dec 06, 2017 at 02:19:17PM -0600, Edward Kuns wrote:
> 1) If I have bad blocks lists configured, how do I safely remove them?

--assemble with --update=no-bbl is safe, since it only removes if empty.
If not empty, likely you'll end up doing --update=force-no-bbl anyway.
 
> # smartctl -l scterc,70,70 /dev/sdb ; echo $?
> smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.8.13-100.fc23.x86_64]
> (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> SCT Commands not supported
> 
> 0

It'd be hilarious if the timeout FUD on this list came with advice that 
didn't even do anything for most people, and nobody ever noticed...

Unfortunately, it returns 4 here. And there are years old posts that 
explicitely check for it returning 4, so this shouldn't be new at all.

Perhaps it's an intermittent error specific to your smartctl version?

You can just set the timeouts unconditionally, if you really want them.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-07 13:58                 ` Andreas Klauer
@ 2017-12-07 17:06                   ` Wols Lists
  2017-12-07 17:40                   ` Andreas Klauer
  1 sibling, 0 replies; 26+ messages in thread
From: Wols Lists @ 2017-12-07 17:06 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Linux-RAID

On 07/12/17 13:58, Andreas Klauer wrote:
> On Wed, Dec 06, 2017 at 02:19:17PM -0600, Edward Kuns wrote:
>> 1) If I have bad blocks lists configured, how do I safely remove them?
> 
> --assemble with --update=no-bbl is safe, since it only removes if empty.
> If not empty, likely you'll end up doing --update=force-no-bbl anyway.
>  
>> # smartctl -l scterc,70,70 /dev/sdb ; echo $?
>> smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.8.13-100.fc23.x86_64]
>> (local build)
>> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>>
>> SCT Commands not supported
>>
>> 0
> 
> It'd be hilarious if the timeout FUD on this list came with advice that 
> didn't even do anything for most people, and nobody ever noticed...
> 
> Unfortunately, it returns 4 here. And there are years old posts that 
> explicitely check for it returning 4, so this shouldn't be new at all.
> 
> Perhaps it's an intermittent error specific to your smartctl version?
> 
> You can just set the timeouts unconditionally, if you really want them.
> 
Except bash is back-to-front. True is 0, anything else is false.

So I'm guessing the above drive you've quoted DOES support erc,
therefore it's returned 0 (true) to say everything's okay.

Does your drive support erc? I guess not? So an error code of 4 is
*correct*, and in the sameple script on the wiki will trigger the code
that sets the *kernel* timeout to 180. My Barracudas return 4 ...

Cheers,
Wol


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-07 13:58                 ` Andreas Klauer
  2017-12-07 17:06                   ` Wols Lists
@ 2017-12-07 17:40                   ` Andreas Klauer
  2017-12-07 20:31                     ` Wols Lists
  2017-12-07 23:40                     ` Wols Lists
  1 sibling, 2 replies; 26+ messages in thread
From: Andreas Klauer @ 2017-12-07 17:40 UTC (permalink / raw)
  To: Edward Kuns; +Cc: Phil Turmel, Wols Lists, Jeremy Graham, Linux-RAID

On Thu, Dec 07, 2017 at 02:58:32PM +0100, Andreas Klauer wrote:
> Perhaps it's an intermittent error specific to your smartctl version?

Looking at the source code, it seems to be a case of:

    sct supported, erc unsupported - exit 4 (fail)
    sct unsupported as a whole     - exit 0 (success)

So some drives will give the wrong exit code for this command.

Should probably be reported as a bug to smartmontools.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-07 17:40                   ` Andreas Klauer
@ 2017-12-07 20:31                     ` Wols Lists
  2017-12-07 23:40                     ` Wols Lists
  1 sibling, 0 replies; 26+ messages in thread
From: Wols Lists @ 2017-12-07 20:31 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Linux-RAID

On 07/12/17 17:40, Andreas Klauer wrote:
> On Thu, Dec 07, 2017 at 02:58:32PM +0100, Andreas Klauer wrote:
>> Perhaps it's an intermittent error specific to your smartctl version?
> 
> Looking at the source code, it seems to be a case of:
> 
>     sct supported, erc unsupported - exit 4 (fail)

That's fine - fail is detected, kernel timeout is set to 180

>     sct unsupported as a whole     - exit 0 (success)

WHOOPS. Setting erc fails, so the command reports success!
> 
> So some drives will give the wrong exit code for this command.
> 
> Should probably be reported as a bug to smartmontools.

Sounds about right.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-07 17:40                   ` Andreas Klauer
  2017-12-07 20:31                     ` Wols Lists
@ 2017-12-07 23:40                     ` Wols Lists
  2017-12-08  1:25                       ` 002
  2017-12-09  0:20                       ` Edward Kuns
  1 sibling, 2 replies; 26+ messages in thread
From: Wols Lists @ 2017-12-07 23:40 UTC (permalink / raw)
  To: Andreas Klauer, smartmontools-support; +Cc: Linux-RAID

Cross-posting to smartmontools, if any of you could be kind enough to
explain what's happening? Please keep linux-raid and me in the
cross-post as we are not subscribed to smartmontools ...

The command in question is

# smartctl -l scterc,70,70 /dev/sdb ; echo $?

On 07/12/17 17:40, Andreas Klauer wrote:
> On Thu, Dec 07, 2017 at 02:58:32PM +0100, Andreas Klauer wrote:
>> Perhaps it's an intermittent error specific to your smartctl version?
> 
> Looking at the source code, it seems to be a case of:
> 
>     sct supported, erc unsupported - exit 4 (fail)
>     sct unsupported as a whole     - exit 0 (success)
> 
> So some drives will give the wrong exit code for this command.
> 
> Should probably be reported as a bug to smartmontools.
> 
If true, this actually has quite a big impact on hobbyist raid
installations. By default, if a desktop drive hiccups, md-raid will kick
it from the array instead of sorting it out. And there's a script
recommended to fix the problem (changing the linux timeouts) except that
if this is true the script will break.

If we've got a enterprise drive, then the above command works, sets the
timeout to 7 seconds, and returns 0 for success.

For drives like my Barracuda, sct is supported but erc isn't, and I get
4 returned, so the script correctly detects a fail and sets the linux
timeout to 180 seconds.

But if this is true, for a drive that doesn't support sct (I don't have
one to test), the command will fail but return 0!!! So the script
doesn't realise anything is wrong, doesn't fix the defaults, and leaves
the raid array in the dangerous situation where linux will time out long
before the drive does.

So does smartctl return 0 if the drive doesn't support sct? If so why?
And what's the easiest way to detect such a drive if so?

Cheers,
Wolo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-07 23:40                     ` Wols Lists
@ 2017-12-08  1:25                       ` 002
  2017-12-09  0:20                       ` Edward Kuns
  1 sibling, 0 replies; 26+ messages in thread
From: 002 @ 2017-12-08  1:25 UTC (permalink / raw)
  To: Wols Lists; +Cc: Linux-RAID, Andreas Klauer, smartmontools-support


> If true, this actually has quite a big impact on hobbyist raid
> installations. 
Not really. SMART Command Transport is a feature, introduced long time ago. The last Seagate desktop drive model, not implementing it, is Barracuda 7200.10 (that's from year 2006), while WD and Hitachi had it even earlier. Besides, smartmontools version 5 behave as you expect it should (just checked that). So, the hypothetical problematic drive must be very old, but reused in fresh system, and also must be affected by long timeouts. Not very probable mix, IMHO.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-07 23:40                     ` Wols Lists
  2017-12-08  1:25                       ` 002
@ 2017-12-09  0:20                       ` Edward Kuns
  2017-12-14 12:43                         ` Brad Campbell
  1 sibling, 1 reply; 26+ messages in thread
From: Edward Kuns @ 2017-12-09  0:20 UTC (permalink / raw)
  To: Wols Lists; +Cc: Andreas Klauer, smartmontools-support, Linux-RAID

On Thu, Dec 7, 2017 at 5:40 PM, Wols Lists <antlists@youngman.org.uk> wrote:
> But if this is true, for a drive that doesn't support sct (I don't have
> one to test), the command will fail but return 0!!! So the script
> doesn't realise anything is wrong, doesn't fix the defaults, and leaves
> the raid array in the dangerous situation where linux will time out long
> before the drive does.
>
> So does smartctl return 0 if the drive doesn't support sct? If so why?
> And what's the easiest way to detect such a drive if so?

I can reproduce this on a Samsung 830 SSD (released in 2011, probably
purchased in 2012).  I also have an 840 Pro SSD which does not have
this problem -- it fully supports SCTERC being set.  I don't know how
representative of SSD models these two are, but that gives a time
frame for this support appearing in SSD drives.

               Eddie

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-09  0:20                       ` Edward Kuns
@ 2017-12-14 12:43                         ` Brad Campbell
  2017-12-14 17:32                           ` Edward Kuns
  0 siblings, 1 reply; 26+ messages in thread
From: Brad Campbell @ 2017-12-14 12:43 UTC (permalink / raw)
  To: Edward Kuns, Wols Lists; +Cc: Andreas Klauer, smartmontools-support, Linux-RAID

On 09/12/17 08:20, Edward Kuns wrote:
> On Thu, Dec 7, 2017 at 5:40 PM, Wols Lists <antlists@youngman.org.uk> wrote:
>> But if this is true, for a drive that doesn't support sct (I don't have
>> one to test), the command will fail but return 0!!! So the script
>> doesn't realise anything is wrong, doesn't fix the defaults, and leaves
>> the raid array in the dangerous situation where linux will time out long
>> before the drive does.
>>
>> So does smartctl return 0 if the drive doesn't support sct? If so why?
>> And what's the easiest way to detect such a drive if so?
> 
> I can reproduce this on a Samsung 830 SSD (released in 2011, probably
> purchased in 2012).  I also have an 840 Pro SSD which does not have
> this problem -- it fully supports SCTERC being set.  I don't know how
> representative of SSD models these two are, but that gives a time
> frame for this support appearing in SSD drives.

I can go with that. I wrote that script snippet, and yes on my box with 
3 Samsung 830 drives it reports them as good. I never bothered to check 
whether it was valid as I really only cared about setting the correct 
values for the spinning disks.

root@srv:~# smartctl -x /dev/sdc | egrep '(Device Model|SCT)'
Device Model:     SAMSUNG SSD 830 Series
SCT Commands not supported

I never claimed it was perfect, just "good enough for me(tm)".

I only had 40 or 50 disks of about 3 different manufacturer and 5 
different model to test with, so my sample space was pretty low.

Brad.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: mdadm stuck at 0% reshape after grow
  2017-12-14 12:43                         ` Brad Campbell
@ 2017-12-14 17:32                           ` Edward Kuns
  0 siblings, 0 replies; 26+ messages in thread
From: Edward Kuns @ 2017-12-14 17:32 UTC (permalink / raw)
  To: Brad Campbell
  Cc: Wols Lists, Andreas Klauer, smartmontools-support, Linux-RAID

On Thu, Dec 14, 2017 at 6:43 AM, Brad Campbell
<lists2009@fnarfbargle.com> wrote:
> I never claimed it was perfect, just "good enough for me(tm)".

Fair enough!  It may be worth a comment on the mdraid wiki where the
script is shown that older disks might need a manual addition to set
the timeouts, and how to check, maybe with some examples of models
known to have this issue.

How do SSDs behave on bad sectors?  Are SSds subject to the same long
delays, trying and retrying to read the sector from disk?  Is this
timeout a strong concern for SSDs?

Would it be worth trying to get a patch to smartctl?  (In other words,
is it likely they'd accept a change in behavior in this area?)

            Eddie

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-12-14 17:32 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-05  9:41 mdadm stuck at 0% reshape after grow Jeremy Graham
2017-12-05 10:56 ` Wols Lists
2017-12-05 15:49   ` Nix
2017-12-05 15:55 ` 002
2017-12-06  2:51   ` Phil Turmel
2017-12-06  4:33     ` Jeremy Graham
2017-12-06  7:36       ` Jeremy Graham
2017-12-06 13:34         ` Wols Lists
2017-12-06 14:02         ` 002
2017-12-06 10:49       ` Andreas Klauer
2017-12-06 14:15         ` Phil Turmel
2017-12-06 16:03           ` Andreas Klauer
2017-12-06 16:21             ` Phil Turmel
2017-12-06 18:24               ` 002
2017-12-07  8:40                 ` Jeremy Graham
2017-12-06 20:19               ` Edward Kuns
2017-12-07 10:26                 ` Wols Lists
2017-12-07 13:58                 ` Andreas Klauer
2017-12-07 17:06                   ` Wols Lists
2017-12-07 17:40                   ` Andreas Klauer
2017-12-07 20:31                     ` Wols Lists
2017-12-07 23:40                     ` Wols Lists
2017-12-08  1:25                       ` 002
2017-12-09  0:20                       ` Edward Kuns
2017-12-14 12:43                         ` Brad Campbell
2017-12-14 17:32                           ` Edward Kuns

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.