All of lore.kernel.org
 help / color / mirror / Atom feed
* How to recover after md crash during reshape?
@ 2015-10-20  2:35 andras
  2015-10-20 12:50 ` Anugraha Sinha
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: andras @ 2015-10-20  2:35 UTC (permalink / raw)
  To: linux-raid

Dear all,

I have a serious (to me) problem, and I'm seeking some pro advice in 
recovering a RAID6 volume after a crash at the beginning of a reshape. 
Thank you all in advance for any help!

The details:

I'm running Debian.
     uname -r says:
         kernel 3.2.0-4-amd64
     dmsg says:
         Linux version 3.2.0-4-amd64 (debian-kernel@lists.debian.org) 
(gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.68-1+deb7u3
     mdadm -v says:
         mdadm - v3.2.5 - 18th May 2012

I used to have a RAID6 volume with 7 disks on it. I've recently bought 
another 3 new HDD-s and was trying to add them to the array.
I've put them in the machine (hot-plug), partitioned them then did:

     mdadm --add /dev/md1 /dev/sdh1 /dev/sdi1 /dev/sdj1

This worked fine, /proc/mdstat showed them as three spares. Then I did:

     mdadm --grow --raid-devices=10 /dev/md1

Yes, I was dumb enough to start the process without a backup option - 
(copy-paste error from https://raid.wiki.kernel.org/index.php/Growing).

This immediately (well, after 2 seconds) crashed the MD driver:

     Oct 17 17:30:27 bazsalikom kernel: [7869821.514718] sd 0:0:0:0: 
[sdj] Attached SCSI disk
     Oct 17 18:39:21 bazsalikom kernel: [7873955.418679]  sdh: sdh1
     Oct 17 18:39:37 bazsalikom kernel: [7873972.155084]  sdi: sdi1
     Oct 17 18:39:49 bazsalikom kernel: [7873983.916038]  sdj: sdj1
     Oct 17 18:40:33 bazsalikom kernel: [7874027.963430] md: bind<sdh1>
     Oct 17 18:40:34 bazsalikom kernel: [7874028.263656] md: bind<sdi1>
     Oct 17 18:40:34 bazsalikom kernel: [7874028.361112] md: bind<sdj1>
     Oct 17 18:59:48 bazsalikom kernel: [7875182.667815] md: reshape of 
RAID array md1
     Oct 17 18:59:48 bazsalikom kernel: [7875182.667818] md: minimum 
_guaranteed_  speed: 1000 KB/sec/disk.
     Oct 17 18:59:48 bazsalikom kernel: [7875182.667821] md: using 
maximum available idle IO bandwidth (but not more than 200000 KB/sec) 
for reshape.
     Oct 17 18:59:48 bazsalikom kernel: [7875182.667831] md: using 128k 
window, over a total of 1465135936k.
--> Oct 17 18:59:50 bazsalikom kernel: [7875184.326245] md: md_do_sync() 
got signal ... exiting
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928059] md1_raid6       
D ffff88021fc12780     0   282      2 0x00000000
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928066]  
ffff880213fd9140 0000000000000046 ffff8800aa80c140 ffff880201fe08c0
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928073]  
0000000000012780 ffff880211845fd8 ffff880211845fd8 ffff880213fd9140
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928079]  
ffff8800a77d8a40 ffffffff81071331 0000000000000046 ffff8802135a0c00
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928085] Call Trace:
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928095]  
[<ffffffff81071331>] ? arch_local_irq_save+0x11/0x17
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928111]  
[<ffffffffa0124c6c>] ? check_reshape+0x27b/0x51a [raid456]
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928128]  
[<ffffffffa013ade4>] ? scsi_request_fn+0x443/0x51e [scsi_mod]
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928134]  
[<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928144]  
[<ffffffffa00ef3b8>] ? md_check_recovery+0x2a5/0x514 [md_mod]
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928151]  
[<ffffffffa01286c7>] ? raid5d+0x1c/0x483 [raid456]
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928156]  
[<ffffffff81071331>] ? arch_local_irq_save+0x11/0x17
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928160]  
[<ffffffff81071331>] ? arch_local_irq_save+0x11/0x17
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928169]  
[<ffffffffa00e9256>] ? md_thread+0x114/0x132 [md_mod]
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928174]  
[<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928183]  
[<ffffffffa00e9142>] ? md_rdev_init+0xea/0xea [md_mod]
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928188]  
[<ffffffff8105f7a1>] ? kthread+0x76/0x7e
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928194]  
[<ffffffff81357ff4>] ? kernel_thread_helper+0x4/0x10
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928199]  
[<ffffffff8105f72b>] ? kthread_worker_fn+0x139/0x139
     Oct 17 19:02:46 bazsalikom kernel: [7875360.928204]  
[<ffffffff81357ff0>] ? gs_change+0x13/0x13
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928055] md1_raid6       
D ffff88021fc12780     0   282      2 0x00000000
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928062]  
ffff880213fd9140 0000000000000046 ffff8800aa80c140 ffff880201fe08c0
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928069]  
0000000000012780 ffff880211845fd8 ffff880211845fd8 ffff880213fd9140
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928075]  
ffff8800a77d8a40 ffffffff81071331 0000000000000046 ffff8802135a0c00
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928082] Call Trace:
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928091]  
[<ffffffff81071331>] ? arch_local_irq_save+0x11/0x17
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928108]  
[<ffffffffa0124c6c>] ? check_reshape+0x27b/0x51a [raid456]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928124]  
[<ffffffffa013ade4>] ? scsi_request_fn+0x443/0x51e [scsi_mod]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928130]  
[<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928141]  
[<ffffffffa00ef3b8>] ? md_check_recovery+0x2a5/0x514 [md_mod]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928148]  
[<ffffffffa01286c7>] ? raid5d+0x1c/0x483 [raid456]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928153]  
[<ffffffff81071331>] ? arch_local_irq_save+0x11/0x17
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928157]  
[<ffffffff81071331>] ? arch_local_irq_save+0x11/0x17
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928166]  
[<ffffffffa00e9256>] ? md_thread+0x114/0x132 [md_mod]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928171]  
[<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928180]  
[<ffffffffa00e9142>] ? md_rdev_init+0xea/0xea [md_mod]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928185]  
[<ffffffff8105f7a1>] ? kthread+0x76/0x7e
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928191]  
[<ffffffff81357ff4>] ? kernel_thread_helper+0x4/0x10
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928196]  
[<ffffffff8105f72b>] ? kthread_worker_fn+0x139/0x139
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928200]  
[<ffffffff81357ff0>] ? gs_change+0x13/0x13
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928212] jbd2/md1-8      
D ffff88021fc92780     0  1731      2 0x00000000
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928218]  
ffff880213693180 0000000000000046 ffff880200000000 ffff880216d04180
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928224]  
0000000000012780 ffff880213df3fd8 ffff880213df3fd8 ffff880213693180
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928230]  
0000000000000000 00000001135a0d70 ffff8802135a0d60 ffff8802135a0d70
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928236] Call Trace:
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928243]  
[<ffffffffa0123804>] ? get_active_stripe+0x24c/0x505 [raid456]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928248]  
[<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928255]  
[<ffffffffa01258c8>] ? make_request+0x1b4/0x37a [raid456]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928260]  
[<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928278]  
[<ffffffffa00e8d47>] ? md_make_request+0xee/0x1db [md_mod]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928283]  
[<ffffffff8119a3ec>] ? generic_make_request+0x90/0xcf
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928287]  
[<ffffffff8119a4fe>] ? submit_bio+0xd3/0xf1
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928293]  
[<ffffffff81121b78>] ? bio_alloc_bioset+0x43/0xb6
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928297]  
[<ffffffff8111da68>] ? submit_bh+0xe2/0xff
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928304]  
[<ffffffffa0167674>] ? jbd2_journal_commit_transaction+0x803/0x10bf 
[jbd2]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928309]  
[<ffffffff8100d02f>] ? load_TLS+0x7/0xa
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928313]  
[<ffffffff8100d69e>] ? __switch_to+0x133/0x258
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928318]  
[<ffffffff81350dd1>] ? _raw_spin_lock_irqsave+0x9/0x25
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928323]  
[<ffffffff8105267a>] ? lock_timer_base.isra.29+0x23/0x47
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928330]  
[<ffffffffa016b166>] ? kjournald2+0xc0/0x20a [jbd2]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928334]  
[<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928341]  
[<ffffffffa016b0a6>] ? commit_timeout+0x5/0x5 [jbd2]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928345]  
[<ffffffff8105f7a1>] ? kthread+0x76/0x7e
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928349]  
[<ffffffff81357ff4>] ? kernel_thread_helper+0x4/0x10
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928354]  
[<ffffffff8105f72b>] ? kthread_worker_fn+0x139/0x139
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928358]  
[<ffffffff81357ff0>] ? gs_change+0x13/0x13
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928408] smbd            
D ffff88021fc12780     0  3063  25481 0x00000000
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928413]  
ffff880213e07780 0000000000000082 0000000000000000 ffffffff8160d020
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928418]  
0000000000012780 ffff880003cabfd8 ffff880003cabfd8 ffff880213e07780
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928424]  
0000000000000000 00000001135a0d70 ffff8802135a0d60 ffff8802135a0d70
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928429] Call Trace:
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928435]  
[<ffffffffa0123804>] ? get_active_stripe+0x24c/0x505 [raid456]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928439]  
[<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928445]  
[<ffffffffa01258c8>] ? make_request+0x1b4/0x37a [raid456]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928450]  
[<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928457]  
[<ffffffffa00e8d47>] ? md_make_request+0xee/0x1db [md_mod]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928468]  
[<ffffffffa017d19a>] ? noalloc_get_block_write+0x17/0x17 [ext4]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928473]  
[<ffffffff8119a3ec>] ? generic_make_request+0x90/0xcf
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928477]  
[<ffffffff8119a4fe>] ? submit_bio+0xd3/0xf1
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928482]  
[<ffffffff810bedab>] ? __lru_cache_add+0x2b/0x51
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928486]  
[<ffffffff811259dd>] ? mpage_readpages+0x113/0x134
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928496]  
[<ffffffffa017d19a>] ? noalloc_get_block_write+0x17/0x17 [ext4]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928500]  
[<ffffffff81109033>] ? poll_freewait+0x97/0x97
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928505]  
[<ffffffff81036628>] ? should_resched+0x5/0x23
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928508]  
[<ffffffff8134fa44>] ? _cond_resched+0x7/0x1c
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928513]  
[<ffffffff810bdd31>] ? __do_page_cache_readahead+0x11e/0x1c3
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928517]  
[<ffffffff810be02e>] ? ra_submit+0x19/0x1d
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928522]  
[<ffffffff810b689b>] ? generic_file_aio_read+0x282/0x5cf
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928528]  
[<ffffffff810fadc4>] ? do_sync_read+0xb4/0xec
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928532]  
[<ffffffff810fb4af>] ? vfs_read+0x9f/0xe6
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928536]  
[<ffffffff810fb61f>] ? sys_pread64+0x53/0x6e
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928540]  
[<ffffffff81355e92>] ? system_call_fastpath+0x16/0x1b
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928549] imap            
D ffff88021fc12780     0  3121   4613 0x00000000
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928554]  
ffff880216db1100 0000000000000082 ffffea0000000000 ffffffff8160d020
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928559]  
0000000000012780 ffff8800cf5b1fd8 ffff8800cf5b1fd8 ffff880216db1100
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928564]  
0000000000000000 00000001135a0d70 ffff8802135a0d60 ffff8802135a0d70
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928569] Call Trace:
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928576]  
[<ffffffffa0123804>] ? get_active_stripe+0x24c/0x505 [raid456]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928580]  
[<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928585]  
[<ffffffffa01258c8>] ? make_request+0x1b4/0x37a [raid456]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928590]  
[<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928597]  
[<ffffffffa00e8d47>] ? md_make_request+0xee/0x1db [md_mod]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928607]  
[<ffffffffa017d19a>] ? noalloc_get_block_write+0x17/0x17 [ext4]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928611]  
[<ffffffff8119a3ec>] ? generic_make_request+0x90/0xcf
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928615]  
[<ffffffff8119a4fe>] ? submit_bio+0xd3/0xf1
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928619]  
[<ffffffff810bedab>] ? __lru_cache_add+0x2b/0x51
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928623]  
[<ffffffff811259dd>] ? mpage_readpages+0x113/0x134
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928633]  
[<ffffffffa017d19a>] ? noalloc_get_block_write+0x17/0x17 [ext4]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928637]  
[<ffffffff8110b27f>] ? dput+0x27/0xee
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928641]  
[<ffffffff811110df>] ? mntput_no_expire+0x1e/0xc9
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928646]  
[<ffffffff810bdd31>] ? __do_page_cache_readahead+0x11e/0x1c3
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928650]  
[<ffffffff810bdff1>] ? force_page_cache_readahead+0x5f/0x83
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928654]  
[<ffffffff810b85e5>] ? sys_fadvise64_64+0x141/0x1e2
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928658]  
[<ffffffff81355e92>] ? system_call_fastpath+0x16/0x1b
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928667] smbd            
D ffff88021fc12780     0  3155  25481 0x00000000
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928672]  
ffff8802135d8780 0000000000000086 0000000000000000 ffffffff8160d020
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928677]  
0000000000012780 ffff880005267fd8 ffff880005267fd8 ffff8802135d8780
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928683]  
0000000000000000 00000001135a0d70 ffff8802135a0d60 ffff8802135a0d70
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928688] Call Trace:
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928694]  
[<ffffffffa0123804>] ? get_active_stripe+0x24c/0x505 [raid456]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928698]  
[<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928704]  
[<ffffffffa01258c8>] ? make_request+0x1b4/0x37a [raid456]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928708]  
[<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928715]  
[<ffffffffa00e8d47>] ? md_make_request+0xee/0x1db [md_mod]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928725]  
[<ffffffffa017d19a>] ? noalloc_get_block_write+0x17/0x17 [ext4]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928729]  
[<ffffffff8119a3ec>] ? generic_make_request+0x90/0xcf
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928733]  
[<ffffffff8119a4fe>] ? submit_bio+0xd3/0xf1
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928737]  
[<ffffffff810bedab>] ? __lru_cache_add+0x2b/0x51
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928741]  
[<ffffffff811259dd>] ? mpage_readpages+0x113/0x134
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928751]  
[<ffffffffa017d19a>] ? noalloc_get_block_write+0x17/0x17 [ext4]
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928755]  
[<ffffffff81109033>] ? poll_freewait+0x97/0x97
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928759]  
[<ffffffff81036628>] ? should_resched+0x5/0x23
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928762]  
[<ffffffff8134fa44>] ? _cond_resched+0x7/0x1c
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928767]  
[<ffffffff810bdd31>] ? __do_page_cache_readahead+0x11e/0x1c3
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928771]  
[<ffffffff810be02e>] ? ra_submit+0x19/0x1d
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928775]  
[<ffffffff810b689b>] ? generic_file_aio_read+0x282/0x5cf
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928780]  
[<ffffffff810fadc4>] ? do_sync_read+0xb4/0xec
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928784]  
[<ffffffff810fb4af>] ? vfs_read+0x9f/0xe6
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928788]  
[<ffffffff810fb61f>] ? sys_pread64+0x53/0x6e
     Oct 17 19:04:46 bazsalikom kernel: [7875480.928792]  
[<ffffffff81355e92>] ? system_call_fastpath+0x16/0x1b

 From here on, things went downhill pretty damn fast. I was not able to 
unmount the file-system, stop or re-start the array (/proc/mdstat went 
away), any process trying to touch /dev/md1 hung, so eventually, I run 
out of options and hit the reset button on the machine.

Upon reboot, the array wouldn't assemble, it was complaining that SDA 
and SDA1 had the same superblock info on it.

mdadm: WARNING /dev/sda and /dev/sda1 appear to have very similar 
superblocks.
       If they are really different, please --zero the superblock on one
       If they are the same or overlap, please remove one from the
       DEVICE list in mdadm.conf.

At this point, I looked at the drives and it appeared that the drive 
letters got re-arranged by the kernel. My three new HDD-s (which used to 
be SDH, SDI, SDJ) now appear as SDA, SDB and SDD.

I've read up on this a little and everyone seemed to suggest that you 
repair this super-block corruption by zeroing out the suport-block, so I 
did:

     mdadm --zero-superblock /dev/sda1

At this point mdadm started complaining about the super-block on SDB 
(and later SDD) so I ended up zeroing out the superblock on all three of 
the new hard-drives:

     mdadm --zero-superblock /dev/sdb1
     mdadm --zero-superblock /dev/sdd1

After this, the array would assemble, but wouldn't start, stating that 
it doesn't have enough disks in it - which is correct for the new array: 
I just removed 3 drives from a RAID6.

Right now, /proc/mdstat says:

     Personalities : [raid1] [raid6] [raid5] [raid4]
     md1 : inactive sdh2[0](S) sdc2[6](S) sdj1[5](S) sde1[4](S) 
sdg1[3](S) sdi1[2](S) sdf2[1](S)
           10744335040 blocks super 0.91

mdadm -E /dev/sdc2 says:
     /dev/sdc2:
               Magic : a92b4efc
             Version : 0.91.00
                UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
       Creation Time : Sat Oct  2 07:21:53 2010
          Raid Level : raid6
       Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
          Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
        Raid Devices : 10
       Total Devices : 10
     Preferred Minor : 1


       Reshape pos'n : 4096
       Delta Devices : 3 (7->10)


         Update Time : Sat Oct 17 18:59:50 2015
               State : active
      Active Devices : 10
     Working Devices : 10
      Failed Devices : 0
       Spare Devices : 0
            Checksum : fad60788 - correct
              Events : 2579239


              Layout : left-symmetric
          Chunk Size : 64K


           Number   Major   Minor   RaidDevice State
     this     6       8       98        6      active sync


        0     0       8       50        0      active sync
        1     1       8       18        1      active sync
        2     2       8       65        2      active sync   /dev/sde1
        3     3       8       33        3      active sync   /dev/sdc1
        4     4       8        1        4      active sync   /dev/sda1
        5     5       8       81        5      active sync   /dev/sdf1
        6     6       8       98        6      active sync
        7     7       8      145        7      active sync   /dev/sdj1
        8     8       8      129        8      active sync   /dev/sdi1
        9     9       8      113        9      active sync   /dev/sdh1

So, if I read this right, the superblock here states that the array is 
in the middle of a reshape from 7 to 10 devices, but it just started 
(4096 is the position).
What's interesting is the device names listed here don't match the ones 
reported by /proc/mdstat, and are actually incorrect. The right 
partition numbers are in /proc/mdstat.

The superblocks on the 6 other original disks match, except for of 
course which one they mark as 'this' and the checksum.

I've read in here (http://ubuntuforums.org/showthread.php?t=2133576) 
among many other places that it might be possible to recover the data on 
the array by trying to re-create it to the state before the re-shape.

I've also read that if I want to re-create an array in read-only mode, I 
should re-create it degraded.

So, what I thought I would do is this:

     mdadm --create /dev/md1 --level=6 --raid-devices=7 /dev/sdh2 
/dev/sdf2 /dev/sdi1 /dev/sdg1 /dev/sde1 missing missing

Obviously, at this point, I'm trying to be as cautious as possible in 
not causing any further damage, if that's at all possible.

It seems that this issue has some similarities to this bug: 
https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/1001019

So, please all mdadm gurus, help me out! How can I recover as much of 
the data on this volume as possible?

Thanks again,
Andras Tantos


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-20  2:35 How to recover after md crash during reshape? andras
@ 2015-10-20 12:50 ` Anugraha Sinha
  2015-10-20 13:04 ` Wols Lists
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Anugraha Sinha @ 2015-10-20 12:50 UTC (permalink / raw)
  To: andras, linux-raid

Hi Andras,

 > Upon reboot, the array wouldn't assemble, it was complaining that SDA
 > and SDA1 had the same superblock info on it.
 >
 > mdadm: WARNING /dev/sda and /dev/sda1 appear to have very similar
 > superblocks.
 >        If they are really different, please --zero the superblock on one
 >        If they are the same or overlap, please remove one from the
 >        DEVICE list in mdadm.conf.
 >
 > At this point, I looked at the drives and it appeared that the drive
 > letters got re-arranged by the kernel. My three new HDD-s (which used to
 > be SDH, SDI, SDJ) now appear as SDA, SDB and SDD.
 >
 > I've read up on this a little and everyone seemed to suggest that you
 > repair this super-block corruption by zeroing out the suport-block, so I
 > did:
 >
 >      mdadm --zero-superblock /dev/sda1
 >
 > At this point mdadm started complaining about the super-block on SDB
 > (and later SDD) so I ended up zeroing out the superblock on all three of
 > the new hard-drives:
 >
 >      mdadm --zero-superblock /dev/sdb1
 >      mdadm --zero-superblock /dev/sdd1

Before doing zero-superblock, you should have removed the drives from 
the array first. Then you should have zero'd the superblock information.
This way array, would have got to know about removal of arrays, and it 
would have reassembled and started again.

Anyways, I suggest, you should first remove the devices which mdadm is 
expecting to be present.

In my opinion you should first execute
[Just as a safegaurd may do this as well]
mdadm --stop /dev/md1
[then]
mdadm /dev/md1 --fail /dev/sda1 --remove /dev/sda1
mdadm /dev/md1 --fail /dev/sdb1 --remove /dev/sdb1
mdadm /dev/md1 --fail /dev/sdd1 --remove /dev/sdd1

Then check what does /proc/mdstat says.
Check mdadm -D /dev/md1 says

If things are good and you are lucky, restart the array (mdadm --run)

Thereafter try and remove existing partitions on /dev/sda, /dev/sdb & 
/dev/sdd. (Using GNU Parted)
Recreate partitions, and probably mkfs on newly created partitions as well.
The above will solve the issue that /dev/sda & /dev/sda1 have similar 
superblock information.

Finally take a backup and then add and grow your array again.

I hope things work for you.

Regards
Anugraha

On 10/20/2015 11:35 AM, andras@tantosonline.com wrote:
> Dear all,
>
> I have a serious (to me) problem, and I'm seeking some pro advice in
> recovering a RAID6 volume after a crash at the beginning of a reshape.
> Thank you all in advance for any help!
>
> The details:
>
> I'm running Debian.
>      uname -r says:
>          kernel 3.2.0-4-amd64
>      dmsg says:
>          Linux version 3.2.0-4-amd64 (debian-kernel@lists.debian.org)
> (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.68-1+deb7u3
>      mdadm -v says:
>          mdadm - v3.2.5 - 18th May 2012
>
> I used to have a RAID6 volume with 7 disks on it. I've recently bought
> another 3 new HDD-s and was trying to add them to the array.
> I've put them in the machine (hot-plug), partitioned them then did:
>
>      mdadm --add /dev/md1 /dev/sdh1 /dev/sdi1 /dev/sdj1
>
> This worked fine, /proc/mdstat showed them as three spares. Then I did:
>
>      mdadm --grow --raid-devices=10 /dev/md1
>
> Yes, I was dumb enough to start the process without a backup option -
> (copy-paste error from https://raid.wiki.kernel.org/index.php/Growing).
>
> This immediately (well, after 2 seconds) crashed the MD driver:
>
>      Oct 17 17:30:27 bazsalikom kernel: [7869821.514718] sd 0:0:0:0:
> [sdj] Attached SCSI disk
>      Oct 17 18:39:21 bazsalikom kernel: [7873955.418679]  sdh: sdh1
>      Oct 17 18:39:37 bazsalikom kernel: [7873972.155084]  sdi: sdi1
>      Oct 17 18:39:49 bazsalikom kernel: [7873983.916038]  sdj: sdj1
>      Oct 17 18:40:33 bazsalikom kernel: [7874027.963430] md: bind<sdh1>
>      Oct 17 18:40:34 bazsalikom kernel: [7874028.263656] md: bind<sdi1>
>      Oct 17 18:40:34 bazsalikom kernel: [7874028.361112] md: bind<sdj1>
>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667815] md: reshape of
> RAID array md1
>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667818] md: minimum
> _guaranteed_  speed: 1000 KB/sec/disk.
>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667821] md: using
> maximum available idle IO bandwidth (but not more than 200000 KB/sec)
> for reshape.
>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667831] md: using 128k
> window, over a total of 1465135936k.
> --> Oct 17 18:59:50 bazsalikom kernel: [7875184.326245] md: md_do_sync()
> got signal ... exiting
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928059] md1_raid6 D
> ffff88021fc12780     0   282      2 0x00000000
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928066]
> ffff880213fd9140 0000000000000046 ffff8800aa80c140 ffff880201fe08c0
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928073]
> 0000000000012780 ffff880211845fd8 ffff880211845fd8 ffff880213fd9140
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928079]
> ffff8800a77d8a40 ffffffff81071331 0000000000000046 ffff8802135a0c00
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928085] Call Trace:
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928095]
> [<ffffffff81071331>] ? arch_local_irq_save+0x11/0x17
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928111]
> [<ffffffffa0124c6c>] ? check_reshape+0x27b/0x51a [raid456]
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928128]
> [<ffffffffa013ade4>] ? scsi_request_fn+0x443/0x51e [scsi_mod]
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928134]
> [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928144]
> [<ffffffffa00ef3b8>] ? md_check_recovery+0x2a5/0x514 [md_mod]
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928151]
> [<ffffffffa01286c7>] ? raid5d+0x1c/0x483 [raid456]
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928156]
> [<ffffffff81071331>] ? arch_local_irq_save+0x11/0x17
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928160]
> [<ffffffff81071331>] ? arch_local_irq_save+0x11/0x17
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928169]
> [<ffffffffa00e9256>] ? md_thread+0x114/0x132 [md_mod]
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928174]
> [<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928183]
> [<ffffffffa00e9142>] ? md_rdev_init+0xea/0xea [md_mod]
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928188]
> [<ffffffff8105f7a1>] ? kthread+0x76/0x7e
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928194]
> [<ffffffff81357ff4>] ? kernel_thread_helper+0x4/0x10
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928199]
> [<ffffffff8105f72b>] ? kthread_worker_fn+0x139/0x139
>      Oct 17 19:02:46 bazsalikom kernel: [7875360.928204]
> [<ffffffff81357ff0>] ? gs_change+0x13/0x13
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928055] md1_raid6 D
> ffff88021fc12780     0   282      2 0x00000000
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928062]
> ffff880213fd9140 0000000000000046 ffff8800aa80c140 ffff880201fe08c0
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928069]
> 0000000000012780 ffff880211845fd8 ffff880211845fd8 ffff880213fd9140
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928075]
> ffff8800a77d8a40 ffffffff81071331 0000000000000046 ffff8802135a0c00
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928082] Call Trace:
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928091]
> [<ffffffff81071331>] ? arch_local_irq_save+0x11/0x17
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928108]
> [<ffffffffa0124c6c>] ? check_reshape+0x27b/0x51a [raid456]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928124]
> [<ffffffffa013ade4>] ? scsi_request_fn+0x443/0x51e [scsi_mod]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928130]
> [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928141]
> [<ffffffffa00ef3b8>] ? md_check_recovery+0x2a5/0x514 [md_mod]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928148]
> [<ffffffffa01286c7>] ? raid5d+0x1c/0x483 [raid456]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928153]
> [<ffffffff81071331>] ? arch_local_irq_save+0x11/0x17
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928157]
> [<ffffffff81071331>] ? arch_local_irq_save+0x11/0x17
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928166]
> [<ffffffffa00e9256>] ? md_thread+0x114/0x132 [md_mod]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928171]
> [<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928180]
> [<ffffffffa00e9142>] ? md_rdev_init+0xea/0xea [md_mod]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928185]
> [<ffffffff8105f7a1>] ? kthread+0x76/0x7e
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928191]
> [<ffffffff81357ff4>] ? kernel_thread_helper+0x4/0x10
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928196]
> [<ffffffff8105f72b>] ? kthread_worker_fn+0x139/0x139
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928200]
> [<ffffffff81357ff0>] ? gs_change+0x13/0x13
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928212] jbd2/md1-8 D
> ffff88021fc92780     0  1731      2 0x00000000
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928218]
> ffff880213693180 0000000000000046 ffff880200000000 ffff880216d04180
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928224]
> 0000000000012780 ffff880213df3fd8 ffff880213df3fd8 ffff880213693180
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928230]
> 0000000000000000 00000001135a0d70 ffff8802135a0d60 ffff8802135a0d70
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928236] Call Trace:
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928243]
> [<ffffffffa0123804>] ? get_active_stripe+0x24c/0x505 [raid456]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928248]
> [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928255]
> [<ffffffffa01258c8>] ? make_request+0x1b4/0x37a [raid456]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928260]
> [<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928278]
> [<ffffffffa00e8d47>] ? md_make_request+0xee/0x1db [md_mod]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928283]
> [<ffffffff8119a3ec>] ? generic_make_request+0x90/0xcf
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928287]
> [<ffffffff8119a4fe>] ? submit_bio+0xd3/0xf1
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928293]
> [<ffffffff81121b78>] ? bio_alloc_bioset+0x43/0xb6
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928297]
> [<ffffffff8111da68>] ? submit_bh+0xe2/0xff
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928304]
> [<ffffffffa0167674>] ? jbd2_journal_commit_transaction+0x803/0x10bf [jbd2]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928309]
> [<ffffffff8100d02f>] ? load_TLS+0x7/0xa
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928313]
> [<ffffffff8100d69e>] ? __switch_to+0x133/0x258
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928318]
> [<ffffffff81350dd1>] ? _raw_spin_lock_irqsave+0x9/0x25
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928323]
> [<ffffffff8105267a>] ? lock_timer_base.isra.29+0x23/0x47
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928330]
> [<ffffffffa016b166>] ? kjournald2+0xc0/0x20a [jbd2]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928334]
> [<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928341]
> [<ffffffffa016b0a6>] ? commit_timeout+0x5/0x5 [jbd2]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928345]
> [<ffffffff8105f7a1>] ? kthread+0x76/0x7e
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928349]
> [<ffffffff81357ff4>] ? kernel_thread_helper+0x4/0x10
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928354]
> [<ffffffff8105f72b>] ? kthread_worker_fn+0x139/0x139
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928358]
> [<ffffffff81357ff0>] ? gs_change+0x13/0x13
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928408] smbd D
> ffff88021fc12780     0  3063  25481 0x00000000
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928413]
> ffff880213e07780 0000000000000082 0000000000000000 ffffffff8160d020
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928418]
> 0000000000012780 ffff880003cabfd8 ffff880003cabfd8 ffff880213e07780
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928424]
> 0000000000000000 00000001135a0d70 ffff8802135a0d60 ffff8802135a0d70
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928429] Call Trace:
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928435]
> [<ffffffffa0123804>] ? get_active_stripe+0x24c/0x505 [raid456]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928439]
> [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928445]
> [<ffffffffa01258c8>] ? make_request+0x1b4/0x37a [raid456]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928450]
> [<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928457]
> [<ffffffffa00e8d47>] ? md_make_request+0xee/0x1db [md_mod]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928468]
> [<ffffffffa017d19a>] ? noalloc_get_block_write+0x17/0x17 [ext4]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928473]
> [<ffffffff8119a3ec>] ? generic_make_request+0x90/0xcf
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928477]
> [<ffffffff8119a4fe>] ? submit_bio+0xd3/0xf1
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928482]
> [<ffffffff810bedab>] ? __lru_cache_add+0x2b/0x51
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928486]
> [<ffffffff811259dd>] ? mpage_readpages+0x113/0x134
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928496]
> [<ffffffffa017d19a>] ? noalloc_get_block_write+0x17/0x17 [ext4]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928500]
> [<ffffffff81109033>] ? poll_freewait+0x97/0x97
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928505]
> [<ffffffff81036628>] ? should_resched+0x5/0x23
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928508]
> [<ffffffff8134fa44>] ? _cond_resched+0x7/0x1c
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928513]
> [<ffffffff810bdd31>] ? __do_page_cache_readahead+0x11e/0x1c3
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928517]
> [<ffffffff810be02e>] ? ra_submit+0x19/0x1d
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928522]
> [<ffffffff810b689b>] ? generic_file_aio_read+0x282/0x5cf
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928528]
> [<ffffffff810fadc4>] ? do_sync_read+0xb4/0xec
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928532]
> [<ffffffff810fb4af>] ? vfs_read+0x9f/0xe6
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928536]
> [<ffffffff810fb61f>] ? sys_pread64+0x53/0x6e
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928540]
> [<ffffffff81355e92>] ? system_call_fastpath+0x16/0x1b
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928549] imap D
> ffff88021fc12780     0  3121   4613 0x00000000
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928554]
> ffff880216db1100 0000000000000082 ffffea0000000000 ffffffff8160d020
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928559]
> 0000000000012780 ffff8800cf5b1fd8 ffff8800cf5b1fd8 ffff880216db1100
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928564]
> 0000000000000000 00000001135a0d70 ffff8802135a0d60 ffff8802135a0d70
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928569] Call Trace:
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928576]
> [<ffffffffa0123804>] ? get_active_stripe+0x24c/0x505 [raid456]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928580]
> [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928585]
> [<ffffffffa01258c8>] ? make_request+0x1b4/0x37a [raid456]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928590]
> [<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928597]
> [<ffffffffa00e8d47>] ? md_make_request+0xee/0x1db [md_mod]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928607]
> [<ffffffffa017d19a>] ? noalloc_get_block_write+0x17/0x17 [ext4]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928611]
> [<ffffffff8119a3ec>] ? generic_make_request+0x90/0xcf
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928615]
> [<ffffffff8119a4fe>] ? submit_bio+0xd3/0xf1
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928619]
> [<ffffffff810bedab>] ? __lru_cache_add+0x2b/0x51
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928623]
> [<ffffffff811259dd>] ? mpage_readpages+0x113/0x134
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928633]
> [<ffffffffa017d19a>] ? noalloc_get_block_write+0x17/0x17 [ext4]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928637]
> [<ffffffff8110b27f>] ? dput+0x27/0xee
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928641]
> [<ffffffff811110df>] ? mntput_no_expire+0x1e/0xc9
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928646]
> [<ffffffff810bdd31>] ? __do_page_cache_readahead+0x11e/0x1c3
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928650]
> [<ffffffff810bdff1>] ? force_page_cache_readahead+0x5f/0x83
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928654]
> [<ffffffff810b85e5>] ? sys_fadvise64_64+0x141/0x1e2
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928658]
> [<ffffffff81355e92>] ? system_call_fastpath+0x16/0x1b
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928667] smbd D
> ffff88021fc12780     0  3155  25481 0x00000000
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928672]
> ffff8802135d8780 0000000000000086 0000000000000000 ffffffff8160d020
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928677]
> 0000000000012780 ffff880005267fd8 ffff880005267fd8 ffff8802135d8780
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928683]
> 0000000000000000 00000001135a0d70 ffff8802135a0d60 ffff8802135a0d70
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928688] Call Trace:
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928694]
> [<ffffffffa0123804>] ? get_active_stripe+0x24c/0x505 [raid456]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928698]
> [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928704]
> [<ffffffffa01258c8>] ? make_request+0x1b4/0x37a [raid456]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928708]
> [<ffffffff8105fdf3>] ? add_wait_queue+0x3c/0x3c
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928715]
> [<ffffffffa00e8d47>] ? md_make_request+0xee/0x1db [md_mod]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928725]
> [<ffffffffa017d19a>] ? noalloc_get_block_write+0x17/0x17 [ext4]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928729]
> [<ffffffff8119a3ec>] ? generic_make_request+0x90/0xcf
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928733]
> [<ffffffff8119a4fe>] ? submit_bio+0xd3/0xf1
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928737]
> [<ffffffff810bedab>] ? __lru_cache_add+0x2b/0x51
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928741]
> [<ffffffff811259dd>] ? mpage_readpages+0x113/0x134
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928751]
> [<ffffffffa017d19a>] ? noalloc_get_block_write+0x17/0x17 [ext4]
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928755]
> [<ffffffff81109033>] ? poll_freewait+0x97/0x97
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928759]
> [<ffffffff81036628>] ? should_resched+0x5/0x23
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928762]
> [<ffffffff8134fa44>] ? _cond_resched+0x7/0x1c
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928767]
> [<ffffffff810bdd31>] ? __do_page_cache_readahead+0x11e/0x1c3
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928771]
> [<ffffffff810be02e>] ? ra_submit+0x19/0x1d
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928775]
> [<ffffffff810b689b>] ? generic_file_aio_read+0x282/0x5cf
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928780]
> [<ffffffff810fadc4>] ? do_sync_read+0xb4/0xec
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928784]
> [<ffffffff810fb4af>] ? vfs_read+0x9f/0xe6
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928788]
> [<ffffffff810fb61f>] ? sys_pread64+0x53/0x6e
>      Oct 17 19:04:46 bazsalikom kernel: [7875480.928792]
> [<ffffffff81355e92>] ? system_call_fastpath+0x16/0x1b
>
>  From here on, things went downhill pretty damn fast. I was not able to
> unmount the file-system, stop or re-start the array (/proc/mdstat went
> away), any process trying to touch /dev/md1 hung, so eventually, I run
> out of options and hit the reset button on the machine.
>
> Upon reboot, the array wouldn't assemble, it was complaining that SDA
> and SDA1 had the same superblock info on it.
>
> mdadm: WARNING /dev/sda and /dev/sda1 appear to have very similar
> superblocks.
>        If they are really different, please --zero the superblock on one
>        If they are the same or overlap, please remove one from the
>        DEVICE list in mdadm.conf.
>
> At this point, I looked at the drives and it appeared that the drive
> letters got re-arranged by the kernel. My three new HDD-s (which used to
> be SDH, SDI, SDJ) now appear as SDA, SDB and SDD.
>
> I've read up on this a little and everyone seemed to suggest that you
> repair this super-block corruption by zeroing out the suport-block, so I
> did:
>
>      mdadm --zero-superblock /dev/sda1
>
> At this point mdadm started complaining about the super-block on SDB
> (and later SDD) so I ended up zeroing out the superblock on all three of
> the new hard-drives:
>
>      mdadm --zero-superblock /dev/sdb1
>      mdadm --zero-superblock /dev/sdd1
>
> After this, the array would assemble, but wouldn't start, stating that
> it doesn't have enough disks in it - which is correct for the new array:
> I just removed 3 drives from a RAID6.
>
> Right now, /proc/mdstat says:
>
>      Personalities : [raid1] [raid6] [raid5] [raid4]
>      md1 : inactive sdh2[0](S) sdc2[6](S) sdj1[5](S) sde1[4](S)
> sdg1[3](S) sdi1[2](S) sdf2[1](S)
>            10744335040 blocks super 0.91
>
> mdadm -E /dev/sdc2 says:
>      /dev/sdc2:
>                Magic : a92b4efc
>              Version : 0.91.00
>                 UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>        Creation Time : Sat Oct  2 07:21:53 2010
>           Raid Level : raid6
>        Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>           Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>         Raid Devices : 10
>        Total Devices : 10
>      Preferred Minor : 1
>
>
>        Reshape pos'n : 4096
>        Delta Devices : 3 (7->10)
>
>
>          Update Time : Sat Oct 17 18:59:50 2015
>                State : active
>       Active Devices : 10
>      Working Devices : 10
>       Failed Devices : 0
>        Spare Devices : 0
>             Checksum : fad60788 - correct
>               Events : 2579239
>
>
>               Layout : left-symmetric
>           Chunk Size : 64K
>
>
>            Number   Major   Minor   RaidDevice State
>      this     6       8       98        6      active sync
>
>
>         0     0       8       50        0      active sync
>         1     1       8       18        1      active sync
>         2     2       8       65        2      active sync   /dev/sde1
>         3     3       8       33        3      active sync   /dev/sdc1
>         4     4       8        1        4      active sync   /dev/sda1
>         5     5       8       81        5      active sync   /dev/sdf1
>         6     6       8       98        6      active sync
>         7     7       8      145        7      active sync   /dev/sdj1
>         8     8       8      129        8      active sync   /dev/sdi1
>         9     9       8      113        9      active sync   /dev/sdh1
>
> So, if I read this right, the superblock here states that the array is
> in the middle of a reshape from 7 to 10 devices, but it just started
> (4096 is the position).
> What's interesting is the device names listed here don't match the ones
> reported by /proc/mdstat, and are actually incorrect. The right
> partition numbers are in /proc/mdstat.
>
> The superblocks on the 6 other original disks match, except for of
> course which one they mark as 'this' and the checksum.
>
> I've read in here (http://ubuntuforums.org/showthread.php?t=2133576)
> among many other places that it might be possible to recover the data on
> the array by trying to re-create it to the state before the re-shape.
>
> I've also read that if I want to re-create an array in read-only mode, I
> should re-create it degraded.
>
> So, what I thought I would do is this:
>
>      mdadm --create /dev/md1 --level=6 --raid-devices=7 /dev/sdh2
> /dev/sdf2 /dev/sdi1 /dev/sdg1 /dev/sde1 missing missing
>
> Obviously, at this point, I'm trying to be as cautious as possible in
> not causing any further damage, if that's at all possible.
>
> It seems that this issue has some similarities to this bug:
> https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/1001019
>
> So, please all mdadm gurus, help me out! How can I recover as much of
> the data on this volume as possible?
>
> Thanks again,
> Andras Tantos
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-20  2:35 How to recover after md crash during reshape? andras
  2015-10-20 12:50 ` Anugraha Sinha
@ 2015-10-20 13:04 ` Wols Lists
  2015-10-20 13:49 ` Phil Turmel
  2015-10-21  1:35 ` How to recover after md crash during reshape? Neil Brown
  3 siblings, 0 replies; 24+ messages in thread
From: Wols Lists @ 2015-10-20 13:04 UTC (permalink / raw)
  To: andras, linux-raid

On 20/10/15 03:35, andras@tantosonline.com wrote:
> From here on, things went downhill pretty damn fast. I was not able to
> unmount the file-system, stop or re-start the array (/proc/mdstat went
> away), any process trying to touch /dev/md1 hung, so eventually, I run
> out of options and hit the reset button on the machine.
> 
> Upon reboot, the array wouldn't assemble, it was complaining that SDA
> and SDA1 had the same superblock info on it.
> 
> mdadm: WARNING /dev/sda and /dev/sda1 appear to have very similar
> superblocks.
>       If they are really different, please --zero the superblock on one
>       If they are the same or overlap, please remove one from the
>       DEVICE list in mdadm.conf.
> 
> At this point, I looked at the drives and it appeared that the drive
> letters got re-arranged by the kernel. My three new HDD-s (which used to
> be SDH, SDI, SDJ) now appear as SDA, SDB and SDD.
> 
> I've read up on this a little and everyone seemed to suggest that you
> repair this super-block corruption by zeroing out the suport-block, so I
> did:
> 
>     mdadm --zero-superblock /dev/sda1

OUCH !!!

REALLY REALLY REALLY don't do anything now until the experts chime in !!!

It looks to me like you have a 0.9 superblock, and this error message is
both common and erroneous. There's only one superblock, but it looks to
mdadm like it's both a disk superblock and a partition superblock.
You've just wiped those drives, I think ...

The experts should be able to recover it for you (I hope), but your
array is now damaged - don't damage it any further !!!

Cheers,
Wol

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-20  2:35 How to recover after md crash during reshape? andras
  2015-10-20 12:50 ` Anugraha Sinha
  2015-10-20 13:04 ` Wols Lists
@ 2015-10-20 13:49 ` Phil Turmel
       [not found]   ` <3baf849321d819483c5d20c005a31844@tantosonline.com>
  2015-10-21  1:35 ` How to recover after md crash during reshape? Neil Brown
  3 siblings, 1 reply; 24+ messages in thread
From: Phil Turmel @ 2015-10-20 13:49 UTC (permalink / raw)
  To: andras, linux-raid

Good morning Andras,

On 10/19/2015 10:35 PM, andras@tantosonline.com wrote:
> Dear all,
> 
> I have a serious (to me) problem, and I'm seeking some pro advice in
> recovering a RAID6 volume after a crash at the beginning of a reshape.
> Thank you all in advance for any help!
> 
> The details:
> 
> I'm running Debian.
>     uname -r says:
>         kernel 3.2.0-4-amd64
>     dmsg says:
>         Linux version 3.2.0-4-amd64 (debian-kernel@lists.debian.org)
> (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.68-1+deb7u3
>     mdadm -v says:
>         mdadm - v3.2.5 - 18th May 2012
> 
> I used to have a RAID6 volume with 7 disks on it. I've recently bought
> another 3 new HDD-s and was trying to add them to the array.
> I've put them in the machine (hot-plug), partitioned them then did:
> 
>     mdadm --add /dev/md1 /dev/sdh1 /dev/sdi1 /dev/sdj1
> 
> This worked fine, /proc/mdstat showed them as three spares. Then I did:
> 
>     mdadm --grow --raid-devices=10 /dev/md1
> 
> Yes, I was dumb enough to start the process without a backup option -
> (copy-paste error from https://raid.wiki.kernel.org/index.php/Growing).

The normal way to recover from this mistake is to issue

mdadm --grow --continue /dev/md1 --backup-file .....

> This immediately (well, after 2 seconds) crashed the MD driver:

Crashing is a bug, of course, but you are using an old kernel.  New
kernels *generally* have fewer bugs than old kernels :-)  In newer
kernels it would have just held @ 0% progress while still otherwise running.

Same observation applies to the mdadm utility too.  Consider using a
relatively new rescue CD for further operations.

[trim /]

> Upon reboot, the array wouldn't assemble, it was complaining that SDA
> and SDA1 had the same superblock info on it.
> 
> mdadm: WARNING /dev/sda and /dev/sda1 appear to have very similar
> superblocks.
>       If they are really different, please --zero the superblock on one
>       If they are the same or overlap, please remove one from the
>       DEVICE list in mdadm.conf.

This is a completely separate problem, and the warning is a bit
misleading.  It is a side effect of version 0.90 metadata that could not
be solved in a backward compatible manner.  Which is why v1.x metadata
was created and became the default years ago.  Basically, v0.90
metadata, which is placed at the end of a device, when used on the last
partition of a disk, is ambiguous about whether it belongs to the last
partition or the disk as a whole.

Normally, you can update the metadata in place from v0.90 to v1.0 with
mdadm --assemble --update=metadata  ....

> At this point, I looked at the drives and it appeared that the drive
> letters got re-arranged by the kernel. My three new HDD-s (which used to
> be SDH, SDI, SDJ) now appear as SDA, SDB and SDD.

This is common and often screws people up.  The kernel assigns names
based on discovery order, which varies, especially with hotplugging.
You need a map of your array and its devices versus the underlying drive
serial numbers.  This is so important I created a script years ago to
generate this information.  Please download and run it, and post the
results here so we can precisely tailor the instructions we give.

https://github.com/pturmel/lsdrv

> I've read up on this a little and everyone seemed to suggest that you
> repair this super-block corruption by zeroing out the suport-block, so I
> did:
> 
>     mdadm --zero-superblock /dev/sda1

"Everyone" was wrong.  Your drives only had the one superblock.  It was
just misidentified in two contexts.  You destroyed the only superblock
on those devices.

[trim /]

> After this, the array would assemble, but wouldn't start, stating that
> it doesn't have enough disks in it - which is correct for the new array:
> I just removed 3 drives from a RAID6.
> 
> Right now, /proc/mdstat says:
> 
>     Personalities : [raid1] [raid6] [raid5] [raid4]
>     md1 : inactive sdh2[0](S) sdc2[6](S) sdj1[5](S) sde1[4](S)
> sdg1[3](S) sdi1[2](S) sdf2[1](S)
>           10744335040 blocks super 0.91
> 
> mdadm -E /dev/sdc2 says:
>     /dev/sdc2:
>               Magic : a92b4efc
>             Version : 0.91.00
>                UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>       Creation Time : Sat Oct  2 07:21:53 2010
>          Raid Level : raid6
>       Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>          Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>        Raid Devices : 10
>       Total Devices : 10
>     Preferred Minor : 1
> 
> 
>       Reshape pos'n : 4096
>       Delta Devices : 3 (7->10)
> 
> 
>         Update Time : Sat Oct 17 18:59:50 2015
>               State : active
>      Active Devices : 10
>     Working Devices : 10
>      Failed Devices : 0
>       Spare Devices : 0
>            Checksum : fad60788 - correct
>              Events : 2579239
> 
> 
>              Layout : left-symmetric
>          Chunk Size : 64K
> 
> 
>           Number   Major   Minor   RaidDevice State
>     this     6       8       98        6      active sync
> 
> 
>        0     0       8       50        0      active sync
>        1     1       8       18        1      active sync
>        2     2       8       65        2      active sync   /dev/sde1
>        3     3       8       33        3      active sync   /dev/sdc1
>        4     4       8        1        4      active sync   /dev/sda1
>        5     5       8       81        5      active sync   /dev/sdf1
>        6     6       8       98        6      active sync
>        7     7       8      145        7      active sync   /dev/sdj1
>        8     8       8      129        8      active sync   /dev/sdi1
>        9     9       8      113        9      active sync   /dev/sdh1
> 
> So, if I read this right, the superblock here states that the array is
> in the middle of a reshape from 7 to 10 devices, but it just started
> (4096 is the position).

Yup, just a little ways in at the beginning.  Probably where it tried to
write its first critical section to the backup file.

> What's interesting is the device names listed here don't match the ones
> reported by /proc/mdstat, and are actually incorrect. The right
> partition numbers are in /proc/mdstat.

Names in the superblock are recorded per the last successful assembly.
Which is why a map of actual roles vs. drive serial numbers is so important.

> I've read in here (http://ubuntuforums.org/showthread.php?t=2133576)
> among many other places that it might be possible to recover the data on
> the array by trying to re-create it to the state before the re-shape.

Yes, since you have destroyed those superblocks, and the reshape
position is so low.  You might lose a little at the beginning of your
array.  Or might not, if it crashed at the first critical section as I
suspect.

> I've also read that if I want to re-create an array in read-only mode, I
> should re-create it degraded.

Not necessary or recommended in this case.

> So, what I thought I would do is this:
> 
>     mdadm --create /dev/md1 --level=6 --raid-devices=7 /dev/sdh2
> /dev/sdf2 /dev/sdi1 /dev/sdg1 /dev/sde1 missing missing
> 
> Obviously, at this point, I'm trying to be as cautious as possible in
> not causing any further damage, if that's at all possible.

Good, because the above would destroy your array.  You'd get modern
defaults for metadata version, offset, and chunk size.

Please supply all of you mdadm -E reports for the seven partitions and
the lsdrv output I requests.  Just post the text inline in your reply.

Do *not* do anything else.

Phil

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
       [not found]   ` <3baf849321d819483c5d20c005a31844@tantosonline.com>
@ 2015-10-20 15:42     ` Phil Turmel
  2015-10-20 22:34       ` Anugraha Sinha
                         ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Phil Turmel @ 2015-10-20 15:42 UTC (permalink / raw)
  To: andras, Linux-RAID

Hi Andras,

{ Added linux-raid back -- convention on kernel.org is to reply-to-all,
trim replies, and either interleave or bottom post.  I'm trimming less
than normal this time so the list can see. }

On 10/20/2015 10:48 AM, andras@tantosonline.com wrote:
> On 2015-10-20 08:49, Phil Turmel wrote:

>> Please supply all of you mdadm -E reports for the seven partitions and
>> the lsdrv output I requests.  Just post the text inline in your reply.
>>
>> Do *not* do anything else.
>>
>> Phil

> Thanks for all the help!
> 
> Here's the output of lsdrv:
> 
> PCI [pata_marvell] 04:00.1 IDE interface: Marvell Technology Group Ltd.
> 88SE9128 IDE Controller (rev 11)
> ├scsi 0:x:x:x [Empty]
> └scsi 2:x:x:x [Empty]
> PCI [pata_jmicron] 05:00.1 IDE interface: JMicron Technology Corp.
> JMB363 SATA/IDE Controller (rev 02)
> ├scsi 1:x:x:x [Empty]
> └scsi 3:x:x:x [Empty]
> PCI [ahci] 04:00.0 SATA controller: Marvell Technology Group Ltd.
> 88SE9123 PCIe SATA 6.0 Gb/s controller (rev 11)
> ├scsi 4:0:0:0 ATA      ST2000DM001-1ER1 {Z4Z1JDN8}
> │└sda 1.82t [8:0] Partitioned (dos)
> │ └sda1 1.82t [8:1] Empty/Unknown
> └scsi 5:0:0:0 ATA      ST2000DM001-1ER1 {Z4Z1H84Q}
>  └sdb 1.82t [8:16] Partitioned (dos)
>   └sdb1 1.82t [8:17] ext4 'data' {d1403616-a9c6-4cd9-8d92-1aabc81fe373}
> PCI [ata_piix] 00:1f.2 IDE interface: Intel Corporation 82801JI (ICH10
> Family) 4 port SATA IDE Controller #1
> ├scsi 6:0:0:0 ATA      ST31500541AS     {6XW0BQL0}
> │└sdc 1.36t [8:32] Partitioned (dos)
> │ └sdc1 1.36t [8:33] MD raid6 (10) inactive
> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
> ├scsi 6:0:1:0 ATA      WDC WD20EARS-00M {WD-WMAZA0348342}
> │└sdd 1.82t [8:48] Partitioned (dos)
> │ ├sdd1 525.53m [8:49] ext4 'boot1' {a3a1cedc-3866-4d80-af18-a7a4db99d880}
> │ ├sdd2 1.36t [8:50] MD raid6 (10) inactive
> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
> │ └sdd3 465.24g [8:51] MD raid1 (3) inactive
> {f89cbbf7-66e9-eb44-42ea-8b6c723593c7}
> ├scsi 7:0:0:0 ATA      ST31500541AS     {5XW05FFV}
> │└sde 1.36t [8:64] Partitioned (dos)
> │ └sde1 1.36t [8:65] MD raid6 (10) inactive
> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
> └scsi 7:0:1:0 ATA      WDC WD20EARS-00M {WD-WMAZA0209553}
>  └sdf 1.82t [8:80] Partitioned (dos)
>   ├sdf1 525.53m [8:81] ext4 'boot2' {9b0e1e49-c736-47c0-89a1-4cac07c1d5ef}
>   ├sdf2 1.36t [8:82] MD raid6 (10) inactive
> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
>   └sdf3 465.24g [8:83] MD raid1 (1/3) (w/ sdi3) in_sync
> {f89cbbf7-66e9-eb44-42ea-8b6c723593c7}
>    └md0 465.24g [9:0] MD v0.90 raid1 (3) clean DEGRADED
> {f89cbbf7:66e9eb44:42ea8b6c:723593c7}
>     │                 ext4 'root' {ceb15bfe-e082-484c-9015-1fcc8889b798}
>     └Mounted as /dev/disk/by-uuid/ceb15bfe-e082-484c-9015-1fcc8889b798 @ /
> PCI [ata_piix] 00:1f.5 IDE interface: Intel Corporation 82801JI (ICH10
> Family) 2 port SATA IDE Controller #2
> ├scsi 8:0:0:0 ATA      ST31500341AS     {9VS1EFFD}
> │└sdg 1.36t [8:96] Partitioned (dos)
> │ └sdg1 1.36t [8:97] MD raid6 (10) inactive
> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
> └scsi 10:0:0:0 ATA      Hitachi HDS5C302 {ML2220F30TEBLE}
>  └sdh 1.82t [8:112] Partitioned (dos)
>   └sdh1 1.82t [8:113] MD raid6 (10) inactive
> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
> PCI [ahci] 05:00.0 SATA controller: JMicron Technology Corp. JMB363
> SATA/IDE Controller (rev 02)
> ├scsi 9:0:0:0 ATA      WDC WD2002FAEX-0 {WD-WMAY01975001}
> │└sdi 1.82t [8:128] Partitioned (dos)
> │ ├sdi1 525.53m [8:129] Empty/Unknown
> │ ├sdi2 1.36t [8:130] MD raid6 (10) inactive
> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
> │ └sdi3 465.24g [8:131] MD raid1 (2/3) (w/ sdf3) in_sync
> {f89cbbf7-66e9-eb44-42ea-8b6c723593c7}
> │  └md0 465.24g [9:0] MD v0.90 raid1 (3) clean DEGRADED
> {f89cbbf7:66e9eb44:42ea8b6c:723593c7}
> │                     ext4 'root' {ceb15bfe-e082-484c-9015-1fcc8889b798}
> └scsi 11:0:0:0 ATA      ST2000DM001-1ER1 {Z4Z1JCDE}
>  └sdj 1.82t [8:144] Partitioned (dos)
>   └sdj1 1.82t [8:145] Empty/Unknown
> Other Block Devices
> ├loop0 0.00k [7:0] Empty/Unknown
> ├loop1 0.00k [7:1] Empty/Unknown
> ├loop2 0.00k [7:2] Empty/Unknown
> ├loop3 0.00k [7:3] Empty/Unknown
> ├loop4 0.00k [7:4] Empty/Unknown
> ├loop5 0.00k [7:5] Empty/Unknown
> ├loop6 0.00k [7:6] Empty/Unknown
> └loop7 0.00k [7:7] Empty/Unknown
> 
> 
> mdadm output:
> 
> mdadm -E /dev/sdb1 /dev/sda1 /dev/sdc1 /dev/sdd2 /dev/sde1 /dev/sdh1
> /dev/sdg1 /dev/sdi2 /dev/sdj1 /dev/sdf2

> mdadm: No md superblock detected on /dev/sdb1.

> mdadm: No md superblock detected on /dev/sda1.

> /dev/sdc1:
>           Magic : a92b4efc
>         Version : 0.91.00
>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>   Creation Time : Sat Oct  2 07:21:53 2010
>      Raid Level : raid6
>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>    Raid Devices : 10
>   Total Devices : 10
> Preferred Minor : 1
> 
>   Reshape pos'n : 4096
>   Delta Devices : 3 (7->10)
> 
>     Update Time : Sat Oct 17 18:59:50 2015
>           State : active
>  Active Devices : 10
> Working Devices : 10
>  Failed Devices : 0
>   Spare Devices : 0
>        Checksum : fad60723 - correct
>          Events : 2579239
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     4       8        1        4      active sync   /dev/sda1
> 
>    0     0       8       50        0      active sync   /dev/sdd2
>    1     1       8       18        1      active sync
>    2     2       8       65        2      active sync   /dev/sde1
>    3     3       8       33        3      active sync   /dev/sdc1
>    4     4       8        1        4      active sync   /dev/sda1
>    5     5       8       81        5      active sync   /dev/sdf1
>    6     6       8       98        6      active sync
>    7     7       8      145        7      active sync   /dev/sdj1
>    8     8       8      129        8      active sync   /dev/sdi1
>    9     9       8      113        9      active sync   /dev/sdh1

> /dev/sdd2:
>           Magic : a92b4efc
>         Version : 0.91.00
>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>   Creation Time : Sat Oct  2 07:21:53 2010
>      Raid Level : raid6
>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>    Raid Devices : 10
>   Total Devices : 10
> Preferred Minor : 1
> 
>   Reshape pos'n : 4096
>   Delta Devices : 3 (7->10)
> 
>     Update Time : Sat Oct 17 18:59:50 2015
>           State : active
>  Active Devices : 10
> Working Devices : 10
>  Failed Devices : 0
>   Spare Devices : 0
>        Checksum : fad6072e - correct
>          Events : 2579239
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     1       8       18        1      active sync
> 
>    0     0       8       50        0      active sync   /dev/sdd2
>    1     1       8       18        1      active sync
>    2     2       8       65        2      active sync   /dev/sde1
>    3     3       8       33        3      active sync   /dev/sdc1
>    4     4       8        1        4      active sync   /dev/sda1
>    5     5       8       81        5      active sync   /dev/sdf1
>    6     6       8       98        6      active sync
>    7     7       8      145        7      active sync   /dev/sdj1
>    8     8       8      129        8      active sync   /dev/sdi1
>    9     9       8      113        9      active sync   /dev/sdh1

> /dev/sde1:
>           Magic : a92b4efc
>         Version : 0.91.00
>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>   Creation Time : Sat Oct  2 07:21:53 2010
>      Raid Level : raid6
>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>    Raid Devices : 10
>   Total Devices : 10
> Preferred Minor : 1
> 
>   Reshape pos'n : 4096
>   Delta Devices : 3 (7->10)
> 
>     Update Time : Sat Oct 17 18:59:50 2015
>           State : active
>  Active Devices : 10
> Working Devices : 10
>  Failed Devices : 0
>   Spare Devices : 0
>        Checksum : fad60741 - correct
>          Events : 2579239
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     3       8       33        3      active sync   /dev/sdc1
> 
>    0     0       8       50        0      active sync   /dev/sdd2
>    1     1       8       18        1      active sync
>    2     2       8       65        2      active sync   /dev/sde1
>    3     3       8       33        3      active sync   /dev/sdc1
>    4     4       8        1        4      active sync   /dev/sda1
>    5     5       8       81        5      active sync   /dev/sdf1
>    6     6       8       98        6      active sync
>    7     7       8      145        7      active sync   /dev/sdj1
>    8     8       8      129        8      active sync   /dev/sdi1
>    9     9       8      113        9      active sync   /dev/sdh1

> /dev/sdh1:
>           Magic : a92b4efc
>         Version : 0.91.00
>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>   Creation Time : Sat Oct  2 07:21:53 2010
>      Raid Level : raid6
>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>    Raid Devices : 10
>   Total Devices : 10
> Preferred Minor : 1
> 
>   Reshape pos'n : 4096
>   Delta Devices : 3 (7->10)
> 
>     Update Time : Sat Oct 17 18:59:50 2015
>           State : active
>  Active Devices : 10
> Working Devices : 10
>  Failed Devices : 0
>   Spare Devices : 0
>        Checksum : fad60775 - correct
>          Events : 2579239
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     5       8       81        5      active sync   /dev/sdf1
> 
>    0     0       8       50        0      active sync   /dev/sdd2
>    1     1       8       18        1      active sync
>    2     2       8       65        2      active sync   /dev/sde1
>    3     3       8       33        3      active sync   /dev/sdc1
>    4     4       8        1        4      active sync   /dev/sda1
>    5     5       8       81        5      active sync   /dev/sdf1
>    6     6       8       98        6      active sync
>    7     7       8      145        7      active sync   /dev/sdj1
>    8     8       8      129        8      active sync   /dev/sdi1
>    9     9       8      113        9      active sync   /dev/sdh1

> /dev/sdg1:
>           Magic : a92b4efc
>         Version : 0.91.00
>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>   Creation Time : Sat Oct  2 07:21:53 2010
>      Raid Level : raid6
>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>    Raid Devices : 10
>   Total Devices : 10
> Preferred Minor : 1
> 
>   Reshape pos'n : 4096
>   Delta Devices : 3 (7->10)
> 
>     Update Time : Sat Oct 17 18:59:50 2015
>           State : active
>  Active Devices : 10
> Working Devices : 10
>  Failed Devices : 0
>   Spare Devices : 0
>        Checksum : fad6075f - correct
>          Events : 2579239
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     2       8       65        2      active sync   /dev/sde1
> 
>    0     0       8       50        0      active sync   /dev/sdd2
>    1     1       8       18        1      active sync
>    2     2       8       65        2      active sync   /dev/sde1
>    3     3       8       33        3      active sync   /dev/sdc1
>    4     4       8        1        4      active sync   /dev/sda1
>    5     5       8       81        5      active sync   /dev/sdf1
>    6     6       8       98        6      active sync
>    7     7       8      145        7      active sync   /dev/sdj1
>    8     8       8      129        8      active sync   /dev/sdi1
>    9     9       8      113        9      active sync   /dev/sdh1

> /dev/sdi2:
>           Magic : a92b4efc
>         Version : 0.91.00
>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>   Creation Time : Sat Oct  2 07:21:53 2010
>      Raid Level : raid6
>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>    Raid Devices : 10
>   Total Devices : 10
> Preferred Minor : 1
> 
>   Reshape pos'n : 4096
>   Delta Devices : 3 (7->10)
> 
>     Update Time : Sat Oct 17 18:59:50 2015
>           State : active
>  Active Devices : 10
> Working Devices : 10
>  Failed Devices : 0
>   Spare Devices : 0
>        Checksum : fad60788 - correct
>          Events : 2579239
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     6       8       98        6      active sync
> 
>    0     0       8       50        0      active sync   /dev/sdd2
>    1     1       8       18        1      active sync
>    2     2       8       65        2      active sync   /dev/sde1
>    3     3       8       33        3      active sync   /dev/sdc1
>    4     4       8        1        4      active sync   /dev/sda1
>    5     5       8       81        5      active sync   /dev/sdf1
>    6     6       8       98        6      active sync
>    7     7       8      145        7      active sync   /dev/sdj1
>    8     8       8      129        8      active sync   /dev/sdi1
>    9     9       8      113        9      active sync   /dev/sdh1

> mdadm: No md superblock detected on /dev/sdj1.

> /dev/sdf2:
>           Magic : a92b4efc
>         Version : 0.91.00
>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>   Creation Time : Sat Oct  2 07:21:53 2010
>      Raid Level : raid6
>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>    Raid Devices : 10
>   Total Devices : 10
> Preferred Minor : 1
> 
>   Reshape pos'n : 4096
>   Delta Devices : 3 (7->10)
> 
>     Update Time : Sat Oct 17 18:59:50 2015
>           State : active
>  Active Devices : 10
> Working Devices : 10
>  Failed Devices : 0
>   Spare Devices : 0
>        Checksum : fad6074c - correct
>          Events : 2579239
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     0       8       50        0      active sync   /dev/sdd2
> 
>    0     0       8       50        0      active sync   /dev/sdd2
>    1     1       8       18        1      active sync
>    2     2       8       65        2      active sync   /dev/sde1
>    3     3       8       33        3      active sync   /dev/sdc1
>    4     4       8        1        4      active sync   /dev/sda1
>    5     5       8       81        5      active sync   /dev/sdf1
>    6     6       8       98        6      active sync
>    7     7       8      145        7      active sync   /dev/sdj1
>    8     8       8      129        8      active sync   /dev/sdi1
>    9     9       8      113        9      active sync   /dev/sdh1

> Apparently my problems don't stop adding up: now SDD started developing
> problems, so my root partition (md0) is now degraded. I will attempt to
> dd out whatever I can from that drive and continue...

Don't.  You have another problem: green & desktop drives in a raid
array.  They aren't built for it and will give you grief of one form or
another.  Anyways, their problem with timeout mismatch can be worked
around with long driver timeouts.  Before you do anything else, you
*MUST* run this command:

for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done

(Arrange for this to happen on every boot, and keep doing it manually
until your boot scripts are fixed.)

Then you can add your missing mirror and let MD fix it:

mdadm /dev/md0 --add /dev/sdd3

After that's done syncing, you can have MD fix any remaining UREs in
that raid1 with:

echo check >/sys/block/md0/md/sync_action

While that's in progress, take the time to read through the links in the
postscript -- the timeout mismatch problem and its impact on
unrecoverable read errors has been hashed out on this list many times.

Now to your big array.  It is vital that it also be cleaned of UREs
after re-creation before you do anything else.  Which means it must
*not* be created degraded (the redundancy is needed to fix UREs).

According to lsdrv and your "mdadm -E" reports, the creation order you
need is:

raid device 0 /dev/sdf2 {WD-WMAZA0209553}
raid device 1 /dev/sdd2 {WD-WMAZA0348342}
raid device 2 /dev/sdg1 {9VS1EFFD}
raid device 3 /dev/sde1 {5XW05FFV}
raid device 4 /dev/sdc1 {6XW0BQL0}
raid device 5 /dev/sdh1 {ML2220F30TEBLE}
raid device 6 /dev/sdi2 {WD-WMAY01975001}

Chunk size is 64k.

Make sure your partially assembled array is stopped:

mdadm --stop /dev/md1

Re-create your array as follows:

mdadm --create --assume-clean --verbose \
    --metadata=1.0 --raid-devices=7 --chunk=64 --level=6 \
    /dev/md1 /dev/sd{f2,d2,g1,e1,c1,h1,i2}

Use "fsck -n" to check your array's filesystem (expect some damage at
the very begining).  If it look reasonable, use fsck to fix any damage.

Then clean up any lingering UREs:

echo check > /sys/block/md1/md/sync_action

Now you can mount it and catch any critical backups. (You do know that
raid != backup, I hope.)

Your array now has a new UUID, so you probably want to fix your
mdadm.conf file and your initramfs.

Finaly, go back and do your --grow, with the --backup-file.

In the future, buy drives with raid ratings like the WD Red family, and
make sure you have a cron job that regularly kicks off array scrubs.  I
do mine weekly.

HTH,

Phil

[1] http://marc.info/?l=linux-raid&m=139050322510249&w=2
[2] http://marc.info/?l=linux-raid&m=135863964624202&w=2
[3] http://marc.info/?l=linux-raid&m=135811522817345&w=1
[4] http://marc.info/?l=linux-raid&m=133761065622164&w=2
[5] http://marc.info/?l=linux-raid&m=132477199207506
[6] http://marc.info/?l=linux-raid&m=133665797115876&w=2
[7] https://www.marc.info/?l=linux-raid&m=142487508806844&w=3

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-20 15:42     ` Phil Turmel
@ 2015-10-20 22:34       ` Anugraha Sinha
  2015-10-21  3:52       ` andras
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Anugraha Sinha @ 2015-10-20 22:34 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Andras Tantos, Linux-RAID

Hi Phil,
Thanks for all the information shared by you over this thread.

It is really informative.

Regards
Anugraha

On Wed, Oct 21, 2015 at 12:42 AM, Phil Turmel <philip@turmel.org> wrote:
> Hi Andras,
>
> { Added linux-raid back -- convention on kernel.org is to reply-to-all,
> trim replies, and either interleave or bottom post.  I'm trimming less
> than normal this time so the list can see. }
>
> On 10/20/2015 10:48 AM, andras@tantosonline.com wrote:
>> On 2015-10-20 08:49, Phil Turmel wrote:
>
>>> Please supply all of you mdadm -E reports for the seven partitions and
>>> the lsdrv output I requests.  Just post the text inline in your reply.
>>>
>>> Do *not* do anything else.
>>>
>>> Phil
>
>> Thanks for all the help!
>>
>> Here's the output of lsdrv:
>>
>> PCI [pata_marvell] 04:00.1 IDE interface: Marvell Technology Group Ltd.
>> 88SE9128 IDE Controller (rev 11)
>> ├scsi 0:x:x:x [Empty]
>> └scsi 2:x:x:x [Empty]
>> PCI [pata_jmicron] 05:00.1 IDE interface: JMicron Technology Corp.
>> JMB363 SATA/IDE Controller (rev 02)
>> ├scsi 1:x:x:x [Empty]
>> └scsi 3:x:x:x [Empty]
>> PCI [ahci] 04:00.0 SATA controller: Marvell Technology Group Ltd.
>> 88SE9123 PCIe SATA 6.0 Gb/s controller (rev 11)
>> ├scsi 4:0:0:0 ATA      ST2000DM001-1ER1 {Z4Z1JDN8}
>> │└sda 1.82t [8:0] Partitioned (dos)
>> │ └sda1 1.82t [8:1] Empty/Unknown
>> └scsi 5:0:0:0 ATA      ST2000DM001-1ER1 {Z4Z1H84Q}
>>  └sdb 1.82t [8:16] Partitioned (dos)
>>   └sdb1 1.82t [8:17] ext4 'data' {d1403616-a9c6-4cd9-8d92-1aabc81fe373}
>> PCI [ata_piix] 00:1f.2 IDE interface: Intel Corporation 82801JI (ICH10
>> Family) 4 port SATA IDE Controller #1
>> ├scsi 6:0:0:0 ATA      ST31500541AS     {6XW0BQL0}
>> │└sdc 1.36t [8:32] Partitioned (dos)
>> │ └sdc1 1.36t [8:33] MD raid6 (10) inactive
>> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
>> ├scsi 6:0:1:0 ATA      WDC WD20EARS-00M {WD-WMAZA0348342}
>> │└sdd 1.82t [8:48] Partitioned (dos)
>> │ ├sdd1 525.53m [8:49] ext4 'boot1' {a3a1cedc-3866-4d80-af18-a7a4db99d880}
>> │ ├sdd2 1.36t [8:50] MD raid6 (10) inactive
>> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
>> │ └sdd3 465.24g [8:51] MD raid1 (3) inactive
>> {f89cbbf7-66e9-eb44-42ea-8b6c723593c7}
>> ├scsi 7:0:0:0 ATA      ST31500541AS     {5XW05FFV}
>> │└sde 1.36t [8:64] Partitioned (dos)
>> │ └sde1 1.36t [8:65] MD raid6 (10) inactive
>> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
>> └scsi 7:0:1:0 ATA      WDC WD20EARS-00M {WD-WMAZA0209553}
>>  └sdf 1.82t [8:80] Partitioned (dos)
>>   ├sdf1 525.53m [8:81] ext4 'boot2' {9b0e1e49-c736-47c0-89a1-4cac07c1d5ef}
>>   ├sdf2 1.36t [8:82] MD raid6 (10) inactive
>> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
>>   └sdf3 465.24g [8:83] MD raid1 (1/3) (w/ sdi3) in_sync
>> {f89cbbf7-66e9-eb44-42ea-8b6c723593c7}
>>    └md0 465.24g [9:0] MD v0.90 raid1 (3) clean DEGRADED
>> {f89cbbf7:66e9eb44:42ea8b6c:723593c7}
>>     │                 ext4 'root' {ceb15bfe-e082-484c-9015-1fcc8889b798}
>>     └Mounted as /dev/disk/by-uuid/ceb15bfe-e082-484c-9015-1fcc8889b798 @ /
>> PCI [ata_piix] 00:1f.5 IDE interface: Intel Corporation 82801JI (ICH10
>> Family) 2 port SATA IDE Controller #2
>> ├scsi 8:0:0:0 ATA      ST31500341AS     {9VS1EFFD}
>> │└sdg 1.36t [8:96] Partitioned (dos)
>> │ └sdg1 1.36t [8:97] MD raid6 (10) inactive
>> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
>> └scsi 10:0:0:0 ATA      Hitachi HDS5C302 {ML2220F30TEBLE}
>>  └sdh 1.82t [8:112] Partitioned (dos)
>>   └sdh1 1.82t [8:113] MD raid6 (10) inactive
>> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
>> PCI [ahci] 05:00.0 SATA controller: JMicron Technology Corp. JMB363
>> SATA/IDE Controller (rev 02)
>> ├scsi 9:0:0:0 ATA      WDC WD2002FAEX-0 {WD-WMAY01975001}
>> │└sdi 1.82t [8:128] Partitioned (dos)
>> │ ├sdi1 525.53m [8:129] Empty/Unknown
>> │ ├sdi2 1.36t [8:130] MD raid6 (10) inactive
>> {5e57a17d-43eb-0786-42ea-8b6c723593c7}
>> │ └sdi3 465.24g [8:131] MD raid1 (2/3) (w/ sdf3) in_sync
>> {f89cbbf7-66e9-eb44-42ea-8b6c723593c7}
>> │  └md0 465.24g [9:0] MD v0.90 raid1 (3) clean DEGRADED
>> {f89cbbf7:66e9eb44:42ea8b6c:723593c7}
>> │                     ext4 'root' {ceb15bfe-e082-484c-9015-1fcc8889b798}
>> └scsi 11:0:0:0 ATA      ST2000DM001-1ER1 {Z4Z1JCDE}
>>  └sdj 1.82t [8:144] Partitioned (dos)
>>   └sdj1 1.82t [8:145] Empty/Unknown
>> Other Block Devices
>> ├loop0 0.00k [7:0] Empty/Unknown
>> ├loop1 0.00k [7:1] Empty/Unknown
>> ├loop2 0.00k [7:2] Empty/Unknown
>> ├loop3 0.00k [7:3] Empty/Unknown
>> ├loop4 0.00k [7:4] Empty/Unknown
>> ├loop5 0.00k [7:5] Empty/Unknown
>> ├loop6 0.00k [7:6] Empty/Unknown
>> └loop7 0.00k [7:7] Empty/Unknown
>>
>>
>> mdadm output:
>>
>> mdadm -E /dev/sdb1 /dev/sda1 /dev/sdc1 /dev/sdd2 /dev/sde1 /dev/sdh1
>> /dev/sdg1 /dev/sdi2 /dev/sdj1 /dev/sdf2
>
>> mdadm: No md superblock detected on /dev/sdb1.
>
>> mdadm: No md superblock detected on /dev/sda1.
>
>> /dev/sdc1:
>>           Magic : a92b4efc
>>         Version : 0.91.00
>>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>>   Creation Time : Sat Oct  2 07:21:53 2010
>>      Raid Level : raid6
>>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>>    Raid Devices : 10
>>   Total Devices : 10
>> Preferred Minor : 1
>>
>>   Reshape pos'n : 4096
>>   Delta Devices : 3 (7->10)
>>
>>     Update Time : Sat Oct 17 18:59:50 2015
>>           State : active
>>  Active Devices : 10
>> Working Devices : 10
>>  Failed Devices : 0
>>   Spare Devices : 0
>>        Checksum : fad60723 - correct
>>          Events : 2579239
>>
>>          Layout : left-symmetric
>>      Chunk Size : 64K
>>
>>       Number   Major   Minor   RaidDevice State
>> this     4       8        1        4      active sync   /dev/sda1
>>
>>    0     0       8       50        0      active sync   /dev/sdd2
>>    1     1       8       18        1      active sync
>>    2     2       8       65        2      active sync   /dev/sde1
>>    3     3       8       33        3      active sync   /dev/sdc1
>>    4     4       8        1        4      active sync   /dev/sda1
>>    5     5       8       81        5      active sync   /dev/sdf1
>>    6     6       8       98        6      active sync
>>    7     7       8      145        7      active sync   /dev/sdj1
>>    8     8       8      129        8      active sync   /dev/sdi1
>>    9     9       8      113        9      active sync   /dev/sdh1
>
>> /dev/sdd2:
>>           Magic : a92b4efc
>>         Version : 0.91.00
>>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>>   Creation Time : Sat Oct  2 07:21:53 2010
>>      Raid Level : raid6
>>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>>    Raid Devices : 10
>>   Total Devices : 10
>> Preferred Minor : 1
>>
>>   Reshape pos'n : 4096
>>   Delta Devices : 3 (7->10)
>>
>>     Update Time : Sat Oct 17 18:59:50 2015
>>           State : active
>>  Active Devices : 10
>> Working Devices : 10
>>  Failed Devices : 0
>>   Spare Devices : 0
>>        Checksum : fad6072e - correct
>>          Events : 2579239
>>
>>          Layout : left-symmetric
>>      Chunk Size : 64K
>>
>>       Number   Major   Minor   RaidDevice State
>> this     1       8       18        1      active sync
>>
>>    0     0       8       50        0      active sync   /dev/sdd2
>>    1     1       8       18        1      active sync
>>    2     2       8       65        2      active sync   /dev/sde1
>>    3     3       8       33        3      active sync   /dev/sdc1
>>    4     4       8        1        4      active sync   /dev/sda1
>>    5     5       8       81        5      active sync   /dev/sdf1
>>    6     6       8       98        6      active sync
>>    7     7       8      145        7      active sync   /dev/sdj1
>>    8     8       8      129        8      active sync   /dev/sdi1
>>    9     9       8      113        9      active sync   /dev/sdh1
>
>> /dev/sde1:
>>           Magic : a92b4efc
>>         Version : 0.91.00
>>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>>   Creation Time : Sat Oct  2 07:21:53 2010
>>      Raid Level : raid6
>>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>>    Raid Devices : 10
>>   Total Devices : 10
>> Preferred Minor : 1
>>
>>   Reshape pos'n : 4096
>>   Delta Devices : 3 (7->10)
>>
>>     Update Time : Sat Oct 17 18:59:50 2015
>>           State : active
>>  Active Devices : 10
>> Working Devices : 10
>>  Failed Devices : 0
>>   Spare Devices : 0
>>        Checksum : fad60741 - correct
>>          Events : 2579239
>>
>>          Layout : left-symmetric
>>      Chunk Size : 64K
>>
>>       Number   Major   Minor   RaidDevice State
>> this     3       8       33        3      active sync   /dev/sdc1
>>
>>    0     0       8       50        0      active sync   /dev/sdd2
>>    1     1       8       18        1      active sync
>>    2     2       8       65        2      active sync   /dev/sde1
>>    3     3       8       33        3      active sync   /dev/sdc1
>>    4     4       8        1        4      active sync   /dev/sda1
>>    5     5       8       81        5      active sync   /dev/sdf1
>>    6     6       8       98        6      active sync
>>    7     7       8      145        7      active sync   /dev/sdj1
>>    8     8       8      129        8      active sync   /dev/sdi1
>>    9     9       8      113        9      active sync   /dev/sdh1
>
>> /dev/sdh1:
>>           Magic : a92b4efc
>>         Version : 0.91.00
>>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>>   Creation Time : Sat Oct  2 07:21:53 2010
>>      Raid Level : raid6
>>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>>    Raid Devices : 10
>>   Total Devices : 10
>> Preferred Minor : 1
>>
>>   Reshape pos'n : 4096
>>   Delta Devices : 3 (7->10)
>>
>>     Update Time : Sat Oct 17 18:59:50 2015
>>           State : active
>>  Active Devices : 10
>> Working Devices : 10
>>  Failed Devices : 0
>>   Spare Devices : 0
>>        Checksum : fad60775 - correct
>>          Events : 2579239
>>
>>          Layout : left-symmetric
>>      Chunk Size : 64K
>>
>>       Number   Major   Minor   RaidDevice State
>> this     5       8       81        5      active sync   /dev/sdf1
>>
>>    0     0       8       50        0      active sync   /dev/sdd2
>>    1     1       8       18        1      active sync
>>    2     2       8       65        2      active sync   /dev/sde1
>>    3     3       8       33        3      active sync   /dev/sdc1
>>    4     4       8        1        4      active sync   /dev/sda1
>>    5     5       8       81        5      active sync   /dev/sdf1
>>    6     6       8       98        6      active sync
>>    7     7       8      145        7      active sync   /dev/sdj1
>>    8     8       8      129        8      active sync   /dev/sdi1
>>    9     9       8      113        9      active sync   /dev/sdh1
>
>> /dev/sdg1:
>>           Magic : a92b4efc
>>         Version : 0.91.00
>>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>>   Creation Time : Sat Oct  2 07:21:53 2010
>>      Raid Level : raid6
>>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>>    Raid Devices : 10
>>   Total Devices : 10
>> Preferred Minor : 1
>>
>>   Reshape pos'n : 4096
>>   Delta Devices : 3 (7->10)
>>
>>     Update Time : Sat Oct 17 18:59:50 2015
>>           State : active
>>  Active Devices : 10
>> Working Devices : 10
>>  Failed Devices : 0
>>   Spare Devices : 0
>>        Checksum : fad6075f - correct
>>          Events : 2579239
>>
>>          Layout : left-symmetric
>>      Chunk Size : 64K
>>
>>       Number   Major   Minor   RaidDevice State
>> this     2       8       65        2      active sync   /dev/sde1
>>
>>    0     0       8       50        0      active sync   /dev/sdd2
>>    1     1       8       18        1      active sync
>>    2     2       8       65        2      active sync   /dev/sde1
>>    3     3       8       33        3      active sync   /dev/sdc1
>>    4     4       8        1        4      active sync   /dev/sda1
>>    5     5       8       81        5      active sync   /dev/sdf1
>>    6     6       8       98        6      active sync
>>    7     7       8      145        7      active sync   /dev/sdj1
>>    8     8       8      129        8      active sync   /dev/sdi1
>>    9     9       8      113        9      active sync   /dev/sdh1
>
>> /dev/sdi2:
>>           Magic : a92b4efc
>>         Version : 0.91.00
>>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>>   Creation Time : Sat Oct  2 07:21:53 2010
>>      Raid Level : raid6
>>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>>    Raid Devices : 10
>>   Total Devices : 10
>> Preferred Minor : 1
>>
>>   Reshape pos'n : 4096
>>   Delta Devices : 3 (7->10)
>>
>>     Update Time : Sat Oct 17 18:59:50 2015
>>           State : active
>>  Active Devices : 10
>> Working Devices : 10
>>  Failed Devices : 0
>>   Spare Devices : 0
>>        Checksum : fad60788 - correct
>>          Events : 2579239
>>
>>          Layout : left-symmetric
>>      Chunk Size : 64K
>>
>>       Number   Major   Minor   RaidDevice State
>> this     6       8       98        6      active sync
>>
>>    0     0       8       50        0      active sync   /dev/sdd2
>>    1     1       8       18        1      active sync
>>    2     2       8       65        2      active sync   /dev/sde1
>>    3     3       8       33        3      active sync   /dev/sdc1
>>    4     4       8        1        4      active sync   /dev/sda1
>>    5     5       8       81        5      active sync   /dev/sdf1
>>    6     6       8       98        6      active sync
>>    7     7       8      145        7      active sync   /dev/sdj1
>>    8     8       8      129        8      active sync   /dev/sdi1
>>    9     9       8      113        9      active sync   /dev/sdh1
>
>> mdadm: No md superblock detected on /dev/sdj1.
>
>> /dev/sdf2:
>>           Magic : a92b4efc
>>         Version : 0.91.00
>>            UUID : 5e57a17d:43eb0786:42ea8b6c:723593c7
>>   Creation Time : Sat Oct  2 07:21:53 2010
>>      Raid Level : raid6
>>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>>      Array Size : 11721087488 (11178.10 GiB 12002.39 GB)
>>    Raid Devices : 10
>>   Total Devices : 10
>> Preferred Minor : 1
>>
>>   Reshape pos'n : 4096
>>   Delta Devices : 3 (7->10)
>>
>>     Update Time : Sat Oct 17 18:59:50 2015
>>           State : active
>>  Active Devices : 10
>> Working Devices : 10
>>  Failed Devices : 0
>>   Spare Devices : 0
>>        Checksum : fad6074c - correct
>>          Events : 2579239
>>
>>          Layout : left-symmetric
>>      Chunk Size : 64K
>>
>>       Number   Major   Minor   RaidDevice State
>> this     0       8       50        0      active sync   /dev/sdd2
>>
>>    0     0       8       50        0      active sync   /dev/sdd2
>>    1     1       8       18        1      active sync
>>    2     2       8       65        2      active sync   /dev/sde1
>>    3     3       8       33        3      active sync   /dev/sdc1
>>    4     4       8        1        4      active sync   /dev/sda1
>>    5     5       8       81        5      active sync   /dev/sdf1
>>    6     6       8       98        6      active sync
>>    7     7       8      145        7      active sync   /dev/sdj1
>>    8     8       8      129        8      active sync   /dev/sdi1
>>    9     9       8      113        9      active sync   /dev/sdh1
>
>> Apparently my problems don't stop adding up: now SDD started developing
>> problems, so my root partition (md0) is now degraded. I will attempt to
>> dd out whatever I can from that drive and continue...
>
> Don't.  You have another problem: green & desktop drives in a raid
> array.  They aren't built for it and will give you grief of one form or
> another.  Anyways, their problem with timeout mismatch can be worked
> around with long driver timeouts.  Before you do anything else, you
> *MUST* run this command:
>
> for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done
>
> (Arrange for this to happen on every boot, and keep doing it manually
> until your boot scripts are fixed.)
>
> Then you can add your missing mirror and let MD fix it:
>
> mdadm /dev/md0 --add /dev/sdd3
>
> After that's done syncing, you can have MD fix any remaining UREs in
> that raid1 with:
>
> echo check >/sys/block/md0/md/sync_action
>
> While that's in progress, take the time to read through the links in the
> postscript -- the timeout mismatch problem and its impact on
> unrecoverable read errors has been hashed out on this list many times.
>
> Now to your big array.  It is vital that it also be cleaned of UREs
> after re-creation before you do anything else.  Which means it must
> *not* be created degraded (the redundancy is needed to fix UREs).
>
> According to lsdrv and your "mdadm -E" reports, the creation order you
> need is:
>
> raid device 0 /dev/sdf2 {WD-WMAZA0209553}
> raid device 1 /dev/sdd2 {WD-WMAZA0348342}
> raid device 2 /dev/sdg1 {9VS1EFFD}
> raid device 3 /dev/sde1 {5XW05FFV}
> raid device 4 /dev/sdc1 {6XW0BQL0}
> raid device 5 /dev/sdh1 {ML2220F30TEBLE}
> raid device 6 /dev/sdi2 {WD-WMAY01975001}
>
> Chunk size is 64k.
>
> Make sure your partially assembled array is stopped:
>
> mdadm --stop /dev/md1
>
> Re-create your array as follows:
>
> mdadm --create --assume-clean --verbose \
>     --metadata=1.0 --raid-devices=7 --chunk=64 --level=6 \
>     /dev/md1 /dev/sd{f2,d2,g1,e1,c1,h1,i2}
>
> Use "fsck -n" to check your array's filesystem (expect some damage at
> the very begining).  If it look reasonable, use fsck to fix any damage.
>
> Then clean up any lingering UREs:
>
> echo check > /sys/block/md1/md/sync_action
>
> Now you can mount it and catch any critical backups. (You do know that
> raid != backup, I hope.)
>
> Your array now has a new UUID, so you probably want to fix your
> mdadm.conf file and your initramfs.
>
> Finaly, go back and do your --grow, with the --backup-file.
>
> In the future, buy drives with raid ratings like the WD Red family, and
> make sure you have a cron job that regularly kicks off array scrubs.  I
> do mine weekly.
>
> HTH,
>
> Phil
>
> [1] http://marc.info/?l=linux-raid&m=139050322510249&w=2
> [2] http://marc.info/?l=linux-raid&m=135863964624202&w=2
> [3] http://marc.info/?l=linux-raid&m=135811522817345&w=1
> [4] http://marc.info/?l=linux-raid&m=133761065622164&w=2
> [5] http://marc.info/?l=linux-raid&m=132477199207506
> [6] http://marc.info/?l=linux-raid&m=133665797115876&w=2
> [7] https://www.marc.info/?l=linux-raid&m=142487508806844&w=3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-20  2:35 How to recover after md crash during reshape? andras
                   ` (2 preceding siblings ...)
  2015-10-20 13:49 ` Phil Turmel
@ 2015-10-21  1:35 ` Neil Brown
  2015-10-21  4:03   ` andras
  2015-10-21 12:18   ` Phil Turmel
  3 siblings, 2 replies; 24+ messages in thread
From: Neil Brown @ 2015-10-21  1:35 UTC (permalink / raw)
  To: andras, linux-raid

[-- Attachment #1: Type: text/plain, Size: 5493 bytes --]

andras@tantosonline.com writes:

Phil has provided lots of useful advice, I'll just add a couple of
clarifications;

>
>      mdadm --grow --raid-devices=10 /dev/md1
>
> Yes, I was dumb enough to start the process without a backup option - 
> (copy-paste error from https://raid.wiki.kernel.org/index.php/Growing).

Nothing dumb about that - you don't need a --backup option.
If you did, mdadm would have complained.

You only need --backup when the size of the array is unchanged or
decreasing.
(or when growing to a degraded array.  e.g. you can reshape a 4-drive
 raid5 to a degraded 5-drive raid5 without adding a spare.  This will
 required a --backup.  I'm fairly sure it also requires --force because
 it is a very strange thing to do).

When reshaping it a larger array, mdadm only requires a backup while
reshaping the first few stripes, and it uses some space in one of the
new (previously spare) devices to store that backup.


>
> This immediately (well, after 2 seconds) crashed the MD driver:
>
>      Oct 17 17:30:27 bazsalikom kernel: [7869821.514718] sd 0:0:0:0: 
> [sdj] Attached SCSI disk
>      Oct 17 18:39:21 bazsalikom kernel: [7873955.418679]  sdh: sdh1
>      Oct 17 18:39:37 bazsalikom kernel: [7873972.155084]  sdi: sdi1
>      Oct 17 18:39:49 bazsalikom kernel: [7873983.916038]  sdj: sdj1
>      Oct 17 18:40:33 bazsalikom kernel: [7874027.963430] md: bind<sdh1>
>      Oct 17 18:40:34 bazsalikom kernel: [7874028.263656] md: bind<sdi1>
>      Oct 17 18:40:34 bazsalikom kernel: [7874028.361112] md: bind<sdj1>
>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667815] md: reshape of 
> RAID array md1
>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667818] md: minimum 
> _guaranteed_  speed: 1000 KB/sec/disk.
>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667821] md: using 
> maximum available idle IO bandwidth (but not more than 200000 KB/sec) 
> for reshape.
>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667831] md: using 128k 
> window, over a total of 1465135936k.
> --> Oct 17 18:59:50 bazsalikom kernel: [7875184.326245] md: md_do_sync() 
> got signal ... exiting

This is very strange ... maybe some messages missing?
Probably an IO error while writing to a new device.

>
>  From here on, things went downhill pretty damn fast. I was not able to 
> unmount the file-system, stop or re-start the array (/proc/mdstat went 
> away), any process trying to touch /dev/md1 hung, so eventually, I run 
> out of options and hit the reset button on the machine.
>
> Upon reboot, the array wouldn't assemble, it was complaining that SDA 
> and SDA1 had the same superblock info on it.
>
> mdadm: WARNING /dev/sda and /dev/sda1 appear to have very similar 
> superblocks.
>        If they are really different, please --zero the superblock on one
>        If they are the same or overlap, please remove one from the
>        DEVICE list in mdadm.conf.

It's very hard to make messages like this clear without being incredibly
verbose...

In this case /dev/sda and /dev/sda1 obviously overlap (that is obvious,
isn't it?).
So in that case you need to remove one of them from the DEVICE list.
You probably don't have a DEVICE list so it defaults to everything listed in
/proc/partitions.
The "correct" thing to do at this point would have been to add a DEVICE
list to mdadm.conf which only listed the devices that might be part of
an array. e.g.

  DEVICE /dev/sd[a-z][1-9]

> So, if I read this right, the superblock here states that the array is 
> in the middle of a reshape from 7 to 10 devices, but it just started 
> (4096 is the position).
> What's interesting is the device names listed here don't match the ones 
> reported by /proc/mdstat, and are actually incorrect. The right 
> partition numbers are in /proc/mdstat.
>
> The superblocks on the 6 other original disks match, except for of 
> course which one they mark as 'this' and the checksum.
>
> I've read in here (http://ubuntuforums.org/showthread.php?t=2133576) 
> among many other places that it might be possible to recover the data on 
> the array by trying to re-create it to the state before the re-shape.
>
> I've also read that if I want to re-create an array in read-only mode, I 
> should re-create it degraded.
>
> So, what I thought I would do is this:
>
>      mdadm --create /dev/md1 --level=6 --raid-devices=7 /dev/sdh2 
> /dev/sdf2 /dev/sdi1 /dev/sdg1 /dev/sde1 missing missing

Phil has given good advice on this point which is worth following.
It is quite possible that there will still be corruption.

mdadm reads the first few stripes and stores them somewhere in each of
the spares.  md (in the kernel) then reads those stripes again and
writes them out in the new configuration.  It appears that one of the
writes failed, others might have succeeded.  This may not have corrupted
anything (the first few blocks are in the same position for both the old
and new layout) but it might have done.

So if the filesystem seems corrupt after the array is re-created, that
is likely the reason.
The data still exists in the backup on those new devices (if you haven't
done anything to them) and could be restored.

If you do want to look for the backup, it is around about the middle of
the device and has some metadata which contains the string
"md_backup_data-1".  If you find that, you are close to getting the
backup data back.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-20 15:42     ` Phil Turmel
  2015-10-20 22:34       ` Anugraha Sinha
@ 2015-10-21  3:52       ` andras
  2015-10-21 12:01         ` Phil Turmel
  2015-10-21 16:17       ` Wols Lists
  2015-10-25 14:15       ` andras
  3 siblings, 1 reply; 24+ messages in thread
From: andras @ 2015-10-21  3:52 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Linux-RAID

Phil,

Thank you so much for the detailed explanation and your patience with 
me! Sorry for not being more responsive - I don't have access to this 
mail account from work.

> 
>> Apparently my problems don't stop adding up: now SDD started 
>> developing
>> problems, so my root partition (md0) is now degraded. I will attempt 
>> to
>> dd out whatever I can from that drive and continue...
> 
> Don't.  You have another problem: green & desktop drives in a raid
> array.  They aren't built for it and will give you grief of one form or
> another.  Anyways, their problem with timeout mismatch can be worked
> around with long driver timeouts.  Before you do anything else, you
> *MUST* run this command:
> 
> for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done
> 
> (Arrange for this to happen on every boot, and keep doing it manually
> until your boot scripts are fixed.)

Yes, will do. In your links below it seems that you're half advocating 
for using desktop drives in RAID arrays, half advocating against. From 
what I can tell, it seems the recommendation might depend on the 
use-case. If one doesn't care too much about instant performance in case 
of errors, one might want to use desktop drivers (with the above fix). 
If one wants reliable performance, one probably wants NAS drives. Did I 
understand the basic trade-off correctly?

It seems that people also think that green drives are a bad idea in 
RAIDs in general - mostly because the frequent parking of heads reduces 
life-time. Is that a correct statement?

> Then you can add your missing mirror and let MD fix it:
> 
> mdadm /dev/md0 --add /dev/sdd3
> 
> After that's done syncing, you can have MD fix any remaining UREs in
> that raid1 with:
> 
> echo check >/sys/block/md0/md/sync_action
> 
> While that's in progress, take the time to read through the links in 
> the
> postscript -- the timeout mismatch problem and its impact on
> unrecoverable read errors has been hashed out on this list many times.
> 
> Now to your big array.  It is vital that it also be cleaned of UREs
> after re-creation before you do anything else.  Which means it must
> *not* be created degraded (the redundancy is needed to fix UREs).
> 
> According to lsdrv and your "mdadm -E" reports, the creation order you
> need is:
> 
> raid device 0 /dev/sdf2 {WD-WMAZA0209553}
> raid device 1 /dev/sdd2 {WD-WMAZA0348342}
> raid device 2 /dev/sdg1 {9VS1EFFD}
> raid device 3 /dev/sde1 {5XW05FFV}
> raid device 4 /dev/sdc1 {6XW0BQL0}
> raid device 5 /dev/sdh1 {ML2220F30TEBLE}
> raid device 6 /dev/sdi2 {WD-WMAY01975001}
> 
> Chunk size is 64k.
> 
> Make sure your partially assembled array is stopped:
> 
> mdadm --stop /dev/md1
> 
> Re-create your array as follows:
> 
> mdadm --create --assume-clean --verbose \
>     --metadata=1.0 --raid-devices=7 --chunk=64 --level=6 \
>     /dev/md1 /dev/sd{f2,d2,g1,e1,c1,h1,i2}
> 
> Use "fsck -n" to check your array's filesystem (expect some damage at
> the very begining).  If it look reasonable, use fsck to fix any damage.
> 
> Then clean up any lingering UREs:
> 
> echo check > /sys/block/md1/md/sync_action
> 
> Now you can mount it and catch any critical backups. (You do know that
> raid != backup, I hope.)
> 
> Your array now has a new UUID, so you probably want to fix your
> mdadm.conf file and your initramfs.

Yes sir! I will go through the steps and report back. One question: the 
reason I shouldn't attempt to re-create the new 10-disk array is that it 
would wipe out the 7->10 grow progress, so MD would think that it's a 
fully grown 10-disk array, right?

> Finaly, go back and do your --grow, with the --backup-file.
> 
> In the future, buy drives with raid ratings like the WD Red family, and
> make sure you have a cron job that regularly kicks off array scrubs.  I
> do mine weekly.

Thanks for the info. This is the first time someone mentions scrubbing 
with regards to RAID to me, but it makes total sense. I will set it up.

Thanks again,
Andras


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-21  1:35 ` How to recover after md crash during reshape? Neil Brown
@ 2015-10-21  4:03   ` andras
  2015-10-21 12:18   ` Phil Turmel
  1 sibling, 0 replies; 24+ messages in thread
From: andras @ 2015-10-21  4:03 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil,

Thanks for helping me out!

>>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667831] md: using 
>> 128k
>> window, over a total of 1465135936k.
>> --> Oct 17 18:59:50 bazsalikom kernel: [7875184.326245] md: 
>> md_do_sync()
>> got signal ... exiting
> 
> This is very strange ... maybe some messages missing?
> Probably an IO error while writing to a new device.

I'm not sure what have happened either. This is /var/log/messages. Maybe 
those things go into a different log?

>> mdadm: WARNING /dev/sda and /dev/sda1 appear to have very similar
>> superblocks.
>>        If they are really different, please --zero the superblock on 
>> one
>>        If they are the same or overlap, please remove one from the
>>        DEVICE list in mdadm.conf.
> 
> It's very hard to make messages like this clear without being 
> incredibly
> verbose...
> 
> In this case /dev/sda and /dev/sda1 obviously overlap (that is obvious,
> isn't it?).
> So in that case you need to remove one of them from the DEVICE list.
> You probably don't have a DEVICE list so it defaults to everything 
> listed in
> /proc/partitions.
> The "correct" thing to do at this point would have been to add a DEVICE
> list to mdadm.conf which only listed the devices that might be part of
> an array. e.g.
> 
>   DEVICE /dev/sd[a-z][1-9]

Understood. My problem was that when I googled for the problem, people 
agreed with the suggested solution of the zeroing the superblock. I 
guess it tells you how much you should trust 'common wisdom'.

> 
> Phil has given good advice on this point which is worth following.
> It is quite possible that there will still be corruption.
> 
> mdadm reads the first few stripes and stores them somewhere in each of
> the spares.  md (in the kernel) then reads those stripes again and
> writes them out in the new configuration.  It appears that one of the
> writes failed, others might have succeeded.  This may not have 
> corrupted
> anything (the first few blocks are in the same position for both the 
> old
> and new layout) but it might have done.
> 
> So if the filesystem seems corrupt after the array is re-created, that
> is likely the reason.
> The data still exists in the backup on those new devices (if you 
> haven't
> done anything to them) and could be restored.
> 
> If you do want to look for the backup, it is around about the middle of
> the device and has some metadata which contains the string
> "md_backup_data-1".  If you find that, you are close to getting the
> backup data back.
> 
> NeilBrown

Oh, gosh, I hope I don't have to do that deep of a surgery. No, I 
haven't touched the new HDDs other then zeroing the superblock. So 
whatever was on them, is still there. I'll see how much damage there is 
to the FS after I reconstruct the array.

Thanks for all the help!
Andras

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-21  3:52       ` andras
@ 2015-10-21 12:01         ` Phil Turmel
  0 siblings, 0 replies; 24+ messages in thread
From: Phil Turmel @ 2015-10-21 12:01 UTC (permalink / raw)
  To: andras; +Cc: Linux-RAID

Good morning Andras,

On 10/20/2015 11:52 PM, andras@tantosonline.com wrote:
> Phil,
> 
> Thank you so much for the detailed explanation and your patience with
> me! Sorry for not being more responsive - I don't have access to this
> mail account from work.

No worries.

>> for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done
>>
>> (Arrange for this to happen on every boot, and keep doing it manually
>> until your boot scripts are fixed.)
> 
> Yes, will do. In your links below it seems that you're half advocating
> for using desktop drives in RAID arrays, half advocating against. From
> what I can tell, it seems the recommendation might depend on the
> use-case. If one doesn't care too much about instant performance in case
> of errors, one might want to use desktop drivers (with the above fix).
> If one wants reliable performance, one probably wants NAS drives. Did I
> understand the basic trade-off correctly?

Times change.  At the time some of those were written, desktop drives
with scterc support were still available, but default off.  Those are ok
in a raid if you have the appropriate smartctl command in your boot scripts.

Long timeouts with non-scterc drives, in my opinion, create a user
impression that things are broken, even if the drive is fine (UREs are
natural and unavoidable in the life of a drive).  Users are prone to
drastic measures when they think something is broken.  Also,
*applications* might not wait that long for their read, either.  So, I
only recommend the long timeout solution when an array is already in
trouble with such drives.

> It seems that people also think that green drives are a bad idea in
> RAIDs in general - mostly because the frequent parking of heads reduces
> life-time. Is that a correct statement?

I don't have enough experience with green drives to say.  The few that I
have (bought before I discovered the dropped scterc support) became part
of my offsite backup rotation.

> Yes sir! I will go through the steps and report back. One question: the
> reason I shouldn't attempt to re-create the new 10-disk array is that it
> would wipe out the 7->10 grow progress, so MD would think that it's a
> fully grown 10-disk array, right?

Right.  Your three extra drives never really were incorporated into the
array, so the data layout is still a 7-drive pattern.

Phil

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-21  1:35 ` How to recover after md crash during reshape? Neil Brown
  2015-10-21  4:03   ` andras
@ 2015-10-21 12:18   ` Phil Turmel
  2015-10-21 20:26     ` Neil Brown
  1 sibling, 1 reply; 24+ messages in thread
From: Phil Turmel @ 2015-10-21 12:18 UTC (permalink / raw)
  To: Neil Brown, andras, linux-raid

Good morning Neil,

On 10/20/2015 09:35 PM, Neil Brown wrote:

> Nothing dumb about that - you don't need a --backup option.
> If you did, mdadm would have complained.
> 
> You only need --backup when the size of the array is unchanged or
> decreasing.

> mdadm reads the first few stripes and stores them somewhere in each of
> the spares.  md (in the kernel) then reads those stripes again and
> writes them out in the new configuration.  It appears that one of the
> writes failed, others might have succeeded.  This may not have corrupted
> anything (the first few blocks are in the same position for both the old
> and new layout) but it might have done.

> If you do want to look for the backup, it is around about the middle of
> the device and has some metadata which contains the string
> "md_backup_data-1".  If you find that, you are close to getting the
> backup data back.

Hmmm.  This feature has advanced beyond my last look at the code.  I was
under the impression the backup option was only optional when mdadm
could move the data offset.  Does this new algorithm apply to v0.90
metadata, a v3.2 kernel, and v3.2.5 mdadm?

Phil


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-21 16:17       ` Wols Lists
@ 2015-10-21 16:05         ` Phil Turmel
  0 siblings, 0 replies; 24+ messages in thread
From: Phil Turmel @ 2015-10-21 16:05 UTC (permalink / raw)
  To: Wols Lists, andras, Linux-RAID

Hi Wols,

I glad you've got the big picture correct, but some details need to be
addressed:

On 10/21/2015 12:17 PM, Wols Lists wrote:

> tl;dr summary ...
> 
> Desktop drives are spec'd as being okay with one soft error per 10TB
> read - that's where a read fails, you try again, and everything's okay.

No, this isn't correct.

That spec is for *unrecoverable* read errors.  For desktop drives,
typically spec'd as one such error every 1e14 bits read, on average.
These are failures where you really have lost the sector contents.  Such
sectors are marked as "Pending Relocations" in drive firmware.  But the
recording surface might still be good, so the drive waits for a write to
that pending sector, which it then verifies, before deciding to relocate
or not.

When MD raid receives a read error, whether in normal operation or a
scrub, it will reconstruct the missing data and write it back, closing
this loop immediately.  Where "normal operation" means "read errors are
reported by the drive before the driver times out".

> A resync will scan the array from start to finish - if you have 10TB's
> worth of disk, you MUST be prepared to handle these errors.
> 
> By default, mdadm will assume a disk is faulty and kick it after about
> 10secs, but a desktop drive will hang for maybe several minutes before
> reporting a problem.

MD raid has no timeout, and does not kick drives out for occassional
read errors.  The timeout is in the per-device drivers (SCSI, SATA,
whatever).  Which defaults to 30 seconds.  Desktop drives typically keep
trying to read a bad sector for 120 seconds or more, ignoring the world
while they do so.  Drives with default SCTERC support typically report a
read error within four to seven seconds.

With a desktop drive, the linux device driver bails after 30 seconds and
resets the link to the drive -- which gets ignored.  And keeps getting
ignored until the original read retry cycle finishes.  During this time,
MD has reconstructed the data and told the driver to write the fixed
sector.  That *write* also fails (because the driver is failing to
reset) and that *write error* kicks the drive out of the array.

Anyways, please consider reading the threads I pointed Andras at :-)

Phil

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-20 15:42     ` Phil Turmel
  2015-10-20 22:34       ` Anugraha Sinha
  2015-10-21  3:52       ` andras
@ 2015-10-21 16:17       ` Wols Lists
  2015-10-21 16:05         ` Phil Turmel
  2015-10-25 14:15       ` andras
  3 siblings, 1 reply; 24+ messages in thread
From: Wols Lists @ 2015-10-21 16:17 UTC (permalink / raw)
  To: andras, Linux-RAID

On 20/10/15 16:42, Phil Turmel wrote:
> Don't.  You have another problem: green & desktop drives in a raid
> array.  They aren't built for it and will give you grief of one form or
> another.  Anyways, their problem with timeout mismatch can be worked
> around with long driver timeouts.  Before you do anything else, you
> *MUST* run this command:
> 
> for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done
> 
> (Arrange for this to happen on every boot, and keep doing it manually
> until your boot scripts are fixed.)

tl;dr summary ...

Desktop drives are spec'd as being okay with one soft error per 10TB
read - that's where a read fails, you try again, and everything's okay.

A resync will scan the array from start to finish - if you have 10TB's
worth of disk, you MUST be prepared to handle these errors.

By default, mdadm will assume a disk is faulty and kick it after about
10secs, but a desktop drive will hang for maybe several minutes before
reporting a problem.

In other words, your drives can meet manufacturer's specs, but, with
default settings, your array will never be able to rebuild after a
problem! (Note that many people will say "I've never had a problem", but
most drives are better than spec. You just don't want to be the unlucky
one ...)


Not that I have any (yet), but I'd second the recommendation for WD
Reds. I've got Seagate Barracudas (not raid-compliant), and the Reds are
not much more expensive, and are also the only drives I've found that
support the raid features - mostly that by default they will fail and
report a problem very quickly. (Plus they're spec'd at reading about
40TB per soft error :-)

Cheers,
Wol



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-21 12:18   ` Phil Turmel
@ 2015-10-21 20:26     ` Neil Brown
  2015-10-21 20:37       ` Phil Turmel
  0 siblings, 1 reply; 24+ messages in thread
From: Neil Brown @ 2015-10-21 20:26 UTC (permalink / raw)
  To: Phil Turmel, andras, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1909 bytes --]

Phil Turmel <philip@turmel.org> writes:

> Good morning Neil,
>
> On 10/20/2015 09:35 PM, Neil Brown wrote:
>
>> Nothing dumb about that - you don't need a --backup option.
>> If you did, mdadm would have complained.
>> 
>> You only need --backup when the size of the array is unchanged or
>> decreasing.
>
>> mdadm reads the first few stripes and stores them somewhere in each of
>> the spares.  md (in the kernel) then reads those stripes again and
>> writes them out in the new configuration.  It appears that one of the
>> writes failed, others might have succeeded.  This may not have corrupted
>> anything (the first few blocks are in the same position for both the old
>> and new layout) but it might have done.
>
>> If you do want to look for the backup, it is around about the middle of
>> the device and has some metadata which contains the string
>> "md_backup_data-1".  If you find that, you are close to getting the
>> backup data back.
>
> Hmmm.  This feature has advanced beyond my last look at the code.  I was
> under the impression the backup option was only optional when mdadm
> could move the data offset.  Does this new algorithm apply to v0.90
> metadata, a v3.2 kernel, and v3.2.5 mdadm?
>

It isn't a new algorithm, it is the original algorithm.

In mdadm-2.4-pre1 (march 2006), you couldn't specify a backup file, but
you could grow a raid5 to more devices.
That was changed by a patch with comment:

    Allow resize to backup to a file.
    
    To support resizing an array without a spare, mdadm now understands
      --backup-file=
    which should point to a file for storing a backup of critical data.
    This can be given to --grow which will create the file, or
    --assemble which will restore from the file if needed.
    
The backup-file was subsequently used to support in-place reshapes and
array shrinking.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-21 20:26     ` Neil Brown
@ 2015-10-21 20:37       ` Phil Turmel
  0 siblings, 0 replies; 24+ messages in thread
From: Phil Turmel @ 2015-10-21 20:37 UTC (permalink / raw)
  To: Neil Brown, andras, linux-raid

On 10/21/2015 04:26 PM, Neil Brown wrote:
> Phil Turmel <philip@turmel.org> writes:

>> Hmmm.  This feature has advanced beyond my last look at the code.  I was
>> under the impression the backup option was only optional when mdadm
>> could move the data offset.  Does this new algorithm apply to v0.90
>> metadata, a v3.2 kernel, and v3.2.5 mdadm?
>>
> 
> It isn't a new algorithm, it is the original algorithm.
> 
> In mdadm-2.4-pre1 (march 2006), you couldn't specify a backup file, but
> you could grow a raid5 to more devices.
> That was changed by a patch with comment:
> 
>     Allow resize to backup to a file.
>     
>     To support resizing an array without a spare, mdadm now understands
>       --backup-file=
>     which should point to a file for storing a backup of critical data.
>     This can be given to --grow which will create the file, or
>     --assemble which will restore from the file if needed.
>     
> The backup-file was subsequently used to support in-place reshapes and
> array shrinking.

Ah, ok.  I wasn't using parity raid that far back, and never noticed
that growing to more devices worked that way.

Thanks for clarifying.

Phil


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-20 15:42     ` Phil Turmel
                         ` (2 preceding siblings ...)
  2015-10-21 16:17       ` Wols Lists
@ 2015-10-25 14:15       ` andras
  2015-10-25 23:02         ` Phil Turmel
  3 siblings, 1 reply; 24+ messages in thread
From: andras @ 2015-10-25 14:15 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Linux-RAID

Phil,

Thanks for all the help. I finally have some progress (and new 
problems).

> Now to your big array.  It is vital that it also be cleaned of UREs
> after re-creation before you do anything else.  Which means it must
> *not* be created degraded (the redundancy is needed to fix UREs).
> 
> According to lsdrv and your "mdadm -E" reports, the creation order you
> need is:
> 
> raid device 0 /dev/sdf2 {WD-WMAZA0209553}
> raid device 1 /dev/sdd2 {WD-WMAZA0348342}
> raid device 2 /dev/sdg1 {9VS1EFFD}
> raid device 3 /dev/sde1 {5XW05FFV}
> raid device 4 /dev/sdc1 {6XW0BQL0}
> raid device 5 /dev/sdh1 {ML2220F30TEBLE}
> raid device 6 /dev/sdi2 {WD-WMAY01975001}
> 
> Chunk size is 64k.
> 
> Make sure your partially assembled array is stopped:
> 
> mdadm --stop /dev/md1
> 
> Re-create your array as follows:
> 
> mdadm --create --assume-clean --verbose \
>     --metadata=1.0 --raid-devices=7 --chunk=64 --level=6 \
>     /dev/md1 /dev/sd{f2,d2,g1,e1,c1,h1,i2}

Being very paranoid at this stage, instead of trying to re-create the 
array on the original drives, I dd-ed their content to a different set 
of (bigger) drives, and issued the command on them.
The array assembled fine:

md1 : active raid6 sdc2[6] sdd1[5] sdg1[4] sdb1[3] sdf1[2] sdh2[1] 
sda2[0]
       7325679040 blocks super 1.0 level 6, 64k chunk, algorithm 2 [7/7] 
[UUUUUUU]
       bitmap: 0/11 pages [0KB], 65536KB chunk

> Use "fsck -n" to check your array's filesystem (expect some damage at
> the very begining).  If it look reasonable, use fsck to fix any damage.

fsck -n run to completion but reported a ton of errors, mostly stemming 
from the initial (ext4) superblock being damaged.

     e2fsck 1.42.12 (29-Aug-2014)
     ext2fs_check_desc: Corrupt group descriptor: bad block for block 
bitmap
     fsck.ext4: Group descriptors look bad... trying backup blocks...
     Superblock needs_recovery flag is clear, but journal has data.
     Recovery flag not set in backup superblock, so running journal 
anyway.
     Clear journal? no

     The filesystem size (according to the superblock) is 1831419920 
blocks
     The physical size of the device is 1831419760 blocks
     Either the superblock or the partition table is likely to be 
corrupt!
     Abort? no

     data contains a file system with errors, check forced.
     Resize inode not valid.  Recreate? no

     Pass 1: Checking inodes, blocks, and sizes
     Inode 7 has illegal block(s).  Clear? no

     Illegal block #448536 (4285956422) in inode 7.  IGNORED.
     Illegal block #448537 (4292313414) in inode 7.  IGNORED.
     Illegal block #448538 (3675619654) in inode 7.  IGNORED.
     Illegal block #448539 (3686760774) in inode 7.  IGNORED.
     Illegal block #448541 (1880654150) in inode 7.  IGNORED.
     Illegal block #448542 (3636035910) in inode 7.  IGNORED.
     Illegal block #448543 (2516877638) in inode 7.  IGNORED.
     Illegal block #448544 (2920513862) in inode 7.  IGNORED.
     Illegal block #449560 (4285956537) in inode 7.  IGNORED.
     Illegal block #449561 (4292313529) in inode 7.  IGNORED.
     Illegal block #449562 (3675619769) in inode 7.  IGNORED.
     Too many illegal blocks in inode 7.
     Clear inode? no

     Suppress messages? no
     ...
     and so on...

So I issued the real fsck command. It interestingly reported a 
completely different set of issues, my guess is that after fixing the 
superblock, the inconsistencies that fsck -n was talking about went way, 
and the real ones started to show up. At any rate, now the file system 
seems to be clean, expect for this message:

     The filesystem size (according to the superblock) is 1831419920 
blocks
     The physical size of the device is 1831419760 blocks
     Either the superblock or the partition table is likely to be 
corrupt!

This problem prevents me from mounting the FS:

     mount -o ro /dev/md1 /mnt -v
     mount: wrong fs type, bad option, bad superblock on /dev/md1,
            missing codepage or helper program, or other error

            In some cases useful info is found in syslog - try
            dmesg | tail or so.

And dmesg reports:

     [ 5859.527778] EXT4-fs (md1): bad geometry: block count 1831419920 
exceeds size of device (1831419760 blocks)

So here I am right now. I can see a few paths forward, but first a 
question:

Why is it that the re-created MD device is different in size (ever so 
slightly) then the ext4 filesystem that it used to contain? I doubt it 
has anything to do with the grow operation as I didn't get far enough to 
actually resize the filesystem...

One side-effect of using different drives (and dd) is that the partition 
table is now misaligned with the new disk geometry. For example:

     fdisk -l /dev/sdb

     Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
     Units: sectors of 1 * 512 = 512 bytes
     Sector size (logical/physical): 512 bytes / 4096 bytes
     I/O size (minimum/optimal): 4096 bytes / 4096 bytes
     Disklabel type: dos
     Disk identifier: 0x3e6b39b9

     Device     Boot Start        End    Sectors  Size Id Type
     /dev/sdb1          63 2930272064 2930272002  1.4T fd Linux raid 
autodetect

     Partition 2 does not start on physical sector boundary.

Could this be the route cause?

Here's the sizes of all the other relevant partitions:

     /dev/sda2   976752064 3907029167 2930277104  1.4T fd Linux raid 
autodetect
     /dev/sdb1          63 2930272064 2930272002  1.4T fd Linux raid 
autodetect
     /dev/sdc2   976752064 3907029167 2930277104  1.4T fd Linux raid 
autodetect
     /dev/sdd1          63 3907024064 3907024002  1.8T fd Linux raid 
autodetect
     /dev/sdf1          63 2930272064 2930272002  1.4T fd Linux raid 
autodetect
     /dev/sdg1          63 2930272064 2930272002  1.4T fd Linux raid 
autodetect
     /dev/sdh2   976752064 3907029167 2930277104  1.4T fd Linux raid 
autodetect

If I look at the size reported by fdisk above, on a 7-disk raid6, with 
each partition of that size, I should have 1831420000 sectors available. 
I'm sure mdadm takes some sectors for management, but I don't know how 
much?

So, I thought of three ways of fixing it:
1. Re-create the array again, but this time force the array size to the 
one reported by the filesystem, using -size. What is the unit for -size? 
Is that bytes?
2. Re-create the array again, but this time use the original 
super-blocks version (0.91 I think). Could that make a difference in the 
size of the array?
3. Instead of DD-ing whole drives, dd just the raid6 partitions so the 
partition table is correct for the drives. Maybe the misalignment trips 
mdadm off and makes it to create the array in the incorrect size?

Thanks for all the help again,
Andras





^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-25 14:15       ` andras
@ 2015-10-25 23:02         ` Phil Turmel
  2015-10-28 16:31           ` Andras Tantos
  0 siblings, 1 reply; 24+ messages in thread
From: Phil Turmel @ 2015-10-25 23:02 UTC (permalink / raw)
  To: andras; +Cc: Linux-RAID

On 10/25/2015 10:15 AM, andras@tantosonline.com wrote:
> Phil,
> 
> Thanks for all the help. I finally have some progress (and new problems).

>     [ 5859.527778] EXT4-fs (md1): bad geometry: block count 1831419920
> exceeds size of device (1831419760 blocks)

> So, I thought of three ways of fixing it:
> 1. Re-create the array again, but this time force the array size to the
> one reported by the filesystem, using -size. What is the unit for -size?
> Is that bytes?

Yep. You'll need to use the --size option on a create. Note that it
specifies the amount of each device to use, not the overall array size.
According to "man mdadm", its units is k == 1024 bytes.  Use the exact
size from your original => --size=1465135936

> 2. Re-create the array again, but this time use the original
> super-blocks version (0.91 I think). Could that make a difference in the
> size of the array?

v0.91 really is just a flag that means v0.90 w/ a reshape in progress.
But yes, the size used would be somewhat different.  With the override
above, it won't matter.  v1.x metadata has more features, and modern
mdadm normally reserves enough room to support them.

> 3. Instead of DD-ing whole drives, dd just the raid6 partitions so the
> partition table is correct for the drives. Maybe the misalignment trips
> mdadm off and makes it to create the array in the incorrect size?

Yes, dd just the partition contents, so the final array is aligned.
This is *really* important for drives that have logical 512-byte sectors
but physical 4k-sectors.  When you put your repaired array back in
service, keep this alignment.

Phil


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-25 23:02         ` Phil Turmel
@ 2015-10-28 16:31           ` Andras Tantos
  2015-10-28 16:42             ` Phil Turmel
  0 siblings, 1 reply; 24+ messages in thread
From: Andras Tantos @ 2015-10-28 16:31 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Linux-RAID

Thanks again Phil!

I'm almost there...

 >>     [ 5859.527778] EXT4-fs (md1): bad geometry: block count 1831419920
 >> exceeds size of device (1831419760 blocks)
 >
 >Yep. You'll need to use the --size option on a create. Note that it
 >specifies the amount of each device to use, not the overall array size.
 >According to "man mdadm", its units is k == 1024 bytes.  Use the exact
 >size from your original => --size=1465135936

When I try to do that, I get the following message:

     root@bazsalikom:~# mdadm --create --assume-clean --verbose 
--metadata=1.0 --raid-devices=7 --size=1465135936 --chunk=64 --level=6 
/dev/md1 /dev/sde2 /dev/sdc2 /dev/sdf1 /dev/sdd1 /dev/sdb1 /dev/sdg1 
/dev/sdh2
     mdadm: layout defaults to left-symmetric
     mdadm: /dev/sde2 appears to contain an ext2fs file system
         size=-1216020180K  mtime=Wed Dec  8 11:55:07 1954
     mdadm: /dev/sde2 appears to be part of a raid array:
         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
     mdadm: /dev/sdc2 appears to contain an ext2fs file system
         size=-1264254912K  mtime=Sat Jul 18 15:26:57 2015
     mdadm: /dev/sdc2 appears to be part of a raid array:
         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
     mdadm: /dev/sdf1 is smaller than given size. 1465135808K < 
1465135936K + metadata
     mdadm: /dev/sdd1 is smaller than given size. 1465135808K < 
1465135936K + metadata
     mdadm: /dev/sdb1 is smaller than given size. 1465135808K < 
1465135936K + metadata
     mdadm: /dev/sdg1 appears to be part of a raid array:
         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
     mdadm: /dev/sdh2 appears to be part of a raid array:
         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
     mdadm: create aborted

To be able to re-assemble the array, I *have* to specify metadata 
version 0.9:

     root@bazsalikom:~# mdadm --create --assume-clean --verbose 
--metadata=0.9 --raid-devices=7 --size=1465135936 --chunk=64 --level=6 
/dev/md1 /dev/sde2 /dev/sdc2 /dev/sdf1 /dev/sdd1 /dev/sdb1 /dev/sdg1 
/dev/sdh2
     mdadm: layout defaults to left-symmetric
     mdadm: /dev/sde2 appears to contain an ext2fs file system
         size=-1216020180K  mtime=Wed Dec  8 11:55:07 1954
     mdadm: /dev/sde2 appears to be part of a raid array:
         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
     mdadm: /dev/sdc2 appears to contain an ext2fs file system
         size=-1264254912K  mtime=Sat Jul 18 15:26:57 2015
     mdadm: /dev/sdc2 appears to be part of a raid array:
         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
     mdadm: /dev/sdf1 appears to be part of a raid array:
         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
     mdadm: /dev/sdd1 appears to be part of a raid array:
         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
     mdadm: /dev/sdb1 appears to be part of a raid array:
         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
     mdadm: /dev/sdg1 appears to be part of a raid array:
         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
     mdadm: /dev/sdh2 appears to be part of a raid array:
         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
     mdadm: largest drive (/dev/sdg1) exceeds size (1465135936K) by more 
than 1%
     Continue creating array? y
     mdadm: array /dev/md1 started.

Is this a problem? Can I upgrade my array to 1.0 metadata? Should I?

Andras


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-28 16:31           ` Andras Tantos
@ 2015-10-28 16:42             ` Phil Turmel
  2015-10-28 17:10               ` Andras Tantos
  2015-10-29 16:59               ` Andras Tantos
  0 siblings, 2 replies; 24+ messages in thread
From: Phil Turmel @ 2015-10-28 16:42 UTC (permalink / raw)
  To: Andras Tantos; +Cc: Linux-RAID

On 10/28/2015 12:31 PM, Andras Tantos wrote:
> Thanks again Phil!
> 
> I'm almost there...
> 
>>>     [ 5859.527778] EXT4-fs (md1): bad geometry: block count 1831419920
>>> exceeds size of device (1831419760 blocks)
>>
>>Yep. You'll need to use the --size option on a create. Note that it
>>specifies the amount of each device to use, not the overall array size.
>>According to "man mdadm", its units is k == 1024 bytes.  Use the exact
>>size from your original => --size=1465135936
> 
> When I try to do that, I get the following message:
> 
>     root@bazsalikom:~# mdadm --create --assume-clean --verbose
> --metadata=1.0 --raid-devices=7 --size=1465135936 --chunk=64 --level=6
> /dev/md1 /dev/sde2 /dev/sdc2 /dev/sdf1 /dev/sdd1 /dev/sdb1 /dev/sdg1
> /dev/sdh2
>     mdadm: layout defaults to left-symmetric
>     mdadm: /dev/sde2 appears to contain an ext2fs file system
>         size=-1216020180K  mtime=Wed Dec  8 11:55:07 1954
>     mdadm: /dev/sde2 appears to be part of a raid array:
>         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
>     mdadm: /dev/sdc2 appears to contain an ext2fs file system
>         size=-1264254912K  mtime=Sat Jul 18 15:26:57 2015
>     mdadm: /dev/sdc2 appears to be part of a raid array:
>         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
>     mdadm: /dev/sdf1 is smaller than given size. 1465135808K <
> 1465135936K + metadata
>     mdadm: /dev/sdd1 is smaller than given size. 1465135808K <
> 1465135936K + metadata
>     mdadm: /dev/sdb1 is smaller than given size. 1465135808K <
> 1465135936K + metadata
>     mdadm: /dev/sdg1 appears to be part of a raid array:
>         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
>     mdadm: /dev/sdh2 appears to be part of a raid array:
>         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
>     mdadm: create aborted
> 
> To be able to re-assemble the array, I *have* to specify metadata
> version 0.9:
> 
>     root@bazsalikom:~# mdadm --create --assume-clean --verbose
> --metadata=0.9 --raid-devices=7 --size=1465135936 --chunk=64 --level=6
> /dev/md1 /dev/sde2 /dev/sdc2 /dev/sdf1 /dev/sdd1 /dev/sdb1 /dev/sdg1
> /dev/sdh2
>     mdadm: layout defaults to left-symmetric
>     mdadm: /dev/sde2 appears to contain an ext2fs file system
>         size=-1216020180K  mtime=Wed Dec  8 11:55:07 1954
>     mdadm: /dev/sde2 appears to be part of a raid array:
>         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
>     mdadm: /dev/sdc2 appears to contain an ext2fs file system
>         size=-1264254912K  mtime=Sat Jul 18 15:26:57 2015
>     mdadm: /dev/sdc2 appears to be part of a raid array:
>         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
>     mdadm: /dev/sdf1 appears to be part of a raid array:
>         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
>     mdadm: /dev/sdd1 appears to be part of a raid array:
>         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
>     mdadm: /dev/sdb1 appears to be part of a raid array:
>         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
>     mdadm: /dev/sdg1 appears to be part of a raid array:
>         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
>     mdadm: /dev/sdh2 appears to be part of a raid array:
>         level=raid6 devices=7 ctime=Wed Oct 28 09:17:55 2015
>     mdadm: largest drive (/dev/sdg1) exceeds size (1465135936K) by more
> than 1%
>     Continue creating array? y
>     mdadm: array /dev/md1 started.
> 
> Is this a problem? Can I upgrade my array to 1.0 metadata? Should I?

Hmm. Interesting.  Your version of mdadm is insisting on reserving much
more space between end of content and the v1.0 metadata than when using
v0.90 metadata.

I'm curious how much.  Please show the output of "cat /proc/partitions".

If you stop the array cleanly and then manually re-assemble with
--update=metadata, you might get around it.  (Specify all of the devices
explicitly to ensure you don't get burned by v0.90's problems with last
partitions.)

You definitely don't want to stay on v0.90, but you may need to for now
to get out of trouble.

Phil


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-28 16:42             ` Phil Turmel
@ 2015-10-28 17:10               ` Andras Tantos
  2015-10-28 17:38                 ` Phil Turmel
  2015-10-29 16:59               ` Andras Tantos
  1 sibling, 1 reply; 24+ messages in thread
From: Andras Tantos @ 2015-10-28 17:10 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Linux-RAID

Phil,

 >> To be able to re-assemble the array, I *have* to specify metadata
 >> version 0.9:
 >>
 >> Is this a problem? Can I upgrade my array to 1.0 metadata? Should I?
 >
 > Hmm. Interesting.  Your version of mdadm is insisting on reserving much
 > more space between end of content and the v1.0 metadata than when using
 > v0.90 metadata.
 >
 > I'm curious how much.  Please show the output of "cat /proc/partitions".

root@bazsalikom:/home/tantos# cat /proc/partitions
major minor  #blocks  name

    8       16 1465138584 sdb
    8       17 1465136001 sdb1
    8       48 1465138584 sdd
    8       49 1465136001 sdd1
    8       80 1465138584 sdf
    8       81 1465136001 sdf1
    8       96 1953513527 sdg
    8       97 1953512001 sdg1
    8      112 1953514584 sdh
    8      113     538145 sdh1
    8      114 1465138552 sdh2
    8      115  487837854 sdh3
    8       64 1953514584 sde
    8       65     538145 sde1
    8       66 1465138552 sde2
    8       67  487837854 sde3
    8       32 1953514584 sdc
    8       33     538145 sdc1
    8       34 1465138552 sdc2
    8       35  487837854 sdc3
    9        0  487837760 md0
    9        1 7325679680 md1

Andras


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-28 17:10               ` Andras Tantos
@ 2015-10-28 17:38                 ` Phil Turmel
  0 siblings, 0 replies; 24+ messages in thread
From: Phil Turmel @ 2015-10-28 17:38 UTC (permalink / raw)
  To: Andras Tantos; +Cc: Linux-RAID

On 10/28/2015 01:10 PM, Andras Tantos wrote:
> Phil,
> 
>>> To be able to re-assemble the array, I *have* to specify metadata
>>> version 0.9:
>>>
>>> Is this a problem? Can I upgrade my array to 1.0 metadata? Should I?
>>
>> Hmm. Interesting.  Your version of mdadm is insisting on reserving much
>> more space between end of content and the v1.0 metadata than when using
>> v0.90 metadata.
>>
>> I'm curious how much.  Please show the output of "cat /proc/partitions".

Ok.  I think your version of mdadm is trying to put a bitmap on the v1.0
array, which can be suppressed with --bitmap=none.  Or just do the
--assemble --update.

Phil


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-28 16:42             ` Phil Turmel
  2015-10-28 17:10               ` Andras Tantos
@ 2015-10-29 16:59               ` Andras Tantos
  2015-10-30 18:12                 ` Phil Turmel
  1 sibling, 1 reply; 24+ messages in thread
From: Andras Tantos @ 2015-10-29 16:59 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Linux-RAID

Phil,

On 10/28/2015 9:42 AM, Phil Turmel wrote:
> If you stop the array cleanly and then manually re-assemble with 
> --update=metadata, you might get around it. (Specify all of the 
> devices explicitly to ensure you don't get burned by v0.90's problems 
> with last partitions.) You definitely don't want to stay on v0.90, but 
> you may need to for now to get out of trouble. Phil 

It seems that my mdadm doesn't have an --update=metadata option, which 
if I understand it right means I have to re-create the array with the 
no-bitmap option. How dangerous is that? Is it possible that things get 
overwritten during the re-create process in the data portion of the array?

I've read that GRUB (which is my bootloader) didn't support v1.0 
superblocks for a while. It seems that 0.99 version of GRUB (which is 
what I have) has it, but how to make certain? I don't want to render my 
system un-bootable...

Can you expand a little bit on the problems of v0.90 superblocks and why 
upgrading is advantageous? What I've read about the differences (lifted 
limit of number of devices/array and 2TB per device limit) don't really 
apply to my case.

Thanks,
Andras


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape?
  2015-10-29 16:59               ` Andras Tantos
@ 2015-10-30 18:12                 ` Phil Turmel
  2015-11-03 23:42                   ` How to recover after md crash during reshape? - SOLVED/SUMMARY Andras Tantos
  0 siblings, 1 reply; 24+ messages in thread
From: Phil Turmel @ 2015-10-30 18:12 UTC (permalink / raw)
  To: Andras Tantos; +Cc: Linux-RAID

On 10/29/2015 12:59 PM, Andras Tantos wrote:
> Phil,
> 
> On 10/28/2015 9:42 AM, Phil Turmel wrote:
>> If you stop the array cleanly and then manually re-assemble with
>> --update=metadata, you might get around it. (Specify all of the
>> devices explicitly to ensure you don't get burned by v0.90's problems
>> with last partitions.) You definitely don't want to stay on v0.90, but
>> you may need to for now to get out of trouble. Phil 
> 
> It seems that my mdadm doesn't have an --update=metadata option, which
> if I understand it right means I have to re-create the array with the
> no-bitmap option. How dangerous is that? Is it possible that things get
> overwritten during the re-create process in the data portion of the array?

Just clone and compile a local copy of the latest mdadm, then run it as
./mdadm for the --update operation.

git clone git://github.com/neilbrown/mdadm

> I've read that GRUB (which is my bootloader) didn't support v1.0
> superblocks for a while. It seems that 0.99 version of GRUB (which is
> what I have) has it, but how to make certain? I don't want to render my
> system un-bootable...

Old grub doesn't understand MD at all, which is why you needed a mirror
that has the content starting at the beginning of the partition.  To
grub, it doesn't look like a mirror.  This is true for v1.0 as well.

> Can you expand a little bit on the problems of v0.90 superblocks and why
> upgrading is advantageous? What I've read about the differences (lifted
> limit of number of devices/array and 2TB per device limit) don't really
> apply to my case.

v0.90 will screw up if you have it on the last partition of a device,
and that partition runs very close to the end of the device.  v0.90
doesn't include size info in the metadata itself, so it is ambiguous in
that case whether the superblock belongs to the device as a whole or the
partition.  That'll really scramble an array.

Just say no to v0.90.

Phil

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to recover after md crash during reshape? - SOLVED/SUMMARY
  2015-10-30 18:12                 ` Phil Turmel
@ 2015-11-03 23:42                   ` Andras Tantos
  0 siblings, 0 replies; 24+ messages in thread
From: Andras Tantos @ 2015-11-03 23:42 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Linux-RAID

Thank you all who helped me solve my problem, especially Phil Turmel, 
who I am in dept for the rest of my live. Right now my family photos - 
and my marriage - are safe.

For people, who might be interested in the future, here's a quick 
summary of the events and the recovery:

Trouble:
==========

Was going to extend RAID6 array from 7 disks to 10. Array reshape 
crashed early in the process. After reboot, the array wouldn't 
re-assemble with error message:

     mdadm: WARNING /dev/sda and /dev/sda1 appear to have very similar
     superblocks.
           If they are really different, please --zero the superblock on one
           If they are the same or overlap, please remove one from the
           DEVICE list in mdadm.conf.

What I SHOULD have done here is to remove SDA from the DEVICE list in 
mdadm.conf followed by mdadm --grow --continue /dev/md1 --backup-file .....
What I did is to zero the superblock of SDA1.

The same message appeard for the other two new HDDs in the array as 
well. By the time I zeroed the super blocks of all three new disks the 
array assembled but didn't start because it was missing three drives.

Recovery:
===========
1. Look at the partitions listed in /proc/mdstat for the array.
2. For each of the constituents of the array, do mdadm -E <disk name 
from the array>
3. Note all the parameters, especially these: 'Chunk Size', 'Raid 
Level', 'Version'
4. Make sure all remaining disks show the same event count ('Events') 
and they have correct checksum and all the above parameters match.
5. Note the order of the disks in the array. You can find that in this line:

            Number   Major   Minor   RaidDevice State
      this     6       8       98        6      active sync

6. If all matches, stop the array:
     mdadm --stop /dev/md1

7. Re-create your array as follows:
     mdadm --create --assume-clean --verbose \
         --metadata=1.0 --raid-devices=7 --chunk=64 --level=6 \
         /dev/md1 <list of devices in the exact order from note 5 above>

     Replace number of devices, chunk size and raid level from note 3 
above. For me, I had do specify metadata version 0.9, which was my 
original metadata version (as reported by the 'Version' parameter in 
point 3 above). YMMV.

8. If all goes well, the array will now re-assemble with the original 7 
disks. The data on the array is corrupted up to the point where the 
reshape stopped, so...
9. fsck -n /dev/md1 to assess the damage. If doesn't look terrible, fix 
the errors: fsck -y /dev/md1.
10. Mount the array rejoice in the data that's recovered.

Final notes:
===============
I still don't know the root cause of the crash. What I did notice is 
that this particular (Core2 duo) system seems to become unstable with 
more than 9 HDDs. It doesn't seem to be a power supply issue as it has 
trouble even if about half of the drives are supplied from a second PSU.

Version 0.9 metadata has some problems, causing the misleading message 
in the first place. Upgrading to version 1.0 metadata is a good idea.

If you use desktop or green drives in your array, fix the short kernel 
timeout on SATA devices (30s). Issue this on every boot:
     for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done
If you don't do that, the first unrecoverable read error will degrade 
your array instead of simply relocating the failing sector on the hard 
drive.

To find and fix unrecoverable read errors on your array, regularly issue:
     echo check >/sys/block/md0/md/sync_action
This is a looooong operation on a large RAID6 array, but makes sure that 
bad sectors don't accumulate in seldom-accessed corners and destroy your 
array at the worst possible time.

Andras


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-11-03 23:42 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-20  2:35 How to recover after md crash during reshape? andras
2015-10-20 12:50 ` Anugraha Sinha
2015-10-20 13:04 ` Wols Lists
2015-10-20 13:49 ` Phil Turmel
     [not found]   ` <3baf849321d819483c5d20c005a31844@tantosonline.com>
2015-10-20 15:42     ` Phil Turmel
2015-10-20 22:34       ` Anugraha Sinha
2015-10-21  3:52       ` andras
2015-10-21 12:01         ` Phil Turmel
2015-10-21 16:17       ` Wols Lists
2015-10-21 16:05         ` Phil Turmel
2015-10-25 14:15       ` andras
2015-10-25 23:02         ` Phil Turmel
2015-10-28 16:31           ` Andras Tantos
2015-10-28 16:42             ` Phil Turmel
2015-10-28 17:10               ` Andras Tantos
2015-10-28 17:38                 ` Phil Turmel
2015-10-29 16:59               ` Andras Tantos
2015-10-30 18:12                 ` Phil Turmel
2015-11-03 23:42                   ` How to recover after md crash during reshape? - SOLVED/SUMMARY Andras Tantos
2015-10-21  1:35 ` How to recover after md crash during reshape? Neil Brown
2015-10-21  4:03   ` andras
2015-10-21 12:18   ` Phil Turmel
2015-10-21 20:26     ` Neil Brown
2015-10-21 20:37       ` Phil Turmel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.