* feature re-quest for "re-write" @ 2014-02-21 18:09 Mikael Abrahamsson 2014-02-24 1:30 ` Brad Campbell 2014-02-24 2:24 ` Brad Campbell 0 siblings, 2 replies; 22+ messages in thread From: Mikael Abrahamsson @ 2014-02-21 18:09 UTC (permalink / raw) To: linux-raid Hi, we have "check", "repair", "replacement" and other operations on raid volumes. I am not a programmer, but I was wondering how much work it would require to take current code and implement "rewrite", basically re-writing every block in the md raid level. Since "repair" and "check" doesn't seem to properly detect a few errors, wouldn't it make sense to try least existance / easiest implementation route to just re-write all data on the entire array? If reads fail, re-calculate from parity, if reads work, just write again. The goal of this new mode would be to eradicate pending sectors by re-writing everything on the drive. If this doesn't seem like a sensible approach, what would be a sensible approach to avoid having pending sectors keep being "pending" even after "check" and "repair"? -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-21 18:09 feature re-quest for "re-write" Mikael Abrahamsson @ 2014-02-24 1:30 ` Brad Campbell 2014-02-24 1:46 ` Eyal Lebedinsky 2014-02-24 2:42 ` Mikael Abrahamsson 2014-02-24 2:24 ` Brad Campbell 1 sibling, 2 replies; 22+ messages in thread From: Brad Campbell @ 2014-02-24 1:30 UTC (permalink / raw) To: Mikael Abrahamsson, linux-raid On 22/02/14 02:09, Mikael Abrahamsson wrote: > > If this doesn't seem like a sensible approach, what would be a sensible > approach to avoid having pending sectors keep being "pending" even after > "check" and "repair"? > The only reason I've ever seen this personally was when the pending sectors were on non-data parts of the drive, like some of the space around the superblock. Have you verified that these issues are really on sectors in the data area? SMART should tell you the LBA of the first error in a read test. Brad ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-24 1:30 ` Brad Campbell @ 2014-02-24 1:46 ` Eyal Lebedinsky 2014-02-24 2:11 ` Brad Campbell 2014-02-24 2:42 ` Mikael Abrahamsson 1 sibling, 1 reply; 22+ messages in thread From: Eyal Lebedinsky @ 2014-02-24 1:46 UTC (permalink / raw) To: list linux-raid In my case (see earlier thread "raid check does not..." the pending sector is early in the device, in sector 261696 of a 4TB component (whole space in one partition of each component). So yes, inside the data area. I still have it reported in my daily logwatch, any idea what to try? Currently unreadable (pending) sectors detected: /dev/sdi [SAT] - 48 Time(s) 1 unreadable sectors detected Eyal On 02/24/14 12:30, Brad Campbell wrote: > On 22/02/14 02:09, Mikael Abrahamsson wrote: >> > >> If this doesn't seem like a sensible approach, what would be a sensible >> approach to avoid having pending sectors keep being "pending" even after >> "check" and "repair"? >> > > The only reason I've ever seen this personally was when the pending sectors were on non-data parts of the drive, like some of the space around the superblock. Have you verified that these issues are really on sectors in the data area? SMART should tell you the LBA of the first error in a read test. > > Brad -- Eyal Lebedinsky (eyal@eyal.emu.id.au) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-24 1:46 ` Eyal Lebedinsky @ 2014-02-24 2:11 ` Brad Campbell 2014-02-24 3:40 ` Eyal Lebedinsky 0 siblings, 1 reply; 22+ messages in thread From: Brad Campbell @ 2014-02-24 2:11 UTC (permalink / raw) To: Eyal Lebedinsky, list linux-raid On 24/02/14 09:46, Eyal Lebedinsky wrote: > In my case (see earlier thread "raid check does not..." the pending > sector is early > in the device, in sector 261696 of a 4TB component (whole space in one > partition of > each component). So yes, inside the data area. > > I still have it reported in my daily logwatch, any idea what to try? > Yes, can you run a dd of the md device from well before to well after the theoretical position of the error? If the dd passes cleanly, it indicates the bad block is a parity block rather than a data block. That hopefully will help narrow down the scope of the search. Brad ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-24 2:11 ` Brad Campbell @ 2014-02-24 3:40 ` Eyal Lebedinsky 2014-02-24 14:14 ` Wilson Jonathan 0 siblings, 1 reply; 22+ messages in thread From: Eyal Lebedinsky @ 2014-02-24 3:40 UTC (permalink / raw) To: list linux-raid I know that the i/o error is in /dev/sdi sector 261696 (consistent kernel and smart reports) - /dev/sdi1 starts 2048 sectors later - /dev/md127 is a 7 devs raid6 so there is 5 times as much data in the array until we hit the bad sector # dd if=/dev/sdi1 of=/dev/null skip=$((1*(261696-2048))) count=1 dd: error reading '/dev/sdi1': Input/output error 0+0 records in 0+0 records out The error is in one sector but 8 sectors will be read (a 4k buffer) so to get a clean read: # dd if=/dev/sdi1 of=/dev/null skip=$((1*(261696-2048)+8)) count=1 1+0 records in 1+0 records out 512 bytes (512 B) copied, 5.5852e-05 s, 9.2 MB/s # echo 3 >'/proc/sys/vm/drop_caches' # dd if=/dev/md127 of=/dev/null skip=$((5*(261696-2048))) count=5 5+0 records in 5+0 records out 2560 bytes (2.6 kB) copied, 0.00380436 s, 673 kB/s Now reading *much* more than necessary (first 10GB of the array): # echo 3 >'/proc/sys/vm/drop_caches' # dd if=/dev/md127 of=/dev/null count=$((20*1024*1024)) 20971520+0 records in 20971520+0 records out 10737418240 bytes (11 GB) copied, 13.5717 s, 791 MB/s Note that I do not expect to get an error because reading the array will not read the P/Q checksums (it assumes good data and avoids the calculations overhead of verifying P/Q). BTW, due to the use of a buffer layer I could have done the whole test using 4k blocks rather than sectors, but it makes no difference in this case. Eyal On 02/24/14 13:11, Brad Campbell wrote: > On 24/02/14 09:46, Eyal Lebedinsky wrote: >> In my case (see earlier thread "raid check does not..." the pending >> sector is early >> in the device, in sector 261696 of a 4TB component (whole space in one >> partition of >> each component). So yes, inside the data area. >> >> I still have it reported in my daily logwatch, any idea what to try? >> > > Yes, can you run a dd of the md device from well before to well after the theoretical position of the error? > > If the dd passes cleanly, it indicates the bad block is a parity block rather than a data block. That hopefully will help narrow down the scope of the search. > > Brad -- Eyal Lebedinsky (eyal@eyal.emu.id.au) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-24 3:40 ` Eyal Lebedinsky @ 2014-02-24 14:14 ` Wilson Jonathan 2014-02-24 20:39 ` Eyal Lebedinsky 0 siblings, 1 reply; 22+ messages in thread From: Wilson Jonathan @ 2014-02-24 14:14 UTC (permalink / raw) To: Eyal Lebedinsky; +Cc: list linux-raid > # echo 3 >'/proc/sys/vm/drop_caches' > # dd if=/dev/md127 of=/dev/null count=$((20*1024*1024)) > 20971520+0 records in > 20971520+0 records out > 10737418240 bytes (11 GB) copied, 13.5717 s, 791 MB/s > > Note that I do not expect to get an error because reading the array will not read the P/Q checksums > (it assumes good data and avoids the calculations overhead of verifying P/Q). > > BTW, due to the use of a buffer layer I could have done the whole test using 4k blocks rather than > sectors, but it makes no difference in this case. > > Eyal > I wonder, could you not use dd to perform a "refresh"? dd if=/dev/md127 of=/dev/md127 count=... bs=.... As that would force a re-calc and write of P & Q and data. That said, its just a "could you/would it" suggestion not a "do it" due to the inherent dangers of dd. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-24 14:14 ` Wilson Jonathan @ 2014-02-24 20:39 ` Eyal Lebedinsky 2014-02-25 3:16 ` NeilBrown 0 siblings, 1 reply; 22+ messages in thread From: Eyal Lebedinsky @ 2014-02-24 20:39 UTC (permalink / raw) To: list linux-raid My main interest is to understand why 'check' does not actually check. I already know how to fix the problem, by writing to the location I can force the pending reallocation to happen, but then I will not have the test case anymore. The OP asks for a specific solution, but I think that the 'check' action should already correctly rewrite failed (i/o error) sectors. It does not always know which sector to rewrite when it finds a raid6 mismatch without an i/o error (with raid5 it never knows). Eyal On 02/25/14 01:14, Wilson Jonathan wrote: > >> # echo 3 >'/proc/sys/vm/drop_caches' >> # dd if=/dev/md127 of=/dev/null count=$((20*1024*1024)) >> 20971520+0 records in >> 20971520+0 records out >> 10737418240 bytes (11 GB) copied, 13.5717 s, 791 MB/s >> >> Note that I do not expect to get an error because reading the array will not read the P/Q checksums >> (it assumes good data and avoids the calculations overhead of verifying P/Q). >> >> BTW, due to the use of a buffer layer I could have done the whole test using 4k blocks rather than >> sectors, but it makes no difference in this case. >> >> Eyal >> > > I wonder, could you not use dd to perform a "refresh"? > > dd if=/dev/md127 of=/dev/md127 count=... bs=.... > > As that would force a re-calc and write of P & Q and data. > > That said, its just a "could you/would it" suggestion not a "do it" due > to the inherent dangers of dd. -- Eyal Lebedinsky (eyal@eyal.emu.id.au) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-24 20:39 ` Eyal Lebedinsky @ 2014-02-25 3:16 ` NeilBrown 2014-02-25 5:58 ` Eyal Lebedinsky 2014-02-25 7:58 ` Eyal Lebedinsky 0 siblings, 2 replies; 22+ messages in thread From: NeilBrown @ 2014-02-25 3:16 UTC (permalink / raw) To: Eyal Lebedinsky; +Cc: list linux-raid, Mikael Abrahamsson [-- Attachment #1: Type: text/plain, Size: 2444 bytes --] On Tue, 25 Feb 2014 07:39:14 +1100 Eyal Lebedinsky <eyal@eyal.emu.id.au> wrote: > My main interest is to understand why 'check' does not actually check. > I already know how to fix the problem, by writing to the location I > can force the pending reallocation to happen, but then I will not have > the test case anymore. > > The OP asks for a specific solution, but I think that the 'check' action > should already correctly rewrite failed (i/o error) sectors. It does not > always know which sector to rewrite when it finds a raid6 mismatch > without an i/o error (with raid5 it never knows). > I cannot reproduce the problem. In my testing a read error is fixed by 'check'. For you it clearly isn't. I wonder what is different. During normal 'check' or 'repair' etc the read requests are allowed to be combined by the io scheduler so when we get a read error, it could be one error for a megabyte of more of the address space. So the first thing raid5.c does is arrange to read all the blocks again but to prohibit the merging of requests. This time any read error will be for a single 4K block. Once we have that reliable read error the data is constructed from the other blocks and the new block is written out. This suggests that when there is a read error you should see e.g. [ 714.808494] end_request: I/O error, dev sds, sector 8141872 then shortly after that another similar error, possibly with a slightly different sector number (at most a few thousand sectors later). Then something like md/raid:md0: read error corrected (8 sectors at 8141872 on sds) However in the log Mikael Abrahamsson posted on 16 Jan 2014 (Subject: Re: read errors not corrected when doing check on RAID6) we only see that first 'end_request' message. No second one and no "read error corrected". This seems to suggest that the second read succeeded, which is odd (to say the least). In your log posted 21 Feb 2014 (Subject: raid 'check' does not provoke expected i/o error) there aren't even any read errors during 'check'. The drive sometimes reports a read error and something doesn't? Does reading the drive with 'dd' already report an error, and with 'check' never report an error? So I'm a bit stumped. It looks like md is doing the right thing, but maybe the drive is getting confused. Are all the people who report this using the same sort of drive?? NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-25 3:16 ` NeilBrown @ 2014-02-25 5:58 ` Eyal Lebedinsky 2014-02-25 7:05 ` Stan Hoeppner 2014-02-25 7:58 ` Eyal Lebedinsky 1 sibling, 1 reply; 22+ messages in thread From: Eyal Lebedinsky @ 2014-02-25 5:58 UTC (permalink / raw) To: list linux-raid My case is consistent. Reading /dev/sdi1 provokes the end_request error. This is 100% reproducible. Reading /dev/md127 runs clean (no error message). Doing a 'check' completes clean (no error message).(*) smartctl shows one pending sector. Some details listed below. Eyal (*) I run a check action by setting sync_min/sync_max/sync_action to cover the bad sector. However, just to be sure, I allowed an overnight full check which also ran clean. The bad sector is still pending. This is how I run the short tests: # parted -l Model: ATA WDC WD4001FAEX-0 (scsi) Disk /dev/sdi: 4001GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1049kB 4001GB 4001GB # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid6 sdf1[7] sdd1[1] sdc1[0] sdg1[4] sdh1[5] sde1[2] sdi1[6] 19534425600 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/7] [UUUUUUU] bitmap: 0/30 pages [0KB], 65536KB chunk # cat /sys/block/md127/md/chunk_size 524288 # sys="/sys/block/md127/md" # echo '0' >$sys/sync_min # check first # echo '1000384' >$sys/sync_max # 1m sectors, 0.5GB # echo 'check' >$sys/sync_action Examining /proc/mdstat every second: 16:30:46 [>....................] check = 0.0% (131360/3906885120) finish=495.6min speed=131360K/sec 16:30:47 [>....................] check = 0.0% (273696/3906885120) finish=237.8min speed=273696K/sec 16:30:48 [>....................] check = 0.0% (416032/3906885120) finish=312.9min speed=208016K/sec 16:30:49 [>....................] check = 0.0% (561952/3906885120) finish=347.5min speed=187317K/sec 16:30:50 [>....................] check = 0.0% (707872/3906885120) finish=367.8min speed=176968K/sec 16:30:51 [>....................] check = 0.0% (844072/3906885120) finish=385.6min speed=168814K/sec 16:30:52 [>....................] check = 0.0% (991016/3906885120) finish=394.1min speed=165169K/sec 16:30:53 [>....................] check = 0.0% (1136936/3906885120) finish=400.7min speed=162419K/sec 16:30:54 [>....................] check = 0.0% (1283368/3906885120) finish=405.7min speed=160421K/sec 16:30:55 [>....................] check = 0.0% (1427752/3906885120) finish=410.3min speed=158639K/sec 16:30:56 [>....................] check = 0.0% (1544492/3906885120) finish=463.5min speed=140408K/sec 16:30:57 [>....................] check = 0.0% (1726304/3906885120) finish=452.4min speed=143858K/sec 16:30:58 [>....................] check = 0.0% (1866592/3906885120) finish=453.2min speed=143584K/sec 16:30:59 [>....................] check = 0.0% (2012000/3906885120) finish=452.8min speed=143714K/sec 16:31:00 [>....................] check = 0.0% (2154336/3906885120) finish=453.1min speed=143622K/sec 16:31:01 [>....................] check = 0.0% (2226336/3906885120) finish=467.6min speed=139146K/sec 16:31:02 [>....................] check = 0.0% (2401636/3906885120) finish=460.6min speed=141272K/sec 16:31:03 [>....................] check = 0.0% (2549592/3906885120) finish=459.4min speed=141644K/sec 16:31:04 [>....................] check = 0.0% (2690864/3906885120) finish=459.4min speed=141625K/sec 16:31:05 [>....................] check = 0.0% (2834776/3906885120) finish=459.0min speed=141738K/sec 16:31:06 [>....................] check = 0.0% (2928880/3906885120) finish=466.5min speed=139470K/sec 16:31:07 [>....................] check = 0.0% (3029760/3906885120) finish=472.4min speed=137716K/sec 16:31:08 [>....................] check = 0.0% (3111680/3906885120) finish=480.9min speed=135290K/sec 16:31:09 [>....................] check = 0.0% (3258624/3906885120) finish=479.1min speed=135776K/sec 16:31:10 [>....................] check = 0.0% (3401472/3906885120) finish=478.1min speed=136058K/sec 16:31:11 [>....................] check = 0.0% (3544832/3906885120) finish=477.1min speed=136339K/sec 16:31:12 [>....................] check = 0.0% (3657476/3906885120) finish=480.2min speed=135462K/sec 16:31:13 [>....................] check = 0.0% (3797764/3906885120) finish=479.6min speed=135634K/sec 16:31:14 [>....................] check = 0.1% (3941636/3906885120) finish=478.5min speed=135918K/sec 16:31:15 [>....................] check = 0.1% (4076292/3906885120) finish=478.7min speed=135876K/sec 16:31:16 [>....................] check = 0.1% (4221188/3906885120) finish=477.6min speed=136167K/sec 16:31:17 [>....................] check = 0.1% (4325252/3906885120) finish=481.2min speed=135164K/sec 16:31:18 [>....................] check = 0.1% (4497992/3906885120) finish=477.1min speed=136302K/sec 16:31:19 [>....................] check = 0.1% (4644936/3906885120) finish=477.1min speed=136300K/sec 16:31:20 [>....................] check = 0.1% (4779088/3906885120) finish=477.3min speed=136233K/sec 16:31:21 [>....................] check = 0.1% (4914888/3906885120) finish=477.4min speed=136220K/sec 16:31:22 [>....................] check = 0.1% (4990720/3906885120) finish=487.6min speed=133366K/sec 16:31:23 [>....................] check = 0.1% (4999896/3906885120) finish=502.2min speed=129485K/sec # cat /sys/block/md127/md/mismatch_cnt 0 # echo 'idle' >$sys/sync_action # dmesg|tail [ 4134.750324] md: data-check of RAID array md127 [ 4134.756992] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [ 4134.764956] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. [ 4134.776816] md: using 128k window, over a total of 3906885120k. [ 4174.065003] md: md_do_sync() got signal ... exiting On 02/25/14 14:16, NeilBrown wrote: > On Tue, 25 Feb 2014 07:39:14 +1100 Eyal Lebedinsky <eyal@eyal.emu.id.au> > wrote: > >> My main interest is to understand why 'check' does not actually check. >> I already know how to fix the problem, by writing to the location I >> can force the pending reallocation to happen, but then I will not have >> the test case anymore. >> >> The OP asks for a specific solution, but I think that the 'check' action >> should already correctly rewrite failed (i/o error) sectors. It does not >> always know which sector to rewrite when it finds a raid6 mismatch >> without an i/o error (with raid5 it never knows). >> > > I cannot reproduce the problem. In my testing a read error is fixed by > 'check'. For you it clearly isn't. I wonder what is different. > > During normal 'check' or 'repair' etc the read requests are allowed to be > combined by the io scheduler so when we get a read error, it could be one > error for a megabyte of more of the address space. > So the first thing raid5.c does is arrange to read all the blocks again but > to prohibit the merging of requests. This time any read error will be for a > single 4K block. > > Once we have that reliable read error the data is constructed from the other > blocks and the new block is written out. > > This suggests that when there is a read error you should see e.g. > > [ 714.808494] end_request: I/O error, dev sds, sector 8141872 > > then shortly after that another similar error, possibly with a slightly > different sector number (at most a few thousand sectors later). > > Then something like > > md/raid:md0: read error corrected (8 sectors at 8141872 on sds) > > > However in the log Mikael Abrahamsson posted on 16 Jan 2014 > (Subject: Re: read errors not corrected when doing check on RAID6) > > we only see that first 'end_request' message. No second one and no "read > error corrected". > > This seems to suggest that the second read succeeded, which is odd (to say > the least). > > In your log posted 21 Feb 2014 > (Subject: raid 'check' does not provoke expected i/o error) > there aren't even any read errors during 'check'. > The drive sometimes reports a read error and something doesn't? > Does reading the drive with 'dd' already report an error, and with 'check' > never report an error? > > > > So I'm a bit stumped. It looks like md is doing the right thing, but maybe > the drive is getting confused. > Are all the people who report this using the same sort of drive?? > > NeilBrown > -- Eyal Lebedinsky (eyal@eyal.emu.id.au) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-25 5:58 ` Eyal Lebedinsky @ 2014-02-25 7:05 ` Stan Hoeppner 2014-02-25 7:45 ` Eyal Lebedinsky 0 siblings, 1 reply; 22+ messages in thread From: Stan Hoeppner @ 2014-02-25 7:05 UTC (permalink / raw) To: Eyal Lebedinsky, list linux-raid, NeilBrown On 2/24/2014 11:58 PM, Eyal Lebedinsky wrote: ... > (*) I run a check action by setting sync_min/sync_max/sync_action > to cover the bad sector. However, just to be sure, I allowed an overnight > full check which also ran clean. The bad sector is still pending. What is the expected behavior when the drive's spare sector pool has been exhausted, and thus the sector cannot be remapped by the drive firmware? Unless md now keeps a spare sector pool of its own and remaps bad sectors, the only way to fix this situation is to replace the drive. And if indeed drive spare pool exhaustion is the cause of the sector not being remapped, the drive needs to be replaced. -- Stan ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-25 7:05 ` Stan Hoeppner @ 2014-02-25 7:45 ` Eyal Lebedinsky 0 siblings, 0 replies; 22+ messages in thread From: Eyal Lebedinsky @ 2014-02-25 7:45 UTC (permalink / raw) To: list linux-raid The disk (actually the whole array) is relatively new and so far has no reallocated sectors. Eyal On 02/25/14 18:05, Stan Hoeppner wrote: > On 2/24/2014 11:58 PM, Eyal Lebedinsky wrote: > ... >> (*) I run a check action by setting sync_min/sync_max/sync_action >> to cover the bad sector. However, just to be sure, I allowed an overnight >> full check which also ran clean. The bad sector is still pending. > > What is the expected behavior when the drive's spare sector pool has > been exhausted, and thus the sector cannot be remapped by the drive > firmware? > > Unless md now keeps a spare sector pool of its own and remaps bad > sectors, the only way to fix this situation is to replace the drive. > And if indeed drive spare pool exhaustion is the cause of the sector not > being remapped, the drive needs to be replaced. -- Eyal Lebedinsky (eyal@eyal.emu.id.au) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-25 3:16 ` NeilBrown 2014-02-25 5:58 ` Eyal Lebedinsky @ 2014-02-25 7:58 ` Eyal Lebedinsky 2014-02-25 8:35 ` NeilBrown 1 sibling, 1 reply; 22+ messages in thread From: Eyal Lebedinsky @ 2014-02-25 7:58 UTC (permalink / raw) Cc: list linux-raid BTW, Is there a monitoring tool to trace all i/o to a device? I could then log activity to /dev/sd[c-i]1 during a (short) 'check' and see if all sectors are really read. Or does md have a debug facility for this? Eyal On 02/25/14 14:16, NeilBrown wrote: > On Tue, 25 Feb 2014 07:39:14 +1100 Eyal Lebedinsky <eyal@eyal.emu.id.au> > wrote: > >> My main interest is to understand why 'check' does not actually check. >> I already know how to fix the problem, by writing to the location I >> can force the pending reallocation to happen, but then I will not have >> the test case anymore. >> >> The OP asks for a specific solution, but I think that the 'check' action >> should already correctly rewrite failed (i/o error) sectors. It does not >> always know which sector to rewrite when it finds a raid6 mismatch >> without an i/o error (with raid5 it never knows). >> > > I cannot reproduce the problem. In my testing a read error is fixed by > 'check'. For you it clearly isn't. I wonder what is different. > > During normal 'check' or 'repair' etc the read requests are allowed to be > combined by the io scheduler so when we get a read error, it could be one > error for a megabyte of more of the address space. > So the first thing raid5.c does is arrange to read all the blocks again but > to prohibit the merging of requests. This time any read error will be for a > single 4K block. > > Once we have that reliable read error the data is constructed from the other > blocks and the new block is written out. > > This suggests that when there is a read error you should see e.g. > > [ 714.808494] end_request: I/O error, dev sds, sector 8141872 > > then shortly after that another similar error, possibly with a slightly > different sector number (at most a few thousand sectors later). > > Then something like > > md/raid:md0: read error corrected (8 sectors at 8141872 on sds) > > > However in the log Mikael Abrahamsson posted on 16 Jan 2014 > (Subject: Re: read errors not corrected when doing check on RAID6) > > we only see that first 'end_request' message. No second one and no "read > error corrected". > > This seems to suggest that the second read succeeded, which is odd (to say > the least). > > In your log posted 21 Feb 2014 > (Subject: raid 'check' does not provoke expected i/o error) > there aren't even any read errors during 'check'. > The drive sometimes reports a read error and something doesn't? > Does reading the drive with 'dd' already report an error, and with 'check' > never report an error? > > > > So I'm a bit stumped. It looks like md is doing the right thing, but maybe > the drive is getting confused. > Are all the people who report this using the same sort of drive?? > > NeilBrown > -- Eyal Lebedinsky (eyal@eyal.emu.id.au) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-25 7:58 ` Eyal Lebedinsky @ 2014-02-25 8:35 ` NeilBrown 2014-02-25 11:08 ` Eyal Lebedinsky 0 siblings, 1 reply; 22+ messages in thread From: NeilBrown @ 2014-02-25 8:35 UTC (permalink / raw) To: Eyal Lebedinsky; +Cc: list linux-raid [-- Attachment #1: Type: text/plain, Size: 3332 bytes --] On Tue, 25 Feb 2014 18:58:16 +1100 Eyal Lebedinsky <eyal@eyal.emu.id.au> wrote: > BTW, Is there a monitoring tool to trace all i/o to a device? I could then > log activity to /dev/sd[c-i]1 during a (short) 'check' and see if all sectors > are really read. Or does md have a debug facility for this? blktrace will collect a trace, blkparse will print it out for you. You need to trace the 'whole' device. So something like blktrace /dev/sd[c-i] # run the test ctrl-C blkparse sd[c-i]* blktrace creates several files, I think one for each device on each CPU. NeilBrown > > Eyal > > On 02/25/14 14:16, NeilBrown wrote: > > On Tue, 25 Feb 2014 07:39:14 +1100 Eyal Lebedinsky <eyal@eyal.emu.id.au> > > wrote: > > > >> My main interest is to understand why 'check' does not actually check. > >> I already know how to fix the problem, by writing to the location I > >> can force the pending reallocation to happen, but then I will not have > >> the test case anymore. > >> > >> The OP asks for a specific solution, but I think that the 'check' action > >> should already correctly rewrite failed (i/o error) sectors. It does not > >> always know which sector to rewrite when it finds a raid6 mismatch > >> without an i/o error (with raid5 it never knows). > >> > > > > I cannot reproduce the problem. In my testing a read error is fixed by > > 'check'. For you it clearly isn't. I wonder what is different. > > > > During normal 'check' or 'repair' etc the read requests are allowed to be > > combined by the io scheduler so when we get a read error, it could be one > > error for a megabyte of more of the address space. > > So the first thing raid5.c does is arrange to read all the blocks again but > > to prohibit the merging of requests. This time any read error will be for a > > single 4K block. > > > > Once we have that reliable read error the data is constructed from the other > > blocks and the new block is written out. > > > > This suggests that when there is a read error you should see e.g. > > > > [ 714.808494] end_request: I/O error, dev sds, sector 8141872 > > > > then shortly after that another similar error, possibly with a slightly > > different sector number (at most a few thousand sectors later). > > > > Then something like > > > > md/raid:md0: read error corrected (8 sectors at 8141872 on sds) > > > > > > However in the log Mikael Abrahamsson posted on 16 Jan 2014 > > (Subject: Re: read errors not corrected when doing check on RAID6) > > > > we only see that first 'end_request' message. No second one and no "read > > error corrected". > > > > This seems to suggest that the second read succeeded, which is odd (to say > > the least). > > > > In your log posted 21 Feb 2014 > > (Subject: raid 'check' does not provoke expected i/o error) > > there aren't even any read errors during 'check'. > > The drive sometimes reports a read error and something doesn't? > > Does reading the drive with 'dd' already report an error, and with 'check' > > never report an error? > > > > > > > > So I'm a bit stumped. It looks like md is doing the right thing, but maybe > > the drive is getting confused. > > Are all the people who report this using the same sort of drive?? > > > > NeilBrown > > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-25 8:35 ` NeilBrown @ 2014-02-25 11:08 ` Eyal Lebedinsky 2014-02-25 11:28 ` Mikael Abrahamsson 0 siblings, 1 reply; 22+ messages in thread From: Eyal Lebedinsky @ 2014-02-25 11:08 UTC (permalink / raw) Cc: list linux-raid This is helpful Neil. I am running blktrace/blkparse and trying to understand what it is telling me. If I got it right then I see that doing a check of md127 (from the start) starts reading with this entry 8,129 6 327 0.992307218 20259 D R 264200 + 504 [md127_resync] which means that the real data starts rather further into the stripes. Actually, further than the bad block: sector 259648 of sdi1 is before the first read operation. Though I am not even sure that the blkparse 264200 is sectors and now 1KB blocks or 4KB blocks. Following is some speculation. Does md127 store a header before it starts striping the data? May this be why it rarely actually needs to read parts of this header? (I thought that superblocks and what not are stored at the far end). If so, then the content of this sector is not part of the redundant data and may not be trivial to recover. Then again, I expect important data is recorded more than once. If this is the case then the calculation to correlate the bad sector to the fs block (which I need to do whenever I find a bad sector in order to investigate my data loss) is more complicated than I assumed. Final thought: if this sector is in an important header, when it *does* need to be read (and fail), how bad a reaction should I expect? Eyal On 02/25/14 19:35, NeilBrown wrote: > On Tue, 25 Feb 2014 18:58:16 +1100 Eyal Lebedinsky <eyal@eyal.emu.id.au> > wrote: > >> BTW, Is there a monitoring tool to trace all i/o to a device? I could then >> log activity to /dev/sd[c-i]1 during a (short) 'check' and see if all sectors >> are really read. Or does md have a debug facility for this? > > blktrace will collect a trace, blkparse will print it out for you. > You need to trace the 'whole' device. > > So something like > > blktrace /dev/sd[c-i] > # run the test > ctrl-C > blkparse sd[c-i]* > > blktrace creates several files, I think one for each device on each CPU. > > > NeilBrown > >> >> Eyal >> >> On 02/25/14 14:16, NeilBrown wrote: >>> On Tue, 25 Feb 2014 07:39:14 +1100 Eyal Lebedinsky <eyal@eyal.emu.id.au> >>> wrote: >>> >>>> My main interest is to understand why 'check' does not actually check. >>>> I already know how to fix the problem, by writing to the location I >>>> can force the pending reallocation to happen, but then I will not have >>>> the test case anymore. >>>> >>>> The OP asks for a specific solution, but I think that the 'check' action >>>> should already correctly rewrite failed (i/o error) sectors. It does not >>>> always know which sector to rewrite when it finds a raid6 mismatch >>>> without an i/o error (with raid5 it never knows). >>>> >>> >>> I cannot reproduce the problem. In my testing a read error is fixed by >>> 'check'. For you it clearly isn't. I wonder what is different. >>> >>> During normal 'check' or 'repair' etc the read requests are allowed to be >>> combined by the io scheduler so when we get a read error, it could be one >>> error for a megabyte of more of the address space. >>> So the first thing raid5.c does is arrange to read all the blocks again but >>> to prohibit the merging of requests. This time any read error will be for a >>> single 4K block. >>> >>> Once we have that reliable read error the data is constructed from the other >>> blocks and the new block is written out. >>> >>> This suggests that when there is a read error you should see e.g. >>> >>> [ 714.808494] end_request: I/O error, dev sds, sector 8141872 >>> >>> then shortly after that another similar error, possibly with a slightly >>> different sector number (at most a few thousand sectors later). >>> >>> Then something like >>> >>> md/raid:md0: read error corrected (8 sectors at 8141872 on sds) >>> >>> >>> However in the log Mikael Abrahamsson posted on 16 Jan 2014 >>> (Subject: Re: read errors not corrected when doing check on RAID6) >>> >>> we only see that first 'end_request' message. No second one and no "read >>> error corrected". >>> >>> This seems to suggest that the second read succeeded, which is odd (to say >>> the least). >>> >>> In your log posted 21 Feb 2014 >>> (Subject: raid 'check' does not provoke expected i/o error) >>> there aren't even any read errors during 'check'. >>> The drive sometimes reports a read error and something doesn't? >>> Does reading the drive with 'dd' already report an error, and with 'check' >>> never report an error? >>> >>> >>> >>> So I'm a bit stumped. It looks like md is doing the right thing, but maybe >>> the drive is getting confused. >>> Are all the people who report this using the same sort of drive?? >>> >>> NeilBrown >>> >> > -- Eyal Lebedinsky (eyal@eyal.emu.id.au) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-25 11:08 ` Eyal Lebedinsky @ 2014-02-25 11:28 ` Mikael Abrahamsson 2014-02-25 12:05 ` Eyal Lebedinsky 0 siblings, 1 reply; 22+ messages in thread From: Mikael Abrahamsson @ 2014-02-25 11:28 UTC (permalink / raw) To: Eyal Lebedinsky; +Cc: list linux-raid On Tue, 25 Feb 2014, Eyal Lebedinsky wrote: > Final thought: if this sector is in an important header, when it *does* > need to be read (and fail), how bad a reaction should I expect? I have two thoughts here: Check data offset when doing mdadm -E. There you will see how much unused data is allocated between the superblock and start of the actual array data contents. This might be where your pending block is. Regarding re-write. I have had happen to me that one drive that had bad blocks that "check" didn't find errors on, when I rebooted that drive had read errors on the superblock, was not assembled into the array, and instead md started rebuilding to a spare since the array was degraded. So my wonder is, when issuing "check" or "repair", does md actually check if the superblocks are readable? If not, perhaps it should? Should it check the contents of the superblocks are consistent with the data that the kernel has in its data structures? -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-25 11:28 ` Mikael Abrahamsson @ 2014-02-25 12:05 ` Eyal Lebedinsky 2014-02-25 12:17 ` Mikael Abrahamsson 0 siblings, 1 reply; 22+ messages in thread From: Eyal Lebedinsky @ 2014-02-25 12:05 UTC (permalink / raw) Cc: list linux-raid Yes, this matches what I saw. From "mdadm -E /dev/sdi1": Avail Dev Size : 7813771264 (3725.90 GiB 4000.65 GB) Array Size : 19534425600 (18629.48 GiB 20003.25 GB) Used Dev Size : 7813770240 (3725.90 GiB 4000.65 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Is the layout of this 128KB area documented? I expect basic superblock data, log, bitmap, what else? Are there copies elsewhere on the disk that can be used? I wonder what the bad sector 259648 (close to the end of the header) covers. Can it be rebuilt from the other members of the array (when stopped one can expect the log and bitmap to be clearable)? Don't know. Maybe build a new array with the --assume-clean option that will rewrite the header but leave the data alone? Doco says "not recommended". Or just give up: fail and remove the disk, clear the superblock then add it and go through a full resync. This way feels safer as I do not touch the other members. cheers Eyal On 02/25/14 22:28, Mikael Abrahamsson wrote: > On Tue, 25 Feb 2014, Eyal Lebedinsky wrote: > >> Final thought: if this sector is in an important header, when it *does* need to be read (and fail), how bad a reaction should I expect? > > I have two thoughts here: > > Check data offset when doing mdadm -E. There you will see how much unused data is allocated between the superblock and start of the actual array data contents. This might be where your pending block is. > > Regarding re-write. I have had happen to me that one drive that had bad blocks that "check" didn't find errors on, when I rebooted that drive had read errors on the superblock, was not assembled into the array, and instead md started rebuilding to a spare since the array was degraded. So my wonder is, when issuing "check" or "repair", does md actually check if the superblocks are readable? If not, perhaps it should? Should it check the contents of the superblocks are consistent with the data that the kernel has in its data structures? -- Eyal Lebedinsky (eyal@eyal.emu.id.au) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-25 12:05 ` Eyal Lebedinsky @ 2014-02-25 12:17 ` Mikael Abrahamsson 2014-02-25 12:32 ` Eyal Lebedinsky 0 siblings, 1 reply; 22+ messages in thread From: Mikael Abrahamsson @ 2014-02-25 12:17 UTC (permalink / raw) To: Eyal Lebedinsky; +Cc: list linux-raid On Tue, 25 Feb 2014, Eyal Lebedinsky wrote: > Or just give up: fail and remove the disk, clear the superblock then add > it and go through a full resync. This way feels safer as I do not touch > the other members. I am not sure this will work either. I would expect that the empty data in "data offset" is never touched even when doing rebuild. Perhaps also something that should be done? -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-25 12:17 ` Mikael Abrahamsson @ 2014-02-25 12:32 ` Eyal Lebedinsky 0 siblings, 0 replies; 22+ messages in thread From: Eyal Lebedinsky @ 2014-02-25 12:32 UTC (permalink / raw) Cc: list linux-raid I expect that zeroing (and then recreating the header) will surely write the whole area. Or I can play really safe and write to the bad sector myself to force the reallocation while the disk is out. Eyal On 02/25/14 23:17, Mikael Abrahamsson wrote: > On Tue, 25 Feb 2014, Eyal Lebedinsky wrote: > >> Or just give up: fail and remove the disk, clear the superblock then add it and go through a full resync. This way feels safer as I do not touch the other members. > > I am not sure this will work either. I would expect that the empty data in "data offset" is never touched even when doing rebuild. Perhaps also something that should be done? > -- Eyal Lebedinsky (eyal@eyal.emu.id.au) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-24 1:30 ` Brad Campbell 2014-02-24 1:46 ` Eyal Lebedinsky @ 2014-02-24 2:42 ` Mikael Abrahamsson 1 sibling, 0 replies; 22+ messages in thread From: Mikael Abrahamsson @ 2014-02-24 2:42 UTC (permalink / raw) To: Brad Campbell; +Cc: linux-raid On Mon, 24 Feb 2014, Brad Campbell wrote: > The only reason I've ever seen this personally was when the pending > sectors were on non-data parts of the drive, like some of the space > around the superblock. Have you verified that these issues are really on > sectors in the data area? SMART should tell you the LBA of the first > error in a read test. I even received UNC errors in the log when doing "repair" but the sector still wasn't re-written. So yes, they were on data part. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-21 18:09 feature re-quest for "re-write" Mikael Abrahamsson 2014-02-24 1:30 ` Brad Campbell @ 2014-02-24 2:24 ` Brad Campbell 2014-02-25 2:10 ` NeilBrown 1 sibling, 1 reply; 22+ messages in thread From: Brad Campbell @ 2014-02-24 2:24 UTC (permalink / raw) To: Mikael Abrahamsson, linux-raid On 22/02/14 02:09, Mikael Abrahamsson wrote: > > Hi, > > we have "check", "repair", "replacement" and other operations on raid > volumes. > > I am not a programmer, but I was wondering how much work it would > require to take current code and implement "rewrite", basically > re-writing every block in the md raid level. Since "repair" and "check" > doesn't seem to properly detect a few errors, wouldn't it make sense to > try least existance / easiest implementation route to just re-write all > data on the entire array? If reads fail, re-calculate from parity, if > reads work, just write again. Now, this is after 3 minutes of looking at raid5.c, so if I've missed something obvious please feel free to yell at me. I'm not much of a programmer. Having said that - Can someone check my understanding of this bit of code? static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, struct stripe_head_state *s, int disks) <....> switch (sh->check_state) { case check_state_idle: /* start a new check operation if there are < 2 failures */ if (s->failed == s->q_failed) { /* The only possible failed device holds Q, so it * makes sense to check P (If anything else were failed, * we would have used P to recreate it). */ sh->check_state = check_state_run; } if (!s->q_failed && s->failed < 2) { /* Q is not failed, and we didn't use it to generate * anything, so it makes sense to check it */ if (sh->check_state == check_state_run) sh->check_state = check_state_run_pq; else sh->check_state = check_state_run_q; } So we get passed a stripe. If it's not being checked we : - If Q has failed we initiate check_state_run (which checks only P) - If we have less than 2 failed drives (lets say we have none), if we are already checking P (check_state_run) we upgrade that to check_state_run_pq (and therefore check both). However - If we were check_state_idle, beacuse we had 0 failed drives, then we only mark check_state_run_q and therefore skip checking P ?? Regards, Brad ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-24 2:24 ` Brad Campbell @ 2014-02-25 2:10 ` NeilBrown 2014-02-25 2:26 ` Brad Campbell 0 siblings, 1 reply; 22+ messages in thread From: NeilBrown @ 2014-02-25 2:10 UTC (permalink / raw) To: Brad Campbell; +Cc: Mikael Abrahamsson, linux-raid [-- Attachment #1: Type: text/plain, Size: 3222 bytes --] On Mon, 24 Feb 2014 10:24:36 +0800 Brad Campbell <lists2009@fnarfbargle.com> wrote: > On 22/02/14 02:09, Mikael Abrahamsson wrote: > > > > Hi, > > > > we have "check", "repair", "replacement" and other operations on raid > > volumes. > > > > I am not a programmer, but I was wondering how much work it would > > require to take current code and implement "rewrite", basically > > re-writing every block in the md raid level. Since "repair" and "check" > > doesn't seem to properly detect a few errors, wouldn't it make sense to > > try least existance / easiest implementation route to just re-write all > > data on the entire array? If reads fail, re-calculate from parity, if > > reads work, just write again. > > Now, this is after 3 minutes of looking at raid5.c, so if I've missed > something obvious please feel free to yell at me. I'm not much of a > programmer. Having said that - > > Can someone check my understanding of this bit of code? > > static void handle_parity_checks6(struct r5conf *conf, struct > stripe_head *sh, > struct stripe_head_state *s, > int disks) > <....> > > switch (sh->check_state) { > case check_state_idle: > /* start a new check operation if there are < 2 failures */ > if (s->failed == s->q_failed) { > /* The only possible failed device holds Q, so it > * makes sense to check P (If anything else > were failed, > * we would have used P to recreate it). > */ > sh->check_state = check_state_run; > } > if (!s->q_failed && s->failed < 2) { > /* Q is not failed, and we didn't use it to > generate > * anything, so it makes sense to check it > */ > if (sh->check_state == check_state_run) > sh->check_state = check_state_run_pq; > else > sh->check_state = check_state_run_q; > } > > > So we get passed a stripe. If it's not being checked we : > > - If Q has failed we initiate check_state_run (which checks only P) > > - If we have less than 2 failed drives (lets say we have none), if we > are already checking P (check_state_run) we upgrade that to > check_state_run_pq (and therefore check both). > > However > > - If we were check_state_idle, beacuse we had 0 failed drives, then we > only mark check_state_run_q and therefore skip checking P ?? This code is obviously too subtle. If 0 drives have failed, then 's->failed' is 0 (it is the count of failed drives), and 's->q_failed' is also 0 (it is a boolean flag, and q clearly hasn't failed as nothing has). So the first 'if' branch will be followed (as "0 == 0") and check_state set to check_state_run. Then as q_failed is still 0 and failed < 2, check_state gets set to check_state_run_pq. So it does check both p and q. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: feature re-quest for "re-write" 2014-02-25 2:10 ` NeilBrown @ 2014-02-25 2:26 ` Brad Campbell 0 siblings, 0 replies; 22+ messages in thread From: Brad Campbell @ 2014-02-25 2:26 UTC (permalink / raw) To: NeilBrown; +Cc: Mikael Abrahamsson, linux-raid On 25/02/14 10:10, NeilBrown wrote: > This code is obviously too subtle. Not at all, it's my understanding that is under-developed. I was just looking for something obvious to explain the behaviour others have been reporting where a check won't trigger a re-write of a pending sector if the sector is a p or q rather than data. > If 0 drives have failed, then 's->failed' is 0 (it is the count of failed > drives), and 's->q_failed' is also 0 (it is a boolean flag, and q clearly > hasn't failed as nothing has). > So the first 'if' branch will be followed (as "0 == 0") and check_state set to > check_state_run. > Then as q_failed is still 0 and failed < 2, check_state gets set to > check_state_run_pq. > Got it, thanks for taking the time to set me straight. Regards, Brad ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2014-02-25 12:32 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-02-21 18:09 feature re-quest for "re-write" Mikael Abrahamsson 2014-02-24 1:30 ` Brad Campbell 2014-02-24 1:46 ` Eyal Lebedinsky 2014-02-24 2:11 ` Brad Campbell 2014-02-24 3:40 ` Eyal Lebedinsky 2014-02-24 14:14 ` Wilson Jonathan 2014-02-24 20:39 ` Eyal Lebedinsky 2014-02-25 3:16 ` NeilBrown 2014-02-25 5:58 ` Eyal Lebedinsky 2014-02-25 7:05 ` Stan Hoeppner 2014-02-25 7:45 ` Eyal Lebedinsky 2014-02-25 7:58 ` Eyal Lebedinsky 2014-02-25 8:35 ` NeilBrown 2014-02-25 11:08 ` Eyal Lebedinsky 2014-02-25 11:28 ` Mikael Abrahamsson 2014-02-25 12:05 ` Eyal Lebedinsky 2014-02-25 12:17 ` Mikael Abrahamsson 2014-02-25 12:32 ` Eyal Lebedinsky 2014-02-24 2:42 ` Mikael Abrahamsson 2014-02-24 2:24 ` Brad Campbell 2014-02-25 2:10 ` NeilBrown 2014-02-25 2:26 ` Brad Campbell
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.