All of lore.kernel.org
 help / color / mirror / Atom feed
* how to handle bad sectors in md control areas?
@ 2014-02-26  8:16 Eyal Lebedinsky
  2014-02-28  1:35 ` Eyal Lebedinsky
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Eyal Lebedinsky @ 2014-02-26  8:16 UTC (permalink / raw)
  To: list linux-raid

In another thread I investigated an issue with a pending sector, which now seems to be
a bad sector inside the md header (the first 256k sectors).

The question now remaining: what is the correct approach to fixing this problem?

The more general issue is what to do when any md control area develops an error. does
all data have redundant copies?

The simple way that I see is to fail the member, remove it, clear it (at least
--zero-superblock and write to the bad sector) and then add it. However this
will incur a full resync (about 10 hours).

Is there a faster, yet safe way? I was thinking that a clean umount and raid stop
should allow a create with --assume-clean (which will write to the bad sector and
"fix" it), but the doco discourages this.

Also, it is not impossible to think that the specific bad sector (toward the end
of the header) is not actually used today, meaning I can live with it as is, or
write anything to the bad sector as it does not get used. Too involved though.

A bad sector in the data area should be fixed with a standard raid 'check' action.

TIA

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bad sectors in md control areas?
  2014-02-26  8:16 how to handle bad sectors in md control areas? Eyal Lebedinsky
@ 2014-02-28  1:35 ` Eyal Lebedinsky
  2014-02-28 10:53   ` Piergiorgio Sartor
  2014-03-02  0:56 ` Eyal Lebedinsky
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Eyal Lebedinsky @ 2014-02-28  1:35 UTC (permalink / raw)
  To: list linux-raid

On 02/26/14 19:16, Eyal Lebedinsky wrote:
> In another thread I investigated an issue with a pending sector, which now seems to be
> a bad sector inside the md header (the first 256k sectors).
>
> The question now remaining: what is the correct approach to fixing this problem?
>
> The more general issue is what to do when any md control area develops an error. does
> all data have redundant copies?
>
> The simple way that I see is to fail the member, remove it, clear it (at least
> --zero-superblock and write to the bad sector) and then add it. However this
> will incur a full resync (about 10 hours).
>
> Is there a faster, yet safe way? I was thinking that a clean umount and raid stop
> should allow a create with --assume-clean (which will write to the bad sector and
> "fix" it), but the doco discourages this.
>
> Also, it is not impossible to think that the specific bad sector (toward the end
> of the header) is not actually used today, meaning I can live with it as is, or
> write anything to the bad sector as it does not get used. Too involved though.
>
> A bad sector in the data area should be fixed with a standard raid 'check' action.
>
> TIA

Adding more details to the above, examining my specific situation.

Dumping the first 128MB of each component (examples below) shows that only
0x1000-0x4000 is used, the rest is zeros (at least when the array is at rest).

Can I assume that it really is safe to write zeroes to the offending sector (note
how the dd of sdi1 fails at offset 0x7ec8000 [sector 259648], toward the very end
at 0x8000000 [sector 262144]).

Eyal

sd[c-i]1 are the 7 components with sdi1 having the bad sector.
sd[c-h]1 all look very similar.

# dd if=/dev/sdh1 bs=1M count=128 | od -x -Ax
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 0.20988 s, 639 MB/s
000000 0000 0000 0000 0000 0000 0000 0000 0000
*
001000 4efc a92b 0001 0000 0001 0000 0000 0000
001010 a4c6 6ac1 b5f7 aa51 2976 c0ec 9f10 1e7e
001020 3765 652e 6179 2e6c 6d65 2e75 6469 612e
001030 3a75 0031 0000 0000 0000 0000 0000 0000
001040 fca8 51c2 0000 0000 0006 0000 0002 0000
001050 ac00 d1bc 0001 0000 0400 0000 0007 0000
001060 0008 0000 0000 0000 0000 0000 0000 0000
001070 0000 0000 0000 0000 0000 0000 0000 0000
001080 0000 0004 0000 0000 b000 d1bc 0001 0000
001090 0008 0000 0000 0000 0000 0000 0000 0000
0010a0 0005 0000 0000 0000 2c6a a754 f403 5b99
0010b0 9b05 5407 a33e 41c4 0000 0000 0000 0000
0010c0 df7e 530f 0000 0000 32ad 0017 0000 0000
0010d0 ffff ffff ffff ffff e563 a205 0080 0000
0010e0 0000 0000 0000 0000 0000 0000 0000 0000
*
001100 0000 0001 0002 fffe 0004 0005 0006 0003
001110 fffe fffe fffe fffe fffe fffe fffe fffe
*
001200 0000 0000 0000 0000 0000 0000 0000 0000
*
002000 6962 6d74 0004 0000 a4c6 6ac1 b5f7 aa51
002010 2976 c0ec 9f10 1e7e 32ad 0017 0000 0000
002020 32ad 0017 0000 0000 ac00 d1bc 0001 0000
002030 0000 0000 0000 0400 0005 0000 0000 0000
002040 0000 0000 0000 0000 0000 0000 0000 0000
*
002f80 0000 0000 0000 0000 0000 0000 2000 0000
002f90 0000 0000 0000 0000 0000 0000 0000 0000
*
003690 0000 0000 0000 0000 0000 0000 0000 0200
0036a0 0000 0000 0000 0000 0000 0000 0000 0000
*
003ab0 4000 0000 0000 0000 0000 0000 0000 0000
003ac0 0000 0000 0000 0000 0000 0000 0000 0000
*
003e10 0000 0000 0000 0000 0000 8000 ffff ffff
003e20 ffff ffff ffff ffff ffff ffff ffff ffff
*
004000 0000 0000 0000 0000 0000 0000 0000 0000
*
8000000

# dd if=/dev/sdi1 bs=1M count=128 | od -x -Ax
dd: error reading '/dev/sdi1': Input/output error
126+1 records in
126+1 records out
132939776 bytes (133 MB) copied, 12.3209 s, 10.8 MB/s
000000 0000 0000 0000 0000 0000 0000 0000 0000
*
001000 4efc a92b 0001 0000 0001 0000 0000 0000
001010 a4c6 6ac1 b5f7 aa51 2976 c0ec 9f10 1e7e
001020 3765 652e 6179 2e6c 6d65 2e75 6469 612e
001030 3a75 0031 0000 0000 0000 0000 0000 0000
001040 fca8 51c2 0000 0000 0006 0000 0002 0000
001050 ac00 d1bc 0001 0000 0400 0000 0007 0000
001060 0008 0000 0000 0000 0000 0000 0000 0000
001070 0000 0000 0000 0000 0000 0000 0000 0000
001080 0000 0004 0000 0000 b000 d1bc 0001 0000
001090 0008 0000 0000 0000 0000 0000 0000 0000
0010a0 0006 0000 0000 0000 9d01 1bb7 9ebe 9ff8
0010b0 95d4 53b1 0ddb 9a2d 0000 0000 0000 0000
0010c0 df88 530f 0000 0000 32ae 0017 0000 0000
0010d0 0000 0000 0000 0000 662d b2da 0080 0000
0010e0 0000 0000 0000 0000 0000 0000 0000 0000
*
001100 0000 0001 0002 fffe 0004 0005 0006 0003
001110 fffe fffe fffe fffe fffe fffe fffe fffe
*
001200 0000 0000 0000 0000 0000 0000 0000 0000
*
002000 6962 6d74 0004 0000 a4c6 6ac1 b5f7 aa51
002010 2976 c0ec 9f10 1e7e 32ae 0017 0000 0000
002020 32ad 0017 0000 0000 ac00 d1bc 0001 0000
002030 0000 0000 0000 0400 0005 0000 0000 0000
002040 0000 0000 0000 0000 0000 0000 0000 0000
*
002f80 0000 0000 0000 0000 0000 0000 2000 0000
002f90 0000 0000 0000 0000 0000 0000 0000 0000
*
003ab0 c000 0000 0000 0000 0000 0000 0000 0000
003ac0 0000 0000 0000 0000 0000 0000 0000 0000
*
003e10 0000 0000 0000 0000 0000 8000 ffff ffff
003e20 ffff ffff ffff ffff ffff ffff ffff ffff
*
004000 0000 0000 0000 0000 0000 0000 0000 0000
*
7ec8000



-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bad sectors in md control areas?
  2014-02-28  1:35 ` Eyal Lebedinsky
@ 2014-02-28 10:53   ` Piergiorgio Sartor
  2014-02-28 13:23     ` Eyal Lebedinsky
  0 siblings, 1 reply; 9+ messages in thread
From: Piergiorgio Sartor @ 2014-02-28 10:53 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: list linux-raid

On Fri, Feb 28, 2014 at 12:35:14PM +1100, Eyal Lebedinsky wrote:
> On 02/26/14 19:16, Eyal Lebedinsky wrote:
> >In another thread I investigated an issue with a pending sector, which now seems to be
> >a bad sector inside the md header (the first 256k sectors).
> >
> >The question now remaining: what is the correct approach to fixing this problem?
> >
> >The more general issue is what to do when any md control area develops an error. does
> >all data have redundant copies?
> >
> >The simple way that I see is to fail the member, remove it, clear it (at least
> >--zero-superblock and write to the bad sector) and then add it. However this
> >will incur a full resync (about 10 hours).
> >
> >Is there a faster, yet safe way? I was thinking that a clean umount and raid stop
> >should allow a create with --assume-clean (which will write to the bad sector and
> >"fix" it), but the doco discourages this.
> >
> >Also, it is not impossible to think that the specific bad sector (toward the end
> >of the header) is not actually used today, meaning I can live with it as is, or
> >write anything to the bad sector as it does not get used. Too involved though.
> >
> >A bad sector in the data area should be fixed with a standard raid 'check' action.
> >
> >TIA
> 
> Adding more details to the above, examining my specific situation.
> 
> Dumping the first 128MB of each component (examples below) shows that only
> 0x1000-0x4000 is used, the rest is zeros (at least when the array is at rest).
> 
> Can I assume that it really is safe to write zeroes to the offending sector (note
> how the dd of sdi1 fails at offset 0x7ec8000 [sector 259648], toward the very end
> at 0x8000000 [sector 262144]).

If you search around (wikipedia, for example),
you'll find a pretty detailed description of
the MD superblock.
This will give you an idea of what is possible
and what is not possible to re-write.
Or what is critical and what is not critical.

Hope this helps,

bye,

pg

> 
> Eyal
> 
> sd[c-i]1 are the 7 components with sdi1 having the bad sector.
> sd[c-h]1 all look very similar.
> 
> # dd if=/dev/sdh1 bs=1M count=128 | od -x -Ax
> 128+0 records in
> 128+0 records out
> 134217728 bytes (134 MB) copied, 0.20988 s, 639 MB/s
> 000000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 001000 4efc a92b 0001 0000 0001 0000 0000 0000
> 001010 a4c6 6ac1 b5f7 aa51 2976 c0ec 9f10 1e7e
> 001020 3765 652e 6179 2e6c 6d65 2e75 6469 612e
> 001030 3a75 0031 0000 0000 0000 0000 0000 0000
> 001040 fca8 51c2 0000 0000 0006 0000 0002 0000
> 001050 ac00 d1bc 0001 0000 0400 0000 0007 0000
> 001060 0008 0000 0000 0000 0000 0000 0000 0000
> 001070 0000 0000 0000 0000 0000 0000 0000 0000
> 001080 0000 0004 0000 0000 b000 d1bc 0001 0000
> 001090 0008 0000 0000 0000 0000 0000 0000 0000
> 0010a0 0005 0000 0000 0000 2c6a a754 f403 5b99
> 0010b0 9b05 5407 a33e 41c4 0000 0000 0000 0000
> 0010c0 df7e 530f 0000 0000 32ad 0017 0000 0000
> 0010d0 ffff ffff ffff ffff e563 a205 0080 0000
> 0010e0 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 001100 0000 0001 0002 fffe 0004 0005 0006 0003
> 001110 fffe fffe fffe fffe fffe fffe fffe fffe
> *
> 001200 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 002000 6962 6d74 0004 0000 a4c6 6ac1 b5f7 aa51
> 002010 2976 c0ec 9f10 1e7e 32ad 0017 0000 0000
> 002020 32ad 0017 0000 0000 ac00 d1bc 0001 0000
> 002030 0000 0000 0000 0400 0005 0000 0000 0000
> 002040 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 002f80 0000 0000 0000 0000 0000 0000 2000 0000
> 002f90 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 003690 0000 0000 0000 0000 0000 0000 0000 0200
> 0036a0 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 003ab0 4000 0000 0000 0000 0000 0000 0000 0000
> 003ac0 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 003e10 0000 0000 0000 0000 0000 8000 ffff ffff
> 003e20 ffff ffff ffff ffff ffff ffff ffff ffff
> *
> 004000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 8000000
> 
> # dd if=/dev/sdi1 bs=1M count=128 | od -x -Ax
> dd: error reading '/dev/sdi1': Input/output error
> 126+1 records in
> 126+1 records out
> 132939776 bytes (133 MB) copied, 12.3209 s, 10.8 MB/s
> 000000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 001000 4efc a92b 0001 0000 0001 0000 0000 0000
> 001010 a4c6 6ac1 b5f7 aa51 2976 c0ec 9f10 1e7e
> 001020 3765 652e 6179 2e6c 6d65 2e75 6469 612e
> 001030 3a75 0031 0000 0000 0000 0000 0000 0000
> 001040 fca8 51c2 0000 0000 0006 0000 0002 0000
> 001050 ac00 d1bc 0001 0000 0400 0000 0007 0000
> 001060 0008 0000 0000 0000 0000 0000 0000 0000
> 001070 0000 0000 0000 0000 0000 0000 0000 0000
> 001080 0000 0004 0000 0000 b000 d1bc 0001 0000
> 001090 0008 0000 0000 0000 0000 0000 0000 0000
> 0010a0 0006 0000 0000 0000 9d01 1bb7 9ebe 9ff8
> 0010b0 95d4 53b1 0ddb 9a2d 0000 0000 0000 0000
> 0010c0 df88 530f 0000 0000 32ae 0017 0000 0000
> 0010d0 0000 0000 0000 0000 662d b2da 0080 0000
> 0010e0 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 001100 0000 0001 0002 fffe 0004 0005 0006 0003
> 001110 fffe fffe fffe fffe fffe fffe fffe fffe
> *
> 001200 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 002000 6962 6d74 0004 0000 a4c6 6ac1 b5f7 aa51
> 002010 2976 c0ec 9f10 1e7e 32ae 0017 0000 0000
> 002020 32ad 0017 0000 0000 ac00 d1bc 0001 0000
> 002030 0000 0000 0000 0400 0005 0000 0000 0000
> 002040 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 002f80 0000 0000 0000 0000 0000 0000 2000 0000
> 002f90 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 003ab0 c000 0000 0000 0000 0000 0000 0000 0000
> 003ac0 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 003e10 0000 0000 0000 0000 0000 8000 ffff ffff
> 003e20 ffff ffff ffff ffff ffff ffff ffff ffff
> *
> 004000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 7ec8000
> 
> 
> 
> -- 
> Eyal Lebedinsky (eyal@eyal.emu.id.au)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bad sectors in md control areas?
  2014-02-28 10:53   ` Piergiorgio Sartor
@ 2014-02-28 13:23     ` Eyal Lebedinsky
  2014-03-02 21:42       ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Eyal Lebedinsky @ 2014-02-28 13:23 UTC (permalink / raw)
  Cc: list linux-raid

Thanks Piergiorgio,

I did search before, unsuccessfully. I repeated now with different keywords and found
this entry in the kernel wiki
	https://raid.wiki.kernel.org/index.php/RAID_superblock_formats

It documents the initial fixed fields of the superblock. I still do not know how
the intent bitmap is laid out. I can see that it starts 4KB into the superblock:
	Internal Bitmap : 8 sectors from superblock
but did not yet find its size (which I expect depends on the array size).

I guess I can calculate it from the sys/block/md127/md items
	component_size * 1024 / 'bitmap/chunksize' / 8
which comes up to 7452 bytes which is still a small fraction of the 128MB header size.

What else is there? bad blocks list? Reading an old blog (from Neil) suggests that it
is not larger than 32KB (but is only 4KB now), so still "small" in this context.
Don't know where it resides though.

I need to understand the full layout of the header and so far I do not see anything that
says what the area past the initial 16KB is used (is always zero when I inspect it).
I started a heavy file copy on the raid and watched the header and never saw any change.
I expected to see at least some activity in the bitmap but none encountered.

My simple question is: is it the case that the reserved header space after 16KB is
actually still unused in header version 1.2? My bad sector is practically at the end
of this 128MB area. A trivial question to answer for someone with expert knowledge
of md.

Anyone?

cheers,
	Eyal

n.b. I hate to admit it but I had a peek at mdadm sources to see how it handles the
superblock (super1.c). It seems to confirm what I guess above.
Though I think this is the wrong way to find doco...

On 02/28/14 21:53, Piergiorgio Sartor wrote:
> On Fri, Feb 28, 2014 at 12:35:14PM +1100, Eyal Lebedinsky wrote:
>> On 02/26/14 19:16, Eyal Lebedinsky wrote:
>>> In another thread I investigated an issue with a pending sector, which now seems to be
>>> a bad sector inside the md header (the first 256k sectors).
>>>
>>> The question now remaining: what is the correct approach to fixing this problem?
>>>
>>> The more general issue is what to do when any md control area develops an error. does
>>> all data have redundant copies?
>>>
>>> The simple way that I see is to fail the member, remove it, clear it (at least
>>> --zero-superblock and write to the bad sector) and then add it. However this
>>> will incur a full resync (about 10 hours).
>>>
>>> Is there a faster, yet safe way? I was thinking that a clean umount and raid stop
>>> should allow a create with --assume-clean (which will write to the bad sector and
>>> "fix" it), but the doco discourages this.
>>>
>>> Also, it is not impossible to think that the specific bad sector (toward the end
>>> of the header) is not actually used today, meaning I can live with it as is, or
>>> write anything to the bad sector as it does not get used. Too involved though.
>>>
>>> A bad sector in the data area should be fixed with a standard raid 'check' action.
>>>
>>> TIA
>>
>> Adding more details to the above, examining my specific situation.
>>
>> Dumping the first 128MB of each component (examples below) shows that only
>> 0x1000-0x4000 is used, the rest is zeros (at least when the array is at rest).
>>
>> Can I assume that it really is safe to write zeroes to the offending sector (note
>> how the dd of sdi1 fails at offset 0x7ec8000 [sector 259648], toward the very end
>> at 0x8000000 [sector 262144]).
>
> If you search around (wikipedia, for example),
> you'll find a pretty detailed description of
> the MD superblock.
> This will give you an idea of what is possible
> and what is not possible to re-write.
> Or what is critical and what is not critical.
>
> Hope this helps,
>
> bye,
>
> pg
>
>>
>> Eyal
>>
>> sd[c-i]1 are the 7 components with sdi1 having the bad sector.
>> sd[c-h]1 all look very similar.
>>
>> # dd if=/dev/sdh1 bs=1M count=128 | od -x -Ax
>> 128+0 records in
>> 128+0 records out
>> 134217728 bytes (134 MB) copied, 0.20988 s, 639 MB/s
>> 000000 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 001000 4efc a92b 0001 0000 0001 0000 0000 0000
>> 001010 a4c6 6ac1 b5f7 aa51 2976 c0ec 9f10 1e7e
>> 001020 3765 652e 6179 2e6c 6d65 2e75 6469 612e
>> 001030 3a75 0031 0000 0000 0000 0000 0000 0000
>> 001040 fca8 51c2 0000 0000 0006 0000 0002 0000
>> 001050 ac00 d1bc 0001 0000 0400 0000 0007 0000
>> 001060 0008 0000 0000 0000 0000 0000 0000 0000
>> 001070 0000 0000 0000 0000 0000 0000 0000 0000
>> 001080 0000 0004 0000 0000 b000 d1bc 0001 0000
>> 001090 0008 0000 0000 0000 0000 0000 0000 0000
>> 0010a0 0005 0000 0000 0000 2c6a a754 f403 5b99
>> 0010b0 9b05 5407 a33e 41c4 0000 0000 0000 0000
>> 0010c0 df7e 530f 0000 0000 32ad 0017 0000 0000
>> 0010d0 ffff ffff ffff ffff e563 a205 0080 0000
>> 0010e0 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 001100 0000 0001 0002 fffe 0004 0005 0006 0003
>> 001110 fffe fffe fffe fffe fffe fffe fffe fffe
>> *
>> 001200 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 002000 6962 6d74 0004 0000 a4c6 6ac1 b5f7 aa51
>> 002010 2976 c0ec 9f10 1e7e 32ad 0017 0000 0000
>> 002020 32ad 0017 0000 0000 ac00 d1bc 0001 0000
>> 002030 0000 0000 0000 0400 0005 0000 0000 0000
>> 002040 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 002f80 0000 0000 0000 0000 0000 0000 2000 0000
>> 002f90 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 003690 0000 0000 0000 0000 0000 0000 0000 0200
>> 0036a0 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 003ab0 4000 0000 0000 0000 0000 0000 0000 0000
>> 003ac0 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 003e10 0000 0000 0000 0000 0000 8000 ffff ffff
>> 003e20 ffff ffff ffff ffff ffff ffff ffff ffff
>> *
>> 004000 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 8000000
>>
>> # dd if=/dev/sdi1 bs=1M count=128 | od -x -Ax
>> dd: error reading '/dev/sdi1': Input/output error
>> 126+1 records in
>> 126+1 records out
>> 132939776 bytes (133 MB) copied, 12.3209 s, 10.8 MB/s
>> 000000 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 001000 4efc a92b 0001 0000 0001 0000 0000 0000
>> 001010 a4c6 6ac1 b5f7 aa51 2976 c0ec 9f10 1e7e
>> 001020 3765 652e 6179 2e6c 6d65 2e75 6469 612e
>> 001030 3a75 0031 0000 0000 0000 0000 0000 0000
>> 001040 fca8 51c2 0000 0000 0006 0000 0002 0000
>> 001050 ac00 d1bc 0001 0000 0400 0000 0007 0000
>> 001060 0008 0000 0000 0000 0000 0000 0000 0000
>> 001070 0000 0000 0000 0000 0000 0000 0000 0000
>> 001080 0000 0004 0000 0000 b000 d1bc 0001 0000
>> 001090 0008 0000 0000 0000 0000 0000 0000 0000
>> 0010a0 0006 0000 0000 0000 9d01 1bb7 9ebe 9ff8
>> 0010b0 95d4 53b1 0ddb 9a2d 0000 0000 0000 0000
>> 0010c0 df88 530f 0000 0000 32ae 0017 0000 0000
>> 0010d0 0000 0000 0000 0000 662d b2da 0080 0000
>> 0010e0 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 001100 0000 0001 0002 fffe 0004 0005 0006 0003
>> 001110 fffe fffe fffe fffe fffe fffe fffe fffe
>> *
>> 001200 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 002000 6962 6d74 0004 0000 a4c6 6ac1 b5f7 aa51
>> 002010 2976 c0ec 9f10 1e7e 32ae 0017 0000 0000
>> 002020 32ad 0017 0000 0000 ac00 d1bc 0001 0000
>> 002030 0000 0000 0000 0400 0005 0000 0000 0000
>> 002040 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 002f80 0000 0000 0000 0000 0000 0000 2000 0000
>> 002f90 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 003ab0 c000 0000 0000 0000 0000 0000 0000 0000
>> 003ac0 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 003e10 0000 0000 0000 0000 0000 8000 ffff ffff
>> 003e20 ffff ffff ffff ffff ffff ffff ffff ffff
>> *
>> 004000 0000 0000 0000 0000 0000 0000 0000 0000
>> *
>> 7ec8000
>>
>>
>>
>> --
>> Eyal Lebedinsky (eyal@eyal.emu.id.au)

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bad sectors in md control areas?
  2014-02-26  8:16 how to handle bad sectors in md control areas? Eyal Lebedinsky
  2014-02-28  1:35 ` Eyal Lebedinsky
@ 2014-03-02  0:56 ` Eyal Lebedinsky
  2014-03-02 13:25 ` Peter Grandi
  2014-03-02 21:38 ` NeilBrown
  3 siblings, 0 replies; 9+ messages in thread
From: Eyal Lebedinsky @ 2014-03-02  0:56 UTC (permalink / raw)
  To: list linux-raid

I did not get a clear reply, and decided to follow my assumption that
the 128MB header is used lightly (only 4K-16KB seem to be non zero).

I just cleared the bad block
	dd of=/dev/sdi1 bs=4K seek=32456 count=1 if=/dev/zero

The write was successful and no bad blocks were recorded in smart log.

Everything else looks good, and hopefully the next 'check' will stay
clean too.

The general question remains, at least some conformation of the how
much space is actually used in the header. Maybe add this to 'mdadm -E'?

cheers
	Eyal

On 02/26/14 19:16, Eyal Lebedinsky wrote:
> In another thread I investigated an issue with a pending sector, which now seems to be
> a bad sector inside the md header (the first 256k sectors).
>
> The question now remaining: what is the correct approach to fixing this problem?
>
> The more general issue is what to do when any md control area develops an error. does
> all data have redundant copies?
>
> The simple way that I see is to fail the member, remove it, clear it (at least
> --zero-superblock and write to the bad sector) and then add it. However this
> will incur a full resync (about 10 hours).
>
> Is there a faster, yet safe way? I was thinking that a clean umount and raid stop
> should allow a create with --assume-clean (which will write to the bad sector and
> "fix" it), but the doco discourages this.
>
> Also, it is not impossible to think that the specific bad sector (toward the end
> of the header) is not actually used today, meaning I can live with it as is, or
> write anything to the bad sector as it does not get used. Too involved though.
>
> A bad sector in the data area should be fixed with a standard raid 'check' action.
>
> TIA

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bad sectors in md control areas?
  2014-02-26  8:16 how to handle bad sectors in md control areas? Eyal Lebedinsky
  2014-02-28  1:35 ` Eyal Lebedinsky
  2014-03-02  0:56 ` Eyal Lebedinsky
@ 2014-03-02 13:25 ` Peter Grandi
  2014-03-02 21:38 ` NeilBrown
  3 siblings, 0 replies; 9+ messages in thread
From: Peter Grandi @ 2014-03-02 13:25 UTC (permalink / raw)
  To: Linux RAID

> In another thread I investigated an issue with a pending
> sector, which now seems to be a bad sector [ ... ] The question
> now remaining: what is the correct approach to fixing this
> problem?

The correct approach is something like:

> The simple way that I see is to fail the member, remove it,
> [ ... ] and then add it.

Where the last "it" is a "known good" storage device.

> [ ... ] clear it (at least --zero-superblock and write to the
> bad sector) [ ... ]

Whether "write to the bad sector" effects a repair or not turning
a failed storage device into a "known good" one, and is dangerous
or not, is a matter of judgement, based on a large number of
factors, and the particulars of the situation leading to the error.

> However this will incur a full resync (about 10 hours).

If you have intentionally or not designed a RAID setup that has
very expensive resync, that's what you get, unless you can
guarantee that resync will never happen. Good luck! :-)

> Is there a faster, yet safe way?

Ah the eternal illusion that someone knows a "secret" way to do
things N times better than other people, at no cost of course.

For RAID, in the general case no. In some specific cases where
you know what you doing, including a deep understanding of both
RAID, MD RAID, and storage device error causes and handling,
perhaps there is.

> A bad sector in the data area should be fixed with a standard
> raid 'check' action.

That seems to me to be a fruit of your imagination; and that of
others, as I occasionally watch the usual threads, eagerly
"contributed" to by the usual clowns, about MD RAID "detecting"
errors and "repairing" bad sectors.

Let's repeat here for the Nth time: RAID is entirely based on the
assumption that the storage devices (disks, host adapters, buses,
...) below it are either entirely error free, or report every
error that occurs on them; that there are no undetected errors.

RAID is not required to perform any detection of errors
undetected by the underlying storage devices, and in the general
case is not able to do that either, as the RAID "levels" with
redundancy have that redundancy designed for reconstruction not
error detection, and even well design error detection is usually
very, very expensive.

Even more so, RAID cannot "fix" bad sectors, and it is not
designed to do so, because RAID subsystems are mere IO remappers
and multiplexers (IIRC NeilB sometimes reminds people of that),
and the way storage devices error happen and can be fixed is a
difficult subject that cannot be handled in the general case in a
general purpose RAID IO remapper and multiplexer.

MD RAID, as a side effect of its operation, merely does some weak
consistency checks and some weak attempts at making things
not-worse when errors are reported or inconsistencies are
discovered.

This is strictly speaking beyond its mission and a layering
nastiness, but while it is somewhat useful, it is very important
that remain a limited effort, because it is already very hard to
get an IO mapper and multiplexer to work reliably and with good
performance (tradeoff between speed and other qualities) in the
general case.

Writing and maintaining a *correct* RAID subsystem is difficult
enough, e.g. given the extreme cases of parallelism and timing
dependent issues it involves (and many proprietary RAID products
are nowhere as reliable as MD RAID, perhaps also because they try
to do too many things other than mapping and multiplexing IO).

Reliable, safe error detection is usually quite expensive as to
speed, and reliable, safe error correction is very difficult to
do because the code gets rarely exercised, and there are so many
subtle and tricky cases.

If you want an error detecting, error-correcting block device
abstraction layer, write one quite separate from MD RAID, or buy
one of several expensive proprietary efforts aimed at your
demographic.

Myself, like many users of RAID, and MD RAID, would rather MD to
remain a *reliable*, low overhead, IO remapper and multiplexer,
with code as simple as possible for ease of understanding and
maintenance, without "mission creep". The end-to-end argument
also applies here.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bad sectors in md control areas?
  2014-02-26  8:16 how to handle bad sectors in md control areas? Eyal Lebedinsky
                   ` (2 preceding siblings ...)
  2014-03-02 13:25 ` Peter Grandi
@ 2014-03-02 21:38 ` NeilBrown
  2014-03-02 22:21   ` Eyal Lebedinsky
  3 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2014-03-02 21:38 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: list linux-raid

[-- Attachment #1: Type: text/plain, Size: 2475 bytes --]

On Wed, 26 Feb 2014 19:16:30 +1100 Eyal Lebedinsky <eyal@eyal.emu.id.au>
wrote:

> In another thread I investigated an issue with a pending sector, which now seems to be
> a bad sector inside the md header (the first 256k sectors).
> 
> The question now remaining: what is the correct approach to fixing this problem?

You could "fix" it by simply redefining it not to be a problem.
If you never get an error then is there a problem?

> 
> The more general issue is what to do when any md control area develops an error. does
> all data have redundant copies?

We don't currently have any redundancy with a device.  Of course most
metadata is replicated across all devices so there is redundancy in that
sense.
I have occasionally thought of creating a v1.3 metadata which duplicates the
superblock at both end of the device.  Never quite seemed worth the effort
though.
The write-intent-bitmap would be a lot more expensive to duplicate but as it
is identical on all devices, the  gain would be small (though there are cases
where it would be useful).

The bad-block log probably should be duplicated.  That wouldn't be too
expensive and  might have  some real benefits....

> 
> The simple way that I see is to fail the member, remove it, clear it (at least
> --zero-superblock and write to the bad sector) and then add it. However this
> will incur a full resync (about 10 hours).
> 
> Is there a faster, yet safe way? I was thinking that a clean umount and raid stop
> should allow a create with --assume-clean (which will write to the bad sector and
> "fix" it), but the doco discourages this.

Why do you think this will write the bad sector?
When you --create and array it doesn't write too all the space on the array.
It only writes what it needs to.  So the superblock, the write-intent-bitmap
and maybe the bad-block-log.  But nothing else.
And most of that gets written during normal array activity.

So if a block remains unwritten after stop/start/check, you can be fairy sure
it isn't used at all, so you can ignore it.  Or write zeros to it.

> 
> Also, it is not impossible to think that the specific bad sector (toward the end
> of the header) is not actually used today, meaning I can live with it as is, or
> write anything to the bad sector as it does not get used. Too involved though.
> 
> A bad sector in the data area should be fixed with a standard raid 'check' action.
> 
> TIA
> 

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bad sectors in md control areas?
  2014-02-28 13:23     ` Eyal Lebedinsky
@ 2014-03-02 21:42       ` NeilBrown
  0 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2014-03-02 21:42 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: list linux-raid

[-- Attachment #1: Type: text/plain, Size: 2290 bytes --]

On Sat, 01 Mar 2014 00:23:59 +1100 Eyal Lebedinsky <eyal@eyal.emu.id.au>
wrote:

> Thanks Piergiorgio,
> 
> I did search before, unsuccessfully. I repeated now with different keywords and found
> this entry in the kernel wiki
> 	https://raid.wiki.kernel.org/index.php/RAID_superblock_formats
> 
> It documents the initial fixed fields of the superblock. I still do not know how
> the intent bitmap is laid out. I can see that it starts 4KB into the superblock:
> 	Internal Bitmap : 8 sectors from superblock
> but did not yet find its size (which I expect depends on the array size).
> 
> I guess I can calculate it from the sys/block/md127/md items
> 	component_size * 1024 / 'bitmap/chunksize' / 8
> which comes up to 7452 bytes which is still a small fraction of the 128MB header size.
> 
> What else is there? bad blocks list? Reading an old blog (from Neil) suggests that it
> is not larger than 32KB (but is only 4KB now), so still "small" in this context.
> Don't know where it resides though.
> 
> I need to understand the full layout of the header and so far I do not see anything that
> says what the area past the initial 16KB is used (is always zero when I inspect it).
> I started a heavy file copy on the raid and watched the header and never saw any change.
> I expected to see at least some activity in the bitmap but none encountered.
> 
> My simple question is: is it the case that the reserved header space after 16KB is
> actually still unused in header version 1.2? My bad sector is practically at the end
> of this 128MB area. A trivial question to answer for someone with expert knowledge
> of md.
> 
> Anyone?

The location of the superblock is reported by "mdadm --exmamine".
The size is 4K (Though most of that is unsed).
The location of the bitmap (if present) is reported by "mdadm --examine"
as an offset from the superblock.  It size can be deduced from the output
of "mdadm --examine-bitmap".  Take the number of bits, divide by 8, add 256
and round up to a multiple of 512.  This number is bytes.
The location of the bad-block-log (if present) is reported by "mdadm
--examine" as an offset from the superblocks.  It's size if 4K.
Any other space outside of the data region is currently unused by md.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bad sectors in md control areas?
  2014-03-02 21:38 ` NeilBrown
@ 2014-03-02 22:21   ` Eyal Lebedinsky
  0 siblings, 0 replies; 9+ messages in thread
From: Eyal Lebedinsky @ 2014-03-02 22:21 UTC (permalink / raw)
  To: list linux-raid

On 03/03/14 08:38, NeilBrown wrote:
> On Wed, 26 Feb 2014 19:16:30 +1100 Eyal Lebedinsky <eyal@eyal.emu.id.au>
> wrote:
>
>> In another thread I investigated an issue with a pending sector, which now seems to be
>> a bad sector inside the md header (the first 256k sectors).
>>
>> The question now remaining: what is the correct approach to fixing this problem?
>
> You could "fix" it by simply redefining it not to be a problem.
> If you never get an error then is there a problem?

I did not know if this block is never accessed or just rarely so. I prefer to handle the
issue when I am in control rather than md encountering it, maybe at a bad time when
resyncing another disk, or growing or reshaping the array, or when the array fills up
and the write intent gets fully used.

I moved from raid5 to raid6 because the risk of another error while resyncing a
replaced raid5 disk (almost 10 hours of heavy activity) is becoming too high.

I also hoped that raid6 will be able to correctly repair an error (even when there
is no low level io error, just a mismatch) when assuming it is a "single" error
(using ECC terminology). I understand that this is not yet done?

>> The more general issue is what to do when any md control area develops an error. does
>> all data have redundant copies?
>
> We don't currently have any redundancy with a device.  Of course most
> metadata is replicated across all devices so there is redundancy in that
> sense.
> I have occasionally thought of creating a v1.3 metadata which duplicates the
> superblock at both end of the device.  Never quite seemed worth the effort
> though.
> The write-intent-bitmap would be a lot more expensive to duplicate but as it
> is identical on all devices, the  gain would be small (though there are cases
> where it would be useful).
>
> The bad-block log probably should be duplicated.  That wouldn't be too
> expensive and  might have  some real benefits....
>
>>
>> The simple way that I see is to fail the member, remove it, clear it (at least
>> --zero-superblock and write to the bad sector) and then add it. However this
>> will incur a full resync (about 10 hours).
>>
>> Is there a faster, yet safe way? I was thinking that a clean umount and raid stop
>> should allow a create with --assume-clean (which will write to the bad sector and
>> "fix" it), but the doco discourages this.
>
> Why do you think this will write the bad sector?

I assumed the full header (128MB) is initialised when it is created. Maybe not...

> When you --create and array it doesn't write too all the space on the array.
> It only writes what it needs to.  So the superblock, the write-intent-bitmap
> and maybe the bad-block-log.  But nothing else.

This (the last three words) is the information I was after.

> And most of that gets written during normal array activity.
>
> So if a block remains unwritten after stop/start/check, you can be fairy sure
> it isn't used at all, so you can ignore it.  Or write zeros to it.

This was my understanding too. The "ignore" was not optimal as apart from the emotional
stress from knowing there is an unreadable sector, there is the constant complaining
of smartd in the log. I zeroed it.

>> Also, it is not impossible to think that the specific bad sector (toward the end
>> of the header) is not actually used today, meaning I can live with it as is, or
>> write anything to the bad sector as it does not get used. Too involved though.
>>
>> A bad sector in the data area should be fixed with a standard raid 'check' action.
>>
>> TIA
>>
>
> NeilBrown

cheers
	Eyal

--
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-03-02 22:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-26  8:16 how to handle bad sectors in md control areas? Eyal Lebedinsky
2014-02-28  1:35 ` Eyal Lebedinsky
2014-02-28 10:53   ` Piergiorgio Sartor
2014-02-28 13:23     ` Eyal Lebedinsky
2014-03-02 21:42       ` NeilBrown
2014-03-02  0:56 ` Eyal Lebedinsky
2014-03-02 13:25 ` Peter Grandi
2014-03-02 21:38 ` NeilBrown
2014-03-02 22:21   ` Eyal Lebedinsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.