All of lore.kernel.org
 help / color / mirror / Atom feed
* Buffer I/O error... async page read
@ 2018-02-05 19:10 Liwei
  2018-02-06 13:55 ` Liwei
  0 siblings, 1 reply; 5+ messages in thread
From: Liwei @ 2018-02-05 19:10 UTC (permalink / raw)
  To: linux-raid

Hi list,

tl;dr: Array seems to be remembering bad blocks from recovered drive,
even though drive the image is on is fine. Is there a way to make
array forget the blocks? Is it safe?


    We had a raid6 array that went down because 2 drives went down and
1 drive encountered bad sectors.
    We managed to recover the 1 drive with bad sectors (we engaged a
recovery lab), and the remaining drives in the array report neither
pending nor re-allocated sectors (from smartctl).

    After re-integrating the (image of the) recovered drive with bad
sectors and starting the array in degraded mode, we realised we are
still unable to read from some sectors in the md device. I believe
they correspond to where the bad sectors were previously.

    When trying to read from said sectors, this comes up in dmesg:

[Feb 6 02:05] Buffer I/O error on dev dm-26, logical block 5166101891,
async page read
[  +0.000458] Buffer I/O error on dev dm-26, logical block 5166101891,
async page read
[ +13.297834] Buffer I/O error on dev dm-26, logical block 5166101891,
async page read
[  +0.000438] Buffer I/O error on dev dm-26, logical block 5166101891,
async page read
[Feb 6 02:06] Buffer I/O error on dev dm-26, logical block 5166101891,
async page read
[  +0.000390] Buffer I/O error on dev dm-26, logical block 5166101891,
async page read
[ +13.284550] Buffer I/O error on dev dm-26, logical block 5166102915,
async page read
[  +0.000448] Buffer I/O error on dev dm-26, logical block 5166102915,
async page read
[Feb 6 02:17] Buffer I/O error on dev dm-26, logical block 5166101891,
async page read
[  +0.000341] Buffer I/O error on dev dm-26, logical block 5166101891,
async page read
[Feb 6 02:24] Buffer I/O error on dev dm-26, logical block 5166118804,
async page read
[  +0.002417] Buffer I/O error on dev dm-26, logical block 5166118804,
async page read
[  +2.972446] Buffer I/O error on dev dm-26, logical block 5166118804,
async page read
[  +0.002172] Buffer I/O error on dev dm-26, logical block 5166118804,
async page read
[Feb 6 02:25] Buffer I/O error on dev dm-26, logical block 5166118804,
async page read
[  +0.002130] Buffer I/O error on dev dm-26, logical block 5166118804,
async page read

    However, I've checked smartctl and ran a pass of (read-only)
badblocks over the drives, all sectors are readable, there are no
pending sectors, and no reallocated sectors.

    So what is generating these buffer I/O errors?

    Also, upon investigating, I'm astonished to find a non-empty list when I do:
        /sys/block/md126/md/dev-*/bad_blocks

    Almost every drive in the array has a few entries. That's not
normal isn't it? My theory is that since these are consumer-grade SATA
drives, some odd read/write timeout must have occurred at some point,
causing md to think that the sectors are bad? Is there a way to make
md forget about these blocks? Is it safe to do so?

Warm regards,
Liwei

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Buffer I/O error... async page read
  2018-02-05 19:10 Buffer I/O error... async page read Liwei
@ 2018-02-06 13:55 ` Liwei
  2018-02-07  6:27   ` Weedy
  0 siblings, 1 reply; 5+ messages in thread
From: Liwei @ 2018-02-06 13:55 UTC (permalink / raw)
  To: linux-raid

On 6 February 2018 at 03:10, Liwei <xieliwei@gmail.com> wrote:
> Hi list,
>
> tl;dr: Array seems to be remembering bad blocks from recovered drive,
> even though drive the image is on is fine. Is there a way to make
> array forget the blocks? Is it safe?
>
>
>     We had a raid6 array that went down because 2 drives went down and
> 1 drive encountered bad sectors.
>     We managed to recover the 1 drive with bad sectors (we engaged a
> recovery lab), and the remaining drives in the array report neither
> pending nor re-allocated sectors (from smartctl).
>
>     After re-integrating the (image of the) recovered drive with bad
> sectors and starting the array in degraded mode, we realised we are
> still unable to read from some sectors in the md device. I believe
> they correspond to where the bad sectors were previously.
>
>     When trying to read from said sectors, this comes up in dmesg:
>
> [Feb 6 02:05] Buffer I/O error on dev dm-26, logical block 5166101891,
> async page read
> [  +0.000458] Buffer I/O error on dev dm-26, logical block 5166101891,
> async page read
> [ +13.297834] Buffer I/O error on dev dm-26, logical block 5166101891,
> async page read
> [  +0.000438] Buffer I/O error on dev dm-26, logical block 5166101891,
> async page read
> [Feb 6 02:06] Buffer I/O error on dev dm-26, logical block 5166101891,
> async page read
> [  +0.000390] Buffer I/O error on dev dm-26, logical block 5166101891,
> async page read
> [ +13.284550] Buffer I/O error on dev dm-26, logical block 5166102915,
> async page read
> [  +0.000448] Buffer I/O error on dev dm-26, logical block 5166102915,
> async page read
> [Feb 6 02:17] Buffer I/O error on dev dm-26, logical block 5166101891,
> async page read
> [  +0.000341] Buffer I/O error on dev dm-26, logical block 5166101891,
> async page read
> [Feb 6 02:24] Buffer I/O error on dev dm-26, logical block 5166118804,
> async page read
> [  +0.002417] Buffer I/O error on dev dm-26, logical block 5166118804,
> async page read
> [  +2.972446] Buffer I/O error on dev dm-26, logical block 5166118804,
> async page read
> [  +0.002172] Buffer I/O error on dev dm-26, logical block 5166118804,
> async page read
> [Feb 6 02:25] Buffer I/O error on dev dm-26, logical block 5166118804,
> async page read
> [  +0.002130] Buffer I/O error on dev dm-26, logical block 5166118804,
> async page read
>
>     However, I've checked smartctl and ran a pass of (read-only)
> badblocks over the drives, all sectors are readable, there are no
> pending sectors, and no reallocated sectors.
>
>     So what is generating these buffer I/O errors?
>
>     Also, upon investigating, I'm astonished to find a non-empty list when I do:
>         /sys/block/md126/md/dev-*/bad_blocks
>
>     Almost every drive in the array has a few entries. That's not
> normal isn't it? My theory is that since these are consumer-grade SATA
> drives, some odd read/write timeout must have occurred at some point,
> causing md to think that the sectors are bad? Is there a way to make
> md forget about these blocks? Is it safe to do so?
>
> Warm regards,
> Liwei

Just answering my question. Turns out the I/O errors are caused by the
MD bad blocks log. There wasn't an easy way to clear the log unless I
wrote over the supposedly bad blocks.

But turns out since the log is in the superblock, I dd-ed it out,
edited the log entries to all FF, cleared the bad blocks feature bit
in the header, updated the checksum, dd-ed the edited superblock back
in, and viola, no more read errors and I have access to my data!

Disclaimer: I had offline backup of the drive images and a write
overlay, please ensure there's a way back if anyone tries something
like this.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Buffer I/O error... async page read
  2018-02-06 13:55 ` Liwei
@ 2018-02-07  6:27   ` Weedy
  2018-02-07  7:02     ` Liwei
  0 siblings, 1 reply; 5+ messages in thread
From: Weedy @ 2018-02-07  6:27 UTC (permalink / raw)
  To: Liwei, linux-raid

On 2018-02-06 08:55 AM, Liwei wrote:
> Just answering my question. Turns out the I/O errors are caused by the
> MD bad blocks log. There wasn't an easy way to clear the log unless I
> wrote over the supposedly bad blocks.
> 

All you needed to do was "--assemble --update=no-bbl".

Neil just posted to the list a couple days ago about this for someone else.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Buffer I/O error... async page read
  2018-02-07  6:27   ` Weedy
@ 2018-02-07  7:02     ` Liwei
  2018-02-07 14:54       ` Wols Lists
  0 siblings, 1 reply; 5+ messages in thread
From: Liwei @ 2018-02-07  7:02 UTC (permalink / raw)
  To: Weedy; +Cc: linux-raid

On 7 February 2018 at 14:27, Weedy <weedy2887@gmail.com> wrote:
> On 2018-02-06 08:55 AM, Liwei wrote:
>> Just answering my question. Turns out the I/O errors are caused by the
>> MD bad blocks log. There wasn't an easy way to clear the log unless I
>> wrote over the supposedly bad blocks.
>>
>
> All you needed to do was "--assemble --update=no-bbl".
>
> Neil just posted to the list a couple days ago about this for someone else.

*Facepalms* Clearly my google-fu isn't good enough. Maybe someone with
access to the wiki can add a page on the bad blocks list?

I was referring to the metadata formats
"https://raid.wiki.kernel.org/index.php/RAID_superblock_formats" page
while trying to figure things out and noticed that it too is outdated.
I'm willing to help update the wiki in whatever ways I can if someone
can approve my account.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Buffer I/O error... async page read
  2018-02-07  7:02     ` Liwei
@ 2018-02-07 14:54       ` Wols Lists
  0 siblings, 0 replies; 5+ messages in thread
From: Wols Lists @ 2018-02-07 14:54 UTC (permalink / raw)
  To: Liwei, Weedy; +Cc: linux-raid

On 07/02/18 07:02, Liwei wrote:
> I was referring to the metadata formats
> "https://raid.wiki.kernel.org/index.php/RAID_superblock_formats" page
> while trying to figure things out and noticed that it too is outdated.
> I'm willing to help update the wiki in whatever ways I can if someone
> can approve my account.

I think your account is probably all set up and working. There's just a
magic formula to getting it activated :-) Ask for a password reset :-)

And please don't edit that page if you think it's badly out of date. I'm
intentionally leaving old pages there for historical reasons - if you
scanned down the wiki you'll have found that page in the "The Valley of
the Kings" section :-) (If you do create a new page, just put a note at
the start of the old page pointing to the new one.) Thing is, I've lost
a lot of references to lilo, grub-1, PATA drives, etc etc, and if
someone has an old system they'll want to find the old stuff.

Find a place in the current setup where you think it fits nicely,
rewrite it from scratch all up to date, and slot it in. Also note the
"editing guidelines" - they're pretty slack but I am trying to keep a
consistent editorial feel to the site - it just makes it a much
pleasanter place to read.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-02-07 14:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-05 19:10 Buffer I/O error... async page read Liwei
2018-02-06 13:55 ` Liwei
2018-02-07  6:27   ` Weedy
2018-02-07  7:02     ` Liwei
2018-02-07 14:54       ` Wols Lists

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.