linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Unable to fixup (regular) error in RAID1 fs
@ 2014-10-28 15:54 Juan Orti
  2014-10-28 20:17 ` Juan Orti
  2014-10-29  3:02 ` Duncan
  0 siblings, 2 replies; 5+ messages in thread
From: Juan Orti @ 2014-10-28 15:54 UTC (permalink / raw)
  To: linux-btrfs

I'm seeing these errors in a RAID1 fs:

[ 3565.073223] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 
30, gen 0
[ 3565.073472] BTRFS: unable to fixup (regular) error at logical 
460632743936 on dev /dev/sdb2
[ 3566.605419] BTRFS: checksum error at logical 461883383808 on dev 
/dev/sdb2, sector 600109712, root 2500, inode 1436631, offset 
6134886400, length 4096, links 1 (path: 
juan/.local/share/gnome-boxes/images/boxes-unknown)
[ 3566.605429] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 
31, gen 0
[ 3566.629207] BTRFS: unable to fixup (regular) error at logical 
461883383808 on dev /dev/sdb2
[ 3569.459460] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 
32, gen 0
[ 3569.478667] BTRFS: unable to fixup (regular) error at logical 
462282203136 on dev /dev/sdb2
[ 3569.479163] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 
33, gen 0
[ 3569.479531] BTRFS: unable to fixup (regular) error at logical 
462282207232 on dev /dev/sdb2
[ 3569.479970] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 
34, gen 0
[ 3569.480102] BTRFS: unable to fixup (regular) error at logical 
462282211328 on dev /dev/sdb2
[ 3569.494522] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 
35, gen 0
[ 3569.494709] BTRFS: unable to fixup (regular) error at logical 
462282215424 on dev /dev/sdb2
[ 3569.495148] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 
36, gen 0
[ 3713.075962] BTRFS: checksum error at logical 483011874816 on dev 
/dev/sdb2, sector 628793384, root 2500, inode 1436631, offset 
3997003776, length 4096, links 1 (path: 
juan/.local/share/gnome-boxes/images/boxes-unknown)
[ 3713.075987] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 
37, gen 0
[ 3713.086292] BTRFS: unable to fixup (regular) error at logical 
483011874816 on dev /dev/sdb2
[ 3713.092577] BTRFS: checksum error at logical 483011948544 on dev 
/dev/sdb2, sector 628793528, root 2500, inode 1436631, offset 
4059963392, length 4096, links 1 (path: 
juan/.local/share/gnome-boxes/images/boxes-unknown)
[ 3713.092584] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 
38, gen 0
[ 3713.093035] BTRFS: unable to fixup (regular) error at logical 
483011948544 on dev /dev/sdb2

Why can't it fix the errors? a bad device? smartctl says the disk is ok. 
I'm currently running a full scrub to see if it finds more errors. What 
should I do?

Versions used:
kernel-3.16.6-200.fc20.x86_64
btrfs-progs-3.16.2-1.fc20.x86_64

Full dmesg: http://ur1.ca/ikxxl

# btrfs fi show
Label: 'fedora_xenon'  uuid: f1c013ff-9bd4-48fe-828e-d0b7b9d91af1
         Total devices 1 FS bytes used 13.85GiB
         devid    1 size 103.22GiB used 17.04GiB path /dev/sda4

Label: 'btrfs_raid1'  uuid: 7721c28b-8ae6-432d-bfe1-0f98fb4043e0
         Total devices 3 FS bytes used 1.50TiB
         devid    1 size 1.81TiB used 1.08TiB path /dev/sdb2
         devid    2 size 1.81TiB used 1.08TiB path /dev/sdc2
         devid    3 size 1.81TiB used 1.08TiB path /dev/sdd2

# btrfs fi df /mnt/btrfs_raid1/
Data, RAID1: total=1.60TiB, used=1.49TiB
System, RAID1: total=32.00MiB, used=256.00KiB
Metadata, RAID1: total=10.00GiB, used=5.75GiB
GlobalReserve, single: total=512.00MiB, used=0.00

# btrfs fi df /mnt/btrfs_ssd/
Data, single: total=15.01GiB, used=13.13GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=2.00GiB, used=739.86MiB
GlobalReserve, single: total=256.00MiB, used=0.00

-- 
Juan Orti
https://miceliux.com


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unable to fixup (regular) error in RAID1 fs
  2014-10-28 15:54 Unable to fixup (regular) error in RAID1 fs Juan Orti
@ 2014-10-28 20:17 ` Juan Orti
  2014-10-29  3:02 ` Duncan
  1 sibling, 0 replies; 5+ messages in thread
From: Juan Orti @ 2014-10-28 20:17 UTC (permalink / raw)
  To: linux-btrfs

El mar, 28-10-2014 a las 16:54 +0100, Juan Orti escribió:
> I'm seeing these errors in a RAID1 fs:
> (...)
> Why can't it fix the errors? a bad device? smartctl says the disk is ok. 
> I'm currently running a full scrub to see if it finds more errors. What 
> should I do?
> 

Well, the scrub has finished without errors. Should I worry or not?


-- 
Juan Orti
https://miceliux.com




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unable to fixup (regular) error in RAID1 fs
  2014-10-28 15:54 Unable to fixup (regular) error in RAID1 fs Juan Orti
  2014-10-28 20:17 ` Juan Orti
@ 2014-10-29  3:02 ` Duncan
  2014-10-29  8:08   ` Juan Orti
  1 sibling, 1 reply; 5+ messages in thread
From: Duncan @ 2014-10-29  3:02 UTC (permalink / raw)
  To: linux-btrfs

Juan Orti posted on Tue, 28 Oct 2014 16:54:19 +0100 as excerpted:

> [ 3713.086292] BTRFS: unable to fixup (regular) error at logical 
> 483011874816 on dev /dev/sdb2
> [ 3713.092577] BTRFS: checksum error at logical 483011948544 on dev 
> /dev/sdb2, sector 628793528, root 2500, inode 1436631, offset 
> 4059963392, length 4096, links 1 (path: 
> juan/.local/share/gnome-boxes/images/boxes-unknown)
> [ 3713.092584] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 
> 38, gen 0
> [ 3713.093035] BTRFS: unable to fixup (regular) error at logical 
> 483011948544 on dev /dev/sdb2
> 
> Why can't it fix the errors? a bad device? smartctl says the disk is ok. 
> I'm currently running a full scrub to see if it finds more errors. What 
> should I do?

Btrfs raid1, and I see you have it for both data and metadata.

During normal operation, when btrfs comes across a block that doesn't
match its checksum, it will look to see if there's another copy (which
there is with raid1, which has exactly two copies) of that block and will
try to use it instead if so.  If the second copy matches the checksum,
all is fine and btrfs will in fact attempt to rewrite the bad copy using
the good copy, as well as returning the good copy to whatever was
reading it.

Those corruption errors seem to indicate that it can't find a good
copy to update the bad copy with -- both copies ended up bad.  Either
that or it found the good copy and returned it to whatever was reading,
but couldn't rewrite the bad copy, for some reason.

I'm not sure which of those interpretations is correct, but given
that you didn't see anything else bad happening, no apps returning
errors due to read error, etc, I'd guess the second.  Because
otherwise whatever was doing the read should have returned an
error.

Doing a scrub, as you already did, is the first thing I'd try here,
since normal operation won't catch all the errors.

BUT, you report that the scrub found no errors, which is weird.
You have the log saying there's corruption errors, but scrub
saying there's not.

The easiest explanation for something like that, is that the errors
were temporary.  If it happens again or regularly, consider running
memcheck or the like, as it could be bad memory.  Do you have ECC RAM?

Another question.  Do you have skinny metadata on that btrfs?  If you
do, btrfs should mention "skinny extents" when mounting the filesystem.

The reason I'm asking this is that if I'm reading the patch descriptions
correctly, a recently posted patch deals with a specific skinny-metadata
bug where wrong results would occasionally be returned, resulting in
errors.  Not being a dev I don't have the technical ability to know for
sure whether this could be connected to that or not, but it sounds like
the sort of thing I might expect from a bug that intermittently returned
bad data -- odd apparent corruption errors in normal use that scrub
can't see, even tho it's designed to catch and fix if possible exactly
that sort of corruption error.

Anyway, if scrub says no corruption, for a potential corruption error
I'd be inclined to trust scrub, so I think the filesystem is fine.
But if so, I'm worried about what might be triggering these
intermittent errors.  Certainly watch for more of them, and if you're
running skinny-metadata, consider finding and applying that patch.
If not or in general, also be on the lookout for more possible hints
of failing memory and/or run a good memory checker for a few hours
and see if it reports all is well.

But as they say about some kinds of potential cancer reports at times,
sometimes watchful waiting is the best you can do, hoping no further
symptoms show up, but being alert in case they do, to try something
more drastic, that isn't warranted /unless/ they do.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unable to fixup (regular) error in RAID1 fs
  2014-10-29  3:02 ` Duncan
@ 2014-10-29  8:08   ` Juan Orti
  2014-10-29 16:19     ` Chris Murphy
  0 siblings, 1 reply; 5+ messages in thread
From: Juan Orti @ 2014-10-29  8:08 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

El 2014-10-29 04:02, Duncan escribió:
> Juan Orti posted on Tue, 28 Oct 2014 16:54:19 +0100 as excerpted:
> 
>> [ 3713.086292] BTRFS: unable to fixup (regular) error at logical
>> 483011874816 on dev /dev/sdb2
>> [ 3713.092577] BTRFS: checksum error at logical 483011948544 on dev
>> /dev/sdb2, sector 628793528, root 2500, inode 1436631, offset
>> 4059963392, length 4096, links 1 (path:
>> juan/.local/share/gnome-boxes/images/boxes-unknown)
>> [ 3713.092584] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, 
>> corrupt
>> 38, gen 0
>> [ 3713.093035] BTRFS: unable to fixup (regular) error at logical
>> 483011948544 on dev /dev/sdb2
>> 
>> Why can't it fix the errors? a bad device? smartctl says the disk is 
>> ok.
>> I'm currently running a full scrub to see if it finds more errors. 
>> What
>> should I do?
> 
> Btrfs raid1, and I see you have it for both data and metadata.
> 
> During normal operation, when btrfs comes across a block that doesn't
> match its checksum, it will look to see if there's another copy (which
> there is with raid1, which has exactly two copies) of that block and 
> will
> try to use it instead if so.  If the second copy matches the checksum,
> all is fine and btrfs will in fact attempt to rewrite the bad copy 
> using
> the good copy, as well as returning the good copy to whatever was
> reading it.
> 
> Those corruption errors seem to indicate that it can't find a good
> copy to update the bad copy with -- both copies ended up bad.  Either
> that or it found the good copy and returned it to whatever was reading,
> but couldn't rewrite the bad copy, for some reason.
> 
> I'm not sure which of those interpretations is correct, but given
> that you didn't see anything else bad happening, no apps returning
> errors due to read error, etc, I'd guess the second.  Because
> otherwise whatever was doing the read should have returned an
> error.

When this error happened, I was editing some text files with vi, and it 
was painfully slow, it took 30 seconds to open a 20 lines file, so 
something weird was going on. Anyway, no visible user space error could 
be seen.


> 
> Doing a scrub, as you already did, is the first thing I'd try here,
> since normal operation won't catch all the errors.
> 
> BUT, you report that the scrub found no errors, which is weird.
> You have the log saying there's corruption errors, but scrub
> saying there's not.
> 
> The easiest explanation for something like that, is that the errors
> were temporary.  If it happens again or regularly, consider running
> memcheck or the like, as it could be bad memory.  Do you have ECC RAM?

I don't have ECC RAM, it's a regular desktop PC. Some RAM checks in the 
past have shown no errors, I'll check it again.

> 
> Another question.  Do you have skinny metadata on that btrfs?  If you
> do, btrfs should mention "skinny extents" when mounting the filesystem.

No skinny metadata. I made the fs with the standard options, just with 
raid1 for data and metadata.

> 
> The reason I'm asking this is that if I'm reading the patch 
> descriptions
> correctly, a recently posted patch deals with a specific 
> skinny-metadata
> bug where wrong results would occasionally be returned, resulting in
> errors.  Not being a dev I don't have the technical ability to know for
> sure whether this could be connected to that or not, but it sounds like
> the sort of thing I might expect from a bug that intermittently 
> returned
> bad data -- odd apparent corruption errors in normal use that scrub
> can't see, even tho it's designed to catch and fix if possible exactly
> that sort of corruption error.
> 
> Anyway, if scrub says no corruption, for a potential corruption error
> I'd be inclined to trust scrub, so I think the filesystem is fine.
> But if so, I'm worried about what might be triggering these
> intermittent errors.  Certainly watch for more of them, and if you're
> running skinny-metadata, consider finding and applying that patch.
> If not or in general, also be on the lookout for more possible hints
> of failing memory and/or run a good memory checker for a few hours
> and see if it reports all is well.
> 
> But as they say about some kinds of potential cancer reports at times,
> sometimes watchful waiting is the best you can do, hoping no further
> symptoms show up, but being alert in case they do, to try something
> more drastic, that isn't warranted /unless/ they do.

That's what I'll do, I'll wait and see.

Thank you for your explanation.

-- 
Juan Orti
https://miceliux.com


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unable to fixup (regular) error in RAID1 fs
  2014-10-29  8:08   ` Juan Orti
@ 2014-10-29 16:19     ` Chris Murphy
  0 siblings, 0 replies; 5+ messages in thread
From: Chris Murphy @ 2014-10-29 16:19 UTC (permalink / raw)
  To: Juan Orti; +Cc: Duncan, linux-btrfs


On Oct 29, 2014, at 2:08 AM, Juan Orti <juan.orti@miceliux.com> wrote:

> El 2014-10-29 04:02, Duncan escribió:
>> Juan Orti posted on Tue, 28 Oct 2014 16:54:19 +0100 as excerpted:
>>> [ 3713.086292] BTRFS: unable to fixup (regular) error at logical
>>> 483011874816 on dev /dev/sdb2
>>> [ 3713.092577] BTRFS: checksum error at logical 483011948544 on dev
>>> /dev/sdb2, sector 628793528, root 2500, inode 1436631, offset
>>> 4059963392, length 4096, links 1 (path:
>>> juan/.local/share/gnome-boxes/images/boxes-unknown)
>>> [ 3713.092584] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt
>>> 38, gen 0
>>> [ 3713.093035] BTRFS: unable to fixup (regular) error at logical
>>> 483011948544 on dev /dev/sdb2
>>> Why can't it fix the errors? a bad device? smartctl says the disk is ok.
>>> I'm currently running a full scrub to see if it finds more errors. What
>>> should I do?
>> Btrfs raid1, and I see you have it for both data and metadata.
>> During normal operation, when btrfs comes across a block that doesn't
>> match its checksum, it will look to see if there's another copy (which
>> there is with raid1, which has exactly two copies) of that block and will
>> try to use it instead if so.  If the second copy matches the checksum,
>> all is fine and btrfs will in fact attempt to rewrite the bad copy using
>> the good copy, as well as returning the good copy to whatever was
>> reading it.
>> Those corruption errors seem to indicate that it can't find a good
>> copy to update the bad copy with -- both copies ended up bad.  Either
>> that or it found the good copy and returned it to whatever was reading,
>> but couldn't rewrite the bad copy, for some reason.
>> I'm not sure which of those interpretations is correct, but given
>> that you didn't see anything else bad happening, no apps returning
>> errors due to read error, etc, I'd guess the second.  Because
>> otherwise whatever was doing the read should have returned an
>> error.
> 
> When this error happened, I was editing some text files with vi, and it was painfully slow, it took 30 seconds to open a 20 lines file, so something weird was going on. Anyway, no visible user space error could be seen.

Anything in dmesg prior to the previously reported errors?

Either with syslog messages or journalctl, filter by btrfs and see what you get for the past couple of days. And then also find out what ata port the two drives are on and filter by those; usually in the form ataX.00. You could also search for "exception Emask" and see if anything comes up. This would account for either controller or drive hardware error messages.


Chris Murphy


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-10-29 16:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-28 15:54 Unable to fixup (regular) error in RAID1 fs Juan Orti
2014-10-28 20:17 ` Juan Orti
2014-10-29  3:02 ` Duncan
2014-10-29  8:08   ` Juan Orti
2014-10-29 16:19     ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).