* How to find corrupted files?
@ 2015-05-01 20:06 Antoine Sirinelli
2015-05-02 4:28 ` Duncan
0 siblings, 1 reply; 2+ messages in thread
From: Antoine Sirinelli @ 2015-05-01 20:06 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1185 bytes --]
Hi,
I had a btrfs system running for a couple of years with an old kernel
(3.14.xx). Recently I have tried to backup it to a remote host using the
send/receive functionality. It results in a couple of kernel oops. I
decided to upgrade the kernel to 3.16 (Debian Jessie) and I was able to
use send/receive without too much problems.
Since the kernel upgrade I have notices a lot of the following lines in
the kernel log:
[145059.990123] BTRFS info (device sda4): csum failed ino 101147 off 1114112 csum 1810207416 expected csum 3082675757
[145060.500612] BTRFS info (device sda4): csum failed ino 101148 off 110592 csum 1418370968 expected csum 496354029
I understand these are corrupted file. By running btrfs scrub, I have
been able to find some of them but I still have 20 inodes with failed
csum. As I have quite a lot of subvolumes (1075, mainly for backup), it
is not easy to find the path to the corrupted files. Is there an easy
way to find these files?
A side note, I have also noticed the following line appearing regularly
in the logs:
[58562.612121] btrfs_readpage_end_io_hook: 12 callbacks suppressed
Do you know what it means?
Many thanks for your help,
Antoine
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: How to find corrupted files?
2015-05-01 20:06 How to find corrupted files? Antoine Sirinelli
@ 2015-05-02 4:28 ` Duncan
0 siblings, 0 replies; 2+ messages in thread
From: Duncan @ 2015-05-02 4:28 UTC (permalink / raw)
To: linux-btrfs
Antoine Sirinelli posted on Fri, 01 May 2015 22:06:48 +0200 as excerpted:
> Hi,
>
> I had a btrfs system running for a couple of years with an old kernel
> (3.14.xx). Recently I have tried to backup it to a remote host using the
> send/receive functionality. It results in a couple of kernel oops. I
> decided to upgrade the kernel to 3.16 (Debian Jessie) and I was able to
> use send/receive without too much problems.
=:^) The send/receive code has gotten a lot of attention and fixes for
various corner-cases over the last few kernels, and your experience
demonstrates that.
Of course btrfs isn't entirely stable yet, and people on this list
generally consider 3.16 pretty old, as well. Generally stated, the full-
stability reasons one might wish to run a long-term-stable kernel are
incompatible with running a not yet fully stable filesystem like btrfs.
Either you want stable and btrfs is still too leading and possibly
bleeding edge for you, or you want leading edge not entirely stable yet
stuff like btrfs, and you can't expect to run old kernels as they tend to
have known and already long fixed bugs for rapidly moving features such
as btrfs.
What I've been recommending recently for people who want btrfs and
reasonable stability as well, is staying a release-kernel series back.
4.0 is current release, so in this mode, you'd be on 3.19 currently, and
would upgrade to 4.0.x about the time 4.1 comes out, provided no serious
current btrfs bugs are known for it at that time. That gives at least
the worst and most common bugs time enough to flush out, so they'll at
least be known by then, and generally either already fixed or at least
they'll be hot on the trail of a fix. That is of course assuming you
don't want the risk of running newer, say late rcs, which are already
usually pretty stable, altho there have been a couple major exceptions of
late.
(Personally, I've gotten a bit more conservative due to those exceptions,
and haven't been updating until say rc5 at least, if not full release,
while I used to try to update by rc3.)
> Since the kernel upgrade I have notices a lot of the following lines in
> the kernel log:
>
> [145059.990123] BTRFS info (device sda4): csum failed ino 101147
> off 1114112 csum 1810207416 expected csum 3082675757
> [145060.500612] BTRFS info (device sda4): csum failed ino 101148
> off 110592 csum 1418370968 expected csum 496354029
>
> I understand these are corrupted file. By running btrfs scrub, I have
> been able to find some of them but I still have 20 inodes with failed
> csum. As I have quite a lot of subvolumes (1075, mainly for backup), it
> is not easy to find the path to the corrupted files. Is there an easy
> way to find these files?
If the corruption is in a data chunk and thus in a file, with a current
kernel at least, dmesg covering the period of a scrub should contain a
file mapping. If the corruption is metadata, mapping to a file is
obviously not possible, but unlike data, metadata defaults to dup mode on
a single-device-btrfs (except for ssds where it defaults to single), and
raid1 mode for a multi-device-btrfs, so chances are much better there
will be a valid second copy that scrub can use to fix the bad copy.
There's also the btrfs-inspect-internal tool, which can resolve various
items including inodes for debugging purposes.
But I'm not exactly sure how either one works with snapshots,
particularly when the corruption is referenced by multiple snapshots. My
use-case doesn't involve subvolumes or snapshots (I prefer small and
therefore manageable whole filesystems, generally under 100 GiB each,
with multiple backup copies as appropriate), and I've not seen that bit
documented or happened across it discussed on the list, so...
Of course here too, you'll be best served by running a current btrfs-progs
userspace. The git/master repo is currently serving v4.0 (which I
grabbed just tonite).
> A side note, I have also noticed the following line appearing regularly
> in the logs:
>
> [58562.612121] btrfs_readpage_end_io_hook: 12 callbacks suppressed
>
> Do you know what it means?
I believe that's a logging noise-reduction mechanism -- 12 lines
substantially similar to a previously printed line were suppressed.
So just above that should be another btrfs_readpage_end_io_hook line,
with an actual message. The suppression line says 12 more of those
occurred but weren't logged.
Beyond that, I don't know what btrfs_readpage_end_io_hook actually does,
as I'm just a btrfs using admin and list regular too, not a dev.
Presumably you'll get a /bit/ more info from the /not/ suppressed line,
and presumably it's at least worth warning about or it wouldn't be
repeating-logged like that, but I guess a dev would have to see the
unsuppressed line to explain further.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2015-05-02 4:28 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-01 20:06 How to find corrupted files? Antoine Sirinelli
2015-05-02 4:28 ` Duncan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.