All of lore.kernel.org
 help / color / mirror / Atom feed
* Manual intervention options for csum errors
@ 2022-06-01 21:16 Matthew Warren
  2022-06-02  1:50 ` waxhead
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Matthew Warren @ 2022-06-01 21:16 UTC (permalink / raw)
  To: Btrfs BTRFS

I have FS which is currently not in any sort of raid configuration and
occasionally a bit flip will occur somewhere on the disk. It would be
nice to be able to tell BTRFS to recalculate the checksum for that
specific block and assume the data is correct. For instance, I just
had this bit flip in the csum for a non-important file which I have an
external backup of.

Jun 01 15:58:04 planeptune kernel: BTRFS warning (device nvme0n1p2):
csum failed root 258 ino 63674380 off 208896 csum 0xa40b3c39 expected
csum 0xa40b2c39 mirror 1

This is a very clear case of a csum bitflip and I'd like to have the
ability to tell BTRFS that the data is correct.

Matthew Warren

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Manual intervention options for csum errors
  2022-06-01 21:16 Manual intervention options for csum errors Matthew Warren
@ 2022-06-02  1:50 ` waxhead
  2022-06-02  4:39 ` Qu Wenruo
  2022-06-02 19:12 ` Chris Murphy
  2 siblings, 0 replies; 9+ messages in thread
From: waxhead @ 2022-06-02  1:50 UTC (permalink / raw)
  To: Matthew Warren, Btrfs BTRFS

Matthew Warren wrote:
> I have FS which is currently not in any sort of raid configuration and
> occasionally a bit flip will occur somewhere on the disk. It would be
> nice to be able to tell BTRFS to recalculate the checksum for that
> specific block and assume the data is correct. For instance, I just
> had this bit flip in the csum for a non-important file which I have an
> external backup of.
> 
> Jun 01 15:58:04 planeptune kernel: BTRFS warning (device nvme0n1p2):
> csum failed root 258 ino 63674380 off 208896 csum 0xa40b3c39 expected
> csum 0xa40b2c39 mirror 1
> 
> This is a very clear case of a csum bitflip and I'd like to have the
> ability to tell BTRFS that the data is correct.
> 
> Matthew Warren
> 
I am just the average user , if anyone picks up this idea I would like 
to throw out an idea. Perhaps something like...

btrfs filesystem (or subvolume) list compromised /mnt which could list 
all the files identified with a unrepairable csum error for example...

1: path/to/file/file1
2: path/to/file/file2
3: path/to/another/filename/somewhere

And then be able to mark damaged files by id as good somehow. Not sure 
how to solve that , but perhaps a "damaged tree" / subvolume dirty bit 
or something along the lines of that would need to be added before that 
would even be possible.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Manual intervention options for csum errors
  2022-06-01 21:16 Manual intervention options for csum errors Matthew Warren
  2022-06-02  1:50 ` waxhead
@ 2022-06-02  4:39 ` Qu Wenruo
  2022-06-02 15:30   ` Matthew Warren
  2022-06-02 19:12 ` Chris Murphy
  2 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2022-06-02  4:39 UTC (permalink / raw)
  To: Matthew Warren, Btrfs BTRFS



On 2022/6/2 05:16, Matthew Warren wrote:
> I have FS which is currently not in any sort of raid configuration and
> occasionally a bit flip will occur somewhere on the disk.

This is not a good sign.

Such bitflip can only happen in memory, as if it's a bitflip from disk,
then it will cause the metadata csum mismatch.

So this means, your memory is unreliable, and a memtest is strongly
recommended before doing anything.

> It would be
> nice to be able to tell BTRFS to recalculate the checksum for that
> specific block and assume the data is correct. For instance, I just
> had this bit flip in the csum for a non-important file which I have an
> external backup of.
>
> Jun 01 15:58:04 planeptune kernel: BTRFS warning (device nvme0n1p2):
> csum failed root 258 ino 63674380 off 208896 csum 0xa40b3c39 expected
> csum 0xa40b2c39 mirror 1
>
> This is a very clear case of a csum bitflip and I'd like to have the
> ability to tell BTRFS that the data is correct.

We have the ability to ignore csum mismatch and force read, but it's
only for recovery purpose only.

You can use "mount -o ro,rescue=idatacsums", which will completely
ignore data csum and allow you read the data out.

Unfortunately since it's a recovery mount option, it has to be used with
read-only mount.
So you can only read out the data and save it somewhere else, then copy
it back.

Thanks,
Qu

>
> Matthew Warren

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Manual intervention options for csum errors
  2022-06-02  4:39 ` Qu Wenruo
@ 2022-06-02 15:30   ` Matthew Warren
  2022-06-02 22:16     ` Qu Wenruo
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Warren @ 2022-06-02 15:30 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

> This is not a good sign.
>
> Such bitflip can only happen in memory, as if it's a bitflip from disk,
> then it will cause the metadata csum mismatch.
>
> So this means, your memory is unreliable, and a memtest is strongly
> recommended before doing anything.

I don't think that's the case. The files were last modified all the
way back in 2020, but there hasn't been any file modifications near
them since the end of April this year. There's also been 2 scrubs
before the last one where there were no issues at all. Does this mean
that at some point in the last half month (since that's the time
between the last successful scrub and the scrub which errored) BTRFS
read and re-wrote the file to disk?

Matthew Warren

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Manual intervention options for csum errors
  2022-06-01 21:16 Manual intervention options for csum errors Matthew Warren
  2022-06-02  1:50 ` waxhead
  2022-06-02  4:39 ` Qu Wenruo
@ 2022-06-02 19:12 ` Chris Murphy
  2022-06-02 19:18   ` Chris Murphy
  2 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2022-06-02 19:12 UTC (permalink / raw)
  To: Matthew Warren; +Cc: Btrfs BTRFS

On Wed, Jun 1, 2022 at 11:35 PM Matthew Warren
<matthewwarren101010@gmail.com> wrote:
>
> I have FS which is currently not in any sort of raid configuration and
> occasionally a bit flip will occur somewhere on the disk. It would be
> nice to be able to tell BTRFS to recalculate the checksum for that
> specific block and assume the data is correct. For instance, I just
> had this bit flip in the csum for a non-important file which I have an
> external backup of.
>
> Jun 01 15:58:04 planeptune kernel: BTRFS warning (device nvme0n1p2):
> csum failed root 258 ino 63674380 off 208896 csum 0xa40b3c39 expected
> csum 0xa40b2c39 mirror 1

The csums are off by 1 bit. That doesn't mean the data on disk changed
at all, because had there been a single bit flip in the data block,
you'd have a completely different csum, it wouldn't be off by one.
Looks like the data on disk did not change but the csum is computed
wrong somewhere - either it was originally computed wrong (bit flip)
and written to the csum tree where it's now persistently wrong. Or
it's transiently computed wrong on read. Either way, it's most likely
a memory bit flip. I suppose it could also be a memory bitflip in the
drive itself.


So yeah you really need to do a memory test, and unfortunately the
available memory testers can still allow bad memory to elude testing.
In the best cases, it'll find a problem in a few hours. In rather
common cases it takes days for it to be detected, so you'd want to set
this up for a weekend to maximize the chance of finding it. There's
both memtest86 (not libre but there is a basic no cost version, pretty
sure they are UEFI only now), and memtest86+ which are back in
development as of last month, your distro should hopefully have a
build. These have the advantage that only a tiny portion of RAM is not
tested, the portion the test utility takes. Since there's no OS, test
coverage is maximized. I've heard pretty good things about memtester,
which is a user space test program. The disadvantage is it needs linux
running so a much bigger portion of memory can't be tested, but
chances are if the memory used by linux and the tester are
compromised, you'll get some sort of catastrophic failure. Probably.
Maybe. You can run in single user mode or non-graphical boot to
maximize the RAM being tested.


> This is a very clear case of a csum bitflip and I'd like to have the
> ability to tell BTRFS that the data is correct.

As Qu mentioned, the easiest way is to get the file out using the
rescue mount option to ignora datacsums, and the remount without that
option and rw, and copy the file back in. But with bad RAM you risk
making the issue worse, and it can hit metadata at any time and then
the problem is much worse. Hopefully the write time tree checker will
catch it, it often does, but it's not guaranteed.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Manual intervention options for csum errors
  2022-06-02 19:12 ` Chris Murphy
@ 2022-06-02 19:18   ` Chris Murphy
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2022-06-02 19:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Matthew Warren, Btrfs BTRFS

On Thu, Jun 2, 2022 at 3:12 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> On Wed, Jun 1, 2022 at 11:35 PM Matthew Warren
> <matthewwarren101010@gmail.com> wrote:
> >
> > I have FS which is currently not in any sort of raid configuration and
> > occasionally a bit flip will occur somewhere on the disk. It would be
> > nice to be able to tell BTRFS to recalculate the checksum for that
> > specific block and assume the data is correct. For instance, I just
> > had this bit flip in the csum for a non-important file which I have an
> > external backup of.
> >
> > Jun 01 15:58:04 planeptune kernel: BTRFS warning (device nvme0n1p2):
> > csum failed root 258 ino 63674380 off 208896 csum 0xa40b3c39 expected
> > csum 0xa40b2c39 mirror 1
>
> The csums are off by 1 bit. That doesn't mean the data on disk changed
> at all, because had there been a single bit flip in the data block,
> you'd have a completely different csum, it wouldn't be off by one.
> Looks like the data on disk did not change but the csum is computed
> wrong somewhere - either it was originally computed wrong (bit flip)
> and written to the csum tree where it's now persistently wrong. Or
> it's transiently computed wrong on read. Either way, it's most likely
> a memory bit flip. I suppose it could also be a memory bitflip in the
> drive itself.

Also, the reason why it isn't bit rot with the on disk csum resulting
in a one bit flip is because that would be detected by btrfs since the
leaf the csum is stored in is also checksummed. So if you were getting
a new one off bitflip here, the whole leaf would have been rendered
invalid, and all csums in it.

Chances are it is a bitflip after the csum was computed, and insertion
into the leaf, resulting in a leaf checksum that validates the wrong
csum.

But again, a one bit flip in a 4KiB data block would result in a much
bigger difference between have and expected csums because it's a hash
function.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Manual intervention options for csum errors
  2022-06-02 15:30   ` Matthew Warren
@ 2022-06-02 22:16     ` Qu Wenruo
       [not found]       ` <CA+H1V9wD0Ndrnt5bV85nJPd7Go3gbyTs0K5pZBCybvwbeB3z3w@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2022-06-02 22:16 UTC (permalink / raw)
  To: Matthew Warren; +Cc: Btrfs BTRFS



On 2022/6/2 23:30, Matthew Warren wrote:
>> This is not a good sign.
>>
>> Such bitflip can only happen in memory, as if it's a bitflip from disk,
>> then it will cause the metadata csum mismatch.
>>
>> So this means, your memory is unreliable, and a memtest is strongly
>> recommended before doing anything.
>
> I don't think that's the case. The files were last modified all the
> way back in 2020, but there hasn't been any file modifications near
> them since the end of April this year.

Since the bitflip is in csum tree, it doesn't matter if that specific
file get modified.

Any other file modification can trigger CoW on that csum tree block.

> There's also been 2 scrubs
> before the last one where there were no issues at all. Does this mean
> that at some point in the last half month (since that's the time
> between the last successful scrub and the scrub which errored) BTRFS
> read and re-wrote the file to disk?

I'd say yes. And it doesn't even need to modify that specific file.

That's why memory bitflip is so concerning.

Thanks,
Qu
>
> Matthew Warren

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Fwd: Manual intervention options for csum errors
       [not found]       ` <CA+H1V9wD0Ndrnt5bV85nJPd7Go3gbyTs0K5pZBCybvwbeB3z3w@mail.gmail.com>
@ 2022-06-03 17:05         ` Matthew Warren
  2022-06-03 19:30           ` Chris Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Warren @ 2022-06-03 17:05 UTC (permalink / raw)
  To: Btrfs BTRFS

> >> This is not a good sign.
> >>
> >> Such bitflip can only happen in memory, as if it's a bitflip from disk,
> >> then it will cause the metadata csum mismatch.
> >>
> >> So this means, your memory is unreliable, and a memtest is strongly
> >> recommended before doing anything.
> >
> > I don't think that's the case. The files were last modified all the
> > way back in 2020, but there hasn't been any file modifications near
> > them since the end of April this year.
>
> Since the bitflip is in csum tree, it doesn't matter if that specific
> file get modified.
>
> Any other file modification can trigger CoW on that csum tree block.
>
> > There's also been 2 scrubs
> > before the last one where there were no issues at all. Does this mean
> > that at some point in the last half month (since that's the time
> > between the last successful scrub and the scrub which errored) BTRFS
> > read and re-wrote the file to disk?
>
> I'd say yes. And it doesn't even need to modify that specific file.
>
> That's why memory bitflip is so concerning.
>
> Thanks,
> Qu

Would using BTRFS raid 1 add resiliency to this particular issue?

Matthew Warren

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Manual intervention options for csum errors
  2022-06-03 17:05         ` Fwd: " Matthew Warren
@ 2022-06-03 19:30           ` Chris Murphy
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2022-06-03 19:30 UTC (permalink / raw)
  To: Matthew Warren; +Cc: Btrfs BTRFS

On Fri, Jun 3, 2022 at 1:05 PM Matthew Warren
<matthewwarren101010@gmail.com> wrote:
>
> > >> This is not a good sign.
> > >>
> > >> Such bitflip can only happen in memory, as if it's a bitflip from disk,
> > >> then it will cause the metadata csum mismatch.
> > >>
> > >> So this means, your memory is unreliable, and a memtest is strongly
> > >> recommended before doing anything.
> > >
> > > I don't think that's the case. The files were last modified all the
> > > way back in 2020, but there hasn't been any file modifications near
> > > them since the end of April this year.
> >
> > Since the bitflip is in csum tree, it doesn't matter if that specific
> > file get modified.
> >
> > Any other file modification can trigger CoW on that csum tree block.
> >
> > > There's also been 2 scrubs
> > > before the last one where there were no issues at all. Does this mean
> > > that at some point in the last half month (since that's the time
> > > between the last successful scrub and the scrub which errored) BTRFS
> > > read and re-wrote the file to disk?
> >
> > I'd say yes. And it doesn't even need to modify that specific file.
> >
> > That's why memory bitflip is so concerning.
> >
> > Thanks,
> > Qu
>
> Would using BTRFS raid 1 add resiliency to this particular issue?

No, the corruption from bad RAM will affect both copies. So you really
need to do a thorough memory test.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-06-03 19:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-01 21:16 Manual intervention options for csum errors Matthew Warren
2022-06-02  1:50 ` waxhead
2022-06-02  4:39 ` Qu Wenruo
2022-06-02 15:30   ` Matthew Warren
2022-06-02 22:16     ` Qu Wenruo
     [not found]       ` <CA+H1V9wD0Ndrnt5bV85nJPd7Go3gbyTs0K5pZBCybvwbeB3z3w@mail.gmail.com>
2022-06-03 17:05         ` Fwd: " Matthew Warren
2022-06-03 19:30           ` Chris Murphy
2022-06-02 19:12 ` Chris Murphy
2022-06-02 19:18   ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.