Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old

All of lore.kernel.org
 help / color / mirror / Atom feed

* Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
@ 2021-09-09  0:47 Sam Edwards
  2021-09-09  7:12 ` Nikolay Borisov
  2021-09-10  5:39 ` Sam Edwards
  0 siblings, 2 replies; 17+ messages in thread
From: Sam Edwards @ 2021-09-09  0:47 UTC (permalink / raw)
  To: linux-btrfs

Hello list,

First, I should say that there's no urgency here on my part.
Everything important is very well backed up, and even the
"unimportant" files (various configs) seem readable. I imaged the
partition without even attempting a repair. Normally, my inclination
would be to shrug this off and recreate the filesystem.

However, I'd like to help investigate the root cause, because:
1. This happened suspiciously soon (see my timeline in the link below)
after upgrading to kernel 5.14.1, so may be a serious regression.
2. The filesystem was created less than 5 weeks ago, so the possible
causes are relatively few.
3. My last successful btrfs scrub was just before upgrading to 5.14.1,
hopefully narrowing possible root causes even more.
4. I have imaged the partition and am thus willing to attempt risky
experimental repairs. (Mostly for the sake of reporting if they work.)

Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
OS: Gentoo
Earliest kernel ever used: 5.10.52-gentoo
First kernel version used for "real" usage: 5.13.8
Relevant information: See my Gist,
https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
Misc. notes: I have run "fstrim /" on occasion, but don't have
discards enabled automatically. I doubt TRIM is the culprit, but I
can't rule it out.

My primary hypothesis is that there's some write bug in Linux 5.14.1.
I installed some package updates right before btrfs detected the
problem, and most of the files in the `btrfs check` output look like
they were created as part of those updates.

My secondary hypothesis is that creating and/or using the swapfile
caused some kind of silent corruption that didn't become a detectable
issue until several further writes later.

Let me know if there's anything else I should try/provide!

Regards,
Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-09  0:47 Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old Sam Edwards
@ 2021-09-09  7:12 ` Nikolay Borisov
  2021-09-09  7:30   ` Sam Edwards
  2021-09-10  5:39 ` Sam Edwards
  1 sibling, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2021-09-09  7:12 UTC (permalink / raw)
  To: Sam Edwards, linux-btrfs

On 9.09.21 г. 3:47, Sam Edwards wrote:
> My primary hypothesis is that there's some write bug in Linux 5.14.1.
> I installed some package updates right before btrfs detected the
> problem, and most of the files in the `btrfs check` output look like
> they were created as part of those updates.

SO what was the last workload that got run on this server. The
btrfs-progs log suggests the last 2 transactions have been lost
expecting 66655 vs finding 66653. Also looking at the missing file names
it suggests you were fiddling with some windows development software? I
can see a mix of .dll.so files and also a bunch of just .dll files. SO
what was the last thing you installed on that machine? Are you using
dual boot by any chance on the same disk?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-09  7:12 ` Nikolay Borisov
@ 2021-09-09  7:30   ` Sam Edwards
  0 siblings, 0 replies; 17+ messages in thread
From: Sam Edwards @ 2021-09-09  7:30 UTC (permalink / raw)
  Cc: linux-btrfs

Hello!

This is a Gentoo system, which installs most package updates from
source. I had just installed/compiled updates before this issue
manifested -- I'm sure that this easily qualifies as a "workload." :)

The listed .dll and .dll.so files belong to Wine. Although I don't
believe Wine was among the updates installed in that batch. I can
double-check if it's of importance.

The only files listed which I don't suspect were from the system
updates are "data" and "backtest" at the top. Those were from the
overnight workload that left the system unresponsive on the 5th.

Kind regards,
Sam

On Thu, Sep 9, 2021, 01:12 Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 9.09.21 г. 3:47, Sam Edwards wrote:
> > My primary hypothesis is that there's some write bug in Linux 5.14.1.
> > I installed some package updates right before btrfs detected the
> > problem, and most of the files in the `btrfs check` output look like
> > they were created as part of those updates.
>
> SO what was the last workload that got run on this server. The
> btrfs-progs log suggests the last 2 transactions have been lost
> expecting 66655 vs finding 66653. Also looking at the missing file names
> it suggests you were fiddling with some windows development software? I
> can see a mix of .dll.so files and also a bunch of just .dll files. SO
> what was the last thing you installed on that machine? Are you using
> dual boot by any chance on the same disk?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-09  0:47 Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old Sam Edwards
  2021-09-09  7:12 ` Nikolay Borisov
@ 2021-09-10  5:39 ` Sam Edwards
  2021-09-10  6:16   ` Qu Wenruo
  1 sibling, 1 reply; 17+ messages in thread
From: Sam Edwards @ 2021-09-10  5:39 UTC (permalink / raw)
  To: linux-btrfs

Hello again,

I've checked the hardware. The RAM tests out okay and SSD reads seem
consistent. I'd be very surprised if this was hardware failure,
though: the system isn't nearly old enough for degradation and I
haven't seen any of the tell-tale signs of defective hardware.

Also, if this was due to bit flips in the SSD, the dm-crypt layer
would amplify that across the 128-bit AES blocks (cipher mode is XTS).
I'm not seeing signs of that either.

I'm now pretty familiar with btrfs inspect-internal and have checked
the trees manually. Absolutely nothing looks corrupt, but several
leaf/node blocks are outdated. More interesting is the byte range of
blocks that were "rolled back": 1065332064256-1065565601792, by my
count. That fits pretty nicely in 256 MiB. This range is also not even
close to aligned on a 256 MiB boundary in LBA terms - even taking into
account luks and partition offsets. Thus this seems more like a
software cache that didn't get flushed rather than a SSD bug.

So, question: does the btrfs module use 256 MiB as a default size for
write-back cache pages anywhere? Or might this problem reside deeper
in the block storage system?

Also, for what it's worth: the last generation to be written before
these inconsistencies seems to be 66303. Files written as part of that
transaction have a birth date of Sep. 7. That's during the same boot
as the failure to read-only, which suggests that the cached buffer
wasn't merely "not saved before a previous shutdown" but rather
"discarded without being written back." Thoughts?

Cheers,
Sam

On Wed, Sep 8, 2021 at 6:47 PM Sam Edwards <cfsworks@gmail.com> wrote:
>
> Hello list,
>
> First, I should say that there's no urgency here on my part.
> Everything important is very well backed up, and even the
> "unimportant" files (various configs) seem readable. I imaged the
> partition without even attempting a repair. Normally, my inclination
> would be to shrug this off and recreate the filesystem.
>
> However, I'd like to help investigate the root cause, because:
> 1. This happened suspiciously soon (see my timeline in the link below)
> after upgrading to kernel 5.14.1, so may be a serious regression.
> 2. The filesystem was created less than 5 weeks ago, so the possible
> causes are relatively few.
> 3. My last successful btrfs scrub was just before upgrading to 5.14.1,
> hopefully narrowing possible root causes even more.
> 4. I have imaged the partition and am thus willing to attempt risky
> experimental repairs. (Mostly for the sake of reporting if they work.)
>
> Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
> OS: Gentoo
> Earliest kernel ever used: 5.10.52-gentoo
> First kernel version used for "real" usage: 5.13.8
> Relevant information: See my Gist,
> https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
> Misc. notes: I have run "fstrim /" on occasion, but don't have
> discards enabled automatically. I doubt TRIM is the culprit, but I
> can't rule it out.
>
> My primary hypothesis is that there's some write bug in Linux 5.14.1.
> I installed some package updates right before btrfs detected the
> problem, and most of the files in the `btrfs check` output look like
> they were created as part of those updates.
>
> My secondary hypothesis is that creating and/or using the swapfile
> caused some kind of silent corruption that didn't become a detectable
> issue until several further writes later.
>
> Let me know if there's anything else I should try/provide!
>
> Regards,
> Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-10  5:39 ` Sam Edwards
@ 2021-09-10  6:16   ` Qu Wenruo
  2021-09-10  7:44     ` Sam Edwards
  0 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2021-09-10  6:16 UTC (permalink / raw)
  To: Sam Edwards, linux-btrfs



On 2021/9/10 下午1:39, Sam Edwards wrote:
> Hello again,
>
> I've checked the hardware. The RAM tests out okay and SSD reads seem
> consistent. I'd be very surprised if this was hardware failure,
> though: the system isn't nearly old enough for degradation and I
> haven't seen any of the tell-tale signs of defective hardware.

Unless there is obvious corruption shown from btrfs tree-checker, we
won't even bother considering RAM bitflip.

But for SSD/HDD, we're more cautious about any transid error.

There are already known bad disks which don't follow FLUSH requests,
thus leads to super easy transid mismatch on powerloss:

https://btrfs.wiki.kernel.org/index.php/Hardware_bugs


But in your case, dm-crypto is another layer between btrfs and hardware,
thus we can't verify if it's dm-crypto or btrfs doing something wrong.

>
> Also, if this was due to bit flips in the SSD, the dm-crypt layer
> would amplify that across the 128-bit AES blocks (cipher mode is XTS).
> I'm not seeing signs of that either.
>
> I'm now pretty familiar with btrfs inspect-internal and have checked
> the trees manually. Absolutely nothing looks corrupt, but several
> leaf/node blocks are outdated. More interesting is the byte range of
> blocks that were "rolled back": 1065332064256-1065565601792, by my
> count. That fits pretty nicely in 256 MiB. This range is also not even
> close to aligned on a 256 MiB boundary in LBA terms - even taking into
> account luks and partition offsets. Thus this seems more like a
> software cache that didn't get flushed rather than a SSD bug.
>
> So, question: does the btrfs module use 256 MiB as a default size for
> write-back cache pages anywhere?

No.

> Or might this problem reside deeper
> in the block storage system?

So if possible, please test without dm-crypto to make sure it's really
btrfs causing the problem.

>
> Also, for what it's worth: the last generation to be written before
> these inconsistencies seems to be 66303. Files written as part of that
> transaction have a birth date of Sep. 7. That's during the same boot
> as the failure to read-only,

That's interesting.

Btrfs aborts transaction when it hits critical metadata problems to
prevent further damage, but I still remember some possible corruption
caused by aborting transaction in the old days.
But those corruptions should not exist in v5.14.

Furthermore, the read-only flips itself is more important, do you have
the dmesg of that incident?


Recently we also see some corrupted free space cache causing problems.

If you're going to re-test this, mind to use space cache v2 to do the
tests?

> which suggests that the cached buffer
> wasn't merely "not saved before a previous shutdown" but rather
> "discarded without being written back." Thoughts?

This should not be a thing for btrfs.

Btrfs COW protects its metadata, as long as there is no problem in btrfs
itself allocating new tree blocks to ex you're reporting some RO
incidents, then I guess the COW mechanism may be broken at that time
alreadyisting blocks in the same transaction, we're completely safe.

But considering you're reporting some RO incidents, then I guess the COW
mechanism may be broken at that time already.
(Maybe abort leads to some corrupted free space cache?)

Thanks,
Qu

>
> Cheers,
> Sam
>
>
> On Wed, Sep 8, 2021 at 6:47 PM Sam Edwards <cfsworks@gmail.com> wrote:
>>
>> Hello list,
>>
>> First, I should say that there's no urgency here on my part.
>> Everything important is very well backed up, and even the
>> "unimportant" files (various configs) seem readable. I imaged the
>> partition without even attempting a repair. Normally, my inclination
>> would be to shrug this off and recreate the filesystem.
>>
>> However, I'd like to help investigate the root cause, because:
>> 1. This happened suspiciously soon (see my timeline in the link below)
>> after upgrading to kernel 5.14.1, so may be a serious regression.
>> 2. The filesystem was created less than 5 weeks ago, so the possible
>> causes are relatively few.
>> 3. My last successful btrfs scrub was just before upgrading to 5.14.1,
>> hopefully narrowing possible root causes even more.
>> 4. I have imaged the partition and am thus willing to attempt risky
>> experimental repairs. (Mostly for the sake of reporting if they work.)
>>
>> Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
>> OS: Gentoo
>> Earliest kernel ever used: 5.10.52-gentoo
>> First kernel version used for "real" usage: 5.13.8
>> Relevant information: See my Gist,
>> https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
>> Misc. notes: I have run "fstrim /" on occasion, but don't have
>> discards enabled automatically. I doubt TRIM is the culprit, but I
>> can't rule it out.
>>
>> My primary hypothesis is that there's some write bug in Linux 5.14.1.
>> I installed some package updates right before btrfs detected the
>> problem, and most of the files in the `btrfs check` output look like
>> they were created as part of those updates.
>>
>> My secondary hypothesis is that creating and/or using the swapfile
>> caused some kind of silent corruption that didn't become a detectable
>> issue until several further writes later.
>>
>> Let me know if there's anything else I should try/provide!
>>
>> Regards,
>> Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-10  6:16   ` Qu Wenruo
@ 2021-09-10  7:44     ` Sam Edwards
  2021-09-10  8:31       ` Qu Wenruo
  2021-09-11  4:24       ` Zygo Blaxell
  0 siblings, 2 replies; 17+ messages in thread
From: Sam Edwards @ 2021-09-10  7:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hi Qu,

Thank you very much for your insight. It is also helpful to know that
I appear to be understanding the btrfs internals properly. :)

On Fri, Sep 10, 2021 at 12:17 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> But for SSD/HDD, we're more cautious about any transid error.
>
> There are already known bad disks which don't follow FLUSH requests,
> thus leads to super easy transid mismatch on powerloss:
>
> https://btrfs.wiki.kernel.org/index.php/Hardware_bugs

I strongly believe now that the SSD hardware/firmware is not the
issue. Not only was the last "power loss"-type event on the 5th (which
appears to have caused no damage), the issue appeared while the system
was running -- a SSD firmware problem of that magnitude should have
been apparent long before I upgraded to 5.14.1.

> > So, question: does the btrfs module use 256 MiB as a default size for
> > write-back cache pages anywhere?
>
> No.
>
> > Or might this problem reside deeper
> > in the block storage system?
>
> So if possible, please test without dm-crypto to make sure it's really
> btrfs causing the problem.

Since you don't find the 256 MiB size evocative of any component of
btrfs, I am also willing to say that this likely isn't a fault in
btrfs. So, I think this is now an exercise in ruling out btrfs
completely (before I move on to investigating devicemapper, the crypto
target, the NVMe driver, and the block storage system as a whole) and
further studying the filesystem to gain insight as to how this
happened.

> > Also, for what it's worth: the last generation to be written before
> > these inconsistencies seems to be 66303. Files written as part of that
> > transaction have a birth date of Sep. 7. That's during the same boot
> > as the failure to read-only,
>
> That's interesting.
>
> Btrfs aborts transaction when it hits critical metadata problems to
> prevent further damage, but I still remember some possible corruption
> caused by aborting transaction in the old days.
> But those corruptions should not exist in v5.14.

I understand "aborts" to mean, "if btrfs is in the middle of a write
transaction, and in the process discovers existing metadata corruption
(e.g. tree node generation mismatch) already on-disk, it will undo any
earlier modifications already written to disk as part of that
transaction" -- is this correct? If so, I don't believe there was any
existing metadata corruption on the disk beforehand, so I doubt the
abort mechanism is at fault here.

> Furthermore, the read-only flips itself is more important, do you have
> the dmesg of that incident?

Sadly, I wasn't able to save the dmesg (for obvious reasons), but I do
recall logs complaining of checksum failures and transid mismatches. I
did make note of the block bytenrs, which were in the interval
[1065332047872, 1065391243264] which itself appears to be part of the
same 256 MiB "slice" as I found before.

> Recently we also see some corrupted free space cache causing problems.

I infer "problems" to mean, "free space cache indicates block X is
free though it is not, so btrfs allocates a new tree node or extent at
that location, overwriting necessary data"?

If so, I don't think this is the case here either. If an older node is
overwritten with a newer one, I would expect the transid mismatches to
be that the found IDs are *greater* than the wanted IDs. I am finding
valid, but *older*, tree nodes in the affected region. This seems like
a "lost writes" kind of problem to me, not overwrites.

However I did note some free space cache errors in my timeline.txt
file over on the Github Gist. If you believe that this is still a
likelihood, I can investigate further.

> If you're going to re-test this, mind to use space cache v2 to do the
> tests?
>
> > which suggests that the cached buffer
> > wasn't merely "not saved before a previous shutdown" but rather
> > "discarded without being written back." Thoughts?
>
> This should not be a thing for btrfs.
>
> Btrfs COW protects its metadata, as long as there is no problem in btrfs
> itself allocating new tree blocks to ex you're reporting some RO
> incidents, then I guess the COW mechanism may be broken at that time
> alreadyisting blocks in the same transaction, we're completely safe.

Again, making sure I understand you correctly: Btrfs, by design,
*never* modifies an active tree node/leaf. It will instead copy the
node/leaf it wishes to modify into memory, perform the modifications
there, save the modified copy at a free location on disk, and then
recursively modify parents (using the same process) all the way back
to the superblock(s).

And so, if power is lost in the middle of this process (i.e. before
the superblock is updated), then the old paths down the trees are
intact. The transaction is lost, but existing metadata is protected.
Correct?

> But considering you're reporting some RO incidents, then I guess the COW
> mechanism may be broken at that time already.
> (Maybe abort leads to some corrupted free space cache?)

Hypothetically, suppose one were to try this experiment:
1. Set up a brand-new machine, which has 2 independent but equally-sized disks.
2. On disk A, create a btrfs and mount it. Use it as a normal rootfs
for a while, and then unmount it.
3. Image disk A to disk B (perhaps obfuscating the image so that it
doesn't confuse btrfs's RAID logic).
4. Remount disk A and begin using it heavily (reads and writes). At a
random moment, instantly copy a 256 MiB slice from disk B onto disk A,
choosing an offset that has recently been written on disk A, while
disk A is still mounted.

Would this cause csum failures, transid mismatches, a RO incident, and
the general output from "btrfs check" that I posted?

If so, I think the problem is that a write-back cache, somewhere
deeper in the block storage stack (i.e. not btrfs) simply...
*discarded* a contiguous 256 MiB worth of data that btrfs thought had
been successfully written, and the RO incident was due to btrfs
suddenly seeing old data resurface. Does this seem like the best
working hypothesis currently?

Also: Per your expert opinion, what would be the best way to repair
disk A (without the benefit of disk B, heh) in the experiment above? I
imagine the affected slice itself is not salvageable, and will have to
be forcibly marked as free space, and in the process throwing out any
references to tree nodes that happened to reside there, meaning I'll
lose a whole bunch of extents, inodes, csum tree entries, and so on...
and in turn I'd lose a random assortment of files. I'm actually okay
with that, because I can diff the surviving files against my last
backup and restore whatever was deleted.

Or is the whole thing FUBAR at this point and I should just zero the
partition and restore from backups?

Thank you for your time,
Sam

> Thanks,
> Qu
>
> >
> > Cheers,
> > Sam
> >
> >
> > On Wed, Sep 8, 2021 at 6:47 PM Sam Edwards <cfsworks@gmail.com> wrote:
> >>
> >> Hello list,
> >>
> >> First, I should say that there's no urgency here on my part.
> >> Everything important is very well backed up, and even the
> >> "unimportant" files (various configs) seem readable. I imaged the
> >> partition without even attempting a repair. Normally, my inclination
> >> would be to shrug this off and recreate the filesystem.
> >>
> >> However, I'd like to help investigate the root cause, because:
> >> 1. This happened suspiciously soon (see my timeline in the link below)
> >> after upgrading to kernel 5.14.1, so may be a serious regression.
> >> 2. The filesystem was created less than 5 weeks ago, so the possible
> >> causes are relatively few.
> >> 3. My last successful btrfs scrub was just before upgrading to 5.14.1,
> >> hopefully narrowing possible root causes even more.
> >> 4. I have imaged the partition and am thus willing to attempt risky
> >> experimental repairs. (Mostly for the sake of reporting if they work.)
> >>
> >> Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
> >> OS: Gentoo
> >> Earliest kernel ever used: 5.10.52-gentoo
> >> First kernel version used for "real" usage: 5.13.8
> >> Relevant information: See my Gist,
> >> https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
> >> Misc. notes: I have run "fstrim /" on occasion, but don't have
> >> discards enabled automatically. I doubt TRIM is the culprit, but I
> >> can't rule it out.
> >>
> >> My primary hypothesis is that there's some write bug in Linux 5.14.1.
> >> I installed some package updates right before btrfs detected the
> >> problem, and most of the files in the `btrfs check` output look like
> >> they were created as part of those updates.
> >>
> >> My secondary hypothesis is that creating and/or using the swapfile
> >> caused some kind of silent corruption that didn't become a detectable
> >> issue until several further writes later.
> >>
> >> Let me know if there's anything else I should try/provide!
> >>
> >> Regards,
> >> Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-10  7:44     ` Sam Edwards
@ 2021-09-10  8:31       ` Qu Wenruo
  2021-09-10 22:34         ` Sam Edwards
  2021-09-11  4:24       ` Zygo Blaxell
  1 sibling, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2021-09-10  8:31 UTC (permalink / raw)
  To: Sam Edwards; +Cc: linux-btrfs



On 2021/9/10 下午3:44, Sam Edwards wrote:
> Hi Qu,
>
> Thank you very much for your insight. It is also helpful to know that
> I appear to be understanding the btrfs internals properly. :)
>
> On Fri, Sep 10, 2021 at 12:17 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> But for SSD/HDD, we're more cautious about any transid error.
>>
>> There are already known bad disks which don't follow FLUSH requests,
>> thus leads to super easy transid mismatch on powerloss:
>>
>> https://btrfs.wiki.kernel.org/index.php/Hardware_bugs
>
> I strongly believe now that the SSD hardware/firmware is not the
> issue. Not only was the last "power loss"-type event on the 5th (which
> appears to have caused no damage), the issue appeared while the system
> was running -- a SSD firmware problem of that magnitude should have
> been apparent long before I upgraded to 5.14.1.
>
>>> So, question: does the btrfs module use 256 MiB as a default size for
>>> write-back cache pages anywhere?
>>
>> No.
>>
>>> Or might this problem reside deeper
>>> in the block storage system?
>>
>> So if possible, please test without dm-crypto to make sure it's really
>> btrfs causing the problem.
>
> Since you don't find the 256 MiB size evocative of any component of
> btrfs, I am also willing to say that this likely isn't a fault in
> btrfs.

If you have hit read-only problem, then it's definitely btrfs' problem,
and we're very interesting to see what's the cause.

> So, I think this is now an exercise in ruling out btrfs
> completely (before I move on to investigating devicemapper, the crypto
> target, the NVMe driver, and the block storage system as a whole) and
> further studying the filesystem to gain insight as to how this
> happened.
>
>>> Also, for what it's worth: the last generation to be written before
>>> these inconsistencies seems to be 66303. Files written as part of that
>>> transaction have a birth date of Sep. 7. That's during the same boot
>>> as the failure to read-only,
>>
>> That's interesting.
>>
>> Btrfs aborts transaction when it hits critical metadata problems to
>> prevent further damage, but I still remember some possible corruption
>> caused by aborting transaction in the old days.
>> But those corruptions should not exist in v5.14.
>
> I understand "aborts" to mean, "if btrfs is in the middle of a write
> transaction, and in the process discovers existing metadata corruption
> (e.g. tree node generation mismatch) already on-disk, it will undo any
> earlier modifications already written to disk as part of that
> transaction"

Btrfs doesn't need to undo anything, just don't writeback the super block.

As metadata is protected by COW, it means the old metadata from previous
transaction is not touched at all.

As long as the super block points to old metadata, everything is safe
and just old metadata.

The already written metadata is not referred by any old metadata, thus
it appears just nothing happened.


But all these means, btrfs needs a completely correct view of which part
of space is free.
Thus if by somehow, such view is not correct, it means metadata is not
protected, and can be overwritten and screwed by powerloss.

But in that case, the found transid should be larger than expected, not
smaller you reported.

Smaller transid normally means missing writes.

> -- is this correct? If so, I don't believe there was any
> existing metadata corruption on the disk beforehand, so I doubt the
> abort mechanism is at fault here.

Normally abort itself is an indicator of something wrong already.

>
>> Furthermore, the read-only flips itself is more important, do you have
>> the dmesg of that incident?
>
> Sadly, I wasn't able to save the dmesg (for obvious reasons), but I do
> recall logs complaining of checksum failures and transid mismatches.

Checksum error doesn't sound correct. Can be another indication of
missing writes.

But this also means, the corruption is even older.

> I
> did make note of the block bytenrs, which were in the interval
> [1065332047872, 1065391243264] which itself appears to be part of the
> same 256 MiB "slice" as I found before.
>
>> Recently we also see some corrupted free space cache causing problems.
>
> I infer "problems" to mean, "free space cache indicates block X is
> free though it is not, so btrfs allocates a new tree node or extent at
> that location, overwriting necessary data"?

Exactly.

>
> If so, I don't think this is the case here either. If an older node is
> overwritten with a newer one, I would expect the transid mismatches to
> be that the found IDs are *greater* than the wanted IDs. I am finding
> valid, but *older*, tree nodes in the affected region. This seems like
> a "lost writes" kind of problem to me, not overwrites.

You're right, I also forgot the transid is smaller than expected...

>
> However I did note some free space cache errors in my timeline.txt
> file over on the Github Gist. If you believe that this is still a
> likelihood, I can investigate further.

Now v1 cache is not the direct cause the missing writes, but it may
still be something.

As long as you don't do forced shutdown, nor btrfs check --repair, the
v1 cache should not mismatch.

>
>> If you're going to re-test this, mind to use space cache v2 to do the
>> tests?
>>
>>> which suggests that the cached buffer
>>> wasn't merely "not saved before a previous shutdown" but rather
>>> "discarded without being written back." Thoughts?
>>
>> This should not be a thing for btrfs.
>>
>> Btrfs COW protects its metadata, as long as there is no problem in btrfs
>> itself allocating new tree blocks to ex you're reporting some RO
>> incidents, then I guess the COW mechanism may be broken at that time
>> alreadyisting blocks in the same transaction, we're completely safe.
>
> Again, making sure I understand you correctly: Btrfs, by design,
> *never* modifies an active tree node/leaf. It will instead copy the
> node/leaf it wishes to modify into memory, perform the modifications
> there, save the modified copy at a free location on disk, and then
> recursively modify parents (using the same process) all the way back
> to the superblock(s).

Yes.

>
> And so, if power is lost in the middle of this process (i.e. before
> the superblock is updated), then the old paths down the trees are
> intact. The transaction is lost, but existing metadata is protected.
> Correct?

Yes.

>
>> But considering you're reporting some RO incidents, then I guess the COW
>> mechanism may be broken at that time already.
>> (Maybe abort leads to some corrupted free space cache?)
>
> Hypothetically, suppose one were to try this experiment:
> 1. Set up a brand-new machine, which has 2 independent but equally-sized disks.
> 2. On disk A, create a btrfs and mount it. Use it as a normal rootfs
> for a while, and then unmount it.
> 3. Image disk A to disk B (perhaps obfuscating the image so that it
> doesn't confuse btrfs's RAID logic).
> 4. Remount disk A and begin using it heavily (reads and writes). At a
> random moment, instantly copy a 256 MiB slice from disk B onto disk A,
> choosing an offset that has recently been written on disk A, while
> disk A is still mounted.

This is too crazy that I can't even imagine what could survive.

But if you mean disk B contains the exactly data of disk A, and there is
no balance executed on either disk, then the copied metadata is
completely valid to btrfs.

Btrfs has fsid to distinguish different fs, in your idea experiment,
it's the same, so no way to distinguish them. So is the checksum.

The final protection is the logical bytenr, where btrfs can map its
chunks at any logical bytenr, thus even at the same physical location,
they can have different logical bytenr.

But if no balance executed, the chunk mapping is still the same, thus we
can't use logical bytenr to distinguish the metadata.

But to me, this is really too crazy...

>
> Would this cause csum failures, transid mismatches, a RO incident, and
> the general output from "btrfs check" that I posted?

It won't cause csum failure.
Transid mismatch is possible.
RO incident is almost for sure.

>
> If so, I think the problem is that a write-back cache, somewhere
> deeper in the block storage stack (i.e. not btrfs) simply...
> *discarded* a contiguous 256 MiB worth of data that btrfs thought had
> been successfully written, and the RO incident was due to btrfs
> suddenly seeing old data resurface. Does this seem like the best
> working hypothesis currently?

But isn't there only dm-crypto (LUKS) between btrfs and the hardware?

If it's LUKS, I guess it's possible to cause missing writes.

BTW, if you have never hit a sudden power loss, nor do any btrfs check
--repair between, the "wrong amount of free space" warning at Aug 10 is
already an indicator of something went wrong.

>
> Also: Per your expert opinion, what would be the best way to repair
> disk A (without the benefit of disk B, heh) in the experiment above? I
> imagine the affected slice itself is not salvageable, and will have to
> be forcibly marked as free space, and in the process throwing out any
> references to tree nodes that happened to reside there, meaning I'll
> lose a whole bunch of extents, inodes, csum tree entries, and so on...
> and in turn I'd a sudden power loss, nor do any btrfs check --repair between, the "wrong amount of free space" warning is already an indicator of somelose a random assortment of files. I'm actually okay
> with that, because I can diff the surviving files against my last
> backup and restore whatever was deleted.

I'd say, there is no way to repair.
Only data salvage is possible for generic transid mismatch.

>
> Or is the whole a sudden power loss, nor do any btrfs check --repair between, the "wrong amount of free space" warning is already an indicator of something FUBAR at this point and I should just zero the
> partition and restore from backups?

I guess so.

The repair for transid is never ensured to be safe, as core btrfs
mechanism is already broken.

Thanks,
Qu

>
> Thank you for your time,
> Sam
>
>> Thanks,
>> Qu
>>
>>>
>>> Cheers,
>>> Sam
>>>
>>>
>>> On Wed, Sep 8, 2021 at 6:47 PM Sam Edwards <cfsworks@gmail.com> wrote:
>>>>
>>>> Hello list,
>>>>
>>>> First, I should say that there's no urgency here on my part.
>>>> Everything important is very well backed up, and even the
>>>> "unimportant" files (various configs) seem readable. I imaged the
>>>> partition without even attempting a repair. Normally, my inclination
>>>> would be to shrug this off and recreate the filesystem.
>>>>
>>>> However, I'd like to help investigate the root cause, because:
>>>> 1. This happened suspiciously soon (see my timeline in the link below)
>>>> after upgrading to kernel 5.14.1, so may be a serious regression.
>>>> 2. The filesystem was created less than 5 weeks ago, so the possible
>>>> causes are relatively few.
>>>> 3. My last successful btrfs scrub was just before upgrading to 5.14.1,
>>>> hopefully narrowing possible root causes even more.
>>>> 4. I have imaged the partition and am thus willing to attempt risky
>>>> experimental repairs. (Mostly for the sake of reporting if they work.)
>>>>
>>>> Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
>>>> OS: Gentoo
>>>> Earliest kernel ever used: 5.10.52-gentoo
>>>> First kernel version used for "real" usage: 5.13.8
>>>> Relevant information: See my Gist,
>>>> https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
>>>> Misc. notes: I have run "fstrim /" on occasion, but don't have
>>>> discards enabled automatically. I doubt TRIM is the culprit, but I
>>>> can't rule it out.
>>>>
>>>> My primary hypothesis is that there's some write bug in Linux 5.14.1.
>>>> I installed some package updates right before btrfs detected the
>>>> problem, and most of the files in the `btrfs check` output look like
>>>> they were created as part of those updates.
>>>>
>>>> My secondary hypothesis is that creating and/or using the swapfile
>>>> caused some kind of silent corruption that didn't become a detectable
>>>> issue until several further writes later.
>>>>
>>>> Let me know if there's anything else I should try/provide!
>>>>
>>>> Regards,
>>>> Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-10  8:31       ` Qu Wenruo
@ 2021-09-10 22:34         ` Sam Edwards
  2021-09-11  1:05           ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Sam Edwards @ 2021-09-10 22:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Sep 10, 2021 at 2:31 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> If you have hit read-only problem, then it's definitely btrfs' problem,
> and we're very interesting to see what's the cause.

I'm in full agreement that RO is a sign of a malfunction; but I was
saying that the malfunction may be deeper in the stack, meaning the RO
is not necessarily due to a design/implementation error in btrfs, but
is rather the only sensible course of action when faced with certain
circumstances outside of its control.

> Checksum error doesn't sound correct. Can be another indication of
> missing writes.
>
> But this also means, the corruption is even older.

Here is the checksum error (the very first indication of something
amiss on the day in question):
BTRFS warning (device dm-0): checksum verify failed on 1065332064256
wanted 0x04ca393a found 0xd5f0b823 level 0
BTRFS error (device dm-0): parent transid verify failed on
1065332064256 wanted 66552 found 66543

Since it's coupled directly with a transid mismatch, I think this just
means the csum tree is current while the node is not. That is,
0x04ca393a is the correct checksum for generation 66552 of leaf
1065332064256, but that generation has gone missing and instead we
find generation 66543, which has checksum 0xd5f0b823.

> As long as you don't do forced shutdown, nor btrfs check --repair, the
> v1 cache should not mismatch.

I have never even heard of btrfs check --repair until this incident (a
testament to btrfs's general durability).

I checked and both shutdowns immediately before the "has wrong amount
of free space" warnings were clean. On the 10th, there was an unclean
shutdown a little earlier in the day - there may have been some
leftover issues from that.

> This is too crazy that I can't even imagine what could survive.
>
> [...]
>
> But to me, this is really too crazy...

It is a contrived idea, yes. :)

But the subject it explores is relevant: how btrfs would react to a
slice of an active partition spontaneously reverting to the data it
held several minutes prior.

That is, a "missing writes" problem where the writes don't go missing
until a few minutes *after* they succeed.

> The final protection is the logical bytenr, where btrfs can map its
> chunks at any logical bytenr, thus even at the same physical location,
> they can have different logical bytenr.

THIS is an interesting lead. Until this point I had been interpreting
bytenr as a physical partition offset. Now that I've learned about the
chunk tree, I found that all missing writes were to chunk
1065173909504.

That chunk has a physical offset of 999675658240. So, there is exactly
a 61 GiB difference between bytenr and partition offset. (This seems
to be true of the neighboring chunks as well, and as that's a nice
round number, I think this is the correct offset.)

However, if the physical offset were to change momentarily (i.e. for a
few minutes), then writes to that chunk would end up diverted to some
other location on disk. Once the physical offset changes back, the
chunk will appear to revert back to the same data it held a few
minutes prior. In effect, causing the "retroactive missing writes"
phenomenon I'm seeing.

This would also leave behind evidence, in that the missing writes
would have to have gone *somewhere* on disk, and as long as they
weren't overwritten, I can track them down by scanning the whole disk
for tree nodes with the proper bytenr/transid. I think I'll spend some
time today trying to do that, as that would confirm this idea.

The only remaining question is why the physical offset would have
changed for only a few minutes. I didn't do a rebalance, although I
think I was running low on available space around the time, so maybe
btrfs was making some last-minute adjustments to the chunk tree to
compensate for that? The transid of the chunk tree node describing
this chunk is 58325, which is well before the problems started
happening. Perhaps the chunk tree was updated in-memory, used for
physical writes, but then reverted? Does this sound like something
btrfs might do?

Or maybe a cosmic ray flipped a bit in the in-memory copy of the
physical offset of the chunk. Unlikely, but possible. :)

> I'd say, there is no way to repair.
> Only data salvage is possible for generic transid mismatch.

Bah. Well, not a problem -- but that will take me a fair amount of
time. I'll want to investigate this and figure out what went wrong
*before* I go through the trouble of recreating my filesystem. I don't
want to spend a day restoring backups only to have the same problem
happen again a week later.

> >
> > Or is the whole a sudden power loss, nor do any btrfs check --repair between, the "wrong amount of free space" warning is already an indicator of something FUBAR at this point and I should just zero the
> > partition and restore from backups?
>
> I guess so.
>
> The repair for transid is never ensured to be safe, as core btrfs
> mechanism is already broken.
>
> Thanks,
> Qu
>
> >
> > Thank you for your time,
> > Sam
> >
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> Cheers,
> >>> Sam
> >>>
> >>>
> >>> On Wed, Sep 8, 2021 at 6:47 PM Sam Edwards <cfsworks@gmail.com> wrote:
> >>>>
> >>>> Hello list,
> >>>>
> >>>> First, I should say that there's no urgency here on my part.
> >>>> Everything important is very well backed up, and even the
> >>>> "unimportant" files (various configs) seem readable. I imaged the
> >>>> partition without even attempting a repair. Normally, my inclination
> >>>> would be to shrug this off and recreate the filesystem.
> >>>>
> >>>> However, I'd like to help investigate the root cause, because:
> >>>> 1. This happened suspiciously soon (see my timeline in the link below)
> >>>> after upgrading to kernel 5.14.1, so may be a serious regression.
> >>>> 2. The filesystem was created less than 5 weeks ago, so the possible
> >>>> causes are relatively few.
> >>>> 3. My last successful btrfs scrub was just before upgrading to 5.14.1,
> >>>> hopefully narrowing possible root causes even more.
> >>>> 4. I have imaged the partition and am thus willing to attempt risky
> >>>> experimental repairs. (Mostly for the sake of reporting if they work.)
> >>>>
> >>>> Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
> >>>> OS: Gentoo
> >>>> Earliest kernel ever used: 5.10.52-gentoo
> >>>> First kernel version used for "real" usage: 5.13.8
> >>>> Relevant information: See my Gist,
> >>>> https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
> >>>> Misc. notes: I have run "fstrim /" on occasion, but don't have
> >>>> discards enabled automatically. I doubt TRIM is the culprit, but I
> >>>> can't rule it out.
> >>>>
> >>>> My primary hypothesis is that there's some write bug in Linux 5.14.1.
> >>>> I installed some package updates right before btrfs detected the
> >>>> problem, and most of the files in the `btrfs check` output look like
> >>>> they were created as part of those updates.
> >>>>
> >>>> My secondary hypothesis is that creating and/or using the swapfile
> >>>> caused some kind of silent corruption that didn't become a detectable
> >>>> issue until several further writes later.
> >>>>
> >>>> Let me know if there's anything else I should try/provide!
> >>>>
> >>>> Regards,
> >>>> Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-10 22:34         ` Sam Edwards
@ 2021-09-11  1:05           ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2021-09-11  1:05 UTC (permalink / raw)
  To: Sam Edwards, Qu Wenruo; +Cc: linux-btrfs



On 2021/9/11 上午6:34, Sam Edwards wrote:
> On Fri, Sep 10, 2021 at 2:31 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>> If you have hit read-only problem, then it's definitely btrfs' problem,
>> and we're very interesting to see what's the cause.
>
> I'm in full agreement that RO is a sign of a malfunction; but I was
> saying that the malfunction may be deeper in the stack, meaning the RO
> is not necessarily due to a design/implementation error in btrfs, but
> is rather the only sensible course of action when faced with certain
> circumstances outside of its control.
>
>> Checksum error doesn't sound correct. Can be another indication of
>> missing writes.
>>
>> But this also means, the corruption is even older.
>
> Here is the checksum error (the very first indication of something
> amiss on the day in question):
> BTRFS warning (device dm-0): checksum verify failed on 1065332064256
> wanted 0x04ca393a found 0xd5f0b823 level 0
> BTRFS error (device dm-0): parent transid verify failed on
> 1065332064256 wanted 66552 found 66543

This is something more weird.

Firstly, this message is for the same block, then it means, the first
copy is not correct, but it's not completely garbage.

If it's completely garbage, its logical bytenr would not match (btrfs
checks the very basic things like logical bytenr/fsid, then check the csum).

Thus it looks like either the data is not correct, or the bytenr check
just passes by pure coincident (which I don't believe, as it also passed
fsid check).

Then the 2nd copy passed all other checks, but transid.

This looks very weird.
If it's btrfs causing the problem, both copy should have the same
problem, not just one copy with csum mismatch, another with transid
mismatch.

>
> Since it's coupled directly with a transid mismatch, I think this just
> means the csum tree is current while the node is not.

For metadata, the csum is inlined into the header, not in csum tree.
Csum tree is only for data.
Thus it's not possible for csum to mismatch with its data.

> That is,
> 0x04ca393a is the correct checksum for generation 66552 of leaf
> 1065332064256, but that generation has gone missing and instead we
> find generation 66543, which has checksum 0xd5f0b823.

Nope, the csum is inside the tree block (along with its bytenr and fsid).
It looks more like the latter part of the tree block mirror 1 got
overwritten or corrupted.

>
>> As long as you don't do forced shutdown, nor btrfs check --repair, the
>> v1 cache should not mismatch.
>
> I have never even heard of btrfs check --repair until this incident (a
> testament to btrfs's general durability).
>
> I checked and both shutdowns immediately before the "has wrong amount
> of free space" warnings were clean. On the 10th, there was an unclean
> shutdown a little earlier in the day - there may have been some
> leftover issues from that.

I guess all the problems happens at that unclean shutdown.

>
>> This is too crazy that I can't even imagine what could survive.
>>
>> [...]
>>
>> But to me, this is really too crazy...
>
> It is a contrived idea, yes. :)
>
> But the subject it explores is relevant: how btrfs would react to a
> slice of an active partition spontaneously reverting to the data it
> held several minutes prior.

That means the disks are not respecting FLUSH.

Flush commands mean, the disk should only return after all the data in
volatile cache has been written to disk or non-volatile cache.

If the disks (including dm-layer) just return without really writing
back all the data to non-volatile storage, then a power loss happens,
that's exactly what the transid mismatch would happen.

>
> That is, a "missing writes" problem where the writes don't go missing
> until a few minutes *after* they succeed.
>
>> The final protection is the logical bytenr, where btrfs can map its
>> chunks at any logical bytenr, thus even at the same physical location,
>> they can have different logical bytenr.
>
> THIS is an interesting lead. Until this point I had been interpreting
> bytenr as a physical partition offset. Now that I've learned about the
> chunk tree, I found that all missing writes were to chunk
> 1065173909504.

That can be caused by the fact that all the newer metadata writes were
just allocated inside chunk 1065173909504.


>
> That chunk has a physical offset of 999675658240. So, there is exactly
> a 61 GiB difference between bytenr and partition offset. (This seems
> to be true of the neighboring chunks as well, and as that's a nice
> round number, I think this is the correct offset.) >
> However, if the physical offset were to change momentarily (i.e. for a
> few minutes), then writes to that chunk would end up diverted to some
> other location on disk. Once the physical offset changes back, the
> chunk will appear to revert back to the same data it held a few
> minutes prior. In effect, causing the "retroactive missing writes"
> phenomenon I'm seeing.

I don't think there is anything related to sudden physical offset
change, or kernel will report things like "bad tree block start, want
%llu have %llu".

Thus I still think there are something between btrfs and the disks, that
causes FLUSH commands to be incorrectly executed.

Thanks,
Qu

>
> This would also leave behind evidence, in that the missing writes
> would have to have gone *somewhere* on disk, and as long as they
> weren't overwritten, I can track them down by scanning the whole disk
> for tree nodes with the proper bytenr/transid. I think I'll spend some
> time today trying to do that, as that would confirm this idea.
>
> The only remaining question is why the physical offset would have
> changed for only a few minutes. I didn't do a rebalance, although I
> think I was running low on available space around the time, so maybe
> btrfs was making some last-minute adjustments to the chunk tree to
> compensate for that? The transid of the chunk tree node describing
> this chunk is 58325, which is well before the problems started
> happening. Perhaps the chunk tree was updated in-memory, used for
> physical writes, but then reverted? Does this sound like something
> btrfs might do?
>
> Or maybe a cosmic ray flipped a bit in the in-memory copy of the
> physical offset of the chunk. Unlikely, but possible. :)
>
>> I'd say, there is no way to repair.
>> Only data salvage is possible for generic transid mismatch.
>
> Bah. Well, not a problem -- but that will take me a fair amount of
> time. I'll want to investigate this and figure out what went wrong
> *before* I go through the trouble of recreating my filesystem. I don't
> want to spend a day restoring backups only to have the same problem
> happen again a week later.
>
>>>
>>> Or is the whole a sudden power loss, nor do any btrfs check --repair between, the "wrong amount of free space" warning is already an indicator of something FUBAR at this point and I should just zero the
>>> partition and restore from backups?
>>
>> I guess so.
>>
>> The repair for transid is never ensured to be safe, as core btrfs
>> mechanism is already broken.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thank you for your time,
>>> Sam
>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> Cheers,
>>>>> Sam
>>>>>
>>>>>
>>>>> On Wed, Sep 8, 2021 at 6:47 PM Sam Edwards <cfsworks@gmail.com> wrote:
>>>>>>
>>>>>> Hello list,
>>>>>>
>>>>>> First, I should say that there's no urgency here on my part.
>>>>>> Everything important is very well backed up, and even the
>>>>>> "unimportant" files (various configs) seem readable. I imaged the
>>>>>> partition without even attempting a repair. Normally, my inclination
>>>>>> would be to shrug this off and recreate the filesystem.
>>>>>>
>>>>>> However, I'd like to help investigate the root cause, because:
>>>>>> 1. This happened suspiciously soon (see my timeline in the link below)
>>>>>> after upgrading to kernel 5.14.1, so may be a serious regression.
>>>>>> 2. The filesystem was created less than 5 weeks ago, so the possible
>>>>>> causes are relatively few.
>>>>>> 3. My last successful btrfs scrub was just before upgrading to 5.14.1,
>>>>>> hopefully narrowing possible root causes even more.
>>>>>> 4. I have imaged the partition and am thus willing to attempt risky
>>>>>> experimental repairs. (Mostly for the sake of reporting if they work.)
>>>>>>
>>>>>> Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
>>>>>> OS: Gentoo
>>>>>> Earliest kernel ever used: 5.10.52-gentoo
>>>>>> First kernel version used for "real" usage: 5.13.8
>>>>>> Relevant information: See my Gist,
>>>>>> https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
>>>>>> Misc. notes: I have run "fstrim /" on occasion, but don't have
>>>>>> discards enabled automatically. I doubt TRIM is the culprit, but I
>>>>>> can't rule it out.
>>>>>>
>>>>>> My primary hypothesis is that there's some write bug in Linux 5.14.1.
>>>>>> I installed some package updates right before btrfs detected the
>>>>>> problem, and most of the files in the `btrfs check` output look like
>>>>>> they were created as part of those updates.
>>>>>>
>>>>>> My secondary hypothesis is that creating and/or using the swapfile
>>>>>> caused some kind of silent corruption that didn't become a detectable
>>>>>> issue until several further writes later.
>>>>>>
>>>>>> Let me know if there's anything else I should try/provide!
>>>>>>
>>>>>> Regards,
>>>>>> Sam
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-10  7:44     ` Sam Edwards
  2021-09-10  8:31       ` Qu Wenruo
@ 2021-09-11  4:24       ` Zygo Blaxell
  2021-09-11  5:07         ` Sam Edwards
  1 sibling, 1 reply; 17+ messages in thread
From: Zygo Blaxell @ 2021-09-11  4:24 UTC (permalink / raw)
  To: Sam Edwards; +Cc: Qu Wenruo, linux-btrfs

On Fri, Sep 10, 2021 at 01:44:51AM -0600, Sam Edwards wrote:
> Hi Qu,
> 
> Thank you very much for your insight. It is also helpful to know that
> I appear to be understanding the btrfs internals properly. :)
> 
> On Fri, Sep 10, 2021 at 12:17 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > But for SSD/HDD, we're more cautious about any transid error.
> >
> > There are already known bad disks which don't follow FLUSH requests,
> > thus leads to super easy transid mismatch on powerloss:
> >
> > https://btrfs.wiki.kernel.org/index.php/Hardware_bugs
> 
> I strongly believe now that the SSD hardware/firmware is not the
> issue. Not only was the last "power loss"-type event on the 5th (which
> appears to have caused no damage), the issue appeared while the system
> was running -- a SSD firmware problem of that magnitude should have
> been apparent long before I upgraded to 5.14.1.

Have you reported the model and firmware revision of the SSD?
This is critical information for identifying firmware problems.

Write cache firmware failures are not only triggered by power losses.
They can be triggered by underlying media failures and SSD embedded DRAM
failures.  This kind of firmware bug would show zero symptoms before
(and sometimes during) the triggering failure event.  When it happens,
there may be no indication of failure other than a whole lot of "parent
transid verify failed" and related errors that appear out of nowhere
during otherwise normal operation.  Such failures are often accompanied
by bus timeouts and resets, but it can also happen with no external
indication if there is no write in progress at the time of failure.

There are roughly 40 distinct block addresses affected in your check log,
clustered in two separate 256 MB blocks.  That looks fairly normal--the
btrfs allocator packs related updates close together to take advantage
of sequential write performance, so all of a dropped transaction's
writes would have nearby addresses.  The variation in transid is also
normal--the new tree pages refer to old tree pages of various ages and
replace old tree pages of various ages themselves.

The parent transid verify messages appear in identical pairs, which is
consistent with a dropped cache write on a single drive with dup metadata.
Btrfs will try to read both mirror copies if the first one is wrong.
Both writes occur at the same time, so both writes can be in the drive's
write cache when the contents of the drive's DRAM are lost.

The other errors are consistent with consequences of these failures.
Leaf pages of the metadata trees are no longer reachable because a path
to these leaf nodes, or the nodes themselves, do not exist on disk.
Items within the missing leaf pages (like csums, inodes, dirents, and
backrefs) cannot be found when references to these items from other
parts of the filesystem are checked, so you get a lot of missing
directory entries, backrefs, and csums.

Other failure modes can generate this pattern (e.g. host RAM corruption
and kernel bugs) but other failure modes often leave other indications
that are not present in these logs.  There's no information here
that points to some failure mode other than a write cache failure.
btrfs is designed to reliably detect write cache failures, so based on
the information so far I wouldn't suspect anything else.

> > > So, question: does the btrfs module use 256 MiB as a default size for
> > > write-back cache pages anywhere?
> > 
> > No.
> >
> > > Or might this problem reside deeper
> > > in the block storage system?
> >
> > So if possible, please test without dm-crypto to make sure it's really
> > btrfs causing the problem.
> 
> Since you don't find the 256 MiB size evocative of any component of
> btrfs, I am also willing to say that this likely isn't a fault in
> btrfs. So, I think this is now an exercise in ruling out btrfs
> completely (before I move on to investigating devicemapper, the crypto
> target, the NVMe driver, and the block storage system as a whole) and
> further studying the filesystem to gain insight as to how this
> happened.
> 
> > > Also, for what it's worth: the last generation to be written before
> > > these inconsistencies seems to be 66303. Files written as part of that
> > > transaction have a birth date of Sep. 7. That's during the same boot
> > > as the failure to read-only,
> >
> > That's interesting.
> >
> > Btrfs aborts transaction when it hits critical metadata problems to
> > prevent further damage, but I still remember some possible corruption
> > caused by aborting transaction in the old days.
> > But those corruptions should not exist in v5.14.
> 
> I understand "aborts" to mean, "if btrfs is in the middle of a write
> transaction, and in the process discovers existing metadata corruption
> (e.g. tree node generation mismatch) already on-disk, it will undo any
> earlier modifications already written to disk as part of that
> transaction" -- is this correct? If so, I don't believe there was any
> existing metadata corruption on the disk beforehand, so I doubt the
> abort mechanism is at fault here.
> 
> > Furthermore, the read-only flips itself is more important, do you have
> > the dmesg of that incident?
> 
> Sadly, I wasn't able to save the dmesg (for obvious reasons), but I do
> recall logs complaining of checksum failures and transid mismatches. I
> did make note of the block bytenrs, which were in the interval
> [1065332047872, 1065391243264] which itself appears to be part of the
> same 256 MiB "slice" as I found before.
> 
> > Recently we also see some corrupted free space cache causing problems.
> 
> I infer "problems" to mean, "free space cache indicates block X is
> free though it is not, so btrfs allocates a new tree node or extent at
> that location, overwriting necessary data"?
> 
> If so, I don't think this is the case here either. If an older node is
> overwritten with a newer one, I would expect the transid mismatches to
> be that the found IDs are *greater* than the wanted IDs. I am finding
> valid, but *older*, tree nodes in the affected region. This seems like
> a "lost writes" kind of problem to me, not overwrites.
> 
> However I did note some free space cache errors in my timeline.txt
> file over on the Github Gist. If you believe that this is still a
> likelihood, I can investigate further.
> 
> > If you're going to re-test this, mind to use space cache v2 to do the
> > tests?
> >
> > > which suggests that the cached buffer
> > > wasn't merely "not saved before a previous shutdown" but rather
> > > "discarded without being written back." Thoughts?
> >
> > This should not be a thing for btrfs.
> >
> > Btrfs COW protects its metadata, as long as there is no problem in btrfs
> > itself allocating new tree blocks to ex you're reporting some RO
> > incidents, then I guess the COW mechanism may be broken at that time
> > alreadyisting blocks in the same transaction, we're completely safe.
> 
> Again, making sure I understand you correctly: Btrfs, by design,
> *never* modifies an active tree node/leaf. It will instead copy the
> node/leaf it wishes to modify into memory, perform the modifications
> there, save the modified copy at a free location on disk, and then
> recursively modify parents (using the same process) all the way back
> to the superblock(s).
> 
> And so, if power is lost in the middle of this process (i.e. before
> the superblock is updated), then the old paths down the trees are
> intact. The transaction is lost, but existing metadata is protected.
> Correct?

If btrfs stops writing to the filesystem immediately, either because a
power failure stopped the kernel running, or a metadata integrity failure
error was detected, then existing metadata is protected.

If btrfs does _not_ stop writing immediately (e.g. the disk does not
notify btrfs of write failure) and commits a transaction after writes are
dropped, then metadata will be destroyed by later transactions because
they will assume the previously written data is present on disk when it
is not.

> > But considering you're reporting some RO incidents, then I guess the COW
> > mechanism may be broken at that time already.
> > (Maybe abort leads to some corrupted free space cache?)
> 
> Hypothetically, suppose one were to try this experiment:
> 1. Set up a brand-new machine, which has 2 independent but equally-sized disks.
> 2. On disk A, create a btrfs and mount it. Use it as a normal rootfs
> for a while, and then unmount it.
> 3. Image disk A to disk B (perhaps obfuscating the image so that it
> doesn't confuse btrfs's RAID logic).
> 4. Remount disk A and begin using it heavily (reads and writes). At a
> random moment, instantly copy a 256 MiB slice from disk B onto disk A,
> choosing an offset that has recently been written on disk A, while
> disk A is still mounted.
> 
> Would this cause csum failures, transid mismatches, a RO incident, and
> the general output from "btrfs check" that I posted?

It would, but in far larger quantities (thousands of blocks instead of
just a few dozen).

If you have two disks you could run btrfs raid1 on them.  Failures on
one disk will be corrected by reading from the other.  Pairing disks by
different vendors in large numbers is an excellent way to quickly learn
which models are bad.

> If so, I think the problem is that a write-back cache, somewhere
> deeper in the block storage stack (i.e. not btrfs) simply...
> *discarded* a contiguous 256 MiB worth of data that btrfs thought had
> been successfully written, and the RO incident was due to btrfs
> suddenly seeing old data resurface. Does this seem like the best
> working hypothesis currently?

No, we would expect _much_ more damage than your check run reported.

> Also: Per your expert opinion, what would be the best way to repair
> disk A (without the benefit of disk B, heh) in the experiment above? I
> imagine the affected slice itself is not salvageable, and will have to
> be forcibly marked as free space, and in the process throwing out any
> references to tree nodes that happened to reside there, meaning I'll
> lose a whole bunch of extents, inodes, csum tree entries, and so on...
> and in turn I'd lose a random assortment of files. I'm actually okay
> with that, because I can diff the surviving files against my last
> backup and restore whatever was deleted.

You'd have to reinsert all the surviving (and up to date) leaf nodes
into new trees, then remove any dangling references, then verify all
the surviving data.  That's more total IO than mkfs + restore, and
requires significant memory and CPU as well.

> Or is the whole thing FUBAR at this point and I should just zero the
> partition and restore from backups?

If you have backups, it's usually faster to restore them.

I'd put that drive on probation for a while--maybe use it only in raid1
setups, or turn off its write cache, or run a stress test for a while
to see if it fails again.

Certainly if you suspect the 5.14 kernel then it would be worthwhile to
retest the drive with an older kernel, or test different drives with a
newer kernel.  If you can reproduce the issue with multiple drives that
is always an interesting bug report.

We want to know your drive model and firmware revision.  If everyone who
owns one of these drives reports the same problem, we know the problem
is in the drive firmware.  If we see the problem across multiple drives
then we know it's somewhere else.

> Thank you for your time,
> Sam
> 
> > Thanks,
> > Qu
> >
> > >
> > > Cheers,
> > > Sam
> > >
> > >
> > > On Wed, Sep 8, 2021 at 6:47 PM Sam Edwards <cfsworks@gmail.com> wrote:
> > >>
> > >> Hello list,
> > >>
> > >> First, I should say that there's no urgency here on my part.
> > >> Everything important is very well backed up, and even the
> > >> "unimportant" files (various configs) seem readable. I imaged the
> > >> partition without even attempting a repair. Normally, my inclination
> > >> would be to shrug this off and recreate the filesystem.
> > >>
> > >> However, I'd like to help investigate the root cause, because:
> > >> 1. This happened suspiciously soon (see my timeline in the link below)
> > >> after upgrading to kernel 5.14.1, so may be a serious regression.
> > >> 2. The filesystem was created less than 5 weeks ago, so the possible
> > >> causes are relatively few.
> > >> 3. My last successful btrfs scrub was just before upgrading to 5.14.1,
> > >> hopefully narrowing possible root causes even more.
> > >> 4. I have imaged the partition and am thus willing to attempt risky
> > >> experimental repairs. (Mostly for the sake of reporting if they work.)
> > >>
> > >> Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
> > >> OS: Gentoo
> > >> Earliest kernel ever used: 5.10.52-gentoo
> > >> First kernel version used for "real" usage: 5.13.8
> > >> Relevant information: See my Gist,
> > >> https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
> > >> Misc. notes: I have run "fstrim /" on occasion, but don't have
> > >> discards enabled automatically. I doubt TRIM is the culprit, but I
> > >> can't rule it out.
> > >>
> > >> My primary hypothesis is that there's some write bug in Linux 5.14.1.
> > >> I installed some package updates right before btrfs detected the
> > >> problem, and most of the files in the `btrfs check` output look like
> > >> they were created as part of those updates.
> > >>
> > >> My secondary hypothesis is that creating and/or using the swapfile
> > >> caused some kind of silent corruption that didn't become a detectable
> > >> issue until several further writes later.
> > >>
> > >> Let me know if there's anything else I should try/provide!
> > >>
> > >> Regards,
> > >> Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-11  4:24       ` Zygo Blaxell
@ 2021-09-11  5:07         ` Sam Edwards
  2021-09-11  7:38           ` Forza
  2021-09-11 16:56           ` Zygo Blaxell
  0 siblings, 2 replies; 17+ messages in thread
From: Sam Edwards @ 2021-09-11  5:07 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Qu Wenruo, linux-btrfs

Pretty much everything you've said scans with what I'm seeing over here.

Sadly I can't easily replace, remove, or augment the drive. It's
inside a laptop that doesn't look like it was designed for user
serviceability. I don't even have any assurances that it isn't
soldered to the board.

I included the SMART information when I posted my Gist. Take whatever
info you need.

Best,
Sam


On Fri, Sep 10, 2021 at 10:24 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Fri, Sep 10, 2021 at 01:44:51AM -0600, Sam Edwards wrote:
> > Hi Qu,
> >
> > Thank you very much for your insight. It is also helpful to know that
> > I appear to be understanding the btrfs internals properly. :)
> >
> > On Fri, Sep 10, 2021 at 12:17 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > > But for SSD/HDD, we're more cautious about any transid error.
> > >
> > > There are already known bad disks which don't follow FLUSH requests,
> > > thus leads to super easy transid mismatch on powerloss:
> > >
> > > https://btrfs.wiki.kernel.org/index.php/Hardware_bugs
> >
> > I strongly believe now that the SSD hardware/firmware is not the
> > issue. Not only was the last "power loss"-type event on the 5th (which
> > appears to have caused no damage), the issue appeared while the system
> > was running -- a SSD firmware problem of that magnitude should have
> > been apparent long before I upgraded to 5.14.1.
>
> Have you reported the model and firmware revision of the SSD?
> This is critical information for identifying firmware problems.
>
> Write cache firmware failures are not only triggered by power losses.
> They can be triggered by underlying media failures and SSD embedded DRAM
> failures.  This kind of firmware bug would show zero symptoms before
> (and sometimes during) the triggering failure event.  When it happens,
> there may be no indication of failure other than a whole lot of "parent
> transid verify failed" and related errors that appear out of nowhere
> during otherwise normal operation.  Such failures are often accompanied
> by bus timeouts and resets, but it can also happen with no external
> indication if there is no write in progress at the time of failure.
>
> There are roughly 40 distinct block addresses affected in your check log,
> clustered in two separate 256 MB blocks.  That looks fairly normal--the
> btrfs allocator packs related updates close together to take advantage
> of sequential write performance, so all of a dropped transaction's
> writes would have nearby addresses.  The variation in transid is also
> normal--the new tree pages refer to old tree pages of various ages and
> replace old tree pages of various ages themselves.
>
> The parent transid verify messages appear in identical pairs, which is
> consistent with a dropped cache write on a single drive with dup metadata.
> Btrfs will try to read both mirror copies if the first one is wrong.
> Both writes occur at the same time, so both writes can be in the drive's
> write cache when the contents of the drive's DRAM are lost.
>
> The other errors are consistent with consequences of these failures.
> Leaf pages of the metadata trees are no longer reachable because a path
> to these leaf nodes, or the nodes themselves, do not exist on disk.
> Items within the missing leaf pages (like csums, inodes, dirents, and
> backrefs) cannot be found when references to these items from other
> parts of the filesystem are checked, so you get a lot of missing
> directory entries, backrefs, and csums.
>
> Other failure modes can generate this pattern (e.g. host RAM corruption
> and kernel bugs) but other failure modes often leave other indications
> that are not present in these logs.  There's no information here
> that points to some failure mode other than a write cache failure.
> btrfs is designed to reliably detect write cache failures, so based on
> the information so far I wouldn't suspect anything else.
>
> > > > So, question: does the btrfs module use 256 MiB as a default size for
> > > > write-back cache pages anywhere?
> > >
> > > No.
> > >
> > > > Or might this problem reside deeper
> > > > in the block storage system?
> > >
> > > So if possible, please test without dm-crypto to make sure it's really
> > > btrfs causing the problem.
> >
> > Since you don't find the 256 MiB size evocative of any component of
> > btrfs, I am also willing to say that this likely isn't a fault in
> > btrfs. So, I think this is now an exercise in ruling out btrfs
> > completely (before I move on to investigating devicemapper, the crypto
> > target, the NVMe driver, and the block storage system as a whole) and
> > further studying the filesystem to gain insight as to how this
> > happened.
> >
> > > > Also, for what it's worth: the last generation to be written before
> > > > these inconsistencies seems to be 66303. Files written as part of that
> > > > transaction have a birth date of Sep. 7. That's during the same boot
> > > > as the failure to read-only,
> > >
> > > That's interesting.
> > >
> > > Btrfs aborts transaction when it hits critical metadata problems to
> > > prevent further damage, but I still remember some possible corruption
> > > caused by aborting transaction in the old days.
> > > But those corruptions should not exist in v5.14.
> >
> > I understand "aborts" to mean, "if btrfs is in the middle of a write
> > transaction, and in the process discovers existing metadata corruption
> > (e.g. tree node generation mismatch) already on-disk, it will undo any
> > earlier modifications already written to disk as part of that
> > transaction" -- is this correct? If so, I don't believe there was any
> > existing metadata corruption on the disk beforehand, so I doubt the
> > abort mechanism is at fault here.
> >
> > > Furthermore, the read-only flips itself is more important, do you have
> > > the dmesg of that incident?
> >
> > Sadly, I wasn't able to save the dmesg (for obvious reasons), but I do
> > recall logs complaining of checksum failures and transid mismatches. I
> > did make note of the block bytenrs, which were in the interval
> > [1065332047872, 1065391243264] which itself appears to be part of the
> > same 256 MiB "slice" as I found before.
> >
> > > Recently we also see some corrupted free space cache causing problems.
> >
> > I infer "problems" to mean, "free space cache indicates block X is
> > free though it is not, so btrfs allocates a new tree node or extent at
> > that location, overwriting necessary data"?
> >
> > If so, I don't think this is the case here either. If an older node is
> > overwritten with a newer one, I would expect the transid mismatches to
> > be that the found IDs are *greater* than the wanted IDs. I am finding
> > valid, but *older*, tree nodes in the affected region. This seems like
> > a "lost writes" kind of problem to me, not overwrites.
> >
> > However I did note some free space cache errors in my timeline.txt
> > file over on the Github Gist. If you believe that this is still a
> > likelihood, I can investigate further.
> >
> > > If you're going to re-test this, mind to use space cache v2 to do the
> > > tests?
> > >
> > > > which suggests that the cached buffer
> > > > wasn't merely "not saved before a previous shutdown" but rather
> > > > "discarded without being written back." Thoughts?
> > >
> > > This should not be a thing for btrfs.
> > >
> > > Btrfs COW protects its metadata, as long as there is no problem in btrfs
> > > itself allocating new tree blocks to ex you're reporting some RO
> > > incidents, then I guess the COW mechanism may be broken at that time
> > > alreadyisting blocks in the same transaction, we're completely safe.
> >
> > Again, making sure I understand you correctly: Btrfs, by design,
> > *never* modifies an active tree node/leaf. It will instead copy the
> > node/leaf it wishes to modify into memory, perform the modifications
> > there, save the modified copy at a free location on disk, and then
> > recursively modify parents (using the same process) all the way back
> > to the superblock(s).
> >
> > And so, if power is lost in the middle of this process (i.e. before
> > the superblock is updated), then the old paths down the trees are
> > intact. The transaction is lost, but existing metadata is protected.
> > Correct?
>
> If btrfs stops writing to the filesystem immediately, either because a
> power failure stopped the kernel running, or a metadata integrity failure
> error was detected, then existing metadata is protected.
>
> If btrfs does _not_ stop writing immediately (e.g. the disk does not
> notify btrfs of write failure) and commits a transaction after writes are
> dropped, then metadata will be destroyed by later transactions because
> they will assume the previously written data is present on disk when it
> is not.
>
> > > But considering you're reporting some RO incidents, then I guess the COW
> > > mechanism may be broken at that time already.
> > > (Maybe abort leads to some corrupted free space cache?)
> >
> > Hypothetically, suppose one were to try this experiment:
> > 1. Set up a brand-new machine, which has 2 independent but equally-sized disks.
> > 2. On disk A, create a btrfs and mount it. Use it as a normal rootfs
> > for a while, and then unmount it.
> > 3. Image disk A to disk B (perhaps obfuscating the image so that it
> > doesn't confuse btrfs's RAID logic).
> > 4. Remount disk A and begin using it heavily (reads and writes). At a
> > random moment, instantly copy a 256 MiB slice from disk B onto disk A,
> > choosing an offset that has recently been written on disk A, while
> > disk A is still mounted.
> >
> > Would this cause csum failures, transid mismatches, a RO incident, and
> > the general output from "btrfs check" that I posted?
>
> It would, but in far larger quantities (thousands of blocks instead of
> just a few dozen).
>
> If you have two disks you could run btrfs raid1 on them.  Failures on
> one disk will be corrected by reading from the other.  Pairing disks by
> different vendors in large numbers is an excellent way to quickly learn
> which models are bad.
>
> > If so, I think the problem is that a write-back cache, somewhere
> > deeper in the block storage stack (i.e. not btrfs) simply...
> > *discarded* a contiguous 256 MiB worth of data that btrfs thought had
> > been successfully written, and the RO incident was due to btrfs
> > suddenly seeing old data resurface. Does this seem like the best
> > working hypothesis currently?
>
> No, we would expect _much_ more damage than your check run reported.
>
> > Also: Per your expert opinion, what would be the best way to repair
> > disk A (without the benefit of disk B, heh) in the experiment above? I
> > imagine the affected slice itself is not salvageable, and will have to
> > be forcibly marked as free space, and in the process throwing out any
> > references to tree nodes that happened to reside there, meaning I'll
> > lose a whole bunch of extents, inodes, csum tree entries, and so on...
> > and in turn I'd lose a random assortment of files. I'm actually okay
> > with that, because I can diff the surviving files against my last
> > backup and restore whatever was deleted.
>
> You'd have to reinsert all the surviving (and up to date) leaf nodes
> into new trees, then remove any dangling references, then verify all
> the surviving data.  That's more total IO than mkfs + restore, and
> requires significant memory and CPU as well.
>
> > Or is the whole thing FUBAR at this point and I should just zero the
> > partition and restore from backups?
>
> If you have backups, it's usually faster to restore them.
>
> I'd put that drive on probation for a while--maybe use it only in raid1
> setups, or turn off its write cache, or run a stress test for a while
> to see if it fails again.
>
> Certainly if you suspect the 5.14 kernel then it would be worthwhile to
> retest the drive with an older kernel, or test different drives with a
> newer kernel.  If you can reproduce the issue with multiple drives that
> is always an interesting bug report.
>
> We want to know your drive model and firmware revision.  If everyone who
> owns one of these drives reports the same problem, we know the problem
> is in the drive firmware.  If we see the problem across multiple drives
> then we know it's somewhere else.
>
> > Thank you for your time,
> > Sam
> >
> > > Thanks,
> > > Qu
> > >
> > > >
> > > > Cheers,
> > > > Sam
> > > >
> > > >
> > > > On Wed, Sep 8, 2021 at 6:47 PM Sam Edwards <cfsworks@gmail.com> wrote:
> > > >>
> > > >> Hello list,
> > > >>
> > > >> First, I should say that there's no urgency here on my part.
> > > >> Everything important is very well backed up, and even the
> > > >> "unimportant" files (various configs) seem readable. I imaged the
> > > >> partition without even attempting a repair. Normally, my inclination
> > > >> would be to shrug this off and recreate the filesystem.
> > > >>
> > > >> However, I'd like to help investigate the root cause, because:
> > > >> 1. This happened suspiciously soon (see my timeline in the link below)
> > > >> after upgrading to kernel 5.14.1, so may be a serious regression.
> > > >> 2. The filesystem was created less than 5 weeks ago, so the possible
> > > >> causes are relatively few.
> > > >> 3. My last successful btrfs scrub was just before upgrading to 5.14.1,
> > > >> hopefully narrowing possible root causes even more.
> > > >> 4. I have imaged the partition and am thus willing to attempt risky
> > > >> experimental repairs. (Mostly for the sake of reporting if they work.)
> > > >>
> > > >> Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
> > > >> OS: Gentoo
> > > >> Earliest kernel ever used: 5.10.52-gentoo
> > > >> First kernel version used for "real" usage: 5.13.8
> > > >> Relevant information: See my Gist,
> > > >> https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
> > > >> Misc. notes: I have run "fstrim /" on occasion, but don't have
> > > >> discards enabled automatically. I doubt TRIM is the culprit, but I
> > > >> can't rule it out.
> > > >>
> > > >> My primary hypothesis is that there's some write bug in Linux 5.14.1.
> > > >> I installed some package updates right before btrfs detected the
> > > >> problem, and most of the files in the `btrfs check` output look like
> > > >> they were created as part of those updates.
> > > >>
> > > >> My secondary hypothesis is that creating and/or using the swapfile
> > > >> caused some kind of silent corruption that didn't become a detectable
> > > >> issue until several further writes later.
> > > >>
> > > >> Let me know if there's anything else I should try/provide!
> > > >>
> > > >> Regards,
> > > >> Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-11  5:07         ` Sam Edwards
@ 2021-09-11  7:38           ` Forza
  2021-09-11 16:56           ` Zygo Blaxell
  1 sibling, 0 replies; 17+ messages in thread
From: Forza @ 2021-09-11  7:38 UTC (permalink / raw)
  To: Sam Edwards, Zygo Blaxell; +Cc: Qu Wenruo, linux-btrfs

You could disable write cache with "hdparm -W0“. This has helped many users with preventing parent transid errors. One thing to note is that write cache setting isn't permanent, so you'd have to set a cron job or so to issue it regularly to be sure it is active. 

Avoid mounting with discard option. Use fstrim manually instead if possible. Queued trims have issues with some firmwares.

Also, if possible, use swap partitions instead of swap files. 

---- From: Sam Edwards <cfsworks@gmail.com> -- Sent: 2021-09-11 - 07:07 ----

> Pretty much everything you've said scans with what I'm seeing over here.
> 
> Sadly I can't easily replace, remove, or augment the drive. It's
> inside a laptop that doesn't look like it was designed for user
> serviceability. I don't even have any assurances that it isn't
> soldered to the board.
> 
> I included the SMART information when I posted my Gist. Take whatever
> info you need.
> 
> Best,
> Sam
> 
> 
> On Fri, Sep 10, 2021 at 10:24 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
>>
>> On Fri, Sep 10, 2021 at 01:44:51AM -0600, Sam Edwards wrote:
>> > Hi Qu,
>> >
>> > Thank you very much for your insight. It is also helpful to know that
>> > I appear to be understanding the btrfs internals properly. :)
>> >
>> > On Fri, Sep 10, 2021 at 12:17 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> > > But for SSD/HDD, we're more cautious about any transid error.
>> > >
>> > > There are already known bad disks which don't follow FLUSH requests,
>> > > thus leads to super easy transid mismatch on powerloss:
>> > >
>> > > https://btrfs.wiki.kernel.org/index.php/Hardware_bugs
>> >
>> > I strongly believe now that the SSD hardware/firmware is not the
>> > issue. Not only was the last "power loss"-type event on the 5th (which
>> > appears to have caused no damage), the issue appeared while the system
>> > was running -- a SSD firmware problem of that magnitude should have
>> > been apparent long before I upgraded to 5.14.1.
>>
>> Have you reported the model and firmware revision of the SSD?
>> This is critical information for identifying firmware problems.
>>
>> Write cache firmware failures are not only triggered by power losses.
>> They can be triggered by underlying media failures and SSD embedded DRAM
>> failures.  This kind of firmware bug would show zero symptoms before
>> (and sometimes during) the triggering failure event.  When it happens,
>> there may be no indication of failure other than a whole lot of "parent
>> transid verify failed" and related errors that appear out of nowhere
>> during otherwise normal operation.  Such failures are often accompanied
>> by bus timeouts and resets, but it can also happen with no external
>> indication if there is no write in progress at the time of failure.
>>
>> There are roughly 40 distinct block addresses affected in your check log,
>> clustered in two separate 256 MB blocks.  That looks fairly normal--the
>> btrfs allocator packs related updates close together to take advantage
>> of sequential write performance, so all of a dropped transaction's
>> writes would have nearby addresses.  The variation in transid is also
>> normal--the new tree pages refer to old tree pages of various ages and
>> replace old tree pages of various ages themselves.
>>
>> The parent transid verify messages appear in identical pairs, which is
>> consistent with a dropped cache write on a single drive with dup metadata.
>> Btrfs will try to read both mirror copies if the first one is wrong.
>> Both writes occur at the same time, so both writes can be in the drive's
>> write cache when the contents of the drive's DRAM are lost.
>>
>> The other errors are consistent with consequences of these failures.
>> Leaf pages of the metadata trees are no longer reachable because a path
>> to these leaf nodes, or the nodes themselves, do not exist on disk.
>> Items within the missing leaf pages (like csums, inodes, dirents, and
>> backrefs) cannot be found when references to these items from other
>> parts of the filesystem are checked, so you get a lot of missing
>> directory entries, backrefs, and csums.
>>
>> Other failure modes can generate this pattern (e.g. host RAM corruption
>> and kernel bugs) but other failure modes often leave other indications
>> that are not present in these logs.  There's no information here
>> that points to some failure mode other than a write cache failure.
>> btrfs is designed to reliably detect write cache failures, so based on
>> the information so far I wouldn't suspect anything else.
>>
>> > > > So, question: does the btrfs module use 256 MiB as a default size for
>> > > > write-back cache pages anywhere?
>> > >
>> > > No.
>> > >
>> > > > Or might this problem reside deeper
>> > > > in the block storage system?
>> > >
>> > > So if possible, please test without dm-crypto to make sure it's really
>> > > btrfs causing the problem.
>> >
>> > Since you don't find the 256 MiB size evocative of any component of
>> > btrfs, I am also willing to say that this likely isn't a fault in
>> > btrfs. So, I think this is now an exercise in ruling out btrfs
>> > completely (before I move on to investigating devicemapper, the crypto
>> > target, the NVMe driver, and the block storage system as a whole) and
>> > further studying the filesystem to gain insight as to how this
>> > happened.
>> >
>> > > > Also, for what it's worth: the last generation to be written before
>> > > > these inconsistencies seems to be 66303. Files written as part of that
>> > > > transaction have a birth date of Sep. 7. That's during the same boot
>> > > > as the failure to read-only,
>> > >
>> > > That's interesting.
>> > >
>> > > Btrfs aborts transaction when it hits critical metadata problems to
>> > > prevent further damage, but I still remember some possible corruption
>> > > caused by aborting transaction in the old days.
>> > > But those corruptions should not exist in v5.14.
>> >
>> > I understand "aborts" to mean, "if btrfs is in the middle of a write
>> > transaction, and in the process discovers existing metadata corruption
>> > (e.g. tree node generation mismatch) already on-disk, it will undo any
>> > earlier modifications already written to disk as part of that
>> > transaction" -- is this correct? If so, I don't believe there was any
>> > existing metadata corruption on the disk beforehand, so I doubt the
>> > abort mechanism is at fault here.
>> >
>> > > Furthermore, the read-only flips itself is more important, do you have
>> > > the dmesg of that incident?
>> >
>> > Sadly, I wasn't able to save the dmesg (for obvious reasons), but I do
>> > recall logs complaining of checksum failures and transid mismatches. I
>> > did make note of the block bytenrs, which were in the interval
>> > [1065332047872, 1065391243264] which itself appears to be part of the
>> > same 256 MiB "slice" as I found before.
>> >
>> > > Recently we also see some corrupted free space cache causing problems.
>> >
>> > I infer "problems" to mean, "free space cache indicates block X is
>> > free though it is not, so btrfs allocates a new tree node or extent at
>> > that location, overwriting necessary data"?
>> >
>> > If so, I don't think this is the case here either. If an older node is
>> > overwritten with a newer one, I would expect the transid mismatches to
>> > be that the found IDs are *greater* than the wanted IDs. I am finding
>> > valid, but *older*, tree nodes in the affected region. This seems like
>> > a "lost writes" kind of problem to me, not overwrites.
>> >
>> > However I did note some free space cache errors in my timeline.txt
>> > file over on the Github Gist. If you believe that this is still a
>> > likelihood, I can investigate further.
>> >
>> > > If you're going to re-test this, mind to use space cache v2 to do the
>> > > tests?
>> > >
>> > > > which suggests that the cached buffer
>> > > > wasn't merely "not saved before a previous shutdown" but rather
>> > > > "discarded without being written back." Thoughts?
>> > >
>> > > This should not be a thing for btrfs.
>> > >
>> > > Btrfs COW protects its metadata, as long as there is no problem in btrfs
>> > > itself allocating new tree blocks to ex you're reporting some RO
>> > > incidents, then I guess the COW mechanism may be broken at that time
>> > > alreadyisting blocks in the same transaction, we're completely safe.
>> >
>> > Again, making sure I understand you correctly: Btrfs, by design,
>> > *never* modifies an active tree node/leaf. It will instead copy the
>> > node/leaf it wishes to modify into memory, perform the modifications
>> > there, save the modified copy at a free location on disk, and then
>> > recursively modify parents (using the same process) all the way back
>> > to the superblock(s).
>> >
>> > And so, if power is lost in the middle of this process (i.e. before
>> > the superblock is updated), then the old paths down the trees are
>> > intact. The transaction is lost, but existing metadata is protected.
>> > Correct?
>>
>> If btrfs stops writing to the filesystem immediately, either because a
>> power failure stopped the kernel running, or a metadata integrity failure
>> error was detected, then existing metadata is protected.
>>
>> If btrfs does _not_ stop writing immediately (e.g. the disk does not
>> notify btrfs of write failure) and commits a transaction after writes are
>> dropped, then metadata will be destroyed by later transactions because
>> they will assume the previously written data is present on disk when it
>> is not.
>>
>> > > But considering you're reporting some RO incidents, then I guess the COW
>> > > mechanism may be broken at that time already.
>> > > (Maybe abort leads to some corrupted free space cache?)
>> >
>> > Hypothetically, suppose one were to try this experiment:
>> > 1. Set up a brand-new machine, which has 2 independent but equally-sized disks.
>> > 2. On disk A, create a btrfs and mount it. Use it as a normal rootfs
>> > for a while, and then unmount it.
>> > 3. Image disk A to disk B (perhaps obfuscating the image so that it
>> > doesn't confuse btrfs's RAID logic).
>> > 4. Remount disk A and begin using it heavily (reads and writes). At a
>> > random moment, instantly copy a 256 MiB slice from disk B onto disk A,
>> > choosing an offset that has recently been written on disk A, while
>> > disk A is still mounted.
>> >
>> > Would this cause csum failures, transid mismatches, a RO incident, and
>> > the general output from "btrfs check" that I posted?
>>
>> It would, but in far larger quantities (thousands of blocks instead of
>> just a few dozen).
>>
>> If you have two disks you could run btrfs raid1 on them.  Failures on
>> one disk will be corrected by reading from the other.  Pairing disks by
>> different vendors in large numbers is an excellent way to quickly learn
>> which models are bad.
>>
>> > If so, I think the problem is that a write-back cache, somewhere
>> > deeper in the block storage stack (i.e. not btrfs) simply...
>> > *discarded* a contiguous 256 MiB worth of data that btrfs thought had
>> > been successfully written, and the RO incident was due to btrfs
>> > suddenly seeing old data resurface. Does this seem like the best
>> > working hypothesis currently?
>>
>> No, we would expect _much_ more damage than your check run reported.
>>
>> > Also: Per your expert opinion, what would be the best way to repair
>> > disk A (without the benefit of disk B, heh) in the experiment above? I
>> > imagine the affected slice itself is not salvageable, and will have to
>> > be forcibly marked as free space, and in the process throwing out any
>> > references to tree nodes that happened to reside there, meaning I'll
>> > lose a whole bunch of extents, inodes, csum tree entries, and so on...
>> > and in turn I'd lose a random assortment of files. I'm actually okay
>> > with that, because I can diff the surviving files against my last
>> > backup and restore whatever was deleted.
>>
>> You'd have to reinsert all the surviving (and up to date) leaf nodes
>> into new trees, then remove any dangling references, then verify all
>> the surviving data.  That's more total IO than mkfs + restore, and
>> requires significant memory and CPU as well.
>>
>> > Or is the whole thing FUBAR at this point and I should just zero the
>> > partition and restore from backups?
>>
>> If you have backups, it's usually faster to restore them.
>>
>> I'd put that drive on probation for a while--maybe use it only in raid1
>> setups, or turn off its write cache, or run a stress test for a while
>> to see if it fails again.
>>
>> Certainly if you suspect the 5.14 kernel then it would be worthwhile to
>> retest the drive with an older kernel, or test different drives with a
>> newer kernel.  If you can reproduce the issue with multiple drives that
>> is always an interesting bug report.
>>
>> We want to know your drive model and firmware revision.  If everyone who
>> owns one of these drives reports the same problem, we know the problem
>> is in the drive firmware.  If we see the problem across multiple drives
>> then we know it's somewhere else.
>>
>> > Thank you for your time,
>> > Sam
>> >
>> > > Thanks,
>> > > Qu
>> > >
>> > > >
>> > > > Cheers,
>> > > > Sam
>> > > >
>> > > >
>> > > > On Wed, Sep 8, 2021 at 6:47 PM Sam Edwards <cfsworks@gmail.com> wrote:
>> > > >>
>> > > >> Hello list,
>> > > >>
>> > > >> First, I should say that there's no urgency here on my part.
>> > > >> Everything important is very well backed up, and even the
>> > > >> "unimportant" files (various configs) seem readable. I imaged the
>> > > >> partition without even attempting a repair. Normally, my inclination
>> > > >> would be to shrug this off and recreate the filesystem.
>> > > >>
>> > > >> However, I'd like to help investigate the root cause, because:
>> > > >> 1. This happened suspiciously soon (see my timeline in the link below)
>> > > >> after upgrading to kernel 5.14.1, so may be a serious regression.
>> > > >> 2. The filesystem was created less than 5 weeks ago, so the possible
>> > > >> causes are relatively few.
>> > > >> 3. My last successful btrfs scrub was just before upgrading to 5.14.1,
>> > > >> hopefully narrowing possible root causes even more.
>> > > >> 4. I have imaged the partition and am thus willing to attempt risky
>> > > >> experimental repairs. (Mostly for the sake of reporting if they work.)
>> > > >>
>> > > >> Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
>> > > >> OS: Gentoo
>> > > >> Earliest kernel ever used: 5.10.52-gentoo
>> > > >> First kernel version used for "real" usage: 5.13.8
>> > > >> Relevant information: See my Gist,
>> > > >> https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
>> > > >> Misc. notes: I have run "fstrim /" on occasion, but don't have
>> > > >> discards enabled automatically. I doubt TRIM is the culprit, but I
>> > > >> can't rule it out.
>> > > >>
>> > > >> My primary hypothesis is that there's some write bug in Linux 5.14.1.
>> > > >> I installed some package updates right before btrfs detected the
>> > > >> problem, and most of the files in the `btrfs check` output look like
>> > > >> they were created as part of those updates.
>> > > >>
>> > > >> My secondary hypothesis is that creating and/or using the swapfile
>> > > >> caused some kind of silent corruption that didn't become a detectable
>> > > >> issue until several further writes later.
>> > > >>
>> > > >> Let me know if there's anything else I should try/provide!
>> > > >>
>> > > >> Regards,
>> > > >> Sam



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-11  5:07         ` Sam Edwards
  2021-09-11  7:38           ` Forza
@ 2021-09-11 16:56           ` Zygo Blaxell
  2021-09-12  6:12             ` Sam Edwards
  1 sibling, 1 reply; 17+ messages in thread
From: Zygo Blaxell @ 2021-09-11 16:56 UTC (permalink / raw)
  To: Sam Edwards; +Cc: Qu Wenruo, linux-btrfs

On Fri, Sep 10, 2021 at 11:07:04PM -0600, Sam Edwards wrote:
> Pretty much everything you've said scans with what I'm seeing over here.
>
> Sadly I can't easily replace, remove, or augment the drive. It's
> inside a laptop that doesn't look like it was designed for user
> serviceability. I don't even have any assurances that it isn't
> soldered to the board.
>
> I included the SMART information when I posted my Gist. Take whatever
> info you need.

For the list, that data is:

	Model Number:                       NVMe CA6-8D1024
	Serial Number:                      <redacted>
	Firmware Version:                   ERA0901
	PCI Vendor ID:                      0x1e95
	PCI Vendor Subsystem ID:            0x1dbe
	Total NVM Capacity:                 1,024,209,543,168 [1.02 TB]

It's not one I've seen previously reported, but there's a huge variety
of SSD firmware in the field.

> Best,
> Sam
>
>
> On Fri, Sep 10, 2021 at 10:24 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Fri, Sep 10, 2021 at 01:44:51AM -0600, Sam Edwards wrote:
> > > Hi Qu,
> > >
> > > Thank you very much for your insight. It is also helpful to know that
> > > I appear to be understanding the btrfs internals properly. :)
> > >
> > > On Fri, Sep 10, 2021 at 12:17 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > > > But for SSD/HDD, we're more cautious about any transid error.
> > > >
> > > > There are already known bad disks which don't follow FLUSH requests,
> > > > thus leads to super easy transid mismatch on powerloss:
> > > >
> > > > https://btrfs.wiki.kernel.org/index.php/Hardware_bugs
> > >
> > > I strongly believe now that the SSD hardware/firmware is not the
> > > issue. Not only was the last "power loss"-type event on the 5th (which
> > > appears to have caused no damage), the issue appeared while the system
> > > was running -- a SSD firmware problem of that magnitude should have
> > > been apparent long before I upgraded to 5.14.1.
> >
> > Have you reported the model and firmware revision of the SSD?
> > This is critical information for identifying firmware problems.
> >
> > Write cache firmware failures are not only triggered by power losses.
> > They can be triggered by underlying media failures and SSD embedded DRAM
> > failures.  This kind of firmware bug would show zero symptoms before
> > (and sometimes during) the triggering failure event.  When it happens,
> > there may be no indication of failure other than a whole lot of "parent
> > transid verify failed" and related errors that appear out of nowhere
> > during otherwise normal operation.  Such failures are often accompanied
> > by bus timeouts and resets, but it can also happen with no external
> > indication if there is no write in progress at the time of failure.
> >
> > There are roughly 40 distinct block addresses affected in your check log,
> > clustered in two separate 256 MB blocks.  That looks fairly normal--the
> > btrfs allocator packs related updates close together to take advantage
> > of sequential write performance, so all of a dropped transaction's
> > writes would have nearby addresses.  The variation in transid is also
> > normal--the new tree pages refer to old tree pages of various ages and
> > replace old tree pages of various ages themselves.
> >
> > The parent transid verify messages appear in identical pairs, which is
> > consistent with a dropped cache write on a single drive with dup metadata.
> > Btrfs will try to read both mirror copies if the first one is wrong.
> > Both writes occur at the same time, so both writes can be in the drive's
> > write cache when the contents of the drive's DRAM are lost.
> >
> > The other errors are consistent with consequences of these failures.
> > Leaf pages of the metadata trees are no longer reachable because a path
> > to these leaf nodes, or the nodes themselves, do not exist on disk.
> > Items within the missing leaf pages (like csums, inodes, dirents, and
> > backrefs) cannot be found when references to these items from other
> > parts of the filesystem are checked, so you get a lot of missing
> > directory entries, backrefs, and csums.
> >
> > Other failure modes can generate this pattern (e.g. host RAM corruption
> > and kernel bugs) but other failure modes often leave other indications
> > that are not present in these logs.  There's no information here
> > that points to some failure mode other than a write cache failure.
> > btrfs is designed to reliably detect write cache failures, so based on
> > the information so far I wouldn't suspect anything else.
> >
> > > > > So, question: does the btrfs module use 256 MiB as a default size for
> > > > > write-back cache pages anywhere?
> > > >
> > > > No.
> > > >
> > > > > Or might this problem reside deeper
> > > > > in the block storage system?
> > > >
> > > > So if possible, please test without dm-crypto to make sure it's really
> > > > btrfs causing the problem.
> > >
> > > Since you don't find the 256 MiB size evocative of any component of
> > > btrfs, I am also willing to say that this likely isn't a fault in
> > > btrfs. So, I think this is now an exercise in ruling out btrfs
> > > completely (before I move on to investigating devicemapper, the crypto
> > > target, the NVMe driver, and the block storage system as a whole) and
> > > further studying the filesystem to gain insight as to how this
> > > happened.
> > >
> > > > > Also, for what it's worth: the last generation to be written before
> > > > > these inconsistencies seems to be 66303. Files written as part of that
> > > > > transaction have a birth date of Sep. 7. That's during the same boot
> > > > > as the failure to read-only,
> > > >
> > > > That's interesting.
> > > >
> > > > Btrfs aborts transaction when it hits critical metadata problems to
> > > > prevent further damage, but I still remember some possible corruption
> > > > caused by aborting transaction in the old days.
> > > > But those corruptions should not exist in v5.14.
> > >
> > > I understand "aborts" to mean, "if btrfs is in the middle of a write
> > > transaction, and in the process discovers existing metadata corruption
> > > (e.g. tree node generation mismatch) already on-disk, it will undo any
> > > earlier modifications already written to disk as part of that
> > > transaction" -- is this correct? If so, I don't believe there was any
> > > existing metadata corruption on the disk beforehand, so I doubt the
> > > abort mechanism is at fault here.
> > >
> > > > Furthermore, the read-only flips itself is more important, do you have
> > > > the dmesg of that incident?
> > >
> > > Sadly, I wasn't able to save the dmesg (for obvious reasons), but I do
> > > recall logs complaining of checksum failures and transid mismatches. I
> > > did make note of the block bytenrs, which were in the interval
> > > [1065332047872, 1065391243264] which itself appears to be part of the
> > > same 256 MiB "slice" as I found before.
> > >
> > > > Recently we also see some corrupted free space cache causing problems.
> > >
> > > I infer "problems" to mean, "free space cache indicates block X is
> > > free though it is not, so btrfs allocates a new tree node or extent at
> > > that location, overwriting necessary data"?
> > >
> > > If so, I don't think this is the case here either. If an older node is
> > > overwritten with a newer one, I would expect the transid mismatches to
> > > be that the found IDs are *greater* than the wanted IDs. I am finding
> > > valid, but *older*, tree nodes in the affected region. This seems like
> > > a "lost writes" kind of problem to me, not overwrites.
> > >
> > > However I did note some free space cache errors in my timeline.txt
> > > file over on the Github Gist. If you believe that this is still a
> > > likelihood, I can investigate further.
> > >
> > > > If you're going to re-test this, mind to use space cache v2 to do the
> > > > tests?
> > > >
> > > > > which suggests that the cached buffer
> > > > > wasn't merely "not saved before a previous shutdown" but rather
> > > > > "discarded without being written back." Thoughts?
> > > >
> > > > This should not be a thing for btrfs.
> > > >
> > > > Btrfs COW protects its metadata, as long as there is no problem in btrfs
> > > > itself allocating new tree blocks to ex you're reporting some RO
> > > > incidents, then I guess the COW mechanism may be broken at that time
> > > > alreadyisting blocks in the same transaction, we're completely safe.
> > >
> > > Again, making sure I understand you correctly: Btrfs, by design,
> > > *never* modifies an active tree node/leaf. It will instead copy the
> > > node/leaf it wishes to modify into memory, perform the modifications
> > > there, save the modified copy at a free location on disk, and then
> > > recursively modify parents (using the same process) all the way back
> > > to the superblock(s).
> > >
> > > And so, if power is lost in the middle of this process (i.e. before
> > > the superblock is updated), then the old paths down the trees are
> > > intact. The transaction is lost, but existing metadata is protected.
> > > Correct?
> >
> > If btrfs stops writing to the filesystem immediately, either because a
> > power failure stopped the kernel running, or a metadata integrity failure
> > error was detected, then existing metadata is protected.
> >
> > If btrfs does _not_ stop writing immediately (e.g. the disk does not
> > notify btrfs of write failure) and commits a transaction after writes are
> > dropped, then metadata will be destroyed by later transactions because
> > they will assume the previously written data is present on disk when it
> > is not.
> >
> > > > But considering you're reporting some RO incidents, then I guess the COW
> > > > mechanism may be broken at that time already.
> > > > (Maybe abort leads to some corrupted free space cache?)
> > >
> > > Hypothetically, suppose one were to try this experiment:
> > > 1. Set up a brand-new machine, which has 2 independent but equally-sized disks.
> > > 2. On disk A, create a btrfs and mount it. Use it as a normal rootfs
> > > for a while, and then unmount it.
> > > 3. Image disk A to disk B (perhaps obfuscating the image so that it
> > > doesn't confuse btrfs's RAID logic).
> > > 4. Remount disk A and begin using it heavily (reads and writes). At a
> > > random moment, instantly copy a 256 MiB slice from disk B onto disk A,
> > > choosing an offset that has recently been written on disk A, while
> > > disk A is still mounted.
> > >
> > > Would this cause csum failures, transid mismatches, a RO incident, and
> > > the general output from "btrfs check" that I posted?
> >
> > It would, but in far larger quantities (thousands of blocks instead of
> > just a few dozen).
> >
> > If you have two disks you could run btrfs raid1 on them.  Failures on
> > one disk will be corrected by reading from the other.  Pairing disks by
> > different vendors in large numbers is an excellent way to quickly learn
> > which models are bad.
> >
> > > If so, I think the problem is that a write-back cache, somewhere
> > > deeper in the block storage stack (i.e. not btrfs) simply...
> > > *discarded* a contiguous 256 MiB worth of data that btrfs thought had
> > > been successfully written, and the RO incident was due to btrfs
> > > suddenly seeing old data resurface. Does this seem like the best
> > > working hypothesis currently?
> >
> > No, we would expect _much_ more damage than your check run reported.
> >
> > > Also: Per your expert opinion, what would be the best way to repair
> > > disk A (without the benefit of disk B, heh) in the experiment above? I
> > > imagine the affected slice itself is not salvageable, and will have to
> > > be forcibly marked as free space, and in the process throwing out any
> > > references to tree nodes that happened to reside there, meaning I'll
> > > lose a whole bunch of extents, inodes, csum tree entries, and so on...
> > > and in turn I'd lose a random assortment of files. I'm actually okay
> > > with that, because I can diff the surviving files against my last
> > > backup and restore whatever was deleted.
> >
> > You'd have to reinsert all the surviving (and up to date) leaf nodes
> > into new trees, then remove any dangling references, then verify all
> > the surviving data.  That's more total IO than mkfs + restore, and
> > requires significant memory and CPU as well.
> >
> > > Or is the whole thing FUBAR at this point and I should just zero the
> > > partition and restore from backups?
> >
> > If you have backups, it's usually faster to restore them.
> >
> > I'd put that drive on probation for a while--maybe use it only in raid1
> > setups, or turn off its write cache, or run a stress test for a while
> > to see if it fails again.
> >
> > Certainly if you suspect the 5.14 kernel then it would be worthwhile to
> > retest the drive with an older kernel, or test different drives with a
> > newer kernel.  If you can reproduce the issue with multiple drives that
> > is always an interesting bug report.
> >
> > We want to know your drive model and firmware revision.  If everyone who
> > owns one of these drives reports the same problem, we know the problem
> > is in the drive firmware.  If we see the problem across multiple drives
> > then we know it's somewhere else.
> >
> > > Thank you for your time,
> > > Sam
> > >
> > > > Thanks,
> > > > Qu
> > > >
> > > > >
> > > > > Cheers,
> > > > > Sam
> > > > >
> > > > >
> > > > > On Wed, Sep 8, 2021 at 6:47 PM Sam Edwards <cfsworks@gmail.com> wrote:
> > > > >>
> > > > >> Hello list,
> > > > >>
> > > > >> First, I should say that there's no urgency here on my part.
> > > > >> Everything important is very well backed up, and even the
> > > > >> "unimportant" files (various configs) seem readable. I imaged the
> > > > >> partition without even attempting a repair. Normally, my inclination
> > > > >> would be to shrug this off and recreate the filesystem.
> > > > >>
> > > > >> However, I'd like to help investigate the root cause, because:
> > > > >> 1. This happened suspiciously soon (see my timeline in the link below)
> > > > >> after upgrading to kernel 5.14.1, so may be a serious regression.
> > > > >> 2. The filesystem was created less than 5 weeks ago, so the possible
> > > > >> causes are relatively few.
> > > > >> 3. My last successful btrfs scrub was just before upgrading to 5.14.1,
> > > > >> hopefully narrowing possible root causes even more.
> > > > >> 4. I have imaged the partition and am thus willing to attempt risky
> > > > >> experimental repairs. (Mostly for the sake of reporting if they work.)
> > > > >>
> > > > >> Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
> > > > >> OS: Gentoo
> > > > >> Earliest kernel ever used: 5.10.52-gentoo
> > > > >> First kernel version used for "real" usage: 5.13.8
> > > > >> Relevant information: See my Gist,
> > > > >> https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
> > > > >> Misc. notes: I have run "fstrim /" on occasion, but don't have
> > > > >> discards enabled automatically. I doubt TRIM is the culprit, but I
> > > > >> can't rule it out.
> > > > >>
> > > > >> My primary hypothesis is that there's some write bug in Linux 5.14.1.
> > > > >> I installed some package updates right before btrfs detected the
> > > > >> problem, and most of the files in the `btrfs check` output look like
> > > > >> they were created as part of those updates.
> > > > >>
> > > > >> My secondary hypothesis is that creating and/or using the swapfile
> > > > >> caused some kind of silent corruption that didn't become a detectable
> > > > >> issue until several further writes later.
> > > > >>
> > > > >> Let me know if there's anything else I should try/provide!
> > > > >>
> > > > >> Regards,
> > > > >> Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-11 16:56           ` Zygo Blaxell
@ 2021-09-12  6:12             ` Sam Edwards
  2021-09-12 23:07               ` Zygo Blaxell
  0 siblings, 1 reply; 17+ messages in thread
From: Sam Edwards @ 2021-09-12  6:12 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Qu Wenruo, linux-btrfs

On Sat, Sep 11, 2021 at 10:56 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> It's not one I've seen previously reported, but there's a huge variety
> of SSD firmware in the field.

It seems to be a very newly released SSD. It's possible that the
reason nobody else has reported issues with it yet is that nobody else
who owns one of these has yet met the conditions for this problem to
occur. All the more reason to figure this out, I say.

I've been working to verify what you've said previously (and to rule
out any contrary hypotheses - like chunks momentarily having the wrong
physical offset). One point I can't corroborate is:

> There are roughly 40 distinct block addresses affected in your check log,
> clustered in two separate 256 MB blocks.

The only missing writes that I see are in a single 256 MiB cluster
(belonging to chunk 1065173909504). What is the other 256 MiB cluster
that you are seeing? What shows that writes to that range went
missing, too? (Or by "affected" do you only mean "involved in the
damaged transactions in some way"?)

I do find it interesting that, of a few dozen missing writes, all of
them are clustered together, while other writes in the same
transactions appear to have had a perfect success rate. My expectation
for drive cache failure would have been that *all* writes (during the
incident) get the same probability of being dropped. All of the
failures being grouped like that can only mean one thing... I just
don't know what it is. :)

So, the prime suspect at this point is the SSD firmware. Once I have a
little more information, I'll (try to) share what I find with the
vendor. Ideally I'd like to narrow down which of 3 components of the
firmware apparently contains the fault:
1. Write back cache: Most likely, although not certain at this point.
If I turn off the write cache and the problem goes away, I'll know.
2. NVMe command queues: Perhaps there is some race condition where 2
writes submitted on different queues will, under some circumstances,
cause one/both of the writes to be ignored.
3. LBA mapper: Given the pattern of torn writes, it's possible that
some LBAs were not updated to the new PBAs after some of the writes. I
find this pretty unlikely for a handful of reasons (trying to write a
non-erased block should result in an internal error, old PBA should be
erased, ...)

However, even if this is a firmware/hardware issue, I remain
unconvinced that it's purely coincidence just how quickly this
happened after the upgrade to 5.14.x. In addition to this corruption,
there are the 2 incidents where the system became unresponsive under
I/O load (and the second was purely reads from trying to image the
SSD). Those problems didn't occur when booting a rescue USB with an
older kernel. So some change which landed in 5.14.x may have changed
the drive command pattern in some important way to trigger the SSD
fault (esp, in the case of possibility #2 above). That gives me hope
that, if nothing else, we may be able to add a device quirk to Linux
and minimize future damage that way. :)

Bayes calls out from beyond the grave and demands that, before I try
any experiments, I first establish the base rate of these corruptions
under current conditions. So that means rebuilding my filesystem from
backups and continuing to use it exactly as I have been, prepared for
this problem to happen again. Being prepared means stepping up my
backup frequency, so I'll first set up a btrbk server that can accept
hourly backups.

Wish me luck,
Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-12  6:12             ` Sam Edwards
@ 2021-09-12 23:07               ` Zygo Blaxell
  2021-09-14 23:58                 ` Sam Edwards
  0 siblings, 1 reply; 17+ messages in thread
From: Zygo Blaxell @ 2021-09-12 23:07 UTC (permalink / raw)
  To: Sam Edwards; +Cc: Qu Wenruo, linux-btrfs

On Sun, Sep 12, 2021 at 12:12:13AM -0600, Sam Edwards wrote:
> On Sat, Sep 11, 2021 at 10:56 AM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> > It's not one I've seen previously reported, but there's a huge variety
> > of SSD firmware in the field.
> 
> It seems to be a very newly released SSD. It's possible that the
> reason nobody else has reported issues with it yet is that nobody else
> who owns one of these has yet met the conditions for this problem to
> occur. All the more reason to figure this out, I say.
> 
> I've been working to verify what you've said previously (and to rule
> out any contrary hypotheses - like chunks momentarily having the wrong
> physical offset). One point I can't corroborate is:
> 
> > There are roughly 40 distinct block addresses affected in your check log,
> > clustered in two separate 256 MB blocks.
> 
> The only missing writes that I see are in a single 256 MiB cluster
> (belonging to chunk 1065173909504). What is the other 256 MiB cluster
> that you are seeing? 

Hex addresses from your parent transid verify failed messages:

	f80ad70000
	f80ad74000
	f80c11c000
	f80c6dc000
	f80c6e8000
	f80c704000
	f80dd04000
	f80e320000
	f80e3a4000
	f80e3cc000

	f8167c8000
	f8167e0000
	f818a68000
	f818a6c000
	f818a70000
	f818a74000
	f818a7c000
	f818a80000
	f818a84000
	f818a8c000

"leaf parent key incorrect" has a similar distribution.

There is less than 256MB distance from the first to the last, but they
occupy two separate 256MB-aligned regions (f800000000 and f810000000).
If there is dup metadata then these blocks occupy four separate 256MB
regions, as there is some space between duplicate regions (in logical
address space).

256MB is far too large for a plausible erase block size, 1GB is even
less likely.

> What shows that writes to that range went
> missing, too? (Or by "affected" do you only mean "involved in the
> damaged transactions in some way"?)

"parent transid verify failed" is btrfs for "there is a reference to a
block at this location, but the block was not written and a different
(usually older) block was found instead."  "leaf parent key incorrect"
is a synonym.  The named block is almost always a missing write.

I don't see any signs of more than one transaction being affected.
With 20 distinct pages of metadata lost we'd expect up to 60,000 items
with reference problems, but there are only about 9000 unique error
items.  If there was more damage to leaf pages, especially from other
transactions, I'd expect more unique errors.

> I do find it interesting that, of a few dozen missing writes, all of
> them are clustered together, while other writes in the same
> transactions appear to have had a perfect success rate. My expectation
> for drive cache failure would have been that *all* writes (during the
> incident) get the same probability of being dropped. All of the
> failures being grouped like that can only mean one thing... I just
> don't know what it is. :)

That's not how write caches work--they are not giant FIFO queues.

Write caches reorder writes in close temporal proximity (i.e. writes
that occurred close together in time rather than in space).  There is
necessarily reordering in the failure case--if there was no reordering,
in a strictly FIFO cache, there would be no btrfs errors because without
reordering the last persisted transaction would be completely persisted
with all its metadata intact (there may be later transactions that were
lost, but if none of their writes persisted then those transactions
didn't really happen).

The cache does not occupy all of the DRAM in the device--usually it is
only a handful of pages, because the firmware will drain the cache to
flash as quickly as possible.  btrfs quite often has to stop to read data
from disk during a transaction, giving the drive time to catch up flushing
out the cache.  This results in a small number of writes in flight.

So there might be e.g. writes to blocks A, B, C, D, E, F, barrier, G,
and those are reordered in write cache as A, D, E, F, G, B, C, and then
the last 2 writes are lost by a drive problem.  Writes to locations A,
D, E, F, and G are persisted (G might be the superblock update,
completing the transaction and updating root pointers to all the trees)
while writes to B and C are not (these are the "leaf parent key incorrect"
and "parent transid verify failed" errors, where a parent node in a tree
points to a block that does not contain the matching child).

Reordering like the above is allowed, but only if the drive can guarantee
results equivalent to the original ordering with the barrier constraint
(e.g. it has big capacitors or SLC write cache).  The drive can legally
write G earlier if and only if it can guarantee it will finish writing
all of A-F, even if power is lost, host sends a reset, or media fails.

> So, the prime suspect at this point is the SSD firmware. Once I have a
> little more information, I'll (try to) share what I find with the
> vendor. Ideally I'd like to narrow down which of 3 components of the
> firmware apparently contains the fault:
> 1. Write back cache: Most likely, although not certain at this point.
> If I turn off the write cache and the problem goes away, I'll know.

Disabling write cache is the most common workaround, and often successful
if the drive is still healthy.

> 2. NVMe command queues: Perhaps there is some race condition where 2
> writes submitted on different queues will, under some circumstances,
> cause one/both of the writes to be ignored.

That often happens to drives as they fail.  Firmware reboots due to a
bug in error handling code or hardware fault, and doesn't remember what
was in its write cache.

Also if there is a transport failure and the host resets the bus, the
drive firmware might have its write cache forcibly erased before being
able to write it.  The spec says that doesn't happen, but some vendors
are demonstrably unable to follow spec.

> 3. LBA mapper: Given the pattern of torn writes, it's possible that
> some LBAs were not updated to the new PBAs after some of the writes. I
> find this pretty unlikely for a handful of reasons (trying to write a
> non-erased block should result in an internal error, old PBA should be
> erased, ...)

That's an SSD-specific restatement of #1 (failure to persist data before
reporting successfully completed write to the host, and returning previous
versions of data on later reads of the same address).

SSDs don't necessarily erase old blocks immediately--a large, empty or
frequently discarded SSD might not erase old blocks for months.

All of the above fit into the general category of "drive drops some
writes, out of order, when some triggering failure occurs."  If you have
access to the drive's firmware on github, you could check out the code,
determine which bug is occurring, and send a pull request with the fix.
If you don't, usually the practical solution is to choose a different
drive vendor, unless you're ordering enough units to cause drive
manufacturer shareholders to panic when you stop.

Also you need to be _really_ sure it's the drive, and this information
casts some doubt on that theory:

> However, even if this is a firmware/hardware issue, I remain
> unconvinced that it's purely coincidence just how quickly this
> happened after the upgrade to 5.14.x. In addition to this corruption,
> there are the 2 incidents where the system became unresponsive under
> I/O load (and the second was purely reads from trying to image the
> SSD). Those problems didn't occur when booting a rescue USB with an
> older kernel. So some change which landed in 5.14.x may have changed
> the drive command pattern in some important way to trigger the SSD
> fault (esp, in the case of possibility #2 above). That gives me hope
> that, if nothing else, we may be able to add a device quirk to Linux
> and minimize future damage that way. :)

Yeah, if something horrible happened in the Linux 5.14 baremetal NVME
hardware drivers or PCIe subsystem in general, then it could produce
symptoms like these.  It wouldn't be the first time a regression in
other parts of Linux was detected by a flood of btrfs errors.

Device resets might trigger write cache losses and then all the above
"firmware" symptoms (but the firmware is not at fault, it is getting
disrupted by the host) (unless you are a stickler for the letter of the
spec that says write cache must be immune to host action).

Linux 5.14 btrfs on VMs seems OK.  I run tests continuously on new Linux
kernels to detect btrfs and lvm regressions early, and nothing like this
has happened on my humble fleet.  My test coverage is limited--it
won't detect a baremetal NVME transport issue, as that's handled by the
host kernel not the VM guest.

> Bayes calls out from beyond the grave and demands that, before I try
> any experiments, I first establish the base rate of these corruptions
> under current conditions. So that means rebuilding my filesystem from
> backups and continuing to use it exactly as I have been, prepared for
> this problem to happen again. Being prepared means stepping up my
> backup frequency, so I'll first set up a btrbk server that can accept
> hourly backups.

Sound methodology.

> Wish me luck,

Good luck!

> Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-12 23:07               ` Zygo Blaxell
@ 2021-09-14 23:58                 ` Sam Edwards
  2021-09-15 18:10                   ` Zygo Blaxell
  0 siblings, 1 reply; 17+ messages in thread
From: Sam Edwards @ 2021-09-14 23:58 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Qu Wenruo, linux-btrfs

On Sun, Sep 12, 2021 at 5:07 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> There is less than 256MB distance from the first to the last, but they
> occupy two separate 256MB-aligned regions (f800000000 and f810000000).
> If there is dup metadata then these blocks occupy four separate 256MB
> regions, as there is some space between duplicate regions (in logical
> address space).

Okay, I see. You were looking at alignment, not range. I didn't think
you were trying to determine LBA alignment because you were working
from offsets within the LUKS space, without the benefit of the
LUKS+partition offsets. (Adding in those offsets doesn't change the
alignment situation much, by the way.)

Oh, for what it's worth: I don't think there was dup metadata. I
didn't pass any flags to mkfs.btrfs when I built the filesystem (so it
should have used the SSD default of single), and when I was (manually,
scanning the LUKS volume) looking for duplicate metadata blocks to
test my misdirected writes guess, I didn't find any two blocks with
the same bytenr. I still have the filesystem image in case there's any
doubt. But it also may not be pertinent at this point.

> 256MB is far too large for a plausible erase block size, 1GB is even
> less likely.

Agreed - orders of magnitude too large, if you ask me. But I'm also
not discounting the possibility that the erase block size is big
enough to cover 256MiB worth of (SSD, not btrfs) metadata.

> That's not how write caches work--they are not giant FIFO queues.

Oh no, I think I didn't make myself clear. I wasn't suggesting
anything to do with the order of the write operations, nor trying to
indicate that I was confused about what I was seeing.

I feel bad now; you provided a good explanation of this stuff, and I
appreciate the time taken to type all of that out, but I also didn't
benefit from the explanation as I'm nowhere near as green to memory
consistency as I am to btrfs internals. (Maybe someone reading this
email thread in the future will find this info useful?)

Hopefully I can rephrase what I was saying more clearly:
During the incident, there was some x% chance of writes not taking
effect. (And since there are a few metadata checksum errors indicating
torn writes, I think I can safely say that x<100. Probably by a wide
margin, given there are only 9000 unique error items, which is much
less than your expectation of 60K.)
The part that first caught my eye was that, for writes *not* to chunk
1065173909504, x=0. i.e. the probability of loss was conditional on
LBA. Note that I am only using the chunk as a spatial grouping, not
saying that the chunk itself is to blame.
What then made me find this interesting (a better word is, perhaps,
"distinctive"), was that the pattern of writes lost during
transactions 66552, 66591, 66655, and 66684 all followed these very
same statistics.

That eliminates a whole class of bugs (because certain cache
replacement policies and/or data structures cannot follow this
pattern, so if the SSD uses one of those for its write cache, it means
the culprit can't be the write cache) while making others more likely
(e.g. if the SSD is not respecting flush and the write cache is
keeping pages in a hash table bucketed by the first N bits of LBA,
then this could easily be a failure of the write cache's hash table).

And if nothing else, it gives us a way to recognize this particular
problem, in case someone else shows up with similar errors.

> That often happens to drives as they fail.  Firmware reboots due to a
> bug in error handling code or hardware fault, and doesn't remember what
> was in its write cache.
>
> Also if there is a transport failure and the host resets the bus, the
> drive firmware might have its write cache forcibly erased before being
> able to write it.  The spec says that doesn't happen, but some vendors
> are demonstrably unable to follow spec.

If the problem is in the command queue block, then the writes are
being lost before the write cache is even involved.

> That's an SSD-specific restatement of #1 (failure to persist data before
> reporting successfully completed write to the host, and returning previous
> versions of data on later reads of the same address).

But then here the writes are being lost *after* the write cache is
involved. In case #1, I could just opt out of the write cache to get
my SSD working reliably again, but I can't opt out of LBA remapping
since it's required for the longevity of the drive.

> SSDs don't necessarily erase old blocks immediately--a large, empty or
> frequently discarded SSD might not erase old blocks for months.

The blocks weren't erased even after a few days of analysis. But case
#3 is really unlikely either way, because LBA mapper bugs should take
a *bunch* of stuff down with them.

> All of the above fit into the general category of "drive drops some
> writes, out of order, when some triggering failure occurs."

Yep -- although I would include the nearby LBAs condition in the
symptoms list. That's statistically significant, especially for SSD!
:)

> If you have
> access to the drive's firmware on github, you could check out the code,
> determine which bug is occurring, and send a pull request with the fix.
> If you don't, usually the practical solution is to choose a different
> drive vendor, unless you're ordering enough units to cause drive
> manufacturer shareholders to panic when you stop.

Oh man, there's sometimes drive firmware on GitHub?! I would be
pleasantly surprised to find my SSD's firmware there. Alas, a cursory
search suggests it isn't.

I didn't choose the drive; it's a whitelabel OEM unit. If I can't get
in touch with the vendor "officially" (which I probably can't), I'll
try getting the attention of the employee/contractor responsible for
maintaining the firmware. I've had moderate success with that in the
past. That's a large part of my motivation for nailing down exactly
*what* in the SSD is misbehaving (if indeed the SSD is misbehaving),
to maximize my chances that such a bug report is deemed "worth their
time." :)

> Yeah, if something horrible happened in the Linux 5.14 baremetal NVME
> hardware drivers or PCIe subsystem in general, then it could produce
> symptoms like these.  It wouldn't be the first time a regression in
> other parts of Linux was detected by a flood of btrfs errors.
>
> Device resets might trigger write cache losses and then all the above
> "firmware" symptoms (but the firmware is not at fault, it is getting
> disrupted by the host) (unless you are a stickler for the letter of the
> spec that says write cache must be immune to host action).

On the other hand, a problem that surfaces only with a new version of
Linux isn't _necessarily_ a regression in Linux. My working hypothesis
currently is that there's a corner-case in the SSD firmware, where
certain (perfectly cromulent) sequences of NVMe commands would trigger
this bug, and the reason it has never been detected in testing is that
the first NVMe driver in the world to hit the corner-case is the
version shipped in Linux 5.14, after the SSD was released. That'd be a
nice outcome since Linux can spare other users a similar fate with the
addition of a device quirk that avoids the corner-case.

But I haven't ruled out what you've said about a full-blown Linux
regression, or even just a problem that happens every few weeks (due
to some internal counter in the SSD overflowing, perhaps -- the
problem did manifest awfully soon after the 2^16th transaction, after
all). All possibilities remain open. :)

> Linux 5.14 btrfs on VMs seems OK.  I run tests continuously on new Linux
> kernels to detect btrfs and lvm regressions early, and nothing like this
> has happened on my humble fleet.  My test coverage is limited--it
> won't detect a baremetal NVME transport issue, as that's handled by the
> host kernel not the VM guest.

I wonder if there'd be some value in setting up PCIe passthrough
straight into the VM for some/all of your NVMe devices? Does your VM
host have an IOMMU to allow that? Any interest in me donating to you
my SSD (which is M.2) if I decide to replace it?

> Sound methodology.
> > Wish me luck,
> Good luck!

Thank you! And: start the clock. I've erased+rebuilt the filesystem
today and am continuing to use it as I was before. I'll be making a
second attempt at installing those package updates later as well.

If there are any other questions, I'll happily answer. Otherwise I
won't be following up unless/until I encounter more instances of this
bug. So, to anyone reading this list archive well in the future: if
this is the last message from me, it means the problem was one-off.
Sorry.

Signing off,
Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
  2021-09-14 23:58                 ` Sam Edwards
@ 2021-09-15 18:10                   ` Zygo Blaxell
  0 siblings, 0 replies; 17+ messages in thread
From: Zygo Blaxell @ 2021-09-15 18:10 UTC (permalink / raw)
  To: Sam Edwards; +Cc: Qu Wenruo, linux-btrfs

On Tue, Sep 14, 2021 at 05:58:58PM -0600, Sam Edwards wrote:
> On Sun, Sep 12, 2021 at 5:07 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > There is less than 256MB distance from the first to the last, but they
> > occupy two separate 256MB-aligned regions (f800000000 and f810000000).
> > If there is dup metadata then these blocks occupy four separate 256MB
> > regions, as there is some space between duplicate regions (in logical
> > address space).
> 
> Okay, I see. You were looking at alignment, not range. I didn't think
> you were trying to determine LBA alignment because you were working
> from offsets within the LUKS space, without the benefit of the
> LUKS+partition offsets. (Adding in those offsets doesn't change the
> alignment situation much, by the way.)

Alignment is mostly meaningless anyway.  I've been looking for some
evidence to support the alignment theory, but nothing fits as well as
the more common and boring write-drop failure modes.

> Oh, for what it's worth: I don't think there was dup metadata. I
> didn't pass any flags to mkfs.btrfs when I built the filesystem (so it
> should have used the SSD default of single), and when I was (manually,
> scanning the LUKS volume) looking for duplicate metadata blocks to
> test my misdirected writes guess, I didn't find any two blocks with
> the same bytenr. I still have the filesystem image in case there's any
> doubt. But it also may not be pertinent at this point.

It appears I guessed wrong here.  In dmesg logs you can count the
number of metadata mirrors by counting repeated failure messages, as
the filesystem will try to read each mirror once before giving up and
emitting a different message.  This trick doesn't work with btrfs check
output, apparently.

> > 256MB is far too large for a plausible erase block size, 1GB is even
> > less likely.
> 
> Agreed - orders of magnitude too large, if you ask me. But I'm also
> not discounting the possibility that the erase block size is big
> enough to cover 256MiB worth of (SSD, not btrfs) metadata.
> 
> > That's not how write caches work--they are not giant FIFO queues.
> 
> Oh no, I think I didn't make myself clear. I wasn't suggesting
> anything to do with the order of the write operations, nor trying to
> indicate that I was confused about what I was seeing.
> 
> I feel bad now; you provided a good explanation of this stuff, and I
> appreciate the time taken to type all of that out, but I also didn't
> benefit from the explanation as I'm nowhere near as green to memory
> consistency as I am to btrfs internals. (Maybe someone reading this
> email thread in the future will find this info useful?)

I figure it's better to send more information than needed rather than
less, in case I'm misunderstanding something.  Whenever two experts talk,
each knows 90% of what the other knows, but not _which_ 90%, so most of
the exchange is figuring out which 10% isn't common knowledge.

Worst case, it might end up in a google search and help out a random
user some day.

> Hopefully I can rephrase what I was saying more clearly:
> During the incident, there was some x% chance of writes not taking
> effect. (And since there are a few metadata checksum errors indicating
> torn writes, I think I can safely say that x<100. Probably by a wide
> margin, given there are only 9000 unique error items, which is much
> less than your expectation of 60K.)
> The part that first caught my eye was that, for writes *not* to chunk
> 1065173909504, x=0. i.e. the probability of loss was conditional on
> LBA. Note that I am only using the chunk as a spatial grouping, not
> saying that the chunk itself is to blame.

I attach more or less zero significance to the LBA pattern.  Previously I
was thinking "is there some way a drive could generate this pattern?", but
a much easier question is "would btrfs ever generate any other pattern?"

btrfs uses a first-fit allocator for metadata.  Only huge transactions
(like balance or snapshot delete) span more than a few hundred MB.
btrfs only spreads out metadata writes if free space fragmentation forces
it to search over long distances to find a free block.  The normal case
is everything is contiguous or nearly contiguous within a transaction.
btrfs can move metadata anywhere on the disk to make metadata writes
contiguous, so it does exactly that to speed up metadata writes (it was
also handy for ZBD device support, though it turned out btrfs wasn't
writing quite 100% sequentially so some fixes had to be done).

> What then made me find this interesting (a better word is, perhaps,
> "distinctive"), was that the pattern of writes lost during
> transactions 66552, 66591, 66655, and 66684 all followed these very
> same statistics.

If we look at the block numbers and transactions from parent transid
verify failures, we get:

	block(hex)..end(hex)   transid

	f80ad70000..f80ad74000 66552 
	f80ad74000..f80ad78000 66552 contiguous with previous block
	f80c11c000..f80c120000 66552 
	f80c6dc000..f80c6e0000 66552 
	f80c6e8000..f80c6ec000 66552 
	f80c704000..f80c708000 66552 

	f80dd04000..f80dd08000 66591 
	f80e320000..f80e324000 66591 
	f80e3a4000..f80e3a8000 66591 
	f80e3cc000..f80e3d0000 66591 

	f8167c8000..f8167cc000 66655 
	f8167e0000..f8167e4000 66655 

	f818a68000..f818a6c000 66684 
	f818a6c000..f818a70000 66684 contig
	f818a70000..f818a74000 66684 contig
	f818a74000..f818a78000 66684 contig
	f818a7c000..f818a80000 66684 
	f818a80000..f818a84000 66684 contig
	f818a84000..f818a88000 66684 contig
	f818a8c000..f818a90000 66684 

Note that the write losses detected in each transaction are very close
together, contiguous in some cases, separated by only a few blocks
in others.  No transactions overlap.  Transactions appear in ascending
LBA order, as the allocator would allocate them (eventually it will go
back to reuse smaller values, but some data has to be deleted first).

transid 66684 is almost completely contiguous, except for only two blocks
(and we might only be missing those because the blocks that refer to
them are also lost).  Depending on request length and io merges, that
could be a _single_ dropped write command.  In any event it is not more
than 3 writes with merging, 8 without.

This is utterly normal when the lower layers drop writes.
In the power-fail case, there would be only one failed
transaction or a contiguous range of transaction numbers.  In the
drop-writes-but-stay-online cases, there will be dropped writes from
multiple transactions with successfully completed transactions between
them.

I think all this tells us is that the drive lost writes (however caused)
on at least 4 distinct occasions, and btrfs's forced-read-only was not
triggered until after the 4th one (either a 5th failure was detected
but the transaction was aborted so the 5th error was not persisted, or
btrfs reread old data and detected one of the previous 4 after the fact).

> That eliminates a whole class of bugs (because certain cache
> replacement policies and/or data structures cannot follow this
> pattern, so if the SSD uses one of those for its write cache, it means
> the culprit can't be the write cache) while making others more likely
> (e.g. if the SSD is not respecting flush and the write cache is
> keeping pages in a hash table bucketed by the first N bits of LBA,
> then this could easily be a failure of the write cache's hash table).
> 
> And if nothing else, it gives us a way to recognize this particular
> problem, in case someone else shows up with similar errors.

So far it looks like the aftermath of every other storage stack with
a write dropping issue between btrfs and the drive.  It's not an uncommon
problem, and I don't see anything unusual or unexplained in the check log.

The main distinguishing feature is the correlation with kernel version
5.14 and the other hangs and issues you were experiencing at the
same time.  Those facts are significant, and point to problems that
should equally affect any other filesystem with that combination of
kernel and hardware (though most other filesystems will require you to
run applications with robust data error detection built into them to be
able to detect the corruption).  If you run btrfs on a setup like that,
it will complain loudly and quickly.  All the integrity checking in btrfs
is there to detect precisely that type of silent storage stack failure,
and it seems to be working.

It's possible that there are two simultaneously occurring problems here:

	1.  the drive does drop writes from its cache on reset or error
	(which includes power failure, but also PCI bus reset and medium
	errors), and

	2.  Linux 5.14 is issuing a lot more bus resets than it should
	be, i.e. more than the one at boot.

Either one individually wouldn't cause data loss while the system is
otherwise running normally, but the combination causes data loss.

Problem 1 can often be worked around with hdparm -W0 (or "nvme set-feature
-f 6 -v 0") to disable write cache.  I've used it both to avoid power
failure losses and also UNC/bus reset losses on drives that have no
bad power failure behavior.  If that works, it provides data corruption
immunity for problem 2:  the resets cause no undetected write losses
because there are never writes that have been ACKed but not committed.

There are a lot of fault recovery paths that lead to the "disable the
write cache" action.  It's not just for power failures.

> > That often happens to drives as they fail.  Firmware reboots due to a
> > bug in error handling code or hardware fault, and doesn't remember what
> > was in its write cache.
> >
> > Also if there is a transport failure and the host resets the bus, the
> > drive firmware might have its write cache forcibly erased before being
> > able to write it.  The spec says that doesn't happen, but some vendors
> > are demonstrably unable to follow spec.
> 
> If the problem is in the command queue block, then the writes are
> being lost before the write cache is even involved.

If the command is lost while still in the queue, Linux should notice
it by command timeout if nothing else, and if Linux notices it, btrfs
will notice it and (assuming no bugs today) abort the transaction.

If the command is no longer in the queue (i.e. Linux received an ACK),
but its data is not yet persistent on storage media, it's in "write cache"
(or some equivalent nonpersistent storage block at the same point in the
storage stack where a write cache would be).  There might be multiple
components in the implementation, but from outside the disk we don't
see their effects separately.

> > That's an SSD-specific restatement of #1 (failure to persist data before
> > reporting successfully completed write to the host, and returning previous
> > versions of data on later reads of the same address).
> 
> But then here the writes are being lost *after* the write cache is
> involved. In case #1, I could just opt out of the write cache to get
> my SSD working reliably again, but I can't opt out of LBA remapping
> since it's required for the longevity of the drive.

That's true, but the write cache "off" switch might also change the
drive's LBA map update strategy at the same time, so it ends up being a
successful workaround anyway.  i.e. the firmware has multiple components
in play, but also multiple components change behavior in lockstep with
write cache on or off, so they still behave as a unit.

Changing write cache strategy definitely does affect write longevity.
We can see that in SSD model families where the "enterprise" version has
full power interrupt protection (every ACKed write committed) and the
"consumer" does not (strictly FIFO write cache without reordering,
but last N writes dropped on power failure) but neither model has
hold-up capacitors (so they're not relying on keeping a RAM cache alive
during commit).  The "enterprise" drive wear happens 2x faster than the
"consumer" drive despite the consumer drive having half the rated TBW
at the same capacity.  The "enterprise" drive is hitting the flash 4x
harder to implement its stronger persistence guarantees.

> > SSDs don't necessarily erase old blocks immediately--a large, empty or
> > frequently discarded SSD might not erase old blocks for months.
> 
> The blocks weren't erased even after a few days of analysis. But case
> #3 is really unlikely either way, because LBA mapper bugs should take
> a *bunch* of stuff down with them.

...or the drive silently works around them.  e.g. an empty page with
bits set on it might be interpreted as a media failure, and the drive
does some recovery action like map different flash, or it normally erases
the block at the start of the write operation without checking.

This is of course assuming the vendor bothered to implement basic error
checking procedures at all--some vendors demonstrably do not, and the
first and last indication of hardware failure is the errors from btrfs.

For almost every possible assumption or expectation of SSD firmware
behavior, there exists an implementation of the opposite behavior in
the field.  ;)

> > All of the above fit into the general category of "drive drops some
> > writes, out of order, when some triggering failure occurs."
> 
> Yep -- although I would include the nearby LBAs condition in the
> symptoms list. That's statistically significant, especially for SSD!
> :)
> 
> > If you have
> > access to the drive's firmware on github, you could check out the code,
> > determine which bug is occurring, and send a pull request with the fix.
> > If you don't, usually the practical solution is to choose a different
> > drive vendor, unless you're ordering enough units to cause drive
> > manufacturer shareholders to panic when you stop.
> 
> Oh man, there's sometimes drive firmware on GitHub?! I would be
> pleasantly surprised to find my SSD's firmware there. Alas, a cursory
> search suggests it isn't.

Yeah, no, that never happens in real life.  It's frequently suggested
as a way to solve chronic firmware quality problems in the storage
industry, though...usually by industry outsiders.

> I didn't choose the drive; it's a whitelabel OEM unit. If I can't get
> in touch with the vendor "officially" (which I probably can't), I'll
> try getting the attention of the employee/contractor responsible for
> maintaining the firmware. I've had moderate success with that in the
> past. That's a large part of my motivation for nailing down exactly
> *what* in the SSD is misbehaving (if indeed the SSD is misbehaving),
> to maximize my chances that such a bug report is deemed "worth their
> time." :)
> 
> > Yeah, if something horrible happened in the Linux 5.14 baremetal NVME
> > hardware drivers or PCIe subsystem in general, then it could produce
> > symptoms like these.  It wouldn't be the first time a regression in
> > other parts of Linux was detected by a flood of btrfs errors.
> >
> > Device resets might trigger write cache losses and then all the above
> > "firmware" symptoms (but the firmware is not at fault, it is getting
> > disrupted by the host) (unless you are a stickler for the letter of the
> > spec that says write cache must be immune to host action).
> 
> On the other hand, a problem that surfaces only with a new version of
> Linux isn't _necessarily_ a regression in Linux. My working hypothesis
> currently is that there's a corner-case in the SSD firmware, where
> certain (perfectly cromulent) sequences of NVMe commands would trigger
> this bug, and the reason it has never been detected in testing is that
> the first NVMe driver in the world to hit the corner-case is the
> version shipped in Linux 5.14, after the SSD was released. That'd be a
> nice outcome since Linux can spare other users a similar fate with the
> addition of a device quirk that avoids the corner-case.

That certainly does happen.  Linux steps on more than a few bugs where
the SSD vendor finally has to admit they didn't do more testing on Linux
than "run the Fedora installer with default options," and they didn't
catch bugs that brick the drive.  Vendors are pretty good at testing
compatibility with Windows though, so only Linux users are affected by
the bugs most of the time.

> But I haven't ruled out what you've said about a full-blown Linux
> regression, or even just a problem that happens every few weeks (due
> to some internal counter in the SSD overflowing, perhaps -- the
> problem did manifest awfully soon after the 2^16th transaction, after
> all). All possibilities remain open. :)

That's a _really_ tenuous connection.  How could it work?  The drive
counts flush commands?  Creating partitions and LVs and LUKS formatting
when setting up the drive will use some of those, so if anything, it
should fail long before transaction 65536.  Also there are at least
two flushes per transaction (one for everything that isn't superblock,
one for the superblocks) so 16-bit overflow should be near transid 32768
(unless it's a signed int16...).  Anyone who runs a database on such a
drive would hit the rollover before loading up a schema much less the
data inside it.

That said...I did notice how close the transids were to 65536...
Cool theory!  Alas, not supported by evidence.  ;)

> > Linux 5.14 btrfs on VMs seems OK.  I run tests continuously on new Linux
> > kernels to detect btrfs and lvm regressions early, and nothing like this
> > has happened on my humble fleet.  My test coverage is limited--it
> > won't detect a baremetal NVME transport issue, as that's handled by the
> > host kernel not the VM guest.
> 
> I wonder if there'd be some value in setting up PCIe passthrough
> straight into the VM for some/all of your NVMe devices? Does your VM
> host have an IOMMU to allow that? Any interest in me donating to you
> my SSD (which is M.2) if I decide to replace it?

The test VMs aren't using passthrough so that any storage hardware
failures can be cleanly identified by the host kernel separately from
btrfs issues in the guest kernel (the test workload does kill a disk
from time to time).  They're also obsolete hardware (anything current
is running production workloads instead) so problems with PCIe 4.0 won't
show up even with passthrough.

I test a lot of drives with _unknown_ issues to identify which drives
to buy (or not buy) more of.  95% of drive models work, so the easiest
and most reliable way to work around a bad model is to choose another
model at random.  It works 99.75% of the time, assuming the problem is
the drive model/firmware, and not something else in the system.

It would be better to donate a drive with known issues to someone
maintaining NVME or PCIe or whatever subsystem is actually failing
here--assuming your specific drive is a healthy specimen of that drive
model.  If the drive is simply a failing instance of that model then
it's best to return it to the vendor--all hardware fails eventually,
and a failed drive is only informative to its manufacturer.

> > Sound methodology.
> > > Wish me luck,
> > Good luck!
> 
> Thank you! And: start the clock. I've erased+rebuilt the filesystem
> today and am continuing to use it as I was before. I'll be making a
> second attempt at installing those package updates later as well.

Exactly as before, including single metadata?  Using dup metadata might
provide some useful extra information (we'd know if the problem affects
both copies or just one, and you're using luks so drive-side dedupe is
not possible and therefore not a confounding factor).  Though if you
have a storage stack that can trash 4 transactions before btrfs notices
the problem, then dup metadata won't save it.

> If there are any other questions, I'll happily answer. Otherwise I
> won't be following up unless/until I encounter more instances of this
> bug. So, to anyone reading this list archive well in the future: if
> this is the last message from me, it means the problem was one-off.
> Sorry.
> 
> Signing off,
> Sam
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-09-15 18:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-09  0:47 Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old Sam Edwards
2021-09-09  7:12 ` Nikolay Borisov
2021-09-09  7:30   ` Sam Edwards
2021-09-10  5:39 ` Sam Edwards
2021-09-10  6:16   ` Qu Wenruo
2021-09-10  7:44     ` Sam Edwards
2021-09-10  8:31       ` Qu Wenruo
2021-09-10 22:34         ` Sam Edwards
2021-09-11  1:05           ` Qu Wenruo
2021-09-11  4:24       ` Zygo Blaxell
2021-09-11  5:07         ` Sam Edwards
2021-09-11  7:38           ` Forza
2021-09-11 16:56           ` Zygo Blaxell
2021-09-12  6:12             ` Sam Edwards
2021-09-12 23:07               ` Zygo Blaxell
2021-09-14 23:58                 ` Sam Edwards
2021-09-15 18:10                   ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.