Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old

From: Sam Edwards <cfsworks@gmail.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>, linux-btrfs@vger.kernel.org
Subject: Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
Date: Sun, 12 Sep 2021 00:12:13 -0600	[thread overview]
Message-ID: <CAH5Ym4isja5hs73ibcACH5cm00=F43cG+m_sNtFjkJ_oRZJT1g@mail.gmail.com> (raw)
In-Reply-To: <20210911165634.GK29026@hungrycats.org>

On Sat, Sep 11, 2021 at 10:56 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> It's not one I've seen previously reported, but there's a huge variety
> of SSD firmware in the field.

It seems to be a very newly released SSD. It's possible that the
reason nobody else has reported issues with it yet is that nobody else
who owns one of these has yet met the conditions for this problem to
occur. All the more reason to figure this out, I say.

I've been working to verify what you've said previously (and to rule
out any contrary hypotheses - like chunks momentarily having the wrong
physical offset). One point I can't corroborate is:

> There are roughly 40 distinct block addresses affected in your check log,
> clustered in two separate 256 MB blocks.

The only missing writes that I see are in a single 256 MiB cluster
(belonging to chunk 1065173909504). What is the other 256 MiB cluster
that you are seeing? What shows that writes to that range went
missing, too? (Or by "affected" do you only mean "involved in the
damaged transactions in some way"?)

I do find it interesting that, of a few dozen missing writes, all of
them are clustered together, while other writes in the same
transactions appear to have had a perfect success rate. My expectation
for drive cache failure would have been that *all* writes (during the
incident) get the same probability of being dropped. All of the
failures being grouped like that can only mean one thing... I just
don't know what it is. :)

So, the prime suspect at this point is the SSD firmware. Once I have a
little more information, I'll (try to) share what I find with the
vendor. Ideally I'd like to narrow down which of 3 components of the
firmware apparently contains the fault:
1. Write back cache: Most likely, although not certain at this point.
If I turn off the write cache and the problem goes away, I'll know.
2. NVMe command queues: Perhaps there is some race condition where 2
writes submitted on different queues will, under some circumstances,
cause one/both of the writes to be ignored.
3. LBA mapper: Given the pattern of torn writes, it's possible that
some LBAs were not updated to the new PBAs after some of the writes. I
find this pretty unlikely for a handful of reasons (trying to write a
non-erased block should result in an internal error, old PBA should be
erased, ...)

However, even if this is a firmware/hardware issue, I remain
unconvinced that it's purely coincidence just how quickly this
happened after the upgrade to 5.14.x. In addition to this corruption,
there are the 2 incidents where the system became unresponsive under
I/O load (and the second was purely reads from trying to image the
SSD). Those problems didn't occur when booting a rescue USB with an
older kernel. So some change which landed in 5.14.x may have changed
the drive command pattern in some important way to trigger the SSD
fault (esp, in the case of possibility #2 above). That gives me hope
that, if nothing else, we may be able to add a device quirk to Linux
and minimize future damage that way. :)

Bayes calls out from beyond the grave and demands that, before I try
any experiments, I first establish the base rate of these corruptions
under current conditions. So that means rebuilding my filesystem from
backups and continuing to use it exactly as I have been, prepared for
this problem to happen again. Being prepared means stepping up my
backup frequency, so I'll first set up a btrbk server that can accept
hourly backups.

Wish me luck,
Sam