linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: dsterba@suse.cz, Qu Wenruo <quwenruo.btrfs@gmx.com>,
	Jorge Bastos <jorge.mrbastos@gmail.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Maybe we want to maintain a bad driver list? (Was 'Re: "bad tree block start, want 419774464 have 0" after a clean shutdown, could it be a disk firmware issue?')
Date: Sat, 24 Jul 2021 19:15:27 -0400	[thread overview]
Message-ID: <20210724231527.GF10170@hungrycats.org> (raw)
In-Reply-To: <20210722135455.GU19710@twin.jikos.cz>

On Thu, Jul 22, 2021 at 03:54:55PM +0200, David Sterba wrote:
> On Thu, Jul 22, 2021 at 08:18:21AM +0800, Qu Wenruo wrote:
> > 
> > 
> > On 2021/7/22 上午1:44, David Sterba wrote:
> > > On Fri, Jul 16, 2021 at 11:44:21PM +0100, Jorge Bastos wrote:
> > >> Hi,
> > >>
> > >> This was a single disk filesystem, DUP metadata, and this week it stop
> > >> mounting out of the blue, the data is not a concern since I have a
> > >> full fs snapshot in another server, just curious why this happened, I
> > >> remember reading that some WD disks have firmware with write caches
> > >> issues, and I believe this disk is affected:
> > >>
> > >> Model family:Western Digital Green
> > >> Device model:WDC WD20EZRX-00D8PB0
> > >> Firmware version:80.00A80
> > >
> > > For the record summing up the discussion from IRC with Zygo, this
> > > particular firmware 80.00A80 on WD Green is known to have problematic
> > > firmware and would explain the observed errors.
> > >
> > > Recommendation is not to use WD Green or periodically disable the write
> > > cache by 'hdparm -W0'.
> > 
> > Zygo is always the god to expose bad hardware.
> > 
> > Can we maintain a list of known bad hardware inside btrfs-wiki?
> > And maybe escalate it to other fses too?
> 
> Yeah a list on wiki would be great, though I'm a bit skeptical about
> keeping it up up to date, there are very few active wiki editors, the
> knowledge is still mostly stored in the IRC logs. But without a landing
> page on wiki we can't even start, so I'll create it.

Some points to note:

Most HDD *models* are good (all but 4% of models I've tested, and the
ones that failed were mostly 8?.00A8?), but the very few models that
are bad form a significant portion of drives in use:  they are the cheap
drives that consumers and OEMs buy millions of every year.

80.00A80 keeps popping up in parent-transid-verify-failed reports from
IRC users.  Sometimes also 81.00A81 and 82.00A82 (those two revisions
appear on some NAS vendor blacklists as well).  I've never seen 83.00A83
fail--I have some drives with that firmware, and they seem OK, and I
have not seen any reports about it.

80.00A80 may appear in a lot of low-end WD drive models (here "low end"
is "anything below Gold and Ultrastar"), marketed under other names like
White Label, or starring as the unspecified model inside USB external
drives.

The bad WD firmware has been sold over a period of at least 8 years.
Retail consumers can buy new drives today with this firmware (the most
recent instance we found was a WD Blue 1TB if I'm decoding the model
string correctly).  Even though WD seems to have fixed the bugs years
ago (in 83.00A83), the bad firmware doesn't die out as hardware ages
out of the user population because users keep buying new drives with
the old firmware.

It seems that _any_ HDD might have write cache issues if it is having
some kind of hardware failure at the same time (e.g. UNC sectors or
power supply issues).  A failing drive is a failing drive, it might blow
up a btrfs with dup profile that would otherwise have survived.  It is
possible that firmware bugs are involved in these cases, but it's hard
to make a test fleet large enough for meaningful and consistent results.

SSDs are a different story:  there are so many models, firmware revisions
are far more diverse, and vendors are still rapidly updating their
designs, so we never see exactly the same firmware in any two incident
reports.  A firmware list would be obsolete in days.  There is nothing
in SSD firmware like the decade-long stability there is in HDD firmware.

IRC users report occasional parent-transid-verify-failure or similar
metadata corruption failures on SSDs, but they don't seem to be repeatable
with other instances of the same model device.  Samsung dominates the
SSD problem reports, but Samsung also dominates the consumer SSD market,
so I think we are just seeing messy-but-normal-for-SSD hardware failures,
not evidence of firmware bugs.

It's also possible that the window for exploiting a powerfail write cache
bug is much, much shorter for SSD than HDD, so even if the bugs do exist,
the probability of hitting one is negligible.

  reply	other threads:[~2021-07-24 23:16 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-16 22:44 "bad tree block start, want 419774464 have 0" after a clean shutdown, could it be a disk firmware issue? Jorge Bastos
2021-07-21 17:44 ` David Sterba
2021-07-21 18:14   ` Jorge Bastos
2021-11-22 13:49     ` Jorge Bastos
2021-07-22  0:18   ` Maybe we want to maintain a bad driver list? (Was 'Re: "bad tree block start, want 419774464 have 0" after a clean shutdown, could it be a disk firmware issue?') Qu Wenruo
2021-07-22 13:54     ` David Sterba
2021-07-24 23:15       ` Zygo Blaxell [this message]
2021-07-25  3:34         ` Chris Murphy
2021-07-27  9:02           ` David Sterba
2021-07-25  5:27         ` Qu Wenruo
2021-07-26  2:53           ` Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210724231527.GF10170@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=dsterba@suse.cz \
    --cc=jorge.mrbastos@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).