Re: Extremely slow device removals

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Chris Murphy <lists@colorremedies.com>
Cc: Phil Karn <karn@ka9q.net>, Paul Jones <paul@pauljones.id.au>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: Extremely slow device removals
Date: Sun, 3 May 2020 01:26:37 -0400	[thread overview]
Message-ID: <20200503052637.GE10796@hungrycats.org> (raw)
In-Reply-To: <CAJCQCtTGg+Rmisw9QAj4SMaDcZ5e_2h_83-3Hjd=FDC5krgjCg@mail.gmail.com>

On Sat, May 02, 2020 at 11:48:18AM -0600, Chris Murphy wrote:
> On Sat, May 2, 2020 at 3:09 AM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On SD/MMC and below-$50 SSDs, silent data corruption is the most common
> > failure mode.  I don't think these disks are capable of detecting or
> > reporting individual sector errors.  I've never seen it happen.  They
> > either fall off the bus or they have a catastrophic failure and give
> > an error on every single access.
> 
> I'm still curious about the allocator to use for this device class. SD
> Cards usually self-report rotational=0. Whereas USB sticks report
> rotational=1. The man page seems to suggest nossd or ssd_spread.

Use dup metadata on all single-disk filesystems, unless you are making
an intentionally temporary filesystem (like a RAM disk, or a cache with
totally expendable contents).  The correct function for maximizing btrfs
lifetime does not have "rotational" as a parameter.

> In my very limited sample size from a single vendor, I've only seen SD
> Card fail by becoming read only. i.e. hardware read-only, with the
> kernel spewing sd/mmc related debugging info about the card (or card's
> firmware). Maybe that's a good example? 

Yes, that would be a good example if you can read the card.  Usually
when these devices hit the end of their lives there's nothing left
to read, or big chunks of data are misplaced or missing entirely.

All SSDs eventually end read-only, completely inaccessible, or
otherwise incapable of accepting further writes, if you run them long
enough.  Since it's no longer possible to test the drive's capability
as a storage device after this happens, you can have at most one such
failure per drive.  All the other failure modes can happen multiple times.

Some cheap SSDs will flip a bit (either in data or in a sector address)
at some point during their testable lifetimes.  The same drive can do
this over and over, so the error counts get quite high, and this is
easily the single most common failure event.  Since the drive itself
seems unaware of the errors, it never hits any kind of internal limit
on the number of failures (contrast with UNC sectors, where eventually
the remapping table fills up).  Typical error rates are one sector every
few weeks once the drive is past 50% of its endurance rating, but some
cheap SSDs don't wait for 50% and start corrupting data right away.

Some cheap SSDs fail by dropping off the bus until power-cycled.
Sometimes they corrupt data and drop off the bus at the same time, so
this event can end up being included in the silent data corruption count.
That may produce an elevated silent data corruption count, but silent
data corruption is still the most common event even if all bus drops
are subtracted.

Some cheap SSDs fail by becoming 2 orders of magnitude slower suddenly.
This is rare, and there's no data loss in these events.

Some SSDs detect and report UNC sector errors, either on read operations
or SMART self-tests, which I presume are due to internal data corruption
combined with error checking by the firmware, though they could be
false positives.  Cheap SSDs never do this, it only occurs on drives
outside of the cheap SSD group.

I believe that the cheap SSDs are not capable of detecting or reporting
data corruption errors on individual sectors, given the large number
of opportunities they've been provided to demonstrate this capability
under my observation, and the exactly zero times they've used one.

Most of the above applies to SD/MMC devices as well, except I've never
seen a SD/MMC device that had the UNC sector error detection capability.
They only seem to have the cheap SSD failure modes.

> I suppose it's better to go
> read-only with data still readable, and insofar as Btrfs was concerned
> the data was correct, rather than start returning transiently bad
> data. However, I only knew this due to data checksums.
> 
> 
> -- 
> Chris Murphy
>