Buggy disk firmware (fsync/FUA) and power-loss btrfs survability

All of lore.kernel.org
 help / color / mirror / Atom feed

* Buggy disk firmware (fsync/FUA) and power-loss btrfs survability
@ 2020-06-28 13:33 Pablo Fabian Wagner Boian
  2020-06-28 14:19 ` Hans van Kranenburg
  0 siblings, 1 reply; 4+ messages in thread
From: Pablo Fabian Wagner Boian @ 2020-06-28 13:33 UTC (permalink / raw)
  To: linux-btrfs

Hi.

Recently, it came to my knowledge that btrfs relies on disks honoring
fsync. So, when a transaction is issued, all of the tree structure is
updated (not in-place, bottom-up) and, lastly, the superblock is
updated to point to the new tree generation. If reordering happens (or
buggy firmware just drops its cache contents without updating the
corresponding sectors) then a power-loss could render the filesystem
unmountable.

Upon more reading, ZFS seems to implement a circular buffer in which
new pointers are updated one after another. That means that, if older
generations (in btrfs terminology) of the tree are kept on disk you
could survive such situations by just using another (older) pointer.

I seem to recall having read somewhere that the btrfs superblock
maintains four pointers to such older tree generations.

My question is: is the statement in this last paragraph true? If not:
could it be implemented in btrfs to not depend on correct fsync
behaviour? I assume it would require an on-disk format change. Lastly:
are there any downsides in this approach?

I have skimmed the mailing list but couldn't find concise answers.
Bear in mind that I'm just an user so I would really appreciate a very
brief explanation attached to any technical aspect in the response. If
any of these questions have no merit (or this isn't the appropriate
place to ask) I'm sorry for the noise and, please, ignore this mail.

Thanks.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Buggy disk firmware (fsync/FUA) and power-loss btrfs survability
  2020-06-28 13:33 Buggy disk firmware (fsync/FUA) and power-loss btrfs survability Pablo Fabian Wagner Boian
@ 2020-06-28 14:19 ` Hans van Kranenburg
  2020-06-29  0:15   ` waxhead
  0 siblings, 1 reply; 4+ messages in thread
From: Hans van Kranenburg @ 2020-06-28 14:19 UTC (permalink / raw)
  To: Pablo Fabian Wagner Boian, linux-btrfs

Hi!

On 6/28/20 3:33 PM, Pablo Fabian Wagner Boian wrote:
> Hi.
> 
> Recently, it came to my knowledge that btrfs relies on disks honoring
> fsync. So, when a transaction is issued, all of the tree structure is
> updated (not in-place, bottom-up) and, lastly, the superblock is
> updated to point to the new tree generation. If reordering happens (or
> buggy firmware just drops its cache contents without updating the
> corresponding sectors) then a power-loss could render the filesystem
> unmountable.
> 
> Upon more reading, ZFS seems to implement a circular buffer in which
> new pointers are updated one after another. That means that, if older
> generations (in btrfs terminology) of the tree are kept on disk you
> could survive such situations by just using another (older) pointer.

Btrfs does not keep older generations of trees on disk. *) Immediately
after completing a transaction, the space that was used by the previous
metadata can be overwritten again. IIRC when using the discard mount
options, it's even directly freed up on disk level by unallocating the
physical space by e.g. the FTL in an SSD. So, even while not overwritten
yet, reading it back gives you zeros.

*) Well, only for fs trees, and only if you explicitly ask for it, when
making subvolume snapshots/clones.

> I seem to recall having read somewhere that the btrfs superblock
> maintains four pointers to such older tree generations.

Yes, and they're absolutely useless and dangerous to use, **) since even
if you manage to mount a filesystem, using one of them, any metadata in
any distant corner of a tree could have been overwritten already. So,
when trying that, directly umounting again and a throrough btrfschk
should be done to verify that everything is present. But... ugh.

**) So, except for one case, which is the filesystem or hardware royally
messing up a transaction commit, and then only when using generation N-1
to recover, while there has not been any write to the filesystem in
between... So, if you try to mount it and halfway it fails, then it's
likely already too late, because it could have done some stuff like
cleaning up orphan objects, or whatever else already causes writes
during mount.

> My question is: is the statement in this last paragraph true? If not:
> could it be implemented in btrfs to not depend on correct fsync
> behaviour? I assume it would require an on-disk format change. Lastly:
> are there any downsides in this approach?

Btrfs could be changed to use the same snapshotting techniques in the
background as are already present for fs trees. In the very beginning of
Btrfs, this was actually used for a little bit, by adding a new tree
root item in metadata tree 1 and then after transaction commit removing
the previous one. However, this was soon replaced by in memory magic
that does not need to actually do changes on tree 1 because of the
processing overhead. (See commit
5d4f98a28c7d334091c1b7744f48a1acdd2a4ae0 "Btrfs: Mixed back reference
(FORWARD ROLLING FORMAT CHANGE)")

The btrfs wiki apparently still lives in 2009, and it has a section
about how it worked before:

https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging

The filesystem trees (subvolumes) are reference counted, which makes it
possible to snapshot them and then properly do long term on-disk
administration of which little parts of metadata are shared or not
between trees, so that when removing subvolumes or (part of) their
contents, the fs knows what metadata pages to free up.

The other trees (like extent tree) are 'cowonly', which means that all
new writes are written to new empty space, so the fs can crash and
recover (yes, if the hardware behaves like it expects, like you already
said). But, instead of using reference counting, there's an in-memory
blacklist of 'pinned extents', which list disk space which should not be
overwritten yet, while there is no 'real' on-disk information about them.

The obvious downside of making all trees fully snapshottable and
reference counted, is that this will lead to a total absolute gigantic
performance disaster, probably bringing the possibilities of actually
using the filesystems of users to a screeching halt, while hammering on
disk all day long. But yes, in that case you could theoretically
snapshot the ENTIRE filesystem. Would be fun to do as experiment. \:D/

> I have skimmed the mailing list but couldn't find concise answers.
> Bear in mind that I'm just an user so I would really appreciate a very
> brief explanation attached to any technical aspect in the response. If
> any of these questions have no merit (or this isn't the appropriate
> place to ask) I'm sorry for the noise and, please, ignore this mail.

It's not noise. Instead, it's a very good question.

So, when browsing btrfs source code and history, some of the relevant
words to look for are 'reference counted', 'pinned' and 'cowonly'.

Have fun,
Hans

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Buggy disk firmware (fsync/FUA) and power-loss btrfs survability
  2020-06-28 14:19 ` Hans van Kranenburg
@ 2020-06-29  0:15   ` waxhead
  2020-06-29 23:05     ` Zygo Blaxell
  0 siblings, 1 reply; 4+ messages in thread
From: waxhead @ 2020-06-29  0:15 UTC (permalink / raw)
  To: Hans van Kranenburg, Pablo Fabian Wagner Boian, linux-btrfs

Hans van Kranenburg wrote:
> Hi!
> 
> On 6/28/20 3:33 PM, Pablo Fabian Wagner Boian wrote:
>> Hi.
>>
>> Recently, it came to my knowledge that btrfs relies on disks honoring
>> fsync. So, when a transaction is issued, all of the tree structure is
>> updated (not in-place, bottom-up) and, lastly, the superblock is
>> updated to point to the new tree generation. If reordering happens (or
>> buggy firmware just drops its cache contents without updating the
>> corresponding sectors) then a power-loss could render the filesystem
>> unmountable.
>>
>> Upon more reading, ZFS seems to implement a circular buffer in which
>> new pointers are updated one after another. That means that, if older
>> generations (in btrfs terminology) of the tree are kept on disk you
>> could survive such situations by just using another (older) pointer.
> 
> Btrfs does not keep older generations of trees on disk. *) Immediately
> after completing a transaction, the space that was used by the previous
> metadata can be overwritten again. IIRC when using the discard mount
> options, it's even directly freed up on disk level by unallocating the
> physical space by e.g. the FTL in an SSD. So, even while not overwritten
> yet, reading it back gives you zeros.
> 
> *) Well, only for fs trees, and only if you explicitly ask for it, when
> making subvolume snapshots/clones.
> So just out of curiosity... if BTRFS internally at every successful 
mount did a 'btrfs subvolume create /mountpoint /mountpoint/fsbackup1' 
you would always have a good filesystem tree to fall back to?! would 
this be correct?!

And if so - this would mean that you would loose everything that 
happened since last mount, but compared to having a catastrophic failure 
this sound much much better.

And if I as just a regular BTRFS user with my (possibly distorted) view 
see this, if you would leave the top level subvolume (5) untouched and 
avoid updates to this except creating child subvolues you reduce the 
risk of catastrophic failure in case a fsync does not work out as only 
the child subvolumes (that are regularily updated) would be at risk.

And if BTRFS internally made alternating snapshots of the root 
subvolumes (5)'s child subvolumes you would loose at maximum 30sec x 2 
(or whatever the commit time is set to) of data.

E.g. keep only child subvolumes on the top level (5).
And if we pretend the top level has a child subvolume called rootfs, 
then BTRFS could internally auto-snapshot (5)/rootfs every other time to 
(5)/rootfs_autobackup1 and (5)/rootfs_autobackup2

Do I understand this correctly or would there be any (significant) 
performance drawback to this? Quite frankly I assume it is or else I 
guess it would have been done already , but it never hurts (that much) 
to ask...

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Buggy disk firmware (fsync/FUA) and power-loss btrfs survability
  2020-06-29  0:15   ` waxhead
@ 2020-06-29 23:05     ` Zygo Blaxell
  0 siblings, 0 replies; 4+ messages in thread
From: Zygo Blaxell @ 2020-06-29 23:05 UTC (permalink / raw)
  To: waxhead; +Cc: Hans van Kranenburg, Pablo Fabian Wagner Boian, linux-btrfs

On Mon, Jun 29, 2020 at 02:15:07AM +0200, waxhead wrote:
> Hans van Kranenburg wrote:
> > Hi!
> > 
> > On 6/28/20 3:33 PM, Pablo Fabian Wagner Boian wrote:
> > > Hi.
> > > 
> > > Recently, it came to my knowledge that btrfs relies on disks honoring
> > > fsync. So, when a transaction is issued, all of the tree structure is
> > > updated (not in-place, bottom-up) and, lastly, the superblock is
> > > updated to point to the new tree generation. If reordering happens (or
> > > buggy firmware just drops its cache contents without updating the
> > > corresponding sectors) then a power-loss could render the filesystem
> > > unmountable.

The buggy firmware problem is trivial to work around: simply disable
write caching in the drive (e.g. hdparm -W0 in udev).  Problem goes away.
I've been running btrfs on old WD drives with firmware bugs for years.

It's fairly easy to test drives for buggy write cache firmware:
mkfs.btrfs, start writing data, start a metadata balance, run sync
in a while loop, disconnect power, repeat about 20 times, see if the
filesystem still mounts.  It only takes a few hours to find a bad
firmware.  SMART long self-tests take longer.

Some admins assume all firmware is buggy and preemptively turn off write
cache on all drives.  I don't think that approach is well supported
by the data.  I tested 100+ drive models, and found only 4 with write
cache bugs.  Pick any random pair of drive models for raid1, and there's a
99%+ chance that at least one of them has working firmware.  The good
drive can carry the filesystem even if the other drive has bad firmware.
If the raid1 array goes into degraded mode, temporarily turn off write
cache on the surviving drive until the failed drive is replaced.

There are several firmware issues that can make a drive unfit for purpose.
Not all of these are write caching or even data integrity bugs.  If you
are deploying at scale, assume 10% of the drive models you'll test
are unusable due to assorted firmware issues.  Less than half of these
are write cache bugs--the rest are other data integrity bugs, firmware
crashing bugs, and crippling performance issues triggered by specific
application workloads.  Plan vendor qualification tests and product
QA accordingly.  Plan for common mode failure risks when designing
storage redundancy.

For the following, I'll assume that for some reason you're going to insist
on enabling the write cache on a drive with possibly buggy firmware and
you have not managed the risks in some better way, and then just see
where that absurd premise goes.

> > > Upon more reading, ZFS seems to implement a circular buffer in which
> > > new pointers are updated one after another. That means that, if older
> > > generations (in btrfs terminology) of the tree are kept on disk you
> > > could survive such situations by just using another (older) pointer.
> > 
> > Btrfs does not keep older generations of trees on disk. *) Immediately
> > after completing a transaction, the space that was used by the previous
> > metadata can be overwritten again. IIRC when using the discard mount
> > options, it's even directly freed up on disk level by unallocating the
> > physical space by e.g. the FTL in an SSD. So, even while not overwritten
> > yet, reading it back gives you zeros.
> > 
> > *) Well, only for fs trees, and only if you explicitly ask for it, when
> > making subvolume snapshots/clones.
> > So just out of curiosity... if BTRFS internally at every successful
> mount did a 'btrfs subvolume create /mountpoint /mountpoint/fsbackup1' you
> would always have a good filesystem tree to fall back to?! would this be
> correct?!
> 
> And if so - this would mean that you would loose everything that happened
> since last mount, but compared to having a catastrophic failure this sound
> much much better.

This would also imply that you cannot delete _anything_ in the filesystem
between mounts.  Without some kind of working write barrier, you can
only append new data to the filesystem.  No sector can ever move from
an occupied state to an unoccupied state, because you might not have
successfully removed all references to the occupied sector (*).  If there
are no working write barriers, you don't know if any write was complete
or successful, so you have to assume any write could have failed.

(*) OK you could do that with ext2, but ext2 had potential data loss
and had to run full fsck after _every_ crash.  Not a model for btrfs
to follow.

In effect, you've got a single write barrier event during the mount,
in the sense that the filesystem has freshly booted, and there are no
earlier writes that could possibly be waiting to complete in the drive's
write cache.  Since the drive firmware is buggy, we can't get working
write barriers from the drive itself.  The only way we get another
reliable write barrier is to power off the drive, power it back on,
check and mount the filesystem again.  umounting and mounting are not
sufficient because those rely on properly functioning write barriers
to work.  We power off the drive to make sure its firmware isn't giving
us good data from its volatile RAM cache instead of bad data from the
disk--if the drive did that, then our mount test would pass, but the
data on the disk could be broken, and we wouldn't discover this until
it was too late to recover.  If we power off the drive, we might wipe
its volatile write cache before its contents are written to disk (no
working barrier means we can't prevent that), so we have to verify the
filesystem on the following mount to make sure we haven't lost any writes,
or recover if we have.

If we don't have working write barriers, can we fake them?

Write caching firmware bugs come in two forms:  reordering, and lost
writes.  To btrfs, they both look like ordinary data corruption on the
disk, except that when the corruption is caused by write caching, it
breaks writes that occur at close to the same time (e.g. both mirrored
copies of dup metadata on btrfs).  This make the data loss unrecoverable
by the normal methods btrfs uses to recover from metadata loss.  Compare
with an ordinary bad sector:  btrfs separates dup metadata by a gigabyte
of disk surface, so (on spinning disks) there is some physical distance
between metadata copies, and a few bad sectors will damage only one copy.

The lost writes case is where the drive silently loses the contents of
its write cache after notifying the host that the data was written to disk
successfully (e.g. due to a hardware fault combined with a firmware bug).
In this case the filesystem has been damaged, and the host does not know
that this damage has occurred until the next time btrfs tries to read the
lost data.  The standard way to recover from lost metadata is to keep
two copies in dup metadata block groups, but write caching bugs will
destroy both copies because they are written at close to the same time.

We can increase the time between writes of duplicate copies of data,
but this dramatically increases kernel memory usage.  You'll need enough
RAM to separate writes by enough time to _maybe_ have the write cache get
flushed, assuming that's possible at all on a drive with buggy firmware.
At 500 MB/s, even a few seconds of delay is a lot of RAM.  If you don't
have enough RAM, you have to throttle the filesystem so that the amount
of RAM you do have provides you with enough buffering time.

The reordering case is the one that happens when there's a power failure.
The drive is told by the host to write block A, then B, then C, but
instead the drive writes C, then B, then A.  This is OK as long as the
drive does eventually write all 3 blocks, and as long as at some point a
tree--and its shared pages referenced from all earlier versions of the
tree--is completely written on the disk, with no overwrites from later
transactions.  Without write barriers, the drive might constantly start
writes on new transaction trees without finishing old ones.  After a
crash we might have to rewind all the way back to the previous mount to
find an intact tree.

If the firmware is freed of the requirement to respect write ordering,
then it can indefinitely postpone any write, and its behavior is limited
only by physical constraints.  e.g. if the disk has 256MB RAM cache,
the firmware can't fake more than 256MB of data at a time--when the RAM
runs out, the firmware has to prove that it can read data that it has
written to the disk, or its lies will be exposed.  If we try to read any
sector we recently wrote, the drive can simply reply with the contents
of its RAM cache, so we can't verify the contents of the disk unless we
circumvent read caching as well.

We could try to defeat the drive firmware by brute force.  We make a
guess about the firmware's behavior wrt its constraints, e.g. after
50GB of reads and writes, we assume that a 256MB RAM cache running a
LRU-ish algorithm is thoroughly purged many times over, any earlier
writes are all safely on disk, and any future reads will return data
from the disk surface.

We could track how many writes btrfs did, and delete filesystem trees that
happened 50GB of writes ago.  But this guess could be wrong.  Some drives
will postpone writes indefinitely--if they receive a continuous series
of IO requests, they may never write some sectors from their write cache
at all, or they will prioritize linear multi-sector accesses over seeks
to update a single sector (like a root page, or the btrfs superblock)
and start writing several trees at once without completing any of them.
Never underestimate what a vendor will do to win a benchmark.

50GB of written data may not be sufficient--we might have to throttle
writes from the filesystem as well, to give the drive some idle time to
finish flushing its write cache.  Or the drive might drop 256MB somewhere
in the middle of that 50GB, and btrfs won't find out about the damage
until the next scrub.  So whether we are successful depends a lot on
how reliably we can predict how buggy drive firmware will behave, and
just how buggy the firmware is.

If we try to verify the contents of the disk by reading the data back,
we can fail badly if the drive implements an ARC-ish caching algorithm
instead of a LRU-ish one.  The drive might be able to successfully predict
our verification reads and keep their data in cache, so we don't get
accurate data about what is on disk.  We might delete a tree that is
not referenced in the drive's RAM cache but is referenced on disk.
The filesystem fails when the drive is reset because the drive can no
longer maintain the fiction of intact data on disk.

Another more straightforward way to brute-force a write cache purge is to
write hundreds of MB to random free blocks on the filesystem.  This could
be used to provide fsync()-like semantics, but without anything close
to normal fsync() performance, even with write caching disabled.

Another problem is that btrfs has physical constraints too.  If we
don't have 50GB of free space, our fake write barriers that rely on
writing 50GB of new data are no longer possible, but we urgently need
to delete something in part _because_ our fake write barriers are no
longer possible.  In the worst case (like a snapshot delete on a full
filesystem), we might end up doing multiple commits within a 256MB
write window, and at that point we no longer have even the fake write
barriers--all of our writes can be indefinitely postponed or reordered
in the drive's volatile RAM cache.  If there's a crash, boom, the
firmware bug ends the filesystem.  Even if the filesystem doesn't crash,
rolling back to a point in time where you had 50GB of free space more
than you do now--possibly all the way to the previous mount--can be
pretty rough.

> And if I as just a regular BTRFS user with my (possibly distorted) view see
> this, if you would leave the top level subvolume (5) untouched and avoid
> updates to this except creating child subvolues you reduce the risk of
> catastrophic failure in case a fsync does not work out as only the child
> subvolumes (that are regularily updated) would be at risk.
> 
> And if BTRFS internally made alternating snapshots of the root subvolumes
> (5)'s child subvolumes you would loose at maximum 30sec x 2 (or whatever the
> commit time is set to) of data.
> 
> E.g. keep only child subvolumes on the top level (5).
> And if we pretend the top level has a child subvolume called rootfs, then
> BTRFS could internally auto-snapshot (5)/rootfs every other time to
> (5)/rootfs_autobackup1 and (5)/rootfs_autobackup2
> 
> Do I understand this correctly or would there be any (significant)
> performance drawback to this? Quite frankly I assume it is or else I guess
> it would have been done already , but it never hurts (that much) to ask...

Usually when SHTF in the filesystem, it's the non-subvol trees that break
and ruin your day.  The extent tree is where all reference counting in all
of btrfs is done.  Broken extent trees are very hard to fix--you have to
walk all the subvol trees to recreate the extent tree, or YOLO fix them
one inconsistency at a time while the filesystem is running (hope the
reference count on a metadata page never reaches a negative number, or
you lose even more data very quickly!).  Small commits involve thousands
of extent tree updates.  Big ones do millions of updates.

I'm not sure off the top of my head whether the 300x write multiplier for
new snapshots would apply to a hypothetical snapshot of the extent tree,
but if it did, it would mean 300x write multipliers more or less all
of the time.  When a snapshot page is updated, the page is CoWed, but
also every reference to and backreference from the page is also CoWed.
Usually with snapshot subvols the write multiplier drops rapidly to 1x
on average after a few seconds of activity, but with a snapshot on every
commit (every 30 seconds) you'd be lucky to get below a 10x multiple ever.

The huge performance gain that came from not literally creating a
snapshot on every commit was that the backrefs didn't need to be updated
because the copied CoW pages were deleted in the same transaction that
created them.  If you're creating persistent copies on every commit, then
there's hundreds of ref updates on pages in every commit.  Maybe there's
some other way to do that (with a btrfs disk format change and maybe a
circular buffer delete list?), but the existing snapshot mechanisms are
much slower than disabled write caching.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-06-29 23:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-28 13:33 Buggy disk firmware (fsync/FUA) and power-loss btrfs survability Pablo Fabian Wagner Boian
2020-06-28 14:19 ` Hans van Kranenburg
2020-06-29  0:15   ` waxhead
2020-06-29 23:05     ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.