How do damaged root trees happen and how to protect against power cut?

All of lore.kernel.org
 help / color / mirror / Atom feed

* How do damaged root trees happen and how to protect against power cut?
@ 2020-03-19 15:14 Carsten Behling
  2020-03-19 19:45 ` Chris Murphy
  2020-03-20  0:46 ` Qu Wenruo
  0 siblings, 2 replies; 5+ messages in thread
From: Carsten Behling @ 2020-03-19 15:14 UTC (permalink / raw)
  To: linux-btrfs

Hi,

the investigation of damaged root trees are already discussed in the
thread starting with

https://www.spinics.net/lists/linux-btrfs/msg74019.html

However, one point wasn't discussed at the end:

> I thought so too. Is there a reason why they ended up being colocated?
> I'm surprised with all the redundancies btrfs is capable of, this can
> happen. Was it because the volume was starting to become full? (This
> whole exercise of turning on mirroring was because we're migrating to
> bigger disks)

Because I have the same issue on an embedded system, after a power
cut, where none of the root tree copies are usable anymore, I'd also
like to know :

- How can we end up in that recoverable state?
- Why can't we protect the fs against the unrecoverable state?
- Why is that error is so hard to recover?

Furthermore, I'd like to know what would be the best solution for an
embedded system where power cuts are unavoidable (because of a missing
circuit). I'm thinking of using a read-only rootfs with a separate
data partition to ensure at least a booting system. But anyway, the
data partition could end up in the same state.

I'm not sure if it would be also a good option working with snapshots.
My space on the embedded device is limited to 8GB. The OS already
takes about 4GB.

Best regards
Carsten

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How do damaged root trees happen and how to protect against power cut?
  2020-03-19 15:14 How do damaged root trees happen and how to protect against power cut? Carsten Behling
@ 2020-03-19 19:45 ` Chris Murphy
       [not found]   ` <CAPuGWB-XyYya263K2gWriv5sGVLQbbzpKD3R01GkxiwNw-LdTA@mail.gmail.com>
  2020-03-20  0:46 ` Qu Wenruo
  1 sibling, 1 reply; 5+ messages in thread
From: Chris Murphy @ 2020-03-19 19:45 UTC (permalink / raw)
  To: Carsten Behling; +Cc: Btrfs BTRFS

On Thu, Mar 19, 2020 at 9:14 AM Carsten Behling
<carsten.behling@googlemail.com> wrote:
>
> Hi,
>
> the investigation of damaged root trees are already discussed in the
> thread starting with
>
> https://www.spinics.net/lists/linux-btrfs/msg74019.html
>
> However, one point wasn't discussed at the end:
>
> > I thought so too. Is there a reason why they ended up being colocated?
> > I'm surprised with all the redundancies btrfs is capable of, this can
> > happen. Was it because the volume was starting to become full? (This
> > whole exercise of turning on mirroring was because we're migrating to
> > bigger disks)
>
> Because I have the same issue on an embedded system, after a power
> cut, where none of the root tree copies are usable anymore, I'd also
> like to know :
>
> - How can we end up in that recoverable state?
> - Why can't we protect the fs against the unrecoverable state?
> - Why is that error is so hard to recover?

I'm interested in this too. Also I want to know whether and what Btrfs
debug or consistency check flags are applicable in discovering these
problems as near to the time as they occur; whether they're Btrfs,
block layer, or device problems.

> Furthermore, I'd like to know what would be the best solution for an
> embedded system where power cuts are unavoidable (because of a missing
> circuit). I'm thinking of using a read-only rootfs with a separate
> data partition to ensure at least a booting system. But anyway, the
> data partition could end up in the same state.
>
> I'm not sure if it would be also a good option working with snapshots.
> My space on the embedded device is limited to 8GB. The OS already
> takes about 4GB.

Seed device?

Create a Btrfs file system, use space_cache v2,
compress-force=zstd:16, and write the root image. Resize the file
system to minimum. Set the seed flag. That's the base image. Part of
the provisioning will be to 'btrfs device add' a 2nd partition, and
remount read-write. This means two Btrfs file systems exist, each with
their own UUID. You can reference the read-only seed by its UUID; and
you can reference the read-write volume by its own UUID. On-disk
metadata for this read-write volume points to both the read-only seed
devid1, and the writable 2nd device devid2.

Make sure write cache on the physical media is disabled.

It might be true that 'flushoncommit' and 'notreelog' reduce
complexity for recovery following a crash; at the expense of losing
some data in the latter case. (It's been suggested before in the
archives, but I have no good way to test if results in less instance
of crash/powerfail recoveries because I personally haven't hit any
problems with the default mount options, despite hundreds of
intentional force power offs while writing.)

For embedded systems, consider using industrial flash. They are slower
but more reliable, especially in the case of a power cut. SD Cards are
notorious for corruption and going permanently read-only when power is
cut; but I've had this problem with USB sticks too.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How do damaged root trees happen and how to protect against power cut?
  2020-03-19 15:14 How do damaged root trees happen and how to protect against power cut? Carsten Behling
  2020-03-19 19:45 ` Chris Murphy
@ 2020-03-20  0:46 ` Qu Wenruo
  1 sibling, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2020-03-20  0:46 UTC (permalink / raw)
  To: Carsten Behling, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 2452 bytes --]

On 2020/3/19 下午11:14, Carsten Behling wrote:
> Hi,
> 
> the investigation of damaged root trees are already discussed in the
> thread starting with
> 
> https://www.spinics.net/lists/linux-btrfs/msg74019.html
> 
> However, one point wasn't discussed at the end:
> 
>> I thought so too. Is there a reason why they ended up being colocated?
>> I'm surprised with all the redundancies btrfs is capable of, this can
>> happen. Was it because the volume was starting to become full? (This
>> whole exercise of turning on mirroring was because we're migrating to
>> bigger disks)
> 
> Because I have the same issue on an embedded system, after a power
> cut, where none of the root tree copies are usable anymore, I'd also
> like to know :
> 
> - How can we end up in that recoverable state?

There are two main reasons:
- Btrfs bug
  The most recent one is between v5.2.0~v5.2.14.
  There may be some more in older kernels.

- Bad storage stack below btrfs
  The critical part is the FLUSH/FUA behavior.
  The spec requires FLUSH/FUA return after all data is written to
  storage or non volatile cache.

  Btrfs heavily depends on metadata COW to keep it corruption free
  against power loss.
  If FLUSH/FUA is not working correctly, then btrfs is completely
  doomed.

> - Why can't we protect the fs against the unrecoverable state?

If it's hardware, we have no way to protect.

> - Why is that error is so hard to recover?

As the only safety net is broken, there is no way to recover from such
deadly corruption.

> 
> Furthermore, I'd like to know what would be the best solution for an
> embedded system where power cuts are unavoidable (because of a missing
> circuit). I'm thinking of using a read-only rootfs with a separate
> data partition to ensure at least a booting system. But anyway, the
> data partition could end up in the same state.

Since if it's hardware related, I recommend to do a power loss test
using latest kernel.

If it's the sdcard's problem, under heavy btrfs write load and powerloss
it would be pretty easy to corrupt the fs.

Then you can try other sdcard until find a good one, or prove it's
kernel's fault and we can address it.

Thanks,
Qu

> 
> I'm not sure if it would be also a good option working with snapshots.
> My space on the embedded device is limited to 8GB. The OS already
> takes about 4GB.
> 
> Best regards
> Carsten
> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Fwd: How do damaged root trees happen and how to protect against power cut?
       [not found]   ` <CAPuGWB-XyYya263K2gWriv5sGVLQbbzpKD3R01GkxiwNw-LdTA@mail.gmail.com>
@ 2020-03-24  7:54     ` Carsten Behling
  2020-03-30 22:43       ` Chris Murphy
  0 siblings, 1 reply; 5+ messages in thread
From: Carsten Behling @ 2020-03-24  7:54 UTC (permalink / raw)
  To: linux-btrfs, Qu Wenruo

---------- Forwarded message ---------
Von: Carsten Behling <carsten.behling@googlemail.com>
Date: Di., 24. März 2020 um 08:51 Uhr
Subject: Re: How do damaged root trees happen and how to protect
against power cut?
To: Chris Murphy <lists@colorremedies.com>


Carsten Behling <carsten.behling@googlemail.com>

Mo., 23. März, 16:58 (vor 15 Stunden)
an Chris
> Seed device?
>
> Create a Btrfs file system, use space_cache v2,
> compress-force=zstd:16, and write the root image. Resize the file
> system to minimum. Set the seed flag. That's the base image. Part of
> the provisioning will be to 'btrfs device add' a 2nd partition, and
> remount read-write. This means two Btrfs file systems exist, each with
> their own UUID. You can reference the read-only seed by its UUID; and
> you can reference the read-write volume by its own UUID. On-disk
> metadata for this read-write volume points to both the read-only seed
> devid1, and the writable 2nd device devid2.
>
> Make sure write cache on the physical media is disabled.

Are this the correct steps in detail:

1. Partition SD card with:
- (write Bootloader ...)
- first partition boot (FAT32 (0x0b), 50MB)
- second partition (Linux Native (0x83), minimum possible size to fit rootfs)
- third partition (Linux Native (0x83), rest
- (write boot files (kernel ...))

2. Create seed device on development host:

# mkfs.btrfs --rootdir ~/rootfs --shrink /dev/sda2 # sda is my SD card device
# btrfstune -S 1 /dev/sda2
# dd if=/dev/zero of=/dev/sda3 bs=1024
# mount /dev/sda2 /mnt
# btrfs device add /dev/sda3 /mnt
# hdparm -W 0 /dev/sda3 # disable write cache

3. Mount on embedded device

- Kernel command line option: "root=/dev/mmcblk0p2 ro rootwait"
- Later, 'systemd-remount-fs.service' remounts seed device 'rw' by
appliying mmount options from fstab:
...
# 'defaults' includes 'rw', 'ROOT' is /dev/mmcblk0p2 (seed device)
LABEL=ROOT       /                    btrfs
defaults,noatime,nodiratime,space_cache=v2,compress-force=zstd:16
 1  1
...

Is this correct?

Regards
Carsten

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How do damaged root trees happen and how to protect against power cut?
  2020-03-24  7:54     ` Fwd: " Carsten Behling
@ 2020-03-30 22:43       ` Chris Murphy
  0 siblings, 0 replies; 5+ messages in thread
From: Chris Murphy @ 2020-03-30 22:43 UTC (permalink / raw)
  To: Carsten Behling; +Cc: Btrfs BTRFS, Qu Wenruo

On Tue, Mar 24, 2020 at 1:54 AM Carsten Behling
<carsten.behling@googlemail.com> wrote:

> Mo., 23. März, 16:58 (vor 15 Stunden)
> an Chris
> > Seed device?
> >
> > Create a Btrfs file system, use space_cache v2,
> > compress-force=zstd:16, and write the root image. Resize the file
> > system to minimum. Set the seed flag. That's the base image. Part of
> > the provisioning will be to 'btrfs device add' a 2nd partition, and
> > remount read-write. This means two Btrfs file systems exist, each with
> > their own UUID. You can reference the read-only seed by its UUID; and
> > you can reference the read-write volume by its own UUID. On-disk
> > metadata for this read-write volume points to both the read-only seed
> > devid1, and the writable 2nd device devid2.
> >
> > Make sure write cache on the physical media is disabled.
>
> Are this the correct steps in detail:

I can't sanity check every single step. But I'll comment on what I can.

>
> 1. Partition SD card with:
> - (write Bootloader ...)
> - first partition boot (FAT32 (0x0b), 50MB)
> - second partition (Linux Native (0x83), minimum possible size to fit rootfs)
> - third partition (Linux Native (0x83), rest
> - (write boot files (kernel ...))

Seems like bootloader happens later, whether BIOS or UEFI.

>
> 2. Create seed device on development host:
>
> # mkfs.btrfs --rootdir ~/rootfs --shrink /dev/sda2 # sda is my SD card device
> # btrfstune -S 1 /dev/sda2
> # dd if=/dev/zero of=/dev/sda3 bs=1024
> # mount /dev/sda2 /mnt
> # btrfs device add /dev/sda3 /mnt
> # hdparm -W 0 /dev/sda3 # disable write cache

I haven't populated a btrfs file system using --rootdir option of
mkfs. I've only ever done it by using kernel code (mounted file
system) and then just shrink the resulting file system to minimum size
and/or fstrim so that it's a sparse file. That way I can also take
advantage of fs compression for the seed.

I'd substitute the dd command above with 'blkdiscard' and relocate it
to step 1 as a preparation step.

Pretty sure you need 'mount -o remount,rw' before it's possible to add
a 2nd device.

The hdparm step is probably only important for production use.

>
> 3. Mount on embedded device
>
> - Kernel command line option: "root=/dev/mmcblk0p2 ro rootwait"
> - Later, 'systemd-remount-fs.service' remounts seed device 'rw' by
> appliying mmount options from fstab:
> ...
> # 'defaults' includes 'rw', 'ROOT' is /dev/mmcblk0p2 (seed device)
> LABEL=ROOT       /                    btrfs
> defaults,noatime,nodiratime,space_cache=v2,compress-force=zstd:16
>  1  1
> ...

The read-only seed device itself can't be mounted read-write. That's
the point of a seed-device. All changes go to the 2nd device. What you
really want to do during production is mount by the fs UUID of the
"sprout".

At mkfs time, devid 1 (first device, which becomes the read-only seed)
has an fs UUID.

When you 'btrfs dev add' a 2nd device to a seed, that 2nd device is
sometimes called a "sprout" device, let's call it devid 2. A new fs
UUID is generated, which is a Btrfs volume made of two devices, devid1
and devid2.

Therefore, if you use root=UUID=fsUUID"seed" this would mount the
read-only seed, and could be used as a way to "reset" the system. If
you use root=UUID=fsUUID"sprout" then this references both devid1 and
devid2, and will mount read-write by default.

It's superfluous detail for your use case, but for the sake of a
complete answer, a "sprout" isn't always 2 devices, even though it
starts that way. It is possible to delete devid1, which then causes
replication of the seed to the sprout. Once finished, devid1, the
seed, is removed. And now the seed and sprout are each single device
Btrfs volumes and totally independent.

Anyway, for your "reset" option, you probably need one of two things.
a) read-only rootfs support in the initramfs so that you can boot the
read-only seed or b) setup a ramdisk, such a zram device, and use it
as a volatile "sprout", now you can remount sysroot read-write, and
perform the reset which would be something like doing a blkdiscard on
the "sprout" device you want to get rid of, and then create a new
persistent sprout. This double use of the seed is completely valid.

There is one possible gotcha you can run into, but again don't think
it applies to your use case:

btrfs multiple devices confusion: automatically unmounted /home,
clobbered ssh session
https://github.com/systemd/systemd/issues/14674

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-03-30 22:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-19 15:14 How do damaged root trees happen and how to protect against power cut? Carsten Behling
2020-03-19 19:45 ` Chris Murphy
     [not found]   ` <CAPuGWB-XyYya263K2gWriv5sGVLQbbzpKD3R01GkxiwNw-LdTA@mail.gmail.com>
2020-03-24  7:54     ` Fwd: " Carsten Behling
2020-03-30 22:43       ` Chris Murphy
2020-03-20  0:46 ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.