linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Problem with file system
@ 2017-04-24 15:27 Fred Van Andel
  2017-04-24 17:02 ` Chris Murphy
  2017-04-25  0:26 ` Qu Wenruo
  0 siblings, 2 replies; 33+ messages in thread
From: Fred Van Andel @ 2017-04-24 15:27 UTC (permalink / raw)
  To: linux-btrfs

I have a btrfs file system with a few thousand snapshots.  When I
attempted to delete 20 or so of them the problems started.

The disks are being read but except for the first few minutes there
are no writes.

Memory usage keeps growing until all the memory (24 Gb) is used in a
few hours. Eventually the system will crash with out of memory errors.

The CPU load is low (<5%) but iowait is around 30 to 50%

The drives are mounted but any process that attempts to access them
will just hang so I cannot access any data on the drives.

Smartctl does not show any issues with the drives.

The problem restarts after a reboot once you mount the drives.

I tried to zero the log hoping it wouldn't restart after a reboot but
that didn't work

I am assuming that the attempt to remove the snapshots caused this
problem.  How do I interrupt the process so I can access the
filesystem again?

# uname -a
Linux Backup 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux

#   btrfs --version
btrfs-progs v4.9.1

#   btrfs fi show
Label: none  uuid: 79ba7374-bf77-4868-bb64-656ff5736c44
        Total devices 6 FS bytes used 5.65TiB
        devid    1 size 1.82TiB used 1.29TiB path /dev/sdb
        devid    2 size 1.82TiB used 1.29TiB path /dev/sdc
        devid    3 size 1.82TiB used 1.29TiB path /dev/sdd
        devid    4 size 1.82TiB used 1.29TiB path /dev/sde
        devid    5 size 3.64TiB used 3.11TiB path /dev/sdf
        devid    6 size 3.64TiB used 3.11TiB path /dev/sdg

# btrfs fi df /pubroot
Data, RAID1: total=5.58TiB, used=5.58TiB
System, RAID1: total=32.00MiB, used=828.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=104.00GiB, used=70.64GiB
GlobalReserve, single: total=512.00MiB, used=28.51MiB

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-04-24 15:27 Problem with file system Fred Van Andel
@ 2017-04-24 17:02 ` Chris Murphy
  2017-04-25  4:05   ` Duncan
  2017-04-25  0:26 ` Qu Wenruo
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2017-04-24 17:02 UTC (permalink / raw)
  To: Fred Van Andel; +Cc: Btrfs BTRFS

On Mon, Apr 24, 2017 at 9:27 AM, Fred Van Andel <vanandel@gmail.com> wrote:
> I have a btrfs file system with a few thousand snapshots.  When I
> attempted to delete 20 or so of them the problems started.
>
> The disks are being read but except for the first few minutes there
> are no writes.
>
> Memory usage keeps growing until all the memory (24 Gb) is used in a
> few hours. Eventually the system will crash with out of memory errors.

Boot with these boot parameters
log_buf_len=1M

I find it easier to remotely login with another computer to capture
problems in case of a crash and I can't save things locally. So on the
remote computer use 'journalctl -kf -o short-monotonic'

Either on the 1st computer, or from an additional ssh connection from the 2nd:

echo 1 >/proc/sys/kernel/sysrq
btrfs fi show   #you need the UUID for the volume you're going to
mount, best to have it in advance

mount the file system normally, and once it's starting to have the
problem (I guess it happens pretty quickly?)

echo t > /proc/sysrq-trigger
grep . -IR /sys/fs/btrfs/UUID/allocation/

Paste in the UUID from fi show. If the computer is hanging due to
running out of memory, each of these commands can take a while to
complete. So it's best to have them all ready to go before you mount
the file system, and the problem starts happening. Best if you can
issue the commands more than once as the problem gets worse, if you
can keep them all organized and labeled.

Then attach them (rather than pasting them into the message).


> I tried to zero the log hoping it wouldn't restart after a reboot but
> that didn't work

Yeah don't just start randomly hitting the fs with a hammer like
zeroing the log tree. That's for a specific problem and this isn't it.


> I am assuming that the attempt to remove the snapshots caused this
> problem.  How do I interrupt the process so I can access the
> filesystem again?

Snapshot creation is essentially free. Snapshot removal is expensive.
There's no way to answer your questions because your email doesn't
even include a call trace. So a developer will need at least the call
trace, but there might be some other useful information in a sysrq +
t, as well as the allocation states.



> # btrfs fi df /pubroot
> Data, RAID1: total=5.58TiB, used=5.58TiB
> System, RAID1: total=32.00MiB, used=828.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID1: total=104.00GiB, used=70.64GiB
> GlobalReserve, single: total=512.00MiB, used=28.51MiB

Later, after this problem is solved, you'll want to get rid of that
single system chunk that isn't being used, but might cause a problem
in a device failure.

sudo btrfs balance start -mconvert=raid1,soft <mp>


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-04-24 15:27 Problem with file system Fred Van Andel
  2017-04-24 17:02 ` Chris Murphy
@ 2017-04-25  0:26 ` Qu Wenruo
  2017-04-25  5:33   ` Marat Khalili
  1 sibling, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2017-04-25  0:26 UTC (permalink / raw)
  To: Fred Van Andel, linux-btrfs



At 04/24/2017 11:27 PM, Fred Van Andel wrote:
> I have a btrfs file system with a few thousand snapshots.  When I
> attempted to delete 20 or so of them the problems started.
> 
> The disks are being read but except for the first few minutes there
> are no writes.
> 
> Memory usage keeps growing until all the memory (24 Gb) is used in a
> few hours. Eventually the system will crash with out of memory errors.

Are you using qgroup/quota?

IIRC qgroup for subvolume deletion will cause full subtree rescan which 
can cause tons of memory.

Thanks,
Qu

> 
> The CPU load is low (<5%) but iowait is around 30 to 50%
> 
> The drives are mounted but any process that attempts to access them
> will just hang so I cannot access any data on the drives.
> 
> Smartctl does not show any issues with the drives.
> 
> The problem restarts after a reboot once you mount the drives.
> 
> I tried to zero the log hoping it wouldn't restart after a reboot but
> that didn't work
> 
> I am assuming that the attempt to remove the snapshots caused this
> problem.  How do I interrupt the process so I can access the
> filesystem again?
> 
> # uname -a
> Linux Backup 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC
> 2017 x86_64 x86_64 x86_64 GNU/Linux
> 
> #   btrfs --version
> btrfs-progs v4.9.1
> 
> #   btrfs fi show
> Label: none  uuid: 79ba7374-bf77-4868-bb64-656ff5736c44
>          Total devices 6 FS bytes used 5.65TiB
>          devid    1 size 1.82TiB used 1.29TiB path /dev/sdb
>          devid    2 size 1.82TiB used 1.29TiB path /dev/sdc
>          devid    3 size 1.82TiB used 1.29TiB path /dev/sdd
>          devid    4 size 1.82TiB used 1.29TiB path /dev/sde
>          devid    5 size 3.64TiB used 3.11TiB path /dev/sdf
>          devid    6 size 3.64TiB used 3.11TiB path /dev/sdg
> 
> # btrfs fi df /pubroot
> Data, RAID1: total=5.58TiB, used=5.58TiB
> System, RAID1: total=32.00MiB, used=828.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID1: total=104.00GiB, used=70.64GiB
> GlobalReserve, single: total=512.00MiB, used=28.51MiB
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-04-24 17:02 ` Chris Murphy
@ 2017-04-25  4:05   ` Duncan
  0 siblings, 0 replies; 33+ messages in thread
From: Duncan @ 2017-04-25  4:05 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Mon, 24 Apr 2017 11:02:02 -0600 as excerpted:

> On Mon, Apr 24, 2017 at 9:27 AM, Fred Van Andel <vanandel@gmail.com>
> wrote:
>> I have a btrfs file system with a few thousand snapshots.  When I
>> attempted to delete 20 or so of them the problems started.
>>
>> The disks are being read but except for the first few minutes there are
>> no writes.
>>
>> Memory usage keeps growing until all the memory (24 Gb) is used in a
>> few hours. Eventually the system will crash with out of memory errors.

In addition to what CMurphy and QW suggested (both valid), I have a 
couple other suggestions/pointers.  They won't help you get out of the 
current situation, but they might help you stay out of it in the future.

1) A "few thousand snapshots", but no mention of how many subvolumes 
those snapshots are of, or how many per subvolume.

As CMurphy says but I'll expand it here, taking a snapshot is nearly 
free, just a bit of metadata to write because btrfs is COW-based and all 
a snapshot does is lock down a copy of everything in the subvolume as it 
exists currently and the filesystem's already tracking that, but removal 
is expensive, because btrfs must go thru and check everything to see if 
it can actually be deleted (no other snapshots referencing the block) or 
not (something else referencing it).

Obviously, then, this checking gets much more complicated the more 
snapshots of the same subvolume that exist.  IOW, it's a scaling issue.

The same scaling issue applies to various other btrfs maintenance tasks, 
including btrfs check (aka btrfsck), and btrfs balance (and thus btrfs 
device remove, which does an implicit balance).  Both of these take *far* 
longer if the number of snapshots per subvolume is allowed to get out of 
hand.

Due to this scaling issue, the recommendation is no more than 200-300 
snapshots per subvolume, and keeping it down to 50-100 max is even 
better, if you can do it reasonably.  That helps keep scaling issues and 
thus time for any necessary maintenance manageable.  Otherwise... well, 
we've had reports of device removes (aka balances) that would take 
/months/ to finish at the rate they were going.  Obviously, well before 
it gets to that point it's far faster to simply blow away the filesystem 
and restore from backups.[1]

It follows that if you have an automated system doing the snapshots, it's 
equally important to have an automated system doing snapshot thinning as 
well, keeping the number of snapshots per subvolume within manageable 
scaling limits.

So if that's "a few thousand snapshots", I hope you that's of (at least) 
a double-digit number of subvolumes, keeping the number of snapshots per 
subvolume under 300, and under 100 if your snapshot rotation schedule 
will allow it.

2) As Qu suggests, btrfs quotas increase the scaling issues significantly.

Additionally, there have been and continue to be accuracy issues with 
certain quota corner-cases, so they can't be entirely relied upon anyway.

Generally, people using btrfs quotas fall into three categories:

a) Those who know the problems and are working with Qu and the other devs 
to report and trace issues so they will eventually work well, ideally 
with less of a scaling issue as well.

Bless them!  Keep it up! =:^)

b) Those who have a use-case that really depends on quotas.

Because btrfs quotas are buggy and not entirely reliable now, not to 
mention the scaling issues, these users are almost certainly better 
served using more mature filesystems with mature and dependable quotas.

c) Those who don't really care about quotas specifically, and are just 
using them because it's a nice feature.  This likely includes some who 
are simply running distros that enable quotas.

My recommendation for these users is to simply turn btrfs quotas off for 
now, as they're presently in general more trouble than they're worth, due 
to both the accuracy and scaling issues.  Hopefully quotas will be stable 
in a couple years, and with developer and tester hard work perhaps the 
scaling issues will have been reduced as well, and that recommendation 
can change.  But for now, if you don't really need them, leaving quotas 
off will significantly reduce scaling issues.  And if you do need them, 
they're not yet reliable on btrfs anyway, so better off using something 
more mature where they actually work.

3) Similarly (tho unlikely to apply in your case), beware of the scaling 
implications of the various reflink-based copying and dedup utilities, 
which work via the same copy-on-write and reflinking technology that's 
behind snapshotting.

Tho snapshotting is effectively reflinking /everything/ in the subvolume, 
so the scaling issues compound much faster there than they will with a 
more trivial level of reflinking.  Of course, when it comes to dedup, a 
more trivial level of reflinking means less benefit from doing the dedup 
in the first place, so there's a limit to the effectiveness of dedup 
before it starts having the same scaling issues that snapshots do.  But 
if you have exactly two copies of /everything/ in a subvolume, and dedup 
it down to a single copy, that's the same effect as a single snapshot, so 
it does take a lot of reflink-based deduping to get to the same level as 
a couple hundred snapshots.  But it's something to think about if you're 
planning to dedup say 1000 copies of a bunch of stuff by making them all 
reflinks to the same single copy.



Bottom line, if those "few thousand snapshots" are all of the same subvol 
or two, /especially/ if you're running btrfs quotas on top of that... 
that's very likely your problem right there.  Keep your number of 
snapshots per subvolume under 300, and turn off btrfs quotas, and you'll 
very likely find the problem disappears.


---
[1] Backups:  Sysadmin's first rule of backups, simple form:  If you 
don't have a backup, you are by lack thereof defining the data at risk as 
worth less than the time/hassle/resources to do that backup.  Because if 
it was worth more than the time/hassle/resources necessary for the 
backup, by definition, it would /be/ backed up.

It's your choice to make, but no redefining after the fact.  If you lost 
the primary copy due to whatever reason and didn't have that backup, you 
simply defined the data as not worth enough to have a backup, and get to 
be happy because you saved what your actions, or lack thereof, defined as 
of most value to you, the time/hassle/resources you would have otherwise 
spent doing that backup.

Sysadmin's second rule of backups:  A backup isn't complete until it has 
been tested restorable.  Until then, it's simply a would-be backup, 
because you don't actually know if it worked or not.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-04-25  0:26 ` Qu Wenruo
@ 2017-04-25  5:33   ` Marat Khalili
  2017-04-25  6:13     ` Qu Wenruo
  0 siblings, 1 reply; 33+ messages in thread
From: Marat Khalili @ 2017-04-25  5:33 UTC (permalink / raw)
  To: linux-btrfs

On 25/04/17 03:26, Qu Wenruo wrote:
> IIRC qgroup for subvolume deletion will cause full subtree rescan 
> which can cause tons of memory. 
Could it be this bad, 24GB of RAM for a 5.6TB volume? What does it even 
use this absurd amount of memory for? Is it swappable?

Haven't read about RAM limitations for running qgroups before, only 
about CPU load (which importantly only requires patience, does not crash 
servers).

--

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-04-25  5:33   ` Marat Khalili
@ 2017-04-25  6:13     ` Qu Wenruo
  2017-04-26 16:43       ` Fred Van Andel
  0 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2017-04-25  6:13 UTC (permalink / raw)
  To: Marat Khalili, linux-btrfs



At 04/25/2017 01:33 PM, Marat Khalili wrote:
> On 25/04/17 03:26, Qu Wenruo wrote:
>> IIRC qgroup for subvolume deletion will cause full subtree rescan 
>> which can cause tons of memory. 
> Could it be this bad, 24GB of RAM for a 5.6TB volume? What does it even 
> use this absurd amount of memory for? Is it swappable?

The memory is used for 2 reasons.

1) Record which extents are needed to trace
    Freed at transaction commit.

    Need better idea to handle them. Maybe create a new tree so that we
    can write it to disk?
    Or another qgroup rework?

2) Record current roots referring to this extent
    Only after v4.10 IIRC.

The memory allocated is not swappable.

How many memory it uses depends on the number of extents of that subvolume.

It's 56 bytes for one extent, both tree block and data extent.
To use up 16G ram, it's about 300 million extents.
For 5.6T volume, its average extent size is about 20K.

It seems that your volume is highly fragmented though.

If that's the problem, disabling qgroup may be the best workaround.

Thanks,
Qu

> 
> Haven't read about RAM limitations for running qgroups before, only 
> about CPU load (which importantly only requires patience, does not crash 
> servers).
> 
> -- 
> 
> With Best Regards,
> Marat Khalili
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-04-25  6:13     ` Qu Wenruo
@ 2017-04-26 16:43       ` Fred Van Andel
  2017-10-30  3:31         ` Dave
  0 siblings, 1 reply; 33+ messages in thread
From: Fred Van Andel @ 2017-04-26 16:43 UTC (permalink / raw)
  To: linux-btrfs

Yes I was running qgroups.
Yes the filesystem is highly fragmented.
Yes I have way too many snapshots.

I think it's clear that the problem is on my end. I simply placed too
many demands on the filesystem without fully understanding the
implications.  Now I have to deal with the consequences.

It was decided today to replace this computer due to its age.  I will
use the recover command to pull the needed data off this system and
onto the new one.


Thank you everyone for your assistance and the education.

Fred

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-04-26 16:43       ` Fred Van Andel
@ 2017-10-30  3:31         ` Dave
  2017-10-30 21:37           ` Chris Murphy
  2017-10-31  1:58           ` Duncan
  0 siblings, 2 replies; 33+ messages in thread
From: Dave @ 2017-10-30  3:31 UTC (permalink / raw)
  To: Linux fs Btrfs; +Cc: Fred Van Andel

This is a very helpful thread. I want to share an interesting related story.

We have a machine with 4 btrfs volumes and 4 Snapper configs. I
recently discovered that Snapper timeline cleanup been turned off for
3 of those volumes. In the Snapper configs I found this setting:

TIMELINE_CLEANUP="no"

Normally that would be set to "yes". So I corrected the issue and set
it to "yes" for the 3 volumes where it had not been set correctly.

I suppose it was turned off temporarily and then somebody forgot to
turn it back on.

What I did not know, and what I did not realize was a critical piece
of information, was how long timeline cleanup had been turned off and
how many snapshots had accumulated on each volume in that time.

I naively re-enabled Snapper timeline cleanup. The instant I started
the  snapper-cleanup.service  the system was hosed. The ssh session
became unresponsive, no other ssh sessions could be established and it
was impossible to log into the system at the console.

My subsequent investigation showed that the root filesystem volume
accumulated more than 3000 btrfs snapshots. The two other affected
volumes also had very large numbers of snapshots.

Deleting a single snapshot in that situation would likely require
hours. (I set up a test, but I ran out of patience before I was able
to delete even a single snapshot.) My guess is that if we had been
patient enough to wait for all the snapshots to be deleted, the
process would have finished in some number of months (or maybe a
year).

We did not know most of this at the time, so we did what we usually do
when a system becomes totally unresponsive -- we did a hard reset. Of
course, we could never get the system to boot up again.

Since we had backups, the easiest option became to replace that system
-- not unlike what the OP decided to do. In our case, the hardware was
not old, so we simply reformatted the drives and reinstalled Linux.

That's a drastic consequence of changing TIMELINE_CLEANUP="no" to
TIMELINE_CLEANUP="yes" in the snapper config.

It's all part of the process of gaining critical experience with
BTRFS. Whether or not BTRFS is ready for production use is (it seems
to me) mostly a question of how knowledgeable and experienced are the
people administering it.

In the various online discussions on this topic, all the focus is on
whether or not BTRFS itself is production-ready. At the current
maturity level of BTRFS, I think that's the wrong focus. The right
focus is on how production-ready is the admin person or team (with
respect to their BTRFS knowledge and experience). When a filesystem
has been around for decades, most of the critical admin issues become
fairly common knowledge, fairly widely known and easy to find. When a
filesystem is newer, far fewer people understand the gotchas. Also, in
older or widely used filesystems, when someone hits a gotcha, the
response isn't "that filesystem is not ready for production". Instead
the response is, "you should have known not to do that."

On Wed, Apr 26, 2017 at 12:43 PM, Fred Van Andel <vanandel@gmail.com> wrote:
> Yes I was running qgroups.
> Yes the filesystem is highly fragmented.
> Yes I have way too many snapshots.
>
> I think it's clear that the problem is on my end. I simply placed too
> many demands on the filesystem without fully understanding the
> implications.  Now I have to deal with the consequences.
>
> It was decided today to replace this computer due to its age.  I will
> use the recover command to pull the needed data off this system and
> onto the new one.
>
>
> Thank you everyone for your assistance and the education.
>
> Fred
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-10-30  3:31         ` Dave
@ 2017-10-30 21:37           ` Chris Murphy
  2017-10-31  5:57             ` Marat Khalili
  2017-11-04  7:26             ` Dave
  2017-10-31  1:58           ` Duncan
  1 sibling, 2 replies; 33+ messages in thread
From: Chris Murphy @ 2017-10-30 21:37 UTC (permalink / raw)
  To: Dave; +Cc: Linux fs Btrfs, Fred Van Andel

On Mon, Oct 30, 2017 at 4:31 AM, Dave <davestechshop@gmail.com> wrote:
> This is a very helpful thread. I want to share an interesting related story.
>
> We have a machine with 4 btrfs volumes and 4 Snapper configs. I
> recently discovered that Snapper timeline cleanup been turned off for
> 3 of those volumes. In the Snapper configs I found this setting:
>
> TIMELINE_CLEANUP="no"
>
> Normally that would be set to "yes". So I corrected the issue and set
> it to "yes" for the 3 volumes where it had not been set correctly.
>
> I suppose it was turned off temporarily and then somebody forgot to
> turn it back on.
>
> What I did not know, and what I did not realize was a critical piece
> of information, was how long timeline cleanup had been turned off and
> how many snapshots had accumulated on each volume in that time.
>
> I naively re-enabled Snapper timeline cleanup. The instant I started
> the  snapper-cleanup.service  the system was hosed. The ssh session
> became unresponsive, no other ssh sessions could be established and it
> was impossible to log into the system at the console.
>
> My subsequent investigation showed that the root filesystem volume
> accumulated more than 3000 btrfs snapshots. The two other affected
> volumes also had very large numbers of snapshots.
>
> Deleting a single snapshot in that situation would likely require
> hours. (I set up a test, but I ran out of patience before I was able
> to delete even a single snapshot.) My guess is that if we had been
> patient enough to wait for all the snapshots to be deleted, the
> process would have finished in some number of months (or maybe a
> year).
>
> We did not know most of this at the time, so we did what we usually do
> when a system becomes totally unresponsive -- we did a hard reset. Of
> course, we could never get the system to boot up again.
>
> Since we had backups, the easiest option became to replace that system
> -- not unlike what the OP decided to do. In our case, the hardware was
> not old, so we simply reformatted the drives and reinstalled Linux.
>
> That's a drastic consequence of changing TIMELINE_CLEANUP="no" to
> TIMELINE_CLEANUP="yes" in the snapper config.


Without a complete autopsy on the file system, it's unclear whether it
was fixable with available tools, and why it wouldn't mount normally,
or if necessary do its own autorecovery with one of the available
backup roots.

But off hand it sounds like hardware was sabotaging the expected write
ordering. How to test a given hardware setup for that, I think, is
really overdue. It affects literally every file system, and Linux
storage technology.

It kinda sounds like to me something other than supers is being
overwritten too soon, and that's why it's possible for none of the
backup roots to find a valid root tree, because all four possible root
trees either haven't actually been written yet (still) or they've been
overwritten, even though the super is updated. But again, it's
speculation, we don't actually know why your system was no longer
mountable.



>
> It's all part of the process of gaining critical experience with
> BTRFS. Whether or not BTRFS is ready for production use is (it seems
> to me) mostly a question of how knowledgeable and experienced are the
> people administering it.

"Btrfs is a copy on write filesystem for Linux aimed at implementing advanced
features while focusing on fault tolerance, repair and easy administration."

That is the current descriptive text at
Documentation/filesystems/btrfs.txt for some time now.


> In the various online discussions on this topic, all the focus is on
> whether or not BTRFS itself is production-ready. At the current
> maturity level of BTRFS, I think that's the wrong focus. The right
> focus is on how production-ready is the admin person or team (with
> respect to their BTRFS knowledge and experience). When a filesystem
> has been around for decades, most of the critical admin issues become
> fairly common knowledge, fairly widely known and easy to find. When a
> filesystem is newer, far fewer people understand the gotchas. Also, in
> older or widely used filesystems, when someone hits a gotcha, the
> response isn't "that filesystem is not ready for production". Instead
> the response is, "you should have known not to do that."



That is not a general purpose file system. It's a file system for
admins who understand where the bodies are buried.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-10-30  3:31         ` Dave
  2017-10-30 21:37           ` Chris Murphy
@ 2017-10-31  1:58           ` Duncan
  1 sibling, 0 replies; 33+ messages in thread
From: Duncan @ 2017-10-31  1:58 UTC (permalink / raw)
  To: linux-btrfs

Dave posted on Sun, 29 Oct 2017 23:31:57 -0400 as excerpted:

> It's all part of the process of gaining critical experience with BTRFS.
> Whether or not BTRFS is ready for production use is (it seems to me)
> mostly a question of how knowledgeable and experienced are the people
> administering it.
> 
> In the various online discussions on this topic, all the focus is on
> whether or not BTRFS itself is production-ready. At the current maturity
> level of BTRFS, I think that's the wrong focus. The right focus is on
> how production-ready is the admin person or team (with respect to their
> BTRFS knowledge and experience). When a filesystem has been around for
> decades, most of the critical admin issues become fairly common
> knowledge, fairly widely known and easy to find. When a filesystem is
> newer, far fewer people understand the gotchas. Also, in older or widely
> used filesystems, when someone hits a gotcha, the response isn't "that
> filesystem is not ready for production". Instead the response is, "you
> should have known not to do that."

That's a view I hadn't seen before, but it seems reasonable and I like it.

Indeed, there were/are a few reasonably widely known caveats with both 
ext3 and reiserfs, for instance, and certainly some that apply to fat/
vfat/fat32, the three filesystems other than btrfs I know most about, and 
if anything they're past their prime, /not/ "still maturing", as btrfs is 
typically described.  For example, setting either of the two to writeback 
journaling and then losing data results in something akin to "you should 
have known not to do that unless you were prepared for the risk, as it's 
definitely a well known one."

Which of course was about my own reaction when Linus and the other powers 
that be decided to set ext3 to writeback journaling by default for a few 
kernel cycles.  Having lived thru that on reiserfs, I /knew/ where /that/ 
was headed, and sure enough...

Similarly, ext3's performance problems with fsync, because it effectively 
forces a full filesystem sync not just a file sync, are well known, as 
are the risks of storing a reiserfs in a loopback file on reiserfs and 
then trying to run a tree restore on the host, since it's known to mix up 
the two filesystems in that case.

It's thus a reasonable viewpoint to consider some of the btrfs quirks to 
be in the same category.  Of course btrfs being the first COW-based-fs 
most will have had experience with, and the first filesystem most will 
have experienced that handles raid, snapshotting, etc, it's definitely 
rather different and more complex than the filesystems most people are 
familiar with, and thus can only be expected to have rather different and 
more complex caveats than the filesystems most are familiar with, as well.

OTOH, there's definitely some known low-hanging-fruit in terms of ease of 
use, remaining to be implemented, tho I'd argue that we've reached the 
point where general stability is such that it has allowed the focus to 
gradually tilt toward implementing some of this, over the last year or 
so, and we're beginning to see the loose ends tied up in the 
documentation, for instance.  I'd say we are getting close, and your 
viewpoint is a definite argument in support of that.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-10-30 21:37           ` Chris Murphy
@ 2017-10-31  5:57             ` Marat Khalili
  2017-10-31 11:28               ` Austin S. Hemmelgarn
  2017-11-04  7:26             ` Dave
  1 sibling, 1 reply; 33+ messages in thread
From: Marat Khalili @ 2017-10-31  5:57 UTC (permalink / raw)
  To: Chris Murphy, Dave; +Cc: Linux fs Btrfs, Fred Van Andel

On 31/10/17 00:37, Chris Murphy wrote:
> But off hand it sounds like hardware was sabotaging the expected write
> ordering. How to test a given hardware setup for that, I think, is
> really overdue. It affects literally every file system, and Linux
> storage technology.
>
> It kinda sounds like to me something other than supers is being
> overwritten too soon, and that's why it's possible for none of the
> backup roots to find a valid root tree, because all four possible root
> trees either haven't actually been written yet (still) or they've been
> overwritten, even though the super is updated. But again, it's
> speculation, we don't actually know why your system was no longer
> mountable.
Just a detached view: I know hardware should respect ordering/barriers 
and such, but how hard is it really to avoid overwriting at least one 
complete metadata tree for half an hour (even better, yet another one 
for a day)? Just metadata, not data extents.

--

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-10-31  5:57             ` Marat Khalili
@ 2017-10-31 11:28               ` Austin S. Hemmelgarn
  2017-11-03  7:42                 ` Kai Krakow
  2017-11-03 22:03                 ` Chris Murphy
  0 siblings, 2 replies; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-10-31 11:28 UTC (permalink / raw)
  To: Marat Khalili, Chris Murphy, Dave; +Cc: Linux fs Btrfs, Fred Van Andel

On 2017-10-31 01:57, Marat Khalili wrote:
> On 31/10/17 00:37, Chris Murphy wrote:
>> But off hand it sounds like hardware was sabotaging the expected write
>> ordering. How to test a given hardware setup for that, I think, is
>> really overdue. It affects literally every file system, and Linux
>> storage technology.
>>
>> It kinda sounds like to me something other than supers is being
>> overwritten too soon, and that's why it's possible for none of the
>> backup roots to find a valid root tree, because all four possible root
>> trees either haven't actually been written yet (still) or they've been
>> overwritten, even though the super is updated. But again, it's
>> speculation, we don't actually know why your system was no longer
>> mountable.
> Just a detached view: I know hardware should respect ordering/barriers 
> and such, but how hard is it really to avoid overwriting at least one 
> complete metadata tree for half an hour (even better, yet another one 
> for a day)? Just metadata, not data extents.
If you're running on an SSD (or thinly provisioned storage, or something 
else which supports discards) and have the 'discard' mount option 
enabled, then there is no backup metadata tree (this issue was mentioned 
on the list a while ago, but nobody ever replied), because it's already 
been discarded.  This is ideally something which should be addressed (we 
need some sort of discard queue for handling in-line discards), but it's 
not easy to address.

Otherwise, it becomes a question of space usage on the filesystem, and 
this is just another reason to keep some extra slack space on the FS 
(though that doesn't help _much_, it does help).  This, in theory, could 
be addressed, but it probably can't be applied across mounts of a 
filesystem without an on-disk format change.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-10-31 11:28               ` Austin S. Hemmelgarn
@ 2017-11-03  7:42                 ` Kai Krakow
  2017-11-03 11:33                   ` Austin S. Hemmelgarn
  2017-11-03 22:03                 ` Chris Murphy
  1 sibling, 1 reply; 33+ messages in thread
From: Kai Krakow @ 2017-11-03  7:42 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 31 Oct 2017 07:28:58 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2017-10-31 01:57, Marat Khalili wrote:
> > On 31/10/17 00:37, Chris Murphy wrote:  
> >> But off hand it sounds like hardware was sabotaging the expected
> >> write ordering. How to test a given hardware setup for that, I
> >> think, is really overdue. It affects literally every file system,
> >> and Linux storage technology.
> >>
> >> It kinda sounds like to me something other than supers is being
> >> overwritten too soon, and that's why it's possible for none of the
> >> backup roots to find a valid root tree, because all four possible
> >> root trees either haven't actually been written yet (still) or
> >> they've been overwritten, even though the super is updated. But
> >> again, it's speculation, we don't actually know why your system
> >> was no longer mountable.  
> > Just a detached view: I know hardware should respect
> > ordering/barriers and such, but how hard is it really to avoid
> > overwriting at least one complete metadata tree for half an hour
> > (even better, yet another one for a day)? Just metadata, not data
> > extents.  
> If you're running on an SSD (or thinly provisioned storage, or
> something else which supports discards) and have the 'discard' mount
> option enabled, then there is no backup metadata tree (this issue was
> mentioned on the list a while ago, but nobody ever replied), because
> it's already been discarded.  This is ideally something which should
> be addressed (we need some sort of discard queue for handling in-line
> discards), but it's not easy to address.
> 
> Otherwise, it becomes a question of space usage on the filesystem,
> and this is just another reason to keep some extra slack space on the
> FS (though that doesn't help _much_, it does help).  This, in theory,
> could be addressed, but it probably can't be applied across mounts of
> a filesystem without an on-disk format change.

Well, maybe inline discard is working at the wrong level. It should
kick in when the reference through any of the backup roots is dropped,
not when the current instance is dropped.

Without knowledge of the internals, I guess discards could be added to
a queue within a new tree in btrfs, and only added to that queue when
dropped from the last backup root referencing it. But this will
probably add some bad performance spikes.

I wonder how a regular fstrim run through cron applies to this problem?


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-03  7:42                 ` Kai Krakow
@ 2017-11-03 11:33                   ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-03 11:33 UTC (permalink / raw)
  To: linux-btrfs

On 2017-11-03 03:42, Kai Krakow wrote:
> Am Tue, 31 Oct 2017 07:28:58 -0400
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
>> On 2017-10-31 01:57, Marat Khalili wrote:
>>> On 31/10/17 00:37, Chris Murphy wrote:
>>>> But off hand it sounds like hardware was sabotaging the expected
>>>> write ordering. How to test a given hardware setup for that, I
>>>> think, is really overdue. It affects literally every file system,
>>>> and Linux storage technology.
>>>>
>>>> It kinda sounds like to me something other than supers is being
>>>> overwritten too soon, and that's why it's possible for none of the
>>>> backup roots to find a valid root tree, because all four possible
>>>> root trees either haven't actually been written yet (still) or
>>>> they've been overwritten, even though the super is updated. But
>>>> again, it's speculation, we don't actually know why your system
>>>> was no longer mountable.
>>> Just a detached view: I know hardware should respect
>>> ordering/barriers and such, but how hard is it really to avoid
>>> overwriting at least one complete metadata tree for half an hour
>>> (even better, yet another one for a day)? Just metadata, not data
>>> extents.
>> If you're running on an SSD (or thinly provisioned storage, or
>> something else which supports discards) and have the 'discard' mount
>> option enabled, then there is no backup metadata tree (this issue was
>> mentioned on the list a while ago, but nobody ever replied), because
>> it's already been discarded.  This is ideally something which should
>> be addressed (we need some sort of discard queue for handling in-line
>> discards), but it's not easy to address.
>>
>> Otherwise, it becomes a question of space usage on the filesystem,
>> and this is just another reason to keep some extra slack space on the
>> FS (though that doesn't help _much_, it does help).  This, in theory,
>> could be addressed, but it probably can't be applied across mounts of
>> a filesystem without an on-disk format change.
> 
> Well, maybe inline discard is working at the wrong level. It should
> kick in when the reference through any of the backup roots is dropped,
> not when the current instance is dropped.
Indeed.
> 
> Without knowledge of the internals, I guess discards could be added to
> a queue within a new tree in btrfs, and only added to that queue when
> dropped from the last backup root referencing it. But this will
> probably add some bad performance spikes.
Inline discards can already cause bad performance spikes.
> 
> I wonder how a regular fstrim run through cron applies to this problem?
You functionally lose any old (freed) trees, they just get kept around 
until you call fstrim.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-10-31 11:28               ` Austin S. Hemmelgarn
  2017-11-03  7:42                 ` Kai Krakow
@ 2017-11-03 22:03                 ` Chris Murphy
  2017-11-04  4:46                   ` Adam Borowski
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2017-11-03 22:03 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Marat Khalili, Chris Murphy, Dave, Linux fs Btrfs, Fred Van Andel

On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

> If you're running on an SSD (or thinly provisioned storage, or something
> else which supports discards) and have the 'discard' mount option enabled,
> then there is no backup metadata tree (this issue was mentioned on the list
> a while ago, but nobody ever replied),


This is a really good point. I've been running discard mount option
for some time now without problems, in a laptop with Samsung
Electronics Co Ltd NVMe SSD Controller SM951/PM951.

However, just trying btrfs-debug-tree -b on a specific block address
for any of the backup root trees listed in the super, only the current
one returns a valid result.  All others fail with checksum errors. And
even the good one fails with checksum errors within seconds as a new
tree is created, the super updated, and Btrfs considers the old root
tree disposable and subject to discard.

So absolutely if I were to have a problem, probably no rollback for
me. This seems to totally obviate a fundamental part of Btrfs design.


 because it's already been discarded.
> This is ideally something which should be addressed (we need some sort of
> discard queue for handling in-line discards), but it's not easy to address.

Discard data extents, don't discard metadata extents? Or put them on a
substantial delay.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-03 22:03                 ` Chris Murphy
@ 2017-11-04  4:46                   ` Adam Borowski
  2017-11-04 12:00                     ` Marat Khalili
  2017-11-04 17:14                     ` Chris Murphy
  0 siblings, 2 replies; 33+ messages in thread
From: Adam Borowski @ 2017-11-04  4:46 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Austin S. Hemmelgarn, Marat Khalili, Dave, Linux fs Btrfs,
	Fred Van Andel

On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote:
> On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> 
> > If you're running on an SSD (or thinly provisioned storage, or something
> > else which supports discards) and have the 'discard' mount option enabled,
> > then there is no backup metadata tree (this issue was mentioned on the list
> > a while ago, but nobody ever replied),
> 
> 
> This is a really good point. I've been running discard mount option
> for some time now without problems, in a laptop with Samsung
> Electronics Co Ltd NVMe SSD Controller SM951/PM951.
> 
> However, just trying btrfs-debug-tree -b on a specific block address
> for any of the backup root trees listed in the super, only the current
> one returns a valid result.  All others fail with checksum errors. And
> even the good one fails with checksum errors within seconds as a new
> tree is created, the super updated, and Btrfs considers the old root
> tree disposable and subject to discard.
> 
> So absolutely if I were to have a problem, probably no rollback for
> me. This seems to totally obviate a fundamental part of Btrfs design.

How is this an issue?  Discard is issued only once we're positive there's no
reference to the freed blocks anywhere.  At that point, they're also open
for reuse, thus they can be arbitrarily scribbled upon.

Unless your hardware is seriously broken (such as lying about barriers,
which is nearly-guaranteed data loss on btrfs anyway), there's no way the
filesystem will ever reference such blocks.  The corpses of old trees that
are left lying around with no discard can at most be used for manual
forensics, but whether a given block will have been overwritten or not is
a matter of pure luck.

For rollbacks, there are snapshots.  Once a transaction has been fully
committed, the old version is considered gone.

>  because it's already been discarded.
> > This is ideally something which should be addressed (we need some sort of
> > discard queue for handling in-line discards), but it's not easy to address.
> 
> Discard data extents, don't discard metadata extents? Or put them on a
> substantial delay.

Why would you special-case metadata?  Metadata that points to overwritten or
discarded blocks is of no use either.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄⠀⠀⠀⠀ relevant to duties], shall be punished by death by shooting.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-10-30 21:37           ` Chris Murphy
  2017-10-31  5:57             ` Marat Khalili
@ 2017-11-04  7:26             ` Dave
  2017-11-04 17:25               ` Chris Murphy
  1 sibling, 1 reply; 33+ messages in thread
From: Dave @ 2017-11-04  7:26 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux fs Btrfs

On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
> That is not a general purpose file system. It's a file system for admins who understand where the bodies are buried.

I'm not sure I understand your comment...

Are you saying BTRFS is not a general purpose file system?

If btrfs isn't able to serve as a general purpose file system for
Linux going forward, which file system(s) would you suggest can fill
that role? (I can't think of any that are clearly all-around better
than btrfs now, or that will be in the next few years.)

Or maybe you meant something else?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-04  4:46                   ` Adam Borowski
@ 2017-11-04 12:00                     ` Marat Khalili
  2017-11-04 17:14                     ` Chris Murphy
  1 sibling, 0 replies; 33+ messages in thread
From: Marat Khalili @ 2017-11-04 12:00 UTC (permalink / raw)
  To: Adam Borowski, Chris Murphy
  Cc: Austin S. Hemmelgarn, Dave, Linux fs Btrfs, Fred Van Andel

>How is this an issue?  Discard is issued only once we're positive 
>there's no 
>reference to the freed blocks anywhere.  At that point, they're also 
>open 
>for reuse, thus they can be arbitrarily scribbled upon.

Point was, how about keeping this reference for some time period?

>Unless your hardware is seriously broken (such as lying about barriers, 
>which is nearly-guaranteed data loss on btrfs anyway), there's no way 
>the 
>filesystem will ever reference such blocks.

Buggy hardware happen. So do buggy filesystems ;) Besides, most filesystems let user recover most data after losing just one sector, would be pity if BTRFS with all its COW coolness didn't. 

>Why would you special-case metadata?  Metadata that points to 
>overwritten or 
>discarded blocks is of no use either.

It takes significant time to overwrite noticeable portion of data on disk, but loss of metadata makes it gone in a moment. Moreover, user is usually prepared to lose some recently changed data in crash, but not the one that it didn't even touch.
-- 

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-04  4:46                   ` Adam Borowski
  2017-11-04 12:00                     ` Marat Khalili
@ 2017-11-04 17:14                     ` Chris Murphy
  2017-11-06 13:29                       ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2017-11-04 17:14 UTC (permalink / raw)
  To: Adam Borowski
  Cc: Chris Murphy, Austin S. Hemmelgarn, Marat Khalili, Dave,
	Linux fs Btrfs, Fred Van Andel

On Fri, Nov 3, 2017 at 10:46 PM, Adam Borowski <kilobyte@angband.pl> wrote:
> On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote:
>> On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>
>> > If you're running on an SSD (or thinly provisioned storage, or something
>> > else which supports discards) and have the 'discard' mount option enabled,
>> > then there is no backup metadata tree (this issue was mentioned on the list
>> > a while ago, but nobody ever replied),
>>
>>
>> This is a really good point. I've been running discard mount option
>> for some time now without problems, in a laptop with Samsung
>> Electronics Co Ltd NVMe SSD Controller SM951/PM951.
>>
>> However, just trying btrfs-debug-tree -b on a specific block address
>> for any of the backup root trees listed in the super, only the current
>> one returns a valid result.  All others fail with checksum errors. And
>> even the good one fails with checksum errors within seconds as a new
>> tree is created, the super updated, and Btrfs considers the old root
>> tree disposable and subject to discard.
>>
>> So absolutely if I were to have a problem, probably no rollback for
>> me. This seems to totally obviate a fundamental part of Btrfs design.
>
> How is this an issue?  Discard is issued only once we're positive there's no
> reference to the freed blocks anywhere.  At that point, they're also open
> for reuse, thus they can be arbitrarily scribbled upon.

If it's not an issue, then no one should ever need those backup slots
in the super and we should just remove them.

But in fact, we know people end up situations where they're needed for
either automatic recovery at mount time or explicitly calling
--usebackuproot. And in some cases we're seeing users using discard
who have a borked root tree, and none of the backup roots are present
so they're fucked. Their file system is fucked.

Now again, maybe this means the hardware is misbehaving, and honored
the discard out of order, and did that and wrote the new supers before
it had completely committed all the metadata? I have no idea, but the
evidence is present in the list that some people run into this and
when they do the file system is beyond repair even t hough it can
usually be scraped with btrfs restore.


> Unless your hardware is seriously broken (such as lying about barriers,
> which is nearly-guaranteed data loss on btrfs anyway), there's no way the
> filesystem will ever reference such blocks.  The corpses of old trees that
> are left lying around with no discard can at most be used for manual
> forensics, but whether a given block will have been overwritten or not is
> a matter of pure luck.

File systems that overwrite, are hinting the intent in the journal
what's about to happen. So if there's a partial overwrite of metadata,
it's fine. The journal can help recover. But Btrfs without a journal,
has a major piece of information required to bootstrap the file system
at mount time, that's damaged, and then every backup has been
discarded. So it actually makes Btrfs more fragile than other file
systems in the same situation.



>
> For rollbacks, there are snapshots.  Once a transaction has been fully
> committed, the old version is considered gone.

Yeah well snapshots do not cause root trees to stick around.


>
>>  because it's already been discarded.
>> > This is ideally something which should be addressed (we need some sort of
>> > discard queue for handling in-line discards), but it's not easy to address.
>>
>> Discard data extents, don't discard metadata extents? Or put them on a
>> substantial delay.
>
> Why would you special-case metadata?  Metadata that points to overwritten or
> discarded blocks is of no use either.

I would rather lose 30 seconds, 1 minute, or even 2 minutes of writes,
than lose an entire file system. That's why.

Anyway right now I consider discard mount option fundamentally broken
on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's
broken there too.

Even fstrim leaves a tiny window open for a few minutes every time it
gets called, where if the root tree is corrupted for any reason,
you're fucked because all the backup roots are already gone.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-04  7:26             ` Dave
@ 2017-11-04 17:25               ` Chris Murphy
  2017-11-07  7:01                 ` Dave
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2017-11-04 17:25 UTC (permalink / raw)
  To: Dave; +Cc: Chris Murphy, Linux fs Btrfs

On Sat, Nov 4, 2017 at 1:26 AM, Dave <davestechshop@gmail.com> wrote:
> On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy <lists@colorremedies.com> wrote:
>>
>> That is not a general purpose file system. It's a file system for admins who understand where the bodies are buried.
>
> I'm not sure I understand your comment...
>
> Are you saying BTRFS is not a general purpose file system?

I'm suggesting that any file system that burdens the user with more
knowledge to stay out of trouble than the widely considered general
purpose file systems of the day, is not a general purpose file system.

And yes, I'm suggesting that Btrfs is at risk of being neither general
purpose, and not meeting its design goals as stated in Btrfs
documentation. It is not easy to admin *when things go wrong*. It's
great before then. It's a butt ton easier to resize, replace devices,
take snapshots, and so on. But when it comes to fixing it when it goes
wrong? It is a goddamn Choose Your Own Adventure book. It's way, way
more complicated than any other file system I'm aware of.


> If btrfs isn't able to serve as a general purpose file system for
> Linux going forward, which file system(s) would you suggest can fill
> that role? (I can't think of any that are clearly all-around better
> than btrfs now, or that will be in the next few years.)

ext4 and XFS are clearly the file systems to beat. They almost always
recover from crashes with just a normal journal replay at mount time,
file system repair is not often needed. When it is needed, it usually
works, and there is just the one option to repair and go with it.
Btrfs has piles of repair options, mount time options, btrfs check has
options, btrfs rescue has options, it's a bit nutty honestly. And
there's zero guidance in the available docs what order to try things
in, not least of which some of these repair tools are still considered
dangerous at least in the man page text, and the order depends on the
failure. The user is burdened with way too much.

Even as much as I know about Btrfs having used it since 2008 and my
list activity, I routinely have WTF moments when people post problems,
what order to try to get things going again. Easy to admin? Yeah for
the most part. But stability is still a problem, and it's coming up on
a 10 year anniversary soon.

If I were equally familiar with ZFS on Linux as I am with Btrfs, I'd
use ZoL hands down. But I'm not, I'm much more familiar with Btrfs and
where the bodies are buried, so I continue to use Btrfs.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-04 17:14                     ` Chris Murphy
@ 2017-11-06 13:29                       ` Austin S. Hemmelgarn
  2017-11-06 18:45                         ` Chris Murphy
  0 siblings, 1 reply; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-06 13:29 UTC (permalink / raw)
  To: Chris Murphy, Adam Borowski
  Cc: Marat Khalili, Dave, Linux fs Btrfs, Fred Van Andel

On 2017-11-04 13:14, Chris Murphy wrote:
> On Fri, Nov 3, 2017 at 10:46 PM, Adam Borowski <kilobyte@angband.pl> wrote:
>> On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote:
>>> On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>
>>>> If you're running on an SSD (or thinly provisioned storage, or something
>>>> else which supports discards) and have the 'discard' mount option enabled,
>>>> then there is no backup metadata tree (this issue was mentioned on the list
>>>> a while ago, but nobody ever replied),
>>>
>>>
>>> This is a really good point. I've been running discard mount option
>>> for some time now without problems, in a laptop with Samsung
>>> Electronics Co Ltd NVMe SSD Controller SM951/PM951.
>>>
>>> However, just trying btrfs-debug-tree -b on a specific block address
>>> for any of the backup root trees listed in the super, only the current
>>> one returns a valid result.  All others fail with checksum errors. And
>>> even the good one fails with checksum errors within seconds as a new
>>> tree is created, the super updated, and Btrfs considers the old root
>>> tree disposable and subject to discard.
>>>
>>> So absolutely if I were to have a problem, probably no rollback for
>>> me. This seems to totally obviate a fundamental part of Btrfs design.
>>
>> How is this an issue?  Discard is issued only once we're positive there's no
>> reference to the freed blocks anywhere.  At that point, they're also open
>> for reuse, thus they can be arbitrarily scribbled upon.
> 
> If it's not an issue, then no one should ever need those backup slots
> in the super and we should just remove them.
> 
> But in fact, we know people end up situations where they're needed for
> either automatic recovery at mount time or explicitly calling
> --usebackuproot. And in some cases we're seeing users using discard
> who have a borked root tree, and none of the backup roots are present
> so they're fucked. Their file system is fucked.
> 
> Now again, maybe this means the hardware is misbehaving, and honored
> the discard out of order, and did that and wrote the new supers before
> it had completely committed all the metadata? I have no idea, but the
> evidence is present in the list that some people run into this and
> when they do the file system is beyond repair even though it can
> usually be scraped with btrfs restore.
With ATA devices (including SATA), except on newer SSD's, TRIM commands 
can't be queued, so by definition they can't become unordered (the 
kernel ends up having to flush the device queue prior to the discard and 
then flush the write cache, so it's functionally equivalent to a write 
barrier, just more expensive, which is why inline discard performance 
sucks in most cases).  I'm not sure about SCSI (I'm pretty sure UNMAP 
can be queued and is handled just like any other write in terms of 
ordering), MMC/SD (Though I'm also not sure if the block layer and the 
MMC driver properly handle discard BIO's on MMC devices), or NVMe (which 
I think handles things similarly to SCSI).
> 
> 
>> Unless your hardware is seriously broken (such as lying about barriers,
>> which is nearly-guaranteed data loss on btrfs anyway), there's no way the
>> filesystem will ever reference such blocks.  The corpses of old trees that
>> are left lying around with no discard can at most be used for manual
>> forensics, but whether a given block will have been overwritten or not is
>> a matter of pure luck.
> 
> File systems that overwrite, are hinting the intent in the journal
> what's about to happen. So if there's a partial overwrite of metadata,
> it's fine. The journal can help recover. But Btrfs without a journal,
> has a major piece of information required to bootstrap the file system
> at mount time, that's damaged, and then every backup has been
> discarded. So it actually makes Btrfs more fragile than other file
> systems in the same situation.
Indeed.

Unless I'm seriously misunderstanding the code, there's a pretty high 
chance that any given old metadata block will get overwritten reasonably 
soon on an active filesystem.  I'm not 100% certain about this, but I'm 
pretty sure that BTRFS will avoid allocating new chunks to write into 
just to preserve old copies of metadata, which in turn means that it 
will overwrite things pretty fast if the metadata chunks are mostly full.>
>>
>> For rollbacks, there are snapshots.  Once a transaction has been fully
>> committed, the old version is considered gone.
> 
> Yeah well snapshots do not cause root trees to stick around.
> 
> 
>>
>>>   because it's already been discarded.
>>>> This is ideally something which should be addressed (we need some sort of
>>>> discard queue for handling in-line discards), but it's not easy to address.
>>>
>>> Discard data extents, don't discard metadata extents? Or put them on a
>>> substantial delay.
>>
>> Why would you special-case metadata?  Metadata that points to overwritten or
>> discarded blocks is of no use either.
> 
> I would rather lose 30 seconds, 1 minute, or even 2 minutes of writes,
> than lose an entire file system. That's why.
And outside of very specific use cases, this is something you'll hear 
from almost any sysadmin.
> 
> Anyway right now I consider discard mount option fundamentally broken
> on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's
> broken there too.
For LVM thinp, discard there deallocates the blocks, and unallocated 
regions read back as zeroes, just like in a sparse file (in fact, if you 
just think of LVM thinp as a sparse file with reflinking for snapshots, 
you get remarkably close to how it's actually implemented from a 
semantic perspective), so it is broken there.  In fact, it's guaranteed 
broken on any block device that has the discard_zeroes_data flag set, 
and theoretically broken on many things that don't have that flag 
(although block devices that don't have that flag are inherently broken 
from a security perspective anyway, but that's orthogonal to this 
discussion).
> 
> Even fstrim leaves a tiny window open for a few minutes every time it
> gets called, where if the root tree is corrupted for any reason,
> you're fucked because all the backup roots are already gone.
For this particular case, I'm pretty sure you can minimize this window 
by calling `btrfs filesystem sync` on the filesystem after calling 
fstrim.  It likely won't eliminate the window, but should significantly 
shorten it.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-06 13:29                       ` Austin S. Hemmelgarn
@ 2017-11-06 18:45                         ` Chris Murphy
  2017-11-06 19:12                           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2017-11-06 18:45 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Chris Murphy, Adam Borowski, Marat Khalili, Dave, Linux fs Btrfs,
	Fred Van Andel

On Mon, Nov 6, 2017 at 6:29 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

>
> With ATA devices (including SATA), except on newer SSD's, TRIM commands
> can't be queued,

SATA spec 3.1 includes queued trim. There are SATA spec 3.1 products
on the market claiming to do queued trim. Some of them fuck up, and
have been black listed in the kernel for queued trim.








>>
>>
>> Anyway right now I consider discard mount option fundamentally broken
>> on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's
>> broken there too.
>
> For LVM thinp, discard there deallocates the blocks, and unallocated regions
> read back as zeroes, just like in a sparse file (in fact, if you just think
> of LVM thinp as a sparse file with reflinking for snapshots, you get
> remarkably close to how it's actually implemented from a semantic
> perspective), so it is broken there.  In fact, it's guaranteed broken on any
> block device that has the discard_zeroes_data flag set, and theoretically
> broken on many things that don't have that flag (although block devices that
> don't have that flag are inherently broken from a security perspective
> anyway, but that's orthogonal to this discussion).

So this is really only solvable by having Btrfs delay, possibly
substantially, the discarding of metadata blocks. Aside from physical
device trim, there are benefits in thin provisioning for trim and some
use cases will require file system discard, being unable to rely on
periodic fstrim.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-06 18:45                         ` Chris Murphy
@ 2017-11-06 19:12                           ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-06 19:12 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Adam Borowski, Marat Khalili, Dave, Linux fs Btrfs, Fred Van Andel

On 2017-11-06 13:45, Chris Murphy wrote:
> On Mon, Nov 6, 2017 at 6:29 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> 
>>
>> With ATA devices (including SATA), except on newer SSD's, TRIM commands
>> can't be queued,
> 
> SATA spec 3.1 includes queued trim. There are SATA spec 3.1 products
> on the market claiming to do queued trim. Some of them fuck up, and
> have been black listed in the kernel for queued trim.
> 
Yes, but some still work, and they are invariably very new devices by 
most people's definitions.
>>> Anyway right now I consider discard mount option fundamentally broken
>>> on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's
>>> broken there too.
>>
>> For LVM thinp, discard there deallocates the blocks, and unallocated regions
>> read back as zeroes, just like in a sparse file (in fact, if you just think
>> of LVM thinp as a sparse file with reflinking for snapshots, you get
>> remarkably close to how it's actually implemented from a semantic
>> perspective), so it is broken there.  In fact, it's guaranteed broken on any
>> block device that has the discard_zeroes_data flag set, and theoretically
>> broken on many things that don't have that flag (although block devices that
>> don't have that flag are inherently broken from a security perspective
>> anyway, but that's orthogonal to this discussion).
> 
> So this is really only solvable by having Btrfs delay, possibly
> substantially, the discarding of metadata blocks. Aside from physical
> device trim, there are benefits in thin provisioning for trim and some
> use cases will require file system discard, being unable to rely on
> periodic fstrim.
Yes.  However, from a simplicity of implementation perspective, it makes 
more sense to keep some number of old trees instead of keeping old trees 
for some amount of time.  That would remove the need to track timing 
info in the filesystem, provide sufficient protection, and probably be a 
bit easier to explain in the documentation.  Such logic could also be 
applied to regular block devices that don't support discard to provide a 
better guarantee that you won't overwrite old trees that might be useful 
for recovery.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-04 17:25               ` Chris Murphy
@ 2017-11-07  7:01                 ` Dave
  2017-11-07 13:02                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 33+ messages in thread
From: Dave @ 2017-11-07  7:01 UTC (permalink / raw)
  To: Linux fs Btrfs; +Cc: Chris Murphy

On Sat, Nov 4, 2017 at 1:25 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Sat, Nov 4, 2017 at 1:26 AM, Dave <davestechshop@gmail.com> wrote:
> > On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy <lists@colorremedies.com> wrote:
> >>
> >> That is not a general purpose file system. It's a file system for admins who understand where the bodies are buried.
> >
> > I'm not sure I understand your comment...
> >
> > Are you saying BTRFS is not a general purpose file system?
>
> I'm suggesting that any file system that burdens the user with more
> knowledge to stay out of trouble than the widely considered general
> purpose file systems of the day, is not a general purpose file system.
>
> And yes, I'm suggesting that Btrfs is at risk of being neither general
> purpose, and not meeting its design goals as stated in Btrfs
> documentation. It is not easy to admin *when things go wrong*. It's
> great before then. It's a butt ton easier to resize, replace devices,
> take snapshots, and so on. But when it comes to fixing it when it goes
> wrong? It is a goddamn Choose Your Own Adventure book. It's way, way
> more complicated than any other file system I'm aware of.

It sounds like a large part of that could be addressed with better
documentation. I know that documentation such as what you are
suggesting would be really valuable to me!

> > If btrfs isn't able to serve as a general purpose file system for
> > Linux going forward, which file system(s) would you suggest can fill
> > that role? (I can't think of any that are clearly all-around better
> > than btrfs now, or that will be in the next few years.)
>
> ext4 and XFS are clearly the file systems to beat. They almost always
> recover from crashes with just a normal journal replay at mount time,
> file system repair is not often needed. When it is needed, it usually
> works, and there is just the one option to repair and go with it.
> Btrfs has piles of repair options, mount time options, btrfs check has
> options, btrfs rescue has options, it's a bit nutty honestly. And
> there's zero guidance in the available docs what order to try things
> in, not least of which some of these repair tools are still considered
> dangerous at least in the man page text, and the order depends on the
> failure. The user is burdened with way too much.

Neither one of those file systems offers snapshots. (And when I
compared LVM snapshots vs BTRFS snapshots, I got the impression BTRFS
is the clear winner.)

Snapshots and volumes have a lot of value to me and I would not enjoy
going back to a file system without those features.

> Even as much as I know about Btrfs having used it since 2008 and my
> list activity, I routinely have WTF moments when people post problems,
> what order to try to get things going again. Easy to admin? Yeah for
> the most part. But stability is still a problem, and it's coming up on
> a 10 year anniversary soon.
>
> If I were equally familiar with ZFS on Linux as I am with Btrfs, I'd
> use ZoL hands down.

Might it be the case that if you were equally familiar with ZFS, you
would become aware of more of its pitfalls? And that greater knowledge
could always lead to a different decision (such as favoring BTRFS)..
In my experience the grass is always greener when I am less familiar
with the field.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-07  7:01                 ` Dave
@ 2017-11-07 13:02                   ` Austin S. Hemmelgarn
  2017-11-08  4:50                     ` Chris Murphy
  0 siblings, 1 reply; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-07 13:02 UTC (permalink / raw)
  To: Dave, Linux fs Btrfs; +Cc: Chris Murphy

On 2017-11-07 02:01, Dave wrote:
> On Sat, Nov 4, 2017 at 1:25 PM, Chris Murphy <lists@colorremedies.com> wrote:
>>
>> On Sat, Nov 4, 2017 at 1:26 AM, Dave <davestechshop@gmail.com> wrote:
>>> On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy <lists@colorremedies.com> wrote:
>>>>
>>>> That is not a general purpose file system. It's a file system for admins who understand where the bodies are buried.
>>>
>>> I'm not sure I understand your comment...
>>>
>>> Are you saying BTRFS is not a general purpose file system?
>>
>> I'm suggesting that any file system that burdens the user with more
>> knowledge to stay out of trouble than the widely considered general
>> purpose file systems of the day, is not a general purpose file system.
>>
>> And yes, I'm suggesting that Btrfs is at risk of being neither general
>> purpose, and not meeting its design goals as stated in Btrfs
>> documentation. It is not easy to admin *when things go wrong*. It's
>> great before then. It's a butt ton easier to resize, replace devices,
>> take snapshots, and so on. But when it comes to fixing it when it goes
>> wrong? It is a goddamn Choose Your Own Adventure book. It's way, way
>> more complicated than any other file system I'm aware of.
> 
> It sounds like a large part of that could be addressed with better
> documentation. I know that documentation such as what you are
> suggesting would be really valuable to me!
Documentation would help, but most of it is a lack of automation of 
things that could be automated (and are reasonably expected to be based 
on how LVM and ZFS work), including but not limited to:
* Handling of device failures.  In particular, BTRFS has absolutely zero 
hot-spare support currently (though there are patches to add this), 
which is considered a mandatory feature in almost all large scale data 
storage situations.
* Handling of chunk-level allocation exhaustion.  Ideally, when we can't 
allocate a chunk, we should try to free up space from the other chunk 
type through repacking of data.  Handling this better would 
significantly improve things around one of the biggest pitfalls with 
BTRFS, namely filling up a filesystem completely (which many end users 
seem to think is perfectly fine, despite that not being the case for 
pretty much any filesystem).
* Optional automatic correction of errors detected during normal usage. 
Right now, you have to run a scrub to correct errors. Such a design 
makes sense with MD and LVM, where you don't know which copy is correct, 
but BTRFS does know which copy is correct (or how to rebuild the correct 
data), and it therefore makes sense to have an option to automatically 
rebuild data that is detected to be incorrect.
> 
>>> If btrfs isn't able to serve as a general purpose file system for
>>> Linux going forward, which file system(s) would you suggest can fill
>>> that role? (I can't think of any that are clearly all-around better
>>> than btrfs now, or that will be in the next few years.)
>>
>> ext4 and XFS are clearly the file systems to beat. They almost always
>> recover from crashes with just a normal journal replay at mount time,
>> file system repair is not often needed. When it is needed, it usually
>> works, and there is just the one option to repair and go with it.
>> Btrfs has piles of repair options, mount time options, btrfs check has
>> options, btrfs rescue has options, it's a bit nutty honestly. And
>> there's zero guidance in the available docs what order to try things
>> in, not least of which some of these repair tools are still considered
>> dangerous at least in the man page text, and the order depends on the
>> failure. The user is burdened with way too much.
> 
> Neither one of those file systems offers snapshots. (And when I
> compared LVM snapshots vs BTRFS snapshots, I got the impression BTRFS
> is the clear winner.)
> 
> Snapshots and volumes have a lot of value to me and I would not enjoy
> going back to a file system without those features.
While that is true, that's not exactly the point Chris was trying to 
make.  The point is that if you install a system with XFS, you don't 
have to do pretty much anything to keep the filesystem running 
correctly, and ext4 is almost as good about not needing user 
intervention (repairs for ext4 are a bit more involved, and you have to 
watch inode usage because it uses static inode tables).  In contrast, 
you have to essentially treat BTRFS like a small child and keep an eye 
on it almost constantly to make sure it works correctly.
> 
>> Even as much as I know about Btrfs having used it since 2008 and my
>> list activity, I routinely have WTF moments when people post problems,
>> what order to try to get things going again. Easy to admin? Yeah for
>> the most part. But stability is still a problem, and it's coming up on
>> a 10 year anniversary soon.
>>
>> If I were equally familiar with ZFS on Linux as I am with Btrfs, I'd
>> use ZoL hands down.
> 
> Might it be the case that if you were equally familiar with ZFS, you
> would become aware of more of its pitfalls? And that greater knowledge
> could always lead to a different decision (such as favoring BTRFS)..
> In my experience the grass is always greener when I am less familiar
> with the field.
Quick summary of the big differences, with ZFS parts based on my 
experience using it with FreeNAS at work:

BTRFS:
* Natively supported by the mainline kernel, unlike ZFS which can't ever 
be included in the mainline kernel due to licensing issues.  This is 
pretty much the only significant reason I stick with BTRFS over ZFS, as 
it greatly simplifies updates (and means I don't have to wait as long 
for kernel upgrades).
* Subvolumes are implicitly rooted in the filesystem hierarchy, unlike 
ZFS datasets which always have to be explicitly mounted.  This is 
largely cosmetic to be honest.
* Able to group subvolumes for quotas without having to replicate the 
grouping with parent subvolumes, unlike ZFS which requires a common 
parent dataset if you want to group datasets for quotas.  This is very 
useful as it reduces the complexity needed in the subvolume hierarchy.
* Has native support for most forms of fallocate(), while ZFS doesn't. 
This isn't all that significant for most users, but it does provide some 
significant benefit if you use lots of large sparse files (you have to 
do batch deduplication on ZFS to make them 'sparse' again, whereas you 
just call fallocate to punch holes on BTRFS, which takes far less time).

ZFS:
* Provides native support for exposing virtual block devices (zvols), 
unlike BTRFS which just provides filesystem functionality.  This is 
really big for NAS usage, as it's much more efficient to expose a zvol 
as an iSCSI, ATAoE, or NBD device than it is to expose a regular file as 
one.
* Includes hot-spare and automatic rebuild support, unlike BTRFS which 
does not (but we are working on this).  Really important for enterprise 
usage and high availability.
* Provides the ability to control stripe width for parity RAID modes, 
unlike BTRFS.  This is extremely important when dealing with large 
filesystems, by using reduced stripe width, you improve rebuild times 
for a given stripe, and in theory can sustain more lost disks before 
losing data.
* Has a much friendlier scrub mechanism that doesn't have anywhere near 
as much impact on other things accessing the device as BTRFS does.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-07 13:02                   ` Austin S. Hemmelgarn
@ 2017-11-08  4:50                     ` Chris Murphy
  2017-11-08 12:13                       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2017-11-08  4:50 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Dave, Linux fs Btrfs, Chris Murphy

On Tue, Nov 7, 2017 at 6:02 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

> * Optional automatic correction of errors detected during normal usage.
> Right now, you have to run a scrub to correct errors. Such a design makes
> sense with MD and LVM, where you don't know which copy is correct, but BTRFS
> does know which copy is correct (or how to rebuild the correct data), and it
> therefore makes sense to have an option to automatically rebuild data that
> is detected to be incorrect.

?

It definitely does fix ups during normal operations. During reads, if
there's a UNC or there's corruption detected, Btrfs gets the good
copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
don't just happen with scrubbing. Even raid56 supports these kinds of
passive fixups back to disk.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-08  4:50                     ` Chris Murphy
@ 2017-11-08 12:13                       ` Austin S. Hemmelgarn
  2017-11-08 17:17                         ` Chris Murphy
  0 siblings, 1 reply; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-08 12:13 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Dave, Linux fs Btrfs

On 2017-11-07 23:50, Chris Murphy wrote:
> On Tue, Nov 7, 2017 at 6:02 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> 
>> * Optional automatic correction of errors detected during normal usage.
>> Right now, you have to run a scrub to correct errors. Such a design makes
>> sense with MD and LVM, where you don't know which copy is correct, but BTRFS
>> does know which copy is correct (or how to rebuild the correct data), and it
>> therefore makes sense to have an option to automatically rebuild data that
>> is detected to be incorrect.
> 
> ?
> 
> It definitely does fix ups during normal operations. During reads, if
> there's a UNC or there's corruption detected, Btrfs gets the good
> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
> don't just happen with scrubbing. Even raid56 supports these kinds of
> passive fixups back to disk.
I could have sworn it didn't rewrite the data on-disk during normal 
usage.  I mean, I know for certain that it will return the correct data 
to userspace if at all possible, but I was under the impression it will 
just log the error during normal operation.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-08 12:13                       ` Austin S. Hemmelgarn
@ 2017-11-08 17:17                         ` Chris Murphy
  2017-11-08 17:22                           ` Hugo Mills
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2017-11-08 17:17 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Dave, Linux fs Btrfs

On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

>> It definitely does fix ups during normal operations. During reads, if
>> there's a UNC or there's corruption detected, Btrfs gets the good
>> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
>> don't just happen with scrubbing. Even raid56 supports these kinds of
>> passive fixups back to disk.
>
> I could have sworn it didn't rewrite the data on-disk during normal usage.
> I mean, I know for certain that it will return the correct data to userspace
> if at all possible, but I was under the impression it will just log the
> error during normal operation.

No, everything except raid56 has had it since a long time, I can't
even think how far back, maybe even before 3.0. Whereas raid56 got it
in 4.12.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-08 17:17                         ` Chris Murphy
@ 2017-11-08 17:22                           ` Hugo Mills
  2017-11-08 17:54                             ` Chris Murphy
  0 siblings, 1 reply; 33+ messages in thread
From: Hugo Mills @ 2017-11-08 17:22 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S. Hemmelgarn, Dave, Linux fs Btrfs

[-- Attachment #1: Type: text/plain, Size: 1321 bytes --]

On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote:
> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> 
> >> It definitely does fix ups during normal operations. During reads, if
> >> there's a UNC or there's corruption detected, Btrfs gets the good
> >> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
> >> don't just happen with scrubbing. Even raid56 supports these kinds of
> >> passive fixups back to disk.
> >
> > I could have sworn it didn't rewrite the data on-disk during normal usage.
> > I mean, I know for certain that it will return the correct data to userspace
> > if at all possible, but I was under the impression it will just log the
> > error during normal operation.
> 
> No, everything except raid56 has had it since a long time, I can't
> even think how far back, maybe even before 3.0. Whereas raid56 got it
> in 4.12.

   Yes, I'm pretty sure it's been like that ever since I've been using
btrfs (somewhere around the early neolithic).

   Hugo.

-- 
Hugo Mills             | Turning, pages turning in the widening bath,
hugo@... carfax.org.uk | The spine cannot bear the humidity.
http://carfax.org.uk/  | Books fall apart; the binding cannot hold.
PGP: E2AB1DE4          | Page 129 is loosed upon the world.               Zarf

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-08 17:22                           ` Hugo Mills
@ 2017-11-08 17:54                             ` Chris Murphy
  2017-11-08 18:10                               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2017-11-08 17:54 UTC (permalink / raw)
  To: Hugo Mills, Chris Murphy, Austin S. Hemmelgarn, Dave, Linux fs Btrfs

On Wed, Nov 8, 2017 at 10:22 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote:
>> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>
>> >> It definitely does fix ups during normal operations. During reads, if
>> >> there's a UNC or there's corruption detected, Btrfs gets the good
>> >> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
>> >> don't just happen with scrubbing. Even raid56 supports these kinds of
>> >> passive fixups back to disk.
>> >
>> > I could have sworn it didn't rewrite the data on-disk during normal usage.
>> > I mean, I know for certain that it will return the correct data to userspace
>> > if at all possible, but I was under the impression it will just log the
>> > error during normal operation.
>>
>> No, everything except raid56 has had it since a long time, I can't
>> even think how far back, maybe even before 3.0. Whereas raid56 got it
>> in 4.12.
>
>    Yes, I'm pretty sure it's been like that ever since I've been using
> btrfs (somewhere around the early neolithic).
>

Yeah, around the original code for multiple devices I think. Anyway,
this is what the fixups look like between scrub and normal read on
raid1. Hilariously the error reporting is radically different.

This is kernel messages of what a scrub finding data file corruption
detection and repair looks like. This was 5120 bytes corrupted so all
of one block and partial of anther.


[244964.589522] BTRFS warning (device dm-6): checksum error at logical
1103626240 on dev /dev/mapper/vg-2, sector 2116608, root 5, inode 257,
offset 0, length 4096, links 1 (path: test.bin)
[244964.589685] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
wr 0, rd 0, flush 0, corrupt 1, gen 0
[244964.650239] BTRFS error (device dm-6): fixed up error at logical
1103626240 on dev /dev/mapper/vg-2
[244964.650612] BTRFS warning (device dm-6): checksum error at logical
1103630336 on dev /dev/mapper/vg-2, sector 2116616, root 5, inode 257,
offset 4096, length 4096, links 1 (path: test.bin)
[244964.650757] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
wr 0, rd 0, flush 0, corrupt 2, gen 0
[244964.683586] BTRFS error (device dm-6): fixed up error at logical
1103630336 on dev /dev/mapper/vg-2
[root@f26s test]#


Exact same corruption (same device and offset), but normal read of the file.

[245721.613806] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
[245721.614416] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
[245721.630131] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
[245721.630656] BTRFS warning (device dm-6): csum failed root 5 ino
257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
[245721.638901] BTRFS info (device dm-6): read error corrected: ino
257 off 0 (dev /dev/mapper/vg-2 sector 2116608)
[245721.639608] BTRFS info (device dm-6): read error corrected: ino
257 off 4096 (dev /dev/mapper/vg-2 sector 2116616)
[245747.280718]


scrub considers the fixup an error, normal read considers it info; but
there's more useful information in the scrub output I think. I'd
really like to see the warning make it clear whether this is metadata
or data corruption though. From the above you have to infer it,
because of the inode reference.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-08 17:54                             ` Chris Murphy
@ 2017-11-08 18:10                               ` Austin S. Hemmelgarn
  2017-11-08 18:31                                 ` Chris Murphy
  0 siblings, 1 reply; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-08 18:10 UTC (permalink / raw)
  To: Chris Murphy, Hugo Mills, Dave, Linux fs Btrfs

On 2017-11-08 12:54, Chris Murphy wrote:
> On Wed, Nov 8, 2017 at 10:22 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>> On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote:
>>> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>
>>>>> It definitely does fix ups during normal operations. During reads, if
>>>>> there's a UNC or there's corruption detected, Btrfs gets the good
>>>>> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
>>>>> don't just happen with scrubbing. Even raid56 supports these kinds of
>>>>> passive fixups back to disk.
>>>>
>>>> I could have sworn it didn't rewrite the data on-disk during normal usage.
>>>> I mean, I know for certain that it will return the correct data to userspace
>>>> if at all possible, but I was under the impression it will just log the
>>>> error during normal operation.
>>>
>>> No, everything except raid56 has had it since a long time, I can't
>>> even think how far back, maybe even before 3.0. Whereas raid56 got it
>>> in 4.12.
>>
>>     Yes, I'm pretty sure it's been like that ever since I've been using
>> btrfs (somewhere around the early neolithic).
>>
> 
> Yeah, around the original code for multiple devices I think. Anyway,
> this is what the fixups look like between scrub and normal read on
> raid1. Hilariously the error reporting is radically different.
> 
> This is kernel messages of what a scrub finding data file corruption
> detection and repair looks like. This was 5120 bytes corrupted so all
> of one block and partial of anther.
> 
> 
> [244964.589522] BTRFS warning (device dm-6): checksum error at logical
> 1103626240 on dev /dev/mapper/vg-2, sector 2116608, root 5, inode 257,
> offset 0, length 4096, links 1 (path: test.bin)
> [244964.589685] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
> wr 0, rd 0, flush 0, corrupt 1, gen 0
> [244964.650239] BTRFS error (device dm-6): fixed up error at logical
> 1103626240 on dev /dev/mapper/vg-2
> [244964.650612] BTRFS warning (device dm-6): checksum error at logical
> 1103630336 on dev /dev/mapper/vg-2, sector 2116616, root 5, inode 257,
> offset 4096, length 4096, links 1 (path: test.bin)
> [244964.650757] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
> wr 0, rd 0, flush 0, corrupt 2, gen 0
> [244964.683586] BTRFS error (device dm-6): fixed up error at logical
> 1103630336 on dev /dev/mapper/vg-2
> [root@f26s test]#
> 
> 
> Exact same corruption (same device and offset), but normal read of the file.
> 
> [245721.613806] BTRFS warning (device dm-6): csum failed root 5 ino
> 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
> [245721.614416] BTRFS warning (device dm-6): csum failed root 5 ino
> 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
> [245721.630131] BTRFS warning (device dm-6): csum failed root 5 ino
> 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
> [245721.630656] BTRFS warning (device dm-6): csum failed root 5 ino
> 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
> [245721.638901] BTRFS info (device dm-6): read error corrected: ino
> 257 off 0 (dev /dev/mapper/vg-2 sector 2116608)
> [245721.639608] BTRFS info (device dm-6): read error corrected: ino
> 257 off 4096 (dev /dev/mapper/vg-2 sector 2116616)
> [245747.280718]
> 
> 
> scrub considers the fixup an error, normal read considers it info; but
> there's more useful information in the scrub output I think. I'd
> really like to see the warning make it clear whether this is metadata
> or data corruption though. From the above you have to infer it,
> because of the inode reference.
OK, that actually explains why I had this incorrect assumption.  I've 
not delved all that deep into that code, so I have no reference there, 
but looking at the two messages, the scrub message makes it very clear 
that the error was fixed, whereas the phrasing in the case of a normal 
read is kind of ambiguous (as I see it, 'read error corrected' could 
mean that it was actually repaired (fixed as scrub says), or that the 
error was corrected in BTRFS by falling back to the old copy, and I 
assumed the second case given the context).

As far as the whole warning versus info versus error thing, I actually 
think _that_ makes some sense.  If things got fixed, it's not exactly an 
error, even though it would be nice to have some consistency there.  For 
scrub however, it makes sense to have it all be labeled as an 'error' 
because otherwise the log entries will be incomplete if dmesg is not set 
to report anything less than an error (and the three lines are 
functionally _one_ entry).  I can also kind of understand scrub 
reporting error counts, but regular reads not doing so (scrub is a 
diagnostic and repair tool, regular reads aren't).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-08 18:10                               ` Austin S. Hemmelgarn
@ 2017-11-08 18:31                                 ` Chris Murphy
  2017-11-08 19:29                                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2017-11-08 18:31 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Hugo Mills, Dave, Linux fs Btrfs

On Wed, Nov 8, 2017 at 11:10 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2017-11-08 12:54, Chris Murphy wrote:
>>
>> On Wed, Nov 8, 2017 at 10:22 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>>
>>> On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote:
>>>>
>>>> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
>>>> <ahferroin7@gmail.com> wrote:
>>>>
>>>>>> It definitely does fix ups during normal operations. During reads, if
>>>>>> there's a UNC or there's corruption detected, Btrfs gets the good
>>>>>> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
>>>>>> don't just happen with scrubbing. Even raid56 supports these kinds of
>>>>>> passive fixups back to disk.
>>>>>
>>>>>
>>>>> I could have sworn it didn't rewrite the data on-disk during normal
>>>>> usage.
>>>>> I mean, I know for certain that it will return the correct data to
>>>>> userspace
>>>>> if at all possible, but I was under the impression it will just log the
>>>>> error during normal operation.
>>>>
>>>>
>>>> No, everything except raid56 has had it since a long time, I can't
>>>> even think how far back, maybe even before 3.0. Whereas raid56 got it
>>>> in 4.12.
>>>
>>>
>>>     Yes, I'm pretty sure it's been like that ever since I've been using
>>> btrfs (somewhere around the early neolithic).
>>>
>>
>> Yeah, around the original code for multiple devices I think. Anyway,
>> this is what the fixups look like between scrub and normal read on
>> raid1. Hilariously the error reporting is radically different.
>>
>> This is kernel messages of what a scrub finding data file corruption
>> detection and repair looks like. This was 5120 bytes corrupted so all
>> of one block and partial of anther.
>>
>>
>> [244964.589522] BTRFS warning (device dm-6): checksum error at logical
>> 1103626240 on dev /dev/mapper/vg-2, sector 2116608, root 5, inode 257,
>> offset 0, length 4096, links 1 (path: test.bin)
>> [244964.589685] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
>> wr 0, rd 0, flush 0, corrupt 1, gen 0
>> [244964.650239] BTRFS error (device dm-6): fixed up error at logical
>> 1103626240 on dev /dev/mapper/vg-2
>> [244964.650612] BTRFS warning (device dm-6): checksum error at logical
>> 1103630336 on dev /dev/mapper/vg-2, sector 2116616, root 5, inode 257,
>> offset 4096, length 4096, links 1 (path: test.bin)
>> [244964.650757] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
>> wr 0, rd 0, flush 0, corrupt 2, gen 0
>> [244964.683586] BTRFS error (device dm-6): fixed up error at logical
>> 1103630336 on dev /dev/mapper/vg-2
>> [root@f26s test]#
>>
>>
>> Exact same corruption (same device and offset), but normal read of the
>> file.
>>
>> [245721.613806] BTRFS warning (device dm-6): csum failed root 5 ino
>> 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
>> [245721.614416] BTRFS warning (device dm-6): csum failed root 5 ino
>> 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
>> [245721.630131] BTRFS warning (device dm-6): csum failed root 5 ino
>> 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
>> [245721.630656] BTRFS warning (device dm-6): csum failed root 5 ino
>> 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
>> [245721.638901] BTRFS info (device dm-6): read error corrected: ino
>> 257 off 0 (dev /dev/mapper/vg-2 sector 2116608)
>> [245721.639608] BTRFS info (device dm-6): read error corrected: ino
>> 257 off 4096 (dev /dev/mapper/vg-2 sector 2116616)
>> [245747.280718]
>>
>>
>> scrub considers the fixup an error, normal read considers it info; but
>> there's more useful information in the scrub output I think. I'd
>> really like to see the warning make it clear whether this is metadata
>> or data corruption though. From the above you have to infer it,
>> because of the inode reference.
>
> OK, that actually explains why I had this incorrect assumption.  I've not
> delved all that deep into that code, so I have no reference there, but
> looking at the two messages, the scrub message makes it very clear that the
> error was fixed, whereas the phrasing in the case of a normal read is kind
> of ambiguous (as I see it, 'read error corrected' could mean that it was
> actually repaired (fixed as scrub says), or that the error was corrected in
> BTRFS by falling back to the old copy, and I assumed the second case given
> the context).
>
> As far as the whole warning versus info versus error thing, I actually think
> _that_ makes some sense.  If things got fixed, it's not exactly an error,
> even though it would be nice to have some consistency there.  For scrub
> however, it makes sense to have it all be labeled as an 'error' because
> otherwise the log entries will be incomplete if dmesg is not set to report
> anything less than an error (and the three lines are functionally _one_
> entry).  I can also kind of understand scrub reporting error counts, but
> regular reads not doing so (scrub is a diagnostic and repair tool, regular
> reads aren't).


I just did those corruptions as a test, and following the normal read
fixup, a subsequent scrub finds no problems. And in both cases
debug-tree shows pretty much identical metadata, at least the same
chunks are intact and the tree the file is located in has the same
logical address for the file in question. So this is not a COW fix up,
it's an overwrite. (Something tells me that raid56 fixes corruptions
differently, they may be cow).

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Problem with file system
  2017-11-08 18:31                                 ` Chris Murphy
@ 2017-11-08 19:29                                   ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-08 19:29 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Hugo Mills, Dave, Linux fs Btrfs

On 2017-11-08 13:31, Chris Murphy wrote:
> On Wed, Nov 8, 2017 at 11:10 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2017-11-08 12:54, Chris Murphy wrote:
>>>
>>> On Wed, Nov 8, 2017 at 10:22 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>>>
>>>> On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote:
>>>>>
>>>>> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
>>>>> <ahferroin7@gmail.com> wrote:
>>>>>
>>>>>>> It definitely does fix ups during normal operations. During reads, if
>>>>>>> there's a UNC or there's corruption detected, Btrfs gets the good
>>>>>>> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
>>>>>>> don't just happen with scrubbing. Even raid56 supports these kinds of
>>>>>>> passive fixups back to disk.
>>>>>>
>>>>>>
>>>>>> I could have sworn it didn't rewrite the data on-disk during normal
>>>>>> usage.
>>>>>> I mean, I know for certain that it will return the correct data to
>>>>>> userspace
>>>>>> if at all possible, but I was under the impression it will just log the
>>>>>> error during normal operation.
>>>>>
>>>>>
>>>>> No, everything except raid56 has had it since a long time, I can't
>>>>> even think how far back, maybe even before 3.0. Whereas raid56 got it
>>>>> in 4.12.
>>>>
>>>>
>>>>      Yes, I'm pretty sure it's been like that ever since I've been using
>>>> btrfs (somewhere around the early neolithic).
>>>>
>>>
>>> Yeah, around the original code for multiple devices I think. Anyway,
>>> this is what the fixups look like between scrub and normal read on
>>> raid1. Hilariously the error reporting is radically different.
>>>
>>> This is kernel messages of what a scrub finding data file corruption
>>> detection and repair looks like. This was 5120 bytes corrupted so all
>>> of one block and partial of anther.
>>>
>>>
>>> [244964.589522] BTRFS warning (device dm-6): checksum error at logical
>>> 1103626240 on dev /dev/mapper/vg-2, sector 2116608, root 5, inode 257,
>>> offset 0, length 4096, links 1 (path: test.bin)
>>> [244964.589685] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
>>> wr 0, rd 0, flush 0, corrupt 1, gen 0
>>> [244964.650239] BTRFS error (device dm-6): fixed up error at logical
>>> 1103626240 on dev /dev/mapper/vg-2
>>> [244964.650612] BTRFS warning (device dm-6): checksum error at logical
>>> 1103630336 on dev /dev/mapper/vg-2, sector 2116616, root 5, inode 257,
>>> offset 4096, length 4096, links 1 (path: test.bin)
>>> [244964.650757] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs:
>>> wr 0, rd 0, flush 0, corrupt 2, gen 0
>>> [244964.683586] BTRFS error (device dm-6): fixed up error at logical
>>> 1103630336 on dev /dev/mapper/vg-2
>>> [root@f26s test]#
>>>
>>>
>>> Exact same corruption (same device and offset), but normal read of the
>>> file.
>>>
>>> [245721.613806] BTRFS warning (device dm-6): csum failed root 5 ino
>>> 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
>>> [245721.614416] BTRFS warning (device dm-6): csum failed root 5 ino
>>> 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
>>> [245721.630131] BTRFS warning (device dm-6): csum failed root 5 ino
>>> 257 off 0 csum 0x98f94189 expected csum 0xd8be3813 mirror 1
>>> [245721.630656] BTRFS warning (device dm-6): csum failed root 5 ino
>>> 257 off 4096 csum 0x05a1017f expected csum 0xef2302b4 mirror 1
>>> [245721.638901] BTRFS info (device dm-6): read error corrected: ino
>>> 257 off 0 (dev /dev/mapper/vg-2 sector 2116608)
>>> [245721.639608] BTRFS info (device dm-6): read error corrected: ino
>>> 257 off 4096 (dev /dev/mapper/vg-2 sector 2116616)
>>> [245747.280718]
>>>
>>>
>>> scrub considers the fixup an error, normal read considers it info; but
>>> there's more useful information in the scrub output I think. I'd
>>> really like to see the warning make it clear whether this is metadata
>>> or data corruption though. From the above you have to infer it,
>>> because of the inode reference.
>>
>> OK, that actually explains why I had this incorrect assumption.  I've not
>> delved all that deep into that code, so I have no reference there, but
>> looking at the two messages, the scrub message makes it very clear that the
>> error was fixed, whereas the phrasing in the case of a normal read is kind
>> of ambiguous (as I see it, 'read error corrected' could mean that it was
>> actually repaired (fixed as scrub says), or that the error was corrected in
>> BTRFS by falling back to the old copy, and I assumed the second case given
>> the context).
>>
>> As far as the whole warning versus info versus error thing, I actually think
>> _that_ makes some sense.  If things got fixed, it's not exactly an error,
>> even though it would be nice to have some consistency there.  For scrub
>> however, it makes sense to have it all be labeled as an 'error' because
>> otherwise the log entries will be incomplete if dmesg is not set to report
>> anything less than an error (and the three lines are functionally _one_
>> entry).  I can also kind of understand scrub reporting error counts, but
>> regular reads not doing so (scrub is a diagnostic and repair tool, regular
>> reads aren't).
> 
> 
> I just did those corruptions as a test, and following the normal read
> fixup, a subsequent scrub finds no problems. And in both cases
> debug-tree shows pretty much identical metadata, at least the same
> chunks are intact and the tree the file is located in has the same
> logical address for the file in question. So this is not a COW fix up,
> it's an overwrite. (Something tells me that raid56 fixes corruptions
> differently, they may be cow).
> 
I would think that this is the only case it makes sense to 
unconditionally _not_ do a COW update.  In the event that the write gets 
interrupted, we're no worse off than we already were (the checksum will 
still fail), so there's not much point in incurring the overhead of a 
COW operation, except possibly with parity involved (because you might 
run the risk of both bogus parity _and_ a bogus checksum).

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2017-11-08 19:29 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-24 15:27 Problem with file system Fred Van Andel
2017-04-24 17:02 ` Chris Murphy
2017-04-25  4:05   ` Duncan
2017-04-25  0:26 ` Qu Wenruo
2017-04-25  5:33   ` Marat Khalili
2017-04-25  6:13     ` Qu Wenruo
2017-04-26 16:43       ` Fred Van Andel
2017-10-30  3:31         ` Dave
2017-10-30 21:37           ` Chris Murphy
2017-10-31  5:57             ` Marat Khalili
2017-10-31 11:28               ` Austin S. Hemmelgarn
2017-11-03  7:42                 ` Kai Krakow
2017-11-03 11:33                   ` Austin S. Hemmelgarn
2017-11-03 22:03                 ` Chris Murphy
2017-11-04  4:46                   ` Adam Borowski
2017-11-04 12:00                     ` Marat Khalili
2017-11-04 17:14                     ` Chris Murphy
2017-11-06 13:29                       ` Austin S. Hemmelgarn
2017-11-06 18:45                         ` Chris Murphy
2017-11-06 19:12                           ` Austin S. Hemmelgarn
2017-11-04  7:26             ` Dave
2017-11-04 17:25               ` Chris Murphy
2017-11-07  7:01                 ` Dave
2017-11-07 13:02                   ` Austin S. Hemmelgarn
2017-11-08  4:50                     ` Chris Murphy
2017-11-08 12:13                       ` Austin S. Hemmelgarn
2017-11-08 17:17                         ` Chris Murphy
2017-11-08 17:22                           ` Hugo Mills
2017-11-08 17:54                             ` Chris Murphy
2017-11-08 18:10                               ` Austin S. Hemmelgarn
2017-11-08 18:31                                 ` Chris Murphy
2017-11-08 19:29                                   ` Austin S. Hemmelgarn
2017-10-31  1:58           ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).