All of lore.kernel.org
 help / color / mirror / Atom feed
* Data and metadata extent allocators [1/2]: Recap: The data story
@ 2017-10-27 18:17 Hans van Kranenburg
  2017-10-27 20:10 ` Martin Steigerwald
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Hans van Kranenburg @ 2017-10-27 18:17 UTC (permalink / raw)
  To: linux-btrfs

Hi,

This is a followup to my previous threads named "About free space
fragmentation, metadata write amplification and (no)ssd" [0] and
"Experiences with metadata balance/convert" [1], exploring how good or
bad btrfs can handle filesystems that are larger than your average
desktop computer and/or which see a pattern of writing and deleting huge
amounts of files of wildly varying sizes all the time.

This message is a summary of the earlier posts. So, for whoever followed
the story, only boring old news here. In the next message as a reply on
this one, I'll add some thoughts about new adventures with metadata
during the last weeks.

My use case is using btrfs as filesystem for backup servers, which work
with collections of subvolumes/snapshots with related data, and add new
data and expire old snapshots daily.

Until now, the following questions were already answered:

Q: Why does the allocated but unused space for data keep growing all the
time?
A: Because btrfs keeps allocating new raw space for data use, while
there's more and more unused space inside already which isn't reused for
new data.

Q: How do I fight this and prevent getting into a situation where all
raw space is allocated, risking a filesystem crash?
A: Use btrfs balance to fight the symptoms. It reads data and writes it
out again without the free space fragments.

Q: Why would it crash the file system when all raw space is allocated?
Won't it start trying harder to reuse the free space inside?
A: Yes, it will, for data. The big problem here is that allocation of a
new metadata chunk when needed is not possible any more.

Q: Where does btrfs balance get the usage information from (what %
filled a chunk / block group is)? How can I see this myself?
A: It's a field of the "block group" item in metadata. The information
can be read from the filesystem metadata using the tree search ioctl.
Exploring this resulted in the first version of btrfs-heatmap. [2]

Q: Ok, but I have many TiBs of data chunks which are ~75% filled and
rewriting all that data is painful, takes a huge amount of time and even
if I would do it full-time aside from doing backups and expiries, I
won't succeed in fighting new fragmented free space that pops up. Help!
A: Yeah, we need something better.

Q: How can I see what the pattern is of free space fragments in a block
group?
A: For this, extent level pictures in btrfs-heatmap were added. [2]

Q: Why do the pictures of my data block groups look like someone fired a
shotgun at it. [3], [4]?
A: Because the data extent allocator that is active when using the 'ssd'
mount option both tends to ignore smaller free space fragments all the
time, and also behaves in a way that causes more of them to appear. [5]

Q: Wait, why is there "ssd" in my mount options? Why does btrfs think my
iSCSI attached lun is an SSD?
A: Because it makes wrong assumptions based on the rotational attribute,
which we can also see in sysfs.

Q: Why does this ssd mode ignore free space?
A: Because it makes assumptions about the mapping of the addresses of
the block device we see in linux and the storage in actual flash chips
inside the ssd. Based on that information it decides where to write or
where not to write any more.

Q: Does this make sense in 2017?
A: No. The interesting relevant optimization when writing to an ssd
would be to write all data together that will be deleted or overwritten
together at the same time in the future. Since btrfs does not come with
a time machine included, it can't do this. So, remove this behaviour
instead. [6]

Q: What will happen when I use kernel 4.14 with the previously mentioned
change, or if I change to the nossd mount option explicitely already?
A: Relatively small free space fragments in existing chunks will
actually be reused for new writes that fit, working from the beginning
of the virtual address space upwards. It's like tetris, trying to
completely fill up the lowest lines first. See the big difference in
behavior when changing extent allocator happening at 16 seconds into
this timelapse movie: [7] (virtual address space)

Q: But what if all my chunks have badly fragmented free space right now?
A: If your situation allows for it, the simplest way is running a full
balance of the data, as some sort of big reset button. If you only want
to clean up chunks with excessive free space fragmentation, then you can
use the helper I used to identify them, which is
show_free_space_fragmentation.py in [8]. Just feed the chunks to balance
starting with the one with the highest score. The script requires the
free space tree to be used, which is a good idea anyway.

[0] https://www.spinics.net/lists/linux-btrfs/msg64446.html
[1] https://www.spinics.net/lists/linux-btrfs/msg64771.html
[2] https://github.com/knorrie/btrfs-heatmap/
[3]
https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-10-27-shotgunblast.png
[4]
https://syrinx.knorrie.org/~knorrie/btrfs/keep/2016-12-18-heatmap-scripting/fsid_ed10a358-c846-4e76-a071-3821d423a99d_startat_320029589504_at_1482095269.png
[5] https://www.spinics.net/lists/linux-btrfs/msg64418.html
[6]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=583b723151794e2ff1691f1510b4e43710293875
[7] https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-ssd-to-nossd.mp4
[8] https://github.com/knorrie/python-btrfs/tree/develop/examples

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Data and metadata extent allocators [1/2]: Recap: The data story
  2017-10-27 18:17 Data and metadata extent allocators [1/2]: Recap: The data story Hans van Kranenburg
@ 2017-10-27 20:10 ` Martin Steigerwald
  2017-10-27 21:40   ` Hans van Kranenburg
  2017-10-27 21:20 ` Data and metadata extent allocators [2/2]: metadata! Hans van Kranenburg
  2017-10-28  0:12 ` Data and metadata extent allocators [1/2]: Recap: The data story Qu Wenruo
  2 siblings, 1 reply; 6+ messages in thread
From: Martin Steigerwald @ 2017-10-27 20:10 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: linux-btrfs

Hello Hans,

Hans van Kranenburg - 27.10.17, 20:17:
> This is a followup to my previous threads named "About free space
> fragmentation, metadata write amplification and (no)ssd" [0] and
> "Experiences with metadata balance/convert" [1], exploring how good or
> bad btrfs can handle filesystems that are larger than your average
> desktop computer and/or which see a pattern of writing and deleting huge
> amounts of files of wildly varying sizes all the time.
[…]
> Q: How do I fight this and prevent getting into a situation where all
> raw space is allocated, risking a filesystem crash?
> A: Use btrfs balance to fight the symptoms. It reads data and writes it
> out again without the free space fragments.

What do you mean by a filesystem crash? Since kernel 4.5 or 4.6 I don´t see any 
BTRFS related filesystem hangs anymore on the /home BTRFS Dual SSD RAID 1 on my 
Laptop, which one or two copies of Akonadi, Baloo and other desktop related 
stuff write *heavily to* and which has all free space allocated into cunks 
since a pretty long time:

merkaba:~> btrfs fi usage -T /home
Overall:
    Device size:                 340.00GiB
    Device allocated:            340.00GiB
    Device unallocated:            2.00MiB
    Device missing:                  0.00B
    Used:                        290.32GiB
    Free (estimated):             23.09GiB      (min: 23.09GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

                          Data      Metadata System              
Id Path                   RAID1     RAID1    RAID1    Unallocated
-- ---------------------- --------- -------- -------- -----------
 1 /dev/mapper/msata-home 163.94GiB  6.03GiB 32.00MiB     1.00MiB
 2 /dev/mapper/sata-home  163.94GiB  6.03GiB 32.00MiB     1.00MiB
-- ---------------------- --------- -------- -------- -----------
   Total                  163.94GiB  6.03GiB 32.00MiB     2.00MiB
   Used                   140.85GiB  4.31GiB 48.00KiB

I didn´t do a balance on this filesystem since a long time (kernel 4.6).

Granted my filesystem is smaller than the typical backup BTRFS. I do have two 3 
TB and one 1,5 TB SATA disks I backup to and another 2 TB BTRFS on a backup 
server that I use for borgbackup (and that doesn´t yet do any snapshots and 
may be better of running as XFS as it doesn´t really need snapshots as 
borgbackup takes care of that. A BTRFS snapshot would only come handy to be 
able to go back to a previous borgbackup repo in case it for whatever reason 
gets corrupted or damaged / deleted by an attacker who only access to non 
privileged user). – However all of these filesystems have plenty of free space 
currently and are not accessed daily.

> Q: Why would it crash the file system when all raw space is allocated?
> Won't it start trying harder to reuse the free space inside?
> A: Yes, it will, for data. The big problem here is that allocation of a
> new metadata chunk when needed is not possible any more.

And there it hangs or really crashes?

[…]

> Q: Why do the pictures of my data block groups look like someone fired a
> shotgun at it. [3], [4]?
> A: Because the data extent allocator that is active when using the 'ssd'
> mount option both tends to ignore smaller free space fragments all the
> time, and also behaves in a way that causes more of them to appear. [5]
> 
> Q: Wait, why is there "ssd" in my mount options? Why does btrfs think my
> iSCSI attached lun is an SSD?
> A: Because it makes wrong assumptions based on the rotational attribute,
> which we can also see in sysfs.
> 
> Q: Why does this ssd mode ignore free space?
> A: Because it makes assumptions about the mapping of the addresses of
> the block device we see in linux and the storage in actual flash chips
> inside the ssd. Based on that information it decides where to write or
> where not to write any more.
> 
> Q: Does this make sense in 2017?
> A: No. The interesting relevant optimization when writing to an ssd
> would be to write all data together that will be deleted or overwritten
> together at the same time in the future. Since btrfs does not come with
> a time machine included, it can't do this. So, remove this behaviour
> instead. [6]
> 
> Q: What will happen when I use kernel 4.14 with the previously mentioned
> change, or if I change to the nossd mount option explicitely already?
> A: Relatively small free space fragments in existing chunks will
> actually be reused for new writes that fit, working from the beginning
> of the virtual address space upwards. It's like tetris, trying to
> completely fill up the lowest lines first. See the big difference in
> behavior when changing extent allocator happening at 16 seconds into
> this timelapse movie: [7] (virtual address space)

I see a difference in behavior but I do not yet fully understand what I am 
looking at.
 
> Q: But what if all my chunks have badly fragmented free space right now?
> A: If your situation allows for it, the simplest way is running a full
> balance of the data, as some sort of big reset button. If you only want
> to clean up chunks with excessive free space fragmentation, then you can
> use the helper I used to identify them, which is
> show_free_space_fragmentation.py in [8]. Just feed the chunks to balance
> starting with the one with the highest score. The script requires the
> free space tree to be used, which is a good idea anyway.

Okay, when I understand this correctly I don´t need to use "nossd" with kernel 
4.14, but it would be good to do a full "btrfs filesystem balance" run on all 
the SSD BTRFS filesystems or all other ones with rotational=0.

What would be the benefit of that? Would the filesystem run faster again? My 
subjective impression is that performance got worse over time. *However* all 
my previous full balance attempts made the performance even more worse. So… is 
a full balance safe to the filesystem performance meanwhile?

I still have the issue that fstrim on /home only works with patch from Lutz 
Euler from 2014, which is still not in mainline BTRFS. Maybe it would be a 
good idea to recreate /home in order to get rid of that special "anomaly" of 
the BTRFS that fstrim don´t work without this patch.

Maybe a least a part of this should go into BTRFS kernel wiki as it would be 
more easy to find there for users.

I wonder about a "upgrade notes for users" / "BTRFS maintenance" page that 
gives recommendations in case some step is recommended after a major kernel 
update and general recommendations for maintenance. Ideally most of this would 
be integrated into BTRFS or a userspace daemon for it and be handled 
transparently and automatically. Yet a full balance is an expensive operation 
time-wise and probably should not be started without user consent.

I do wonder about the ton of tools here and there and I would love some btrfsd 
or… maybe even more generic fsd filesystem maintenance daemon which would do 
regular scrubs and whatever else makes sense. It could use some configuration 
in the root directory of a filesystem and work for BTRFS and other filesystem 
that do have beneficial online / background upgraded like XFS which also has 
online scrubbing by now (at least for metadata).

> [0] https://www.spinics.net/lists/linux-btrfs/msg64446.html
> [1] https://www.spinics.net/lists/linux-btrfs/msg64771.html
> [2] https://github.com/knorrie/btrfs-heatmap/
> [3]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-10-27-shotgunblast.png
> [4]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2016-12-18-heatmap-scripting/
> fsid_ed10a358-c846-4e76-a071-3821d423a99d_startat_320029589504_at_1482095269
> .png [5] https://www.spinics.net/lists/linux-btrfs/msg64418.html
> [6]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> d=583b723151794e2ff1691f1510b4e43710293875 [7]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-ssd-to-nossd.mp4 [8]
> https://github.com/knorrie/python-btrfs/tree/develop/examples

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Data and metadata extent allocators [2/2]: metadata!
  2017-10-27 18:17 Data and metadata extent allocators [1/2]: Recap: The data story Hans van Kranenburg
  2017-10-27 20:10 ` Martin Steigerwald
@ 2017-10-27 21:20 ` Hans van Kranenburg
  2017-10-28  0:12 ` Data and metadata extent allocators [1/2]: Recap: The data story Qu Wenruo
  2 siblings, 0 replies; 6+ messages in thread
From: Hans van Kranenburg @ 2017-10-27 21:20 UTC (permalink / raw)
  To: linux-btrfs

Ok, it's time to start looking at the other half of the story... The
behavior of the metadata extent allocator.

    Interesting questions here are:

Q: If I point btrfs balance at 1GiB of data, why does it need to write
40GiB to disk while only relocating this 1GiB amount? What's the other
39GiB of "ghost" data?

Q: If I'm running nightly backups, fetching changes from external
filesystems (rsync, not send/receive), why do I see an average amount of
writes of ~60MiB/s to disk while the incoming data stream is capped on
~16MiB/s?

Q: If I'm doing expiries (mass removal of subvolumes), why does my
filesystem write ~80MiB/s to disk for hours and hours and hours?

    tl;dr version:

* Excessive rumination in large extent tree
* I want an invalid combination of data / metadata extent allocators to
minimize extent tree writes
* I get the invalid combination thanks to a bug
* Profit
* I want to be able to do the same in a newer kernel

    Long version:

    July 2017 was the last time when I have been doing tests on a cloned
(on a lower layer, yay NetApp) btrfs filesystem with about 40TiB of
files and 90k subvolumes (with related data in groups of between 20 and
30 subvolumes each).

What I did was running a linux kernel with some modifications (yes, I
found out about the tracepoints a bit later) to count the amount of
metadata block cow operations that are done, per tree. By reading the
counters and graphing that data, it became very clear what happened when
writing that 39GiB of ghost data that I just talked about...

It's metadata, and it's the extent tree. Thousands of cow operations on
the extent tree per second, filling all write IO bandwidth (just 1Gb/s
iSCSI in this case, writing 80-100MiB/s) while the other trees are
relatively dead silent in comparison.

Q: Why does my extent tree cause so many writes to disk?
A: Because the extent tree is tracking the space used to store itself.

(Disclaimer: identifying these symptoms is not some kind of new amazing
discovery, it should be a well known thing for btrfs developers, but I'm
writing for the users like me who are looking at their running
filesystems, wondering what the hell the thing is are doing all the
time. Also, it's good to see at what size and complexity the practical
scalability limitations of this filesystem are seriously starting to get
in the way.)

Let's see what would happen (also a bit simplified, it's about the
general idea) in a worst case scenario, where every update of a metadata
item would cause cow of a metadata block:

  1. Write to a filesystem tree happens
  2. Filesystem metadata block gets cowed
  3. Write to the extent tree happens to add new fs tree block
  4. Extent tree block gets cowed for the write
  5. Write to the extent tree happens to track the new blocks location
  6. Extent tree block gets cowed for the write
  7. Write to the extent tree happens to track the new blocks location
  8. Extent tree block gets cowed for the write
  9. Write to the extent tree happens to track the new blocks location
  10. Extent tree block gets cowed for the write
  11. Write to the extent tree happens to track the new blocks location
  12. Extent tree block gets cowed for the write
  13. Write to the extent tree happens to track the new blocks location
  14. Extent tree block gets cowed for the write
  15. Write to the extent tree happens to track the new blocks location
  16. Extent tree block gets cowed for the write
  17. Write to the extent tree happens to track the new blocks location
  18. Extent tree block gets cowed for the write
  19. Write to the extent tree happens to track the new blocks location
  20. Extent tree block gets cowed for the write
  21. Write to the extent tree happens to track the new blocks location
  [...]

Yep, it's like a dog running in circles chasing its own tail.

(Side note: The "Snowball effect of wandering trees" still has to be
added on top of this, since cowing a metadata block also needs cow
operations of every block in the path up towards the top of the tree.
But, I'm ignoring that part now, since it's not causing the biggest
problems in my case.)

When would this ever stop? Well...
1. A metadata block gets cowed only once during a transaction. The
reason of the cow is to get a new block on disk later on a different
location while the previous one is also still on disk. All changes that
happen inside the memory during the transaction never reach the disk
individually, so there's no need to keep more copies in memory other
than the final one which will go to disk at the end of the transaction.
2. A single metadata block holds a whole bunch of metadata items, part
of a larger range. So, together with 1, if the changes happen near to
each other, they are all going into the same metadata block, and there
are less blocks to cow.

So, in reality, the recursive cowing in the extent tree (I'd like to
call it "rumination"...) will stop after a few extra chews.

As for point 2... If we try to keep all new writes of extent tree
metadata as close together as possible, we minimize the explosion of
rumination that's happening.

    Extent allocators...

As mentioned in the commit to change the data extent allocator behaviour
for 'ssd' mode [0]: "Recommendations for future development are to
reconsider the current oversimplified nossd / ssd distinction [...] and
provide experienced users with a more flexible way to choose allocator
behaviour for data and metadata"

Currently, the nossd / ssd / ssd_spread mount options are the only knobs
we can turn to change extent allocator choice in btrfs as a side effect.
When doing so, the behavior for data as well as metadata gets changed.

Here's the situation since 4.14:

        nossd         ssd           ssd_spread
------------------------------------------------
data    tetris        tetris        contiguous
meta    cluster(64k)  cluster(2M)   contiguous

Before 4.14, data+ssd is also cluster(2M)

* tetris means: just fill all space up with writes that fit, from the
beginning of the filesystem to the end.
* cluster(X) means: use the cluster system (of which the code still
mostly looks like black magic to me) and when doing writes, first
collect at least X amount of space together in free space extents that
are near each other, thus "clustering" writes together.
* contiguous means: when doing X writes, just put them into X free
space, and don't fragment the write over multiple locations.

    When switching from the ssd (which was automatically chosen for me
because btrfs thinks an iSCSI lun is an ssd) to nossd because of the
effect on data placement, the immediate new problem which surfaced was
that subvolume removals would take forever, while the filesystem was
just writing, writing and writing metadata to disk full speed all the
time. Expiries would not be finished before the next nightly backups, so
that situation was not acceptable.

When changing back to -o ssd, the situation would immediately improve
again. See [1] for an example... The simple reason for this was not that
there was more actual work to be done, it was that metadata writes would
end up in more different locations because of the smaller cluster size
parameter, and thus caused much longer ongoing rumination.

The pragmatic solution so far for this was to remount -o nossd, then do
the nightly backups, then remount -o ssd, then do the expiries etc... Yay...

    Flash forward to the beginning of October 2017 when I was
thinking... "what would happen when I was able to run data with the
tetris allocator and metadata with the contiguous allocator? That would
probably be better for my metadata..."

Thanks to a bug, solved in [2], it's actually possible to run exactly
this combination, just by mounting with -o ssd_spread,nossd. The nossd
option resets the ssd flag again, that was just before set by
ssd_spread, but it doesn't unset ssd_spread. Combine this result with
the exact checks that are done for flags in the code paths and voila.
So, in my 4.9 kernel I can still do this.

When, after testing the change was applied on the production system, the
immediate effect on the behavior was amazing. *poof* Bye bye metadata
writes.

During nightly backups, we now write around 25MiB/s for 16MiB/s incoming
data and all metadata administration that needs to happen. (small
changes are happening all over the place). With DUP metadata this means
overhead of about (25-16)/2 = 4.5MiB/s

For expiries... Removing an avg of 3000 subvolumes would take between 4
and 8 hours, writing 80-100MiB/s to disk all the time (~3500 iops). Now
with the contiguous allocator, it's 1 hour with ~30MiB/s writes (~750
iops), and the progress is suddenly limited by random write behaviour
while walking the trees to do the subvolume removals...

Roughly speaking this means writing 16 times less metadata to disk to do
the same thing. (!!)

Using btrfs balance for a filled 1GiB chunk with, say, 2000 data extents
changed from 10 minutes of looking at 80MiB/s metadata writes to doing
the same in just under a minute.

The obvious downside of using the 'contiguous' allocator is that the
exact same effect as we just prevented for data will again happen
here... When metadata gets cowed, the old 16kiB blocks are turned into
free space after the transaction finished. The effect is that the usage
of all exiting metadata chunks slowly decreases, while the free space is
not reused because it's happening all over the place. [3] is an example
of a metadata block 83% filled up two weeks after the switch. Allocated
space for metadata was exploding with about 5 to 10GiB extra per day.

So, the tradeoff for getting 16x less metadata writes in this case is
sacrificing more raw disk space to metadata. Right now, after a while,
the excessive new allocations have stopped since the gaps which have
dropped in existing chunks are becoming interestingly sized enough to be
chosen for new bulk writes.

It's like a child which never cleans up the toys he plays with, but just
throws them onto a big pile in the hallway instead of choosing an empty
spot in the closet to put every item back. At some point, enough
different toys are used to end up with a mostly empty closet, after
which we can simply take the whole pile of toys from the hallway and put
it inside again. :D

    So, to be continued... I'll try to produce a proposal with some
patches to introduce a different way to (individually) choose data and
metadata extent allocator, decoupling it from the current ssd related
options, since the whole concept of ssd doesn't have anything to do with
everything written above. Different combinations of allocators can be
better in different situations. Bundling writes together and doing 16x
less of them instead of doing random writes all over the place is e.g.
also something that a user of a large btrfs filesystem made from slower
rotating drives might prefer?

P.S. metadata on the big production filesystem is still DUP, since I
can't change that easily [4]. This also causes all metadata writes to
end up in the iSCSI write pipeline twice... Getting this fixed would
reduce the writes by another 50%.

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=583b723151794e2ff1691f1510b4e43710293875
[1]
https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-06-04-expire_ssd_nossd.png
[2] https://www.spinics.net/lists/linux-btrfs/msg64203.html
[3]
https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-10-27-metadata-ssd_spread.png
[4] https://www.spinics.net/lists/linux-btrfs/msg64771.html

---- >8 ----

Fun thing is, I'm not seeing any problem with cpu usage. It's perfectly
possible to have tens of thousands of subvolumes in a btrfs filesystem
without cpu usage problems. The real cpu trouble starts when there's
data with too many reflinks. For example, when doing deduplication, you
win some space, but if you're too greedy and dedupe the wrong things,
you have to pay the price of added metadata complexity and cpu usage.

With only groups of 20-30 subvolumes that reference each others data
(the 14 daily, 10 extra weekly and 9 extra monthly snapshots) there are
no cpu usage problems.

Actually... when having 40TiB of gazillions of files of all sizes, it's
much better to have a large amount of subvolumes instead of a small
amount, since it keeps the sizes of the subvolume fs trees down. Also,
sacrificing some space to actively prevent more file fragmentation and
reflinks by e.g. using rsync --whole-file helps.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Data and metadata extent allocators [1/2]: Recap: The data story
  2017-10-27 20:10 ` Martin Steigerwald
@ 2017-10-27 21:40   ` Hans van Kranenburg
  0 siblings, 0 replies; 6+ messages in thread
From: Hans van Kranenburg @ 2017-10-27 21:40 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs

Hi Martin,

On 10/27/2017 10:10 PM, Martin Steigerwald wrote:
>> Q: How do I fight this and prevent getting into a situation where all
>> raw space is allocated, risking a filesystem crash?
>> A: Use btrfs balance to fight the symptoms. It reads data and writes it
>> out again without the free space fragments.
> 
> What do you mean by a filesystem crash? Since kernel 4.5 or 4.6 I don´t see any 
> BTRFS related filesystem hangs anymore on the /home BTRFS Dual SSD RAID 1 on my 
> Laptop, which one or two copies of Akonadi, Baloo and other desktop related 
> stuff write *heavily to* and which has all free space allocated into cunks 
> since a pretty long time:
> 
> merkaba:~> btrfs fi usage -T /home
> Overall:
>     Device size:                 340.00GiB
>     Device allocated:            340.00GiB
>     Device unallocated:            2.00MiB
>     Device missing:                  0.00B
>     Used:                        290.32GiB
>     Free (estimated):             23.09GiB      (min: 23.09GiB)
>     Data ratio:                       2.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
>                           Data      Metadata System              
> Id Path                   RAID1     RAID1    RAID1    Unallocated
> -- ---------------------- --------- -------- -------- -----------
>  1 /dev/mapper/msata-home 163.94GiB  6.03GiB 32.00MiB     1.00MiB
>  2 /dev/mapper/sata-home  163.94GiB  6.03GiB 32.00MiB     1.00MiB
> -- ---------------------- --------- -------- -------- -----------
>    Total                  163.94GiB  6.03GiB 32.00MiB     2.00MiB
>    Used                   140.85GiB  4.31GiB 48.00KiB
> 
> I didn´t do a balance on this filesystem since a long time (kernel 4.6).

Yep, but it simply means your filesystem does not need to allocate a new
chunk for either data or metadata, since it has enough room inside to
reuse when you're doing your things.

On a filesystem that sees a large amount of writes and deletes of files,
say, adding 340GiB every day, and expiring 340GiB of data every day,
adding, removing and rewriting tens of GiBs of metadata every day,
taking such a risk is a total no go.

If you run out of that 1.7GiB - 512.00MiB = ~1.2GiB of metadata that is
free now, the filesystem stops working. Also, you cannot solve it
anymore at that point (probably also not at this point) by making raw
space unallocated with balance, because every balance action will fail
because it also hits the out of space condition.

> Granted my filesystem is smaller than the typical backup BTRFS. I do have two 3 
> TB and one 1,5 TB SATA disks I backup to and another 2 TB BTRFS on a backup 
> server that I use for borgbackup (and that doesn´t yet do any snapshots and 
> may be better of running as XFS as it doesn´t really need snapshots as 
> borgbackup takes care of that. A BTRFS snapshot would only come handy to be 
> able to go back to a previous borgbackup repo in case it for whatever reason 
> gets corrupted or damaged / deleted by an attacker who only access to non 
> privileged user). – However all of these filesystems have plenty of free space 
> currently and are not accessed daily.
> 
>> Q: Why would it crash the file system when all raw space is allocated?
>> Won't it start trying harder to reuse the free space inside?
>> A: Yes, it will, for data. The big problem here is that allocation of a
>> new metadata chunk when needed is not possible any more.
> 
> And there it hangs or really crashes?

It will probably throw itself in read-only mode and stop doing anything
else from that point.

> […]
> 
>> Q: Why do the pictures of my data block groups look like someone fired a
>> shotgun at it. [3], [4]?
>> A: Because the data extent allocator that is active when using the 'ssd'
>> mount option both tends to ignore smaller free space fragments all the
>> time, and also behaves in a way that causes more of them to appear. [5]
>>
>> Q: Wait, why is there "ssd" in my mount options? Why does btrfs think my
>> iSCSI attached lun is an SSD?
>> A: Because it makes wrong assumptions based on the rotational attribute,
>> which we can also see in sysfs.
>>
>> Q: Why does this ssd mode ignore free space?
>> A: Because it makes assumptions about the mapping of the addresses of
>> the block device we see in linux and the storage in actual flash chips
>> inside the ssd. Based on that information it decides where to write or
>> where not to write any more.
>>
>> Q: Does this make sense in 2017?
>> A: No. The interesting relevant optimization when writing to an ssd
>> would be to write all data together that will be deleted or overwritten
>> together at the same time in the future. Since btrfs does not come with
>> a time machine included, it can't do this. So, remove this behaviour
>> instead. [6]
>>
>> Q: What will happen when I use kernel 4.14 with the previously mentioned
>> change, or if I change to the nossd mount option explicitely already?
>> A: Relatively small free space fragments in existing chunks will
>> actually be reused for new writes that fit, working from the beginning
>> of the virtual address space upwards. It's like tetris, trying to
>> completely fill up the lowest lines first. See the big difference in
>> behavior when changing extent allocator happening at 16 seconds into
>> this timelapse movie: [7] (virtual address space)
> 
> I see a difference in behavior but I do not yet fully understand what I am 
> looking at.

It's hilbert sorted:
https://github.com/knorrie/btrfs-heatmap/blob/develop/doc/curves.md

In the lower left corner, you suddenly see all space becoming bright
white instead, which means it's trying to fill up all chunks too 100%
usage from then on.

>> Q: But what if all my chunks have badly fragmented free space right now?
>> A: If your situation allows for it, the simplest way is running a full
>> balance of the data, as some sort of big reset button. If you only want
>> to clean up chunks with excessive free space fragmentation, then you can
>> use the helper I used to identify them, which is
>> show_free_space_fragmentation.py in [8]. Just feed the chunks to balance
>> starting with the one with the highest score. The script requires the
>> free space tree to be used, which is a good idea anyway.
> 
> Okay, when I understand this correctly I don´t need to use "nossd" with kernel 
> 4.14, but it would be good to do a full "btrfs filesystem balance" run on all 
> the SSD BTRFS filesystems or all other ones with rotational=0.

Well, every use case is different. If you only store files that are a
few hundreds of MB big, you'll never see a problem with the old ssd mode.

If you have a badly treated filesystem that has seen many small writes
and deletes (e.g. the example post I linked with the videos what happens
when you put mailman storage or /var/log on it), then you might have so
many small free space extents all over the place that it's a good idea
to clean them up first, instead of having your new writes also not fit
into them now it's allowed with nossd / tetris allocator.

> What would be the benefit of that? Would the filesystem run faster again? My 
> subjective impression is that performance got worse over time. *However* all 
> my previous full balance attempts made the performance even more worse. So… is 
> a full balance safe to the filesystem performance meanwhile?

I can't say anything about that. One of the other things I learned is
that there's a fair share of "butterfly effect" going on in a
filesystem, where everything that happens or has happened in the past
influences anything, and anything could happen if you try something.

> I still have the issue that fstrim on /home only works with patch from Lutz 
> Euler from 2014, which is still not in mainline BTRFS. Maybe it would be a 
> good idea to recreate /home in order to get rid of that special "anomaly" of 
> the BTRFS that fstrim don´t work without this patch.

I don't know about that patch, what does it do?

> Maybe a least a part of this should go into BTRFS kernel wiki as it would be 
> more easy to find there for users.
> 
> I wonder about a "upgrade notes for users" / "BTRFS maintenance" page that 
> gives recommendations in case some step is recommended after a major kernel 
> update and general recommendations for maintenance. Ideally most of this would 
> be integrated into BTRFS or a userspace daemon for it and be handled 
> transparently and automatically. Yet a full balance is an expensive operation 
> time-wise and probably should not be started without user consent.

Deciding on what's needed totally depends on what state the filesystem
is in. The inspection and visualization tools help with that.

And, in 2017, btrfs is still not a filesystem to just choose in a linux
installer and then forget about it without getting in trouble ever.
IMHO.

At least one of the awesome things of btrfs is that it provides such a
rich API and metadata search to build those tools to see what's going on
inside. :D

> I do wonder about the ton of tools here and there and I would love some btrfsd 
> or… maybe even more generic fsd filesystem maintenance daemon which would do 
> regular scrubs and whatever else makes sense. It could use some configuration 
> in the root directory of a filesystem and work for BTRFS and other filesystem 
> that do have beneficial online / background upgraded like XFS which also has 
> online scrubbing by now (at least for metadata).

Have fun,
Hans

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Data and metadata extent allocators [1/2]: Recap: The data story
  2017-10-27 18:17 Data and metadata extent allocators [1/2]: Recap: The data story Hans van Kranenburg
  2017-10-27 20:10 ` Martin Steigerwald
  2017-10-27 21:20 ` Data and metadata extent allocators [2/2]: metadata! Hans van Kranenburg
@ 2017-10-28  0:12 ` Qu Wenruo
  2017-11-01  0:32   ` Hans van Kranenburg
  2 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2017-10-28  0:12 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6384 bytes --]



On 2017年10月28日 02:17, Hans van Kranenburg wrote:
> Hi,
> 
> This is a followup to my previous threads named "About free space
> fragmentation, metadata write amplification and (no)ssd" [0] and
> "Experiences with metadata balance/convert" [1], exploring how good or
> bad btrfs can handle filesystems that are larger than your average
> desktop computer and/or which see a pattern of writing and deleting huge
> amounts of files of wildly varying sizes all the time.
> 
> This message is a summary of the earlier posts. So, for whoever followed
> the story, only boring old news here. In the next message as a reply on
> this one, I'll add some thoughts about new adventures with metadata
> during the last weeks.
> 
> My use case is using btrfs as filesystem for backup servers, which work
> with collections of subvolumes/snapshots with related data, and add new
> data and expire old snapshots daily.
> 
> Until now, the following questions were already answered:
> 
> Q: Why does the allocated but unused space for data keep growing all the
> time?
> A: Because btrfs keeps allocating new raw space for data use, while
> there's more and more unused space inside already which isn't reused for
> new data.

In fact, btrfs data allocator can split its data allocation request, and
make them fit into smaller blocks.

For example, for highly fragmented data space (and of course, no
unallocated space for new chunk), btrfs will use small space in existing
chunks.

Just as fstests, generic/416.

But I think it should be done in a more aggressive manner to reduce
chunk allocation.

And balance under certain case can be very slow due to the amount of
snapshots/reflinks, so personally I don't really like the idea of
balance itself.

If it can be avoid by extent allocator, we should do it from the very
beginning.

Thanks,
Qu

> 
> Q: How do I fight this and prevent getting into a situation where all
> raw space is allocated, risking a filesystem crash?
> A: Use btrfs balance to fight the symptoms. It reads data and writes it
> out again without the free space fragments.
> 
> Q: Why would it crash the file system when all raw space is allocated?
> Won't it start trying harder to reuse the free space inside?
> A: Yes, it will, for data. The big problem here is that allocation of a
> new metadata chunk when needed is not possible any more.
> 
> Q: Where does btrfs balance get the usage information from (what %
> filled a chunk / block group is)? How can I see this myself?
> A: It's a field of the "block group" item in metadata. The information
> can be read from the filesystem metadata using the tree search ioctl.
> Exploring this resulted in the first version of btrfs-heatmap. [2]
> 
> Q: Ok, but I have many TiBs of data chunks which are ~75% filled and
> rewriting all that data is painful, takes a huge amount of time and even
> if I would do it full-time aside from doing backups and expiries, I
> won't succeed in fighting new fragmented free space that pops up. Help!
> A: Yeah, we need something better.
> 
> Q: How can I see what the pattern is of free space fragments in a block
> group?
> A: For this, extent level pictures in btrfs-heatmap were added. [2]
> 
> Q: Why do the pictures of my data block groups look like someone fired a
> shotgun at it. [3], [4]?
> A: Because the data extent allocator that is active when using the 'ssd'
> mount option both tends to ignore smaller free space fragments all the
> time, and also behaves in a way that causes more of them to appear. [5]
> 
> Q: Wait, why is there "ssd" in my mount options? Why does btrfs think my
> iSCSI attached lun is an SSD?
> A: Because it makes wrong assumptions based on the rotational attribute,
> which we can also see in sysfs.
> 
> Q: Why does this ssd mode ignore free space?
> A: Because it makes assumptions about the mapping of the addresses of
> the block device we see in linux and the storage in actual flash chips
> inside the ssd. Based on that information it decides where to write or
> where not to write any more.
> 
> Q: Does this make sense in 2017?
> A: No. The interesting relevant optimization when writing to an ssd
> would be to write all data together that will be deleted or overwritten
> together at the same time in the future. Since btrfs does not come with
> a time machine included, it can't do this. So, remove this behaviour
> instead. [6]
> 
> Q: What will happen when I use kernel 4.14 with the previously mentioned
> change, or if I change to the nossd mount option explicitely already?
> A: Relatively small free space fragments in existing chunks will
> actually be reused for new writes that fit, working from the beginning
> of the virtual address space upwards. It's like tetris, trying to
> completely fill up the lowest lines first. See the big difference in
> behavior when changing extent allocator happening at 16 seconds into
> this timelapse movie: [7] (virtual address space)
> 
> Q: But what if all my chunks have badly fragmented free space right now?
> A: If your situation allows for it, the simplest way is running a full
> balance of the data, as some sort of big reset button. If you only want
> to clean up chunks with excessive free space fragmentation, then you can
> use the helper I used to identify them, which is
> show_free_space_fragmentation.py in [8]. Just feed the chunks to balance
> starting with the one with the highest score. The script requires the
> free space tree to be used, which is a good idea anyway.
> 
> [0] https://www.spinics.net/lists/linux-btrfs/msg64446.html
> [1] https://www.spinics.net/lists/linux-btrfs/msg64771.html
> [2] https://github.com/knorrie/btrfs-heatmap/
> [3]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-10-27-shotgunblast.png
> [4]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2016-12-18-heatmap-scripting/fsid_ed10a358-c846-4e76-a071-3821d423a99d_startat_320029589504_at_1482095269.png
> [5] https://www.spinics.net/lists/linux-btrfs/msg64418.html
> [6]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=583b723151794e2ff1691f1510b4e43710293875
> [7] https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-ssd-to-nossd.mp4
> [8] https://github.com/knorrie/python-btrfs/tree/develop/examples
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Data and metadata extent allocators [1/2]: Recap: The data story
  2017-10-28  0:12 ` Data and metadata extent allocators [1/2]: Recap: The data story Qu Wenruo
@ 2017-11-01  0:32   ` Hans van Kranenburg
  0 siblings, 0 replies; 6+ messages in thread
From: Hans van Kranenburg @ 2017-11-01  0:32 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 10/28/2017 02:12 AM, Qu Wenruo wrote:
> 
> On 2017年10月28日 02:17, Hans van Kranenburg wrote:
>> Hi,
>>
>> This is a followup to my previous threads named "About free space
>> fragmentation, metadata write amplification and (no)ssd" [0] and
>> "Experiences with metadata balance/convert" [1], exploring how good or
>> bad btrfs can handle filesystems that are larger than your average
>> desktop computer and/or which see a pattern of writing and deleting huge
>> amounts of files of wildly varying sizes all the time.
>>
>> This message is a summary of the earlier posts. So, for whoever followed
>> the story, only boring old news here. In the next message as a reply on
>> this one, I'll add some thoughts about new adventures with metadata
>> during the last weeks.
>>
>> My use case is using btrfs as filesystem for backup servers, which work
>> with collections of subvolumes/snapshots with related data, and add new
>> data and expire old snapshots daily.
>>
>> Until now, the following questions were already answered:
>>
>> Q: Why does the allocated but unused space for data keep growing all the
>> time?
>> A: Because btrfs keeps allocating new raw space for data use, while
>> there's more and more unused space inside already which isn't reused for
>> new data.
> 
> In fact, btrfs data allocator can split its data allocation request, and
> make them fit into smaller blocks.
> 
> For example, for highly fragmented data space (and of course, no
> unallocated space for new chunk), btrfs will use small space in existing
> chunks.

Yes, it will.

> Just as fstests, generic/416.
> 
> But I think it should be done in a more aggressive manner to reduce
> chunk allocation.
> 
> And balance under certain case can be very slow due to the amount of
> snapshots/reflinks, so personally I don't really like the idea of
> balance itself.
> 
> If it can be avoid by extent allocator, we should do it from the very
> beginning.

Well, the most urgent problem in the end is not how the distribution of
data over the data chunks is organized.

It's about this:

>> Q: Why would it crash the file system when all raw space is allocated?
>> Won't it start trying harder to reuse the free space inside?
>> A: Yes, it will, for data. The big problem here is that allocation of a
>> new metadata chunk when needed is not possible any more.

The real problem is that the separate allocation of raw space for data
and metadata might lead to a situation where you can't write any
metadata any more (because it wants to allocate a new chunk) while you
have a lot of data space available.

And to be honest, for the end user this is a very similar experience to
getting an out of space on ext4 while there's a lot of data space
available which makes you find out about the concept of inodes (which
you run out of) and df -i etc... (The difference is that on btrfs, in
most cases you can actually solve it in place. :D) We have no limits on
inodes, but we have a limit on space to store tree blocks... Unallocated
raw space available.

So if we can work around this and prevent it, it will mostly solve the
rest automatically. In theory a filesystem always keeps working as long
as you make sure you have about a GiB unallocated all the time so that
either the next data or metadata chunk can grab it.

I actually right now remember seeing something in the kernel code a
while ago that should make it try harder to push data in existing
allocated space instead of allocating a new data chunk if the
unallocated part was < 3% of total device size. Something like that
might help, only for some reason if the code is still in there it
doesn't actually work it seems.

Having the tetris allocator for data by default helps the situation
(preventing from running with fully allocated space too soon) from
occurring with certain workloads, e.g. by not showing the insane
behaviour like [0].

But, I can also still fill my disk with files and then remove half of
them in a way that leaves me with fully allocated raw space and all
chunks 50% filled. And what is the supposed behaviour if I start to
refill the empty space by metadata-hungry small files with long names again?

[0]
https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-11-01  0:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-27 18:17 Data and metadata extent allocators [1/2]: Recap: The data story Hans van Kranenburg
2017-10-27 20:10 ` Martin Steigerwald
2017-10-27 21:40   ` Hans van Kranenburg
2017-10-27 21:20 ` Data and metadata extent allocators [2/2]: metadata! Hans van Kranenburg
2017-10-28  0:12 ` Data and metadata extent allocators [1/2]: Recap: The data story Qu Wenruo
2017-11-01  0:32   ` Hans van Kranenburg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.