From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Problem with file system
Date: Tue, 25 Apr 2017 04:05:39 +0000 (UTC) [thread overview]
Message-ID: <pan$abe0f$67f7c415$77be962b$81e0757d@cox.net> (raw)
In-Reply-To: CAJCQCtQ2d1c6OOGnJkKQ269kR526Y=krOcnJgy9cJHGQiw4aQQ@mail.gmail.com
Chris Murphy posted on Mon, 24 Apr 2017 11:02:02 -0600 as excerpted:
> On Mon, Apr 24, 2017 at 9:27 AM, Fred Van Andel <vanandel@gmail.com>
> wrote:
>> I have a btrfs file system with a few thousand snapshots. When I
>> attempted to delete 20 or so of them the problems started.
>>
>> The disks are being read but except for the first few minutes there are
>> no writes.
>>
>> Memory usage keeps growing until all the memory (24 Gb) is used in a
>> few hours. Eventually the system will crash with out of memory errors.
In addition to what CMurphy and QW suggested (both valid), I have a
couple other suggestions/pointers. They won't help you get out of the
current situation, but they might help you stay out of it in the future.
1) A "few thousand snapshots", but no mention of how many subvolumes
those snapshots are of, or how many per subvolume.
As CMurphy says but I'll expand it here, taking a snapshot is nearly
free, just a bit of metadata to write because btrfs is COW-based and all
a snapshot does is lock down a copy of everything in the subvolume as it
exists currently and the filesystem's already tracking that, but removal
is expensive, because btrfs must go thru and check everything to see if
it can actually be deleted (no other snapshots referencing the block) or
not (something else referencing it).
Obviously, then, this checking gets much more complicated the more
snapshots of the same subvolume that exist. IOW, it's a scaling issue.
The same scaling issue applies to various other btrfs maintenance tasks,
including btrfs check (aka btrfsck), and btrfs balance (and thus btrfs
device remove, which does an implicit balance). Both of these take *far*
longer if the number of snapshots per subvolume is allowed to get out of
hand.
Due to this scaling issue, the recommendation is no more than 200-300
snapshots per subvolume, and keeping it down to 50-100 max is even
better, if you can do it reasonably. That helps keep scaling issues and
thus time for any necessary maintenance manageable. Otherwise... well,
we've had reports of device removes (aka balances) that would take
/months/ to finish at the rate they were going. Obviously, well before
it gets to that point it's far faster to simply blow away the filesystem
and restore from backups.[1]
It follows that if you have an automated system doing the snapshots, it's
equally important to have an automated system doing snapshot thinning as
well, keeping the number of snapshots per subvolume within manageable
scaling limits.
So if that's "a few thousand snapshots", I hope you that's of (at least)
a double-digit number of subvolumes, keeping the number of snapshots per
subvolume under 300, and under 100 if your snapshot rotation schedule
will allow it.
2) As Qu suggests, btrfs quotas increase the scaling issues significantly.
Additionally, there have been and continue to be accuracy issues with
certain quota corner-cases, so they can't be entirely relied upon anyway.
Generally, people using btrfs quotas fall into three categories:
a) Those who know the problems and are working with Qu and the other devs
to report and trace issues so they will eventually work well, ideally
with less of a scaling issue as well.
Bless them! Keep it up! =:^)
b) Those who have a use-case that really depends on quotas.
Because btrfs quotas are buggy and not entirely reliable now, not to
mention the scaling issues, these users are almost certainly better
served using more mature filesystems with mature and dependable quotas.
c) Those who don't really care about quotas specifically, and are just
using them because it's a nice feature. This likely includes some who
are simply running distros that enable quotas.
My recommendation for these users is to simply turn btrfs quotas off for
now, as they're presently in general more trouble than they're worth, due
to both the accuracy and scaling issues. Hopefully quotas will be stable
in a couple years, and with developer and tester hard work perhaps the
scaling issues will have been reduced as well, and that recommendation
can change. But for now, if you don't really need them, leaving quotas
off will significantly reduce scaling issues. And if you do need them,
they're not yet reliable on btrfs anyway, so better off using something
more mature where they actually work.
3) Similarly (tho unlikely to apply in your case), beware of the scaling
implications of the various reflink-based copying and dedup utilities,
which work via the same copy-on-write and reflinking technology that's
behind snapshotting.
Tho snapshotting is effectively reflinking /everything/ in the subvolume,
so the scaling issues compound much faster there than they will with a
more trivial level of reflinking. Of course, when it comes to dedup, a
more trivial level of reflinking means less benefit from doing the dedup
in the first place, so there's a limit to the effectiveness of dedup
before it starts having the same scaling issues that snapshots do. But
if you have exactly two copies of /everything/ in a subvolume, and dedup
it down to a single copy, that's the same effect as a single snapshot, so
it does take a lot of reflink-based deduping to get to the same level as
a couple hundred snapshots. But it's something to think about if you're
planning to dedup say 1000 copies of a bunch of stuff by making them all
reflinks to the same single copy.
Bottom line, if those "few thousand snapshots" are all of the same subvol
or two, /especially/ if you're running btrfs quotas on top of that...
that's very likely your problem right there. Keep your number of
snapshots per subvolume under 300, and turn off btrfs quotas, and you'll
very likely find the problem disappears.
---
[1] Backups: Sysadmin's first rule of backups, simple form: If you
don't have a backup, you are by lack thereof defining the data at risk as
worth less than the time/hassle/resources to do that backup. Because if
it was worth more than the time/hassle/resources necessary for the
backup, by definition, it would /be/ backed up.
It's your choice to make, but no redefining after the fact. If you lost
the primary copy due to whatever reason and didn't have that backup, you
simply defined the data as not worth enough to have a backup, and get to
be happy because you saved what your actions, or lack thereof, defined as
of most value to you, the time/hassle/resources you would have otherwise
spent doing that backup.
Sysadmin's second rule of backups: A backup isn't complete until it has
been tested restorable. Until then, it's simply a would-be backup,
because you don't actually know if it worked or not.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2017-04-25 4:05 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-04-24 15:27 Problem with file system Fred Van Andel
2017-04-24 17:02 ` Chris Murphy
2017-04-25 4:05 ` Duncan [this message]
2017-04-25 0:26 ` Qu Wenruo
2017-04-25 5:33 ` Marat Khalili
2017-04-25 6:13 ` Qu Wenruo
2017-04-26 16:43 ` Fred Van Andel
2017-10-30 3:31 ` Dave
2017-10-30 21:37 ` Chris Murphy
2017-10-31 5:57 ` Marat Khalili
2017-10-31 11:28 ` Austin S. Hemmelgarn
2017-11-03 7:42 ` Kai Krakow
2017-11-03 11:33 ` Austin S. Hemmelgarn
2017-11-03 22:03 ` Chris Murphy
2017-11-04 4:46 ` Adam Borowski
2017-11-04 12:00 ` Marat Khalili
2017-11-04 17:14 ` Chris Murphy
2017-11-06 13:29 ` Austin S. Hemmelgarn
2017-11-06 18:45 ` Chris Murphy
2017-11-06 19:12 ` Austin S. Hemmelgarn
2017-11-04 7:26 ` Dave
2017-11-04 17:25 ` Chris Murphy
2017-11-07 7:01 ` Dave
2017-11-07 13:02 ` Austin S. Hemmelgarn
2017-11-08 4:50 ` Chris Murphy
2017-11-08 12:13 ` Austin S. Hemmelgarn
2017-11-08 17:17 ` Chris Murphy
2017-11-08 17:22 ` Hugo Mills
2017-11-08 17:54 ` Chris Murphy
2017-11-08 18:10 ` Austin S. Hemmelgarn
2017-11-08 18:31 ` Chris Murphy
2017-11-08 19:29 ` Austin S. Hemmelgarn
2017-10-31 1:58 ` Duncan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$abe0f$67f7c415$77be962b$81e0757d@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.