* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-20 16:36 ` freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?) Christian Pernegger
@ 2019-11-20 17:59 ` Oliver Freyermuth
2019-11-20 18:32 ` Chris Murphy
` (2 subsequent siblings)
3 siblings, 0 replies; 31+ messages in thread
From: Oliver Freyermuth @ 2019-11-20 17:59 UTC (permalink / raw)
To: Christian Pernegger, linux-btrfs
Hi,
I'm using a ~4 year old laptop, 4 cores (+4 HT), 32 GB RAM,
Crucial mSATA SSD and don't notice neither the snapshotting nor the deletion of snapshots nor the transferring at all
(been doing this for years now).
I'm running kernel 5.3 now, but have also been on 5.0 some time ago (but I'm on Gentoo, not Ubuntu). So I'd say this is not normal.
The first thing you'd need to check is when exactly it happens - btrbk logs the steps it is doing. Does it happen during the snapshotting, transferring,
or deletion of snapshots? Anything in the kernel log?
Did you run a deduplication tool on the BTRFS volumes, or use quotas? These are the only things which come to my mind which can cause high CPU load here
(but in any case, nothing should "block").
Cheers,
Oliver
Am 20.11.19 um 17:36 schrieb Christian Pernegger:
> Hello,
>
> I've decided to go with a snapshot-based backup solution for our new
> Linux desktops -- thank you for the timely thread --, namely btrbk.
> A couple of subvolumes for different stuff, with hourly snapshots that
> regularly go to another machine. Brilliant in theory, less so in
> practice, because every time btrbk runs, the box'll freeze for a few
> seconds, as in, Firefox and LibreOffice, for instance, become entirely
> unresponsive, games hang and so on. (AFAICT, all it does is snapshot
> each subvolume and delete ones that are out of the retention period.)
>
> I'm aware that having many snapshots can impact performance of some
> operations, but I didn't think that "many" <= 200, "impact" = stop
> dead and "some operations" = light desktop use. These are decently
> specced, after all (Zen 2 8/12 core, 32 GB RAM, Samsung 970 Evo Plus).
> What I'm asking is, is this to be expected, does it just need tuning,
> is the hardware buggy, the kernel version (Ubuntu 18.04.3 HWE, their
> 5.0 series) a stinker, something else awry ...?
>
> Cheers,
> C.
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-20 16:36 ` freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?) Christian Pernegger
2019-11-20 17:59 ` Oliver Freyermuth
@ 2019-11-20 18:32 ` Chris Murphy
2019-11-21 1:51 ` Qu Wenruo
2019-11-21 22:22 ` Zygo Blaxell
3 siblings, 0 replies; 31+ messages in thread
From: Chris Murphy @ 2019-11-20 18:32 UTC (permalink / raw)
To: Christian Pernegger; +Cc: linux-btrfs
On Wed, Nov 20, 2019 at 9:36 AM Christian Pernegger <pernegger@gmail.com> wrote:
>
> Hello,
>
> I've decided to go with a snapshot-based backup solution for our new
> Linux desktops -- thank you for the timely thread --, namely btrbk.
> A couple of subvolumes for different stuff, with hourly snapshots that
> regularly go to another machine. Brilliant in theory, less so in
> practice, because every time btrbk runs, the box'll freeze for a few
> seconds, as in, Firefox and LibreOffice, for instance, become entirely
> unresponsive, games hang and so on. (AFAICT, all it does is snapshot
> each subvolume and delete ones that are out of the retention period.)
>
> I'm aware that having many snapshots can impact performance of some
> operations, but I didn't think that "many" <= 200, "impact" = stop
> dead and "some operations" = light desktop use. These are decently
> specced, after all (Zen 2 8/12 core, 32 GB RAM, Samsung 970 Evo Plus).
> What I'm asking is, is this to be expected, does it just need tuning,
> is the hardware buggy, the kernel version (Ubuntu 18.04.3 HWE, their
> 5.0 series) a stinker, something else awry ...?
What are the mount options? And what's the workload immediate prior to
the snapshot? Or does it always happen no matter the workload?
I use Btrfs on a variety of hardware and storage devices, USB flash,
NVMe, hard drives, and a Samsung 940 EVO, and I can't say I experience
anything like a freeze or hang. If I'm doing something like updates
(dnf updates, RPM) and do a snapshot while the update is happening
(bit kooky because that snapshot represents an inbetween state of the
update, essentially useless except as an intentionally poking things
with a stick just to see what happens) I do see a user space "hang" as
a flush is required as part of the snapshot, and I see this flush
using top. But so far I only see it affect the snapshot command itself
(it's a delay rather than a hang). I don't see it affect GUI
responsiveness.
--
Chris Murphy
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-20 16:36 ` freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?) Christian Pernegger
2019-11-20 17:59 ` Oliver Freyermuth
2019-11-20 18:32 ` Chris Murphy
@ 2019-11-21 1:51 ` Qu Wenruo
2019-11-21 16:44 ` Christian Pernegger
2019-11-21 22:22 ` Zygo Blaxell
3 siblings, 1 reply; 31+ messages in thread
From: Qu Wenruo @ 2019-11-21 1:51 UTC (permalink / raw)
To: Christian Pernegger, linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 1375 bytes --]
On 2019/11/21 上午12:36, Christian Pernegger wrote:
> Hello,
>
> I've decided to go with a snapshot-based backup solution for our new
> Linux desktops -- thank you for the timely thread --, namely btrbk.
> A couple of subvolumes for different stuff, with hourly snapshots that
> regularly go to another machine. Brilliant in theory, less so in
> practice, because every time btrbk runs, the box'll freeze for a few
> seconds, as in, Firefox and LibreOffice, for instance, become entirely
> unresponsive, games hang and so on. (AFAICT, all it does is snapshot
> each subvolume and delete ones that are out of the retention period.)
>
> I'm aware that having many snapshots can impact performance of some
> operations, but I didn't think that "many" <= 200, "impact" = stop
> dead and "some operations" = light desktop use. These are decently
> specced, after all (Zen 2 8/12 core, 32 GB RAM, Samsung 970 Evo Plus).
> What I'm asking is, is this to be expected, does it just need tuning,
> is the hardware buggy, the kernel version (Ubuntu 18.04.3 HWE, their
> 5.0 series) a stinker, something else awry ...?
Are you using qgroup?
With qgroup, snapshot deleting is still a problem though.
(But not for snapshot creation, that shouldn't cause any slow down,
unless you're using multi-level qgroups)
Thanks,
Qu
>
> Cheers,
> C.
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-21 1:51 ` Qu Wenruo
@ 2019-11-21 16:44 ` Christian Pernegger
2019-11-21 19:37 ` Oliver Freyermuth
0 siblings, 1 reply; 31+ messages in thread
From: Christian Pernegger @ 2019-11-21 16:44 UTC (permalink / raw)
To: linux-btrfs
Am Mi., 20. Nov. 2019 um 18:59 Uhr schrieb Oliver Freyermuth
<o.freyermuth@googlemail.com>:
> So I'd say this is not normal.
Good to hear, that means it might be fixable. The alternative would be
to switch to Borg or restic, and I just don't feel comfortable with
deduplication relying solely on hashes, I'm a Luddite like that.
> The first thing you'd need to check is when exactly it happens
Currently 17 minutes past the hour, which is when my cron.hourly runs,
and that only runs btrbk. I can't say for certain if it happens every
hour, but I'm reasonably confident.
> btrbk logs the steps it is doing. Does it happen during the snapshotting, transferring, or deletion of snapshots?
It's just configured to snapshot & prune, no transfer. A central
backup server (grand name, for a white-box NAS) pulls the snapshots
each night and does its own pruning. I'm not sure how to tell when
exactly it happens, as I have not much agency while it is happening.
> Anything in the kernel log?
Nothing suspicious in btrbk.log, dmesg or the systemd journal. The
affected things just stop reacting, then continue as if nothing had
happened.
> Did you run a deduplication tool on the BTRFS volumes, or use quotas?
No to deduplication, maybe to quotas. It's possible that Timeshift
enables them, how can I check?
Just had another episode:
2019-11-21T17:17:01+0100 startup v0.26.0 - - - # btrbk command line
client, version 0.26.0
2019-11-21T17:17:01+0100 snapshot starting
/mnt/timeshift/backup/btrbk-snapshots/@.20191121T171701+0100
/mnt/timeshift/backup/@ - -
2019-11-21T17:17:01+0100 snapshot success
/mnt/timeshift/backup/btrbk-snapshots/@.20191121T171701+0100
/mnt/timeshift/backup/@ - -
2019-11-21T17:17:01+0100 snapshot starting
/mnt/timeshift/backup/btrbk-snapshots/@home.20191121T171701+0100
/mnt/timeshift/backup/@home - -
2019-11-21T17:17:01+0100 snapshot success
/mnt/timeshift/backup/btrbk-snapshots/@home.20191121T171701+0100
/mnt/timeshift/backup/@home - -
2019-11-21T17:17:01+0100 delete_snapshot starting
/mnt/timeshift/backup/btrbk-snapshots/@.20191119T161701+0100 - - -
2019-11-21T17:17:01+0100 delete_snapshot success
/mnt/timeshift/backup/btrbk-snapshots/@.20191119T161701+0100 - - -
2019-11-21T17:17:01+0100 delete_snapshot starting
/mnt/timeshift/backup/btrbk-snapshots/@home.20191119T161701+0100 - - -
2019-11-21T17:17:01+0100 delete_snapshot success
/mnt/timeshift/backup/btrbk-snapshots/@home.20191119T161701+0100 - - -
2019-11-21T17:17:01+0100 delete_snapshot starting
/mnt/timeshift/backup/btrbk-snapshots/@home-chris-.steam.20191119T161701+0100
- - -
2019-11-21T17:17:01+0100 delete_snapshot success
/mnt/timeshift/backup/btrbk-snapshots/@home-chris-.steam.20191119T161701+0100
- - -
2019-11-21T17:17:01+0100 finished success - - - -
I had a tail on the log, these came out in one go, no larger pauses.
At first I thought, just my luck, here I am lying in wait and of
course everything works, then the mini-freeze happened. CPU usage in
one core spiked during the freeze, but I couldn't switch tabs from the
graphs to the process list in gnome-system-monitor. Top it is, next
time.
Am Mi., 20. Nov. 2019 um 19:32 Uhr schrieb Chris Murphy
<lists@colorremedies.com>:
> What are the mount options?
defaults, which comes out as
rw,relatime,ssd,space_cache,subvolid=,subvol=, according to mount.
> And what's the workload immediate prior to the snapshot? Or does it always happen no matter the workload?
Can't guarantee "always", but ... This time I was in the process of
composing this e-Mail. A couple of things open, sure, Firefox, couple
of terminals, Signal, evince, deadbeat [stopped], but not doing
anything much. I'd call the workload "idle", especially fs-wise. Last
time I was typing at a bash prompt via gnome-terminal -- the input
wouldn't show or register until it was over. It's not only
i/o-intensive stuff that blocks.
Am Do., 21. Nov. 2019 um 02:51 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
> Are you using qgroup?
Not knowingly. If either Timeshift or btrbk enable them, it's possible.
Cheers,
C.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-21 16:44 ` Christian Pernegger
@ 2019-11-21 19:37 ` Oliver Freyermuth
2019-11-21 20:30 ` Christian Pernegger
0 siblings, 1 reply; 31+ messages in thread
From: Oliver Freyermuth @ 2019-11-21 19:37 UTC (permalink / raw)
To: Christian Pernegger, linux-btrfs
Am 21.11.19 um 17:44 schrieb Christian Pernegger:
> No to deduplication, maybe to quotas. It's possible that Timeshift
> enables them, how can I check?
You can test with:
$ btrfs qgroup show /
ERROR: can't list qgroups: quotas not enabled
but none of the tools you are using should activate qgroups I think
(at least btrbk does not).
> Just had another episode:
> 2019-11-21T17:17:01+0100 startup v0.26.0 - - - # btrbk command line
> client, version 0.26.0
> 2019-11-21T17:17:01+0100 snapshot starting
> /mnt/timeshift/backup/btrbk-snapshots/@.20191121T171701+0100
> /mnt/timeshift/backup/@ - -
> 2019-11-21T17:17:01+0100 snapshot success
> /mnt/timeshift/backup/btrbk-snapshots/@.20191121T171701+0100
> /mnt/timeshift/backup/@ - -
> 2019-11-21T17:17:01+0100 snapshot starting
> /mnt/timeshift/backup/btrbk-snapshots/@home.20191121T171701+0100
> /mnt/timeshift/backup/@home - -
> 2019-11-21T17:17:01+0100 snapshot success
> /mnt/timeshift/backup/btrbk-snapshots/@home.20191121T171701+0100
> /mnt/timeshift/backup/@home - -
> 2019-11-21T17:17:01+0100 delete_snapshot starting
> /mnt/timeshift/backup/btrbk-snapshots/@.20191119T161701+0100 - - -
> 2019-11-21T17:17:01+0100 delete_snapshot success
> /mnt/timeshift/backup/btrbk-snapshots/@.20191119T161701+0100 - - -
> 2019-11-21T17:17:01+0100 delete_snapshot starting
> /mnt/timeshift/backup/btrbk-snapshots/@home.20191119T161701+0100 - - -
> 2019-11-21T17:17:01+0100 delete_snapshot success
> /mnt/timeshift/backup/btrbk-snapshots/@home.20191119T161701+0100 - - -
> 2019-11-21T17:17:01+0100 delete_snapshot starting
> /mnt/timeshift/backup/btrbk-snapshots/@home-chris-.steam.20191119T161701+0100
> - - -
> 2019-11-21T17:17:01+0100 delete_snapshot success
> /mnt/timeshift/backup/btrbk-snapshots/@home-chris-.steam.20191119T161701+0100
> - - -
> 2019-11-21T17:17:01+0100 finished success - - - -
>
> I had a tail on the log, these came out in one go, no larger pauses.
> At first I thought, just my luck, here I am lying in wait and of
> course everything works, then the mini-freeze happened. CPU usage in
> one core spiked during the freeze, but I couldn't switch tabs from the
> graphs to the process list in gnome-system-monitor. Top it is, next
> time.
This is an interesting observation. I believe this means it is happening when the snapshot deletes are actually going to the storage,
which usually happens only _after_ btrbk is finished (in case you catch it with top, a kernel thread "btrfs-cleaner" should be doing this job).
Another interesting test could be to adjust btrbk configuration to:
btrfs_commit_delete = each
which will ensure the delete_snapshot operations are flushed to disk one by one, so the freeze should then correlate to the log
(and might be converted from one longer freeze to multiple, contiguous smaller freezes).
Sadly, I have no idea on why this would freeze for you (well, it's the only actual I/O-heavy part when you don't do the transfers at this point in time).
But maybe Qu will have a good idea.
Cheers,
Oliver
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-21 19:37 ` Oliver Freyermuth
@ 2019-11-21 20:30 ` Christian Pernegger
2019-11-21 21:34 ` Christian Pernegger
2019-11-21 23:57 ` Oliver Freyermuth
0 siblings, 2 replies; 31+ messages in thread
From: Christian Pernegger @ 2019-11-21 20:30 UTC (permalink / raw)
To: linux-btrfs
> Am 21.11.19 um 17:44 schrieb Christian Pernegger:
> > maybe to quotas. It's possible that Timeshift enables them, how can I check?
>
> You can test with:
> $ btrfs qgroup show /
Definitely enabled, then. ... ... ... There it is: Timeshift has a
pre-selected checkbox "enable BTRFS qgroups (recommended)" [translated
from German].
1) How can I safely disable qgroups? Is it enough to uncheck the
Timeshift option and then run btrfs quota disable or do I have to
manually remove the qgroups somehow?
2) I'm wondering if this couldn't be improved. Considering qgroups are
only used (in this case) for reporting on allocated space, not
limiting it, and btrfs free space reporting is notoriously lazy [not
meant in a bad way, can't think of a better word right now] anyway,
why does anything need to block at all? Even if I were using quotas, I
might prefer fuzzy quotas [that can be be hit too early/late because
accounting is catching up] to a temporary standstill, as an option.
> This is an interesting observation. I believe this means it is happening when the snapshot deletes are actually going to the storage,
> which usually happens only _after_ btrbk is finished (in case you catch it with top, a kernel thread "btrfs-cleaner" should be doing this job).
Ok, so btrbk runs, finishes, soon (but not immediately) after that
btrfs-cleaner indeed tops the CPU charts, pegging one core to 100 %.
The system is still responsive at this point. A couple of seconds into
the btrfs-cleaner run, the system becomes unresponsive (top still
updates throughout, though). btrfs-cleaner drops off, and
btrfs-transacti[obv. cut off] takes it's place, taking 100 % CPU.
Still unresponsive. As soon as btrfs-transacti is done, the system
immediately recovers. Then btrfs cleaner returns, briefly, with no
impact on performance. (Keep in mind that top only updates every
couple seconds, it's possible btrfs-cleaner is blameless and
btrfs-transacti the culprit.)
> Another interesting test could be to adjust btrbk configuration to:
> btrfs_commit_delete = each
Will do.
Cheers,
C.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-21 20:30 ` Christian Pernegger
@ 2019-11-21 21:34 ` Christian Pernegger
2019-11-21 22:39 ` Marc Joliet
2019-11-21 23:57 ` Oliver Freyermuth
1 sibling, 1 reply; 31+ messages in thread
From: Christian Pernegger @ 2019-11-21 21:34 UTC (permalink / raw)
To: linux-btrfs
> > Another interesting test could be to adjust btrbk configuration to:
> > btrfs_commit_delete = each
>
> Will do.
Hm. No freeze, this time (with btrbk set to commit after each delete).
In other news,
- I seem to be leaking cgroups. There are currently 191 subvolumes
(most of which are ro snapshots), but 547 "0/*" qgroups. Should
deleting a subvolume take care of removing its (auto-created) cgroup,
or does that always have to be done manually (or by setting the
experimental *_qgroup_destroy options in btrbk.conf)? Any elegant ways
to remove orphaned cqroups?
- Timeshift, at :00, triggers this as well, it's just less severe
(maybe because that's 1 subvolume instead of 3).
Cheers,
C.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-21 21:34 ` Christian Pernegger
@ 2019-11-21 22:39 ` Marc Joliet
2019-11-22 1:36 ` Chris Murphy
0 siblings, 1 reply; 31+ messages in thread
From: Marc Joliet @ 2019-11-21 22:39 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1819 bytes --]
Am Donnerstag, 21. November 2019, 22:34:41 CET schrieb Christian Pernegger:
> > > Another interesting test could be to adjust btrbk configuration to:
> > > btrfs_commit_delete = each
> >
> > Will do.
>
> Hm. No freeze, this time (with btrbk set to commit after each delete).
>
> In other news,
> - I seem to be leaking cgroups. There are currently 191 subvolumes
> (most of which are ro snapshots), but 547 "0/*" qgroups. Should
> deleting a subvolume take care of removing its (auto-created) cgroup,
> or does that always have to be done manually (or by setting the
> experimental *_qgroup_destroy options in btrbk.conf)? Any elegant ways
> to remove orphaned cqroups?
> - Timeshift, at :00, triggers this as well, it's just less severe
> (maybe because that's 1 subvolume instead of 3).
>
> Cheers,
> C.
As Qu said, the freezes should only happen on snapshot deletion. Depending on
how you have btrbk configured and how regularly it runs, not every btrbk run
will delete snapshots. Therefor not every run will cause the system to lock
up.
On a side note, I am also really annoyed by the lockups caused by qgroups. On
my Gentoo systems (which use btrbk) I have it disabled for that reason, but I
left it on on my openSUSE laptop (a Dell XPS 13 9360), which locks up for
about 15-30 minutes while cleaning up snapshots a few times a week (usually
after reboots or after "zypper dup"). Of course, that's with snapshots active
for /home, which I do so that the file system doesn't change out from under
borg while it's running. I'm tentatively considering turning it off there,
too, but I'll experiment with the snapper configuration first.
Greetings
--
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-21 22:39 ` Marc Joliet
@ 2019-11-22 1:36 ` Chris Murphy
2019-11-22 23:21 ` Marc Joliet
0 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2019-11-22 1:36 UTC (permalink / raw)
To: Marc Joliet; +Cc: Btrfs BTRFS
On Thu, Nov 21, 2019 at 3:39 PM Marc Joliet <marcec@gmx.de> wrote:
> On a side note, I am also really annoyed by the lockups caused by qgroups. On
> my Gentoo systems (which use btrbk) I have it disabled for that reason, but I
> left it on on my openSUSE laptop (a Dell XPS 13 9360), which locks up for
> about 15-30 minutes while cleaning up snapshots a few times a week (usually
> after reboots or after "zypper dup").
15 seconds is not at all acceptable on a desktop system, 15 minutes is
atrocious. A computer that appears to hang for 15 seconds, it is
completely reasonable for ordinary users to consider has totally
faceplanted, will not recover, and to force power off. The
distribution really needs to do something about that kind of negative
user experience.
And by the way, I've recently done some unprivileged compilations of
webkitgtk, with default options that cause n cores +2 to be used,
eating all available RAM and swap, and quickly totally hanging the
system while swap thrashing and basically acting like a fork bomb. I'm
using Btrfs for the rootfs as well as user home for this compile, and
have done hundreds of forced power offs during these events and have
seen exactly zero corruptions or Btrfs complaints. So at least there's
that, however unscientific a sample that is.
--
Chris Murphy
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-22 1:36 ` Chris Murphy
@ 2019-11-22 23:21 ` Marc Joliet
2020-03-08 15:11 ` Marc Joliet
0 siblings, 1 reply; 31+ messages in thread
From: Marc Joliet @ 2019-11-22 23:21 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 3394 bytes --]
Am Freitag, 22. November 2019, 02:36:56 CET schrieb Chris Murphy:
> On Thu, Nov 21, 2019 at 3:39 PM Marc Joliet <marcec@gmx.de> wrote:
> > On a side note, I am also really annoyed by the lockups caused by qgroups.
> > On my Gentoo systems (which use btrbk) I have it disabled for that
> > reason, but I left it on on my openSUSE laptop (a Dell XPS 13 9360),
> > which locks up for about 15-30 minutes while cleaning up snapshots a few
> > times a week (usually after reboots or after "zypper dup").
>
> 15 seconds is not at all acceptable on a desktop system, 15 minutes is
> atrocious. A computer that appears to hang for 15 seconds, it is
> completely reasonable for ordinary users to consider has totally
> faceplanted, will not recover, and to force power off. The
> distribution really needs to do something about that kind of negative
> user experience.
Sadly, I can't say if it's better without snapshotting /home, because I hadn't
accumulated many / snapshots at that point in time. It might have gotten
worse even with only / being snapshotted. But like I said, I'll experiment
with configuring snapper before blaming SUSE. I believe the installation even
recommends against snapshotting /home, but hey, I wanted to do it anyway :-) .
But to be precise, it's not locked up continuously during snapshot deletion.
Occasionally I'll be able to operate my desktop for a few seconds, and if I
leave top running in a GUI terminal (in my case konsole), I'll see it updating
(almost) the entire time. My guess (emphasis on *guess*) is that the qgroups
update is holding some lock that is preventing other I/O from finishing, thus
locking up any application that wants to write to disk and isn't doing so
concurrently (maybe Plasma is blocking on fsync() at the time?).
> And by the way, I've recently done some unprivileged compilations of
> webkitgtk, with default options that cause n cores +2 to be used,
> eating all available RAM and swap, and quickly totally hanging the
> system while swap thrashing and basically acting like a fork bomb. I'm
> using Btrfs for the rootfs as well as user home for this compile, and
> have done hundreds of forced power offs during these events and have
> seen exactly zero corruptions or Btrfs complaints. So at least there's
> that, however unscientific a sample that is.
My experience has also been that forced reboots don't cause any damage, even
though I usually only have to do them rarely [0]. I mean, with COW it should
be expected to be safe.
[0] I have two main situations where this happens: The first are RCU stalls
that cause my desktop to get hung up (happens during bootup occasionally,
shortly between the boot loader and the login screen), but also recently
started affecting my home server. The second only affects my home server (a
used small business server), namely a wonky e1000e NIC, which I only recently
learned are sometimes buggy are known for causing servers to crash. The
workaround is apparently to turn off TSO and GSO, and sometimes also GRO, but
I've been able to get away with only the first two without experiencing any
more crashes thus far. Interestingly enough the RCU stalls happened shortly
after I did that.
Greetings
--
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-22 23:21 ` Marc Joliet
@ 2020-03-08 15:11 ` Marc Joliet
0 siblings, 0 replies; 31+ messages in thread
From: Marc Joliet @ 2020-03-08 15:11 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2835 bytes --]
Am Samstag, 23. November 2019, 00:21:18 CET schrieben Sie:
> Am Freitag, 22. November 2019, 02:36:56 CET schrieb Chris Murphy:
> > On Thu, Nov 21, 2019 at 3:39 PM Marc Joliet <marcec@gmx.de> wrote:
> > > On a side note, I am also really annoyed by the lockups caused by
> > > qgroups.
> > > On my Gentoo systems (which use btrbk) I have it disabled for that
> > > reason, but I left it on on my openSUSE laptop (a Dell XPS 13 9360),
> > > which locks up for about 15-30 minutes while cleaning up snapshots a few
> > > times a week (usually after reboots or after "zypper dup").
> >
> > 15 seconds is not at all acceptable on a desktop system, 15 minutes is
> > atrocious. A computer that appears to hang for 15 seconds, it is
> > completely reasonable for ordinary users to consider has totally
> > faceplanted, will not recover, and to force power off. The
> > distribution really needs to do something about that kind of negative
> > user experience.
>
> Sadly, I can't say if it's better without snapshotting /home, because I
> hadn't accumulated many / snapshots at that point in time. It might have
> gotten worse even with only / being snapshotted. But like I said, I'll
> experiment with configuring snapper before blaming SUSE. I believe the
> installation even recommends against snapshotting /home, but hey, I wanted
> to do it anyway :-) .
>
> But to be precise, it's not locked up continuously during snapshot deletion.
> Occasionally I'll be able to operate my desktop for a few seconds, and if I
> leave top running in a GUI terminal (in my case konsole), I'll see it
> updating (almost) the entire time. My guess (emphasis on *guess*) is that
> the qgroups update is holding some lock that is preventing other I/O from
> finishing, thus locking up any application that wants to write to disk and
> isn't doing so concurrently (maybe Plasma is blocking on fsync() at the
> time?).
So just to follow up on this, reducing the total number of snapshots and
increasing the time between their creation from hourly to once every six hours
did help a *little* bit. However, about a week ago I decided to try an
experiment and added the "autodefrag" mount option (which I don't usually do
on SSDs), and that helped *massively*. Ever since, snapper-cleanup.service
runs without me noticing at all!
[ What made me try it was that booting the laptop and logging in started
getting really slow and top was showing several btrfs-endio threads hogging
the CPU, *before* snapper-cleanup.service or anything else specific to btrfs
was running (their activity usually coincided with KDE Baloo activity), i.e.,
general I/O was performing badly. ]
Greetings
--
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-21 20:30 ` Christian Pernegger
2019-11-21 21:34 ` Christian Pernegger
@ 2019-11-21 23:57 ` Oliver Freyermuth
2019-11-22 12:30 ` Christian Pernegger
1 sibling, 1 reply; 31+ messages in thread
From: Oliver Freyermuth @ 2019-11-21 23:57 UTC (permalink / raw)
To: Christian Pernegger, linux-btrfs
Am 21.11.19 um 21:30 schrieb Christian Pernegger:
> Definitely enabled, then. ... ... ... There it is: Timeshift has a
> pre-selected checkbox "enable BTRFS qgroups (recommended)" [translated
> from German].
Since I've never used qgroups myself, I'll only comment on the parts where I can.
However, I would say "(recommended)" just to get an estimate of space consumption
is a rather hard label for the option in Timeshift.
You can check the known issues on qgroups:
https://btrfs.wiki.kernel.org/index.php/Quota_support#Known_issues
This contains, amongst other things, the observed performance issues and also:
"- After deleting a subvolume, you must manually delete the associated qgroup."
which you observe, too. But it does indeed seem btrbk can help out here:
https://github.com/digint/btrbk/issues/49
Manpages of btrfs-quota and btrfs-qgroup contain quite some warnings about the existence
of these known issues, the status page at:
https://btrfs.wiki.kernel.org/index.php/Status
links them, etc. So I believe the recommendation by Timeshift is somewhat hefty.
Other downstreams (see e.g. https://wiki.debian.org/Btrfs or https://wiki.archlinux.org/index.php/Btrfs#Quota )
explicitly recommend not to use qgroup unless really needed.
Apparently this has also been raised to the developer:
https://github.com/teejee2008/timeshift/issues/127
which has at least led to the addition of the checkmark to allow not enabling qgroup.
> 2) I'm wondering if this couldn't be improved. Considering qgroups are
> only used (in this case) for reporting on allocated space, not
> limiting it, and btrfs free space reporting is notoriously lazy [not
> meant in a bad way, can't think of a better word right now] anyway,
> why does anything need to block at all? Even if I were using quotas, I
> might prefer fuzzy quotas [that can be be hit too early/late because
> accounting is catching up] to a temporary standstill, as an option.
You can check e.g. the man page btrfs-quota(8) for a short discussion on why doing quota correctly
with btrfs is not as easy as it may seem.
I'll leave more comments (and how to disable them safely) to those who have experience with qgroups ;-).
>> Another interesting test could be to adjust btrbk configuration to:
>> btrfs_commit_delete = each
>
> Will do.
...
> Hm. No freeze, this time (with btrbk set to commit after each delete).
That might be a red herring if there was just less to delete, as Marc Joliet pointed out,
at least, I think this means we identified the reason for the freezes you get.
Cheers,
Oliver
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-21 23:57 ` Oliver Freyermuth
@ 2019-11-22 12:30 ` Christian Pernegger
2019-11-22 12:34 ` Qu Wenruo
0 siblings, 1 reply; 31+ messages in thread
From: Christian Pernegger @ 2019-11-22 12:30 UTC (permalink / raw)
To: linux-btrfs
Am Fr., 22. Nov. 2019 um 00:57 Uhr schrieb Oliver Freyermuth
<o.freyermuth@googlemail.com>:
> > 2) I'm wondering if this couldn't be improved. [...]
>
> You can check e.g. the man page btrfs-quota(8) for a short discussion on why doing quota correctly
> with btrfs is not as easy as it may seem.
I've read that and I appreciate the difficulties in getting accurate
usage information (or even defining what that means) from a COW
filesystem. IMHO, performance, and the trade-off between performance
and up-to-the-minute accuracy are separate issues.
FWIW, running btrfs quota disable, enable, and rescan got rid of the
orphan qgroups. The full rescan ran for all of 3 seconds and didn't
block.
Cheers,
C.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-22 12:30 ` Christian Pernegger
@ 2019-11-22 12:34 ` Qu Wenruo
2019-11-22 14:43 ` Christian Pernegger
0 siblings, 1 reply; 31+ messages in thread
From: Qu Wenruo @ 2019-11-22 12:34 UTC (permalink / raw)
To: Christian Pernegger, linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 1179 bytes --]
On 2019/11/22 下午8:30, Christian Pernegger wrote:
> Am Fr., 22. Nov. 2019 um 00:57 Uhr schrieb Oliver Freyermuth
> <o.freyermuth@googlemail.com>:
>>> 2) I'm wondering if this couldn't be improved. [...]
>>
>> You can check e.g. the man page btrfs-quota(8) for a short discussion on why doing quota correctly
>> with btrfs is not as easy as it may seem.
>
> I've read that and I appreciate the difficulties in getting accurate
> usage information (or even defining what that means) from a COW
> filesystem. IMHO, performance, and the trade-off between performance
> and up-to-the-minute accuracy are separate issues.
>
> FWIW, running btrfs quota disable, enable, and rescan got rid of the
> orphan qgroups. The full rescan ran for all of 3 seconds and didn't
> block.
BTW, for the empty qgroup auto delete, we have pending patch for that
already.
Just not merged yet.
https://patchwork.kernel.org/patch/11195067/
But still, for snapshot deletion part, there is still a performance impact.
(For completely independent subvolume, IIRC there is a quick path for
it, thus no performance penalty then)
Thanks,
Qu
>
> Cheers,
> C.
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-22 12:34 ` Qu Wenruo
@ 2019-11-22 14:43 ` Christian Pernegger
2019-11-24 0:38 ` Qu Wenruo
0 siblings, 1 reply; 31+ messages in thread
From: Christian Pernegger @ 2019-11-22 14:43 UTC (permalink / raw)
To: linux-btrfs
Am Fr., 22. Nov. 2019 um 13:34 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
> But still, for snapshot deletion part, there is still a performance impact.
Ok. It's just that I'd have expected *slower* write and read
performance until everything's settled, maybe sync writes taking
noticeably longer than usual, not that all user input blocks across
the whole system regardless of fs activity.
Cheers,
C.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-22 14:43 ` Christian Pernegger
@ 2019-11-24 0:38 ` Qu Wenruo
2019-11-24 19:09 ` Christian Pernegger
0 siblings, 1 reply; 31+ messages in thread
From: Qu Wenruo @ 2019-11-24 0:38 UTC (permalink / raw)
To: Christian Pernegger, linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 1178 bytes --]
On 2019/11/22 下午10:43, Christian Pernegger wrote:
> Am Fr., 22. Nov. 2019 um 13:34 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>> But still, for snapshot deletion part, there is still a performance impact.
>
> Ok. It's just that I'd have expected *slower* write and read
> performance until everything's settled, maybe sync writes taking
> noticeably longer than usual, not that all user input blocks across
> the whole system regardless of fs activity.
The slowdown happens in commit transaction, and with commit transaction,
a lot of operation is blocked until current transaction is committed.
That's why it blocks everything.
We had tried our best to reduce the impact, but deletion is still a big
problem, as it can cause tons of extents to change their owner, thus
cause the problem.
In short, unless you really need to know how many bytes each snapshots
really takes, then disable qgroup.
And BTW, for "many" subvolumes/snapshots, I guess we mean 20.
200 is already prone to cause problem, not only qgroups, but also send.
So it's also recommended to reduce the number of snapshots.
Thanks,
Qu
>
> Cheers,
> C.
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 484 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-24 0:38 ` Qu Wenruo
@ 2019-11-24 19:09 ` Christian Pernegger
2019-11-25 1:22 ` Qu Wenruo
0 siblings, 1 reply; 31+ messages in thread
From: Christian Pernegger @ 2019-11-24 19:09 UTC (permalink / raw)
To: linux-btrfs
Am So., 24. Nov. 2019 um 01:38 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
> In short, unless you really need to know how many bytes each snapshots
> really takes, then disable qgroup.
>
> And BTW, for "many" subvolumes/snapshots, I guess we mean 20.
> 200 is already prone to cause problem, not only qgroups, but also send.
>
> So it's also recommended to reduce the number of snapshots.
I've disabled qgroups for now, we'll see how that goes. These are
personal desktops, they would have been nice to have, that's all.
Sadly that means that they probably won't work on any storage setup
complex enough for them to be really useful, either, yet.
If btrfs scales so badly with the number of subvolumes that having >20
at a time should be avoided, doesn't that kill a lot of interesting
use-cases? My "time machine" desktop setup, certainly, but anything
with a couple of users or VMs would chew through that 20 pretty
quickly, even before snapshots. Which leaves the LVM use-case
(snapshot, backup the snapshot, delete the snapshot).
> The slowdown happens in commit transaction, and with commit transaction,
> a lot of operation is blocked until current transaction is committed.
>
> That's why it blocks everything.
>
> We had tried our best to reduce the impact, but deletion is still a big
> problem, as it can cause tons of extents to change their owner, thus
> cause the problem.
Sure, but why does it *have to* block? Couldn't the intent to delete
the subvolume be committed, the metadata changes / actual deletion
happen at leisure? Yes, if qgroups are on, then the qgroup info will
be behind, but so what? At least I think that lax/lazy qgroups would
be a nice option to have.
Also, I still don't get why disabling qgroups, reenabling them and
doing a full rescan is lightning fast (and non-blocking), while just
leaving them on results in the observed behaviour.
Cheers,
C.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-24 19:09 ` Christian Pernegger
@ 2019-11-25 1:22 ` Qu Wenruo
0 siblings, 0 replies; 31+ messages in thread
From: Qu Wenruo @ 2019-11-25 1:22 UTC (permalink / raw)
To: Christian Pernegger, linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 3122 bytes --]
On 2019/11/25 上午3:09, Christian Pernegger wrote:
> Am So., 24. Nov. 2019 um 01:38 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>> In short, unless you really need to know how many bytes each snapshots
>> really takes, then disable qgroup.
>>
>> And BTW, for "many" subvolumes/snapshots, I guess we mean 20.
>> 200 is already prone to cause problem, not only qgroups, but also send.
>>
>> So it's also recommended to reduce the number of snapshots.
>
> I've disabled qgroups for now, we'll see how that goes. These are
> personal desktops, they would have been nice to have, that's all.
> Sadly that means that they probably won't work on any storage setup
> complex enough for them to be really useful, either, yet.
> If btrfs scales so badly with the number of subvolumes that having >20
> at a time should be avoided, doesn't that kill a lot of interesting
> use-cases? My "time machine" desktop setup, certainly, but anything
> with a couple of users or VMs would chew through that 20 pretty
> quickly, even before snapshots. Which leaves the LVM use-case
> (snapshot, backup the snapshot, delete the snapshot).
BTW, that 20 number means 20 snapshots (they all have some shared tree
blocks).
If it's 20 subvolume (no shared tree/data between each), then it counts
as 1.
The main time consuming part is the shared tree/data check, as btrfs
uses indirect way to record them on-disk, forcing us to do complex
walk-back.
Thankfully, we have some plan to improve it.
>
>> The slowdown happens in commit transaction, and with commit transaction,
>> a lot of operation is blocked until current transaction is committed.
>>
>> That's why it blocks everything.
>>
>> We had tried our best to reduce the impact, but deletion is still a big
>> problem, as it can cause tons of extents to change their owner, thus
>> cause the problem.
>
> Sure, but why does it *have to* block? Couldn't the intent to delete
> the subvolume be committed, the metadata changes / actual deletion
> happen at leisure?
Unfortunately, not that easy.
We have already delayed a lot of metadata operation, and commit
transaction is the only time we get a consistent metadata view.
That's why it has to happen at that critical section.
> Yes, if qgroups are on, then the qgroup info will
> be behind, but so what?
It's already behind.
> At least I think that lax/lazy qgroups would
> be a nice option to have.
Qgroup is bond to delayed extent tree updates.
While extent tree update is already delayed to transaction commit time,
if it's further delayed, the consistency of the fs will be corrupted.
The plan to solve it is to introduce a global cache for backref walk,
which would not only benefit qgroup, but also send with reflink.
Although there will be some new challenges, we will see if the cache
will be worthy.
Thanks,
Qu
> Also, I still don't get why disabling qgroups, reenabling them and
> doing a full rescan is lightning fast (and non-blocking), while just
> leaving them on results in the observed behaviour.
>
> Cheers,
> C.
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-20 16:36 ` freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?) Christian Pernegger
` (2 preceding siblings ...)
2019-11-21 1:51 ` Qu Wenruo
@ 2019-11-21 22:22 ` Zygo Blaxell
2019-11-22 4:59 ` Zygo Blaxell
2019-11-22 14:36 ` Christian Pernegger
3 siblings, 2 replies; 31+ messages in thread
From: Zygo Blaxell @ 2019-11-21 22:22 UTC (permalink / raw)
To: Christian Pernegger; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 3430 bytes --]
On Wed, Nov 20, 2019 at 05:36:04PM +0100, Christian Pernegger wrote:
> Hello,
>
> I've decided to go with a snapshot-based backup solution for our new
> Linux desktops -- thank you for the timely thread --, namely btrbk.
> A couple of subvolumes for different stuff, with hourly snapshots that
> regularly go to another machine. Brilliant in theory, less so in
> practice, because every time btrbk runs, the box'll freeze for a few
> seconds, as in, Firefox and LibreOffice, for instance, become entirely
> unresponsive, games hang and so on. (AFAICT, all it does is snapshot
> each subvolume and delete ones that are out of the retention period.)
Snapshot delete is pretty aggressive with IO and can force a lot of
commits if you are modifying a lot of metadata pages between snapshots.
Generally I get a coffee when my 1TB NVME systems decide it's time to
drop a snapshot, as the system can effectively hang for a few minutes
while btrfs-cleaner runs. On performance-critical systems we only ever
have one snapshot active on the filesystem at a time, and we only create
it once a day for backups. I'd love a way to throttle btrfs-cleaner so
it's not so aggressive with IO and CPU.
Snapshot create has unbounded running time on 5.0 kernels. The creation
process has to flush dirty buffers to the filesystem to get a clean
snapshot state. Any process that is writing data while the flush is
running gets its data included in the snapshot flush, so in the worst
possible case, the snapshot flush never ends (unless you run out of disk
space, or whatever was writing new data stops, whichever comes first).
Anything that needs to take a sb_writer lock (which is almost everything
that modifies the filesystem) will hang until the snapshot create is done;
however, processes that are reading the filesystem will not be obstructed.
This can lead to starvation of the writing processes. cgroups and ionice
won't help here--the block layer doesn't detect waits for sb_writers
(there is no associated block device for those, so they're invisible to
the block layer), so it doesn't know that writer processes are waiting
for IO, and all the writers' IO bandwidth gets reallocated to the reader
processes, making for long-lasting priority inversions. The IO pressure
stall subsystem reads _zero_ IO pressure even though writing processes
are continuously blocked for hours.
On small systems, this is all over in a second or less. On bigger
fileservers, I've had single snapshot creates run for many hours. As a
workaround, I have some scripts that freeze processes that write to the
disk while 'btrfs sub create' runs, to force the snapshot create to finish
in a timely manner. I think I saw some patches going into later 5.x
kernels that solve the problem in the kernel, too (writes that occur after
the snapshot creation starts are not included in the snapshot any more).
> I'm aware that having many snapshots can impact performance of some
> operations, but I didn't think that "many" <= 200, "impact" = stop
> dead and "some operations" = light desktop use. These are decently
> specced, after all (Zen 2 8/12 core, 32 GB RAM, Samsung 970 Evo Plus).
> What I'm asking is, is this to be expected, does it just need tuning,
> is the hardware buggy, the kernel version (Ubuntu 18.04.3 HWE, their
> 5.0 series) a stinker, something else awry ...?
>
> Cheers,
> C.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-21 22:22 ` Zygo Blaxell
@ 2019-11-22 4:59 ` Zygo Blaxell
2019-11-22 14:36 ` Christian Pernegger
1 sibling, 0 replies; 31+ messages in thread
From: Zygo Blaxell @ 2019-11-22 4:59 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 3781 bytes --]
On Thu, Nov 21, 2019 at 05:22:28PM -0500, Zygo Blaxell wrote:
> On Wed, Nov 20, 2019 at 05:36:04PM +0100, Christian Pernegger wrote:
> > Hello,
> >
> > I've decided to go with a snapshot-based backup solution for our new
> > Linux desktops -- thank you for the timely thread --, namely btrbk.
> > A couple of subvolumes for different stuff, with hourly snapshots that
> > regularly go to another machine. Brilliant in theory, less so in
> > practice, because every time btrbk runs, the box'll freeze for a few
> > seconds, as in, Firefox and LibreOffice, for instance, become entirely
> > unresponsive, games hang and so on. (AFAICT, all it does is snapshot
> > each subvolume and delete ones that are out of the retention period.)
>
> Snapshot delete is pretty aggressive with IO and can force a lot of
> commits if you are modifying a lot of metadata pages between snapshots.
> Generally I get a coffee when my 1TB NVME systems decide it's time to
> drop a snapshot, as the system can effectively hang for a few minutes
> while btrfs-cleaner runs. On performance-critical systems we only ever
> have one snapshot active on the filesystem at a time, and we only create
> it once a day for backups. I'd love a way to throttle btrfs-cleaner so
> it's not so aggressive with IO and CPU.
>
> Snapshot create has unbounded running time on 5.0 kernels. The creation
> process has to flush dirty buffers to the filesystem to get a clean
> snapshot state. Any process that is writing data while the flush is
> running gets its data included in the snapshot flush, so in the worst
> possible case, the snapshot flush never ends (unless you run out of disk
> space, or whatever was writing new data stops, whichever comes first).
>
> Anything that needs to take a sb_writer lock (which is almost everything
> that modifies the filesystem) will hang until the snapshot create is done;
> however, processes that are reading the filesystem will not be obstructed.
> This can lead to starvation of the writing processes. cgroups and ionice
> won't help here--the block layer doesn't detect waits for sb_writers
> (there is no associated block device for those, so they're invisible to
> the block layer), so it doesn't know that writer processes are waiting
> for IO, and all the writers' IO bandwidth gets reallocated to the reader
> processes, making for long-lasting priority inversions. The IO pressure
> stall subsystem reads _zero_ IO pressure even though writing processes
> are continuously blocked for hours.
>
> On small systems, this is all over in a second or less. On bigger
> fileservers, I've had single snapshot creates run for many hours. As a
> workaround, I have some scripts that freeze processes that write to the
> disk while 'btrfs sub create' runs, to force the snapshot create to finish
> in a timely manner. I think I saw some patches going into later 5.x
> kernels that solve the problem in the kernel, too (writes that occur after
> the snapshot creation starts are not included in the snapshot any more).
Nope, the patch I'm thinking of is dated Nov 1 *2018* and is already in
5.0. So either that fix is ineffective, or the slow snapshots are caused
by something else.
> > I'm aware that having many snapshots can impact performance of some
> > operations, but I didn't think that "many" <= 200, "impact" = stop
> > dead and "some operations" = light desktop use. These are decently
> > specced, after all (Zen 2 8/12 core, 32 GB RAM, Samsung 970 Evo Plus).
> > What I'm asking is, is this to be expected, does it just need tuning,
> > is the hardware buggy, the kernel version (Ubuntu 18.04.3 HWE, their
> > 5.0 series) a stinker, something else awry ...?
> >
> > Cheers,
> > C.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-21 22:22 ` Zygo Blaxell
2019-11-22 4:59 ` Zygo Blaxell
@ 2019-11-22 14:36 ` Christian Pernegger
2019-11-23 3:49 ` Zygo Blaxell
1 sibling, 1 reply; 31+ messages in thread
From: Christian Pernegger @ 2019-11-22 14:36 UTC (permalink / raw)
To: linux-btrfs
Am Do., 21. Nov. 2019 um 23:22 Uhr schrieb Zygo Blaxell
<ce3g8jdj@umail.furryterror.org>:
> Snapshot delete is pretty aggressive with IO [...] can effectively hang for a few minutes
> while btrfs-cleaner runs.
It's doesn't look like it's btrfs-cleaner that blocks here, though,
more like it's btrfs-transacti.
> Snapshot create has unbounded running time on 5.0 kernels.
It looks to me like delete, not create, is the culprit here.
> Anything that needs to take a sb_writer lock (which is almost everything
> that modifies the filesystem) will hang until the snapshot create is done;
It's not just fs activity, either. Even if I'm just typing in
LibreOffice or at a bash prompt, the input isn't registered during the
freeze (it's buffered, so it comes out all at once in the end).
Cheers,
C.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: freezes during snapshot creation/deletion -- to be expected? (Was: Re: btrfs based backup?)
2019-11-22 14:36 ` Christian Pernegger
@ 2019-11-23 3:49 ` Zygo Blaxell
0 siblings, 0 replies; 31+ messages in thread
From: Zygo Blaxell @ 2019-11-23 3:49 UTC (permalink / raw)
To: Christian Pernegger; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1764 bytes --]
On Fri, Nov 22, 2019 at 03:36:43PM +0100, Christian Pernegger wrote:
> Am Do., 21. Nov. 2019 um 23:22 Uhr schrieb Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org>:
> > Snapshot delete is pretty aggressive with IO [...] can effectively hang for a few minutes
> > while btrfs-cleaner runs.
>
> It's doesn't look like it's btrfs-cleaner that blocks here, though,
> more like it's btrfs-transacti.
It's hard to tell. btrfs-transaction does a lot of work for other threads.
If you have kernel stacks enabled,
watch -n.1 cat /proc/<pid of btrfs-cleaner>/stack
will show you what btrfs-cleaner is up to. If it's something like
'wait_for_commit' then btrfs-cleaner dumped a bunch of work on
btrfs-transaction, and now btrfs-transaction is trying to catch up.
> > Snapshot create has unbounded running time on 5.0 kernels.
>
> It looks to me like delete, not create, is the culprit here.
>
> > Anything that needs to take a sb_writer lock (which is almost everything
> > that modifies the filesystem) will hang until the snapshot create is done;
>
> It's not just fs activity, either. Even if I'm just typing in
> LibreOffice or at a bash prompt, the input isn't registered during the
> freeze (it's buffered, so it comes out all at once in the end).
IO pressure, especially blocked writes, can delay memory allocations
on Linux. That stops almost everything dead in a modern GUI.
If you can log into the box from another machine you might be able to
watch what it's doing with 'top' etc.
On the other hand, from the other messages in this thread, it sounds like
you're using qgroups, which multiplies everything I said above by 1000.
qgroups is all in-kernel CPU, too, so userspace can't preempt it.
> Cheers,
> C.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread