* Re: Snapshots slowing system
@ 2016-03-14 23:03 pete
2016-03-15 15:52 ` Duncan
0 siblings, 1 reply; 17+ messages in thread
From: pete @ 2016-03-14 23:03 UTC (permalink / raw)
To: linux-btrfs
>pete posted on Sat, 12 Mar 2016 13:01:17 +0000 as excerpted:
>> I hope this message stays within the thread on the list. I had email
>> problems and ended up hacking around with sendmail & grabbing the
>> message id off of the web based group archives.
>Looks like it should have as the reply-to looks right, but at least on
>gmane's news/nntp archive of the list (which is how I read and reply), it
>didn't. But the thread was found easily enough.
Found out what had happened. I think I had a quota full issue at my hosting
provider, I suspect bounce messages caused majordomo to unsubscribe me, the
very week I asked a quesiton.
Thanks for the huge response, and thanks also to Boris.
>>>>I wondered whether you had elimated fragmentation, or any other known
>>>>gotchas, as a cause?
>>
>> Subvolumes are mounted with the following options:
>> autodefrag,relatime,compress=lzo,subvol=<sub vol name>>
>That relatime (which is the default), could be an issue. See below.
I've now changed that to noatime. I think I read or missread relatime as
a good comprimise sometime in the past.
>> Not sure if there is much else to do about fragmentation apart from
>> running a balance which would probally make thje machine v sluggish for
>> a day or so.
>>
>>>>Out of curiosity, what is/was the utilisation of the disk? Were the
>>>>snapshots read-only or read-write?
>>
>> root@phoenix:~# btrfs fi df /
>> Data, single: total=101.03GiB, used=97.91GiB
>> System, single: total=32.00MiB, used=16.00KiB
>> Metadata, single: total=8.00GiB, used=5.29GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> root@phoenix:~# btrfs fi df /home
>> Data, RAID1: total=1.99TiB, used=1.97TiB
>> System, RAID1: total=32.00MiB, used=352.00KiB
>> Metadata, RAID1: total=53.00GiB, used=50.22GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>Normally when posting, either btrfs fi df *and* btrfs fi show are
>needed, /or/ (with a new enough btrfs-progs) btrfs fi usage. And of
>course the kernel (4.0.4 in your case) and btrfs-progs (not posted, that
>I saw) versions.
OK, I have usage. For the SSD with the system:
root@phoenix:~# btrfs fi usage /
Overall:
Device size: 118.05GiB
Device allocated: 110.06GiB
Device unallocated: 7.99GiB
Used: 103.46GiB
Free (estimated): 11.85GiB (min: 11.85GiB)
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 512.00MiB (used: 0.00B)
Data,single: Size:102.03GiB, Used:98.16GiB
/dev/sda3 102.03GiB
Metadata,single: Size:8.00GiB, Used:5.30GiB
/dev/sda3 8.00GiB
System,single: Size:32.00MiB, Used:16.00KiB
/dev/sda3 32.00MiB
Unallocated:
/dev/sda3 7.99GiB
Hmm. A bit tight. I've just ordered a replacement SSD. Slackware
should it in about 5GB+ of disk space I've seen on a website? Hmm. Don't
beleive that. I'd allow at least 10GB and more if I want to add extra
packages such as libreoffice. If I have no snapshots it seems to get to
45GB with various extra packages installed and grows to 100ish with
snapshotting probally owing to updates.
Anyway, took the lazy, but less tearing less hair out route and ordered
a 500GB drive. Prices have dropped and fortunately a new drive is not
a major issue. Timing is also good with Slack 14.2 immanent. You
rarely hear people complaining about disk too empty problems...
For the traditional hard drives with the data:
root@phoenix:~# btrfs fi usage /home
Overall:
Device size: 5.46TiB
Device allocated: 4.09TiB
Device unallocated: 1.37TiB
Used: 4.04TiB
Free (estimated): 720.58GiB (min: 720.58GiB)
Data ratio: 2.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Data,RAID1: Size:1.99TiB, Used:1.97TiB
/dev/sdb 1.99TiB
/dev/sdc 1.99TiB
Metadata,RAID1: Size:53.00GiB, Used:49.65GiB
/dev/sdb 53.00GiB
/dev/sdc 53.00GiB
System,RAID1: Size:32.00MiB, Used:352.00KiB
/dev/sdb 32.00MiB
/dev/sdc 32.00MiB
Unallocated:
/dev/sdb 699.49GiB
/dev/sdc 699.49GiB
root@phoenix:~#
>> Hmm. The system disk is getting a little tight. cddisk reports the
>> partition I use for btrfs containing root as 127GB approx. Not sure why
>> it grows so much. Suspect that software updates can't help as snapshots
>> will contain the legacy versions. On the other hand they can be useful.
>With the 127 GiB (I _guess_ it's GiB, 1024, not GB, 1000, multiplier,
>btrfs consistently uses the 1024 multiplier and properly specifies it
>using the XiB notation) for /, however, and the btrfs fi df sizes of 101
>GiB plus data and 8 GiB metadata (with system's 32 MiB a rounding error
>and global reserve actually taken from metadata, so it doesn't add to
>chunk reservation on its own) we can see that as you mention, it's
>starting to get tight, a bit under 110 GiB of 127 GiB, but that 17 GiB
>free isn't horrible, just slightly tight, as you said.
>Tho it'll obviously be tighter if that's 127 GB, 1000 multiplier...
Note that the system btrfs does not get 127GB, it gets /dev/sda3, not
far off, but I've a 209MB partition for /boot and a 1G partition for a
very cut down system for maintenance purposes (both ext4). On the
new drive I'll keep the 'maintenance' ext4 install but I could use
/boot from that filesystem using bind mounts, a bit cleaner.
>It's tight enough that particularly with the regular snapshotting, btrfs
>might be having to fragment more than it'd like. Tho kudos for the
>_excellent_ snapshot rotation. We regularly see folks in here with 100K
>or more snapshots per filesystem, and btrfs _does_ have scaling issues in
>that case. But your rotation seems to be keeping it well below the 1-3K
>snapshots per filesystem recommended max, so that's obviously NOT you're
>problem, unless of course the snapshot deletion bugged out and they
>aren't being deleted as they should.
Yay, I've done it right at least somewhere... I was assuming that was
on server hardware so I thought best to keep it tighter on my more
modest desktop.
They are deleting. The new ones are also read only now.
>(Of course, you can check that by listing them, and I would indeed double-
>check, as that _is_ the _usual_ problem we have with snapshots slowing
>things down, simply too many of them, hitting the known scaling issues
>btrfs had with over 10K snapshots per filesystem. But FWIW I don't use
>snapshots here and thus don't deal with snapshots command-level detail.)
Rarely use them except when I either delete the wrong file or do something
very sneaky but dumb like inavertently set umask for root and install
a package and break _lots_ of file system permissions. Easiest to
recover from a good snapshot than try to fix that mess...
>But as I mentioned above, that relatime mount option isn't your best
>choice, in the presence of heavy snapshotting. Unless you KNOW you need
>atimes for something or other, noatime is _strongly_ recommended with
>snapshotting, because relatime, while /relatively/ better than
>strictatime, still updates atimes once a day for files you're accessing
>at least that frequently.
Now noatime.
>And that interacts badly with snapshots, particularly where few of the
>files themselves have changed, because in that case, a large share of the
>changes from one snapshot to another are going to be those atime updates
>themselves. Ensuring that you're always using noatime avoids the atime
>updates entirely (well, unless the file itself changes and thus mtime
>changes as well), which should, in the normal most files unchanged
>snapshotting context, make for much smaller snapshot-exclusive sizes.
>And you mention below that the snapshots are read-write, but generally
>used as read-only. Does that include actually mounting them read-only?
>Because if not, and if they too are mounted the default relatime,
>accessing them is obviously going to be updating atimes the relatime-
>default once per day there as well... triggering further divergence of
>snapshots from the subvolumes they are snapshots of and from each other...
Actually they are normally not mounted. Only mount them, or rather the
default subvolume that contains them, on an as needed basis. The script
that does the snapshotting mounts and then unmounts.
>> Is it likely the SSD? If likely I could get a larger one, now is a good
>> time with a new version of slackware imminent. However, no point in
>> spending money for the sake of it.
>Not directly btrfs related, but when you do buy a new ssd, now or later,
>keep in mind that a lot of authorities recommend that for ssds you buy
>10-33% larger than you plan on actually provisioning, and that you leave
>that extra space entirely unprovisioned -- either leave that extra space
>entirely unpartitioned, or partition it, but don't put filesystems or
>anything else (swap, etc) on it. This leaves those erase-blocks free to
>be used by the FTL for additional wear-leveling block-swap, thus helping
>maintain device speed as it ages, and with good wear-leveling firmware,
>should dramatically increase device usable lifetime, as well.
Well, went OTT so got ordered a 500GB. So if I put say 20GB as my
'maintenance' partition, then the rest minus 100-150GB as btrfs and keep
the rest unallocated that should work well?
>FWIW, I ended up going rather overboard with that here, as I knew I
<snip>
So have I. The price seems almost linear per gigabyte perhaps?
Suspected it was better to go larger if I could and delay the
time until the new disk runs out. Could put the old disk in the
laptop for experimentation with distros.
>>>>Apropos Nada: quick shout out to Qu to wish him luck for the 4.6 merge.
>>
>> I'm wondering if it is time for an update from 4.0.4?
>The going list recommendation is to choose either current kernel track or
>LTS kernel track. If you choose current kernel, the recommendation is to
>stick within 1-2 kernel cycles of newest current, which with 4.5 about to
>come out, means you would be on 4.3 at the oldest, and be looking at 4.4
>by now, again, on the current kernel track.
4.5 is out. Maybe I ought to await 4.5.1 or .2 for any initial bugs to
shake out.
>If you choose LTS kernels, until recently, the recommendation was again
>the latest two, but here LTS kernel cycles. That would be 4.4 as the
>newest LTS and 4.1 previous to that. However, 3.18, the LTS kernel
>previous to 4.1, has been holding up reasonably well, so while 4.1 would
>be preferred, 3.18 remains reasonably well supported as well.
Can't see the advantage to me for a LTS kernel. In the past I've gone
for the latest and then updated the kernel with the new latest kernel.
Distro maintainers might want LTS kernels but I'm not going to go from
say 4.1.10 to 4.1.19 when I can go to 4.5.
OK googled for a bit. Upgrading within an LTS branch fixes bugs but
reduces chances of breakage due to new functionality.
>You're on 4.0, which isn't an LTS kernel series and is thus, along with
>4.2, out of upstream's support window. So it's past time to look at
>updating. =:^) Given that you obviously do _not_ follow the last couple
Whilst everything worked fine and there were no security horrors there was
no need to update.
Kind regards,
Pete
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-14 23:03 Snapshots slowing system pete
@ 2016-03-15 15:52 ` Duncan
2016-03-15 22:29 ` Peter Chant
0 siblings, 1 reply; 17+ messages in thread
From: Duncan @ 2016-03-15 15:52 UTC (permalink / raw)
To: linux-btrfs
pete posted on Mon, 14 Mar 2016 23:03:52 +0000 as excerpted:
> [Duncan wrote...]
>>pete posted on Sat, 12 Mar 2016 13:01:17 +0000 as excerpted:
>>>
>>> Subvolumes are mounted with the following options:
>>> autodefrag,relatime,compress=lzo,subvol=<sub vol name>>
>
>>That relatime (which is the default), could be an issue. See below.
>
> I've now changed that to noatime. I think I read or missread relatime
> as a good comprimise sometime in the past.
Well, "good" is relative (ha! much like relatime itself! =:^).
Relatime is certainly better than strictatime as it cuts down on atime
updates quite a bit, and as a default it's a reasonable compromise (at
least for most filesystems), because it /does/ do a pretty good job of
eliminating /most/ atime updates while still doing the minimal amount to
avoid breaking all known apps that still rely on what is mostly a legacy
POSIX feature that very little actually modern software actually relies
on any more.
For normal filesystems and normal use-cases, relatime really is a
reasonably "good" compromise. But btrfs is definitely not a traditional
filesystem, relying as it does on COW, and snapshotting is even more
definitely not a traditional filesystem feature. Relatime does still
work, but it's just not particularly suitable to frequent snapshotting.
Meanwhile, so little actually depends on atime these days, that unless
you're trying to work out a compromise solution for a kernel with a
standing rule that breaking working userspace is simply not acceptable,
the context in which relatime was developed and for which it really is a
good compromise, chances are pretty high that unless you are running
something like mutt that is /known/ to need atime, you can simply set
noatime and forget about it.
And I'm sure, were the kernel rules on avoiding breaking old but
otherwise still working userspace somewhat less strict, noatime would be
the kernel default now, as well.
Meanwhile, FWIW, some months ago I finally got tired of having to specify
noatime on all my mounts, expanding my fstab width by 8 chars (including
the ,) and the total fstab character count by several multiples of that
as I added it to all entries, and decided to see if I might per chance,
even as a sysadmin not a dev, be able to come up with a patch that
changed the kernel default to noatime. It wasn't actually hard, tho were
I a coder and actually knew what I was doing, I imagine I could create a
much better patch. So now all my filesystems (barring a few of the
memory-only virtual-filesystem mounts) are mounted noatime by default, as
opposed to the unpatched relatime, and I was able to take all the noatimes
out of my fstab. =:^)
>>Normally when posting, either btrfs fi df *and* btrfs fi show are
>>needed, /or/ (with a new enough btrfs-progs) btrfs fi usage. And of
>>course the kernel (4.0.4 in your case) and btrfs-progs (not posted, that
>>I saw) versions.
>
> OK, I have usage. For the SSD with the system:
>
> root@phoenix:~# btrfs fi usage /
> Overall:
> Device size: 118.05GiB
> Device allocated: 110.06GiB
> Device unallocated: 7.99GiB
> Used: 103.46GiB
> Free (estimated): 11.85GiB (min: 11.85GiB)
> Data ratio: 1.00
> Metadata ratio: 1.00
> Global reserve: 512.00MiB (used: 0.00B)
>
> Data,single: Size:102.03GiB, Used:98.16GiB
> /dev/sda3 102.03GiB
>
> Metadata,single: Size:8.00GiB, Used:5.30GiB
> /dev/sda3 8.00GiB
>
> System,single: Size:32.00MiB, Used:16.00KiB
> /dev/sda3 32.00MiB
>
> Unallocated:
> /dev/sda3 7.99GiB
>
>
> Hmm. A bit tight. I've just ordered a replacement SSD.
While ~8 GiB unallocated on a ~118 GiB filesystem is indeed a bit tight,
it's nothing that should be giving btrfs fits yet.
Tho even with autodefrag, given the previous relatime and snapshotting,
it could be that the free-space in existing chunks is fragmented, which
over time and continued usage would force higher file fragmentation
despite the autodefrag, since there simply aren't any large contiguous
free-space areas left in which to write files.
> Slackware
> should it in about 5GB+ of disk space I've seen on a website? Hmm.
> Don't beleive that. I'd allow at least 10GB and more if I want to add
> extra packages such as libreoffice. If I have no snapshots it seems to
> get to 45GB with various extra packages installed and grows to 100ish
> with snapshotting probally owing to updates.
FWIW, here on gentoo and actually using separate partitions and btrfs,
/not/ btrfs subvolumes (because I don't want all my data eggs in the same
filesystem basket, should that filesystem go bad)...
My / is 8 GiB (per device, btrfs raid1 both data and metadata on
partitions from two ssds, so same stuff on each device) including all
files installed by packages except some individual subdirs in /var/ which
are symlinked to dirs in /home/var/ where necessary, because I keep /
read-only mounted by default, and some services want a writable /var/
config.
Tho I don't have libreoffice installed, nor multiple desktop environments
as I prefer (a much slimmed down) kde, but I have had multiple versions
of kde (kde 3/4 back when, kde 4/5 more recently) installed at the same
time as I was switching from one to the other. While gentoo allows
pulling in rather fewer deps than many distros if one is conservative
with their USE flag settings, that's probably roughly canceled out by the
fact that it's build-from-source and thus all the developer package
halves not installed on binary distros need installed on gentoo, in
ordered to build packages that depend on them.
Anyway, with compress=lzo, here's my root usage:
$$ sudo btrfs fi usage /
Overall:
Device size: 16.00GiB
Device allocated: 9.06GiB
Device unallocated: 6.94GiB
Device missing: 0.00B
Used: 5.41GiB
Free (estimated): 4.99GiB (min: 4.99GiB)
Data ratio: 2.00
Metadata ratio: 2.00
Global reserve: 64.00MiB (used: 0.00B)
Data,RAID1: Size:4.00GiB, Used:2.47GiB
/dev/sda5 4.00GiB
/dev/sdb5 4.00GiB
Metadata,RAID1: Size:512.00MiB, Used:237.55MiB
/dev/sda5 512.00MiB
/dev/sdb5 512.00MiB
System,RAID1: Size:32.00MiB, Used:16.00KiB
/dev/sda5 32.00MiB
/dev/sdb5 32.00MiB
Unallocated:
/dev/sda5 3.47GiB
/dev/sdb5 3.47GiB
So of that 8 gig (per device, two device raid1), nearly half, ~3.5 GiB,
remains unallocated. Data is 4 GiB allocated, ~2.5 GiB used. Metadata
is half a GiB allocated, just over half used, and there's 32 MiB of
system allocated as well, with trivial usage. Including both the
allocated but unused data space and the entirely unallocated space, I
should still be able to write nearly 5 GiB (the free estimate already
accounts for the raid1).
Regular df (not btrfs fi df) reports similar numbers, 8192 MiB total,
2836 MiB used, 5114 MiB available, tho with non-btrfs df the numbers are
going to be fuzzy since its understanding of btrfs internals is somewhat
fuzzy.
But either way, given the LZO compression it appears I've used under half
the 8 GiB capacity. Meanwhile, du -xBM / says 4158M, so just over half
in uncompressed data (with --apparent-size added it says 3624M).
So installation-only may well fit in under 5 GiB, and indeed, some years
ago (before btrfs and the ssds, so reiserfs on spinning rust), I was
running 5 GiB /, which on reiserfs was possible due to tail packing even
without compression, but it was indeed a bit tighter than I was
comfortable with, thus the 8 GiB I'm much happier with, today, when I
partitioned up the ssds with btrfs and lzo compression in mind.
My /home is 20 GiB (per device, dual-ssd-partition btrfs raid1), tho
that's with a separate media partition and will obviously vary *GREATLY*
per person/installation. My distro's git tree and overlays, along with
sources tarball cache, built binpkgs cache, ccache build cache, and
mainline kernel git repo, is 24 GiB.
Log is separate to avoid runaway logging filling up more critical
filesystems and is tiny, 640 MiB, which I'll make smaller, possibly half
a GiB, next time I repartition.
Boot is an exception to the usual btrfs raid1, with a separate working
boot partition on one device and its backup on the other, so I can point
the BIOS at and boot either one. It's btrfs mixed-bg mode dup, 256 MiB
for each of working and backup, which because it's dup means 128 MiB
capacity. That's actually a bit small, and why I'll be shrinking the log
partition the next time I repartition. Making it 384 MiB dup, for 192
MiB capacity, would be much better, and since I can shrink the log
partition by that and still keep the main partitions GiB aligned, it all
works out.
Under the GiB boundary in addition to boot and log I also have separate
BIOS and EFI partitions. Yes, both, for compatibility. =:^) The sizes
of all the sub-GiB partitions are calculated so (as I mentioned) the main
partitions are all GiB aligned.
Further, all main partitions have both a working and a backup partition,
the same size, which combined with the dual-SSD btrfs raid1 and a btrfs
dup boot on each device, gives me both working copy and primary backups
on the SSDs (except for log, which is btrfs raid1 but without a backup
copy as I didn't see the point).
As I mentioned elsewhere, with another purpose-dedicated partition or two
and their backups, that's about 130 GiB out of the 256 GB ssds, with the
rest left unpartitioned for use by the ssd FTL.
I also mentioned a media partition. That's on spinning rust, along with
the secondary backups for the main system. It too is bootable on its
own, should I need to resort to that, tho I don't keep the secondary
backups near as current as the primary backups on the SSDs, because I
figure between the raid1 and the primary backups on the ssds, there's a
relatively small chance I'll actually have to resort to the secondary
backups on spinning rust.
> Anyway, took the lazy, but less tearing less hair out route and ordered
> a 500GB drive. Prices have dropped and fortunately a new drive is not a
> major issue. Timing is also good with Slack 14.2 immanent. You rarely
> hear people complaining about disk too empty problems...
If I had 500 GiB SSDs like the one you're getting, I could put the media
partition on SSDs and be rid of the spinning rust entirely. But I seem
to keep finding higher priorities for the money I'd spend on a pair of
them...
(Tho I'm finding I do online media enough these days that I don't use the
media partition so much these days. I could probably go thru it, delete
some stuff, and shrink what I have stored on it. Given the near 50%
unpartitioned space on the SSDs if I could get it to 64 GiB or under, I'd
still have the recommended 20% unallocated space for the FTL to use, and
wouldn't need to wait to upgrade the SSDs to put media on the SSDs and
could then unplug the then only "secondary backup usage" spinning rust,
except for doing those backups.)
> Note that the system btrfs does not get 127GB, it gets /dev/sda3, not
> far off, but I've a 209MB partition for /boot and a 1G partition for a
> very cut down system for maintenance purposes (both ext4). On the new
> drive I'll keep the 'maintenance' ext4 install but I could use /boot
> from that filesystem using bind mounts, a bit cleaner.
Good point. Similar here except the backup/maintenance isn't a cutdown
system, it's a snapshot (in time, not btrfs snapshot) of exactly what was
on the system when I did the backup. That way, should it be necessary, I
can boot the backup and have a fully functional system exactly as it was
the day I took that backup. That's very nice to have for a maintenance
setup, since it means I have access to full manpages, even a full X,
media players, a full graphical browser to google my problems with, etc.
And of course I have it partitioned up into much smaller pieces, with the
second device in raid1 as well as having the backup partition copies.
> Rarely use them except when I either delete the wrong file or do
> something very sneaky but dumb like inavertently set umask for root and
> install a package and break _lots_ of file system permissions. Easiest
> to recover from a good snapshot than try to fix that mess...
Of course snapshots aren't backups as if the filesystem goes south, it
takes the snapshots with it. But it's still great for fat-fingering
issues, as you mention. But I still prefer smaller and easier/faster
maintained partitions, with backup partition copies that are totally
independent filesystems from the working copies. Between that and the
btrfs raid1 to cover device failure, AND secondary backups on spinning
rust, I guess I'm /reasonably/ prepared.
(I don't worry much about or bother with offsite backups, however, as I
figure if I'm forced to resort to them, I'll have a whole lot more
important things to worry about, like where I'm going to live if a fire
or whatever took them out, or simply what sort of computer I'll replace
it with and how I'll actually set it up, if it was simply burglarized.
After all, the real important stuff is in my head anyway, and if I lose
/that/ backup I'm not going to be caring much about /anything/, so...)
>>FWIW, I ended up going rather overboard with that here, as I knew I
> <snip>
>
> So have I. The price seems almost linear per gigabyte perhaps?
> Suspected it was better to go larger if I could and delay the time until
> the new disk runs out. Could put the old disk in the laptop for
> experimentation with distros.
It seems to be more or less linear within a sweet-spot, yes. Back when I
bought mine, the sweet-spot was 32-256 GiB or so, smaller you paid more
due to overhead, larger simply wasn't manufactured in high enough
quantities yet.
Now it seems the sweet-spot is 256 GB to 1 TB, at around 3 GB/USD
low end price (pricewatch.com, SATA-600). (128 GB is available at that,
but only for bare laptop OEM models, M2 I'd guess, possibly used.)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-15 15:52 ` Duncan
@ 2016-03-15 22:29 ` Peter Chant
2016-03-16 11:39 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 17+ messages in thread
From: Peter Chant @ 2016-03-15 22:29 UTC (permalink / raw)
To: linux-btrfs
On 03/15/2016 03:52 PM, Duncan wrote:
<snip>
> Meanwhile, FWIW, some months ago I finally got tired of having to specify
> noatime on all my mounts, expanding my fstab width by 8 chars (including
> the ,) and the total fstab character count by several multiples of that
> as I added it to all entries, and decided to see if I might per chance,
> even as a sysadmin not a dev, be able to come up with a patch that
> changed the kernel default to noatime. It wasn't actually hard, tho were
> I a coder and actually knew what I was doing, I imagine I could create a
> much better patch. So now all my filesystems (barring a few of the
> memory-only virtual-filesystem mounts) are mounted noatime by default, as
> opposed to the unpatched relatime, and I was able to take all the noatimes
> out of my fstab. =:^)
It is a pity you cannot use variables or macros in fstab. Its not too
bad with traditional file systems on my home user machine but with
multiple subvolumes my fstab is huge and there is a lot of repetition
of the options.
>> Hmm. A bit tight. I've just ordered a replacement SSD.
>
> While ~8 GiB unallocated on a ~118 GiB filesystem is indeed a bit tight,
> it's nothing that should be giving btrfs fits yet.
>
Too late, drive ordered. It was only a matter of time anyway.
> Tho even with autodefrag, given the previous relatime and snapshotting,
> it could be that the free-space in existing chunks is fragmented, which
> over time and continued usage would force higher file fragmentation
> despite the autodefrag, since there simply aren't any large contiguous
> free-space areas left in which to write files.
>
Hmm. The following returns instantly as if it were a null operation.
btrfs fi defrag /
I thought though that btrfs fi defrag <name> would only defrag the one
file or directory?
btrfs fi defrag /srv/photos/
Is considerably slower, it is still running. Disk light is on solid.
Processes kworker and btrfs-transacti are pretty busy according to iotop.
<snip>
> But either way, given the LZO compression it appears I've used under half
> the 8 GiB capacity. Meanwhile, du -xBM / says 4158M, so just over half
> in uncompressed data (with --apparent-size added it says 3624M).
>
I seem to install a lot of interesting looking things I barely use. I
am surprised about how full the filesystem gets, it should not.
However, large disks make life much easier rather than routing out
unused packages as a hobby. Unless it gets silly.
<snip>
>
> Boot is an exception to the usual btrfs raid1, with a separate working
> boot partition on one device and its backup on the other, so I can point
> the BIOS at and boot either one. It's btrfs mixed-bg mode dup, 256 MiB
> for each of working and backup, which because it's dup means 128 MiB
> capacity. That's actually a bit small, and why I'll be shrinking the log
> partition the next time I repartition. Making it 384 MiB dup, for 192
> MiB capacity, would be much better, and since I can shrink the log
> partition by that and still keep the main partitions GiB aligned, it all
> works out.
>
Slackware uses lilo so I need a separate /boot with something that is
supported by lilo.
<snip>
> If I had 500 GiB SSDs like the one you're getting, I could put the media
> partition on SSDs and be rid of the spinning rust entirely. But I seem
> to keep finding higher priorities for the money I'd spend on a pair of
> them...
I'm getting one, not two, so the system is raid0. Data is more
important (and backed up).
>
<snip>
> Good point. Similar here except the backup/maintenance isn't a cutdown
> system, it's a snapshot (in time, not btrfs snapshot) of exactly what was
> on the system when I did the backup. That way, should it be necessary, I
> can boot the backup and have a fully functional system exactly as it was
> the day I took that backup. That's very nice to have for a maintenance
> setup, since it means I have access to full manpages, even a full X,
> media players, a full graphical browser to google my problems with, etc.
>
I have that as well. But the non-btrfs maintenance partition is there
in case btrfs is unbootable.
--
Peter Chant
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-15 22:29 ` Peter Chant
@ 2016-03-16 11:39 ` Austin S. Hemmelgarn
2016-03-17 21:08 ` Pete
0 siblings, 1 reply; 17+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-16 11:39 UTC (permalink / raw)
To: Peter Chant, linux-btrfs
On 2016-03-15 18:29, Peter Chant wrote:
> On 03/15/2016 03:52 PM, Duncan wrote:
>> Tho even with autodefrag, given the previous relatime and snapshotting,
>> it could be that the free-space in existing chunks is fragmented, which
>> over time and continued usage would force higher file fragmentation
>> despite the autodefrag, since there simply aren't any large contiguous
>> free-space areas left in which to write files.
>>
>
> Hmm. The following returns instantly as if it were a null operation.
> btrfs fi defrag /
That should return almost immediately, as defrag isn't recursive by
default, and / should only have at most about 16-20 directory entries.
>
> I thought though that btrfs fi defrag <name> would only defrag the one
> file or directory?
It does, it's just not recursive unless you tell it to be.
>
> btrfs fi defrag /srv/photos/
> Is considerably slower, it is still running. Disk light is on solid.
> Processes kworker and btrfs-transacti are pretty busy according to iotop.
If you have a lot of items in /srv/photos/ (be it either lots of
individual files, or lots of directories at the top level), then this is
normal, if not, then you may have found a bug.
>>
>> Boot is an exception to the usual btrfs raid1, with a separate working
>> boot partition on one device and its backup on the other, so I can point
>> the BIOS at and boot either one. It's btrfs mixed-bg mode dup, 256 MiB
>> for each of working and backup, which because it's dup means 128 MiB
>> capacity. That's actually a bit small, and why I'll be shrinking the log
>> partition the next time I repartition. Making it 384 MiB dup, for 192
>> MiB capacity, would be much better, and since I can shrink the log
>> partition by that and still keep the main partitions GiB aligned, it all
>> works out.
>>
>
> Slackware uses lilo so I need a separate /boot with something that is
> supported by lilo.
I would like to point out that just because the distribution prefers one
package doesn't mean you can't use another, it's just not quite as easy.
It's worth noting that I do similarly to Duncan in this respect
though, although I provisioned 512MiB when I set things up (and stuck
the BIOS boot partition (because I use GPT on everything these days) in
the unaligned slack space between the partition table and /boot). It
also has the advantage that I can fall back to old versions of the
kernel and initrd if need be when an upgrade fails to boot for some reason.
>
> <snip>
>
>> If I had 500 GiB SSDs like the one you're getting, I could put the media
>> partition on SSDs and be rid of the spinning rust entirely. But I seem
>> to keep finding higher priorities for the money I'd spend on a pair of
>> them...
>
>
> I'm getting one, not two, so the system is raid0. Data is more
> important (and backed up).
If you don't need the full terabyte of space, I would seriously suggest
using raid1 instead of raid0. If you're using SSD's, then you won't get
much performance gain from BTRFS raid0 (because the I/O dispatching is
not particularly smart), and it also makes it more likely that you will
need to rebuild from scratch.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-16 11:39 ` Austin S. Hemmelgarn
@ 2016-03-17 21:08 ` Pete
2016-03-18 9:17 ` Duncan
0 siblings, 1 reply; 17+ messages in thread
From: Pete @ 2016-03-17 21:08 UTC (permalink / raw)
To: Austin S. Hemmelgarn, linux-btrfs
On 03/16/2016 11:39 AM, Austin S. Hemmelgarn wrote:
>> I thought though that btrfs fi defrag <name> would only defrag the one
>> file or directory?
> It does, it's just not recursive unless you tell it to be.
Hmm. That shows when I last used it. Last time I used it the '-r'
option did not exist. So I set and forgot 'autodefrag'.
>>
>> btrfs fi defrag /srv/photos/
>> Is considerably slower, it is still running. Disk light is on solid.
>> Processes kworker and btrfs-transacti are pretty busy according to iotop.
> If you have a lot of items in /srv/photos/ (be it either lots of
> individual files, or lots of directories at the top level), then this is
> normal, if not, then you may have found a bug.
20 files. 15 directories. A lot of files under this directory but
recursive NOT set.
Hmm. Comments on ssd s set me googling. Don't normally touch smartctl
root@phoenix:~# smartctl --attributes /dev/sdc
<snip>
184 End-to-End_Error 0x0032 098 098 099 Old_age Always
FAILING_NOW 2
<snip>
also:
1 Raw_Read_Error_Rate 0x000f 120 099 006 Pre-fail Always
- 241052216
That figure seems to be on the move. On /dev/sdb (the other half of my
hdd raid1 btrfs it is zero). I presume zero means either 'no errors,
happy days' or 'not supported'.
Hmm. Is this bad and/or possibly the smoking gun for slowness? I will
keep an eye on the number to see if it changes.
OK, full output:
root@phoenix:~# smartctl --attributes /dev/sdc
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.0.4] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 120 099 006 Pre-fail Always
- 241159856
3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always
- 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always
- 83
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always
- 56166570022
9 Power_On_Hours 0x0032 075 075 000 Old_age Always
- 22098
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always
- 83
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always
- 0
184 End-to-End_Error 0x0032 098 098 099 Old_age Always
FAILING_NOW 2
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always
- 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always
- 8590065669
189 High_Fly_Writes 0x003a 095 095 000 Old_age Always
- 5
190 Airflow_Temperature_Cel 0x0022 066 063 045 Old_age Always
- 34 (Min/Max 30/34)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always
- 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always
- 27
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always
- 287836
194 Temperature_Celsius 0x0022 034 040 000 Old_age Always
- 34 (0 14 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age
Offline - 281032595099550
241 Total_LBAs_Written 0x0000 100 253 000 Old_age
Offline - 75393744072
242 Total_LBAs_Read 0x0000 100 253 000 Old_age
Offline - 115340399121
OK, head flying hours explains it, drive is over 32 billion years old...
As I am slowly producing this post raw_read_error_rate is now at
241507192. But I did set smartctl -t long /dev/sdc in motion if that
is at all relevent.
>>
>> Slackware uses lilo so I need a separate /boot with something that is
>> supported by lilo.
> I would like to point out that just because the distribution prefers one
> package doesn't mean you can't use another, it's just not quite as easy.
> It's worth noting that I do similarly to Duncan in this respect though,
> although I provisioned 512MiB when I set things up (and stuck the BIOS
> boot partition (because I use GPT on everything these days) in the
> unaligned slack space between the partition table and /boot). It also
> has the advantage that I can fall back to old versions of the kernel and
> initrd if need be when an upgrade fails to boot for some reason.
Thanks. I know that. Have dallied with Grub and Grub2 but when it
works well lilo is nice and simple. My 'maintenance' partition plan is
to give me something more powerful than a rescue disk if things go
south. Bit frustrating the time I found that btrfs-tools was well
behind on the maintenance partition. At least I could go online and
fix. Rescue CDs are not helpful there.
>>
>> <snip>
>>
>>> If I had 500 GiB SSDs like the one you're getting, I could put the media
>>> partition on SSDs and be rid of the spinning rust entirely. But I seem
>>> to keep finding higher priorities for the money I'd spend on a pair of
>>> them...
>>
>>
>> I'm getting one, not two, so the system is raid0. Data is more
>> important (and backed up).
> If you don't need the full terabyte of space, I would seriously suggest
> using raid1 instead of raid0. If you're using SSD's, then you won't get
> much performance gain from BTRFS raid0 (because the I/O dispatching is
> not particularly smart), and it also makes it more likely that you will
> need to rebuild from scratch.
Confused. I'm getting one SSD which I intend to use raid0. Seems to me
to make no sense to split it in two and put both sides of raid1 on one
disk and I reasonably think that you are not suggesting that. Or are
you assuming that I'm getting two disks? Or are you saying that buying
a second SSD disk is strongly advised? (bearing in mind that it looks
like I might need another hdd if the smart field above is worth worrying
about).
Pete
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-17 21:08 ` Pete
@ 2016-03-18 9:17 ` Duncan
2016-03-18 11:38 ` Austin S. Hemmelgarn
2016-03-18 18:16 ` Pete
0 siblings, 2 replies; 17+ messages in thread
From: Duncan @ 2016-03-18 9:17 UTC (permalink / raw)
To: linux-btrfs
Pete posted on Thu, 17 Mar 2016 21:08:23 +0000 as excerpted:
> Hmm. Comments on ssd s set me googling. Don't normally touch smartctl
>
> root@phoenix:~# smartctl --attributes /dev/sdc
> <snip>
> 184 End-to-End_Error 0x0032 098 098 099 Old_age Always
> FAILING_NOW 2
> <snip>
> 1 Raw_Read_Error_Rate 0x000f 120 099 006 Pre-fail Always
> - 241052216
>
> That figure seems to be on the move. On /dev/sdb (the other half of my
> hdd raid1 btrfs it is zero). I presume zero means either 'no errors,
> happy days' or 'not supported'.
This is very useful. See below.
> Hmm. Is this bad and/or possibly the smoking gun for slowness? I will
> keep an eye on the number to see if it changes.
>
> OK, full output:
> root@phoenix:~# smartctl --attributes /dev/sdc
> [...]
> === START OF READ SMART DATA SECTION ===
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
> WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000f 120 099 006 Pre-fail Always
> - 241159856
This one's showing some issues, but is within tolerance as even the worst
value of 99 is still _well_ above the failure threshold of 6.
But the fact that the raw value isn't simply zero means that it is having
mild problems, they're just well within tolerance according to the cooked
value and threshold.
(I've snipped a few of these...)
> 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always
> - 0
On spinning rust this one's a strong indicator of one of the failure
modes, a very long time to spin up. Obviously that's not a problem with
this device. Even raw is zero.
> 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always
> - 83
Spinning up a drive is hard on it. Laptops in particular often spin down
their drives to save power, then spin them up again. Wall-powered
machines can and sometimes do, but it's not as common, and when they do,
the spin-down time is often an hour or higher of idle, where on laptops
it's commonly 15 minutes and may be as low as 5.
Obviously you're doing no spindowns except for power-offs, and thus have
a very low raw count of 83, which hasn't dropped the cooked value from
100 yet, so great on this one as well.
> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
> - 0
This one is available on ssds and spinning rust, and while it never
actually hit failure mode for me on an ssd I had that went bad, I watched
over some months as the raw reallocated sector count increased a bit at a
time. (The device was one of a pair with multiple btrfs raid1 on
parallel partitions on each, and the other device of the pair remains
perfectly healthy to this day, so I was able to use btrfs checksumming
and scrubs to keep the one that was going bad repaired based on the other
one, and was thus able to run it for quite some time after I would have
otherwise replaced it, simply continuing to use it out of curiosity and
to get some experience with how it and btrfs behaved when failing.)
In my case, it started at 253 cooked with 0 raw, then dropped to a
percentage (still 100 at first) as soon as the first sector was
reallocated (raw count of 1). It appears that your manufacturer treats
it as a percentage from a raw count of 0.
What really surprised me was just how many spare sectors that ssd
apparently had. 512 byte sectors, so half a KiB each. But it was into
the thousands of replaced sectors raw count, so Megabytes used, but the
cooked count had only dropped to 85 or so by the time I got tired of
constantly scrubbing to keep it half working as more and more sectors
failed. But threshold was 36, so I wasn't anywhere CLOSE to getting to
reported failure here, despite having thousands of replaced sectors thus
megabytes in size.
But the ssd was simply bad before its time, as it wasn't failing due to
write-cycle wear-out, but due to bad flash, plain and simple. With the
other device (and the one I replaced it with as well, I actually had
three of the same brand and size SSDs), there's still no replaced sectors
at all.
But apparently, when ssds hit normal old-age and start to go bad from
write-cycle failure, THAT is when those 128 MiB or so (as I calculated
based on percentage and raw value failed at one point, or was it 256 MiB,
IDR for sure) of replacement sectors start to be used. And on SSDs,
apparently when that happens, sectors often fail and are replaced faster
than I was seeing, so it's likely people will actually get to failure
mode on this attribute in that case.
I'd guess spinning rust has something less, maybe 64 MiB for multiple TB
of storage, instead of the 128 or 256 MiB I saw on my 256 GiB SSDs. That
would be because spinning rust failure mode is typically different, and
while a few sectors might die and be replaced over the life of the
device, typically it's not that many, and failure is by some other means
like mechanical failure (failure to spin up, or read heads getting out of
tolerated sync with the cylinders on the device).
> 7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always
> - 56166570022
Like the raw-read-error-rate attribute above, you're seeing minor issues
as the raw number isn't 0, and in this case, the cooked value is
obviously dropping significantly as well, but it's still within
tolerance, so it's not failing yet. That worst cooked value of 60 is
starting to get close to that threshold of 30, however, so this one's
definitely showing wear, just not failure... yet.
> 9 Power_On_Hours 0x0032 075 075 000 Old_age Always
> - 22098
Reasonable for a middle-aged drive, considering you obviously don't shut
it down often (a start-stop-count raw of 80-something). That's ~2.5
years of power-on.
> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always
> - 0
This one goes with spin-up time. Absolutely no problems here.
> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always
> - 83
Matches start-stop-count. Good. =:^) Since you obviously don't spin
down except at power-off, this one isn't going to be a problem for you.
> 184 End-to-End_Error 0x0032 098 098 099 Old_age Always
> FAILING_NOW 2
I /think/ this one is a power-on head self-test head seek from one side
of the device to the other, and back, covering both ways.
Assuming I'm correct on the above guess, the combination of this failing
for you, and the not yet failing but a non-zero raw-value for raw-read-
error-rate and seek-error-rate, with the latter's cooked value being
significantly down if not yet failing, is definitely concerning, as the
three values all have to do with head seeking errors.
I'd definitely get your data onto something else as soon as possible, tho
as much of it is backups, you're not in too bad a shape even if you lose
them, as long as you don't lose the working copy at the same time.
But with all three seek attributes indicating at least some issue and one
failing, at least get anything off it that is NOT backups ASAP.
And that very likely explains the slowdowns as well, as obviously, while
all sectors are still readable, it's having to retry multiple times on
some of them, and that WILL slow things down.
> 188 Command_Timeout 0x0032 100 099 000 Old_age Always
> - 8590065669
Again, a non-zero raw value indicating command timeouts, probably due to
those bad seeks. It'll have to retry those commands, and that'll
definitely mean slowdowns.
Tho there's no threshold, but 99 worst-value cooked isn't horrible.
FWIW, on my spinning rust device this value actually shows a worst of
001, here (100 current cooked value, tho), with a threshold of zero,
however. But as I've experienced no problems with it I'd guess that's an
aberration. I haven't the foggiest why/how/when it got that 001 worst.
> 189 High_Fly_Writes 0x003a 095 095 000 Old_age Always
> - 5
Again, this demonstrates a bit of disk wobble or head slop. But with a
threshold of zero and a value and worst of 95, it doesn't seem to be too
bad.
> 193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always
> - 287836
Interesting. My spinning rust has the exact same value and worst of 1,
threshold 0, and a relatively similar 237181 raw count.
But I don't really know what this counts unless it's actual seeks, and
mine seems in good health still, certainly far better than the cooked
value and worst of 1 might suggest.
> 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline
> - 281032595099550
> OK, head flying hours explains it, drive is over 32 billion years old...
>
While my spinning rust has this attribute and the cooked values are
identical 100/253/0, the raw value is reported and formatted entirely
differently, as 21122 (89 19 0). I don't know what those values are, but
presumably your big long value reports the others mine does, as well,
only as a big long combined value.
Which would explain the apparent multi-billion years yours is reporting!
=:^) It's not a single value, it's multiple values somehow combined.
At least with my power-on hours of 23637, a head-flying hours of 21122
seems reasonable. (I only recently configured the BIOS to spin down that
drive after 15 minutes I think, because it's only backups and my media
partition which isn't mounted all the time anyway, so I might as well
leave it off instead of idle-spinning when I might not use it for days at
a time. So a difference of a couple thousand hours between power-on and
head-flying, on a base of 20K+ hours for both, makes sense given that I
only recently configured it to spin down.)
But given your ~22K power-on hours, even simply peeling off the first 5
digits of your raw value would be 28K head-flying, and that doesn't make
sense for only 22K power-on, so obviously they're using a rather more
complex formula than that.
So bottom line regarding that smartctl output, yeah, a new device is
probably a very good idea at this point. Those smart attributes indicate
either head slop or spin wobble, and some errors and command timeouts and
retries, which could well account for your huge slowdowns. Fortunately,
it's mostly backup, so you have your working copy, but if I'm not mixing
up my threads, you have some media files, etc, on a different partition
on it as well, and if you don't have backups elsewhere, getting them onto
something else ASAP is a very good idea, because this drive does look to
be struggling, and tho it could continue working in a low usage scenario
for some time yet, it could also fail rather quickly, as well.
> As I am slowly producing this post raw_read_error_rate is now at
> 241507192. But I did set smartctl -t long /dev/sdc in motion if that
> is at all relevent.
>
>>> <snip>
>>>
>>>> If I had 500 GiB SSDs like the one you're getting, I could put the
>>>> media partition on SSDs and be rid of the spinning rust entirely.
>>>> But I seem to keep finding higher priorities for the money I'd spend
>>>> on a pair of them...
>>>
>>>
>>> I'm getting one, not two, so the system is raid0. Data is more
>>> important (and backed up).
>> If you don't need the full terabyte of space, I would seriously suggest
>> using raid1 instead of raid0. If you're using SSD's, then you won't
>> get much performance gain from BTRFS raid0 (because the I/O dispatching
>> is not particularly smart), and it also makes it more likely that you
>> will need to rebuild from scratch.
>
> Confused. I'm getting one SSD which I intend to use raid0. Seems to me
> to make no sense to split it in two and put both sides of raid1 on one
> disk and I reasonably think that you are not suggesting that. Or are
> you assuming that I'm getting two disks? Or are you saying that buying
> a second SSD disk is strongly advised? (bearing in mind that it looks
> like I might need another hdd if the smart field above is worth worrying
> about).
Well, raid0 normally requires two devices. So either you mean single
mode on a single device, or you're combining it with another device (or
more than one more) to do raid0.
And if you're combining it with another device to do raid0, than the
suggestion, unless you really need all the room from the raid0, is to do
raid1, because the usual reason for raid0 is speed, and btrfs raid0 isn't
yet particularly optimized so you don't get so much more speed than on a
single device. And raid0 has a much higher risk of failure because if
any of the devices fail the whole filesystem is gone.
So raid0 really doesn't get you much besides the additional room of the
multiple devices.
Meanwhile, in addition to the traditional device redundancy that you
normally get with raid1, btrfs raid1 has some additional features as
well, namely, data integrity due to checksumming, and the ability to
repair a bad copy from the other one, assuming the other copy passes
checksum verification. While traditional raid1 lets you do a similar
repair, because it doesn't have and verify the checksums like btrfs does,
on traditional raid1, you're just as likely to be replacing the good copy
with the bad one, as the other way around. Btrfs' ability to actually
repair bad data from a verified good second copy like that, is a very
nice feature indeed, and having lived thru a failing ssd as I mentioned
above, btrfs raid1 is not only what saved my data, it's what allowed me
to continue playing with the failing ssd as I continued to use it well
passed when I would have otherwise replaced it, so I could watch just how
it behaved as it failed and get more experience with both that and
working with btrfs raid1 recovery under that sort of situation.
So btrfs raid1 has data integrity and repair features that aren't
available on normal raid1, and thus is highly recommended.
But, raid1 /does/ mean two copies of both data and metadata (assuming of
course you make them both raid1, as I did), and if you simply don't have
room to do it that way, you don't have room, highly recommended tho it
may be.
Tho raid1 shouldn't be considered the same as a backup, because it's
not. In particular, while you do have reasonable protection against
device failure, and with btrfs, against the data going bad, raid1, on its
own, doesn't protect against fat-fingering, simply making a mistake and
deleting something you shouldn't have, which as any admin knows, tends to
be the greatest risk to data. You need a real backup (or a snapshot) to
recover from that.
Additionally, raid1 alone isn't going to help if the filesystem itself
goes bad. Neither will a snapshot, there. You need a backup to recover
in that case.
Similarly in the case of an electrical problem, robbery of the machine,
or fire, since both/all devices in a raid1 will be affected together. If
you want to be able to recover your data in that case, better have a real
backup, preferably kept offline except when actually making the backup,
and even better, off-site. For this sort of thing, in fact, the usual
recommendation is at least two offsite backups, alternated such that if
tragedy strikes when you're updating the one, taking it out as well, you
still have the other one safe and sound, and will only lose the
difference since that alternating backup, even when both your working
copy and the other of the backups are both taken out at once.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-18 9:17 ` Duncan
@ 2016-03-18 11:38 ` Austin S. Hemmelgarn
2016-03-18 17:58 ` Pete
2016-03-18 23:58 ` Duncan
2016-03-18 18:16 ` Pete
1 sibling, 2 replies; 17+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-18 11:38 UTC (permalink / raw)
To: linux-btrfs
On 2016-03-18 05:17, Duncan wrote:
> Pete posted on Thu, 17 Mar 2016 21:08:23 +0000 as excerpted:
>> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
>> - 0
>
> This one is available on ssds and spinning rust, and while it never
> actually hit failure mode for me on an ssd I had that went bad, I watched
> over some months as the raw reallocated sector count increased a bit at a
> time. (The device was one of a pair with multiple btrfs raid1 on
> parallel partitions on each, and the other device of the pair remains
> perfectly healthy to this day, so I was able to use btrfs checksumming
> and scrubs to keep the one that was going bad repaired based on the other
> one, and was thus able to run it for quite some time after I would have
> otherwise replaced it, simply continuing to use it out of curiosity and
> to get some experience with how it and btrfs behaved when failing.)
>
> In my case, it started at 253 cooked with 0 raw, then dropped to a
> percentage (still 100 at first) as soon as the first sector was
> reallocated (raw count of 1). It appears that your manufacturer treats
> it as a percentage from a raw count of 0.
>
> What really surprised me was just how many spare sectors that ssd
> apparently had. 512 byte sectors, so half a KiB each. But it was into
> the thousands of replaced sectors raw count, so Megabytes used, but the
> cooked count had only dropped to 85 or so by the time I got tired of
> constantly scrubbing to keep it half working as more and more sectors
> failed. But threshold was 36, so I wasn't anywhere CLOSE to getting to
> reported failure here, despite having thousands of replaced sectors thus
> megabytes in size.
This actually makes sense, as SSD's have spare 'sectors' in erase block
size chunks, and most use a minimum 1MiB erase block size, with 4-8MiB
being normal for most consumer devices.
>
> But the ssd was simply bad before its time, as it wasn't failing due to
> write-cycle wear-out, but due to bad flash, plain and simple. With the
> other device (and the one I replaced it with as well, I actually had
> three of the same brand and size SSDs), there's still no replaced sectors
> at all.
>
> But apparently, when ssds hit normal old-age and start to go bad from
> write-cycle failure, THAT is when those 128 MiB or so (as I calculated
> based on percentage and raw value failed at one point, or was it 256 MiB,
> IDR for sure) of replacement sectors start to be used. And on SSDs,
> apparently when that happens, sectors often fail and are replaced faster
> than I was seeing, so it's likely people will actually get to failure
> mode on this attribute in that case.
>
> I'd guess spinning rust has something less, maybe 64 MiB for multiple TB
> of storage, instead of the 128 or 256 MiB I saw on my 256 GiB SSDs. That
> would be because spinning rust failure mode is typically different, and
> while a few sectors might die and be replaced over the life of the
> device, typically it's not that many, and failure is by some other means
> like mechanical failure (failure to spin up, or read heads getting out of
> tolerated sync with the cylinders on the device).
>
>> 7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always
>> - 56166570022
>
> Like the raw-read-error-rate attribute above, you're seeing minor issues
> as the raw number isn't 0, and in this case, the cooked value is
> obviously dropping significantly as well, but it's still within
> tolerance, so it's not failing yet. That worst cooked value of 60 is
> starting to get close to that threshold of 30, however, so this one's
> definitely showing wear, just not failure... yet.
>
>> 9 Power_On_Hours 0x0032 075 075 000 Old_age Always
>> - 22098
>
> Reasonable for a middle-aged drive, considering you obviously don't shut
> it down often (a start-stop-count raw of 80-something). That's ~2.5
> years of power-on.
>
>> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always
>> - 0
>
> This one goes with spin-up time. Absolutely no problems here.
>
>> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always
>> - 83
>
> Matches start-stop-count. Good. =:^) Since you obviously don't spin
> down except at power-off, this one isn't going to be a problem for you.
>
>> 184 End-to-End_Error 0x0032 098 098 099 Old_age Always
>> FAILING_NOW 2
>
> I /think/ this one is a power-on head self-test head seek from one side
> of the device to the other, and back, covering both ways.
I believe you're correct about this, although I've never seen any
definitive answer anywhere.
>
> Assuming I'm correct on the above guess, the combination of this failing
> for you, and the not yet failing but a non-zero raw-value for raw-read-
> error-rate and seek-error-rate, with the latter's cooked value being
> significantly down if not yet failing, is definitely concerning, as the
> three values all have to do with head seeking errors.
>
> I'd definitely get your data onto something else as soon as possible, tho
> as much of it is backups, you're not in too bad a shape even if you lose
> them, as long as you don't lose the working copy at the same time.
>
> But with all three seek attributes indicating at least some issue and one
> failing, at least get anything off it that is NOT backups ASAP.
>
> And that very likely explains the slowdowns as well, as obviously, while
> all sectors are still readable, it's having to retry multiple times on
> some of them, and that WILL slow things down.
>
>> 188 Command_Timeout 0x0032 100 099 000 Old_age Always
>> - 8590065669
>
> Again, a non-zero raw value indicating command timeouts, probably due to
> those bad seeks. It'll have to retry those commands, and that'll
> definitely mean slowdowns.
>
> Tho there's no threshold, but 99 worst-value cooked isn't horrible.
>
> FWIW, on my spinning rust device this value actually shows a worst of
> 001, here (100 current cooked value, tho), with a threshold of zero,
> however. But as I've experienced no problems with it I'd guess that's an
> aberration. I haven't the foggiest why/how/when it got that 001 worst.
Such an occurrence is actually not unusual when you have particularly
bad sectors on a 'desktop' rated HDD, as they will keep retrying for an
insanely long time to read the bad sector before giving up.
>
>> 189 High_Fly_Writes 0x003a 095 095 000 Old_age Always
>> - 5
>
> Again, this demonstrates a bit of disk wobble or head slop. But with a
> threshold of zero and a value and worst of 95, it doesn't seem to be too
> bad.
>
>> 193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always
>> - 287836
>
> Interesting. My spinning rust has the exact same value and worst of 1,
> threshold 0, and a relatively similar 237181 raw count.
>
> But I don't really know what this counts unless it's actual seeks, and
> mine seems in good health still, certainly far better than the cooked
> value and worst of 1 might suggest.
As far as I understand it, this is an indicator of the number of times
the heads have been loaded and unloaded. This is tracked separately as
there are multiple reasons the heads might get parked without spinning
down the disk (most disks will park them if they've been idle, so that
they reduce the risk of a head crash, and many modern laptops will park
them if they detect that they're in free fall to protect the disk when
they impact whatever they fall onto). It's not unusual to see values
like that for similarly aged disks either though, so it's not too worrying.
>
>> 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline
>> - 281032595099550
>
>> OK, head flying hours explains it, drive is over 32 billion years old...
>>
>
> While my spinning rust has this attribute and the cooked values are
> identical 100/253/0, the raw value is reported and formatted entirely
> differently, as 21122 (89 19 0). I don't know what those values are, but
> presumably your big long value reports the others mine does, as well,
> only as a big long combined value.
>
> Which would explain the apparent multi-billion years yours is reporting!
> =:^) It's not a single value, it's multiple values somehow combined.
>
> At least with my power-on hours of 23637, a head-flying hours of 21122
> seems reasonable. (I only recently configured the BIOS to spin down that
> drive after 15 minutes I think, because it's only backups and my media
> partition which isn't mounted all the time anyway, so I might as well
> leave it off instead of idle-spinning when I might not use it for days at
> a time. So a difference of a couple thousand hours between power-on and
> head-flying, on a base of 20K+ hours for both, makes sense given that I
> only recently configured it to spin down.)
>
> But given your ~22K power-on hours, even simply peeling off the first 5
> digits of your raw value would be 28K head-flying, and that doesn't make
> sense for only 22K power-on, so obviously they're using a rather more
> complex formula than that.
This one is tricky, as it's not very clearly defined in the SMART spec.
Most manufacturers just count the total time the head has been loaded.
There are some however who count the time the heads have been loaded,
multiplied by the number of heads. This value still appears to be
incorrect though, as combined with the Power_On_Hours, it implies well
over 1024 heads, which is physically impossible on even a 5.25 inch disk
using modern technology, even using multiple spindles. The fact that
this is so blatantly wrong should be a red flag regarding the disk
firmware or on-board electronics, which just reinforces what Duncan
already said about getting a new disk.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-18 11:38 ` Austin S. Hemmelgarn
@ 2016-03-18 17:58 ` Pete
2016-03-18 23:58 ` Duncan
1 sibling, 0 replies; 17+ messages in thread
From: Pete @ 2016-03-18 17:58 UTC (permalink / raw)
To: linux-btrfs
On 03/18/2016 11:38 AM, Austin S. Hemmelgarn wrote:
> This one is tricky, as it's not very clearly defined in the SMART spec.
> Most manufacturers just count the total time the head has been loaded.
> There are some however who count the time the heads have been loaded,
> multiplied by the number of heads. This value still appears to be
> incorrect though, as combined with the Power_On_Hours, it implies well
> over 1024 heads, which is physically impossible on even a 5.25 inch disk
> using modern technology, even using multiple spindles. The fact that
> this is so blatantly wrong should be a red flag regarding the disk
> firmware or on-board electronics, which just reinforces what Duncan
> already said about getting a new disk.
Have got a larger SSD on the way as it looked tight. So annoylingly
wallet has to come out again.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-18 9:17 ` Duncan
2016-03-18 11:38 ` Austin S. Hemmelgarn
@ 2016-03-18 18:16 ` Pete
2016-03-18 18:54 ` Austin S. Hemmelgarn
2016-03-19 1:15 ` Duncan
1 sibling, 2 replies; 17+ messages in thread
From: Pete @ 2016-03-18 18:16 UTC (permalink / raw)
To: linux-btrfs
On 03/18/2016 09:17 AM, Duncan wrote:
> So bottom line regarding that smartctl output, yeah, a new device is
> probably a very good idea at this point. Those smart attributes indicate
> either head slop or spin wobble, and some errors and command timeouts and
> retries, which could well account for your huge slowdowns. Fortunately,
> it's mostly backup, so you have your working copy, but if I'm not mixing
> up my threads, you have some media files, etc, on a different partition
> on it as well, and if you don't have backups elsewhere, getting them onto
> something else ASAP is a very good idea, because this drive does look to
> be struggling, and tho it could continue working in a low usage scenario
> for some time yet, it could also fail rather quickly, as well.
>
This disk is one of a pair or raid1 disks which hold the data on my
system. As you summised the machine is generally on 24x7 as it can just
get on with backups and some data grabbing and crunching on its own.
This is a set up of 2 x 3TB disks completely dedicated to btrfs. I'm
wondering if the failing one is the older one wrenched out of a USB
enclosure as it was cheaper than a desktop one or whether it was the
desktop drive? Still academic. I have 1.37TB unallocated, 720GB free
estimated. I'm therefore wondering whether I opt for the cheapest
reasonable desktop drive, a NAS drive advertised for 24x7 or whether I
pick a wallet frightening 'enterprise drive' as it might be twice as
much as the standard desktop but will give me less grief in the long
term. Probably one for comp.os.linux.hardware.
>> Confused. I'm getting one SSD which I intend to use raid0. Seems to me
>> to make no sense to split it in two and put both sides of raid1 on one
>> disk and I reasonably think that you are not suggesting that. Or are
>> you assuming that I'm getting two disks? Or are you saying that buying
>> a second SSD disk is strongly advised? (bearing in mind that it looks
>> like I might need another hdd if the smart field above is worth worrying
>> about).
>
> Well, raid0 normally requires two devices. So either you mean single
> mode on a single device, or you're combining it with another device (or
> more than one more) to do raid0.
Sorry, I confused raid0 with single. The _lone_ system disk contains
the root partition, it is btrfs in single mode.
> So btrfs raid1 has data integrity and repair features that aren't
> available on normal raid1, and thus is highly recommended.
>
> But, raid1 /does/ mean two copies of both data and metadata (assuming of
> course you make them both raid1, as I did), and if you simply don't have
> room to do it that way, you don't have room, highly recommended tho it
> may be.
This looks like a strong recommendation to get a second SSD for the root
partition and go raid1. Are SSDs more flakey that hdd or are you just a
strong believer in the integrity of raid1?
>
> Tho raid1 shouldn't be considered the same as a backup, because it's
> not. In particular, while you do have reasonable protection against
<snip>
Backup nightly to an external usb hdd with ext4 via rsync. Permanently
connected. Also periodically (when I remember) backup via rsync to
another hdd formatted btrfs, single mode, with snapshots.
Given the discussions here maybe a couple of extra copies of the very
important stuff would not go amiss.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-18 18:16 ` Pete
@ 2016-03-18 18:54 ` Austin S. Hemmelgarn
2016-03-19 0:59 ` Duncan
2016-03-19 1:15 ` Duncan
1 sibling, 1 reply; 17+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-18 18:54 UTC (permalink / raw)
To: Pete, linux-btrfs
On 2016-03-18 14:16, Pete wrote:
> On 03/18/2016 09:17 AM, Duncan wrote:
>
>> So bottom line regarding that smartctl output, yeah, a new device is
>> probably a very good idea at this point. Those smart attributes indicate
>> either head slop or spin wobble, and some errors and command timeouts and
>> retries, which could well account for your huge slowdowns. Fortunately,
>> it's mostly backup, so you have your working copy, but if I'm not mixing
>> up my threads, you have some media files, etc, on a different partition
>> on it as well, and if you don't have backups elsewhere, getting them onto
>> something else ASAP is a very good idea, because this drive does look to
>> be struggling, and tho it could continue working in a low usage scenario
>> for some time yet, it could also fail rather quickly, as well.
>>
>
> This disk is one of a pair or raid1 disks which hold the data on my
> system. As you summised the machine is generally on 24x7 as it can just
> get on with backups and some data grabbing and crunching on its own.
>
> This is a set up of 2 x 3TB disks completely dedicated to btrfs. I'm
> wondering if the failing one is the older one wrenched out of a USB
> enclosure as it was cheaper than a desktop one or whether it was the
> desktop drive? Still academic. I have 1.37TB unallocated, 720GB free
> estimated. I'm therefore wondering whether I opt for the cheapest
> reasonable desktop drive, a NAS drive advertised for 24x7 or whether I
> pick a wallet frightening 'enterprise drive' as it might be twice as
> much as the standard desktop but will give me less grief in the long
> term. Probably one for comp.os.linux.hardware.
Personally, I find that desktop drives generally do fine for 24/7 usage
as long as things aren't constantly being written to and read from them.
For a write-once-read-many workload like most backup setups, there's
not usually a huge advantage to getting high end disks unless you can't
be there to replace them relatively soon after they fail (one disk in a
RAID set failing puts more load on the other disk, thus increasing it's
chance of also failing). Desktop disks usually do provide similarly low
error rates as higher end disks, the big difference is in how they
handle errors. Desktop drives will (usually) keep retrying a read on a
bad sector for multiple minutes before giving up, while NAS drives will
return an error almost immediately, and enterprise drives will let you
configure how long it will retry.
>
>
>>> Confused. I'm getting one SSD which I intend to use raid0. Seems to me
>>> to make no sense to split it in two and put both sides of raid1 on one
>>> disk and I reasonably think that you are not suggesting that. Or are
>>> you assuming that I'm getting two disks? Or are you saying that buying
>>> a second SSD disk is strongly advised? (bearing in mind that it looks
>>> like I might need another hdd if the smart field above is worth worrying
>>> about).
>>
>> Well, raid0 normally requires two devices. So either you mean single
>> mode on a single device, or you're combining it with another device (or
>> more than one more) to do raid0.
>
> Sorry, I confused raid0 with single. The _lone_ system disk contains
> the root partition, it is btrfs in single mode.
Don't feel bad, I made this mistake myself a couple of times at first too.
>
>
>
>> So btrfs raid1 has data integrity and repair features that aren't
>> available on normal raid1, and thus is highly recommended.
>>
>> But, raid1 /does/ mean two copies of both data and metadata (assuming of
>> course you make them both raid1, as I did), and if you simply don't have
>> room to do it that way, you don't have room, highly recommended tho it
>> may be.
>
> This looks like a strong recommendation to get a second SSD for the root
> partition and go raid1. Are SSDs more flakey that hdd or are you just a
> strong believer in the integrity of raid1?
Generally, SSD's have better reliability in harsh conditions than HDD's,
they can safely handle a wider temperature range, and are pretty much
unaffected by vibration. They fail in different ways however, so advice
for preventing data loss on HDD's doesn't necessarily apply to SSD's.
Overall though, it really depends on what brand you get. As of right
now, the top three brands of SSD as far as quality IMHO are Intel,
Samsung, and Crucial. I usually go with Crucial myself because they are
almost on-par with the other two, give more deterministic performance
(their peak performance is often lower, but I'm willing to sacrifice a
bit of performance to get consistency across operating conditions), and
cost less (sometimes less than half as much as an equivalently sized
Intel or Samsung SSD) . Kingston, SanDisk, ADATA, Transcend, and Micron
are generally OK, but sometimes have issues with data loss when they
lose power unexpectedly (this likely won't be an issue for you though if
you have a system that's on 24/7). The only brand I would actively
avoid is OCZ, as they've had numerous issues with reliability and data
integrity over multiple revisions of multiple models of SSD.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-18 11:38 ` Austin S. Hemmelgarn
2016-03-18 17:58 ` Pete
@ 2016-03-18 23:58 ` Duncan
1 sibling, 0 replies; 17+ messages in thread
From: Duncan @ 2016-03-18 23:58 UTC (permalink / raw)
To: linux-btrfs
Austin S. Hemmelgarn posted on Fri, 18 Mar 2016 07:38:29 -0400 as
excerpted:
>>> 188 Command_Timeout 0x0032 100 099 000 Old_age
>>> Always
>>> - 8590065669
>>
>> Again, a non-zero raw value indicating command timeouts, probably due
>> to those bad seeks. It'll have to retry those commands, and that'll
>> definitely mean slowdowns.
>>
>> Tho there's no threshold, but 99 worst-value cooked isn't horrible.
>>
>> FWIW, on my spinning rust device this value actually shows a worst of
>> 001, here (100 current cooked value, tho), with a threshold of zero,
>> however. But as I've experienced no problems with it I'd guess that's
>> an aberration. I haven't the foggiest why/how/when it got that 001
>> worst.
> Such an occurrence is actually not unusual when you have particularly
> bad sectors on a 'desktop' rated HDD, as they will keep retrying for an
> insanely long time to read the bad sector before giving up.
Which is why it's mystifying to me how it could be reporting a worst-
value 1, when the device seems to be working just fine, and I don't
recall even one event of waiting "an insanely long time", or even
anything out of the ordinary, for anything on that device, ever.
Tho I suppose it's within reason that whatever it was froze up the system
bad enough that I rebooted, and I attributed the one-off to something
else. But with no other attributes indicating issues, I remain clueless
as to what might have happened and why that 1-worst, particularly so
given the 0 threshold for that attribute and that it's an old-age
indicator rather than a fail indicator, but the device is neither that
old, nor as I said, in any other way indicating anything close to what
that 1-worst value for just that single attribute implies.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-18 18:54 ` Austin S. Hemmelgarn
@ 2016-03-19 0:59 ` Duncan
0 siblings, 0 replies; 17+ messages in thread
From: Duncan @ 2016-03-19 0:59 UTC (permalink / raw)
To: linux-btrfs
Austin S. Hemmelgarn posted on Fri, 18 Mar 2016 14:54:54 -0400 as
excerpted:
> As of right now, the top three brands of SSD as far as quality IMHO are
> Intel, Samsung, and Crucial. I usually go with Crucial myself because
> they are almost on-par with the other two, give more deterministic
> performance (their peak performance is often lower, but I'm willing to
> sacrifice a bit of performance to get consistency across operating
> conditions), and cost less (sometimes less than half as much as an
> equivalently sized Intel or Samsung SSD) . Kingston, SanDisk, ADATA,
> Transcend, and Micron are generally OK, but sometimes have issues with
> data loss when they lose power unexpectedly (this likely won't be an
> issue for you though if you have a system that's on 24/7). The only
> brand I would actively avoid is OCZ, as they've had numerous issues with
> reliability and data integrity over multiple revisions of multiple
> models of SSD.
Thanks. This is useful information for me as well, because while I'm not
in the /immediate/ market for SSDs ATM, I'm relatively likely to be doing
some new machines later this year, and will likely either be getting new
ssds for them or will be getting newer and bigger for my main machine and
will be putting the current 256 GiB main machine ssds in the smaller
machines. So I was recently looking at prices on pricewatch.com, and
wondering again about brands. I got relatively lucky with my first ssds
purchased some years ago when I knew little about them (except that one
went bad prematurely, but it has been replaced with one of the others
now), but know much more about the technology now. Only thing is I knew
nothing about which brands were in general good or to stay away from, in
ordered to narrow down the search a bit, so this was really helpful,
particularly the Crucial bit as I already know Intels aren't
realistically in my price range.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-18 18:16 ` Pete
2016-03-18 18:54 ` Austin S. Hemmelgarn
@ 2016-03-19 1:15 ` Duncan
1 sibling, 0 replies; 17+ messages in thread
From: Duncan @ 2016-03-19 1:15 UTC (permalink / raw)
To: linux-btrfs
Pete posted on Fri, 18 Mar 2016 18:16:50 +0000 as excerpted:
> On 03/18/2016 09:17 AM, Duncan wrote:
>> So btrfs raid1 has data integrity and repair features that aren't
>> available on normal raid1, and thus is highly recommended.
>>
>> But, raid1 /does/ mean two copies of both data and metadata (assuming
>> of course you make them both raid1, as I did), and if you simply don't
>> have room to do it that way, you don't have room, highly recommended
>> tho it may be.
>
> This looks like a strong recommendation to get a second SSD for the root
> partition and go raid1. Are SSDs more flakey that hdd or are you just a
> strong believer in the integrity of raid1?
As Austin says, I'd generally consider ssds /more/ reliable than hdds, at
least as long as you stay away from the OCZs, etc (but then again, there
are spinning rust brands and specific models I stay away from, as well),
but the failure modes are a bit different so it's not always as simple as
that.
But I played with raid1 before btrfs and find the additional data
integrity features that btrfs raid1 brings even more compelling, so yes,
it would indeed be fair to say that I'm a strong booster of btrfs raid1
in particular. =:^)
The roadmapped but still to come feature I'm /really/ looking forward to,
however, is N-way-mirroring, because btrfs raid1 is currently very
specifically two copies, regardless of how many devices there are, and I
would really /really/ like to have the choice of three copies, again, not
just for device-failure protection, but because with btrfs checksumming
and data integrity, if one copy is found to be bad for whatever reason,
be it a crash before all copies were written, or a failing device, a
simple bad block on an otherwise fine device, or whatever gamma ray block
damage or the like, right now that means the other copy BETTER come out
checksum-verified, or that data or metadata is toast, and I'd rest far
easier if even if one failed, I knew there was still not just the one
copy, but a second, to fall back on as well.
With N-way-mirroring, of course that could be 4 or more copies as well,
but three is closest to my sweet spot balance between cost and extreme
reliability, and I'd very much like to have that choice as an option.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-12 13:01 pete
@ 2016-03-13 3:28 ` Duncan
0 siblings, 0 replies; 17+ messages in thread
From: Duncan @ 2016-03-13 3:28 UTC (permalink / raw)
To: linux-btrfs
pete posted on Sat, 12 Mar 2016 13:01:17 +0000 as excerpted:
> I hope this message stays within the thread on the list. I had email
> problems and ended up hacking around with sendmail & grabbing the
> message id off of the web based group archives.
Looks like it should have as the reply-to looks right, but at least on
gmane's news/nntp archive of the list (which is how I read and reply), it
didn't. But the thread was found easily enough.
>>I wondered whether you had elimated fragmentation, or any other known
>>gotchas, as a cause?
>
> Subvolumes are mounted with the following options:
> autodefrag,relatime,compress=lzo,subvol=<sub vol name>
That relatime (which is the default), could be an issue. See below.
> Not sure if there is much else to do about fragmentation apart from
> running a balance which would probally make thje machine v sluggish for
> a day or so.
>
>>Out of curiosity, what is/was the utilisation of the disk? Were the
>>snapshots read-only or read-write?
>
> root@phoenix:~# btrfs fi df /
> Data, single: total=101.03GiB, used=97.91GiB
> System, single: total=32.00MiB, used=16.00KiB
> Metadata, single: total=8.00GiB, used=5.29GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> root@phoenix:~# btrfs fi df /home
> Data, RAID1: total=1.99TiB, used=1.97TiB
> System, RAID1: total=32.00MiB, used=352.00KiB
> Metadata, RAID1: total=53.00GiB, used=50.22GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
Normally when posting, either btrfs fi df *and* btrfs fi show are
needed, /or/ (with a new enough btrfs-progs) btrfs fi usage. And of
course the kernel (4.0.4 in your case) and btrfs-progs (not posted, that
I saw) versions.
Btrfs fi df shows the chunk allocation and usage within the chunks, but
does not show the size of the filesystem or of individual devices. Btrfs
fi show, shows that, but not the chunk allocation and usage info. Btrfs
fi usage shows both, but it's a newer command that isn't available on old
btrfs-progs, and was buggy for some layouts (raid56 and mixed-mode, where
the bugs would cause the numbers to go negative, which would appear as
EiB free (I wish!!)) until relatively recently.
> Hmm. The system disk is getting a little tight. cddisk reports the
> partition I use for btrfs containing root as 127GB approx. Not sure why
> it grows so much. Suspect that software updates can't help as snapshots
> will contain the legacy versions. On the other hand they can be useful.
With the 127 GiB (I _guess_ it's GiB, 1024, not GB, 1000, multiplier,
btrfs consistently uses the 1024 multiplier and properly specifies it
using the XiB notation) for /, however, and the btrfs fi df sizes of 101
GiB plus data and 8 GiB metadata (with system's 32 MiB a rounding error
and global reserve actually taken from metadata, so it doesn't add to
chunk reservation on its own) we can see that as you mention, it's
starting to get tight, a bit under 110 GiB of 127 GiB, but that 17 GiB
free isn't horrible, just slightly tight, as you said.
Tho it'll obviously be tighter if that's 127 GB, 1000 multiplier...
It's tight enough that particularly with the regular snapshotting, btrfs
might be having to fragment more than it'd like. Tho kudos for the
_excellent_ snapshot rotation. We regularly see folks in here with 100K
or more snapshots per filesystem, and btrfs _does_ have scaling issues in
that case. But your rotation seems to be keeping it well below the 1-3K
snapshots per filesystem recommended max, so that's obviously NOT you're
problem, unless of course the snapshot deletion bugged out and they
aren't being deleted as they should.
(Of course, you can check that by listing them, and I would indeed double-
check, as that _is_ the _usual_ problem we have with snapshots slowing
things down, simply too many of them, hitting the known scaling issues
btrfs had with over 10K snapshots per filesystem. But FWIW I don't use
snapshots here and thus don't deal with snapshots command-level detail.)
But as I mentioned above, that relatime mount option isn't your best
choice, in the presence of heavy snapshotting. Unless you KNOW you need
atimes for something or other, noatime is _strongly_ recommended with
snapshotting, because relatime, while /relatively/ better than
strictatime, still updates atimes once a day for files you're accessing
at least that frequently.
And that interacts badly with snapshots, particularly where few of the
files themselves have changed, because in that case, a large share of the
changes from one snapshot to another are going to be those atime updates
themselves. Ensuring that you're always using noatime avoids the atime
updates entirely (well, unless the file itself changes and thus mtime
changes as well), which should, in the normal most files unchanged
snapshotting context, make for much smaller snapshot-exclusive sizes.
And you mention below that the snapshots are read-write, but generally
used as read-only. Does that include actually mounting them read-only?
Because if not, and if they too are mounted the default relatime,
accessing them is obviously going to be updating atimes the relatime-
default once per day there as well... triggering further divergence of
snapshots from the subvolumes they are snapshots of and from each other...
> Is it likely the SSD? If likely I could get a larger one, now is a good
> time with a new version of slackware imminent. However, no point in
> spending money for the sake of it.
Not directly btrfs related, but when you do buy a new ssd, now or later,
keep in mind that a lot of authorities recommend that for ssds you buy
10-33% larger than you plan on actually provisioning, and that you leave
that extra space entirely unprovisioned -- either leave that extra space
entirely unpartitioned, or partition it, but don't put filesystems or
anything else (swap, etc) on it. This leaves those erase-blocks free to
be used by the FTL for additional wear-leveling block-swap, thus helping
maintain device speed as it ages, and with good wear-leveling firmware,
should dramatically increase device usable lifetime, as well.
FWIW, I ended up going rather overboard with that here, as I knew I
needed a bit under 128 GiB (1024, I was trying to fit it in 100 GiB, so I
could get 120 or 128 GB (1000) and use the extra as slack, but that was
going to be tighter than I actually wanted) and thus thought I'd get 140
GB (1000) or so devices, but I ended up getting 256 GB (1000), as that's
what was both in-stock and at a reasonable price and performance level.
Of course that meant I spent somewhat more, but I put it on credit and
paid it off in 2-3 months, before the interest ate up _all_ the price
savings I got on it. So I ended up being able to put a couple more
partitions on the SSD that I had planned to keep on spinning rust, and
_still_ was only to 130 GiB or so, so I was still close to only 50%
actually partitioned and used.
But it has been nice since I basically don't need to worry about trim/
discard at all, tho I do have a cronjob setup to run fstrim every week or
so. And given the price on the 256 GB ssds, I actually didn't spend
_that_ much more on them than I would have on 160 GB or 200 GB devices --
well either that or I'd have had to wait for them to get more in stock,
since all the good-price/performance devices were out of stock in the
120-200 GB range.
> All snapshots read-write. However, I have mainly treated them as
> read-only. Does that make a difference?
See above. It definitely will if you're not using noatime when mounting
them.
>>Apropos Nada: quick shout out to Qu to wish him luck for the 4.6 merge.
>
> I'm wondering if it is time for an update from 4.0.4?
The going list recommendation is to choose either current kernel track or
LTS kernel track. If you choose current kernel, the recommendation is to
stick within 1-2 kernel cycles of newest current, which with 4.5 about to
come out, means you would be on 4.3 at the oldest, and be looking at 4.4
by now, again, on the current kernel track.
If you choose LTS kernels, until recently, the recommendation was again
the latest two, but here LTS kernel cycles. That would be 4.4 as the
newest LTS and 4.1 previous to that. However, 3.18, the LTS kernel
previous to 4.1, has been holding up reasonably well, so while 4.1 would
be preferred, 3.18 remains reasonably well supported as well.
You're on 4.0, which isn't an LTS kernel series and is thus, along with
4.2, out of upstream's support window. So it's past time to look at
updating. =:^) Given that you obviously do _not_ follow the last couple
current kernels rule, I'd strongly recommend that you consider switching
to an LTS kernel, and given that you're on 4.0 now, the 4.1 or 4.4 LTS
kernels would be your best candidates. 4.1 should be supported for quite
some time yet, both btrfs-wise and in general, and would be the minimal
incremental upgrade, but of course if your object is to upgrade as far as
you reasonably can when you /do/ upgrade, 4.4, the latest LTS, is perhaps
your best candidate.
In normal operation, the btrfs-progs userspace version isn't as critical,
as long as it has support for the features you're using, of course,
because for most normal runtime tasks, all progs does is make the
appropriate calls to the kernel to do the real work anyway. But as soon
as you find yourself trying to fix a filesystem that isn't working
properly and possibly won't mount, btrfs-progs version becomes more
critical, as the newest versions can fix more bugs than older versions,
which didn't know about the bugs discovered since then.
As a result, a reasonable userspace rule of thumb is to use at _least_ a
version corresponding to your kernel. Newer is fine as well, but using
at _least_ a version corresponding to your kernel means you're running a
userspace that was developed with that kernel in mind, and also, as long
as you're following kernel recommendations already, nicely keeps your
userspace from getting /too/ outdated, to the point that the commands and
output are enough different from current userspace to create problems
when you post command output to the list, etc.
>>[Also, damn you autocorrection on my phone!]
>
> Yep!
I'm one of those folks who still doesn't have a cell phone -- tho I have
a VoIP adaptor hooked up to my internet, and and a cordless phone
attached to it (and pay... about $30/year to a VoIP phone service for a
phone number and US-domestic dialing without additional fees... tho
obviously I have to keep an internet connection to keep that working, but
that's why I don't have a cell, at the pitifully small full-speed data
limits, I can't switch to cell for data, and it's simply not cost
effective for voice when I can get full US phone coverage at no
additional cost for what amounts to $2.50/mo.).
But FWIW, if you've not already discovered it, plug in phone autocorrect
on youtube some day when you have some time, and be prepared to spend a
few hours laughing your *** off!
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
@ 2016-03-12 13:01 pete
2016-03-13 3:28 ` Duncan
0 siblings, 1 reply; 17+ messages in thread
From: pete @ 2016-03-12 13:01 UTC (permalink / raw)
To: linux-btrfs
I hope this message stays within the thread on the list. I had email problems
and ended up hacking around with sendmail & grabbing the message id off of
the web based group archives.
>I wondered whether you had elimated fragmentation, or any other known gotchas,
>as a cause?
Subvolumes are mounted with the following options:
autodefrag,relatime,compress=lzo,subvol=<sub vol name>
Not sure if there is much else to do about fragmentation apart from running a
balance which would probally make thje machine v sluggish for a day or so.
>Out of curiosity, what is/was the utilisation of the disk? Were the snapshots
>read-only or read-write?
root@phoenix:~# btrfs fi df /
Data, single: total=101.03GiB, used=97.91GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=8.00GiB, used=5.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
root@phoenix:~# btrfs fi df /home
Data, RAID1: total=1.99TiB, used=1.97TiB
System, RAID1: total=32.00MiB, used=352.00KiB
Metadata, RAID1: total=53.00GiB, used=50.22GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Hmm. The system disk is getting a little tight. cddisk reports the partition I
use for btrfs containing root as 127GB approx. Not sure why it grows so much.
Suspect that software updates can't help as snapshots will contain the legacy
versions. On the other hand they can be useful.
Is it likely the SSD? If likely I could get a larger one, now is a good time with
a new version of slackware imminent. However, no point in spending money for
the sake of it.
All snapshots read-write. However, I have mainly treated them as read-only.
Does that make a difference?
>Apropos Nada: quick shout out to Qu to wish him luck for the 4.6 merge.
I'm wondering if it is time for an update from 4.0.4?
>[Also, damn you autocorrection on my phone!]
Yep!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Snapshots slowing system
2016-03-11 20:03 Pete
@ 2016-03-11 23:38 ` boris
0 siblings, 0 replies; 17+ messages in thread
From: boris @ 2016-03-11 23:38 UTC (permalink / raw)
To: linux-btrfs
I wondered whether you had elimated fragmentation, or any other known gotchas,
as a cause?
Out of curiosity, what is/was the utilisation of the disk? Were the snapshots
read-only or read-write?
Apropos Nada: quick shout out to Qu to wish him luck for the 4.6 merge.
[Also, damn you autocorrection on my phone!]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Snapshots slowing system
@ 2016-03-11 20:03 Pete
2016-03-11 23:38 ` boris
0 siblings, 1 reply; 17+ messages in thread
From: Pete @ 2016-03-11 20:03 UTC (permalink / raw)
To: linux-btrfs
I though I would post this in case it was useful info for the list. No
help needed as I have a fix (sort of).
I've an PC with a 3 core Phenom 720 CPU, 8GB of RAM. / is in a RAID0
SSD btrfs file system and data, home directories, various bits of data
etc, on 2x3TB disks at btrfs RAID1. The data file system contains the
data spread across about 5 or 6 subvolumes for ease of management.
Kernel 4.0.4.
I wrote a script which performs snapshots for appropriate subvolumes on
each file system. Hourly snapshots were taken and kept for 24 hours
before deletion. Daily ones 30 days and weekly ones about a year. So
each sub-volume had approx 86 snapshots. This worked well with the odd
sluggish response from the file system, but these were infrequent and I
was happy to accept them given the benefits of subvolumes & snapshots.
Over the past few weeks I had noticed a degradation in performance to
the point where it paused with busy disks when trying to do anything
that might involve disks until it got to an unacceptable state. Not
sure that anything had changed, but the slowness came on over a period
of a couple weeks.
I fixed this by disabling the hourly snapshots and deleting them.
System is back to normal. Though I would share in case there is any
value in this info for the devs.
Kind regards,
Pete
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2016-03-19 1:15 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-14 23:03 Snapshots slowing system pete
2016-03-15 15:52 ` Duncan
2016-03-15 22:29 ` Peter Chant
2016-03-16 11:39 ` Austin S. Hemmelgarn
2016-03-17 21:08 ` Pete
2016-03-18 9:17 ` Duncan
2016-03-18 11:38 ` Austin S. Hemmelgarn
2016-03-18 17:58 ` Pete
2016-03-18 23:58 ` Duncan
2016-03-18 18:16 ` Pete
2016-03-18 18:54 ` Austin S. Hemmelgarn
2016-03-19 0:59 ` Duncan
2016-03-19 1:15 ` Duncan
-- strict thread matches above, loose matches on Subject: below --
2016-03-12 13:01 pete
2016-03-13 3:28 ` Duncan
2016-03-11 20:03 Pete
2016-03-11 23:38 ` boris
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.