* btrfs-transaction blocked for more than 120 seconds
@ 2013-12-31 11:46 Sulla
2014-01-01 12:37 ` Duncan
` (2 more replies)
0 siblings, 3 replies; 31+ messages in thread
From: Sulla @ 2013-12-31 11:46 UTC (permalink / raw)
To: linux-btrfs
Dear all!
On my Ubuntu Server 13.10 I use a RAID5 blockdevice consisting of 3 WD20EARS
drives. On this I built a LVM and in this LVM I use quite normal partitions
/, /home, SWAP (/boot resides on a RAID1.) and also a custom /data
partition. Everything (except boot and swap) is on btrfs.
sometimes my system hangs for quite some time (top is showing a high wait
percentage), then runs on normally. I get kernel messages into
/var/log/sylsog, see below. I am unable to make any sense of the kernel
messages, there is no reference to the filesystem or drive affected (at
least I can not find one).
Question: What is happening here?
* Is a HDD failing (smart looks good, however)
* Is something wrong with my btrfs-filesystem? with which one?
* How can I find the cause?
thanks, Wolfgang
Dec 31 12:27:49 freedom kernel: [ 4681.264112] INFO: task
btrfs-transacti:529 blocked for more than 120 seconds.
Dec 31 12:27:49 freedom kernel: [ 4681.264239] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 31 12:27:49 freedom kernel: [ 4681.264367] btrfs-transacti D
ffff88013fc14580 0 529 2 0x00000000
Dec 31 12:27:49 freedom kernel: [ 4681.264377] ffff880138345e10
0000000000000046 ffff880138345fd8 0000000000014580
Dec 31 12:27:49 freedom kernel: [ 4681.264386] ffff880138345fd8
0000000000014580 ffff880135615dc0 ffff880132fb6a00
Dec 31 12:27:49 freedom kernel: [ 4681.264393] ffff880133f45800
ffff880138345e30 ffff880137ee2000 ffff880137ee2070
Dec 31 12:27:49 freedom kernel: [ 4681.264402] Call Trace:
Dec 31 12:27:49 freedom kernel: [ 4681.264418] [<ffffffff816eaa79>]
schedule+0x29/0x70
Dec 31 12:27:49 freedom kernel: [ 4681.264477] [<ffffffffa032a57d>]
btrfs_commit_transaction+0x34d/0x980 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.264487] [<ffffffff81085580>] ?
wake_up_atomic_t+0x30/0x30
Dec 31 12:27:49 freedom kernel: [ 4681.264517] [<ffffffffa0321be5>]
transaction_kthread+0x1a5/0x240 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.264548] [<ffffffffa0321a40>] ?
verify_parent_transid+0x150/0x150 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.264557] [<ffffffff810847b0>]
kthread+0xc0/0xd0
Dec 31 12:27:49 freedom kernel: [ 4681.264565] [<ffffffff810846f0>] ?
kthread_create_on_node+0x120/0x120
Dec 31 12:27:49 freedom kernel: [ 4681.264573] [<ffffffff816f566c>]
ret_from_fork+0x7c/0xb0
Dec 31 12:27:49 freedom kernel: [ 4681.264580] [<ffffffff810846f0>] ?
kthread_create_on_node+0x120/0x120
Dec 31 12:27:49 freedom kernel: [ 4681.264610] INFO: task kworker/u4:0:9975
blocked for more than 120 seconds.
Dec 31 12:27:49 freedom kernel: [ 4681.264722] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 31 12:27:49 freedom kernel: [ 4681.264847] kworker/u4:0 D
ffff88013fd14580 0 9975 2 0x00000000
Dec 31 12:27:49 freedom kernel: [ 4681.264861] Workqueue: writeback
bdi_writeback_workfn (flush-btrfs-4)
Dec 31 12:27:49 freedom kernel: [ 4681.264865] ffff8800a8739538
0000000000000046 ffff8800a8739fd8 0000000000014580
Dec 31 12:27:49 freedom kernel: [ 4681.264873] ffff8800a8739fd8
0000000000014580 ffff8801351e5dc0 ffff8801351e5dc0
Dec 31 12:27:49 freedom kernel: [ 4681.264880] ffff880134c5e6a8
ffff880134c5e6b0 ffffffff00000000 ffff880134c5e6b8
Dec 31 12:27:49 freedom kernel: [ 4681.264887] Call Trace:
Dec 31 12:27:49 freedom kernel: [ 4681.264895] [<ffffffff816eaa79>]
schedule+0x29/0x70
Dec 31 12:27:49 freedom kernel: [ 4681.264902] [<ffffffff816ec465>]
rwsem_down_write_failed+0x105/0x1e0
Dec 31 12:27:49 freedom kernel: [ 4681.264911] [<ffffffff8136257d>] ?
__rwsem_do_wake+0xdd/0x160
Dec 31 12:27:49 freedom kernel: [ 4681.264918] [<ffffffff81369763>]
call_rwsem_down_write_failed+0x13/0x20
Dec 31 12:27:49 freedom kernel: [ 4681.264927] [<ffffffff816e9e7d>] ?
down_write+0x2d/0x30
Dec 31 12:27:49 freedom kernel: [ 4681.264956] [<ffffffffa030fbe0>]
cache_block_group+0x290/0x3b0 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.264963] [<ffffffff81085580>] ?
wake_up_atomic_t+0x30/0x30
Dec 31 12:27:49 freedom kernel: [ 4681.264991] [<ffffffffa0317d48>]
find_free_extent+0xa38/0xac0 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265022] [<ffffffffa0317ef2>]
btrfs_reserve_extent+0xa2/0x1c0 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265056] [<ffffffffa033103d>]
__cow_file_range+0x15d/0x4a0 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265090] [<ffffffffa0331efa>]
cow_file_range+0x8a/0xd0 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265122] [<ffffffffa0332290>]
run_delalloc_range+0x350/0x390 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265158] [<ffffffffa0346bf1>] ?
find_lock_delalloc_range.constprop.42+0x1d1/0x1f0 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265194] [<ffffffffa0348764>]
__extent_writepage+0x304/0x750 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265202] [<ffffffff8109a1d5>] ?
set_next_entity+0x95/0xb0
Dec 31 12:27:49 freedom kernel: [ 4681.265212] [<ffffffff810115c6>] ?
__switch_to+0x126/0x4b0
Dec 31 12:27:49 freedom kernel: [ 4681.265221] [<ffffffff8104dee9>] ?
default_spin_lock_flags+0x9/0x10
Dec 31 12:27:49 freedom kernel: [ 4681.265229] [<ffffffff8113f6c1>] ?
find_get_pages_tag+0xd1/0x180
Dec 31 12:27:49 freedom kernel: [ 4681.265266] [<ffffffffa0348e32>]
extent_write_cache_pages.isra.31.constprop.46+0x282/0x3e0 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265303] [<ffffffffa034928d>]
extent_writepages+0x4d/0x70 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265336] [<ffffffffa032ea90>] ?
btrfs_real_readdir+0x5c0/0x5c0 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265369] [<ffffffffa032caa8>]
btrfs_writepages+0x28/0x30 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265378] [<ffffffff8114a4ae>]
do_writepages+0x1e/0x40
Dec 31 12:27:49 freedom kernel: [ 4681.265387] [<ffffffff811ce7d0>]
__writeback_single_inode+0x40/0x220
Dec 31 12:27:49 freedom kernel: [ 4681.265395] [<ffffffff811ceb4b>]
writeback_sb_inodes+0x19b/0x3b0
Dec 31 12:27:49 freedom kernel: [ 4681.265403] [<ffffffff811cedff>]
__writeback_inodes_wb+0x9f/0xd0
Dec 31 12:27:49 freedom kernel: [ 4681.265411] [<ffffffff811cf623>]
wb_writeback+0x243/0x2c0
Dec 31 12:27:49 freedom kernel: [ 4681.265418] [<ffffffff811d1489>]
bdi_writeback_workfn+0x1b9/0x3d0
Dec 31 12:27:49 freedom kernel: [ 4681.265426] [<ffffffff8107d05c>]
process_one_work+0x17c/0x430
Dec 31 12:27:49 freedom kernel: [ 4681.265432] [<ffffffff8107dcac>]
worker_thread+0x11c/0x3c0
Dec 31 12:27:49 freedom kernel: [ 4681.265439] [<ffffffff8107db90>] ?
manage_workers.isra.24+0x2a0/0x2a0
Dec 31 12:27:49 freedom kernel: [ 4681.265447] [<ffffffff810847b0>]
kthread+0xc0/0xd0
Dec 31 12:27:49 freedom kernel: [ 4681.265454] [<ffffffff810846f0>] ?
kthread_create_on_node+0x120/0x120
Dec 31 12:27:49 freedom kernel: [ 4681.265461] [<ffffffff816f566c>]
ret_from_fork+0x7c/0xb0
Dec 31 12:27:49 freedom kernel: [ 4681.265469] [<ffffffff810846f0>] ?
kthread_create_on_node+0x120/0x120
Dec 31 12:27:49 freedom kernel: [ 4681.265476] INFO: task smbd:10275 blocked
for more than 120 seconds.
Dec 31 12:27:49 freedom kernel: [ 4681.265579] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 31 12:27:49 freedom kernel: [ 4681.265704] smbd D
ffff88013fc14580 0 10275 723 0x00000004
Dec 31 12:27:49 freedom kernel: [ 4681.265711] ffff8800a5abbbc0
0000000000000046 ffff8800a5abbfd8 0000000000014580
Dec 31 12:27:49 freedom kernel: [ 4681.265718] ffff8800a5abbfd8
0000000000014580 ffff880133d5aee0 ffff880137ee2000
Dec 31 12:27:49 freedom kernel: [ 4681.265726] ffff880133db79e8
ffff880133db79e8 0000000000000001 ffff880132d2dc80
Dec 31 12:27:49 freedom kernel: [ 4681.265733] Call Trace:
Dec 31 12:27:49 freedom kernel: [ 4681.265739] [<ffffffff816eaa79>]
schedule+0x29/0x70
Dec 31 12:27:49 freedom kernel: [ 4681.265772] [<ffffffffa03296df>]
wait_current_trans.isra.18+0xbf/0x120 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265778] [<ffffffff81085580>] ?
wake_up_atomic_t+0x30/0x30
Dec 31 12:27:49 freedom kernel: [ 4681.265810] [<ffffffffa032af06>]
start_transaction+0x356/0x520 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265843] [<ffffffffa032b0eb>]
btrfs_start_transaction+0x1b/0x20 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265876] [<ffffffffa0334887>]
btrfs_cont_expand+0x1c7/0x460 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265911] [<ffffffffa033cc26>]
btrfs_file_aio_write+0x346/0x520 [btrfs]
Dec 31 12:27:49 freedom kernel: [ 4681.265919] [<ffffffff811b9810>] ?
poll_select_copy_remaining+0x130/0x130
Dec 31 12:27:49 freedom kernel: [ 4681.265928] [<ffffffff811a6640>]
do_sync_write+0x80/0xb0
Dec 31 12:27:49 freedom kernel: [ 4681.265936] [<ffffffff811a6d7d>]
vfs_write+0xbd/0x1e0
Dec 31 12:27:49 freedom kernel: [ 4681.265942] [<ffffffff811a7932>]
SyS_pwrite64+0x72/0xb0
Dec 31 12:27:49 freedom kernel: [ 4681.265949] [<ffffffff816f571d>]
system_call_fastpath+0x1a/0x1f
--
For a successful technology, reality must take precedence over
public relations, for Nature cannot be fooled.
Richard P. Feynman
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla
@ 2014-01-01 12:37 ` Duncan
2014-01-01 20:08 ` Sulla
2014-01-03 17:25 ` Marc MERLIN
2014-01-02 8:49 ` Jojo
2014-01-05 20:32 ` Chris Murphy
2 siblings, 2 replies; 31+ messages in thread
From: Duncan @ 2014-01-01 12:37 UTC (permalink / raw)
To: linux-btrfs
Sulla posted on Tue, 31 Dec 2013 12:46:04 +0100 as excerpted:
> On my Ubuntu Server 13.10 I use a RAID5 blockdevice consisting of 3
> WD20EARS drives. On this I built a LVM and in this LVM I use quite
> normal partitions /, /home, SWAP (/boot resides on a RAID1.) and also a
> custom /data partition. Everything (except boot and swap) is on btrfs.
>
> sometimes my system hangs for quite some time (top is showing a high
> wait percentage), then runs on normally. I get kernel messages into
> /var/log/sylsog, see below. I am unable to make any sense of the kernel
> messages, there is no reference to the filesystem or drive affected (at
> least I can not find one).
>
> Question: What is happening here?
> * Is a HDD failing (smart looks good, however)
> * Is something wrong with my btrfs-filesystem? with which one?
> * How can I find the cause?
>
> Dec 31 12:27:49 freedom kernel: [ 4681.264112] INFO: task
> btrfs-transacti:529 blocked for more than 120 seconds.
>
> Dec 31 12:27:49 freedom kernel: [ 4681.264239] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
First to put your mind at rest, no, it's unlikely that your hardware is
failing; and it's not an indication of a filesystem bug either. Rather,
it's a characteristic of btrfs behavior in certain corner-cases, and yes,
you /can/ do something about it with some relatively minor btrfs
configuration adjustments... altho on spinning rust at multi-terabyte
sizes, those otherwise minor adjustments might take some time (hours)!
There seem to be two primary btrfs triggers for these "blocked for more
than N seconds" messages. One is COW-related (COW=copy-on-write, the
basis of BTRFS) fragmentation, the other is many-hardlink related. The
only scenario-trigger I've seen for the many-hardlink case, however, has
been when people are using a hardlink-based backup scheme, which you
don't mention, so I'd guess it's the COW-related trigger for you.
A bit of background on COW: (Assuming I get this correct, I don't claim
to be an expert on it.) In general, copy-on-write is a data handling
technique where any modification to the original data is made out-of-line
from the original, then the extent map (be it memory extent map for in-
memory COW applications, or on-device data extent map for filesystems,
or...) is modified, replacing the original inline extent index with that
of the new modification.
The advantage of COW for filesystems, over in-place-modification, is that
should the system crash at just the right (wrong?) moment, before the
full record has been written, an in-place-modification may corrupt the
entire file (or worse yet, the metadata for a whole bunch of files,
effectively killing them all!), while with COW the update is atomic -- at
least in theory, it has either been fully written and you get the new
version, or the remapping hasn't yet occurred and you get the old version
-- no corrupted case which is if you're lucky, part new and part old, and
if you're unlucky, has something entirely unrelated and very possibly
binary in the middle of what might have previously been for example a
plain-text config file.
However, COW-based filesystems work best when most updates either replace
the entire file, or append to the end of the file, luckily the most
common case. COW's primary down side in filesystem implementations is
that for use-cases where only a small piece of the file somewhere in the
middle is modified and saved, then another small piece somewhere else,
and another and another... repeated tens of thousands of times, each
small modification and save gets mapped to a new location and the file
fragments into possibly tens of thousands of extents, each with just the
content of the individual modification made to the file at that point.
On a spinning rust hard drive, the time necessary to seek to each of
those possibly tens of thousands of extents in ordered to read the file,
as compared to the cost of simply reading the same data were it stored
sequentially in a straight line, is... non-trivial to say the least!
It's exactly that fragmentation and the delays caused by all the seeks to
read an affected file, that result in the stalls and system hangs you are
seeing.
OK, so now that we know what causes it, what files are affected, and what
can you do to help the situation?
Fortunately, COW-fragmentation isn't a situation that dramatically
impacts operations on most files, as obviously if it was, it'd be
unsuited for filesystem use at all. But it does have a dramatic effect
in some cases -- the ones I've seen people report on this list are listed
below:
1) Installation.
Apparently the way some distribution installation scripts work results in
even a brand new installation being highly fragmented. =:^( If in
addition they don't add autodefrag to the mount options used when
mounting the filesystem for the original installation, the problem is
made even worse, since the autodefrag mount option is designed to help
catch some of this sort of issue, and schedule the affected files for
auto-defrag by a separate thread.
The fix here is to run a manual btrfs filesystem defrag -r on the
filesystem immediately after installation completes, and to add
autodefrag to the mount options used for the filesystem from then on, to
keep updates and routine operation from triggering new fragmentation.
(It's possible to do the same with just the autodefrag option over time,
but depending on how fragmented the filesystem was to begin with, some
people report that this makes the problem worse for awhile, and the
system unusable, until the autodefrag mechanism has caught up to the
existing problem. Autodefrag works best to /keep/ an already in good
shape filesystem in good shape; it's not so good at getting one that's
highly fragmented back into good shape. That's what btrfs filesystem
defrag -r is for. =:^)
2) Pre-allocated files.
Systemd's journal file is probably the most common single case here, but
it's not the only case, and AFAIK ubuntu doesn't use systemd anyway, so
that's highly unlikely to be your problem.
A less widespread case that's never-the-less common enough is bittorrent
clients that preallocate files at their final size before the download,
then write into them as the torrent chunks are downloaded. BAD situation
for COW filesystems including btrfs, since now the entire file is one
relocated chunk after another. If the file's a multi-gig DVD image or
the like, as mentioned above, that can be tens of thousands of extents!
This situation is *KNOWN* to cause N-second block reports and system
stalls of the nature you're reporting, but of course only triggers for
those running such bittorrent clients.
One potential fix if your bittorrent client has the option, is to turn
preallocation off. However, it's there for a couple reasons -- on normal
non-COW filesystems it has exactly the opposite effect, ensuring a file
stays sequentially mapped, AND, by preallocating the file, it's easier to
ensure that there's space available for the entire thing. (Altho if
you're using btrfs' compression option and it compresses the allocation,
more space will still be used as the actual data downloads and the file
is filled in, as that won't compress as well.)
Additionally, there's other cases of pre-allocated files. For these and
for bittorrent if you don't want to or can't turn pre-allocation off,
there's the NOCOW file attribute. See below for that.
3) Virtual machine images.
Virtual machine images tend to be rather large, often several gig, and to
trigger internal-image writes every time the configuration changes or
something is saved to the virtual disk in the image. Again, a big worst-
case for COW-based filesystems such as btrfs, as those internal image-
writes are precisely the sort of behavior that triggers image file
fragmentation.
For these, the NOCOW option is the best. Again, see below.
4) Database files.
Same COW-based-filesystem-worst-case behavior pattern here.
The autodefrag mount option was actually designed to help deal with this
case, however, for small databases (typically the small sqlite databases
used in firefox and thunderbird, for instance). It'll detect the
fragmentation and rewrite the entire file as a single extent. Of course
that works well for reasonably small databases, but won't work so well
for multi-gig databases, or multi-gig VMs or torrent images for that
matter, since the write magnification would be very large (rewriting a
whole multi-gig image for every change of a few bytes). Which is where
the NOCOW file attribute comes in...
Solutions beyond btrfs filesystem defrag -r, and the autodefrag mount
option:
The nodatacow mount option.
At the filesystem level, btrfs has the nodatacow mount option. For use-
cases where there's several files of the same problematic type, say a
bunch of VM images, or a bunch of torrent files downloading to the same
target subdir or subdirectory tree, or a bunch of database files all in
the same directory subtree, creating a dedicated filesystem which can be
mounted with the nodatacow option can make sense.
At some point in the future, btrfs is supposed to support different mount
options per subvolume, and at that point, a simple subvolume mounted with
nodatacow but still located on a main system volume mounted without it,
might make sense, but at this point, differing subvolume mount options
aren't available, so to use this solution, you have to create a fully
separate btrfs filesystem to use the nodatacow option on.
But nodatacow also disables some of the other features of btrfs, such as
checksumming and compression. While those don't work so well with COW-
averse use-cases anyway (for some of the same reasons COW doesn't work on
them), once you get rid of them on a global filesystem level, you're
almost back to the level of a normal filesystem, and might as well use
one. So in that case, rather than a dedicated btrfs mounted with
nodatacow, I'd suggest a dedicated ext4 or reiserfs or xfs or whatever
filesystem instead, particularly since btrfs is still under development,
while these other filesystems have been mature and stable for years.
The NOCOW file attribute.
Simple command form:
chattr +C /path/to/file/or/directory
*CAVEAT! This attribute should be set on new/empty files before they
have any content. The easiest way to do that is to set the attribute on
the parent directory, after which all new files created in it will
inherit the attribute. (Alternatively, touch the file to create it
empty, do the chattr, then append data into it using cat source >> target
or the like.)
Meanwhile, if there's a point at which the file exists in its more or
less permanent form and won't be written into any longer (a torrented
file is fully downloaded, or a VM image is backed up), sequentially
copying it elsewhere (possibly using cp --reflink=never if on the same
filesystem, to avoid a reflink copy pointing at the same fragmented
extents!), then deleting the original fragmented version, should
effectively defragment the file too. And since it's not being written
into any more at that point, it should stay defragmented.
Or just btrfs filesystem defrag the individual file...
Finally, there's some more work going into autodefrag now, to hopefully
increase its performance, and make it work more efficiently on a bit
larger files as well. The goal is to eliminate the problems with
systemd's journal, among other things, now that it's known to be a common
problem, given systemd's widespread use and the fact that both systemd
and btrfs aim to be the accepted general Linux default within a few years.
Summary:
Figure out what applications on your system have the "internal write"
pattern that causes so much trouble to COW-based filesystems, and turn
off that behavior either in that app (as possible with torrent clients),
or in the filesystem, using either a dedicated filesystem mount, or more
likely, by setting the NOCOW attribute (chattr +C) on the individual
target files or directories.
Figuring out which files and applications are affected is left to the
reader, but the information above should provide a good starting point.
Then btrfs filesystem defrag -r the filesystem and add autodefrag to its
mount options to help keep it free of at least smaller-file fragmentation.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-01 12:37 ` Duncan
@ 2014-01-01 20:08 ` Sulla
2014-01-02 8:38 ` Duncan
2014-01-05 0:12 ` Sulla
2014-01-03 17:25 ` Marc MERLIN
1 sibling, 2 replies; 31+ messages in thread
From: Sulla @ 2014-01-01 20:08 UTC (permalink / raw)
To: linux-btrfs
Dear Duncan!
Thanks very much for your exhaustive answer.
Hm, I also thought of fragmentation. Alhtough I don't think this is really
very likely, as my server doesn't serve things that likely cause fragmentation.
It is a mailserver (but only maildir-format), fileserver for windows clients
(huge files that hardly don't get rewritten), a server for TV-records (but
only copy recordings from a sat receiver after they have been recorded, so
no heavy rewriting here), a tiny webserver and all kinds of such things, but
not a storage for huge databases, virtual machines or a target for
filesharing clients.
It however serves as a target for a hardlink-based backupprogram run on
windows PCs, but only once per month or so, so that shouldn't bee too much.
The problem must lie somewhere on the root partition itslef, because the
system is already slow before mounting the fat data-partitions.
I'll give the defragmentation a try. But
# sudo btrfs filesystem defrag -r
doesn't work, because "-r" is an unknown option (I'm running
Btrfs v0.20-rc1 on an Ubuntu 3.11.0-14-generic kernel).
I'm doing a
# sudo btrfs filesystem defrag / &
on the root directory at the moment.
Question: will this defragment everything or just the root-fs and will I
need to run a defragment on /home as well, as /home is a separate btrfs
filesystem?
I've also added autodefrag mountoptions and will do a "mount -a" after the
defragmentation.
I've considered a
# sudo btrfs balance start
as well, would this do any good? How close should I let the data fill the
partition? The large data partitions are 85% used, root is 70% used. Is this
safe or should I add space?
Thanx, Wolfgang
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-01 20:08 ` Sulla
@ 2014-01-02 8:38 ` Duncan
2014-01-03 1:24 ` Kai Krakow
2014-01-05 0:12 ` Sulla
1 sibling, 1 reply; 31+ messages in thread
From: Duncan @ 2014-01-02 8:38 UTC (permalink / raw)
To: linux-btrfs
Sulla posted on Wed, 01 Jan 2014 20:08:21 +0000 as excerpted:
> Dear Duncan!
>
> Thanks very much for your exhaustive answer.
>
> Hm, I also thought of fragmentation. Alhtough I don't think this is
> really very likely, as my server doesn't serve things that likely cause
> fragmentation.
> It is a mailserver (but only maildir-format), fileserver for windows
> clients (huge files that hardly don't get rewritten), a server for
> TV-records (but only copy recordings from a sat receiver after they have
> been recorded, so no heavy rewriting here), a tiny webserver and all
> kinds of such things, but not a storage for huge databases, virtual
> machines or a target for filesharing clients.
> It however serves as a target for a hardlink-based backupprogram run on
> windows PCs, but only once per month or so, so that shouldn't bee too
> much.
One thing I didn't mention originally, was how to check for fragmentation.
filefrag is part of e2fsprogs, and does the trick -- with one caveat.
filefrag currently doesn't know about btrfs compression, and interprets
each 128 KiB block as a separate extent. So if you have btrfs
compression turned on and check a (larger than 128 KiB) file that btrfs
has compressed, filefrag will falsely report fragmentation.
If in doubt, you can always try defragging that individual file and see
if filefrag reports fewer extents or not. If it has fewer extents you
know it was fragmented, if not...
With that you should actually be able to check some of those big files
that you don't think are fragmented, to see.
> The problem must lie somewhere on the root partition itslef, because the
> system is already slow before mounting the fat data-partitions.
>
> I'll give the defragmentation a try. But
> # sudo btrfs filesystem defrag -r
> doesn't work, because "-r" is an unknown option (I'm running Btrfs
> v0.20-rc1 on an Ubuntu 3.11.0-14-generic kernel).
The -r option was added quite recently.
As the wiki (at https://btrfs.wiki.kernel.org ) urges, btrfs is a
development filesystem and people choosing to test it should really try
to keep current, both because you're unnecessarily putting the data
you're testing on btrfs at risk when running old versions with bugs
patched in newer versions (that part's mostly for the kernel, tho), and
because as a tester, when things /do/ go wrong and you report it, the
reports are far more useful if you're running a current version.
Kernal 3.11.0 is old. 3.12 has been out for well over a month now. And
the btrfs-progs userspace recently switched to kernel-synced versioning
as well, with version 3.12 the latest version, which also happens to be
the first kernel-version-synced version.
That's assuming you don't choose to run the latest git version of the
userspace, and the Linus kernel RCs, which many btrfs testers do. (Tho
last I updated btrfs-progs, about a week ago, the last git commit was
still the version bump to 3.12, but I'm running a git kernel at version
3.13.0-rc5 plus 69 commits.)
So you are encouraged to update. =:^)
However, if you don't choose to upgrade ... (see next)
> I'm doing a # sudo btrfs filesystem defrag / &
> on the root directory at the moment.
... Before the -r option was added, btrfs filesystem defrag would only
defrag the specific file it was pointed at. If pointed at a directory,
it would defrag the directory metadata, but not files or subdirs below it.
The way to defrag the entire system then, involved a rather more
complicated command using find to output a list of everything on the
system, and run defrag individually on each item listed. It's on the
wiki. Let's see if I can find it... (yes, but note the wrapped link):
https://btrfs.wiki.kernel.org/index.php/
UseCases#How_do_I_defragment_many_files.3F
sudo find [subvol [subvol]…] -xdev -type f -exec btrfs filesystem
defragment -- {} +
As the wiki warns, that doesn't recurse into subvolumes (the -xdev keeps
it from going onto non-btrfs filesystems but also keeps it from going
into subvolumes), but you can list them as paths where noted.
> Question: will this defragment everything or just the root-fs and will I
> need to run a defragment on /home as well, as /home is a separate btrfs
> filesystem?
Well, as noted your command doesn't really defragment that much. But the
find command should defragment everything on the named subvolumes.
But of course this is where that bit I mentioned in the original post
about possibly taking hours with multiple terabytes on spinning rust
comes in too. It could take awhile, and when it gets to really
fragmented files, it'll probably trigger the same sort of stalls that has
us discussing the whole thing in the first place, so the system may not
be exactly usable. =:^(
> I've also added autodefrag mountoptions and will do a "mount -a" after
> the defragmentation.
>
> I've considered a # sudo btrfs balance start as well, would this do any
> good? How close should I let the data fill the partition? The large data
> partitions are 85% used, root is 70% used. Is this safe or should I add
> space?
!! Be careful!! You mentioned running 3.11. Both early versions of 3.11
and 3.12 had a bug where if you tried to run a balance and a defrag at
the same time, bad things could happen (lockups or even corrupted data)!
Running just one at a time and letting it finish, then the other, should
be fine. And later stable kernels of both 3.11 and 3.12 have that bug
fixed (as does 3.13). But 3.11.0 is almost certainly still bugged in
that regard, unless ubuntu backported the fix and didn't bump the kernel
version.
But because a full balance rewrites everything anyway, it'll effectively
defrag too. So if you're going to do a balance, you can skip the
defrag. =:^) And since it's likely to take hours at the terabyte scale
on spinning rust, that's just as well.
As for the space question, that's a whole different subject with its own
convolutions. =:^\
Very briefly, the rule of thumb I use is that for partitions of
sufficient size (several GiB low end), you always want btrfs filesystem
show to have at LEAST enough unallocated space left to allocate one each
data and metadata chunk. Data chunks default to 1 GiB, while metadata
chunks default to 256 MiB, but because single-device metadata defaults to
DUP mode, metadata chunks are normally allocated in pairs and that
doubles to half a GiB.
So you need at LEAST 1.5 GiB unallocated, in ordered to be sure balance
can work, since it allocates a new chunk and writes into it from the old
chunks, until it can free up the old chunks.
Assuming you have large enough filesystems, I'd try to keep twice that, 3
GiB unallocated according to btrfs filesystem show, and would definitely
recommend doing a rebalance any time it starts getting close to that.
If you tend to have many multi-gig files, you'll probably want to keep
enough unallocated space (rounded up to a whole gig, plus the 3 gig
minimum I suggested above) around to handle at least one of those as
well, just so you know you always have space available to move at least
one of those if necessary, without using up your 3 gig safety margin.
Beyond that, take a look at your btrfs filesystem df output. I already
mentioned that data chunk size is 1 GiB, metadata 256 MiB (doubled to 512
MiB for default dup mode for a single device btrfs). So if data says
something like total=248.00GiB, used=123.24GiB (example picked out of
thin air), you know you're running a whole bunch of half empty chunks,
and a balance should trim that down dramatically, to probably
total=124.00GiB altho it's possible it might be 125.00GiB or something,
but in any case it should be FAR closer to used than the twice-used
figure in my example above. Any time total is more than a GiB above
used, a balance is likely to be able to reduce it and return the extra to
the unallocated pool.
Of course the same applies to metadata, keeping in mind its default-dup,
so you're effectively allocating in 512 MiB chunks for it. But any time
total is more than 512 MiB above used, a balance will probably reduce it,
returning the extra space to the unallocated pool.
Of course single vs. dup on single devices, and multiple devices with all
the different btrfs raid modes, throw various curves into the numbers
given above. While it's reasonably straightforward to figure an
individual case, explaining all the permutations gets quite complex. And
while it's not supported yet, eventually btrfs is supposed to support
different raid levels, etc, for different subvolumes, which will throw
even MORE complexity into the thing! And obviously for small single-
digit GiB partitions the rules must be adjusted, even more so for mixed-
blockgroup, which is the default below 1 GiB but makes some sense in the
single-digit GiB size range as well. But the reasonably large single-
device default isn't /too/ bad, even if it takes a bit to explain, as I
did here.
Meanwhile, especially on spinning rust at terabyte sizes, those balances
are going to take awhile, so you probably don't want to run them daily.
And on SSDs, balances (and defrags and anything else for that matter)
should go MUCH faster, but SSDs are limited-write-cycle, and any time you
balance you're rewriting all that data and metadata, thus using up
limited write cycles on all those gigs worth of blocks in one fell swoop!
So either way, doing balances without any clear return probably isn't a
good idea. But when the allocated space gets within a few gigs of total
as shown by btrfs filesystem show, or when total gets multiple gigs above
used as shown by btrfs filesystem df, it's time to consider a balance.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla
2014-01-01 12:37 ` Duncan
@ 2014-01-02 8:49 ` Jojo
2014-01-05 20:32 ` Chris Murphy
2 siblings, 0 replies; 31+ messages in thread
From: Jojo @ 2014-01-02 8:49 UTC (permalink / raw)
To: Sulla, linux-btrfs
Am 31.12.2013 12:46, schrieb Sulla:
> Dear all!
>
> On my Ubuntu Server 13.10 I use a RAID5 blockdevice consisting of 3 WD20EARS
> drives. On this I built a LVM and in this LVM I use quite normal partitions
> /, /home, SWAP (/boot resides on a RAID1.) and also a custom /data
> partition. Everything (except boot and swap) is on btrfs.
>
> sometimes my system hangs for quite some time (top is showing a high wait
> percentage), then runs on normally. I get kernel messages into
> /var/log/sylsog, see below. I am unable to make any sense of the kernel
> messages, there is no reference to the filesystem or drive affected (at
> least I can not find one).
>
> Question: What is happening here?
> * Is a HDD failing (smart looks good, however)
> * Is something wrong with my btrfs-filesystem? with which one?
> * How can I find the cause?
>
Moin Wolfgang,
first ot: Happy new Year,
over the last celebration days one of our servers (ubuntu 13.04) with
custom kernel 3.11.04 did quite simular things, also rais5/raid6.
Our Problem was writing to backup showed quit the same kernelog.
Also btrfs-transaction was hanging.
Also Filesystem usage with 83% looked fine. But that was not true.
After some time eating investigation I found, that BTRFS may have in
3.11.x and other kernels(?) a problem with free block lists and
fragmentation.
Our Server was able to self recover after defragmentation and
compressing run.
We had problems with end-of-free blocks.
After rebuilding the free block list and running defrag the server got
enough free blocks to operate well.
To be able to do that, we were forced to use the btrfs-git kernel and
also the btrfs-progs from git. (3.13-rcX)
I did on 26.12.13:
# umount /ar
# btrfsck --repair --init-extent-tree /dev/sda1
# mount -o clear_cache,skip_balance,autodefrag /dev/sda1 /ar
# btrfs fi defragment -rc /ar/backup
But attention, I thougt 83% used space shoud be enough "free blocks",
but this was wrong. It seems that BTRFS free Block lists are somewhat
errous.
Especially "balance" may crash if an file has got too many
extents/fragments, and allocating space may also hang if
free blocks are running low.
During the defragmentation run the response of the Server was getting
slow, but did not stop in Read Access.
Our state today:
root@bk:~# df -m /ar
Dateisystem 1M-Blöcke Benutzt Verfügbar Verw% Eingehängt auf
/dev/sda1 13232966 7213717 3181874 70% /ar
root@bk:~# btrfs fi show /ar
Label: Archiv+Backup uuid: 72b710aa-49a0-4ff5-a470-231560bfee81
Total devices 5 FS bytes used 6.88TiB
devid 1 size 2.73TiB used 2.70TiB path /dev/sda1
devid 2 size 2.73TiB used 2.70TiB path /dev/sdb1
devid 3 size 2.73TiB used 2.70TiB path /dev/sdc1
devid 4 size 2.73TiB used 2.70TiB path /dev/sdd1
devid 5 size 1.70TiB used 4.25GiB path /dev/sde4
Btrfs v3.12
root@bk:~# btrfs fi df /ar
Data, single: total=8.00MiB, used=0.00
Data, RAID5: total=8.10TiB, used=6.87TiB
System, single: total=4.00MiB, used=0.00
System, RAID5: total=12.00MiB, used=600.00KiB
Metadata, single: total=8.00MiB, used=0.00
Metadata, RAID5: total=12.25GiB, used=10.41GiB
Today the server completely recovered to full operation.
Is there a plan ongoing to hangle such out of free blocks/space
situations more comfortable?
TIA
J. Sauer
--
Jürgen Sauer - automatiX GmbH,
+49-4209-4699, juergen.sauer@automatix.de
Geschäftsführer: Jürgen Sauer,
Gerichtstand: Amtsgericht Walsrode • HRB 120986
Ust-Id: DE191468481 • St.Nr.: 36/211/08000
GPG Public Key zur Signaturprüfung:
http://www.automatix.de/juergen_sauer_publickey.gpg
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-02 8:38 ` Duncan
@ 2014-01-03 1:24 ` Kai Krakow
2014-01-03 9:18 ` Duncan
0 siblings, 1 reply; 31+ messages in thread
From: Kai Krakow @ 2014-01-03 1:24 UTC (permalink / raw)
To: linux-btrfs
Duncan <1i5t5.duncan@cox.net> schrieb:
> But because a full balance rewrites everything anyway, it'll effectively
> defrag too.
Is that really true? I thought it just rewrites each distinct extent and
shuffels chunks around... This would mean it does not merge extents
together.
Regards,
Kai
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-03 1:24 ` Kai Krakow
@ 2014-01-03 9:18 ` Duncan
0 siblings, 0 replies; 31+ messages in thread
From: Duncan @ 2014-01-03 9:18 UTC (permalink / raw)
To: linux-btrfs
Kai Krakow posted on Fri, 03 Jan 2014 02:24:01 +0100 as excerpted:
> Duncan <1i5t5.duncan@cox.net> schrieb:
>
>> But because a full balance rewrites everything anyway, it'll
>> effectively defrag too.
>
> Is that really true? I thought it just rewrites each distinct extent and
> shuffels chunks around... This would mean it does not merge extents
> together.
While I'm not a coder and they're free to correct me if I'm wrong...
With a full balance (there are now options allowing one to do only data,
or only metadata, or for that matter only system, and do other filtering,
say to rebalance only chunks less than 10% used or only those not yet
converted to a new raid level, if desired, but we're talking a full
balance here), all chunks are rewritten, merging data (or metadata) into
fewer chunks if possible, eliminating the then unused chunks and
returning the space they took to the unallocated pool.
Given that everything is being rewritten anyway, a process that can take
hours or even days on multi-terabyte spinning rust filesystems, /not/
doing a file defrag as part of the process would be stupid.
So doing a separate defrag and balance isn't necessary. And while we're
at it, doing a separate scrub and balance isn't necessary, for the same
reason. (If one copy of the data is invalid and there's another, it'll
be used for the rewrite and redup if necessary during the balance and the
invalid copy will simply be erased. If there's no valid copy, then there
will be balance errors and I believe the chunks containing the bad data
are simply not rewritten at all, tho the valid data from them might be
rewritten, leaving only the bad data (I'm not sure which, on that), thus
allowing the admin to try other tools to clean up or recover from the
damage as necessary.)
That's one reason why the balance operation can take so much longer than
a straight sequential read/write of the data might indicate, because it's
doing all that extra work behind the scenes as well.
Tho I'm not sure that it defrags across chunks, particularly if a file's
fragments reach across enough chunks that they'd not have been processed
by the time a written chunk is full and the balance progresses to the
next one. However, given that data chunks are 1 GiB in size, that should
still cut down a multi-thousand-extent file to perhaps a few dozen
extents, one each per rewritten chunk.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-01 12:37 ` Duncan
2014-01-01 20:08 ` Sulla
@ 2014-01-03 17:25 ` Marc MERLIN
2014-01-03 21:34 ` Duncan
2014-01-04 20:48 ` Roger Binns
1 sibling, 2 replies; 31+ messages in thread
From: Marc MERLIN @ 2014-01-03 17:25 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
First, a big thank you for taking the time to post this very informative
message.
On Wed, Jan 01, 2014 at 12:37:42PM +0000, Duncan wrote:
> Apparently the way some distribution installation scripts work results in
> even a brand new installation being highly fragmented. =:^( If in
> addition they don't add autodefrag to the mount options used when
> mounting the filesystem for the original installation, the problem is
> made even worse, since the autodefrag mount option is designed to help
> catch some of this sort of issue, and schedule the affected files for
> auto-defrag by a separate thread.
Assuming you can stomach a bit of occasional performance loss due to
autodefrag, is there a reason not to always have this on btrfs
filesystems in newer kernels? (let's say 3.12+)?
Is there even a reason for this not to become a default mount option in
newer kernels?
> The NOCOW file attribute.
>
> Simple command form:
>
> chattr +C /path/to/file/or/directory
Thank you for that tip, I had been unaware of it 'till now.
This will make my virtualbox image directory much happier :)
> Meanwhile, if there's a point at which the file exists in its more or
> less permanent form and won't be written into any longer (a torrented
> file is fully downloaded, or a VM image is backed up), sequentially
> copying it elsewhere (possibly using cp --reflink=never if on the same
> filesystem, to avoid a reflink copy pointing at the same fragmented
> extents!), then deleting the original fragmented version, should
> effectively defragment the file too. And since it's not being written
> into any more at that point, it should stay defragmented.
>
> Or just btrfs filesystem defrag the individual file..
I know I can do the cp --reflink=never, but that will generate 100GB of
new files and force me to drop all my hourly/daily/weekly snapshots, so
file defrag is definitely a better option.
> Finally, there's some more work going into autodefrag now, to hopefully
> increase its performance, and make it work more efficiently on a bit
> larger files as well. The goal is to eliminate the problems with
> systemd's journal, among other things, now that it's known to be a common
> problem, given systemd's widespread use and the fact that both systemd
> and btrfs aim to be the accepted general Linux default within a few years.
Is there a good guideline on which kinds of btrfs filesystems autodefrag
is likely not a good idea, even if the current code does not have
optimal performance?
I suppose fragmented files that are deleted soon after being written are
a loss, but otherwise it's mostly a win. Am I missing something?
Unfortunately, on a 83GB vdi (virtualbox) file, with 3.12.5, it did a
lot of writing and chewed up my 4 CPUs.
Then, it started to be hard to move my mouse cursor and my procmeter
graph was barely updating seconds.
Next, nothing updated on my X server anymore, not even seconds in time
widgets.
But, I could still sometimes move my mouse cursor, and I could sometimes
see the HD light fliker a bit before going dead again. In other words,
the system wasn't fully deadlocked, but btrfs sure got into a state
where it was unable to to finish the job, and took the kernel down with
it (64bit, 8GB of RAM).
I waited 2H and it never came out of it, I had to power down the system
in the end.
Note that this was on a top of the line 500MB/s write Samsung Evo 840 SSD,
not a slow HD.
I think I had enough free space:
Label: 'btrfs_pool1' uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6
Total devices 1 FS bytes used 732.14GB
devid 1 size 865.01GB used 865.01GB path /dev/dm-0
Is it possible expected behaviour of defrag to lock up on big files?
Should I have had more spare free space for it to work?
Other?
On the plus side, the file I was trying to defragment and hung my system,
was not corrupted by the process.
Any idea what I should try from here?
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-03 17:25 ` Marc MERLIN
@ 2014-01-03 21:34 ` Duncan
2014-01-05 6:39 ` Marc MERLIN
2014-01-08 3:22 ` Marc MERLIN
2014-01-04 20:48 ` Roger Binns
1 sibling, 2 replies; 31+ messages in thread
From: Duncan @ 2014-01-03 21:34 UTC (permalink / raw)
To: linux-btrfs
Marc MERLIN posted on Fri, 03 Jan 2014 09:25:06 -0800 as excerpted:
> First, a big thank you for taking the time to post this very informative
> message.
>
> On Wed, Jan 01, 2014 at 12:37:42PM +0000, Duncan wrote:
>> Apparently the way some distribution installation scripts work results
>> in even a brand new installation being highly fragmented. =:^( If in
>> addition they don't add autodefrag to the mount options used when
>> mounting the filesystem for the original installation, the problem is
>> made even worse, since the autodefrag mount option is designed to help
>> catch some of this sort of issue, and schedule the affected files for
>> auto-defrag by a separate thread.
>
> Assuming you can stomach a bit of occasional performance loss due to
> autodefrag, is there a reason not to always have this on btrfs
> filesystems in newer kernels? (let's say 3.12+)?
>
> Is there even a reason for this not to become a default mount option in
> newer kernels?
For big "internal write" files, autodefrag isn't yet well tuned, because
it effectively write-magnifies too much, forcing rewrite of the entire
file for just a small change. If whatever app is more or less constantly
writing those small changes, faster than the file can be rewritten...
I don't know where the break-over might be, but certainly, multi-gig
sized IO-active VMs images or databases aren't something I'd want to use
it with. That's where the NOCOW thing will likely work better.
IIRC someone also mentioned problems with autodefrag and an about 3/4 gig
systemd journal. My gut feeling (IOW, *NOT* benchmarked!) is that double-
digit MiB files should /normally/ be fine, but somewhere in the lower
triple digits, write-magnification could well become an issue, depending
of course on exactly how much active writing the app is doing into the
file.
As I said there's more work going into tuning autodefrag ATM, but as it
is, I couldn't really recommend making it a global default... tho maybe a
distro could enable it by default on a no-VM desktop system (as opposed
to a server). Certainly I'd recommend most desktop types enable it.
>> The NOCOW file attribute.
>>
>> Simple command form:
>>
>> chattr +C /path/to/file/or/directory
>
> Thank you for that tip, I had been unaware of it 'till now.
> This will make my virtualbox image directory much happier :)
I think I said it, but it bears repeating. Once you set that attribute
on the dir, you may want to move the files out of the dir (to another
partition would make sure the data is actually moved) and back in, so
they're effectively new files in the dir. Or use something like cat
oldfile > newfile, so you know it's actually creating the new file, not
reflinking. That'll ensure the NOCOW takes effect.
> Unfortunately, on a 83GB vdi (virtualbox) file, with 3.12.5, it did a
> lot of writing and chewed up my 4 CPUs. Then, it started to be hard to
> move my mouse cursor and my procmeter graph was barely updating seconds.
> Next, nothing updated on my X server anymore, not even seconds in time
> widgets.
>
> But, I could still sometimes move my mouse cursor, and I could sometimes
> see the HD light fliker a bit before going dead again. In other words,
> the system wasn't fully deadlocked, but btrfs sure got into a state
> where it was unable to to finish the job, and took the kernel down with
> it (64bit, 8GB of RAM).
>
> I waited 2H and it never came out of it, I had to power down the system
> in the end. Note that this was on a top of the line 500MB/s write
> Samsung Evo 840 SSD, not a slow HD.
That was defrag (the command) or autodefrag (the mount option)? I'd
guess defrag (the command).
That's fragmentation for you! What did/does filefrag have to say about
that file? Were you the one that posted the 6-digit extents?
For something that bad, it might be faster to copy/move it off-device
(expect it to take awhile) then move it back. That way you're only
trying to read OR write on the device, not both, and the move elsewhere
should defrag it quite a bit, effectively sequential write, then read and
write on the move back.
But even that might be prohibitive. At some point, you may need to
either simply give up on it (if you're lazy), or get down and dirty with
the tracing/profiling, working with a dev to figure out where it's
spending its time and hopefully get btrfs recoded to work a bit faster
for that sort of thing.
> I think I had enough free space:
> Label: 'btrfs_pool1' uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6
> Total devices 1 FS bytes used 732.14GB
> devid 1 size 865.01GB used 865.01GB path /dev/dm-0
>
> Is it possible expected behaviour of defrag to lock up on big files?
> Should I have had more spare free space for it to work?
> Other?
>From my understanding it's not the file size, but the number of
fragments. I'm guessing you simply overwhelmed the system. Ideally you
never let it get that bad in the first place. =:^(
As I suggested above, you might try the old school method of defrag, move
the file to a different device, then move it back. And if possible do it
when nothing else is using the system. But it may simply be practically
inaccessible with a current kernel, in which case you'd either have to
work with the devs to optimize, or give it up as a lost cause. =:(
> On the plus side, the file I was trying to defragment and hung my
> system,
> was not corrupted by the process.
>
> Any idea what I should try from here?
Beyond the above, it's let the devs hack on it time. =:^\
One other /narrow/ possibility if you're desperate. You could try
splitting the file into chunks (generic term not btrfs chunks) of some
arbitrary shorter size, and copying them out. If you spit into say 10
parts, then each piece should take roughly a tenth of the time, altho
more fragmented areas will likely take longer. But by splitting into say
100 parts (which would be ~830 MiB apiece), you could at least see the
progress and if there was one particular area where it suddenly got a lot
worse.
I know there's tools for that sort of thing, but I'm not enough into
forensics to know much about them...
Then if the process completed successfully, you could cat the parts back
together again... and the written parts would be basically sequential, so
that should go MUCH faster! =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-03 17:25 ` Marc MERLIN
2014-01-03 21:34 ` Duncan
@ 2014-01-04 20:48 ` Roger Binns
1 sibling, 0 replies; 31+ messages in thread
From: Roger Binns @ 2014-01-04 20:48 UTC (permalink / raw)
To: linux-btrfs
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 03/01/14 09:25, Marc MERLIN wrote:
> Is there even a reason for this not to become a default mount option
> in newer kernels?
autodefrag can go insane because it is unbounded. For example I have a
4GB RAM system (3.12, no gui) that kept hanging. I eventually managed to
work out the cause being a MySQL database (about 750MB of data only being
used by tt-rss refreshing RSS feeds every 4 hours).
autodefrag would eventually consume all the RAM and 20GB of swap kicking
off the OOM killer and with so little RAM left for anything else that the
only recourse was sysrq keys.
What I'd love to see is some sort of background worker that does sensible
things. For example it could defragment files, but pick the ones that
need it the most, and I'd love to see extra copies of (meta)data in
currently unused space that is freed as needed. deduping is another
worthwhile option. So is recompressing data that hasn't changed recently
but using larger block sizes to get more effective ratios. Some of these
happen at the moment but they are independent and you have to be aware of
the caveats.
Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iEYEARECAAYFAlLIc6wACgkQmOOfHg372QQgjgCeJp1sZQ0+Y7WRGE+U+IFljiDY
MgQAnjEBspyJZvTC2caEn1Qkn942vPQ2
=rhNY
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-01 20:08 ` Sulla
2014-01-02 8:38 ` Duncan
@ 2014-01-05 0:12 ` Sulla
1 sibling, 0 replies; 31+ messages in thread
From: Sulla @ 2014-01-05 0:12 UTC (permalink / raw)
To: linux-btrfs
Oh gosh, I don't know what went wrong with my btrfs root filesystem, and I
probably will never know, too:
The "sudo balance start /" was running fine for about 4 or 5 hours, running
at a system load of ~3 when "balance status /" told me the balancing was on
its way and had completed 19 out of 23 extents.
At this moment the system load started to increase and increase an increase
and when it reached 147 (!!) (while top was showing me NOTHING was going on)
I resetted the computer. TTY1 showed some kernel panics and btrfs-bug
messages, but those files were lost because they've never made it to disk.
Fortunately my RAID5 stayed in sync and everything was fine. System also
booted, but with the same 120+ secs hangs as before. System was unusable, as
e.g. all IMAP logins time-out-ed.
So
* I booted into a live-CD
* mounted a backup disk
* cp-ed all files of the root fs to the backup disk (it could read them
flawlessly)
* formatted the root-partition to ext4 (yes, I feel sad about it)
* cp-ed all root-files from the backupdisk to the ext4 root system
* stroke the subvol=@ boot argument from /boot/grub/grub.cfg
* and rebooted my server.
How I love linux! Wouldn't be possible with M$!!
Now its running fine again, system is responsive as it should be. No clue
'bout what went wrong, though.
I still have /home and the huge data partitions on btrfs and plan to leave
it so. While it would not be difficult to put /home on ext4 it would be a
major effort to cp the ~3TB data off and on the disks...
Thanx for your support,
Sulla
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-03 21:34 ` Duncan
@ 2014-01-05 6:39 ` Marc MERLIN
2014-01-05 17:09 ` Chris Murphy
2014-01-08 3:22 ` Marc MERLIN
1 sibling, 1 reply; 31+ messages in thread
From: Marc MERLIN @ 2014-01-05 6:39 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
On Fri, Jan 03, 2014 at 09:34:10PM +0000, Duncan wrote:
> > Thank you for that tip, I had been unaware of it 'till now.
> > This will make my virtualbox image directory much happier :)
>
> I think I said it, but it bears repeating. Once you set that attribute
> on the dir, you may want to move the files out of the dir (to another
> partition would make sure the data is actually moved) and back in, so
> they're effectively new files in the dir. Or use something like cat
> oldfile > newfile, so you know it's actually creating the new file, not
> reflinking. That'll ensure the NOCOW takes effect.
Yes, I got that. That why I ran btrfs defrag on the files after that (I
explained why, copy would waste lots of snapshot space by replacing all
the block needlessly).
> > Unfortunately, on a 83GB vdi (virtualbox) file, with 3.12.5, it did a
> > lot of writing and chewed up my 4 CPUs. Then, it started to be hard to
> > move my mouse cursor and my procmeter graph was barely updating seconds.
> > Next, nothing updated on my X server anymore, not even seconds in time
> > widgets.
> >
> > But, I could still sometimes move my mouse cursor, and I could sometimes
> > see the HD light fliker a bit before going dead again. In other words,
> > the system wasn't fully deadlocked, but btrfs sure got into a state
> > where it was unable to to finish the job, and took the kernel down with
> > it (64bit, 8GB of RAM).
> >
> > I waited 2H and it never came out of it, I had to power down the system
> > in the end. Note that this was on a top of the line 500MB/s write
> > Samsung Evo 840 SSD, not a slow HD.
>
> That was defrag (the command) or autodefrag (the mount option)? I'd
> guess defrag (the command).
defrag, the btrfs subcommand.
> That's fragmentation for you! What did/does filefrag have to say about
> that file? Were you the one that posted the 6-digit extents?
Nope, I never posted anything until now. Hopefully you agree that it's
not ok for btrfs/kernel to just kill my system for over 2H until I power
it off before of defragging one file. I did hit a severe performance but
if it's not a never ending loop.
gandalfthegreat:/var/local/nobck/VirtualBox VMs/Win7# filefrag Win7.vdi
Win7.vdi: 156222 extents found
Considering how virtualbox works, that's hardly surprising.
> For something that bad, it might be faster to copy/move it off-device
> (expect it to take awhile) then move it back. That way you're only
> trying to read OR write on the device, not both, and the move elsewhere
> should defrag it quite a bit, effectively sequential write, then read and
> write on the move back.
Yes, I know how I can work around the problem (although I'll likely have
to delete all my historical snapshots to delete the old blocks, which I
don't love to do).
But doesn't it make sense to see why the kernel is near deadlocking on a
single file defrag first?
> But even that might be prohibitive. At some point, you may need to
> either simply give up on it (if you're lazy), or get down and dirty with
> the tracing/profiling, working with a dev to figure out where it's
> spending its time and hopefully get btrfs recoded to work a bit faster
> for that sort of thing.
I'm on my way to a linux conf where I'm speaking, so I have limited time
and can't crash my laptop, but I'm happy to type some commands and give
output.
> As I suggested above, you might try the old school method of defrag, move
> the file to a different device, then move it back. And if possible do it
> when nothing else is using the system. But it may simply be practically
> inaccessible with a current kernel, in which case you'd either have to
> work with the devs to optimize, or give it up as a lost cause. =:(
I can fix my problem, actually virtualbox works fine with the fragmented
file, without even feeling slow, so really I don't need to fix it
urgently, I was just trying it out after your post.
> Then if the process completed successfully, you could cat the parts back
> together again... and the written parts would be basically sequential, so
> that should go MUCH faster! =:^)
All that noted, but I'm not desperate, just trying commands I hadn't
tried yet :)
Thanks for your replies,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-05 6:39 ` Marc MERLIN
@ 2014-01-05 17:09 ` Chris Murphy
2014-01-05 17:54 ` Jim Salter
0 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2014-01-05 17:09 UTC (permalink / raw)
To: Btrfs BTRFS
On Jan 4, 2014, at 11:39 PM, Marc MERLIN <marc@merlins.org> wrote:
>
> Nope, I never posted anything until now. Hopefully you agree that it's
> not ok for btrfs/kernel to just kill my system for over 2H until I power
> it off before of defragging one file. I did hit a severe performance but
> if it's not a never ending loop.
>
> gandalfthegreat:/var/local/nobck/VirtualBox VMs/Win7# filefrag Win7.vdi
> Win7.vdi: 156222 extents found
>
> Considering how virtualbox works, that's hardly surprising.
I haven't read anything so far indicating defrag applies to the VM container use case, rather nodatacow via xattr +C is the way to go. At least for now.
>
> But doesn't it make sense to see why the kernel is near deadlocking on a
> single file defrag first?
It's better than a panic or corrupt data. So far the best combination I've found, open to other suggestions though, is +C xattr on /var/lib/libvirt/images, creating non-preallocated qcow2 files, and snapshotting the qcow2 file with qemu-img. Granted when sysroot is snapshot, I'm making btrfs snapshots of these qcow2 files. Another option is to make /var/lib/libvirt/images a subvolume, and then when sysroot is snapshot, then /var/lib/libvirt/images is immune to being snapshot automatically with the parent subvolume. I'd have to explicitly snapshot it. This may be a better way to go to avoid accumulation of btrfs snapshots of qcow2 snapshot files.
This may already be a known problem but it's worth sysrq+w, and then dmesg and posting those results if you haven't already.
Chris Murphy
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-05 17:09 ` Chris Murphy
@ 2014-01-05 17:54 ` Jim Salter
2014-01-05 19:57 ` Duncan
0 siblings, 1 reply; 31+ messages in thread
From: Jim Salter @ 2014-01-05 17:54 UTC (permalink / raw)
To: Chris Murphy, Btrfs BTRFS
On 01/05/2014 12:09 PM, Chris Murphy wrote:
> I haven't read anything so far indicating defrag applies to the VM
> container use case, rather nodatacow via xattr +C is the way to go. At
> least for now.
Can you elaborate on the rationale behind database or VM binaries being
set nodatacow? I experimented with this*, and found no significant (to
me, anyway) performance enhancement with nodatacow on - maybe 10% at
best, and if I understand correctly, that implies losing the live
per-block checksumming of the data that's set nodatacow, meaning you
won't get automatic correction if you're on a redundant array.
All I've heard so far is "better performance" without any more detailed
explanation, and if the only benefit is an added MAYBE 10%ish
performance... I'd rather take the hit, personally.
* "experimented with this" == set up a Win2008R2 test VM and ran
HDTunePro for several runs on binaries stored with and without nodatacow
set, 5G of random and sequential read and write access per run.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-05 17:54 ` Jim Salter
@ 2014-01-05 19:57 ` Duncan
2014-01-05 20:44 ` Chris Murphy
0 siblings, 1 reply; 31+ messages in thread
From: Duncan @ 2014-01-05 19:57 UTC (permalink / raw)
To: linux-btrfs
Jim Salter posted on Sun, 05 Jan 2014 12:54:44 -0500 as excerpted:
> On 01/05/2014 12:09 PM, Chris Murphy wrote:
>> I haven't read anything so far indicating defrag applies to the VM
>> container use case, rather nodatacow via xattr +C is the way to go. At
>> least for now.
Well, NOCOW from the get-go would certainly be better, but given that the
file is already there and heavily fragmented, my idea was to get it
defragmented and then set the +C, to prevent it reoccurring.
But I do very little snapshotting here, and as a result hadn't considered
the knockon effect of 100K-plus extents in perhaps 1000 snapshots. I
guess that's what's killing the defrag, however it's initiated. The only
way to get rid of the problem, then, would be to move the file away and
then back, but doing so does still leave all those snapshots with the
crazy fragmentation, and to kill that would require either killing all
those snapshots, or setting them writable and doing the same move out,
move back, on each one! OUCH, but I guess that's why it just seems
impossible to deal with the fragmentation on these things, whether it's
autodefrag, or named file defrag, or doing the whole move out and back
thing, and then having to worry about all those snapshots.
Still, I'd guess ultimately it'll need done, whether it's a wipe the
filesystem and restore from backup or whatever.
> Can you elaborate on the rationale behind database or VM binaries being
> set nodatacow? I experimented with this*, and found no significant (to
> me,
> anyway) performance enhancement with nodatacow on - maybe 10% at best,
> and if I understand correctly, that implies losing the live per-block
> checksumming of the data that's set nodatacow, meaning you won't get
> automatic correction if you're on a redundant array.
>
> All I've heard so far is "better performance" without any more detailed
> explanation, and if the only benefit is an added MAYBE 10%ish
> performance... I'd rather take the hit, personally.
>
> * "experimented with this" == set up a Win2008R2 test VM and ran
> HDTunePro for several runs on binaries stored with and without nodatacow
> set, 5G of random and sequential read and write access per run.
Well, the problem isn't just performance, it's that in most such cases
the apps actually have their own date integrity checking and management,
and sometimes the app's integrity management and that of btrfs end up
fighting each other, destroying the data as a result.
In normal operation, everything's fine. But should the system crash at
the wrong moment, btrfs' atomic commit and data integrity mechanisms can
roll back to a slightly earlier version of the file.
Which is normally fine. But because hardware is known to often lie about
having committed writes that may actually still only be in buffer, if the
power outage/crash occurred at the wrong moment, ordinary write-barrier
ordering guarantees may be invalid (particularly on large files with
finite-seek-speed devices), the app's own integrity checksum may have
been updated before the data it was supposed to be a checksum on actually
got to disk. If btrfs ends up rolling back to that condition, btrfs will
likely consider the file fine, but the app's own integrity management
will consider it corrupted, which it actually is.
But if btrfs only stays out of the way, the application often can fix
whatever minor corruption it detects, doing its own roll-backs to an
earlier checkpoint, because it's /designed/ to be able to handle such
problems on filesystems that don't have integrity management.
So having btrfs trying to manage integrity too on such data where the app
already handles it is self-defeating, because neither knows about nor
considers what the other one is doing, and the two end up undoing each
other's careful work.
Again, this isn't something you'll see in normal operation, but several
people have reported exactly that sort of problem with the general large-
internally-written-file, application-self-managed-file-integrity,
scenario. In those cases, the best thing btrfs can do is simply get out
of the way and let the application handle its own integrity management,
and the way to tell btrfs to do that, as well as to do in-place rewrites
instead of COW-based rewrites, is with the NOCOW xattrib, chattr +C, and
that must be done before the file gets so fragmented (and multi-
snapshotted in its fragmented state) in the first place.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla
2014-01-01 12:37 ` Duncan
2014-01-02 8:49 ` Jojo
@ 2014-01-05 20:32 ` Chris Murphy
2014-01-05 21:17 ` Sulla
2 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2014-01-05 20:32 UTC (permalink / raw)
To: Sulla; +Cc: linux-btrfs
On Dec 31, 2013, at 4:46 AM, Sulla <Sulla@gmx.at> wrote:
> Dear all!
>
> On my Ubuntu Server 13.10 I use a RAID5 blockdevice consisting of 3 WD20EARS
Sulla is this md raid5? If so can you report the result from mdadm -D <mddevice>, I'm curious what the chunk size is. Thanks.
Chris Murphy
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
[not found] <ADin1n00P0VAdqd01DioM9>
@ 2014-01-05 20:44 ` Duncan
0 siblings, 0 replies; 31+ messages in thread
From: Duncan @ 2014-01-05 20:44 UTC (permalink / raw)
To: Jim Salter; +Cc: Marc MERLIN, linux-btrfs
On Sun, 05 Jan 2014 08:42:46 -0500
Jim Salter <jim@jrs-s.net> wrote:
> On Jan 5, 2014 1:39 AM, Marc MERLIN <marc@merlins.org> wrote:
> >
> > On Fri, Jan 03, 2014 at 09:34:10PM +0000, Duncan wrote:
> > Yes, I got that. That why I ran btrfs defrag on the files after that
>
> Why are you trying to defrag an SSD? There's no seek penalty for
> moving between fragmented blocks, so defrag isn't really desirable in
> the first place.
[I normally try to reply directly to list but don't believe I've seen
this there yet, but got it direct-mailed so will reply-all in response.]
There's no seek penalty so the overall problem is dramatically lessened
as that's the significant part of it on spinning rust, correct, but...
SSDs do remain IOPS-bound, and tens or hundreds of thousands of extents
do exact an IOPS (as well as general extent bookkeeping) toll, too.
That's why I ended up enabling autodefrag here when I was first setting
up, even tho I'm on SSD. (Only after asking the list basically the same
question, what good it is autodefrag on SSD, tho.)
Luckily I don't happen to deal with any of the
internal-write-in-huge-files scenarios, however, and I enabled
autodefrag to cover the internal-write-in-small-file scenarios BEFORE I
started putting any data on the filesystems at all, so I'm basically
covered, here, without actually having to do chattr +C on anything.
> That doesn't change the fact that the described lockup sounds like a
> bug not a feature of course, but I think the answer to your personal
> issue on that particular machine is "don't defrag a solid state
> drive".
I now believe the lockup must be due to processing the hundreds of
thousands of extents on all those snapshots, too, in addition to doing
it on the main volume. I don't actually make very extensive use of
snapshots here anyway, so I didn't think about that aspect originally,
but that's gotta be what's throwing the real spanner in the works,
turning a possibly long but workable normal defrag (O(1)) into a lockup
scenario (O(n)) where virtually no progress is made as currently
coded.
--
Duncan - No HTML messages please, as they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-05 19:57 ` Duncan
@ 2014-01-05 20:44 ` Chris Murphy
0 siblings, 0 replies; 31+ messages in thread
From: Chris Murphy @ 2014-01-05 20:44 UTC (permalink / raw)
To: Btrfs BTRFS, Duncan
On Jan 5, 2014, at 12:57 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>
> But I do very little snapshotting here, and as a result hadn't considered
> the knockon effect of 100K-plus extents in perhaps 1000 snapshots.
I wonder if this is an issue with snapshot aware defrag? Some problems were fixed recently but I'm not sure of the status.
The OP's case involves Btrfs on LVM on (I think) md raid5. The mdadm default stripe size is 512KB, which would be a 1MB full stripe. There are some optimizations for non-full stripe reads and writes for raid5 (not for raid6 so it takes a much bigger performance hit) but nevertheless it might be a factor.
> I
> guess that's what's killing the defrag, however it's initiated. The only
> way to get rid of the problem, then, would be to move the file away and
> then back, but doing so does still leave all those snapshots with the
> crazy fragmentation, and to kill that would require either killing all
> those snapshots, or setting them writable and doing the same move out,
> move back, on each one! OUCH, but I guess that's why it just seems
> impossible to deal with the fragmentation on these things, whether it's
> autodefrag, or named file defrag, or doing the whole move out and back
> thing, and then having to worry about all those snapshots.
It's why in the short term I'm using +C from the get go. And if I had more VM images and qcow2 snapshots, I would put them in a subvolume of their own so that they aren't snapshotted along with rootfs. Using Btrfs within the VM I still get the features I expect and the performance is quite good.
Chris Murphy
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-05 20:32 ` Chris Murphy
@ 2014-01-05 21:17 ` Sulla
2014-01-05 22:36 ` Brendan Hide
2014-01-05 23:48 ` Chris Murphy
0 siblings, 2 replies; 31+ messages in thread
From: Sulla @ 2014-01-05 21:17 UTC (permalink / raw)
To: Chris Murphy; +Cc: linux-btrfs
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Dear Chris!
Certainly: I have 3 HDDs, all of which WD20EARS. Originally I wanted to
let btrfs handle all 3 devices directly without making partitions, but
this was impossible, as at least /boot needed to be ext4, at least back
then when I set up the server. And back then btrfs also hadn't raid5-like
functionality, so I decided to put good old partitions and md-Raids and
LVM on them and use btrfs just as plain file-systems on the partitions
provided by LVM.
On the WD disks I thus created 2 partitions each, the first sdX1 being
~500MiB, the rest, 1.9995 TiB is one partition of, sdX2.
I built a Raid1 on the 3 small partitions sdX1 with ext4 for boot, each
disk is bootable with grub installed into the MBR.
I combined the 3 large partitions to a Raid5 of size 3,64TB:
/proc/mdstat reads:
md0 : active raid1 sda1[5] sdb1[4] sdc1[3]
498676 blocks super 1.2 [3/3] [UUU]
md1 : active raid5 sda2[5] sdb2[4] sdc2[3]
3904907520 blocks super 1.2 level 5, 8k chunk, algorithm 2 [3/3] [UUU]
the information you requested:
# sudo mdadm -D /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Thu Jul 14 18:49:25 2011
Raid Level : raid5
Array Size : 3904907520 (3724.01 GiB 3998.63 GB)
Used Dev Size : 1952453760 (1862.01 GiB 1999.31 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Sun Jan 5 22:07:22 2014
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 8K
Name : freedom:1 (local to host freedom)
UUID : 44b72520:a78af6f7:dba13fb3:2203127d
Events : 576884
Number Major Minor RaidDevice State
4 8 18 0 active sync /dev/sdb2
5 8 2 1 active sync /dev/sda2
3 8 34 2 active sync /dev/sdc2
I use the Raid5 md1 as physical volume for LVM: pvdisplay gives:
--- Physical volume ---
PV Name /dev/md1
VG Name MAIN
PV Size 3.64 TiB / not usable 2.06 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 953346
Free PE 6274
Allocated PE 947072
PV UUID WcuEx8-ehJL-xHdf-ElwF-b9s3-dlmM-KZlDNG
I keep a reserve of 6274 4MiB blocks (=24GiB) in case one of the logical
volumes runs out of space...
I created the following logical volumes, named after their intended
mountpoints:
--- Logical volume ---
LV Path /dev/MAIN/ROOT
LV Name ROOT
VG Name MAIN
LV UUID kURJks-xHox-73B5-n02x-eZfS-agDD-n1dtAm
LV Write Access read/write
LV Creation host, time ,
LV Status available
# open 1
LV Size 19.31 GiB
Current LE 4944
Segments 2
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 252:0
and similar:
--- Logical volume ---
LV Path /dev/MAIN/SWAP: 1.8GB
LV Path /dev/MAIN/HOME: 18.6GB
LV Path /dev/MAIN/TMP: 9.3 GB
LV Path /dev/MAIN/DATA1 2.6 TB
LV Path /dev/MAIN/DATA2: 0.9 TB
as filesystem I used btrfs during install form an ubuntu server, I don't
recall which, might have been 11.10 or 12.04 (?) for all logical
partitions except swap, of course,
any other information I can supply?
regards, Sulla
- --
Cogito cogito ergo cogito sum.
Ambrose Bierce
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.21 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iEYEARECAAYFAlLJy+8ACgkQR6b2EdogPFupxgCfeDRdeO+PYoQNIjtySAYEmSEr
PNoAoLPNcSqDHsDzM8pAuHlbva7j18MS
=XBOA
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-05 21:17 ` Sulla
@ 2014-01-05 22:36 ` Brendan Hide
2014-01-05 22:57 ` Roman Mamedov
2014-01-06 0:15 ` Chris Murphy
2014-01-05 23:48 ` Chris Murphy
1 sibling, 2 replies; 31+ messages in thread
From: Brendan Hide @ 2014-01-05 22:36 UTC (permalink / raw)
To: Sulla, Chris Murphy; +Cc: linux-btrfs
On 2014/01/05 11:17 PM, Sulla wrote:
> Certainly: I have 3 HDDs, all of which WD20EARS.
Maybe/maybe-not off-topic:
Poor hardware performance, though not necessarily the root cause, can be
a major factor with these errors.
WD Greens (Reds too, for that matter) have poor non-sequential
performance. An educated guess I'd say there's a 15% chance this is a
major factor to the problem and, perhaps, a 60% chance it is merely a
"small contributor" to the problem. Greens are aimed at consumers
wanting high capacity and a low pricepoint. The result is poor
performance. See footnote * re my experience.
My general recommendation (use cases vary of course) is to install a
tiny SSD (60GB, for example) just for the OS. It is typically cheaper
than the larger drives and will be *much* faster. WD Greens and Reds
have good *sequential* throughput but comparatively abysmal random
throughput even in comparison to regular non-SSD consumer drives.
*
I had 8x 1.5TB WD1500EARS drives in an mdRAID5 array. With it I had a
single 250GB IDE disk for the OS. When the very old IDE disk inevitably
died, I decided to use a spare 1.5TB drive for the OS. Performance was
bad enough that I simply bought my first SSD the same week.
--
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-05 22:36 ` Brendan Hide
@ 2014-01-05 22:57 ` Roman Mamedov
2014-01-07 10:22 ` Brendan Hide
2014-01-06 0:15 ` Chris Murphy
1 sibling, 1 reply; 31+ messages in thread
From: Roman Mamedov @ 2014-01-05 22:57 UTC (permalink / raw)
To: Brendan Hide; +Cc: Sulla, Chris Murphy, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 483 bytes --]
On Mon, 06 Jan 2014 00:36:22 +0200
Brendan Hide <brendan@swiftspirit.co.za> wrote:
> I had 8x 1.5TB WD1500EARS drives in an mdRAID5 array. With it I had a
> single 250GB IDE disk for the OS. When the very old IDE disk inevitably
> died, I decided to use a spare 1.5TB drive for the OS. Performance was
> bad enough that I simply bought my first SSD the same week.
Did you align your partitions to accommodate for the 4K sector of the EARS?
--
With respect,
Roman
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-05 21:17 ` Sulla
2014-01-05 22:36 ` Brendan Hide
@ 2014-01-05 23:48 ` Chris Murphy
2014-01-05 23:57 ` Chris Murphy
1 sibling, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2014-01-05 23:48 UTC (permalink / raw)
To: Sulla; +Cc: linux-btrfs
On Jan 5, 2014, at 2:17 PM, Sulla <Sulla@gmx.at> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Dear Chris!
>
> Certainly: I have 3 HDDs, all of which WD20EARS.
These drives don't have a configurable SCT ERC, so you need to modify the SCSI block layer timeout:
echo 120 >/sys/block/sdX/device/timeout
You also need to schedule regular scrubs at the md level as well.
echo check > /sys/block/mdX/md/sync_action
cat /sys/block/mdX/mismatch_cnt
More info about this is in man 4 md, and on the linux-raid list.
>
> 3904907520 blocks super 1.2 level 5, 8k chunk, algorithm 2 [3/3] [UUU]
OK so 8KB chunk, 16KB full stripe, so that doesn't apply to what I was thinking might be the case. The workload is presumably small file sizes, like a mail server?
> any other information I can supply?
I'm not a developer, I don't know if this problem is known or maybe fixed in a newer kernel than 3.11.0 - which has been around for 5-6 months. I think the main suggestion is to try a newer kernel, granted with the configuration of md, lvm, and btrfs you have three layers that will likely have kernel changes. I'd make sure you have backups. While this layout is valid and should work, it's also probably less common and therefore less tested.
Usually in case of blocking devs want to see sysrq+w issued. The setup is dmesg -n7, and enable sysrq functions. Then reproduce the block, and during the block issue w to the sysrq trigger, then capture dmesg contents and post the block and any other nearby btrfs messages.
https://www.kernel.org/doc/Documentation/sysrq.txt
Chris Murphy
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-05 23:48 ` Chris Murphy
@ 2014-01-05 23:57 ` Chris Murphy
2014-01-06 0:25 ` Sulla
0 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2014-01-05 23:57 UTC (permalink / raw)
To: Sulla; +Cc: linux-btrfs
On Jan 5, 2014, at 4:48 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Jan 5, 2014, at 2:17 PM, Sulla <Sulla@gmx.at> wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Dear Chris!
>>
>> Certainly: I have 3 HDDs, all of which WD20EARS.
>
> These drives don't have a configurable SCT ERC, so you need to modify the SCSI block layer timeout:
>
> echo 120 >/sys/block/sdX/device/timeout
>
> You also need to schedule regular scrubs at the md level as well.
>
> echo check > /sys/block/mdX/md/sync_action
> cat /sys/block/mdX/mismatch_cnt
>
> More info about this is in man 4 md, and on the linux-raid list.
>
>>
>> 3904907520 blocks super 1.2 level 5, 8k chunk, algorithm 2 [3/3] [UUU]
>
> OK so 8KB chunk, 16KB full stripe, so that doesn't apply to what I was thinking might be the case. The workload is presumably small file sizes, like a mail server?
>
>
>> any other information I can supply?
>
> I'm not a developer, I don't know if this problem is known or maybe fixed in a newer kernel than 3.11.0 - which has been around for 5-6 months. I think the main suggestion is to try a newer kernel, granted with the configuration of md, lvm, and btrfs you have three layers that will likely have kernel changes. I'd make sure you have backups. While this layout is valid and should work, it's also probably less common and therefore less tested.
>
> Usually in case of blocking devs want to see sysrq+w issued. The setup is dmesg -n7, and enable sysrq functions. Then reproduce the block, and during the block issue w to the sysrq trigger, then capture dmesg contents and post the block and any other nearby btrfs messages.
>
> https://www.kernel.org/doc/Documentation/sysrq.txt
Also, this thread is pretty cluttered with other conversations by now so I think you're best off starting a new thread with this information, maybe a title of "PROBLEM: btrfs on LVM on md raid, blocking > 120 seconds"
Since it's almost inevitable you'd be asked to test with a newer kernel anyway, you might as well go to 3.13rc7 and see if you can reproduce, if reproducible, be specific with the problem report by following this template:
https://www.kernel.org/pub/linux/docs/lkml/reporting-bugs.html
Chris Murphy
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-05 22:36 ` Brendan Hide
2014-01-05 22:57 ` Roman Mamedov
@ 2014-01-06 0:15 ` Chris Murphy
2014-01-06 0:19 ` Chris Murphy
1 sibling, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2014-01-06 0:15 UTC (permalink / raw)
To: Btrfs BTRFS; +Cc: Sulla, Brendan Hide
On Jan 5, 2014, at 3:36 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote:
> WD Greens (Reds too, for that matter) have poor non-sequential performance. An educated guess I'd say there's a 15% chance this is a major factor to the problem and, perhaps, a 60% chance it is merely a "small contributor" to the problem. Greens are aimed at consumers wanting high capacity and a low pricepoint. The result is poor performance. See footnote * re my experience.
>
> My general recommendation (use cases vary of course) is to install a tiny SSD (60GB, for example) just for the OS. It is typically cheaper than the larger drives and will be *much* faster. WD Greens and Reds have good *sequential* throughput but comparatively abysmal random throughput even in comparison to regular non-SSD consumer drives.
Another thing with md raid and parallel flie systems that's been an issue is cqf. On the XFS list cqf is approximately in the realm of persona non grata. It might be worth Sulla also setting elevator=deadline and see if simply different scheduling is a work around, not that it's OK to get blocks with cqf. But it might be worth a shot as a more conservative approach to upgrading the kernel from 3.11.0.
> I had 8x 1.5TB WD1500EARS drives in an mdRAID5 array. With it I had a single 250GB IDE disk for the OS. When the very old IDE disk inevitably died, I decided to use a spare 1.5TB drive for the OS. Performance was bad enough that I simply bought my first SSD the same week.
Yeah for what it's worth, the current WD Green PDF says these drives are not to be used in RAID at all. Not 0, 1, 5 or 6. Even Caviar Black is proscribed from use in RAID environments using multibay chassis, as in, no warranty. It's desktop raid0 and raid1 only, and arguably the lack of configurable SCT ERC makes it not ideal even for raid1.
Anyway, Sulla, how about putting up a smartctl -x for each drive? Curious if there are any bad sectors that have developed, and may be worth filtering all /var/log/messages for the word "reset" and see if you find any of these drives ever being reset by the kernel and if so, post the full output of that.
Chris Murphy
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-06 0:15 ` Chris Murphy
@ 2014-01-06 0:19 ` Chris Murphy
0 siblings, 0 replies; 31+ messages in thread
From: Chris Murphy @ 2014-01-06 0:19 UTC (permalink / raw)
To: Btrfs BTRFS
On Jan 5, 2014, at 5:15 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Jan 5, 2014, at 3:36 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote:
>
>> WD Greens (Reds too, for that matter) have poor non-sequential performance. An educated guess I'd say there's a 15% chance this is a major factor to the problem and, perhaps, a 60% chance it is merely a "small contributor" to the problem. Greens are aimed at consumers wanting high capacity and a low pricepoint. The result is poor performance. See footnote * re my experience.
>>
>> My general recommendation (use cases vary of course) is to install a tiny SSD (60GB, for example) just for the OS. It is typically cheaper than the larger drives and will be *much* faster. WD Greens and Reds have good *sequential* throughput but comparatively abysmal random throughput even in comparison to regular non-SSD consumer drives.
>
>
> Another thing with md raid and parallel flie systems that's been an issue is cqf.
Oops, CFQ!
Chris Murphy
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-05 23:57 ` Chris Murphy
@ 2014-01-06 0:25 ` Sulla
2014-01-06 0:49 ` Chris Murphy
0 siblings, 1 reply; 31+ messages in thread
From: Sulla @ 2014-01-06 0:25 UTC (permalink / raw)
To: Chris Murphy; +Cc: linux-btrfs
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Thanks Chris!
Thanks for your support.
>> echo 120 >/sys/block/sdX/device/timeout
timeout is 30 for my HDDs. I'm well aware that the WD green HDDs are not
the perfect ones for servers, but they were cheaper - and quieter - than
the black ones for servers. I'll get the red ones next, though. ;-)
>> You also need to schedule regular scrubs at the md level as well.
Ubuntu does that once a month.
>> cat /sys/block/mdX/mismatch_cnt
this resides in cat /sys/devices/virtual/block/md1/md/mismatch_cnt on my
machine.
the count is zero.
>> The workload is presumably small file sizes, like a mail server?
Yes. It serves as a mailserver (maildir-format), but also as a samba file
server with quite big files...
btrfs ran fine for more than a year, so I'm not sure how reproducible the
problem is...
I don't really wish to install or compile cumstom kernels, to be honest.
Not sure how problematic they might be during the next do-release-upgrade...
Sulla
- --
Russian Roulette is not the same without a gun
and baby when it's love, if it's not rough, it isn't fun, fun.
Lady GaGa, "Pokerface"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.21 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iEYEARECAAYFAlLJ+A8ACgkQR6b2EdogPFuFwwCffSjZpDJvIj70Ag+CPbClCVuc
viEAnjqnxcEdhKR2Gq84eGYEXfjfb23F
=pmTS
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-06 0:25 ` Sulla
@ 2014-01-06 0:49 ` Chris Murphy
[not found] ` <52CA06FE.2030802@gmx.at>
0 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2014-01-06 0:49 UTC (permalink / raw)
To: Sulla; +Cc: linux-btrfs
On Jan 5, 2014, at 5:25 PM, Sulla <Sulla@gmx.at> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Thanks Chris!
>
> Thanks for your support.
>
>>> echo 120 >/sys/block/sdX/device/timeout
> timeout is 30 for my HDDs.
I don't think those drives support a configurable time out; the Green hasn't support it in years. Where are you getting this information? What do you get for 'smartctl -l scterc /dev/sdX'?
> I don't really wish to install or compile cumstom kernels, to be honest.
If the problem is reproducible, then that's the fastest way to find out if it's been fixed or not. In this case 3.11 is EOL already, no more updates.
Chris Murphy
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
[not found] ` <52CA06FE.2030802@gmx.at>
@ 2014-01-06 1:55 ` Chris Murphy
0 siblings, 0 replies; 31+ messages in thread
From: Chris Murphy @ 2014-01-06 1:55 UTC (permalink / raw)
To: Sulla; +Cc: Btrfs BTRFS
On Jan 5, 2014, at 6:29 PM, Sulla <Sulla@gmx.at> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Chris!
>
> # sudo smartctl -l scterc /dev/sda
> tells me
> SCT Error Recovery Control command not supported
>
> you're right. the /sys/block/sdX/device/timeout file probably is useless then.
OK there's some confusion. /sys/block/sdX/device/timeout is the SCSI block layer timeout - linux itself has a timeout for each command issued to a block device, and will reset the link upon timeout being reached. So writing 120 to this will cause linux to wait for up to 120 seconds for the drive to respond. This is necessary because if there's a bad sector, the drive must report a read error in order for the md driver to reconstruct that data from parity. This is needed bothfor effective scrubs, and recovery on read error in normal operation. It is not a persistent setting so you'll want to create a start up script for it.
Chris Murphy
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-05 22:57 ` Roman Mamedov
@ 2014-01-07 10:22 ` Brendan Hide
0 siblings, 0 replies; 31+ messages in thread
From: Brendan Hide @ 2014-01-07 10:22 UTC (permalink / raw)
To: Roman Mamedov; +Cc: Sulla, Chris Murphy, linux-btrfs
On 2014/01/06 12:57 AM, Roman Mamedov wrote:
> Did you align your partitions to accommodate for the 4K sector of the EARS?
I had, yes. I had to do a lot of research to get the array working
"optimally". I didn't need to repartition the spare so this carried over
to its being used as an OS disk.
I actually lost the "Green" array twice - and learned some valuable lessons:
1. I had an 8-port SCSI card which was dropping the disks due to the
timeout issue mentioned by Chris. That caused the first array failure.
Technically all the data was on the disks - but temporarily
irrecoverable as disks were constantly being dropped. I made a mistake
during ddrescue which simultaneously destroyed two disks' data, meaning
that the recovery operation was finally for nought. The only consolation
was that I had very little data at the time and none of it was
irreplaceable.
2. After replacing the SCSI card with two 4-port SATA cards, a few
months later I still had a double-failure (the second failure being
during the RAID5 rebuild). This time it was only due to bad disks and a
lack of scrubbing/early warning - clearly my own fault.
Having learnt these lessons, I'm now a big fan of scrubbing and backups. ;)
I'm also pushing for RAID15 wherever data is mission-critical. I simply
don't "trust" the reliability of disks any more and I also better
understand how, by having more and/or larger disks in a RAID5/6 array,
the overall reliability of that array array plummets.
--
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-03 21:34 ` Duncan
2014-01-05 6:39 ` Marc MERLIN
@ 2014-01-08 3:22 ` Marc MERLIN
2014-01-08 9:45 ` Duncan
1 sibling, 1 reply; 31+ messages in thread
From: Marc MERLIN @ 2014-01-08 3:22 UTC (permalink / raw)
To: Duncan, Chris Murphy; +Cc: linux-btrfs, Jim Salter
On Fri, Jan 03, 2014 at 09:34:10PM +0000, Duncan wrote:
> IIRC someone also mentioned problems with autodefrag and an about 3/4 gig
> systemd journal. My gut feeling (IOW, *NOT* benchmarked!) is that double-
> digit MiB files should /normally/ be fine, but somewhere in the lower
> triple digits, write-magnification could well become an issue, depending
> of course on exactly how much active writing the app is doing into the
> file.
When I defrag'ed my 83GB vm file with 156222 extents, it was not in use
or being written to.
> As I said there's more work going into tuning autodefrag ATM, but as it
> is, I couldn't really recommend making it a global default... tho maybe a
> distro could enable it by default on a no-VM desktop system (as opposed
> to a server). Certainly I'd recommend most desktop types enable it.
I use VMs on my desktop :) but point taken.
On Sun, Jan 05, 2014 at 10:09:38AM -0700, Chris Murphy wrote:
> > gandalfthegreat:/var/local/nobck/VirtualBox VMs/Win7# filefrag Win7.vdi
> > Win7.vdi: 156222 extents found
> >
> > Considering how virtualbox works, that's hardly surprising.
>
> I haven't read anything so far indicating defrag applies to the VM container use case, rather nodatacow via xattr +C is the way to go. At least for now.
Yep, I'll convert the file, but since I found a pretty severe
performance problem, does anyone care to get details off my system
before I make the problem go away for me?
> It's better than a panic or corrupt data. So far the best combination
To be honest, I'd have taken a panic, it would have saved me 2H of
waiting for a laptop to recover when it was never going to recover :(
Data corruption, sure, obviously :)
> I've found, open to other suggestions though, is +C xattr on
So you're saying that defragmentation has known performance problems
that can't get fixed for now, and that the solution is not to get
fragmented or recreate the relevant files.
If so, I'll go ahead, I just wanted to make sure I didn't have useful
debug state before clearing my problem.
> This may already be a known problem but it's worth sysrq+w, and then dmesg and posting those results if you haven't already.
No, I had not yet, but I'll do this.
On Sun, Jan 05, 2014 at 01:44:25PM -0700, Duncan wrote:
> [I normally try to reply directly to list but don't believe I've seen
> this there yet, but got it direct-mailed so will reply-all in response.]
I like direct Cc on replies, makes my filter and mutt coloring happier
:)
Dupes with the same message-id are what procmail and others were written
for :)
> I now believe the lockup must be due to processing the hundreds of
> thousands of extents on all those snapshots, too, in addition to doing
That's a good call. I do have this:
gandalfthegreat:/mnt/btrfs_pool1# ls var
var/ var_hourly_20140105_16:00:01/
var_daily_20140102_00:01:01/ var_hourly_20140105_17:00:26/
var_daily_20140103_00:59:28/ var_weekly_20131208_00:02:02/
var_daily_20140104_00:01:01/ var_weekly_20131215_00:02:01/
var_daily_20140105_00:33:14/ var_weekly_20131229_00:02:02/
var_hourly_20140105_05:00:01/ var_weekly_20140105_00:33:14/
> it on the main volume. I don't actually make very extensive use of
> snapshots here anyway, so I didn't think about that aspect originally,
> but that's gotta be what's throwing the real spanner in the works,
> turning a possibly long but workable normal defrag (O(1)) into a lockup
> scenario (O(n)) where virtually no progress is made as currently
> coded.
That is indeed what I'm seeing, so it's very possible you're right.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: btrfs-transaction blocked for more than 120 seconds
2014-01-08 3:22 ` Marc MERLIN
@ 2014-01-08 9:45 ` Duncan
0 siblings, 0 replies; 31+ messages in thread
From: Duncan @ 2014-01-08 9:45 UTC (permalink / raw)
To: linux-btrfs
Marc MERLIN posted on Tue, 07 Jan 2014 19:22:58 -0800 as excerpted:
> On Fri, Jan 03, 2014 at 09:34:10PM +0000, Duncan wrote:
>> IIRC someone also mentioned problems with autodefrag and an about 3/4
>> gig systemd journal. My gut feeling (IOW, *NOT* benchmarked!) is that
>> double-digit MiB files should /normally/ be fine, but somewhere in the
>> lower triple digits, write-magnification could well become an issue,
>> depending of course on exactly how much active writing the app is doing
>> into the file.
>
> When I defrag'ed my 83GB vm file with 156222 extents, it was not in use
> or being written to.
Note the scale... I said double-digit _MiB_ should be fine, but somewhere
in the triple-digits write magnification likely becomes a problem (this
based on my memory of someone mentioning an issue with a 3/4 gig systemd
journal file).
You then say 83 _GB_, which may or may not be GiB, but either way, it's
three orders of magnitude above the scale I said should be fine, and two
orders of magnitude above the scale at which I said problems likely start
appearing.
So problems at that size are a given.
> On Sun, Jan 05, 2014 at 10:09:38AM -0700, Chris Murphy wrote:
>> I've found, open to other suggestions though, is +C xattr on
>
> So you're saying that defragmentation has known performance problems
> that can't get fixed for now, and that the solution is not to get
> fragmented or recreate the relevant files.
> If so, I'll go ahead, I just wanted to make sure I didn't have useful
> debug state before clearing my problem.
Basically, yes. One of the devs said he's just starting to focus on it
again now. So it's a known issue that'll take some work to make better.
However, since he's focusing on it again now, now's the time to report
stuff like the sysrq+w trace mentioned.
> On Sun, Jan 05, 2014 at 01:44:25PM -0700, Duncan wrote:
>> [I normally try to reply directly to list but don't believe I've seen
>> this there yet, but got it direct-mailed so will reply-all in
>> response.]
>
> I like direct Cc on replies, makes my filter and mutt coloring happier
> :)
> Dupes with the same message-id are what procmail and others were written
> for :)
Some of us think this sort of list works best as a public newsgroup...
such distributed discussion is what they were designed for, after all...
and that keeps it separate from actual email. That's where gmane.org
comes in with its list2news (as well as list2web) archiving service. We
subscribe to our lists as newsgroups there, use a news/nntp client for
it, and save our email client for actually handling (more private) email.
If you watch, you'll see links to particular messages on the gmane web
interface posted from time to time. For those using gmane's list2news
service (and obviously for those using its web interface as well) that's
real easy, since gmane adds a header with the web link to messages it
serves on the news interface as well. I've been using gmane for perhaps
a decade now, but apparently it's more popular for people on this list
than I might have expected from other lists, since I see more of those
gmane web links posted.
But I've also noticed that a lot more people on this list want CCed/
direct-mailed too, not just to read it on the list. I generally do that
when I see the explicit request, but /only/ when I see the explicit
request.
>> I now believe the lockup must be due to processing the hundreds of
>> thousands of extents on all those snapshots, too
>
> That's a good call. I do have this:
> gandalfthegreat:/mnt/btrfs_pool1# ls var var/
> var_hourly_20140105_16:00:01/ var_daily_20140102_00:01:01/
> var_hourly_20140105_17:00:26/ var_daily_20140103_00:59:28/
> var_weekly_20131208_00:02:02/ var_daily_20140104_00:01:01/
> var_weekly_20131215_00:02:01/ var_daily_20140105_00:33:14/
> var_weekly_20131229_00:02:02/ var_hourly_20140105_05:00:01/
> var_weekly_20140105_00:33:14/
>
>> I don't actually make very extensive use of
>> snapshots here anyway, so I didn't think about that aspect originally,
>> but that's gotta be what's throwing the real spanner in the works,
>> turning a possibly long but workable normal defrag (O(1)) into a lockup
>> scenario (O(n)) where virtually no progress is made as currently coded.
>
> That is indeed what I'm seeing, so it's very possible you're right.
That's where the evidence is pointing, ATM. Hopefully the defrag work
they're doing now will turn snapshotted defrag back into O(1), too.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2014-01-08 9:45 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla
2014-01-01 12:37 ` Duncan
2014-01-01 20:08 ` Sulla
2014-01-02 8:38 ` Duncan
2014-01-03 1:24 ` Kai Krakow
2014-01-03 9:18 ` Duncan
2014-01-05 0:12 ` Sulla
2014-01-03 17:25 ` Marc MERLIN
2014-01-03 21:34 ` Duncan
2014-01-05 6:39 ` Marc MERLIN
2014-01-05 17:09 ` Chris Murphy
2014-01-05 17:54 ` Jim Salter
2014-01-05 19:57 ` Duncan
2014-01-05 20:44 ` Chris Murphy
2014-01-08 3:22 ` Marc MERLIN
2014-01-08 9:45 ` Duncan
2014-01-04 20:48 ` Roger Binns
2014-01-02 8:49 ` Jojo
2014-01-05 20:32 ` Chris Murphy
2014-01-05 21:17 ` Sulla
2014-01-05 22:36 ` Brendan Hide
2014-01-05 22:57 ` Roman Mamedov
2014-01-07 10:22 ` Brendan Hide
2014-01-06 0:15 ` Chris Murphy
2014-01-06 0:19 ` Chris Murphy
2014-01-05 23:48 ` Chris Murphy
2014-01-05 23:57 ` Chris Murphy
2014-01-06 0:25 ` Sulla
2014-01-06 0:49 ` Chris Murphy
[not found] ` <52CA06FE.2030802@gmx.at>
2014-01-06 1:55 ` Chris Murphy
[not found] <ADin1n00P0VAdqd01DioM9>
2014-01-05 20:44 ` Duncan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.