* BTRFS free space handling still needs more work: Hangs again
@ 2014-12-26 13:37 Martin Steigerwald
2014-12-26 14:20 ` Martin Steigerwald
` (2 more replies)
0 siblings, 3 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-26 13:37 UTC (permalink / raw)
To: linux-btrfs
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 16557 bytes --]
Hello!
First: Have a merry christmas and enjoy a quiet time in these days.
Second: At a time you feel like it, here is a little rant, but also a bug
report:
I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents â are these a problem? â and
compress=lzo:
merkaba:~> btrfs fi sh /home
Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
Total devices 2 FS bytes used 144.41GiB
devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
And I had hangs with BTRFS again. This time as I wanted to install tax
return software in Virtualbox´d Windows XP VM (which I use once a year
cause I know no tax return software for Linux which would be suitable for
Germany and I frankly don´t care about the end of security cause all
surfing and other network access I will do from the Linux box and I only
run the VM behind a firewall).
And thus I try the balance dance again:
merkaba:~> btrfs balance start -dusage=5 -musage=5 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -dusage=5 -musage=5 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -dusage=5 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=10 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=20 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=30 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=40 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=50 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=60 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=70 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -dusage=70 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -dusage=70 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -dusage=65 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=67 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -musage=10 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -musage=05 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
Okay, not really, ey?
But
merkaba:~> btrfs balance start /home
works.
So I am rebalancing everything basically, without need I bet, so causing
more churn to SSDs than is needed.
Otherwise alternative would be to make BTRFS larger I bet.
Well this is still not what I would consider stable. So I will still
recommend: If you want to use BTRFS on a server and estimate 25 GiB of
usage, make drive at least 50GiB big or even 100GiB to be on the safe
side. Like I recommended for SLES 11 SP 2/3 BTRFS deployments â but
hey, there say meanwhile "don´t" as in "just don´t use it at all and use SLES
12 instead, cause BTRFS with 3.0 kernel with a ton of snapper snapshots
is really not asking for anything even near to production or enterprise
reliability" (if you need proof, I think I still have a snapshot of a SLES
11 SP3 VM that broke over night due to me having installed an LDAP server
for preparing some training slides). Even 3.12 kernel seems daring regarding
BTRFS, unless SUSE actively backports fixes.
In kernel log the failed attempts look like this:
[ 209.783437] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 210.116416] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 210.455479] BTRFS info (device dm-3): 1 enospc errors during balance
[ 212.915690] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 213.291634] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 213.654145] BTRFS info (device dm-3): 1 enospc errors during balance
[ 219.219584] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 219.531864] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 222.721234] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 223.084007] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 226.418100] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 226.730118] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 230.218590] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 230.559232] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 233.979952] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 234.320569] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 237.672101] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 237.961171] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 241.262757] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 241.594655] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 244.783861] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 245.095942] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[ 245.418042] BTRFS info (device dm-3): relocating block group 500198014976 flags 17
[ 245.544153] BTRFS info (device dm-3): relocating block group 496997761024 flags 17
[ 245.644254] BTRFS info (device dm-3): relocating block group 495924019200 flags 17
[ 246.281001] BTRFS info (device dm-3): relocating block group 488407826432 flags 17
[ 246.449939] BTRFS info (device dm-3): relocating block group 431499509760 flags 17
[ 246.561724] BTRFS info (device dm-3): relocating block group 411804106752 flags 17
[ 246.723997] BTRFS info (device dm-3): relocating block group 409656623104 flags 17
[ 251.770469] BTRFS info (device dm-3): 7 enospc errors during balance
My expection for a *stable* and *production quality* filesystem would be:
I never ever get hangs with one kworker running on 100% of one Sandybridge
core *for minutes* in a production filesystem and thats about it.
Especially for a filesystem that claims to still have a good amount of free
space:
merkaba:~> LANG=C df -hT /home
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/msata-home btrfs 160G 146G 25G 86% /home
(yeah, these don´t add up, I account this to compression, but hey, who knows)
In kernel log I have things like this, but some earlier time and these I have
not yet perceived as hangs:
Dec 23 23:33:26 merkaba kernel: [23040.621678] ------------[ cut here ]------------
Dec 23 23:33:26 merkaba kernel: [23040.621792] WARNING: CPU: 3 PID: 308 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empt
y+0x2d/0x2f [btrfs]()
Dec 23 23:33:26 merkaba kernel: [23040.621796] Modules linked in: mmc_block ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c snd
_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device hid_generic hid_pl ff_memless usbhid hid nls_utf8 nls_cp437 vfat fat uas usb_stor
age bnep bluetooth binfmt_misc cpufreq_userspace cpufreq_stats pci_stub cpufreq_powersave vboxpci(O) cpufreq_conservative vboxnetadp(O)
vboxnetflt(O) vboxdrv(O) ext4 crc16 mbcache jbd2 intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32
_pclmul ghash_clmulni_intel snd_hda_codec_hdmi iwldvm aesni_intel snd_hda_codec_conexant mac80211 aes_x86_64 snd_hda_codec_generic lrw
gf128mul glue_helper ablk_helper cryptd psmouse snd_hda_intel serio_raw iwlwifi pcspkr lpc_ich snd_hda_controller i2c_i801 mfd_core snd
_hda_codec snd_hwdep cfg80211 snd_pcm snd_timer shpchp thinkpad_acpi nvram snd soundcore rfkill battery ac tpm_tis tpm processor evdev
joydev sbs sbshc coretemp hdaps(O) tp_smapi(O) thinkpad_ec(O) loop firewire_sbp2 fuse ecryptfs autofs4 md_mod btrfs xor raid6_pq microc
ode dm_mirror dm_region_hash dm_log dm_mod sg sr_mod sd_mod cdrom crc32c_intel ahci firewire_ohci libahci sata_sil24 e1000e libata ptp
sdhci_pci ehci_pci sdhci firewire_core ehci_hcd crc_itu_t pps_core mmc_core scsi_mod usbcore usb_common thermal
Dec 23 23:33:26 merkaba kernel: [23040.621978] CPU: 3 PID: 308 Comm: btrfs-transacti Tainted: G W O 3.18.0-tp520 #14
Dec 23 23:33:26 merkaba kernel: [23040.621982] Hardware name: LENOVO 42433WG/42433WG, BIOS 8AET63WW (1.43 ) 05/08/2013
Dec 23 23:33:26 merkaba kernel: [23040.621985] 0000000000000009 ffff8804044c7d88 ffffffff814a516e 0000000080000000
Dec 23 23:33:26 merkaba kernel: [23040.621992] 0000000000000000 ffff8804044c7dc8 ffffffff8103f83e ffff8804044c7db8
Dec 23 23:33:26 merkaba kernel: [23040.621999] ffffffffc04bd5a1 ffff880037590800 ffff8800a599c320 0000000000000000
Dec 23 23:33:26 merkaba kernel: [23040.622006] Call Trace:
Dec 23 23:33:26 merkaba kernel: [23040.622026] [<ffffffff814a516e>] dump_stack+0x4f/0x7c
Dec 23 23:33:26 merkaba kernel: [23040.622034] [<ffffffff8103f83e>] warn_slowpath_common+0x7c/0x96
Dec 23 23:33:26 merkaba kernel: [23040.622104] [<ffffffffc04bd5a1>] ? btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622111] [<ffffffff8103f8ec>] warn_slowpath_null+0x15/0x17
Dec 23 23:33:26 merkaba kernel: [23040.622164] [<ffffffffc04bd5a1>] btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622211] [<ffffffffc047a830>] btrfs_commit_transaction+0x394/0x8bc [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622254] [<ffffffffc0476dd5>] transaction_kthread+0xf9/0x1af [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622295] [<ffffffffc0476cdc>] ? btrfs_cleanup_transaction+0x43a/0x43a [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622305] [<ffffffff8105697c>] kthread+0xb2/0xba
Dec 23 23:33:26 merkaba kernel: [23040.622312] [<ffffffff814a0000>] ? dcbnl_newmsg+0x14/0xa8
Dec 23 23:33:26 merkaba kernel: [23040.622317] [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
Dec 23 23:33:26 merkaba kernel: [23040.622324] [<ffffffff814a9f6c>] ret_from_fork+0x7c/0xb0
Dec 23 23:33:26 merkaba kernel: [23040.622329] [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
Dec 23 23:33:26 merkaba kernel: [23040.622334] ---[ end trace 90db5b1c7067cf1d ]---
Dec 23 23:33:56 merkaba kernel: [23070.671999] ------------[ cut here ]------------
Dec 23 23:33:56 merkaba kernel: [23070.671999] ------------[ cut here ]------------
Dec 23 23:33:56 merkaba kernel: [23070.672064] WARNING: CPU: 3 PID: 308 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empt
y+0x2d/0x2f [btrfs]()
Dec 23 23:33:56 merkaba kernel: [23070.672067] Modules linked in: mmc_block ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device hid_generic hid_pl ff_memless usbhid hid nls_utf8 nls_cp437 vfat fat uas usb_storage bnep bluetooth binfmt_misc cpufreq_userspace cpufreq_stats pci_stub cpufreq_powersave vboxpci(O) cpufreq_conservative vboxnetadp(O) vboxnetflt(O) vboxdrv(O) ext4 crc16 mbcache jbd2 intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi iwldvm aesni_intel snd_hda_codec_conexant mac80211 aes_x86_64 snd_hda_codec_generic lrw gf128mul glue_helper ablk_helper cryptd psmouse snd_hda_intel serio_raw iwlwifi pcspkr lpc_ich snd_hda_controller i2c_i801 mfd_core snd_hda_codec snd_hwdep cfg80211 snd_pcm snd_timer shpchp thinkpad_acpi nvram snd soundcore rfkill battery ac tpm_tis tpm processor evdev joydev sbs sbshc coretemp hdaps(O) tp_smapi(O) thinkpad_ec(O) loop firewire_sbp2 fuse ecryptfs autofs4 md_mod btrfs xor raid6_pq microcode dm_mirror dm_region_hash dm_log dm_mod sg sr_mod sd_mod cdrom crc32c_intel ahci firewire_ohci libahci sata_sil24 e1000e libata ptp sdhci_pci ehci_pci sdhci firewire_core ehci_hcd crc_itu_t pps_core mmc_core scsi_mod usbcore usb_common thermal
Dec 23 23:33:56 merkaba kernel: [23070.672193] CPU: 3 PID: 308 Comm: btrfs-transacti Tainted: G W O 3.18.0-tp520 #14
Dec 23 23:33:56 merkaba kernel: [23070.672196] Hardware name: LENOVO 42433WG/42433WG, BIOS 8AET63WW (1.43 ) 05/08/2013
Dec 23 23:33:56 merkaba kernel: [23070.672200] 0000000000000009 ffff8804044c7d88 ffffffff814a516e 0000000080000000
Dec 23 23:33:56 merkaba kernel: [23070.672205] 0000000000000000 ffff8804044c7dc8 ffffffff8103f83e ffff8804044c7db8
Dec 23 23:33:56 merkaba kernel: [23070.672209] ffffffffc04bd5a1 ffff880037590800 ffff8802cd6e50a0 0000000000000000
Dec 23 23:33:56 merkaba kernel: [23070.672214] Call Trace:
Dec 23 23:33:56 merkaba kernel: [23070.672222] [<ffffffff814a516e>] dump_stack+0x4f/0x7c
Dec 23 23:33:56 merkaba kernel: [23070.672229] [<ffffffff8103f83e>] warn_slowpath_common+0x7c/0x96
Dec 23 23:33:56 merkaba kernel: [23070.672264] [<ffffffffc04bd5a1>] ? btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672270] [<ffffffff8103f8ec>] warn_slowpath_null+0x15/0x17
Dec 23 23:33:56 merkaba kernel: [23070.672301] [<ffffffffc04bd5a1>] btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672330] [<ffffffffc047a830>] btrfs_commit_transaction+0x394/0x8bc [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672357] [<ffffffffc0476dd5>] transaction_kthread+0xf9/0x1af [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672383] [<ffffffffc0476cdc>] ? btrfs_cleanup_transaction+0x43a/0x43a [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672389] [<ffffffff8105697c>] kthread+0xb2/0xba
Dec 23 23:33:56 merkaba kernel: [23070.672395] [<ffffffff814a0000>] ? dcbnl_newmsg+0x14/0xa8
Dec 23 23:33:56 merkaba kernel: [23070.672399] [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
Dec 23 23:33:56 merkaba kernel: [23070.672405] [<ffffffff814a9f6c>] ret_from_fork+0x7c/0xb0
Dec 23 23:33:56 merkaba kernel: [23070.672409] [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
Dec 23 23:33:56 merkaba kernel: [23070.672412] ---[ end trace 90db5b1c7067cf1e ]---
Dec 23 23:34:26 merkaba kernel: [23100.709530] ------------[ cut here ]------------
The recent hangings today are not in the log, I was upset enough to
forcefully switch of the machine. Tax returns are not my all time favorite,
but tax returns with hanging filesystems is no fun at all.
I will upgrade to 3.19 with 3.19-rc2.
Lets see what this balance will do.
It currently is here:
merkaba:~> btrfs balance status /home
Balance on '/home' is running
32 out of about 164 chunks balanced (53 considered), 80% left
merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=142.10GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.33GiB
GlobalReserve, single: total=512.00MiB, used=254.31MiB
So for once, we are told not to balance needlessly, but then in order for
stable operation I need to balance nonetheless?
Well lets see how it will improve things. Last time it did. Considerably.
BTRFS only had these hang problems with 3.15 and 3.16 if trees allocated
all remaining space. So I expect it to downsize these trees are to there is
some device space being freed to allocatable again.
Next I will also defrag the Windows VM image just as an additional safety
net.
Okay, doing something else now as the BTRFS will sort things out hopefully.
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±ý»k~ÏâØ^nr¡ö¦zË\x1aëh¨èÚ&£ûàz¿äz¹Þú+Ê+zf£¢·h§~Ûiÿÿïêÿêçz_è®\x0fæj:+v¨þ)ߣøm
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
@ 2014-12-26 14:20 ` Martin Steigerwald
2014-12-26 14:41 ` Martin Steigerwald
2014-12-26 15:59 ` Martin Steigerwald
2014-12-26 22:48 ` Robert White
2 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-26 14:20 UTC (permalink / raw)
To: linux-btrfs
Am Freitag, 26. Dezember 2014, 14:37:36 schrieben Sie:
> It currently is here:
>
> merkaba:~> btrfs balance status /home
> Balance on '/home' is running
> 32 out of about 164 chunks balanced (53 considered), 80% left
>
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=154.97GiB, used=142.10GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.33GiB
> GlobalReserve, single: total=512.00MiB, used=254.31MiB
Now I got this:
merkaba:~> btrfs balance start /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> dmesg | tail
[ 4260.276416] BTRFS info (device dm-3): relocating block group 151418568704
flags 17
[ 4274.683349] BTRFS info (device dm-3): found 25089 extents
[ 4295.836590] BTRFS info (device dm-3): found 25089 extents
[ 4296.026778] BTRFS info (device dm-3): relocating block group 150344826880
flags 17
[ 4312.732021] BTRFS info (device dm-3): found 59388 extents
[ 4326.398261] BTRFS info (device dm-3): found 59388 extents
[ 4326.813205] BTRFS info (device dm-3): relocating block group 149271085056
flags 17
[ 4347.346540] BTRFS info (device dm-3): found 104739 extents
[ 4357.160098] BTRFS info (device dm-3): found 104739 extents
[ 4359.304646] BTRFS info (device dm-3): 20 enospc errors during balance
And I wonder about:
> Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
> GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599
>
84C7N�����r��y����b�X��ǧv�^�){.n�+����{�n�߲)����w*\x1fjg���\x1e�����ݢj/���z�ޖ��2
> �ޙ����&�)ߡ�a��\x7f��\x1e�G���h�\x0f�j:+v���w��٥
These random chars are not supposed to be there: I better run scrub straight
after this balance.
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-26 14:20 ` Martin Steigerwald
@ 2014-12-26 14:41 ` Martin Steigerwald
2014-12-27 3:33 ` Duncan
0 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-26 14:41 UTC (permalink / raw)
To: linux-btrfs
Am Freitag, 26. Dezember 2014, 15:20:42 schrieben Sie:
> And I wonder about:
> > Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
> > GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599
> >
> >
>
>
84C7N�����r��y����b�X��ǧv�^�){.n�+����{�n�߲)����w*\x1fjg���\x1e�����ݢj/���z�ޖ��2
>
> > �ޙ����&�)ߡ�a��\x7f��\x1e�G���h�\x0f�j:+v���w��٥
>
> These random chars are not supposed to be there: I better run scrub
> straight after this balance.
Okay, thats not me I think. scrub didn´t report any errors and when I look
in kmail send folder I don´t see these random chars as well, so it seems
some server on the wire added the garbage.
Lets defragment the file:
merkaba:/home/martin/.VirtualBox/HardDisks> filefrag Winlala.vdi
Winlala.vdi: 41462 extents found
merkaba:/home/martin/.VirtualBox/HardDisks> btrfs filesystem defragment
Winlala.vdi
merkaba:/home/martin/.VirtualBox/HardDisks> filefrag Winlala.vdi
Winlala.vdi: 11735 extents found
merkaba:/home/martin/.VirtualBox/HardDisks> sync
merkaba:/home/martin/.VirtualBox/HardDisks> filefrag Winlala.vdi
Winlala.vdi: 11735 extents found
Okay, that together with:
merkaba:~> btrfs fi df /home
Data, RAID1: total=151.95GiB, used=144.68GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.25GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
merkaba:~> btrfs fi sh /home
Label: 'home' uuid: […]
Total devices 2 FS bytes used 147.94GiB
devid 1 size 160.00GiB used 156.98GiB path /dev/mapper/msata-home
devid 2 size 160.00GiB used 156.98GiB path /dev/mapper/sata-home
Btrfs v3.17
May do for a while.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
2014-12-26 14:20 ` Martin Steigerwald
@ 2014-12-26 15:59 ` Martin Steigerwald
2014-12-27 4:26 ` Duncan
2014-12-26 22:48 ` Robert White
2 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-26 15:59 UTC (permalink / raw)
To: linux-btrfs
Am Freitag, 26. Dezember 2014, 14:37:36 schrieben Sie:
> I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> space_cache, skinny meta data extents – are these a problem? – and
> compress=lzo:
>
> merkaba:~> btrfs fi sh /home
> Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> Total devices 2 FS bytes used 144.41GiB
> devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
>
> Btrfs v3.17
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=154.97GiB, used=141.12GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.29GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
> And I had hangs with BTRFS again. This time as I wanted to install tax
> return software in Virtualbox´d Windows XP VM (which I use once a year
> cause I know no tax return software for Linux which would be suitable for
> Germany and I frankly don´t care about the end of security cause all
> surfing and other network access I will do from the Linux box and I only
> run the VM behind a firewall).
These are 100% reproducable for me:
1) Have the compress=lzo, space_cache BTRFS RAID Dual SSD RAID 1 fill both
with trees.
2) Have a Windows XP VM in Virtualbox on that BTRFS RAID 1
3) Press "Defragment" (in the hope to be able to use sdelete -c and then
VBoxManage modifyhd Winlala.vdi --compact to reduce image size)
Gives:
One kworker thread using up 100% of a core for minutes with bursts of
btrfs-transaction-process in between and:
Dec 26 16:17:57 merkaba kernel: [ 8102.029438] mce: [Hardware Error]: Machine check events logged
Dec 26 16:18:15 merkaba kernel: [ 8119.879230] CPU2: Core temperature above threshold, cpu clock throttled (total events = 54053)
Dec 26 16:18:15 merkaba kernel: [ 8119.879232] CPU0: Package temperature above threshold, cpu clock throttled (total events = 89435)
Dec 26 16:18:15 merkaba kernel: [ 8119.879234] CPU3: Core temperature above threshold, cpu clock throttled (total events = 54053)
Dec 26 16:18:15 merkaba kernel: [ 8119.879235] CPU1: Package temperature above threshold, cpu clock throttled (total events = 89435)
Dec 26 16:18:15 merkaba kernel: [ 8119.879237] CPU3: Package temperature above threshold, cpu clock throttled (total events = 89435)
Dec 26 16:18:15 merkaba kernel: [ 8119.879245] CPU2: Package temperature above threshold, cpu clock throttled (total events = 89435)
Dec 26 16:18:15 merkaba kernel: [ 8119.880218] CPU2: Core temperature/speed normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880219] CPU1: Package temperature/speed normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880220] CPU3: Core temperature/speed normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880221] CPU0: Package temperature/speed normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880223] CPU3: Package temperature/speed normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880228] CPU2: Package temperature/speed normal
Dec 26 16:20:27 merkaba kernel: [ 8252.054015] mce: [Hardware Error]: Machine check events logged
Dec 26 16:20:57 merkaba kernel: [ 8281.461874] INFO: task kded4:1959 blocked for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.464106] Tainted: G O 3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.466361] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.468760] kded4 D ffff88040764ce98 0 1959 1 0x00000000
Dec 26 16:20:57 merkaba kernel: [ 8281.471112] ffff8803efa57bb8 0000000000000002 ffff8803efa57c00 ffff880407f261c0
Dec 26 16:20:57 merkaba kernel: [ 8281.473462] ffff8803efa57fd8 ffff88040764c950 0000000000012300 ffff88040764c950
Dec 26 16:20:57 merkaba kernel: [ 8281.475780] ffff8803efa57ba8 ffff8803eea9a900 ffff8803eea9a904 ffff88040764c950
Dec 26 16:20:57 merkaba kernel: [ 8281.478142] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.480414] [<ffffffff814a6f9a>] schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.482694] [<ffffffff814a72d3>] schedule_preempt_disabled+0x13/0x1f
Dec 26 16:20:57 merkaba kernel: [ 8281.484979] [<ffffffff814a8440>] __mutex_lock_slowpath+0xab/0x126
Dec 26 16:20:57 merkaba kernel: [ 8281.487271] [<ffffffff81143735>] ? lookup_fast+0x173/0x238
Dec 26 16:20:57 merkaba kernel: [ 8281.489534] [<ffffffff814a84ce>] mutex_lock+0x13/0x24
Dec 26 16:20:57 merkaba kernel: [ 8281.491811] [<ffffffff81143c45>] walk_component+0x69/0x17e
Dec 26 16:20:57 merkaba kernel: [ 8281.494092] [<ffffffff81143d88>] lookup_last+0x2e/0x30
Dec 26 16:20:57 merkaba kernel: [ 8281.496416] [<ffffffff81145a32>] path_lookupat+0x83/0x2d9
Dec 26 16:20:57 merkaba kernel: [ 8281.498733] [<ffffffff8121f38c>] ? debug_smp_processor_id+0x17/0x19
Dec 26 16:20:57 merkaba kernel: [ 8281.501074] [<ffffffff8114683c>] ? getname_flags+0x31/0x134
Dec 26 16:20:57 merkaba kernel: [ 8281.503338] [<ffffffff81145cad>] filename_lookup+0x25/0x7a
Dec 26 16:20:57 merkaba kernel: [ 8281.505604] [<ffffffff8114767a>] user_path_at_empty+0x55/0x93
Dec 26 16:20:57 merkaba kernel: [ 8281.507941] [<ffffffff8105ec3e>] ? preempt_count_add+0x7c/0x90
Dec 26 16:20:57 merkaba kernel: [ 8281.510210] [<ffffffff81071751>] ? cpuacct_account_field+0x56/0x5f
Dec 26 16:20:57 merkaba kernel: [ 8281.512499] [<ffffffff81071751>] ? cpuacct_account_field+0x56/0x5f
Dec 26 16:20:57 merkaba kernel: [ 8281.514705] [<ffffffff811476c4>] user_path_at+0xc/0xe
Dec 26 16:20:57 merkaba kernel: [ 8281.517039] [<ffffffff8113ec3b>] vfs_fstatat+0x49/0x84
Dec 26 16:20:57 merkaba kernel: [ 8281.519397] [<ffffffff810be29a>] ? acct_account_cputime+0x17/0x19
Dec 26 16:20:57 merkaba kernel: [ 8281.521686] [<ffffffff8113ec8c>] vfs_stat+0x16/0x18
Dec 26 16:20:57 merkaba kernel: [ 8281.524064] [<ffffffff8113ecd1>] SYSC_newstat+0x15/0x2e
Dec 26 16:20:57 merkaba kernel: [ 8281.526367] [<ffffffff8100cf3f>] ? user_exit+0x13/0x15
Dec 26 16:20:57 merkaba kernel: [ 8281.528792] [<ffffffff8100e21d>] ? syscall_trace_enter_phase1+0x57/0x12a
Dec 26 16:20:57 merkaba kernel: [ 8281.531120] [<ffffffff8100e537>] ? syscall_trace_leave+0xcc/0x10a
Dec 26 16:20:57 merkaba kernel: [ 8281.533577] [<ffffffff814aa264>] ? int_check_syscall_exit_work+0x34/0x3d
Dec 26 16:20:57 merkaba kernel: [ 8281.535977] [<ffffffff8113edb9>] SyS_newstat+0x9/0xb
Dec 26 16:20:57 merkaba kernel: [ 8281.538416] [<ffffffff814aa012>] system_call_fastpath+0x12/0x17
Dec 26 16:20:57 merkaba kernel: [ 8281.540835] INFO: task kactivitymanage:1994 blocked for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.540838] Tainted: G O 3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.540838] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.540848] kactivitymanage D 0000000000000000 0 1994 1 0x00000000
Dec 26 16:20:57 merkaba kernel: [ 8281.540851] ffff8803e42efe58 0000000000000002 00000001ff68d6d6 ffff8800c285c950
Dec 26 16:20:57 merkaba kernel: [ 8281.540854] ffff8803e42effd8 ffff8803fda361c0 0000000000012300 ffff8803fda361c0
Dec 26 16:20:57 merkaba kernel: [ 8281.540857] 00000000000034e5 ffff8804059e0348 ffff8804059e034c ffff8803fda361c0
Dec 26 16:20:57 merkaba kernel: [ 8281.540858] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.540862] [<ffffffff814a6f9a>] schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.540865] [<ffffffff814a72d3>] schedule_preempt_disabled+0x13/0x1f
Dec 26 16:20:57 merkaba kernel: [ 8281.540867] [<ffffffff814a8440>] __mutex_lock_slowpath+0xab/0x126
Dec 26 16:20:57 merkaba kernel: [ 8281.540871] [<ffffffff81150f22>] ? __fget+0x67/0x72
Dec 26 16:20:57 merkaba kernel: [ 8281.540873] [<ffffffff814a84ce>] mutex_lock+0x13/0x24
Dec 26 16:20:57 merkaba kernel: [ 8281.540876] [<ffffffff811516e9>] __fdget_pos+0x36/0x3c
Dec 26 16:20:57 merkaba kernel: [ 8281.540878] [<ffffffff8113a393>] fdget_pos+0x9/0x15
Dec 26 16:20:57 merkaba kernel: [ 8281.540881] [<ffffffff8113b39d>] SyS_write+0x19/0x71
Dec 26 16:20:57 merkaba kernel: [ 8281.540884] [<ffffffff814aa264>] ? int_check_syscall_exit_work+0x34/0x3d
Dec 26 16:20:57 merkaba kernel: [ 8281.540886] [<ffffffff814aa012>] system_call_fastpath+0x12/0x17
Dec 26 16:20:57 merkaba kernel: [ 8281.540890] INFO: task plasma-desktop:2013 blocked for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.540891] Tainted: G O 3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.540892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.540895] plasma-desktop D ffff8803fda39db8 0 2013 1 0x00000000
Dec 26 16:20:57 merkaba kernel: [ 8281.540898] ffff8803d947fbb8 0000000000000002 ffff8803d947fc00 ffffffff81a16500
Dec 26 16:20:57 merkaba kernel: [ 8281.540900] ffff8803d947ffd8 ffff8803fda39870 0000000000012300 ffff8803fda39870
Dec 26 16:20:57 merkaba kernel: [ 8281.540902] ffff8803d947fba8 ffff8803eea9a900 ffff8803eea9a904 ffff8803fda39870
Dec 26 16:20:57 merkaba kernel: [ 8281.540903] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.540906] [<ffffffff814a6f9a>] schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.540908] [<ffffffff814a72d3>] schedule_preempt_disabled+0x13/0x1f
Dec 26 16:20:57 merkaba kernel: [ 8281.540910] [<ffffffff814a8440>] __mutex_lock_slowpath+0xab/0x126
Dec 26 16:20:57 merkaba kernel: [ 8281.540913] [<ffffffff81143735>] ? lookup_fast+0x173/0x238
Dec 26 16:20:57 merkaba kernel: [ 8281.540916] [<ffffffff814a84ce>] mutex_lock+0x13/0x24
Dec 26 16:20:57 merkaba kernel: [ 8281.540918] [<ffffffff81143c45>] walk_component+0x69/0x17e
Dec 26 16:20:57 merkaba kernel: [ 8281.540921] [<ffffffff813ca675>] ? __sock_recvmsg_nosec+0x29/0x2b
Dec 26 16:20:57 merkaba kernel: [ 8281.540924] [<ffffffff81143d88>] lookup_last+0x2e/0x30
Dec 26 16:20:57 merkaba kernel: [ 8281.540926] [<ffffffff81145a32>] path_lookupat+0x83/0x2d9
Dec 26 16:20:57 merkaba kernel: [ 8281.540929] [<ffffffff8121f38c>] ? debug_smp_processor_id+0x17/0x19
Dec 26 16:20:57 merkaba kernel: [ 8281.540932] [<ffffffff8114683c>] ? getname_flags+0x31/0x134
Dec 26 16:20:57 merkaba kernel: [ 8281.540934] [<ffffffff81145cad>] filename_lookup+0x25/0x7a
Dec 26 16:20:57 merkaba kernel: [ 8281.540937] [<ffffffff8114767a>] user_path_at_empty+0x55/0x93
Dec 26 16:20:57 merkaba kernel: [ 8281.540942] [<ffffffff8105ec3e>] ? preempt_count_add+0x7c/0x90
Dec 26 16:20:57 merkaba kernel: [ 8281.540947] [<ffffffff81071751>] ? cpuacct_account_field+0x56/0x5f
Dec 26 16:20:57 merkaba kernel: [ 8281.540949] [<ffffffff81071751>] ? cpuacct_account_field+0x56/0x5f
Dec 26 16:20:57 merkaba kernel: [ 8281.540952] [<ffffffff811476c4>] user_path_at+0xc/0xe
Dec 26 16:20:57 merkaba kernel: [ 8281.540956] [<ffffffff81160193>] user_statfs+0x2b/0x68
Dec 26 16:20:57 merkaba kernel: [ 8281.540960] [<ffffffff810be29a>] ? acct_account_cputime+0x17/0x19
Dec 26 16:20:57 merkaba kernel: [ 8281.540963] [<ffffffff811601eb>] SYSC_statfs+0x1b/0x3a
Dec 26 16:20:57 merkaba kernel: [ 8281.540965] [<ffffffff8100cf3f>] ? user_exit+0x13/0x15
Dec 26 16:20:57 merkaba kernel: [ 8281.540968] [<ffffffff8100e21d>] ? syscall_trace_enter_phase1+0x57/0x12a
Dec 26 16:20:57 merkaba kernel: [ 8281.540970] [<ffffffff8100e537>] ? syscall_trace_leave+0xcc/0x10a
Dec 26 16:20:57 merkaba kernel: [ 8281.540973] [<ffffffff814aa264>] ? int_check_syscall_exit_work+0x34/0x3d
Dec 26 16:20:57 merkaba kernel: [ 8281.540976] [<ffffffff81160328>] SyS_statfs+0x9/0xb
Dec 26 16:20:57 merkaba kernel: [ 8281.540978] [<ffffffff814aa012>] system_call_fastpath+0x12/0x17
Dec 26 16:20:57 merkaba kernel: [ 8281.540983] INFO: task krunner:2050 blocked for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.540984] Tainted: G O 3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.540985] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.540988] krunner D 0000000000000000 0 2050 1 0x00000000
Dec 26 16:20:57 merkaba kernel: [ 8281.540991] ffff8803cb68be58 0000000000000002 ffff8803cb68be28 ffff8803fda39870
Dec 26 16:20:57 merkaba kernel: [ 8281.540993] ffff8803cb68bfd8 ffff8800cecee1c0 0000000000012300 ffff8800cecee1c0
Dec 26 16:20:57 merkaba kernel: [ 8281.540995] 00000000000089ef ffff8804059e0348 ffff8804059e034c ffff8800cecee1c0
Dec 26 16:20:57 merkaba kernel: [ 8281.540996] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.540998] [<ffffffff814a6f9a>] schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.541001] [<ffffffff814a72d3>] schedule_preempt_disabled+0x13/0x1f
Dec 26 16:20:57 merkaba kernel: [ 8281.541003] [<ffffffff814a8440>] __mutex_lock_slowpath+0xab/0x126
Dec 26 16:20:57 merkaba kernel: [ 8281.541005] [<ffffffff81150f22>] ? __fget+0x67/0x72
Dec 26 16:20:57 merkaba kernel: [ 8281.541008] [<ffffffff814a84ce>] mutex_lock+0x13/0x24
Dec 26 16:20:57 merkaba kernel: [ 8281.541010] [<ffffffff811516e9>] __fdget_pos+0x36/0x3c
Dec 26 16:20:57 merkaba kernel: [ 8281.541012] [<ffffffff8113a393>] fdget_pos+0x9/0x15
Dec 26 16:20:57 merkaba kernel: [ 8281.541014] [<ffffffff8113b39d>] SyS_write+0x19/0x71
Dec 26 16:20:57 merkaba kernel: [ 8281.541017] [<ffffffff814aa264>] ? int_check_syscall_exit_work+0x34/0x3d
Dec 26 16:20:57 merkaba kernel: [ 8281.541019] [<ffffffff814aa012>] system_call_fastpath+0x12/0x17
Dec 26 16:20:57 merkaba kernel: [ 8281.541035] INFO: task akonadi_baloo_i:2273 blocked for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.541036] Tainted: G O 3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.541036] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.541039] akonadi_baloo_i D ffff8803b773b628 0 2273 2170 0x00000000
Dec 26 16:20:57 merkaba kernel: [ 8281.541041] ffff8803ac8ff948 0000000000000002 ffff88040d80dc00 ffffffff81a16500
Dec 26 16:20:57 merkaba kernel: [ 8281.541043] ffff8803ac8fffd8 ffff8803b773b0e0 0000000000012300 ffff8803b773b0e0
Dec 26 16:20:57 merkaba kernel: [ 8281.541046] ffff8803ac8ff928 7fffffffffffffff ffff8803ac8ffa80 0000000000000002
Dec 26 16:20:57 merkaba kernel: [ 8281.541046] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.541049] [<ffffffff814a8e73>] ? console_conditional_schedule+0x14/0x14
Dec 26 16:20:57 merkaba kernel: [ 8281.541051] [<ffffffff814a6f9a>] schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.541054] [<ffffffff814a8e93>] schedule_timeout+0x20/0xf5
Dec 26 16:20:57 merkaba kernel: [ 8281.541056] [<ffffffff8105eb92>] ? get_parent_ip+0xe/0x3e
Dec 26 16:20:57 merkaba kernel: [ 8281.541058] [<ffffffff8105ec3e>] ? preempt_count_add+0x7c/0x90
Dec 26 16:20:57 merkaba kernel: [ 8281.541061] [<ffffffff814a97db>] ? _raw_spin_lock_irq+0x1c/0x20
Dec 26 16:20:57 merkaba kernel: [ 8281.541063] [<ffffffff814a7999>] __wait_for_common+0x11e/0x163
Dec 26 16:20:57 merkaba kernel: [ 8281.541066] [<ffffffff810607da>] ? wake_up_state+0xd/0xd
Dec 26 16:20:57 merkaba kernel: [ 8281.541069] [<ffffffff814a79fd>] wait_for_completion+0x1f/0x21
Dec 26 16:20:57 merkaba kernel: [ 8281.541072] [<ffffffff8115b5fb>] writeback_inodes_sb_nr+0x8c/0x95
Dec 26 16:20:57 merkaba kernel: [ 8281.541077] [<ffffffff81050101>] ? perf_trace_workqueue_work+0x8e/0x95
Dec 26 16:20:57 merkaba kernel: [ 8281.541115] [<ffffffffc044a44e>] flush_space+0x200/0x426 [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541135] [<ffffffffc044a20c>] ? can_overcommit+0xaa/0xec [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541160] [<ffffffffc044aa48>] reserve_metadata_bytes+0x274/0x368 [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541164] [<ffffffff8105eb92>] ? get_parent_ip+0xe/0x3e
Dec 26 16:20:57 merkaba kernel: [ 8281.541166] [<ffffffff8105ec3e>] ? preempt_count_add+0x7c/0x90
Dec 26 16:20:57 merkaba kernel: [ 8281.541185] [<ffffffffc044b39b>] btrfs_delalloc_reserve_metadata+0x100/0x32c [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541215] [<ffffffffc046c182>] __btrfs_buffered_write+0x1be/0x4a4 [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541218] [<ffffffff81101a6e>] ? kmap_atomic+0x13/0x39
Dec 26 16:20:57 merkaba kernel: [ 8281.541220] [<ffffffff81101aa2>] ? pagefault_enable+0xe/0x21
Dec 26 16:20:57 merkaba kernel: [ 8281.541242] [<ffffffffc046c76b>] btrfs_file_write_iter+0x303/0x40e [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541245] [<ffffffff8113a60a>] new_sync_write+0x77/0x9b
Dec 26 16:20:57 merkaba kernel: [ 8281.541247] [<ffffffff8113ad51>] vfs_write+0xad/0x112
Dec 26 16:20:57 merkaba kernel: [ 8281.541250] [<ffffffff8113b4d1>] SyS_pwrite64+0x5f/0x7d
Dec 26 16:20:57 merkaba kernel: [ 8281.541253] [<ffffffff814aa264>] ? int_check_syscall_exit_work+0x34/0x3d
Dec 26 16:20:57 merkaba kernel: [ 8281.541256] [<ffffffff814aa012>] system_call_fastpath+0x12/0x17
Dec 26 16:20:57 merkaba kernel: [ 8281.541263] INFO: task kworker/u8:1:3336 blocked for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.541264] Tainted: G O 3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.541265] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.541268] kworker/u8:1 D 0000000000000000 0 3336 2 0x00000000
Dec 26 16:20:57 merkaba kernel: [ 8281.541285] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541288] ffff880332d6bb78 0000000000000002 ffff88040d80dc00 ffff8803b773b0e0
Dec 26 16:20:57 merkaba kernel: [ 8281.541290] ffff880332d6bfd8 ffff8800c2804950 0000000000012300 ffff8800c2804950
Dec 26 16:20:57 merkaba kernel: [ 8281.541292] ffff880332d6bb58 7fffffffffffffff ffff880332d6bcb0 0000000000000002
Dec 26 16:20:57 merkaba kernel: [ 8281.541293] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.541299] [<ffffffff814a8e73>] ? console_conditional_schedule+0x14/0x14
Dec 26 16:20:57 merkaba kernel: [ 8281.541301] [<ffffffff814a6f9a>] schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.541303] [<ffffffff814a8e93>] schedule_timeout+0x20/0xf5
Dec 26 16:20:57 merkaba kernel: [ 8281.541306] [<ffffffff8105eb92>] ? get_parent_ip+0xe/0x3e
Dec 26 16:20:57 merkaba kernel: [ 8281.541308] [<ffffffff8105ec3e>] ? preempt_count_add+0x7c/0x90
Dec 26 16:20:57 merkaba kernel: [ 8281.541311] [<ffffffff814a97db>] ? _raw_spin_lock_irq+0x1c/0x20
Dec 26 16:20:57 merkaba kernel: [ 8281.541313] [<ffffffff814a7999>] __wait_for_common+0x11e/0x163
Dec 26 16:20:57 merkaba kernel: [ 8281.541317] [<ffffffff810607da>] ? wake_up_state+0xd/0xd
Dec 26 16:20:57 merkaba kernel: [ 8281.541320] [<ffffffff814a79fd>] wait_for_completion+0x1f/0x21
Dec 26 16:20:57 merkaba kernel: [ 8281.541322] [<ffffffff8115b5fb>] writeback_inodes_sb_nr+0x8c/0x95
Dec 26 16:20:57 merkaba kernel: [ 8281.541324] [<ffffffff81050101>] ? perf_trace_workqueue_work+0x8e/0x95
Dec 26 16:20:57 merkaba kernel: [ 8281.541343] [<ffffffffc044a44e>] flush_space+0x200/0x426 [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541346] [<ffffffff814a97bb>] ? _raw_spin_lock+0x1b/0x1f
Dec 26 16:20:57 merkaba kernel: [ 8281.541348] [<ffffffff814a9841>] ? _raw_spin_unlock+0x11/0x24
Dec 26 16:20:57 merkaba kernel: [ 8281.541366] [<ffffffffc044a788>] btrfs_async_reclaim_metadata_space+0x114/0x160 [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541368] [<ffffffff81052962>] process_one_work+0x15e/0x2a9
Dec 26 16:20:57 merkaba kernel: [ 8281.541371] [<ffffffff81052ee1>] worker_thread+0x1f6/0x2a3
Dec 26 16:20:57 merkaba kernel: [ 8281.541374] [<ffffffff81052ceb>] ? rescuer_thread+0x214/0x214
Dec 26 16:20:57 merkaba kernel: [ 8281.541376] [<ffffffff8105697c>] kthread+0xb2/0xba
Dec 26 16:20:57 merkaba kernel: [ 8281.541379] [<ffffffff814a0000>] ? dcbnl_newmsg+0x14/0xa8
Dec 26 16:20:57 merkaba kernel: [ 8281.541381] [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
Dec 26 16:20:57 merkaba kernel: [ 8281.541384] [<ffffffff814a9f6c>] ret_from_fork+0x7c/0xb0
Dec 26 16:20:57 merkaba kernel: [ 8281.541386] [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
Dec 26 16:21:24 merkaba kernel: [ 8308.678889] device eth0 left promiscuous mode
Dec 26 16:21:24 merkaba kernel: [ 8308.700212] vboxnetflt: 0 out of 34916 packets were not sent (directed to host)
which translates to:
Desktop unusable => hard reboot.
I know resized it from 160GiB to 170 GiB on both devices.
But I think I will consider moving the VM image to another filesystem.
But at least my description can give an idea on how to reproduce this behaviour.
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
2014-12-26 14:20 ` Martin Steigerwald
2014-12-26 15:59 ` Martin Steigerwald
@ 2014-12-26 22:48 ` Robert White
2014-12-27 5:54 ` Duncan
2014-12-27 9:01 ` Martin Steigerwald
2 siblings, 2 replies; 59+ messages in thread
From: Robert White @ 2014-12-26 22:48 UTC (permalink / raw)
To: Martin Steigerwald, linux-btrfs
On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> Hello!
>
> First: Have a merry christmas and enjoy a quiet time in these days.
>
> Second: At a time you feel like it, here is a little rant, but also a bug
> report:
>
> I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> space_cache, skinny meta data extents – are these a problem? – and
> compress=lzo:
(there is no known problem with skinny metadata, it's actually more
efficient than the older format. There has been some anecdotes about
mixing the skinny and fat metadata but nothing has ever been
demonstrated problematic.)
>
> merkaba:~> btrfs fi sh /home
> Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> Total devices 2 FS bytes used 144.41GiB
> devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
>
> Btrfs v3.17
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=154.97GiB, used=141.12GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.29GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
This filesystem, at the allocation level, is "very full" (see below).
> And I had hangs with BTRFS again. This time as I wanted to install tax
> return software in Virtualbox´d Windows XP VM (which I use once a year
> cause I know no tax return software for Linux which would be suitable for
> Germany and I frankly don´t care about the end of security cause all
> surfing and other network access I will do from the Linux box and I only
> run the VM behind a firewall).
>
>
> And thus I try the balance dance again:
ITEM: Balance... it doesn't do what you think it does... 8-)
"Balancing" is something you should almost never need to do. It is only
for cases of changing geometry (adding disks, switching RAID levels,
etc.) of for cases when you've radically changed allocation behaviors
(like you decided to remove all your VM's or you've decided to remove a
mail spool directory full of thousands of tiny files).
People run balance all the time because they think they should. They are
_usually_ incorrect in that belief.
>
> merkaba:~> btrfs balance start -dusage=5 -musage=5 /home
> ERROR: error during balancing '/home' - No space left on device
ITEM: Running out of space during a balance is not running out of space
for files. BTRFS has two layers of allocation. That is, there are two
levels of abstraction where "no space" can occur.
The first level of allocation is the "making more BTRFS structures out
of raw device space".
The second level is "allocating space for files inside of existing BTRFS
structures".
Balance is the operation of relocating the BTRFS structures and
attempting to increase their order (conincidentally) while doing that.
So, for instance, "reocating block group some_number_here" requires
finding an unallocated expanse of disk, creating a new/empty block group
there of the current relevant block group size (typically data=1G or
metadata=256M if you didn't override these settings while making the
filesystem). You can _easily_ end up lacking a 1G contiguous expanse of
raw allocation space on a nearly-full filesystem.
NOTE :: This does _not_ happen with other filesystems like EXT4 because
building those filesystems creates a static filesystem-level allocation.
That is 100% of the disk that can be controlled by EXT4 (etc) is
allocated and initialized at initial creation time (or first mount in
the case of EXT4).
BTRFS is intentionally different because it wants to be able to adapt as
your usage changes. If you first make millions of tiny files then you
will have a lot of metadata extents and virtually no data extents. If
you erase a lot of those and then start making large files the metadata
will tend to go away and then data extents will be created.
Being a chaotic system, you can get into some corner cases that suck,
but in terms of natural evolution it has more benefits than drawbacks.
> There may be more info in syslog - try dmesg | tail
> merkaba:~#1> btrfs balance start -dusage=5 -musage=5 /home
> ERROR: error during balancing '/home' - No space left on device
> There may be more info in syslog - try dmesg | tail
> merkaba:~#1> btrfs balance start -dusage=5 /home
>
> .... losts deleted for brevity ....
>
> So I am rebalancing everything basically, without need I bet, so causing
> more churn to SSDs than is needed.
Correct, though churn isn't really the issue.
> Otherwise alternative would be to make BTRFS larger I bet.
Correct.
>
>
> Well this is still not what I would consider stable. So I will still
Not a question of stability.
See, dong a balance is like doing a sliding block puzzle. If there isn't
enough room to slide the blocks around then the blocks will not slide
around. You are just out of space and that results in "out of space"
returns. This is not even an error, just a fact.
http://en.wikipedia.org/wiki/15_puzzle
Meditate on the above link. Then ask yourself what happens if you put in
the number 16. 8-)
The below recomendation is incorrect...
> recommend: If you want to use BTRFS on a server and estimate 25 GiB of
> usage, make drive at least 50GiB big or even 100GiB to be on the safe
> side. Like I recommended for SLES 11 SP 2/3 BTRFS deployments – but
> hey, there say meanwhile "don´t" as in "just don´t use it at all and use SLES
> 12 instead, cause BTRFS with 3.0 kernel with a ton of snapper snapshots
> is really not asking for anything even near to production or enterprise
> reliability" (if you need proof, I think I still have a snapshot of a SLES
> 11 SP3 VM that broke over night due to me having installed an LDAP server
> for preparing some training slides). Even 3.12 kernel seems daring regarding
> BTRFS, unless SUSE actively backports fixes.
>
>
> In kernel log the failed attempts look like this:
Already covered.
>
> [ 209.783437] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
> [ 210.116416] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
> My expection for a *stable* and *production quality* filesystem would be:
>
> I never ever get hangs with one kworker running on 100% of one Sandybridge
> core *for minutes* in a production filesystem and thats about it.
Now this is one of several other issues.
ITEM: An SSD plus a good fast controller and default system virtual
memory and disk scheduler activities can completely bog a system down.
You can get into a mode where the system begins doing synchronous writes
of vast expanses of dirty cache. The SSD is so fast that there is
effectively zero "wait for IO time" and the IO subsystem is effectively
locked or just plain busy.
Look at /proc/sys/vm/dirty_background_ratio which is probably set to 10%
of system ram.
You may need/want to change this number to something closer to 4. That's
not a hard suggestion. Some reading and analysis will be needed to find
the best possible tuning for an advanced system.
>
> Especially for a filesystem that claims to still have a good amount of free
> space:
>
> merkaba:~> LANG=C df -hT /home
> Filesystem Type Size Used Avail Use% Mounted on
> /dev/mapper/msata-home btrfs 160G 146G 25G 86% /home
It does have plenty of free space at the file-storage level. (Which is
not the "balance" level where raw disk is converted into file system
"data" or "metadata" extents.)
>
> (yeah, these don´t add up, I account this to compression, but hey, who knows)
No need to "account for" compression.
They add up fine, in the sense that they are separate domains for space
and so are not intended to be taken together. You will notice that you
are not getting "out of space" errors for actually creating/appending files.
>
>
> In kernel log I have things like this, but some earlier time and these I have
> not yet perceived as hangs:
>
> Dec 23 23:33:26 merkaba kernel: [23040.621678] ------------[ cut here ]------------
> Dec 23 23:33:26 merkaba kernel: [23040.621792] WARNING: CPU: 3 PID: 308 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empt
> y+0x2d/0x2f [btrfs]()
> Dec 23 23:33:26 merkaba kernel: [23040.621796] Modules linked in: mmc_block ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c snd
> _usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device hid_generic hid_pl ff_memless usbhid hid nls_utf8 nls_cp437 vfat fat uas usb_stor
> age bnep bluetooth binfmt_misc cpufreq_userspace cpufreq_stats pci_stub cpufreq_powersave vboxpci(O) cpufreq_conservative vboxnetadp(O)
> vboxnetflt(O) vboxdrv(O) ext4 crc16 mbcache jbd2 intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32
> _pclmul ghash_clmulni_intel snd_hda_codec_hdmi iwldvm aesni_intel snd_hda_codec_conexant mac80211 aes_x86_64 snd_hda_codec_generic lrw
> gf128mul glue_helper ablk_helper cryptd psmouse snd_hda_intel serio_raw iwlwifi pcspkr lpc_ich snd_hda_controller i2c_i801 mfd_core snd
> _hda_codec snd_hwdep cfg80211 snd_pcm snd_timer shpchp thinkpad_acpi nvram snd soundcore rfkill battery ac tpm_tis tpm processor evdev
> joydev sbs sbshc coretemp hdaps(O) tp_smapi(O) thinkpad_ec(O) loop firewire_sbp2 fuse ecryptfs autofs4 md_mod btrfs xor raid6_pq microc
> ode dm_mirror dm_region_hash dm_log dm_mod sg sr_mod sd_mod cdrom crc32c_intel ahci firewire_ohci libahci sata_sil24 e1000e libata ptp
> sdhci_pci ehci_pci sdhci firewire_core ehci_hcd crc_itu_t pps_core mmc_core scsi_mod usbcore usb_common thermal
> Dec 23 23:33:26 merkaba kernel: [23040.621978] CPU: 3 PID: 308 Comm: btrfs-transacti Tainted: G W O 3.18.0-tp520 #14
> Dec 23 23:33:26 merkaba kernel: [23040.621982] Hardware name: LENOVO 42433WG/42433WG, BIOS 8AET63WW (1.43 ) 05/08/2013
> Dec 23 23:33:26 merkaba kernel: [23040.621985] 0000000000000009 ffff8804044c7d88 ffffffff814a516e 0000000080000000
> Dec 23 23:33:26 merkaba kernel: [23040.621992] 0000000000000000 ffff8804044c7dc8 ffffffff8103f83e ffff8804044c7db8
> Dec 23 23:33:26 merkaba kernel: [23040.621999] ffffffffc04bd5a1 ffff880037590800 ffff8800a599c320 0000000000000000
> Dec 23 23:33:26 merkaba kernel: [23040.622006] Call Trace:
> Dec 23 23:33:26 merkaba kernel: [23040.622026] [<ffffffff814a516e>] dump_stack+0x4f/0x7c
> Dec 23 23:33:26 merkaba kernel: [23040.622034] [<ffffffff8103f83e>] warn_slowpath_common+0x7c/0x96
> Dec 23 23:33:26 merkaba kernel: [23040.622104] [<ffffffffc04bd5a1>] ? btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
> Dec 23 23:33:26 merkaba kernel: [23040.622111] [<ffffffff8103f8ec>] warn_slowpath_null+0x15/0x17
> Dec 23 23:33:26 merkaba kernel: [23040.622164] [<ffffffffc04bd5a1>] btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
> Dec 23 23:33:26 merkaba kernel: [23040.622211] [<ffffffffc047a830>] btrfs_commit_transaction+0x394/0x8bc [btrfs]
> Dec 23 23:33:26 merkaba kernel: [23040.622254] [<ffffffffc0476dd5>] transaction_kthread+0xf9/0x1af [btrfs]
> Dec 23 23:33:26 merkaba kernel: [23040.622295] [<ffffffffc0476cdc>] ? btrfs_cleanup_transaction+0x43a/0x43a [btrfs]
> Dec 23 23:33:26 merkaba kernel: [23040.622305] [<ffffffff8105697c>] kthread+0xb2/0xba
> Dec 23 23:33:26 merkaba kernel: [23040.622312] [<ffffffff814a0000>] ? dcbnl_newmsg+0x14/0xa8
> Dec 23 23:33:26 merkaba kernel: [23040.622317] [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
> Dec 23 23:33:26 merkaba kernel: [23040.622324] [<ffffffff814a9f6c>] ret_from_fork+0x7c/0xb0
> Dec 23 23:33:26 merkaba kernel: [23040.622329] [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
> Dec 23 23:33:26 merkaba kernel: [23040.622334] ---[ end trace 90db5b1c7067cf1d ]---
> Dec 23 23:33:56 merkaba kernel: [23070.671999] ------------[ cut here ]------------
Not sure about either of these, they _could_ be previous unrelated bugs
that are now fixed bugs, since you say they've stopped happening.
> Dec 23 23:33:56 merkaba kernel: [23070.671999] ------------[ cut here ]------------
> Dec 23 23:33:56 merkaba kernel: [23070.672064] WARNING: CPU: 3 PID: 308 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empt
> y+0x2d/0x2f [btrfs]()
> Dec 23 23:33:56 merkaba kernel: [23070.672067] Modules linked in: mmc_block ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device hid_generic hid_pl ff_memless usbhid hid nls_utf8 nls_cp437 vfat fat uas usb_storage bnep bluetooth binfmt_misc cpufreq_userspace cpufreq_stats pci_stub cpufreq_powersave vboxpci(O) cpufreq_conservative vboxnetadp(O) vboxnetflt(O) vboxdrv(O) ext4 crc16 mbcache jbd2 intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi iwldvm aesni_intel snd_hda_codec_conexant mac80211 aes_x86_64 snd_hda_codec_generic lrw gf128mul glue_helper ablk_helper cryptd psmouse snd_hda_intel serio_raw iwlwifi pcspkr lpc_ich snd_hda_controller i2c_i801 mfd_core snd_hda_codec snd_hwdep cfg80211 snd_pcm snd_timer shpchp thinkpad_acpi nvram snd soundcore rfkill battery ac tpm_tis tpm processor evdev joydev sbs sbshc coretemp hdaps(O) tp_smapi(O) think
pad_ec(O) loop firewire_sbp2 fuse ecryptfs autofs4 md_mod btrfs xor raid6_pq microcode dm_mirror dm_region_hash dm_log dm_mod sg sr_mod sd_mod cdrom crc32c_intel ahci firewire_ohci libahci sata_sil24 e1000e libata ptp sdhci_pci ehci_pci sdhci firewire_core ehci_hcd crc_itu_t pps_core mmc_core scsi_mod usbcore usb_common thermal
> Dec 23 23:33:56 merkaba kernel: [23070.672193] CPU: 3 PID: 308 Comm: btrfs-transacti Tainted: G W O 3.18.0-tp520 #14
> Dec 23 23:33:56 merkaba kernel: [23070.672196] Hardware name: LENOVO 42433WG/42433WG, BIOS 8AET63WW (1.43 ) 05/08/2013
> Dec 23 23:33:56 merkaba kernel: [23070.672200] 0000000000000009 ffff8804044c7d88 ffffffff814a516e 0000000080000000
> Dec 23 23:33:56 merkaba kernel: [23070.672205] 0000000000000000 ffff8804044c7dc8 ffffffff8103f83e ffff8804044c7db8
> Dec 23 23:33:56 merkaba kernel: [23070.672209] ffffffffc04bd5a1 ffff880037590800 ffff8802cd6e50a0 0000000000000000
> Dec 23 23:33:56 merkaba kernel: [23070.672214] Call Trace:
> Dec 23 23:33:56 merkaba kernel: [23070.672222] [<ffffffff814a516e>] dump_stack+0x4f/0x7c
> Dec 23 23:33:56 merkaba kernel: [23070.672229] [<ffffffff8103f83e>] warn_slowpath_common+0x7c/0x96
> Dec 23 23:33:56 merkaba kernel: [23070.672264] [<ffffffffc04bd5a1>] ? btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
> Dec 23 23:33:56 merkaba kernel: [23070.672270] [<ffffffff8103f8ec>] warn_slowpath_null+0x15/0x17
> Dec 23 23:33:56 merkaba kernel: [23070.672301] [<ffffffffc04bd5a1>] btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
> Dec 23 23:33:56 merkaba kernel: [23070.672330] [<ffffffffc047a830>] btrfs_commit_transaction+0x394/0x8bc [btrfs]
> Dec 23 23:33:56 merkaba kernel: [23070.672357] [<ffffffffc0476dd5>] transaction_kthread+0xf9/0x1af [btrfs]
> Dec 23 23:33:56 merkaba kernel: [23070.672383] [<ffffffffc0476cdc>] ? btrfs_cleanup_transaction+0x43a/0x43a [btrfs]
> Dec 23 23:33:56 merkaba kernel: [23070.672389] [<ffffffff8105697c>] kthread+0xb2/0xba
> Dec 23 23:33:56 merkaba kernel: [23070.672395] [<ffffffff814a0000>] ? dcbnl_newmsg+0x14/0xa8
> Dec 23 23:33:56 merkaba kernel: [23070.672399] [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
> Dec 23 23:33:56 merkaba kernel: [23070.672405] [<ffffffff814a9f6c>] ret_from_fork+0x7c/0xb0
> Dec 23 23:33:56 merkaba kernel: [23070.672409] [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
> Dec 23 23:33:56 merkaba kernel: [23070.672412] ---[ end trace 90db5b1c7067cf1e ]---
> Dec 23 23:34:26 merkaba kernel: [23100.709530] ------------[ cut here ]------------
>
>
> The recent hangings today are not in the log, I was upset enough to
> forcefully switch of the machine. Tax returns are not my all time favorite,
> but tax returns with hanging filesystems is no fun at all.
>
>
> I will upgrade to 3.19 with 3.19-rc2.
>
> Lets see what this balance will do.
>
> It currently is here:
>
> merkaba:~> btrfs balance status /home
> Balance on '/home' is running
> 32 out of about 164 chunks balanced (53 considered), 80% left
>
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=154.97GiB, used=142.10GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.33GiB
> GlobalReserve, single: total=512.00MiB, used=254.31MiB
>
>
> So for once, we are told not to balance needlessly, but then in order for
> stable operation I need to balance nonetheless?
Nope. "needing to Balance" just isn't your problem. Being out of space
for new extents is your problem with the balancing you don't need to do.
Which is different than your VM update problem. And is also different
than your bursty, excessive caching problem.
I've also not seen you say you did a btrfsck ever. Does a filesystem
check come up clean.
> Well lets see how it will improve things. Last time it did. Considerably.
> BTRFS only had these hang problems with 3.15 and 3.16 if trees allocated
> all remaining space. So I expect it to downsize these trees are to there is
> some device space being freed to allocatable again.
>
> Next I will also defrag the Windows VM image just as an additional safety
> net.
Simply copying the file might help you for a while at least. But in the
long term "too much orderliness" for large files ends up being anti-helpful.
e.g. disk_file.img :: cp disk_file.img new_disk_file.img; rm
disk_file.img; mv new_disk_file.img disk_file.img;
Turning off Copy-on-write might be helpful. (This will turn off
compression as well) but it can be anti-helpful too depending on the VM
and how it's used.
As I learn more about the way BTRFS stores files, particularly deltas to
files, I come to suspect that the "best" storage model for a VM _might_
be exactly the opposite of normal suggestions. (The most wasteful
possible storage is the Gauss sum of consecutive integers where i=1 and
n=number of consecutively stored blocks in the file. Ouch. So a file
that is reasonably segmented is "more efficent".)
With a fast SSD, my research suggests that defragging the disk image is
bad. No-COW is good if you don't snapshot often, but each snapshot puts
the file in one-COW which kind-of defeats the No-COW if you do it very
often.
But as near as I can tell, starting with an "empty" .qcow file and
growing the system step-wise and _never_ defragging that file tends to
create a chaotically natural expanse that wont hit these corner cases.
(Way more analysis needs to be done here for that to be a real answer.)
As I learn more I discover that being overly agressive with balance and
defrag of large files is the opposite of good. The system seems to want
to develop a chaotic layout and trying to make it orderly seems to make
things worse. For very large files like VM images it seems to amplify
the worst parts.
> Okay, doing something else now as the BTRFS will sort things out hopefully.
To get good natural performance on my (non SSD) system while running
VM(s) I classify a bunch of the system ram with the moveablecore= kernel
boot option (about 1/4 to 1/3rd of physical ram) and turn down the dirty
backround ratio to avoid large synchronous cache flush events.
YMMV.
>
> Ciao,
>
Later.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-26 14:41 ` Martin Steigerwald
@ 2014-12-27 3:33 ` Duncan
0 siblings, 0 replies; 59+ messages in thread
From: Duncan @ 2014-12-27 3:33 UTC (permalink / raw)
To: linux-btrfs
Martin Steigerwald posted on Fri, 26 Dec 2014 15:41:23 +0100 as excerpted:
> Am Freitag, 26. Dezember 2014, 15:20:42 schrieben Sie:
>> And I wonder about:
>> > Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C
>> > 0040 0710 4AFA B82F 991B EAAC A599
>> >
>> >
>> >
>>
> 84C7N�����r��y����b�X��ǧv�^�){.n�+����{�n�߲)����w*\x1fjg���\x1e�����ݢj/
���z�ޖ��2
>>
>> > �ޙ����&�)ߡ�a��\x7f��\x1e�G���h�\x0f�j:+v���w��٥
>>
>> These random chars are not supposed to be there: I better run scrub
>> straight after this balance.
>
> Okay, thats not me I think. scrub didn´t report any errors and when I
> look in kmail send folder I don´t see these random chars as well, so it
> seems some server on the wire added the garbage.
FWIW...
They didn't show up here on gmane's list2nntp service (message viewed
with pan), either. There were a few strange characters -- your dashes(?)
on either side of the "are these a problem?" showed up as the squares
containing four digits (0080, 0093) that appear when a font doesn't
contain the appropriate character it's being asked to display, and there
were a few others, but that's a common charset/font l10n issue, not the
apparent line noise binary corruption shown above.
So I'd guess it was either the transmission to your mail service, at the
mail service, or the transmission between them and your mail client, that
corrupted.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-26 15:59 ` Martin Steigerwald
@ 2014-12-27 4:26 ` Duncan
0 siblings, 0 replies; 59+ messages in thread
From: Duncan @ 2014-12-27 4:26 UTC (permalink / raw)
To: linux-btrfs
Martin Steigerwald posted on Fri, 26 Dec 2014 16:59:09 +0100 as excerpted:
> Dec 26 16:17:57 merkaba kernel: [ 8102.029438] mce:
> [Hardware Error]: Machine check events logged
> Dec 26 16:20:27 merkaba kernel: [ 8252.054015] mce:
> [Hardware Error]: Machine check events logged
Have you checked these MCEs? What are they?
MCEs are hardware errors. These are *NOT* kernel errors, tho of course
they may /trigger/ kernel errors. The reported event codes can be looked
up and translated into English.
Since shortly after the first one until a bit before the second one here,
you had hardware thermal throttling, the CPUs, on-chip cache, and
possibly the memory, was working pretty hard.
FWIW, I had an AMD machine that would MCE with memory related errors some
time (about a decade) ago. I had ECC RAM, but it was cheap and
apparently not quite up to the speeds it was actually rated for. MemTest
check out the memory fine, but under high stress especially, it would
sometimes have bus/transit related corruption, which would sometimes (not
always) trigger those MCEs.
Eventually a BIOS update gave me the ability to turn down the memory
timings, and turning them down just one notch made everything rock-stable
-- I was even able to decrease some of the wait-states to get a bit of
the memory speed back. It just so happened that it was borderline stable
at the rated clock, and turning the memory clock down just one notch was
all it took. Later, I upgraded the RAM (the bad RAM was two half-gig
sticks, back when they were $100+ a piece, I upgraded to four 2-gig
sticks), and the new RAM didn't have the problem at all -- the bad RAM
sticks simply weren't /quite/ stable at the rated speed, that was it.
I run gentoo so of course do a lot of building from sources, and
interestingly enough, the thing that turned out to detect the corruption
the most often was bzip2 compression checksums -- I'd get errors on
sources decompress previous to the build, rather more often than actual
build failures altho those would happen occasionally as well, while
redoing it would work fine -- checksums passed, and I never had a build
that actually finished fail to run due to a bad build.
Now here's the thing. Of course a decade ago was well before I was
running btrfs (FWIW I was running reiserfs at the time, and it seemed
pretty resilient given the bad RAM I had), so it was the bzip2 checksums
it failed on.
But guess what btrfs uses for file integrity, checksums. If your MCEs
are either like my memory-related MCEs were, or are similar CPU-cache or
CPU related but still something that would affect checksumming, btrfs may
well be fighting bad checksums due to the same issues, and that would of
course throw all sorts of wrenches into things. Another thing I've seen
reported as triggering MCEs is bad power (in that case it was an either
underpowered or going bad UPS, once it was out of the picture, the MCEs
and problems stopped).
Now I think you're having other btrfs issues as well, some of which are
likely legit bugs. However, your MCEs certainly aren't helping things,
and I'd definitely recommend checking up on them to see what's actually
happening to your hardware. It may well be that without whatever
hardware issues are triggering those MCEs, you may end up with less btrfs
problems as well.
Or maybe not, but it's something to look into, because right now,
regardless of whether they're making things worse physically, they're at
minimum obscuring a troubleshooting picture that would be clearer without
them.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-26 22:48 ` Robert White
@ 2014-12-27 5:54 ` Duncan
2014-12-27 9:01 ` Martin Steigerwald
1 sibling, 0 replies; 59+ messages in thread
From: Duncan @ 2014-12-27 5:54 UTC (permalink / raw)
To: linux-btrfs
Robert White posted on Fri, 26 Dec 2014 14:48:38 -0800 as excerpted:
> ITEM: An SSD plus a good fast controller and default system virtual
> memory and disk scheduler activities can completely bog a system down.
> You can get into a mode where the system begins doing synchronous writes
> of vast expanses of dirty cache. The SSD is so fast that there is
> effectively zero "wait for IO time" and the IO subsystem is effectively
> locked or just plain busy.
>
> Look at /proc/sys/vm/dirty_background_ratio which is probably set to 10%
> of system ram.
>
> You may need/want to change this number to something closer to 4. That's
> not a hard suggestion. Some reading and analysis will be needed to find
> the best possible tuning for an advanced system.
FWIW, I can second at least this part, myself. Half of the base problem
is that memory speeds have increased far faster than storage speeds.
SSDs do help with that, but the problem remains. The other half of the
problem is the comparatively huge memory capacity systems have today,
with the result being that the default percentages of system RAM that
were allowed to be dirty before kicking in background and then foreground
flushing, reasonable back when they were introduced, simply aren't
reasonable any longer, PARTICULARLY on spinning rust, but even on SSD.
vm.dirty_ratio is the percentage of RAM allowed to dirty before the
system kicks into high-priority write-flush mode.
vm.dirty_background_ratio is likewise, but where the system starts even
worrying about it at all, doing work in the background.
Now take my 16 GiB RAM system as an example.
The default background setting is 5%, foreground/high-priority, 10%.
With 16 gigs RAM, that 10% is 1.6 GiB of dirty pages to flush. A
spinning rust drive might do 100 MiB/sec throughput contiguous, but a
real-world number is more like 30-50 MiB/sec.
At 100 MiB/sec, that 1.6 GiB will take 16+ seconds, during which nothing
else can be doing I/O. So let's just divide the speed by 3 and call it
33.3 MiB/sec. Now we're looking at being blocked for nearly 50 seconds
to flush all those dirty blocks. And the system doesn't even START
worrying about it, at even LOW priority, until it has about 25 seconds
worth of full-usage flushing built-up!
Not only that, but that's *ALSO* 1.6 GiB worth of dirty data that isn't
yet written to storage, that would lost in the event of a crash!
Of course there's a timer expiry as well. vm.dirty_writeback_centiseconds
(that's background) defaults to 499 (5 seconds),
vm.dirty_expire_centiseconds defaults to 2999 (30 seconds).
So the first thing to notice is that it's going to take more time to
write the dirty data we're allowing to stack up, than the expiry time!
At least to me, that makes absolutely NO sense! At minimum, we need to
reduce cached writes allowed to stack up to something that can actually
be done before they expire, time-wise. Either that, or trying to depend
on that 30-second expiry to make sure our dirty data is flushed in
something at least /close/ to that isn't going to work so well!
So assuming we think the 30-seconds is logical, the /minimum/ we need to
do is reduce the size cap by half, to 5% high-priority/foreground (which
was as we saw about 25 seconds worth), say 2% lower-priority/background.
But that's STILL about 800 MiB before it kicks to high priority mode at
risk in case of a crash, and I still considered that a bit more than I
wanted.
So what I ended up with here (set for spinning rust before I had SSD),
was:
vm.dirty_background_ratio = 1
(low priority flush, that's still ~160 MiB or about 5 seconds worth of
activity at lower 30s MiB/sec)
vm.dirty_ratio = 3
(high priority flush, roughly half a GiB, about 15 seconds of activity)
vm.dirty_writeback_centiseconds=1000
(10 seconds, background flush timeout, note that the corresponding size
cap is ~5 seconds worth so about 50% duty cycle, a bit high for
background priority, but...)
(I left vm.dirty_expire_centiseconds at the default, 2999 or 30 seconds,
since I found that an acceptable amount of work to lose in the case of a
crash. Again, the corresponding size cap is ~15 seconds worth, so ~50
duty cycle. This is very reasonable for high priority, as if data is
coming in faster than that, it'll trigger high priority flushing "billed"
to the processes actually dirtying the memory in the first place, thus
forcing them to slow down and wait for their IO, in turn allowing other
(CPU-bound) processes to run.)
And while 15-second interactivity latency during disk thrashing isn't
cake, it's at least tolerable, while 50-second latency is HORRIBLE.
Meanwhile, with vm.dirty_background_ratio already set to 1 and without
knowing whether it can take a decimal such as 0.5 (I could look I suppose
but I don't really have to), that's the lowest I can go there unless I
set it to zero. HOWEVER, if I wanted to go lower, I could set the actual
size version, vm.dirty_background_bytes, instead. If I needed to go
below ~160 MiB, that's what I'd do. Of course there's a corresponding
vm.dirty_bytes setting as well.
As I said I originally set those up for spinning rust. Now my main
system is SSD, tho I still have secondary backups and media on spinning
rust. But I've seen no reason to change them upward to allow for the
faster SSDs, particularly since were I to do so, I'd be risking that much
more data loss in the event of a crash, and I find that risk balance
about right, just where it is.
And I've been quite happy with btrfs performance on the ssds (the
spinning rust is still reiserfs). Tho of course I do run multiple
smaller independent btrfs instead of the huge all-the-data-eggs-in-a-
single-basket mode most people seem to run. My biggest btrfs is actually
only 24 GiB (on each of two devices but in raid1 mode both data/metadata,
so 24 GiB to work with too), but between working copy and primary backup,
I have nearly a dozen btrfs filesystems. But I don't tend to run into
the scaling issues others see, and being able to do full filesystem
maintenance (scrub/balance/backup/restore-from-backup/etc) in seconds to
minutes per filesystem is nice! =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-26 22:48 ` Robert White
2014-12-27 5:54 ` Duncan
@ 2014-12-27 9:01 ` Martin Steigerwald
2014-12-27 9:30 ` Hugo Mills
1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 9:01 UTC (permalink / raw)
To: Robert White; +Cc: linux-btrfs
Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > Hello!
> >
> > First: Have a merry christmas and enjoy a quiet time in these days.
> >
> > Second: At a time you feel like it, here is a little rant, but also a bug
> > report:
> >
> > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > space_cache, skinny meta data extents – are these a problem? – and
>
> > compress=lzo:
> (there is no known problem with skinny metadata, it's actually more
> efficient than the older format. There has been some anecdotes about
> mixing the skinny and fat metadata but nothing has ever been
> demonstrated problematic.)
>
> > merkaba:~> btrfs fi sh /home
> > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> >
> > Total devices 2 FS bytes used 144.41GiB
> > devid 1 size 160.00GiB used 160.00GiB path
> > /dev/mapper/msata-home
> > devid 2 size 160.00GiB used 160.00GiB path
> > /dev/mapper/sata-home
> >
> > Btrfs v3.17
> > merkaba:~> btrfs fi df /home
> > Data, RAID1: total=154.97GiB, used=141.12GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
>
> This filesystem, at the allocation level, is "very full" (see below).
>
> > And I had hangs with BTRFS again. This time as I wanted to install tax
> > return software in Virtualbox´d Windows XP VM (which I use once a year
> > cause I know no tax return software for Linux which would be suitable for
> > Germany and I frankly don´t care about the end of security cause all
> > surfing and other network access I will do from the Linux box and I only
> > run the VM behind a firewall).
>
> > And thus I try the balance dance again:
> ITEM: Balance... it doesn't do what you think it does... 8-)
>
> "Balancing" is something you should almost never need to do. It is only
> for cases of changing geometry (adding disks, switching RAID levels,
> etc.) of for cases when you've radically changed allocation behaviors
> (like you decided to remove all your VM's or you've decided to remove a
> mail spool directory full of thousands of tiny files).
>
> People run balance all the time because they think they should. They are
> _usually_ incorrect in that belief.
I only see the lockups of BTRFS is the trees *occupy* all space on the device.
I *never* so far saw it lockup if there is still space BTRFS can allocate from
to *extend* a tree.
This may be a bug, but this is what I see.
And no amount of "you should not balance a BTRFS" will make that perception go
away.
See, I see the sun coming out on a morning and you tell me "no, it doesn´t".
Simply that is not going to match my perception.
> > merkaba:~> btrfs balance start -dusage=5 -musage=5 /home
> > ERROR: error during balancing '/home' - No space left on device
>
> ITEM: Running out of space during a balance is not running out of space
> for files. BTRFS has two layers of allocation. That is, there are two
> levels of abstraction where "no space" can occur.
I understand that *very* well. I know about the allocation of *device* space
for tree and I know about the allocation *inside* a tree.
> The first level of allocation is the "making more BTRFS structures out
Skipped rest of explaination that I already now.
I also don´t buy in the SSD makes kworker thread to use 100% for minutes
explaination - *while* this SSDs are basically idling. A sandybridge core is
not exactly slow and these are still consumer SSDs, we are not talking about a
million of IOPS here.
And again:
This does not ever happen on when the trees do *not* fully allocate all device
space. Even the defragmentation of the Windows XP run fine until after the
trees allocated all space on the device again.
Try to reread the last two sentences in case it doesn´t sink to you.
Thats why I consider it a bug. I totally agree with you that a balance should
not be necessary, but in my observation it is. That is the actual bug.
And no, no one needs me to tell to nocow the file. Even the extents are no
issue: Not with SSDs which provide good enough random access.
My interpretation from what I see is this: BTRFS free space *in tree* handling
is still not up to producation quality.
Now you either try out what I describe and see whether you perceive the same,
or if you don´t, please don´t argue with my perception. You can argue with my
conclusion, but I know what I see here. Thanks.
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 9:01 ` Martin Steigerwald
@ 2014-12-27 9:30 ` Hugo Mills
2014-12-27 10:54 ` Martin Steigerwald
` (3 more replies)
0 siblings, 4 replies; 59+ messages in thread
From: Hugo Mills @ 2014-12-27 9:30 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 4823 bytes --]
On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > Hello!
> > >
> > > First: Have a merry christmas and enjoy a quiet time in these days.
> > >
> > > Second: At a time you feel like it, here is a little rant, but also a bug
> > > report:
> > >
> > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > space_cache, skinny meta data extents – are these a problem? – and
> >
> > > compress=lzo:
> > (there is no known problem with skinny metadata, it's actually more
> > efficient than the older format. There has been some anecdotes about
> > mixing the skinny and fat metadata but nothing has ever been
> > demonstrated problematic.)
> >
> > > merkaba:~> btrfs fi sh /home
> > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > >
> > > Total devices 2 FS bytes used 144.41GiB
> > > devid 1 size 160.00GiB used 160.00GiB path
> > > /dev/mapper/msata-home
> > > devid 2 size 160.00GiB used 160.00GiB path
> > > /dev/mapper/sata-home
> > >
> > > Btrfs v3.17
> > > merkaba:~> btrfs fi df /home
> > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > GlobalReserve, single: total=512.00MiB, used=0.00B
> >
> > This filesystem, at the allocation level, is "very full" (see below).
> >
> > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > cause I know no tax return software for Linux which would be suitable for
> > > Germany and I frankly don´t care about the end of security cause all
> > > surfing and other network access I will do from the Linux box and I only
> > > run the VM behind a firewall).
> >
> > > And thus I try the balance dance again:
> > ITEM: Balance... it doesn't do what you think it does... 8-)
> >
> > "Balancing" is something you should almost never need to do. It is only
> > for cases of changing geometry (adding disks, switching RAID levels,
> > etc.) of for cases when you've radically changed allocation behaviors
> > (like you decided to remove all your VM's or you've decided to remove a
> > mail spool directory full of thousands of tiny files).
> >
> > People run balance all the time because they think they should. They are
> > _usually_ incorrect in that belief.
>
> I only see the lockups of BTRFS is the trees *occupy* all space on the device.
No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
space. What's more, balance does *not* balance the metadata trees. The
remaining space -- 154.97 GiB -- is unstructured storage for file
data, and you have some 13 GiB of that available for use.
Now, since you're seeing lockups when the space on your disks is
all allocated I'd say that's a bug. However, you're the *only* person
who's reported this as a regular occurrence. Does this happen with all
filesystems you have, or just this one?
> I *never* so far saw it lockup if there is still space BTRFS can allocate from
> to *extend* a tree.
It's not a tree. It's simply space allocation. It's not even space
*usage* you're talking about here -- it's just allocation (i.e. the FS
saying "I'm going to use this piece of disk for this purpose").
> This may be a bug, but this is what I see.
>
> And no amount of "you should not balance a BTRFS" will make that perception go
> away.
>
> See, I see the sun coming out on a morning and you tell me "no, it doesn´t".
> Simply that is not going to match my perception.
Duncan's assertion is correct in its detail. Looking at your space
usage, I would not suggest that running a balance is something you
need to do. Now, since you have these lockups that seem quite
repeatable, there's probably a lurking bug in there, but hacking
around with balance every time you hit it isn't going to get the
problem solved properly.
I think I would suggest the following:
- make sure you have some way of logging your dmesg permanently (use
a different filesystem for /var/log, or a serial console, or a
netconsole)
- when the lockup happens, hit Alt-SysRq-t a few times
- send the dmesg output here, or post to bugzilla.kernel.org
That's probably going to give enough information to the developers
to work out where the lockup is happening, and is clearly the way
forward here.
Hugo.
--
Hugo Mills | w.w.w. -- England's batting scorecard
hugo@... carfax.org.uk |
http://carfax.org.uk/ |
PGP: 65E74AC0 |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 9:30 ` Hugo Mills
@ 2014-12-27 10:54 ` Martin Steigerwald
2014-12-27 11:52 ` Robert White
2014-12-27 11:11 ` Martin Steigerwald
` (2 subsequent siblings)
3 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 10:54 UTC (permalink / raw)
To: Hugo Mills; +Cc: Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 12579 bytes --]
Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > Hello!
> > > >
> > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > >
> > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > bug
> > > > report:
> > > >
> > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > space_cache, skinny meta data extents – are these a problem? – and
> > >
> > > > compress=lzo:
> > > (there is no known problem with skinny metadata, it's actually more
> > > efficient than the older format. There has been some anecdotes about
> > > mixing the skinny and fat metadata but nothing has ever been
> > > demonstrated problematic.)
> > >
> > > > merkaba:~> btrfs fi sh /home
> > > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > >
> > > > Total devices 2 FS bytes used 144.41GiB
> > > > devid 1 size 160.00GiB used 160.00GiB path
> > > > /dev/mapper/msata-home
> > > > devid 2 size 160.00GiB used 160.00GiB path
> > > > /dev/mapper/sata-home
> > > >
> > > > Btrfs v3.17
> > > > merkaba:~> btrfs fi df /home
> > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > >
> > > This filesystem, at the allocation level, is "very full" (see below).
> > >
> > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > cause I know no tax return software for Linux which would be suitable
> > > > for
> > > > Germany and I frankly don´t care about the end of security cause all
> > > > surfing and other network access I will do from the Linux box and I
> > > > only
> > > > run the VM behind a firewall).
> > >
> > > > And thus I try the balance dance again:
> > > ITEM: Balance... it doesn't do what you think it does... 8-)
> > >
> > > "Balancing" is something you should almost never need to do. It is only
> > > for cases of changing geometry (adding disks, switching RAID levels,
> > > etc.) of for cases when you've radically changed allocation behaviors
> > > (like you decided to remove all your VM's or you've decided to remove a
> > > mail spool directory full of thousands of tiny files).
> > >
> > > People run balance all the time because they think they should. They are
> > > _usually_ incorrect in that belief.
> >
> > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > device.
> No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> space. What's more, balance does *not* balance the metadata trees. The
> remaining space -- 154.97 GiB -- is unstructured storage for file
> data, and you have some 13 GiB of that available for use.
Ok, let me rephrase that: Then the space *reserved* for the trees occupies all
space on the device. Or okay, when that I see in btrfs fi df as "total" in
summary occupies what I see as "size" in btrfs fi sh, i.e. when "used" equals
space in "btrfs fi sh"
What happened here is this:
I tried
https://blogs.oracle.com/virtualbox/entry/how_to_compact_your_virtual
in order to regain some space from the Windows XP VDI file. I just wanted to
get around upsizing the BTRFS again.
And on the defragementation step in Windows it first ran fast. For about 46-47%
there, during that fast phase btrfs fi df showed that BTRFS was quickly
reserving the remaining free device space for data trees (not metadata).
Only after a while after it did so, it got slow again, basically the Windows
defragmentation process stopped at 46-47% altogether and then after a while
even the desktop locked due to processes being blocked in I/O.
I decided to forget about this downsizing of the Virtualbox VDI file, it will
extend again on next Windows work and it is already 18 GB of its maximum 20GB,
so… I dislike the approach anyway, and don´t even understand why the
defragmentation step would be necessary as I think Virtualbox can poke holes
into the file for any space not allocated inside the VM, whether it is
defragmented or not.
> Now, since you're seeing lockups when the space on your disks is
> all allocated I'd say that's a bug. However, you're the *only* person
> who's reported this as a regular occurrence. Does this happen with all
> filesystems you have, or just this one?
The *only* person? The compression lockups with 3.15 and 3.16, quite some
people saw them, I thought. For me also these lockups only happened with all
space on device allocated.
And these seem to be gone. In regular use it doesn´t lockup totally hard. But
in the a processes writes a lot into one big no-cowed file case, it seems it
can still get into a lockup, but this time one where a kworker thread consumes
100% of CPU for minutes.
> > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > from to *extend* a tree.
>
> It's not a tree. It's simply space allocation. It's not even space
> *usage* you're talking about here -- it's just allocation (i.e. the FS
> saying "I'm going to use this piece of disk for this purpose").
Okay, I thought it is the space BTRFS reserves for a tree or well the *chunks*
the tree manages. I am aware of that it isn´t already *used* space, its just
*reserved*
> > This may be a bug, but this is what I see.
> >
> > And no amount of "you should not balance a BTRFS" will make that
> > perception go away.
> >
> > See, I see the sun coming out on a morning and you tell me "no, it
> > doesn´t". Simply that is not going to match my perception.
>
> Duncan's assertion is correct in its detail. Looking at your space
> usage, I would not suggest that running a balance is something you
> need to do. Now, since you have these lockups that seem quite
> repeatable, there's probably a lurking bug in there, but hacking
> around with balance every time you hit it isn't going to get the
> problem solved properly.
It was Robert writing this I think.
Well I do not like to balance the FS, but I see the result, I see that it
helps here. And thats about it.
My theory from watching the Windows XP defragmentation case is this:
- For writing into the file BTRFS needs to actually allocate and use free space
in the current tree allocation, or, as we seem to misunderstood from the words
we use, it needs to fit data in
Data, RAID1: total=144.98GiB, used=140.94GiB
between 144,98 GiB and 140,94 GiB given that total space of this tree, or if
its not a tree, but the chunks in that the tree manages, in these chunks can
*not* be extended anymore.
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.24GiB
- What I see now is as long as it can be extended, BTRFS on this workload
*happily* does so. *Quickly*. Up to the amount of the free, unreserved space
of the device. And *even* if in my eyes there is a big enough difference
between total and used in btrfs fi df.
- Then as all the device space is *reserved*, BTRFS needs to fit the allocation
within the *existing* chunks instead of reserving a new one and fill the empty
one. And I think this is where it gets problems.
I extended both devices of /home by 10 GiB now and I was able to comlete some
balance steps with these results.
Original after my last partly failed balance attempts:
Label: 'home' uuid: […]
Total devices 2 FS bytes used 144.20GiB
devid 1 size 170.00GiB used 159.01GiB path /dev/mapper/msata-home
devid 2 size 170.00GiB used 159.01GiB path /dev/mapper/sata-home
Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=153.98GiB, used=140.95GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.25GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Then balancing, but not all of them:
merkaba:~#1> btrfs balance start -dusage=70 /home
Done, had to relocate 9 out of 162 chunks
merkaba:~> btrfs fi df /home
Data, RAID1: total=146.98GiB, used=140.95GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.25GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
merkaba:~> btrfs balance start -dusage=80 /home
Done, had to relocate 9 out of 155 chunks
merkaba:~> btrfs fi df /home
Data, RAID1: total=144.98GiB, used=140.94GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.24GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
merkaba:~> btrfs fi sh /home
Label: 'home' uuid: […]
Total devices 2 FS bytes used 144.19GiB
devid 1 size 170.00GiB used 150.01GiB path /dev/mapper/msata-home
devid 2 size 170.00GiB used 150.01GiB path /dev/mapper/sata-home
Btrfs v3.17
This is a situation where I do not see any slowdowns with BTRFS.
As far as I understand the balance commands I used I told BTRFS the following:
- go and balance all chunks that has 70% or less used
- go and balance all chunks that have 80% or less used
I rarely see any chunks that have 60% or less used and get something like this
if I try:
merkaba:~> btrfs balance start -dusage=60 /home
Done, had to relocate 0 out of 153 chunks
Now my idea is this: BTRFS will need to satisfy the allocations it need to do
for writing heavily into a cow´ed file from the already reserved space. Yet if
I have lots of chunks that are filled between 60-70% it needs to spread the
allocations in the 40-30% of the chunk that are not yet used.
My theory is this: If BTRFS needs to do this *heavily*, it at some time gets
problems while doing so. Apparently it seems *easier* to just reserve a new
chunk and fill the fresh chunk then. Otherwise I don´t know why BTRFS is doing
it like this. It prefers to reserve free device space on this defragmentation
inside VM then.
And these issues may be due to an inefficient implementation or bug.
Now if no one else if ever having this, this may be a speciality with my
filesystem and heck I can recreate it from scratch if need be. Yet I would
prefer to find out what is happening here.
> I think I would suggest the following:
>
> - make sure you have some way of logging your dmesg permanently (use
> a different filesystem for /var/log, or a serial console, or a
> netconsole)
>
> - when the lockup happens, hit Alt-SysRq-t a few times
>
> - send the dmesg output here, or post to bugzilla.kernel.org
>
> That's probably going to give enough information to the developers
> to work out where the lockup is happening, and is clearly the way
> forward here.
Thanks, I think this seems to be a way to go.
Actually the logging should be safe I´d say, cause it wents into a different
BTRFS. The BTRFS for /, which is also a RAID 1 and which didn´t show this
behavior yet, although it has also all space reserved since quite some time:
merkaba:~> btrfs fi sh /
Label: 'debian' uuid: […]
Total devices 2 FS bytes used 17.79GiB
devid 1 size 30.00GiB used 30.00GiB path /dev/mapper/sata-debian
devid 2 size 30.00GiB used 30.00GiB path /dev/mapper/msata-debian
Btrfs v3.17
merkaba:~> btrfs fi df /
Data, RAID1: total=27.99GiB, used=17.21GiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=2.00GiB, used=596.12MiB
GlobalReserve, single: total=208.00MiB, used=0.00B
*Unlike* if one BTRFS locks up the other will also lock up, logging should be
safe.
Actually I got the last task hung messages as I posted them here. So I may
just try to reproduce this and trigger
echo "t" > /proc/sysrq-trigger
this gives
[32459.707323] systemd-journald[314]: /dev/kmsg buffer overrun, some messages
lost.
but I bet rsyslog will capture it just nice. I may even disable journald to
reduce writes to / during reproducing the bug.
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 9:30 ` Hugo Mills
2014-12-27 10:54 ` Martin Steigerwald
@ 2014-12-27 11:11 ` Martin Steigerwald
2014-12-27 12:08 ` Robert White
2014-12-27 13:55 ` Martin Steigerwald
2014-12-27 18:28 ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
3 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 11:11 UTC (permalink / raw)
To: Hugo Mills, Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 3474 bytes --]
Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> >
> >
> > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > device.
> No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> space. What's more, balance does *not* balance the metadata trees. The
> remaining space -- 154.97 GiB -- is unstructured storage for file
> data, and you have some 13 GiB of that available for use.
>
> Now, since you're seeing lockups when the space on your disks is
> all allocated I'd say that's a bug. However, you're the *only* person
> who's reported this as a regular occurrence. Does this happen with all
> filesystems you have, or just this one?
Okay, just about terms.
What I call trees is this:
merkaba:~> btrfs fi df /
Data, RAID1: total=27.99GiB, used=17.21GiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=2.00GiB, used=596.12MiB
GlobalReserve, single: total=208.00MiB, used=0.00B
For me each one of "Data", "System", "Metadata" and "GlobalReserve" is what I
call a "tree".
How would you call it?
I always thought that BTRFS uses a tree structure not only for metadata, but
also for data. But I bet strictly spoken thats only to *manage* the chunks it
allocates and what I see above is the actual chunk usage.
I.e. to get terms straight, how would you call it? I think my understanding of
how BTRFS handles space allocation is quite correct, but I may use a term
incorrectly.
I read
> Data, RAID1: total=27.99GiB, used=17.21GiB
as:
I reserved 27,99 GiB for data chunks and used 17,21 GiB in these data chunks
so far. So I have about 10,5 GiB free in these data chunks at the moment and
all is good.
What it doesn´t tell me at all is how the allocated space is distributed onto
these chunks. I may be that some chunks are completely empty or not. It may be
that each chunk has some space allocated to it but in total there is that
amount of free space yet. I.e. it doesn´t tell me anything about the free
space fragmentation inside the chunks.
Yet I still hold my theory that in the case of heavily writing to a COW´d file
BTRFS seems to prefer to reserve new empty chunks on this /home filesystem of
my laptop instead of trying to find free space in existing only partially empty
chunks. And the lockup only happens when it tries to do the latter. And no, I
think it shouldn´t lockup then. I also think its a bug. I never said
differently.
And yes, I only ever had this on my /home so far. Not on / which is also RAID
1 and has all device space reserved for quite some time, not on /daten which
only holds large files and is single instead of RAID. Also not on the server,
but the server FS has lots of unallocated device space still, or on the 2 TiB
eSATA backup HD, also I do get the impression that BTRFS started to get slower
there as well at least the rsync based backup script takes quite long
meanwhile and I see rsync reading from backup BTRFS and in this case almost
fully ultilizing the disk for longer times. But unlike my /home the backup
disk has some timely widely distributed snaphots (about 2 week to 1 months
intervalls, and about last half year).
Neither /home nor / on the SSD have snapshots at the moment. So this is
happening without snapshots.
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 10:54 ` Martin Steigerwald
@ 2014-12-27 11:52 ` Robert White
2014-12-27 13:16 ` Martin Steigerwald
0 siblings, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-27 11:52 UTC (permalink / raw)
To: Martin Steigerwald, Hugo Mills; +Cc: linux-btrfs
On 12/27/2014 02:54 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
>> On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
>>> Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
>>>> On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
>>>>> Hello!
>>>>>
>>>>> First: Have a merry christmas and enjoy a quiet time in these days.
>>>>>
>>>>> Second: At a time you feel like it, here is a little rant, but also a
>>>>> bug
>>>>> report:
>>>>>
>>>>> I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
>>>>> space_cache, skinny meta data extents – are these a problem? – and
>>>>
>>>>> compress=lzo:
>>>> (there is no known problem with skinny metadata, it's actually more
>>>> efficient than the older format. There has been some anecdotes about
>>>> mixing the skinny and fat metadata but nothing has ever been
>>>> demonstrated problematic.)
>>>>
>>>>> merkaba:~> btrfs fi sh /home
>>>>> Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
>>>>>
>>>>> Total devices 2 FS bytes used 144.41GiB
>>>>> devid 1 size 160.00GiB used 160.00GiB path
>>>>> /dev/mapper/msata-home
>>>>> devid 2 size 160.00GiB used 160.00GiB path
>>>>> /dev/mapper/sata-home
>>>>>
>>>>> Btrfs v3.17
>>>>> merkaba:~> btrfs fi df /home
>>>>> Data, RAID1: total=154.97GiB, used=141.12GiB
>>>>> System, RAID1: total=32.00MiB, used=48.00KiB
>>>>> Metadata, RAID1: total=5.00GiB, used=3.29GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>
>>>> This filesystem, at the allocation level, is "very full" (see below).
>>>>
>>>>> And I had hangs with BTRFS again. This time as I wanted to install tax
>>>>> return software in Virtualbox´d Windows XP VM (which I use once a year
>>>>> cause I know no tax return software for Linux which would be suitable
>>>>> for
>>>>> Germany and I frankly don´t care about the end of security cause all
>>>>> surfing and other network access I will do from the Linux box and I
>>>>> only
>>>>> run the VM behind a firewall).
>>>>
>>>>> And thus I try the balance dance again:
>>>> ITEM: Balance... it doesn't do what you think it does... 8-)
>>>>
>>>> "Balancing" is something you should almost never need to do. It is only
>>>> for cases of changing geometry (adding disks, switching RAID levels,
>>>> etc.) of for cases when you've radically changed allocation behaviors
>>>> (like you decided to remove all your VM's or you've decided to remove a
>>>> mail spool directory full of thousands of tiny files).
>>>>
>>>> People run balance all the time because they think they should. They are
>>>> _usually_ incorrect in that belief.
>>>
>>> I only see the lockups of BTRFS is the trees *occupy* all space on the
>>> device.
>> No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
>> space. What's more, balance does *not* balance the metadata trees. The
>> remaining space -- 154.97 GiB -- is unstructured storage for file
>> data, and you have some 13 GiB of that available for use.
>
> Ok, let me rephrase that: Then the space *reserved* for the trees occupies all
> space on the device. Or okay, when that I see in btrfs fi df as "total" in
> summary occupies what I see as "size" in btrfs fi sh, i.e. when "used" equals
> space in "btrfs fi sh"
>
> What happened here is this:
>
> I tried
>
> https://blogs.oracle.com/virtualbox/entry/how_to_compact_your_virtual
>
> in order to regain some space from the Windows XP VDI file. I just wanted to
> get around upsizing the BTRFS again.
>
> And on the defragementation step in Windows it first ran fast. For about 46-47%
> there, during that fast phase btrfs fi df showed that BTRFS was quickly
> reserving the remaining free device space for data trees (not metadata).
The above statement is word-salad. The storage for data is not a "data
tree", the tree that maps data into a file is metadata. The data is
data. There is no "data tree".
> Only after a while after it did so, it got slow again, basically the Windows
> defragmentation process stopped at 46-47% altogether and then after a while
> even the desktop locked due to processes being blocked in I/O.
If you've over-organized your very-large data files you can get waste
some terrific amounts of space.
[---------------------------------------]
[-------] [uuuuuuu] [] [-----]
[------] [-----][----] [-------]
[----]
As you write new segments you don't actually free the lower extents
unless they are _completely_ obscured end-to-end by a later extent. So
if you've _ever_ defragged the BTRFS extent to be fully contiguous and
you've not overwritten each and every byte later, the original expanse
is still going to be there.
In the above exampel only the "uuu" block is ever freed, and only when
the fourth generation finally covers the little gap.
In the worst case you can end up with (N*(N+1))/2 total blocks used up
on disk when only N blocks are visible. (See the Gauss equation for the
sum of consecutive integers for why this is the correct approximation
for the worst case.)
[------------]
[-----------]
[----------]
...
[-]
Each generation, being one block shorter than the previous one, exposes
N blocks, one from each generation. So 1+2+3+4+5...+N blocks allocated
if each ovewrite is one block shorter than the previous.
So if your original VDI file was all in little pieces all through the
disk, it will waste less space (statistically).
But if you keep on defragging the file internally and externally you can
end up with many times the total file size "in use" to represent the
disk file.
So like I said, if you start trying to _force_ order you will end up
paying significant expenses as the file ages.
COW can help, but every snapshot counts as a generation, so really it's
not necessarily ideal.
I suspect that copying the file as 100 blocks (400k) [or so] at a time
would lead to a file likely to sanitize its history with overwrites.
As it is, coercing order is not your friend. But once done, the best
thing to do is periodically copy the whole file anew to burp the history
out of it.
>
> I decided to forget about this downsizing of the Virtualbox VDI file, it will
> extend again on next Windows work and it is already 18 GB of its maximum 20GB,
> so… I dislike the approach anyway, and don´t even understand why the
> defragmentation step would be necessary as I think Virtualbox can poke holes
> into the file for any space not allocated inside the VM, whether it is
> defragmented or not.
If you don't have trim turned on in both the virtual box and the base
system then there is no discarding to be done. And defrag is "meh" in
your arrangement. [See "lsblk -D" to see if you are doing real discards.
Check windows as well.
Then consider using _raw_ disk format instead of VDI since the
"container format" may not result in trim operations comming through to
the underlying filesystem as such. (I don't know for sure).
So basically, you've arranged your storage almost exactly wrong by
defragging and such, particularly since you are doing it at both layers.
I know where you got the advice from, but its not right for the BTRFS
assumptions.
>
>> Now, since you're seeing lockups when the space on your disks is
>> all allocated I'd say that's a bug. However, you're the *only* person
>> who's reported this as a regular occurrence. Does this happen with all
>> filesystems you have, or just this one?
>
> The *only* person? The compression lockups with 3.15 and 3.16, quite some
> people saw them, I thought. For me also these lockups only happened with all
> space on device allocated.
>
> And these seem to be gone. In regular use it doesn´t lockup totally hard. But
> in the a processes writes a lot into one big no-cowed file case, it seems it
> can still get into a lockup, but this time one where a kworker thread consumes
> 100% of CPU for minutes.
>
>>> I *never* so far saw it lockup if there is still space BTRFS can allocate
>>> from to *extend* a tree.
>>
>> It's not a tree. It's simply space allocation. It's not even space
>> *usage* you're talking about here -- it's just allocation (i.e. the FS
>> saying "I'm going to use this piece of disk for this purpose").
>
> Okay, I thought it is the space BTRFS reserves for a tree or well the *chunks*
> the tree manages. I am aware of that it isn´t already *used* space, its just
> *reserved*
>
>>> This may be a bug, but this is what I see.
>>>
>>> And no amount of "you should not balance a BTRFS" will make that
>>> perception go away.
>>>
>>> See, I see the sun coming out on a morning and you tell me "no, it
>>> doesn´t". Simply that is not going to match my perception.
>>
>> Duncan's assertion is correct in its detail. Looking at your space
>> usage, I would not suggest that running a balance is something you
>> need to do. Now, since you have these lockups that seem quite
>> repeatable, there's probably a lurking bug in there, but hacking
>> around with balance every time you hit it isn't going to get the
>> problem solved properly.
>
> It was Robert writing this I think.
>
> Well I do not like to balance the FS, but I see the result, I see that it
> helps here. And thats about it.
>
> My theory from watching the Windows XP defragmentation case is this:
>
> - For writing into the file BTRFS needs to actually allocate and use free space
> in the current tree allocation, or, as we seem to misunderstood from the words
> we use, it needs to fit data in
>
> Data, RAID1: total=144.98GiB, used=140.94GiB
>
> between 144,98 GiB and 140,94 GiB given that total space of this tree, or if
> its not a tree, but the chunks in that the tree manages, in these chunks can
> *not* be extended anymore.
If your file was actually COW (and you have _not_ been taking snapshots)
then there is no extenting to be had. But if you are using snapper
(which I believe you mentioned previously) then the snapshots cause a
write boundary and a layer of copying. Frequently taking snapshots of a
COW file is self defeating. If you are going to take snapshots then you
might as well turn copy on write back on and, for the love of pete, stop
defragging things.
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.24GiB
>
> - What I see now is as long as it can be extended, BTRFS on this workload
> *happily* does so. *Quickly*. Up to the amount of the free, unreserved space
> of the device. And *even* if in my eyes there is a big enough difference
> between total and used in btrfs fi df.
>
> - Then as all the device space is *reserved*, BTRFS needs to fit the allocation
> within the *existing* chunks instead of reserving a new one and fill the empty
> one. And I think this is where it gets problems.
>
>
> I extended both devices of /home by 10 GiB now and I was able to comlete some
> balance steps with these results.
>
> Original after my last partly failed balance attempts:
>
> Label: 'home' uuid: […]
> Total devices 2 FS bytes used 144.20GiB
> devid 1 size 170.00GiB used 159.01GiB path /dev/mapper/msata-home
> devid 2 size 170.00GiB used 159.01GiB path /dev/mapper/sata-home
>
> Btrfs v3.17
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=153.98GiB, used=140.95GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.25GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
> Then balancing, but not all of them:
>
> merkaba:~#1> btrfs balance start -dusage=70 /home
> Done, had to relocate 9 out of 162 chunks
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=146.98GiB, used=140.95GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.25GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> merkaba:~> btrfs balance start -dusage=80 /home
> Done, had to relocate 9 out of 155 chunks
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=144.98GiB, used=140.94GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.24GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> merkaba:~> btrfs fi sh /home
> Label: 'home' uuid: […]
> Total devices 2 FS bytes used 144.19GiB
> devid 1 size 170.00GiB used 150.01GiB path /dev/mapper/msata-home
> devid 2 size 170.00GiB used 150.01GiB path /dev/mapper/sata-home
>
> Btrfs v3.17
>
>
> This is a situation where I do not see any slowdowns with BTRFS.
>
> As far as I understand the balance commands I used I told BTRFS the following:
>
> - go and balance all chunks that has 70% or less used
> - go and balance all chunks that have 80% or less used
>
> I rarely see any chunks that have 60% or less used and get something like this
> if I try:
>
> merkaba:~> btrfs balance start -dusage=60 /home
> Done, had to relocate 0 out of 153 chunks
>
>
>
> Now my idea is this: BTRFS will need to satisfy the allocations it need to do
> for writing heavily into a cow´ed file from the already reserved space. Yet if
> I have lots of chunks that are filled between 60-70% it needs to spread the
> allocations in the 40-30% of the chunk that are not yet used.
>
> My theory is this: If BTRFS needs to do this *heavily*, it at some time gets
> problems while doing so. Apparently it seems *easier* to just reserve a new
> chunk and fill the fresh chunk then. Otherwise I don´t know why BTRFS is doing
> it like this. It prefers to reserve free device space on this defragmentation
> inside VM then.
When you defrag inside the VM, it gets scrambled through the VDI
container, then layered into the BTRFS filesystem. This can consume vast
amounts of space wiht no purpose. so...
Don't do that.
> And these issues may be due to an inefficient implementation or bug.
Or just stop fighting the system with all the unnecessary defragging.
Watch the picture as it defrags. Look at all that layered writing.
That's what's killing you.
(I do agree, however, that the implementation can become very
inefficient, especially if you do exactly the wrong things.)
>
> Now if no one else if ever having this, this may be a speciality with my
> filesystem and heck I can recreate it from scratch if need be. Yet I would
> prefer to find out what is happening here.
>
>
>> I think I would suggest the following:
>>
>> - make sure you have some way of logging your dmesg permanently (use
>> a different filesystem for /var/log, or a serial console, or a
>> netconsole)
>>
>> - when the lockup happens, hit Alt-SysRq-t a few times
>>
>> - send the dmesg output here, or post to bugzilla.kernel.org
>>
>> That's probably going to give enough information to the developers
>> to work out where the lockup is happening, and is clearly the way
>> forward here.
>
> Thanks, I think this seems to be a way to go.
>
> Actually the logging should be safe I´d say, cause it wents into a different
> BTRFS. The BTRFS for /, which is also a RAID 1 and which didn´t show this
> behavior yet, although it has also all space reserved since quite some time:
>
> merkaba:~> btrfs fi sh /
> Label: 'debian' uuid: […]
> Total devices 2 FS bytes used 17.79GiB
> devid 1 size 30.00GiB used 30.00GiB path /dev/mapper/sata-debian
> devid 2 size 30.00GiB used 30.00GiB path /dev/mapper/msata-debian
>
> Btrfs v3.17
> merkaba:~> btrfs fi df /
> Data, RAID1: total=27.99GiB, used=17.21GiB
> System, RAID1: total=8.00MiB, used=16.00KiB
> Metadata, RAID1: total=2.00GiB, used=596.12MiB
> GlobalReserve, single: total=208.00MiB, used=0.00B
>
>
> *Unlike* if one BTRFS locks up the other will also lock up, logging should be
> safe.
>
> Actually I got the last task hung messages as I posted them here. So I may
> just try to reproduce this and trigger
>
> echo "t" > /proc/sysrq-trigger
>
> this gives
>
> [32459.707323] systemd-journald[314]: /dev/kmsg buffer overrun, some messages
> lost.
>
> but I bet rsyslog will capture it just nice. I may even disable journald to
> reduce writes to / during reproducing the bug.
>
> Ciao,
>
ASIDE: I've been considering recreating my raw extents with COW turned
_off_, but doing it as a series of 4Meg appends so that the underlying
allocation would look like
[--][--][--][--][--][--][--][--][--][--][--][--][--]...[--][--]
this would net the most naturally discard-ready/cleanable history.
It's the vast expanse of the preallocated base.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 11:11 ` Martin Steigerwald
@ 2014-12-27 12:08 ` Robert White
0 siblings, 0 replies; 59+ messages in thread
From: Robert White @ 2014-12-27 12:08 UTC (permalink / raw)
To: Martin Steigerwald, Hugo Mills, linux-btrfs
On 12/27/2014 03:11 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
>>>
>>>
>>> I only see the lockups of BTRFS is the trees *occupy* all space on the
>>> device.
>> No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
>> space. What's more, balance does *not* balance the metadata trees. The
>> remaining space -- 154.97 GiB -- is unstructured storage for file
>> data, and you have some 13 GiB of that available for use.
>>
>> Now, since you're seeing lockups when the space on your disks is
>> all allocated I'd say that's a bug. However, you're the *only* person
>> who's reported this as a regular occurrence. Does this happen with all
>> filesystems you have, or just this one?
>
> Okay, just about terms.
Terms are _really_ important if you want to file and discuss bugs.
> What I call trees is this:
>
> merkaba:~> btrfs fi df /
> Data, RAID1: total=27.99GiB, used=17.21GiB
> System, RAID1: total=8.00MiB, used=16.00KiB
> Metadata, RAID1: total=2.00GiB, used=596.12MiB
> GlobalReserve, single: total=208.00MiB, used=0.00B
>
> For me each one of "Data", "System", "Metadata" and "GlobalReserve" is what I
> call a "tree".
>
> How would you call it?
Those are "extents" I think. All of the "Trees" are in the metadata. One
of the trees is the "extent tree". That extent tree is what contains the
list of which regions of the disk are data, or metadata, or
system-metadata (like the superblocks), or the global reserve.
Those extents are then filled with the type of information described.
But all the "trees" are in the metadata extents.
>
> I always thought that BTRFS uses a tree structure not only for metadata, but
> also for data. But I bet strictly spoken thats only to *manage* the chunks it
> allocates and what I see above is the actual chunk usage.
>
> I.e. to get terms straight, how would you call it? I think my understanding of
> how BTRFS handles space allocation is quite correct, but I may use a term
> incorrectly.
>
> I read
>
>> Data, RAID1: total=27.99GiB, used=17.21GiB
>
> as:
>
> I reserved 27,99 GiB for data chunks and used 17,21 GiB in these data chunks
> so far. So I have about 10,5 GiB free in these data chunks at the moment and
> all is good.
>
> What it doesn´t tell me at all is how the allocated space is distributed onto
> these chunks. I may be that some chunks are completely empty or not. It may be
> that each chunk has some space allocated to it but in total there is that
> amount of free space yet. I.e. it doesn´t tell me anything about the free
> space fragmentation inside the chunks.
>
> Yet I still hold my theory that in the case of heavily writing to a COW´d file
> BTRFS seems to prefer to reserve new empty chunks on this /home filesystem of
> my laptop instead of trying to find free space in existing only partially empty
> chunks. And the lockup only happens when it tries to do the latter. And no, I
> think it shouldn´t lockup then. I also think its a bug. I never said
> differently.
Partly correct. The system (as I understand it) will try to fill old
chunks before allocating to new ones. It also perfers the most empty
chunk first. But if you fallocate large extents they can have trouble
finding a home. So lets say you have a systemic process that keeps
making .51GiB files then it will tend to allocate a new 1GiB data extent
each time (presuming you used default values) because each successive
.51GiB region cannot fit in any existing data extent.
Excessive snapshotting can also contribute to this effect, but only
because it freezes the history.
There are some other odd-out cases.
> And yes, I only ever had this on my /home so far. Not on / which is also RAID
> 1 and has all device space reserved for quite some time, not on /daten which
> only holds large files and is single instead of RAID. Also not on the server,
> but the server FS has lots of unallocated device space still, or on the 2 TiB
> eSATA backup HD, also I do get the impression that BTRFS started to get slower
> there as well at least the rsync based backup script takes quite long
> meanwhile and I see rsync reading from backup BTRFS and in this case almost
> fully ultilizing the disk for longer times. But unlike my /home the backup
> disk has some timely widely distributed snaphots (about 2 week to 1 months
> intervalls, and about last half year).
>
> Neither /home nor / on the SSD have snapshots at the moment. So this is
> happening without snapshots.
>
> Ciao,
>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 11:52 ` Robert White
@ 2014-12-27 13:16 ` Martin Steigerwald
2014-12-27 13:49 ` Robert White
2014-12-27 14:00 ` Robert White
0 siblings, 2 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 13:16 UTC (permalink / raw)
To: Robert White; +Cc: Hugo Mills, linux-btrfs
Am Samstag, 27. Dezember 2014, 03:52:56 schrieb Robert White:
> > My theory from watching the Windows XP defragmentation case is this:
> >
> > - For writing into the file BTRFS needs to actually allocate and use free
> > space in the current tree allocation, or, as we seem to misunderstood
> > from the words we use, it needs to fit data in
> >
> > Data, RAID1: total=144.98GiB, used=140.94GiB
> >
> > between 144,98 GiB and 140,94 GiB given that total space of this tree, or
> > if its not a tree, but the chunks in that the tree manages, in these
> > chunks can *not* be extended anymore.
>
> If your file was actually COW (and you have _not_ been taking snapshots)
> then there is no extenting to be had. But if you are using snapper
> (which I believe you mentioned previously) then the snapshots cause a
> write boundary and a layer of copying. Frequently taking snapshots of a
> COW file is self defeating. If you are going to take snapshots then you
> might as well turn copy on write back on and, for the love of pete, stop
> defragging things.
I don´t use any snapshots on the filesystems. None, zero, zilch, nada.
And as I understand it copy on write means: It has to write the new write
requests to somewhere else. For this it needs to allocate space. Either
withing existing chunks or in a newly allocated one.
So for COW when writing to a file it will always need to allocate new space
(although it can forget about the old space afterwards unless there isn´t a
snapshot holding it)
Anyway, I got it reproduced. And am about to write a lengthy mail about.
It can easily be reproduced without even using Virtualbox, just by a nice
simple fio job.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 13:16 ` Martin Steigerwald
@ 2014-12-27 13:49 ` Robert White
2014-12-27 14:06 ` Martin Steigerwald
2014-12-27 14:00 ` Robert White
1 sibling, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-27 13:49 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Hugo Mills, linux-btrfs
On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 03:52:56 schrieb Robert White:
>>> My theory from watching the Windows XP defragmentation case is this:
>>>
>>> - For writing into the file BTRFS needs to actually allocate and use free
>>> space in the current tree allocation, or, as we seem to misunderstood
>>> from the words we use, it needs to fit data in
>>>
>>> Data, RAID1: total=144.98GiB, used=140.94GiB
>>>
>>> between 144,98 GiB and 140,94 GiB given that total space of this tree, or
>>> if its not a tree, but the chunks in that the tree manages, in these
>>> chunks can *not* be extended anymore.
>>
>> If your file was actually COW (and you have _not_ been taking snapshots)
>> then there is no extenting to be had. But if you are using snapper
>> (which I believe you mentioned previously) then the snapshots cause a
>> write boundary and a layer of copying. Frequently taking snapshots of a
>> COW file is self defeating. If you are going to take snapshots then you
>> might as well turn copy on write back on and, for the love of pete, stop
>> defragging things.
>
> I don´t use any snapshots on the filesystems. None, zero, zilch, nada.
>
> And as I understand it copy on write means: It has to write the new write
> requests to somewhere else. For this it needs to allocate space. Either
> withing existing chunks or in a newly allocated one.
>
> So for COW when writing to a file it will always need to allocate new space
> (although it can forget about the old space afterwards unless there isn´t a
> snapshot holding it)
It can _only_ forget about the space if absolutely _all_ of the old
extent is overwritten. So if you write 1MiB, then you go back and
overwrite 1MiB-4Kib, then you go back and write 1MiB-8KiB, you've now
got 3MiB-12KiB to represent 1MiB of data. No snapshots involved. The
worst case is quite well understood.
[...--------------] 1MiB
[...-------------] 1MiB-4KiB
[...------------] 1MiB-8KiB
BTRFS will _NOT_ reclaim the "part" of any extent. So if this kept going
it would take 250 diminishing overwrites, each 4k less than the prior:
1MiB == 250 4k blocks.
(250*(250+1))/2 = 31375 4K blocks or 125.5MiB of storage allocated and
dedicated to representing 1MiB of accessible data.
This is a worst case, of course, but it exists and it's _horrible_.
And such a file can be "burped" by doing a copy-and-rename, resulting in
returning it to a single 1MiB extent. (I don't know if a "btrfs defrag"
would have identical results, but I think it would.)
The problem is that there isn't (yet) a COW safe way to discard partial
extents. That is, there is no universally safe way (yet implemented) to
turn that first 1MiB into two extents of 1MiB-4K and one 4K extent "in
place" so there is no way (yet) to prevent this worst case.
Doing things like excessive defragging at the BTRFS level, and
defragging inside of a VM, and using certain file types can lead to
pretty awful data wastage. YMMV.
e.g. "too much tidying up and you make a mess".
I offered a pseudocode example a few days back on how this problem might
be dealt with in future, but I've not seen any feedback on it.
>
> Anyway, I got it reproduced. And am about to write a lengthy mail about.
Have fun with that lengthy email, but the devs already know about the
data waste profile of the system. They just don't have a good solution yet.
Practical use cases involving _not_ defragging and _not_ packing files,
or disabling COW and using raw image formats for VM disk storage are,
meanwhile, also well understood.
>
> It can easily be reproduced without even using Virtualbox, just by a nice
> simple fio job.
>
Yep. As I've explained twice now.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 9:30 ` Hugo Mills
2014-12-27 10:54 ` Martin Steigerwald
2014-12-27 11:11 ` Martin Steigerwald
@ 2014-12-27 13:55 ` Martin Steigerwald
2014-12-27 14:54 ` Robert White
2014-12-28 13:00 ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
2014-12-27 18:28 ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
3 siblings, 2 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 13:55 UTC (permalink / raw)
To: Hugo Mills; +Cc: Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 33441 bytes --]
Summarized at
Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401
see below. This is reproducable with fio, no need for Windows XP in
Virtualbox for reproducing the issue. Next I will try to reproduce with
a freshly creating filesystem.
Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > Hello!
> > > >
> > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > >
> > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > bug
> > > > report:
> > > >
> > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > space_cache, skinny meta data extents – are these a problem? – and
> > >
> > > > compress=lzo:
> > > (there is no known problem with skinny metadata, it's actually more
> > > efficient than the older format. There has been some anecdotes about
> > > mixing the skinny and fat metadata but nothing has ever been
> > > demonstrated problematic.)
> > >
> > > > merkaba:~> btrfs fi sh /home
> > > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > >
> > > > Total devices 2 FS bytes used 144.41GiB
> > > > devid 1 size 160.00GiB used 160.00GiB path
> > > > /dev/mapper/msata-home
> > > > devid 2 size 160.00GiB used 160.00GiB path
> > > > /dev/mapper/sata-home
> > > >
> > > > Btrfs v3.17
> > > > merkaba:~> btrfs fi df /home
> > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > >
> > > This filesystem, at the allocation level, is "very full" (see below).
> > >
> > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > cause I know no tax return software for Linux which would be suitable
> > > > for
> > > > Germany and I frankly don´t care about the end of security cause all
> > > > surfing and other network access I will do from the Linux box and I
> > > > only
> > > > run the VM behind a firewall).
> > >
> > > > And thus I try the balance dance again:
> > > ITEM: Balance... it doesn't do what you think it does... 8-)
> > >
> > > "Balancing" is something you should almost never need to do. It is only
> > > for cases of changing geometry (adding disks, switching RAID levels,
> > > etc.) of for cases when you've radically changed allocation behaviors
> > > (like you decided to remove all your VM's or you've decided to remove a
> > > mail spool directory full of thousands of tiny files).
> > >
> > > People run balance all the time because they think they should. They are
> > > _usually_ incorrect in that belief.
> >
> > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > device.
> No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> space. What's more, balance does *not* balance the metadata trees. The
> remaining space -- 154.97 GiB -- is unstructured storage for file
> data, and you have some 13 GiB of that available for use.
>
> Now, since you're seeing lockups when the space on your disks is
> all allocated I'd say that's a bug. However, you're the *only* person
> who's reported this as a regular occurrence. Does this happen with all
> filesystems you have, or just this one?
>
> > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > from to *extend* a tree.
>
> It's not a tree. It's simply space allocation. It's not even space
> *usage* you're talking about here -- it's just allocation (i.e. the FS
> saying "I'm going to use this piece of disk for this purpose").
>
> > This may be a bug, but this is what I see.
> >
> > And no amount of "you should not balance a BTRFS" will make that
> > perception go away.
> >
> > See, I see the sun coming out on a morning and you tell me "no, it
> > doesn´t". Simply that is not going to match my perception.
>
> Duncan's assertion is correct in its detail. Looking at your space
Robert's :)
> usage, I would not suggest that running a balance is something you
> need to do. Now, since you have these lockups that seem quite
> repeatable, there's probably a lurking bug in there, but hacking
> around with balance every time you hit it isn't going to get the
> problem solved properly.
>
> I think I would suggest the following:
>
> - make sure you have some way of logging your dmesg permanently (use
> a different filesystem for /var/log, or a serial console, or a
> netconsole)
>
> - when the lockup happens, hit Alt-SysRq-t a few times
>
> - send the dmesg output here, or post to bugzilla.kernel.org
>
> That's probably going to give enough information to the developers
> to work out where the lockup is happening, and is clearly the way
> forward here.
And I got it reproduced. *Perfectly* reproduced, I´d say.
But let me run the whole story:
1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
Which gave me:
merkaba:~> btrfs fi sh /home
Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
Total devices 2 FS bytes used 144.19GiB
devid 1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home
devid 2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home
Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=144.98GiB, used=140.95GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.24GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
2) I run the Virtualbox machine again and defragmented the NTFS filesystem
in the VDI image file. And: It worked *just* fine. Fine as in *fine*. No issues
whatsoever.
I got this during the run:
ATOP - merkaba 2014/12/27 12:58:42 ----------- 10s elapsed
PRC | sys 10.41s | user 1.08s | #proc 357 | #trun 4 | #tslpi 694 | #tslpu 0 | #zombie 0 | no procacct |
CPU | sys 107% | user 11% | irq 0% | idle 259% | wait 23% | guest 0% | curf 3.01GHz | curscal 93% |
cpu | sys 29% | user 3% | irq 0% | idle 63% | cpu002 w 5% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 27% | user 3% | irq 0% | idle 65% | cpu000 w 5% | guest 0% | curf 3.03GHz | curscal 94% |
cpu | sys 26% | user 3% | irq 0% | idle 63% | cpu003 w 8% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 24% | user 2% | irq 0% | idle 68% | cpu001 w 6% | guest 0% | curf 3.00GHz | curscal 93% |
CPL | avg1 1.92 | avg5 1.01 | avg15 0.56 | | csw 501619 | intr 129279 | | numcpu 4 |
MEM | tot 15.5G | free 610.1M | cache 9.1G | buff 0.1M | slab 1.0G | shmem 183.5M | vmbal 0.0M | hptot 0.0M |
SWP | tot 12.0G | free 11.6G | | | | | vmcom 7.1G | vmlim 19.7G |
PAG | scan 219141 | steal 215577 | stall 936 | | | | swin 0 | swout 940 |
LVM | sata-home | busy 53% | read 181413 | write 0 | KiB/w 0 | MBr/s 70.86 | MBw/s 0.00 | avio 0.03 ms |
LVM | sata-swap | busy 2% | read 0 | write 940 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.37 | avio 0.17 ms |
LVM | sata-debian | busy 0% | read 0 | write 1 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 1.00 ms |
LVM | msata-debian | busy 0% | read 0 | write 1 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms |
DSK | sda | busy 53% | read 181413 | write 477 | KiB/w 7 | MBr/s 70.86 | MBw/s 0.37 | avio 0.03 ms |
DSK | sdb | busy 0% | read 0 | write 1 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms |
NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 |
NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 |
NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
9650 - martin martin 22 7.89s 0.65s 0K 128K 705.5M 382.1M -- - S 2 87% VirtualBox
9911 - root root 1 0.69s 0.01s 0K 0K 0K 0K -- - S 3 7% watch
9598 - root root 1 0.38s 0.00s 0K 0K 0K 20K -- - S 0 4% kworker/u8:9
9892 - root root 1 0.36s 0.00s 0K 0K 0K 0K -- - S 1 4% kworker/u8:17
9428 - root root 1 0.30s 0.00s 0K 0K 0K 0K -- - R 0 3% kworker/u8:3
9589 - root root 1 0.23s 0.00s 0K 0K 0K 0K -- - S 1 2% kworker/u8:6
4746 - martin martin 2 0.04s 0.13s 0K -16K 0K 0K -- - R 2 2% konsole
Every 1,0s: cat /proc/meminfo Sat Dec 27 12:59:23 2014
MemTotal: 16210512 kB
MemFree: 786632 kB
MemAvailable: 10271500 kB
Buffers: 52 kB
Cached: 9564340 kB
SwapCached: 70268 kB
Active: 6847560 kB
Inactive: 5257956 kB
Active(anon): 2016412 kB
Inactive(anon): 703076 kB
Active(file): 4831148 kB
Inactive(file): 4554880 kB
Unevictable: 9068 kB
Mlocked: 9068 kB
SwapTotal: 12582908 kB
SwapFree: 12186680 kB
Dirty: 972324 kB
Writeback: 0 kB
AnonPages: 2526340 kB
Mapped: 2457096 kB
Shmem: 173564 kB
Slab: 918128 kB
SReclaimable: 848816 kB
SUnreclaim: 69312 kB
KernelStack: 11200 kB
PageTables: 64556 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 20688164 kB
Committed_AS: 7438348 kB
I am not seeing more than one GiB of dirty here during regular usage and
it is no problem.
And kworker thread CPU usage just fine. So no, the dirty_background_ratio
isn´t an issue with this 16 GiB ThinkPad T520. Please note: I do Linux
performance analysis and tuning courses for about 7 years or so meanwhile.
I *know* these knobs. I may have used wrong terms regarding BTRFS, and my
understanding of BTRFS space allocation probably can be more accurate, but
I do think that I am onto something here. This is no rotating disk, it can handle
the write burst just fine and I generally do not tune where there is no need to
tune. Here there isn´t. And it wouldn´t be much more than a fine tuning.
With slow devices or with rsync over NFS by all means reduce it. But here it
simply isn´t an issue as you can see with the low kworker thread CPU usage
and the general SSD usage above.
So defragmentation completed just nice, no issue so far.
But I am close to full device space reservation already:
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
Sa 27. Dez 13:02:40 CET 2014
Label: 'home' uuid: [some UUID]
Total devices 2 FS bytes used 151.58GiB
devid 1 size 160.00GiB used 158.01GiB path /dev/mapper/msata-home
devid 2 size 160.00GiB used 158.01GiB path /dev/mapper/sata-home
I thought I can trigger it again by defragmenting in Windows XP again, but
mind you, its defragmented already so it doesn´t to much. I did the sdelete
dance just to trigger something and well I saw kworker a bit higher, but not
much.
But finally I got to:
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
Sa 27. Dez 13:26:39 CET 2014
Label: 'home' uuid: [some UUID]
Total devices 2 FS bytes used 152.83GiB
devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=154.97GiB, used=149.58GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
So I did, if Virtualbox can write randomly in a file, I can too.
So I did:
martin@merkaba:~> cat ssd-test.fio
[global]
bs=4k
#ioengine=libaio
#iodepth=4
size=4g
#direct=1
runtime=120
filename=ssd.test.file
[seq-write]
rw=write
stonewall
[rand-write]
rw=randwrite
stonewall
And got:
ATOP - merkaba 2014/12/27 13:41:02 ----------- 10s elapsed
PRC | sys 10.14s | user 0.38s | #proc 332 | #trun 2 | #tslpi 548 | #tslpu 0 | #zombie 0 | no procacct |
CPU | sys 102% | user 4% | irq 0% | idle 295% | wait 0% | guest 0% | curf 3.10GHz | curscal 96% |
cpu | sys 76% | user 0% | irq 0% | idle 24% | cpu001 w 0% | guest 0% | curf 3.20GHz | curscal 99% |
cpu | sys 24% | user 1% | irq 0% | idle 75% | cpu000 w 0% | guest 0% | curf 3.19GHz | curscal 99% |
cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
CPL | avg1 0.82 | avg5 0.78 | avg15 0.99 | | csw 6233 | intr 12023 | | numcpu 4 |
MEM | tot 15.5G | free 4.0G | cache 9.7G | buff 0.0M | slab 333.1M | shmem 206.6M | vmbal 0.0M | hptot 0.0M |
SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
LVM | sata-home | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
DSK | sda | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 |
NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 |
NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
18079 - martin martin 2 9.99s 0.00s 0K 0K 0K 16K -- - R 1 100% fio
4746 - martin martin 2 0.01s 0.14s 0K 0K 0K 0K -- - S 2 2% konsole
3291 - martin martin 4 0.01s 0.11s 0K 0K 0K 0K -- - S 0 1% plasma-desktop
1488 - root root 1 0.03s 0.04s 0K 0K 0K 0K -- - S 0 1% Xorg
10036 - root root 1 0.04s 0.02s 0K 0K 0K 0K -- - R 2 1% atop
while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
martin@merkaba:~> LANG=C df -hT /home
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/msata-home btrfs 170G 156G 17G 91% /home
where a 4 GiB file should easily fit, no? (And this output is with the 4
GiB file. So it was even 4 GiB more free before.)
But it gets even more visible:
martin@merkaba:~> fio ssd-test.fio
seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 2 processes
Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]
0$ zsh 1$ zsh 2$ zsh 3-$ zsh 4$ zsh 5$* zsh
yes, thats 0 IOPS.
0 IOPS and in zero IOPS. For minutes.
And here is why:
ATOP - merkaba 2014/12/27 13:46:52 ----------- 10s elapsed
PRC | sys 10.77s | user 0.31s | #proc 334 | #trun 2 | #tslpi 548 | #tslpu 3 | #zombie 0 | no procacct |
CPU | sys 108% | user 3% | irq 0% | idle 286% | wait 2% | guest 0% | curf 3.08GHz | curscal 96% |
cpu | sys 72% | user 1% | irq 0% | idle 28% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 19% | user 0% | irq 0% | idle 81% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 11% | user 1% | irq 0% | idle 87% | cpu003 w 1% | guest 0% | curf 3.19GHz | curscal 99% |
cpu | sys 6% | user 1% | irq 0% | idle 91% | cpu002 w 1% | guest 0% | curf 3.11GHz | curscal 97% |
CPL | avg1 2.78 | avg5 1.34 | avg15 1.12 | | csw 50192 | intr 32379 | | numcpu 4 |
MEM | tot 15.5G | free 5.0G | cache 8.7G | buff 0.0M | slab 332.6M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
LVM | sata-home | busy 5% | read 160 | write 11177 | KiB/w 3 | MBr/s 0.06 | MBw/s 4.36 | avio 0.05 ms |
LVM | msata-home | busy 4% | read 28 | write 11177 | KiB/w 3 | MBr/s 0.01 | MBw/s 4.36 | avio 0.04 ms |
LVM | sata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
LVM | msata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
DSK | sda | busy 5% | read 160 | write 10200 | KiB/w 4 | MBr/s 0.06 | MBw/s 4.69 | avio 0.05 ms |
DSK | sdb | busy 4% | read 28 | write 10558 | KiB/w 4 | MBr/s 0.01 | MBw/s 4.69 | avio 0.04 ms |
NET | transport | tcpi 35 | tcpo 33 | udpi 3 | udpo 3 | tcpao 2 | tcppo 1 | tcprs 0 |
NET | network | ipi 38 | ipo 36 | ipfrw 0 | deliv 38 | | icmpi 0 | icmpo 0 |
NET | eth0 0% | pcki 22 | pcko 20 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
14973 - root root 1 8.92s 0.00s 0K 0K 0K 144K -- - S 0 89% kworker/u8:14
17450 - root root 1 0.86s 0.00s 0K 0K 0K 32K -- - R 3 9% kworker/u8:5
788 - root root 1 0.25s 0.00s 0K 0K 128K 18880K -- - S 3 3% btrfs-transact
12254 - root root 1 0.14s 0.00s 0K 0K 64K 576K -- - S 2 1% kworker/u8:3
17332 - root root 1 0.11s 0.00s 0K 0K 112K 1348K -- - S 2 1% kworker/u8:4
3291 - martin martin 4 0.01s 0.09s 0K 0K 0K 0K -- - S 1 1% plasma-deskto
ATOP - merkaba 2014/12/27 13:47:12 ----------- 10s elapsed
PRC | sys 10.78s | user 0.44s | #proc 334 | #trun 3 | #tslpi 547 | #tslpu 3 | #zombie 0 | no procacct |
CPU | sys 106% | user 4% | irq 0% | idle 288% | wait 1% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 93% | user 0% | irq 0% | idle 7% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 7% | user 0% | irq 0% | idle 93% | cpu003 w 0% | guest 0% | curf 3.01GHz | curscal 94% |
cpu | sys 3% | user 2% | irq 0% | idle 94% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
CPL | avg1 3.33 | avg5 1.56 | avg15 1.20 | | csw 38253 | intr 23104 | | numcpu 4 |
MEM | tot 15.5G | free 4.9G | cache 8.7G | buff 0.0M | slab 336.5M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
LVM | msata-home | busy 2% | read 0 | write 2337 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.91 | avio 0.07 ms |
LVM | sata-home | busy 2% | read 36 | write 2337 | KiB/w 3 | MBr/s 0.01 | MBw/s 0.91 | avio 0.07 ms |
LVM | msata-debian | busy 1% | read 1 | write 1630 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.65 | avio 0.03 ms |
LVM | sata-debian | busy 0% | read 0 | write 1019 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.41 | avio 0.02 ms |
DSK | sdb | busy 2% | read 1 | write 2545 | KiB/w 5 | MBr/s 0.00 | MBw/s 1.45 | avio 0.07 ms |
DSK | sda | busy 1% | read 36 | write 2461 | KiB/w 5 | MBr/s 0.01 | MBw/s 1.28 | avio 0.06 ms |
NET | transport | tcpi 20 | tcpo 20 | udpi 1 | udpo 1 | tcpao 1 | tcppo 1 | tcprs 0 |
NET | network | ipi 21 | ipo 21 | ipfrw 0 | deliv 21 | | icmpi 0 | icmpo 0 |
NET | eth0 0% | pcki 5 | pcko 5 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 |
NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
17450 - root root 1 9.96s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:5
4746 - martin martin 2 0.06s 0.15s 0K 0K 0K 0K -- - S 1 2% konsole
10508 - root root 1 0.13s 0.00s 0K 0K 96K 4048K -- - S 1 1% kworker/u8:18
1488 - root root 1 0.06s 0.06s 0K 0K 0K 0K -- - S 0 1% Xorg
17332 - root root 1 0.12s 0.00s 0K 0K 96K 580K -- - R 3 1% kworker/u8:4
17454 - root root 1 0.11s 0.00s 0K 0K 32K 4416K -- - D 1 1% kworker/u8:6
17516 - root root 1 0.09s 0.00s 0K 0K 16K 136K -- - S 3 1% kworker/u8:7
3268 - martin martin 3 0.02s 0.05s 0K 0K 0K 0K -- - S 1 1% kwin
10036 - root root 1 0.05s 0.02s 0K 0K 0K 0K -- - R 0 1% atop
So BTRFS is basically busy with itself and nothing else. Look at the SSD
usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
you measure of course, like request size, read, write, iodepth and so).
Its kworker/u8:5 utilizing 100% of one core for minutes.
Its the random write case it seems. Here are values from fio job:
martin@merkaba:~> fio ssd-test.fio
seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 2 processes
Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
clat percentiles (usec):
| 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
| 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5],
| 70.00th=[ 7], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ 9],
| 99.00th=[ 14], 99.50th=[ 20], 99.90th=[ 211], 99.95th=[ 2128],
| 99.99th=[10304]
bw (KB /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
cpu : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Seems fine.
But:
rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
clat percentiles (usec):
| 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5],
| 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6],
| 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10],
| 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116],
| 99.99th=[16711680]
bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
Run status group 1 (all jobs):
WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
What?
Ey, *what*?
Repeating with the random write case.
Its a different kworker now, but similar result:
ATOP - merkaba 2014/12/27 13:51:48 ----------- 10s elapsed
PRC | sys 10.66s | user 0.25s | #proc 330 | #trun 2 | #tslpi 545 | #tslpu 2 | #zombie 0 | no procacct |
CPU | sys 105% | user 3% | irq 0% | idle 292% | wait 0% | guest 0% | curf 3.07GHz | curscal 95% |
cpu | sys 92% | user 0% | irq 0% | idle 8% | cpu002 w 0% | guest 0% | curf 3.19GHz | curscal 99% |
cpu | sys 8% | user 0% | irq 0% | idle 92% | cpu003 w 0% | guest 0% | curf 3.09GHz | curscal 96% |
cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 2% | user 1% | irq 0% | idle 97% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
CPL | avg1 1.00 | avg5 1.32 | avg15 1.23 | | csw 34484 | intr 23182 | | numcpu 4 |
MEM | tot 15.5G | free 5.4G | cache 8.3G | buff 0.0M | slab 334.8M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
LVM | sata-home | busy 1% | read 36 | write 2502 | KiB/w 4 | MBr/s 0.01 | MBw/s 0.98 | avio 0.06 ms |
LVM | msata-home | busy 1% | read 48 | write 2502 | KiB/w 4 | MBr/s 0.02 | MBw/s 0.98 | avio 0.04 ms |
LVM | msata-debian | busy 0% | read 0 | write 6 | KiB/w 7 | MBr/s 0.00 | MBw/s 0.00 | avio 1.33 ms |
LVM | sata-debian | busy 0% | read 0 | write 6 | KiB/w 7 | MBr/s 0.00 | MBw/s 0.00 | avio 0.17 ms |
DSK | sda | busy 1% | read 36 | write 2494 | KiB/w 4 | MBr/s 0.01 | MBw/s 0.98 | avio 0.06 ms |
DSK | sdb | busy 1% | read 48 | write 2494 | KiB/w 4 | MBr/s 0.02 | MBw/s 0.98 | avio 0.04 ms |
NET | transport | tcpi 32 | tcpo 30 | udpi 2 | udpo 2 | tcpao 2 | tcppo 1 | tcprs 0 |
NET | network | ipi 35 | ipo 32 | ipfrw 0 | deliv 35 | | icmpi 0 | icmpo 0 |
NET | eth0 0% | pcki 19 | pcko 16 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
11746 - root root 1 10.00s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:0
12254 - root root 1 0.16s 0.00s 0K 0K 112K 1712K -- - S 3 2% kworker/u8:3
17517 - root root 1 0.16s 0.00s 0K 0K 144K 1764K -- - S 1 2% kworker/u8:8
And now the graphical environemnt is locked. Continuining on TTY1.
Doing another fio job with tee so I can get output easily.
Wow! I wonder whether this is reproducable with a fresh BTRFS with fio stressing it.
Like a 10 GiB BTRFS with 5 GiB fio test file and just letting it run.
Okay, I let the final fio job complete and include the output here.
Okay, and there we are and I do have sysrq-t figures.
Okay, this is 1,2 MiB xz packed. So I better start a bug report about this
and attach it there. Is dislike cloud URLs that may disappear at some time.
Now please finally acknowledge that there is an issue. Maybe I was not
using the correct terms at the beginning, but there is a real issue. I do
performance things for half a decade at least, I know that there is an issue
when I see it.
There we go:
Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401
Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 13:16 ` Martin Steigerwald
2014-12-27 13:49 ` Robert White
@ 2014-12-27 14:00 ` Robert White
2014-12-27 14:14 ` Martin Steigerwald
2014-12-27 14:19 ` Robert White
1 sibling, 2 replies; 59+ messages in thread
From: Robert White @ 2014-12-27 14:00 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Hugo Mills, linux-btrfs
On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
> It can easily be reproduced without even using Virtualbox, just by a nice
> simple fio job.
>
TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
with one single file...
#!/bin/bash
# not tested, so correct any syntax errors
typeset -i counter
for ((counter=250;counter>0;counter--)); do
dd if=/dev/urandom of=/some/file bs=4k count=$counter
done
exit
Each pass over /some/file is 4k shorter than the previous one, but none
of the extents can be deallocated. File will be 1MiB in size and usage
will be something like 125.5MiB (if I've done the math correctly).
larger values of counter will result in exponentially larger amounts of
waste.
Doing the bad things is very bad...
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 13:49 ` Robert White
@ 2014-12-27 14:06 ` Martin Steigerwald
0 siblings, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 14:06 UTC (permalink / raw)
To: Robert White; +Cc: Hugo Mills, linux-btrfs
Am Samstag, 27. Dezember 2014, 05:49:48 schrieb Robert White:
> > Anyway, I got it reproduced. And am about to write a lengthy mail about.
>
> Have fun with that lengthy email, but the devs already know about the
> data waste profile of the system. They just don't have a good solution yet.
>
> Practical use cases involving _not_ defragging and _not_ packing files,
> or disabling COW and using raw image formats for VM disk storage are,
> meanwhile, also well understood.
Okay, then how about a database?
BTRFS is not usable for these kind of workloads then.
And thats about it.
Not even on SSD.
Yet, what I have shown in my lengthy mail is pathological.
Its even abysmal.
And yet it only happens when BTRFS is forced to pack things into *existing*
chunks. It does not happen when BTRFS can still reserve new chunks and write
to them.
And this makes all the talk that you should not need to rebalance obsolete
when in practice you need to to get decent performance. To get out of your
SSDs what your SSDs can provide instead of waiting for BTRFS to finish being
busy with itself.
Still, I have only yet reproduced it on this /home filesystem. If that is also
reproducable on a freshly created filesystem after some runs of the fio job I
provided I´d say that there is a performance bug in BTRFS. And thats it.
No talking about technicalities my turn this performance bug observation away.
Heck 254 IOPS from a Dual SSD RAID 1? Are you even kidding me?
I refuse to believe that this is built into the design, no matter how much you
outline its limitations.
And if it is?
Well… then maybe BTRFS won´t save us. Unless you give it a ton of extra free
space. Unless you do as I recommend and if you use 25 GB you make it 100 GB
big so it will always find enough space to waste.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 14:00 ` Robert White
@ 2014-12-27 14:14 ` Martin Steigerwald
2014-12-27 14:21 ` Martin Steigerwald
2014-12-27 14:19 ` Robert White
1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 14:14 UTC (permalink / raw)
To: Robert White; +Cc: Hugo Mills, linux-btrfs
Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:
> On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
> > It can easily be reproduced without even using Virtualbox, just by a nice
> > simple fio job.
>
> TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
> with one single file...
>
> #!/bin/bash
> # not tested, so correct any syntax errors
> typeset -i counter
> for ((counter=250;counter>0;counter--)); do
> dd if=/dev/urandom of=/some/file bs=4k count=$counter
> done
> exit
>
>
> Each pass over /some/file is 4k shorter than the previous one, but none
> of the extents can be deallocated. File will be 1MiB in size and usage
> will be something like 125.5MiB (if I've done the math correctly).
> larger values of counter will result in exponentially larger amounts of
> waste.
Robert, I experienced this hang issues even before the defragmenting case. It
happened while just installed a 400 MiB tax returns application to it (that is
no joke, it is that big).
It happens while just using the VM.
Yes, I recommend not to use BTRFS for any VM image or any larger database on
rotating storage for exactly that COW semantics.
But on SSD?
Its busy looping a CPU core and while the flash is basically idling.
I refuse to believe that this is by design.
I do think there is a *bug*.
Either acknowledge it and try to fix it, or say its by design *without even
looking at it closely enough to be sure that it is not a bug* and limit your
own possibilities by it.
I´d rather see it treated as a bug for now.
Come on, 254 IOPS on a filesystem with still 17 GiB of free space while
randomly writing to a 4 GiB file.
People do these kind of things. Ditch that defrag Windows XP VM case, I had
performance issue even before by just installing things to it. Databases, VMs,
emulators. And heck even while just *creating* the file with fio as I shown.
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 14:00 ` Robert White
2014-12-27 14:14 ` Martin Steigerwald
@ 2014-12-27 14:19 ` Robert White
1 sibling, 0 replies; 59+ messages in thread
From: Robert White @ 2014-12-27 14:19 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Hugo Mills, linux-btrfs
On 12/27/2014 06:00 AM, Robert White wrote:
> On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
>> It can easily be reproduced without even using Virtualbox, just by a nice
>> simple fio job.
>>
>
> TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
> with one single file...
>
> #!/bin/bash
> # not tested, so correct any syntax errors
> typeset -i counter
> for ((counter=250;counter>0;counter--)); do
> dd if=/dev/urandom of=/some/file bs=4k count=$counter
> done
> exit 0
Slight correction: you need to prevent the truncate dd performs by
default, and flush the data and metadata to disk between after each
invocatoin. So you need the "conv=" flags.
for ((counter=250;counter>0;counter--)); do
dd if=/dev/urandom of=some_file conv=notrunc,fsync bs=4k count=$counter
done
>
>
> Each pass over /some/file is 4k shorter than the previous one, but none
> of the extents can be deallocated. File will be 1MiB in size and usage
> will be something like 125.5MiB (if I've done the math correctly).
> larger values of counter will result in exponentially larger amounts of
> waste.
>
> Doing the bad things is very bad...
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 14:14 ` Martin Steigerwald
@ 2014-12-27 14:21 ` Martin Steigerwald
2014-12-27 15:14 ` Robert White
0 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 14:21 UTC (permalink / raw)
To: Robert White; +Cc: Hugo Mills, linux-btrfs
Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald:
> Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:
> > On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
> > > It can easily be reproduced without even using Virtualbox, just by a
> > > nice
> > > simple fio job.
> >
> > TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
> > with one single file...
> >
> > #!/bin/bash
> > # not tested, so correct any syntax errors
> > typeset -i counter
> > for ((counter=250;counter>0;counter--)); do
> >
> > dd if=/dev/urandom of=/some/file bs=4k count=$counter
> >
> > done
> > exit
> >
> >
> > Each pass over /some/file is 4k shorter than the previous one, but none
> > of the extents can be deallocated. File will be 1MiB in size and usage
> > will be something like 125.5MiB (if I've done the math correctly).
> > larger values of counter will result in exponentially larger amounts of
> > waste.
>
> Robert, I experienced this hang issues even before the defragmenting case.
> It happened while just installed a 400 MiB tax returns application to it
> (that is no joke, it is that big).
>
> It happens while just using the VM.
>
> Yes, I recommend not to use BTRFS for any VM image or any larger database on
> rotating storage for exactly that COW semantics.
>
> But on SSD?
>
> Its busy looping a CPU core and while the flash is basically idling.
>
> I refuse to believe that this is by design.
>
> I do think there is a *bug*.
>
> Either acknowledge it and try to fix it, or say its by design *without even
> looking at it closely enough to be sure that it is not a bug* and limit your
> own possibilities by it.
>
> I´d rather see it treated as a bug for now.
>
> Come on, 254 IOPS on a filesystem with still 17 GiB of free space while
> randomly writing to a 4 GiB file.
>
> People do these kind of things. Ditch that defrag Windows XP VM case, I had
> performance issue even before by just installing things to it. Databases,
> VMs, emulators. And heck even while just *creating* the file with fio as I
> shown.
Add to these use cases things like this:
martin@merkaba:~/.local/share/akonadi/db_data/akonadi> ls -lSh | head -5
insgesamt 2,2G
-rw-rw---- 1 martin martin 1,7G Dez 27 15:17 parttable.ibd
-rw-rw---- 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd
-rw-rw---- 1 martin martin 23M Dez 27 15:17 pimitemflagrelation.ibd
-rw-rw---- 1 martin martin 240K Dez 27 15:17 collectiontable.ibd
Or this:
martin@merkaba:~/.local/share/baloo> du -sch * | sort -rh
9,2G insgesamt
8,0G email
1,2G file
51M emailContacts
408K contacts
76K notes
16K calendars
martin@merkaba:~/.local/share/baloo> ls -lSh email | head -5
insgesamt 8,0G
-rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB
-rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB
-rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB
-rw-r--r-- 1 martin martin 63K Dez 27 15:16 postlist.baseA
These will not be as bad as the fio test case, but still these files are
written into. They are updated in place.
And thats running on every Plasma desktop by default. And on GNOME desktops
there is similar stuff.
I haven´t seen this spike out a kworker yet tough, so maybe the workload is
light enough not to trigger it that easily.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 13:55 ` Martin Steigerwald
@ 2014-12-27 14:54 ` Robert White
2014-12-27 16:26 ` Hugo Mills
2014-12-28 13:00 ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
1 sibling, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-27 14:54 UTC (permalink / raw)
To: Martin Steigerwald, Hugo Mills; +Cc: linux-btrfs
On 12/27/2014 05:55 AM, Martin Steigerwald wrote:
> Summarized at
>
> Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>
> see below. This is reproducable with fio, no need for Windows XP in
> Virtualbox for reproducing the issue. Next I will try to reproduce with
> a freshly creating filesystem.
>
>
> Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
>> On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
>>> Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
>>>> On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
>>>>> Hello!
>>>>>
>>>>> First: Have a merry christmas and enjoy a quiet time in these days.
>>>>>
>>>>> Second: At a time you feel like it, here is a little rant, but also a
>>>>> bug
>>>>> report:
>>>>>
>>>>> I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
>>>>> space_cache, skinny meta data extents – are these a problem? – and
>>>>
>>>>> compress=lzo:
>>>> (there is no known problem with skinny metadata, it's actually more
>>>> efficient than the older format. There has been some anecdotes about
>>>> mixing the skinny and fat metadata but nothing has ever been
>>>> demonstrated problematic.)
>>>>
>>>>> merkaba:~> btrfs fi sh /home
>>>>> Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
>>>>>
>>>>> Total devices 2 FS bytes used 144.41GiB
>>>>> devid 1 size 160.00GiB used 160.00GiB path
>>>>> /dev/mapper/msata-home
>>>>> devid 2 size 160.00GiB used 160.00GiB path
>>>>> /dev/mapper/sata-home
>>>>>
>>>>> Btrfs v3.17
>>>>> merkaba:~> btrfs fi df /home
>>>>> Data, RAID1: total=154.97GiB, used=141.12GiB
>>>>> System, RAID1: total=32.00MiB, used=48.00KiB
>>>>> Metadata, RAID1: total=5.00GiB, used=3.29GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>
>>>> This filesystem, at the allocation level, is "very full" (see below).
>>>>
>>>>> And I had hangs with BTRFS again. This time as I wanted to install tax
>>>>> return software in Virtualbox´d Windows XP VM (which I use once a year
>>>>> cause I know no tax return software for Linux which would be suitable
>>>>> for
>>>>> Germany and I frankly don´t care about the end of security cause all
>>>>> surfing and other network access I will do from the Linux box and I
>>>>> only
>>>>> run the VM behind a firewall).
>>>>
>>>>> And thus I try the balance dance again:
>>>> ITEM: Balance... it doesn't do what you think it does... 8-)
>>>>
>>>> "Balancing" is something you should almost never need to do. It is only
>>>> for cases of changing geometry (adding disks, switching RAID levels,
>>>> etc.) of for cases when you've radically changed allocation behaviors
>>>> (like you decided to remove all your VM's or you've decided to remove a
>>>> mail spool directory full of thousands of tiny files).
>>>>
>>>> People run balance all the time because they think they should. They are
>>>> _usually_ incorrect in that belief.
>>>
>>> I only see the lockups of BTRFS is the trees *occupy* all space on the
>>> device.
>> No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
>> space. What's more, balance does *not* balance the metadata trees. The
>> remaining space -- 154.97 GiB -- is unstructured storage for file
>> data, and you have some 13 GiB of that available for use.
>>
>> Now, since you're seeing lockups when the space on your disks is
>> all allocated I'd say that's a bug. However, you're the *only* person
>> who's reported this as a regular occurrence. Does this happen with all
>> filesystems you have, or just this one?
>>
>>> I *never* so far saw it lockup if there is still space BTRFS can allocate
>>> from to *extend* a tree.
>>
>> It's not a tree. It's simply space allocation. It's not even space
>> *usage* you're talking about here -- it's just allocation (i.e. the FS
>> saying "I'm going to use this piece of disk for this purpose").
>>
>>> This may be a bug, but this is what I see.
>>>
>>> And no amount of "you should not balance a BTRFS" will make that
>>> perception go away.
>>>
>>> See, I see the sun coming out on a morning and you tell me "no, it
>>> doesn´t". Simply that is not going to match my perception.
>>
>> Duncan's assertion is correct in its detail. Looking at your space
>
> Robert's :)
>
>> usage, I would not suggest that running a balance is something you
>> need to do. Now, since you have these lockups that seem quite
>> repeatable, there's probably a lurking bug in there, but hacking
>> around with balance every time you hit it isn't going to get the
>> problem solved properly.
>>
>> I think I would suggest the following:
>>
>> - make sure you have some way of logging your dmesg permanently (use
>> a different filesystem for /var/log, or a serial console, or a
>> netconsole)
>>
>> - when the lockup happens, hit Alt-SysRq-t a few times
>>
>> - send the dmesg output here, or post to bugzilla.kernel.org
>>
>> That's probably going to give enough information to the developers
>> to work out where the lockup is happening, and is clearly the way
>> forward here.
>
> And I got it reproduced. *Perfectly* reproduced, I´d say.
>
> But let me run the whole story:
>
> 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
>
> Which gave me:
>
> merkaba:~> btrfs fi sh /home
> Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> Total devices 2 FS bytes used 144.19GiB
> devid 1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home
> devid 2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home
>
> Btrfs v3.17
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=144.98GiB, used=140.95GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.24GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
> 2) I run the Virtualbox machine again and defragmented the NTFS filesystem
> in the VDI image file. And: It worked *just* fine. Fine as in *fine*. No issues
> whatsoever.
>
>
> I got this during the run:
>
> ATOP - merkaba 2014/12/27 12:58:42 ----------- 10s elapsed
> PRC | sys 10.41s | user 1.08s | #proc 357 | #trun 4 | #tslpi 694 | #tslpu 0 | #zombie 0 | no procacct |
> CPU | sys 107% | user 11% | irq 0% | idle 259% | wait 23% | guest 0% | curf 3.01GHz | curscal 93% |
> cpu | sys 29% | user 3% | irq 0% | idle 63% | cpu002 w 5% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 27% | user 3% | irq 0% | idle 65% | cpu000 w 5% | guest 0% | curf 3.03GHz | curscal 94% |
> cpu | sys 26% | user 3% | irq 0% | idle 63% | cpu003 w 8% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 24% | user 2% | irq 0% | idle 68% | cpu001 w 6% | guest 0% | curf 3.00GHz | curscal 93% |
> CPL | avg1 1.92 | avg5 1.01 | avg15 0.56 | | csw 501619 | intr 129279 | | numcpu 4 |
> MEM | tot 15.5G | free 610.1M | cache 9.1G | buff 0.1M | slab 1.0G | shmem 183.5M | vmbal 0.0M | hptot 0.0M |
> SWP | tot 12.0G | free 11.6G | | | | | vmcom 7.1G | vmlim 19.7G |
> PAG | scan 219141 | steal 215577 | stall 936 | | | | swin 0 | swout 940 |
> LVM | sata-home | busy 53% | read 181413 | write 0 | KiB/w 0 | MBr/s 70.86 | MBw/s 0.00 | avio 0.03 ms |
> LVM | sata-swap | busy 2% | read 0 | write 940 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.37 | avio 0.17 ms |
> LVM | sata-debian | busy 0% | read 0 | write 1 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 1.00 ms |
> LVM | msata-debian | busy 0% | read 0 | write 1 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms |
> DSK | sda | busy 53% | read 181413 | write 477 | KiB/w 7 | MBr/s 70.86 | MBw/s 0.37 | avio 0.03 ms |
> DSK | sdb | busy 0% | read 0 | write 1 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms |
> NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 |
> NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 |
> NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
>
> PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> 9650 - martin martin 22 7.89s 0.65s 0K 128K 705.5M 382.1M -- - S 2 87% VirtualBox
> 9911 - root root 1 0.69s 0.01s 0K 0K 0K 0K -- - S 3 7% watch
> 9598 - root root 1 0.38s 0.00s 0K 0K 0K 20K -- - S 0 4% kworker/u8:9
> 9892 - root root 1 0.36s 0.00s 0K 0K 0K 0K -- - S 1 4% kworker/u8:17
> 9428 - root root 1 0.30s 0.00s 0K 0K 0K 0K -- - R 0 3% kworker/u8:3
> 9589 - root root 1 0.23s 0.00s 0K 0K 0K 0K -- - S 1 2% kworker/u8:6
> 4746 - martin martin 2 0.04s 0.13s 0K -16K 0K 0K -- - R 2 2% konsole
>
>
>
> Every 1,0s: cat /proc/meminfo Sat Dec 27 12:59:23 2014
>
> MemTotal: 16210512 kB
> MemFree: 786632 kB
> MemAvailable: 10271500 kB
> Buffers: 52 kB
> Cached: 9564340 kB
> SwapCached: 70268 kB
> Active: 6847560 kB
> Inactive: 5257956 kB
> Active(anon): 2016412 kB
> Inactive(anon): 703076 kB
> Active(file): 4831148 kB
> Inactive(file): 4554880 kB
> Unevictable: 9068 kB
> Mlocked: 9068 kB
> SwapTotal: 12582908 kB
> SwapFree: 12186680 kB
> Dirty: 972324 kB
> Writeback: 0 kB
> AnonPages: 2526340 kB
> Mapped: 2457096 kB
> Shmem: 173564 kB
> Slab: 918128 kB
> SReclaimable: 848816 kB
> SUnreclaim: 69312 kB
> KernelStack: 11200 kB
> PageTables: 64556 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 20688164 kB
> Committed_AS: 7438348 kB
>
>
>
> I am not seeing more than one GiB of dirty here during regular usage and
> it is no problem.
>
> And kworker thread CPU usage just fine. So no, the dirty_background_ratio
> isn´t an issue with this 16 GiB ThinkPad T520. Please note: I do Linux
> performance analysis and tuning courses for about 7 years or so meanwhile.
>
> I *know* these knobs. I may have used wrong terms regarding BTRFS, and my
> understanding of BTRFS space allocation probably can be more accurate, but
> I do think that I am onto something here. This is no rotating disk, it can handle
> the write burst just fine and I generally do not tune where there is no need to
> tune. Here there isn´t. And it wouldn´t be much more than a fine tuning.
>
> With slow devices or with rsync over NFS by all means reduce it. But here it
> simply isn´t an issue as you can see with the low kworker thread CPU usage
> and the general SSD usage above.
>
>
> So defragmentation completed just nice, no issue so far.
>
> But I am close to full device space reservation already:
>
> merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> Sa 27. Dez 13:02:40 CET 2014
> Label: 'home' uuid: [some UUID]
> Total devices 2 FS bytes used 151.58GiB
> devid 1 size 160.00GiB used 158.01GiB path /dev/mapper/msata-home
> devid 2 size 160.00GiB used 158.01GiB path /dev/mapper/sata-home
>
>
>
> I thought I can trigger it again by defragmenting in Windows XP again, but
> mind you, its defragmented already so it doesn´t to much. I did the sdelete
> dance just to trigger something and well I saw kworker a bit higher, but not
> much.
>
> But finally I got to:
>
>
>
>
> So I did, if Virtualbox can write randomly in a file, I can too.
>
> So I did:
>
>
> martin@merkaba:~> cat ssd-test.fio
> [global]
> bs=4k
> #ioengine=libaio
> #iodepth=4
> size=4g
> #direct=1
> runtime=120
> filename=ssd.test.file
>
> [seq-write]
> rw=write
> stonewall
>
> [rand-write]
> rw=randwrite
> stonewall
>
>
>
> And got:
>
> ATOP - merkaba 2014/12/27 13:41:02 ----------- 10s elapsed
> PRC | sys 10.14s | user 0.38s | #proc 332 | #trun 2 | #tslpi 548 | #tslpu 0 | #zombie 0 | no procacct |
> CPU | sys 102% | user 4% | irq 0% | idle 295% | wait 0% | guest 0% | curf 3.10GHz | curscal 96% |
> cpu | sys 76% | user 0% | irq 0% | idle 24% | cpu001 w 0% | guest 0% | curf 3.20GHz | curscal 99% |
> cpu | sys 24% | user 1% | irq 0% | idle 75% | cpu000 w 0% | guest 0% | curf 3.19GHz | curscal 99% |
> cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> CPL | avg1 0.82 | avg5 0.78 | avg15 0.99 | | csw 6233 | intr 12023 | | numcpu 4 |
> MEM | tot 15.5G | free 4.0G | cache 9.7G | buff 0.0M | slab 333.1M | shmem 206.6M | vmbal 0.0M | hptot 0.0M |
> SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> LVM | sata-home | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> DSK | sda | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 |
> NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 |
> NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
>
> PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
> 18079 - martin martin 2 9.99s 0.00s 0K 0K 0K 16K -- - R 1 100% fio
> 4746 - martin martin 2 0.01s 0.14s 0K 0K 0K 0K -- - S 2 2% konsole
> 3291 - martin martin 4 0.01s 0.11s 0K 0K 0K 0K -- - S 0 1% plasma-desktop
> 1488 - root root 1 0.03s 0.04s 0K 0K 0K 0K -- - S 0 1% Xorg
> 10036 - root root 1 0.04s 0.02s 0K 0K 0K 0K -- - R 2 1% atop
>
> while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
>
> martin@merkaba:~> LANG=C df -hT /home
> Filesystem Type Size Used Avail Use% Mounted on
> /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home
>
> where a 4 GiB file should easily fit, no? (And this output is with the 4
> GiB file. So it was even 4 GiB more free before.)
No. /usr/bin/df is an _approximation_ in BTRFS because of the limits of
the fsstat() function call. The fstat function call was defined in 1990
and "can't understand" the dynamic allocation model used in BTRFS as it
assumes fixed geometry for filesystems. You do _not_ have 17G actually
available. You need to rely on btrfs fi df and btrfs fi show to figure
out how much space you _really_ have.
According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks)
> merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> Sa 27. Dez 13:26:39 CET 2014
> Label: 'home' uuid: [some UUID]
> Total devices 2 FS bytes used 152.83GiB
> devid 1 size 160.00GiB used 160.00GiB path
/dev/mapper/msata-home
> devid 2 size 160.00GiB used 160.00GiB path
/dev/mapper/sata-home
And according to this block you have about 4.49GiB of data space:
> Btrfs v3.17
> Data, RAID1: total=154.97GiB, used=149.58GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.26GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
154.97
5.00
0.032
+ 0.512
Pretty much as close to 160GiB as you are going to get (those numbers
being rounded up in places for "human readability") BTRFS has allocate
100% of the raw storage into typed extents.
A large datafile can only fit in the 154.97-149.58 = 5.39
Trying to allocate that 4GiB file into that 5.39GiB of space becomes an
NP-complete (e.g. "very hard") problem if it is very fragmented.
I also don't know what kind of tool you are using, but it might be
repeatedly trying and failing to fallocate the file as a single extent
or something equally dumb.
If the tool that takes those .fio files "isn't smart" about transient
allocation failures it might be trying the same allocation again, and
again, and again, and again.... forever... which is not a problem with
BTRFS but which _would_ lead to runaway CPU usage with no actual disk
activity.
So try again with more normal tools and see if you can allocate 4GiB.
dd if=/dev/urandom of=file bs=1m count=4096
Does that create a four-gig file? Probably works fine.
You need to isolate not "overall cpu usage" but _what_ program is doing
what an why. So strace your fio program or whatever it is to see what
function call(s) it is making and what is being returned.
But seriously dude, if the DD works and the fio doesn't, then that's a
problem with fio.
(I've got _zero_ idea what fio is, but if it does "testing" and
repeatedly writes random bits of the file, since you've only got 5.39G
of space it's likely going to have a lot of problems doing _anything_
"intensive" to a COW file of 4G.
So yea, that simultaneous write/rewrite test is going to fail. You don't
have enough room to permute that file.
None of the results below "surprise me" given that you _don't_ have
enough room to do the tests you (seem to have) initiated on a COW file.
Minimum likely needed space is just under 8GiB. Maximum could be much,
much larger.
>
>
> But it gets even more visible:
>
> martin@merkaba:~> fio ssd-test.fio
> seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.1.11
> Starting 2 processes
> Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]
> 0$ zsh 1$ zsh 2$ zsh 3-$ zsh 4$ zsh 5$* zsh
>
>
> yes, thats 0 IOPS.
>
> 0 IOPS and in zero IOPS. For minutes.
>
>
>
> And here is why:
>
> ATOP - merkaba 2014/12/27 13:46:52 ----------- 10s elapsed
> PRC | sys 10.77s | user 0.31s | #proc 334 | #trun 2 | #tslpi 548 | #tslpu 3 | #zombie 0 | no procacct |
> CPU | sys 108% | user 3% | irq 0% | idle 286% | wait 2% | guest 0% | curf 3.08GHz | curscal 96% |
> cpu | sys 72% | user 1% | irq 0% | idle 28% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 19% | user 0% | irq 0% | idle 81% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 11% | user 1% | irq 0% | idle 87% | cpu003 w 1% | guest 0% | curf 3.19GHz | curscal 99% |
> cpu | sys 6% | user 1% | irq 0% | idle 91% | cpu002 w 1% | guest 0% | curf 3.11GHz | curscal 97% |
> CPL | avg1 2.78 | avg5 1.34 | avg15 1.12 | | csw 50192 | intr 32379 | | numcpu 4 |
> MEM | tot 15.5G | free 5.0G | cache 8.7G | buff 0.0M | slab 332.6M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> LVM | sata-home | busy 5% | read 160 | write 11177 | KiB/w 3 | MBr/s 0.06 | MBw/s 4.36 | avio 0.05 ms |
> LVM | msata-home | busy 4% | read 28 | write 11177 | KiB/w 3 | MBr/s 0.01 | MBw/s 4.36 | avio 0.04 ms |
> LVM | sata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> LVM | msata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> DSK | sda | busy 5% | read 160 | write 10200 | KiB/w 4 | MBr/s 0.06 | MBw/s 4.69 | avio 0.05 ms |
> DSK | sdb | busy 4% | read 28 | write 10558 | KiB/w 4 | MBr/s 0.01 | MBw/s 4.69 | avio 0.04 ms |
> NET | transport | tcpi 35 | tcpo 33 | udpi 3 | udpo 3 | tcpao 2 | tcppo 1 | tcprs 0 |
> NET | network | ipi 38 | ipo 36 | ipfrw 0 | deliv 38 | | icmpi 0 | icmpo 0 |
> NET | eth0 0% | pcki 22 | pcko 20 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
>
> PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> 14973 - root root 1 8.92s 0.00s 0K 0K 0K 144K -- - S 0 89% kworker/u8:14
> 17450 - root root 1 0.86s 0.00s 0K 0K 0K 32K -- - R 3 9% kworker/u8:5
> 788 - root root 1 0.25s 0.00s 0K 0K 128K 18880K -- - S 3 3% btrfs-transact
> 12254 - root root 1 0.14s 0.00s 0K 0K 64K 576K -- - S 2 1% kworker/u8:3
> 17332 - root root 1 0.11s 0.00s 0K 0K 112K 1348K -- - S 2 1% kworker/u8:4
> 3291 - martin martin 4 0.01s 0.09s 0K 0K 0K 0K -- - S 1 1% plasma-deskto
>
>
>
>
> ATOP - merkaba 2014/12/27 13:47:12 ----------- 10s elapsed
> PRC | sys 10.78s | user 0.44s | #proc 334 | #trun 3 | #tslpi 547 | #tslpu 3 | #zombie 0 | no procacct |
> CPU | sys 106% | user 4% | irq 0% | idle 288% | wait 1% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 93% | user 0% | irq 0% | idle 7% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 7% | user 0% | irq 0% | idle 93% | cpu003 w 0% | guest 0% | curf 3.01GHz | curscal 94% |
> cpu | sys 3% | user 2% | irq 0% | idle 94% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> CPL | avg1 3.33 | avg5 1.56 | avg15 1.20 | | csw 38253 | intr 23104 | | numcpu 4 |
> MEM | tot 15.5G | free 4.9G | cache 8.7G | buff 0.0M | slab 336.5M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> LVM | msata-home | busy 2% | read 0 | write 2337 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.91 | avio 0.07 ms |
> LVM | sata-home | busy 2% | read 36 | write 2337 | KiB/w 3 | MBr/s 0.01 | MBw/s 0.91 | avio 0.07 ms |
> LVM | msata-debian | busy 1% | read 1 | write 1630 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.65 | avio 0.03 ms |
> LVM | sata-debian | busy 0% | read 0 | write 1019 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.41 | avio 0.02 ms |
> DSK | sdb | busy 2% | read 1 | write 2545 | KiB/w 5 | MBr/s 0.00 | MBw/s 1.45 | avio 0.07 ms |
> DSK | sda | busy 1% | read 36 | write 2461 | KiB/w 5 | MBr/s 0.01 | MBw/s 1.28 | avio 0.06 ms |
> NET | transport | tcpi 20 | tcpo 20 | udpi 1 | udpo 1 | tcpao 1 | tcppo 1 | tcprs 0 |
> NET | network | ipi 21 | ipo 21 | ipfrw 0 | deliv 21 | | icmpi 0 | icmpo 0 |
> NET | eth0 0% | pcki 5 | pcko 5 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 |
> NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
>
> PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> 17450 - root root 1 9.96s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:5
> 4746 - martin martin 2 0.06s 0.15s 0K 0K 0K 0K -- - S 1 2% konsole
> 10508 - root root 1 0.13s 0.00s 0K 0K 96K 4048K -- - S 1 1% kworker/u8:18
> 1488 - root root 1 0.06s 0.06s 0K 0K 0K 0K -- - S 0 1% Xorg
> 17332 - root root 1 0.12s 0.00s 0K 0K 96K 580K -- - R 3 1% kworker/u8:4
> 17454 - root root 1 0.11s 0.00s 0K 0K 32K 4416K -- - D 1 1% kworker/u8:6
> 17516 - root root 1 0.09s 0.00s 0K 0K 16K 136K -- - S 3 1% kworker/u8:7
> 3268 - martin martin 3 0.02s 0.05s 0K 0K 0K 0K -- - S 1 1% kwin
> 10036 - root root 1 0.05s 0.02s 0K 0K 0K 0K -- - R 0 1% atop
>
>
>
> So BTRFS is basically busy with itself and nothing else. Look at the SSD
> usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> you measure of course, like request size, read, write, iodepth and so).
>
> Its kworker/u8:5 utilizing 100% of one core for minutes.
>
>
>
> Its the random write case it seems. Here are values from fio job:
>
> martin@merkaba:~> fio ssd-test.fio
> seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.1.11
> Starting 2 processes
> Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
> write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> clat percentiles (usec):
> | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
> | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5],
> | 70.00th=[ 7], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ 9],
> | 99.00th=[ 14], 99.50th=[ 20], 99.90th=[ 211], 99.95th=[ 2128],
> | 99.99th=[10304]
> bw (KB /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
> lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> cpu : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=1
>
> Seems fine.
>
>
> But:
>
> rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
> write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> clat percentiles (usec):
> | 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5],
> | 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6],
> | 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10],
> | 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116],
> | 99.99th=[16711680]
> bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
> lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
> WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
>
> Run status group 1 (all jobs):
> WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
>
>
> What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
>
> What?
>
> Ey, *what*?
>
>
>
> Repeating with the random write case.
>
> Its a different kworker now, but similar result:
>
> ATOP - merkaba 2014/12/27 13:51:48 ----------- 10s elapsed
> PRC | sys 10.66s | user 0.25s | #proc 330 | #trun 2 | #tslpi 545 | #tslpu 2 | #zombie 0 | no procacct |
> CPU | sys 105% | user 3% | irq 0% | idle 292% | wait 0% | guest 0% | curf 3.07GHz | curscal 95% |
> cpu | sys 92% | user 0% | irq 0% | idle 8% | cpu002 w 0% | guest 0% | curf 3.19GHz | curscal 99% |
> cpu | sys 8% | user 0% | irq 0% | idle 92% | cpu003 w 0% | guest 0% | curf 3.09GHz | curscal 96% |
> cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 2% | user 1% | irq 0% | idle 97% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> CPL | avg1 1.00 | avg5 1.32 | avg15 1.23 | | csw 34484 | intr 23182 | | numcpu 4 |
> MEM | tot 15.5G | free 5.4G | cache 8.3G | buff 0.0M | slab 334.8M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> LVM | sata-home | busy 1% | read 36 | write 2502 | KiB/w 4 | MBr/s 0.01 | MBw/s 0.98 | avio 0.06 ms |
> LVM | msata-home | busy 1% | read 48 | write 2502 | KiB/w 4 | MBr/s 0.02 | MBw/s 0.98 | avio 0.04 ms |
> LVM | msata-debian | busy 0% | read 0 | write 6 | KiB/w 7 | MBr/s 0.00 | MBw/s 0.00 | avio 1.33 ms |
> LVM | sata-debian | busy 0% | read 0 | write 6 | KiB/w 7 | MBr/s 0.00 | MBw/s 0.00 | avio 0.17 ms |
> DSK | sda | busy 1% | read 36 | write 2494 | KiB/w 4 | MBr/s 0.01 | MBw/s 0.98 | avio 0.06 ms |
> DSK | sdb | busy 1% | read 48 | write 2494 | KiB/w 4 | MBr/s 0.02 | MBw/s 0.98 | avio 0.04 ms |
> NET | transport | tcpi 32 | tcpo 30 | udpi 2 | udpo 2 | tcpao 2 | tcppo 1 | tcprs 0 |
> NET | network | ipi 35 | ipo 32 | ipfrw 0 | deliv 35 | | icmpi 0 | icmpo 0 |
> NET | eth0 0% | pcki 19 | pcko 16 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
>
> PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
> 11746 - root root 1 10.00s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:0
> 12254 - root root 1 0.16s 0.00s 0K 0K 112K 1712K -- - S 3 2% kworker/u8:3
> 17517 - root root 1 0.16s 0.00s 0K 0K 144K 1764K -- - S 1 2% kworker/u8:8
>
>
>
> And now the graphical environemnt is locked. Continuining on TTY1.
>
> Doing another fio job with tee so I can get output easily.
>
> Wow! I wonder whether this is reproducable with a fresh BTRFS with fio stressing it.
>
> Like a 10 GiB BTRFS with 5 GiB fio test file and just letting it run.
>
>
> Okay, I let the final fio job complete and include the output here.
>
>
> Okay, and there we are and I do have sysrq-t figures.
>
> Okay, this is 1,2 MiB xz packed. So I better start a bug report about this
> and attach it there. Is dislike cloud URLs that may disappear at some time.
>
>
>
> Now please finally acknowledge that there is an issue. Maybe I was not
> using the correct terms at the beginning, but there is a real issue. I do
> performance things for half a decade at least, I know that there is an issue
> when I see it.
>
>
>
>
> There we go:
>
> Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>
> Thanks,
>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 14:21 ` Martin Steigerwald
@ 2014-12-27 15:14 ` Robert White
2014-12-27 16:01 ` Martin Steigerwald
2014-12-27 16:10 ` Martin Steigerwald
0 siblings, 2 replies; 59+ messages in thread
From: Robert White @ 2014-12-27 15:14 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Hugo Mills, linux-btrfs
On 12/27/2014 06:21 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald:
>> Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:
>>> On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
>>>> It can easily be reproduced without even using Virtualbox, just by a
>>>> nice
>>>> simple fio job.
>>>
>>> TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
>>> with one single file...
>>>
>>> #!/bin/bash
>>> # not tested, so correct any syntax errors
>>> typeset -i counter
>>> for ((counter=250;counter>0;counter--)); do
>>>
>>> dd if=/dev/urandom of=/some/file bs=4k count=$counter
>>>
>>> done
>>> exit
>>>
>>>
>>> Each pass over /some/file is 4k shorter than the previous one, but none
>>> of the extents can be deallocated. File will be 1MiB in size and usage
>>> will be something like 125.5MiB (if I've done the math correctly).
>>> larger values of counter will result in exponentially larger amounts of
>>> waste.
>>
>> Robert, I experienced this hang issues even before the defragmenting case.
>> It happened while just installed a 400 MiB tax returns application to it
>> (that is no joke, it is that big).
>>
>> It happens while just using the VM.
>>
>> Yes, I recommend not to use BTRFS for any VM image or any larger database on
>> rotating storage for exactly that COW semantics.
>>
>> But on SSD?
>>
>> Its busy looping a CPU core and while the flash is basically idling.
>>
>> I refuse to believe that this is by design.
>>
>> I do think there is a *bug*.
>>
>> Either acknowledge it and try to fix it, or say its by design *without even
>> looking at it closely enough to be sure that it is not a bug* and limit your
>> own possibilities by it.
>>
>> I´d rather see it treated as a bug for now.
>>
>> Come on, 254 IOPS on a filesystem with still 17 GiB of free space while
>> randomly writing to a 4 GiB file.
>>
>> People do these kind of things. Ditch that defrag Windows XP VM case, I had
>> performance issue even before by just installing things to it. Databases,
>> VMs, emulators. And heck even while just *creating* the file with fio as I
>> shown.
>
> Add to these use cases things like this:
>
> martin@merkaba:~/.local/share/akonadi/db_data/akonadi> ls -lSh | head -5
> insgesamt 2,2G
> -rw-rw---- 1 martin martin 1,7G Dez 27 15:17 parttable.ibd
> -rw-rw---- 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd
> -rw-rw---- 1 martin martin 23M Dez 27 15:17 pimitemflagrelation.ibd
> -rw-rw---- 1 martin martin 240K Dez 27 15:17 collectiontable.ibd
>
>
> Or this:
>
> martin@merkaba:~/.local/share/baloo> du -sch * | sort -rh
> 9,2G insgesamt
> 8,0G email
> 1,2G file
> 51M emailContacts
> 408K contacts
> 76K notes
> 16K calendars
>
> martin@merkaba:~/.local/share/baloo> ls -lSh email | head -5
> insgesamt 8,0G
> -rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB
> -rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB
> -rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB
> -rw-r--r-- 1 martin martin 63K Dez 27 15:16 postlist.baseA
/usr/bin/du and /usr/bin/df and /bin/ls are all _useless_ for showing
the amount of filespace used by a file in BTRFS.
Look at a nice paste of the previously described "worst case" allocation.
Gust rwhite # btrfs fi df /
Data, single: total=344.00GiB, used=340.41GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=8.00GiB, used=4.84GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Gust rwhite # for ((counter=250;counter>0;counter--)); do dd
if=/dev/urandom of=some_file conv=notrunc,fsync bs=4k count=$counter
>/dev/null 2>&1; done
Gust rwhite # btrfs fi df /
Data, single: total=344.00GiB, used=340.48GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=8.00GiB, used=4.84GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Gust rwhite # du some_file
1000 some_file
Gust rwhite # ls -lh some_file
-rw-rw-r--+ 1 root root 1000K Dec 27 07:00 some_file
Gust rwhite # rm some_file
Gust rwhite # btrfs fi df /
Data, single: total=344.00GiB, used=340.41GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=8.00GiB, used=4.84GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Notice that "some_file" shows 1000 blocks in du, and 1000k bytes in ls.
But notice that data used jumps from 340.41GiB to 340.48GiB when the
file is created, then drops back down to 340.41GiB when it's deleted.
Now I have compression turned on so the amount of growth/shrinkage
changes between each run, but it's _Way_ more than 1Meg, that's like
70MiB (give or take significant rounding in the third place after the
decimal). So I wrote this file in a way that leads to it taking up
_seventy_ _times_ it's base size in actual allocated storage. Real files
do not perform this terribly, but they can get pretty ugly in some cases.
You _really_ need to learn how the system works and what its best and
worst cases look like before you start shouting "bug!"
You are using the wrong numbers (e.g. "df") for available space and you
don't know how to estimate what your tools _should_ do for the
conditions observed.
But yes, if you open a file and scribble all over it when your disk is
full to within the same order of magnitude as the size of the file you
are scribbling on, you will get into a condition where the _application_
will aggressively retry the IO. Particularly if that application is a
"test program" or a virtual machine doing asynchronous IO.
That's what those sorts of systems do when they crash against a limit in
the underlying system.
So yea... out of space plus agressive writer equals spinning CPU
Before you can assign blame you need to strace your application to see
what call its making over and over again to see if its just being stupid.
> These will not be as bad as the fio test case, but still these files are
> written into. They are updated in place.
>
> And thats running on every Plasma desktop by default. And on GNOME desktops
> there is similar stuff.
>
> I haven´t seen this spike out a kworker yet tough, so maybe the workload is
> light enough not to trigger it that easily.
>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 15:14 ` Robert White
@ 2014-12-27 16:01 ` Martin Steigerwald
2014-12-28 0:25 ` Robert White
2014-12-27 16:10 ` Martin Steigerwald
1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 16:01 UTC (permalink / raw)
To: Robert White; +Cc: Hugo Mills, linux-btrfs
Am Samstag, 27. Dezember 2014, 07:14:32 schrieb Robert White:
> But yes, if you open a file and scribble all over it when your disk is
> full to within the same order of magnitude as the size of the file you
> are scribbling on, you will get into a condition where the _application_
> will aggressively retry the IO. Particularly if that application is a
> "test program" or a virtual machine doing asynchronous IO.
>
> That's what those sorts of systems do when they crash against a limit in
> the underlying system.
>
> So yea... out of space plus agressive writer equals spinning CPU
>
> Before you can assign blame you need to strace your application to see
> what call its making over and over again to see if its just being stupid.
Robert, I am pretty sure that fio does not retry the I/O. If the I/O returns
an error it exists immediately.
I don´t think BTRFS fails an I/O – there is nothing of that in kern.log or
dmesg. But it just needs a very long time for it.
And yet, with BTRFS *is* *full* testcase I still can´t reproduce the <300
IOPS case. I consistently get about 4800 IOPS which is just about okay IMHO.
fio just does random I/O. Aggressively, yes. But it would stop on the *first*
*failed* I/O request. I am pretty sure of that.
fio is flexible I/O tester. It has been written mostly by Jens Axboe. Jens
is the block maintainer of the Linux kernel. So I kindly ask that
before you assume I use crap tools, you have a look at it.
>From how you write I get the impression that you think everyone else
beside you is just silly and dumb. Please stop this assumption. I may not
always get terms right, and I may make a mistake as with the wrong df
figure. But I also highly dislike to feel treated like someone who doesn´t
know a thing.
I made my case.
I tried to reproduce it in a test case.
Now I suggest we wait till someone had an actual log at the sysrq-t triggers
of the 25 MiB kern.log I provided in the bug report.
I will now wait for BTRFS developers to comment on this.
I think Chris and Josef and other BTRFS developers actually know what fio
is, so… either they are interested in that <300 IOPS case I cannot yet
reproduce with a fresh filesystem or not.
Even when it is as almost full as it can get and the fio *barely* completes
without a "no space left on device" error, I still get those 4800 IOPS.
I tested it and took the first run where it actually completed again after
deleting partially copies /usr/bin directory from the test filesystem.
As I have shown it in my test case (see my other mail with altered subject
line)
So for at least a *small* full filesystem, the filesystem full or BTRFS has
to search for free space aggressively case *does not* explain what I see
with my /home. So either I need a fuller filesystem for the test case,
maybe one which carries a million of files or more, or one that at least
has more chunks to allocate from, or there is more to it and there is
something with my /home that makes it even worse.
So it isn´t just the filesystem full case, and the all free space allocated
for chunks condition also does not suffice as my test case shows (where
BTRFS just won´t allocate another data chunk it seems).
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 15:14 ` Robert White
2014-12-27 16:01 ` Martin Steigerwald
@ 2014-12-27 16:10 ` Martin Steigerwald
1 sibling, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 16:10 UTC (permalink / raw)
To: Robert White; +Cc: Hugo Mills, linux-btrfs
Am Samstag, 27. Dezember 2014, 07:14:32 schrieb Robert White:
> On 12/27/2014 06:21 AM, Martin Steigerwald wrote:
> > Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald:
> >> Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:
> >>> On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
> >>>> It can easily be reproduced without even using Virtualbox, just by a
> >>>> nice
> >>>> simple fio job.
> >>>
> >>> TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
> >>> with one single file...
> >>>
> >>> #!/bin/bash
> >>> # not tested, so correct any syntax errors
> >>> typeset -i counter
> >>> for ((counter=250;counter>0;counter--)); do
> >>>
> >>> dd if=/dev/urandom of=/some/file bs=4k count=$counter
> >>>
> >>> done
> >>> exit
> >>>
> >>>
> >>> Each pass over /some/file is 4k shorter than the previous one, but none
> >>> of the extents can be deallocated. File will be 1MiB in size and usage
> >>> will be something like 125.5MiB (if I've done the math correctly).
> >>> larger values of counter will result in exponentially larger amounts of
> >>> waste.
> >>
> >> Robert, I experienced this hang issues even before the defragmenting case.
> >> It happened while just installed a 400 MiB tax returns application to it
> >> (that is no joke, it is that big).
> >>
> >> It happens while just using the VM.
> >>
> >> Yes, I recommend not to use BTRFS for any VM image or any larger database on
> >> rotating storage for exactly that COW semantics.
> >>
> >> But on SSD?
> >>
> >> Its busy looping a CPU core and while the flash is basically idling.
> >>
> >> I refuse to believe that this is by design.
> >>
> >> I do think there is a *bug*.
> >>
> >> Either acknowledge it and try to fix it, or say its by design *without even
> >> looking at it closely enough to be sure that it is not a bug* and limit your
> >> own possibilities by it.
> >>
> >> I´d rather see it treated as a bug for now.
> >>
> >> Come on, 254 IOPS on a filesystem with still 17 GiB of free space while
> >> randomly writing to a 4 GiB file.
> >>
> >> People do these kind of things. Ditch that defrag Windows XP VM case, I had
> >> performance issue even before by just installing things to it. Databases,
> >> VMs, emulators. And heck even while just *creating* the file with fio as I
> >> shown.
> >
> > Add to these use cases things like this:
> >
> > martin@merkaba:~/.local/share/akonadi/db_data/akonadi> ls -lSh | head -5
> > insgesamt 2,2G
> > -rw-rw---- 1 martin martin 1,7G Dez 27 15:17 parttable.ibd
> > -rw-rw---- 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd
> > -rw-rw---- 1 martin martin 23M Dez 27 15:17 pimitemflagrelation.ibd
> > -rw-rw---- 1 martin martin 240K Dez 27 15:17 collectiontable.ibd
> >
> >
> > Or this:
> >
> > martin@merkaba:~/.local/share/baloo> du -sch * | sort -rh
> > 9,2G insgesamt
> > 8,0G email
> > 1,2G file
> > 51M emailContacts
> > 408K contacts
> > 76K notes
> > 16K calendars
> >
> > martin@merkaba:~/.local/share/baloo> ls -lSh email | head -5
> > insgesamt 8,0G
> > -rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB
> > -rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB
> > -rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB
> > -rw-r--r-- 1 martin martin 63K Dez 27 15:16 postlist.baseA
>
> /usr/bin/du and /usr/bin/df and /bin/ls are all _useless_ for showing
> the amount of filespace used by a file in BTRFS.
Yes.
But they are *useful* to demonstrate that there are regular desktop
application which randomly write into huge files. And that was *exactly*
the point I was trying to make.
Yes, I didn´t prove the random aspect. But heck, one is a MySQL and
one is a Xapian. I am fairly sure that for a desktop search and for maildir
folder indexing there is some random aspect in the workload. Do you
agree to that?
So what you call as "bad" – that was my exact point I was going to make
– point is going to happen on systems. Maybe not as fierce as a fio job,
granted. And for these said /home BTRFS worked fine, but for just
installed a 400 MiB application onto the Windows XP I had the hang
already. With more than 8 GiB of free space within the chunks at that
time.
If BTRFS fails like <300 IOPS on Dual SSD on disk full conditions on
workloads like this it will fail in real world scenarios. And again my
recommendation to leave way more free space than with other filesystems
still holds.
Yes, I saw XFS developer Dave Chinner recommending about 50% of free
space of XFS for a crazy workload in case you want the filesystem in a young
state even after 10 years. So I am fully aware that filesystems will age.
But to *this* extent? After about the six months I actually run the BTRFS
RAID 1, and started with a fresh single BTRFS that I balanced as RAID 1 to
the second SSD then?
I still think it is a bug. Especially as it just does not happen with a
simple disk full condition as I spent several hours in trying to reproduce
this worst case.
If it only happens with my /home, I am willing to accept that something may
be borked with it. And I haven´t been able to produce with a clean filesystem
yet. So maybe it doesn´t happen for others. Then all fine, I recreate the FS
and forget about it.
But before I do any of this, I will wait whether a developer can make sense of
the sysrq-t triggers in syslog.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 14:54 ` Robert White
@ 2014-12-27 16:26 ` Hugo Mills
2014-12-27 17:11 ` Martin Steigerwald
2014-12-28 0:06 ` Robert White
0 siblings, 2 replies; 59+ messages in thread
From: Hugo Mills @ 2014-12-27 16:26 UTC (permalink / raw)
To: Robert White; +Cc: Martin Steigerwald, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 4052 bytes --]
On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote:
> On 12/27/2014 05:55 AM, Martin Steigerwald wrote:
[snip]
> >while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> >for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> >
> >martin@merkaba:~> LANG=C df -hT /home
> >Filesystem Type Size Used Avail Use% Mounted on
> >/dev/mapper/msata-home btrfs 170G 156G 17G 91% /home
> >
> >where a 4 GiB file should easily fit, no? (And this output is with the 4
> >GiB file. So it was even 4 GiB more free before.)
>
> No. /usr/bin/df is an _approximation_ in BTRFS because of the limits
> of the fsstat() function call. The fstat function call was defined
> in 1990 and "can't understand" the dynamic allocation model used in
> BTRFS as it assumes fixed geometry for filesystems. You do _not_
> have 17G actually available. You need to rely on btrfs fi df and
> btrfs fi show to figure out how much space you _really_ have.
>
> According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks)
>
> > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > Sa 27. Dez 13:26:39 CET 2014
> > Label: 'home' uuid: [some UUID]
> > Total devices 2 FS bytes used 152.83GiB
> > devid 1 size 160.00GiB used 160.00GiB path
> /dev/mapper/msata-home
> > devid 2 size 160.00GiB used 160.00GiB path
> /dev/mapper/sata-home
>
> And according to this block you have about 4.49GiB of data space:
>
> > Btrfs v3.17
> > Data, RAID1: total=154.97GiB, used=149.58GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
>
> 154.97
> 5.00
> 0.032
> + 0.512
>
> Pretty much as close to 160GiB as you are going to get (those
> numbers being rounded up in places for "human readability") BTRFS
> has allocate 100% of the raw storage into typed extents.
>
> A large datafile can only fit in the 154.97-149.58 = 5.39
I appreciate that this is something of a minor point in the grand
scheme of things, but I'm afraid I've lost the enthusiasm to engage
with the broader (somewhat rambling, possibly-at-cross-purposes)
conversation in this thread. However...
> Trying to allocate that 4GiB file into that 5.39GiB of space becomes
> an NP-complete (e.g. "very hard") problem if it is very fragmented.
This is... badly mistaken, at best. The problem of where to write a
file into a set of free extents is definitely *not* an NP-hard
problem. It's a P problem, with an O(n log n) solution, where n is the
number of free extents in the free space cache. The simple approach:
fill the first hole with as many bytes as you can, then move on to the
next hole. More complex: order the free extents by size first. Both of
these are O(n log n) algorithms, given an efficient general-purpose
index of free space.
The problem of placing file data isn't a bin-packing problem; it's
not like allocating RAM (where each allocation must be contiguous).
The items being placed may be split as much as you like, although
minimising the amount of splitting is a goal.
I suspect that the performance problems that Martin is seeing may
indeed be related to free space fragmentation, in that finding and
creating all of those tiny extents for a huge file is causing
problems. I believe that btrfs isn't alone in this, but it may well be
showing the problem to a far greater degree than other FSes. I don't
have figures to compare, I'm afraid.
> I also don't know what kind of tool you are using, but it might be
> repeatedly trying and failing to fallocate the file as a single
> extent or something equally dumb.
Userspace doesn't as far as I know, get to make that decision. I've
just read the fallocate(2) man page, and it says nothing at all about
the contiguity of the extent(s) storage allocated by the call.
Hugo.
[snip]
--
Hugo Mills | O tempura! O moresushi!
hugo@... carfax.org.uk |
http://carfax.org.uk/ |
PGP: 65E74AC0 |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 16:26 ` Hugo Mills
@ 2014-12-27 17:11 ` Martin Steigerwald
2014-12-27 17:59 ` Martin Steigerwald
2014-12-28 0:06 ` Robert White
1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 17:11 UTC (permalink / raw)
To: Hugo Mills; +Cc: Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 6800 bytes --]
Am Samstag, 27. Dezember 2014, 16:26:42 schrieb Hugo Mills:
> On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote:
> > On 12/27/2014 05:55 AM, Martin Steigerwald wrote:
> [snip]
> > >while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > >for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > >
> > >martin@merkaba:~> LANG=C df -hT /home
> > >Filesystem Type Size Used Avail Use% Mounted on
> > >/dev/mapper/msata-home btrfs 170G 156G 17G 91% /home
> > >
> > >where a 4 GiB file should easily fit, no? (And this output is with the 4
> > >GiB file. So it was even 4 GiB more free before.)
> >
> > No. /usr/bin/df is an _approximation_ in BTRFS because of the limits
> > of the fsstat() function call. The fstat function call was defined
> > in 1990 and "can't understand" the dynamic allocation model used in
> > BTRFS as it assumes fixed geometry for filesystems. You do _not_
> > have 17G actually available. You need to rely on btrfs fi df and
> > btrfs fi show to figure out how much space you _really_ have.
> >
> > According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks)
> >
> > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > > Sa 27. Dez 13:26:39 CET 2014
> > > Label: 'home' uuid: [some UUID]
> > > Total devices 2 FS bytes used 152.83GiB
> > > devid 1 size 160.00GiB used 160.00GiB path
> > /dev/mapper/msata-home
> > > devid 2 size 160.00GiB used 160.00GiB path
> > /dev/mapper/sata-home
> >
> > And according to this block you have about 4.49GiB of data space:
> >
> > > Btrfs v3.17
> > > Data, RAID1: total=154.97GiB, used=149.58GiB
> > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > > GlobalReserve, single: total=512.00MiB, used=0.00B
> >
> > 154.97
> > 5.00
> > 0.032
> > + 0.512
> >
> > Pretty much as close to 160GiB as you are going to get (those
> > numbers being rounded up in places for "human readability") BTRFS
> > has allocate 100% of the raw storage into typed extents.
> >
> > A large datafile can only fit in the 154.97-149.58 = 5.39
>
> I appreciate that this is something of a minor point in the grand
> scheme of things, but I'm afraid I've lost the enthusiasm to engage
> with the broader (somewhat rambling, possibly-at-cross-purposes)
> conversation in this thread. However...
>
> > Trying to allocate that 4GiB file into that 5.39GiB of space becomes
> > an NP-complete (e.g. "very hard") problem if it is very fragmented.
>
> This is... badly mistaken, at best. The problem of where to write a
> file into a set of free extents is definitely *not* an NP-hard
> problem. It's a P problem, with an O(n log n) solution, where n is the
> number of free extents in the free space cache. The simple approach:
> fill the first hole with as many bytes as you can, then move on to the
> next hole. More complex: order the free extents by size first. Both of
> these are O(n log n) algorithms, given an efficient general-purpose
> index of free space.
>
> The problem of placing file data isn't a bin-packing problem; it's
> not like allocating RAM (where each allocation must be contiguous).
> The items being placed may be split as much as you like, although
> minimising the amount of splitting is a goal.
>
> I suspect that the performance problems that Martin is seeing may
> indeed be related to free space fragmentation, in that finding and
> creating all of those tiny extents for a huge file is causing
> problems. I believe that btrfs isn't alone in this, but it may well be
> showing the problem to a far greater degree than other FSes. I don't
> have figures to compare, I'm afraid.
Thats what I wanted to hint at.
I suspect an issue with free space fragmentation and do what I think I see:
btrfs balance minimizes free space in chunk fragmentation.
And that is my whole case on why I think it does help with my /home
filesystem.
So while btrfs filesystem defragment may help with defragmenting individual
files, possibly at the cost of fragmenting free space at least on filesystem
almost full conditions, I think to help with free space fragmentation there
are only three options at the moment:
1) reformat and restore via rsync or btrfs send from backup (i.e. file based)
2) make the BTRFS in itself bigger
3) btrfs balance at least chunks, at least those that are not more than 70%
or 80% full.
Do you know of any other ways to deal with it?
So yes, in case it really is freespace fragmentation, I do think a balance
may be helpful. Even if usually one should not use a balance.
> > I also don't know what kind of tool you are using, but it might be
> > repeatedly trying and failing to fallocate the file as a single
> > extent or something equally dumb.
>
> Userspace doesn't as far as I know, get to make that decision. I've
> just read the fallocate(2) man page, and it says nothing at all about
> the contiguity of the extent(s) storage allocated by the call.
fio fallocates just once. And then writes, even if the fallocate call fails.
Was nice to see at some point as BTRFS returned out of space on the
fallocate but was still be able to write the 4GiB of random data. I bet
the latter was due to compression. Thus while it could not guarentee
that the 4 GiB will be there in all cases, i.e. even with uncompressible
data, it was able to wrote out the random buffer fio repeatedly wrote.
I think I will step back from this now, its weekend and a quiet time after
all.
I probably got a bit too engaged with this discussion. Yet, I had the feeling
I was treated by Robert like someone who doesn´t know a thing. I want to
approach this with a willingness to learn, and I don´t want to interpret
an empirical result away before someone even had a closer look at it.
I had this before where an expert claimed that he wouldn´t reduce the
dirty_background_ratio on an rsync via NFS case and I actually needed to
prove the result to him before he – I don´t even know – eventually
accepted it.
I may be off with my free space fragmentation idea, thus let the kern.log
and my results speak for itself. I don´t see much point in proceeding this
discussion before a BTRFS developer had a look at it.
I put up the sysrq-trigger t kern.log onto the bug report. The bugzilla does
not seem to be available from here at the moment, nginx reports "502 bad
gateway, but the kern.log I attached to it. And in case someone needs it by
mail, just ping me.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 17:11 ` Martin Steigerwald
@ 2014-12-27 17:59 ` Martin Steigerwald
0 siblings, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 17:59 UTC (permalink / raw)
To: Hugo Mills; +Cc: Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 5091 bytes --]
Am Samstag, 27. Dezember 2014, 18:11:21 schrieb Martin Steigerwald:
> Am Samstag, 27. Dezember 2014, 16:26:42 schrieb Hugo Mills:
> > On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote:
> > > On 12/27/2014 05:55 AM, Martin Steigerwald wrote:
> > [snip]
> > > >while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > > >for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > > >
> > > >martin@merkaba:~> LANG=C df -hT /home
> > > >Filesystem Type Size Used Avail Use% Mounted on
> > > >/dev/mapper/msata-home btrfs 170G 156G 17G 91% /home
> > > >
> > > >where a 4 GiB file should easily fit, no? (And this output is with the 4
> > > >GiB file. So it was even 4 GiB more free before.)
> > >
> > > No. /usr/bin/df is an _approximation_ in BTRFS because of the limits
> > > of the fsstat() function call. The fstat function call was defined
> > > in 1990 and "can't understand" the dynamic allocation model used in
> > > BTRFS as it assumes fixed geometry for filesystems. You do _not_
> > > have 17G actually available. You need to rely on btrfs fi df and
> > > btrfs fi show to figure out how much space you _really_ have.
> > >
> > > According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks)
> > >
> > > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > > > Sa 27. Dez 13:26:39 CET 2014
> > > > Label: 'home' uuid: [some UUID]
> > > > Total devices 2 FS bytes used 152.83GiB
> > > > devid 1 size 160.00GiB used 160.00GiB path
> > > /dev/mapper/msata-home
> > > > devid 2 size 160.00GiB used 160.00GiB path
> > > /dev/mapper/sata-home
> > >
> > > And according to this block you have about 4.49GiB of data space:
> > >
> > > > Btrfs v3.17
> > > > Data, RAID1: total=154.97GiB, used=149.58GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > >
> > > 154.97
> > > 5.00
> > > 0.032
> > > + 0.512
> > >
> > > Pretty much as close to 160GiB as you are going to get (those
> > > numbers being rounded up in places for "human readability") BTRFS
> > > has allocate 100% of the raw storage into typed extents.
> > >
> > > A large datafile can only fit in the 154.97-149.58 = 5.39
> >
> > I appreciate that this is something of a minor point in the grand
> > scheme of things, but I'm afraid I've lost the enthusiasm to engage
> > with the broader (somewhat rambling, possibly-at-cross-purposes)
> > conversation in this thread. However...
> >
> > > Trying to allocate that 4GiB file into that 5.39GiB of space becomes
> > > an NP-complete (e.g. "very hard") problem if it is very fragmented.
> >
> > This is... badly mistaken, at best. The problem of where to write a
> > file into a set of free extents is definitely *not* an NP-hard
> > problem. It's a P problem, with an O(n log n) solution, where n is the
> > number of free extents in the free space cache. The simple approach:
> > fill the first hole with as many bytes as you can, then move on to the
> > next hole. More complex: order the free extents by size first. Both of
> > these are O(n log n) algorithms, given an efficient general-purpose
> > index of free space.
> >
> > The problem of placing file data isn't a bin-packing problem; it's
> > not like allocating RAM (where each allocation must be contiguous).
> > The items being placed may be split as much as you like, although
> > minimising the amount of splitting is a goal.
> >
> > I suspect that the performance problems that Martin is seeing may
> > indeed be related to free space fragmentation, in that finding and
> > creating all of those tiny extents for a huge file is causing
> > problems. I believe that btrfs isn't alone in this, but it may well be
> > showing the problem to a far greater degree than other FSes. I don't
> > have figures to compare, I'm afraid.
>
> Thats what I wanted to hint at.
>
> I suspect an issue with free space fragmentation and do what I think I see:
>
> btrfs balance minimizes free space in chunk fragmentation.
>
> And that is my whole case on why I think it does help with my /home
> filesystem.
>
> So while btrfs filesystem defragment may help with defragmenting individual
> files, possibly at the cost of fragmenting free space at least on filesystem
> almost full conditions, I think to help with free space fragmentation there
> are only three options at the moment:
>
> 1) reformat and restore via rsync or btrfs send from backup (i.e. file based)
>
> 2) make the BTRFS in itself bigger
>
> 3) btrfs balance at least chunks, at least those that are not more than 70%
> or 80% full.
>
> Do you know of any other ways to deal with it?
Yes.
4) Delete some stuff from it or move it over to a different filesystem.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 9:30 ` Hugo Mills
` (2 preceding siblings ...)
2014-12-27 13:55 ` Martin Steigerwald
@ 2014-12-27 18:28 ` Zygo Blaxell
2014-12-27 18:40 ` Hugo Mills
3 siblings, 1 reply; 59+ messages in thread
From: Zygo Blaxell @ 2014-12-27 18:28 UTC (permalink / raw)
To: Hugo Mills, Martin Steigerwald, Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1582 bytes --]
On Sat, Dec 27, 2014 at 09:30:43AM +0000, Hugo Mills wrote:
> On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> Now, since you're seeing lockups when the space on your disks is
> all allocated I'd say that's a bug. However, you're the *only* person
> who's reported this as a regular occurrence. Does this happen with all
> filesystems you have, or just this one?
I do see something similar, but there are so many problems going on I
have no idea which ones to report, and which ones are my own doing. :-P
I see lots of CPU being burned when all the disk space is allocated
to chunks, but there is still lots of space free (multiple GB) inside
the chunks.
iotop shows a crapton of disk writes (1-5MB/sec) from one kworker.
There are maybe a few kB/sec of writes through the filesystem at the time.
The filesystem where I see this most is on a laptop, so the disk writes
also hit the CPU again for encryption. There's so much CPU usage it's
worth mentioning twice. :-(
'watch cat /proc/12345/stack' on the active processes shows the kernel
fairly often in that new chunk deallocator function whose name escapes
me at the moment.
Deleting a bunch of data then running balance helps return to sane CPU
usage...for a while (maybe a week?).
It's not technically "locked up" per se, but when a 5KB download takes
a minute or more, most users won't wait around to see the difference.
Kernel versions I'm using are 3.17.7 and 3.18.1.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 18:28 ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
@ 2014-12-27 18:40 ` Hugo Mills
2014-12-27 19:23 ` BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time) Martin Steigerwald
0 siblings, 1 reply; 59+ messages in thread
From: Hugo Mills @ 2014-12-27 18:40 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: Martin Steigerwald, Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2481 bytes --]
On Sat, Dec 27, 2014 at 01:28:46PM -0500, Zygo Blaxell wrote:
> On Sat, Dec 27, 2014 at 09:30:43AM +0000, Hugo Mills wrote:
> > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > Now, since you're seeing lockups when the space on your disks is
> > all allocated I'd say that's a bug. However, you're the *only* person
> > who's reported this as a regular occurrence. Does this happen with all
> > filesystems you have, or just this one?
>
> I do see something similar, but there are so many problems going on I
> have no idea which ones to report, and which ones are my own doing. :-P
>
> I see lots of CPU being burned when all the disk space is allocated
> to chunks, but there is still lots of space free (multiple GB) inside
> the chunks.
>
> iotop shows a crapton of disk writes (1-5MB/sec) from one kworker.
> There are maybe a few kB/sec of writes through the filesystem at the time.
>
> The filesystem where I see this most is on a laptop, so the disk writes
> also hit the CPU again for encryption. There's so much CPU usage it's
> worth mentioning twice. :-(
>
> 'watch cat /proc/12345/stack' on the active processes shows the kernel
> fairly often in that new chunk deallocator function whose name escapes
> me at the moment.
>
> Deleting a bunch of data then running balance helps return to sane CPU
> usage...for a while (maybe a week?).
>
> It's not technically "locked up" per se, but when a 5KB download takes
> a minute or more, most users won't wait around to see the difference.
>
> Kernel versions I'm using are 3.17.7 and 3.18.1.
OK, so I'd like to change my statement above.
When I first read Martin's problem, I thought that he was referring
to a complete, hit-the-power-button kind of lock-up. Given that
(erroneous) assumption, I stand by my (now pointless) statement. :)
I realised during a brief conversation on IRC that Martin was
actually referring to long but temporary periods where the machine is
unusable by any process requiring disk activity. There's clearly a
number of people seeing that.
It doesn't stop it being a major problem, but it does change the
interpretation considerably.
Hugo.
--
Hugo Mills | Mixing mathematics and alcohol is dangerous. Don't
hugo@... carfax.org.uk | drink and derive.
http://carfax.org.uk/ |
PGP: 65E74AC0 |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
2014-12-27 18:40 ` Hugo Mills
@ 2014-12-27 19:23 ` Martin Steigerwald
2014-12-29 2:07 ` Zygo Blaxell
0 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 19:23 UTC (permalink / raw)
To: Hugo Mills; +Cc: Zygo Blaxell, Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 5782 bytes --]
Am Samstag, 27. Dezember 2014, 18:40:17 schrieb Hugo Mills:
> On Sat, Dec 27, 2014 at 01:28:46PM -0500, Zygo Blaxell wrote:
> > On Sat, Dec 27, 2014 at 09:30:43AM +0000, Hugo Mills wrote:
> > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > Now, since you're seeing lockups when the space on your disks is
> > > all allocated I'd say that's a bug. However, you're the *only* person
> > > who's reported this as a regular occurrence. Does this happen with all
> > > filesystems you have, or just this one?
> >
> > I do see something similar, but there are so many problems going on I
> > have no idea which ones to report, and which ones are my own doing. :-P
> >
> > I see lots of CPU being burned when all the disk space is allocated
> > to chunks, but there is still lots of space free (multiple GB) inside
> > the chunks.
> >
> > iotop shows a crapton of disk writes (1-5MB/sec) from one kworker.
> > There are maybe a few kB/sec of writes through the filesystem at the time.
> >
> > The filesystem where I see this most is on a laptop, so the disk writes
> > also hit the CPU again for encryption. There's so much CPU usage it's
> > worth mentioning twice. :-(
> >
> > 'watch cat /proc/12345/stack' on the active processes shows the kernel
> > fairly often in that new chunk deallocator function whose name escapes
> > me at the moment.
> >
> > Deleting a bunch of data then running balance helps return to sane CPU
> > usage...for a while (maybe a week?).
> >
> > It's not technically "locked up" per se, but when a 5KB download takes
> > a minute or more, most users won't wait around to see the difference.
> >
> > Kernel versions I'm using are 3.17.7 and 3.18.1.
>
> OK, so I'd like to change my statement above.
>
> When I first read Martin's problem, I thought that he was referring
> to a complete, hit-the-power-button kind of lock-up. Given that
> (erroneous) assumption, I stand by my (now pointless) statement. :)
>
> I realised during a brief conversation on IRC that Martin was
> actually referring to long but temporary periods where the machine is
> unusable by any process requiring disk activity. There's clearly a
> number of people seeing that.
>
> It doesn't stop it being a major problem, but it does change the
> interpretation considerably.
Ah, then my bet was right with whom I talked there. :)
Yeah, it does not seem to be a complete hang, I though so initially, cause
honestly after waiting several minutes for my Plasma desktop to come back
I just gave up. Maybe it would have returned at some time. I just didn´t
have the patience to wait.
It now did at my last testing where I continued on tty1 (had all the testing
in a screen) as the desktop session locked up. After some time after the
test completed I was able to use that desktop again and I am still using it.
So the issue I see is: One kworker uses 100% of one core for minutes and
while doing so processes that do I/O to the BTRFS that I test (/home) in my
case seem to be stuck in uninteruptible sleep ("D" process state). While I
see this there is no huge load on the SSDs so… it seems to be something
CPU bound. I didn´t yet use a strace on the kworker process – or at the
allocation time on the fio process –, Robert, thats a good suggestion. From
a gut feeling I wouldn´t be surprised if I see *nothing* in strace as my bet
is that the kworker thread deals with finding free space inside the chunks
and deals with some data structures while doing so. But that is really just
a gut feeling and so an strace would be nice.
I made a backup yesterday, so I think I can try the strace. But I also spend
a considerable amount of time of reproducing it and digging deeper into it
so likely not this weekend anymore although this even makes some fun. But
I see myself neglecting other stuff thats important to me as well, so…
My simple test case didn´t trigger it, and I so not have another twice 160
GiB available on this SSDs available to try with a copy of my home
filesystem. Then I could safely test without bringing the desktop session to
an halt. Maybe someone has an idea on how to "enhance" my test case in
order to reliably trigger the issue.
It may be challenging tough. My /home is quite a filesystem. It has a maildir
with at least one million of files (yeah, I am performance testing KMail and
Akonadi as well to the limit!), and it has git repos and this one VM image,
and the desktop search and the Akonadi database. In other words: It has
been hit nicely with various mostly random I think workloads over the last
about six months. I bet its not that easy to simulate that. Maybe some runs
of compilebench to age the filesystem before the fio test?
That said, BTRFS performs a lot better. The complete lockups without any
CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there
is this kworker issue now. I noticed it that gravely just while trying to
complete this tax returns stuff with the Windows XP VM. Otherwise it may
have happened, I have seen some backtraces in kern.log, but it didn´t last
for minutes. So this indeed is of less severity than the full lockups with
3.15 and 3.16.
Zygo, was is the characteristics of your filesystem. Do you use
compress=lzo and skinny metadata as well? How are the chunks allocated?
What kind of data you have on it?
Well now off to some dancing event. Thats just right now :)
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 16:26 ` Hugo Mills
2014-12-27 17:11 ` Martin Steigerwald
@ 2014-12-28 0:06 ` Robert White
2014-12-28 11:05 ` Martin Steigerwald
1 sibling, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-28 0:06 UTC (permalink / raw)
To: Hugo Mills, Martin Steigerwald, linux-btrfs
Semi off-topic questions...
On 12/27/2014 08:26 AM, Hugo Mills wrote:
> This is... badly mistaken, at best. The problem of where to write a
> file into a set of free extents is definitely *not* an NP-hard
> problem. It's a P problem, with an O(n log n) solution, where n is the
> number of free extents in the free space cache. The simple approach:
> fill the first hole with as many bytes as you can, then move on to the
> next hole. More complex: order the free extents by size first. Both of
> these are O(n log n) algorithms, given an efficient general-purpose
> index of free space.
Which algorithm is actually in use?
Is any attempt made to keep subsequent allocations in the same data extent?
All of "best fit", "first fit", and "first encountered" allocation have
terrible distribution graphs over time.
Without a knod to locality, discontiguous allocation will have
staggeringly bad after-effects in terms of read-ahead.
>
> The problem of placing file data isn't a bin-packing problem; it's
> not like allocating RAM (where each allocation must be contiguous).
> The items being placed may be split as much as you like, although
> minimising the amount of splitting is a goal.
How is compression and re-compression handled? If a linear extent is
compressed to find its on-disk size in bytes, and then there isn't an
extent large enough to fit it, it has to be cut, then recompressed, then
searched again right?
How does the system look for the right cut? How iterative can this get?
Does it always try cutting in half? Does it shave single bytes off the
end? Does it add one byte at a time till it reaches the size of the
extent its looking at?
Can you get down to a point where you are placing data in five or ten
byte chunks somehow? (e.g. what's the smallest chunk you can place?
clearly if I open a multi-megabyte file and update a single word or byte
it's not going to land in metadata from my reading of the code.) One
could easily end up with a couple million free extents of just a few
bytes each, particularly if largest-first allocation is used.
The degenerate cases here do come straight from the various packing
problems. You may not be executing any of those packing algorithms but
once you ignore enough of those issues in the easy cases your free space
will be a fine pink mist suspended in space. (both an explosion analogy
and a reference to pink noise 8-) ).
> I suspect that the performance problems that Martin is seeing may
> indeed be related to free space fragmentation, in that finding and
> creating all of those tiny extents for a huge file is causing
> problems. I believe that btrfs isn't alone in this, but it may well be
> showing the problem to a far greater degree than other FSes. I don't
> have figures to compare, I'm afraid.
>
>> I also don't know what kind of tool you are using, but it might be
>> repeatedly trying and failing to fallocate the file as a single
>> extent or something equally dumb.
>
> Userspace doesn't as far as I know, get to make that decision. I've
> just read the fallocate(2) man page, and it says nothing at all about
> the contiguity of the extent(s) storage allocated by the call.
Yep, my bad. But as soon as I saw that "fio" was starting two threads,
one doing random read/write and another doing sequential read/write,
both on the same file, it set off my "not just creating a file" mindset.
Given the delayed write into/through the cache normally done by casual
file io, It seemed likely that fio would be doing something more
aggressive (like using O_DIRECT or repeated fdatasync() which could get
very tit-for-tat).
Compare that to a VM in which the guest operating system "knows" it has,
and has used, its "disk space" internally, and the subsequent async
activity of the monitor to push that activity out to real storage which
is usually quite pathological... well you can get into some super
pernicious behavior over write ordering and infinite retries.
So I was wrong about fallocate per-se, applications can be incredibly
dumb. For instance a VM might think its _inconceivable_ to get an ENOSPC
while rewriting data it's just read from a file it "knows" has no holes etc.
Given how lots of code doesn't even check the results of many function
calls... how many times have you seen code that doesn't look at the
return value of fwrite() or printf()? Or one that, if it does something
like if (bytes_written < size) retry_remainder(); So sure I was
imagining an fallocate() in a loop or something equally dumb. 8-)
>
> Hugo.
>
> [snip]
>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-27 16:01 ` Martin Steigerwald
@ 2014-12-28 0:25 ` Robert White
2014-12-28 1:01 ` Bardur Arantsson
0 siblings, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-28 0:25 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Hugo Mills, linux-btrfs
On 12/27/2014 08:01 AM, Martin Steigerwald wrote:
>>From how you write I get the impression that you think everyone else
> beside you is just silly and dumb. Please stop this assumption. I may not
> always get terms right, and I may make a mistake as with the wrong df
> figure. But I also highly dislike to feel treated like someone who doesn´t
> know a thing.
Nope. I'm a systems theorist and I demand/require variable isolation.
Not a question of "silly" or "dumb" but a question of "speaking with
sufficient precision and clarity".
For instance you speak of "having an impression" and then decide I've
made an assumption.
I define my position. Explain my terms. Give my examples.
I also risk being utterly wrong because sometimes being completely wrong
gets others to cut away misconceptions and assumptions.
It annoys some people, but it gets results. You've been going around on
this topic for how long? and just today Hugo "got" that your problem is
becoming CPU bound (long process) instead of a hard lockup. We've
stopped talking about "trees" and started talking about free space
management. We've stopped talking about 17G of free space and gotten
down to the 5 or so, plus you've gotten angry at me, tried to prove me
an idiot, and so produced test cases and data that is absolutely clear
including steps to reproduce.
In real life I work on mission critical systems that can get people
killed when they fail. So I have developed the reflex of tenacity in
getting everyone using the same words, talking about the same concepts,
giving concrete examples, and generally bringing the discussion to a
very precise head.
Example: I had two parties in conflict about a system. One party said
that every time they did "an orderly shutdown" the device would hang in
a way that took days to recover from. The other party would examine the
device and say "could not reproduce". Turns out that the two parties
were doing entirely different (but both correct) sequences for "orderly
shutdown". They'd been having that conflict for more than a year. But
since they both _knew_ what an "orderly shutdown" was, they _never_
analyzed what they were saying. (turns out one procedure left a chip in
a state that it wouldn't restart until a capacitor discharged, and the
other procedure did not.)
So yea, when people make statements that "everybody understands" and
those statements don't agree. I start slicing concepts off one at a time...
It's not about "dumb" or "silly" it's about exact and accurate
descriptions that have been stripped of assumptions and tribal knowledge.
And I don't care if I come off looking like "the bad guy" because I
don't believe in "the bad guy" at all when it comes to making things
more clear and getting out of a communications deadlock. My only goal is
"less broken".
So occasionally annoying... but look... progress!
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-28 0:25 ` Robert White
@ 2014-12-28 1:01 ` Bardur Arantsson
2014-12-28 4:03 ` Robert White
0 siblings, 1 reply; 59+ messages in thread
From: Bardur Arantsson @ 2014-12-28 1:01 UTC (permalink / raw)
To: linux-btrfs
On 2014-12-28 01:25, Robert White wrote:
> On 12/27/2014 08:01 AM, Martin Steigerwald wrote:
>>> From how you write I get the impression that you think everyone else
>> beside you is just silly and dumb. Please stop this assumption. I may not
>> always get terms right, and I may make a mistake as with the wrong df
>> figure. But I also highly dislike to feel treated like someone who
>> doesn´t
>> know a thing.
>
> Nope. I'm a systems theorist and I demand/require variable isolation.
>
> Not a question of "silly" or "dumb" but a question of "speaking with
> sufficient precision and clarity".
>
> For instance you speak of "having an impression" and then decide I've
> made an assumption.
>
> I define my position. Explain my terms. Give my examples.
>
> I also risk being utterly wrong because sometimes being completely wrong
> gets others to cut away misconceptions and assumptions.
>
> It annoys some people, but it gets results.
Can you please stop this bullshit posturing nonsense? It accomlishes
nothing -- if you're right your other posts will stand for themselves
and show that you are indeed "the shit" when it comes to these matters,
but this post (so far, didn't read further) accomplishes nothing other
than (possibly) convincing everyone that you're a pompous/self-important
ass.
Regards,
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-28 1:01 ` Bardur Arantsson
@ 2014-12-28 4:03 ` Robert White
2014-12-28 12:03 ` Martin Steigerwald
2014-12-28 12:07 ` Martin Steigerwald
0 siblings, 2 replies; 59+ messages in thread
From: Robert White @ 2014-12-28 4:03 UTC (permalink / raw)
To: Bardur Arantsson, linux-btrfs
On 12/27/2014 05:01 PM, Bardur Arantsson wrote:
> On 2014-12-28 01:25, Robert White wrote:
>> On 12/27/2014 08:01 AM, Martin Steigerwald wrote:
>>>> From how you write I get the impression that you think everyone else
>>> beside you is just silly and dumb. Please stop this assumption. I may not
>>> always get terms right, and I may make a mistake as with the wrong df
>>> figure. But I also highly dislike to feel treated like someone who
>>> doesn´t
>>> know a thing.
>>
>> Nope. I'm a systems theorist and I demand/require variable isolation.
>>
>> Not a question of "silly" or "dumb" but a question of "speaking with
>> sufficient precision and clarity".
>>
>> For instance you speak of "having an impression" and then decide I've
>> made an assumption.
>>
>> I define my position. Explain my terms. Give my examples.
>>
>> I also risk being utterly wrong because sometimes being completely wrong
>> gets others to cut away misconceptions and assumptions.
>>
>> It annoys some people, but it gets results.
>
> Can you please stop this bullshit posturing nonsense? It accomlishes
> nothing -- if you're right your other posts will stand for themselves
> and show that you are indeed "the shit" when it comes to these matters,
> but this post (so far, didn't read further) accomplishes nothing other
> than (possibly) convincing everyone that you're a pompous/self-important
> ass.
Really? "accomplishes nothing"?
24 hours ago:
the complaining party was talking about
- Windows XP
- Tax software
- Virtual box
- vdi files
- defragging
- balancing
- "data trees"
- system hanging
And the responding party was saying
"you are the only person reporting this as a regular occurrence" with
the implication that the report was a duplicate or at least might not
get much immediate attention.
Now:
The complaining party has verified the minimum, repeatable case of
simple file allocation on a very fragmented system and the responding
party and several others have understood and supported the bug.
That's not "accomplishing nothing", thats called engaging in diagnostics
instead of dismissing a complaint, and sticking out the diagnostic
process until everyone is on the same page.
I never dismissed Martin. I never disbelieved him. I went through his
elements one at a time with examples of what I was taking away from him
and why they didn't match expectations and experimental evidence. We
adjusted our positions and communications.
So you can call it "bullshit posturing nonsense" but I see "taking less
than a day to get to the bottom of a bug report that might not have
gotten significant attention."
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-28 0:06 ` Robert White
@ 2014-12-28 11:05 ` Martin Steigerwald
0 siblings, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 11:05 UTC (permalink / raw)
To: Robert White; +Cc: Hugo Mills, linux-btrfs
Am Samstag, 27. Dezember 2014, 16:06:13 schrieb Robert White:
> >
> >> I also don't know what kind of tool you are using, but it might be
> >> repeatedly trying and failing to fallocate the file as a single
> >> extent or something equally dumb.
> >
> > Userspace doesn't as far as I know, get to make that decision. I've
> > just read the fallocate(2) man page, and it says nothing at all about
> > the contiguity of the extent(s) storage allocated by the call.
>
> Yep, my bad. But as soon as I saw that "fio" was starting two threads,
> one doing random read/write and another doing sequential read/write,
> both on the same file, it set off my "not just creating a file" mindset.
> Given the delayed write into/through the cache normally done by casual
> file io, It seemed likely that fio would be doing something more
> aggressive (like using O_DIRECT or repeated fdatasync() which could get
> very tit-for-tat).
Robert, please get to know about fio or *ask* before jumping to conclusions.
I used this:
[global]
bs=4k
#ioengine=libaio
#iodepth=4
size=4g
#direct=1
runtime=120
filename=ssd.test.file
#[seq-write]
#rw=write
#stonewall
[rand-write]
rw=randwrite
stonewall
At the first test I still tested seq-write, but do you note the "stonewall"
param? It *separates* both jobs from one another. I.e. fio may be starting
two threads as it I think prepares all threads in advance, yet it did
execute only *one* at a time.
>From the manpage of fio:
stonewall , wait_for_previous
Wait for preceding jobs in the job file to exit before
starting this one. stonewall implies new_group.
(that said the first stonewall isn´t even needed, but I removed the read
jobs from the ssd-test.fio example fio I used for this job and I didn´t
remember to remove the statement)
Thank you a lot for your input. I learned some from it. For example that
the trees for the data handling are in the metadata section. And now
I am very clear the btrfs fi df does not display any trees but the chunk
reservation and usage. I think I knew this before, but I thought somehow
that was combined with the tree, but it isn´t, at least not in place, but
the trees are stored in the metadata chunks. I´d still not call these
extents tough, cause thats a file-based thing to all I know.
I skip theoretizing about algorithms here. I prefer to let measurements
speak and try to understand these. Best approach to understand the ones
I made, I think, is what Hugo suggested: A developer looks at the sysrq-t
outputs. So I personally won´t speculate any further about given or not
given algorithmic limitations of BTRFS.
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-28 4:03 ` Robert White
@ 2014-12-28 12:03 ` Martin Steigerwald
2014-12-28 17:04 ` Patrik Lundquist
2014-12-28 12:07 ` Martin Steigerwald
1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 12:03 UTC (permalink / raw)
To: Robert White; +Cc: Bardur Arantsson, linux-btrfs
Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
> On 12/27/2014 05:01 PM, Bardur Arantsson wrote:
> > On 2014-12-28 01:25, Robert White wrote:
> >> On 12/27/2014 08:01 AM, Martin Steigerwald wrote:
> >>>> From how you write I get the impression that you think everyone else
> >>> beside you is just silly and dumb. Please stop this assumption. I may not
> >>> always get terms right, and I may make a mistake as with the wrong df
> >>> figure. But I also highly dislike to feel treated like someone who
> >>> doesn´t
> >>> know a thing.
> >>
> >> Nope. I'm a systems theorist and I demand/require variable isolation.
> >>
> >> Not a question of "silly" or "dumb" but a question of "speaking with
> >> sufficient precision and clarity".
> >>
> >> For instance you speak of "having an impression" and then decide I've
> >> made an assumption.
> >>
> >> I define my position. Explain my terms. Give my examples.
> >>
> >> I also risk being utterly wrong because sometimes being completely wrong
> >> gets others to cut away misconceptions and assumptions.
> >>
> >> It annoys some people, but it gets results.
> >
> > Can you please stop this bullshit posturing nonsense? It accomlishes
> > nothing -- if you're right your other posts will stand for themselves
> > and show that you are indeed "the shit" when it comes to these matters,
> > but this post (so far, didn't read further) accomplishes nothing other
> > than (possibly) convincing everyone that you're a pompous/self-important
> > ass.
>
> Really? "accomplishes nothing"?
>
> 24 hours ago:
>
> the complaining party was talking about
>
> - Windows XP
> - Tax software
> - Virtual box
> - vdi files
> - defragging
> - balancing
> - "data trees"
> - system hanging
>
> And the responding party was saying
>
> "you are the only person reporting this as a regular occurrence" with
> the implication that the report was a duplicate or at least might not
> get much immediate attention.
>
> Now:
>
> The complaining party has verified the minimum, repeatable case of
> simple file allocation on a very fragmented system and the responding
> party and several others have understood and supported the bug.
It was repeatable before. That I go from application case to simulate a
workload case is only natural. Or do you run fio or other load testing apps
as a part of your daily work on your computer (unless you are actually
diagnosing performance issues). I still *use* the computer with
applications. And if thats where I see the performance issue, I report as
such. Then I think about the kind of workload it creates and go from there
to simplicy it to a reproducable case.
At least I read mails, browse the web, run a VM, and so such kinds of
things as daily computer usage. And thus its likely that performance issues
show like this. Heck even my server does mail and Owncloud and things.
I only use workload generation tools during my teachings or when analysing
things, not as part of my daily computer usage.
And that doesn´t make using a VM any less valid. And if it basically crawls
BTRFS to an halt, I report this. Its actually that easy.
> That's not "accomplishing nothing", thats called engaging in diagnostics
> instead of dismissing a complaint, and sticking out the diagnostic
> process until everyone is on the same page.
>
> I never dismissed Martin. I never disbelieved him. I went through his
> elements one at a time with examples of what I was taking away from him
> and why they didn't match expectations and experimental evidence. We
> adjusted our positions and communications.
Robert, I received this differently. I received your input partly as wronging
me. Granted that motivated me even more to prove things. But I highly
dislike this kind of motivation. As I think I am motivated myself. I like
finding causes of performance bottle necks. And I prefer positive motivation
instead of negative one.
> So you can call it "bullshit posturing nonsense" but I see "taking less
> than a day to get to the bottom of a bug report that might not have
> gotten significant attention."
And you attribute all of this to your argumentation?
Thats bold.
See, Robert, your arguments helped with clearing my understanding in some
parts. Especially on the terms I have not been very familiar.
I am grateful for that.
It even helped motivate me to do the further tests, as I got the
impression that you have just been discussing that what I am seeing is
just the way BTRFS necesessarily is *algorithmically* and I was just using
it wrongly. But that said: I have an interest myself in resolving this.
I was prepared for giving additional input at a given time. But right on
this day I was just fed up with things.
It motivated to prove the abysmal performance behaviour in a certain
workload.
Robert, your arguments contributed, thats true. But still I did the work of
the actual measurements. I spent the hours on doing the measurements,
with a slight risk of having to restore from backup, incase BTRFS would
mess up things. I was the one bringing BTRFS to the limits where it
actually shows an issue, instead of theoreticing about the limit as being
an algorithmic issue or wrong usage.
I expect the process to be iterative. At first I see something, get an
impression and probably a gut feeling. And then I move on from that.
Maybe you are the superguru who has the complete picture at the time
you see an issue. But I see things and then try to make sense of them,
actively allowing for feedback on the way. I start research then. And this
research is iterative. And yet, I am so bold, to post things on this mailing
list even if they are not yet a fully fledged out scientific document. Even
if I didn´t get all BTRFS specific terms right – but still had about quite an
accurate understanding on what I was seeing. In order for others to chime
in and give ideas.
At first I partially ranted. Yes, I even said so. Cause I am human. Thats
it. I wanted to progress on my tax return and BTRFS messed up. And I
spent literally hours on fixing things then, even copying back the VDI file
from backup as Windows did come chkdsk on it and I wasn´t so sure anymore
about its consistency. Heck I even succeeded at it. By even doing something
that is *not* recommended, but *still* works: The balance. I have been – I´d
say rightfully so – angry at that. If confronted with theory and with real
world perception, I always take what I perceive first. If theory says a
balance is not supposed to help, I´d still balance if I see that it does. Its
that simple, cause I found it quite fruitless to argue with the world. If
it rains while it shouldn´t I still prepare myself for the rain when going
out.
And heck, my initial impression still stands, Robert. I have shown a case
where BTRFS becomes CPU bound and basically crawls to a halt and I
still think this is a (performance) bug. We will see, whether it is. And no,
it wasn´t the dirty background ratio or the SSDs being to fast for the CPU
as you tried to guess (even due I am using 3.18 with that multiqueue block
I/O handling enabled, at least I think it is enabled). (I am aware of all of
this and I am aware of the work in the Linux kernel to support a million
of IOPS or more or that certain PCIe flash drivers at least at some tme
circumvented parts of block I/O layer. But I also know I just have some
SATA SSDs, connected via SATA-300 and thats a difference in the amount
of IOPS they can actually produce)
And also given the history, what I reported is *new*. I saw this for the first
time like this. It doesn´t have a fruitless history of not going to the root
cause. The last thing others and I reported is fixed already. The hangs in
3.15 and 3.16. I provided information on these as well, if you care to
lookup in the mailing list archives. I tested patches to solve them. Just
check the mailing list archives for this. Its not even the first BTRFS issue
I helped diagnosing.
I simply wasn´t aware that this isn´t a permanent lock, as I gave up waiting
for the desktop to return after some minutes, cause I simply wanted to get
my work done. At some certain time I may not be willing to spend hours to
find the root cause of a problem, but will just workaround it.
So I just wrote that I saw these kworker process spinning on 100% CPU
and my desktop has locked. I didn´t include the information on the process
state of the desktop processes, but it was basically the dialog on IRC I
had with Hugo which cleared that out. As far as I am aware none of your
argumentation contributed to that. For me it was a hang, cause things
hanged. Whether permanent or not? How long do you wait to determine that?
I close with how I like this process to work:
I perceive something, I may have an idea about it, but then I proceed
from there without anysuming anyone is "right" or "wrong" about something.
And it is this "wronging" of others that I perceived here, that I think
received from you as a message, at least thats how I received some of what
you wrote, what I want to stop right now. If you didn´t send it, we
misunderstood. But thats how I received some of your arguments. As
dismissive of my +10 years of Linux experiences and +6-7 years of
actually *teaching* performance analysis & tuning.
Still I find myself knowing nothing at times. And thats good. Cause that
is how I learn. How even did I see values that just didn´t make any sense.
In the beginning. In the end the did.
So I am learning. You are learning. Everybody is learning.
At first I see something and describe it as it is… I used unclear terms on it
and it helped to clarify this. And then I try to understand whats happening.
That is how it works for me.
I very much like to proceed on this with that kind of attitude. And in that
sense I look forward to your valuable input on this as it progresses to
an conclusion.
So can we just assume that we are all experts and beginners at the same
time and capable and helpful and willing to learn and go along like this?
That´s what I call productive.
Okay, that was lengthy, and I have a part in this. Actually I felt offended.
Maybe by misunderstanding you. But thats how I received some of your
statements.
BTW, I found that the Oracle blog didn´t work at all for me. I completed
a cycle of defrag, sdelete -c and VBoxManage compact, not due to being
much interested in it at all, but also as part of my BTRFS testing, and it
apparently did *nothing* to reduce the size of the file. That was my initial
motivation with that: To reduce the size of the file to make more free space
for BTRFS. And I was using at whats recommended by company whose
developers develop Virtualbox. Disliking the defragment step from the
beginning (as useless on SSD). But I thought, heck if it gives me a smaller
file, all good.
Next time I just give it 10 GiB instead of 20 GiB from the beginning. Or
at some time I find a Linux based task returns software. I wonder what
the authorities would say when I say I can´t complete my tax returns as
my operating system is not supported by the software necessary for it.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-28 4:03 ` Robert White
2014-12-28 12:03 ` Martin Steigerwald
@ 2014-12-28 12:07 ` Martin Steigerwald
2014-12-28 14:52 ` Robert White
1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 12:07 UTC (permalink / raw)
To: Robert White; +Cc: Bardur Arantsson, linux-btrfs
Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
> Now:
>
> The complaining party has verified the minimum, repeatable case of
> simple file allocation on a very fragmented system and the responding
> party and several others have understood and supported the bug.
I didn´t yet provide such a test case.
At the moment I can only reproduce this kworker thread using a CPU for
minutes case with my /home filesystem.
A mininmal test case for me would be to be able to reproduce it with a
fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
get 4800 instead of 270 IOPS.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (further tests)
2014-12-27 13:55 ` Martin Steigerwald
2014-12-27 14:54 ` Robert White
@ 2014-12-28 13:00 ` Martin Steigerwald
2014-12-28 13:40 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare) Martin Steigerwald
1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 13:00 UTC (permalink / raw)
To: Hugo Mills; +Cc: Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 40181 bytes --]
Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> Summarized at
>
> Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>
> see below. This is reproducable with fio, no need for Windows XP in
> Virtualbox for reproducing the issue. Next I will try to reproduce with
> a freshly creating filesystem.
>
>
> Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > Hello!
> > > > >
> > > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > >
> > > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > > bug
> > > > > report:
> > > > >
> > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > > space_cache, skinny meta data extents – are these a problem? – and
> > > >
> > > > > compress=lzo:
> > > > (there is no known problem with skinny metadata, it's actually more
> > > > efficient than the older format. There has been some anecdotes about
> > > > mixing the skinny and fat metadata but nothing has ever been
> > > > demonstrated problematic.)
> > > >
> > > > > merkaba:~> btrfs fi sh /home
> > > > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > >
> > > > > Total devices 2 FS bytes used 144.41GiB
> > > > > devid 1 size 160.00GiB used 160.00GiB path
> > > > > /dev/mapper/msata-home
> > > > > devid 2 size 160.00GiB used 160.00GiB path
> > > > > /dev/mapper/sata-home
> > > > >
> > > > > Btrfs v3.17
> > > > > merkaba:~> btrfs fi df /home
> > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > >
> > > > This filesystem, at the allocation level, is "very full" (see below).
> > > >
> > > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > > cause I know no tax return software for Linux which would be suitable
> > > > > for
> > > > > Germany and I frankly don´t care about the end of security cause all
> > > > > surfing and other network access I will do from the Linux box and I
> > > > > only
> > > > > run the VM behind a firewall).
> > > >
> > > > > And thus I try the balance dance again:
> > > > ITEM: Balance... it doesn't do what you think it does... 8-)
> > > >
> > > > "Balancing" is something you should almost never need to do. It is only
> > > > for cases of changing geometry (adding disks, switching RAID levels,
> > > > etc.) of for cases when you've radically changed allocation behaviors
> > > > (like you decided to remove all your VM's or you've decided to remove a
> > > > mail spool directory full of thousands of tiny files).
> > > >
> > > > People run balance all the time because they think they should. They are
> > > > _usually_ incorrect in that belief.
> > >
> > > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > > device.
> > No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > space. What's more, balance does *not* balance the metadata trees. The
> > remaining space -- 154.97 GiB -- is unstructured storage for file
> > data, and you have some 13 GiB of that available for use.
> >
> > Now, since you're seeing lockups when the space on your disks is
> > all allocated I'd say that's a bug. However, you're the *only* person
> > who's reported this as a regular occurrence. Does this happen with all
> > filesystems you have, or just this one?
> >
> > > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > > from to *extend* a tree.
> >
> > It's not a tree. It's simply space allocation. It's not even space
> > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > saying "I'm going to use this piece of disk for this purpose").
> >
> > > This may be a bug, but this is what I see.
> > >
> > > And no amount of "you should not balance a BTRFS" will make that
> > > perception go away.
> > >
> > > See, I see the sun coming out on a morning and you tell me "no, it
> > > doesn´t". Simply that is not going to match my perception.
> >
> > Duncan's assertion is correct in its detail. Looking at your space
>
> Robert's :)
>
> > usage, I would not suggest that running a balance is something you
> > need to do. Now, since you have these lockups that seem quite
> > repeatable, there's probably a lurking bug in there, but hacking
> > around with balance every time you hit it isn't going to get the
> > problem solved properly.
> >
> > I think I would suggest the following:
> >
> > - make sure you have some way of logging your dmesg permanently (use
> > a different filesystem for /var/log, or a serial console, or a
> > netconsole)
> >
> > - when the lockup happens, hit Alt-SysRq-t a few times
> >
> > - send the dmesg output here, or post to bugzilla.kernel.org
> >
> > That's probably going to give enough information to the developers
> > to work out where the lockup is happening, and is clearly the way
> > forward here.
>
> And I got it reproduced. *Perfectly* reproduced, I´d say.
>
> But let me run the whole story:
>
> 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
[… story of trying to reproduce with Windows XP defragmenting which was
unsuccessful as BTRFS still had free device space to allocate new chunks
from …]
> But finally I got to:
>
> merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> Sa 27. Dez 13:26:39 CET 2014
> Label: 'home' uuid: [some UUID]
> Total devices 2 FS bytes used 152.83GiB
> devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
>
> Btrfs v3.17
> Data, RAID1: total=154.97GiB, used=149.58GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.26GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
>
> So I did, if Virtualbox can write randomly in a file, I can too.
>
> So I did:
>
>
> martin@merkaba:~> cat ssd-test.fio
> [global]
> bs=4k
> #ioengine=libaio
> #iodepth=4
> size=4g
> #direct=1
> runtime=120
> filename=ssd.test.file
>
> [seq-write]
> rw=write
> stonewall
>
> [rand-write]
> rw=randwrite
> stonewall
>
>
>
> And got:
>
> ATOP - merkaba 2014/12/27 13:41:02 ----------- 10s elapsed
> PRC | sys 10.14s | user 0.38s | #proc 332 | #trun 2 | #tslpi 548 | #tslpu 0 | #zombie 0 | no procacct |
> CPU | sys 102% | user 4% | irq 0% | idle 295% | wait 0% | guest 0% | curf 3.10GHz | curscal 96% |
> cpu | sys 76% | user 0% | irq 0% | idle 24% | cpu001 w 0% | guest 0% | curf 3.20GHz | curscal 99% |
> cpu | sys 24% | user 1% | irq 0% | idle 75% | cpu000 w 0% | guest 0% | curf 3.19GHz | curscal 99% |
> cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> CPL | avg1 0.82 | avg5 0.78 | avg15 0.99 | | csw 6233 | intr 12023 | | numcpu 4 |
> MEM | tot 15.5G | free 4.0G | cache 9.7G | buff 0.0M | slab 333.1M | shmem 206.6M | vmbal 0.0M | hptot 0.0M |
> SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> LVM | sata-home | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> DSK | sda | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 |
> NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 |
> NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
>
> PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
> 18079 - martin martin 2 9.99s 0.00s 0K 0K 0K 16K -- - R 1 100% fio
> 4746 - martin martin 2 0.01s 0.14s 0K 0K 0K 0K -- - S 2 2% konsole
> 3291 - martin martin 4 0.01s 0.11s 0K 0K 0K 0K -- - S 0 1% plasma-desktop
> 1488 - root root 1 0.03s 0.04s 0K 0K 0K 0K -- - S 0 1% Xorg
> 10036 - root root 1 0.04s 0.02s 0K 0K 0K 0K -- - R 2 1% atop
>
> while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
>
> martin@merkaba:~> LANG=C df -hT /home
> Filesystem Type Size Used Avail Use% Mounted on
> /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home
>
> where a 4 GiB file should easily fit, no? (And this output is with the 4
> GiB file. So it was even 4 GiB more free before.)
>
>
> But it gets even more visible:
>
> martin@merkaba:~> fio ssd-test.fio
> seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.1.11
> Starting 2 processes
> Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]
> 0$ zsh 1$ zsh 2$ zsh 3-$ zsh 4$ zsh 5$* zsh
>
>
> yes, thats 0 IOPS.
>
> 0 IOPS and in zero IOPS. For minutes.
>
>
>
> And here is why:
>
> ATOP - merkaba 2014/12/27 13:46:52 ----------- 10s elapsed
> PRC | sys 10.77s | user 0.31s | #proc 334 | #trun 2 | #tslpi 548 | #tslpu 3 | #zombie 0 | no procacct |
> CPU | sys 108% | user 3% | irq 0% | idle 286% | wait 2% | guest 0% | curf 3.08GHz | curscal 96% |
> cpu | sys 72% | user 1% | irq 0% | idle 28% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 19% | user 0% | irq 0% | idle 81% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 11% | user 1% | irq 0% | idle 87% | cpu003 w 1% | guest 0% | curf 3.19GHz | curscal 99% |
> cpu | sys 6% | user 1% | irq 0% | idle 91% | cpu002 w 1% | guest 0% | curf 3.11GHz | curscal 97% |
> CPL | avg1 2.78 | avg5 1.34 | avg15 1.12 | | csw 50192 | intr 32379 | | numcpu 4 |
> MEM | tot 15.5G | free 5.0G | cache 8.7G | buff 0.0M | slab 332.6M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> LVM | sata-home | busy 5% | read 160 | write 11177 | KiB/w 3 | MBr/s 0.06 | MBw/s 4.36 | avio 0.05 ms |
> LVM | msata-home | busy 4% | read 28 | write 11177 | KiB/w 3 | MBr/s 0.01 | MBw/s 4.36 | avio 0.04 ms |
> LVM | sata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> LVM | msata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> DSK | sda | busy 5% | read 160 | write 10200 | KiB/w 4 | MBr/s 0.06 | MBw/s 4.69 | avio 0.05 ms |
> DSK | sdb | busy 4% | read 28 | write 10558 | KiB/w 4 | MBr/s 0.01 | MBw/s 4.69 | avio 0.04 ms |
> NET | transport | tcpi 35 | tcpo 33 | udpi 3 | udpo 3 | tcpao 2 | tcppo 1 | tcprs 0 |
> NET | network | ipi 38 | ipo 36 | ipfrw 0 | deliv 38 | | icmpi 0 | icmpo 0 |
> NET | eth0 0% | pcki 22 | pcko 20 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
>
> PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> 14973 - root root 1 8.92s 0.00s 0K 0K 0K 144K -- - S 0 89% kworker/u8:14
> 17450 - root root 1 0.86s 0.00s 0K 0K 0K 32K -- - R 3 9% kworker/u8:5
> 788 - root root 1 0.25s 0.00s 0K 0K 128K 18880K -- - S 3 3% btrfs-transact
> 12254 - root root 1 0.14s 0.00s 0K 0K 64K 576K -- - S 2 1% kworker/u8:3
> 17332 - root root 1 0.11s 0.00s 0K 0K 112K 1348K -- - S 2 1% kworker/u8:4
> 3291 - martin martin 4 0.01s 0.09s 0K 0K 0K 0K -- - S 1 1% plasma-deskto
>
>
>
>
> ATOP - merkaba 2014/12/27 13:47:12 ----------- 10s elapsed
> PRC | sys 10.78s | user 0.44s | #proc 334 | #trun 3 | #tslpi 547 | #tslpu 3 | #zombie 0 | no procacct |
> CPU | sys 106% | user 4% | irq 0% | idle 288% | wait 1% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 93% | user 0% | irq 0% | idle 7% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 7% | user 0% | irq 0% | idle 93% | cpu003 w 0% | guest 0% | curf 3.01GHz | curscal 94% |
> cpu | sys 3% | user 2% | irq 0% | idle 94% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> CPL | avg1 3.33 | avg5 1.56 | avg15 1.20 | | csw 38253 | intr 23104 | | numcpu 4 |
> MEM | tot 15.5G | free 4.9G | cache 8.7G | buff 0.0M | slab 336.5M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> LVM | msata-home | busy 2% | read 0 | write 2337 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.91 | avio 0.07 ms |
> LVM | sata-home | busy 2% | read 36 | write 2337 | KiB/w 3 | MBr/s 0.01 | MBw/s 0.91 | avio 0.07 ms |
> LVM | msata-debian | busy 1% | read 1 | write 1630 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.65 | avio 0.03 ms |
> LVM | sata-debian | busy 0% | read 0 | write 1019 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.41 | avio 0.02 ms |
> DSK | sdb | busy 2% | read 1 | write 2545 | KiB/w 5 | MBr/s 0.00 | MBw/s 1.45 | avio 0.07 ms |
> DSK | sda | busy 1% | read 36 | write 2461 | KiB/w 5 | MBr/s 0.01 | MBw/s 1.28 | avio 0.06 ms |
> NET | transport | tcpi 20 | tcpo 20 | udpi 1 | udpo 1 | tcpao 1 | tcppo 1 | tcprs 0 |
> NET | network | ipi 21 | ipo 21 | ipfrw 0 | deliv 21 | | icmpi 0 | icmpo 0 |
> NET | eth0 0% | pcki 5 | pcko 5 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 |
> NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
>
> PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> 17450 - root root 1 9.96s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:5
> 4746 - martin martin 2 0.06s 0.15s 0K 0K 0K 0K -- - S 1 2% konsole
> 10508 - root root 1 0.13s 0.00s 0K 0K 96K 4048K -- - S 1 1% kworker/u8:18
> 1488 - root root 1 0.06s 0.06s 0K 0K 0K 0K -- - S 0 1% Xorg
> 17332 - root root 1 0.12s 0.00s 0K 0K 96K 580K -- - R 3 1% kworker/u8:4
> 17454 - root root 1 0.11s 0.00s 0K 0K 32K 4416K -- - D 1 1% kworker/u8:6
> 17516 - root root 1 0.09s 0.00s 0K 0K 16K 136K -- - S 3 1% kworker/u8:7
> 3268 - martin martin 3 0.02s 0.05s 0K 0K 0K 0K -- - S 1 1% kwin
> 10036 - root root 1 0.05s 0.02s 0K 0K 0K 0K -- - R 0 1% atop
>
>
>
> So BTRFS is basically busy with itself and nothing else. Look at the SSD
> usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> you measure of course, like request size, read, write, iodepth and so).
>
> Its kworker/u8:5 utilizing 100% of one core for minutes.
>
>
>
> Its the random write case it seems. Here are values from fio job:
>
> martin@merkaba:~> fio ssd-test.fio
> seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.1.11
> Starting 2 processes
> Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
> write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> clat percentiles (usec):
> | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
> | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5],
> | 70.00th=[ 7], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ 9],
> | 99.00th=[ 14], 99.50th=[ 20], 99.90th=[ 211], 99.95th=[ 2128],
> | 99.99th=[10304]
> bw (KB /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
> lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> cpu : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=1
>
> Seems fine.
>
>
> But:
>
> rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
> write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> clat percentiles (usec):
> | 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5],
> | 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6],
> | 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10],
> | 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116],
> | 99.99th=[16711680]
> bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
> lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
> WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
>
> Run status group 1 (all jobs):
> WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
>
>
> What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
>
> What?
>
> Ey, *what*?
>
>
>
> Repeating with the random write case.
>
> Its a different kworker now, but similar result:
>
> ATOP - merkaba 2014/12/27 13:51:48 ----------- 10s elapsed
> PRC | sys 10.66s | user 0.25s | #proc 330 | #trun 2 | #tslpi 545 | #tslpu 2 | #zombie 0 | no procacct |
> CPU | sys 105% | user 3% | irq 0% | idle 292% | wait 0% | guest 0% | curf 3.07GHz | curscal 95% |
> cpu | sys 92% | user 0% | irq 0% | idle 8% | cpu002 w 0% | guest 0% | curf 3.19GHz | curscal 99% |
> cpu | sys 8% | user 0% | irq 0% | idle 92% | cpu003 w 0% | guest 0% | curf 3.09GHz | curscal 96% |
> cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> cpu | sys 2% | user 1% | irq 0% | idle 97% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> CPL | avg1 1.00 | avg5 1.32 | avg15 1.23 | | csw 34484 | intr 23182 | | numcpu 4 |
> MEM | tot 15.5G | free 5.4G | cache 8.3G | buff 0.0M | slab 334.8M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> LVM | sata-home | busy 1% | read 36 | write 2502 | KiB/w 4 | MBr/s 0.01 | MBw/s 0.98 | avio 0.06 ms |
> LVM | msata-home | busy 1% | read 48 | write 2502 | KiB/w 4 | MBr/s 0.02 | MBw/s 0.98 | avio 0.04 ms |
> LVM | msata-debian | busy 0% | read 0 | write 6 | KiB/w 7 | MBr/s 0.00 | MBw/s 0.00 | avio 1.33 ms |
> LVM | sata-debian | busy 0% | read 0 | write 6 | KiB/w 7 | MBr/s 0.00 | MBw/s 0.00 | avio 0.17 ms |
> DSK | sda | busy 1% | read 36 | write 2494 | KiB/w 4 | MBr/s 0.01 | MBw/s 0.98 | avio 0.06 ms |
> DSK | sdb | busy 1% | read 48 | write 2494 | KiB/w 4 | MBr/s 0.02 | MBw/s 0.98 | avio 0.04 ms |
> NET | transport | tcpi 32 | tcpo 30 | udpi 2 | udpo 2 | tcpao 2 | tcppo 1 | tcprs 0 |
> NET | network | ipi 35 | ipo 32 | ipfrw 0 | deliv 35 | | icmpi 0 | icmpo 0 |
> NET | eth0 0% | pcki 19 | pcko 16 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
>
> PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
> 11746 - root root 1 10.00s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:0
> 12254 - root root 1 0.16s 0.00s 0K 0K 112K 1712K -- - S 3 2% kworker/u8:3
> 17517 - root root 1 0.16s 0.00s 0K 0K 144K 1764K -- - S 1 2% kworker/u8:8
>
>
>
> And now the graphical environemnt is locked. Continuining on TTY1.
>
> Doing another fio job with tee so I can get output easily.
>
> Wow! I wonder whether this is reproducable with a fresh BTRFS with fio stressing it.
>
> Like a 10 GiB BTRFS with 5 GiB fio test file and just letting it run.
>
>
> Okay, I let the final fio job complete and include the output here.
>
>
> Okay, and there we are and I do have sysrq-t figures.
>
> Okay, this is 1,2 MiB xz packed. So I better start a bug report about this
> and attach it there. Is dislike cloud URLs that may disappear at some time.
>
>
>
> Now please finally acknowledge that there is an issue. Maybe I was not
> using the correct terms at the beginning, but there is a real issue. I do
> performance things for half a decade at least, I know that there is an issue
> when I see it.
>
>
>
>
> There we go:
>
> Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
I have done more tests.
This is on the same /home after extending it to 170 GiB and balancing it to
btrfs balance start -dusage=80
It has plenty of free space free. I updated the bug report and hope it can
give an easy enough to comprehend summary. The new tests are in:
https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6
Pasting below for discussion on list. Summary: I easily get 38000 (!)
IOPS. It may be an idea to reduce to 160 GiB, but right now this does
not work as it says no free space on device when trying to downsize it.
I may try with 165 or 162GiB.
So now we have three IOPS figures:
- 256 IOPS in worst case scenario
- 4700 IOPS when trying to reproduce worst case scenario with a fresh and small
BTRFS
- 38000 IOPS when /home has unused device space to allocate chunks from
https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8
This is another test.
With my /home. This time while it has enough free device space to reserve
new chunks from.
Remember this for the case where is hasn´t:
merkaba:~> btrfs fi sh /home
Label: 'home' uuid: [some UUID]
Total devices 2 FS bytes used 144.19GiB
devid 1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home
devid 2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home
Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=144.98GiB, used=140.95GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.24GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
clat percentiles (usec):
| 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5],
| 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6],
| 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10],
| 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116],
| 99.99th=[16711680]
bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
[…]
Run status group 1 (all jobs):
WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
That is where I saw this kworker thread at 100% of one Sandybridge core
for minutes. This is where I made kern.log with sysrq-t triggers in
https://bugzilla.kernel.org/show_bug.cgi?id=90401#c0
made from.
But now as I extended it to 170 GiB and did some basic rebalancing
(upto btrfs balance start -dusage=80) I have this:
First attempt:
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:13:47 CET 2014
Label: 'home' uuid: [some UUID]
Total devices 2 FS bytes used 151.09GiB
devid 1 size 170.00GiB used 158.03GiB path /dev/mapper/msata-home
devid 2 size 170.00GiB used 158.03GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=153.00GiB, used=147.83GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
rand-write: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/84528KB/0KB /s] [0/21.2K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=9987: Sun Dec 28 13:14:32 2014
write: io=4096.0MB, bw=155304KB/s, iops=38826, runt= 27007msec
clat (usec): min=5, max=28202, avg=22.03, stdev=240.04
lat (usec): min=5, max=28202, avg=22.28, stdev=240.16
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8],
| 30.00th=[ 10], 40.00th=[ 11], 50.00th=[ 12], 60.00th=[ 13],
| 70.00th=[ 14], 80.00th=[ 15], 90.00th=[ 17], 95.00th=[ 23],
| 99.00th=[ 93], 99.50th=[ 175], 99.90th=[ 2096], 99.95th=[ 6816],
| 99.99th=[10176]
bw (KB /s): min=76832, max=413616, per=100.00%, avg=156706.75, stdev=101101.26
lat (usec) : 10=29.85%, 20=62.43%, 50=5.74%, 100=1.07%, 250=0.57%
lat (usec) : 500=0.16%, 750=0.04%, 1000=0.02%
lat (msec) : 2=0.02%, 4=0.01%, 10=0.08%, 20=0.01%, 50=0.01%
cpu : usr=12.05%, sys=47.34%, ctx=86985, majf=0, minf=5
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=155304KB/s, minb=155304KB/s, maxb=155304KB/s, mint=27007msec, maxt=27007msec
Second attempt:
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:16:19 CET 2014
Label: 'home' uuid: [some UUID]
Total devices 2 FS bytes used 155.11GiB
devid 1 size 170.00GiB used 162.03GiB path /dev/mapper/msata-home
devid 2 size 170.00GiB used 162.03GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=157.00GiB, used=151.83GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.27GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/113.5MB/0KB /s] [0/29.3K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10043: Sun Dec 28 13:17:34 2014
write: io=4096.0MB, bw=145995KB/s, iops=36498, runt= 28729msec
clat (usec): min=4, max=143201, avg=23.95, stdev=518.47
lat (usec): min=4, max=143201, avg=24.13, stdev=518.48
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8],
| 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 13], 80.00th=[ 13], 90.00th=[ 15], 95.00th=[ 16],
| 99.00th=[ 33], 99.50th=[ 70], 99.90th=[ 5472], 99.95th=[ 8640],
| 99.99th=[20864]
bw (KB /s): min= 4, max=433760, per=100.00%, avg=149179.63, stdev=136784.14
lat (usec) : 10=38.35%, 20=58.99%, 50=1.96%, 100=0.38%, 250=0.16%
lat (usec) : 500=0.03%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.08%, 20=0.02%, 50=0.01%
lat (msec) : 100=0.01%, 250=0.01%
cpu : usr=10.25%, sys=42.40%, ctx=42642, majf=0, minf=8
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=145995KB/s, minb=145995KB/s, maxb=145995KB/s, mint=28729msec, maxt=28729msec
Third attempt:
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:18:24 CET 2014
Label: 'home' uuid: [some UUID]
Total devices 2 FS bytes used 156.16GiB
devid 1 size 170.00GiB used 160.03GiB path /dev/mapper/msata-home
devid 2 size 170.00GiB used 160.03GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=155.00GiB, used=152.83GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.34GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/195.7MB/0KB /s] [0/50.9K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10058: Sun Dec 28 13:18:59 2014
write: io=4096.0MB, bw=202184KB/s, iops=50545, runt= 20745msec
clat (usec): min=4, max=28261, avg=15.84, stdev=214.59
lat (usec): min=4, max=28261, avg=16.06, stdev=214.78
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8],
| 30.00th=[ 8], 40.00th=[ 9], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 15], 95.00th=[ 17],
| 99.00th=[ 52], 99.50th=[ 105], 99.90th=[ 426], 99.95th=[ 980],
| 99.99th=[12736]
bw (KB /s): min= 4, max=426344, per=100.00%, avg=207355.30, stdev=105104.72
lat (usec) : 10=41.44%, 20=55.33%, 50=2.17%, 100=0.54%, 250=0.34%
lat (usec) : 500=0.10%, 750=0.02%, 1000=0.01%
lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.01%
cpu : usr=14.15%, sys=59.06%, ctx=81711, majf=0, minf=6
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=202183KB/s, minb=202183KB/s, maxb=202183KB/s, mint=20745msec, maxt=20745msec
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:19:15 CET 2014
Label: 'home' uuid: [some UUID]
Total devices 2 FS bytes used 155.16GiB
devid 1 size 170.00GiB used 162.03GiB path /dev/mapper/msata-home
devid 2 size 170.00GiB used 162.03GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=157.00GiB, used=151.85GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
So here BTRFS was fast. It is getting trouble on my /home when its
almost full. But more so than with a empty 10 GiB filesystem as I
have shown in testcase
https://bugzilla.kernel.org/show_bug.cgi?id=90401#c3
There I had:
merkaba:/mnt/btrfsraid1> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
rand-write: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/20924KB/0KB /s] [0/5231/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=6221: Sat Dec 27 15:34:14 2014
write: io=2645.8MB, bw=22546KB/s, iops=5636, runt=120165msec
clat (usec): min=4, max=3054.8K, avg=174.87, stdev=11455.26
lat (usec): min=4, max=3054.8K, avg=175.03, stdev=11455.27
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 6], 10.00th=[ 6], 20.00th=[ 7],
| 30.00th=[ 7], 40.00th=[ 8], 50.00th=[ 9], 60.00th=[ 10],
| 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 14], 95.00th=[ 17],
| 99.00th=[ 30], 99.50th=[ 40], 99.90th=[ 1992], 99.95th=[25984],
| 99.99th=[411648]
bw (KB /s): min= 168, max=70703, per=100.00%, avg=27325.46, stdev=14887.94
lat (usec) : 10=55.81%, 20=41.12%, 50=2.70%, 100=0.14%, 250=0.06%
lat (usec) : 500=0.02%, 750=0.01%, 1000=0.02%
lat (msec) : 2=0.02%, 4=0.02%, 10=0.02%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.02%, 750=0.01%, 1000=0.01%
lat (msec) : 2000=0.01%, >=2000=0.01%
cpu : usr=1.56%, sys=5.57%, ctx=29822, majf=0, minf=7
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=677303/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=2645.8MB, aggrb=22545KB/s, minb=22545KB/s, maxb=22545KB/s, mint=120165msec, maxt=120165msec
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare)
2014-12-28 13:00 ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
@ 2014-12-28 13:40 ` Martin Steigerwald
2014-12-28 13:56 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
0 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 13:40 UTC (permalink / raw)
To: Hugo Mills; +Cc: Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 52581 bytes --]
Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
> Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> > Summarized at
> >
> > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> >
> > see below. This is reproducable with fio, no need for Windows XP in
> > Virtualbox for reproducing the issue. Next I will try to reproduce with
> > a freshly creating filesystem.
> >
> >
> > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > > Hello!
> > > > > >
> > > > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > > >
> > > > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > > > bug
> > > > > > report:
> > > > > >
> > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > > > space_cache, skinny meta data extents – are these a problem? – and
> > > > >
> > > > > > compress=lzo:
> > > > > (there is no known problem with skinny metadata, it's actually more
> > > > > efficient than the older format. There has been some anecdotes about
> > > > > mixing the skinny and fat metadata but nothing has ever been
> > > > > demonstrated problematic.)
> > > > >
> > > > > > merkaba:~> btrfs fi sh /home
> > > > > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > > >
> > > > > > Total devices 2 FS bytes used 144.41GiB
> > > > > > devid 1 size 160.00GiB used 160.00GiB path
> > > > > > /dev/mapper/msata-home
> > > > > > devid 2 size 160.00GiB used 160.00GiB path
> > > > > > /dev/mapper/sata-home
> > > > > >
> > > > > > Btrfs v3.17
> > > > > > merkaba:~> btrfs fi df /home
> > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > >
> > > > > This filesystem, at the allocation level, is "very full" (see below).
> > > > >
> > > > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > > > cause I know no tax return software for Linux which would be suitable
> > > > > > for
> > > > > > Germany and I frankly don´t care about the end of security cause all
> > > > > > surfing and other network access I will do from the Linux box and I
> > > > > > only
> > > > > > run the VM behind a firewall).
> > > > >
> > > > > > And thus I try the balance dance again:
> > > > > ITEM: Balance... it doesn't do what you think it does... 8-)
> > > > >
> > > > > "Balancing" is something you should almost never need to do. It is only
> > > > > for cases of changing geometry (adding disks, switching RAID levels,
> > > > > etc.) of for cases when you've radically changed allocation behaviors
> > > > > (like you decided to remove all your VM's or you've decided to remove a
> > > > > mail spool directory full of thousands of tiny files).
> > > > >
> > > > > People run balance all the time because they think they should. They are
> > > > > _usually_ incorrect in that belief.
> > > >
> > > > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > > > device.
> > > No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > > space. What's more, balance does *not* balance the metadata trees. The
> > > remaining space -- 154.97 GiB -- is unstructured storage for file
> > > data, and you have some 13 GiB of that available for use.
> > >
> > > Now, since you're seeing lockups when the space on your disks is
> > > all allocated I'd say that's a bug. However, you're the *only* person
> > > who's reported this as a regular occurrence. Does this happen with all
> > > filesystems you have, or just this one?
> > >
> > > > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > > > from to *extend* a tree.
> > >
> > > It's not a tree. It's simply space allocation. It's not even space
> > > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > > saying "I'm going to use this piece of disk for this purpose").
> > >
> > > > This may be a bug, but this is what I see.
> > > >
> > > > And no amount of "you should not balance a BTRFS" will make that
> > > > perception go away.
> > > >
> > > > See, I see the sun coming out on a morning and you tell me "no, it
> > > > doesn´t". Simply that is not going to match my perception.
> > >
> > > Duncan's assertion is correct in its detail. Looking at your space
> >
> > Robert's :)
> >
> > > usage, I would not suggest that running a balance is something you
> > > need to do. Now, since you have these lockups that seem quite
> > > repeatable, there's probably a lurking bug in there, but hacking
> > > around with balance every time you hit it isn't going to get the
> > > problem solved properly.
> > >
> > > I think I would suggest the following:
> > >
> > > - make sure you have some way of logging your dmesg permanently (use
> > > a different filesystem for /var/log, or a serial console, or a
> > > netconsole)
> > >
> > > - when the lockup happens, hit Alt-SysRq-t a few times
> > >
> > > - send the dmesg output here, or post to bugzilla.kernel.org
> > >
> > > That's probably going to give enough information to the developers
> > > to work out where the lockup is happening, and is clearly the way
> > > forward here.
> >
> > And I got it reproduced. *Perfectly* reproduced, I´d say.
> >
> > But let me run the whole story:
> >
> > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
>
> [… story of trying to reproduce with Windows XP defragmenting which was
> unsuccessful as BTRFS still had free device space to allocate new chunks
> from …]
>
> > But finally I got to:
> >
> > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > Sa 27. Dez 13:26:39 CET 2014
> > Label: 'home' uuid: [some UUID]
> > Total devices 2 FS bytes used 152.83GiB
> > devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> > devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> >
> > Btrfs v3.17
> > Data, RAID1: total=154.97GiB, used=149.58GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> >
> >
> >
> > So I did, if Virtualbox can write randomly in a file, I can too.
> >
> > So I did:
> >
> >
> > martin@merkaba:~> cat ssd-test.fio
> > [global]
> > bs=4k
> > #ioengine=libaio
> > #iodepth=4
> > size=4g
> > #direct=1
> > runtime=120
> > filename=ssd.test.file
> >
> > [seq-write]
> > rw=write
> > stonewall
> >
> > [rand-write]
> > rw=randwrite
> > stonewall
> >
> >
> >
> > And got:
> >
> > ATOP - merkaba 2014/12/27 13:41:02 ----------- 10s elapsed
> > PRC | sys 10.14s | user 0.38s | #proc 332 | #trun 2 | #tslpi 548 | #tslpu 0 | #zombie 0 | no procacct |
> > CPU | sys 102% | user 4% | irq 0% | idle 295% | wait 0% | guest 0% | curf 3.10GHz | curscal 96% |
> > cpu | sys 76% | user 0% | irq 0% | idle 24% | cpu001 w 0% | guest 0% | curf 3.20GHz | curscal 99% |
> > cpu | sys 24% | user 1% | irq 0% | idle 75% | cpu000 w 0% | guest 0% | curf 3.19GHz | curscal 99% |
> > cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > CPL | avg1 0.82 | avg5 0.78 | avg15 0.99 | | csw 6233 | intr 12023 | | numcpu 4 |
> > MEM | tot 15.5G | free 4.0G | cache 9.7G | buff 0.0M | slab 333.1M | shmem 206.6M | vmbal 0.0M | hptot 0.0M |
> > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > LVM | sata-home | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> > DSK | sda | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> > NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 |
> > NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 |
> > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> >
> > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
> > 18079 - martin martin 2 9.99s 0.00s 0K 0K 0K 16K -- - R 1 100% fio
> > 4746 - martin martin 2 0.01s 0.14s 0K 0K 0K 0K -- - S 2 2% konsole
> > 3291 - martin martin 4 0.01s 0.11s 0K 0K 0K 0K -- - S 0 1% plasma-desktop
> > 1488 - root root 1 0.03s 0.04s 0K 0K 0K 0K -- - S 0 1% Xorg
> > 10036 - root root 1 0.04s 0.02s 0K 0K 0K 0K -- - R 2 1% atop
> >
> > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> >
> > martin@merkaba:~> LANG=C df -hT /home
> > Filesystem Type Size Used Avail Use% Mounted on
> > /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home
> >
> > where a 4 GiB file should easily fit, no? (And this output is with the 4
> > GiB file. So it was even 4 GiB more free before.)
> >
> >
> > But it gets even more visible:
> >
> > martin@merkaba:~> fio ssd-test.fio
> > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > fio-2.1.11
> > Starting 2 processes
> > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]
> > 0$ zsh 1$ zsh 2$ zsh 3-$ zsh 4$ zsh 5$* zsh
> >
> >
> > yes, thats 0 IOPS.
> >
> > 0 IOPS and in zero IOPS. For minutes.
> >
> >
> >
> > And here is why:
> >
> > ATOP - merkaba 2014/12/27 13:46:52 ----------- 10s elapsed
> > PRC | sys 10.77s | user 0.31s | #proc 334 | #trun 2 | #tslpi 548 | #tslpu 3 | #zombie 0 | no procacct |
> > CPU | sys 108% | user 3% | irq 0% | idle 286% | wait 2% | guest 0% | curf 3.08GHz | curscal 96% |
> > cpu | sys 72% | user 1% | irq 0% | idle 28% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > cpu | sys 19% | user 0% | irq 0% | idle 81% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > cpu | sys 11% | user 1% | irq 0% | idle 87% | cpu003 w 1% | guest 0% | curf 3.19GHz | curscal 99% |
> > cpu | sys 6% | user 1% | irq 0% | idle 91% | cpu002 w 1% | guest 0% | curf 3.11GHz | curscal 97% |
> > CPL | avg1 2.78 | avg5 1.34 | avg15 1.12 | | csw 50192 | intr 32379 | | numcpu 4 |
> > MEM | tot 15.5G | free 5.0G | cache 8.7G | buff 0.0M | slab 332.6M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > LVM | sata-home | busy 5% | read 160 | write 11177 | KiB/w 3 | MBr/s 0.06 | MBw/s 4.36 | avio 0.05 ms |
> > LVM | msata-home | busy 4% | read 28 | write 11177 | KiB/w 3 | MBr/s 0.01 | MBw/s 4.36 | avio 0.04 ms |
> > LVM | sata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> > LVM | msata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> > DSK | sda | busy 5% | read 160 | write 10200 | KiB/w 4 | MBr/s 0.06 | MBw/s 4.69 | avio 0.05 ms |
> > DSK | sdb | busy 4% | read 28 | write 10558 | KiB/w 4 | MBr/s 0.01 | MBw/s 4.69 | avio 0.04 ms |
> > NET | transport | tcpi 35 | tcpo 33 | udpi 3 | udpo 3 | tcpao 2 | tcppo 1 | tcprs 0 |
> > NET | network | ipi 38 | ipo 36 | ipfrw 0 | deliv 38 | | icmpi 0 | icmpo 0 |
> > NET | eth0 0% | pcki 22 | pcko 20 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> >
> > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> > 14973 - root root 1 8.92s 0.00s 0K 0K 0K 144K -- - S 0 89% kworker/u8:14
> > 17450 - root root 1 0.86s 0.00s 0K 0K 0K 32K -- - R 3 9% kworker/u8:5
> > 788 - root root 1 0.25s 0.00s 0K 0K 128K 18880K -- - S 3 3% btrfs-transact
> > 12254 - root root 1 0.14s 0.00s 0K 0K 64K 576K -- - S 2 1% kworker/u8:3
> > 17332 - root root 1 0.11s 0.00s 0K 0K 112K 1348K -- - S 2 1% kworker/u8:4
> > 3291 - martin martin 4 0.01s 0.09s 0K 0K 0K 0K -- - S 1 1% plasma-deskto
> >
> >
> >
> >
> > ATOP - merkaba 2014/12/27 13:47:12 ----------- 10s elapsed
> > PRC | sys 10.78s | user 0.44s | #proc 334 | #trun 3 | #tslpi 547 | #tslpu 3 | #zombie 0 | no procacct |
> > CPU | sys 106% | user 4% | irq 0% | idle 288% | wait 1% | guest 0% | curf 3.00GHz | curscal 93% |
> > cpu | sys 93% | user 0% | irq 0% | idle 7% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > cpu | sys 7% | user 0% | irq 0% | idle 93% | cpu003 w 0% | guest 0% | curf 3.01GHz | curscal 94% |
> > cpu | sys 3% | user 2% | irq 0% | idle 94% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > CPL | avg1 3.33 | avg5 1.56 | avg15 1.20 | | csw 38253 | intr 23104 | | numcpu 4 |
> > MEM | tot 15.5G | free 4.9G | cache 8.7G | buff 0.0M | slab 336.5M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > LVM | msata-home | busy 2% | read 0 | write 2337 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.91 | avio 0.07 ms |
> > LVM | sata-home | busy 2% | read 36 | write 2337 | KiB/w 3 | MBr/s 0.01 | MBw/s 0.91 | avio 0.07 ms |
> > LVM | msata-debian | busy 1% | read 1 | write 1630 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.65 | avio 0.03 ms |
> > LVM | sata-debian | busy 0% | read 0 | write 1019 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.41 | avio 0.02 ms |
> > DSK | sdb | busy 2% | read 1 | write 2545 | KiB/w 5 | MBr/s 0.00 | MBw/s 1.45 | avio 0.07 ms |
> > DSK | sda | busy 1% | read 36 | write 2461 | KiB/w 5 | MBr/s 0.01 | MBw/s 1.28 | avio 0.06 ms |
> > NET | transport | tcpi 20 | tcpo 20 | udpi 1 | udpo 1 | tcpao 1 | tcppo 1 | tcprs 0 |
> > NET | network | ipi 21 | ipo 21 | ipfrw 0 | deliv 21 | | icmpi 0 | icmpo 0 |
> > NET | eth0 0% | pcki 5 | pcko 5 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 |
> > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> >
> > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> > 17450 - root root 1 9.96s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:5
> > 4746 - martin martin 2 0.06s 0.15s 0K 0K 0K 0K -- - S 1 2% konsole
> > 10508 - root root 1 0.13s 0.00s 0K 0K 96K 4048K -- - S 1 1% kworker/u8:18
> > 1488 - root root 1 0.06s 0.06s 0K 0K 0K 0K -- - S 0 1% Xorg
> > 17332 - root root 1 0.12s 0.00s 0K 0K 96K 580K -- - R 3 1% kworker/u8:4
> > 17454 - root root 1 0.11s 0.00s 0K 0K 32K 4416K -- - D 1 1% kworker/u8:6
> > 17516 - root root 1 0.09s 0.00s 0K 0K 16K 136K -- - S 3 1% kworker/u8:7
> > 3268 - martin martin 3 0.02s 0.05s 0K 0K 0K 0K -- - S 1 1% kwin
> > 10036 - root root 1 0.05s 0.02s 0K 0K 0K 0K -- - R 0 1% atop
> >
> >
> >
> > So BTRFS is basically busy with itself and nothing else. Look at the SSD
> > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> > you measure of course, like request size, read, write, iodepth and so).
> >
> > Its kworker/u8:5 utilizing 100% of one core for minutes.
> >
> >
> >
> > Its the random write case it seems. Here are values from fio job:
> >
> > martin@merkaba:~> fio ssd-test.fio
> > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > fio-2.1.11
> > Starting 2 processes
> > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
> > write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> > clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> > lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> > clat percentiles (usec):
> > | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
> > | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5],
> > | 70.00th=[ 7], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ 9],
> > | 99.00th=[ 14], 99.50th=[ 20], 99.90th=[ 211], 99.95th=[ 2128],
> > | 99.99th=[10304]
> > bw (KB /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
> > lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> > lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> > lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> > cpu : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> > latency : target=0, window=0, percentile=100.00%, depth=1
> >
> > Seems fine.
> >
> >
> > But:
> >
> > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
> > write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> > clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> > lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> > clat percentiles (usec):
> > | 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5],
> > | 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6],
> > | 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10],
> > | 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116],
> > | 99.99th=[16711680]
> > bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
> > lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> > lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> > cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> > latency : target=0, window=0, percentile=100.00%, depth=1
> >
> > Run status group 0 (all jobs):
> > WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
> >
> > Run status group 1 (all jobs):
> > WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
> >
> >
> > What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
> >
> > What?
> >
> > Ey, *what*?
[…]
> > There we go:
> >
> > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
>
> I have done more tests.
>
> This is on the same /home after extending it to 170 GiB and balancing it to
> btrfs balance start -dusage=80
>
> It has plenty of free space free. I updated the bug report and hope it can
> give an easy enough to comprehend summary. The new tests are in:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6
>
>
>
> Pasting below for discussion on list. Summary: I easily get 38000 (!)
> IOPS. It may be an idea to reduce to 160 GiB, but right now this does
> not work as it says no free space on device when trying to downsize it.
> I may try with 165 or 162GiB.
>
> So now we have three IOPS figures:
>
> - 256 IOPS in worst case scenario
> - 4700 IOPS when trying to reproduce worst case scenario with a fresh and small
> BTRFS
> - 38000 IOPS when /home has unused device space to allocate chunks from
>
> https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8
>
>
> This is another test.
Okay, and this is the last series of tests for today.
Conclusion:
I cannot manage to get it down to the knees as before, but I come near to it.
Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh
even *worse* situation than before.
That hints me at the need to look at the free space fragmentation, as in the
beginning the problem started appearing with:
merkaba:~> btrfs fi sh /home
Label: 'home' uuid: […]
Total devices 2 FS bytes used 144.41GiB
devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Yes, thats 13 GiB of free space *within* the chunks.
So while I can get it down in IOPS by bringing it to a situation where it
can not reserve additional data chunks again, I cannot recreate the
abysmal 250 IOPS figure by this. Not even with my /home filesystem.
So there is more to it. I think its important to look into free space
fragmentation. It seems it needs an *aged* filesystem to recreate. At
it seems the balances really helped. As I am not able to recreate the
issue to that extent right now.
So this shows my original idea about free device space to allocate from
also doesn´t explain it fully. It seems to be something thats going on
within the chunks that explains the worst case <300 IOPS, kworker using
one core for minutes and desktop locked scenario.
Is there a way to view free space fragmentation in BTRFS?
Test log follows, also added to bug report:
https://bugzilla.kernel.org/show_bug.cgi?id=90401#c9
Okay, retesting with
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:01:05 CET 2014
Label: 'home' uuid: […]
Total devices 2 FS bytes used 155.15GiB
devid 1 size 163.00GiB used 159.92GiB path /dev/mapper/msata-home
devid 2 size 163.00GiB used 159.92GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=154.95GiB, used=151.84GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Thats is just 3 GiB to reserve new data chunks from from.
First run – all good:
martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/134.2MB/0KB /s] [0/34.4K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10483: Sun Dec 28 14:02:59 2014
write: io=4096.0MB, bw=218101KB/s, iops=54525, runt= 19231msec
clat (usec): min=4, max=20056, avg=14.87, stdev=143.15
lat (usec): min=4, max=20056, avg=15.09, stdev=143.26
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8],
| 30.00th=[ 8], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 13], 80.00th=[ 14], 90.00th=[ 15], 95.00th=[ 17],
| 99.00th=[ 52], 99.50th=[ 99], 99.90th=[ 434], 99.95th=[ 980],
| 99.99th=[ 7968]
bw (KB /s): min=62600, max=424456, per=100.00%, avg=218821.63, stdev=93695.28
lat (usec) : 10=38.19%, 20=58.83%, 50=1.90%, 100=0.59%, 250=0.33%
lat (usec) : 500=0.09%, 750=0.03%, 1000=0.01%
lat (msec) : 2=0.02%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01%
cpu : usr=15.50%, sys=61.86%, ctx=93432, majf=0, minf=5
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=218101KB/s, minb=218101KB/s, maxb=218101KB/s, mint=19231msec, maxt=19231msec
Second run:
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:04:01 CET 2014
Label: 'home' uuid: […]
Total devices 2 FS bytes used 155.23GiB
devid 1 size 163.00GiB used 160.95GiB path /dev/mapper/msata-home
devid 2 size 163.00GiB used 160.95GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=155.98GiB, used=151.91GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.32GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/171.3MB/0KB /s] [0/43.9K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10501: Sun Dec 28 14:05:03 2014
write: io=4096.0MB, bw=220637KB/s, iops=55159, runt= 19010msec
clat (usec): min=4, max=20578, avg=14.45, stdev=160.84
lat (usec): min=4, max=20578, avg=14.65, stdev=160.88
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 7],
| 30.00th=[ 8], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 15], 95.00th=[ 17],
| 99.00th=[ 42], 99.50th=[ 79], 99.90th=[ 278], 99.95th=[ 620],
| 99.99th=[ 9792]
bw (KB /s): min= 5, max=454816, per=100.00%, avg=224700.32, stdev=100763.29
lat (usec) : 10=38.15%, 20=58.73%, 50=2.28%, 100=0.47%, 250=0.26%
lat (usec) : 500=0.06%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01%
cpu : usr=15.83%, sys=63.17%, ctx=74934, majf=0, minf=5
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=220636KB/s, minb=220636KB/s, maxb=220636KB/s, mint=19010msec, maxt=19010msec
Okay, now try the same without space for a free chunk to allocate.
The testfile is still there, fio doesn´t delete and recreate it on
every attempt, but just writes into it:
martin@merkaba:~> ls -l ssd.test.file
-rw-r--r-- 1 martin martin 4294967296 Dez 28 14:05 ssd.test.file
Okay – with still one chunk to allocate:
merkaba:~> btrfs filesystem resize 1:161G /home
Resize '/home' of '1:161G'
merkaba:~> btrfs filesystem resize 2:161G /home
Resize '/home' of '2:161G'
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:08:45 CET 2014
Label: 'home' uuid: […]
Total devices 2 FS bytes used 155.15GiB
devid 1 size 161.00GiB used 159.92GiB path /dev/mapper/msata-home
devid 2 size 161.00GiB used 159.92GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=154.95GiB, used=151.84GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
I would like to have it allocate the chunks by other means, but it frees
them eventually afterwards again, so I did it this way.
Note, we still have the original file there. The space it currently
occupies is already taken into account.
Next test:
martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/130.5MB/0KB /s] [0/33.5K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10563: Sun Dec 28 14:10:34 2014
write: io=4096.0MB, bw=210526KB/s, iops=52631, runt= 19923msec
clat (usec): min=4, max=21820, avg=14.78, stdev=119.40
lat (usec): min=4, max=21821, avg=15.03, stdev=120.26
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8],
| 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 11],
| 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 14], 95.00th=[ 17],
| 99.00th=[ 62], 99.50th=[ 131], 99.90th=[ 490], 99.95th=[ 964],
| 99.99th=[ 6816]
bw (KB /s): min= 3, max=410480, per=100.00%, avg=216892.84, stdev=95620.33
lat (usec) : 10=33.20%, 20=63.71%, 50=1.86%, 100=0.59%, 250=0.42%
lat (usec) : 500=0.12%, 750=0.03%, 1000=0.01%
lat (msec) : 2=0.02%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
cpu : usr=15.13%, sys=62.74%, ctx=94346, majf=0, minf=5
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=210525KB/s, minb=210525KB/s, maxb=210525KB/s, mint=19923msec, maxt=19923msec
Okay, this is still good.
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:11:18 CET 2014
Label: 'home' uuid: […]
Total devices 2 FS bytes used 155.17GiB
devid 1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home
devid 2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=155.94GiB, used=151.86GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.30GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Now no space left to reserve additional chunks anymore. Another test:
martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/152.3MB/0KB /s] [0/38.1K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10580: Sun Dec 28 14:13:26 2014
write: io=4096.0MB, bw=225804KB/s, iops=56450, runt= 18575msec
clat (usec): min=4, max=16669, avg=13.66, stdev=72.88
lat (usec): min=4, max=16669, avg=13.89, stdev=73.06
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8],
| 30.00th=[ 8], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 13], 80.00th=[ 14], 90.00th=[ 15], 95.00th=[ 20],
| 99.00th=[ 65], 99.50th=[ 113], 99.90th=[ 314], 99.95th=[ 506],
| 99.99th=[ 2768]
bw (KB /s): min= 4, max=444568, per=100.00%, avg=231326.97, stdev=93374.31
lat (usec) : 10=36.50%, 20=58.44%, 50=3.73%, 100=0.76%, 250=0.44%
lat (usec) : 500=0.09%, 750=0.02%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
cpu : usr=16.35%, sys=68.39%, ctx=127221, majf=0, minf=5
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=225803KB/s, minb=225803KB/s, maxb=225803KB/s, mint=18575msec, maxt=18575msec
Okay, this still does not trigger it.
Another test, it even freed some chunk:
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:14:21 CET 2014
Label: 'home' uuid: […]
Total devices 2 FS bytes used 155.28GiB
devid 1 size 161.00GiB used 160.85GiB path /dev/mapper/msata-home
devid 2 size 161.00GiB used 160.85GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=155.89GiB, used=151.97GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Still good:
martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/156.5MB/0KB /s] [0/40.6K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10589: Sun Dec 28 14:14:37 2014
write: io=4096.0MB, bw=161121KB/s, iops=40280, runt= 26032msec
clat (usec): min=4, max=1228.9K, avg=15.69, stdev=1205.88
lat (usec): min=4, max=1228.9K, avg=15.92, stdev=1205.90
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8],
| 30.00th=[ 8], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 13], 80.00th=[ 14], 90.00th=[ 15], 95.00th=[ 19],
| 99.00th=[ 53], 99.50th=[ 96], 99.90th=[ 366], 99.95th=[ 764],
| 99.99th=[ 7776]
bw (KB /s): min= 0, max=431680, per=100.00%, avg=219856.30, stdev=98172.64
lat (usec) : 10=39.24%, 20=55.83%, 50=3.81%, 100=0.63%, 250=0.33%
lat (usec) : 500=0.08%, 750=0.02%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 2000=0.01%
cpu : usr=11.50%, sys=61.08%, ctx=123428, majf=0, minf=6
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=161121KB/s, minb=161121KB/s, maxb=161121KB/s, mint=26032msec, maxt=26032msec
Okay, lets allocate one GiB with fallocate to make free space tighter:
martin@merkaba:~> /usr/bin/time fallocate -l 1G 1g-1
0.00user 0.09system 0:00.11elapsed 86%CPU (0avgtext+0avgdata 1752maxresident)k
112inputs+64outputs (1major+89minor)pagefaults 0swaps
martin@merkaba:~> ls -lh 1g-1
-rw-r--r-- 1 martin martin 1,0G Dez 28 14:16 1g
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:16:24 CET 2014
Label: 'home' uuid: […]
Total devices 2 FS bytes used 156.15GiB
devid 1 size 161.00GiB used 160.94GiB path /dev/mapper/msata-home
devid 2 size 161.00GiB used 160.94GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=155.97GiB, used=152.84GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Still not:
martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/132.1MB/0KB /s] [0/34.4K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10632: Sun Dec 28 14:17:12 2014
write: io=4096.0MB, bw=198773KB/s, iops=49693, runt= 21101msec
clat (usec): min=4, max=543255, avg=16.27, stdev=563.85
lat (usec): min=4, max=543255, avg=16.48, stdev=563.91
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8],
| 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 14], 95.00th=[ 17],
| 99.00th=[ 49], 99.50th=[ 98], 99.90th=[ 386], 99.95th=[ 828],
| 99.99th=[10816]
bw (KB /s): min= 4, max=444848, per=100.00%, avg=203909.07, stdev=109502.11
lat (usec) : 10=33.97%, 20=62.99%, 50=2.05%, 100=0.51%, 250=0.33%
lat (usec) : 500=0.08%, 750=0.02%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 750=0.01%
cpu : usr=14.21%, sys=60.44%, ctx=70273, majf=0, minf=6
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=198772KB/s, minb=198772KB/s, maxb=198772KB/s, mint=21101msec, maxt=21101msec
Another 1G file.
Got it:
ATOP - merkaba 2014/12/28 14:18:14 ----------- 10s elapsed
PRC | sys 21.74s | user 2.48s | #proc 382 | #trun 8 | #tslpi 698 | #tslpu 1 | #zombie 0 | no procacct |
CPU | sys 218% | user 24% | irq 1% | idle 155% | wait 2% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 75% | user 5% | irq 0% | idle 19% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 59% | user 3% | irq 0% | idle 37% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 48% | user 6% | irq 1% | idle 45% | cpu000 w 1% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 36% | user 9% | irq 0% | idle 54% | cpu002 w 1% | guest 0% | curf 3.00GHz | curscal 93% |
CPL | avg1 2.13 | avg5 2.37 | avg15 1.92 | | csw 67473 | intr 59152 | | numcpu 4 |
MEM | tot 15.5G | free 1.1G | cache 11.0G | buff 0.1M | slab 740.2M | shmem 190.9M | vmbal 0.0M | hptot 0.0M |
SWP | tot 12.0G | free 11.4G | | | | | vmcom 5.4G | vmlim 19.7G |
PAG | scan 0 | steal 0 | stall 1 | | | | swin 19 | swout 0 |
LVM | sata-home | busy 8% | read 4 | write 26062 | KiB/w 3 | MBr/s 0.00 | MBw/s 10.18 | avio 0.03 ms |
LVM | msata-home | busy 5% | read 4 | write 26062 | KiB/w 3 | MBr/s 0.00 | MBw/s 10.18 | avio 0.02 ms |
LVM | sata-swap | busy 0% | read 19 | write 0 | KiB/w 0 | MBr/s 0.01 | MBw/s 0.00 | avio 0.05 ms |
LVM | msata-debian | busy 0% | read 0 | write 4 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms |
LVM | sata-debian | busy 0% | read 0 | write 4 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms |
DSK | sda | busy 8% | read 23 | write 13239 | KiB/w 7 | MBr/s 0.01 | MBw/s 10.18 | avio 0.06 ms |
DSK | sdb | busy 5% | read 4 | write 14360 | KiB/w 7 | MBr/s 0.00 | MBw/s 10.18 | avio 0.04 ms |
NET | transport | tcpi 18 | tcpo 18 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 |
NET | network | ipi 18 | ipo 18 | ipfrw 0 | deliv 18 | | icmpi 0 | icmpo 0 |
NET | eth0 0% | pcki 2 | pcko 2 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 |
NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/5
10657 - martin martin 1 9.88s 0.00s 0K 0K 0K 48K -- - R 1 99% fallocate
9685 - root root 1 9.84s 0.00s 0K 0K 0K 0K -- - D 0 99% kworker/u8:10
martin@merkaba:~> /usr/bin/time fallocate -l 1G 1g-2 ; ls -l 1g*
0.00user 59.28system 1:00.21elapsed 98%CPU (0avgtext+0avgdata 1756maxresident)k
0inputs+416outputs (0major+90minor)pagefaults 0swaps
-rw-r--r-- 1 martin martin 1073741824 Dez 28 14:16 1g-1
-rw-r--r-- 1 martin martin 1073741824 Dez 28 14:17 1g-2
One minute in system CPU for this.
Okay, so now another test:
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:19:30 CET 2014
Label: 'home' uuid: […]
Total devices 2 FS bytes used 157.18GiB
devid 1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home
devid 2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=155.94GiB, used=153.87GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.30GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
I admit, this now really isn´t nice to it anymore, but I want to see
where it starts to become an issue.
ATOP - merkaba 2014/12/28 14:21:18 ----------- 1s elapsed
PRC | sys 1.15s | user 0.16s | #proc 382 | #trun 2 | #tslpi 707 | #tslpu 1 | #zombie 0 | clones 0 | | no procacct |
CPU | sys 163% | user 24% | irq 1% | idle 189% | wait 26% | | steal 0% | guest 0% | curf 3.01GHz | curscal 94% |
cpu | sys 72% | user 1% | irq 0% | idle 25% | cpu001 w 1% | | steal 0% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 41% | user 9% | irq 0% | idle 32% | cpu002 w 19% | | steal 0% | guest 0% | curf 3.00GHz | curscal 93% |
cpu | sys 34% | user 10% | irq 0% | idle 53% | cpu003 w 3% | | steal 0% | guest 0% | curf 3.03GHz | curscal 94% |
cpu | sys 16% | user 4% | irq 0% | idle 77% | cpu000 w 3% | | steal 0% | guest 0% | curf 3.00GHz | curscal 93% |
CPL | avg1 2.37 | avg5 2.64 | avg15 2.13 | | | csw 18687 | intr 12435 | | | numcpu 4 |
MEM | tot 15.5G | free 2.5G | cache 9.5G | buff 0.1M | slab 742.6M | shmem 242.8M | shrss 115.5M | vmbal 0.0M | hptot 0.0M | hpuse 0.0M |
SWP | tot 12.0G | free 11.4G | | | | | | | vmcom 5.5G | vmlim 19.7G |
LVM | msata-home | busy 71% | read 28 | write 8134 | KiB/r 4 | KiB/w 3 | MBr/s 0.11 | MBw/s 31.68 | avq 13.21 | avio 0.06 ms |
LVM | sata-home | busy 40% | read 72 | write 8135 | KiB/r 4 | KiB/w 3 | MBr/s 0.28 | MBw/s 31.69 | avq 41.67 | avio 0.03 ms |
DSK | sdb | busy 71% | read 24 | write 6049 | KiB/r 4 | KiB/w 5 | MBr/s 0.11 | MBw/s 31.68 | avq 5.64 | avio 0.08 ms |
DSK | sda | busy 38% | read 60 | write 5987 | KiB/r 4 | KiB/w 5 | MBr/s 0.28 | MBw/s 31.69 | avq 20.40 | avio 0.04 ms |
NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 | tcpie 0 | udpie 0 |
NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | | | icmpi 0 | icmpo 0 |
NET | lo ---- | pcki 16 | pcko 16 | si 20 Kbps | so 20 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 |
PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
10459 - root root 1 0.70s 0.00s 0K 0K 0K 0K -- - D 3 100% kworker/u8:17
10674 - martin martin 1 0.20s 0.01s 0K 0K 0K 27504K -- - R 0 30% fio
Okay.
It is jumping between 0 IOPS and 40000 IOPS and hogging 100% with one kworker.
Still quite okay in terms of IOPS tough:
martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/126.8MB/0KB /s] [0/32.5K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10674: Sun Dec 28 14:22:16 2014
write: io=3801.3MB, bw=32415KB/s, iops=8103, runt=120083msec
clat (usec): min=4, max=1809.9K, avg=83.88, stdev=3615.98
lat (usec): min=4, max=1809.9K, avg=84.10, stdev=3616.00
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8],
| 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 11],
| 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 15], 95.00th=[ 18],
| 99.00th=[ 52], 99.50th=[ 124], 99.90th=[24704], 99.95th=[30592],
| 99.99th=[57088]
bw (KB /s): min= 0, max=417544, per=100.00%, avg=48302.16, stdev=89108.07
lat (usec) : 10=35.61%, 20=60.17%, 50=3.17%, 100=0.47%, 250=0.27%
lat (usec) : 500=0.05%, 750=0.02%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.04%, 50=0.16%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2000=0.01%
cpu : usr=2.37%, sys=29.74%, ctx=202984, majf=0, minf=6
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=973128/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=3801.3MB, aggrb=32415KB/s, minb=32415KB/s, maxb=32415KB/s, mint=120083msec, maxt=120083msec
I stop this here now.
Its interesting to see that even with:
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:23:11 CET 2014
Label: 'home' uuid: […]
Total devices 2 FS bytes used 157.89GiB
devid 1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home
devid 2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home
Btrfs v3.17
Data, RAID1: total=155.94GiB, used=154.59GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.30GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
I am not able to fully reproduce it.
I can partly reproduce it, but it behaves way better than before.
I think to go further one needs to have a look at the free space
fragmentation inside the chunks.
As in the beginning I had:
I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents – are these a problem? – and
compress=lzo:
merkaba:~> btrfs fi sh /home
Label: 'home' uuid: […]
Total devices 2 FS bytes used 144.41GiB
devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
So I had quite some free space *within* the chunks and it still was a
problem.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)
2014-12-28 13:40 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare) Martin Steigerwald
@ 2014-12-28 13:56 ` Martin Steigerwald
2014-12-28 15:00 ` Martin Steigerwald
2014-12-29 9:25 ` Martin Steigerwald
0 siblings, 2 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 13:56 UTC (permalink / raw)
To: Hugo Mills; +Cc: Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 27801 bytes --]
Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald:
> Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
> > Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> > > Summarized at
> > >
> > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > >
> > > see below. This is reproducable with fio, no need for Windows XP in
> > > Virtualbox for reproducing the issue. Next I will try to reproduce with
> > > a freshly creating filesystem.
> > >
> > >
> > > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > > > Hello!
> > > > > > >
> > > > > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > > > >
> > > > > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > > > > bug
> > > > > > > report:
> > > > > > >
> > > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > > > > space_cache, skinny meta data extents – are these a problem? – and
> > > > > >
> > > > > > > compress=lzo:
> > > > > > (there is no known problem with skinny metadata, it's actually more
> > > > > > efficient than the older format. There has been some anecdotes about
> > > > > > mixing the skinny and fat metadata but nothing has ever been
> > > > > > demonstrated problematic.)
> > > > > >
> > > > > > > merkaba:~> btrfs fi sh /home
> > > > > > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > > > >
> > > > > > > Total devices 2 FS bytes used 144.41GiB
> > > > > > > devid 1 size 160.00GiB used 160.00GiB path
> > > > > > > /dev/mapper/msata-home
> > > > > > > devid 2 size 160.00GiB used 160.00GiB path
> > > > > > > /dev/mapper/sata-home
> > > > > > >
> > > > > > > Btrfs v3.17
> > > > > > > merkaba:~> btrfs fi df /home
> > > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > > >
> > > > > > This filesystem, at the allocation level, is "very full" (see below).
> > > > > >
> > > > > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > > > > cause I know no tax return software for Linux which would be suitable
> > > > > > > for
> > > > > > > Germany and I frankly don´t care about the end of security cause all
> > > > > > > surfing and other network access I will do from the Linux box and I
> > > > > > > only
> > > > > > > run the VM behind a firewall).
> > > > > >
> > > > > > > And thus I try the balance dance again:
> > > > > > ITEM: Balance... it doesn't do what you think it does...
> > > > > >
> > > > > > "Balancing" is something you should almost never need to do. It is only
> > > > > > for cases of changing geometry (adding disks, switching RAID levels,
> > > > > > etc.) of for cases when you've radically changed allocation behaviors
> > > > > > (like you decided to remove all your VM's or you've decided to remove a
> > > > > > mail spool directory full of thousands of tiny files).
> > > > > >
> > > > > > People run balance all the time because they think they should. They are
> > > > > > _usually_ incorrect in that belief.
> > > > >
> > > > > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > > > > device.
> > > > No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > > > space. What's more, balance does *not* balance the metadata trees. The
> > > > remaining space -- 154.97 GiB -- is unstructured storage for file
> > > > data, and you have some 13 GiB of that available for use.
> > > >
> > > > Now, since you're seeing lockups when the space on your disks is
> > > > all allocated I'd say that's a bug. However, you're the *only* person
> > > > who's reported this as a regular occurrence. Does this happen with all
> > > > filesystems you have, or just this one?
> > > >
> > > > > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > > > > from to *extend* a tree.
> > > >
> > > > It's not a tree. It's simply space allocation. It's not even space
> > > > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > > > saying "I'm going to use this piece of disk for this purpose").
> > > >
> > > > > This may be a bug, but this is what I see.
> > > > >
> > > > > And no amount of "you should not balance a BTRFS" will make that
> > > > > perception go away.
> > > > >
> > > > > See, I see the sun coming out on a morning and you tell me "no, it
> > > > > doesn´t". Simply that is not going to match my perception.
> > > >
> > > > Duncan's assertion is correct in its detail. Looking at your space
> > >
> > > Robert's
> > >
> > > > usage, I would not suggest that running a balance is something you
> > > > need to do. Now, since you have these lockups that seem quite
> > > > repeatable, there's probably a lurking bug in there, but hacking
> > > > around with balance every time you hit it isn't going to get the
> > > > problem solved properly.
> > > >
> > > > I think I would suggest the following:
> > > >
> > > > - make sure you have some way of logging your dmesg permanently (use
> > > > a different filesystem for /var/log, or a serial console, or a
> > > > netconsole)
> > > >
> > > > - when the lockup happens, hit Alt-SysRq-t a few times
> > > >
> > > > - send the dmesg output here, or post to bugzilla.kernel.org
> > > >
> > > > That's probably going to give enough information to the developers
> > > > to work out where the lockup is happening, and is clearly the way
> > > > forward here.
> > >
> > > And I got it reproduced. *Perfectly* reproduced, I´d say.
> > >
> > > But let me run the whole story:
> > >
> > > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
> >
> > [… story of trying to reproduce with Windows XP defragmenting which was
> > unsuccessful as BTRFS still had free device space to allocate new chunks
> > from …]
> >
> > > But finally I got to:
> > >
> > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > > Sa 27. Dez 13:26:39 CET 2014
> > > Label: 'home' uuid: [some UUID]
> > > Total devices 2 FS bytes used 152.83GiB
> > > devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> > > devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> > >
> > > Btrfs v3.17
> > > Data, RAID1: total=154.97GiB, used=149.58GiB
> > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > >
> > >
> > >
> > > So I did, if Virtualbox can write randomly in a file, I can too.
> > >
> > > So I did:
> > >
> > >
> > > martin@merkaba:~> cat ssd-test.fio
> > > [global]
> > > bs=4k
> > > #ioengine=libaio
> > > #iodepth=4
> > > size=4g
> > > #direct=1
> > > runtime=120
> > > filename=ssd.test.file
> > >
> > > [seq-write]
> > > rw=write
> > > stonewall
> > >
> > > [rand-write]
> > > rw=randwrite
> > > stonewall
> > >
> > >
> > >
> > > And got:
> > >
> > > ATOP - merkaba 2014/12/27 13:41:02 ----------- 10s elapsed
> > > PRC | sys 10.14s | user 0.38s | #proc 332 | #trun 2 | #tslpi 548 | #tslpu 0 | #zombie 0 | no procacct |
> > > CPU | sys 102% | user 4% | irq 0% | idle 295% | wait 0% | guest 0% | curf 3.10GHz | curscal 96% |
> > > cpu | sys 76% | user 0% | irq 0% | idle 24% | cpu001 w 0% | guest 0% | curf 3.20GHz | curscal 99% |
> > > cpu | sys 24% | user 1% | irq 0% | idle 75% | cpu000 w 0% | guest 0% | curf 3.19GHz | curscal 99% |
> > > cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > CPL | avg1 0.82 | avg5 0.78 | avg15 0.99 | | csw 6233 | intr 12023 | | numcpu 4 |
> > > MEM | tot 15.5G | free 4.0G | cache 9.7G | buff 0.0M | slab 333.1M | shmem 206.6M | vmbal 0.0M | hptot 0.0M |
> > > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > > LVM | sata-home | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> > > DSK | sda | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> > > NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 |
> > > NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 |
> > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > >
> > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
> > > 18079 - martin martin 2 9.99s 0.00s 0K 0K 0K 16K -- - R 1 100% fio
> > > 4746 - martin martin 2 0.01s 0.14s 0K 0K 0K 0K -- - S 2 2% konsole
> > > 3291 - martin martin 4 0.01s 0.11s 0K 0K 0K 0K -- - S 0 1% plasma-desktop
> > > 1488 - root root 1 0.03s 0.04s 0K 0K 0K 0K -- - S 0 1% Xorg
> > > 10036 - root root 1 0.04s 0.02s 0K 0K 0K 0K -- - R 2 1% atop
> > >
> > > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > > for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > >
> > > martin@merkaba:~> LANG=C df -hT /home
> > > Filesystem Type Size Used Avail Use% Mounted on
> > > /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home
> > >
> > > where a 4 GiB file should easily fit, no? (And this output is with the 4
> > > GiB file. So it was even 4 GiB more free before.)
> > >
> > >
> > > But it gets even more visible:
> > >
> > > martin@merkaba:~> fio ssd-test.fio
> > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > fio-2.1.11
> > > Starting 2 processes
> > > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]
> > > 0$ zsh 1$ zsh 2$ zsh 3-$ zsh 4$ zsh 5$* zsh
> > >
> > >
> > > yes, thats 0 IOPS.
> > >
> > > 0 IOPS and in zero IOPS. For minutes.
> > >
> > >
> > >
> > > And here is why:
> > >
> > > ATOP - merkaba 2014/12/27 13:46:52 ----------- 10s elapsed
> > > PRC | sys 10.77s | user 0.31s | #proc 334 | #trun 2 | #tslpi 548 | #tslpu 3 | #zombie 0 | no procacct |
> > > CPU | sys 108% | user 3% | irq 0% | idle 286% | wait 2% | guest 0% | curf 3.08GHz | curscal 96% |
> > > cpu | sys 72% | user 1% | irq 0% | idle 28% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > cpu | sys 19% | user 0% | irq 0% | idle 81% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > cpu | sys 11% | user 1% | irq 0% | idle 87% | cpu003 w 1% | guest 0% | curf 3.19GHz | curscal 99% |
> > > cpu | sys 6% | user 1% | irq 0% | idle 91% | cpu002 w 1% | guest 0% | curf 3.11GHz | curscal 97% |
> > > CPL | avg1 2.78 | avg5 1.34 | avg15 1.12 | | csw 50192 | intr 32379 | | numcpu 4 |
> > > MEM | tot 15.5G | free 5.0G | cache 8.7G | buff 0.0M | slab 332.6M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> > > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > > LVM | sata-home | busy 5% | read 160 | write 11177 | KiB/w 3 | MBr/s 0.06 | MBw/s 4.36 | avio 0.05 ms |
> > > LVM | msata-home | busy 4% | read 28 | write 11177 | KiB/w 3 | MBr/s 0.01 | MBw/s 4.36 | avio 0.04 ms |
> > > LVM | sata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> > > LVM | msata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> > > DSK | sda | busy 5% | read 160 | write 10200 | KiB/w 4 | MBr/s 0.06 | MBw/s 4.69 | avio 0.05 ms |
> > > DSK | sdb | busy 4% | read 28 | write 10558 | KiB/w 4 | MBr/s 0.01 | MBw/s 4.69 | avio 0.04 ms |
> > > NET | transport | tcpi 35 | tcpo 33 | udpi 3 | udpo 3 | tcpao 2 | tcppo 1 | tcprs 0 |
> > > NET | network | ipi 38 | ipo 36 | ipfrw 0 | deliv 38 | | icmpi 0 | icmpo 0 |
> > > NET | eth0 0% | pcki 22 | pcko 20 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > >
> > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> > > 14973 - root root 1 8.92s 0.00s 0K 0K 0K 144K -- - S 0 89% kworker/u8:14
> > > 17450 - root root 1 0.86s 0.00s 0K 0K 0K 32K -- - R 3 9% kworker/u8:5
> > > 788 - root root 1 0.25s 0.00s 0K 0K 128K 18880K -- - S 3 3% btrfs-transact
> > > 12254 - root root 1 0.14s 0.00s 0K 0K 64K 576K -- - S 2 1% kworker/u8:3
> > > 17332 - root root 1 0.11s 0.00s 0K 0K 112K 1348K -- - S 2 1% kworker/u8:4
> > > 3291 - martin martin 4 0.01s 0.09s 0K 0K 0K 0K -- - S 1 1% plasma-deskto
> > >
> > >
> > >
> > >
> > > ATOP - merkaba 2014/12/27 13:47:12 ----------- 10s elapsed
> > > PRC | sys 10.78s | user 0.44s | #proc 334 | #trun 3 | #tslpi 547 | #tslpu 3 | #zombie 0 | no procacct |
> > > CPU | sys 106% | user 4% | irq 0% | idle 288% | wait 1% | guest 0% | curf 3.00GHz | curscal 93% |
> > > cpu | sys 93% | user 0% | irq 0% | idle 7% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > cpu | sys 7% | user 0% | irq 0% | idle 93% | cpu003 w 0% | guest 0% | curf 3.01GHz | curscal 94% |
> > > cpu | sys 3% | user 2% | irq 0% | idle 94% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > CPL | avg1 3.33 | avg5 1.56 | avg15 1.20 | | csw 38253 | intr 23104 | | numcpu 4 |
> > > MEM | tot 15.5G | free 4.9G | cache 8.7G | buff 0.0M | slab 336.5M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> > > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > > LVM | msata-home | busy 2% | read 0 | write 2337 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.91 | avio 0.07 ms |
> > > LVM | sata-home | busy 2% | read 36 | write 2337 | KiB/w 3 | MBr/s 0.01 | MBw/s 0.91 | avio 0.07 ms |
> > > LVM | msata-debian | busy 1% | read 1 | write 1630 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.65 | avio 0.03 ms |
> > > LVM | sata-debian | busy 0% | read 0 | write 1019 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.41 | avio 0.02 ms |
> > > DSK | sdb | busy 2% | read 1 | write 2545 | KiB/w 5 | MBr/s 0.00 | MBw/s 1.45 | avio 0.07 ms |
> > > DSK | sda | busy 1% | read 36 | write 2461 | KiB/w 5 | MBr/s 0.01 | MBw/s 1.28 | avio 0.06 ms |
> > > NET | transport | tcpi 20 | tcpo 20 | udpi 1 | udpo 1 | tcpao 1 | tcppo 1 | tcprs 0 |
> > > NET | network | ipi 21 | ipo 21 | ipfrw 0 | deliv 21 | | icmpi 0 | icmpo 0 |
> > > NET | eth0 0% | pcki 5 | pcko 5 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 |
> > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > >
> > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> > > 17450 - root root 1 9.96s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:5
> > > 4746 - martin martin 2 0.06s 0.15s 0K 0K 0K 0K -- - S 1 2% konsole
> > > 10508 - root root 1 0.13s 0.00s 0K 0K 96K 4048K -- - S 1 1% kworker/u8:18
> > > 1488 - root root 1 0.06s 0.06s 0K 0K 0K 0K -- - S 0 1% Xorg
> > > 17332 - root root 1 0.12s 0.00s 0K 0K 96K 580K -- - R 3 1% kworker/u8:4
> > > 17454 - root root 1 0.11s 0.00s 0K 0K 32K 4416K -- - D 1 1% kworker/u8:6
> > > 17516 - root root 1 0.09s 0.00s 0K 0K 16K 136K -- - S 3 1% kworker/u8:7
> > > 3268 - martin martin 3 0.02s 0.05s 0K 0K 0K 0K -- - S 1 1% kwin
> > > 10036 - root root 1 0.05s 0.02s 0K 0K 0K 0K -- - R 0 1% atop
> > >
> > >
> > >
> > > So BTRFS is basically busy with itself and nothing else. Look at the SSD
> > > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> > > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> > > you measure of course, like request size, read, write, iodepth and so).
> > >
> > > Its kworker/u8:5 utilizing 100% of one core for minutes.
> > >
> > >
> > >
> > > Its the random write case it seems. Here are values from fio job:
> > >
> > > martin@merkaba:~> fio ssd-test.fio
> > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > fio-2.1.11
> > > Starting 2 processes
> > > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> > > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
> > > write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> > > clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> > > lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> > > clat percentiles (usec):
> > > | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
> > > | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5],
> > > | 70.00th=[ 7], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ 9],
> > > | 99.00th=[ 14], 99.50th=[ 20], 99.90th=[ 211], 99.95th=[ 2128],
> > > | 99.99th=[10304]
> > > bw (KB /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
> > > lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> > > lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> > > lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> > > cpu : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> > > latency : target=0, window=0, percentile=100.00%, depth=1
> > >
> > > Seems fine.
> > >
> > >
> > > But:
> > >
> > > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
> > > write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> > > clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> > > lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> > > clat percentiles (usec):
> > > | 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5],
> > > | 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6],
> > > | 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10],
> > > | 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116],
> > > | 99.99th=[16711680]
> > > bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
> > > lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> > > lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> > > cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> > > latency : target=0, window=0, percentile=100.00%, depth=1
> > >
> > > Run status group 0 (all jobs):
> > > WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
> > >
> > > Run status group 1 (all jobs):
> > > WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
> > >
> > >
> > > What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
> > >
> > > What?
> > >
> > > Ey, *what*?
> […]
> > > There we go:
> > >
> > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> >
> > I have done more tests.
> >
> > This is on the same /home after extending it to 170 GiB and balancing it to
> > btrfs balance start -dusage=80
> >
> > It has plenty of free space free. I updated the bug report and hope it can
> > give an easy enough to comprehend summary. The new tests are in:
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6
> >
> >
> >
> > Pasting below for discussion on list. Summary: I easily get 38000 (!)
> > IOPS. It may be an idea to reduce to 160 GiB, but right now this does
> > not work as it says no free space on device when trying to downsize it.
> > I may try with 165 or 162GiB.
> >
> > So now we have three IOPS figures:
> >
> > - 256 IOPS in worst case scenario
> > - 4700 IOPS when trying to reproduce worst case scenario with a fresh and small
> > BTRFS
> > - 38000 IOPS when /home has unused device space to allocate chunks from
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8
> >
> >
> > This is another test.
>
>
> Okay, and this is the last series of tests for today.
>
> Conclusion:
>
> I cannot manage to get it down to the knees as before, but I come near to it.
>
> Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh
> even *worse* situation than before.
>
> That hints me at the need to look at the free space fragmentation, as in the
> beginning the problem started appearing with:
>
> merkaba:~> btrfs fi sh /home
> Label: 'home' uuid: […]
> Total devices 2 FS bytes used 144.41GiB
> devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
>
> Btrfs v3.17
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=154.97GiB, used=141.12GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.29GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
>
> Yes, thats 13 GiB of free space *within* the chunks.
>
> So while I can get it down in IOPS by bringing it to a situation where it
> can not reserve additional data chunks again, I cannot recreate the
> abysmal 250 IOPS figure by this. Not even with my /home filesystem.
>
> So there is more to it. I think its important to look into free space
> fragmentation. It seems it needs an *aged* filesystem to recreate. At
> it seems the balances really helped. As I am not able to recreate the
> issue to that extent right now.
>
> So this shows my original idea about free device space to allocate from
> also doesn´t explain it fully. It seems to be something thats going on
> within the chunks that explains the worst case <300 IOPS, kworker using
> one core for minutes and desktop locked scenario.
>
> Is there a way to view free space fragmentation in BTRFS?
So to rephrase that:
From what I perceive the worst case issue happens when
1) BTRFS cannot reserve any new chunks from unused device space anymore.
2) The free space in the existing chunks is highly fragmented.
Only one of those conditions is not sufficient to trigger it.
Thats at least my current idea about it.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-28 12:07 ` Martin Steigerwald
@ 2014-12-28 14:52 ` Robert White
2014-12-28 15:42 ` Martin Steigerwald
0 siblings, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-28 14:52 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Bardur Arantsson, linux-btrfs
On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
>> Now:
>>
>> The complaining party has verified the minimum, repeatable case of
>> simple file allocation on a very fragmented system and the responding
>> party and several others have understood and supported the bug.
>
> I didn´t yet provide such a test case.
My bad.
>
> At the moment I can only reproduce this kworker thread using a CPU for
> minutes case with my /home filesystem.
>
> A mininmal test case for me would be to be able to reproduce it with a
> fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
> get 4800 instead of 270 IOPS.
>
A version of the test case to demonstrate absolutely system-clogging
loads is pretty easy to construct.
Make a raid1 filesystem.
Balance it once to make sure the seed filesystem is fully integrated.
Create a bunch of small files that are at least 4K in size, but are
randomly sized. Fill the entire filesystem with them.
BASH Script:
typeset -i counter=0
while
dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM))
count=1 2>/dev/null
do
echo $counter >/dev/null #basically a noop
done
The while will exit when the dd encounters a full filesystem.
Then delete ~10% of the files with
rm *0
Run the while loop again, then delete a different 10% with "rm *1".
Then again with rm *2, etc...
Do this a few times and with each iteration the CPU usage gets worse and
worse. You'll easily get system-wide stalls on all IO tasks lasting ten
or more seconds.
I don't have enough spare storage to do this directly, so I used
loopback devices. First I did it with the loopback files in COW mode.
Then I did it again with the files in NOCOW mode. (the COW files got
thick with overwrite real fast. 8-)
So anyway...
After I got through all ten digits on the rm (that is removing *0, then
refilling, then *1 etc...) I figured the FS image was nicely fragmented.
At that point it was very easy to spike the kworker to 100% CPU with
dd if=/dev/urandom of=/mnt/Work/scratch bs=40k
The DD wold read 40k (a cpu spike for /dev/urandom processing) then it
would write the 40k and the kworker would peg 100% on one CPU and stay
there for a while. Then it would be back to the /dev/urandom spike.
So this laptop has been carefully detuned to prevent certain kinds of
stalls (particularly the moveablecore= reservation, as previously
mentioned, to prevent non-responsiveness of the UI) and I had to go
through /dev/loop so that had a smoothing effect... but yep, there were
clear kworker spikes that _did_ stop the IO path (the system monitor ap,
for instance, could not get I/O statistics for ten and fifteen second
intervals and would stop logging/scrolling).
Progressively larger block sizes on the write path made things
progressively worse...
dd if=/dev/urandom of=/mnt/Work/scratch bs=160k
And overwriting the file by just invoking DD again, was worse still
(presumably from the juggling act) before resulting in a net
out-of-space condition.
Switching from /dev/urandom to /dev/zero for writing the large file made
things worse still -- probably since there were no respites for the
kworker to catch up etc.
ASIDE: Playing with /proc/sys/vm/dirty_{background_,}ratio had lots of
interesting and difficult to quantify effects on user-space
applications. Cutting in half (5 and 10 instead of 10 and 20
respectively) seemed to give some relief, but going further got harmful
quickly. Diverging numbers was odd too. But it seemed a little brittle
to play with these numbers.
SUPER FREAKY THING...
Every time I removed and recreated "scratch" I would get _radically_
different results for how much I could write into that remaining space
and how long it took to do so. In theory I am reusing the exact same
storage again and again. I'm not doing compression (the underlying
filessytem behind the loop devices have compression but that would be
disabled by the +C attribute). It's not enough space coming-and-going to
cause data extents to be reclaimed or displaced by metadata. And the
filessytem is otherwise completely unused.
But check it out...
Gust Work # rm scratch
Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
1700+0 records in
1700+0 records out
278528000 bytes (279 MB) copied, 1.4952 s, 186 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
1700+0 records in
1700+0 records out
278528000 bytes (279 MB) copied, 292.135 s, 953 kB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
dd: error writing ‘/mnt/Work/scratch’: No space left on device
93+0 records in
92+0 records out
15073280 bytes (15 MB) copied, 0.0453977 s, 332 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
dd: error writing ‘/mnt/Work/scratch’: No space left on device
1090+0 records in
1089+0 records out
178421760 bytes (178 MB) copied, 115.991 s, 1.5 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
dd: error writing ‘/mnt/Work/scratch’: No space left on device
332+0 records in
331+0 records out
54231040 bytes (54 MB) copied, 30.1589 s, 1.8 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
dd: error writing ‘/mnt/Work/scratch’: No space left on device
622+0 records in
621+0 records out
101744640 bytes (102 MB) copied, 37.4813 s, 2.7 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
1700+0 records in
1700+0 records out
278528000 bytes (279 MB) copied, 121.863 s, 2.3 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
1700+0 records in
1700+0 records out
278528000 bytes (279 MB) copied, 24.2909 s, 11.5 MB/s
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k
dd: error writing ‘/mnt/Work/scratch’: No space left on device
1709+0 records in
1708+0 records out
279838720 bytes (280 MB) copied, 139.538 s, 2.0 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k
dd: error writing ‘/mnt/Work/scratch’: No space left on device
1424+0 records in
1423+0 records out
233144320 bytes (233 MB) copied, 102.257 s, 2.3 MB/s
Gust Work #
(and so on)
So...
Repeatable: yes.
Problematic: yes.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)
2014-12-28 13:56 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
@ 2014-12-28 15:00 ` Martin Steigerwald
2014-12-29 9:25 ` Martin Steigerwald
1 sibling, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 15:00 UTC (permalink / raw)
To: Hugo Mills; +Cc: Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 30838 bytes --]
Am Sonntag, 28. Dezember 2014, 14:56:21 schrieb Martin Steigerwald:
> Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald:
> > Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
> > > Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> > > > Summarized at
> > > >
> > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > > >
> > > > see below. This is reproducable with fio, no need for Windows XP in
> > > > Virtualbox for reproducing the issue. Next I will try to reproduce with
> > > > a freshly creating filesystem.
> > > >
> > > >
> > > > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > > > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > > > > Hello!
> > > > > > > >
> > > > > > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > > > > >
> > > > > > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > > > > > bug
> > > > > > > > report:
> > > > > > > >
> > > > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > > > > > space_cache, skinny meta data extents – are these a problem? – and
> > > > > > >
> > > > > > > > compress=lzo:
> > > > > > > (there is no known problem with skinny metadata, it's actually more
> > > > > > > efficient than the older format. There has been some anecdotes about
> > > > > > > mixing the skinny and fat metadata but nothing has ever been
> > > > > > > demonstrated problematic.)
> > > > > > >
> > > > > > > > merkaba:~> btrfs fi sh /home
> > > > > > > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > > > > >
> > > > > > > > Total devices 2 FS bytes used 144.41GiB
> > > > > > > > devid 1 size 160.00GiB used 160.00GiB path
> > > > > > > > /dev/mapper/msata-home
> > > > > > > > devid 2 size 160.00GiB used 160.00GiB path
> > > > > > > > /dev/mapper/sata-home
> > > > > > > >
> > > > > > > > Btrfs v3.17
> > > > > > > > merkaba:~> btrfs fi df /home
> > > > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > > > >
> > > > > > > This filesystem, at the allocation level, is "very full" (see below).
> > > > > > >
> > > > > > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > > > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > > > > > cause I know no tax return software for Linux which would be suitable
> > > > > > > > for
> > > > > > > > Germany and I frankly don´t care about the end of security cause all
> > > > > > > > surfing and other network access I will do from the Linux box and I
> > > > > > > > only
> > > > > > > > run the VM behind a firewall).
> > > > > > >
> > > > > > > > And thus I try the balance dance again:
> > > > > > > ITEM: Balance... it doesn't do what you think it does...
> > > > > > >
> > > > > > > "Balancing" is something you should almost never need to do. It is only
> > > > > > > for cases of changing geometry (adding disks, switching RAID levels,
> > > > > > > etc.) of for cases when you've radically changed allocation behaviors
> > > > > > > (like you decided to remove all your VM's or you've decided to remove a
> > > > > > > mail spool directory full of thousands of tiny files).
> > > > > > >
> > > > > > > People run balance all the time because they think they should. They are
> > > > > > > _usually_ incorrect in that belief.
> > > > > >
> > > > > > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > > > > > device.
> > > > > No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > > > > space. What's more, balance does *not* balance the metadata trees. The
> > > > > remaining space -- 154.97 GiB -- is unstructured storage for file
> > > > > data, and you have some 13 GiB of that available for use.
> > > > >
> > > > > Now, since you're seeing lockups when the space on your disks is
> > > > > all allocated I'd say that's a bug. However, you're the *only* person
> > > > > who's reported this as a regular occurrence. Does this happen with all
> > > > > filesystems you have, or just this one?
> > > > >
> > > > > > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > > > > > from to *extend* a tree.
> > > > >
> > > > > It's not a tree. It's simply space allocation. It's not even space
> > > > > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > > > > saying "I'm going to use this piece of disk for this purpose").
> > > > >
> > > > > > This may be a bug, but this is what I see.
> > > > > >
> > > > > > And no amount of "you should not balance a BTRFS" will make that
> > > > > > perception go away.
> > > > > >
> > > > > > See, I see the sun coming out on a morning and you tell me "no, it
> > > > > > doesn´t". Simply that is not going to match my perception.
> > > > >
> > > > > Duncan's assertion is correct in its detail. Looking at your space
> > > >
> > > > Robert's
> > > >
> > > > > usage, I would not suggest that running a balance is something you
> > > > > need to do. Now, since you have these lockups that seem quite
> > > > > repeatable, there's probably a lurking bug in there, but hacking
> > > > > around with balance every time you hit it isn't going to get the
> > > > > problem solved properly.
> > > > >
> > > > > I think I would suggest the following:
> > > > >
> > > > > - make sure you have some way of logging your dmesg permanently (use
> > > > > a different filesystem for /var/log, or a serial console, or a
> > > > > netconsole)
> > > > >
> > > > > - when the lockup happens, hit Alt-SysRq-t a few times
> > > > >
> > > > > - send the dmesg output here, or post to bugzilla.kernel.org
> > > > >
> > > > > That's probably going to give enough information to the developers
> > > > > to work out where the lockup is happening, and is clearly the way
> > > > > forward here.
> > > >
> > > > And I got it reproduced. *Perfectly* reproduced, I´d say.
> > > >
> > > > But let me run the whole story:
> > > >
> > > > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
> > >
> > > [… story of trying to reproduce with Windows XP defragmenting which was
> > > unsuccessful as BTRFS still had free device space to allocate new chunks
> > > from …]
> > >
> > > > But finally I got to:
> > > >
> > > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > > > Sa 27. Dez 13:26:39 CET 2014
> > > > Label: 'home' uuid: [some UUID]
> > > > Total devices 2 FS bytes used 152.83GiB
> > > > devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> > > > devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> > > >
> > > > Btrfs v3.17
> > > > Data, RAID1: total=154.97GiB, used=149.58GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > >
> > > >
> > > >
> > > > So I did, if Virtualbox can write randomly in a file, I can too.
> > > >
> > > > So I did:
> > > >
> > > >
> > > > martin@merkaba:~> cat ssd-test.fio
> > > > [global]
> > > > bs=4k
> > > > #ioengine=libaio
> > > > #iodepth=4
> > > > size=4g
> > > > #direct=1
> > > > runtime=120
> > > > filename=ssd.test.file
> > > >
> > > > [seq-write]
> > > > rw=write
> > > > stonewall
> > > >
> > > > [rand-write]
> > > > rw=randwrite
> > > > stonewall
> > > >
> > > >
> > > >
> > > > And got:
> > > >
> > > > ATOP - merkaba 2014/12/27 13:41:02 ----------- 10s elapsed
> > > > PRC | sys 10.14s | user 0.38s | #proc 332 | #trun 2 | #tslpi 548 | #tslpu 0 | #zombie 0 | no procacct |
> > > > CPU | sys 102% | user 4% | irq 0% | idle 295% | wait 0% | guest 0% | curf 3.10GHz | curscal 96% |
> > > > cpu | sys 76% | user 0% | irq 0% | idle 24% | cpu001 w 0% | guest 0% | curf 3.20GHz | curscal 99% |
> > > > cpu | sys 24% | user 1% | irq 0% | idle 75% | cpu000 w 0% | guest 0% | curf 3.19GHz | curscal 99% |
> > > > cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > CPL | avg1 0.82 | avg5 0.78 | avg15 0.99 | | csw 6233 | intr 12023 | | numcpu 4 |
> > > > MEM | tot 15.5G | free 4.0G | cache 9.7G | buff 0.0M | slab 333.1M | shmem 206.6M | vmbal 0.0M | hptot 0.0M |
> > > > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > > > LVM | sata-home | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> > > > DSK | sda | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> > > > NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 |
> > > > NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 |
> > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > >
> > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
> > > > 18079 - martin martin 2 9.99s 0.00s 0K 0K 0K 16K -- - R 1 100% fio
> > > > 4746 - martin martin 2 0.01s 0.14s 0K 0K 0K 0K -- - S 2 2% konsole
> > > > 3291 - martin martin 4 0.01s 0.11s 0K 0K 0K 0K -- - S 0 1% plasma-desktop
> > > > 1488 - root root 1 0.03s 0.04s 0K 0K 0K 0K -- - S 0 1% Xorg
> > > > 10036 - root root 1 0.04s 0.02s 0K 0K 0K 0K -- - R 2 1% atop
> > > >
> > > > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > > > for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > > >
> > > > martin@merkaba:~> LANG=C df -hT /home
> > > > Filesystem Type Size Used Avail Use% Mounted on
> > > > /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home
> > > >
> > > > where a 4 GiB file should easily fit, no? (And this output is with the 4
> > > > GiB file. So it was even 4 GiB more free before.)
> > > >
> > > >
> > > > But it gets even more visible:
> > > >
> > > > martin@merkaba:~> fio ssd-test.fio
> > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > fio-2.1.11
> > > > Starting 2 processes
> > > > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]
> > > > 0$ zsh 1$ zsh 2$ zsh 3-$ zsh 4$ zsh 5$* zsh
> > > >
> > > >
> > > > yes, thats 0 IOPS.
> > > >
> > > > 0 IOPS and in zero IOPS. For minutes.
> > > >
> > > >
> > > >
> > > > And here is why:
> > > >
> > > > ATOP - merkaba 2014/12/27 13:46:52 ----------- 10s elapsed
> > > > PRC | sys 10.77s | user 0.31s | #proc 334 | #trun 2 | #tslpi 548 | #tslpu 3 | #zombie 0 | no procacct |
> > > > CPU | sys 108% | user 3% | irq 0% | idle 286% | wait 2% | guest 0% | curf 3.08GHz | curscal 96% |
> > > > cpu | sys 72% | user 1% | irq 0% | idle 28% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 19% | user 0% | irq 0% | idle 81% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 11% | user 1% | irq 0% | idle 87% | cpu003 w 1% | guest 0% | curf 3.19GHz | curscal 99% |
> > > > cpu | sys 6% | user 1% | irq 0% | idle 91% | cpu002 w 1% | guest 0% | curf 3.11GHz | curscal 97% |
> > > > CPL | avg1 2.78 | avg5 1.34 | avg15 1.12 | | csw 50192 | intr 32379 | | numcpu 4 |
> > > > MEM | tot 15.5G | free 5.0G | cache 8.7G | buff 0.0M | slab 332.6M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> > > > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > > > LVM | sata-home | busy 5% | read 160 | write 11177 | KiB/w 3 | MBr/s 0.06 | MBw/s 4.36 | avio 0.05 ms |
> > > > LVM | msata-home | busy 4% | read 28 | write 11177 | KiB/w 3 | MBr/s 0.01 | MBw/s 4.36 | avio 0.04 ms |
> > > > LVM | sata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> > > > LVM | msata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> > > > DSK | sda | busy 5% | read 160 | write 10200 | KiB/w 4 | MBr/s 0.06 | MBw/s 4.69 | avio 0.05 ms |
> > > > DSK | sdb | busy 4% | read 28 | write 10558 | KiB/w 4 | MBr/s 0.01 | MBw/s 4.69 | avio 0.04 ms |
> > > > NET | transport | tcpi 35 | tcpo 33 | udpi 3 | udpo 3 | tcpao 2 | tcppo 1 | tcprs 0 |
> > > > NET | network | ipi 38 | ipo 36 | ipfrw 0 | deliv 38 | | icmpi 0 | icmpo 0 |
> > > > NET | eth0 0% | pcki 22 | pcko 20 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > >
> > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> > > > 14973 - root root 1 8.92s 0.00s 0K 0K 0K 144K -- - S 0 89% kworker/u8:14
> > > > 17450 - root root 1 0.86s 0.00s 0K 0K 0K 32K -- - R 3 9% kworker/u8:5
> > > > 788 - root root 1 0.25s 0.00s 0K 0K 128K 18880K -- - S 3 3% btrfs-transact
> > > > 12254 - root root 1 0.14s 0.00s 0K 0K 64K 576K -- - S 2 1% kworker/u8:3
> > > > 17332 - root root 1 0.11s 0.00s 0K 0K 112K 1348K -- - S 2 1% kworker/u8:4
> > > > 3291 - martin martin 4 0.01s 0.09s 0K 0K 0K 0K -- - S 1 1% plasma-deskto
> > > >
> > > >
> > > >
> > > >
> > > > ATOP - merkaba 2014/12/27 13:47:12 ----------- 10s elapsed
> > > > PRC | sys 10.78s | user 0.44s | #proc 334 | #trun 3 | #tslpi 547 | #tslpu 3 | #zombie 0 | no procacct |
> > > > CPU | sys 106% | user 4% | irq 0% | idle 288% | wait 1% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 93% | user 0% | irq 0% | idle 7% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 7% | user 0% | irq 0% | idle 93% | cpu003 w 0% | guest 0% | curf 3.01GHz | curscal 94% |
> > > > cpu | sys 3% | user 2% | irq 0% | idle 94% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > CPL | avg1 3.33 | avg5 1.56 | avg15 1.20 | | csw 38253 | intr 23104 | | numcpu 4 |
> > > > MEM | tot 15.5G | free 4.9G | cache 8.7G | buff 0.0M | slab 336.5M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> > > > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > > > LVM | msata-home | busy 2% | read 0 | write 2337 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.91 | avio 0.07 ms |
> > > > LVM | sata-home | busy 2% | read 36 | write 2337 | KiB/w 3 | MBr/s 0.01 | MBw/s 0.91 | avio 0.07 ms |
> > > > LVM | msata-debian | busy 1% | read 1 | write 1630 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.65 | avio 0.03 ms |
> > > > LVM | sata-debian | busy 0% | read 0 | write 1019 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.41 | avio 0.02 ms |
> > > > DSK | sdb | busy 2% | read 1 | write 2545 | KiB/w 5 | MBr/s 0.00 | MBw/s 1.45 | avio 0.07 ms |
> > > > DSK | sda | busy 1% | read 36 | write 2461 | KiB/w 5 | MBr/s 0.01 | MBw/s 1.28 | avio 0.06 ms |
> > > > NET | transport | tcpi 20 | tcpo 20 | udpi 1 | udpo 1 | tcpao 1 | tcppo 1 | tcprs 0 |
> > > > NET | network | ipi 21 | ipo 21 | ipfrw 0 | deliv 21 | | icmpi 0 | icmpo 0 |
> > > > NET | eth0 0% | pcki 5 | pcko 5 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 |
> > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > >
> > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> > > > 17450 - root root 1 9.96s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:5
> > > > 4746 - martin martin 2 0.06s 0.15s 0K 0K 0K 0K -- - S 1 2% konsole
> > > > 10508 - root root 1 0.13s 0.00s 0K 0K 96K 4048K -- - S 1 1% kworker/u8:18
> > > > 1488 - root root 1 0.06s 0.06s 0K 0K 0K 0K -- - S 0 1% Xorg
> > > > 17332 - root root 1 0.12s 0.00s 0K 0K 96K 580K -- - R 3 1% kworker/u8:4
> > > > 17454 - root root 1 0.11s 0.00s 0K 0K 32K 4416K -- - D 1 1% kworker/u8:6
> > > > 17516 - root root 1 0.09s 0.00s 0K 0K 16K 136K -- - S 3 1% kworker/u8:7
> > > > 3268 - martin martin 3 0.02s 0.05s 0K 0K 0K 0K -- - S 1 1% kwin
> > > > 10036 - root root 1 0.05s 0.02s 0K 0K 0K 0K -- - R 0 1% atop
> > > >
> > > >
> > > >
> > > > So BTRFS is basically busy with itself and nothing else. Look at the SSD
> > > > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> > > > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> > > > you measure of course, like request size, read, write, iodepth and so).
> > > >
> > > > Its kworker/u8:5 utilizing 100% of one core for minutes.
> > > >
> > > >
> > > >
> > > > Its the random write case it seems. Here are values from fio job:
> > > >
> > > > martin@merkaba:~> fio ssd-test.fio
> > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > fio-2.1.11
> > > > Starting 2 processes
> > > > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> > > > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
> > > > write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> > > > clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> > > > lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> > > > clat percentiles (usec):
> > > > | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
> > > > | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5],
> > > > | 70.00th=[ 7], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ 9],
> > > > | 99.00th=[ 14], 99.50th=[ 20], 99.90th=[ 211], 99.95th=[ 2128],
> > > > | 99.99th=[10304]
> > > > bw (KB /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
> > > > lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> > > > lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> > > > lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> > > > cpu : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> > > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > > issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> > > > latency : target=0, window=0, percentile=100.00%, depth=1
> > > >
> > > > Seems fine.
> > > >
> > > >
> > > > But:
> > > >
> > > > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
> > > > write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> > > > clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> > > > lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> > > > clat percentiles (usec):
> > > > | 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5],
> > > > | 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6],
> > > > | 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10],
> > > > | 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116],
> > > > | 99.99th=[16711680]
> > > > bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
> > > > lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> > > > lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> > > > cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> > > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > > issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> > > > latency : target=0, window=0, percentile=100.00%, depth=1
> > > >
> > > > Run status group 0 (all jobs):
> > > > WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
> > > >
> > > > Run status group 1 (all jobs):
> > > > WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
> > > >
> > > >
> > > > What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
> > > >
> > > > What?
> > > >
> > > > Ey, *what*?
> > […]
> > > > There we go:
> > > >
> > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > >
> > > I have done more tests.
> > >
> > > This is on the same /home after extending it to 170 GiB and balancing it to
> > > btrfs balance start -dusage=80
> > >
> > > It has plenty of free space free. I updated the bug report and hope it can
> > > give an easy enough to comprehend summary. The new tests are in:
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6
> > >
> > >
> > >
> > > Pasting below for discussion on list. Summary: I easily get 38000 (!)
> > > IOPS. It may be an idea to reduce to 160 GiB, but right now this does
> > > not work as it says no free space on device when trying to downsize it.
> > > I may try with 165 or 162GiB.
> > >
> > > So now we have three IOPS figures:
> > >
> > > - 256 IOPS in worst case scenario
> > > - 4700 IOPS when trying to reproduce worst case scenario with a fresh and small
> > > BTRFS
> > > - 38000 IOPS when /home has unused device space to allocate chunks from
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8
> > >
> > >
> > > This is another test.
> >
> >
> > Okay, and this is the last series of tests for today.
> >
> > Conclusion:
> >
> > I cannot manage to get it down to the knees as before, but I come near to it.
> >
> > Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh
> > even *worse* situation than before.
> >
> > That hints me at the need to look at the free space fragmentation, as in the
> > beginning the problem started appearing with:
> >
> > merkaba:~> btrfs fi sh /home
> > Label: 'home' uuid: […]
> > Total devices 2 FS bytes used 144.41GiB
> > devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> > devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> >
> > Btrfs v3.17
> > merkaba:~> btrfs fi df /home
> > Data, RAID1: total=154.97GiB, used=141.12GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> >
> >
> >
> > Yes, thats 13 GiB of free space *within* the chunks.
> >
> > So while I can get it down in IOPS by bringing it to a situation where it
> > can not reserve additional data chunks again, I cannot recreate the
> > abysmal 250 IOPS figure by this. Not even with my /home filesystem.
> >
> > So there is more to it. I think its important to look into free space
> > fragmentation. It seems it needs an *aged* filesystem to recreate. At
> > it seems the balances really helped. As I am not able to recreate the
> > issue to that extent right now.
> >
> > So this shows my original idea about free device space to allocate from
> > also doesn´t explain it fully. It seems to be something thats going on
> > within the chunks that explains the worst case <300 IOPS, kworker using
> > one core for minutes and desktop locked scenario.
> >
> > Is there a way to view free space fragmentation in BTRFS?
>
> So to rephrase that:
>
> From what I perceive the worst case issue happens when
>
> 1) BTRFS cannot reserve any new chunks from unused device space anymore.
>
> 2) The free space in the existing chunks is highly fragmented.
>
> Only one of those conditions is not sufficient to trigger it.
>
> Thats at least my current idea about it.
One more note about the IOPS. I currently let fio run with:
martin@merkaba:~> cat ssd-test.fio
[global]
bs=4k
#ioengine=libaio
#iodepth=4
size=4g
#direct=1
runtime=120
filename=ssd.test.file
#[seq-write]
#rw=write
#stonewall
[rand-write]
rw=randwrite
stonewall
This is using buffered I/O write read()/write() system calls. So these
IOPS are not regarding the device raw capabilities. I specifically wanted
to test through the page cache to simulate what I see with Virtualbox
writing to the VDI file (i.e. dirty piling up and dirty_background_ratio
in effect). Just like with a real app.
But that also means that IOPS may be higher cause fio ends before all of the
writes have been completed to disk.
That means when I reach <300IOPS with buffered writes, that means that
through the pagecache BTRFS was not able to yield a higher IOPS.
But it also means that I measure write requests like an application would
be doing (unless it uses fsync() or direct I/O which it seems to me
Virtualbox doesn´t at least not with every request).
Just wanted to make that explicit. Its basically visible in the job file
from what I commented out in there, but still, I thought I mention it.
I just tested the effect by reducing the test file to 500MiB and the runtime
to 10 seconds and I got 98000 IOPS for that. So the larger test file size,
but specifically the runtime forces the kernel to do actual writes due to:
merkaba:~> grep . /proc/sys/vm/dirty_*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500
(standard values, I still see no need to optimize anything in here with
those SSDs, not even with the 16 GiB of RAM the laptop has, as the SSDs
usually easily can keep up, and I´d rather wait for a change in the default
value unless I am convinced of a benefit in manually adapting it in *this*
case)
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-28 14:52 ` Robert White
@ 2014-12-28 15:42 ` Martin Steigerwald
2014-12-28 15:47 ` Martin Steigerwald
2014-12-29 0:27 ` Robert White
0 siblings, 2 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 15:42 UTC (permalink / raw)
To: Robert White; +Cc: Bardur Arantsson, linux-btrfs
Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:
> On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
> > Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
> >> Now:
> >>
> >> The complaining party has verified the minimum, repeatable case of
> >> simple file allocation on a very fragmented system and the responding
> >> party and several others have understood and supported the bug.
> >
> > I didn´t yet provide such a test case.
>
> My bad.
>
> >
> > At the moment I can only reproduce this kworker thread using a CPU for
> > minutes case with my /home filesystem.
> >
> > A mininmal test case for me would be to be able to reproduce it with a
> > fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
> > get 4800 instead of 270 IOPS.
> >
>
> A version of the test case to demonstrate absolutely system-clogging
> loads is pretty easy to construct.
>
> Make a raid1 filesystem.
> Balance it once to make sure the seed filesystem is fully integrated.
>
> Create a bunch of small files that are at least 4K in size, but are
> randomly sized. Fill the entire filesystem with them.
>
> BASH Script:
> typeset -i counter=0
> while
> dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM))
> count=1 2>/dev/null
> do
> echo $counter >/dev/null #basically a noop
> done
>
> The while will exit when the dd encounters a full filesystem.
>
> Then delete ~10% of the files with
> rm *0
>
> Run the while loop again, then delete a different 10% with "rm *1".
>
> Then again with rm *2, etc...
>
> Do this a few times and with each iteration the CPU usage gets worse and
> worse. You'll easily get system-wide stalls on all IO tasks lasting ten
> or more seconds.
Thanks Robert. Thats wonderful.
I wondered about such a test case already and thought about reproducing
it just with fallocate calls instead to reduce the amount of actual
writes done. I.e. just do some silly fallocate, truncating, write just
some parts with dd seek and remove things again kind of workload.
Feel free to add your testcase to the bug report:
[Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401
Cause anything that helps a BTRFS developer to reproduce will make it easier
to find and fix the root cause of it.
I think I will try with this little critter:
merkaba:/mnt/btrfsraid1> cat freespracefragment.sh
#!/bin/bash
TESTDIR="./test"
mkdir -p "$TESTDIR"
typeset -i counter=0
while true; do
fallocate -l $((4096 + $RANDOM)) "$TESTDIR/$((++counter))"
echo $counter >/dev/null #basically a noop
done
It takes a time, the script itself is using only a few percent of one core
there, while busying out the SSDs more heavily than I thought it would do.
But well I see up to 12000 writes per 10 seconds – thats not that much, still
it busies one SSD for 80%:
ATOP - merkaba 2014/12/28 16:40:57 ----------- 10s elapsed
PRC | sys 1.50s | user 3.47s | #proc 367 | #trun 1 | #tslpi 649 | #tslpu 0 | #zombie 0 | clones 839 | | no procacct |
CPU | sys 30% | user 38% | irq 1% | idle 293% | wait 37% | | steal 0% | guest 0% | curf 1.63GHz | curscal 50% |
cpu | sys 7% | user 11% | irq 1% | idle 75% | cpu000 w 6% | | steal 0% | guest 0% | curf 1.25GHz | curscal 39% |
cpu | sys 8% | user 11% | irq 0% | idle 76% | cpu002 w 4% | | steal 0% | guest 0% | curf 1.55GHz | curscal 48% |
cpu | sys 7% | user 9% | irq 0% | idle 71% | cpu001 w 13% | | steal 0% | guest 0% | curf 1.75GHz | curscal 54% |
cpu | sys 8% | user 7% | irq 0% | idle 71% | cpu003 w 14% | | steal 0% | guest 0% | curf 1.96GHz | curscal 61% |
CPL | avg1 1.69 | avg5 1.30 | avg15 0.94 | | | csw 68387 | intr 36928 | | | numcpu 4 |
MEM | tot 15.5G | free 3.1G | cache 8.8G | buff 4.2M | slab 1.0G | shmem 210.3M | shrss 79.1M | vmbal 0.0M | hptot 0.0M | hpuse 0.0M |
SWP | tot 12.0G | free 11.5G | | | | | | | vmcom 4.9G | vmlim 19.7G |
LVM | a-btrfsraid1 | busy 80% | read 0 | write 11873 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 4.31 | avq 1.11 | avio 0.67 ms |
LVM | a-btrfsraid1 | busy 5% | read 0 | write 11873 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 4.31 | avq 2.45 | avio 0.04 ms |
LVM | msata-home | busy 3% | read 0 | write 175 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.06 | avq 1.71 | avio 1.43 ms |
LVM | msata-debian | busy 0% | read 0 | write 10 | KiB/r 0 | KiB/w 8 | MBr/s 0.00 | MBw/s 0.01 | avq 1.15 | avio 3.40 ms |
LVM | sata-home | busy 0% | read 0 | write 175 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.06 | avq 1.71 | avio 0.04 ms |
LVM | sata-debian | busy 0% | read 0 | write 10 | KiB/r 0 | KiB/w 8 | MBr/s 0.00 | MBw/s 0.01 | avq 1.00 | avio 0.10 ms |
DSK | sdb | busy 80% | read 0 | write 11880 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 4.38 | avq 1.11 | avio 0.67 ms |
DSK | sda | busy 5% | read 0 | write 12069 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 4.38 | avq 2.51 | avio 0.04 ms |
NET | transport | tcpi 26 | tcpo 26 | udpi 0 | udpo 0 | tcpao 2 | tcppo 1 | tcprs 0 | tcpie 0 | udpie 0 |
NET | network | ipi 26 | ipo 26 | ipfrw 0 | deliv 26 | | | | icmpi 0 | icmpo 0 |
NET | eth0 0% | pcki 10 | pcko 10 | si 5 Kbps | so 1 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 |
NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 |
PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/4
9169 - martin martin 14 0.22s 1.53s 0K 0K 0K 4K -- - S 1 18% amarok
1488 - root root 1 0.34s 0.27s 220K 0K 0K 0K -- - S 2 6% Xorg
6816 - martin martin 7 0.05s 0.44s 0K 0K 0K 0K -- - S 1 5% kmail
24390 - root root 1 0.20s 0.25s 24K 24K 0K 40800K -- - S 0 5% freespracefrag
3268 - martin martin 3 0.08s 0.34s 0K 0K 0K 24K -- - S 0 4% kwin
But only with a low amount of writes:
merkaba:/mnt/btrfsraid1> vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 538424 3326248 4304 9202576 6 11 1968 4029 273 207 15 10 72 3 0
1 0 538424 3325244 4304 9202836 0 0 0 6456 3498 7635 11 8 72 10 0
0 0 538424 3325168 4304 9202932 0 0 0 9032 3719 6764 9 9 74 9 0
0 0 538424 3334508 4304 9202932 0 0 0 8936 3548 6035 7 8 76 9 0
0 0 538424 3334144 4304 9202876 0 0 0 9008 3335 5635 7 7 76 10 0
0 0 538424 3332724 4304 9202728 0 0 0 11240 3555 5699 7 8 76 10 0
2 0 538424 3333328 4304 9202876 0 0 0 9080 3724 6542 8 8 75 9 0
0 0 538424 3333328 4304 9202876 0 0 0 6968 2951 5015 7 7 76 10 0
0 1 538424 3332832 4304 9202584 0 0 0 9160 3663 6772 8 8 76 9 0
Still it busies one of both SSDs for about 80%:
iostat -xz 1
avg-cpu: %user %nice %system %iowait %steal %idle
7,04 0,00 7,04 9,80 0,00 76,13
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0,00 0,00 0,00 1220,00 0,00 4556,00 7,47 0,12 0,10 0,00 0,10 0,04 5,10
sdb 0,00 10,00 0,00 1210,00 0,00 4556,00 7,53 0,85 0,70 0,00 0,70 0,66 79,90
dm-2 0,00 0,00 0,00 4,00 0,00 36,00 18,00 0,02 5,00 0,00 5,00 4,25 1,70
dm-5 0,00 0,00 0,00 4,00 0,00 36,00 18,00 0,00 0,25 0,00 0,25 0,25 0,10
dm-6 0,00 0,00 0,00 1216,00 0,00 4520,00 7,43 0,12 0,10 0,00 0,10 0,04 5,00
dm-7 0,00 0,00 0,00 1216,00 0,00 4520,00 7,43 0,84 0,69 0,00 0,69 0,66 79,70
avg-cpu: %user %nice %system %iowait %steal %idle
6,55 0,00 7,81 9,32 0,00 76,32
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0,00 0,00 0,00 1203,00 0,00 4472,00 7,43 0,09 0,07 0,00 0,07 0,03 3,80
sdb 0,00 0,00 0,00 1203,00 0,00 4472,00 7,43 0,79 0,66 0,00 0,66 0,64 77,10
dm-6 0,00 0,00 0,00 1203,00 0,00 4472,00 7,43 0,09 0,07 0,00 0,07 0,03 4,00
dm-7 0,00 0,00 0,00 1203,00 0,00 4472,00 7,43 0,79 0,66 0,00 0,66 0,64 77,10
avg-cpu: %user %nice %system %iowait %steal %idle
7,79 0,00 7,79 9,30 0,00 75,13
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0,00 0,00 0,00 1202,00 0,00 4468,00 7,43 0,09 0,07 0,00 0,07 0,04 4,70
sdb 0,00 0,00 4,00 1202,00 2048,00 4468,00 10,81 0,86 0,71 4,75 0,70 0,65 78,10
dm-1 0,00 0,00 4,00 0,00 2048,00 0,00 1024,00 0,02 4,75 4,75 0,00 2,00 0,80
dm-6 0,00 0,00 0,00 1202,00 0,00 4468,00 7,43 0,08 0,07 0,00 0,07 0,04 4,60
dm-7 0,00 0,00 0,00 1202,00 0,00 4468,00 7,43 0,84 0,70 0,00 0,70 0,65 77,80
But yet, neither I hit full CPU usage nor full SSD usage (just 80%), so
this is yet another interesting case.
> I don't have enough spare storage to do this directly, so I used
> loopback devices. First I did it with the loopback files in COW mode.
> Then I did it again with the files in NOCOW mode. (the COW files got
> thick with overwrite real fast. 8-)
>
> So anyway...
>
> After I got through all ten digits on the rm (that is removing *0, then
> refilling, then *1 etc...) I figured the FS image was nicely fragmented.
>
> At that point it was very easy to spike the kworker to 100% CPU with
>
> dd if=/dev/urandom of=/mnt/Work/scratch bs=40k
>
> The DD wold read 40k (a cpu spike for /dev/urandom processing) then it
> would write the 40k and the kworker would peg 100% on one CPU and stay
> there for a while. Then it would be back to the /dev/urandom spike.
>
> So this laptop has been carefully detuned to prevent certain kinds of
> stalls (particularly the moveablecore= reservation, as previously
> mentioned, to prevent non-responsiveness of the UI) and I had to go
> through /dev/loop so that had a smoothing effect... but yep, there were
> clear kworker spikes that _did_ stop the IO path (the system monitor ap,
> for instance, could not get I/O statistics for ten and fifteen second
> intervals and would stop logging/scrolling).
I think I will look at the moveablecore= thing again. I think I have overread
this.
> Progressively larger block sizes on the write path made things
> progressively worse...
>
> dd if=/dev/urandom of=/mnt/Work/scratch bs=160k
>
>
> And overwriting the file by just invoking DD again, was worse still
> (presumably from the juggling act) before resulting in a net
> out-of-space condition.
>
> Switching from /dev/urandom to /dev/zero for writing the large file made
> things worse still -- probably since there were no respites for the
> kworker to catch up etc.
>
> ASIDE: Playing with /proc/sys/vm/dirty_{background_,}ratio had lots of
> interesting and difficult to quantify effects on user-space
> applications. Cutting in half (5 and 10 instead of 10 and 20
> respectively) seemed to give some relief, but going further got harmful
> quickly. Diverging numbers was odd too. But it seemed a little brittle
> to play with these numbers.
As said, in usual usage I do not see much reason to poke around with it.
And yes, I know the statement of Linus to tune it to about some seconds
of your storage bandwidth. But heck, these SSDs can do 200 MiB/s even
with partially random workloads. So in 5 seconds they could write out
a 1 GiB. And I have not seen more dirty in that fio testcase, so.
It may make sense to reduce it to 1 GiB as 10% of:
merkaba:~> free -m
total used free shared buffers cached
Mem: 15830 11953 3877 207 0 8382
-/+ buffers/cache: 3570 12260
Swap: 12287 526 11761
is still much.
merkaba:~> grep . /proc/sys/vm/dirty_*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500
But as I never have seen a problem with piling up bulk writes to the SSDs,
I didn´t. I am quite lazy in that. I only ever change the default, when I see
a need to. And yes, on write heavy servers with 512 GiB RAM or slow rotating
storage it may well be needed to avoid large stalls.
> SUPER FREAKY THING...
>
> Every time I removed and recreated "scratch" I would get _radically_
> different results for how much I could write into that remaining space
> and how long it took to do so. In theory I am reusing the exact same
> storage again and again. I'm not doing compression (the underlying
> filessytem behind the loop devices have compression but that would be
> disabled by the +C attribute). It's not enough space coming-and-going to
> cause data extents to be reclaimed or displaced by metadata. And the
> filessytem is otherwise completely unused.
>
> But check it out...
>
> Gust Work # rm scratch
> Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
> 1700+0 records in
> 1700+0 records out
> 278528000 bytes (279 MB) copied, 1.4952 s, 186 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
> 1700+0 records in
> 1700+0 records out
> 278528000 bytes (279 MB) copied, 292.135 s, 953 kB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
> dd: error writing ‘/mnt/Work/scratch’: No space left on device
> 93+0 records in
> 92+0 records out
> 15073280 bytes (15 MB) copied, 0.0453977 s, 332 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
> dd: error writing ‘/mnt/Work/scratch’: No space left on device
> 1090+0 records in
> 1089+0 records out
> 178421760 bytes (178 MB) copied, 115.991 s, 1.5 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
> dd: error writing ‘/mnt/Work/scratch’: No space left on device
> 332+0 records in
> 331+0 records out
> 54231040 bytes (54 MB) copied, 30.1589 s, 1.8 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
> dd: error writing ‘/mnt/Work/scratch’: No space left on device
> 622+0 records in
> 621+0 records out
> 101744640 bytes (102 MB) copied, 37.4813 s, 2.7 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
> 1700+0 records in
> 1700+0 records out
> 278528000 bytes (279 MB) copied, 121.863 s, 2.3 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
> 1700+0 records in
> 1700+0 records out
> 278528000 bytes (279 MB) copied, 24.2909 s, 11.5 MB/s
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k
> dd: error writing ‘/mnt/Work/scratch’: No space left on device
> 1709+0 records in
> 1708+0 records out
> 279838720 bytes (280 MB) copied, 139.538 s, 2.0 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k
> dd: error writing ‘/mnt/Work/scratch’: No space left on device
> 1424+0 records in
> 1423+0 records out
> 233144320 bytes (233 MB) copied, 102.257 s, 2.3 MB/s
> Gust Work #
>
> (and so on)
I saw something similar, but with the RAID 1 2x10 GiB on LV volume
test BTRFS. I filled remaining space by rsync -a /usr/bin to it several
times, and even while it aborted with "no space left on device" in
subsequent calls I was still able to copy things to it. But later I
attributed it to it may have copied a large file on the first failure,
but then filled in smaller files on subsequent calls, as I used different
destination directories on each rsync call and thus is started the copy
process from scratch with the first file every time.
So its nice to see that you can produce this with dd.
> So...
>
> Repeatable: yes.
> Problematic: yes.
Wonderful.
I may try with with my test BTRFS. I could even make it 2x20 GiB RAID 1
as well.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-28 15:42 ` Martin Steigerwald
@ 2014-12-28 15:47 ` Martin Steigerwald
2014-12-29 0:27 ` Robert White
1 sibling, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 15:47 UTC (permalink / raw)
To: Robert White; +Cc: Bardur Arantsson, linux-btrfs
Am Sonntag, 28. Dezember 2014, 16:42:20 schrieb Martin Steigerwald:
> Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:
> > On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
> > > Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
> > >> Now:
> > >>
> > >> The complaining party has verified the minimum, repeatable case of
> > >> simple file allocation on a very fragmented system and the responding
> > >> party and several others have understood and supported the bug.
> > >
> > > I didn´t yet provide such a test case.
> >
> > My bad.
> >
> > >
> > > At the moment I can only reproduce this kworker thread using a CPU for
> > > minutes case with my /home filesystem.
> > >
> > > A mininmal test case for me would be to be able to reproduce it with a
> > > fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
> > > get 4800 instead of 270 IOPS.
> > >
> >
> > A version of the test case to demonstrate absolutely system-clogging
> > loads is pretty easy to construct.
> >
> > Make a raid1 filesystem.
> > Balance it once to make sure the seed filesystem is fully integrated.
> >
> > Create a bunch of small files that are at least 4K in size, but are
> > randomly sized. Fill the entire filesystem with them.
> >
> > BASH Script:
> > typeset -i counter=0
> > while
> > dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM))
> > count=1 2>/dev/null
> > do
> > echo $counter >/dev/null #basically a noop
> > done
> >
> > The while will exit when the dd encounters a full filesystem.
> >
> > Then delete ~10% of the files with
> > rm *0
> >
> > Run the while loop again, then delete a different 10% with "rm *1".
> >
> > Then again with rm *2, etc...
> >
> > Do this a few times and with each iteration the CPU usage gets worse and
> > worse. You'll easily get system-wide stalls on all IO tasks lasting ten
> > or more seconds.
>
> Thanks Robert. Thats wonderful.
>
> I wondered about such a test case already and thought about reproducing
> it just with fallocate calls instead to reduce the amount of actual
> writes done. I.e. just do some silly fallocate, truncating, write just
> some parts with dd seek and remove things again kind of workload.
>
> Feel free to add your testcase to the bug report:
>
> [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>
> Cause anything that helps a BTRFS developer to reproduce will make it easier
> to find and fix the root cause of it.
>
> I think I will try with this little critter:
>
> merkaba:/mnt/btrfsraid1> cat freespracefragment.sh
> #!/bin/bash
>
> TESTDIR="./test"
> mkdir -p "$TESTDIR"
>
> typeset -i counter=0
> while true; do
> fallocate -l $((4096 + $RANDOM)) "$TESTDIR/$((++counter))"
> echo $counter >/dev/null #basically a noop
> done
>
> It takes a time, the script itself is using only a few percent of one core
> there, while busying out the SSDs more heavily than I thought it would do.
> But well I see up to 12000 writes per 10 seconds – thats not that much, still
> it busies one SSD for 80%:
>
> ATOP - merkaba 2014/12/28 16:40:57 ----------- 10s elapsed
> PRC | sys 1.50s | user 3.47s | #proc 367 | #trun 1 | #tslpi 649 | #tslpu 0 | #zombie 0 | clones 839 | | no procacct |
> CPU | sys 30% | user 38% | irq 1% | idle 293% | wait 37% | | steal 0% | guest 0% | curf 1.63GHz | curscal 50% |
> cpu | sys 7% | user 11% | irq 1% | idle 75% | cpu000 w 6% | | steal 0% | guest 0% | curf 1.25GHz | curscal 39% |
> cpu | sys 8% | user 11% | irq 0% | idle 76% | cpu002 w 4% | | steal 0% | guest 0% | curf 1.55GHz | curscal 48% |
> cpu | sys 7% | user 9% | irq 0% | idle 71% | cpu001 w 13% | | steal 0% | guest 0% | curf 1.75GHz | curscal 54% |
> cpu | sys 8% | user 7% | irq 0% | idle 71% | cpu003 w 14% | | steal 0% | guest 0% | curf 1.96GHz | curscal 61% |
> CPL | avg1 1.69 | avg5 1.30 | avg15 0.94 | | | csw 68387 | intr 36928 | | | numcpu 4 |
> MEM | tot 15.5G | free 3.1G | cache 8.8G | buff 4.2M | slab 1.0G | shmem 210.3M | shrss 79.1M | vmbal 0.0M | hptot 0.0M | hpuse 0.0M |
> SWP | tot 12.0G | free 11.5G | | | | | | | vmcom 4.9G | vmlim 19.7G |
> LVM | a-btrfsraid1 | busy 80% | read 0 | write 11873 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 4.31 | avq 1.11 | avio 0.67 ms |
> LVM | a-btrfsraid1 | busy 5% | read 0 | write 11873 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 4.31 | avq 2.45 | avio 0.04 ms |
> LVM | msata-home | busy 3% | read 0 | write 175 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.06 | avq 1.71 | avio 1.43 ms |
> LVM | msata-debian | busy 0% | read 0 | write 10 | KiB/r 0 | KiB/w 8 | MBr/s 0.00 | MBw/s 0.01 | avq 1.15 | avio 3.40 ms |
> LVM | sata-home | busy 0% | read 0 | write 175 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.06 | avq 1.71 | avio 0.04 ms |
> LVM | sata-debian | busy 0% | read 0 | write 10 | KiB/r 0 | KiB/w 8 | MBr/s 0.00 | MBw/s 0.01 | avq 1.00 | avio 0.10 ms |
> DSK | sdb | busy 80% | read 0 | write 11880 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 4.38 | avq 1.11 | avio 0.67 ms |
> DSK | sda | busy 5% | read 0 | write 12069 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 4.38 | avq 2.51 | avio 0.04 ms |
> NET | transport | tcpi 26 | tcpo 26 | udpi 0 | udpo 0 | tcpao 2 | tcppo 1 | tcprs 0 | tcpie 0 | udpie 0 |
> NET | network | ipi 26 | ipo 26 | ipfrw 0 | deliv 26 | | | | icmpi 0 | icmpo 0 |
> NET | eth0 0% | pcki 10 | pcko 10 | si 5 Kbps | so 1 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 |
> NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 |
>
> PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/4
> 9169 - martin martin 14 0.22s 1.53s 0K 0K 0K 4K -- - S 1 18% amarok
> 1488 - root root 1 0.34s 0.27s 220K 0K 0K 0K -- - S 2 6% Xorg
> 6816 - martin martin 7 0.05s 0.44s 0K 0K 0K 0K -- - S 1 5% kmail
> 24390 - root root 1 0.20s 0.25s 24K 24K 0K 40800K -- - S 0 5% freespracefrag
> 3268 - martin martin 3 0.08s 0.34s 0K 0K 0K 24K -- - S 0 4% kwin
>
>
>
> But only with a low amount of writes:
>
> merkaba:/mnt/btrfsraid1> vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 2 0 538424 3326248 4304 9202576 6 11 1968 4029 273 207 15 10 72 3 0
> 1 0 538424 3325244 4304 9202836 0 0 0 6456 3498 7635 11 8 72 10 0
> 0 0 538424 3325168 4304 9202932 0 0 0 9032 3719 6764 9 9 74 9 0
> 0 0 538424 3334508 4304 9202932 0 0 0 8936 3548 6035 7 8 76 9 0
> 0 0 538424 3334144 4304 9202876 0 0 0 9008 3335 5635 7 7 76 10 0
> 0 0 538424 3332724 4304 9202728 0 0 0 11240 3555 5699 7 8 76 10 0
> 2 0 538424 3333328 4304 9202876 0 0 0 9080 3724 6542 8 8 75 9 0
> 0 0 538424 3333328 4304 9202876 0 0 0 6968 2951 5015 7 7 76 10 0
> 0 1 538424 3332832 4304 9202584 0 0 0 9160 3663 6772 8 8 76 9 0
Let me rephrase that.
One one hand rather low, but for the kind of workload just for *fallocating*
new files actually quite much. I just tell it to *reserve* the space for the
file I do not tell it to write to them. And yet its about 6-12 MiB/s.
> Still it busies one of both SSDs for about 80%:
>
> iostat -xz 1
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 7,04 0,00 7,04 9,80 0,00 76,13
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> sda 0,00 0,00 0,00 1220,00 0,00 4556,00 7,47 0,12 0,10 0,00 0,10 0,04 5,10
> sdb 0,00 10,00 0,00 1210,00 0,00 4556,00 7,53 0,85 0,70 0,00 0,70 0,66 79,90
> dm-2 0,00 0,00 0,00 4,00 0,00 36,00 18,00 0,02 5,00 0,00 5,00 4,25 1,70
> dm-5 0,00 0,00 0,00 4,00 0,00 36,00 18,00 0,00 0,25 0,00 0,25 0,25 0,10
> dm-6 0,00 0,00 0,00 1216,00 0,00 4520,00 7,43 0,12 0,10 0,00 0,10 0,04 5,00
> dm-7 0,00 0,00 0,00 1216,00 0,00 4520,00 7,43 0,84 0,69 0,00 0,69 0,66 79,70
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 6,55 0,00 7,81 9,32 0,00 76,32
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> sda 0,00 0,00 0,00 1203,00 0,00 4472,00 7,43 0,09 0,07 0,00 0,07 0,03 3,80
> sdb 0,00 0,00 0,00 1203,00 0,00 4472,00 7,43 0,79 0,66 0,00 0,66 0,64 77,10
> dm-6 0,00 0,00 0,00 1203,00 0,00 4472,00 7,43 0,09 0,07 0,00 0,07 0,03 4,00
> dm-7 0,00 0,00 0,00 1203,00 0,00 4472,00 7,43 0,79 0,66 0,00 0,66 0,64 77,10
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 7,79 0,00 7,79 9,30 0,00 75,13
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> sda 0,00 0,00 0,00 1202,00 0,00 4468,00 7,43 0,09 0,07 0,00 0,07 0,04 4,70
> sdb 0,00 0,00 4,00 1202,00 2048,00 4468,00 10,81 0,86 0,71 4,75 0,70 0,65 78,10
> dm-1 0,00 0,00 4,00 0,00 2048,00 0,00 1024,00 0,02 4,75 4,75 0,00 2,00 0,80
> dm-6 0,00 0,00 0,00 1202,00 0,00 4468,00 7,43 0,08 0,07 0,00 0,07 0,04 4,60
> dm-7 0,00 0,00 0,00 1202,00 0,00 4468,00 7,43 0,84 0,70 0,00 0,70 0,65 77,80
>
>
> But yet, neither I hit full CPU usage nor full SSD usage (just 80%), so
> this is yet another interesting case.
[…]
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-28 12:03 ` Martin Steigerwald
@ 2014-12-28 17:04 ` Patrik Lundquist
2014-12-29 10:14 ` Martin Steigerwald
0 siblings, 1 reply; 59+ messages in thread
From: Patrik Lundquist @ 2014-12-28 17:04 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: linux-btrfs
On 28 December 2014 at 13:03, Martin Steigerwald <Martin@lichtvoll.de> wrote:
>
> BTW, I found that the Oracle blog didn´t work at all for me. I completed
> a cycle of defrag, sdelete -c and VBoxManage compact, [...] and it
> apparently did *nothing* to reduce the size of the file.
They've changed the argument to -z; sdelete -z.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-28 15:42 ` Martin Steigerwald
2014-12-28 15:47 ` Martin Steigerwald
@ 2014-12-29 0:27 ` Robert White
2014-12-29 9:14 ` Martin Steigerwald
1 sibling, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-29 0:27 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Bardur Arantsson, linux-btrfs
On 12/28/2014 07:42 AM, Martin Steigerwald wrote:
> Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:
>> On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
>>> Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
>>>> Now:
>>>>
>>>> The complaining party has verified the minimum, repeatable case of
>>>> simple file allocation on a very fragmented system and the responding
>>>> party and several others have understood and supported the bug.
>>>
>>> I didn´t yet provide such a test case.
>>
>> My bad.
>>
>>>
>>> At the moment I can only reproduce this kworker thread using a CPU for
>>> minutes case with my /home filesystem.
>>>
>>> A mininmal test case for me would be to be able to reproduce it with a
>>> fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
>>> get 4800 instead of 270 IOPS.
>>>
>>
>> A version of the test case to demonstrate absolutely system-clogging
>> loads is pretty easy to construct.
>>
>> Make a raid1 filesystem.
>> Balance it once to make sure the seed filesystem is fully integrated.
>>
>> Create a bunch of small files that are at least 4K in size, but are
>> randomly sized. Fill the entire filesystem with them.
>>
>> BASH Script:
>> typeset -i counter=0
>> while
>> dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM))
>> count=1 2>/dev/null
>> do
>> echo $counter >/dev/null #basically a noop
>> done
>>
>> The while will exit when the dd encounters a full filesystem.
>>
>> Then delete ~10% of the files with
>> rm *0
>>
>> Run the while loop again, then delete a different 10% with "rm *1".
>>
>> Then again with rm *2, etc...
>>
>> Do this a few times and with each iteration the CPU usage gets worse and
>> worse. You'll easily get system-wide stalls on all IO tasks lasting ten
>> or more seconds.
>
> Thanks Robert. Thats wonderful.
>
> I wondered about such a test case already and thought about reproducing
> it just with fallocate calls instead to reduce the amount of actual
> writes done. I.e. just do some silly fallocate, truncating, write just
> some parts with dd seek and remove things again kind of workload.
>
> Feel free to add your testcase to the bug report:
>
> [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>
> Cause anything that helps a BTRFS developer to reproduce will make it easier
> to find and fix the root cause of it.
>
> I think I will try with this little critter:
>
> merkaba:/mnt/btrfsraid1> cat freespracefragment.sh
> #!/bin/bash
>
> TESTDIR="./test"
> mkdir -p "$TESTDIR"
>
> typeset -i counter=0
> while true; do
> fallocate -l $((4096 + $RANDOM)) "$TESTDIR/$((++counter))"
> echo $counter >/dev/null #basically a noop
> done
If you don't do the remove/delete passes you won't get as much
fragmentation...
I also noticed that fallocate would not actually create the files in my
toolset, so I had to touch them first. So the theoretical script became
e.g.
typeset -i counter=0
for AA in {0..9}
do
while
touch ${TESTDIR}/$((++counter)) &&
fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((counter))
do
if ((counter%100 == 0))
then
echo $counter
fi
done
echo "removing ${AA}"
rm ${TESTDIR}/*${AA}
done
Meanwhile, on my test rig using fallocate did _not_ result in final
exhaustion of resources. That is btrfs fi df /mnt/Work didn't show
significant changes on a near full expanse.
I also never got a failed response back from fallocate, that is the
inner loop never terminated. This could be a problem with the system
call itself or it could be a problem with the application wrapper.
Nor did I reach the CPU saturation I expected.
e.g.
Gust vm # btrfs fi df /mnt/Work/
Data, RAID1: total=1.72GiB, used=1.66GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=256.00MiB, used=57.84MiB
GlobalReserve, single: total=32.00MiB, used=0.00B
time passes while script running...
Gust vm # btrfs fi df /mnt/Work/
Data, RAID1: total=1.72GiB, used=1.66GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=256.00MiB, used=57.84MiB
GlobalReserve, single: total=32.00MiB, used=0.00B
So there may be some limiting factor or something.
Without the actual writes to the actual file expanse I don't get the stalls.
(I added a _touch_ of instrumentation, it makes the various catostrophy
events a little more obvious in context. 8-)
mount /dev/whattever /mnt/Work
typeset -i counter=0
for AA in {0..9}
do
while
dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 +
$RANDOM)) count=1 2>/dev/null
do
if ((counter%100 == 0))
then
echo $counter
if ((counter%1000 == 0))
then
btrfs fi df /mnt/Work
fi
fi
done
btrfs fi df /mnt/Work
echo "removing ${AA}"
rm /mnt/Work/*${AA}
btrfs fi df /mnt/Work
done
So you definitely need the writes to really see the stalls.
> I may try with with my test BTRFS. I could even make it 2x20 GiB RAID 1
> as well.
I guess I never mentioned it... I am using 4x1GiB NOCOW files through
losetup as the basis of a RAID1. No compression (by virtue of the NOCOW
files in underlying fs, and not being set in the resulting mount). No
encryption. No LVM.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
2014-12-27 19:23 ` BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time) Martin Steigerwald
@ 2014-12-29 2:07 ` Zygo Blaxell
2014-12-29 9:32 ` Martin Steigerwald
0 siblings, 1 reply; 59+ messages in thread
From: Zygo Blaxell @ 2014-12-29 2:07 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Hugo Mills, Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2308 bytes --]
On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote:
> My simple test case didn´t trigger it, and I so not have another twice 160
> GiB available on this SSDs available to try with a copy of my home
> filesystem. Then I could safely test without bringing the desktop session to
> an halt. Maybe someone has an idea on how to "enhance" my test case in
> order to reliably trigger the issue.
>
> It may be challenging tough. My /home is quite a filesystem. It has a maildir
> with at least one million of files (yeah, I am performance testing KMail and
> Akonadi as well to the limit!), and it has git repos and this one VM image,
> and the desktop search and the Akonadi database. In other words: It has
> been hit nicely with various mostly random I think workloads over the last
> about six months. I bet its not that easy to simulate that. Maybe some runs
> of compilebench to age the filesystem before the fio test?
>
> That said, BTRFS performs a lot better. The complete lockups without any
> CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there
> is this kworker issue now. I noticed it that gravely just while trying to
> complete this tax returns stuff with the Windows XP VM. Otherwise it may
> have happened, I have seen some backtraces in kern.log, but it didn´t last
> for minutes. So this indeed is of less severity than the full lockups with
> 3.15 and 3.16.
>
> Zygo, was is the characteristics of your filesystem. Do you use
> compress=lzo and skinny metadata as well? How are the chunks allocated?
> What kind of data you have on it?
compress-force (default zlib), no skinny-metadata. Chunks are d=single,
m=dup. Data is a mix of various desktop applications, most active
file sizes from a few hundred K to a few MB, maybe 300k-400k files.
No database or VM workloads. Filesystem is 100GB and is usually between
98 and 99% full (about 1-2GB free).
I have another filesystem which has similar problems when it's 99.99%
full (it's 13TB, so 0.01% is 1.3GB). That filesystem is RAID1 with
skinny-metadata and no-holes.
On various filesystems I have the above CPU-burning problem, a bunch of
irreproducible random crashes, and a hang with a kernel stack that goes
through SyS_unlinkat and btrfs_evict_inode.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-29 0:27 ` Robert White
@ 2014-12-29 9:14 ` Martin Steigerwald
0 siblings, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-29 9:14 UTC (permalink / raw)
To: Robert White; +Cc: Bardur Arantsson, linux-btrfs
Am Sonntag, 28. Dezember 2014, 16:27:41 schrieb Robert White:
> On 12/28/2014 07:42 AM, Martin Steigerwald wrote:
> > Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:
> >> On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
> >>> Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
> >>>> Now:
> >>>>
> >>>> The complaining party has verified the minimum, repeatable case of
> >>>> simple file allocation on a very fragmented system and the responding
> >>>> party and several others have understood and supported the bug.
> >>>
> >>> I didn´t yet provide such a test case.
> >>
> >> My bad.
> >>
> >>>
> >>> At the moment I can only reproduce this kworker thread using a CPU for
> >>> minutes case with my /home filesystem.
> >>>
> >>> A mininmal test case for me would be to be able to reproduce it with a
> >>> fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
> >>> get 4800 instead of 270 IOPS.
> >>>
> >>
> >> A version of the test case to demonstrate absolutely system-clogging
> >> loads is pretty easy to construct.
> >>
> >> Make a raid1 filesystem.
> >> Balance it once to make sure the seed filesystem is fully integrated.
> >>
> >> Create a bunch of small files that are at least 4K in size, but are
> >> randomly sized. Fill the entire filesystem with them.
> >>
> >> BASH Script:
> >> typeset -i counter=0
> >> while
> >> dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM))
> >> count=1 2>/dev/null
> >> do
> >> echo $counter >/dev/null #basically a noop
> >> done
> >>
> >> The while will exit when the dd encounters a full filesystem.
> >>
> >> Then delete ~10% of the files with
> >> rm *0
> >>
> >> Run the while loop again, then delete a different 10% with "rm *1".
> >>
> >> Then again with rm *2, etc...
> >>
> >> Do this a few times and with each iteration the CPU usage gets worse and
> >> worse. You'll easily get system-wide stalls on all IO tasks lasting ten
> >> or more seconds.
> >
> > Thanks Robert. Thats wonderful.
> >
> > I wondered about such a test case already and thought about reproducing
> > it just with fallocate calls instead to reduce the amount of actual
> > writes done. I.e. just do some silly fallocate, truncating, write just
> > some parts with dd seek and remove things again kind of workload.
> >
> > Feel free to add your testcase to the bug report:
> >
> > [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> >
> > Cause anything that helps a BTRFS developer to reproduce will make it easier
> > to find and fix the root cause of it.
> >
> > I think I will try with this little critter:
> >
> > merkaba:/mnt/btrfsraid1> cat freespracefragment.sh
> > #!/bin/bash
> >
> > TESTDIR="./test"
> > mkdir -p "$TESTDIR"
> >
> > typeset -i counter=0
> > while true; do
> > fallocate -l $((4096 + $RANDOM)) "$TESTDIR/$((++counter))"
> > echo $counter >/dev/null #basically a noop
> > done
>
> If you don't do the remove/delete passes you won't get as much
> fragmentation...
>
> I also noticed that fallocate would not actually create the files in my
> toolset, so I had to touch them first. So the theoretical script became
>
> e.g.
>
> typeset -i counter=0
> for AA in {0..9}
> do
> while
> touch ${TESTDIR}/$((++counter)) &&
> fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((counter))
> do
> if ((counter%100 == 0))
> then
> echo $counter
> fi
> done
> echo "removing ${AA}"
> rm ${TESTDIR}/*${AA}
> done
Hmmm, strange. It did here. I had a ton of files in the test directory.
> Meanwhile, on my test rig using fallocate did _not_ result in final
> exhaustion of resources. That is btrfs fi df /mnt/Work didn't show
> significant changes on a near full expanse.
Hmmm, I had it running up to it allocating about 5 GiB in the data chunks.
But I stopped it yesterday. It took a long time to get there. It seems to be
quite slow on filling a 10 GiB RAID-1 BTRFS. I bet that may be due to lots
of forks for the fallocate command.
But it seems my fallocate works differently than yours. I have fallocate
from:
merkaba:~> fallocate --version
fallocate von util-linux 2.25.2
> I also never got a failed response back from fallocate, that is the
> inner loop never terminated. This could be a problem with the system
> call itself or it could be a problem with the application wrapper.
Hmmm, it should return a failure like this:
merkaba:/mnt/btrfsraid1> LANG=C fallocate -l 20G 20g
fallocate: fallocate failed: No space left on device
merkaba:/mnt/btrfsraid1#1> echo $?
1
> Nor did I reach the CPU saturation I expected.
No, I didn´t reach it as well. Just 5% or so for the script itself and I
didn´t see any notable kworker activity. But I stopped it before the
filesystem was full, so.
> e.g.
> Gust vm # btrfs fi df /mnt/Work/
> Data, RAID1: total=1.72GiB, used=1.66GiB
> System, RAID1: total=32.00MiB, used=16.00KiB
> Metadata, RAID1: total=256.00MiB, used=57.84MiB
> GlobalReserve, single: total=32.00MiB, used=0.00B
>
> time passes while script running...
>
> Gust vm # btrfs fi df /mnt/Work/
> Data, RAID1: total=1.72GiB, used=1.66GiB
> System, RAID1: total=32.00MiB, used=16.00KiB
> Metadata, RAID1: total=256.00MiB, used=57.84MiB
> GlobalReserve, single: total=32.00MiB, used=0.00B
>
> So there may be some limiting factor or something.
>
> Without the actual writes to the actual file expanse I don't get the stalls.
Interesting. We may have unveiled another performance issue with fallocate
on BTRFS then.
>
> (I added a _touch_ of instrumentation, it makes the various catostrophy
> events a little more obvious in context. 8-)
>
> mount /dev/whattever /mnt/Work
> typeset -i counter=0
> for AA in {0..9}
> do
> while
> dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 +
> $RANDOM)) count=1 2>/dev/null
> do
> if ((counter%100 == 0))
> then
> echo $counter
> if ((counter%1000 == 0))
> then
> btrfs fi df /mnt/Work
> fi
> fi
> done
> btrfs fi df /mnt/Work
> echo "removing ${AA}"
> rm /mnt/Work/*${AA}
> btrfs fi df /mnt/Work
> done
>
> So you definitely need the writes to really see the stalls.
Hmmm, interesting. Will try this some time. But right now other stuffs
that are also important, so I take a break from this.
> > I may try with with my test BTRFS. I could even make it 2x20 GiB RAID 1
> > as well.
>
> I guess I never mentioned it... I am using 4x1GiB NOCOW files through
> losetup as the basis of a RAID1. No compression (by virtue of the NOCOW
> files in underlying fs, and not being set in the resulting mount). No
> encryption. No LVM.
Well okay, I am using BTRFS RAID 1 on two logical volumes in two different
volume groups which are spun over a partition each on two different SSDs:
Intel SSD 320 with 300 GB on SATA-600 (but SSD can only do SATA-300) +
Crucial m500 480 GB on mSATA-300 (but SSD could do SATA-600)
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)
2014-12-28 13:56 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
2014-12-28 15:00 ` Martin Steigerwald
@ 2014-12-29 9:25 ` Martin Steigerwald
1 sibling, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-29 9:25 UTC (permalink / raw)
To: Hugo Mills; +Cc: Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 29945 bytes --]
Am Sonntag, 28. Dezember 2014, 14:56:21 schrieb Martin Steigerwald:
> Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald:
> > Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
> > > Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> > > > Summarized at
> > > >
> > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > > >
> > > > see below. This is reproducable with fio, no need for Windows XP in
> > > > Virtualbox for reproducing the issue. Next I will try to reproduce with
> > > > a freshly creating filesystem.
> > > >
> > > >
> > > > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > > > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > > > > Hello!
> > > > > > > >
> > > > > > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > > > > >
> > > > > > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > > > > > bug
> > > > > > > > report:
> > > > > > > >
> > > > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > > > > > space_cache, skinny meta data extents – are these a problem? – and
> > > > > > >
> > > > > > > > compress=lzo:
> > > > > > > (there is no known problem with skinny metadata, it's actually more
> > > > > > > efficient than the older format. There has been some anecdotes about
> > > > > > > mixing the skinny and fat metadata but nothing has ever been
> > > > > > > demonstrated problematic.)
> > > > > > >
> > > > > > > > merkaba:~> btrfs fi sh /home
> > > > > > > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > > > > >
> > > > > > > > Total devices 2 FS bytes used 144.41GiB
> > > > > > > > devid 1 size 160.00GiB used 160.00GiB path
> > > > > > > > /dev/mapper/msata-home
> > > > > > > > devid 2 size 160.00GiB used 160.00GiB path
> > > > > > > > /dev/mapper/sata-home
> > > > > > > >
> > > > > > > > Btrfs v3.17
> > > > > > > > merkaba:~> btrfs fi df /home
> > > > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > > > >
> > > > > > > This filesystem, at the allocation level, is "very full" (see below).
> > > > > > >
> > > > > > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > > > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > > > > > cause I know no tax return software for Linux which would be suitable
> > > > > > > > for
> > > > > > > > Germany and I frankly don´t care about the end of security cause all
> > > > > > > > surfing and other network access I will do from the Linux box and I
> > > > > > > > only
> > > > > > > > run the VM behind a firewall).
> > > > > > >
> > > > > > > > And thus I try the balance dance again:
> > > > > > > ITEM: Balance... it doesn't do what you think it does...
> > > > > > >
> > > > > > > "Balancing" is something you should almost never need to do. It is only
> > > > > > > for cases of changing geometry (adding disks, switching RAID levels,
> > > > > > > etc.) of for cases when you've radically changed allocation behaviors
> > > > > > > (like you decided to remove all your VM's or you've decided to remove a
> > > > > > > mail spool directory full of thousands of tiny files).
> > > > > > >
> > > > > > > People run balance all the time because they think they should. They are
> > > > > > > _usually_ incorrect in that belief.
> > > > > >
> > > > > > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > > > > > device.
> > > > > No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > > > > space. What's more, balance does *not* balance the metadata trees. The
> > > > > remaining space -- 154.97 GiB -- is unstructured storage for file
> > > > > data, and you have some 13 GiB of that available for use.
> > > > >
> > > > > Now, since you're seeing lockups when the space on your disks is
> > > > > all allocated I'd say that's a bug. However, you're the *only* person
> > > > > who's reported this as a regular occurrence. Does this happen with all
> > > > > filesystems you have, or just this one?
> > > > >
> > > > > > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > > > > > from to *extend* a tree.
> > > > >
> > > > > It's not a tree. It's simply space allocation. It's not even space
> > > > > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > > > > saying "I'm going to use this piece of disk for this purpose").
> > > > >
> > > > > > This may be a bug, but this is what I see.
> > > > > >
> > > > > > And no amount of "you should not balance a BTRFS" will make that
> > > > > > perception go away.
> > > > > >
> > > > > > See, I see the sun coming out on a morning and you tell me "no, it
> > > > > > doesn´t". Simply that is not going to match my perception.
> > > > >
> > > > > Duncan's assertion is correct in its detail. Looking at your space
> > > >
> > > > Robert's
> > > >
> > > > > usage, I would not suggest that running a balance is something you
> > > > > need to do. Now, since you have these lockups that seem quite
> > > > > repeatable, there's probably a lurking bug in there, but hacking
> > > > > around with balance every time you hit it isn't going to get the
> > > > > problem solved properly.
> > > > >
> > > > > I think I would suggest the following:
> > > > >
> > > > > - make sure you have some way of logging your dmesg permanently (use
> > > > > a different filesystem for /var/log, or a serial console, or a
> > > > > netconsole)
> > > > >
> > > > > - when the lockup happens, hit Alt-SysRq-t a few times
> > > > >
> > > > > - send the dmesg output here, or post to bugzilla.kernel.org
> > > > >
> > > > > That's probably going to give enough information to the developers
> > > > > to work out where the lockup is happening, and is clearly the way
> > > > > forward here.
> > > >
> > > > And I got it reproduced. *Perfectly* reproduced, I´d say.
> > > >
> > > > But let me run the whole story:
> > > >
> > > > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
> > >
> > > [… story of trying to reproduce with Windows XP defragmenting which was
> > > unsuccessful as BTRFS still had free device space to allocate new chunks
> > > from …]
> > >
> > > > But finally I got to:
> > > >
> > > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > > > Sa 27. Dez 13:26:39 CET 2014
> > > > Label: 'home' uuid: [some UUID]
> > > > Total devices 2 FS bytes used 152.83GiB
> > > > devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> > > > devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> > > >
> > > > Btrfs v3.17
> > > > Data, RAID1: total=154.97GiB, used=149.58GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > >
> > > >
> > > >
> > > > So I did, if Virtualbox can write randomly in a file, I can too.
> > > >
> > > > So I did:
> > > >
> > > >
> > > > martin@merkaba:~> cat ssd-test.fio
> > > > [global]
> > > > bs=4k
> > > > #ioengine=libaio
> > > > #iodepth=4
> > > > size=4g
> > > > #direct=1
> > > > runtime=120
> > > > filename=ssd.test.file
> > > >
> > > > [seq-write]
> > > > rw=write
> > > > stonewall
> > > >
> > > > [rand-write]
> > > > rw=randwrite
> > > > stonewall
> > > >
> > > >
> > > >
> > > > And got:
> > > >
> > > > ATOP - merkaba 2014/12/27 13:41:02 ----------- 10s elapsed
> > > > PRC | sys 10.14s | user 0.38s | #proc 332 | #trun 2 | #tslpi 548 | #tslpu 0 | #zombie 0 | no procacct |
> > > > CPU | sys 102% | user 4% | irq 0% | idle 295% | wait 0% | guest 0% | curf 3.10GHz | curscal 96% |
> > > > cpu | sys 76% | user 0% | irq 0% | idle 24% | cpu001 w 0% | guest 0% | curf 3.20GHz | curscal 99% |
> > > > cpu | sys 24% | user 1% | irq 0% | idle 75% | cpu000 w 0% | guest 0% | curf 3.19GHz | curscal 99% |
> > > > cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > CPL | avg1 0.82 | avg5 0.78 | avg15 0.99 | | csw 6233 | intr 12023 | | numcpu 4 |
> > > > MEM | tot 15.5G | free 4.0G | cache 9.7G | buff 0.0M | slab 333.1M | shmem 206.6M | vmbal 0.0M | hptot 0.0M |
> > > > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > > > LVM | sata-home | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> > > > DSK | sda | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms |
> > > > NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 |
> > > > NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 |
> > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > >
> > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
> > > > 18079 - martin martin 2 9.99s 0.00s 0K 0K 0K 16K -- - R 1 100% fio
> > > > 4746 - martin martin 2 0.01s 0.14s 0K 0K 0K 0K -- - S 2 2% konsole
> > > > 3291 - martin martin 4 0.01s 0.11s 0K 0K 0K 0K -- - S 0 1% plasma-desktop
> > > > 1488 - root root 1 0.03s 0.04s 0K 0K 0K 0K -- - S 0 1% Xorg
> > > > 10036 - root root 1 0.04s 0.02s 0K 0K 0K 0K -- - R 2 1% atop
> > > >
> > > > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > > > for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > > >
> > > > martin@merkaba:~> LANG=C df -hT /home
> > > > Filesystem Type Size Used Avail Use% Mounted on
> > > > /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home
> > > >
> > > > where a 4 GiB file should easily fit, no? (And this output is with the 4
> > > > GiB file. So it was even 4 GiB more free before.)
> > > >
> > > >
> > > > But it gets even more visible:
> > > >
> > > > martin@merkaba:~> fio ssd-test.fio
> > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > fio-2.1.11
> > > > Starting 2 processes
> > > > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]
> > > > 0$ zsh 1$ zsh 2$ zsh 3-$ zsh 4$ zsh 5$* zsh
> > > >
> > > >
> > > > yes, thats 0 IOPS.
> > > >
> > > > 0 IOPS and in zero IOPS. For minutes.
> > > >
> > > >
> > > >
> > > > And here is why:
> > > >
> > > > ATOP - merkaba 2014/12/27 13:46:52 ----------- 10s elapsed
> > > > PRC | sys 10.77s | user 0.31s | #proc 334 | #trun 2 | #tslpi 548 | #tslpu 3 | #zombie 0 | no procacct |
> > > > CPU | sys 108% | user 3% | irq 0% | idle 286% | wait 2% | guest 0% | curf 3.08GHz | curscal 96% |
> > > > cpu | sys 72% | user 1% | irq 0% | idle 28% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 19% | user 0% | irq 0% | idle 81% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 11% | user 1% | irq 0% | idle 87% | cpu003 w 1% | guest 0% | curf 3.19GHz | curscal 99% |
> > > > cpu | sys 6% | user 1% | irq 0% | idle 91% | cpu002 w 1% | guest 0% | curf 3.11GHz | curscal 97% |
> > > > CPL | avg1 2.78 | avg5 1.34 | avg15 1.12 | | csw 50192 | intr 32379 | | numcpu 4 |
> > > > MEM | tot 15.5G | free 5.0G | cache 8.7G | buff 0.0M | slab 332.6M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> > > > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > > > LVM | sata-home | busy 5% | read 160 | write 11177 | KiB/w 3 | MBr/s 0.06 | MBw/s 4.36 | avio 0.05 ms |
> > > > LVM | msata-home | busy 4% | read 28 | write 11177 | KiB/w 3 | MBr/s 0.01 | MBw/s 4.36 | avio 0.04 ms |
> > > > LVM | sata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> > > > LVM | msata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms |
> > > > DSK | sda | busy 5% | read 160 | write 10200 | KiB/w 4 | MBr/s 0.06 | MBw/s 4.69 | avio 0.05 ms |
> > > > DSK | sdb | busy 4% | read 28 | write 10558 | KiB/w 4 | MBr/s 0.01 | MBw/s 4.69 | avio 0.04 ms |
> > > > NET | transport | tcpi 35 | tcpo 33 | udpi 3 | udpo 3 | tcpao 2 | tcppo 1 | tcprs 0 |
> > > > NET | network | ipi 38 | ipo 36 | ipfrw 0 | deliv 38 | | icmpi 0 | icmpo 0 |
> > > > NET | eth0 0% | pcki 22 | pcko 20 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > >
> > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> > > > 14973 - root root 1 8.92s 0.00s 0K 0K 0K 144K -- - S 0 89% kworker/u8:14
> > > > 17450 - root root 1 0.86s 0.00s 0K 0K 0K 32K -- - R 3 9% kworker/u8:5
> > > > 788 - root root 1 0.25s 0.00s 0K 0K 128K 18880K -- - S 3 3% btrfs-transact
> > > > 12254 - root root 1 0.14s 0.00s 0K 0K 64K 576K -- - S 2 1% kworker/u8:3
> > > > 17332 - root root 1 0.11s 0.00s 0K 0K 112K 1348K -- - S 2 1% kworker/u8:4
> > > > 3291 - martin martin 4 0.01s 0.09s 0K 0K 0K 0K -- - S 1 1% plasma-deskto
> > > >
> > > >
> > > >
> > > >
> > > > ATOP - merkaba 2014/12/27 13:47:12 ----------- 10s elapsed
> > > > PRC | sys 10.78s | user 0.44s | #proc 334 | #trun 3 | #tslpi 547 | #tslpu 3 | #zombie 0 | no procacct |
> > > > CPU | sys 106% | user 4% | irq 0% | idle 288% | wait 1% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 93% | user 0% | irq 0% | idle 7% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 7% | user 0% | irq 0% | idle 93% | cpu003 w 0% | guest 0% | curf 3.01GHz | curscal 94% |
> > > > cpu | sys 3% | user 2% | irq 0% | idle 94% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% |
> > > > CPL | avg1 3.33 | avg5 1.56 | avg15 1.20 | | csw 38253 | intr 23104 | | numcpu 4 |
> > > > MEM | tot 15.5G | free 4.9G | cache 8.7G | buff 0.0M | slab 336.5M | shmem 207.2M | vmbal 0.0M | hptot 0.0M |
> > > > SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G |
> > > > LVM | msata-home | busy 2% | read 0 | write 2337 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.91 | avio 0.07 ms |
> > > > LVM | sata-home | busy 2% | read 36 | write 2337 | KiB/w 3 | MBr/s 0.01 | MBw/s 0.91 | avio 0.07 ms |
> > > > LVM | msata-debian | busy 1% | read 1 | write 1630 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.65 | avio 0.03 ms |
> > > > LVM | sata-debian | busy 0% | read 0 | write 1019 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.41 | avio 0.02 ms |
> > > > DSK | sdb | busy 2% | read 1 | write 2545 | KiB/w 5 | MBr/s 0.00 | MBw/s 1.45 | avio 0.07 ms |
> > > > DSK | sda | busy 1% | read 36 | write 2461 | KiB/w 5 | MBr/s 0.01 | MBw/s 1.28 | avio 0.06 ms |
> > > > NET | transport | tcpi 20 | tcpo 20 | udpi 1 | udpo 1 | tcpao 1 | tcppo 1 | tcprs 0 |
> > > > NET | network | ipi 21 | ipo 21 | ipfrw 0 | deliv 21 | | icmpi 0 | icmpo 0 |
> > > > NET | eth0 0% | pcki 5 | pcko 5 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 |
> > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 |
> > > >
> > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
> > > > 17450 - root root 1 9.96s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:5
> > > > 4746 - martin martin 2 0.06s 0.15s 0K 0K 0K 0K -- - S 1 2% konsole
> > > > 10508 - root root 1 0.13s 0.00s 0K 0K 96K 4048K -- - S 1 1% kworker/u8:18
> > > > 1488 - root root 1 0.06s 0.06s 0K 0K 0K 0K -- - S 0 1% Xorg
> > > > 17332 - root root 1 0.12s 0.00s 0K 0K 96K 580K -- - R 3 1% kworker/u8:4
> > > > 17454 - root root 1 0.11s 0.00s 0K 0K 32K 4416K -- - D 1 1% kworker/u8:6
> > > > 17516 - root root 1 0.09s 0.00s 0K 0K 16K 136K -- - S 3 1% kworker/u8:7
> > > > 3268 - martin martin 3 0.02s 0.05s 0K 0K 0K 0K -- - S 1 1% kwin
> > > > 10036 - root root 1 0.05s 0.02s 0K 0K 0K 0K -- - R 0 1% atop
> > > >
> > > >
> > > >
> > > > So BTRFS is basically busy with itself and nothing else. Look at the SSD
> > > > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> > > > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> > > > you measure of course, like request size, read, write, iodepth and so).
> > > >
> > > > Its kworker/u8:5 utilizing 100% of one core for minutes.
> > > >
> > > >
> > > >
> > > > Its the random write case it seems. Here are values from fio job:
> > > >
> > > > martin@merkaba:~> fio ssd-test.fio
> > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > fio-2.1.11
> > > > Starting 2 processes
> > > > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> > > > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
> > > > write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> > > > clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> > > > lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> > > > clat percentiles (usec):
> > > > | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
> > > > | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5],
> > > > | 70.00th=[ 7], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ 9],
> > > > | 99.00th=[ 14], 99.50th=[ 20], 99.90th=[ 211], 99.95th=[ 2128],
> > > > | 99.99th=[10304]
> > > > bw (KB /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
> > > > lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> > > > lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> > > > lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> > > > cpu : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> > > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > > issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> > > > latency : target=0, window=0, percentile=100.00%, depth=1
> > > >
> > > > Seems fine.
> > > >
> > > >
> > > > But:
> > > >
> > > > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
> > > > write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> > > > clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> > > > lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> > > > clat percentiles (usec):
> > > > | 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5],
> > > > | 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6],
> > > > | 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10],
> > > > | 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116],
> > > > | 99.99th=[16711680]
> > > > bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
> > > > lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> > > > lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> > > > cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> > > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > > issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> > > > latency : target=0, window=0, percentile=100.00%, depth=1
> > > >
> > > > Run status group 0 (all jobs):
> > > > WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
> > > >
> > > > Run status group 1 (all jobs):
> > > > WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
> > > >
> > > >
> > > > What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
> > > >
> > > > What?
> > > >
> > > > Ey, *what*?
> > […]
> > > > There we go:
> > > >
> > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > >
> > > I have done more tests.
> > >
> > > This is on the same /home after extending it to 170 GiB and balancing it to
> > > btrfs balance start -dusage=80
> > >
> > > It has plenty of free space free. I updated the bug report and hope it can
> > > give an easy enough to comprehend summary. The new tests are in:
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6
> > >
> > >
> > >
> > > Pasting below for discussion on list. Summary: I easily get 38000 (!)
> > > IOPS. It may be an idea to reduce to 160 GiB, but right now this does
> > > not work as it says no free space on device when trying to downsize it.
> > > I may try with 165 or 162GiB.
> > >
> > > So now we have three IOPS figures:
> > >
> > > - 256 IOPS in worst case scenario
> > > - 4700 IOPS when trying to reproduce worst case scenario with a fresh and small
> > > BTRFS
> > > - 38000 IOPS when /home has unused device space to allocate chunks from
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8
> > >
> > >
> > > This is another test.
> >
> >
> > Okay, and this is the last series of tests for today.
> >
> > Conclusion:
> >
> > I cannot manage to get it down to the knees as before, but I come near to it.
> >
> > Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh
> > even *worse* situation than before.
> >
> > That hints me at the need to look at the free space fragmentation, as in the
> > beginning the problem started appearing with:
> >
> > merkaba:~> btrfs fi sh /home
> > Label: 'home' uuid: […]
> > Total devices 2 FS bytes used 144.41GiB
> > devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> > devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> >
> > Btrfs v3.17
> > merkaba:~> btrfs fi df /home
> > Data, RAID1: total=154.97GiB, used=141.12GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> >
> >
> >
> > Yes, thats 13 GiB of free space *within* the chunks.
> >
> > So while I can get it down in IOPS by bringing it to a situation where it
> > can not reserve additional data chunks again, I cannot recreate the
> > abysmal 250 IOPS figure by this. Not even with my /home filesystem.
> >
> > So there is more to it. I think its important to look into free space
> > fragmentation. It seems it needs an *aged* filesystem to recreate. At
> > it seems the balances really helped. As I am not able to recreate the
> > issue to that extent right now.
> >
> > So this shows my original idea about free device space to allocate from
> > also doesn´t explain it fully. It seems to be something thats going on
> > within the chunks that explains the worst case <300 IOPS, kworker using
> > one core for minutes and desktop locked scenario.
> >
> > Is there a way to view free space fragmentation in BTRFS?
>
> So to rephrase that:
>
> From what I perceive the worst case issue happens when
>
> 1) BTRFS cannot reserve any new chunks from unused device space anymore.
>
> 2) The free space in the existing chunks is highly fragmented.
>
> Only one of those conditions is not sufficient to trigger it.
>
> Thats at least my current idea about it.
With
merkaba:~> btrfs fi df /home
Data, RAID1: total=163.87GiB, used=146.92GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.94GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
merkaba:~> btrfs fi sh /home
Label: 'home' uuid: […]
Total devices 2 FS bytes used 150.18GiB
devid 1 size 170.00GiB used 169.84GiB path /dev/mapper/msata-home
devid 2 size 170.00GiB used 169.84GiB path /dev/mapper/sata-home
Btrfs v3.17
I had a noticable hang during sdelete.exe -z in Windows XP VM with 20 GiB VDI file – Patrik on mailing list told me they have changed the argument from -c to -z as I wondered by VBoxManage modifyhd Winlala.vdi --compact did not reduce the size of the file).
It was not as bad, but desktop was locked for more than 5 seconds easily.
So this also happens with larger free space *within* the chunks. Before I to the VBoxManage --compact I will now rebalance partly.
So this definately shows, it can happen when BTRFS cannot reserve any new
chunks anymore, yet still has *plenty* of free space within the existing data
chunks.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
2014-12-29 2:07 ` Zygo Blaxell
@ 2014-12-29 9:32 ` Martin Steigerwald
2015-01-06 20:03 ` Zygo Blaxell
0 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-29 9:32 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: Hugo Mills, Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 3906 bytes --]
Am Sonntag, 28. Dezember 2014, 21:07:05 schrieb Zygo Blaxell:
> On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote:
> > My simple test case didn´t trigger it, and I so not have another twice 160
> > GiB available on this SSDs available to try with a copy of my home
> > filesystem. Then I could safely test without bringing the desktop session to
> > an halt. Maybe someone has an idea on how to "enhance" my test case in
> > order to reliably trigger the issue.
> >
> > It may be challenging tough. My /home is quite a filesystem. It has a maildir
> > with at least one million of files (yeah, I am performance testing KMail and
> > Akonadi as well to the limit!), and it has git repos and this one VM image,
> > and the desktop search and the Akonadi database. In other words: It has
> > been hit nicely with various mostly random I think workloads over the last
> > about six months. I bet its not that easy to simulate that. Maybe some runs
> > of compilebench to age the filesystem before the fio test?
> >
> > That said, BTRFS performs a lot better. The complete lockups without any
> > CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there
> > is this kworker issue now. I noticed it that gravely just while trying to
> > complete this tax returns stuff with the Windows XP VM. Otherwise it may
> > have happened, I have seen some backtraces in kern.log, but it didn´t last
> > for minutes. So this indeed is of less severity than the full lockups with
> > 3.15 and 3.16.
> >
> > Zygo, was is the characteristics of your filesystem. Do you use
> > compress=lzo and skinny metadata as well? How are the chunks allocated?
> > What kind of data you have on it?
>
> compress-force (default zlib), no skinny-metadata. Chunks are d=single,
> m=dup. Data is a mix of various desktop applications, most active
> file sizes from a few hundred K to a few MB, maybe 300k-400k files.
> No database or VM workloads. Filesystem is 100GB and is usually between
> 98 and 99% full (about 1-2GB free).
>
> I have another filesystem which has similar problems when it's 99.99%
> full (it's 13TB, so 0.01% is 1.3GB). That filesystem is RAID1 with
> skinny-metadata and no-holes.
>
> On various filesystems I have the above CPU-burning problem, a bunch of
> irreproducible random crashes, and a hang with a kernel stack that goes
> through SyS_unlinkat and btrfs_evict_inode.
Zygo, thanks. That desktop filesystem sounds a bit similar to my usecase,
with the interesting difference that you have no databases or VMs on it.
That said, I use the Windows XP rarely, but using it was what made the issue
so visible for me. Is your desktop filesystem on SSD?
Do you have the chance to extend one of the affected filesystems to check
my theory that this does not happen as long as BTRFS can still allocate new
data chunks? If its right, your FS should be fluent again as long as you see
more than 1 GiB free
Label: none uuid: 53bdf47c-4298-45bc-a30f-8a310c274069
Total devices 2 FS bytes used 512.00KiB
devid 1 size 10.00GiB used 6.53GiB path /dev/mapper/sata-btrfsraid1
devid 2 size 10.00GiB used 6.53GiB path /dev/mapper/msata-btrfsraid1
between "size" and "used" in btrfs fi sh. I suggest going with at least 2-3
GiB, as BTRFS may allocate just one chunk so quickly that you do not have
the chance to recognize the difference.
Well, and if thats works for you, we are back to my recommendation:
More so than with other filesystems give BTRFS plenty of free space to
operate with. At best as much, that you always have a mininum of 2-3 GiB
unused device space for chunk reservation left. One could even do some
Nagios/Icinga monitoring plugin for that :)
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again
2014-12-28 17:04 ` Patrik Lundquist
@ 2014-12-29 10:14 ` Martin Steigerwald
0 siblings, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-29 10:14 UTC (permalink / raw)
To: Patrik Lundquist; +Cc: linux-btrfs
Am Sonntag, 28. Dezember 2014, 18:04:31 schrieb Patrik Lundquist:
> On 28 December 2014 at 13:03, Martin Steigerwald <Martin@lichtvoll.de> wrote:
> >
> > BTW, I found that the Oracle blog didn´t work at all for me. I completed
> > a cycle of defrag, sdelete -c and VBoxManage compact, [...] and it
> > apparently did *nothing* to reduce the size of the file.
>
> They've changed the argument to -z; sdelete -z.
Now how cute is that. Thank you. This did the trick:
martin@merkaba:~/.VirtualBox/HardDisks> VBoxManage modifyhd Winlala.vdi --compact
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
martin@merkaba:~/.VirtualBox/HardDisks> ls -lh
insgesamt 12G
-rw------- 1 martin martin 12G Dez 29 11:00 Winlala.vdi
martin@merkaba:~/.VirtualBox/HardDisks>
It has been 20 GiB before.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
2014-12-29 9:32 ` Martin Steigerwald
@ 2015-01-06 20:03 ` Zygo Blaxell
2015-01-07 19:08 ` Martin Steigerwald
0 siblings, 1 reply; 59+ messages in thread
From: Zygo Blaxell @ 2015-01-06 20:03 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Hugo Mills, Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 5373 bytes --]
On Mon, Dec 29, 2014 at 10:32:00AM +0100, Martin Steigerwald wrote:
> Am Sonntag, 28. Dezember 2014, 21:07:05 schrieb Zygo Blaxell:
> > On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote:
> > > My simple test case didn´t trigger it, and I so not have another twice 160
> > > GiB available on this SSDs available to try with a copy of my home
> > > filesystem. Then I could safely test without bringing the desktop session to
> > > an halt. Maybe someone has an idea on how to "enhance" my test case in
> > > order to reliably trigger the issue.
> > >
> > > It may be challenging tough. My /home is quite a filesystem. It has a maildir
> > > with at least one million of files (yeah, I am performance testing KMail and
> > > Akonadi as well to the limit!), and it has git repos and this one VM image,
> > > and the desktop search and the Akonadi database. In other words: It has
> > > been hit nicely with various mostly random I think workloads over the last
> > > about six months. I bet its not that easy to simulate that. Maybe some runs
> > > of compilebench to age the filesystem before the fio test?
> > >
> > > That said, BTRFS performs a lot better. The complete lockups without any
> > > CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there
> > > is this kworker issue now. I noticed it that gravely just while trying to
> > > complete this tax returns stuff with the Windows XP VM. Otherwise it may
> > > have happened, I have seen some backtraces in kern.log, but it didn´t last
> > > for minutes. So this indeed is of less severity than the full lockups with
> > > 3.15 and 3.16.
> > >
> > > Zygo, was is the characteristics of your filesystem. Do you use
> > > compress=lzo and skinny metadata as well? How are the chunks allocated?
> > > What kind of data you have on it?
> >
> > compress-force (default zlib), no skinny-metadata. Chunks are d=single,
> > m=dup. Data is a mix of various desktop applications, most active
> > file sizes from a few hundred K to a few MB, maybe 300k-400k files.
> > No database or VM workloads. Filesystem is 100GB and is usually between
> > 98 and 99% full (about 1-2GB free).
> >
> > I have another filesystem which has similar problems when it's 99.99%
> > full (it's 13TB, so 0.01% is 1.3GB). That filesystem is RAID1 with
> > skinny-metadata and no-holes.
> >
> > On various filesystems I have the above CPU-burning problem, a bunch of
> > irreproducible random crashes, and a hang with a kernel stack that goes
> > through SyS_unlinkat and btrfs_evict_inode.
>
> Zygo, thanks. That desktop filesystem sounds a bit similar to my usecase,
> with the interesting difference that you have no databases or VMs on it.
>
> That said, I use the Windows XP rarely, but using it was what made the issue
> so visible for me. Is your desktop filesystem on SSD?
No, but I recently stumbled across the same symptoms on an 8GB SD card
on kernel 3.12.24 (raspberry pi). When the filesystem hit over ~97%
full, all accesses were blocked for several minutes. I was able to
work around it by adjusting the threshold on a garbage collector daemon
(i.e. deleting a lot of expendable files) to keep usage below 90%.
I didn't try to balance the filesystem, and didn't seem to need to.
ext3 has a related problem when it's nearly full: it will try to search
gigabytes of block allocation bitmaps searching for a free block, which
can result in a single 'mkdir' call spending 45 minutes reading a large
slow 99.5% full filesystem.
I'd expect a btrfs filesystem that was nearly full to have a small tree
of cached free space extents and be able to search it quickly even if
the result is negative (i.e. there's no free space). It seems to be
doing something else... :-P
> Do you have the chance to extend one of the affected filesystems to check
> my theory that this does not happen as long as BTRFS can still allocate new
> data chunks? If its right, your FS should be fluent again as long as you see
> more than 1 GiB free
>
> Label: none uuid: 53bdf47c-4298-45bc-a30f-8a310c274069
> Total devices 2 FS bytes used 512.00KiB
> devid 1 size 10.00GiB used 6.53GiB path /dev/mapper/sata-btrfsraid1
> devid 2 size 10.00GiB used 6.53GiB path /dev/mapper/msata-btrfsraid1
>
> between "size" and "used" in btrfs fi sh. I suggest going with at least 2-3
> GiB, as BTRFS may allocate just one chunk so quickly that you do not have
> the chance to recognize the difference.
So far I've found that problems start when space drops below 1GB free
(although it can go as low as 400MB) and problems stop when space gets
above 1GB free, even without resizing or balancing the filesystem.
I've adjusted free space monitoring thresholds accordingly for now,
and it seems to be keeping things working so far.
> Well, and if thats works for you, we are back to my recommendation:
>
> More so than with other filesystems give BTRFS plenty of free space to
> operate with. At best as much, that you always have a mininum of 2-3 GiB
> unused device space for chunk reservation left. One could even do some
> Nagios/Icinga monitoring plugin for that :)
>
> --
> Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
> GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
2015-01-06 20:03 ` Zygo Blaxell
@ 2015-01-07 19:08 ` Martin Steigerwald
2015-01-07 21:41 ` Zygo Blaxell
2015-01-08 5:45 ` Duncan
0 siblings, 2 replies; 59+ messages in thread
From: Martin Steigerwald @ 2015-01-07 19:08 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: Hugo Mills, Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 4639 bytes --]
Am Dienstag, 6. Januar 2015, 15:03:23 schrieb Zygo Blaxell:
> On Mon, Dec 29, 2014 at 10:32:00AM +0100, Martin Steigerwald wrote:
> > Am Sonntag, 28. Dezember 2014, 21:07:05 schrieb Zygo Blaxell:
> > > On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote:
[…]
> > > > Zygo, was is the characteristics of your filesystem. Do you use
> > > > compress=lzo and skinny metadata as well? How are the chunks
> > > > allocated?
> > > > What kind of data you have on it?
> > >
> > > compress-force (default zlib), no skinny-metadata. Chunks are d=single,
> > > m=dup. Data is a mix of various desktop applications, most active
> > > file sizes from a few hundred K to a few MB, maybe 300k-400k files.
> > > No database or VM workloads. Filesystem is 100GB and is usually between
> > > 98 and 99% full (about 1-2GB free).
> > >
> > > I have another filesystem which has similar problems when it's 99.99%
> > > full (it's 13TB, so 0.01% is 1.3GB). That filesystem is RAID1 with
> > > skinny-metadata and no-holes.
> > >
> > > On various filesystems I have the above CPU-burning problem, a bunch of
> > > irreproducible random crashes, and a hang with a kernel stack that goes
> > > through SyS_unlinkat and btrfs_evict_inode.
> >
> > Zygo, thanks. That desktop filesystem sounds a bit similar to my usecase,
> > with the interesting difference that you have no databases or VMs on it.
> >
> > That said, I use the Windows XP rarely, but using it was what made the
> > issue so visible for me. Is your desktop filesystem on SSD?
>
> No, but I recently stumbled across the same symptoms on an 8GB SD card
> on kernel 3.12.24 (raspberry pi). When the filesystem hit over ~97%
> full, all accesses were blocked for several minutes. I was able to
> work around it by adjusting the threshold on a garbage collector daemon
> (i.e. deleting a lot of expendable files) to keep usage below 90%.
> I didn't try to balance the filesystem, and didn't seem to need to.
Interesting.
> ext3 has a related problem when it's nearly full: it will try to search
> gigabytes of block allocation bitmaps searching for a free block, which
> can result in a single 'mkdir' call spending 45 minutes reading a large
> slow 99.5% full filesystem.
Ok, thats for bitmap access. Ext4 uses extens. BTRFS can use bitmaps as well,
but also supports extents and I think uses it for most use cases.
> I'd expect a btrfs filesystem that was nearly full to have a small tree
> of cached free space extents and be able to search it quickly even if
> the result is negative (i.e. there's no free space). It seems to be
> doing something else... :-P
Yeah :)
> > Do you have the chance to extend one of the affected filesystems to check
> > my theory that this does not happen as long as BTRFS can still allocate
> > new
> > data chunks? If its right, your FS should be fluent again as long as you
> > see more than 1 GiB free
> >
> > Label: none uuid: 53bdf47c-4298-45bc-a30f-8a310c274069
> >
> > Total devices 2 FS bytes used 512.00KiB
> > devid 1 size 10.00GiB used 6.53GiB path
> > /dev/mapper/sata-btrfsraid1
> > devid 2 size 10.00GiB used 6.53GiB path
> > /dev/mapper/msata-btrfsraid1
> >
> > between "size" and "used" in btrfs fi sh. I suggest going with at least
> > 2-3
> > GiB, as BTRFS may allocate just one chunk so quickly that you do not have
> > the chance to recognize the difference.
>
> So far I've found that problems start when space drops below 1GB free
> (although it can go as low as 400MB) and problems stop when space gets
> above 1GB free, even without resizing or balancing the filesystem.
> I've adjusted free space monitoring thresholds accordingly for now,
> and it seems to be keeping things working so far.
Just to see whether we are on the same terms: You talk about space that BTRFS
has not yet reserved for chunks, i.e. the difference between size and used in
btrfs fi sh, right?
No BTRFS developers commented yet on this, neither in this thread nor in the
bug report at kernel.org I made.
> > Well, and if thats works for you, we are back to my recommendation:
> >
> > More so than with other filesystems give BTRFS plenty of free space to
> > operate with. At best as much, that you always have a mininum of 2-3 GiB
> > unused device space for chunk reservation left. One could even do some
> > Nagios/Icinga monitoring plugin for that :)
Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
2015-01-07 19:08 ` Martin Steigerwald
@ 2015-01-07 21:41 ` Zygo Blaxell
2015-01-08 5:45 ` Duncan
1 sibling, 0 replies; 59+ messages in thread
From: Zygo Blaxell @ 2015-01-07 21:41 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Hugo Mills, Robert White, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1581 bytes --]
On Wed, Jan 07, 2015 at 08:08:50PM +0100, Martin Steigerwald wrote:
> Am Dienstag, 6. Januar 2015, 15:03:23 schrieb Zygo Blaxell:
> > ext3 has a related problem when it's nearly full: it will try to search
> > gigabytes of block allocation bitmaps searching for a free block, which
> > can result in a single 'mkdir' call spending 45 minutes reading a large
> > slow 99.5% full filesystem.
>
> Ok, thats for bitmap access. Ext4 uses extens.
...and the problem doesn't happen to the same degree on ext4 as it did
on ext3.
> > So far I've found that problems start when space drops below 1GB free
> > (although it can go as low as 400MB) and problems stop when space gets
> > above 1GB free, even without resizing or balancing the filesystem.
> > I've adjusted free space monitoring thresholds accordingly for now,
> > and it seems to be keeping things working so far.
>
> Just to see whether we are on the same terms: You talk about space that BTRFS
> has not yet reserved for chunks, i.e. the difference between size and used in
> btrfs fi sh, right?
The number I look at for this issue is statvfs() f_bavail (i.e. the
"Available" column of /bin/df).
Before the empty-chunk-deallocation code, most of my filesystems would
quickly reach a steady state where all space is allocated to chunks,
and they stay that way unless I have to downsize them.
Now there is free (non-chunk) space on most of my filesystems. I'll try
monitoring btrfs fi df and btrfs fi show under the failing conditions
and see if there are interesting correlations.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
2015-01-07 19:08 ` Martin Steigerwald
2015-01-07 21:41 ` Zygo Blaxell
@ 2015-01-08 5:45 ` Duncan
2015-01-08 10:18 ` Martin Steigerwald
1 sibling, 1 reply; 59+ messages in thread
From: Duncan @ 2015-01-08 5:45 UTC (permalink / raw)
To: linux-btrfs
Martin Steigerwald posted on Wed, 07 Jan 2015 20:08:50 +0100 as excerpted:
> No BTRFS developers commented yet on this, neither in this thread nor in
> the bug report at kernel.org I made.
Just a quick general note on this point...
There has in the past (and I believe referenced on the wiki) been dev
comment to the effect that on the list they tend to find particular
reports/threads and work on them until they find and either fix the issue
or (when not urgent) decide it must wait for something else, first.
During the time they're busy pursuing such a report, they don't read
others on the list very closely, and such list-only bug reports may thus
get dropped on the floor and never worked on.
The recommendation, then, is to report it to the list, and if not picked
up right away and you plan on being around in a few weeks/months when
they potentially get to it, file a bug on it, so it doesn't get dropped
on the floor.
With the bugzilla.kernel.org report you've followed the recommendation,
but the implication is that you won't necessarily get any comment right
away, only later, when they're not immediately busy looking at some other
bug. So lack of b.k.o comment in the immediate term doesn't mean they're
ignoring the bug or don't value it; it just means they're hot on the
trail of something else ATM and it might take some time to get that
"first comment" engagement.
But the recommendation is to file the bugzilla report precisely so it
does /not/ get lost, and you've done that, so... you've done your part
there and now comes the enforced patience bit of waiting for that
engagement.
But if it takes a bit, I would keep the bug updated every kernel release
or so, with a comment updating status.
(Meanwhile, I've seen no indication of such issues here. Most of my
btrfs are 8-24 GiB each, all SSD, mostly dual-device btrfs raid1 both
data/metadata. Maybe I don't run those full enough, however. I do have
three mixed-bg mode sub-GiB btrfs, however, with one of them, a 256 MiB
single-device dup-mode btrfs, used as /boot, that tends to run reasonably
full, but I've not seen a problem like that there, either. But my use-
case probably simply doesn't hit the problem.)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
2015-01-08 5:45 ` Duncan
@ 2015-01-08 10:18 ` Martin Steigerwald
2015-01-09 8:25 ` Duncan
0 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2015-01-08 10:18 UTC (permalink / raw)
To: linux-btrfs
Am Donnerstag, 8. Januar 2015, 05:45:56 schrieben Sie:
> Martin Steigerwald posted on Wed, 07 Jan 2015 20:08:50 +0100 as excerpted:
> > No BTRFS developers commented yet on this, neither in this thread nor in
> > the bug report at kernel.org I made.
>
> Just a quick general note on this point...
>
> There has in the past (and I believe referenced on the wiki) been dev
> comment to the effect that on the list they tend to find particular
> reports/threads and work on them until they find and either fix the issue
> or (when not urgent) decide it must wait for something else, first.
> During the time they're busy pursuing such a report, they don't read
> others on the list very closely, and such list-only bug reports may thus
> get dropped on the floor and never worked on.
>
> The recommendation, then, is to report it to the list, and if not picked
> up right away and you plan on being around in a few weeks/months when
> they potentially get to it, file a bug on it, so it doesn't get dropped
> on the floor.
Duncan, I *did* file a bug.
[Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for
minutes on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
2015-01-08 10:18 ` Martin Steigerwald
@ 2015-01-09 8:25 ` Duncan
0 siblings, 0 replies; 59+ messages in thread
From: Duncan @ 2015-01-09 8:25 UTC (permalink / raw)
To: linux-btrfs
Martin Steigerwald posted on Thu, 08 Jan 2015 11:18:40 +0100 as excerpted:
> Duncan, I *did* file a bug.
I think you misunderstood me... I understood that and actually said as
much:
>> But the recommendation is to file the bugzilla report precisely so it
>> does /not/ get lost, and you've done that, so... you've done your part
>> there and now comes the enforced patience bit of waiting [...]
My point was simply that based on the wiki recommendation and the earlier
thread as mentioned on the wiki, the reason /why/ a bugzi report is
preferred over simply reporting it here is that the devs tend to pick
bugs and spend some time digging into them, during which they don't look
too much at other reports here, and they can get lost, while the bugzi
report won't.
Which implies that a failure to respond either to a thread here or a bug
report there is because they're busy working on other bugs, and that
failure to immediately respond isn't to be seen as ignoring the problem,
and is in fact to be expected.
IOW, I was saying now that the bug is filed, you can sit back and wait in
reasonable assurance that it'll be processed in due time, as you've done
your bit and now it's up to them to prioritize and process in due time.
That's a good thing, and I was commending you for taking the time to file
the bug as well. =:^)
... While at the same time commiserating a bit, since I know from
experience how hard that wait for a dev reply can be, and that the wait
is sort of an enforced patience as at least for a non-coder as I am,
there's not much else one can do. =:^(
That said, now that I reread, I can see how what I wrote could appear to
be contingent on an assumed /future/ filing of a bug, and that it wasn't
as clear as I intended that I was commending you for filing it already,
and basically saying, "Be patient, I know how hard it can be to wait."
Words! They be tricky! =:^(
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2015-01-09 8:34 UTC | newest]
Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
2014-12-26 14:20 ` Martin Steigerwald
2014-12-26 14:41 ` Martin Steigerwald
2014-12-27 3:33 ` Duncan
2014-12-26 15:59 ` Martin Steigerwald
2014-12-27 4:26 ` Duncan
2014-12-26 22:48 ` Robert White
2014-12-27 5:54 ` Duncan
2014-12-27 9:01 ` Martin Steigerwald
2014-12-27 9:30 ` Hugo Mills
2014-12-27 10:54 ` Martin Steigerwald
2014-12-27 11:52 ` Robert White
2014-12-27 13:16 ` Martin Steigerwald
2014-12-27 13:49 ` Robert White
2014-12-27 14:06 ` Martin Steigerwald
2014-12-27 14:00 ` Robert White
2014-12-27 14:14 ` Martin Steigerwald
2014-12-27 14:21 ` Martin Steigerwald
2014-12-27 15:14 ` Robert White
2014-12-27 16:01 ` Martin Steigerwald
2014-12-28 0:25 ` Robert White
2014-12-28 1:01 ` Bardur Arantsson
2014-12-28 4:03 ` Robert White
2014-12-28 12:03 ` Martin Steigerwald
2014-12-28 17:04 ` Patrik Lundquist
2014-12-29 10:14 ` Martin Steigerwald
2014-12-28 12:07 ` Martin Steigerwald
2014-12-28 14:52 ` Robert White
2014-12-28 15:42 ` Martin Steigerwald
2014-12-28 15:47 ` Martin Steigerwald
2014-12-29 0:27 ` Robert White
2014-12-29 9:14 ` Martin Steigerwald
2014-12-27 16:10 ` Martin Steigerwald
2014-12-27 14:19 ` Robert White
2014-12-27 11:11 ` Martin Steigerwald
2014-12-27 12:08 ` Robert White
2014-12-27 13:55 ` Martin Steigerwald
2014-12-27 14:54 ` Robert White
2014-12-27 16:26 ` Hugo Mills
2014-12-27 17:11 ` Martin Steigerwald
2014-12-27 17:59 ` Martin Steigerwald
2014-12-28 0:06 ` Robert White
2014-12-28 11:05 ` Martin Steigerwald
2014-12-28 13:00 ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
2014-12-28 13:40 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare) Martin Steigerwald
2014-12-28 13:56 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
2014-12-28 15:00 ` Martin Steigerwald
2014-12-29 9:25 ` Martin Steigerwald
2014-12-27 18:28 ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
2014-12-27 18:40 ` Hugo Mills
2014-12-27 19:23 ` BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time) Martin Steigerwald
2014-12-29 2:07 ` Zygo Blaxell
2014-12-29 9:32 ` Martin Steigerwald
2015-01-06 20:03 ` Zygo Blaxell
2015-01-07 19:08 ` Martin Steigerwald
2015-01-07 21:41 ` Zygo Blaxell
2015-01-08 5:45 ` Duncan
2015-01-08 10:18 ` Martin Steigerwald
2015-01-09 8:25 ` Duncan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.