All of lore.kernel.org
 help / color / mirror / Atom feed
* BTRFS free space handling still needs more work: Hangs again
@ 2014-12-26 13:37 Martin Steigerwald
  2014-12-26 14:20 ` Martin Steigerwald
                   ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-26 13:37 UTC (permalink / raw)
  To: linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 16557 bytes --]

Hello!

First: Have a merry christmas and enjoy a quiet time in these days.

Second: At a time you feel like it, here is a little rant, but also a bug
report:




























I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents – are these a problem? – and
compress=lzo:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
        Total devices 2 FS bytes used 144.41GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


And I had hangs with BTRFS again. This time as I wanted to install tax
return software in Virtualbox´d Windows XP VM (which I use once a year
cause I know no tax return software for Linux which would be suitable for
Germany and I frankly don´t care about the end of security cause all
surfing and other network access I will do from the Linux box and I only
run the VM behind a firewall).


And thus I try the balance dance again:

merkaba:~> btrfs balance start -dusage=5 -musage=5 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -dusage=5 -musage=5 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -dusage=5 /home          
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=10 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=20 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=30 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=40 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=50 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=60 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=70 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -dusage=70 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -dusage=70 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -dusage=65 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~> btrfs balance start -dusage=67 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -musage=10 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -musage=05 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail


Okay, not really, ey?



But

merkaba:~> btrfs balance start /home

works.

So I am rebalancing everything basically, without need I bet, so causing
more churn to SSDs than is needed.


Otherwise alternative would be to make BTRFS larger I bet.


Well this is still not what I would consider stable. So I will still
recommend: If you want to use BTRFS on a server and estimate 25 GiB of
usage, make drive at least 50GiB big or even 100GiB to be on the safe
side. Like I recommended for SLES 11 SP 2/3 BTRFS deployments – but
hey, there say meanwhile "don´t" as in "just don´t use it at all and use SLES
12 instead, cause BTRFS with 3.0 kernel with a ton of snapper snapshots
is really not asking for anything even near to production or enterprise
reliability" (if you need proof, I think I still have a snapshot of a SLES
11 SP3 VM that broke over night due to me having installed an LDAP server
for preparing some training slides). Even 3.12 kernel seems daring regarding
BTRFS, unless SUSE actively backports fixes.


In kernel log the failed attempts look like this:

[  209.783437] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  210.116416] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  210.455479] BTRFS info (device dm-3): 1 enospc errors during balance
[  212.915690] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  213.291634] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  213.654145] BTRFS info (device dm-3): 1 enospc errors during balance
[  219.219584] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  219.531864] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  222.721234] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  223.084007] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  226.418100] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  226.730118] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  230.218590] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  230.559232] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  233.979952] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  234.320569] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  237.672101] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  237.961171] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  241.262757] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  241.594655] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  244.783861] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  245.095942] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
[  245.418042] BTRFS info (device dm-3): relocating block group 500198014976 flags 17
[  245.544153] BTRFS info (device dm-3): relocating block group 496997761024 flags 17
[  245.644254] BTRFS info (device dm-3): relocating block group 495924019200 flags 17
[  246.281001] BTRFS info (device dm-3): relocating block group 488407826432 flags 17
[  246.449939] BTRFS info (device dm-3): relocating block group 431499509760 flags 17
[  246.561724] BTRFS info (device dm-3): relocating block group 411804106752 flags 17
[  246.723997] BTRFS info (device dm-3): relocating block group 409656623104 flags 17
[  251.770469] BTRFS info (device dm-3): 7 enospc errors during balance




My expection for a *stable* and *production quality* filesystem would be:

I never ever get hangs with one kworker running on 100% of one Sandybridge
core *for minutes* in a production filesystem and thats about it.

Especially for a filesystem that claims to still have a good amount of free
space:

merkaba:~> LANG=C df -hT /home
Filesystem             Type   Size  Used Avail Use% Mounted on
/dev/mapper/msata-home btrfs  160G  146G   25G  86% /home

(yeah, these don´t add up, I account this to compression, but hey, who knows)


In kernel log I have things like this, but some earlier time and these I have
not yet perceived as hangs:

Dec 23 23:33:26 merkaba kernel: [23040.621678] ------------[ cut here ]------------
Dec 23 23:33:26 merkaba kernel: [23040.621792] WARNING: CPU: 3 PID: 308 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empt
y+0x2d/0x2f [btrfs]()
Dec 23 23:33:26 merkaba kernel: [23040.621796] Modules linked in: mmc_block ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c snd
_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device hid_generic hid_pl ff_memless usbhid hid nls_utf8 nls_cp437 vfat fat uas usb_stor
age bnep bluetooth binfmt_misc cpufreq_userspace cpufreq_stats pci_stub cpufreq_powersave vboxpci(O) cpufreq_conservative vboxnetadp(O)
 vboxnetflt(O) vboxdrv(O) ext4 crc16 mbcache jbd2 intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32
_pclmul ghash_clmulni_intel snd_hda_codec_hdmi iwldvm aesni_intel snd_hda_codec_conexant mac80211 aes_x86_64 snd_hda_codec_generic lrw 
gf128mul glue_helper ablk_helper cryptd psmouse snd_hda_intel serio_raw iwlwifi pcspkr lpc_ich snd_hda_controller i2c_i801 mfd_core snd
_hda_codec snd_hwdep cfg80211 snd_pcm snd_timer shpchp thinkpad_acpi nvram snd soundcore rfkill battery ac tpm_tis tpm processor evdev 
joydev sbs sbshc coretemp hdaps(O) tp_smapi(O) thinkpad_ec(O) loop firewire_sbp2 fuse ecryptfs autofs4 md_mod btrfs xor raid6_pq microc
ode dm_mirror dm_region_hash dm_log dm_mod sg sr_mod sd_mod cdrom crc32c_intel ahci firewire_ohci libahci sata_sil24 e1000e libata ptp 
sdhci_pci ehci_pci sdhci firewire_core ehci_hcd crc_itu_t pps_core mmc_core scsi_mod usbcore usb_common thermal
Dec 23 23:33:26 merkaba kernel: [23040.621978] CPU: 3 PID: 308 Comm: btrfs-transacti Tainted: G        W  O   3.18.0-tp520 #14
Dec 23 23:33:26 merkaba kernel: [23040.621982] Hardware name: LENOVO 42433WG/42433WG, BIOS 8AET63WW (1.43 ) 05/08/2013
Dec 23 23:33:26 merkaba kernel: [23040.621985]  0000000000000009 ffff8804044c7d88 ffffffff814a516e 0000000080000000
Dec 23 23:33:26 merkaba kernel: [23040.621992]  0000000000000000 ffff8804044c7dc8 ffffffff8103f83e ffff8804044c7db8
Dec 23 23:33:26 merkaba kernel: [23040.621999]  ffffffffc04bd5a1 ffff880037590800 ffff8800a599c320 0000000000000000
Dec 23 23:33:26 merkaba kernel: [23040.622006] Call Trace:
Dec 23 23:33:26 merkaba kernel: [23040.622026]  [<ffffffff814a516e>] dump_stack+0x4f/0x7c
Dec 23 23:33:26 merkaba kernel: [23040.622034]  [<ffffffff8103f83e>] warn_slowpath_common+0x7c/0x96
Dec 23 23:33:26 merkaba kernel: [23040.622104]  [<ffffffffc04bd5a1>] ? btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622111]  [<ffffffff8103f8ec>] warn_slowpath_null+0x15/0x17
Dec 23 23:33:26 merkaba kernel: [23040.622164]  [<ffffffffc04bd5a1>] btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622211]  [<ffffffffc047a830>] btrfs_commit_transaction+0x394/0x8bc [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622254]  [<ffffffffc0476dd5>] transaction_kthread+0xf9/0x1af [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622295]  [<ffffffffc0476cdc>] ? btrfs_cleanup_transaction+0x43a/0x43a [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622305]  [<ffffffff8105697c>] kthread+0xb2/0xba
Dec 23 23:33:26 merkaba kernel: [23040.622312]  [<ffffffff814a0000>] ? dcbnl_newmsg+0x14/0xa8
Dec 23 23:33:26 merkaba kernel: [23040.622317]  [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
Dec 23 23:33:26 merkaba kernel: [23040.622324]  [<ffffffff814a9f6c>] ret_from_fork+0x7c/0xb0
Dec 23 23:33:26 merkaba kernel: [23040.622329]  [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
Dec 23 23:33:26 merkaba kernel: [23040.622334] ---[ end trace 90db5b1c7067cf1d ]---
Dec 23 23:33:56 merkaba kernel: [23070.671999] ------------[ cut here ]------------


Dec 23 23:33:56 merkaba kernel: [23070.671999] ------------[ cut here ]------------
Dec 23 23:33:56 merkaba kernel: [23070.672064] WARNING: CPU: 3 PID: 308 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empt
y+0x2d/0x2f [btrfs]()
Dec 23 23:33:56 merkaba kernel: [23070.672067] Modules linked in: mmc_block ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device hid_generic hid_pl ff_memless usbhid hid nls_utf8 nls_cp437 vfat fat uas usb_storage bnep bluetooth binfmt_misc cpufreq_userspace cpufreq_stats pci_stub cpufreq_powersave vboxpci(O) cpufreq_conservative vboxnetadp(O) vboxnetflt(O) vboxdrv(O) ext4 crc16 mbcache jbd2 intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi iwldvm aesni_intel snd_hda_codec_conexant mac80211 aes_x86_64 snd_hda_codec_generic lrw gf128mul glue_helper ablk_helper cryptd psmouse snd_hda_intel serio_raw iwlwifi pcspkr lpc_ich snd_hda_controller i2c_i801 mfd_core snd_hda_codec snd_hwdep cfg80211 snd_pcm snd_timer shpchp thinkpad_acpi nvram snd soundcore rfkill battery ac tpm_tis tpm processor evdev joydev sbs sbshc coretemp hdaps(O) tp_smapi(O) thinkpad_ec(O) loop firewire_sbp2 fuse ecryptfs autofs4 md_mod btrfs xor raid6_pq microcode dm_mirror dm_region_hash dm_log dm_mod sg sr_mod sd_mod cdrom crc32c_intel ahci firewire_ohci libahci sata_sil24 e1000e libata ptp sdhci_pci ehci_pci sdhci firewire_core ehci_hcd crc_itu_t pps_core mmc_core scsi_mod usbcore usb_common thermal
Dec 23 23:33:56 merkaba kernel: [23070.672193] CPU: 3 PID: 308 Comm: btrfs-transacti Tainted: G        W  O   3.18.0-tp520 #14
Dec 23 23:33:56 merkaba kernel: [23070.672196] Hardware name: LENOVO 42433WG/42433WG, BIOS 8AET63WW (1.43 ) 05/08/2013
Dec 23 23:33:56 merkaba kernel: [23070.672200]  0000000000000009 ffff8804044c7d88 ffffffff814a516e 0000000080000000
Dec 23 23:33:56 merkaba kernel: [23070.672205]  0000000000000000 ffff8804044c7dc8 ffffffff8103f83e ffff8804044c7db8
Dec 23 23:33:56 merkaba kernel: [23070.672209]  ffffffffc04bd5a1 ffff880037590800 ffff8802cd6e50a0 0000000000000000
Dec 23 23:33:56 merkaba kernel: [23070.672214] Call Trace:
Dec 23 23:33:56 merkaba kernel: [23070.672222]  [<ffffffff814a516e>] dump_stack+0x4f/0x7c
Dec 23 23:33:56 merkaba kernel: [23070.672229]  [<ffffffff8103f83e>] warn_slowpath_common+0x7c/0x96
Dec 23 23:33:56 merkaba kernel: [23070.672264]  [<ffffffffc04bd5a1>] ? btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672270]  [<ffffffff8103f8ec>] warn_slowpath_null+0x15/0x17
Dec 23 23:33:56 merkaba kernel: [23070.672301]  [<ffffffffc04bd5a1>] btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672330]  [<ffffffffc047a830>] btrfs_commit_transaction+0x394/0x8bc [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672357]  [<ffffffffc0476dd5>] transaction_kthread+0xf9/0x1af [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672383]  [<ffffffffc0476cdc>] ? btrfs_cleanup_transaction+0x43a/0x43a [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672389]  [<ffffffff8105697c>] kthread+0xb2/0xba
Dec 23 23:33:56 merkaba kernel: [23070.672395]  [<ffffffff814a0000>] ? dcbnl_newmsg+0x14/0xa8
Dec 23 23:33:56 merkaba kernel: [23070.672399]  [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
Dec 23 23:33:56 merkaba kernel: [23070.672405]  [<ffffffff814a9f6c>] ret_from_fork+0x7c/0xb0
Dec 23 23:33:56 merkaba kernel: [23070.672409]  [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
Dec 23 23:33:56 merkaba kernel: [23070.672412] ---[ end trace 90db5b1c7067cf1e ]---
Dec 23 23:34:26 merkaba kernel: [23100.709530] ------------[ cut here ]------------


The recent hangings today are not in the log, I was upset enough to
forcefully switch of the machine. Tax returns are not my all time favorite,
but tax returns with hanging filesystems is no fun at all.


I will upgrade to 3.19 with 3.19-rc2.

Lets see what this balance will do.

It currently is here:

merkaba:~> btrfs balance status /home
Balance on '/home' is running
32 out of about 164 chunks balanced (53 considered),  80% left

merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=142.10GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.33GiB
GlobalReserve, single: total=512.00MiB, used=254.31MiB


So for once, we are told not to balance needlessly, but then in order for
stable operation I need to balance nonetheless?

Well lets see how it will improve things. Last time it did. Considerably.
BTRFS only had these hang problems with 3.15 and 3.16 if trees allocated
all remaining space. So I expect it to downsize these trees are to there is
some device space being freed to allocatable again.

Next I will also defrag the Windows VM image just as an additional safety
net.


Okay, doing something else now as the BTRFS will sort things out hopefully.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨è­Ú&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~†­†Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ߣøm

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
@ 2014-12-26 14:20 ` Martin Steigerwald
  2014-12-26 14:41   ` Martin Steigerwald
  2014-12-26 15:59 ` Martin Steigerwald
  2014-12-26 22:48 ` Robert White
  2 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-26 14:20 UTC (permalink / raw)
  To: linux-btrfs

Am Freitag, 26. Dezember 2014, 14:37:36 schrieben Sie:
> It currently is here:
> 
> merkaba:~> btrfs balance status /home
> Balance on '/home' is running
> 32 out of about 164 chunks balanced (53 considered),  80% left
> 
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=154.97GiB, used=142.10GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.33GiB
> GlobalReserve, single: total=512.00MiB, used=254.31MiB

Now I got this:

merkaba:~> btrfs balance start /home           

ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> dmesg | tail
[ 4260.276416] BTRFS info (device dm-3): relocating block group 151418568704 
flags 17
[ 4274.683349] BTRFS info (device dm-3): found 25089 extents
[ 4295.836590] BTRFS info (device dm-3): found 25089 extents
[ 4296.026778] BTRFS info (device dm-3): relocating block group 150344826880 
flags 17
[ 4312.732021] BTRFS info (device dm-3): found 59388 extents
[ 4326.398261] BTRFS info (device dm-3): found 59388 extents
[ 4326.813205] BTRFS info (device dm-3): relocating block group 149271085056 
flags 17
[ 4347.346540] BTRFS info (device dm-3): found 104739 extents
[ 4357.160098] BTRFS info (device dm-3): found 104739 extents
[ 4359.304646] BTRFS info (device dm-3): 20 enospc errors during balance

And I wonder about:

> Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
> GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599
> 
84C7N�����r��y����b�X��ǧv�^�)޺{.n�+����{�n�߲)����w*\x1fjg���\x1e�����ݢj/���z�ޖ��2
> �ޙ����&�)ߡ�a��\x7f��\x1e�G���h�\x0f�j:+v���w��٥


These random chars are not supposed to be there: I better run scrub straight 
after this balance.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-26 14:20 ` Martin Steigerwald
@ 2014-12-26 14:41   ` Martin Steigerwald
  2014-12-27  3:33     ` Duncan
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-26 14:41 UTC (permalink / raw)
  To: linux-btrfs

Am Freitag, 26. Dezember 2014, 15:20:42 schrieben Sie:
> And I wonder about:
> > Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
> > GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599
> >
> > 
> 
> 
84C7N�����r��y����b�X��ǧv�^�)޺{.n�+����{�n�߲)����w*\x1fjg���\x1e�����ݢj/���z�ޖ��2
> 
> > �ޙ����&�)ߡ�a��\x7f��\x1e�G���h�\x0f�j:+v���w��٥
> 
> These random chars are not supposed to be there: I better run scrub
> straight  after this balance.

Okay, thats not me I think. scrub didn´t report any errors and when I look
in kmail send folder I don´t see these random chars as well, so it seems
some server on the wire added the garbage.

Lets defragment the file:

merkaba:/home/martin/.VirtualBox/HardDisks> filefrag Winlala.vdi 
Winlala.vdi: 41462 extents found
merkaba:/home/martin/.VirtualBox/HardDisks> btrfs filesystem defragment 
Winlala.vdi 
merkaba:/home/martin/.VirtualBox/HardDisks> filefrag Winlala.vdi                   
Winlala.vdi: 11735 extents found
merkaba:/home/martin/.VirtualBox/HardDisks> sync
merkaba:/home/martin/.VirtualBox/HardDisks> filefrag Winlala.vdi
Winlala.vdi: 11735 extents found


Okay, that together with:

merkaba:~> btrfs fi df /home       
Data, RAID1: total=151.95GiB, used=144.68GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.25GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 147.94GiB
        devid    1 size 160.00GiB used 156.98GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 156.98GiB path /dev/mapper/sata-home

Btrfs v3.17

May do for a while.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
  2014-12-26 14:20 ` Martin Steigerwald
@ 2014-12-26 15:59 ` Martin Steigerwald
  2014-12-27  4:26   ` Duncan
  2014-12-26 22:48 ` Robert White
  2 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-26 15:59 UTC (permalink / raw)
  To: linux-btrfs

Am Freitag, 26. Dezember 2014, 14:37:36 schrieben Sie:
> I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> space_cache, skinny meta data extents – are these a problem? – and
> compress=lzo:
> 
> merkaba:~> btrfs fi sh /home
> Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
>         Total devices 2 FS bytes used 144.41GiB
>         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> 
> Btrfs v3.17
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=154.97GiB, used=141.12GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.29GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 
> And I had hangs with BTRFS again. This time as I wanted to install tax
> return software in Virtualbox´d Windows XP VM (which I use once a year
> cause I know no tax return software for Linux which would be suitable for
> Germany and I frankly don´t care about the end of security cause all
> surfing and other network access I will do from the Linux box and I only
> run the VM behind a firewall).

These are 100% reproducable for me:

1) Have the compress=lzo, space_cache BTRFS RAID Dual SSD RAID 1 fill both
with trees.

2) Have a Windows XP VM in Virtualbox on that BTRFS RAID 1

3) Press "Defragment" (in the hope to be able to use sdelete -c and then
VBoxManage modifyhd Winlala.vdi --compact to reduce image size)


Gives:

One kworker thread using up 100% of a core for minutes with bursts of
btrfs-transaction-process in between and:

Dec 26 16:17:57 merkaba kernel: [ 8102.029438] mce: [Hardware Error]: Machine check events logged
Dec 26 16:18:15 merkaba kernel: [ 8119.879230] CPU2: Core temperature above threshold, cpu clock throttled (total events = 54053)
Dec 26 16:18:15 merkaba kernel: [ 8119.879232] CPU0: Package temperature above threshold, cpu clock throttled (total events = 89435)
Dec 26 16:18:15 merkaba kernel: [ 8119.879234] CPU3: Core temperature above threshold, cpu clock throttled (total events = 54053)
Dec 26 16:18:15 merkaba kernel: [ 8119.879235] CPU1: Package temperature above threshold, cpu clock throttled (total events = 89435)
Dec 26 16:18:15 merkaba kernel: [ 8119.879237] CPU3: Package temperature above threshold, cpu clock throttled (total events = 89435)
Dec 26 16:18:15 merkaba kernel: [ 8119.879245] CPU2: Package temperature above threshold, cpu clock throttled (total events = 89435)
Dec 26 16:18:15 merkaba kernel: [ 8119.880218] CPU2: Core temperature/speed normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880219] CPU1: Package temperature/speed normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880220] CPU3: Core temperature/speed normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880221] CPU0: Package temperature/speed normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880223] CPU3: Package temperature/speed normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880228] CPU2: Package temperature/speed normal
Dec 26 16:20:27 merkaba kernel: [ 8252.054015] mce: [Hardware Error]: Machine check events logged
Dec 26 16:20:57 merkaba kernel: [ 8281.461874] INFO: task kded4:1959 blocked for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.464106]       Tainted: G           O   3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.466361] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.468760] kded4           D ffff88040764ce98     0  1959      1 0x00000000
Dec 26 16:20:57 merkaba kernel: [ 8281.471112]  ffff8803efa57bb8 0000000000000002 ffff8803efa57c00 ffff880407f261c0
Dec 26 16:20:57 merkaba kernel: [ 8281.473462]  ffff8803efa57fd8 ffff88040764c950 0000000000012300 ffff88040764c950
Dec 26 16:20:57 merkaba kernel: [ 8281.475780]  ffff8803efa57ba8 ffff8803eea9a900 ffff8803eea9a904 ffff88040764c950
Dec 26 16:20:57 merkaba kernel: [ 8281.478142] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.480414]  [<ffffffff814a6f9a>] schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.482694]  [<ffffffff814a72d3>] schedule_preempt_disabled+0x13/0x1f
Dec 26 16:20:57 merkaba kernel: [ 8281.484979]  [<ffffffff814a8440>] __mutex_lock_slowpath+0xab/0x126
Dec 26 16:20:57 merkaba kernel: [ 8281.487271]  [<ffffffff81143735>] ? lookup_fast+0x173/0x238
Dec 26 16:20:57 merkaba kernel: [ 8281.489534]  [<ffffffff814a84ce>] mutex_lock+0x13/0x24
Dec 26 16:20:57 merkaba kernel: [ 8281.491811]  [<ffffffff81143c45>] walk_component+0x69/0x17e
Dec 26 16:20:57 merkaba kernel: [ 8281.494092]  [<ffffffff81143d88>] lookup_last+0x2e/0x30
Dec 26 16:20:57 merkaba kernel: [ 8281.496416]  [<ffffffff81145a32>] path_lookupat+0x83/0x2d9
Dec 26 16:20:57 merkaba kernel: [ 8281.498733]  [<ffffffff8121f38c>] ? debug_smp_processor_id+0x17/0x19
Dec 26 16:20:57 merkaba kernel: [ 8281.501074]  [<ffffffff8114683c>] ? getname_flags+0x31/0x134
Dec 26 16:20:57 merkaba kernel: [ 8281.503338]  [<ffffffff81145cad>] filename_lookup+0x25/0x7a
Dec 26 16:20:57 merkaba kernel: [ 8281.505604]  [<ffffffff8114767a>] user_path_at_empty+0x55/0x93
Dec 26 16:20:57 merkaba kernel: [ 8281.507941]  [<ffffffff8105ec3e>] ? preempt_count_add+0x7c/0x90
Dec 26 16:20:57 merkaba kernel: [ 8281.510210]  [<ffffffff81071751>] ? cpuacct_account_field+0x56/0x5f
Dec 26 16:20:57 merkaba kernel: [ 8281.512499]  [<ffffffff81071751>] ? cpuacct_account_field+0x56/0x5f
Dec 26 16:20:57 merkaba kernel: [ 8281.514705]  [<ffffffff811476c4>] user_path_at+0xc/0xe
Dec 26 16:20:57 merkaba kernel: [ 8281.517039]  [<ffffffff8113ec3b>] vfs_fstatat+0x49/0x84
Dec 26 16:20:57 merkaba kernel: [ 8281.519397]  [<ffffffff810be29a>] ? acct_account_cputime+0x17/0x19
Dec 26 16:20:57 merkaba kernel: [ 8281.521686]  [<ffffffff8113ec8c>] vfs_stat+0x16/0x18
Dec 26 16:20:57 merkaba kernel: [ 8281.524064]  [<ffffffff8113ecd1>] SYSC_newstat+0x15/0x2e
Dec 26 16:20:57 merkaba kernel: [ 8281.526367]  [<ffffffff8100cf3f>] ? user_exit+0x13/0x15
Dec 26 16:20:57 merkaba kernel: [ 8281.528792]  [<ffffffff8100e21d>] ? syscall_trace_enter_phase1+0x57/0x12a
Dec 26 16:20:57 merkaba kernel: [ 8281.531120]  [<ffffffff8100e537>] ? syscall_trace_leave+0xcc/0x10a
Dec 26 16:20:57 merkaba kernel: [ 8281.533577]  [<ffffffff814aa264>] ? int_check_syscall_exit_work+0x34/0x3d
Dec 26 16:20:57 merkaba kernel: [ 8281.535977]  [<ffffffff8113edb9>] SyS_newstat+0x9/0xb
Dec 26 16:20:57 merkaba kernel: [ 8281.538416]  [<ffffffff814aa012>] system_call_fastpath+0x12/0x17
Dec 26 16:20:57 merkaba kernel: [ 8281.540835] INFO: task kactivitymanage:1994 blocked for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.540838]       Tainted: G           O   3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.540838] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.540848] kactivitymanage D 0000000000000000     0  1994      1 0x00000000
Dec 26 16:20:57 merkaba kernel: [ 8281.540851]  ffff8803e42efe58 0000000000000002 00000001ff68d6d6 ffff8800c285c950
Dec 26 16:20:57 merkaba kernel: [ 8281.540854]  ffff8803e42effd8 ffff8803fda361c0 0000000000012300 ffff8803fda361c0
Dec 26 16:20:57 merkaba kernel: [ 8281.540857]  00000000000034e5 ffff8804059e0348 ffff8804059e034c ffff8803fda361c0
Dec 26 16:20:57 merkaba kernel: [ 8281.540858] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.540862]  [<ffffffff814a6f9a>] schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.540865]  [<ffffffff814a72d3>] schedule_preempt_disabled+0x13/0x1f
Dec 26 16:20:57 merkaba kernel: [ 8281.540867]  [<ffffffff814a8440>] __mutex_lock_slowpath+0xab/0x126
Dec 26 16:20:57 merkaba kernel: [ 8281.540871]  [<ffffffff81150f22>] ? __fget+0x67/0x72
Dec 26 16:20:57 merkaba kernel: [ 8281.540873]  [<ffffffff814a84ce>] mutex_lock+0x13/0x24
Dec 26 16:20:57 merkaba kernel: [ 8281.540876]  [<ffffffff811516e9>] __fdget_pos+0x36/0x3c
Dec 26 16:20:57 merkaba kernel: [ 8281.540878]  [<ffffffff8113a393>] fdget_pos+0x9/0x15
Dec 26 16:20:57 merkaba kernel: [ 8281.540881]  [<ffffffff8113b39d>] SyS_write+0x19/0x71
Dec 26 16:20:57 merkaba kernel: [ 8281.540884]  [<ffffffff814aa264>] ? int_check_syscall_exit_work+0x34/0x3d
Dec 26 16:20:57 merkaba kernel: [ 8281.540886]  [<ffffffff814aa012>] system_call_fastpath+0x12/0x17
Dec 26 16:20:57 merkaba kernel: [ 8281.540890] INFO: task plasma-desktop:2013 blocked for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.540891]       Tainted: G           O   3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.540892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.540895] plasma-desktop  D ffff8803fda39db8     0  2013      1 0x00000000
Dec 26 16:20:57 merkaba kernel: [ 8281.540898]  ffff8803d947fbb8 0000000000000002 ffff8803d947fc00 ffffffff81a16500
Dec 26 16:20:57 merkaba kernel: [ 8281.540900]  ffff8803d947ffd8 ffff8803fda39870 0000000000012300 ffff8803fda39870
Dec 26 16:20:57 merkaba kernel: [ 8281.540902]  ffff8803d947fba8 ffff8803eea9a900 ffff8803eea9a904 ffff8803fda39870
Dec 26 16:20:57 merkaba kernel: [ 8281.540903] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.540906]  [<ffffffff814a6f9a>] schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.540908]  [<ffffffff814a72d3>] schedule_preempt_disabled+0x13/0x1f
Dec 26 16:20:57 merkaba kernel: [ 8281.540910]  [<ffffffff814a8440>] __mutex_lock_slowpath+0xab/0x126
Dec 26 16:20:57 merkaba kernel: [ 8281.540913]  [<ffffffff81143735>] ? lookup_fast+0x173/0x238
Dec 26 16:20:57 merkaba kernel: [ 8281.540916]  [<ffffffff814a84ce>] mutex_lock+0x13/0x24
Dec 26 16:20:57 merkaba kernel: [ 8281.540918]  [<ffffffff81143c45>] walk_component+0x69/0x17e
Dec 26 16:20:57 merkaba kernel: [ 8281.540921]  [<ffffffff813ca675>] ? __sock_recvmsg_nosec+0x29/0x2b
Dec 26 16:20:57 merkaba kernel: [ 8281.540924]  [<ffffffff81143d88>] lookup_last+0x2e/0x30
Dec 26 16:20:57 merkaba kernel: [ 8281.540926]  [<ffffffff81145a32>] path_lookupat+0x83/0x2d9
Dec 26 16:20:57 merkaba kernel: [ 8281.540929]  [<ffffffff8121f38c>] ? debug_smp_processor_id+0x17/0x19
Dec 26 16:20:57 merkaba kernel: [ 8281.540932]  [<ffffffff8114683c>] ? getname_flags+0x31/0x134
Dec 26 16:20:57 merkaba kernel: [ 8281.540934]  [<ffffffff81145cad>] filename_lookup+0x25/0x7a
Dec 26 16:20:57 merkaba kernel: [ 8281.540937]  [<ffffffff8114767a>] user_path_at_empty+0x55/0x93
Dec 26 16:20:57 merkaba kernel: [ 8281.540942]  [<ffffffff8105ec3e>] ? preempt_count_add+0x7c/0x90
Dec 26 16:20:57 merkaba kernel: [ 8281.540947]  [<ffffffff81071751>] ? cpuacct_account_field+0x56/0x5f
Dec 26 16:20:57 merkaba kernel: [ 8281.540949]  [<ffffffff81071751>] ? cpuacct_account_field+0x56/0x5f
Dec 26 16:20:57 merkaba kernel: [ 8281.540952]  [<ffffffff811476c4>] user_path_at+0xc/0xe
Dec 26 16:20:57 merkaba kernel: [ 8281.540956]  [<ffffffff81160193>] user_statfs+0x2b/0x68
Dec 26 16:20:57 merkaba kernel: [ 8281.540960]  [<ffffffff810be29a>] ? acct_account_cputime+0x17/0x19
Dec 26 16:20:57 merkaba kernel: [ 8281.540963]  [<ffffffff811601eb>] SYSC_statfs+0x1b/0x3a
Dec 26 16:20:57 merkaba kernel: [ 8281.540965]  [<ffffffff8100cf3f>] ? user_exit+0x13/0x15
Dec 26 16:20:57 merkaba kernel: [ 8281.540968]  [<ffffffff8100e21d>] ? syscall_trace_enter_phase1+0x57/0x12a
Dec 26 16:20:57 merkaba kernel: [ 8281.540970]  [<ffffffff8100e537>] ? syscall_trace_leave+0xcc/0x10a
Dec 26 16:20:57 merkaba kernel: [ 8281.540973]  [<ffffffff814aa264>] ? int_check_syscall_exit_work+0x34/0x3d
Dec 26 16:20:57 merkaba kernel: [ 8281.540976]  [<ffffffff81160328>] SyS_statfs+0x9/0xb
Dec 26 16:20:57 merkaba kernel: [ 8281.540978]  [<ffffffff814aa012>] system_call_fastpath+0x12/0x17
Dec 26 16:20:57 merkaba kernel: [ 8281.540983] INFO: task krunner:2050 blocked for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.540984]       Tainted: G           O   3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.540985] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.540988] krunner         D 0000000000000000     0  2050      1 0x00000000
Dec 26 16:20:57 merkaba kernel: [ 8281.540991]  ffff8803cb68be58 0000000000000002 ffff8803cb68be28 ffff8803fda39870
Dec 26 16:20:57 merkaba kernel: [ 8281.540993]  ffff8803cb68bfd8 ffff8800cecee1c0 0000000000012300 ffff8800cecee1c0
Dec 26 16:20:57 merkaba kernel: [ 8281.540995]  00000000000089ef ffff8804059e0348 ffff8804059e034c ffff8800cecee1c0
Dec 26 16:20:57 merkaba kernel: [ 8281.540996] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.540998]  [<ffffffff814a6f9a>] schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.541001]  [<ffffffff814a72d3>] schedule_preempt_disabled+0x13/0x1f
Dec 26 16:20:57 merkaba kernel: [ 8281.541003]  [<ffffffff814a8440>] __mutex_lock_slowpath+0xab/0x126
Dec 26 16:20:57 merkaba kernel: [ 8281.541005]  [<ffffffff81150f22>] ? __fget+0x67/0x72
Dec 26 16:20:57 merkaba kernel: [ 8281.541008]  [<ffffffff814a84ce>] mutex_lock+0x13/0x24
Dec 26 16:20:57 merkaba kernel: [ 8281.541010]  [<ffffffff811516e9>] __fdget_pos+0x36/0x3c
Dec 26 16:20:57 merkaba kernel: [ 8281.541012]  [<ffffffff8113a393>] fdget_pos+0x9/0x15
Dec 26 16:20:57 merkaba kernel: [ 8281.541014]  [<ffffffff8113b39d>] SyS_write+0x19/0x71
Dec 26 16:20:57 merkaba kernel: [ 8281.541017]  [<ffffffff814aa264>] ? int_check_syscall_exit_work+0x34/0x3d
Dec 26 16:20:57 merkaba kernel: [ 8281.541019]  [<ffffffff814aa012>] system_call_fastpath+0x12/0x17
Dec 26 16:20:57 merkaba kernel: [ 8281.541035] INFO: task akonadi_baloo_i:2273 blocked for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.541036]       Tainted: G           O   3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.541036] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.541039] akonadi_baloo_i D ffff8803b773b628     0  2273   2170 0x00000000
Dec 26 16:20:57 merkaba kernel: [ 8281.541041]  ffff8803ac8ff948 0000000000000002 ffff88040d80dc00 ffffffff81a16500
Dec 26 16:20:57 merkaba kernel: [ 8281.541043]  ffff8803ac8fffd8 ffff8803b773b0e0 0000000000012300 ffff8803b773b0e0
Dec 26 16:20:57 merkaba kernel: [ 8281.541046]  ffff8803ac8ff928 7fffffffffffffff ffff8803ac8ffa80 0000000000000002
Dec 26 16:20:57 merkaba kernel: [ 8281.541046] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.541049]  [<ffffffff814a8e73>] ? console_conditional_schedule+0x14/0x14
Dec 26 16:20:57 merkaba kernel: [ 8281.541051]  [<ffffffff814a6f9a>] schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.541054]  [<ffffffff814a8e93>] schedule_timeout+0x20/0xf5
Dec 26 16:20:57 merkaba kernel: [ 8281.541056]  [<ffffffff8105eb92>] ? get_parent_ip+0xe/0x3e
Dec 26 16:20:57 merkaba kernel: [ 8281.541058]  [<ffffffff8105ec3e>] ? preempt_count_add+0x7c/0x90
Dec 26 16:20:57 merkaba kernel: [ 8281.541061]  [<ffffffff814a97db>] ? _raw_spin_lock_irq+0x1c/0x20
Dec 26 16:20:57 merkaba kernel: [ 8281.541063]  [<ffffffff814a7999>] __wait_for_common+0x11e/0x163
Dec 26 16:20:57 merkaba kernel: [ 8281.541066]  [<ffffffff810607da>] ? wake_up_state+0xd/0xd
Dec 26 16:20:57 merkaba kernel: [ 8281.541069]  [<ffffffff814a79fd>] wait_for_completion+0x1f/0x21
Dec 26 16:20:57 merkaba kernel: [ 8281.541072]  [<ffffffff8115b5fb>] writeback_inodes_sb_nr+0x8c/0x95
Dec 26 16:20:57 merkaba kernel: [ 8281.541077]  [<ffffffff81050101>] ? perf_trace_workqueue_work+0x8e/0x95
Dec 26 16:20:57 merkaba kernel: [ 8281.541115]  [<ffffffffc044a44e>] flush_space+0x200/0x426 [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541135]  [<ffffffffc044a20c>] ? can_overcommit+0xaa/0xec [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541160]  [<ffffffffc044aa48>] reserve_metadata_bytes+0x274/0x368 [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541164]  [<ffffffff8105eb92>] ? get_parent_ip+0xe/0x3e
Dec 26 16:20:57 merkaba kernel: [ 8281.541166]  [<ffffffff8105ec3e>] ? preempt_count_add+0x7c/0x90
Dec 26 16:20:57 merkaba kernel: [ 8281.541185]  [<ffffffffc044b39b>] btrfs_delalloc_reserve_metadata+0x100/0x32c [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541215]  [<ffffffffc046c182>] __btrfs_buffered_write+0x1be/0x4a4 [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541218]  [<ffffffff81101a6e>] ? kmap_atomic+0x13/0x39
Dec 26 16:20:57 merkaba kernel: [ 8281.541220]  [<ffffffff81101aa2>] ? pagefault_enable+0xe/0x21
Dec 26 16:20:57 merkaba kernel: [ 8281.541242]  [<ffffffffc046c76b>] btrfs_file_write_iter+0x303/0x40e [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541245]  [<ffffffff8113a60a>] new_sync_write+0x77/0x9b
Dec 26 16:20:57 merkaba kernel: [ 8281.541247]  [<ffffffff8113ad51>] vfs_write+0xad/0x112
Dec 26 16:20:57 merkaba kernel: [ 8281.541250]  [<ffffffff8113b4d1>] SyS_pwrite64+0x5f/0x7d
Dec 26 16:20:57 merkaba kernel: [ 8281.541253]  [<ffffffff814aa264>] ? int_check_syscall_exit_work+0x34/0x3d
Dec 26 16:20:57 merkaba kernel: [ 8281.541256]  [<ffffffff814aa012>] system_call_fastpath+0x12/0x17
Dec 26 16:20:57 merkaba kernel: [ 8281.541263] INFO: task kworker/u8:1:3336 blocked for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.541264]       Tainted: G           O   3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.541265] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.541268] kworker/u8:1    D 0000000000000000     0  3336      2 0x00000000
Dec 26 16:20:57 merkaba kernel: [ 8281.541285] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541288]  ffff880332d6bb78 0000000000000002 ffff88040d80dc00 ffff8803b773b0e0
Dec 26 16:20:57 merkaba kernel: [ 8281.541290]  ffff880332d6bfd8 ffff8800c2804950 0000000000012300 ffff8800c2804950
Dec 26 16:20:57 merkaba kernel: [ 8281.541292]  ffff880332d6bb58 7fffffffffffffff ffff880332d6bcb0 0000000000000002
Dec 26 16:20:57 merkaba kernel: [ 8281.541293] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.541299]  [<ffffffff814a8e73>] ? console_conditional_schedule+0x14/0x14
Dec 26 16:20:57 merkaba kernel: [ 8281.541301]  [<ffffffff814a6f9a>] schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.541303]  [<ffffffff814a8e93>] schedule_timeout+0x20/0xf5
Dec 26 16:20:57 merkaba kernel: [ 8281.541306]  [<ffffffff8105eb92>] ? get_parent_ip+0xe/0x3e
Dec 26 16:20:57 merkaba kernel: [ 8281.541308]  [<ffffffff8105ec3e>] ? preempt_count_add+0x7c/0x90
Dec 26 16:20:57 merkaba kernel: [ 8281.541311]  [<ffffffff814a97db>] ? _raw_spin_lock_irq+0x1c/0x20
Dec 26 16:20:57 merkaba kernel: [ 8281.541313]  [<ffffffff814a7999>] __wait_for_common+0x11e/0x163
Dec 26 16:20:57 merkaba kernel: [ 8281.541317]  [<ffffffff810607da>] ? wake_up_state+0xd/0xd
Dec 26 16:20:57 merkaba kernel: [ 8281.541320]  [<ffffffff814a79fd>] wait_for_completion+0x1f/0x21
Dec 26 16:20:57 merkaba kernel: [ 8281.541322]  [<ffffffff8115b5fb>] writeback_inodes_sb_nr+0x8c/0x95
Dec 26 16:20:57 merkaba kernel: [ 8281.541324]  [<ffffffff81050101>] ? perf_trace_workqueue_work+0x8e/0x95
Dec 26 16:20:57 merkaba kernel: [ 8281.541343]  [<ffffffffc044a44e>] flush_space+0x200/0x426 [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541346]  [<ffffffff814a97bb>] ? _raw_spin_lock+0x1b/0x1f
Dec 26 16:20:57 merkaba kernel: [ 8281.541348]  [<ffffffff814a9841>] ? _raw_spin_unlock+0x11/0x24
Dec 26 16:20:57 merkaba kernel: [ 8281.541366]  [<ffffffffc044a788>] btrfs_async_reclaim_metadata_space+0x114/0x160 [btrfs]
Dec 26 16:20:57 merkaba kernel: [ 8281.541368]  [<ffffffff81052962>] process_one_work+0x15e/0x2a9
Dec 26 16:20:57 merkaba kernel: [ 8281.541371]  [<ffffffff81052ee1>] worker_thread+0x1f6/0x2a3
Dec 26 16:20:57 merkaba kernel: [ 8281.541374]  [<ffffffff81052ceb>] ? rescuer_thread+0x214/0x214
Dec 26 16:20:57 merkaba kernel: [ 8281.541376]  [<ffffffff8105697c>] kthread+0xb2/0xba
Dec 26 16:20:57 merkaba kernel: [ 8281.541379]  [<ffffffff814a0000>] ? dcbnl_newmsg+0x14/0xa8
Dec 26 16:20:57 merkaba kernel: [ 8281.541381]  [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
Dec 26 16:20:57 merkaba kernel: [ 8281.541384]  [<ffffffff814a9f6c>] ret_from_fork+0x7c/0xb0
Dec 26 16:20:57 merkaba kernel: [ 8281.541386]  [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
Dec 26 16:21:24 merkaba kernel: [ 8308.678889] device eth0 left promiscuous mode
Dec 26 16:21:24 merkaba kernel: [ 8308.700212] vboxnetflt: 0 out of 34916 packets were not sent (directed to host)


which translates to:

Desktop unusable => hard reboot.


I know resized it from 160GiB to 170 GiB on both devices.

But I think I will consider moving the VM image to another filesystem.

But at least my description can give an idea on how to reproduce this behaviour.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
  2014-12-26 14:20 ` Martin Steigerwald
  2014-12-26 15:59 ` Martin Steigerwald
@ 2014-12-26 22:48 ` Robert White
  2014-12-27  5:54   ` Duncan
  2014-12-27  9:01   ` Martin Steigerwald
  2 siblings, 2 replies; 59+ messages in thread
From: Robert White @ 2014-12-26 22:48 UTC (permalink / raw)
  To: Martin Steigerwald, linux-btrfs

On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> Hello!
>
> First: Have a merry christmas and enjoy a quiet time in these days.
>
> Second: At a time you feel like it, here is a little rant, but also a bug
> report:
>
> I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> space_cache, skinny meta data extents – are these a problem? – and
> compress=lzo:

(there is no known problem with skinny metadata, it's actually more 
efficient than the older format. There has been some anecdotes about 
mixing the skinny and fat metadata but nothing has ever been 
demonstrated problematic.)

>
> merkaba:~> btrfs fi sh /home
> Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
>          Total devices 2 FS bytes used 144.41GiB
>          devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
>          devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
>
> Btrfs v3.17
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=154.97GiB, used=141.12GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.29GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

This filesystem, at the allocation level, is "very full" (see below).

> And I had hangs with BTRFS again. This time as I wanted to install tax
> return software in Virtualbox´d Windows XP VM (which I use once a year
> cause I know no tax return software for Linux which would be suitable for
> Germany and I frankly don´t care about the end of security cause all
> surfing and other network access I will do from the Linux box and I only
> run the VM behind a firewall).
>
>
> And thus I try the balance dance again:

ITEM: Balance... it doesn't do what you think it does... 8-)

"Balancing" is something you should almost never need to do. It is only 
for cases of changing geometry (adding disks, switching RAID levels, 
etc.) of for cases when you've radically changed allocation behaviors 
(like you decided to remove all your VM's or you've decided to remove a 
mail spool directory full of thousands of tiny files).

People run balance all the time because they think they should. They are 
_usually_ incorrect in that belief.

>
> merkaba:~> btrfs balance start -dusage=5 -musage=5 /home
> ERROR: error during balancing '/home' - No space left on device

ITEM: Running out of space during a balance is not running out of space 
for files. BTRFS has two layers of allocation. That is, there are two 
levels of abstraction where "no space" can occur.

The first level of allocation is the "making more BTRFS structures out 
of raw device space".

The second level is "allocating space for files inside of existing BTRFS 
structures".

Balance is the operation of relocating the BTRFS structures and 
attempting to increase their order (conincidentally) while doing that. 
So, for instance, "reocating block group some_number_here" requires 
finding an unallocated expanse of disk, creating a new/empty block group 
there of the current relevant block group size (typically data=1G or 
metadata=256M if you didn't override these settings while making the 
filesystem). You can _easily_ end up lacking a 1G contiguous expanse of 
raw allocation space on a nearly-full filesystem.

NOTE :: This does _not_ happen with other filesystems like EXT4 because 
building those filesystems creates a static filesystem-level allocation. 
That is 100% of the disk that can be controlled by EXT4 (etc) is 
allocated and initialized at initial creation time (or first mount in 
the case of EXT4).

BTRFS is intentionally different because it wants to be able to adapt as 
your usage changes. If you first make millions of tiny files then you 
will have a lot of metadata extents and virtually no data extents. If 
you erase a lot of those and then start making large files the metadata 
will tend to go away and then data extents will be created.

Being a chaotic system, you can get into some corner cases that suck, 
but in terms of natural evolution it has more benefits than drawbacks.


> There may be more info in syslog - try dmesg | tail
> merkaba:~#1> btrfs balance start -dusage=5 -musage=5 /home
> ERROR: error during balancing '/home' - No space left on device
> There may be more info in syslog - try dmesg | tail
> merkaba:~#1> btrfs balance start -dusage=5 /home
>
> .... losts deleted for brevity ....
>
> So I am rebalancing everything basically, without need I bet, so causing
> more churn to SSDs than is needed.

Correct, though churn isn't really the issue.

> Otherwise alternative would be to make BTRFS larger I bet.

Correct.

>
>
> Well this is still not what I would consider stable. So I will still

Not a question of stability.

See, dong a balance is like doing a sliding block puzzle. If there isn't 
enough room to slide the blocks around then the blocks will not slide 
around. You are just out of space and that results in "out of space" 
returns. This is not even an error, just a fact.

http://en.wikipedia.org/wiki/15_puzzle

Meditate on the above link. Then ask yourself what happens if you put in 
the number 16. 8-)

The below recomendation is incorrect...

> recommend: If you want to use BTRFS on a server and estimate 25 GiB of
> usage, make drive at least 50GiB big or even 100GiB to be on the safe
> side. Like I recommended for SLES 11 SP 2/3 BTRFS deployments – but
> hey, there say meanwhile "don´t" as in "just don´t use it at all and use SLES
> 12 instead, cause BTRFS with 3.0 kernel with a ton of snapper snapshots
> is really not asking for anything even near to production or enterprise
> reliability" (if you need proof, I think I still have a snapshot of a SLES
> 11 SP3 VM that broke over night due to me having installed an LDAP server
> for preparing some training slides). Even 3.12 kernel seems daring regarding
> BTRFS, unless SUSE actively backports fixes.
>
>
> In kernel log the failed attempts look like this:
  Already covered.
>
> [  209.783437] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
> [  210.116416] BTRFS info (device dm-3): relocating block group 501238202368 flags 17
> My expection for a *stable* and *production quality* filesystem would be:
>
> I never ever get hangs with one kworker running on 100% of one Sandybridge
> core *for minutes* in a production filesystem and thats about it.

Now this is one of several other issues.

ITEM: An SSD plus a good fast controller and default system virtual 
memory and disk scheduler activities can completely bog a system down. 
You can get into a mode where the system begins doing synchronous writes 
of vast expanses of dirty cache. The SSD is so fast that there is 
effectively zero "wait for IO time" and the IO subsystem is effectively 
locked or just plain busy.

Look at /proc/sys/vm/dirty_background_ratio which is probably set to 10% 
of system ram.

You may need/want to change this number to something closer to 4. That's 
not a hard suggestion. Some reading and analysis will be needed to find 
the best possible tuning for an advanced system.

>
> Especially for a filesystem that claims to still have a good amount of free
> space:
>
> merkaba:~> LANG=C df -hT /home
> Filesystem             Type   Size  Used Avail Use% Mounted on
> /dev/mapper/msata-home btrfs  160G  146G   25G  86% /home

It does have plenty of free space at the file-storage level. (Which is 
not the "balance" level where raw disk is converted into file system 
"data" or "metadata" extents.)

>
> (yeah, these don´t add up, I account this to compression, but hey, who knows)

No need to "account for" compression.

They add up fine, in the sense that they are separate domains for space 
and so are not intended to be taken together. You will notice that you 
are not getting "out of space" errors for actually creating/appending files.

>
>
> In kernel log I have things like this, but some earlier time and these I have
> not yet perceived as hangs:
>
> Dec 23 23:33:26 merkaba kernel: [23040.621678] ------------[ cut here ]------------
> Dec 23 23:33:26 merkaba kernel: [23040.621792] WARNING: CPU: 3 PID: 308 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empt
> y+0x2d/0x2f [btrfs]()
> Dec 23 23:33:26 merkaba kernel: [23040.621796] Modules linked in: mmc_block ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c snd
> _usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device hid_generic hid_pl ff_memless usbhid hid nls_utf8 nls_cp437 vfat fat uas usb_stor
> age bnep bluetooth binfmt_misc cpufreq_userspace cpufreq_stats pci_stub cpufreq_powersave vboxpci(O) cpufreq_conservative vboxnetadp(O)
>   vboxnetflt(O) vboxdrv(O) ext4 crc16 mbcache jbd2 intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32
> _pclmul ghash_clmulni_intel snd_hda_codec_hdmi iwldvm aesni_intel snd_hda_codec_conexant mac80211 aes_x86_64 snd_hda_codec_generic lrw
> gf128mul glue_helper ablk_helper cryptd psmouse snd_hda_intel serio_raw iwlwifi pcspkr lpc_ich snd_hda_controller i2c_i801 mfd_core snd
> _hda_codec snd_hwdep cfg80211 snd_pcm snd_timer shpchp thinkpad_acpi nvram snd soundcore rfkill battery ac tpm_tis tpm processor evdev
> joydev sbs sbshc coretemp hdaps(O) tp_smapi(O) thinkpad_ec(O) loop firewire_sbp2 fuse ecryptfs autofs4 md_mod btrfs xor raid6_pq microc
> ode dm_mirror dm_region_hash dm_log dm_mod sg sr_mod sd_mod cdrom crc32c_intel ahci firewire_ohci libahci sata_sil24 e1000e libata ptp
> sdhci_pci ehci_pci sdhci firewire_core ehci_hcd crc_itu_t pps_core mmc_core scsi_mod usbcore usb_common thermal
> Dec 23 23:33:26 merkaba kernel: [23040.621978] CPU: 3 PID: 308 Comm: btrfs-transacti Tainted: G        W  O   3.18.0-tp520 #14
> Dec 23 23:33:26 merkaba kernel: [23040.621982] Hardware name: LENOVO 42433WG/42433WG, BIOS 8AET63WW (1.43 ) 05/08/2013
> Dec 23 23:33:26 merkaba kernel: [23040.621985]  0000000000000009 ffff8804044c7d88 ffffffff814a516e 0000000080000000
> Dec 23 23:33:26 merkaba kernel: [23040.621992]  0000000000000000 ffff8804044c7dc8 ffffffff8103f83e ffff8804044c7db8
> Dec 23 23:33:26 merkaba kernel: [23040.621999]  ffffffffc04bd5a1 ffff880037590800 ffff8800a599c320 0000000000000000
> Dec 23 23:33:26 merkaba kernel: [23040.622006] Call Trace:
> Dec 23 23:33:26 merkaba kernel: [23040.622026]  [<ffffffff814a516e>] dump_stack+0x4f/0x7c
> Dec 23 23:33:26 merkaba kernel: [23040.622034]  [<ffffffff8103f83e>] warn_slowpath_common+0x7c/0x96
> Dec 23 23:33:26 merkaba kernel: [23040.622104]  [<ffffffffc04bd5a1>] ? btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
> Dec 23 23:33:26 merkaba kernel: [23040.622111]  [<ffffffff8103f8ec>] warn_slowpath_null+0x15/0x17
> Dec 23 23:33:26 merkaba kernel: [23040.622164]  [<ffffffffc04bd5a1>] btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
> Dec 23 23:33:26 merkaba kernel: [23040.622211]  [<ffffffffc047a830>] btrfs_commit_transaction+0x394/0x8bc [btrfs]
> Dec 23 23:33:26 merkaba kernel: [23040.622254]  [<ffffffffc0476dd5>] transaction_kthread+0xf9/0x1af [btrfs]
> Dec 23 23:33:26 merkaba kernel: [23040.622295]  [<ffffffffc0476cdc>] ? btrfs_cleanup_transaction+0x43a/0x43a [btrfs]
> Dec 23 23:33:26 merkaba kernel: [23040.622305]  [<ffffffff8105697c>] kthread+0xb2/0xba
> Dec 23 23:33:26 merkaba kernel: [23040.622312]  [<ffffffff814a0000>] ? dcbnl_newmsg+0x14/0xa8
> Dec 23 23:33:26 merkaba kernel: [23040.622317]  [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
> Dec 23 23:33:26 merkaba kernel: [23040.622324]  [<ffffffff814a9f6c>] ret_from_fork+0x7c/0xb0
> Dec 23 23:33:26 merkaba kernel: [23040.622329]  [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
> Dec 23 23:33:26 merkaba kernel: [23040.622334] ---[ end trace 90db5b1c7067cf1d ]---
> Dec 23 23:33:56 merkaba kernel: [23070.671999] ------------[ cut here ]------------

Not sure about either of these, they _could_ be previous unrelated bugs 
that are now fixed bugs, since you say they've stopped happening.

> Dec 23 23:33:56 merkaba kernel: [23070.671999] ------------[ cut here ]------------
> Dec 23 23:33:56 merkaba kernel: [23070.672064] WARNING: CPU: 3 PID: 308 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empt
> y+0x2d/0x2f [btrfs]()
> Dec 23 23:33:56 merkaba kernel: [23070.672067] Modules linked in: mmc_block ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device hid_generic hid_pl ff_memless usbhid hid nls_utf8 nls_cp437 vfat fat uas usb_storage bnep bluetooth binfmt_misc cpufreq_userspace cpufreq_stats pci_stub cpufreq_powersave vboxpci(O) cpufreq_conservative vboxnetadp(O) vboxnetflt(O) vboxdrv(O) ext4 crc16 mbcache jbd2 intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi iwldvm aesni_intel snd_hda_codec_conexant mac80211 aes_x86_64 snd_hda_codec_generic lrw gf128mul glue_helper ablk_helper cryptd psmouse snd_hda_intel serio_raw iwlwifi pcspkr lpc_ich snd_hda_controller i2c_i801 mfd_core snd_hda_codec snd_hwdep cfg80211 snd_pcm snd_timer shpchp thinkpad_acpi nvram snd soundcore rfkill battery ac tpm_tis tpm processor evdev joydev sbs sbshc coretemp hdaps(O) tp_smapi(O) think
pad_ec(O) loop firewire_sbp2 fuse ecryptfs autofs4 md_mod btrfs xor raid6_pq microcode dm_mirror dm_region_hash dm_log dm_mod sg sr_mod sd_mod cdrom crc32c_intel ahci firewire_ohci libahci sata_sil24 e1000e libata ptp sdhci_pci ehci_pci sdhci firewire_core ehci_hcd crc_itu_t pps_core mmc_core scsi_mod usbcore usb_common thermal
> Dec 23 23:33:56 merkaba kernel: [23070.672193] CPU: 3 PID: 308 Comm: btrfs-transacti Tainted: G        W  O   3.18.0-tp520 #14
> Dec 23 23:33:56 merkaba kernel: [23070.672196] Hardware name: LENOVO 42433WG/42433WG, BIOS 8AET63WW (1.43 ) 05/08/2013
> Dec 23 23:33:56 merkaba kernel: [23070.672200]  0000000000000009 ffff8804044c7d88 ffffffff814a516e 0000000080000000
> Dec 23 23:33:56 merkaba kernel: [23070.672205]  0000000000000000 ffff8804044c7dc8 ffffffff8103f83e ffff8804044c7db8
> Dec 23 23:33:56 merkaba kernel: [23070.672209]  ffffffffc04bd5a1 ffff880037590800 ffff8802cd6e50a0 0000000000000000
> Dec 23 23:33:56 merkaba kernel: [23070.672214] Call Trace:
> Dec 23 23:33:56 merkaba kernel: [23070.672222]  [<ffffffff814a516e>] dump_stack+0x4f/0x7c
> Dec 23 23:33:56 merkaba kernel: [23070.672229]  [<ffffffff8103f83e>] warn_slowpath_common+0x7c/0x96
> Dec 23 23:33:56 merkaba kernel: [23070.672264]  [<ffffffffc04bd5a1>] ? btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
> Dec 23 23:33:56 merkaba kernel: [23070.672270]  [<ffffffff8103f8ec>] warn_slowpath_null+0x15/0x17
> Dec 23 23:33:56 merkaba kernel: [23070.672301]  [<ffffffffc04bd5a1>] btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
> Dec 23 23:33:56 merkaba kernel: [23070.672330]  [<ffffffffc047a830>] btrfs_commit_transaction+0x394/0x8bc [btrfs]
> Dec 23 23:33:56 merkaba kernel: [23070.672357]  [<ffffffffc0476dd5>] transaction_kthread+0xf9/0x1af [btrfs]
> Dec 23 23:33:56 merkaba kernel: [23070.672383]  [<ffffffffc0476cdc>] ? btrfs_cleanup_transaction+0x43a/0x43a [btrfs]
> Dec 23 23:33:56 merkaba kernel: [23070.672389]  [<ffffffff8105697c>] kthread+0xb2/0xba
> Dec 23 23:33:56 merkaba kernel: [23070.672395]  [<ffffffff814a0000>] ? dcbnl_newmsg+0x14/0xa8
> Dec 23 23:33:56 merkaba kernel: [23070.672399]  [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
> Dec 23 23:33:56 merkaba kernel: [23070.672405]  [<ffffffff814a9f6c>] ret_from_fork+0x7c/0xb0
> Dec 23 23:33:56 merkaba kernel: [23070.672409]  [<ffffffff810568ca>] ? __kthread_parkme+0x62/0x62
> Dec 23 23:33:56 merkaba kernel: [23070.672412] ---[ end trace 90db5b1c7067cf1e ]---
> Dec 23 23:34:26 merkaba kernel: [23100.709530] ------------[ cut here ]------------
>
>
> The recent hangings today are not in the log, I was upset enough to
> forcefully switch of the machine. Tax returns are not my all time favorite,
> but tax returns with hanging filesystems is no fun at all.
>
>
> I will upgrade to 3.19 with 3.19-rc2.
>
> Lets see what this balance will do.
>
> It currently is here:
>
> merkaba:~> btrfs balance status /home
> Balance on '/home' is running
> 32 out of about 164 chunks balanced (53 considered),  80% left
>
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=154.97GiB, used=142.10GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.33GiB
> GlobalReserve, single: total=512.00MiB, used=254.31MiB
>
>
> So for once, we are told not to balance needlessly, but then in order for
> stable operation I need to balance nonetheless?

Nope. "needing to Balance" just isn't your problem. Being out of space 
for new extents is your problem with the balancing you don't need to do. 
Which is different than your VM update problem. And is also different 
than your bursty, excessive caching problem.

I've also not seen you say you did a btrfsck ever. Does a filesystem 
check come up clean.


> Well lets see how it will improve things. Last time it did. Considerably.
> BTRFS only had these hang problems with 3.15 and 3.16 if trees allocated
> all remaining space. So I expect it to downsize these trees are to there is
> some device space being freed to allocatable again.
>
> Next I will also defrag the Windows VM image just as an additional safety
> net.

Simply copying the file might help you for a while at least. But in the 
long term "too much orderliness" for large files ends up being anti-helpful.

e.g. disk_file.img :: cp disk_file.img new_disk_file.img; rm 
disk_file.img; mv new_disk_file.img disk_file.img;

Turning off Copy-on-write might be helpful. (This will turn off 
compression as well) but it can be anti-helpful too depending on the VM 
and how it's used.

As I learn more about the way BTRFS stores files, particularly deltas to 
files, I come to suspect that the "best" storage model for a VM _might_ 
be exactly the opposite of normal suggestions. (The most wasteful 
possible storage is the Gauss sum of consecutive integers where i=1 and 
n=number of consecutively stored blocks in the file. Ouch. So a file 
that is reasonably segmented is "more efficent".)

With a fast SSD, my research suggests that defragging the disk image is 
bad. No-COW is good if you don't snapshot often, but each snapshot puts 
the file in one-COW which kind-of defeats the No-COW if you do it very 
often.

But as near as I can tell, starting with an "empty" .qcow file and 
growing the system step-wise and _never_ defragging that file tends to 
create a chaotically natural expanse that wont hit these corner cases. 
(Way more analysis needs to be done here for that to be a real answer.)

As I learn more I discover that being overly agressive with balance and 
defrag of large files is the opposite of good. The system seems to want 
to develop a chaotic layout and trying to make it orderly seems to make 
things worse. For very large files like VM images it seems to amplify 
the worst parts.

> Okay, doing something else now as the BTRFS will sort things out hopefully.

To get good natural performance on my (non SSD) system while running 
VM(s) I classify a bunch of the system ram with the moveablecore= kernel 
boot option (about 1/4 to 1/3rd of physical ram) and turn down the dirty 
backround ratio to avoid large synchronous cache flush events.

YMMV.

>
> Ciao,
>
Later.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-26 14:41   ` Martin Steigerwald
@ 2014-12-27  3:33     ` Duncan
  0 siblings, 0 replies; 59+ messages in thread
From: Duncan @ 2014-12-27  3:33 UTC (permalink / raw)
  To: linux-btrfs

Martin Steigerwald posted on Fri, 26 Dec 2014 15:41:23 +0100 as excerpted:

> Am Freitag, 26. Dezember 2014, 15:20:42 schrieben Sie:
>> And I wonder about:
>> > Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C
>> > 0040 0710 4AFA  B82F 991B EAAC A599
>> >
>> >
>> > 
>> 
> 84C7N�����r��y����b�X��ǧv�^�)޺{.n�+����{�n�߲)����w*\x1fjg���\x1e�����ݢj/
���z�ޖ��2
>> 
>> > �ޙ����&�)ߡ�a��\x7f��\x1e�G���h�\x0f�j:+v���w��٥
>> 
>> These random chars are not supposed to be there: I better run scrub
>> straight  after this balance.
> 
> Okay, thats not me I think. scrub didn´t report any errors and when I
> look in kmail send folder I don´t see these random chars as well, so it
> seems some server on the wire added the garbage.

FWIW...

They didn't show up here on gmane's list2nntp service (message viewed 
with pan), either.  There were a few strange characters -- your dashes(?) 
on either side of the "are these a problem?" showed up as the squares 
containing four digits (0080, 0093) that appear when a font doesn't 
contain the appropriate character it's being asked to display, and there 
were a few others, but that's a common charset/font l10n issue, not the 
apparent line noise binary corruption shown above.

So I'd guess it was either the transmission to your mail service, at the 
mail service, or the transmission between them and your mail client, that 
corrupted.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-26 15:59 ` Martin Steigerwald
@ 2014-12-27  4:26   ` Duncan
  0 siblings, 0 replies; 59+ messages in thread
From: Duncan @ 2014-12-27  4:26 UTC (permalink / raw)
  To: linux-btrfs

Martin Steigerwald posted on Fri, 26 Dec 2014 16:59:09 +0100 as excerpted:

> Dec 26 16:17:57 merkaba kernel: [ 8102.029438] mce:
> [Hardware Error]: Machine check events logged
> Dec 26 16:20:27 merkaba kernel: [ 8252.054015] mce:
> [Hardware Error]: Machine check events logged

Have you checked these MCEs?  What are they?

MCEs are hardware errors.  These are *NOT* kernel errors, tho of course 
they may /trigger/ kernel errors.  The reported event codes can be looked 
up and translated into English. 

Since shortly after the first one until a bit before the second one here, 
you had hardware thermal throttling, the CPUs, on-chip cache, and 
possibly the memory, was working pretty hard.

FWIW, I had an AMD machine that would MCE with memory related errors some 
time (about a decade) ago.  I had ECC RAM, but it was cheap and 
apparently not quite up to the speeds it was actually rated for.  MemTest 
check out the memory fine, but under high stress especially, it would 
sometimes have bus/transit related corruption, which would sometimes (not 
always) trigger those MCEs.

Eventually a BIOS update gave me the ability to turn down the memory 
timings, and turning them down just one notch made everything rock-stable 
-- I was even able to decrease some of the wait-states to get a bit of 
the memory speed back.  It just so happened that it was borderline stable 
at the rated clock, and turning the memory clock down just one notch was 
all it took.  Later, I upgraded the RAM (the bad RAM was two half-gig 
sticks, back when they were $100+ a piece, I upgraded to four 2-gig 
sticks), and the new RAM didn't have the problem at all -- the bad RAM 
sticks simply weren't /quite/ stable at the rated speed, that was it.

I run gentoo so of course do a lot of building from sources, and 
interestingly enough, the thing that turned out to detect the corruption 
the most often was bzip2 compression checksums -- I'd get errors on 
sources decompress previous to the build, rather more often than actual 
build failures altho those would happen occasionally as well, while 
redoing it would work fine -- checksums passed, and I never had a build 
that actually finished fail to run due to a bad build.

Now here's the thing.  Of course a decade ago was well before I was 
running btrfs (FWIW I was running reiserfs at the time, and it seemed 
pretty resilient given the bad RAM I had), so it was the bzip2 checksums 
it failed on.

But guess what btrfs uses for file integrity, checksums.  If your MCEs 
are either like my memory-related MCEs were, or are similar CPU-cache or 
CPU related but still something that would affect checksumming, btrfs may 
well be fighting bad checksums due to the same issues, and that would of 
course throw all sorts of wrenches into things.  Another thing I've seen 
reported as triggering MCEs is bad power (in that case it was an either 
underpowered or going bad UPS, once it was out of the picture, the MCEs 
and problems stopped).

Now I think you're having other btrfs issues as well, some of which are 
likely legit bugs.  However, your MCEs certainly aren't helping things, 
and I'd definitely recommend checking up on them to see what's actually 
happening to your hardware.  It may well be that without whatever 
hardware issues are triggering those MCEs, you may end up with less btrfs 
problems as well.

Or maybe not, but it's something to look into, because right now, 
regardless of whether they're making things worse physically, they're at 
minimum obscuring a troubleshooting picture that would be clearer without 
them.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-26 22:48 ` Robert White
@ 2014-12-27  5:54   ` Duncan
  2014-12-27  9:01   ` Martin Steigerwald
  1 sibling, 0 replies; 59+ messages in thread
From: Duncan @ 2014-12-27  5:54 UTC (permalink / raw)
  To: linux-btrfs

Robert White posted on Fri, 26 Dec 2014 14:48:38 -0800 as excerpted:

> ITEM: An SSD plus a good fast controller and default system virtual
> memory and disk scheduler activities can completely bog a system down.
> You can get into a mode where the system begins doing synchronous writes
> of vast expanses of dirty cache. The SSD is so fast that there is
> effectively zero "wait for IO time" and the IO subsystem is effectively
> locked or just plain busy.
> 
> Look at /proc/sys/vm/dirty_background_ratio which is probably set to 10%
> of system ram.
> 
> You may need/want to change this number to something closer to 4. That's
> not a hard suggestion. Some reading and analysis will be needed to find
> the best possible tuning for an advanced system.

FWIW, I can second at least this part, myself.  Half of the base problem 
is that memory speeds have increased far faster than storage speeds.  
SSDs do help with that, but the problem remains.  The other half of the 
problem is the comparatively huge memory capacity systems have today, 
with the result being that the default percentages of system RAM that 
were allowed to be dirty before kicking in background and then foreground 
flushing, reasonable back when they were introduced, simply aren't 
reasonable any longer, PARTICULARLY on spinning rust, but even on SSD.

vm.dirty_ratio is the percentage of RAM allowed to dirty before the 
system kicks into high-priority write-flush mode.  
vm.dirty_background_ratio is likewise, but where the system starts even 
worrying about it at all, doing work in the background.

Now take my 16 GiB RAM system as an example.

The default background setting is 5%, foreground/high-priority, 10%.  
With 16 gigs RAM, that 10% is 1.6 GiB of dirty pages to flush.  A 
spinning rust drive might do 100 MiB/sec throughput contiguous, but a 
real-world number is more like 30-50 MiB/sec.

At 100 MiB/sec, that 1.6 GiB will take 16+ seconds, during which nothing 
else can be doing I/O.  So let's just divide the speed by 3 and call it 
33.3 MiB/sec.  Now we're looking at being blocked for nearly 50 seconds 
to flush all those dirty blocks.  And the system doesn't even START 
worrying about it, at even LOW priority, until it has about 25 seconds 
worth of full-usage flushing built-up!

Not only that, but that's *ALSO* 1.6 GiB worth of dirty data that isn't 
yet written to storage, that would lost in the event of a crash!

Of course there's a timer expiry as well.  vm.dirty_writeback_centiseconds 
(that's background) defaults to 499 (5 seconds), 
vm.dirty_expire_centiseconds defaults to 2999 (30 seconds).

So the first thing to notice is that it's going to take more time to 
write the dirty data we're allowing to stack up, than the expiry time!  
At least to me, that makes absolutely NO sense!  At minimum, we need to 
reduce cached writes allowed to stack up to something that can actually 
be done before they expire, time-wise.  Either that, or trying to depend 
on that 30-second expiry to make sure our dirty data is flushed in 
something at least /close/ to that isn't going to work so well!

So assuming we think the 30-seconds is logical, the /minimum/ we need to 
do is reduce the size cap by half, to 5% high-priority/foreground (which 
was as we saw about 25 seconds worth), say 2% lower-priority/background.

But that's STILL about 800 MiB before it kicks to high priority mode at 
risk in case of a crash, and I still considered that a bit more than I 
wanted.

So what I ended up with here (set for spinning rust before I had SSD), 
was:

vm.dirty_background_ratio = 1

(low priority flush, that's still ~160 MiB or about 5 seconds worth of 
activity at lower 30s MiB/sec)

vm.dirty_ratio = 3

(high priority flush, roughly half a GiB, about 15 seconds of activity)

vm.dirty_writeback_centiseconds=1000

(10 seconds, background flush timeout, note that the corresponding size 
cap is ~5 seconds worth so about 50% duty cycle, a bit high for 
background priority, but...)

(I left vm.dirty_expire_centiseconds at the default, 2999 or 30 seconds, 
since I found that an acceptable amount of work to lose in the case of a 
crash.  Again, the corresponding size cap is ~15 seconds worth, so ~50 
duty cycle.  This is very reasonable for high priority, as if data is 
coming in faster than that, it'll trigger high priority flushing "billed" 
to the processes actually dirtying the memory in the first place, thus 
forcing them to slow down and wait for their IO, in turn allowing other 
(CPU-bound) processes to run.)

And while 15-second interactivity latency during disk thrashing isn't 
cake, it's at least tolerable, while 50-second latency is HORRIBLE.

Meanwhile, with vm.dirty_background_ratio already set to 1 and without 
knowing whether it can take a decimal such as 0.5 (I could look I suppose 
but I don't really have to), that's the lowest I can go there unless I 
set it to zero.  HOWEVER, if I wanted to go lower, I could set the actual 
size version, vm.dirty_background_bytes, instead.  If I needed to go 
below ~160 MiB, that's what I'd do.  Of course there's a corresponding 
vm.dirty_bytes setting as well.


As I said I originally set those up for spinning rust.  Now my main 
system is SSD, tho I still have secondary backups and media on spinning 
rust.  But I've seen no reason to change them upward to allow for the 
faster SSDs, particularly since were I to do so, I'd be risking that much 
more data loss in the event of a crash, and I find that risk balance 
about right, just where it is.

And I've been quite happy with btrfs performance on the ssds (the 
spinning rust is still reiserfs).  Tho of course I do run multiple 
smaller independent btrfs instead of the huge all-the-data-eggs-in-a-
single-basket mode most people seem to run.  My biggest btrfs is actually 
only 24 GiB (on each of two devices but in raid1 mode both data/metadata, 
so 24 GiB to work with too), but between working copy and primary backup, 
I have nearly a dozen btrfs filesystems.  But I don't tend to run into 
the scaling issues others see, and being able to do full filesystem 
maintenance (scrub/balance/backup/restore-from-backup/etc) in seconds to 
minutes per filesystem is nice! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-26 22:48 ` Robert White
  2014-12-27  5:54   ` Duncan
@ 2014-12-27  9:01   ` Martin Steigerwald
  2014-12-27  9:30     ` Hugo Mills
  1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27  9:01 UTC (permalink / raw)
  To: Robert White; +Cc: linux-btrfs

Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > Hello!
> > 
> > First: Have a merry christmas and enjoy a quiet time in these days.
> > 
> > Second: At a time you feel like it, here is a little rant, but also a bug
> > report:
> > 
> > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > space_cache, skinny meta data extents – are these a problem? – and
> 
> > compress=lzo:
> (there is no known problem with skinny metadata, it's actually more
> efficient than the older format. There has been some anecdotes about
> mixing the skinny and fat metadata but nothing has ever been
> demonstrated problematic.)
> 
> > merkaba:~> btrfs fi sh /home
> > Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > 
> >          Total devices 2 FS bytes used 144.41GiB
> >          devid    1 size 160.00GiB used 160.00GiB path
> >          /dev/mapper/msata-home
> >          devid    2 size 160.00GiB used 160.00GiB path
> >          /dev/mapper/sata-home
> > 
> > Btrfs v3.17
> > merkaba:~> btrfs fi df /home
> > Data, RAID1: total=154.97GiB, used=141.12GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> This filesystem, at the allocation level, is "very full" (see below).
> 
> > And I had hangs with BTRFS again. This time as I wanted to install tax
> > return software in Virtualbox´d Windows XP VM (which I use once a year
> > cause I know no tax return software for Linux which would be suitable for
> > Germany and I frankly don´t care about the end of security cause all
> > surfing and other network access I will do from the Linux box and I only
> > run the VM behind a firewall).
> 
> > And thus I try the balance dance again:
> ITEM: Balance... it doesn't do what you think it does... 8-)
> 
> "Balancing" is something you should almost never need to do. It is only
> for cases of changing geometry (adding disks, switching RAID levels,
> etc.) of for cases when you've radically changed allocation behaviors
> (like you decided to remove all your VM's or you've decided to remove a
> mail spool directory full of thousands of tiny files).
> 
> People run balance all the time because they think they should. They are
> _usually_ incorrect in that belief.

I only see the lockups of BTRFS is the trees *occupy* all space on the device.

I *never* so far saw it lockup if there is still space BTRFS can allocate from 
to *extend* a tree.

This may be a bug, but this is what I see.

And no amount of "you should not balance a BTRFS" will make that perception go 
away.

See, I see the sun coming out on a morning and you tell me "no, it doesn´t". 
Simply that is not going to match my perception.

> > merkaba:~> btrfs balance start -dusage=5 -musage=5 /home
> > ERROR: error during balancing '/home' - No space left on device
> 
> ITEM: Running out of space during a balance is not running out of space
> for files. BTRFS has two layers of allocation. That is, there are two
> levels of abstraction where "no space" can occur.

I understand that *very* well. I know about the allocation of *device* space 
for tree and I know about the allocation *inside* a tree.

> The first level of allocation is the "making more BTRFS structures out

Skipped rest of explaination that I already now. 

I also don´t buy in the SSD makes kworker thread to use 100% for minutes 
explaination - *while* this SSDs are basically idling. A sandybridge core is 
not exactly slow and these are still consumer SSDs, we are not talking about a 
million of IOPS here.

And again:

This does not ever happen on when the trees do *not* fully allocate all device 
space. Even the defragmentation of the Windows XP run fine until after the 
trees allocated all space on the device again.

Try to reread the last two sentences in case it doesn´t sink to you.


Thats why I consider it a bug. I totally agree with you that a balance should 
not be necessary, but in my observation it is. That is the actual bug.




And no, no one needs me to tell to nocow the file. Even the extents are no 
issue: Not with SSDs which provide good enough random access.

My interpretation from what I see is this: BTRFS free space *in tree* handling 
is still not up to producation quality.


Now you either try out what I describe and see whether you perceive the same, 
or if you don´t, please don´t argue with my perception. You can argue with my 
conclusion, but I know what I see here. Thanks.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27  9:01   ` Martin Steigerwald
@ 2014-12-27  9:30     ` Hugo Mills
  2014-12-27 10:54       ` Martin Steigerwald
                         ` (3 more replies)
  0 siblings, 4 replies; 59+ messages in thread
From: Hugo Mills @ 2014-12-27  9:30 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4823 bytes --]

On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > Hello!
> > > 
> > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > 
> > > Second: At a time you feel like it, here is a little rant, but also a bug
> > > report:
> > > 
> > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > space_cache, skinny meta data extents – are these a problem? – and
> > 
> > > compress=lzo:
> > (there is no known problem with skinny metadata, it's actually more
> > efficient than the older format. There has been some anecdotes about
> > mixing the skinny and fat metadata but nothing has ever been
> > demonstrated problematic.)
> > 
> > > merkaba:~> btrfs fi sh /home
> > > Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > 
> > >          Total devices 2 FS bytes used 144.41GiB
> > >          devid    1 size 160.00GiB used 160.00GiB path
> > >          /dev/mapper/msata-home
> > >          devid    2 size 160.00GiB used 160.00GiB path
> > >          /dev/mapper/sata-home
> > > 
> > > Btrfs v3.17
> > > merkaba:~> btrfs fi df /home
> > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > 
> > This filesystem, at the allocation level, is "very full" (see below).
> > 
> > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > cause I know no tax return software for Linux which would be suitable for
> > > Germany and I frankly don´t care about the end of security cause all
> > > surfing and other network access I will do from the Linux box and I only
> > > run the VM behind a firewall).
> > 
> > > And thus I try the balance dance again:
> > ITEM: Balance... it doesn't do what you think it does... 8-)
> > 
> > "Balancing" is something you should almost never need to do. It is only
> > for cases of changing geometry (adding disks, switching RAID levels,
> > etc.) of for cases when you've radically changed allocation behaviors
> > (like you decided to remove all your VM's or you've decided to remove a
> > mail spool directory full of thousands of tiny files).
> > 
> > People run balance all the time because they think they should. They are
> > _usually_ incorrect in that belief.
> 
> I only see the lockups of BTRFS is the trees *occupy* all space on the device.

   No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
space. What's more, balance does *not* balance the metadata trees. The
remaining space -- 154.97 GiB -- is unstructured storage for file
data, and you have some 13 GiB of that available for use.

   Now, since you're seeing lockups when the space on your disks is
all allocated I'd say that's a bug. However, you're the *only* person
who's reported this as a regular occurrence. Does this happen with all
filesystems you have, or just this one?

> I *never* so far saw it lockup if there is still space BTRFS can allocate from 
> to *extend* a tree.

   It's not a tree. It's simply space allocation. It's not even space
*usage* you're talking about here -- it's just allocation (i.e. the FS
saying "I'm going to use this piece of disk for this purpose").

> This may be a bug, but this is what I see.
> 
> And no amount of "you should not balance a BTRFS" will make that perception go 
> away.
> 
> See, I see the sun coming out on a morning and you tell me "no, it doesn´t". 
> Simply that is not going to match my perception.

   Duncan's assertion is correct in its detail. Looking at your space
usage, I would not suggest that running a balance is something you
need to do. Now, since you have these lockups that seem quite
repeatable, there's probably a lurking bug in there, but hacking
around with balance every time you hit it isn't going to get the
problem solved properly.

   I think I would suggest the following:

 - make sure you have some way of logging your dmesg permanently (use
   a different filesystem for /var/log, or a serial console, or a
   netconsole)

 - when the lockup happens, hit Alt-SysRq-t a few times

 - send the dmesg output here, or post to bugzilla.kernel.org

   That's probably going to give enough information to the developers
to work out where the lockup is happening, and is clearly the way
forward here.

   Hugo.

-- 
Hugo Mills             | w.w.w. -- England's batting scorecard
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: 65E74AC0          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27  9:30     ` Hugo Mills
@ 2014-12-27 10:54       ` Martin Steigerwald
  2014-12-27 11:52         ` Robert White
  2014-12-27 11:11       ` Martin Steigerwald
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 10:54 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 12579 bytes --]

Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > Hello!
> > > > 
> > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > 
> > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > bug
> > > > report:
> > > > 
> > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > space_cache, skinny meta data extents – are these a problem? – and
> > > 
> > > > compress=lzo:
> > > (there is no known problem with skinny metadata, it's actually more
> > > efficient than the older format. There has been some anecdotes about
> > > mixing the skinny and fat metadata but nothing has ever been
> > > demonstrated problematic.)
> > > 
> > > > merkaba:~> btrfs fi sh /home
> > > > Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > 
> > > >          Total devices 2 FS bytes used 144.41GiB
> > > >          devid    1 size 160.00GiB used 160.00GiB path
> > > >          /dev/mapper/msata-home
> > > >          devid    2 size 160.00GiB used 160.00GiB path
> > > >          /dev/mapper/sata-home
> > > > 
> > > > Btrfs v3.17
> > > > merkaba:~> btrfs fi df /home
> > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > 
> > > This filesystem, at the allocation level, is "very full" (see below).
> > > 
> > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > cause I know no tax return software for Linux which would be suitable
> > > > for
> > > > Germany and I frankly don´t care about the end of security cause all
> > > > surfing and other network access I will do from the Linux box and I
> > > > only
> > > > run the VM behind a firewall).
> > > 
> > > > And thus I try the balance dance again:
> > > ITEM: Balance... it doesn't do what you think it does... 8-)
> > > 
> > > "Balancing" is something you should almost never need to do. It is only
> > > for cases of changing geometry (adding disks, switching RAID levels,
> > > etc.) of for cases when you've radically changed allocation behaviors
> > > (like you decided to remove all your VM's or you've decided to remove a
> > > mail spool directory full of thousands of tiny files).
> > > 
> > > People run balance all the time because they think they should. They are
> > > _usually_ incorrect in that belief.
> > 
> > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > device.
>    No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> space. What's more, balance does *not* balance the metadata trees. The
> remaining space -- 154.97 GiB -- is unstructured storage for file
> data, and you have some 13 GiB of that available for use.

Ok, let me rephrase that: Then the space *reserved* for the trees occupies all 
space on the device. Or okay, when that I see in btrfs fi df as "total" in 
summary occupies what I see as "size" in btrfs fi sh, i.e. when "used" equals 
space in "btrfs fi sh"

What happened here is this:

I tried

 https://blogs.oracle.com/virtualbox/entry/how_to_compact_your_virtual

in order to regain some space from the Windows XP VDI file. I just wanted to 
get around upsizing the BTRFS again.

And on the defragementation step in Windows it first ran fast. For about 46-47% 
there, during that fast phase btrfs fi df showed that BTRFS was quickly 
reserving the remaining free device space for data trees (not metadata).

Only after a while after it did so, it got slow again, basically the Windows 
defragmentation process stopped at 46-47% altogether and then after a while 
even the desktop locked due to processes being blocked in I/O.

I decided to forget about this downsizing of the Virtualbox VDI file, it will 
extend again on next Windows work and it is already 18 GB of its maximum 20GB, 
so… I dislike the approach anyway, and don´t even understand why the 
defragmentation step would be necessary as I think Virtualbox can poke holes 
into the file for any space not allocated inside the VM, whether it is 
defragmented or not.

>    Now, since you're seeing lockups when the space on your disks is
> all allocated I'd say that's a bug. However, you're the *only* person
> who's reported this as a regular occurrence. Does this happen with all
> filesystems you have, or just this one?

The *only* person? The compression lockups with 3.15 and 3.16, quite some 
people saw them, I thought. For me also these lockups only happened with all 
space on device allocated.

And these seem to be gone. In regular use it doesn´t lockup totally hard. But 
in the a processes writes a lot into one big no-cowed file case, it seems it 
can still get into a lockup, but this time one where a kworker thread consumes 
100% of CPU for minutes.

> > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > from to *extend* a tree.
> 
>    It's not a tree. It's simply space allocation. It's not even space
> *usage* you're talking about here -- it's just allocation (i.e. the FS
> saying "I'm going to use this piece of disk for this purpose").

Okay, I thought it is the space BTRFS reserves for a tree or well the *chunks* 
the tree manages. I am aware of that it isn´t already *used* space, its just 
*reserved*

> > This may be a bug, but this is what I see.
> > 
> > And no amount of "you should not balance a BTRFS" will make that
> > perception go away.
> > 
> > See, I see the sun coming out on a morning and you tell me "no, it
> > doesn´t". Simply that is not going to match my perception.
> 
>    Duncan's assertion is correct in its detail. Looking at your space
> usage, I would not suggest that running a balance is something you
> need to do. Now, since you have these lockups that seem quite
> repeatable, there's probably a lurking bug in there, but hacking
> around with balance every time you hit it isn't going to get the
> problem solved properly.

It was Robert writing this I think.

Well I do not like to balance the FS, but I see the result, I see that it 
helps here. And thats about it.

My theory from watching the Windows XP defragmentation case is this:

- For writing into the file BTRFS needs to actually allocate and use free space 
in the current tree allocation, or, as we seem to misunderstood from the words 
we use, it needs to fit data in

Data, RAID1: total=144.98GiB, used=140.94GiB

between 144,98 GiB and 140,94 GiB given that total space of this tree, or if 
its not a tree, but the chunks in that the tree manages, in these chunks can 
*not* be extended anymore.

System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.24GiB

- What I see now is as long as it can be extended, BTRFS on this workload 
*happily* does so. *Quickly*. Up to the amount of the free, unreserved space 
of the device. And *even* if in my eyes there is a big enough difference 
between total and used in btrfs fi df.

- Then as all the device space is *reserved*, BTRFS needs to fit the allocation 
within the *existing* chunks instead of reserving a new one and fill the empty 
one. And I think this is where it gets problems.


I extended both devices of /home by 10 GiB now and I was able to comlete some 
balance steps with these results.

Original after my last partly failed balance attempts:

Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 144.20GiB
        devid    1 size 170.00GiB used 159.01GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 159.01GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=153.98GiB, used=140.95GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.25GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


Then balancing, but not all of them:

merkaba:~#1> btrfs balance start -dusage=70 /home
Done, had to relocate 9 out of 162 chunks
merkaba:~> btrfs fi df /home                   
Data, RAID1: total=146.98GiB, used=140.95GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.25GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
merkaba:~> btrfs balance start -dusage=80 /home
Done, had to relocate 9 out of 155 chunks
merkaba:~> btrfs fi df /home                   
Data, RAID1: total=144.98GiB, used=140.94GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.24GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 144.19GiB
        devid    1 size 170.00GiB used 150.01GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 150.01GiB path /dev/mapper/sata-home

Btrfs v3.17


This is a situation where I do not see any slowdowns with BTRFS.

As far as I understand the balance commands I used I told BTRFS the following:

- go and balance all chunks that has 70% or less used
- go and balance all chunks that have 80% or less used

I rarely see any chunks that have 60% or less used and get something like this 
if I try:

merkaba:~> btrfs balance start -dusage=60 /home
Done, had to relocate 0 out of 153 chunks



Now my idea is this: BTRFS will need to satisfy the allocations it need to do 
for writing heavily into a cow´ed file from the already reserved space. Yet if 
I have lots of chunks that are filled between 60-70% it needs to spread the 
allocations in the 40-30% of the chunk that are not yet used.

My theory is this: If BTRFS needs to do this *heavily*, it at some time gets 
problems while doing so. Apparently it seems *easier* to just reserve a new 
chunk and fill the fresh chunk then. Otherwise I don´t know why BTRFS is doing 
it like this. It prefers to reserve free device space on this defragmentation 
inside VM then.

And these issues may be due to an inefficient implementation or bug.

Now if no one else if ever having this, this may be a speciality with my 
filesystem and heck I can recreate it from scratch if need be. Yet I would 
prefer to find out what is happening here.


>    I think I would suggest the following:
> 
>  - make sure you have some way of logging your dmesg permanently (use
>    a different filesystem for /var/log, or a serial console, or a
>    netconsole)
> 
>  - when the lockup happens, hit Alt-SysRq-t a few times
> 
>  - send the dmesg output here, or post to bugzilla.kernel.org
> 
>    That's probably going to give enough information to the developers
> to work out where the lockup is happening, and is clearly the way
> forward here.

Thanks, I think this seems to be a way to go.

Actually the logging should be safe I´d say, cause it wents into a different 
BTRFS. The BTRFS for /, which is also a RAID 1 and which didn´t show this 
behavior yet, although it has also all space reserved since quite some time:

merkaba:~> btrfs fi sh /                       
Label: 'debian'  uuid: […]
        Total devices 2 FS bytes used 17.79GiB
        devid    1 size 30.00GiB used 30.00GiB path /dev/mapper/sata-debian
        devid    2 size 30.00GiB used 30.00GiB path /dev/mapper/msata-debian

Btrfs v3.17
merkaba:~> btrfs fi df /
Data, RAID1: total=27.99GiB, used=17.21GiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=2.00GiB, used=596.12MiB
GlobalReserve, single: total=208.00MiB, used=0.00B


*Unlike* if one BTRFS locks up the other will also lock up, logging should be 
safe.

Actually I got the last task hung messages as I posted them here. So I may 
just try to reproduce this and trigger

echo "t" > /proc/sysrq-trigger

this gives

[32459.707323] systemd-journald[314]: /dev/kmsg buffer overrun, some messages 
lost.

but I bet rsyslog will capture it just nice. I may even disable journald to 
reduce writes to / during reproducing the bug.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27  9:30     ` Hugo Mills
  2014-12-27 10:54       ` Martin Steigerwald
@ 2014-12-27 11:11       ` Martin Steigerwald
  2014-12-27 12:08         ` Robert White
  2014-12-27 13:55       ` Martin Steigerwald
  2014-12-27 18:28       ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
  3 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 11:11 UTC (permalink / raw)
  To: Hugo Mills, Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3474 bytes --]

Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > 
> >
> > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > device.
>    No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> space. What's more, balance does *not* balance the metadata trees. The
> remaining space -- 154.97 GiB -- is unstructured storage for file
> data, and you have some 13 GiB of that available for use.
> 
>    Now, since you're seeing lockups when the space on your disks is
> all allocated I'd say that's a bug. However, you're the *only* person
> who's reported this as a regular occurrence. Does this happen with all
> filesystems you have, or just this one?

Okay, just about terms.

What I call trees is this:

merkaba:~> btrfs fi df /
Data, RAID1: total=27.99GiB, used=17.21GiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=2.00GiB, used=596.12MiB
GlobalReserve, single: total=208.00MiB, used=0.00B

For me each one of "Data", "System", "Metadata" and "GlobalReserve" is what I 
call a "tree".

How would you call it?

I always thought that BTRFS uses a tree structure not only for metadata, but 
also for data. But I bet strictly spoken thats only to *manage* the chunks it 
allocates and what I see above is the actual chunk usage.

I.e. to get terms straight, how would you call it? I think my understanding of 
how BTRFS handles space allocation is quite correct, but I may use a term 
incorrectly.

I read

> Data, RAID1: total=27.99GiB, used=17.21GiB

as:

I reserved 27,99 GiB for data chunks and used 17,21 GiB in these data chunks 
so far. So I have about 10,5 GiB free in these data chunks at the moment and 
all is good.

What it doesn´t tell me at all is how the allocated space is distributed onto 
these chunks. I may be that some chunks are completely empty or not. It may be 
that each chunk has some space allocated to it but in total there is that 
amount of free space yet. I.e. it doesn´t tell me anything about the free 
space fragmentation inside the chunks.

Yet I still hold my theory that in the case of heavily writing to a COW´d file 
BTRFS seems to prefer to reserve new empty chunks on this /home filesystem of 
my laptop instead of trying to find free space in existing only partially empty 
chunks. And the lockup only happens when it tries to do the latter. And no, I 
think it shouldn´t lockup then. I also think its a bug. I never said 
differently.

And yes, I only ever had this on my /home so far. Not on / which is also RAID 
1 and has all device space reserved for quite some time, not on /daten which 
only holds large files and is single instead of RAID. Also not on the server, 
but the server FS has lots of unallocated device space still, or on the 2 TiB 
eSATA backup HD, also I do get the impression that BTRFS started to get slower 
there as well at least the rsync based backup script takes quite long 
meanwhile and I see rsync reading from backup BTRFS and in this case almost 
fully ultilizing the disk for longer times. But unlike my /home the backup 
disk has some timely widely distributed snaphots (about 2 week to 1 months 
intervalls, and about last half year).

Neither /home nor / on the SSD have snapshots at the moment. So this is 
happening without snapshots.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 10:54       ` Martin Steigerwald
@ 2014-12-27 11:52         ` Robert White
  2014-12-27 13:16           ` Martin Steigerwald
  0 siblings, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-27 11:52 UTC (permalink / raw)
  To: Martin Steigerwald, Hugo Mills; +Cc: linux-btrfs

On 12/27/2014 02:54 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
>> On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
>>> Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
>>>> On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
>>>>> Hello!
>>>>>
>>>>> First: Have a merry christmas and enjoy a quiet time in these days.
>>>>>
>>>>> Second: At a time you feel like it, here is a little rant, but also a
>>>>> bug
>>>>> report:
>>>>>
>>>>> I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
>>>>> space_cache, skinny meta data extents – are these a problem? – and
>>>>
>>>>> compress=lzo:
>>>> (there is no known problem with skinny metadata, it's actually more
>>>> efficient than the older format. There has been some anecdotes about
>>>> mixing the skinny and fat metadata but nothing has ever been
>>>> demonstrated problematic.)
>>>>
>>>>> merkaba:~> btrfs fi sh /home
>>>>> Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
>>>>>
>>>>>           Total devices 2 FS bytes used 144.41GiB
>>>>>           devid    1 size 160.00GiB used 160.00GiB path
>>>>>           /dev/mapper/msata-home
>>>>>           devid    2 size 160.00GiB used 160.00GiB path
>>>>>           /dev/mapper/sata-home
>>>>>
>>>>> Btrfs v3.17
>>>>> merkaba:~> btrfs fi df /home
>>>>> Data, RAID1: total=154.97GiB, used=141.12GiB
>>>>> System, RAID1: total=32.00MiB, used=48.00KiB
>>>>> Metadata, RAID1: total=5.00GiB, used=3.29GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>
>>>> This filesystem, at the allocation level, is "very full" (see below).
>>>>
>>>>> And I had hangs with BTRFS again. This time as I wanted to install tax
>>>>> return software in Virtualbox´d Windows XP VM (which I use once a year
>>>>> cause I know no tax return software for Linux which would be suitable
>>>>> for
>>>>> Germany and I frankly don´t care about the end of security cause all
>>>>> surfing and other network access I will do from the Linux box and I
>>>>> only
>>>>> run the VM behind a firewall).
>>>>
>>>>> And thus I try the balance dance again:
>>>> ITEM: Balance... it doesn't do what you think it does... 8-)
>>>>
>>>> "Balancing" is something you should almost never need to do. It is only
>>>> for cases of changing geometry (adding disks, switching RAID levels,
>>>> etc.) of for cases when you've radically changed allocation behaviors
>>>> (like you decided to remove all your VM's or you've decided to remove a
>>>> mail spool directory full of thousands of tiny files).
>>>>
>>>> People run balance all the time because they think they should. They are
>>>> _usually_ incorrect in that belief.
>>>
>>> I only see the lockups of BTRFS is the trees *occupy* all space on the
>>> device.
>>     No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
>> space. What's more, balance does *not* balance the metadata trees. The
>> remaining space -- 154.97 GiB -- is unstructured storage for file
>> data, and you have some 13 GiB of that available for use.
>
> Ok, let me rephrase that: Then the space *reserved* for the trees occupies all
> space on the device. Or okay, when that I see in btrfs fi df as "total" in
> summary occupies what I see as "size" in btrfs fi sh, i.e. when "used" equals
> space in "btrfs fi sh"
>
> What happened here is this:
>
> I tried
>
>   https://blogs.oracle.com/virtualbox/entry/how_to_compact_your_virtual
>
> in order to regain some space from the Windows XP VDI file. I just wanted to
> get around upsizing the BTRFS again.
>
> And on the defragementation step in Windows it first ran fast. For about 46-47%
> there, during that fast phase btrfs fi df showed that BTRFS was quickly
> reserving the remaining free device space for data trees (not metadata).

The above statement is word-salad. The storage for data is not a "data 
tree", the tree that maps data into a file is metadata. The data is 
data. There is no "data tree".

> Only after a while after it did so, it got slow again, basically the Windows
> defragmentation process stopped at 46-47% altogether and then after a while
> even the desktop locked due to processes being blocked in I/O.

If you've over-organized your very-large data files you can get waste 
some terrific amounts of space.

[---------------------------------------]
   [-------]     [uuuuuuu]  [] [-----]
       [------] [-----][----]   [-------]
                    [----]

As you write new segments you don't actually free the lower extents 
unless they are _completely_ obscured end-to-end by a later extent. So 
if you've _ever_ defragged the BTRFS extent to be fully contiguous and 
you've not overwritten each and every byte later, the original expanse 
is still going to be there.

In the above exampel only the "uuu" block is ever freed, and only when 
the fourth generation finally covers the little gap.

In the worst case you can end up with (N*(N+1))/2 total blocks used up 
on disk when only N blocks are visible. (See the Gauss equation for the 
sum of consecutive integers for why this is the correct approximation 
for the worst case.)

[------------]
[-----------]
[----------]
...
[-]

Each generation, being one block shorter than the previous one, exposes 
N blocks, one from each generation. So 1+2+3+4+5...+N blocks allocated 
if each ovewrite is one block shorter than the previous.

So if your original VDI file was all in little pieces all through the 
disk, it will waste less space (statistically).

But if you keep on defragging the file internally and externally you can 
end up with many times the total file size "in use" to represent the 
disk file.

So like I said, if you start trying to _force_ order you will end up 
paying significant expenses as the file ages.

COW can help, but every snapshot counts as a generation, so really it's 
not necessarily ideal.

I suspect that copying the file as 100 blocks (400k) [or so] at a time 
would lead to a file likely to sanitize its history with overwrites.

As it is, coercing order is not your friend. But once done, the best 
thing to do is periodically copy the whole file anew to burp the history 
out of it.

>
> I decided to forget about this downsizing of the Virtualbox VDI file, it will
> extend again on next Windows work and it is already 18 GB of its maximum 20GB,
> so… I dislike the approach anyway, and don´t even understand why the
> defragmentation step would be necessary as I think Virtualbox can poke holes
> into the file for any space not allocated inside the VM, whether it is
> defragmented or not.

If you don't have trim turned on in both the virtual box and the base 
system then there is no discarding to be done. And defrag is "meh" in 
your arrangement. [See "lsblk -D" to see if you are doing real discards. 
Check windows as well.

Then consider using _raw_ disk format instead of VDI since the 
"container format" may not result in trim operations comming through to 
the underlying filesystem as such. (I don't know for sure).

So basically, you've arranged your storage almost exactly wrong by 
defragging and such, particularly since you are doing it at both layers.

I know where you got the advice from, but its not right for the BTRFS 
assumptions.

>
>>     Now, since you're seeing lockups when the space on your disks is
>> all allocated I'd say that's a bug. However, you're the *only* person
>> who's reported this as a regular occurrence. Does this happen with all
>> filesystems you have, or just this one?
>
> The *only* person? The compression lockups with 3.15 and 3.16, quite some
> people saw them, I thought. For me also these lockups only happened with all
> space on device allocated.
>
> And these seem to be gone. In regular use it doesn´t lockup totally hard. But
> in the a processes writes a lot into one big no-cowed file case, it seems it
> can still get into a lockup, but this time one where a kworker thread consumes
> 100% of CPU for minutes.



>
>>> I *never* so far saw it lockup if there is still space BTRFS can allocate
>>> from to *extend* a tree.
>>
>>     It's not a tree. It's simply space allocation. It's not even space
>> *usage* you're talking about here -- it's just allocation (i.e. the FS
>> saying "I'm going to use this piece of disk for this purpose").
>
> Okay, I thought it is the space BTRFS reserves for a tree or well the *chunks*
> the tree manages. I am aware of that it isn´t already *used* space, its just
> *reserved*
>
>>> This may be a bug, but this is what I see.
>>>
>>> And no amount of "you should not balance a BTRFS" will make that
>>> perception go away.
>>>
>>> See, I see the sun coming out on a morning and you tell me "no, it
>>> doesn´t". Simply that is not going to match my perception.
>>
>>     Duncan's assertion is correct in its detail. Looking at your space
>> usage, I would not suggest that running a balance is something you
>> need to do. Now, since you have these lockups that seem quite
>> repeatable, there's probably a lurking bug in there, but hacking
>> around with balance every time you hit it isn't going to get the
>> problem solved properly.
>
> It was Robert writing this I think.
>
> Well I do not like to balance the FS, but I see the result, I see that it
> helps here. And thats about it.
>
> My theory from watching the Windows XP defragmentation case is this:
>
> - For writing into the file BTRFS needs to actually allocate and use free space
> in the current tree allocation, or, as we seem to misunderstood from the words
> we use, it needs to fit data in
>
> Data, RAID1: total=144.98GiB, used=140.94GiB
>
> between 144,98 GiB and 140,94 GiB given that total space of this tree, or if
> its not a tree, but the chunks in that the tree manages, in these chunks can
> *not* be extended anymore.

If your file was actually COW (and you have _not_ been taking snapshots) 
then there is no extenting to be had. But if you are using snapper 
(which I believe you mentioned previously) then the snapshots cause a 
write boundary and a layer of copying. Frequently taking snapshots of a 
COW file is self defeating. If you are going to take snapshots then you 
might as well turn copy on write back on and, for the love of pete, stop 
defragging things.


> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.24GiB
>
> - What I see now is as long as it can be extended, BTRFS on this workload
> *happily* does so. *Quickly*. Up to the amount of the free, unreserved space
> of the device. And *even* if in my eyes there is a big enough difference
> between total and used in btrfs fi df.
>
> - Then as all the device space is *reserved*, BTRFS needs to fit the allocation
> within the *existing* chunks instead of reserving a new one and fill the empty
> one. And I think this is where it gets problems.
>
>
> I extended both devices of /home by 10 GiB now and I was able to comlete some
> balance steps with these results.
>
> Original after my last partly failed balance attempts:
>
> Label: 'home'  uuid: […]
>          Total devices 2 FS bytes used 144.20GiB
>          devid    1 size 170.00GiB used 159.01GiB path /dev/mapper/msata-home
>          devid    2 size 170.00GiB used 159.01GiB path /dev/mapper/sata-home
>
> Btrfs v3.17
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=153.98GiB, used=140.95GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.25GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
> Then balancing, but not all of them:
>
> merkaba:~#1> btrfs balance start -dusage=70 /home
> Done, had to relocate 9 out of 162 chunks
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=146.98GiB, used=140.95GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.25GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> merkaba:~> btrfs balance start -dusage=80 /home
> Done, had to relocate 9 out of 155 chunks
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=144.98GiB, used=140.94GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.24GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> merkaba:~> btrfs fi sh /home
> Label: 'home'  uuid: […]
>          Total devices 2 FS bytes used 144.19GiB
>          devid    1 size 170.00GiB used 150.01GiB path /dev/mapper/msata-home
>          devid    2 size 170.00GiB used 150.01GiB path /dev/mapper/sata-home
>
> Btrfs v3.17
>
>
> This is a situation where I do not see any slowdowns with BTRFS.
>
> As far as I understand the balance commands I used I told BTRFS the following:
>
> - go and balance all chunks that has 70% or less used
> - go and balance all chunks that have 80% or less used
>
> I rarely see any chunks that have 60% or less used and get something like this
> if I try:
>
> merkaba:~> btrfs balance start -dusage=60 /home
> Done, had to relocate 0 out of 153 chunks
>
>
>
> Now my idea is this: BTRFS will need to satisfy the allocations it need to do
> for writing heavily into a cow´ed file from the already reserved space. Yet if
> I have lots of chunks that are filled between 60-70% it needs to spread the
> allocations in the 40-30% of the chunk that are not yet used.
>
> My theory is this: If BTRFS needs to do this *heavily*, it at some time gets
> problems while doing so. Apparently it seems *easier* to just reserve a new
> chunk and fill the fresh chunk then. Otherwise I don´t know why BTRFS is doing
> it like this. It prefers to reserve free device space on this defragmentation
> inside VM then.

When you defrag inside the VM, it gets scrambled through the VDI 
container, then layered into the BTRFS filesystem. This can consume vast 
amounts of space wiht no purpose. so...

Don't do that.


> And these issues may be due to an inefficient implementation or bug.

Or just stop fighting the system with all the unnecessary defragging. 
Watch the picture as it defrags. Look at all that layered writing. 
That's what's killing you.

(I do agree, however, that the implementation can become very 
inefficient, especially if you do exactly the wrong things.)

>
> Now if no one else if ever having this, this may be a speciality with my
> filesystem and heck I can recreate it from scratch if need be. Yet I would
> prefer to find out what is happening here.
>
>
>>     I think I would suggest the following:
>>
>>   - make sure you have some way of logging your dmesg permanently (use
>>     a different filesystem for /var/log, or a serial console, or a
>>     netconsole)
>>
>>   - when the lockup happens, hit Alt-SysRq-t a few times
>>
>>   - send the dmesg output here, or post to bugzilla.kernel.org
>>
>>     That's probably going to give enough information to the developers
>> to work out where the lockup is happening, and is clearly the way
>> forward here.
>
> Thanks, I think this seems to be a way to go.
>
> Actually the logging should be safe I´d say, cause it wents into a different
> BTRFS. The BTRFS for /, which is also a RAID 1 and which didn´t show this
> behavior yet, although it has also all space reserved since quite some time:
>
> merkaba:~> btrfs fi sh /
> Label: 'debian'  uuid: […]
>          Total devices 2 FS bytes used 17.79GiB
>          devid    1 size 30.00GiB used 30.00GiB path /dev/mapper/sata-debian
>          devid    2 size 30.00GiB used 30.00GiB path /dev/mapper/msata-debian
>
> Btrfs v3.17
> merkaba:~> btrfs fi df /
> Data, RAID1: total=27.99GiB, used=17.21GiB
> System, RAID1: total=8.00MiB, used=16.00KiB
> Metadata, RAID1: total=2.00GiB, used=596.12MiB
> GlobalReserve, single: total=208.00MiB, used=0.00B
>
>
> *Unlike* if one BTRFS locks up the other will also lock up, logging should be
> safe.
>
> Actually I got the last task hung messages as I posted them here. So I may
> just try to reproduce this and trigger
>
> echo "t" > /proc/sysrq-trigger
>
> this gives
>
> [32459.707323] systemd-journald[314]: /dev/kmsg buffer overrun, some messages
> lost.
>
> but I bet rsyslog will capture it just nice. I may even disable journald to
> reduce writes to / during reproducing the bug.
>
> Ciao,
>


ASIDE: I've been considering recreating my raw extents with COW turned 
_off_, but doing it as a series of 4Meg appends so that the underlying 
allocation would look like

[--][--][--][--][--][--][--][--][--][--][--][--][--]...[--][--]

this would net the most naturally discard-ready/cleanable history.

It's the vast expanse of the preallocated base.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 11:11       ` Martin Steigerwald
@ 2014-12-27 12:08         ` Robert White
  0 siblings, 0 replies; 59+ messages in thread
From: Robert White @ 2014-12-27 12:08 UTC (permalink / raw)
  To: Martin Steigerwald, Hugo Mills, linux-btrfs

On 12/27/2014 03:11 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
>>>
>>>
>>> I only see the lockups of BTRFS is the trees *occupy* all space on the
>>> device.
>>     No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
>> space. What's more, balance does *not* balance the metadata trees. The
>> remaining space -- 154.97 GiB -- is unstructured storage for file
>> data, and you have some 13 GiB of that available for use.
>>
>>     Now, since you're seeing lockups when the space on your disks is
>> all allocated I'd say that's a bug. However, you're the *only* person
>> who's reported this as a regular occurrence. Does this happen with all
>> filesystems you have, or just this one?
>
> Okay, just about terms.

Terms are _really_ important if you want to file and discuss bugs.

> What I call trees is this:
>
> merkaba:~> btrfs fi df /
> Data, RAID1: total=27.99GiB, used=17.21GiB
> System, RAID1: total=8.00MiB, used=16.00KiB
> Metadata, RAID1: total=2.00GiB, used=596.12MiB
> GlobalReserve, single: total=208.00MiB, used=0.00B
>
> For me each one of "Data", "System", "Metadata" and "GlobalReserve" is what I
> call a "tree".
>
> How would you call it?

Those are "extents" I think. All of the "Trees" are in the metadata. One 
of the trees is the "extent tree". That extent tree is what contains the 
list of which regions of the disk are data, or metadata, or 
system-metadata (like the superblocks), or the global reserve.

Those extents are then filled with the type of information described.

But all the "trees" are in the metadata extents.

>
> I always thought that BTRFS uses a tree structure not only for metadata, but
> also for data. But I bet strictly spoken thats only to *manage* the chunks it
> allocates and what I see above is the actual chunk usage.
>
> I.e. to get terms straight, how would you call it? I think my understanding of
> how BTRFS handles space allocation is quite correct, but I may use a term
> incorrectly.
>
> I read
>
>> Data, RAID1: total=27.99GiB, used=17.21GiB
>
> as:
>
> I reserved 27,99 GiB for data chunks and used 17,21 GiB in these data chunks
> so far. So I have about 10,5 GiB free in these data chunks at the moment and
> all is good.
>
> What it doesn´t tell me at all is how the allocated space is distributed onto
> these chunks. I may be that some chunks are completely empty or not. It may be
> that each chunk has some space allocated to it but in total there is that
> amount of free space yet. I.e. it doesn´t tell me anything about the free
> space fragmentation inside the chunks.
>
> Yet I still hold my theory that in the case of heavily writing to a COW´d file
> BTRFS seems to prefer to reserve new empty chunks on this /home filesystem of
> my laptop instead of trying to find free space in existing only partially empty
> chunks. And the lockup only happens when it tries to do the latter. And no, I
> think it shouldn´t lockup then. I also think its a bug. I never said
> differently.

Partly correct. The system (as I understand it) will try to fill old 
chunks before allocating to new ones. It also perfers the most empty 
chunk first. But if you fallocate large extents they can have trouble 
finding a home. So lets say you have a systemic process that keeps 
making .51GiB files then it will tend to allocate a new 1GiB data extent 
each time (presuming you used default values) because each successive 
.51GiB region cannot fit in any existing data extent.

Excessive snapshotting can also contribute to this effect, but only 
because it freezes the history.

There are some other odd-out cases.

> And yes, I only ever had this on my /home so far. Not on / which is also RAID
> 1 and has all device space reserved for quite some time, not on /daten which
> only holds large files and is single instead of RAID. Also not on the server,
> but the server FS has lots of unallocated device space still, or on the 2 TiB
> eSATA backup HD, also I do get the impression that BTRFS started to get slower
> there as well at least the rsync based backup script takes quite long
> meanwhile and I see rsync reading from backup BTRFS and in this case almost
> fully ultilizing the disk for longer times. But unlike my /home the backup
> disk has some timely widely distributed snaphots (about 2 week to 1 months
> intervalls, and about last half year).
>
> Neither /home nor / on the SSD have snapshots at the moment. So this is
> happening without snapshots.
>
> Ciao,
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 11:52         ` Robert White
@ 2014-12-27 13:16           ` Martin Steigerwald
  2014-12-27 13:49             ` Robert White
  2014-12-27 14:00             ` Robert White
  0 siblings, 2 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 13:16 UTC (permalink / raw)
  To: Robert White; +Cc: Hugo Mills, linux-btrfs

Am Samstag, 27. Dezember 2014, 03:52:56 schrieb Robert White:
> > My theory from watching the Windows XP defragmentation case is this:
> > 
> > - For writing into the file BTRFS needs to actually allocate and use free
> > space in the current tree allocation, or, as we seem to misunderstood
> > from the words we use, it needs to fit data in
> > 
> > Data, RAID1: total=144.98GiB, used=140.94GiB
> > 
> > between 144,98 GiB and 140,94 GiB given that total space of this tree, or
> > if its not a tree, but the chunks in that the tree manages, in these
> > chunks can *not* be extended anymore.
> 
> If your file was actually COW (and you have _not_ been taking snapshots) 
> then there is no extenting to be had. But if you are using snapper 
> (which I believe you mentioned previously) then the snapshots cause a 
> write boundary and a layer of copying. Frequently taking snapshots of a 
> COW file is self defeating. If you are going to take snapshots then you 
> might as well turn copy on write back on and, for the love of pete, stop 
> defragging things.

I don´t use any snapshots on the filesystems. None, zero, zilch, nada.

And as I understand it copy on write means: It has to write the new write 
requests to somewhere else. For this it needs to allocate space. Either 
withing existing chunks or in a newly allocated one.

So for COW when writing to a file it will always need to allocate new space 
(although it can forget about the old space afterwards unless there isn´t a 
snapshot holding it)

Anyway, I got it reproduced. And am about to write a lengthy mail about.

It can easily be reproduced without even using Virtualbox, just by a nice 
simple fio job.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 13:16           ` Martin Steigerwald
@ 2014-12-27 13:49             ` Robert White
  2014-12-27 14:06               ` Martin Steigerwald
  2014-12-27 14:00             ` Robert White
  1 sibling, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-27 13:49 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Hugo Mills, linux-btrfs

On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 03:52:56 schrieb Robert White:
>>> My theory from watching the Windows XP defragmentation case is this:
>>>
>>> - For writing into the file BTRFS needs to actually allocate and use free
>>> space in the current tree allocation, or, as we seem to misunderstood
>>> from the words we use, it needs to fit data in
>>>
>>> Data, RAID1: total=144.98GiB, used=140.94GiB
>>>
>>> between 144,98 GiB and 140,94 GiB given that total space of this tree, or
>>> if its not a tree, but the chunks in that the tree manages, in these
>>> chunks can *not* be extended anymore.
>>
>> If your file was actually COW (and you have _not_ been taking snapshots)
>> then there is no extenting to be had. But if you are using snapper
>> (which I believe you mentioned previously) then the snapshots cause a
>> write boundary and a layer of copying. Frequently taking snapshots of a
>> COW file is self defeating. If you are going to take snapshots then you
>> might as well turn copy on write back on and, for the love of pete, stop
>> defragging things.
>
> I don´t use any snapshots on the filesystems. None, zero, zilch, nada.
>
> And as I understand it copy on write means: It has to write the new write
> requests to somewhere else. For this it needs to allocate space. Either
> withing existing chunks or in a newly allocated one.
>
> So for COW when writing to a file it will always need to allocate new space
> (although it can forget about the old space afterwards unless there isn´t a
> snapshot holding it)

It can _only_ forget about the space if absolutely _all_ of the old 
extent is overwritten. So if you write 1MiB, then you go back and 
overwrite 1MiB-4Kib, then you go back and write 1MiB-8KiB, you've now 
got 3MiB-12KiB to represent 1MiB of data. No snapshots involved. The 
worst case is quite well understood.

[...--------------] 1MiB
[...-------------]  1MiB-4KiB
[...------------]   1MiB-8KiB


BTRFS will _NOT_ reclaim the "part" of any extent. So if this kept going 
it would take 250 diminishing overwrites, each 4k less than the prior:

1MiB == 250 4k blocks.
(250*(250+1))/2 = 31375 4K blocks or 125.5MiB of storage allocated and 
dedicated to representing 1MiB of accessible data.

This is a worst case, of course, but it exists and it's _horrible_.

And such a file can be "burped" by doing a copy-and-rename, resulting in 
returning it to a single 1MiB extent. (I don't know if a "btrfs defrag" 
would have identical results, but I think it would.)

The problem is that there isn't (yet) a COW safe way to discard partial 
extents. That is, there is no universally safe way (yet implemented) to 
turn that first 1MiB into two extents of 1MiB-4K and one 4K extent "in 
place" so there is no way (yet) to prevent this worst case.

Doing things like excessive defragging at the BTRFS level, and 
defragging inside of a VM, and using certain file types can lead to 
pretty awful data wastage. YMMV.

e.g. "too much tidying up and you make a mess".

I offered a pseudocode example a few days back on how this problem might 
be dealt with in future, but I've not seen any feedback on it.

>
> Anyway, I got it reproduced. And am about to write a lengthy mail about.

Have fun with that lengthy email, but the devs already know about the 
data waste profile of the system. They just don't have a good solution yet.

Practical use cases involving _not_ defragging and _not_ packing files, 
or disabling COW and using raw image formats for VM disk storage are, 
meanwhile, also well understood.

>
> It can easily be reproduced without even using Virtualbox, just by a nice
> simple fio job.
>

Yep. As I've explained twice now.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27  9:30     ` Hugo Mills
  2014-12-27 10:54       ` Martin Steigerwald
  2014-12-27 11:11       ` Martin Steigerwald
@ 2014-12-27 13:55       ` Martin Steigerwald
  2014-12-27 14:54         ` Robert White
  2014-12-28 13:00         ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
  2014-12-27 18:28       ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
  3 siblings, 2 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 13:55 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 33441 bytes --]

Summarized at

Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401

see below. This is reproducable with fio, no need for Windows XP in
Virtualbox for reproducing the issue. Next I will try to reproduce with
a freshly creating filesystem.


Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > Hello!
> > > > 
> > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > 
> > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > bug
> > > > report:
> > > > 
> > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > space_cache, skinny meta data extents – are these a problem? – and
> > > 
> > > > compress=lzo:
> > > (there is no known problem with skinny metadata, it's actually more
> > > efficient than the older format. There has been some anecdotes about
> > > mixing the skinny and fat metadata but nothing has ever been
> > > demonstrated problematic.)
> > > 
> > > > merkaba:~> btrfs fi sh /home
> > > > Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > 
> > > >          Total devices 2 FS bytes used 144.41GiB
> > > >          devid    1 size 160.00GiB used 160.00GiB path
> > > >          /dev/mapper/msata-home
> > > >          devid    2 size 160.00GiB used 160.00GiB path
> > > >          /dev/mapper/sata-home
> > > > 
> > > > Btrfs v3.17
> > > > merkaba:~> btrfs fi df /home
> > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > 
> > > This filesystem, at the allocation level, is "very full" (see below).
> > > 
> > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > cause I know no tax return software for Linux which would be suitable
> > > > for
> > > > Germany and I frankly don´t care about the end of security cause all
> > > > surfing and other network access I will do from the Linux box and I
> > > > only
> > > > run the VM behind a firewall).
> > > 
> > > > And thus I try the balance dance again:
> > > ITEM: Balance... it doesn't do what you think it does... 8-)
> > > 
> > > "Balancing" is something you should almost never need to do. It is only
> > > for cases of changing geometry (adding disks, switching RAID levels,
> > > etc.) of for cases when you've radically changed allocation behaviors
> > > (like you decided to remove all your VM's or you've decided to remove a
> > > mail spool directory full of thousands of tiny files).
> > > 
> > > People run balance all the time because they think they should. They are
> > > _usually_ incorrect in that belief.
> > 
> > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > device.
>    No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> space. What's more, balance does *not* balance the metadata trees. The
> remaining space -- 154.97 GiB -- is unstructured storage for file
> data, and you have some 13 GiB of that available for use.
> 
>    Now, since you're seeing lockups when the space on your disks is
> all allocated I'd say that's a bug. However, you're the *only* person
> who's reported this as a regular occurrence. Does this happen with all
> filesystems you have, or just this one?
> 
> > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > from to *extend* a tree.
> 
>    It's not a tree. It's simply space allocation. It's not even space
> *usage* you're talking about here -- it's just allocation (i.e. the FS
> saying "I'm going to use this piece of disk for this purpose").
> 
> > This may be a bug, but this is what I see.
> > 
> > And no amount of "you should not balance a BTRFS" will make that
> > perception go away.
> > 
> > See, I see the sun coming out on a morning and you tell me "no, it
> > doesn´t". Simply that is not going to match my perception.
> 
>    Duncan's assertion is correct in its detail. Looking at your space

Robert's :)

> usage, I would not suggest that running a balance is something you
> need to do. Now, since you have these lockups that seem quite
> repeatable, there's probably a lurking bug in there, but hacking
> around with balance every time you hit it isn't going to get the
> problem solved properly.
> 
>    I think I would suggest the following:
> 
>  - make sure you have some way of logging your dmesg permanently (use
>    a different filesystem for /var/log, or a serial console, or a
>    netconsole)
> 
>  - when the lockup happens, hit Alt-SysRq-t a few times
> 
>  - send the dmesg output here, or post to bugzilla.kernel.org
> 
>    That's probably going to give enough information to the developers
> to work out where the lockup is happening, and is clearly the way
> forward here.

And I got it reproduced. *Perfectly* reproduced, I´d say.

But let me run the whole story:

1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.

Which gave me:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
        Total devices 2 FS bytes used 144.19GiB
        devid    1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=144.98GiB, used=140.95GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.24GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


2) I run the Virtualbox machine again and defragmented the NTFS filesystem
in the VDI image file. And: It worked *just* fine. Fine as in *fine*. No issues
whatsoever.


I got this during the run:

ATOP - merkaba                          2014/12/27  12:58:42                          -----------                           10s elapsed
PRC |  sys   10.41s |  user   1.08s |  #proc    357  | #trun      4  |  #tslpi   694 |  #tslpu     0 |  #zombie    0  | no  procacct  |
CPU |  sys     107% |  user     11% |  irq       0%  | idle    259%  |  wait     23% |  guest     0% |  curf 3.01GHz  | curscal  93%  |
cpu |  sys      29% |  user      3% |  irq       0%  | idle     63%  |  cpu002 w  5% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      27% |  user      3% |  irq       0%  | idle     65%  |  cpu000 w  5% |  guest     0% |  curf 3.03GHz  | curscal  94%  |
cpu |  sys      26% |  user      3% |  irq       0%  | idle     63%  |  cpu003 w  8% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      24% |  user      2% |  irq       0%  | idle     68%  |  cpu001 w  6% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
CPL |  avg1    1.92 |  avg5    1.01 |  avg15   0.56  |               |  csw   501619 |  intr  129279 |                | numcpu     4  |
MEM |  tot    15.5G |  free  610.1M |  cache   9.1G  | buff    0.1M  |  slab    1.0G |  shmem 183.5M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.6G |                |               |               |               |  vmcom   7.1G  | vmlim  19.7G  |
PAG |  scan  219141 |  steal 215577 |  stall    936  |               |               |               |  swin       0  | swout    940  |
LVM |     sata-home |  busy     53% |  read  181413  | write      0  |  KiB/w      0 |  MBr/s  70.86 |  MBw/s   0.00  | avio 0.03 ms  |
LVM |     sata-swap |  busy      2% |  read       0  | write    940  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.37  | avio 0.17 ms  |
LVM |   sata-debian |  busy      0% |  read       0  | write      1  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 1.00 ms  |
LVM |  msata-debian |  busy      0% |  read       0  | write      1  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.00 ms  |
DSK |           sda |  busy     53% |  read  181413  | write    477  |  KiB/w      7 |  MBr/s  70.86 |  MBw/s   0.37  | avio 0.03 ms  |
DSK |           sdb |  busy      0% |  read       0  | write      1  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.00 ms  |
NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv     16 |               |  icmpi      0  | icmpo      0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
 9650      -   martin    martin       22   7.89s    0.65s      0K     128K  705.5M   382.1M  --     -  S       2   87%   VirtualBox
 9911      -   root      root          1   0.69s    0.01s      0K       0K      0K       0K  --     -  S       3    7%   watch
 9598      -   root      root          1   0.38s    0.00s      0K       0K      0K      20K  --     -  S       0    4%   kworker/u8:9
 9892      -   root      root          1   0.36s    0.00s      0K       0K      0K       0K  --     -  S       1    4%   kworker/u8:17
 9428      -   root      root          1   0.30s    0.00s      0K       0K      0K       0K  --     -  R       0    3%   kworker/u8:3
 9589      -   root      root          1   0.23s    0.00s      0K       0K      0K       0K  --     -  S       1    2%   kworker/u8:6 
 4746      -   martin    martin        2   0.04s    0.13s      0K     -16K      0K       0K  --     -  R       2    2%   konsole



Every 1,0s: cat /proc/meminfo                                                                                  Sat Dec 27 12:59:23 2014

MemTotal:       16210512 kB
MemFree:          786632 kB
MemAvailable:   10271500 kB
Buffers:              52 kB
Cached:          9564340 kB
SwapCached:        70268 kB
Active:          6847560 kB
Inactive:        5257956 kB
Active(anon):    2016412 kB
Inactive(anon):   703076 kB
Active(file):    4831148 kB
Inactive(file):  4554880 kB
Unevictable:        9068 kB
Mlocked:            9068 kB
SwapTotal:      12582908 kB
SwapFree:       12186680 kB
Dirty:            972324 kB
Writeback:             0 kB
AnonPages:       2526340 kB
Mapped:          2457096 kB
Shmem:            173564 kB
Slab:             918128 kB
SReclaimable:     848816 kB
SUnreclaim:        69312 kB
KernelStack:       11200 kB
PageTables:        64556 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    20688164 kB
Committed_AS:    7438348 kB



I am not seeing more than one GiB of dirty here during regular usage and
it is no problem.

And kworker thread CPU usage just fine. So no, the dirty_background_ratio
isn´t an issue with this 16 GiB ThinkPad T520. Please note: I do Linux
performance analysis and tuning courses for about 7 years or so meanwhile.

I *know* these knobs. I may have used wrong terms regarding BTRFS, and my
understanding of BTRFS space allocation probably can be more accurate, but
I do think that I am onto something here. This is no rotating disk, it can handle
the write burst just fine and I generally do not tune where there is no need to
tune. Here there isn´t. And it wouldn´t be much more than a fine tuning.

With slow devices or with rsync over NFS by all means reduce it. But here it
simply isn´t an issue as you can see with the low kworker thread CPU usage
and the general SSD usage above.


So defragmentation completed just nice, no issue so far.

But I am close to full device space reservation already:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
Sa 27. Dez 13:02:40 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 151.58GiB
        devid    1 size 160.00GiB used 158.01GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 158.01GiB path /dev/mapper/sata-home



I thought I can trigger it again by defragmenting in Windows XP again, but
mind you, its defragmented already so it doesn´t to much. I did the sdelete
dance just to trigger something and well I saw kworker a bit higher, but not
much.

But finally I got to:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
Sa 27. Dez 13:26:39 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 152.83GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=154.97GiB, used=149.58GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



So I did, if Virtualbox can write randomly in a file, I can too.

So I did:


martin@merkaba:~> cat ssd-test.fio 
[global]
bs=4k
#ioengine=libaio
#iodepth=4
size=4g
#direct=1
runtime=120
filename=ssd.test.file

[seq-write]
rw=write
stonewall

[rand-write]
rw=randwrite
stonewall



And got:

ATOP - merkaba                          2014/12/27  13:41:02                          -----------                           10s elapsed
PRC |  sys   10.14s |  user   0.38s |  #proc    332  | #trun      2  |  #tslpi   548 |  #tslpu     0 |  #zombie    0  | no  procacct  |
CPU |  sys     102% |  user      4% |  irq       0%  | idle    295%  |  wait      0% |  guest     0% |  curf 3.10GHz  | curscal  96%  |
cpu |  sys      76% |  user      0% |  irq       0%  | idle     24%  |  cpu001 w  0% |  guest     0% |  curf 3.20GHz  | curscal  99%  |
cpu |  sys      24% |  user      1% |  irq       0%  | idle     75%  |  cpu000 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu003 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
CPL |  avg1    0.82 |  avg5    0.78 |  avg15   0.99  |               |  csw     6233 |  intr   12023 |                | numcpu     4  |
MEM |  tot    15.5G |  free    4.0G |  cache   9.7G  | buff    0.0M  |  slab  333.1M |  shmem 206.6M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
LVM |     sata-home |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
DSK |           sda |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv     16 |               |  icmpi      0  | icmpo      0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
18079      -   martin    martin        2   9.99s    0.00s      0K       0K      0K      16K  --     -  R       1  100%   fio
 4746      -   martin    martin        2   0.01s    0.14s      0K       0K      0K       0K  --     -  S       2    2%   konsole
 3291      -   martin    martin        4   0.01s    0.11s      0K       0K      0K       0K  --     -  S       0    1%   plasma-desktop
 1488      -   root      root          1   0.03s    0.04s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
10036      -   root      root          1   0.04s    0.02s      0K       0K      0K       0K  --     -  R       2    1%   atop

while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
for 10 seconds while allocatiing a 4 GiB file on a filesystem like:

martin@merkaba:~> LANG=C df -hT /home
Filesystem             Type   Size  Used Avail Use% Mounted on
/dev/mapper/msata-home btrfs  170G  156G   17G  91% /home

where a 4 GiB file should easily fit, no? (And this output is with the 4
GiB file. So it was even 4 GiB more free before.)


But it gets even more visible:

martin@merkaba:~> fio ssd-test.fio
seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 2 processes
Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]       
0$ zsh  1$ zsh  2$ zsh  3-$ zsh  4$ zsh  5$* zsh                                   


yes, thats 0 IOPS.

0 IOPS and in zero IOPS. For minutes.



And here is why:

ATOP - merkaba                          2014/12/27  13:46:52                          -----------                           10s elapsed
PRC |  sys   10.77s |  user   0.31s |  #proc    334  | #trun      2  |  #tslpi   548 |  #tslpu     3 |  #zombie    0  | no  procacct  |
CPU |  sys     108% |  user      3% |  irq       0%  | idle    286%  |  wait      2% |  guest     0% |  curf 3.08GHz  | curscal  96%  |
cpu |  sys      72% |  user      1% |  irq       0%  | idle     28%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      19% |  user      0% |  irq       0%  | idle     81%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      11% |  user      1% |  irq       0%  | idle     87%  |  cpu003 w  1% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
cpu |  sys       6% |  user      1% |  irq       0%  | idle     91%  |  cpu002 w  1% |  guest     0% |  curf 3.11GHz  | curscal  97%  |
CPL |  avg1    2.78 |  avg5    1.34 |  avg15   1.12  |               |  csw    50192 |  intr   32379 |                | numcpu     4  |
MEM |  tot    15.5G |  free    5.0G |  cache   8.7G  | buff    0.0M  |  slab  332.6M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
LVM |     sata-home |  busy      5% |  read     160  | write  11177  |  KiB/w      3 |  MBr/s   0.06 |  MBw/s   4.36  | avio 0.05 ms  |
LVM |    msata-home |  busy      4% |  read      28  | write  11177  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   4.36  | avio 0.04 ms  |
LVM |   sata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
LVM |  msata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
DSK |           sda |  busy      5% |  read     160  | write  10200  |  KiB/w      4 |  MBr/s   0.06 |  MBw/s   4.69  | avio 0.05 ms  |
DSK |           sdb |  busy      4% |  read      28  | write  10558  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   4.69  | avio 0.04 ms  |
NET |  transport    |  tcpi      35 |  tcpo      33  | udpi       3  |  udpo       3 |  tcpao      2 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       38 |  ipo       36  | ipfrw      0  |  deliv     38 |               |  icmpi      0  | icmpo      0  |
NET |  eth0      0% |  pcki      22 |  pcko      20  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
14973      -   root      root          1   8.92s    0.00s      0K       0K      0K     144K  --     -  S       0   89%   kworker/u8:14
17450      -   root      root          1   0.86s    0.00s      0K       0K      0K      32K  --     -  R       3    9%   kworker/u8:5
  788      -   root      root          1   0.25s    0.00s      0K       0K    128K   18880K  --     -  S       3    3%   btrfs-transact
12254      -   root      root          1   0.14s    0.00s      0K       0K     64K     576K  --     -  S       2    1%   kworker/u8:3
17332      -   root      root          1   0.11s    0.00s      0K       0K    112K    1348K  --     -  S       2    1%   kworker/u8:4
 3291      -   martin    martin        4   0.01s    0.09s      0K       0K      0K       0K  --     -  S       1    1%   plasma-deskto




ATOP - merkaba                          2014/12/27  13:47:12                          -----------                           10s elapsed
PRC |  sys   10.78s |  user   0.44s |  #proc    334  | #trun      3  |  #tslpi   547 |  #tslpu     3 |  #zombie    0  | no  procacct  |
CPU |  sys     106% |  user      4% |  irq       0%  | idle    288%  |  wait      1% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      93% |  user      0% |  irq       0%  | idle      7%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys       7% |  user      0% |  irq       0%  | idle     93%  |  cpu003 w  0% |  guest     0% |  curf 3.01GHz  | curscal  94%  |
cpu |  sys       3% |  user      2% |  irq       0%  | idle     94%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
CPL |  avg1    3.33 |  avg5    1.56 |  avg15   1.20  |               |  csw    38253 |  intr   23104 |                | numcpu     4  |
MEM |  tot    15.5G |  free    4.9G |  cache   8.7G  | buff    0.0M  |  slab  336.5M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
LVM |    msata-home |  busy      2% |  read       0  | write   2337  |  KiB/w      3 |  MBr/s   0.00 |  MBw/s   0.91  | avio 0.07 ms  |
LVM |     sata-home |  busy      2% |  read      36  | write   2337  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   0.91  | avio 0.07 ms  |
LVM |  msata-debian |  busy      1% |  read       1  | write   1630  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.65  | avio 0.03 ms  |
LVM |   sata-debian |  busy      0% |  read       0  | write   1019  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.41  | avio 0.02 ms  |
DSK |           sdb |  busy      2% |  read       1  | write   2545  |  KiB/w      5 |  MBr/s   0.00 |  MBw/s   1.45  | avio 0.07 ms  |
DSK |           sda |  busy      1% |  read      36  | write   2461  |  KiB/w      5 |  MBr/s   0.01 |  MBw/s   1.28  | avio 0.06 ms  |
NET |  transport    |  tcpi      20 |  tcpo      20  | udpi       1  |  udpo       1 |  tcpao      1 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       21 |  ipo       21  | ipfrw      0  |  deliv     21 |               |  icmpi      0  | icmpo      0  |
NET |  eth0      0% |  pcki       5 |  pcko       5  | si    0 Kbps  |  so    0 Kbps |  erri       0 |  erro       0  | drpo       0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
17450      -   root      root          1   9.96s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:5
 4746      -   martin    martin        2   0.06s    0.15s      0K       0K      0K       0K  --     -  S       1    2%   konsole
10508      -   root      root          1   0.13s    0.00s      0K       0K     96K    4048K  --     -  S       1    1%   kworker/u8:18
 1488      -   root      root          1   0.06s    0.06s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
17332      -   root      root          1   0.12s    0.00s      0K       0K     96K     580K  --     -  R       3    1%   kworker/u8:4
17454      -   root      root          1   0.11s    0.00s      0K       0K     32K    4416K  --     -  D       1    1%   kworker/u8:6
17516      -   root      root          1   0.09s    0.00s      0K       0K     16K     136K  --     -  S       3    1%   kworker/u8:7
 3268      -   martin    martin        3   0.02s    0.05s      0K       0K      0K       0K  --     -  S       1    1%   kwin
10036      -   root      root          1   0.05s    0.02s      0K       0K      0K       0K  --     -  R       0    1%   atop



So BTRFS is basically busy with itself and nothing else. Look at the SSD
usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
you measure of course, like request size, read, write, iodepth and so).

Its kworker/u8:5 utilizing 100% of one core for minutes.



Its the random write case it seems. Here are values from fio job:

martin@merkaba:~> fio ssd-test.fio
seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 2 processes
Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
  write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
    clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
     lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
     | 70.00th=[    7], 80.00th=[    8], 90.00th=[    8], 95.00th=[    9],
     | 99.00th=[   14], 99.50th=[   20], 99.90th=[  211], 99.95th=[ 2128],
     | 99.99th=[10304]
    bw (KB  /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
    lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
    lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
  cpu          : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Seems fine.


But:

rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
  write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
    clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
     lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
     | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
     | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
     | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
     | 99.99th=[16711680]
    bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
    lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
    lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
  cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec

Run status group 1 (all jobs):
  WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec


What? 254 IOPS? With a Dual SSD BTRFS RAID 1?

What?

Ey, *what*?



Repeating with the random write case.

Its a different kworker now, but similar result:

ATOP - merkaba                          2014/12/27  13:51:48                          -----------                           10s elapsed
PRC |  sys   10.66s |  user   0.25s |  #proc    330  | #trun      2  |  #tslpi   545 |  #tslpu     2 |  #zombie    0  | no  procacct  |
CPU |  sys     105% |  user      3% |  irq       0%  | idle    292%  |  wait      0% |  guest     0% |  curf 3.07GHz  | curscal  95%  |
cpu |  sys      92% |  user      0% |  irq       0%  | idle      8%  |  cpu002 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
cpu |  sys       8% |  user      0% |  irq       0%  | idle     92%  |  cpu003 w  0% |  guest     0% |  curf 3.09GHz  | curscal  96%  |
cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys       2% |  user      1% |  irq       0%  | idle     97%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
CPL |  avg1    1.00 |  avg5    1.32 |  avg15   1.23  |               |  csw    34484 |  intr   23182 |                | numcpu     4  |
MEM |  tot    15.5G |  free    5.4G |  cache   8.3G  | buff    0.0M  |  slab  334.8M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
LVM |     sata-home |  busy      1% |  read      36  | write   2502  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   0.98  | avio 0.06 ms  |
LVM |    msata-home |  busy      1% |  read      48  | write   2502  |  KiB/w      4 |  MBr/s   0.02 |  MBw/s   0.98  | avio 0.04 ms  |
LVM |  msata-debian |  busy      0% |  read       0  | write      6  |  KiB/w      7 |  MBr/s   0.00 |  MBw/s   0.00  | avio 1.33 ms  |
LVM |   sata-debian |  busy      0% |  read       0  | write      6  |  KiB/w      7 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.17 ms  |
DSK |           sda |  busy      1% |  read      36  | write   2494  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   0.98  | avio 0.06 ms  |
DSK |           sdb |  busy      1% |  read      48  | write   2494  |  KiB/w      4 |  MBr/s   0.02 |  MBw/s   0.98  | avio 0.04 ms  |
NET |  transport    |  tcpi      32 |  tcpo      30  | udpi       2  |  udpo       2 |  tcpao      2 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       35 |  ipo       32  | ipfrw      0  |  deliv     35 |               |  icmpi      0  | icmpo      0  |
NET |  eth0      0% |  pcki      19 |  pcko      16  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
11746      -   root      root          1  10.00s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:0
12254      -   root      root          1   0.16s    0.00s      0K       0K    112K    1712K  --     -  S       3    2%   kworker/u8:3  
17517      -   root      root          1   0.16s    0.00s      0K       0K    144K    1764K  --     -  S       1    2%   kworker/u8:8



And now the graphical environemnt is locked. Continuining on TTY1.

Doing another fio job with tee so I can get output easily.

Wow! I wonder whether this is reproducable with a fresh BTRFS with fio stressing it.

Like a 10 GiB BTRFS with 5 GiB fio test file and just letting it run.


Okay, I let the final fio job complete and include the output here.


Okay, and there we are and I do have sysrq-t figures. 

Okay, this is 1,2 MiB xz packed. So I better start a bug report about this
and attach it there. Is dislike cloud URLs that may disappear at some time.



Now please finally acknowledge that there is an issue. Maybe I was not
using the correct terms at the beginning, but there is a real issue. I do
performance things for half a decade at least, I know that there is an issue
when I see it.




There we go:

Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 13:16           ` Martin Steigerwald
  2014-12-27 13:49             ` Robert White
@ 2014-12-27 14:00             ` Robert White
  2014-12-27 14:14               ` Martin Steigerwald
  2014-12-27 14:19               ` Robert White
  1 sibling, 2 replies; 59+ messages in thread
From: Robert White @ 2014-12-27 14:00 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Hugo Mills, linux-btrfs

On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
> It can easily be reproduced without even using Virtualbox, just by a nice
> simple fio job.
>

TL;DR: If you want a worst-case example of consuming a BTRFS filesystem 
with one single file...

#!/bin/bash
# not tested, so correct any syntax errors
typeset -i counter
for ((counter=250;counter>0;counter--)); do
  dd if=/dev/urandom of=/some/file bs=4k count=$counter
done
exit


Each pass over /some/file is 4k shorter than the previous one, but none 
of the extents can be deallocated. File will be 1MiB in size and usage 
will be something like 125.5MiB (if I've done the math correctly). 
larger values of counter will result in exponentially larger amounts of 
waste.

Doing the bad things is very bad...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 13:49             ` Robert White
@ 2014-12-27 14:06               ` Martin Steigerwald
  0 siblings, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 14:06 UTC (permalink / raw)
  To: Robert White; +Cc: Hugo Mills, linux-btrfs

Am Samstag, 27. Dezember 2014, 05:49:48 schrieb Robert White:
> > Anyway, I got it reproduced. And am about to write a lengthy mail about.
> 
> Have fun with that lengthy email, but the devs already know about the 
> data waste profile of the system. They just don't have a good solution yet.
> 
> Practical use cases involving _not_ defragging and _not_ packing files, 
> or disabling COW and using raw image formats for VM disk storage are, 
> meanwhile, also well understood.

Okay, then how about a database?

BTRFS is not usable for these kind of workloads then.

And thats about it.

Not even on SSD.

Yet, what I have shown in my lengthy mail is pathological.

Its even abysmal.

And yet it only happens when BTRFS is forced to pack things into *existing* 
chunks. It does not happen when BTRFS can still reserve new chunks and write 
to them.

And this makes all the talk that you should not need to rebalance obsolete 
when in practice you need to to get decent performance. To get out of your 
SSDs what your SSDs can provide instead of waiting for BTRFS to finish being 
busy with itself.

Still, I have only yet reproduced it on this /home filesystem. If that is also 
reproducable on a freshly created filesystem after some runs of the fio job I 
provided I´d say that there is a performance bug in BTRFS. And thats it.

No talking about technicalities my turn this performance bug observation away. 
Heck 254 IOPS from a Dual SSD RAID 1? Are you even kidding me?

I refuse to believe that this is built into the design, no matter how much you 
outline its limitations.

And if it is?

Well… then maybe BTRFS won´t save us. Unless you give it a ton of extra free 
space. Unless you do as I recommend and if you use 25 GB you make it 100 GB 
big so it will always find enough space to waste.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 14:00             ` Robert White
@ 2014-12-27 14:14               ` Martin Steigerwald
  2014-12-27 14:21                 ` Martin Steigerwald
  2014-12-27 14:19               ` Robert White
  1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 14:14 UTC (permalink / raw)
  To: Robert White; +Cc: Hugo Mills, linux-btrfs

Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:
> On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
> > It can easily be reproduced without even using Virtualbox, just by a nice
> > simple fio job.
> 
> TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
> with one single file...
> 
> #!/bin/bash
> # not tested, so correct any syntax errors
> typeset -i counter
> for ((counter=250;counter>0;counter--)); do
>   dd if=/dev/urandom of=/some/file bs=4k count=$counter
> done
> exit
> 
> 
> Each pass over /some/file is 4k shorter than the previous one, but none
> of the extents can be deallocated. File will be 1MiB in size and usage
> will be something like 125.5MiB (if I've done the math correctly).
> larger values of counter will result in exponentially larger amounts of
> waste.

Robert, I experienced this hang issues even before the defragmenting case. It 
happened while just installed a 400 MiB tax returns application to it (that is 
no joke, it is that big).

It happens while just using the VM.

Yes, I recommend not to use BTRFS for any VM image or any larger database on 
rotating storage for exactly that COW semantics.

But on SSD?

Its busy looping a CPU core and while the flash is basically idling.

I refuse to believe that this is by design.

I do think there is a *bug*.

Either acknowledge it and try to fix it, or say its by design *without even 
looking at it closely enough to be sure that it is not a bug* and limit your 
own possibilities by it.

I´d rather see it treated as a bug for now.

Come on, 254 IOPS on a filesystem with still 17 GiB of free space while 
randomly writing to a 4 GiB file.

People do these kind of things. Ditch that defrag Windows XP VM case, I had 
performance issue even before by just installing things to it. Databases, VMs, 
emulators. And heck even while just *creating* the file with fio as I shown.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 14:00             ` Robert White
  2014-12-27 14:14               ` Martin Steigerwald
@ 2014-12-27 14:19               ` Robert White
  1 sibling, 0 replies; 59+ messages in thread
From: Robert White @ 2014-12-27 14:19 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Hugo Mills, linux-btrfs

On 12/27/2014 06:00 AM, Robert White wrote:
> On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
>> It can easily be reproduced without even using Virtualbox, just by a nice
>> simple fio job.
>>
>
> TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
> with one single file...
>
> #!/bin/bash
> # not tested, so correct any syntax errors
> typeset -i counter
> for ((counter=250;counter>0;counter--)); do
>   dd if=/dev/urandom of=/some/file bs=4k count=$counter
> done
> exit 0

Slight correction: you need to prevent the truncate dd performs by 
default, and flush the data and metadata to disk between after each 
invocatoin. So you need the "conv=" flags.

for ((counter=250;counter>0;counter--)); do
dd if=/dev/urandom of=some_file conv=notrunc,fsync bs=4k count=$counter
done



>
>
> Each pass over /some/file is 4k shorter than the previous one, but none
> of the extents can be deallocated. File will be 1MiB in size and usage
> will be something like 125.5MiB (if I've done the math correctly).
> larger values of counter will result in exponentially larger amounts of
> waste.
>
> Doing the bad things is very bad...
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 14:14               ` Martin Steigerwald
@ 2014-12-27 14:21                 ` Martin Steigerwald
  2014-12-27 15:14                   ` Robert White
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 14:21 UTC (permalink / raw)
  To: Robert White; +Cc: Hugo Mills, linux-btrfs

Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald:
> Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:
> > On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
> > > It can easily be reproduced without even using Virtualbox, just by a
> > > nice
> > > simple fio job.
> > 
> > TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
> > with one single file...
> > 
> > #!/bin/bash
> > # not tested, so correct any syntax errors
> > typeset -i counter
> > for ((counter=250;counter>0;counter--)); do
> > 
> >   dd if=/dev/urandom of=/some/file bs=4k count=$counter
> > 
> > done
> > exit
> > 
> > 
> > Each pass over /some/file is 4k shorter than the previous one, but none
> > of the extents can be deallocated. File will be 1MiB in size and usage
> > will be something like 125.5MiB (if I've done the math correctly).
> > larger values of counter will result in exponentially larger amounts of
> > waste.
> 
> Robert, I experienced this hang issues even before the defragmenting case.
> It happened while just installed a 400 MiB tax returns application to it
> (that is no joke, it is that big).
> 
> It happens while just using the VM.
> 
> Yes, I recommend not to use BTRFS for any VM image or any larger database on
> rotating storage for exactly that COW semantics.
> 
> But on SSD?
> 
> Its busy looping a CPU core and while the flash is basically idling.
> 
> I refuse to believe that this is by design.
> 
> I do think there is a *bug*.
> 
> Either acknowledge it and try to fix it, or say its by design *without even
> looking at it closely enough to be sure that it is not a bug* and limit your
> own possibilities by it.
> 
> I´d rather see it treated as a bug for now.
> 
> Come on, 254 IOPS on a filesystem with still 17 GiB of free space while
> randomly writing to a 4 GiB file.
> 
> People do these kind of things. Ditch that defrag Windows XP VM case, I had
> performance issue even before by just installing things to it. Databases,
> VMs, emulators. And heck even while just *creating* the file with fio as I
> shown.

Add to these use cases things like this:

martin@merkaba:~/.local/share/akonadi/db_data/akonadi> ls -lSh | head -5
insgesamt 2,2G
-rw-rw---- 1 martin martin 1,7G Dez 27 15:17 parttable.ibd
-rw-rw---- 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd
-rw-rw---- 1 martin martin  23M Dez 27 15:17 pimitemflagrelation.ibd
-rw-rw---- 1 martin martin 240K Dez 27 15:17 collectiontable.ibd


Or this:

martin@merkaba:~/.local/share/baloo> du -sch * | sort -rh
9,2G    insgesamt
8,0G    email
1,2G    file
51M     emailContacts
408K    contacts
76K     notes
16K     calendars

martin@merkaba:~/.local/share/baloo> ls -lSh email | head -5
insgesamt 8,0G
-rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB
-rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB
-rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB
-rw-r--r-- 1 martin martin  63K Dez 27 15:16 postlist.baseA



These will not be as bad as the fio test case, but still these files are
written into. They are updated in place.

And thats running on every Plasma desktop by default. And on GNOME desktops
there is similar stuff.

I haven´t seen this spike out a kworker yet tough, so maybe the workload is 
light enough not to trigger it that easily.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 13:55       ` Martin Steigerwald
@ 2014-12-27 14:54         ` Robert White
  2014-12-27 16:26           ` Hugo Mills
  2014-12-28 13:00         ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
  1 sibling, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-27 14:54 UTC (permalink / raw)
  To: Martin Steigerwald, Hugo Mills; +Cc: linux-btrfs

On 12/27/2014 05:55 AM, Martin Steigerwald wrote:
> Summarized at
>
> Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>
> see below. This is reproducable with fio, no need for Windows XP in
> Virtualbox for reproducing the issue. Next I will try to reproduce with
> a freshly creating filesystem.
>
>
> Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
>> On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
>>> Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
>>>> On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
>>>>> Hello!
>>>>>
>>>>> First: Have a merry christmas and enjoy a quiet time in these days.
>>>>>
>>>>> Second: At a time you feel like it, here is a little rant, but also a
>>>>> bug
>>>>> report:
>>>>>
>>>>> I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
>>>>> space_cache, skinny meta data extents – are these a problem? – and
>>>>
>>>>> compress=lzo:
>>>> (there is no known problem with skinny metadata, it's actually more
>>>> efficient than the older format. There has been some anecdotes about
>>>> mixing the skinny and fat metadata but nothing has ever been
>>>> demonstrated problematic.)
>>>>
>>>>> merkaba:~> btrfs fi sh /home
>>>>> Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
>>>>>
>>>>>           Total devices 2 FS bytes used 144.41GiB
>>>>>           devid    1 size 160.00GiB used 160.00GiB path
>>>>>           /dev/mapper/msata-home
>>>>>           devid    2 size 160.00GiB used 160.00GiB path
>>>>>           /dev/mapper/sata-home
>>>>>
>>>>> Btrfs v3.17
>>>>> merkaba:~> btrfs fi df /home
>>>>> Data, RAID1: total=154.97GiB, used=141.12GiB
>>>>> System, RAID1: total=32.00MiB, used=48.00KiB
>>>>> Metadata, RAID1: total=5.00GiB, used=3.29GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>
>>>> This filesystem, at the allocation level, is "very full" (see below).
>>>>
>>>>> And I had hangs with BTRFS again. This time as I wanted to install tax
>>>>> return software in Virtualbox´d Windows XP VM (which I use once a year
>>>>> cause I know no tax return software for Linux which would be suitable
>>>>> for
>>>>> Germany and I frankly don´t care about the end of security cause all
>>>>> surfing and other network access I will do from the Linux box and I
>>>>> only
>>>>> run the VM behind a firewall).
>>>>
>>>>> And thus I try the balance dance again:
>>>> ITEM: Balance... it doesn't do what you think it does... 8-)
>>>>
>>>> "Balancing" is something you should almost never need to do. It is only
>>>> for cases of changing geometry (adding disks, switching RAID levels,
>>>> etc.) of for cases when you've radically changed allocation behaviors
>>>> (like you decided to remove all your VM's or you've decided to remove a
>>>> mail spool directory full of thousands of tiny files).
>>>>
>>>> People run balance all the time because they think they should. They are
>>>> _usually_ incorrect in that belief.
>>>
>>> I only see the lockups of BTRFS is the trees *occupy* all space on the
>>> device.
>>     No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
>> space. What's more, balance does *not* balance the metadata trees. The
>> remaining space -- 154.97 GiB -- is unstructured storage for file
>> data, and you have some 13 GiB of that available for use.
>>
>>     Now, since you're seeing lockups when the space on your disks is
>> all allocated I'd say that's a bug. However, you're the *only* person
>> who's reported this as a regular occurrence. Does this happen with all
>> filesystems you have, or just this one?
>>
>>> I *never* so far saw it lockup if there is still space BTRFS can allocate
>>> from to *extend* a tree.
>>
>>     It's not a tree. It's simply space allocation. It's not even space
>> *usage* you're talking about here -- it's just allocation (i.e. the FS
>> saying "I'm going to use this piece of disk for this purpose").
>>
>>> This may be a bug, but this is what I see.
>>>
>>> And no amount of "you should not balance a BTRFS" will make that
>>> perception go away.
>>>
>>> See, I see the sun coming out on a morning and you tell me "no, it
>>> doesn´t". Simply that is not going to match my perception.
>>
>>     Duncan's assertion is correct in its detail. Looking at your space
>
> Robert's :)
>
>> usage, I would not suggest that running a balance is something you
>> need to do. Now, since you have these lockups that seem quite
>> repeatable, there's probably a lurking bug in there, but hacking
>> around with balance every time you hit it isn't going to get the
>> problem solved properly.
>>
>>     I think I would suggest the following:
>>
>>   - make sure you have some way of logging your dmesg permanently (use
>>     a different filesystem for /var/log, or a serial console, or a
>>     netconsole)
>>
>>   - when the lockup happens, hit Alt-SysRq-t a few times
>>
>>   - send the dmesg output here, or post to bugzilla.kernel.org
>>
>>     That's probably going to give enough information to the developers
>> to work out where the lockup is happening, and is clearly the way
>> forward here.
>
> And I got it reproduced. *Perfectly* reproduced, I´d say.
>
> But let me run the whole story:
>
> 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
>
> Which gave me:
>
> merkaba:~> btrfs fi sh /home
> Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
>          Total devices 2 FS bytes used 144.19GiB
>          devid    1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home
>          devid    2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home
>
> Btrfs v3.17
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=144.98GiB, used=140.95GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.24GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
> 2) I run the Virtualbox machine again and defragmented the NTFS filesystem
> in the VDI image file. And: It worked *just* fine. Fine as in *fine*. No issues
> whatsoever.
>
>
> I got this during the run:
>
> ATOP - merkaba                          2014/12/27  12:58:42                          -----------                           10s elapsed
> PRC |  sys   10.41s |  user   1.08s |  #proc    357  | #trun      4  |  #tslpi   694 |  #tslpu     0 |  #zombie    0  | no  procacct  |
> CPU |  sys     107% |  user     11% |  irq       0%  | idle    259%  |  wait     23% |  guest     0% |  curf 3.01GHz  | curscal  93%  |
> cpu |  sys      29% |  user      3% |  irq       0%  | idle     63%  |  cpu002 w  5% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      27% |  user      3% |  irq       0%  | idle     65%  |  cpu000 w  5% |  guest     0% |  curf 3.03GHz  | curscal  94%  |
> cpu |  sys      26% |  user      3% |  irq       0%  | idle     63%  |  cpu003 w  8% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      24% |  user      2% |  irq       0%  | idle     68%  |  cpu001 w  6% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> CPL |  avg1    1.92 |  avg5    1.01 |  avg15   0.56  |               |  csw   501619 |  intr  129279 |                | numcpu     4  |
> MEM |  tot    15.5G |  free  610.1M |  cache   9.1G  | buff    0.1M  |  slab    1.0G |  shmem 183.5M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.6G |                |               |               |               |  vmcom   7.1G  | vmlim  19.7G  |
> PAG |  scan  219141 |  steal 215577 |  stall    936  |               |               |               |  swin       0  | swout    940  |
> LVM |     sata-home |  busy     53% |  read  181413  | write      0  |  KiB/w      0 |  MBr/s  70.86 |  MBw/s   0.00  | avio 0.03 ms  |
> LVM |     sata-swap |  busy      2% |  read       0  | write    940  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.37  | avio 0.17 ms  |
> LVM |   sata-debian |  busy      0% |  read       0  | write      1  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 1.00 ms  |
> LVM |  msata-debian |  busy      0% |  read       0  | write      1  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.00 ms  |
> DSK |           sda |  busy     53% |  read  181413  | write    477  |  KiB/w      7 |  MBr/s  70.86 |  MBw/s   0.37  | avio 0.03 ms  |
> DSK |           sdb |  busy      0% |  read       0  | write      1  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.00 ms  |
> NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv     16 |               |  icmpi      0  | icmpo      0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
>
>    PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
>   9650      -   martin    martin       22   7.89s    0.65s      0K     128K  705.5M   382.1M  --     -  S       2   87%   VirtualBox
>   9911      -   root      root          1   0.69s    0.01s      0K       0K      0K       0K  --     -  S       3    7%   watch
>   9598      -   root      root          1   0.38s    0.00s      0K       0K      0K      20K  --     -  S       0    4%   kworker/u8:9
>   9892      -   root      root          1   0.36s    0.00s      0K       0K      0K       0K  --     -  S       1    4%   kworker/u8:17
>   9428      -   root      root          1   0.30s    0.00s      0K       0K      0K       0K  --     -  R       0    3%   kworker/u8:3
>   9589      -   root      root          1   0.23s    0.00s      0K       0K      0K       0K  --     -  S       1    2%   kworker/u8:6
>   4746      -   martin    martin        2   0.04s    0.13s      0K     -16K      0K       0K  --     -  R       2    2%   konsole
>
>
>
> Every 1,0s: cat /proc/meminfo                                                                                  Sat Dec 27 12:59:23 2014
>
> MemTotal:       16210512 kB
> MemFree:          786632 kB
> MemAvailable:   10271500 kB
> Buffers:              52 kB
> Cached:          9564340 kB
> SwapCached:        70268 kB
> Active:          6847560 kB
> Inactive:        5257956 kB
> Active(anon):    2016412 kB
> Inactive(anon):   703076 kB
> Active(file):    4831148 kB
> Inactive(file):  4554880 kB
> Unevictable:        9068 kB
> Mlocked:            9068 kB
> SwapTotal:      12582908 kB
> SwapFree:       12186680 kB
> Dirty:            972324 kB
> Writeback:             0 kB
> AnonPages:       2526340 kB
> Mapped:          2457096 kB
> Shmem:            173564 kB
> Slab:             918128 kB
> SReclaimable:     848816 kB
> SUnreclaim:        69312 kB
> KernelStack:       11200 kB
> PageTables:        64556 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    20688164 kB
> Committed_AS:    7438348 kB
>
>
>
> I am not seeing more than one GiB of dirty here during regular usage and
> it is no problem.
>
> And kworker thread CPU usage just fine. So no, the dirty_background_ratio
> isn´t an issue with this 16 GiB ThinkPad T520. Please note: I do Linux
> performance analysis and tuning courses for about 7 years or so meanwhile.
>
> I *know* these knobs. I may have used wrong terms regarding BTRFS, and my
> understanding of BTRFS space allocation probably can be more accurate, but
> I do think that I am onto something here. This is no rotating disk, it can handle
> the write burst just fine and I generally do not tune where there is no need to
> tune. Here there isn´t. And it wouldn´t be much more than a fine tuning.
>
> With slow devices or with rsync over NFS by all means reduce it. But here it
> simply isn´t an issue as you can see with the low kworker thread CPU usage
> and the general SSD usage above.
>
>
> So defragmentation completed just nice, no issue so far.
>
> But I am close to full device space reservation already:
>
> merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> Sa 27. Dez 13:02:40 CET 2014
> Label: 'home'  uuid: [some UUID]
>          Total devices 2 FS bytes used 151.58GiB
>          devid    1 size 160.00GiB used 158.01GiB path /dev/mapper/msata-home
>          devid    2 size 160.00GiB used 158.01GiB path /dev/mapper/sata-home
>
>
>
> I thought I can trigger it again by defragmenting in Windows XP again, but
> mind you, its defragmented already so it doesn´t to much. I did the sdelete
> dance just to trigger something and well I saw kworker a bit higher, but not
> much.
>
> But finally I got to:
>
>
>
>
> So I did, if Virtualbox can write randomly in a file, I can too.
>
> So I did:
>
>
> martin@merkaba:~> cat ssd-test.fio
> [global]
> bs=4k
> #ioengine=libaio
> #iodepth=4
> size=4g
> #direct=1
> runtime=120
> filename=ssd.test.file
>
> [seq-write]
> rw=write
> stonewall
>
> [rand-write]
> rw=randwrite
> stonewall
>
>
>
> And got:
>
> ATOP - merkaba                          2014/12/27  13:41:02                          -----------                           10s elapsed
> PRC |  sys   10.14s |  user   0.38s |  #proc    332  | #trun      2  |  #tslpi   548 |  #tslpu     0 |  #zombie    0  | no  procacct  |
> CPU |  sys     102% |  user      4% |  irq       0%  | idle    295%  |  wait      0% |  guest     0% |  curf 3.10GHz  | curscal  96%  |
> cpu |  sys      76% |  user      0% |  irq       0%  | idle     24%  |  cpu001 w  0% |  guest     0% |  curf 3.20GHz  | curscal  99%  |
> cpu |  sys      24% |  user      1% |  irq       0%  | idle     75%  |  cpu000 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu003 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> CPL |  avg1    0.82 |  avg5    0.78 |  avg15   0.99  |               |  csw     6233 |  intr   12023 |                | numcpu     4  |
> MEM |  tot    15.5G |  free    4.0G |  cache   9.7G  | buff    0.0M  |  slab  333.1M |  shmem 206.6M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> LVM |     sata-home |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> DSK |           sda |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv     16 |               |  icmpi      0  | icmpo      0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
>
>    PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
> 18079      -   martin    martin        2   9.99s    0.00s      0K       0K      0K      16K  --     -  R       1  100%   fio
>   4746      -   martin    martin        2   0.01s    0.14s      0K       0K      0K       0K  --     -  S       2    2%   konsole
>   3291      -   martin    martin        4   0.01s    0.11s      0K       0K      0K       0K  --     -  S       0    1%   plasma-desktop
>   1488      -   root      root          1   0.03s    0.04s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> 10036      -   root      root          1   0.04s    0.02s      0K       0K      0K       0K  --     -  R       2    1%   atop
>
> while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
>
> martin@merkaba:~> LANG=C df -hT /home
> Filesystem             Type   Size  Used Avail Use% Mounted on
> /dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
>
> where a 4 GiB file should easily fit, no? (And this output is with the 4
> GiB file. So it was even 4 GiB more free before.)

No. /usr/bin/df is an _approximation_ in BTRFS because of the limits of 
the fsstat() function call. The fstat function call was defined in 1990 
and "can't understand" the dynamic allocation model used in BTRFS as it 
assumes fixed geometry for filesystems. You do _not_ have 17G actually 
available. You need to rely on btrfs fi df and btrfs fi show to figure 
out how much space you _really_ have.

According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks)

 > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
 > Sa 27. Dez 13:26:39 CET 2014
 > Label: 'home'  uuid: [some UUID]
 >          Total devices 2 FS bytes used 152.83GiB
 >          devid    1 size 160.00GiB used 160.00GiB path 
/dev/mapper/msata-home
 >          devid    2 size 160.00GiB used 160.00GiB path 
/dev/mapper/sata-home

And according to this block you have about 4.49GiB of data space:

 > Btrfs v3.17
 > Data, RAID1: total=154.97GiB, used=149.58GiB
 > System, RAID1: total=32.00MiB, used=48.00KiB
 > Metadata, RAID1: total=5.00GiB, used=3.26GiB
 > GlobalReserve, single: total=512.00MiB, used=0.00B

154.97
   5.00
   0.032
+ 0.512

Pretty much as close to 160GiB as you are going to get (those numbers 
being rounded up in places for "human readability") BTRFS has allocate 
100% of the raw storage into typed extents.

A large datafile can only fit in the 154.97-149.58 = 5.39

Trying to allocate that 4GiB file into that 5.39GiB of space becomes an 
NP-complete (e.g. "very hard") problem if it is very fragmented.

I also don't know what kind of tool you are using, but it might be 
repeatedly trying and failing to fallocate the file as a single extent 
or something equally dumb.

If the tool that takes those .fio files "isn't smart" about transient 
allocation failures it might be trying the same allocation again, and 
again, and again, and again.... forever... which is not a problem with 
BTRFS but which _would_ lead to runaway CPU usage with no actual disk 
activity.

So try again with more normal tools and see if you can allocate 4GiB.

dd if=/dev/urandom of=file bs=1m count=4096

Does that create a four-gig file? Probably works fine.

You need to isolate not "overall cpu usage" but _what_ program is doing 
what an why. So strace your fio program or whatever it is to see what 
function call(s) it is making and what is being returned.

But seriously dude, if the DD works and the fio doesn't, then that's a 
problem with fio.

(I've got _zero_ idea what fio is, but if it does "testing" and 
repeatedly writes random bits of the file, since you've only got 5.39G 
of space it's likely going to have a lot of problems doing _anything_ 
"intensive" to a COW file of 4G.

So yea, that simultaneous write/rewrite test is going to fail. You don't 
have enough room to permute that file.

None of the results below "surprise me" given that you _don't_ have 
enough room to do the tests you (seem to have) initiated on a COW file. 
Minimum likely needed space is just under 8GiB. Maximum could be much, 
much larger.

>
>
> But it gets even more visible:
>
> martin@merkaba:~> fio ssd-test.fio
> seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.1.11
> Starting 2 processes
> Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]
> 0$ zsh  1$ zsh  2$ zsh  3-$ zsh  4$ zsh  5$* zsh
>
>
> yes, thats 0 IOPS.
>
> 0 IOPS and in zero IOPS. For minutes.
>
>
>
> And here is why:
>
> ATOP - merkaba                          2014/12/27  13:46:52                          -----------                           10s elapsed
> PRC |  sys   10.77s |  user   0.31s |  #proc    334  | #trun      2  |  #tslpi   548 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> CPU |  sys     108% |  user      3% |  irq       0%  | idle    286%  |  wait      2% |  guest     0% |  curf 3.08GHz  | curscal  96%  |
> cpu |  sys      72% |  user      1% |  irq       0%  | idle     28%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      19% |  user      0% |  irq       0%  | idle     81%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      11% |  user      1% |  irq       0%  | idle     87%  |  cpu003 w  1% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> cpu |  sys       6% |  user      1% |  irq       0%  | idle     91%  |  cpu002 w  1% |  guest     0% |  curf 3.11GHz  | curscal  97%  |
> CPL |  avg1    2.78 |  avg5    1.34 |  avg15   1.12  |               |  csw    50192 |  intr   32379 |                | numcpu     4  |
> MEM |  tot    15.5G |  free    5.0G |  cache   8.7G  | buff    0.0M  |  slab  332.6M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> LVM |     sata-home |  busy      5% |  read     160  | write  11177  |  KiB/w      3 |  MBr/s   0.06 |  MBw/s   4.36  | avio 0.05 ms  |
> LVM |    msata-home |  busy      4% |  read      28  | write  11177  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   4.36  | avio 0.04 ms  |
> LVM |   sata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> LVM |  msata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> DSK |           sda |  busy      5% |  read     160  | write  10200  |  KiB/w      4 |  MBr/s   0.06 |  MBw/s   4.69  | avio 0.05 ms  |
> DSK |           sdb |  busy      4% |  read      28  | write  10558  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   4.69  | avio 0.04 ms  |
> NET |  transport    |  tcpi      35 |  tcpo      33  | udpi       3  |  udpo       3 |  tcpao      2 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       38 |  ipo       36  | ipfrw      0  |  deliv     38 |               |  icmpi      0  | icmpo      0  |
> NET |  eth0      0% |  pcki      22 |  pcko      20  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
>
>    PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> 14973      -   root      root          1   8.92s    0.00s      0K       0K      0K     144K  --     -  S       0   89%   kworker/u8:14
> 17450      -   root      root          1   0.86s    0.00s      0K       0K      0K      32K  --     -  R       3    9%   kworker/u8:5
>    788      -   root      root          1   0.25s    0.00s      0K       0K    128K   18880K  --     -  S       3    3%   btrfs-transact
> 12254      -   root      root          1   0.14s    0.00s      0K       0K     64K     576K  --     -  S       2    1%   kworker/u8:3
> 17332      -   root      root          1   0.11s    0.00s      0K       0K    112K    1348K  --     -  S       2    1%   kworker/u8:4
>   3291      -   martin    martin        4   0.01s    0.09s      0K       0K      0K       0K  --     -  S       1    1%   plasma-deskto
>
>
>
>
> ATOP - merkaba                          2014/12/27  13:47:12                          -----------                           10s elapsed
> PRC |  sys   10.78s |  user   0.44s |  #proc    334  | #trun      3  |  #tslpi   547 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> CPU |  sys     106% |  user      4% |  irq       0%  | idle    288%  |  wait      1% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      93% |  user      0% |  irq       0%  | idle      7%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys       7% |  user      0% |  irq       0%  | idle     93%  |  cpu003 w  0% |  guest     0% |  curf 3.01GHz  | curscal  94%  |
> cpu |  sys       3% |  user      2% |  irq       0%  | idle     94%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> CPL |  avg1    3.33 |  avg5    1.56 |  avg15   1.20  |               |  csw    38253 |  intr   23104 |                | numcpu     4  |
> MEM |  tot    15.5G |  free    4.9G |  cache   8.7G  | buff    0.0M  |  slab  336.5M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> LVM |    msata-home |  busy      2% |  read       0  | write   2337  |  KiB/w      3 |  MBr/s   0.00 |  MBw/s   0.91  | avio 0.07 ms  |
> LVM |     sata-home |  busy      2% |  read      36  | write   2337  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   0.91  | avio 0.07 ms  |
> LVM |  msata-debian |  busy      1% |  read       1  | write   1630  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.65  | avio 0.03 ms  |
> LVM |   sata-debian |  busy      0% |  read       0  | write   1019  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.41  | avio 0.02 ms  |
> DSK |           sdb |  busy      2% |  read       1  | write   2545  |  KiB/w      5 |  MBr/s   0.00 |  MBw/s   1.45  | avio 0.07 ms  |
> DSK |           sda |  busy      1% |  read      36  | write   2461  |  KiB/w      5 |  MBr/s   0.01 |  MBw/s   1.28  | avio 0.06 ms  |
> NET |  transport    |  tcpi      20 |  tcpo      20  | udpi       1  |  udpo       1 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       21 |  ipo       21  | ipfrw      0  |  deliv     21 |               |  icmpi      0  | icmpo      0  |
> NET |  eth0      0% |  pcki       5 |  pcko       5  | si    0 Kbps  |  so    0 Kbps |  erri       0 |  erro       0  | drpo       0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
>
>    PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> 17450      -   root      root          1   9.96s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:5
>   4746      -   martin    martin        2   0.06s    0.15s      0K       0K      0K       0K  --     -  S       1    2%   konsole
> 10508      -   root      root          1   0.13s    0.00s      0K       0K     96K    4048K  --     -  S       1    1%   kworker/u8:18
>   1488      -   root      root          1   0.06s    0.06s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> 17332      -   root      root          1   0.12s    0.00s      0K       0K     96K     580K  --     -  R       3    1%   kworker/u8:4
> 17454      -   root      root          1   0.11s    0.00s      0K       0K     32K    4416K  --     -  D       1    1%   kworker/u8:6
> 17516      -   root      root          1   0.09s    0.00s      0K       0K     16K     136K  --     -  S       3    1%   kworker/u8:7
>   3268      -   martin    martin        3   0.02s    0.05s      0K       0K      0K       0K  --     -  S       1    1%   kwin
> 10036      -   root      root          1   0.05s    0.02s      0K       0K      0K       0K  --     -  R       0    1%   atop
>
>
>
> So BTRFS is basically busy with itself and nothing else. Look at the SSD
> usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> you measure of course, like request size, read, write, iodepth and so).
>
> Its kworker/u8:5 utilizing 100% of one core for minutes.
>
>
>
> Its the random write case it seems. Here are values from fio job:
>
> martin@merkaba:~> fio ssd-test.fio
> seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.1.11
> Starting 2 processes
> Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
>    write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
>      clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
>       lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
>      clat percentiles (usec):
>       |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
>       | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
>       | 70.00th=[    7], 80.00th=[    8], 90.00th=[    8], 95.00th=[    9],
>       | 99.00th=[   14], 99.50th=[   20], 99.90th=[  211], 99.95th=[ 2128],
>       | 99.99th=[10304]
>      bw (KB  /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
>      lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
>      lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
>      lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
>    cpu          : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Seems fine.
>
>
> But:
>
> rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
>    write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
>      clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
>       lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
>      clat percentiles (usec):
>       |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
>       | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
>       | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
>       | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
>       | 99.99th=[16711680]
>      bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
>      lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
>      lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
>    cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>    WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
>
> Run status group 1 (all jobs):
>    WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
>
>
> What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
>
> What?
>
> Ey, *what*?
>
>
>
> Repeating with the random write case.
>
> Its a different kworker now, but similar result:
>
> ATOP - merkaba                          2014/12/27  13:51:48                          -----------                           10s elapsed
> PRC |  sys   10.66s |  user   0.25s |  #proc    330  | #trun      2  |  #tslpi   545 |  #tslpu     2 |  #zombie    0  | no  procacct  |
> CPU |  sys     105% |  user      3% |  irq       0%  | idle    292%  |  wait      0% |  guest     0% |  curf 3.07GHz  | curscal  95%  |
> cpu |  sys      92% |  user      0% |  irq       0%  | idle      8%  |  cpu002 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> cpu |  sys       8% |  user      0% |  irq       0%  | idle     92%  |  cpu003 w  0% |  guest     0% |  curf 3.09GHz  | curscal  96%  |
> cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys       2% |  user      1% |  irq       0%  | idle     97%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> CPL |  avg1    1.00 |  avg5    1.32 |  avg15   1.23  |               |  csw    34484 |  intr   23182 |                | numcpu     4  |
> MEM |  tot    15.5G |  free    5.4G |  cache   8.3G  | buff    0.0M  |  slab  334.8M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> LVM |     sata-home |  busy      1% |  read      36  | write   2502  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   0.98  | avio 0.06 ms  |
> LVM |    msata-home |  busy      1% |  read      48  | write   2502  |  KiB/w      4 |  MBr/s   0.02 |  MBw/s   0.98  | avio 0.04 ms  |
> LVM |  msata-debian |  busy      0% |  read       0  | write      6  |  KiB/w      7 |  MBr/s   0.00 |  MBw/s   0.00  | avio 1.33 ms  |
> LVM |   sata-debian |  busy      0% |  read       0  | write      6  |  KiB/w      7 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.17 ms  |
> DSK |           sda |  busy      1% |  read      36  | write   2494  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   0.98  | avio 0.06 ms  |
> DSK |           sdb |  busy      1% |  read      48  | write   2494  |  KiB/w      4 |  MBr/s   0.02 |  MBw/s   0.98  | avio 0.04 ms  |
> NET |  transport    |  tcpi      32 |  tcpo      30  | udpi       2  |  udpo       2 |  tcpao      2 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       35 |  ipo       32  | ipfrw      0  |  deliv     35 |               |  icmpi      0  | icmpo      0  |
> NET |  eth0      0% |  pcki      19 |  pcko      16  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
>
>    PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
> 11746      -   root      root          1  10.00s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:0
> 12254      -   root      root          1   0.16s    0.00s      0K       0K    112K    1712K  --     -  S       3    2%   kworker/u8:3
> 17517      -   root      root          1   0.16s    0.00s      0K       0K    144K    1764K  --     -  S       1    2%   kworker/u8:8
>
>
>
> And now the graphical environemnt is locked. Continuining on TTY1.
>
> Doing another fio job with tee so I can get output easily.
>
> Wow! I wonder whether this is reproducable with a fresh BTRFS with fio stressing it.
>
> Like a 10 GiB BTRFS with 5 GiB fio test file and just letting it run.
>
>
> Okay, I let the final fio job complete and include the output here.
>
>
> Okay, and there we are and I do have sysrq-t figures.
>
> Okay, this is 1,2 MiB xz packed. So I better start a bug report about this
> and attach it there. Is dislike cloud URLs that may disappear at some time.
>
>
>
> Now please finally acknowledge that there is an issue. Maybe I was not
> using the correct terms at the beginning, but there is a real issue. I do
> performance things for half a decade at least, I know that there is an issue
> when I see it.
>
>
>
>
> There we go:
>
> Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>
> Thanks,
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 14:21                 ` Martin Steigerwald
@ 2014-12-27 15:14                   ` Robert White
  2014-12-27 16:01                     ` Martin Steigerwald
  2014-12-27 16:10                     ` Martin Steigerwald
  0 siblings, 2 replies; 59+ messages in thread
From: Robert White @ 2014-12-27 15:14 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Hugo Mills, linux-btrfs

On 12/27/2014 06:21 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald:
>> Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:
>>> On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
>>>> It can easily be reproduced without even using Virtualbox, just by a
>>>> nice
>>>> simple fio job.
>>>
>>> TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
>>> with one single file...
>>>
>>> #!/bin/bash
>>> # not tested, so correct any syntax errors
>>> typeset -i counter
>>> for ((counter=250;counter>0;counter--)); do
>>>
>>>    dd if=/dev/urandom of=/some/file bs=4k count=$counter
>>>
>>> done
>>> exit
>>>
>>>
>>> Each pass over /some/file is 4k shorter than the previous one, but none
>>> of the extents can be deallocated. File will be 1MiB in size and usage
>>> will be something like 125.5MiB (if I've done the math correctly).
>>> larger values of counter will result in exponentially larger amounts of
>>> waste.
>>
>> Robert, I experienced this hang issues even before the defragmenting case.
>> It happened while just installed a 400 MiB tax returns application to it
>> (that is no joke, it is that big).
>>
>> It happens while just using the VM.
>>
>> Yes, I recommend not to use BTRFS for any VM image or any larger database on
>> rotating storage for exactly that COW semantics.
>>
>> But on SSD?
>>
>> Its busy looping a CPU core and while the flash is basically idling.
>>
>> I refuse to believe that this is by design.
>>
>> I do think there is a *bug*.
>>
>> Either acknowledge it and try to fix it, or say its by design *without even
>> looking at it closely enough to be sure that it is not a bug* and limit your
>> own possibilities by it.
>>
>> I´d rather see it treated as a bug for now.
>>
>> Come on, 254 IOPS on a filesystem with still 17 GiB of free space while
>> randomly writing to a 4 GiB file.
>>
>> People do these kind of things. Ditch that defrag Windows XP VM case, I had
>> performance issue even before by just installing things to it. Databases,
>> VMs, emulators. And heck even while just *creating* the file with fio as I
>> shown.
>
> Add to these use cases things like this:
>
> martin@merkaba:~/.local/share/akonadi/db_data/akonadi> ls -lSh | head -5
> insgesamt 2,2G
> -rw-rw---- 1 martin martin 1,7G Dez 27 15:17 parttable.ibd
> -rw-rw---- 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd
> -rw-rw---- 1 martin martin  23M Dez 27 15:17 pimitemflagrelation.ibd
> -rw-rw---- 1 martin martin 240K Dez 27 15:17 collectiontable.ibd
>
>
> Or this:
>
> martin@merkaba:~/.local/share/baloo> du -sch * | sort -rh
> 9,2G    insgesamt
> 8,0G    email
> 1,2G    file
> 51M     emailContacts
> 408K    contacts
> 76K     notes
> 16K     calendars
>
> martin@merkaba:~/.local/share/baloo> ls -lSh email | head -5
> insgesamt 8,0G
> -rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB
> -rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB
> -rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB
> -rw-r--r-- 1 martin martin  63K Dez 27 15:16 postlist.baseA

/usr/bin/du and /usr/bin/df and /bin/ls are all _useless_ for showing 
the amount of filespace used by a file in BTRFS.

Look at a nice paste of the previously described "worst case" allocation.

Gust rwhite # btrfs fi df /
Data, single: total=344.00GiB, used=340.41GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=8.00GiB, used=4.84GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Gust rwhite # for ((counter=250;counter>0;counter--)); do dd 
if=/dev/urandom of=some_file conv=notrunc,fsync bs=4k count=$counter 
 >/dev/null 2>&1; done

Gust rwhite # btrfs fi df /
Data, single: total=344.00GiB, used=340.48GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=8.00GiB, used=4.84GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Gust rwhite # du some_file
1000    some_file

Gust rwhite # ls -lh some_file
-rw-rw-r--+ 1 root root 1000K Dec 27 07:00 some_file

Gust rwhite # rm some_file
Gust rwhite # btrfs fi df /
Data, single: total=344.00GiB, used=340.41GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=8.00GiB, used=4.84GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Notice that "some_file" shows 1000 blocks in du, and 1000k bytes in ls.

But notice that data used jumps from 340.41GiB to 340.48GiB when the 
file is created, then drops back down to 340.41GiB when it's deleted.

Now I have compression turned on so the amount of growth/shrinkage 
changes between each run, but it's _Way_ more than 1Meg, that's like 
70MiB (give or take significant rounding in the third place after the 
decimal). So I wrote this file in a way that leads to it taking up 
_seventy_ _times_ it's base size in actual allocated storage. Real files 
do not perform this terribly, but they can get pretty ugly in some cases.

You _really_ need to learn how the system works and what its best and 
worst cases look like before you start shouting "bug!"

You are using the wrong numbers (e.g. "df") for available space and you 
don't know how to estimate what your tools _should_ do for the 
conditions observed.

But yes, if you open a file and scribble all over it when your disk is 
full to within the same order of magnitude as the size of the file you 
are scribbling on, you will get into a condition where the _application_ 
will aggressively retry the IO. Particularly if that application is a 
"test program" or a virtual machine doing asynchronous IO.

That's what those sorts of systems do when they crash against a limit in 
the underlying system.

So yea... out of space plus agressive writer equals spinning CPU

Before you can assign blame you need to strace your application to see 
what call its making over and over again to see if its just being stupid.

> These will not be as bad as the fio test case, but still these files are
> written into. They are updated in place.
>
> And thats running on every Plasma desktop by default. And on GNOME desktops
> there is similar stuff.
>
> I haven´t seen this spike out a kworker yet tough, so maybe the workload is
> light enough not to trigger it that easily.
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 15:14                   ` Robert White
@ 2014-12-27 16:01                     ` Martin Steigerwald
  2014-12-28  0:25                       ` Robert White
  2014-12-27 16:10                     ` Martin Steigerwald
  1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 16:01 UTC (permalink / raw)
  To: Robert White; +Cc: Hugo Mills, linux-btrfs

Am Samstag, 27. Dezember 2014, 07:14:32 schrieb Robert White:
> But yes, if you open a file and scribble all over it when your disk is 
> full to within the same order of magnitude as the size of the file you 
> are scribbling on, you will get into a condition where the _application_ 
> will aggressively retry the IO. Particularly if that application is a 
> "test program" or a virtual machine doing asynchronous IO.
> 
> That's what those sorts of systems do when they crash against a limit in 
> the underlying system.
> 
> So yea... out of space plus agressive writer equals spinning CPU
> 
> Before you can assign blame you need to strace your application to see 
> what call its making over and over again to see if its just being stupid.

Robert, I am pretty sure that fio does not retry the I/O. If the I/O returns
an error it exists immediately.

I don´t think BTRFS fails an I/O – there is nothing of that in kern.log or
dmesg. But it just needs a very long time for it.

And yet, with BTRFS *is* *full* testcase I still can´t reproduce the <300
IOPS case. I consistently get about 4800 IOPS which is just about okay IMHO.

fio just does random I/O. Aggressively, yes. But it would stop on the *first*
*failed* I/O request. I am pretty sure of that.

fio is flexible I/O tester. It has been written mostly by Jens Axboe. Jens
is the block maintainer of the Linux kernel. So I kindly ask that
before you assume I use crap tools, you have a look at it.

>From how you write I get the impression that you think everyone else
beside you is just silly and dumb. Please stop this assumption. I may not
always get terms right, and I may make a mistake as with the wrong df
figure. But I also highly dislike to feel treated like someone who doesn´t
know a thing.

I made my case.

I tried to reproduce it in a test case.

Now I suggest we wait till someone had an actual log at the sysrq-t triggers
of the 25 MiB kern.log I provided in the bug report.

I will now wait for BTRFS developers to comment on this.

I think Chris and Josef and other BTRFS developers actually know what fio
is, so… either they are interested in that <300 IOPS case I cannot yet
reproduce with a fresh filesystem or not.


Even when it is as almost full as it can get and the fio *barely* completes
without a "no space left on device" error, I still get those 4800 IOPS.
I tested it and took the first run where it actually completed again after
deleting partially copies /usr/bin directory from the test filesystem.

As I have shown it in my test case (see my other mail with altered subject
line)

So for at least a *small* full filesystem, the filesystem full or BTRFS has
to search for free space aggressively case *does not* explain what I see
with my /home. So either I need a fuller filesystem for the test case,
maybe one which carries a million of files or more, or one that at least
has more chunks to allocate from, or there is more to it and there is
something with my /home that makes it even worse.

So it isn´t just the filesystem full case, and the all free space allocated
for chunks condition also does not suffice as my test case shows (where
BTRFS just won´t allocate another data chunk it seems).

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 15:14                   ` Robert White
  2014-12-27 16:01                     ` Martin Steigerwald
@ 2014-12-27 16:10                     ` Martin Steigerwald
  1 sibling, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 16:10 UTC (permalink / raw)
  To: Robert White; +Cc: Hugo Mills, linux-btrfs

Am Samstag, 27. Dezember 2014, 07:14:32 schrieb Robert White:
> On 12/27/2014 06:21 AM, Martin Steigerwald wrote:
> > Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald:
> >> Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:
> >>> On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
> >>>> It can easily be reproduced without even using Virtualbox, just by a
> >>>> nice
> >>>> simple fio job.
> >>>
> >>> TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
> >>> with one single file...
> >>>
> >>> #!/bin/bash
> >>> # not tested, so correct any syntax errors
> >>> typeset -i counter
> >>> for ((counter=250;counter>0;counter--)); do
> >>>
> >>>    dd if=/dev/urandom of=/some/file bs=4k count=$counter
> >>>
> >>> done
> >>> exit
> >>>
> >>>
> >>> Each pass over /some/file is 4k shorter than the previous one, but none
> >>> of the extents can be deallocated. File will be 1MiB in size and usage
> >>> will be something like 125.5MiB (if I've done the math correctly).
> >>> larger values of counter will result in exponentially larger amounts of
> >>> waste.
> >>
> >> Robert, I experienced this hang issues even before the defragmenting case.
> >> It happened while just installed a 400 MiB tax returns application to it
> >> (that is no joke, it is that big).
> >>
> >> It happens while just using the VM.
> >>
> >> Yes, I recommend not to use BTRFS for any VM image or any larger database on
> >> rotating storage for exactly that COW semantics.
> >>
> >> But on SSD?
> >>
> >> Its busy looping a CPU core and while the flash is basically idling.
> >>
> >> I refuse to believe that this is by design.
> >>
> >> I do think there is a *bug*.
> >>
> >> Either acknowledge it and try to fix it, or say its by design *without even
> >> looking at it closely enough to be sure that it is not a bug* and limit your
> >> own possibilities by it.
> >>
> >> I´d rather see it treated as a bug for now.
> >>
> >> Come on, 254 IOPS on a filesystem with still 17 GiB of free space while
> >> randomly writing to a 4 GiB file.
> >>
> >> People do these kind of things. Ditch that defrag Windows XP VM case, I had
> >> performance issue even before by just installing things to it. Databases,
> >> VMs, emulators. And heck even while just *creating* the file with fio as I
> >> shown.
> >
> > Add to these use cases things like this:
> >
> > martin@merkaba:~/.local/share/akonadi/db_data/akonadi> ls -lSh | head -5
> > insgesamt 2,2G
> > -rw-rw---- 1 martin martin 1,7G Dez 27 15:17 parttable.ibd
> > -rw-rw---- 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd
> > -rw-rw---- 1 martin martin  23M Dez 27 15:17 pimitemflagrelation.ibd
> > -rw-rw---- 1 martin martin 240K Dez 27 15:17 collectiontable.ibd
> >
> >
> > Or this:
> >
> > martin@merkaba:~/.local/share/baloo> du -sch * | sort -rh
> > 9,2G    insgesamt
> > 8,0G    email
> > 1,2G    file
> > 51M     emailContacts
> > 408K    contacts
> > 76K     notes
> > 16K     calendars
> >
> > martin@merkaba:~/.local/share/baloo> ls -lSh email | head -5
> > insgesamt 8,0G
> > -rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB
> > -rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB
> > -rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB
> > -rw-r--r-- 1 martin martin  63K Dez 27 15:16 postlist.baseA
> 
> /usr/bin/du and /usr/bin/df and /bin/ls are all _useless_ for showing 
> the amount of filespace used by a file in BTRFS.

Yes.

But they are *useful* to demonstrate that there are regular desktop
application which randomly write into huge files. And that was *exactly*
the point I was trying to make.

Yes, I didn´t prove the random aspect. But heck, one is a MySQL and
one is a Xapian. I am fairly sure that for a desktop search and for maildir
folder indexing there is some random aspect in the workload. Do you
agree to that?

So what you call as "bad" – that was my exact point I was going to make
– point is going to happen on systems. Maybe not as fierce as a fio job,
granted. And for these said /home BTRFS worked fine, but for just
installed a 400 MiB application onto the Windows XP I had the hang
already. With more than 8 GiB of free space within the chunks at that
time.

If BTRFS fails like <300 IOPS on Dual SSD on disk full conditions on
workloads like this it will fail in real world scenarios. And again my
recommendation to leave way more free space than with other filesystems
still holds.

Yes, I saw XFS developer Dave Chinner recommending about 50% of free
space of XFS for a crazy workload in case you want the filesystem in a young
state even after 10 years. So I am fully aware that filesystems will age.

But to *this* extent? After about the six months I actually run the BTRFS
RAID 1, and started with a fresh single BTRFS that I balanced as RAID 1 to
the second SSD then?

I still think it is a bug. Especially as it just does not happen with a
simple disk full condition as I spent several hours in trying to reproduce
this worst case.

If it only happens with my /home, I am willing to accept that something may
be borked with it. And I haven´t been able to produce with a clean filesystem
yet. So maybe it doesn´t happen for others. Then all fine, I recreate the FS
and forget about it.

But before I do any of this, I will wait whether a developer can make sense of
the sysrq-t triggers in syslog.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 14:54         ` Robert White
@ 2014-12-27 16:26           ` Hugo Mills
  2014-12-27 17:11             ` Martin Steigerwald
  2014-12-28  0:06             ` Robert White
  0 siblings, 2 replies; 59+ messages in thread
From: Hugo Mills @ 2014-12-27 16:26 UTC (permalink / raw)
  To: Robert White; +Cc: Martin Steigerwald, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4052 bytes --]

On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote:
> On 12/27/2014 05:55 AM, Martin Steigerwald wrote:
[snip]
> >while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> >for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> >
> >martin@merkaba:~> LANG=C df -hT /home
> >Filesystem             Type   Size  Used Avail Use% Mounted on
> >/dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
> >
> >where a 4 GiB file should easily fit, no? (And this output is with the 4
> >GiB file. So it was even 4 GiB more free before.)
> 
> No. /usr/bin/df is an _approximation_ in BTRFS because of the limits
> of the fsstat() function call. The fstat function call was defined
> in 1990 and "can't understand" the dynamic allocation model used in
> BTRFS as it assumes fixed geometry for filesystems. You do _not_
> have 17G actually available. You need to rely on btrfs fi df and
> btrfs fi show to figure out how much space you _really_ have.
> 
> According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks)
> 
> > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > Sa 27. Dez 13:26:39 CET 2014
> > Label: 'home'  uuid: [some UUID]
> >          Total devices 2 FS bytes used 152.83GiB
> >          devid    1 size 160.00GiB used 160.00GiB path
> /dev/mapper/msata-home
> >          devid    2 size 160.00GiB used 160.00GiB path
> /dev/mapper/sata-home
> 
> And according to this block you have about 4.49GiB of data space:
> 
> > Btrfs v3.17
> > Data, RAID1: total=154.97GiB, used=149.58GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 154.97
>   5.00
>   0.032
> + 0.512
> 
> Pretty much as close to 160GiB as you are going to get (those
> numbers being rounded up in places for "human readability") BTRFS
> has allocate 100% of the raw storage into typed extents.
> 
> A large datafile can only fit in the 154.97-149.58 = 5.39

   I appreciate that this is something of a minor point in the grand
scheme of things, but I'm afraid I've lost the enthusiasm to engage
with the broader (somewhat rambling, possibly-at-cross-purposes)
conversation in this thread. However...

> Trying to allocate that 4GiB file into that 5.39GiB of space becomes
> an NP-complete (e.g. "very hard") problem if it is very fragmented.

   This is... badly mistaken, at best. The problem of where to write a
file into a set of free extents is definitely *not* an NP-hard
problem. It's a P problem, with an O(n log n) solution, where n is the
number of free extents in the free space cache. The simple approach:
fill the first hole with as many bytes as you can, then move on to the
next hole. More complex: order the free extents by size first. Both of
these are O(n log n) algorithms, given an efficient general-purpose
index of free space.

   The problem of placing file data isn't a bin-packing problem; it's
not like allocating RAM (where each allocation must be contiguous).
The items being placed may be split as much as you like, although
minimising the amount of splitting is a goal.

   I suspect that the performance problems that Martin is seeing may
indeed be related to free space fragmentation, in that finding and
creating all of those tiny extents for a huge file is causing
problems. I believe that btrfs isn't alone in this, but it may well be
showing the problem to a far greater degree than other FSes. I don't
have figures to compare, I'm afraid.

> I also don't know what kind of tool you are using, but it might be
> repeatedly trying and failing to fallocate the file as a single
> extent or something equally dumb.

   Userspace doesn't as far as I know, get to make that decision. I've
just read the fallocate(2) man page, and it says nothing at all about
the contiguity of the extent(s) storage allocated by the call.

   Hugo.

[snip]

-- 
Hugo Mills             | O tempura! O moresushi!
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: 65E74AC0          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 16:26           ` Hugo Mills
@ 2014-12-27 17:11             ` Martin Steigerwald
  2014-12-27 17:59               ` Martin Steigerwald
  2014-12-28  0:06             ` Robert White
  1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 17:11 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6800 bytes --]

Am Samstag, 27. Dezember 2014, 16:26:42 schrieb Hugo Mills:
> On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote:
> > On 12/27/2014 05:55 AM, Martin Steigerwald wrote:
> [snip]
> > >while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > >for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > >
> > >martin@merkaba:~> LANG=C df -hT /home
> > >Filesystem             Type   Size  Used Avail Use% Mounted on
> > >/dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
> > >
> > >where a 4 GiB file should easily fit, no? (And this output is with the 4
> > >GiB file. So it was even 4 GiB more free before.)
> > 
> > No. /usr/bin/df is an _approximation_ in BTRFS because of the limits
> > of the fsstat() function call. The fstat function call was defined
> > in 1990 and "can't understand" the dynamic allocation model used in
> > BTRFS as it assumes fixed geometry for filesystems. You do _not_
> > have 17G actually available. You need to rely on btrfs fi df and
> > btrfs fi show to figure out how much space you _really_ have.
> > 
> > According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks)
> > 
> > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > > Sa 27. Dez 13:26:39 CET 2014
> > > Label: 'home'  uuid: [some UUID]
> > >          Total devices 2 FS bytes used 152.83GiB
> > >          devid    1 size 160.00GiB used 160.00GiB path
> > /dev/mapper/msata-home
> > >          devid    2 size 160.00GiB used 160.00GiB path
> > /dev/mapper/sata-home
> > 
> > And according to this block you have about 4.49GiB of data space:
> > 
> > > Btrfs v3.17
> > > Data, RAID1: total=154.97GiB, used=149.58GiB
> > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > 
> > 154.97
> >   5.00
> >   0.032
> > + 0.512
> > 
> > Pretty much as close to 160GiB as you are going to get (those
> > numbers being rounded up in places for "human readability") BTRFS
> > has allocate 100% of the raw storage into typed extents.
> > 
> > A large datafile can only fit in the 154.97-149.58 = 5.39
> 
>    I appreciate that this is something of a minor point in the grand
> scheme of things, but I'm afraid I've lost the enthusiasm to engage
> with the broader (somewhat rambling, possibly-at-cross-purposes)
> conversation in this thread. However...
> 
> > Trying to allocate that 4GiB file into that 5.39GiB of space becomes
> > an NP-complete (e.g. "very hard") problem if it is very fragmented.
> 
>    This is... badly mistaken, at best. The problem of where to write a
> file into a set of free extents is definitely *not* an NP-hard
> problem. It's a P problem, with an O(n log n) solution, where n is the
> number of free extents in the free space cache. The simple approach:
> fill the first hole with as many bytes as you can, then move on to the
> next hole. More complex: order the free extents by size first. Both of
> these are O(n log n) algorithms, given an efficient general-purpose
> index of free space.
> 
>    The problem of placing file data isn't a bin-packing problem; it's
> not like allocating RAM (where each allocation must be contiguous).
> The items being placed may be split as much as you like, although
> minimising the amount of splitting is a goal.
> 
>    I suspect that the performance problems that Martin is seeing may
> indeed be related to free space fragmentation, in that finding and
> creating all of those tiny extents for a huge file is causing
> problems. I believe that btrfs isn't alone in this, but it may well be
> showing the problem to a far greater degree than other FSes. I don't
> have figures to compare, I'm afraid.

Thats what I wanted to hint at.

I suspect an issue with free space fragmentation and do what I think I see:

btrfs balance minimizes free space in chunk fragmentation.

And that is my whole case on why I think it does help with my /home
filesystem.

So while btrfs filesystem defragment may help with defragmenting individual
files, possibly at the cost of fragmenting free space at least on filesystem
almost full conditions, I think to help with free space fragmentation there
are only three options at the moment:

1) reformat and restore via rsync or btrfs send from backup (i.e. file based)

2) make the BTRFS in itself bigger

3) btrfs balance at least chunks, at least those that are not more than 70%
or 80% full.

Do you know of any other ways to deal with it?

So yes, in case it really is freespace fragmentation, I do think a balance
may be helpful. Even if usually one should not use a balance.
 
> > I also don't know what kind of tool you are using, but it might be
> > repeatedly trying and failing to fallocate the file as a single
> > extent or something equally dumb.
> 
>    Userspace doesn't as far as I know, get to make that decision. I've
> just read the fallocate(2) man page, and it says nothing at all about
> the contiguity of the extent(s) storage allocated by the call.

fio fallocates just once. And then writes, even if the fallocate call fails.

Was nice to see at some point as BTRFS returned out of space on the
fallocate but was still be able to write the 4GiB of random data. I bet
the latter was due to compression. Thus while it could not guarentee
that the 4 GiB will be there in all cases, i.e. even with uncompressible
data, it was able to wrote out the random buffer fio repeatedly wrote.


I think I will step back from this now, its weekend and a quiet time after
all.

I probably got a bit too engaged with this discussion. Yet, I had the feeling
I was treated by Robert like someone who doesn´t know a thing. I want to
approach this with a willingness to learn, and I don´t want to interpret
an empirical result away before someone even had a closer look at it.

I had this before where an expert claimed that he wouldn´t reduce the
dirty_background_ratio on an rsync via NFS case and I actually needed to
prove the result to him before he – I don´t even know – eventually
accepted it.

I may be off with my free space fragmentation idea, thus let the kern.log
and my results speak for itself. I don´t see much point in proceeding this
discussion before a BTRFS developer had a look at it.

I put up the sysrq-trigger t kern.log onto the bug report. The bugzilla does
not seem to be available from here at the moment, nginx reports "502 bad
gateway, but the kern.log I attached to it. And in case someone needs it by
mail, just ping me.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 17:11             ` Martin Steigerwald
@ 2014-12-27 17:59               ` Martin Steigerwald
  0 siblings, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 17:59 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5091 bytes --]

Am Samstag, 27. Dezember 2014, 18:11:21 schrieb Martin Steigerwald:
> Am Samstag, 27. Dezember 2014, 16:26:42 schrieb Hugo Mills:
> > On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote:
> > > On 12/27/2014 05:55 AM, Martin Steigerwald wrote:
> > [snip]
> > > >while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > > >for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > > >
> > > >martin@merkaba:~> LANG=C df -hT /home
> > > >Filesystem             Type   Size  Used Avail Use% Mounted on
> > > >/dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
> > > >
> > > >where a 4 GiB file should easily fit, no? (And this output is with the 4
> > > >GiB file. So it was even 4 GiB more free before.)
> > > 
> > > No. /usr/bin/df is an _approximation_ in BTRFS because of the limits
> > > of the fsstat() function call. The fstat function call was defined
> > > in 1990 and "can't understand" the dynamic allocation model used in
> > > BTRFS as it assumes fixed geometry for filesystems. You do _not_
> > > have 17G actually available. You need to rely on btrfs fi df and
> > > btrfs fi show to figure out how much space you _really_ have.
> > > 
> > > According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks)
> > > 
> > > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > > > Sa 27. Dez 13:26:39 CET 2014
> > > > Label: 'home'  uuid: [some UUID]
> > > >          Total devices 2 FS bytes used 152.83GiB
> > > >          devid    1 size 160.00GiB used 160.00GiB path
> > > /dev/mapper/msata-home
> > > >          devid    2 size 160.00GiB used 160.00GiB path
> > > /dev/mapper/sata-home
> > > 
> > > And according to this block you have about 4.49GiB of data space:
> > > 
> > > > Btrfs v3.17
> > > > Data, RAID1: total=154.97GiB, used=149.58GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > 
> > > 154.97
> > >   5.00
> > >   0.032
> > > + 0.512
> > > 
> > > Pretty much as close to 160GiB as you are going to get (those
> > > numbers being rounded up in places for "human readability") BTRFS
> > > has allocate 100% of the raw storage into typed extents.
> > > 
> > > A large datafile can only fit in the 154.97-149.58 = 5.39
> > 
> >    I appreciate that this is something of a minor point in the grand
> > scheme of things, but I'm afraid I've lost the enthusiasm to engage
> > with the broader (somewhat rambling, possibly-at-cross-purposes)
> > conversation in this thread. However...
> > 
> > > Trying to allocate that 4GiB file into that 5.39GiB of space becomes
> > > an NP-complete (e.g. "very hard") problem if it is very fragmented.
> > 
> >    This is... badly mistaken, at best. The problem of where to write a
> > file into a set of free extents is definitely *not* an NP-hard
> > problem. It's a P problem, with an O(n log n) solution, where n is the
> > number of free extents in the free space cache. The simple approach:
> > fill the first hole with as many bytes as you can, then move on to the
> > next hole. More complex: order the free extents by size first. Both of
> > these are O(n log n) algorithms, given an efficient general-purpose
> > index of free space.
> > 
> >    The problem of placing file data isn't a bin-packing problem; it's
> > not like allocating RAM (where each allocation must be contiguous).
> > The items being placed may be split as much as you like, although
> > minimising the amount of splitting is a goal.
> > 
> >    I suspect that the performance problems that Martin is seeing may
> > indeed be related to free space fragmentation, in that finding and
> > creating all of those tiny extents for a huge file is causing
> > problems. I believe that btrfs isn't alone in this, but it may well be
> > showing the problem to a far greater degree than other FSes. I don't
> > have figures to compare, I'm afraid.
> 
> Thats what I wanted to hint at.
> 
> I suspect an issue with free space fragmentation and do what I think I see:
> 
> btrfs balance minimizes free space in chunk fragmentation.
> 
> And that is my whole case on why I think it does help with my /home
> filesystem.
> 
> So while btrfs filesystem defragment may help with defragmenting individual
> files, possibly at the cost of fragmenting free space at least on filesystem
> almost full conditions, I think to help with free space fragmentation there
> are only three options at the moment:
> 
> 1) reformat and restore via rsync or btrfs send from backup (i.e. file based)
> 
> 2) make the BTRFS in itself bigger
> 
> 3) btrfs balance at least chunks, at least those that are not more than 70%
> or 80% full.
> 
> Do you know of any other ways to deal with it?

Yes.

4) Delete some stuff from it or move it over to a different filesystem.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27  9:30     ` Hugo Mills
                         ` (2 preceding siblings ...)
  2014-12-27 13:55       ` Martin Steigerwald
@ 2014-12-27 18:28       ` Zygo Blaxell
  2014-12-27 18:40         ` Hugo Mills
  3 siblings, 1 reply; 59+ messages in thread
From: Zygo Blaxell @ 2014-12-27 18:28 UTC (permalink / raw)
  To: Hugo Mills, Martin Steigerwald, Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1582 bytes --]

On Sat, Dec 27, 2014 at 09:30:43AM +0000, Hugo Mills wrote:
> On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
>    Now, since you're seeing lockups when the space on your disks is
> all allocated I'd say that's a bug. However, you're the *only* person
> who's reported this as a regular occurrence. Does this happen with all
> filesystems you have, or just this one?

I do see something similar, but there are so many problems going on I
have no idea which ones to report, and which ones are my own doing.  :-P

I see lots of CPU being burned when all the disk space is allocated
to chunks, but there is still lots of space free (multiple GB) inside
the chunks.

iotop shows a crapton of disk writes (1-5MB/sec) from one kworker.
There are maybe a few kB/sec of writes through the filesystem at the time.

The filesystem where I see this most is on a laptop, so the disk writes
also hit the CPU again for encryption.  There's so much CPU usage it's
worth mentioning twice.  :-(

'watch cat /proc/12345/stack' on the active processes shows the kernel
fairly often in that new chunk deallocator function whose name escapes
me at the moment.

Deleting a bunch of data then running balance helps return to sane CPU
usage...for a while (maybe a week?).

It's not technically "locked up" per se, but when a 5KB download takes
a minute or more, most users won't wait around to see the difference.

Kernel versions I'm using are 3.17.7 and 3.18.1.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 18:28       ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
@ 2014-12-27 18:40         ` Hugo Mills
  2014-12-27 19:23           ` BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time) Martin Steigerwald
  0 siblings, 1 reply; 59+ messages in thread
From: Hugo Mills @ 2014-12-27 18:40 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Martin Steigerwald, Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2481 bytes --]

On Sat, Dec 27, 2014 at 01:28:46PM -0500, Zygo Blaxell wrote:
> On Sat, Dec 27, 2014 at 09:30:43AM +0000, Hugo Mills wrote:
> > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> >    Now, since you're seeing lockups when the space on your disks is
> > all allocated I'd say that's a bug. However, you're the *only* person
> > who's reported this as a regular occurrence. Does this happen with all
> > filesystems you have, or just this one?
> 
> I do see something similar, but there are so many problems going on I
> have no idea which ones to report, and which ones are my own doing.  :-P
> 
> I see lots of CPU being burned when all the disk space is allocated
> to chunks, but there is still lots of space free (multiple GB) inside
> the chunks.
> 
> iotop shows a crapton of disk writes (1-5MB/sec) from one kworker.
> There are maybe a few kB/sec of writes through the filesystem at the time.
> 
> The filesystem where I see this most is on a laptop, so the disk writes
> also hit the CPU again for encryption.  There's so much CPU usage it's
> worth mentioning twice.  :-(
> 
> 'watch cat /proc/12345/stack' on the active processes shows the kernel
> fairly often in that new chunk deallocator function whose name escapes
> me at the moment.
> 
> Deleting a bunch of data then running balance helps return to sane CPU
> usage...for a while (maybe a week?).
> 
> It's not technically "locked up" per se, but when a 5KB download takes
> a minute or more, most users won't wait around to see the difference.
> 
> Kernel versions I'm using are 3.17.7 and 3.18.1.

   OK, so I'd like to change my statement above.

   When I first read Martin's problem, I thought that he was referring
to a complete, hit-the-power-button kind of lock-up. Given that
(erroneous) assumption, I stand by my (now pointless) statement. :)

   I realised during a brief conversation on IRC that Martin was
actually referring to long but temporary periods where the machine is
unusable by any process requiring disk activity. There's clearly a
number of people seeing that.

   It doesn't stop it being a major problem, but it does change the
interpretation considerably.

   Hugo.

-- 
Hugo Mills             | Mixing mathematics and alcohol is dangerous. Don't
hugo@... carfax.org.uk | drink and derive.
http://carfax.org.uk/  |
PGP: 65E74AC0          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
  2014-12-27 18:40         ` Hugo Mills
@ 2014-12-27 19:23           ` Martin Steigerwald
  2014-12-29  2:07             ` Zygo Blaxell
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-27 19:23 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Zygo Blaxell, Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5782 bytes --]

Am Samstag, 27. Dezember 2014, 18:40:17 schrieb Hugo Mills:
> On Sat, Dec 27, 2014 at 01:28:46PM -0500, Zygo Blaxell wrote:
> > On Sat, Dec 27, 2014 at 09:30:43AM +0000, Hugo Mills wrote:
> > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > >    Now, since you're seeing lockups when the space on your disks is
> > > all allocated I'd say that's a bug. However, you're the *only* person
> > > who's reported this as a regular occurrence. Does this happen with all
> > > filesystems you have, or just this one?
> > 
> > I do see something similar, but there are so many problems going on I
> > have no idea which ones to report, and which ones are my own doing.  :-P
> > 
> > I see lots of CPU being burned when all the disk space is allocated
> > to chunks, but there is still lots of space free (multiple GB) inside
> > the chunks.
> > 
> > iotop shows a crapton of disk writes (1-5MB/sec) from one kworker.
> > There are maybe a few kB/sec of writes through the filesystem at the time.
> > 
> > The filesystem where I see this most is on a laptop, so the disk writes
> > also hit the CPU again for encryption.  There's so much CPU usage it's
> > worth mentioning twice.  :-(
> > 
> > 'watch cat /proc/12345/stack' on the active processes shows the kernel
> > fairly often in that new chunk deallocator function whose name escapes
> > me at the moment.
> > 
> > Deleting a bunch of data then running balance helps return to sane CPU
> > usage...for a while (maybe a week?).
> > 
> > It's not technically "locked up" per se, but when a 5KB download takes
> > a minute or more, most users won't wait around to see the difference.
> > 
> > Kernel versions I'm using are 3.17.7 and 3.18.1.
> 
>    OK, so I'd like to change my statement above.
> 
>    When I first read Martin's problem, I thought that he was referring
> to a complete, hit-the-power-button kind of lock-up. Given that
> (erroneous) assumption, I stand by my (now pointless) statement. :)
> 
>    I realised during a brief conversation on IRC that Martin was
> actually referring to long but temporary periods where the machine is
> unusable by any process requiring disk activity. There's clearly a
> number of people seeing that.
> 
>    It doesn't stop it being a major problem, but it does change the
> interpretation considerably.

Ah, then my bet was right with whom I talked there. :)

Yeah, it does not seem to be a complete hang, I though so initially, cause
honestly after waiting several minutes for my Plasma desktop to come back
I just gave up. Maybe it would have returned at some time. I just didn´t
have the patience to wait.

It now did at my last testing where I continued on tty1 (had all the testing
in a screen) as the desktop session locked up. After some time after the
test completed I was able to use that desktop again and I am still using it.

So the issue I see is: One kworker uses 100% of one core for minutes and
while doing so processes that do I/O to the BTRFS that I test (/home) in my
case seem to be stuck in uninteruptible sleep ("D" process state). While I
see this there is no huge load on the SSDs so… it seems to be something
CPU bound. I didn´t yet use a strace on the kworker process – or at the
allocation time on the fio process –, Robert, thats a good suggestion. From
a gut feeling I wouldn´t be surprised if I see *nothing* in strace as my bet
is that the kworker thread deals with finding free space inside the chunks
and deals with some data structures while doing so. But that is really just
a gut feeling and so an strace would be nice.

I made a backup yesterday, so I think I can try the strace. But I also spend
a considerable amount of time of reproducing it and digging deeper into it
so likely not this weekend anymore although this even makes some fun. But
I see myself neglecting other stuff thats important to me as well, so…

My simple test case didn´t trigger it, and I so not have another twice 160
GiB available on this SSDs available to try with a copy of my home
filesystem. Then I could safely test without bringing the desktop session to
an halt. Maybe someone has an idea on how to "enhance" my test case in
order to reliably trigger the issue.

It may be challenging tough. My /home is quite a filesystem. It has a maildir
with at least one million of files (yeah, I am performance testing KMail and
Akonadi as well to the limit!), and it has git repos and this one VM image,
and the desktop search and the Akonadi database. In other words: It has
been hit nicely with various mostly random I think workloads over the last
about six months. I bet its not that easy to simulate that. Maybe some runs
of compilebench to age the filesystem before the fio test?

That said, BTRFS performs a lot better. The complete lockups without any
CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there
is this kworker issue now. I noticed it that gravely just while trying to
complete this tax returns stuff with the Windows XP VM. Otherwise it may
have happened, I have seen some backtraces in kern.log, but it didn´t last
for minutes. So this indeed is of less severity than the full lockups with
3.15 and 3.16.

Zygo, was is the characteristics of your filesystem. Do you use
compress=lzo and skinny metadata as well? How are the chunks allocated?
What kind of data you have on it?

Well now off to some dancing event. Thats just right now :)

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 16:26           ` Hugo Mills
  2014-12-27 17:11             ` Martin Steigerwald
@ 2014-12-28  0:06             ` Robert White
  2014-12-28 11:05               ` Martin Steigerwald
  1 sibling, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-28  0:06 UTC (permalink / raw)
  To: Hugo Mills, Martin Steigerwald, linux-btrfs

Semi off-topic questions...

On 12/27/2014 08:26 AM, Hugo Mills wrote:
>     This is... badly mistaken, at best. The problem of where to write a
> file into a set of free extents is definitely *not* an NP-hard
> problem. It's a P problem, with an O(n log n) solution, where n is the
> number of free extents in the free space cache. The simple approach:
> fill the first hole with as many bytes as you can, then move on to the
> next hole. More complex: order the free extents by size first. Both of
> these are O(n log n) algorithms, given an efficient general-purpose
> index of free space.

Which algorithm is actually in use?

Is any attempt made to keep subsequent allocations in the same data extent?

All of "best fit", "first fit", and "first encountered" allocation have 
terrible distribution graphs over time.

Without a knod to locality, discontiguous allocation will have 
staggeringly bad after-effects in terms of read-ahead.

>
>     The problem of placing file data isn't a bin-packing problem; it's
> not like allocating RAM (where each allocation must be contiguous).
> The items being placed may be split as much as you like, although
> minimising the amount of splitting is a goal.

How is compression and re-compression handled? If a linear extent is 
compressed to find its on-disk size in bytes, and then there isn't an 
extent large enough to fit it, it has to be cut, then recompressed, then 
searched again right?

How does the system look for the right cut? How iterative can this get? 
Does it always try cutting in half? Does it shave single bytes off the 
end? Does it add one byte at a time till it reaches the size of the 
extent its looking at?

Can you get down to a point where you are placing data in five or ten 
byte chunks somehow? (e.g. what's the smallest chunk you can place? 
clearly if I open a multi-megabyte file and update a single word or byte 
it's not going to land in metadata from my reading of the code.) One 
could easily end up with a couple million free extents of just a few 
bytes each, particularly if largest-first allocation is used.

The degenerate cases here do come straight from the various packing 
problems. You may not be executing any of those packing algorithms but 
once you ignore enough of those issues in the easy cases your free space 
will be a fine pink mist suspended in space. (both an explosion analogy 
and a reference to pink noise 8-) ).

>     I suspect that the performance problems that Martin is seeing may
> indeed be related to free space fragmentation, in that finding and
> creating all of those tiny extents for a huge file is causing
> problems. I believe that btrfs isn't alone in this, but it may well be
> showing the problem to a far greater degree than other FSes. I don't
> have figures to compare, I'm afraid.

>
>> I also don't know what kind of tool you are using, but it might be
>> repeatedly trying and failing to fallocate the file as a single
>> extent or something equally dumb.
>
>     Userspace doesn't as far as I know, get to make that decision. I've
> just read the fallocate(2) man page, and it says nothing at all about
> the contiguity of the extent(s) storage allocated by the call.

Yep, my bad. But as soon as I saw that "fio" was starting two threads, 
one doing random read/write and another doing sequential read/write, 
both on the same file, it set off my "not just creating a file" mindset. 
Given the delayed write into/through the cache normally done by casual 
file io, It seemed likely that fio would be doing something more 
aggressive (like using O_DIRECT or repeated fdatasync() which could get 
very tit-for-tat).

Compare that to a VM in which the guest operating system "knows" it has, 
and has used, its "disk space" internally, and the subsequent async 
activity of the monitor to push that activity out to real storage which 
is usually quite pathological... well you can get into some super 
pernicious behavior over write ordering and infinite retries.

So I was wrong about fallocate per-se, applications can be incredibly 
dumb. For instance a VM might think its _inconceivable_ to get an ENOSPC 
while rewriting data it's just read from a file it "knows" has no holes etc.

Given how lots of code doesn't even check the results of many function 
calls... how many times have you seen code that doesn't look at the 
return value of fwrite() or printf()? Or one that, if it does something 
like if (bytes_written < size) retry_remainder(); So sure I was 
imagining an fallocate() in a loop or something equally dumb. 8-)

>
>     Hugo.
>
> [snip]
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-27 16:01                     ` Martin Steigerwald
@ 2014-12-28  0:25                       ` Robert White
  2014-12-28  1:01                         ` Bardur Arantsson
  0 siblings, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-28  0:25 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Hugo Mills, linux-btrfs

On 12/27/2014 08:01 AM, Martin Steigerwald wrote:
>>From how you write I get the impression that you think everyone else
> beside you is just silly and dumb. Please stop this assumption. I may not
> always get terms right, and I may make a mistake as with the wrong df
> figure. But I also highly dislike to feel treated like someone who doesn´t
> know a thing.

Nope. I'm a systems theorist and I demand/require variable isolation.

Not a question of "silly" or "dumb" but a question of "speaking with 
sufficient precision and clarity".

For instance you speak of "having an impression" and then decide I've 
made an assumption.

I define my position. Explain my terms. Give my examples.

I also risk being utterly wrong because sometimes being completely wrong 
gets others to cut away misconceptions and assumptions.

It annoys some people, but it gets results. You've been going around on 
this topic for how long? and just today Hugo "got" that your problem is 
becoming CPU bound (long process) instead of a hard lockup. We've 
stopped talking about "trees" and started talking about free space 
management. We've stopped talking about 17G of free space and gotten 
down to the 5 or so, plus you've gotten angry at me, tried to prove me 
an idiot, and so produced test cases and data that is absolutely clear 
including steps to reproduce.

In real life I work on mission critical systems that can get people 
killed when they fail. So I have developed the reflex of tenacity in 
getting everyone using the same words, talking about the same concepts, 
giving concrete examples, and generally bringing the discussion to a 
very precise head.

Example: I had two parties in conflict about a system. One party said 
that every time they did "an orderly shutdown" the device would hang in 
a way that took days to recover from. The other party would examine the 
device and say "could not reproduce". Turns out that the two parties 
were doing entirely different (but both correct) sequences for "orderly 
shutdown". They'd been having that conflict for more than a year. But 
since they both _knew_ what an "orderly shutdown" was, they _never_ 
analyzed what they were saying. (turns out one procedure left a chip in 
a state that it wouldn't restart until a capacitor discharged, and the 
other procedure did not.)

So yea, when people make statements that "everybody understands" and 
those statements don't agree. I start slicing concepts off one at a time...

It's not about "dumb" or "silly" it's about exact and accurate 
descriptions that have been stripped of assumptions and tribal knowledge.

And I don't care if I come off looking like "the bad guy" because I 
don't believe in "the bad guy" at all when it comes to making things 
more clear and getting out of a communications deadlock. My only goal is 
"less broken".

So occasionally annoying... but look... progress!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-28  0:25                       ` Robert White
@ 2014-12-28  1:01                         ` Bardur Arantsson
  2014-12-28  4:03                           ` Robert White
  0 siblings, 1 reply; 59+ messages in thread
From: Bardur Arantsson @ 2014-12-28  1:01 UTC (permalink / raw)
  To: linux-btrfs

On 2014-12-28 01:25, Robert White wrote:
> On 12/27/2014 08:01 AM, Martin Steigerwald wrote:
>>> From how you write I get the impression that you think everyone else
>> beside you is just silly and dumb. Please stop this assumption. I may not
>> always get terms right, and I may make a mistake as with the wrong df
>> figure. But I also highly dislike to feel treated like someone who
>> doesn´t
>> know a thing.
> 
> Nope. I'm a systems theorist and I demand/require variable isolation.
> 
> Not a question of "silly" or "dumb" but a question of "speaking with
> sufficient precision and clarity".
> 
> For instance you speak of "having an impression" and then decide I've
> made an assumption.
> 
> I define my position. Explain my terms. Give my examples.
> 
> I also risk being utterly wrong because sometimes being completely wrong
> gets others to cut away misconceptions and assumptions.
> 
> It annoys some people, but it gets results.

Can you please stop this bullshit posturing nonsense? It accomlishes
nothing -- if you're right your other posts will stand for themselves
and show that you are indeed "the shit" when it comes to these matters,
but this post (so far, didn't read further) accomplishes nothing other
than (possibly) convincing everyone that you're a pompous/self-important
ass.

Regards,


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-28  1:01                         ` Bardur Arantsson
@ 2014-12-28  4:03                           ` Robert White
  2014-12-28 12:03                             ` Martin Steigerwald
  2014-12-28 12:07                             ` Martin Steigerwald
  0 siblings, 2 replies; 59+ messages in thread
From: Robert White @ 2014-12-28  4:03 UTC (permalink / raw)
  To: Bardur Arantsson, linux-btrfs

On 12/27/2014 05:01 PM, Bardur Arantsson wrote:
> On 2014-12-28 01:25, Robert White wrote:
>> On 12/27/2014 08:01 AM, Martin Steigerwald wrote:
>>>>  From how you write I get the impression that you think everyone else
>>> beside you is just silly and dumb. Please stop this assumption. I may not
>>> always get terms right, and I may make a mistake as with the wrong df
>>> figure. But I also highly dislike to feel treated like someone who
>>> doesn´t
>>> know a thing.
>>
>> Nope. I'm a systems theorist and I demand/require variable isolation.
>>
>> Not a question of "silly" or "dumb" but a question of "speaking with
>> sufficient precision and clarity".
>>
>> For instance you speak of "having an impression" and then decide I've
>> made an assumption.
>>
>> I define my position. Explain my terms. Give my examples.
>>
>> I also risk being utterly wrong because sometimes being completely wrong
>> gets others to cut away misconceptions and assumptions.
>>
>> It annoys some people, but it gets results.
>
> Can you please stop this bullshit posturing nonsense? It accomlishes
> nothing -- if you're right your other posts will stand for themselves
> and show that you are indeed "the shit" when it comes to these matters,
> but this post (so far, didn't read further) accomplishes nothing other
> than (possibly) convincing everyone that you're a pompous/self-important
> ass.

Really? "accomplishes nothing"?

24 hours ago:

the complaining party was talking about

- Windows XP
- Tax software
- Virtual box
- vdi files
- defragging
- balancing
- "data trees"
- system hanging

And the responding party was saying

"you are the only person reporting this as a regular occurrence" with 
the implication that the report was a duplicate or at least might not 
get much immediate attention.

Now:

The complaining party has verified the minimum, repeatable case of 
simple file allocation on a very fragmented system and the responding 
party and several others have understood and supported the bug.

That's not "accomplishing nothing", thats called engaging in diagnostics 
instead of dismissing a complaint, and sticking out the diagnostic 
process until everyone is on the same page.

I never dismissed Martin. I never disbelieved him. I went through his 
elements one at a time with examples of what I was taking away from him 
and why they didn't match expectations and experimental evidence. We 
adjusted our positions and communications.

So you can call it "bullshit posturing nonsense" but I see "taking less 
than a day to get to the bottom of a bug report that might not have 
gotten significant attention."

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-28  0:06             ` Robert White
@ 2014-12-28 11:05               ` Martin Steigerwald
  0 siblings, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 11:05 UTC (permalink / raw)
  To: Robert White; +Cc: Hugo Mills, linux-btrfs

Am Samstag, 27. Dezember 2014, 16:06:13 schrieb Robert White:
> >
> >> I also don't know what kind of tool you are using, but it might be
> >> repeatedly trying and failing to fallocate the file as a single
> >> extent or something equally dumb.
> >
> >     Userspace doesn't as far as I know, get to make that decision. I've
> > just read the fallocate(2) man page, and it says nothing at all about
> > the contiguity of the extent(s) storage allocated by the call.
> 
> Yep, my bad. But as soon as I saw that "fio" was starting two threads, 
> one doing random read/write and another doing sequential read/write, 
> both on the same file, it set off my "not just creating a file" mindset. 
> Given the delayed write into/through the cache normally done by casual 
> file io, It seemed likely that fio would be doing something more 
> aggressive (like using O_DIRECT or repeated fdatasync() which could get 
> very tit-for-tat).

Robert, please get to know about fio or *ask* before jumping to conclusions.

I used this:

[global]
bs=4k
#ioengine=libaio
#iodepth=4
size=4g
#direct=1
runtime=120
filename=ssd.test.file

#[seq-write]
#rw=write
#stonewall

[rand-write]
rw=randwrite
stonewall


At the first test I still tested seq-write, but do you note the "stonewall"
param? It *separates* both jobs from one another. I.e. fio may be starting
two threads as it I think prepares all threads in advance, yet it did
execute only *one* at a time.

>From the manpage of fio:

       stonewall , wait_for_previous
              Wait  for  preceding  jobs  in the job file to exit before
              starting this one.  stonewall implies new_group.

(that said the first stonewall isn´t even needed, but I removed the read
jobs from the ssd-test.fio example fio I used for this job and I didn´t
remember to remove the statement)


Thank you a lot for your input. I learned some from it. For example that
the trees for the data handling are in the metadata section. And now
I am very clear the btrfs fi df does not display any trees but the chunk
reservation and usage. I think I knew this before, but I thought somehow
that was combined with the tree, but it isn´t, at least not in place, but
the trees are stored in the metadata chunks. I´d still not call these
extents tough, cause thats a file-based thing to all I know.

I skip theoretizing about algorithms here. I prefer to let measurements
speak and try to understand these. Best approach to understand the ones
I made, I think, is what Hugo suggested: A developer looks at the sysrq-t
outputs. So I personally won´t speculate any further about given or not
given algorithmic limitations of BTRFS.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-28  4:03                           ` Robert White
@ 2014-12-28 12:03                             ` Martin Steigerwald
  2014-12-28 17:04                               ` Patrik Lundquist
  2014-12-28 12:07                             ` Martin Steigerwald
  1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 12:03 UTC (permalink / raw)
  To: Robert White; +Cc: Bardur Arantsson, linux-btrfs

Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
> On 12/27/2014 05:01 PM, Bardur Arantsson wrote:
> > On 2014-12-28 01:25, Robert White wrote:
> >> On 12/27/2014 08:01 AM, Martin Steigerwald wrote:
> >>>>  From how you write I get the impression that you think everyone else
> >>> beside you is just silly and dumb. Please stop this assumption. I may not
> >>> always get terms right, and I may make a mistake as with the wrong df
> >>> figure. But I also highly dislike to feel treated like someone who
> >>> doesn´t
> >>> know a thing.
> >>
> >> Nope. I'm a systems theorist and I demand/require variable isolation.
> >>
> >> Not a question of "silly" or "dumb" but a question of "speaking with
> >> sufficient precision and clarity".
> >>
> >> For instance you speak of "having an impression" and then decide I've
> >> made an assumption.
> >>
> >> I define my position. Explain my terms. Give my examples.
> >>
> >> I also risk being utterly wrong because sometimes being completely wrong
> >> gets others to cut away misconceptions and assumptions.
> >>
> >> It annoys some people, but it gets results.
> >
> > Can you please stop this bullshit posturing nonsense? It accomlishes
> > nothing -- if you're right your other posts will stand for themselves
> > and show that you are indeed "the shit" when it comes to these matters,
> > but this post (so far, didn't read further) accomplishes nothing other
> > than (possibly) convincing everyone that you're a pompous/self-important
> > ass.
> 
> Really? "accomplishes nothing"?
> 
> 24 hours ago:
> 
> the complaining party was talking about
> 
> - Windows XP
> - Tax software
> - Virtual box
> - vdi files
> - defragging
> - balancing
> - "data trees"
> - system hanging
> 
> And the responding party was saying
> 
> "you are the only person reporting this as a regular occurrence" with 
> the implication that the report was a duplicate or at least might not 
> get much immediate attention.
> 
> Now:
> 
> The complaining party has verified the minimum, repeatable case of 
> simple file allocation on a very fragmented system and the responding 
> party and several others have understood and supported the bug.

It was repeatable before. That I go from application case to simulate a
workload case is only natural. Or do you run fio or other load testing apps
as a part of your daily work on your computer (unless you are actually
diagnosing performance issues). I still *use* the computer with
applications. And if thats where I see the performance issue, I report as
such. Then I think about the kind of workload it creates and go from there
to simplicy it to a reproducable case.

At least I read mails, browse the web, run a VM, and so such kinds of
things as daily computer usage. And thus its likely that performance issues
show like this. Heck even my server does mail and Owncloud and things.

I only use workload generation tools during my teachings or when analysing
things, not as part of my daily computer usage.

And that doesn´t make using a VM any less valid. And if it basically crawls
BTRFS to an halt, I report this. Its actually that easy.

> That's not "accomplishing nothing", thats called engaging in diagnostics 
> instead of dismissing a complaint, and sticking out the diagnostic 
> process until everyone is on the same page.
> 
> I never dismissed Martin. I never disbelieved him. I went through his 
> elements one at a time with examples of what I was taking away from him 
> and why they didn't match expectations and experimental evidence. We 
> adjusted our positions and communications.

Robert, I received this differently. I received your input partly as wronging
me. Granted that motivated me even more to prove things. But I highly
dislike this kind of motivation. As I think I am motivated myself. I like
finding causes of performance bottle necks. And I prefer positive motivation
instead of negative one.

> So you can call it "bullshit posturing nonsense" but I see "taking less 
> than a day to get to the bottom of a bug report that might not have 
> gotten significant attention."

And you attribute all of this to your argumentation?

Thats bold.

See, Robert, your arguments helped with clearing my understanding in some
parts. Especially on the terms I have not been very familiar.

I am grateful for that.

It even helped motivate me to do the further tests, as I got the
impression that you have just been discussing that what I am seeing is
just the way BTRFS necesessarily is *algorithmically* and I was just using
it wrongly. But that said: I have an interest myself in resolving this.
I was prepared for giving additional input at a given time. But right on
this day I was just fed up with things.

It motivated to prove the abysmal performance behaviour in a certain
workload.

Robert, your arguments contributed, thats true. But still I did the work of
the actual measurements. I spent the hours on doing the measurements,
with a slight risk of having to restore from backup, incase BTRFS would
mess up things. I was the one bringing BTRFS to the limits where it
actually shows an issue, instead of theoreticing about the limit as being
an algorithmic issue or wrong usage.

I expect the process to be iterative. At first I see something, get an
impression and probably a gut feeling. And then I move on from that.

Maybe you are the superguru who has the complete picture at the time
you see an issue. But I see things and then try to make sense of them,
actively allowing for feedback on the way. I start research then. And this
research is iterative. And yet, I am so bold, to post things on this mailing
list even if they are not yet a fully fledged out scientific document. Even
if I didn´t get all BTRFS specific terms right – but still had about quite an
accurate understanding on what I was seeing. In order for others to chime
in and give ideas.

At first I partially ranted. Yes, I even said so. Cause I am human. Thats
it. I wanted to progress on my tax return and BTRFS messed up. And I
spent literally hours on fixing things then, even copying back the VDI file
from backup as Windows did come chkdsk on it and I wasn´t so sure anymore
about its consistency. Heck I even succeeded at it. By even doing something
that is *not* recommended, but *still* works: The balance. I have been – I´d
say rightfully so – angry at that. If confronted with theory and with real
world perception, I always take what I perceive first. If theory says a
balance is not supposed to help, I´d still balance if I see that it does. Its
that simple, cause I found it quite fruitless to argue with the world. If
it rains while it shouldn´t I still prepare myself for the rain when going
out.

And heck, my initial impression still stands, Robert. I have shown a case
where BTRFS becomes CPU bound and basically crawls to a halt and I
still think this is a (performance) bug. We will see, whether it is. And no,
it wasn´t the dirty background ratio or the SSDs being to fast for the CPU
as you tried to guess (even due I am using 3.18 with that multiqueue block
I/O handling enabled, at least I think it is enabled). (I am aware of all of
this and I am aware of the work in the Linux kernel to support a million
of IOPS or more or that certain PCIe flash drivers at least at some tme
circumvented parts of block I/O layer. But I also know I just have some
SATA SSDs, connected via SATA-300 and thats a difference in the amount
of IOPS they can actually produce)

And also given the history, what I reported is *new*. I saw this for the first
time like this. It doesn´t have a fruitless history of not going to the root
cause. The last thing others and I reported is fixed already. The hangs in
3.15 and 3.16. I provided information on these as well, if you care to
lookup in the mailing list archives. I tested patches to solve them. Just
check the mailing list archives for this. Its not even the first BTRFS issue
I helped diagnosing.

I simply wasn´t aware that this isn´t a permanent lock, as I gave up waiting
for the desktop to return after some minutes, cause I simply wanted to get
my work done. At some certain time I may not be willing to spend hours to
find the root cause of a problem, but will just workaround it.

So I just wrote that I saw these kworker process spinning on 100% CPU
and my desktop has locked. I didn´t include the information on the process
state of the desktop processes, but it was basically the dialog on IRC I
had with Hugo which cleared that out. As far as I am aware none of your
argumentation contributed to that. For me it was a hang, cause things
hanged. Whether permanent or not? How long do you wait to determine that?


I close with how I like this process to work:

I perceive something, I may have an idea about it, but then I proceed
from there without anysuming anyone is "right" or "wrong" about something.

And it is this "wronging" of others that I perceived here, that I think
received from you as a message, at least thats how I received some of what
you wrote, what I want to stop right now. If you didn´t send it, we
misunderstood. But thats how I received some of your arguments. As
dismissive of my +10 years of Linux experiences and +6-7 years of
actually *teaching* performance analysis & tuning.

Still I find myself knowing nothing at times. And thats good. Cause that
is how I learn. How even did I see values that just didn´t make any sense.
In the beginning. In the end the did.

So I am learning. You are learning. Everybody is learning.

At first I see something and describe it as it is… I used unclear terms on it
and it helped to clarify this. And then I try to understand whats happening.

That is how it works for me.

I very much like to proceed on this with that kind of attitude. And in that
sense I look forward to your valuable input on this as it progresses to
an conclusion.

So can we just assume that we are all experts and beginners at the same
time and capable and helpful and willing to learn and go along like this?

That´s what I call productive.


Okay, that was lengthy, and I have a part in this. Actually I felt offended.
Maybe by misunderstanding you. But thats how I received some of your
statements.



BTW, I found that the Oracle blog didn´t work at all for me. I completed
a cycle of defrag, sdelete -c and VBoxManage compact, not due to being
much interested in it at all, but also as part of my BTRFS testing, and it
apparently did *nothing* to reduce the size of the file. That was my initial
motivation with that: To reduce the size of the file to make more free space
for BTRFS. And I was using at whats recommended by company whose
developers develop Virtualbox. Disliking the defragment step from the
beginning (as useless on SSD). But I thought, heck if it gives me a smaller
file, all good.

Next time I just give it 10 GiB instead of 20 GiB from the beginning. Or
at some time I find a Linux based task returns software. I wonder what
the authorities would say when I say I can´t complete my tax returns as
my operating system is not supported by the software necessary for it.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-28  4:03                           ` Robert White
  2014-12-28 12:03                             ` Martin Steigerwald
@ 2014-12-28 12:07                             ` Martin Steigerwald
  2014-12-28 14:52                               ` Robert White
  1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 12:07 UTC (permalink / raw)
  To: Robert White; +Cc: Bardur Arantsson, linux-btrfs

Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
> Now:
> 
> The complaining party has verified the minimum, repeatable case of 
> simple file allocation on a very fragmented system and the responding 
> party and several others have understood and supported the bug.

I didn´t yet provide such a test case.

At the moment I can only reproduce this kworker thread using a CPU for
minutes case with my /home filesystem.

A mininmal test case for me would be to be able to reproduce it with a 
fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
get 4800 instead of 270 IOPS.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (further tests)
  2014-12-27 13:55       ` Martin Steigerwald
  2014-12-27 14:54         ` Robert White
@ 2014-12-28 13:00         ` Martin Steigerwald
  2014-12-28 13:40           ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare) Martin Steigerwald
  1 sibling, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 13:00 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 40181 bytes --]

Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> Summarized at
> 
> Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
> 
> see below. This is reproducable with fio, no need for Windows XP in
> Virtualbox for reproducing the issue. Next I will try to reproduce with
> a freshly creating filesystem.
> 
> 
> Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > Hello!
> > > > > 
> > > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > > 
> > > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > > bug
> > > > > report:
> > > > > 
> > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > > space_cache, skinny meta data extents – are these a problem? – and
> > > > 
> > > > > compress=lzo:
> > > > (there is no known problem with skinny metadata, it's actually more
> > > > efficient than the older format. There has been some anecdotes about
> > > > mixing the skinny and fat metadata but nothing has ever been
> > > > demonstrated problematic.)
> > > > 
> > > > > merkaba:~> btrfs fi sh /home
> > > > > Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > > 
> > > > >          Total devices 2 FS bytes used 144.41GiB
> > > > >          devid    1 size 160.00GiB used 160.00GiB path
> > > > >          /dev/mapper/msata-home
> > > > >          devid    2 size 160.00GiB used 160.00GiB path
> > > > >          /dev/mapper/sata-home
> > > > > 
> > > > > Btrfs v3.17
> > > > > merkaba:~> btrfs fi df /home
> > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > 
> > > > This filesystem, at the allocation level, is "very full" (see below).
> > > > 
> > > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > > cause I know no tax return software for Linux which would be suitable
> > > > > for
> > > > > Germany and I frankly don´t care about the end of security cause all
> > > > > surfing and other network access I will do from the Linux box and I
> > > > > only
> > > > > run the VM behind a firewall).
> > > > 
> > > > > And thus I try the balance dance again:
> > > > ITEM: Balance... it doesn't do what you think it does... 8-)
> > > > 
> > > > "Balancing" is something you should almost never need to do. It is only
> > > > for cases of changing geometry (adding disks, switching RAID levels,
> > > > etc.) of for cases when you've radically changed allocation behaviors
> > > > (like you decided to remove all your VM's or you've decided to remove a
> > > > mail spool directory full of thousands of tiny files).
> > > > 
> > > > People run balance all the time because they think they should. They are
> > > > _usually_ incorrect in that belief.
> > > 
> > > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > > device.
> >    No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > space. What's more, balance does *not* balance the metadata trees. The
> > remaining space -- 154.97 GiB -- is unstructured storage for file
> > data, and you have some 13 GiB of that available for use.
> > 
> >    Now, since you're seeing lockups when the space on your disks is
> > all allocated I'd say that's a bug. However, you're the *only* person
> > who's reported this as a regular occurrence. Does this happen with all
> > filesystems you have, or just this one?
> > 
> > > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > > from to *extend* a tree.
> > 
> >    It's not a tree. It's simply space allocation. It's not even space
> > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > saying "I'm going to use this piece of disk for this purpose").
> > 
> > > This may be a bug, but this is what I see.
> > > 
> > > And no amount of "you should not balance a BTRFS" will make that
> > > perception go away.
> > > 
> > > See, I see the sun coming out on a morning and you tell me "no, it
> > > doesn´t". Simply that is not going to match my perception.
> > 
> >    Duncan's assertion is correct in its detail. Looking at your space
> 
> Robert's :)
> 
> > usage, I would not suggest that running a balance is something you
> > need to do. Now, since you have these lockups that seem quite
> > repeatable, there's probably a lurking bug in there, but hacking
> > around with balance every time you hit it isn't going to get the
> > problem solved properly.
> > 
> >    I think I would suggest the following:
> > 
> >  - make sure you have some way of logging your dmesg permanently (use
> >    a different filesystem for /var/log, or a serial console, or a
> >    netconsole)
> > 
> >  - when the lockup happens, hit Alt-SysRq-t a few times
> > 
> >  - send the dmesg output here, or post to bugzilla.kernel.org
> > 
> >    That's probably going to give enough information to the developers
> > to work out where the lockup is happening, and is clearly the way
> > forward here.
> 
> And I got it reproduced. *Perfectly* reproduced, I´d say.
> 
> But let me run the whole story:
> 
> 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.

[… story of trying to reproduce with Windows XP defragmenting which was
unsuccessful as BTRFS still had free device space to allocate new chunks
from …]

> But finally I got to:
> 
> merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> Sa 27. Dez 13:26:39 CET 2014
> Label: 'home'  uuid: [some UUID]
>         Total devices 2 FS bytes used 152.83GiB
>         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
>         devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> 
> Btrfs v3.17
> Data, RAID1: total=154.97GiB, used=149.58GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.26GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 
> 
> So I did, if Virtualbox can write randomly in a file, I can too.
> 
> So I did:
> 
> 
> martin@merkaba:~> cat ssd-test.fio 
> [global]
> bs=4k
> #ioengine=libaio
> #iodepth=4
> size=4g
> #direct=1
> runtime=120
> filename=ssd.test.file
> 
> [seq-write]
> rw=write
> stonewall
> 
> [rand-write]
> rw=randwrite
> stonewall
> 
> 
> 
> And got:
> 
> ATOP - merkaba                          2014/12/27  13:41:02                          -----------                           10s elapsed
> PRC |  sys   10.14s |  user   0.38s |  #proc    332  | #trun      2  |  #tslpi   548 |  #tslpu     0 |  #zombie    0  | no  procacct  |
> CPU |  sys     102% |  user      4% |  irq       0%  | idle    295%  |  wait      0% |  guest     0% |  curf 3.10GHz  | curscal  96%  |
> cpu |  sys      76% |  user      0% |  irq       0%  | idle     24%  |  cpu001 w  0% |  guest     0% |  curf 3.20GHz  | curscal  99%  |
> cpu |  sys      24% |  user      1% |  irq       0%  | idle     75%  |  cpu000 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu003 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> CPL |  avg1    0.82 |  avg5    0.78 |  avg15   0.99  |               |  csw     6233 |  intr   12023 |                | numcpu     4  |
> MEM |  tot    15.5G |  free    4.0G |  cache   9.7G  | buff    0.0M  |  slab  333.1M |  shmem 206.6M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> LVM |     sata-home |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> DSK |           sda |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv     16 |               |  icmpi      0  | icmpo      0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> 
>   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
> 18079      -   martin    martin        2   9.99s    0.00s      0K       0K      0K      16K  --     -  R       1  100%   fio
>  4746      -   martin    martin        2   0.01s    0.14s      0K       0K      0K       0K  --     -  S       2    2%   konsole
>  3291      -   martin    martin        4   0.01s    0.11s      0K       0K      0K       0K  --     -  S       0    1%   plasma-desktop
>  1488      -   root      root          1   0.03s    0.04s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> 10036      -   root      root          1   0.04s    0.02s      0K       0K      0K       0K  --     -  R       2    1%   atop
> 
> while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> 
> martin@merkaba:~> LANG=C df -hT /home
> Filesystem             Type   Size  Used Avail Use% Mounted on
> /dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
> 
> where a 4 GiB file should easily fit, no? (And this output is with the 4
> GiB file. So it was even 4 GiB more free before.)
> 
> 
> But it gets even more visible:
> 
> martin@merkaba:~> fio ssd-test.fio
> seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.1.11
> Starting 2 processes
> Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]       
> 0$ zsh  1$ zsh  2$ zsh  3-$ zsh  4$ zsh  5$* zsh                                   
> 
> 
> yes, thats 0 IOPS.
> 
> 0 IOPS and in zero IOPS. For minutes.
> 
> 
> 
> And here is why:
> 
> ATOP - merkaba                          2014/12/27  13:46:52                          -----------                           10s elapsed
> PRC |  sys   10.77s |  user   0.31s |  #proc    334  | #trun      2  |  #tslpi   548 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> CPU |  sys     108% |  user      3% |  irq       0%  | idle    286%  |  wait      2% |  guest     0% |  curf 3.08GHz  | curscal  96%  |
> cpu |  sys      72% |  user      1% |  irq       0%  | idle     28%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      19% |  user      0% |  irq       0%  | idle     81%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      11% |  user      1% |  irq       0%  | idle     87%  |  cpu003 w  1% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> cpu |  sys       6% |  user      1% |  irq       0%  | idle     91%  |  cpu002 w  1% |  guest     0% |  curf 3.11GHz  | curscal  97%  |
> CPL |  avg1    2.78 |  avg5    1.34 |  avg15   1.12  |               |  csw    50192 |  intr   32379 |                | numcpu     4  |
> MEM |  tot    15.5G |  free    5.0G |  cache   8.7G  | buff    0.0M  |  slab  332.6M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> LVM |     sata-home |  busy      5% |  read     160  | write  11177  |  KiB/w      3 |  MBr/s   0.06 |  MBw/s   4.36  | avio 0.05 ms  |
> LVM |    msata-home |  busy      4% |  read      28  | write  11177  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   4.36  | avio 0.04 ms  |
> LVM |   sata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> LVM |  msata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> DSK |           sda |  busy      5% |  read     160  | write  10200  |  KiB/w      4 |  MBr/s   0.06 |  MBw/s   4.69  | avio 0.05 ms  |
> DSK |           sdb |  busy      4% |  read      28  | write  10558  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   4.69  | avio 0.04 ms  |
> NET |  transport    |  tcpi      35 |  tcpo      33  | udpi       3  |  udpo       3 |  tcpao      2 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       38 |  ipo       36  | ipfrw      0  |  deliv     38 |               |  icmpi      0  | icmpo      0  |
> NET |  eth0      0% |  pcki      22 |  pcko      20  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> 
>   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> 14973      -   root      root          1   8.92s    0.00s      0K       0K      0K     144K  --     -  S       0   89%   kworker/u8:14
> 17450      -   root      root          1   0.86s    0.00s      0K       0K      0K      32K  --     -  R       3    9%   kworker/u8:5
>   788      -   root      root          1   0.25s    0.00s      0K       0K    128K   18880K  --     -  S       3    3%   btrfs-transact
> 12254      -   root      root          1   0.14s    0.00s      0K       0K     64K     576K  --     -  S       2    1%   kworker/u8:3
> 17332      -   root      root          1   0.11s    0.00s      0K       0K    112K    1348K  --     -  S       2    1%   kworker/u8:4
>  3291      -   martin    martin        4   0.01s    0.09s      0K       0K      0K       0K  --     -  S       1    1%   plasma-deskto
> 
> 
> 
> 
> ATOP - merkaba                          2014/12/27  13:47:12                          -----------                           10s elapsed
> PRC |  sys   10.78s |  user   0.44s |  #proc    334  | #trun      3  |  #tslpi   547 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> CPU |  sys     106% |  user      4% |  irq       0%  | idle    288%  |  wait      1% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys      93% |  user      0% |  irq       0%  | idle      7%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys       7% |  user      0% |  irq       0%  | idle     93%  |  cpu003 w  0% |  guest     0% |  curf 3.01GHz  | curscal  94%  |
> cpu |  sys       3% |  user      2% |  irq       0%  | idle     94%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> CPL |  avg1    3.33 |  avg5    1.56 |  avg15   1.20  |               |  csw    38253 |  intr   23104 |                | numcpu     4  |
> MEM |  tot    15.5G |  free    4.9G |  cache   8.7G  | buff    0.0M  |  slab  336.5M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> LVM |    msata-home |  busy      2% |  read       0  | write   2337  |  KiB/w      3 |  MBr/s   0.00 |  MBw/s   0.91  | avio 0.07 ms  |
> LVM |     sata-home |  busy      2% |  read      36  | write   2337  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   0.91  | avio 0.07 ms  |
> LVM |  msata-debian |  busy      1% |  read       1  | write   1630  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.65  | avio 0.03 ms  |
> LVM |   sata-debian |  busy      0% |  read       0  | write   1019  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.41  | avio 0.02 ms  |
> DSK |           sdb |  busy      2% |  read       1  | write   2545  |  KiB/w      5 |  MBr/s   0.00 |  MBw/s   1.45  | avio 0.07 ms  |
> DSK |           sda |  busy      1% |  read      36  | write   2461  |  KiB/w      5 |  MBr/s   0.01 |  MBw/s   1.28  | avio 0.06 ms  |
> NET |  transport    |  tcpi      20 |  tcpo      20  | udpi       1  |  udpo       1 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       21 |  ipo       21  | ipfrw      0  |  deliv     21 |               |  icmpi      0  | icmpo      0  |
> NET |  eth0      0% |  pcki       5 |  pcko       5  | si    0 Kbps  |  so    0 Kbps |  erri       0 |  erro       0  | drpo       0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> 
>   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> 17450      -   root      root          1   9.96s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:5
>  4746      -   martin    martin        2   0.06s    0.15s      0K       0K      0K       0K  --     -  S       1    2%   konsole
> 10508      -   root      root          1   0.13s    0.00s      0K       0K     96K    4048K  --     -  S       1    1%   kworker/u8:18
>  1488      -   root      root          1   0.06s    0.06s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> 17332      -   root      root          1   0.12s    0.00s      0K       0K     96K     580K  --     -  R       3    1%   kworker/u8:4
> 17454      -   root      root          1   0.11s    0.00s      0K       0K     32K    4416K  --     -  D       1    1%   kworker/u8:6
> 17516      -   root      root          1   0.09s    0.00s      0K       0K     16K     136K  --     -  S       3    1%   kworker/u8:7
>  3268      -   martin    martin        3   0.02s    0.05s      0K       0K      0K       0K  --     -  S       1    1%   kwin
> 10036      -   root      root          1   0.05s    0.02s      0K       0K      0K       0K  --     -  R       0    1%   atop
> 
> 
> 
> So BTRFS is basically busy with itself and nothing else. Look at the SSD
> usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> you measure of course, like request size, read, write, iodepth and so).
> 
> Its kworker/u8:5 utilizing 100% of one core for minutes.
> 
> 
> 
> Its the random write case it seems. Here are values from fio job:
> 
> martin@merkaba:~> fio ssd-test.fio
> seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.1.11
> Starting 2 processes
> Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
>   write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
>     clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
>      lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
>     clat percentiles (usec):
>      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
>      | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
>      | 70.00th=[    7], 80.00th=[    8], 90.00th=[    8], 95.00th=[    9],
>      | 99.00th=[   14], 99.50th=[   20], 99.90th=[  211], 99.95th=[ 2128],
>      | 99.99th=[10304]
>     bw (KB  /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
>     lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
>     lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
>   cpu          : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Seems fine.
> 
> 
> But:
> 
> rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
>   write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
>     clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
>      lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
>     clat percentiles (usec):
>      |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
>      | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
>      | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
>      | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
>      | 99.99th=[16711680]
>     bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
>     lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
>     lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
>   cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>   WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
> 
> 
> What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
> 
> What?
> 
> Ey, *what*?
> 
> 
> 
> Repeating with the random write case.
> 
> Its a different kworker now, but similar result:
> 
> ATOP - merkaba                          2014/12/27  13:51:48                          -----------                           10s elapsed
> PRC |  sys   10.66s |  user   0.25s |  #proc    330  | #trun      2  |  #tslpi   545 |  #tslpu     2 |  #zombie    0  | no  procacct  |
> CPU |  sys     105% |  user      3% |  irq       0%  | idle    292%  |  wait      0% |  guest     0% |  curf 3.07GHz  | curscal  95%  |
> cpu |  sys      92% |  user      0% |  irq       0%  | idle      8%  |  cpu002 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> cpu |  sys       8% |  user      0% |  irq       0%  | idle     92%  |  cpu003 w  0% |  guest     0% |  curf 3.09GHz  | curscal  96%  |
> cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> cpu |  sys       2% |  user      1% |  irq       0%  | idle     97%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> CPL |  avg1    1.00 |  avg5    1.32 |  avg15   1.23  |               |  csw    34484 |  intr   23182 |                | numcpu     4  |
> MEM |  tot    15.5G |  free    5.4G |  cache   8.3G  | buff    0.0M  |  slab  334.8M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> LVM |     sata-home |  busy      1% |  read      36  | write   2502  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   0.98  | avio 0.06 ms  |
> LVM |    msata-home |  busy      1% |  read      48  | write   2502  |  KiB/w      4 |  MBr/s   0.02 |  MBw/s   0.98  | avio 0.04 ms  |
> LVM |  msata-debian |  busy      0% |  read       0  | write      6  |  KiB/w      7 |  MBr/s   0.00 |  MBw/s   0.00  | avio 1.33 ms  |
> LVM |   sata-debian |  busy      0% |  read       0  | write      6  |  KiB/w      7 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.17 ms  |
> DSK |           sda |  busy      1% |  read      36  | write   2494  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   0.98  | avio 0.06 ms  |
> DSK |           sdb |  busy      1% |  read      48  | write   2494  |  KiB/w      4 |  MBr/s   0.02 |  MBw/s   0.98  | avio 0.04 ms  |
> NET |  transport    |  tcpi      32 |  tcpo      30  | udpi       2  |  udpo       2 |  tcpao      2 |  tcppo      1  | tcprs      0  |
> NET |  network      |  ipi       35 |  ipo       32  | ipfrw      0  |  deliv     35 |               |  icmpi      0  | icmpo      0  |
> NET |  eth0      0% |  pcki      19 |  pcko      16  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> 
>   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
> 11746      -   root      root          1  10.00s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:0
> 12254      -   root      root          1   0.16s    0.00s      0K       0K    112K    1712K  --     -  S       3    2%   kworker/u8:3  
> 17517      -   root      root          1   0.16s    0.00s      0K       0K    144K    1764K  --     -  S       1    2%   kworker/u8:8
> 
> 
> 
> And now the graphical environemnt is locked. Continuining on TTY1.
> 
> Doing another fio job with tee so I can get output easily.
> 
> Wow! I wonder whether this is reproducable with a fresh BTRFS with fio stressing it.
> 
> Like a 10 GiB BTRFS with 5 GiB fio test file and just letting it run.
> 
> 
> Okay, I let the final fio job complete and include the output here.
> 
> 
> Okay, and there we are and I do have sysrq-t figures. 
> 
> Okay, this is 1,2 MiB xz packed. So I better start a bug report about this
> and attach it there. Is dislike cloud URLs that may disappear at some time.
> 
> 
> 
> Now please finally acknowledge that there is an issue. Maybe I was not
> using the correct terms at the beginning, but there is a real issue. I do
> performance things for half a decade at least, I know that there is an issue
> when I see it.
> 
> 
> 
> 
> There we go:
> 
> Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401

I have done more tests.

This is on the same /home after extending it to 170 GiB and balancing it to
btrfs balance start -dusage=80

It has plenty of free space free. I updated the bug report and hope it can
give an easy enough to comprehend summary. The new tests are in:

https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6



Pasting below for discussion on list. Summary: I easily get 38000 (!)
IOPS. It may be an idea to reduce to 160 GiB, but right now this does
not work as it says no free space on device when trying to downsize it.
I may try with 165 or 162GiB.

So now we have three IOPS figures:

- 256 IOPS in worst case scenario
- 4700 IOPS when trying to reproduce worst case scenario with a fresh and small
BTRFS
- 38000 IOPS when /home has unused device space to allocate chunks from

https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8


This is another test.

With my /home. This time while it has enough free device space to reserve
new chunks from.

Remember this for the case where is hasn´t:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 144.19GiB
        devid    1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=144.98GiB, used=140.95GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.24GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
  write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
    clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
     lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
     | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
     | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
     | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
     | 99.99th=[16711680]
    bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
    lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
    lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
  cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1
[…]
Run status group 1 (all jobs):
  WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec

That is where I saw this kworker thread at 100% of one Sandybridge core
for minutes. This is where I made kern.log with sysrq-t triggers in

https://bugzilla.kernel.org/show_bug.cgi?id=90401#c0

made from.


But now as I extended it to 170 GiB and did some basic rebalancing
(upto btrfs balance start -dusage=80) I have this:


First attempt:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:13:47 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 151.09GiB
        devid    1 size 170.00GiB used 158.03GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 158.03GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=153.00GiB, used=147.83GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


martin@merkaba:~> fio ssd-test.fio 
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
rand-write: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/84528KB/0KB /s] [0/21.2K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=9987: Sun Dec 28 13:14:32 2014
  write: io=4096.0MB, bw=155304KB/s, iops=38826, runt= 27007msec
    clat (usec): min=5, max=28202, avg=22.03, stdev=240.04
     lat (usec): min=5, max=28202, avg=22.28, stdev=240.16
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[   10], 40.00th=[   11], 50.00th=[   12], 60.00th=[   13],
     | 70.00th=[   14], 80.00th=[   15], 90.00th=[   17], 95.00th=[   23],
     | 99.00th=[   93], 99.50th=[  175], 99.90th=[ 2096], 99.95th=[ 6816],
     | 99.99th=[10176]
    bw (KB  /s): min=76832, max=413616, per=100.00%, avg=156706.75, stdev=101101.26
    lat (usec) : 10=29.85%, 20=62.43%, 50=5.74%, 100=1.07%, 250=0.57%
    lat (usec) : 500=0.16%, 750=0.04%, 1000=0.02%
    lat (msec) : 2=0.02%, 4=0.01%, 10=0.08%, 20=0.01%, 50=0.01%
  cpu          : usr=12.05%, sys=47.34%, ctx=86985, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=155304KB/s, minb=155304KB/s, maxb=155304KB/s, mint=27007msec, maxt=27007msec



Second attempt:



merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:16:19 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 155.11GiB
        devid    1 size 170.00GiB used 162.03GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 162.03GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=157.00GiB, used=151.83GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.27GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/113.5MB/0KB /s] [0/29.3K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10043: Sun Dec 28 13:17:34 2014
  write: io=4096.0MB, bw=145995KB/s, iops=36498, runt= 28729msec
    clat (usec): min=4, max=143201, avg=23.95, stdev=518.47
     lat (usec): min=4, max=143201, avg=24.13, stdev=518.48
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    9], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   13], 80.00th=[   13], 90.00th=[   15], 95.00th=[   16],
     | 99.00th=[   33], 99.50th=[   70], 99.90th=[ 5472], 99.95th=[ 8640],
     | 99.99th=[20864]
    bw (KB  /s): min=    4, max=433760, per=100.00%, avg=149179.63, stdev=136784.14
    lat (usec) : 10=38.35%, 20=58.99%, 50=1.96%, 100=0.38%, 250=0.16%
    lat (usec) : 500=0.03%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.08%, 20=0.02%, 50=0.01%
    lat (msec) : 100=0.01%, 250=0.01%
  cpu          : usr=10.25%, sys=42.40%, ctx=42642, majf=0, minf=8
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=145995KB/s, minb=145995KB/s, maxb=145995KB/s, mint=28729msec, maxt=28729msec



Third attempt:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:18:24 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 156.16GiB
        devid    1 size 170.00GiB used 160.03GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 160.03GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.00GiB, used=152.83GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.34GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/195.7MB/0KB /s] [0/50.9K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10058: Sun Dec 28 13:18:59 2014
  write: io=4096.0MB, bw=202184KB/s, iops=50545, runt= 20745msec
    clat (usec): min=4, max=28261, avg=15.84, stdev=214.59
     lat (usec): min=4, max=28261, avg=16.06, stdev=214.78
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[    9], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   15], 95.00th=[   17],
     | 99.00th=[   52], 99.50th=[  105], 99.90th=[  426], 99.95th=[  980],
     | 99.99th=[12736]
    bw (KB  /s): min=    4, max=426344, per=100.00%, avg=207355.30, stdev=105104.72
    lat (usec) : 10=41.44%, 20=55.33%, 50=2.17%, 100=0.54%, 250=0.34%
    lat (usec) : 500=0.10%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.01%
  cpu          : usr=14.15%, sys=59.06%, ctx=81711, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=202183KB/s, minb=202183KB/s, maxb=202183KB/s, mint=20745msec, maxt=20745msec


merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:19:15 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 155.16GiB
        devid    1 size 170.00GiB used 162.03GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 162.03GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=157.00GiB, used=151.85GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



So here BTRFS was fast. It is getting trouble on my /home when its
almost full. But more so than with a empty 10 GiB filesystem as I
have shown in testcase
https://bugzilla.kernel.org/show_bug.cgi?id=90401#c3


There I had:

merkaba:/mnt/btrfsraid1> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
rand-write: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/20924KB/0KB /s] [0/5231/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=6221: Sat Dec 27 15:34:14 2014
  write: io=2645.8MB, bw=22546KB/s, iops=5636, runt=120165msec
    clat (usec): min=4, max=3054.8K, avg=174.87, stdev=11455.26
     lat (usec): min=4, max=3054.8K, avg=175.03, stdev=11455.27
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    6], 10.00th=[    6], 20.00th=[    7],
     | 30.00th=[    7], 40.00th=[    8], 50.00th=[    9], 60.00th=[   10],
     | 70.00th=[   11], 80.00th=[   12], 90.00th=[   14], 95.00th=[   17],
     | 99.00th=[   30], 99.50th=[   40], 99.90th=[ 1992], 99.95th=[25984],
     | 99.99th=[411648]
    bw (KB  /s): min=  168, max=70703, per=100.00%, avg=27325.46, stdev=14887.94
    lat (usec) : 10=55.81%, 20=41.12%, 50=2.70%, 100=0.14%, 250=0.06%
    lat (usec) : 500=0.02%, 750=0.01%, 1000=0.02%
    lat (msec) : 2=0.02%, 4=0.02%, 10=0.02%, 20=0.01%, 50=0.01%
    lat (msec) : 100=0.01%, 250=0.01%, 500=0.02%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%, >=2000=0.01%
  cpu          : usr=1.56%, sys=5.57%, ctx=29822, majf=0, minf=7
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=677303/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=2645.8MB, aggrb=22545KB/s, minb=22545KB/s, maxb=22545KB/s, mint=120165msec, maxt=120165msec



-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare)
  2014-12-28 13:00         ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
@ 2014-12-28 13:40           ` Martin Steigerwald
  2014-12-28 13:56             ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 13:40 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 52581 bytes --]

Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
> Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> > Summarized at
> > 
> > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > 
> > see below. This is reproducable with fio, no need for Windows XP in
> > Virtualbox for reproducing the issue. Next I will try to reproduce with
> > a freshly creating filesystem.
> > 
> > 
> > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > > Hello!
> > > > > > 
> > > > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > > > 
> > > > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > > > bug
> > > > > > report:
> > > > > > 
> > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > > > space_cache, skinny meta data extents – are these a problem? – and
> > > > > 
> > > > > > compress=lzo:
> > > > > (there is no known problem with skinny metadata, it's actually more
> > > > > efficient than the older format. There has been some anecdotes about
> > > > > mixing the skinny and fat metadata but nothing has ever been
> > > > > demonstrated problematic.)
> > > > > 
> > > > > > merkaba:~> btrfs fi sh /home
> > > > > > Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > > > 
> > > > > >          Total devices 2 FS bytes used 144.41GiB
> > > > > >          devid    1 size 160.00GiB used 160.00GiB path
> > > > > >          /dev/mapper/msata-home
> > > > > >          devid    2 size 160.00GiB used 160.00GiB path
> > > > > >          /dev/mapper/sata-home
> > > > > > 
> > > > > > Btrfs v3.17
> > > > > > merkaba:~> btrfs fi df /home
> > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > > 
> > > > > This filesystem, at the allocation level, is "very full" (see below).
> > > > > 
> > > > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > > > cause I know no tax return software for Linux which would be suitable
> > > > > > for
> > > > > > Germany and I frankly don´t care about the end of security cause all
> > > > > > surfing and other network access I will do from the Linux box and I
> > > > > > only
> > > > > > run the VM behind a firewall).
> > > > > 
> > > > > > And thus I try the balance dance again:
> > > > > ITEM: Balance... it doesn't do what you think it does... 8-)
> > > > > 
> > > > > "Balancing" is something you should almost never need to do. It is only
> > > > > for cases of changing geometry (adding disks, switching RAID levels,
> > > > > etc.) of for cases when you've radically changed allocation behaviors
> > > > > (like you decided to remove all your VM's or you've decided to remove a
> > > > > mail spool directory full of thousands of tiny files).
> > > > > 
> > > > > People run balance all the time because they think they should. They are
> > > > > _usually_ incorrect in that belief.
> > > > 
> > > > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > > > device.
> > >    No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > > space. What's more, balance does *not* balance the metadata trees. The
> > > remaining space -- 154.97 GiB -- is unstructured storage for file
> > > data, and you have some 13 GiB of that available for use.
> > > 
> > >    Now, since you're seeing lockups when the space on your disks is
> > > all allocated I'd say that's a bug. However, you're the *only* person
> > > who's reported this as a regular occurrence. Does this happen with all
> > > filesystems you have, or just this one?
> > > 
> > > > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > > > from to *extend* a tree.
> > > 
> > >    It's not a tree. It's simply space allocation. It's not even space
> > > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > > saying "I'm going to use this piece of disk for this purpose").
> > > 
> > > > This may be a bug, but this is what I see.
> > > > 
> > > > And no amount of "you should not balance a BTRFS" will make that
> > > > perception go away.
> > > > 
> > > > See, I see the sun coming out on a morning and you tell me "no, it
> > > > doesn´t". Simply that is not going to match my perception.
> > > 
> > >    Duncan's assertion is correct in its detail. Looking at your space
> > 
> > Robert's :)
> > 
> > > usage, I would not suggest that running a balance is something you
> > > need to do. Now, since you have these lockups that seem quite
> > > repeatable, there's probably a lurking bug in there, but hacking
> > > around with balance every time you hit it isn't going to get the
> > > problem solved properly.
> > > 
> > >    I think I would suggest the following:
> > > 
> > >  - make sure you have some way of logging your dmesg permanently (use
> > >    a different filesystem for /var/log, or a serial console, or a
> > >    netconsole)
> > > 
> > >  - when the lockup happens, hit Alt-SysRq-t a few times
> > > 
> > >  - send the dmesg output here, or post to bugzilla.kernel.org
> > > 
> > >    That's probably going to give enough information to the developers
> > > to work out where the lockup is happening, and is clearly the way
> > > forward here.
> > 
> > And I got it reproduced. *Perfectly* reproduced, I´d say.
> > 
> > But let me run the whole story:
> > 
> > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
> 
> [… story of trying to reproduce with Windows XP defragmenting which was
> unsuccessful as BTRFS still had free device space to allocate new chunks
> from …]
> 
> > But finally I got to:
> > 
> > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > Sa 27. Dez 13:26:39 CET 2014
> > Label: 'home'  uuid: [some UUID]
> >         Total devices 2 FS bytes used 152.83GiB
> >         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> >         devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> > 
> > Btrfs v3.17
> > Data, RAID1: total=154.97GiB, used=149.58GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> > 
> > 
> > 
> > So I did, if Virtualbox can write randomly in a file, I can too.
> > 
> > So I did:
> > 
> > 
> > martin@merkaba:~> cat ssd-test.fio 
> > [global]
> > bs=4k
> > #ioengine=libaio
> > #iodepth=4
> > size=4g
> > #direct=1
> > runtime=120
> > filename=ssd.test.file
> > 
> > [seq-write]
> > rw=write
> > stonewall
> > 
> > [rand-write]
> > rw=randwrite
> > stonewall
> > 
> > 
> > 
> > And got:
> > 
> > ATOP - merkaba                          2014/12/27  13:41:02                          -----------                           10s elapsed
> > PRC |  sys   10.14s |  user   0.38s |  #proc    332  | #trun      2  |  #tslpi   548 |  #tslpu     0 |  #zombie    0  | no  procacct  |
> > CPU |  sys     102% |  user      4% |  irq       0%  | idle    295%  |  wait      0% |  guest     0% |  curf 3.10GHz  | curscal  96%  |
> > cpu |  sys      76% |  user      0% |  irq       0%  | idle     24%  |  cpu001 w  0% |  guest     0% |  curf 3.20GHz  | curscal  99%  |
> > cpu |  sys      24% |  user      1% |  irq       0%  | idle     75%  |  cpu000 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> > cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu003 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > CPL |  avg1    0.82 |  avg5    0.78 |  avg15   0.99  |               |  csw     6233 |  intr   12023 |                | numcpu     4  |
> > MEM |  tot    15.5G |  free    4.0G |  cache   9.7G  | buff    0.0M  |  slab  333.1M |  shmem 206.6M |  vmbal   0.0M  | hptot   0.0M  |
> > SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> > LVM |     sata-home |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> > DSK |           sda |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> > NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> > NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv     16 |               |  icmpi      0  | icmpo      0  |
> > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > 
> >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
> > 18079      -   martin    martin        2   9.99s    0.00s      0K       0K      0K      16K  --     -  R       1  100%   fio
> >  4746      -   martin    martin        2   0.01s    0.14s      0K       0K      0K       0K  --     -  S       2    2%   konsole
> >  3291      -   martin    martin        4   0.01s    0.11s      0K       0K      0K       0K  --     -  S       0    1%   plasma-desktop
> >  1488      -   root      root          1   0.03s    0.04s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> > 10036      -   root      root          1   0.04s    0.02s      0K       0K      0K       0K  --     -  R       2    1%   atop
> > 
> > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > 
> > martin@merkaba:~> LANG=C df -hT /home
> > Filesystem             Type   Size  Used Avail Use% Mounted on
> > /dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
> > 
> > where a 4 GiB file should easily fit, no? (And this output is with the 4
> > GiB file. So it was even 4 GiB more free before.)
> > 
> > 
> > But it gets even more visible:
> > 
> > martin@merkaba:~> fio ssd-test.fio
> > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > fio-2.1.11
> > Starting 2 processes
> > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]       
> > 0$ zsh  1$ zsh  2$ zsh  3-$ zsh  4$ zsh  5$* zsh                                   
> > 
> > 
> > yes, thats 0 IOPS.
> > 
> > 0 IOPS and in zero IOPS. For minutes.
> > 
> > 
> > 
> > And here is why:
> > 
> > ATOP - merkaba                          2014/12/27  13:46:52                          -----------                           10s elapsed
> > PRC |  sys   10.77s |  user   0.31s |  #proc    334  | #trun      2  |  #tslpi   548 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> > CPU |  sys     108% |  user      3% |  irq       0%  | idle    286%  |  wait      2% |  guest     0% |  curf 3.08GHz  | curscal  96%  |
> > cpu |  sys      72% |  user      1% |  irq       0%  | idle     28%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > cpu |  sys      19% |  user      0% |  irq       0%  | idle     81%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > cpu |  sys      11% |  user      1% |  irq       0%  | idle     87%  |  cpu003 w  1% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> > cpu |  sys       6% |  user      1% |  irq       0%  | idle     91%  |  cpu002 w  1% |  guest     0% |  curf 3.11GHz  | curscal  97%  |
> > CPL |  avg1    2.78 |  avg5    1.34 |  avg15   1.12  |               |  csw    50192 |  intr   32379 |                | numcpu     4  |
> > MEM |  tot    15.5G |  free    5.0G |  cache   8.7G  | buff    0.0M  |  slab  332.6M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> > SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> > LVM |     sata-home |  busy      5% |  read     160  | write  11177  |  KiB/w      3 |  MBr/s   0.06 |  MBw/s   4.36  | avio 0.05 ms  |
> > LVM |    msata-home |  busy      4% |  read      28  | write  11177  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   4.36  | avio 0.04 ms  |
> > LVM |   sata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> > LVM |  msata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> > DSK |           sda |  busy      5% |  read     160  | write  10200  |  KiB/w      4 |  MBr/s   0.06 |  MBw/s   4.69  | avio 0.05 ms  |
> > DSK |           sdb |  busy      4% |  read      28  | write  10558  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   4.69  | avio 0.04 ms  |
> > NET |  transport    |  tcpi      35 |  tcpo      33  | udpi       3  |  udpo       3 |  tcpao      2 |  tcppo      1  | tcprs      0  |
> > NET |  network      |  ipi       38 |  ipo       36  | ipfrw      0  |  deliv     38 |               |  icmpi      0  | icmpo      0  |
> > NET |  eth0      0% |  pcki      22 |  pcko      20  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > 
> >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> > 14973      -   root      root          1   8.92s    0.00s      0K       0K      0K     144K  --     -  S       0   89%   kworker/u8:14
> > 17450      -   root      root          1   0.86s    0.00s      0K       0K      0K      32K  --     -  R       3    9%   kworker/u8:5
> >   788      -   root      root          1   0.25s    0.00s      0K       0K    128K   18880K  --     -  S       3    3%   btrfs-transact
> > 12254      -   root      root          1   0.14s    0.00s      0K       0K     64K     576K  --     -  S       2    1%   kworker/u8:3
> > 17332      -   root      root          1   0.11s    0.00s      0K       0K    112K    1348K  --     -  S       2    1%   kworker/u8:4
> >  3291      -   martin    martin        4   0.01s    0.09s      0K       0K      0K       0K  --     -  S       1    1%   plasma-deskto
> > 
> > 
> > 
> > 
> > ATOP - merkaba                          2014/12/27  13:47:12                          -----------                           10s elapsed
> > PRC |  sys   10.78s |  user   0.44s |  #proc    334  | #trun      3  |  #tslpi   547 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> > CPU |  sys     106% |  user      4% |  irq       0%  | idle    288%  |  wait      1% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > cpu |  sys      93% |  user      0% |  irq       0%  | idle      7%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > cpu |  sys       7% |  user      0% |  irq       0%  | idle     93%  |  cpu003 w  0% |  guest     0% |  curf 3.01GHz  | curscal  94%  |
> > cpu |  sys       3% |  user      2% |  irq       0%  | idle     94%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > CPL |  avg1    3.33 |  avg5    1.56 |  avg15   1.20  |               |  csw    38253 |  intr   23104 |                | numcpu     4  |
> > MEM |  tot    15.5G |  free    4.9G |  cache   8.7G  | buff    0.0M  |  slab  336.5M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> > SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> > LVM |    msata-home |  busy      2% |  read       0  | write   2337  |  KiB/w      3 |  MBr/s   0.00 |  MBw/s   0.91  | avio 0.07 ms  |
> > LVM |     sata-home |  busy      2% |  read      36  | write   2337  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   0.91  | avio 0.07 ms  |
> > LVM |  msata-debian |  busy      1% |  read       1  | write   1630  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.65  | avio 0.03 ms  |
> > LVM |   sata-debian |  busy      0% |  read       0  | write   1019  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.41  | avio 0.02 ms  |
> > DSK |           sdb |  busy      2% |  read       1  | write   2545  |  KiB/w      5 |  MBr/s   0.00 |  MBw/s   1.45  | avio 0.07 ms  |
> > DSK |           sda |  busy      1% |  read      36  | write   2461  |  KiB/w      5 |  MBr/s   0.01 |  MBw/s   1.28  | avio 0.06 ms  |
> > NET |  transport    |  tcpi      20 |  tcpo      20  | udpi       1  |  udpo       1 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> > NET |  network      |  ipi       21 |  ipo       21  | ipfrw      0  |  deliv     21 |               |  icmpi      0  | icmpo      0  |
> > NET |  eth0      0% |  pcki       5 |  pcko       5  | si    0 Kbps  |  so    0 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > 
> >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> > 17450      -   root      root          1   9.96s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:5
> >  4746      -   martin    martin        2   0.06s    0.15s      0K       0K      0K       0K  --     -  S       1    2%   konsole
> > 10508      -   root      root          1   0.13s    0.00s      0K       0K     96K    4048K  --     -  S       1    1%   kworker/u8:18
> >  1488      -   root      root          1   0.06s    0.06s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> > 17332      -   root      root          1   0.12s    0.00s      0K       0K     96K     580K  --     -  R       3    1%   kworker/u8:4
> > 17454      -   root      root          1   0.11s    0.00s      0K       0K     32K    4416K  --     -  D       1    1%   kworker/u8:6
> > 17516      -   root      root          1   0.09s    0.00s      0K       0K     16K     136K  --     -  S       3    1%   kworker/u8:7
> >  3268      -   martin    martin        3   0.02s    0.05s      0K       0K      0K       0K  --     -  S       1    1%   kwin
> > 10036      -   root      root          1   0.05s    0.02s      0K       0K      0K       0K  --     -  R       0    1%   atop
> > 
> > 
> > 
> > So BTRFS is basically busy with itself and nothing else. Look at the SSD
> > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> > you measure of course, like request size, read, write, iodepth and so).
> > 
> > Its kworker/u8:5 utilizing 100% of one core for minutes.
> > 
> > 
> > 
> > Its the random write case it seems. Here are values from fio job:
> > 
> > martin@merkaba:~> fio ssd-test.fio
> > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > fio-2.1.11
> > Starting 2 processes
> > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
> >   write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> >     clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> >      lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> >     clat percentiles (usec):
> >      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
> >      | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
> >      | 70.00th=[    7], 80.00th=[    8], 90.00th=[    8], 95.00th=[    9],
> >      | 99.00th=[   14], 99.50th=[   20], 99.90th=[  211], 99.95th=[ 2128],
> >      | 99.99th=[10304]
> >     bw (KB  /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
> >     lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> >     lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> >     lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> >   cpu          : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >      issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > 
> > Seems fine.
> > 
> > 
> > But:
> > 
> > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
> >   write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> >     clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> >      lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> >     clat percentiles (usec):
> >      |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
> >      | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
> >      | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
> >      | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
> >      | 99.99th=[16711680]
> >     bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
> >     lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> >     lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> >   cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >      issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > 
> > Run status group 0 (all jobs):
> >   WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
> > 
> > Run status group 1 (all jobs):
> >   WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
> > 
> > 
> > What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
> > 
> > What?
> > 
> > Ey, *what*?
[…] 
> > There we go:
> > 
> > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> 
> I have done more tests.
> 
> This is on the same /home after extending it to 170 GiB and balancing it to
> btrfs balance start -dusage=80
> 
> It has plenty of free space free. I updated the bug report and hope it can
> give an easy enough to comprehend summary. The new tests are in:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6
> 
> 
> 
> Pasting below for discussion on list. Summary: I easily get 38000 (!)
> IOPS. It may be an idea to reduce to 160 GiB, but right now this does
> not work as it says no free space on device when trying to downsize it.
> I may try with 165 or 162GiB.
> 
> So now we have three IOPS figures:
> 
> - 256 IOPS in worst case scenario
> - 4700 IOPS when trying to reproduce worst case scenario with a fresh and small
> BTRFS
> - 38000 IOPS when /home has unused device space to allocate chunks from
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8
> 
> 
> This is another test.


Okay, and this is the last series of tests for today.

Conclusion:

I cannot manage to get it down to the knees as before, but I come near to it.

Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh
even *worse* situation than before.

That hints me at the need to look at the free space fragmentation, as in the
beginning the problem started appearing with:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 144.41GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



Yes, thats 13 GiB of free space *within* the chunks.

So while I can get it down in IOPS by bringing it to a situation where it
can not reserve additional data chunks again, I cannot recreate the
abysmal 250 IOPS figure by this. Not even with my /home filesystem.

So there is more to it. I think its important to look into free space
fragmentation. It seems it needs an *aged* filesystem to recreate. At
it seems the balances really helped. As I am not able to recreate the
issue to that extent right now.

So this shows my original idea about free device space to allocate from
also doesn´t explain it fully. It seems to be something thats going on
within the chunks that explains the worst case <300 IOPS, kworker using
one core for minutes and desktop locked scenario.

Is there a way to view free space fragmentation in BTRFS?




Test log follows, also added to bug report:

https://bugzilla.kernel.org/show_bug.cgi?id=90401#c9


Okay, retesting with 

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:01:05 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.15GiB
        devid    1 size 163.00GiB used 159.92GiB path /dev/mapper/msata-home
        devid    2 size 163.00GiB used 159.92GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=154.95GiB, used=151.84GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


Thats is just 3 GiB to reserve new data chunks from from.

First run – all good:

martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/134.2MB/0KB /s] [0/34.4K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10483: Sun Dec 28 14:02:59 2014
  write: io=4096.0MB, bw=218101KB/s, iops=54525, runt= 19231msec
    clat (usec): min=4, max=20056, avg=14.87, stdev=143.15
     lat (usec): min=4, max=20056, avg=15.09, stdev=143.26
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   13], 80.00th=[   14], 90.00th=[   15], 95.00th=[   17],
     | 99.00th=[   52], 99.50th=[   99], 99.90th=[  434], 99.95th=[  980],
     | 99.99th=[ 7968]
    bw (KB  /s): min=62600, max=424456, per=100.00%, avg=218821.63, stdev=93695.28
    lat (usec) : 10=38.19%, 20=58.83%, 50=1.90%, 100=0.59%, 250=0.33%
    lat (usec) : 500=0.09%, 750=0.03%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01%
  cpu          : usr=15.50%, sys=61.86%, ctx=93432, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=218101KB/s, minb=218101KB/s, maxb=218101KB/s, mint=19231msec, maxt=19231msec




Second run:


merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:04:01 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.23GiB
        devid    1 size 163.00GiB used 160.95GiB path /dev/mapper/msata-home
        devid    2 size 163.00GiB used 160.95GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.98GiB, used=151.91GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.32GiB
GlobalReserve, single: total=512.00MiB, used=0.00B




martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/171.3MB/0KB /s] [0/43.9K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10501: Sun Dec 28 14:05:03 2014
  write: io=4096.0MB, bw=220637KB/s, iops=55159, runt= 19010msec
    clat (usec): min=4, max=20578, avg=14.45, stdev=160.84
     lat (usec): min=4, max=20578, avg=14.65, stdev=160.88
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    7],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   15], 95.00th=[   17],
     | 99.00th=[   42], 99.50th=[   79], 99.90th=[  278], 99.95th=[  620],
     | 99.99th=[ 9792]
    bw (KB  /s): min=    5, max=454816, per=100.00%, avg=224700.32, stdev=100763.29
    lat (usec) : 10=38.15%, 20=58.73%, 50=2.28%, 100=0.47%, 250=0.26%
    lat (usec) : 500=0.06%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01%
  cpu          : usr=15.83%, sys=63.17%, ctx=74934, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=220636KB/s, minb=220636KB/s, maxb=220636KB/s, mint=19010msec, maxt=19010msec



Okay, now try the same without space for a free chunk to allocate.

The testfile is still there, fio doesn´t delete and recreate it on
every attempt, but just writes into it:

martin@merkaba:~> ls -l ssd.test.file 
-rw-r--r-- 1 martin martin 4294967296 Dez 28 14:05 ssd.test.file


Okay – with still one chunk to allocate:

merkaba:~> btrfs filesystem resize 1:161G /home
Resize '/home' of '1:161G'
merkaba:~> btrfs filesystem resize 2:161G /home
Resize '/home' of '2:161G'
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:08:45 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.15GiB
        devid    1 size 161.00GiB used 159.92GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 159.92GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=154.95GiB, used=151.84GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



I would like to have it allocate the chunks by other means, but it frees
them eventually afterwards again, so I did it this way.

Note, we still have the original file there. The space it currently
occupies is already taken into account.


Next test:

martin@merkaba:~> fio ssd-test.fio   
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/130.5MB/0KB /s] [0/33.5K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10563: Sun Dec 28 14:10:34 2014
  write: io=4096.0MB, bw=210526KB/s, iops=52631, runt= 19923msec
    clat (usec): min=4, max=21820, avg=14.78, stdev=119.40
     lat (usec): min=4, max=21821, avg=15.03, stdev=120.26
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    9], 40.00th=[   10], 50.00th=[   11], 60.00th=[   11],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   14], 95.00th=[   17],
     | 99.00th=[   62], 99.50th=[  131], 99.90th=[  490], 99.95th=[  964],
     | 99.99th=[ 6816]
    bw (KB  /s): min=    3, max=410480, per=100.00%, avg=216892.84, stdev=95620.33
    lat (usec) : 10=33.20%, 20=63.71%, 50=1.86%, 100=0.59%, 250=0.42%
    lat (usec) : 500=0.12%, 750=0.03%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
  cpu          : usr=15.13%, sys=62.74%, ctx=94346, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=210525KB/s, minb=210525KB/s, maxb=210525KB/s, mint=19923msec, maxt=19923msec


Okay, this is still good.

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:11:18 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.17GiB
        devid    1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.94GiB, used=151.86GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.30GiB
GlobalReserve, single: total=512.00MiB, used=0.00B




Now no space left to reserve additional chunks anymore. Another test:


martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/152.3MB/0KB /s] [0/38.1K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10580: Sun Dec 28 14:13:26 2014
  write: io=4096.0MB, bw=225804KB/s, iops=56450, runt= 18575msec
    clat (usec): min=4, max=16669, avg=13.66, stdev=72.88
     lat (usec): min=4, max=16669, avg=13.89, stdev=73.06
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   13], 80.00th=[   14], 90.00th=[   15], 95.00th=[   20],
     | 99.00th=[   65], 99.50th=[  113], 99.90th=[  314], 99.95th=[  506],
     | 99.99th=[ 2768]
    bw (KB  /s): min=    4, max=444568, per=100.00%, avg=231326.97, stdev=93374.31
    lat (usec) : 10=36.50%, 20=58.44%, 50=3.73%, 100=0.76%, 250=0.44%
    lat (usec) : 500=0.09%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=16.35%, sys=68.39%, ctx=127221, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=225803KB/s, minb=225803KB/s, maxb=225803KB/s, mint=18575msec, maxt=18575msec



Okay, this still does not trigger it.


Another test, it even freed some chunk:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:14:21 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.28GiB
        devid    1 size 161.00GiB used 160.85GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.85GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.89GiB, used=151.97GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



Still good:

martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/156.5MB/0KB /s] [0/40.6K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10589: Sun Dec 28 14:14:37 2014
  write: io=4096.0MB, bw=161121KB/s, iops=40280, runt= 26032msec
    clat (usec): min=4, max=1228.9K, avg=15.69, stdev=1205.88
     lat (usec): min=4, max=1228.9K, avg=15.92, stdev=1205.90
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   13], 80.00th=[   14], 90.00th=[   15], 95.00th=[   19],
     | 99.00th=[   53], 99.50th=[   96], 99.90th=[  366], 99.95th=[  764],
     | 99.99th=[ 7776]
    bw (KB  /s): min=    0, max=431680, per=100.00%, avg=219856.30, stdev=98172.64
    lat (usec) : 10=39.24%, 20=55.83%, 50=3.81%, 100=0.63%, 250=0.33%
    lat (usec) : 500=0.08%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 2000=0.01%
  cpu          : usr=11.50%, sys=61.08%, ctx=123428, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=161121KB/s, minb=161121KB/s, maxb=161121KB/s, mint=26032msec, maxt=26032msec



Okay, lets allocate one GiB with fallocate to make free space tighter:

martin@merkaba:~> /usr/bin/time fallocate -l 1G 1g-1
0.00user 0.09system 0:00.11elapsed 86%CPU (0avgtext+0avgdata 1752maxresident)k
112inputs+64outputs (1major+89minor)pagefaults 0swaps
martin@merkaba:~> ls -lh 1g-1
-rw-r--r-- 1 martin martin 1,0G Dez 28 14:16 1g


merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:16:24 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 156.15GiB
        devid    1 size 161.00GiB used 160.94GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.94GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.97GiB, used=152.84GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


Still not:

martin@merkaba:~> fio ssd-test.fio                  
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/132.1MB/0KB /s] [0/34.4K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10632: Sun Dec 28 14:17:12 2014
  write: io=4096.0MB, bw=198773KB/s, iops=49693, runt= 21101msec
    clat (usec): min=4, max=543255, avg=16.27, stdev=563.85
     lat (usec): min=4, max=543255, avg=16.48, stdev=563.91
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    9], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   14], 95.00th=[   17],
     | 99.00th=[   49], 99.50th=[   98], 99.90th=[  386], 99.95th=[  828],
     | 99.99th=[10816]
    bw (KB  /s): min=    4, max=444848, per=100.00%, avg=203909.07, stdev=109502.11
    lat (usec) : 10=33.97%, 20=62.99%, 50=2.05%, 100=0.51%, 250=0.33%
    lat (usec) : 500=0.08%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 750=0.01%
  cpu          : usr=14.21%, sys=60.44%, ctx=70273, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=198772KB/s, minb=198772KB/s, maxb=198772KB/s, mint=21101msec, maxt=21101msec




Another 1G file.

Got it:

ATOP - merkaba                       2014/12/28  14:18:14                       -----------                        10s elapsed
PRC | sys   21.74s | user   2.48s | #proc    382 | #trun      8 |  #tslpi   698 | #tslpu     1 | #zombie    0 | no  procacct |
CPU | sys     218% | user     24% | irq       1% | idle    155% |  wait      2% | guest     0% | curf 3.00GHz | curscal  93% |
cpu | sys      75% | user      5% | irq       0% | idle     19% |  cpu003 w  0% | guest     0% | curf 3.00GHz | curscal  93% |
cpu | sys      59% | user      3% | irq       0% | idle     37% |  cpu001 w  0% | guest     0% | curf 3.00GHz | curscal  93% |
cpu | sys      48% | user      6% | irq       1% | idle     45% |  cpu000 w  1% | guest     0% | curf 3.00GHz | curscal  93% |
cpu | sys      36% | user      9% | irq       0% | idle     54% |  cpu002 w  1% | guest     0% | curf 3.00GHz | curscal  93% |
CPL | avg1    2.13 | avg5    2.37 | avg15   1.92 |              |  csw    67473 | intr   59152 |              | numcpu     4 |
MEM | tot    15.5G | free    1.1G | cache  11.0G | buff    0.1M |  slab  740.2M | shmem 190.9M | vmbal   0.0M | hptot   0.0M |
SWP | tot    12.0G | free   11.4G |              |              |               |              | vmcom   5.4G | vmlim  19.7G |
PAG | scan       0 | steal      0 | stall      1 |              |               |              | swin      19 | swout      0 |
LVM |    sata-home | busy      8% | read       4 | write  26062 |  KiB/w      3 | MBr/s   0.00 | MBw/s  10.18 | avio 0.03 ms |
LVM |   msata-home | busy      5% | read       4 | write  26062 |  KiB/w      3 | MBr/s   0.00 | MBw/s  10.18 | avio 0.02 ms |
LVM |    sata-swap | busy      0% | read      19 | write      0 |  KiB/w      0 | MBr/s   0.01 | MBw/s   0.00 | avio 0.05 ms |
LVM | msata-debian | busy      0% | read       0 | write      4 |  KiB/w      4 | MBr/s   0.00 | MBw/s   0.00 | avio 0.00 ms |
LVM |  sata-debian | busy      0% | read       0 | write      4 |  KiB/w      4 | MBr/s   0.00 | MBw/s   0.00 | avio 0.00 ms |
DSK |          sda | busy      8% | read      23 | write  13239 |  KiB/w      7 | MBr/s   0.01 | MBw/s  10.18 | avio 0.06 ms |
DSK |          sdb | busy      5% | read       4 | write  14360 |  KiB/w      7 | MBr/s   0.00 | MBw/s  10.18 | avio 0.04 ms |
NET | transport    | tcpi      18 | tcpo      18 | udpi       0 |  udpo       0 | tcpao      1 | tcppo      1 | tcprs      0 |
NET | network      | ipi       18 | ipo       18 | ipfrw      0 |  deliv     18 |              | icmpi      0 | icmpo      0 |
NET | eth0      0% | pcki       2 | pcko       2 | si    0 Kbps |  so    0 Kbps | erri       0 | erro       0 | drpo       0 |
NET | lo      ---- | pcki      16 | pcko      16 | si    2 Kbps |  so    2 Kbps | erri       0 | erro       0 | drpo       0 |

  PID   TID  RUID      EUID       THR  SYSCPU  USRCPU   VGROW   RGROW   RDDSK   WRDSK  ST  EXC  S  CPUNR   CPU  CMD        1/5
10657     -  martin    martin       1   9.88s   0.00s      0K      0K      0K     48K  --    -  R      1   99%  fallocate
 9685     -  root      root         1   9.84s   0.00s      0K      0K      0K      0K  --    -  D      0   99%  kworker/u8:10



martin@merkaba:~> /usr/bin/time fallocate -l 1G 1g-2 ; ls -l 1g*
0.00user 59.28system 1:00.21elapsed 98%CPU (0avgtext+0avgdata 1756maxresident)k
0inputs+416outputs (0major+90minor)pagefaults 0swaps
-rw-r--r-- 1 martin martin 1073741824 Dez 28 14:16 1g-1
-rw-r--r-- 1 martin martin 1073741824 Dez 28 14:17 1g-2


One minute in system CPU for this.



Okay, so now another test:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:19:30 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 157.18GiB
        devid    1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.94GiB, used=153.87GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.30GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



I admit, this now really isn´t nice to it anymore, but I want to see
where it starts to become an issue.



ATOP - merkaba                                  2014/12/28  14:21:18                                  -----------                                  1s elapsed
PRC | sys    1.15s | user   0.16s | #proc    382  | #trun      2 | #tslpi   707 | #tslpu     1 | #zombie    0 | clones     0  |              | no  procacct |
CPU | sys     163% | user     24% | irq       1%  | idle    189% | wait     26% |              | steal     0% | guest     0%  | curf 3.01GHz | curscal  94% |
cpu | sys      72% | user      1% | irq       0%  | idle     25% | cpu001 w  1% |              | steal     0% | guest     0%  | curf 3.00GHz | curscal  93% |
cpu | sys      41% | user      9% | irq       0%  | idle     32% | cpu002 w 19% |              | steal     0% | guest     0%  | curf 3.00GHz | curscal  93% |
cpu | sys      34% | user     10% | irq       0%  | idle     53% | cpu003 w  3% |              | steal     0% | guest     0%  | curf 3.03GHz | curscal  94% |
cpu | sys      16% | user      4% | irq       0%  | idle     77% | cpu000 w  3% |              | steal     0% | guest     0%  | curf 3.00GHz | curscal  93% |
CPL | avg1    2.37 | avg5    2.64 | avg15   2.13  |              |              | csw    18687 | intr   12435 |               |              | numcpu     4 |
MEM | tot    15.5G | free    2.5G | cache   9.5G  | buff    0.1M | slab  742.6M | shmem 242.8M | shrss 115.5M | vmbal   0.0M  | hptot   0.0M | hpuse   0.0M |
SWP | tot    12.0G | free   11.4G |               |              |              |              |              |               | vmcom   5.5G | vmlim  19.7G |
LVM |   msata-home | busy     71% | read      28  | write   8134 | KiB/r      4 | KiB/w      3 | MBr/s   0.11 | MBw/s  31.68  | avq    13.21 | avio 0.06 ms |
LVM |    sata-home | busy     40% | read      72  | write   8135 | KiB/r      4 | KiB/w      3 | MBr/s   0.28 | MBw/s  31.69  | avq    41.67 | avio 0.03 ms |
DSK |          sdb | busy     71% | read      24  | write   6049 | KiB/r      4 | KiB/w      5 | MBr/s   0.11 | MBw/s  31.68  | avq     5.64 | avio 0.08 ms |
DSK |          sda | busy     38% | read      60  | write   5987 | KiB/r      4 | KiB/w      5 | MBr/s   0.28 | MBw/s  31.69  | avq    20.40 | avio 0.04 ms |
NET | transport    | tcpi      16 | tcpo      16  | udpi       0 | udpo       0 | tcpao      1 | tcppo      1 | tcprs      0  | tcpie      0 | udpie      0 |
NET | network      | ipi       16 | ipo       16  | ipfrw      0 | deliv     16 |              |              |               | icmpi      0 | icmpo      0 |
NET | lo      ---- | pcki      16 | pcko      16  | si   20 Kbps | so   20 Kbps | coll       0 | erri       0 | erro       0  | drpi       0 | drpo       0 |

  PID     TID    RUID        EUID         THR    SYSCPU    USRCPU     VGROW     RGROW    RDDSK     WRDSK    ST    EXC    S    CPUNR     CPU    CMD        1/2
10459       -    root        root           1     0.70s     0.00s        0K        0K       0K        0K    --      -    D        3    100%    kworker/u8:17
10674       -    martin      martin         1     0.20s     0.01s        0K        0K       0K    27504K    --      -    R        0     30%    fio



Okay.

It is jumping between 0 IOPS and 40000 IOPS and hogging 100% with one kworker.



Still quite okay in terms of IOPS tough:

martin@merkaba:~> fio ssd-test.fio                              
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/126.8MB/0KB /s] [0/32.5K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10674: Sun Dec 28 14:22:16 2014
  write: io=3801.3MB, bw=32415KB/s, iops=8103, runt=120083msec
    clat (usec): min=4, max=1809.9K, avg=83.88, stdev=3615.98
     lat (usec): min=4, max=1809.9K, avg=84.10, stdev=3616.00
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    9], 40.00th=[   10], 50.00th=[   11], 60.00th=[   11],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   15], 95.00th=[   18],
     | 99.00th=[   52], 99.50th=[  124], 99.90th=[24704], 99.95th=[30592],
     | 99.99th=[57088]
    bw (KB  /s): min=    0, max=417544, per=100.00%, avg=48302.16, stdev=89108.07
    lat (usec) : 10=35.61%, 20=60.17%, 50=3.17%, 100=0.47%, 250=0.27%
    lat (usec) : 500=0.05%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.04%, 50=0.16%
    lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=2.37%, sys=29.74%, ctx=202984, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=973128/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=3801.3MB, aggrb=32415KB/s, minb=32415KB/s, maxb=32415KB/s, mint=120083msec, maxt=120083msec



I stop this here now.


Its interesting to see that even with:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:23:11 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 157.89GiB
        devid    1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.94GiB, used=154.59GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.30GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

I am not able to fully reproduce it.


I can partly reproduce it, but it behaves way better than before.



I think to go further one needs to have a look at the free space
fragmentation inside the chunks.

As in the beginning I had:

I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents – are these a problem? – and
compress=lzo:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 144.41GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



So I had quite some free space *within* the chunks and it still was a
problem.










-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)
  2014-12-28 13:40           ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare) Martin Steigerwald
@ 2014-12-28 13:56             ` Martin Steigerwald
  2014-12-28 15:00               ` Martin Steigerwald
  2014-12-29  9:25               ` Martin Steigerwald
  0 siblings, 2 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 13:56 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 27801 bytes --]

Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald:
> Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
> > Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> > > Summarized at
> > > 
> > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > > 
> > > see below. This is reproducable with fio, no need for Windows XP in
> > > Virtualbox for reproducing the issue. Next I will try to reproduce with
> > > a freshly creating filesystem.
> > > 
> > > 
> > > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > > > Hello!
> > > > > > > 
> > > > > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > > > > 
> > > > > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > > > > bug
> > > > > > > report:
> > > > > > > 
> > > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > > > > space_cache, skinny meta data extents – are these a problem? – and
> > > > > > 
> > > > > > > compress=lzo:
> > > > > > (there is no known problem with skinny metadata, it's actually more
> > > > > > efficient than the older format. There has been some anecdotes about
> > > > > > mixing the skinny and fat metadata but nothing has ever been
> > > > > > demonstrated problematic.)
> > > > > > 
> > > > > > > merkaba:~> btrfs fi sh /home
> > > > > > > Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > > > > 
> > > > > > >          Total devices 2 FS bytes used 144.41GiB
> > > > > > >          devid    1 size 160.00GiB used 160.00GiB path
> > > > > > >          /dev/mapper/msata-home
> > > > > > >          devid    2 size 160.00GiB used 160.00GiB path
> > > > > > >          /dev/mapper/sata-home
> > > > > > > 
> > > > > > > Btrfs v3.17
> > > > > > > merkaba:~> btrfs fi df /home
> > > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > > > 
> > > > > > This filesystem, at the allocation level, is "very full" (see below).
> > > > > > 
> > > > > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > > > > cause I know no tax return software for Linux which would be suitable
> > > > > > > for
> > > > > > > Germany and I frankly don´t care about the end of security cause all
> > > > > > > surfing and other network access I will do from the Linux box and I
> > > > > > > only
> > > > > > > run the VM behind a firewall).
> > > > > > 
> > > > > > > And thus I try the balance dance again:
> > > > > > ITEM: Balance... it doesn't do what you think it does... 
> > > > > > 
> > > > > > "Balancing" is something you should almost never need to do. It is only
> > > > > > for cases of changing geometry (adding disks, switching RAID levels,
> > > > > > etc.) of for cases when you've radically changed allocation behaviors
> > > > > > (like you decided to remove all your VM's or you've decided to remove a
> > > > > > mail spool directory full of thousands of tiny files).
> > > > > > 
> > > > > > People run balance all the time because they think they should. They are
> > > > > > _usually_ incorrect in that belief.
> > > > > 
> > > > > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > > > > device.
> > > >    No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > > > space. What's more, balance does *not* balance the metadata trees. The
> > > > remaining space -- 154.97 GiB -- is unstructured storage for file
> > > > data, and you have some 13 GiB of that available for use.
> > > > 
> > > >    Now, since you're seeing lockups when the space on your disks is
> > > > all allocated I'd say that's a bug. However, you're the *only* person
> > > > who's reported this as a regular occurrence. Does this happen with all
> > > > filesystems you have, or just this one?
> > > > 
> > > > > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > > > > from to *extend* a tree.
> > > > 
> > > >    It's not a tree. It's simply space allocation. It's not even space
> > > > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > > > saying "I'm going to use this piece of disk for this purpose").
> > > > 
> > > > > This may be a bug, but this is what I see.
> > > > > 
> > > > > And no amount of "you should not balance a BTRFS" will make that
> > > > > perception go away.
> > > > > 
> > > > > See, I see the sun coming out on a morning and you tell me "no, it
> > > > > doesn´t". Simply that is not going to match my perception.
> > > > 
> > > >    Duncan's assertion is correct in its detail. Looking at your space
> > > 
> > > Robert's 
> > > 
> > > > usage, I would not suggest that running a balance is something you
> > > > need to do. Now, since you have these lockups that seem quite
> > > > repeatable, there's probably a lurking bug in there, but hacking
> > > > around with balance every time you hit it isn't going to get the
> > > > problem solved properly.
> > > > 
> > > >    I think I would suggest the following:
> > > > 
> > > >  - make sure you have some way of logging your dmesg permanently (use
> > > >    a different filesystem for /var/log, or a serial console, or a
> > > >    netconsole)
> > > > 
> > > >  - when the lockup happens, hit Alt-SysRq-t a few times
> > > > 
> > > >  - send the dmesg output here, or post to bugzilla.kernel.org
> > > > 
> > > >    That's probably going to give enough information to the developers
> > > > to work out where the lockup is happening, and is clearly the way
> > > > forward here.
> > > 
> > > And I got it reproduced. *Perfectly* reproduced, I´d say.
> > > 
> > > But let me run the whole story:
> > > 
> > > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
> > 
> > [… story of trying to reproduce with Windows XP defragmenting which was
> > unsuccessful as BTRFS still had free device space to allocate new chunks
> > from …]
> > 
> > > But finally I got to:
> > > 
> > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > > Sa 27. Dez 13:26:39 CET 2014
> > > Label: 'home'  uuid: [some UUID]
> > >         Total devices 2 FS bytes used 152.83GiB
> > >         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> > >         devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> > > 
> > > Btrfs v3.17
> > > Data, RAID1: total=154.97GiB, used=149.58GiB
> > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > 
> > > 
> > > 
> > > So I did, if Virtualbox can write randomly in a file, I can too.
> > > 
> > > So I did:
> > > 
> > > 
> > > martin@merkaba:~> cat ssd-test.fio 
> > > [global]
> > > bs=4k
> > > #ioengine=libaio
> > > #iodepth=4
> > > size=4g
> > > #direct=1
> > > runtime=120
> > > filename=ssd.test.file
> > > 
> > > [seq-write]
> > > rw=write
> > > stonewall
> > > 
> > > [rand-write]
> > > rw=randwrite
> > > stonewall
> > > 
> > > 
> > > 
> > > And got:
> > > 
> > > ATOP - merkaba                          2014/12/27  13:41:02                          -----------                           10s elapsed
> > > PRC |  sys   10.14s |  user   0.38s |  #proc    332  | #trun      2  |  #tslpi   548 |  #tslpu     0 |  #zombie    0  | no  procacct  |
> > > CPU |  sys     102% |  user      4% |  irq       0%  | idle    295%  |  wait      0% |  guest     0% |  curf 3.10GHz  | curscal  96%  |
> > > cpu |  sys      76% |  user      0% |  irq       0%  | idle     24%  |  cpu001 w  0% |  guest     0% |  curf 3.20GHz  | curscal  99%  |
> > > cpu |  sys      24% |  user      1% |  irq       0%  | idle     75%  |  cpu000 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> > > cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu003 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > CPL |  avg1    0.82 |  avg5    0.78 |  avg15   0.99  |               |  csw     6233 |  intr   12023 |                | numcpu     4  |
> > > MEM |  tot    15.5G |  free    4.0G |  cache   9.7G  | buff    0.0M  |  slab  333.1M |  shmem 206.6M |  vmbal   0.0M  | hptot   0.0M  |
> > > SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> > > LVM |     sata-home |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> > > DSK |           sda |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> > > NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> > > NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv     16 |               |  icmpi      0  | icmpo      0  |
> > > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > 
> > >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
> > > 18079      -   martin    martin        2   9.99s    0.00s      0K       0K      0K      16K  --     -  R       1  100%   fio
> > >  4746      -   martin    martin        2   0.01s    0.14s      0K       0K      0K       0K  --     -  S       2    2%   konsole
> > >  3291      -   martin    martin        4   0.01s    0.11s      0K       0K      0K       0K  --     -  S       0    1%   plasma-desktop
> > >  1488      -   root      root          1   0.03s    0.04s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> > > 10036      -   root      root          1   0.04s    0.02s      0K       0K      0K       0K  --     -  R       2    1%   atop
> > > 
> > > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > > for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > > 
> > > martin@merkaba:~> LANG=C df -hT /home
> > > Filesystem             Type   Size  Used Avail Use% Mounted on
> > > /dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
> > > 
> > > where a 4 GiB file should easily fit, no? (And this output is with the 4
> > > GiB file. So it was even 4 GiB more free before.)
> > > 
> > > 
> > > But it gets even more visible:
> > > 
> > > martin@merkaba:~> fio ssd-test.fio
> > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > fio-2.1.11
> > > Starting 2 processes
> > > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]       
> > > 0$ zsh  1$ zsh  2$ zsh  3-$ zsh  4$ zsh  5$* zsh                                   
> > > 
> > > 
> > > yes, thats 0 IOPS.
> > > 
> > > 0 IOPS and in zero IOPS. For minutes.
> > > 
> > > 
> > > 
> > > And here is why:
> > > 
> > > ATOP - merkaba                          2014/12/27  13:46:52                          -----------                           10s elapsed
> > > PRC |  sys   10.77s |  user   0.31s |  #proc    334  | #trun      2  |  #tslpi   548 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> > > CPU |  sys     108% |  user      3% |  irq       0%  | idle    286%  |  wait      2% |  guest     0% |  curf 3.08GHz  | curscal  96%  |
> > > cpu |  sys      72% |  user      1% |  irq       0%  | idle     28%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > cpu |  sys      19% |  user      0% |  irq       0%  | idle     81%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > cpu |  sys      11% |  user      1% |  irq       0%  | idle     87%  |  cpu003 w  1% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> > > cpu |  sys       6% |  user      1% |  irq       0%  | idle     91%  |  cpu002 w  1% |  guest     0% |  curf 3.11GHz  | curscal  97%  |
> > > CPL |  avg1    2.78 |  avg5    1.34 |  avg15   1.12  |               |  csw    50192 |  intr   32379 |                | numcpu     4  |
> > > MEM |  tot    15.5G |  free    5.0G |  cache   8.7G  | buff    0.0M  |  slab  332.6M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> > > SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> > > LVM |     sata-home |  busy      5% |  read     160  | write  11177  |  KiB/w      3 |  MBr/s   0.06 |  MBw/s   4.36  | avio 0.05 ms  |
> > > LVM |    msata-home |  busy      4% |  read      28  | write  11177  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   4.36  | avio 0.04 ms  |
> > > LVM |   sata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> > > LVM |  msata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> > > DSK |           sda |  busy      5% |  read     160  | write  10200  |  KiB/w      4 |  MBr/s   0.06 |  MBw/s   4.69  | avio 0.05 ms  |
> > > DSK |           sdb |  busy      4% |  read      28  | write  10558  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   4.69  | avio 0.04 ms  |
> > > NET |  transport    |  tcpi      35 |  tcpo      33  | udpi       3  |  udpo       3 |  tcpao      2 |  tcppo      1  | tcprs      0  |
> > > NET |  network      |  ipi       38 |  ipo       36  | ipfrw      0  |  deliv     38 |               |  icmpi      0  | icmpo      0  |
> > > NET |  eth0      0% |  pcki      22 |  pcko      20  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > 
> > >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> > > 14973      -   root      root          1   8.92s    0.00s      0K       0K      0K     144K  --     -  S       0   89%   kworker/u8:14
> > > 17450      -   root      root          1   0.86s    0.00s      0K       0K      0K      32K  --     -  R       3    9%   kworker/u8:5
> > >   788      -   root      root          1   0.25s    0.00s      0K       0K    128K   18880K  --     -  S       3    3%   btrfs-transact
> > > 12254      -   root      root          1   0.14s    0.00s      0K       0K     64K     576K  --     -  S       2    1%   kworker/u8:3
> > > 17332      -   root      root          1   0.11s    0.00s      0K       0K    112K    1348K  --     -  S       2    1%   kworker/u8:4
> > >  3291      -   martin    martin        4   0.01s    0.09s      0K       0K      0K       0K  --     -  S       1    1%   plasma-deskto
> > > 
> > > 
> > > 
> > > 
> > > ATOP - merkaba                          2014/12/27  13:47:12                          -----------                           10s elapsed
> > > PRC |  sys   10.78s |  user   0.44s |  #proc    334  | #trun      3  |  #tslpi   547 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> > > CPU |  sys     106% |  user      4% |  irq       0%  | idle    288%  |  wait      1% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > cpu |  sys      93% |  user      0% |  irq       0%  | idle      7%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > cpu |  sys       7% |  user      0% |  irq       0%  | idle     93%  |  cpu003 w  0% |  guest     0% |  curf 3.01GHz  | curscal  94%  |
> > > cpu |  sys       3% |  user      2% |  irq       0%  | idle     94%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > CPL |  avg1    3.33 |  avg5    1.56 |  avg15   1.20  |               |  csw    38253 |  intr   23104 |                | numcpu     4  |
> > > MEM |  tot    15.5G |  free    4.9G |  cache   8.7G  | buff    0.0M  |  slab  336.5M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> > > SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> > > LVM |    msata-home |  busy      2% |  read       0  | write   2337  |  KiB/w      3 |  MBr/s   0.00 |  MBw/s   0.91  | avio 0.07 ms  |
> > > LVM |     sata-home |  busy      2% |  read      36  | write   2337  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   0.91  | avio 0.07 ms  |
> > > LVM |  msata-debian |  busy      1% |  read       1  | write   1630  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.65  | avio 0.03 ms  |
> > > LVM |   sata-debian |  busy      0% |  read       0  | write   1019  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.41  | avio 0.02 ms  |
> > > DSK |           sdb |  busy      2% |  read       1  | write   2545  |  KiB/w      5 |  MBr/s   0.00 |  MBw/s   1.45  | avio 0.07 ms  |
> > > DSK |           sda |  busy      1% |  read      36  | write   2461  |  KiB/w      5 |  MBr/s   0.01 |  MBw/s   1.28  | avio 0.06 ms  |
> > > NET |  transport    |  tcpi      20 |  tcpo      20  | udpi       1  |  udpo       1 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> > > NET |  network      |  ipi       21 |  ipo       21  | ipfrw      0  |  deliv     21 |               |  icmpi      0  | icmpo      0  |
> > > NET |  eth0      0% |  pcki       5 |  pcko       5  | si    0 Kbps  |  so    0 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > 
> > >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> > > 17450      -   root      root          1   9.96s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:5
> > >  4746      -   martin    martin        2   0.06s    0.15s      0K       0K      0K       0K  --     -  S       1    2%   konsole
> > > 10508      -   root      root          1   0.13s    0.00s      0K       0K     96K    4048K  --     -  S       1    1%   kworker/u8:18
> > >  1488      -   root      root          1   0.06s    0.06s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> > > 17332      -   root      root          1   0.12s    0.00s      0K       0K     96K     580K  --     -  R       3    1%   kworker/u8:4
> > > 17454      -   root      root          1   0.11s    0.00s      0K       0K     32K    4416K  --     -  D       1    1%   kworker/u8:6
> > > 17516      -   root      root          1   0.09s    0.00s      0K       0K     16K     136K  --     -  S       3    1%   kworker/u8:7
> > >  3268      -   martin    martin        3   0.02s    0.05s      0K       0K      0K       0K  --     -  S       1    1%   kwin
> > > 10036      -   root      root          1   0.05s    0.02s      0K       0K      0K       0K  --     -  R       0    1%   atop
> > > 
> > > 
> > > 
> > > So BTRFS is basically busy with itself and nothing else. Look at the SSD
> > > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> > > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> > > you measure of course, like request size, read, write, iodepth and so).
> > > 
> > > Its kworker/u8:5 utilizing 100% of one core for minutes.
> > > 
> > > 
> > > 
> > > Its the random write case it seems. Here are values from fio job:
> > > 
> > > martin@merkaba:~> fio ssd-test.fio
> > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > fio-2.1.11
> > > Starting 2 processes
> > > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> > > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
> > >   write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> > >     clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> > >      lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> > >     clat percentiles (usec):
> > >      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
> > >      | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
> > >      | 70.00th=[    7], 80.00th=[    8], 90.00th=[    8], 95.00th=[    9],
> > >      | 99.00th=[   14], 99.50th=[   20], 99.90th=[  211], 99.95th=[ 2128],
> > >      | 99.99th=[10304]
> > >     bw (KB  /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
> > >     lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> > >     lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> > >     lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> > >   cpu          : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > >      issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> > >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > > 
> > > Seems fine.
> > > 
> > > 
> > > But:
> > > 
> > > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
> > >   write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> > >     clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> > >      lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> > >     clat percentiles (usec):
> > >      |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
> > >      | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
> > >      | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
> > >      | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
> > >      | 99.99th=[16711680]
> > >     bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
> > >     lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> > >     lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> > >   cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > >      issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> > >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > > 
> > > Run status group 0 (all jobs):
> > >   WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
> > > 
> > > Run status group 1 (all jobs):
> > >   WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
> > > 
> > > 
> > > What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
> > > 
> > > What?
> > > 
> > > Ey, *what*?
> […] 
> > > There we go:
> > > 
> > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > 
> > I have done more tests.
> > 
> > This is on the same /home after extending it to 170 GiB and balancing it to
> > btrfs balance start -dusage=80
> > 
> > It has plenty of free space free. I updated the bug report and hope it can
> > give an easy enough to comprehend summary. The new tests are in:
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6
> > 
> > 
> > 
> > Pasting below for discussion on list. Summary: I easily get 38000 (!)
> > IOPS. It may be an idea to reduce to 160 GiB, but right now this does
> > not work as it says no free space on device when trying to downsize it.
> > I may try with 165 or 162GiB.
> > 
> > So now we have three IOPS figures:
> > 
> > - 256 IOPS in worst case scenario
> > - 4700 IOPS when trying to reproduce worst case scenario with a fresh and small
> > BTRFS
> > - 38000 IOPS when /home has unused device space to allocate chunks from
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8
> > 
> > 
> > This is another test.
> 
> 
> Okay, and this is the last series of tests for today.
> 
> Conclusion:
> 
> I cannot manage to get it down to the knees as before, but I come near to it.
> 
> Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh
> even *worse* situation than before.
> 
> That hints me at the need to look at the free space fragmentation, as in the
> beginning the problem started appearing with:
> 
> merkaba:~> btrfs fi sh /home
> Label: 'home'  uuid: […]
>         Total devices 2 FS bytes used 144.41GiB
>         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
>         devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> 
> Btrfs v3.17
> merkaba:~> btrfs fi df /home
> Data, RAID1: total=154.97GiB, used=141.12GiB
> System, RAID1: total=32.00MiB, used=48.00KiB
> Metadata, RAID1: total=5.00GiB, used=3.29GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 
> 
> Yes, thats 13 GiB of free space *within* the chunks.
> 
> So while I can get it down in IOPS by bringing it to a situation where it
> can not reserve additional data chunks again, I cannot recreate the
> abysmal 250 IOPS figure by this. Not even with my /home filesystem.
> 
> So there is more to it. I think its important to look into free space
> fragmentation. It seems it needs an *aged* filesystem to recreate. At
> it seems the balances really helped. As I am not able to recreate the
> issue to that extent right now.
> 
> So this shows my original idea about free device space to allocate from
> also doesn´t explain it fully. It seems to be something thats going on
> within the chunks that explains the worst case <300 IOPS, kworker using
> one core for minutes and desktop locked scenario.
> 
> Is there a way to view free space fragmentation in BTRFS?

So to rephrase that:

From what I perceive the worst case issue happens when

1) BTRFS cannot reserve any new chunks from unused device space anymore.

2) The free space in the existing chunks is highly fragmented.

Only one of those conditions is not sufficient to trigger it.

Thats at least my current idea about it.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-28 12:07                             ` Martin Steigerwald
@ 2014-12-28 14:52                               ` Robert White
  2014-12-28 15:42                                 ` Martin Steigerwald
  0 siblings, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-28 14:52 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Bardur Arantsson, linux-btrfs

On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
>> Now:
>>
>> The complaining party has verified the minimum, repeatable case of
>> simple file allocation on a very fragmented system and the responding
>> party and several others have understood and supported the bug.
>
> I didn´t yet provide such a test case.

My bad.

>
> At the moment I can only reproduce this kworker thread using a CPU for
> minutes case with my /home filesystem.
>
> A mininmal test case for me would be to be able to reproduce it with a
> fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
> get 4800 instead of 270 IOPS.
>

A version of the test case to demonstrate absolutely system-clogging 
loads is pretty easy to construct.

Make a raid1 filesystem.
Balance it once to make sure the seed filesystem is fully integrated.

Create a bunch of small files that are at least 4K in size, but are 
randomly sized. Fill the entire filesystem with them.

BASH Script:
typeset -i counter=0
while
  dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) 
count=1 2>/dev/null
do
echo $counter >/dev/null #basically a noop
done

The while will exit when the dd encounters a full filesystem.

Then delete ~10% of the files with
rm *0

Run the while loop again, then delete a different 10% with "rm *1".

Then again with rm *2, etc...

Do this a few times and with each iteration the CPU usage gets worse and 
worse. You'll easily get system-wide stalls on all IO tasks lasting ten 
or more seconds.

I don't have enough spare storage to do this directly, so I used 
loopback devices. First I did it with the loopback files in COW mode. 
Then I did it again with the files in NOCOW mode. (the COW files got 
thick with overwrite real fast. 8-)

So anyway...

After I got through all ten digits on the rm (that is removing *0, then 
refilling, then *1 etc...) I figured the FS image was nicely fragmented.

At that point it was very easy to spike the kworker to 100% CPU with

dd if=/dev/urandom of=/mnt/Work/scratch bs=40k

The DD wold read 40k (a cpu spike for /dev/urandom processing) then it 
would write the 40k and the kworker would peg 100% on one CPU and stay 
there for a while. Then it would be back to the /dev/urandom spike.

So this laptop has been carefully detuned to prevent certain kinds of 
stalls (particularly the moveablecore= reservation, as previously 
mentioned, to prevent non-responsiveness of the UI) and I had to go 
through /dev/loop so that had a smoothing effect... but yep, there were 
clear kworker spikes that _did_ stop the IO path (the system monitor ap, 
for instance,  could not get I/O statistics for ten and fifteen second 
intervals and would stop logging/scrolling).

Progressively larger block sizes on the write path made things 
progressively worse...

dd if=/dev/urandom of=/mnt/Work/scratch bs=160k


And overwriting the file by just invoking DD again, was worse still 
(presumably from the juggling act) before resulting in a net 
out-of-space condition.

Switching from /dev/urandom to /dev/zero for writing the large file made 
things worse still -- probably since there were no respites for the 
kworker to catch up etc.

ASIDE: Playing with /proc/sys/vm/dirty_{background_,}ratio had lots of 
interesting and difficult to quantify effects on user-space 
applications. Cutting in half (5 and 10 instead of 10 and 20 
respectively) seemed to give some relief, but going further got harmful 
quickly. Diverging numbers was odd too. But it seemed a little brittle 
to play with these numbers.

SUPER FREAKY THING...

Every time I removed and recreated "scratch" I would get _radically_ 
different results for how much I could write into that remaining space 
and how long it took to do so. In theory I am reusing the exact same 
storage again and again. I'm not doing compression (the underlying 
filessytem behind the loop devices have compression but that would be 
disabled by the +C attribute). It's not enough space coming-and-going to 
cause data extents to be reclaimed or displaced by metadata. And the 
filessytem is otherwise completely unused.

But check it out...

Gust Work # rm scratch
Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
1700+0 records in
1700+0 records out
278528000 bytes (279 MB) copied, 1.4952 s, 186 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
1700+0 records in
1700+0 records out
278528000 bytes (279 MB) copied, 292.135 s, 953 kB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
dd: error writing ‘/mnt/Work/scratch’: No space left on device
93+0 records in
92+0 records out
15073280 bytes (15 MB) copied, 0.0453977 s, 332 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
dd: error writing ‘/mnt/Work/scratch’: No space left on device
1090+0 records in
1089+0 records out
178421760 bytes (178 MB) copied, 115.991 s, 1.5 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
dd: error writing ‘/mnt/Work/scratch’: No space left on device
332+0 records in
331+0 records out
54231040 bytes (54 MB) copied, 30.1589 s, 1.8 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
dd: error writing ‘/mnt/Work/scratch’: No space left on device
622+0 records in
621+0 records out
101744640 bytes (102 MB) copied, 37.4813 s, 2.7 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
1700+0 records in
1700+0 records out
278528000 bytes (279 MB) copied, 121.863 s, 2.3 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
1700+0 records in
1700+0 records out
278528000 bytes (279 MB) copied, 24.2909 s, 11.5 MB/s
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k
dd: error writing ‘/mnt/Work/scratch’: No space left on device
1709+0 records in
1708+0 records out
279838720 bytes (280 MB) copied, 139.538 s, 2.0 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k
dd: error writing ‘/mnt/Work/scratch’: No space left on device
1424+0 records in
1423+0 records out
233144320 bytes (233 MB) copied, 102.257 s, 2.3 MB/s
Gust Work #

(and so on)

So...

Repeatable: yes.
Problematic: yes.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)
  2014-12-28 13:56             ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
@ 2014-12-28 15:00               ` Martin Steigerwald
  2014-12-29  9:25               ` Martin Steigerwald
  1 sibling, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 15:00 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 30838 bytes --]

Am Sonntag, 28. Dezember 2014, 14:56:21 schrieb Martin Steigerwald:
> Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald:
> > Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
> > > Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> > > > Summarized at
> > > > 
> > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > > > 
> > > > see below. This is reproducable with fio, no need for Windows XP in
> > > > Virtualbox for reproducing the issue. Next I will try to reproduce with
> > > > a freshly creating filesystem.
> > > > 
> > > > 
> > > > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > > > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > > > > Hello!
> > > > > > > > 
> > > > > > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > > > > > 
> > > > > > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > > > > > bug
> > > > > > > > report:
> > > > > > > > 
> > > > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > > > > > space_cache, skinny meta data extents – are these a problem? – and
> > > > > > > 
> > > > > > > > compress=lzo:
> > > > > > > (there is no known problem with skinny metadata, it's actually more
> > > > > > > efficient than the older format. There has been some anecdotes about
> > > > > > > mixing the skinny and fat metadata but nothing has ever been
> > > > > > > demonstrated problematic.)
> > > > > > > 
> > > > > > > > merkaba:~> btrfs fi sh /home
> > > > > > > > Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > > > > > 
> > > > > > > >          Total devices 2 FS bytes used 144.41GiB
> > > > > > > >          devid    1 size 160.00GiB used 160.00GiB path
> > > > > > > >          /dev/mapper/msata-home
> > > > > > > >          devid    2 size 160.00GiB used 160.00GiB path
> > > > > > > >          /dev/mapper/sata-home
> > > > > > > > 
> > > > > > > > Btrfs v3.17
> > > > > > > > merkaba:~> btrfs fi df /home
> > > > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > > > > 
> > > > > > > This filesystem, at the allocation level, is "very full" (see below).
> > > > > > > 
> > > > > > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > > > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > > > > > cause I know no tax return software for Linux which would be suitable
> > > > > > > > for
> > > > > > > > Germany and I frankly don´t care about the end of security cause all
> > > > > > > > surfing and other network access I will do from the Linux box and I
> > > > > > > > only
> > > > > > > > run the VM behind a firewall).
> > > > > > > 
> > > > > > > > And thus I try the balance dance again:
> > > > > > > ITEM: Balance... it doesn't do what you think it does... 
> > > > > > > 
> > > > > > > "Balancing" is something you should almost never need to do. It is only
> > > > > > > for cases of changing geometry (adding disks, switching RAID levels,
> > > > > > > etc.) of for cases when you've radically changed allocation behaviors
> > > > > > > (like you decided to remove all your VM's or you've decided to remove a
> > > > > > > mail spool directory full of thousands of tiny files).
> > > > > > > 
> > > > > > > People run balance all the time because they think they should. They are
> > > > > > > _usually_ incorrect in that belief.
> > > > > > 
> > > > > > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > > > > > device.
> > > > >    No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > > > > space. What's more, balance does *not* balance the metadata trees. The
> > > > > remaining space -- 154.97 GiB -- is unstructured storage for file
> > > > > data, and you have some 13 GiB of that available for use.
> > > > > 
> > > > >    Now, since you're seeing lockups when the space on your disks is
> > > > > all allocated I'd say that's a bug. However, you're the *only* person
> > > > > who's reported this as a regular occurrence. Does this happen with all
> > > > > filesystems you have, or just this one?
> > > > > 
> > > > > > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > > > > > from to *extend* a tree.
> > > > > 
> > > > >    It's not a tree. It's simply space allocation. It's not even space
> > > > > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > > > > saying "I'm going to use this piece of disk for this purpose").
> > > > > 
> > > > > > This may be a bug, but this is what I see.
> > > > > > 
> > > > > > And no amount of "you should not balance a BTRFS" will make that
> > > > > > perception go away.
> > > > > > 
> > > > > > See, I see the sun coming out on a morning and you tell me "no, it
> > > > > > doesn´t". Simply that is not going to match my perception.
> > > > > 
> > > > >    Duncan's assertion is correct in its detail. Looking at your space
> > > > 
> > > > Robert's 
> > > > 
> > > > > usage, I would not suggest that running a balance is something you
> > > > > need to do. Now, since you have these lockups that seem quite
> > > > > repeatable, there's probably a lurking bug in there, but hacking
> > > > > around with balance every time you hit it isn't going to get the
> > > > > problem solved properly.
> > > > > 
> > > > >    I think I would suggest the following:
> > > > > 
> > > > >  - make sure you have some way of logging your dmesg permanently (use
> > > > >    a different filesystem for /var/log, or a serial console, or a
> > > > >    netconsole)
> > > > > 
> > > > >  - when the lockup happens, hit Alt-SysRq-t a few times
> > > > > 
> > > > >  - send the dmesg output here, or post to bugzilla.kernel.org
> > > > > 
> > > > >    That's probably going to give enough information to the developers
> > > > > to work out where the lockup is happening, and is clearly the way
> > > > > forward here.
> > > > 
> > > > And I got it reproduced. *Perfectly* reproduced, I´d say.
> > > > 
> > > > But let me run the whole story:
> > > > 
> > > > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
> > > 
> > > [… story of trying to reproduce with Windows XP defragmenting which was
> > > unsuccessful as BTRFS still had free device space to allocate new chunks
> > > from …]
> > > 
> > > > But finally I got to:
> > > > 
> > > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > > > Sa 27. Dez 13:26:39 CET 2014
> > > > Label: 'home'  uuid: [some UUID]
> > > >         Total devices 2 FS bytes used 152.83GiB
> > > >         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> > > >         devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> > > > 
> > > > Btrfs v3.17
> > > > Data, RAID1: total=154.97GiB, used=149.58GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > 
> > > > 
> > > > 
> > > > So I did, if Virtualbox can write randomly in a file, I can too.
> > > > 
> > > > So I did:
> > > > 
> > > > 
> > > > martin@merkaba:~> cat ssd-test.fio 
> > > > [global]
> > > > bs=4k
> > > > #ioengine=libaio
> > > > #iodepth=4
> > > > size=4g
> > > > #direct=1
> > > > runtime=120
> > > > filename=ssd.test.file
> > > > 
> > > > [seq-write]
> > > > rw=write
> > > > stonewall
> > > > 
> > > > [rand-write]
> > > > rw=randwrite
> > > > stonewall
> > > > 
> > > > 
> > > > 
> > > > And got:
> > > > 
> > > > ATOP - merkaba                          2014/12/27  13:41:02                          -----------                           10s elapsed
> > > > PRC |  sys   10.14s |  user   0.38s |  #proc    332  | #trun      2  |  #tslpi   548 |  #tslpu     0 |  #zombie    0  | no  procacct  |
> > > > CPU |  sys     102% |  user      4% |  irq       0%  | idle    295%  |  wait      0% |  guest     0% |  curf 3.10GHz  | curscal  96%  |
> > > > cpu |  sys      76% |  user      0% |  irq       0%  | idle     24%  |  cpu001 w  0% |  guest     0% |  curf 3.20GHz  | curscal  99%  |
> > > > cpu |  sys      24% |  user      1% |  irq       0%  | idle     75%  |  cpu000 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> > > > cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu003 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > CPL |  avg1    0.82 |  avg5    0.78 |  avg15   0.99  |               |  csw     6233 |  intr   12023 |                | numcpu     4  |
> > > > MEM |  tot    15.5G |  free    4.0G |  cache   9.7G  | buff    0.0M  |  slab  333.1M |  shmem 206.6M |  vmbal   0.0M  | hptot   0.0M  |
> > > > SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> > > > LVM |     sata-home |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> > > > DSK |           sda |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> > > > NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> > > > NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv     16 |               |  icmpi      0  | icmpo      0  |
> > > > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > 
> > > >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
> > > > 18079      -   martin    martin        2   9.99s    0.00s      0K       0K      0K      16K  --     -  R       1  100%   fio
> > > >  4746      -   martin    martin        2   0.01s    0.14s      0K       0K      0K       0K  --     -  S       2    2%   konsole
> > > >  3291      -   martin    martin        4   0.01s    0.11s      0K       0K      0K       0K  --     -  S       0    1%   plasma-desktop
> > > >  1488      -   root      root          1   0.03s    0.04s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> > > > 10036      -   root      root          1   0.04s    0.02s      0K       0K      0K       0K  --     -  R       2    1%   atop
> > > > 
> > > > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > > > for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > > > 
> > > > martin@merkaba:~> LANG=C df -hT /home
> > > > Filesystem             Type   Size  Used Avail Use% Mounted on
> > > > /dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
> > > > 
> > > > where a 4 GiB file should easily fit, no? (And this output is with the 4
> > > > GiB file. So it was even 4 GiB more free before.)
> > > > 
> > > > 
> > > > But it gets even more visible:
> > > > 
> > > > martin@merkaba:~> fio ssd-test.fio
> > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > fio-2.1.11
> > > > Starting 2 processes
> > > > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]       
> > > > 0$ zsh  1$ zsh  2$ zsh  3-$ zsh  4$ zsh  5$* zsh                                   
> > > > 
> > > > 
> > > > yes, thats 0 IOPS.
> > > > 
> > > > 0 IOPS and in zero IOPS. For minutes.
> > > > 
> > > > 
> > > > 
> > > > And here is why:
> > > > 
> > > > ATOP - merkaba                          2014/12/27  13:46:52                          -----------                           10s elapsed
> > > > PRC |  sys   10.77s |  user   0.31s |  #proc    334  | #trun      2  |  #tslpi   548 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> > > > CPU |  sys     108% |  user      3% |  irq       0%  | idle    286%  |  wait      2% |  guest     0% |  curf 3.08GHz  | curscal  96%  |
> > > > cpu |  sys      72% |  user      1% |  irq       0%  | idle     28%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys      19% |  user      0% |  irq       0%  | idle     81%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys      11% |  user      1% |  irq       0%  | idle     87%  |  cpu003 w  1% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> > > > cpu |  sys       6% |  user      1% |  irq       0%  | idle     91%  |  cpu002 w  1% |  guest     0% |  curf 3.11GHz  | curscal  97%  |
> > > > CPL |  avg1    2.78 |  avg5    1.34 |  avg15   1.12  |               |  csw    50192 |  intr   32379 |                | numcpu     4  |
> > > > MEM |  tot    15.5G |  free    5.0G |  cache   8.7G  | buff    0.0M  |  slab  332.6M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> > > > SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> > > > LVM |     sata-home |  busy      5% |  read     160  | write  11177  |  KiB/w      3 |  MBr/s   0.06 |  MBw/s   4.36  | avio 0.05 ms  |
> > > > LVM |    msata-home |  busy      4% |  read      28  | write  11177  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   4.36  | avio 0.04 ms  |
> > > > LVM |   sata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> > > > LVM |  msata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> > > > DSK |           sda |  busy      5% |  read     160  | write  10200  |  KiB/w      4 |  MBr/s   0.06 |  MBw/s   4.69  | avio 0.05 ms  |
> > > > DSK |           sdb |  busy      4% |  read      28  | write  10558  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   4.69  | avio 0.04 ms  |
> > > > NET |  transport    |  tcpi      35 |  tcpo      33  | udpi       3  |  udpo       3 |  tcpao      2 |  tcppo      1  | tcprs      0  |
> > > > NET |  network      |  ipi       38 |  ipo       36  | ipfrw      0  |  deliv     38 |               |  icmpi      0  | icmpo      0  |
> > > > NET |  eth0      0% |  pcki      22 |  pcko      20  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > 
> > > >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> > > > 14973      -   root      root          1   8.92s    0.00s      0K       0K      0K     144K  --     -  S       0   89%   kworker/u8:14
> > > > 17450      -   root      root          1   0.86s    0.00s      0K       0K      0K      32K  --     -  R       3    9%   kworker/u8:5
> > > >   788      -   root      root          1   0.25s    0.00s      0K       0K    128K   18880K  --     -  S       3    3%   btrfs-transact
> > > > 12254      -   root      root          1   0.14s    0.00s      0K       0K     64K     576K  --     -  S       2    1%   kworker/u8:3
> > > > 17332      -   root      root          1   0.11s    0.00s      0K       0K    112K    1348K  --     -  S       2    1%   kworker/u8:4
> > > >  3291      -   martin    martin        4   0.01s    0.09s      0K       0K      0K       0K  --     -  S       1    1%   plasma-deskto
> > > > 
> > > > 
> > > > 
> > > > 
> > > > ATOP - merkaba                          2014/12/27  13:47:12                          -----------                           10s elapsed
> > > > PRC |  sys   10.78s |  user   0.44s |  #proc    334  | #trun      3  |  #tslpi   547 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> > > > CPU |  sys     106% |  user      4% |  irq       0%  | idle    288%  |  wait      1% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys      93% |  user      0% |  irq       0%  | idle      7%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys       7% |  user      0% |  irq       0%  | idle     93%  |  cpu003 w  0% |  guest     0% |  curf 3.01GHz  | curscal  94%  |
> > > > cpu |  sys       3% |  user      2% |  irq       0%  | idle     94%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > CPL |  avg1    3.33 |  avg5    1.56 |  avg15   1.20  |               |  csw    38253 |  intr   23104 |                | numcpu     4  |
> > > > MEM |  tot    15.5G |  free    4.9G |  cache   8.7G  | buff    0.0M  |  slab  336.5M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> > > > SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> > > > LVM |    msata-home |  busy      2% |  read       0  | write   2337  |  KiB/w      3 |  MBr/s   0.00 |  MBw/s   0.91  | avio 0.07 ms  |
> > > > LVM |     sata-home |  busy      2% |  read      36  | write   2337  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   0.91  | avio 0.07 ms  |
> > > > LVM |  msata-debian |  busy      1% |  read       1  | write   1630  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.65  | avio 0.03 ms  |
> > > > LVM |   sata-debian |  busy      0% |  read       0  | write   1019  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.41  | avio 0.02 ms  |
> > > > DSK |           sdb |  busy      2% |  read       1  | write   2545  |  KiB/w      5 |  MBr/s   0.00 |  MBw/s   1.45  | avio 0.07 ms  |
> > > > DSK |           sda |  busy      1% |  read      36  | write   2461  |  KiB/w      5 |  MBr/s   0.01 |  MBw/s   1.28  | avio 0.06 ms  |
> > > > NET |  transport    |  tcpi      20 |  tcpo      20  | udpi       1  |  udpo       1 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> > > > NET |  network      |  ipi       21 |  ipo       21  | ipfrw      0  |  deliv     21 |               |  icmpi      0  | icmpo      0  |
> > > > NET |  eth0      0% |  pcki       5 |  pcko       5  | si    0 Kbps  |  so    0 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > 
> > > >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> > > > 17450      -   root      root          1   9.96s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:5
> > > >  4746      -   martin    martin        2   0.06s    0.15s      0K       0K      0K       0K  --     -  S       1    2%   konsole
> > > > 10508      -   root      root          1   0.13s    0.00s      0K       0K     96K    4048K  --     -  S       1    1%   kworker/u8:18
> > > >  1488      -   root      root          1   0.06s    0.06s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> > > > 17332      -   root      root          1   0.12s    0.00s      0K       0K     96K     580K  --     -  R       3    1%   kworker/u8:4
> > > > 17454      -   root      root          1   0.11s    0.00s      0K       0K     32K    4416K  --     -  D       1    1%   kworker/u8:6
> > > > 17516      -   root      root          1   0.09s    0.00s      0K       0K     16K     136K  --     -  S       3    1%   kworker/u8:7
> > > >  3268      -   martin    martin        3   0.02s    0.05s      0K       0K      0K       0K  --     -  S       1    1%   kwin
> > > > 10036      -   root      root          1   0.05s    0.02s      0K       0K      0K       0K  --     -  R       0    1%   atop
> > > > 
> > > > 
> > > > 
> > > > So BTRFS is basically busy with itself and nothing else. Look at the SSD
> > > > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> > > > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> > > > you measure of course, like request size, read, write, iodepth and so).
> > > > 
> > > > Its kworker/u8:5 utilizing 100% of one core for minutes.
> > > > 
> > > > 
> > > > 
> > > > Its the random write case it seems. Here are values from fio job:
> > > > 
> > > > martin@merkaba:~> fio ssd-test.fio
> > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > fio-2.1.11
> > > > Starting 2 processes
> > > > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> > > > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
> > > >   write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> > > >     clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> > > >      lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> > > >     clat percentiles (usec):
> > > >      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
> > > >      | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
> > > >      | 70.00th=[    7], 80.00th=[    8], 90.00th=[    8], 95.00th=[    9],
> > > >      | 99.00th=[   14], 99.50th=[   20], 99.90th=[  211], 99.95th=[ 2128],
> > > >      | 99.99th=[10304]
> > > >     bw (KB  /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
> > > >     lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> > > >     lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> > > >     lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> > > >   cpu          : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> > > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > >      issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> > > >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > > > 
> > > > Seems fine.
> > > > 
> > > > 
> > > > But:
> > > > 
> > > > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
> > > >   write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> > > >     clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> > > >      lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> > > >     clat percentiles (usec):
> > > >      |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
> > > >      | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
> > > >      | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
> > > >      | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
> > > >      | 99.99th=[16711680]
> > > >     bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
> > > >     lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> > > >     lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> > > >   cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> > > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > >      issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> > > >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > > > 
> > > > Run status group 0 (all jobs):
> > > >   WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
> > > > 
> > > > Run status group 1 (all jobs):
> > > >   WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
> > > > 
> > > > 
> > > > What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
> > > > 
> > > > What?
> > > > 
> > > > Ey, *what*?
> > […] 
> > > > There we go:
> > > > 
> > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > > 
> > > I have done more tests.
> > > 
> > > This is on the same /home after extending it to 170 GiB and balancing it to
> > > btrfs balance start -dusage=80
> > > 
> > > It has plenty of free space free. I updated the bug report and hope it can
> > > give an easy enough to comprehend summary. The new tests are in:
> > > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6
> > > 
> > > 
> > > 
> > > Pasting below for discussion on list. Summary: I easily get 38000 (!)
> > > IOPS. It may be an idea to reduce to 160 GiB, but right now this does
> > > not work as it says no free space on device when trying to downsize it.
> > > I may try with 165 or 162GiB.
> > > 
> > > So now we have three IOPS figures:
> > > 
> > > - 256 IOPS in worst case scenario
> > > - 4700 IOPS when trying to reproduce worst case scenario with a fresh and small
> > > BTRFS
> > > - 38000 IOPS when /home has unused device space to allocate chunks from
> > > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8
> > > 
> > > 
> > > This is another test.
> > 
> > 
> > Okay, and this is the last series of tests for today.
> > 
> > Conclusion:
> > 
> > I cannot manage to get it down to the knees as before, but I come near to it.
> > 
> > Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh
> > even *worse* situation than before.
> > 
> > That hints me at the need to look at the free space fragmentation, as in the
> > beginning the problem started appearing with:
> > 
> > merkaba:~> btrfs fi sh /home
> > Label: 'home'  uuid: […]
> >         Total devices 2 FS bytes used 144.41GiB
> >         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> >         devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> > 
> > Btrfs v3.17
> > merkaba:~> btrfs fi df /home
> > Data, RAID1: total=154.97GiB, used=141.12GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> > 
> > 
> > 
> > Yes, thats 13 GiB of free space *within* the chunks.
> > 
> > So while I can get it down in IOPS by bringing it to a situation where it
> > can not reserve additional data chunks again, I cannot recreate the
> > abysmal 250 IOPS figure by this. Not even with my /home filesystem.
> > 
> > So there is more to it. I think its important to look into free space
> > fragmentation. It seems it needs an *aged* filesystem to recreate. At
> > it seems the balances really helped. As I am not able to recreate the
> > issue to that extent right now.
> > 
> > So this shows my original idea about free device space to allocate from
> > also doesn´t explain it fully. It seems to be something thats going on
> > within the chunks that explains the worst case <300 IOPS, kworker using
> > one core for minutes and desktop locked scenario.
> > 
> > Is there a way to view free space fragmentation in BTRFS?
> 
> So to rephrase that:
> 
> From what I perceive the worst case issue happens when
> 
> 1) BTRFS cannot reserve any new chunks from unused device space anymore.
> 
> 2) The free space in the existing chunks is highly fragmented.
> 
> Only one of those conditions is not sufficient to trigger it.
> 
> Thats at least my current idea about it.

One more note about the IOPS. I currently let fio run with:

martin@merkaba:~> cat ssd-test.fio 
[global]
bs=4k
#ioengine=libaio
#iodepth=4
size=4g
#direct=1
runtime=120
filename=ssd.test.file

#[seq-write]
#rw=write
#stonewall

[rand-write]
rw=randwrite
stonewall


This is using buffered I/O write read()/write() system calls. So these
IOPS are not regarding the device raw capabilities. I specifically wanted
to test through the page cache to simulate what I see with Virtualbox
writing to the VDI file (i.e. dirty piling up and dirty_background_ratio
in effect). Just like with a real app.

But that also means that IOPS may be higher cause fio ends before all of the
writes have been completed to disk.

That means when I reach <300IOPS with buffered writes, that means that
through the pagecache BTRFS was not able to yield a higher IOPS.

But it also means that I measure write requests like an application would
be doing (unless it uses fsync() or direct I/O which it seems to me
Virtualbox doesn´t at least not with every request).

Just wanted to make that explicit. Its basically visible in the job file
from what I commented out in there, but still, I thought I mention it.

I just tested the effect by reducing the test file to 500MiB and the runtime
to 10 seconds and I got 98000 IOPS for that. So the larger test file size,
but specifically the runtime forces the kernel to do actual writes due to:

merkaba:~> grep . /proc/sys/vm/dirty_*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500

(standard values, I still see no need to optimize anything in here with
those SSDs, not even with the 16 GiB of RAM the laptop has, as the SSDs
usually easily can keep up, and I´d rather wait for a change in the default
value unless I am convinced of a benefit in manually adapting it in *this*
case)

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-28 14:52                               ` Robert White
@ 2014-12-28 15:42                                 ` Martin Steigerwald
  2014-12-28 15:47                                   ` Martin Steigerwald
  2014-12-29  0:27                                   ` Robert White
  0 siblings, 2 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 15:42 UTC (permalink / raw)
  To: Robert White; +Cc: Bardur Arantsson, linux-btrfs

Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:
> On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
> > Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
> >> Now:
> >>
> >> The complaining party has verified the minimum, repeatable case of
> >> simple file allocation on a very fragmented system and the responding
> >> party and several others have understood and supported the bug.
> >
> > I didn´t yet provide such a test case.
> 
> My bad.
> 
> >
> > At the moment I can only reproduce this kworker thread using a CPU for
> > minutes case with my /home filesystem.
> >
> > A mininmal test case for me would be to be able to reproduce it with a
> > fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
> > get 4800 instead of 270 IOPS.
> >
> 
> A version of the test case to demonstrate absolutely system-clogging 
> loads is pretty easy to construct.
> 
> Make a raid1 filesystem.
> Balance it once to make sure the seed filesystem is fully integrated.
> 
> Create a bunch of small files that are at least 4K in size, but are 
> randomly sized. Fill the entire filesystem with them.
> 
> BASH Script:
> typeset -i counter=0
> while
>   dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) 
> count=1 2>/dev/null
> do
> echo $counter >/dev/null #basically a noop
> done
>
> The while will exit when the dd encounters a full filesystem.
> 
> Then delete ~10% of the files with
> rm *0
> 
> Run the while loop again, then delete a different 10% with "rm *1".
> 
> Then again with rm *2, etc...
> 
> Do this a few times and with each iteration the CPU usage gets worse and 
> worse. You'll easily get system-wide stalls on all IO tasks lasting ten 
> or more seconds.

Thanks Robert. Thats wonderful.

I wondered about such a test case already and thought about reproducing
it just with fallocate calls instead to reduce the amount of actual
writes done. I.e. just do some silly fallocate, truncating, write just
some parts with dd seek and remove things again kind of workload.

Feel free to add your testcase to the bug report:

[Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401

Cause anything that helps a BTRFS developer to reproduce will make it easier
to find and fix the root cause of it.

I think I will try with this little critter:

merkaba:/mnt/btrfsraid1> cat freespracefragment.sh 
#!/bin/bash

TESTDIR="./test"
mkdir -p "$TESTDIR"

typeset -i counter=0
while true; do
        fallocate -l $((4096 + $RANDOM)) "$TESTDIR/$((++counter))"
        echo $counter >/dev/null #basically a noop
done

It takes a time, the script itself is using only a few percent of one core
there, while busying out the SSDs more heavily than I thought it would do.
But well I see up to 12000 writes per 10 seconds – thats not that much, still
it busies one SSD for 80%:

ATOP - merkaba                                 2014/12/28  16:40:57                                 -----------                                   10s elapsed
PRC | sys    1.50s | user   3.47s | #proc    367  | #trun      1 | #tslpi   649 | #tslpu     0 | #zombie    0 | clones   839  |              | no  procacct |
CPU | sys      30% | user     38% | irq       1%  | idle    293% | wait     37% |              | steal     0% | guest     0%  | curf 1.63GHz | curscal  50% |
cpu | sys       7% | user     11% | irq       1%  | idle     75% | cpu000 w  6% |              | steal     0% | guest     0%  | curf 1.25GHz | curscal  39% |
cpu | sys       8% | user     11% | irq       0%  | idle     76% | cpu002 w  4% |              | steal     0% | guest     0%  | curf 1.55GHz | curscal  48% |
cpu | sys       7% | user      9% | irq       0%  | idle     71% | cpu001 w 13% |              | steal     0% | guest     0%  | curf 1.75GHz | curscal  54% |
cpu | sys       8% | user      7% | irq       0%  | idle     71% | cpu003 w 14% |              | steal     0% | guest     0%  | curf 1.96GHz | curscal  61% |
CPL | avg1    1.69 | avg5    1.30 | avg15   0.94  |              |              | csw    68387 | intr   36928 |               |              | numcpu     4 |
MEM | tot    15.5G | free    3.1G | cache   8.8G  | buff    4.2M | slab    1.0G | shmem 210.3M | shrss  79.1M | vmbal   0.0M  | hptot   0.0M | hpuse   0.0M |
SWP | tot    12.0G | free   11.5G |               |              |              |              |              |               | vmcom   4.9G | vmlim  19.7G |
LVM | a-btrfsraid1 | busy     80% | read       0  | write  11873 | KiB/r      0 | KiB/w      3 | MBr/s   0.00 | MBw/s   4.31  | avq     1.11 | avio 0.67 ms |
LVM | a-btrfsraid1 | busy      5% | read       0  | write  11873 | KiB/r      0 | KiB/w      3 | MBr/s   0.00 | MBw/s   4.31  | avq     2.45 | avio 0.04 ms |
LVM |   msata-home | busy      3% | read       0  | write    175 | KiB/r      0 | KiB/w      3 | MBr/s   0.00 | MBw/s   0.06  | avq     1.71 | avio 1.43 ms |
LVM | msata-debian | busy      0% | read       0  | write     10 | KiB/r      0 | KiB/w      8 | MBr/s   0.00 | MBw/s   0.01  | avq     1.15 | avio 3.40 ms |
LVM |    sata-home | busy      0% | read       0  | write    175 | KiB/r      0 | KiB/w      3 | MBr/s   0.00 | MBw/s   0.06  | avq     1.71 | avio 0.04 ms |
LVM |  sata-debian | busy      0% | read       0  | write     10 | KiB/r      0 | KiB/w      8 | MBr/s   0.00 | MBw/s   0.01  | avq     1.00 | avio 0.10 ms |
DSK |          sdb | busy     80% | read       0  | write  11880 | KiB/r      0 | KiB/w      3 | MBr/s   0.00 | MBw/s   4.38  | avq     1.11 | avio 0.67 ms |
DSK |          sda | busy      5% | read       0  | write  12069 | KiB/r      0 | KiB/w      3 | MBr/s   0.00 | MBw/s   4.38  | avq     2.51 | avio 0.04 ms |
NET | transport    | tcpi      26 | tcpo      26  | udpi       0 | udpo       0 | tcpao      2 | tcppo      1 | tcprs      0  | tcpie      0 | udpie      0 |
NET | network      | ipi       26 | ipo       26  | ipfrw      0 | deliv     26 |              |              |               | icmpi      0 | icmpo      0 |
NET | eth0      0% | pcki      10 | pcko      10  | si    5 Kbps | so    1 Kbps | coll       0 | erri       0 | erro       0  | drpi       0 | drpo       0 |
NET | lo      ---- | pcki      16 | pcko      16  | si    2 Kbps | so    2 Kbps | coll       0 | erri       0 | erro       0  | drpi       0 | drpo       0 |

  PID     TID    RUID        EUID         THR    SYSCPU    USRCPU     VGROW     RGROW    RDDSK     WRDSK    ST    EXC    S    CPUNR     CPU    CMD        1/4
 9169       -    martin      martin        14     0.22s     1.53s        0K        0K       0K        4K    --      -    S        1     18%    amarok
 1488       -    root        root           1     0.34s     0.27s      220K        0K       0K        0K    --      -    S        2      6%    Xorg
 6816       -    martin      martin         7     0.05s     0.44s        0K        0K       0K        0K    --      -    S        1      5%    kmail
24390       -    root        root           1     0.20s     0.25s       24K       24K       0K    40800K    --      -    S        0      5%    freespracefrag
 3268       -    martin      martin         3     0.08s     0.34s        0K        0K       0K       24K    --      -    S        0      4%    kwin



But only with a low amount of writes:

merkaba:/mnt/btrfsraid1> vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0 538424 3326248   4304 9202576    6   11  1968  4029  273  207 15 10 72  3  0
 1  0 538424 3325244   4304 9202836    0    0     0  6456 3498 7635 11  8 72 10  0
 0  0 538424 3325168   4304 9202932    0    0     0  9032 3719 6764  9  9 74  9  0
 0  0 538424 3334508   4304 9202932    0    0     0  8936 3548 6035  7  8 76  9  0
 0  0 538424 3334144   4304 9202876    0    0     0  9008 3335 5635  7  7 76 10  0
 0  0 538424 3332724   4304 9202728    0    0     0 11240 3555 5699  7  8 76 10  0
 2  0 538424 3333328   4304 9202876    0    0     0  9080 3724 6542  8  8 75  9  0
 0  0 538424 3333328   4304 9202876    0    0     0  6968 2951 5015  7  7 76 10  0
 0  1 538424 3332832   4304 9202584    0    0     0  9160 3663 6772  8  8 76  9  0


Still it busies one of both SSDs for about 80%:

iostat -xz 1

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7,04    0,00    7,04    9,80    0,00   76,13

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00    0,00 1220,00     0,00  4556,00     7,47     0,12    0,10    0,00    0,10   0,04   5,10
sdb               0,00    10,00    0,00 1210,00     0,00  4556,00     7,53     0,85    0,70    0,00    0,70   0,66  79,90
dm-2              0,00     0,00    0,00    4,00     0,00    36,00    18,00     0,02    5,00    0,00    5,00   4,25   1,70
dm-5              0,00     0,00    0,00    4,00     0,00    36,00    18,00     0,00    0,25    0,00    0,25   0,25   0,10
dm-6              0,00     0,00    0,00 1216,00     0,00  4520,00     7,43     0,12    0,10    0,00    0,10   0,04   5,00
dm-7              0,00     0,00    0,00 1216,00     0,00  4520,00     7,43     0,84    0,69    0,00    0,69   0,66  79,70

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6,55    0,00    7,81    9,32    0,00   76,32

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00    0,00 1203,00     0,00  4472,00     7,43     0,09    0,07    0,00    0,07   0,03   3,80
sdb               0,00     0,00    0,00 1203,00     0,00  4472,00     7,43     0,79    0,66    0,00    0,66   0,64  77,10
dm-6              0,00     0,00    0,00 1203,00     0,00  4472,00     7,43     0,09    0,07    0,00    0,07   0,03   4,00
dm-7              0,00     0,00    0,00 1203,00     0,00  4472,00     7,43     0,79    0,66    0,00    0,66   0,64  77,10

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7,79    0,00    7,79    9,30    0,00   75,13

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00    0,00 1202,00     0,00  4468,00     7,43     0,09    0,07    0,00    0,07   0,04   4,70
sdb               0,00     0,00    4,00 1202,00  2048,00  4468,00    10,81     0,86    0,71    4,75    0,70   0,65  78,10
dm-1              0,00     0,00    4,00    0,00  2048,00     0,00  1024,00     0,02    4,75    4,75    0,00   2,00   0,80
dm-6              0,00     0,00    0,00 1202,00     0,00  4468,00     7,43     0,08    0,07    0,00    0,07   0,04   4,60
dm-7              0,00     0,00    0,00 1202,00     0,00  4468,00     7,43     0,84    0,70    0,00    0,70   0,65  77,80


But yet, neither I hit full CPU usage nor full SSD usage (just 80%), so
this is yet another interesting case.

> I don't have enough spare storage to do this directly, so I used 
> loopback devices. First I did it with the loopback files in COW mode. 
> Then I did it again with the files in NOCOW mode. (the COW files got 
> thick with overwrite real fast. 8-)
> 
> So anyway...
> 
> After I got through all ten digits on the rm (that is removing *0, then 
> refilling, then *1 etc...) I figured the FS image was nicely fragmented.
> 
> At that point it was very easy to spike the kworker to 100% CPU with
> 
> dd if=/dev/urandom of=/mnt/Work/scratch bs=40k
> 
> The DD wold read 40k (a cpu spike for /dev/urandom processing) then it 
> would write the 40k and the kworker would peg 100% on one CPU and stay 
> there for a while. Then it would be back to the /dev/urandom spike.
> 
> So this laptop has been carefully detuned to prevent certain kinds of 
> stalls (particularly the moveablecore= reservation, as previously 
> mentioned, to prevent non-responsiveness of the UI) and I had to go 
> through /dev/loop so that had a smoothing effect... but yep, there were 
> clear kworker spikes that _did_ stop the IO path (the system monitor ap, 
> for instance,  could not get I/O statistics for ten and fifteen second 
> intervals and would stop logging/scrolling).

I think I will look at the moveablecore= thing again. I think I have overread
this.

> Progressively larger block sizes on the write path made things 
> progressively worse...
> 
> dd if=/dev/urandom of=/mnt/Work/scratch bs=160k
> 
> 
> And overwriting the file by just invoking DD again, was worse still 
> (presumably from the juggling act) before resulting in a net 
> out-of-space condition.
> 
> Switching from /dev/urandom to /dev/zero for writing the large file made 
> things worse still -- probably since there were no respites for the 
> kworker to catch up etc.
> 
> ASIDE: Playing with /proc/sys/vm/dirty_{background_,}ratio had lots of 
> interesting and difficult to quantify effects on user-space 
> applications. Cutting in half (5 and 10 instead of 10 and 20 
> respectively) seemed to give some relief, but going further got harmful 
> quickly. Diverging numbers was odd too. But it seemed a little brittle 
> to play with these numbers.

As said, in usual usage I do not see much reason to poke around with it.
And yes, I know the statement of Linus to tune it to about some seconds
of your storage bandwidth. But heck, these SSDs can do 200 MiB/s even
with partially random workloads. So in 5 seconds they could write out
a 1 GiB. And I have not seen more dirty in that fio testcase, so.

It may make sense to reduce it to 1 GiB as 10% of:

merkaba:~> free -m
             total       used       free     shared    buffers     cached
Mem:         15830      11953       3877        207          0       8382
-/+ buffers/cache:       3570      12260
Swap:        12287        526      11761

is still much.

merkaba:~> grep . /proc/sys/vm/dirty_*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500

But as I never have seen a problem with piling up bulk writes to the SSDs,
I didn´t. I am quite lazy in that. I only ever change the default, when I see
a need to. And yes, on write heavy servers with 512 GiB RAM or slow rotating
storage it may well be needed to avoid large stalls.

> SUPER FREAKY THING...
> 
> Every time I removed and recreated "scratch" I would get _radically_ 
> different results for how much I could write into that remaining space 
> and how long it took to do so. In theory I am reusing the exact same 
> storage again and again. I'm not doing compression (the underlying 
> filessytem behind the loop devices have compression but that would be 
> disabled by the +C attribute). It's not enough space coming-and-going to 
> cause data extents to be reclaimed or displaced by metadata. And the 
> filessytem is otherwise completely unused.
> 
> But check it out...
> 
> Gust Work # rm scratch
> Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
> 1700+0 records in
> 1700+0 records out
> 278528000 bytes (279 MB) copied, 1.4952 s, 186 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
> 1700+0 records in
> 1700+0 records out
> 278528000 bytes (279 MB) copied, 292.135 s, 953 kB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
> dd: error writing ‘/mnt/Work/scratch’: No space left on device
> 93+0 records in
> 92+0 records out
> 15073280 bytes (15 MB) copied, 0.0453977 s, 332 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
> dd: error writing ‘/mnt/Work/scratch’: No space left on device
> 1090+0 records in
> 1089+0 records out
> 178421760 bytes (178 MB) copied, 115.991 s, 1.5 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
> dd: error writing ‘/mnt/Work/scratch’: No space left on device
> 332+0 records in
> 331+0 records out
> 54231040 bytes (54 MB) copied, 30.1589 s, 1.8 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
> dd: error writing ‘/mnt/Work/scratch’: No space left on device
> 622+0 records in
> 621+0 records out
> 101744640 bytes (102 MB) copied, 37.4813 s, 2.7 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
> 1700+0 records in
> 1700+0 records out
> 278528000 bytes (279 MB) copied, 121.863 s, 2.3 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
> 1700+0 records in
> 1700+0 records out
> 278528000 bytes (279 MB) copied, 24.2909 s, 11.5 MB/s
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k
> dd: error writing ‘/mnt/Work/scratch’: No space left on device
> 1709+0 records in
> 1708+0 records out
> 279838720 bytes (280 MB) copied, 139.538 s, 2.0 MB/s
> Gust Work # rm scratch
> Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k
> dd: error writing ‘/mnt/Work/scratch’: No space left on device
> 1424+0 records in
> 1423+0 records out
> 233144320 bytes (233 MB) copied, 102.257 s, 2.3 MB/s
> Gust Work #
> 
> (and so on)

I saw something similar, but with the RAID 1 2x10 GiB on LV volume
test BTRFS. I filled remaining space by rsync -a /usr/bin to it several
times, and even while it aborted with "no space left on device" in
subsequent calls I was still able to copy things to it. But later I
attributed it to it may have copied a large file on the first failure,
but then filled in smaller files on subsequent calls, as I used different
destination directories on each rsync call and thus is started the copy
process from scratch with the first file every time.

So its nice to see that you can produce this with dd.
 
> So...
> 
> Repeatable: yes.
> Problematic: yes.

Wonderful.

I may try with with my test BTRFS. I could even make it 2x20 GiB RAID 1
as well.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-28 15:42                                 ` Martin Steigerwald
@ 2014-12-28 15:47                                   ` Martin Steigerwald
  2014-12-29  0:27                                   ` Robert White
  1 sibling, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-28 15:47 UTC (permalink / raw)
  To: Robert White; +Cc: Bardur Arantsson, linux-btrfs

Am Sonntag, 28. Dezember 2014, 16:42:20 schrieb Martin Steigerwald:
> Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:
> > On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
> > > Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
> > >> Now:
> > >>
> > >> The complaining party has verified the minimum, repeatable case of
> > >> simple file allocation on a very fragmented system and the responding
> > >> party and several others have understood and supported the bug.
> > >
> > > I didn´t yet provide such a test case.
> > 
> > My bad.
> > 
> > >
> > > At the moment I can only reproduce this kworker thread using a CPU for
> > > minutes case with my /home filesystem.
> > >
> > > A mininmal test case for me would be to be able to reproduce it with a
> > > fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
> > > get 4800 instead of 270 IOPS.
> > >
> > 
> > A version of the test case to demonstrate absolutely system-clogging 
> > loads is pretty easy to construct.
> > 
> > Make a raid1 filesystem.
> > Balance it once to make sure the seed filesystem is fully integrated.
> > 
> > Create a bunch of small files that are at least 4K in size, but are 
> > randomly sized. Fill the entire filesystem with them.
> > 
> > BASH Script:
> > typeset -i counter=0
> > while
> >   dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) 
> > count=1 2>/dev/null
> > do
> > echo $counter >/dev/null #basically a noop
> > done
> >
> > The while will exit when the dd encounters a full filesystem.
> > 
> > Then delete ~10% of the files with
> > rm *0
> > 
> > Run the while loop again, then delete a different 10% with "rm *1".
> > 
> > Then again with rm *2, etc...
> > 
> > Do this a few times and with each iteration the CPU usage gets worse and 
> > worse. You'll easily get system-wide stalls on all IO tasks lasting ten 
> > or more seconds.
> 
> Thanks Robert. Thats wonderful.
> 
> I wondered about such a test case already and thought about reproducing
> it just with fallocate calls instead to reduce the amount of actual
> writes done. I.e. just do some silly fallocate, truncating, write just
> some parts with dd seek and remove things again kind of workload.
> 
> Feel free to add your testcase to the bug report:
> 
> [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
> 
> Cause anything that helps a BTRFS developer to reproduce will make it easier
> to find and fix the root cause of it.
> 
> I think I will try with this little critter:
> 
> merkaba:/mnt/btrfsraid1> cat freespracefragment.sh 
> #!/bin/bash
> 
> TESTDIR="./test"
> mkdir -p "$TESTDIR"
> 
> typeset -i counter=0
> while true; do
>         fallocate -l $((4096 + $RANDOM)) "$TESTDIR/$((++counter))"
>         echo $counter >/dev/null #basically a noop
> done
> 
> It takes a time, the script itself is using only a few percent of one core
> there, while busying out the SSDs more heavily than I thought it would do.
> But well I see up to 12000 writes per 10 seconds – thats not that much, still
> it busies one SSD for 80%:
> 
> ATOP - merkaba                                 2014/12/28  16:40:57                                 -----------                                   10s elapsed
> PRC | sys    1.50s | user   3.47s | #proc    367  | #trun      1 | #tslpi   649 | #tslpu     0 | #zombie    0 | clones   839  |              | no  procacct |
> CPU | sys      30% | user     38% | irq       1%  | idle    293% | wait     37% |              | steal     0% | guest     0%  | curf 1.63GHz | curscal  50% |
> cpu | sys       7% | user     11% | irq       1%  | idle     75% | cpu000 w  6% |              | steal     0% | guest     0%  | curf 1.25GHz | curscal  39% |
> cpu | sys       8% | user     11% | irq       0%  | idle     76% | cpu002 w  4% |              | steal     0% | guest     0%  | curf 1.55GHz | curscal  48% |
> cpu | sys       7% | user      9% | irq       0%  | idle     71% | cpu001 w 13% |              | steal     0% | guest     0%  | curf 1.75GHz | curscal  54% |
> cpu | sys       8% | user      7% | irq       0%  | idle     71% | cpu003 w 14% |              | steal     0% | guest     0%  | curf 1.96GHz | curscal  61% |
> CPL | avg1    1.69 | avg5    1.30 | avg15   0.94  |              |              | csw    68387 | intr   36928 |               |              | numcpu     4 |
> MEM | tot    15.5G | free    3.1G | cache   8.8G  | buff    4.2M | slab    1.0G | shmem 210.3M | shrss  79.1M | vmbal   0.0M  | hptot   0.0M | hpuse   0.0M |
> SWP | tot    12.0G | free   11.5G |               |              |              |              |              |               | vmcom   4.9G | vmlim  19.7G |
> LVM | a-btrfsraid1 | busy     80% | read       0  | write  11873 | KiB/r      0 | KiB/w      3 | MBr/s   0.00 | MBw/s   4.31  | avq     1.11 | avio 0.67 ms |
> LVM | a-btrfsraid1 | busy      5% | read       0  | write  11873 | KiB/r      0 | KiB/w      3 | MBr/s   0.00 | MBw/s   4.31  | avq     2.45 | avio 0.04 ms |
> LVM |   msata-home | busy      3% | read       0  | write    175 | KiB/r      0 | KiB/w      3 | MBr/s   0.00 | MBw/s   0.06  | avq     1.71 | avio 1.43 ms |
> LVM | msata-debian | busy      0% | read       0  | write     10 | KiB/r      0 | KiB/w      8 | MBr/s   0.00 | MBw/s   0.01  | avq     1.15 | avio 3.40 ms |
> LVM |    sata-home | busy      0% | read       0  | write    175 | KiB/r      0 | KiB/w      3 | MBr/s   0.00 | MBw/s   0.06  | avq     1.71 | avio 0.04 ms |
> LVM |  sata-debian | busy      0% | read       0  | write     10 | KiB/r      0 | KiB/w      8 | MBr/s   0.00 | MBw/s   0.01  | avq     1.00 | avio 0.10 ms |
> DSK |          sdb | busy     80% | read       0  | write  11880 | KiB/r      0 | KiB/w      3 | MBr/s   0.00 | MBw/s   4.38  | avq     1.11 | avio 0.67 ms |
> DSK |          sda | busy      5% | read       0  | write  12069 | KiB/r      0 | KiB/w      3 | MBr/s   0.00 | MBw/s   4.38  | avq     2.51 | avio 0.04 ms |
> NET | transport    | tcpi      26 | tcpo      26  | udpi       0 | udpo       0 | tcpao      2 | tcppo      1 | tcprs      0  | tcpie      0 | udpie      0 |
> NET | network      | ipi       26 | ipo       26  | ipfrw      0 | deliv     26 |              |              |               | icmpi      0 | icmpo      0 |
> NET | eth0      0% | pcki      10 | pcko      10  | si    5 Kbps | so    1 Kbps | coll       0 | erri       0 | erro       0  | drpi       0 | drpo       0 |
> NET | lo      ---- | pcki      16 | pcko      16  | si    2 Kbps | so    2 Kbps | coll       0 | erri       0 | erro       0  | drpi       0 | drpo       0 |
> 
>   PID     TID    RUID        EUID         THR    SYSCPU    USRCPU     VGROW     RGROW    RDDSK     WRDSK    ST    EXC    S    CPUNR     CPU    CMD        1/4
>  9169       -    martin      martin        14     0.22s     1.53s        0K        0K       0K        4K    --      -    S        1     18%    amarok
>  1488       -    root        root           1     0.34s     0.27s      220K        0K       0K        0K    --      -    S        2      6%    Xorg
>  6816       -    martin      martin         7     0.05s     0.44s        0K        0K       0K        0K    --      -    S        1      5%    kmail
> 24390       -    root        root           1     0.20s     0.25s       24K       24K       0K    40800K    --      -    S        0      5%    freespracefrag
>  3268       -    martin      martin         3     0.08s     0.34s        0K        0K       0K       24K    --      -    S        0      4%    kwin
> 
> 
> 
> But only with a low amount of writes:
> 
> merkaba:/mnt/btrfsraid1> vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  2  0 538424 3326248   4304 9202576    6   11  1968  4029  273  207 15 10 72  3  0
>  1  0 538424 3325244   4304 9202836    0    0     0  6456 3498 7635 11  8 72 10  0
>  0  0 538424 3325168   4304 9202932    0    0     0  9032 3719 6764  9  9 74  9  0
>  0  0 538424 3334508   4304 9202932    0    0     0  8936 3548 6035  7  8 76  9  0
>  0  0 538424 3334144   4304 9202876    0    0     0  9008 3335 5635  7  7 76 10  0
>  0  0 538424 3332724   4304 9202728    0    0     0 11240 3555 5699  7  8 76 10  0
>  2  0 538424 3333328   4304 9202876    0    0     0  9080 3724 6542  8  8 75  9  0
>  0  0 538424 3333328   4304 9202876    0    0     0  6968 2951 5015  7  7 76 10  0
>  0  1 538424 3332832   4304 9202584    0    0     0  9160 3663 6772  8  8 76  9  0

Let me rephrase that.

One one hand rather low, but for the kind of workload just for *fallocating*
new files actually quite much. I just tell it to *reserve* the space for the
file I do not tell it to write to them. And yet its about 6-12 MiB/s.

> Still it busies one of both SSDs for about 80%:
> 
> iostat -xz 1
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            7,04    0,00    7,04    9,80    0,00   76,13
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0,00     0,00    0,00 1220,00     0,00  4556,00     7,47     0,12    0,10    0,00    0,10   0,04   5,10
> sdb               0,00    10,00    0,00 1210,00     0,00  4556,00     7,53     0,85    0,70    0,00    0,70   0,66  79,90
> dm-2              0,00     0,00    0,00    4,00     0,00    36,00    18,00     0,02    5,00    0,00    5,00   4,25   1,70
> dm-5              0,00     0,00    0,00    4,00     0,00    36,00    18,00     0,00    0,25    0,00    0,25   0,25   0,10
> dm-6              0,00     0,00    0,00 1216,00     0,00  4520,00     7,43     0,12    0,10    0,00    0,10   0,04   5,00
> dm-7              0,00     0,00    0,00 1216,00     0,00  4520,00     7,43     0,84    0,69    0,00    0,69   0,66  79,70
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            6,55    0,00    7,81    9,32    0,00   76,32
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0,00     0,00    0,00 1203,00     0,00  4472,00     7,43     0,09    0,07    0,00    0,07   0,03   3,80
> sdb               0,00     0,00    0,00 1203,00     0,00  4472,00     7,43     0,79    0,66    0,00    0,66   0,64  77,10
> dm-6              0,00     0,00    0,00 1203,00     0,00  4472,00     7,43     0,09    0,07    0,00    0,07   0,03   4,00
> dm-7              0,00     0,00    0,00 1203,00     0,00  4472,00     7,43     0,79    0,66    0,00    0,66   0,64  77,10
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            7,79    0,00    7,79    9,30    0,00   75,13
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0,00     0,00    0,00 1202,00     0,00  4468,00     7,43     0,09    0,07    0,00    0,07   0,04   4,70
> sdb               0,00     0,00    4,00 1202,00  2048,00  4468,00    10,81     0,86    0,71    4,75    0,70   0,65  78,10
> dm-1              0,00     0,00    4,00    0,00  2048,00     0,00  1024,00     0,02    4,75    4,75    0,00   2,00   0,80
> dm-6              0,00     0,00    0,00 1202,00     0,00  4468,00     7,43     0,08    0,07    0,00    0,07   0,04   4,60
> dm-7              0,00     0,00    0,00 1202,00     0,00  4468,00     7,43     0,84    0,70    0,00    0,70   0,65  77,80
> 
> 
> But yet, neither I hit full CPU usage nor full SSD usage (just 80%), so
> this is yet another interesting case.
[…]
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-28 12:03                             ` Martin Steigerwald
@ 2014-12-28 17:04                               ` Patrik Lundquist
  2014-12-29 10:14                                 ` Martin Steigerwald
  0 siblings, 1 reply; 59+ messages in thread
From: Patrik Lundquist @ 2014-12-28 17:04 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs

On 28 December 2014 at 13:03, Martin Steigerwald <Martin@lichtvoll.de> wrote:
>
> BTW, I found that the Oracle blog didn´t work at all for me. I completed
> a cycle of defrag, sdelete -c and VBoxManage compact, [...] and it
> apparently did *nothing* to reduce the size of the file.

They've changed the argument to -z; sdelete -z.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-28 15:42                                 ` Martin Steigerwald
  2014-12-28 15:47                                   ` Martin Steigerwald
@ 2014-12-29  0:27                                   ` Robert White
  2014-12-29  9:14                                     ` Martin Steigerwald
  1 sibling, 1 reply; 59+ messages in thread
From: Robert White @ 2014-12-29  0:27 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Bardur Arantsson, linux-btrfs

On 12/28/2014 07:42 AM, Martin Steigerwald wrote:
> Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:
>> On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
>>> Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
>>>> Now:
>>>>
>>>> The complaining party has verified the minimum, repeatable case of
>>>> simple file allocation on a very fragmented system and the responding
>>>> party and several others have understood and supported the bug.
>>>
>>> I didn´t yet provide such a test case.
>>
>> My bad.
>>
>>>
>>> At the moment I can only reproduce this kworker thread using a CPU for
>>> minutes case with my /home filesystem.
>>>
>>> A mininmal test case for me would be to be able to reproduce it with a
>>> fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
>>> get 4800 instead of 270 IOPS.
>>>
>>
>> A version of the test case to demonstrate absolutely system-clogging
>> loads is pretty easy to construct.
>>
>> Make a raid1 filesystem.
>> Balance it once to make sure the seed filesystem is fully integrated.
>>
>> Create a bunch of small files that are at least 4K in size, but are
>> randomly sized. Fill the entire filesystem with them.
>>
>> BASH Script:
>> typeset -i counter=0
>> while
>>    dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM))
>> count=1 2>/dev/null
>> do
>> echo $counter >/dev/null #basically a noop
>> done
>>
>> The while will exit when the dd encounters a full filesystem.
>>
>> Then delete ~10% of the files with
>> rm *0
>>
>> Run the while loop again, then delete a different 10% with "rm *1".
>>
>> Then again with rm *2, etc...
>>
>> Do this a few times and with each iteration the CPU usage gets worse and
>> worse. You'll easily get system-wide stalls on all IO tasks lasting ten
>> or more seconds.
>
> Thanks Robert. Thats wonderful.
>
> I wondered about such a test case already and thought about reproducing
> it just with fallocate calls instead to reduce the amount of actual
> writes done. I.e. just do some silly fallocate, truncating, write just
> some parts with dd seek and remove things again kind of workload.
>
> Feel free to add your testcase to the bug report:
>
> [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>
> Cause anything that helps a BTRFS developer to reproduce will make it easier
> to find and fix the root cause of it.
>
> I think I will try with this little critter:
>
> merkaba:/mnt/btrfsraid1> cat freespracefragment.sh
> #!/bin/bash
>
> TESTDIR="./test"
> mkdir -p "$TESTDIR"
>
> typeset -i counter=0
> while true; do
>          fallocate -l $((4096 + $RANDOM)) "$TESTDIR/$((++counter))"
>          echo $counter >/dev/null #basically a noop
> done

If you don't do the remove/delete passes you won't get as much 
fragmentation...

I also noticed that fallocate would not actually create the files in my 
toolset, so I had to touch them first. So the theoretical script became

e.g.

typeset -i counter=0
for AA in {0..9}
do
   while
     touch ${TESTDIR}/$((++counter)) &&
     fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((counter))
   do
     if ((counter%100 == 0))
     then
       echo $counter
     fi
   done
   echo "removing ${AA}"
   rm ${TESTDIR}/*${AA}
done

Meanwhile, on my test rig using fallocate did _not_ result in final 
exhaustion of resources. That is btrfs fi df /mnt/Work didn't show 
significant changes on a near full expanse.

I also never got a failed response back from fallocate, that is the 
inner loop never terminated. This could be a problem with the system 
call itself or it could be a problem with the application wrapper.

Nor did I reach the CPU saturation I expected.

e.g.
Gust vm # btrfs fi df /mnt/Work/
Data, RAID1: total=1.72GiB, used=1.66GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=256.00MiB, used=57.84MiB
GlobalReserve, single: total=32.00MiB, used=0.00B

time passes while script running...

Gust vm # btrfs fi df /mnt/Work/
Data, RAID1: total=1.72GiB, used=1.66GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=256.00MiB, used=57.84MiB
GlobalReserve, single: total=32.00MiB, used=0.00B

So there may be some limiting factor or something.

Without the actual writes to the actual file expanse I don't get the stalls.

(I added a _touch_ of instrumentation, it makes the various catostrophy 
events a little more obvious in context. 8-)

mount /dev/whattever /mnt/Work
typeset -i counter=0
for AA in {0..9}
do
   while
     dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + 
$RANDOM)) count=1 2>/dev/null
   do
     if ((counter%100 == 0))
     then
       echo $counter
       if ((counter%1000 == 0))
       then
         btrfs fi df /mnt/Work
       fi
     fi
   done
   btrfs fi df /mnt/Work
   echo "removing ${AA}"
   rm /mnt/Work/*${AA}
   btrfs fi df /mnt/Work
done

So you definitely need the writes to really see the stalls.

> I may try with with my test BTRFS. I could even make it 2x20 GiB RAID 1
> as well.

I guess I never mentioned it... I am using 4x1GiB NOCOW files through 
losetup as the basis of a RAID1. No compression (by virtue of the NOCOW 
files in underlying fs, and not being set in the resulting mount). No 
encryption. No LVM.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
  2014-12-27 19:23           ` BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time) Martin Steigerwald
@ 2014-12-29  2:07             ` Zygo Blaxell
  2014-12-29  9:32               ` Martin Steigerwald
  0 siblings, 1 reply; 59+ messages in thread
From: Zygo Blaxell @ 2014-12-29  2:07 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Hugo Mills, Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2308 bytes --]

On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote:
> My simple test case didn´t trigger it, and I so not have another twice 160
> GiB available on this SSDs available to try with a copy of my home
> filesystem. Then I could safely test without bringing the desktop session to
> an halt. Maybe someone has an idea on how to "enhance" my test case in
> order to reliably trigger the issue.
> 
> It may be challenging tough. My /home is quite a filesystem. It has a maildir
> with at least one million of files (yeah, I am performance testing KMail and
> Akonadi as well to the limit!), and it has git repos and this one VM image,
> and the desktop search and the Akonadi database. In other words: It has
> been hit nicely with various mostly random I think workloads over the last
> about six months. I bet its not that easy to simulate that. Maybe some runs
> of compilebench to age the filesystem before the fio test?
> 
> That said, BTRFS performs a lot better. The complete lockups without any
> CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there
> is this kworker issue now. I noticed it that gravely just while trying to
> complete this tax returns stuff with the Windows XP VM. Otherwise it may
> have happened, I have seen some backtraces in kern.log, but it didn´t last
> for minutes. So this indeed is of less severity than the full lockups with
> 3.15 and 3.16.
> 
> Zygo, was is the characteristics of your filesystem. Do you use
> compress=lzo and skinny metadata as well? How are the chunks allocated?
> What kind of data you have on it?

compress-force (default zlib), no skinny-metadata.  Chunks are d=single,
m=dup.  Data is a mix of various desktop applications, most active
file sizes from a few hundred K to a few MB, maybe 300k-400k files.
No database or VM workloads.  Filesystem is 100GB and is usually between
98 and 99% full (about 1-2GB free).

I have another filesystem which has similar problems when it's 99.99%
full (it's 13TB, so 0.01% is 1.3GB).  That filesystem is RAID1 with
skinny-metadata and no-holes.

On various filesystems I have the above CPU-burning problem, a bunch of
irreproducible random crashes, and a hang with a kernel stack that goes
through SyS_unlinkat and btrfs_evict_inode.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-29  0:27                                   ` Robert White
@ 2014-12-29  9:14                                     ` Martin Steigerwald
  0 siblings, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-29  9:14 UTC (permalink / raw)
  To: Robert White; +Cc: Bardur Arantsson, linux-btrfs

Am Sonntag, 28. Dezember 2014, 16:27:41 schrieb Robert White:
> On 12/28/2014 07:42 AM, Martin Steigerwald wrote:
> > Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:
> >> On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
> >>> Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
> >>>> Now:
> >>>>
> >>>> The complaining party has verified the minimum, repeatable case of
> >>>> simple file allocation on a very fragmented system and the responding
> >>>> party and several others have understood and supported the bug.
> >>>
> >>> I didn´t yet provide such a test case.
> >>
> >> My bad.
> >>
> >>>
> >>> At the moment I can only reproduce this kworker thread using a CPU for
> >>> minutes case with my /home filesystem.
> >>>
> >>> A mininmal test case for me would be to be able to reproduce it with a
> >>> fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
> >>> get 4800 instead of 270 IOPS.
> >>>
> >>
> >> A version of the test case to demonstrate absolutely system-clogging
> >> loads is pretty easy to construct.
> >>
> >> Make a raid1 filesystem.
> >> Balance it once to make sure the seed filesystem is fully integrated.
> >>
> >> Create a bunch of small files that are at least 4K in size, but are
> >> randomly sized. Fill the entire filesystem with them.
> >>
> >> BASH Script:
> >> typeset -i counter=0
> >> while
> >>    dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM))
> >> count=1 2>/dev/null
> >> do
> >> echo $counter >/dev/null #basically a noop
> >> done
> >>
> >> The while will exit when the dd encounters a full filesystem.
> >>
> >> Then delete ~10% of the files with
> >> rm *0
> >>
> >> Run the while loop again, then delete a different 10% with "rm *1".
> >>
> >> Then again with rm *2, etc...
> >>
> >> Do this a few times and with each iteration the CPU usage gets worse and
> >> worse. You'll easily get system-wide stalls on all IO tasks lasting ten
> >> or more seconds.
> >
> > Thanks Robert. Thats wonderful.
> >
> > I wondered about such a test case already and thought about reproducing
> > it just with fallocate calls instead to reduce the amount of actual
> > writes done. I.e. just do some silly fallocate, truncating, write just
> > some parts with dd seek and remove things again kind of workload.
> >
> > Feel free to add your testcase to the bug report:
> >
> > [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> >
> > Cause anything that helps a BTRFS developer to reproduce will make it easier
> > to find and fix the root cause of it.
> >
> > I think I will try with this little critter:
> >
> > merkaba:/mnt/btrfsraid1> cat freespracefragment.sh
> > #!/bin/bash
> >
> > TESTDIR="./test"
> > mkdir -p "$TESTDIR"
> >
> > typeset -i counter=0
> > while true; do
> >          fallocate -l $((4096 + $RANDOM)) "$TESTDIR/$((++counter))"
> >          echo $counter >/dev/null #basically a noop
> > done
> 
> If you don't do the remove/delete passes you won't get as much 
> fragmentation...
> 
> I also noticed that fallocate would not actually create the files in my 
> toolset, so I had to touch them first. So the theoretical script became
> 
> e.g.
> 
> typeset -i counter=0
> for AA in {0..9}
> do
>    while
>      touch ${TESTDIR}/$((++counter)) &&
>      fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((counter))
>    do
>      if ((counter%100 == 0))
>      then
>        echo $counter
>      fi
>    done
>    echo "removing ${AA}"
>    rm ${TESTDIR}/*${AA}
> done

Hmmm, strange. It did here. I had a ton of files in the test directory.

> Meanwhile, on my test rig using fallocate did _not_ result in final 
> exhaustion of resources. That is btrfs fi df /mnt/Work didn't show 
> significant changes on a near full expanse.

Hmmm, I had it running up to it allocating about 5 GiB in the data chunks.

But I stopped it yesterday. It took a long time to get there. It seems to be
quite slow on filling a 10 GiB RAID-1 BTRFS. I bet that may be due to lots
of forks for the fallocate command.

But it seems my fallocate works differently than yours. I have fallocate
from:

merkaba:~> fallocate --version
fallocate von util-linux 2.25.2

> I also never got a failed response back from fallocate, that is the 
> inner loop never terminated. This could be a problem with the system 
> call itself or it could be a problem with the application wrapper.

Hmmm, it should return a failure like this:

merkaba:/mnt/btrfsraid1> LANG=C fallocate -l 20G 20g
fallocate: fallocate failed: No space left on device
merkaba:/mnt/btrfsraid1#1> echo $?
1
 
> Nor did I reach the CPU saturation I expected.

No, I didn´t reach it as well. Just 5% or so for the script itself and I
didn´t see any notable kworker activity. But I stopped it before the
filesystem was full, so.

> e.g.
> Gust vm # btrfs fi df /mnt/Work/
> Data, RAID1: total=1.72GiB, used=1.66GiB
> System, RAID1: total=32.00MiB, used=16.00KiB
> Metadata, RAID1: total=256.00MiB, used=57.84MiB
> GlobalReserve, single: total=32.00MiB, used=0.00B
> 
> time passes while script running...
> 
> Gust vm # btrfs fi df /mnt/Work/
> Data, RAID1: total=1.72GiB, used=1.66GiB
> System, RAID1: total=32.00MiB, used=16.00KiB
> Metadata, RAID1: total=256.00MiB, used=57.84MiB
> GlobalReserve, single: total=32.00MiB, used=0.00B
> 
> So there may be some limiting factor or something.
> 
> Without the actual writes to the actual file expanse I don't get the stalls.

Interesting. We may have unveiled another performance issue with fallocate
on BTRFS then.

> 
> (I added a _touch_ of instrumentation, it makes the various catostrophy 
> events a little more obvious in context. 8-)
> 
> mount /dev/whattever /mnt/Work
> typeset -i counter=0
> for AA in {0..9}
> do
>    while
>      dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + 
> $RANDOM)) count=1 2>/dev/null
>    do
>      if ((counter%100 == 0))
>      then
>        echo $counter
>        if ((counter%1000 == 0))
>        then
>          btrfs fi df /mnt/Work
>        fi
>      fi
>    done
>    btrfs fi df /mnt/Work
>    echo "removing ${AA}"
>    rm /mnt/Work/*${AA}
>    btrfs fi df /mnt/Work
> done
> 
> So you definitely need the writes to really see the stalls.

Hmmm, interesting. Will try this some time. But right now other stuffs
that are also important, so I take a break from this.

> > I may try with with my test BTRFS. I could even make it 2x20 GiB RAID 1
> > as well.
> 
> I guess I never mentioned it... I am using 4x1GiB NOCOW files through 
> losetup as the basis of a RAID1. No compression (by virtue of the NOCOW 
> files in underlying fs, and not being set in the resulting mount). No 
> encryption. No LVM.

Well okay, I am using BTRFS RAID 1 on two logical volumes in two different
volume groups which are spun over a partition each on two different SSDs:

Intel SSD 320 with 300 GB on SATA-600 (but SSD can only do SATA-300) +
Crucial m500 480 GB on mSATA-300 (but SSD could do SATA-600)

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)
  2014-12-28 13:56             ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
  2014-12-28 15:00               ` Martin Steigerwald
@ 2014-12-29  9:25               ` Martin Steigerwald
  1 sibling, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-29  9:25 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 29945 bytes --]

Am Sonntag, 28. Dezember 2014, 14:56:21 schrieb Martin Steigerwald:
> Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald:
> > Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
> > > Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> > > > Summarized at
> > > > 
> > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > > > 
> > > > see below. This is reproducable with fio, no need for Windows XP in
> > > > Virtualbox for reproducing the issue. Next I will try to reproduce with
> > > > a freshly creating filesystem.
> > > > 
> > > > 
> > > > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > > > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > > > > Hello!
> > > > > > > > 
> > > > > > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > > > > > 
> > > > > > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > > > > > bug
> > > > > > > > report:
> > > > > > > > 
> > > > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > > > > > space_cache, skinny meta data extents – are these a problem? – and
> > > > > > > 
> > > > > > > > compress=lzo:
> > > > > > > (there is no known problem with skinny metadata, it's actually more
> > > > > > > efficient than the older format. There has been some anecdotes about
> > > > > > > mixing the skinny and fat metadata but nothing has ever been
> > > > > > > demonstrated problematic.)
> > > > > > > 
> > > > > > > > merkaba:~> btrfs fi sh /home
> > > > > > > > Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > > > > > 
> > > > > > > >          Total devices 2 FS bytes used 144.41GiB
> > > > > > > >          devid    1 size 160.00GiB used 160.00GiB path
> > > > > > > >          /dev/mapper/msata-home
> > > > > > > >          devid    2 size 160.00GiB used 160.00GiB path
> > > > > > > >          /dev/mapper/sata-home
> > > > > > > > 
> > > > > > > > Btrfs v3.17
> > > > > > > > merkaba:~> btrfs fi df /home
> > > > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > > > > 
> > > > > > > This filesystem, at the allocation level, is "very full" (see below).
> > > > > > > 
> > > > > > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > > > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > > > > > cause I know no tax return software for Linux which would be suitable
> > > > > > > > for
> > > > > > > > Germany and I frankly don´t care about the end of security cause all
> > > > > > > > surfing and other network access I will do from the Linux box and I
> > > > > > > > only
> > > > > > > > run the VM behind a firewall).
> > > > > > > 
> > > > > > > > And thus I try the balance dance again:
> > > > > > > ITEM: Balance... it doesn't do what you think it does... 
> > > > > > > 
> > > > > > > "Balancing" is something you should almost never need to do. It is only
> > > > > > > for cases of changing geometry (adding disks, switching RAID levels,
> > > > > > > etc.) of for cases when you've radically changed allocation behaviors
> > > > > > > (like you decided to remove all your VM's or you've decided to remove a
> > > > > > > mail spool directory full of thousands of tiny files).
> > > > > > > 
> > > > > > > People run balance all the time because they think they should. They are
> > > > > > > _usually_ incorrect in that belief.
> > > > > > 
> > > > > > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > > > > > device.
> > > > >    No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > > > > space. What's more, balance does *not* balance the metadata trees. The
> > > > > remaining space -- 154.97 GiB -- is unstructured storage for file
> > > > > data, and you have some 13 GiB of that available for use.
> > > > > 
> > > > >    Now, since you're seeing lockups when the space on your disks is
> > > > > all allocated I'd say that's a bug. However, you're the *only* person
> > > > > who's reported this as a regular occurrence. Does this happen with all
> > > > > filesystems you have, or just this one?
> > > > > 
> > > > > > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > > > > > from to *extend* a tree.
> > > > > 
> > > > >    It's not a tree. It's simply space allocation. It's not even space
> > > > > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > > > > saying "I'm going to use this piece of disk for this purpose").
> > > > > 
> > > > > > This may be a bug, but this is what I see.
> > > > > > 
> > > > > > And no amount of "you should not balance a BTRFS" will make that
> > > > > > perception go away.
> > > > > > 
> > > > > > See, I see the sun coming out on a morning and you tell me "no, it
> > > > > > doesn´t". Simply that is not going to match my perception.
> > > > > 
> > > > >    Duncan's assertion is correct in its detail. Looking at your space
> > > > 
> > > > Robert's 
> > > > 
> > > > > usage, I would not suggest that running a balance is something you
> > > > > need to do. Now, since you have these lockups that seem quite
> > > > > repeatable, there's probably a lurking bug in there, but hacking
> > > > > around with balance every time you hit it isn't going to get the
> > > > > problem solved properly.
> > > > > 
> > > > >    I think I would suggest the following:
> > > > > 
> > > > >  - make sure you have some way of logging your dmesg permanently (use
> > > > >    a different filesystem for /var/log, or a serial console, or a
> > > > >    netconsole)
> > > > > 
> > > > >  - when the lockup happens, hit Alt-SysRq-t a few times
> > > > > 
> > > > >  - send the dmesg output here, or post to bugzilla.kernel.org
> > > > > 
> > > > >    That's probably going to give enough information to the developers
> > > > > to work out where the lockup is happening, and is clearly the way
> > > > > forward here.
> > > > 
> > > > And I got it reproduced. *Perfectly* reproduced, I´d say.
> > > > 
> > > > But let me run the whole story:
> > > > 
> > > > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
> > > 
> > > [… story of trying to reproduce with Windows XP defragmenting which was
> > > unsuccessful as BTRFS still had free device space to allocate new chunks
> > > from …]
> > > 
> > > > But finally I got to:
> > > > 
> > > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > > > Sa 27. Dez 13:26:39 CET 2014
> > > > Label: 'home'  uuid: [some UUID]
> > > >         Total devices 2 FS bytes used 152.83GiB
> > > >         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> > > >         devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> > > > 
> > > > Btrfs v3.17
> > > > Data, RAID1: total=154.97GiB, used=149.58GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > 
> > > > 
> > > > 
> > > > So I did, if Virtualbox can write randomly in a file, I can too.
> > > > 
> > > > So I did:
> > > > 
> > > > 
> > > > martin@merkaba:~> cat ssd-test.fio 
> > > > [global]
> > > > bs=4k
> > > > #ioengine=libaio
> > > > #iodepth=4
> > > > size=4g
> > > > #direct=1
> > > > runtime=120
> > > > filename=ssd.test.file
> > > > 
> > > > [seq-write]
> > > > rw=write
> > > > stonewall
> > > > 
> > > > [rand-write]
> > > > rw=randwrite
> > > > stonewall
> > > > 
> > > > 
> > > > 
> > > > And got:
> > > > 
> > > > ATOP - merkaba                          2014/12/27  13:41:02                          -----------                           10s elapsed
> > > > PRC |  sys   10.14s |  user   0.38s |  #proc    332  | #trun      2  |  #tslpi   548 |  #tslpu     0 |  #zombie    0  | no  procacct  |
> > > > CPU |  sys     102% |  user      4% |  irq       0%  | idle    295%  |  wait      0% |  guest     0% |  curf 3.10GHz  | curscal  96%  |
> > > > cpu |  sys      76% |  user      0% |  irq       0%  | idle     24%  |  cpu001 w  0% |  guest     0% |  curf 3.20GHz  | curscal  99%  |
> > > > cpu |  sys      24% |  user      1% |  irq       0%  | idle     75%  |  cpu000 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> > > > cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu003 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > CPL |  avg1    0.82 |  avg5    0.78 |  avg15   0.99  |               |  csw     6233 |  intr   12023 |                | numcpu     4  |
> > > > MEM |  tot    15.5G |  free    4.0G |  cache   9.7G  | buff    0.0M  |  slab  333.1M |  shmem 206.6M |  vmbal   0.0M  | hptot   0.0M  |
> > > > SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> > > > LVM |     sata-home |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> > > > DSK |           sda |  busy      0% |  read       8  | write      0  |  KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> > > > NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> > > > NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv     16 |               |  icmpi      0  | icmpo      0  |
> > > > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > 
> > > >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
> > > > 18079      -   martin    martin        2   9.99s    0.00s      0K       0K      0K      16K  --     -  R       1  100%   fio
> > > >  4746      -   martin    martin        2   0.01s    0.14s      0K       0K      0K       0K  --     -  S       2    2%   konsole
> > > >  3291      -   martin    martin        4   0.01s    0.11s      0K       0K      0K       0K  --     -  S       0    1%   plasma-desktop
> > > >  1488      -   root      root          1   0.03s    0.04s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> > > > 10036      -   root      root          1   0.04s    0.02s      0K       0K      0K       0K  --     -  R       2    1%   atop
> > > > 
> > > > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > > > for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > > > 
> > > > martin@merkaba:~> LANG=C df -hT /home
> > > > Filesystem             Type   Size  Used Avail Use% Mounted on
> > > > /dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
> > > > 
> > > > where a 4 GiB file should easily fit, no? (And this output is with the 4
> > > > GiB file. So it was even 4 GiB more free before.)
> > > > 
> > > > 
> > > > But it gets even more visible:
> > > > 
> > > > martin@merkaba:~> fio ssd-test.fio
> > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > fio-2.1.11
> > > > Starting 2 processes
> > > > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]       
> > > > 0$ zsh  1$ zsh  2$ zsh  3-$ zsh  4$ zsh  5$* zsh                                   
> > > > 
> > > > 
> > > > yes, thats 0 IOPS.
> > > > 
> > > > 0 IOPS and in zero IOPS. For minutes.
> > > > 
> > > > 
> > > > 
> > > > And here is why:
> > > > 
> > > > ATOP - merkaba                          2014/12/27  13:46:52                          -----------                           10s elapsed
> > > > PRC |  sys   10.77s |  user   0.31s |  #proc    334  | #trun      2  |  #tslpi   548 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> > > > CPU |  sys     108% |  user      3% |  irq       0%  | idle    286%  |  wait      2% |  guest     0% |  curf 3.08GHz  | curscal  96%  |
> > > > cpu |  sys      72% |  user      1% |  irq       0%  | idle     28%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys      19% |  user      0% |  irq       0%  | idle     81%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys      11% |  user      1% |  irq       0%  | idle     87%  |  cpu003 w  1% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> > > > cpu |  sys       6% |  user      1% |  irq       0%  | idle     91%  |  cpu002 w  1% |  guest     0% |  curf 3.11GHz  | curscal  97%  |
> > > > CPL |  avg1    2.78 |  avg5    1.34 |  avg15   1.12  |               |  csw    50192 |  intr   32379 |                | numcpu     4  |
> > > > MEM |  tot    15.5G |  free    5.0G |  cache   8.7G  | buff    0.0M  |  slab  332.6M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> > > > SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> > > > LVM |     sata-home |  busy      5% |  read     160  | write  11177  |  KiB/w      3 |  MBr/s   0.06 |  MBw/s   4.36  | avio 0.05 ms  |
> > > > LVM |    msata-home |  busy      4% |  read      28  | write  11177  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   4.36  | avio 0.04 ms  |
> > > > LVM |   sata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> > > > LVM |  msata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> > > > DSK |           sda |  busy      5% |  read     160  | write  10200  |  KiB/w      4 |  MBr/s   0.06 |  MBw/s   4.69  | avio 0.05 ms  |
> > > > DSK |           sdb |  busy      4% |  read      28  | write  10558  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   4.69  | avio 0.04 ms  |
> > > > NET |  transport    |  tcpi      35 |  tcpo      33  | udpi       3  |  udpo       3 |  tcpao      2 |  tcppo      1  | tcprs      0  |
> > > > NET |  network      |  ipi       38 |  ipo       36  | ipfrw      0  |  deliv     38 |               |  icmpi      0  | icmpo      0  |
> > > > NET |  eth0      0% |  pcki      22 |  pcko      20  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > 
> > > >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> > > > 14973      -   root      root          1   8.92s    0.00s      0K       0K      0K     144K  --     -  S       0   89%   kworker/u8:14
> > > > 17450      -   root      root          1   0.86s    0.00s      0K       0K      0K      32K  --     -  R       3    9%   kworker/u8:5
> > > >   788      -   root      root          1   0.25s    0.00s      0K       0K    128K   18880K  --     -  S       3    3%   btrfs-transact
> > > > 12254      -   root      root          1   0.14s    0.00s      0K       0K     64K     576K  --     -  S       2    1%   kworker/u8:3
> > > > 17332      -   root      root          1   0.11s    0.00s      0K       0K    112K    1348K  --     -  S       2    1%   kworker/u8:4
> > > >  3291      -   martin    martin        4   0.01s    0.09s      0K       0K      0K       0K  --     -  S       1    1%   plasma-deskto
> > > > 
> > > > 
> > > > 
> > > > 
> > > > ATOP - merkaba                          2014/12/27  13:47:12                          -----------                           10s elapsed
> > > > PRC |  sys   10.78s |  user   0.44s |  #proc    334  | #trun      3  |  #tslpi   547 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> > > > CPU |  sys     106% |  user      4% |  irq       0%  | idle    288%  |  wait      1% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys      93% |  user      0% |  irq       0%  | idle      7%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys       7% |  user      0% |  irq       0%  | idle     93%  |  cpu003 w  0% |  guest     0% |  curf 3.01GHz  | curscal  94%  |
> > > > cpu |  sys       3% |  user      2% |  irq       0%  | idle     94%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > CPL |  avg1    3.33 |  avg5    1.56 |  avg15   1.20  |               |  csw    38253 |  intr   23104 |                | numcpu     4  |
> > > > MEM |  tot    15.5G |  free    4.9G |  cache   8.7G  | buff    0.0M  |  slab  336.5M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> > > > SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
> > > > LVM |    msata-home |  busy      2% |  read       0  | write   2337  |  KiB/w      3 |  MBr/s   0.00 |  MBw/s   0.91  | avio 0.07 ms  |
> > > > LVM |     sata-home |  busy      2% |  read      36  | write   2337  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   0.91  | avio 0.07 ms  |
> > > > LVM |  msata-debian |  busy      1% |  read       1  | write   1630  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.65  | avio 0.03 ms  |
> > > > LVM |   sata-debian |  busy      0% |  read       0  | write   1019  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.41  | avio 0.02 ms  |
> > > > DSK |           sdb |  busy      2% |  read       1  | write   2545  |  KiB/w      5 |  MBr/s   0.00 |  MBw/s   1.45  | avio 0.07 ms  |
> > > > DSK |           sda |  busy      1% |  read      36  | write   2461  |  KiB/w      5 |  MBr/s   0.01 |  MBw/s   1.28  | avio 0.06 ms  |
> > > > NET |  transport    |  tcpi      20 |  tcpo      20  | udpi       1  |  udpo       1 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> > > > NET |  network      |  ipi       21 |  ipo       21  | ipfrw      0  |  deliv     21 |               |  icmpi      0  | icmpo      0  |
> > > > NET |  eth0      0% |  pcki       5 |  pcko       5  | si    0 Kbps  |  so    0 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > 
> > > >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> > > > 17450      -   root      root          1   9.96s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:5
> > > >  4746      -   martin    martin        2   0.06s    0.15s      0K       0K      0K       0K  --     -  S       1    2%   konsole
> > > > 10508      -   root      root          1   0.13s    0.00s      0K       0K     96K    4048K  --     -  S       1    1%   kworker/u8:18
> > > >  1488      -   root      root          1   0.06s    0.06s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
> > > > 17332      -   root      root          1   0.12s    0.00s      0K       0K     96K     580K  --     -  R       3    1%   kworker/u8:4
> > > > 17454      -   root      root          1   0.11s    0.00s      0K       0K     32K    4416K  --     -  D       1    1%   kworker/u8:6
> > > > 17516      -   root      root          1   0.09s    0.00s      0K       0K     16K     136K  --     -  S       3    1%   kworker/u8:7
> > > >  3268      -   martin    martin        3   0.02s    0.05s      0K       0K      0K       0K  --     -  S       1    1%   kwin
> > > > 10036      -   root      root          1   0.05s    0.02s      0K       0K      0K       0K  --     -  R       0    1%   atop
> > > > 
> > > > 
> > > > 
> > > > So BTRFS is basically busy with itself and nothing else. Look at the SSD
> > > > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> > > > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> > > > you measure of course, like request size, read, write, iodepth and so).
> > > > 
> > > > Its kworker/u8:5 utilizing 100% of one core for minutes.
> > > > 
> > > > 
> > > > 
> > > > Its the random write case it seems. Here are values from fio job:
> > > > 
> > > > martin@merkaba:~> fio ssd-test.fio
> > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > > > fio-2.1.11
> > > > Starting 2 processes
> > > > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
> > > > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
> > > >   write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> > > >     clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> > > >      lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> > > >     clat percentiles (usec):
> > > >      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
> > > >      | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
> > > >      | 70.00th=[    7], 80.00th=[    8], 90.00th=[    8], 95.00th=[    9],
> > > >      | 99.00th=[   14], 99.50th=[   20], 99.90th=[  211], 99.95th=[ 2128],
> > > >      | 99.99th=[10304]
> > > >     bw (KB  /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
> > > >     lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> > > >     lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> > > >     lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> > > >   cpu          : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> > > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > >      issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> > > >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > > > 
> > > > Seems fine.
> > > > 
> > > > 
> > > > But:
> > > > 
> > > > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
> > > >   write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> > > >     clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> > > >      lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> > > >     clat percentiles (usec):
> > > >      |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
> > > >      | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
> > > >      | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
> > > >      | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
> > > >      | 99.99th=[16711680]
> > > >     bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
> > > >     lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> > > >     lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> > > >   cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> > > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> > > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > > >      issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> > > >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > > > 
> > > > Run status group 0 (all jobs):
> > > >   WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec
> > > > 
> > > > Run status group 1 (all jobs):
> > > >   WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec
> > > > 
> > > > 
> > > > What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
> > > > 
> > > > What?
> > > > 
> > > > Ey, *what*?
> > […] 
> > > > There we go:
> > > > 
> > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > > 
> > > I have done more tests.
> > > 
> > > This is on the same /home after extending it to 170 GiB and balancing it to
> > > btrfs balance start -dusage=80
> > > 
> > > It has plenty of free space free. I updated the bug report and hope it can
> > > give an easy enough to comprehend summary. The new tests are in:
> > > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6
> > > 
> > > 
> > > 
> > > Pasting below for discussion on list. Summary: I easily get 38000 (!)
> > > IOPS. It may be an idea to reduce to 160 GiB, but right now this does
> > > not work as it says no free space on device when trying to downsize it.
> > > I may try with 165 or 162GiB.
> > > 
> > > So now we have three IOPS figures:
> > > 
> > > - 256 IOPS in worst case scenario
> > > - 4700 IOPS when trying to reproduce worst case scenario with a fresh and small
> > > BTRFS
> > > - 38000 IOPS when /home has unused device space to allocate chunks from
> > > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8
> > > 
> > > 
> > > This is another test.
> > 
> > 
> > Okay, and this is the last series of tests for today.
> > 
> > Conclusion:
> > 
> > I cannot manage to get it down to the knees as before, but I come near to it.
> > 
> > Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh
> > even *worse* situation than before.
> > 
> > That hints me at the need to look at the free space fragmentation, as in the
> > beginning the problem started appearing with:
> > 
> > merkaba:~> btrfs fi sh /home
> > Label: 'home'  uuid: […]
> >         Total devices 2 FS bytes used 144.41GiB
> >         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> >         devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> > 
> > Btrfs v3.17
> > merkaba:~> btrfs fi df /home
> > Data, RAID1: total=154.97GiB, used=141.12GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> > 
> > 
> > 
> > Yes, thats 13 GiB of free space *within* the chunks.
> > 
> > So while I can get it down in IOPS by bringing it to a situation where it
> > can not reserve additional data chunks again, I cannot recreate the
> > abysmal 250 IOPS figure by this. Not even with my /home filesystem.
> > 
> > So there is more to it. I think its important to look into free space
> > fragmentation. It seems it needs an *aged* filesystem to recreate. At
> > it seems the balances really helped. As I am not able to recreate the
> > issue to that extent right now.
> > 
> > So this shows my original idea about free device space to allocate from
> > also doesn´t explain it fully. It seems to be something thats going on
> > within the chunks that explains the worst case <300 IOPS, kworker using
> > one core for minutes and desktop locked scenario.
> > 
> > Is there a way to view free space fragmentation in BTRFS?
> 
> So to rephrase that:
> 
> From what I perceive the worst case issue happens when
> 
> 1) BTRFS cannot reserve any new chunks from unused device space anymore.
> 
> 2) The free space in the existing chunks is highly fragmented.
> 
> Only one of those conditions is not sufficient to trigger it.
> 
> Thats at least my current idea about it.

With

merkaba:~> btrfs fi df /home
Data, RAID1: total=163.87GiB, used=146.92GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.94GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 150.18GiB
        devid    1 size 170.00GiB used 169.84GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 169.84GiB path /dev/mapper/sata-home

Btrfs v3.17

I had a noticable hang during sdelete.exe -z in Windows XP VM with 20 GiB VDI file – Patrik on mailing list told me they have changed the argument from -c to -z as I wondered by VBoxManage modifyhd Winlala.vdi --compact did not reduce the size of the file).

It was not as bad, but desktop was locked for more than 5 seconds easily.

So this also happens with larger free space *within* the chunks. Before I to the VBoxManage --compact I will now rebalance partly.

So this definately shows, it can happen when BTRFS cannot reserve any new
chunks anymore, yet still has *plenty* of free space within the existing data
chunks. 

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
  2014-12-29  2:07             ` Zygo Blaxell
@ 2014-12-29  9:32               ` Martin Steigerwald
  2015-01-06 20:03                 ` Zygo Blaxell
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-29  9:32 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Hugo Mills, Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3906 bytes --]

Am Sonntag, 28. Dezember 2014, 21:07:05 schrieb Zygo Blaxell:
> On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote:
> > My simple test case didn´t trigger it, and I so not have another twice 160
> > GiB available on this SSDs available to try with a copy of my home
> > filesystem. Then I could safely test without bringing the desktop session to
> > an halt. Maybe someone has an idea on how to "enhance" my test case in
> > order to reliably trigger the issue.
> > 
> > It may be challenging tough. My /home is quite a filesystem. It has a maildir
> > with at least one million of files (yeah, I am performance testing KMail and
> > Akonadi as well to the limit!), and it has git repos and this one VM image,
> > and the desktop search and the Akonadi database. In other words: It has
> > been hit nicely with various mostly random I think workloads over the last
> > about six months. I bet its not that easy to simulate that. Maybe some runs
> > of compilebench to age the filesystem before the fio test?
> > 
> > That said, BTRFS performs a lot better. The complete lockups without any
> > CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there
> > is this kworker issue now. I noticed it that gravely just while trying to
> > complete this tax returns stuff with the Windows XP VM. Otherwise it may
> > have happened, I have seen some backtraces in kern.log, but it didn´t last
> > for minutes. So this indeed is of less severity than the full lockups with
> > 3.15 and 3.16.
> > 
> > Zygo, was is the characteristics of your filesystem. Do you use
> > compress=lzo and skinny metadata as well? How are the chunks allocated?
> > What kind of data you have on it?
> 
> compress-force (default zlib), no skinny-metadata.  Chunks are d=single,
> m=dup.  Data is a mix of various desktop applications, most active
> file sizes from a few hundred K to a few MB, maybe 300k-400k files.
> No database or VM workloads.  Filesystem is 100GB and is usually between
> 98 and 99% full (about 1-2GB free).
> 
> I have another filesystem which has similar problems when it's 99.99%
> full (it's 13TB, so 0.01% is 1.3GB).  That filesystem is RAID1 with
> skinny-metadata and no-holes.
> 
> On various filesystems I have the above CPU-burning problem, a bunch of
> irreproducible random crashes, and a hang with a kernel stack that goes
> through SyS_unlinkat and btrfs_evict_inode.

Zygo, thanks. That desktop filesystem sounds a bit similar to my usecase,
with the interesting difference that you have no databases or VMs on it.

That said, I use the Windows XP rarely, but using it was what made the issue
so visible for me. Is your desktop filesystem on SSD?

Do you have the chance to extend one of the affected filesystems to check
my theory that this does not happen as long as BTRFS can still allocate new
data chunks? If its right, your FS should be fluent again as long as you see
more than 1 GiB free

Label: none  uuid: 53bdf47c-4298-45bc-a30f-8a310c274069
        Total devices 2 FS bytes used 512.00KiB
        devid    1 size 10.00GiB used 6.53GiB path /dev/mapper/sata-btrfsraid1
        devid    2 size 10.00GiB used 6.53GiB path /dev/mapper/msata-btrfsraid1

between "size" and "used" in btrfs fi sh. I suggest going with at least 2-3
GiB, as BTRFS may allocate just one chunk so quickly that you do not have
the chance to recognize the difference.

Well, and if thats works for you, we are back to my recommendation:

More so than with other filesystems give BTRFS plenty of free space to
operate with. At best as much, that you always have a mininum of 2-3 GiB
unused device space for chunk reservation left. One could even do some
Nagios/Icinga monitoring plugin for that :)

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again
  2014-12-28 17:04                               ` Patrik Lundquist
@ 2014-12-29 10:14                                 ` Martin Steigerwald
  0 siblings, 0 replies; 59+ messages in thread
From: Martin Steigerwald @ 2014-12-29 10:14 UTC (permalink / raw)
  To: Patrik Lundquist; +Cc: linux-btrfs

Am Sonntag, 28. Dezember 2014, 18:04:31 schrieb Patrik Lundquist:
> On 28 December 2014 at 13:03, Martin Steigerwald <Martin@lichtvoll.de> wrote:
> >
> > BTW, I found that the Oracle blog didn´t work at all for me. I completed
> > a cycle of defrag, sdelete -c and VBoxManage compact, [...] and it
> > apparently did *nothing* to reduce the size of the file.
> 
> They've changed the argument to -z; sdelete -z.

Now how cute is that. Thank you. This did the trick:

martin@merkaba:~/.VirtualBox/HardDisks> VBoxManage modifyhd Winlala.vdi --compact
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
martin@merkaba:~/.VirtualBox/HardDisks> ls -lh
insgesamt 12G
-rw------- 1 martin martin 12G Dez 29 11:00 Winlala.vdi
martin@merkaba:~/.VirtualBox/HardDisks>

It has been 20 GiB before.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
  2014-12-29  9:32               ` Martin Steigerwald
@ 2015-01-06 20:03                 ` Zygo Blaxell
  2015-01-07 19:08                   ` Martin Steigerwald
  0 siblings, 1 reply; 59+ messages in thread
From: Zygo Blaxell @ 2015-01-06 20:03 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Hugo Mills, Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5373 bytes --]

On Mon, Dec 29, 2014 at 10:32:00AM +0100, Martin Steigerwald wrote:
> Am Sonntag, 28. Dezember 2014, 21:07:05 schrieb Zygo Blaxell:
> > On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote:
> > > My simple test case didn´t trigger it, and I so not have another twice 160
> > > GiB available on this SSDs available to try with a copy of my home
> > > filesystem. Then I could safely test without bringing the desktop session to
> > > an halt. Maybe someone has an idea on how to "enhance" my test case in
> > > order to reliably trigger the issue.
> > > 
> > > It may be challenging tough. My /home is quite a filesystem. It has a maildir
> > > with at least one million of files (yeah, I am performance testing KMail and
> > > Akonadi as well to the limit!), and it has git repos and this one VM image,
> > > and the desktop search and the Akonadi database. In other words: It has
> > > been hit nicely with various mostly random I think workloads over the last
> > > about six months. I bet its not that easy to simulate that. Maybe some runs
> > > of compilebench to age the filesystem before the fio test?
> > > 
> > > That said, BTRFS performs a lot better. The complete lockups without any
> > > CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there
> > > is this kworker issue now. I noticed it that gravely just while trying to
> > > complete this tax returns stuff with the Windows XP VM. Otherwise it may
> > > have happened, I have seen some backtraces in kern.log, but it didn´t last
> > > for minutes. So this indeed is of less severity than the full lockups with
> > > 3.15 and 3.16.
> > > 
> > > Zygo, was is the characteristics of your filesystem. Do you use
> > > compress=lzo and skinny metadata as well? How are the chunks allocated?
> > > What kind of data you have on it?
> > 
> > compress-force (default zlib), no skinny-metadata.  Chunks are d=single,
> > m=dup.  Data is a mix of various desktop applications, most active
> > file sizes from a few hundred K to a few MB, maybe 300k-400k files.
> > No database or VM workloads.  Filesystem is 100GB and is usually between
> > 98 and 99% full (about 1-2GB free).
> > 
> > I have another filesystem which has similar problems when it's 99.99%
> > full (it's 13TB, so 0.01% is 1.3GB).  That filesystem is RAID1 with
> > skinny-metadata and no-holes.
> > 
> > On various filesystems I have the above CPU-burning problem, a bunch of
> > irreproducible random crashes, and a hang with a kernel stack that goes
> > through SyS_unlinkat and btrfs_evict_inode.
> 
> Zygo, thanks. That desktop filesystem sounds a bit similar to my usecase,
> with the interesting difference that you have no databases or VMs on it.
> 
> That said, I use the Windows XP rarely, but using it was what made the issue
> so visible for me. Is your desktop filesystem on SSD?

No, but I recently stumbled across the same symptoms on an 8GB SD card
on kernel 3.12.24 (raspberry pi).  When the filesystem hit over ~97%
full, all accesses were blocked for several minutes.  I was able to
work around it by adjusting the threshold on a garbage collector daemon
(i.e. deleting a lot of expendable files) to keep usage below 90%.
I didn't try to balance the filesystem, and didn't seem to need to.

ext3 has a related problem when it's nearly full:  it will try to search
gigabytes of block allocation bitmaps searching for a free block, which
can result in a single 'mkdir' call spending 45 minutes reading a large
slow 99.5% full filesystem.

I'd expect a btrfs filesystem that was nearly full to have a small tree
of cached free space extents and be able to search it quickly even if
the result is negative (i.e. there's no free space).  It seems to be
doing something else... :-P

> Do you have the chance to extend one of the affected filesystems to check
> my theory that this does not happen as long as BTRFS can still allocate new
> data chunks? If its right, your FS should be fluent again as long as you see
> more than 1 GiB free
> 
> Label: none  uuid: 53bdf47c-4298-45bc-a30f-8a310c274069
>         Total devices 2 FS bytes used 512.00KiB
>         devid    1 size 10.00GiB used 6.53GiB path /dev/mapper/sata-btrfsraid1
>         devid    2 size 10.00GiB used 6.53GiB path /dev/mapper/msata-btrfsraid1
> 
> between "size" and "used" in btrfs fi sh. I suggest going with at least 2-3
> GiB, as BTRFS may allocate just one chunk so quickly that you do not have
> the chance to recognize the difference.

So far I've found that problems start when space drops below 1GB free
(although it can go as low as 400MB) and problems stop when space gets
above 1GB free, even without resizing or balancing the filesystem.
I've adjusted free space monitoring thresholds accordingly for now,
and it seems to be keeping things working so far.

> Well, and if thats works for you, we are back to my recommendation:
> 
> More so than with other filesystems give BTRFS plenty of free space to
> operate with. At best as much, that you always have a mininum of 2-3 GiB
> unused device space for chunk reservation left. One could even do some
> Nagios/Icinga monitoring plugin for that :)
> 
> -- 
> Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
> GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7



[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
  2015-01-06 20:03                 ` Zygo Blaxell
@ 2015-01-07 19:08                   ` Martin Steigerwald
  2015-01-07 21:41                     ` Zygo Blaxell
  2015-01-08  5:45                     ` Duncan
  0 siblings, 2 replies; 59+ messages in thread
From: Martin Steigerwald @ 2015-01-07 19:08 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Hugo Mills, Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4639 bytes --]

Am Dienstag, 6. Januar 2015, 15:03:23 schrieb Zygo Blaxell:
> On Mon, Dec 29, 2014 at 10:32:00AM +0100, Martin Steigerwald wrote:
> > Am Sonntag, 28. Dezember 2014, 21:07:05 schrieb Zygo Blaxell:
> > > On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote:
[…]
> > > > Zygo, was is the characteristics of your filesystem. Do you use
> > > > compress=lzo and skinny metadata as well? How are the chunks
> > > > allocated?
> > > > What kind of data you have on it?
> > > 
> > > compress-force (default zlib), no skinny-metadata.  Chunks are d=single,
> > > m=dup.  Data is a mix of various desktop applications, most active
> > > file sizes from a few hundred K to a few MB, maybe 300k-400k files.
> > > No database or VM workloads.  Filesystem is 100GB and is usually between
> > > 98 and 99% full (about 1-2GB free).
> > > 
> > > I have another filesystem which has similar problems when it's 99.99%
> > > full (it's 13TB, so 0.01% is 1.3GB).  That filesystem is RAID1 with
> > > skinny-metadata and no-holes.
> > > 
> > > On various filesystems I have the above CPU-burning problem, a bunch of
> > > irreproducible random crashes, and a hang with a kernel stack that goes
> > > through SyS_unlinkat and btrfs_evict_inode.
> > 
> > Zygo, thanks. That desktop filesystem sounds a bit similar to my usecase,
> > with the interesting difference that you have no databases or VMs on it.
> > 
> > That said, I use the Windows XP rarely, but using it was what made the
> > issue so visible for me. Is your desktop filesystem on SSD?
> 
> No, but I recently stumbled across the same symptoms on an 8GB SD card
> on kernel 3.12.24 (raspberry pi).  When the filesystem hit over ~97%
> full, all accesses were blocked for several minutes.  I was able to
> work around it by adjusting the threshold on a garbage collector daemon
> (i.e. deleting a lot of expendable files) to keep usage below 90%.
> I didn't try to balance the filesystem, and didn't seem to need to.

Interesting.

> ext3 has a related problem when it's nearly full:  it will try to search
> gigabytes of block allocation bitmaps searching for a free block, which
> can result in a single 'mkdir' call spending 45 minutes reading a large
> slow 99.5% full filesystem.

Ok, thats for bitmap access. Ext4 uses extens. BTRFS can use bitmaps as well, 
but also supports extents and I think uses it for most use cases.

> I'd expect a btrfs filesystem that was nearly full to have a small tree
> of cached free space extents and be able to search it quickly even if
> the result is negative (i.e. there's no free space).  It seems to be
> doing something else... :-P

Yeah :)


> > Do you have the chance to extend one of the affected filesystems to check
> > my theory that this does not happen as long as BTRFS can still allocate
> > new
> > data chunks? If its right, your FS should be fluent again as long as you
> > see more than 1 GiB free
> > 
> > Label: none  uuid: 53bdf47c-4298-45bc-a30f-8a310c274069
> > 
> >         Total devices 2 FS bytes used 512.00KiB
> >         devid    1 size 10.00GiB used 6.53GiB path
> >         /dev/mapper/sata-btrfsraid1
> >         devid    2 size 10.00GiB used 6.53GiB path
> >         /dev/mapper/msata-btrfsraid1
> > 
> > between "size" and "used" in btrfs fi sh. I suggest going with at least
> > 2-3
> > GiB, as BTRFS may allocate just one chunk so quickly that you do not have
> > the chance to recognize the difference.
> 
> So far I've found that problems start when space drops below 1GB free
> (although it can go as low as 400MB) and problems stop when space gets
> above 1GB free, even without resizing or balancing the filesystem.
> I've adjusted free space monitoring thresholds accordingly for now,
> and it seems to be keeping things working so far.

Just to see whether we are on the same terms: You talk about space that BTRFS 
has not yet reserved for chunks, i.e. the difference between size and used in 
btrfs fi sh, right?

No BTRFS developers commented yet on this, neither in this thread nor in the 
bug report at kernel.org I made.

> > Well, and if thats works for you, we are back to my recommendation:
> > 
> > More so than with other filesystems give BTRFS plenty of free space to
> > operate with. At best as much, that you always have a mininum of 2-3 GiB
> > unused device space for chunk reservation left. One could even do some
> > Nagios/Icinga monitoring plugin for that :)

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
  2015-01-07 19:08                   ` Martin Steigerwald
@ 2015-01-07 21:41                     ` Zygo Blaxell
  2015-01-08  5:45                     ` Duncan
  1 sibling, 0 replies; 59+ messages in thread
From: Zygo Blaxell @ 2015-01-07 21:41 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Hugo Mills, Robert White, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1581 bytes --]

On Wed, Jan 07, 2015 at 08:08:50PM +0100, Martin Steigerwald wrote:
> Am Dienstag, 6. Januar 2015, 15:03:23 schrieb Zygo Blaxell:
> > ext3 has a related problem when it's nearly full:  it will try to search
> > gigabytes of block allocation bitmaps searching for a free block, which
> > can result in a single 'mkdir' call spending 45 minutes reading a large
> > slow 99.5% full filesystem.
> 
> Ok, thats for bitmap access. Ext4 uses extens. 

...and the problem doesn't happen to the same degree on ext4 as it did
on ext3.

> > So far I've found that problems start when space drops below 1GB free
> > (although it can go as low as 400MB) and problems stop when space gets
> > above 1GB free, even without resizing or balancing the filesystem.
> > I've adjusted free space monitoring thresholds accordingly for now,
> > and it seems to be keeping things working so far.
> 
> Just to see whether we are on the same terms: You talk about space that BTRFS 
> has not yet reserved for chunks, i.e. the difference between size and used in 
> btrfs fi sh, right?

The number I look at for this issue is statvfs() f_bavail (i.e. the
"Available" column of /bin/df).

Before the empty-chunk-deallocation code, most of my filesystems would
quickly reach a steady state where all space is allocated to chunks,
and they stay that way unless I have to downsize them.

Now there is free (non-chunk) space on most of my filesystems.  I'll try
monitoring btrfs fi df and btrfs fi show under the failing conditions
and see if there are interesting correlations.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
  2015-01-07 19:08                   ` Martin Steigerwald
  2015-01-07 21:41                     ` Zygo Blaxell
@ 2015-01-08  5:45                     ` Duncan
  2015-01-08 10:18                       ` Martin Steigerwald
  1 sibling, 1 reply; 59+ messages in thread
From: Duncan @ 2015-01-08  5:45 UTC (permalink / raw)
  To: linux-btrfs

Martin Steigerwald posted on Wed, 07 Jan 2015 20:08:50 +0100 as excerpted:

> No BTRFS developers commented yet on this, neither in this thread nor in
> the bug report at kernel.org I made.

Just a quick general note on this point...

There has in the past (and I believe referenced on the wiki) been dev 
comment to the effect that on the list they tend to find particular 
reports/threads and work on them until they find and either fix the issue 
or (when not urgent) decide it must wait for something else, first.  
During the time they're busy pursuing such a report, they don't read 
others on the list very closely, and such list-only bug reports may thus 
get dropped on the floor and never worked on.

The recommendation, then, is to report it to the list, and if not picked 
up right away and you plan on being around in a few weeks/months when 
they potentially get to it, file a bug on it, so it doesn't get dropped 
on the floor.

With the bugzilla.kernel.org report you've followed the recommendation, 
but the implication is that you won't necessarily get any comment right 
away, only later, when they're not immediately busy looking at some other 
bug.  So lack of b.k.o comment in the immediate term doesn't mean they're 
ignoring the bug or don't value it; it just means they're hot on the 
trail of something else ATM and it might take some time to get that 
"first comment" engagement.

But the recommendation is to file the bugzilla report precisely so it 
does /not/ get lost, and you've done that, so... you've done your part 
there and now comes the enforced patience bit of waiting for that 
engagement.

But if it takes a bit, I would keep the bug updated every kernel release 
or so, with a comment updating status.

(Meanwhile, I've seen no indication of such issues here.  Most of my 
btrfs are 8-24 GiB each, all SSD, mostly dual-device btrfs raid1 both 
data/metadata.  Maybe I don't run those full enough, however.  I do have 
three mixed-bg mode sub-GiB btrfs, however, with one of them, a 256 MiB 
single-device dup-mode btrfs, used as /boot, that tends to run reasonably 
full, but I've not seen a problem like that there, either.  But my use-
case probably simply doesn't hit the problem.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
  2015-01-08  5:45                     ` Duncan
@ 2015-01-08 10:18                       ` Martin Steigerwald
  2015-01-09  8:25                         ` Duncan
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Steigerwald @ 2015-01-08 10:18 UTC (permalink / raw)
  To: linux-btrfs

Am Donnerstag, 8. Januar 2015, 05:45:56 schrieben Sie:
> Martin Steigerwald posted on Wed, 07 Jan 2015 20:08:50 +0100 as excerpted:
> > No BTRFS developers commented yet on this, neither in this thread nor in
> > the bug report at kernel.org I made.
> 
> Just a quick general note on this point...
> 
> There has in the past (and I believe referenced on the wiki) been dev 
> comment to the effect that on the list they tend to find particular 
> reports/threads and work on them until they find and either fix the issue 
> or (when not urgent) decide it must wait for something else, first.  
> During the time they're busy pursuing such a report, they don't read 
> others on the list very closely, and such list-only bug reports may thus 
> get dropped on the floor and never worked on.
> 
> The recommendation, then, is to report it to the list, and if not picked 
> up right away and you plan on being around in a few weeks/months when 
> they potentially get to it, file a bug on it, so it doesn't get dropped 
> on the floor.

Duncan, I *did* file a bug.

[Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for 
minutes on random write into big file

https://bugzilla.kernel.org/show_bug.cgi?id=90401

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
  2015-01-08 10:18                       ` Martin Steigerwald
@ 2015-01-09  8:25                         ` Duncan
  0 siblings, 0 replies; 59+ messages in thread
From: Duncan @ 2015-01-09  8:25 UTC (permalink / raw)
  To: linux-btrfs

Martin Steigerwald posted on Thu, 08 Jan 2015 11:18:40 +0100 as excerpted:

> Duncan, I *did* file a bug.

I think you misunderstood me... I understood that and actually said as 
much:

>> But the recommendation is to file the bugzilla report precisely so it
>> does /not/ get lost, and you've done that, so... you've done your part
>> there and now comes the enforced patience bit of waiting [...]

My point was simply that based on the wiki recommendation and the earlier 
thread as mentioned on the wiki, the reason /why/ a bugzi report is 
preferred over simply reporting it here is that the devs tend to pick 
bugs and spend some time digging into them, during which they don't look 
too much at other reports here, and they can get lost, while the bugzi 
report won't.

Which implies that a failure to respond either to a thread here or a bug 
report there is because they're busy working on other bugs, and that 
failure to immediately respond isn't to be seen as ignoring the problem, 
and is in fact to be expected.

IOW, I was saying now that the bug is filed, you can sit back and wait in 
reasonable assurance that it'll be processed in due time, as you've done 
your bit and now it's up to them to prioritize and process in due time.  
That's a good thing, and I was commending you for taking the time to file 
the bug as well. =:^)

... While at the same time commiserating a bit, since I know from 
experience how hard that wait for a dev reply can be, and that the wait 
is sort of an enforced patience as at least for a non-coder as I am, 
there's not much else one can do. =:^(

That said, now that I reread, I can see how what I wrote could appear to 
be contingent on an assumed /future/ filing of a bug, and that it wasn't 
as clear as I intended that I was commending you for filing it already, 
and basically saying, "Be patient, I know how hard it can be to wait."

Words!  They be tricky! =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2015-01-09  8:34 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
2014-12-26 14:20 ` Martin Steigerwald
2014-12-26 14:41   ` Martin Steigerwald
2014-12-27  3:33     ` Duncan
2014-12-26 15:59 ` Martin Steigerwald
2014-12-27  4:26   ` Duncan
2014-12-26 22:48 ` Robert White
2014-12-27  5:54   ` Duncan
2014-12-27  9:01   ` Martin Steigerwald
2014-12-27  9:30     ` Hugo Mills
2014-12-27 10:54       ` Martin Steigerwald
2014-12-27 11:52         ` Robert White
2014-12-27 13:16           ` Martin Steigerwald
2014-12-27 13:49             ` Robert White
2014-12-27 14:06               ` Martin Steigerwald
2014-12-27 14:00             ` Robert White
2014-12-27 14:14               ` Martin Steigerwald
2014-12-27 14:21                 ` Martin Steigerwald
2014-12-27 15:14                   ` Robert White
2014-12-27 16:01                     ` Martin Steigerwald
2014-12-28  0:25                       ` Robert White
2014-12-28  1:01                         ` Bardur Arantsson
2014-12-28  4:03                           ` Robert White
2014-12-28 12:03                             ` Martin Steigerwald
2014-12-28 17:04                               ` Patrik Lundquist
2014-12-29 10:14                                 ` Martin Steigerwald
2014-12-28 12:07                             ` Martin Steigerwald
2014-12-28 14:52                               ` Robert White
2014-12-28 15:42                                 ` Martin Steigerwald
2014-12-28 15:47                                   ` Martin Steigerwald
2014-12-29  0:27                                   ` Robert White
2014-12-29  9:14                                     ` Martin Steigerwald
2014-12-27 16:10                     ` Martin Steigerwald
2014-12-27 14:19               ` Robert White
2014-12-27 11:11       ` Martin Steigerwald
2014-12-27 12:08         ` Robert White
2014-12-27 13:55       ` Martin Steigerwald
2014-12-27 14:54         ` Robert White
2014-12-27 16:26           ` Hugo Mills
2014-12-27 17:11             ` Martin Steigerwald
2014-12-27 17:59               ` Martin Steigerwald
2014-12-28  0:06             ` Robert White
2014-12-28 11:05               ` Martin Steigerwald
2014-12-28 13:00         ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
2014-12-28 13:40           ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare) Martin Steigerwald
2014-12-28 13:56             ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
2014-12-28 15:00               ` Martin Steigerwald
2014-12-29  9:25               ` Martin Steigerwald
2014-12-27 18:28       ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
2014-12-27 18:40         ` Hugo Mills
2014-12-27 19:23           ` BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time) Martin Steigerwald
2014-12-29  2:07             ` Zygo Blaxell
2014-12-29  9:32               ` Martin Steigerwald
2015-01-06 20:03                 ` Zygo Blaxell
2015-01-07 19:08                   ` Martin Steigerwald
2015-01-07 21:41                     ` Zygo Blaxell
2015-01-08  5:45                     ` Duncan
2015-01-08 10:18                       ` Martin Steigerwald
2015-01-09  8:25                         ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.