All of lore.kernel.org
 help / color / mirror / Atom feed
* Another ENOSPC situation
@ 2016-04-01 13:40 Marc Haber
  2016-04-01 15:44 ` Henk Slager
  2016-04-02  4:55 ` Duncan
  0 siblings, 2 replies; 11+ messages in thread
From: Marc Haber @ 2016-04-01 13:40 UTC (permalink / raw)
  To: linux-btrfs

Hi,

just for a change, this is another btrfs on a different host. The host
is also running Debian unstable with mainline kernels, the btrfs in
question was created (not converted) in March 2015 with btrfs-tools
3.17. It is the root fs of my main work notebook which is under
workstation load, with lots of snapshots being created and deleted.

Balance immediately fails with ENOSPC

balance -dprofiles=single -dusage=1 goes through "fine" ("had to
relocate 0 out of 602 chunks")

balance -dprofiles=single -dusage=2 also ENOSPCes immediately.

[4/502]mh@swivel:~$ sudo btrfs fi usage /
Overall:
    Device size:                 600.00GiB
    Device allocated:            600.00GiB
    Device unallocated:            1.00MiB
    Device missing:                  0.00B
    Used:                        413.40GiB
    Free (estimated):            148.20GiB      (min: 148.20GiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,single: Size:553.93GiB, Used:405.73GiB
   /dev/mapper/swivelbtr         553.93GiB

Metadata,DUP: Size:23.00GiB, Used:3.83GiB
   /dev/mapper/swivelbtr          46.00GiB

System,DUP: Size:32.00MiB, Used:112.00KiB
   /dev/mapper/swivelbtr          64.00MiB

Unallocated:
   /dev/mapper/swivelbtr           1.00MiB
[5/503]mh@swivel:~$ 

btrfs balance -mprofiles seems to do something. one kworked and one
btrfs-transaction process hog one CPU core each for hours, while
blocking the filesystem for minutes apiece, which leads to the host
being nearly unuseable up to the point of "clock and mouse pointer
frozen for nearly ten minutes".

The btrfs balance cancel I issued after four hours of this state took
eleven minutes alone to complete.

These are all log entries that were obtained after starting btrfs
balance -mprofiles on 09:43
Apr  1 12:18:21 swivel kernel: [253651.970413] BTRFS info (device dm-14): found 3523 extents
Apr  1 12:18:21 swivel kernel: [253652.035572] BTRFS info (device dm-14): relocating block group 1538365849600 flags 36
Apr  1 13:30:57 swivel kernel: [258007.653597] BTRFS info (device dm-14): found 3585 extents
Apr  1 13:30:57 swivel kernel: [258007.746541] BTRFS info (device dm-14): relocating block group 1536755236864 flags 36
Apr  1 13:49:39 swivel kernel: [259130.296184] BTRFS info (device dm-14): found 3047 extents
Apr  1 13:49:39 swivel kernel: [259130.357314] BTRFS info (device dm-14): relocating block group 1528702173184 flags 36
Apr  1 14:30:00 swivel kernel: [261550.776348] BTRFS info (device dm-14): found 4200 extents

This kernel trace from 11:16 is not btrfs-related, is it? I guess it's
bluetooth related since it happened simultaneously to the bluetooth
device popping out an in:
Apr  1 11:16:38 swivel kernel: [249948.993751] usb 1-1.4: USB disconnect, device number 39
Apr  1 11:16:38 swivel systemd[1]: Starting Load/Save RF Kill Switch Status...
Apr  1 11:16:38 swivel systemd[1]: Started Load/Save RF Kill Switch Status.
Apr  1 11:16:38 swivel systemd[1]: bluetooth.target: Unit not needed anymore. Stopping.
Apr  1 11:16:38 swivel systemd[1]: Stopped target Bluetooth.
Apr  1 11:16:38 swivel laptop-mode: Laptop mode
Apr  1 11:16:38 swivel laptop-mode: enabled, not active
Apr  1 11:16:39 swivel kernel: [249949.211549] usb 1-1.4: new full-speed USB device number 40 using ehci-pci
Apr  1 11:16:39 swivel kernel: [249949.308386] usb 1-1.4: New USB device found, idVendor=0a5c, idProduct=217f
Apr  1 11:16:39 swivel kernel: [249949.308397] usb 1-1.4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Apr  1 11:16:39 swivel kernel: [249949.308402] usb 1-1.4: Product: Broadcom Bluetooth Device
Apr  1 11:16:39 swivel kernel: [249949.308407] usb 1-1.4: Manufacturer: Broadcom Corp
Apr  1 11:16:39 swivel kernel: [249949.308412] usb 1-1.4: SerialNumber: CCAF78F1274F
Apr  1 11:16:39 swivel systemd[1]: Reached target Bluetooth.
Apr  1 11:16:39 swivel kernel: [249949.507794] ------------[ cut here ]------------
Apr  1 11:16:39 swivel kernel: [249949.507810] WARNING: CPU: 1 PID: 11 at arch/x86/kernel/cpu/perf_event_intel_ds.c:325 reserve_ds_buffers+0x102/0x326()
Apr  1 11:16:39 swivel kernel: [249949.507813] alloc_bts_buffer: BTS buffer allocation failure
Apr  1 11:16:39 swivel kernel: [249949.507816] Modules linked in: cpuid hid_generic usbhid hid e1000e tun ctr ccm rfcomm bridge stp llc cpufreq_userspace cpufreq_stats cpufreq_conservative cpufreq_powersave nf_conntrack_netlink nfnetlink bnep binfmt_misc intel_rapl x86_pkg_temp_thermal arc4 intel_powerclamp kvm_intel kvm irqbypass iwldvm snd_hda_codec_conexant snd_hda_codec_generic mac80211 input_leds btusb btbcm i2c_i801 snd_hda_intel btintel snd_hda_codec bluetooth iwlwifi snd_hda_core cfg80211 snd_hwdep sg snd_pcm_oss snd_mixer_oss lpc_ich mfd_core snd_pcm shpchp snd_timer thinkpad_acpi nvram snd battery soundcore rfkill ac tpm_tis tpm evdev processor xt_TCPMSS xt_tcpudp iptable_mangle iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables ip6table_filter ip6_tables x_tables coretemp tp_smapi(O) thinkpad_ec(O) loop drbd lru_cache libcrc32c crc32c_generic autofs4 btrfs xor raid6_pq ext4 crc16 mbcache jbd2 algif_skcipher af_alg dm_snapshot dm_bufio dm_crypt dm_mod md_mod sd_mod usb_storage crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel jitterentropy_rng hmac sha256_ssse3 sha256_generic drbg ansi_cprng aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse ahci libahci i915 libata ehci_pci sdhci_pci i2c_algo_bit scsi_mod ehci_hcd sdhci drm_kms_helper mmc_core usbcore usb_common drm i2c_core thermal video button [last unloaded: e1000e]
Apr  1 11:16:39 swivel kernel: [249949.507974] CPU: 1 PID: 11 Comm: watchdog/1 Tainted: G           O    4.5.0-zgws1 #2
Apr  1 11:16:39 swivel kernel: [249949.507978] Hardware name: LENOVO 4240CTO/4240CTO, BIOS 8AET63WW (1.43 ) 05/08/2013
Apr  1 11:16:39 swivel kernel: [249949.507981]  0000000000000047 ffffffff811e6bf3 ffff88017dd4fce8 0000000000000009
Apr  1 11:16:39 swivel kernel: [249949.507987]  ffffffff81052619 ffffffff810230d1 0000000000000000 ffff88017dd4fd40
Apr  1 11:16:39 swivel kernel: [249949.507992]  0000000000000000 00000000000106b0 ffffffff81052671 ffffffff817188a1
Apr  1 11:16:39 swivel kernel: [249949.507997] Call Trace:
Apr  1 11:16:39 swivel kernel: [249949.508006]  [<ffffffff811e6bf3>] ? dump_stack+0x5a/0x6f
Apr  1 11:16:39 swivel kernel: [249949.508013]  [<ffffffff81052619>] ? warn_slowpath_common+0x8e/0xa3
Apr  1 11:16:39 swivel kernel: [249949.508019]  [<ffffffff810230d1>] ? reserve_ds_buffers+0x102/0x326
Apr  1 11:16:39 swivel kernel: [249949.508024]  [<ffffffff81052671>] ? warn_slowpath_fmt+0x43/0x4b
Apr  1 11:16:39 swivel kernel: [249949.508029]  [<ffffffff810230d1>] ? reserve_ds_buffers+0x102/0x326
Apr  1 11:16:39 swivel kernel: [249949.508035]  [<ffffffff810bfefd>] ? watchdog_timer_fn+0x1ad/0x1ad
Apr  1 11:16:39 swivel kernel: [249949.508040]  [<ffffffff8101e15f>] ? x86_reserve_hardware+0xb9/0xc8
Apr  1 11:16:39 swivel kernel: [249949.508045]  [<ffffffff8101e1b9>] ? x86_pmu_event_init+0x4b/0x1bb
Apr  1 11:16:39 swivel kernel: [249949.508050]  [<ffffffff810dfca7>] ? perf_try_init_event+0x3d/0x6c
Apr  1 11:16:39 swivel kernel: [249949.508055]  [<ffffffff810e17d2>] ? perf_event_alloc+0x3c2/0x500
Apr  1 11:16:39 swivel kernel: [249949.508060]  [<ffffffff810e2cb7>] ? perf_event_create_kernel_counter+0x1f/0x122
Apr  1 11:16:39 swivel kernel: [249949.508065]  [<ffffffff810bfc0e>] ? watchdog_enable+0x9d/0x199
Apr  1 11:16:39 swivel kernel: [249949.508071]  [<ffffffff8106a8df>] ? smpboot_thread_fn+0xf7/0x13a
Apr  1 11:16:39 swivel kernel: [249949.508075]  [<ffffffff8106a7e8>] ? sort_range+0x17/0x17
Apr  1 11:16:39 swivel kernel: [249949.508081]  [<ffffffff81068756>] ? kthread+0x95/0x9d
Apr  1 11:16:39 swivel kernel: [249949.508085]  [<ffffffff810686c1>] ? kthread_parkme+0x16/0x16
Apr  1 11:16:39 swivel kernel: [249949.508092]  [<ffffffff8141dcff>] ? ret_from_fork+0x3f/0x70
Apr  1 11:16:39 swivel kernel: [249949.508097]  [<ffffffff810686c1>] ? kthread_parkme+0x16/0x16
Apr  1 11:16:39 swivel kernel: [249949.508100] ---[ end trace e082dccd90d0875a ]---
Apr  1 11:16:39 swivel kernel: [249949.509383] watchdog/1: page allocation failure: order:4, mode:0x26040c0
Apr  1 11:16:39 swivel kernel: [249949.509390] CPU: 1 PID: 11 Comm: watchdog/1 Tainted: G        W  O    4.5.0-zgws1 #2
Apr  1 11:16:39 swivel kernel: [249949.509392] Hardware name: LENOVO 4240CTO/4240CTO, BIOS 8AET63WW (1.43 ) 05/08/2013
Apr  1 11:16:39 swivel kernel: [249949.509395]  0000000000000047 ffffffff811e6bf3 0000000000000001 ffff88017dd4fb60
Apr  1 11:16:39 swivel kernel: [249949.509401]  ffffffff810f01af 0000000000000010 0000000000000040 0000000000000000
Apr  1 11:16:39 swivel kernel: [249949.509407]  00004c0800000001 000000001e5eeb00 0000000000000004 0000000000000040
Apr  1 11:16:39 swivel kernel: [249949.509412] Call Trace:
Apr  1 11:16:39 swivel kernel: [249949.509418]  [<ffffffff811e6bf3>] ? dump_stack+0x5a/0x6f
Apr  1 11:16:39 swivel kernel: [249949.509425]  [<ffffffff810f01af>] ? warn_alloc_failed+0x10f/0x127
Apr  1 11:16:39 swivel kernel: [249949.509431]  [<ffffffff810f2a55>] ? __alloc_pages_nodemask+0x8cc/0x966
Apr  1 11:16:39 swivel kernel: [249949.509438]  [<ffffffff8112dd2c>] ? kmem_getpages+0x50/0x12c
Apr  1 11:16:39 swivel kernel: [249949.509442]  [<ffffffff8112e046>] ? fallback_alloc+0xfe/0x1a7
Apr  1 11:16:39 swivel kernel: [249949.509447]  [<ffffffff8112e73d>] ? kmem_cache_alloc_node_trace+0x89/0x14b
Apr  1 11:16:39 swivel kernel: [249949.509454]  [<ffffffff8102314f>] ? reserve_ds_buffers+0x180/0x326
Apr  1 11:16:39 swivel kernel: [249949.509459]  [<ffffffff810bfefd>] ? watchdog_timer_fn+0x1ad/0x1ad
Apr  1 11:16:39 swivel kernel: [249949.509463]  [<ffffffff8101e15f>] ? x86_reserve_hardware+0xb9/0xc8
Apr  1 11:16:39 swivel kernel: [249949.509468]  [<ffffffff8101e1b9>] ? x86_pmu_event_init+0x4b/0x1bb
Apr  1 11:16:39 swivel kernel: [249949.509472]  [<ffffffff810dfca7>] ? perf_try_init_event+0x3d/0x6c
Apr  1 11:16:39 swivel kernel: [249949.509477]  [<ffffffff810e17d2>] ? perf_event_alloc+0x3c2/0x500
Apr  1 11:16:39 swivel kernel: [249949.509482]  [<ffffffff810e2cb7>] ? perf_event_create_kernel_counter+0x1f/0x122
Apr  1 11:16:39 swivel kernel: [249949.509487]  [<ffffffff810bfc0e>] ? watchdog_enable+0x9d/0x199
Apr  1 11:16:39 swivel kernel: [249949.509491]  [<ffffffff8106a8df>] ? smpboot_thread_fn+0xf7/0x13a
Apr  1 11:16:39 swivel kernel: [249949.509495]  [<ffffffff8106a7e8>] ? sort_range+0x17/0x17
Apr  1 11:16:39 swivel kernel: [249949.509500]  [<ffffffff81068756>] ? kthread+0x95/0x9d
Apr  1 11:16:39 swivel kernel: [249949.509505]  [<ffffffff810686c1>] ? kthread_parkme+0x16/0x16
Apr  1 11:16:39 swivel kernel: [249949.509510]  [<ffffffff8141dcff>] ? ret_from_fork+0x3f/0x70
Apr  1 11:16:39 swivel kernel: [249949.509515]  [<ffffffff810686c1>] ? kthread_parkme+0x16/0x16
Apr  1 11:16:39 swivel kernel: [249949.509519] Mem-Info:
Apr  1 11:16:39 swivel kernel: [249949.509529] active_anon:1107088 inactive_anon:326101 isolated_anon:0
Apr  1 11:16:39 swivel kernel: [249949.509529]  active_file:1104846 inactive_file:1367650 isolated_file:0
Apr  1 11:16:39 swivel kernel: [249949.509529]  unevictable:2526 dirty:14757 writeback:0 unstable:0
Apr  1 11:16:39 swivel kernel: [249949.509529]  slab_reclaimable:56106 slab_unreclaimable:33051
Apr  1 11:16:39 swivel kernel: [249949.509529]  mapped:67336 shmem:87440 pagetables:12012 bounce:0
Apr  1 11:16:39 swivel kernel: [249949.509529]  free:30592 free_pcp:170 free_cma:0
Apr  1 11:16:39 swivel kernel: [249949.509538] Node 0 DMA free:15360kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15360kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr  1 11:16:39 swivel kernel: [249949.509553] lowmem_reserve[]: 0 3403 15919 15919
Apr  1 11:16:39 swivel kernel: [249949.509559] Node 0 DMA32 free:64968kB min:3436kB low:4292kB high:5152kB active_anon:475148kB inactive_anon:357880kB active_file:1173604kB inactive_file:1314960kB unevictable:3416kB isolated(anon):0kB isolated(file):0kB present:3561088kB managed:3487816kB mlocked:3416kB dirty:13592kB writeback:0kB mapped:55924kB shmem:70004kB slab_reclaimable:47096kB slab_unreclaimable:17888kB kernel_stack:2000kB pagetables:8308kB unstable:0kB bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
Apr  1 11:16:39 swivel kernel: [249949.509575] lowmem_reserve[]: 0 0 12516 12516
Apr  1 11:16:39 swivel kernel: [249949.509580] Node 0 Normal free:42040kB min:12648kB low:15808kB high:18972kB active_anon:3953204kB inactive_anon:946524kB active_file:3245780kB inactive_file:4155640kB unevictable:6688kB isolated(anon):0kB isolated(file):0kB present:13080576kB managed:12816596kB mlocked:6688kB dirty:45436kB writeback:0kB mapped:213420kB shmem:279756kB slab_reclaimable:177328kB slab_unreclaimable:114316kB kernel_stack:8688kB pagetables:39740kB unstable:0kB bounce:0kB free_pcp:764kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr  1 11:16:39 swivel kernel: [249949.509596] lowmem_reserve[]: 0 0 0 0
Apr  1 11:16:39 swivel kernel: [249949.509601] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15360kB
Apr  1 11:16:39 swivel kernel: [249949.509619] Node 0 DMA32: 11548*4kB (UME) 2282*8kB (UME) 55*16kB (UM) 2*32kB (UM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 65392kB
Apr  1 11:16:39 swivel kernel: [249949.509638] Node 0 Normal: 3736*4kB (UME) 3206*8kB (UE) 131*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42688kB
Apr  1 11:16:39 swivel kernel: [249949.509657] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr  1 11:16:39 swivel kernel: [249949.509661] 2561271 total pagecache pages
Apr  1 11:16:39 swivel kernel: [249949.509664] 616 pages in swap cache
Apr  1 11:16:39 swivel kernel: [249949.509667] Swap cache stats: add 28221, delete 27605, find 294750/295285
Apr  1 11:16:39 swivel kernel: [249949.509670] Free swap  = 8277324kB
Apr  1 11:16:39 swivel kernel: [249949.509672] Total swap = 8386556kB
Apr  1 11:16:39 swivel kernel: [249949.509674] 4164412 pages RAM
Apr  1 11:16:39 swivel kernel: [249949.509676] 0 pages HighMem/MovableOnly
Apr  1 11:16:39 swivel kernel: [249949.509678] 84469 pages reserved
Apr  1 11:16:39 swivel kernel: [249949.509681] 0 pages hwpoisoned
Apr  1 11:16:39 swivel kernel: [249949.509717] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
Apr  1 11:16:39 swivel kernel: [249949.537265] EXT4-fs (dm-16): re-mounted. Opts: data=ordered,commit=0
Apr  1 11:16:39 swivel systemd[1]: Reloading Laptop Mode Tools.
Apr  1 11:16:39 swivel kernel: [249949.664133] thinkpad_acpi: EC reports that Thermal Table has changed
Apr  1 11:16:39 swivel kernel: [249949.723795] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready



This btrfs is ripe for the backup-format-restore procedure, right?

Greetings
Marc


-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Another ENOSPC situation
  2016-04-01 13:40 Another ENOSPC situation Marc Haber
@ 2016-04-01 15:44 ` Henk Slager
  2016-04-01 16:30   ` Marc Haber
  2016-04-02  4:55 ` Duncan
  1 sibling, 1 reply; 11+ messages in thread
From: Henk Slager @ 2016-04-01 15:44 UTC (permalink / raw)
  To: linux-btrfs

On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber <mh+linux-btrfs@zugschlus.de> wrote:
> Hi,
>
> just for a change, this is another btrfs on a different host. The host
> is also running Debian unstable with mainline kernels, the btrfs in
> question was created (not converted) in March 2015 with btrfs-tools
> 3.17. It is the root fs of my main work notebook which is under
> workstation load, with lots of snapshots being created and deleted.
>
> Balance immediately fails with ENOSPC
>
> balance -dprofiles=single -dusage=1 goes through "fine" ("had to
> relocate 0 out of 602 chunks")
>
> balance -dprofiles=single -dusage=2 also ENOSPCes immediately.
>
> [4/502]mh@swivel:~$ sudo btrfs fi usage /
> Overall:
>     Device size:                 600.00GiB
>     Device allocated:            600.00GiB
>     Device unallocated:            1.00MiB
>     Device missing:                  0.00B
>     Used:                        413.40GiB
>     Free (estimated):            148.20GiB      (min: 148.20GiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 0.00B)
>
> Data,single: Size:553.93GiB, Used:405.73GiB
>    /dev/mapper/swivelbtr         553.93GiB
>
> Metadata,DUP: Size:23.00GiB, Used:3.83GiB
>    /dev/mapper/swivelbtr          46.00GiB
>
> System,DUP: Size:32.00MiB, Used:112.00KiB
>    /dev/mapper/swivelbtr          64.00MiB
>
> Unallocated:
>    /dev/mapper/swivelbtr           1.00MiB
> [5/503]mh@swivel:~$
>
> btrfs balance -mprofiles seems to do something. one kworked and one
> btrfs-transaction process hog one CPU core each for hours, while
> blocking the filesystem for minutes apiece, which leads to the host
> being nearly unuseable up to the point of "clock and mouse pointer
> frozen for nearly ten minutes".

I assume you still have your every 10 minutes snapshotting running
while balancing?

> The btrfs balance cancel I issued after four hours of this state took
> eleven minutes alone to complete.
>
> These are all log entries that were obtained after starting btrfs
> balance -mprofiles on 09:43
> Apr  1 12:18:21 swivel kernel: [253651.970413] BTRFS info (device dm-14): found 3523 extents
> Apr  1 12:18:21 swivel kernel: [253652.035572] BTRFS info (device dm-14): relocating block group 1538365849600 flags 36
> Apr  1 13:30:57 swivel kernel: [258007.653597] BTRFS info (device dm-14): found 3585 extents
> Apr  1 13:30:57 swivel kernel: [258007.746541] BTRFS info (device dm-14): relocating block group 1536755236864 flags 36
> Apr  1 13:49:39 swivel kernel: [259130.296184] BTRFS info (device dm-14): found 3047 extents
> Apr  1 13:49:39 swivel kernel: [259130.357314] BTRFS info (device dm-14): relocating block group 1528702173184 flags 36
> Apr  1 14:30:00 swivel kernel: [261550.776348] BTRFS info (device dm-14): found 4200 extents
>
> This kernel trace from 11:16 is not btrfs-related, is it? I guess it's
> bluetooth related since it happened simultaneously to the bluetooth
> device popping out an in:
> Apr  1 11:16:38 swivel kernel: [249948.993751] usb 1-1.4: USB disconnect, device number 39
> Apr  1 11:16:38 swivel systemd[1]: Starting Load/Save RF Kill Switch Status...
> Apr  1 11:16:38 swivel systemd[1]: Started Load/Save RF Kill Switch Status.
> Apr  1 11:16:38 swivel systemd[1]: bluetooth.target: Unit not needed anymore. Stopping.
> Apr  1 11:16:38 swivel systemd[1]: Stopped target Bluetooth.
> Apr  1 11:16:38 swivel laptop-mode: Laptop mode
> Apr  1 11:16:38 swivel laptop-mode: enabled, not active
> Apr  1 11:16:39 swivel kernel: [249949.211549] usb 1-1.4: new full-speed USB device number 40 using ehci-pci
> Apr  1 11:16:39 swivel kernel: [249949.308386] usb 1-1.4: New USB device found, idVendor=0a5c, idProduct=217f
> Apr  1 11:16:39 swivel kernel: [249949.308397] usb 1-1.4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
> Apr  1 11:16:39 swivel kernel: [249949.308402] usb 1-1.4: Product: Broadcom Bluetooth Device
> Apr  1 11:16:39 swivel kernel: [249949.308407] usb 1-1.4: Manufacturer: Broadcom Corp
> Apr  1 11:16:39 swivel kernel: [249949.308412] usb 1-1.4: SerialNumber: CCAF78F1274F
> Apr  1 11:16:39 swivel systemd[1]: Reached target Bluetooth.
> Apr  1 11:16:39 swivel kernel: [249949.507794] ------------[ cut here ]------------
> Apr  1 11:16:39 swivel kernel: [249949.507810] WARNING: CPU: 1 PID: 11 at arch/x86/kernel/cpu/perf_event_intel_ds.c:325 reserve_ds_buffers+0x102/0x326()
> Apr  1 11:16:39 swivel kernel: [249949.507813] alloc_bts_buffer: BTS buffer allocation failure
> Apr  1 11:16:39 swivel kernel: [249949.507816] Modules linked in: cpuid hid_generic usbhid hid e1000e tun ctr ccm rfcomm bridge stp llc cpufreq_userspace cpufreq_stats cpufreq_conservative cpufreq_powersave nf_conntrack_netlink nfnetlink bnep binfmt_misc intel_rapl x86_pkg_temp_thermal arc4 intel_powerclamp kvm_intel kvm irqbypass iwldvm snd_hda_codec_conexant snd_hda_codec_generic mac80211 input_leds btusb btbcm i2c_i801 snd_hda_intel btintel snd_hda_codec bluetooth iwlwifi snd_hda_core cfg80211 snd_hwdep sg snd_pcm_oss snd_mixer_oss lpc_ich mfd_core snd_pcm shpchp snd_timer thinkpad_acpi nvram snd battery soundcore rfkill ac tpm_tis tpm evdev processor xt_TCPMSS xt_tcpudp iptable_mangle iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables ip6table_filter ip6_tables x_tables coretemp tp_smapi(O) thinkpad_ec(O) loop drbd lru_cache libcrc32c crc32c_generic autofs4 btrfs xor raid6_pq ext4 crc16 mbcache jbd2 algif_skcipher af_alg dm_snapshot dm_bufio dm_crypt dm_mod md_mod sd_mod usb_storage crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel jitterentropy_rng hmac sha256_ssse3 sha256_generic drbg ansi_cprng aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse ahci libahci i915 libata ehci_pci sdhci_pci i2c_algo_bit scsi_mod ehci_hcd sdhci drm_kms_helper mmc_core usbcore usb_common drm i2c_core thermal video button [last unloaded: e1000e]
> Apr  1 11:16:39 swivel kernel: [249949.507974] CPU: 1 PID: 11 Comm: watchdog/1 Tainted: G           O    4.5.0-zgws1 #2
> Apr  1 11:16:39 swivel kernel: [249949.507978] Hardware name: LENOVO 4240CTO/4240CTO, BIOS 8AET63WW (1.43 ) 05/08/2013
> Apr  1 11:16:39 swivel kernel: [249949.507981]  0000000000000047 ffffffff811e6bf3 ffff88017dd4fce8 0000000000000009
> Apr  1 11:16:39 swivel kernel: [249949.507987]  ffffffff81052619 ffffffff810230d1 0000000000000000 ffff88017dd4fd40
> Apr  1 11:16:39 swivel kernel: [249949.507992]  0000000000000000 00000000000106b0 ffffffff81052671 ffffffff817188a1
> Apr  1 11:16:39 swivel kernel: [249949.507997] Call Trace:
> Apr  1 11:16:39 swivel kernel: [249949.508006]  [<ffffffff811e6bf3>] ? dump_stack+0x5a/0x6f
> Apr  1 11:16:39 swivel kernel: [249949.508013]  [<ffffffff81052619>] ? warn_slowpath_common+0x8e/0xa3
> Apr  1 11:16:39 swivel kernel: [249949.508019]  [<ffffffff810230d1>] ? reserve_ds_buffers+0x102/0x326
> Apr  1 11:16:39 swivel kernel: [249949.508024]  [<ffffffff81052671>] ? warn_slowpath_fmt+0x43/0x4b
> Apr  1 11:16:39 swivel kernel: [249949.508029]  [<ffffffff810230d1>] ? reserve_ds_buffers+0x102/0x326
> Apr  1 11:16:39 swivel kernel: [249949.508035]  [<ffffffff810bfefd>] ? watchdog_timer_fn+0x1ad/0x1ad
> Apr  1 11:16:39 swivel kernel: [249949.508040]  [<ffffffff8101e15f>] ? x86_reserve_hardware+0xb9/0xc8
> Apr  1 11:16:39 swivel kernel: [249949.508045]  [<ffffffff8101e1b9>] ? x86_pmu_event_init+0x4b/0x1bb
> Apr  1 11:16:39 swivel kernel: [249949.508050]  [<ffffffff810dfca7>] ? perf_try_init_event+0x3d/0x6c
> Apr  1 11:16:39 swivel kernel: [249949.508055]  [<ffffffff810e17d2>] ? perf_event_alloc+0x3c2/0x500
> Apr  1 11:16:39 swivel kernel: [249949.508060]  [<ffffffff810e2cb7>] ? perf_event_create_kernel_counter+0x1f/0x122
> Apr  1 11:16:39 swivel kernel: [249949.508065]  [<ffffffff810bfc0e>] ? watchdog_enable+0x9d/0x199
> Apr  1 11:16:39 swivel kernel: [249949.508071]  [<ffffffff8106a8df>] ? smpboot_thread_fn+0xf7/0x13a
> Apr  1 11:16:39 swivel kernel: [249949.508075]  [<ffffffff8106a7e8>] ? sort_range+0x17/0x17
> Apr  1 11:16:39 swivel kernel: [249949.508081]  [<ffffffff81068756>] ? kthread+0x95/0x9d
> Apr  1 11:16:39 swivel kernel: [249949.508085]  [<ffffffff810686c1>] ? kthread_parkme+0x16/0x16
> Apr  1 11:16:39 swivel kernel: [249949.508092]  [<ffffffff8141dcff>] ? ret_from_fork+0x3f/0x70
> Apr  1 11:16:39 swivel kernel: [249949.508097]  [<ffffffff810686c1>] ? kthread_parkme+0x16/0x16
> Apr  1 11:16:39 swivel kernel: [249949.508100] ---[ end trace e082dccd90d0875a ]---
> Apr  1 11:16:39 swivel kernel: [249949.509383] watchdog/1: page allocation failure: order:4, mode:0x26040c0
> Apr  1 11:16:39 swivel kernel: [249949.509390] CPU: 1 PID: 11 Comm: watchdog/1 Tainted: G        W  O    4.5.0-zgws1 #2
> Apr  1 11:16:39 swivel kernel: [249949.509392] Hardware name: LENOVO 4240CTO/4240CTO, BIOS 8AET63WW (1.43 ) 05/08/2013
> Apr  1 11:16:39 swivel kernel: [249949.509395]  0000000000000047 ffffffff811e6bf3 0000000000000001 ffff88017dd4fb60
> Apr  1 11:16:39 swivel kernel: [249949.509401]  ffffffff810f01af 0000000000000010 0000000000000040 0000000000000000
> Apr  1 11:16:39 swivel kernel: [249949.509407]  00004c0800000001 000000001e5eeb00 0000000000000004 0000000000000040
> Apr  1 11:16:39 swivel kernel: [249949.509412] Call Trace:
> Apr  1 11:16:39 swivel kernel: [249949.509418]  [<ffffffff811e6bf3>] ? dump_stack+0x5a/0x6f
> Apr  1 11:16:39 swivel kernel: [249949.509425]  [<ffffffff810f01af>] ? warn_alloc_failed+0x10f/0x127
> Apr  1 11:16:39 swivel kernel: [249949.509431]  [<ffffffff810f2a55>] ? __alloc_pages_nodemask+0x8cc/0x966
> Apr  1 11:16:39 swivel kernel: [249949.509438]  [<ffffffff8112dd2c>] ? kmem_getpages+0x50/0x12c
> Apr  1 11:16:39 swivel kernel: [249949.509442]  [<ffffffff8112e046>] ? fallback_alloc+0xfe/0x1a7
> Apr  1 11:16:39 swivel kernel: [249949.509447]  [<ffffffff8112e73d>] ? kmem_cache_alloc_node_trace+0x89/0x14b
> Apr  1 11:16:39 swivel kernel: [249949.509454]  [<ffffffff8102314f>] ? reserve_ds_buffers+0x180/0x326
> Apr  1 11:16:39 swivel kernel: [249949.509459]  [<ffffffff810bfefd>] ? watchdog_timer_fn+0x1ad/0x1ad
> Apr  1 11:16:39 swivel kernel: [249949.509463]  [<ffffffff8101e15f>] ? x86_reserve_hardware+0xb9/0xc8
> Apr  1 11:16:39 swivel kernel: [249949.509468]  [<ffffffff8101e1b9>] ? x86_pmu_event_init+0x4b/0x1bb
> Apr  1 11:16:39 swivel kernel: [249949.509472]  [<ffffffff810dfca7>] ? perf_try_init_event+0x3d/0x6c
> Apr  1 11:16:39 swivel kernel: [249949.509477]  [<ffffffff810e17d2>] ? perf_event_alloc+0x3c2/0x500
> Apr  1 11:16:39 swivel kernel: [249949.509482]  [<ffffffff810e2cb7>] ? perf_event_create_kernel_counter+0x1f/0x122
> Apr  1 11:16:39 swivel kernel: [249949.509487]  [<ffffffff810bfc0e>] ? watchdog_enable+0x9d/0x199
> Apr  1 11:16:39 swivel kernel: [249949.509491]  [<ffffffff8106a8df>] ? smpboot_thread_fn+0xf7/0x13a
> Apr  1 11:16:39 swivel kernel: [249949.509495]  [<ffffffff8106a7e8>] ? sort_range+0x17/0x17
> Apr  1 11:16:39 swivel kernel: [249949.509500]  [<ffffffff81068756>] ? kthread+0x95/0x9d
> Apr  1 11:16:39 swivel kernel: [249949.509505]  [<ffffffff810686c1>] ? kthread_parkme+0x16/0x16
> Apr  1 11:16:39 swivel kernel: [249949.509510]  [<ffffffff8141dcff>] ? ret_from_fork+0x3f/0x70
> Apr  1 11:16:39 swivel kernel: [249949.509515]  [<ffffffff810686c1>] ? kthread_parkme+0x16/0x16
> Apr  1 11:16:39 swivel kernel: [249949.509519] Mem-Info:
> Apr  1 11:16:39 swivel kernel: [249949.509529] active_anon:1107088 inactive_anon:326101 isolated_anon:0
> Apr  1 11:16:39 swivel kernel: [249949.509529]  active_file:1104846 inactive_file:1367650 isolated_file:0
> Apr  1 11:16:39 swivel kernel: [249949.509529]  unevictable:2526 dirty:14757 writeback:0 unstable:0
> Apr  1 11:16:39 swivel kernel: [249949.509529]  slab_reclaimable:56106 slab_unreclaimable:33051
> Apr  1 11:16:39 swivel kernel: [249949.509529]  mapped:67336 shmem:87440 pagetables:12012 bounce:0
> Apr  1 11:16:39 swivel kernel: [249949.509529]  free:30592 free_pcp:170 free_cma:0
> Apr  1 11:16:39 swivel kernel: [249949.509538] Node 0 DMA free:15360kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15360kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
> Apr  1 11:16:39 swivel kernel: [249949.509553] lowmem_reserve[]: 0 3403 15919 15919
> Apr  1 11:16:39 swivel kernel: [249949.509559] Node 0 DMA32 free:64968kB min:3436kB low:4292kB high:5152kB active_anon:475148kB inactive_anon:357880kB active_file:1173604kB inactive_file:1314960kB unevictable:3416kB isolated(anon):0kB isolated(file):0kB present:3561088kB managed:3487816kB mlocked:3416kB dirty:13592kB writeback:0kB mapped:55924kB shmem:70004kB slab_reclaimable:47096kB slab_unreclaimable:17888kB kernel_stack:2000kB pagetables:8308kB unstable:0kB bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
> Apr  1 11:16:39 swivel kernel: [249949.509575] lowmem_reserve[]: 0 0 12516 12516
> Apr  1 11:16:39 swivel kernel: [249949.509580] Node 0 Normal free:42040kB min:12648kB low:15808kB high:18972kB active_anon:3953204kB inactive_anon:946524kB active_file:3245780kB inactive_file:4155640kB unevictable:6688kB isolated(anon):0kB isolated(file):0kB present:13080576kB managed:12816596kB mlocked:6688kB dirty:45436kB writeback:0kB mapped:213420kB shmem:279756kB slab_reclaimable:177328kB slab_unreclaimable:114316kB kernel_stack:8688kB pagetables:39740kB unstable:0kB bounce:0kB free_pcp:764kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> Apr  1 11:16:39 swivel kernel: [249949.509596] lowmem_reserve[]: 0 0 0 0
> Apr  1 11:16:39 swivel kernel: [249949.509601] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15360kB
> Apr  1 11:16:39 swivel kernel: [249949.509619] Node 0 DMA32: 11548*4kB (UME) 2282*8kB (UME) 55*16kB (UM) 2*32kB (UM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 65392kB
> Apr  1 11:16:39 swivel kernel: [249949.509638] Node 0 Normal: 3736*4kB (UME) 3206*8kB (UE) 131*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42688kB
> Apr  1 11:16:39 swivel kernel: [249949.509657] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> Apr  1 11:16:39 swivel kernel: [249949.509661] 2561271 total pagecache pages
> Apr  1 11:16:39 swivel kernel: [249949.509664] 616 pages in swap cache
> Apr  1 11:16:39 swivel kernel: [249949.509667] Swap cache stats: add 28221, delete 27605, find 294750/295285
> Apr  1 11:16:39 swivel kernel: [249949.509670] Free swap  = 8277324kB
> Apr  1 11:16:39 swivel kernel: [249949.509672] Total swap = 8386556kB
> Apr  1 11:16:39 swivel kernel: [249949.509674] 4164412 pages RAM
> Apr  1 11:16:39 swivel kernel: [249949.509676] 0 pages HighMem/MovableOnly
> Apr  1 11:16:39 swivel kernel: [249949.509678] 84469 pages reserved
> Apr  1 11:16:39 swivel kernel: [249949.509681] 0 pages hwpoisoned
> Apr  1 11:16:39 swivel kernel: [249949.509717] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
> Apr  1 11:16:39 swivel kernel: [249949.537265] EXT4-fs (dm-16): re-mounted. Opts: data=ordered,commit=0
> Apr  1 11:16:39 swivel systemd[1]: Reloading Laptop Mode Tools.
> Apr  1 11:16:39 swivel kernel: [249949.664133] thinkpad_acpi: EC reports that Thermal Table has changed
> Apr  1 11:16:39 swivel kernel: [249949.723795] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
>
>
>
> This btrfs is ripe for the backup-format-restore procedure, right?
>
> Greetings
> Marc
>
>
> --
> -----------------------------------------------------------------------------
> Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Another ENOSPC situation
  2016-04-01 15:44 ` Henk Slager
@ 2016-04-01 16:30   ` Marc Haber
  2016-04-01 16:50     ` Marc Haber
  0 siblings, 1 reply; 11+ messages in thread
From: Marc Haber @ 2016-04-01 16:30 UTC (permalink / raw)
  To: linux-btrfs

On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote:
> On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber <mh+linux-btrfs@zugschlus.de> wrote:
> > btrfs balance -mprofiles seems to do something. one kworked and one
> > btrfs-transaction process hog one CPU core each for hours, while
> > blocking the filesystem for minutes apiece, which leads to the host
> > being nearly unuseable up to the point of "clock and mouse pointer
> > frozen for nearly ten minutes".
> 
> I assume you still have your every 10 minutes snapshotting running
> while balancing?

No, I disabled the cronjob before trying the balance. I might be
crazy, but not stup^wunexperienced.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Another ENOSPC situation
  2016-04-01 16:30   ` Marc Haber
@ 2016-04-01 16:50     ` Marc Haber
  2016-04-01 19:20       ` Henk Slager
  0 siblings, 1 reply; 11+ messages in thread
From: Marc Haber @ 2016-04-01 16:50 UTC (permalink / raw)
  To: linux-btrfs

On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote:
> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote:
> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber <mh+linux-btrfs@zugschlus.de> wrote:
> > > btrfs balance -mprofiles seems to do something. one kworked and one
> > > btrfs-transaction process hog one CPU core each for hours, while
> > > blocking the filesystem for minutes apiece, which leads to the host
> > > being nearly unuseable up to the point of "clock and mouse pointer
> > > frozen for nearly ten minutes".
> > 
> > I assume you still have your every 10 minutes snapshotting running
> > while balancing?
> 
> No, I disabled the cronjob before trying the balance. I might be
> crazy, but not stup^wunexperienced.

That being said, I would still expect the code not to allow _this_
kind of effect on the entire system when two alledgely incompatible
operations run simultaneously. I mean, Linux is a multi-user,
multi-tasking operating system where one simply cannot expect all
processes to be cooperative to each other. We have the operating
systems to prevent this kind of issues, not to cause them.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Another ENOSPC situation
  2016-04-01 16:50     ` Marc Haber
@ 2016-04-01 19:20       ` Henk Slager
  2016-04-01 20:40         ` Marc Haber
  0 siblings, 1 reply; 11+ messages in thread
From: Henk Slager @ 2016-04-01 19:20 UTC (permalink / raw)
  To: linux-btrfs

On Fri, Apr 1, 2016 at 6:50 PM, Marc Haber <mh+linux-btrfs@zugschlus.de> wrote:
> On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote:
>> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote:
>> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber <mh+linux-btrfs@zugschlus.de> wrote:
>> > > btrfs balance -mprofiles seems to do something. one kworked and one
>> > > btrfs-transaction process hog one CPU core each for hours, while
>> > > blocking the filesystem for minutes apiece, which leads to the host
>> > > being nearly unuseable up to the point of "clock and mouse pointer
>> > > frozen for nearly ten minutes".
>> >
>> > I assume you still have your every 10 minutes snapshotting running
>> > while balancing?
>>
>> No, I disabled the cronjob before trying the balance. I might be
>> crazy, but not stup^wunexperienced.
>
> That being said, I would still expect the code not to allow _this_
> kind of effect on the entire system when two alledgely incompatible
> operations run simultaneously. I mean, Linux is a multi-user,
> multi-tasking operating system where one simply cannot expect all
> processes to be cooperative to each other. We have the operating
> systems to prevent this kind of issues, not to cause them.

Maybe look at it differently: Does user mh have trouble using this
laptop w.r.t. storing files?

In openSUSE Tumbleweed (the snapshot from end of march), root access
is needed to change the default snapshotting config, otherwise you
will have a 10 year history. After that change has been done according
to needs of the user, there is no need to run manual balance.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Another ENOSPC situation
  2016-04-01 19:20       ` Henk Slager
@ 2016-04-01 20:40         ` Marc Haber
  2016-04-01 23:39           ` Henk Slager
  0 siblings, 1 reply; 11+ messages in thread
From: Marc Haber @ 2016-04-01 20:40 UTC (permalink / raw)
  To: linux-btrfs

On Fri, Apr 01, 2016 at 09:20:52PM +0200, Henk Slager wrote:
> On Fri, Apr 1, 2016 at 6:50 PM, Marc Haber <mh+linux-btrfs@zugschlus.de> wrote:
> > On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote:
> >> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote:
> >> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber <mh+linux-btrfs@zugschlus.de> wrote:
> >> > > btrfs balance -mprofiles seems to do something. one kworked and one
> >> > > btrfs-transaction process hog one CPU core each for hours, while
> >> > > blocking the filesystem for minutes apiece, which leads to the host
> >> > > being nearly unuseable up to the point of "clock and mouse pointer
> >> > > frozen for nearly ten minutes".
> >> >
> >> > I assume you still have your every 10 minutes snapshotting running
> >> > while balancing?
> >>
> >> No, I disabled the cronjob before trying the balance. I might be
> >> crazy, but not stup^wunexperienced.
> >
> > That being said, I would still expect the code not to allow _this_
> > kind of effect on the entire system when two alledgely incompatible
> > operations run simultaneously. I mean, Linux is a multi-user,
> > multi-tasking operating system where one simply cannot expect all
> > processes to be cooperative to each other. We have the operating
> > systems to prevent this kind of issues, not to cause them.
> 
> Maybe look at it differently: Does user mh have trouble using this
> laptop w.r.t. storing files?

No. I would have cried murder otherwise.

> In openSUSE Tumbleweed (the snapshot from end of march), root access
> is needed to change the default snapshotting config, otherwise you
> will have a 10 year history. After that change has been done according
> to needs of the user, there is no need to run manual balance.

So you are saying the balancing a filesystem should never be
necessary? Or what are you trying to say?

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Another ENOSPC situation
  2016-04-01 20:40         ` Marc Haber
@ 2016-04-01 23:39           ` Henk Slager
  0 siblings, 0 replies; 11+ messages in thread
From: Henk Slager @ 2016-04-01 23:39 UTC (permalink / raw)
  To: linux-btrfs

On Fri, Apr 1, 2016 at 10:40 PM, Marc Haber <mh+linux-btrfs@zugschlus.de> wrote:
> On Fri, Apr 01, 2016 at 09:20:52PM +0200, Henk Slager wrote:
>> On Fri, Apr 1, 2016 at 6:50 PM, Marc Haber <mh+linux-btrfs@zugschlus.de> wrote:
>> > On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote:
>> >> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote:
>> >> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber <mh+linux-btrfs@zugschlus.de> wrote:
>> >> > > btrfs balance -mprofiles seems to do something. one kworked and one
>> >> > > btrfs-transaction process hog one CPU core each for hours, while
>> >> > > blocking the filesystem for minutes apiece, which leads to the host
>> >> > > being nearly unuseable up to the point of "clock and mouse pointer
>> >> > > frozen for nearly ten minutes".
>> >> >
>> >> > I assume you still have your every 10 minutes snapshotting running
>> >> > while balancing?
>> >>
>> >> No, I disabled the cronjob before trying the balance. I might be
>> >> crazy, but not stup^wunexperienced.
>> >
>> > That being said, I would still expect the code not to allow _this_
>> > kind of effect on the entire system when two alledgely incompatible
>> > operations run simultaneously. I mean, Linux is a multi-user,
>> > multi-tasking operating system where one simply cannot expect all
>> > processes to be cooperative to each other. We have the operating
>> > systems to prevent this kind of issues, not to cause them.
>>
>> Maybe look at it differently: Does user mh have trouble using this
>> laptop w.r.t. storing files?
>
> No. I would have cried murder otherwise.
>
>> In openSUSE Tumbleweed (the snapshot from end of march), root access
>> is needed to change the default snapshotting config, otherwise you
>> will have a 10 year history. After that change has been done according
>> to needs of the user, there is no need to run manual balance.
>
> So you are saying the balancing a filesystem should never be
> necessary? Or what are you trying to say?

There is a package  bbtrfsmaintenance  which does balancing for the
user after it is configured by root according to user's wishes and
needs.

Key thing I want to say is that you should change you snapshotting
rate and/or policy. It has been hinted before and it is more a
psychological issue than technical I think.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Another ENOSPC situation
  2016-04-01 13:40 Another ENOSPC situation Marc Haber
  2016-04-01 15:44 ` Henk Slager
@ 2016-04-02  4:55 ` Duncan
  2016-04-02  5:43   ` Chris Murphy
  1 sibling, 1 reply; 11+ messages in thread
From: Duncan @ 2016-04-02  4:55 UTC (permalink / raw)
  To: linux-btrfs

Marc Haber posted on Fri, 01 Apr 2016 15:40:29 +0200 as excerpted:

> Hi,
> 
> just for a change, this is another btrfs on a different host. The host
> is also running Debian unstable with mainline kernels, the btrfs in
> question was created (not converted) in March 2015 with btrfs-tools
> 3.17. It is the root fs of my main work notebook which is under
> workstation load, with lots of snapshots being created and deleted.
> 
> Balance immediately fails with ENOSPC
> 
> balance -dprofiles=single -dusage=1 goes through "fine" ("had to
> relocate 0 out of 602 chunks")
> 
> balance -dprofiles=single -dusage=2 also ENOSPCes immediately.
> 
> [4/502]mh@swivel:~$ sudo btrfs fi usage /
> Overall:
>     Device size:                 600.00GiB
>     Device allocated:            600.00GiB
>     Device unallocated:            1.00MiB

That's the problem right there.  The admin didn't do his job and spot the 
near full allocation issue (perhaps with the help of some script set to 
run periodically and tell him about it) before it got critical, and now 
there's no room left to balance, to fix the problem.

This despite the fact that the admin chose to run a not yet entirely 
stable filesystem that's well known to run off the rails in precisely 
this sort of way, occasionally, with specific use-cases such as heavy 
snapshotting more often than others.

>     Device missing:                  0.00B
>     Used:                        413.40GiB
>     Free (estimated):            148.20GiB      (min: 148.20GiB)

Tho the used vs. free isn't all that bad... it's just that the allocated 
vs. unallocated was allowed to run off the rails and get the filesystem 
in a bind.

But that does mean it should be possible to do something about it. =:^)

>     Data ratio:                       1.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,single: Size:553.93GiB, Used:405.73GiB
>    /dev/mapper/swivelbtr         553.93GiB
> 
> Metadata,DUP: Size:23.00GiB, Used:3.83GiB
>    /dev/mapper/swivelbtr          46.00GiB
> 
> System,DUP: Size:32.00MiB, Used:112.00KiB
>    /dev/mapper/swivelbtr          64.00MiB
> 
> Unallocated:
>    /dev/mapper/swivelbtr           1.00MiB
> [5/503]mh@swivel:~$

Both data and metadata have several GiB free, data ~140 GiB free, and 
metadata isn't into global reserve, so the system isn't totally wedged, 
only partially, due to the lack of unallocated space.

> btrfs balance -mprofiles seems to do something. one kworked and one
> btrfs-transaction process hog one CPU core each for hours, while
> blocking the filesystem for minutes apiece, which leads to the host
> being nearly unuseable up to the point of "clock and mouse pointer
> frozen for nearly ten minutes".
> 
> The btrfs balance cancel I issued after four hours of this state took
> eleven minutes alone to complete.

It's worth noting as an aside that Linux isn't necessarily tuned for 
interactivity by default, tho there are definitely ways to make it more 
so.  Additionally, on some mobos at least, it's possible to tweak the 
BIOS balance between interactivity and thruput.  An old Tyan board (PCI 
not the newer PCIE, which avoids some of the problems with multiple 
dedicated buses) I had was tilted a bit heavily toward thruput, which did 
make sense as it was actually a server board, until I tweaked things a 
bit.  That made a LOT of difference, curing the dragging, but also curing 
occasional audio runouts, etc.  Turns out it was simply tuned to do huge 
bus "packets" (I forgot the proper in-context term, and that board died a 
few years ago, so...), increasing thruput, but also increasing latency 
beyond what the sound card and keyboard/mouse (or in that case the human 
operating them) could reasonably deal with.  By shortening the PCI 
"packet length", it reduced thruput a bit but greatly improved latency, 
letting other users have their turn when they needed it, not some time 
later.

Of course in addition to PCIE putting many of those things on dedicated 
buses these days, ssds are so much faster that a lot of things that could 
potentially be problems on spinning rust, simply don't tend to be issues 
on ssds.  As much as anything, I think that's what a lot of users 
bothered by such problems are turning to, and I'd bet that's a good part 
of why SSDs are as popular as they are, as well.  I know I've simply not 
had many of the problems here that others had, and while I think part of 
it is the multiple relatively small but independent filesystems and part 
of it may be because I don't use snapshotting, I also think a major part 
of it is simply that the SSDs I'm running btrfs on are simply so much 
faster than spinning rust that the problems either don't occur, or if 
they do, they're done before I even notice them.

FWIW, I do still use spinning rust, but for my media partition and 
(second) backups, not for anything speed critical at all.  And FWIW, I 
still use reiserfs on that spinning rust, not btrfs, which I only use on 
the SSDs.

But I'll skip the tuning detail discussion here.  If necessary, that 
could go in a different thread.

[snip logging and apparently unrelated traceback]

> This btrfs is ripe for the backup-format-restore procedure, right?

What was the exact balance -mprofiles filter you used?  The -mprofiles 
alone says -m, metadata, profiles, but doesn't say what to actually /do/ 
with the profiles, no selection, no conversion, no nothing, so it would 
presumably do exactly the same thing as -m by itself.  While that could 
could conceivably help if you let it do the full metadata balance (which 
is what the effect would be), that's not the most efficient way to go 
about it, for sure.

OK, so until you have at least a GiB unallocated, attempting to even 
touch data chunks (beyond -dusage=0) is likely to result in an ENOSPC as 
there's simply not enough room to write a new chunk.  (The only reason 
-dusage=1 worked for you at this point is because there were no not 
entirely empty chunks under 1% full to balance, but there's apparently 
one at 1-2% full, and it failed.)  And -dusage=0 and -dusage=1 didn't 
free any entirely empty chunks, so you can't get out of the tight spot 
that way.

Did you try -musage=0 and incrementing by 5 or 10% at a time from there?  
That's what I'd try next, hoping that would give me some gigs to work 
with, tho it's possible it will ENOSPC as well.  But assuming it works...

According to btrfs fi usage, you have 23 GiB worth of metadata chunks, 
but under 4 GiB nominal usage, so say 4.5 GiB with the half GiB of global 
reserve.  With those figures, were you to rebalance all metadata, you'd 
probably end up freeing 18 GiB or so of metadata chunks.  However, if 
you've incremented usage gradually until it's starting to take "too 
long", and it has freed say 3-5 GiB, hopefully that'll be enough to work 
with to start rebalancing the data chunks, which is where the real payoff 
should be as there's ~140 GiB that should be reclaimable there.

So assuming you can free some gigs with -musage=, then try -dusage, again 
incrementing, until you reach something reasonable, say at least 50 GiB, 
unallocated.

I wouldn't touch the profiles filter unless you have to.  And as you 
suggested, in that case it could well be faster to simply do the backup/
format/restore thing.

Of course you should already have a backup if it's worth backing up, so 
you shouldn't really need to worry about that step unless you want to 
freshen up your backup, and can simply do the mkfs and restore steps.

Depending on how deep into -musage= and -dusage= you have to go, and how 
long it takes on your spinning rust, it may actually be faster to do the 
mkfs and restore from backup in any case, but given what you posted so 
far, it's not necessary yet, so your choice, based I guess on what you 
think will be faster vs what you might lose... really not a lot if you've 
already deleted all the snapshots and you either have current backups or 
freshen them before doing the blow-away.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Another ENOSPC situation
  2016-04-02  4:55 ` Duncan
@ 2016-04-02  5:43   ` Chris Murphy
  2016-04-02  7:31     ` Duncan
  2016-04-05 13:39     ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 11+ messages in thread
From: Chris Murphy @ 2016-04-02  5:43 UTC (permalink / raw)
  To: Btrfs BTRFS

On Fri, Apr 1, 2016 at 10:55 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Marc Haber posted on Fri, 01 Apr 2016 15:40:29 +0200 as excerpted:

>> [4/502]mh@swivel:~$ sudo btrfs fi usage /
>> Overall:
>>     Device size:                 600.00GiB
>>     Device allocated:            600.00GiB
>>     Device unallocated:            1.00MiB
>
> That's the problem right there.  The admin didn't do his job and spot the
> near full allocation issue


I don't yet agree this is an admin problem. This is the 2nd or 3rd
case we've seen only recently where there's plenty of space in all
chunk types and yet ENOSPC happens, seemingly only because there's no
unallocated space remaining. I don't know that this is a regression
for sure, but it sure seems like one.



>>
>> Data,single: Size:553.93GiB, Used:405.73GiB
>>    /dev/mapper/swivelbtr         553.93GiB
>>
>> Metadata,DUP: Size:23.00GiB, Used:3.83GiB
>>    /dev/mapper/swivelbtr          46.00GiB
>>
>> System,DUP: Size:32.00MiB, Used:112.00KiB
>>    /dev/mapper/swivelbtr          64.00MiB
>>
>> Unallocated:
>>    /dev/mapper/swivelbtr           1.00MiB
>> [5/503]mh@swivel:~$
>
> Both data and metadata have several GiB free, data ~140 GiB free, and
> metadata isn't into global reserve, so the system isn't totally wedged,
> only partially, due to the lack of unallocated space.

Unallocated space alone hasn't ever caused this that I can remember.
It's most often been totally full metadata chunks, with free space in
allocated data chunks, with no unallocated space out of which to
create another metadata chunk to write out changes.

There should be plenty of space for either a -dusage=1 or -musage=1
balance to free up a bunch of partially allocated chunks. Offhand I
don't think the profiles filter is helpful in this case.

OK so where I could be wrong is that I'm expecting balance doesn't
require allocated space to work. I'd expect that it can COW extents
from one chunk into another existing chunk (of the same type) and then
once that's successful, free up that chunk, i.e. revert it back to
unallocated. If balance can only copy into newly allocated chunks,
that seems like a big problem. I thought that problems had been fixed
a very long time ago.

And what we don't see from 'usage' that we will see from 'df' is the
GlobalReserve values. I'd like to see that.

Anyway, in the meantime there is a work around:

btrfs dev add

Just add a device, even if it's an 8GiB flash drive. But it can be a
spare space on a partition, or it can be a logical volume, or whatever
you want. That'll add some gigs of unallocated space. Now the balance
will work, or for absolutely sure there's a bug (and a new one because
this has always worked in the past). After whatever filtered or full
balance is done, make sure to 'btfs dev rem' and confirm it's gone
with 'btrfs fi show' before removing the device. It's a two device
volume until that device is successfully removed and is in something
of a fragile state until then because any loss of data on that 2nd
device has a good chance of face planting the file system.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Another ENOSPC situation
  2016-04-02  5:43   ` Chris Murphy
@ 2016-04-02  7:31     ` Duncan
  2016-04-05 13:39     ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 11+ messages in thread
From: Duncan @ 2016-04-02  7:31 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Fri, 01 Apr 2016 23:43:46 -0600 as excerpted:

> On Fri, Apr 1, 2016 at 10:55 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>> Marc Haber posted on Fri, 01 Apr 2016 15:40:29 +0200 as excerpted:
> 
>>> [4/502]mh@swivel:~$ sudo btrfs fi usage / Overall:
>>>     Device size:                 600.00GiB Device allocated:          
>>>      600.00GiB Device unallocated:            1.00MiB
>>
>> That's the problem right there.  The admin didn't do his job and spot
>> the near full allocation issue
> 
> 
> I don't yet agree this is an admin problem. This is the 2nd or 3rd case
> we've seen only recently where there's plenty of space in all chunk
> types and yet ENOSPC happens, seemingly only because there's no
> unallocated space remaining. I don't know that this is a regression for
> sure, but it sure seems like one.

Notice that he said _balance_ failed with ENOSPC.  He did _NOT_ say he 
was getting it in ordinary usage, just yet.  Which would fit a 100% 
allocated situation, with plenty of space left in both data and metadata 
chunks.  The plenty of space left inside the chunks would keep ordinary 
usage from running into problems just yet, but balance really /does/ need 
room to allocate at least one new chunk in ordered to properly handle the 
chunk rewrite via COW.  (At least for data, metadata seems to work a bit 
differently.  See below.)

Balance has always failed with ENOSPC if there was no unallocated space 
left.  It used to happen all the time, before btrfs learned how to delete 
empty chunks in 3.17, but while that helps, it only works for literally 
/empty/ chunks.  Chunks with even a single block/node still in use don't 
get deleted automatically.

What I think is happening now is that while the empty-chunk deleting from 
3.17 on helped, it has been long enough since then, now, that people with 
particular usage patterns, I'd strongly suspect those with heavy 
snapshotting, don't tend to fully empty their chunks to the extent that 
those with other usage patterns do, and it has been just long enough now 
that we're beginning to see the problem reported again, because deleting 
empty chunks helped, but they weren't fully emptying enough chunks to 
keep up with things that way, in their particular use-cases.

>>> Data,single: Size:553.93GiB, Used:405.73GiB
>>>    /dev/mapper/swivelbtr         553.93GiB
>>>
>>> Metadata,DUP: Size:23.00GiB, Used:3.83GiB
>>>    /dev/mapper/swivelbtr          46.00GiB
>>>
>>> System,DUP: Size:32.00MiB, Used:112.00KiB
>>>    /dev/mapper/swivelbtr          64.00MiB
>>>
>>> Unallocated:
>>>    /dev/mapper/swivelbtr           1.00MiB
>>> [5/503]mh@swivel:~$
>>
>> Both data and metadata have several GiB free, data ~140 GiB free, and
>> metadata isn't into global reserve, so the system isn't totally wedged,
>> only partially, due to the lack of unallocated space.
> 
> Unallocated space alone hasn't ever caused this that I can remember.
> It's most often been totally full metadata chunks, with free space in
> allocated data chunks, with no unallocated space out of which to create
> another metadata chunk to write out changes.

Unallocated space alone doesn't cause ENOSPC with normal operations; for 
those you're correct, running out of either data or metadata space is 
required as well.  (Normally it's metadata that runs out, but I recall 
seeing one post from someone who had metadata room but full data.  The 
behavior was.. "interesting", as he could do renames, etc, and even 
create small files as long as they were small enough to stay in 
metadata.  As soon as he tried to do anything that needed an actual data 
extent, however, ENOSPC.)

But balance has always required space to allocate at least one chunk, as 
COW means the existing chunk can't be released until everything is 
rewritten into the new one.

Tho it seems that btrfs can sometimes either write very small metadata 
chunks, which don't forget are dup by default on a single device, as they 
are in this case.  He has 1 MiB unallocated.  Split in half that's 512 
KiB.  I'm not sure if btrfs can go that small, but if it can, and it can 
find a low enough usage metadata chunk to write into it, freeing the 
larger metadata chunk...

Or maybe btrfs can actually use the global reserve for that, since global 
reserve is part of metadata.  If it can, a 512 MiB global reserve would 
be just large enough to write the two copies of a nominally 256 MiB 
metadata chunk.

Either way, I've seen a number of times now where btrfs was able to 
balance metadata, when it had less than the 256 (*2 if dup) MiB 
unallocated that would normally be required.  Maybe it /is/ able to use 
global reserve for that, which would allow it to work, as long as 
metadata isn't so tight that it's already using global reserve.  That's 
actually what I bet it's doing, now that I think about it.  Because as 
long as the global reserve isn't being used, 512 MiB of global reserve 
would be exactly 2*256 MiB metadata chunks, and if they're unused, that 
would allow balance to claim them without actually having to allocate 
them.  But, I'd bet it works only if global reserve remains at absolutely 
0 usage.

> There should be plenty of space for either a -dusage=1 or -musage=1
> balance to free up a bunch of partially allocated chunks. Offhand I
> don't think the profiles filter is helpful in this case.
> 
> OK so where I could be wrong is that I'm expecting balance doesn't
> require allocated space to work. I'd expect that it can COW extents from
> one chunk into another existing chunk (of the same type) and then once
> that's successful, free up that chunk, i.e. revert it back to
> unallocated. If balance can only copy into newly allocated chunks, that
> seems like a big problem. I thought that problems had been fixed a very
> long time ago.

I don't believe it can.  It has to create new chunks.  (Tho if it works 
as in the metadata and global reserve discussion above, that would be an 
exception, as it could then use those 100% unused metadata global reserve 
chunks without having to actually allocate them first.)

> And what we don't see from 'usage' that we will see from 'df' is the
> GlobalReserve values. I'd like to see that.

Actually... look again.  It's there, and I quoted it, but you snipped 
that part. =:^)

Tho I don't blame you, an actually usable btrfs fi usage is new enough to 
all of us that we're still getting used to it, and don't have its format 
hard-wired into our wetware by repetition just yet, as we do btrfs fi 
show and btrfs fi df.  I know there's been several times when I "lost" a 
figure in fi usage that I knew was there... somewhere! and had to start 
from the top and go thru every line one by one to find it, because I just 
don't know usage like I know show and df yet. =:^\

Plus, I think it's a bit more difficult because the display is more 
spread out, so there's more "haystack" to lose the "needle" in. =;^P

But I suppose we'll get used to it, over time.

> Anyway, in the meantime there is a work around:
> 
> btrfs dev add
> 
> Just add a device, even if it's an 8GiB flash drive. But it can be a
> spare space on a partition, or it can be a logical volume, or whatever
> you want. That'll add some gigs of unallocated space. Now the balance
> will work, or for absolutely sure there's a bug (and a new one because
> this has always worked in the past). After whatever filtered or full
> balance is done, make sure to 'btfs dev rem' and confirm it's gone with
> 'btrfs fi show' before removing the device. It's a two device volume
> until that device is successfully removed and is in something of a
> fragile state until then because any loss of data on that 2nd device has
> a good chance of face planting the file system.

Agreed.  This is the next step if he can't finagle enough room out of 
metadata without it ENOSPCing.  But as I said, I've actually seen it 
(metadata only, not data... until metadata shrunk enough to leave some 
gigs unallocated) work a couple times recently when I didn't think it 
could due to no unallocated space, and I'm actually beginning to think 
that's due to balance being able to use the (metadata-only) global 
reserve.

Which would make sense, but it's either a relatively new development, or 
one I simply didn't know about previously.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Another ENOSPC situation
  2016-04-02  5:43   ` Chris Murphy
  2016-04-02  7:31     ` Duncan
@ 2016-04-05 13:39     ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-05 13:39 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

On 2016-04-02 01:43, Chris Murphy wrote:
> On Fri, Apr 1, 2016 at 10:55 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>> Marc Haber posted on Fri, 01 Apr 2016 15:40:29 +0200 as excerpted:
>
>>> [4/502]mh@swivel:~$ sudo btrfs fi usage /
>>> Overall:
>>>      Device size:                 600.00GiB
>>>      Device allocated:            600.00GiB
>>>      Device unallocated:            1.00MiB
>>
>> That's the problem right there.  The admin didn't do his job and spot the
>> near full allocation issue
>
>
> I don't yet agree this is an admin problem. This is the 2nd or 3rd
> case we've seen only recently where there's plenty of space in all
> chunk types and yet ENOSPC happens, seemingly only because there's no
> unallocated space remaining. I don't know that this is a regression
> for sure, but it sure seems like one.
I personally don't think it's a regression.  I've hit this myself before 
(although I make a point not to anymore, having to jump through hoops to 
the degree I did to get the FS working again tends to provide a pretty 
big incentive to not let it happen again), I know a couple of other 
people who have and never reported it here or on IRC, and I'd be willing 
to bet that the reason we're seeing it recently is that more 'regular' 
users (in contrast to system administrators or developers) are using 
BTRFS, and they tend to be more likely to hit such issues (because 
they're not as likely to know about them in the first place, let alone 
how to avoid them).
>
>
>
>>>
>>> Data,single: Size:553.93GiB, Used:405.73GiB
>>>     /dev/mapper/swivelbtr         553.93GiB
>>>
>>> Metadata,DUP: Size:23.00GiB, Used:3.83GiB
>>>     /dev/mapper/swivelbtr          46.00GiB
>>>
>>> System,DUP: Size:32.00MiB, Used:112.00KiB
>>>     /dev/mapper/swivelbtr          64.00MiB
>>>
>>> Unallocated:
>>>     /dev/mapper/swivelbtr           1.00MiB
>>> [5/503]mh@swivel:~$
>>
>> Both data and metadata have several GiB free, data ~140 GiB free, and
>> metadata isn't into global reserve, so the system isn't totally wedged,
>> only partially, due to the lack of unallocated space.
>
> Unallocated space alone hasn't ever caused this that I can remember.
> It's most often been totally full metadata chunks, with free space in
> allocated data chunks, with no unallocated space out of which to
> create another metadata chunk to write out changes.
>
> There should be plenty of space for either a -dusage=1 or -musage=1
> balance to free up a bunch of partially allocated chunks. Offhand I
> don't think the profiles filter is helpful in this case.
>
> OK so where I could be wrong is that I'm expecting balance doesn't
> require allocated space to work. I'd expect that it can COW extents
> from one chunk into another existing chunk (of the same type) and then
> once that's successful, free up that chunk, i.e. revert it back to
> unallocated. If balance can only copy into newly allocated chunks,
> that seems like a big problem. I thought that problems had been fixed
> a very long time ago.
Balance has always allocated new chunks.  This is IMHO one of the big 
issues with the current implementation of it (the other being that it 
can't be made asynchronous without some creative userspace work).  If we 
aren't converting chunk types and we're on a single device FS, we should 
be tail-packing existing chunks before we try to allocate new ones.
>
> And what we don't see from 'usage' that we will see from 'df' is the
> GlobalReserve values. I'd like to see that.
>
> Anyway, in the meantime there is a work around:
>
> btrfs dev add
>
> Just add a device, even if it's an 8GiB flash drive. But it can be a
> spare space on a partition, or it can be a logical volume, or whatever
> you want. That'll add some gigs of unallocated space. Now the balance
> will work, or for absolutely sure there's a bug (and a new one because
> this has always worked in the past). After whatever filtered or full
> balance is done, make sure to 'btfs dev rem' and confirm it's gone
> with 'btrfs fi show' before removing the device. It's a two device
> volume until that device is successfully removed and is in something
> of a fragile state until then because any loss of data on that 2nd
> device has a good chance of face planting the file system.
If you can ensure with a relative degree of certainty that you won't 
lose power or crash, and you have lots of RAM, a small ramdisk (or even 
zram) works well for this too.  I wouldn't use either personally for a 
critical filesystem (I'd pull out the disk and hook it up internally to 
another system with spare disk space and handle things there), but both 
options should work fine.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-04-05 13:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-01 13:40 Another ENOSPC situation Marc Haber
2016-04-01 15:44 ` Henk Slager
2016-04-01 16:30   ` Marc Haber
2016-04-01 16:50     ` Marc Haber
2016-04-01 19:20       ` Henk Slager
2016-04-01 20:40         ` Marc Haber
2016-04-01 23:39           ` Henk Slager
2016-04-02  4:55 ` Duncan
2016-04-02  5:43   ` Chris Murphy
2016-04-02  7:31     ` Duncan
2016-04-05 13:39     ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.