On Tue, Apr 23, 2019 at 07:06:51PM -0400, Zygo Blaxell wrote:
> I had a test filesystem that ran out of unallocated space, then ran
> out of metadata space during a snapshot delete, and forced readonly.
> The workload before the failure was a lot of rsync and bees dedupe
> combined with random snapshot creates and deletes.

Had this happen again on a production filesystem, this time on 5.1.11,
and it happened during orphan inode cleanup instead of snapshot delete:

	[14303.076134][T20882] BTRFS: error (device dm-21) in add_to_free_space_tree:1037: errno=-28 No space left
	[14303.076144][T20882] BTRFS: error (device dm-21) in __btrfs_free_extent:7196: errno=-28 No space left
	[14303.076157][T20882] BTRFS: error (device dm-21) in btrfs_run_delayed_refs:3008: errno=-28 No space left
	[14303.076203][T20882] BTRFS error (device dm-21): Error removing orphan entry, stopping orphan cleanup
	[14303.076210][T20882] BTRFS error (device dm-21): could not do orphan cleanup -22
	[14303.076281][T20882] BTRFS error (device dm-21): commit super ret -30
	[14303.357337][T20882] BTRFS error (device dm-21): open_ctree failed

Same fix:  I bumped the reserved size limit from 512M to 2G and mounted
normally.  (OK, technically, I booted my old 5.0.21 kernel--but my 5.0.21
kernel has the 2G reserved space patch below in it.)

I've not been able to repeat this ENOSPC behavior under test conditions
in the last two months of trying, but it's now happened twice in different
places, so it has non-zero repeatability.

> I tried the usual fix strategies:
> 
> 	1.  Immediately after mount, try to balance to free space for
> 	metadata
> 
> 	2.  Immediately after mount, add additional disks to provide
> 	unallocated space for metadata
> 
> 	3.  Mount -o nossd to increase metadata density
> 
> #3 had no effect.  #1 failed consistently.
> 
> #2 was successful, but the additional space was not used because
> btrfs couldn't allocate chunks for metadata because it ran out of
> metadata space for new metadata chunks.
> 
> When btrfs-cleaner tried to remove the first pending deleted snapshot,
> it started a transaction that failed due to lack of metadata space.
> Since the transaction failed, the filesystem reverts to its earlier state,
> and exactly the same thing happens on the next mount.  The 'btrfs dev
> add' in #2 is successful only if it is executed immediately after mount,
> before the btrfs-cleaner thread wakes up.
> 
> Here's what the kernel said during one of the attempts:
> 
> 	[41263.822252] BTRFS info (device dm-3): use zstd compression, level 0
> 	[41263.825135] BTRFS info (device dm-3): using free space tree
> 	[41263.827319] BTRFS info (device dm-3): has skinny extents
> 	[42046.463356] ------------[ cut here ]------------
> 	[42046.463387] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: errno=-28 No space left
> 	[42046.463404] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: errno=-28 No space left
> 	[42046.463407] BTRFS info (device dm-3): forced readonly
> 	[42046.463414] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:3011: errno=-28 No space left
> 	[42046.463429] BTRFS: error (device dm-3) in btrfs_create_pending_block_groups:10517: errno=-28 No space left
> 	[42046.463548] BTRFS: error (device dm-3) in btrfs_create_pending_block_groups:10520: errno=-28 No space left
> 	[42046.471363] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:3011: errno=-28 No space left
> 	[42046.471475] BTRFS: error (device dm-3) in btrfs_create_pending_block_groups:10517: errno=-28 No space left
> 	[42046.471506] BTRFS: error (device dm-3) in btrfs_create_pending_block_groups:10520: errno=-28 No space left
> 	[42046.473672] BTRFS: error (device dm-3) in btrfs_drop_snapshot:9489: errno=-28 No space left
> 	[42046.475643] WARNING: CPU: 0 PID: 10187 at fs/btrfs/extent-tree.c:7056 __btrfs_free_extent+0x364/0xf60
> 	[42046.475645] Modules linked in: mq_deadline bfq dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio joydev ppdev crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel dm_mod snd_pcm aesni_intel aes_x86_64 snd_timer crypto_simd cryptd glue_helper sr_mod snd cdrom psmouse sg soundcore input_leds pcspkr serio_raw ide_pci_generic i2c_piix4 bochs_drm parport_pc piix rtc_cmos floppy parport pcc_cpufreq ide_core qemu_fw_cfg evbug evdev ip_tables x_tables ipv6 crc_ccitt autofs4
> 	[42046.475677] CPU: 0 PID: 10187 Comm: btrfs-transacti Tainted: G    B   W         5.0.8-zb64-10a85e8a1569+ #1
> 	[42046.475678] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> 	[42046.475681] RIP: 0010:__btrfs_free_extent+0x364/0xf60
> 	[42046.475684] Code: 50 f0 48 0f ba a8 90 22 00 00 02 72 1f 8b 85 88 fe ff ff 83 f8 fb 0f 84 59 04 00 00 89 c6 48 c7 c7 00 85 be b8 e8 3c 1b 9b ff <0f> 0b 8b 8d 88 fe ff ff 48 8b bd 90 fe ff ff ba 90 1b 00 00 48 c7
> 	[42046.475685] RSP: 0018:ffff888103d8f8e0 EFLAGS: 00010282
> 	[42046.475688] RAX: 0000000000000000 RBX: ffff88802d9ce370 RCX: ffffffffb7224ca2
> 	[42046.475689] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88811960dfac
> 	[42046.475691] RBP: ffff888103d8fa98 R08: ffffed10232c24c1 R09: ffffed10232c24c0
> 	[42046.475693] R10: ffff888025c94aeb R11: ffffed10232c24c1 R12: 0000000000a5054e
> 	[42046.475694] R13: 0000000000003000 R14: 000000000000214a R15: ffff888103d8fa70
> 	[42046.475696] FS:  0000000000000000(0000) GS:ffff888119400000(0000) knlGS:0000000000000000
> 	[42046.475698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	[42046.475700] CR2: 00007f1b84271718 CR3: 00000000a3636001 CR4: 00000000001606f0
> 	[42046.475702] Call Trace:
> 	[42046.475707]  ? trace_hardirqs_off+0x24/0x120
> 	[42046.475714]  ? update_block_group+0x6d0/0x6d0
> 	[42046.475718]  ? __lock_acquire+0xf8/0x26e0
> 	[42046.475721]  ? __kasan_slab_free+0x14d/0x230
> 	[42046.475724]  ? btrfs_delayed_ref_lock+0x3e/0x100
> 	[42046.475727]  ? __lock_acquire+0xf8/0x26e0
> 	[42046.475730]  ? kthread+0x1a9/0x200
> 	[42046.475733]  ? debug_show_all_locks+0x210/0x210
> 	[42046.475737]  ? __btrfs_run_delayed_refs+0xce/0x210
> 	[42046.475742]  ? debug_show_all_locks+0x210/0x210
> 	[42046.475747]  run_delayed_data_ref+0x15a/0x400
> 	[42046.475752]  ? alloc_reserved_file_extent+0x520/0x520
> 	[42046.475756]  ? rb_next+0x25/0x90
> 	[42046.475761]  run_one_delayed_ref+0x71/0xe0
> 	[42046.475765]  btrfs_run_delayed_refs_for_head+0x284/0x580
> 	[42046.475772]  __btrfs_run_delayed_refs+0xeb/0x210
> 	[42046.475777]  ? btrfs_run_delayed_refs_for_head+0x580/0x580
> 	[42046.475781]  ? debug_show_all_locks+0x210/0x210
> 	[42046.475786]  btrfs_run_delayed_refs+0xc0/0x260
> 	[42046.475792]  btrfs_commit_transaction+0xf0/0x10f0
> 	[42046.475795]  ? do_raw_spin_unlock+0xa8/0x140
> 	[42046.475799]  ? _raw_spin_unlock+0x27/0x40
> 	[42046.475803]  ? btrfs_record_root_in_trans+0x24/0xb0
> 	[42046.475807]  ? btrfs_apply_pending_changes+0xb0/0xb0
> 	[42046.475809]  ? start_transaction+0x14a/0x760
> 	[42046.475816]  transaction_kthread+0x218/0x250
> 	[42046.475822]  kthread+0x1a9/0x200
> 	[42046.475824]  ? btrfs_cleanup_transaction+0xb80/0xb80
> 	[42046.475827]  ? kthread_park+0xb0/0xb0
> 	[42046.475831]  ret_from_fork+0x3a/0x50
> 	[42046.475837] irq event stamp: 0
> 	[42046.475839] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
> 	[42046.475843] hardirqs last disabled at (0): [<ffffffffb70efe95>] copy_process.part.7+0xad5/0x3b80
> 	[42046.475846] softirqs last  enabled at (0): [<ffffffffb70efe95>] copy_process.part.7+0xad5/0x3b80
> 	[42046.475847] softirqs last disabled at (0): [<0000000000000000>]           (null)
> 	[42046.475849] ---[ end trace b02d556137ea688c ]---
> 	[42046.475852] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: errno=-28 No space left
> 	[42046.475859] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:3011: errno=-28 No space left
> 	[42046.475878] BTRFS: error (device dm-3) in btrfs_create_pending_block_groups:10517: errno=-28 No space left
> 	[42046.475905] BTRFS: error (device dm-3) in btrfs_create_pending_block_groups:10520: errno=-28 No space left
> 
> I noticed that during the few minutes when the filesystem was running,
> the "global reserve" number in "btrfs fi usage" output kept going up to
> 503M or so, then the transaction failed.
> 
> I then applied this patch to the kernel (5.0.8):
> 
> 	diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> 	index 83791d13c204..59ba456e2314 100644
> 	--- a/fs/btrfs/extent-tree.c
> 	+++ b/fs/btrfs/extent-tree.c
> 	@@ -5790,7 +5790,7 @@ static void update_global_block_rsv(struct btrfs_fs_info *fs_info)
> 		spin_lock(&sinfo->lock);
> 		spin_lock(&block_rsv->lock);
> 
> 	-       block_rsv->size = min_t(u64, num_bytes, SZ_512M);
> 	+       block_rsv->size = min_t(u64, num_bytes, SZ_1G);
> 
> 		if (block_rsv->reserved < block_rsv->size) {
> 			num_bytes = btrfs_space_info_used(sinfo, true);
> 
> Then mounted the filesystem.
> 
> With the larger reserved block size, btrfs-cleaner was able to get
> past its earlier failures, allocating 3GB of new metadata chunks in the
> first transaction after it started deleting snapshots.  After this, the
> filesystem deleted some more snapshots (there are currently 2470 deleted
> snapshots remaining on this filesystem).
> 
> This isn't really a solution, though, for two reasons:
> 
> 1.  The commit that adds the limit on block_rsv->size (fdf30d1c1b3
> "Btrfs: limit the global reserve to 512mb") points out:
> 
>     A user reported a problem where he was getting early ENOSPC with
>     hundreds of gigs of free data space and 6 gigs of free metadata space.
>     This is because the global block reserve was taking up the entire
>     free metadata space.  This is ridiculous, we have infrastructure in
>     place to throttle if we start using too much of the global reserve,
>     so instead of letting it get this huge just limit it to 512mb so that
>     users can still get work done.  This allowed the user to complete
>     his rsync without issues. 
> 
> In other words, something should be handling the case when reserved
> space runs out, and that something is failing.
> 
> 2.  The 1G reserved limit still isn't sufficient.  The filesystem deletes
> a few dozen snapshots, then fails again later:
> 
> 	[ 3705.477036] CPU: 2 PID: 8388 Comm: btrfs-transacti Tainted: G    B             5.0.9-zb64-b6eed6abc880+ #3
> 	[ 3705.481510] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> 	[ 3705.484038] RIP: 0010:__btrfs_free_extent+0x364/0xf60
> 	[ 3705.485061] Code: 50 f0 48 0f ba a8 90 22 00 00 02 72 1f 8b 85 88 fe ff ff 83 f8 fb 0f 84 59 04 00 00 89 c6 48 c7 c7 80 85 be 8b e8 fc 1b 9b ff <0f> 0b 8b 8d 88 fe ff ff 48 8b bd 90 fe ff ff ba 90 1b 00 00 48 c7
> 	[ 3705.488046] RSP: 0018:ffff8881c29cf8e0 EFLAGS: 00010282
> 	[ 3705.488868] RAX: 0000000000000000 RBX: ffff888121dbdf20 RCX: ffffffff8a224a92
> 	[ 3705.490129] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8881f5e0dfac
> 	[ 3705.491246] RBP: ffff8881c29cfa98 R08: ffffed103ebc24c1 R09: ffffed103ebc24c0
> 	[ 3705.492376] R10: ffff88819e6d430b R11: ffffed103ebc24c1 R12: 00000000009ecb21
> 	[ 3705.493501] R13: 0000000000040000 R14: 00000000000021b3 R15: ffff8881c29cfa70
> 	[ 3705.494619] FS:  0000000000000000(0000) GS:ffff8881f5c00000(0000) knlGS:0000000000000000
> 	[ 3705.495883] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	[ 3705.496781] CR2: 0000556bc833ec70 CR3: 00000001ef9e0001 CR4: 00000000001606e0
> 	[ 3705.497872] Call Trace:
> 	[ 3705.498279]  ? update_block_group+0x6d0/0x6d0
> 	[ 3705.498950]  ? btrfs_run_delayed_refs_for_head+0x352/0x580
> 	[ 3705.499816]  ? kthread+0x1a9/0x200
> 	[ 3705.500379]  ? debug_show_all_locks+0x210/0x210
> 	[ 3705.501078]  ? __btrfs_run_delayed_refs+0xce/0x210
> 	[ 3705.501813]  ? debug_show_all_locks+0x210/0x210
> 	[ 3705.502516]  run_delayed_data_ref+0x15a/0x400
> 	[ 3705.503193]  ? alloc_reserved_file_extent+0x520/0x520
> 	[ 3705.503968]  ? rb_next+0x3c/0x90
> 	[ 3705.504479]  run_one_delayed_ref+0x71/0xe0
> 	[ 3705.505116]  btrfs_run_delayed_refs_for_head+0x284/0x580
> 	[ 3705.505935]  __btrfs_run_delayed_refs+0xeb/0x210
> 	[ 3705.506649]  ? btrfs_run_delayed_refs_for_head+0x580/0x580
> 	[ 3705.507495]  ? debug_show_all_locks+0x210/0x210
> 	[ 3705.508201]  btrfs_run_delayed_refs+0xc0/0x260
> 	[ 3705.508912]  btrfs_commit_transaction+0xf0/0x10f0
> 	[ 3705.509654]  ? do_raw_spin_unlock+0xa8/0x140
> 	[ 3705.510343]  ? _raw_spin_unlock+0x27/0x40
> 	[ 3705.510979]  ? btrfs_record_root_in_trans+0x24/0xb0
> 	[ 3705.511748]  ? btrfs_apply_pending_changes+0xb0/0xb0
> 	[ 3705.512533]  ? start_transaction+0x14a/0x760
> 	[ 3705.513216]  transaction_kthread+0x218/0x250
> 	[ 3705.513900]  kthread+0x1a9/0x200
> 	[ 3705.514425]  ? btrfs_cleanup_transaction+0xb80/0xb80
> 	[ 3705.515233]  ? kthread_park+0xb0/0xb0
> 	[ 3705.515829]  ret_from_fork+0x3a/0x50
> 	[ 3705.516408] irq event stamp: 0
> 	[ 3705.516898] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
> 	[ 3705.518054] hardirqs last disabled at (0): [<ffffffff8a0efe65>] copy_process.part.7+0xad5/0x3b80
> 	[ 3705.519419] softirqs last  enabled at (0): [<ffffffff8a0efe65>] copy_process.part.7+0xad5/0x3b80
> 	[ 3705.520779] softirqs last disabled at (0): [<0000000000000000>]           (null)
> 	[ 3705.521944] ---[ end trace 3a5f720e889bc0d3 ]---
> 	[ 3705.522713] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: errno=-28 No space left
> 	[ 3705.524069] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:3011: errno=-28 No space left
> 
> If I increase the reserved limit again to 2G, btrfs-cleaner gets past
> this shapshot (so far).  It takes more than 10 minutes to delete some
> of the snapshots, so this will take a while to finish all 2470.
> 
> The current btrfs fi usage output worries me:
> 
> 	Overall:
> 	    Device size:                   2.02TiB
> 	    Device allocated:              2.01TiB
> 	    Device unallocated:           10.03GiB
> 	    Device missing:                  0.00B
> 	    Used:                          1.70TiB
> 	    Free (estimated):            296.02GiB      (min: 291.01GiB)
> 	    Data ratio:                       1.00
> 	    Metadata ratio:                   2.00
> 	    Global reserve:                2.00GiB      (used: 864.00KiB)
> 
> 	Data,single: Size:1.81TiB, Used:1.53TiB
> 	   /dev/mapper/vgnebtest-tvdb    928.97GiB
> 	   /dev/mapper/vgnebtest-tvdc    929.00GiB
> 
> 	Metadata,RAID1: Size:99.99GiB, Used:82.29GiB
> 	   /dev/mapper/vgnebtest-lvol1     5.00GiB
> 	   /dev/mapper/vgnebtest-lvol2     5.00GiB
> 	   /dev/mapper/vgnebtest-tvdb     94.99GiB
> 	   /dev/mapper/vgnebtest-tvdc     94.99GiB
> 
> 	System,RAID1: Size:8.00MiB, Used:240.00KiB
> 	   /dev/mapper/vgnebtest-tvdb      8.00MiB
> 	   /dev/mapper/vgnebtest-tvdc      8.00MiB
> 
> 	Unallocated:
> 	   /dev/mapper/vgnebtest-lvol1     5.00GiB
> 	   /dev/mapper/vgnebtest-lvol2     5.00GiB
> 	   /dev/mapper/vgnebtest-tvdb     30.00MiB
> 	   /dev/mapper/vgnebtest-tvdc      1.00MiB
> 
> Note "lvol1" and "lvol2" were added after the initial failure.  
> 
> There is clearly a lot of metadata space available, but the filesystem still
> intermittently hits ENOSPC while trying to free space.