btrfs fi defrag hangs on small files, 100% CPU thread

* btrfs fi defrag hangs on small files, 100% CPU thread
@ 2022-01-16 19:15 Anthony Ruhier
  2022-01-17 12:10 ` Filipe Manana
  0 siblings, 1 reply; 14+ messages in thread
From: Anthony Ruhier @ 2022-01-16 19:15 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1.1.1: Type: text/plain, Size: 2935 bytes --]

Hi,
Since I upgraded from linux 5.15 to 5.16, `btrfs filesystem defrag 
-t128K` hangs on small files (~1 byte) and triggers what it seems to be 
a loop in the kernel. It results in one CPU thread running being used at 
100%. I cannot kill the process, and rebooting is blocked by btrfs.
It is a copy of the bug https://bugzilla.kernel.org/show_bug.cgi?id=215498

Rebooting to linux 5.15 shows no issue. I have no issue to run a defrag 
on bigger files (I filter out files smaller than 3.9KB).

I had a conversation on #btrfs on IRC, so here's what we debugged:

I can replicate the issue by copying a file impacted by this bug, by 
using `cp --reflink=never`. I attached one of the impacted files to this 
bug, named README.md.

Someone told me that it could be a bug due to the inline extent. So we 
tried to check that.

filefrag shows that the file Readme.md is 1 inline extent. I tried to 
create a new file with random text, of 18 bytes (slightly bigger than 
the other file), that is also 1 inline extent. This file doesn't trigger 
the bug and has no issue to be defragmented.

I tried to mount my system with `max_inline=0`, created a copy of 
README.md. `filefrag` shows me that the new file is now 1 extent, not 
inline. This new file also triggers the bug, so it doesn't seem to be 
due to the inline extent.

Someone asked me to provide the output of a perf top when the defrag is 
stuck:

     28.70%  [kernel]          [k] generic_bin_search
     14.90%  [kernel]          [k] free_extent_buffer
     13.17%  [kernel]          [k] btrfs_search_slot
     12.63%  [kernel]          [k] btrfs_root_node
      8.33%  [kernel]          [k] btrfs_get_64
      3.88%  [kernel]          [k] __down_read_common.llvm
      3.00%  [kernel]          [k] up_read
      2.63%  [kernel]          [k] read_block_for_search
      2.40%  [kernel]          [k] read_extent_buffer
      1.38%  [kernel]          [k] memset_erms
      1.11%  [kernel]          [k] find_extent_buffer
      0.69%  [kernel]          [k] kmem_cache_free
      0.69%  [kernel]          [k] memcpy_erms
      0.57%  [kernel]          [k] kmem_cache_alloc
      0.45%  [kernel]          [k] radix_tree_lookup

I can reproduce the bug on 2 different machines, running 2 different 
linux distributions (Arch and Gentoo) with 2 different kernel configs.
This kernel is compiled with clang, the other with GCC.

Kernel version: 5.16.0
Mount options:
     Machine 1: 
rw,noatime,compress-force=zstd:2,ssd,discard=async,space_cache=v2,autodefrag
     Machine 2: rw,noatime,compress-force=zstd:3,nossd,space_cache=v2

When the error happens, no message is shown in dmesg.

Thanks,
Anthony Ruhier

[-- Attachment #1.1.2: README.md --]
[-- Type: text/markdown, Size: 1 bytes --]

[-- Attachment #1.1.3: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 21807 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread