Re: btrfs fi defrag hangs on small files, 100% CPU thread

From: Anthony Ruhier <aruhier@mailbox.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>,
	Filipe Manana <fdmanana@kernel.org>
Subject: Re: btrfs fi defrag hangs on small files, 100% CPU thread
Date: Tue, 18 Jan 2022 02:58:50 +0100	[thread overview]
Message-ID: <8ecb84e1-b8da-20f2-8730-8c1403ac7d00@mailbox.org> (raw)
In-Reply-To: <e811a5b7-1ffe-b062-5c53-b3da4c26d04d@gmx.com>

[-- Attachment #1.1: Type: text/plain, Size: 16235 bytes --]

Hi Qu,

No problem, thanks a lot for your patch! After a quick look on a 10min 
period, it looks like your patch fixed the issue. I've applied the 2 
patches from Filipe and yours.
I'll let my NAS run over the night and watch the writes average on a 
longer period. I'll send an update in ~10 hours.

Thanks a lot for your help,
Anthony

Le 18/01/2022 à 00:32, Qu Wenruo a écrit :
>
>
> On 2022/1/18 07:15, Qu Wenruo wrote:
>>
>>
>> On 2022/1/18 03:39, Anthony Ruhier wrote:
>>> I have some good news and bad news: the bad news is that it didn't fix
>>> the autodefrag problem (I applied the 2 patches).
>>>
>>> The good news is that when I enable autodefrag, I can quickly see if 
>>> the
>>> problem is still there or not.
>>> It's not super obvious compared to the amount of writes that happened
>>> constantly just after my upgrade to linux 5.16, because since then the
>>> problem mitigated itself a bit, but it's still noticeable.
>>>
>>> If I compare the write average (in total, I don't have it per process)
>>> when taking idle periods on the same machine:
>>>      Linux 5.16:
>>>          without autodefrag: ~ 10KiB/s
>>>          with autodefrag: between 1 and 2MiB/s.
>>>
>>>      Linux 5.15:
>>>          with autodefrag:~ 10KiB/s (around the same as without
>>> autodefrag on 5.16)
>>>
>>> Feel free to ask me anything to help your debugging, just try to be
>>> quite explicit about what I should do, I'm not experimented in
>>> filesystems debugging.
>>
>> Mind to test the following diff (along with previous two patches from
>> Filipe)?
>>
>> I screwed up, the refactor changed how the defraged bytes accounting,
>> which I didn't notice that autodefrag relies on that to requeue the 
>> inode.
>>
>> The refactor is originally to add support for subpage defrag, but it
>> looks like I pushed the boundary too hard to change some behaviors.
>
> Sorry, previous diff has a memory leakage bug.
>
> The autodefrag code is more complex than I thought, it has extra logic
> to handle defrag.
>
> Please try this one instead.
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 11204dbbe053..096feecf2602 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -305,26 +305,7 @@ static int __btrfs_run_defrag_inode(struct
> btrfs_fs_info *fs_info,
>         num_defrag = btrfs_defrag_file(inode, NULL, &range,
> defrag->transid,
>                                        BTRFS_DEFRAG_BATCH);
>         sb_end_write(fs_info->sb);
> -       /*
> -        * if we filled the whole defrag batch, there
> -        * must be more work to do.  Queue this defrag
> -        * again
> -        */
> -       if (num_defrag == BTRFS_DEFRAG_BATCH) {
> -               defrag->last_offset = range.start;
> -               btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
> -       } else if (defrag->last_offset && !defrag->cycled) {
> -               /*
> -                * we didn't fill our defrag batch, but
> -                * we didn't start at zero.  Make sure we loop
> -                * around to the start of the file.
> -                */
> -               defrag->last_offset = 0;
> -               defrag->cycled = 1;
> -               btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
> -       } else {
> -               kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> -       }
> +       kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>
>         iput(inode);
>         return 0;
>
>>
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 11204dbbe053..c720260f9656 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -312,7 +312,9 @@ static int __btrfs_run_defrag_inode(struct
>> btrfs_fs_info *fs_info,
>>           */
>>          if (num_defrag == BTRFS_DEFRAG_BATCH) {
>>                  defrag->last_offset = range.start;
>> +               /*
>>                  btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>> +               */
>>          } else if (defrag->last_offset && !defrag->cycled) {
>>                  /*
>>                   * we didn't fill our defrag batch, but
>> @@ -321,7 +323,9 @@ static int __btrfs_run_defrag_inode(struct
>> btrfs_fs_info *fs_info,
>>                   */
>>                  defrag->last_offset = 0;
>>                  defrag->cycled = 1;
>> +               /*
>>                  btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>> +               */
>>          } else {
>>                  kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>          }
>>
>>
>>
>>> Thanks
>>>
>>> Le 17/01/2022 à 18:56, Filipe Manana a écrit :
>>>> On Mon, Jan 17, 2022 at 5:50 PM Anthony Ruhier <aruhier@mailbox.org>
>>>> wrote:
>>>>> Filipe,
>>>>>
>>>>> Just a quick update after my previous email, your patch fixed the 
>>>>> issue
>>>>> for `btrfs fi defrag`.
>>>>> Thanks a lot! I closed my bug on bugzilla.
>>>>>
>>>>> I'll keep you in touch about the autodefrag.
>>>> Please do.
>>>> The 1 byte file case was very specific, so it's probably a different
>>>> issue.
>>>>
>>>> Thanks for testing!
>>>>
>>>>> Le 17/01/2022 à 18:04, Anthony Ruhier a écrit :
>>>>>> I'm going to apply your patch for the 1B file, and quickly 
>>>>>> confirm if
>>>>>> it works.
>>>>>> Thanks a lot for your patch!
>>>>>>
>>>>>> About the autodefrag issue, it's going to be trickier to check that
>>>>>> your patch fixes it, because for whatever reason the problem 
>>>>>> seems to
>>>>>> have resolved itself (or at least, btrfs-cleaner does way less 
>>>>>> writes)
>>>>>> after a partial btrfs balance.
>>>>>> I'll try and look the amount of writes after several hours. Maybe 
>>>>>> for
>>>>>> this one, see with the author of the other bug. If they can easily
>>>>>> reproduce the issue then it's going to be easier to check.
>>>>>>
>>>>>> Thanks,
>>>>>> Anthony
>>>>>>
>>>>>> Le 17/01/2022 à 17:52, Filipe Manana a écrit :
>>>>>>> On Mon, Jan 17, 2022 at 03:24:00PM +0100, Anthony Ruhier wrote:
>>>>>>>> Thanks for the answer.
>>>>>>>>
>>>>>>>> I had the exact same issue as in the thread you've linked, and 
>>>>>>>> have
>>>>>>>> some
>>>>>>>> monitoring and graphs that showed that btrfs-cleaner did constant
>>>>>>>> writes
>>>>>>>> during 12 hours just after I upgraded to linux 5.16. Weirdly 
>>>>>>>> enough,
>>>>>>>> the
>>>>>>>> issue almost disappeared after I did a btrfs balance by 
>>>>>>>> filtering on
>>>>>>>> 10%
>>>>>>>> usage of data.
>>>>>>>> But that's why I initially disabled autodefrag, what has lead to
>>>>>>>> discovering
>>>>>>>> this bug as I switched to manual defragmentation (which, in the 
>>>>>>>> end,
>>>>>>>> makes
>>>>>>>> more sense anyway with my setup).
>>>>>>>>
>>>>>>>> I tried this patch, but sadly it doesn't help for the initial
>>>>>>>> issue. I
>>>>>>>> cannot say for the bug in the other thread, as the problem with
>>>>>>>> btrfs-cleaner disappeared (I can still see some writes from it, 
>>>>>>>> but
>>>>>>>> it so
>>>>>>>> rare that I cannot say if it's normal or not).
>>>>>>> Ok, reading better your first mail, I see there's the case of the 1
>>>>>>> byte
>>>>>>> file, which is possibly not the cause from the autodefrag 
>>>>>>> causing the
>>>>>>> excessive IO problem.
>>>>>>>
>>>>>>> For the 1 byte file problem, I've just sent a fix:
>>>>>>>
>>>>>>> https://patchwork.kernel.org/project/linux-btrfs/patch/bcbfce0ff7e21bbfed2484b1457e560edf78020d.1642436805.git.fdmanana@suse.com/ 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It's actually trivial to trigger.
>>>>>>>
>>>>>>> Can you check if it also fixes your problem with autodefrag?
>>>>>>>
>>>>>>> If not, then try the following (after applying the 1 file fix):
>>>>>>>
>>>>>>> https://pastebin.com/raw/EbEfk1tF
>>>>>>>
>>>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>>>> index a5bd6926f7ff..db795e226cca 100644
>>>>>>> --- a/fs/btrfs/ioctl.c
>>>>>>> +++ b/fs/btrfs/ioctl.c
>>>>>>> @@ -1191,6 +1191,7 @@ static int defrag_collect_targets(struct
>>>>>>> btrfs_inode *inode,
>>>>>>>                      u64 newer_than, bool do_compress,
>>>>>>>                      bool locked, struct list_head *target_list)
>>>>>>>    {
>>>>>>> +    const u32 min_thresh = extent_thresh / 2;
>>>>>>>        u64 cur = start;
>>>>>>>        int ret = 0;
>>>>>>>    @@ -1198,6 +1199,7 @@ static int defrag_collect_targets(struct
>>>>>>> btrfs_inode *inode,
>>>>>>>            struct extent_map *em;
>>>>>>>            struct defrag_target_range *new;
>>>>>>>            bool next_mergeable = true;
>>>>>>> +        u64 range_start;
>>>>>>>            u64 range_len;
>>>>>>>              em = defrag_lookup_extent(&inode->vfs_inode, cur,
>>>>>>> locked);
>>>>>>> @@ -1213,6 +1215,24 @@ static int defrag_collect_targets(struct
>>>>>>> btrfs_inode *inode,
>>>>>>>            if (em->generation < newer_than)
>>>>>>>                goto next;
>>>>>>>    +        /*
>>>>>>> +         * Our start offset might be in the middle of an existing
>>>>>>> extent
>>>>>>> +         * map, so take that into account.
>>>>>>> +         */
>>>>>>> +        range_len = em->len - (cur - em->start);
>>>>>>> +
>>>>>>> +        /*
>>>>>>> +         * If there's already a good range for delalloc within the
>>>>>>> range
>>>>>>> +         * covered by the extent map, skip it, otherwise we can
>>>>>>> end up
>>>>>>> +         * doing on the same IO range for a long time when using
>>>>>>> auto
>>>>>>> +         * defrag.
>>>>>>> +         */
>>>>>>> +        range_start = cur;
>>>>>>> +        if (count_range_bits(&inode->io_tree, &range_start,
>>>>>>> +                     range_start + range_len - 1, min_thresh,
>>>>>>> +                     EXTENT_DELALLOC, 1) >= min_thresh)
>>>>>>> +            goto next;
>>>>>>> +
>>>>>>>            /*
>>>>>>>             * For do_compress case, we want to compress all valid
>>>>>>> file
>>>>>>>             * extents, thus no @extent_thresh or mergeable check.
>>>>>>> @@ -1220,8 +1240,8 @@ static int defrag_collect_targets(struct
>>>>>>> btrfs_inode *inode,
>>>>>>>            if (do_compress)
>>>>>>>                goto add;
>>>>>>>    -        /* Skip too large extent */
>>>>>>> -        if (em->len >= extent_thresh)
>>>>>>> +        /* Skip large enough ranges. */
>>>>>>> +        if (range_len >= extent_thresh)
>>>>>>>                goto next;
>>>>>>>              next_mergeable =
>>>>>>> defrag_check_next_extent(&inode->vfs_inode, em,
>>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anthony
>>>>>>>>
>>>>>>>> Le 17/01/2022 à 13:10, Filipe Manana a écrit :
>>>>>>>>> On Sun, Jan 16, 2022 at 08:15:37PM +0100, Anthony Ruhier wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> Since I upgraded from linux 5.15 to 5.16, `btrfs filesystem 
>>>>>>>>>> defrag
>>>>>>>>>> -t128K`
>>>>>>>>>> hangs on small files (~1 byte) and triggers what it seems to 
>>>>>>>>>> be a
>>>>>>>>>> loop in
>>>>>>>>>> the kernel. It results in one CPU thread running being used at
>>>>>>>>>> 100%. I
>>>>>>>>>> cannot kill the process, and rebooting is blocked by btrfs.
>>>>>>>>>> It is a copy of the
>>>>>>>>>> bughttps://bugzilla.kernel.org/show_bug.cgi?id=215498
>>>>>>>>>>
>>>>>>>>>> Rebooting to linux 5.15 shows no issue. I have no issue to run a
>>>>>>>>>> defrag on
>>>>>>>>>> bigger files (I filter out files smaller than 3.9KB).
>>>>>>>>>>
>>>>>>>>>> I had a conversation on #btrfs on IRC, so here's what we 
>>>>>>>>>> debugged:
>>>>>>>>>>
>>>>>>>>>> I can replicate the issue by copying a file impacted by this 
>>>>>>>>>> bug,
>>>>>>>>>> by using
>>>>>>>>>> `cp --reflink=never`. I attached one of the impacted files to 
>>>>>>>>>> this
>>>>>>>>>> bug,
>>>>>>>>>> named README.md.
>>>>>>>>>>
>>>>>>>>>> Someone told me that it could be a bug due to the inline extent.
>>>>>>>>>> So we tried
>>>>>>>>>> to check that.
>>>>>>>>>>
>>>>>>>>>> filefrag shows that the file Readme.md is 1 inline extent. I 
>>>>>>>>>> tried
>>>>>>>>>> to create
>>>>>>>>>> a new file with random text, of 18 bytes (slightly bigger 
>>>>>>>>>> than the
>>>>>>>>>> other
>>>>>>>>>> file), that is also 1 inline extent. This file doesn't 
>>>>>>>>>> trigger the
>>>>>>>>>> bug and
>>>>>>>>>> has no issue to be defragmented.
>>>>>>>>>>
>>>>>>>>>> I tried to mount my system with `max_inline=0`, created a 
>>>>>>>>>> copy of
>>>>>>>>>> README.md.
>>>>>>>>>> `filefrag` shows me that the new file is now 1 extent, not 
>>>>>>>>>> inline.
>>>>>>>>>> This new
>>>>>>>>>> file also triggers the bug, so it doesn't seem to be due to the
>>>>>>>>>> inline
>>>>>>>>>> extent.
>>>>>>>>>>
>>>>>>>>>> Someone asked me to provide the output of a perf top when the
>>>>>>>>>> defrag is
>>>>>>>>>> stuck:
>>>>>>>>>>
>>>>>>>>>>        28.70%  [kernel]          [k] generic_bin_search
>>>>>>>>>>        14.90%  [kernel]          [k] free_extent_buffer
>>>>>>>>>>        13.17%  [kernel]          [k] btrfs_search_slot
>>>>>>>>>>        12.63%  [kernel]          [k] btrfs_root_node
>>>>>>>>>>         8.33%  [kernel]          [k] btrfs_get_64
>>>>>>>>>>         3.88%  [kernel]          [k] __down_read_common.llvm
>>>>>>>>>>         3.00%  [kernel]          [k] up_read
>>>>>>>>>>         2.63%  [kernel]          [k] read_block_for_search
>>>>>>>>>>         2.40%  [kernel]          [k] read_extent_buffer
>>>>>>>>>>         1.38%  [kernel]          [k] memset_erms
>>>>>>>>>>         1.11%  [kernel]          [k] find_extent_buffer
>>>>>>>>>>         0.69%  [kernel]          [k] kmem_cache_free
>>>>>>>>>>         0.69%  [kernel]          [k] memcpy_erms
>>>>>>>>>>         0.57%  [kernel]          [k] kmem_cache_alloc
>>>>>>>>>>         0.45%  [kernel]          [k] radix_tree_lookup
>>>>>>>>>>
>>>>>>>>>> I can reproduce the bug on 2 different machines, running 2
>>>>>>>>>> different linux
>>>>>>>>>> distributions (Arch and Gentoo) with 2 different kernel configs.
>>>>>>>>>> This kernel is compiled with clang, the other with GCC.
>>>>>>>>>>
>>>>>>>>>> Kernel version: 5.16.0
>>>>>>>>>> Mount options:
>>>>>>>>>>        Machine 1:
>>>>>>>>>> rw,noatime,compress-force=zstd:2,ssd,discard=async,space_cache=v2,autodefrag 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>        Machine 2:
>>>>>>>>>> rw,noatime,compress-force=zstd:3,nossd,space_cache=v2
>>>>>>>>>>
>>>>>>>>>> When the error happens, no message is shown in dmesg.
>>>>>>>>> This is very likely the same issue as reported at this thread:
>>>>>>>>>
>>>>>>>>> https://lore.kernel.org/linux-btrfs/YeVawBBE3r6hVhgs@debian9.Home/T/#ma1c8a9848c9b7e4edb471f7be184599d38e288bb 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Are you able to test the patch posted there?
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Anthony Ruhier
>>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]