Re: btrfs fi defrag hangs on small files, 100% CPU thread

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Anthony Ruhier <aruhier@mailbox.org>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>,
	Filipe Manana <fdmanana@kernel.org>
Subject: Re: btrfs fi defrag hangs on small files, 100% CPU thread
Date: Tue, 18 Jan 2022 10:22:55 +0800	[thread overview]
Message-ID: <2d4ed9ca-52a7-3a39-1a4f-995c6ed0d070@gmx.com> (raw)
In-Reply-To: <8ecb84e1-b8da-20f2-8730-8c1403ac7d00@mailbox.org>

On 2022/1/18 09:58, Anthony Ruhier wrote:
> Hi Qu,
>
> No problem, thanks a lot for your patch! After a quick look on a 10min
> period, it looks like your patch fixed the issue. I've applied the 2
> patches from Filipe and yours.
> I'll let my NAS run over the night and watch the writes average on a
> longer period. I'll send an update in ~10 hours.

I'm really sorry for all the problems.

Currently I believe the root cause is I changed the return value of
btrfs_defrag_file(), which should return the number of sectors defragged.

Even I renamed the variable to "sectors_defragged", I still return it in
bytes instead.

(Furthermore it tends to report more bytes than really defragged).

The diff I mentioned is not really to solve the problem, but to help us
to pin down the problem.
(The diff itself will make autodefrag to do a little less work, thus it
may not defrag as hard as previously)

Thank you very much helping fixing the mess I caused.

BTW, if you have spare machines, mind to test the following two patches?
(if everything works as expected, they should be the final fixes, thus
please test without other patches)

https://patchwork.kernel.org/project/linux-btrfs/patch/20220118021544.18543-1-wqu@suse.com/
https://patchwork.kernel.org/project/linux-btrfs/patch/bcbfce0ff7e21bbfed2484b1457e560edf78020d.1642436805.git.fdmanana@suse.com/

Thanks,
Qu

>
> Thanks a lot for your help,
> Anthony
>
> Le 18/01/2022 à 00:32, Qu Wenruo a écrit :
>>
>>
>> On 2022/1/18 07:15, Qu Wenruo wrote:
>>>
>>>
>>> On 2022/1/18 03:39, Anthony Ruhier wrote:
>>>> I have some good news and bad news: the bad news is that it didn't fix
>>>> the autodefrag problem (I applied the 2 patches).
>>>>
>>>> The good news is that when I enable autodefrag, I can quickly see if
>>>> the
>>>> problem is still there or not.
>>>> It's not super obvious compared to the amount of writes that happened
>>>> constantly just after my upgrade to linux 5.16, because since then the
>>>> problem mitigated itself a bit, but it's still noticeable.
>>>>
>>>> If I compare the write average (in total, I don't have it per process)
>>>> when taking idle periods on the same machine:
>>>>      Linux 5.16:
>>>>          without autodefrag: ~ 10KiB/s
>>>>          with autodefrag: between 1 and 2MiB/s.
>>>>
>>>>      Linux 5.15:
>>>>          with autodefrag:~ 10KiB/s (around the same as without
>>>> autodefrag on 5.16)
>>>>
>>>> Feel free to ask me anything to help your debugging, just try to be
>>>> quite explicit about what I should do, I'm not experimented in
>>>> filesystems debugging.
>>>
>>> Mind to test the following diff (along with previous two patches from
>>> Filipe)?
>>>
>>> I screwed up, the refactor changed how the defraged bytes accounting,
>>> which I didn't notice that autodefrag relies on that to requeue the
>>> inode.
>>>
>>> The refactor is originally to add support for subpage defrag, but it
>>> looks like I pushed the boundary too hard to change some behaviors.
>>
>> Sorry, previous diff has a memory leakage bug.
>>
>> The autodefrag code is more complex than I thought, it has extra logic
>> to handle defrag.
>>
>> Please try this one instead.
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 11204dbbe053..096feecf2602 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -305,26 +305,7 @@ static int __btrfs_run_defrag_inode(struct
>> btrfs_fs_info *fs_info,
>>         num_defrag = btrfs_defrag_file(inode, NULL, &range,
>> defrag->transid,
>>                                        BTRFS_DEFRAG_BATCH);
>>         sb_end_write(fs_info->sb);
>> -       /*
>> -        * if we filled the whole defrag batch, there
>> -        * must be more work to do.  Queue this defrag
>> -        * again
>> -        */
>> -       if (num_defrag == BTRFS_DEFRAG_BATCH) {
>> -               defrag->last_offset = range.start;
>> -               btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>> -       } else if (defrag->last_offset && !defrag->cycled) {
>> -               /*
>> -                * we didn't fill our defrag batch, but
>> -                * we didn't start at zero.  Make sure we loop
>> -                * around to the start of the file.
>> -                */
>> -               defrag->last_offset = 0;
>> -               defrag->cycled = 1;
>> -               btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>> -       } else {
>> -               kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>> -       }
>> +       kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>
>>         iput(inode);
>>         return 0;
>>
>>>
>>>
>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>> index 11204dbbe053..c720260f9656 100644
>>> --- a/fs/btrfs/file.c
>>> +++ b/fs/btrfs/file.c
>>> @@ -312,7 +312,9 @@ static int __btrfs_run_defrag_inode(struct
>>> btrfs_fs_info *fs_info,
>>>           */
>>>          if (num_defrag == BTRFS_DEFRAG_BATCH) {
>>>                  defrag->last_offset = range.start;
>>> +               /*
>>>                  btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>>> +               */
>>>          } else if (defrag->last_offset && !defrag->cycled) {
>>>                  /*
>>>                   * we didn't fill our defrag batch, but
>>> @@ -321,7 +323,9 @@ static int __btrfs_run_defrag_inode(struct
>>> btrfs_fs_info *fs_info,
>>>                   */
>>>                  defrag->last_offset = 0;
>>>                  defrag->cycled = 1;
>>> +               /*
>>>                  btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>>> +               */
>>>          } else {
>>>                  kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>          }
>>>
>>>
>>>
>>>> Thanks
>>>>
>>>> Le 17/01/2022 à 18:56, Filipe Manana a écrit :
>>>>> On Mon, Jan 17, 2022 at 5:50 PM Anthony Ruhier <aruhier@mailbox.org>
>>>>> wrote:
>>>>>> Filipe,
>>>>>>
>>>>>> Just a quick update after my previous email, your patch fixed the
>>>>>> issue
>>>>>> for `btrfs fi defrag`.
>>>>>> Thanks a lot! I closed my bug on bugzilla.
>>>>>>
>>>>>> I'll keep you in touch about the autodefrag.
>>>>> Please do.
>>>>> The 1 byte file case was very specific, so it's probably a different
>>>>> issue.
>>>>>
>>>>> Thanks for testing!
>>>>>
>>>>>> Le 17/01/2022 à 18:04, Anthony Ruhier a écrit :
>>>>>>> I'm going to apply your patch for the 1B file, and quickly
>>>>>>> confirm if
>>>>>>> it works.
>>>>>>> Thanks a lot for your patch!
>>>>>>>
>>>>>>> About the autodefrag issue, it's going to be trickier to check that
>>>>>>> your patch fixes it, because for whatever reason the problem
>>>>>>> seems to
>>>>>>> have resolved itself (or at least, btrfs-cleaner does way less
>>>>>>> writes)
>>>>>>> after a partial btrfs balance.
>>>>>>> I'll try and look the amount of writes after several hours. Maybe
>>>>>>> for
>>>>>>> this one, see with the author of the other bug. If they can easily
>>>>>>> reproduce the issue then it's going to be easier to check.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Anthony
>>>>>>>
>>>>>>> Le 17/01/2022 à 17:52, Filipe Manana a écrit :
>>>>>>>> On Mon, Jan 17, 2022 at 03:24:00PM +0100, Anthony Ruhier wrote:
>>>>>>>>> Thanks for the answer.
>>>>>>>>>
>>>>>>>>> I had the exact same issue as in the thread you've linked, and
>>>>>>>>> have
>>>>>>>>> some
>>>>>>>>> monitoring and graphs that showed that btrfs-cleaner did constant
>>>>>>>>> writes
>>>>>>>>> during 12 hours just after I upgraded to linux 5.16. Weirdly
>>>>>>>>> enough,
>>>>>>>>> the
>>>>>>>>> issue almost disappeared after I did a btrfs balance by
>>>>>>>>> filtering on
>>>>>>>>> 10%
>>>>>>>>> usage of data.
>>>>>>>>> But that's why I initially disabled autodefrag, what has lead to
>>>>>>>>> discovering
>>>>>>>>> this bug as I switched to manual defragmentation (which, in the
>>>>>>>>> end,
>>>>>>>>> makes
>>>>>>>>> more sense anyway with my setup).
>>>>>>>>>
>>>>>>>>> I tried this patch, but sadly it doesn't help for the initial
>>>>>>>>> issue. I
>>>>>>>>> cannot say for the bug in the other thread, as the problem with
>>>>>>>>> btrfs-cleaner disappeared (I can still see some writes from it,
>>>>>>>>> but
>>>>>>>>> it so
>>>>>>>>> rare that I cannot say if it's normal or not).
>>>>>>>> Ok, reading better your first mail, I see there's the case of the 1
>>>>>>>> byte
>>>>>>>> file, which is possibly not the cause from the autodefrag
>>>>>>>> causing the
>>>>>>>> excessive IO problem.
>>>>>>>>
>>>>>>>> For the 1 byte file problem, I've just sent a fix:
>>>>>>>>
>>>>>>>> https://patchwork.kernel.org/project/linux-btrfs/patch/bcbfce0ff7e21bbfed2484b1457e560edf78020d.1642436805.git.fdmanana@suse.com/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> It's actually trivial to trigger.
>>>>>>>>
>>>>>>>> Can you check if it also fixes your problem with autodefrag?
>>>>>>>>
>>>>>>>> If not, then try the following (after applying the 1 file fix):
>>>>>>>>
>>>>>>>> https://pastebin.com/raw/EbEfk1tF
>>>>>>>>
>>>>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>>>>> index a5bd6926f7ff..db795e226cca 100644
>>>>>>>> --- a/fs/btrfs/ioctl.c
>>>>>>>> +++ b/fs/btrfs/ioctl.c
>>>>>>>> @@ -1191,6 +1191,7 @@ static int defrag_collect_targets(struct
>>>>>>>> btrfs_inode *inode,
>>>>>>>>                      u64 newer_than, bool do_compress,
>>>>>>>>                      bool locked, struct list_head *target_list)
>>>>>>>>    {
>>>>>>>> +    const u32 min_thresh = extent_thresh / 2;
>>>>>>>>        u64 cur = start;
>>>>>>>>        int ret = 0;
>>>>>>>>    @@ -1198,6 +1199,7 @@ static int defrag_collect_targets(struct
>>>>>>>> btrfs_inode *inode,
>>>>>>>>            struct extent_map *em;
>>>>>>>>            struct defrag_target_range *new;
>>>>>>>>            bool next_mergeable = true;
>>>>>>>> +        u64 range_start;
>>>>>>>>            u64 range_len;
>>>>>>>>              em = defrag_lookup_extent(&inode->vfs_inode, cur,
>>>>>>>> locked);
>>>>>>>> @@ -1213,6 +1215,24 @@ static int defrag_collect_targets(struct
>>>>>>>> btrfs_inode *inode,
>>>>>>>>            if (em->generation < newer_than)
>>>>>>>>                goto next;
>>>>>>>>    +        /*
>>>>>>>> +         * Our start offset might be in the middle of an existing
>>>>>>>> extent
>>>>>>>> +         * map, so take that into account.
>>>>>>>> +         */
>>>>>>>> +        range_len = em->len - (cur - em->start);
>>>>>>>> +
>>>>>>>> +        /*
>>>>>>>> +         * If there's already a good range for delalloc within the
>>>>>>>> range
>>>>>>>> +         * covered by the extent map, skip it, otherwise we can
>>>>>>>> end up
>>>>>>>> +         * doing on the same IO range for a long time when using
>>>>>>>> auto
>>>>>>>> +         * defrag.
>>>>>>>> +         */
>>>>>>>> +        range_start = cur;
>>>>>>>> +        if (count_range_bits(&inode->io_tree, &range_start,
>>>>>>>> +                     range_start + range_len - 1, min_thresh,
>>>>>>>> +                     EXTENT_DELALLOC, 1) >= min_thresh)
>>>>>>>> +            goto next;
>>>>>>>> +
>>>>>>>>            /*
>>>>>>>>             * For do_compress case, we want to compress all valid
>>>>>>>> file
>>>>>>>>             * extents, thus no @extent_thresh or mergeable check.
>>>>>>>> @@ -1220,8 +1240,8 @@ static int defrag_collect_targets(struct
>>>>>>>> btrfs_inode *inode,
>>>>>>>>            if (do_compress)
>>>>>>>>                goto add;
>>>>>>>>    -        /* Skip too large extent */
>>>>>>>> -        if (em->len >= extent_thresh)
>>>>>>>> +        /* Skip large enough ranges. */
>>>>>>>> +        if (range_len >= extent_thresh)
>>>>>>>>                goto next;
>>>>>>>>              next_mergeable =
>>>>>>>> defrag_check_next_extent(&inode->vfs_inode, em,
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Anthony
>>>>>>>>>
>>>>>>>>> Le 17/01/2022 à 13:10, Filipe Manana a écrit :
>>>>>>>>>> On Sun, Jan 16, 2022 at 08:15:37PM +0100, Anthony Ruhier wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> Since I upgraded from linux 5.15 to 5.16, `btrfs filesystem
>>>>>>>>>>> defrag
>>>>>>>>>>> -t128K`
>>>>>>>>>>> hangs on small files (~1 byte) and triggers what it seems to
>>>>>>>>>>> be a
>>>>>>>>>>> loop in
>>>>>>>>>>> the kernel. It results in one CPU thread running being used at
>>>>>>>>>>> 100%. I
>>>>>>>>>>> cannot kill the process, and rebooting is blocked by btrfs.
>>>>>>>>>>> It is a copy of the
>>>>>>>>>>> bughttps://bugzilla.kernel.org/show_bug.cgi?id=215498
>>>>>>>>>>>
>>>>>>>>>>> Rebooting to linux 5.15 shows no issue. I have no issue to run a
>>>>>>>>>>> defrag on
>>>>>>>>>>> bigger files (I filter out files smaller than 3.9KB).
>>>>>>>>>>>
>>>>>>>>>>> I had a conversation on #btrfs on IRC, so here's what we
>>>>>>>>>>> debugged:
>>>>>>>>>>>
>>>>>>>>>>> I can replicate the issue by copying a file impacted by this
>>>>>>>>>>> bug,
>>>>>>>>>>> by using
>>>>>>>>>>> `cp --reflink=never`. I attached one of the impacted files to
>>>>>>>>>>> this
>>>>>>>>>>> bug,
>>>>>>>>>>> named README.md.
>>>>>>>>>>>
>>>>>>>>>>> Someone told me that it could be a bug due to the inline extent.
>>>>>>>>>>> So we tried
>>>>>>>>>>> to check that.
>>>>>>>>>>>
>>>>>>>>>>> filefrag shows that the file Readme.md is 1 inline extent. I
>>>>>>>>>>> tried
>>>>>>>>>>> to create
>>>>>>>>>>> a new file with random text, of 18 bytes (slightly bigger
>>>>>>>>>>> than the
>>>>>>>>>>> other
>>>>>>>>>>> file), that is also 1 inline extent. This file doesn't
>>>>>>>>>>> trigger the
>>>>>>>>>>> bug and
>>>>>>>>>>> has no issue to be defragmented.
>>>>>>>>>>>
>>>>>>>>>>> I tried to mount my system with `max_inline=0`, created a
>>>>>>>>>>> copy of
>>>>>>>>>>> README.md.
>>>>>>>>>>> `filefrag` shows me that the new file is now 1 extent, not
>>>>>>>>>>> inline.
>>>>>>>>>>> This new
>>>>>>>>>>> file also triggers the bug, so it doesn't seem to be due to the
>>>>>>>>>>> inline
>>>>>>>>>>> extent.
>>>>>>>>>>>
>>>>>>>>>>> Someone asked me to provide the output of a perf top when the
>>>>>>>>>>> defrag is
>>>>>>>>>>> stuck:
>>>>>>>>>>>
>>>>>>>>>>>        28.70%  [kernel]          [k] generic_bin_search
>>>>>>>>>>>        14.90%  [kernel]          [k] free_extent_buffer
>>>>>>>>>>>        13.17%  [kernel]          [k] btrfs_search_slot
>>>>>>>>>>>        12.63%  [kernel]          [k] btrfs_root_node
>>>>>>>>>>>         8.33%  [kernel]          [k] btrfs_get_64
>>>>>>>>>>>         3.88%  [kernel]          [k] __down_read_common.llvm
>>>>>>>>>>>         3.00%  [kernel]          [k] up_read
>>>>>>>>>>>         2.63%  [kernel]          [k] read_block_for_search
>>>>>>>>>>>         2.40%  [kernel]          [k] read_extent_buffer
>>>>>>>>>>>         1.38%  [kernel]          [k] memset_erms
>>>>>>>>>>>         1.11%  [kernel]          [k] find_extent_buffer
>>>>>>>>>>>         0.69%  [kernel]          [k] kmem_cache_free
>>>>>>>>>>>         0.69%  [kernel]          [k] memcpy_erms
>>>>>>>>>>>         0.57%  [kernel]          [k] kmem_cache_alloc
>>>>>>>>>>>         0.45%  [kernel]          [k] radix_tree_lookup
>>>>>>>>>>>
>>>>>>>>>>> I can reproduce the bug on 2 different machines, running 2
>>>>>>>>>>> different linux
>>>>>>>>>>> distributions (Arch and Gentoo) with 2 different kernel configs.
>>>>>>>>>>> This kernel is compiled with clang, the other with GCC.
>>>>>>>>>>>
>>>>>>>>>>> Kernel version: 5.16.0
>>>>>>>>>>> Mount options:
>>>>>>>>>>>        Machine 1:
>>>>>>>>>>> rw,noatime,compress-force=zstd:2,ssd,discard=async,space_cache=v2,autodefrag
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>        Machine 2:
>>>>>>>>>>> rw,noatime,compress-force=zstd:3,nossd,space_cache=v2
>>>>>>>>>>>
>>>>>>>>>>> When the error happens, no message is shown in dmesg.
>>>>>>>>>> This is very likely the same issue as reported at this thread:
>>>>>>>>>>
>>>>>>>>>> https://lore.kernel.org/linux-btrfs/YeVawBBE3r6hVhgs@debian9.Home/T/#ma1c8a9848c9b7e4edb471f7be184599d38e288bb
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Are you able to test the patch posted there?
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Anthony Ruhier
>>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>