Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2

From: Hans van Kranenburg <hans@knorrie.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, linux-btrfs@vger.kernel.org
Subject: Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
Date: Mon, 27 Jul 2020 19:17:50 +0200	[thread overview]
Message-ID: <ae50b6a5-0f1e-282e-61d0-4ff37372a3ca@knorrie.org> (raw)
In-Reply-To: <f3cba5f0-8cc4-a521-3bba-2c02ff6c93a2@gmx.com>

Hi!

On 7/27/20 1:09 PM, Qu Wenruo wrote:
> 
> 
> On 2020/7/25 下午10:24, Hans van Kranenburg wrote:
>> Hi,
>>
>> I have a filesystem here that I'm filling up with data from elsewhere.
>> Most of it is done by rsync, and part by send/receive. So, receiving
>> data over the network, and then writing the files to disk. There can be
>> a dozen of these processes running in parallel.
>>
>> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
>> varying X) often is using nearly 100% cpu, while enormously slowing down
>> disk writes. This shows as disk IO wait for the rsync and btrfs receive
>> processes.
> 
> The name is in fact pretty strange.
> It doesn't follow the btrfs workqueue names.
> 
> There are two type of kernel threads used by btrfs:
>   [...]
> 
> As you can see, there is no one named like "flush-btrfs".

Well, I do have a kernel thread named kworker/u16:7-flush-btrfs-2 here,
currently. ;]

> Thus I guess it's from other part of the stack.
> 
> Also, the calltrace also shows that, that kernel thread is only doing
> page writeback, which calls back to the page write hooks of btrfs.
> 
> So I guess it may not be btrfs, but something else trying to do all the
> writeback.

Yes, so, from the writeback path it re-enters btrfs code, right?

> But still, the CPU usage is still a problem, it shouldn't cost so much
> CPU time just writing back pages from btrfs.

It's find_free_extent which is the culprit.

>> [...]
>>
>> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here.
>>
>> I just did a 'perf record -g -a sleep 60' while disk writes were down to
>> under 1MiB (!) per second and then 'perf report'. Attached some 'screen
>> shot' of it. Also attached an example of what nmon shows to give an idea
>> about the situation.
>>
>> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately
>> see network and disk write speed ramping up, easily over 200 MiB/s,
>> until it soon plummets again.
>>
>> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which
>> is booted now).
>>
>> So, what I'm looking for is:
>> * Does anyone else see this happen when doing a lot of concurrent writes?
>> * What does this flush thing do?
>> * Why is it using 100% cpu all the time?
>> * How can I debug this more?
> 
> bcc based runtime probes I guess?
> 
> Since it's almost a dead CPU burner loop, regular sleep based lockup
> detector won't help much.

Here's a flame graph of 180 seconds, taken from the kernel thread pid:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg

> You can try trace events first to see which trace event get executed the
> most frequently, then try to add probe points to pin down the real cause.

From the default collection, I already got the following, a few days
ago, by enabling find_free_extent and btrfs_cow_block:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt

From the timestamps you can see how long this takes. It's not much that
gets done per second.

The spin lock part must be spin_lock(&space_info->lock) because that's
the only one in find_free_extent.

So, what I think is that, like I mentioned on saturday already,
find_free_extent is probably doing things in a really inefficient way,
scanning around for a small single free space gap all the time in a
really expensive way, and doing that over again for each tiny metadata
block or file that needs to be placed somewhere (also notice the
empty_size=0), instead of just throwing all of it on disk after each
other, when it's otherwise slowing down everyone.

And then we have the spin lock part and btrfs_get_block_group, which are
also a significant part of the time spent. All together is the
continuous 100% cpu.

What I *can* try is switch to the ssd_spread option, to force it to do
much more yolo allocation, but I'm afraid this will result in a sudden
blast of metadata block groups getting allocated. Or, maybe try it for a
minute or so and compare the trace pipe output...

> But personally speaking, it's better to shrink the workload, to find a
> minimal workload to reproduce the 100% CPU burn, so that you need less
> probes/time to pindown the problem.

I think I can reproduce it with a single btrfs receive operation, as
long as it has a large amount of small files in it.

Currently I'm almost only copying data onto the filesystem, only data
removals are some targeted dedupe script for known big files that are
identical, but which could not easily be handled by rsync.

So, when find_free_extent wants to write extent tree blocks, it might be
playing outside in the woods all the time, searching around for little
holes that were caused by previous cow operations.

Large files are not a problem, like postgresql database dumps. Copying
them over just runs >200-300M/s without problem, where the read speed of
the old storage I'm moving things off is the bottleneck, actually.

>> * Ultimately, of course... how can we improve this?
>>
>> [...]

I will see if I can experiment a bit with putting more trace points in.

I remember reading around in the extent allocator code, until I got
dizzy from it. Deja vu...

https://www.spinics.net/lists/linux-btrfs/msg70624.html

Thanks
Hans