Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2

From: Hans van Kranenburg <hans@knorrie.org>
To: linux-btrfs@vger.kernel.org
Subject: Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
Date: Sat, 25 Jul 2020 23:27:55 +0200	[thread overview]
Message-ID: <a370a2bf-25b5-eaa0-a5e5-595c967e82a6@knorrie.org> (raw)
In-Reply-To: <2523ce77-31a3-ecec-f36d-8d74132eae02@knorrie.org>

Ok,

On 7/25/20 4:24 PM, Hans van Kranenburg wrote:
> Hi,
> 
> I have a filesystem here that I'm filling up with data from elsewhere.
> Most of it is done by rsync, and part by send/receive. So, receiving
> data over the network, and then writing the files to disk. There can be
> a dozen of these processes running in parallel.
> 
> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
> varying X) often is using nearly 100% cpu, while enormously slowing down
> disk writes. This shows as disk IO wait for the rsync and btrfs receive
> processes.

Both rsync and btrfs receive are handling incoming streams of data which
contain huge amounts of files of all kind of sizes, but also a lot of
small files. (E.g. an imap server receive job from 1G split send dump
files is processing now, with maildir storage with every email in a
separate file, and there's also customers storing tiny XML fragments
from XML webservice calls in files to process later).

> The underlying storage (iSCSI connected over 10Gb/s network) can easily
> eat a few hundred MiB/s. When looking at actual disk business on the
> storage device, percentages <5% utilization are reported for the actual
> disks.
> 
> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here.
> 
> I just did a 'perf record -g -a sleep 60' while disk writes were down to
> under 1MiB (!) per second and then 'perf report'. Attached some 'screen
> shot' of it. Also attached an example of what nmon shows to give an idea
> about the situation.
> 
> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately
> see network and disk write speed ramping up, easily over 200 MiB/s,
> until it soon plummets again.
> 
> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which
> is booted now).
> 
> So, what I'm looking for is:
> * Does anyone else see this happen when doing a lot of concurrent writes?
> * What does this flush thing do?

I have been playing around with the available trace point stuff in
/sys/kernel/debug/tracing/events/btrfs a bit more.

After rsync or btrfs receive has some new data available for the fs to
write to disk... What needs to happen before it can actually go into the
write cache to end up on disk?

Well, we need to find a place where to put it.

What's kworker/u16:X+flush-btrfs-2 doing?

It's doing find_free_extent.

> * Why is it using 100% cpu all the time?

Because it spends all its time doing find_free_extent. And almost all of
it is happening for EXTENT_TREE.

    len=16384 empty_size=0 flags=4(METADATA)

I could have known :O And again it's the extent allocator making my life
miserable.

> * How can I debug this more?

I will read more relevant code and see if I can put some more custom
trace points in. I will also have a closer look at the empty_size thing,
again...

> * Ultimately, of course... how can we improve this?

Well, be able to have an idea of 'pressure' on it. If we're slowing down
everyone a lot, switch to some more yolo allocation which just appends
all new writes and flush them to disk, instead of spending 100% kernel
cpu to find some tiny old 16k free metadata space hole to put a single
write in.

So, yes, I could hack around a bit in the code again and flip around
some mount option which does the 'ssd_spread' behavior for metadata
only. I did that on another filesystem a while ago, which helped a lot
for throughput, but it also allocated a ridiculous amount of new
metadata block groups.

I'd like to redesign all of this, but -EPOSSIBLE because -EWORKTIME

Now I have a btrfs receive running for >1.5T of customer garbage...
Receive cannot stop and resume. So am I going to abort and redo it while
it's at 75.2GiB 5:57:46 [1.44MiB/s]? Or just sit it out?

Stay tuned for the next episode of Knorrie's btrfs adventures! :)

K