Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2

From: Hans van Kranenburg <hans@knorrie.org>
To: "Holger Hoffstätte" <holger@applied-asynchrony.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
Date: Sat, 25 Jul 2020 18:43:09 +0200	[thread overview]
Message-ID: <4771c445-dcb4-77c4-7cb6-07a52f8025f6@knorrie.org> (raw)
In-Reply-To: <6b4041a7-cbf7-07b6-0f30-8141d60a7d51@applied-asynchrony.com>

On 7/25/20 5:37 PM, Holger Hoffstätte wrote:
> On 2020-07-25 16:24, Hans van Kranenburg wrote:
>> Hi,
>>
>> I have a filesystem here that I'm filling up with data from elsewhere.
>> Most of it is done by rsync, and part by send/receive. So, receiving
>> data over the network, and then writing the files to disk. There can be
>> a dozen of these processes running in parallel.
>>
>> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
>> varying X) often is using nearly 100% cpu, while enormously slowing down
>> disk writes. This shows as disk IO wait for the rsync and btrfs receive
>> processes.
> 
> <snip>
> 
> I cannot speak to anything btrfs-specific (other than the usual write
> storms), however..
> 
>> [<0>] rq_qos_wait+0xfa/0x170
>> [<0>] wbt_wait+0x98/0xe0
>> [<0>] __rq_qos_throttle+0x23/0x30

I need to cat /proc/<pid>/stack a huge number of times in a loop to once
in a while get this sort of output shown.

> ..this means that you have CONFIG_BLK_WBT{_MQ} enabled and are using
> an IO scheduler that observes writeback throttling. AFAIK all MQ-capable
> schedulers (also SQ ones in 4.19 IIRC!) do so except for BFQ, which has
> its own mechanism to regulate fairness vs. latency and explicitly turns
> WBT off.
> 
> WBT aka 'writeback throttling' throttles background writes acording to
> latency/throughput of the underlying block device in favor of readers.
> It is meant to protect interactive/low-latency/desktop apps from heavy
> bursts of background writeback activity. I tested early versions and
> provided feedback to Jens Axboe; it really is helpful when it works,
> but obviously cannot cater to every situation. There have been reports
> that it is unhelpful for write-only/heavy workloads and may lead to
> queueing pileup.
> 
> You can tune the expected latency of device writes via:
> /sys/block/sda/queue/wbt_lat_usec.

Yes, I have been playing around with it earlier, without any effect on
the symptoms.

I just did this again, echo 0 > all of the involved block devices. When
looking at the events/wbt trace point, I see that wbt activity stops at
that moment.

No difference in symptoms.

> You might also check whether your vm.dirty_{background}_bytes and
> vm.dirty_expire_centisecs are too high; distro defaults almost always
> are. This leads to more evenly spaced out write traffic.

Dirty buffers were ~ 2G in size. I can modify the numbers to make it
bigger or smaller. There's absolutely no change in behavior of the system.

> Without knowing more it's difficult to say exactly what is going on,
> but if your underlying storage has latency spikes

It doesn't. It's idle, waiting to finally get some data sent to it.

> it might be that
> you are very likely looking at queueing pileup caused by multiple WBTs
> choking each other. Having other unrelated queueing & throttling
> mechanisms (in your case the network) in the mix is unlikely to help.
> I'm not going to comment on iSCSI in general.. :^)
> 
> OTOH I also have 10G networking here and no such problems, even when
> pushing large amounts of data over NFS at ~750 MB/s - and I have WBT
> enabled everywhere.
> 
> So maybe start small and either ramp up the wbt latency sysctl or
> decrease dirty_background bytes to start flushing sooner, depending
> on how it's set. As last resort you can rebuild your kernels with
> CONFIG_BLK_WBT/CONFIG_BLK_WBT_MQ disabled.

All processing speed is inversely proportional to the cpu usage of this
kworker/u16:X+flush-btrfs-2 thread. If it reaches >95% kernel cpu usage,
everything slows down. The network is idle, the disks are idle. Incoming
rsync speed drops, the speed in which btrfs receive is reading input
drops, etc. As soon as kworker/u16:X+flush-btrfs-2 busy cpu usage gets
below ~ 95% again, throughput goes up.

I do not see how writeback problems would result in having a
kworker/u16:X+flush-btrfs-2 do 100% cpu. I think it's the other way
round, and that's why I want to know what this thread is actually busy
with doing instead of shoveling the data towards the disks.

As far as I can see, all the iowait reported by rsync and btrfs receive
is because they want to give their writes to the btrfs code in the
kernel, but this kworker/u16:X+flush-btrfs-2 is in their way, so they
are blocked. Even before anything gets queued anywhere. Or doesn't that
make sense?

So, the problem is located *before* all the things you mention above
even come into play yet.

Hans