linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fix degraded system performance due to workqueue overload
@ 2021-01-27 13:23 Kai Krakow
  2021-01-27 13:23 ` [PATCH 1/2] Revert "bcache: Kill btree_io_wq" Kai Krakow
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Kai Krakow @ 2021-01-27 13:23 UTC (permalink / raw)
  To: linux-bcache

In the past months (and looking back, even years), I was seeing system
performance and latency degrading vastly when bcache is active.

Finally, with kernel 5.10, I was able to locate the problem:

[250336.887598] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 72s!
[250336.887606] Showing busy workqueues and worker pools:
[250336.887607] workqueue events: flags=0x0
[250336.887608]   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=3/256 refcnt=4
[250336.887611]     pending: psi_avgs_work, psi_avgs_work, psi_avgs_work
[250336.887619]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=15/256 refcnt=16
[250336.887621]     in-flight: 3760137:psi_avgs_work
[250336.887624]     pending: psi_avgs_work, psi_avgs_work, psi_avgs_work, psi_avgs_work, psi_avgs_work, psi_avgs_work, psi_avgs_work, psi_avgs_work, psi_avgs_work, psi_avgs_work, psi_avgs_work, psi_avgs_work, psi_avgs_work, psi_avgs_work
[250336.887637]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[250336.887639]     pending: psi_avgs_work
[250336.887643] workqueue events_power_efficient: flags=0x80
[250336.887644]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[250336.887646]     pending: do_cache_clean
[250336.887651] workqueue mm_percpu_wq: flags=0x8
[250336.887651]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256 refcnt=4
[250336.887653]     pending: lru_add_drain_per_cpu BAR(60), vmstat_update
[250336.887666] workqueue bcache: flags=0x8
[250336.887667]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[250336.887668]     pending: cached_dev_nodata
[250336.887681] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=72s workers=2 idle: 3760136

I was able to track that back to the following commit:
56b30770b27d54d68ad51eccc6d888282b568cee ("bcache: Kill btree_io_wq")

Reverting that commit (with some adjustments due to later code changes)
improved my desktop latency a lot, I mean really a lot. The system was
finally able to handle somewhat higher loads without stalling for
several seconds and without spiking load into the hundreds while doing a
lot of write IO.

So I dug a little deeper and found that the assumption of this old
commit may no longer be true and bcache simply overwhelms the system_wq
with too many or too long running workers. This should really only be
used for workers that can do their work almost instantly, and it should
not be spammed with a lot of workers which bcache seems to do (look at
how many kthreads it creates from workers):

# ps aux | grep 'kworker/.*bc' | wc -l
131

And this is with a mostly idle system, it may easily reach 700+. Also,
with my patches in place, that number seems to be overall lower.

So I added another commit (patch 2) to move another worker queue over
to a dedicated worker queue ("bcache: Move journal work to new
background wq").

I tested this by overloading my desktop system with the following
parallel load:

  * A big download at 1 Gbit/s, resulting in 60+ MB/s write
  * Active IPFS daemon
  * Watching a YouTube video
  * Fully syncing 4 IMAP accounts with MailSpring
  * Running a Gentoo system update (compiling packages)
  * Browsing the web
  * Running a Windows VM (Qemu) with Outlook and defragmentation
  * Starting and closing several applications and clicking in them

IO setup: 4x HDD (2+2+4+4 TB) btrfs RAID-0 with 850 GB SSD bcache
Kernel 5.10.10

Without the patches, the system would have come to a stop, probably not
recovering from it (last time I tried, a clean shutdown took 1+ hour).
With the patches, the system easily survives and feels overall smooth
with only a small perceivable lag.

Boot times are more consistent, too, and faster when bcache is mostly
cold due to a previous system update.

Write rates of the system are more smooth now, and can easily sustain a
constant load of 200-300 MB/s while previously I would see long stalls
followed by vastly reduces write performance (down to 5-20 MB/s).

I'm not sure if there are side-effects of my patches that I cannot know
of but it works great for me: All write-related desktop stalling is
gone.

-- 
Regards,
Kai



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-01-29 17:03 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-27 13:23 Fix degraded system performance due to workqueue overload Kai Krakow
2021-01-27 13:23 ` [PATCH 1/2] Revert "bcache: Kill btree_io_wq" Kai Krakow
2021-01-27 16:28   ` Kai Krakow
2021-01-27 13:23 ` [PATCH 2/2] bcache: Move journal work to new background wq Kai Krakow
2021-01-27 16:28   ` Kai Krakow
2021-01-27 15:27 ` Fix degraded system performance due to workqueue overload Coly Li
2021-01-27 16:39 ` [PATCH 1/2] Revert "bcache: Kill btree_io_wq" Kai Krakow
2021-01-27 16:39   ` [PATCH 2/2] bcache: Move journal work to new background wq Kai Krakow
2021-01-28 10:09 ` Fix degraded system performance due to workqueue overload Kai Krakow
2021-01-28 10:50 ` [PATCH v2 1/2] Revert "bcache: Kill btree_io_wq" Kai Krakow
2021-01-28 10:50   ` [PATCH v2 2/2] bcache: Move journal work to new background wq Kai Krakow
2021-01-28 16:37     ` Kai Krakow
2021-01-28 16:41       ` Kai Krakow
     [not found]         ` <988ba514-c607-688b-555d-18fbbb069f48@suse.de>
2021-01-29 16:36           ` Kai Krakow
2021-01-28 23:28 ` [PATCH v3 1/3] Revert "bcache: Kill btree_io_wq" Kai Krakow
2021-01-28 23:28   ` [PATCH v3 2/3] bcache: Give btree_io_wq correct semantics again Kai Krakow
2021-01-28 23:28   ` [PATCH v3 3/3] bcache: Move journal work to new background wq Kai Krakow
     [not found]     ` <a52b9107-7e84-0fea-6095-84a9576d7cc4@suse.de>
2021-01-29 16:37       ` Kai Krakow
     [not found]   ` <4fe07714-e5bf-4be3-6023-74b507ee54be@suse.de>
2021-01-29 16:59     ` [PATCH v3 1/3] Revert "bcache: Kill btree_io_wq" Kai Krakow
2021-01-29 16:40 ` [PATCH v4 " Kai Krakow
2021-01-29 16:40   ` [PATCH v4 2/3] bcache: Give btree_io_wq correct semantics again Kai Krakow
2021-01-29 16:40   ` [PATCH v4 3/3] bcache: Move journal work to new flush wq Kai Krakow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).