linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Petr Mladek <pmladek@suse.com>
To: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>,
	Michal Koutny <mkoutny@suse.com>,
	linux-kernel@vger.kernel.org, Petr Mladek <pmladek@suse.com>
Subject: [RFC 0/5] workqueue: Debugging improvements
Date: Wed,  1 Feb 2023 14:45:38 +0100	[thread overview]
Message-ID: <20230201134543.13687-1-pmladek@suse.com> (raw)

The workqueue watchdog provides a lot of impormation when a stall is
detected. The report says a lot about what workqueues and worker pools
are active and what is being blocked. Unfortunately, it does not provide
much information about what caused the stall.

In particular, it did not help me to get root of the following problems:

    + New workers were not created because the system reached PID limit.
      Admins limited it too much in a cloud.

    + A networking driver was not loaded because systemd killed modprobe
      when switching the root from initrd to the booted system.

      It was surprisingly quite reproducible. Interrupts are not handled
      immediately in kernel code. The wait in kthread_create_on_node()
      was one of few locations. So the race window evidently was not
      trivial.


1st patch fixes a misleading "hung" time report.

2nd, 3rd, and 4rd patches add warnings into create_worker() and
create_rescuer(). The rather persistent errors are printed only once
until it succeeds again. Otherwise it would be too noisy and might even
break the watchdog.

5th patch adds printing bracktraces of CPU-bound workers that might
block CPU-bound workqueues. The candidate is well defined to keep
the number of backtraces small. It always printed only the right one
during my testing.


The first 4 patches would have helped me to debug the real problems
that I met.

The 5th patch is theoretical. I did not see this case in practice.
But it looks realistic enough. And it worked very well when I
simulated the problem. IMHO, it should be pretty useful.


Petr Mladek (5):
  workqueue: Fix hung time report of worker pools
  workqueue: Warn when a new worker could not be created
  workqueue: Interrupted create_worker() is not a repeated event
  workqueue: Warn when a rescuer could not be created
  workqueue: Print backtraces from CPUs with hung CPU bound workqueues

 kernel/workqueue.c | 186 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 177 insertions(+), 9 deletions(-)

-- 
2.35.3


             reply	other threads:[~2023-02-01 13:46 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-01 13:45 Petr Mladek [this message]
2023-02-01 13:45 ` [RFC 1/5] workqueue: Fix hung time report of worker pools Petr Mladek
2023-02-01 13:45 ` [RFC 2/5] workqueue: Warn when a new worker could not be created Petr Mladek
2023-02-02 23:30   ` Tejun Heo
2023-02-03 14:10     ` Petr Mladek
2023-02-03 19:29       ` Tejun Heo
2023-02-15 18:02   ` Michal Koutný
2023-02-16  9:43     ` Petr Mladek
2023-02-01 13:45 ` [RFC 3/5] workqueue: Interrupted create_worker() is not a repeated event Petr Mladek
2023-02-01 13:45 ` [RFC 4/5] workqueue: Warn when a rescuer could not be created Petr Mladek
2023-02-01 13:45 ` [RFC 5/5] workqueue: Print backtraces from CPUs with hung CPU bound workqueues Petr Mladek
2023-02-02 23:45   ` Tejun Heo
2023-02-03 14:26     ` Petr Mladek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230201134543.13687-1-pmladek@suse.com \
    --to=pmladek@suse.com \
    --cc=jiangshanlai@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mkoutny@suse.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).