linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/5] workqueue: Debugging improvements
@ 2023-02-01 13:45 Petr Mladek
  2023-02-01 13:45 ` [RFC 1/5] workqueue: Fix hung time report of worker pools Petr Mladek
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Petr Mladek @ 2023-02-01 13:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, Michal Koutny, linux-kernel, Petr Mladek

The workqueue watchdog provides a lot of impormation when a stall is
detected. The report says a lot about what workqueues and worker pools
are active and what is being blocked. Unfortunately, it does not provide
much information about what caused the stall.

In particular, it did not help me to get root of the following problems:

    + New workers were not created because the system reached PID limit.
      Admins limited it too much in a cloud.

    + A networking driver was not loaded because systemd killed modprobe
      when switching the root from initrd to the booted system.

      It was surprisingly quite reproducible. Interrupts are not handled
      immediately in kernel code. The wait in kthread_create_on_node()
      was one of few locations. So the race window evidently was not
      trivial.


1st patch fixes a misleading "hung" time report.

2nd, 3rd, and 4rd patches add warnings into create_worker() and
create_rescuer(). The rather persistent errors are printed only once
until it succeeds again. Otherwise it would be too noisy and might even
break the watchdog.

5th patch adds printing bracktraces of CPU-bound workers that might
block CPU-bound workqueues. The candidate is well defined to keep
the number of backtraces small. It always printed only the right one
during my testing.


The first 4 patches would have helped me to debug the real problems
that I met.

The 5th patch is theoretical. I did not see this case in practice.
But it looks realistic enough. And it worked very well when I
simulated the problem. IMHO, it should be pretty useful.


Petr Mladek (5):
  workqueue: Fix hung time report of worker pools
  workqueue: Warn when a new worker could not be created
  workqueue: Interrupted create_worker() is not a repeated event
  workqueue: Warn when a rescuer could not be created
  workqueue: Print backtraces from CPUs with hung CPU bound workqueues

 kernel/workqueue.c | 186 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 177 insertions(+), 9 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC 1/5] workqueue: Fix hung time report of worker pools
  2023-02-01 13:45 [RFC 0/5] workqueue: Debugging improvements Petr Mladek
@ 2023-02-01 13:45 ` Petr Mladek
  2023-02-01 13:45 ` [RFC 2/5] workqueue: Warn when a new worker could not be created Petr Mladek
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Petr Mladek @ 2023-02-01 13:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, Michal Koutny, linux-kernel, Petr Mladek

The workqueue watchdog prints a warning when there is no progress in
a worker pool. Where the progress means that the pool started processing
a pending work item.

Note that it is perfectly fine to process work items much longer.
The progress should be guaranteed by waking up or creating idle
workers.

show_one_worker_pool() prints state of non-idle worker pool. It shows
a delay since the last pool->watchdog_ts.

The timestamp is updated when a first pending work is queued in
__queue_work(). Also it is updated when a work is dequeued for
processing in worker_thread() and rescuer_thread().

The delay is misleading when there is no pending work item. In this
case it shows how long the last work item is being proceed. Show
zero instead. There is no stall if there is no pending work.

Fixes: 82607adcf9cdf40fb7b ("workqueue: implement lockup detector")
Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/workqueue.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 07895deca271..5419d12e56ae 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4848,10 +4848,16 @@ static void show_one_worker_pool(struct worker_pool *pool)
 	struct worker *worker;
 	bool first = true;
 	unsigned long flags;
+	unsigned long hung = 0;
 
 	raw_spin_lock_irqsave(&pool->lock, flags);
 	if (pool->nr_workers == pool->nr_idle)
 		goto next_pool;
+
+	/* How long the first pending work is waiting for a worker. */
+	if (!list_empty(&pool->worklist))
+		hung = jiffies_to_msecs(jiffies - pool->watchdog_ts) / 1000;
+
 	/*
 	 * Defer printing to avoid deadlocks in console drivers that
 	 * queue work while holding locks also taken in their write
@@ -4860,9 +4866,7 @@ static void show_one_worker_pool(struct worker_pool *pool)
 	printk_deferred_enter();
 	pr_info("pool %d:", pool->id);
 	pr_cont_pool_info(pool);
-	pr_cont(" hung=%us workers=%d",
-		jiffies_to_msecs(jiffies - pool->watchdog_ts) / 1000,
-		pool->nr_workers);
+	pr_cont(" hung=%lus workers=%d", hung, pool->nr_workers);
 	if (pool->manager)
 		pr_cont(" manager: %d",
 			task_pid_nr(pool->manager->task));
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 2/5] workqueue: Warn when a new worker could not be created
  2023-02-01 13:45 [RFC 0/5] workqueue: Debugging improvements Petr Mladek
  2023-02-01 13:45 ` [RFC 1/5] workqueue: Fix hung time report of worker pools Petr Mladek
@ 2023-02-01 13:45 ` Petr Mladek
  2023-02-02 23:30   ` Tejun Heo
  2023-02-15 18:02   ` Michal Koutný
  2023-02-01 13:45 ` [RFC 3/5] workqueue: Interrupted create_worker() is not a repeated event Petr Mladek
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 13+ messages in thread
From: Petr Mladek @ 2023-02-01 13:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, Michal Koutny, linux-kernel, Petr Mladek

The workqueue watchdog reports a lockup when there was not any progress
in the worker pool for a long time. The progress means that a pending
work item starts being proceed.

The progress is guaranteed by using idle workers or creating new workers
for pending work items.

There are several reasons why a new worker could not be created:

   + there is not enough memory

   + there is no free pool ID (IDR API)

   + the system reached PID limit

   + the process creating the new worker was interrupted

   + the last idle worker (manager) has not been scheduled for a long
     time. It was not able to even start creating the kthread.

None of these failures is reported at the moment. The only clue is that
show_one_worker_pool() prints that there is a manager. It is the last
idle worker that is responsible for creating a new one. But it is not
clear if create_worker() is repeatedly failing and why.

Make the debugging easier by printing warnings in create_worker().

The error code is important, especially from kthread_create_on_node().
It helps to distinguish the various reasons. For example, reaching
memory limit (-ENOMEM), other system limits (-EAGAIN), or process
interrupted (-EINTR).

Print the warnings only once during the stall. printk() might touch
the watchdog and prevent reaching the watchdog thresh. For example,
see touch_nmi_watchdog() in serial8250_console_write(). Anyway,
many of these failures are quite persistent. And it does not make
sense to report the same problem every second, see CREATE_COOLDOWN.

In addition print a message when create_worker() succeeded again.

Also print how many times it failed in the hung report when it
still keeps failing.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/workqueue.c | 82 +++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 77 insertions(+), 5 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5419d12e56ae..ab109ef7a7c0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1911,7 +1911,68 @@ static void worker_detach_from_pool(struct worker *worker)
 		complete(detach_completion);
 }
 
-/**
+static int create_worker_failed;
+static DEFINE_SPINLOCK(create_worker_failed_lock);
+
+static __printf(2, 3) __cold
+void __print_create_worker_failure(long err, const char *fmt, ...)
+{
+	spin_lock_irq(&create_worker_failed_lock);
+
+	/*
+	 * Report potentially repeated failures only once during a stall.
+	 * Otherwise, it might be noisy. Also slow serial console drivers
+	 * touch watchdogs so that more frequent messages would prevent
+	 * reaching the watchdog thresh.
+	 */
+	if (!create_worker_failed) {
+		va_list args;
+
+		va_start(args, fmt);
+		vprintk(fmt, args);
+		va_end(args);
+	}
+
+	create_worker_failed++;
+
+	spin_unlock_irq(&create_worker_failed_lock);
+}
+
+#define print_create_worker_failure(msg, err)	\
+	do {					\
+		long _err = err;		\
+						\
+		__print_create_worker_failure(_err, KERN_WARNING msg ":%pe\n", (void *)_err); \
+	} while (0)
+
+static void print_create_worker_success(void)
+{
+	spin_lock_irq(&create_worker_failed_lock);
+
+	if (create_worker_failed) {
+		pr_info("workqueue: Successfully created a worker after %d attempts\n",
+			create_worker_failed);
+	}
+	create_worker_failed = 0;
+
+	spin_unlock_irq(&create_worker_failed_lock);
+}
+
+static void print_create_worker_failed_num(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&create_worker_failed_lock, flags);
+
+	if (create_worker_failed) {
+		pr_info("workqueue: %d times failed to create a new worker since the last success\n",
+			create_worker_failed);
+	}
+
+	spin_unlock_irqrestore(&create_worker_failed_lock, flags);
+}
+
+ /**
  * create_worker - create a new workqueue worker
  * @pool: pool the new worker will belong to
  *
@@ -1931,12 +1992,16 @@ static struct worker *create_worker(struct worker_pool *pool)
 
 	/* ID is needed to determine kthread name */
 	id = ida_alloc(&pool->worker_ida, GFP_KERNEL);
-	if (id < 0)
+	if (id < 0) {
+		print_create_worker_failure("workqueue: Failed to allocate a pool ID", id);
 		return NULL;
+	}
 
 	worker = alloc_worker(pool->node);
-	if (!worker)
+	if (!worker) {
+		print_create_worker_failure("workqueue: Failed to allocate a worker", -ENOMEM);
 		goto fail;
+	}
 
 	worker->id = id;
 
@@ -1948,8 +2013,11 @@ static struct worker *create_worker(struct worker_pool *pool)
 
 	worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
 					      "kworker/%s", id_buf);
-	if (IS_ERR(worker->task))
+	if (IS_ERR(worker->task)) {
+		print_create_worker_failure("workqueue: Failed to create a worker thread",
+					    PTR_ERR(worker->task));
 		goto fail;
+	}
 
 	set_user_nice(worker->task, pool->attrs->nice);
 	kthread_bind_mask(worker->task, pool->attrs->cpumask);
@@ -1964,6 +2032,8 @@ static struct worker *create_worker(struct worker_pool *pool)
 	wake_up_process(worker->task);
 	raw_spin_unlock_irq(&pool->lock);
 
+	print_create_worker_success();
+
 	return worker;
 
 fail:
@@ -5880,8 +5950,10 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
 
 	rcu_read_unlock();
 
-	if (lockup_detected)
+	if (lockup_detected) {
+		print_create_worker_failed_num();
 		show_all_workqueues();
+	}
 
 	wq_watchdog_reset_touched();
 	mod_timer(&wq_watchdog_timer, jiffies + thresh);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 3/5] workqueue: Interrupted create_worker() is not a repeated event
  2023-02-01 13:45 [RFC 0/5] workqueue: Debugging improvements Petr Mladek
  2023-02-01 13:45 ` [RFC 1/5] workqueue: Fix hung time report of worker pools Petr Mladek
  2023-02-01 13:45 ` [RFC 2/5] workqueue: Warn when a new worker could not be created Petr Mladek
@ 2023-02-01 13:45 ` Petr Mladek
  2023-02-01 13:45 ` [RFC 4/5] workqueue: Warn when a rescuer could not be created Petr Mladek
  2023-02-01 13:45 ` [RFC 5/5] workqueue: Print backtraces from CPUs with hung CPU bound workqueues Petr Mladek
  4 siblings, 0 replies; 13+ messages in thread
From: Petr Mladek @ 2023-02-01 13:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, Michal Koutny, linux-kernel, Petr Mladek

kthread_create_on_node() might get interrupted(). It is rare but realistic.
For example, when an unbound workqueue is allocated in module_init()
callback. It is done in the context of the "modprobe" process. And,
for example, systemd might kill pending processes when switching root
from initrd to the booted system.

Handle the interrupt a special way. It a one-off event in compare with
the other failures. Always print this warning. And do not block the other
warnings.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/workqueue.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ab109ef7a7c0..5f3327e119b4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1925,7 +1925,7 @@ void __print_create_worker_failure(long err, const char *fmt, ...)
 	 * touch watchdogs so that more frequent messages would prevent
 	 * reaching the watchdog thresh.
 	 */
-	if (!create_worker_failed) {
+	if (!create_worker_failed || err == -EINTR) {
 		va_list args;
 
 		va_start(args, fmt);
@@ -1933,7 +1933,8 @@ void __print_create_worker_failure(long err, const char *fmt, ...)
 		va_end(args);
 	}
 
-	create_worker_failed++;
+	if (err != -EINTR)
+		create_worker_failed++;
 
 	spin_unlock_irq(&create_worker_failed_lock);
 }
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 4/5] workqueue: Warn when a rescuer could not be created
  2023-02-01 13:45 [RFC 0/5] workqueue: Debugging improvements Petr Mladek
                   ` (2 preceding siblings ...)
  2023-02-01 13:45 ` [RFC 3/5] workqueue: Interrupted create_worker() is not a repeated event Petr Mladek
@ 2023-02-01 13:45 ` Petr Mladek
  2023-02-01 13:45 ` [RFC 5/5] workqueue: Print backtraces from CPUs with hung CPU bound workqueues Petr Mladek
  4 siblings, 0 replies; 13+ messages in thread
From: Petr Mladek @ 2023-02-01 13:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, Michal Koutny, linux-kernel, Petr Mladek

Rescuers are created when a workqueue with WQ_MEM_RECLAIM is allocated.
It typically happens during the system boot.

systemd switches the root filesystem from initrd to the booted system
during boot. It kills processes that block the switch for too long.
One of the process might be modprobe that tries to create a workqueue.

These problems are hard to reproduce. Also alloc_workqueue() does not
pass the error code. Make the debugging easier by printing a warning,
similar to create_worker().

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/workqueue.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5f3327e119b4..58b835420435 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1914,9 +1914,11 @@ static void worker_detach_from_pool(struct worker *worker)
 static int create_worker_failed;
 static DEFINE_SPINLOCK(create_worker_failed_lock);
 
-static __printf(2, 3) __cold
-void __print_create_worker_failure(long err, const char *fmt, ...)
+static __printf(3, 4) __cold
+void __print_create_worker_failure(bool is_rescuer, long err, const char *fmt, ...)
 {
+	bool one_off_failure;
+
 	spin_lock_irq(&create_worker_failed_lock);
 
 	/*
@@ -1925,7 +1927,8 @@ void __print_create_worker_failure(long err, const char *fmt, ...)
 	 * touch watchdogs so that more frequent messages would prevent
 	 * reaching the watchdog thresh.
 	 */
-	if (!create_worker_failed || err == -EINTR) {
+	one_off_failure = (err == -EINTR || is_rescuer);
+	if (!create_worker_failed || one_off_failure) {
 		va_list args;
 
 		va_start(args, fmt);
@@ -1933,7 +1936,7 @@ void __print_create_worker_failure(long err, const char *fmt, ...)
 		va_end(args);
 	}
 
-	if (err != -EINTR)
+	if (!one_off_failure)
 		create_worker_failed++;
 
 	spin_unlock_irq(&create_worker_failed_lock);
@@ -1943,7 +1946,9 @@ void __print_create_worker_failure(long err, const char *fmt, ...)
 	do {					\
 		long _err = err;		\
 						\
-		__print_create_worker_failure(_err, KERN_WARNING msg ":%pe\n", (void *)_err); \
+		__print_create_worker_failure(false, _err,			\
+					      KERN_WARNING msg ":%pe\n",	\
+					      (void *)_err);			\
 	} while (0)
 
 static void print_create_worker_success(void)
@@ -4323,6 +4328,15 @@ static int wq_clamp_max_active(int max_active, unsigned int flags,
 	return clamp_val(max_active, 1, lim);
 }
 
+#define print_create_rescuer_failure(msg, name, err)			\
+	do {								\
+		long _err = err;					\
+									\
+		__print_create_worker_failure(true, _err,		\
+					      KERN_WARNING msg ": %s :%pe\n",	\
+					      name, (void *)_err);	\
+	} while (0)
+
 /*
  * Workqueues which may be used during memory reclaim should have a rescuer
  * to guarantee forward progress.
@@ -4336,13 +4350,18 @@ static int init_rescuer(struct workqueue_struct *wq)
 		return 0;
 
 	rescuer = alloc_worker(NUMA_NO_NODE);
-	if (!rescuer)
+	if (!rescuer) {
+		print_create_rescuer_failure("workqueue: Failed to allocate a rescuer",
+					     wq->name, -ENOMEM);
 		return -ENOMEM;
+	}
 
 	rescuer->rescue_wq = wq;
 	rescuer->task = kthread_create(rescuer_thread, rescuer, "%s", wq->name);
 	if (IS_ERR(rescuer->task)) {
 		ret = PTR_ERR(rescuer->task);
+		print_create_rescuer_failure("workqueue: Failed to create a rescuer thread",
+					     wq->name, ret);
 		kfree(rescuer);
 		return ret;
 	}
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 5/5] workqueue: Print backtraces from CPUs with hung CPU bound workqueues
  2023-02-01 13:45 [RFC 0/5] workqueue: Debugging improvements Petr Mladek
                   ` (3 preceding siblings ...)
  2023-02-01 13:45 ` [RFC 4/5] workqueue: Warn when a rescuer could not be created Petr Mladek
@ 2023-02-01 13:45 ` Petr Mladek
  2023-02-02 23:45   ` Tejun Heo
  4 siblings, 1 reply; 13+ messages in thread
From: Petr Mladek @ 2023-02-01 13:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, Michal Koutny, linux-kernel, Petr Mladek

The workqueue watchdog reports a lockup when there was not any progress
in the worker pool for a long time. The progress means that a pending
work item starts being proceed.

Worker pools for unbound workqueues always wake up an idle worker and
try to process the work immediately. The last idle worker has to create
new worker first. The stall might happen only when a new worker could
not be created in which case some warnings should get printed.
Another problem might be too high load. In this case, workers are
victims of a global system problem.

Worker pools for CPU bound workqueues are designed for lightweight
work items that do not need much CPU time. They are ideally proceed
on a single worker. New worker is used only when a work is sleeping.
It brings one additional problem. The stall might happen when
the CPU-bound workqueue is used for CPU-intensive work.

The problem happens when the CPU-intensive work is not sleeping/waiting
for a long time. Such a worker would be in TASK_RUNNING state.

In this case, it might be useful to see backtrace from the problematic
worker.

wq_watchdog_timer_fn() starts with one line report for each stalled worker
pool. All the extra information are printed later.

The most precise solution would be to make a note in struct worker_pool
when the first check found a stall in a CPU-bound workqueue.

A good enough heuristic should be printing backtraces of workers in
TASK_RUNNING state on CPUs where the stall was detected. After all,
there are just two CPU-bound workers pools for normal and high priority
workers.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/workqueue.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 58b835420435..c6017d178b91 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -49,6 +49,7 @@
 #include <linux/moduleparam.h>
 #include <linux/uaccess.h>
 #include <linux/sched/isolation.h>
+#include <linux/sched/debug.h>
 #include <linux/nmi.h>
 #include <linux/kvm_para.h>
 
@@ -5911,6 +5912,68 @@ static struct timer_list wq_watchdog_timer;
 
 static unsigned long wq_watchdog_touched = INITIAL_JIFFIES;
 static DEFINE_PER_CPU(unsigned long, wq_watchdog_touched_cpu) = INITIAL_JIFFIES;
+static DEFINE_PER_CPU(bool, wq_watchdog_hung_cpu);
+
+/*
+ * Show workers that might prevent processing pending work items.
+ * The only candidates are CPU-bound workers in a running state.
+ * Pending work items should be handled by another idle worker in all
+ * other situations.
+ */
+static bool show_pool_suspicious_workers(struct worker_pool *pool, bool shown_title)
+{
+	bool shown_any = false;
+	struct worker *worker;
+	unsigned long flags;
+	int bkt;
+
+	raw_spin_lock_irqsave(&pool->lock, flags);
+
+	if (pool->cpu < 0)
+		goto out;
+
+	if (!per_cpu(wq_watchdog_hung_cpu, pool->cpu))
+		goto out;
+
+	if (list_empty(&pool->worklist))
+		goto out;
+
+	hash_for_each(pool->busy_hash, bkt, worker, hentry) {
+		if (!task_is_running(worker->task))
+			continue;
+
+		if (!shown_title) {
+			pr_info("Showing backtraces of running workers in stuck CPU-bound worker pools:\n");
+			shown_title = true;
+		}
+
+		shown_any = true;
+		pr_info("pool %d:\n", pool->id);
+		sched_show_task(worker->task);
+	}
+out:
+	raw_spin_unlock_irqrestore(&pool->lock, flags);
+	return shown_any;
+}
+
+static void show_suspicious_workers(void)
+{
+	struct worker_pool *pool;
+	bool shown_title = false;
+	int pi;
+
+	rcu_read_lock();
+
+	for_each_pool(pool, pi) {
+		bool shown;
+
+		shown = show_pool_suspicious_workers(pool, shown_title);
+		if (shown)
+			shown_title = shown;
+	}
+
+	rcu_read_unlock();
+}
 
 static void wq_watchdog_reset_touched(void)
 {
@@ -5927,11 +5990,15 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
 	bool lockup_detected = false;
 	unsigned long now = jiffies;
 	struct worker_pool *pool;
+	int cpu;
 	int pi;
 
 	if (!thresh)
 		return;
 
+	for_each_online_cpu(cpu)
+		per_cpu(wq_watchdog_hung_cpu, cpu) = false;
+
 	rcu_read_lock();
 
 	for_each_pool(pool, pi) {
@@ -5961,11 +6028,15 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
 		/* did we stall? */
 		if (time_after(now, ts + thresh)) {
 			lockup_detected = true;
+			if (pool->cpu >= 0)
+				per_cpu(wq_watchdog_hung_cpu, pool->cpu) = true;
 			pr_emerg("BUG: workqueue lockup - pool");
 			pr_cont_pool_info(pool);
 			pr_cont(" stuck for %us!\n",
 				jiffies_to_msecs(now - pool_ts) / 1000);
 		}
+
+
 	}
 
 	rcu_read_unlock();
@@ -5973,6 +6044,7 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
 	if (lockup_detected) {
 		print_create_worker_failed_num();
 		show_all_workqueues();
+		show_suspicious_workers();
 	}
 
 	wq_watchdog_reset_touched();
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC 2/5] workqueue: Warn when a new worker could not be created
  2023-02-01 13:45 ` [RFC 2/5] workqueue: Warn when a new worker could not be created Petr Mladek
@ 2023-02-02 23:30   ` Tejun Heo
  2023-02-03 14:10     ` Petr Mladek
  2023-02-15 18:02   ` Michal Koutný
  1 sibling, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2023-02-02 23:30 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Lai Jiangshan, Michal Koutny, linux-kernel

Hello,

On Wed, Feb 01, 2023 at 02:45:40PM +0100, Petr Mladek wrote:
> +static __printf(2, 3) __cold
> +void __print_create_worker_failure(long err, const char *fmt, ...)
> +{
> +	spin_lock_irq(&create_worker_failed_lock);
> +
> +	/*
> +	 * Report potentially repeated failures only once during a stall.
> +	 * Otherwise, it might be noisy. Also slow serial console drivers
> +	 * touch watchdogs so that more frequent messages would prevent
> +	 * reaching the watchdog thresh.
> +	 */
> +	if (!create_worker_failed) {
> +		va_list args;
> +
> +		va_start(args, fmt);
> +		vprintk(fmt, args);
> +		va_end(args);
> +	}
> +
> +	create_worker_failed++;
> +
> +	spin_unlock_irq(&create_worker_failed_lock);
> +}

That's pretty elaborate. Why not just use printk_ratelimited()?

> @@ -1931,12 +1992,16 @@ static struct worker *create_worker(struct worker_pool *pool)
>  
>  	/* ID is needed to determine kthread name */
>  	id = ida_alloc(&pool->worker_ida, GFP_KERNEL);
> -	if (id < 0)
> +	if (id < 0) {
> +		print_create_worker_failure("workqueue: Failed to allocate a pool ID", id);
                                                                             ^
									  worker ID?

> @@ -5880,8 +5950,10 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
>  
>  	rcu_read_unlock();
>  
> -	if (lockup_detected)
> +	if (lockup_detected) {
> +		print_create_worker_failed_num();
>  		show_all_workqueues();
> +	}

I'd just use printk_ratelimited() and leave it at that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 5/5] workqueue: Print backtraces from CPUs with hung CPU bound workqueues
  2023-02-01 13:45 ` [RFC 5/5] workqueue: Print backtraces from CPUs with hung CPU bound workqueues Petr Mladek
@ 2023-02-02 23:45   ` Tejun Heo
  2023-02-03 14:26     ` Petr Mladek
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2023-02-02 23:45 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Lai Jiangshan, Michal Koutny, linux-kernel

Hello,

I generally really like it.

> +static bool show_pool_suspicious_workers(struct worker_pool *pool, bool shown_title)

Maybe the name can be a bit more specific? show_cpu_pool_hog()?

> +{
> +	bool shown_any = false;
> +	struct worker *worker;
> +	unsigned long flags;
> +	int bkt;
> +
> +	raw_spin_lock_irqsave(&pool->lock, flags);
> +
> +	if (pool->cpu < 0)
> +		goto out;

This can be tested before grabbling the lock.

> +	if (!per_cpu(wq_watchdog_hung_cpu, pool->cpu))
> +		goto out;

Given that the state is per-pool, would it make sense to mark this on the
pool instead?

> +	if (list_empty(&pool->worklist))
> +		goto out;

This condition isn't really necessary, right?

> +	hash_for_each(pool->busy_hash, bkt, worker, hentry) {
> +		if (!task_is_running(worker->task))
> +			continue;
> +
> +		if (!shown_title) {
> +			pr_info("Showing backtraces of running workers in stuck CPU-bound worker pools:\n");
> +			shown_title = true;
> +		}
> +
> +		shown_any = true;
> +		pr_info("pool %d:\n", pool->id);
> +		sched_show_task(worker->task);
> +	}
> +out:
> +	raw_spin_unlock_irqrestore(&pool->lock, flags);
> +	return shown_any;
> +}
> +
> +static void show_suspicious_workers(void)
> +{
> +	struct worker_pool *pool;
> +	bool shown_title = false;
> +	int pi;
> +
> +	rcu_read_lock();
> +
> +	for_each_pool(pool, pi) {
> +		bool shown;
> +
> +		shown = show_pool_suspicious_workers(pool, shown_title);
> +		if (shown)
> +			shown_title = shown;

Maybe, move shown to the outer scope and:

	shown |= show_pool_suspicious_workers(pool, show);

> +	}
> +
> +	rcu_read_unlock();
> +}
>  
>  static void wq_watchdog_reset_touched(void)
>  {

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 2/5] workqueue: Warn when a new worker could not be created
  2023-02-02 23:30   ` Tejun Heo
@ 2023-02-03 14:10     ` Petr Mladek
  2023-02-03 19:29       ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Petr Mladek @ 2023-02-03 14:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, Michal Koutny, linux-kernel

On Thu 2023-02-02 13:30:39, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 01, 2023 at 02:45:40PM +0100, Petr Mladek wrote:
> > +static __printf(2, 3) __cold
> > +void __print_create_worker_failure(long err, const char *fmt, ...)
> > +{
> > +	spin_lock_irq(&create_worker_failed_lock);
> > +
> > +	/*
> > +	 * Report potentially repeated failures only once during a stall.
> > +	 * Otherwise, it might be noisy. Also slow serial console drivers
> > +	 * touch watchdogs so that more frequent messages would prevent
> > +	 * reaching the watchdog thresh.
> > +	 */
> > +	if (!create_worker_failed) {
> > +		va_list args;
> > +
> > +		va_start(args, fmt);
> > +		vprintk(fmt, args);
> > +		va_end(args);
> > +	}
> > +
> > +	create_worker_failed++;
> > +
> > +	spin_unlock_irq(&create_worker_failed_lock);
> > +}
> 
> That's pretty elaborate.

Yeah, I am not super happy with it either.

> Why not just use printk_ratelimited()?

The default printk_ratelimited() is not usable because it uses:

     #define DEFAULT_RATELIMIT_INTERVAL	(5 * HZ)
     #define DEFAULT_RATELIMIT_BURST		10

It allows 10 messages per 5 seconds. It would be still too noisy.
maybe_create_worker() tries to create a new worker every second.

And more importantly, it would break both softlockup and workqueue
watchdogs. See touch_nmi_watchdog() in serial8250_console_write().
By other words, it would break both softlockup and workqueue watchdogs.


A solution would be to use a custom printk_ratelimited_wq() that would
allow printing one message per 2 * wq_watchdog_thresh.
Something like:

#define printk_ratelimited_wq(fmt, ...)					\
({									\
	static DEFINE_RATELIMIT_STATE(_rs, 60 * HZ, 1);			\
									\
	if (__ratelimit(&_rs))						\
		printk(fmt, ##__VA_ARGS__);				\
})

I admit that it is much easier than __print_create_worker_failure().

The only problem might be that wq_watchdog_thresh might be modified
at runtime. But it can be solved by sharing the same
struct ratelimit_state rs in all printk_ratelimited_wq() calls
and updating it together with wq_watchdog_thresh.

Would you prefer the custom printk_ratelimited_wq(), please?

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 5/5] workqueue: Print backtraces from CPUs with hung CPU bound workqueues
  2023-02-02 23:45   ` Tejun Heo
@ 2023-02-03 14:26     ` Petr Mladek
  0 siblings, 0 replies; 13+ messages in thread
From: Petr Mladek @ 2023-02-03 14:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, Michal Koutny, linux-kernel

Hello,

On Thu 2023-02-02 13:45:05, Tejun Heo wrote:
> > +static bool show_pool_suspicious_workers(struct worker_pool *pool, bool shown_title)
> > +{
> > +	bool shown_any = false;
> > +	struct worker *worker;
> > +	unsigned long flags;
> > +	int bkt;
> > +
> > +	raw_spin_lock_irqsave(&pool->lock, flags);
> > +
> > +	if (pool->cpu < 0)
> > +		goto out;
> 
> This can be tested before grabbling the lock.

I see.

> > +	if (!per_cpu(wq_watchdog_hung_cpu, pool->cpu))
> > +		goto out;
> 
> Given that the state is per-pool, would it make sense to mark this on the
> pool instead?

Makes sense. I think that I started with the per-CPU variable before
I sorted my thoughts about what backtraces were useful ;-)

> > +	if (list_empty(&pool->worklist))
> > +		goto out;
> 
> This condition isn't really necessary, right?

IMHO, it should be there. The watchdog reports the problem only when
there  are pending work items, see

		if (list_empty(&pool->worklist))
			continue;

in wq_watchdog_timer_fn().

My understanding is that it is OK to process work items longer
time when they are sleeping and waiting for something.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 2/5] workqueue: Warn when a new worker could not be created
  2023-02-03 14:10     ` Petr Mladek
@ 2023-02-03 19:29       ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2023-02-03 19:29 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Lai Jiangshan, Michal Koutny, linux-kernel

On Fri, Feb 03, 2023 at 03:10:28PM +0100, Petr Mladek wrote:
> A solution would be to use a custom printk_ratelimited_wq() that would
> allow printing one message per 2 * wq_watchdog_thresh.
> Something like:
> 
> #define printk_ratelimited_wq(fmt, ...)					\
> ({									\
> 	static DEFINE_RATELIMIT_STATE(_rs, 60 * HZ, 1);			\
> 									\
> 	if (__ratelimit(&_rs))						\
> 		printk(fmt, ##__VA_ARGS__);				\
> })
> 
> I admit that it is much easier than __print_create_worker_failure().
> 
> The only problem might be that wq_watchdog_thresh might be modified
> at runtime. But it can be solved by sharing the same
> struct ratelimit_state rs in all printk_ratelimited_wq() calls
> and updating it together with wq_watchdog_thresh.
> 
> Would you prefer the custom printk_ratelimited_wq(), please?

Yeah, I think so.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 2/5] workqueue: Warn when a new worker could not be created
  2023-02-01 13:45 ` [RFC 2/5] workqueue: Warn when a new worker could not be created Petr Mladek
  2023-02-02 23:30   ` Tejun Heo
@ 2023-02-15 18:02   ` Michal Koutný
  2023-02-16  9:43     ` Petr Mladek
  1 sibling, 1 reply; 13+ messages in thread
From: Michal Koutný @ 2023-02-15 18:02 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Tejun Heo, Lai Jiangshan, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1598 bytes --]

Hello.

On Wed, Feb 01, 2023 at 02:45:40PM +0100, Petr Mladek <pmladek@suse.com> wrote:
>    + the system reached PID limit
or threads-max limit.

FTR, I was once considering something like

--->8---
diff --git a/kernel/fork.c b/kernel/fork.c
index 867b46d6fd0a..bba05ecc3765 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1684,8 +1684,10 @@ static __latent_entropy struct task_struct *copy_process(
         * to stop root fork bombs.
         */
        retval = -EAGAIN;
-       if (nr_threads >= max_threads)
+       if (nr_threads >= max_threads) {
+               printk_once(KERN_INFO "clone failed due to threads-max limit\n");
                goto bad_fork_cleanup_count;
+       }

        delayacct_tsk_init(p);  /* Must remain after dup_task_struct() */
        p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
@@ -1816,6 +1818,7 @@ static __latent_entropy struct task_struct *copy_process(
        if (pid != &init_struct_pid) {
                pid = alloc_pid(p->nsproxy->pid_ns_for_children);
                if (IS_ERR(pid)) {
+                       printk_once(KERN_INFO "fork failed to find pid\n");
                        retval = PTR_ERR(pid);
                        goto bad_fork_cleanup_thread;
                }
--->8---

Effects of the global limits on anything but kthreads should be much less
important and easier to troubleshoot anyway.
Covering kworkers with your changes should be useful and substitute my idea
above. Take that as my support for this patch (from my perspective reporting
*_once would be enough to guide a troubleshooter).

Thanks,
Michal



[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC 2/5] workqueue: Warn when a new worker could not be created
  2023-02-15 18:02   ` Michal Koutný
@ 2023-02-16  9:43     ` Petr Mladek
  0 siblings, 0 replies; 13+ messages in thread
From: Petr Mladek @ 2023-02-16  9:43 UTC (permalink / raw)
  To: Michal Koutný; +Cc: Tejun Heo, Lai Jiangshan, linux-kernel

On Wed 2023-02-15 19:02:04, Michal Koutný wrote:
> Hello.
> 
> On Wed, Feb 01, 2023 at 02:45:40PM +0100, Petr Mladek <pmladek@suse.com> wrote:
> >    + the system reached PID limit
> or threads-max limit.
> 
> FTR, I was once considering something like
> 
> --->8---
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 867b46d6fd0a..bba05ecc3765 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1684,8 +1684,10 @@ static __latent_entropy struct task_struct *copy_process(
>          * to stop root fork bombs.
>          */
>         retval = -EAGAIN;
> -       if (nr_threads >= max_threads)
> +       if (nr_threads >= max_threads) {
> +               printk_once(KERN_INFO "clone failed due to threads-max limit\n");
>                 goto bad_fork_cleanup_count;
> +       }
> 
>         delayacct_tsk_init(p);  /* Must remain after dup_task_struct() */
>         p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
> @@ -1816,6 +1818,7 @@ static __latent_entropy struct task_struct *copy_process(
>         if (pid != &init_struct_pid) {
>                 pid = alloc_pid(p->nsproxy->pid_ns_for_children);
>                 if (IS_ERR(pid)) {
> +                       printk_once(KERN_INFO "fork failed to find pid\n");
>                         retval = PTR_ERR(pid);
>                         goto bad_fork_cleanup_thread;
>                 }
> --->8---
> 
> Effects of the global limits on anything but kthreads should be much less
> important and easier to troubleshoot anyway.
> Covering kworkers with your changes should be useful and substitute my idea
> above.

I was not brave enough to touch kernel/fork.c.

> Take that as my support for this patch (from my perspective reporting
> *_once would be enough to guide a troubleshooter).

My experience is that users often send logs with many stall reports and
the earlier messages are missing/lost. This is why I wanted to repeat
them.

Anyway, thanks for the support. I am going to send v2 when time permits.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-02-16  9:43 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-01 13:45 [RFC 0/5] workqueue: Debugging improvements Petr Mladek
2023-02-01 13:45 ` [RFC 1/5] workqueue: Fix hung time report of worker pools Petr Mladek
2023-02-01 13:45 ` [RFC 2/5] workqueue: Warn when a new worker could not be created Petr Mladek
2023-02-02 23:30   ` Tejun Heo
2023-02-03 14:10     ` Petr Mladek
2023-02-03 19:29       ` Tejun Heo
2023-02-15 18:02   ` Michal Koutný
2023-02-16  9:43     ` Petr Mladek
2023-02-01 13:45 ` [RFC 3/5] workqueue: Interrupted create_worker() is not a repeated event Petr Mladek
2023-02-01 13:45 ` [RFC 4/5] workqueue: Warn when a rescuer could not be created Petr Mladek
2023-02-01 13:45 ` [RFC 5/5] workqueue: Print backtraces from CPUs with hung CPU bound workqueues Petr Mladek
2023-02-02 23:45   ` Tejun Heo
2023-02-03 14:26     ` Petr Mladek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).