From: Tejun Heo <tj@kernel.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Sandeep Dhavale <dhavale@google.com>,
jiangshanlai@gmail.com, torvalds@linux-foundation.org,
peterz@infradead.org, linux-kernel@vger.kernel.org,
kernel-team@meta.com, joshdon@google.com, brho@google.com,
briannorris@chromium.org, nhuck@google.com, agk@redhat.com,
snitzer@kernel.org, void@manifault.com, kernel-team@android.com
Subject: Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
Date: Thu, 8 Jun 2023 12:50:25 -1000 [thread overview]
Message-ID: <ZIJbMQOu_k07jkFf@slm.duckdns.org> (raw)
In-Reply-To: <fb3461cd-3fc2-189a-a86b-c638816a2440@amd.com>
Hello,
On Thu, Jun 08, 2023 at 08:31:34AM +0530, K Prateek Nayak wrote:
...
> Thank you for sharing the debug branch. I've managed to hit some one of
> the WARN_ON_ONCE() consistently but I still haven't seen a kernel panic
> yet. Sharing the traces below:
Yeah, that's good. It does a dirty fix-up. Shouldn't crash.
> o Early Boot
>
> [ 4.182411] ------------[ cut here ]------------
> [ 4.186313] WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:1130 kick_pool+0xdb/0xe0
> [ 4.186313] Modules linked in:
> [ 4.186313] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-rc1-tj-wq-valid-cpu+ #481
> [ 4.186313] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
> [ 4.186313] RIP: 0010:kick_pool+0xdb/0xe0
> [ 4.186313] Code: 6b c0 d0 01 73 24 41 89 45 64 49 8b 54 24 f8 48 89 d0 30 c0 83 e2 04 ba 00 00 00 00 48 0f 44 c2 48 83 80 c0 00 00 00 01 eb 82 <0f> 0b eb dc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
> [ 4.186313] RSP: 0018:ffffbc1b800e7dd8 EFLAGS: 00010046
> [ 4.186313] RAX: 0000000000000100 RBX: ffff97c73d2321c0 RCX: 0000000000000000
> [ 4.186313] RDX: 0000000000000040 RSI: 0000000000000001 RDI: ffff9788c0159728
> [ 4.186313] RBP: ffffbc1b800e7df0 R08: 0000000000000100 R09: ffff9788c01593e0
> [ 4.186313] R10: ffff9788c01593c0 R11: 0000000000000001 R12: ffffffff8c582430
> [ 4.186313] R13: ffff9788c03fcd40 R14: 0000000000000000 R15: ffff97c73d2324b0
> [ 4.186313] FS: 0000000000000000(0000) GS:ffff97c73d200000(0000) knlGS:0000000000000000
> [ 4.186313] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4.186313] CR2: ffff97cecee01000 CR3: 000000470d43a001 CR4: 0000000000770ef0
> [ 4.186313] PKRU: 55555554
> [ 4.186313] Call Trace:
> [ 4.186313] <TASK>
> [ 4.186313] create_worker+0x14e/0x280
> [ 4.186313] ? wake_up_process+0x15/0x20
> [ 4.186313] workqueue_init+0x22a/0x3d0
> [ 4.186313] kernel_init_freeable+0x1fe/0x4f0
> [ 4.186313] ? __pfx_kernel_init+0x10/0x10
> [ 4.186313] kernel_init+0x1b/0x1f0
> [ 4.186313] ? __pfx_kernel_init+0x10/0x10
> [ 4.186313] ret_from_fork+0x2c/0x50
> [ 4.186313] </TASK>
> [ 4.186313] ---[ end trace 0000000000000000 ]---
>
> o I consistently see a WARN_ON_ONCE() in kick_pool() being hit when I
> run "sudo ./stress-ng --iomix 96 --timeout 1m". I've seen few
> different stack traces so far. Including all below just in case:
...
> This is the same WARN_ON_ONCE() you had added in the HEAD commit:
>
> $ scripts/faddr2line vmlinux kick_pool+0xdb
> kick_pool+0xdb/0xe0:
> kick_pool at kernel/workqueue.c:1130 (discriminator 1)
>
> $ sed -n 1130,1132p kernel/workqueue.c
> if (!WARN_ON_ONCE(wake_cpu >= nr_cpu_ids))
> p->wake_cpu = wake_cpu;
> get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++;
>
> Let me know if you need any more data from my test setup.
> P.S. The kernel is still up and running (~30min) despite hitting this
> WARN_ON_ONCE() in my case :)
Okay, that was me being stupid and not initializing the new fields for
per-cpu workqueues. Can you please test the following branch? It should have
both bugs fixed properly.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v2
If that doesn't crash, I'd love to hear how it affects the perf regressions
reported over that past few months.
Thanks.
--
tejun
next prev parent reply other threads:[~2023-06-08 22:50 UTC|newest]
Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-19 0:16 [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Tejun Heo
2023-05-19 0:16 ` [PATCH 01/24] workqueue: Drop the special locking rule for worker->flags and worker_pool->flags Tejun Heo
2023-05-19 0:16 ` [PATCH 02/24] workqueue: Cleanups around process_scheduled_works() Tejun Heo
2023-05-19 0:16 ` [PATCH 03/24] workqueue: Not all work insertion needs to wake up a worker Tejun Heo
2023-05-23 9:54 ` Lai Jiangshan
2023-05-23 21:37 ` Tejun Heo
2023-08-08 1:15 ` [PATCH v2 " Tejun Heo
2023-05-19 0:16 ` [PATCH 04/24] workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq Tejun Heo
2023-05-19 0:16 ` [PATCH 05/24] workqueue: Relocate worker and work management functions Tejun Heo
2023-05-19 0:16 ` [PATCH 06/24] workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa Tejun Heo
2023-05-19 0:16 ` [PATCH 07/24] workqueue: Use a kthread_worker to release pool_workqueues Tejun Heo
2023-05-19 0:16 ` [PATCH 08/24] workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones Tejun Heo
2023-05-19 0:16 ` [PATCH 09/24] workqueue: Make unbound workqueues to use per-cpu pool_workqueues Tejun Heo
2023-05-22 6:41 ` Leon Romanovsky
2023-05-22 12:27 ` Dennis Dalessandro
2023-05-19 0:16 ` [PATCH 10/24] workqueue: Rename workqueue_attrs->no_numa to ->ordered Tejun Heo
2023-05-19 0:16 ` [PATCH 11/24] workqueue: Rename NUMA related names to use pod instead Tejun Heo
2023-05-19 0:16 ` [PATCH 12/24] workqueue: Move wq_pod_init() below workqueue_init() Tejun Heo
2023-05-19 0:16 ` [PATCH 13/24] workqueue: Initialize unbound CPU pods later in the boot Tejun Heo
2023-05-19 0:16 ` [PATCH 14/24] workqueue: Generalize unbound CPU pods Tejun Heo
2023-05-30 8:06 ` K Prateek Nayak
2023-06-07 1:50 ` Tejun Heo
2023-05-30 21:18 ` Sandeep Dhavale
2023-05-31 12:14 ` K Prateek Nayak
2023-06-07 22:13 ` Tejun Heo
2023-06-08 3:01 ` K Prateek Nayak
2023-06-08 22:50 ` Tejun Heo [this message]
2023-06-09 3:43 ` K Prateek Nayak
2023-06-14 18:49 ` Sandeep Dhavale
2023-06-21 20:14 ` Tejun Heo
2023-06-19 4:30 ` Swapnil Sapkal
2023-06-21 20:38 ` Tejun Heo
2023-07-05 7:04 ` K Prateek Nayak
2023-07-05 18:39 ` Tejun Heo
2023-07-11 3:02 ` K Prateek Nayak
2023-07-31 23:52 ` Tejun Heo
2023-08-08 1:08 ` Tejun Heo
2023-05-19 0:17 ` [PATCH 15/24] workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration Tejun Heo
2023-05-19 0:17 ` [PATCH 16/24] workqueue: Modularize wq_pod_type initialization Tejun Heo
2023-05-19 0:17 ` [PATCH 17/24] workqueue: Add multiple affinity scopes and interface to select them Tejun Heo
2023-05-19 0:17 ` [PATCH 18/24] workqueue: Factor out work to worker assignment and collision handling Tejun Heo
2023-05-19 0:17 ` [PATCH 19/24] workqueue: Factor out need_more_worker() check and worker wake-up Tejun Heo
2023-05-19 0:17 ` [PATCH 20/24] workqueue: Add workqueue_attrs->__pod_cpumask Tejun Heo
2023-05-19 0:17 ` [PATCH 21/24] workqueue: Implement non-strict affinity scope for unbound workqueues Tejun Heo
2023-05-19 0:17 ` [PATCH 22/24] workqueue: Add "Affinity Scopes and Performance" section to documentation Tejun Heo
2023-05-19 0:17 ` [PATCH 23/24] workqueue: Add pool_workqueue->cpu Tejun Heo
2023-05-19 0:17 ` [PATCH 24/24] workqueue: Implement localize-to-issuing-CPU for unbound workqueues Tejun Heo
2023-05-19 0:41 ` [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality Linus Torvalds
2023-05-19 22:35 ` Tejun Heo
2023-05-19 23:03 ` Tejun Heo
2023-05-23 1:51 ` Linus Torvalds
2023-05-23 17:59 ` Linus Torvalds
2023-05-23 20:08 ` Rik van Riel
2023-05-23 21:36 ` Sandeep Dhavale
2023-05-23 11:18 ` Peter Zijlstra
2023-05-23 16:12 ` Vincent Guittot
2023-05-24 7:34 ` Peter Zijlstra
2023-05-24 13:15 ` Vincent Guittot
2023-06-05 4:46 ` Gautham R. Shenoy
2023-06-07 14:42 ` Libo Chen
2023-05-26 1:12 ` Tejun Heo
2023-05-30 11:32 ` Peter Zijlstra
2023-06-12 23:56 ` Brian Norris
2023-06-13 2:48 ` Tejun Heo
2023-06-13 9:26 ` Pin-yen Lin
2023-06-21 19:16 ` Tejun Heo
2023-06-21 19:31 ` Linus Torvalds
2023-06-29 9:49 ` Pin-yen Lin
2023-07-03 21:47 ` Tejun Heo
2023-08-08 1:22 ` Tejun Heo
2023-08-08 2:58 ` K Prateek Nayak
2023-08-08 7:59 ` Tejun Heo
2023-08-18 4:05 ` K Prateek Nayak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZIJbMQOu_k07jkFf@slm.duckdns.org \
--to=tj@kernel.org \
--cc=agk@redhat.com \
--cc=brho@google.com \
--cc=briannorris@chromium.org \
--cc=dhavale@google.com \
--cc=jiangshanlai@gmail.com \
--cc=joshdon@google.com \
--cc=kernel-team@android.com \
--cc=kernel-team@meta.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=nhuck@google.com \
--cc=peterz@infradead.org \
--cc=snitzer@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).