* [PATCH] workqueue: missing NOT while checking if Workqueue is offline @ 2022-05-28 20:07 Geraldo Nascimento 2022-05-29 1:25 ` Geraldo Nascimento 0 siblings, 1 reply; 9+ messages in thread From: Geraldo Nascimento @ 2022-05-28 20:07 UTC (permalink / raw) To: Tejun Heo; +Cc: Lai Jiangshan, LKML Greetings, This is a one-character patch but very important as the kernel workqueue __cancel_work_timer will cancel active work without the NOT operator added. During early boot wq_online is false so with the NOT added it will evaluate to true. Conversely, after boot is done, workqueue is now true and we want it to evaluate to false because otherwise it will cancel important work. Signed-off-by: Geraldo Nascimento <geraldogabriel@gmail.com> --- workqueue.c 2022-05-28 16:54:12.024176123 -0300 +++ workqueue.c 2022-05-28 16:54:37.698176135 -0300 @@ -3158,7 +3158,7 @@ static bool __cancel_work_timer(struct w * This allows canceling during early boot. We know that @work * isn't executing. */ - if (wq_online) + if (!wq_online) __flush_work(work, true); clear_work_data(work); ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline 2022-05-28 20:07 [PATCH] workqueue: missing NOT while checking if Workqueue is offline Geraldo Nascimento @ 2022-05-29 1:25 ` Geraldo Nascimento 2022-05-29 4:29 ` Geraldo Nascimento 0 siblings, 1 reply; 9+ messages in thread From: Geraldo Nascimento @ 2022-05-29 1:25 UTC (permalink / raw) To: Tejun Heo; +Cc: Lai Jiangshan, LKML On Sat, May 28, 2022 at 05:07:08PM -0300, Geraldo Nascimento wrote: > Greetings, > Hi, again, > This is a one-character patch but very important as the kernel workqueue > __cancel_work_timer will cancel active work without the NOT operator > added. > > During early boot wq_online is false so with the NOT added it will evaluate > to true. Conversely, after boot is done, workqueue I meant wq_online. After boot, wq_online will evaluate to true, current code might as well have an if (true) there. I hurried up the patch because if I'm right this is a major show stopper to drivers that make use of cancel_work_timer(). I hit it through amdgpu in conjuction with amdkfd. > is now true and we want > it to evaluate to false because otherwise it will cancel important work. > > Signed-off-by: Geraldo Nascimento <geraldogabriel@gmail.com> > > --- workqueue.c 2022-05-28 16:54:12.024176123 -0300 > +++ workqueue.c 2022-05-28 16:54:37.698176135 -0300 > @@ -3158,7 +3158,7 @@ static bool __cancel_work_timer(struct w > * This allows canceling during early boot. We know that @work > * isn't executing. > */ > - if (wq_online) > + if (!wq_online) > __flush_work(work, true); > > clear_work_data(work); ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline 2022-05-29 1:25 ` Geraldo Nascimento @ 2022-05-29 4:29 ` Geraldo Nascimento 2022-05-29 5:24 ` Tejun Heo 0 siblings, 1 reply; 9+ messages in thread From: Geraldo Nascimento @ 2022-05-29 4:29 UTC (permalink / raw) To: Tejun Heo; +Cc: Lai Jiangshan, LKML On Sat, May 28, 2022 at 10:25:55PM -0300, Geraldo Nascimento wrote: > On Sat, May 28, 2022 at 05:07:08PM -0300, Geraldo Nascimento wrote: > > Greetings, > > > > Hi, again, > And again, Hi! Doing my due dilligence, it seems. > > This is a one-character patch but very important as the kernel workqueue > > __cancel_work_timer will cancel active work It won't cancel important work, seems it's just a WARN_ON but it's very annoying. My understanding was the NOT was needed to call __flush_work() before kthreads are spawned. During early boot, as the comment says, before workqueue_init() fires. There's a bug report at https://gitlab.freedesktop.org/drm/amd/-/issues/1898 and Felix Kuehling is right that this bug is only triggered when you try to use amdkfd ( and the Kconfigs that implies) without HSA_AMD_SVM configured. It makes sense to me that NOT operator is missing however, since in the warning I was coming from _cancel_work_timer() to __flush_work(), something that should not be done? I would like very much to hear the opinion of the maintainers! Thanks, Geraldo Nascimento >> without the NOT operator > > added. > > > > During early boot wq_online is false so with the NOT added it will evaluate > > to true. Conversely, after boot is done, workqueue > > I meant wq_online. After boot, wq_online will evaluate to true, current > code might as well have an if (true) there. I hurried up the patch > because if I'm right this is a major show stopper to drivers that make > use of cancel_work_timer(). I hit it through amdgpu in conjuction with amdkfd. > > > is now true and we want > > it to evaluate to false because otherwise it will cancel important work. > > > > Signed-off-by: Geraldo Nascimento <geraldogabriel@gmail.com> > > > > --- workqueue.c 2022-05-28 16:54:12.024176123 -0300 > > +++ workqueue.c 2022-05-28 16:54:37.698176135 -0300 > > @@ -3158,7 +3158,7 @@ static bool __cancel_work_timer(struct w > > * This allows canceling during early boot. We know that @work > > * isn't executing. > > */ > > - if (wq_online) > > + if (!wq_online) > > __flush_work(work, true); > > > > clear_work_data(work); ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline 2022-05-29 4:29 ` Geraldo Nascimento @ 2022-05-29 5:24 ` Tejun Heo 2022-05-29 5:53 ` Geraldo Nascimento 0 siblings, 1 reply; 9+ messages in thread From: Tejun Heo @ 2022-05-29 5:24 UTC (permalink / raw) To: Geraldo Nascimento; +Cc: Lai Jiangshan, LKML On Sun, May 29, 2022 at 01:29:32AM -0300, Geraldo Nascimento wrote: > I would like very much to hear the opinion of the maintainers! I have a hard time understanding what you're trying to do. Can you please slow down and start from describing the problem itself? Thanks. -- tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline 2022-05-29 5:24 ` Tejun Heo @ 2022-05-29 5:53 ` Geraldo Nascimento 2022-05-29 6:14 ` Tejun Heo 0 siblings, 1 reply; 9+ messages in thread From: Geraldo Nascimento @ 2022-05-29 5:53 UTC (permalink / raw) To: Tejun Heo; +Cc: Lai Jiangshan, LKML On Sat, May 28, 2022 at 07:24:41PM -1000, Tejun Heo wrote: > On Sun, May 29, 2022 at 01:29:32AM -0300, Geraldo Nascimento wrote: > > I would like very much to hear the opinion of the maintainers! > > I have a hard time understanding what you're trying to do. Can you please > slow down and start from describing the problem itself? Hi Tejun, Sorry for the hurry. The problem is best described in https://gitlab.freedesktop.org/drm/amd/-/issues/1898 From my understanding from the context of __cancel_work_timer() we should not ever call __flush_work() but I may be wrong. In the present case as described in AMD's GitLab __cancel_work_timer() is being called by cancel_delayed_work_sync() inside kfd_process_notifier_release() from drivers/gpu/drm/amd/amdkfd/kfd_process.c:1157 (Linux 5.18). We should only call __flush_work() from __cancel_work_timer() if workqueue_init() is not yet initialized, that's possible during early boot though not very likely. Anyway that's before kthreads are spwaned, so we are sure that particular work isn't executing, hence why it's safe to call __flush_work() in this particular case. The comment on kernel/workqueue.c:3157 (for Linux 5.18) says it best: /* * This allows canceling during early boot. We know that @work * isn't executing. */ if (wq_online) __flush_work(work, true); If __flush_work() is ever called during early boot it will result in a WARN_ON because workqueue is not online. I have no idea if that's OK though it hasn't harmed my machine. Of course I don't want to introduce bugs, I wanna solve them, and I appreciate your cautious approach. Thank you for the work. What is not OK apparently is trying to use amdkfd without HSA_AMD_SVM configured! :) Thank you, Geraldo Nascimento > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline 2022-05-29 5:53 ` Geraldo Nascimento @ 2022-05-29 6:14 ` Tejun Heo 2022-05-29 6:41 ` Geraldo Nascimento 2022-05-31 4:13 ` Geraldo Nascimento 0 siblings, 2 replies; 9+ messages in thread From: Tejun Heo @ 2022-05-29 6:14 UTC (permalink / raw) To: Geraldo Nascimento; +Cc: Lai Jiangshan, LKML On Sun, May 29, 2022 at 02:53:39AM -0300, Geraldo Nascimento wrote: > On Sat, May 28, 2022 at 07:24:41PM -1000, Tejun Heo wrote: > > On Sun, May 29, 2022 at 01:29:32AM -0300, Geraldo Nascimento wrote: > > > I would like very much to hear the opinion of the maintainers! > > > > I have a hard time understanding what you're trying to do. Can you please > > slow down and start from describing the problem itself? > > Hi Tejun, > > Sorry for the hurry. > > The problem is best described in https://gitlab.freedesktop.org/drm/amd/-/issues/1898 > > From my understanding from the context of __cancel_work_timer() we should not > ever call __flush_work() but I may be wrong. In the present case as Yeah, you're wrong. > described in AMD's GitLab __cancel_work_timer() is being called by > cancel_delayed_work_sync() inside kfd_process_notifier_release() > from drivers/gpu/drm/amd/amdkfd/kfd_process.c:1157 (Linux 5.18). Have you confirmed that that actually is the warning which is triggering? I don't see how that condition would trigger that late during the boot and the warning line being reported doesn't match v5.16 source code, so I'm not sure but skimming the instructon sequence, that's the second UD2 sequence, so I'm gonna guess that's the second WARN_ON - the !work->func one and someone else on the gitlab bug report seems to agree too. It's usually a lot more helpful if the bug report is complete - include the full warning message with some context at least, make sure that the kernel you're using is an upstream one or something close enough. If not, point to the source tree. Also, try to clearly distinguish what you know and what you suspect. Both can help but mixing them up together tends to cause confusion for everyone involved. It just looks like the code is trying to cancel a work item which hasn't been initialized and what it prolly needs is an ifdef around that cancel call depending on the config option. Thanks. -- tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline 2022-05-29 6:14 ` Tejun Heo @ 2022-05-29 6:41 ` Geraldo Nascimento 2022-05-31 4:13 ` Geraldo Nascimento 1 sibling, 0 replies; 9+ messages in thread From: Geraldo Nascimento @ 2022-05-29 6:41 UTC (permalink / raw) To: Tejun Heo; +Cc: Lai Jiangshan, LKML On Sat, May 28, 2022 at 08:14:23PM -1000, Tejun Heo wrote: > On Sun, May 29, 2022 at 02:53:39AM -0300, Geraldo Nascimento wrote: > > On Sat, May 28, 2022 at 07:24:41PM -1000, Tejun Heo wrote: > > > On Sun, May 29, 2022 at 01:29:32AM -0300, Geraldo Nascimento wrote: > > > > I would like very much to hear the opinion of the maintainers! > > > > > > I have a hard time understanding what you're trying to do. Can you please > > > slow down and start from describing the problem itself? > > > > Hi Tejun, > > > > Sorry for the hurry. > > > > The problem is best described in https://gitlab.freedesktop.org/drm/amd/-/issues/1898 > > > > From my understanding from the context of __cancel_work_timer() we should not > > ever call __flush_work() but I may be wrong. In the present case as > > Yeah, you're wrong. > No problem from my side, sorry for wasting your time. > > described in AMD's GitLab __cancel_work_timer() is being called by > > cancel_delayed_work_sync() inside kfd_process_notifier_release() > > from drivers/gpu/drm/amd/amdkfd/kfd_process.c:1157 (Linux 5.18). > > Have you confirmed that that actually is the warning which is triggering? I > don't see how that condition would trigger that late during the boot and the > warning line being reported doesn't match v5.16 source code, so I'm not sure > but skimming the instructon sequence, that's the second UD2 sequence, so I'm > gonna guess that's the second WARN_ON - the !work->func one and someone else > on the gitlab bug report seems to agree too. While I can confirm from my dmesg traces it's the second WARN_ON (the one on (!work->func)) that triggers, remember it's being called from __flush_work() due to the lack of NOT operator on wq_online, inside __cancel_work_timer(). To be honest, for me, it only makes sense to call __flush_work() from the context of __cancel_work_timer() if we are sure the work isn't executing. One of the few ways to be sure is when kthreads haven't initialized yet, that means workqueue_init() hasn't fired yet and wq_online is still false, so it makes sense to negate that false and exceptionally call __flush_work() without waiting for completion of the work - i.e., __flush_work() will WARN_ON workqueue offline condition and return false immediately because task was already idle. I know I may be repeating myself, but I'm trying to make my point, from the little understanding I have of the kernel. And I know that you know best, and that the possibility of a bug like that lying undiscovered on a code-base as scrutinized as the Linux kernel is, is very small. > > It's usually a lot more helpful if the bug report is complete - include the > full warning message with some context at least, make sure that the kernel > you're using is an upstream one or something close enough. If not, point to > the source tree. Also, try to clearly distinguish what you know and what you > suspect. Both can help but mixing them up together tends to cause confusion > for everyone involved. Sorry. Will try to do better. > > It just looks like the code is trying to cancel a work item which hasn't > been initialized and what it prolly needs is an ifdef around that cancel > call depending on the config option. Thanks for looking into it, Geraldo Nascimento > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline 2022-05-29 6:14 ` Tejun Heo 2022-05-29 6:41 ` Geraldo Nascimento @ 2022-05-31 4:13 ` Geraldo Nascimento 2022-05-31 17:43 ` Tejun Heo 1 sibling, 1 reply; 9+ messages in thread From: Geraldo Nascimento @ 2022-05-31 4:13 UTC (permalink / raw) To: Tejun Heo; +Cc: Lai Jiangshan, LKML On Sat, May 28, 2022 at 08:14:23PM -1000, Tejun Heo wrote: > Yeah, you're wrong. Hi Tejun, you were completely right about me being wrong. It's a real shame to me and I am only asking for an apology in case there were any doubts. With my "patch" I was unable to cancel service on the GPU ("sending SIGINT to MPV because I like to Ctrl+C") at the same time as running ROCm inference on the AMD Renoir APU with amdkfd of amdgpu. My machine froze beyond expectation of recovery, I would like to see what e.g. netconsole would report but it's irrelevant, it would case a major regression in the kernel tree if this was ever introduced. Sorry for the misunderstanding, I guess it's the burden of maintaning, too. My deepest apologies. I'll follow your advice and try to be more careful and detailed without mixing what I know and what I don't next time. Many thanks, Geraldo Nascimento ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline 2022-05-31 4:13 ` Geraldo Nascimento @ 2022-05-31 17:43 ` Tejun Heo 0 siblings, 0 replies; 9+ messages in thread From: Tejun Heo @ 2022-05-31 17:43 UTC (permalink / raw) To: Geraldo Nascimento; +Cc: Lai Jiangshan, LKML Hello, On Tue, May 31, 2022 at 01:13:03AM -0300, Geraldo Nascimento wrote: > you were completely right about me being wrong. It's a real shame to me > and I am only asking for an apology in case there were any doubts. I don't think there's any need for shame or aplogies. You weren't familiar with the code base and it just takes some effort to pick up skills for good bug reporting. I'm sure you'll pick them up in no time. Thanks. -- tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-05-31 17:43 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-05-28 20:07 [PATCH] workqueue: missing NOT while checking if Workqueue is offline Geraldo Nascimento 2022-05-29 1:25 ` Geraldo Nascimento 2022-05-29 4:29 ` Geraldo Nascimento 2022-05-29 5:24 ` Tejun Heo 2022-05-29 5:53 ` Geraldo Nascimento 2022-05-29 6:14 ` Tejun Heo 2022-05-29 6:41 ` Geraldo Nascimento 2022-05-31 4:13 ` Geraldo Nascimento 2022-05-31 17:43 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).