linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] workqueue: missing NOT while checking if Workqueue is offline
@ 2022-05-28 20:07 Geraldo Nascimento
  2022-05-29  1:25 ` Geraldo Nascimento
  0 siblings, 1 reply; 9+ messages in thread
From: Geraldo Nascimento @ 2022-05-28 20:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, LKML

Greetings,

This is a one-character patch but very important as the kernel workqueue
__cancel_work_timer will cancel active work without the NOT operator
added.

During early boot wq_online is false so with the NOT added it will evaluate
to true. Conversely, after boot is done, workqueue is now true and we want
it to evaluate to false because otherwise it will cancel important work.

Signed-off-by: Geraldo Nascimento <geraldogabriel@gmail.com>

--- workqueue.c	2022-05-28 16:54:12.024176123 -0300
+++ workqueue.c	2022-05-28 16:54:37.698176135 -0300
@@ -3158,7 +3158,7 @@ static bool __cancel_work_timer(struct w
 	 * This allows canceling during early boot.  We know that @work
 	 * isn't executing.
 	 */
-	if (wq_online)
+	if (!wq_online)
 		__flush_work(work, true);
 
 	clear_work_data(work);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline
  2022-05-28 20:07 [PATCH] workqueue: missing NOT while checking if Workqueue is offline Geraldo Nascimento
@ 2022-05-29  1:25 ` Geraldo Nascimento
  2022-05-29  4:29   ` Geraldo Nascimento
  0 siblings, 1 reply; 9+ messages in thread
From: Geraldo Nascimento @ 2022-05-29  1:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, LKML

On Sat, May 28, 2022 at 05:07:08PM -0300, Geraldo Nascimento wrote:
> Greetings,
> 

Hi, again,

> This is a one-character patch but very important as the kernel workqueue
> __cancel_work_timer will cancel active work without the NOT operator
> added.
> 
> During early boot wq_online is false so with the NOT added it will evaluate
> to true. Conversely, after boot is done, workqueue

I meant wq_online. After boot, wq_online will evaluate to true, current
code might as well have an if (true) there. I hurried up the patch
because if I'm right this is a major show stopper to drivers that make
use of cancel_work_timer(). I hit it through amdgpu in conjuction with amdkfd.

> is now true and we want
> it to evaluate to false because otherwise it will cancel important work.
> 
> Signed-off-by: Geraldo Nascimento <geraldogabriel@gmail.com>
> 
> --- workqueue.c	2022-05-28 16:54:12.024176123 -0300
> +++ workqueue.c	2022-05-28 16:54:37.698176135 -0300
> @@ -3158,7 +3158,7 @@ static bool __cancel_work_timer(struct w
>  	 * This allows canceling during early boot.  We know that @work
>  	 * isn't executing.
>  	 */
> -	if (wq_online)
> +	if (!wq_online)
>  		__flush_work(work, true);
>  
>  	clear_work_data(work);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline
  2022-05-29  1:25 ` Geraldo Nascimento
@ 2022-05-29  4:29   ` Geraldo Nascimento
  2022-05-29  5:24     ` Tejun Heo
  0 siblings, 1 reply; 9+ messages in thread
From: Geraldo Nascimento @ 2022-05-29  4:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, LKML

On Sat, May 28, 2022 at 10:25:55PM -0300, Geraldo Nascimento wrote:
> On Sat, May 28, 2022 at 05:07:08PM -0300, Geraldo Nascimento wrote:
> > Greetings,
> > 
> 
> Hi, again,
> 

And again, Hi!

Doing my due dilligence, it seems.

> > This is a one-character patch but very important as the kernel workqueue
> > __cancel_work_timer will cancel active work

It won't cancel important work, seems it's just a WARN_ON but it's very
annoying. My understanding was the NOT was needed to call __flush_work()
before kthreads are spawned. During early boot, as the comment says, before
workqueue_init() fires.

There's a bug report at https://gitlab.freedesktop.org/drm/amd/-/issues/1898
and Felix Kuehling is right that this bug is only triggered when you try
to use amdkfd ( and the Kconfigs that implies) without HSA_AMD_SVM
configured.

It makes sense to me that NOT operator is missing however, since in the warning
I was coming from _cancel_work_timer() to __flush_work(), something that
should not be done?

I would like very much to hear the opinion of the maintainers!

Thanks,
Geraldo Nascimento

>> without the NOT operator
> > added.
> > 
> > During early boot wq_online is false so with the NOT added it will evaluate
> > to true. Conversely, after boot is done, workqueue
> 
> I meant wq_online. After boot, wq_online will evaluate to true, current
> code might as well have an if (true) there. I hurried up the patch
> because if I'm right this is a major show stopper to drivers that make
> use of cancel_work_timer(). I hit it through amdgpu in conjuction with amdkfd.
> 
> > is now true and we want
> > it to evaluate to false because otherwise it will cancel important work.
> > 
> > Signed-off-by: Geraldo Nascimento <geraldogabriel@gmail.com>
> > 
> > --- workqueue.c	2022-05-28 16:54:12.024176123 -0300
> > +++ workqueue.c	2022-05-28 16:54:37.698176135 -0300
> > @@ -3158,7 +3158,7 @@ static bool __cancel_work_timer(struct w
> >  	 * This allows canceling during early boot.  We know that @work
> >  	 * isn't executing.
> >  	 */
> > -	if (wq_online)
> > +	if (!wq_online)
> >  		__flush_work(work, true);
> >  
> >  	clear_work_data(work);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline
  2022-05-29  4:29   ` Geraldo Nascimento
@ 2022-05-29  5:24     ` Tejun Heo
  2022-05-29  5:53       ` Geraldo Nascimento
  0 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2022-05-29  5:24 UTC (permalink / raw)
  To: Geraldo Nascimento; +Cc: Lai Jiangshan, LKML

On Sun, May 29, 2022 at 01:29:32AM -0300, Geraldo Nascimento wrote:
> I would like very much to hear the opinion of the maintainers!

I have a hard time understanding what you're trying to do. Can you please
slow down and start from describing the problem itself?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline
  2022-05-29  5:24     ` Tejun Heo
@ 2022-05-29  5:53       ` Geraldo Nascimento
  2022-05-29  6:14         ` Tejun Heo
  0 siblings, 1 reply; 9+ messages in thread
From: Geraldo Nascimento @ 2022-05-29  5:53 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, LKML

On Sat, May 28, 2022 at 07:24:41PM -1000, Tejun Heo wrote:
> On Sun, May 29, 2022 at 01:29:32AM -0300, Geraldo Nascimento wrote:
> > I would like very much to hear the opinion of the maintainers!
> 
> I have a hard time understanding what you're trying to do. Can you please
> slow down and start from describing the problem itself?

Hi Tejun,

Sorry for the hurry.

The problem is best described in https://gitlab.freedesktop.org/drm/amd/-/issues/1898

From my understanding from the context of __cancel_work_timer() we should not
ever call __flush_work() but I may be wrong. In the present case as
described in AMD's GitLab __cancel_work_timer() is being called by
cancel_delayed_work_sync() inside kfd_process_notifier_release()
from drivers/gpu/drm/amd/amdkfd/kfd_process.c:1157 (Linux 5.18).

We should only call __flush_work() from __cancel_work_timer() if
workqueue_init() is not yet initialized, that's possible during
early boot though not very likely. Anyway that's before kthreads are
spwaned, so we are sure that particular work isn't executing, hence
why it's safe to call __flush_work() in this particular case.
The comment on kernel/workqueue.c:3157 (for Linux 5.18) says it best:	

	/*
	 * This allows canceling during early boot.  We know that @work
	 * isn't executing.
	 */
	 	if (wq_online)
		__flush_work(work, true);

If __flush_work() is ever called during early boot it will result in a
WARN_ON because workqueue is not online. I have no idea if that's OK
though it hasn't harmed my machine. Of course I don't want to introduce
bugs, I wanna solve them, and I appreciate your cautious approach. Thank
you for the work.

What is not OK apparently is trying to use amdkfd without HSA_AMD_SVM configured! :)

Thank you,
Geraldo Nascimento

> 
> Thanks.
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline
  2022-05-29  5:53       ` Geraldo Nascimento
@ 2022-05-29  6:14         ` Tejun Heo
  2022-05-29  6:41           ` Geraldo Nascimento
  2022-05-31  4:13           ` Geraldo Nascimento
  0 siblings, 2 replies; 9+ messages in thread
From: Tejun Heo @ 2022-05-29  6:14 UTC (permalink / raw)
  To: Geraldo Nascimento; +Cc: Lai Jiangshan, LKML

On Sun, May 29, 2022 at 02:53:39AM -0300, Geraldo Nascimento wrote:
> On Sat, May 28, 2022 at 07:24:41PM -1000, Tejun Heo wrote:
> > On Sun, May 29, 2022 at 01:29:32AM -0300, Geraldo Nascimento wrote:
> > > I would like very much to hear the opinion of the maintainers!
> > 
> > I have a hard time understanding what you're trying to do. Can you please
> > slow down and start from describing the problem itself?
> 
> Hi Tejun,
> 
> Sorry for the hurry.
> 
> The problem is best described in https://gitlab.freedesktop.org/drm/amd/-/issues/1898
> 
> From my understanding from the context of __cancel_work_timer() we should not
> ever call __flush_work() but I may be wrong. In the present case as

Yeah, you're wrong.

> described in AMD's GitLab __cancel_work_timer() is being called by
> cancel_delayed_work_sync() inside kfd_process_notifier_release()
> from drivers/gpu/drm/amd/amdkfd/kfd_process.c:1157 (Linux 5.18).

Have you confirmed that that actually is the warning which is triggering? I
don't see how that condition would trigger that late during the boot and the
warning line being reported doesn't match v5.16 source code, so I'm not sure
but skimming the instructon sequence, that's the second UD2 sequence, so I'm
gonna guess that's the second WARN_ON - the !work->func one and someone else
on the gitlab bug report seems to agree too.

It's usually a lot more helpful if the bug report is complete - include the
full warning message with some context at least, make sure that the kernel
you're using is an upstream one or something close enough. If not, point to
the source tree. Also, try to clearly distinguish what you know and what you
suspect. Both can help but mixing them up together tends to cause confusion
for everyone involved.

It just looks like the code is trying to cancel a work item which hasn't
been initialized and what it prolly needs is an ifdef around that cancel
call depending on the config option.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline
  2022-05-29  6:14         ` Tejun Heo
@ 2022-05-29  6:41           ` Geraldo Nascimento
  2022-05-31  4:13           ` Geraldo Nascimento
  1 sibling, 0 replies; 9+ messages in thread
From: Geraldo Nascimento @ 2022-05-29  6:41 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, LKML

On Sat, May 28, 2022 at 08:14:23PM -1000, Tejun Heo wrote:
> On Sun, May 29, 2022 at 02:53:39AM -0300, Geraldo Nascimento wrote:
> > On Sat, May 28, 2022 at 07:24:41PM -1000, Tejun Heo wrote:
> > > On Sun, May 29, 2022 at 01:29:32AM -0300, Geraldo Nascimento wrote:
> > > > I would like very much to hear the opinion of the maintainers!
> > > 
> > > I have a hard time understanding what you're trying to do. Can you please
> > > slow down and start from describing the problem itself?
> > 
> > Hi Tejun,
> > 
> > Sorry for the hurry.
> > 
> > The problem is best described in https://gitlab.freedesktop.org/drm/amd/-/issues/1898
> > 
> > From my understanding from the context of __cancel_work_timer() we should not
> > ever call __flush_work() but I may be wrong. In the present case as
> 
> Yeah, you're wrong.
>

No problem from my side, sorry for wasting your time.

> > described in AMD's GitLab __cancel_work_timer() is being called by
> > cancel_delayed_work_sync() inside kfd_process_notifier_release()
> > from drivers/gpu/drm/amd/amdkfd/kfd_process.c:1157 (Linux 5.18).
> 
> Have you confirmed that that actually is the warning which is triggering? I
> don't see how that condition would trigger that late during the boot and the
> warning line being reported doesn't match v5.16 source code, so I'm not sure
> but skimming the instructon sequence, that's the second UD2 sequence, so I'm
> gonna guess that's the second WARN_ON - the !work->func one and someone else
> on the gitlab bug report seems to agree too.

While I can confirm from my dmesg traces it's the second WARN_ON (the one on
(!work->func)) that triggers, remember it's being called from
__flush_work() due to the lack of NOT operator on wq_online, inside
__cancel_work_timer().

To be honest, for me, it only makes sense to call __flush_work() from
the context of __cancel_work_timer() if we are sure the work isn't
executing. One of the few ways to be sure is when kthreads haven't
initialized yet, that means workqueue_init() hasn't fired yet and
wq_online is still false, so it makes sense to negate that false and
exceptionally call __flush_work() without waiting for completion of the
work - i.e., __flush_work() will WARN_ON workqueue offline condition and
return false immediately because task was already idle.

I know I may be repeating myself, but I'm trying to make my point, from
the little understanding I have of the kernel. And I know that you know best,
and that the possibility of a bug like that lying undiscovered on a
code-base as scrutinized as the Linux kernel is, is very small.

> 
> It's usually a lot more helpful if the bug report is complete - include the
> full warning message with some context at least, make sure that the kernel
> you're using is an upstream one or something close enough. If not, point to
> the source tree. Also, try to clearly distinguish what you know and what you
> suspect. Both can help but mixing them up together tends to cause confusion
> for everyone involved.

Sorry. Will try to do better.

> 
> It just looks like the code is trying to cancel a work item which hasn't
> been initialized and what it prolly needs is an ifdef around that cancel
> call depending on the config option.

Thanks for looking into it,
Geraldo Nascimento

> 
> Thanks.
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline
  2022-05-29  6:14         ` Tejun Heo
  2022-05-29  6:41           ` Geraldo Nascimento
@ 2022-05-31  4:13           ` Geraldo Nascimento
  2022-05-31 17:43             ` Tejun Heo
  1 sibling, 1 reply; 9+ messages in thread
From: Geraldo Nascimento @ 2022-05-31  4:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, LKML

On Sat, May 28, 2022 at 08:14:23PM -1000, Tejun Heo wrote:
> Yeah, you're wrong.

Hi Tejun,

you were completely right about me being wrong. It's a real shame to me
and I am only asking for an apology in case there were any doubts.

With my "patch" I was unable to cancel service on the GPU ("sending
SIGINT to MPV because I like to Ctrl+C") at the same time as running
ROCm inference on the AMD Renoir APU with amdkfd of amdgpu.

My machine froze beyond expectation of recovery, I would like to see
what e.g. netconsole would report but it's irrelevant, it would case a
major regression in the kernel tree if this was ever introduced.

Sorry for the misunderstanding, I guess it's the burden of maintaning,
too. My deepest apologies.

I'll follow your advice and try to be more careful and detailed without
mixing what I know and what I don't next time.

Many thanks,
Geraldo Nascimento

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] workqueue: missing NOT while checking if Workqueue is offline
  2022-05-31  4:13           ` Geraldo Nascimento
@ 2022-05-31 17:43             ` Tejun Heo
  0 siblings, 0 replies; 9+ messages in thread
From: Tejun Heo @ 2022-05-31 17:43 UTC (permalink / raw)
  To: Geraldo Nascimento; +Cc: Lai Jiangshan, LKML

Hello,

On Tue, May 31, 2022 at 01:13:03AM -0300, Geraldo Nascimento wrote:
> you were completely right about me being wrong. It's a real shame to me
> and I am only asking for an apology in case there were any doubts.

I don't think there's any need for shame or aplogies. You weren't familiar
with the code base and it just takes some effort to pick up skills for good
bug reporting. I'm sure you'll pick them up in no time.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-05-31 17:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-28 20:07 [PATCH] workqueue: missing NOT while checking if Workqueue is offline Geraldo Nascimento
2022-05-29  1:25 ` Geraldo Nascimento
2022-05-29  4:29   ` Geraldo Nascimento
2022-05-29  5:24     ` Tejun Heo
2022-05-29  5:53       ` Geraldo Nascimento
2022-05-29  6:14         ` Tejun Heo
2022-05-29  6:41           ` Geraldo Nascimento
2022-05-31  4:13           ` Geraldo Nascimento
2022-05-31 17:43             ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).