SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.

All of lore.kernel.org
 help / color / mirror / Atom feed

* SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
@ 2007-02-23  3:29 Johannes Berg
  2007-02-23 11:54 ` Rafael J. Wysocki
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Berg @ 2007-02-23  3:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Linus Torvalds, Dave Vasilevsky, Pavel Machek,
	Alexey Starikovskiy, Nigel Cunningham, linux-pm

[-- Attachment #1.1: Type: text/plain, Size: 2439 bytes --]

Hi,

After first debugging a while and then bisecting I found out why my quad
G5 won't suspend any longer.

Let me explain. The patch in question (committed as
ed746e3b18f4df18afa3763155972c5835f284c5, but the other ones around that
for other suspend methods will have the same problems) modifies the
suspend sequence to be like this:

freeze_processes
swsusp_shrink_memory
platform_prepare
device_suspend
disable_nonboot_cpus
[...]

while previously it was

disable_nonboot_cpus
freeze_processes
platform_prepare
swsusp_shrink_memory
[...]

The only thing I'm worried about here is the ordering of
freeze_processes vs. disable_nonboot_cpus. The problem with this new
ordering is with workqueues, specifically freezable per-CPU workqueues
which consist of one kthread per CPU, bound to a single CPU. Now, when
CPUs are hot-unplugged, the workqueue code (by having a cpu notifier
called) will kill the thread for the CPU that is being unplugged. If you
look into kernel/workqueue.c, you'll notice that this is done by a
regular kthread_stop() as one might expect.

However, and this is the problem, for any freezable workqueue, the
workqueue kthread will be frozen at this point! Hence, kthread_stop()
will wait forever for the thread to finish, blocking the suspend
process.

Now, as for a solution, I don't really have a great idea yet. We have a
bunch of things we could do:
 (1) simply change the ordering to disable nonboot CPUs much earlier
 (2a) teach kthread_stop() about frozen processes and that it doesn't
      need to wait for them because they'll die once they wake up again
 (2b) teach kthread_stop() about frozen processes modify the freezer to
      allow waking up a process that is destined to die
 (3) teach the workqueue code about suspend

Of these options,

(1) would work, but also only punts the problem until someone wants to
do multi-threaded suspend (as if...).

(2a) would sort-of work, but what if someone unplugs a CPU while the
system is suspended [will that even work]? the thread would get really
stuck there, bound to a CPU that no longer exists.

(2b) should be possible, but would require some sort of per-thread
exit-the-freezer API

(3) is icky

I think I prefer (2b) or alternatively (1). In any case, with the commit
mentioned above reverted, my quad G5 can suspend to disk again and I'm
happy that it isn't my fault ;)

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
  2007-02-23  3:29 SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al Johannes Berg
@ 2007-02-23 11:54 ` Rafael J. Wysocki
  2007-02-23 12:17   ` Johannes Berg
  0 siblings, 1 reply; 13+ messages in thread
From: Rafael J. Wysocki @ 2007-02-23 11:54 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Andrew Morton, Linus Torvalds, Dave Vasilevsky, Pavel Machek,
	Alexey Starikovskiy, Nigel Cunningham, linux-pm

Hi,

On Friday, 23 February 2007 04:29, Johannes Berg wrote:
> Hi,
> 
> After first debugging a while and then bisecting I found out why my quad
> G5 won't suspend any longer.
> 
> Let me explain. The patch in question (committed as
> ed746e3b18f4df18afa3763155972c5835f284c5, but the other ones around that
> for other suspend methods will have the same problems) modifies the
> suspend sequence to be like this:
> 
> freeze_processes
> swsusp_shrink_memory
> platform_prepare
> device_suspend
> disable_nonboot_cpus
> [...]
> 
> while previously it was
> 
> disable_nonboot_cpus
> freeze_processes
> platform_prepare
> swsusp_shrink_memory
> [...]
> 
> 
> The only thing I'm worried about here is the ordering of
> freeze_processes vs. disable_nonboot_cpus. The problem with this new
> ordering is with workqueues, specifically freezable per-CPU workqueues
> which consist of one kthread per CPU, bound to a single CPU. Now, when
> CPUs are hot-unplugged, the workqueue code (by having a cpu notifier
> called) will kill the thread for the CPU that is being unplugged. If you
> look into kernel/workqueue.c, you'll notice that this is done by a
> regular kthread_stop() as one might expect.
> 
> However, and this is the problem, for any freezable workqueue, the
> workqueue kthread will be frozen at this point! Hence, kthread_stop()
> will wait forever for the thread to finish, blocking the suspend
> process.

Hm, the only freezable workqueues I was aware of were those in XFS.

Moreover, the patch has got _a_ _lot_ of testing on SMP on x86_64
and I believe it works for people on i386 too.  So the workqueues in question
seem to be architecture-specific.  Is that correct?

> Now, as for a solution, I don't really have a great idea yet. We have a
> bunch of things we could do:
>  (1) simply change the ordering to disable nonboot CPUs much earlier
>  (2a) teach kthread_stop() about frozen processes and that it doesn't
>       need to wait for them because they'll die once they wake up again
>  (2b) teach kthread_stop() about frozen processes modify the freezer to
>       allow waking up a process that is destined to die
>  (3) teach the workqueue code about suspend
> 
> Of these options,
> 
> (1) would work, but also only punts the problem until someone wants to
> do multi-threaded suspend (as if...).

It will also break symmetry with the resume code that has to be like this
because of ACPI-related issues.

> (2a) would sort-of work, but what if someone unplugs a CPU while the
> system is suspended [will that even work]? the thread would get really
> stuck there, bound to a CPU that no longer exists.

Right now we are working on using the task freezer for CPU hotplugging and if
that works, this won't be an issue.
 
> (2b) should be possible, but would require some sort of per-thread
> exit-the-freezer API
> 
> (3) is icky

The workqueue code knows about the suspend already, that's why we have
create_freezeable_worqueue(), for example.

I'd like to first understand why the workqueues in question here are freezable.

> I think I prefer (2b) or alternatively (1). In any case, with the commit
> mentioned above reverted, my quad G5 can suspend to disk again and I'm
> happy that it isn't my fault ;)

Could you please check if the appended patch (on top of the commit you have
reverted) changes anything?

Rafael

---
 kernel/power/disk.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6.21-rc1/kernel/power/disk.c
===================================================================
--- linux-2.6.21-rc1.orig/kernel/power/disk.c
+++ linux-2.6.21-rc1/kernel/power/disk.c
@@ -132,9 +132,13 @@ int pm_suspend_disk(void)
 	if (error)
 		goto Thaw;
 
+	error = disable_nonboot_cpus();
+	if (error)
+		goto Enable_cpus;
+
 	error = platform_prepare();
 	if (error)
-		goto Thaw;
+		goto Enable_cpus;
 
 	suspend_console();
 	error = device_suspend(PMSG_FREEZE);
@@ -142,10 +146,6 @@ int pm_suspend_disk(void)
 		printk(KERN_ERR "PM: Some devices failed to suspend\n");
 		goto Resume_devices;
 	}
-	error = disable_nonboot_cpus();
-	if (error)
-		goto Enable_cpus;
-
 	if (pm_disk_mode == PM_DISK_TEST) {
 		printk("swsusp debug: Waiting for 5 seconds.\n");
 		mdelay(5000);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
  2007-02-23 11:54 ` Rafael J. Wysocki
@ 2007-02-23 12:17   ` Johannes Berg
  2007-02-23 13:25     ` Rafael J. Wysocki
  2007-02-23 13:31     ` Johannes Berg
  0 siblings, 2 replies; 13+ messages in thread
From: Johannes Berg @ 2007-02-23 12:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Linus Torvalds, Dave Vasilevsky, Pavel Machek,
	Nigel Cunningham, Alexey Starikovskiy, linux-pm

[-- Attachment #1.1: Type: text/plain, Size: 2054 bytes --]

[correcting address for Nigel]

Hi Rafael,

> Hm, the only freezable workqueues I was aware of were those in XFS.

Well, I have an XFS root fs. In fact, the one that was mostly causing it
was the xfs one, but I'm not sure it was all the time. The same bug
happens at resume time (if I manually offline all nonboot CPUs before
suspend), at which point I couldn't tell whether it was XFS or something
else.

> Moreover, the patch has got _a_ _lot_ of testing on SMP on x86_64
> and I believe it works for people on i386 too.  So the workqueues in question
> seem to be architecture-specific.  Is that correct?

There could be some arch-specific workqueues as well, but at least in
one case I saw xfsdatad causing it.

> > (1) would work, but also only punts the problem until someone wants to
> > do multi-threaded suspend (as if...).
> 
> It will also break symmetry with the resume code that has to be like this
> because of ACPI-related issues.

Ok. I don't have any ACPI issues due to the lack of ACPI ;)

> > (2a) would sort-of work, but what if someone unplugs a CPU while the
> > system is suspended [will that even work]? the thread would get really
> > stuck there, bound to a CPU that no longer exists.
> 
> Right now we are working on using the task freezer for CPU hotplugging and if
> that works, this won't be an issue.

Care to elaborate? It doesn't seem to make sense to freeze tasks that
are running on some other CPU if that one is going offline, but I'm
probably misunderstanding you here.

> I'd like to first understand why the workqueues in question here are freezable.

I don't think that matters much. We can't have freezable per-CPU
workqueues and then forbid using them.

> Could you please check if the appended patch (on top of the commit you have
> reverted) changes anything?

I can give it a go, but it doesn't look as though it'd help. It still
freezes all threads before disabling CPUs, and my debugging certainly
pinpointed to kthread_stop() in the workqueue.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
  2007-02-23 12:17   ` Johannes Berg
@ 2007-02-23 13:25     ` Rafael J. Wysocki
  2007-02-23 20:23       ` Johannes Berg
  2007-02-23 13:31     ` Johannes Berg
  1 sibling, 1 reply; 13+ messages in thread
From: Rafael J. Wysocki @ 2007-02-23 13:25 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Andrew Morton, Linus Torvalds, Dave Vasilevsky, Pavel Machek,
	Nigel Cunningham, Alexey Starikovskiy, linux-pm

Hi,

On Friday, 23 February 2007 13:17, Johannes Berg wrote:
> [correcting address for Nigel]
> 
> Hi Rafael,
> 
> > Hm, the only freezable workqueues I was aware of were those in XFS.
> 
> Well, I have an XFS root fs.

I see.

> In fact, the one that was mostly causing it was the xfs one, but I'm not
> sure it was all the time. The same bug happens at resume time (if I manually
> offline all nonboot CPUs before suspend), at which point I couldn't tell
> whether it was XFS or something else.

Probably XFS too.

> > Moreover, the patch has got _a_ _lot_ of testing on SMP on x86_64
> > and I believe it works for people on i386 too.  So the workqueues in question
> > seem to be architecture-specific.  Is that correct?
> 
> There could be some arch-specific workqueues as well, but at least in
> one case I saw xfsdatad causing it.

Yes, that makes sense.

In that case the fastest fix would be to revert the commit in question (and
some others too), but I don't think it will be satisfactory for the people with
the ACPI issues related to the resume code ordering.  Moreover, it really
is more reasonable to disable nonboot CPUs after tasks have been frozen
(I think it's also necessary for the CPU hotplugging with the help of the
freezer; see below).

I'll try to find a better solution later today.

[Why, o why people don't test -mm kernels???  The patch that causes the problem
has been in -mm since 2.6.20-rc2-mm1 and _nobody_ has reported any problem
with it until now.  Sigh.]

> > > (1) would work, but also only punts the problem until someone wants to
> > > do multi-threaded suspend (as if...).
> > 
> > It will also break symmetry with the resume code that has to be like this
> > because of ACPI-related issues.
> 
> Ok. I don't have any ACPI issues due to the lack of ACPI ;)

Lucky you. ;-)

> > > (2a) would sort-of work, but what if someone unplugs a CPU while the
> > > system is suspended [will that even work]? the thread would get really
> > > stuck there, bound to a CPU that no longer exists.
> > 
> > Right now we are working on using the task freezer for CPU hotplugging and if
> > that works, this won't be an issue.
> 
> Care to elaborate? It doesn't seem to make sense to freeze tasks that
> are running on some other CPU if that one is going offline, but I'm
> probably misunderstanding you here.

The idea is to freeze tasks every time before CPUs are hot(un)plugged.  This
way we can avoid nasty locking-related issues that have been haunting us
for some time.  And yes, we are going to freeze all tasks for this purpose,
although it seems inefficient at first sight.

Of course if we do it, the suspend code will have to be changed so that the
freezing of tasks related to the CPU hotplug doesn't interfere with the
freezing of tasks related to the suspend in a destructive way.  Then, probably,
the problem you have discovered will go away automatically.

For this reason I'd like to keep the ordering of code in disk.c as it is now,
because something like

freeze_processes(SUSPEND)
...
freeze_processes(CPU_HOTPLUG)
...
thaw_processes(CPU_HOTPLUG)
...
thaw_processes(SUSPEND)

seems to be more logical than any other ordering.

> > I'd like to first understand why the workqueues in question here are freezable.
> 
> I don't think that matters much. We can't have freezable per-CPU
> workqueues and then forbid using them.

Yes.

> > Could you please check if the appended patch (on top of the commit you have
> > reverted) changes anything?
> 
> I can give it a go, but it doesn't look as though it'd help. It still
> freezes all threads before disabling CPUs, and my debugging certainly
> pinpointed to kthread_stop() in the workqueue.

Yes, you're right.

I wasn't quite sure which workqueues were causing the problem, but if that's
XFS, I know enough.

The fix is needed, because we'd like to make the majority of worqueues
freezable, but the return to the old code ordering doesn't seem to be a good
solution in the long run.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
  2007-02-23 12:17   ` Johannes Berg
  2007-02-23 13:25     ` Rafael J. Wysocki
@ 2007-02-23 13:31     ` Johannes Berg
  1 sibling, 0 replies; 13+ messages in thread
From: Johannes Berg @ 2007-02-23 13:31 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Linus Torvalds, Dave Vasilevsky, Pavel Machek,
	Nigel Cunningham, Alexey Starikovskiy, linux-pm


[-- Attachment #1.1: Type: text/plain, Size: 551 bytes --]

On Fri, 2007-02-23 at 13:19 +0100, Johannes Berg wrote:

> I can give it a go, but it doesn't look as though it'd help. It still
> freezes all threads before disabling CPUs, and my debugging certainly
> pinpointed to kthread_stop() in the workqueue.

As expected, it prints:
CPU#1 offline
CPU 1 dead (my message when it goes into the dead loop)
kthread_stop: stopping migration/1
kthread_stop: stopping ksoftirq/1
kthread_stop: stopping watchdog/1
kthread_stop: kcryptd/1
kthread_stop: stopping xfsdatad/1

and then hangs.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
  2007-02-23 13:25     ` Rafael J. Wysocki
@ 2007-02-23 20:23       ` Johannes Berg
  2007-02-24  0:01         ` Rafael J. Wysocki
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Berg @ 2007-02-23 20:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Linus Torvalds, Dave Vasilevsky, Pavel Machek,
	Nigel Cunningham, Alexey Starikovskiy, linux-pm

[-- Attachment #1.1: Type: text/plain, Size: 2445 bytes --]

Hi,

> In that case the fastest fix would be to revert the commit in question (and
> some others too),

Yeah, the same thing exists for uswsusp and regular suspend-to-ram
afaict.

> but I don't think it will be satisfactory for the people with
> the ACPI issues related to the resume code ordering.  Moreover, it really
> is more reasonable to disable nonboot CPUs after tasks have been frozen
> (I think it's also necessary for the CPU hotplugging with the help of the
> freezer; see below).
> 
> I'll try to find a better solution later today.

Great.

> [Why, o why people don't test -mm kernels???  The patch that causes the problem
> has been in -mm since 2.6.20-rc2-mm1 and _nobody_ has reported any problem
> with it until now.  Sigh.]

Sorry. It's hard enough to follow powerpc.git and keep my suspend
patches working on top of that.

> The idea is to freeze tasks every time before CPUs are hot(un)plugged.  This
> way we can avoid nasty locking-related issues that have been haunting us
> for some time.  And yes, we are going to freeze all tasks for this purpose,
> although it seems inefficient at first sight.

It's not like CPU hotplug is very frequent so that's perfectly fine to
me.

> Of course if we do it, the suspend code will have to be changed so that the
> freezing of tasks related to the CPU hotplug doesn't interfere with the
> freezing of tasks related to the suspend in a destructive way.  Then, probably,
> the problem you have discovered will go away automatically.

Yeah, I agree that would probably fix it, but the freezer guarantee
becomes slightly less, you have some (most) tasks frozen and some then
need to exit the freezer again to cleanly exit. I'm not sure you can do
that w/o getting into possible trouble with locking, i.e. imagine the
cleanup-path needing to wait for completion of some other event from a
thread that's frozen... (not that I'm sure whether that's legal or not
anyway)

> For this reason I'd like to keep the ordering of code in disk.c as it is now,
> because something like

[...]

> seems to be more logical than any other ordering.

I used to think of CPU hotplug as a simple mechanism to avoid multi-CPU
problems so I was perfectly fine with disabling them very early during
suspend. I can't really see what problems ACPI might have with that
since the user can anyway do echo 0 > cpu3/online before suspend, no?

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
  2007-02-23 20:23       ` Johannes Berg
@ 2007-02-24  0:01         ` Rafael J. Wysocki
  2007-02-24  0:31           ` Johannes Berg
  0 siblings, 1 reply; 13+ messages in thread
From: Rafael J. Wysocki @ 2007-02-24  0:01 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Andrew Morton, Linus Torvalds, Dave Vasilevsky, Pavel Machek,
	Nigel Cunningham, Alexey Starikovskiy, linux-pm

Hi,

On Friday, 23 February 2007 21:23, Johannes Berg wrote:
> Hi,
> 
> > In that case the fastest fix would be to revert the commit in question (and
> > some others too),
> 
> Yeah, the same thing exists for uswsusp and regular suspend-to-ram
> afaict.

Exactly.

> > but I don't think it will be satisfactory for the people with
> > the ACPI issues related to the resume code ordering.  Moreover, it really
> > is more reasonable to disable nonboot CPUs after tasks have been frozen
> > (I think it's also necessary for the CPU hotplugging with the help of the
> > freezer; see below).
> > 
> > I'll try to find a better solution later today.
> 
> Great.

I'm looking at it right now, but this is quite complicated.

Could you please try the appended patch?  It's somewhat hackish, but may work.

The idea is to do nothing on CPU unplug if the CPU's worker thread is frozen
and later check in the thread itself if it has been replaced by another one.

Greetings,
Rafael


 kernel/workqueue.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6.20-mm2/kernel/workqueue.c
===================================================================
--- linux-2.6.20-mm2.orig/kernel/workqueue.c	2007-02-23 22:48:50.000000000 +0100
+++ linux-2.6.20-mm2/kernel/workqueue.c	2007-02-24 00:58:57.000000000 +0100
@@ -316,7 +316,13 @@ static int worker_thread(void *__cwq)
 	do_sigaction(SIGCHLD, &sa, (struct k_sigaction *)0);
 
 	for (;;) {
-		try_to_freeze();
+		if (try_to_freeze()) {
+			/* We've just exited the refrigerator.  If our CPU is
+			 * a nonboot one, we might have been replaced.
+			 */
+			if (cwq->thread != current)
+				break;
+		}
 
 		prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
 		if (!cwq->should_stop && list_empty(&cwq->worklist))
@@ -713,7 +719,7 @@ static void cleanup_workqueue_thread(str
 	int alive = 0;
 
 	spin_lock_irq(&cwq->lock);
-	if (cwq->thread != NULL) {
+	if (cwq->thread != NULL && !frozen(cwq->thread)) {
 		insert_wq_barrier(cwq, &barr, 1);
 		cwq->should_stop = 1;
 		alive = 1;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
  2007-02-24  0:01         ` Rafael J. Wysocki
@ 2007-02-24  0:31           ` Johannes Berg
  2007-02-24  8:57             ` Rafael J. Wysocki
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Berg @ 2007-02-24  0:31 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Linus Torvalds, Dave Vasilevsky, Pavel Machek,
	Nigel Cunningham, Alexey Starikovskiy, linux-pm


[-- Attachment #1.1: Type: text/plain, Size: 1719 bytes --]

Hi,

> Could you please try the appended patch?  It's somewhat hackish, but may work.

Not before Monday or Tuesday unfortunately, I'm away from the machine.
Maybe I can find someone else willing to test who has the same box.

> The idea is to do nothing on CPU unplug if the CPU's worker thread is frozen
> and later check in the thread itself if it has been replaced by another one.

Yeah that should work. That patch won't apply to my tree though unless I
actually go to -mm2 as well :)

> Greetings,
> Rafael
> 
> 
>  kernel/workqueue.c |   10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.20-mm2/kernel/workqueue.c
> ===================================================================
> --- linux-2.6.20-mm2.orig/kernel/workqueue.c	2007-02-23 22:48:50.000000000 +0100
> +++ linux-2.6.20-mm2/kernel/workqueue.c	2007-02-24 00:58:57.000000000 +0100
> @@ -316,7 +316,13 @@ static int worker_thread(void *__cwq)
>  	do_sigaction(SIGCHLD, &sa, (struct k_sigaction *)0);
>  
>  	for (;;) {
> -		try_to_freeze();
> +		if (try_to_freeze()) {
> +			/* We've just exited the refrigerator.  If our CPU is
> +			 * a nonboot one, we might have been replaced.
> +			 */
> +			if (cwq->thread != current)
> +				break;
> +		}
>  
>  		prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
>  		if (!cwq->should_stop && list_empty(&cwq->worklist))
> @@ -713,7 +719,7 @@ static void cleanup_workqueue_thread(str
>  	int alive = 0;
>  
>  	spin_lock_irq(&cwq->lock);
> -	if (cwq->thread != NULL) {
> +	if (cwq->thread != NULL && !frozen(cwq->thread)) {
>  		insert_wq_barrier(cwq, &barr, 1);
>  		cwq->should_stop = 1;
>  		alive = 1;
> 

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
  2007-02-24  0:31           ` Johannes Berg
@ 2007-02-24  8:57             ` Rafael J. Wysocki
  2007-02-24 20:54               ` Rafael J. Wysocki
  0 siblings, 1 reply; 13+ messages in thread
From: Rafael J. Wysocki @ 2007-02-24  8:57 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Andrew Morton, Linus Torvalds, Dave Vasilevsky, Pavel Machek,
	Nigel Cunningham, Alexey Starikovskiy, linux-pm

On Saturday, 24 February 2007 01:31, Johannes Berg wrote:
> Hi,
> 
> > Could you please try the appended patch?  It's somewhat hackish, but may work.
> 
> Not before Monday or Tuesday unfortunately, I'm away from the machine.

Okay, no big deal.  I think I'll be able to reproduce the problem here. ;-)

> Maybe I can find someone else willing to test who has the same box.
> 
> > The idea is to do nothing on CPU unplug if the CPU's worker thread is frozen
> > and later check in the thread itself if it has been replaced by another one.
> 
> Yeah that should work. That patch won't apply to my tree though unless I
> actually go to -mm2 as well :)

Ah, sorry.  I'll prepare a version against 2.6.21-rc1.  Apart from this, I
think there's a better solution, but I have to verify that.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
  2007-02-24  8:57             ` Rafael J. Wysocki
@ 2007-02-24 20:54               ` Rafael J. Wysocki
  2007-02-24 21:07                 ` Johannes Berg
  2007-03-12 16:57                 ` Roman Jarosz
  0 siblings, 2 replies; 13+ messages in thread
From: Rafael J. Wysocki @ 2007-02-24 20:54 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Andrew Morton, Linus Torvalds, Pavel Machek, Nigel Cunningham,
	Alexey Starikovskiy, linux-pm, Dave Vasilevsky

On Saturday, 24 February 2007 09:57, Rafael J. Wysocki wrote:
> On Saturday, 24 February 2007 01:31, Johannes Berg wrote:
> > Hi,
> > 
> > > Could you please try the appended patch?  It's somewhat hackish, but may work.
> > 
> > Not before Monday or Tuesday unfortunately, I'm away from the machine.
> 
> Okay, no big deal.  I think I'll be able to reproduce the problem here. ;-)
> 
> > Maybe I can find someone else willing to test who has the same box.
> > 
> > > The idea is to do nothing on CPU unplug if the CPU's worker thread is frozen
> > > and later check in the thread itself if it has been replaced by another one.
> > 
> > Yeah that should work. That patch won't apply to my tree though unless I
> > actually go to -mm2 as well :)
> 
> Ah, sorry.  I'll prepare a version against 2.6.21-rc1.  Apart from this, I
> think there's a better solution, but I have to verify that.

No, this one is better IMO.

Appended is the version against 2.6.21-rc1.  It seems to work for me, but I'd
like someone else to confirm it.

Greetings,
Rafael

 kernel/workqueue.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Index: linux-2.6.21-rc1/kernel/workqueue.c
===================================================================
--- linux-2.6.21-rc1.orig/kernel/workqueue.c
+++ linux-2.6.21-rc1/kernel/workqueue.c
@@ -376,8 +376,19 @@ static int worker_thread(void *__cwq)
 
 	set_current_state(TASK_INTERRUPTIBLE);
 	while (!kthread_should_stop()) {
-		if (cwq->freezeable)
-			try_to_freeze();
+		if (try_to_freeze()) {
+			/* We've just left the refrigerator.  If our CPU is
+			 * a nonboot one, we might have been replaced.
+			 * The lock is taken to prevent the race with
+			 * cleanup_workqueue_thread() from happening
+			 */
+			spin_lock_irq(&cwq->lock);
+			if (cwq->thread != current) {
+				spin_unlock_irq(&cwq->lock);
+				break;
+			}
+			spin_unlock_irq(&cwq->lock);
+		}
 
 		add_wait_queue(&cwq->more_work, &wait);
 		if (list_empty(&cwq->worklist))
@@ -539,6 +550,9 @@ static void cleanup_workqueue_thread(str
 	cwq = per_cpu_ptr(wq->cpu_wq, cpu);
 	spin_lock_irqsave(&cwq->lock, flags);
 	p = cwq->thread;
+	if (p && frozen(p))
+		p = NULL;
+
 	cwq->thread = NULL;
 	spin_unlock_irqrestore(&cwq->lock, flags);
 	if (p)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
  2007-02-24 20:54               ` Rafael J. Wysocki
@ 2007-02-24 21:07                 ` Johannes Berg
  2007-03-12 16:57                 ` Roman Jarosz
  1 sibling, 0 replies; 13+ messages in thread
From: Johannes Berg @ 2007-02-24 21:07 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Linus Torvalds, Pavel Machek, Nigel Cunningham,
	Alexey Starikovskiy, linux-pm, Dave Vasilevsky


[-- Attachment #1.1: Type: text/plain, Size: 288 bytes --]

On Sat, 2007-02-24 at 21:54 +0100, Rafael J. Wysocki wrote:

> No, this one is better IMO.
> 
> Appended is the version against 2.6.21-rc1.  It seems to work for me, but I'd
> like someone else to confirm it.

I'll give it a try at the beginning of next week, thanks.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
  2007-02-24 20:54               ` Rafael J. Wysocki
  2007-02-24 21:07                 ` Johannes Berg
@ 2007-03-12 16:57                 ` Roman Jarosz
  2007-03-12 18:14                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 13+ messages in thread
From: Roman Jarosz @ 2007-03-12 16:57 UTC (permalink / raw)
  To: linux-pm

Rafael J. Wysocki <rjw <at> sisk.pl> writes:

> 
> On Saturday, 24 February 2007 09:57, Rafael J. Wysocki wrote:
> > On Saturday, 24 February 2007 01:31, Johannes Berg wrote:
> > > Hi,
> > > 
> > > > Could you please try the appended patch?  It's somewhat hackish, but 
may work.
> > > 
> > > Not before Monday or Tuesday unfortunately, I'm away from the machine.
> > 
> > Okay, no big deal.  I think I'll be able to reproduce the problem here. 
> > 
> > > Maybe I can find someone else willing to test who has the same box.
> > > 
> > > > The idea is to do nothing on CPU unplug if the CPU's worker thread is 
frozen
> > > > and later check in the thread itself if it has been replaced by another 
one.
> > > 
> > > Yeah that should work. That patch won't apply to my tree though unless I
> > > actually go to -mm2 as well :)
> > 
> > Ah, sorry.  I'll prepare a version against 2.6.21-rc1.  Apart from this, I
> > think there's a better solution, but I have to verify that.
> 
> No, this one is better IMO.
> 
> Appended is the version against 2.6.21-rc1.  It seems to work for me, but I'd
> like someone else to confirm it.
> 

I had the same problem as Johannes and I can confirm that your patch works.

Though I'm not sure if it's caused by XFS because I don't have any XFS 
partitions, but I have XFS support compiled in kernel.
My root partition is reiserfs.

Tested with suspend to ram on:
kernel 2.6.21-rc3-git7
Notebook ASUS A6JC, Intel Core Duo

Thanks,
Roman Jarosz

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.
  2007-03-12 16:57                 ` Roman Jarosz
@ 2007-03-12 18:14                   ` Rafael J. Wysocki
  0 siblings, 0 replies; 13+ messages in thread
From: Rafael J. Wysocki @ 2007-03-12 18:14 UTC (permalink / raw)
  To: linux-pm; +Cc: Roman Jarosz

Hi,

On Monday, 12 March 2007 17:57, Roman Jarosz wrote:
> Rafael J. Wysocki <rjw <at> sisk.pl> writes:
> 
> > 
> > On Saturday, 24 February 2007 09:57, Rafael J. Wysocki wrote:
> > > On Saturday, 24 February 2007 01:31, Johannes Berg wrote:
> > > > Hi,
> > > > 
> > > > > Could you please try the appended patch?  It's somewhat hackish, but 
> may work.
> > > > 
> > > > Not before Monday or Tuesday unfortunately, I'm away from the machine.
> > > 
> > > Okay, no big deal.  I think I'll be able to reproduce the problem here. 
> > > 
> > > > Maybe I can find someone else willing to test who has the same box.
> > > > 
> > > > > The idea is to do nothing on CPU unplug if the CPU's worker thread is 
> frozen
> > > > > and later check in the thread itself if it has been replaced by another 
> one.
> > > > 
> > > > Yeah that should work. That patch won't apply to my tree though unless I
> > > > actually go to -mm2 as well :)
> > > 
> > > Ah, sorry.  I'll prepare a version against 2.6.21-rc1.  Apart from this, I
> > > think there's a better solution, but I have to verify that.
> > 
> > No, this one is better IMO.
> > 
> > Appended is the version against 2.6.21-rc1.  It seems to work for me, but I'd
> > like someone else to confirm it.
> > 
> 
> I had the same problem as Johannes and I can confirm that your patch works.

Thanks for testing, but I'm afraid it won't be merged.  Instead, we have
decided to make the XFS workqueues nonfreezable again, but the patch for
that hasn't made it to the mainline yet (it was sent to the XFS maintainers
some time ago).

> Though I'm not sure if it's caused by XFS because I don't have any XFS 
> partitions, but I have XFS support compiled in kernel.

This probably is enough to trigger the problem.

> My root partition is reiserfs.
> 
> Tested with suspend to ram on:
> kernel 2.6.21-rc3-git7
> Notebook ASUS A6JC, Intel Core Duo

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-03-12 18:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-23  3:29 SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al Johannes Berg
2007-02-23 11:54 ` Rafael J. Wysocki
2007-02-23 12:17   ` Johannes Berg
2007-02-23 13:25     ` Rafael J. Wysocki
2007-02-23 20:23       ` Johannes Berg
2007-02-24  0:01         ` Rafael J. Wysocki
2007-02-24  0:31           ` Johannes Berg
2007-02-24  8:57             ` Rafael J. Wysocki
2007-02-24 20:54               ` Rafael J. Wysocki
2007-02-24 21:07                 ` Johannes Berg
2007-03-12 16:57                 ` Roman Jarosz
2007-03-12 18:14                   ` Rafael J. Wysocki
2007-02-23 13:31     ` Johannes Berg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.