linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-05-23 13:29 Yu, Luming
  2006-05-23 17:12 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-05-23 13:29 UTC (permalink / raw)
  To: trenn, Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos, Carl-Daniel Hailfinger


>> exregion-0185 [36] ex_system_memory_space: system_memory 0 
>(32 width) Address=0000000023FDFFC0
>> exregion-0185 [36] ex_system_memory_space: system_memory 1 
>(32 width) Address=0000000023FDFFC0
>> exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 
>width) Address=00000000000000B2
>> 
>> repeated endlessly.

Hmm.. interesting.  This looks like same error with TP600X.

>
>This sounds like the problem Daniel had on his Samsung P35 recently.
>He could fix it by getting rid of some asus_unhide_smbus stuff or the
>otherway around, adding asus_unhide_smbus quirks in the S3 resume code.
>
>This thread was recently posted on lkml:
>Re: [patch] smbus unhiding kills thermal management
>
>Here are some more details, for me that sounds related...:
>https://bugzilla.novell.com/show_bug.cgi?id=173420
>

But this Samsung P35 don't have _GLK. So, I think TP 600x has
a different problem with Samsung P35.

Actually, Sanjoy has a workaround to solve TP 600X S3 issue.
What we need to do is to come up with a clean patch. 
It is on to-do list. 

Thanks,
Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-05-23 13:29 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT] Yu, Luming
@ 2006-05-23 17:12 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-05-23 17:12 UTC (permalink / raw)
  To: Yu, Luming
  Cc: trenn, linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos,
	Carl-Daniel Hailfinger

[Trimmed lists that seemed unrelated: v4l-dvb-maintainer@linuxtv.org,
 video4linux-list@redhat.com, linux-usb-devel@lists.sourceforge.net,
 linux-ide@vger.kernel.org, linux-input@atrey.karlin.mff.cuni.cz]

> But this Samsung P35 don't have _GLK. So, I think TP 600x has
> a different problem with Samsung P35.

You're right.  I tried 2.6.16.18, which has the smbus patch, but it
didn't help the resume.  I need to test more whether it helps the fan,
but I doubt it will.

2.6.17-rc4 (with vanilla DSDT) does strange things to the fan.  At
boot, the fan is often on.  The trip point is 37 C (the DSDT default)
and temperature, say, 40 C.  That's fine and the fan should be on.
But if I set the trip point to 45 C and the poll interval to 100
seconds, the fan remains on.  I have to set the trip point and polling
interval a second time for the fan to turn off.  With 2.6.16-rc5, it
would turn off after the first setting.

Also, and I need to check which kernel it is (either 2.6.16.18 or
2.6.17-rc4), during S3 sleep, the right speaker made a quiet hiss.  I
imagine that will run down the battery pretty quickly.  It's a new
behavior since 2.6.16-rc5.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-05-21  0:12     ` Sanjoy Mahajan
  2006-05-21  0:40       ` Carl-Daniel Hailfinger
@ 2006-05-22  9:55       ` Pavel Machek
  1 sibling, 0 replies; 86+ messages in thread
From: Pavel Machek @ 2006-05-22  9:55 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: trenn, Yu, Luming, linux-kernel, Linus Torvalds, Andrew Morton,
	Tom Seeley, Dave Jones, Jiri Slaby, michael, mchehab,
	v4l-dvb-maintainer, video4linux-list, Brian Marete,
	Ryan Phillips, gregkh, linux-usb-devel, Brown, Len, linux-acpi,
	Mark Lord, Randy Dunlap, jgarzik, linux-ide, Duncan,
	Pavlik Vojtech, linux-input, Meelis Roos, Carl-Daniel Hailfinger

Hi!

> > https://bugzilla.novell.com/show_bug.cgi?id=173420
> 
> >From Comment #30 at the above url: "The Linux ACPI code seems to
> actively prevent the fan from running and that worries me."
> 
> I saw that as well, and found the following recipe would work around
> the problem:
> 
> 1. Set the trip point to, say, 70 C -- well above the actual
>    temperature.
> 
> 2. Then set the trip to anything reasonable that's under the current
>    temperature (27 C always works).  Now the fan turns on, and behaves
>    fine from then.
> 
> My explanation is that, before step 1, the fan is off but the OS
> thinks it's on.  So the dialogue goes something like:
> 
> Hardware (from EC or BIOS?): Ack, I'm overheating, turn on the fan now!
> OS: There, there, take it easy.  I've checked bit fields in my
>      memory, and the fan is on.  So I don't have to do anything.
> Hardware: Ack, ...
> OS: There, there, ...
> [Hence the 100% kacpid CPU usage]
> 
> Based on this explanation, I added a resume method to the fan driver.
> It would turn on the fan and mark it as on.  So then the internal OS
> state matched the actual state.  The fix didn't work for at least one
> reason: ACPI drivers didn't have suspend/resume methods (though now
> there are test patches to add those methods).

Can you redo your patches with those methods?

> Another fix, probably worth doing anyway, is to turn on the fan if the
> BIOS asks for it, whether or not the OS thinks it's on.  The chance of
> the two pieces of information getting out of synch, and the hardware
> damage it can cause, is enough to make it worthwhile.  The reverse

There should be 0% hardware damage chance. Fan failure means overheats
mean emergency power cutoff. I even tested it with paper into fan
blades several times. It mostly works.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-05-21  1:30         ` Joshua Hudson
@ 2006-05-21  3:53           ` Lee Revell
  0 siblings, 0 replies; 86+ messages in thread
From: Lee Revell @ 2006-05-21  3:53 UTC (permalink / raw)
  To: Joshua Hudson; +Cc: linux-kernel

On Sat, 2006-05-20 at 18:30 -0700, Joshua Hudson wrote:
> On 5/20/06, Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:
> > Please try kernel 2.6.16.17 (just released). It has the SMBus fix which
> > may fix resume and fan behaviour.
> 
> Am I the only person who read that as 2.6.17 the first time around?

I think it's evidence that the -stable process is working brilliantly.
We have 17 point releases worth of bug fixes that would not have been
available under the previous model.

Lee


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-05-21  0:40       ` Carl-Daniel Hailfinger
@ 2006-05-21  1:30         ` Joshua Hudson
  2006-05-21  3:53           ` Lee Revell
  0 siblings, 1 reply; 86+ messages in thread
From: Joshua Hudson @ 2006-05-21  1:30 UTC (permalink / raw)
  To: linux-kernel

On 5/20/06, Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:
> Sanjoy Mahajan wrote:
> > That seems likely, thanks for the pointer: Besides the ACPI sleep
> > hangs, this machine (TP 600X) has fan troubles upon S3 resume.  The
> > problems don't do harm (the damn fan keeps turning on when it
> > shouldn't), but that's probably chance.  Various patches that I tested
> > for S3 resume hangs reversed this fan behavior, making the fan refuse
> > to turn on when it should have.  The same problem happened after
> > resume from swsusp (bugzilla #5000).
>
> Please try kernel 2.6.16.17 (just released). It has the SMBus fix which
> may fix resume and fan behaviour.

Am I the only person who read that as 2.6.17 the first time around?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-05-21  0:12     ` Sanjoy Mahajan
@ 2006-05-21  0:40       ` Carl-Daniel Hailfinger
  2006-05-21  1:30         ` Joshua Hudson
  2006-05-22  9:55       ` Pavel Machek
  1 sibling, 1 reply; 86+ messages in thread
From: Carl-Daniel Hailfinger @ 2006-05-21  0:40 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: trenn, Yu, Luming, linux-kernel, Linus Torvalds, Andrew Morton,
	Tom Seeley, Dave Jones, Jiri Slaby, michael, mchehab,
	v4l-dvb-maintainer, video4linux-list, Brian Marete,
	Ryan Phillips, gregkh, linux-usb-devel, Brown, Len, linux-acpi,
	Mark Lord, Randy Dunlap, jgarzik, linux-ide, Duncan,
	Pavlik Vojtech, linux-input, Meelis Roos

Sanjoy Mahajan wrote:
> That seems likely, thanks for the pointer: Besides the ACPI sleep
> hangs, this machine (TP 600X) has fan troubles upon S3 resume.  The
> problems don't do harm (the damn fan keeps turning on when it
> shouldn't), but that's probably chance.  Various patches that I tested
> for S3 resume hangs reversed this fan behavior, making the fan refuse
> to turn on when it should have.  The same problem happened after
> resume from swsusp (bugzilla #5000).

Please try kernel 2.6.16.17 (just released). It has the SMBus fix which
may fix resume and fan behaviour.


Regards,
Carl-Daniel
-- 
http://www.hailfinger.org/

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-05-19 13:44   ` Thomas Renninger
@ 2006-05-21  0:12     ` Sanjoy Mahajan
  2006-05-21  0:40       ` Carl-Daniel Hailfinger
  2006-05-22  9:55       ` Pavel Machek
  0 siblings, 2 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-05-21  0:12 UTC (permalink / raw)
  To: trenn
  Cc: Yu, Luming, linux-kernel, Linus Torvalds, Andrew Morton,
	Tom Seeley, Dave Jones, Jiri Slaby, michael, mchehab,
	v4l-dvb-maintainer, video4linux-list, Brian Marete,
	Ryan Phillips, gregkh, linux-usb-devel, Brown, Len, linux-acpi,
	Mark Lord, Randy Dunlap, jgarzik, linux-ide, Duncan,
	Pavlik Vojtech, linux-input, Meelis Roos, Carl-Daniel Hailfinger

> This sounds like the problem Daniel had on his Samsung P35 recently.
> He could fix it by getting rid of some asus_unhide_smbus stuff or the
> otherway around, adding asus_unhide_smbus quirks in the S3 resume code.
> 
> This thread was recently posted on lkml:
> Re: [patch] smbus unhiding kills thermal management
> 

That seems likely, thanks for the pointer: Besides the ACPI sleep
hangs, this machine (TP 600X) has fan troubles upon S3 resume.  The
problems don't do harm (the damn fan keeps turning on when it
shouldn't), but that's probably chance.  Various patches that I tested
for S3 resume hangs reversed this fan behavior, making the fan refuse
to turn on when it should have.  The same problem happened after
resume from swsusp (bugzilla #5000).

> https://bugzilla.novell.com/show_bug.cgi?id=173420

>From Comment #30 at the above url: "The Linux ACPI code seems to
actively prevent the fan from running and that worries me."

I saw that as well, and found the following recipe would work around
the problem:

1. Set the trip point to, say, 70 C -- well above the actual
   temperature.

2. Then set the trip to anything reasonable that's under the current
   temperature (27 C always works).  Now the fan turns on, and behaves
   fine from then.

My explanation is that, before step 1, the fan is off but the OS
thinks it's on.  So the dialogue goes something like:

Hardware (from EC or BIOS?): Ack, I'm overheating, turn on the fan now!
OS: There, there, take it easy.  I've checked bit fields in my
     memory, and the fan is on.  So I don't have to do anything.
Hardware: Ack, ...
OS: There, there, ...
[Hence the 100% kacpid CPU usage]

Based on this explanation, I added a resume method to the fan driver.
It would turn on the fan and mark it as on.  So then the internal OS
state matched the actual state.  The fix didn't work for at least one
reason: ACPI drivers didn't have suspend/resume methods (though now
there are test patches to add those methods).

Another fix, probably worth doing anyway, is to turn on the fan if the
BIOS asks for it, whether or not the OS thinks it's on.  The chance of
the two pieces of information getting out of synch, and the hardware
damage it can cause, is enough to make it worthwhile.  The reverse
case can try to optimize (if BIOS asks to shut off the fan, shut it
off only if OS thinks it's on).  That creates no danger: just extra
fan noise if the fan is on but the OS thinks it's off.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-10  5:26 ` 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT] Sanjoy Mahajan
@ 2006-05-19 13:44   ` Thomas Renninger
  2006-05-21  0:12     ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Thomas Renninger @ 2006-05-19 13:44 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: Yu, Luming, linux-kernel, Linus Torvalds, Andrew Morton,
	Tom Seeley, Dave Jones, Jiri Slaby, michael, mchehab,
	v4l-dvb-maintainer, video4linux-list, Brian Marete,
	Ryan Phillips, gregkh, linux-usb-devel, Brown, Len, linux-acpi,
	Mark Lord, Randy Dunlap, jgarzik, linux-ide, Duncan,
	Pavlik Vojtech, linux-input, Meelis Roos, Carl-Daniel Hailfinger

On Fri, 2006-03-10 at 00:26 -0500, Sanjoy Mahajan wrote:
> [Re: bugme #5989, head no longer hanging in shame]
> 
> From: "Yu, Luming" <luming.yu@intel.com>
> > I suggest you to retest, and post dmesg with UN-modified BIOS.
> 
> I'm now running/testing an unmodified DSDT with 2.6.16-rc5.  For a while
> I had no S3 hangs, but I just noticed them again.  The error is the same
> as with the modified DSDT (with slightly different offsets):
> 
> exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
> exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC0
> exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000000B2
> 
> repeated endlessly.

This sounds like the problem Daniel had on his Samsung P35 recently.
He could fix it by getting rid of some asus_unhide_smbus stuff or the
otherway around, adding asus_unhide_smbus quirks in the S3 resume code.

This thread was recently posted on lkml:
Re: [patch] smbus unhiding kills thermal management

Here are some more details, for me that sounds related...:
https://bugzilla.novell.com/show_bug.cgi?id=173420

     Thomas


^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-04-05  3:03 Yu, Luming
  0 siblings, 0 replies; 86+ messages in thread
From: Yu, Luming @ 2006-04-05  3:03 UTC (permalink / raw)
  To: Sanjoy Mahajan, linux-acpi; +Cc: linux-kernel, Andrew Morton, Brown, Len


>diff -r ac486e270597 -r abd89292c539 drivers/acpi/osl.c
>--- a/drivers/acpi/osl.c	Sat Mar 18 08:35:34 2006 -0500
>+++ b/drivers/acpi/osl.c	Thu Mar 30 10:59:57 2006 -0500
>@@ -634,6 +634,8 @@ static void acpi_os_execute_deferred(voi
> 	return_VOID;
> }
> 
>+extern int acpi_in_suspend;
>+
> acpi_status
> acpi_os_queue_for_execution(u32 priority,
> 			    acpi_osd_exec_callback function, 
>void *context)
>@@ -643,6 +645,8 @@ acpi_os_queue_for_execution(u32 priority
> 	struct work_struct *task;
> 
> 	ACPI_FUNCTION_TRACE("os_queue_for_execution");
>+	if (acpi_in_suspend)	/* in case kacpid is causing 
>the queue */
>+		return_ACPI_STATUS(AE_OK);

The request will be dropped silently , So, it sounds ugly.
At least, you need to put some warning here. 
The long-term solution is to fix the invoker to NOT ask 
kacpid to invoke AML methods during suspend-resume period.

> 
> 	ACPI_DEBUG_PRINT((ACPI_DB_EXEC,
> 			  "Scheduling function [%p(%p)] for 
>deferred execution.\n",
>diff -r ac486e270597 -r abd89292c539 drivers/acpi/sleep/main.c
>--- a/drivers/acpi/sleep/main.c	Sat Mar 18 08:35:34 2006 -0500
>+++ b/drivers/acpi/sleep/main.c	Thu Mar 30 10:59:57 2006 -0500
>@@ -19,6 +19,12 @@
> #include <acpi/acpi_drivers.h>
> #include "sleep.h"
> 
>+/* for functions putting machine to sleep to know that we're
>+   suspending, so that they can careful about what AML methods they
>+   invoke (to avoid trying untested BIOS code paths) */
>+int acpi_in_suspend;
>+EXPORT_SYMBOL(acpi_in_suspend);
>+
> u8 sleep_states[ACPI_S_STATE_COUNT];
> 
> static struct pm_ops acpi_pm_ops;
>@@ -55,6 +61,8 @@ static int acpi_pm_prepare(suspend_state
> 		printk("acpi_pm_prepare does not support %d 
>\n", pm_state);
> 		return -EPERM;
> 	}
>+	acpi_os_wait_events_complete(NULL);
>+	acpi_in_suspend = TRUE;
> 	return acpi_sleep_prepare(acpi_state);

There is race condition here.
Probably, it should be :
	acpi_in_suspend = TURE;
	acpi_os_wait_events_complete(NULL);

> }
> 
>@@ -132,6 +140,7 @@ static int acpi_pm_finish(suspend_state_
> 	u32 acpi_state = acpi_suspend_states[pm_state];
> 
> 	acpi_leave_sleep_state(acpi_state);
>+	acpi_in_suspend = FALSE;
> 	acpi_disable_wakeup_device(acpi_state);
> 
> 	/* reset firmware waking vector */
>diff -r ac486e270597 -r abd89292c539 drivers/acpi/thermal.c
>--- a/drivers/acpi/thermal.c	Sat Mar 18 08:35:34 2006 -0500
>+++ b/drivers/acpi/thermal.c	Thu Mar 30 10:59:57 2006 -0500
>@@ -79,6 +79,8 @@ static int tzp;
> static int tzp;
> module_param(tzp, int, 0);
> MODULE_PARM_DESC(tzp, "Thermal zone polling frequency, in 
>1/10 seconds.\n");
>+
>+extern int acpi_in_suspend;
> 
> static int acpi_thermal_add(struct acpi_device *device);
> static int acpi_thermal_remove(struct acpi_device *device, int type);
>@@ -683,6 +685,8 @@ static void acpi_thermal_run(unsigned lo
> static void acpi_thermal_run(unsigned long data)
> {
> 	struct acpi_thermal *tz = (struct acpi_thermal *)data;
>+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
>+		return_VOID;	/* so don't do them */

If you fixed kacpid, then this part could be removed.

> 	if (!tz->zombie)
> 		acpi_os_queue_for_execution(OSD_PRIORITY_GPE,
> 					    acpi_thermal_check, 
>(void *)data);
>@@ -705,6 +709,8 @@ static void acpi_thermal_check(void *dat
> 
> 	state = tz->state;
> 
>+	if (acpi_in_suspend)
>+		return_VOID;

Could it cause trouble to caller?

> 	result = acpi_thermal_get_temperature(tz);
> 	if (result)
> 		return_VOID;
>@@ -1224,6 +1230,9 @@ static void acpi_thermal_notify(acpi_han
> 	struct acpi_device *device = NULL;
> 
> 	ACPI_FUNCTION_TRACE("acpi_thermal_notify");
>+
>+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
>+		return_VOID;	/* so don't do them */

Could it cause trouble to caller?

> 
> 	if (!tz)
> 		return_VOID;

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-24  1:31 Yu, Luming
@ 2006-04-04  6:49 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-04-04  6:49 UTC (permalink / raw)
  To: linux-acpi; +Cc: linux-kernel, Yu, Luming, Andrew Morton, Brown, Len

Some light at the end of a long tunnel of debugging, so the end of this
email has a patch for review.  This was done with lots of help from
Luming Yu.  The discussions that went off list for a while are on the
bugzilla page [kernel bugzilla #5989].

When ec_intr=1 became the default, my TP 600X started hanging on S3
sleep (in _PTS()).  With a minimal kernel config, just a few acpi
modules built, it would hang on the first sleep...only if the thermal
module was loaded.  But that was with a hacked DSDT, so perhaps the bug
was due to the DSDT hacks.

With the vanilla (latest BIOS, v1.11) DSDT, it wouldn't hang in the
minimal configuration at all.  But with my regular, production
configuration it would hang on the 2nd sleep, so my DSDT hacks probably
helped expose, but not create the problem.  (Some of the modifications
were to fix thermal polling.)

So I hacked thermal so that it would load only a specified thermal zone
(one of THM{0,2,6,7}) -- given as a module parameter -- and also would
stop partway through setting up the thermal zone, with how far to get
specified by another thermal parameter.  Experiments with this modified
module showed that the bug appeared even when THM0 was the only zone,
and disappeared if no TMP method was called.

Perhaps there is more than one bug, but THM0 has at least one, so I did
most of the remaining testing with just THM0.  With just THM0 and a
kernel that returned 27 C for any temperature -- without calling ACPI
methods -- the system wouldn't hang.  Bisecting within the THM0 method
in the DSDT eventually showed that the trouble came in the _TMP method:

            Method (_TMP, 0, NotSerialized)
            {
                \_SB.PCI0.ISA0.EC0.UPDT ()
                Store (\_SB.PCI0.ISA0.EC0.TMP0, Local0)
                If (LGreater (Local0, 0x0AAC))
                {
                    Return (Local0)
                }
                Else
                {
                    Return (0x0BB8)
                }
            }

Further bisection showed that commenting out the EC0.UPDT() stopped the
hangs.  By the way, only THM0._TMP does an EC0.UPDT(); THM{2,6,7} just
load the temperatures from the EC (so with thermal polling only on, say,
THM2, the system won't notice when temperatures get too high -- you have
to turn it on for THM0 -- and the DSDT hack mentioned above worked
around this bug).  But back to bisecting BIOS-supplied DSDT.

EC0.UPDT() is this:

                    Method (UPDT, 0, NotSerialized)
                    {
                        If (IGNR)
                        {
                            Decrement (IGNR)
                        }
                        Else
                        {
                            If (H8DR)
                            {
                                If (Acquire (I2CM, 0x0064)) {}
                                Else
                                {
                                    Store (I2RB (Zero, 0x01, 0x04), Local7)
                                    If (Local7)
                                    {
                                        Fatal (0x01, 0x80000003, Local7)
                                    }
                                    Else
                                    {
                                        Store (HBS0, TMP0)
                                        Store (HBS2, TMP2)
                                        Store (HBS6, TMP6)
                                        Store (HBS7, TMP7)
                                    }

                                    Release (I2CM)
                                }
                            }
                        }
                    }

Bisecting within it showed that 

  Store (I2RB (Zero, 0x01, 0x04), Local7)

caused problems.  It looks like:

                    Method (I2RB, 3, NotSerialized)
                    {
                        Store (Arg0, HCSL)
                        Store (ShiftLeft (Arg1, 0x01), HMAD)
                        Store (Arg2, HMCM)
                        Store (0x0B, HMPR)
                        Return (CHKS ())
                    }

For hacking, I made an I2RBcopy() method and an UPDTcopy() for
THM0._TMP() to call, which would call I2RBcopy() instead of IR2B().  By
the way, the actual method names had to be four characters or the iasl
compiler got unhappy, but I'll use the longer forms for clarity.

This debugging version of I2RBcopy was useful:

                    Method (I2RBcopy, 3, NotSerialized)
                    {
                        Store (Arg0, HCSL)
                        Store (ShiftLeft (Arg1, 0x01), HMAD)
                        Store (Arg2, HMCM)
                        Store (0x0B, HMPR)
			Store(local0, Debug)
                        Return (local0)
                        Return (CHKS ())
                    }

It showed -- once I found the right debug_level/debug_layer combination
(layer=0x90, level=0x1F) -- that I2RBcopy was called during wakeup in a
strange spot.  You'd expect it to be called after a THM0._TMP(), but it
would be called during the _SST() method; see comment #43 at bug 5989
and the dmesgs attached there, of which the relevant extract is:

Execute Method: [\_WAK] (Node e3f8aa48)
Execute Method: [\_TZ_.THM0._PSV] (Node e3f8bdc8)
Execute Method: [\_TZ_.THM0._TC1] (Node e3f8bd48)
Execute Method: [\_TZ_.THM0._TC2] (Node e3f8bd08)
Execute Method: [\_TZ_.THM0._TSP] (Node e3f8bcc8)
Execute Method: [\_TZ_.THM0._AC0] (Node e3f8bec8)
Execute Method: [\_TZ_.THM0._TMP] (Node e3f8bf08)
Execute Method: [\_SI_._SST] (Node e3f8a848)
# why does the Debug line show up here, and not right after the THM0._TMP line?
[ACPI Debug]  Integer: 0x00000000
Execute Method: [\_SB_.LID0._PSW] (Node c1574808)
Execute Method: [\_SB_.SLPB._PSW] (Node c1574708)

So the next step was to try the acpi_serialize boot option to make the
method calls more like how Windows interprets ACPI code (presumably well
tested when the BIOS was made).  The system still hung on the 2nd sleep.

One explanation was that kacpid was jumping the gun and the wakeup
methods were not running in the right order.  So after several hacks and
tests of them, the changes converged to the following (see the diff for
the full details):

- Create an exported symbol: int acpi_in_suspend;

- In acpi_pm_prepare(), set acpi_in_suspend just before preparing to
  sleep. 

- In acpi_pm_finish, unset acpi_in_suspend after leaving sleep state.

- If acpi_in_suspend is true:
  -- Don't do anything in acpi_os_queue_for_execution()
  -- or in acpi_thermal_run()
  -- or in acpi_thermal_check()
  -- or in acpi_thermal_notify()

By itself, and returning to the vanilla DSDT, this change seemed to fix
the problem: The system lasted through two sleep/wake cycles, which was
encouraging.  Alas, it hung on the 4th sleep.

But there's good news.  I used that patch with

   acpi_os_name="Microsoft Windows" acpi_serialize

and I haven't been able to hang it (again, now with the vanilla DSDT).
[Thanks to Len Brown for the recent ACPI changes that document
acpi_os_name.]  

I did 14 sleep/wake cycles with no problem.  The first eight cycles were
with debug_{level,layer}=0x10, which was the combination that hung more
easily than layer=0x90,level=0x1F.  To check whether the debug params
matters, I did the last six cycles with layer=0x90,level=0x1F -- also
fine.

To further flush out any bug, I turned on thermal polling: every 1
second for each of the four thermal zones (starting them 0.25 seconds
apart to maximize the chance that a thermal poll happens at a dangerous
time during the suspend or wake).  Six further cycles worked fine.  Then
I unloaded and loaded the thermal module, which often helps produce
hangs on the next suspend, and that was fine for six more cycles.  I
haven't been able to hang it at all.

To summarize:

1. hangs on 4th S3 : patch in #63.
2. hangs on 2nd S3 : vanilla + acpi_os_name="Microsoft Windows"
3. hangs on 2nd S3 : vanilla + acpi_os_name="Microsoft Windows" acpi_serialize
4. doesn't hang    :   patch + acpi_os_name="Microsoft Windows"

Probably the patch + acpi_serialize will hang too, but I'm not sure.
The unclean version of the patch (i.e. with debugging printk's) did
hang with just acpi_serialize.

The slight problems: the fan state and polling frequencies are garbled
on resume.  By fan state being garbled, I mean that it is sometime on
even though the temps are all below the trip points, and acpi -t
reports that the fan is off.  So the system doesn't have the right fan
state.  Similarly with the thermal polling: It reports 1 second for
each zone, but it's not polling at all (no TMP methods reported in the
dmesgs).  If I set the polling intervals to 1 second, then it starts
polling again.

Also, I need to wake it up using the power switch instead of the Fn
key.  Using the power switch produces a couple sharp clicks on the
speaker.

Here's the diff.  It needs review.  For example, are these changes
solving the real problem, or do they just smother it under a bunch of
hacks?  If the basic idea of acpi_in_suspend is sound, are there other
areas that need to check for it?  For example, should EC UPDT() be
disabled during sleep/wake?

Is the change something that other machines should use by default, or
should use if sleep hangs, or should avoid?  My suspicion is that it
solves a real problem, which will be triggered on some machines in some
circumstances, although whether it's the whole problem I'm not sure.  I
had one (unreproducible) S3 hang with a vanilla kernel even though
thermal was being unloaded every time by that sleep script (the other
hangs discussed above had thermal loaded).  So the _TMP story isn't the
whole one.

On the other hand, the patch is at least part of the story.  For
example, bug 5037 <http://bugzilla.kernel.org/show_bug.cgi?id=5037#c17>,
where pre-emptible kernels would often hang going to sleep, is probably
related to this problem of methods running out of order.  [Although I
haven't tried recompiling with all preemption turned on to see whether
that problem goes away.]



diff -r ac486e270597 -r abd89292c539 drivers/acpi/osl.c
--- a/drivers/acpi/osl.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/osl.c	Thu Mar 30 10:59:57 2006 -0500
@@ -634,6 +634,8 @@ static void acpi_os_execute_deferred(voi
 	return_VOID;
 }
 
+extern int acpi_in_suspend;
+
 acpi_status
 acpi_os_queue_for_execution(u32 priority,
 			    acpi_osd_exec_callback function, void *context)
@@ -643,6 +645,8 @@ acpi_os_queue_for_execution(u32 priority
 	struct work_struct *task;
 
 	ACPI_FUNCTION_TRACE("os_queue_for_execution");
+	if (acpi_in_suspend)	/* in case kacpid is causing the queue */
+		return_ACPI_STATUS(AE_OK);
 
 	ACPI_DEBUG_PRINT((ACPI_DB_EXEC,
 			  "Scheduling function [%p(%p)] for deferred execution.\n",
diff -r ac486e270597 -r abd89292c539 drivers/acpi/sleep/main.c
--- a/drivers/acpi/sleep/main.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/sleep/main.c	Thu Mar 30 10:59:57 2006 -0500
@@ -19,6 +19,12 @@
 #include <acpi/acpi_drivers.h>
 #include "sleep.h"
 
+/* for functions putting machine to sleep to know that we're
+   suspending, so that they can careful about what AML methods they
+   invoke (to avoid trying untested BIOS code paths) */
+int acpi_in_suspend;
+EXPORT_SYMBOL(acpi_in_suspend);
+
 u8 sleep_states[ACPI_S_STATE_COUNT];
 
 static struct pm_ops acpi_pm_ops;
@@ -55,6 +61,8 @@ static int acpi_pm_prepare(suspend_state
 		printk("acpi_pm_prepare does not support %d \n", pm_state);
 		return -EPERM;
 	}
+	acpi_os_wait_events_complete(NULL);
+	acpi_in_suspend = TRUE;
 	return acpi_sleep_prepare(acpi_state);
 }
 
@@ -132,6 +140,7 @@ static int acpi_pm_finish(suspend_state_
 	u32 acpi_state = acpi_suspend_states[pm_state];
 
 	acpi_leave_sleep_state(acpi_state);
+	acpi_in_suspend = FALSE;
 	acpi_disable_wakeup_device(acpi_state);
 
 	/* reset firmware waking vector */
diff -r ac486e270597 -r abd89292c539 drivers/acpi/thermal.c
--- a/drivers/acpi/thermal.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/thermal.c	Thu Mar 30 10:59:57 2006 -0500
@@ -79,6 +79,8 @@ static int tzp;
 static int tzp;
 module_param(tzp, int, 0);
 MODULE_PARM_DESC(tzp, "Thermal zone polling frequency, in 1/10 seconds.\n");
+
+extern int acpi_in_suspend;
 
 static int acpi_thermal_add(struct acpi_device *device);
 static int acpi_thermal_remove(struct acpi_device *device, int type);
@@ -683,6 +685,8 @@ static void acpi_thermal_run(unsigned lo
 static void acpi_thermal_run(unsigned long data)
 {
 	struct acpi_thermal *tz = (struct acpi_thermal *)data;
+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
+		return_VOID;	/* so don't do them */
 	if (!tz->zombie)
 		acpi_os_queue_for_execution(OSD_PRIORITY_GPE,
 					    acpi_thermal_check, (void *)data);
@@ -705,6 +709,8 @@ static void acpi_thermal_check(void *dat
 
 	state = tz->state;
 
+	if (acpi_in_suspend)
+		return_VOID;
 	result = acpi_thermal_get_temperature(tz);
 	if (result)
 		return_VOID;
@@ -1224,6 +1230,9 @@ static void acpi_thermal_notify(acpi_han
 	struct acpi_device *device = NULL;
 
 	ACPI_FUNCTION_TRACE("acpi_thermal_notify");
+
+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
+		return_VOID;	/* so don't do them */
 
 	if (!tz)
 		return_VOID;

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-24  1:31 Yu, Luming
  2006-04-04  6:49 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-24  1:31 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

es it mean we need to slow down  acpi_ec_intr_read/write ?
>> Could you try to insert acpi_os_stall (100)  after  ACPI_DEBUG_PRINT
>> statement both in acpi_ec_intr_read/write.
>
>I added that line in those two places.  The result refused to hang with
>acpi_debug_layer=0x00100010, but it did hang (on the usual 
>second sleep)
>with it set to 0x10.

Really strange,  how several printks could change the results.
Could you try to repalce acpi_os_stall with acpi_os_sleep(1)
in acpi_ec_intr_read/write?

>
>> Hmmm, then I cannot get the ec access log for hang case?!
>
>It seems difficult, but let's keep trying if you have other ideas for
>how to get it.
>

Also, please change I2RB copy to: 

                    Method (I2RBcopy, 3, NotSerialized)
                    {
                        Store (Arg0, HCSL)
                        Store (ShiftLeft (Arg1, 0x01), HMAD)
                        Store (Arg2, HMCM)
                        Store (0x0B, HMPR)

		Store(CHKS(), local0)
		Store(local0, Debug)

                        Return (local0)
                    }
And boot with acpi_dbg_layer=0x10 acpi_dbg_level=0x10,
Post full log (Don't edit) for both not hang case, and hang case on
bugzilla.
There should have some clues.

Thanks,
Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-22  4:58 Yu, Luming
  2006-03-22  5:13 ` Sanjoy Mahajan
@ 2006-03-24  1:17 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-24  1:17 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>> So perhaps I should bisect in _SST and put in the debug lines there?
>> Here's another idea, which is a terrible hack.  But there are lots of
>> lines in the DSDT like
>>    If (LOr (SPS, WNTF))
>> which I imagine is saying "If something or if WinNT".  So,
>> what if Linux
>> pretends to be WinNT (or W98F -- which is another common
>> test), at least
>> for the 600x?  Maybe those code paths are known to work.
>> 

> Yes, you can try that.

I tried the patch below.  

It went to sleep fine.  It wouldn't wake up with the Fn key or
closing/opening the lid (both methods wake up Linux sleep).  But the
power switch did the trick, and it made it through most of the wakeup.
But the screen came back toofull of garbage and not taking keyboard
input (at least not to X), and it might have been stuck in PCI0._INI.
That was the last Execute Method on the serial console, and after that
line it was just printing dots one at a time: .......

So I had to power it down (by holding the power switch until it turned
off) and could never try the second sleep.

-Sanjoy


summary:     Pretend to be Windows 98 in DSDT.

diff -r bf1b330b9a7f -r 8109ef6f6d19 dsdt/600x.dsl
--- a/dsdt/600x.dsl	Tue Mar 21 12:11:19 2006 -0500
+++ b/dsdt/600x.dsl	Thu Mar 23 19:49:10 2006 -0500
@@ -1090,7 +1090,7 @@ DefinitionBlock ("DSDT.aml", "DSDT", 1, 
             })
             Method (_INI, 0, NotSerialized)
             {
-                If (LEqual (SCMP (\_OS, "Microsoft Windows"), Zero))
+                If (One) /* LEqual (SCMP (\_OS, "Microsoft Windows"), Zero)) */
                 {
                     Store (One, W98F)
                 }





^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-23  9:10 Yu, Luming
@ 2006-03-23 19:19 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-23 19:19 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> Does it mean we need to slow down  acpi_ec_intr_read/write ?
> Could you try to insert acpi_os_stall (100)  after  ACPI_DEBUG_PRINT
> statement both in acpi_ec_intr_read/write.

I added that line in those two places.  The result refused to hang with
acpi_debug_layer=0x00100010, but it did hang (on the usual second sleep)
with it set to 0x10.

> Hmmm, then I cannot get the ec access log for hang case?!

It seems difficult, but let's keep trying if you have other ideas for
how to get it.

-Sanjoy

`A society of sheep must in time beget a government of wolves.'
   - Bertrand de Jouvenal

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-23  9:10 Yu, Luming
  2006-03-23 19:19 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-23  9:10 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>   Good, then the hang should be caused by:
>
>			     Store (Arg0, HCSL)
>			     Store (ShiftLeft (Arg1, 0x01), HMAD)
>			     Store (Arg2, HMCM)
>			     Store (0x0B, HMPR)
>
>   Could you add this at the beginning of this block:
>	   Store (Arg0,  Debug)
>   And add this at the end of this block:
>	   Store( HMPR, Debug)
>
>I added those two lines to the DSDT with only THM0 zone, but with
>nothing else commented out.  Below are the dmesgs for one sleep-wake
>cycle, plus an 'acpi -t'.  I thought it would hang if I did one more
>cycle, but it didn't.  So I tried five more, and it was fine too.
>
>Then I reset /proc/acpi/acpi_debug_layer to 0x10 (the boot paramater is
>acpi_dbg_layer although the /proc file is acpi_debug_layer), and
>unloaded and reloaded the thermal module.  And it hung in the 
>(expected)
>two cycles.  I've seen this behavior before: It won't hang with lots of
>debugging turned on, but it does hang with less debugging.  Strange!

Hmmm, then I cannot get the ec access log for hang case?!

	acpi_hw_low_level_read(8, data, &ec->common.data_addr);
	ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Read [%02x] from address
[%02x]\n",
			  *data, address));

Does it mean we need to slow down  acpi_ec_intr_read/write ?
Could you try to insert acpi_os_stall (100)  after  ACPI_DEBUG_PRINT
statement
both in acpi_ec_intr_read/write.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-23  4:46 Yu, Luming
@ 2006-03-23  6:25 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-23  6:25 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

   Good, then the hang should be caused by:

			     Store (Arg0, HCSL)
			     Store (ShiftLeft (Arg1, 0x01), HMAD)
			     Store (Arg2, HMCM)
			     Store (0x0B, HMPR)

   Could you add this at the beginning of this block:
	   Store (Arg0,  Debug)
   And add this at the end of this block:
	   Store( HMPR, Debug)

I added those two lines to the DSDT with only THM0 zone, but with
nothing else commented out.  Below are the dmesgs for one sleep-wake
cycle, plus an 'acpi -t'.  I thought it would hang if I did one more
cycle, but it didn't.  So I tried five more, and it was fine too.

Then I reset /proc/acpi/acpi_debug_layer to 0x10 (the boot paramater is
acpi_dbg_layer although the /proc file is acpi_debug_layer), and
unloaded and reloaded the thermal module.  And it hung in the (expected)
two cycles.  I've seen this behavior before: It won't hang with lots of
debugging turned on, but it does hang with less debugging.  Strange!

> Yes, that's good idea to have separate i2rb copy for THM0 which we are
> hacking.

I tried that, but the ACPI system got a bit sick.  It didn't support any
Sn states, for example.  So I must have done something wrong and I'll
come back to the idea later.

-Sanjoy

`A society of sheep must in time beget a government of wolves.'
   - Bertrand de Jouvenal


eth0: removing device
Unloaded prism54 driver
PM: Preparing system for mem sleep
Stopping tasks: ====================================================|
Execute Method: [\_SB_.LID0._PSW] (Node c1574808)
 acpi_ec-0458 [23] ec_intr_read          : Read [02] from address [32]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [06] to address [32]
Execute Method: [\_SB_.SLPB._PSW] (Node c1574708)
Execute Method: [\_S3_] (Node e3f8a988)
Execute Method: [\_PTS] (Node e3f8ab48)
 acpi_ec-0458 [23] ec_intr_read          : Read [d6] from address [00]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [96] to address [00]
 acpi_ec-0458 [23] ec_intr_read          : Read [9f] from address [05]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [9e] to address [05]
 acpi_ec-0458 [22] ec_intr_read          : Read [87] from address [16]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [07] to address [16]
 acpi_ec-0458 [22] ec_intr_read          : Read [2f] from address [17]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [2e] to address [17]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [83] to address [54]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [83] to address [55]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [83] to address [56]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [83] to address [57]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [83] to address [58]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [83] to address [59]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [83] to address [5a]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [00] to address [5b]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [83] to address [5c]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [83] to address [5d]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [83] to address [5e]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [83] to address [5f]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [00] to address [60]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [83] to address [61]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [00] to address [62]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [00] to address [63]
 acpi_ec-0458 [23] ec_intr_read          : Read [00] from address [3a]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [00] to address [3a]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [02] to address [52]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [09] to address [53]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [10] to address [74]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [0a] to address [50]
 acpi_ec-0458 [22] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [22] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [22] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [23] ec_intr_read          : Read [06] from address [32]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [16] to address [32]
Execute Method: [\_SI_._SST] (Node e3f8a8c8)
 acpi_ec-0508 [23] ec_intr_write         : Wrote [03] to address [06]
 acpi_ec-0458 [23] ec_intr_read          : Read [00] from address [3a]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [01] to address [3a]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [01] to address [0e]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [00] to address [0d]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [00] to address [0c]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [80] to address [0e]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [00] to address [0d]
 acpi_ec-0508 [23] ec_intr_write         : Wrote [80] to address [0c]
uhci_hcd 0000:00:07.2: suspend_rh
uhci_hcd 0000:00:07.2: uhci_suspend
uhci_hcd 0000:00:07.2: --> PCI D0/legacy
PM: Entering mem sleep
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Back to C!
PM: Finishing wakeup.
Execute Method: [\_GPE._L0B] (Node e3f8a848)
 acpi_ec-0458 [23] ec_intr_read          : Read [10] from address [4e]
PCI: Found IRQ 11 for device 0000:00:02.0
PCI: Sharing IRQ 11 with 0000:00:06.0
PCI: Sharing IRQ 11 with 0000:01:00.0
 acpi_ec-0741 [06] ec_gpe_intr_query     : Evaluating _Q42
Execute Method: [\_SB_.PCI0.ISA0.EC0_._Q42] (Node e3f82408)
 acpi_ec-0458 [26] ec_intr_read          : Read [01] from address [3a]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [01] to address [3a]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [02] to address [52]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [04] to address [53]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [0b] to address [50]
 acpi_ec-0458 [25] ec_intr_read          : Read [0b] from address [50]
 acpi_ec-0458 [25] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [25] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [25] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [25] ec_intr_read          : Read [80] from address [54]
 acpi_ec-0458 [25] ec_intr_read          : Read [0c] from address [55]
 acpi_ec-0458 [25] ec_intr_read          : Read [4e] from address [58]
 acpi_ec-0458 [25] ec_intr_read          : Read [0c] from address [59]
 acpi_ec-0458 [25] ec_intr_read          : Read [f4] from address [60]
 acpi_ec-0458 [25] ec_intr_read          : Read [0b] from address [61]
 acpi_ec-0458 [25] ec_intr_read          : Read [08] from address [62]
 acpi_ec-0458 [25] ec_intr_read          : Read [0c] from address [63]
PCI: Found IRQ 11 for device 0000:00:02.1
uhci_hcd 0000:00:07.2: PCI legacy resume
PCI: Found IRQ 11 for device 0000:00:07.2
uhci_hcd 0000:00:07.2: uhci_resume
uhci_hcd 0000:00:07.2: uhci_check_and_reset_hc: legsup = 0x2000
uhci_hcd 0000:00:07.2: Performing full reset
usb usb1: root hub lost power or was reset
uhci_hcd 0000:00:07.2: suspend_rh
usb usb1: finish resume
uhci_hcd 0000:00:07.2: wakeup_rh
Restarting tasks...<7>hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
 done
Execute Method: [\_SI_._SST] (Node e3f8a8c8)
 acpi_ec-0508 [26] ec_intr_write         : Wrote [01] to address [0e]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [00] to address [0d]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [01] to address [0c]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [80] to address [0e]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [80] to address [0d]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [80] to address [0c]
Execute Method: [\_WAK] (Node e3f8aac8)
 acpi_ec-0458 [26] ec_intr_read          : Read [96] from address [00]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [d6] to address [00]
 acpi_ec-0458 [26] ec_intr_read          : Read [16] from address [32]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [16] to address [32]
 acpi_ec-0458 [26] ec_intr_read          : Read [16] from address [32]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [06] to address [32]
 acpi_ec-0458 [26] ec_intr_read          : Read [06] from address [32]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [06] to address [32]
 acpi_ec-0458 [25] ec_intr_read          : Read [34] from address [36]
 acpi_ec-0458 [25] ec_intr_read          : <7>uhci_hcd 0000:00:07.2: suspend_rh (auto-stop)
Read [14] from address [34]
 acpi_ec-0458 [25] ec_intr_read          : Read [07] from address [16]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [87] to address [16]
 acpi_ec-0458 [26] ec_intr_read          : Read [01] from address [3a]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [01] to address [3a]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [81] to address [52]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [00] to address [53]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [07] to address [50]
 acpi_ec-0458 [25] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [25] ec_intr_read          : Read [90] from address [51]
 acpi_ec-0458 [25] ec_intr_read          : Read [90] from address [51]
 acpi_ec-0458 [25] ec_intr_read          : Read [90] from address [51]
 acpi_ec-0458 [26] ec_intr_read          : Read [06] from address [32]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [06] to address [32]
 acpi_ec-0458 [25] ec_intr_read          : Read [0f] from address [28]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [8f] to address [28]
 acpi_ec-0458 [25] ec_intr_read          : Read [0f] from address [28]
 acpi_ec-0458 [25] ec_intr_read          : Read [0f] from address [28]
 acpi_ec-0458 [25] ec_intr_read          : Read [0f] from address [28]
 acpi_ec-0458 [25] ec_intr_read          : Read [86] from address [39]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [18] to address [0e]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [00] to address [0d]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [10] to address [0c]
 acpi_ec-0458 [25] ec_intr_read          : Read [28] from address [17]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [29] to address [17]
 acpi_ec-0458 [26] ec_intr_read          : Read [01] from address [3a]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [01] to address [3a]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [02] to address [52]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [04] to address [53]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [0b] to address [50]
 acpi_ec-0458 [25] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [25] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [25] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [25] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [25] ec_intr_read          : Read [80] from address [54]
 acpi_ec-0458 [25] ec_intr_read          : Read [0c] from address [55]
 acpi_ec-0458 [25] ec_intr_read          : Read [4e] from address [58]
 acpi_ec-0458 [25] ec_intr_read          : Read [0c] from address [59]
 acpi_ec-0458 [25] ec_intr_read          : Read [f4] from address [60]
 acpi_ec-0458 [25] ec_intr_read          : Read [0b] from address [61]
 acpi_ec-0458 [25] ec_intr_read          : Read [08] from address [62]
 acpi_ec-0458 [25] ec_intr_read          : Read [0c] from address [63]
Execute Method: [\_TZ_.THM0._PSV] (Node e3f8be48)
Execute Method: [\_TZ_.THM0._TC1] (Node e3f8bdc8)
Execute Method: [\_TZ_.THM0._TC2] (Node e3f8bd88)
Execute Method: [\_TZ_.THM0._TSP] (Node e3f8bd48)
Execute Method: [\_TZ_.THM0._AC0] (Node e3f8bf48)
 acpi_ec-0458 [34] ec_intr_read          : Read [00] from address [20]
Execute Method: [\_TZ_.THM0._TMP] (Node e3f8bf88)
 acpi_ec-0458 [36] ec_intr_read          : Read [01] from address [3a]
 acpi_ec-0508 [36] ec_intr_write         : Wrote [01] to address [3a]
 acpi_ec-0508 [36] ec_intr_write         : Wrote [02] to address [52]
 acpi_ec-0508 [36] ec_intr_write         : Wrote [04] to address [53]
 acpi_ec-0508 [36] ec_intr_write         : Wrote [0b] to address [50]
 acpi_ec-0458 [35] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [35] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [35] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [35] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [35] ec_intr_read          : Read [8a] from address [54]
 acpi_ec-0458 [35] ec_intr_read          : Read [0c] from address [55]
 acpi_ec-0458 [35] ec_intr_read          : Read [4e] from address [58]
 acpi_ec-0458 [35] ec_intr_read          : Read [0c] from address [59]
 acpi_ec-0458 [35] ec_intr_read          : Read [f4] from address [60]
 acpi_ec-0458 [35] ec_intr_read          : Read [0b] from address [61]
 acpi_ec-0458 [35] ec_intr_read          : Read [08] from address [62]
 acpi_ec-0458 [35] ec_intr_read          : Read [0c] from address [63]
Execute Method: [\_SI_._SST] (Node e3f8a8c8)
 acpi_ec-0458 [26] ec_intr_read          : Read [01] from address [3a]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [00] to address [3a]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [05] to address [06]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [01] to address [0e]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [00] to address [0d]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [01] to address [0c]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [80] to address [0e]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [00] to address [0d]
 acpi_ec-0508 [26] ec_intr_write         : Wrote [00] to address [0c]
Execute Method: [\_SB_.LID0._PSW] (Node c1574808)
 acpi_ec-0458 [27] ec_intr_read          : Read [06] from address [32]
 acpi_ec-0508 [27] ec_intr_write         : Wrote [02] to address [32]
Execute Method: [\_SB_.SLPB._PSW] (Node c1574708)
ds: ds_open(socket 0)
ds: ds_open(socket 1)
ds: ds_open(socket 2)
pccard: card ejected from slot 1
PCMCIA: socket e36a8828: *** DANGER *** unable to remove socket power
ds: ds_release(socket 0)
ds: ds_release(socket 1)
# I think this where 'acpi -t' happens
Execute Method: [\_SB_.PCI0.ISA0.EC0_.BAT1._BST] (Node e3f82b48)
 acpi_ec-0458 [28] ec_intr_read          : Read [00] from address [3a]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [00] to address [3a]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [02] to address [52]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [11] to address [53]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [0b] to address [50]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [27] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [27] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [27] ec_intr_read          : Read [1c] from address [74]
 acpi_ec-0458 [27] ec_intr_read          : Read [84] from address [64]
 acpi_ec-0458 [27] ec_intr_read          : Read [30] from address [65]
 acpi_ec-0458 [27] ec_intr_read          : Read [ff] from address [60]
 acpi_ec-0458 [27] ec_intr_read          : Read [66] from address [61]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [62]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [63]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [66]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [67]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [54]
 acpi_ec-0458 [27] ec_intr_read          : Read [86] from address [39]
 acpi_ec-0458 [27] ec_intr_read          : Read [86] from address [39]
 acpi_ec-0458 [27] ec_intr_read          : Read [86] from address [39]
Execute Method: [\_SB_.PCI0.ISA0.EC0_.BAT1._BIF] (Node e3f82b88)
 acpi_ec-0458 [27] ec_intr_read          : Read [86] from address [39]
 acpi_ec-0458 [28] ec_intr_read          : Read [00] from address [3a]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [00] to address [3a]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [02] to address [52]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [11] to address [53]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [0b] to address [50]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [27] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [27] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [27] ec_intr_read          : Read [1c] from address [74]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [54]
 acpi_ec-0458 [27] ec_intr_read          : Read [20] from address [58]
 acpi_ec-0458 [27] ec_intr_read          : Read [76] from address [59]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [5a]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [5b]
 acpi_ec-0458 [27] ec_intr_read          : Read [ff] from address [5c]
 acpi_ec-0458 [27] ec_intr_read          : Read [66] from address [5d]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [5e]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [5f]
Execute Method: [\_SB_.PCI0.ISA0.EC0_.BAT0._BST] (Node e3f82f08)
 acpi_ec-0458 [28] ec_intr_read          : Read [00] from address [3a]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [00] to address [3a]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [02] to address [52]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [10] to address [53]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [0b] to address [50]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [27] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [27] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [27] ec_intr_read          : Read [1c] from address [74]
 acpi_ec-0458 [27] ec_intr_read          : Read [c0] from address [64]
 acpi_ec-0458 [27] ec_intr_read          : Read [30] from address [65]
 acpi_ec-0458 [27] ec_intr_read          : Read [80] from address [60]
 acpi_ec-0458 [27] ec_intr_read          : Read [89] from address [61]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [62]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [63]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [66]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [67]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [54]
 acpi_ec-0458 [27] ec_intr_read          : Read [86] from address [38]
 acpi_ec-0458 [27] ec_intr_read          : Read [86] from address [38]
 acpi_ec-0458 [27] ec_intr_read          : Read [86] from address [38]
Execute Method: [\_SB_.PCI0.ISA0.EC0_.BAT0._BIF] (Node e3f82f48)
 acpi_ec-0458 [27] ec_intr_read          : Read [86] from address [38]
 acpi_ec-0458 [28] ec_intr_read          : Read [00] from address [3a]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [00] to address [3a]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [02] to address [52]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [10] to address [53]
 acpi_ec-0508 [28] ec_intr_write         : Wrote [0b] to address [50]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [27] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [27] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [27] ec_intr_read          : Read [1c] from address [74]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [54]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [58]
 acpi_ec-0458 [27] ec_intr_read          : Read [87] from address [59]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [5a]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [5b]
 acpi_ec-0458 [27] ec_intr_read          : Read [80] from address [5c]
 acpi_ec-0458 [27] ec_intr_read          : Read [89] from address [5d]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [5e]
 acpi_ec-0458 [27] ec_intr_read          : Read [00] from address [5f]
Execute Method: [\_TZ_.THM0._TMP] (Node e3f8bf88)
 acpi_ec-0458 [29] ec_intr_read          : Read [00] from address [3a]
 acpi_ec-0508 [29] ec_intr_write         : Wrote [00] to address [3a]
 acpi_ec-0508 [29] ec_intr_write         : Wrote [02] to address [52]
 acpi_ec-0508 [29] ec_intr_write         : Wrote [04] to address [53]
 acpi_ec-0508 [29] ec_intr_write         : Wrote [0b] to address [50]
 acpi_ec-0458 [28] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [28] ec_intr_read          : Read [00] from address [50]
 acpi_ec-0458 [28] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [28] ec_intr_read          : Read [80] from address [51]
 acpi_ec-0458 [28] ec_intr_read          : Read [6c] from address [54]
 acpi_ec-0458 [28] ec_intr_read          : Read [0c] from address [55]
 acpi_ec-0458 [28] ec_intr_read          : Read [4e] from address [58]
 acpi_ec-0458 [28] ec_intr_read          : Read [0c] from address [59]
 acpi_ec-0458 [28] ec_intr_read          : Read [f4] from address [60]
 acpi_ec-0458 [28] ec_intr_read          : Read [0b] from address [61]
 acpi_ec-0458 [28] ec_intr_read          : Read [08] from address [62]
 acpi_ec-0458 [28] ec_intr_read          : Read [0c] from address [63]


^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-23  4:46 Yu, Luming
  2006-03-23  6:25 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-23  4:46 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>  How about this. The side effect of this change is that _BIF, 
>_BST could
>  NOT work. But I think it's just ok.
>
>
>		      Method (I2RB, 3, NotSerialized)
>		      {
>			  Store (Arg0, HCSL)
>			  Store (ShiftLeft (Arg1, 0x01), HMAD)
>			  Store (Arg2, HMCM)
>			  Store (0x0B, HMPR)
>	    /*              Return (CHKS ())*/
>		      }
>
>It hangs in the usual way (2nd sleep).  The boot messages had two Fatal
>opcodes, but that must be the _BIF and _BST that you mentioned:

Good, then the hang should be caused by:

			  Store (Arg0, HCSL)
			  Store (ShiftLeft (Arg1, 0x01), HMAD)
			  Store (Arg2, HMCM)
			  Store (0x0B, HMPR)

Could you add this at the beginning of this block:
	Store (Arg0,  Debug)
And add this at the end of this block:
	Store( HMPR, Debug)

Also change boot option: acpi_debug_layer=0x00100010,
acpi_debug_level=0x10
Let me verify if ec access is just ok.

>
>  Execute Method: [\_TZ_.THM0._TMP] (Node e3f8bf88)
>  ACPI: Fatal opcode executed
>  Execute Method: [\_TZ_.THM0._PSV] (Node e3f8be48)
>  Execute Method: [\_TZ_.THM0._TC1] (Node e3f8bdc8)
>  Execute Method: [\_TZ_.THM0._TC2] (Node e3f8bd88)
>  Execute Method: [\_TZ_.THM0._TSP] (Node e3f8bd48)
>  Execute Method: [\_TZ_.THM0._AC0] (Node e3f8bf48)
>  Execute Method: [\_TZ_.THM0._SCP] (Node e3f8bec8)
>  Execute Method: [\_TZ_.THM0._TMP] (Node e3f8bf88)
>  ACPI: Fatal opcode executed
>  ACPI: Thermal Zone [THM0] (47 C)
>
>With later modifications (e.g. commenting out one of the Store 
>lines), I
>could Return(0x00) instead of commenting out the line.  Let me know
>which ones to try.  

Probably yes.

>
>One more thought.  We know that commenting out the UPDT call in _TMP
>fixes the hang.  By bisecting the UPDT method, however, we change every
>call to UPDT, including the one in THM0._TMP.  So we're making extra
>changes beyond what is needed to fix the hang (and maybe producing
>another hang?).
>
>But let's continue this bisection since it's almost done.  If we
>eventually find the offending statement, we can use the information in
>order to find the smallest change that fixes the hang.  We make a copy
>of the original UPDT method, call it UPDTCOPY, say; same for 
>I2RB.  Then
>THM0._TMP can call EC0.UPDTCOPY(), which calls I2RBCOPY.  And we modify
>I2RBCOPY, but we leave I2RB and UPDT alone.
>
Yes, that's good idea to have separate i2rb copy for THM0 which we are
hacking.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-22  7:28 Yu, Luming
@ 2006-03-22 14:16 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-22 14:16 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

  How about this. The side effect of this change is that _BIF, _BST could
  NOT work. But I think it's just ok.


		      Method (I2RB, 3, NotSerialized)
		      {
			  Store (Arg0, HCSL)
			  Store (ShiftLeft (Arg1, 0x01), HMAD)
			  Store (Arg2, HMCM)
			  Store (0x0B, HMPR)
	    /*              Return (CHKS ())*/
		      }

It hangs in the usual way (2nd sleep).  The boot messages had two Fatal
opcodes, but that must be the _BIF and _BST that you mentioned:

  Execute Method: [\_TZ_.THM0._TMP] (Node e3f8bf88)
  ACPI: Fatal opcode executed
  Execute Method: [\_TZ_.THM0._PSV] (Node e3f8be48)
  Execute Method: [\_TZ_.THM0._TC1] (Node e3f8bdc8)
  Execute Method: [\_TZ_.THM0._TC2] (Node e3f8bd88)
  Execute Method: [\_TZ_.THM0._TSP] (Node e3f8bd48)
  Execute Method: [\_TZ_.THM0._AC0] (Node e3f8bf48)
  Execute Method: [\_TZ_.THM0._SCP] (Node e3f8bec8)
  Execute Method: [\_TZ_.THM0._TMP] (Node e3f8bf88)
  ACPI: Fatal opcode executed
  ACPI: Thermal Zone [THM0] (47 C)

With later modifications (e.g. commenting out one of the Store lines), I
could Return(0x00) instead of commenting out the line.  Let me know
which ones to try.  

One more thought.  We know that commenting out the UPDT call in _TMP
fixes the hang.  By bisecting the UPDT method, however, we change every
call to UPDT, including the one in THM0._TMP.  So we're making extra
changes beyond what is needed to fix the hang (and maybe producing
another hang?).

But let's continue this bisection since it's almost done.  If we
eventually find the offending statement, we can use the information in
order to find the smallest change that fixes the hang.  We make a copy
of the original UPDT method, call it UPDTCOPY, say; same for I2RB.  Then
THM0._TMP can call EC0.UPDTCOPY(), which calls I2RBCOPY.  And we modify
I2RBCOPY, but we leave I2RB and UPDT alone.

-Sanjoy

`A society of sheep must in time beget a government of wolves.'
   - Bertrand de Jouvenal

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-22  7:28 Yu, Luming
  2006-03-22 14:16 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-22  7:28 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>Since I don't think Fatal() isn't being called, I guess the problem is
>in I2RB.  But all those magic numbers in I2RB make me recultant to take
>out lines, unless you tell me which changes won't harm the hardware.
>

How about this. The side effect of this change is that _BIF, _BST could
NOT
work. But I think it's just ok.


                    Method (I2RB, 3, NotSerialized)
                    {
                        Store (Arg0, HCSL)
                        Store (ShiftLeft (Arg1, 0x01), HMAD)
                        Store (Arg2, HMCM)
                        Store (0x0B, HMPR)
          /*              Return (CHKS ())*/
                    }

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-22  1:30 Yu, Luming
  2006-03-22  4:35 ` Sanjoy Mahajan
@ 2006-03-22  7:15 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-22  7:15 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

So the kernel with this UPDT() hung at the 2nd sleep:

                    Method (UPDT, 0, NotSerialized)
                    {
                        If (IGNR)
                        {
                            Decrement (IGNR)
                        }
                        Else
                        {
                            If (H8DR)
                            {
                                If (Acquire (I2CM, 0x0064)) {}
                                Else
                                {
                                    Store (I2RB (Zero, 0x01, 0x04), Local7)
                                    If (Local7)
                                    {
                                        Fatal (0x01, 0x80000003, Local7)
                                    }

                                    Release (I2CM)
                                }
                            }
                        }
                    }

Relative to a working kernel (well, a kernel that I could get to hang
only once, and then all reboots afterwards it never would hang), these
are the extra lines:

                                    Store (I2RB (Zero, 0x01, 0x04), Local7)
                                    If (Local7)
                                    {
                                        Fatal (0x01, 0x80000003, Local7)
                                    }

Since I don't think Fatal() isn't being called, I guess the problem is
in I2RB.  But all those magic numbers in I2RB make me recultant to take
out lines, unless you tell me which changes won't harm the hardware.

-Sanjoy

`A society of sheep must in time beget a government of wolves.'
   - Bertrand de Jouvenal

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-22  1:34 Yu, Luming
@ 2006-03-22  7:00 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-22  7:00 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

I tried the following kernels (all with only THM0):

1. No other changes: It hangs, as before and as expected.

2. Commented out a large chunk of the UPDT() method:

diff -r ac2b38909dfa -r c431c477d3b6 dsdt/600x.dsl
--- a/dsdt/600x.dsl	Tue Mar 21 17:12:47 2006 -0500
+++ b/dsdt/600x.dsl	Wed Mar 22 00:22:21 2006 -0500
@@ -4132,19 +4132,6 @@ DefinitionBlock ("DSDT.aml", "DSDT", 1, 
                                 If (Acquire (I2CM, 0x0064)) {}
                                 Else
                                 {
-                                    Store (I2RB (Zero, 0x01, 0x04), Local7)
-                                    If (Local7)
-                                    {
-                                        Fatal (0x01, 0x80000003, Local7)
-                                    }
-                                    Else
-                                    {
-                                        Store (HBS0, TMP0)
-                                        Store (HBS2, TMP2)
-                                        Store (HBS6, TMP6)
-                                        Store (HBS7, TMP7)
-                                    }
-
                                     Release (I2CM)
                                 }
                             }

So now it just grabs and releases the I2CM lock.  

This kernel hung on the first sleep, but I couldn't reproduce that
behavior.  I tried two more boots, and each time it never hung.  I even
thought it might depend on the result of previous boots, so I tried
kernel #1 again and got the same hang, and then tried #2.  But it still
wouldn't hang.  So I went back through the serial console logs to check
whether I was hallucinating, and I was not.  This kernel had indeed hung
on the first sleep, but only the first time I booted it.

So I'm going to assume that it's okay, and that if it isn't okay, it's
because of another, more intermitten bug.

3. Commented out EC0.UPDT() call in THM0._THM, and this kernel was fine,
   which is how it behaved a couple days ago, and is what I expected.

I'm about to try a smaller change (continuing the bisect):

diff -r ac2b38909dfa -r f10a309b8385 dsdt/600x.dsl
--- a/dsdt/600x.dsl	Tue Mar 21 17:12:47 2006 -0500
+++ b/dsdt/600x.dsl	Wed Mar 22 01:44:12 2006 -0500
@@ -4137,13 +4137,6 @@ DefinitionBlock ("DSDT.aml", "DSDT", 1, 
                                     {
                                         Fatal (0x01, 0x80000003, Local7)
                                     }
-                                    Else
-                                    {
-                                        Store (HBS0, TMP0)
-                                        Store (HBS2, TMP2)
-                                        Store (HBS6, TMP6)
-                                        Store (HBS7, TMP7)
-                                    }
 
                                     Release (I2CM)
                                 }


After that bisection there's not much more to change in UPDT().
However, I can drill down into I2RB() because of this line in UPDT():

    Store (I2RB (Zero, 0x01, 0x04), Local7)

But I have no idea what's safe to experiment with in I2RB():

                    Method (I2RB, 3, NotSerialized)
                    {
                        Store (Arg0, HCSL)
                        Store (ShiftLeft (Arg1, 0x01), HMAD)
                        Store (Arg2, HMCM)
                        Store (0x0B, HMPR)
                        Return (CHKS ())
                    }


All those lines look like tricky hardware manipulations.

By the way, which debug_{level,layer} settings will show the lines of
the human-readable DSDT as they are executed?

-Sanjoy

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-22  4:58 Yu, Luming
@ 2006-03-22  5:13 ` Sanjoy Mahajan
  2006-03-24  1:17 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-22  5:13 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> Please don't give up . :-)

I haven't!

> I need to know which statement in EC0.UPDT that could trigger the
> problem.  That is very important to understand the problem
> correctly.

> This is still my assumption that some AML code needed to be avoided
> in suspend/resume, I need data support. So, we need to dig more in
> EC0.UPDT.

You've convinced me!

> If we cannot find out that statement , then, I will dout the testing
> results that guiding us to here.

Yes, the testing is often frustrating and unreliable because the bug
is not 100% reproducible, which is why I think it's more than 1 bug
(one reliable bug, related to THM0, and one or more flakey ones).

>> However, we do have one more piece of data.  When it hangs, it hangs in
>> \_SI._SST, because I see that line on successful sleeps (as the last

> I don't know this. I always assume the hang is at _PTS.SMPI

Oh, I think you're right.  Robert Moore's comment at the bugzilla
entry agrees with what you say.  I was (I don't know why) assuming
that the ACPI system printed "Execute Method ..." after it finished
executing the method.  In which case, seeing the PTS but not the SST
made me think that PTS worked but SST failed.  However, what you say
is consistent with ACPI printing "Execute Method" as it begins a
method -- in which case seeing the PTS but not the SST means it fails
in PTS (and in PTS.SMPI).

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.


http://bugzilla.kernel.org/show_bug.cgi?id=5989





------- Additional Comments From Robert.Moore@intel.com  2006-02-06 14:46 

You are stuck in the loop in the method below. I would guess that the
code is 
waiting for a response from the SMI bios and it never happens.

    Method (SMPI, 1, NotSerialized)
    {
        Store (S_AX, Local0)
        Store (0x81, APMD)
        While (LEqual (S_AH, 0xA6))
        {
            Sleep (0x64)
            Store (Local0, S_AX)
            Store (0x81, APMD)
        }
    }

    OperationRegion (MNVS, SystemMemory, 0x23FDF000, 0x1000)
    Field (MNVS, DWordAcc, NoLock, Preserve)
    {
        Offset (0xFC0), 
        S_AX,   16, 

    OperationRegion (APMC, SystemIO, 0xB2, 0x01)
    Field (APMC, ByteAcc, NoLock, Preserve)
    {
        APMD,   8


Store (Local0, S_AX)

exregion-0182 [29] ex_system_memory_space: system_memory 0 (32 width) 
Address=0000000023FDFFC0


Store (0x81, APMD)

exregion-0287 [30] ex_system_io_space_han: system_iO 1 (8 width) 
Address=00000000000000B2

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-22  4:58 Yu, Luming
  2006-03-22  5:13 ` Sanjoy Mahajan
  2006-03-24  1:17 ` Sanjoy Mahajan
  0 siblings, 2 replies; 86+ messages in thread
From: Yu, Luming @ 2006-03-22  4:58 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>> We can do bisection in EC0.UPDT to find out which statement cause
>> hang?
>
>Yes, though see below for why I don't think it'll help no 
>matter what we
>find there.

Please don't give up . :-)

I need to know which statement  in EC0.UPDT that could trigger the
problem.
That is very important to understand the problem correctly.
If we cannot find out that statement , then, I will dout the testing
results that guiding us to here.

>
>> My assumption is that since Windows works well, then these BIOS code
>> should have been tested ok. The only possible excuse for BIOS is that
>> Linux is using unnecessary/untested code path for 
>Suspend/resume.  So,
>> Eventually, we need to disable unnecessary BIOS call for
>> suspend/resume
>
>Maybe we're not collecting the right data in that case.  We know that
>commenting out the call to UPDT in THM0.TMP fixes the hang.  
>But it does
>not follow that the osl suspend code should avoid running UPDT.

This is still my assumption that some AML code needed to be avoided
in suspend/resume, I need data support. So, we need to dig more in 
EC0.UPDT.


>
>The hang may work like this: Between boot and sleep, calling 
>UPDT messes
>up something in the ec [which is why it takes >1 sleep to 
>cause a hang].
>When the system tries to sleep, that something triggers and the ec
>hangs.  But it may hang somewhere else than UPDT, and avoiding UPDT
>during sleep will not fix it.

If BIOS behaviors NOT correctly , then everything can happen.

>
>However, we do have one more piece of data.  When it hangs, it hangs in
>\_SI._SST, because I see that line on successful sleeps (as the last

I don't know this. I always assume the hang is at _PTS.SMPI

>method before the beep) but not when it hangs (and then I also don't
>hear a beep).  There are lots of calls to EC0.XXX, including to
>EC0.BEEP, within _SST, which isn't surprising if the EC is the problem.

It could be. But there should have something that trigger it.

>So perhaps I should bisect in _SST and put in the debug lines there?
>
>Here's another idea, which is a terrible hack.  But there are lots of
>lines in the DSDT like
>   If (LOr (SPS, WNTF))
>which I imagine is saying "If something or if WinNT".  So, 
>what if Linux
>pretends to be WinNT (or W98F -- which is another common 
>test), at least
>for the 600x?  Maybe those code paths are known to work.
>
Yes, you can try that.

Thanks,
Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-22  1:30 Yu, Luming
@ 2006-03-22  4:35 ` Sanjoy Mahajan
  2006-03-22  7:15 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-22  4:35 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> We can do bisection in EC0.UPDT to find out which statement cause
> hang?

Yes, though see below for why I don't think it'll help no matter what we
find there.

> My assumption is that since Windows works well, then these BIOS code
> should have been tested ok. The only possible excuse for BIOS is that
> Linux is using unnecessary/untested code path for Suspend/resume.  So,
> Eventually, we need to disable unnecessary BIOS call for
> suspend/resume

Maybe we're not collecting the right data in that case.  We know that
commenting out the call to UPDT in THM0.TMP fixes the hang.  But it does
not follow that the osl suspend code should avoid running UPDT.

The hang may work like this: Between boot and sleep, calling UPDT messes
up something in the ec [which is why it takes >1 sleep to cause a hang].
When the system tries to sleep, that something triggers and the ec
hangs.  But it may hang somewhere else than UPDT, and avoiding UPDT
during sleep will not fix it.

However, we do have one more piece of data.  When it hangs, it hangs in
\_SI._SST, because I see that line on successful sleeps (as the last
method before the beep) but not when it hangs (and then I also don't
hear a beep).  There are lots of calls to EC0.XXX, including to
EC0.BEEP, within _SST, which isn't surprising if the EC is the problem.
So perhaps I should bisect in _SST and put in the debug lines there?

Here's another idea, which is a terrible hack.  But there are lots of
lines in the DSDT like
   If (LOr (SPS, WNTF))
which I imagine is saying "If something or if WinNT".  So, what if Linux
pretends to be WinNT (or W98F -- which is another common test), at least
for the 600x?  Maybe those code paths are known to work.

-Sanjoy

`A society of sheep must in time beget a government of wolves.'
   - Bertrand de Jouvenal

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-22  1:34 Yu, Luming
  2006-03-22  7:00 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-22  1:34 UTC (permalink / raw)
  To: Yu, Luming, Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>
>Hmm, you seems to prefer depth-first search algorithm?
>I like it too. :-)
>
>
>>
>>One bug is quite repeatable and we know a lot about it. With all zones
>>except THM0 commented out, the system hung.  With the EC0.UPDT line in
>>THM0._TMP also commented out, the system didn't hang.  So there's a
>>problem related to the EC, even with only THM0.  And finding that
>>problem may giveideas for what else may be wrong.
>
>We can do bisection in EC0.UPDT to find out which statement cause hang?
>Hmm, we are going to fix BIOS. :-)

You can insert debug statements in EC0.UPDT to help debug:

Store (IGNR, Debug)
Store (" before relase I2CM", Debug)
Store (HBS7, TMP7)	
....

>
>My assumption is that since Windows works well, then these BIOS code
>should have been tested ok. The only possible excuse for BIOS is that
>Linux is using unnecessary/untested code path for Suspend/resume.
>So, Eventually, we need to disable unnecessary BIOS call for 
>suspend/resume

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-22  1:30 Yu, Luming
  2006-03-22  4:35 ` Sanjoy Mahajan
  2006-03-22  7:15 ` Sanjoy Mahajan
  0 siblings, 2 replies; 86+ messages in thread
From: Yu, Luming @ 2006-03-22  1:30 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>Two more experiments:
>
>  With a vanilla kernel, I faked EC0.UPDT() to just return 
>0x00, and the
>  system hung on the second sleep.
>
>  Then, again in the DSDT, I also faked the 4 _TMP methods (one in each
>  thermal zone), and the system hung on the second sleep.
>
>I think we've raced too far ahead by trying to debug many thermal zones
>at once.  Perhaps there are two bugs.  So let's find them one by one.

Hmm, you seems to prefer depth-first search algorithm?
I like it too. :-)


>
>One bug is quite repeatable and we know a lot about it. With all zones
>except THM0 commented out, the system hung.  With the EC0.UPDT line in
>THM0._TMP also commented out, the system didn't hang.  So there's a
>problem related to the EC, even with only THM0.  And finding that
>problem may giveideas for what else may be wrong.

We can do bisection in EC0.UPDT to find out which statement cause hang?
Hmm, we are going to fix BIOS. :-)

My assumption is that since Windows works well, then these BIOS code
should have been tested ok. The only possible excuse for BIOS is that
Linux is using unnecessary/untested code path for Suspend/resume.
So, Eventually, we need to disable unnecessary BIOS call for
suspend/resume

Thanks,
Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-21  9:11 Yu, Luming
  2006-03-21 20:37 ` Sanjoy Mahajan
@ 2006-03-21 22:09 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-21 22:09 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

Two more experiments:

  With a vanilla kernel, I faked EC0.UPDT() to just return 0x00, and the
  system hung on the second sleep.

  Then, again in the DSDT, I also faked the 4 _TMP methods (one in each
  thermal zone), and the system hung on the second sleep.

I think we've raced too far ahead by trying to debug many thermal zones
at once.  Perhaps there are two bugs.  So let's find them one by one.

One bug is quite repeatable and we know a lot about it. With all zones
except THM0 commented out, the system hung.  With the EC0.UPDT line in
THM0._TMP also commented out, the system didn't hang.  So there's a
problem related to the EC, even with only THM0.  And finding that
problem may giveideas for what else may be wrong.

-Sanjoy

`A society of sheep must in time beget a government of wolves.'
   - Bertrand de Jouvenal

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-21  9:11 Yu, Luming
@ 2006-03-21 20:37 ` Sanjoy Mahajan
  2006-03-21 22:09 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-21 20:37 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

The following tests all have acpi_evaluate_integer() hacked to return
_TMP=27C.

>> The kernel panic for the don't-load-THM2 kernel is very strange.  I
>> had another kernel panic while doing another set of tests, which I
>> also couldn't explain.  The only difference between the no-THM0 and
>> the no-THM2 kernels is:

> Could you just printk device->pnp? it could be null point (due to you
> hack?)

device->pnp is a struct and I couldn't figure out how to printk it, so
I just printk'ed device->pnp.bus_id (most of its other elements aren't
initialized by then anyway):

diff -r ac486e270597 -r 8b088512dd1d drivers/acpi/thermal.c
--- a/drivers/acpi/thermal.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/thermal.c	Tue Mar 21 11:32:31 2006 -0500
@@ -1324,6 +1324,7 @@ static int acpi_thermal_add(struct acpi_
 
 	if (!device)
 		return_VALUE(-EINVAL);
+	printk(KERN_INFO PREFIX "pnp.bus_id=0x%x\n", (u32) device->pnp.bus_id);
 
 	tz = kmalloc(sizeof(struct acpi_thermal), GFP_KERNEL);
 	if (!tz)

It produced nothing surprising:

ACPI: pnp.bus_id=0xe3ed7830
ACPI: pnp.bus_id=0xe3ed7430
ACPI: pnp.bus_id=0xe3ed7030
ACPI: pnp.bus_id=0xe3ed8c30
ACPI: pnp.bus_id=0xe3ed4030

for THM0,2,6,7, and _TZ.

So I still don't know why getting rid of THM2 in the kernel causes the
panic.

But while I had this kernel booted, I tried a few sleep cycles, and it
hung on the second one as expected (it's just the vanilla kernel&DSDT
with acpi_evaluate_integer() hacked to return _TMP=27C).

>> THM6			Hangs (4th cycle)
> Is it still hang at SMPI?

It looked like the usual hang, but I had debug_{layer,level}=0x10.  I
increased debug_layer to 0xFFFFFFFF it to see the function traces.
However, the hang didn't occur even after 15 cycles. So I rebooted with
debug_layer=0x10 and still couldn't reproduce the hang even after 12
cycles.  But the same kernel hung yesterday after 4 cycles [I save all
the kernels tagged by their revision hash], so I don't know what to
think about THM6.

>> THM2			"kernel panic! attempted to kill init"

> I guess, if you fake DSDT by completely removing THM2 you won't see
> this.

Right, it booted fine when I removed THM2 from the DSDT instead of from
the kernel.

>> So THM6 seems healthy, but THM0 and THM7 (and maybe THM2) interact
>> badly.  If I unload THM2, THM6, and THM7, then it's okay (previous
>> experiments with faking _TMP but with only THM0 loaded).  But
>> unloading THM6 is not enough.

> Please try to remove THM2 judge if it is JUST the problem of THM0 &&
> THM7.

I tried the kernel with THM2 taken out of the DSDT, and it was fine (so
the total change was that plus _TMP faked in acpi_evaluate_integer()).

-Sanjoy

`A society of sheep must in time beget a government of wolves.'
   - Bertrand de Jouvenal

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-21  9:11 Yu, Luming
  2006-03-21 20:37 ` Sanjoy Mahajan
  2006-03-21 22:09 ` Sanjoy Mahajan
  0 siblings, 2 replies; 86+ messages in thread
From: Yu, Luming @ 2006-03-21  9:11 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>With _TMP faked in the kernel and one whole zone ignored, this 
>is what I
>get:
>
>Zone to ignore	|	Result
>---------------------------------------------------------------
>---------
>THM0			OK (10 cycles)
>THM2			"kernel panic! attempted to kill init"

I guess, if you fake DSDT by completely removing THM2
you won't see this.

>THM6			Hangs (4th cycle)
Is it still hang at SMPI?

>THM7			OK (8 cycles)
>
>So THM6 seems healthy, but THM0 and THM7 (and maybe THM2) interact
>badly.  If I unload THM2, THM6, and THM7, then it's okay (previous
>experiments with faking _TMP but with only THM0 loaded).  But unloading
>THM6 is not enough.

Please try to remove THM2 judge if it is JUST the 
problem of THM0 && THM7.

>
>The kernel panic for the don't-load-THM2 kernel is very strange.  I had
>another kernel panic while doing another set of tests, which I also
>couldn't explain.  The only difference between the no-THM0 and the
>no-THM2 kernels is:

Could you just printk device->pnp? it could be null point (due to 
you hack?)

>
>diff -r b7ad6c906aba -r 213308f0ec31 drivers/acpi/thermal.c
>--- a/drivers/acpi/thermal.c	Tue Mar 21 02:23:30 2006 -0500
>+++ b/drivers/acpi/thermal.c	Tue Mar 21 02:36:42 2006 -0500
>@@ -1324,7 +1324,7 @@ static int acpi_thermal_add(struct acpi_
> 
> 	if (!device)
> 		return_VALUE(-EINVAL);
>-	if (strcmp("THM2", device->pnp.bus_id) == 0) {
>+	if (strcmp("THM0", device->pnp.bus_id) == 0) {
> 	    printk(KERN_INFO PREFIX "thermal_add: ignoring %s\n",
> 		   device->pnp.bus_id);
> 	    return_VALUE(-EINVAL);
>
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-21  1:38 Yu, Luming
  2006-03-21  7:27 ` Sanjoy Mahajan
@ 2006-03-21  8:47 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-21  8:47 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

With _TMP faked in the kernel and one whole zone ignored, this is what I
get:

Zone to ignore	|	Result
------------------------------------------------------------------------
THM0			OK (10 cycles)
THM2			"kernel panic! attempted to kill init"
THM6			Hangs (4th cycle)
THM7			OK (8 cycles)

So THM6 seems healthy, but THM0 and THM7 (and maybe THM2) interact
badly.  If I unload THM2, THM6, and THM7, then it's okay (previous
experiments with faking _TMP but with only THM0 loaded).  But unloading
THM6 is not enough.

The kernel panic for the don't-load-THM2 kernel is very strange.  I had
another kernel panic while doing another set of tests, which I also
couldn't explain.  The only difference between the no-THM0 and the
no-THM2 kernels is:

diff -r b7ad6c906aba -r 213308f0ec31 drivers/acpi/thermal.c
--- a/drivers/acpi/thermal.c	Tue Mar 21 02:23:30 2006 -0500
+++ b/drivers/acpi/thermal.c	Tue Mar 21 02:36:42 2006 -0500
@@ -1324,7 +1324,7 @@ static int acpi_thermal_add(struct acpi_
 
 	if (!device)
 		return_VALUE(-EINVAL);
-	if (strcmp("THM2", device->pnp.bus_id) == 0) {
+	if (strcmp("THM0", device->pnp.bus_id) == 0) {
 	    printk(KERN_INFO PREFIX "thermal_add: ignoring %s\n",
 		   device->pnp.bus_id);
 	    return_VALUE(-EINVAL);


-Sanjoy

`A society of sheep must in time beget a government of wolves.'
   - Bertrand de Jouvenal

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-21  1:38 Yu, Luming
@ 2006-03-21  7:27 ` Sanjoy Mahajan
  2006-03-21  8:47 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-21  7:27 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> From pervious experience, we know _THM0._TMP causes problem.  If you
> fake _TMP for all THM, what could happen?

It still hangs on the second sleep.  I faked them in the kernel instead
of the DSDT, by faking them in acpi_evaluate_integer() like so:

diff -r ac486e270597 -r 959c4fa10a36 drivers/acpi/utils.c
--- a/drivers/acpi/utils.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/utils.c	Mon Mar 20 20:52:01 2006 -0500
@@ -270,7 +270,15 @@ acpi_evaluate_integer(acpi_handle handle
 	memset(element, 0, sizeof(union acpi_object));
 	buffer.length = sizeof(union acpi_object);
 	buffer.pointer = element;
-	status = acpi_evaluate_object(handle, pathname, arguments, &buffer);
+	if (strcmp(pathname, "_TMP") != 0)
+	  status = acpi_evaluate_object(handle, pathname, arguments, &buffer);
+	else {
+	  printk(KERN_INFO PREFIX "acpi_evaluate_integer: Faking _TMP\n");
+	  status = AE_OK;
+	  element->type = ACPI_TYPE_INTEGER;
+	  element->integer.value = 3000; /* 27 C, in deciKelvins */
+	}
+
 	if (ACPI_FAILURE(status)) {
 		acpi_util_eval_error(handle, pathname, status);
 		return_ACPI_STATUS(status);


Each thermal zone loaded with produced printk's like "Faking _TMP", etc,
so the patch was working.  It shouldn't change the result if instead I
make all the _TMP methods in the DSDT return 0xBB8 (or whatever the
magic number was).

So my plan, which I'm trying now, is to keep _TMP faked for all zones,
and take away one zone at a time until the hang goes away.  If I take
away all of THM[267], then it won't hang (since THM0 by itself hangs but
THM0 without _TMP does not hang).  But I hope that an earlier
combination in the search will not hang.

-Sanjoy

`A society of sheep must in time beget a government of wolves.'
   - Bertrand de Jouvenal

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-21  1:38 Yu, Luming
  2006-03-21  7:27 ` Sanjoy Mahajan
  2006-03-21  8:47 ` Sanjoy Mahajan
  0 siblings, 2 replies; 86+ messages in thread
From: Yu, Luming @ 2006-03-21  1:38 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>> I think you need to continue to find out which THMs, which methods
>> cause s3 hang when THM0._TMP disabled.
>
>So far I've found that if (with no THM0 loaded) I load exactly one of
>THM2, THM6, or THM7, then there's no hang.  Now I am looking for which
>combinations of the THM[0267] zones cause the problem.

Hmm, I guess you don't need to try each combination of  THM[0267].
>From pervious experience, we know _THM0._TMP causes problem.
If you fake _TMP for all THM, what could happen?

If you verified _TMP cause issue by fake them in DSDT,  probably,
we need to continue dig Method : UPDT. 

                    Method (UPDT, 0, NotSerialized)
                    {
                        If (IGNR)
                        {
                            Decrement (IGNR)
                        }
                        Else
                        {
                            If (H8DR)
                            {
                                If (Acquire (I2CM, 0x0064)) {}
                                Else
                                {
                                    Store (I2RB (Zero, 0x01, 0x04),
Local7)
                                    If (Local7)
                                    {
                                        Fatal (0x01, 0x80000003, Local7)
                                    }
                                    Else
                                    {
                                        Store (HBS0, TMP0)
                                        Store (HBS2, TMP2)
                                        Store (HBS6, TMP6)
                                        Store (HBS7, TMP7)
                                    }

                                    Release (I2CM)
                                }
                            }
                        }
                    }

Thanks,
Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-19  4:12 Yu, Luming
  2006-03-19 14:33 ` Sanjoy Mahajan
@ 2006-03-20  6:39 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-20  6:39 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> I think you need to continue to find out which THMs, which methods
> cause s3 hang when THM0._TMP disabled.

So far I've found that if (with no THM0 loaded) I load exactly one of
THM2, THM6, or THM7, then there's no hang.  Now I am looking for which
combinations of the THM[0267] zones cause the problem.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-19  4:12 Yu, Luming
@ 2006-03-19 14:33 ` Sanjoy Mahajan
  2006-03-20  6:39 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-19 14:33 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> Maybe I need to make a summary here for this issue:
> 1. The s3 hang is in While-loop in SMPI that looks like
> waiting BIOS response.

Right.

> 2. If THM2, THM6, THM7 disabled, disabling THM0._TMP
> fix the s3 hang.

Right.  And many ways of disabling THM0._TMP fix the hang:

1. making acpi_evaluate_integer() not evaluate _TMP methods.
2. the short-term fix using acpi_in_suspend
3. taking out \_SB.PCI0.ISA0.EC0.UPDT () line from _TMP method.

> I think you need to continue to find out which THMs, which methods
> cause s3 hang when THM0._TMP disabled.  I assume the problem is:
> THM0._TMP && THMx._XXX && THMy._YYY..

I agree, and am testing the other thermal methods one at a time.  I
suspect that THMx.AC0 will be involved, but we'll see.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-19  4:12 Yu, Luming
  2006-03-19 14:33 ` Sanjoy Mahajan
  2006-03-20  6:39 ` Sanjoy Mahajan
  0 siblings, 2 replies; 86+ messages in thread
From: Yu, Luming @ 2006-03-19  4:12 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>> Do you load processor driver?
>
>It's loads at boot.  When thermal loads, it pulls in processor:
>
>$ lsmod | grep thermal
>thermal                17224  0 
>processor              30080  1 thermal
>

Maybe I need to make a summary here for this issue:
1. The s3 hang is in While-loop in SMPI that looks like
waiting BIOS response.
2. If THM2, THM6, THM7 disabled, disabling THM0._TMP
fix the s3 hang.

I think you need to continue to find out which THMs, which methods
cause s3 hang when THM0._TMP disabled.
I assume the problem is:
THM0._TMP && THMx._XXX && THMy._YYY..

Thanks,
Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-18 17:08 Yu, Luming
@ 2006-03-18 20:12 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-18 20:12 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> Do you load processor driver?

It's loads at boot.  When thermal loads, it pulls in processor:

$ lsmod | grep thermal
thermal                17224  0 
processor              30080  1 thermal

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-18 17:08 Yu, Luming
  2006-03-18 20:12 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-18 17:08 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos


>>> PM: Preparing system for mem sleep
>>> Stopping tasks: 
>>> =======================================================|
>
>> Did you see any methods before and after this line in hang case on
>> screen?  If yes, do you recall what they are?
>
>I capture across a serial console, so here are the exact msgs (I just
>ran the second sleep and got the usual hang).  This is with vanilla
>2.6.16-rc5 (and vanilla DSDT):
>
>Stopping tasks: 
>=========================================================|
>Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
>Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
>Execute Method: [\_S3_] (Node c157a988)
>Execute Method: [\_PTS] (Node c157ab48)
>
>The screen itself is full of garbage because the first 
>sleep/wake messes
>up the console.  Along with a giant white square that fills most of the
>screen, I see a fuzzy, dotted version of the above messages, plus one
>more line "ACPI" and then a flashing underscore cursor after that.  I
>don't know if it was trying to printk "ACPI" but then the rest of the
>message got lost, or it hung before printing it, or whether the ACPI is
>from a previous dmesg (i.e. the first sleep/wake) that didn't get
>cleared properly.

Do you load processor driver?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-18 16:37 Yu, Luming
@ 2006-03-18 17:03 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-18 17:03 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>> PM: Preparing system for mem sleep
>> Stopping tasks: 
>> =======================================================|

> Did you see any methods before and after this line in hang case on
> screen?  If yes, do you recall what they are?

I capture across a serial console, so here are the exact msgs (I just
ran the second sleep and got the usual hang).  This is with vanilla
2.6.16-rc5 (and vanilla DSDT):

Stopping tasks: =========================================================|
Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
Execute Method: [\_S3_] (Node c157a988)
Execute Method: [\_PTS] (Node c157ab48)

The screen itself is full of garbage because the first sleep/wake messes
up the console.  Along with a giant white square that fills most of the
screen, I see a fuzzy, dotted version of the above messages, plus one
more line "ACPI" and then a flashing underscore cursor after that.  I
don't know if it was trying to printk "ACPI" but then the rest of the
message got lost, or it hung before printing it, or whether the ACPI is
from a previous dmesg (i.e. the first sleep/wake) that didn't get
cleared properly.

-Sanjoy

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-18 16:37 Yu, Luming
  2006-03-18 17:03 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-18 16:37 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos


>Here first are the dmesgs from suspending with a vanilla 2.6.16-rc5.  I
>did only one cycle so that it didn't hang and I could edit this email
>without rebooting (but later suspends produce the same method 
>calls, I'm
>90% sure):
>
># the sleep dmesgs
>PM: Preparing system for mem sleep
>Stopping tasks: 
>=======================================================|
Did you see any methods before and after this line in hang case on
screen?
If yeas, do you recall what they are?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-18 15:58 Yu, Luming
@ 2006-03-18 16:27 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-18 16:27 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> just return AE_OK, because we are hacking. :-)

I found that out the hard way.  I first tried AE_BAD_PARAMETER but the
kernel paniced on boot, which I don't understand.  So then I switched to
returning AE_OK, and it booted fine.  But it hung on the *second* sleep
cycle.  So the problem got worse, or the bug is slithering around and
shows up in odd places depending what we do.

>> That's in my lilo.conf so all kernels I test use those options.  I
>> can send you the dmesgs from the suspends without the ugly hack (and
>> will send them from the upcoming suspends, with the ugly hack).

> Thanks, I'm waiting for that to understand if the hack is clean for
> killing unwanted AML methods call.

Here first are the dmesgs from suspending with a vanilla 2.6.16-rc5.  I
did only one cycle so that it didn't hang and I could edit this email
without rebooting (but later suspends produce the same method calls, I'm
90% sure):

# the sleep dmesgs
PM: Preparing system for mem sleep
Stopping tasks: =======================================================|
Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
Execute Method: [\_S3_] (Node c157a988)
Execute Method: [\_PTS] (Node c157ab48)
Execute Method: [\_SI_._SST] (Node c157a8c8)
uhci_hcd 0000:00:07.2: suspend_rh
uhci_hcd 0000:00:07.2: uhci_suspend
uhci_hcd 0000:00:07.2: --> PCI D0/legacy
PM: Entering mem sleep

# and here are the wakeup dmesgs

Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Back to C!
PM: Finishing wakeup.
Execute Method: [\_GPE._L0B] (Node c157a848)
PCI: Found IRQ 11 for device 0000:00:02.0
PCI: Sharing IRQ 11 with 0000:00:06.0
PCI: Sharing IRQ 11 with 0000:01:00.0
PCI: Found IRQ 11 for device 0000:00:02.1
uhci_hcd 0000:00:07.2: PCI legacy resume
PCI: Found IRQ 11 for device 0000:00:07.2
uhci_hcd 0000:00:07.2: uhci_resume
uhci_hcd 0000:00:07.2: uhci_check_and_reset_hc: legsup = 0x2000
uhci_hcd 0000:00:07.2: Performing full reset
usb usb1: root hub lost power or was reset
uhci_hcd 0000:00:07.2: suspend_rh
usb usb1: finish resume
uhci_hcd 0000:00:07.2: wakeup_rh
Restarting tasks...<7>hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
 done
Execute Method: [\_SI_._SST] (Node c157a8c8)
Execute Method: [\_WAK] (Node c157aac8)
Execute Method: [\_TZ_.THM0._PSV] (Node c157be48)
Execute Method: [\_TZ_.THM0._TC1] (Node c157bdc8)
Execute Method: [\_TZ_.THM0._TC2] (Node c157bd88)
Execute Method: [\_TZ_.THM0._TSP] (Node c157bd48)
Execute Method: [\_TZ_.THM0._AC0] (Node c157bf48)
Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)
Execute Method: [\_SI_._SST] (Node c157a8c8)
Execute Method: [\_TZ_.THM2._AC0] (Node c157bb48)
Execute Method: [\_TZ_.THM2._TMP] (Node c157bb88)
Execute Method: [\_TZ_.THM6._AC0] (Node c157b908)
Execute Method: [\_TZ_.THM6._TMP] (Node c157b948)
Execute Method: [\_TZ_.THM7._AC0] (Node c157b6c8)
Execute Method: [\_TZ_.THM7._TMP] (Node c157b708)<7>uhci_hcd 0000:00:07.2: suspend_rh (auto-stop)

Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
# next msgs are from 'cardctl eject'
ds: ds_open(socket 0)
ds: ds_open(socket 1)
ds: ds_open(socket 2)
pccard: card ejected from slot 1
PCMCIA: socket e231c028: *** DANGER *** unable to remove socket power
ds: ds_release(socket 0)
ds: ds_release(socket 1)


# now for the dmesgs with the latest hack (returning AE_OK) for the
# first suspend cycle (which didn't hang):

Stopping tasks: ====================================================|
Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
Execute Method: [\_S3_] (Node c157a988)
Execute Method: [\_PTS] (Node c157ab48)
Execute Method: [\_SI_._SST] (Node c157a8c8)
uhci_hcd 0000:00:07.2: suspend_rh
uhci_hcd 0000:00:07.2: uhci_suspend
uhci_hcd 0000:00:07.2: --> PCI D0/legacy
PM: Entering mem sleep

# and the wakeup msgs

Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Back to C!
PM: Finishing wakeup.
PCI: Found IRQ 11 for device 0000:00:02.0
PCI: Sharing IRQ 11 with 0000:00:06.0
PCI: Sharing IRQ 11 with 0000:01:00.0
PCI: Found IRQ 11 for device 0000:00:02.1
uhci_hcd 0000:00:07.2: PCI legacy resume
PCI: Found IRQ 11 for device 0000:00:07.2
uhci_hcd 0000:00:07.2: uhci_resume
uhci_hcd 0000:00:07.2: uhci_check_and_reset_hc: legsup = 0x2000
uhci_hcd 0000:00:07.2: Performing full reset
usb usb1: root hub lost power or was reset
uhci_hcd 0000:00:07.2: suspend_rh
usb usb1: finish resume
uhci_hcd 0000:00:07.2: wakeup_rh
Restarting tasks...<7>hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
 done
Execute Method: [\_SI_._SST] (Node c157a8c8)
Execute Method: [\_WAK] (Node c157aac8)
Execute Method: [\_TZ_.THM0._PSV] (Node c157be48)
Execute Method: [\_TZ_.THM0._TC1] (Node c157bdc8)
Execute Method: [\_TZ_.THM0._TC2] (Node c157bd88)
Execute Method: [\_TZ_.THM0._TSP] (Node c157bd48)
Execute Method: [\_TZ_.THM0._AC0] (Node c157bf48)
Execute Method: [\_SI_._SST] (Node c157a8c8)
Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)
Execute Method: [\_TZ_.THM2._AC0] (Node c157bb48)
Execute Method: [\_TZ_.THM2._TMP] (Node c157bb88)
Execute Method: [\_TZ_.PFN0._OFF] (Node c157a288)
Execute Method: [\_TZ_.PFN0._STA] (Node c157a308)
Execute Method: [\_TZ_.THM6._AC0] (Node c157b908)
Execute Method: <7>uhci_hcd 0000:00:07.2: suspend_rh (auto-stop)
[\_TZ_.THM6._TMP] (Node c157b948)
Execute Method: [\_TZ_.THM7._AC0] (Node c157b6c8)
Execute Method: [\_TZ_.THM7._TMP] (Node c157b708)
Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
# next msgs are from 'cardctl eject'
ds: ds_open(socket 0)
ds: ds_open(socket 1)
ds: ds_open(socket 2)
pccard: card ejected from slot 1
PCMCIA: socket e34e1828: *** DANGER *** unable to remove socket power
ds: ds_release(socket 0)
ds: ds_release(socket 1)
# I think these were after the wakeup when the fan turned on.
Execute Method: [\_SB_.PCI0.ISA0.EC0_._Q42] (Node c1572408)
Execute Method: [\_TZ_.THM2._AC0] (Node c157bb48)
Execute Method: [\_TZ_.THM2._TMP] (Node c157bb88)
Execute Method: [\_TZ_.PFN0._ON_] (Node c157a2c8)
Execute Method: [\_TZ_.PFN0._STA] (Node c157a308)
Execute Method: [\_TZ_.THM2._TMP] (Node c157bb88)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-18 15:58 Yu, Luming
  2006-03-18 16:27 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-18 15:58 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos


>> Please try additional ugly hack
>>  5. in acpi_os_queue_for_execution:
>>	if(acpi_in_suspend == YES)
>>		do nothing.
>
>Am compiling it.  If acpi_in_suspend, I've had it do
>return_ACPI_STATUS(AE_BAD_PARAMETER).  Is there a better error code to
>use?  I didn't want to use AE_OK, since the caller might think that
>the function will be executed eventually, and might do something silly
>like wait for it to be executed -- and produce another hang.  I didn't
>know, but to be safe I wanted to return an error code.

just return AE_OK, because we are hacking. :-)
The only place that could have issue is in acpi_ev_global_lock_handler,
you can add a printk there, then you can know what happened.

>
>> Also, please add acpi_debug_layer=0x10 acpi_debug_leve=0x10 boot
>> option, then you can observe what methods were executed before
>> suspend.
>
>That's in my lilo.conf so all kernels I test use those options.  I can
>send you the dmesgs from the suspends without the ugly hack (and will
>send them from the upcoming suspends, with the ugly hack).

Thanks, I'm waiting for that to understand if the hack is clean for
killing unwanted AML methods call.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-18 15:10 Yu, Luming
@ 2006-03-18 15:48 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-18 15:48 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> Please try additional ugly hack
>  5. in acpi_os_queue_for_execution:
>	if(acpi_in_suspend == YES)
>		do nothing.

Am compiling it.  If acpi_in_suspend, I've had it do
return_ACPI_STATUS(AE_BAD_PARAMETER).  Is there a better error code to
use?  I didn't want to use AE_OK, since the caller might think that
the function will be executed eventually, and might do something silly
like wait for it to be executed -- and produce another hang.  I didn't
know, but to be safe I wanted to return an error code.

> Also, please add acpi_debug_layer=0x10 acpi_debug_leve=0x10 boot
> option, then you can observe what methods were executed before
> suspend.

That's in my lilo.conf so all kernels I test use those options.  I can
send you the dmesgs from the suspends without the ugly hack (and will
send them from the upcoming suspends, with the ugly hack).

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-18 15:10 Yu, Luming
  2006-03-18 15:48 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-18 15:10 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos


>> Hmm,  probably, you need to do :
>>
>> 4. in acpi_thermal_notify,
>>       if (acpi_in_suspend == YES)
>>               do nothing.
>
>I've just tested that.  It suspended twice without problem, which made
>me think the problem was solved.  But it hung on the third suspend!

I'm NOT surprised about that hung, because kernel thread kacpid 
is a kernel worker thread that has flag PF_NOFREEZE, that means
kacpid won't be freezed.  I tried to freeze kacpid, but end up with 
this conclusion.  From my understanding, for safety concern,
kernel worker thread should be freezed. Because, kacpid could
invoke AML methods that we are trying to avoid during suspend.

Please try additional ugly hack
 5. in acpi_os_queue_for_execution:
	if(acpi_in_suspend == YES)
		do nothing.

Also, please add acpi_debug_layer=0x10 acpi_debug_leve=0x10 
boot option, then you can observe what methods were executed
before suspend.

Thanks,
Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-18 13:24 Yu, Luming
@ 2006-03-18 14:37 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-18 14:37 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> Hmm,  probably, you need to do :
>
> 4. in acpi_thermal_notify,
>       if (acpi_in_suspend == YES)
>               do nothing.

I've just tested that.  It suspended twice without problem, which made
me think the problem was solved.  But it hung on the third suspend!

I placed all the source under revision control, since I was spending
more time moving versions of files back and forth (and fixing the
inevitable mistakes, forgetting which version was which) than I would by
figuring out how to use the SCM.  So here is its generated diff between
the vanilla kernel (with config file that uses vanilla DSDT) and the
kernel that I just tested.

As you can see, the only change is the short-term fix including item 4
above.  It doesn't do anything else, e.g. there's no code to load just
THM0 (which is probably why it hung).  

Perhaps the other thermal zones have different problems, or maybe
there's yet another source of thermal method calls?

-Sanjoy


diff -r ac486e270597 -r 03c54e90f75d drivers/acpi/sleep/main.c
--- a/drivers/acpi/sleep/main.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/sleep/main.c	Sat Mar 18 09:08:04 2006 -0500
@@ -19,6 +19,12 @@
 #include <acpi/acpi_drivers.h>
 #include "sleep.h"
 
+/* for functions putting machine to sleep to know that we're
+   suspending, so that they can careful about what AML methods they
+   invoke (to avoid trying untested BIOS code paths) */
+int acpi_in_suspend;
+EXPORT_SYMBOL(acpi_in_suspend);
+
 u8 sleep_states[ACPI_S_STATE_COUNT];
 
 static struct pm_ops acpi_pm_ops;
@@ -55,6 +61,8 @@ static int acpi_pm_prepare(suspend_state
 		printk("acpi_pm_prepare does not support %d \n", pm_state);
 		return -EPERM;
 	}
+	acpi_os_wait_events_complete(NULL);
+	acpi_in_suspend = TRUE;
 	return acpi_sleep_prepare(acpi_state);
 }
 
@@ -131,6 +139,7 @@ static int acpi_pm_finish(suspend_state_
 {
 	u32 acpi_state = acpi_suspend_states[pm_state];
 
+	acpi_in_suspend = FALSE;
 	acpi_leave_sleep_state(acpi_state);
 	acpi_disable_wakeup_device(acpi_state);
 
diff -r ac486e270597 -r 03c54e90f75d drivers/acpi/thermal.c
--- a/drivers/acpi/thermal.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/thermal.c	Sat Mar 18 09:08:04 2006 -0500
@@ -79,6 +79,8 @@ static int tzp;
 static int tzp;
 module_param(tzp, int, 0);
 MODULE_PARM_DESC(tzp, "Thermal zone polling frequency, in 1/10 seconds.\n");
+
+extern int acpi_in_suspend;
 
 static int acpi_thermal_add(struct acpi_device *device);
 static int acpi_thermal_remove(struct acpi_device *device, int type);
@@ -683,6 +685,8 @@ static void acpi_thermal_run(unsigned lo
 static void acpi_thermal_run(unsigned long data)
 {
 	struct acpi_thermal *tz = (struct acpi_thermal *)data;
+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
+		return;		/* so don't do them */
 	if (!tz->zombie)
 		acpi_os_queue_for_execution(OSD_PRIORITY_GPE,
 					    acpi_thermal_check, (void *)data);
@@ -1224,6 +1228,9 @@ static void acpi_thermal_notify(acpi_han
 	struct acpi_device *device = NULL;
 
 	ACPI_FUNCTION_TRACE("acpi_thermal_notify");
+
+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
+		return_VOID;	/* so don't do them */
 
 	if (!tz)
 		return_VOID;


^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-18 13:24 Yu, Luming
  2006-03-18 14:37 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-18 13:24 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>> The short-term proper way could be:
>> 1. add a global variable: acpi_in_suspend.
>> 2. in acpi_pm_prepare:
>>	a.call acpi_os_wait_events_complete()
>> 	b.set acpi_in_suspend = YES.
>>    in acpi_pm_finish :
>> 	set acpi_in_suspend = NO.
>> 3. in acpi_thermal_run:
>> 	if (acpi_in_suspend == YES)
>>		do nothing.
>
>I tested the included diff to implement the above short-term fix.  It
>also hung on the second sleep.  BUT, it's the same reason that the
>utils.c change didn't help: because acpi_thermal_add() was loading
>THM[0267].  After the usual modification to acpi_thermal_add() to have
>it ignore THM[267], the system didn't hang (12 cycles).  Which is
>progress.

Hmm,  probably, you need to do :

4. in acpi_thermal_notify,
	if (acpi_in_suspend == YES)
		do nothing.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-18  2:02 Yu, Luming
@ 2006-03-18  7:23 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-18  7:23 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>> Which is why I think the minimal change is the diff above to
>> utils.c [to make acpi_evaluate_integer fake _TMP].  With that
>> change the system never hung.

> Good, this is exactly what I wanted.  How many times you tested with
> this hack without hang?

Sadly, I just tried it again and it hung.  But from looking at my old
emails and test results, I know why.  I made the previous tests with
only THM0 loaded.  The bisection began by loading only THM0 (by having
acpi_thermal_add() ignore THM[267]).  The hang still happened, so I
never tested whether THM[267] also have problems: chase one problem at
a time.

To test the theory, I recompiled to recreate the kernel with the
utils.c change [to make acpi_evaluate_integer fake _TMP] and with only
THM0 loaded (i.e. what I had tested and reported a few days ago).  It
didn't hang (10 cycles), which repeats my previous result.

> The short-term proper way could be:
> 1. add a global variable: acpi_in_suspend.
> 2. in acpi_pm_prepare:
>	a.call acpi_os_wait_events_complete()
> 	b.set acpi_in_suspend = YES.
>    in acpi_pm_finish :
> 	set acpi_in_suspend = NO.
> 3. in acpi_thermal_run:
> 	if (acpi_in_suspend == YES)
>		do nothing.

I tested the included diff to implement the above short-term fix.  It
also hung on the second sleep.  BUT, it's the same reason that the
utils.c change didn't help: because acpi_thermal_add() was loading
THM[0267].  After the usual modification to acpi_thermal_add() to have
it ignore THM[267], the system didn't hang (12 cycles).  Which is
progress.

So I conclude that this diff does fix the THM0 problem, but that at
least one other thermal zone has a problem, and the problem is not
_TMP.  Or at least, the problem is not the same one that THM0 has
(running thermal threads at suspend time), otherwise the diff would
fix it like it fixed THM0.

I guess I should try loading only one of THM2,6,7 to see which zones
besides THM0 produce a problem, and then narrow the problem to one
method within the zone.

-Sanjoy

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-18  2:02 Yu, Luming
  2006-03-18  7:23 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-18  2:02 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>> So, please try hack thermal.c by removing calls to _TMP.
>
>I did something like that before, by changing acpi_evaluate_integer()
>to return 3000 if it is asked for _TMP.  
>
>--- a/utils.c	2006-03-15 01:42:34.000000000 -0500
>+++ b/utils.c	2006-03-14 23:36:59.000000000 -0500
>@@ -270,7 +270,15 @@ acpi_evaluate_integer(acpi_handle handle
> 	memset(element, 0, sizeof(union acpi_object));
> 	buffer.length = sizeof(union acpi_object);
> 	buffer.pointer = element;
>-	status = acpi_evaluate_object(handle, pathname, 
>arguments, &buffer);
>+	if (strcmp(pathname, "_TMP") != 0)
>+	  status = acpi_evaluate_object(handle, pathname, 
>arguments, &buffer);
>+	else {
>+	  printk(KERN_INFO PREFIX "acpi_evaluate_integer: 
>Faking _TMP\n");
>+	  status = AE_OK;
>+	  element->type = ACPI_TYPE_INTEGER;
>+	  element->integer.value = 3000; /* 27 C, in deciKelvins */
>+	}
>+
> 	if (ACPI_FAILURE(status)) {
> 		acpi_util_eval_error(handle, pathname, status);
> 		return_ACPI_STATUS(status);
>
>
>The alternative, obvious change in thermal.c (diff below) turns out
>not to be a minimal change.  If acpi_thermal_get_temperature() returns
>with a failure, then most of the later methods in THM0 aren't
>executed, so one is actually commenting out much more than _TMP.
>
>Which is why I think the minimal change is the diff above to utils.c.
>With that change the system never hung.

Good, this is exactly what I wanted.  How many times you tested with
this
hack without hang?  If s3 hang really goes away , then probably you can
move on , and come up with a real patch that could go into the 2.6.16. 
What do you think? :-)

The short-term proper way could be:
1. add a global variable: acpi_in_suspend.
2. in acpi_pm_prepare:
	a.call acpi_os_wait_events_complete()
	b.set acpi_in_suspend = YES.
   in acpi_pm_finish :
	set acpi_in_suspend = NO.
3. in acpi_thermal_run:
	if (acpi_in_suspend == YES)
		do nothing.

The long-term proper way should be:
1. ACPI subsystem should stop invoking BIOS before Suspend except
for several necessary AML methods that are required to put 
the platform into S3 state.  Otherwise, un-tested BIOS code path 
could cause trouble to linux, because I assume such platform 
should have been tested under windows. 

Thanks,
Luming
 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-17  7:50 Yu, Luming
@ 2006-03-17 18:43 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-17 18:43 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> So, please try hack thermal.c by removing calls to _TMP.

I did something like that before, by changing acpi_evaluate_integer()
to return 3000 if it is asked for _TMP.  

--- a/utils.c	2006-03-15 01:42:34.000000000 -0500
+++ b/utils.c	2006-03-14 23:36:59.000000000 -0500
@@ -270,7 +270,15 @@ acpi_evaluate_integer(acpi_handle handle
 	memset(element, 0, sizeof(union acpi_object));
 	buffer.length = sizeof(union acpi_object);
 	buffer.pointer = element;
-	status = acpi_evaluate_object(handle, pathname, arguments, &buffer);
+	if (strcmp(pathname, "_TMP") != 0)
+	  status = acpi_evaluate_object(handle, pathname, arguments, &buffer);
+	else {
+	  printk(KERN_INFO PREFIX "acpi_evaluate_integer: Faking _TMP\n");
+	  status = AE_OK;
+	  element->type = ACPI_TYPE_INTEGER;
+	  element->integer.value = 3000; /* 27 C, in deciKelvins */
+	}
+
 	if (ACPI_FAILURE(status)) {
 		acpi_util_eval_error(handle, pathname, status);
 		return_ACPI_STATUS(status);


The alternative, obvious change in thermal.c (diff below) turns out
not to be a minimal change.  If acpi_thermal_get_temperature() returns
with a failure, then most of the later methods in THM0 aren't
executed, so one is actually commenting out much more than _TMP.

Which is why I think the minimal change is the diff above to utils.c.
With that change the system never hung.

Or should I do a compromise modification, where calls from thermal.c
to _TMP use the hacked acpi_evaluate_integer()
[e.g. acpi_evaluate_integer_called_from_thermal()], but other calls to
_TMP get the unhacked version?

Here is the diff for commenting out _TMP directly in thermal.c, which
I think I've tried already but I'll try it again.  I'm sure it'll
work, though.

--- a/thermal.c	2006-03-16 09:45:30.000000000 -0500
+++ b/thermal.c	2006-03-17 09:00:30.000000000 -0500
@@ -222,7 +222,7 @@ static int acpi_thermal_get_temperature(
 
 	ACPI_FUNCTION_TRACE("acpi_thermal_get_temperature");
 
-	if (!tz)
+	if (!tz || strcmp(tz->handle, "_TMP") == 0)
 		return_VALUE(-EINVAL);
 
 	tz->last_temperature = tz->temperature;



^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-17  7:50 Yu, Luming
  2006-03-17 18:43 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-17  7:50 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>> How about re-testing dummy _PSV and dummy _AC0 in DSDT?
>
>Just retested and you were right.  This time I managed to get it to
>hang, after many cycles of sleep.sh and "modprobe -r thermal ;
>modprobe thermal" mixed in.
>

Hmmm, may I think this is a problem of:
_TMP ,

It is neither _TMP && (_PSV || _AC0),
nor  _TMP || _PSV || _AC0.

So, please try hack thermal.c by removing calls to _TMP.
And do stress test with Vanilla Kernel, Vanilla Dsdt , just
with hacked thermal.c

Anyway, the clean way to fix your problem might be:

 suspend thermal driver with disabling AML methods invoke
that might cause/ trigger BIOS issues.

Thanks,
Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-17  6:57 Yu, Luming
  2006-03-17  7:11 ` Sanjoy Mahajan
@ 2006-03-17  7:32 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-17  7:32 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> How about re-testing dummy _PSV and dummy _AC0 in DSDT?

Just retested and you were right.  This time I managed to get it to
hang, after many cycles of sleep.sh and "modprobe -r thermal ;
modprobe thermal" mixed in.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-17  6:57 Yu, Luming
@ 2006-03-17  7:11 ` Sanjoy Mahajan
  2006-03-17  7:32 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-17  7:11 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

By the way, I wonder if this problem is the same as
<http://bugzilla.kernel.org/show_bug.cgi?id=5037>, about S3 hangs with
kernel pre-empts enabled.

> How about re-testing dummy _PSV and dummy _AC0 in DSDT?

I'll do that.  It's the one data point that I'm not sure about.  With
dummy _PSV, it hangs, though it takes a bit of stressing it before it
hangs.  But with dummy _PSV and dummy _AC0, I could not make it hang.
I tried it twice, each time stressing it as much as I could (about 10
or so cycles, with thermal polling thrown in as well as module loading
and unloading).  Even though it didn't hang, it did get *very*
sluggish at times, and once woke up with load=8.2 even though no
processes were running.  Lots of ACPI threads?

I'll test it just with sleep.sh, no thermal polling.  Maybe also with
loading and unloading thermal.ko.

> How about just faking _TMP in DSDT. I'm sure you have done this
> before.

This one I've tried, and it worked fine (no hang).  I tested it for a
while and then retested it.  It also works fine if I take out just the
EC0.UPDT line in _TMP (with AC0 already taken out).

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-17  6:57 Yu, Luming
  2006-03-17  7:11 ` Sanjoy Mahajan
  2006-03-17  7:32 ` Sanjoy Mahajan
  0 siblings, 2 replies; 86+ messages in thread
From: Yu, Luming @ 2006-03-17  6:57 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>> Hmm,  we can continue to have fun with debugging. Right?
>
>Definitely, I haven't given up.

Great!

>
>>> The second sleep.sh hangs going to sleep.  It is in an endless loop
>>> printing the following line, once per second (from the
>>> polling_frequency):
>>>
>>>  Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)
>
>I don't think these lines are a problem.  They just reflect that
>thermal polling is happening once per second.  So even though the ACPI
>system is hanging in the SMPI loop (as you say below), it is alive
>enough to poll the temperature sensors.
>
>> Also please mute THM0 polling.
>
>I retested the hacked kernel (with faked thermal_active/passive)
>but with no thermal polling, just doing
>
>  cat THM*/polling_frequency (they were all 'polling disabled')
>  sleep.sh  (works)
>  sleep.sh  (hangs in the usual SMPI loop)
>
>and it hangs as usual.

Good news, no new branch needed to track. 
I assume the problem is still like _TMP & (_PSV | _AC0).

How about re-testing dummy _PSV and dummy _AC0 in DSDT?
Because, your testing result with dummy _PSV and dummy_AC0
IS NOT consistent with the result of hacking
acpi_thermal_passive/active.
Maybe I need to reconsider the impact of _PSV or_AC0 on the 
platform.

How about  just faking _TMP in DSDT. I'm sure you have done this before.
But, I need to confirm that the problem is NOT _TMP | _PSV | _AC0.

Thanks,
Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-17  1:17 Yu, Luming
@ 2006-03-17  6:28 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-17  6:28 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> Hmm,  we can continue to have fun with debugging. Right?

Definitely, I haven't given up.

>> The second sleep.sh hangs going to sleep.  It is in an endless loop
>> printing the following line, once per second (from the
>> polling_frequency):
>>
>>  Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)

I don't think these lines are a problem.  They just reflect that
thermal polling is happening once per second.  So even though the ACPI
system is hanging in the SMPI loop (as you say below), it is alive
enough to poll the temperature sensors.

> Also please mute THM0 polling.

I retested the hacked kernel (with faked thermal_active/passive)
but with no thermal polling, just doing

  cat THM*/polling_frequency (they were all 'polling disabled')
  sleep.sh  (works)
  sleep.sh  (hangs in the usual SMPI loop)

and it hangs as usual.

> This should be the different problem from the previous reported hang.
> I recall it was hanging at a loop in SMPI waiting for BIOS's response.
> Please confirm, 

I just retested vanilla 2.6.16-rc5 (vanilla kernel, vanilla DSDT),
with polling_interval=1 (second).  My earlier tests with that kernel
had polling_interval=100, and the easiest way to reproduce the hang
was:

  echo 100 > THM0/polling_interval
  modprobe -r thermal ; modprobe thermal
  sleep.sh  (this hangs)

With this method, the system would hang on the *first* sleep cycle.
The other method to produce the hang, with thermal polling muted, was:

  echo 0 > THM0/polling_interval (and the rest of them, to make sure)
  sleep.sh  (it comes back)
  sleep.sh  (this one hangs)

I tried the same method but with 1 second instead of 100 seconds:

  echo 1 > THM0/polling_interval
  sleep.sh  (this one works, maybe because I didn't do the modprobing)
  sleep.sh  (this hangs)

The second sleep.sh hangs in the usual loop, which produces the
ex-region etc. loop, but interspersed in that dmesg output
is the output from the thermal polling.  So I also see 

  Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)

plus its associated function traces (ec_intr_write or something like
that -- I saved all the log files).

One other point is that we haven't yet used a piece of information:
that the system never hangs if I boot with ec_intr=0.  Actually,
that's why I tried commenting out the \_SB.PCI0.ISA0.EC0.UPDT () line
in _TMP method, and it did 'solve' the problem (at least, it did with
AC0 faked -- I haven't tried keeping AC0 but taking out just that
line).

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-17  1:17 Yu, Luming
  2006-03-17  6:28 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-17  1:17 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>Bad news.  It hangs when I do the usual stress test:

Hmm,  we can continue to have fun with debugging. Right?

>
>echo 1 > THM0/polling_frequency
>sleep.sh
>sleep.sh
>
>The second sleep.sh hangs going to sleep.  It is in an endless loop
>printing the following line, once per second (from the
>polling_frequency):
>
>  Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)

This should be the diffient problem with the previous reported hang.
I recall it was hang at a loop in SMPI waiting for BIOS's response.
Please confirm, Also please mute THM0 polling.

>
>> Please also make sure you have vanilla DSDT
>
>$ grep DSDT /boot/config-2.6.16-rc5.fake-thermal_active+passive
># CONFIG_ACPI_CUSTOM_DSDT is not set
>
>> vanilla Kernel, and just hacked acpi_thermal_active/passive.
>
>Only diff between pristine 2.6.16-rc5 tree and mine is:
>
>diff -rup /tmp/linux-2.6.16-rc5/drivers/acpi/thermal.c 
>/usr/src/linux-2.6.16-rc5/drivers/acpi/thermal.c
>--- /tmp/linux-2.6.16-rc5/drivers/acpi/thermal.c	
>2006-02-27 00:09:35.000000000 -0500
>+++ /usr/src/linux-2.6.16-rc5/drivers/acpi/thermal.c	
>2006-03-16 09:45:30.000000000 -0500
>@@ -526,6 +526,8 @@ static void acpi_thermal_passive(struct 
> 
> 	ACPI_FUNCTION_TRACE("acpi_thermal_passive");
> 
>+	return;
>+
> 	if (!tz || !tz->trips.passive.flags.valid)
> 		return;
> 
>@@ -615,6 +617,8 @@ static void acpi_thermal_active(struct a
> 
> 	ACPI_FUNCTION_TRACE("acpi_thermal_active");
> 
>+	return;
>+
> 	if (!tz)
> 		return;
> 
>

This looks ok for debugging.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-16  8:18 Yu, Luming
@ 2006-03-16 15:15 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-16 15:15 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

Bad news.  It hangs when I do the usual stress test:

echo 1 > THM0/polling_frequency
sleep.sh
sleep.sh

The second sleep.sh hangs going to sleep.  It is in an endless loop
printing the following line, once per second (from the
polling_frequency):

  Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)

> Please also make sure you have vanilla DSDT

$ grep DSDT /boot/config-2.6.16-rc5.fake-thermal_active+passive
# CONFIG_ACPI_CUSTOM_DSDT is not set

> vanilla Kernel, and just hacked acpi_thermal_active/passive.

Only diff between pristine 2.6.16-rc5 tree and mine is:

diff -rup /tmp/linux-2.6.16-rc5/drivers/acpi/thermal.c /usr/src/linux-2.6.16-rc5/drivers/acpi/thermal.c
--- /tmp/linux-2.6.16-rc5/drivers/acpi/thermal.c	2006-02-27 00:09:35.000000000 -0500
+++ /usr/src/linux-2.6.16-rc5/drivers/acpi/thermal.c	2006-03-16 09:45:30.000000000 -0500
@@ -526,6 +526,8 @@ static void acpi_thermal_passive(struct 
 
 	ACPI_FUNCTION_TRACE("acpi_thermal_passive");
 
+	return;
+
 	if (!tz || !tz->trips.passive.flags.valid)
 		return;
 
@@ -615,6 +617,8 @@ static void acpi_thermal_active(struct a
 
 	ACPI_FUNCTION_TRACE("acpi_thermal_active");
 
+	return;
+
 	if (!tz)
 		return;
 

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-16  8:18 Yu, Luming
  2006-03-16 15:15 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-16  8:18 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos


>> To verify this, please hack acpi_thermal_active.
>
>Do you mean hack it for now to return without doing anything (like if
>'tz' wasn't valid)?  Or do it farther in the function, like by
>changing
>
>				result =
>				    acpi_bus_set_power(active->devices.
>						       handles[j],
>						       ACPI_STATE_D0);
>to 
>
>				result = 1;
>
>>  Disable active/passive cooling request before suspend.

Yes, just return , and DONT do anything could impact to platform.

>
>Do I need to hack acpi_thermal_passive() as well?

Yes.

Please also make sure you have vanilla DSDT,  vanilla Kernel, and just
hacked 
acpi_thermal_active/passive.

I'm waiting for your good news.
If it is the root cause, probably you need to come up with a real patch.
:-)


Thanks,
Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-16  7:28 Yu, Luming
@ 2006-03-16  7:57 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-16  7:57 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> To verify this, please hack acpi_thermal_active.

Do you mean hack it for now to return without doing anything (like if
'tz' wasn't valid)?  Or do it farther in the function, like by
changing

				result =
				    acpi_bus_set_power(active->devices.
						       handles[j],
						       ACPI_STATE_D0);
to 

				result = 1;

>  Disable active/passive cooling request before suspend.

Do I need to hack acpi_thermal_passive() as well?

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-16  7:28 Yu, Luming
  2006-03-16  7:57 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-16  7:28 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>It doesn't hang.  Though it seemed close to hanging a couple times,
>but after a 5-10 second pause always managed to go to sleep.  I tried
>about 15 sleep cycles, with a few echo 1 > polling_frequency thrown in.

ACPI SPEC define:

_PSV  : thermal zone object that returns Passive trip point in
	tenths of digress Kelvin.

_ACx:  thermal zone object that returns active cooling policy 
	threshold values in tenths of degrees Kelvin.

I suspect , when hang, the system was trying to start active cooling
with Fan
in function acpi_thermal_active that was somehow conflict request with
_PTS's call to SMPI in BIOS.  So, the solution is :

	Disable active/passive cooling request before suspend.

To verify this, please hack acpi_thermal_active.

We need a suspend/resume method for acpi thermal to cleanly solve 
your problem.

Thanks,
Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-16  6:41 Yu, Luming
  2006-03-16  6:54 ` Sanjoy Mahajan
@ 2006-03-16  7:14 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-16  7:14 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> I found the common code in _PSV and _AC0
>    Store (DerefOf (Index (DerefOf (MODP (0x01)), Local1)), Local0)
> Could you just comment out that?

It doesn't hang.  Though it seemed close to hanging a couple times,
but after a 5-10 second pause always managed to go to sleep.  I tried
about 15 sleep cycles, with a few echo 1 > polling_frequency thrown in.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-16  6:41 Yu, Luming
@ 2006-03-16  6:54 ` Sanjoy Mahajan
  2006-03-16  7:14 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-16  6:54 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

Okay it's compiling with that change.  Now those two methods look
like:

      Method (_PSV, 0, NotSerialized)
      {
        /* Store (DerefOf (Index (DerefOf (MODP (0x00)), 0x01)),  Local0) */
	   Return (Local0)
      }
      Method (_AC0, 0, NotSerialized)
      {
	  If (H8DR)
	  {
	      Store (\_SB.PCI0.ISA0.EC0.HT00, Local1)
	  }
	  Else
	  {
	      And (\_SB.RBEC (0x20), 0x01, Local1)
	  }

	  Store (Local1, \_TZ.THM0.AC0M)
       /* Store (DerefOf (Index (DerefOf (MODP (0x01)), Local1)), Local0) */
	  Return (Local0)
      }

But I have two worries:

1. The lines that I commented out are not identical (if they are
   identical in your setup, maybe we have different disassembled
   DSDT's?).

2. With those lines commented out, the local variables might contain
   garbage, since those lines initialize them.  The iasl compiler also
   worries about this:

  thm0-ac0psv-line.dsl 10504:                 Return (Local0)
  Error    1013 - Method local variable is not initialized ^  (Local0)

  thm0-ac0psv-line.dsl 10520:                  Return (Local0)
  Error    1013 -  Method local variable is not initialized ^  (Local0)

Should I change the Return statement to Return(0)?

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-15  8:02 Yu, Luming
  2006-03-16  0:03 ` Sanjoy Mahajan
  2006-03-16  5:47 ` Sanjoy Mahajan
@ 2006-03-16  6:46 ` Sanjoy Mahajan
  2 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-16  6:46 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

I started subdividing the _TMP method, and found that

hangs: -AC0 (as reported in the last email)
okay:  -AC0-TMP (also in last email)

but

okay: -AC0-one line of TMP, by which I mean getting rid of the 
EC0.UPDT() line below:

            Method (_TMP, 0, NotSerialized)
            {
                \_SB.PCI0.ISA0.EC0.UPDT ()
                Store (\_SB.PCI0.ISA0.EC0.TMP0, Local0)
                If (LGreater (Local0, 0x0AAC))
                {
                    Return (Local0)
                }
                Else
                {
                    Return (0x0BB8)
                }
            }

So that's a small change in which having a line means the hang
happens, and not having the line means it goes away.

By the way, I just checked -AC0-TMP and it was okay (no hang).  That
data point is consistent with TMP & (PSV | AC0).

> I found the common code in _PSV and _AC0
>    Store (DerefOf (Index (DerefOf (MODP (0x01)), Local1)), Local0)
> Could you just comment out that?

I will try that right now (leaving TMP as in the vanilla DSDT).

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-16  6:41 Yu, Luming
  2006-03-16  6:54 ` Sanjoy Mahajan
  2006-03-16  7:14 ` Sanjoy Mahajan
  0 siblings, 2 replies; 86+ messages in thread
From: Yu, Luming @ 2006-03-16  6:41 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>   hang iff (TMP & (PSV | AC0)).

Very interesting! 

I found the common code in _PSV and _AC0

 Store (DerefOf (Index (DerefOf (MODP (0x01)), Local1)), Local0)

Could you just comment out that?

We are very near at root-cause.

Thanks,
luming




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-15  8:02 Yu, Luming
  2006-03-16  0:03 ` Sanjoy Mahajan
@ 2006-03-16  5:47 ` Sanjoy Mahajan
  2006-03-16  6:46 ` Sanjoy Mahajan
  2 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-16  5:47 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

Here the test results on the DSDT variants, as a tree.  THM0 is the
root (meaning a DSDT with only THM0).  -XYZ means 'fake the method XYZ
in THM0, relative to the situation in the parent DSDT':

hang:	 THM0
okay:		-TMP
hang:		-PSV
okay:			-AC0 (i.e. THM0 methods but no PSV, no AC0)
hang:			-SCP
hang:		-AC0

The first two results are consistent with the view that TMP is the
problem.  

>From the first five results, I convinced myself that TMP needed AC0
around to cause a problem, and vice versa: hang iff (TMP & AC0).  But
the last result (-AC0) surprised me.  Now I think: 
   hang iff (TMP & (PSV | AC0)).

The -PSV-AC0 DSDT, which did not hang, seemed close to hanging.  After
a few cycles, it became very sluggish and the load was 8.2 on wakeup.
But the sluggishness disappeared after a couple more cycles, and I
couldn't produce a hang (tried two reboots, each with different
permutations of sleep.sh or "echo 1 > THM0/polling_frequency").

The -AC0 DSDT hung upon doing 

 echo 1 > THM0/polling_frequency ; sleep.sh; sleep.sh

and it got in an endless loop that showed it sluggishly executing
(over and over again) THM0._TMP.

It's probably not coincidence that TMP and AC0 both use the EC.
Although PSV doesn't.

I didn't make any tests on the MODP method.  And the TC1, TC2, and TSP
methods seemed to trivial (just returning a constant) that it didn't
seem worth testing them.

I keep the kernels around for each permutation, so I can retest any of
the above, or send the THM0 portions of the .dsl files.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-15  8:02 Yu, Luming
@ 2006-03-16  0:03 ` Sanjoy Mahajan
  2006-03-16  5:47 ` Sanjoy Mahajan
  2006-03-16  6:46 ` Sanjoy Mahajan
  2 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-16  0:03 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

Your intuition was right.  Testing by changing only the DSDT gives
slightly different results than by changing the kernel drivers.  So
far the results:

with only THM0 in the DSDT: HANGS (but only on the second sleep, not
   the first one, so a slight difference with the kernel-testing data)

with only THM0 and all methods doing nothing (either returning 0 or,
   for _TMP, 0xBB8):  NO hang

with only THM0 and all methods except _TMP doing nothing, but _TMP
   doing its normal code: NO hang [that's the difference between DSDT
   and kernel testing]

More bisection coming.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-15  8:02 Yu, Luming
  2006-03-16  0:03 ` Sanjoy Mahajan
                   ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: Yu, Luming @ 2006-03-15  8:02 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>So, before I begin that search, which THM0 methods can I safely get
>rid of?  All of AC0M, AC1M, PSL, TC1, TC2, TSP, TBL0, MODP, _CRT, AL0,
>PSL?  That'll leave MODE, _TMP, _AC0, _SCP, PSV to bisect among.

for example, I would  fake these methods in this way:

	 Method (_TMP, 0, NotSerialized)
            {
                    Return (0x0BB8)
            }

            Method (_PSV, 0, NotSerialized)
            {
                Return (0)
            }
Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)
Execute Method: [\_TZ_.THM0._PSV] (Node c157be48)
Execute Method: [\_TZ_.THM0._TC1] (Node c157bdc8)
Execute Method: [\_TZ_.THM0._TC2] (Node c157bd88)
Execute Method: [\_TZ_.THM0._TSP] (Node c157bd48)
Execute Method: [\_TZ_.THM0._AC0] (Node c157bf48)
Execute Method: [\_TZ_.THM0._SCP] (Node c157bec8)
Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-15  6:47 Yu, Luming
@ 2006-03-15  7:06 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-15  7:06 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

[Sorry, forgot to reply to all the first time.]

How about this plan (basically what you suggested, but I want to
confirm its sensibility since these tests take a while):

Since I know (or think I know) that just THM0 with just _TMP causes
problems when investigating it by modifying the kernel, I'll instead
use a vanilla kernel but modify the DSDT as follows:

Remove the THM2, THM6, and THM7 code blocks.  See whether the DSDT
even compiles (e.g. TWAK might complain because of the "Notify
(\_TZ.THM7, 0x81)").  If it does, see if the resulting kernel hangs.
If it hangs (which I expect), then I'll bisect within THM0 to find at
least one method that causes a problem.

So, before I begin that search, which THM0 methods can I safely get
rid of?  All of PSL, TC1, TC2, TSP, MODP?  That'll leave _TMP, _AC0,
_SCP, PSV to bisect among.  I'll assume that the NAME statements are
okay.

Here's the THM0 code:

        ThermalZone (THM0)
        {
            Name (MODE, 0x00)
            Name (AC0M, 0x00)
            Name (AC1M, 0x00)
            Name (TBL0, Package (0x02)
            {
                Package (0x03)
                {
                    Package (0x02)
                    {
                        0x0E62, 
                        0x0E49
                    }, 

                    Package (0x02)
                    {
                        0x0E30, 
                        0x0DBD
                    }, 

                    Package (0x02)
                    {
                        0x0E26, 
                        0x0DB3
                    }
                }, 

                Package (0x03)
                {
                    Package (0x02)
                    {
                        0x0E26, 
                        0x0DB3
                    }, 

                    Package (0x02)
                    {
                        0x0E62, 
                        0x0E49
                    }, 

                    Package (0x02)
                    {
                        0x0E30, 
                        0x0DBD
                    }
                }
            })
            Method (MODP, 1, NotSerialized)
            {
                Return (Index (DerefOf (Index (TBL0, MODE)), Arg0))
            }

            Method (_TMP, 0, NotSerialized)
            {
                \_SB.PCI0.ISA0.EC0.UPDT ()
                Store (\_SB.PCI0.ISA0.EC0.TMP0, Local0)
                If (LGreater (Local0, 0x0AAC))
                {
                    Return (Local0)
                }
                Else
                {
                    Return (0x0BB8)
                }
            }

            Method (_AC0, 0, NotSerialized)
            {
                If (H8DR)
                {
                    Store (\_SB.PCI0.ISA0.EC0.HT00, Local1)
                }
                Else
                {
                    And (\_SB.RBEC (0x20), 0x01, Local1)
                }

                Store (Local1, \_TZ.THM0.AC0M)
                Store (DerefOf (Index (DerefOf (MODP (0x01)),
                Local1)), Local0)
                Return (Local0)
            }

            Name (_CRT, 0x0E80)
            Method (_SCP, 1, NotSerialized)
            {
                Notify (\_TZ.THM0, 0x81)
            }

            Name (_AL0, Package (0x01)
            {
                FN00
            })
            Method (_PSV, 0, NotSerialized)
            {
                Store (DerefOf (Index (DerefOf (MODP (0x00)), 0x01)),
            Local0)
                Return (Local0)
            }

            Name (_PSL, Package (0x01)
            {
                \_PR.CPU0
            })
            Method (_TC1, 0, NotSerialized)
            {
                Return (TTC1)
            }

            Method (_TC2, 0, NotSerialized)
            {
                Return (TTC2)
            }

            Method (_TSP, 0, NotSerialized)
            {
                Return (TTSP)
            }
        }

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-15  6:47 Yu, Luming
  2006-03-15  7:06 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-15  6:47 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos


>> If you remove the real THM0._TMP, and fake a dummy THM0._TMP in
>> DSDT, and don't change anything in kernel, then if S3 works well, I
>> will be convinced that THM0._TMP was causing trouble.
>
>I'll try it, to test my theory above!  But one clarification first: Do
>you mean that I use a vanilla thermal.c, or should I keep using the
>modified thermal.c with zone_to_keep=0 as the module parameter?  I
>don't think I revert to the vanilla thermal.c.  Suppose that there are
>two bugs, which I think is likely (see previous email).  Commenting
>out only THM0._TMP but preserving everything else in the DSDT & kernel
>might eliminate any bug caused by THM0._TMP.  But if it still hangs --
>and I'm pretty sure it will -- it means there's a another bug
>somewhere else.

Ok, I'm fine whatever way you choose to start, But I think you need to
verify
the findings with the UN-modified kernek ,UN-modified Thermal.c and
others
that can reproduce S3 hang with UN-modified DSDT.




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-15  6:16 Yu, Luming
@ 2006-03-15  6:35 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-15  6:35 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> If you do it in this way, all thermal zone's _TMP will be faked.

Loading 'thermal' with zone_to_keep=0 meant that it skipped THM{2,6,7}
(the only other zones).  But only THM0 was loaded, so any path that
included, say, THM2._TMP wouldn't get executed because of lines like:

     if (!tz)
	return_VALUE(-EINVAL);

Plus the dmesgs show all cases when _TMP was faked (each fakery
produces a printk).  In the experiment with zone_to_keep=0, the only
cases were with THM0.

> If you remove the real THM0._TMP, and fake a dummy THM0._TMP in
> DSDT, and don't change anything in kernel, then if S3 works well, I
> will be convinced that THM0._TMP was causing trouble.

I'll try it, to test my theory above!  But one clarification first: Do
you mean that I use a vanilla thermal.c, or should I keep using the
modified thermal.c with zone_to_keep=0 as the module parameter?  I
don't think I revert to the vanilla thermal.c.  Suppose that there are
two bugs, which I think is likely (see previous email).  Commenting
out only THM0._TMP but preserving everything else in the DSDT & kernel
might eliminate any bug caused by THM0._TMP.  But if it still hangs --
and I'm pretty sure it will -- it means there's a another bug
somewhere else.

Here's why I'm sure it will hang.  When I commented out all
evaluations of _TMP (modifying utils.c), but used a vanilla thermal.c,
it still hung.  And commenting out all _TMP's means I commented out
THM0._TMP.  So vanilla thermal.c + no THM0._TMP should hang too.

> Ok, Let's change the way of hacking. Let's start bisection without
> touching kernel, instead with DSDT.

No problem I think.

> Firstly, you need to find out which THM.

The zone_to_keep=0 tests show that THM0 causes a problem, don't they?
Other zones may also cause a problem, but THM0 can do it all alone.

> Then, which Methods.  

The test that hung on the first S3 sleep, with zone_to_keep=0 and
bisect_get_info=1, shows that just THM0._TMP can cause a problem --
since no other methods got executed.

As with figuring out which zones cause problems, other methods may
also cause the problem.  So I want to make sure I use a bisection
method that will work even if there is more than one bug, whether in
multiple zones or in multiple methods in the same zone.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-15  6:25 Yu, Luming
  0 siblings, 0 replies; 86+ messages in thread
From: Yu, Luming @ 2006-03-15  6:25 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos


>One sad piece of data that I came across, perhaps worth investigating
>further after this one is chased down:
>
>As described in the last email, the combination of _TMP fakery (in
>utils.c) plus the bisecting version of thermal.c (loading only the
>zone THM0 and then only up to bisect_get_info=1) got rid of the hangs.
>
>So I got bold and tried _TMP fakery but with the vanilla thermal.c.
>The idea being that if _TMP is to blame for all the problems, then S3
>sleep should work fine with this setup.  But it hung in the usual way,
>on the second sleep.  Below are the dmesgs after the usual boot-time
>ones.
>
>This experiment produces a hang even with _TMP faked, whereas the
>previous experiment didn't (also with _TMP faked but, after the boot,
>loading only the THM0 zone and only doing the _TMP methods of it, even
>on wake).  So one of the non-TMP methods below must be causing a
>problem?  My suspicion is that it's one of the methods called on wake
>(_THM0._PSV or ._TC1, etc. or maybe one of the other zone's methods),
>which would explain why the first sleep goes fine but the second one
>fails.
>
>I don't think it's any of the calls made when 'thermal' is loading at
>boot time, because the same calls happen in the previous experiment.
>In that experiment, thermal loads normally (with _TMP faked), and only
>after boot do I unload it and replace it with
>
>  modprobe thermal zone_to_keep=0 bisect_get_info=1
>
>Anyway, here are the dmesgs for this experiment (hangs on 2nd sleep):

Ok, Let's change the way of hacking. Let's start bisection
without  touching kernel, instead with DSDT.
Firstly, you need to find out which THM.
Then,  which Methods.
Finally, which statements that triggers S3 hang.

Thanks,
Luming


^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-15  6:16 Yu, Luming
  2006-03-15  6:35 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-15  6:16 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>> Could you just comment out _TMP in kernel or in DSDT, 
>
>I think it needs both excisions: If I comment out just the kernel _TMP
>calls, the DSDT might slip one in through the interpreter.  If I
>comment out just the DSDT _TMP calls, then the kernel can still call
>_TMP.  So instead I modified acpi_evaluate_integer() to return 27 C
>(3000 dK) if it's ever asked for a temperature, without doing any
>actual work:
>
>--- utils.c.orig	2006-02-27 00:09:35.000000000 -0500
>+++ utils.c		2006-03-14 23:36:59.000000000 -0500
>@@ -270,7 +270,15 @@ acpi_evaluate_integer(acpi_handle handle
>   memset(element, 0, sizeof(union acpi_object));
>   buffer.length = sizeof(union acpi_object);
>   buffer.pointer = element;
>-  status = acpi_evaluate_object(handle, pathname, arguments, &buffer);
>+  if (strcmp(pathname, "_TMP") != 0)
>+    status = acpi_evaluate_object(handle, pathname, 
>arguments, &buffer);
>+    else {
>+      printk(KERN_INFO PREFIX "acpi_evaluate_integer: Faking _TMP\n");
>+        status = AE_OK;
>+	   element->type = ACPI_TYPE_INTEGER;
>+	     element->integer.value = 3000; /* 27 C, in deciKelvins */
>+	     }
>+
>	if (ACPI_FAILURE(status)) {
>	   acpi_util_eval_error(handle, pathname, status);
>					return_ACPI_STATUS(status);
>
>This diff is in addition to the previous debugging changes to
>thermal.c.

If you do it in this way, all thermal zone's _TMP will be faked.
If you remove the real THM0._TMP, and fake a dummy THM0._TMP
in DSDT, and don't change anything in kernel, then if S3 works
well, I will be convinced that THM0._TMP was causing trouble.
Yes, I'm asking you to override DSDT for debugging. :-)
But, please make sure don't change other things in DSDT, otherwise
it still won't be trusted. :-)

Anyway, I'm studying THM0._TMP, and try to figure out how it is related
with EC. 

Thanks,
Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-15  1:46 Yu, Luming
  2006-03-15  5:40 ` Sanjoy Mahajan
@ 2006-03-15  5:57 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-15  5:57 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

One sad piece of data that I came across, perhaps worth investigating
further after this one is chased down:

As described in the last email, the combination of _TMP fakery (in
utils.c) plus the bisecting version of thermal.c (loading only the
zone THM0 and then only up to bisect_get_info=1) got rid of the hangs.

So I got bold and tried _TMP fakery but with the vanilla thermal.c.
The idea being that if _TMP is to blame for all the problems, then S3
sleep should work fine with this setup.  But it hung in the usual way,
on the second sleep.  Below are the dmesgs after the usual boot-time
ones.

This experiment produces a hang even with _TMP faked, whereas the
previous experiment didn't (also with _TMP faked but, after the boot,
loading only the THM0 zone and only doing the _TMP methods of it, even
on wake).  So one of the non-TMP methods below must be causing a
problem?  My suspicion is that it's one of the methods called on wake
(_THM0._PSV or ._TC1, etc. or maybe one of the other zone's methods),
which would explain why the first sleep goes fine but the second one
fails.

I don't think it's any of the calls made when 'thermal' is loading at
boot time, because the same calls happen in the previous experiment.
In that experiment, thermal loads normally (with _TMP faked), and only
after boot do I unload it and replace it with

  modprobe thermal zone_to_keep=0 bisect_get_info=1

Anyway, here are the dmesgs for this experiment (hangs on 2nd sleep):

# loading 'thermal' on boot (with vanilla thermal.c, so it loads
# all the thermal zones):
ACPI: acpi_evaluate_integer: Faking _TMP
Execute Method: [\_TZ_.THM0._PSV] (Node c157be48)
Execute Method: [\_TZ_.THM0._TC1] (Node c157bdc8)
Execute Method: [\_TZ_.THM0._TC2] (Node c157bd88)
Execute Method: [\_TZ_.THM0._TSP] (Node c157bd48)
Execute Method: [\_TZ_.THM0._AC0] (Node c157bf48)
Execute Method: [\_TZ_.THM0._SCP] (Node c157bec8)
ACPI: acpi_evaluate_integer: Faking _TMP
ACPI: Thermal Zone [THM0] (27 C)
ACPI: acpi_evaluate_integer: Faking _TMP
Execute Method: [\_TZ_.THM2._AC0] (Node c157bb48)
Execute Method: [\_TZ_.THM2._SCP] (Node c157bac8)
ACPI: acpi_evaluate_integer: Faking _TMP
ACPI: Thermal Zone [THM2] (27 C)
ACPI: acpi_evaluate_integer: Faking _TMP
Execute Method: [\_TZ_.THM6._AC0] (Node c157b908)
Execute Method: [\_TZ_.THM6._SCP] (Node c157b888)
ACPI: acpi_evaluate_integer: Faking _TMP
ACPI: Thermal Zone [THM6] (27 C)
ACPI: acpi_evaluate_integer: Faking _TMP
Execute Method: [\_TZ_.THM7._AC0] (Node c157b6c8)
Execute Method: [\_TZ_.THM7._SCP] (Node c157b648)
ACPI: acpi_evaluate_integer: Faking _TMP
ACPI: Thermal Zone [THM7] (27 C)

# from "echo 100 > THM0/polling_frequency"
ACPI: acpi_evaluate_integer: Faking _TMP
# now doing the 'sleep.sh' script
# though for consistency maybe I should first do 
#   'modprobe -r thermal  ; modprobe thermal' 
eth0: removing device
Unloaded prism54 driver
PM: Preparing system for mem sleep
Stopping tasks: ====================================================|
Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
Execute Method: [\_S3_] (Node c157a988)
Execute Method: [\_PTS] (Node c157ab48)
Execute Method: [\_SI_._SST] (Node c157a8c8)
uhci_hcd 0000:00:07.2: suspend_rh
uhci_hcd 0000:00:07.2: uhci_suspend
uhci_hcd 0000:00:07.2: --> PCI D0/legacy
PM: Entering mem sleep
# wake it up
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Back to C!
PM: Finishing wakeup.
Execute Method: [\_GPE._L0B] (Node c157a848)
PCI: Found IRQ 11 for device 0000:00:02.0
PCI: Sharing IRQ 11 with 0000:00:06.0
PCI: Sharing IRQ 11 with 0000:01:00.0
PCI: Found IRQ 11 for device 0000:00:02.1
uhci_hcd 0000:00:07.2: PCI legacy resume
PCI: Found IRQ 11 for device 0000:00:07.2
uhci_hcd 0000:00:07.2: uhci_resume
uhci_hcd 0000:00:07.2: uhci_check_and_reset_hc: legsup = 0x2000
uhci_hcd 0000:00:07.2: Performing full reset
usb usb1: root hub lost power or was reset
uhci_hcd 0000:00:07.2: suspend_rh
usb usb1: finish resume
uhci_hcd 0000:00:07.2: wakeup_rh
Restarting tasks...<7>hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
 done
Execute Method: [\_SI_._SST] (Node c157a8c8)
Execute Method: [\_WAK] (Node c157aac8)
Execute Method: [\_TZ_.THM0._PSV] (Node c157be48)
Execute Method: [\_TZ_.THM0._TC1] (Node c157bdc8)
Execute Method: [\_TZ_.THM0._TC2] (Node c157bd88)
Execute Method: [\_TZ_.THM0._TSP] (Node c157bd48)
Execute Method: [\_TZ_.THM0._AC0] (Node c157bf48)
ACPI: acpi_evaluate_integer: Faking _TMP
Execute Method: [\_TZ_.THM2._AC0] (Node c157bb48)
Execute Method: [\_SI_._SST] (Node c157a8c8)
ACPI: acpi_evaluate_integer: Faking _TMP
Execute Method: [\_TZ_.THM6._AC0] (Node c157b908)
ACPI: acpi_evaluate_integer: Faking _TMP
Execute Method: [\_TZ_.THM7._AC0] (Node c157b6c8)
ACPI: acpi_evaluate_integer: Faking _TMP
uhci_hcd 0000:00:07.2: suspend_rh (auto-stop)
Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
ds: ds_open(socket 0)
ds: ds_open(socket 1)
ds: ds_open(socket 2)
pccard: card ejected from slot 1
PCMCIA: socket e3003c28: *** DANGER *** unable to remove socket power
ds: ds_release(socket 0)
ds: ds_release(socket 1)
PM: Preparing system for mem sleep

# and it hangs here.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-15  1:46 Yu, Luming
@ 2006-03-15  5:40 ` Sanjoy Mahajan
  2006-03-15  5:57 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-15  5:40 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

> Could you just comment out _TMP in kernel or in DSDT, 

I think it needs both excisions: If I comment out just the kernel _TMP
calls, the DSDT might slip one in through the interpreter.  If I
comment out just the DSDT _TMP calls, then the kernel can still call
_TMP.  So instead I modified acpi_evaluate_integer() to return 27 C
(3000 dK) if it's ever asked for a temperature, without doing any
actual work:

--- utils.c.orig	2006-02-27 00:09:35.000000000 -0500
+++ utils.c		2006-03-14 23:36:59.000000000 -0500
@@ -270,7 +270,15 @@ acpi_evaluate_integer(acpi_handle handle
   memset(element, 0, sizeof(union acpi_object));
   buffer.length = sizeof(union acpi_object);
   buffer.pointer = element;
-  status = acpi_evaluate_object(handle, pathname, arguments, &buffer);
+  if (strcmp(pathname, "_TMP") != 0)
+    status = acpi_evaluate_object(handle, pathname, arguments, &buffer);
+    else {
+      printk(KERN_INFO PREFIX "acpi_evaluate_integer: Faking _TMP\n");
+        status = AE_OK;
+	   element->type = ACPI_TYPE_INTEGER;
+	     element->integer.value = 3000; /* 27 C, in deciKelvins */
+	     }
+
	if (ACPI_FAILURE(status)) {
	   acpi_util_eval_error(handle, pathname, status);
					return_ACPI_STATUS(status);

This diff is in addition to the previous debugging changes to
thermal.c.

> and do several S3 suspend /resume Cycles without remove thermal
> module, I want to make sure we are at right place to drill down.

I repeated yesterday's experiments: 

  echo 100 > /proc/acpi/thermal_zone/THM0/polling_frequency
  modprobe -rv thermal
  modprobe thermal zone_to_keep=0 bisect_get_info=1
  sleep.sh

with the modified utils.c (being careful to install the new kernel and
reboot, not just reinstall modules, since utils.c is part of the acpi
builtins).  And, unlike yesterday (when _TMP was unhacked), there was
no hang.  Nor did it hang after five more sleep-wake cycles.

Here's are the dmesgs starting when 'thermal' is loaded at boot
(i.e. with the above patch but no special zone_to_keep etc. params),
and then with the commands above:

# during boot
ACPI: thermal_add: THM0
# next line is from the utils.c modification to return 27 C always
ACPI: acpi_evaluate_integer: Faking _TMP
Execute Method: [\_TZ_.THM0._PSV] (Node c157be48)
Execute Method: [\_TZ_.THM0._TC1] (Node c157bdc8)
Execute Method: [\_TZ_.THM0._TC2] (Node c157bd88)
Execute Method: [\_TZ_.THM0._TSP] (Node c157bd48)
Execute Method: [\_TZ_.THM0._AC0] (Node c157bf48)
Execute Method: [\_TZ_.THM0._SCP] (Node c157bec8)
ACPI: acpi_evaluate_integer: Faking _TMP
ACPI: Thermal Zone [THM0] (27 C)
ACPI: thermal_add: THM2
ACPI: acpi_evaluate_integer: Faking _TMP
Execute Method: [\_TZ_.THM2._AC0] (Node c157bb48)
Execute Method: [\_TZ_.THM2._SCP] (Node c157bac8)
ACPI: acpi_evaluate_integer: Faking _TMP
ACPI: Thermal Zone [THM2] (27 C)
ACPI: thermal_add: THM6
ACPI: acpi_evaluate_integer: Faking _TMP
Execute Method: [\_TZ_.THM6._AC0] (Node c157b908)
Execute Method: [\_TZ_.THM6._SCP] (Node c157b888)
ACPI: acpi_evaluate_integer: Faking _TMP
ACPI: Thermal Zone [THM6] (27 C)
ACPI: thermal_add: THM7
ACPI: acpi_evaluate_integer: Faking _TMP
Execute Method: [\_TZ_.THM7._AC0] (Node c157b6c8)
Execute Method: [\_TZ_.THM7._SCP] (Node c157b648)
ACPI: acpi_evaluate_integer: Faking _TMP
ACPI: Thermal Zone [THM7] (27 C)
ACPI: thermal_add: _TZ
ACPI: acpi_evaluate_integer: Faking _TMP

# booting is done.  Now for
#   echo 100 > /proc/acpi/thermal_zone/THM0/polling_frequency
ACPI: acpi_evaluate_integer: Faking _TMP
# now "modprobe -rv thermal; modprobe thermal zone_to_keep=0 bisect_get_info=1"
ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: thermal_add: THM0
ACPI: acpi_evaluate_integer: Faking _TMP
ACPI: thermal_get_info: got temperature, but bisect_get_info = 1 so exiting
ACPI: acpi_evaluate_integer: Faking _TMP
ACPI: Thermal Zone [THM0] (27 C)
ACPI: thermal_add: THM2
ACPI: thermal_add: ignoring THM2
ACPI: thermal_add: THM6
ACPI: thermal_add: ignoring THM6
ACPI: thermal_add: THM7
ACPI: thermal_add: ignoring THM7
ACPI: thermal_add: _TZ
ACPI: thermal_add: ignoring _TZ
# now sleep.sh
eth0: removing device
Unloaded prism54 driver
PM: Preparing system for mem sleep
Stopping tasks: =======================================================|
Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
Execute Method: [\_S3_] (Node c157a988)
Execute Method: [\_PTS] (Node c157ab48)
Execute Method: [\_SI_._SST] (Node c157a8c8)
uhci_hcd 0000:00:07.2: suspend_rh
uhci_hcd 0000:00:07.2: uhci_suspend
uhci_hcd 0000:00:07.2: --> PCI D0/legacy
PM: Entering mem sleep
# hit "Fn" key to wake it up
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Back to C!
PM: Finishing wakeup.
Execute Method: [\_GPE._L0B] (Node c157a848)
PCI: Found IRQ 11 for device 0000:00:02.0
PCI: Sharing IRQ 11 with 0000:00:06.0
PCI: Sharing IRQ 11 with 0000:01:00.0
PCI: Found IRQ 11 for device 0000:00:02.1
uhci_hcd 0000:00:07.2: PCI legacy resume
PCI: Found IRQ 11 for device 0000:00:07.2
uhci_hcd 0000:00:07.2: uhci_resume
uhci_hcd 0000:00:07.2: uhci_check_and_reset_hc: legsup = 0x2000
uhci_hcd 0000:00:07.2: Performing full reset
usb usb1: root hub lost power or was reset
uhci_hcd 0000:00:07.2: suspend_rh
usb usb1: finish resume
uhci_hcd 0000:00:07.2: wakeup_rh
Restarting tasks...<7>hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
 done
Execute Method: [\_SI_._SST] (Node c157a8c8)
Execute Method: [\_WAK] (Node c157aac8)
Execute Method: [\_TZ_.THM0._PSV] (Node c157be48)
Execute Method: [\_TZ_.THM0._TC1] (Node c157bdc8)
Execute Method: [\_TZ_.THM0._TC2] (Node c157bd88)
Execute Method: [\_TZ_.THM0._TSP] (Node c157bd48)
Execute Method: [\_TZ_.THM0._AC0] (Node c157bf48)
ACPI: acpi_evaluate_integer: Faking _TMP
Execute Method: [\_SI_._SST] (Node c157a8c8)
uhci_hcd 0000:00:07.2: suspend_rh (auto-stop)
Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
ds: ds_open(socket 0)
ds: ds_open(socket 1)
ds: ds_open(socket 2)
# from explicit 'cardctl eject' in sleep.sh's wake portion (to save battery)
pccard: card ejected from slot 1
PCMCIA: socket e36dac28: *** DANGER *** unable to remove socket power
ds: ds_release(socket 0)
ds: ds_release(socket 1)
ACPI: acpi_evaluate_integer: Faking _TMP

# and I can keep doing 'sleep.sh' with no problem

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-15  1:46 Yu, Luming
  2006-03-15  5:40 ` Sanjoy Mahajan
  2006-03-15  5:57 ` Sanjoy Mahajan
  0 siblings, 2 replies; 86+ messages in thread
From: Yu, Luming @ 2006-03-15  1:46 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

>
>[I've trimmed non-relevant lists (v4l-dvb-maintainer@linuxtv.org,
>video4linux-list@redhat.com, linux-ide@vger.kernel.org,
>linux-input@atrey.karlin.mff.cuni.cz,
>linux-usb-devel@lists.sourceforge.net) from the CC.  Let me know if
>anyone else wants to be trimmed.]
>
>> Could you do bisection to find out which methods or which thermal
>> zone cause trouble?  To do that, you have to hack thermal.c by
>> commenting out some calls of evaluating methods below.  I hope it is
>> easy for you!  :-)
>
>I eventually muddled my way there.  The short story is that I can
>reproduce the hang -- on the FIRST S3 cycle -- when the _TMP method is
>called a few times, just for THM0.  

Excellent!
Could you just comment out _TMP in kernel or in DSDT,
and do several  S3 suspend /resume  Cycles without remove thermal
module, 
I want to make sure we are at right place to drill down. 

Thanks for your  testing reports. It's impressive. :-)

--Luming

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-14  1:48 Yu, Luming
@ 2006-03-14  8:28 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-14  8:28 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, Brian Marete,
	Ryan Phillips, gregkh, Brown, Len, linux-acpi, Mark Lord,
	Randy Dunlap, jgarzik, Duncan, Pavlik Vojtech, Meelis Roos

[I've trimmed non-relevant lists (v4l-dvb-maintainer@linuxtv.org,
video4linux-list@redhat.com, linux-ide@vger.kernel.org,
linux-input@atrey.karlin.mff.cuni.cz,
linux-usb-devel@lists.sourceforge.net) from the CC.  Let me know if
anyone else wants to be trimmed.]

> Could you do bisection to find out which methods or which thermal
> zone cause trouble?  To do that, you have to hack thermal.c by
> commenting out some calls of evaluating methods below.  I hope it is
> easy for you!  :-)

I eventually muddled my way there.  The short story is that I can
reproduce the hang -- on the FIRST S3 cycle -- when the _TMP method is
called a few times, just for THM0.  The boot loads 'thermal' as usual
and produces the usual list of method calls.  The following snippet
are the commented dmesgs after the boot and until the hang, produced
by:

  echo 100 > /proc/acpi/thermal_zone/THM0/polling_frequency
  modprobe -rv thermal
  modprobe thermal zone_to_keep=0 bisect_get_info=1
  sleep.sh

# from "echo 100 > /proc/acpi/thermal_zone/THM0/polling_frequency"
Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)

# "modprobe -rv thermal" produces no output

# the next msgs are due to
#   modprobe thermal zone_to_keep=0 bisect_get_info=1
# (see below for details on the two new debugging params)
# 'thermal' loads 'processor', which produces the next two lines:
ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
ACPI: Processor [CPU0] (supports 8 throttling states)

# now loading 'thermal' with the params above
ACPI: thermal_add: THM0
Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)

# the bisect_get_info parameter value says how far
# to go into acpi_thermal_get_info() before exiting.
ACPI: thermal_get_info: got temperature, but bisect_get_info = 1 so exiting

# next line is probably from the acpi_thermal_check(tz) in acpi_thermal_add()
Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)

# next line means THM0 is sorted out
ACPI: Thermal Zone [THM0] (42 C)

# the other zones are ignored (set by the zone_to_keep=0)
ACPI: thermal_add: THM2
ACPI: thermal_add: ignoring THM2
ACPI: thermal_add: THM6
ACPI: thermal_add: ignoring THM6
ACPI: thermal_add: THM7
ACPI: thermal_add: ignoring THM7
ACPI: thermal_add: _TZ
ACPI: thermal_add: ignoring _TZ

# now 'thermal' has loaded but only with THM0
# now for the first "sleep.sh", which unloads a few drivers etc.:
Unloaded prism54 driver
@@@@ SLEEP

# and it hangs!  Without the  
#    echo 100 > /proc/acpi/thermal_zone/THM0/polling_frequency
# at the beginning, a second cycle is required to reproduce the hang.
# That requirement makes debugging tricky because the wakeup runs a
# bunch of thermal methods, plus the serial console doesn't work right
# away, so several lines of dmesgs get lost.

I've included the diff for the thermal.c changes.  I added two module
parameters:

  zone_to_keep=N 
  bisect_get_info=N

where the zone_to_keep says which of THM{0,2,4,7} to load --
acpi_thermal_add() just returns -EINVAL when it's asked to load one of
the others.  So if zone_to_keep=8 for example, then no thermal zones
will be loaded.

Other data:

1. As a control experiment, I loaded thermal with zone_to_keep=8 (so
   no thermal zones loaded or methods called).  Then I could S3 cycle
   many many times and never noticed a problem.

2. If I loaded thermal with zone_to_keep=1 bisect_get_info=1 (so as in
   the dmesgs above but without the "echo 100 >
   THM0/polling_frequency"), then the second S3 sleep invariably would
   hang.  Compared to the the dmesgs above (which have no wake), these
   dmesgs had extra lines .  On wakeup, a bunch of the other THM0
   methods would be executed:

   Execute Method: [\_SI_._SST] (Node c157a8c8)
   Execute Method: [\_WAK] (Node c157aac8)
   Execute Method: [\_TZ_.THM0._PSV] (Node c157be48)
   Execute Method: [\_TZ_.THM0._TC1] (Node c157bdc8)
   Execute Method: [\_TZ_.THM0._TC2] (Node c157bd88)
   Execute Method: [\_TZ_.THM0._TSP] (Node c157bd48)
   Execute Method: [\_TZ_.THM0._AC0] (Node c157bf48)
   Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)
   Execute Method: [\_SI_._SST] (Node c157a8c8)

   even though I had set bisect_get_info=1 to prevent those methods
   from being run.  Eventually I realized that on wakeup they were
   probably being called by someone else calling acpi_thermal_check()
   and I should have been more clever in how I did the bisect.

   After this wakeup, the second sleep would hang as usual.

I suppose my next step will be to make acpi_thermal_get_temperature()
return (set by a third module param) before it evaluates the _TMP
method and see whether the hang remains.  But hopefully the data above
is already useful for analysis while I collect more data.

Here is the diff for thermal.c

--- thermal.c.orig	2006-03-14 01:01:05.000000000 -0500
+++ thermal.c	2006-03-14 01:16:58.000000000 -0500
@@ -76,10 +76,13 @@
 MODULE_DESCRIPTION(ACPI_THERMAL_DRIVER_NAME);
 MODULE_LICENSE("GPL");
 
-static int tzp;
+static int tzp, zone_to_keep = -1, bisect_get_info;
 module_param(tzp, int, 0);
+module_param(zone_to_keep, int, 0);
+module_param(bisect_get_info, int, 0);
 MODULE_PARM_DESC(tzp, "Thermal zone polling frequency, in 1/10 seconds.\n");
 
+
 static int acpi_thermal_add(struct acpi_device *device);
 static int acpi_thermal_remove(struct acpi_device *device, int type);
 static int acpi_thermal_state_open_fs(struct inode *inode, struct file *file);
@@ -1268,13 +1271,29 @@
 	if (result)
 		return_VALUE(result);
 
+	if (bisect_get_info == 1) {
+	  printk (KERN_INFO PREFIX "thermal_get_info: got temperature, but bisect_get_info = %d so exiting\n", bisect_get_info);
+	  return_VALUE(0);
+	}
+
 	/* Get trip points [_CRT, _PSV, etc.] (required) */
 	result = acpi_thermal_get_trip_points(tz);
 	if (result)
 		return_VALUE(result);
 
+	if (bisect_get_info == 2) {
+	  printk (KERN_INFO PREFIX "thermal_get_info: got trip points, but bisect_get_info = %d so exiting\n", bisect_get_info);
+	  return_VALUE(0);
+	}
+
 	/* Set the cooling mode [_SCP] to active cooling (default) */
 	result = acpi_thermal_set_cooling_mode(tz, ACPI_THERMAL_MODE_ACTIVE);
+
+	if (bisect_get_info == 3) {
+	  printk (KERN_INFO PREFIX "thermal_get_info: set active cooling, but bisect_get_info = %d so exiting with %d\n", bisect_get_info, result);
+	  return_VALUE(result);
+	}
+
 	if (!result)
 		tz->flags.cooling_mode = 1;
 	else {
@@ -1306,6 +1325,11 @@
 	else
 		acpi_thermal_get_polling_frequency(tz);
 
+	if (bisect_get_info == 4) {
+	  printk (KERN_INFO PREFIX "thermal_get_info: got default polling frequency, but bisect_get_info = %d so exiting\n", bisect_get_info);
+	  return_VALUE(0);
+	}
+
 	/* Get devices in this thermal zone [_TZD] (optional) */
 	result = acpi_thermal_get_devices(tz);
 	if (!result)
@@ -1319,12 +1343,25 @@
 	int result = 0;
 	acpi_status status = AE_OK;
 	struct acpi_thermal *tz = NULL;
+	char zone_to_keep_str[10];
 
 	ACPI_FUNCTION_TRACE("acpi_thermal_add");
 
 	if (!device)
 		return_VALUE(-EINVAL);
 
+	/* debugging bugzilla.kernel.org #5989 (second S3 sleep hangs) */
+	printk(KERN_INFO PREFIX "thermal_add: %s\n", device->pnp.bus_id);
+	if (zone_to_keep >= 0) {
+	  snprintf (zone_to_keep_str, sizeof(zone_to_keep_str),
+		    "THM%d", zone_to_keep);
+	  if (strcmp(zone_to_keep_str, device->pnp.bus_id) != 0) {
+	    printk(KERN_INFO PREFIX "thermal_add: ignoring %s\n",
+		   device->pnp.bus_id);
+	    return_VALUE(-EINVAL);
+	  }
+	}
+
 	tz = kmalloc(sizeof(struct acpi_thermal), GFP_KERNEL);
 	if (!tz)
 		return_VALUE(-ENOMEM);

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-14  1:48 Yu, Luming
  2006-03-14  8:28 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-14  1:48 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos

>> Hmm, could you file dmesgs with thermal module loaded and unloaded?
>
>Filed at bugzilla.

Excellent! .

>Let me know if there's a different permutation of debug options that I
>should try.  I wasn't sure whether you meant that I should leave all
>the debug values at 0x10.  Or whether I should still include
>acpi_debug=0xffffffff on top of the other options.

So far, it's ok,  I saw these,  Could you do bisection to find out
which methods or which thermal zone cause trouble?
To do that, you have to hack thermal.c by commenting out 
some calls of evaluating methods below.
I hope it is easy for you!	 :-)

Thanks,
Luming

Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)
Execute Method: [\_TZ_.THM0._PSV] (Node c157be48)
Execute Method: [\_TZ_.THM0._TC1] (Node c157bdc8)
Execute Method: [\_TZ_.THM0._TC2] (Node c157bd88)
Execute Method: [\_TZ_.THM0._TSP] (Node c157bd48)
Execute Method: [\_TZ_.THM0._AC0] (Node c157bf48)
Execute Method: [\_TZ_.THM0._SCP] (Node c157bec8)
Execute Method: [\_TZ_.THM0._TMP] (Node c157bf88)
ACPI: Thermal Zone [THM0] (47 C)
Execute Method: [\_TZ_.THM2._TMP] (Node c157bb88)
Execute Method: [\_TZ_.THM2._AC0] (Node c157bb48)
Execute Method: [\_TZ_.THM2._SCP] (Node c157bac8)
Execute Method: [\_TZ_.THM2._TMP] (Node c157bb88)
Execute Method: [\_TZ_.PFN0._ON_] (Node c157a2c8)
Execute Method: [\_TZ_.PFN0._STA] (Node c157a308)
ACPI: Thermal Zone [THM2] (40 C)
Execute Method: [\_TZ_.THM6._TMP] (Node c157b948)
Execute Method: [\_TZ_.THM6._AC0] (Node c157b908)
Execute Method: [\_TZ_.THM6._SCP] (Node c157b888)
Execute Method: [\_TZ_.THM6._TMP] (Node c157b948)
ACPI: Thermal Zone [THM6] (30 C)
Execute Method: [\_TZ_.THM7._TMP] (Node c157b708)
Execute Method: [\_TZ_.THM7._AC0] (Node c157b6c8)
Execute Method: [\_TZ_.THM7._SCP] (Node c157b648)
Execute Method: [\_TZ_.THM7._TMP] (Node c157b708)
ACPI: Thermal Zone [THM7] (33 C)


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-13  8:35 Yu, Luming
@ 2006-03-13 15:21 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-13 15:21 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos

> Hmm, could you file dmesgs with thermal module loaded and unloaded?

Filed at bugzilla.

> I saw this acpi_debug=0xffffffff.

Sorry, it's a legacy from trying to debug #5112, and I've removed it
for getting the above dmesgs.  I'm not even sure what that option does
since it's not documented in the kernel-parameters.txt, but it does
increase the amount of debugging info.

> I used to used to use acpi_debug_layer=0x10 acpi_debug_level=0x10
> Could you try that?

For the above dmesgs I booted with acpi_dbg_level=0x10
acpi_dbg_layer=0x10 and then did two sleep-wake cycles with no thermal
module (both went fine), then one cycle with the thermal module loaded
(went fine), and then the usual failing second sleep with the thermal
module still loaded.  The sleep-wake cycles themselves (i.e. once the
system booted) were done with acpi_debug_level=0x1F rather than the
0x10 boot value.

Let me know if there's a different permutation of debug options that I
should try.  I wasn't sure whether you meant that I should leave all
the debug values at 0x10.  Or whether I should still include
acpi_debug=0xffffffff on top of the other options.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-13  8:35 Yu, Luming
  2006-03-13 15:21 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-13  8:35 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos


Thanks for your debug information.

>
>> Could you try to mute thermal poll?
>
>Done.  The sleep.sh script now has
>
>echo 0 > /proc/acpi/thermal_zone/THM2/polling_frequency
>echo 0 > /proc/acpi/thermal_zone/THM0/polling_frequency
>sleep 1

Hmm,  could you file dmesges with tmermal module loaded and
unloaded?

>
>> I need the full log  for S3 suspend failure not just snippets.
>> Please attach it on bugzilla.kernel.org
>
>Done.

I saw this acpi_debug=0xffffffff.

I used to used to use acpi_debug_layer=0x10 acpi_debug_level=0x10
Could you try that?


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-13  4:51 Yu, Luming
@ 2006-03-13  7:28 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-13  7:28 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos

> Could you try to mute thermal poll?

Done.  The sleep.sh script now has

echo 0 > /proc/acpi/thermal_zone/THM2/polling_frequency
echo 0 > /proc/acpi/thermal_zone/THM0/polling_frequency
sleep 1

> I need the full log  for S3 suspend failure not just snippets.
> Please attach it on bugzilla.kernel.org

Done.

> The log for S3 suspend success cannot help me to track down.

For completeness, I didn't excise that portion of the log.  It's not
many lines, plus it doesn't make it harder to find the failing
portion: The suspend failure happens after the second "@@@@ SLEEP" in
the log.

Should I turn on more acpi_debug_level debugging?

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-13  4:51 Yu, Luming
  2006-03-13  7:28 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-13  4:51 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos

>> I need the acpi trace log before _PTS to see what kind of thermal
>> related methods got called.
>
>Alas, I've included all the dmesg's.  

I need the full log  for S3 suspend failure not just snippets.
Please attach it on bugzilla.kernel.org

The log for S3 suspend success cannot help me to track down.


>
>Below is the script that I use to enter S3 sleep.  It unloads rid of
>troublesome modules and stop services that don't sleep well.  Then
>(for debugging) it sends the kernel version and boot parameters across
>the serial console (the @@@@ SLEEP line), raises the debug level to
>0x1F, does a sync (in case the sleep hangs, since this is my
>production machine), and then enters mem sleep.
>
>So nothing in it should trigger any thermal methods; except that I
>usually have the THM2 trip point raised to 45C with a polling time of
>100 seconds.  So once in a while a thermal poll will happen sleep is
>being set up.  I am not sure whether it would be reported in the
>dmesgs if it happened; but the S3 failure happens much more often than
>such a thermal polling would happen, so I doubt the S3 failure
>requires a thermal poll.

Could you try to mute thermal poll?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-13  2:00 Yu, Luming
@ 2006-03-13  4:38 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-13  4:38 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos

> I need the acpi trace log before _PTS to see what kind of thermal
> related methods got called.

Alas, I've included all the dmesg's.  

Below is the script that I use to enter S3 sleep.  It unloads rid of
troublesome modules and stop services that don't sleep well.  Then
(for debugging) it sends the kernel version and boot parameters across
the serial console (the @@@@ SLEEP line), raises the debug level to
0x1F, does a sync (in case the sleep hangs, since this is my
production machine), and then enters mem sleep.

So nothing in it should trigger any thermal methods; except that I
usually have the THM2 trip point raised to 45C with a polling time of
100 seconds.  So once in a while a thermal poll will happen sleep is
being set up.  I am not sure whether it would be reported in the
dmesgs if it happened; but the S3 failure happens much more often than
such a thermal polling would happen, so I doubt the S3 failure
requires a thermal poll.

#!/bin/bash -x
# S3 (suspend to memory), with cleanups before and after
sync
ifdown eth0
remove='prism54 xircom_cb xircom_tulip_cb' 
remove2='snd_pcm_oss snd_cs46xx'
modprobe -rv $remove
modprobe -rv $remove2
/etc/init.d/chrony stop  > /dev/null

sleep 1

(echo "@@@@ SLEEP" ; date ; uname -a ; cat /proc/cmdline ) > /dev/tts/0
echo 0x0000001F > /proc/acpi/debug_level
sync
sleep 2
echo -n mem > /sys/power/state
[stuff for wakeup snipped]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-13  2:00 Yu, Luming
  2006-03-13  4:38 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-13  2:00 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos

>width) Address=0000000023FDFFC0
>exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 
>width) Address=00000000000000B2
>exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 
>width) Address=0000000023FDFFC0
>exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 
>width) Address=0000000023FDFFC0
>exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 
>width) Address=0000000023FDFFC0
>exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 
>width) Address=00000000000000B2
>
>And then these above four lines (exregion-0185, -0185, -0185, -0290)
>repeat until I reboot.
>

If I understand correctly, it was due to  LEqual(S_AH, 0xA6) awlays
true.
SMM bios code didn't  respond , or respond correctly 
to the request by "store 0x81, APMD"  due to thermal module caused
issue?
I need the acpi trace log before _PTS to see what kind of thermal
related methods got called.

    Method (SMPI, 1, NotSerialized)
    {
        Store (S_AX, Local0)
        Store (0x81, APMD)
        While (LEqual (S_AH, 0xA6))
        {
            Sleep (0x64)
            Store (Local0, S_AX)
            Store (0x81, APMD)
        }
    }


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-10  6:46 Yu, Luming
  2006-03-10 13:27 ` Sanjoy Mahajan
@ 2006-03-10 13:36 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-10 13:36 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos

Actually, there's no point only attaching it at bugzilla, since the
relevant output is shorter than I thought:

/proc/cmdline: auto BOOT_IMAGE=16-rc5 ro root=305 idebus=66 apm=off acpi=force pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios cpufreq.debug=7 acpi_debug=0xffffffff

where 16-rc5 is my LILO label for 2.6.16-rc5

The extract from the dmesgs:

Stopping tasks: ========================================================|
Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
 acpi_ec-0458 [37] ec_intr_read          : Read [02] from address [32]
 acpi_ec-0508 [37] ec_intr_write         : Wrote [06] to address [32]
Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
Execute Method: [\_S3_] (Node c157a988)
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
Execute Method: [\_PTS] (Node c157ab48)
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC0
exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC4
exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC4
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000000B2
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC0
exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000000B2

And then these above four lines (exregion-0185, -0185, -0185, -0290)
repeat until I reboot.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-10  6:46 Yu, Luming
@ 2006-03-10 13:27 ` Sanjoy Mahajan
  2006-03-10 13:36 ` Sanjoy Mahajan
  1 sibling, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-10 13:27 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos

> What do you mean of "slither away" ? 
> bug go away?

I can no longer trigger it, at least not with the usual procedure.  I
doubt that it goes away (i.e. that it is solved), only that it
slithers into hiding, like bugs that disappear when compiling a C
program with -g but show up when compiling without -g.

> echo -n 0x10 > /proc/acpi/debug_layer
> echo -n 0x10 > /proc/acpi/debug_level

Oh, I always have more info turned on in my sleep.sh script
(debug_layer = 0xFFFF3FFF to begin with, and the script sets
debug_level to 0x1F).  I'll attach the slightly trimmed log file to
the bugme report.  If it's too much information, let me know and I'll
retest with just the above settings.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-10  6:46 Yu, Luming
  2006-03-10 13:27 ` Sanjoy Mahajan
  2006-03-10 13:36 ` Sanjoy Mahajan
  0 siblings, 2 replies; 86+ messages in thread
From: Yu, Luming @ 2006-03-10  6:46 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos

> exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 
>>> width) Address=00000000000000B2
>>>
>>> repeated endlessly.
>
>> I need calltrace for this 
>
>Looking at /proc/acpi/debug_level, I see several debugging choices
>that might give the calltrace you want.  Let me know which ones are
>essential (I'd turn all of them on; however, I found when trying to
>track this down earlier that the bug would slither away if I had too
>much debugging turned on):

What do you mean of "slither away" ? 
bug go away?

>
>ACPI_LV_DISPATCH	       0x00000100 [ ]
>ACPI_LV_EXEC		       0x00000200 [ ]
>ACPI_LV_NAMES		       0x00000400 [ ]
>ACPI_LV_FUNCTIONS	       0x00200000 [ ]
>
>By the way, a long standing buglet for me is that 'cat
>/proc/acpi/debug_level' truncates the output to 1024 bytes.  So I have
>to do 'cat /proc/acpi/debug_level | cat' so that the first cat doesn't
>find that its stdout is a tty and try to reduce its buffer size from
>4096 (big enough) to 1024.  A patch is available at
><http://bugzilla.kernel.org/show_bug.cgi?id=5076>

let's start from:

echo -n 0x10 > /proc/acpi/debug_layer
echo -n 0x10 > /proc/acpi/debug_level

>
>> BTW, do you still think this is a regression?
>
>I'm 95% sure, because booting with ec_intr=0 avoids the problem, so
>the commit that made ec_intr=1 the default almost certainly also makes
>this bug appear.

why NOT 100% sure? :-)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-03-10  6:12 Yu, Luming
@ 2006-03-10  6:27 ` Sanjoy Mahajan
  0 siblings, 0 replies; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-10  6:27 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos

> I assume you have tested ec_intr=0 and ec_intr=1.

Right: I forgot to mention it, but I did test it both ways, and
ec_intr=0 is fine.

>> These noises happen when printing via the wireless card or via USB
>> (to a different HP inkjet),

> Interesting, open bug for this.

I cannot reproduce it with the vanilla DSDT, only with the modified
one.  But:

> The ground rule is Don't use modified DSDT.
> If you do that,  the results won't be trusted.

>> exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 
>> width) Address=00000000000000B2
>>
>> repeated endlessly.

> I need calltrace for this 

Looking at /proc/acpi/debug_level, I see several debugging choices
that might give the calltrace you want.  Let me know which ones are
essential (I'd turn all of them on; however, I found when trying to
track this down earlier that the bug would slither away if I had too
much debugging turned on):

ACPI_LV_DISPATCH	       0x00000100 [ ]
ACPI_LV_EXEC		       0x00000200 [ ]
ACPI_LV_NAMES		       0x00000400 [ ]
ACPI_LV_FUNCTIONS	       0x00200000 [ ]

By the way, a long standing buglet for me is that 'cat
/proc/acpi/debug_level' truncates the output to 1024 bytes.  So I have
to do 'cat /proc/acpi/debug_level | cat' so that the first cat doesn't
find that its stdout is a tty and try to reduce its buffer size from
4096 (big enough) to 1024.  A patch is available at
<http://bugzilla.kernel.org/show_bug.cgi?id=5076>

> BTW, do you still think this is a regression?

I'm 95% sure, because booting with ec_intr=0 avoids the problem, so
the commit that made ec_intr=1 the default almost certainly also makes
this bug appear.

-Sanjoy

`Never underestimate the evil of which men of power are capable.'
         --Bertrand Russell, _War Crimes in Vietnam_, chapter 1.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
@ 2006-03-10  6:12 Yu, Luming
  2006-03-10  6:27 ` Sanjoy Mahajan
  0 siblings, 1 reply; 86+ messages in thread
From: Yu, Luming @ 2006-03-10  6:12 UTC (permalink / raw)
  To: Sanjoy Mahajan
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos

>From: "Yu, Luming" <luming.yu@intel.com>
>> I suggest you to retest, and post dmesg with UN-modified BIOS.
>
>I'm now running/testing an unmodified DSDT with 2.6.16-rc5.  
>For a while
>I had no S3 hangs, but I just noticed them again.  The error 
>is the same
>as with the modified DSDT (with slightly different offsets):

I assume you have tested ec_intr=0 and ec_intr=1.

>
>exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 
>width) Address=0000000023FDFFC0
>exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 
>width) Address=0000000023FDFFC0
>exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 
>width) Address=00000000000000B2
>
>repeated endlessly.

I need calltrace for this 

>
>I think the problem resurfaced once I decided to let my sleep.sh script
>leave the thermal driver loaded before going into S3 (suspecting that
>the bug might come back if I did that).

Clealy, it's thermal related. We need to narrow down here.

>
>So I susect that my modified DSDT didn't cause the S3 problems, it
>merely exposed one even in the minimal configuration discussed in the
>#5989 report.

The ground rule is Don't use modified DSDT.
If you do that,  the results won't be trusted.

>
>Which makes me wonder about another bug that disappeared when 
>I switched
>to the vanilla DSDT: While printing (via gs+hpijs to an HP photosmart
>2710 via the wireless card), the system makes double-beeps as 
>if it were
>having the AC adapter plugged and unplugged.  These noises happen when
>printing via the wireless card or via USB (to a different HP inkjet),

Interesting, open bug for this.

>but not when printing via the parallel port to a Lexmark laserprinter
>(using just gs).  Since I didn't do anything to the battery code in the
>DSDT, I now wonder whether changing the DSDT merely exposed the issue
>but didn't create it.
>
>[From an earlier msg:]
>> I think the truth is, for 5989, we need to fix thermal and processor
>> driver issue.
>
>I agree, although I think the processor driver is not the culprit.  My
>earlier testing with the (with the modified DSDT) worked fine with the
>processor module loaded, but hung with processor + thermal loaded.
>

ok, we need to start from thermal.  

BTW, do you still think this is a regression?

Thanks,
Luming


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]
  2006-02-27  9:04 2.6.16-rc5: known regressions Yu, Luming
@ 2006-03-10  5:26 ` Sanjoy Mahajan
  2006-05-19 13:44   ` Thomas Renninger
  0 siblings, 1 reply; 86+ messages in thread
From: Sanjoy Mahajan @ 2006-03-10  5:26 UTC (permalink / raw)
  To: Yu, Luming
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Tom Seeley,
	Dave Jones, Jiri Slaby, michael, mchehab, v4l-dvb-maintainer,
	video4linux-list, Brian Marete, Ryan Phillips, gregkh,
	linux-usb-devel, Brown, Len, linux-acpi, Mark Lord, Randy Dunlap,
	jgarzik, linux-ide, Duncan, Pavlik Vojtech, linux-input,
	Meelis Roos

[Re: bugme #5989, head no longer hanging in shame]

From: "Yu, Luming" <luming.yu@intel.com>
> I suggest you to retest, and post dmesg with UN-modified BIOS.

I'm now running/testing an unmodified DSDT with 2.6.16-rc5.  For a while
I had no S3 hangs, but I just noticed them again.  The error is the same
as with the modified DSDT (with slightly different offsets):

exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC0
exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000000B2

repeated endlessly.

I think the problem resurfaced once I decided to let my sleep.sh script
leave the thermal driver loaded before going into S3 (suspecting that
the bug might come back if I did that).

So I susect that my modified DSDT didn't cause the S3 problems, it
merely exposed one even in the minimal configuration discussed in the
#5989 report.

Which makes me wonder about another bug that disappeared when I switched
to the vanilla DSDT: While printing (via gs+hpijs to an HP photosmart
2710 via the wireless card), the system makes double-beeps as if it were
having the AC adapter plugged and unplugged.  These noises happen when
printing via the wireless card or via USB (to a different HP inkjet),
but not when printing via the parallel port to a Lexmark laserprinter
(using just gs).  Since I didn't do anything to the battery code in the
DSDT, I now wonder whether changing the DSDT merely exposed the issue
but didn't create it.

[From an earlier msg:]
> I think the truth is, for 5989, we need to fix thermal and processor
> driver issue.

I agree, although I think the processor driver is not the culprit.  My
earlier testing with the (with the modified DSDT) worked fine with the
processor module loaded, but hung with processor + thermal loaded.

-Sanjoy

`A society of sheep must in time beget a government of wolves.'
   - Bertrand de Jouvenal

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2006-05-23 17:13 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-23 13:29 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT] Yu, Luming
2006-05-23 17:12 ` Sanjoy Mahajan
  -- strict thread matches above, loose matches on Subject: below --
2006-04-05  3:03 Yu, Luming
2006-03-24  1:31 Yu, Luming
2006-04-04  6:49 ` Sanjoy Mahajan
2006-03-23  9:10 Yu, Luming
2006-03-23 19:19 ` Sanjoy Mahajan
2006-03-23  4:46 Yu, Luming
2006-03-23  6:25 ` Sanjoy Mahajan
2006-03-22  7:28 Yu, Luming
2006-03-22 14:16 ` Sanjoy Mahajan
2006-03-22  4:58 Yu, Luming
2006-03-22  5:13 ` Sanjoy Mahajan
2006-03-24  1:17 ` Sanjoy Mahajan
2006-03-22  1:34 Yu, Luming
2006-03-22  7:00 ` Sanjoy Mahajan
2006-03-22  1:30 Yu, Luming
2006-03-22  4:35 ` Sanjoy Mahajan
2006-03-22  7:15 ` Sanjoy Mahajan
2006-03-21  9:11 Yu, Luming
2006-03-21 20:37 ` Sanjoy Mahajan
2006-03-21 22:09 ` Sanjoy Mahajan
2006-03-21  1:38 Yu, Luming
2006-03-21  7:27 ` Sanjoy Mahajan
2006-03-21  8:47 ` Sanjoy Mahajan
2006-03-19  4:12 Yu, Luming
2006-03-19 14:33 ` Sanjoy Mahajan
2006-03-20  6:39 ` Sanjoy Mahajan
2006-03-18 17:08 Yu, Luming
2006-03-18 20:12 ` Sanjoy Mahajan
2006-03-18 16:37 Yu, Luming
2006-03-18 17:03 ` Sanjoy Mahajan
2006-03-18 15:58 Yu, Luming
2006-03-18 16:27 ` Sanjoy Mahajan
2006-03-18 15:10 Yu, Luming
2006-03-18 15:48 ` Sanjoy Mahajan
2006-03-18 13:24 Yu, Luming
2006-03-18 14:37 ` Sanjoy Mahajan
2006-03-18  2:02 Yu, Luming
2006-03-18  7:23 ` Sanjoy Mahajan
2006-03-17  7:50 Yu, Luming
2006-03-17 18:43 ` Sanjoy Mahajan
2006-03-17  6:57 Yu, Luming
2006-03-17  7:11 ` Sanjoy Mahajan
2006-03-17  7:32 ` Sanjoy Mahajan
2006-03-17  1:17 Yu, Luming
2006-03-17  6:28 ` Sanjoy Mahajan
2006-03-16  8:18 Yu, Luming
2006-03-16 15:15 ` Sanjoy Mahajan
2006-03-16  7:28 Yu, Luming
2006-03-16  7:57 ` Sanjoy Mahajan
2006-03-16  6:41 Yu, Luming
2006-03-16  6:54 ` Sanjoy Mahajan
2006-03-16  7:14 ` Sanjoy Mahajan
2006-03-15  8:02 Yu, Luming
2006-03-16  0:03 ` Sanjoy Mahajan
2006-03-16  5:47 ` Sanjoy Mahajan
2006-03-16  6:46 ` Sanjoy Mahajan
2006-03-15  6:47 Yu, Luming
2006-03-15  7:06 ` Sanjoy Mahajan
2006-03-15  6:25 Yu, Luming
2006-03-15  6:16 Yu, Luming
2006-03-15  6:35 ` Sanjoy Mahajan
2006-03-15  1:46 Yu, Luming
2006-03-15  5:40 ` Sanjoy Mahajan
2006-03-15  5:57 ` Sanjoy Mahajan
2006-03-14  1:48 Yu, Luming
2006-03-14  8:28 ` Sanjoy Mahajan
2006-03-13  8:35 Yu, Luming
2006-03-13 15:21 ` Sanjoy Mahajan
2006-03-13  4:51 Yu, Luming
2006-03-13  7:28 ` Sanjoy Mahajan
2006-03-13  2:00 Yu, Luming
2006-03-13  4:38 ` Sanjoy Mahajan
2006-03-10  6:46 Yu, Luming
2006-03-10 13:27 ` Sanjoy Mahajan
2006-03-10 13:36 ` Sanjoy Mahajan
2006-03-10  6:12 Yu, Luming
2006-03-10  6:27 ` Sanjoy Mahajan
2006-02-27  9:04 2.6.16-rc5: known regressions Yu, Luming
2006-03-10  5:26 ` 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT] Sanjoy Mahajan
2006-05-19 13:44   ` Thomas Renninger
2006-05-21  0:12     ` Sanjoy Mahajan
2006-05-21  0:40       ` Carl-Daniel Hailfinger
2006-05-21  1:30         ` Joshua Hudson
2006-05-21  3:53           ` Lee Revell
2006-05-22  9:55       ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).