All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/4] PM/Thermal: Enhance PCH overheat handling
@ 2022-05-19 14:35 Zhang Rui
  2022-05-19 14:35 ` [PATCH V2 1/4] PM: wakeup: expose pm_wakeup_pending to modules Zhang Rui
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Zhang Rui @ 2022-05-19 14:35 UTC (permalink / raw)
  To: rjw; +Cc: linux-pm, daniel.lezcano, sumeet.r.pawnikar, len.brown

On some Intel client platforms like SKL/KBL/CNL/CML, there is a
PCH thermal sensor that monitors the PCH temperature and blocks the system
from entering S0ix in case it overheats.

Commit ef63b043ac86 ("thermal: intel: pch: fix S0ix failure due to PCH
temperature above threshold") introduces a delay loop to cool the
temperature down for this purpose.

However, in practice, we found that the time it takes to cool the PCH down
below threshold highly depends on the initial PCH temperature when the
delay starts, as well as the ambient temperature.

For example, on a Dell XPS 9360 laptop, the problem can be triggered 
1. when it is suspended with heavy workload running.
or
2. when it is moved from New Hampshire to Florida.

In these cases, the 1 second delay is not sufficient. As a result, the
system stays in a shallower power state like PCx instead of S0ix, and
drains the battery power, without user' notice.

In order to fix this, we
1. move the delay to .suspend_noirq phase instead, in order to
   do the cooling when the system is in a more quiescent state
2. expand the default overall cooling delay timeout to 60 seconds.
3. make sure the temperature is below threshold rather than equal to it.

Compared with V1, the last four patches are dropped from the series, and
we focus on the PCH Overheat issue only. Plus, splitted one of the patch
according to Rafael' suggestion.

thanks,
rui

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH V2 1/4] PM: wakeup: expose pm_wakeup_pending to modules
  2022-05-19 14:35 [PATCH V2 0/4] PM/Thermal: Enhance PCH overheat handling Zhang Rui
@ 2022-05-19 14:35 ` Zhang Rui
  2022-05-19 14:35 ` [PATCH V2 2/4] thermal: intel: pch: move cooling delay to suspend_noirq phase Zhang Rui
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Zhang Rui @ 2022-05-19 14:35 UTC (permalink / raw)
  To: rjw; +Cc: linux-pm, daniel.lezcano, sumeet.r.pawnikar, len.brown

intel_pch_thermal driver needs a long delay to cool itself (60 seconds
in maximum) during suspend. When a wakeup event occures during the
delay, it is better for the intel_pch_thermal driver to detect this and
quit cooling because the suspend is likely to abort anyway.

Thus expose pm_wakeup_pending to modules so that intel_pch_thermal
driver can be aware of the wakeup events.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Sumeet Pawnikar <sumeet.r.pawnikar@intel.com>
---
 drivers/base/power/wakeup.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/base/power/wakeup.c b/drivers/base/power/wakeup.c
index a57d469676ca..11a4ffe91367 100644
--- a/drivers/base/power/wakeup.c
+++ b/drivers/base/power/wakeup.c
@@ -930,6 +930,7 @@ bool pm_wakeup_pending(void)
 
 	return ret || atomic_read(&pm_abort_suspend) > 0;
 }
+EXPORT_SYMBOL_GPL(pm_wakeup_pending);
 
 void pm_system_wakeup(void)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH V2 2/4] thermal: intel: pch: move cooling delay to suspend_noirq phase
  2022-05-19 14:35 [PATCH V2 0/4] PM/Thermal: Enhance PCH overheat handling Zhang Rui
  2022-05-19 14:35 ` [PATCH V2 1/4] PM: wakeup: expose pm_wakeup_pending to modules Zhang Rui
@ 2022-05-19 14:35 ` Zhang Rui
  2022-05-19 14:35 ` [PATCH V2 3/4] thermal: intel: pch: enhance overheat handling Zhang Rui
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Zhang Rui @ 2022-05-19 14:35 UTC (permalink / raw)
  To: rjw; +Cc: linux-pm, daniel.lezcano, sumeet.r.pawnikar, len.brown

Move the PCH Thermal driver suspend callback to suspend_noirq to do
cooling while the system is more quiescent.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Sumeet Pawnikar <sumeet.r.pawnikar@intel.com>
---
 drivers/thermal/intel/intel_pch_thermal.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/intel/intel_pch_thermal.c b/drivers/thermal/intel/intel_pch_thermal.c
index 527c91f5960b..c0f651b5905d 100644
--- a/drivers/thermal/intel/intel_pch_thermal.c
+++ b/drivers/thermal/intel/intel_pch_thermal.c
@@ -193,6 +193,7 @@ static int pch_wpt_get_temp(struct pch_thermal_device *ptd, int *temp)
 	return 0;
 }
 
+/* Cool the PCH when it's overheat in .suspend_noirq phase */
 static int pch_wpt_suspend(struct pch_thermal_device *ptd)
 {
 	u8 tsel;
@@ -455,7 +456,7 @@ static void intel_pch_thermal_remove(struct pci_dev *pdev)
 	pci_disable_device(pdev);
 }
 
-static int intel_pch_thermal_suspend(struct device *device)
+static int intel_pch_thermal_suspend_noirq(struct device *device)
 {
 	struct pch_thermal_device *ptd = dev_get_drvdata(device);
 
@@ -495,7 +496,7 @@ static const struct pci_device_id intel_pch_thermal_id[] = {
 MODULE_DEVICE_TABLE(pci, intel_pch_thermal_id);
 
 static const struct dev_pm_ops intel_pch_pm_ops = {
-	.suspend = intel_pch_thermal_suspend,
+	.suspend_noirq = intel_pch_thermal_suspend_noirq,
 	.resume = intel_pch_thermal_resume,
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH V2 3/4] thermal: intel: pch: enhance overheat handling
  2022-05-19 14:35 [PATCH V2 0/4] PM/Thermal: Enhance PCH overheat handling Zhang Rui
  2022-05-19 14:35 ` [PATCH V2 1/4] PM: wakeup: expose pm_wakeup_pending to modules Zhang Rui
  2022-05-19 14:35 ` [PATCH V2 2/4] thermal: intel: pch: move cooling delay to suspend_noirq phase Zhang Rui
@ 2022-05-19 14:35 ` Zhang Rui
  2022-05-19 14:35 ` [PATCH V2 4/4] thermal: intel: pch: improve the cooling delay log Zhang Rui
  2022-05-19 17:42 ` [PATCH V2 0/4] PM/Thermal: Enhance PCH overheat handling Rafael J. Wysocki
  4 siblings, 0 replies; 6+ messages in thread
From: Zhang Rui @ 2022-05-19 14:35 UTC (permalink / raw)
  To: rjw; +Cc: linux-pm, daniel.lezcano, sumeet.r.pawnikar, len.brown

Commit ef63b043ac86 ("thermal: intel: pch: fix S0ix failure due to PCH
temperature above threshold") introduces delay loop mechanism that allows
PCH temperature to go down below threshold during suspend so it won't
block S0ix. And the default overall delay timeout is 1 second.

However, in practice, we found that the time it takes to cool the PCH down
below threshold highly depends on the initial PCH temperature when the
delay starts, as well as the ambient temperature.
And in some cases, the 1 second delay is not sufficient. As a result, the
system stays in a shallower power state like PCx instead of S0ix, and
drains the battery power, without user' notice.

To make sure S0ix is not blocked by the PCH overheating, we
1. expand the default overall timeout to 60 seconds.
2. make sure the temperature is below threshold rather than equal to it.

At the same time, as the cooling delay can be much longer and many wakeup
events (ACPI Power Button press, USB mouse move, etc) becomes valid in the
suspend_noirq phase, add detection of wakeup event so that the driver
does not delay blindly when the system suspend is likely to abort soon.

This patch may introduce longer suspend time, but only in the cases when
the system overheats and Linux used to enter a shallower S2idle state,
say, PCx instead of S0ix.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Sumeet Pawnikar <sumeet.r.pawnikar@intel.com>
---
 drivers/thermal/intel/intel_pch_thermal.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/thermal/intel/intel_pch_thermal.c b/drivers/thermal/intel/intel_pch_thermal.c
index c0f651b5905d..b7b32e2f5ae2 100644
--- a/drivers/thermal/intel/intel_pch_thermal.c
+++ b/drivers/thermal/intel/intel_pch_thermal.c
@@ -70,8 +70,8 @@ static unsigned int delay_timeout = 100;
 module_param(delay_timeout, int, 0644);
 MODULE_PARM_DESC(delay_timeout, "amount of time delay for each iteration.");
 
-/* Number of iterations for cooling delay, 10 counts by default for now */
-static unsigned int delay_cnt = 10;
+/* Number of iterations for cooling delay, 600 counts by default for now */
+static unsigned int delay_cnt = 600;
 module_param(delay_cnt, int, 0644);
 MODULE_PARM_DESC(delay_cnt, "total number of iterations for time delay.");
 
@@ -197,7 +197,7 @@ static int pch_wpt_get_temp(struct pch_thermal_device *ptd, int *temp)
 static int pch_wpt_suspend(struct pch_thermal_device *ptd)
 {
 	u8 tsel;
-	u8 pch_delay_cnt = 1;
+	int pch_delay_cnt = 1;
 	u16 pch_thr_temp, pch_cur_temp;
 
 	/* Shutdown the thermal sensor if it is not enabled by BIOS */
@@ -234,7 +234,10 @@ static int pch_wpt_suspend(struct pch_thermal_device *ptd)
 	 * which helps to indentify the reason why S0ix entry was rejected.
 	 */
 	while (pch_delay_cnt <= delay_cnt) {
-		if (pch_cur_temp <= pch_thr_temp)
+		if (pch_cur_temp < pch_thr_temp)
+			break;
+
+		if (pm_wakeup_pending())
 			break;
 
 		dev_warn(&ptd->pdev->dev,
@@ -246,7 +249,7 @@ static int pch_wpt_suspend(struct pch_thermal_device *ptd)
 		pch_delay_cnt++;
 	}
 
-	if (pch_cur_temp > pch_thr_temp)
+	if (pch_cur_temp >= pch_thr_temp)
 		dev_warn(&ptd->pdev->dev,
 			"CPU-PCH is hot [%dC] even after delay, continue to suspend. S0ix might fail\n",
 			pch_cur_temp);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH V2 4/4] thermal: intel: pch: improve the cooling delay log
  2022-05-19 14:35 [PATCH V2 0/4] PM/Thermal: Enhance PCH overheat handling Zhang Rui
                   ` (2 preceding siblings ...)
  2022-05-19 14:35 ` [PATCH V2 3/4] thermal: intel: pch: enhance overheat handling Zhang Rui
@ 2022-05-19 14:35 ` Zhang Rui
  2022-05-19 17:42 ` [PATCH V2 0/4] PM/Thermal: Enhance PCH overheat handling Rafael J. Wysocki
  4 siblings, 0 replies; 6+ messages in thread
From: Zhang Rui @ 2022-05-19 14:35 UTC (permalink / raw)
  To: rjw; +Cc: linux-pm, daniel.lezcano, sumeet.r.pawnikar, len.brown

Previously, during suspend, intel_pch_thermal driver logs for every
cooling iteration, about the current PCH temperature and number of cooling
iterations that have been tried, like below

[  100.955526] intel_pch_thermal 0000:00:14.2: CPU-PCH current temp [53C] higher than the threshold temp [50C], sleep 1 times for 100 ms duration
[  101.064156] intel_pch_thermal 0000:00:14.2: CPU-PCH current temp [53C] higher than the threshold temp [50C], sleep 2 times for 100 ms duration

After changing the default delay_cnt to 600, in practice, it is common to
see tens of the above messages if the system is suspended when PCH
overheats. Thus, change this log message from dev_warn to dev_dbg because
it is only useful when we want to check the temperature trend.

At the same time, there is always a one-line message given by the driver
with the patch applied, with below four possibilities.

1. PCH is cool, no cooling delay needed
[ 1791.902853] intel_pch_thermal 0000:00:12.0: CPU-PCH is cool [48C]

2. PCH overheats and becomes cool after the cooling delays
[ 1475.511617] intel_pch_thermal 0000:00:12.0: CPU-PCH is cool [49C] after 30700 ms delay

3. PCH still overheats after the overall cooling timeout
[ 2250.157487] intel_pch_thermal 0000:00:12.0: CPU-PCH is hot [60C] after 60000 ms delay. S0ix might fail

4. PCH aborts cooling because of wakeup event detected during the delay
[ 1933.639509] intel_pch_thermal 0000:00:12.0: Wakeup event detected, abort cooling

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Sumeet Pawnikar <sumeet.r.pawnikar@intel.com>
---
 drivers/thermal/intel/intel_pch_thermal.c | 31 +++++++++++++++--------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/thermal/intel/intel_pch_thermal.c b/drivers/thermal/intel/intel_pch_thermal.c
index b7b32e2f5ae2..c1fa2b29b153 100644
--- a/drivers/thermal/intel/intel_pch_thermal.c
+++ b/drivers/thermal/intel/intel_pch_thermal.c
@@ -197,7 +197,7 @@ static int pch_wpt_get_temp(struct pch_thermal_device *ptd, int *temp)
 static int pch_wpt_suspend(struct pch_thermal_device *ptd)
 {
 	u8 tsel;
-	int pch_delay_cnt = 1;
+	int pch_delay_cnt = 0;
 	u16 pch_thr_temp, pch_cur_temp;
 
 	/* Shutdown the thermal sensor if it is not enabled by BIOS */
@@ -233,29 +233,38 @@ static int pch_wpt_suspend(struct pch_thermal_device *ptd)
 	 * temperature stays above threshold, notify the warning message
 	 * which helps to indentify the reason why S0ix entry was rejected.
 	 */
-	while (pch_delay_cnt <= delay_cnt) {
+	while (pch_delay_cnt < delay_cnt) {
 		if (pch_cur_temp < pch_thr_temp)
 			break;
 
-		if (pm_wakeup_pending())
-			break;
+		if (pm_wakeup_pending()) {
+			dev_warn(&ptd->pdev->dev, "Wakeup event detected, abort cooling\n");
+			return 0;
+		}
 
-		dev_warn(&ptd->pdev->dev,
+		pch_delay_cnt++;
+		dev_dbg(&ptd->pdev->dev,
 			"CPU-PCH current temp [%dC] higher than the threshold temp [%dC], sleep %d times for %d ms duration\n",
 			pch_cur_temp, pch_thr_temp, pch_delay_cnt, delay_timeout);
 		msleep(delay_timeout);
 		/* Read the PCH current temperature for next cycle. */
 		pch_cur_temp = GET_PCH_TEMP(WPT_TEMP_TSR & readw(ptd->hw_base + WPT_TEMP));
-		pch_delay_cnt++;
 	}
 
 	if (pch_cur_temp >= pch_thr_temp)
 		dev_warn(&ptd->pdev->dev,
-			"CPU-PCH is hot [%dC] even after delay, continue to suspend. S0ix might fail\n",
-			pch_cur_temp);
-	else
-		dev_info(&ptd->pdev->dev,
-			"CPU-PCH is cool [%dC], continue to suspend\n", pch_cur_temp);
+			"CPU-PCH is hot [%dC] after %d ms delay. S0ix might fail\n",
+			pch_cur_temp, pch_delay_cnt * delay_timeout);
+	else {
+		if (pch_delay_cnt)
+			dev_info(&ptd->pdev->dev,
+				"CPU-PCH is cool [%dC] after %d ms delay\n",
+				pch_cur_temp, pch_delay_cnt * delay_timeout);
+		else
+			dev_info(&ptd->pdev->dev,
+				"CPU-PCH is cool [%dC]\n",
+				pch_cur_temp);
+	}
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH V2 0/4] PM/Thermal: Enhance PCH overheat handling
  2022-05-19 14:35 [PATCH V2 0/4] PM/Thermal: Enhance PCH overheat handling Zhang Rui
                   ` (3 preceding siblings ...)
  2022-05-19 14:35 ` [PATCH V2 4/4] thermal: intel: pch: improve the cooling delay log Zhang Rui
@ 2022-05-19 17:42 ` Rafael J. Wysocki
  4 siblings, 0 replies; 6+ messages in thread
From: Rafael J. Wysocki @ 2022-05-19 17:42 UTC (permalink / raw)
  To: Zhang Rui
  Cc: Rafael J. Wysocki, Linux PM, Daniel Lezcano, Sumeet Pawnikar, Len Brown

On Thu, May 19, 2022 at 4:35 PM Zhang Rui <rui.zhang@intel.com> wrote:
>
> On some Intel client platforms like SKL/KBL/CNL/CML, there is a
> PCH thermal sensor that monitors the PCH temperature and blocks the system
> from entering S0ix in case it overheats.
>
> Commit ef63b043ac86 ("thermal: intel: pch: fix S0ix failure due to PCH
> temperature above threshold") introduces a delay loop to cool the
> temperature down for this purpose.
>
> However, in practice, we found that the time it takes to cool the PCH down
> below threshold highly depends on the initial PCH temperature when the
> delay starts, as well as the ambient temperature.
>
> For example, on a Dell XPS 9360 laptop, the problem can be triggered
> 1. when it is suspended with heavy workload running.
> or
> 2. when it is moved from New Hampshire to Florida.
>
> In these cases, the 1 second delay is not sufficient. As a result, the
> system stays in a shallower power state like PCx instead of S0ix, and
> drains the battery power, without user' notice.
>
> In order to fix this, we
> 1. move the delay to .suspend_noirq phase instead, in order to
>    do the cooling when the system is in a more quiescent state
> 2. expand the default overall cooling delay timeout to 60 seconds.
> 3. make sure the temperature is below threshold rather than equal to it.
>
> Compared with V1, the last four patches are dropped from the series, and
> we focus on the PCH Overheat issue only. Plus, splitted one of the patch
> according to Rafael' suggestion.

All patches in the series applied as 5.19 material, thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-05-19 17:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-19 14:35 [PATCH V2 0/4] PM/Thermal: Enhance PCH overheat handling Zhang Rui
2022-05-19 14:35 ` [PATCH V2 1/4] PM: wakeup: expose pm_wakeup_pending to modules Zhang Rui
2022-05-19 14:35 ` [PATCH V2 2/4] thermal: intel: pch: move cooling delay to suspend_noirq phase Zhang Rui
2022-05-19 14:35 ` [PATCH V2 3/4] thermal: intel: pch: enhance overheat handling Zhang Rui
2022-05-19 14:35 ` [PATCH V2 4/4] thermal: intel: pch: improve the cooling delay log Zhang Rui
2022-05-19 17:42 ` [PATCH V2 0/4] PM/Thermal: Enhance PCH overheat handling Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.