linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3] thermal: core: Add indication for userspace usage
@ 2020-11-28 17:54 Kai-Heng Feng
  2020-11-28 17:54 ` [PATCH 2/3] thermal: int340x: Indicate " Kai-Heng Feng
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Kai-Heng Feng @ 2020-11-28 17:54 UTC (permalink / raw)
  To: rui.zhang, daniel.lezcano, amitk
  Cc: Kai-Heng Feng, open list:THERMAL, open list

We are seeing thermal shutdown on Intel based mobile workstations, the
shutdown happens during the first trip handle in
thermal_zone_device_register():
kernel: thermal thermal_zone15: critical temperature reached (101 C), shutting down

However, we shouldn't do a thermal shutdown here, since
1) We may want to use a dedicated daemon, Intel's thermald in this case,
to handle thermal shutdown.

2) For ACPI based system, _CRT doesn't mean shutdown unless it's inside
ThermalZone. ACPI Spec, 11.4.4 _CRT (Critical Temperature):
"... If this object it present under a device, the device’s driver
evaluates this object to determine the device’s critical cooling
temperature trip point. This value may then be used by the device’s
driver to program an internal device temperature sensor trip point."

So a "critical trip" here merely means we should take a more aggressive
cooling method.

So add an indication to let thermal core know it should leave thermal
device to userspace to handle.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/thermal/thermal_core.c | 3 +++
 include/linux/thermal.h        | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index c6d74bc1c90b..6561e3767529 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1477,6 +1477,9 @@ thermal_zone_device_register(const char *type, int trips, int mask,
 			goto unregister;
 	}
 
+	if (tz->tzp && tz->tzp->userspace)
+		thermal_zone_device_disable(tz);
+
 	mutex_lock(&thermal_list_lock);
 	list_add_tail(&tz->node, &thermal_tz_list);
 	mutex_unlock(&thermal_list_lock);
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index d07ea27e72a9..e8e8fac78fc8 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -247,6 +247,8 @@ struct thermal_zone_params {
 	 */
 	bool no_hwmon;
 
+	bool userspace;
+
 	int num_tbps;	/* Number of tbp entries */
 	struct thermal_bind_params *tbp;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] thermal: int340x: Indicate userspace usage
  2020-11-28 17:54 [PATCH 1/3] thermal: core: Add indication for userspace usage Kai-Heng Feng
@ 2020-11-28 17:54 ` Kai-Heng Feng
  2020-11-30  5:29   ` Srinivas Pandruvada
  2020-11-28 17:54 ` [PATCH 3/3] thermal: intel: intel_pch_thermal: " Kai-Heng Feng
  2020-11-30  7:57 ` [PATCH 1/3] thermal: core: Add indication for " Daniel Lezcano
  2 siblings, 1 reply; 18+ messages in thread
From: Kai-Heng Feng @ 2020-11-28 17:54 UTC (permalink / raw)
  To: rui.zhang, daniel.lezcano, amitk
  Cc: Kai-Heng Feng, Andrzej Pietrasiewicz, Bartlomiej Zolnierkiewicz,
	Peter Kaestle, Matthew Garrett, Gayatri Kammela,
	Srinivas Pandruvada, Takashi Iwai, Andrew Morton,
	Andy Shevchenko, Akinobu Mita, open list:THERMAL, open list

The device isn't present under ACPI ThermalZone, and there's a dedicated
userspace daemon for this thermal device.

Let thermal core know it shouldn't handle trips to avoid surprising
thermal shutdown.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/thermal/intel/int340x_thermal/int3400_thermal.c     | 1 +
 .../thermal/intel/int340x_thermal/int340x_thermal_zone.c    | 6 +-----
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/thermal/intel/int340x_thermal/int3400_thermal.c b/drivers/thermal/intel/int340x_thermal/int3400_thermal.c
index 0966551cbaaa..2002bc96eb3c 100644
--- a/drivers/thermal/intel/int340x_thermal/int3400_thermal.c
+++ b/drivers/thermal/intel/int340x_thermal/int3400_thermal.c
@@ -439,6 +439,7 @@ static struct thermal_zone_device_ops int3400_thermal_ops = {
 static struct thermal_zone_params int3400_thermal_params = {
 	.governor_name = "user_space",
 	.no_hwmon = true,
+	.userspace = true,
 };
 
 static void int3400_setup_gddv(struct int3400_thermal_priv *priv)
diff --git a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
index 6e479deff76b..a103eb42ef2d 100644
--- a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
+++ b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
@@ -208,6 +208,7 @@ EXPORT_SYMBOL_GPL(int340x_thermal_read_trips);
 static struct thermal_zone_params int340x_thermal_params = {
 	.governor_name = "user_space",
 	.no_hwmon = true,
+	.userspace = true,
 };
 
 struct int34x_thermal_zone *int340x_thermal_zone_add(struct acpi_device *adev,
@@ -259,14 +260,9 @@ struct int34x_thermal_zone *int340x_thermal_zone_add(struct acpi_device *adev,
 		ret = PTR_ERR(int34x_thermal_zone->zone);
 		goto err_thermal_zone;
 	}
-	ret = thermal_zone_device_enable(int34x_thermal_zone->zone);
-	if (ret)
-		goto err_enable;
 
 	return int34x_thermal_zone;
 
-err_enable:
-	thermal_zone_device_unregister(int34x_thermal_zone->zone);
 err_thermal_zone:
 	acpi_lpat_free_conversion_table(int34x_thermal_zone->lpat_table);
 	kfree(int34x_thermal_zone->aux_trips);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] thermal: intel: intel_pch_thermal: Indicate userspace usage
  2020-11-28 17:54 [PATCH 1/3] thermal: core: Add indication for userspace usage Kai-Heng Feng
  2020-11-28 17:54 ` [PATCH 2/3] thermal: int340x: Indicate " Kai-Heng Feng
@ 2020-11-28 17:54 ` Kai-Heng Feng
  2020-11-30  7:57 ` [PATCH 1/3] thermal: core: Add indication for " Daniel Lezcano
  2 siblings, 0 replies; 18+ messages in thread
From: Kai-Heng Feng @ 2020-11-28 17:54 UTC (permalink / raw)
  To: rui.zhang, daniel.lezcano, amitk
  Cc: Kai-Heng Feng, Sumeet Pawnikar, Andy Shevchenko, Gayatri Kammela,
	Andrzej Pietrasiewicz, Chuhong Yuan, Akinobu Mita,
	open list:THERMAL, open list

The device isn't present under ACPI ThermalZone, and there's a dedicated
userspace daemon for this thermal device.

Let thermal core know it shouldn't handle trips to avoid surprising
thermal shutdown.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/thermal/intel/intel_pch_thermal.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/thermal/intel/intel_pch_thermal.c b/drivers/thermal/intel/intel_pch_thermal.c
index 3b813ebb6ca1..e55e6318d733 100644
--- a/drivers/thermal/intel/intel_pch_thermal.c
+++ b/drivers/thermal/intel/intel_pch_thermal.c
@@ -270,6 +270,10 @@ static struct thermal_zone_device_ops tzd_ops = {
 	.get_trip_temp = pch_get_trip_temp,
 };
 
+static struct thermal_zone_params tzd_params = {
+	.userspace = true,
+};
+
 enum board_ids {
 	board_hsw,
 	board_wpt,
@@ -346,21 +350,16 @@ static int intel_pch_thermal_probe(struct pci_dev *pdev,
 		goto error_cleanup;
 
 	ptd->tzd = thermal_zone_device_register(bi->name, nr_trips, 0, ptd,
-						&tzd_ops, NULL, 0, 0);
+						&tzd_ops, &tzd_params, 0, 0);
 	if (IS_ERR(ptd->tzd)) {
 		dev_err(&pdev->dev, "Failed to register thermal zone %s\n",
 			bi->name);
 		err = PTR_ERR(ptd->tzd);
 		goto error_cleanup;
 	}
-	err = thermal_zone_device_enable(ptd->tzd);
-	if (err)
-		goto err_unregister;
 
 	return 0;
 
-err_unregister:
-	thermal_zone_device_unregister(ptd->tzd);
 error_cleanup:
 	iounmap(ptd->hw_base);
 error_release:
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] thermal: int340x: Indicate userspace usage
  2020-11-28 17:54 ` [PATCH 2/3] thermal: int340x: Indicate " Kai-Heng Feng
@ 2020-11-30  5:29   ` Srinivas Pandruvada
  2020-11-30  5:46     ` Kai-Heng Feng
  0 siblings, 1 reply; 18+ messages in thread
From: Srinivas Pandruvada @ 2020-11-30  5:29 UTC (permalink / raw)
  To: Kai-Heng Feng, rui.zhang, daniel.lezcano, amitk
  Cc: Andrzej Pietrasiewicz, Bartlomiej Zolnierkiewicz, Peter Kaestle,
	Matthew Garrett, Gayatri Kammela, Takashi Iwai, Andrew Morton,
	Andy Shevchenko, Akinobu Mita, open list:THERMAL, open list

On Sun, 2020-11-29 at 01:54 +0800, Kai-Heng Feng wrote:
> The device isn't present under ACPI ThermalZone, and there's a
> dedicated
> userspace daemon for this thermal device.
> 
> Let thermal core know it shouldn't handle trips to avoid surprising
> thermal shutdown.
> 
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/thermal/intel/int340x_thermal/int3400_thermal.c     | 1 +
>  .../thermal/intel/int340x_thermal/int340x_thermal_zone.c    | 6 +---
> --
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/thermal/intel/int340x_thermal/int3400_thermal.c
> b/drivers/thermal/intel/int340x_thermal/int3400_thermal.c
> index 0966551cbaaa..2002bc96eb3c 100644
> --- a/drivers/thermal/intel/int340x_thermal/int3400_thermal.c
> +++ b/drivers/thermal/intel/int340x_thermal/int3400_thermal.c
> @@ -439,6 +439,7 @@ static struct thermal_zone_device_ops
> int3400_thermal_ops = {
>  static struct thermal_zone_params int3400_thermal_params = {
>  	.governor_name = "user_space",
>  	.no_hwmon = true,
> +	.userspace = true,
I am copied on only this patch, so I don't know what is this attribute?
I think it is new.

>  };
>  
>  static void int3400_setup_gddv(struct int3400_thermal_priv *priv)
> diff --git
> a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
> b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
> index 6e479deff76b..a103eb42ef2d 100644
> --- a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
> +++ b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
> @@ -208,6 +208,7 @@ EXPORT_SYMBOL_GPL(int340x_thermal_read_trips);
>  static struct thermal_zone_params int340x_thermal_params = {
>  	.governor_name = "user_space",
>  	.no_hwmon = true,
> +	.userspace = true,
>  };
>  
>  struct int34x_thermal_zone *int340x_thermal_zone_add(struct
> acpi_device *adev,
> @@ -259,14 +260,9 @@ struct int34x_thermal_zone
> *int340x_thermal_zone_add(struct acpi_device *adev,
>  		ret = PTR_ERR(int34x_thermal_zone->zone);
>  		goto err_thermal_zone;
>  	}
> -	ret = thermal_zone_device_enable(int34x_thermal_zone->zone);
> -	if (ret)
> -		goto err_enable;

What is the effect of this?
The INT340X zones don't need to poll for temperature. When HW notifies
then user space gets notified via user space governor. Not sure if the
not enabling break that path.

Thanks,
Srinivas

>  
>  	return int34x_thermal_zone;
>  
> -err_enable:
> -	thermal_zone_device_unregister(int34x_thermal_zone->zone);
>  err_thermal_zone:
>  	acpi_lpat_free_conversion_table(int34x_thermal_zone-
> >lpat_table);
>  	kfree(int34x_thermal_zone->aux_trips);


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] thermal: int340x: Indicate userspace usage
  2020-11-30  5:29   ` Srinivas Pandruvada
@ 2020-11-30  5:46     ` Kai-Heng Feng
  0 siblings, 0 replies; 18+ messages in thread
From: Kai-Heng Feng @ 2020-11-30  5:46 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Zhang Rui, daniel.lezcano, amitk, Andrzej Pietrasiewicz,
	Bartlomiej Zolnierkiewicz, Peter Kaestle, Matthew Garrett,
	Gayatri Kammela, Takashi Iwai, Andrew Morton, Andy Shevchenko,
	Akinobu Mita, open list:THERMAL, open list



> On Nov 30, 2020, at 13:29, Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> wrote:
> 
> On Sun, 2020-11-29 at 01:54 +0800, Kai-Heng Feng wrote:
>> The device isn't present under ACPI ThermalZone, and there's a
>> dedicated
>> userspace daemon for this thermal device.
>> 
>> Let thermal core know it shouldn't handle trips to avoid surprising
>> thermal shutdown.
>> 
>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>> ---
>> drivers/thermal/intel/int340x_thermal/int3400_thermal.c     | 1 +
>> .../thermal/intel/int340x_thermal/int340x_thermal_zone.c    | 6 +---
>> --
>> 2 files changed, 2 insertions(+), 5 deletions(-)
>> 
>> diff --git a/drivers/thermal/intel/int340x_thermal/int3400_thermal.c
>> b/drivers/thermal/intel/int340x_thermal/int3400_thermal.c
>> index 0966551cbaaa..2002bc96eb3c 100644
>> --- a/drivers/thermal/intel/int340x_thermal/int3400_thermal.c
>> +++ b/drivers/thermal/intel/int340x_thermal/int3400_thermal.c
>> @@ -439,6 +439,7 @@ static struct thermal_zone_device_ops
>> int3400_thermal_ops = {
>> static struct thermal_zone_params int3400_thermal_params = {
>> 	.governor_name = "user_space",
>> 	.no_hwmon = true,
>> +	.userspace = true,
> I am copied on only this patch, so I don't know what is this attribute?
> I think it is new.

Ok. The first one doesn't seem to be sent out correctly.

Series resent.

> 
>> };
>> 
>> static void int3400_setup_gddv(struct int3400_thermal_priv *priv)
>> diff --git
>> a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
>> b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
>> index 6e479deff76b..a103eb42ef2d 100644
>> --- a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
>> +++ b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
>> @@ -208,6 +208,7 @@ EXPORT_SYMBOL_GPL(int340x_thermal_read_trips);
>> static struct thermal_zone_params int340x_thermal_params = {
>> 	.governor_name = "user_space",
>> 	.no_hwmon = true,
>> +	.userspace = true,
>> };
>> 
>> struct int34x_thermal_zone *int340x_thermal_zone_add(struct
>> acpi_device *adev,
>> @@ -259,14 +260,9 @@ struct int34x_thermal_zone
>> *int340x_thermal_zone_add(struct acpi_device *adev,
>> 		ret = PTR_ERR(int34x_thermal_zone->zone);
>> 		goto err_thermal_zone;
>> 	}
>> -	ret = thermal_zone_device_enable(int34x_thermal_zone->zone);
>> -	if (ret)
>> -		goto err_enable;
> 
> What is the effect of this?
> The INT340X zones don't need to poll for temperature. When HW notifies
> then user space gets notified via user space governor. Not sure if the
> not enabling break that path.

thermal_zone_device_disable()
  thermal_notify_tz_disable()
    thermal_genl_send_event(THERMAL_GENL_EVENT_TZ_DISABLE, &p)

I think it doesn't affect user_space governor.

Kai-Heng

> 
> Thanks,
> Srinivas
> 
>> 
>> 	return int34x_thermal_zone;
>> 
>> -err_enable:
>> -	thermal_zone_device_unregister(int34x_thermal_zone->zone);
>> err_thermal_zone:
>> 	acpi_lpat_free_conversion_table(int34x_thermal_zone-
>>> lpat_table);
>> 	kfree(int34x_thermal_zone->aux_trips);


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] thermal: core: Add indication for userspace usage
  2020-11-28 17:54 [PATCH 1/3] thermal: core: Add indication for userspace usage Kai-Heng Feng
  2020-11-28 17:54 ` [PATCH 2/3] thermal: int340x: Indicate " Kai-Heng Feng
  2020-11-28 17:54 ` [PATCH 3/3] thermal: intel: intel_pch_thermal: " Kai-Heng Feng
@ 2020-11-30  7:57 ` Daniel Lezcano
  2020-11-30  8:23   ` Kai-Heng Feng
  2 siblings, 1 reply; 18+ messages in thread
From: Daniel Lezcano @ 2020-11-30  7:57 UTC (permalink / raw)
  To: Kai-Heng Feng, rui.zhang, amitk
  Cc: open list:THERMAL, open list, Srinivas Pandruvada


[Added Srinivas]

On 28/11/2020 18:54, Kai-Heng Feng wrote:
> We are seeing thermal shutdown on Intel based mobile workstations, the
> shutdown happens during the first trip handle in
> thermal_zone_device_register():
> kernel: thermal thermal_zone15: critical temperature reached (101 C), shutting down
> 
> However, we shouldn't do a thermal shutdown here, since
> 1) We may want to use a dedicated daemon, Intel's thermald in this case,
> to handle thermal shutdown.
> 
> 2) For ACPI based system, _CRT doesn't mean shutdown unless it's inside
> ThermalZone. ACPI Spec, 11.4.4 _CRT (Critical Temperature):
> "... If this object it present under a device, the device’s driver
> evaluates this object to determine the device’s critical cooling
> temperature trip point. This value may then be used by the device’s
> driver to program an internal device temperature sensor trip point."
> 
> So a "critical trip" here merely means we should take a more aggressive
> cooling method.

Well, actually it is stated before:

"This object, when defined under a thermal zone, returns the critical
temperature at which OSPM must shutdown the system".

That is what does the thermal subsystem, no ?

> So add an indication to let thermal core know it should leave thermal
> device to userspace to handle.

You may want to check the 'HOT' trip point and then use the notification
mechanism to get notified in userspace and take action from there (eg.
offline some CPUs).

> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/thermal/thermal_core.c | 3 +++
>  include/linux/thermal.h        | 2 ++
>  2 files changed, 5 insertions(+)
> 
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index c6d74bc1c90b..6561e3767529 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -1477,6 +1477,9 @@ thermal_zone_device_register(const char *type, int trips, int mask,
>  			goto unregister;
>  	}
>  
> +	if (tz->tzp && tz->tzp->userspace)
> +		thermal_zone_device_disable(tz);
> +
>  	mutex_lock(&thermal_list_lock);
>  	list_add_tail(&tz->node, &thermal_tz_list);
>  	mutex_unlock(&thermal_list_lock);
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index d07ea27e72a9..e8e8fac78fc8 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -247,6 +247,8 @@ struct thermal_zone_params {
>  	 */
>  	bool no_hwmon;
>  
> +	bool userspace;
> +
>  	int num_tbps;	/* Number of tbp entries */
>  	struct thermal_bind_params *tbp;
>  
> 


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] thermal: core: Add indication for userspace usage
  2020-11-30  7:57 ` [PATCH 1/3] thermal: core: Add indication for " Daniel Lezcano
@ 2020-11-30  8:23   ` Kai-Heng Feng
  2020-11-30 16:19     ` Srinivas Pandruvada
  0 siblings, 1 reply; 18+ messages in thread
From: Kai-Heng Feng @ 2020-11-30  8:23 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Zhang Rui, amitk, open list:THERMAL, open list, Srinivas Pandruvada



> On Nov 30, 2020, at 15:57, Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
> 
> 
> [Added Srinivas]
> 
> On 28/11/2020 18:54, Kai-Heng Feng wrote:
>> We are seeing thermal shutdown on Intel based mobile workstations, the
>> shutdown happens during the first trip handle in
>> thermal_zone_device_register():
>> kernel: thermal thermal_zone15: critical temperature reached (101 C), shutting down
>> 
>> However, we shouldn't do a thermal shutdown here, since
>> 1) We may want to use a dedicated daemon, Intel's thermald in this case,
>> to handle thermal shutdown.
>> 
>> 2) For ACPI based system, _CRT doesn't mean shutdown unless it's inside
>> ThermalZone. ACPI Spec, 11.4.4 _CRT (Critical Temperature):
>> "... If this object it present under a device, the device’s driver
>> evaluates this object to determine the device’s critical cooling
>> temperature trip point. This value may then be used by the device’s
>> driver to program an internal device temperature sensor trip point."
>> 
>> So a "critical trip" here merely means we should take a more aggressive
>> cooling method.
> 
> Well, actually it is stated before:
> 
> "This object, when defined under a thermal zone, returns the critical
> temperature at which OSPM must shutdown the system".

This means specifically for the ACPI ThermalZone in AML, e.g.:

ThermalZone (TZ0) {
....
    Method(_CRT) { ... }
 } // end of TZ0

However the device is not under any ACPI ThermalZone.

> 
> That is what does the thermal subsystem, no ?
> 
>> So add an indication to let thermal core know it should leave thermal
>> device to userspace to handle.
> 
> You may want to check the 'HOT' trip point and then use the notification
> mechanism to get notified in userspace and take action from there (eg.
> offline some CPUs).

For this particular issue we are facing, the thermal shutdown happens in thermal_zone_device_register() and userspace isn't up yet.

Kai-Heng

> 
>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>> ---
>> drivers/thermal/thermal_core.c | 3 +++
>> include/linux/thermal.h        | 2 ++
>> 2 files changed, 5 insertions(+)
>> 
>> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
>> index c6d74bc1c90b..6561e3767529 100644
>> --- a/drivers/thermal/thermal_core.c
>> +++ b/drivers/thermal/thermal_core.c
>> @@ -1477,6 +1477,9 @@ thermal_zone_device_register(const char *type, int trips, int mask,
>> 			goto unregister;
>> 	}
>> 
>> +	if (tz->tzp && tz->tzp->userspace)
>> +		thermal_zone_device_disable(tz);
>> +
>> 	mutex_lock(&thermal_list_lock);
>> 	list_add_tail(&tz->node, &thermal_tz_list);
>> 	mutex_unlock(&thermal_list_lock);
>> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
>> index d07ea27e72a9..e8e8fac78fc8 100644
>> --- a/include/linux/thermal.h
>> +++ b/include/linux/thermal.h
>> @@ -247,6 +247,8 @@ struct thermal_zone_params {
>> 	 */
>> 	bool no_hwmon;
>> 
>> +	bool userspace;
>> +
>> 	int num_tbps;	/* Number of tbp entries */
>> 	struct thermal_bind_params *tbp;
>> 
>> 
> 
> 
> -- 
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] thermal: core: Add indication for userspace usage
  2020-11-30  8:23   ` Kai-Heng Feng
@ 2020-11-30 16:19     ` Srinivas Pandruvada
  2020-11-30 18:04       ` Kai-Heng Feng
  0 siblings, 1 reply; 18+ messages in thread
From: Srinivas Pandruvada @ 2020-11-30 16:19 UTC (permalink / raw)
  To: Kai-Heng Feng, Daniel Lezcano
  Cc: Zhang Rui, amitk, open list:THERMAL, open list

On Mon, 2020-11-30 at 16:23 +0800, Kai-Heng Feng wrote:
> > On Nov 30, 2020, at 15:57, Daniel Lezcano <
> > daniel.lezcano@linaro.org> wrote:
> > 
> > 
> > [Added Srinivas]
> > 
> > On 28/11/2020 18:54, Kai-Heng Feng wrote:
> > > We are seeing thermal shutdown on Intel based mobile
> > > workstations, the
> > > shutdown happens during the first trip handle in
> > > thermal_zone_device_register():
> > > kernel: thermal thermal_zone15: critical temperature reached (101
> > > C), shutting down
> > > 
> > > However, we shouldn't do a thermal shutdown here, since
> > > 1) We may want to use a dedicated daemon, Intel's thermald in
> > > this case,
> > > to handle thermal shutdown.
> > > 
> > > 2) For ACPI based system, _CRT doesn't mean shutdown unless it's
> > > inside
> > > ThermalZone. ACPI Spec, 11.4.4 _CRT (Critical Temperature):
> > > "... If this object it present under a device, the device’s
> > > driver
> > > evaluates this object to determine the device’s critical cooling
> > > temperature trip point. This value may then be used by the
> > > device’s
> > > driver to program an internal device temperature sensor trip
> > > point."
> > > 
> > > So a "critical trip" here merely means we should take a more
> > > aggressive
> > > cooling method.
> > 
> > Well, actually it is stated before:
> > 
> > "This object, when defined under a thermal zone, returns the
> > critical
> > temperature at which OSPM must shutdown the system".
> 
> This means specifically for the ACPI ThermalZone in AML, e.g.:
> 
> ThermalZone (TZ0) {
> ....
>     Method(_CRT) { ... }
>  } // end of TZ0
> 
> However the device is not under any ACPI ThermalZone.
> 
> > That is what does the thermal subsystem, no ?
> > 
> > > So add an indication to let thermal core know it should leave
> > > thermal
> > > device to userspace to handle.
> > 
> > You may want to check the 'HOT' trip point and then use the
> > notification
> > mechanism to get notified in userspace and take action from there
> > (eg.
> > offline some CPUs).
> 
> For this particular issue we are facing, the thermal shutdown happens
> in thermal_zone_device_register() and userspace isn't up yet.

What about creating an new callback

enum thermal_trip_status {
	THERMAL_TRIP_DISABLED = 0,
	THERMAL_TRIP_ENABLED,
};

int get_trip_status(struct thermal_zone_device *, int trip, enum
thermal_trip_status *state);

Then in 
static void handle_thermal_trip(struct thermal_zone_device *tz, int
trip)
{

/* before tz->ops->get_trip_temp(tz, trip, &trip_temp); */
if (tz->ops->get_trip_status) {
	enum thermal_trip_status *status;

	if (!tz->ops->get_trip_status(tz, trip, &status)) {
		if (status == THERMAL_TRIP_DISABLED)
			return;	
	}
}
...
...

}


This callback will help the cases:
- Allows drivers to selectively disable certain trips during init state
or system resume where there can be spikes or always. int340x drivers
can disable always.
- Still give options for drivers to handle critical trip even if they
are bound to user space governors. User space process may be dead, so
still allow kernel to process graceful shutdown

Thanks,
Srinivas

> 
> Kai-Heng
> 
> > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > ---
> > > drivers/thermal/thermal_core.c | 3 +++
> > > include/linux/thermal.h        | 2 ++
> > > 2 files changed, 5 insertions(+)
> > > 
> > > diff --git a/drivers/thermal/thermal_core.c
> > > b/drivers/thermal/thermal_core.c
> > > index c6d74bc1c90b..6561e3767529 100644
> > > --- a/drivers/thermal/thermal_core.c
> > > +++ b/drivers/thermal/thermal_core.c
> > > @@ -1477,6 +1477,9 @@ thermal_zone_device_register(const char
> > > *type, int trips, int mask,
> > > 			goto unregister;
> > > 	}
> > > 
> > > +	if (tz->tzp && tz->tzp->userspace)
> > > +		thermal_zone_device_disable(tz);
> > > +
> > > 	mutex_lock(&thermal_list_lock);
> > > 	list_add_tail(&tz->node, &thermal_tz_list);
> > > 	mutex_unlock(&thermal_list_lock);
> > > diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> > > index d07ea27e72a9..e8e8fac78fc8 100644
> > > --- a/include/linux/thermal.h
> > > +++ b/include/linux/thermal.h
> > > @@ -247,6 +247,8 @@ struct thermal_zone_params {
> > > 	 */
> > > 	bool no_hwmon;
> > > 
> > > +	bool userspace;
> > > +
> > > 	int num_tbps;	/* Number of tbp entries */
> > > 	struct thermal_bind_params *tbp;
> > > 
> > > 
> > 
> > -- 
> > <http://www.linaro.org/> Linaro.org │ Open source software for ARM
> > SoCs
> > 
> > Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> > <http://twitter.com/#!/linaroorg> Twitter |
> > <http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] thermal: core: Add indication for userspace usage
  2020-11-30 16:19     ` Srinivas Pandruvada
@ 2020-11-30 18:04       ` Kai-Heng Feng
  2020-11-30 18:13         ` Srinivas Pandruvada
  0 siblings, 1 reply; 18+ messages in thread
From: Kai-Heng Feng @ 2020-11-30 18:04 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Daniel Lezcano, Zhang Rui, amitk, open list:THERMAL, open list



> On Dec 1, 2020, at 00:19, Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> wrote:
> 
> On Mon, 2020-11-30 at 16:23 +0800, Kai-Heng Feng wrote:
>>> On Nov 30, 2020, at 15:57, Daniel Lezcano <
>>> daniel.lezcano@linaro.org> wrote:
>>> 
>>> 
>>> [Added Srinivas]
>>> 
>>> On 28/11/2020 18:54, Kai-Heng Feng wrote:
>>>> We are seeing thermal shutdown on Intel based mobile
>>>> workstations, the
>>>> shutdown happens during the first trip handle in
>>>> thermal_zone_device_register():
>>>> kernel: thermal thermal_zone15: critical temperature reached (101
>>>> C), shutting down
>>>> 
>>>> However, we shouldn't do a thermal shutdown here, since
>>>> 1) We may want to use a dedicated daemon, Intel's thermald in
>>>> this case,
>>>> to handle thermal shutdown.
>>>> 
>>>> 2) For ACPI based system, _CRT doesn't mean shutdown unless it's
>>>> inside
>>>> ThermalZone. ACPI Spec, 11.4.4 _CRT (Critical Temperature):
>>>> "... If this object it present under a device, the device’s
>>>> driver
>>>> evaluates this object to determine the device’s critical cooling
>>>> temperature trip point. This value may then be used by the
>>>> device’s
>>>> driver to program an internal device temperature sensor trip
>>>> point."
>>>> 
>>>> So a "critical trip" here merely means we should take a more
>>>> aggressive
>>>> cooling method.
>>> 
>>> Well, actually it is stated before:
>>> 
>>> "This object, when defined under a thermal zone, returns the
>>> critical
>>> temperature at which OSPM must shutdown the system".
>> 
>> This means specifically for the ACPI ThermalZone in AML, e.g.:
>> 
>> ThermalZone (TZ0) {
>> ....
>>    Method(_CRT) { ... }
>> } // end of TZ0
>> 
>> However the device is not under any ACPI ThermalZone.
>> 
>>> That is what does the thermal subsystem, no ?
>>> 
>>>> So add an indication to let thermal core know it should leave
>>>> thermal
>>>> device to userspace to handle.
>>> 
>>> You may want to check the 'HOT' trip point and then use the
>>> notification
>>> mechanism to get notified in userspace and take action from there
>>> (eg.
>>> offline some CPUs).
>> 
>> For this particular issue we are facing, the thermal shutdown happens
>> in thermal_zone_device_register() and userspace isn't up yet.
> 
> What about creating an new callback
> 
> enum thermal_trip_status {
> 	THERMAL_TRIP_DISABLED = 0,
> 	THERMAL_TRIP_ENABLED,
> };
> 
> int get_trip_status(struct thermal_zone_device *, int trip, enum
> thermal_trip_status *state);
> 
> Then in 
> static void handle_thermal_trip(struct thermal_zone_device *tz, int
> trip)
> {
> 
> /* before tz->ops->get_trip_temp(tz, trip, &trip_temp); */
> if (tz->ops->get_trip_status) {
> 	enum thermal_trip_status *status;
> 
> 	if (!tz->ops->get_trip_status(tz, trip, &status)) {
> 		if (status == THERMAL_TRIP_DISABLED)
> 			return;	
> 	}
> }
> ...
> ...
> 
> }
> 
> 
> This callback will help the cases:
> - Allows drivers to selectively disable certain trips during init state
> or system resume where there can be spikes or always. int340x drivers
> can disable always.

This sounds really great. This is indeed can happen on system resume, before userspace process thaw.

> - Still give options for drivers to handle critical trip even if they
> are bound to user space governors. User space process may be dead, so
> still allow kernel to process graceful shutdown

To make the scenario happen, do we need a new sysfs to let usespace enable it with THERMAL_TRIP_ENABLED?

Kai-Heng

> 
> Thanks,
> Srinivas
> 
>> 
>> Kai-Heng
>> 
>>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>>>> ---
>>>> drivers/thermal/thermal_core.c | 3 +++
>>>> include/linux/thermal.h        | 2 ++
>>>> 2 files changed, 5 insertions(+)
>>>> 
>>>> diff --git a/drivers/thermal/thermal_core.c
>>>> b/drivers/thermal/thermal_core.c
>>>> index c6d74bc1c90b..6561e3767529 100644
>>>> --- a/drivers/thermal/thermal_core.c
>>>> +++ b/drivers/thermal/thermal_core.c
>>>> @@ -1477,6 +1477,9 @@ thermal_zone_device_register(const char
>>>> *type, int trips, int mask,
>>>> 			goto unregister;
>>>> 	}
>>>> 
>>>> +	if (tz->tzp && tz->tzp->userspace)
>>>> +		thermal_zone_device_disable(tz);
>>>> +
>>>> 	mutex_lock(&thermal_list_lock);
>>>> 	list_add_tail(&tz->node, &thermal_tz_list);
>>>> 	mutex_unlock(&thermal_list_lock);
>>>> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
>>>> index d07ea27e72a9..e8e8fac78fc8 100644
>>>> --- a/include/linux/thermal.h
>>>> +++ b/include/linux/thermal.h
>>>> @@ -247,6 +247,8 @@ struct thermal_zone_params {
>>>> 	 */
>>>> 	bool no_hwmon;
>>>> 
>>>> +	bool userspace;
>>>> +
>>>> 	int num_tbps;	/* Number of tbp entries */
>>>> 	struct thermal_bind_params *tbp;
>>>> 
>>>> 
>>> 
>>> -- 
>>> <http://www.linaro.org/> Linaro.org │ Open source software for ARM
>>> SoCs
>>> 
>>> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
>>> <http://twitter.com/#!/linaroorg> Twitter |
>>> <http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] thermal: core: Add indication for userspace usage
  2020-11-30 18:04       ` Kai-Heng Feng
@ 2020-11-30 18:13         ` Srinivas Pandruvada
  2020-11-30 18:22           ` Kai-Heng Feng
  0 siblings, 1 reply; 18+ messages in thread
From: Srinivas Pandruvada @ 2020-11-30 18:13 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Daniel Lezcano, Zhang Rui, amitk, open list:THERMAL, open list

On Tue, 2020-12-01 at 02:04 +0800, Kai-Heng Feng wrote:
> > On Dec 1, 2020, at 00:19, Srinivas Pandruvada <
> > srinivas.pandruvada@linux.intel.com> wrote:
> > 
> > On Mon, 2020-11-30 at 16:23 +0800, Kai-Heng Feng wrote:
> > > > On Nov 30, 2020, at 15:57, Daniel Lezcano <
> > > > daniel.lezcano@linaro.org> wrote:
> > > > 
> > > > 
> > > > [Added Srinivas]
> > > > 
> > > > On 28/11/2020 18:54, Kai-Heng Feng wrote:
> > > > > We are seeing thermal shutdown on Intel based mobile
> > > > > workstations, the
> > > > > shutdown happens during the first trip handle in
> > > > > thermal_zone_device_register():
> > > > > kernel: thermal thermal_zone15: critical temperature reached
> > > > > (101
> > > > > C), shutting down
> > > > > 
> > > > > However, we shouldn't do a thermal shutdown here, since
> > > > > 1) We may want to use a dedicated daemon, Intel's thermald in
> > > > > this case,
> > > > > to handle thermal shutdown.
> > > > > 
> > > > > 2) For ACPI based system, _CRT doesn't mean shutdown unless
> > > > > it's
> > > > > inside
> > > > > ThermalZone. ACPI Spec, 11.4.4 _CRT (Critical Temperature):
> > > > > "... If this object it present under a device, the device’s
> > > > > driver
> > > > > evaluates this object to determine the device’s critical
> > > > > cooling
> > > > > temperature trip point. This value may then be used by the
> > > > > device’s
> > > > > driver to program an internal device temperature sensor trip
> > > > > point."
> > > > > 
> > > > > So a "critical trip" here merely means we should take a more
> > > > > aggressive
> > > > > cooling method.
> > > > 
> > > > Well, actually it is stated before:
> > > > 
> > > > "This object, when defined under a thermal zone, returns the
> > > > critical
> > > > temperature at which OSPM must shutdown the system".
> > > 
> > > This means specifically for the ACPI ThermalZone in AML, e.g.:
> > > 
> > > ThermalZone (TZ0) {
> > > ....
> > >    Method(_CRT) { ... }
> > > } // end of TZ0
> > > 
> > > However the device is not under any ACPI ThermalZone.
> > > 
> > > > That is what does the thermal subsystem, no ?
> > > > 
> > > > > So add an indication to let thermal core know it should leave
> > > > > thermal
> > > > > device to userspace to handle.
> > > > 
> > > > You may want to check the 'HOT' trip point and then use the
> > > > notification
> > > > mechanism to get notified in userspace and take action from
> > > > there
> > > > (eg.
> > > > offline some CPUs).
> > > 
> > > For this particular issue we are facing, the thermal shutdown
> > > happens
> > > in thermal_zone_device_register() and userspace isn't up yet.
> > 
> > What about creating an new callback
> > 
> > enum thermal_trip_status {
> > 	THERMAL_TRIP_DISABLED = 0,
> > 	THERMAL_TRIP_ENABLED,
> > };
> > 
> > int get_trip_status(struct thermal_zone_device *, int trip, enum
> > thermal_trip_status *state);
> > 
> > Then in 
> > static void handle_thermal_trip(struct thermal_zone_device *tz, int
> > trip)
> > {
> > 
> > /* before tz->ops->get_trip_temp(tz, trip, &trip_temp); */
> > if (tz->ops->get_trip_status) {
> > 	enum thermal_trip_status *status;
> > 
> > 	if (!tz->ops->get_trip_status(tz, trip, &status)) {
> > 		if (status == THERMAL_TRIP_DISABLED)
> > 			return;	
> > 	}
> > }
> > ...
> > ...
> > 
> > }
> > 
> > 
> > This callback will help the cases:
> > - Allows drivers to selectively disable certain trips during init
> > state
> > or system resume where there can be spikes or always. int340x
> > drivers
> > can disable always.
> 
> This sounds really great. This is indeed can happen on system resume,
> before userspace process thaw.
> 
> > - Still give options for drivers to handle critical trip even if
> > they
> > are bound to user space governors. User space process may be dead,
> > so
> > still allow kernel to process graceful shutdown
> 
> To make the scenario happen, do we need a new sysfs to let usespace
> enable it with THERMAL_TRIP_ENABLED?
This should be drivers call not user space.

Thanks,
Srinivas


> 
> Kai-Heng
> 
> > Thanks,
> > Srinivas
> > 
> > > Kai-Heng
> > > 
> > > > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > > > ---
> > > > > drivers/thermal/thermal_core.c | 3 +++
> > > > > include/linux/thermal.h        | 2 ++
> > > > > 2 files changed, 5 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/thermal/thermal_core.c
> > > > > b/drivers/thermal/thermal_core.c
> > > > > index c6d74bc1c90b..6561e3767529 100644
> > > > > --- a/drivers/thermal/thermal_core.c
> > > > > +++ b/drivers/thermal/thermal_core.c
> > > > > @@ -1477,6 +1477,9 @@ thermal_zone_device_register(const char
> > > > > *type, int trips, int mask,
> > > > > 			goto unregister;
> > > > > 	}
> > > > > 
> > > > > +	if (tz->tzp && tz->tzp->userspace)
> > > > > +		thermal_zone_device_disable(tz);
> > > > > +
> > > > > 	mutex_lock(&thermal_list_lock);
> > > > > 	list_add_tail(&tz->node, &thermal_tz_list);
> > > > > 	mutex_unlock(&thermal_list_lock);
> > > > > diff --git a/include/linux/thermal.h
> > > > > b/include/linux/thermal.h
> > > > > index d07ea27e72a9..e8e8fac78fc8 100644
> > > > > --- a/include/linux/thermal.h
> > > > > +++ b/include/linux/thermal.h
> > > > > @@ -247,6 +247,8 @@ struct thermal_zone_params {
> > > > > 	 */
> > > > > 	bool no_hwmon;
> > > > > 
> > > > > +	bool userspace;
> > > > > +
> > > > > 	int num_tbps;	/* Number of tbp entries */
> > > > > 	struct thermal_bind_params *tbp;
> > > > > 
> > > > > 
> > > > 
> > > > -- 
> > > > <http://www.linaro.org/> Linaro.org │ Open source software for
> > > > ARM
> > > > SoCs
> > > > 
> > > > Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook
> > > > |
> > > > <http://twitter.com/#!/linaroorg> Twitter |
> > > > <http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] thermal: core: Add indication for userspace usage
  2020-11-30 18:13         ` Srinivas Pandruvada
@ 2020-11-30 18:22           ` Kai-Heng Feng
  2020-11-30 18:39             ` Srinivas Pandruvada
  0 siblings, 1 reply; 18+ messages in thread
From: Kai-Heng Feng @ 2020-11-30 18:22 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Daniel Lezcano, Zhang Rui, amitk, open list:THERMAL, open list


> On Dec 1, 2020, at 02:13, Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> wrote:

[snipped] 

>>> What about creating an new callback
>>> 
>>> enum thermal_trip_status {
>>> 	THERMAL_TRIP_DISABLED = 0,
>>> 	THERMAL_TRIP_ENABLED,
>>> };
>>> 
>>> int get_trip_status(struct thermal_zone_device *, int trip, enum
>>> thermal_trip_status *state);
>>> 
>>> Then in 
>>> static void handle_thermal_trip(struct thermal_zone_device *tz, int
>>> trip)
>>> {
>>> 
>>> /* before tz->ops->get_trip_temp(tz, trip, &trip_temp); */
>>> if (tz->ops->get_trip_status) {
>>> 	enum thermal_trip_status *status;
>>> 
>>> 	if (!tz->ops->get_trip_status(tz, trip, &status)) {
>>> 		if (status == THERMAL_TRIP_DISABLED)
>>> 			return;	
>>> 	}
>>> }
>>> ...
>>> ...
>>> 
>>> }
>>> 
>>> 
>>> This callback will help the cases:
>>> - Allows drivers to selectively disable certain trips during init
>>> state
>>> or system resume where there can be spikes or always. int340x
>>> drivers
>>> can disable always.
>> 
>> This sounds really great. This is indeed can happen on system resume,
>> before userspace process thaw.
>> 
>>> - Still give options for drivers to handle critical trip even if
>>> they
>>> are bound to user space governors. User space process may be dead,
>>> so
>>> still allow kernel to process graceful shutdown
>> 
>> To make the scenario happen, do we need a new sysfs to let usespace
>> enable it with THERMAL_TRIP_ENABLED?
> This should be drivers call not user space.

Understood. So after thermal_zone_device_register(), the driver can decide to what to return on get_trip_temp().
Let me work on a new patch if there's no other concern.

Kai-Heng

> 
> Thanks,
> Srinivas


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] thermal: core: Add indication for userspace usage
  2020-11-30 18:22           ` Kai-Heng Feng
@ 2020-11-30 18:39             ` Srinivas Pandruvada
  2020-12-07  5:36               ` Kai-Heng Feng
  0 siblings, 1 reply; 18+ messages in thread
From: Srinivas Pandruvada @ 2020-11-30 18:39 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Daniel Lezcano, Zhang Rui, amitk, open list:THERMAL, open list

On Tue, 2020-12-01 at 02:22 +0800, Kai-Heng Feng wrote:
> > On Dec 1, 2020, at 02:13, Srinivas Pandruvada <
> > srinivas.pandruvada@linux.intel.com> wrote:
> 
> [snipped] 
> 
> > > > What about creating an new callback
> > > > 
> > > > enum thermal_trip_status {
> > > > 	THERMAL_TRIP_DISABLED = 0,
> > > > 	THERMAL_TRIP_ENABLED,
> > > > };
> > > > 
> > > > int get_trip_status(struct thermal_zone_device *, int trip,
> > > > enum
> > > > thermal_trip_status *state);
> > > > 
> > > > Then in 
> > > > static void handle_thermal_trip(struct thermal_zone_device *tz,
> > > > int
> > > > trip)
> > > > {
> > > > 
> > > > /* before tz->ops->get_trip_temp(tz, trip, &trip_temp); */
> > > > if (tz->ops->get_trip_status) {
> > > > 	enum thermal_trip_status *status;
> > > > 
> > > > 	if (!tz->ops->get_trip_status(tz, trip, &status)) {
> > > > 		if (status == THERMAL_TRIP_DISABLED)
> > > > 			return;	
> > > > 	}
> > > > }
> > > > ...
> > > > ...
> > > > 
> > > > }
> > > > 
> > > > 
> > > > This callback will help the cases:
> > > > - Allows drivers to selectively disable certain trips during
> > > > init
> > > > state
> > > > or system resume where there can be spikes or always. int340x
> > > > drivers
> > > > can disable always.
> > > 
> > > This sounds really great. This is indeed can happen on system
> > > resume,
> > > before userspace process thaw.
> > > 
> > > > - Still give options for drivers to handle critical trip even
> > > > if
> > > > they
> > > > are bound to user space governors. User space process may be
> > > > dead,
> > > > so
> > > > still allow kernel to process graceful shutdown
> > > 
> > > To make the scenario happen, do we need a new sysfs to let
> > > usespace
> > > enable it with THERMAL_TRIP_ENABLED?
> > This should be drivers call not user space.
> 
> Understood. So after thermal_zone_device_register(), the driver can
> decide to what to return on get_trip_temp().
get_trip_status()

> Let me work on a new patch if there's no other concern.
Better to wait for confirmation from Daniel and others.

Thanks,
Srinivas

> 
> Kai-Heng
> 
> > Thanks,
> > Srinivas


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] thermal: core: Add indication for userspace usage
  2020-11-30 18:39             ` Srinivas Pandruvada
@ 2020-12-07  5:36               ` Kai-Heng Feng
  2020-12-09  9:30                 ` Daniel Lezcano
  0 siblings, 1 reply; 18+ messages in thread
From: Kai-Heng Feng @ 2020-12-07  5:36 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Srinivas Pandruvada, Zhang Rui, amitk, open list:THERMAL, open list



> On Dec 1, 2020, at 02:39, Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> wrote:
> 
> On Tue, 2020-12-01 at 02:22 +0800, Kai-Heng Feng wrote:
>>> On Dec 1, 2020, at 02:13, Srinivas Pandruvada <
>>> srinivas.pandruvada@linux.intel.com> wrote:
>> 
>> [snipped] 
>> 
>>>>> What about creating an new callback
>>>>> 
>>>>> enum thermal_trip_status {
>>>>> 	THERMAL_TRIP_DISABLED = 0,
>>>>> 	THERMAL_TRIP_ENABLED,
>>>>> };
>>>>> 
>>>>> int get_trip_status(struct thermal_zone_device *, int trip,
>>>>> enum
>>>>> thermal_trip_status *state);
>>>>> 
>>>>> Then in 
>>>>> static void handle_thermal_trip(struct thermal_zone_device *tz,
>>>>> int
>>>>> trip)
>>>>> {
>>>>> 
>>>>> /* before tz->ops->get_trip_temp(tz, trip, &trip_temp); */
>>>>> if (tz->ops->get_trip_status) {
>>>>> 	enum thermal_trip_status *status;
>>>>> 
>>>>> 	if (!tz->ops->get_trip_status(tz, trip, &status)) {
>>>>> 		if (status == THERMAL_TRIP_DISABLED)
>>>>> 			return;	
>>>>> 	}
>>>>> }
>>>>> ...
>>>>> ...
>>>>> 
>>>>> }
>>>>> 
>>>>> 
>>>>> This callback will help the cases:
>>>>> - Allows drivers to selectively disable certain trips during
>>>>> init
>>>>> state
>>>>> or system resume where there can be spikes or always. int340x
>>>>> drivers
>>>>> can disable always.
>>>> 
>>>> This sounds really great. This is indeed can happen on system
>>>> resume,
>>>> before userspace process thaw.
>>>> 
>>>>> - Still give options for drivers to handle critical trip even
>>>>> if
>>>>> they
>>>>> are bound to user space governors. User space process may be
>>>>> dead,
>>>>> so
>>>>> still allow kernel to process graceful shutdown
>>>> 
>>>> To make the scenario happen, do we need a new sysfs to let
>>>> usespace
>>>> enable it with THERMAL_TRIP_ENABLED?
>>> This should be drivers call not user space.
>> 
>> Understood. So after thermal_zone_device_register(), the driver can
>> decide to what to return on get_trip_temp().
> get_trip_status()
> 
>> Let me work on a new patch if there's no other concern.
> Better to wait for confirmation from Daniel and others.

Daniel,

Do you like Srinivas' proposed solution?

I hope we can find a solution in upstream kernel soon.

Kai-Heng

> 
> Thanks,
> Srinivas
> 
>> 
>> Kai-Heng
>> 
>>> Thanks,
>>> Srinivas


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] thermal: core: Add indication for userspace usage
  2020-12-07  5:36               ` Kai-Heng Feng
@ 2020-12-09  9:30                 ` Daniel Lezcano
  2020-12-09 16:10                   ` Srinivas Pandruvada
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Lezcano @ 2020-12-09  9:30 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Srinivas Pandruvada, Zhang Rui, amitk, open list:THERMAL, open list

On 07/12/2020 06:36, Kai-Heng Feng wrote:
> 
> 
>> On Dec 1, 2020, at 02:39, Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> wrote:
>>
>> On Tue, 2020-12-01 at 02:22 +0800, Kai-Heng Feng wrote:
>>>> On Dec 1, 2020, at 02:13, Srinivas Pandruvada <
>>>> srinivas.pandruvada@linux.intel.com> wrote:
>>>
>>> [snipped] 
>>>
>>>>>> What about creating an new callback
>>>>>>
>>>>>> enum thermal_trip_status {
>>>>>> 	THERMAL_TRIP_DISABLED = 0,
>>>>>> 	THERMAL_TRIP_ENABLED,
>>>>>> };
>>>>>>
>>>>>> int get_trip_status(struct thermal_zone_device *, int trip,
>>>>>> enum
>>>>>> thermal_trip_status *state);
>>>>>>
>>>>>> Then in 
>>>>>> static void handle_thermal_trip(struct thermal_zone_device *tz,
>>>>>> int
>>>>>> trip)
>>>>>> {
>>>>>>
>>>>>> /* before tz->ops->get_trip_temp(tz, trip, &trip_temp); */
>>>>>> if (tz->ops->get_trip_status) {
>>>>>> 	enum thermal_trip_status *status;
>>>>>>
>>>>>> 	if (!tz->ops->get_trip_status(tz, trip, &status)) {
>>>>>> 		if (status == THERMAL_TRIP_DISABLED)
>>>>>> 			return;	
>>>>>> 	}
>>>>>> }
>>>>>> ...
>>>>>> ...
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>> This callback will help the cases:
>>>>>> - Allows drivers to selectively disable certain trips during
>>>>>> init
>>>>>> state
>>>>>> or system resume where there can be spikes or always. int340x
>>>>>> drivers
>>>>>> can disable always.
>>>>>
>>>>> This sounds really great. This is indeed can happen on system
>>>>> resume,
>>>>> before userspace process thaw.
>>>>>
>>>>>> - Still give options for drivers to handle critical trip even
>>>>>> if
>>>>>> they
>>>>>> are bound to user space governors. User space process may be
>>>>>> dead,
>>>>>> so
>>>>>> still allow kernel to process graceful shutdown
>>>>>
>>>>> To make the scenario happen, do we need a new sysfs to let
>>>>> usespace
>>>>> enable it with THERMAL_TRIP_ENABLED?
>>>> This should be drivers call not user space.
>>>
>>> Understood. So after thermal_zone_device_register(), the driver can
>>> decide to what to return on get_trip_temp().
>> get_trip_status()
>>
>>> Let me work on a new patch if there's no other concern.
>> Better to wait for confirmation from Daniel and others.
> 
> Daniel,
> 
> Do you like Srinivas' proposed solution?
> 
> I hope we can find a solution in upstream kernel soon.

(just trying to figure out the full context)

If the device is enumerated outside of a thermal zone, the sensor should
not register in the thermal zone no ?



-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] thermal: core: Add indication for userspace usage
  2020-12-09  9:30                 ` Daniel Lezcano
@ 2020-12-09 16:10                   ` Srinivas Pandruvada
  0 siblings, 0 replies; 18+ messages in thread
From: Srinivas Pandruvada @ 2020-12-09 16:10 UTC (permalink / raw)
  To: Daniel Lezcano, Kai-Heng Feng
  Cc: Zhang Rui, amitk, open list:THERMAL, open list

On Wed, 2020-12-09 at 10:30 +0100, Daniel Lezcano wrote:
> On 07/12/2020 06:36, Kai-Heng Feng wrote:
> > 
> > > On Dec 1, 2020, at 02:39, Srinivas Pandruvada <
> > > srinivas.pandruvada@linux.intel.com> wrote:
> > > 
> > > On Tue, 2020-12-01 at 02:22 +0800, Kai-Heng Feng wrote:
> > > > > On Dec 1, 2020, at 02:13, Srinivas Pandruvada <
> > > > > srinivas.pandruvada@linux.intel.com> wrote:
> > > > 
> > > > [snipped] 
> > > > 
> > > > > > > What about creating an new callback
> > > > > > > 
> > > > > > > enum thermal_trip_status {
> > > > > > > 	THERMAL_TRIP_DISABLED = 0,
> > > > > > > 	THERMAL_TRIP_ENABLED,
> > > > > > > };
> > > > > > > 
> > > > > > > int get_trip_status(struct thermal_zone_device *, int
> > > > > > > trip,
> > > > > > > enum
> > > > > > > thermal_trip_status *state);
> > > > > > > 
> > > > > > > Then in 
> > > > > > > static void handle_thermal_trip(struct
> > > > > > > thermal_zone_device *tz,
> > > > > > > int
> > > > > > > trip)
> > > > > > > {
> > > > > > > 
> > > > > > > /* before tz->ops->get_trip_temp(tz, trip, &trip_temp);
> > > > > > > */
> > > > > > > if (tz->ops->get_trip_status) {
> > > > > > > 	enum thermal_trip_status *status;
> > > > > > > 
> > > > > > > 	if (!tz->ops->get_trip_status(tz, trip, &status)) {
> > > > > > > 		if (status == THERMAL_TRIP_DISABLED)
> > > > > > > 			return;	
> > > > > > > 	}
> > > > > > > }
> > > > > > > ...
> > > > > > > ...
> > > > > > > 
> > > > > > > }
> > > > > > > 
> > > > > > > 
> > > > > > > This callback will help the cases:
> > > > > > > - Allows drivers to selectively disable certain trips
> > > > > > > during
> > > > > > > init
> > > > > > > state
> > > > > > > or system resume where there can be spikes or always.
> > > > > > > int340x
> > > > > > > drivers
> > > > > > > can disable always.
> > > > > > 
> > > > > > This sounds really great. This is indeed can happen on
> > > > > > system
> > > > > > resume,
> > > > > > before userspace process thaw.
> > > > > > 
> > > > > > > - Still give options for drivers to handle critical trip
> > > > > > > even
> > > > > > > if
> > > > > > > they
> > > > > > > are bound to user space governors. User space process may
> > > > > > > be
> > > > > > > dead,
> > > > > > > so
> > > > > > > still allow kernel to process graceful shutdown
> > > > > > 
> > > > > > To make the scenario happen, do we need a new sysfs to let
> > > > > > usespace
> > > > > > enable it with THERMAL_TRIP_ENABLED?
> > > > > This should be drivers call not user space.
> > > > 
> > > > Understood. So after thermal_zone_device_register(), the driver
> > > > can
> > > > decide to what to return on get_trip_temp().
> > > get_trip_status()
> > > 
> > > > Let me work on a new patch if there's no other concern.
> > > Better to wait for confirmation from Daniel and others.
> > 
> > Daniel,
> > 
> > Do you like Srinivas' proposed solution?
> > 
> > I hope we can find a solution in upstream kernel soon.
> 
> (just trying to figure out the full context)
> 
> If the device is enumerated outside of a thermal zone, the sensor
> should
> not register in the thermal zone no ?

Other trips are fine, so sensor registry is still required for passive
and active control. We just need to ignore critical trip. These table
are tested by OEM on Windows, where critical trip doesn't result in
immediate shutdown.

Thanks,
Srinivas

> 
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] thermal: core: Add indication for userspace usage
  2020-12-14 18:21 ` Matthew Garrett
@ 2020-12-15 12:49   ` Kai-Heng Feng
  0 siblings, 0 replies; 18+ messages in thread
From: Kai-Heng Feng @ 2020-12-15 12:49 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Zhang, Rui, Daniel Lezcano, amitk, Andrzej Pietrasiewicz,
	Srinivas Pandruvada, open list:THERMAL, open list

On Tue, Dec 15, 2020 at 2:22 AM Matthew Garrett <mjg59@google.com> wrote:
>
> On Sun, Nov 29, 2020 at 9:36 PM Kai-Heng Feng
> <kai.heng.feng@canonical.com> wrote:
> >
> > We are seeing thermal shutdown on Intel based mobile workstations, the
> > shutdown happens during the first trip handle in
> > thermal_zone_device_register():
> > kernel: thermal thermal_zone15: critical temperature reached (101 C), shutting down
>
> Is the temperature reported by the thermal zone actually correct here?
> 101 C seems extremely excessive.

According to ODM/OEM, it's correct.
It's a short period when Intel Turbo Boost kicks in.

Kai-Heng

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] thermal: core: Add indication for userspace usage
  2020-11-30  5:36 Kai-Heng Feng
@ 2020-12-14 18:21 ` Matthew Garrett
  2020-12-15 12:49   ` Kai-Heng Feng
  0 siblings, 1 reply; 18+ messages in thread
From: Matthew Garrett @ 2020-12-14 18:21 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Zhang, Rui, daniel.lezcano, amitk, andrzej.p,
	Srinivas Pandruvada, open list:THERMAL, open list

On Sun, Nov 29, 2020 at 9:36 PM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
> We are seeing thermal shutdown on Intel based mobile workstations, the
> shutdown happens during the first trip handle in
> thermal_zone_device_register():
> kernel: thermal thermal_zone15: critical temperature reached (101 C), shutting down

Is the temperature reported by the thermal zone actually correct here?
101 C seems extremely excessive.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] thermal: core: Add indication for userspace usage
@ 2020-11-30  5:36 Kai-Heng Feng
  2020-12-14 18:21 ` Matthew Garrett
  0 siblings, 1 reply; 18+ messages in thread
From: Kai-Heng Feng @ 2020-11-30  5:36 UTC (permalink / raw)
  To: rui.zhang, daniel.lezcano, amitk
  Cc: andrzej.p, mjg59, srinivas.pandruvada, Kai-Heng Feng,
	open list:THERMAL, open list

We are seeing thermal shutdown on Intel based mobile workstations, the
shutdown happens during the first trip handle in
thermal_zone_device_register():
kernel: thermal thermal_zone15: critical temperature reached (101 C), shutting down

However, we shouldn't do a thermal shutdown here, since
1) We may want to use a dedicated daemon, Intel's thermald in this case,
to handle thermal shutdown.

2) For ACPI based system, _CRT doesn't mean shutdown unless it's inside
ThermalZone. ACPI Spec, 11.4.4 _CRT (Critical Temperature):
"... If this object it present under a device, the device’s driver
evaluates this object to determine the device’s critical cooling
temperature trip point. This value may then be used by the device’s
driver to program an internal device temperature sensor trip point."

So a "critical trip" here merely means we should take a more aggressive
cooling method.

So add an indication to let thermal core know it should leave thermal
device to userspace to handle.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/thermal/thermal_core.c | 3 +++
 include/linux/thermal.h        | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index c6d74bc1c90b..6561e3767529 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1477,6 +1477,9 @@ thermal_zone_device_register(const char *type, int trips, int mask,
 			goto unregister;
 	}
 
+	if (tz->tzp && tz->tzp->userspace)
+		thermal_zone_device_disable(tz);
+
 	mutex_lock(&thermal_list_lock);
 	list_add_tail(&tz->node, &thermal_tz_list);
 	mutex_unlock(&thermal_list_lock);
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index d07ea27e72a9..e8e8fac78fc8 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -247,6 +247,8 @@ struct thermal_zone_params {
 	 */
 	bool no_hwmon;
 
+	bool userspace;
+
 	int num_tbps;	/* Number of tbp entries */
 	struct thermal_bind_params *tbp;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-12-15 12:51 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-28 17:54 [PATCH 1/3] thermal: core: Add indication for userspace usage Kai-Heng Feng
2020-11-28 17:54 ` [PATCH 2/3] thermal: int340x: Indicate " Kai-Heng Feng
2020-11-30  5:29   ` Srinivas Pandruvada
2020-11-30  5:46     ` Kai-Heng Feng
2020-11-28 17:54 ` [PATCH 3/3] thermal: intel: intel_pch_thermal: " Kai-Heng Feng
2020-11-30  7:57 ` [PATCH 1/3] thermal: core: Add indication for " Daniel Lezcano
2020-11-30  8:23   ` Kai-Heng Feng
2020-11-30 16:19     ` Srinivas Pandruvada
2020-11-30 18:04       ` Kai-Heng Feng
2020-11-30 18:13         ` Srinivas Pandruvada
2020-11-30 18:22           ` Kai-Heng Feng
2020-11-30 18:39             ` Srinivas Pandruvada
2020-12-07  5:36               ` Kai-Heng Feng
2020-12-09  9:30                 ` Daniel Lezcano
2020-12-09 16:10                   ` Srinivas Pandruvada
2020-11-30  5:36 Kai-Heng Feng
2020-12-14 18:21 ` Matthew Garrett
2020-12-15 12:49   ` Kai-Heng Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).