[PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
@ 2023-06-07  0:37 Eduardo Valentin
  2023-06-07  6:32 ` Zhang, Rui
  2023-06-07  9:24 ` Daniel Lezcano
  0 siblings, 2 replies; 27+ messages in thread
From: Eduardo Valentin @ 2023-06-07  0:37 UTC (permalink / raw)
  To: evalenti, eduval, rafael, daniel.lezcano, linux-pm, linux-kernel
  Cc: Amit Kucheria, Zhang Rui

From: Eduardo Valentin <eduval@amazon.com>

As the thermal zone caches the current and last temperature
value, the sysfs interface can use that instead of
forcing an actual update or read from the device.
This way, if multiple userspace requests are coming
in, we avoid storming the device with multiple reads
and potentially clogging the timing requirement
for the governors.

Cc: "Rafael J. Wysocki" <rafael@kernel.org> (supporter:THERMAL)
Cc: Daniel Lezcano <daniel.lezcano@linaro.org> (supporter:THERMAL)
Cc: Amit Kucheria <amitk@kernel.org> (reviewer:THERMAL)
Cc: Zhang Rui <rui.zhang@intel.com> (reviewer:THERMAL)
Cc: linux-pm@vger.kernel.org (open list:THERMAL)
Cc: linux-kernel@vger.kernel.org (open list)

Signed-off-by: Eduardo Valentin <eduval@amazon.com>
---
 drivers/thermal/thermal_sysfs.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/thermal/thermal_sysfs.c b/drivers/thermal/thermal_sysfs.c
index b6daea2398da..a240c58d9e08 100644
--- a/drivers/thermal/thermal_sysfs.c
+++ b/drivers/thermal/thermal_sysfs.c
@@ -35,12 +35,23 @@ static ssize_t
 temp_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
 	struct thermal_zone_device *tz = to_thermal_zone(dev);
-	int temperature, ret;
-
-	ret = thermal_zone_get_temp(tz, &temperature);
+	int temperature;
 
-	if (ret)
-		return ret;
+	/*
+	 * don't force new update from external reads
+	 * This way we avoid messing up with time constraints.
+	 */
+	if (tz->mode == THERMAL_DEVICE_DISABLED) {
+		int r;
+
+		r = thermal_zone_get_temp(tz, &temperature); /* holds tz->lock*/
+		if (r)
+			return r;
+	} else {
+		mutex_lock(&tz->lock);
+		temperature = tz->temperature;
+		mutex_unlock(&tz->lock);
+	}
 
 	return sprintf(buf, "%d\n", temperature);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-07  0:37 [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs Eduardo Valentin
@ 2023-06-07  6:32 ` Zhang, Rui
  2023-06-07 16:28   ` Eduardo Valentin
  2023-06-07  9:24 ` Daniel Lezcano
  1 sibling, 1 reply; 27+ messages in thread
From: Zhang, Rui @ 2023-06-07  6:32 UTC (permalink / raw)
  To: linux-pm, Valentin, Eduardo, rafael, evalenti, daniel.lezcano,
	linux-kernel
  Cc: amitk

On Tue, 2023-06-06 at 17:37 -0700, Eduardo Valentin wrote:
> From: Eduardo Valentin <eduval@amazon.com>
> 
> As the thermal zone caches the current and last temperature
> value, the sysfs interface can use that instead of
> forcing an actual update or read from the device.
> This way, if multiple userspace requests are coming
> in, we avoid storming the device with multiple reads
> and potentially clogging the timing requirement
> for the governors.
> 
> Cc: "Rafael J. Wysocki" <rafael@kernel.org> (supporter:THERMAL)
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> (supporter:THERMAL)
> Cc: Amit Kucheria <amitk@kernel.org> (reviewer:THERMAL)
> Cc: Zhang Rui <rui.zhang@intel.com> (reviewer:THERMAL)
> Cc: linux-pm@vger.kernel.org (open list:THERMAL)
> Cc: linux-kernel@vger.kernel.org (open list)
> 
> Signed-off-by: Eduardo Valentin <eduval@amazon.com>
> ---
>  drivers/thermal/thermal_sysfs.c | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/thermal/thermal_sysfs.c
> b/drivers/thermal/thermal_sysfs.c
> index b6daea2398da..a240c58d9e08 100644
> --- a/drivers/thermal/thermal_sysfs.c
> +++ b/drivers/thermal/thermal_sysfs.c
> @@ -35,12 +35,23 @@ static ssize_t
>  temp_show(struct device *dev, struct device_attribute *attr, char
> *buf)
>  {
>         struct thermal_zone_device *tz = to_thermal_zone(dev);
> -       int temperature, ret;
> -
> -       ret = thermal_zone_get_temp(tz, &temperature);
> +       int temperature;
>  
> -       if (ret)
> -               return ret;
> +       /*
> +        * don't force new update from external reads
> +        * This way we avoid messing up with time constraints.
> +        */
> +       if (tz->mode == THERMAL_DEVICE_DISABLED) {
> +               int r;
> +
> +               r = thermal_zone_get_temp(tz, &temperature); /* holds
> tz->lock*/

what is the expected behavior of a disabled zone?

IMO, the hardware may not be functional at this point, and reading the
temperature should be avoided, as we do in
__thermal_zone_device_update().

should we just return failure in this case?

userspace should poke the temp attribute for enabled zones only.

thanks,
rui
> +               if (r)
> +                       return r;
> +       } else {
> +               mutex_lock(&tz->lock);
> +               temperature = tz->temperature;
> +               mutex_unlock(&tz->lock);
> +       }
>  
>         return sprintf(buf, "%d\n", temperature);
>  }


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-07  0:37 [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs Eduardo Valentin
  2023-06-07  6:32 ` Zhang, Rui
@ 2023-06-07  9:24 ` Daniel Lezcano
  2023-06-07 16:38   ` Eduardo Valentin
  1 sibling, 1 reply; 27+ messages in thread
From: Daniel Lezcano @ 2023-06-07  9:24 UTC (permalink / raw)
  To: Eduardo Valentin, eduval, rafael, linux-pm, linux-kernel
  Cc: Amit Kucheria, Zhang Rui


Hi Eduardo,

On 07/06/2023 02:37, Eduardo Valentin wrote:
> From: Eduardo Valentin <eduval@amazon.com>
> 
> As the thermal zone caches the current and last temperature
> value, the sysfs interface can use that instead of
> forcing an actual update or read from the device.

If the read fails, userspace can handle that by using the previous 
value. Do we really want to hide driver dysfunctions?

> This way, if multiple userspace requests are coming
> in, we avoid storming the device with multiple reads
> and potentially clogging the timing requirement
> for the governors.


Can you elaborate 'the timing requirement for the governors' ? I'm 
missing the point

> Cc: "Rafael J. Wysocki" <rafael@kernel.org> (supporter:THERMAL)
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> (supporter:THERMAL)
> Cc: Amit Kucheria <amitk@kernel.org> (reviewer:THERMAL)
> Cc: Zhang Rui <rui.zhang@intel.com> (reviewer:THERMAL)
> Cc: linux-pm@vger.kernel.org (open list:THERMAL)
> Cc: linux-kernel@vger.kernel.org (open list)
> 
> Signed-off-by: Eduardo Valentin <eduval@amazon.com>
> ---
>   drivers/thermal/thermal_sysfs.c | 21 ++++++++++++++++-----
>   1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/thermal/thermal_sysfs.c b/drivers/thermal/thermal_sysfs.c
> index b6daea2398da..a240c58d9e08 100644
> --- a/drivers/thermal/thermal_sysfs.c
> +++ b/drivers/thermal/thermal_sysfs.c
> @@ -35,12 +35,23 @@ static ssize_t
>   temp_show(struct device *dev, struct device_attribute *attr, char *buf)
>   {
>   	struct thermal_zone_device *tz = to_thermal_zone(dev);
> -	int temperature, ret;
> -
> -	ret = thermal_zone_get_temp(tz, &temperature);
> +	int temperature;
>   
> -	if (ret)
> -		return ret;
> +	/*
> +	 * don't force new update from external reads
> +	 * This way we avoid messing up with time constraints.
> +	 */
> +	if (tz->mode == THERMAL_DEVICE_DISABLED) {
> +		int r;
> +
> +		r = thermal_zone_get_temp(tz, &temperature); /* holds tz->lock*/
> +		if (r)
> +			return r;
> +	} else {
> +		mutex_lock(&tz->lock);
> +		temperature = tz->temperature;
> +		mutex_unlock(&tz->lock);
> +	}

No please, we are pushing since several weeks a lot of changes to 
encapsulate the thermal zone device structure and prevent external core 
components to use the internals directly. Even if we can consider the 
thermal_sysfs as part of the core code, that changes is not sysfs related.

>   	return sprintf(buf, "%d\n", temperature);
>   }

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-07  6:32 ` Zhang, Rui
@ 2023-06-07 16:28   ` Eduardo Valentin
  0 siblings, 0 replies; 27+ messages in thread
From: Eduardo Valentin @ 2023-06-07 16:28 UTC (permalink / raw)
  To: Zhang, Rui
  Cc: linux-pm, Valentin, Eduardo, rafael, evalenti, daniel.lezcano,
	linux-kernel, amitk

Rui!

Long time no chatting! In this case, no email exchange. Good to hear from you.

On Wed, Jun 07, 2023 at 06:32:46AM +0000, Zhang, Rui wrote:
> 
> 
> 
> On Tue, 2023-06-06 at 17:37 -0700, Eduardo Valentin wrote:
> > From: Eduardo Valentin <eduval@amazon.com>
> >
> > As the thermal zone caches the current and last temperature
> > value, the sysfs interface can use that instead of
> > forcing an actual update or read from the device.
> > This way, if multiple userspace requests are coming
> > in, we avoid storming the device with multiple reads
> > and potentially clogging the timing requirement
> > for the governors.
> >
> > Cc: "Rafael J. Wysocki" <rafael@kernel.org> (supporter:THERMAL)
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org> (supporter:THERMAL)
> > Cc: Amit Kucheria <amitk@kernel.org> (reviewer:THERMAL)
> > Cc: Zhang Rui <rui.zhang@intel.com> (reviewer:THERMAL)
> > Cc: linux-pm@vger.kernel.org (open list:THERMAL)
> > Cc: linux-kernel@vger.kernel.org (open list)
> >
> > Signed-off-by: Eduardo Valentin <eduval@amazon.com>
> > ---
> >  drivers/thermal/thermal_sysfs.c | 21 ++++++++++++++++-----
> >  1 file changed, 16 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/thermal/thermal_sysfs.c
> > b/drivers/thermal/thermal_sysfs.c
> > index b6daea2398da..a240c58d9e08 100644
> > --- a/drivers/thermal/thermal_sysfs.c
> > +++ b/drivers/thermal/thermal_sysfs.c
> > @@ -35,12 +35,23 @@ static ssize_t
> >  temp_show(struct device *dev, struct device_attribute *attr, char
> > *buf)
> >  {
> >         struct thermal_zone_device *tz = to_thermal_zone(dev);
> > -       int temperature, ret;
> > -
> > -       ret = thermal_zone_get_temp(tz, &temperature);
> > +       int temperature;
> >
> > -       if (ret)
> > -               return ret;
> > +       /*
> > +        * don't force new update from external reads
> > +        * This way we avoid messing up with time constraints.
> > +        */
> > +       if (tz->mode == THERMAL_DEVICE_DISABLED) {
> > +               int r;
> > +
> > +               r = thermal_zone_get_temp(tz, &temperature); /* holds
> > tz->lock*/
> 
> what is the expected behavior of a disabled zone?
> 
> IMO, the hardware may not be functional at this point, and reading the
> temperature should be avoided, as we do in
> __thermal_zone_device_update().
> 
> should we just return failure in this case?
> 
> userspace should poke the temp attribute for enabled zones only.

While I see your point, My understanding is that thermal zone mode
is either kernel mode or userspace mode, which to my interpretation,
it dictating where the control is, not that there is a malfunction,
necessarily.

Taking that perspective, the expected behavior here is to have a
in userspace control/governor, where it:
1. disables the in kernel control
2. monitors the thermal zone by reading the /temp property
3. Actuates on the assigned cooling devices for the thermal zone.

The above setup works pretty well for non critical control, where
the system design or state does not require an in kernel control.
And for that scenario, the proposed cached value will not be updated
given that the in kernel thread is not collecting/updating temperature
values anymore, therefore, the sysfs entry has to talk to the
driver to get the most current value.

For the failure case you referred to, Rui, This patch will handle it
too. It will talk to the driver, if the device is malfunction, the
driver will return an error which will be reported back
to userspace, as an error code upon read, which is expected behavior
for userspace to know that there is a problem.

> 
> thanks,
> rui
> > +               if (r)
> > +                       return r;
> > +       } else {
> > +               mutex_lock(&tz->lock);
> > +               temperature = tz->temperature;
> > +               mutex_unlock(&tz->lock);
> > +       }
> >
> >         return sprintf(buf, "%d\n", temperature);
> >  }
> 

-- 
All the best,
Eduardo Valentin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-07  9:24 ` Daniel Lezcano
@ 2023-06-07 16:38   ` Eduardo Valentin
  2023-06-07 18:23     ` Daniel Lezcano
  2023-06-10 17:24     ` Russell Haley
  0 siblings, 2 replies; 27+ messages in thread
From: Eduardo Valentin @ 2023-06-07 16:38 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Eduardo Valentin, eduval, rafael, linux-pm, linux-kernel,
	Amit Kucheria, Zhang Rui

Hey Daniel,

Thanks for taking the time to read the patch.

On Wed, Jun 07, 2023 at 11:24:21AM +0200, Daniel Lezcano wrote:
> 
> 
> 
> Hi Eduardo,
> 
> On 07/06/2023 02:37, Eduardo Valentin wrote:
> > From: Eduardo Valentin <eduval@amazon.com>
> > 
> > As the thermal zone caches the current and last temperature
> > value, the sysfs interface can use that instead of
> > forcing an actual update or read from the device.
> 
> If the read fails, userspace can handle that by using the previous
> value. Do we really want to hide driver dysfunctions?

Good point.

In fact I thought of this exact problem. I sent only this patch,
but it has more changes to come.

The next changes will replicate the current design of
storing last_temperature in the thermal zone to also store
the last return value, success or error, on the thermal zone
too so that we can use here at the front end to report back
to userspace when the reads are failing.

But yes, you are right, we do not want to keep reporting
a successful read when the thermal zone thread has been
failing to update the value, that needs to be reported
up back to userspace.

> 
> > This way, if multiple userspace requests are coming
> > in, we avoid storming the device with multiple reads
> > and potentially clogging the timing requirement
> > for the governors.
> 
> 
> Can you elaborate 'the timing requirement for the governors' ? I'm
> missing the point

The point is to avoid contention on the device update path.
Governor that use differential equations on temperature over time
will be very time sensitive. Step wise, power allocator, or any
PID will be very sensitive to time. So, If userspace is hitting
this API too often we can see cases where the updates needed to
service userspace may defer/delay the execution of the governor
logic.

Despite that, there is really no point to have more updates than
what was configured for the thermal zone to support. Say that
we configure a thermal zone to update itself every 500ms, yet
userspace keeps sending reads every 100ms, we do not need necessarily
to do a trip to the device every single time to update the temperature,
as per the design for the thermal zone.

> 
> > Cc: "Rafael J. Wysocki" <rafael@kernel.org> (supporter:THERMAL)
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org> (supporter:THERMAL)
> > Cc: Amit Kucheria <amitk@kernel.org> (reviewer:THERMAL)
> > Cc: Zhang Rui <rui.zhang@intel.com> (reviewer:THERMAL)
> > Cc: linux-pm@vger.kernel.org (open list:THERMAL)
> > Cc: linux-kernel@vger.kernel.org (open list)
> > 
> > Signed-off-by: Eduardo Valentin <eduval@amazon.com>
> > ---
> >   drivers/thermal/thermal_sysfs.c | 21 ++++++++++++++++-----
> >   1 file changed, 16 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/thermal/thermal_sysfs.c b/drivers/thermal/thermal_sysfs.c
> > index b6daea2398da..a240c58d9e08 100644
> > --- a/drivers/thermal/thermal_sysfs.c
> > +++ b/drivers/thermal/thermal_sysfs.c
> > @@ -35,12 +35,23 @@ static ssize_t
> >   temp_show(struct device *dev, struct device_attribute *attr, char *buf)
> >   {
> >       struct thermal_zone_device *tz = to_thermal_zone(dev);
> > -     int temperature, ret;
> > -
> > -     ret = thermal_zone_get_temp(tz, &temperature);
> > +     int temperature;
> > 
> > -     if (ret)
> > -             return ret;
> > +     /*
> > +      * don't force new update from external reads
> > +      * This way we avoid messing up with time constraints.
> > +      */
> > +     if (tz->mode == THERMAL_DEVICE_DISABLED) {
> > +             int r;
> > +
> > +             r = thermal_zone_get_temp(tz, &temperature); /* holds tz->lock*/
> > +             if (r)
> > +                     return r;
> > +     } else {
> > +             mutex_lock(&tz->lock);
> > +             temperature = tz->temperature;
> > +             mutex_unlock(&tz->lock);
> > +     }
> 
> No please, we are pushing since several weeks a lot of changes to
> encapsulate the thermal zone device structure and prevent external core
> components to use the internals directly. Even if we can consider the
> thermal_sysfs as part of the core code, that changes is not sysfs related.

Can you clarify your concern, is it the direct access ? The lock ? 
what is the concern?

What is your suggestion here? Do you want me to write a helper
function that gets tz->temperature without doing a ops->get_temp()?

Let me know.

> 
> >       return sprintf(buf, "%d\n", temperature);
> >   }
> 
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
> 

-- 
All the best,
Eduardo Valentin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-07 16:38   ` Eduardo Valentin
@ 2023-06-07 18:23     ` Daniel Lezcano
  2023-06-08 17:44       ` Eduardo Valentin
  2023-06-10 17:24     ` Russell Haley
  1 sibling, 1 reply; 27+ messages in thread
From: Daniel Lezcano @ 2023-06-07 18:23 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: eduval, rafael, linux-pm, linux-kernel, Amit Kucheria, Zhang Rui

On 07/06/2023 18:38, Eduardo Valentin wrote:
> Hey Daniel,
> 
> Thanks for taking the time to read the patch.
> 
> On Wed, Jun 07, 2023 at 11:24:21AM +0200, Daniel Lezcano wrote:
>>
>>
>>
>> Hi Eduardo,
>>
>> On 07/06/2023 02:37, Eduardo Valentin wrote:
>>> From: Eduardo Valentin <eduval@amazon.com>
>>>
>>> As the thermal zone caches the current and last temperature
>>> value, the sysfs interface can use that instead of
>>> forcing an actual update or read from the device.
>>
>> If the read fails, userspace can handle that by using the previous
>> value. Do we really want to hide driver dysfunctions?
> 
> Good point.
> 
> In fact I thought of this exact problem. I sent only this patch,
> but it has more changes to come.
> 
> The next changes will replicate the current design of
> storing last_temperature in the thermal zone to also store
> the last return value, success or error, on the thermal zone
> too so that we can use here at the front end to report back
> to userspace when the reads are failing.
 >
> But yes, you are right, we do not want to keep reporting
> a successful read when the thermal zone thread has been
> failing to update the value, that needs to be reported
> up back to userspace.

IIUC, you want the temperature to be updated only by the polling thread 
and when the userspace reads the temperature, it reads a cached value, 
is that correct ?

>>> This way, if multiple userspace requests are coming
>>> in, we avoid storming the device with multiple reads
>>> and potentially clogging the timing requirement
>>> for the governors.

Sorry, I'm not convinced :/

>> Can you elaborate 'the timing requirement for the governors' ? I'm
>> missing the point
> 
> 
> The point is to avoid contention on the device update path.
> Governor that use differential equations on temperature over time
> will be very time sensitive. Step wise, power allocator, or any
> PID will be very sensitive to time. So, If userspace is hitting
> this API too often we can see cases where the updates needed to
> service userspace may defer/delay the execution of the governor
> logic.

AFAIK, reading a temperature value is usually between less than 1us and 
10us (depending on the sampling configuration in the driver).

I've done some measurements to read a temperature through sysfs and 
netlink. It is between 2us and 7us on a platforms where reading a 900ns 
latency sensor, sysfs being faster.

The time sensitive governor is the power allocator and usually, the 
sampling period is between 100ms and 250ms.

The thermal zones with fast thermal transitions may need faster sampling 
period but the hardware offloads the mitigation in this case by sampling 
every 100*us*, IIRC.

Given that, I'm not sure we are facing a design issue with thermal 
framework.

Do you have a use case with some measurements to spot an issue or is it 
a potential issue you identified ?

> Despite that, there is really no point to have more updates than
> what was configured for the thermal zone to support. Say that
> we configure a thermal zone to update itself every 500ms, yet
> userspace keeps sending reads every 100ms, we do not need necessarily
> to do a trip to the device every single time to update the temperature,
> as per the design for the thermal zone.

Sorry, I do not agree. The thermal zone is configured with a monitoring 
sampling period to detect trip point crossing and it can be configured 
without sampling period at all, just basing on trip point crossing 
interrupt.

The userspace has the right to read the current temperature it is 
interested in. For instance, the 'thermometer' in 
tools/thermal/thermometer may want to read the temperature at a high 
rate in order to profile the thermal zones with/without the mitigation 
kicking in.

Caching the values just break the current behavior.

>>> Cc: "Rafael J. Wysocki" <rafael@kernel.org> (supporter:THERMAL)
>>> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> (supporter:THERMAL)
>>> Cc: Amit Kucheria <amitk@kernel.org> (reviewer:THERMAL)
>>> Cc: Zhang Rui <rui.zhang@intel.com> (reviewer:THERMAL)
>>> Cc: linux-pm@vger.kernel.org (open list:THERMAL)
>>> Cc: linux-kernel@vger.kernel.org (open list)
>>>
>>> Signed-off-by: Eduardo Valentin <eduval@amazon.com>
>>> ---
>>>    drivers/thermal/thermal_sysfs.c | 21 ++++++++++++++++-----
>>>    1 file changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/thermal/thermal_sysfs.c b/drivers/thermal/thermal_sysfs.c
>>> index b6daea2398da..a240c58d9e08 100644
>>> --- a/drivers/thermal/thermal_sysfs.c
>>> +++ b/drivers/thermal/thermal_sysfs.c
>>> @@ -35,12 +35,23 @@ static ssize_t
>>>    temp_show(struct device *dev, struct device_attribute *attr, char *buf)
>>>    {
>>>        struct thermal_zone_device *tz = to_thermal_zone(dev);
>>> -     int temperature, ret;
>>> -
>>> -     ret = thermal_zone_get_temp(tz, &temperature);
>>> +     int temperature;
>>>
>>> -     if (ret)
>>> -             return ret;
>>> +     /*
>>> +      * don't force new update from external reads
>>> +      * This way we avoid messing up with time constraints.
>>> +      */
>>> +     if (tz->mode == THERMAL_DEVICE_DISABLED) {
>>> +             int r;
>>> +
>>> +             r = thermal_zone_get_temp(tz, &temperature); /* holds tz->lock*/
>>> +             if (r)
>>> +                     return r;
>>> +     } else {
>>> +             mutex_lock(&tz->lock);
>>> +             temperature = tz->temperature;
>>> +             mutex_unlock(&tz->lock);
>>> +     }
>>
>> No please, we are pushing since several weeks a lot of changes to
>> encapsulate the thermal zone device structure and prevent external core
>> components to use the internals directly. Even if we can consider the
>> thermal_sysfs as part of the core code, that changes is not sysfs related.
> 
> Can you clarify your concern, is it the direct access ? The lock ?
> what is the concern?
> 
> What is your suggestion here? Do you want me to write a helper
> function that gets tz->temperature without doing a ops->get_temp()?

The concern is the thermal framework code is not really in a good shape 
and the internals leaked around in the different drivers all around the 
subsystems, that led to drivers tampering with the thermal zone device 
structure data (including taking locks).

There are ongoing efforts to do some house cleaning around the thermal 
zone device structure encapsulation as well as getting ride of the 
different ops get_trip* by replacing that with a generic trip point 
structure to deal with in the thermal core code.

Those with the final objective to have the trip points ordered and 
handle the trip point crossing correctly detected (it is currently 
somehow broken).

Here the code does directly access tz->*.

Even not being convinced by the change proposal, I would have handle 
this value cache in the thermal_zone_get_temp() function directly in 
order to preserve the code self-encapsulation.

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-07 18:23     ` Daniel Lezcano
@ 2023-06-08 17:44       ` Eduardo Valentin
  2023-06-12  8:17         ` Daniel Lezcano
  0 siblings, 1 reply; 27+ messages in thread
From: Eduardo Valentin @ 2023-06-08 17:44 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Eduardo Valentin, eduval, rafael, linux-pm, linux-kernel,
	Amit Kucheria, Zhang Rui

Hello Daniel,

On Wed, Jun 07, 2023 at 08:23:08PM +0200, Daniel Lezcano wrote:
> 
> 
> 
> On 07/06/2023 18:38, Eduardo Valentin wrote:
> > Hey Daniel,
> > 
> > Thanks for taking the time to read the patch.
> > 
> > On Wed, Jun 07, 2023 at 11:24:21AM +0200, Daniel Lezcano wrote:
> > > 
> > > 
> > > 
> > > Hi Eduardo,
> > > 
> > > On 07/06/2023 02:37, Eduardo Valentin wrote:
> > > > From: Eduardo Valentin <eduval@amazon.com>
> > > > 
> > > > As the thermal zone caches the current and last temperature
> > > > value, the sysfs interface can use that instead of
> > > > forcing an actual update or read from the device.
> > > 
> > > If the read fails, userspace can handle that by using the previous
> > > value. Do we really want to hide driver dysfunctions?
> > 
> > Good point.
> > 
> > In fact I thought of this exact problem. I sent only this patch,
> > but it has more changes to come.
> > 
> > The next changes will replicate the current design of
> > storing last_temperature in the thermal zone to also store
> > the last return value, success or error, on the thermal zone
> > too so that we can use here at the front end to report back
> > to userspace when the reads are failing.
> >
> > But yes, you are right, we do not want to keep reporting
> > a successful read when the thermal zone thread has been
> > failing to update the value, that needs to be reported
> > up back to userspace.
> 
> IIUC, you want the temperature to be updated only by the polling thread
> and when the userspace reads the temperature, it reads a cached value,
> is that correct ?

That is correct. I want to minimize the amount of actual driver / device accesses.

> 
> > > > This way, if multiple userspace requests are coming
> > > > in, we avoid storming the device with multiple reads
> > > > and potentially clogging the timing requirement
> > > > for the governors.
> 
> Sorry, I'm not convinced :/
> 
> > > Can you elaborate 'the timing requirement for the governors' ? I'm
> > > missing the point
> > 
> > 
> > The point is to avoid contention on the device update path.
> > Governor that use differential equations on temperature over time
> > will be very time sensitive. Step wise, power allocator, or any
> > PID will be very sensitive to time. So, If userspace is hitting
> > this API too often we can see cases where the updates needed to
> > service userspace may defer/delay the execution of the governor
> > logic.
> 
> AFAIK, reading a temperature value is usually between less than 1us and
> 10us (depending on the sampling configuration in the driver).
> 
> I've done some measurements to read a temperature through sysfs and
> netlink. It is between 2us and 7us on a platforms where reading a 900ns
> latency sensor, sysfs being faster.
> 
> The time sensitive governor is the power allocator and usually, the
> sampling period is between 100ms and 250ms.
> 
> The thermal zones with fast thermal transitions may need faster sampling
> period but the hardware offloads the mitigation in this case by sampling
> every 100*us*, IIRC.
> 
> Given that, I'm not sure we are facing a design issue with thermal
> framework.

The numbers above works well for memory IO devices, like a on DIE bandgap.
In that case, I agree, the concern is less of a factual issue.
The numbers becomes completely different if you have a monitor
based on a I2C that shares a bus with multiple other devices :-)
and you have multiple userspace requests coming in.

In that case there are more than one contention. One is the simple
fact that it is going to contend on thermal zone thread and userspace access.
The other contention is the I2C bus lock itself. All of that adds up.


> 
> Do you have a use case with some measurements to spot an issue or is it
> a potential issue you identified ?


yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
and needs to update the zone every 100ms. Each read in this bus, if done alone
would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
well technically 9, but rounding for the sake of the example, which gets you
50 / 100KHz = 500 us. That is for a single read. You add one single extra 
userspace read triggering an unused device update, that is already a 1ms drift.
Basically you looking at 0.5% for each extra userspace read competing in this
sysfs node. You add extra devices in the same I2C bus, your governor is looking
at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
I did not even include the lock overhead considered for this CPU ;-)

Again, this is not about controlling the DIE temperature of the CPU you
are running the thermal subsystem. This is about controlling
a target device.

> 
> > Despite that, there is really no point to have more updates than
> > what was configured for the thermal zone to support. Say that
> > we configure a thermal zone to update itself every 500ms, yet
> > userspace keeps sending reads every 100ms, we do not need necessarily
> > to do a trip to the device every single time to update the temperature,
> > as per the design for the thermal zone.
> 
> Sorry, I do not agree. The thermal zone is configured with a monitoring
> sampling period to detect trip point crossing and it can be configured
> without sampling period at all, just basing on trip point crossing
> interrupt.

That is right, one can derive a working policy solely based on
IRQ. But not true for all cases is it? The stepwise and power allocator
will need to have intermediate checks before actual trip crossing. 
Specially for the allocator case, which is more time bound. Situation
becomes even more time sensitive when the cooling effect of your
cooling device is on the hundreds of milliseconds to seconds, like a fan,
not on nano or microseconds, like throttling.

> 
> The userspace has the right to read the current temperature it is
> interested in. For instance, the 'thermometer' in
> tools/thermal/thermometer may want to read the temperature at a high
> rate in order to profile the thermal zones with/without the mitigation
> kicking in.
> 
> Caching the values just break the current behavior.

I see your concern here. Yeah, let's make sure we dont break this tool.

Does the tool need to have the thermal zone in kernel mode?

The thermal zone will still behave the same way while reading
/temp after this patch if the thermal zone is set to non kernel mode.


Alternatively we can make the caching behavior non-default/optional too. 

> 
> > > > Cc: "Rafael J. Wysocki" <rafael@kernel.org> (supporter:THERMAL)
> > > > Cc: Daniel Lezcano <daniel.lezcano@linaro.org> (supporter:THERMAL)
> > > > Cc: Amit Kucheria <amitk@kernel.org> (reviewer:THERMAL)
> > > > Cc: Zhang Rui <rui.zhang@intel.com> (reviewer:THERMAL)
> > > > Cc: linux-pm@vger.kernel.org (open list:THERMAL)
> > > > Cc: linux-kernel@vger.kernel.org (open list)
> > > > 
> > > > Signed-off-by: Eduardo Valentin <eduval@amazon.com>
> > > > ---
> > > >    drivers/thermal/thermal_sysfs.c | 21 ++++++++++++++++-----
> > > >    1 file changed, 16 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/drivers/thermal/thermal_sysfs.c b/drivers/thermal/thermal_sysfs.c
> > > > index b6daea2398da..a240c58d9e08 100644
> > > > --- a/drivers/thermal/thermal_sysfs.c
> > > > +++ b/drivers/thermal/thermal_sysfs.c
> > > > @@ -35,12 +35,23 @@ static ssize_t
> > > >    temp_show(struct device *dev, struct device_attribute *attr, char *buf)
> > > >    {
> > > >        struct thermal_zone_device *tz = to_thermal_zone(dev);
> > > > -     int temperature, ret;
> > > > -
> > > > -     ret = thermal_zone_get_temp(tz, &temperature);
> > > > +     int temperature;
> > > > 
> > > > -     if (ret)
> > > > -             return ret;
> > > > +     /*
> > > > +      * don't force new update from external reads
> > > > +      * This way we avoid messing up with time constraints.
> > > > +      */
> > > > +     if (tz->mode == THERMAL_DEVICE_DISABLED) {
> > > > +             int r;
> > > > +
> > > > +             r = thermal_zone_get_temp(tz, &temperature); /* holds tz->lock*/
> > > > +             if (r)
> > > > +                     return r;
> > > > +     } else {
> > > > +             mutex_lock(&tz->lock);
> > > > +             temperature = tz->temperature;
> > > > +             mutex_unlock(&tz->lock);
> > > > +     }
> > > 
> > > No please, we are pushing since several weeks a lot of changes to
> > > encapsulate the thermal zone device structure and prevent external core
> > > components to use the internals directly. Even if we can consider the
> > > thermal_sysfs as part of the core code, that changes is not sysfs related.
> > 
> > Can you clarify your concern, is it the direct access ? The lock ?
> > what is the concern?
> > 
> > What is your suggestion here? Do you want me to write a helper
> > function that gets tz->temperature without doing a ops->get_temp()?
> 
> The concern is the thermal framework code is not really in a good shape
> and the internals leaked around in the different drivers all around the
> subsystems, that led to drivers tampering with the thermal zone device
> structure data (including taking locks).
> 
> There are ongoing efforts to do some house cleaning around the thermal
> zone device structure encapsulation as well as getting ride of the
> different ops get_trip* by replacing that with a generic trip point
> structure to deal with in the thermal core code.
> 
> Those with the final objective to have the trip points ordered and
> handle the trip point crossing correctly detected (it is currently
> somehow broken).
> 
> Here the code does directly access tz->*.
> 
> Even not being convinced by the change proposal, I would have handle
> this value cache in the thermal_zone_get_temp() function directly in
> order to preserve the code self-encapsulation.

Ok. So, is your proposal to simply get rid of the direct access? So long the
code is kept self encapsulated, this concern would be addressed?

> 
> 
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
> 

-- 
All the best,
Eduardo Valentin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-07 16:38   ` Eduardo Valentin
  2023-06-07 18:23     ` Daniel Lezcano
@ 2023-06-10 17:24     ` Russell Haley
  1 sibling, 0 replies; 27+ messages in thread
From: Russell Haley @ 2023-06-10 17:24 UTC (permalink / raw)
  To: Eduardo Valentin, Daniel Lezcano
  Cc: eduval, rafael, linux-pm, linux-kernel, Amit Kucheria, Zhang Rui

On 6/7/23 11:38, Eduardo Valentin wrote:
>> Can you elaborate 'the timing requirement for the governors' ? I'm
>> missing the point
> 
> 
> The point is to avoid contention on the device update path.
> Governor that use differential equations on temperature over time
> will be very time sensitive. Step wise, power allocator, or any
> PID will be very sensitive to time. So, If userspace is hitting
> this API too often we can see cases where the updates needed to
> service userspace may defer/delay the execution of the governor
> logic.
> 
> Despite that, there is really no point to have more updates than
> what was configured for the thermal zone to support. Say that
> we configure a thermal zone to update itself every 500ms, yet
> userspace keeps sending reads every 100ms, we do not need necessarily
> to do a trip to the device every single time to update the temperature,
> as per the design for the thermal zone.

A userspace governor might *also* use PID or filter multiple samples
taken at high rate. I specifically switched my python fan control script
from the Intel coretemp hwmon to the x86_pkg_tmp thermal zone because of
the coretemp driver's annoying 1-second caching behavior.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-08 17:44       ` Eduardo Valentin
@ 2023-06-12  8:17         ` Daniel Lezcano
  2023-06-21  5:06           ` Eduardo Valentin
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Lezcano @ 2023-06-12  8:17 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: eduval, rafael, linux-pm, linux-kernel, Amit Kucheria, Zhang Rui


Hi Eduardo,

On 08/06/2023 19:44, Eduardo Valentin wrote:

[ ... ]

>> Do you have a use case with some measurements to spot an issue or is it
>> a potential issue you identified ?
> 
> 
> yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
> and needs to update the zone every 100ms. Each read in this bus, if done alone
> would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
> well technically 9, but rounding for the sake of the example, which gets you
> 50 / 100KHz = 500 us. That is for a single read. You add one single extra
> userspace read triggering an unused device update, that is already a 1ms drift.
> Basically you looking at 0.5% for each extra userspace read competing in this
> sysfs node. You add extra devices in the same I2C bus, your governor is looking
> at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
> I did not even include the lock overhead considered for this CPU ;-)
> 
> Again, this is not about controlling the DIE temperature of the CPU you
> are running the thermal subsystem. This is about controlling
> a target device.

Ok. The target device is on a bus which is slow and prone to contention.

This hardware is not designed to be monitored with a high precision, so 
reading the temperature at a high rate does not really make sense.

Moreover (putting apart a potential contention), the delayed read does 
not change the time interval, which remains the same from the governor 
point of view.

In addition, i2c sensors are usually handled in the hwmon subsystem 
which are registered in the thermal framework from there. Those have 
most of their 'read' callback with a cached value in a jiffies based way 
eg. [1].

So the feature already exists for slow devices and are handled in the 
drivers directly via the hwmon subsystem.

 From my POV, the feature is not needed in the thermal framework.



[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git/tree/drivers/hwmon/lm95234.c#n163



-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-12  8:17         ` Daniel Lezcano
@ 2023-06-21  5:06           ` Eduardo Valentin
  2023-06-21 11:43             ` Daniel Lezcano
  0 siblings, 1 reply; 27+ messages in thread
From: Eduardo Valentin @ 2023-06-21  5:06 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Eduardo Valentin, eduval, rafael, linux-pm, linux-kernel,
	Amit Kucheria, Zhang Rui

On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote:
> 
> 
> 
> Hi Eduardo,
> 
> On 08/06/2023 19:44, Eduardo Valentin wrote:
> 
> [ ... ]
> 
> > > Do you have a use case with some measurements to spot an issue or is it
> > > a potential issue you identified ?
> > 
> > 
> > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
> > and needs to update the zone every 100ms. Each read in this bus, if done alone
> > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
> > well technically 9, but rounding for the sake of the example, which gets you
> > 50 / 100KHz = 500 us. That is for a single read. You add one single extra
> > userspace read triggering an unused device update, that is already a 1ms drift.
> > Basically you looking at 0.5% for each extra userspace read competing in this
> > sysfs node. You add extra devices in the same I2C bus, your governor is looking
> > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
> > I did not even include the lock overhead considered for this CPU ;-)
> > 
> > Again, this is not about controlling the DIE temperature of the CPU you
> > are running the thermal subsystem. This is about controlling
> > a target device.
> 
> Ok. The target device is on a bus which is slow and prone to contention.
> 
> This hardware is not designed to be monitored with a high precision, so
> reading the temperature at a high rate does not really make sense.

On the contrary, it needs even more precision and any extra delay adds to
loss on accuracy :-)

> 
> Moreover (putting apart a potential contention), the delayed read does
> not change the time interval, which remains the same from the governor
> point of view.

It does not change the governor update interval and that is a property of
the thermal zone. Correct. And that is the intention of the change.
The actual temperature updates driven by the governor will always
result in a driver call. While a userspace call will not be in the way
of the governor update.

Sysfs reads, However, with the current code as is, it may cause
jittering on the actual execution of the governor throttle function.
 causing the computation of the desired outcome cooling device being skewed.

> 
> In addition, i2c sensors are usually handled in the hwmon subsystem
> which are registered in the thermal framework from there. Those have
> most of their 'read' callback with a cached value in a jiffies based way
> eg. [1].

I guess what you are really saying is: go read the hwmon sysfs node,
or, hwmon solves this for us, which unfortunately is not true for all devices.

> 
> So the feature already exists for slow devices and are handled in the
> drivers directly via the hwmon subsystem.
> 
> From my POV, the feature is not needed in the thermal framework.

The fact that hwmon does it in some way is another evidence of the
actual problem. Telling that this has to be solved by another subsystem
for a sysfs node that is part of thermal subsystem does not really solve
the problem. Also as I mentioned, this is not common on all hwmon
devices, and not all I2C devices are hwmon devices. In fact, I2C
was just one example of a slow device. There are more I can quote
that are not necessarily under the hwmon case.

Not sure if you missed, but an alternative for the difference of
opinion on how this should behave is to have caching for response
of sysfs read of tz/temp  as an option/configuration. Then we let
userspace to choose which behavior it wants.

> 
> 
> 
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git/tree/drivers/hwmon/lm95234.c#n163
> 
> 
> 
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
> 

-- 
All the best,
Eduardo Valentin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-21  5:06           ` Eduardo Valentin
@ 2023-06-21 11:43             ` Daniel Lezcano
  2023-06-22  4:56               ` Eduardo Valentin
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Lezcano @ 2023-06-21 11:43 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: eduval, rafael, linux-pm, linux-kernel, Amit Kucheria, Zhang Rui

On 21/06/2023 07:06, Eduardo Valentin wrote:
> On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote:
>>
>>
>>
>> Hi Eduardo,
>>
>> On 08/06/2023 19:44, Eduardo Valentin wrote:
>>
>> [ ... ]
>>
>>>> Do you have a use case with some measurements to spot an issue or is it
>>>> a potential issue you identified ?
>>>
>>>
>>> yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
>>> and needs to update the zone every 100ms. Each read in this bus, if done alone
>>> would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
>>> well technically 9, but rounding for the sake of the example, which gets you
>>> 50 / 100KHz = 500 us. That is for a single read. You add one single extra
>>> userspace read triggering an unused device update, that is already a 1ms drift.
>>> Basically you looking at 0.5% for each extra userspace read competing in this
>>> sysfs node. You add extra devices in the same I2C bus, your governor is looking
>>> at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
>>> I did not even include the lock overhead considered for this CPU ;-)
>>>
>>> Again, this is not about controlling the DIE temperature of the CPU you
>>> are running the thermal subsystem. This is about controlling
>>> a target device.
>>
>> Ok. The target device is on a bus which is slow and prone to contention.
>>
>> This hardware is not designed to be monitored with a high precision, so
>> reading the temperature at a high rate does not really make sense.
> 
> On the contrary, it needs even more precision and any extra delay adds to
> loss on accuracy :-)

What I meant is if the hardware designer thought there could be a 
problem with the thermal zone they would have put another kind of 
sensor, not one with a i2c based communication.


>> Moreover (putting apart a potential contention), the delayed read does
>> not change the time interval, which remains the same from the governor
>> point of view.
> 
> It does not change the governor update interval and that is a property of
> the thermal zone. Correct. And that is the intention of the change.
> The actual temperature updates driven by the governor will always
> result in a driver call. While a userspace call will not be in the way
> of the governor update.
> 
> Sysfs reads, However, with the current code as is, it may cause
> jittering on the actual execution of the governor throttle function.
>   causing the computation of the desired outcome cooling device being skewed.
> 
>>
>> In addition, i2c sensors are usually handled in the hwmon subsystem
>> which are registered in the thermal framework from there. Those have
>> most of their 'read' callback with a cached value in a jiffies based way
>> eg. [1].
> 
> I guess what you are really saying is: go read the hwmon sysfs node,
> or, hwmon solves this for us, which unfortunately is not true for all devices.

I meant the i2c sensors are under the hwmon subsystem. This subsystem is 
connected with the thermal framework, so when a hwmon sensor is created, 
it register this sensor as a thermal zone.


>> So the feature already exists for slow devices and are handled in the
>> drivers directly via the hwmon subsystem.
>>
>>  From my POV, the feature is not needed in the thermal framework.
> 
> The fact that hwmon does it in some way is another evidence of the
> actual problem.

Not really, it shows the i2c sensors are in the hwmon subsystems.


> Telling that this has to be solved by another subsystem
> for a sysfs node that is part of thermal subsystem does not really solve
> the problem. Also as I mentioned, this is not common on all hwmon
> devices, and not all I2C devices are hwmon devices. In fact, I2C
> was just one example of a slow device. There are more I can quote
> that are not necessarily under the hwmon case.

Yes, please. Can you give examples with existing drivers in the thermal 
framework and observed issues on specific platforms ? Numbers would help.

> Not sure if you missed, but an alternative for the difference of
> opinion on how this should behave is to have caching for response
> of sysfs read of tz/temp  as an option/configuration. Then we let
> userspace to choose which behavior it wants.

Before that, you have to prove the feature is really needed and show how 
the patches solves an issue. At this point, this is not demonstrated.




-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-21 11:43             ` Daniel Lezcano
@ 2023-06-22  4:56               ` Eduardo Valentin
  2023-06-23 17:31                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Eduardo Valentin @ 2023-06-22  4:56 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Eduardo Valentin, eduval, rafael, linux-pm, linux-kernel,
	Amit Kucheria, Zhang Rui

On Wed, Jun 21, 2023 at 01:43:26PM +0200, Daniel Lezcano wrote:
> 
> 
> 
> On 21/06/2023 07:06, Eduardo Valentin wrote:
> > On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote:
> > > 
> > > 
> > > 
> > > Hi Eduardo,
> > > 
> > > On 08/06/2023 19:44, Eduardo Valentin wrote:
> > > 
> > > [ ... ]
> > > 
> > > > > Do you have a use case with some measurements to spot an issue or is it
> > > > > a potential issue you identified ?
> > > > 
> > > > 
> > > > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
> > > > and needs to update the zone every 100ms. Each read in this bus, if done alone
> > > > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
> > > > well technically 9, but rounding for the sake of the example, which gets you
> > > > 50 / 100KHz = 500 us. That is for a single read. You add one single extra
> > > > userspace read triggering an unused device update, that is already a 1ms drift.
> > > > Basically you looking at 0.5% for each extra userspace read competing in this
> > > > sysfs node. You add extra devices in the same I2C bus, your governor is looking
> > > > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
> > > > I did not even include the lock overhead considered for this CPU ;-)
> > > > 
> > > > Again, this is not about controlling the DIE temperature of the CPU you
> > > > are running the thermal subsystem. This is about controlling
> > > > a target device.
> > > 
> > > Ok. The target device is on a bus which is slow and prone to contention.
> > > 
> > > This hardware is not designed to be monitored with a high precision, so
> > > reading the temperature at a high rate does not really make sense.
> > 
> > On the contrary, it needs even more precision and any extra delay adds to
> > loss on accuracy :-)
> 
> What I meant is if the hardware designer thought there could be a
> problem with the thermal zone they would have put another kind of
> sensor, not one with a i2c based communication.

No, that is not a problem in the physical thermal zone. Or to cover
for a hardware design flaw. This is an actual typical case. However,
yes, designer must take into account any sort of delays or jittering
in the control system, and typically one would add some thermal margins
on the control system. But to the original point here, we should eliminate
unnecessary jittering or delay in the control system.

> 
> 
> > > Moreover (putting apart a potential contention), the delayed read does
> > > not change the time interval, which remains the same from the governor
> > > point of view.
> > 
> > It does not change the governor update interval and that is a property of
> > the thermal zone. Correct. And that is the intention of the change.
> > The actual temperature updates driven by the governor will always
> > result in a driver call. While a userspace call will not be in the way
> > of the governor update.
> > 
> > Sysfs reads, However, with the current code as is, it may cause
> > jittering on the actual execution of the governor throttle function.
> >   causing the computation of the desired outcome cooling device being skewed.
> > 
> > > 
> > > In addition, i2c sensors are usually handled in the hwmon subsystem
> > > which are registered in the thermal framework from there. Those have
> > > most of their 'read' callback with a cached value in a jiffies based way
> > > eg. [1].
> > 
> > I guess what you are really saying is: go read the hwmon sysfs node,
> > or, hwmon solves this for us, which unfortunately is not true for all devices.
> 
> I meant the i2c sensors are under the hwmon subsystem. This subsystem is
> connected with the thermal framework, so when a hwmon sensor is created,
> it register this sensor as a thermal zone.
> 
> 
> > > So the feature already exists for slow devices and are handled in the
> > > drivers directly via the hwmon subsystem.
> > > 
> > >  From my POV, the feature is not needed in the thermal framework.
> > 
> > The fact that hwmon does it in some way is another evidence of the
> > actual problem.
> 
> Not really, it shows the i2c sensors are in the hwmon subsystems.

They are there too. And hwmon also sees same problem of too frequent
device update. The problem is there regardless of the subsystem we use
to represent the device. Also, I dont buy the argument that this is
a problem of hwmon because it is already supported to plug in
hwmon devices in the thermal subsystem. That is actually the original
design in fact :-). So running a control in the thermal subsystem
on top of inputs from hwmon, which happens to support I2C devices,
is not a problem for hwmon to solve, since the control is in the
thermal subsystem, and hwmon does not offer control solutions.


> 
> 
> > Telling that this has to be solved by another subsystem
> > for a sysfs node that is part of thermal subsystem does not really solve
> > the problem. Also as I mentioned, this is not common on all hwmon
> > devices, and not all I2C devices are hwmon devices. In fact, I2C
> > was just one example of a slow device. There are more I can quote
> > that are not necessarily under the hwmon case.
> 
> Yes, please. Can you give examples with existing drivers in the thermal
> framework and observed issues on specific platforms ? Numbers would help.

I believe I already gave you numbers. And as I explained above,
the driver does not need to sit on thermal subsystem only,
we already support plugging in I2C/pmbus devices on the thermal
subsystem, so basically all drivers on hwmon that has the REGISTER_TZ
flag are actual samples for this problem


-- 
All the best,
Eduardo Valentin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-22  4:56               ` Eduardo Valentin
@ 2023-06-23 17:31                 ` Rafael J. Wysocki
  2023-06-28 21:10                   ` Eduardo Valentin
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-06-23 17:31 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: Daniel Lezcano, eduval, rafael, linux-pm, linux-kernel,
	Amit Kucheria, Zhang Rui

On Thu, Jun 22, 2023 at 6:56 AM Eduardo Valentin <evalenti@kernel.org> wrote:
>
> On Wed, Jun 21, 2023 at 01:43:26PM +0200, Daniel Lezcano wrote:
> >
> >
> >
> > On 21/06/2023 07:06, Eduardo Valentin wrote:
> > > On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote:
> > > >
> > > >
> > > >
> > > > Hi Eduardo,
> > > >
> > > > On 08/06/2023 19:44, Eduardo Valentin wrote:
> > > >
> > > > [ ... ]
> > > >
> > > > > > Do you have a use case with some measurements to spot an issue or is it
> > > > > > a potential issue you identified ?
> > > > >
> > > > >
> > > > > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
> > > > > and needs to update the zone every 100ms. Each read in this bus, if done alone
> > > > > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
> > > > > well technically 9, but rounding for the sake of the example, which gets you
> > > > > 50 / 100KHz = 500 us. That is for a single read. You add one single extra
> > > > > userspace read triggering an unused device update, that is already a 1ms drift.
> > > > > Basically you looking at 0.5% for each extra userspace read competing in this
> > > > > sysfs node. You add extra devices in the same I2C bus, your governor is looking
> > > > > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
> > > > > I did not even include the lock overhead considered for this CPU ;-)
> > > > >
> > > > > Again, this is not about controlling the DIE temperature of the CPU you
> > > > > are running the thermal subsystem. This is about controlling
> > > > > a target device.
> > > >
> > > > Ok. The target device is on a bus which is slow and prone to contention.
> > > >
> > > > This hardware is not designed to be monitored with a high precision, so
> > > > reading the temperature at a high rate does not really make sense.
> > >
> > > On the contrary, it needs even more precision and any extra delay adds to
> > > loss on accuracy :-)
> >
> > What I meant is if the hardware designer thought there could be a
> > problem with the thermal zone they would have put another kind of
> > sensor, not one with a i2c based communication.
>
> No, that is not a problem in the physical thermal zone. Or to cover
> for a hardware design flaw. This is an actual typical case. However,
> yes, designer must take into account any sort of delays or jittering
> in the control system, and typically one would add some thermal margins
> on the control system. But to the original point here, we should eliminate
> unnecessary jittering or delay in the control system.
>
> >
> >
> > > > Moreover (putting apart a potential contention), the delayed read does
> > > > not change the time interval, which remains the same from the governor
> > > > point of view.
> > >
> > > It does not change the governor update interval and that is a property of
> > > the thermal zone. Correct. And that is the intention of the change.
> > > The actual temperature updates driven by the governor will always
> > > result in a driver call. While a userspace call will not be in the way
> > > of the governor update.
> > >
> > > Sysfs reads, However, with the current code as is, it may cause
> > > jittering on the actual execution of the governor throttle function.
> > >   causing the computation of the desired outcome cooling device being skewed.
> > >
> > > >
> > > > In addition, i2c sensors are usually handled in the hwmon subsystem
> > > > which are registered in the thermal framework from there. Those have
> > > > most of their 'read' callback with a cached value in a jiffies based way
> > > > eg. [1].
> > >
> > > I guess what you are really saying is: go read the hwmon sysfs node,
> > > or, hwmon solves this for us, which unfortunately is not true for all devices.
> >
> > I meant the i2c sensors are under the hwmon subsystem. This subsystem is
> > connected with the thermal framework, so when a hwmon sensor is created,
> > it register this sensor as a thermal zone.
> >
> >
> > > > So the feature already exists for slow devices and are handled in the
> > > > drivers directly via the hwmon subsystem.
> > > >
> > > >  From my POV, the feature is not needed in the thermal framework.
> > >
> > > The fact that hwmon does it in some way is another evidence of the
> > > actual problem.
> >
> > Not really, it shows the i2c sensors are in the hwmon subsystems.
>
> They are there too. And hwmon also sees same problem of too frequent
> device update. The problem is there regardless of the subsystem we use
> to represent the device. Also, I dont buy the argument that this is
> a problem of hwmon because it is already supported to plug in
> hwmon devices in the thermal subsystem. That is actually the original
> design in fact :-). So running a control in the thermal subsystem
> on top of inputs from hwmon, which happens to support I2C devices,
> is not a problem for hwmon to solve, since the control is in the
> thermal subsystem, and hwmon does not offer control solutions.

Regardless of where the problem is etc, if my understanding of the
patch is correct, it is proposing to change the behavior of a
well-known sysfs interface in a way that is likely to be incompatible
with at least some of its users.  This is an obvious no-go in kernel
development and I would expect you to be well aware of it.

IOW, if you want the cached value to be returned, a new interface (eg.
a new sysfs attribute) is needed.

And I think that the use case is not really i2c sensors in general,
because at least in some cases they work just fine AFAICS, but a
platform with a control loop running in the kernel that depends on
sensor reads carried out at a specific, approximately constant, rate
that can be disturbed by user space checking the sensor temperature
via sysfs at "wrong" times.  If at the same time the user space
program in question doesn't care about the most recent values reported
by the sensor, it may very well use the values cached by the in-kernel
control loop.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-23 17:31                 ` Rafael J. Wysocki
@ 2023-06-28 21:10                   ` Eduardo Valentin
  2023-06-30  8:16                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Eduardo Valentin @ 2023-06-28 21:10 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Eduardo Valentin, Daniel Lezcano, eduval, linux-pm, linux-kernel,
	Amit Kucheria, Zhang Rui

On Fri, Jun 23, 2023 at 07:31:43PM +0200, Rafael J. Wysocki wrote:
> 
> 
> 
> On Thu, Jun 22, 2023 at 6:56 AM Eduardo Valentin <evalenti@kernel.org> wrote:
> >
> > On Wed, Jun 21, 2023 at 01:43:26PM +0200, Daniel Lezcano wrote:
> > >
> > >
> > >
> > > On 21/06/2023 07:06, Eduardo Valentin wrote:
> > > > On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote:
> > > > >
> > > > >
> > > > >
> > > > > Hi Eduardo,
> > > > >
> > > > > On 08/06/2023 19:44, Eduardo Valentin wrote:
> > > > >
> > > > > [ ... ]
> > > > >
> > > > > > > Do you have a use case with some measurements to spot an issue or is it
> > > > > > > a potential issue you identified ?
> > > > > >
> > > > > >
> > > > > > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz)
> > > > > > and needs to update the zone every 100ms. Each read in this bus, if done alone
> > > > > > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte,
> > > > > > well technically 9, but rounding for the sake of the example, which gets you
> > > > > > 50 / 100KHz = 500 us. That is for a single read. You add one single extra
> > > > > > userspace read triggering an unused device update, that is already a 1ms drift.
> > > > > > Basically you looking at 0.5% for each extra userspace read competing in this
> > > > > > sysfs node. You add extra devices in the same I2C bus, your governor is looking
> > > > > > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz.
> > > > > > I did not even include the lock overhead considered for this CPU ;-)
> > > > > >
> > > > > > Again, this is not about controlling the DIE temperature of the CPU you
> > > > > > are running the thermal subsystem. This is about controlling
> > > > > > a target device.
> > > > >
> > > > > Ok. The target device is on a bus which is slow and prone to contention.
> > > > >
> > > > > This hardware is not designed to be monitored with a high precision, so
> > > > > reading the temperature at a high rate does not really make sense.
> > > >
> > > > On the contrary, it needs even more precision and any extra delay adds to
> > > > loss on accuracy :-)
> > >
> > > What I meant is if the hardware designer thought there could be a
> > > problem with the thermal zone they would have put another kind of
> > > sensor, not one with a i2c based communication.
> >
> > No, that is not a problem in the physical thermal zone. Or to cover
> > for a hardware design flaw. This is an actual typical case. However,
> > yes, designer must take into account any sort of delays or jittering
> > in the control system, and typically one would add some thermal margins
> > on the control system. But to the original point here, we should eliminate
> > unnecessary jittering or delay in the control system.
> >
> > >
> > >
> > > > > Moreover (putting apart a potential contention), the delayed read does
> > > > > not change the time interval, which remains the same from the governor
> > > > > point of view.
> > > >
> > > > It does not change the governor update interval and that is a property of
> > > > the thermal zone. Correct. And that is the intention of the change.
> > > > The actual temperature updates driven by the governor will always
> > > > result in a driver call. While a userspace call will not be in the way
> > > > of the governor update.
> > > >
> > > > Sysfs reads, However, with the current code as is, it may cause
> > > > jittering on the actual execution of the governor throttle function.
> > > >   causing the computation of the desired outcome cooling device being skewed.
> > > >
> > > > >
> > > > > In addition, i2c sensors are usually handled in the hwmon subsystem
> > > > > which are registered in the thermal framework from there. Those have
> > > > > most of their 'read' callback with a cached value in a jiffies based way
> > > > > eg. [1].
> > > >
> > > > I guess what you are really saying is: go read the hwmon sysfs node,
> > > > or, hwmon solves this for us, which unfortunately is not true for all devices.
> > >
> > > I meant the i2c sensors are under the hwmon subsystem. This subsystem is
> > > connected with the thermal framework, so when a hwmon sensor is created,
> > > it register this sensor as a thermal zone.
> > >
> > >
> > > > > So the feature already exists for slow devices and are handled in the
> > > > > drivers directly via the hwmon subsystem.
> > > > >
> > > > >  From my POV, the feature is not needed in the thermal framework.
> > > >
> > > > The fact that hwmon does it in some way is another evidence of the
> > > > actual problem.
> > >
> > > Not really, it shows the i2c sensors are in the hwmon subsystems.
> >
> > They are there too. And hwmon also sees same problem of too frequent
> > device update. The problem is there regardless of the subsystem we use
> > to represent the device. Also, I dont buy the argument that this is
> > a problem of hwmon because it is already supported to plug in
> > hwmon devices in the thermal subsystem. That is actually the original
> > design in fact :-). So running a control in the thermal subsystem
> > on top of inputs from hwmon, which happens to support I2C devices,
> > is not a problem for hwmon to solve, since the control is in the
> > thermal subsystem, and hwmon does not offer control solutions.
> 
> Regardless of where the problem is etc, if my understanding of the
> patch is correct, it is proposing to change the behavior of a
> well-known sysfs interface in a way that is likely to be incompatible
> with at least some of its users.  This is an obvious no-go in kernel
> development and I would expect you to be well aware of it.

yeah I get it.

> 
> IOW, if you want the cached value to be returned, a new interface (eg.
> a new sysfs attribute) is needed.

Yeah, I am fine with either a new sysfs entry to return the cached value,
or a new sysfs entry to change the behavior of the existing /temp, as I
mentioned before, either way works for me, if changing the existing one
is too intrusive.

> 
> And I think that the use case is not really i2c sensors in general,

I2C was just the factual example I had, but you are right, the use case
is not isolated to I2C sensor. Rather, to be clear I am not blaming I2C,
the actual issue just happen to be easier to see when I2C devices, slower
than typical MMIO devices, are being used as input for the control.

> because at least in some cases they work just fine AFAICS, but a
> platform with a control loop running in the kernel that depends on
> sensor reads carried out at a specific, approximately constant, rate
> that can be disturbed by user space checking the sensor temperature
> via sysfs at "wrong" times.  If at the same time the user space
> program in question doesn't care about the most recent values reported
> by the sensor, it may very well use the values cached by the in-kernel
> control loop.

That is right, the balance between supporting user space reads and
running the control timely is the actual original concern. The problem
fades out a bit when you have device reads in the us / ns time scale
and control update is in 100s of ms. But becomes more apparent on slower
devices, when reads are in ms and policy update is in the 100s ms, that is
why the I2C case was quoted. But nothing wrong with I2C, or supporting
I2C on the thermal subsystem as we do today via the hwmon interface REGISTER_TZ,
the problem is on having to support the control in kernel and a read in
userspace that can create jitter to the control.

And as you properly stated, for this use case, the userspace does not care
about the most recent value of the device, so that is why the change
proposes to give cached values.

On the flip side though, there may be user space based policies that
require the most recent device value. But in this case, the expectation
is to disable the in kernel policy and switch the thermal zone to
mode == disabled. And that is also why this patch will go the path
to request the most recent device value when the /temp sysfs entry
is read and the mode is disabled.

I would suggest to have an addition sysfs entry that sets the
thermal zone into cached mode or not, let's say for the sake of the
discussion, we call it 'cached_values', with default to 'not cached'.
This way, we could support:

a) Default, current situation, where all reads in /temp are always backed up
with an actual device .get_temp(). Nothing changes here, keeps reading
under /temp, and so long system designer is satisfied with jittering,
no need to change anything.

b) When one has control in kernel, and frequent userspace reads on /temp
but system designer wants to protect the control in kernel to avoid jittering.
Just keep reading from /temp but set the new sysfs property 'cached_values' to 'cached'.
Then userspace will get updated values as frequent as the kernel control has
and the kernel control will not be disturbed by frequent device reads.

c) When one has control in userspace, and wants to have the most frequent
device read. Here, one can simply disable the in kernel control by
setting the 'mode' sysfs entry to 'disabled', and making sure the new sysfs property is set
to 'not cached'. Well in fact, the way I thought this originally in this patch
was to simply always read the device when /temp is read is 'mode' is 'disabled'.

I believe you proposed to have another sysfs entry  sysfs entry for reading cached temperature.
Something like 'temp_cached'. Thinking of it, as I mentioned before, it will work.
The only possible downside is to have two entries to read temperature.

Strong opinions on any of the above?

-- 
All the best,
Eduardo Valentin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-28 21:10                   ` Eduardo Valentin
@ 2023-06-30  8:16                     ` Rafael J. Wysocki
  2023-06-30 10:11                       ` Daniel Lezcano
  2023-07-01  1:37                       ` Eduardo Valentin
  0 siblings, 2 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-06-30  8:16 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: Rafael J. Wysocki, Daniel Lezcano, eduval, linux-pm,
	linux-kernel, Amit Kucheria, Zhang Rui

On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
>
> On Fri, Jun 23, 2023 at 07:31:43PM +0200, Rafael J. Wysocki wrote:
> >

[cut]

> >
> > Regardless of where the problem is etc, if my understanding of the
> > patch is correct, it is proposing to change the behavior of a
> > well-known sysfs interface in a way that is likely to be incompatible
> > with at least some of its users.  This is an obvious no-go in kernel
> > development and I would expect you to be well aware of it.
>
> yeah I get it.
>
> >
> > IOW, if you want the cached value to be returned, a new interface (eg.
> > a new sysfs attribute) is needed.
>
> Yeah, I am fine with either a new sysfs entry to return the cached value,
> or a new sysfs entry to change the behavior of the existing /temp, as I
> mentioned before, either way works for me, if changing the existing one
> is too intrusive.
>
> >
> > And I think that the use case is not really i2c sensors in general,
>
> I2C was just the factual example I had, but you are right, the use case
> is not isolated to I2C sensor. Rather, to be clear I am not blaming I2C,
> the actual issue just happen to be easier to see when I2C devices, slower
> than typical MMIO devices, are being used as input for the control.
>
> > because at least in some cases they work just fine AFAICS, but a
> > platform with a control loop running in the kernel that depends on
> > sensor reads carried out at a specific, approximately constant, rate
> > that can be disturbed by user space checking the sensor temperature
> > via sysfs at "wrong" times.  If at the same time the user space
> > program in question doesn't care about the most recent values reported
> > by the sensor, it may very well use the values cached by the in-kernel
> > control loop.
>
> That is right, the balance between supporting user space reads and
> running the control timely is the actual original concern. The problem
> fades out a bit when you have device reads in the us / ns time scale
> and control update is in 100s of ms. But becomes more apparent on slower
> devices, when reads are in ms and policy update is in the 100s ms, that is
> why the I2C case was quoted. But nothing wrong with I2C, or supporting
> I2C on the thermal subsystem as we do today via the hwmon interface REGISTER_TZ,
> the problem is on having to support the control in kernel and a read in
> userspace that can create jitter to the control.
>
> And as you properly stated, for this use case, the userspace does not care
> about the most recent value of the device, so that is why the change
> proposes to give cached values.
>
> On the flip side though, there may be user space based policies that
> require the most recent device value. But in this case, the expectation
> is to disable the in kernel policy and switch the thermal zone to
> mode == disabled. And that is also why this patch will go the path
> to request the most recent device value when the /temp sysfs entry
> is read and the mode is disabled.
>
> I would suggest to have an addition sysfs entry that sets the
> thermal zone into cached mode or not, let's say for the sake of the
> discussion, we call it 'cached_values', with default to 'not cached'.
> This way, we could support:
>
> a) Default, current situation, where all reads in /temp are always backed up
> with an actual device .get_temp(). Nothing changes here, keeps reading
> under /temp, and so long system designer is satisfied with jittering,
> no need to change anything.
>
> b) When one has control in kernel, and frequent userspace reads on /temp
> but system designer wants to protect the control in kernel to avoid jittering.
> Just keep reading from /temp but set the new sysfs property 'cached_values' to 'cached'.
> Then userspace will get updated values as frequent as the kernel control has
> and the kernel control will not be disturbed by frequent device reads.
>
> c) When one has control in userspace, and wants to have the most frequent
> device read. Here, one can simply disable the in kernel control by
> setting the 'mode' sysfs entry to 'disabled', and making sure the new sysfs property is set
> to 'not cached'. Well in fact, the way I thought this originally in this patch
> was to simply always read the device when /temp is read is 'mode' is 'disabled'.
>
> I believe you proposed to have another sysfs entry  sysfs entry for reading cached temperature.
> Something like 'temp_cached'. Thinking of it, as I mentioned before, it will work.
> The only possible downside is to have two entries to read temperature.
>
> Strong opinions on any of the above?

So what about adding a new zone attribute that can be used to specify
the preferred caching time for the temperature?

That is, if the time interval between two consecutive updates of the
cached temperature value is less than the value of the new attribute,
the cached temperature value will be returned by "temp".  Otherwise,
it will cause the sensor to be read and the value obtained from it
will be returned to user space and cached.

If the value of the new attribute is 0, everything will work as it
does now (which will also need to be the default behavior).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-30  8:16                     ` Rafael J. Wysocki
@ 2023-06-30 10:11                       ` Daniel Lezcano
  2023-06-30 10:46                         ` Rafael J. Wysocki
  2023-07-01  1:38                         ` Eduardo Valentin
  2023-07-01  1:37                       ` Eduardo Valentin
  1 sibling, 2 replies; 27+ messages in thread
From: Daniel Lezcano @ 2023-06-30 10:11 UTC (permalink / raw)
  To: Rafael J. Wysocki, Eduardo Valentin
  Cc: eduval, linux-pm, linux-kernel, Amit Kucheria, Zhang Rui


Hi Rafael,

On 30/06/2023 10:16, Rafael J. Wysocki wrote:
> On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:

[ ... ]

> So what about adding a new zone attribute that can be used to specify
> the preferred caching time for the temperature?
> 
> That is, if the time interval between two consecutive updates of the
> cached temperature value is less than the value of the new attribute,
> the cached temperature value will be returned by "temp".  Otherwise,
> it will cause the sensor to be read and the value obtained from it
> will be returned to user space and cached.
> 
> If the value of the new attribute is 0, everything will work as it
> does now (which will also need to be the default behavior).

I'm still not convinced about the feature.

Eduardo provided some numbers but they seem based on the characteristics 
of the I2C, not to a real use case. Eduardo?

Before adding more complexity in the thermal framework and yet another 
sysfs entry, it would be interesting to have an experiment and show the 
impact of both configurations, not from a timing point of view but with 
a temperature mitigation accuracy.

Without a real use case, this feature does make really sense IMO.


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-30 10:11                       ` Daniel Lezcano
@ 2023-06-30 10:46                         ` Rafael J. Wysocki
  2023-06-30 12:09                           ` Daniel Lezcano
  2023-07-01  1:38                         ` Eduardo Valentin
  1 sibling, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-06-30 10:46 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Rafael J. Wysocki, Eduardo Valentin, eduval, linux-pm,
	linux-kernel, Amit Kucheria, Zhang Rui

Hi Daniel,

On Fri, Jun 30, 2023 at 12:11 PM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
>
> Hi Rafael,
>
> On 30/06/2023 10:16, Rafael J. Wysocki wrote:
> > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
>
> [ ... ]
>
> > So what about adding a new zone attribute that can be used to specify
> > the preferred caching time for the temperature?
> >
> > That is, if the time interval between two consecutive updates of the
> > cached temperature value is less than the value of the new attribute,
> > the cached temperature value will be returned by "temp".  Otherwise,
> > it will cause the sensor to be read and the value obtained from it
> > will be returned to user space and cached.
> >
> > If the value of the new attribute is 0, everything will work as it
> > does now (which will also need to be the default behavior).
>
> I'm still not convinced about the feature.
>
> Eduardo provided some numbers but they seem based on the characteristics
> of the I2C, not to a real use case. Eduardo?
>
> Before adding more complexity in the thermal framework and yet another
> sysfs entry, it would be interesting to have an experiment and show the
> impact of both configurations, not from a timing point of view but with
> a temperature mitigation accuracy.
>
> Without a real use case, this feature does make really sense IMO.

I'm kind of unsure why you think that it is not a good idea in general
to have a way to limit the rate of accessing a temperature sensor, for
energy-efficiency reasons if nothing more.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-30 10:46                         ` Rafael J. Wysocki
@ 2023-06-30 12:09                           ` Daniel Lezcano
  2023-07-01  1:49                             ` Eduardo Valentin
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Lezcano @ 2023-06-30 12:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Eduardo Valentin, eduval, linux-pm, linux-kernel, Amit Kucheria,
	Zhang Rui

On 30/06/2023 12:46, Rafael J. Wysocki wrote:
> Hi Daniel,
> 
> On Fri, Jun 30, 2023 at 12:11 PM Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
>>
>>
>> Hi Rafael,
>>
>> On 30/06/2023 10:16, Rafael J. Wysocki wrote:
>>> On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
>>
>> [ ... ]
>>
>>> So what about adding a new zone attribute that can be used to specify
>>> the preferred caching time for the temperature?
>>>
>>> That is, if the time interval between two consecutive updates of the
>>> cached temperature value is less than the value of the new attribute,
>>> the cached temperature value will be returned by "temp".  Otherwise,
>>> it will cause the sensor to be read and the value obtained from it
>>> will be returned to user space and cached.
>>>
>>> If the value of the new attribute is 0, everything will work as it
>>> does now (which will also need to be the default behavior).
>>
>> I'm still not convinced about the feature.
>>
>> Eduardo provided some numbers but they seem based on the characteristics
>> of the I2C, not to a real use case. Eduardo?
>>
>> Before adding more complexity in the thermal framework and yet another
>> sysfs entry, it would be interesting to have an experiment and show the
>> impact of both configurations, not from a timing point of view but with
>> a temperature mitigation accuracy.
>>
>> Without a real use case, this feature does make really sense IMO.
> 
> I'm kind of unsure why you think that it is not a good idea in general
> to have a way to limit the rate of accessing a temperature sensor, for
> energy-efficiency reasons if nothing more.

I don't think it is not a good idea. I've no judgement with the proposed 
change.

But I'm not convinced it is really useful, that is why having a real use 
case and some numbers showing that feature solves the issue would be nice.

It is illogical we want a fast and accurate response on a specific 
hardware and then design it with slow sensors and contention prone bus.

In Eduardo's example, we have 100ms monitoring rate on a I2C. This rate 
is usually to monitor CPUs with very fast transitions. With a remote 
site, the monitoring rate would be much slower, so if there is a 
contention in the bus because a dumb process is reading constantly the 
temperature, then it should be negligible.

All that are hypothesis, that is why having a real use case would help 
to figure out the temperature limit drift at mitigation time.

Assuming it is really needed, I'm not sure that should be exported via 
sysfs. It is a driver issue and it may register the thermal zone with a 
parameter telling the userspace rate limit.

On the other side, hwmon and thermal are connected. hwmon drivers 
register a thermal zone and thermal drivers add themselves in the hwmon 
sysfs directory. The temperature cache is handled in the driver level in 
the hwmon subsystems and we want to handle the temperature cache at the 
thermal sysfs level. How will we cope with this inconsistency?

As a side note, slow drivers are usually going under drivers/hwmon.

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-30  8:16                     ` Rafael J. Wysocki
  2023-06-30 10:11                       ` Daniel Lezcano
@ 2023-07-01  1:37                       ` Eduardo Valentin
  2023-07-06 13:02                         ` Rafael J. Wysocki
  1 sibling, 1 reply; 27+ messages in thread
From: Eduardo Valentin @ 2023-07-01  1:37 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Eduardo Valentin, Daniel Lezcano, eduval, linux-pm, linux-kernel,
	Amit Kucheria, Zhang Rui

On Fri, Jun 30, 2023 at 10:16:38AM +0200, Rafael J. Wysocki wrote:
> 
> 
> 
> On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
> >
> > On Fri, Jun 23, 2023 at 07:31:43PM +0200, Rafael J. Wysocki wrote:
> > >
> 
> [cut]
> 
> > >
> > > Regardless of where the problem is etc, if my understanding of the
> > > patch is correct, it is proposing to change the behavior of a
> > > well-known sysfs interface in a way that is likely to be incompatible
> > > with at least some of its users.  This is an obvious no-go in kernel
> > > development and I would expect you to be well aware of it.
> >
> > yeah I get it.
> >
> > >
> > > IOW, if you want the cached value to be returned, a new interface (eg.
> > > a new sysfs attribute) is needed.
> >
> > Yeah, I am fine with either a new sysfs entry to return the cached value,
> > or a new sysfs entry to change the behavior of the existing /temp, as I
> > mentioned before, either way works for me, if changing the existing one
> > is too intrusive.
> >
> > >
> > > And I think that the use case is not really i2c sensors in general,
> >
> > I2C was just the factual example I had, but you are right, the use case
> > is not isolated to I2C sensor. Rather, to be clear I am not blaming I2C,
> > the actual issue just happen to be easier to see when I2C devices, slower
> > than typical MMIO devices, are being used as input for the control.
> >
> > > because at least in some cases they work just fine AFAICS, but a
> > > platform with a control loop running in the kernel that depends on
> > > sensor reads carried out at a specific, approximately constant, rate
> > > that can be disturbed by user space checking the sensor temperature
> > > via sysfs at "wrong" times.  If at the same time the user space
> > > program in question doesn't care about the most recent values reported
> > > by the sensor, it may very well use the values cached by the in-kernel
> > > control loop.
> >
> > That is right, the balance between supporting user space reads and
> > running the control timely is the actual original concern. The problem
> > fades out a bit when you have device reads in the us / ns time scale
> > and control update is in 100s of ms. But becomes more apparent on slower
> > devices, when reads are in ms and policy update is in the 100s ms, that is
> > why the I2C case was quoted. But nothing wrong with I2C, or supporting
> > I2C on the thermal subsystem as we do today via the hwmon interface REGISTER_TZ,
> > the problem is on having to support the control in kernel and a read in
> > userspace that can create jitter to the control.
> >
> > And as you properly stated, for this use case, the userspace does not care
> > about the most recent value of the device, so that is why the change
> > proposes to give cached values.
> >
> > On the flip side though, there may be user space based policies that
> > require the most recent device value. But in this case, the expectation
> > is to disable the in kernel policy and switch the thermal zone to
> > mode == disabled. And that is also why this patch will go the path
> > to request the most recent device value when the /temp sysfs entry
> > is read and the mode is disabled.
> >
> > I would suggest to have an addition sysfs entry that sets the
> > thermal zone into cached mode or not, let's say for the sake of the
> > discussion, we call it 'cached_values', with default to 'not cached'.
> > This way, we could support:
> >
> > a) Default, current situation, where all reads in /temp are always backed up
> > with an actual device .get_temp(). Nothing changes here, keeps reading
> > under /temp, and so long system designer is satisfied with jittering,
> > no need to change anything.
> >
> > b) When one has control in kernel, and frequent userspace reads on /temp
> > but system designer wants to protect the control in kernel to avoid jittering.
> > Just keep reading from /temp but set the new sysfs property 'cached_values' to 'cached'.
> > Then userspace will get updated values as frequent as the kernel control has
> > and the kernel control will not be disturbed by frequent device reads.
> >
> > c) When one has control in userspace, and wants to have the most frequent
> > device read. Here, one can simply disable the in kernel control by
> > setting the 'mode' sysfs entry to 'disabled', and making sure the new sysfs property is set
> > to 'not cached'. Well in fact, the way I thought this originally in this patch
> > was to simply always read the device when /temp is read is 'mode' is 'disabled'.
> >
> > I believe you proposed to have another sysfs entry  sysfs entry for reading cached temperature.
> > Something like 'temp_cached'. Thinking of it, as I mentioned before, it will work.
> > The only possible downside is to have two entries to read temperature.
> >
> > Strong opinions on any of the above?
> 
> So what about adding a new zone attribute that can be used to specify
> the preferred caching time for the temperature?
> 
> That is, if the time interval between two consecutive updates of the
> cached temperature value is less than the value of the new attribute,
> the cached temperature value will be returned by "temp".  Otherwise,
> it will cause the sensor to be read and the value obtained from it
> will be returned to user space and cached.
> 
> If the value of the new attribute is 0, everything will work as it
> does now (which will also need to be the default behavior).


Yeah, that makes sense to me. I can cook something up in the next version.

-- 
All the best,
Eduardo Valentin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-30 10:11                       ` Daniel Lezcano
  2023-06-30 10:46                         ` Rafael J. Wysocki
@ 2023-07-01  1:38                         ` Eduardo Valentin
  2023-07-01 14:20                           ` Daniel Lezcano
  1 sibling, 1 reply; 27+ messages in thread
From: Eduardo Valentin @ 2023-07-01  1:38 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Rafael J. Wysocki, Eduardo Valentin, eduval, linux-pm,
	linux-kernel, Amit Kucheria, Zhang Rui

Hey Daniel,

On Fri, Jun 30, 2023 at 12:11:25PM +0200, Daniel Lezcano wrote:
> 
> 
> 
> Hi Rafael,
> 
> On 30/06/2023 10:16, Rafael J. Wysocki wrote:
> > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
> 
> [ ... ]
> 
> > So what about adding a new zone attribute that can be used to specify
> > the preferred caching time for the temperature?
> > 
> > That is, if the time interval between two consecutive updates of the
> > cached temperature value is less than the value of the new attribute,
> > the cached temperature value will be returned by "temp".  Otherwise,
> > it will cause the sensor to be read and the value obtained from it
> > will be returned to user space and cached.
> > 
> > If the value of the new attribute is 0, everything will work as it
> > does now (which will also need to be the default behavior).
> 
> I'm still not convinced about the feature.
> 
> Eduardo provided some numbers but they seem based on the characteristics
> of the I2C, not to a real use case. Eduardo?

Why I2C is not a real use case?



-- 
All the best,
Eduardo Valentin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-06-30 12:09                           ` Daniel Lezcano
@ 2023-07-01  1:49                             ` Eduardo Valentin
  2023-07-01  7:28                               ` Daniel Lezcano
  0 siblings, 1 reply; 27+ messages in thread
From: Eduardo Valentin @ 2023-07-01  1:49 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Rafael J. Wysocki, Eduardo Valentin, eduval, linux-pm,
	linux-kernel, Amit Kucheria, Zhang Rui

Hello,

On Fri, Jun 30, 2023 at 02:09:44PM +0200, Daniel Lezcano wrote:
> 
> 
> 
> On 30/06/2023 12:46, Rafael J. Wysocki wrote:
> > Hi Daniel,
> > 
> > On Fri, Jun 30, 2023 at 12:11 PM Daniel Lezcano
> > <daniel.lezcano@linaro.org> wrote:
> > > 
> > > 
> > > Hi Rafael,
> > > 
> > > On 30/06/2023 10:16, Rafael J. Wysocki wrote:
> > > > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
> > > 
> > > [ ... ]
> > > 
> > > > So what about adding a new zone attribute that can be used to specify
> > > > the preferred caching time for the temperature?
> > > > 
> > > > That is, if the time interval between two consecutive updates of the
> > > > cached temperature value is less than the value of the new attribute,
> > > > the cached temperature value will be returned by "temp".  Otherwise,
> > > > it will cause the sensor to be read and the value obtained from it
> > > > will be returned to user space and cached.
> > > > 
> > > > If the value of the new attribute is 0, everything will work as it
> > > > does now (which will also need to be the default behavior).
> > > 
> > > I'm still not convinced about the feature.
> > > 
> > > Eduardo provided some numbers but they seem based on the characteristics
> > > of the I2C, not to a real use case. Eduardo?
> > > 
> > > Before adding more complexity in the thermal framework and yet another
> > > sysfs entry, it would be interesting to have an experiment and show the
> > > impact of both configurations, not from a timing point of view but with
> > > a temperature mitigation accuracy.
> > > 
> > > Without a real use case, this feature does make really sense IMO.
> > 
> > I'm kind of unsure why you think that it is not a good idea in general
> > to have a way to limit the rate of accessing a temperature sensor, for
> > energy-efficiency reasons if nothing more.
> 
> I don't think it is not a good idea. I've no judgement with the proposed
> change.
> 
> But I'm not convinced it is really useful, that is why having a real use
> case and some numbers showing that feature solves the issue would be nice.
> 
> It is illogical we want a fast and accurate response on a specific
> hardware and then design it with slow sensors and contention prone bus.

Totally agree, but at same time, this is real world :-)

> 
> In Eduardo's example, we have 100ms monitoring rate on a I2C. This rate
> is usually to monitor CPUs with very fast transitions. With a remote
> site, the monitoring rate would be much slower, so if there is a
> contention in the bus because a dumb process is reading constantly the
> temperature, then it should be negligible.
> 
> All that are hypothesis, that is why having a real use case would help
> to figure out the temperature limit drift at mitigation time.

Yeah, I guess the problem here is that you are assuming I2C is not a real
use case, not sure why. But it is and very common design in fact.

> 
> Assuming it is really needed, I'm not sure that should be exported via
> sysfs. It is a driver issue and it may register the thermal zone with a
> parameter telling the userspace rate limit.
> 
> On the other side, hwmon and thermal are connected. hwmon drivers
> register a thermal zone and thermal drivers add themselves in the hwmon
> sysfs directory. The temperature cache is handled in the driver level in
> the hwmon subsystems and we want to handle the temperature cache at the
> thermal sysfs level. How will we cope with this inconsistency?

Yeah, I do not see this, again, as where to handle cache type of design problem only.
This is really a protective / defensive code on the thermal core to avoid
userspace interfering on a kernel based control.


I agree that drivers may be free to go and defend themselves against
too frequent userspace requests, like they do, as you already shared
a link in another email. But saying that it is up to the driver to do this
is basically saying that the thermal subsystem do not care about their
own threads being delayed by a too frequent reads on a sysfs entry
created by the thermal subsystem, just because it is drivers responsability
to cache. To that is a missing defensive code. 

> 
> As a side note, slow drivers are usually going under drivers/hwmon.

Have you seen this code?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850


I also do not understand when you say slow drivers are usually going under
drivers/hwmon, does it really matter? One can design a thermal zone
that is connected to a hwmon device as input. Why would that be illogical?


> 
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
> 

-- 
All the best,
Eduardo Valentin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-07-01  1:49                             ` Eduardo Valentin
@ 2023-07-01  7:28                               ` Daniel Lezcano
  2023-07-05 22:49                                 ` Eduardo Valentin
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Lezcano @ 2023-07-01  7:28 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: Rafael J. Wysocki, eduval, linux-pm, linux-kernel, Amit Kucheria,
	Zhang Rui

Eduardo,

On 01/07/2023 03:49, Eduardo Valentin wrote:

[ ... ]

>> All that are hypothesis, that is why having a real use case would help
>> to figure out the temperature limit drift at mitigation time.
> 
> Yeah, I guess the problem here is that you are assuming I2C is not a real
> use case, not sure why. But it is and very common design in fact.

If it is so common you should be able to reproduce the issue and give 
numbers. At this point, what I read is "that may happen because I2C is 
slow and we may monitor it at an insane rate, so let's cache the value".

>> Assuming it is really needed, I'm not sure that should be exported via
>> sysfs. It is a driver issue and it may register the thermal zone with a
>> parameter telling the userspace rate limit.
>>
>> On the other side, hwmon and thermal are connected. hwmon drivers
>> register a thermal zone and thermal drivers add themselves in the hwmon
>> sysfs directory. The temperature cache is handled in the driver level in
>> the hwmon subsystems and we want to handle the temperature cache at the
>> thermal sysfs level. How will we cope with this inconsistency?
> 
> Yeah, I do not see this, again, as where to handle cache type of design problem only.
> This is really a protective / defensive code on the thermal core to avoid
> userspace interfering on a kernel based control.
> 
> 
> I agree that drivers may be free to go and defend themselves against
> too frequent userspace requests, like they do, as you already shared
> a link in another email. But saying that it is up to the driver to do this
> is basically saying that the thermal subsystem do not care about their
> own threads being delayed by a too frequent reads on a sysfs entry
> created by the thermal subsystem, just because it is drivers responsability
> to cache. To that is a missing defensive code.

No, the core code has not to be defensive against bad hardware design.

If multiple processes are reading in an infinite loop the temperature, 
they will constantly take the lock, and as the monitoring thread is a 
CFS task, this one will be considered as the readers and be delayed, 
with probably a mitigation temperature drift. Here we have a missing 
defensive / optimized code against a DoS but it is unrelated to the 
hardware and the fix is not caching the value.

>> As a side note, slow drivers are usually going under drivers/hwmon.
> 
> Have you seen this code?
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850

Yes, and ?

That is what I said, the hwmon and the thermal zone are connected.

> I also do not understand when you say slow drivers are usually going under
> drivers/hwmon, does it really matter? One can design a thermal zone
> that is connected to a hwmon device as input. Why would that be illogical?

I'm not saying it is illogical. I'm pointing the slow drivers are under 
hwmon subsystems and usually don't aim in-kernel mitigation. The 
get_temp ops is going through hwmon and the drivers may cache the 
values. So *if* there is an in-kernel mitigation, the value will be 
already cached usually.

I do believe I raised some valid concerns regarding the approach. Could 
please take them into account instead of eluding them?

1. A real use case with temperature limit drift (easy to reproduce 
because it is very common)

2. How about the consistency between hwmon and thermal? (one driver but 
two ways to access the temperature - one may cache and the other not)

Another question regarding the I2C example, if another subsystem is 
using the I2C, won't it take the bus lock and create the contention 
also? I mean it is possible to create a mitigation drift by reading 
constantly another sensor value (eg. voltage or whatever) ?

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-07-01  1:38                         ` Eduardo Valentin
@ 2023-07-01 14:20                           ` Daniel Lezcano
  0 siblings, 0 replies; 27+ messages in thread
From: Daniel Lezcano @ 2023-07-01 14:20 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: Rafael J. Wysocki, eduval, linux-pm, linux-kernel, Amit Kucheria,
	Zhang Rui


Hi Eduardo,

On 01/07/2023 03:38, Eduardo Valentin wrote:
> Hey Daniel,
> 
> On Fri, Jun 30, 2023 at 12:11:25PM +0200, Daniel Lezcano wrote:
>>
>>
>>
>> Hi Rafael,
>>
>> On 30/06/2023 10:16, Rafael J. Wysocki wrote:
>>> On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
>>
>> [ ... ]
>>
>>> So what about adding a new zone attribute that can be used to specify
>>> the preferred caching time for the temperature?
>>>
>>> That is, if the time interval between two consecutive updates of the
>>> cached temperature value is less than the value of the new attribute,
>>> the cached temperature value will be returned by "temp".  Otherwise,
>>> it will cause the sensor to be read and the value obtained from it
>>> will be returned to user space and cached.
>>>
>>> If the value of the new attribute is 0, everything will work as it
>>> does now (which will also need to be the default behavior).
>>
>> I'm still not convinced about the feature.
>>
>> Eduardo provided some numbers but they seem based on the characteristics
>> of the I2C, not to a real use case. Eduardo?
> 
> Why I2C is not a real use case?

What I meant is "I2C is slow, ok. But what is the setup where the 
problem arises?"


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-07-01  7:28                               ` Daniel Lezcano
@ 2023-07-05 22:49                                 ` Eduardo Valentin
  2023-07-06 13:22                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Eduardo Valentin @ 2023-07-05 22:49 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Eduardo Valentin, Rafael J. Wysocki, eduval, linux-pm,
	linux-kernel, Amit Kucheria, Zhang Rui

On Sat, Jul 01, 2023 at 09:28:31AM +0200, Daniel Lezcano wrote:
> 
> 
> 
> Eduardo,
> 
> On 01/07/2023 03:49, Eduardo Valentin wrote:
> 
> [ ... ]
> 
> > > All that are hypothesis, that is why having a real use case would help
> > > to figure out the temperature limit drift at mitigation time.
> > 
> > Yeah, I guess the problem here is that you are assuming I2C is not a real
> > use case, not sure why. But it is and very common design in fact.
> 
> If it is so common you should be able to reproduce the issue and give
> numbers. At this point, what I read is "that may happen because I2C is
> slow and we may monitor it at an insane rate, so let's cache the value".
> 
> > > Assuming it is really needed, I'm not sure that should be exported via
> > > sysfs. It is a driver issue and it may register the thermal zone with a
> > > parameter telling the userspace rate limit.
> > > 
> > > On the other side, hwmon and thermal are connected. hwmon drivers
> > > register a thermal zone and thermal drivers add themselves in the hwmon
> > > sysfs directory. The temperature cache is handled in the driver level in
> > > the hwmon subsystems and we want to handle the temperature cache at the
> > > thermal sysfs level. How will we cope with this inconsistency?
> > 
> > Yeah, I do not see this, again, as where to handle cache type of design problem only.
> > This is really a protective / defensive code on the thermal core to avoid
> > userspace interfering on a kernel based control.
> > 
> > 
> > I agree that drivers may be free to go and defend themselves against
> > too frequent userspace requests, like they do, as you already shared
> > a link in another email. But saying that it is up to the driver to do this
> > is basically saying that the thermal subsystem do not care about their
> > own threads being delayed by a too frequent reads on a sysfs entry
> > created by the thermal subsystem, just because it is drivers responsability
> > to cache. To that is a missing defensive code.
> 
> No, the core code has not to be defensive against bad hardware design.

I do not understand why you are calling this a bad hardware design.

> 
> If multiple processes are reading in an infinite loop the temperature,
> they will constantly take the lock, and as the monitoring thread is a
> CFS task, this one will be considered as the readers and be delayed,
> with probably a mitigation temperature drift. Here we have a missing
> defensive / optimized code against a DoS but it is unrelated to the
> hardware and the fix is not caching the value.

I am not even going into the CFS discussion, yet. But if we go that direction,
here we are having a case of a userspace process contending into
a in kernel process, regardless of the userspace process priority.

But again that is not the main point of discussion for this change.

> 
> > > As a side note, slow drivers are usually going under drivers/hwmon.
> > 
> > Have you seen this code?
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850
> 
> Yes, and ?

I mean , I am sure can git grep for all occurences if that is what are looking for.
> 
> That is what I said, the hwmon and the thermal zone are connected.
> 
> > I also do not understand when you say slow drivers are usually going under
> > drivers/hwmon, does it really matter? One can design a thermal zone
> > that is connected to a hwmon device as input. Why would that be illogical?
> 
> I'm not saying it is illogical. I'm pointing the slow drivers are under
> hwmon subsystems and usually don't aim in-kernel mitigation. The
> get_temp ops is going through hwmon and the drivers may cache the
> values. So *if* there is an in-kernel mitigation, the value will be
> already cached usually.

Oh I see. yes, I totally give you that the most common case is to have
the in kernel policy written with drivers under thermal, but neglecting
the existence of drivers under /hwmon that connects into the thermal subsystem
is not fair, is it? They are there, they connect to thermal subsystem
and one can certainly have an in kernel control with hwmon drivers as input,
I am not understanding why protecting against those cases make the change
invalid or overcomplicates the subsystem design.

> 
> I do believe I raised some valid concerns regarding the approach. Could
> please take them into account instead of eluding them
> 
> 1. A real use case with temperature limit drift (easy to reproduce
> because it is very common)
> 
> 2. How about the consistency between hwmon and thermal? (one driver but
> two ways to access the temperature - one may cache and the other not)

I am not eluding anything. But this conversation seams to not move forward
because of the assumption that building in kernel control based on
hwmon drivers is either wrong or uncommon.  I fail to see that as the
main argument to be discussed mainly because it is a problem that exists.
Driver count is also not a meaningful metric to conclude if the problem
is common or not. How many motherboard can you think that are out there
that may have an lm75 to represent an actual temperature point in the
PCB? Or an lm75 that may come from a PCI board? That is what I meant
by common design.

Seams to me that you are trying to make a point that this problem
is not worth having a fix for, even if you already recognized the
basic point of it, because (a) this is not happening MMIO based
drivers which is the (today) common case for drivers under /thermal
and (b) hwmon drivers are supposed to feed in only control in userspace.

for both argument that seams to restrict the thermal subsystem
to only MMIO base devices drivers to feed into CPU temperature control,
which is a limited application of it in real world scenario, for
any real life scenario really, mobile. BMC, or any embedded case. 
When in fact, the abstraction and desing on the thermal subsystem today
is pure control of temperature and can be used in any application
that does it. Limiting the application of the thermal subsystem to
only MMIO based devices is, well, limiting :-).

> 
> Another question regarding the I2C example, if another subsystem is
> using the I2C, won't it take the bus lock and create the contention
> also? I mean it is possible to create a mitigation drift by reading
> constantly another sensor value (eg. voltage or whatever) ?

Yes, if a flood of events in the bus happen, then the drift will also happen,
specially if reads are aligned with the in kernel monitoring thread
update / call to .get_temp().

> 
> 
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
> 

-- 
All the best,
Eduardo Valentin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-07-01  1:37                       ` Eduardo Valentin
@ 2023-07-06 13:02                         ` Rafael J. Wysocki
  0 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-06 13:02 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: Rafael J. Wysocki, Daniel Lezcano, eduval, linux-pm,
	linux-kernel, Amit Kucheria, Zhang Rui

On Sat, Jul 1, 2023 at 3:37 AM Eduardo Valentin <evalenti@kernel.org> wrote:
>
> On Fri, Jun 30, 2023 at 10:16:38AM +0200, Rafael J. Wysocki wrote:
> >
> >
> >
> > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote:
> > >
> > > On Fri, Jun 23, 2023 at 07:31:43PM +0200, Rafael J. Wysocki wrote:
> > > >
> >
> > [cut]
> >
> > > >
> > > > Regardless of where the problem is etc, if my understanding of the
> > > > patch is correct, it is proposing to change the behavior of a
> > > > well-known sysfs interface in a way that is likely to be incompatible
> > > > with at least some of its users.  This is an obvious no-go in kernel
> > > > development and I would expect you to be well aware of it.
> > >
> > > yeah I get it.
> > >
> > > >
> > > > IOW, if you want the cached value to be returned, a new interface (eg.
> > > > a new sysfs attribute) is needed.
> > >
> > > Yeah, I am fine with either a new sysfs entry to return the cached value,
> > > or a new sysfs entry to change the behavior of the existing /temp, as I
> > > mentioned before, either way works for me, if changing the existing one
> > > is too intrusive.
> > >
> > > >
> > > > And I think that the use case is not really i2c sensors in general,
> > >
> > > I2C was just the factual example I had, but you are right, the use case
> > > is not isolated to I2C sensor. Rather, to be clear I am not blaming I2C,
> > > the actual issue just happen to be easier to see when I2C devices, slower
> > > than typical MMIO devices, are being used as input for the control.
> > >
> > > > because at least in some cases they work just fine AFAICS, but a
> > > > platform with a control loop running in the kernel that depends on
> > > > sensor reads carried out at a specific, approximately constant, rate
> > > > that can be disturbed by user space checking the sensor temperature
> > > > via sysfs at "wrong" times.  If at the same time the user space
> > > > program in question doesn't care about the most recent values reported
> > > > by the sensor, it may very well use the values cached by the in-kernel
> > > > control loop.
> > >
> > > That is right, the balance between supporting user space reads and
> > > running the control timely is the actual original concern. The problem
> > > fades out a bit when you have device reads in the us / ns time scale
> > > and control update is in 100s of ms. But becomes more apparent on slower
> > > devices, when reads are in ms and policy update is in the 100s ms, that is
> > > why the I2C case was quoted. But nothing wrong with I2C, or supporting
> > > I2C on the thermal subsystem as we do today via the hwmon interface REGISTER_TZ,
> > > the problem is on having to support the control in kernel and a read in
> > > userspace that can create jitter to the control.
> > >
> > > And as you properly stated, for this use case, the userspace does not care
> > > about the most recent value of the device, so that is why the change
> > > proposes to give cached values.
> > >
> > > On the flip side though, there may be user space based policies that
> > > require the most recent device value. But in this case, the expectation
> > > is to disable the in kernel policy and switch the thermal zone to
> > > mode == disabled. And that is also why this patch will go the path
> > > to request the most recent device value when the /temp sysfs entry
> > > is read and the mode is disabled.
> > >
> > > I would suggest to have an addition sysfs entry that sets the
> > > thermal zone into cached mode or not, let's say for the sake of the
> > > discussion, we call it 'cached_values', with default to 'not cached'.
> > > This way, we could support:
> > >
> > > a) Default, current situation, where all reads in /temp are always backed up
> > > with an actual device .get_temp(). Nothing changes here, keeps reading
> > > under /temp, and so long system designer is satisfied with jittering,
> > > no need to change anything.
> > >
> > > b) When one has control in kernel, and frequent userspace reads on /temp
> > > but system designer wants to protect the control in kernel to avoid jittering.
> > > Just keep reading from /temp but set the new sysfs property 'cached_values' to 'cached'.
> > > Then userspace will get updated values as frequent as the kernel control has
> > > and the kernel control will not be disturbed by frequent device reads.
> > >
> > > c) When one has control in userspace, and wants to have the most frequent
> > > device read. Here, one can simply disable the in kernel control by
> > > setting the 'mode' sysfs entry to 'disabled', and making sure the new sysfs property is set
> > > to 'not cached'. Well in fact, the way I thought this originally in this patch
> > > was to simply always read the device when /temp is read is 'mode' is 'disabled'.
> > >
> > > I believe you proposed to have another sysfs entry  sysfs entry for reading cached temperature.
> > > Something like 'temp_cached'. Thinking of it, as I mentioned before, it will work.
> > > The only possible downside is to have two entries to read temperature.
> > >
> > > Strong opinions on any of the above?
> >
> > So what about adding a new zone attribute that can be used to specify
> > the preferred caching time for the temperature?
> >
> > That is, if the time interval between two consecutive updates of the
> > cached temperature value is less than the value of the new attribute,
> > the cached temperature value will be returned by "temp".  Otherwise,
> > it will cause the sensor to be read and the value obtained from it
> > will be returned to user space and cached.
> >
> > If the value of the new attribute is 0, everything will work as it
> > does now (which will also need to be the default behavior).
>
> Yeah, that makes sense to me. I can cook something up in the next version.

Yes, please.

Also I think that the $subject patch was inspired by observations made
on a specific system in practice.  It would be good to say what the
system is and include some numbers illustrating how severe the problem
is (that is, what is expected and what is observed and why the
discrepancy is attributed to the effect of direct sensor accesses from
user space via sysfs).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-07-05 22:49                                 ` Eduardo Valentin
@ 2023-07-06 13:22                                   ` Rafael J. Wysocki
  2023-07-07 17:14                                     ` Eduardo Valentin
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-06 13:22 UTC (permalink / raw)
  To: Eduardo Valentin, Daniel Lezcano
  Cc: Rafael J. Wysocki, eduval, linux-pm, linux-kernel, Amit Kucheria,
	Zhang Rui

On Thu, Jul 6, 2023 at 12:50 AM Eduardo Valentin <evalenti@kernel.org> wrote:
>
> On Sat, Jul 01, 2023 at 09:28:31AM +0200, Daniel Lezcano wrote:
> >
> >
> >
> > Eduardo,
> >
> > On 01/07/2023 03:49, Eduardo Valentin wrote:
> >
> > [ ... ]
> >
> > > > All that are hypothesis, that is why having a real use case would help
> > > > to figure out the temperature limit drift at mitigation time.
> > >
> > > Yeah, I guess the problem here is that you are assuming I2C is not a real
> > > use case, not sure why. But it is and very common design in fact.
> >
> > If it is so common you should be able to reproduce the issue and give
> > numbers. At this point, what I read is "that may happen because I2C is
> > slow and we may monitor it at an insane rate, so let's cache the value".
> >
> > > > Assuming it is really needed, I'm not sure that should be exported via
> > > > sysfs. It is a driver issue and it may register the thermal zone with a
> > > > parameter telling the userspace rate limit.
> > > >
> > > > On the other side, hwmon and thermal are connected. hwmon drivers
> > > > register a thermal zone and thermal drivers add themselves in the hwmon
> > > > sysfs directory. The temperature cache is handled in the driver level in
> > > > the hwmon subsystems and we want to handle the temperature cache at the
> > > > thermal sysfs level. How will we cope with this inconsistency?
> > >
> > > Yeah, I do not see this, again, as where to handle cache type of design problem only.
> > > This is really a protective / defensive code on the thermal core to avoid
> > > userspace interfering on a kernel based control.
> > >
> > >
> > > I agree that drivers may be free to go and defend themselves against
> > > too frequent userspace requests, like they do, as you already shared
> > > a link in another email. But saying that it is up to the driver to do this
> > > is basically saying that the thermal subsystem do not care about their
> > > own threads being delayed by a too frequent reads on a sysfs entry
> > > created by the thermal subsystem, just because it is drivers responsability
> > > to cache. To that is a missing defensive code.
> >
> > No, the core code has not to be defensive against bad hardware design.
>
> I do not understand why you are calling this a bad hardware design.
>
> >
> > If multiple processes are reading in an infinite loop the temperature,
> > they will constantly take the lock, and as the monitoring thread is a
> > CFS task, this one will be considered as the readers and be delayed,
> > with probably a mitigation temperature drift. Here we have a missing
> > defensive / optimized code against a DoS but it is unrelated to the
> > hardware and the fix is not caching the value.
>
> I am not even going into the CFS discussion, yet. But if we go that direction,
> here we are having a case of a userspace process contending into
> a in kernel process, regardless of the userspace process priority.
>
> But again that is not the main point of discussion for this change.
>
> >
> > > > As a side note, slow drivers are usually going under drivers/hwmon.
> > >
> > > Have you seen this code?
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850
> >
> > Yes, and ?
>
>
> I mean , I am sure can git grep for all occurences if that is what are looking for.
> >
> > That is what I said, the hwmon and the thermal zone are connected.
> >
> > > I also do not understand when you say slow drivers are usually going under
> > > drivers/hwmon, does it really matter? One can design a thermal zone
> > > that is connected to a hwmon device as input. Why would that be illogical?
> >
> > I'm not saying it is illogical. I'm pointing the slow drivers are under
> > hwmon subsystems and usually don't aim in-kernel mitigation. The
> > get_temp ops is going through hwmon and the drivers may cache the
> > values. So *if* there is an in-kernel mitigation, the value will be
> > already cached usually.
>
> Oh I see. yes, I totally give you that the most common case is to have
> the in kernel policy written with drivers under thermal, but neglecting
> the existence of drivers under /hwmon that connects into the thermal subsystem
> is not fair, is it? They are there, they connect to thermal subsystem
> and one can certainly have an in kernel control with hwmon drivers as input,
> I am not understanding why protecting against those cases make the change
> invalid or overcomplicates the subsystem design.
>
> >
> > I do believe I raised some valid concerns regarding the approach. Could
> > please take them into account instead of eluding them
> >
> > 1. A real use case with temperature limit drift (easy to reproduce
> > because it is very common)
> >
> > 2. How about the consistency between hwmon and thermal? (one driver but
> > two ways to access the temperature - one may cache and the other not)
>
> I am not eluding anything. But this conversation seams to not move forward
> because of the assumption that building in kernel control based on
> hwmon drivers is either wrong or uncommon.  I fail to see that as the
> main argument to be discussed mainly because it is a problem that exists.
> Driver count is also not a meaningful metric to conclude if the problem
> is common or not. How many motherboard can you think that are out there
> that may have an lm75 to represent an actual temperature point in the
> PCB? Or an lm75 that may come from a PCI board? That is what I meant
> by common design.
>
> Seams to me that you are trying to make a point that this problem
> is not worth having a fix for, even if you already recognized the
> basic point of it, because (a) this is not happening MMIO based
> drivers which is the (today) common case for drivers under /thermal
> and (b) hwmon drivers are supposed to feed in only control in userspace.
>
> for both argument that seams to restrict the thermal subsystem
> to only MMIO base devices drivers to feed into CPU temperature control,
> which is a limited application of it in real world scenario, for
> any real life scenario really, mobile. BMC, or any embedded case.
> When in fact, the abstraction and desing on the thermal subsystem today
> is pure control of temperature and can be used in any application
> that does it. Limiting the application of the thermal subsystem to
> only MMIO based devices is, well, limiting :-).
>
> >
> > Another question regarding the I2C example, if another subsystem is
> > using the I2C, won't it take the bus lock and create the contention
> > also? I mean it is possible to create a mitigation drift by reading
> > constantly another sensor value (eg. voltage or whatever) ?
>
> Yes, if a flood of events in the bus happen, then the drift will also happen,
> specially if reads are aligned with the in kernel monitoring thread
> update / call to .get_temp().

This is all in general terms, though.

I think that Daniel has been asking for a specific example, because if
a new sysfs i/f is added with a certain issue in mind and then it
turns out to be never used by anyone, because the issue is theoretical
or not severe enough for anyone to care, it is a pure maintenance
burden.  It would be good to have at least some assurance that this
will not be the case.

Daniel, even if you think that the hardware in question is
misdesigned, Linux has a long record of supporting misdesigned
hardware, so this wouldn't be the first time ever.  However, I do
agree that it would be good to have a specific example supporting this
case.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs
  2023-07-06 13:22                                   ` Rafael J. Wysocki
@ 2023-07-07 17:14                                     ` Eduardo Valentin
  0 siblings, 0 replies; 27+ messages in thread
From: Eduardo Valentin @ 2023-07-07 17:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Eduardo Valentin, Daniel Lezcano, eduval, linux-pm, linux-kernel,
	Amit Kucheria, Zhang Rui

On Thu, Jul 06, 2023 at 03:22:55PM +0200, Rafael J. Wysocki wrote:
> 
> 
> 
> On Thu, Jul 6, 2023 at 12:50 AM Eduardo Valentin <evalenti@kernel.org> wrote:
> >
> > On Sat, Jul 01, 2023 at 09:28:31AM +0200, Daniel Lezcano wrote:
> > >
> > >
> > >
> > > Eduardo,
> > >
> > > On 01/07/2023 03:49, Eduardo Valentin wrote:
> > >
> > > [ ... ]
> > >
> > > > > All that are hypothesis, that is why having a real use case would help
> > > > > to figure out the temperature limit drift at mitigation time.
> > > >
> > > > Yeah, I guess the problem here is that you are assuming I2C is not a real
> > > > use case, not sure why. But it is and very common design in fact.
> > >
> > > If it is so common you should be able to reproduce the issue and give
> > > numbers. At this point, what I read is "that may happen because I2C is
> > > slow and we may monitor it at an insane rate, so let's cache the value".
> > >
> > > > > Assuming it is really needed, I'm not sure that should be exported via
> > > > > sysfs. It is a driver issue and it may register the thermal zone with a
> > > > > parameter telling the userspace rate limit.
> > > > >
> > > > > On the other side, hwmon and thermal are connected. hwmon drivers
> > > > > register a thermal zone and thermal drivers add themselves in the hwmon
> > > > > sysfs directory. The temperature cache is handled in the driver level in
> > > > > the hwmon subsystems and we want to handle the temperature cache at the
> > > > > thermal sysfs level. How will we cope with this inconsistency?
> > > >
> > > > Yeah, I do not see this, again, as where to handle cache type of design problem only.
> > > > This is really a protective / defensive code on the thermal core to avoid
> > > > userspace interfering on a kernel based control.
> > > >
> > > >
> > > > I agree that drivers may be free to go and defend themselves against
> > > > too frequent userspace requests, like they do, as you already shared
> > > > a link in another email. But saying that it is up to the driver to do this
> > > > is basically saying that the thermal subsystem do not care about their
> > > > own threads being delayed by a too frequent reads on a sysfs entry
> > > > created by the thermal subsystem, just because it is drivers responsability
> > > > to cache. To that is a missing defensive code.
> > >
> > > No, the core code has not to be defensive against bad hardware design.
> >
> > I do not understand why you are calling this a bad hardware design.
> >
> > >
> > > If multiple processes are reading in an infinite loop the temperature,
> > > they will constantly take the lock, and as the monitoring thread is a
> > > CFS task, this one will be considered as the readers and be delayed,
> > > with probably a mitigation temperature drift. Here we have a missing
> > > defensive / optimized code against a DoS but it is unrelated to the
> > > hardware and the fix is not caching the value.
> >
> > I am not even going into the CFS discussion, yet. But if we go that direction,
> > here we are having a case of a userspace process contending into
> > a in kernel process, regardless of the userspace process priority.
> >
> > But again that is not the main point of discussion for this change.
> >
> > >
> > > > > As a side note, slow drivers are usually going under drivers/hwmon.
> > > >
> > > > Have you seen this code?
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850
> > >
> > > Yes, and ?
> >
> >
> > I mean , I am sure can git grep for all occurences if that is what are looking for.
> > >
> > > That is what I said, the hwmon and the thermal zone are connected.
> > >
> > > > I also do not understand when you say slow drivers are usually going under
> > > > drivers/hwmon, does it really matter? One can design a thermal zone
> > > > that is connected to a hwmon device as input. Why would that be illogical?
> > >
> > > I'm not saying it is illogical. I'm pointing the slow drivers are under
> > > hwmon subsystems and usually don't aim in-kernel mitigation. The
> > > get_temp ops is going through hwmon and the drivers may cache the
> > > values. So *if* there is an in-kernel mitigation, the value will be
> > > already cached usually.
> >
> > Oh I see. yes, I totally give you that the most common case is to have
> > the in kernel policy written with drivers under thermal, but neglecting
> > the existence of drivers under /hwmon that connects into the thermal subsystem
> > is not fair, is it? They are there, they connect to thermal subsystem
> > and one can certainly have an in kernel control with hwmon drivers as input,
> > I am not understanding why protecting against those cases make the change
> > invalid or overcomplicates the subsystem design.
> >
> > >
> > > I do believe I raised some valid concerns regarding the approach. Could
> > > please take them into account instead of eluding them
> > >
> > > 1. A real use case with temperature limit drift (easy to reproduce
> > > because it is very common)
> > >
> > > 2. How about the consistency between hwmon and thermal? (one driver but
> > > two ways to access the temperature - one may cache and the other not)
> >
> > I am not eluding anything. But this conversation seams to not move forward
> > because of the assumption that building in kernel control based on
> > hwmon drivers is either wrong or uncommon.  I fail to see that as the
> > main argument to be discussed mainly because it is a problem that exists.
> > Driver count is also not a meaningful metric to conclude if the problem
> > is common or not. How many motherboard can you think that are out there
> > that may have an lm75 to represent an actual temperature point in the
> > PCB? Or an lm75 that may come from a PCI board? That is what I meant
> > by common design.
> >
> > Seams to me that you are trying to make a point that this problem
> > is not worth having a fix for, even if you already recognized the
> > basic point of it, because (a) this is not happening MMIO based
> > drivers which is the (today) common case for drivers under /thermal
> > and (b) hwmon drivers are supposed to feed in only control in userspace.
> >
> > for both argument that seams to restrict the thermal subsystem
> > to only MMIO base devices drivers to feed into CPU temperature control,
> > which is a limited application of it in real world scenario, for
> > any real life scenario really, mobile. BMC, or any embedded case.
> > When in fact, the abstraction and desing on the thermal subsystem today
> > is pure control of temperature and can be used in any application
> > that does it. Limiting the application of the thermal subsystem to
> > only MMIO based devices is, well, limiting :-).
> >
> > >
> > > Another question regarding the I2C example, if another subsystem is
> > > using the I2C, won't it take the bus lock and create the contention
> > > also? I mean it is possible to create a mitigation drift by reading
> > > constantly another sensor value (eg. voltage or whatever) ?
> >
> > Yes, if a flood of events in the bus happen, then the drift will also happen,
> > specially if reads are aligned with the in kernel monitoring thread
> > update / call to .get_temp().
> 
> This is all in general terms, though.
> 
> I think that Daniel has been asking for a specific example, because if
> a new sysfs i/f is added with a certain issue in mind and then it
> turns out to be never used by anyone, because the issue is theoretical
> or not severe enough for anyone to care, it is a pure maintenance
> burden.  It would be good to have at least some assurance that this
> will not be the case.
> 
> Daniel, even if you think that the hardware in question is
> misdesigned, Linux has a long record of supporting misdesigned
> hardware, so this wouldn't be the first time ever.  However, I do
> agree that it would be good to have a specific example supporting this
> case.


I will add some examples in the next version with the numbers I quoted in this thread.

Also, generally speaking, this series uses the thermal subsystem mostly
for board monitoring use cases, which includes applications like BMC,
and that has generated some churn because the current set of existing
drivers mainly targets in-kernel control for the running system CPU temperature monitoring.
However, even since time of thermal device tree binding conception, this limitation
was never part of the original thoughts on enhancing the thermal subsystem from
ACPI only to a more general solution.

For this reason, next iteration I will start with enhancing documentation.
I took a brief look in the existing master tree and plenty of
the existing documentation references to board monitoring have been lost
in reshuffle of thermal documentation files, including thermal device tree
bindings, which is a shame. I will recover those examples with refreshed examples.
Thermal doc has been not the prettiest anyways so it is always good to spend
some time enhancing it.

-- 
All the best,
Eduardo Valentin

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-07-07 17:14 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-07  0:37 [PATCH 1/1] thermal: sysfs: avoid actual readings from sysfs Eduardo Valentin
2023-06-07  6:32 ` Zhang, Rui
2023-06-07 16:28   ` Eduardo Valentin
2023-06-07  9:24 ` Daniel Lezcano
2023-06-07 16:38   ` Eduardo Valentin
2023-06-07 18:23     ` Daniel Lezcano
2023-06-08 17:44       ` Eduardo Valentin
2023-06-12  8:17         ` Daniel Lezcano
2023-06-21  5:06           ` Eduardo Valentin
2023-06-21 11:43             ` Daniel Lezcano
2023-06-22  4:56               ` Eduardo Valentin
2023-06-23 17:31                 ` Rafael J. Wysocki
2023-06-28 21:10                   ` Eduardo Valentin
2023-06-30  8:16                     ` Rafael J. Wysocki
2023-06-30 10:11                       ` Daniel Lezcano
2023-06-30 10:46                         ` Rafael J. Wysocki
2023-06-30 12:09                           ` Daniel Lezcano
2023-07-01  1:49                             ` Eduardo Valentin
2023-07-01  7:28                               ` Daniel Lezcano
2023-07-05 22:49                                 ` Eduardo Valentin
2023-07-06 13:22                                   ` Rafael J. Wysocki
2023-07-07 17:14                                     ` Eduardo Valentin
2023-07-01  1:38                         ` Eduardo Valentin
2023-07-01 14:20                           ` Daniel Lezcano
2023-07-01  1:37                       ` Eduardo Valentin
2023-07-06 13:02                         ` Rafael J. Wysocki
2023-06-10 17:24     ` Russell Haley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).