linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
       [not found]     ` <faaf027c-e01c-6801-9a0c-ab7e0ba669a1@linaro.org>
@ 2018-02-26  4:30       ` Viresh Kumar
  2018-03-13 19:15         ` Daniel Lezcano
  2018-04-04  8:50         ` Daniel Lezcano
  2018-03-27  3:43       ` Leo Yan
  1 sibling, 2 replies; 22+ messages in thread
From: Viresh Kumar @ 2018-02-26  4:30 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: edubezval, kevin.wangtao, leo.yan, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm

On 23-02-18, 12:28, Daniel Lezcano wrote:
> On 23/02/2018 08:34, Viresh Kumar wrote:
> > On 21-02-18, 16:29, Daniel Lezcano wrote:
> >> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> >> index 5c219dc..9340216 100644
> >> --- a/drivers/thermal/cpu_cooling.c
> >> +++ b/drivers/thermal/cpu_cooling.c
> >> @@ -10,18 +10,32 @@
> >>   *		Viresh Kumar <viresh.kumar@linaro.org>
> >>   *
> >>   */
> >> +#undef DEBUG
> > 
> > Why is this required ?
> 
> It is usually added, so if you set the -DDEBUG flag when compiling, you
> don't get all the pr_debug traces for all files, but the just the ones
> where you commented the #undef above. pr_debug is a no-op otherwise.

Yeah, but this is a mess as you need to go edit the files before
enabling debug with it. Everyone prefers the dynamic debug thing now,
where we don't need such stuff. Just drop it.

> >> +#define pr_fmt(fmt) "CPU cooling: " fmt
> > 
> > I think you can use the dev_***() routines instead, as you can
> > directly the CPU device from anywhere.
> 
> Can we postpone this change for later ? All the file is using pr_*
> (cpufreq_cooling included). There is only one place where dev_err is
> used but it is removed by the patch 3/7.

okay.

> >> +	while (1) {
> >> +		s64 next_wakeup;
> >> +
> >> +		prepare_to_wait(&cct->waitq, &wait, TASK_INTERRUPTIBLE);
> >> +
> >> +		schedule();
> >> +
> >> +		atomic_inc(&idle_cdev->count);
> >> +
> >> +		play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
> >> +
> >> +		/*
> >> +		 * The last CPU waking up is in charge of setting the
> >> +		 * timer. If the CPU is hotplugged, the timer will
> >> +		 * move to another CPU (which may not belong to the
> >> +		 * same cluster) but that is not a problem as the
> >> +		 * timer will be set again by another CPU belonging to
> >> +		 * the cluster, so this mechanism is self adaptive and
> >> +		 * does not require any hotplugging dance.
> >> +		 */
> > 
> > Well this depends on how CPU hotplug really happens. What happens to
> > the per-cpu-tasks which are in the middle of something when hotplug
> > happens?  Does hotplug wait for those per-cpu-tasks to finish ?

Missed this one ?

> >> +int cpuidle_cooling_register(void)
> >> +{
> >> +	struct cpuidle_cooling_device *idle_cdev = NULL;
> >> +	struct thermal_cooling_device *cdev;
> >> +	struct cpuidle_cooling_tsk *cct;
> >> +	struct task_struct *tsk;
> >> +	struct device_node *np;
> >> +	cpumask_t *cpumask;
> >> +	char dev_name[THERMAL_NAME_LENGTH];
> >> +	int ret = -ENOMEM, cpu;
> >> +	int index = 0;
> >> +
> >> +	for_each_possible_cpu(cpu) {
> >> +		cpumask = topology_core_cpumask(cpu);
> >> +
> >> +		cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
> >> +
> >> +		/*
> >> +		 * This condition makes the first cpu belonging to the
> >> +		 * cluster to create a cooling device and allocates
> >> +		 * the structure. Others CPUs belonging to the same
> >> +		 * cluster will just increment the refcount on the
> >> +		 * cooling device structure and initialize it.
> >> +		 */
> >> +		if (cpu == cpumask_first(cpumask)) {
> > 
> > Your function still have few assumptions of cpu numbering and it will
> > break in few cases. What if the CPUs on a big Little system (4x4) are
> > present in this order: B L L L L B B B  ??
> > 
> > This configuration can happen if CPUs in DT are marked as: 0-3 LITTLE,
> > 4-7 big and a big CPU is used by the boot loader to bring up Linux.
> 
> Ok, how can I sort it out ?

I would do something like this:

        cpumask_copy(possible, cpu_possible_mask);
        
        while (!cpumask_empty(possible)) {
                first = cpumask_first(possible);
                cpumask = topology_core_cpumask(first);
                cpumask_andnot(possible, possible, cpumask);
        
                allocate_cooling_dev(first); //This is most of this function in your patch.
        
                while (!cpumask_empty(cpumask)) {
                        temp = cpumask_first(possible);
                        //rest init "temp"
                        cpumask_clear_cpu(temp, cpumask);
                }
        
                //Everything done, register cooling device for cpumask.
        }

> >> +			np = of_cpu_device_node_get(cpu);
> >> +
> >> +			idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
> >> +			if (!idle_cdev)
> >> +				goto out_fail;
> >> +
> >> +			idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;
> >> +
> >> +			atomic_set(&idle_cdev->count, 0);
> > 
> > This should already be 0, isn't it ?
> 
> Yes.

I read it as, "I will drop it" :)

-- 
viresh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 5/7] thermal/drivers/cpu_cooling: Add idle cooling device documentation
       [not found] ` <1519226968-19821-6-git-send-email-daniel.lezcano@linaro.org>
@ 2018-03-06 23:19   ` Pavel Machek
  2018-03-07 11:42     ` Daniel Lezcano
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Machek @ 2018-03-06 23:19 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: edubezval, kevin.wangtao, leo.yan, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm,
	Jonathan Corbet, open list:DOCUMENTATION

[-- Attachment #1: Type: text/plain, Size: 4470 bytes --]

Hi!

> --- /dev/null
> +++ b/Documentation/thermal/cpu-idle-cooling.txt
> @@ -0,0 +1,165 @@
> +
> +Situation:
> +----------
> +

Can we have some real header here? Also if this is .rst, maybe it
should be marked so?

> +Under certain circumstances, the SoC reaches a temperature exceeding
> +the allocated power budget or the maximum temperature limit. The

I don't understand. Power budget is in W, temperature is in
kelvin. Temperature can't exceed power budget AFAICT.

> +former must be mitigated to stabilize the SoC temperature around the
> +temperature control using the defined cooling devices, the latter

later?

> +catastrophic situation where a radical decision must be taken to
> +reduce the temperature under the critical threshold, that can impact
> +the performances.

performance.

> +Another situation is when the silicon reaches a certain temperature
> +which continues to increase even if the dynamic leakage is reduced to
> +its minimum by clock gating the component. The runaway phenomena will
> +continue with the static leakage and only powering down the component,
> +thus dropping the dynamic and static leakage will allow the component
> +to cool down. This situation is critical.

Critical here, critical there. I have trouble following
it. Theoretically hardware should protect itself, because you don't
want kernel bug to damage your CPU?

> +Last but not least, the system can ask for a specific power budget but
> +because of the OPP density, we can only choose an OPP with a power
> +budget lower than the requested one and underuse the CPU, thus losing
> +performances. In other words, one OPP under uses the CPU with a

performance.

> +lesser than the power budget and the next OPP exceed the power budget,
> +an intermediate OPP could have been used if it were present.

was.

> +Solutions:
> +----------
> +
> +If we can remove the static and the dynamic leakage for a specific
> +duration in a controlled period, the SoC temperature will
> +decrease. Acting at the idle state duration or the idle cycle

"should" decrease? If you are in bad environment..

> +The Operating Performance Point (OPP) density has a great influence on
> +the control precision of cpufreq, however different vendors have a
> +plethora of OPP density, and some have large power gap between OPPs,
> +that will result in loss of performance during thermal control and
> +loss of power in other scenes.

scene seems to be wrong word here.

> +At a specific OPP, we can assume injecting idle cycle on all CPUs,

Extra comma?

> +Idle Injection:
> +---------------
> +
> +The base concept of the idle injection is to force the CPU to go to an
> +idle state for a specified time each control cycle, it provides
> +another way to control CPU power and heat in addition to
> +cpufreq. Ideally, if all CPUs of a cluster inject idle synchronously,
> +this cluster can get into the deepest idle state and achieve minimum
> +power consumption, but that will also increase system response latency
> +if we inject less than cpuidle latency.

I don't understand last sentence.

> +The mitigation begins with a maximum period value which decrease

decreases?
  
> +more cooling effect is requested. When the period duration is equal
> to
> +the idle duration, then we are in a situation the platform can’t
> +dissipate the heat enough and the mitigation fails. In this case

fast enough?

> +situation is considered critical and there is nothing to do. The idle

Nothing to do? Maybe power the system down?

> +The idle injection duration value must comply with the constraints:
> +
> +- It is lesser or equal to the latency we tolerate when the mitigation

less ... than the latency

> +Minimum period
> +--------------
> +
> +The idle injection duration being fixed, it is obvious the minimum
> +period can’t be lesser than that, otherwise we will be scheduling the

less.

> +Practically, if the running power is lesses than the targeted power,

less.

> +However, in this demonstration we ignore three aspects:
> +
> + * The static leakage is not defined here, we can introduce it in the
> +   equation but assuming it will be zero most of the time as it is

, but?

Best regards,
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 5/7] thermal/drivers/cpu_cooling: Add idle cooling device documentation
  2018-03-06 23:19   ` [PATCH V2 5/7] thermal/drivers/cpu_cooling: Add idle cooling device documentation Pavel Machek
@ 2018-03-07 11:42     ` Daniel Lezcano
  2018-03-08  8:59       ` Pavel Machek
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Lezcano @ 2018-03-07 11:42 UTC (permalink / raw)
  To: Pavel Machek
  Cc: edubezval, kevin.wangtao, leo.yan, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm,
	Jonathan Corbet, open list:DOCUMENTATION

On 07/03/2018 00:19, Pavel Machek wrote:
> Hi!

Hi Pavel,

thanks for taking the time to review the documentation.

>> --- /dev/null
>> +++ b/Documentation/thermal/cpu-idle-cooling.txt
>> @@ -0,0 +1,165 @@
>> +
>> +Situation:
>> +----------
>> +
> 
> Can we have some real header here? Also if this is .rst, maybe it
> should be marked so?

Ok, I will fix it.

>> +Under certain circumstances, the SoC reaches a temperature exceeding
>> +the allocated power budget or the maximum temperature limit. The
> 
> I don't understand. Power budget is in W, temperature is in
> kelvin. Temperature can't exceed power budget AFAICT.

Yes, it is badly worded. Is the following better ?

"
Under certain circumstances a SoC can reach the maximum temperature
limit or is unable to stabilize the temperature around a temperature
control.

When the SoC has to stabilize the temperature, the kernel can act on a
cooling device to mitigate the dissipated power.

When the maximum temperature is reached and to prevent a catastrophic
situation a radical decision must be taken to reduce the temperature
under the critical threshold, that impacts the performance.

"

>> +former must be mitigated to stabilize the SoC temperature around the
>> +temperature control using the defined cooling devices, the latter
> 
> later?
> 
>> +catastrophic situation where a radical decision must be taken to
>> +reduce the temperature under the critical threshold, that can impact
>> +the performances.
> 
> performance.
> 
>> +Another situation is when the silicon reaches a certain temperature
>> +which continues to increase even if the dynamic leakage is reduced to
>> +its minimum by clock gating the component. The runaway phenomena will
>> +continue with the static leakage and only powering down the component,
>> +thus dropping the dynamic and static leakage will allow the component
>> +to cool down. This situation is critical.
> 
> Critical here, critical there. I have trouble following
> it. Theoretically hardware should protect itself, because you don't
> want kernel bug to damage your CPU?

There are several levels of protection. The first level is mitigating
the temperature from the kernel, then in the temperature sensor a reset
line will trigger the reboot of the CPUs. Usually it is a register where
you write the maximum temperature, from the driver itself. I never tried
to write 1000°C in this register and see if I can burn the board.

I know some boards have another level of thermal protection in the
hardware itself and some other don't.

In any case, from a kernel point of view, it is a critical situation as
we are about to hard reboot the system and in this case it is preferable
to drop drastically the performance but give the opportunity to the
system to run in a degraded mode.

>> +Last but not least, the system can ask for a specific power budget but
>> +because of the OPP density, we can only choose an OPP with a power
>> +budget lower than the requested one and underuse the CPU, thus losing
>> +performances. In other words, one OPP under uses the CPU with a
> 
> performance.
> 
>> +lesser than the power budget and the next OPP exceed the power budget,
>> +an intermediate OPP could have been used if it were present.
> 
> was.
> 
>> +Solutions:
>> +----------
>> +
>> +If we can remove the static and the dynamic leakage for a specific
>> +duration in a controlled period, the SoC temperature will
>> +decrease. Acting at the idle state duration or the idle cycle
> 
> "should" decrease? If you are in bad environment..

No, it will decrease in any case because of the static leakage drop. The
bad environment will impact the speed of this decrease.

>> +The Operating Performance Point (OPP) density has a great influence on
>> +the control precision of cpufreq, however different vendors have a
>> +plethora of OPP density, and some have large power gap between OPPs,
>> +that will result in loss of performance during thermal control and
>> +loss of power in other scenes.
> 
> scene seems to be wrong word here.

yes, 'scenario' will be better :)

>> +At a specific OPP, we can assume injecting idle cycle on all CPUs,
> 
> Extra comma?
> 
>> +Idle Injection:
>> +---------------
>> +
>> +The base concept of the idle injection is to force the CPU to go to an
>> +idle state for a specified time each control cycle, it provides
>> +another way to control CPU power and heat in addition to
>> +cpufreq. Ideally, if all CPUs of a cluster inject idle synchronously,
>> +this cluster can get into the deepest idle state and achieve minimum
>> +power consumption, but that will also increase system response latency
>> +if we inject less than cpuidle latency.
> 
> I don't understand last sentence.

Is it better ?

"Ideally, if all CPUs, belonging to the same cluster, inject their idle
cycle synchronously, the cluster can reach its power down state with a
minimum power consumption and static leakage drop. However, these idle
cycles injection will add extra latencies as the CPUs will have to
wakeup from a deep sleep state."


>> +The mitigation begins with a maximum period value which decrease
> 
> decreases?
>   
>> +more cooling effect is requested. When the period duration is equal
>> to
>> +the idle duration, then we are in a situation the platform can’t
>> +dissipate the heat enough and the mitigation fails. In this case
> 
> fast enough?
> 
>> +situation is considered critical and there is nothing to do. The idle
> 
> Nothing to do? Maybe power the system down?

Nothing to do == the mitigation can't handle the situation, it reached
its limit. We can't do better.

Solution: add an emergency thermal shutdown (which is an orthogonal
feature to be added to the thermal framework).

Sidenote: it is a very unlikely case, as we are idle most of the time
when the heat is hard to dissipate. I tested this with a proto-SoC with
an interesting thermal behavior (temperature jumps insanely high),
running at full blast and bad heat dissipation, the mitigation never
reached the limit.

>> +The idle injection duration value must comply with the constraints:
>> +
>> +- It is lesser or equal to the latency we tolerate when the mitigation
> 
> less ... than the latency
> 
>> +Minimum period
>> +--------------
>> +
>> +The idle injection duration being fixed, it is obvious the minimum
>> +period can’t be lesser than that, otherwise we will be scheduling the
> 
> less.
> 
>> +Practically, if the running power is lesses than the targeted power,
> 
> less.
> 
>> +However, in this demonstration we ignore three aspects:
>> +
>> + * The static leakage is not defined here, we can introduce it in the
>> +   equation but assuming it will be zero most of the time as it is
> 
> , but?
> 
> Best regards,

Thanks!


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 0/7] CPU cooling device new strategies
       [not found] <1519226968-19821-1-git-send-email-daniel.lezcano@linaro.org>
       [not found] ` <1519226968-19821-6-git-send-email-daniel.lezcano@linaro.org>
@ 2018-03-07 17:09 ` Eduardo Valentin
  2018-03-07 18:57   ` Daniel Lezcano
       [not found] ` <1519226968-19821-7-git-send-email-daniel.lezcano@linaro.org>
  2 siblings, 1 reply; 22+ messages in thread
From: Eduardo Valentin @ 2018-03-07 17:09 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: kevin.wangtao, leo.yan, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm

Hello Daniel,

On Wed, Feb 21, 2018 at 04:29:21PM +0100, Daniel Lezcano wrote:
> Changelog:
>   V2:
>      - Dropped the cpu combo cooling device
>      - Added the acked-by tags
>      - Replaced task array by a percpu structure
>      - Fixed the copyright dates
>      - Fixed the extra lines
>      - Fixed the compilation macros to be reused
>      - Fixed the list node removal
>      - Massaged a couple of function names
> 
> 
> The following series provides a new way to cool down a SoC by reducing
> the dissipated power on the CPUs. Based on the initial work from Kevin
> Wangtao, the series implements a CPU cooling device based on idle
> injection, relying on the cpuidle framework.

Nice! Glad to see that Linaro took this work again. I have a few
questions, as follows.

> 
> The patchset is designed to have the current DT binding for the
> cpufreq cooling device to be compatible with the new cooling devices.
> 
> Different cpu cooling devices can not co-exist on the system, the cpu
> cooling device is enabled or not, and one cooling strategy is selected
> (cpufreq or cpuidle). It is not possible to have all of them available
> at the same time. However, the goal is to enable them all and be able
> to switch from one to another at runtime but that needs a rework of the
> thermal framework which is orthogonal to the feature we are providing.
> 

Can you please elaborate on the limitations you found? Please be more
specific.

> This series is divided into two parts.
> 
> The first part just provides trivial changes for the copyright and
> removes an unused field in the cpu freq cooling device structure.
> 

Ok..

> The second part provides the idle injection cooling device, allowing a SoC
> without a cpufreq driver to use this cooling device as an alternative.
> 

which is awesome!


> The preliminary benchmarks show the following changes:
> 
> On the hikey6220, dhrystone shows a throughtput increase of 40% for an
> increase of the latency of 16% while sysbench shows a latency increase
> of 5%.

I don't follow these numbers. Throughput increase while injecting idle?
compared to what? percentages of what? Please be more specific to better
describer your work..

> 
> Initially, the first version provided also the cpuidle + cpufreq combo
> cooling device but regarding the latest comments, there is a misfit with
> how the cpufreq cooling device and suspend/resume/cpu hotplug/module
> loading|unloading behave together while the combo cooling device was
> designed assuming the cpufreq cooling device was always there. This
> dynamic is investigated and the combo cooling device will be posted
> separetely after this series gets merged.

Yeah, this is one of the confusing parts. Could you please
remind us of the limitations here? Why can't we enable CPUfreq
on higher trip points and CPUidle on lower trip points, for example?
Specially from a system design point of view, the system engineer
typically would benefit of idle injections to achieve overall
average CPU frequencies in a more granular fashion, for example,
achieving performance vs. cooling between available real
frequencies, avoiding real switches.

Also, there is a major design question here. After Linaro's attempt
to send a cpufreq+cpuidle cooling device(s), there was an attempt
to generalize and extend intel powerclamp driver. Do you mind
explaining why refactoring intel powerclamp is not possible? Basic
idea is the same, no?



> 
> Daniel Lezcano (7):
>   thermal/drivers/cpu_cooling: Fixup the header and copyright
>   thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)
>   thermal/drivers/cpu_cooling: Remove pointless field
>   thermal/drivers/Kconfig: Convert the CPU cooling device to a choice
>   thermal/drivers/cpu_cooling: Add idle cooling device documentation
>   thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
>   cpuidle/drivers/cpuidle-arm: Register the cooling device
> 
>  Documentation/thermal/cpu-idle-cooling.txt | 165 ++++++++++
>  drivers/cpuidle/cpuidle-arm.c              |   5 +
>  drivers/thermal/Kconfig                    |  30 +-
>  drivers/thermal/cpu_cooling.c              | 480 +++++++++++++++++++++++++++--
>  include/linux/cpu_cooling.h                |  15 +-
>  5 files changed, 668 insertions(+), 27 deletions(-)
>  create mode 100644 Documentation/thermal/cpu-idle-cooling.txt
> 
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 0/7] CPU cooling device new strategies
  2018-03-07 17:09 ` [PATCH V2 0/7] CPU cooling device new strategies Eduardo Valentin
@ 2018-03-07 18:57   ` Daniel Lezcano
  2018-03-08 12:03     ` Daniel Thompson
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Lezcano @ 2018-03-07 18:57 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: kevin.wangtao, leo.yan, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm


Hi Eduardo,


On 07/03/2018 18:09, Eduardo Valentin wrote:
> Hello Daniel,
> 
> On Wed, Feb 21, 2018 at 04:29:21PM +0100, Daniel Lezcano wrote:
>> Changelog:
>>   V2:
>>      - Dropped the cpu combo cooling device
>>      - Added the acked-by tags
>>      - Replaced task array by a percpu structure
>>      - Fixed the copyright dates
>>      - Fixed the extra lines
>>      - Fixed the compilation macros to be reused
>>      - Fixed the list node removal
>>      - Massaged a couple of function names
>>
>>
>> The following series provides a new way to cool down a SoC by reducing
>> the dissipated power on the CPUs. Based on the initial work from Kevin
>> Wangtao, the series implements a CPU cooling device based on idle
>> injection, relying on the cpuidle framework.
> 
> Nice! Glad to see that Linaro took this work again. I have a few
> questions, as follows.
> 
>>
>> The patchset is designed to have the current DT binding for the
>> cpufreq cooling device to be compatible with the new cooling devices.
>>
>> Different cpu cooling devices can not co-exist on the system, the cpu
>> cooling device is enabled or not, and one cooling strategy is selected
>> (cpufreq or cpuidle). It is not possible to have all of them available
>> at the same time. However, the goal is to enable them all and be able
>> to switch from one to another at runtime but that needs a rework of the
>> thermal framework which is orthogonal to the feature we are providing.
>>
> 
> Can you please elaborate on the limitations you found? Please be more
> specific.

I did not found the limitation because the dynamic change was not
implemented. My concern is principally regarding the change when we are
mitigating, I'm not sure the thermal framework supports that for the moment.

This is why Viresh proposed to first add the idle injection and then
support the dynamic change before resending the combo cooling device.

>> This series is divided into two parts.
>>
>> The first part just provides trivial changes for the copyright and
>> removes an unused field in the cpu freq cooling device structure.
>>
> 
> Ok..
> 
>> The second part provides the idle injection cooling device, allowing a SoC
>> without a cpufreq driver to use this cooling device as an alternative.
>>
> 
> which is awesome!
> 
> 
>> The preliminary benchmarks show the following changes:
>>
>> On the hikey6220, dhrystone shows a throughtput increase of 40% for an
>> increase of the latency of 16% while sysbench shows a latency increase
>> of 5%.
> 
> I don't follow these numbers. Throughput increase while injecting idle?
> compared to what? percentages of what? Please be more specific to better
> describer your work..

The dhrystone throughput is based on the virtual timer, when we are
running, it is at max opp, so the throughput increases. But regarding
the real time, it takes obviously more time to achieve as we are
artificially inserting idle cycles. With the cpufreq governor, we run at
a lower opp, so the throughput is less for dhrystone but it takes less
time to achieve.

Percentages are comparing cpufreq vs cpuidle cooling devices. I will
take care of presenting the results in a more clear way in the next version.

>> Initially, the first version provided also the cpuidle + cpufreq combo
>> cooling device but regarding the latest comments, there is a misfit with
>> how the cpufreq cooling device and suspend/resume/cpu hotplug/module
>> loading|unloading behave together while the combo cooling device was
>> designed assuming the cpufreq cooling device was always there. This
>> dynamic is investigated and the combo cooling device will be posted
>> separetely after this series gets merged.
> 
> Yeah, this is one of the confusing parts. Could you please
> remind us of the limitations here? Why can't we enable CPUfreq
> on higher trip points and CPUidle on lower trip points, for example?

Sorry, I'm not getting the question. We don't want to enable cpuidle or
cpufreq at certain point but combine the cooling effect of both in order
to get the best tradeoff power / performance.

Let me give an example with a simple SoC - one core.

Let's say we have 4 OPPs and a core-sleep idle state. Respectively, the
OPPs consume 100mW, 500mW, 2W, 4W. Now the CPU is in an intensive work
running at the highest OPP, thus consuming 4W. The temperature increases
and reaches 75°C which is the mitigation point and where the sustainable
power is 1.7W.

 - With the cpufreq cooling device, we can't have 4W, so we go back and
forth between 2W and 500mW.

 - With the cpuidle cooling device, we are at the highest OPP (there is
no cpufreq driver) and we insert 47.5% of idle duration

 - With the combo cooling device, we compute the round-up OPP (here 2W)
and we insert idle cycles for the remaining power to reach the
sustainable power, so 15%.

With the combo, we increase the performances for the same requested
power. There is no yet the state2power callbacks but we expect the
combination of dropping the static leakage and the higher OPP to give
better results in terms of performance and mitigation on energy eager
CPUs like the recent big ARM cpus with the IPA governor.

Going straight to the point of your question above, we can see the
cpufreq cooling device and the cpuidle cooling device have to
collaborate. If we unregister the cpufreq device, we have to do the math
for the power again in the combo cooling device. It is not a problem by
itself but needs an extra reflexion in the current code.

> Specially from a system design point of view, the system engineer
> typically would benefit of idle injections to achieve overall
> average CPU frequencies in a more granular fashion, for example,
> achieving performance vs. cooling between available real
> frequencies, avoiding real switches.
>
> Also, there is a major design question here. After Linaro's attempt
> to send a cpufreq+cpuidle cooling device(s), there was an attempt
> to generalize and extend intel powerclamp driver. 

I'm not aware of such attempt.

> Do you mind
> explaining why refactoring intel powerclamp is not possible? Basic
> idea is the same, no?

Basically the idea is the same: run synchronized idle thread and call
play_idle(). That is all. Putting apart the intel_powerclamp is very x86
centric and contains a plethora of code not fitting our purpose, it
increases the idle duration while we are increasing the number of idle
cycles but keep the idle duration constant in order to have a control on
the latency for the user interactivity. If you compare the idle
injection threads codes (powerclamp and cpuidle cooling device), you
will also notice they are very different in terms of implementation.

The combo cooling device collaborates with the cpufreq cooling device
and reuses the DT binding, and finally it uses the power information
provided in the DT. The idle injection is a brick to the combo cooling
device.

Initially I thought we should refactor the intel_powerclamp but it
appears the combo cooling device reuses the cpufreq and cpuidle cooling
device, making sense to have them all in a single file and evolving to a
single cooling device with different strategies.



>> Daniel Lezcano (7):
>>   thermal/drivers/cpu_cooling: Fixup the header and copyright
>>   thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)
>>   thermal/drivers/cpu_cooling: Remove pointless field
>>   thermal/drivers/Kconfig: Convert the CPU cooling device to a choice
>>   thermal/drivers/cpu_cooling: Add idle cooling device documentation
>>   thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
>>   cpuidle/drivers/cpuidle-arm: Register the cooling device
>>
>>  Documentation/thermal/cpu-idle-cooling.txt | 165 ++++++++++
>>  drivers/cpuidle/cpuidle-arm.c              |   5 +
>>  drivers/thermal/Kconfig                    |  30 +-
>>  drivers/thermal/cpu_cooling.c              | 480 +++++++++++++++++++++++++++--
>>  include/linux/cpu_cooling.h                |  15 +-
>>  5 files changed, 668 insertions(+), 27 deletions(-)
>>  create mode 100644 Documentation/thermal/cpu-idle-cooling.txt
>>
>> -- 
>> 2.7.4
>>


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 5/7] thermal/drivers/cpu_cooling: Add idle cooling device documentation
  2018-03-07 11:42     ` Daniel Lezcano
@ 2018-03-08  8:59       ` Pavel Machek
  2018-03-08 11:54         ` Daniel Thompson
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Machek @ 2018-03-08  8:59 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: edubezval, kevin.wangtao, leo.yan, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm,
	Jonathan Corbet, open list:DOCUMENTATION

[-- Attachment #1: Type: text/plain, Size: 3812 bytes --]

Hi!

> >> +Under certain circumstances, the SoC reaches a temperature exceeding
> >> +the allocated power budget or the maximum temperature limit. The
> > 
> > I don't understand. Power budget is in W, temperature is in
> > kelvin. Temperature can't exceed power budget AFAICT.
> 
> Yes, it is badly worded. Is the following better ?
> 
> "
> Under certain circumstances a SoC can reach the maximum temperature
> limit or is unable to stabilize the temperature around a temperature
> control.
> 
> When the SoC has to stabilize the temperature, the kernel can act on a
> cooling device to mitigate the dissipated power.
> 
> When the maximum temperature is reached and to prevent a catastrophic
> situation a radical decision must be taken to reduce the temperature
> under the critical threshold, that impacts the performance.
> 
> "

Actually... if hardware is expected to protect itself, I'd tone it
down. No need to be all catastrophic and critical... But yes, better.

> > Critical here, critical there. I have trouble following
> > it. Theoretically hardware should protect itself, because you don't
> > want kernel bug to damage your CPU?
> 
> There are several levels of protection. The first level is mitigating
> the temperature from the kernel, then in the temperature sensor a reset
> line will trigger the reboot of the CPUs. Usually it is a register where
> you write the maximum temperature, from the driver itself. I never tried
> to write 1000°C in this register and see if I can burn the board.
> 
> I know some boards have another level of thermal protection in the
> hardware itself and some other don't.
> 
> In any case, from a kernel point of view, it is a critical situation as
> we are about to hard reboot the system and in this case it is preferable
> to drop drastically the performance but give the opportunity to the
> system to run in a degraded mode.

Agreed you want to keep going. In ACPI world, we shutdown when
critical trip point is reached, so this is somehow confusing.

> >> +Solutions:
> >> +----------
> >> +
> >> +If we can remove the static and the dynamic leakage for a specific
> >> +duration in a controlled period, the SoC temperature will
> >> +decrease. Acting at the idle state duration or the idle cycle
> > 
> > "should" decrease? If you are in bad environment..
> 
> No, it will decrease in any case because of the static leakage drop. The
> bad environment will impact the speed of this decrease.

I meant... if ambient temperature is 105C, there's not much you can do
to cool system down :-).

> >> +Idle Injection:
> >> +---------------
> >> +
> >> +The base concept of the idle injection is to force the CPU to go to an
> >> +idle state for a specified time each control cycle, it provides
> >> +another way to control CPU power and heat in addition to
> >> +cpufreq. Ideally, if all CPUs of a cluster inject idle synchronously,
> >> +this cluster can get into the deepest idle state and achieve minimum
> >> +power consumption, but that will also increase system response latency
> >> +if we inject less than cpuidle latency.
> > 
> > I don't understand last sentence.
> 
> Is it better ?
> 
> "Ideally, if all CPUs, belonging to the same cluster, inject their idle
> cycle synchronously, the cluster can reach its power down state with a
> minimum power consumption and static leakage drop. However, these idle
> cycles injection will add extra latencies as the CPUs will have to
> wakeup from a deep sleep state."

Extra comma "CPUs , belonging". But yes, better.

> Thanks!

You are welcome. Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 5/7] thermal/drivers/cpu_cooling: Add idle cooling device documentation
  2018-03-08  8:59       ` Pavel Machek
@ 2018-03-08 11:54         ` Daniel Thompson
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Thompson @ 2018-03-08 11:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Daniel Lezcano, edubezval, kevin.wangtao, leo.yan,
	vincent.guittot, amit.kachhap, linux-kernel, javi.merino,
	rui.zhang, linux-pm, Jonathan Corbet, open list:DOCUMENTATION

On Thu, Mar 08, 2018 at 09:59:49AM +0100, Pavel Machek wrote:
> Hi!
> 
> > >> +Under certain circumstances, the SoC reaches a temperature exceeding
> > >> +the allocated power budget or the maximum temperature limit. The
> > > 
> > > I don't understand. Power budget is in W, temperature is in
> > > kelvin. Temperature can't exceed power budget AFAICT.
> > 
> > Yes, it is badly worded. Is the following better ?
> > 
> > "
> > Under certain circumstances a SoC can reach the maximum temperature
> > limit or is unable to stabilize the temperature around a temperature
> > control.
> > 
> > When the SoC has to stabilize the temperature, the kernel can act on a
> > cooling device to mitigate the dissipated power.
> > 
> > When the maximum temperature is reached and to prevent a catastrophic
> > situation a radical decision must be taken to reduce the temperature
> > under the critical threshold, that impacts the performance.
> > 
> > "
> 
> Actually... if hardware is expected to protect itself, I'd tone it
> down. No need to be all catastrophic and critical... But yes, better.

Makes sense. For a thermally overcommitted but passively cooled device 
work close to max operating temperature it is not a critical situation 
requiring a radical reaction, it is normal operation.

Put another way, it would severely bogus to attach KERN_CRITICAL 
messages to reaching the cooling threshold.


Daniel.


> > > Critical here, critical there. I have trouble following
> > > it. Theoretically hardware should protect itself, because you don't
> > > want kernel bug to damage your CPU?
> > 
> > There are several levels of protection. The first level is mitigating
> > the temperature from the kernel, then in the temperature sensor a reset
> > line will trigger the reboot of the CPUs. Usually it is a register where
> > you write the maximum temperature, from the driver itself. I never tried
> > to write 1000°C in this register and see if I can burn the board.
> > 
> > I know some boards have another level of thermal protection in the
> > hardware itself and some other don't.
> > 
> > In any case, from a kernel point of view, it is a critical situation as
> > we are about to hard reboot the system and in this case it is preferable
> > to drop drastically the performance but give the opportunity to the
> > system to run in a degraded mode.
> 
> Agreed you want to keep going. In ACPI world, we shutdown when
> critical trip point is reached, so this is somehow confusing.
> 
> > >> +Solutions:
> > >> +----------
> > >> +
> > >> +If we can remove the static and the dynamic leakage for a specific
> > >> +duration in a controlled period, the SoC temperature will
> > >> +decrease. Acting at the idle state duration or the idle cycle
> > > 
> > > "should" decrease? If you are in bad environment..
> > 
> > No, it will decrease in any case because of the static leakage drop. The
> > bad environment will impact the speed of this decrease.
> 
> I meant... if ambient temperature is 105C, there's not much you can do
> to cool system down :-).
> 
> > >> +Idle Injection:
> > >> +---------------
> > >> +
> > >> +The base concept of the idle injection is to force the CPU to go to an
> > >> +idle state for a specified time each control cycle, it provides
> > >> +another way to control CPU power and heat in addition to
> > >> +cpufreq. Ideally, if all CPUs of a cluster inject idle synchronously,
> > >> +this cluster can get into the deepest idle state and achieve minimum
> > >> +power consumption, but that will also increase system response latency
> > >> +if we inject less than cpuidle latency.
> > > 
> > > I don't understand last sentence.
> > 
> > Is it better ?
> > 
> > "Ideally, if all CPUs, belonging to the same cluster, inject their idle
> > cycle synchronously, the cluster can reach its power down state with a
> > minimum power consumption and static leakage drop. However, these idle
> > cycles injection will add extra latencies as the CPUs will have to
> > wakeup from a deep sleep state."
> 
> Extra comma "CPUs , belonging". But yes, better.
> 
> > Thanks!
> 
> You are welcome. Best regards,
> 									Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 0/7] CPU cooling device new strategies
  2018-03-07 18:57   ` Daniel Lezcano
@ 2018-03-08 12:03     ` Daniel Thompson
  2018-03-26 14:30       ` Leo Yan
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Thompson @ 2018-03-08 12:03 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Eduardo Valentin, kevin.wangtao, leo.yan, vincent.guittot,
	amit.kachhap, linux-kernel, javi.merino, rui.zhang, linux-pm

On Wed, Mar 07, 2018 at 07:57:17PM +0100, Daniel Lezcano wrote:
> >> The preliminary benchmarks show the following changes:
> >>
> >> On the hikey6220, dhrystone shows a throughtput increase of 40% for an
> >> increase of the latency of 16% while sysbench shows a latency increase
> >> of 5%.
> > 
> > I don't follow these numbers. Throughput increase while injecting idle?
> > compared to what? percentages of what? Please be more specific to better
> > describer your work..
> 
> The dhrystone throughput is based on the virtual timer, when we are
> running, it is at max opp, so the throughput increases. But regarding
> the real time, it takes obviously more time to achieve as we are
> artificially inserting idle cycles. With the cpufreq governor, we run at
> a lower opp, so the throughput is less for dhrystone but it takes less
> time to achieve.
> 
> Percentages are comparing cpufreq vs cpuidle cooling devices. I will
> take care of presenting the results in a more clear way in the next version.

I think we should also note that the current hikey settings for cpufreq
are very badly tuned for this platform. It has a single temp threshold
and it jumps from max freq to min freq.

IIRC Leo's work on Hikey thermals correctly it would be much better if 
it used the power-allocator thermal governor or if if copied some of 
the Samsung 32-bit platform by configuring the step governor with a 
graduated with a slightly lower threshold that moves two stops back in 
the OPP table (which is still fairly high clock speed... but it
thermally sustainable).


Daniel.


> >> Initially, the first version provided also the cpuidle + cpufreq combo
> >> cooling device but regarding the latest comments, there is a misfit with
> >> how the cpufreq cooling device and suspend/resume/cpu hotplug/module
> >> loading|unloading behave together while the combo cooling device was
> >> designed assuming the cpufreq cooling device was always there. This
> >> dynamic is investigated and the combo cooling device will be posted
> >> separetely after this series gets merged.
> > 
> > Yeah, this is one of the confusing parts. Could you please
> > remind us of the limitations here? Why can't we enable CPUfreq
> > on higher trip points and CPUidle on lower trip points, for example?
> 
> Sorry, I'm not getting the question. We don't want to enable cpuidle or
> cpufreq at certain point but combine the cooling effect of both in order
> to get the best tradeoff power / performance.
> 
> Let me give an example with a simple SoC - one core.
> 
> Let's say we have 4 OPPs and a core-sleep idle state. Respectively, the
> OPPs consume 100mW, 500mW, 2W, 4W. Now the CPU is in an intensive work
> running at the highest OPP, thus consuming 4W. The temperature increases
> and reaches 75°C which is the mitigation point and where the sustainable
> power is 1.7W.
> 
>  - With the cpufreq cooling device, we can't have 4W, so we go back and
> forth between 2W and 500mW.
> 
>  - With the cpuidle cooling device, we are at the highest OPP (there is
> no cpufreq driver) and we insert 47.5% of idle duration
> 
>  - With the combo cooling device, we compute the round-up OPP (here 2W)
> and we insert idle cycles for the remaining power to reach the
> sustainable power, so 15%.
> 
> With the combo, we increase the performances for the same requested
> power. There is no yet the state2power callbacks but we expect the
> combination of dropping the static leakage and the higher OPP to give
> better results in terms of performance and mitigation on energy eager
> CPUs like the recent big ARM cpus with the IPA governor.
> 
> Going straight to the point of your question above, we can see the
> cpufreq cooling device and the cpuidle cooling device have to
> collaborate. If we unregister the cpufreq device, we have to do the math
> for the power again in the combo cooling device. It is not a problem by
> itself but needs an extra reflexion in the current code.
> 
> > Specially from a system design point of view, the system engineer
> > typically would benefit of idle injections to achieve overall
> > average CPU frequencies in a more granular fashion, for example,
> > achieving performance vs. cooling between available real
> > frequencies, avoiding real switches.
> >
> > Also, there is a major design question here. After Linaro's attempt
> > to send a cpufreq+cpuidle cooling device(s), there was an attempt
> > to generalize and extend intel powerclamp driver. 
> 
> I'm not aware of such attempt.
> 
> > Do you mind
> > explaining why refactoring intel powerclamp is not possible? Basic
> > idea is the same, no?
> 
> Basically the idea is the same: run synchronized idle thread and call
> play_idle(). That is all. Putting apart the intel_powerclamp is very x86
> centric and contains a plethora of code not fitting our purpose, it
> increases the idle duration while we are increasing the number of idle
> cycles but keep the idle duration constant in order to have a control on
> the latency for the user interactivity. If you compare the idle
> injection threads codes (powerclamp and cpuidle cooling device), you
> will also notice they are very different in terms of implementation.
> 
> The combo cooling device collaborates with the cpufreq cooling device
> and reuses the DT binding, and finally it uses the power information
> provided in the DT. The idle injection is a brick to the combo cooling
> device.
> 
> Initially I thought we should refactor the intel_powerclamp but it
> appears the combo cooling device reuses the cpufreq and cpuidle cooling
> device, making sense to have them all in a single file and evolving to a
> single cooling device with different strategies.
> 
> 
> 
> >> Daniel Lezcano (7):
> >>   thermal/drivers/cpu_cooling: Fixup the header and copyright
> >>   thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)
> >>   thermal/drivers/cpu_cooling: Remove pointless field
> >>   thermal/drivers/Kconfig: Convert the CPU cooling device to a choice
> >>   thermal/drivers/cpu_cooling: Add idle cooling device documentation
> >>   thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
> >>   cpuidle/drivers/cpuidle-arm: Register the cooling device
> >>
> >>  Documentation/thermal/cpu-idle-cooling.txt | 165 ++++++++++
> >>  drivers/cpuidle/cpuidle-arm.c              |   5 +
> >>  drivers/thermal/Kconfig                    |  30 +-
> >>  drivers/thermal/cpu_cooling.c              | 480 +++++++++++++++++++++++++++--
> >>  include/linux/cpu_cooling.h                |  15 +-
> >>  5 files changed, 668 insertions(+), 27 deletions(-)
> >>  create mode 100644 Documentation/thermal/cpu-idle-cooling.txt
> >>
> >> -- 
> >> 2.7.4
> >>
> 
> 
> -- 
>  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
  2018-02-26  4:30       ` [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver Viresh Kumar
@ 2018-03-13 19:15         ` Daniel Lezcano
  2018-04-04  8:50         ` Daniel Lezcano
  1 sibling, 0 replies; 22+ messages in thread
From: Daniel Lezcano @ 2018-03-13 19:15 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: edubezval, kevin.wangtao, leo.yan, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm

On 26/02/2018 05:30, Viresh Kumar wrote:

[ ... ]

>>>> +		/*
>>>> +		 * The last CPU waking up is in charge of setting the
>>>> +		 * timer. If the CPU is hotplugged, the timer will
>>>> +		 * move to another CPU (which may not belong to the
>>>> +		 * same cluster) but that is not a problem as the
>>>> +		 * timer will be set again by another CPU belonging to
>>>> +		 * the cluster, so this mechanism is self adaptive and
>>>> +		 * does not require any hotplugging dance.
>>>> +		 */
>>>
>>> Well this depends on how CPU hotplug really happens. What happens to
>>> the per-cpu-tasks which are in the middle of something when hotplug
>>> happens?  Does hotplug wait for those per-cpu-tasks to finish ?
> 
> Missed this one ?

To be frank, I don't know. I have been through the hp code and didn't
find the answer. AFAICT, the sched/core.c sched_cpu_dying() disables the
rq and then migrates the tasks. I assume this kthread is not migrated
and stays in sleeping state until the rq is online again. As the wakeup
callback discards offline cpus, the kthread is not wakeup at any moment.

I created a situation where that happens: mitigation (raised with
dhrystone) + cpu hotplugging (offline/online loop) and never hit any issue.

However, I'm wondering if using the 'struct smp_hotplug_thread'
infra-stucture (git show f97f8f06) shouldn't be used.


>>>> +int cpuidle_cooling_register(void)
>>>> +{
>>>> +	struct cpuidle_cooling_device *idle_cdev = NULL;
>>>> +	struct thermal_cooling_device *cdev;
>>>> +	struct cpuidle_cooling_tsk *cct;
>>>> +	struct task_struct *tsk;
>>>> +	struct device_node *np;
>>>> +	cpumask_t *cpumask;
>>>> +	char dev_name[THERMAL_NAME_LENGTH];
>>>> +	int ret = -ENOMEM, cpu;
>>>> +	int index = 0;
>>>> +
>>>> +	for_each_possible_cpu(cpu) {
>>>> +		cpumask = topology_core_cpumask(cpu);
>>>> +
>>>> +		cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
>>>> +
>>>> +		/*
>>>> +		 * This condition makes the first cpu belonging to the
>>>> +		 * cluster to create a cooling device and allocates
>>>> +		 * the structure. Others CPUs belonging to the same
>>>> +		 * cluster will just increment the refcount on the
>>>> +		 * cooling device structure and initialize it.
>>>> +		 */
>>>> +		if (cpu == cpumask_first(cpumask)) {
>>>
>>> Your function still have few assumptions of cpu numbering and it will
>>> break in few cases. What if the CPUs on a big Little system (4x4) are
>>> present in this order: B L L L L B B B  ??
>>>
>>> This configuration can happen if CPUs in DT are marked as: 0-3 LITTLE,
>>> 4-7 big and a big CPU is used by the boot loader to bring up Linux.
>>
>> Ok, how can I sort it out ?
> 
> I would do something like this:
> 
>         cpumask_copy(possible, cpu_possible_mask);
>         
>         while (!cpumask_empty(possible)) {
>                 first = cpumask_first(possible);
>                 cpumask = topology_core_cpumask(first);
>                 cpumask_andnot(possible, possible, cpumask);
>         
>                 allocate_cooling_dev(first); //This is most of this function in your patch.
>         
>                 while (!cpumask_empty(cpumask)) {
>                         temp = cpumask_first(possible);
>                         //rest init "temp"
>                         cpumask_clear_cpu(temp, cpumask);
>                 }
>         
>                 //Everything done, register cooling device for cpumask.
>         }
> 
>>>> +			np = of_cpu_device_node_get(cpu);
>>>> +
>>>> +			idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
>>>> +			if (!idle_cdev)
>>>> +				goto out_fail;
>>>> +
>>>> +			idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;
>>>> +
>>>> +			atomic_set(&idle_cdev->count, 0);
>>>
>>> This should already be 0, isn't it ?
>>
>> Yes.
> 
> I read it as, "I will drop it" :)
> 


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 0/7] CPU cooling device new strategies
  2018-03-08 12:03     ` Daniel Thompson
@ 2018-03-26 14:30       ` Leo Yan
  2018-03-27  9:35         ` Daniel Lezcano
  0 siblings, 1 reply; 22+ messages in thread
From: Leo Yan @ 2018-03-26 14:30 UTC (permalink / raw)
  To: Daniel Thompson
  Cc: Daniel Lezcano, Eduardo Valentin, kevin.wangtao, vincent.guittot,
	amit.kachhap, linux-kernel, javi.merino, rui.zhang, linux-pm

On Thu, Mar 08, 2018 at 12:03:52PM +0000, Daniel Thompson wrote:
> On Wed, Mar 07, 2018 at 07:57:17PM +0100, Daniel Lezcano wrote:
> > >> The preliminary benchmarks show the following changes:
> > >>
> > >> On the hikey6220, dhrystone shows a throughtput increase of 40% for an
> > >> increase of the latency of 16% while sysbench shows a latency increase
> > >> of 5%.
> > > 
> > > I don't follow these numbers. Throughput increase while injecting idle?
> > > compared to what? percentages of what? Please be more specific to better
> > > describer your work..
> > 
> > The dhrystone throughput is based on the virtual timer, when we are
> > running, it is at max opp, so the throughput increases. But regarding
> > the real time, it takes obviously more time to achieve as we are
> > artificially inserting idle cycles. With the cpufreq governor, we run at
> > a lower opp, so the throughput is less for dhrystone but it takes less
> > time to achieve.
> > 
> > Percentages are comparing cpufreq vs cpuidle cooling devices. I will
> > take care of presenting the results in a more clear way in the next version.
> 
> I think we should also note that the current hikey settings for cpufreq
> are very badly tuned for this platform. It has a single temp threshold
> and it jumps from max freq to min freq.
> 
> IIRC Leo's work on Hikey thermals correctly it would be much better if 
> it used the power-allocator thermal governor or if if copied some of 
> the Samsung 32-bit platform by configuring the step governor with a 
> graduated with a slightly lower threshold that moves two stops back in 
> the OPP table (which is still fairly high clock speed... but it
> thermally sustainable).

I think Daniel L. is working on this patch set with 'power-allocator'
governor, and the parameters 'sustainable-power = <3326>' and
'dynamic-power-coefficient = <311>' are profiling value on Hikey
platform.  Now we only consider dynamic power and skip static leakage
for 'power-allocator' governor.  And all these parameters are merged
into Linux mainline kernel.

Daniel L. could correct me if I misunderstand the testing conditions.

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
       [not found] ` <1519226968-19821-7-git-send-email-daniel.lezcano@linaro.org>
       [not found]   ` <20180223073432.GF26947@vireshk-i7>
@ 2018-03-27  2:03   ` Leo Yan
  2018-03-27 10:26     ` Daniel Lezcano
  2018-03-27  3:35   ` Leo Yan
  2 siblings, 1 reply; 22+ messages in thread
From: Leo Yan @ 2018-03-27  2:03 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: edubezval, kevin.wangtao, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm,
	Viresh Kumar

Hi Daniel,

On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
> The cpu idle cooling driver performs synchronized idle injection across all
> cpus belonging to the same cluster and offers a new method to cool down a SoC.
> 
> Each cluster has its own idle cooling device, each core has its own idle
> injection thread, each idle injection thread uses play_idle to enter idle.  In
> order to reach the deepest idle state, each cooling device has the idle
> injection threads synchronized together.
> 
> It has some similarity with the intel power clamp driver but it is actually
> designed to work on the ARM architecture via the DT with a mathematical proof
> with the power model which comes with the Documentation.
> 
> The idle injection cycle is fixed while the running cycle is variable. That
> allows to have control on the device reactivity for the user experience. At
> the mitigation point the idle threads are unparked, they play idle the
> specified amount of time and they schedule themselves. The last thread sets
> the next idle injection deadline and when the timer expires it wakes up all
> the threads which in turn play idle again. Meanwhile the running cycle is
> changed by set_cur_state.  When the mitigation ends, the threads are parked.
> The algorithm is self adaptive, so there is no need to handle hotplugging.

The idle injection threads are RT threads (FIFO) and I saw in
play_idle() set/clear flag PF_IDLE for it.  Will these idle injection
threads utilization be accounted into RT utilization?

If idle injection threads utilization is accounted as RT tasks
utilization, will this impact CPUFreq governor 'schedutil' for OPP
selection?

[...]

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
       [not found] ` <1519226968-19821-7-git-send-email-daniel.lezcano@linaro.org>
       [not found]   ` <20180223073432.GF26947@vireshk-i7>
  2018-03-27  2:03   ` Leo Yan
@ 2018-03-27  3:35   ` Leo Yan
  2018-03-27 10:56     ` Daniel Lezcano
  2 siblings, 1 reply; 22+ messages in thread
From: Leo Yan @ 2018-03-27  3:35 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: edubezval, kevin.wangtao, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm,
	Viresh Kumar

On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:

[...]

> +/**
> + * cpuidle_cooling_injection_thread - Idle injection mainloop thread function
> + * @arg: a void pointer containing the idle cooling device address
> + *
> + * This main function does basically two operations:
> + *
> + * - Goes idle for a specific amount of time
> + *
> + * - Sets a timer to wake up all the idle injection threads after a
> + *   running period
> + *
> + * That happens only when the mitigation is enabled, otherwise the
> + * task is scheduled out.
> + *
> + * In order to keep the tasks synchronized together, it is the last
> + * task exiting the idle period which is in charge of setting the
> + * timer.
> + *
> + * This function never returns.
> + */
> +static int cpuidle_cooling_injection_thread(void *arg)
> +{
> +	struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };

I am just wandering if should set priority to (MAX_RT_PRIO - 1)?
Otherwise I am concern it might be cannot enter deep idle state when
any CPU idle injection thread is preempted by other higher priority RT
threads so all CPUs have no alignment for idle state entering/exiting.

> +	struct cpuidle_cooling_device *idle_cdev = arg;
> +	struct cpuidle_cooling_tsk *cct = per_cpu_ptr(&cpuidle_cooling_tsk,
> +						      smp_processor_id());
> +	DEFINE_WAIT(wait);
> +
> +	set_freezable();
> +
> +	sched_setscheduler(current, SCHED_FIFO, &param);
> +
> +	while (1) {
> +		s64 next_wakeup;
> +
> +		prepare_to_wait(&cct->waitq, &wait, TASK_INTERRUPTIBLE);
> +
> +		schedule();
> +
> +		atomic_inc(&idle_cdev->count);
> +
> +		play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
> +
> +		/*
> +		 * The last CPU waking up is in charge of setting the
> +		 * timer. If the CPU is hotplugged, the timer will
> +		 * move to another CPU (which may not belong to the
> +		 * same cluster) but that is not a problem as the
> +		 * timer will be set again by another CPU belonging to
> +		 * the cluster, so this mechanism is self adaptive and
> +		 * does not require any hotplugging dance.
> +		 */
> +		if (!atomic_dec_and_test(&idle_cdev->count))
> +			continue;
> +
> +		if (!idle_cdev->state)
> +			continue;
> +
> +		next_wakeup = cpuidle_cooling_runtime(idle_cdev);
> +
> +		hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
> +			      HRTIMER_MODE_REL_PINNED);

If SoC temperature descreases under tipping point, will the timer be
disabled for this case?  Or will here set next timer event with big
value from next_wakeup?

[...]

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
       [not found]     ` <faaf027c-e01c-6801-9a0c-ab7e0ba669a1@linaro.org>
  2018-02-26  4:30       ` [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver Viresh Kumar
@ 2018-03-27  3:43       ` Leo Yan
  2018-03-27 11:10         ` Daniel Lezcano
  1 sibling, 1 reply; 22+ messages in thread
From: Leo Yan @ 2018-03-27  3:43 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Viresh Kumar, edubezval, kevin.wangtao, vincent.guittot,
	amit.kachhap, linux-kernel, javi.merino, rui.zhang,
	daniel.thompson, linux-pm

On Fri, Feb 23, 2018 at 12:28:51PM +0100, Daniel Lezcano wrote:
> On 23/02/2018 08:34, Viresh Kumar wrote:
> > On 21-02-18, 16:29, Daniel Lezcano wrote:

> [ ... ]
> 
> >> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
> >> +{
> >> +	s64 next_wakeup;
> >> +	int state = idle_cdev->state;
> >> +
> >> +	/*
> >> +	 * The function must never be called when there is no
> >> +	 * mitigation because:
> >> +	 * - that does not make sense
> >> +	 * - we end up with a division by zero
> >> +	 */
> >> +	BUG_ON(!state);
> > 
> > As there is no locking in place, we can surely hit this case. What if
> > the state changed to 0 right before this routine was called ?
> > 
> > I would suggest we should just return 0 in that case and get away with
> > the BUG_ON().

Here if 'state' equals to 0 and we return 0, then the return value will
be same with when 'state' = 100; this lets the return value confused.

I think for 'state' = 0, should we return -1 so indicate the hrtimer
will not be set for this case?

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 0/7] CPU cooling device new strategies
  2018-03-26 14:30       ` Leo Yan
@ 2018-03-27  9:35         ` Daniel Lezcano
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Lezcano @ 2018-03-27  9:35 UTC (permalink / raw)
  To: Leo Yan, Daniel Thompson
  Cc: Eduardo Valentin, kevin.wangtao, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, linux-pm

On 26/03/2018 16:30, Leo Yan wrote:
> On Thu, Mar 08, 2018 at 12:03:52PM +0000, Daniel Thompson wrote:
>> On Wed, Mar 07, 2018 at 07:57:17PM +0100, Daniel Lezcano wrote:
>>>>> The preliminary benchmarks show the following changes:
>>>>>
>>>>> On the hikey6220, dhrystone shows a throughtput increase of 40% for an
>>>>> increase of the latency of 16% while sysbench shows a latency increase
>>>>> of 5%.
>>>>
>>>> I don't follow these numbers. Throughput increase while injecting idle?
>>>> compared to what? percentages of what? Please be more specific to better
>>>> describer your work..
>>>
>>> The dhrystone throughput is based on the virtual timer, when we are
>>> running, it is at max opp, so the throughput increases. But regarding
>>> the real time, it takes obviously more time to achieve as we are
>>> artificially inserting idle cycles. With the cpufreq governor, we run at
>>> a lower opp, so the throughput is less for dhrystone but it takes less
>>> time to achieve.
>>>
>>> Percentages are comparing cpufreq vs cpuidle cooling devices. I will
>>> take care of presenting the results in a more clear way in the next version.
>>
>> I think we should also note that the current hikey settings for cpufreq
>> are very badly tuned for this platform. It has a single temp threshold
>> and it jumps from max freq to min freq.
>>
>> IIRC Leo's work on Hikey thermals correctly it would be much better if 
>> it used the power-allocator thermal governor or if if copied some of 
>> the Samsung 32-bit platform by configuring the step governor with a 
>> graduated with a slightly lower threshold that moves two stops back in 
>> the OPP table (which is still fairly high clock speed... but it
>> thermally sustainable).
> 
> I think Daniel L. is working on this patch set with 'power-allocator'
> governor, and the parameters 'sustainable-power = <3326>' and
> 'dynamic-power-coefficient = <311>' are profiling value on Hikey
> platform.  Now we only consider dynamic power and skip static leakage
> for 'power-allocator' governor.  And all these parameters are merged
> into Linux mainline kernel.
> 
> Daniel L. could correct me if I misunderstand the testing conditions.

Well, the first iteration is without the powerallocator governor API. It
was tested with the step-wise governor only. But you are right by saying
it will use the dynamic-power-coefficient and sustainable-power and will
implement the power allocator version of the API. I'm working on the
power allocator version for the idle injection + OPP change as we need
to compute the capacity equivalence between the idle-injection cycles +
OPP and the lower OPP in order to change the OPP for optimized power /
performance trade-off.


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
  2018-03-27  2:03   ` Leo Yan
@ 2018-03-27 10:26     ` Daniel Lezcano
  2018-03-27 12:28       ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Lezcano @ 2018-03-27 10:26 UTC (permalink / raw)
  To: Leo Yan
  Cc: edubezval, kevin.wangtao, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm,
	Viresh Kumar

On 27/03/2018 04:03, Leo Yan wrote:
> Hi Daniel,
> 
> On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
>> The cpu idle cooling driver performs synchronized idle injection across all
>> cpus belonging to the same cluster and offers a new method to cool down a SoC.
>>
>> Each cluster has its own idle cooling device, each core has its own idle
>> injection thread, each idle injection thread uses play_idle to enter idle.  In
>> order to reach the deepest idle state, each cooling device has the idle
>> injection threads synchronized together.
>>
>> It has some similarity with the intel power clamp driver but it is actually
>> designed to work on the ARM architecture via the DT with a mathematical proof
>> with the power model which comes with the Documentation.
>>
>> The idle injection cycle is fixed while the running cycle is variable. That
>> allows to have control on the device reactivity for the user experience. At
>> the mitigation point the idle threads are unparked, they play idle the
>> specified amount of time and they schedule themselves. The last thread sets
>> the next idle injection deadline and when the timer expires it wakes up all
>> the threads which in turn play idle again. Meanwhile the running cycle is
>> changed by set_cur_state.  When the mitigation ends, the threads are parked.
>> The algorithm is self adaptive, so there is no need to handle hotplugging.
> 
> The idle injection threads are RT threads (FIFO) and I saw in
> play_idle() set/clear flag PF_IDLE for it.  Will these idle injection
> threads utilization be accounted into RT utilization?
> 
> If idle injection threads utilization is accounted as RT tasks
> utilization, will this impact CPUFreq governor 'schedutil' for OPP
> selection?

Hi Leo,

The idle injection task has a very low utilization when it is not in the
play_idle function, basically it wakes up, sets a timer and play_idle().

Regarding the use case, the idle injection is the base brick for an
combo cooling device with cpufreq + cpuidle. When the idle injection is
used alone, it is because there is no cpufreq driver for the platform.
If there is a cpufreq driver, then we should endup with the cpu cooling
device where we have control of the OPP (and there is no idle injection
threads) or the combo cooling device.

Except I'm missing something, the idle injection threads won't impact
the OPP selection.





-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
  2018-03-27  3:35   ` Leo Yan
@ 2018-03-27 10:56     ` Daniel Lezcano
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Lezcano @ 2018-03-27 10:56 UTC (permalink / raw)
  To: Leo Yan
  Cc: edubezval, kevin.wangtao, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm,
	Viresh Kumar

On 27/03/2018 05:35, Leo Yan wrote:
> On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
> 
> [...]
> 
>> +/**
>> + * cpuidle_cooling_injection_thread - Idle injection mainloop thread function
>> + * @arg: a void pointer containing the idle cooling device address
>> + *
>> + * This main function does basically two operations:
>> + *
>> + * - Goes idle for a specific amount of time
>> + *
>> + * - Sets a timer to wake up all the idle injection threads after a
>> + *   running period
>> + *
>> + * That happens only when the mitigation is enabled, otherwise the
>> + * task is scheduled out.
>> + *
>> + * In order to keep the tasks synchronized together, it is the last
>> + * task exiting the idle period which is in charge of setting the
>> + * timer.
>> + *
>> + * This function never returns.
>> + */
>> +static int cpuidle_cooling_injection_thread(void *arg)
>> +{
>> +	struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };
> 
> I am just wandering if should set priority to (MAX_RT_PRIO - 1)?
> Otherwise I am concern it might be cannot enter deep idle state when
> any CPU idle injection thread is preempted by other higher priority RT
> threads so all CPUs have no alignment for idle state entering/exiting.

I do believe we should consider other RT tasks more important than the
idle injection threads.

>> +	struct cpuidle_cooling_device *idle_cdev = arg;
>> +	struct cpuidle_cooling_tsk *cct = per_cpu_ptr(&cpuidle_cooling_tsk,
>> +						      smp_processor_id());
>> +	DEFINE_WAIT(wait);
>> +
>> +	set_freezable();
>> +
>> +	sched_setscheduler(current, SCHED_FIFO, &param);
>> +
>> +	while (1) {
>> +		s64 next_wakeup;
>> +
>> +		prepare_to_wait(&cct->waitq, &wait, TASK_INTERRUPTIBLE);
>> +
>> +		schedule();
>> +
>> +		atomic_inc(&idle_cdev->count);
>> +
>> +		play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
>> +
>> +		/*
>> +		 * The last CPU waking up is in charge of setting the
>> +		 * timer. If the CPU is hotplugged, the timer will
>> +		 * move to another CPU (which may not belong to the
>> +		 * same cluster) but that is not a problem as the
>> +		 * timer will be set again by another CPU belonging to
>> +		 * the cluster, so this mechanism is self adaptive and
>> +		 * does not require any hotplugging dance.
>> +		 */
>> +		if (!atomic_dec_and_test(&idle_cdev->count))
>> +			continue;
>> +
>> +		if (!idle_cdev->state)
>> +			continue;
>> +
>> +		next_wakeup = cpuidle_cooling_runtime(idle_cdev);
>> +
>> +		hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
>> +			      HRTIMER_MODE_REL_PINNED);
> 
> If SoC temperature descreases under tipping point, will the timer be
> disabled for this case?  Or will here set next timer event with big
> value from next_wakeup?

Another timer (the polling one) will update the 'state' variable to zero
in the set_cur_state. In the worst case, we check the idle_cdev->state
right before it is updated and we end up with an extra idle injection
cycle which is perfectly fine.





-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
  2018-03-27  3:43       ` Leo Yan
@ 2018-03-27 11:10         ` Daniel Lezcano
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Lezcano @ 2018-03-27 11:10 UTC (permalink / raw)
  To: Leo Yan
  Cc: Viresh Kumar, edubezval, kevin.wangtao, vincent.guittot,
	amit.kachhap, linux-kernel, javi.merino, rui.zhang,
	daniel.thompson, linux-pm

On 27/03/2018 05:43, Leo Yan wrote:
> On Fri, Feb 23, 2018 at 12:28:51PM +0100, Daniel Lezcano wrote:
>> On 23/02/2018 08:34, Viresh Kumar wrote:
>>> On 21-02-18, 16:29, Daniel Lezcano wrote:
> 
>> [ ... ]
>>
>>>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
>>>> +{
>>>> +	s64 next_wakeup;
>>>> +	int state = idle_cdev->state;
>>>> +
>>>> +	/*
>>>> +	 * The function must never be called when there is no
>>>> +	 * mitigation because:
>>>> +	 * - that does not make sense
>>>> +	 * - we end up with a division by zero
>>>> +	 */
>>>> +	BUG_ON(!state);
>>>
>>> As there is no locking in place, we can surely hit this case. What if
>>> the state changed to 0 right before this routine was called ?
>>>
>>> I would suggest we should just return 0 in that case and get away with
>>> the BUG_ON().
> 
> Here if 'state' equals to 0 and we return 0, then the return value will
> be same with when 'state' = 100; this lets the return value confused.
> 
> I think for 'state' = 0, should we return -1 so indicate the hrtimer
> will not be set for this case?

Yeah, I will look at how to make this smoother.

Thanks

  -- Daniel


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
  2018-03-27 10:26     ` Daniel Lezcano
@ 2018-03-27 12:28       ` Juri Lelli
  2018-03-27 12:31         ` Daniel Lezcano
  0 siblings, 1 reply; 22+ messages in thread
From: Juri Lelli @ 2018-03-27 12:28 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Leo Yan, edubezval, kevin.wangtao, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm,
	Viresh Kumar

Hi Daniel,

On 27/03/18 12:26, Daniel Lezcano wrote:
> On 27/03/2018 04:03, Leo Yan wrote:
> > Hi Daniel,
> > 
> > On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
> >> The cpu idle cooling driver performs synchronized idle injection across all
> >> cpus belonging to the same cluster and offers a new method to cool down a SoC.
> >>
> >> Each cluster has its own idle cooling device, each core has its own idle
> >> injection thread, each idle injection thread uses play_idle to enter idle.  In
> >> order to reach the deepest idle state, each cooling device has the idle
> >> injection threads synchronized together.
> >>
> >> It has some similarity with the intel power clamp driver but it is actually
> >> designed to work on the ARM architecture via the DT with a mathematical proof
> >> with the power model which comes with the Documentation.
> >>
> >> The idle injection cycle is fixed while the running cycle is variable. That
> >> allows to have control on the device reactivity for the user experience. At
> >> the mitigation point the idle threads are unparked, they play idle the
> >> specified amount of time and they schedule themselves. The last thread sets
> >> the next idle injection deadline and when the timer expires it wakes up all
> >> the threads which in turn play idle again. Meanwhile the running cycle is
> >> changed by set_cur_state.  When the mitigation ends, the threads are parked.
> >> The algorithm is self adaptive, so there is no need to handle hotplugging.
> > 
> > The idle injection threads are RT threads (FIFO) and I saw in
> > play_idle() set/clear flag PF_IDLE for it.  Will these idle injection
> > threads utilization be accounted into RT utilization?
> > 
> > If idle injection threads utilization is accounted as RT tasks
> > utilization, will this impact CPUFreq governor 'schedutil' for OPP
> > selection?
> 
> Hi Leo,
> 
> The idle injection task has a very low utilization when it is not in the
> play_idle function, basically it wakes up, sets a timer and play_idle().
> 
> Regarding the use case, the idle injection is the base brick for an
> combo cooling device with cpufreq + cpuidle. When the idle injection is
> used alone, it is because there is no cpufreq driver for the platform.
> If there is a cpufreq driver, then we should endup with the cpu cooling
> device where we have control of the OPP (and there is no idle injection
> threads) or the combo cooling device.
> 
> Except I'm missing something, the idle injection threads won't impact
> the OPP selection.

Mmm, they actually might. schedutil selects max OPP as soon as it sees
an RT thread. I fear these guys might generate unwanted spikes. Maybe
you can filter them out?

Best,

- Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
  2018-03-27 12:28       ` Juri Lelli
@ 2018-03-27 12:31         ` Daniel Lezcano
  2018-03-27 13:08           ` Juri Lelli
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Lezcano @ 2018-03-27 12:31 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Leo Yan, edubezval, kevin.wangtao, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm,
	Viresh Kumar

On 27/03/2018 14:28, Juri Lelli wrote:
> Hi Daniel,
> 
> On 27/03/18 12:26, Daniel Lezcano wrote:
>> On 27/03/2018 04:03, Leo Yan wrote:
>>> Hi Daniel,
>>>
>>> On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
>>>> The cpu idle cooling driver performs synchronized idle injection across all
>>>> cpus belonging to the same cluster and offers a new method to cool down a SoC.
>>>>
>>>> Each cluster has its own idle cooling device, each core has its own idle
>>>> injection thread, each idle injection thread uses play_idle to enter idle.  In
>>>> order to reach the deepest idle state, each cooling device has the idle
>>>> injection threads synchronized together.
>>>>
>>>> It has some similarity with the intel power clamp driver but it is actually
>>>> designed to work on the ARM architecture via the DT with a mathematical proof
>>>> with the power model which comes with the Documentation.
>>>>
>>>> The idle injection cycle is fixed while the running cycle is variable. That
>>>> allows to have control on the device reactivity for the user experience. At
>>>> the mitigation point the idle threads are unparked, they play idle the
>>>> specified amount of time and they schedule themselves. The last thread sets
>>>> the next idle injection deadline and when the timer expires it wakes up all
>>>> the threads which in turn play idle again. Meanwhile the running cycle is
>>>> changed by set_cur_state.  When the mitigation ends, the threads are parked.
>>>> The algorithm is self adaptive, so there is no need to handle hotplugging.
>>>
>>> The idle injection threads are RT threads (FIFO) and I saw in
>>> play_idle() set/clear flag PF_IDLE for it.  Will these idle injection
>>> threads utilization be accounted into RT utilization?
>>>
>>> If idle injection threads utilization is accounted as RT tasks
>>> utilization, will this impact CPUFreq governor 'schedutil' for OPP
>>> selection?
>>
>> Hi Leo,
>>
>> The idle injection task has a very low utilization when it is not in the
>> play_idle function, basically it wakes up, sets a timer and play_idle().
>>
>> Regarding the use case, the idle injection is the base brick for an
>> combo cooling device with cpufreq + cpuidle. When the idle injection is
>> used alone, it is because there is no cpufreq driver for the platform.
>> If there is a cpufreq driver, then we should endup with the cpu cooling
>> device where we have control of the OPP (and there is no idle injection
>> threads) or the combo cooling device.
>>
>> Except I'm missing something, the idle injection threads won't impact
>> the OPP selection.
> 
> Mmm, they actually might. schedutil selects max OPP as soon as it sees
> an RT thread. I fear these guys might generate unwanted spikes. Maybe
> you can filter them out?

Yes, absolutely. Leo pointed it also.

We might want to change the check at:

https://elixir.bootlin.com/linux/v4.16-rc7/source/kernel/sched/cpufreq_schedutil.c#L364

in order to ignore PF_IDLE tagged tasks.



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
  2018-03-27 12:31         ` Daniel Lezcano
@ 2018-03-27 13:08           ` Juri Lelli
  0 siblings, 0 replies; 22+ messages in thread
From: Juri Lelli @ 2018-03-27 13:08 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Leo Yan, edubezval, kevin.wangtao, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm,
	Viresh Kumar

On 27/03/18 14:31, Daniel Lezcano wrote:
> On 27/03/2018 14:28, Juri Lelli wrote:
> > Hi Daniel,
> > 
> > On 27/03/18 12:26, Daniel Lezcano wrote:
> >> On 27/03/2018 04:03, Leo Yan wrote:
> >>> Hi Daniel,
> >>>
> >>> On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
> >>>> The cpu idle cooling driver performs synchronized idle injection across all
> >>>> cpus belonging to the same cluster and offers a new method to cool down a SoC.
> >>>>
> >>>> Each cluster has its own idle cooling device, each core has its own idle
> >>>> injection thread, each idle injection thread uses play_idle to enter idle.  In
> >>>> order to reach the deepest idle state, each cooling device has the idle
> >>>> injection threads synchronized together.
> >>>>
> >>>> It has some similarity with the intel power clamp driver but it is actually
> >>>> designed to work on the ARM architecture via the DT with a mathematical proof
> >>>> with the power model which comes with the Documentation.
> >>>>
> >>>> The idle injection cycle is fixed while the running cycle is variable. That
> >>>> allows to have control on the device reactivity for the user experience. At
> >>>> the mitigation point the idle threads are unparked, they play idle the
> >>>> specified amount of time and they schedule themselves. The last thread sets
> >>>> the next idle injection deadline and when the timer expires it wakes up all
> >>>> the threads which in turn play idle again. Meanwhile the running cycle is
> >>>> changed by set_cur_state.  When the mitigation ends, the threads are parked.
> >>>> The algorithm is self adaptive, so there is no need to handle hotplugging.
> >>>
> >>> The idle injection threads are RT threads (FIFO) and I saw in
> >>> play_idle() set/clear flag PF_IDLE for it.  Will these idle injection
> >>> threads utilization be accounted into RT utilization?
> >>>
> >>> If idle injection threads utilization is accounted as RT tasks
> >>> utilization, will this impact CPUFreq governor 'schedutil' for OPP
> >>> selection?
> >>
> >> Hi Leo,
> >>
> >> The idle injection task has a very low utilization when it is not in the
> >> play_idle function, basically it wakes up, sets a timer and play_idle().
> >>
> >> Regarding the use case, the idle injection is the base brick for an
> >> combo cooling device with cpufreq + cpuidle. When the idle injection is
> >> used alone, it is because there is no cpufreq driver for the platform.
> >> If there is a cpufreq driver, then we should endup with the cpu cooling
> >> device where we have control of the OPP (and there is no idle injection
> >> threads) or the combo cooling device.
> >>
> >> Except I'm missing something, the idle injection threads won't impact
> >> the OPP selection.
> > 
> > Mmm, they actually might. schedutil selects max OPP as soon as it sees
> > an RT thread. I fear these guys might generate unwanted spikes. Maybe
> > you can filter them out?
> 
> Yes, absolutely. Leo pointed it also.
> 
> We might want to change the check at:
> 
> https://elixir.bootlin.com/linux/v4.16-rc7/source/kernel/sched/cpufreq_schedutil.c#L364
> 
> in order to ignore PF_IDLE tagged tasks.

We might yes. And also for the update_single cases, I guess.

Best,

- Juri

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
  2018-02-26  4:30       ` [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver Viresh Kumar
  2018-03-13 19:15         ` Daniel Lezcano
@ 2018-04-04  8:50         ` Daniel Lezcano
  2018-04-05  4:49           ` Viresh Kumar
  1 sibling, 1 reply; 22+ messages in thread
From: Daniel Lezcano @ 2018-04-04  8:50 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: edubezval, kevin.wangtao, leo.yan, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm

On 26/02/2018 05:30, Viresh Kumar wrote:

[ ... ]

>>>> +
>>>> +	for_each_possible_cpu(cpu) {
>>>> +		cpumask = topology_core_cpumask(cpu);
>>>> +
>>>> +		cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
>>>> +
>>>> +		/*
>>>> +		 * This condition makes the first cpu belonging to the
>>>> +		 * cluster to create a cooling device and allocates
>>>> +		 * the structure. Others CPUs belonging to the same
>>>> +		 * cluster will just increment the refcount on the
>>>> +		 * cooling device structure and initialize it.
>>>> +		 */
>>>> +		if (cpu == cpumask_first(cpumask)) {
>>>
>>> Your function still have few assumptions of cpu numbering and it will
>>> break in few cases. What if the CPUs on a big Little system (4x4) are
>>> present in this order: B L L L L B B B  ??
>>>
>>> This configuration can happen if CPUs in DT are marked as: 0-3 LITTLE,
>>> 4-7 big and a big CPU is used by the boot loader to bring up Linux.
>>
>> Ok, how can I sort it out ?
> 
> I would do something like this:
> 
>         cpumask_copy(possible, cpu_possible_mask);
>         
>         while (!cpumask_empty(possible)) {
>                 first = cpumask_first(possible);
>                 cpumask = topology_core_cpumask(first);
>                 cpumask_andnot(possible, possible, cpumask);
>         
>                 allocate_cooling_dev(first); //This is most of this function in your patch.
>         
>                 while (!cpumask_empty(cpumask)) {
>                         temp = cpumask_first(possible);
>                         //rest init "temp"
>                         cpumask_clear_cpu(temp, cpumask);
>                 }
>         
>                 //Everything done, register cooling device for cpumask.
>         }


Mmh, that sounds very complex. May be it is simpler to count the number
of cluster and initialize the idle_cdev for each cluster and then go for
this loop with the cluster cpumask.




-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
  2018-04-04  8:50         ` Daniel Lezcano
@ 2018-04-05  4:49           ` Viresh Kumar
  0 siblings, 0 replies; 22+ messages in thread
From: Viresh Kumar @ 2018-04-05  4:49 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: edubezval, kevin.wangtao, leo.yan, vincent.guittot, amit.kachhap,
	linux-kernel, javi.merino, rui.zhang, daniel.thompson, linux-pm

On 04-04-18, 10:50, Daniel Lezcano wrote:
> Mmh, that sounds very complex. May be it is simpler to count the number
> of cluster and initialize the idle_cdev for each cluster and then go for
> this loop with the cluster cpumask.

Maybe not sure. I have had such code in the past and it was quite
straight forward to understand :)

You can go with whichever version you like.

-- 
viresh

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-04-05  4:49 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1519226968-19821-1-git-send-email-daniel.lezcano@linaro.org>
     [not found] ` <1519226968-19821-6-git-send-email-daniel.lezcano@linaro.org>
2018-03-06 23:19   ` [PATCH V2 5/7] thermal/drivers/cpu_cooling: Add idle cooling device documentation Pavel Machek
2018-03-07 11:42     ` Daniel Lezcano
2018-03-08  8:59       ` Pavel Machek
2018-03-08 11:54         ` Daniel Thompson
2018-03-07 17:09 ` [PATCH V2 0/7] CPU cooling device new strategies Eduardo Valentin
2018-03-07 18:57   ` Daniel Lezcano
2018-03-08 12:03     ` Daniel Thompson
2018-03-26 14:30       ` Leo Yan
2018-03-27  9:35         ` Daniel Lezcano
     [not found] ` <1519226968-19821-7-git-send-email-daniel.lezcano@linaro.org>
     [not found]   ` <20180223073432.GF26947@vireshk-i7>
     [not found]     ` <faaf027c-e01c-6801-9a0c-ab7e0ba669a1@linaro.org>
2018-02-26  4:30       ` [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver Viresh Kumar
2018-03-13 19:15         ` Daniel Lezcano
2018-04-04  8:50         ` Daniel Lezcano
2018-04-05  4:49           ` Viresh Kumar
2018-03-27  3:43       ` Leo Yan
2018-03-27 11:10         ` Daniel Lezcano
2018-03-27  2:03   ` Leo Yan
2018-03-27 10:26     ` Daniel Lezcano
2018-03-27 12:28       ` Juri Lelli
2018-03-27 12:31         ` Daniel Lezcano
2018-03-27 13:08           ` Juri Lelli
2018-03-27  3:35   ` Leo Yan
2018-03-27 10:56     ` Daniel Lezcano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).