Message ID  20210301212149.228775daniel.lezcano@linaro.org 

State  New, archived 
Headers  show 
Series 

Related  show 
Hi Daniel, I've started reviewing the series, please find some comments below. On 3/1/21 9:21 PM, Daniel Lezcano wrote: > Currently the power consumption is based on the current OPP power > assuming the entire performance domain is fully loaded. > > That gives very gross power estimation and we can do much better by > using the load to scale the power consumption. > > Use the utilization to normalize and scale the power usage over the > max possible power. > > Tested on a rock960 with 2 big CPUS, the power consumption estimation > conforms with the expected one. > > Before this change: > > ~$ ~/dhrystone t 1 l 10000& > ~$ cat /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw > 2260000 > > After this change: > > ~$ ~/dhrystone t 1 l 10000& > ~$ cat /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw > 1130000 > > ~$ ~/dhrystone t 2 l 10000& > ~$ cat /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw > 2260000 > > Signedoffby: Daniel Lezcano <daniel.lezcano@linaro.org> >  > drivers/powercap/dtpm_cpu.c  21 +++++++++++++++++ > 1 file changed, 17 insertions(+), 4 deletions() > > diff git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c > index e728ebd6d0ca..8379b96468ef 100644 >  a/drivers/powercap/dtpm_cpu.c > +++ b/drivers/powercap/dtpm_cpu.c > @@ 68,27 +68,40 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit) > return power_limit; > } > > +static u64 scale_pd_power_uw(struct cpumask *cpus, u64 power) renamed 'cpus' into 'pd_mask', see below > +{ > + unsigned long max, util; > + int cpu, load = 0; IMHO 'int load' looks odd when used with 'util' and 'max'. I would put in the line above to have them all the same type and renamed to 'sum_util'. > + > + for_each_cpu(cpu, cpus) { I would avoid the temporary CPU mask in the get_pd_power_uw() with this modified loop: for_each_cpu_and(cpu, pd_mask, cpu_online_mask) { > + max = arch_scale_cpu_capacity(cpu); > + util = sched_cpu_util(cpu, max); > + load += ((util * 100) / max); Below you can find 3 optimizations. Since we are not in the hot path here, it's up to if you would like to use all/some of them or just ignore. 1st optimization. If we use 'load += (util << 10) / max' in the loop, then we could avoid div by 100 and use a right shift: (power * load) >> 10 2nd optimization. Since we use EM CPU mask, which span all CPUs with the same arch_scale_cpu_capacity(), you can avoid N divs inside the loop and do it once, below the loop. 3rd optimization. If we just simply add all 'util' into 'sum_util' (no mul or div in the loop), then we might just have simple macro #define CALC_POWER_USAGE(power, sum_util, max) \ (((power * (sum_util << 10)) / max) >> 10) > + } > + > + return (power * load) / 100; > +} > + > static u64 get_pd_power_uw(struct dtpm *dtpm) > { > struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm); > struct em_perf_domain *pd; > struct cpumask cpus; Since we don't need the 'nr_cpus' we also don't need the cpumask which occupy stack; Maybe use struct cpumask *pd_mask; then > unsigned long freq; >  int i, nr_cpus; > + int i; > > pd = em_cpu_get(dtpm_cpu>cpu); > freq = cpufreq_quick_get(dtpm_cpu>cpu); > > cpumask_and(&cpus, cpu_online_mask, to_cpumask(pd>cpus)); remove ^^^^^ and set pd_mask = em_span_cpus(pd); >  nr_cpus = cpumask_weight(&cpus); > > for (i = 0; i < pd>nr_perf_states; i++) { > > if (pd>table[i].frequency < freq) > continue; > >  return pd>table[i].power * >  MICROWATT_PER_MILLIWATT * nr_cpus; > + return scale_pd_power_uw(&cpus, pd>table[i].power * > + MICROWATT_PER_MILLIWATT); Instead of '&cpus' I would put 'pd_mask' and that should do the job. > } > > return 0; > Apart from that, the design idea with util looks good. Regards, Lukasz
Hi Lukasz, thanks for your comments, one question below. On 09/03/2021 11:01, Lukasz Luba wrote: [ ... ] >> +static u64 scale_pd_power_uw(struct cpumask *cpus, u64 power) > > renamed 'cpus' into 'pd_mask', see below > >> +{ >> + unsigned long max, util; >> + int cpu, load = 0; > > IMHO 'int load' looks odd when used with 'util' and 'max'. > I would put in the line above to have them all the same type and > renamed to 'sum_util'. > >> + >> + for_each_cpu(cpu, cpus) { > > I would avoid the temporary CPU mask in the get_pd_power_uw() > with this modified loop: > > for_each_cpu_and(cpu, pd_mask, cpu_online_mask) { > > >> + max = arch_scale_cpu_capacity(cpu); >> + util = sched_cpu_util(cpu, max); >> + load += ((util * 100) / max); > > Below you can find 3 optimizations. Since we are not in the hot > path here, it's up to if you would like to use all/some of them > or just ignore. > > 1st optimization. > If we use 'load += (util << 10) / max' in the loop, then > we could avoid div by 100 and use a right shift: > (power * load) >> 10 > > 2nd optimization. > Since we use EM CPU mask, which span all CPUs with the same > arch_scale_cpu_capacity(), you can avoid N divs inside the loop > and do it once, below the loop. > > 3rd optimization. > If we just simply add all 'util' into 'sum_util' (no mul or div in > the loop), then we might just have simple macro > > #define CALC_POWER_USAGE(power, sum_util, max) \ > (((power * (sum_util << 10)) / max) >> 10) I don't understand the 'max' division, I was expecting here something like: ((sum_util << 10) / sum_max) >> 10) no ?
On 09/03/2021 11:01, Lukasz Luba wrote: > Hi Daniel, > > I've started reviewing the series, please find some comments below. > > On 3/1/21 9:21 PM, Daniel Lezcano wrote: >> Currently the power consumption is based on the current OPP power >> assuming the entire performance domain is fully loaded. >> >> That gives very gross power estimation and we can do much better by >> using the load to scale the power consumption. >> >> Use the utilization to normalize and scale the power usage over the >> max possible power. >> >> Tested on a rock960 with 2 big CPUS, the power consumption estimation >> conforms with the expected one. >> >> Before this change: >> >> ~$ ~/dhrystone t 1 l 10000& >> ~$ cat >> /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw >> >> 2260000 >> >> After this change: >> >> ~$ ~/dhrystone t 1 l 10000& >> ~$ cat >> /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw >> >> 1130000 >> >> ~$ ~/dhrystone t 2 l 10000& >> ~$ cat >> /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw >> >> 2260000 >> >> Signedoffby: Daniel Lezcano <daniel.lezcano@linaro.org> >>  >> drivers/powercap/dtpm_cpu.c  21 +++++++++++++++++ >> 1 file changed, 17 insertions(+), 4 deletions() >> >> diff git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c >> index e728ebd6d0ca..8379b96468ef 100644 >>  a/drivers/powercap/dtpm_cpu.c >> +++ b/drivers/powercap/dtpm_cpu.c >> @@ 68,27 +68,40 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, >> u64 power_limit) >> return power_limit; >> } >> +static u64 scale_pd_power_uw(struct cpumask *cpus, u64 power) > > renamed 'cpus' into 'pd_mask', see below > >> +{ >> + unsigned long max, util; >> + int cpu, load = 0; > > IMHO 'int load' looks odd when used with 'util' and 'max'. > I would put in the line above to have them all the same type and > renamed to 'sum_util'. > >> + >> + for_each_cpu(cpu, cpus) { > > I would avoid the temporary CPU mask in the get_pd_power_uw() > with this modified loop: > > for_each_cpu_and(cpu, pd_mask, cpu_online_mask) { > > >> + max = arch_scale_cpu_capacity(cpu); >> + util = sched_cpu_util(cpu, max); >> + load += ((util * 100) / max); > > Below you can find 3 optimizations. Since we are not in the hot > path here, it's up to if you would like to use all/some of them > or just ignore. > > 1st optimization. > If we use 'load += (util << 10) / max' in the loop, then > we could avoid div by 100 and use a right shift: > (power * load) >> 10 > > 2nd optimization. > Since we use EM CPU mask, which span all CPUs with the same > arch_scale_cpu_capacity(), you can avoid N divs inside the loop > and do it once, below the loop. > > 3rd optimization. > If we just simply add all 'util' into 'sum_util' (no mul or div in > the loop), then we might just have simple macro > > #define CALC_POWER_USAGE(power, sum_util, max) \ > (((power * (sum_util << 10)) / max) >> 10) static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power) { unsigned long max, sum_max = 0, sum_util = 0; int cpu; for_each_cpu_and(cpu, pd_mask, cpu_online_mask) { max = arch_scale_cpu_capacity(cpu); sum_util += sched_cpu_util(cpu, max); sum_max += max; } return (power * ((sum_util << 10) / sum_max)) >> 10; } ??
On 3/9/21 7:03 PM, Daniel Lezcano wrote: > > Hi Lukasz, > > thanks for your comments, one question below. > > On 09/03/2021 11:01, Lukasz Luba wrote: > > [ ... ] > >>> +static u64 scale_pd_power_uw(struct cpumask *cpus, u64 power) >> >> renamed 'cpus' into 'pd_mask', see below >> >>> +{ >>> + unsigned long max, util; >>> + int cpu, load = 0; >> >> IMHO 'int load' looks odd when used with 'util' and 'max'. >> I would put in the line above to have them all the same type and >> renamed to 'sum_util'. >> >>> + >>> + for_each_cpu(cpu, cpus) { >> >> I would avoid the temporary CPU mask in the get_pd_power_uw() >> with this modified loop: >> >> for_each_cpu_and(cpu, pd_mask, cpu_online_mask) { >> >> >>> + max = arch_scale_cpu_capacity(cpu); >>> + util = sched_cpu_util(cpu, max); >>> + load += ((util * 100) / max); >> >> Below you can find 3 optimizations. Since we are not in the hot >> path here, it's up to if you would like to use all/some of them >> or just ignore. >> >> 1st optimization. >> If we use 'load += (util << 10) / max' in the loop, then >> we could avoid div by 100 and use a right shift: >> (power * load) >> 10 >> >> 2nd optimization. >> Since we use EM CPU mask, which span all CPUs with the same >> arch_scale_cpu_capacity(), you can avoid N divs inside the loop >> and do it once, below the loop. >> >> 3rd optimization. >> If we just simply add all 'util' into 'sum_util' (no mul or div in >> the loop), then we might just have simple macro >> >> #define CALC_POWER_USAGE(power, sum_util, max) \ >> (((power * (sum_util << 10)) / max) >> 10) > > I don't understand the 'max' division, I was expecting here something > like: ((sum_util << 10) / sum_max) >> 10) > > no ? > No, it should be single 'max', which is in range 0..1024. We would like to calculate the power for the whole perf domain, e.g. 4 CPUs almost fully utilized would have util ~1000, then total power should be around ~4 * EM_table[i].power. This '~4' is coming from 4 utils divided by one max util 4000 / 1024 The 'max' in the equation can be put before the bracket, as well as 'power'. If we had floating point number, simple power for cpu1, cpu2, cpuN would be just: power_1 = power * util_1 / max power_2 = power * util_2 / max power_N = power * util_N / max (since they have the same 'max' capacity and the same EM 'power') The total domain power would be: total_power = power_1 + power_2 + ... + power_N which is: total_power = (power * util_1 / max) + (power * util_2 / max) + ... + + (power * util_N / max) put the 'power' and 'max' before the bracket: total_power = power * (util_1 + util_2 + ... + util_N) * (1/max) introduce the 'sum_util': sum_util = util_1 + util_2 + ... + util_N then: total_power = power * sum_util / max Unfortunately, we don't use floating point, so temporary fixed point tricks, thus the '<< 10' and '>> 10' avoid some errors
diff git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c index e728ebd6d0ca..8379b96468ef 100644  a/drivers/powercap/dtpm_cpu.c +++ b/drivers/powercap/dtpm_cpu.c @@ 68,27 +68,40 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit) return power_limit; } +static u64 scale_pd_power_uw(struct cpumask *cpus, u64 power) +{ + unsigned long max, util; + int cpu, load = 0; + + for_each_cpu(cpu, cpus) { + max = arch_scale_cpu_capacity(cpu); + util = sched_cpu_util(cpu, max); + load += ((util * 100) / max); + } + + return (power * load) / 100; +} + static u64 get_pd_power_uw(struct dtpm *dtpm) { struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm); struct em_perf_domain *pd; struct cpumask cpus; unsigned long freq;  int i, nr_cpus; + int i; pd = em_cpu_get(dtpm_cpu>cpu); freq = cpufreq_quick_get(dtpm_cpu>cpu); cpumask_and(&cpus, cpu_online_mask, to_cpumask(pd>cpus));  nr_cpus = cpumask_weight(&cpus); for (i = 0; i < pd>nr_perf_states; i++) { if (pd>table[i].frequency < freq) continue;  return pd>table[i].power *  MICROWATT_PER_MILLIWATT * nr_cpus; + return scale_pd_power_uw(&cpus, pd>table[i].power * + MICROWATT_PER_MILLIWATT); } return 0;
Currently the power consumption is based on the current OPP power assuming the entire performance domain is fully loaded. That gives very gross power estimation and we can do much better by using the load to scale the power consumption. Use the utilization to normalize and scale the power usage over the max possible power. Tested on a rock960 with 2 big CPUS, the power consumption estimation conforms with the expected one. Before this change: ~$ ~/dhrystone t 1 l 10000& ~$ cat /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw 2260000 After this change: ~$ ~/dhrystone t 1 l 10000& ~$ cat /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw 1130000 ~$ ~/dhrystone t 2 l 10000& ~$ cat /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw 2260000 Signedoffby: Daniel Lezcano <daniel.lezcano@linaro.org>  drivers/powercap/dtpm_cpu.c  21 +++++++++++++++++ 1 file changed, 17 insertions(+), 4 deletions()