On Thu, 30 Dec 2021, Rafael J. Wysocki wrote: > On Thu, Dec 30, 2021 at 7:21 PM Julia Lawall wrote: > > > > > > > > On Thu, 30 Dec 2021, Rafael J. Wysocki wrote: > > > > > On Thu, Dec 30, 2021 at 6:54 PM Julia Lawall wrote: > > > > > > > > > > The effect is the same. But that approach is indeed simpler than patching > > > > > > the kernel. > > > > > > > > > > It is also applicable when intel_pstate runs in the active mode. > > > > > > > > > > As for the results that you have reported, it looks like the package > > > > > power on these systems is dominated by package voltage and going from > > > > > P-state 20 to P-state 21 causes that voltage to increase significantly > > > > > (the observed RAM energy usage pattern is consistent with that). This > > > > > means that running at P-states above 20 is only really justified if > > > > > there is a strict performance requirement that can't be met otherwise. > > > > > > > > > > Can you please check what value is there in the base_frequency sysfs > > > > > attribute under cpuX/cpufreq/? > > > > > > > > 2100000, which should be pstate 21 > > > > > > > > > > > > > > I'm guessing that the package voltage level for P-states 10 and 20 is > > > > > the same, so the power difference between them is not significant > > > > > relative to the difference between P-state 20 and 21 and if increasing > > > > > the P-state causes some extra idle time to appear in the workload > > > > > (even though there is not enough of it to prevent to overall > > > > > utilization from increasing), then the overall power draw when running > > > > > at P-state 10 may be greater that for P-state 20. > > > > > > > > My impression is that the package voltage level for P-states 10 to 20 is > > > > high enough that increasing the frequency has little impact. But the code > > > > runs twice as fast, which reduces the execution time a lot, saving energy. > > > > > > > > My first experiment had only one running thread. I also tried running 32 > > > > spinning threads for 10 seconds, ie using up one package and leaving the > > > > other idle. In this case, instead of staying around 600J for pstates > > > > 10-20, the pstate rises from 743 to 946. But there is still a gap between > > > > 20 and 21, with 21 being 1392J. > > > > > > > > > You can check if there is any C-state residency difference between > > > > > these two cases by running the workload under turbostat in each of > > > > > them. > > > > > > > > The C1 and C6 cases (CPU%c1 and CPU%c6) are about the same between 20 and > > > > 21, whether with 1 thread or with 32 thread. > > > > > > I meant to compare P-state 10 and P-state 20. > > > > > > 20 and 21 are really close as far as the performance is concerned, so > > > I wouldn't expect to see any significant C-state residency difference > > > between them. > > > > There's also no difference between 10 and 20. This seems normal, because > > the same cores are either fully used or fully idle in both cases. The > > idle ones are almost always in C6. > > The turbostat output sent by you previously shows that the CPUs doing > the work are only about 15-or-less percent busy, though, and you get > quite a bit of C-state residency on them. I'm assuming that this is > for 1 running thread. > > Can you please run the 32 spinning threads workload (ie. on one > package) and with P-state locked to 10 and then to 20 under turbostat > and send me the turbostat output for both runs? Attached. Pstate 10: spin_minmax_10_dahu-9_5.15.0freq_schedutil_11.turbo Pstate 20: spin_minmax_20_dahu-9_5.15.0freq_schedutil_11.turbo julia