Over the last week I tested v4+pollv2 and now v5+pollv3. With v5, I observe a particular idle behavior, that I have not seen before with v4. On a dual-socket Skylake system the idle power increases from 74.1 W (system total) to 85.5 W with a 300 HZ build and even to 138.3 W with a 1000 HZ build. A similar Haswell-EP system is also affected. There are phases during which one core will keep switching to the highest C-state, but not disable the sched tick. Every 4th sched tick, a kworker on that core is scheduled shortly. Every wakeup from C6 of a single core will more than double the package power consumption of *both8 sockets for ~500 us resulting in the significantly increased sustained power consumption. This is illustrated in [1]. For a comparison of a "normal" phase (samekernel), see [2]. For a global view of the effect on a 1000 Hz build, see [3]. I have not yet found any particular triggers or the specific interaction between the sched tick and the kworker. I'm not sure how this was introduced in v5. I would guess it could be a feedback loop that I was concerned about initially. I have more findings from v4, but this seems much more impactful. [1] https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/rjwv5_idle_300Hz.png [2] https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/rjwv5_idle_300Hz_ok.png [3] https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/rjwv5_idle_1000Hz.png On 2018-03-15 22:59, Rafael J. Wysocki wrote: > Hi All, > > Thanks a lot for the feedback so far! > > One more respin after the last batch of comments from Peter and Frederic. > > The previous summary that still applies: > > On Sunday, March 4, 2018 11:21:30 PM CET Rafael J. Wysocki wrote: >> >> The problem is that if we stop the sched tick in >> tick_nohz_idle_enter() and then the idle governor predicts short idle >> duration, we lose regardless of whether or not it is right. >> >> If it is right, we've lost already, because we stopped the tick >> unnecessarily. If it is not right, we'll lose going forward, because >> the idle state selected by the governor is going to be too shallow and >> we'll draw too much power (that has been reported recently to actually >> happen often enough for people to care). >> >> This patch series is an attempt to improve the situation and the idea >> here is to make the decision whether or not to stop the tick deeper in >> the idle loop and in particular after running the idle state selection >> in the path where the idle governor is invoked. This way the problem >> can be avoided, because the idle duration predicted by the idle governor >> can be used to decide whether or not to stop the tick so that the tick >> is only stopped if that value is large enough (and, consequently, the >> idle state selected by the governor is deep enough). >> >> The series tires to avoid adding too much new code, rather reorder the >> existing code and make it more fine-grained. >> >> Patch 1 prepares the tick-sched code for the subsequent modifications and it >> doesn't change the code's functionality (at least not intentionally). >> >> Patch 2 starts pushing the tick stopping decision deeper into the idle >> loop, but that is limited to do_idle() and tick_nohz_irq_exit(). >> >> Patch 3 makes cpuidle_idle_call() decide whether or not to stop the tick >> and sets the stage for the subsequent changes. >> >> Patch 4 adds a bool pointer argument to cpuidle_select() and the ->select >> governor callback allowing them to return a "nohz" hint on whether or not to >> stop the tick to the caller. It also adds code to decide what value to >> return as "nohz" to the menu governor. >> >> Patch 5 reorders the idle state selection with respect to the stopping of >> the tick and causes the additional "nohz" hint from cpuidle_select() to be >> used for deciding whether or not to stop the tick. >> >> Patch 6 causes the menu governor to refine the state selection in case the >> tick is not going to be stopped and the already selected state may not fit >> before the next tick time. >> >> Patch 7 Deals with the situation in which the tick was stopped previously, >> but the idle governor still predicts short idle. > > This series is complementary to the poll_idle() patch at > > https://patchwork.kernel.org/patch/10282237/ > > Thanks, > Rafael > -- Dipl. Inf. Thomas Ilsche Computer Scientist Highly Adaptive Energy-Efficient Computing CRC 912 HAEC: http://tu-dresden.de/sfb912 Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) 01062 Dresden, Germany Phone: +49 351 463-42168 Fax: +49 351 463-37773 E-Mail: thomas.ilsche@tu-dresden.de