* RFC: /sys/power/policy_preference @ 2010-06-16 21:05 Len Brown 2010-06-17 6:03 ` [linux-pm] " Igor.Stoppa ` (4 more replies) 0 siblings, 5 replies; 26+ messages in thread From: Len Brown @ 2010-06-16 21:05 UTC (permalink / raw) To: Linux Power Management List, Linux Kernel Mailing List, linux-acpi Create /sys/power/policy_preference, giving user-space the ability to express its preference for kernel based power vs. performance decisions in a single place. This gives kernel sub-systems and drivers a central place to discover this system-wide policy preference. It also allows user-space to not have to be updated every time a sub-system or driver adds a new power/perf knob. policy_preference has 5 levels, from max_performance through max_powersave. Here is how 4 parts of the kernel might respond to those 5 levels: max_performance (unwilling to sacrifice any performance) scheduler: default (optimized for performance) cpuidle: disable all C-states except polling mode ondemand: disable all P-states except max perf msr_ia32_energy_perf_bias: 0 of 15 performance (care primarily about performance) scheduler: default (optimized for performance) cpuidle: enable all C-states subject to QOS ondemand: all P-states, using no bias msr_ia32_energy_perf_bias: 3 of 15 balanced (default) scheduler: enable sched_mc_power_savings cpuidle: enable all C-states subject to QOS ondemand: all P-states, powersave_bias=5 msr_ia32_energy_perf_bias: 7 of 15 powersave (can sacrifice measurable performance) scheduler: enable sched_smt_power_savings cpuidle: enable all C-states, subject to QOS ondemand: disable turbo mode, powersave_bias=10 msr_ia32_energy_perf_bias: 11 of 15 max_powersave (can sacrifice significant performance) scheduler: enable sched_smt_power_savings cpuidle: enable all C-states, subject to QOS ondemand: min P-state (do not invoke T-states) msr_ia32_energy_perf_bias: 15 of 15 Note that today Linux is typically operating in the mode called "performance" above, rather than "balanced", which is proposed to be the default. While a system should work well if left in "balanced" mode, it is likely that some users would want to use "powersave" when on battery and perhaps shift to "performance" on A/C. Please let me know what you think. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [linux-pm] RFC: /sys/power/policy_preference 2010-06-16 21:05 RFC: /sys/power/policy_preference Len Brown @ 2010-06-17 6:03 ` Igor.Stoppa 2010-06-17 19:00 ` Len Brown 2010-06-17 16:14 ` Victor Lowther ` (3 subsequent siblings) 4 siblings, 1 reply; 26+ messages in thread From: Igor.Stoppa @ 2010-06-17 6:03 UTC (permalink / raw) To: lenb, linux-pm, linux-kernel, linux-acpi hi, > From: Len Brown [lenb@kernel.org] policy_preference has 5 levels, from max_performance through max_powersave. Here is how 4 parts of the kernel might respond to those 5 levels: [levels description] i do understand that you are mostly targetting acpi based systems, but even there, based on static leaks, it might not be always true that lower frequencies are correlated to higher power savings (or maybe i have misunderstood your draft - i am not so fluent in acpi) > it is likely > that some users would want to use "powersave" when on > battery and perhaps shift to "performance" on A/C. if we consider also the thermal envelope and the fact that "performance" might steal power from a charging battery, even ton A/C it might not be possible to settle down in one state permanently. Or do you expect other mechanisms to intervene? Cheers, igor ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [linux-pm] RFC: /sys/power/policy_preference 2010-06-17 6:03 ` [linux-pm] " Igor.Stoppa @ 2010-06-17 19:00 ` Len Brown 0 siblings, 0 replies; 26+ messages in thread From: Len Brown @ 2010-06-17 19:00 UTC (permalink / raw) To: Igor.Stoppa; +Cc: linux-pm, linux-kernel, linux-acpi On Thu, 17 Jun 2010, Igor.Stoppa@nokia.com wrote: > i do understand that you are mostly targetting acpi based systems, > but even there, based on static leaks, it might not be always true > that lower frequencies are correlated to higher power savings > (or maybe i have misunderstood your draft - i am not so fluent in acpi) Right, my assertion is that ondemand deals only with P-states, where, by defintion, the deeper the P-state the lower the voltage, the higher the efficiency. I assume that ondemand is not used to enable T-states where the clock is throttled w/o lowering the voltage. I put a note to try to make that clear under max_powersave: "ondemand: min P-state (do not invoke T-states)" Of course it is also possible for a processor to do a poor job implementing P-states and a great job optimizing idle states such that race to idle were always a win. However, on such a processor it would make more sense to simply disable P-states. > > it is likely > > that some users would want to use "powersave" when on > > battery and perhaps shift to "performance" on A/C. > > if we consider also the thermal envelope and the fact that "performance" > might steal power from a charging battery, even ton A/C it might not be > possible to settle down in one state permanently. > > Or do you expect other mechanisms to intervene? Typical laptop BIOS commonly implement a scheme where they maximize performance on AC and bias towards saving energy on DC. That, of course, is just one example use-model. Here Linux user-space can choose whatever policy makes sense for them at run-time. cheers, -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [linux-pm] RFC: /sys/power/policy_preference 2010-06-16 21:05 RFC: /sys/power/policy_preference Len Brown 2010-06-17 6:03 ` [linux-pm] " Igor.Stoppa @ 2010-06-17 16:14 ` Victor Lowther 2010-06-17 19:02 ` Len Brown 2010-06-19 15:17 ` Vaidyanathan Srinivasan 2010-06-17 20:48 ` Mike Chan ` (2 subsequent siblings) 4 siblings, 2 replies; 26+ messages in thread From: Victor Lowther @ 2010-06-17 16:14 UTC (permalink / raw) To: Len Brown Cc: Linux Power Management List, Linux Kernel Mailing List, linux-acpi On Jun 16, 2010, at 4:05 PM, Len Brown <lenb@kernel.org> wrote: > Create /sys/power/policy_preference, giving user-space > the ability to express its preference for kernel based > power vs. performance decisions in a single place. > > This gives kernel sub-systems and drivers a central place > to discover this system-wide policy preference. > It also allows user-space to not have to be updated > every time a sub-system or driver adds a new power/perf knob. I would prefer documenting all the current knobs and adding them to pm- utils so that pm-powersave knows about and can manage them. Once that is done, creating arbitrary powersave levels should be fairly simple. > policy_preference has 5 levels, from max_performance > through max_powersave. Here is how 4 parts of the kernel > might respond to those 5 levels: > > max_performance (unwilling to sacrifice any performance) > scheduler: default (optimized for performance) > cpuidle: disable all C-states except polling mode > ondemand: disable all P-states except max perf > msr_ia32_energy_perf_bias: 0 of 15 > > performance (care primarily about performance) > scheduler: default (optimized for performance) > cpuidle: enable all C-states subject to QOS > ondemand: all P-states, using no bias > msr_ia32_energy_perf_bias: 3 of 15 > > balanced (default) > scheduler: enable sched_mc_power_savings > cpuidle: enable all C-states subject to QOS > ondemand: all P-states, powersave_bias=5 > msr_ia32_energy_perf_bias: 7 of 15 > > powersave (can sacrifice measurable performance) > scheduler: enable sched_smt_power_savings > cpuidle: enable all C-states, subject to QOS > ondemand: disable turbo mode, powersave_bias=10 > msr_ia32_energy_perf_bias: 11 of 15 > > max_powersave (can sacrifice significant performance) > scheduler: enable sched_smt_power_savings > cpuidle: enable all C-states, subject to QOS > ondemand: min P-state (do not invoke T-states) > msr_ia32_energy_perf_bias: 15 of 15 > > Note that today Linux is typically operating in the mode > called "performance" above, rather than "balanced", > which is proposed to be the default. While a system > should work well if left in "balanced" mode, it is likely > that some users would want to use "powersave" when on > battery and perhaps shift to "performance" on A/C. > > Please let me know what you think. > > thanks, > Len Brown, Intel Open Source Technology Center > _______________________________________________ > linux-pm mailing list > linux-pm@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/linux-pm ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [linux-pm] RFC: /sys/power/policy_preference 2010-06-17 16:14 ` Victor Lowther @ 2010-06-17 19:02 ` Len Brown 2010-06-17 22:23 ` Victor Lowther 2010-06-19 15:17 ` Vaidyanathan Srinivasan 1 sibling, 1 reply; 26+ messages in thread From: Len Brown @ 2010-06-17 19:02 UTC (permalink / raw) To: Victor Lowther Cc: Linux Power Management List, Linux Kernel Mailing List, linux-acpi > On Jun 16, 2010, at 4:05 PM, Len Brown <lenb@kernel.org> wrote: > > > Create /sys/power/policy_preference, giving user-space > > the ability to express its preference for kernel based > > power vs. performance decisions in a single place. > > > > This gives kernel sub-systems and drivers a central place > > to discover this system-wide policy preference. > > It also allows user-space to not have to be updated > > every time a sub-system or driver adds a new power/perf knob. > > I would prefer documenting all the current knobs and adding them to pm-utils > so that pm-powersave knows about and can manage them. Once that is done, > creating arbitrary powersave levels should be fairly simple. The idea here is to not require user-space to need updating whenever a future knob is invented. We can do a great job at documenting the past, but a poor job of documenting the future:-) cheers, Len Brown, Intel Open Source Technolgy Center ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [linux-pm] RFC: /sys/power/policy_preference 2010-06-17 19:02 ` Len Brown @ 2010-06-17 22:23 ` Victor Lowther 2010-06-18 5:56 ` Len Brown 0 siblings, 1 reply; 26+ messages in thread From: Victor Lowther @ 2010-06-17 22:23 UTC (permalink / raw) To: Len Brown Cc: Linux Power Management List, Linux Kernel Mailing List, linux-acpi On Thu, Jun 17, 2010 at 2:02 PM, Len Brown <lenb@kernel.org> wrote: > >> On Jun 16, 2010, at 4:05 PM, Len Brown <lenb@kernel.org> wrote: >> >> > Create /sys/power/policy_preference, giving user-space >> > the ability to express its preference for kernel based >> > power vs. performance decisions in a single place. >> > >> > This gives kernel sub-systems and drivers a central place >> > to discover this system-wide policy preference. >> > It also allows user-space to not have to be updated >> > every time a sub-system or driver adds a new power/perf knob. >> >> I would prefer documenting all the current knobs and adding them to pm-utils >> so that pm-powersave knows about and can manage them. Once that is done, >> creating arbitrary powersave levels should be fairly simple. > > > The idea here is to not require user-space to need updating > whenever a future knob is invented. We can do a great job > at documenting the past, but a poor job of documenting the future:-) Well, I would suggest that the habit of not documenting what is happening with power management in the kernel needs to change, then. Having the documentation and example code for how to tweak the various power management settings from userspace is inherently more flexible than trying to expose a single knob from the kernel to userspace for power management, with little loss of flexibility. > cheers, > Len Brown, Intel Open Source Technolgy Center > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [linux-pm] RFC: /sys/power/policy_preference 2010-06-17 22:23 ` Victor Lowther @ 2010-06-18 5:56 ` Len Brown 2010-06-18 11:55 ` Victor Lowther 0 siblings, 1 reply; 26+ messages in thread From: Len Brown @ 2010-06-18 5:56 UTC (permalink / raw) To: Victor Lowther Cc: Linux Power Management List, Linux Kernel Mailing List, linux-acpi [-- Attachment #1: Type: TEXT/PLAIN, Size: 1269 bytes --] On Thu, 17 Jun 2010, Victor Lowther wrote: > > The idea here is to not require user-space to need updating > > whenever a future knob is invented. We can do a great job > > at documenting the past, but a poor job of documenting the future:-) > > Well, I would suggest that the habit of not documenting what is > happening with power management in the kernel needs to change, then. Actually some of the knobs I showed in the examples have been documented for *years*, yet are ignored by user-space today. I don't want to insult user-space programmers, but the reality is that simpler is usually better. > Having the documentation and example code for how to tweak the various > power management settings from userspace is inherently more flexible > than trying to expose a single knob from the kernel to userspace for > power management, with little loss of flexibility. Yes, the ultimate in flexibility is to update user-space whenever some new driver or new knob appears in the kernel. I'm not proposing that ability be taken away. I'm proposing that in many cases it is unnecessary. The idea is to have the ability to add something to the kernel and avoid the need to make any change to user-space. thanks, -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [linux-pm] RFC: /sys/power/policy_preference 2010-06-18 5:56 ` Len Brown @ 2010-06-18 11:55 ` Victor Lowther 0 siblings, 0 replies; 26+ messages in thread From: Victor Lowther @ 2010-06-18 11:55 UTC (permalink / raw) To: Len Brown Cc: Linux Power Management List, Linux Kernel Mailing List, linux-acpi On Fri, Jun 18, 2010 at 12:56 AM, Len Brown <lenb@kernel.org> wrote: > On Thu, 17 Jun 2010, Victor Lowther wrote: > >> > The idea here is to not require user-space to need updating >> > whenever a future knob is invented. We can do a great job >> > at documenting the past, but a poor job of documenting the future:-) >> >> Well, I would suggest that the habit of not documenting what is >> happening with power management in the kernel needs to change, then. > > Actually some of the knobs I showed in the examples > have been documented for *years*, yet are ignored > by user-space today. I don't want to insult user-space > programmers, but the reality is that simpler is usually better. Let me explain where I am coming from, then. I maintain pm-utils, one of the main low-level bodies of userspace code that concerns itself with power management. I am currently in the process of standardizing some of the more common power management tweaks so that they will work in a cross distro manner, and know from this that the documentation we have is badly fragmented -- if you know exactly what you are looking for, you can google or grep for it, but if you do not, there is no easy way to find a list of all the power management settings you can tune. >> Having the documentation and example code for how to tweak the various >> power management settings from userspace is inherently more flexible >> than trying to expose a single knob from the kernel to userspace for >> power management, with little loss of flexibility. > > Yes, the ultimate in flexibility is to update user-space whenever > some new driver or new knob appears in the kernel. I'm not proposing > that ability be taken away. I'm proposing that in many cases it > is unnecessary. I disagree. Most of userspace does not care about how the system is trying to save power. I maintain one that does, and I do not like the idea of adding another knob whose entire purpose is to map other, already existing knobs onto a line, especially when we can do that in userspace easily enough if anyone actually wants it. > The idea is to have the ability to add something to the > kernel and avoid the need to make any change to user-space. Userspace in this case consists mainly of acpi-scripts/pm-utils/laptop-mode-tools, upower, g-p-m/kpowersave/x-p-m, and X. I can only speak for pm-utils, but the model pm-utils, acpi-scripts, and laptop-mode-tools use does not map to your proposed knob at all. We use a two-state model -- either we are on AC power and use the kernel's default power state, or we are on battery power and set power management to a set of distro or user chosen set of parameters. I am working on making pm-utils contain some predefined powersaving policies, but I do not expect them to change the two-state model much more than changing which power management tweaks are used in the on-ac and on-battery states. > thanks, > -Len Brown, Intel Open Source Technology Center > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [linux-pm] RFC: /sys/power/policy_preference 2010-06-17 16:14 ` Victor Lowther 2010-06-17 19:02 ` Len Brown @ 2010-06-19 15:17 ` Vaidyanathan Srinivasan 2010-06-19 19:04 ` Rafael J. Wysocki 1 sibling, 1 reply; 26+ messages in thread From: Vaidyanathan Srinivasan @ 2010-06-19 15:17 UTC (permalink / raw) To: Victor Lowther Cc: Len Brown, Linux Power Management List, Linux Kernel Mailing List, linux-acpi * Victor Lowther <victor.lowther@gmail.com> [2010-06-17 11:14:50]: > > > > > On Jun 16, 2010, at 4:05 PM, Len Brown <lenb@kernel.org> wrote: > > >Create /sys/power/policy_preference, giving user-space > >the ability to express its preference for kernel based > >power vs. performance decisions in a single place. > > > >This gives kernel sub-systems and drivers a central place > >to discover this system-wide policy preference. > >It also allows user-space to not have to be updated > >every time a sub-system or driver adds a new power/perf knob. > > I would prefer documenting all the current knobs and adding them to > pm-utils so that pm-powersave knows about and can manage them. Once > that is done, creating arbitrary powersave levels should be fairly > simple. Hi Len, Reading through this thread, I prefer the above recommendation. We have three main dimensions of (power savings) control (cpufreq, cpuidle and scheduler) and you are combining them into a single policy in the kernel. The challenges are as follows: * Number of policies will always limit flexibility * More dimensions of control will be added in future and your intention is to transparently include them within these defined polices * Even with the current implementations, power savings and performance impact widely vary based on system topology and workload. There is no easy method to define modes such that one mode will _always_ consume less power than the other * Each subsystem can override the policy settings and create more combinations anyway Your argument is that these modes can serve as a good default and allow the user to tune the knobs directly for more sophisticated policies. But in that case all kernel subsystem should default to the balanced policy and let the user tweak individual subsystems for other modes. On the other hand having the policy definitions in user space allows us to create more flexible policies by considering higher level factors like workload behavior, utilization, platform features, power/thermal constraints etc. --Vaidy [snip] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [linux-pm] RFC: /sys/power/policy_preference 2010-06-19 15:17 ` Vaidyanathan Srinivasan @ 2010-06-19 19:04 ` Rafael J. Wysocki 0 siblings, 0 replies; 26+ messages in thread From: Rafael J. Wysocki @ 2010-06-19 19:04 UTC (permalink / raw) To: svaidy, Linux Kernel Mailing List Cc: Victor Lowther, Len Brown, linux-acpi, Matthew Garrett, linux-pm On Saturday, June 19, 2010, Vaidyanathan Srinivasan wrote: > * Victor Lowther <victor.lowther@gmail.com> [2010-06-17 11:14:50]: > > > > > > > > > > > On Jun 16, 2010, at 4:05 PM, Len Brown <lenb@kernel.org> wrote: > > > > >Create /sys/power/policy_preference, giving user-space > > >the ability to express its preference for kernel based > > >power vs. performance decisions in a single place. > > > > > >This gives kernel sub-systems and drivers a central place > > >to discover this system-wide policy preference. > > >It also allows user-space to not have to be updated > > >every time a sub-system or driver adds a new power/perf knob. > > > > I would prefer documenting all the current knobs and adding them to > > pm-utils so that pm-powersave knows about and can manage them. Once > > that is done, creating arbitrary powersave levels should be fairly > > simple. > > Hi Len, > > Reading through this thread, I prefer the above recommendation. It also reflects my opinion quite well. > We have three main dimensions of (power savings) control (cpufreq, > cpuidle and scheduler) and you are combining them into a single policy > in the kernel. There's more than that, because we're in the process of adding runtime PM features to I/O device drivers. > The challenges are as follows: > > * Number of policies will always limit flexibility > * More dimensions of control will be added in future and your > intention is to transparently include them within these defined > polices > * Even with the current implementations, power savings and performance > impact widely vary based on system topology and workload. There is > no easy method to define modes such that one mode will _always_ > consume less power than the other > * Each subsystem can override the policy settings and create more > combinations anyway > > Your argument is that these modes can serve as a good default and allow > the user to tune the knobs directly for more sophisticated policies. > But in that case all kernel subsystem should default to the balanced > policy and let the user tweak individual subsystems for other modes. > > On the other hand having the policy definitions in user space allows > us to create more flexible policies by considering higher level > factors like workload behavior, utilization, platform features, > power/thermal constraints etc. The policy_preference levels as proposed are also really arbitrary and they will usually mean different things on different systems. If the interpretation of these values is left to device drivers, then (for example) different network adapter drivers may interpret "performance" differently and that will lead to different types of behavior depending on which of them is used. I think we should rather use interfaces that unambiguously tell the driver what to do. Thanks, Rafael ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: RFC: /sys/power/policy_preference 2010-06-16 21:05 RFC: /sys/power/policy_preference Len Brown 2010-06-17 6:03 ` [linux-pm] " Igor.Stoppa 2010-06-17 16:14 ` Victor Lowther @ 2010-06-17 20:48 ` Mike Chan 2010-06-18 6:25 ` Len Brown 2010-06-21 20:10 ` [linux-pm] " Dipankar Sarma 2010-09-28 16:17 ` x86_energy_perf_policy.c Len Brown 4 siblings, 1 reply; 26+ messages in thread From: Mike Chan @ 2010-06-17 20:48 UTC (permalink / raw) To: Len Brown Cc: Linux Power Management List, Linux Kernel Mailing List, linux-acpi On Wed, Jun 16, 2010 at 2:05 PM, Len Brown <lenb@kernel.org> wrote: > Create /sys/power/policy_preference, giving user-space > the ability to express its preference for kernel based > power vs. performance decisions in a single place. > > This gives kernel sub-systems and drivers a central place > to discover this system-wide policy preference. > It also allows user-space to not have to be updated > every time a sub-system or driver adds a new power/perf knob. > This might be ok as a convince feature for userspace, but if that is the sole intention, is 5 states enough? Are these values sufficient? I can say at least for Android this will probably won't be as useful (but perhaps on your platforms it makes sense). As for a place for subsystems and drivers to check for what performance mode you're in, do my driver how to check two places now? Whats stopping someone from overriding cpufreq, or cpuidle? I might be confused here (if I am someone please correct me) but isn't this somewhat along he lines of pm runtime / pm qos if drivers want to check what power / performance state the system is in? -- Mike > policy_preference has 5 levels, from max_performance > through max_powersave. Here is how 4 parts of the kernel > might respond to those 5 levels: > > max_performance (unwilling to sacrifice any performance) > scheduler: default (optimized for performance) > cpuidle: disable all C-states except polling mode > ondemand: disable all P-states except max perf > msr_ia32_energy_perf_bias: 0 of 15 > > performance (care primarily about performance) > scheduler: default (optimized for performance) > cpuidle: enable all C-states subject to QOS > ondemand: all P-states, using no bias > msr_ia32_energy_perf_bias: 3 of 15 > > balanced (default) > scheduler: enable sched_mc_power_savings > cpuidle: enable all C-states subject to QOS > ondemand: all P-states, powersave_bias=5 > msr_ia32_energy_perf_bias: 7 of 15 > > powersave (can sacrifice measurable performance) > scheduler: enable sched_smt_power_savings > cpuidle: enable all C-states, subject to QOS > ondemand: disable turbo mode, powersave_bias=10 > msr_ia32_energy_perf_bias: 11 of 15 > > max_powersave (can sacrifice significant performance) > scheduler: enable sched_smt_power_savings > cpuidle: enable all C-states, subject to QOS > ondemand: min P-state (do not invoke T-states) > msr_ia32_energy_perf_bias: 15 of 15 > > Note that today Linux is typically operating in the mode > called "performance" above, rather than "balanced", > which is proposed to be the default. While a system > should work well if left in "balanced" mode, it is likely > that some users would want to use "powersave" when on > battery and perhaps shift to "performance" on A/C. > > Please let me know what you think. > > thanks, > Len Brown, Intel Open Source Technology Center > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: RFC: /sys/power/policy_preference 2010-06-17 20:48 ` Mike Chan @ 2010-06-18 6:25 ` Len Brown 0 siblings, 0 replies; 26+ messages in thread From: Len Brown @ 2010-06-18 6:25 UTC (permalink / raw) To: Mike Chan Cc: Linux Power Management List, Linux Kernel Mailing List, linux-acpi On Thu, 17 Jun 2010, Mike Chan wrote: > On Wed, Jun 16, 2010 at 2:05 PM, Len Brown <lenb@kernel.org> wrote: > > Create /sys/power/policy_preference, giving user-space > > the ability to express its preference for kernel based > > power vs. performance decisions in a single place. > > > > This gives kernel sub-systems and drivers a central place > > to discover this system-wide policy preference. > > It also allows user-space to not have to be updated > > every time a sub-system or driver adds a new power/perf knob. > > > > This might be ok as a convince feature for userspace, but if that is > the sole intention, is 5 states enough? > > Are these values sufficient? I > can say at least for Android this will probably won't be as useful > (but perhaps on your platforms it makes sense). Honestly, my first thought was to use 100 values -- a percentage. But I got quickly taked out of it by people much wiser than me. Consider that the vendors that are cleaning Linux's clock on laptops seem quite content with 3 values at the user-interface. So one might argue that 5 levels is already 66% more complexity than needed:-) Some suggested special case states, eg for HPC. But those needs didn't fit into this simple power vs performance continuum, and every consumer of this interface needs to undertand every state, so adding special states would be a mistake. The folks that do HPC and the folks that do embedded devices are smart enough to tune their systems without using this rather blunt instrument. They should continue to do so, and this mechanism should not get in their way. For example, if this mechanism is used to update powersave_bias inside ondemand, but at the same time somebody tunes powersave_bias by hand, the by-hand tuning must win. > As for a place for subsystems and drivers to check for what > performance mode you're in, do my driver how to check two places now? > Whats stopping someone from overriding cpufreq, or cpuidle? I might be > confused here (if I am someone please correct me) but isn't this > somewhat along he lines of pm runtime / pm qos if drivers want to > check what power / performance state the system is in? pm runtime and pm qos are much bigger hammers, and this mechanism is intended to complement them, not replace them. Simply stated, this mechanism is intended just to give a global hint of the user's power vs. performance preference at a given time. There are places in the kernel and drivers where power vs performance decisions are made with zero concept of user preference, and this hint can help there. Other parts of the kernel don't care, or have sufficient information to make informed decisions, and thus they simply wouldn't need to make use of this hint. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [linux-pm] RFC: /sys/power/policy_preference 2010-06-16 21:05 RFC: /sys/power/policy_preference Len Brown ` (2 preceding siblings ...) 2010-06-17 20:48 ` Mike Chan @ 2010-06-21 20:10 ` Dipankar Sarma 2010-09-28 16:17 ` x86_energy_perf_policy.c Len Brown 4 siblings, 0 replies; 26+ messages in thread From: Dipankar Sarma @ 2010-06-21 20:10 UTC (permalink / raw) To: Len Brown Cc: Linux Power Management List, Linux Kernel Mailing List, linux-acpi On Wed, Jun 16, 2010 at 05:05:26PM -0400, Len Brown wrote: > Create /sys/power/policy_preference, giving user-space > the ability to express its preference for kernel based > power vs. performance decisions in a single place. > > policy_preference has 5 levels, from max_performance > through max_powersave. Here is how 4 parts of the kernel > might respond to those 5 levels: In theory this makes sense. We have been toying with something like this, but the difficulty is that outside of benchmarking environment, it is hard to figure out what mode to set when. Also, the impact could be different for different workloads. We should probably have a broader discussion around this with data - I will share some measurements on impact of such power modes. > max_performance (unwilling to sacrifice any performance) > scheduler: default (optimized for performance) > cpuidle: disable all C-states except polling mode > ondemand: disable all P-states except max perf > msr_ia32_energy_perf_bias: 0 of 15 > > performance (care primarily about performance) > scheduler: default (optimized for performance) > cpuidle: enable all C-states subject to QOS > ondemand: all P-states, using no bias > msr_ia32_energy_perf_bias: 3 of 15 > > balanced (default) > scheduler: enable sched_mc_power_savings > cpuidle: enable all C-states subject to QOS > ondemand: all P-states, powersave_bias=5 > msr_ia32_energy_perf_bias: 7 of 15 Would there be sufficient difference between performance and balanced ? > > powersave (can sacrifice measurable performance) > scheduler: enable sched_smt_power_savings > cpuidle: enable all C-states, subject to QOS > ondemand: disable turbo mode, powersave_bias=10 > msr_ia32_energy_perf_bias: 11 of 15 > > max_powersave (can sacrifice significant performance) > scheduler: enable sched_smt_power_savings > cpuidle: enable all C-states, subject to QOS > ondemand: min P-state (do not invoke T-states) > msr_ia32_energy_perf_bias: 15 of 15 Thanks Dipankar ^ permalink raw reply [flat|nested] 26+ messages in thread
* x86_energy_perf_policy.c 2010-06-16 21:05 RFC: /sys/power/policy_preference Len Brown ` (3 preceding siblings ...) 2010-06-21 20:10 ` [linux-pm] " Dipankar Sarma @ 2010-09-28 16:17 ` Len Brown 2010-10-23 4:40 ` [PATCH] tools: add x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS Len Brown 4 siblings, 1 reply; 26+ messages in thread From: Len Brown @ 2010-09-28 16:17 UTC (permalink / raw) To: Linux Power Management List, Linux Kernel Mailing List, linux-acpi, x86 /* In June, I proposed /sys/power/policy_preference to consolidate the knobs that user-space needs to turn to tell the kernel its performance/energy preference. The feedback I got was that user-space doesn't want the kernel to consolidate anything, but instead wants the kernel to expose everything and user-space will be able to keep up with new devices and hooks, as long as they are sufficiently documented. I think that past history and the current state of affairs suggests that user-space will come up short, but who am I to judge? So here is a utility to implement the user-space approach for Intel's new ENERGY_PERFR_BIAS MSR. (You'll see it on some Westmere, and all Sandy Bridge processors) The utility translates the words "powersave", "normal", or "performance" into the right bits for this register, and scribbles on /dev/cpu/*/msr, as appropriate. I'll be delighted to re-implement this in a different way if consensus emerges that a better way exists. thanks, Len Brown Intel Open Source Technology Center */ /* * x86_energy_perf_policy -- set the energy versus performance * policy preference bias on recent X86 processors. */ /* * Copyright (c) 2010, Intel Corporation. * Len Brown <len.brown@intel.com> * * This program is free software; you can redistribute it and/or modify it * under the terms and conditions of the GNU General Public License, * version 2, as published by the Free Software Foundation. * * This program is distributed in the hope it will be useful, but WITHOUT * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for * more details. * * You should have received a copy of the GNU General Public License along with * this program; if not, write to the Free Software Foundation, Inc., * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. */ #include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/resource.h> #include <fcntl.h> #include <signal.h> #include <sys/time.h> #include <stdlib.h> unsigned int verbose; /* set with -v */ unsigned int read_only; /* set with -r */ char *progname; unsigned long long new_bias; int cpu = -1; /* * Usage: * * -c cpu: limit action to a single CPU (default is all CPUs) * -v: verbose output (can invoke more than once) * -r: read-only, don't change any settings * * performance * Performance is paramount. * Unwilling to sacrafice any performance * for the sake of energy saving. (hardware default) * * normal * Can tolerate minor performance compromise * for potentially significant energy savings. * (reasonable default for most desktops and servers) * * powersave * Can tolerate significant performance hit * to maximize energy savings. * * n * a numerical value to write to the underlying MSR. */ void usage(void) { printf("%s: [-c cpu] [-v] " "(-r | 'performance' | 'normal' | 'powersave' | n)\n", progname); } /* * MSR_IA32_ENERGY_PERF_BIAS allows software to convey * its policy for the relative importance of performance * versus energy savings. * * The hardware uses this information in model-specific ways * when it must choose trade-offs between performance and * energy consumption. * * This policy hint does not supercede Processor Performance states * (P-states) or CPU Idle power states (C-states), but allows * software to have influence where it has been unable to * express a preference in the past. * * For example, this setting may tell the hardware how * aggressively or conservatively to control frequency * in the "turbo range" above the explicitly OS-controlled * P-state frequency range. It may also tell the hardware * how aggressively is should enter the OS requestec C-states. * * The support for this feature is indicated by CPUID.06H.ECX.bit3 * per the Intel Architectures Software Developer's Manual. */ #define MSR_IA32_ENERGY_PERF_BIAS 0x000001b0 #define BIAS_PERFORMANCE 0 #define BIAS_BALANCE 6 #define BIAS_POWERSAVE 15 cmdline(int argc, char **argv) { int opt; progname = argv[0]; while((opt = getopt(argc, argv, "+rvc:")) != -1) { switch (opt) { case 'c': cpu = atoi(optarg); break; case 'r': read_only = 1; break; case 'v': verbose++; break; default: usage(); exit(-1); } } /* if -r, then should be no additional optind */ if (read_only && (argc > optind)) { usage(); exit(-1); } /* * if no -r , then must be one additional optind */ if (!read_only) { if (argc != optind + 1 ) { printf("must supply -r or policy param\n"); usage(); exit(-1); } if (!strcmp("performance", argv[optind])) { new_bias = BIAS_PERFORMANCE; } else if (!strcmp("normal", argv[optind])) { new_bias = BIAS_BALANCE; } else if (!strcmp("powersave", argv[optind])) { new_bias = BIAS_POWERSAVE; } else { new_bias = atoll(argv[optind]); if (new_bias > BIAS_POWERSAVE) { usage(); exit(-1); } } printf("new_bias 0x%016llx\n", new_bias); } } /* * validate_cpuid() * returns on success, quietly exits on failure (make verbose with -v) */ void validate_cpuid(void) { unsigned int eax, ebx, ecx, edx, max_level; char brand[16]; unsigned int fms, family, model, stepping, ht_capable; eax = ebx = ecx = edx = 0; asm("cpuid" : "=a" (max_level), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (0)); sprintf(brand, "%.4s%.4s%.4s", &ebx, &edx, &ecx); if (strncmp(brand, "GenuineIntel", 12)) { if (verbose) printf("CPUID: %s != GenuineIntel\n", brand); exit(-1); } asm("cpuid" : "=a" (fms), "=c" (ecx), "=d" (edx) : "a" (1) : "ebx"); family = (fms >> 8) & 0xf; model = (fms >> 4) & 0xf; stepping = fms & 0xf; if (family == 6 || family == 0xf) model += ((fms >> 16) & 0xf) << 4; if (verbose > 1) printf("CPUID %s %d levels family:model:stepping " "0x%x:%x:%x (%d:%d:%d)\n", brand, max_level, family, model, stepping, family, model, stepping); if (!(edx & (1 << 5))) { if (verbose) printf("CPUID: no MSR\n"); exit(-1); } /* * Support for MSR_IA32_ENERGY_PERF_BIAS is indicated by CPUID.06H.ECX.bit3 */ asm("cpuid" : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (6)); if (verbose) printf("CPUID.06H.ECX: 0x%x\n", ecx); if (!(ecx & (1 << 3))) { if (verbose) printf("CPUID: No MSR_IA32_ENERGY_PERF_BIAS\n"); exit(-1); } return; /* success */ } check_dev_msr() { struct stat sb; if (stat("/dev/cpu/0/msr", &sb)) { printf("no /dev/cpu/0/msr\n"); printf("Try \"# modprobe msr\"\n"); exit(-5); } } unsigned long long get_msr(int cpu, int offset) { unsigned long long msr; char msr_path[32]; int retval; int fd; sprintf(msr_path, "/dev/cpu/%d/msr", cpu); fd = open(msr_path, O_RDONLY); if (fd < 0) { perror(msr_path); exit(-1); } retval = pread(fd, &msr, sizeof msr, offset); if (retval != sizeof msr) { printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval); exit(-2); } close(fd); return msr; } unsigned long long put_msr(int cpu, unsigned long long new_msr, int offset) { unsigned long long old_msr; char msr_path[32]; int retval; int fd; sprintf(msr_path, "/dev/cpu/%d/msr", cpu); fd = open(msr_path, O_RDWR); if (fd < 0) { perror(msr_path); exit(-1); } retval = pread(fd, &old_msr, sizeof old_msr, offset); if (retval != sizeof old_msr) { perror("pwrite"); printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval); exit(-2); } retval = pwrite(fd, &new_msr, sizeof new_msr, offset); if (retval != sizeof new_msr) { perror("pwrite"); printf("pwrite cpu%d 0x%x = %d\n", cpu, offset, retval); exit(-2); } close(fd); return old_msr; } void print_msr(int cpu) { printf("cpu%d: 0x%016llx\n", cpu, get_msr(cpu, MSR_IA32_ENERGY_PERF_BIAS)); } void update_msr(int cpu) { unsigned long long previous_msr; previous_msr = put_msr(cpu, new_bias, MSR_IA32_ENERGY_PERF_BIAS); if (verbose) printf("cpu%d msr0x%x 0x%016llx -> 0x%016llx\n", cpu, MSR_IA32_ENERGY_PERF_BIAS, previous_msr, new_bias); return; } char *proc_stat = "/proc/stat"; /* * run func() on every cpu in /dev/cpu */ void for_every_cpu(void (func)(int)) { FILE *fp; int cpu_count; int retval; fp = fopen(proc_stat, "r"); if (fp == NULL) { perror(proc_stat); exit(-1); } retval = fscanf(fp, "cpu %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n"); if (retval != 0) { perror("/proc/stat format"); exit(-1); } for (cpu_count = 0; ;cpu_count++) { int cpu; retval = fscanf(fp, "cpu%u %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n", &cpu); if (retval != 1) return; func(cpu); } fclose(fp); } int main(int argc, char **argv) { cmdline(argc, argv); if (verbose > 1) printf("x86_energy_perf_policy Aug 2, 2010" " - Len Brown <lenb@kernel.org>\n"); if (verbose > 1 && !read_only) printf("new_bias %lld\n", new_bias); validate_cpuid(); check_dev_msr(); if (cpu != -1) { if (read_only) print_msr(cpu); else update_msr(cpu); } else { if (read_only) { for_every_cpu(print_msr); } else { for_every_cpu(update_msr); } } return 0; } ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH] tools: add x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS 2010-09-28 16:17 ` x86_energy_perf_policy.c Len Brown @ 2010-10-23 4:40 ` Len Brown 2010-10-27 3:23 ` Andrew Morton 2010-11-15 16:07 ` [PATCH RESEND] tools: add power/x86/x86_energy_perf_policy " Len Brown 0 siblings, 2 replies; 26+ messages in thread From: Len Brown @ 2010-10-23 4:40 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-pm, linux-kernel, linux-acpi, x86 From: Len Brown <len.brown@intel.com> MSR_IA32_ENERGY_PERF_BIAS first became available on Westmere Xeon. It is implemented in all Sandy Bridge processors -- mobile, desktop and server. It is expected to become increasingly important in subsequent generations. x86_energy_perf_policy is a user-space utility to set this hardware energy vs performance policy hint in the processor. Most systems would benefit from "x86_energy_perf_policy normal" at system startup, as the hardware default is maximum performance at the expense of energy efficiency. See the comments in the source code for more information. Linux-2.6.36 added "epb" to /proc/cpuinfo to indicate if an x86 processor supports MSR_IA32_ENERGY_PERF_BIAS, though the kernel does not actually program the MSR. In March, Venkatesh Pallipadi proposed a small driver that programmed MSR_IA32_ENERGY_PERF_BIAS, based on the cpufreq governor in use. It also offered a boot-time cmdline option to override. http://lkml.org/lkml/2010/3/4/457 But hiding the hardware policy behind the governor choice was deemed "kinda icky". So in June, I proposed a generic user/kernel API to consolidate the power/performance policy trade-off. "RFC: /sys/power/policy_preference" http://lkml.org/lkml/2010/6/16/399 That is my preference for implementing this capability, but I received no support on the list. So in September, I sent x86_energy_perf_policy.c to LKML, a user-space utility that scribbles directly to the MSR. http://lkml.org/lkml/2010/9/28/246 Here is the same utility re-sent, this time proposed to reside in the kernel tools directory. Signed-off-by: Len Brown <len.brown@intel.com> --- tools/power/x86/x86_energy_perf_policy/Makefile | 7 + .../x86_energy_perf_policy.c | 358 ++++++++++++++++++++ 2 files changed, 365 insertions(+), 0 deletions(-) create mode 100644 tools/power/x86/x86_energy_perf_policy/Makefile create mode 100644 tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c diff --git a/tools/power/x86/x86_energy_perf_policy/Makefile b/tools/power/x86/x86_energy_perf_policy/Makefile new file mode 100644 index 0000000..b0763da --- /dev/null +++ b/tools/power/x86/x86_energy_perf_policy/Makefile @@ -0,0 +1,7 @@ +x86_energy_perf_policy : x86_energy_perf_policy.c + +clean : + rm -f x86_energy_perf_policy + +install : + install x86_energy_perf_policy /usr/bin/x86_energy_perf_policy diff --git a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c new file mode 100644 index 0000000..89394d9 --- /dev/null +++ b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c @@ -0,0 +1,358 @@ +/* + * x86_energy_perf_policy -- set the energy versus performance + * policy preference bias on recent X86 processors. + */ +/* + * Copyright (c) 2010, Intel Corporation. + * Len Brown <len.brown@intel.com> + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. + */ + +#include <stdio.h> +#include <unistd.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/resource.h> +#include <fcntl.h> +#include <signal.h> +#include <sys/time.h> +#include <stdlib.h> + +unsigned int verbose; /* set with -v */ +unsigned int read_only; /* set with -r */ +char *progname; +unsigned long long new_bias; +int cpu = -1; + +/* + * Usage: + * + * -c cpu: limit action to a single CPU (default is all CPUs) + * -v: verbose output (can invoke more than once) + * -r: read-only, don't change any settings + * + * performance + * Performance is paramount. + * Unwilling to sacrafice any performance + * for the sake of energy saving. (hardware default) + * + * normal + * Can tolerate minor performance compromise + * for potentially significant energy savings. + * (reasonable default for most desktops and servers) + * + * powersave + * Can tolerate significant performance hit + * to maximize energy savings. + * + * n + * a numerical value to write to the underlying MSR. + */ +void usage(void) +{ + printf("%s: [-c cpu] [-v] " + "(-r | 'performance' | 'normal' | 'powersave' | n)\n", + progname); +} + +/* + * MSR_IA32_ENERGY_PERF_BIAS allows software to convey + * its policy for the relative importance of performance + * versus energy savings. + * + * The hardware uses this information in model-specific ways + * when it must choose trade-offs between performance and + * energy consumption. + * + * This policy hint does not supercede Processor Performance states + * (P-states) or CPU Idle power states (C-states), but allows + * software to have influence where it has been unable to + * express a preference in the past. + * + * For example, this setting may tell the hardware how + * aggressively or conservatively to control frequency + * in the "turbo range" above the explicitly OS-controlled + * P-state frequency range. It may also tell the hardware + * how aggressively is should enter the OS requestec C-states. + * + * The support for this feature is indicated by CPUID.06H.ECX.bit3 + * per the Intel Architectures Software Developer's Manual. + */ + +#define MSR_IA32_ENERGY_PERF_BIAS 0x000001b0 + +#define BIAS_PERFORMANCE 0 +#define BIAS_BALANCE 6 +#define BIAS_POWERSAVE 15 + +cmdline(int argc, char **argv) { + int opt; + + progname = argv[0]; + + while ((opt = getopt(argc, argv, "+rvc:")) != -1) { + switch (opt) { + case 'c': + cpu = atoi(optarg); + break; + case 'r': + read_only = 1; + break; + case 'v': + verbose++; + break; + default: + usage(); + exit(-1); + } + } + /* if -r, then should be no additional optind */ + if (read_only && (argc > optind)) { + usage(); + exit(-1); + } + + /* + * if no -r , then must be one additional optind + */ + if (!read_only) { + + if (argc != optind + 1) { + printf("must supply -r or policy param\n"); + usage(); + exit(-1); + } + + if (!strcmp("performance", argv[optind])) { + new_bias = BIAS_PERFORMANCE; + } else if (!strcmp("normal", argv[optind])) { + new_bias = BIAS_BALANCE; + } else if (!strcmp("powersave", argv[optind])) { + new_bias = BIAS_POWERSAVE; + } else { + new_bias = atoll(argv[optind]); + if (new_bias > BIAS_POWERSAVE) { + usage(); + exit(-1); + } + } + } +} + +/* + * validate_cpuid() + * returns on success, quietly exits on failure (make verbose with -v) + */ +void validate_cpuid(void) +{ + unsigned int eax, ebx, ecx, edx, max_level; + char brand[16]; + unsigned int fms, family, model, stepping, ht_capable; + + eax = ebx = ecx = edx = 0; + + asm("cpuid" : "=a" (max_level), "=b" (ebx), "=c" (ecx), + "=d" (edx) : "a" (0)); + + sprintf(brand, "%.4s%.4s%.4s", &ebx, &edx, &ecx); + + if (strncmp(brand, "GenuineIntel", 12)) { + if (verbose) + printf("CPUID: %s != GenuineIntel\n", brand); + exit(-1); + } + + asm("cpuid" : "=a" (fms), "=c" (ecx), "=d" (edx) : "a" (1) : "ebx"); + family = (fms >> 8) & 0xf; + model = (fms >> 4) & 0xf; + stepping = fms & 0xf; + if (family == 6 || family == 0xf) + model += ((fms >> 16) & 0xf) << 4; + + if (verbose > 1) + printf("CPUID %s %d levels family:model:stepping " + "0x%x:%x:%x (%d:%d:%d)\n", brand, max_level, + family, model, stepping, family, model, stepping); + + if (!(edx & (1 << 5))) { + if (verbose) + printf("CPUID: no MSR\n"); + exit(-1); + } + + /* + * Support for MSR_IA32_ENERGY_PERF_BIAS + * is indicated by CPUID.06H.ECX.bit3 + */ + asm("cpuid" : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (6)); + if (verbose) + printf("CPUID.06H.ECX: 0x%x\n", ecx); + if (!(ecx & (1 << 3))) { + if (verbose) + printf("CPUID: No MSR_IA32_ENERGY_PERF_BIAS\n"); + exit(-1); + } + return; /* success */ +} + +check_dev_msr() { + struct stat sb; + + if (stat("/dev/cpu/0/msr", &sb)) { + printf("no /dev/cpu/0/msr\n"); + printf("Try \"# modprobe msr\"\n"); + exit(-5); + } +} + +unsigned long long get_msr(int cpu, int offset) +{ + unsigned long long msr; + char msr_path[32]; + int retval; + int fd; + + sprintf(msr_path, "/dev/cpu/%d/msr", cpu); + fd = open(msr_path, O_RDONLY); + if (fd < 0) { + perror(msr_path); + exit(-1); + } + + retval = pread(fd, &msr, sizeof msr, offset); + + if (retval != sizeof msr) { + printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval); + exit(-2); + } + close(fd); + return msr; +} + +unsigned long long put_msr(int cpu, unsigned long long new_msr, int offset) +{ + unsigned long long old_msr; + char msr_path[32]; + int retval; + int fd; + + sprintf(msr_path, "/dev/cpu/%d/msr", cpu); + fd = open(msr_path, O_RDWR); + if (fd < 0) { + perror(msr_path); + exit(-1); + } + + retval = pread(fd, &old_msr, sizeof old_msr, offset); + if (retval != sizeof old_msr) { + perror("pwrite"); + printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval); + exit(-2); + } + + retval = pwrite(fd, &new_msr, sizeof new_msr, offset); + if (retval != sizeof new_msr) { + perror("pwrite"); + printf("pwrite cpu%d 0x%x = %d\n", cpu, offset, retval); + exit(-2); + } + + close(fd); + + return old_msr; +} + +void print_msr(int cpu) +{ + printf("cpu%d: 0x%016llx\n", + cpu, get_msr(cpu, MSR_IA32_ENERGY_PERF_BIAS)); +} + +void update_msr(int cpu) +{ + unsigned long long previous_msr; + + previous_msr = put_msr(cpu, new_bias, MSR_IA32_ENERGY_PERF_BIAS); + + if (verbose) + printf("cpu%d msr0x%x 0x%016llx -> 0x%016llx\n", + cpu, MSR_IA32_ENERGY_PERF_BIAS, previous_msr, new_bias); + + return; +} + +char *proc_stat = "/proc/stat"; +/* + * run func() on every cpu in /dev/cpu + */ +void for_every_cpu(void (func)(int)) +{ + FILE *fp; + int cpu_count; + int retval; + + fp = fopen(proc_stat, "r"); + if (fp == NULL) { + perror(proc_stat); + exit(-1); + } + + retval = fscanf(fp, "cpu %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n"); + if (retval != 0) { + perror("/proc/stat format"); + exit(-1); + } + + for (cpu_count = 0; ; cpu_count++) { + int cpu; + + retval = fscanf(fp, + "cpu%u %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n", + &cpu); + if (retval != 1) + return; + + func(cpu); + } + fclose(fp); +} + +int main(int argc, char **argv) +{ + cmdline(argc, argv); + + if (verbose > 1) + printf("x86_energy_perf_policy Aug 2, 2010" + " - Len Brown <lenb@kernel.org>\n"); + if (verbose > 1 && !read_only) + printf("new_bias %lld\n", new_bias); + + validate_cpuid(); + check_dev_msr(); + + if (cpu != -1) { + if (read_only) + print_msr(cpu); + else + update_msr(cpu); + } else { + if (read_only) + for_every_cpu(print_msr); + else + for_every_cpu(update_msr); + } + + return 0; +} -- 1.7.3.1.127.g1bb28 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH] tools: add x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS 2010-10-23 4:40 ` [PATCH] tools: add x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS Len Brown @ 2010-10-27 3:23 ` Andrew Morton 2010-10-27 6:01 ` Ingo Molnar 2010-11-15 16:07 ` [PATCH RESEND] tools: add power/x86/x86_energy_perf_policy " Len Brown 1 sibling, 1 reply; 26+ messages in thread From: Andrew Morton @ 2010-10-27 3:23 UTC (permalink / raw) To: Len Brown; +Cc: linux-pm, linux-kernel, linux-acpi, x86 On Sat, 23 Oct 2010 00:40:18 -0400 (EDT) Len Brown <lenb@kernel.org> wrote: > MSR_IA32_ENERGY_PERF_BIAS first became available on Westmere Xeon. > It is implemented in all Sandy Bridge processors -- mobile, desktop and server. > It is expected to become increasingly important in subsequent generations. > > x86_energy_perf_policy is a user-space utility to set this > hardware energy vs performance policy hint in the processor. > Most systems would benefit from "x86_energy_perf_policy normal" > at system startup, as the hardware default is maximum performance > at the expense of energy efficiency. See the comments > in the source code for more information. > > Linux-2.6.36 added "epb" to /proc/cpuinfo to indicate > if an x86 processor supports MSR_IA32_ENERGY_PERF_BIAS, > though the kernel does not actually program the MSR. > > In March, Venkatesh Pallipadi proposed a small driver > that programmed MSR_IA32_ENERGY_PERF_BIAS, based on > the cpufreq governor in use. It also offered > a boot-time cmdline option to override. > http://lkml.org/lkml/2010/3/4/457 > But hiding the hardware policy behind the > governor choice was deemed "kinda icky". > > So in June, I proposed a generic user/kernel API to > consolidate the power/performance policy trade-off. > "RFC: /sys/power/policy_preference" > http://lkml.org/lkml/2010/6/16/399 > That is my preference for implementing this capability, > but I received no support on the list. > > So in September, I sent x86_energy_perf_policy.c to LKML, > a user-space utility that scribbles directly to the MSR. > http://lkml.org/lkml/2010/9/28/246 > > Here is the same utility re-sent, this time proposed > to reside in the kernel tools directory. > > Signed-off-by: Len Brown <len.brown@intel.com> > --- > tools/power/x86/x86_energy_perf_policy/Makefile | 7 + > .../x86_energy_perf_policy.c | 358 ++++++++++++++++++++ > 2 files changed, 365 insertions(+), 0 deletions(-) > create mode 100644 tools/power/x86/x86_energy_perf_policy/Makefile > create mode 100644 tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c tools/power/x86, eh? It seems a better place than under Documentation/, where such things have thus far landed! I looked briefly, wondering about the kbuild situation. It doesn't appear to be wired up, so one has to manually enter that directory and type `make'? I guess that's OK as an interim thing but longer-term I suppose we should have some more complete build and deployment system. So (thinking out loud) a `make' would invoke a `make tools', and that `make tools' would build the tools which are specific to the target arch[*], and any generic ones. And a `make tools_install' would install those tools in, I guess, /lib/modules/$(uname -r)/bin. Or something else. We'd need input from the distro guys to get this right. [*]: building tools for the `target arch' would require a far more extensive cross-build environment than is needed for just kernel cross-compilation. This is perhaps Just Too Hard and perhaps a `make tools_install' should copy the *source* into /lib/modules/$(uname -r)/src and you then finish the build on the target. Or something else. The mind boggles. So for now, just parking the source down in ./tools/ and deferring the problem sounds a fine idea ;) A number of programs down under Documentation/ should be moved into tools/ as well. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] tools: add x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS 2010-10-27 3:23 ` Andrew Morton @ 2010-10-27 6:01 ` Ingo Molnar 2010-10-27 11:43 ` Arnaldo Carvalho de Melo 0 siblings, 1 reply; 26+ messages in thread From: Ingo Molnar @ 2010-10-27 6:01 UTC (permalink / raw) To: Andrew Morton, Arnaldo Carvalho de Melo, Peter Zijlstra Cc: Len Brown, linux-pm, linux-kernel, linux-acpi, x86 * Andrew Morton <akpm@linux-foundation.org> wrote: > On Sat, 23 Oct 2010 00:40:18 -0400 (EDT) Len Brown <lenb@kernel.org> wrote: > > > MSR_IA32_ENERGY_PERF_BIAS first became available on Westmere Xeon. > > It is implemented in all Sandy Bridge processors -- mobile, desktop and server. > > It is expected to become increasingly important in subsequent generations. > > > > x86_energy_perf_policy is a user-space utility to set this > > hardware energy vs performance policy hint in the processor. > > Most systems would benefit from "x86_energy_perf_policy normal" > > at system startup, as the hardware default is maximum performance > > at the expense of energy efficiency. See the comments > > in the source code for more information. > > > > Linux-2.6.36 added "epb" to /proc/cpuinfo to indicate > > if an x86 processor supports MSR_IA32_ENERGY_PERF_BIAS, > > though the kernel does not actually program the MSR. > > > > In March, Venkatesh Pallipadi proposed a small driver > > that programmed MSR_IA32_ENERGY_PERF_BIAS, based on > > the cpufreq governor in use. It also offered > > a boot-time cmdline option to override. > > http://lkml.org/lkml/2010/3/4/457 > > But hiding the hardware policy behind the > > governor choice was deemed "kinda icky". > > > > So in June, I proposed a generic user/kernel API to > > consolidate the power/performance policy trade-off. > > "RFC: /sys/power/policy_preference" > > http://lkml.org/lkml/2010/6/16/399 > > That is my preference for implementing this capability, > > but I received no support on the list. > > > > So in September, I sent x86_energy_perf_policy.c to LKML, > > a user-space utility that scribbles directly to the MSR. > > http://lkml.org/lkml/2010/9/28/246 > > > > Here is the same utility re-sent, this time proposed > > to reside in the kernel tools directory. > > > > Signed-off-by: Len Brown <len.brown@intel.com> > > --- > > tools/power/x86/x86_energy_perf_policy/Makefile | 7 + > > .../x86_energy_perf_policy.c | 358 ++++++++++++++++++++ > > 2 files changed, 365 insertions(+), 0 deletions(-) > > create mode 100644 tools/power/x86/x86_energy_perf_policy/Makefile > > create mode 100644 tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c > > tools/power/x86, eh? It seems a better place than under > Documentation/, where such things have thus far landed! > > I looked briefly, wondering about the kbuild situation. It doesn't > appear to be wired up, so one has to manually enter that directory and > type `make'? > > I guess that's OK as an interim thing but longer-term I suppose we > should have some more complete build and deployment system. So > (thinking out loud) a `make' would invoke a `make tools', and that > `make tools' would build the tools which are specific to the target > arch[*], and any generic ones. And a `make tools_install' would install > those tools in, I guess, /lib/modules/$(uname -r)/bin. In terms of build and documentation environment, tools/perf/ has one cloned/inherited from Git, which is rather good and functional. Sharing it with the kernel's build system depends on the kbuild developers being interested in it. Thanks, Ingo ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] tools: add x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS 2010-10-27 6:01 ` Ingo Molnar @ 2010-10-27 11:43 ` Arnaldo Carvalho de Melo 0 siblings, 0 replies; 26+ messages in thread From: Arnaldo Carvalho de Melo @ 2010-10-27 11:43 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Peter Zijlstra, Len Brown, linux-pm, linux-kernel, linux-acpi, x86 Em Wed, Oct 27, 2010 at 08:01:39AM +0200, Ingo Molnar escreveu: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > On Sat, 23 Oct 2010 00:40:18 -0400 (EDT) Len Brown <lenb@kernel.org> wrote: > > tools/power/x86, eh? It seems a better place than under > > Documentation/, where such things have thus far landed! > > I looked briefly, wondering about the kbuild situation. It doesn't > > appear to be wired up, so one has to manually enter that directory > > and type `make'? > > I guess that's OK as an interim thing but longer-term I suppose we > > should have some more complete build and deployment system. So > > (thinking out loud) a `make' would invoke a `make tools', and that > > `make tools' would build the tools which are specific to the target > > arch[*], and any generic ones. And a `make tools_install' would > > install those tools in, I guess, /lib/modules/$(uname -r)/bin. > In terms of build and documentation environment, tools/perf/ has one > cloned/inherited from Git, which is rather good and functional. > Sharing it with the kernel's build system depends on the kbuild > developers being interested in it. Yes, that is how it is today, I glued it to the main makefile in at least one case: [acme@doppio linux]$ make help | grep perf perf-tar-src-pkg - Build perf-2.6.36-rc7.tar source tarball perf-targz-src-pkg - Build perf-2.6.36-rc7.tar.gz source tarball perf-tarbz2-src-pkg - Build perf-2.6.36-rc7.tar.bz2 source tarball [acme@doppio linux]$ I'd love to glue it some more, even using Kconfig and 'make toolsconfig' for configuring the tools: . Want the TUI? . Want to link with DWARF? Needed for features x, y and z Getting it done this way will provide examples that hopefully would lead to more kernel coding practices and infrastructure being adopted by (hell is freezing) userland programmers. This is specially important now that there are more kernel programmers writing userland code, lets hope that at least them continue to use those practices and infrastructures ;-) - Arnaldo ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH RESEND] tools: add power/x86/x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS 2010-10-23 4:40 ` [PATCH] tools: add x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS Len Brown 2010-10-27 3:23 ` Andrew Morton @ 2010-11-15 16:07 ` Len Brown 2010-11-17 11:35 ` Andi Kleen 2010-11-24 5:31 ` [PATCH v2] tools: create power/x86/x86_energy_perf_policy Len Brown 1 sibling, 2 replies; 26+ messages in thread From: Len Brown @ 2010-11-15 16:07 UTC (permalink / raw) To: Greg Kroah-Hartman; +Cc: linux-pm, linux-kernel, linux-acpi, x86 From: Len Brown <len.brown@intel.com> MSR_IA32_ENERGY_PERF_BIAS first became available on Westmere Xeon. It is implemented in all Sandy Bridge processors -- mobile, desktop and server. It is expected to become increasingly important in subsequent generations. x86_energy_perf_policy is a user-space utility to set this hardware energy vs performance policy hint in the processor. Most systems would benefit from "x86_energy_perf_policy normal" at system startup, as the hardware default is maximum performance at the expense of energy efficiency. See the comments in the source code for more information. Linux-2.6.36 added "epb" to /proc/cpuinfo to indicate if an x86 processor supports MSR_IA32_ENERGY_PERF_BIAS, though the kernel does not actually program the MSR. In March, Venkatesh Pallipadi proposed a small driver that programmed MSR_IA32_ENERGY_PERF_BIAS, based on the cpufreq governor in use. It also offered a boot-time cmdline option to override. http://lkml.org/lkml/2010/3/4/457 But hiding the hardware policy behind the governor choice was deemed "kinda icky". In June, I proposed a generic user/kernel API to consolidate the power/performance policy trade-off. "RFC: /sys/power/policy_preference" http://lkml.org/lkml/2010/6/16/399 That is my preference for implementing this capability, but I received no support on the list. In September, I sent x86_energy_perf_policy.c to LKML, a user-space utility that scribbles directly to the MSR. http://lkml.org/lkml/2010/9/28/246 Here is the same utility re-sent, this time proposed to reside in the kernel tools directory. Signed-off-by: Len Brown <len.brown@intel.com> --- tools/power/x86/x86_energy_perf_policy/Makefile | 7 + .../x86_energy_perf_policy.c | 358 ++++++++++++++++++++ 2 files changed, 365 insertions(+), 0 deletions(-) create mode 100644 tools/power/x86/x86_energy_perf_policy/Makefile create mode 100644 tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c diff --git a/tools/power/x86/x86_energy_perf_policy/Makefile b/tools/power/x86/x86_energy_perf_policy/Makefile new file mode 100644 index 0000000..b0763da --- /dev/null +++ b/tools/power/x86/x86_energy_perf_policy/Makefile @@ -0,0 +1,7 @@ +x86_energy_perf_policy : x86_energy_perf_policy.c + +clean : + rm -f x86_energy_perf_policy + +install : + install x86_energy_perf_policy /usr/bin/x86_energy_perf_policy diff --git a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c new file mode 100644 index 0000000..89394d9 --- /dev/null +++ b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c @@ -0,0 +1,358 @@ +/* + * x86_energy_perf_policy -- set the energy versus performance + * policy preference bias on recent X86 processors. + */ +/* + * Copyright (c) 2010, Intel Corporation. + * Len Brown <len.brown@intel.com> + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. + */ + +#include <stdio.h> +#include <unistd.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/resource.h> +#include <fcntl.h> +#include <signal.h> +#include <sys/time.h> +#include <stdlib.h> + +unsigned int verbose; /* set with -v */ +unsigned int read_only; /* set with -r */ +char *progname; +unsigned long long new_bias; +int cpu = -1; + +/* + * Usage: + * + * -c cpu: limit action to a single CPU (default is all CPUs) + * -v: verbose output (can invoke more than once) + * -r: read-only, don't change any settings + * + * performance + * Performance is paramount. + * Unwilling to sacrafice any performance + * for the sake of energy saving. (hardware default) + * + * normal + * Can tolerate minor performance compromise + * for potentially significant energy savings. + * (reasonable default for most desktops and servers) + * + * powersave + * Can tolerate significant performance hit + * to maximize energy savings. + * + * n + * a numerical value to write to the underlying MSR. + */ +void usage(void) +{ + printf("%s: [-c cpu] [-v] " + "(-r | 'performance' | 'normal' | 'powersave' | n)\n", + progname); +} + +/* + * MSR_IA32_ENERGY_PERF_BIAS allows software to convey + * its policy for the relative importance of performance + * versus energy savings. + * + * The hardware uses this information in model-specific ways + * when it must choose trade-offs between performance and + * energy consumption. + * + * This policy hint does not supercede Processor Performance states + * (P-states) or CPU Idle power states (C-states), but allows + * software to have influence where it has been unable to + * express a preference in the past. + * + * For example, this setting may tell the hardware how + * aggressively or conservatively to control frequency + * in the "turbo range" above the explicitly OS-controlled + * P-state frequency range. It may also tell the hardware + * how aggressively is should enter the OS requestec C-states. + * + * The support for this feature is indicated by CPUID.06H.ECX.bit3 + * per the Intel Architectures Software Developer's Manual. + */ + +#define MSR_IA32_ENERGY_PERF_BIAS 0x000001b0 + +#define BIAS_PERFORMANCE 0 +#define BIAS_BALANCE 6 +#define BIAS_POWERSAVE 15 + +cmdline(int argc, char **argv) { + int opt; + + progname = argv[0]; + + while ((opt = getopt(argc, argv, "+rvc:")) != -1) { + switch (opt) { + case 'c': + cpu = atoi(optarg); + break; + case 'r': + read_only = 1; + break; + case 'v': + verbose++; + break; + default: + usage(); + exit(-1); + } + } + /* if -r, then should be no additional optind */ + if (read_only && (argc > optind)) { + usage(); + exit(-1); + } + + /* + * if no -r , then must be one additional optind + */ + if (!read_only) { + + if (argc != optind + 1) { + printf("must supply -r or policy param\n"); + usage(); + exit(-1); + } + + if (!strcmp("performance", argv[optind])) { + new_bias = BIAS_PERFORMANCE; + } else if (!strcmp("normal", argv[optind])) { + new_bias = BIAS_BALANCE; + } else if (!strcmp("powersave", argv[optind])) { + new_bias = BIAS_POWERSAVE; + } else { + new_bias = atoll(argv[optind]); + if (new_bias > BIAS_POWERSAVE) { + usage(); + exit(-1); + } + } + } +} + +/* + * validate_cpuid() + * returns on success, quietly exits on failure (make verbose with -v) + */ +void validate_cpuid(void) +{ + unsigned int eax, ebx, ecx, edx, max_level; + char brand[16]; + unsigned int fms, family, model, stepping, ht_capable; + + eax = ebx = ecx = edx = 0; + + asm("cpuid" : "=a" (max_level), "=b" (ebx), "=c" (ecx), + "=d" (edx) : "a" (0)); + + sprintf(brand, "%.4s%.4s%.4s", &ebx, &edx, &ecx); + + if (strncmp(brand, "GenuineIntel", 12)) { + if (verbose) + printf("CPUID: %s != GenuineIntel\n", brand); + exit(-1); + } + + asm("cpuid" : "=a" (fms), "=c" (ecx), "=d" (edx) : "a" (1) : "ebx"); + family = (fms >> 8) & 0xf; + model = (fms >> 4) & 0xf; + stepping = fms & 0xf; + if (family == 6 || family == 0xf) + model += ((fms >> 16) & 0xf) << 4; + + if (verbose > 1) + printf("CPUID %s %d levels family:model:stepping " + "0x%x:%x:%x (%d:%d:%d)\n", brand, max_level, + family, model, stepping, family, model, stepping); + + if (!(edx & (1 << 5))) { + if (verbose) + printf("CPUID: no MSR\n"); + exit(-1); + } + + /* + * Support for MSR_IA32_ENERGY_PERF_BIAS + * is indicated by CPUID.06H.ECX.bit3 + */ + asm("cpuid" : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (6)); + if (verbose) + printf("CPUID.06H.ECX: 0x%x\n", ecx); + if (!(ecx & (1 << 3))) { + if (verbose) + printf("CPUID: No MSR_IA32_ENERGY_PERF_BIAS\n"); + exit(-1); + } + return; /* success */ +} + +check_dev_msr() { + struct stat sb; + + if (stat("/dev/cpu/0/msr", &sb)) { + printf("no /dev/cpu/0/msr\n"); + printf("Try \"# modprobe msr\"\n"); + exit(-5); + } +} + +unsigned long long get_msr(int cpu, int offset) +{ + unsigned long long msr; + char msr_path[32]; + int retval; + int fd; + + sprintf(msr_path, "/dev/cpu/%d/msr", cpu); + fd = open(msr_path, O_RDONLY); + if (fd < 0) { + perror(msr_path); + exit(-1); + } + + retval = pread(fd, &msr, sizeof msr, offset); + + if (retval != sizeof msr) { + printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval); + exit(-2); + } + close(fd); + return msr; +} + +unsigned long long put_msr(int cpu, unsigned long long new_msr, int offset) +{ + unsigned long long old_msr; + char msr_path[32]; + int retval; + int fd; + + sprintf(msr_path, "/dev/cpu/%d/msr", cpu); + fd = open(msr_path, O_RDWR); + if (fd < 0) { + perror(msr_path); + exit(-1); + } + + retval = pread(fd, &old_msr, sizeof old_msr, offset); + if (retval != sizeof old_msr) { + perror("pwrite"); + printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval); + exit(-2); + } + + retval = pwrite(fd, &new_msr, sizeof new_msr, offset); + if (retval != sizeof new_msr) { + perror("pwrite"); + printf("pwrite cpu%d 0x%x = %d\n", cpu, offset, retval); + exit(-2); + } + + close(fd); + + return old_msr; +} + +void print_msr(int cpu) +{ + printf("cpu%d: 0x%016llx\n", + cpu, get_msr(cpu, MSR_IA32_ENERGY_PERF_BIAS)); +} + +void update_msr(int cpu) +{ + unsigned long long previous_msr; + + previous_msr = put_msr(cpu, new_bias, MSR_IA32_ENERGY_PERF_BIAS); + + if (verbose) + printf("cpu%d msr0x%x 0x%016llx -> 0x%016llx\n", + cpu, MSR_IA32_ENERGY_PERF_BIAS, previous_msr, new_bias); + + return; +} + +char *proc_stat = "/proc/stat"; +/* + * run func() on every cpu in /dev/cpu + */ +void for_every_cpu(void (func)(int)) +{ + FILE *fp; + int cpu_count; + int retval; + + fp = fopen(proc_stat, "r"); + if (fp == NULL) { + perror(proc_stat); + exit(-1); + } + + retval = fscanf(fp, "cpu %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n"); + if (retval != 0) { + perror("/proc/stat format"); + exit(-1); + } + + for (cpu_count = 0; ; cpu_count++) { + int cpu; + + retval = fscanf(fp, + "cpu%u %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n", + &cpu); + if (retval != 1) + return; + + func(cpu); + } + fclose(fp); +} + +int main(int argc, char **argv) +{ + cmdline(argc, argv); + + if (verbose > 1) + printf("x86_energy_perf_policy Aug 2, 2010" + " - Len Brown <lenb@kernel.org>\n"); + if (verbose > 1 && !read_only) + printf("new_bias %lld\n", new_bias); + + validate_cpuid(); + check_dev_msr(); + + if (cpu != -1) { + if (read_only) + print_msr(cpu); + else + update_msr(cpu); + } else { + if (read_only) + for_every_cpu(print_msr); + else + for_every_cpu(update_msr); + } + + return 0; +} -- 1.7.3.1.127.g1bb28 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH RESEND] tools: add power/x86/x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS 2010-11-15 16:07 ` [PATCH RESEND] tools: add power/x86/x86_energy_perf_policy " Len Brown @ 2010-11-17 11:35 ` Andi Kleen 2010-11-22 20:13 ` Len Brown 2010-11-24 5:31 ` [PATCH v2] tools: create power/x86/x86_energy_perf_policy Len Brown 1 sibling, 1 reply; 26+ messages in thread From: Andi Kleen @ 2010-11-17 11:35 UTC (permalink / raw) To: Len Brown; +Cc: Greg Kroah-Hartman, linux-pm, linux-kernel, linux-acpi, x86 Len Brown <lenb@kernel.org> writes: > @@ -0,0 +1,7 @@ > +x86_energy_perf_policy : x86_energy_perf_policy.c > + > +clean : > + rm -f x86_energy_perf_policy > + > +install : > + install x86_energy_perf_policy /usr/bin/x86_energy_perf_policy It's not clear to me how this Makefile ensures it's only build on x86. If someone on another architecture does a full tools build in the future (I think that is not wired up yet, but should eventually) such a mechanism would be needed. > + > +/* > + * Usage: ... This full comment and parts of the following comments describing the semantics need to be available somewhere to the user who may not have easy access to the source. Can you make it display in usage or convert it to a manpage? I would prefer a manpage > + > +cmdline(int argc, char **argv) { No type? > + int opt; > + > + progname = argv[0]; > + > + while ((opt = getopt(argc, argv, "+rvc:")) != -1) { Maybe it's me, but I prefer having long options too (getopt_long) These are easier to memorize. > + > + /* > + * if no -r , then must be one additional optind > + */ > + if (!read_only) { > + > + if (argc != optind + 1) { > + printf("must supply -r or policy param\n"); > + usage(); > + exit(-1); -1 is an unusual exit code. Better use 1. An obvious improvement would be to put the exit() into usage() > + } > + > + if (!strcmp("performance", argv[optind])) { > + new_bias = BIAS_PERFORMANCE; > + } else if (!strcmp("normal", argv[optind])) { > + new_bias = BIAS_BALANCE; > + } else if (!strcmp("powersave", argv[optind])) { > + new_bias = BIAS_POWERSAVE; > + } else { > + new_bias = atoll(argv[optind]); If you used strtoull() you could actually check if the input is really a number (end == argv[optind]) > + eax = ebx = ecx = edx = 0; > + > + asm("cpuid" : "=a" (max_level), "=b" (ebx), "=c" (ecx), > + "=d" (edx) : "a" (0)); Strictly for 386/early 486 you would need to check if cpuid is available using pushf too. Perhaps it's safer to use cpuinfo > + > +check_dev_msr() { Return type missing again > + struct stat sb; > + > + if (stat("/dev/cpu/0/msr", &sb)) { > + printf("no /dev/cpu/0/msr\n"); This will fail if we eventually implement cpu 0 hotplug... Better readdir or similar. > + printf("Try \"# modprobe msr\"\n"); > + exit(-5); Again -5 is unusual. > + char msr_path[32]; > + int retval; > + int fd; > + > + sprintf(msr_path, "/dev/cpu/%d/msr", cpu); > + fd = open(msr_path, O_RDONLY); > + if (fd < 0) { > + perror(msr_path); > + exit(-1); This should be a soft error because the CPU can go away any time. > +/* > + * run func() on every cpu in /dev/cpu > + */ > +void for_every_cpu(void (func)(int)) > +{ > + FILE *fp; > + int cpu_count; > + int retval; > + > + fp = fopen(proc_stat, "r"); Using /proc/stat to get the number of CPUs is unusual and you don't handle holes in the cpu numbers which can happen due to hotplug. I would just readdir or fnmatch the MSR /dev/cpu/* directories. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RESEND] tools: add power/x86/x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS 2010-11-17 11:35 ` Andi Kleen @ 2010-11-22 20:13 ` Len Brown 2010-11-22 20:33 ` Andi Kleen 0 siblings, 1 reply; 26+ messages in thread From: Len Brown @ 2010-11-22 20:13 UTC (permalink / raw) To: Andi Kleen; +Cc: Greg Kroah-Hartman, linux-pm, linux-kernel, linux-acpi, x86 Hi Andy, Thank you for the review! responses below. > > +install : > > + install x86_energy_perf_policy /usr/bin/x86_energy_perf_policy > > It's not clear to me how this Makefile ensures it's only > build on x86. > > If someone on another architecture does a full tools build > in the future (I think that is not wired up yet, but should > eventually) such a mechanism would be needed. Per the comments from Andrew and others, the concept of a "full tools build" doesn't actually exit (yet). So I guess the only assurance that somebody not on x86 would run make in this directory this utility lives in tools/power/x86/ Note that there are other utilities under tools which have no Makefile at all... > ...I would prefer a manpage I'll be happy to write a manpage. Is there good example I should follow? > > +cmdline(int argc, char **argv) { > > No type? okay, now void. > > + while ((opt = getopt(argc, argv, "+rvc:")) != -1) { > > Maybe it's me, but I prefer having long options too (getopt_long) > These are easier to memorize. I'm not inclined to bother, as the use-case for this utility is to be invoked by another program, and the options available are really there just for verification/debugging, and don't really merit being memorized by a human after that task. > An obvious improvement would be to put the exit() into usage() done. > > + new_bias = atoll(argv[optind]); > > If you used strtoull() you could actually check if the input > is really a number (end == argv[optind]) done. > > + asm("cpuid" : "=a" (max_level), "=b" (ebx), "=c" (ecx), > > + "=d" (edx) : "a" (0)); > > Strictly for 386/early 486 you would need to check if cpuid > is available using pushf too. Perhaps it's safer to use cpuinfo Meh, maybe simpler to crash on 486 and earlier?:-) I'm not fond of parsing /proc/cpuinfo. > > +check_dev_msr() { > > Return type missing again routine deleted. > > + struct stat sb; > > + > > + if (stat("/dev/cpu/0/msr", &sb)) { > > + printf("no /dev/cpu/0/msr\n"); > > This will fail if we eventually implement cpu 0 hotplug... > Better readdir or similar. simpler to delete check_dev_msr() and stumble forward assuming /dev/cpu/*/msr exists, and print a message and exit if it doesn't. > > + printf("Try \"# modprobe msr\"\n"); > > + exit(-5); > > Again -5 is unusual. okay, I canged all the exits to 1. > > + sprintf(msr_path, "/dev/cpu/%d/msr", cpu); > > + fd = open(msr_path, O_RDONLY); > > + if (fd < 0) { > > + perror(msr_path); > > + exit(-1); > > This should be a soft error because the CPU can go away > any time. In the highly unlikely scenario that somebody uses the -r option to excerise the read-only code, and simultaneously invokes and completes a cpu hot remove during the execution of this utility, I think the utility exiting is just as useful, and less complicated, than handling soft error. Since in either case, the user would probably simply re-invoke the utility to see what the current state of the settled machine is. > > +/* > > + * run func() on every cpu in /dev/cpu > > + */ ... > > + fp = fopen(proc_stat, "r"); > > Using /proc/stat to get the number of CPUs is unusual > and you don't handle holes in the cpu numbers which > can happen due to hotplug. The code does handle holes in cpu number namespace. The "num_cpus" variable was a hold-over from an older version that did not, and so I've deleted it. > I would just readdir or fnmatch the MSR /dev/cpu/* directories. I used to do that, but Arjan convinced me to use /proc/stat. turbostat, rdmsr, and wrmsr all use /proc/stat. thanks, -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RESEND] tools: add power/x86/x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS 2010-11-22 20:13 ` Len Brown @ 2010-11-22 20:33 ` Andi Kleen 2010-11-23 4:48 ` Len Brown 0 siblings, 1 reply; 26+ messages in thread From: Andi Kleen @ 2010-11-22 20:33 UTC (permalink / raw) To: Len Brown Cc: Andi Kleen, Greg Kroah-Hartman, linux-pm, linux-kernel, linux-acpi, x86 On Mon, Nov 22, 2010 at 03:13:24PM -0500, Len Brown wrote: > Per the comments from Andrew and others, the concept of a > "full tools build" doesn't actually exit (yet). > > So I guess the only assurance that somebody not on x86 would run > make in this directory this utility lives in tools/power/x86/ > > Note that there are other utilities under tools > which have no Makefile at all... I suspect this will need to be fixed at some point. e.g. kernel rpms probably don't want to hard code all of this but just call some standard make file target. And the kernel eventually needs a make install_user or similar. > > > ...I would prefer a manpage > > I'll be happy to write a manpage. > Is there good example I should follow? Just pick one from /usr/share/man. You can grep for my name if you want one written by me, but I don't claim they are necessarily better than others @) > I'm not inclined to bother, as the use-case for this utility > is to be invoked by another program, and the options available What other program? I could well imagine administrators sticking this into their boot.locals to set the policy they want. > In the highly unlikely scenario that somebody uses > the -r option to excerise the read-only code, > and simultaneously invokes and completes a cpu hot remove FWIW there are setups where core offlining can happen automatically in response to an error. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RESEND] tools: add power/x86/x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS 2010-11-22 20:33 ` Andi Kleen @ 2010-11-23 4:48 ` Len Brown 0 siblings, 0 replies; 26+ messages in thread From: Len Brown @ 2010-11-23 4:48 UTC (permalink / raw) To: Andi Kleen; +Cc: Greg Kroah-Hartman, linux-pm, linux-kernel, linux-acpi, x86 On Mon, 22 Nov 2010, Andi Kleen wrote: > On Mon, Nov 22, 2010 at 03:13:24PM -0500, Len Brown wrote: > > Per the comments from Andrew and others, the concept of a > > "full tools build" doesn't actually exit (yet). > > > > So I guess the only assurance that somebody not on x86 would run > > make in this directory this utility lives in tools/power/x86/ > > > > Note that there are other utilities under tools > > which have no Makefile at all... > > I suspect this will need to be fixed at some point. > > e.g. kernel rpms probably don't want to hard code all of this > but just call some standard make file target. And the kernel > eventually needs a make install_user or similar. I agree, but I don't volunteer to set up such a build system as part of this particular patch. As I mentioned, supplying any Makefile is a step better than some of the peers... > > I'm not inclined to bother, as the use-case for this utility > > is to be invoked by another program, and the options available > > What other program? > > I could well imagine administrators sticking this > into their boot.locals to set the policy they want. right, and that would be a program. It is unlikely that users are going to be typing this command, except into an admin script. > > In the highly unlikely scenario that somebody uses > > the -r option to excerise the read-only code, > > and simultaneously invokes and completes a cpu hot remove > > FWIW there are setups where core offlining can happen > automatically in response to an error. Understood. I think it is fine if this utility simply exits if that error occurs while it is running. (turbostat, OTOH, may be long running, and it treats vanishing processors as a recoverable error) thanks, -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2] tools: create power/x86/x86_energy_perf_policy 2010-11-15 16:07 ` [PATCH RESEND] tools: add power/x86/x86_energy_perf_policy " Len Brown 2010-11-17 11:35 ` Andi Kleen @ 2010-11-24 5:31 ` Len Brown 2010-11-25 5:52 ` Chen Gong 1 sibling, 1 reply; 26+ messages in thread From: Len Brown @ 2010-11-24 5:31 UTC (permalink / raw) To: Greg Kroah-Hartman; +Cc: linux-pm, linux-kernel, linux-acpi, x86 From: Len Brown <len.brown@intel.com> MSR_IA32_ENERGY_PERF_BIAS first became available on Westmere Xeon. It is implemented in all Sandy Bridge processors -- mobile, desktop and server. It is expected to become increasingly important in subsequent generations. x86_energy_perf_policy is a user-space utility to set this hardware energy vs performance policy hint in the processor. Most systems would benefit from "x86_energy_perf_policy normal" at system startup, as the hardware default is maximum performance at the expense of energy efficiency. Linux-2.6.36 added "epb" to /proc/cpuinfo to indicate if an x86 processor supports MSR_IA32_ENERGY_PERF_BIAS, though the kernel does not actually program the MSR. In March, Venkatesh Pallipadi proposed a small driver that programmed MSR_IA32_ENERGY_PERF_BIAS, based on the cpufreq governor in use. It also offered a boot-time cmdline option to override. http://lkml.org/lkml/2010/3/4/457 But hiding the hardware policy behind the governor choice was deemed "kinda icky". So in June, I proposed a generic user/kernel API to consolidate the power/performance policy trade-off. "RFC: /sys/power/policy_preference" http://lkml.org/lkml/2010/6/16/399 That is my preference for implementing this capability, but I received no support on the list. So in September, I sent x86_energy_perf_policy.c to LKML, a user-space utility that scribbles directly to the MSR. http://lkml.org/lkml/2010/9/28/246 Here is the same utility re-sent, this time proposed to reside in the kernel tools directory. Signed-off-by: Len Brown <len.brown@intel.com> --- v2 create man page minor tweaks in response to review comments tools/power/x86/x86_energy_perf_policy/Makefile | 8 + .../x86_energy_perf_policy.8 | 104 +++++++ .../x86_energy_perf_policy.c | 325 ++++++++++++++++++++ diff --git a/tools/power/x86/x86_energy_perf_policy/Makefile b/tools/power/x86/x86_energy_perf_policy/Makefile new file mode 100644 index 0000000..f458237 --- /dev/null +++ b/tools/power/x86/x86_energy_perf_policy/Makefile @@ -0,0 +1,8 @@ +x86_energy_perf_policy : x86_energy_perf_policy.c + +clean : + rm -f x86_energy_perf_policy + +install : + install x86_energy_perf_policy /usr/bin/ + install x86_energy_perf_policy.8 /usr/share/man/man8/ diff --git a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 new file mode 100644 index 0000000..8eaaad6 --- /dev/null +++ b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 @@ -0,0 +1,104 @@ +.\" This page Copyright (C) 2010 Len Brown <len.brown@intel.com> +.\" Distributed under the GPL, Copyleft 1994. +.TH X86_ENERGY_PERF_POLICY 8 +.SH NAME +x86_energy_perf_policy \- read or write MSR_IA32_ENERGY_PERF_BIAS +.SH SYNOPSIS +.ft B +.B x86_energy_perf_policy +.RB [ "\-c cpu" ] +.RB [ "\-v" ] +.RB "\-r" +.br +.B x86_energy_perf_policy +.RB [ "\-c cpu" ] +.RB [ "\-v" ] +.RB 'performance' +.br +.B x86_energy_perf_policy +.RB [ "\-c cpu" ] +.RB [ "\-v" ] +.RB 'normal' +.br +.B x86_energy_perf_policy +.RB [ "\-c cpu" ] +.RB [ "\-v" ] +.RB 'powersave' +.br +.B x86_energy_perf_policy +.RB [ "\-c cpu" ] +.RB [ "\-v" ] +.RB n +.br +.SH DESCRIPTION +\fBx86_energy_perf_policy\fP +allows software to convey +its policy for the relative importance of performance +versus energy savings to the processor. + +The processor uses this information in model-specific ways +when it must select trade-offs between performance and +energy efficiency. + +This policy hint does not supersede Processor Performance states +(P-states) or CPU Idle power states (C-states), but allows +software to have influence where it would otherwise be unable +to express a preference. + +For example, this setting may tell the hardware how +aggressively or conservatively to control frequency +in the "turbo range" above the explicitly OS-controlled +P-state frequency range. It may also tell the hardware +how aggressively is should enter the OS requested C-states. + +Support for this feature is indicated by CPUID.06H.ECX.bit3 +per the Intel Architectures Software Developer's Manual. + +.SS Options +\fB-c\fP limits operation to a single CPU. +The default is to operate on all CPUs. +Note that MSR_IA32_ENERGY_PERF_BIAS is defined per +logical processor, but that the initial implementations +of the MSR were shared among all processors in each package. +.PP +\fB-v\fP increases verbosity. By default +x86_energy_perf_policy is silent. +.PP +\fB-r\fP is for "read-only" mode - the unchanged state +is read and displayed. +.PP +.I performance +Set a policy where performance is paramount. +The processor will be unwilling to sacrifice any performance +for the sake of energy saving. This is the hardware default. +.PP +.I normal +Set a policy with a normal balance between performance and energy efficiency. +The processor will tolerate minor performance compromise +for potentially significant energy savings. +This reasonable default for most desktops and servers. +.PP +.I powersave +Set a policy where the processor can accept +a measurable performance hit to maximize energy efficiency. +.PP +.I n +Set MSR_IA32_ENERGY_PERF_BIAS to the specified number. +The range of valid numbers is 0-15, where 0 is maximum +performance and 15 is maximum energy efficiency. + +.SH NOTES +.B "x86_energy_perf_policy " +runs only as root. +.SH FILES +.ta +.nf +/dev/cpu/*/msr +.fi + +.SH "SEE ALSO" +msr(4) +.PP +.SH AUTHORS +.nf +Written by Len Brown <len.brown@intel.com> diff --git a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c new file mode 100644 index 0000000..b539923 --- /dev/null +++ b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c @@ -0,0 +1,325 @@ +/* + * x86_energy_perf_policy -- set the energy versus performance + * policy preference bias on recent X86 processors. + */ +/* + * Copyright (c) 2010, Intel Corporation. + * Len Brown <len.brown@intel.com> + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. + */ + +#include <stdio.h> +#include <unistd.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/resource.h> +#include <fcntl.h> +#include <signal.h> +#include <sys/time.h> +#include <stdlib.h> +#include <string.h> + +unsigned int verbose; /* set with -v */ +unsigned int read_only; /* set with -r */ +char *progname; +unsigned long long new_bias; +int cpu = -1; + +/* + * Usage: + * + * -c cpu: limit action to a single CPU (default is all CPUs) + * -v: verbose output (can invoke more than once) + * -r: read-only, don't change any settings + * + * performance + * Performance is paramount. + * Unwilling to sacrafice any performance + * for the sake of energy saving. (hardware default) + * + * normal + * Can tolerate minor performance compromise + * for potentially significant energy savings. + * (reasonable default for most desktops and servers) + * + * powersave + * Can tolerate significant performance hit + * to maximize energy savings. + * + * n + * a numerical value to write to the underlying MSR. + */ +void usage(void) +{ + printf("%s: [-c cpu] [-v] " + "(-r | 'performance' | 'normal' | 'powersave' | n)\n", + progname); + exit(1); +} + +#define MSR_IA32_ENERGY_PERF_BIAS 0x000001b0 + +#define BIAS_PERFORMANCE 0 +#define BIAS_BALANCE 6 +#define BIAS_POWERSAVE 15 + +void cmdline(int argc, char **argv) +{ + int opt; + + progname = argv[0]; + + while ((opt = getopt(argc, argv, "+rvc:")) != -1) { + switch (opt) { + case 'c': + cpu = atoi(optarg); + break; + case 'r': + read_only = 1; + break; + case 'v': + verbose++; + break; + default: + usage(); + } + } + /* if -r, then should be no additional optind */ + if (read_only && (argc > optind)) + usage(); + + /* + * if no -r , then must be one additional optind + */ + if (!read_only) { + + if (argc != optind + 1) { + printf("must supply -r or policy param\n"); + usage(); + } + + if (!strcmp("performance", argv[optind])) { + new_bias = BIAS_PERFORMANCE; + } else if (!strcmp("normal", argv[optind])) { + new_bias = BIAS_BALANCE; + } else if (!strcmp("powersave", argv[optind])) { + new_bias = BIAS_POWERSAVE; + } else { + char *endptr; + + new_bias = strtoull(argv[optind], &endptr, 0); + if (endptr == argv[optind] || + new_bias > BIAS_POWERSAVE) { + fprintf(stderr, "invalid value: %s\n", + argv[optind]); + usage(); + } + } + } +} + +/* + * validate_cpuid() + * returns on success, quietly exits on failure (make verbose with -v) + */ +void validate_cpuid(void) +{ + unsigned int eax, ebx, ecx, edx, max_level; + char brand[16]; + unsigned int fms, family, model, stepping; + + eax = ebx = ecx = edx = 0; + + asm("cpuid" : "=a" (max_level), "=b" (ebx), "=c" (ecx), + "=d" (edx) : "a" (0)); + + if (ebx != 0x756e6547 || edx != 0x49656e69 || ecx != 0x6c65746e) { + if (verbose) + fprintf(stderr, "%.4s%.4s%.4s != GenuineIntel", + (char *)&ebx, (char *)&edx, (char *)&ecx); + exit(1); + } + + asm("cpuid" : "=a" (fms), "=c" (ecx), "=d" (edx) : "a" (1) : "ebx"); + family = (fms >> 8) & 0xf; + model = (fms >> 4) & 0xf; + stepping = fms & 0xf; + if (family == 6 || family == 0xf) + model += ((fms >> 16) & 0xf) << 4; + + if (verbose > 1) + printf("CPUID %s %d levels family:model:stepping " + "0x%x:%x:%x (%d:%d:%d)\n", brand, max_level, + family, model, stepping, family, model, stepping); + + if (!(edx & (1 << 5))) { + if (verbose) + printf("CPUID: no MSR\n"); + exit(1); + } + + /* + * Support for MSR_IA32_ENERGY_PERF_BIAS + * is indicated by CPUID.06H.ECX.bit3 + */ + asm("cpuid" : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (6)); + if (verbose) + printf("CPUID.06H.ECX: 0x%x\n", ecx); + if (!(ecx & (1 << 3))) { + if (verbose) + printf("CPUID: No MSR_IA32_ENERGY_PERF_BIAS\n"); + exit(1); + } + return; /* success */ +} + +unsigned long long get_msr(int cpu, int offset) +{ + unsigned long long msr; + char msr_path[32]; + int retval; + int fd; + + sprintf(msr_path, "/dev/cpu/%d/msr", cpu); + fd = open(msr_path, O_RDONLY); + if (fd < 0) { + printf("Try \"# modprobe msr\"\n"); + perror(msr_path); + exit(1); + } + + retval = pread(fd, &msr, sizeof msr, offset); + + if (retval != sizeof msr) { + printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval); + exit(-2); + } + close(fd); + return msr; +} + +unsigned long long put_msr(int cpu, unsigned long long new_msr, int offset) +{ + unsigned long long old_msr; + char msr_path[32]; + int retval; + int fd; + + sprintf(msr_path, "/dev/cpu/%d/msr", cpu); + fd = open(msr_path, O_RDWR); + if (fd < 0) { + perror(msr_path); + exit(1); + } + + retval = pread(fd, &old_msr, sizeof old_msr, offset); + if (retval != sizeof old_msr) { + perror("pwrite"); + printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval); + exit(-2); + } + + retval = pwrite(fd, &new_msr, sizeof new_msr, offset); + if (retval != sizeof new_msr) { + perror("pwrite"); + printf("pwrite cpu%d 0x%x = %d\n", cpu, offset, retval); + exit(-2); + } + + close(fd); + + return old_msr; +} + +void print_msr(int cpu) +{ + printf("cpu%d: 0x%016llx\n", + cpu, get_msr(cpu, MSR_IA32_ENERGY_PERF_BIAS)); +} + +void update_msr(int cpu) +{ + unsigned long long previous_msr; + + previous_msr = put_msr(cpu, new_bias, MSR_IA32_ENERGY_PERF_BIAS); + + if (verbose) + printf("cpu%d msr0x%x 0x%016llx -> 0x%016llx\n", + cpu, MSR_IA32_ENERGY_PERF_BIAS, previous_msr, new_bias); + + return; +} + +char *proc_stat = "/proc/stat"; +/* + * run func() on every cpu in /dev/cpu + */ +void for_every_cpu(void (func)(int)) +{ + FILE *fp; + int retval; + + fp = fopen(proc_stat, "r"); + if (fp == NULL) { + perror(proc_stat); + exit(1); + } + + retval = fscanf(fp, "cpu %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n"); + if (retval != 0) { + perror("/proc/stat format"); + exit(1); + } + + while (1) { + int cpu; + + retval = fscanf(fp, + "cpu%u %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n", + &cpu); + if (retval != 1) + return; + + func(cpu); + } + fclose(fp); +} + +int main(int argc, char **argv) +{ + cmdline(argc, argv); + + if (verbose > 1) + printf("x86_energy_perf_policy Nov 24, 2010" + " - Len Brown <lenb@kernel.org>\n"); + if (verbose > 1 && !read_only) + printf("new_bias %lld\n", new_bias); + + validate_cpuid(); + + if (cpu != -1) { + if (read_only) + print_msr(cpu); + else + update_msr(cpu); + } else { + if (read_only) + for_every_cpu(print_msr); + else + for_every_cpu(update_msr); + } + + return 0; +} ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v2] tools: create power/x86/x86_energy_perf_policy 2010-11-24 5:31 ` [PATCH v2] tools: create power/x86/x86_energy_perf_policy Len Brown @ 2010-11-25 5:52 ` Chen Gong 2010-11-25 8:59 ` Chen Gong 0 siblings, 1 reply; 26+ messages in thread From: Chen Gong @ 2010-11-25 5:52 UTC (permalink / raw) To: Len Brown; +Cc: Greg Kroah-Hartman, linux-pm, linux-kernel, linux-acpi, x86 于 11/24/2010 1:31 PM, Len Brown 写道: > From: Len Brown<len.brown@intel.com> > > MSR_IA32_ENERGY_PERF_BIAS first became available on Westmere Xeon. > It is implemented in all Sandy Bridge processors -- mobile, desktop and server. > It is expected to become increasingly important in subsequent generations. > > x86_energy_perf_policy is a user-space utility to set this > hardware energy vs performance policy hint in the processor. > Most systems would benefit from "x86_energy_perf_policy normal" > at system startup, as the hardware default is maximum performance > at the expense of energy efficiency. > > Linux-2.6.36 added "epb" to /proc/cpuinfo to indicate > if an x86 processor supports MSR_IA32_ENERGY_PERF_BIAS, > though the kernel does not actually program the MSR. > > In March, Venkatesh Pallipadi proposed a small driver > that programmed MSR_IA32_ENERGY_PERF_BIAS, based on > the cpufreq governor in use. It also offered > a boot-time cmdline option to override. > http://lkml.org/lkml/2010/3/4/457 > But hiding the hardware policy behind the > governor choice was deemed "kinda icky". > > So in June, I proposed a generic user/kernel API to > consolidate the power/performance policy trade-off. > "RFC: /sys/power/policy_preference" > http://lkml.org/lkml/2010/6/16/399 > That is my preference for implementing this capability, > but I received no support on the list. > > So in September, I sent x86_energy_perf_policy.c to LKML, > a user-space utility that scribbles directly to the MSR. > http://lkml.org/lkml/2010/9/28/246 > > Here is the same utility re-sent, this time proposed > to reside in the kernel tools directory. > > Signed-off-by: Len Brown<len.brown@intel.com> > --- > v2 > create man page > minor tweaks in response to review comments > > tools/power/x86/x86_energy_perf_policy/Makefile | 8 + > .../x86_energy_perf_policy.8 | 104 +++++++ > .../x86_energy_perf_policy.c | 325 ++++++++++++++++++++ > > diff --git a/tools/power/x86/x86_energy_perf_policy/Makefile b/tools/power/x86/x86_energy_perf_policy/Makefile > new file mode 100644 > index 0000000..f458237 > --- /dev/null > +++ b/tools/power/x86/x86_energy_perf_policy/Makefile > @@ -0,0 +1,8 @@ > +x86_energy_perf_policy : x86_energy_perf_policy.c > + > +clean : > + rm -f x86_energy_perf_policy > + > +install : > + install x86_energy_perf_policy /usr/bin/ > + install x86_energy_perf_policy.8 /usr/share/man/man8/ > diff --git a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 > new file mode 100644 > index 0000000..8eaaad6 > --- /dev/null > +++ b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 > @@ -0,0 +1,104 @@ > +.\" This page Copyright (C) 2010 Len Brown<len.brown@intel.com> > +.\" Distributed under the GPL, Copyleft 1994. > +.TH X86_ENERGY_PERF_POLICY 8 > +.SH NAME > +x86_energy_perf_policy \- read or write MSR_IA32_ENERGY_PERF_BIAS > +.SH SYNOPSIS > +.ft B > +.B x86_energy_perf_policy > +.RB [ "\-c cpu" ] > +.RB [ "\-v" ] > +.RB "\-r" > +.br > +.B x86_energy_perf_policy > +.RB [ "\-c cpu" ] > +.RB [ "\-v" ] > +.RB 'performance' > +.br > +.B x86_energy_perf_policy > +.RB [ "\-c cpu" ] > +.RB [ "\-v" ] > +.RB 'normal' > +.br > +.B x86_energy_perf_policy > +.RB [ "\-c cpu" ] > +.RB [ "\-v" ] > +.RB 'powersave' > +.br > +.B x86_energy_perf_policy > +.RB [ "\-c cpu" ] > +.RB [ "\-v" ] > +.RB n > +.br > +.SH DESCRIPTION > +\fBx86_energy_perf_policy\fP > +allows software to convey > +its policy for the relative importance of performance > +versus energy savings to the processor. > + > +The processor uses this information in model-specific ways > +when it must select trade-offs between performance and > +energy efficiency. > + > +This policy hint does not supersede Processor Performance states > +(P-states) or CPU Idle power states (C-states), but allows > +software to have influence where it would otherwise be unable > +to express a preference. > + > +For example, this setting may tell the hardware how > +aggressively or conservatively to control frequency > +in the "turbo range" above the explicitly OS-controlled > +P-state frequency range. It may also tell the hardware > +how aggressively is should enter the OS requested C-states. > + > +Support for this feature is indicated by CPUID.06H.ECX.bit3 > +per the Intel Architectures Software Developer's Manual. > + > +.SS Options > +\fB-c\fP limits operation to a single CPU. > +The default is to operate on all CPUs. > +Note that MSR_IA32_ENERGY_PERF_BIAS is defined per > +logical processor, but that the initial implementations > +of the MSR were shared among all processors in each package. > +.PP > +\fB-v\fP increases verbosity. By default > +x86_energy_perf_policy is silent. > +.PP > +\fB-r\fP is for "read-only" mode - the unchanged state > +is read and displayed. > +.PP > +.I performance > +Set a policy where performance is paramount. > +The processor will be unwilling to sacrifice any performance > +for the sake of energy saving. This is the hardware default. > +.PP > +.I normal > +Set a policy with a normal balance between performance and energy efficiency. > +The processor will tolerate minor performance compromise > +for potentially significant energy savings. > +This reasonable default for most desktops and servers. > +.PP > +.I powersave > +Set a policy where the processor can accept > +a measurable performance hit to maximize energy efficiency. > +.PP > +.I n > +Set MSR_IA32_ENERGY_PERF_BIAS to the specified number. > +The range of valid numbers is 0-15, where 0 is maximum > +performance and 15 is maximum energy efficiency. > + > +.SH NOTES > +.B "x86_energy_perf_policy " > +runs only as root. > +.SH FILES > +.ta > +.nf > +/dev/cpu/*/msr > +.fi > + > +.SH "SEE ALSO" > +msr(4) > +.PP > +.SH AUTHORS > +.nf > +Written by Len Brown<len.brown@intel.com> > diff --git a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c > new file mode 100644 > index 0000000..b539923 > --- /dev/null > +++ b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c > @@ -0,0 +1,325 @@ > +/* > + * x86_energy_perf_policy -- set the energy versus performance > + * policy preference bias on recent X86 processors. > + */ > +/* > + * Copyright (c) 2010, Intel Corporation. > + * Len Brown<len.brown@intel.com> > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms and conditions of the GNU General Public License, > + * version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + * > + * You should have received a copy of the GNU General Public License along with > + * this program; if not, write to the Free Software Foundation, Inc., > + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. > + */ > + > +#include<stdio.h> > +#include<unistd.h> > +#include<sys/types.h> > +#include<sys/stat.h> > +#include<sys/resource.h> > +#include<fcntl.h> > +#include<signal.h> > +#include<sys/time.h> > +#include<stdlib.h> > +#include<string.h> > + > +unsigned int verbose; /* set with -v */ > +unsigned int read_only; /* set with -r */ > +char *progname; > +unsigned long long new_bias; > +int cpu = -1; > + > +/* > + * Usage: > + * > + * -c cpu: limit action to a single CPU (default is all CPUs) > + * -v: verbose output (can invoke more than once) > + * -r: read-only, don't change any settings > + * > + * performance > + * Performance is paramount. > + * Unwilling to sacrafice any performance > + * for the sake of energy saving. (hardware default) > + * > + * normal > + * Can tolerate minor performance compromise > + * for potentially significant energy savings. > + * (reasonable default for most desktops and servers) > + * > + * powersave > + * Can tolerate significant performance hit > + * to maximize energy savings. > + * > + * n > + * a numerical value to write to the underlying MSR. > + */ > +void usage(void) > +{ > + printf("%s: [-c cpu] [-v] " > + "(-r | 'performance' | 'normal' | 'powersave' | n)\n", > + progname); > + exit(1); > +} > + > +#define MSR_IA32_ENERGY_PERF_BIAS 0x000001b0 > + > +#define BIAS_PERFORMANCE 0 > +#define BIAS_BALANCE 6 > +#define BIAS_POWERSAVE 15 > + > +void cmdline(int argc, char **argv) > +{ > + int opt; > + > + progname = argv[0]; > + > + while ((opt = getopt(argc, argv, "+rvc:")) != -1) { > + switch (opt) { > + case 'c': > + cpu = atoi(optarg); > + break; > + case 'r': > + read_only = 1; > + break; > + case 'v': > + verbose++; > + break; > + default: > + usage(); > + } > + } > + /* if -r, then should be no additional optind */ > + if (read_only&& (argc> optind)) > + usage(); > + > + /* > + * if no -r , then must be one additional optind > + */ > + if (!read_only) { > + > + if (argc != optind + 1) { > + printf("must supply -r or policy param\n"); > + usage(); > + } > + > + if (!strcmp("performance", argv[optind])) { > + new_bias = BIAS_PERFORMANCE; > + } else if (!strcmp("normal", argv[optind])) { > + new_bias = BIAS_BALANCE; > + } else if (!strcmp("powersave", argv[optind])) { > + new_bias = BIAS_POWERSAVE; > + } else { > + char *endptr; > + > + new_bias = strtoull(argv[optind],&endptr, 0); > + if (endptr == argv[optind] || > + new_bias> BIAS_POWERSAVE) { > + fprintf(stderr, "invalid value: %s\n", > + argv[optind]); > + usage(); > + } > + } > + } > +} > + > +/* > + * validate_cpuid() > + * returns on success, quietly exits on failure (make verbose with -v) > + */ > +void validate_cpuid(void) > +{ > + unsigned int eax, ebx, ecx, edx, max_level; > + char brand[16]; > + unsigned int fms, family, model, stepping; > + > + eax = ebx = ecx = edx = 0; > + > + asm("cpuid" : "=a" (max_level), "=b" (ebx), "=c" (ecx), > + "=d" (edx) : "a" (0)); > + > + if (ebx != 0x756e6547 || edx != 0x49656e69 || ecx != 0x6c65746e) { > + if (verbose) > + fprintf(stderr, "%.4s%.4s%.4s != GenuineIntel", > + (char *)&ebx, (char *)&edx, (char *)&ecx); > + exit(1); > + } > + > + asm("cpuid" : "=a" (fms), "=c" (ecx), "=d" (edx) : "a" (1) : "ebx"); > + family = (fms>> 8)& 0xf; > + model = (fms>> 4)& 0xf; > + stepping = fms& 0xf; > + if (family == 6 || family == 0xf) > + model += ((fms>> 16)& 0xf)<< 4; > + > + if (verbose> 1) > + printf("CPUID %s %d levels family:model:stepping " > + "0x%x:%x:%x (%d:%d:%d)\n", brand, max_level, > + family, model, stepping, family, model, stepping); > + > + if (!(edx& (1<< 5))) { > + if (verbose) > + printf("CPUID: no MSR\n"); > + exit(1); > + } > + > + /* > + * Support for MSR_IA32_ENERGY_PERF_BIAS > + * is indicated by CPUID.06H.ECX.bit3 > + */ > + asm("cpuid" : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (6)); > + if (verbose) > + printf("CPUID.06H.ECX: 0x%x\n", ecx); > + if (!(ecx& (1<< 3))) { > + if (verbose) > + printf("CPUID: No MSR_IA32_ENERGY_PERF_BIAS\n"); > + exit(1); > + } > + return; /* success */ > +} > + > +unsigned long long get_msr(int cpu, int offset) > +{ > + unsigned long long msr; > + char msr_path[32]; > + int retval; > + int fd; > + > + sprintf(msr_path, "/dev/cpu/%d/msr", cpu); > + fd = open(msr_path, O_RDONLY); > + if (fd< 0) { > + printf("Try \"# modprobe msr\"\n"); > + perror(msr_path); > + exit(1); > + } > + > + retval = pread(fd,&msr, sizeof msr, offset); > + > + if (retval != sizeof msr) { > + printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval); > + exit(-2); > + } > + close(fd); > + return msr; > +} > + > +unsigned long long put_msr(int cpu, unsigned long long new_msr, int offset) > +{ > + unsigned long long old_msr; > + char msr_path[32]; > + int retval; > + int fd; > + > + sprintf(msr_path, "/dev/cpu/%d/msr", cpu); > + fd = open(msr_path, O_RDWR); > + if (fd< 0) { > + perror(msr_path); > + exit(1); > + } > + > + retval = pread(fd,&old_msr, sizeof old_msr, offset); > + if (retval != sizeof old_msr) { > + perror("pwrite"); > + printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval); > + exit(-2); > + } > + > + retval = pwrite(fd,&new_msr, sizeof new_msr, offset); > + if (retval != sizeof new_msr) { > + perror("pwrite"); > + printf("pwrite cpu%d 0x%x = %d\n", cpu, offset, retval); > + exit(-2); > + } > + > + close(fd); > + > + return old_msr; > +} > + > +void print_msr(int cpu) > +{ > + printf("cpu%d: 0x%016llx\n", > + cpu, get_msr(cpu, MSR_IA32_ENERGY_PERF_BIAS)); > +} > + > +void update_msr(int cpu) > +{ > + unsigned long long previous_msr; > + > + previous_msr = put_msr(cpu, new_bias, MSR_IA32_ENERGY_PERF_BIAS); > + > + if (verbose) > + printf("cpu%d msr0x%x 0x%016llx -> 0x%016llx\n", > + cpu, MSR_IA32_ENERGY_PERF_BIAS, previous_msr, new_bias); > + > + return; > +} > + > +char *proc_stat = "/proc/stat"; > +/* > + * run func() on every cpu in /dev/cpu > + */ > +void for_every_cpu(void (func)(int)) > +{ > + FILE *fp; > + int retval; > + > + fp = fopen(proc_stat, "r"); > + if (fp == NULL) { > + perror(proc_stat); > + exit(1); > + } > + > + retval = fscanf(fp, "cpu %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n"); > + if (retval != 0) { > + perror("/proc/stat format"); > + exit(1); > + } > + > + while (1) { > + int cpu; > + > + retval = fscanf(fp, > + "cpu%u %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n", > + &cpu); > + if (retval != 1) > + return; > + > + func(cpu); > + } > + fclose(fp); > +} > + > +int main(int argc, char **argv) > +{ > + cmdline(argc, argv); > + > + if (verbose> 1) > + printf("x86_energy_perf_policy Nov 24, 2010" > + " - Len Brown<lenb@kernel.org>\n"); > + if (verbose> 1&& !read_only) > + printf("new_bias %lld\n", new_bias); > + > + validate_cpuid(); > + > + if (cpu != -1) { > + if (read_only) > + print_msr(cpu); > + else > + update_msr(cpu); > + } else { > + if (read_only) > + for_every_cpu(print_msr); > + else > + for_every_cpu(update_msr); > + } > + > + return 0; > +} > I have 2 questions. 1. the usage looks too simple. If I haven't read the comments in the source codes, I even can't know the exact meaning of these parameters. Such as -v, -vv etc. How about adding the comments as the part of the usage ? 2. the paramter "noraml | performance | powersave | n" looks weird. why it can't look like other paramter (-r, -v etc.). For example, I can't use it such as "./x86_energy_perf_policy -c 0 normal -v" ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2] tools: create power/x86/x86_energy_perf_policy 2010-11-25 5:52 ` Chen Gong @ 2010-11-25 8:59 ` Chen Gong 0 siblings, 0 replies; 26+ messages in thread From: Chen Gong @ 2010-11-25 8:59 UTC (permalink / raw) To: Chen Gong Cc: Len Brown, Greg Kroah-Hartman, linux-pm, linux-kernel, linux-acpi, x86 于 11/25/2010 1:52 PM, Chen Gong 写道: > 于 11/24/2010 1:31 PM, Len Brown 写道: >> From: Len Brown<len.brown@intel.com> >> >> MSR_IA32_ENERGY_PERF_BIAS first became available on Westmere Xeon. >> It is implemented in all Sandy Bridge processors -- mobile, desktop >> and server. >> It is expected to become increasingly important in subsequent >> generations. >> >> x86_energy_perf_policy is a user-space utility to set this >> hardware energy vs performance policy hint in the processor. >> Most systems would benefit from "x86_energy_perf_policy normal" >> at system startup, as the hardware default is maximum performance >> at the expense of energy efficiency. >> >> Linux-2.6.36 added "epb" to /proc/cpuinfo to indicate >> if an x86 processor supports MSR_IA32_ENERGY_PERF_BIAS, >> though the kernel does not actually program the MSR. >> >> In March, Venkatesh Pallipadi proposed a small driver >> that programmed MSR_IA32_ENERGY_PERF_BIAS, based on >> the cpufreq governor in use. It also offered >> a boot-time cmdline option to override. >> http://lkml.org/lkml/2010/3/4/457 >> But hiding the hardware policy behind the >> governor choice was deemed "kinda icky". >> >> So in June, I proposed a generic user/kernel API to >> consolidate the power/performance policy trade-off. >> "RFC: /sys/power/policy_preference" >> http://lkml.org/lkml/2010/6/16/399 >> That is my preference for implementing this capability, >> but I received no support on the list. >> >> So in September, I sent x86_energy_perf_policy.c to LKML, >> a user-space utility that scribbles directly to the MSR. >> http://lkml.org/lkml/2010/9/28/246 >> >> Here is the same utility re-sent, this time proposed >> to reside in the kernel tools directory. >> >> Signed-off-by: Len Brown<len.brown@intel.com> >> --- >> v2 >> create man page >> minor tweaks in response to review comments >> >> tools/power/x86/x86_energy_perf_policy/Makefile | 8 + >> .../x86_energy_perf_policy.8 | 104 +++++++ >> .../x86_energy_perf_policy.c | 325 ++++++++++++++++++++ >> >> diff --git a/tools/power/x86/x86_energy_perf_policy/Makefile >> b/tools/power/x86/x86_energy_perf_policy/Makefile >> new file mode 100644 >> index 0000000..f458237 >> --- /dev/null >> +++ b/tools/power/x86/x86_energy_perf_policy/Makefile >> @@ -0,0 +1,8 @@ >> +x86_energy_perf_policy : x86_energy_perf_policy.c >> + >> +clean : >> + rm -f x86_energy_perf_policy >> + >> +install : >> + install x86_energy_perf_policy /usr/bin/ >> + install x86_energy_perf_policy.8 /usr/share/man/man8/ >> diff --git >> a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 >> b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 >> new file mode 100644 >> index 0000000..8eaaad6 >> --- /dev/null >> +++ b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 >> @@ -0,0 +1,104 @@ >> +.\" This page Copyright (C) 2010 Len Brown<len.brown@intel.com> >> +.\" Distributed under the GPL, Copyleft 1994. >> +.TH X86_ENERGY_PERF_POLICY 8 >> +.SH NAME >> +x86_energy_perf_policy \- read or write MSR_IA32_ENERGY_PERF_BIAS >> +.SH SYNOPSIS >> +.ft B >> +.B x86_energy_perf_policy >> +.RB [ "\-c cpu" ] >> +.RB [ "\-v" ] >> +.RB "\-r" >> +.br >> +.B x86_energy_perf_policy >> +.RB [ "\-c cpu" ] >> +.RB [ "\-v" ] >> +.RB 'performance' >> +.br >> +.B x86_energy_perf_policy >> +.RB [ "\-c cpu" ] >> +.RB [ "\-v" ] >> +.RB 'normal' >> +.br >> +.B x86_energy_perf_policy >> +.RB [ "\-c cpu" ] >> +.RB [ "\-v" ] >> +.RB 'powersave' >> +.br >> +.B x86_energy_perf_policy >> +.RB [ "\-c cpu" ] >> +.RB [ "\-v" ] >> +.RB n >> +.br >> +.SH DESCRIPTION >> +\fBx86_energy_perf_policy\fP >> +allows software to convey >> +its policy for the relative importance of performance >> +versus energy savings to the processor. >> + >> +The processor uses this information in model-specific ways >> +when it must select trade-offs between performance and >> +energy efficiency. >> + >> +This policy hint does not supersede Processor Performance states >> +(P-states) or CPU Idle power states (C-states), but allows >> +software to have influence where it would otherwise be unable >> +to express a preference. >> + >> +For example, this setting may tell the hardware how >> +aggressively or conservatively to control frequency >> +in the "turbo range" above the explicitly OS-controlled >> +P-state frequency range. It may also tell the hardware >> +how aggressively is should enter the OS requested C-states. >> + >> +Support for this feature is indicated by CPUID.06H.ECX.bit3 >> +per the Intel Architectures Software Developer's Manual. >> + >> +.SS Options >> +\fB-c\fP limits operation to a single CPU. >> +The default is to operate on all CPUs. >> +Note that MSR_IA32_ENERGY_PERF_BIAS is defined per >> +logical processor, but that the initial implementations >> +of the MSR were shared among all processors in each package. >> +.PP >> +\fB-v\fP increases verbosity. By default >> +x86_energy_perf_policy is silent. >> +.PP >> +\fB-r\fP is for "read-only" mode - the unchanged state >> +is read and displayed. >> +.PP >> +.I performance >> +Set a policy where performance is paramount. >> +The processor will be unwilling to sacrifice any performance >> +for the sake of energy saving. This is the hardware default. >> +.PP >> +.I normal >> +Set a policy with a normal balance between performance and energy >> efficiency. >> +The processor will tolerate minor performance compromise >> +for potentially significant energy savings. >> +This reasonable default for most desktops and servers. >> +.PP >> +.I powersave >> +Set a policy where the processor can accept >> +a measurable performance hit to maximize energy efficiency. >> +.PP >> +.I n >> +Set MSR_IA32_ENERGY_PERF_BIAS to the specified number. >> +The range of valid numbers is 0-15, where 0 is maximum >> +performance and 15 is maximum energy efficiency. >> + >> +.SH NOTES >> +.B "x86_energy_perf_policy " >> +runs only as root. >> +.SH FILES >> +.ta >> +.nf >> +/dev/cpu/*/msr >> +.fi >> + >> +.SH "SEE ALSO" >> +msr(4) >> +.PP >> +.SH AUTHORS >> +.nf >> +Written by Len Brown<len.brown@intel.com> >> diff --git >> a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c >> b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c >> new file mode 100644 >> index 0000000..b539923 >> --- /dev/null >> +++ b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c >> @@ -0,0 +1,325 @@ >> +/* >> + * x86_energy_perf_policy -- set the energy versus performance >> + * policy preference bias on recent X86 processors. >> + */ >> +/* >> + * Copyright (c) 2010, Intel Corporation. >> + * Len Brown<len.brown@intel.com> >> + * >> + * This program is free software; you can redistribute it and/or >> modify it >> + * under the terms and conditions of the GNU General Public License, >> + * version 2, as published by the Free Software Foundation. >> + * >> + * This program is distributed in the hope it will be useful, but >> WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public >> License for >> + * more details. >> + * >> + * You should have received a copy of the GNU General Public License >> along with >> + * this program; if not, write to the Free Software Foundation, Inc., >> + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. >> + */ >> + >> +#include<stdio.h> >> +#include<unistd.h> >> +#include<sys/types.h> >> +#include<sys/stat.h> >> +#include<sys/resource.h> >> +#include<fcntl.h> >> +#include<signal.h> >> +#include<sys/time.h> >> +#include<stdlib.h> >> +#include<string.h> >> + >> +unsigned int verbose; /* set with -v */ >> +unsigned int read_only; /* set with -r */ >> +char *progname; >> +unsigned long long new_bias; >> +int cpu = -1; >> + >> +/* >> + * Usage: >> + * >> + * -c cpu: limit action to a single CPU (default is all CPUs) >> + * -v: verbose output (can invoke more than once) >> + * -r: read-only, don't change any settings >> + * >> + * performance >> + * Performance is paramount. >> + * Unwilling to sacrafice any performance >> + * for the sake of energy saving. (hardware default) >> + * >> + * normal >> + * Can tolerate minor performance compromise >> + * for potentially significant energy savings. >> + * (reasonable default for most desktops and servers) >> + * >> + * powersave >> + * Can tolerate significant performance hit >> + * to maximize energy savings. >> + * >> + * n >> + * a numerical value to write to the underlying MSR. >> + */ >> +void usage(void) >> +{ >> + printf("%s: [-c cpu] [-v] " >> + "(-r | 'performance' | 'normal' | 'powersave' | n)\n", >> + progname); >> + exit(1); >> +} >> + >> +#define MSR_IA32_ENERGY_PERF_BIAS 0x000001b0 >> + >> +#define BIAS_PERFORMANCE 0 >> +#define BIAS_BALANCE 6 >> +#define BIAS_POWERSAVE 15 >> + >> +void cmdline(int argc, char **argv) >> +{ >> + int opt; >> + >> + progname = argv[0]; >> + >> + while ((opt = getopt(argc, argv, "+rvc:")) != -1) { >> + switch (opt) { >> + case 'c': >> + cpu = atoi(optarg); >> + break; >> + case 'r': >> + read_only = 1; >> + break; >> + case 'v': >> + verbose++; >> + break; >> + default: >> + usage(); >> + } >> + } >> + /* if -r, then should be no additional optind */ >> + if (read_only&& (argc> optind)) >> + usage(); >> + >> + /* >> + * if no -r , then must be one additional optind >> + */ >> + if (!read_only) { >> + >> + if (argc != optind + 1) { >> + printf("must supply -r or policy param\n"); >> + usage(); >> + } >> + >> + if (!strcmp("performance", argv[optind])) { >> + new_bias = BIAS_PERFORMANCE; >> + } else if (!strcmp("normal", argv[optind])) { >> + new_bias = BIAS_BALANCE; >> + } else if (!strcmp("powersave", argv[optind])) { >> + new_bias = BIAS_POWERSAVE; >> + } else { >> + char *endptr; >> + >> + new_bias = strtoull(argv[optind],&endptr, 0); >> + if (endptr == argv[optind] || >> + new_bias> BIAS_POWERSAVE) { >> + fprintf(stderr, "invalid value: %s\n", >> + argv[optind]); >> + usage(); >> + } >> + } >> + } >> +} >> + >> +/* >> + * validate_cpuid() >> + * returns on success, quietly exits on failure (make verbose with -v) >> + */ >> +void validate_cpuid(void) >> +{ >> + unsigned int eax, ebx, ecx, edx, max_level; >> + char brand[16]; >> + unsigned int fms, family, model, stepping; >> + >> + eax = ebx = ecx = edx = 0; >> + >> + asm("cpuid" : "=a" (max_level), "=b" (ebx), "=c" (ecx), >> + "=d" (edx) : "a" (0)); >> + >> + if (ebx != 0x756e6547 || edx != 0x49656e69 || ecx != 0x6c65746e) { >> + if (verbose) >> + fprintf(stderr, "%.4s%.4s%.4s != GenuineIntel", >> + (char *)&ebx, (char *)&edx, (char *)&ecx); >> + exit(1); >> + } >> + >> + asm("cpuid" : "=a" (fms), "=c" (ecx), "=d" (edx) : "a" (1) : "ebx"); >> + family = (fms>> 8)& 0xf; >> + model = (fms>> 4)& 0xf; >> + stepping = fms& 0xf; >> + if (family == 6 || family == 0xf) >> + model += ((fms>> 16)& 0xf)<< 4; >> + >> + if (verbose> 1) >> + printf("CPUID %s %d levels family:model:stepping " >> + "0x%x:%x:%x (%d:%d:%d)\n", brand, max_level, >> + family, model, stepping, family, model, stepping); >> + >> + if (!(edx& (1<< 5))) { >> + if (verbose) >> + printf("CPUID: no MSR\n"); >> + exit(1); >> + } >> + >> + /* >> + * Support for MSR_IA32_ENERGY_PERF_BIAS >> + * is indicated by CPUID.06H.ECX.bit3 >> + */ >> + asm("cpuid" : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" >> (6)); >> + if (verbose) >> + printf("CPUID.06H.ECX: 0x%x\n", ecx); >> + if (!(ecx& (1<< 3))) { >> + if (verbose) >> + printf("CPUID: No MSR_IA32_ENERGY_PERF_BIAS\n"); >> + exit(1); >> + } >> + return; /* success */ >> +} >> + >> +unsigned long long get_msr(int cpu, int offset) >> +{ >> + unsigned long long msr; >> + char msr_path[32]; >> + int retval; >> + int fd; >> + >> + sprintf(msr_path, "/dev/cpu/%d/msr", cpu); >> + fd = open(msr_path, O_RDONLY); >> + if (fd< 0) { >> + printf("Try \"# modprobe msr\"\n"); >> + perror(msr_path); >> + exit(1); >> + } >> + >> + retval = pread(fd,&msr, sizeof msr, offset); >> + >> + if (retval != sizeof msr) { >> + printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval); >> + exit(-2); >> + } >> + close(fd); >> + return msr; >> +} >> + >> +unsigned long long put_msr(int cpu, unsigned long long new_msr, int >> offset) >> +{ >> + unsigned long long old_msr; >> + char msr_path[32]; >> + int retval; >> + int fd; >> + >> + sprintf(msr_path, "/dev/cpu/%d/msr", cpu); >> + fd = open(msr_path, O_RDWR); >> + if (fd< 0) { >> + perror(msr_path); >> + exit(1); >> + } >> + >> + retval = pread(fd,&old_msr, sizeof old_msr, offset); >> + if (retval != sizeof old_msr) { >> + perror("pwrite"); >> + printf("pread cpu%d 0x%x = %d\n", cpu, offset, retval); >> + exit(-2); >> + } >> + >> + retval = pwrite(fd,&new_msr, sizeof new_msr, offset); >> + if (retval != sizeof new_msr) { >> + perror("pwrite"); >> + printf("pwrite cpu%d 0x%x = %d\n", cpu, offset, retval); >> + exit(-2); >> + } >> + >> + close(fd); >> + >> + return old_msr; >> +} >> + >> +void print_msr(int cpu) >> +{ >> + printf("cpu%d: 0x%016llx\n", >> + cpu, get_msr(cpu, MSR_IA32_ENERGY_PERF_BIAS)); >> +} >> + >> +void update_msr(int cpu) >> +{ >> + unsigned long long previous_msr; >> + >> + previous_msr = put_msr(cpu, new_bias, MSR_IA32_ENERGY_PERF_BIAS); >> + >> + if (verbose) >> + printf("cpu%d msr0x%x 0x%016llx -> 0x%016llx\n", >> + cpu, MSR_IA32_ENERGY_PERF_BIAS, previous_msr, new_bias); >> + >> + return; >> +} >> + >> +char *proc_stat = "/proc/stat"; >> +/* >> + * run func() on every cpu in /dev/cpu >> + */ >> +void for_every_cpu(void (func)(int)) >> +{ >> + FILE *fp; >> + int retval; >> + >> + fp = fopen(proc_stat, "r"); >> + if (fp == NULL) { >> + perror(proc_stat); >> + exit(1); >> + } >> + >> + retval = fscanf(fp, "cpu %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n"); >> + if (retval != 0) { >> + perror("/proc/stat format"); >> + exit(1); >> + } >> + >> + while (1) { >> + int cpu; >> + >> + retval = fscanf(fp, >> + "cpu%u %*d %*d %*d %*d %*d %*d %*d %*d %*d %*d\n", >> + &cpu); >> + if (retval != 1) >> + return; >> + >> + func(cpu); >> + } >> + fclose(fp); >> +} >> + >> +int main(int argc, char **argv) >> +{ >> + cmdline(argc, argv); >> + >> + if (verbose> 1) >> + printf("x86_energy_perf_policy Nov 24, 2010" >> + " - Len Brown<lenb@kernel.org>\n"); >> + if (verbose> 1&& !read_only) >> + printf("new_bias %lld\n", new_bias); >> + >> + validate_cpuid(); >> + >> + if (cpu != -1) { >> + if (read_only) >> + print_msr(cpu); >> + else >> + update_msr(cpu); >> + } else { >> + if (read_only) >> + for_every_cpu(print_msr); >> + else >> + for_every_cpu(update_msr); >> + } >> + >> + return 0; >> +} >> > I have 2 questions. > > 1. the usage looks too simple. If I haven't read the comments > in the source codes, I even can't know the exact meaning of these > parameters. Such as -v, -vv etc. How about adding the comments > as the part of the usage ? > > 2. the paramter "noraml | performance | powersave | n" looks weird. > why it can't look like other paramter (-r, -v etc.). For example, > I can't use it such as > "./x86_energy_perf_policy -c 0 normal -v" > -- One more question. From the spec, it should write 1 to the MSR 0x1FC[18] to enable this function after setting the Energy Policy on all threads in one package. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2010-11-25 8:59 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-06-16 21:05 RFC: /sys/power/policy_preference Len Brown 2010-06-17 6:03 ` [linux-pm] " Igor.Stoppa 2010-06-17 19:00 ` Len Brown 2010-06-17 16:14 ` Victor Lowther 2010-06-17 19:02 ` Len Brown 2010-06-17 22:23 ` Victor Lowther 2010-06-18 5:56 ` Len Brown 2010-06-18 11:55 ` Victor Lowther 2010-06-19 15:17 ` Vaidyanathan Srinivasan 2010-06-19 19:04 ` Rafael J. Wysocki 2010-06-17 20:48 ` Mike Chan 2010-06-18 6:25 ` Len Brown 2010-06-21 20:10 ` [linux-pm] " Dipankar Sarma 2010-09-28 16:17 ` x86_energy_perf_policy.c Len Brown 2010-10-23 4:40 ` [PATCH] tools: add x86_energy_perf_policy to program MSR_IA32_ENERGY_PERF_BIAS Len Brown 2010-10-27 3:23 ` Andrew Morton 2010-10-27 6:01 ` Ingo Molnar 2010-10-27 11:43 ` Arnaldo Carvalho de Melo 2010-11-15 16:07 ` [PATCH RESEND] tools: add power/x86/x86_energy_perf_policy " Len Brown 2010-11-17 11:35 ` Andi Kleen 2010-11-22 20:13 ` Len Brown 2010-11-22 20:33 ` Andi Kleen 2010-11-23 4:48 ` Len Brown 2010-11-24 5:31 ` [PATCH v2] tools: create power/x86/x86_energy_perf_policy Len Brown 2010-11-25 5:52 ` Chen Gong 2010-11-25 8:59 ` Chen Gong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).