On Sun, Jun 08, 2014 at 07:26:29AM +0800, Yuyang Du wrote: > Ok. I think we understand each other. But one more thing, I said P ~ V^3, > because P ~ V^2*f and f ~ V, so P ~ V^3. Maybe some frequencies share the same > voltage, but you can still safely assume V changes with f in general, and it > will be more and more so, since we do need finer control over power consumption. I didn't know the frequency part was proportionate to another voltage term, ok, then the cubic term makes sense. > > Sure, but realize that we must fully understand this governor and > > integrate it in the scheduler if we're to attain the goal of IPC/watt > > optimized scheduling behaviour. > > > > Attain the goal of IPC/watt optimized? > > I don't see how it can be done like this. As I said, what is unknown for > prediction is perf scaling *and* changing workload. So the challenge for pstate > control is in both. But I see more chanllenge in the changing workload than > in the performance scaling or the resulting IPC impact (if workload is > fixed). But for the scheduler the workload change isn't that big a problem; we know the history of each task, we know when tasks wake up and when we move them around. Therefore we can fairly accurately predict this. And given a simple P state model (like ARM) where the CPU simply does what you tell it to, that all works out. We can change P-state at task wakeup/sleep/migration and compute the most efficient P-state, and task distribution, for the new task-set. > Currently, all freq governors take CPU utilization (load%) as the indicator > (target), which can server both: workload and perf scaling. So the current cpufreq stuff is terminally broken in too many ways; its sampling, so it misses a lot of changes, its strictly cpu local, so it completely misses SMP information (like the migrations etc..) If we move a 50% task from CPU1 to CPU0, a sampling thing takes time to adjust on both CPUs, whereas if its scheduler driven, we can instantly adjust and be done, because we _know_ what we moved. Now some of that is due to hysterical raisins, and some of that due to broken hardware (hardware that needs to schedule in order to change its state because its behind some broken bus or other). But we should basically kill off cpufreq for anything recent and sane. > As for IPC/watt optimized, I don't see how it can be practical. Too micro to > be used for the general well-being? What other target would you optimize for? The purpose here is to build an energy aware scheduler, one that schedules tasks so that the total amount of energy, for the given amount of work, is minimal. So we can't measure in Watt, since if we forced the CPU into the lowest P-state (or even C-state for that matter) work would simply not complete. So we need a complete energy term. Now. IPC is instructions/cycle, Watt is Joule/second, so IPC/Watt is instructions second ------------ * ------ ~ instructions / joule cycle joule Seeing how both cycles and seconds are time units. So for any given amount of instructions, the work needs to be done, we want the minimal amount of energy consumed, and IPC/Watt is the natural metric to measure this over an entire workload.