* [PATCH 0/6] cpufreq: schedutil governor @ 2016-03-02 1:56 Rafael J. Wysocki 2016-03-02 2:04 ` [PATCH 1/6] cpufreq: Reduce cpufreq_update_util() overhead a bit Rafael J. Wysocki ` (6 more replies) 0 siblings, 7 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-02 1:56 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette Hi, My previous intro message still applies somewhat, so here's a link: http://marc.info/?l=linux-pm&m=145609673008122&w=2 The executive summary of the motivation is that I wanted to do two things: use the utilization data from the scheduler (it's passed to the governor as aguments of update callbacks anyway) and make it possible to set CPU frequency without involving process context (fast frequency switching). Both have been prototyped in the previous RFCs: https://patchwork.kernel.org/patch/8426691/ https://patchwork.kernel.org/patch/8426741/ but in the meantime I found a couple of issues in there. First off, the common governor code relied on by the previous version reset the sample delay to 0 in order to force an immediate frequency update. That doesn't work with the new governor, though, because it computes the frequency to set in a cpufreq_update_util() callback and (when fast switching is not used) passes that to a work item which sets the frequency and then restores the sample delay. Thus if sysfs changes the sample delay to 0 when work_in_progress is in effect, it will be overwritten by the work item and so discarded. When using fast switching, the previous version would update the sample delay from a scheduler path, but that (on a 32-bit system) might clash with an update from sysfs leading to a result that's completely off. That value would be less than the correct sample delay (I think), so in practice that shouldn't matter that much, but still it's not nice. The above means that schedutil cannot really share as much code as I thought it could with "ondemand" and "conservative". Moreover, I wanted to have a "rate_limit" tunable (instead of the sampling rate which doesn't mean what the name suggests in schedutil), but that would be the only one used by schedutil, so I ended up having to define a new struct to point to from struct dbs_data just to hold that single value and I would need to define ->init() and ->exit() callbacks for the governor for that reason (and the common tunables in struct dbs_data wouldn't be used). Not to mention the fact that the majority of the common governor code is not really used by schedutil anyway. Taking the above into account, I decided to decouple schedutil from the other governors, but I wanted to avoid duplicating some of the tunables manipulation code. Hence patches [3-4/6] taking that code into a separate file so schedutil can use it too without pulling the rest of the common "ondemand" and "conservative" code along with it. Patch [5/6] adds support for fast switching to the core and the ACPI driver, but doesn't hook it up to anything useful. That is done in the last patch that actually adds the new governor. That depends on two patches I sent previously, [1/6] that makes cpufreq_update_util() use RCU-sched (one change from the previous version as requested by Peter) and [2/6] that reworks acpi-cpufreq so the fast switching (added later in patch [5/6]) can work with all of the frequency setting methods the driver may use. Comments welcome. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH 1/6] cpufreq: Reduce cpufreq_update_util() overhead a bit 2016-03-02 1:56 [PATCH 0/6] cpufreq: schedutil governor Rafael J. Wysocki @ 2016-03-02 2:04 ` Rafael J. Wysocki 2016-03-03 5:48 ` Viresh Kumar 2016-03-03 11:47 ` Juri Lelli 2016-03-02 2:05 ` [PATCH 2/6][Resend] cpufreq: acpi-cpufreq: Make read and write operations more efficient Rafael J. Wysocki ` (5 subsequent siblings) 6 siblings, 2 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-02 2:04 UTC (permalink / raw) To: Linux PM list, Peter Zijlstra Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Use the observation that cpufreq_update_util() is only called by the scheduler with rq->lock held, so the callers of cpufreq_set_update_util_data() can use synchronize_sched() instead of synchronize_rcu() to wait for cpufreq_update_util() to complete. Moreover, if they are updated to do that, rcu_read_(un)lock() calls in cpufreq_update_util() might be replaced with rcu_read_(un)lock_sched(), respectively, but those aren't really necessary, because the scheduler calls that function from RCU-sched read-side critical sections already. In addition to that, if cpufreq_set_update_util_data() checks the func field in the struct update_util_data before setting the per-CPU pointer to it, the data->func check may be dropped from cpufreq_update_util() as well. Make the above changes to reduce the overhead from cpufreq_update_util() in the scheduler paths invoking it and to make the cleanup after removing its callbacks less heavy-weight somewhat. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Supersedes https://patchwork.kernel.org/patch/8443191/ --- drivers/cpufreq/cpufreq.c | 23 ++++++++++++++++------- drivers/cpufreq/cpufreq_governor.c | 2 +- drivers/cpufreq/intel_pstate.c | 4 ++-- 3 files changed, 19 insertions(+), 10 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -77,12 +77,15 @@ static DEFINE_PER_CPU(struct update_util * to call from cpufreq_update_util(). That function will be called from an RCU * read-side critical section, so it must not sleep. * - * Callers must use RCU callbacks to free any memory that might be accessed - * via the old update_util_data pointer or invoke synchronize_rcu() right after - * this function to avoid use-after-free. + * Callers must use RCU-sched callbacks to free any memory that might be + * accessed via the old update_util_data pointer or invoke synchronize_sched() + * right after this function to avoid use-after-free. */ void cpufreq_set_update_util_data(int cpu, struct update_util_data *data) { + if (WARN_ON(data && !data->func)) + return; + rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data); } EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data); @@ -95,18 +98,24 @@ EXPORT_SYMBOL_GPL(cpufreq_set_update_uti * * This function is called by the scheduler on every invocation of * update_load_avg() on the CPU whose utilization is being updated. + * + * It can only be called from RCU-sched read-side critical sections. */ void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) { struct update_util_data *data; - rcu_read_lock(); +#ifdef CONFIG_LOCKDEP + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); +#endif data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data)); - if (data && data->func) + /* + * If this isn't inside of an RCU-sched read-side critical section, data + * may become NULL after the check below. + */ + if (data) data->func(data, time, util, max); - - rcu_read_unlock(); } /* Flag to suspend/resume CPUFreq governors */ Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -280,7 +280,7 @@ static inline void gov_clear_update_util for_each_cpu(i, policy->cpus) cpufreq_set_update_util_data(i, NULL); - synchronize_rcu(); + synchronize_sched(); } static void gov_cancel_work(struct cpufreq_policy *policy) Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -1174,7 +1174,7 @@ static void intel_pstate_stop_cpu(struct pr_debug("intel_pstate: CPU %d exiting\n", cpu_num); cpufreq_set_update_util_data(cpu_num, NULL); - synchronize_rcu(); + synchronize_sched(); if (hwp_active) return; @@ -1442,7 +1442,7 @@ out: for_each_online_cpu(cpu) { if (all_cpu_data[cpu]) { cpufreq_set_update_util_data(cpu, NULL); - synchronize_rcu(); + synchronize_sched(); kfree(all_cpu_data[cpu]); } } ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 1/6] cpufreq: Reduce cpufreq_update_util() overhead a bit 2016-03-02 2:04 ` [PATCH 1/6] cpufreq: Reduce cpufreq_update_util() overhead a bit Rafael J. Wysocki @ 2016-03-03 5:48 ` Viresh Kumar 2016-03-03 11:47 ` Juri Lelli 1 sibling, 0 replies; 158+ messages in thread From: Viresh Kumar @ 2016-03-03 5:48 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Peter Zijlstra, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Vincent Guittot, Michael Turquette On 02-03-16, 03:04, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Use the observation that cpufreq_update_util() is only called > by the scheduler with rq->lock held, so the callers of > cpufreq_set_update_util_data() can use synchronize_sched() > instead of synchronize_rcu() to wait for cpufreq_update_util() > to complete. Moreover, if they are updated to do that, > rcu_read_(un)lock() calls in cpufreq_update_util() might be > replaced with rcu_read_(un)lock_sched(), respectively, but > those aren't really necessary, because the scheduler calls > that function from RCU-sched read-side critical sections > already. > > In addition to that, if cpufreq_set_update_util_data() checks > the func field in the struct update_util_data before setting > the per-CPU pointer to it, the data->func check may be dropped > from cpufreq_update_util() as well. > > Make the above changes to reduce the overhead from > cpufreq_update_util() in the scheduler paths invoking it > and to make the cleanup after removing its callbacks less > heavy-weight somewhat. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > Supersedes https://patchwork.kernel.org/patch/8443191/ Acked-by: Viresh Kumar <viresh.kumar@linaro.org> -- viresh ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 1/6] cpufreq: Reduce cpufreq_update_util() overhead a bit 2016-03-02 2:04 ` [PATCH 1/6] cpufreq: Reduce cpufreq_update_util() overhead a bit Rafael J. Wysocki 2016-03-03 5:48 ` Viresh Kumar @ 2016-03-03 11:47 ` Juri Lelli 2016-03-03 13:04 ` Peter Zijlstra 1 sibling, 1 reply; 158+ messages in thread From: Juri Lelli @ 2016-03-03 11:47 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Peter Zijlstra, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette Hi, On 02/03/16 03:04, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > [...] > @@ -95,18 +98,24 @@ EXPORT_SYMBOL_GPL(cpufreq_set_update_uti > * > * This function is called by the scheduler on every invocation of > * update_load_avg() on the CPU whose utilization is being updated. > + * > + * It can only be called from RCU-sched read-side critical sections. > */ > void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) > { > struct update_util_data *data; > > - rcu_read_lock(); > +#ifdef CONFIG_LOCKDEP > + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); > +#endif > > data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data)); I think you need to s/rcu_dereference/rcu_dereference_sched/ here or RCU will complain: [ 0.106313] =============================== [ 0.106322] [ INFO: suspicious RCU usage. ] [ 0.106334] 4.5.0-rc6+ #93 Not tainted [ 0.106342] ------------------------------- [ 0.106353] /media/hdd1tb/work/integration/kernel/drivers/cpufreq/cpufreq.c:113 suspicious rcu_dereference_check() usage! [ 0.106361] [ 0.106361] other info that might help us debug this: [ 0.106361] [ 0.106375] [ 0.106375] rcu_scheduler_active = 1, debug_locks = 1 [ 0.106387] 1 lock held by swapper/0/0: [ 0.106395] #0: (&rq->lock){-.....}, at: [<ffffffc000743204>] __schedule+0xec/0xadc [ 0.106436] [ 0.106436] stack backtrace: [ 0.106450] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.5.0-rc6+ #93 [ 0.106459] Hardware name: ARM Juno development board (r2) (DT) [ 0.106468] Call trace: [ 0.106483] [<ffffffc00008a8a8>] dump_backtrace+0x0/0x210 [ 0.106496] [<ffffffc00008aad8>] show_stack+0x20/0x28 [ 0.106511] [<ffffffc0004261a4>] dump_stack+0xa8/0xe0 [ 0.106526] [<ffffffc000120e9c>] lockdep_rcu_suspicious+0xd4/0x114 [ 0.106540] [<ffffffc0005d8180>] cpufreq_update_util+0xd4/0xd8 [ 0.106554] [<ffffffc000105b9c>] set_next_entity+0x540/0xf7c [ 0.106569] [<ffffffc00010f78c>] pick_next_task_fair+0x9c/0x754 [ 0.106580] [<ffffffc00074351c>] __schedule+0x404/0xadc [ 0.106592] [<ffffffc000743de0>] schedule+0x40/0xa0 [ 0.106603] [<ffffffc000744094>] schedule_preempt_disabled+0x1c/0x2c [ 0.106617] [<ffffffc000741190>] rest_init+0x14c/0x164 [ 0.106631] [<ffffffc0009f9990>] start_kernel+0x3c0/0x3d4 [ 0.106642] [<ffffffc0000811b4>] 0xffffffc0000811b4 Best, - Juri ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 1/6] cpufreq: Reduce cpufreq_update_util() overhead a bit 2016-03-03 11:47 ` Juri Lelli @ 2016-03-03 13:04 ` Peter Zijlstra 0 siblings, 0 replies; 158+ messages in thread From: Peter Zijlstra @ 2016-03-03 13:04 UTC (permalink / raw) To: Juri Lelli Cc: Rafael J. Wysocki, Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette On Thu, Mar 03, 2016 at 11:47:01AM +0000, Juri Lelli wrote: > > +#ifdef CONFIG_LOCKDEP > > + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); > > +#endif > > > > data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data)); > > I think you need to s/rcu_dereference/rcu_dereference_sched/ here or > RCU will complain: Ah, indeed ;-) ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH 2/6][Resend] cpufreq: acpi-cpufreq: Make read and write operations more efficient 2016-03-02 1:56 [PATCH 0/6] cpufreq: schedutil governor Rafael J. Wysocki 2016-03-02 2:04 ` [PATCH 1/6] cpufreq: Reduce cpufreq_update_util() overhead a bit Rafael J. Wysocki @ 2016-03-02 2:05 ` Rafael J. Wysocki 2016-03-02 2:08 ` [PATCH 3/6] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki ` (4 subsequent siblings) 6 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-02 2:05 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Setting a new CPU frequency and reading the current request value in the ACPI cpufreq driver involves each at least two switch instructions (there's more if the policy is shared). One of them is present in drv_read/write() that prepares a command structure and the other happens in subsequent do_drv_read/write() when that structure is interpreted. However, all of those switches may be avoided by using function pointers. To that end, add two function pointers to struct acpi_cpufreq_data to represent read and write operations on the frequency register and set them up during policy intitialization to point to the pair of routines suitable for the given processor (Intel/AMD MSR access or I/O port access). Then, use those pointers in do_drv_read/write() and modify drv_read/write() to prepare the command structure for them without any checks. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- drivers/cpufreq/acpi-cpufreq.c | 208 ++++++++++++++++++----------------------- 1 file changed, 95 insertions(+), 113 deletions(-) Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c @@ -70,6 +70,8 @@ struct acpi_cpufreq_data { unsigned int cpu_feature; unsigned int acpi_perf_cpu; cpumask_var_t freqdomain_cpus; + void (*cpu_freq_write)(struct acpi_pct_register *reg, u32 val); + u32 (*cpu_freq_read)(struct acpi_pct_register *reg); }; /* acpi_perf_data is a pointer to percpu data. */ @@ -243,125 +245,119 @@ static unsigned extract_freq(u32 val, st } } -struct msr_addr { - u32 reg; -}; +u32 cpu_freq_read_intel(struct acpi_pct_register *not_used) +{ + u32 val, dummy; -struct io_addr { - u16 port; - u8 bit_width; -}; + rdmsr(MSR_IA32_PERF_CTL, val, dummy); + return val; +} + +void cpu_freq_write_intel(struct acpi_pct_register *not_used, u32 val) +{ + u32 lo, hi; + + rdmsr(MSR_IA32_PERF_CTL, lo, hi); + lo = (lo & ~INTEL_MSR_RANGE) | (val & INTEL_MSR_RANGE); + wrmsr(MSR_IA32_PERF_CTL, lo, hi); +} + +u32 cpu_freq_read_amd(struct acpi_pct_register *not_used) +{ + u32 val, dummy; + + rdmsr(MSR_AMD_PERF_CTL, val, dummy); + return val; +} + +void cpu_freq_write_amd(struct acpi_pct_register *not_used, u32 val) +{ + wrmsr(MSR_AMD_PERF_CTL, val, 0); +} + +u32 cpu_freq_read_io(struct acpi_pct_register *reg) +{ + u32 val; + + acpi_os_read_port(reg->address, &val, reg->bit_width); + return val; +} + +void cpu_freq_write_io(struct acpi_pct_register *reg, u32 val) +{ + acpi_os_write_port(reg->address, val, reg->bit_width); +} struct drv_cmd { - unsigned int type; - const struct cpumask *mask; - union { - struct msr_addr msr; - struct io_addr io; - } addr; + struct acpi_pct_register *reg; u32 val; + union { + void (*write)(struct acpi_pct_register *reg, u32 val); + u32 (*read)(struct acpi_pct_register *reg); + } func; }; /* Called via smp_call_function_single(), on the target CPU */ static void do_drv_read(void *_cmd) { struct drv_cmd *cmd = _cmd; - u32 h; - switch (cmd->type) { - case SYSTEM_INTEL_MSR_CAPABLE: - case SYSTEM_AMD_MSR_CAPABLE: - rdmsr(cmd->addr.msr.reg, cmd->val, h); - break; - case SYSTEM_IO_CAPABLE: - acpi_os_read_port((acpi_io_address)cmd->addr.io.port, - &cmd->val, - (u32)cmd->addr.io.bit_width); - break; - default: - break; - } + cmd->val = cmd->func.read(cmd->reg); } -/* Called via smp_call_function_many(), on the target CPUs */ -static void do_drv_write(void *_cmd) +static u32 drv_read(struct acpi_cpufreq_data *data, const struct cpumask *mask) { - struct drv_cmd *cmd = _cmd; - u32 lo, hi; + struct acpi_processor_performance *perf = to_perf_data(data); + struct drv_cmd cmd = { + .reg = &perf->control_register, + .func.read = data->cpu_freq_read, + }; + int err; - switch (cmd->type) { - case SYSTEM_INTEL_MSR_CAPABLE: - rdmsr(cmd->addr.msr.reg, lo, hi); - lo = (lo & ~INTEL_MSR_RANGE) | (cmd->val & INTEL_MSR_RANGE); - wrmsr(cmd->addr.msr.reg, lo, hi); - break; - case SYSTEM_AMD_MSR_CAPABLE: - wrmsr(cmd->addr.msr.reg, cmd->val, 0); - break; - case SYSTEM_IO_CAPABLE: - acpi_os_write_port((acpi_io_address)cmd->addr.io.port, - cmd->val, - (u32)cmd->addr.io.bit_width); - break; - default: - break; - } + err = smp_call_function_any(mask, do_drv_read, &cmd, 1); + WARN_ON_ONCE(err); /* smp_call_function_any() was buggy? */ + return cmd.val; } -static void drv_read(struct drv_cmd *cmd) +/* Called via smp_call_function_many(), on the target CPUs */ +static void do_drv_write(void *_cmd) { - int err; - cmd->val = 0; + struct drv_cmd *cmd = _cmd; - err = smp_call_function_any(cmd->mask, do_drv_read, cmd, 1); - WARN_ON_ONCE(err); /* smp_call_function_any() was buggy? */ + cmd->func.write(cmd->reg, cmd->val); } -static void drv_write(struct drv_cmd *cmd) +static void drv_write(struct acpi_cpufreq_data *data, + const struct cpumask *mask, u32 val) { + struct acpi_processor_performance *perf = to_perf_data(data); + struct drv_cmd cmd = { + .reg = &perf->control_register, + .val = val, + .func.write = data->cpu_freq_write, + }; int this_cpu; this_cpu = get_cpu(); - if (cpumask_test_cpu(this_cpu, cmd->mask)) - do_drv_write(cmd); - smp_call_function_many(cmd->mask, do_drv_write, cmd, 1); + if (cpumask_test_cpu(this_cpu, mask)) + do_drv_write(&cmd); + + smp_call_function_many(mask, do_drv_write, &cmd, 1); put_cpu(); } -static u32 -get_cur_val(const struct cpumask *mask, struct acpi_cpufreq_data *data) +static u32 get_cur_val(const struct cpumask *mask, struct acpi_cpufreq_data *data) { - struct acpi_processor_performance *perf; - struct drv_cmd cmd; + u32 val; if (unlikely(cpumask_empty(mask))) return 0; - switch (data->cpu_feature) { - case SYSTEM_INTEL_MSR_CAPABLE: - cmd.type = SYSTEM_INTEL_MSR_CAPABLE; - cmd.addr.msr.reg = MSR_IA32_PERF_CTL; - break; - case SYSTEM_AMD_MSR_CAPABLE: - cmd.type = SYSTEM_AMD_MSR_CAPABLE; - cmd.addr.msr.reg = MSR_AMD_PERF_CTL; - break; - case SYSTEM_IO_CAPABLE: - cmd.type = SYSTEM_IO_CAPABLE; - perf = to_perf_data(data); - cmd.addr.io.port = perf->control_register.address; - cmd.addr.io.bit_width = perf->control_register.bit_width; - break; - default: - return 0; - } - - cmd.mask = mask; - drv_read(&cmd); + val = drv_read(data, mask); - pr_debug("get_cur_val = %u\n", cmd.val); + pr_debug("get_cur_val = %u\n", val); - return cmd.val; + return val; } static unsigned int get_cur_freq_on_cpu(unsigned int cpu) @@ -416,7 +412,7 @@ static int acpi_cpufreq_target(struct cp { struct acpi_cpufreq_data *data = policy->driver_data; struct acpi_processor_performance *perf; - struct drv_cmd cmd; + const struct cpumask *mask; unsigned int next_perf_state = 0; /* Index into perf table */ int result = 0; @@ -438,37 +434,17 @@ static int acpi_cpufreq_target(struct cp } } - switch (data->cpu_feature) { - case SYSTEM_INTEL_MSR_CAPABLE: - cmd.type = SYSTEM_INTEL_MSR_CAPABLE; - cmd.addr.msr.reg = MSR_IA32_PERF_CTL; - cmd.val = (u32) perf->states[next_perf_state].control; - break; - case SYSTEM_AMD_MSR_CAPABLE: - cmd.type = SYSTEM_AMD_MSR_CAPABLE; - cmd.addr.msr.reg = MSR_AMD_PERF_CTL; - cmd.val = (u32) perf->states[next_perf_state].control; - break; - case SYSTEM_IO_CAPABLE: - cmd.type = SYSTEM_IO_CAPABLE; - cmd.addr.io.port = perf->control_register.address; - cmd.addr.io.bit_width = perf->control_register.bit_width; - cmd.val = (u32) perf->states[next_perf_state].control; - break; - default: - return -ENODEV; - } - - /* cpufreq holds the hotplug lock, so we are safe from here on */ - if (policy->shared_type != CPUFREQ_SHARED_TYPE_ANY) - cmd.mask = policy->cpus; - else - cmd.mask = cpumask_of(policy->cpu); + /* + * The core won't allow CPUs to go away until the governor has been + * stopped, so we can rely on the stability of policy->cpus. + */ + mask = policy->shared_type == CPUFREQ_SHARED_TYPE_ANY ? + cpumask_of(policy->cpu) : policy->cpus; - drv_write(&cmd); + drv_write(data, mask, perf->states[next_perf_state].control); if (acpi_pstate_strict) { - if (!check_freqs(cmd.mask, data->freq_table[index].frequency, + if (!check_freqs(mask, data->freq_table[index].frequency, data)) { pr_debug("acpi_cpufreq_target failed (%d)\n", policy->cpu); @@ -738,15 +714,21 @@ static int acpi_cpufreq_cpu_init(struct } pr_debug("SYSTEM IO addr space\n"); data->cpu_feature = SYSTEM_IO_CAPABLE; + data->cpu_freq_read = cpu_freq_read_io; + data->cpu_freq_write = cpu_freq_write_io; break; case ACPI_ADR_SPACE_FIXED_HARDWARE: pr_debug("HARDWARE addr space\n"); if (check_est_cpu(cpu)) { data->cpu_feature = SYSTEM_INTEL_MSR_CAPABLE; + data->cpu_freq_read = cpu_freq_read_intel; + data->cpu_freq_write = cpu_freq_write_intel; break; } if (check_amd_hwpstate_cpu(cpu)) { data->cpu_feature = SYSTEM_AMD_MSR_CAPABLE; + data->cpu_freq_read = cpu_freq_read_amd; + data->cpu_freq_write = cpu_freq_write_amd; break; } result = -ENODEV; ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH 3/6] cpufreq: governor: New data type for management part of dbs_data 2016-03-02 1:56 [PATCH 0/6] cpufreq: schedutil governor Rafael J. Wysocki 2016-03-02 2:04 ` [PATCH 1/6] cpufreq: Reduce cpufreq_update_util() overhead a bit Rafael J. Wysocki 2016-03-02 2:05 ` [PATCH 2/6][Resend] cpufreq: acpi-cpufreq: Make read and write operations more efficient Rafael J. Wysocki @ 2016-03-02 2:08 ` Rafael J. Wysocki 2016-03-03 5:53 ` Viresh Kumar 2016-03-02 2:10 ` [PATCH 4/6] cpufreq: governor: Move abstract gov_tunables code to a seperate file Rafael J. Wysocki ` (3 subsequent siblings) 6 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-02 2:08 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> In addition to fields representing governor tunables, struct dbs_data contains some fields needed for the management of objects of that type. As it turns out, that part of struct dbs_data may be shared with (future) governors that won't use the common code used by "ondemand" and "conservative", so move it to a separate struct type and modify the code using struct dbs_data to follow. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- drivers/cpufreq/cpufreq_conservative.c | 15 +++-- drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++------------- drivers/cpufreq/cpufreq_governor.h | 36 +++++++------ drivers/cpufreq/cpufreq_ondemand.c | 19 ++++-- 4 files changed, 97 insertions(+), 63 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -41,6 +41,13 @@ /* Ondemand Sampling types */ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; +struct gov_tunables { + struct kobject kobj; + struct list_head policy_list; + struct mutex update_lock; + int usage_count; +}; + /* * Abbreviations: * dbs: used as a shortform for demand based switching It helps to keep variable @@ -52,7 +59,7 @@ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; /* Governor demand based switching data (per-policy or global). */ struct dbs_data { - int usage_count; + struct gov_tunables gt; void *tuners; unsigned int min_sampling_rate; unsigned int ignore_nice_load; @@ -60,37 +67,34 @@ struct dbs_data { unsigned int sampling_down_factor; unsigned int up_threshold; unsigned int io_is_busy; - - struct kobject kobj; - struct list_head policy_dbs_list; - /* - * Protect concurrent updates to governor tunables from sysfs, - * policy_dbs_list and usage_count. - */ - struct mutex mutex; }; +static inline struct dbs_data *to_dbs_data(struct gov_tunables *gt) +{ + return container_of(gt, struct dbs_data, gt); +} + /* Governor's specific attributes */ -struct dbs_data; struct governor_attr { struct attribute attr; - ssize_t (*show)(struct dbs_data *dbs_data, char *buf); - ssize_t (*store)(struct dbs_data *dbs_data, const char *buf, - size_t count); + ssize_t (*show)(struct gov_tunables *gt, char *buf); + ssize_t (*store)(struct gov_tunables *gt, const char *buf, size_t count); }; #define gov_show_one(_gov, file_name) \ static ssize_t show_##file_name \ -(struct dbs_data *dbs_data, char *buf) \ +(struct gov_tunables *gt, char *buf) \ { \ + struct dbs_data *dbs_data = to_dbs_data(gt); \ struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \ return sprintf(buf, "%u\n", tuners->file_name); \ } #define gov_show_one_common(file_name) \ static ssize_t show_##file_name \ -(struct dbs_data *dbs_data, char *buf) \ +(struct gov_tunables *gt, char *buf) \ { \ + struct dbs_data *dbs_data = to_dbs_data(gt); \ return sprintf(buf, "%u\n", dbs_data->file_name); \ } @@ -184,7 +188,7 @@ void od_register_powersave_bias_handler( (struct cpufreq_policy *, unsigned int, unsigned int), unsigned int powersave_bias); void od_unregister_powersave_bias_handler(void); -ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf, +ssize_t store_sampling_rate(struct gov_tunables *gt, const char *buf, size_t count); void gov_update_cpu_data(struct dbs_data *dbs_data); #endif /* _CPUFREQ_GOVERNOR_H */ Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -42,9 +42,10 @@ static DEFINE_MUTEX(gov_dbs_data_mutex); * This must be called with dbs_data->mutex held, otherwise traversing * policy_dbs_list isn't safe. */ -ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf, +ssize_t store_sampling_rate(struct gov_tunables *gt, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(gt); struct policy_dbs_info *policy_dbs; unsigned int rate; int ret; @@ -58,7 +59,7 @@ ssize_t store_sampling_rate(struct dbs_d * We are operating under dbs_data->mutex and so the list and its * entries can't be freed concurrently. */ - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) { + list_for_each_entry(policy_dbs, >->policy_list, list) { mutex_lock(&policy_dbs->timer_mutex); /* * On 32-bit architectures this may race with the @@ -95,7 +96,7 @@ void gov_update_cpu_data(struct dbs_data { struct policy_dbs_info *policy_dbs; - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) { + list_for_each_entry(policy_dbs, &dbs_data->gt.policy_list, list) { unsigned int j; for_each_cpu(j, policy_dbs->policy->cpus) { @@ -110,9 +111,9 @@ void gov_update_cpu_data(struct dbs_data } EXPORT_SYMBOL_GPL(gov_update_cpu_data); -static inline struct dbs_data *to_dbs_data(struct kobject *kobj) +static inline struct gov_tunables *to_gov_tunables(struct kobject *kobj) { - return container_of(kobj, struct dbs_data, kobj); + return container_of(kobj, struct gov_tunables, kobj); } static inline struct governor_attr *to_gov_attr(struct attribute *attr) @@ -123,25 +124,24 @@ static inline struct governor_attr *to_g static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, char *buf) { - struct dbs_data *dbs_data = to_dbs_data(kobj); struct governor_attr *gattr = to_gov_attr(attr); - return gattr->show(dbs_data, buf); + return gattr->show(to_gov_tunables(kobj), buf); } static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, const char *buf, size_t count) { - struct dbs_data *dbs_data = to_dbs_data(kobj); + struct gov_tunables *gt = to_gov_tunables(kobj); struct governor_attr *gattr = to_gov_attr(attr); int ret = -EBUSY; - mutex_lock(&dbs_data->mutex); + mutex_lock(>->update_lock); - if (dbs_data->usage_count) - ret = gattr->store(dbs_data, buf, count); + if (gt->usage_count) + ret = gattr->store(gt, buf, count); - mutex_unlock(&dbs_data->mutex); + mutex_unlock(>->update_lock); return ret; } @@ -424,6 +424,41 @@ static void free_policy_dbs_info(struct gov->free(policy_dbs); } +static void gov_tunables_init(struct gov_tunables *gt, + struct list_head *list_node) +{ + INIT_LIST_HEAD(>->policy_list); + mutex_init(>->update_lock); + gt->usage_count = 1; + list_add(list_node, >->policy_list); +} + +static void gov_tunables_get(struct gov_tunables *gt, + struct list_head *list_node) +{ + mutex_lock(>->update_lock); + gt->usage_count++; + list_add(list_node, >->policy_list); + mutex_unlock(>->update_lock); +} + +static unsigned int gov_tunables_put(struct gov_tunables *gt, + struct list_head *list_node) +{ + unsigned int count; + + mutex_lock(>->update_lock); + list_del(list_node); + count = --gt->usage_count; + mutex_unlock(>->update_lock); + if (count) + return count; + + kobject_put(>->kobj); + mutex_destroy(>->update_lock); + return 0; +} + static int cpufreq_governor_init(struct cpufreq_policy *policy) { struct dbs_governor *gov = dbs_governor_of(policy); @@ -452,10 +487,7 @@ static int cpufreq_governor_init(struct policy_dbs->dbs_data = dbs_data; policy->governor_data = policy_dbs; - mutex_lock(&dbs_data->mutex); - dbs_data->usage_count++; - list_add(&policy_dbs->list, &dbs_data->policy_dbs_list); - mutex_unlock(&dbs_data->mutex); + gov_tunables_get(&dbs_data->gt, &policy_dbs->list); goto out; } @@ -465,8 +497,7 @@ static int cpufreq_governor_init(struct goto free_policy_dbs_info; } - INIT_LIST_HEAD(&dbs_data->policy_dbs_list); - mutex_init(&dbs_data->mutex); + gov_tunables_init(&dbs_data->gt, &policy_dbs->list); ret = gov->init(dbs_data, !policy->governor->initialized); if (ret) @@ -486,14 +517,11 @@ static int cpufreq_governor_init(struct if (!have_governor_per_policy()) gov->gdbs_data = dbs_data; - policy->governor_data = policy_dbs; - policy_dbs->dbs_data = dbs_data; - dbs_data->usage_count = 1; - list_add(&policy_dbs->list, &dbs_data->policy_dbs_list); + policy->governor_data = policy_dbs; gov->kobj_type.sysfs_ops = &governor_sysfs_ops; - ret = kobject_init_and_add(&dbs_data->kobj, &gov->kobj_type, + ret = kobject_init_and_add(&dbs_data->gt.kobj, &gov->kobj_type, get_governor_parent_kobj(policy), "%s", gov->gov.name); if (!ret) @@ -522,29 +550,21 @@ static int cpufreq_governor_exit(struct struct dbs_governor *gov = dbs_governor_of(policy); struct policy_dbs_info *policy_dbs = policy->governor_data; struct dbs_data *dbs_data = policy_dbs->dbs_data; - int count; + unsigned int count; /* Protect gov->gdbs_data against concurrent updates. */ mutex_lock(&gov_dbs_data_mutex); - mutex_lock(&dbs_data->mutex); - list_del(&policy_dbs->list); - count = --dbs_data->usage_count; - mutex_unlock(&dbs_data->mutex); + count = gov_tunables_put(&dbs_data->gt, &policy_dbs->list); - if (!count) { - kobject_put(&dbs_data->kobj); - - policy->governor_data = NULL; + policy->governor_data = NULL; + if (!count) { if (!have_governor_per_policy()) gov->gdbs_data = NULL; gov->exit(dbs_data, policy->governor->initialized == 1); - mutex_destroy(&dbs_data->mutex); kfree(dbs_data); - } else { - policy->governor_data = NULL; } free_policy_dbs_info(policy_dbs, gov); Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c +++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c @@ -207,9 +207,10 @@ static unsigned int od_dbs_timer(struct /************************** sysfs interface ************************/ static struct dbs_governor od_dbs_gov; -static ssize_t store_io_is_busy(struct dbs_data *dbs_data, const char *buf, +static ssize_t store_io_is_busy(struct gov_tunables *gt, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(gt); unsigned int input; int ret; @@ -224,9 +225,10 @@ static ssize_t store_io_is_busy(struct d return count; } -static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf, +static ssize_t store_up_threshold(struct gov_tunables *gt, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(gt); unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -240,9 +242,10 @@ static ssize_t store_up_threshold(struct return count; } -static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data, +static ssize_t store_sampling_down_factor(struct gov_tunables *gt, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(gt); struct policy_dbs_info *policy_dbs; unsigned int input; int ret; @@ -254,7 +257,7 @@ static ssize_t store_sampling_down_facto dbs_data->sampling_down_factor = input; /* Reset down sampling multiplier in case it was active */ - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) { + list_for_each_entry(policy_dbs, >->policy_list, list) { /* * Doing this without locking might lead to using different * rate_mult values in od_update() and od_dbs_timer(). @@ -267,9 +270,10 @@ static ssize_t store_sampling_down_facto return count; } -static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data, +static ssize_t store_ignore_nice_load(struct gov_tunables *gt, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(gt); unsigned int input; int ret; @@ -291,9 +295,10 @@ static ssize_t store_ignore_nice_load(st return count; } -static ssize_t store_powersave_bias(struct dbs_data *dbs_data, const char *buf, +static ssize_t store_powersave_bias(struct gov_tunables *gt, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(gt); struct od_dbs_tuners *od_tuners = dbs_data->tuners; struct policy_dbs_info *policy_dbs; unsigned int input; @@ -308,7 +313,7 @@ static ssize_t store_powersave_bias(stru od_tuners->powersave_bias = input; - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) + list_for_each_entry(policy_dbs, >->policy_list, list) ondemand_powersave_bias_init(policy_dbs->policy); return count; Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c +++ linux-pm/drivers/cpufreq/cpufreq_conservative.c @@ -129,9 +129,10 @@ static struct notifier_block cs_cpufreq_ /************************** sysfs interface ************************/ static struct dbs_governor cs_dbs_gov; -static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data, +static ssize_t store_sampling_down_factor(struct gov_tunables *gt, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(gt); unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -143,9 +144,10 @@ static ssize_t store_sampling_down_facto return count; } -static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf, +static ssize_t store_up_threshold(struct gov_tunables *gt, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(gt); struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; @@ -158,9 +160,10 @@ static ssize_t store_up_threshold(struct return count; } -static ssize_t store_down_threshold(struct dbs_data *dbs_data, const char *buf, +static ssize_t store_down_threshold(struct gov_tunables *gt, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(gt); struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; @@ -175,9 +178,10 @@ static ssize_t store_down_threshold(stru return count; } -static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data, +static ssize_t store_ignore_nice_load(struct gov_tunables *gt, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(gt); unsigned int input; int ret; @@ -199,9 +203,10 @@ static ssize_t store_ignore_nice_load(st return count; } -static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf, +static ssize_t store_freq_step(struct gov_tunables *gt, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(gt); struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 3/6] cpufreq: governor: New data type for management part of dbs_data 2016-03-02 2:08 ` [PATCH 3/6] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki @ 2016-03-03 5:53 ` Viresh Kumar 2016-03-03 19:26 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Viresh Kumar @ 2016-03-03 5:53 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Vincent Guittot, Michael Turquette On 02-03-16, 03:08, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > In addition to fields representing governor tunables, struct dbs_data > contains some fields needed for the management of objects of that > type. As it turns out, that part of struct dbs_data may be shared > with (future) governors that won't use the common code used by > "ondemand" and "conservative", so move it to a separate struct type > and modify the code using struct dbs_data to follow. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > drivers/cpufreq/cpufreq_conservative.c | 15 +++-- > drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++------------- > drivers/cpufreq/cpufreq_governor.h | 36 +++++++------ > drivers/cpufreq/cpufreq_ondemand.c | 19 ++++-- > 4 files changed, 97 insertions(+), 63 deletions(-) > > Index: linux-pm/drivers/cpufreq/cpufreq_governor.h > =================================================================== > --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h > +++ linux-pm/drivers/cpufreq/cpufreq_governor.h > @@ -41,6 +41,13 @@ > /* Ondemand Sampling types */ > enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; > > +struct gov_tunables { > + struct kobject kobj; > + struct list_head policy_list; > + struct mutex update_lock; > + int usage_count; > +}; Everything else looks fine, but I don't think that you have named it properly. Every thing else present in struct dbs_data are tunables, but not this. And so gov_tunables doesn't suit at all here.. -- viresh ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 3/6] cpufreq: governor: New data type for management part of dbs_data 2016-03-03 5:53 ` Viresh Kumar @ 2016-03-03 19:26 ` Rafael J. Wysocki 2016-03-04 5:49 ` Viresh Kumar 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-03 19:26 UTC (permalink / raw) To: Viresh Kumar Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Vincent Guittot, Michael Turquette On Thu, Mar 3, 2016 at 6:53 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote: > On 02-03-16, 03:08, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> In addition to fields representing governor tunables, struct dbs_data >> contains some fields needed for the management of objects of that >> type. As it turns out, that part of struct dbs_data may be shared >> with (future) governors that won't use the common code used by >> "ondemand" and "conservative", so move it to a separate struct type >> and modify the code using struct dbs_data to follow. >> >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> --- >> drivers/cpufreq/cpufreq_conservative.c | 15 +++-- >> drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++------------- >> drivers/cpufreq/cpufreq_governor.h | 36 +++++++------ >> drivers/cpufreq/cpufreq_ondemand.c | 19 ++++-- >> 4 files changed, 97 insertions(+), 63 deletions(-) >> >> Index: linux-pm/drivers/cpufreq/cpufreq_governor.h >> =================================================================== >> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h >> +++ linux-pm/drivers/cpufreq/cpufreq_governor.h >> @@ -41,6 +41,13 @@ >> /* Ondemand Sampling types */ >> enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; >> >> +struct gov_tunables { >> + struct kobject kobj; >> + struct list_head policy_list; >> + struct mutex update_lock; >> + int usage_count; >> +}; > > Everything else looks fine, but I don't think that you have named it > properly. Every thing else present in struct dbs_data are tunables, > but not this. And so gov_tunables doesn't suit at all here.. So this is a totally bicycle shed discussion argument which makes it seriously irritating. Does it really matter so much how this structure is called? Essentially, it is something to build your tunables structure around and you can treat it as a counterpart of a C++ abstract class. So the name *does* make sense in that context. That said, what about gov_attr_set? Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 3/6] cpufreq: governor: New data type for management part of dbs_data 2016-03-03 19:26 ` Rafael J. Wysocki @ 2016-03-04 5:49 ` Viresh Kumar 0 siblings, 0 replies; 158+ messages in thread From: Viresh Kumar @ 2016-03-04 5:49 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Vincent Guittot, Michael Turquette On 03-03-16, 20:26, Rafael J. Wysocki wrote: > So this is a totally bicycle shed discussion argument which makes it > seriously irritating. > > Does it really matter so much how this structure is called? > Essentially, it is something to build your tunables structure around > and you can treat it as a counterpart of a C++ abstract class. So the > name *does* make sense in that context. :( I thought you will apply this patch to linux-next *now*, as it was quite independent and so gave such comment. > That said, what about gov_attr_set? Maybe just gov_kobj or whatever you wish. -- viresh ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH 4/6] cpufreq: governor: Move abstract gov_tunables code to a seperate file 2016-03-02 1:56 [PATCH 0/6] cpufreq: schedutil governor Rafael J. Wysocki ` (2 preceding siblings ...) 2016-03-02 2:08 ` [PATCH 3/6] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki @ 2016-03-02 2:10 ` Rafael J. Wysocki 2016-03-03 6:03 ` Viresh Kumar 2016-03-02 2:12 ` [PATCH 5/6] cpufreq: Support for fast frequency switching Rafael J. Wysocki ` (2 subsequent siblings) 6 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-02 2:10 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Move abstract code related to struct gov_tunables to a separate (new) file so it can be shared with (future) goverernors that won't share more code with "ondemand" and "conservative". No intentional functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- drivers/cpufreq/Kconfig | 4 + drivers/cpufreq/Makefile | 1 drivers/cpufreq/cpufreq_governor.c | 82 --------------------------- drivers/cpufreq/cpufreq_governor.h | 6 ++ drivers/cpufreq/cpufreq_governor_tunables.c | 84 ++++++++++++++++++++++++++++ 5 files changed, 95 insertions(+), 82 deletions(-) Index: linux-pm/drivers/cpufreq/Kconfig =================================================================== --- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -18,7 +18,11 @@ config CPU_FREQ if CPU_FREQ +config CPU_FREQ_GOV_TUNABLES + bool + config CPU_FREQ_GOV_COMMON + select CPU_FREQ_GOV_TUNABLES select IRQ_WORK bool Index: linux-pm/drivers/cpufreq/Makefile =================================================================== --- linux-pm.orig/drivers/cpufreq/Makefile +++ linux-pm/drivers/cpufreq/Makefile @@ -11,6 +11,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) += obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o +obj-$(CONFIG_CPU_FREQ_GOV_TUNABLES) += cpufreq_governor_tunables.o obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -111,53 +111,6 @@ void gov_update_cpu_data(struct dbs_data } EXPORT_SYMBOL_GPL(gov_update_cpu_data); -static inline struct gov_tunables *to_gov_tunables(struct kobject *kobj) -{ - return container_of(kobj, struct gov_tunables, kobj); -} - -static inline struct governor_attr *to_gov_attr(struct attribute *attr) -{ - return container_of(attr, struct governor_attr, attr); -} - -static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, - char *buf) -{ - struct governor_attr *gattr = to_gov_attr(attr); - - return gattr->show(to_gov_tunables(kobj), buf); -} - -static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, - const char *buf, size_t count) -{ - struct gov_tunables *gt = to_gov_tunables(kobj); - struct governor_attr *gattr = to_gov_attr(attr); - int ret = -EBUSY; - - mutex_lock(>->update_lock); - - if (gt->usage_count) - ret = gattr->store(gt, buf, count); - - mutex_unlock(>->update_lock); - - return ret; -} - -/* - * Sysfs Ops for accessing governor attributes. - * - * All show/store invocations for governor specific sysfs attributes, will first - * call the below show/store callbacks and the attribute specific callback will - * be called from within it. - */ -static const struct sysfs_ops governor_sysfs_ops = { - .show = governor_show, - .store = governor_store, -}; - unsigned int dbs_update(struct cpufreq_policy *policy) { struct policy_dbs_info *policy_dbs = policy->governor_data; @@ -424,41 +377,6 @@ static void free_policy_dbs_info(struct gov->free(policy_dbs); } -static void gov_tunables_init(struct gov_tunables *gt, - struct list_head *list_node) -{ - INIT_LIST_HEAD(>->policy_list); - mutex_init(>->update_lock); - gt->usage_count = 1; - list_add(list_node, >->policy_list); -} - -static void gov_tunables_get(struct gov_tunables *gt, - struct list_head *list_node) -{ - mutex_lock(>->update_lock); - gt->usage_count++; - list_add(list_node, >->policy_list); - mutex_unlock(>->update_lock); -} - -static unsigned int gov_tunables_put(struct gov_tunables *gt, - struct list_head *list_node) -{ - unsigned int count; - - mutex_lock(>->update_lock); - list_del(list_node); - count = --gt->usage_count; - mutex_unlock(>->update_lock); - if (count) - return count; - - kobject_put(>->kobj); - mutex_destroy(>->update_lock); - return 0; -} - static int cpufreq_governor_init(struct cpufreq_policy *policy) { struct dbs_governor *gov = dbs_governor_of(policy); Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -48,6 +48,12 @@ struct gov_tunables { int usage_count; }; +extern const struct sysfs_ops governor_sysfs_ops; + +void gov_tunables_init(struct gov_tunables *gt, struct list_head *list_node); +void gov_tunables_get(struct gov_tunables *gt, struct list_head *list_node); +unsigned int gov_tunables_put(struct gov_tunables *gt, struct list_head *list_node); + /* * Abbreviations: * dbs: used as a shortform for demand based switching It helps to keep variable Index: linux-pm/drivers/cpufreq/cpufreq_governor_tunables.c =================================================================== --- /dev/null +++ linux-pm/drivers/cpufreq/cpufreq_governor_tunables.c @@ -0,0 +1,84 @@ +/* + * Abstract code for CPUFreq governor tunables handling. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include "cpufreq_governor.h" + +static inline struct gov_tunables *to_gov_tunables(struct kobject *kobj) +{ + return container_of(kobj, struct gov_tunables, kobj); +} + +static inline struct governor_attr *to_gov_attr(struct attribute *attr) +{ + return container_of(attr, struct governor_attr, attr); +} + +static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, + char *buf) +{ + struct governor_attr *gattr = to_gov_attr(attr); + + return gattr->show(to_gov_tunables(kobj), buf); +} + +static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, + const char *buf, size_t count) +{ + struct gov_tunables *gt = to_gov_tunables(kobj); + struct governor_attr *gattr = to_gov_attr(attr); + int ret; + + mutex_lock(>->update_lock); + ret = gt->usage_count ? gattr->store(gt, buf, count) : -EBUSY; + mutex_unlock(>->update_lock); + return ret; +} + +const struct sysfs_ops governor_sysfs_ops = { + .show = governor_show, + .store = governor_store, +}; +EXPORT_SYMBOL_GPL(governor_sysfs_ops); + +void gov_tunables_init(struct gov_tunables *gt, struct list_head *list_node) +{ + INIT_LIST_HEAD(>->policy_list); + mutex_init(>->update_lock); + gt->usage_count = 1; + list_add(list_node, >->policy_list); +} +EXPORT_SYMBOL_GPL(gov_tunables_init); + +void gov_tunables_get(struct gov_tunables *gt, struct list_head *list_node) +{ + mutex_lock(>->update_lock); + gt->usage_count++; + list_add(list_node, >->policy_list); + mutex_unlock(>->update_lock); +} +EXPORT_SYMBOL_GPL(gov_tunables_get); + +unsigned int gov_tunables_put(struct gov_tunables *gt, struct list_head *list_node) +{ + unsigned int count; + + mutex_lock(>->update_lock); + list_del(list_node); + count = --gt->usage_count; + mutex_unlock(>->update_lock); + if (count) + return count; + + kobject_put(>->kobj); + mutex_destroy(>->update_lock); + return 0; +} +EXPORT_SYMBOL_GPL(gov_tunables_put); ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 4/6] cpufreq: governor: Move abstract gov_tunables code to a seperate file 2016-03-02 2:10 ` [PATCH 4/6] cpufreq: governor: Move abstract gov_tunables code to a seperate file Rafael J. Wysocki @ 2016-03-03 6:03 ` Viresh Kumar 0 siblings, 0 replies; 158+ messages in thread From: Viresh Kumar @ 2016-03-03 6:03 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Vincent Guittot, Michael Turquette On 02-03-16, 03:10, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Move abstract code related to struct gov_tunables to a separate (new) > file so it can be shared with (future) goverernors that won't share > more code with "ondemand" and "conservative". > > No intentional functional changes. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > drivers/cpufreq/Kconfig | 4 + > drivers/cpufreq/Makefile | 1 > drivers/cpufreq/cpufreq_governor.c | 82 --------------------------- > drivers/cpufreq/cpufreq_governor.h | 6 ++ > drivers/cpufreq/cpufreq_governor_tunables.c | 84 ++++++++++++++++++++++++++++ These aren't governor tunables, isn't it? Tunables were the fields that could be tuned, but this is something else. -- viresh ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH 5/6] cpufreq: Support for fast frequency switching 2016-03-02 1:56 [PATCH 0/6] cpufreq: schedutil governor Rafael J. Wysocki ` (3 preceding siblings ...) 2016-03-02 2:10 ` [PATCH 4/6] cpufreq: governor: Move abstract gov_tunables code to a seperate file Rafael J. Wysocki @ 2016-03-02 2:12 ` Rafael J. Wysocki 2016-03-03 6:00 ` Viresh Kumar ` (2 more replies) 2016-03-02 2:27 ` [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki 6 siblings, 3 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-02 2:12 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Modify the ACPI cpufreq driver to provide a method for switching CPU frequencies from interrupt context and update the cpufreq core to support that method if available. Introduce a new cpufreq driver callback, ->fast_switch, to be invoked for frequency switching from interrupt context via a new helper function, cpufreq_driver_fast_switch(). Add a new policy flag, fast_switch_possible, to be set if fast frequency switching can be used for the given policy. Implement the ->fast_switch callback in the ACPI cpufreq driver and make it set fast_switch_possible during policy initialization as appropriate. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- The most important change from the previous version is that the ->fast_switch() callback takes an additional "relation" argument and now the governor can use it to choose a selection method. --- drivers/cpufreq/acpi-cpufreq.c | 53 +++++++++++++++++++++++++++++++++++++++++ drivers/cpufreq/cpufreq.c | 33 +++++++++++++++++++++++++ include/linux/cpufreq.h | 6 ++++ 3 files changed, 92 insertions(+) Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c @@ -458,6 +458,55 @@ static int acpi_cpufreq_target(struct cp return result; } +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq, + unsigned int relation) +{ + struct acpi_cpufreq_data *data = policy->driver_data; + struct acpi_processor_performance *perf; + struct cpufreq_frequency_table *entry, *found; + unsigned int next_perf_state, next_freq, freq; + + /* + * Find the closest frequency above target_freq or equal to it. + * + * The table is sorted in the reverse order with respect to the + * frequency and all of the entries are valid (see the initialization). + */ + entry = data->freq_table; + do { + entry++; + freq = entry->frequency; + } while (freq >= target_freq && freq != CPUFREQ_TABLE_END); + found = entry - 1; + /* + * Use the one found or the previous one, depending on the relation. + * CPUFREQ_RELATION_H is not taken into account here, but it is not + * expected to be passed to this function anyway. + */ + next_freq = found->frequency; + if (freq == CPUFREQ_TABLE_END || relation != CPUFREQ_RELATION_C || + target_freq - freq >= next_freq - target_freq) { + next_perf_state = found->driver_data; + } else { + next_freq = freq; + next_perf_state = entry->driver_data; + } + + perf = to_perf_data(data); + if (perf->state == next_perf_state) { + if (unlikely(data->resume)) + data->resume = 0; + else + return next_freq; + } + + data->cpu_freq_write(&perf->control_register, + perf->states[next_perf_state].control); + perf->state = next_perf_state; + return next_freq; +} + static unsigned long acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu) { @@ -740,6 +789,9 @@ static int acpi_cpufreq_cpu_init(struct goto err_unreg; } + policy->fast_switch_possible = !acpi_pstate_strict && + !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY); + data->freq_table = kzalloc(sizeof(*data->freq_table) * (perf->state_count+1), GFP_KERNEL); if (!data->freq_table) { @@ -874,6 +926,7 @@ static struct freq_attr *acpi_cpufreq_at static struct cpufreq_driver acpi_cpufreq_driver = { .verify = cpufreq_generic_frequency_table_verify, .target_index = acpi_cpufreq_target, + .fast_switch = acpi_cpufreq_fast_switch, .bios_limit = acpi_processor_get_bios_limit, .init = acpi_cpufreq_cpu_init, .exit = acpi_cpufreq_cpu_exit, Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -1772,6 +1772,39 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie * GOVERNORS * *********************************************************************/ +/** + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch. + * @policy: cpufreq policy to switch the frequency for. + * @target_freq: New frequency to set (may be approximate). + * @relation: Relation to use for frequency selection. + * + * Carry out a fast frequency switch from interrupt context. + * + * This function must not be called if policy->fast_switch_possible is unset. + * + * Governors calling this function must guarantee that it will never be invoked + * twice in parallel for the same policy and that it will never be called in + * parallel with either ->target() or ->target_index() for the same policy. + * + * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch() + * callback, the hardware configuration must be preserved. + */ +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq, unsigned int relation) +{ + unsigned int freq; + + if (target_freq == policy->cur) + return; + + freq = cpufreq_driver->fast_switch(policy, target_freq, relation); + if (freq != CPUFREQ_ENTRY_INVALID) { + policy->cur = freq; + trace_cpu_frequency(freq, smp_processor_id()); + } +} +EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch); + /* Must set freqs->new to intermediate frequency */ static int __target_intermediate(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, int index) Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -81,6 +81,7 @@ struct cpufreq_policy { struct cpufreq_governor *governor; /* see below */ void *governor_data; char last_governor[CPUFREQ_NAME_LEN]; /* last governor used */ + bool fast_switch_possible; struct work_struct update; /* if update_policy() needs to be * called, but you're in IRQ context */ @@ -270,6 +271,9 @@ struct cpufreq_driver { unsigned int relation); /* Deprecated */ int (*target_index)(struct cpufreq_policy *policy, unsigned int index); + unsigned int (*fast_switch)(struct cpufreq_policy *policy, + unsigned int target_freq, + unsigned int relation); /* * Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION * unset. @@ -484,6 +488,8 @@ struct cpufreq_governor { }; /* Pass a target to the cpufreq driver */ +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq, unsigned int relation); int cpufreq_driver_target(struct cpufreq_policy *policy, unsigned int target_freq, unsigned int relation); ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 5/6] cpufreq: Support for fast frequency switching 2016-03-02 2:12 ` [PATCH 5/6] cpufreq: Support for fast frequency switching Rafael J. Wysocki @ 2016-03-03 6:00 ` Viresh Kumar 2016-03-04 2:15 ` Rafael J. Wysocki 2016-03-03 11:16 ` Peter Zijlstra 2016-03-03 11:18 ` Peter Zijlstra 2 siblings, 1 reply; 158+ messages in thread From: Viresh Kumar @ 2016-03-03 6:00 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Vincent Guittot, Michael Turquette On 02-03-16, 03:12, Rafael J. Wysocki wrote: > Index: linux-pm/drivers/cpufreq/cpufreq.c > =================================================================== > --- linux-pm.orig/drivers/cpufreq/cpufreq.c > +++ linux-pm/drivers/cpufreq/cpufreq.c > @@ -1772,6 +1772,39 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie > * GOVERNORS * > *********************************************************************/ > > +/** > + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch. > + * @policy: cpufreq policy to switch the frequency for. > + * @target_freq: New frequency to set (may be approximate). > + * @relation: Relation to use for frequency selection. > + * > + * Carry out a fast frequency switch from interrupt context. > + * > + * This function must not be called if policy->fast_switch_possible is unset. > + * > + * Governors calling this function must guarantee that it will never be invoked > + * twice in parallel for the same policy and that it will never be called in > + * parallel with either ->target() or ->target_index() for the same policy. > + * > + * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch() > + * callback, the hardware configuration must be preserved. > + */ > +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, > + unsigned int target_freq, unsigned int relation) > +{ > + unsigned int freq; > + > + if (target_freq == policy->cur) Maybe an unlikely() here ? > + return; > + > + freq = cpufreq_driver->fast_switch(policy, target_freq, relation); > + if (freq != CPUFREQ_ENTRY_INVALID) { > + policy->cur = freq; Hmm.. What will happen to the code relying on the cpufreq-notifiers now ? > + trace_cpu_frequency(freq, smp_processor_id()); > + } > +} > +EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch); -- viresh ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 5/6] cpufreq: Support for fast frequency switching 2016-03-03 6:00 ` Viresh Kumar @ 2016-03-04 2:15 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 2:15 UTC (permalink / raw) To: Viresh Kumar Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Vincent Guittot, Michael Turquette On Thu, Mar 3, 2016 at 7:00 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote: > On 02-03-16, 03:12, Rafael J. Wysocki wrote: >> Index: linux-pm/drivers/cpufreq/cpufreq.c >> =================================================================== >> --- linux-pm.orig/drivers/cpufreq/cpufreq.c >> +++ linux-pm/drivers/cpufreq/cpufreq.c >> @@ -1772,6 +1772,39 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie >> * GOVERNORS * >> *********************************************************************/ >> >> +/** >> + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch. >> + * @policy: cpufreq policy to switch the frequency for. >> + * @target_freq: New frequency to set (may be approximate). >> + * @relation: Relation to use for frequency selection. >> + * >> + * Carry out a fast frequency switch from interrupt context. >> + * >> + * This function must not be called if policy->fast_switch_possible is unset. >> + * >> + * Governors calling this function must guarantee that it will never be invoked >> + * twice in parallel for the same policy and that it will never be called in >> + * parallel with either ->target() or ->target_index() for the same policy. >> + * >> + * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch() >> + * callback, the hardware configuration must be preserved. >> + */ >> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, >> + unsigned int target_freq, unsigned int relation) >> +{ >> + unsigned int freq; >> + >> + if (target_freq == policy->cur) > > Maybe an unlikely() here ? > >> + return; >> + >> + freq = cpufreq_driver->fast_switch(policy, target_freq, relation); >> + if (freq != CPUFREQ_ENTRY_INVALID) { >> + policy->cur = freq; > > Hmm.. What will happen to the code relying on the cpufreq-notifiers > now ? It will have a problem. For that code it's like the CPU changing the frequency and not telling it (which is not unusual for that matter). Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 5/6] cpufreq: Support for fast frequency switching 2016-03-02 2:12 ` [PATCH 5/6] cpufreq: Support for fast frequency switching Rafael J. Wysocki 2016-03-03 6:00 ` Viresh Kumar @ 2016-03-03 11:16 ` Peter Zijlstra 2016-03-03 20:56 ` Rafael J. Wysocki 2016-03-03 11:18 ` Peter Zijlstra 2 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-03 11:16 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette On Wed, Mar 02, 2016 at 03:12:33AM +0100, Rafael J. Wysocki wrote: > The most important change from the previous version is that the > ->fast_switch() callback takes an additional "relation" argument > and now the governor can use it to choose a selection method. > +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy, > + unsigned int target_freq, > + unsigned int relation) Would it make sense to replace the {target_freq, relation} pair with something like the CPPC {min_freq, max_freq} pair? Then you could use the closest frequency to max provided it is larger than min. This communicates more actual information in the same number of parameters and would thereby allow for a more flexible (better) frequency selection. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 5/6] cpufreq: Support for fast frequency switching 2016-03-03 11:16 ` Peter Zijlstra @ 2016-03-03 20:56 ` Rafael J. Wysocki 2016-03-03 21:12 ` Peter Zijlstra 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-03 20:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette On Thu, Mar 3, 2016 at 12:16 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Mar 02, 2016 at 03:12:33AM +0100, Rafael J. Wysocki wrote: >> The most important change from the previous version is that the >> ->fast_switch() callback takes an additional "relation" argument >> and now the governor can use it to choose a selection method. > >> +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy, >> + unsigned int target_freq, >> + unsigned int relation) > > Would it make sense to replace the {target_freq, relation} pair with > something like the CPPC {min_freq, max_freq} pair? Yes, it would in general, but since I use __cpufreq_driver_target() in the "slow driver" case, that would need to be reworked too for consistency. So I'd prefer to do that later. > Then you could use the closest frequency to max provided it is larger > than min. > > This communicates more actual information in the same number of > parameters and would thereby allow for a more flexible (better) > frequency selection. Agreed. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 5/6] cpufreq: Support for fast frequency switching 2016-03-03 20:56 ` Rafael J. Wysocki @ 2016-03-03 21:12 ` Peter Zijlstra 0 siblings, 0 replies; 158+ messages in thread From: Peter Zijlstra @ 2016-03-03 21:12 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette On Thu, Mar 03, 2016 at 09:56:40PM +0100, Rafael J. Wysocki wrote: > On Thu, Mar 3, 2016 at 12:16 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Mar 02, 2016 at 03:12:33AM +0100, Rafael J. Wysocki wrote: > >> The most important change from the previous version is that the > >> ->fast_switch() callback takes an additional "relation" argument > >> and now the governor can use it to choose a selection method. > > > >> +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy, > >> + unsigned int target_freq, > >> + unsigned int relation) > > > > Would it make sense to replace the {target_freq, relation} pair with > > something like the CPPC {min_freq, max_freq} pair? > > Yes, it would in general, but since I use __cpufreq_driver_target() in > the "slow driver" case, that would need to be reworked too for > consistency. So I'd prefer to do that later. OK, fair enough. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 5/6] cpufreq: Support for fast frequency switching 2016-03-02 2:12 ` [PATCH 5/6] cpufreq: Support for fast frequency switching Rafael J. Wysocki 2016-03-03 6:00 ` Viresh Kumar 2016-03-03 11:16 ` Peter Zijlstra @ 2016-03-03 11:18 ` Peter Zijlstra 2016-03-03 19:39 ` Rafael J. Wysocki 2 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-03 11:18 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette On Wed, Mar 02, 2016 at 03:12:33AM +0100, Rafael J. Wysocki wrote: > +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, > + unsigned int target_freq, unsigned int relation) > +{ > + unsigned int freq; > + > + if (target_freq == policy->cur) > + return; But what if relation is different from last time? ;-) ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 5/6] cpufreq: Support for fast frequency switching 2016-03-03 11:18 ` Peter Zijlstra @ 2016-03-03 19:39 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-03 19:39 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette On Thu, Mar 3, 2016 at 12:18 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Mar 02, 2016 at 03:12:33AM +0100, Rafael J. Wysocki wrote: >> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, >> + unsigned int target_freq, unsigned int relation) >> +{ >> + unsigned int freq; >> + >> + if (target_freq == policy->cur) >> + return; > > But what if relation is different from last time? ;-) Doh Never mind, I'll drop this check (that said this mistake is present elsewhere too IIRC). ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-02 1:56 [PATCH 0/6] cpufreq: schedutil governor Rafael J. Wysocki ` (4 preceding siblings ...) 2016-03-02 2:12 ` [PATCH 5/6] cpufreq: Support for fast frequency switching Rafael J. Wysocki @ 2016-03-02 2:27 ` Rafael J. Wysocki 2016-03-02 17:10 ` Vincent Guittot 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki 6 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-02 2:27 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Add a new cpufreq scaling governor, called "schedutil", that uses scheduler-provided CPU utilization information as input for making its decisions. Doing that is possible after commit fe7034338ba0 (cpufreq: Add mechanism for registering utilization update callbacks) that introduced cpufreq_update_util() called by the scheduler on utilization changes (from CFS) and RT/DL task status updates. In particular, CPU frequency scaling decisions may be based on the the utilization data passed to cpufreq_update_util() by CFS. The new governor is relatively simple. The frequency selection formula used by it is essentially the same as the one used by the "ondemand" governor, although it doesn't use the additional up_threshold parameter, but instead of computing the load as the "non-idle CPU time" to "total CPU time" ratio, it takes the utilization data provided by CFS as input. More specifically, it represents "load" as the util/max ratio, where util and max are the utilization and CPU capacity coming from CFS. All of the computations are carried out in the utilization update handlers provided by the new governor. One of those handlers is used for cpufreq policies shared between multiple CPUs and the other one is for policies with one CPU only (and therefore it doesn't need to use any extra synchronization means). The governor supports fast frequency switching if that is supported by the cpufreq driver in use and possible for the given policy. In the fast switching case, all operations of the governor take place in its utilization update handlers. If fast switching cannot be used, the frequency switch operations are carried out with the help of a work item which only calls __cpufreq_driver_target() (under a mutex) to trigger a frequency update (to a value already computed beforehand in one of the utilization update handlers). Currently, the governor treats all of the RT and DL tasks as "unknown utilization" and sets the frequency to the allowed maximum when updated from the RT or DL sched classes. That heavy-handed approach should be replaced with something more subtle and specifically targeted at RT and DL tasks. The governor shares some tunables management code with the "ondemand" and "conservative" governors and uses some common definitions from cpufreq_governor.h, but apart from that it is stand-alone. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- In addition to the changes mentioned in the intro message [0/6] this also tweaks the frequency selection formula in a couple of ways. First off, it uses min and max frequencies correctly (the formula from "ondemand" is applied to cpuinfo.min/max_freq like the original and policy->min/max are applied to the result later). Second, RELATION_L is used most of the time except for the bottom 1/4 of the available frequency range (but also note that DL tasks are treated in the same way as RT ones, meaning f_max is always used for them). Finally, the condition for discarding idle policy CPUs was modified to also work if the rate limit is below the scheduling rate. The code in sugov_init/exit/stop() and the irq_work handler look very similar to the analogous code in cpufreq_governor.c, but it is different enough that trying to avoid that duplication was not practical. Thanks, Rafael --- drivers/cpufreq/Kconfig | 26 + drivers/cpufreq/Makefile | 1 drivers/cpufreq/cpufreq_schedutil.c | 501 ++++++++++++++++++++++++++++++++++++ 3 files changed, 528 insertions(+) Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c =================================================================== --- /dev/null +++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c @@ -0,0 +1,501 @@ +/* + * CPUFreq governor based on scheduler-provided CPU utilization data. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/percpu-defs.h> +#include <linux/slab.h> + +#include "cpufreq_governor.h" + +struct sugov_tunables { + struct gov_tunables gt; + unsigned int rate_limit_us; +}; + +struct sugov_policy { + struct cpufreq_policy *policy; + + struct sugov_tunables *tunables; + struct list_head tunables_hook; + + raw_spinlock_t update_lock; /* For shared policies */ + u64 last_freq_update_time; + s64 freq_update_delay_ns; + unsigned int next_freq; + + /* The next fields are only needed if fast switch cannot be used. */ + unsigned int relation; + struct irq_work irq_work; + struct work_struct work; + struct mutex work_lock; + bool work_in_progress; + + bool need_freq_update; +}; + +struct sugov_cpu { + struct update_util_data update_util; + struct sugov_policy *sg_policy; + + /* The fields below are only needed when sharing a policy. */ + unsigned long util; + unsigned long max; + u64 last_update; +}; + +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); + +/************************ Governor internals ***********************/ + +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) +{ + u64 delta_ns; + + if (sg_policy->work_in_progress) + return false; + + if (unlikely(sg_policy->need_freq_update)) { + sg_policy->need_freq_update = false; + return true; + } + + delta_ns = time - sg_policy->last_freq_update_time; + return (s64)delta_ns >= sg_policy->freq_update_delay_ns; +} + +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, + unsigned long util, unsigned long max, + unsigned int next_freq) +{ + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int rel; + + if (next_freq > policy->max) + next_freq = policy->max; + else if (next_freq < policy->min) + next_freq = policy->min; + + sg_policy->last_freq_update_time = time; + if (sg_policy->next_freq == next_freq) + return; + + sg_policy->next_freq = next_freq; + /* + * If utilization is less than max / 4, use RELATION_C to allow the + * minimum frequency to be selected more often in case the distance from + * it to the next available frequency in the table is significant. + */ + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L; + if (policy->fast_switch_possible) { + cpufreq_driver_fast_switch(policy, next_freq, rel); + } else { + sg_policy->relation = rel; + sg_policy->work_in_progress = true; + irq_work_queue(&sg_policy->irq_work); + } +} + +static void sugov_update_single(struct update_util_data *data, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + unsigned int min_f, max_f, next_f; + + if (!sugov_should_update_freq(sg_policy, time)) + return; + + min_f = sg_policy->policy->cpuinfo.min_freq; + max_f = sg_policy->policy->cpuinfo.max_freq; + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; + + sugov_update_commit(sg_policy, time, util, max, next_f); +} + +static unsigned int sugov_next_freq(struct sugov_policy *sg_policy, + unsigned long util, unsigned long max) +{ + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int min_f = policy->cpuinfo.min_freq; + unsigned int max_f = policy->cpuinfo.max_freq; + u64 last_freq_update_time = sg_policy->last_freq_update_time; + unsigned int j; + + if (util > max) + return max_f; + + for_each_cpu(j, policy->cpus) { + struct sugov_cpu *j_sg_cpu; + unsigned long j_util, j_max; + u64 delta_ns; + + if (j == smp_processor_id()) + continue; + + j_sg_cpu = &per_cpu(sugov_cpu, j); + /* + * If the CPU utilization was last updated before the previous + * frequency update and the time elapsed between the last update + * of the CPU utilization and the last frequency update is long + * enough, don't take the CPU into account as it probably is + * idle now. + */ + delta_ns = last_freq_update_time - j_sg_cpu->last_update; + if ((s64)delta_ns > NSEC_PER_SEC / HZ) + continue; + + j_util = j_sg_cpu->util; + j_max = j_sg_cpu->max; + if (j_util > j_max) + return max_f; + + if (j_util * max > j_max * util) { + util = j_util; + max = j_max; + } + } + + return min_f + util * (max_f - min_f) / max; +} + +static void sugov_update_shared(struct update_util_data *data, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + unsigned int next_f; + + raw_spin_lock(&sg_policy->update_lock); + + sg_cpu->util = util; + sg_cpu->max = max; + sg_cpu->last_update = time; + + if (sugov_should_update_freq(sg_policy, time)) { + next_f = sugov_next_freq(sg_policy, util, max); + sugov_update_commit(sg_policy, time, util, max, next_f); + } + + raw_spin_unlock(&sg_policy->update_lock); +} + +static void sugov_work(struct work_struct *work) +{ + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); + + mutex_lock(&sg_policy->work_lock); + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, + sg_policy->relation); + mutex_unlock(&sg_policy->work_lock); + + sg_policy->work_in_progress = false; +} + +static void sugov_irq_work(struct irq_work *irq_work) +{ + struct sugov_policy *sg_policy; + + sg_policy = container_of(irq_work, struct sugov_policy, irq_work); + schedule_work(&sg_policy->work); +} + +/************************** sysfs interface ************************/ + +static struct sugov_tunables *global_tunables; +static DEFINE_MUTEX(global_tunables_lock); + +static inline struct sugov_tunables *to_sugov_tunables(struct gov_tunables *gt) +{ + return container_of(gt, struct sugov_tunables, gt); +} + +static ssize_t rate_limit_us_show(struct gov_tunables *gt, char *buf) +{ + struct sugov_tunables *tunables = to_sugov_tunables(gt); + + return sprintf(buf, "%u\n", tunables->rate_limit_us); +} + +static ssize_t rate_limit_us_store(struct gov_tunables *gt, const char *buf, + size_t count) +{ + struct sugov_tunables *tunables = to_sugov_tunables(gt); + struct sugov_policy *sg_policy; + unsigned int rate_limit_us; + int ret; + + ret = sscanf(buf, "%u", &rate_limit_us); + if (ret != 1) + return -EINVAL; + + tunables->rate_limit_us = rate_limit_us; + + list_for_each_entry(sg_policy, >->policy_list, tunables_hook) + sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC; + + return count; +} + +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us); + +static struct attribute *sugov_attributes[] = { + &rate_limit_us.attr, + NULL +}; + +static struct kobj_type sugov_tunables_ktype = { + .default_attrs = sugov_attributes, + .sysfs_ops = &governor_sysfs_ops, +}; + +/********************** cpufreq governor interface *********************/ + +static struct cpufreq_governor schedutil_gov; + +static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + + sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL); + if (!sg_policy) + return NULL; + + sg_policy->policy = policy; + init_irq_work(&sg_policy->irq_work, sugov_irq_work); + INIT_WORK(&sg_policy->work, sugov_work); + mutex_init(&sg_policy->work_lock); + raw_spin_lock_init(&sg_policy->update_lock); + return sg_policy; +} + +static void sugov_policy_free(struct sugov_policy *sg_policy) +{ + mutex_destroy(&sg_policy->work_lock); + kfree(sg_policy); +} + +static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy) +{ + struct sugov_tunables *tunables; + + tunables = kzalloc(sizeof(*tunables), GFP_KERNEL); + if (tunables) + gov_tunables_init(&tunables->gt, &sg_policy->tunables_hook); + + return tunables; +} + +static void sugov_tunables_free(struct sugov_tunables *tunables) +{ + if (!have_governor_per_policy()) + global_tunables = NULL; + + kfree(tunables); +} + +static int sugov_init(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + struct sugov_tunables *tunables; + unsigned int lat; + int ret = 0; + + /* State should be equivalent to EXIT */ + if (policy->governor_data) + return -EBUSY; + + sg_policy = sugov_policy_alloc(policy); + if (!sg_policy) + return -ENOMEM; + + mutex_lock(&global_tunables_lock); + + if (global_tunables) { + if (WARN_ON(have_governor_per_policy())) { + ret = -EINVAL; + goto free_sg_policy; + } + policy->governor_data = sg_policy; + sg_policy->tunables = global_tunables; + + gov_tunables_get(&global_tunables->gt, &sg_policy->tunables_hook); + goto out; + } + + tunables = sugov_tunables_alloc(sg_policy); + if (!tunables) { + ret = -ENOMEM; + goto free_sg_policy; + } + + tunables->rate_limit_us = LATENCY_MULTIPLIER; + lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC; + if (lat) + tunables->rate_limit_us *= lat; + + if (!have_governor_per_policy()) + global_tunables = tunables; + + policy->governor_data = sg_policy; + sg_policy->tunables = tunables; + + ret = kobject_init_and_add(&tunables->gt.kobj, &sugov_tunables_ktype, + get_governor_parent_kobj(policy), "%s", + schedutil_gov.name); + if (!ret) + goto out; + + /* Failure, so roll back. */ + policy->governor_data = NULL; + sugov_tunables_free(tunables); + + free_sg_policy: + pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret); + sugov_policy_free(sg_policy); + + out: + mutex_unlock(&global_tunables_lock); + return ret; +} + +static int sugov_exit(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + struct sugov_tunables *tunables = sg_policy->tunables; + unsigned int count; + + mutex_lock(&global_tunables_lock); + + count = gov_tunables_put(&tunables->gt, &sg_policy->tunables_hook); + policy->governor_data = NULL; + if (!count) + sugov_tunables_free(tunables); + + mutex_unlock(&global_tunables_lock); + + sugov_policy_free(sg_policy); + return 0; +} + +static int sugov_start(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC; + sg_policy->last_freq_update_time = 0; + sg_policy->next_freq = UINT_MAX; + sg_policy->work_in_progress = false; + sg_policy->need_freq_update = false; + + for_each_cpu(cpu, policy->cpus) { + struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu); + + sg_cpu->sg_policy = sg_policy; + if (policy_is_shared(policy)) { + sg_cpu->util = ULONG_MAX; + sg_cpu->max = 0; + sg_cpu->last_update = 0; + sg_cpu->update_util.func = sugov_update_shared; + } else { + sg_cpu->update_util.func = sugov_update_single; + } + cpufreq_set_update_util_data(cpu, &sg_cpu->update_util); + } + return 0; +} + +static int sugov_stop(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + for_each_cpu(cpu, policy->cpus) + cpufreq_set_update_util_data(cpu, NULL); + + synchronize_sched(); + + irq_work_sync(&sg_policy->irq_work); + cancel_work_sync(&sg_policy->work); + return 0; +} + +static int sugov_limits(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + + if (!policy->fast_switch_possible) { + mutex_lock(&sg_policy->work_lock); + + if (policy->max < policy->cur) + __cpufreq_driver_target(policy, policy->max, + CPUFREQ_RELATION_H); + else if (policy->min > policy->cur) + __cpufreq_driver_target(policy, policy->min, + CPUFREQ_RELATION_L); + + mutex_unlock(&sg_policy->work_lock); + } + + sg_policy->need_freq_update = true; + return 0; +} + +int sugov_governor(struct cpufreq_policy *policy, unsigned int event) +{ + if (event == CPUFREQ_GOV_POLICY_INIT) { + return sugov_init(policy); + } else if (policy->governor_data) { + switch (event) { + case CPUFREQ_GOV_POLICY_EXIT: + return sugov_exit(policy); + case CPUFREQ_GOV_START: + return sugov_start(policy); + case CPUFREQ_GOV_STOP: + return sugov_stop(policy); + case CPUFREQ_GOV_LIMITS: + return sugov_limits(policy); + } + } + return -EINVAL; +} + +static struct cpufreq_governor schedutil_gov = { + .name = "schedutil", + .governor = sugov_governor, + .max_transition_latency = TRANSITION_LATENCY_LIMIT, + .owner = THIS_MODULE, +}; + +static int __init sugov_module_init(void) +{ + return cpufreq_register_governor(&schedutil_gov); +} + +static void __exit sugov_module_exit(void) +{ + cpufreq_unregister_governor(&schedutil_gov); +} + +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>"); +MODULE_DESCRIPTION("Utilization-based CPU frequency selection"); +MODULE_LICENSE("GPL"); + +#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL +struct cpufreq_governor *cpufreq_default_governor(void) +{ + return &schedutil_gov; +} + +fs_initcall(sugov_module_init); +#else +module_init(sugov_module_init); +#endif +module_exit(sugov_module_exit); Index: linux-pm/drivers/cpufreq/Kconfig =================================================================== --- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor. + +config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL + bool "schedutil" + select CPU_FREQ_GOV_SCHEDUTIL + select CPU_FREQ_GOV_PERFORMANCE + help + Use the 'schedutil' CPUFreq governor by default. If unsure, + have a look at the help section of that governor. The fallback + governor will be 'performance'. + endchoice config CPU_FREQ_GOV_PERFORMANCE @@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE If in doubt, say N. +config CPU_FREQ_GOV_SCHEDUTIL + tristate "'schedutil' cpufreq policy governor" + depends on CPU_FREQ + select CPU_FREQ_GOV_TUNABLES + select IRQ_WORK + help + The frequency selection formula used by this governor is analogous + to the one used by 'ondemand', but instead of computing CPU load + as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU + utilization data provided by the scheduler as input. + + To compile this driver as a module, choose M here: the + module will be called cpufreq_schedutil. + + If in doubt, say N. + comment "CPU frequency scaling drivers" config CPUFREQ_DT Index: linux-pm/drivers/cpufreq/Makefile =================================================================== --- linux-pm.orig/drivers/cpufreq/Makefile +++ linux-pm/drivers/cpufreq/Makefile @@ -10,6 +10,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE) += obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) += cpufreq_userspace.o obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o +obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o obj-$(CONFIG_CPU_FREQ_GOV_TUNABLES) += cpufreq_governor_tunables.o ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-02 2:27 ` [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki @ 2016-03-02 17:10 ` Vincent Guittot 2016-03-02 17:58 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Vincent Guittot @ 2016-03-02 17:10 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Michael Turquette Hi Rafael, On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Add a new cpufreq scaling governor, called "schedutil", that uses > scheduler-provided CPU utilization information as input for making > its decisions. > > Doing that is possible after commit fe7034338ba0 (cpufreq: Add > mechanism for registering utilization update callbacks) that > introduced cpufreq_update_util() called by the scheduler on > utilization changes (from CFS) and RT/DL task status updates. > In particular, CPU frequency scaling decisions may be based on > the the utilization data passed to cpufreq_update_util() by CFS. > > The new governor is relatively simple. > > The frequency selection formula used by it is essentially the same > as the one used by the "ondemand" governor, although it doesn't use > the additional up_threshold parameter, but instead of computing the > load as the "non-idle CPU time" to "total CPU time" ratio, it takes > the utilization data provided by CFS as input. More specifically, > it represents "load" as the util/max ratio, where util and max > are the utilization and CPU capacity coming from CFS. > [snip] > + > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, > + unsigned long util, unsigned long max, > + unsigned int next_freq) > +{ > + struct cpufreq_policy *policy = sg_policy->policy; > + unsigned int rel; > + > + if (next_freq > policy->max) > + next_freq = policy->max; > + else if (next_freq < policy->min) > + next_freq = policy->min; > + > + sg_policy->last_freq_update_time = time; > + if (sg_policy->next_freq == next_freq) > + return; > + > + sg_policy->next_freq = next_freq; > + /* > + * If utilization is less than max / 4, use RELATION_C to allow the > + * minimum frequency to be selected more often in case the distance from > + * it to the next available frequency in the table is significant. > + */ > + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L; > + if (policy->fast_switch_possible) { > + cpufreq_driver_fast_switch(policy, next_freq, rel); > + } else { > + sg_policy->relation = rel; > + sg_policy->work_in_progress = true; > + irq_work_queue(&sg_policy->irq_work); > + } > +} > + > +static void sugov_update_single(struct update_util_data *data, u64 time, > + unsigned long util, unsigned long max) > +{ > + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); > + struct sugov_policy *sg_policy = sg_cpu->sg_policy; > + unsigned int min_f, max_f, next_f; > + > + if (!sugov_should_update_freq(sg_policy, time)) > + return; > + > + min_f = sg_policy->policy->cpuinfo.min_freq; > + max_f = sg_policy->policy->cpuinfo.max_freq; > + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; I think it has been pointed out in another email's thread but you should change the way the next_f is computed. util reflects the utilization of a CPU from 0 to its max compute capacity whereas ondemand was using the load at the current frequency during the last time window. I have understood that you want to keep same formula than ondemand as a starting point but you use a different input to calculate the next frequency so i don't see the rational of keeping this formula. Saying that, even the simple formula next_f = util > max ? max_f : util * (max_f) / max will not work properly if the frequency invariance is enable because the utilization becomes capped by the current compute capacity so next_f will never be higher than current freq (unless a task move on the rq). That was one reason of using a threshold in sched-freq proposal (and there are on going dev to try to solve this limitation). IIIUC, frequency invariance is not enable on your platform so you have not seen the problem but you have probably see that selection of your next_f was not really stable. Without frequency invariance, the utilization will be overestimated when running at lower frequency so the governor will probably select a frequency that is higher than necessary but then the utilization will decrease at this higher frequency so the governor will probably decrease the frequency and so on until you found the right frequency that will generate the right utilisation value Regards, Vincent > + > + sugov_update_commit(sg_policy, time, util, max, next_f); > +} > + > +static unsigned int sugov_next_freq(struct sugov_policy *sg_policy, > + unsigned long util, unsigned long max) > +{ > + struct cpufreq_policy *policy = sg_policy->policy; > + unsigned int min_f = policy->cpuinfo.min_freq; > + unsigned int max_f = policy->cpuinfo.max_freq; > + u64 last_freq_update_time = sg_policy->last_freq_update_time; > + unsigned int j; > + > + if (util > max) > + return max_f; > + > + for_each_cpu(j, policy->cpus) { > + struct sugov_cpu *j_sg_cpu; > + unsigned long j_util, j_max; > + u64 delta_ns; > + > + if (j == smp_processor_id()) > + continue; > + > + j_sg_cpu = &per_cpu(sugov_cpu, j); > + /* > + * If the CPU utilization was last updated before the previous > + * frequency update and the time elapsed between the last update > + * of the CPU utilization and the last frequency update is long > + * enough, don't take the CPU into account as it probably is > + * idle now. > + */ > + delta_ns = last_freq_update_time - j_sg_cpu->last_update; > + if ((s64)delta_ns > NSEC_PER_SEC / HZ) > + continue; > + > + j_util = j_sg_cpu->util; > + j_max = j_sg_cpu->max; > + if (j_util > j_max) > + return max_f; > + > + if (j_util * max > j_max * util) { > + util = j_util; > + max = j_max; > + } > + } > + > + return min_f + util * (max_f - min_f) / max; > +} > + [snip] ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-02 17:10 ` Vincent Guittot @ 2016-03-02 17:58 ` Rafael J. Wysocki 2016-03-02 22:49 ` Rafael J. Wysocki 2016-03-03 13:07 ` Vincent Guittot 0 siblings, 2 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-02 17:58 UTC (permalink / raw) To: Vincent Guittot Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot <vincent.guittot@linaro.org> wrote: > Hi Rafael, > > > On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> Add a new cpufreq scaling governor, called "schedutil", that uses >> scheduler-provided CPU utilization information as input for making >> its decisions. >> >> Doing that is possible after commit fe7034338ba0 (cpufreq: Add >> mechanism for registering utilization update callbacks) that >> introduced cpufreq_update_util() called by the scheduler on >> utilization changes (from CFS) and RT/DL task status updates. >> In particular, CPU frequency scaling decisions may be based on >> the the utilization data passed to cpufreq_update_util() by CFS. >> >> The new governor is relatively simple. >> >> The frequency selection formula used by it is essentially the same >> as the one used by the "ondemand" governor, although it doesn't use >> the additional up_threshold parameter, but instead of computing the >> load as the "non-idle CPU time" to "total CPU time" ratio, it takes >> the utilization data provided by CFS as input. More specifically, >> it represents "load" as the util/max ratio, where util and max >> are the utilization and CPU capacity coming from CFS. >> > > [snip] > >> + >> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, >> + unsigned long util, unsigned long max, >> + unsigned int next_freq) >> +{ >> + struct cpufreq_policy *policy = sg_policy->policy; >> + unsigned int rel; >> + >> + if (next_freq > policy->max) >> + next_freq = policy->max; >> + else if (next_freq < policy->min) >> + next_freq = policy->min; >> + >> + sg_policy->last_freq_update_time = time; >> + if (sg_policy->next_freq == next_freq) >> + return; >> + >> + sg_policy->next_freq = next_freq; >> + /* >> + * If utilization is less than max / 4, use RELATION_C to allow the >> + * minimum frequency to be selected more often in case the distance from >> + * it to the next available frequency in the table is significant. >> + */ >> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L; >> + if (policy->fast_switch_possible) { >> + cpufreq_driver_fast_switch(policy, next_freq, rel); >> + } else { >> + sg_policy->relation = rel; >> + sg_policy->work_in_progress = true; >> + irq_work_queue(&sg_policy->irq_work); >> + } >> +} >> + >> +static void sugov_update_single(struct update_util_data *data, u64 time, >> + unsigned long util, unsigned long max) >> +{ >> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); >> + struct sugov_policy *sg_policy = sg_cpu->sg_policy; >> + unsigned int min_f, max_f, next_f; >> + >> + if (!sugov_should_update_freq(sg_policy, time)) >> + return; >> + >> + min_f = sg_policy->policy->cpuinfo.min_freq; >> + max_f = sg_policy->policy->cpuinfo.max_freq; >> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; > > I think it has been pointed out in another email's thread but you > should change the way the next_f is computed. util reflects the > utilization of a CPU from 0 to its max compute capacity whereas > ondemand was using the load at the current frequency during the last > time window. I have understood that you want to keep same formula than > ondemand as a starting point but you use a different input to > calculate the next frequency so i don't see the rational of keeping > this formula. It is a formula that causes the entire available frequency range to be utilized proportionally to the utilization as reported by the scheduler (modulo the policy->min/max limits). Its (significant IMO) advantage is that it doesn't require any additional factors that would need to be determined somehow. > Saying that, even the simple formula next_f = util > max > ? max_f : util * (max_f) / max will not work properly if the frequency > invariance is enable because the utilization becomes capped by the > current compute capacity so next_f will never be higher than current > freq (unless a task move on the rq). That was one reason of using a > threshold in sched-freq proposal (and there are on going dev to try to > solve this limitation). Well, a different formula will have to be used along with frequency invariance, then. > IIIUC, frequency invariance is not enable on your platform so you have > not seen the problem but you have probably see that selection of your > next_f was not really stable. Without frequency invariance, the > utilization will be overestimated when running at lower frequency so > the governor will probably select a frequency that is higher than > necessary but then the utilization will decrease at this higher > frequency so the governor will probably decrease the frequency and so > on until you found the right frequency that will generate the right > utilisation value I don't have any problems with that to be honest and if you aim at selecting the perfect frequency at the first attempt, then good luck with that anyway. Now, I'm not saying that the formula used in this patch cannot be improved or similar. It very well may be possible to improve it. I'm only saying that it is good enough to start with, because of the reasons mentioned above. Still, if you can suggest to me what other formula specifically should be used here, I'll consider using it. Which will probably mean comparing the two and seeing which one leads to better results. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-02 17:58 ` Rafael J. Wysocki @ 2016-03-02 22:49 ` Rafael J. Wysocki 2016-03-03 12:20 ` Peter Zijlstra 2016-03-03 14:01 ` Vincent Guittot 2016-03-03 13:07 ` Vincent Guittot 1 sibling, 2 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-02 22:49 UTC (permalink / raw) To: Vincent Guittot Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Wed, Mar 2, 2016 at 6:58 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot > <vincent.guittot@linaro.org> wrote: >> Hi Rafael, >> >> >> On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> >>> Add a new cpufreq scaling governor, called "schedutil", that uses >>> scheduler-provided CPU utilization information as input for making >>> its decisions. >>> >>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add >>> mechanism for registering utilization update callbacks) that >>> introduced cpufreq_update_util() called by the scheduler on >>> utilization changes (from CFS) and RT/DL task status updates. >>> In particular, CPU frequency scaling decisions may be based on >>> the the utilization data passed to cpufreq_update_util() by CFS. >>> >>> The new governor is relatively simple. >>> >>> The frequency selection formula used by it is essentially the same >>> as the one used by the "ondemand" governor, although it doesn't use >>> the additional up_threshold parameter, but instead of computing the >>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes >>> the utilization data provided by CFS as input. More specifically, >>> it represents "load" as the util/max ratio, where util and max >>> are the utilization and CPU capacity coming from CFS. >>> >> >> [snip] >> >>> + >>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, >>> + unsigned long util, unsigned long max, >>> + unsigned int next_freq) >>> +{ >>> + struct cpufreq_policy *policy = sg_policy->policy; >>> + unsigned int rel; >>> + >>> + if (next_freq > policy->max) >>> + next_freq = policy->max; >>> + else if (next_freq < policy->min) >>> + next_freq = policy->min; >>> + >>> + sg_policy->last_freq_update_time = time; >>> + if (sg_policy->next_freq == next_freq) >>> + return; >>> + >>> + sg_policy->next_freq = next_freq; >>> + /* >>> + * If utilization is less than max / 4, use RELATION_C to allow the >>> + * minimum frequency to be selected more often in case the distance from >>> + * it to the next available frequency in the table is significant. >>> + */ >>> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L; >>> + if (policy->fast_switch_possible) { >>> + cpufreq_driver_fast_switch(policy, next_freq, rel); >>> + } else { >>> + sg_policy->relation = rel; >>> + sg_policy->work_in_progress = true; >>> + irq_work_queue(&sg_policy->irq_work); >>> + } >>> +} >>> + >>> +static void sugov_update_single(struct update_util_data *data, u64 time, >>> + unsigned long util, unsigned long max) >>> +{ >>> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); >>> + struct sugov_policy *sg_policy = sg_cpu->sg_policy; >>> + unsigned int min_f, max_f, next_f; >>> + >>> + if (!sugov_should_update_freq(sg_policy, time)) >>> + return; >>> + >>> + min_f = sg_policy->policy->cpuinfo.min_freq; >>> + max_f = sg_policy->policy->cpuinfo.max_freq; >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; >> >> I think it has been pointed out in another email's thread but you >> should change the way the next_f is computed. util reflects the >> utilization of a CPU from 0 to its max compute capacity whereas >> ondemand was using the load at the current frequency during the last >> time window. I have understood that you want to keep same formula than >> ondemand as a starting point but you use a different input to >> calculate the next frequency so i don't see the rational of keeping >> this formula. > > It is a formula that causes the entire available frequency range to be > utilized proportionally to the utilization as reported by the > scheduler (modulo the policy->min/max limits). Its (significant IMO) > advantage is that it doesn't require any additional factors that would > need to be determined somehow. In case a more formal derivation of this formula is needed, it is based on the following 3 assumptions: (1) Performance is a linear function of frequency. (2) Required performance is a linear function of the utilization ratio x = util/max as provided by the scheduler (0 <= x <= 1). (3) The minimum possible frequency (min_freq) corresponds to x = 0 and the maximum possible frequency (max_freq) corresponds to x = 1. (1) and (2) combined imply that f = a * x + b (f - frequency, a, b - constants to be determined) and then (3) quite trivially leads to b = min_freq and a = max_freq - min_freq. Now, of course, the linearity assumptions may be questioned, but then it's just the first approximation. If you go any further, though, you end up with an expansion series like this: f(x) = c_0 + c_1 * x + c_2 * x^2 + c_3 * x^3 + ... where all of the c_j need to be determined in principle. With luck, if you can guess what kind of a function f(x) may be, it may be possible to reduce the number of coefficients to determine, but question is whether or not that is going to work universally for all systems. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-02 22:49 ` Rafael J. Wysocki @ 2016-03-03 12:20 ` Peter Zijlstra 2016-03-03 12:32 ` Juri Lelli 2016-03-03 16:24 ` Rafael J. Wysocki 2016-03-03 14:01 ` Vincent Guittot 1 sibling, 2 replies; 158+ messages in thread From: Peter Zijlstra @ 2016-03-03 12:20 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Vincent Guittot, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote: > >>> + min_f = sg_policy->policy->cpuinfo.min_freq; > >>> + max_f = sg_policy->policy->cpuinfo.max_freq; > >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; > In case a more formal derivation of this formula is needed, it is > based on the following 3 assumptions: > > (1) Performance is a linear function of frequency. > (2) Required performance is a linear function of the utilization ratio > x = util/max as provided by the scheduler (0 <= x <= 1). > (3) The minimum possible frequency (min_freq) corresponds to x = 0 and > the maximum possible frequency (max_freq) corresponds to x = 1. > > (1) and (2) combined imply that > > f = a * x + b > > (f - frequency, a, b - constants to be determined) and then (3) quite > trivially leads to b = min_freq and a = max_freq - min_freq. 3 is the problem, that just doesn't make sense and is probably the reason why you see very little selection of the min freq. Suppose a machine with the following frequencies: 500, 750, 1000 And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) = 700 make any sense? Per your point 1, it should should be asking for 0.4 * 1000 = 400. Because, per 1, at 500 it runs exactly half as fast as at 1000, and we only need 0.4 times as much. Therefore 500 is more than sufficient. Note. we all know that 1 is a 'broken' assumption, but lacking anything better I think its a reasonable one to make. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 12:20 ` Peter Zijlstra @ 2016-03-03 12:32 ` Juri Lelli 2016-03-03 16:24 ` Rafael J. Wysocki 1 sibling, 0 replies; 158+ messages in thread From: Juri Lelli @ 2016-03-03 12:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Vincent Guittot, Rafael J. Wysocki, Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On 03/03/16 13:20, Peter Zijlstra wrote: > On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote: > > >>> + min_f = sg_policy->policy->cpuinfo.min_freq; > > >>> + max_f = sg_policy->policy->cpuinfo.max_freq; > > >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; > > > In case a more formal derivation of this formula is needed, it is > > based on the following 3 assumptions: > > > > (1) Performance is a linear function of frequency. > > (2) Required performance is a linear function of the utilization ratio > > x = util/max as provided by the scheduler (0 <= x <= 1). > > > (3) The minimum possible frequency (min_freq) corresponds to x = 0 and > > the maximum possible frequency (max_freq) corresponds to x = 1. > > > > (1) and (2) combined imply that > > > > f = a * x + b > > > > (f - frequency, a, b - constants to be determined) and then (3) quite > > trivially leads to b = min_freq and a = max_freq - min_freq. > > 3 is the problem, that just doesn't make sense and is probably the > reason why you see very little selection of the min freq. > > Suppose a machine with the following frequencies: > > 500, 750, 1000 > > And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) = > 700 make any sense? Per your point 1, it should should be asking for > 0.4 * 1000 = 400. > > Because, per 1, at 500 it runs exactly half as fast as at 1000, and we > only need 0.4 times as much. Therefore 500 is more than sufficient. > Oh, and that is probably also why the governor can reach max OPP with freq invariance enabled (the point Vincent was making). When we run at 500 the util signal is capped at that capacity, but the formula makes us requesting more, so we can jump to the next step and so on. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 12:20 ` Peter Zijlstra 2016-03-03 12:32 ` Juri Lelli @ 2016-03-03 16:24 ` Rafael J. Wysocki 2016-03-03 16:37 ` Peter Zijlstra 1 sibling, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-03 16:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Vincent Guittot, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Thu, Mar 3, 2016 at 1:20 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote: >> >>> + min_f = sg_policy->policy->cpuinfo.min_freq; >> >>> + max_f = sg_policy->policy->cpuinfo.max_freq; >> >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; > >> In case a more formal derivation of this formula is needed, it is >> based on the following 3 assumptions: >> >> (1) Performance is a linear function of frequency. >> (2) Required performance is a linear function of the utilization ratio >> x = util/max as provided by the scheduler (0 <= x <= 1). > >> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and >> the maximum possible frequency (max_freq) corresponds to x = 1. >> >> (1) and (2) combined imply that >> >> f = a * x + b >> >> (f - frequency, a, b - constants to be determined) and then (3) quite >> trivially leads to b = min_freq and a = max_freq - min_freq. > > 3 is the problem, that just doesn't make sense and is probably the > reason why you see very little selection of the min freq. It is about mapping the entire [0,1] interval to the available frequency range. I till overprovision things (the smaller x the more), but then it may help the race-to-idle a bit in theory. > Suppose a machine with the following frequencies: > > 500, 750, 1000 > > And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) = > 700 make any sense? Per your point 1, it should should be asking for > 0.4 * 1000 = 400. > > Because, per 1, at 500 it runs exactly half as fast as at 1000, and we > only need 0.4 times as much. Therefore 500 is more than sufficient. OK, but then I don't see why this reasoning only applies to the lower bound of the frequency range. Is there any reason why x = 1 should be the only point mapping to max_freq? If not, then I think it's reasonable to map the middle of the available frequency range to x = 0.5 and then we have b = 0 and a = (max_freq + min_freq) / 2. I'll try that and see how it goes. > Note. we all know that 1 is a 'broken' assumption, but lacking anything > better I think its a reasonable one to make. Right. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 16:24 ` Rafael J. Wysocki @ 2016-03-03 16:37 ` Peter Zijlstra 2016-03-03 16:47 ` Peter Zijlstra 2016-03-03 16:55 ` Juri Lelli 0 siblings, 2 replies; 158+ messages in thread From: Peter Zijlstra @ 2016-03-03 16:37 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Vincent Guittot, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote: > On Thu, Mar 3, 2016 at 1:20 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote: > >> >>> + min_f = sg_policy->policy->cpuinfo.min_freq; > >> >>> + max_f = sg_policy->policy->cpuinfo.max_freq; > >> >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; > > > >> In case a more formal derivation of this formula is needed, it is > >> based on the following 3 assumptions: > >> > >> (1) Performance is a linear function of frequency. > >> (2) Required performance is a linear function of the utilization ratio > >> x = util/max as provided by the scheduler (0 <= x <= 1). > > > >> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and > >> the maximum possible frequency (max_freq) corresponds to x = 1. > >> > >> (1) and (2) combined imply that > >> > >> f = a * x + b > >> > >> (f - frequency, a, b - constants to be determined) and then (3) quite > >> trivially leads to b = min_freq and a = max_freq - min_freq. > > > > 3 is the problem, that just doesn't make sense and is probably the > > reason why you see very little selection of the min freq. > > It is about mapping the entire [0,1] interval to the available frequency range. Yeah, but I don't see why that makes sense.. > I till overprovision things (the smaller x the more), but then it may > help the race-to-idle a bit in theory. So, since we also have the cpuidle information, could we not make a better guess at race-to-idle? > > Suppose a machine with the following frequencies: > > > > 500, 750, 1000 > > > > And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) = > > 700 make any sense? Per your point 1, it should should be asking for > > 0.4 * 1000 = 400. > > > > Because, per 1, at 500 it runs exactly half as fast as at 1000, and we > > only need 0.4 times as much. Therefore 500 is more than sufficient. > > OK, but then I don't see why this reasoning only applies to the lower > bound of the frequency range. Is there any reason why x = 1 should be > the only point mapping to max_freq? Well, everything that goes over the second to last freq would end up at the last (max) freq. Take again the 500,750,1000 example, everything that's >750 would end up at 1000 (for relation_l, >875 for _c). But given the platform's cpuidle information, maybe coupled with an avg idle est, we can compute the benefit of race-to-idle and over provision based on that, right? > If not, then I think it's reasonable to map the middle of the > available frequency range to x = 0.5 and then we have b = 0 and a = > (max_freq + min_freq) / 2. So I really think that approach falls apart on the low util bits, you effectively always run above min speed, even if min is already vstly over provisioned. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 16:37 ` Peter Zijlstra @ 2016-03-03 16:47 ` Peter Zijlstra 2016-03-04 1:14 ` Rafael J. Wysocki 2016-03-03 16:55 ` Juri Lelli 1 sibling, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-03 16:47 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Vincent Guittot, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Thu, Mar 03, 2016 at 05:37:35PM +0100, Peter Zijlstra wrote: > On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote: > > >> f = a * x + b > > If not, then I think it's reasonable to map the middle of the > > available frequency range to x = 0.5 and then we have b = 0 and a = > > (max_freq + min_freq) / 2. > > So I really think that approach falls apart on the low util bits, you > effectively always run above min speed, even if min is already vstly > over provisioned. Ah nevermind, I cannot read. Yes that is worth trying I suppose. But the b=0,a=1 thing seems more natural still. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 16:47 ` Peter Zijlstra @ 2016-03-04 1:14 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 1:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Vincent Guittot, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Thu, Mar 3, 2016 at 5:47 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Mar 03, 2016 at 05:37:35PM +0100, Peter Zijlstra wrote: >> On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote: >> > >> f = a * x + b > >> > If not, then I think it's reasonable to map the middle of the >> > available frequency range to x = 0.5 and then we have b = 0 and a = >> > (max_freq + min_freq) / 2. That actually should be a = max_freq + min_freq, because I want (max_freq + min_freq) / 2 = a / 2. >> So I really think that approach falls apart on the low util bits, you >> effectively always run above min speed, even if min is already vstly >> over provisioned. > > Ah nevermind, I cannot read. Yes that is worth trying I suppose. But the > b=0,a=1 thing seems more natural still. It is somewhat imbalanced, though. If all of the values of x are equally probable (or equally frequent), the probability of running above the middle frequency is lower than the probability of running below it. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 16:37 ` Peter Zijlstra 2016-03-03 16:47 ` Peter Zijlstra @ 2016-03-03 16:55 ` Juri Lelli 2016-03-03 16:56 ` Peter Zijlstra 1 sibling, 1 reply; 158+ messages in thread From: Juri Lelli @ 2016-03-03 16:55 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Vincent Guittot, Rafael J. Wysocki, Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On 03/03/16 17:37, Peter Zijlstra wrote: > On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote: > > On Thu, Mar 3, 2016 at 1:20 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote: > > >> >>> + min_f = sg_policy->policy->cpuinfo.min_freq; > > >> >>> + max_f = sg_policy->policy->cpuinfo.max_freq; > > >> >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; > > > > > >> In case a more formal derivation of this formula is needed, it is > > >> based on the following 3 assumptions: > > >> > > >> (1) Performance is a linear function of frequency. > > >> (2) Required performance is a linear function of the utilization ratio > > >> x = util/max as provided by the scheduler (0 <= x <= 1). > > > > > >> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and > > >> the maximum possible frequency (max_freq) corresponds to x = 1. > > >> > > >> (1) and (2) combined imply that > > >> > > >> f = a * x + b > > >> > > >> (f - frequency, a, b - constants to be determined) and then (3) quite > > >> trivially leads to b = min_freq and a = max_freq - min_freq. > > > > > > 3 is the problem, that just doesn't make sense and is probably the > > > reason why you see very little selection of the min freq. > > > > It is about mapping the entire [0,1] interval to the available frequency range. > > Yeah, but I don't see why that makes sense.. > > > I till overprovision things (the smaller x the more), but then it may > > help the race-to-idle a bit in theory. > > So, since we also have the cpuidle information, could we not make a > better guess at race-to-idle? > > > > Suppose a machine with the following frequencies: > > > > > > 500, 750, 1000 > > > > > > And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) = > > > 700 make any sense? Per your point 1, it should should be asking for > > > 0.4 * 1000 = 400. > > > > > > Because, per 1, at 500 it runs exactly half as fast as at 1000, and we > > > only need 0.4 times as much. Therefore 500 is more than sufficient. > > > > OK, but then I don't see why this reasoning only applies to the lower > > bound of the frequency range. Is there any reason why x = 1 should be > > the only point mapping to max_freq? > > Well, everything that goes over the second to last freq would end up at > the last (max) freq. > > Take again the 500,750,1000 example, everything that's >750 would end up > at 1000 (for relation_l, >875 for _c). > > But given the platform's cpuidle information, maybe coupled with an avg > idle est, we can compute the benefit of race-to-idle and over provision > based on that, right? > Shouldn't this kind of considerations be a scheduler thing? I'm not really getting why we want to put more "intelligence" in a new governor. Also, if I understand Ingo's point correctly, I think we want to make this kind of policy decisions inside the scheduler. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 16:55 ` Juri Lelli @ 2016-03-03 16:56 ` Peter Zijlstra 2016-03-03 17:14 ` Juri Lelli 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-03 16:56 UTC (permalink / raw) To: Juri Lelli Cc: Rafael J. Wysocki, Vincent Guittot, Rafael J. Wysocki, Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Thu, Mar 03, 2016 at 04:55:44PM +0000, Juri Lelli wrote: > On 03/03/16 17:37, Peter Zijlstra wrote: > > But given the platform's cpuidle information, maybe coupled with an avg > > idle est, we can compute the benefit of race-to-idle and over provision > > based on that, right? > > > > Shouldn't this kind of considerations be a scheduler thing? I'm not > really getting why we want to put more "intelligence" in a new governor. > Also, if I understand Ingo's point correctly, I think we want to make > this kind of policy decisions inside the scheduler. Well sure, put it in kernel/sched/cpufreq.c or wherever. My point was more that we don't have to guess/hardcode race-to-idle assumptions but can actually calculate some of that. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 16:56 ` Peter Zijlstra @ 2016-03-03 17:14 ` Juri Lelli 0 siblings, 0 replies; 158+ messages in thread From: Juri Lelli @ 2016-03-03 17:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Vincent Guittot, Rafael J. Wysocki, Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On 03/03/16 17:56, Peter Zijlstra wrote: > On Thu, Mar 03, 2016 at 04:55:44PM +0000, Juri Lelli wrote: > > On 03/03/16 17:37, Peter Zijlstra wrote: > > > But given the platform's cpuidle information, maybe coupled with an avg > > > idle est, we can compute the benefit of race-to-idle and over provision > > > based on that, right? > > > > > > > Shouldn't this kind of considerations be a scheduler thing? I'm not > > really getting why we want to put more "intelligence" in a new governor. > > Also, if I understand Ingo's point correctly, I think we want to make > > this kind of policy decisions inside the scheduler. > > Well sure, put it in kernel/sched/cpufreq.c or wherever. My point was > more that we don't have to guess/hardcode race-to-idle assumptions but > can actually calculate some of that. > Right, thanks for clarifying! ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-02 22:49 ` Rafael J. Wysocki 2016-03-03 12:20 ` Peter Zijlstra @ 2016-03-03 14:01 ` Vincent Guittot 2016-03-03 15:38 ` Peter Zijlstra 1 sibling, 1 reply; 158+ messages in thread From: Vincent Guittot @ 2016-03-03 14:01 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On 2 March 2016 at 23:49, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Wed, Mar 2, 2016 at 6:58 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: >> On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot >> <vincent.guittot@linaro.org> wrote: >>> Hi Rafael, >>> >>> >>> On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>>> >>>> Add a new cpufreq scaling governor, called "schedutil", that uses >>>> scheduler-provided CPU utilization information as input for making >>>> its decisions. >>>> >>>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add >>>> mechanism for registering utilization update callbacks) that >>>> introduced cpufreq_update_util() called by the scheduler on >>>> utilization changes (from CFS) and RT/DL task status updates. >>>> In particular, CPU frequency scaling decisions may be based on >>>> the the utilization data passed to cpufreq_update_util() by CFS. >>>> >>>> The new governor is relatively simple. >>>> >>>> The frequency selection formula used by it is essentially the same >>>> as the one used by the "ondemand" governor, although it doesn't use >>>> the additional up_threshold parameter, but instead of computing the >>>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes >>>> the utilization data provided by CFS as input. More specifically, >>>> it represents "load" as the util/max ratio, where util and max >>>> are the utilization and CPU capacity coming from CFS. >>>> >>> >>> [snip] >>> >>>> + >>>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, >>>> + unsigned long util, unsigned long max, >>>> + unsigned int next_freq) >>>> +{ >>>> + struct cpufreq_policy *policy = sg_policy->policy; >>>> + unsigned int rel; >>>> + >>>> + if (next_freq > policy->max) >>>> + next_freq = policy->max; >>>> + else if (next_freq < policy->min) >>>> + next_freq = policy->min; >>>> + >>>> + sg_policy->last_freq_update_time = time; >>>> + if (sg_policy->next_freq == next_freq) >>>> + return; >>>> + >>>> + sg_policy->next_freq = next_freq; >>>> + /* >>>> + * If utilization is less than max / 4, use RELATION_C to allow the >>>> + * minimum frequency to be selected more often in case the distance from >>>> + * it to the next available frequency in the table is significant. >>>> + */ >>>> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L; >>>> + if (policy->fast_switch_possible) { >>>> + cpufreq_driver_fast_switch(policy, next_freq, rel); >>>> + } else { >>>> + sg_policy->relation = rel; >>>> + sg_policy->work_in_progress = true; >>>> + irq_work_queue(&sg_policy->irq_work); >>>> + } >>>> +} >>>> + >>>> +static void sugov_update_single(struct update_util_data *data, u64 time, >>>> + unsigned long util, unsigned long max) >>>> +{ >>>> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); >>>> + struct sugov_policy *sg_policy = sg_cpu->sg_policy; >>>> + unsigned int min_f, max_f, next_f; >>>> + >>>> + if (!sugov_should_update_freq(sg_policy, time)) >>>> + return; >>>> + >>>> + min_f = sg_policy->policy->cpuinfo.min_freq; >>>> + max_f = sg_policy->policy->cpuinfo.max_freq; >>>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; >>> >>> I think it has been pointed out in another email's thread but you >>> should change the way the next_f is computed. util reflects the >>> utilization of a CPU from 0 to its max compute capacity whereas >>> ondemand was using the load at the current frequency during the last >>> time window. I have understood that you want to keep same formula than >>> ondemand as a starting point but you use a different input to >>> calculate the next frequency so i don't see the rational of keeping >>> this formula. >> >> It is a formula that causes the entire available frequency range to be >> utilized proportionally to the utilization as reported by the >> scheduler (modulo the policy->min/max limits). Its (significant IMO) >> advantage is that it doesn't require any additional factors that would >> need to be determined somehow. > > In case a more formal derivation of this formula is needed, it is > based on the following 3 assumptions: > > (1) Performance is a linear function of frequency. > (2) Required performance is a linear function of the utilization ratio > x = util/max as provided by the scheduler (0 <= x <= 1). Just to mention that the utilization that you are using, varies with the frequency which add another variable in your equation > (3) The minimum possible frequency (min_freq) corresponds to x = 0 and > the maximum possible frequency (max_freq) corresponds to x = 1. > > (1) and (2) combined imply that > > f = a * x + b > > (f - frequency, a, b - constants to be determined) and then (3) quite > trivially leads to b = min_freq and a = max_freq - min_freq. > > Now, of course, the linearity assumptions may be questioned, but then > it's just the first approximation. If you go any further, though, you > end up with an expansion series like this: > > f(x) = c_0 + c_1 * x + c_2 * x^2 + c_3 * x^3 + ... > > where all of the c_j need to be determined in principle. With luck, > if you can guess what kind of a function f(x) may be, it may be > possible to reduce the number of coefficients to determine, but > question is whether or not that is going to work universally for all > systems. > > Thanks, > Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 14:01 ` Vincent Guittot @ 2016-03-03 15:38 ` Peter Zijlstra 2016-03-03 16:28 ` Peter Zijlstra 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-03 15:38 UTC (permalink / raw) To: Vincent Guittot Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Thu, Mar 03, 2016 at 03:01:15PM +0100, Vincent Guittot wrote: > > In case a more formal derivation of this formula is needed, it is > > based on the following 3 assumptions: > > > > (1) Performance is a linear function of frequency. > > (2) Required performance is a linear function of the utilization ratio > > x = util/max as provided by the scheduler (0 <= x <= 1). > > Just to mention that the utilization that you are using, varies with > the frequency which add another variable in your equation Right, x86 hasn't implemented arch_scale_freq_capacity(), so the utilization values we use are all over the map. If we lower freq, the util will go up, which would result in us bumping the freq again, etc.. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 15:38 ` Peter Zijlstra @ 2016-03-03 16:28 ` Peter Zijlstra 2016-03-03 16:42 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 158+ messages in thread From: Peter Zijlstra @ 2016-03-03 16:28 UTC (permalink / raw) To: Vincent Guittot Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Thu, Mar 03, 2016 at 04:38:17PM +0100, Peter Zijlstra wrote: > On Thu, Mar 03, 2016 at 03:01:15PM +0100, Vincent Guittot wrote: > > > In case a more formal derivation of this formula is needed, it is > > > based on the following 3 assumptions: > > > > > > (1) Performance is a linear function of frequency. > > > (2) Required performance is a linear function of the utilization ratio > > > x = util/max as provided by the scheduler (0 <= x <= 1). > > > > Just to mention that the utilization that you are using, varies with > > the frequency which add another variable in your equation > > Right, x86 hasn't implemented arch_scale_freq_capacity(), so the > utilization values we use are all over the map. If we lower freq, the > util will go up, which would result in us bumping the freq again, etc.. Something like the completely untested below should maybe work. Rafael? --- arch/x86/include/asm/topology.h | 19 +++++++++++++++++++ arch/x86/kernel/smpboot.c | 24 ++++++++++++++++++++++++ kernel/sched/core.c | 1 + kernel/sched/sched.h | 7 +++++++ 4 files changed, 51 insertions(+) diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 7f991bd5031b..af7b7259db94 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -146,4 +146,23 @@ struct pci_bus; int x86_pci_root_bus_node(int bus); void x86_pci_root_bus_resources(int bus, struct list_head *resources); +#ifdef CONFIG_SMP + +#define arch_scale_freq_tick arch_scale_freq_tick +#define arch_scale_freq_capacity arch_scale_freq_capacity + +DECLARE_PER_CPU(unsigned long, arch_cpu_freq); + +static inline arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +{ + if (static_cpu_has(X86_FEATURE_APERFMPERF)) + return per_cpu(arch_cpu_freq, cpu); + else + return SCHED_CAPACITY_SCALE; +} + +extern void arch_scale_freq_tick(void); + +#endif + #endif /* _ASM_X86_TOPOLOGY_H */ diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 3bf1e0b5f827..7d459577ee44 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1647,3 +1647,27 @@ void native_play_dead(void) } #endif + +static DEFINE_PER_CPU(u64, arch_prev_aperf); +static DEFINE_PER_CPU(u64, arch_prev_mperf); +DEFINE_PER_CPU(unsigned long, arch_cpu_freq); + +void arch_scale_freq_tick(void) +{ + u64 aperf, mperf; + u64 acnt, mcnt; + + if (!static_cpu_has(X86_FEATURE_APERFMPERF)) + return; + + aperf = rdmsrl(MSR_IA32_APERF); + mperf = rdmsrl(MSR_IA32_APERF); + + acnt = aperf - this_cpu_read(arch_prev_aperf); + mcnt = mperf - this_cpu_read(arch_prev_mperf); + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); + + this_cpu_write(arch_cpu_freq, div64_u64(acnt * SCHED_CAPACITY_SCALE, mcnt)); +} diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 96e323b26ea9..35dbf909afb2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2901,6 +2901,7 @@ void scheduler_tick(void) struct rq *rq = cpu_rq(cpu); struct task_struct *curr = rq->curr; + arch_scale_freq_tick(); sched_clock_tick(); raw_spin_lock(&rq->lock); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index baa32075f98e..c3825c920e3f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1408,6 +1408,13 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) } #endif +#ifndef arch_scale_freq_tick +static __always_inline +void arch_scale_freq_tick(void) +{ +} +#endif + #ifndef arch_scale_cpu_capacity static __always_inline unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) ^ permalink raw reply related [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 16:28 ` Peter Zijlstra @ 2016-03-03 16:42 ` Peter Zijlstra 2016-03-03 17:28 ` Dietmar Eggemann 2016-03-03 18:58 ` Rafael J. Wysocki 2 siblings, 0 replies; 158+ messages in thread From: Peter Zijlstra @ 2016-03-03 16:42 UTC (permalink / raw) To: Vincent Guittot Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Thu, Mar 03, 2016 at 05:28:29PM +0100, Peter Zijlstra wrote: > +void arch_scale_freq_tick(void) > +{ > + u64 aperf, mperf; > + u64 acnt, mcnt; > + > + if (!static_cpu_has(X86_FEATURE_APERFMPERF)) > + return; > + > + aperf = rdmsrl(MSR_IA32_APERF); > + mperf = rdmsrl(MSR_IA32_APERF); Actually reading MPERF increases the chances of this working. > + > + acnt = aperf - this_cpu_read(arch_prev_aperf); > + mcnt = mperf - this_cpu_read(arch_prev_mperf); > + > + this_cpu_write(arch_prev_aperf, aperf); > + this_cpu_write(arch_prev_mperf, mperf); > + > + this_cpu_write(arch_cpu_freq, div64_u64(acnt * SCHED_CAPACITY_SCALE, mcnt)); > +} ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 16:28 ` Peter Zijlstra 2016-03-03 16:42 ` Peter Zijlstra @ 2016-03-03 17:28 ` Dietmar Eggemann 2016-03-03 18:26 ` Peter Zijlstra 2016-03-03 18:58 ` Rafael J. Wysocki 2 siblings, 1 reply; 158+ messages in thread From: Dietmar Eggemann @ 2016-03-03 17:28 UTC (permalink / raw) To: Peter Zijlstra, Vincent Guittot Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On 03/03/16 16:28, Peter Zijlstra wrote: > On Thu, Mar 03, 2016 at 04:38:17PM +0100, Peter Zijlstra wrote: >> On Thu, Mar 03, 2016 at 03:01:15PM +0100, Vincent Guittot wrote: >>>> In case a more formal derivation of this formula is needed, it is >>>> based on the following 3 assumptions: >>>> >>>> (1) Performance is a linear function of frequency. >>>> (2) Required performance is a linear function of the utilization ratio >>>> x = util/max as provided by the scheduler (0 <= x <= 1). >>> >>> Just to mention that the utilization that you are using, varies with >>> the frequency which add another variable in your equation >> >> Right, x86 hasn't implemented arch_scale_freq_capacity(), so the >> utilization values we use are all over the map. If we lower freq, the >> util will go up, which would result in us bumping the freq again, etc.. > > Something like the completely untested below should maybe work. > > Rafael? > [...] > +void arch_scale_freq_tick(void) > +{ > + u64 aperf, mperf; > + u64 acnt, mcnt; > + > + if (!static_cpu_has(X86_FEATURE_APERFMPERF)) > + return; > + > + aperf = rdmsrl(MSR_IA32_APERF); > + mperf = rdmsrl(MSR_IA32_APERF); > + > + acnt = aperf - this_cpu_read(arch_prev_aperf); > + mcnt = mperf - this_cpu_read(arch_prev_mperf); > + > + this_cpu_write(arch_prev_aperf, aperf); > + this_cpu_write(arch_prev_mperf, mperf); > + > + this_cpu_write(arch_cpu_freq, div64_u64(acnt * SCHED_CAPACITY_SCALE, mcnt)); Wasn't there the problem that this ratio goes to zero if the cpu is idle in the old power estimation approach on x86? [...] ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 17:28 ` Dietmar Eggemann @ 2016-03-03 18:26 ` Peter Zijlstra 2016-03-03 19:14 ` Dietmar Eggemann 2016-03-08 13:09 ` Peter Zijlstra 0 siblings, 2 replies; 158+ messages in thread From: Peter Zijlstra @ 2016-03-03 18:26 UTC (permalink / raw) To: Dietmar Eggemann Cc: Vincent Guittot, Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Thu, Mar 03, 2016 at 05:28:55PM +0000, Dietmar Eggemann wrote: > > +void arch_scale_freq_tick(void) > > +{ > > + u64 aperf, mperf; > > + u64 acnt, mcnt; > > + > > + if (!static_cpu_has(X86_FEATURE_APERFMPERF)) > > + return; > > + > > + aperf = rdmsrl(MSR_IA32_APERF); > > + mperf = rdmsrl(MSR_IA32_APERF); > > + > > + acnt = aperf - this_cpu_read(arch_prev_aperf); > > + mcnt = mperf - this_cpu_read(arch_prev_mperf); > > + > > + this_cpu_write(arch_prev_aperf, aperf); > > + this_cpu_write(arch_prev_mperf, mperf); > > + > > + this_cpu_write(arch_cpu_freq, div64_u64(acnt * SCHED_CAPACITY_SCALE, mcnt)); > > Wasn't there the problem that this ratio goes to zero if the cpu is idle > in the old power estimation approach on x86? Yeah, there was something funky. SDM says they only count in C0 (ie. !idle), so it _should_ work. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 18:26 ` Peter Zijlstra @ 2016-03-03 19:14 ` Dietmar Eggemann 2016-03-08 13:09 ` Peter Zijlstra 1 sibling, 0 replies; 158+ messages in thread From: Dietmar Eggemann @ 2016-03-03 19:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Vincent Guittot, Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On 03/03/16 18:26, Peter Zijlstra wrote: > On Thu, Mar 03, 2016 at 05:28:55PM +0000, Dietmar Eggemann wrote: >>> +void arch_scale_freq_tick(void) >>> +{ >>> + u64 aperf, mperf; >>> + u64 acnt, mcnt; >>> + >>> + if (!static_cpu_has(X86_FEATURE_APERFMPERF)) >>> + return; >>> + >>> + aperf = rdmsrl(MSR_IA32_APERF); >>> + mperf = rdmsrl(MSR_IA32_APERF); >>> + >>> + acnt = aperf - this_cpu_read(arch_prev_aperf); >>> + mcnt = mperf - this_cpu_read(arch_prev_mperf); >>> + >>> + this_cpu_write(arch_prev_aperf, aperf); >>> + this_cpu_write(arch_prev_mperf, mperf); >>> + >>> + this_cpu_write(arch_cpu_freq, div64_u64(acnt * SCHED_CAPACITY_SCALE, mcnt)); >> >> Wasn't there the problem that this ratio goes to zero if the cpu is idle >> in the old power estimation approach on x86? > > Yeah, there was something funky. > > SDM says they only count in C0 (ie. !idle), so it _should_ work. I see, back than the problem was 0 capacity in idle but this is about frequency. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 18:26 ` Peter Zijlstra 2016-03-03 19:14 ` Dietmar Eggemann @ 2016-03-08 13:09 ` Peter Zijlstra 1 sibling, 0 replies; 158+ messages in thread From: Peter Zijlstra @ 2016-03-08 13:09 UTC (permalink / raw) To: Dietmar Eggemann Cc: Vincent Guittot, Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On Thu, Mar 03, 2016 at 07:26:24PM +0100, Peter Zijlstra wrote: > On Thu, Mar 03, 2016 at 05:28:55PM +0000, Dietmar Eggemann wrote: > > Wasn't there the problem that this ratio goes to zero if the cpu is idle > > in the old power estimation approach on x86? > > Yeah, there was something funky. So it might have been that when we're nearly idle the hardware runs at low frequency, which under that old code would have resulted in lowering the capacity of that cpu. Which in turn would have resulted in the scheduler moving the little work it had away and it being even more idle. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 16:28 ` Peter Zijlstra 2016-03-03 16:42 ` Peter Zijlstra 2016-03-03 17:28 ` Dietmar Eggemann @ 2016-03-03 18:58 ` Rafael J. Wysocki 2 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-03 18:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Vincent Guittot, Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Thu, Mar 3, 2016 at 5:28 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Mar 03, 2016 at 04:38:17PM +0100, Peter Zijlstra wrote: >> On Thu, Mar 03, 2016 at 03:01:15PM +0100, Vincent Guittot wrote: >> > > In case a more formal derivation of this formula is needed, it is >> > > based on the following 3 assumptions: >> > > >> > > (1) Performance is a linear function of frequency. >> > > (2) Required performance is a linear function of the utilization ratio >> > > x = util/max as provided by the scheduler (0 <= x <= 1). >> > >> > Just to mention that the utilization that you are using, varies with >> > the frequency which add another variable in your equation >> >> Right, x86 hasn't implemented arch_scale_freq_capacity(), so the >> utilization values we use are all over the map. If we lower freq, the >> util will go up, which would result in us bumping the freq again, etc.. > > Something like the completely untested below should maybe work. > > Rafael? It looks reasonable (modulo the MPERF reading typo you've noticed), but can we get back to that later? I'll first try to address the Ingo's feedback (which I hope I understood correctly) and some other comments people had and resend the series. > --- > arch/x86/include/asm/topology.h | 19 +++++++++++++++++++ > arch/x86/kernel/smpboot.c | 24 ++++++++++++++++++++++++ > kernel/sched/core.c | 1 + > kernel/sched/sched.h | 7 +++++++ > 4 files changed, 51 insertions(+) > > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h > index 7f991bd5031b..af7b7259db94 100644 > --- a/arch/x86/include/asm/topology.h > +++ b/arch/x86/include/asm/topology.h > @@ -146,4 +146,23 @@ struct pci_bus; > int x86_pci_root_bus_node(int bus); > void x86_pci_root_bus_resources(int bus, struct list_head *resources); > > +#ifdef CONFIG_SMP > + > +#define arch_scale_freq_tick arch_scale_freq_tick > +#define arch_scale_freq_capacity arch_scale_freq_capacity > + > +DECLARE_PER_CPU(unsigned long, arch_cpu_freq); > + > +static inline arch_scale_freq_capacity(struct sched_domain *sd, int cpu) > +{ > + if (static_cpu_has(X86_FEATURE_APERFMPERF)) > + return per_cpu(arch_cpu_freq, cpu); > + else > + return SCHED_CAPACITY_SCALE; > +} > + > +extern void arch_scale_freq_tick(void); > + > +#endif > + > #endif /* _ASM_X86_TOPOLOGY_H */ > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 3bf1e0b5f827..7d459577ee44 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1647,3 +1647,27 @@ void native_play_dead(void) > } > > #endif > + > +static DEFINE_PER_CPU(u64, arch_prev_aperf); > +static DEFINE_PER_CPU(u64, arch_prev_mperf); > +DEFINE_PER_CPU(unsigned long, arch_cpu_freq); > + > +void arch_scale_freq_tick(void) > +{ > + u64 aperf, mperf; > + u64 acnt, mcnt; > + > + if (!static_cpu_has(X86_FEATURE_APERFMPERF)) > + return; > + > + aperf = rdmsrl(MSR_IA32_APERF); > + mperf = rdmsrl(MSR_IA32_APERF); > + > + acnt = aperf - this_cpu_read(arch_prev_aperf); > + mcnt = mperf - this_cpu_read(arch_prev_mperf); > + > + this_cpu_write(arch_prev_aperf, aperf); > + this_cpu_write(arch_prev_mperf, mperf); > + > + this_cpu_write(arch_cpu_freq, div64_u64(acnt * SCHED_CAPACITY_SCALE, mcnt)); > +} > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 96e323b26ea9..35dbf909afb2 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2901,6 +2901,7 @@ void scheduler_tick(void) > struct rq *rq = cpu_rq(cpu); > struct task_struct *curr = rq->curr; > > + arch_scale_freq_tick(); > sched_clock_tick(); > > raw_spin_lock(&rq->lock); > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index baa32075f98e..c3825c920e3f 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1408,6 +1408,13 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) > } > #endif > > +#ifndef arch_scale_freq_tick > +static __always_inline > +void arch_scale_freq_tick(void) > +{ > +} > +#endif > + > #ifndef arch_scale_cpu_capacity > static __always_inline > unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-02 17:58 ` Rafael J. Wysocki 2016-03-02 22:49 ` Rafael J. Wysocki @ 2016-03-03 13:07 ` Vincent Guittot 2016-03-03 20:06 ` Steve Muckle 1 sibling, 1 reply; 158+ messages in thread From: Vincent Guittot @ 2016-03-03 13:07 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Michael Turquette On 2 March 2016 at 18:58, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot > <vincent.guittot@linaro.org> wrote: >> Hi Rafael, >> >> >> On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> >>> Add a new cpufreq scaling governor, called "schedutil", that uses >>> scheduler-provided CPU utilization information as input for making >>> its decisions. >>> >>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add >>> mechanism for registering utilization update callbacks) that >>> introduced cpufreq_update_util() called by the scheduler on >>> utilization changes (from CFS) and RT/DL task status updates. >>> In particular, CPU frequency scaling decisions may be based on >>> the the utilization data passed to cpufreq_update_util() by CFS. >>> >>> The new governor is relatively simple. >>> >>> The frequency selection formula used by it is essentially the same >>> as the one used by the "ondemand" governor, although it doesn't use >>> the additional up_threshold parameter, but instead of computing the >>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes >>> the utilization data provided by CFS as input. More specifically, >>> it represents "load" as the util/max ratio, where util and max >>> are the utilization and CPU capacity coming from CFS. >>> >> >> [snip] >> >>> + >>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, >>> + unsigned long util, unsigned long max, >>> + unsigned int next_freq) >>> +{ >>> + struct cpufreq_policy *policy = sg_policy->policy; >>> + unsigned int rel; >>> + >>> + if (next_freq > policy->max) >>> + next_freq = policy->max; >>> + else if (next_freq < policy->min) >>> + next_freq = policy->min; >>> + >>> + sg_policy->last_freq_update_time = time; >>> + if (sg_policy->next_freq == next_freq) >>> + return; >>> + >>> + sg_policy->next_freq = next_freq; >>> + /* >>> + * If utilization is less than max / 4, use RELATION_C to allow the >>> + * minimum frequency to be selected more often in case the distance from >>> + * it to the next available frequency in the table is significant. >>> + */ >>> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L; >>> + if (policy->fast_switch_possible) { >>> + cpufreq_driver_fast_switch(policy, next_freq, rel); >>> + } else { >>> + sg_policy->relation = rel; >>> + sg_policy->work_in_progress = true; >>> + irq_work_queue(&sg_policy->irq_work); >>> + } >>> +} >>> + >>> +static void sugov_update_single(struct update_util_data *data, u64 time, >>> + unsigned long util, unsigned long max) >>> +{ >>> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); >>> + struct sugov_policy *sg_policy = sg_cpu->sg_policy; >>> + unsigned int min_f, max_f, next_f; >>> + >>> + if (!sugov_should_update_freq(sg_policy, time)) >>> + return; >>> + >>> + min_f = sg_policy->policy->cpuinfo.min_freq; >>> + max_f = sg_policy->policy->cpuinfo.max_freq; >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; >> >> I think it has been pointed out in another email's thread but you >> should change the way the next_f is computed. util reflects the >> utilization of a CPU from 0 to its max compute capacity whereas >> ondemand was using the load at the current frequency during the last >> time window. I have understood that you want to keep same formula than >> ondemand as a starting point but you use a different input to >> calculate the next frequency so i don't see the rational of keeping >> this formula. > > It is a formula that causes the entire available frequency range to be > utilized proportionally to the utilization as reported by the > scheduler (modulo the policy->min/max limits). Its (significant IMO) > advantage is that it doesn't require any additional factors that would > need to be determined somehow. > >> Saying that, even the simple formula next_f = util > max >> ? max_f : util * (max_f) / max will not work properly if the frequency >> invariance is enable because the utilization becomes capped by the >> current compute capacity so next_f will never be higher than current >> freq (unless a task move on the rq). That was one reason of using a >> threshold in sched-freq proposal (and there are on going dev to try to >> solve this limitation). > > Well, a different formula will have to be used along with frequency > invariance, then. > >> IIIUC, frequency invariance is not enable on your platform so you have >> not seen the problem but you have probably see that selection of your >> next_f was not really stable. Without frequency invariance, the >> utilization will be overestimated when running at lower frequency so >> the governor will probably select a frequency that is higher than >> necessary but then the utilization will decrease at this higher >> frequency so the governor will probably decrease the frequency and so >> on until you found the right frequency that will generate the right >> utilisation value > > I don't have any problems with that to be honest and if you aim at > selecting the perfect frequency at the first attempt, then good luck > with that anyway. I mainly want to prevent any useless and periodic frequency switch because of an utilization that changes with the current frequency (if frequency invariance is not used) and that can make the formula selects another frequency than the current one. That what i can see when testing it . Sorry for the late reply, i was trying to do some test on my board but was facing some crash issue (not link with your patchset). So i have done some tests and i can see such instable behavior. I have generated a load of 33% at max frequency (3ms runs every 9ms) and i can see the frequency that toggles without any good reason. Saying that, i can see similar thing with ondemand. Vincent > > Now, I'm not saying that the formula used in this patch cannot be > improved or similar. It very well may be possible to improve it. I'm > only saying that it is good enough to start with, because of the > reasons mentioned above. > > Still, if you can suggest to me what other formula specifically should > be used here, I'll consider using it. Which will probably mean > comparing the two and seeing which one leads to better results. > > Thanks, > Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 13:07 ` Vincent Guittot @ 2016-03-03 20:06 ` Steve Muckle 2016-03-03 20:20 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Steve Muckle @ 2016-03-03 20:06 UTC (permalink / raw) To: Vincent Guittot, Rafael J. Wysocki Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On 03/03/2016 05:07 AM, Vincent Guittot wrote: > I mainly want to prevent any useless and periodic frequency switch > because of an utilization that changes with the current frequency (if > frequency invariance is not used) and that can make the formula > selects another frequency than the current one. That what i can see > when testing it . > > Sorry for the late reply, i was trying to do some test on my board but > was facing some crash issue (not link with your patchset). So i have > done some tests and i can see such instable behavior. I have generated > a load of 33% at max frequency (3ms runs every 9ms) and i can see the > frequency that toggles without any good reason. Saying that, i can see > similar thing with ondemand. FWIW I ran some performance numbers on my chromebook 2. Initially I forgot to bring in the frequency invariance support but that yielded an opportunity to see the impact of it. The tests below consist of a periodic workload. The OH (overhead) numbers show how close the workload got to running as slow as fmin (100% = as slow as powersave gov, 0% = as fast as perf gov). The OR (overrun) number is the count of instances where the busy work exceeded the period. First a comparison of schedutil with and without frequency invariance. Run and period are in milliseconds. scu (no inv) scu (w/inv) run period busy % OR OH OR OH 1 100 1.00% 0 79.72% 0 95.86% 10 1000 1.00% 0 24.52% 0 71.61% 1 10 10.00% 0 21.25% 0 41.78% 10 100 10.00% 0 26.06% 0 47.96% 100 1000 10.00% 0 6.36% 0 26.03% 6 33 18.18% 0 15.67% 0 31.61% 66 333 19.82% 0 8.94% 0 29.46% 4 10 40.00% 0 6.26% 0 12.93% 40 100 40.00% 0 6.93% 2 14.08% 400 1000 40.00% 0 1.65% 0 11.58% 5 9 55.56% 0 3.70% 0 7.70% 50 90 55.56% 1 4.19% 6 8.06% 500 900 55.56% 0 1.35% 5 6.94% 9 12 75.00% 0 1.60% 56 3.59% 90 120 75.00% 0 1.88% 21 3.94% 900 1200 75.00% 0 0.73% 4 4.41% Frequency invariance causes schedutil overhead to increase noticeably. I haven't dug into traces or anything. Perhaps this is due to the algorithm overshooting then overcorrecting etc., I do not yet know. Here is a comparison, with frequency invariance, of ondemand and interactive with schedfreq and schedutil. The first two columns (run and period) are omitted so the table will fit. ondemand interactive schedfreq schedutil busy % OR OH OR OH OR OH OR OH 1.00% 0 68.96% 0 100.04% 0 78.49% 0 95.86% 1.00% 0 25.04% 0 22.59% 0 72.56% 0 71.61% 10.00% 0 21.75% 0 63.08% 0 52.40% 0 41.78% 10.00% 0 12.17% 0 14.41% 0 17.33% 0 47.96% 10.00% 0 2.57% 0 2.17% 0 0.29% 0 26.03% 18.18% 0 12.39% 0 9.39% 0 17.34% 0 31.61% 19.82% 0 3.74% 0 3.42% 0 12.26% 0 29.46% 40.00% 2 6.26% 1 12.23% 0 6.15% 0 12.93% 40.00% 0 0.47% 0 0.05% 0 2.68% 2 14.08% 40.00% 0 0.60% 0 0.50% 0 1.22% 0 11.58% 55.56% 2 4.25% 5 5.97% 0 2.51% 0 7.70% 55.56% 0 1.89% 0 0.04% 0 1.71% 6 8.06% 55.56% 0 0.50% 0 0.47% 0 1.82% 5 6.94% 75.00% 2 1.65% 1 0.46% 0 0.26% 56 3.59% 75.00% 0 1.68% 0 0.05% 0 0.49% 21 3.94% 75.00% 0 0.28% 0 0.23% 0 0.62% 4 4.41% Aside from the 2nd and 3rd tests schedutil is showing decreased performance across the board. The fifth test is particularly bad. The catch is that I do not have power numbers to go with this data, as I'm not currently equipped to gather them. So more analysis is definitely needed to capture the full story. thanks, Steve ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 20:06 ` Steve Muckle @ 2016-03-03 20:20 ` Rafael J. Wysocki 2016-03-03 21:37 ` Steve Muckle 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-03 20:20 UTC (permalink / raw) To: Steve Muckle Cc: Vincent Guittot, Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Thu, Mar 3, 2016 at 9:06 PM, Steve Muckle <steve.muckle@linaro.org> wrote: > On 03/03/2016 05:07 AM, Vincent Guittot wrote: >> I mainly want to prevent any useless and periodic frequency switch >> because of an utilization that changes with the current frequency (if >> frequency invariance is not used) and that can make the formula >> selects another frequency than the current one. That what i can see >> when testing it . >> >> Sorry for the late reply, i was trying to do some test on my board but >> was facing some crash issue (not link with your patchset). So i have >> done some tests and i can see such instable behavior. I have generated >> a load of 33% at max frequency (3ms runs every 9ms) and i can see the >> frequency that toggles without any good reason. Saying that, i can see >> similar thing with ondemand. > > FWIW I ran some performance numbers on my chromebook 2. Initially I > forgot to bring in the frequency invariance support but that yielded an > opportunity to see the impact of it. > > The tests below consist of a periodic workload. The OH (overhead) > numbers show how close the workload got to running as slow as fmin (100% > = as slow as powersave gov, 0% = as fast as perf gov). The OR (overrun) > number is the count of instances where the busy work exceeded the period. > > First a comparison of schedutil with and without frequency invariance. > Run and period are in milliseconds. > > scu (no inv) scu (w/inv) > run period busy % OR OH OR OH > 1 100 1.00% 0 79.72% 0 95.86% > 10 1000 1.00% 0 24.52% 0 71.61% > 1 10 10.00% 0 21.25% 0 41.78% > 10 100 10.00% 0 26.06% 0 47.96% > 100 1000 10.00% 0 6.36% 0 26.03% > 6 33 18.18% 0 15.67% 0 31.61% > 66 333 19.82% 0 8.94% 0 29.46% > 4 10 40.00% 0 6.26% 0 12.93% > 40 100 40.00% 0 6.93% 2 14.08% > 400 1000 40.00% 0 1.65% 0 11.58% > 5 9 55.56% 0 3.70% 0 7.70% > 50 90 55.56% 1 4.19% 6 8.06% > 500 900 55.56% 0 1.35% 5 6.94% > 9 12 75.00% 0 1.60% 56 3.59% > 90 120 75.00% 0 1.88% 21 3.94% > 900 1200 75.00% 0 0.73% 4 4.41% > > Frequency invariance causes schedutil overhead to increase noticeably. I > haven't dug into traces or anything. Perhaps this is due to the > algorithm overshooting then overcorrecting etc., I do not yet know. So as I said, the formula I used didn't take invariance into account, so that's quite as expected. > Here is a comparison, with frequency invariance, of ondemand and > interactive with schedfreq and schedutil. The first two columns (run and > period) are omitted so the table will fit. > > ondemand interactive schedfreq schedutil > busy % OR OH OR OH OR OH OR OH > 1.00% 0 68.96% 0 100.04% 0 78.49% 0 95.86% > 1.00% 0 25.04% 0 22.59% 0 72.56% 0 71.61% > 10.00% 0 21.75% 0 63.08% 0 52.40% 0 41.78% > 10.00% 0 12.17% 0 14.41% 0 17.33% 0 47.96% > 10.00% 0 2.57% 0 2.17% 0 0.29% 0 26.03% > 18.18% 0 12.39% 0 9.39% 0 17.34% 0 31.61% > 19.82% 0 3.74% 0 3.42% 0 12.26% 0 29.46% > 40.00% 2 6.26% 1 12.23% 0 6.15% 0 12.93% > 40.00% 0 0.47% 0 0.05% 0 2.68% 2 14.08% > 40.00% 0 0.60% 0 0.50% 0 1.22% 0 11.58% > 55.56% 2 4.25% 5 5.97% 0 2.51% 0 7.70% > 55.56% 0 1.89% 0 0.04% 0 1.71% 6 8.06% > 55.56% 0 0.50% 0 0.47% 0 1.82% 5 6.94% > 75.00% 2 1.65% 1 0.46% 0 0.26% 56 3.59% > 75.00% 0 1.68% 0 0.05% 0 0.49% 21 3.94% > 75.00% 0 0.28% 0 0.23% 0 0.62% 4 4.41% > > Aside from the 2nd and 3rd tests schedutil is showing decreased > performance across the board. The fifth test is particularly bad. I guess you mean performance in terms of the overhead? Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 20:20 ` Rafael J. Wysocki @ 2016-03-03 21:37 ` Steve Muckle 2016-03-07 2:41 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Steve Muckle @ 2016-03-03 21:37 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Vincent Guittot, Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On 03/03/2016 12:20 PM, Rafael J. Wysocki wrote: >> Here is a comparison, with frequency invariance, of ondemand and >> interactive with schedfreq and schedutil. The first two columns (run and >> period) are omitted so the table will fit. >> >> ondemand interactive schedfreq schedutil >> busy % OR OH OR OH OR OH OR OH >> 1.00% 0 68.96% 0 100.04% 0 78.49% 0 95.86% >> 1.00% 0 25.04% 0 22.59% 0 72.56% 0 71.61% >> 10.00% 0 21.75% 0 63.08% 0 52.40% 0 41.78% >> 10.00% 0 12.17% 0 14.41% 0 17.33% 0 47.96% >> 10.00% 0 2.57% 0 2.17% 0 0.29% 0 26.03% >> 18.18% 0 12.39% 0 9.39% 0 17.34% 0 31.61% >> 19.82% 0 3.74% 0 3.42% 0 12.26% 0 29.46% >> 40.00% 2 6.26% 1 12.23% 0 6.15% 0 12.93% >> 40.00% 0 0.47% 0 0.05% 0 2.68% 2 14.08% >> 40.00% 0 0.60% 0 0.50% 0 1.22% 0 11.58% >> 55.56% 2 4.25% 5 5.97% 0 2.51% 0 7.70% >> 55.56% 0 1.89% 0 0.04% 0 1.71% 6 8.06% >> 55.56% 0 0.50% 0 0.47% 0 1.82% 5 6.94% >> 75.00% 2 1.65% 1 0.46% 0 0.26% 56 3.59% >> 75.00% 0 1.68% 0 0.05% 0 0.49% 21 3.94% >> 75.00% 0 0.28% 0 0.23% 0 0.62% 4 4.41% >> >> Aside from the 2nd and 3rd tests schedutil is showing decreased >> performance across the board. The fifth test is particularly bad. > > I guess you mean performance in terms of the overhead? Correct. This overhead metric describes how fast the workload completes, with 0% equaling the perf governor and 100% equaling the powersave governor. So it's a reflection of general performance using the governor. It's called "overhead" I imagine (the metric predates my involvement) as it is something introduced/caused by the policy of the governor. thanks, Steve ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-03 21:37 ` Steve Muckle @ 2016-03-07 2:41 ` Rafael J. Wysocki 2016-03-08 11:27 ` Peter Zijlstra 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-07 2:41 UTC (permalink / raw) To: Steve Muckle Cc: Rafael J. Wysocki, Vincent Guittot, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Thursday, March 03, 2016 01:37:59 PM Steve Muckle wrote: > On 03/03/2016 12:20 PM, Rafael J. Wysocki wrote: > >> Here is a comparison, with frequency invariance, of ondemand and > >> interactive with schedfreq and schedutil. The first two columns (run and > >> period) are omitted so the table will fit. > >> > >> ondemand interactive schedfreq schedutil > >> busy % OR OH OR OH OR OH OR OH > >> 1.00% 0 68.96% 0 100.04% 0 78.49% 0 95.86% > >> 1.00% 0 25.04% 0 22.59% 0 72.56% 0 71.61% > >> 10.00% 0 21.75% 0 63.08% 0 52.40% 0 41.78% > >> 10.00% 0 12.17% 0 14.41% 0 17.33% 0 47.96% > >> 10.00% 0 2.57% 0 2.17% 0 0.29% 0 26.03% > >> 18.18% 0 12.39% 0 9.39% 0 17.34% 0 31.61% > >> 19.82% 0 3.74% 0 3.42% 0 12.26% 0 29.46% > >> 40.00% 2 6.26% 1 12.23% 0 6.15% 0 12.93% > >> 40.00% 0 0.47% 0 0.05% 0 2.68% 2 14.08% > >> 40.00% 0 0.60% 0 0.50% 0 1.22% 0 11.58% > >> 55.56% 2 4.25% 5 5.97% 0 2.51% 0 7.70% > >> 55.56% 0 1.89% 0 0.04% 0 1.71% 6 8.06% > >> 55.56% 0 0.50% 0 0.47% 0 1.82% 5 6.94% > >> 75.00% 2 1.65% 1 0.46% 0 0.26% 56 3.59% > >> 75.00% 0 1.68% 0 0.05% 0 0.49% 21 3.94% > >> 75.00% 0 0.28% 0 0.23% 0 0.62% 4 4.41% > >> > >> Aside from the 2nd and 3rd tests schedutil is showing decreased > >> performance across the board. The fifth test is particularly bad. > > > > I guess you mean performance in terms of the overhead? > > Correct. This overhead metric describes how fast the workload completes, > with 0% equaling the perf governor and 100% equaling the powersave > governor. So it's a reflection of general performance using the > governor. It's called "overhead" I imagine (the metric predates my > involvement) as it is something introduced/caused by the policy of the > governor. If my understanding of the requency invariant utilization idea is correct, it is about re-scaling utilization so it is always relative to the capacity at the max frequency. If that's the case, then instead of using x = util_raw / max we will use something like y = (util_raw / max) * (f / max_freq) (f - current frequency). This means that (1) x = y * max_freq / f Now, say we have an agreed-on (linear) formula for f depending on x: f = a * x + b and if you say "Look, if I substitute y for x in this formula, it doesn't produce correct results", then I can only say "It doesn't, because it can't". It *obviously* won't work, because instead of substituting y for x, you need to substitute the right-hand side of (1) for it. They you'll get f = a * y * max_freq / f + b which is obviously nonlinear, so there's no hope that the same formula will ever work for both "raw" and "frequency invariant" utilization. To me this means that looking for a formula that will work for both is just pointless and there are 3 possibilities: (a) Look for a good enough formula to apply to "raw" utilization and then switch over when all architectures start to use "frequency invariant" utilization. (b) Make all architecuters use "frequency invariant" and then look for a working formula (seems rather less than realistic to me to be honest). (c) Code for using either "raw" or "frequency invariant" depending on a callback flag or something like that. I, personally, would go for (a) at this point, because that's the easiest one, but (c) would be doable too IMO, so I don't care that much as long as it is not (b). Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-07 2:41 ` Rafael J. Wysocki @ 2016-03-08 11:27 ` Peter Zijlstra 2016-03-08 18:00 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-08 11:27 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Steve Muckle, Rafael J. Wysocki, Vincent Guittot, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Mon, Mar 07, 2016 at 03:41:15AM +0100, Rafael J. Wysocki wrote: > If my understanding of the requency invariant utilization idea is correct, > it is about re-scaling utilization so it is always relative to the capacity > at the max frequency. Right. So if a workload runs for 5ms at @1GHz and 10ms @500MHz, it would still result in the exact same utilization. > If that's the case, then instead of using > x = util_raw / max > we will use something like > y = (util_raw / max) * (f / max_freq) (f - current frequency). I don't get the last term. Assuming fixed frequency hardware (we can't really assume anything else) I get to: util = util_raw * (current_freq / max_freq) (1) x = util / max (2) > so there's no hope that the same formula will ever work for both "raw" > and "frequency invariant" utilization. Here I agree, however the above (current_freq / max_freq) term is easily computable, and really the only thing we can assume if the arch doesn't implement freq invariant accounting. > (c) Code for using either "raw" or "frequency invariant" depending on > a callback flag or something like that. Seeing how frequency invariance is an arch feature, and cpufreq drivers are also typically arch specific, do we really need a flag at this level? In any case, I think the only difference between the two formula should be the addition of (1) for the platforms that do not already implement frequency invariance. That is actually correct for platforms which do as told with their DVFS bits. And there's really not much else we can do short of implementing the scheduler arch hook to do better. > (b) Make all architecuters use "frequency invariant" and then look for a > working formula (seems rather less than realistic to me to be honest). There was a proposal to implement arch_scale_freq_capacity() as a weak function and have it serve the cpufreq selected frequency for (1) so that everything would default to that. We didn't do that because that makes the function call and multiplications unconditional. It's cheaper to add (1) to the cpufreq side when selecting a freq rather than at every single time we update the util statistics. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-08 11:27 ` Peter Zijlstra @ 2016-03-08 18:00 ` Rafael J. Wysocki 2016-03-08 19:26 ` Peter Zijlstra 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-08 18:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Steve Muckle, Rafael J. Wysocki, Vincent Guittot, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Mar 07, 2016 at 03:41:15AM +0100, Rafael J. Wysocki wrote: > >> If my understanding of the requency invariant utilization idea is correct, >> it is about re-scaling utilization so it is always relative to the capacity >> at the max frequency. > > Right. So if a workload runs for 5ms at @1GHz and 10ms @500MHz, it would > still result in the exact same utilization. > >> If that's the case, then instead of using >> x = util_raw / max >> we will use something like >> y = (util_raw / max) * (f / max_freq) (f - current frequency). > > I don't get the last term. The "(f - current frequency)" thing? It doesn't belong to the formula, sorry for the confusion. So it is almost the same as your (1) below (except for the max in the denominator), so my y is your x. :-) > Assuming fixed frequency hardware (we can't > really assume anything else) I get to: > > util = util_raw * (current_freq / max_freq) (1) > x = util / max (2) > >> so there's no hope that the same formula will ever work for both "raw" >> and "frequency invariant" utilization. > > Here I agree, however the above (current_freq / max_freq) term is easily > computable, and really the only thing we can assume if the arch doesn't > implement freq invariant accounting. Right. >> (c) Code for using either "raw" or "frequency invariant" depending on >> a callback flag or something like that. > > Seeing how frequency invariance is an arch feature, and cpufreq drivers > are also typically arch specific, do we really need a flag at this > level? The next frequency is selected by the governor and that's why. The driver gets a frequency to set only. Now, the governor needs to work with different platforms, so it needs to know how to deal with the given one. > In any case, I think the only difference between the two formula should > be the addition of (1) for the platforms that do not already implement > frequency invariance. OK So I'm reading this as a statement that linear is a better approximation for frequency invariant utilization. This means that on platforms where the utilization is frequency invariant we should use next_freq = a * x (where x is given by (2) above) and for platforms where the utilization is not frequency invariant next_freq = a * x * current_freq / max_freq and all boils down to finding a. Now, it seems reasonable for a to be something like (1 + 1/n) * max_freq, so for non-frequency invariant we get nex_freq = (1 + 1/n) * current_freq * x > That is actually correct for platforms which do as told with their DVFS > bits. And there's really not much else we can do short of implementing > the scheduler arch hook to do better. > >> (b) Make all architecuters use "frequency invariant" and then look for a >> working formula (seems rather less than realistic to me to be honest). > > There was a proposal to implement arch_scale_freq_capacity() as a weak > function and have it serve the cpufreq selected frequency for (1) so > that everything would default to that. > > We didn't do that because that makes the function call and > multiplications unconditional. It's cheaper to add (1) to the cpufreq > side when selecting a freq rather than at every single time we update > the util statistics. That's fine by me. My point was that we need different formulas for frequency invariant and the other basically. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-08 18:00 ` Rafael J. Wysocki @ 2016-03-08 19:26 ` Peter Zijlstra 2016-03-08 20:05 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-08 19:26 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Steve Muckle, Vincent Guittot, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: > On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > Seeing how frequency invariance is an arch feature, and cpufreq drivers > > are also typically arch specific, do we really need a flag at this > > level? > > The next frequency is selected by the governor and that's why. The > driver gets a frequency to set only. > > Now, the governor needs to work with different platforms, so it needs > to know how to deal with the given one. Ah, indeed. In any case, the availability of arch_sched_scale_freq() is a compile time thingy, so we can, at compile time, know what to use. > > In any case, I think the only difference between the two formula should > > be the addition of (1) for the platforms that do not already implement > > frequency invariance. > > OK > > So I'm reading this as a statement that linear is a better > approximation for frequency invariant utilization. Well, (1) is what the scheduler does with frequency invariance, except that allows a more flexible definition of 'current frequency' by asking for it every time we update the util stats. But if a platform doesn't need this, ie. it has a fixed frequency, or simply doesn't provide anything like this, assuming we run at the frequency we asked for is a reasonable assumption no? > This means that on platforms where the utilization is frequency > invariant we should use > > next_freq = a * x > > (where x is given by (2) above) and for platforms where the > utilization is not frequency invariant > > next_freq = a * x * current_freq / max_freq > > and all boils down to finding a. Right. > Now, it seems reasonable for a to be something like (1 + 1/n) * > max_freq, so for non-frequency invariant we get > > nex_freq = (1 + 1/n) * current_freq * x This seems like a big leap; where does: (1 + 1/n) * max_freq come from? And what is 'n'? ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-08 19:26 ` Peter Zijlstra @ 2016-03-08 20:05 ` Rafael J. Wysocki 2016-03-09 10:15 ` Juri Lelli 2016-03-09 16:39 ` Peter Zijlstra 0 siblings, 2 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-08 20:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Rafael J. Wysocki, Steve Muckle, Vincent Guittot, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > >> > Seeing how frequency invariance is an arch feature, and cpufreq drivers >> > are also typically arch specific, do we really need a flag at this >> > level? >> >> The next frequency is selected by the governor and that's why. The >> driver gets a frequency to set only. >> >> Now, the governor needs to work with different platforms, so it needs >> to know how to deal with the given one. > > Ah, indeed. In any case, the availability of arch_sched_scale_freq() is > a compile time thingy, so we can, at compile time, know what to use. > >> > In any case, I think the only difference between the two formula should >> > be the addition of (1) for the platforms that do not already implement >> > frequency invariance. >> >> OK >> >> So I'm reading this as a statement that linear is a better >> approximation for frequency invariant utilization. > > Well, (1) is what the scheduler does with frequency invariance, except > that allows a more flexible definition of 'current frequency' by asking > for it every time we update the util stats. > > But if a platform doesn't need this, ie. it has a fixed frequency, or > simply doesn't provide anything like this, assuming we run at the > frequency we asked for is a reasonable assumption no? > >> This means that on platforms where the utilization is frequency >> invariant we should use >> >> next_freq = a * x >> >> (where x is given by (2) above) and for platforms where the >> utilization is not frequency invariant >> >> next_freq = a * x * current_freq / max_freq >> >> and all boils down to finding a. > > Right. However, that doesn't seem to be in agreement with the Steve's results posted earlier in this thread. Also theoretically, with frequency invariant, the only way you can get to 100% utilization is by running at the max frequency, so the closer to 100% you get, the faster you need to run to get any further. That indicates nonlinear to me. >> Now, it seems reasonable for a to be something like (1 + 1/n) * >> max_freq, so for non-frequency invariant we get >> >> nex_freq = (1 + 1/n) * current_freq * x > > This seems like a big leap; where does: > > (1 + 1/n) * max_freq > > come from? And what is 'n'? a = max_freq gives next_freq = max_freq for x = 1, but with that choice of a you may never get to x = 1 with frequency invariant because of the feedback effect mentioned above, so the 1/n produces the extra boost needed for that (n is a positive integer). Quite frankly, to me it looks like linear really is a better approximation for "raw" utilization. That is, for frequency invariant x we should take: next_freq = a * x * max_freq / current_freq (and if x is not frequency invariant, the right-hand side becomes a * x). Then, the extra boost needed to get to x = 1 for frequency invariant is produced by the (max_freq / current_freq) factor that is greater than 1 as long as we are not running at max_freq and a can be chosen as max_freq. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-08 20:05 ` Rafael J. Wysocki @ 2016-03-09 10:15 ` Juri Lelli 2016-03-09 23:41 ` Rafael J. Wysocki 2016-03-09 16:39 ` Peter Zijlstra 1 sibling, 1 reply; 158+ messages in thread From: Juri Lelli @ 2016-03-09 10:15 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Peter Zijlstra, Rafael J. Wysocki, Steve Muckle, Vincent Guittot, Linux PM list, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar Hi, sorry if I didn't reply yet. Trying to cope with jetlag and talks/meetings these days :-). Let me see if I'm getting what you are discussing, though. On 08/03/16 21:05, Rafael J. Wysocki wrote: > On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: > >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: [...] > a = max_freq gives next_freq = max_freq for x = 1, but with that > choice of a you may never get to x = 1 with frequency invariant > because of the feedback effect mentioned above, so the 1/n produces > the extra boost needed for that (n is a positive integer). > > Quite frankly, to me it looks like linear really is a better > approximation for "raw" utilization. That is, for frequency invariant > x we should take: > > next_freq = a * x * max_freq / current_freq > > (and if x is not frequency invariant, the right-hand side becomes a * > x). Then, the extra boost needed to get to x = 1 for frequency > invariant is produced by the (max_freq / current_freq) factor that is > greater than 1 as long as we are not running at max_freq and a can be > chosen as max_freq. > Expanding terms again, your original formula (without the 1.1 factor of the last version) was: next_freq = util / max_cap * max_freq and this doesn't work when we have freq invariance since util won't go over curr_cap. What you propose above is to add another factor, so that we have: next_freq = util / max_cap * max_freq / curr_freq * max_freq which should give us the opportunity to reach max_freq also with freq invariance. This should actually be the same of doing: next_freq = util / max_cap * max_cap / curr_cap * max_freq We are basically scaling how much the cpu is busy at curr_cap back to the 0..1024 scale. And we use this to select next_freq. Also, we can simplify this to: next_freq = util / curr_cap * max_freq and we save some ops. However, if that is correct, I think we might have a problem, as we are skewing OPP selection towards higher frequencies. Let's suppose we have a platform with 3 OPPs: freq cap 1200 1024 900 768 600 512 As soon a task reaches an utilization of 257 we will be selecting the second OPP as next_freq = 257 / 512 * 1200 ~ 602 While the cpu is only 50% busy in this case. And we will go at max OPP when reaching ~492 (~64% of 768). That said, I guess this might work as a first solution, but we will probably need something better in the future. I understand Rafael's concerns regardin margins, but it seems to me that some kind of additional parameter will be probably needed anyway to fix this. Just to say again how we handle this in schedfreq, with a -20% margin applied to the lowest OPP we will get to the next one when utilization reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones, which is less aggressive and might be better IMHO. Best, - Juri ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-09 10:15 ` Juri Lelli @ 2016-03-09 23:41 ` Rafael J. Wysocki 2016-03-10 4:30 ` Juri Lelli 2016-03-10 23:19 ` Michael Turquette 0 siblings, 2 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-09 23:41 UTC (permalink / raw) To: Juri Lelli Cc: Rafael J. Wysocki, Peter Zijlstra, Rafael J. Wysocki, Steve Muckle, Vincent Guittot, Linux PM list, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <juri.lelli@arm.com> wrote: > Hi, > > sorry if I didn't reply yet. Trying to cope with jetlag and > talks/meetings these days :-). Let me see if I'm getting what you are > discussing, though. > > On 08/03/16 21:05, Rafael J. Wysocki wrote: >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > [...] > >> a = max_freq gives next_freq = max_freq for x = 1, but with that >> choice of a you may never get to x = 1 with frequency invariant >> because of the feedback effect mentioned above, so the 1/n produces >> the extra boost needed for that (n is a positive integer). >> >> Quite frankly, to me it looks like linear really is a better >> approximation for "raw" utilization. That is, for frequency invariant >> x we should take: >> >> next_freq = a * x * max_freq / current_freq >> >> (and if x is not frequency invariant, the right-hand side becomes a * >> x). Then, the extra boost needed to get to x = 1 for frequency >> invariant is produced by the (max_freq / current_freq) factor that is >> greater than 1 as long as we are not running at max_freq and a can be >> chosen as max_freq. >> > > Expanding terms again, your original formula (without the 1.1 factor of > the last version) was: > > next_freq = util / max_cap * max_freq > > and this doesn't work when we have freq invariance since util won't go > over curr_cap. Can you please remind me what curr_cap is? > What you propose above is to add another factor, so that we have: > > next_freq = util / max_cap * max_freq / curr_freq * max_freq > > which should give us the opportunity to reach max_freq also with freq > invariance. > > This should actually be the same of doing: > > next_freq = util / max_cap * max_cap / curr_cap * max_freq > > We are basically scaling how much the cpu is busy at curr_cap back to > the 0..1024 scale. And we use this to select next_freq. Also, we can > simplify this to: > > next_freq = util / curr_cap * max_freq > > and we save some ops. > > However, if that is correct, I think we might have a problem, as we are > skewing OPP selection towards higher frequencies. Let's suppose we have > a platform with 3 OPPs: > > freq cap > 1200 1024 > 900 768 > 600 512 > > As soon a task reaches an utilization of 257 we will be selecting the > second OPP as > > next_freq = 257 / 512 * 1200 ~ 602 > > While the cpu is only 50% busy in this case. And we will go at max OPP > when reaching ~492 (~64% of 768). > > That said, I guess this might work as a first solution, but we will > probably need something better in the future. I understand Rafael's > concerns regardin margins, but it seems to me that some kind of > additional parameter will be probably needed anyway to fix this. > Just to say again how we handle this in schedfreq, with a -20% margin > applied to the lowest OPP we will get to the next one when utilization > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones, > which is less aggressive and might be better IMHO. Well, Peter says that my idea is incorrect, so I'll go for next_freq = C * current_freq * util_raw / max where C > 1 (and likely C < 1.5) instead. That means C has to be determined somehow or guessed. The 80% tipping point condition seems reasonable to me, though, which leads to C = 1.25. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-09 23:41 ` Rafael J. Wysocki @ 2016-03-10 4:30 ` Juri Lelli 2016-03-10 21:01 ` Rafael J. Wysocki 2016-03-10 23:19 ` Michael Turquette 1 sibling, 1 reply; 158+ messages in thread From: Juri Lelli @ 2016-03-10 4:30 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Peter Zijlstra, Rafael J. Wysocki, Steve Muckle, Vincent Guittot, Linux PM list, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On 10/03/16 00:41, Rafael J. Wysocki wrote: > On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <juri.lelli@arm.com> wrote: > > Hi, > > > > sorry if I didn't reply yet. Trying to cope with jetlag and > > talks/meetings these days :-). Let me see if I'm getting what you are > > discussing, though. > > > > On 08/03/16 21:05, Rafael J. Wysocki wrote: > >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: > >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: > >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > [...] > > > >> a = max_freq gives next_freq = max_freq for x = 1, but with that > >> choice of a you may never get to x = 1 with frequency invariant > >> because of the feedback effect mentioned above, so the 1/n produces > >> the extra boost needed for that (n is a positive integer). > >> > >> Quite frankly, to me it looks like linear really is a better > >> approximation for "raw" utilization. That is, for frequency invariant > >> x we should take: > >> > >> next_freq = a * x * max_freq / current_freq > >> > >> (and if x is not frequency invariant, the right-hand side becomes a * > >> x). Then, the extra boost needed to get to x = 1 for frequency > >> invariant is produced by the (max_freq / current_freq) factor that is > >> greater than 1 as long as we are not running at max_freq and a can be > >> chosen as max_freq. > >> > > > > Expanding terms again, your original formula (without the 1.1 factor of > > the last version) was: > > > > next_freq = util / max_cap * max_freq > > > > and this doesn't work when we have freq invariance since util won't go > > over curr_cap. > > Can you please remind me what curr_cap is? > The capacity at current frequency. > > What you propose above is to add another factor, so that we have: > > > > next_freq = util / max_cap * max_freq / curr_freq * max_freq > > > > which should give us the opportunity to reach max_freq also with freq > > invariance. > > > > This should actually be the same of doing: > > > > next_freq = util / max_cap * max_cap / curr_cap * max_freq > > > > We are basically scaling how much the cpu is busy at curr_cap back to > > the 0..1024 scale. And we use this to select next_freq. Also, we can > > simplify this to: > > > > next_freq = util / curr_cap * max_freq > > > > and we save some ops. > > > > However, if that is correct, I think we might have a problem, as we are > > skewing OPP selection towards higher frequencies. Let's suppose we have > > a platform with 3 OPPs: > > > > freq cap > > 1200 1024 > > 900 768 > > 600 512 > > > > As soon a task reaches an utilization of 257 we will be selecting the > > second OPP as > > > > next_freq = 257 / 512 * 1200 ~ 602 > > > > While the cpu is only 50% busy in this case. And we will go at max OPP > > when reaching ~492 (~64% of 768). > > > > That said, I guess this might work as a first solution, but we will > > probably need something better in the future. I understand Rafael's > > concerns regardin margins, but it seems to me that some kind of > > additional parameter will be probably needed anyway to fix this. > > Just to say again how we handle this in schedfreq, with a -20% margin > > applied to the lowest OPP we will get to the next one when utilization > > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones, > > which is less aggressive and might be better IMHO. > > Well, Peter says that my idea is incorrect, so I'll go for > > next_freq = C * current_freq * util_raw / max > > where C > 1 (and likely C < 1.5) instead. > > That means C has to be determined somehow or guessed. The 80% tipping > point condition seems reasonable to me, though, which leads to C = > 1.25. > Right. So, when using freq. invariant util we have: next_freq = C * curr_freq * util / curr_cap as util_raw = util * max / curr_cap What Vincent is saying makes sense, though. If we use arch_scale_freq_capacity() as denominator instead of max, we can use a single formula for both cases. Best, - Juri ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-10 4:30 ` Juri Lelli @ 2016-03-10 21:01 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-10 21:01 UTC (permalink / raw) To: Juri Lelli Cc: Rafael J. Wysocki, Peter Zijlstra, Steve Muckle, Vincent Guittot, Linux PM list, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Thursday, March 10, 2016 11:30:34 AM Juri Lelli wrote: > On 10/03/16 00:41, Rafael J. Wysocki wrote: > > On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <juri.lelli@arm.com> wrote: > > > Hi, > > > > > > sorry if I didn't reply yet. Trying to cope with jetlag and > > > talks/meetings these days :-). Let me see if I'm getting what you are > > > discussing, though. > > > > > > On 08/03/16 21:05, Rafael J. Wysocki wrote: > > >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: > > >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > [...] > > > > > >> a = max_freq gives next_freq = max_freq for x = 1, but with that > > >> choice of a you may never get to x = 1 with frequency invariant > > >> because of the feedback effect mentioned above, so the 1/n produces > > >> the extra boost needed for that (n is a positive integer). > > >> > > >> Quite frankly, to me it looks like linear really is a better > > >> approximation for "raw" utilization. That is, for frequency invariant > > >> x we should take: > > >> > > >> next_freq = a * x * max_freq / current_freq > > >> > > >> (and if x is not frequency invariant, the right-hand side becomes a * > > >> x). Then, the extra boost needed to get to x = 1 for frequency > > >> invariant is produced by the (max_freq / current_freq) factor that is > > >> greater than 1 as long as we are not running at max_freq and a can be > > >> chosen as max_freq. > > >> > > > > > > Expanding terms again, your original formula (without the 1.1 factor of > > > the last version) was: > > > > > > next_freq = util / max_cap * max_freq > > > > > > and this doesn't work when we have freq invariance since util won't go > > > over curr_cap. > > > > Can you please remind me what curr_cap is? > > > > The capacity at current frequency. I see, thanks! > > > What you propose above is to add another factor, so that we have: > > > > > > next_freq = util / max_cap * max_freq / curr_freq * max_freq > > > > > > which should give us the opportunity to reach max_freq also with freq > > > invariance. > > > > > > This should actually be the same of doing: > > > > > > next_freq = util / max_cap * max_cap / curr_cap * max_freq > > > > > > We are basically scaling how much the cpu is busy at curr_cap back to > > > the 0..1024 scale. And we use this to select next_freq. Also, we can > > > simplify this to: > > > > > > next_freq = util / curr_cap * max_freq > > > > > > and we save some ops. > > > > > > However, if that is correct, I think we might have a problem, as we are > > > skewing OPP selection towards higher frequencies. Let's suppose we have > > > a platform with 3 OPPs: > > > > > > freq cap > > > 1200 1024 > > > 900 768 > > > 600 512 > > > > > > As soon a task reaches an utilization of 257 we will be selecting the > > > second OPP as > > > > > > next_freq = 257 / 512 * 1200 ~ 602 > > > > > > While the cpu is only 50% busy in this case. And we will go at max OPP > > > when reaching ~492 (~64% of 768). > > > > > > That said, I guess this might work as a first solution, but we will > > > probably need something better in the future. I understand Rafael's > > > concerns regardin margins, but it seems to me that some kind of > > > additional parameter will be probably needed anyway to fix this. > > > Just to say again how we handle this in schedfreq, with a -20% margin > > > applied to the lowest OPP we will get to the next one when utilization > > > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones, > > > which is less aggressive and might be better IMHO. > > > > Well, Peter says that my idea is incorrect, so I'll go for > > > > next_freq = C * current_freq * util_raw / max > > > > where C > 1 (and likely C < 1.5) instead. > > > > That means C has to be determined somehow or guessed. The 80% tipping > > point condition seems reasonable to me, though, which leads to C = > > 1.25. > > > > Right. So, when using freq. invariant util we have: > > next_freq = C * curr_freq * util / curr_cap > > as > > util_raw = util * max / curr_cap > > What Vincent is saying makes sense, though. If we use > arch_scale_freq_capacity() as denominator instead of max, we can use a > single formula for both cases. I'm not convinced about that yet, but let me think about it some more. :-) Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-09 23:41 ` Rafael J. Wysocki 2016-03-10 4:30 ` Juri Lelli @ 2016-03-10 23:19 ` Michael Turquette 1 sibling, 0 replies; 158+ messages in thread From: Michael Turquette @ 2016-03-10 23:19 UTC (permalink / raw) To: Rafael J. Wysocki, Juri Lelli Cc: Rafael J. Wysocki, Peter Zijlstra, Rafael J. Wysocki, Steve Muckle, Vincent Guittot, Linux PM list, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Ingo Molnar Quoting Rafael J. Wysocki (2016-03-09 15:41:34) > On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <juri.lelli@arm.com> wrote: > > Hi, > > > > sorry if I didn't reply yet. Trying to cope with jetlag and > > talks/meetings these days :-). Let me see if I'm getting what you are > > discussing, though. > > > > On 08/03/16 21:05, Rafael J. Wysocki wrote: > >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: > >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: > >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > [...] > > > >> a = max_freq gives next_freq = max_freq for x = 1, but with that > >> choice of a you may never get to x = 1 with frequency invariant > >> because of the feedback effect mentioned above, so the 1/n produces > >> the extra boost needed for that (n is a positive integer). > >> > >> Quite frankly, to me it looks like linear really is a better > >> approximation for "raw" utilization. That is, for frequency invariant > >> x we should take: > >> > >> next_freq = a * x * max_freq / current_freq > >> > >> (and if x is not frequency invariant, the right-hand side becomes a * > >> x). Then, the extra boost needed to get to x = 1 for frequency > >> invariant is produced by the (max_freq / current_freq) factor that is > >> greater than 1 as long as we are not running at max_freq and a can be > >> chosen as max_freq. > >> > > > > Expanding terms again, your original formula (without the 1.1 factor of > > the last version) was: > > > > next_freq = util / max_cap * max_freq > > > > and this doesn't work when we have freq invariance since util won't go > > over curr_cap. > > Can you please remind me what curr_cap is? > > > What you propose above is to add another factor, so that we have: > > > > next_freq = util / max_cap * max_freq / curr_freq * max_freq > > > > which should give us the opportunity to reach max_freq also with freq > > invariance. > > > > This should actually be the same of doing: > > > > next_freq = util / max_cap * max_cap / curr_cap * max_freq > > > > We are basically scaling how much the cpu is busy at curr_cap back to > > the 0..1024 scale. And we use this to select next_freq. Also, we can > > simplify this to: > > > > next_freq = util / curr_cap * max_freq > > > > and we save some ops. > > > > However, if that is correct, I think we might have a problem, as we are > > skewing OPP selection towards higher frequencies. Let's suppose we have > > a platform with 3 OPPs: > > > > freq cap > > 1200 1024 > > 900 768 > > 600 512 > > > > As soon a task reaches an utilization of 257 we will be selecting the > > second OPP as > > > > next_freq = 257 / 512 * 1200 ~ 602 > > > > While the cpu is only 50% busy in this case. And we will go at max OPP > > when reaching ~492 (~64% of 768). > > > > That said, I guess this might work as a first solution, but we will > > probably need something better in the future. I understand Rafael's > > concerns regardin margins, but it seems to me that some kind of > > additional parameter will be probably needed anyway to fix this. > > Just to say again how we handle this in schedfreq, with a -20% margin > > applied to the lowest OPP we will get to the next one when utilization > > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones, > > which is less aggressive and might be better IMHO. > > Well, Peter says that my idea is incorrect, so I'll go for > > next_freq = C * current_freq * util_raw / max > > where C > 1 (and likely C < 1.5) instead. > > That means C has to be determined somehow or guessed. The 80% tipping > point condition seems reasonable to me, though, which leads to C = > 1.25. Right, that is the same value used in the schedfreq series: +/* + * Capacity margin added to CFS and RT capacity requests to provide + * some head room if task utilization further increases. + */ +unsigned int capacity_margin = 1280; Regards, Mike ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-08 20:05 ` Rafael J. Wysocki 2016-03-09 10:15 ` Juri Lelli @ 2016-03-09 16:39 ` Peter Zijlstra 2016-03-09 23:28 ` Rafael J. Wysocki 1 sibling, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-09 16:39 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Steve Muckle, Vincent Guittot, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Tue, Mar 08, 2016 at 09:05:50PM +0100, Rafael J. Wysocki wrote: > >> This means that on platforms where the utilization is frequency > >> invariant we should use > >> > >> next_freq = a * x > >> > >> (where x is given by (2) above) and for platforms where the > >> utilization is not frequency invariant > >> > >> next_freq = a * x * current_freq / max_freq > >> > >> and all boils down to finding a. > > > > Right. > > However, that doesn't seem to be in agreement with the Steve's results > posted earlier in this thread. I could not make anything of those numbers. > Also theoretically, with frequency invariant, the only way you can get > to 100% utilization is by running at the max frequency, so the closer > to 100% you get, the faster you need to run to get any further. That > indicates nonlinear to me. I'm not seeing that, you get that by using a > 1. No need for non-linear. > >> Now, it seems reasonable for a to be something like (1 + 1/n) * > >> max_freq, so for non-frequency invariant we get > >> > >> nex_freq = (1 + 1/n) * current_freq * x > > > > This seems like a big leap; where does: > > > > (1 + 1/n) * max_freq > > > > come from? And what is 'n'? > a = max_freq gives next_freq = max_freq for x = 1, next_freq = a * x * current_freq / max_freq [ a := max_freq, x := 1 ] -> = max_freq * 1 * current_freq / max_freq = current_freq != max_freq But I think I see what you're saying; because at x = 1, current_frequency must be max_frequency. Per your earlier point. > but with that choice of a you may never get to x = 1 with frequency > invariant because of the feedback effect mentioned above, so the 1/n > produces the extra boost needed for that (n is a positive integer). OK, so that gets us: a = (1 + 1/n) ; n > 0 [ I would not have chosen (1 + 1/n), but lets stick to that ] So for n = 4 that gets you: a = 1.25, which effectively gets you an 80% utilization tipping point. That is, 1.25 * .8 = 1, iow. you'll pick the next frequency (assuming RELATION_L like selection). Together this gets you: next_freq = (1 + 1/n) * max_freq * x * current_freq / max_freq = (1 + 1/n) * x * current_freq Again, with n = 4, x > .8 will result in a next_freq > current_freq, and hence (RELATION_L) pick a higher one. > Quite frankly, to me it looks like linear really is a better > approximation for "raw" utilization. That is, for frequency invariant > x we should take: > > next_freq = a * x * max_freq / current_freq (its very confusing how you use 'x' for both invariant and non-invariant). That doesn't make sense, remember: util = \Sum_i u_i * freq_i / max_freq (1) Which for systems where freq_i is constant reduces to: util = util_raw * current_freq / max_freq (2) But you cannot reverse this. IOW you cannot try and divide out current_freq on a frequency invariant metric. So going by: next_freq = (1 + 1/n) * max_freq * util (3) if we substitute (2) into (3) we get: = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq = (1 + 1/n) * current_freq * util_raw (4) Which gets you two formula with the same general behaviour. As (2) is the only approximation of (1) we can make. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-09 16:39 ` Peter Zijlstra @ 2016-03-09 23:28 ` Rafael J. Wysocki 2016-03-10 3:44 ` Vincent Guittot 2016-03-10 8:43 ` Peter Zijlstra 0 siblings, 2 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-09 23:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Rafael J. Wysocki, Steve Muckle, Vincent Guittot, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Wed, Mar 9, 2016 at 5:39 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Mar 08, 2016 at 09:05:50PM +0100, Rafael J. Wysocki wrote: >> >> This means that on platforms where the utilization is frequency >> >> invariant we should use >> >> >> >> next_freq = a * x >> >> >> >> (where x is given by (2) above) and for platforms where the >> >> utilization is not frequency invariant >> >> >> >> next_freq = a * x * current_freq / max_freq >> >> >> >> and all boils down to finding a. >> > >> > Right. >> >> However, that doesn't seem to be in agreement with the Steve's results >> posted earlier in this thread. > > I could not make anything of those numbers. > >> Also theoretically, with frequency invariant, the only way you can get >> to 100% utilization is by running at the max frequency, so the closer >> to 100% you get, the faster you need to run to get any further. That >> indicates nonlinear to me. > > I'm not seeing that, you get that by using a > 1. No need for > non-linear. OK >> >> Now, it seems reasonable for a to be something like (1 + 1/n) * >> >> max_freq, so for non-frequency invariant we get >> >> >> >> nex_freq = (1 + 1/n) * current_freq * x (*) (see below) >> > This seems like a big leap; where does: >> > >> > (1 + 1/n) * max_freq >> > >> > come from? And what is 'n'? > >> a = max_freq gives next_freq = max_freq for x = 1, > > next_freq = a * x * current_freq / max_freq > > [ a := max_freq, x := 1 ] -> > > = max_freq * 1 * current_freq / max_freq > = current_freq > > != max_freq > > But I think I see what you're saying; because at x = 1, > current_frequency must be max_frequency. Per your earlier point. Correct. >> but with that choice of a you may never get to x = 1 with frequency >> invariant because of the feedback effect mentioned above, so the 1/n >> produces the extra boost needed for that (n is a positive integer). > > OK, so that gets us: > > a = (1 + 1/n) ; n > 0 > > [ I would not have chosen (1 + 1/n), but lets stick to that ] Well, what would you choose then? :-) > So for n = 4 that gets you: a = 1.25, which effectively gets you an 80% > utilization tipping point. That is, 1.25 * .8 = 1, iow. you'll pick the > next frequency (assuming RELATION_L like selection). > > Together this gets you: > > next_freq = (1 + 1/n) * max_freq * x * current_freq / max_freq > = (1 + 1/n) * x * current_freq That seems to be what I said above (*), isn't it? > Again, with n = 4, x > .8 will result in a next_freq > current_freq, and > hence (RELATION_L) pick a higher one. OK >> Quite frankly, to me it looks like linear really is a better >> approximation for "raw" utilization. That is, for frequency invariant >> x we should take: >> >> next_freq = a * x * max_freq / current_freq > > (its very confusing how you use 'x' for both invariant and > non-invariant). > > That doesn't make sense, remember: > > util = \Sum_i u_i * freq_i / max_freq (1) > > Which for systems where freq_i is constant reduces to: > > util = util_raw * current_freq / max_freq (2) > > But you cannot reverse this. IOW you cannot try and divide out > current_freq on a frequency invariant metric. I see. > So going by: > > next_freq = (1 + 1/n) * max_freq * util (3) I think that should be next_freq = (1 + 1/n) * max_freq * util / max (where max is the second argument of cpufreq_update_util) or the dimensions on both sides don't match. > if we substitute (2) into (3) we get: > > = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq > = (1 + 1/n) * current_freq * util_raw (4) > > Which gets you two formula with the same general behaviour. As (2) is > the only approximation of (1) we can make. OK So since utilization is not frequency invariant in the current mainline (or linux-next for that matter) AFAIC, I'm going to use the following in the next version of the schedutil patch series: next_freq = 1.25 * current_freq * util_raw / max where util_raw and max are what I get from cpufreq_update_util(). 1.25 is for the 80% tipping point which I think is reasonable. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-09 23:28 ` Rafael J. Wysocki @ 2016-03-10 3:44 ` Vincent Guittot 2016-03-10 10:07 ` Peter Zijlstra 2016-03-10 8:43 ` Peter Zijlstra 1 sibling, 1 reply; 158+ messages in thread From: Vincent Guittot @ 2016-03-10 3:44 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Peter Zijlstra, Rafael J. Wysocki, Steve Muckle, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On 10 March 2016 at 06:28, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Wed, Mar 9, 2016 at 5:39 PM, Peter Zijlstra <peterz@infradead.org> wrote: >> On Tue, Mar 08, 2016 at 09:05:50PM +0100, Rafael J. Wysocki wrote: >>> >> This means that on platforms where the utilization is frequency >>> >> invariant we should use >>> >> >>> >> next_freq = a * x >>> >> >>> >> (where x is given by (2) above) and for platforms where the >>> >> utilization is not frequency invariant >>> >> >>> >> next_freq = a * x * current_freq / max_freq >>> >> >>> >> and all boils down to finding a. >>> > >>> > Right. >>> >>> However, that doesn't seem to be in agreement with the Steve's results >>> posted earlier in this thread. >> >> I could not make anything of those numbers. >> >>> Also theoretically, with frequency invariant, the only way you can get >>> to 100% utilization is by running at the max frequency, so the closer >>> to 100% you get, the faster you need to run to get any further. That >>> indicates nonlinear to me. >> >> I'm not seeing that, you get that by using a > 1. No need for >> non-linear. > > OK > >>> >> Now, it seems reasonable for a to be something like (1 + 1/n) * >>> >> max_freq, so for non-frequency invariant we get >>> >> >>> >> nex_freq = (1 + 1/n) * current_freq * x > > (*) (see below) > >>> > This seems like a big leap; where does: >>> > >>> > (1 + 1/n) * max_freq >>> > >>> > come from? And what is 'n'? >> >>> a = max_freq gives next_freq = max_freq for x = 1, >> >> next_freq = a * x * current_freq / max_freq >> >> [ a := max_freq, x := 1 ] -> >> >> = max_freq * 1 * current_freq / max_freq >> = current_freq >> >> != max_freq >> >> But I think I see what you're saying; because at x = 1, >> current_frequency must be max_frequency. Per your earlier point. > > Correct. > >>> but with that choice of a you may never get to x = 1 with frequency >>> invariant because of the feedback effect mentioned above, so the 1/n >>> produces the extra boost needed for that (n is a positive integer). >> >> OK, so that gets us: >> >> a = (1 + 1/n) ; n > 0 >> >> [ I would not have chosen (1 + 1/n), but lets stick to that ] > > Well, what would you choose then? :-) > >> So for n = 4 that gets you: a = 1.25, which effectively gets you an 80% >> utilization tipping point. That is, 1.25 * .8 = 1, iow. you'll pick the >> next frequency (assuming RELATION_L like selection). >> >> Together this gets you: >> >> next_freq = (1 + 1/n) * max_freq * x * current_freq / max_freq >> = (1 + 1/n) * x * current_freq > > That seems to be what I said above (*), isn't it? > >> Again, with n = 4, x > .8 will result in a next_freq > current_freq, and >> hence (RELATION_L) pick a higher one. > > OK > >>> Quite frankly, to me it looks like linear really is a better >>> approximation for "raw" utilization. That is, for frequency invariant >>> x we should take: >>> >>> next_freq = a * x * max_freq / current_freq >> >> (its very confusing how you use 'x' for both invariant and >> non-invariant). >> >> That doesn't make sense, remember: >> >> util = \Sum_i u_i * freq_i / max_freq (1) >> >> Which for systems where freq_i is constant reduces to: >> >> util = util_raw * current_freq / max_freq (2) >> >> But you cannot reverse this. IOW you cannot try and divide out >> current_freq on a frequency invariant metric. > > I see. > >> So going by: >> >> next_freq = (1 + 1/n) * max_freq * util (3) > > I think that should be > > next_freq = (1 + 1/n) * max_freq * util / max > > (where max is the second argument of cpufreq_update_util) or the > dimensions on both sides don't match. > >> if we substitute (2) into (3) we get: >> >> = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq >> = (1 + 1/n) * current_freq * util_raw (4) >> >> Which gets you two formula with the same general behaviour. As (2) is >> the only approximation of (1) we can make. > > OK > > So since utilization is not frequency invariant in the current > mainline (or linux-next for that matter) AFAIC, I'm going to use the > following in the next version of the schedutil patch series: > > next_freq = 1.25 * current_freq * util_raw / max > > where util_raw and max are what I get from cpufreq_update_util(). > > 1.25 is for the 80% tipping point which I think is reasonable. We have the arch_scale_freq_capacity function that is arch dependent and can be used to merge the 2 formula that were described by peter above. By default, arch_scale_freq_capacity return SCHED_CAPACITY_SCALE which is max capacity but when arch_scale_freq_capacity is defined by an architecture, arch_scale_freq_capacity returns current_freq * max_capacity/max_freq so can't we use arch_scale_freq in your formula ? Taking your formula above it becomes: next_freq = 1.25 * current_freq * util / arch_scale_freq_capacity() Without invariance feature, we have the same formula than above : next_freq = 1.25 * current_freq * util_raw / max because SCHED_CAPACITY_SCALE is max capacity With invariance feature, we have next_freq = 1.25 * current_freq * util / (current_freq*max_capacity/max_freq) = 1.25 * util * max_freq / max which is the formula that has to be used with frequency invariant utilization. so we have one formula that works for both configuration (this is not really optimized for invariant system because we multiply then divide by current_freq in 2 different places but it's better than a wrong formula) Now, arch_scale_freq_capacity is available in kernel/sched/sched.h header file which can only be accessed by scheduler code... May be we can pass arch_scale_freq_capacity value instead of max one as a parameter of update_util function prototype Vincent ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-10 3:44 ` Vincent Guittot @ 2016-03-10 10:07 ` Peter Zijlstra 2016-03-10 10:26 ` Vincent Guittot [not found] ` <CAKfTPtCbjgbJn+68NJPCnmPFtcHD0wGmZRYaw37zSqPXNpo_Uw@mail.gmail.com> 0 siblings, 2 replies; 158+ messages in thread From: Peter Zijlstra @ 2016-03-10 10:07 UTC (permalink / raw) To: Vincent Guittot Cc: Rafael J. Wysocki, Rafael J. Wysocki, Steve Muckle, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Thu, Mar 10, 2016 at 10:44:21AM +0700, Vincent Guittot wrote: > We have the arch_scale_freq_capacity function that is arch dependent > and can be used to merge the 2 formula that were described by peter > above. > By default, arch_scale_freq_capacity return SCHED_CAPACITY_SCALE which > is max capacity > but when arch_scale_freq_capacity is defined by an architecture, > arch_scale_freq_capacity returns current_freq * max_capacity/max_freq However, current_freq is a very fluid thing, it might (and will) change very rapidly on some platforms. This is the same point I made earlier, you cannot try and divide out current_freq from the invariant measure. > so can't we use arch_scale_freq in your formula ? Taking your formula > above it becomes: > next_freq = 1.25 * current_freq * util / arch_scale_freq_capacity() No, that cannot work, nor makes any sense, per the above. > With invariance feature, we have: > > next_freq = 1.25 * current_freq * util / (current_freq*max_capacity/max_freq) > = 1.25 * util * max_freq / max > > which is the formula that has to be used with frequency invariant > utilization. Wrong, you cannot talk about current_freq in the invariant case. > May be we can pass arch_scale_freq_capacity value instead of max one > as a parameter of update_util function prototype No, since its a compile time thing, we can simply do: #ifdef arch_scale_freq_capacity next_freq = (1 + 1/n) * max_freq * (util / max) #else next_freq = (1 + 1/n) * current_freq * (util_raw / max) #endif ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-10 10:07 ` Peter Zijlstra @ 2016-03-10 10:26 ` Vincent Guittot [not found] ` <CAKfTPtCbjgbJn+68NJPCnmPFtcHD0wGmZRYaw37zSqPXNpo_Uw@mail.gmail.com> 1 sibling, 0 replies; 158+ messages in thread From: Vincent Guittot @ 2016-03-10 10:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Rafael J. Wysocki, Steve Muckle, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On 10 March 2016 at 17:07, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Mar 10, 2016 at 10:44:21AM +0700, Vincent Guittot wrote: >> We have the arch_scale_freq_capacity function that is arch dependent >> and can be used to merge the 2 formula that were described by peter >> above. >> By default, arch_scale_freq_capacity return SCHED_CAPACITY_SCALE which >> is max capacity >> but when arch_scale_freq_capacity is defined by an architecture, > >> arch_scale_freq_capacity returns current_freq * max_capacity/max_freq > > However, current_freq is a very fluid thing, it might (and will) change > very rapidly on some platforms. > > This is the same point I made earlier, you cannot try and divide out > current_freq from the invariant measure. > >> so can't we use arch_scale_freq in your formula ? Taking your formula >> above it becomes: >> next_freq = 1.25 * current_freq * util / arch_scale_freq_capacity() > > No, that cannot work, nor makes any sense, per the above. > >> With invariance feature, we have: >> >> next_freq = 1.25 * current_freq * util / (current_freq*max_capacity/max_freq) >> = 1.25 * util * max_freq / max >> >> which is the formula that has to be used with frequency invariant >> utilization. > > Wrong, you cannot talk about current_freq in the invariant case. > >> May be we can pass arch_scale_freq_capacity value instead of max one >> as a parameter of update_util function prototype > > No, since its a compile time thing, we can simply do: > > #ifdef arch_scale_freq_capacity > next_freq = (1 + 1/n) * max_freq * (util / max) > #else > next_freq = (1 + 1/n) * current_freq * (util_raw / max) > #endif selecting formula at compilation is clearly better. I wrongly thought that it can't be accepted as a solution. ^ permalink raw reply [flat|nested] 158+ messages in thread
[parent not found: <CAKfTPtCbjgbJn+68NJPCnmPFtcHD0wGmZRYaw37zSqPXNpo_Uw@mail.gmail.com>]
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data [not found] ` <CAKfTPtCbjgbJn+68NJPCnmPFtcHD0wGmZRYaw37zSqPXNpo_Uw@mail.gmail.com> @ 2016-03-10 10:30 ` Peter Zijlstra 2016-03-10 10:56 ` Peter Zijlstra 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-10 10:30 UTC (permalink / raw) To: Vincent Guittot Cc: Juri Lelli, mturquette, Linux PM list, Ingo Molnar, Srinivas Pandruvada, Rafael J. Wysocki, Viresh Kumar, Rafael J. Wysocki, Linux Kernel Mailing List, Steve Muckle, ACPI Devel Maling List On Thu, Mar 10, 2016 at 05:23:54PM +0700, Vincent Guittot wrote: > > No, since its a compile time thing, we can simply do: > > > > #ifdef arch_scale_freq_capacity > > next_freq = (1 + 1/n) * max_freq * (util / max) > > #else > > next_freq = (1 + 1/n) * current_freq * (util_raw / max) > > #endif > > selecting formula at compilation is clearly better. I wrongly thought that > it can't be accepted as a solution. Well, its bound to get more 'interesting' since I forse implementations not always actually doing the invariant thing. Take for example the thing I send: lkml.kernel.org/r/20160303162829.GB6375@twins.programming.kicks-ass.net it both shows why you cannot talk about current_freq but also that the above needs a little more help (for the !X86_FEATURE_APERFMPERF case). But the !arch_scale_freq_capacity case should indeed be that simple. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-10 10:30 ` Peter Zijlstra @ 2016-03-10 10:56 ` Peter Zijlstra 2016-03-10 22:28 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-10 10:56 UTC (permalink / raw) To: Vincent Guittot Cc: Juri Lelli, mturquette, Linux PM list, Ingo Molnar, Srinivas Pandruvada, Rafael J. Wysocki, Viresh Kumar, Rafael J. Wysocki, Linux Kernel Mailing List, Steve Muckle, ACPI Devel Maling List On Thu, Mar 10, 2016 at 11:30:08AM +0100, Peter Zijlstra wrote: > On Thu, Mar 10, 2016 at 05:23:54PM +0700, Vincent Guittot wrote: > > > > No, since its a compile time thing, we can simply do: > > > > > > #ifdef arch_scale_freq_capacity > > > next_freq = (1 + 1/n) * max_freq * (util / max) > > > #else > > > next_freq = (1 + 1/n) * current_freq * (util_raw / max) > > > #endif > > > > selecting formula at compilation is clearly better. I wrongly thought that > > it can't be accepted as a solution. > > Well, its bound to get more 'interesting' since I forse implementations > not always actually doing the invariant thing. > > Take for example the thing I send: > > lkml.kernel.org/r/20160303162829.GB6375@twins.programming.kicks-ass.net > > it both shows why you cannot talk about current_freq but also that the > above needs a little more help (for the !X86_FEATURE_APERFMPERF case). > > But the !arch_scale_freq_capacity case should indeed be that simple. Maybe something like: #ifdef arch_scale_freq_capacity #ifndef arch_scale_freq_invariant #define arch_scale_freq_invariant() (true) #endif #else /* arch_scale_freq_capacity */ #define arch_scale_freq_invariant() (false) #endif if (arch_scale_freq_invariant()) And have archs that have conditional arch_scale_freq_capacity() implementation provide a arch_scale_freq_invariant implementation. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-10 10:56 ` Peter Zijlstra @ 2016-03-10 22:28 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-10 22:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Vincent Guittot, Juri Lelli, mturquette, Linux PM list, Ingo Molnar, Srinivas Pandruvada, Rafael J. Wysocki, Viresh Kumar, Linux Kernel Mailing List, Steve Muckle, ACPI Devel Maling List On Thursday, March 10, 2016 11:56:14 AM Peter Zijlstra wrote: > On Thu, Mar 10, 2016 at 11:30:08AM +0100, Peter Zijlstra wrote: > > On Thu, Mar 10, 2016 at 05:23:54PM +0700, Vincent Guittot wrote: > > > > > > No, since its a compile time thing, we can simply do: > > > > > > > > #ifdef arch_scale_freq_capacity > > > > next_freq = (1 + 1/n) * max_freq * (util / max) > > > > #else > > > > next_freq = (1 + 1/n) * current_freq * (util_raw / max) > > > > #endif > > > > > > selecting formula at compilation is clearly better. I wrongly thought that > > > it can't be accepted as a solution. > > > > Well, its bound to get more 'interesting' since I forse implementations > > not always actually doing the invariant thing. > > > > Take for example the thing I send: > > > > lkml.kernel.org/r/20160303162829.GB6375@twins.programming.kicks-ass.net > > > > it both shows why you cannot talk about current_freq but also that the > > above needs a little more help (for the !X86_FEATURE_APERFMPERF case). > > > > But the !arch_scale_freq_capacity case should indeed be that simple. > > Maybe something like: > > #ifdef arch_scale_freq_capacity > #ifndef arch_scale_freq_invariant > #define arch_scale_freq_invariant() (true) > #endif > #else /* arch_scale_freq_capacity */ > #define arch_scale_freq_invariant() (false) > #endif > > if (arch_scale_freq_invariant()) > > And have archs that have conditional arch_scale_freq_capacity() > implementation provide a arch_scale_freq_invariant implementation. Yeah, looks workable to me. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-09 23:28 ` Rafael J. Wysocki 2016-03-10 3:44 ` Vincent Guittot @ 2016-03-10 8:43 ` Peter Zijlstra 1 sibling, 0 replies; 158+ messages in thread From: Peter Zijlstra @ 2016-03-10 8:43 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Steve Muckle, Vincent Guittot, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Michael Turquette, Ingo Molnar On Thu, Mar 10, 2016 at 12:28:52AM +0100, Rafael J. Wysocki wrote: > > [ I would not have chosen (1 + 1/n), but lets stick to that ] > > Well, what would you choose then? :-) 1/p ; 0 < p < 1 or so. Where p then represents the percentile threshold where you want to bump to the next freq. > I think that should be > > next_freq = (1 + 1/n) * max_freq * util / max > > (where max is the second argument of cpufreq_update_util) or the > dimensions on both sides don't match. Well yes, but so far we were treating util (and util_raw) as 0 < u < 1, values, so already normalized against max. But yes.. > > if we substitute (2) into (3) we get: > > > > = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq > > = (1 + 1/n) * current_freq * util_raw (4) > > > > Which gets you two formula with the same general behaviour. As (2) is > > the only approximation of (1) we can make. > > OK > > So since utilization is not frequency invariant in the current > mainline (or linux-next for that matter) AFAIC, I'm going to use the > following in the next version of the schedutil patch series: > > next_freq = 1.25 * current_freq * util_raw / max > > where util_raw and max are what I get from cpufreq_update_util(). > > 1.25 is for the 80% tipping point which I think is reasonable. OK. ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v2 0/10] cpufreq: schedutil governor 2016-03-02 1:56 [PATCH 0/6] cpufreq: schedutil governor Rafael J. Wysocki ` (5 preceding siblings ...) 2016-03-02 2:27 ` [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki @ 2016-03-04 2:56 ` Rafael J. Wysocki 2016-03-04 2:58 ` [PATCH v2 1/10] cpufreq: Reduce cpufreq_update_util() overhead a bit Rafael J. Wysocki ` (10 more replies) 6 siblings, 11 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 2:56 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wednesday, March 02, 2016 02:56:28 AM Rafael J. Wysocki wrote: > Hi, > > My previous intro message still applies somewhat, so here's a link: > > http://marc.info/?l=linux-pm&m=145609673008122&w=2 > > The executive summary of the motivation is that I wanted to do two things: > use the utilization data from the scheduler (it's passed to the governor > as aguments of update callbacks anyway) and make it possible to set > CPU frequency without involving process context (fast frequency switching). > > Both have been prototyped in the previous RFCs: > > https://patchwork.kernel.org/patch/8426691/ > https://patchwork.kernel.org/patch/8426741/ > [cut] > > Comments welcome. There were quite a few comments to address, so here's a new version. First off, my interpretation of what Ingo said earlier today (or yesterday depending on your time zone) is that he wants all of the code dealing with the util and max values to be located in kernel/sched/. I can understand the motivation here, although schedutil shares some amount of code with the other governors, so the dependency on cpufreq will still be there, even if the code goes to kernel/sched/. Nevertheless, I decided to make that change just to see how it would look like if not for anything else. To that end, I revived a patch I had before the first schedutil one to remove util/max from the cpufreq hooks [7/10], moved the scheduler-related code from drivers/cpufreq/cpufreq.c to kernel/sched/cpufreq.c (new file) on top of that [8/10] and reintroduced cpufreq_update_util() in a slightly different form [9/10]. I did it this way in case it turns out to be necessary to apply [7/10] and [8/10] for the time being and defer the rest to the next cycle. Apart from that, I changed the frequency selection formula in the new governor to next_freq = util * max_freq / max and it seems to work. That allowed the code to be simplified somewhat as I don't need the extra relation field in struct sugov_policy now (RELATION_L is used everywhere). Finally, I tried to address the bikeshed comment from Viresh about the "wrong" names of data types etc related to governor sysfs attributes handling. Hopefully, the new ones are better. There are small tweaks all over on top of that. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v2 1/10] cpufreq: Reduce cpufreq_update_util() overhead a bit 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki @ 2016-03-04 2:58 ` Rafael J. Wysocki 2016-03-09 12:39 ` Peter Zijlstra 2016-03-04 2:59 ` [PATCH v2 2/10][Resend] cpufreq: acpi-cpufreq: Make read and write operations more efficient Rafael J. Wysocki ` (9 subsequent siblings) 10 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 2:58 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Use the observation that cpufreq_update_util() is only called by the scheduler with rq->lock held, so the callers of cpufreq_set_update_util_data() can use synchronize_sched() instead of synchronize_rcu() to wait for cpufreq_update_util() to complete. Moreover, if they are updated to do that, rcu_read_(un)lock() calls in cpufreq_update_util() might be replaced with rcu_read_(un)lock_sched(), respectively, but those aren't really necessary, because the scheduler calls that function from RCU-sched read-side critical sections already. In addition to that, if cpufreq_set_update_util_data() checks the func field in the struct update_util_data before setting the per-CPU pointer to it, the data->func check may be dropped from cpufreq_update_util() as well. Make the above changes to reduce the overhead from cpufreq_update_util() in the scheduler paths invoking it and to make the cleanup after removing its callbacks less heavy-weight somewhat. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> --- Changes from the previous version: - Use rcu_dereference_sched() in cpufreq_update_util(). --- drivers/cpufreq/cpufreq.c | 25 +++++++++++++++++-------- drivers/cpufreq/cpufreq_governor.c | 2 +- drivers/cpufreq/intel_pstate.c | 4 ++-- 3 files changed, 20 insertions(+), 11 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -77,12 +77,15 @@ static DEFINE_PER_CPU(struct update_util * to call from cpufreq_update_util(). That function will be called from an RCU * read-side critical section, so it must not sleep. * - * Callers must use RCU callbacks to free any memory that might be accessed - * via the old update_util_data pointer or invoke synchronize_rcu() right after - * this function to avoid use-after-free. + * Callers must use RCU-sched callbacks to free any memory that might be + * accessed via the old update_util_data pointer or invoke synchronize_sched() + * right after this function to avoid use-after-free. */ void cpufreq_set_update_util_data(int cpu, struct update_util_data *data) { + if (WARN_ON(data && !data->func)) + return; + rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data); } EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data); @@ -95,18 +98,24 @@ EXPORT_SYMBOL_GPL(cpufreq_set_update_uti * * This function is called by the scheduler on every invocation of * update_load_avg() on the CPU whose utilization is being updated. + * + * It can only be called from RCU-sched read-side critical sections. */ void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) { struct update_util_data *data; - rcu_read_lock(); +#ifdef CONFIG_LOCKDEP + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); +#endif - data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data)); - if (data && data->func) + data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data)); + /* + * If this isn't inside of an RCU-sched read-side critical section, data + * may become NULL after the check below. + */ + if (data) data->func(data, time, util, max); - - rcu_read_unlock(); } /* Flag to suspend/resume CPUFreq governors */ Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -280,7 +280,7 @@ static inline void gov_clear_update_util for_each_cpu(i, policy->cpus) cpufreq_set_update_util_data(i, NULL); - synchronize_rcu(); + synchronize_sched(); } static void gov_cancel_work(struct cpufreq_policy *policy) Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -1174,7 +1174,7 @@ static void intel_pstate_stop_cpu(struct pr_debug("intel_pstate: CPU %d exiting\n", cpu_num); cpufreq_set_update_util_data(cpu_num, NULL); - synchronize_rcu(); + synchronize_sched(); if (hwp_active) return; @@ -1442,7 +1442,7 @@ out: for_each_online_cpu(cpu) { if (all_cpu_data[cpu]) { cpufreq_set_update_util_data(cpu, NULL); - synchronize_rcu(); + synchronize_sched(); kfree(all_cpu_data[cpu]); } } ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 1/10] cpufreq: Reduce cpufreq_update_util() overhead a bit 2016-03-04 2:58 ` [PATCH v2 1/10] cpufreq: Reduce cpufreq_update_util() overhead a bit Rafael J. Wysocki @ 2016-03-09 12:39 ` Peter Zijlstra 2016-03-09 14:17 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-09 12:39 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Fri, Mar 04, 2016 at 03:58:22AM +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Use the observation that cpufreq_update_util() is only called > by the scheduler with rq->lock held, so the callers of > cpufreq_set_update_util_data() can use synchronize_sched() > instead of synchronize_rcu() to wait for cpufreq_update_util() > to complete. Moreover, if they are updated to do that, > rcu_read_(un)lock() calls in cpufreq_update_util() might be > replaced with rcu_read_(un)lock_sched(), respectively, but > those aren't really necessary, because the scheduler calls > that function from RCU-sched read-side critical sections > already. > > In addition to that, if cpufreq_set_update_util_data() checks > the func field in the struct update_util_data before setting > the per-CPU pointer to it, the data->func check may be dropped > from cpufreq_update_util() as well. > > Make the above changes to reduce the overhead from > cpufreq_update_util() in the scheduler paths invoking it > and to make the cleanup after removing its callbacks less > heavy-weight somewhat. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Acked-by: Viresh Kumar <viresh.kumar@linaro.org> > --- > > Changes from the previous version: > - Use rcu_dereference_sched() in cpufreq_update_util(). Which I think also shows the WARN_ON I insisted upon is redundant. In any case, I cannot object to reducing overhead, esp. as this whole patch was suggested by me in the first place, so: Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> That said, how about the below? It avoids a function call. Ideally the whole thing would be a single direct function call, but because of the current situation with multiple governors we're stuck with the indirect call :/ --- drivers/cpufreq/cpufreq.c | 30 +----------------------------- include/linux/cpufreq.h | 33 +++++++++++++++++++++++++++------ 2 files changed, 28 insertions(+), 35 deletions(-) diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index b6dd41824368..d594bf18cb02 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -65,7 +65,7 @@ static struct cpufreq_driver *cpufreq_driver; static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data); static DEFINE_RWLOCK(cpufreq_driver_lock); -static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data); +DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data); /** * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer. @@ -90,34 +90,6 @@ void cpufreq_set_update_util_data(int cpu, struct update_util_data *data) } EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data); -/** - * cpufreq_update_util - Take a note about CPU utilization changes. - * @time: Current time. - * @util: Current utilization. - * @max: Utilization ceiling. - * - * This function is called by the scheduler on every invocation of - * update_load_avg() on the CPU whose utilization is being updated. - * - * It can only be called from RCU-sched read-side critical sections. - */ -void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) -{ - struct update_util_data *data; - -#ifdef CONFIG_LOCKDEP - WARN_ON(debug_locks && !rcu_read_lock_sched_held()); -#endif - - data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data)); - /* - * If this isn't inside of an RCU-sched read-side critical section, data - * may become NULL after the check below. - */ - if (data) - data->func(data, time, util, max); -} - /* Flag to suspend/resume CPUFreq governors */ static bool cpufreq_suspended; diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 277024ff2289..62d2a1d623e9 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -146,7 +146,33 @@ static inline bool policy_is_shared(struct cpufreq_policy *policy) extern struct kobject *cpufreq_global_kobject; #ifdef CONFIG_CPU_FREQ -void cpufreq_update_util(u64 time, unsigned long util, unsigned long max); + +struct update_util_data { + void (*func)(struct update_util_data *data, + u64 time, unsigned long util, unsigned long max); +}; + +DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data); + +/** + * cpufreq_update_util - Take a note about CPU utilization changes. + * @time: Current time. + * @util: Current utilization. + * @max: Utilization ceiling. + * + * This function is called by the scheduler on every invocation of + * update_load_avg() on the CPU whose utilization is being updated. + * + * It can only be called from RCU-sched read-side critical sections. + */ +static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) +{ + struct update_util_data *data; + + data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data)); + if (data) + data->func(data, time, util, max); +} /** * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. @@ -169,11 +195,6 @@ static inline void cpufreq_trigger_update(u64 time) cpufreq_update_util(time, ULONG_MAX, 0); } -struct update_util_data { - void (*func)(struct update_util_data *data, - u64 time, unsigned long util, unsigned long max); -}; - void cpufreq_set_update_util_data(int cpu, struct update_util_data *data); unsigned int cpufreq_get(unsigned int cpu); ^ permalink raw reply related [flat|nested] 158+ messages in thread
* Re: [PATCH v2 1/10] cpufreq: Reduce cpufreq_update_util() overhead a bit 2016-03-09 12:39 ` Peter Zijlstra @ 2016-03-09 14:17 ` Rafael J. Wysocki 2016-03-09 15:29 ` Peter Zijlstra 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-09 14:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 9, 2016 at 1:39 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Mar 04, 2016 at 03:58:22AM +0100, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> Use the observation that cpufreq_update_util() is only called >> by the scheduler with rq->lock held, so the callers of >> cpufreq_set_update_util_data() can use synchronize_sched() >> instead of synchronize_rcu() to wait for cpufreq_update_util() >> to complete. Moreover, if they are updated to do that, >> rcu_read_(un)lock() calls in cpufreq_update_util() might be >> replaced with rcu_read_(un)lock_sched(), respectively, but >> those aren't really necessary, because the scheduler calls >> that function from RCU-sched read-side critical sections >> already. >> >> In addition to that, if cpufreq_set_update_util_data() checks >> the func field in the struct update_util_data before setting >> the per-CPU pointer to it, the data->func check may be dropped >> from cpufreq_update_util() as well. >> >> Make the above changes to reduce the overhead from >> cpufreq_update_util() in the scheduler paths invoking it >> and to make the cleanup after removing its callbacks less >> heavy-weight somewhat. >> >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> >> --- >> >> Changes from the previous version: >> - Use rcu_dereference_sched() in cpufreq_update_util(). > > Which I think also shows the WARN_ON I insisted upon is redundant. > > In any case, I cannot object to reducing overhead, esp. as this whole > patch was suggested by me in the first place, so: > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Thanks! > That said, how about the below? It avoids a function call. That is fine by me. What about taking it a bit further, though, and moving the definition of cpufreq_update_util_data to somewhere under kernel/sched/ (like kernel/sched/cpufreq.c maybe)? Then, the whole static inline void cpufreq_update_util() definition can go into kernel/sched/sched.h (it doesn't have to be visible anywhere beyond kernel/sched/) and the only thing that needs to be exported to cpufreq will be a helper (or two), to set/clear the cpufreq_update_util_data pointers. I'll try to cut a patch doing that later today for illustration. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 1/10] cpufreq: Reduce cpufreq_update_util() overhead a bit 2016-03-09 14:17 ` Rafael J. Wysocki @ 2016-03-09 15:29 ` Peter Zijlstra 2016-03-09 21:35 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-09 15:29 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 09, 2016 at 03:17:48PM +0100, Rafael J. Wysocki wrote: > > That said, how about the below? It avoids a function call. > > That is fine by me. > > What about taking it a bit further, though, and moving the definition > of cpufreq_update_util_data to somewhere under kernel/sched/ (like > kernel/sched/cpufreq.c maybe)? > > Then, the whole static inline void cpufreq_update_util() definition > can go into kernel/sched/sched.h (it doesn't have to be visible > anywhere beyond kernel/sched/) and the only thing that needs to be > exported to cpufreq will be a helper (or two), to set/clear the > cpufreq_update_util_data pointers. > > I'll try to cut a patch doing that later today for illustration. Right, that's a blend with your second patch. Sure. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 1/10] cpufreq: Reduce cpufreq_update_util() overhead a bit 2016-03-09 15:29 ` Peter Zijlstra @ 2016-03-09 21:35 ` Rafael J. Wysocki 2016-03-10 9:19 ` Peter Zijlstra 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-09 21:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wednesday, March 09, 2016 04:29:34 PM Peter Zijlstra wrote: > On Wed, Mar 09, 2016 at 03:17:48PM +0100, Rafael J. Wysocki wrote: > > > That said, how about the below? It avoids a function call. > > > > That is fine by me. > > > > What about taking it a bit further, though, and moving the definition > > of cpufreq_update_util_data to somewhere under kernel/sched/ (like > > kernel/sched/cpufreq.c maybe)? > > > > Then, the whole static inline void cpufreq_update_util() definition > > can go into kernel/sched/sched.h (it doesn't have to be visible > > anywhere beyond kernel/sched/) and the only thing that needs to be > > exported to cpufreq will be a helper (or two), to set/clear the > > cpufreq_update_util_data pointers. > > > > I'll try to cut a patch doing that later today for illustration. > > Right, that's a blend with your second patch. Sure. OK, patch below. --- From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Subject: [PATCH] cpufreq: Move scheduler-related code to the sched directory Create cpufreq.c under kernel/sched/ and move the cpufreq code related to the scheduler to that file and to sched.h. Redefine cpufreq_update_util() as a static inline function to avoid function calls at its call sites in the scheduler code (as suggested by Peter Zijlstra). Also move the definition of struct update_util_data and declaration of cpufreq_set_update_util_data() from include/linux/cpufreq.h to include/linux/sched.h. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- drivers/cpufreq/cpufreq.c | 53 ------------------------------------- drivers/cpufreq/cpufreq_governor.c | 1 include/linux/cpufreq.h | 34 ----------------------- include/linux/sched.h | 9 ++++++ kernel/sched/Makefile | 1 kernel/sched/cpufreq.c | 37 +++++++++++++++++++++++++ kernel/sched/sched.h | 49 +++++++++++++++++++++++++++++++++- 7 files changed, 96 insertions(+), 88 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -65,59 +65,6 @@ static struct cpufreq_driver *cpufreq_dr static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data); static DEFINE_RWLOCK(cpufreq_driver_lock); -static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data); - -/** - * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer. - * @cpu: The CPU to set the pointer for. - * @data: New pointer value. - * - * Set and publish the update_util_data pointer for the given CPU. That pointer - * points to a struct update_util_data object containing a callback function - * to call from cpufreq_update_util(). That function will be called from an RCU - * read-side critical section, so it must not sleep. - * - * Callers must use RCU-sched callbacks to free any memory that might be - * accessed via the old update_util_data pointer or invoke synchronize_sched() - * right after this function to avoid use-after-free. - */ -void cpufreq_set_update_util_data(int cpu, struct update_util_data *data) -{ - if (WARN_ON(data && !data->func)) - return; - - rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data); -} -EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data); - -/** - * cpufreq_update_util - Take a note about CPU utilization changes. - * @time: Current time. - * @util: Current utilization. - * @max: Utilization ceiling. - * - * This function is called by the scheduler on every invocation of - * update_load_avg() on the CPU whose utilization is being updated. - * - * It can only be called from RCU-sched read-side critical sections. - */ -void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) -{ - struct update_util_data *data; - -#ifdef CONFIG_LOCKDEP - WARN_ON(debug_locks && !rcu_read_lock_sched_held()); -#endif - - data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data)); - /* - * If this isn't inside of an RCU-sched read-side critical section, data - * may become NULL after the check below. - */ - if (data) - data->func(data, time, util, max); -} - /* Flag to suspend/resume CPUFreq governors */ static bool cpufreq_suspended; Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -146,36 +146,6 @@ static inline bool policy_is_shared(stru extern struct kobject *cpufreq_global_kobject; #ifdef CONFIG_CPU_FREQ -void cpufreq_update_util(u64 time, unsigned long util, unsigned long max); - -/** - * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. - * @time: Current time. - * - * The way cpufreq is currently arranged requires it to evaluate the CPU - * performance state (frequency/voltage) on a regular basis to prevent it from - * being stuck in a completely inadequate performance level for too long. - * That is not guaranteed to happen if the updates are only triggered from CFS, - * though, because they may not be coming in if RT or deadline tasks are active - * all the time (or there are RT and DL tasks only). - * - * As a workaround for that issue, this function is called by the RT and DL - * sched classes to trigger extra cpufreq updates to prevent it from stalling, - * but that really is a band-aid. Going forward it should be replaced with - * solutions targeted more specifically at RT and DL tasks. - */ -static inline void cpufreq_trigger_update(u64 time) -{ - cpufreq_update_util(time, ULONG_MAX, 0); -} - -struct update_util_data { - void (*func)(struct update_util_data *data, - u64 time, unsigned long util, unsigned long max); -}; - -void cpufreq_set_update_util_data(int cpu, struct update_util_data *data); - unsigned int cpufreq_get(unsigned int cpu); unsigned int cpufreq_quick_get(unsigned int cpu); unsigned int cpufreq_quick_get_max(unsigned int cpu); @@ -187,10 +157,6 @@ int cpufreq_update_policy(unsigned int c bool have_governor_per_policy(void); struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy); #else -static inline void cpufreq_update_util(u64 time, unsigned long util, - unsigned long max) {} -static inline void cpufreq_trigger_update(u64 time) {} - static inline unsigned int cpufreq_get(unsigned int cpu) { return 0; Index: linux-pm/kernel/sched/cpufreq.c =================================================================== --- /dev/null +++ linux-pm/kernel/sched/cpufreq.c @@ -0,0 +1,37 @@ +/* + * Scheduler code and data structures related to cpufreq. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include "sched.h" + +DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data); + +/** + * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer. + * @cpu: The CPU to set the pointer for. + * @data: New pointer value. + * + * Set and publish the update_util_data pointer for the given CPU. That pointer + * points to a struct update_util_data object containing a callback function + * to call from cpufreq_update_util(). That function will be called from an RCU + * read-side critical section, so it must not sleep. + * + * Callers must use RCU-sched callbacks to free any memory that might be + * accessed via the old update_util_data pointer or invoke synchronize_sched() + * right after this function to avoid use-after-free. + */ +void cpufreq_set_update_util_data(int cpu, struct update_util_data *data) +{ + if (WARN_ON(data && !data->func)) + return; + + rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data); +} +EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data); Index: linux-pm/kernel/sched/sched.h =================================================================== --- linux-pm.orig/kernel/sched/sched.h +++ linux-pm/kernel/sched/sched.h @@ -9,7 +9,6 @@ #include <linux/irq_work.h> #include <linux/tick.h> #include <linux/slab.h> -#include <linux/cpufreq.h> #include "cpupri.h" #include "cpudeadline.h" @@ -1739,3 +1738,51 @@ static inline u64 irq_time_read(int cpu) } #endif /* CONFIG_64BIT */ #endif /* CONFIG_IRQ_TIME_ACCOUNTING */ + +#ifdef CONFIG_CPU_FREQ +DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data); + +/** + * cpufreq_update_util - Take a note about CPU utilization changes. + * @time: Current time. + * @util: Current utilization. + * @max: Utilization ceiling. + * + * This function is called by the scheduler on every invocation of + * update_load_avg() on the CPU whose utilization is being updated. + * + * It can only be called from RCU-sched read-side critical sections. + */ +static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) +{ + struct update_util_data *data; + + data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data)); + if (data) + data->func(data, time, util, max); +} + +/** + * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. + * @time: Current time. + * + * The way cpufreq is currently arranged requires it to evaluate the CPU + * performance state (frequency/voltage) on a regular basis to prevent it from + * being stuck in a completely inadequate performance level for too long. + * That is not guaranteed to happen if the updates are only triggered from CFS, + * though, because they may not be coming in if RT or deadline tasks are active + * all the time (or there are RT and DL tasks only). + * + * As a workaround for that issue, this function is called by the RT and DL + * sched classes to trigger extra cpufreq updates to prevent it from stalling, + * but that really is a band-aid. Going forward it should be replaced with + * solutions targeted more specifically at RT and DL tasks. + */ +static inline void cpufreq_trigger_update(u64 time) +{ + cpufreq_update_util(time, ULONG_MAX, 0); +} +#else +static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {} +static inline void cpufreq_trigger_update(u64 time) {} +#endif /* CONFIG_CPU_FREQ */ Index: linux-pm/include/linux/sched.h =================================================================== --- linux-pm.orig/include/linux/sched.h +++ linux-pm/include/linux/sched.h @@ -3207,4 +3207,13 @@ static inline unsigned long rlimit_max(u return task_rlimit_max(current, limit); } +#ifdef CONFIG_CPU_FREQ +struct update_util_data { + void (*func)(struct update_util_data *data, + u64 time, unsigned long util, unsigned long max); +}; + +void cpufreq_set_update_util_data(int cpu, struct update_util_data *data); +#endif /* CONFIG_CPU_FREQ */ + #endif Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -18,6 +18,7 @@ #include <linux/export.h> #include <linux/kernel_stat.h> +#include <linux/sched.h> #include <linux/slab.h> #include "cpufreq_governor.h" Index: linux-pm/kernel/sched/Makefile =================================================================== --- linux-pm.orig/kernel/sched/Makefile +++ linux-pm/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_gr obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ) += cpufreq.o ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 1/10] cpufreq: Reduce cpufreq_update_util() overhead a bit 2016-03-09 21:35 ` Rafael J. Wysocki @ 2016-03-10 9:19 ` Peter Zijlstra 0 siblings, 0 replies; 158+ messages in thread From: Peter Zijlstra @ 2016-03-10 9:19 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 09, 2016 at 10:35:02PM +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Subject: [PATCH] cpufreq: Move scheduler-related code to the sched directory > > Create cpufreq.c under kernel/sched/ and move the cpufreq code > related to the scheduler to that file and to sched.h. > > Redefine cpufreq_update_util() as a static inline function to avoid > function calls at its call sites in the scheduler code (as suggested > by Peter Zijlstra). > > Also move the definition of struct update_util_data and declaration > of cpufreq_set_update_util_data() from include/linux/cpufreq.h to > include/linux/sched.h. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > drivers/cpufreq/cpufreq.c | 53 ------------------------------------- > drivers/cpufreq/cpufreq_governor.c | 1 > include/linux/cpufreq.h | 34 ----------------------- > include/linux/sched.h | 9 ++++++ > kernel/sched/Makefile | 1 > kernel/sched/cpufreq.c | 37 +++++++++++++++++++++++++ > kernel/sched/sched.h | 49 +++++++++++++++++++++++++++++++++- > 7 files changed, 96 insertions(+), 88 deletions(-) Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v2 2/10][Resend] cpufreq: acpi-cpufreq: Make read and write operations more efficient 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki 2016-03-04 2:58 ` [PATCH v2 1/10] cpufreq: Reduce cpufreq_update_util() overhead a bit Rafael J. Wysocki @ 2016-03-04 2:59 ` Rafael J. Wysocki 2016-03-04 3:01 ` [PATCH v2 3/10] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki ` (8 subsequent siblings) 10 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 2:59 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Setting a new CPU frequency and reading the current request value in the ACPI cpufreq driver involves each at least two switch instructions (there's more if the policy is shared). One of them is present in drv_read/write() that prepares a command structure and the other happens in subsequent do_drv_read/write() when that structure is interpreted. However, all of those switches may be avoided by using function pointers. To that end, add two function pointers to struct acpi_cpufreq_data to represent read and write operations on the frequency register and set them up during policy intitialization to point to the pair of routines suitable for the given processor (Intel/AMD MSR access or I/O port access). Then, use those pointers in do_drv_read/write() and modify drv_read/write() to prepare the command structure for them without any checks. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- No changes. --- drivers/cpufreq/acpi-cpufreq.c | 208 ++++++++++++++++++----------------------- 1 file changed, 95 insertions(+), 113 deletions(-) Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c @@ -70,6 +70,8 @@ struct acpi_cpufreq_data { unsigned int cpu_feature; unsigned int acpi_perf_cpu; cpumask_var_t freqdomain_cpus; + void (*cpu_freq_write)(struct acpi_pct_register *reg, u32 val); + u32 (*cpu_freq_read)(struct acpi_pct_register *reg); }; /* acpi_perf_data is a pointer to percpu data. */ @@ -243,125 +245,119 @@ static unsigned extract_freq(u32 val, st } } -struct msr_addr { - u32 reg; -}; +u32 cpu_freq_read_intel(struct acpi_pct_register *not_used) +{ + u32 val, dummy; -struct io_addr { - u16 port; - u8 bit_width; -}; + rdmsr(MSR_IA32_PERF_CTL, val, dummy); + return val; +} + +void cpu_freq_write_intel(struct acpi_pct_register *not_used, u32 val) +{ + u32 lo, hi; + + rdmsr(MSR_IA32_PERF_CTL, lo, hi); + lo = (lo & ~INTEL_MSR_RANGE) | (val & INTEL_MSR_RANGE); + wrmsr(MSR_IA32_PERF_CTL, lo, hi); +} + +u32 cpu_freq_read_amd(struct acpi_pct_register *not_used) +{ + u32 val, dummy; + + rdmsr(MSR_AMD_PERF_CTL, val, dummy); + return val; +} + +void cpu_freq_write_amd(struct acpi_pct_register *not_used, u32 val) +{ + wrmsr(MSR_AMD_PERF_CTL, val, 0); +} + +u32 cpu_freq_read_io(struct acpi_pct_register *reg) +{ + u32 val; + + acpi_os_read_port(reg->address, &val, reg->bit_width); + return val; +} + +void cpu_freq_write_io(struct acpi_pct_register *reg, u32 val) +{ + acpi_os_write_port(reg->address, val, reg->bit_width); +} struct drv_cmd { - unsigned int type; - const struct cpumask *mask; - union { - struct msr_addr msr; - struct io_addr io; - } addr; + struct acpi_pct_register *reg; u32 val; + union { + void (*write)(struct acpi_pct_register *reg, u32 val); + u32 (*read)(struct acpi_pct_register *reg); + } func; }; /* Called via smp_call_function_single(), on the target CPU */ static void do_drv_read(void *_cmd) { struct drv_cmd *cmd = _cmd; - u32 h; - switch (cmd->type) { - case SYSTEM_INTEL_MSR_CAPABLE: - case SYSTEM_AMD_MSR_CAPABLE: - rdmsr(cmd->addr.msr.reg, cmd->val, h); - break; - case SYSTEM_IO_CAPABLE: - acpi_os_read_port((acpi_io_address)cmd->addr.io.port, - &cmd->val, - (u32)cmd->addr.io.bit_width); - break; - default: - break; - } + cmd->val = cmd->func.read(cmd->reg); } -/* Called via smp_call_function_many(), on the target CPUs */ -static void do_drv_write(void *_cmd) +static u32 drv_read(struct acpi_cpufreq_data *data, const struct cpumask *mask) { - struct drv_cmd *cmd = _cmd; - u32 lo, hi; + struct acpi_processor_performance *perf = to_perf_data(data); + struct drv_cmd cmd = { + .reg = &perf->control_register, + .func.read = data->cpu_freq_read, + }; + int err; - switch (cmd->type) { - case SYSTEM_INTEL_MSR_CAPABLE: - rdmsr(cmd->addr.msr.reg, lo, hi); - lo = (lo & ~INTEL_MSR_RANGE) | (cmd->val & INTEL_MSR_RANGE); - wrmsr(cmd->addr.msr.reg, lo, hi); - break; - case SYSTEM_AMD_MSR_CAPABLE: - wrmsr(cmd->addr.msr.reg, cmd->val, 0); - break; - case SYSTEM_IO_CAPABLE: - acpi_os_write_port((acpi_io_address)cmd->addr.io.port, - cmd->val, - (u32)cmd->addr.io.bit_width); - break; - default: - break; - } + err = smp_call_function_any(mask, do_drv_read, &cmd, 1); + WARN_ON_ONCE(err); /* smp_call_function_any() was buggy? */ + return cmd.val; } -static void drv_read(struct drv_cmd *cmd) +/* Called via smp_call_function_many(), on the target CPUs */ +static void do_drv_write(void *_cmd) { - int err; - cmd->val = 0; + struct drv_cmd *cmd = _cmd; - err = smp_call_function_any(cmd->mask, do_drv_read, cmd, 1); - WARN_ON_ONCE(err); /* smp_call_function_any() was buggy? */ + cmd->func.write(cmd->reg, cmd->val); } -static void drv_write(struct drv_cmd *cmd) +static void drv_write(struct acpi_cpufreq_data *data, + const struct cpumask *mask, u32 val) { + struct acpi_processor_performance *perf = to_perf_data(data); + struct drv_cmd cmd = { + .reg = &perf->control_register, + .val = val, + .func.write = data->cpu_freq_write, + }; int this_cpu; this_cpu = get_cpu(); - if (cpumask_test_cpu(this_cpu, cmd->mask)) - do_drv_write(cmd); - smp_call_function_many(cmd->mask, do_drv_write, cmd, 1); + if (cpumask_test_cpu(this_cpu, mask)) + do_drv_write(&cmd); + + smp_call_function_many(mask, do_drv_write, &cmd, 1); put_cpu(); } -static u32 -get_cur_val(const struct cpumask *mask, struct acpi_cpufreq_data *data) +static u32 get_cur_val(const struct cpumask *mask, struct acpi_cpufreq_data *data) { - struct acpi_processor_performance *perf; - struct drv_cmd cmd; + u32 val; if (unlikely(cpumask_empty(mask))) return 0; - switch (data->cpu_feature) { - case SYSTEM_INTEL_MSR_CAPABLE: - cmd.type = SYSTEM_INTEL_MSR_CAPABLE; - cmd.addr.msr.reg = MSR_IA32_PERF_CTL; - break; - case SYSTEM_AMD_MSR_CAPABLE: - cmd.type = SYSTEM_AMD_MSR_CAPABLE; - cmd.addr.msr.reg = MSR_AMD_PERF_CTL; - break; - case SYSTEM_IO_CAPABLE: - cmd.type = SYSTEM_IO_CAPABLE; - perf = to_perf_data(data); - cmd.addr.io.port = perf->control_register.address; - cmd.addr.io.bit_width = perf->control_register.bit_width; - break; - default: - return 0; - } - - cmd.mask = mask; - drv_read(&cmd); + val = drv_read(data, mask); - pr_debug("get_cur_val = %u\n", cmd.val); + pr_debug("get_cur_val = %u\n", val); - return cmd.val; + return val; } static unsigned int get_cur_freq_on_cpu(unsigned int cpu) @@ -416,7 +412,7 @@ static int acpi_cpufreq_target(struct cp { struct acpi_cpufreq_data *data = policy->driver_data; struct acpi_processor_performance *perf; - struct drv_cmd cmd; + const struct cpumask *mask; unsigned int next_perf_state = 0; /* Index into perf table */ int result = 0; @@ -438,37 +434,17 @@ static int acpi_cpufreq_target(struct cp } } - switch (data->cpu_feature) { - case SYSTEM_INTEL_MSR_CAPABLE: - cmd.type = SYSTEM_INTEL_MSR_CAPABLE; - cmd.addr.msr.reg = MSR_IA32_PERF_CTL; - cmd.val = (u32) perf->states[next_perf_state].control; - break; - case SYSTEM_AMD_MSR_CAPABLE: - cmd.type = SYSTEM_AMD_MSR_CAPABLE; - cmd.addr.msr.reg = MSR_AMD_PERF_CTL; - cmd.val = (u32) perf->states[next_perf_state].control; - break; - case SYSTEM_IO_CAPABLE: - cmd.type = SYSTEM_IO_CAPABLE; - cmd.addr.io.port = perf->control_register.address; - cmd.addr.io.bit_width = perf->control_register.bit_width; - cmd.val = (u32) perf->states[next_perf_state].control; - break; - default: - return -ENODEV; - } - - /* cpufreq holds the hotplug lock, so we are safe from here on */ - if (policy->shared_type != CPUFREQ_SHARED_TYPE_ANY) - cmd.mask = policy->cpus; - else - cmd.mask = cpumask_of(policy->cpu); + /* + * The core won't allow CPUs to go away until the governor has been + * stopped, so we can rely on the stability of policy->cpus. + */ + mask = policy->shared_type == CPUFREQ_SHARED_TYPE_ANY ? + cpumask_of(policy->cpu) : policy->cpus; - drv_write(&cmd); + drv_write(data, mask, perf->states[next_perf_state].control); if (acpi_pstate_strict) { - if (!check_freqs(cmd.mask, data->freq_table[index].frequency, + if (!check_freqs(mask, data->freq_table[index].frequency, data)) { pr_debug("acpi_cpufreq_target failed (%d)\n", policy->cpu); @@ -738,15 +714,21 @@ static int acpi_cpufreq_cpu_init(struct } pr_debug("SYSTEM IO addr space\n"); data->cpu_feature = SYSTEM_IO_CAPABLE; + data->cpu_freq_read = cpu_freq_read_io; + data->cpu_freq_write = cpu_freq_write_io; break; case ACPI_ADR_SPACE_FIXED_HARDWARE: pr_debug("HARDWARE addr space\n"); if (check_est_cpu(cpu)) { data->cpu_feature = SYSTEM_INTEL_MSR_CAPABLE; + data->cpu_freq_read = cpu_freq_read_intel; + data->cpu_freq_write = cpu_freq_write_intel; break; } if (check_amd_hwpstate_cpu(cpu)) { data->cpu_feature = SYSTEM_AMD_MSR_CAPABLE; + data->cpu_freq_read = cpu_freq_read_amd; + data->cpu_freq_write = cpu_freq_write_amd; break; } result = -ENODEV; ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v2 3/10] cpufreq: governor: New data type for management part of dbs_data 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki 2016-03-04 2:58 ` [PATCH v2 1/10] cpufreq: Reduce cpufreq_update_util() overhead a bit Rafael J. Wysocki 2016-03-04 2:59 ` [PATCH v2 2/10][Resend] cpufreq: acpi-cpufreq: Make read and write operations more efficient Rafael J. Wysocki @ 2016-03-04 3:01 ` Rafael J. Wysocki 2016-03-04 5:52 ` Viresh Kumar 2016-03-04 3:03 ` [PATCH v2 4/10] cpufreq: governor: Move abstract gov_attr_set code to seperate file Rafael J. Wysocki ` (7 subsequent siblings) 10 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 3:01 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> In addition to fields representing governor tunables, struct dbs_data contains some fields needed for the management of objects of that type. As it turns out, that part of struct dbs_data may be shared with (future) governors that won't use the common code used by "ondemand" and "conservative", so move it to a separate struct type and modify the code using struct dbs_data to follow. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Changes from the previous version: - The new data type is called gov_attr_set now (instead of gov_tunables) and some variable names etc have been changed to follow. --- drivers/cpufreq/cpufreq_conservative.c | 25 +++++---- drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++------------- drivers/cpufreq/cpufreq_governor.h | 35 +++++++----- drivers/cpufreq/cpufreq_ondemand.c | 29 ++++++---- 4 files changed, 107 insertions(+), 72 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -41,6 +41,13 @@ /* Ondemand Sampling types */ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; +struct gov_attr_set { + struct kobject kobj; + struct list_head policy_list; + struct mutex update_lock; + int usage_count; +}; + /* * Abbreviations: * dbs: used as a shortform for demand based switching It helps to keep variable @@ -52,7 +59,7 @@ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; /* Governor demand based switching data (per-policy or global). */ struct dbs_data { - int usage_count; + struct gov_attr_set attr_set; void *tuners; unsigned int min_sampling_rate; unsigned int ignore_nice_load; @@ -60,37 +67,35 @@ struct dbs_data { unsigned int sampling_down_factor; unsigned int up_threshold; unsigned int io_is_busy; - - struct kobject kobj; - struct list_head policy_dbs_list; - /* - * Protect concurrent updates to governor tunables from sysfs, - * policy_dbs_list and usage_count. - */ - struct mutex mutex; }; +static inline struct dbs_data *to_dbs_data(struct gov_attr_set *attr_set) +{ + return container_of(attr_set, struct dbs_data, attr_set); +} + /* Governor's specific attributes */ -struct dbs_data; struct governor_attr { struct attribute attr; - ssize_t (*show)(struct dbs_data *dbs_data, char *buf); - ssize_t (*store)(struct dbs_data *dbs_data, const char *buf, + ssize_t (*show)(struct gov_attr_set *attr_set, char *buf); + ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf, size_t count); }; #define gov_show_one(_gov, file_name) \ static ssize_t show_##file_name \ -(struct dbs_data *dbs_data, char *buf) \ +(struct gov_attr_set *attr_set, char *buf) \ { \ + struct dbs_data *dbs_data = to_dbs_data(attr_set); \ struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \ return sprintf(buf, "%u\n", tuners->file_name); \ } #define gov_show_one_common(file_name) \ static ssize_t show_##file_name \ -(struct dbs_data *dbs_data, char *buf) \ +(struct gov_attr_set *attr_set, char *buf) \ { \ + struct dbs_data *dbs_data = to_dbs_data(attr_set); \ return sprintf(buf, "%u\n", dbs_data->file_name); \ } @@ -184,7 +189,7 @@ void od_register_powersave_bias_handler( (struct cpufreq_policy *, unsigned int, unsigned int), unsigned int powersave_bias); void od_unregister_powersave_bias_handler(void); -ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf, +ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf, size_t count); void gov_update_cpu_data(struct dbs_data *dbs_data); #endif /* _CPUFREQ_GOVERNOR_H */ Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -43,9 +43,10 @@ static DEFINE_MUTEX(gov_dbs_data_mutex); * This must be called with dbs_data->mutex held, otherwise traversing * policy_dbs_list isn't safe. */ -ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf, +ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct policy_dbs_info *policy_dbs; unsigned int rate; int ret; @@ -59,7 +60,7 @@ ssize_t store_sampling_rate(struct dbs_d * We are operating under dbs_data->mutex and so the list and its * entries can't be freed concurrently. */ - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) { + list_for_each_entry(policy_dbs, &attr_set->policy_list, list) { mutex_lock(&policy_dbs->timer_mutex); /* * On 32-bit architectures this may race with the @@ -96,7 +97,7 @@ void gov_update_cpu_data(struct dbs_data { struct policy_dbs_info *policy_dbs; - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) { + list_for_each_entry(policy_dbs, &dbs_data->attr_set.policy_list, list) { unsigned int j; for_each_cpu(j, policy_dbs->policy->cpus) { @@ -111,9 +112,9 @@ void gov_update_cpu_data(struct dbs_data } EXPORT_SYMBOL_GPL(gov_update_cpu_data); -static inline struct dbs_data *to_dbs_data(struct kobject *kobj) +static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj) { - return container_of(kobj, struct dbs_data, kobj); + return container_of(kobj, struct gov_attr_set, kobj); } static inline struct governor_attr *to_gov_attr(struct attribute *attr) @@ -124,25 +125,24 @@ static inline struct governor_attr *to_g static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, char *buf) { - struct dbs_data *dbs_data = to_dbs_data(kobj); struct governor_attr *gattr = to_gov_attr(attr); - return gattr->show(dbs_data, buf); + return gattr->show(to_gov_attr_set(kobj), buf); } static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, const char *buf, size_t count) { - struct dbs_data *dbs_data = to_dbs_data(kobj); + struct gov_attr_set *attr_set = to_gov_attr_set(kobj); struct governor_attr *gattr = to_gov_attr(attr); int ret = -EBUSY; - mutex_lock(&dbs_data->mutex); + mutex_lock(&attr_set->update_lock); - if (dbs_data->usage_count) - ret = gattr->store(dbs_data, buf, count); + if (attr_set->usage_count) + ret = gattr->store(attr_set, buf, count); - mutex_unlock(&dbs_data->mutex); + mutex_unlock(&attr_set->update_lock); return ret; } @@ -424,6 +424,41 @@ static void free_policy_dbs_info(struct gov->free(policy_dbs); } +static void gov_attr_set_init(struct gov_attr_set *attr_set, + struct list_head *list_node) +{ + INIT_LIST_HEAD(&attr_set->policy_list); + mutex_init(&attr_set->update_lock); + attr_set->usage_count = 1; + list_add(list_node, &attr_set->policy_list); +} + +static void gov_attr_set_get(struct gov_attr_set *attr_set, + struct list_head *list_node) +{ + mutex_lock(&attr_set->update_lock); + attr_set->usage_count++; + list_add(list_node, &attr_set->policy_list); + mutex_unlock(&attr_set->update_lock); +} + +static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, + struct list_head *list_node) +{ + unsigned int count; + + mutex_lock(&attr_set->update_lock); + list_del(list_node); + count = --attr_set->usage_count; + mutex_unlock(&attr_set->update_lock); + if (count) + return count; + + kobject_put(&attr_set->kobj); + mutex_destroy(&attr_set->update_lock); + return 0; +} + static int cpufreq_governor_init(struct cpufreq_policy *policy) { struct dbs_governor *gov = dbs_governor_of(policy); @@ -452,10 +487,7 @@ static int cpufreq_governor_init(struct policy_dbs->dbs_data = dbs_data; policy->governor_data = policy_dbs; - mutex_lock(&dbs_data->mutex); - dbs_data->usage_count++; - list_add(&policy_dbs->list, &dbs_data->policy_dbs_list); - mutex_unlock(&dbs_data->mutex); + gov_attr_set_get(&dbs_data->attr_set, &policy_dbs->list); goto out; } @@ -465,8 +497,7 @@ static int cpufreq_governor_init(struct goto free_policy_dbs_info; } - INIT_LIST_HEAD(&dbs_data->policy_dbs_list); - mutex_init(&dbs_data->mutex); + gov_attr_set_init(&dbs_data->attr_set, &policy_dbs->list); ret = gov->init(dbs_data, !policy->governor->initialized); if (ret) @@ -486,14 +517,11 @@ static int cpufreq_governor_init(struct if (!have_governor_per_policy()) gov->gdbs_data = dbs_data; - policy->governor_data = policy_dbs; - policy_dbs->dbs_data = dbs_data; - dbs_data->usage_count = 1; - list_add(&policy_dbs->list, &dbs_data->policy_dbs_list); + policy->governor_data = policy_dbs; gov->kobj_type.sysfs_ops = &governor_sysfs_ops; - ret = kobject_init_and_add(&dbs_data->kobj, &gov->kobj_type, + ret = kobject_init_and_add(&dbs_data->attr_set.kobj, &gov->kobj_type, get_governor_parent_kobj(policy), "%s", gov->gov.name); if (!ret) @@ -522,29 +550,21 @@ static int cpufreq_governor_exit(struct struct dbs_governor *gov = dbs_governor_of(policy); struct policy_dbs_info *policy_dbs = policy->governor_data; struct dbs_data *dbs_data = policy_dbs->dbs_data; - int count; + unsigned int count; /* Protect gov->gdbs_data against concurrent updates. */ mutex_lock(&gov_dbs_data_mutex); - mutex_lock(&dbs_data->mutex); - list_del(&policy_dbs->list); - count = --dbs_data->usage_count; - mutex_unlock(&dbs_data->mutex); + count = gov_attr_set_put(&dbs_data->attr_set, &policy_dbs->list); - if (!count) { - kobject_put(&dbs_data->kobj); - - policy->governor_data = NULL; + policy->governor_data = NULL; + if (!count) { if (!have_governor_per_policy()) gov->gdbs_data = NULL; gov->exit(dbs_data, policy->governor->initialized == 1); - mutex_destroy(&dbs_data->mutex); kfree(dbs_data); - } else { - policy->governor_data = NULL; } free_policy_dbs_info(policy_dbs, gov); Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c +++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c @@ -207,9 +207,10 @@ static unsigned int od_dbs_timer(struct /************************** sysfs interface ************************/ static struct dbs_governor od_dbs_gov; -static ssize_t store_io_is_busy(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_io_is_busy(struct gov_attr_set *attr_set, const char *buf, + size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; @@ -224,9 +225,10 @@ static ssize_t store_io_is_busy(struct d return count; } -static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_up_threshold(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -240,9 +242,10 @@ static ssize_t store_up_threshold(struct return count; } -static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data, - const char *buf, size_t count) +static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct policy_dbs_info *policy_dbs; unsigned int input; int ret; @@ -254,7 +257,7 @@ static ssize_t store_sampling_down_facto dbs_data->sampling_down_factor = input; /* Reset down sampling multiplier in case it was active */ - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) { + list_for_each_entry(policy_dbs, &attr_set->policy_list, list) { /* * Doing this without locking might lead to using different * rate_mult values in od_update() and od_dbs_timer(). @@ -267,9 +270,10 @@ static ssize_t store_sampling_down_facto return count; } -static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data, - const char *buf, size_t count) +static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; @@ -291,9 +295,10 @@ static ssize_t store_ignore_nice_load(st return count; } -static ssize_t store_powersave_bias(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_powersave_bias(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct od_dbs_tuners *od_tuners = dbs_data->tuners; struct policy_dbs_info *policy_dbs; unsigned int input; @@ -308,7 +313,7 @@ static ssize_t store_powersave_bias(stru od_tuners->powersave_bias = input; - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) + list_for_each_entry(policy_dbs, &attr_set->policy_list, list) ondemand_powersave_bias_init(policy_dbs->policy); return count; Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c +++ linux-pm/drivers/cpufreq/cpufreq_conservative.c @@ -129,9 +129,10 @@ static struct notifier_block cs_cpufreq_ /************************** sysfs interface ************************/ static struct dbs_governor cs_dbs_gov; -static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data, - const char *buf, size_t count) +static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -143,9 +144,10 @@ static ssize_t store_sampling_down_facto return count; } -static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_up_threshold(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; @@ -158,9 +160,10 @@ static ssize_t store_up_threshold(struct return count; } -static ssize_t store_down_threshold(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_down_threshold(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; @@ -175,9 +178,10 @@ static ssize_t store_down_threshold(stru return count; } -static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data, - const char *buf, size_t count) +static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; @@ -199,9 +203,10 @@ static ssize_t store_ignore_nice_load(st return count; } -static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_freq_step(struct gov_attr_set *attr_set, const char *buf, + size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 3/10] cpufreq: governor: New data type for management part of dbs_data 2016-03-04 3:01 ` [PATCH v2 3/10] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki @ 2016-03-04 5:52 ` Viresh Kumar 0 siblings, 0 replies; 158+ messages in thread From: Viresh Kumar @ 2016-03-04 5:52 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Vincent Guittot, Michael Turquette, Ingo Molnar On 04-03-16, 04:01, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > In addition to fields representing governor tunables, struct dbs_data > contains some fields needed for the management of objects of that > type. As it turns out, that part of struct dbs_data may be shared > with (future) governors that won't use the common code used by > "ondemand" and "conservative", so move it to a separate struct type > and modify the code using struct dbs_data to follow. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > Changes from the previous version: > - The new data type is called gov_attr_set now (instead of gov_tunables) > and some variable names etc have been changed to follow. Acked-by: Viresh Kumar <viresh.kumar@linaro.org> -- viresh ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v2 4/10] cpufreq: governor: Move abstract gov_attr_set code to seperate file 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki ` (2 preceding siblings ...) 2016-03-04 3:01 ` [PATCH v2 3/10] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki @ 2016-03-04 3:03 ` Rafael J. Wysocki 2016-03-04 5:52 ` Viresh Kumar 2016-03-04 3:05 ` [PATCH v2 5/10] cpufreq: Move governor attribute set headers to cpufreq.h Rafael J. Wysocki ` (6 subsequent siblings) 10 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 3:03 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Move abstract code related to struct gov_attr_set to a separate (new) file so it can be shared with (future) goverernors that won't share more code with "ondemand" and "conservative". No intentional functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Changes from the previous version: - Different name of the new file. - Different name of the new Kconfig symbol. --- drivers/cpufreq/Kconfig | 4 + drivers/cpufreq/Makefile | 1 drivers/cpufreq/cpufreq_governor.c | 82 --------------------------- drivers/cpufreq/cpufreq_governor.h | 6 ++ drivers/cpufreq/cpufreq_governor_attr_set.c | 84 ++++++++++++++++++++++++++++ 5 files changed, 95 insertions(+), 82 deletions(-) Index: linux-pm/drivers/cpufreq/Kconfig =================================================================== --- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -18,7 +18,11 @@ config CPU_FREQ if CPU_FREQ +config CPU_FREQ_GOV_ATTR_SET + bool + config CPU_FREQ_GOV_COMMON + select CPU_FREQ_GOV_ATTR_SET select IRQ_WORK bool Index: linux-pm/drivers/cpufreq/Makefile =================================================================== --- linux-pm.orig/drivers/cpufreq/Makefile +++ linux-pm/drivers/cpufreq/Makefile @@ -11,6 +11,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) += obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o +obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -112,53 +112,6 @@ void gov_update_cpu_data(struct dbs_data } EXPORT_SYMBOL_GPL(gov_update_cpu_data); -static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj) -{ - return container_of(kobj, struct gov_attr_set, kobj); -} - -static inline struct governor_attr *to_gov_attr(struct attribute *attr) -{ - return container_of(attr, struct governor_attr, attr); -} - -static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, - char *buf) -{ - struct governor_attr *gattr = to_gov_attr(attr); - - return gattr->show(to_gov_attr_set(kobj), buf); -} - -static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, - const char *buf, size_t count) -{ - struct gov_attr_set *attr_set = to_gov_attr_set(kobj); - struct governor_attr *gattr = to_gov_attr(attr); - int ret = -EBUSY; - - mutex_lock(&attr_set->update_lock); - - if (attr_set->usage_count) - ret = gattr->store(attr_set, buf, count); - - mutex_unlock(&attr_set->update_lock); - - return ret; -} - -/* - * Sysfs Ops for accessing governor attributes. - * - * All show/store invocations for governor specific sysfs attributes, will first - * call the below show/store callbacks and the attribute specific callback will - * be called from within it. - */ -static const struct sysfs_ops governor_sysfs_ops = { - .show = governor_show, - .store = governor_store, -}; - unsigned int dbs_update(struct cpufreq_policy *policy) { struct policy_dbs_info *policy_dbs = policy->governor_data; @@ -424,41 +377,6 @@ static void free_policy_dbs_info(struct gov->free(policy_dbs); } -static void gov_attr_set_init(struct gov_attr_set *attr_set, - struct list_head *list_node) -{ - INIT_LIST_HEAD(&attr_set->policy_list); - mutex_init(&attr_set->update_lock); - attr_set->usage_count = 1; - list_add(list_node, &attr_set->policy_list); -} - -static void gov_attr_set_get(struct gov_attr_set *attr_set, - struct list_head *list_node) -{ - mutex_lock(&attr_set->update_lock); - attr_set->usage_count++; - list_add(list_node, &attr_set->policy_list); - mutex_unlock(&attr_set->update_lock); -} - -static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, - struct list_head *list_node) -{ - unsigned int count; - - mutex_lock(&attr_set->update_lock); - list_del(list_node); - count = --attr_set->usage_count; - mutex_unlock(&attr_set->update_lock); - if (count) - return count; - - kobject_put(&attr_set->kobj); - mutex_destroy(&attr_set->update_lock); - return 0; -} - static int cpufreq_governor_init(struct cpufreq_policy *policy) { struct dbs_governor *gov = dbs_governor_of(policy); Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -48,6 +48,12 @@ struct gov_attr_set { int usage_count; }; +extern const struct sysfs_ops governor_sysfs_ops; + +void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node); +void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node); +unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node); + /* * Abbreviations: * dbs: used as a shortform for demand based switching It helps to keep variable Index: linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c =================================================================== --- /dev/null +++ linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c @@ -0,0 +1,84 @@ +/* + * Abstract code for CPUFreq governor tunable sysfs attributes. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include "cpufreq_governor.h" + +static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj) +{ + return container_of(kobj, struct gov_attr_set, kobj); +} + +static inline struct governor_attr *to_gov_attr(struct attribute *attr) +{ + return container_of(attr, struct governor_attr, attr); +} + +static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, + char *buf) +{ + struct governor_attr *gattr = to_gov_attr(attr); + + return gattr->show(to_gov_attr_set(kobj), buf); +} + +static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, + const char *buf, size_t count) +{ + struct gov_attr_set *attr_set = to_gov_attr_set(kobj); + struct governor_attr *gattr = to_gov_attr(attr); + int ret; + + mutex_lock(&attr_set->update_lock); + ret = attr_set->usage_count ? gattr->store(attr_set, buf, count) : -EBUSY; + mutex_unlock(&attr_set->update_lock); + return ret; +} + +const struct sysfs_ops governor_sysfs_ops = { + .show = governor_show, + .store = governor_store, +}; +EXPORT_SYMBOL_GPL(governor_sysfs_ops); + +void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node) +{ + INIT_LIST_HEAD(&attr_set->policy_list); + mutex_init(&attr_set->update_lock); + attr_set->usage_count = 1; + list_add(list_node, &attr_set->policy_list); +} +EXPORT_SYMBOL_GPL(gov_attr_set_init); + +void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node) +{ + mutex_lock(&attr_set->update_lock); + attr_set->usage_count++; + list_add(list_node, &attr_set->policy_list); + mutex_unlock(&attr_set->update_lock); +} +EXPORT_SYMBOL_GPL(gov_attr_set_get); + +unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node) +{ + unsigned int count; + + mutex_lock(&attr_set->update_lock); + list_del(list_node); + count = --attr_set->usage_count; + mutex_unlock(&attr_set->update_lock); + if (count) + return count; + + kobject_put(&attr_set->kobj); + mutex_destroy(&attr_set->update_lock); + return 0; +} +EXPORT_SYMBOL_GPL(gov_attr_set_put); ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 4/10] cpufreq: governor: Move abstract gov_attr_set code to seperate file 2016-03-04 3:03 ` [PATCH v2 4/10] cpufreq: governor: Move abstract gov_attr_set code to seperate file Rafael J. Wysocki @ 2016-03-04 5:52 ` Viresh Kumar 0 siblings, 0 replies; 158+ messages in thread From: Viresh Kumar @ 2016-03-04 5:52 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Vincent Guittot, Michael Turquette, Ingo Molnar On 04-03-16, 04:03, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Move abstract code related to struct gov_attr_set to a separate (new) > file so it can be shared with (future) goverernors that won't share > more code with "ondemand" and "conservative". > > No intentional functional changes. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > Changes from the previous version: > - Different name of the new file. > - Different name of the new Kconfig symbol. Acked-by: Viresh Kumar <viresh.kumar@linaro.org> -- viresh ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v2 5/10] cpufreq: Move governor attribute set headers to cpufreq.h 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki ` (3 preceding siblings ...) 2016-03-04 3:03 ` [PATCH v2 4/10] cpufreq: governor: Move abstract gov_attr_set code to seperate file Rafael J. Wysocki @ 2016-03-04 3:05 ` Rafael J. Wysocki 2016-03-04 5:53 ` Viresh Kumar 2016-03-04 3:07 ` [PATCH v2 6/10] cpufreq: Support for fast frequency switching Rafael J. Wysocki ` (5 subsequent siblings) 10 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 3:05 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Move definitions and function headers related to struct gov_attr_set to include/linux/cpufreq.h so they can be used by (future) goverernors located outside of drivers/cpufreq/. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- New patch. Needed to move cpufreq_schedutil.c to kernel/sched/. --- drivers/cpufreq/cpufreq_governor.h | 21 --------------------- include/linux/cpufreq.h | 23 +++++++++++++++++++++++ 2 files changed, 23 insertions(+), 21 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -41,19 +41,6 @@ /* Ondemand Sampling types */ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; -struct gov_attr_set { - struct kobject kobj; - struct list_head policy_list; - struct mutex update_lock; - int usage_count; -}; - -extern const struct sysfs_ops governor_sysfs_ops; - -void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node); -void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node); -unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node); - /* * Abbreviations: * dbs: used as a shortform for demand based switching It helps to keep variable @@ -80,14 +67,6 @@ static inline struct dbs_data *to_dbs_da return container_of(attr_set, struct dbs_data, attr_set); } -/* Governor's specific attributes */ -struct governor_attr { - struct attribute attr; - ssize_t (*show)(struct gov_attr_set *attr_set, char *buf); - ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf, - size_t count); -}; - #define gov_show_one(_gov, file_name) \ static ssize_t show_##file_name \ (struct gov_attr_set *attr_set, char *buf) \ Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -462,6 +462,29 @@ void cpufreq_unregister_governor(struct struct cpufreq_governor *cpufreq_default_governor(void); struct cpufreq_governor *cpufreq_fallback_governor(void); +/* Governor attribute set */ +struct gov_attr_set { + struct kobject kobj; + struct list_head policy_list; + struct mutex update_lock; + int usage_count; +}; + +/* sysfs ops for cpufreq governors */ +extern const struct sysfs_ops governor_sysfs_ops; + +void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node); +void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node); +unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node); + +/* Governor sysfs attribute */ +struct governor_attr { + struct attribute attr; + ssize_t (*show)(struct gov_attr_set *attr_set, char *buf); + ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf, + size_t count); +}; + /********************************************************************* * FREQUENCY TABLE HELPERS * *********************************************************************/ ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 5/10] cpufreq: Move governor attribute set headers to cpufreq.h 2016-03-04 3:05 ` [PATCH v2 5/10] cpufreq: Move governor attribute set headers to cpufreq.h Rafael J. Wysocki @ 2016-03-04 5:53 ` Viresh Kumar 0 siblings, 0 replies; 158+ messages in thread From: Viresh Kumar @ 2016-03-04 5:53 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Vincent Guittot, Michael Turquette, Ingo Molnar On 04-03-16, 04:05, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Move definitions and function headers related to struct gov_attr_set > to include/linux/cpufreq.h so they can be used by (future) goverernors > located outside of drivers/cpufreq/. > > No functional changes. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > New patch. Needed to move cpufreq_schedutil.c to kernel/sched/. Acked-by: Viresh Kumar <viresh.kumar@linaro.org> -- viresh ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki ` (4 preceding siblings ...) 2016-03-04 3:05 ` [PATCH v2 5/10] cpufreq: Move governor attribute set headers to cpufreq.h Rafael J. Wysocki @ 2016-03-04 3:07 ` Rafael J. Wysocki 2016-03-04 22:18 ` Steve Muckle 2016-03-04 3:12 ` [PATCH v2 7/10] cpufreq: Rework the scheduler hooks for triggering updates Rafael J. Wysocki ` (4 subsequent siblings) 10 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 3:07 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Modify the ACPI cpufreq driver to provide a method for switching CPU frequencies from interrupt context and update the cpufreq core to support that method if available. Introduce a new cpufreq driver callback, ->fast_switch, to be invoked for frequency switching from interrupt context via a new helper function, cpufreq_driver_fast_switch(). Add a new policy flag, fast_switch_possible, to be set if fast frequency switching can be used for the given policy. Implement the ->fast_switch callback in the ACPI cpufreq driver and make it set fast_switch_possible during policy initialization as appropriate. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Changes from the previous version: - Drop a bogus check from cpufreq_driver_fast_switch(). --- drivers/cpufreq/acpi-cpufreq.c | 53 +++++++++++++++++++++++++++++++++++++++++ drivers/cpufreq/cpufreq.c | 30 +++++++++++++++++++++++ include/linux/cpufreq.h | 6 ++++ 3 files changed, 89 insertions(+) Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c @@ -458,6 +458,55 @@ static int acpi_cpufreq_target(struct cp return result; } +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq, + unsigned int relation) +{ + struct acpi_cpufreq_data *data = policy->driver_data; + struct acpi_processor_performance *perf; + struct cpufreq_frequency_table *entry, *found; + unsigned int next_perf_state, next_freq, freq; + + /* + * Find the closest frequency above target_freq or equal to it. + * + * The table is sorted in the reverse order with respect to the + * frequency and all of the entries are valid (see the initialization). + */ + entry = data->freq_table; + do { + entry++; + freq = entry->frequency; + } while (freq >= target_freq && freq != CPUFREQ_TABLE_END); + found = entry - 1; + /* + * Use the one found or the previous one, depending on the relation. + * CPUFREQ_RELATION_H is not taken into account here, but it is not + * expected to be passed to this function anyway. + */ + next_freq = found->frequency; + if (freq == CPUFREQ_TABLE_END || relation != CPUFREQ_RELATION_C || + target_freq - freq >= next_freq - target_freq) { + next_perf_state = found->driver_data; + } else { + next_freq = freq; + next_perf_state = entry->driver_data; + } + + perf = to_perf_data(data); + if (perf->state == next_perf_state) { + if (unlikely(data->resume)) + data->resume = 0; + else + return next_freq; + } + + data->cpu_freq_write(&perf->control_register, + perf->states[next_perf_state].control); + perf->state = next_perf_state; + return next_freq; +} + static unsigned long acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu) { @@ -740,6 +789,9 @@ static int acpi_cpufreq_cpu_init(struct goto err_unreg; } + policy->fast_switch_possible = !acpi_pstate_strict && + !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY); + data->freq_table = kzalloc(sizeof(*data->freq_table) * (perf->state_count+1), GFP_KERNEL); if (!data->freq_table) { @@ -874,6 +926,7 @@ static struct freq_attr *acpi_cpufreq_at static struct cpufreq_driver acpi_cpufreq_driver = { .verify = cpufreq_generic_frequency_table_verify, .target_index = acpi_cpufreq_target, + .fast_switch = acpi_cpufreq_fast_switch, .bios_limit = acpi_processor_get_bios_limit, .init = acpi_cpufreq_cpu_init, .exit = acpi_cpufreq_cpu_exit, Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -1719,6 +1719,36 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie * GOVERNORS * *********************************************************************/ +/** + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch. + * @policy: cpufreq policy to switch the frequency for. + * @target_freq: New frequency to set (may be approximate). + * @relation: Relation to use for frequency selection. + * + * Carry out a fast frequency switch from interrupt context. + * + * This function must not be called if policy->fast_switch_possible is unset. + * + * Governors calling this function must guarantee that it will never be invoked + * twice in parallel for the same policy and that it will never be called in + * parallel with either ->target() or ->target_index() for the same policy. + * + * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch() + * callback, the hardware configuration must be preserved. + */ +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq, unsigned int relation) +{ + unsigned int freq; + + freq = cpufreq_driver->fast_switch(policy, target_freq, relation); + if (freq != CPUFREQ_ENTRY_INVALID) { + policy->cur = freq; + trace_cpu_frequency(freq, smp_processor_id()); + } +} +EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch); + /* Must set freqs->new to intermediate frequency */ static int __target_intermediate(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, int index) Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -81,6 +81,7 @@ struct cpufreq_policy { struct cpufreq_governor *governor; /* see below */ void *governor_data; char last_governor[CPUFREQ_NAME_LEN]; /* last governor used */ + bool fast_switch_possible; struct work_struct update; /* if update_policy() needs to be * called, but you're in IRQ context */ @@ -236,6 +237,9 @@ struct cpufreq_driver { unsigned int relation); /* Deprecated */ int (*target_index)(struct cpufreq_policy *policy, unsigned int index); + unsigned int (*fast_switch)(struct cpufreq_policy *policy, + unsigned int target_freq, + unsigned int relation); /* * Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION * unset. @@ -450,6 +454,8 @@ struct cpufreq_governor { }; /* Pass a target to the cpufreq driver */ +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq, unsigned int relation); int cpufreq_driver_target(struct cpufreq_policy *policy, unsigned int target_freq, unsigned int relation); ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-04 3:07 ` [PATCH v2 6/10] cpufreq: Support for fast frequency switching Rafael J. Wysocki @ 2016-03-04 22:18 ` Steve Muckle 2016-03-04 22:32 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Steve Muckle @ 2016-03-04 22:18 UTC (permalink / raw) To: Rafael J. Wysocki, Linux PM list Cc: Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On 03/03/2016 07:07 PM, Rafael J. Wysocki wrote: > +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, > + unsigned int target_freq, unsigned int relation) > +{ > + unsigned int freq; > + > + freq = cpufreq_driver->fast_switch(policy, target_freq, relation); > + if (freq != CPUFREQ_ENTRY_INVALID) { > + policy->cur = freq; > + trace_cpu_frequency(freq, smp_processor_id()); > + } > +} Even if there are platforms which may change the CPU frequency behind cpufreq's back, breaking the transition notifiers, I'm worried about the addition of an interface which itself breaks them. The platforms which do change CPU frequency on their own have probably evolved to live with or work around this behavior. As other platforms migrate to fast frequency switching they might be surprised when things don't work as advertised. I'm not sure what the easiest way to deal with this is. I see the transition notifiers are the srcu type, which I understand to be blocking. Going through the tree and reworking everyone's callbacks and changing the type to atomic is obviously not realistic. How about modifying cpufreq_register_notifier to return an error if the driver has a fast_switch callback installed and an attempt to register a transition notifier is made? In the future, perhaps an additional atomic transition callback type can be added, which platform/driver owners can switch to if they wish to use fast transitions with their platform. thanks, Steve ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-04 22:18 ` Steve Muckle @ 2016-03-04 22:32 ` Rafael J. Wysocki 2016-03-04 22:40 ` Rafael J. Wysocki 2016-03-04 22:58 ` Rafael J. Wysocki 0 siblings, 2 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 22:32 UTC (permalink / raw) To: Steve Muckle Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Fri, Mar 4, 2016 at 11:18 PM, Steve Muckle <steve.muckle@linaro.org> wrote: > On 03/03/2016 07:07 PM, Rafael J. Wysocki wrote: >> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, >> + unsigned int target_freq, unsigned int relation) >> +{ >> + unsigned int freq; >> + >> + freq = cpufreq_driver->fast_switch(policy, target_freq, relation); >> + if (freq != CPUFREQ_ENTRY_INVALID) { >> + policy->cur = freq; >> + trace_cpu_frequency(freq, smp_processor_id()); >> + } >> +} > > Even if there are platforms which may change the CPU frequency behind > cpufreq's back, breaking the transition notifiers, I'm worried about the > addition of an interface which itself breaks them. The platforms which > do change CPU frequency on their own have probably evolved to live with > or work around this behavior. As other platforms migrate to fast > frequency switching they might be surprised when things don't work as > advertised. Well, intel_pstate doesn't do notifies at all, so anything depending on them is already broken when it is used. Let alone the hardware P-states coordination mechanism (HWP) where the frequency is controlled by the processor itself entirely. That said I see your point. > I'm not sure what the easiest way to deal with this is. I see the > transition notifiers are the srcu type, which I understand to be > blocking. Going through the tree and reworking everyone's callbacks and > changing the type to atomic is obviously not realistic. Right. > How about modifying cpufreq_register_notifier to return an error if the > driver has a fast_switch callback installed and an attempt to register a > transition notifier is made? That sounds like a good idea. There also is the CPUFREQ_ASYNC_NOTIFICATION driver flag that in principle might be used as a workaround, but I'm not sure how much work that would require ATM. > In the future, perhaps an additional atomic transition callback type can > be added, which platform/driver owners can switch to if they wish to use > fast transitions with their platform. I guess you mean an atomic notification mechanism based on registering callbacks? While technically viable that's somewhat risky, because we are in a fast path and allowing anyone to add stuff to it would be asking for trouble IMO. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-04 22:32 ` Rafael J. Wysocki @ 2016-03-04 22:40 ` Rafael J. Wysocki 2016-03-04 23:18 ` Rafael J. Wysocki 2016-03-04 22:58 ` Rafael J. Wysocki 1 sibling, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 22:40 UTC (permalink / raw) To: Steve Muckle Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Fri, Mar 4, 2016 at 11:32 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Fri, Mar 4, 2016 at 11:18 PM, Steve Muckle <steve.muckle@linaro.org> wrote: >> On 03/03/2016 07:07 PM, Rafael J. Wysocki wrote: >>> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, >>> + unsigned int target_freq, unsigned int relation) >>> +{ >>> + unsigned int freq; >>> + >>> + freq = cpufreq_driver->fast_switch(policy, target_freq, relation); >>> + if (freq != CPUFREQ_ENTRY_INVALID) { >>> + policy->cur = freq; >>> + trace_cpu_frequency(freq, smp_processor_id()); >>> + } >>> +} >> >> Even if there are platforms which may change the CPU frequency behind >> cpufreq's back, breaking the transition notifiers, I'm worried about the >> addition of an interface which itself breaks them. The platforms which >> do change CPU frequency on their own have probably evolved to live with >> or work around this behavior. As other platforms migrate to fast >> frequency switching they might be surprised when things don't work as >> advertised. > > Well, intel_pstate doesn't do notifies at all, so anything depending > on them is already broken when it is used. Let alone the hardware > P-states coordination mechanism (HWP) where the frequency is > controlled by the processor itself entirely. > > That said I see your point. > >> I'm not sure what the easiest way to deal with this is. I see the >> transition notifiers are the srcu type, which I understand to be >> blocking. Going through the tree and reworking everyone's callbacks and >> changing the type to atomic is obviously not realistic. > > Right. > >> How about modifying cpufreq_register_notifier to return an error if the >> driver has a fast_switch callback installed and an attempt to register a >> transition notifier is made? > > That sounds like a good idea. > > There also is the CPUFREQ_ASYNC_NOTIFICATION driver flag that in > principle might be used as a workaround, but I'm not sure how much > work that would require ATM. What I mean is that drivers using it are supposed to handle the notifications by calling cpufreq_freq_transition_begin(/end() by themselves, so theoretically there is a mechanism already in place for that. I guess what might be done would be to spawn a work item to carry out a notify when the frequency changes. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-04 22:40 ` Rafael J. Wysocki @ 2016-03-04 23:18 ` Rafael J. Wysocki 2016-03-04 23:56 ` Steve Muckle 2016-03-05 16:49 ` Peter Zijlstra 0 siblings, 2 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 23:18 UTC (permalink / raw) To: Steve Muckle Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Fri, Mar 4, 2016 at 11:40 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Fri, Mar 4, 2016 at 11:32 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: >> On Fri, Mar 4, 2016 at 11:18 PM, Steve Muckle <steve.muckle@linaro.org> wrote: >>> On 03/03/2016 07:07 PM, Rafael J. Wysocki wrote: >>>> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, >>>> + unsigned int target_freq, unsigned int relation) >>>> +{ >>>> + unsigned int freq; >>>> + >>>> + freq = cpufreq_driver->fast_switch(policy, target_freq, relation); >>>> + if (freq != CPUFREQ_ENTRY_INVALID) { >>>> + policy->cur = freq; >>>> + trace_cpu_frequency(freq, smp_processor_id()); >>>> + } >>>> +} >>> >>> Even if there are platforms which may change the CPU frequency behind >>> cpufreq's back, breaking the transition notifiers, I'm worried about the >>> addition of an interface which itself breaks them. The platforms which >>> do change CPU frequency on their own have probably evolved to live with >>> or work around this behavior. As other platforms migrate to fast >>> frequency switching they might be surprised when things don't work as >>> advertised. >> >> Well, intel_pstate doesn't do notifies at all, so anything depending >> on them is already broken when it is used. Let alone the hardware >> P-states coordination mechanism (HWP) where the frequency is >> controlled by the processor itself entirely. >> >> That said I see your point. >> >>> I'm not sure what the easiest way to deal with this is. I see the >>> transition notifiers are the srcu type, which I understand to be >>> blocking. Going through the tree and reworking everyone's callbacks and >>> changing the type to atomic is obviously not realistic. >> >> Right. >> >>> How about modifying cpufreq_register_notifier to return an error if the >>> driver has a fast_switch callback installed and an attempt to register a >>> transition notifier is made? >> >> That sounds like a good idea. >> >> There also is the CPUFREQ_ASYNC_NOTIFICATION driver flag that in >> principle might be used as a workaround, but I'm not sure how much >> work that would require ATM. > > What I mean is that drivers using it are supposed to handle the > notifications by calling cpufreq_freq_transition_begin(/end() by > themselves, so theoretically there is a mechanism already in place for > that. > > I guess what might be done would be to spawn a work item to carry out > a notify when the frequency changes. In fact, the mechanism may be relatively simple if I'm not mistaken. In the "fast switch" case, the governor may spawn a work item that will just execute cpufreq_get() on policy->cpu. That will notice that policy->cur is different from the real current frequency and will re-adjust. Of course, cpufreq_driver_fast_switch() will need to be modified so it doesn't update policy->cur then perhaps with a comment that the governor using it will be responsible for that. And the governor will need to avoid spawning that work item too often (basically, if one has been spawned already and hasn't completed, no need to spawn a new one, and maybe rate-limit it?), but all that looks reasonably straightforward. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-04 23:18 ` Rafael J. Wysocki @ 2016-03-04 23:56 ` Steve Muckle 2016-03-05 0:18 ` Rafael J. Wysocki 2016-03-05 16:49 ` Peter Zijlstra 1 sibling, 1 reply; 158+ messages in thread From: Steve Muckle @ 2016-03-04 23:56 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On 03/04/2016 03:18 PM, Rafael J. Wysocki wrote: > In fact, the mechanism may be relatively simple if I'm not mistaken. > > In the "fast switch" case, the governor may spawn a work item that > will just execute cpufreq_get() on policy->cpu. That will notice that > policy->cur is different from the real current frequency and will > re-adjust. > > Of course, cpufreq_driver_fast_switch() will need to be modified so it > doesn't update policy->cur then perhaps with a comment that the > governor using it will be responsible for that. > > And the governor will need to avoid spawning that work item too often > (basically, if one has been spawned already and hasn't completed, no > need to spawn a new one, and maybe rate-limit it?), but all that looks > reasonably straightforward. It is another option though definitely a compromise. The semantics seem different since you'd potentially have multiple freq changes before a single notifier went through, so stuff might still break. The fast path would also be more expensive given the workqueue activity that could translate into additional task wakeups. Honestly I wonder if it's better to just try the "no notifiers with fast drivers" approach to start. The notifiers could always be added if platform owners complain that they absolutely require them. thanks, Steve ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-04 23:56 ` Steve Muckle @ 2016-03-05 0:18 ` Rafael J. Wysocki 2016-03-05 11:58 ` Ingo Molnar 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-05 0:18 UTC (permalink / raw) To: Steve Muckle Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Sat, Mar 5, 2016 at 12:56 AM, Steve Muckle <steve.muckle@linaro.org> wrote: > On 03/04/2016 03:18 PM, Rafael J. Wysocki wrote: >> In fact, the mechanism may be relatively simple if I'm not mistaken. >> >> In the "fast switch" case, the governor may spawn a work item that >> will just execute cpufreq_get() on policy->cpu. That will notice that >> policy->cur is different from the real current frequency and will >> re-adjust. >> >> Of course, cpufreq_driver_fast_switch() will need to be modified so it >> doesn't update policy->cur then perhaps with a comment that the >> governor using it will be responsible for that. >> >> And the governor will need to avoid spawning that work item too often >> (basically, if one has been spawned already and hasn't completed, no >> need to spawn a new one, and maybe rate-limit it?), but all that looks >> reasonably straightforward. > > It is another option though definitely a compromise. The semantics seem > different since you'd potentially have multiple freq changes before a > single notifier went through, so stuff might still break. Here I'm not worried. That's basically equivalent to someone doing a "get" and seeing an unexpected frequency in the driver output which is covered already and things need to cope with it or they are just really broken. > The fast path would also be more expensive given the workqueue activity that could > translate into additional task wakeups. That's a valid concern, so maybe there can be a driver flag to indicate that this has to be done if ->fast_switch is in use? Or something like fast_switch_notify_rate that will tell the governor how often to notify things about transitions if ->fast_switch is in use with either 0 or all ones meaning "never"? That might be a policy property even, so the driver may set this depending on what platform it is used on. > Honestly I wonder if it's better to just try the "no notifiers with fast > drivers" approach to start. The notifiers could always be added if > platform owners complain that they absolutely require them. Well, I'm not sure what happens if we start to fail notifier registrations. It may not be a well tested error code path. :-) Besides, there is the problem with registering notifiers before the driver and I don't think we can fail driver registration if notifiers have already been registered. We may not be able to register a "fast" driver at all in that case. But that whole thing is your worry, not mine. :-) Had I been worrying about that, I would have added some bandaid for that to the patches. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-05 0:18 ` Rafael J. Wysocki @ 2016-03-05 11:58 ` Ingo Molnar 0 siblings, 0 replies; 158+ messages in thread From: Ingo Molnar @ 2016-03-05 11:58 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Steve Muckle, Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Thomas Gleixner, Peter Zijlstra * Rafael J. Wysocki <rafael@kernel.org> wrote: > > Honestly I wonder if it's better to just try the "no notifiers with fast > > drivers" approach to start. The notifiers could always be added if platform > > owners complain that they absolutely require them. > > Well, I'm not sure what happens if we start to fail notifier registrations. It > may not be a well tested error code path. :-) Yeah, so as a general principle 'struct notifier_block' as a really bad interface with poor and fragile semantics, and we are trying to get rid of them everywhere from core kernel code. For example Thomas Gleixner et al is working on eliminating them from the CPU hotplug code - which will get rid of most remaining notifier uses from the scheduler as well. So please add explicit cpufreq driver callback functions instead, which can be filled in by a platform if needed. No notifiers! Thanks, Ingo ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-04 23:18 ` Rafael J. Wysocki 2016-03-04 23:56 ` Steve Muckle @ 2016-03-05 16:49 ` Peter Zijlstra 2016-03-06 2:17 ` Rafael J. Wysocki 1 sibling, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-05 16:49 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Steve Muckle, Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Sat, Mar 05, 2016 at 12:18:54AM +0100, Rafael J. Wysocki wrote: > >>> Even if there are platforms which may change the CPU frequency behind > >>> cpufreq's back, breaking the transition notifiers, I'm worried about the > >>> addition of an interface which itself breaks them. The platforms which > >>> do change CPU frequency on their own have probably evolved to live with > >>> or work around this behavior. As other platforms migrate to fast > >>> frequency switching they might be surprised when things don't work as > >>> advertised. There's only 43 sites of cpufreq_register_notifier in 37 files, that should be fairly simple to audit. > >>> I'm not sure what the easiest way to deal with this is. I see the > >>> transition notifiers are the srcu type, which I understand to be > >>> blocking. Going through the tree and reworking everyone's callbacks and > >>> changing the type to atomic is obviously not realistic. > >> > >> Right. Even if it was (and per the above it looks entirely feasible), that's just not going to happen. We're not ever going to call random notifier crap from this deep within the scheduler. > >>> How about modifying cpufreq_register_notifier to return an error if the > >>> driver has a fast_switch callback installed and an attempt to register a > >>> transition notifier is made? > >> > >> That sounds like a good idea. Agreed, fail the stuff hard. Simply make cpufreq_register_notifier a __must_check function and add error handling to all call sites. > > I guess what might be done would be to spawn a work item to carry out > > a notify when the frequency changes. > > In fact, the mechanism may be relatively simple if I'm not mistaken. > > In the "fast switch" case, the governor may spawn a work item that > will just execute cpufreq_get() on policy->cpu. That will notice that > policy->cur is different from the real current frequency and will > re-adjust. > > Of course, cpufreq_driver_fast_switch() will need to be modified so it > doesn't update policy->cur then perhaps with a comment that the > governor using it will be responsible for that. No no no, that's just horrible. Why would you want to keep this notification stuff alive? If your platform can change frequency 'fast' you don't want notifiers. What's the point of a notification that says: "At some point in the random past my frequency has changed, and it likely has changed again since then, do 'something'." That's pointless. If you have dependent clock domains or whatever, you simply _cannot_ be fast. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-05 16:49 ` Peter Zijlstra @ 2016-03-06 2:17 ` Rafael J. Wysocki 2016-03-07 8:00 ` Peter Zijlstra 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-06 2:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Steve Muckle, Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Sat, Mar 5, 2016 at 5:49 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Sat, Mar 05, 2016 at 12:18:54AM +0100, Rafael J. Wysocki wrote: > >> >>> Even if there are platforms which may change the CPU frequency behind >> >>> cpufreq's back, breaking the transition notifiers, I'm worried about the >> >>> addition of an interface which itself breaks them. The platforms which >> >>> do change CPU frequency on their own have probably evolved to live with >> >>> or work around this behavior. As other platforms migrate to fast >> >>> frequency switching they might be surprised when things don't work as >> >>> advertised. > > There's only 43 sites of cpufreq_register_notifier in 37 files, that > should be fairly simple to audit. > >> >>> I'm not sure what the easiest way to deal with this is. I see the >> >>> transition notifiers are the srcu type, which I understand to be >> >>> blocking. Going through the tree and reworking everyone's callbacks and >> >>> changing the type to atomic is obviously not realistic. >> >> >> >> Right. > > Even if it was (and per the above it looks entirely feasible), that's > just not going to happen. We're not ever going to call random notifier > crap from this deep within the scheduler. > >> >>> How about modifying cpufreq_register_notifier to return an error if the >> >>> driver has a fast_switch callback installed and an attempt to register a >> >>> transition notifier is made? >> >> >> >> That sounds like a good idea. > > Agreed, fail the stuff hard. > > Simply make cpufreq_register_notifier a __must_check function and add > error handling to all call sites. Quite frankly, I don't see a compelling reason to do anything about the notifications at this point. The ACPI driver is the only one that will support fast switching for the time being and on practically all platforms that can use the ACPI driver the transition notifications cannot be relied on anyway for a few reasons. First, if intel_pstate or HWP is in use, they won't be coming at all. Second, anything turbo will just change frequency at will without notifying (like HWP). Finally, if they are coming, whoever receives them is notified about the frequency that is requested and not the real one, which is misleading, because (a) the request may just make the CPU go into the turbo range and then see above or (b) if the CPU is in a platform-coordinated package, its request will only be granted if it's the winning one. >> > I guess what might be done would be to spawn a work item to carry out >> > a notify when the frequency changes. >> >> In fact, the mechanism may be relatively simple if I'm not mistaken. >> >> In the "fast switch" case, the governor may spawn a work item that >> will just execute cpufreq_get() on policy->cpu. That will notice that >> policy->cur is different from the real current frequency and will >> re-adjust. >> >> Of course, cpufreq_driver_fast_switch() will need to be modified so it >> doesn't update policy->cur then perhaps with a comment that the >> governor using it will be responsible for that. > > No no no, that's just horrible. Why would you want to keep this > notification stuff alive? If your platform can change frequency 'fast' > you don't want notifiers. I'm not totally sure about that. > > What's the point of a notification that says: "At some point in the > random past my frequency has changed, and it likely has changed again > since then, do 'something'." > > That's pointless. If you have dependent clock domains or whatever, you > simply _cannot_ be fast. > What about thermal? They don't need to get very accurate information, but they need to be updated on a regular basis. It would do if they get averages instead of momentary values (and may be better even). ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-06 2:17 ` Rafael J. Wysocki @ 2016-03-07 8:00 ` Peter Zijlstra 2016-03-07 13:15 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-07 8:00 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Steve Muckle, Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Sun, Mar 06, 2016 at 03:17:09AM +0100, Rafael J. Wysocki wrote: > > Agreed, fail the stuff hard. > > > > Simply make cpufreq_register_notifier a __must_check function and add > > error handling to all call sites. > > Quite frankly, I don't see a compelling reason to do anything about > the notifications at this point. > > The ACPI driver is the only one that will support fast switching for > the time being and on practically all platforms that can use the ACPI > driver the transition notifications cannot be relied on anyway for a > few reasons. First, if intel_pstate or HWP is in use, they won't be > coming at all. Second, anything turbo will just change frequency at > will without notifying (like HWP). Finally, if they are coming, > whoever receives them is notified about the frequency that is > requested and not the real one, which is misleading, because (a) the > request may just make the CPU go into the turbo range and then see > above or (b) if the CPU is in a platform-coordinated package, its > request will only be granted if it's the winning one. Sure I know all that. But that, to me, seems like an argument for why you should have done this a long time ago. Someone registering a notifier you _know_ won't be called reliably is a sure sign of borkage. And you want to be notified (pun intended) of borkage. So the alternative option to making the registration fail, is making the registration WARN (and possibly disable fast support in the driver). But I do think something wants to be done here. > > No no no, that's just horrible. Why would you want to keep this > > notification stuff alive? If your platform can change frequency 'fast' > > you don't want notifiers. > > I'm not totally sure about that. I am, per definition, if you need to call notifiers, you're not fast. I would really suggest making that a hard rule and enforcing it. > > What's the point of a notification that says: "At some point in the > > random past my frequency has changed, and it likely has changed again > > since then, do 'something'." > > > > That's pointless. If you have dependent clock domains or whatever, you > > simply _cannot_ be fast. > > > > What about thermal? They don't need to get very accurate information, > but they need to be updated on a regular basis. It would do if they > get averages instead of momentary values (and may be better even). Thermal, should be an integral part of cpufreq, but if they need a callback from the switching hook (and here I would like to remind everyone that this is inside scheduler hot paths and the more code you stuff in the harder the performance regressions will hit you in the face) it can get a direct function call. No need for no stinking notifiers. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-07 8:00 ` Peter Zijlstra @ 2016-03-07 13:15 ` Rafael J. Wysocki 2016-03-07 13:32 ` Peter Zijlstra 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-07 13:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Steve Muckle, Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Mon, Mar 7, 2016 at 9:00 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Sun, Mar 06, 2016 at 03:17:09AM +0100, Rafael J. Wysocki wrote: >> > Agreed, fail the stuff hard. >> > >> > Simply make cpufreq_register_notifier a __must_check function and add >> > error handling to all call sites. >> >> Quite frankly, I don't see a compelling reason to do anything about >> the notifications at this point. >> >> The ACPI driver is the only one that will support fast switching for >> the time being and on practically all platforms that can use the ACPI >> driver the transition notifications cannot be relied on anyway for a >> few reasons. First, if intel_pstate or HWP is in use, they won't be >> coming at all. Second, anything turbo will just change frequency at >> will without notifying (like HWP). Finally, if they are coming, >> whoever receives them is notified about the frequency that is >> requested and not the real one, which is misleading, because (a) the >> request may just make the CPU go into the turbo range and then see >> above or (b) if the CPU is in a platform-coordinated package, its >> request will only be granted if it's the winning one. > > Sure I know all that. But that, to me, seems like an argument for why > you should have done this a long time ago. While I generally agree with this, I don't quite see why cleaning that up necessarily has to be connected to the current patch series which is my point. > Someone registering a notifier you _know_ won't be called reliably is a > sure sign of borkage. And you want to be notified (pun intended) of > borkage. > > So the alternative option to making the registration fail, is making the > registration WARN (and possibly disable fast support in the driver). > > But I do think something wants to be done here. So here's what I can do for the "fast switch" thing. There is the fast_switch_possible policy flag that's necessary anyway. I can make notifier registration fail when that is set for at least one policy and I can make the setting of it fail if at least one notifier has already been registered. However, without spending too much time on chasing code dependencies i sort of suspect that it will uncover things that register cpufreq notifiers early and it won't be possible to use fast switch without sorting that out. And that won't even change anything apart from removing some code that has not worked for quite a while already and nobody noticed. >> > No no no, that's just horrible. Why would you want to keep this >> > notification stuff alive? If your platform can change frequency 'fast' >> > you don't want notifiers. >> >> I'm not totally sure about that. > > I am, per definition, if you need to call notifiers, you're not fast. > > I would really suggest making that a hard rule and enforcing it. OK, but see above. It is doable for the "fast switch" thing, but it won't help in all of the other cases when notifications are not reliable. >> > What's the point of a notification that says: "At some point in the >> > random past my frequency has changed, and it likely has changed again >> > since then, do 'something'." >> > >> > That's pointless. If you have dependent clock domains or whatever, you >> > simply _cannot_ be fast. >> > >> >> What about thermal? They don't need to get very accurate information, >> but they need to be updated on a regular basis. It would do if they >> get averages instead of momentary values (and may be better even). > > Thermal, should be an integral part of cpufreq, but if they need a > callback from the switching hook (and here I would like to remind > everyone that this is inside scheduler hot paths and the more code you > stuff in the harder the performance regressions will hit you in the > face) Calling notifiers (or any kind of callbacks that anyone can register) from there is out of the question. > it can get a direct function call. No need for no stinking > notifiers. I'm not talking about hooks in the switching code but *some* way to let stuff know about frequency changes. If it changes frequently enough, it's not practical and not even necessary to cause things like thermal to react on every change, but I think there needs to be a way to make them reevaluate things regularly. Arguably, they might set a timer for that, but why would they need a timer if they could get triggered by the code that actually makes changes? ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-07 13:15 ` Rafael J. Wysocki @ 2016-03-07 13:32 ` Peter Zijlstra 2016-03-07 13:42 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-07 13:32 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Steve Muckle, Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Mon, Mar 07, 2016 at 02:15:47PM +0100, Rafael J. Wysocki wrote: > On Mon, Mar 7, 2016 at 9:00 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > Sure I know all that. But that, to me, seems like an argument for why > > you should have done this a long time ago. > > While I generally agree with this, I don't quite see why cleaning that > up necessarily has to be connected to the current patch series which > is my point. Ah OK, fair enough I suppose. But someone should stick this on their TODO list, we should not 'forget' about this (again). > > But I do think something wants to be done here. > > So here's what I can do for the "fast switch" thing. > > There is the fast_switch_possible policy flag that's necessary anyway. > I can make notifier registration fail when that is set for at least > one policy and I can make the setting of it fail if at least one > notifier has already been registered. > > However, without spending too much time on chasing code dependencies i > sort of suspect that it will uncover things that register cpufreq > notifiers early and it won't be possible to use fast switch without > sorting that out. The two x86 users don't register notifiers when CONSTANT_TSC, which seems to be the right thing. Much of the other users seem unlikely to be used on x86, so I suspect the initial fallout will be very limited. *groan* modules, cpufreq allows drivers to be modules, so init sequences are poorly defined at best :/ Yes that blows. > And that won't even change anything apart from > removing some code that has not worked for quite a while already and > nobody noticed. Which is always a good thing, but yes, we can do this later. > It is doable for the "fast switch" thing, but it won't help in all of > the other cases when notifications are not reliable. Right, you can maybe add a 'NOTIFIERS_BROKEN' flag to the intel_p_state and HWP drivers or so, and trigger off of that. > If it changes frequently enough, it's not practical and not even > necessary to cause things like thermal to react on every change, but I > think there needs to be a way to make them reevaluate things > regularly. Arguably, they might set a timer for that, but why would > they need a timer if they could get triggered by the code that > actually makes changes? So that very much depends on what thermal actually needs; but I suspect that using a timer is cheaper than using irq_work to kick off something else. The irq_work is a LAPIC write (self IPI), just as the timer. However timers can be coalesced, resulting in, on average, less timer reprogramming than there are handlers ran. Now, if thermal can do without work and can run in-line just like the fast freq switch, then yes, that might make sense. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-07 13:32 ` Peter Zijlstra @ 2016-03-07 13:42 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-07 13:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Steve Muckle, Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Mon, Mar 7, 2016 at 2:32 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Mar 07, 2016 at 02:15:47PM +0100, Rafael J. Wysocki wrote: >> On Mon, Mar 7, 2016 at 9:00 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >> > Sure I know all that. But that, to me, seems like an argument for why >> > you should have done this a long time ago. >> >> While I generally agree with this, I don't quite see why cleaning that >> up necessarily has to be connected to the current patch series which >> is my point. > > Ah OK, fair enough I suppose. But someone should stick this on their > TODO list, we should not 'forget' about this (again). Sure. >> > But I do think something wants to be done here. >> >> So here's what I can do for the "fast switch" thing. >> >> There is the fast_switch_possible policy flag that's necessary anyway. >> I can make notifier registration fail when that is set for at least >> one policy and I can make the setting of it fail if at least one >> notifier has already been registered. >> >> However, without spending too much time on chasing code dependencies i >> sort of suspect that it will uncover things that register cpufreq >> notifiers early and it won't be possible to use fast switch without >> sorting that out. > > The two x86 users don't register notifiers when CONSTANT_TSC, which > seems to be the right thing. > > Much of the other users seem unlikely to be used on x86, so I suspect > the initial fallout will be very limited. OK, let me try this then. > *groan* modules, cpufreq allows drivers to be modules, so init sequences > are poorly defined at best :/ Yes that blows. Yup. >> And that won't even change anything apart from >> removing some code that has not worked for quite a while already and >> nobody noticed. > > Which is always a good thing, but yes, we can do this later. > >> It is doable for the "fast switch" thing, but it won't help in all of >> the other cases when notifications are not reliable. > > Right, you can maybe add a 'NOTIFIERS_BROKEN' flag to the intel_p_state > and HWP drivers or so, and trigger off of that. Something like that, yes. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-04 22:32 ` Rafael J. Wysocki 2016-03-04 22:40 ` Rafael J. Wysocki @ 2016-03-04 22:58 ` Rafael J. Wysocki 2016-03-04 23:59 ` Steve Muckle 1 sibling, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 22:58 UTC (permalink / raw) To: Steve Muckle Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Fri, Mar 4, 2016 at 11:32 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Fri, Mar 4, 2016 at 11:18 PM, Steve Muckle <steve.muckle@linaro.org> wrote: >> On 03/03/2016 07:07 PM, Rafael J. Wysocki wrote: >>> +void cpufreq_driver_fast_switch(struct cpufreq_policy *policy, >>> + unsigned int target_freq, unsigned int relation) >>> +{ >>> + unsigned int freq; >>> + >>> + freq = cpufreq_driver->fast_switch(policy, target_freq, relation); >>> + if (freq != CPUFREQ_ENTRY_INVALID) { >>> + policy->cur = freq; >>> + trace_cpu_frequency(freq, smp_processor_id()); >>> + } >>> +} >> >> Even if there are platforms which may change the CPU frequency behind >> cpufreq's back, breaking the transition notifiers, I'm worried about the >> addition of an interface which itself breaks them. The platforms which >> do change CPU frequency on their own have probably evolved to live with >> or work around this behavior. As other platforms migrate to fast >> frequency switching they might be surprised when things don't work as >> advertised. > > Well, intel_pstate doesn't do notifies at all, so anything depending > on them is already broken when it is used. Let alone the hardware > P-states coordination mechanism (HWP) where the frequency is > controlled by the processor itself entirely. > > That said I see your point. > >> I'm not sure what the easiest way to deal with this is. I see the >> transition notifiers are the srcu type, which I understand to be >> blocking. Going through the tree and reworking everyone's callbacks and >> changing the type to atomic is obviously not realistic. > > Right. > >> How about modifying cpufreq_register_notifier to return an error if the >> driver has a fast_switch callback installed and an attempt to register a >> transition notifier is made? > > That sounds like a good idea. Transition notifiers may be registered before the driver is registered, so that won't help in all cases. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching 2016-03-04 22:58 ` Rafael J. Wysocki @ 2016-03-04 23:59 ` Steve Muckle 0 siblings, 0 replies; 158+ messages in thread From: Steve Muckle @ 2016-03-04 23:59 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On 03/04/2016 02:58 PM, Rafael J. Wysocki wrote: >>> How about modifying cpufreq_register_notifier to return an error if the >>> >> driver has a fast_switch callback installed and an attempt to register a >>> >> transition notifier is made? >> > >> > That sounds like a good idea. > > Transition notifiers may be registered before the driver is > registered, so that won't help in all cases. Could that hole be closed by a similar check in cpufreq_register_driver()? I.e. if the transition_notifier list is not empty, fail to register the driver (if the driver has a fast_switch routine)? Or alternatively, the fast_switch routine is not installed. thanks, Steve ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v2 7/10] cpufreq: Rework the scheduler hooks for triggering updates 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki ` (5 preceding siblings ...) 2016-03-04 3:07 ` [PATCH v2 6/10] cpufreq: Support for fast frequency switching Rafael J. Wysocki @ 2016-03-04 3:12 ` Rafael J. Wysocki 2016-03-04 3:14 ` [PATCH v2 8/10] cpufreq: Move scheduler-related code to the sched directory Rafael J. Wysocki ` (3 subsequent siblings) 10 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 3:12 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Commit fe7034338ba0 (cpufreq: Add mechanism for registering utilization update callbacks) added cpufreq_update_util() to be called by the scheduler (from the CFS part) on utilization updates. The goal was to allow CFS to pass utilization information to cpufreq and to trigger it to evaluate the frequency/voltage configuration (P-state) of every CPU on a regular basis. However, the last two arguments of that function are never used by the current code, so CFS might simply call cpufreq_trigger_update() instead of it (like the RT and DL sched classes). For this reason, drop the last two arguments of cpufreq_update_util(), rename it to cpufreq_trigger_update() and modify CFS to call it. Moreover, since the utilization is not involved in that now, rename data types, functions and variables related to cpufreq_trigger_update() to reflect that (eg. struct update_util_data becomes struct freq_update_hook and so on). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- New patch. Not strictly necessary, but I like the new names better. :-) --- drivers/cpufreq/cpufreq.c | 52 +++++++++++++++++++++---------------- drivers/cpufreq/cpufreq_governor.c | 25 ++++++++--------- drivers/cpufreq/cpufreq_governor.h | 2 - drivers/cpufreq/intel_pstate.c | 15 ++++------ include/linux/cpufreq.h | 32 ++-------------------- kernel/sched/deadline.c | 2 - kernel/sched/fair.c | 13 +-------- kernel/sched/rt.c | 2 - 8 files changed, 58 insertions(+), 85 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -65,57 +65,65 @@ static struct cpufreq_driver *cpufreq_dr static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data); static DEFINE_RWLOCK(cpufreq_driver_lock); -static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data); +static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook); /** - * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer. + * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer. * @cpu: The CPU to set the pointer for. - * @data: New pointer value. + * @hook: New pointer value. * - * Set and publish the update_util_data pointer for the given CPU. That pointer - * points to a struct update_util_data object containing a callback function - * to call from cpufreq_update_util(). That function will be called from an RCU - * read-side critical section, so it must not sleep. + * Set and publish the freq_update_hook pointer for the given CPU. That pointer + * points to a struct freq_update_hook object containing a callback function + * to call from cpufreq_trigger_update(). That function will be called from + * an RCU read-side critical section, so it must not sleep. * * Callers must use RCU-sched callbacks to free any memory that might be * accessed via the old update_util_data pointer or invoke synchronize_sched() * right after this function to avoid use-after-free. */ -void cpufreq_set_update_util_data(int cpu, struct update_util_data *data) +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook) { - if (WARN_ON(data && !data->func)) + if (WARN_ON(hook && !hook->func)) return; - rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data); + rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook); } -EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data); +EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook); /** - * cpufreq_update_util - Take a note about CPU utilization changes. + * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. * @time: Current time. - * @util: Current utilization. - * @max: Utilization ceiling. * - * This function is called by the scheduler on every invocation of - * update_load_avg() on the CPU whose utilization is being updated. + * The way cpufreq is currently arranged requires it to evaluate the CPU + * performance state (frequency/voltage) on a regular basis. To facilitate + * that, this function is called by update_load_avg() in CFS when executed for + * the current CPU's runqueue. * - * It can only be called from RCU-sched read-side critical sections. + * However, this isn't sufficient to prevent the CPU from being stuck in a + * completely inadequate performance level for too long, because the calls + * from CFS will not be made if RT or deadline tasks are active all the time + * (or there are RT and DL tasks only). + * + * As a workaround for that issue, this function is called by the RT and DL + * sched classes to trigger extra cpufreq updates to prevent it from stalling, + * but that really is a band-aid. Going forward it should be replaced with + * solutions targeted more specifically at RT and DL tasks. */ -void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) +void cpufreq_trigger_update(u64 time) { - struct update_util_data *data; + struct freq_update_hook *hook; #ifdef CONFIG_LOCKDEP WARN_ON(debug_locks && !rcu_read_lock_sched_held()); #endif - data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data)); + hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); /* * If this isn't inside of an RCU-sched read-side critical section, data * may become NULL after the check below. */ - if (data) - data->func(data, time, util, max); + if (hook) + hook->func(hook, time); } /* Flag to suspend/resume CPUFreq governors */ Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -146,35 +146,13 @@ static inline bool policy_is_shared(stru extern struct kobject *cpufreq_global_kobject; #ifdef CONFIG_CPU_FREQ -void cpufreq_update_util(u64 time, unsigned long util, unsigned long max); +void cpufreq_trigger_update(u64 time); -/** - * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. - * @time: Current time. - * - * The way cpufreq is currently arranged requires it to evaluate the CPU - * performance state (frequency/voltage) on a regular basis to prevent it from - * being stuck in a completely inadequate performance level for too long. - * That is not guaranteed to happen if the updates are only triggered from CFS, - * though, because they may not be coming in if RT or deadline tasks are active - * all the time (or there are RT and DL tasks only). - * - * As a workaround for that issue, this function is called by the RT and DL - * sched classes to trigger extra cpufreq updates to prevent it from stalling, - * but that really is a band-aid. Going forward it should be replaced with - * solutions targeted more specifically at RT and DL tasks. - */ -static inline void cpufreq_trigger_update(u64 time) -{ - cpufreq_update_util(time, ULONG_MAX, 0); -} - -struct update_util_data { - void (*func)(struct update_util_data *data, - u64 time, unsigned long util, unsigned long max); +struct freq_update_hook { + void (*func)(struct freq_update_hook *hook, u64 time); }; -void cpufreq_set_update_util_data(int cpu, struct update_util_data *data); +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook); unsigned int cpufreq_get(unsigned int cpu); unsigned int cpufreq_quick_get(unsigned int cpu); @@ -187,8 +165,6 @@ int cpufreq_update_policy(unsigned int c bool have_governor_per_policy(void); struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy); #else -static inline void cpufreq_update_util(u64 time, unsigned long util, - unsigned long max) {} static inline void cpufreq_trigger_update(u64 time) {} static inline unsigned int cpufreq_get(unsigned int cpu) Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -62,10 +62,10 @@ ssize_t store_sampling_rate(struct dbs_d mutex_lock(&policy_dbs->timer_mutex); /* * On 32-bit architectures this may race with the - * sample_delay_ns read in dbs_update_util_handler(), but that + * sample_delay_ns read in dbs_freq_update_handler(), but that * really doesn't matter. If the read returns a value that's * too big, the sample will be skipped, but the next invocation - * of dbs_update_util_handler() (when the update has been + * of dbs_freq_update_handler() (when the update has been * completed) will take a sample. * * If this runs in parallel with dbs_work_handler(), we may end @@ -257,7 +257,7 @@ unsigned int dbs_update(struct cpufreq_p } EXPORT_SYMBOL_GPL(dbs_update); -static void gov_set_update_util(struct policy_dbs_info *policy_dbs, +static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs, unsigned int delay_us) { struct cpufreq_policy *policy = policy_dbs->policy; @@ -269,16 +269,16 @@ static void gov_set_update_util(struct p for_each_cpu(cpu, policy->cpus) { struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu); - cpufreq_set_update_util_data(cpu, &cdbs->update_util); + cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook); } } -static inline void gov_clear_update_util(struct cpufreq_policy *policy) +static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy) { int i; for_each_cpu(i, policy->cpus) - cpufreq_set_update_util_data(i, NULL); + cpufreq_set_freq_update_hook(i, NULL); synchronize_sched(); } @@ -287,7 +287,7 @@ static void gov_cancel_work(struct cpufr { struct policy_dbs_info *policy_dbs = policy->governor_data; - gov_clear_update_util(policy_dbs->policy); + gov_clear_freq_update_hooks(policy_dbs->policy); irq_work_sync(&policy_dbs->irq_work); cancel_work_sync(&policy_dbs->work); atomic_set(&policy_dbs->work_count, 0); @@ -331,10 +331,9 @@ static void dbs_irq_work(struct irq_work schedule_work(&policy_dbs->work); } -static void dbs_update_util_handler(struct update_util_data *data, u64 time, - unsigned long util, unsigned long max) +static void dbs_freq_update_handler(struct freq_update_hook *hook, u64 time) { - struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util); + struct cpu_dbs_info *cdbs = container_of(hook, struct cpu_dbs_info, update_hook); struct policy_dbs_info *policy_dbs = cdbs->policy_dbs; u64 delta_ns, lst; @@ -403,7 +402,7 @@ static struct policy_dbs_info *alloc_pol struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j); j_cdbs->policy_dbs = policy_dbs; - j_cdbs->update_util.func = dbs_update_util_handler; + j_cdbs->update_hook.func = dbs_freq_update_handler; } return policy_dbs; } @@ -419,7 +418,7 @@ static void free_policy_dbs_info(struct struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j); j_cdbs->policy_dbs = NULL; - j_cdbs->update_util.func = NULL; + j_cdbs->update_hook.func = NULL; } gov->free(policy_dbs); } @@ -586,7 +585,7 @@ static int cpufreq_governor_start(struct gov->start(policy); - gov_set_update_util(policy_dbs, sampling_rate); + gov_set_freq_update_hooks(policy_dbs, sampling_rate); return 0; } Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -144,7 +144,7 @@ struct cpu_dbs_info { * wake-up from idle. */ unsigned int prev_load; - struct update_util_data update_util; + struct freq_update_hook update_hook; struct policy_dbs_info *policy_dbs; }; Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -103,7 +103,7 @@ struct _pid { struct cpudata { int cpu; - struct update_util_data update_util; + struct freq_update_hook update_hook; struct pstate_data pstate; struct vid_data vid; @@ -1019,10 +1019,9 @@ static inline void intel_pstate_adjust_b sample->freq); } -static void intel_pstate_update_util(struct update_util_data *data, u64 time, - unsigned long util, unsigned long max) +static void intel_pstate_freq_update(struct freq_update_hook *hook, u64 time) { - struct cpudata *cpu = container_of(data, struct cpudata, update_util); + struct cpudata *cpu = container_of(hook, struct cpudata, update_hook); u64 delta_ns = time - cpu->sample.time; if ((s64)delta_ns >= pid_params.sample_rate_ns) { @@ -1088,8 +1087,8 @@ static int intel_pstate_init_cpu(unsigne intel_pstate_busy_pid_reset(cpu); intel_pstate_sample(cpu, 0); - cpu->update_util.func = intel_pstate_update_util; - cpufreq_set_update_util_data(cpunum, &cpu->update_util); + cpu->update_hook.func = intel_pstate_freq_update; + cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook); pr_debug("intel_pstate: controlling: cpu %d\n", cpunum); @@ -1173,7 +1172,7 @@ static void intel_pstate_stop_cpu(struct pr_debug("intel_pstate: CPU %d exiting\n", cpu_num); - cpufreq_set_update_util_data(cpu_num, NULL); + cpufreq_set_freq_update_hook(cpu_num, NULL); synchronize_sched(); if (hwp_active) @@ -1441,7 +1440,7 @@ out: get_online_cpus(); for_each_online_cpu(cpu) { if (all_cpu_data[cpu]) { - cpufreq_set_update_util_data(cpu, NULL); + cpufreq_set_freq_update_hook(cpu, NULL); synchronize_sched(); kfree(all_cpu_data[cpu]); } Index: linux-pm/kernel/sched/fair.c =================================================================== --- linux-pm.orig/kernel/sched/fair.c +++ linux-pm/kernel/sched/fair.c @@ -2839,8 +2839,6 @@ static inline void update_load_avg(struc update_tg_load_avg(cfs_rq, 0); if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) { - unsigned long max = rq->cpu_capacity_orig; - /* * There are a few boundary cases this might miss but it should * get called often enough that that should (hopefully) not be @@ -2849,16 +2847,9 @@ static inline void update_load_avg(struc * the next tick/schedule should update. * * It will not get called when we go idle, because the idle - * thread is a different class (!fair), nor will the utilization - * number include things like RT tasks. - * - * As is, the util number is not freq-invariant (we'd have to - * implement arch_scale_freq_capacity() for that). - * - * See cpu_util(). + * thread is a different class (!fair). */ - cpufreq_update_util(rq_clock(rq), - min(cfs_rq->avg.util_avg, max), max); + cpufreq_trigger_update(rq_clock(rq)); } } Index: linux-pm/kernel/sched/deadline.c =================================================================== --- linux-pm.orig/kernel/sched/deadline.c +++ linux-pm/kernel/sched/deadline.c @@ -726,7 +726,7 @@ static void update_curr_dl(struct rq *rq if (!dl_task(curr) || !on_dl_rq(dl_se)) return; - /* Kick cpufreq (see the comment in linux/cpufreq.h). */ + /* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */ if (cpu_of(rq) == smp_processor_id()) cpufreq_trigger_update(rq_clock(rq)); Index: linux-pm/kernel/sched/rt.c =================================================================== --- linux-pm.orig/kernel/sched/rt.c +++ linux-pm/kernel/sched/rt.c @@ -945,7 +945,7 @@ static void update_curr_rt(struct rq *rq if (curr->sched_class != &rt_sched_class) return; - /* Kick cpufreq (see the comment in linux/cpufreq.h). */ + /* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */ if (cpu_of(rq) == smp_processor_id()) cpufreq_trigger_update(rq_clock(rq)); ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v2 8/10] cpufreq: Move scheduler-related code to the sched directory 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki ` (6 preceding siblings ...) 2016-03-04 3:12 ` [PATCH v2 7/10] cpufreq: Rework the scheduler hooks for triggering updates Rafael J. Wysocki @ 2016-03-04 3:14 ` Rafael J. Wysocki 2016-03-04 3:18 ` [PATCH v2 9/10] cpufreq: sched: Re-introduce cpufreq_update_util() Rafael J. Wysocki ` (2 subsequent siblings) 10 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 3:14 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Create cpufreq.c under kernel/sched/ and move the cpufreq code related to the scheduler to that file. Also move the headers related to that code from cpufreq.h to sched.h. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- New patch. --- drivers/cpufreq/cpufreq.c | 61 ------------------------------ drivers/cpufreq/cpufreq_governor.c | 1 drivers/cpufreq/intel_pstate.c | 1 include/linux/cpufreq.h | 10 ----- include/linux/sched.h | 12 ++++++ kernel/sched/Makefile | 1 kernel/sched/cpufreq.c | 73 +++++++++++++++++++++++++++++++++++++ 7 files changed, 88 insertions(+), 71 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -65,67 +65,6 @@ static struct cpufreq_driver *cpufreq_dr static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data); static DEFINE_RWLOCK(cpufreq_driver_lock); -static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook); - -/** - * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer. - * @cpu: The CPU to set the pointer for. - * @hook: New pointer value. - * - * Set and publish the freq_update_hook pointer for the given CPU. That pointer - * points to a struct freq_update_hook object containing a callback function - * to call from cpufreq_trigger_update(). That function will be called from - * an RCU read-side critical section, so it must not sleep. - * - * Callers must use RCU-sched callbacks to free any memory that might be - * accessed via the old update_util_data pointer or invoke synchronize_sched() - * right after this function to avoid use-after-free. - */ -void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook) -{ - if (WARN_ON(hook && !hook->func)) - return; - - rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook); -} -EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook); - -/** - * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. - * @time: Current time. - * - * The way cpufreq is currently arranged requires it to evaluate the CPU - * performance state (frequency/voltage) on a regular basis. To facilitate - * that, this function is called by update_load_avg() in CFS when executed for - * the current CPU's runqueue. - * - * However, this isn't sufficient to prevent the CPU from being stuck in a - * completely inadequate performance level for too long, because the calls - * from CFS will not be made if RT or deadline tasks are active all the time - * (or there are RT and DL tasks only). - * - * As a workaround for that issue, this function is called by the RT and DL - * sched classes to trigger extra cpufreq updates to prevent it from stalling, - * but that really is a band-aid. Going forward it should be replaced with - * solutions targeted more specifically at RT and DL tasks. - */ -void cpufreq_trigger_update(u64 time) -{ - struct freq_update_hook *hook; - -#ifdef CONFIG_LOCKDEP - WARN_ON(debug_locks && !rcu_read_lock_sched_held()); -#endif - - hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); - /* - * If this isn't inside of an RCU-sched read-side critical section, data - * may become NULL after the check below. - */ - if (hook) - hook->func(hook, time); -} - /* Flag to suspend/resume CPUFreq governors */ static bool cpufreq_suspended; Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -18,6 +18,7 @@ #include <linux/export.h> #include <linux/kernel_stat.h> +#include <linux/sched.h> #include <linux/slab.h> #include "cpufreq_governor.h" Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -21,6 +21,7 @@ #include <linux/list.h> #include <linux/cpu.h> #include <linux/cpufreq.h> +#include <linux/sched.h> #include <linux/sysfs.h> #include <linux/types.h> #include <linux/fs.h> Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -146,14 +146,6 @@ static inline bool policy_is_shared(stru extern struct kobject *cpufreq_global_kobject; #ifdef CONFIG_CPU_FREQ -void cpufreq_trigger_update(u64 time); - -struct freq_update_hook { - void (*func)(struct freq_update_hook *hook, u64 time); -}; - -void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook); - unsigned int cpufreq_get(unsigned int cpu); unsigned int cpufreq_quick_get(unsigned int cpu); unsigned int cpufreq_quick_get_max(unsigned int cpu); @@ -165,8 +157,6 @@ int cpufreq_update_policy(unsigned int c bool have_governor_per_policy(void); struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy); #else -static inline void cpufreq_trigger_update(u64 time) {} - static inline unsigned int cpufreq_get(unsigned int cpu) { return 0; Index: linux-pm/include/linux/sched.h =================================================================== --- linux-pm.orig/include/linux/sched.h +++ linux-pm/include/linux/sched.h @@ -2362,6 +2362,18 @@ extern u64 scheduler_tick_max_deferment( static inline bool sched_can_stop_tick(void) { return false; } #endif +#ifdef CONFIG_CPU_FREQ +void cpufreq_trigger_update(u64 time); + +struct freq_update_hook { + void (*func)(struct freq_update_hook *hook, u64 time); +}; + +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook); +#else +static inline void cpufreq_trigger_update(u64 time) {} +#endif + #ifdef CONFIG_SCHED_AUTOGROUP extern void sched_autogroup_create_attach(struct task_struct *p); extern void sched_autogroup_detach(struct task_struct *p); Index: linux-pm/kernel/sched/Makefile =================================================================== --- linux-pm.orig/kernel/sched/Makefile +++ linux-pm/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_gr obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ) += cpufreq.o Index: linux-pm/kernel/sched/cpufreq.c =================================================================== --- /dev/null +++ linux-pm/kernel/sched/cpufreq.c @@ -0,0 +1,73 @@ +/* + * Scheduler code and data structures related to cpufreq. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/sched.h> + +static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook); + +/** + * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer. + * @cpu: The CPU to set the pointer for. + * @hook: New pointer value. + * + * Set and publish the freq_update_hook pointer for the given CPU. That pointer + * points to a struct freq_update_hook object containing a callback function + * to call from cpufreq_trigger_update(). That function will be called from + * an RCU read-side critical section, so it must not sleep. + * + * Callers must use RCU-sched callbacks to free any memory that might be + * accessed via the old update_util_data pointer or invoke synchronize_sched() + * right after this function to avoid use-after-free. + */ +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook) +{ + if (WARN_ON(hook && !hook->func)) + return; + + rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook); +} +EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook); + +/** + * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. + * @time: Current time. + * + * The way cpufreq is currently arranged requires it to evaluate the CPU + * performance state (frequency/voltage) on a regular basis. To facilitate + * that, this function is called by update_load_avg() in CFS when executed for + * the current CPU's runqueue. + * + * However, this isn't sufficient to prevent the CPU from being stuck in a + * completely inadequate performance level for too long, because the calls + * from CFS will not be made if RT or deadline tasks are active all the time + * (or there are RT and DL tasks only). + * + * As a workaround for that issue, this function is called by the RT and DL + * sched classes to trigger extra cpufreq updates to prevent it from stalling, + * but that really is a band-aid. Going forward it should be replaced with + * solutions targeted more specifically at RT and DL tasks. + */ +void cpufreq_trigger_update(u64 time) +{ + struct freq_update_hook *hook; + +#ifdef CONFIG_LOCKDEP + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); +#endif + + hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); + /* + * If this isn't inside of an RCU-sched read-side critical section, hook + * may become NULL after the check below. + */ + if (hook) + hook->func(hook, time); +} ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v2 9/10] cpufreq: sched: Re-introduce cpufreq_update_util() 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki ` (7 preceding siblings ...) 2016-03-04 3:14 ` [PATCH v2 8/10] cpufreq: Move scheduler-related code to the sched directory Rafael J. Wysocki @ 2016-03-04 3:18 ` Rafael J. Wysocki 2016-03-04 10:50 ` Juri Lelli 2016-03-04 13:30 ` [PATCH v3 " Rafael J. Wysocki 2016-03-04 3:35 ` [PATCH v2 10/10] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 2016-03-08 2:23 ` [PATCH v3 0/7] cpufreq: schedutil governor Rafael J. Wysocki 10 siblings, 2 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 3:18 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> A subsequent change set will introduce a new cpufreq governor using CPU utilization information from the scheduler, so introduce cpufreq_update_util() (again) to allow that information to be passed to the new governor and make cpufreq_trigger_update() call it internally. To that end, add a new ->update_util callback pointer to struct freq_update_hook to be set by entities that want to use the util and max arguments and make cpufreq_update_util() use that callback if available or the ->func callback that only takes the time argument otherwise. In addition to that, arrange helpers to set/clear the utilization update hooks in such a way that the full ->update_util callbacks can only be set by code inside the kernel/sched/ directory. Update the current users of cpufreq_set_freq_update_hook() to use the new helpers. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- New patch. Maybe slightly over the top, but at least it should be clear who uses the util and max arguments and who doesn't use them after it. --- drivers/cpufreq/cpufreq_governor.c | 76 +++++++++++++-------------- drivers/cpufreq/intel_pstate.c | 8 +- include/linux/sched.h | 10 +-- kernel/sched/cpufreq.c | 101 +++++++++++++++++++++++++++++-------- kernel/sched/fair.c | 8 ++ kernel/sched/sched.h | 16 +++++ 6 files changed, 150 insertions(+), 69 deletions(-) Index: linux-pm/include/linux/sched.h =================================================================== --- linux-pm.orig/include/linux/sched.h +++ linux-pm/include/linux/sched.h @@ -2363,15 +2363,15 @@ static inline bool sched_can_stop_tick(v #endif #ifdef CONFIG_CPU_FREQ -void cpufreq_trigger_update(u64 time); - struct freq_update_hook { void (*func)(struct freq_update_hook *hook, u64 time); + void (*update_util)(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max); }; -void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook); -#else -static inline void cpufreq_trigger_update(u64 time) {} +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook, + void (*func)(struct freq_update_hook *hook, u64 time)); +void cpufreq_clear_freq_update_hook(int cpu); #endif #ifdef CONFIG_SCHED_AUTOGROUP Index: linux-pm/kernel/sched/cpufreq.c =================================================================== --- linux-pm.orig/kernel/sched/cpufreq.c +++ linux-pm/kernel/sched/cpufreq.c @@ -9,12 +9,12 @@ * published by the Free Software Foundation. */ -#include <linux/sched.h> +#include "sched.h" static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook); /** - * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer. + * set_freq_update_hook - Populate the CPU's freq_update_hook pointer. * @cpu: The CPU to set the pointer for. * @hook: New pointer value. * @@ -27,23 +27,96 @@ static DEFINE_PER_CPU(struct freq_update * accessed via the old update_util_data pointer or invoke synchronize_sched() * right after this function to avoid use-after-free. */ -void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook) +static void set_freq_update_hook(int cpu, struct freq_update_hook *hook) { - if (WARN_ON(hook && !hook->func)) + rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook); +} + +/** + * cpufreq_set_freq_update_hook - Set the CPU's frequency update callback. + * @cpu: The CPU to set the callback for. + * @hook: New freq_update_hook pointer value. + * @func: Callback function to use with the new hook. + */ +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook, + void (*func)(struct freq_update_hook *hook, u64 time)) +{ + if (WARN_ON(!hook || !func)) return; - rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook); + hook->func = func; + set_freq_update_hook(cpu, hook); } EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook); /** + * cpufreq_set_update_util_hook - Set the CPU's utilization update callback. + * @cpu: The CPU to set the callback for. + * @hook: New freq_update_hook pointer value. + * @update_util: Callback function to use with the new hook. + */ +void cpufreq_set_update_util_hook(int cpu, struct freq_update_hook *hook, + void (*update_util)(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max)) +{ + if (WARN_ON(!hook || !update_util)) + return; + + hook->update_util = update_util; + set_freq_update_hook(cpu, hook); +} +EXPORT_SYMBOL_GPL(cpufreq_set_update_util_hook); + +/** + * cpufreq_set_update_util_hook - Clear the CPU's freq_update_hook pointer. + * @cpu: The CPU to clear the pointer for. + */ +void cpufreq_clear_freq_update_hook(int cpu) +{ + set_freq_update_hook(cpu, NULL); +} +EXPORT_SYMBOL_GPL(cpufreq_clear_freq_update_hook); + +/** + * cpufreq_update_util - Take a note about CPU utilization changes. + * @time: Current time. + * @util: CPU utilization. + * @max: CPU capacity. + * + * This function is called on every invocation of update_load_avg() on the CPU + * whose utilization is being updated. + * + * It can only be called from RCU-sched read-side critical sections. + */ +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) +{ + struct freq_update_hook *hook; + +#ifdef CONFIG_LOCKDEP + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); +#endif + + hook = rcu_dereference(*this_cpu_ptr(&cpufreq_freq_update_hook)); + /* + * If this isn't inside of an RCU-sched read-side critical section, hook + * may become NULL after the check below. + */ + if (hook) { + if (hook->update_util) + hook->update_util(hook, time, util, max); + else + hook->func(hook, time); + } +} + +/** * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. * @time: Current time. * * The way cpufreq is currently arranged requires it to evaluate the CPU * performance state (frequency/voltage) on a regular basis. To facilitate - * that, this function is called by update_load_avg() in CFS when executed for - * the current CPU's runqueue. + * that, cpufreq_update_util() is called by update_load_avg() in CFS when + * executed for the current CPU's runqueue. * * However, this isn't sufficient to prevent the CPU from being stuck in a * completely inadequate performance level for too long, because the calls @@ -57,17 +130,5 @@ EXPORT_SYMBOL_GPL(cpufreq_set_freq_updat */ void cpufreq_trigger_update(u64 time) { - struct freq_update_hook *hook; - -#ifdef CONFIG_LOCKDEP - WARN_ON(debug_locks && !rcu_read_lock_sched_held()); -#endif - - hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); - /* - * If this isn't inside of an RCU-sched read-side critical section, hook - * may become NULL after the check below. - */ - if (hook) - hook->func(hook, time); + cpufreq_update_util(time, ULONG_MAX, 0); } Index: linux-pm/kernel/sched/fair.c =================================================================== --- linux-pm.orig/kernel/sched/fair.c +++ linux-pm/kernel/sched/fair.c @@ -2839,6 +2839,8 @@ static inline void update_load_avg(struc update_tg_load_avg(cfs_rq, 0); if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) { + unsigned long max = rq->cpu_capacity_orig; + /* * There are a few boundary cases this might miss but it should * get called often enough that that should (hopefully) not be @@ -2847,9 +2849,11 @@ static inline void update_load_avg(struc * the next tick/schedule should update. * * It will not get called when we go idle, because the idle - * thread is a different class (!fair). + * thread is a different class (!fair), nor will the utilization +- * number include things like RT tasks. */ - cpufreq_trigger_update(rq_clock(rq)); + cpufreq_update_util(rq_clock(rq), + min(cfs_rq->avg.util_avg, max), max); } } Index: linux-pm/kernel/sched/sched.h =================================================================== --- linux-pm.orig/kernel/sched/sched.h +++ linux-pm/kernel/sched/sched.h @@ -1739,3 +1739,19 @@ static inline u64 irq_time_read(int cpu) } #endif /* CONFIG_64BIT */ #endif /* CONFIG_IRQ_TIME_ACCOUNTING */ + +#ifdef CONFIG_CPU_FREQ +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max); +void cpufreq_trigger_update(u64 time); +void cpufreq_set_update_util_hook(int cpu, struct freq_update_hook *hook, + void (*update_util)(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max)); +static inline void cpufreq_clear_update_util_hook(int cpu) +{ + cpufreq_clear_freq_update_hook(cpu); +} +#else +static inline void cpufreq_update_util(u64 time, unsigned long util, + unsigned long max) {} +static inline void cpufreq_trigger_update(u64 time) {} +#endif /* CONFIG_CPU_FREQ */ Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -1088,8 +1088,8 @@ static int intel_pstate_init_cpu(unsigne intel_pstate_busy_pid_reset(cpu); intel_pstate_sample(cpu, 0); - cpu->update_hook.func = intel_pstate_freq_update; - cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook); + cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook, + intel_pstate_freq_update); pr_debug("intel_pstate: controlling: cpu %d\n", cpunum); @@ -1173,7 +1173,7 @@ static void intel_pstate_stop_cpu(struct pr_debug("intel_pstate: CPU %d exiting\n", cpu_num); - cpufreq_set_freq_update_hook(cpu_num, NULL); + cpufreq_clear_freq_update_hook(cpu_num); synchronize_sched(); if (hwp_active) @@ -1441,7 +1441,7 @@ out: get_online_cpus(); for_each_online_cpu(cpu) { if (all_cpu_data[cpu]) { - cpufreq_set_freq_update_hook(cpu, NULL); + cpufreq_clear_freq_update_hook(cpu); synchronize_sched(); kfree(all_cpu_data[cpu]); } Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -211,43 +211,6 @@ unsigned int dbs_update(struct cpufreq_p } EXPORT_SYMBOL_GPL(dbs_update); -static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs, - unsigned int delay_us) -{ - struct cpufreq_policy *policy = policy_dbs->policy; - int cpu; - - gov_update_sample_delay(policy_dbs, delay_us); - policy_dbs->last_sample_time = 0; - - for_each_cpu(cpu, policy->cpus) { - struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu); - - cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook); - } -} - -static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy) -{ - int i; - - for_each_cpu(i, policy->cpus) - cpufreq_set_freq_update_hook(i, NULL); - - synchronize_sched(); -} - -static void gov_cancel_work(struct cpufreq_policy *policy) -{ - struct policy_dbs_info *policy_dbs = policy->governor_data; - - gov_clear_freq_update_hooks(policy_dbs->policy); - irq_work_sync(&policy_dbs->irq_work); - cancel_work_sync(&policy_dbs->work); - atomic_set(&policy_dbs->work_count, 0); - policy_dbs->work_in_progress = false; -} - static void dbs_work_handler(struct work_struct *work) { struct policy_dbs_info *policy_dbs; @@ -334,6 +297,44 @@ static void dbs_freq_update_handler(stru irq_work_queue(&policy_dbs->irq_work); } +static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs, + unsigned int delay_us) +{ + struct cpufreq_policy *policy = policy_dbs->policy; + int cpu; + + gov_update_sample_delay(policy_dbs, delay_us); + policy_dbs->last_sample_time = 0; + + for_each_cpu(cpu, policy->cpus) { + struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu); + + cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook, + dbs_freq_update_handler); + } +} + +static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy) +{ + int i; + + for_each_cpu(i, policy->cpus) + cpufreq_clear_freq_update_hook(i); + + synchronize_sched(); +} + +static void gov_cancel_work(struct cpufreq_policy *policy) +{ + struct policy_dbs_info *policy_dbs = policy->governor_data; + + gov_clear_freq_update_hooks(policy_dbs->policy); + irq_work_sync(&policy_dbs->irq_work); + cancel_work_sync(&policy_dbs->work); + atomic_set(&policy_dbs->work_count, 0); + policy_dbs->work_in_progress = false; +} + static struct policy_dbs_info *alloc_policy_dbs_info(struct cpufreq_policy *policy, struct dbs_governor *gov) { @@ -356,7 +357,6 @@ static struct policy_dbs_info *alloc_pol struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j); j_cdbs->policy_dbs = policy_dbs; - j_cdbs->update_hook.func = dbs_freq_update_handler; } return policy_dbs; } ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 9/10] cpufreq: sched: Re-introduce cpufreq_update_util() 2016-03-04 3:18 ` [PATCH v2 9/10] cpufreq: sched: Re-introduce cpufreq_update_util() Rafael J. Wysocki @ 2016-03-04 10:50 ` Juri Lelli 2016-03-04 12:58 ` Rafael J. Wysocki 2016-03-04 13:30 ` [PATCH v3 " Rafael J. Wysocki 1 sibling, 1 reply; 158+ messages in thread From: Juri Lelli @ 2016-03-04 10:50 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar Hi Rafael, On 04/03/16 04:18, Rafael J. Wysocki wrote: [...] > +/** > + * cpufreq_update_util - Take a note about CPU utilization changes. > + * @time: Current time. > + * @util: CPU utilization. > + * @max: CPU capacity. > + * > + * This function is called on every invocation of update_load_avg() on the CPU > + * whose utilization is being updated. > + * > + * It can only be called from RCU-sched read-side critical sections. > + */ > +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) > +{ > + struct freq_update_hook *hook; > + > +#ifdef CONFIG_LOCKDEP > + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); > +#endif > + > + hook = rcu_dereference(*this_cpu_ptr(&cpufreq_freq_update_hook)); Small fix. You forgot to change this to rcu_dereference_sched() (you only fixed that in 01/10). Best, - Juri ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 9/10] cpufreq: sched: Re-introduce cpufreq_update_util() 2016-03-04 10:50 ` Juri Lelli @ 2016-03-04 12:58 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 12:58 UTC (permalink / raw) To: Juri Lelli Cc: Rafael J. Wysocki, Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Fri, Mar 4, 2016 at 11:50 AM, Juri Lelli <juri.lelli@arm.com> wrote: > Hi Rafael, > > On 04/03/16 04:18, Rafael J. Wysocki wrote: > > [...] > >> +/** >> + * cpufreq_update_util - Take a note about CPU utilization changes. >> + * @time: Current time. >> + * @util: CPU utilization. >> + * @max: CPU capacity. >> + * >> + * This function is called on every invocation of update_load_avg() on the CPU >> + * whose utilization is being updated. >> + * >> + * It can only be called from RCU-sched read-side critical sections. >> + */ >> +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) >> +{ >> + struct freq_update_hook *hook; >> + >> +#ifdef CONFIG_LOCKDEP >> + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); >> +#endif >> + >> + hook = rcu_dereference(*this_cpu_ptr(&cpufreq_freq_update_hook)); > > Small fix. You forgot to change this to rcu_dereference_sched() (you > only fixed that in 01/10). Yup, thanks! I had to propagate the change throughout the queue and forgot about the last step. I'll send an updated patch shortly. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v3 9/10] cpufreq: sched: Re-introduce cpufreq_update_util() 2016-03-04 3:18 ` [PATCH v2 9/10] cpufreq: sched: Re-introduce cpufreq_update_util() Rafael J. Wysocki 2016-03-04 10:50 ` Juri Lelli @ 2016-03-04 13:30 ` Rafael J. Wysocki 2016-03-04 21:21 ` Steve Muckle 1 sibling, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 13:30 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> A subsequent change set will introduce a new cpufreq governor using CPU utilization information from the scheduler, so introduce cpufreq_update_util() (again) to allow that information to be passed to the new governor and make cpufreq_trigger_update() call it internally. To that end, add a new ->update_util callback pointer to struct freq_update_hook to be set by entities that want to use the util and max arguments and make cpufreq_update_util() use that callback if available or the ->func callback that only takes the time argument otherwise. In addition to that, arrange helpers to set/clear the utilization update hooks in such a way that the full ->update_util callbacks can only be set by code inside the kernel/sched/ directory. Update the current users of cpufreq_set_freq_update_hook() to use the new helpers. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Changes from v2: - Use rcu_dereference_sched() in cpufreq_update_util(). --- drivers/cpufreq/cpufreq_governor.c | 76 +++++++++++++-------------- drivers/cpufreq/intel_pstate.c | 8 +- include/linux/sched.h | 10 +-- kernel/sched/cpufreq.c | 101 +++++++++++++++++++++++++++++-------- kernel/sched/fair.c | 8 ++ kernel/sched/sched.h | 16 +++++ 6 files changed, 150 insertions(+), 69 deletions(-) Index: linux-pm/include/linux/sched.h =================================================================== --- linux-pm.orig/include/linux/sched.h +++ linux-pm/include/linux/sched.h @@ -2363,15 +2363,15 @@ static inline bool sched_can_stop_tick(v #endif #ifdef CONFIG_CPU_FREQ -void cpufreq_trigger_update(u64 time); - struct freq_update_hook { void (*func)(struct freq_update_hook *hook, u64 time); + void (*update_util)(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max); }; -void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook); -#else -static inline void cpufreq_trigger_update(u64 time) {} +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook, + void (*func)(struct freq_update_hook *hook, u64 time)); +void cpufreq_clear_freq_update_hook(int cpu); #endif #ifdef CONFIG_SCHED_AUTOGROUP Index: linux-pm/kernel/sched/cpufreq.c =================================================================== --- linux-pm.orig/kernel/sched/cpufreq.c +++ linux-pm/kernel/sched/cpufreq.c @@ -9,12 +9,12 @@ * published by the Free Software Foundation. */ -#include <linux/sched.h> +#include "sched.h" static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook); /** - * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer. + * set_freq_update_hook - Populate the CPU's freq_update_hook pointer. * @cpu: The CPU to set the pointer for. * @hook: New pointer value. * @@ -27,23 +27,96 @@ static DEFINE_PER_CPU(struct freq_update * accessed via the old update_util_data pointer or invoke synchronize_sched() * right after this function to avoid use-after-free. */ -void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook) +static void set_freq_update_hook(int cpu, struct freq_update_hook *hook) { - if (WARN_ON(hook && !hook->func)) + rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook); +} + +/** + * cpufreq_set_freq_update_hook - Set the CPU's frequency update callback. + * @cpu: The CPU to set the callback for. + * @hook: New freq_update_hook pointer value. + * @func: Callback function to use with the new hook. + */ +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook, + void (*func)(struct freq_update_hook *hook, u64 time)) +{ + if (WARN_ON(!hook || !func)) return; - rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook); + hook->func = func; + set_freq_update_hook(cpu, hook); } EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook); /** + * cpufreq_set_update_util_hook - Set the CPU's utilization update callback. + * @cpu: The CPU to set the callback for. + * @hook: New freq_update_hook pointer value. + * @update_util: Callback function to use with the new hook. + */ +void cpufreq_set_update_util_hook(int cpu, struct freq_update_hook *hook, + void (*update_util)(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max)) +{ + if (WARN_ON(!hook || !update_util)) + return; + + hook->update_util = update_util; + set_freq_update_hook(cpu, hook); +} +EXPORT_SYMBOL_GPL(cpufreq_set_update_util_hook); + +/** + * cpufreq_set_update_util_hook - Clear the CPU's freq_update_hook pointer. + * @cpu: The CPU to clear the pointer for. + */ +void cpufreq_clear_freq_update_hook(int cpu) +{ + set_freq_update_hook(cpu, NULL); +} +EXPORT_SYMBOL_GPL(cpufreq_clear_freq_update_hook); + +/** + * cpufreq_update_util - Take a note about CPU utilization changes. + * @time: Current time. + * @util: CPU utilization. + * @max: CPU capacity. + * + * This function is called on every invocation of update_load_avg() on the CPU + * whose utilization is being updated. + * + * It can only be called from RCU-sched read-side critical sections. + */ +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) +{ + struct freq_update_hook *hook; + +#ifdef CONFIG_LOCKDEP + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); +#endif + + hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); + /* + * If this isn't inside of an RCU-sched read-side critical section, hook + * may become NULL after the check below. + */ + if (hook) { + if (hook->update_util) + hook->update_util(hook, time, util, max); + else + hook->func(hook, time); + } +} + +/** * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. * @time: Current time. * * The way cpufreq is currently arranged requires it to evaluate the CPU * performance state (frequency/voltage) on a regular basis. To facilitate - * that, this function is called by update_load_avg() in CFS when executed for - * the current CPU's runqueue. + * that, cpufreq_update_util() is called by update_load_avg() in CFS when + * executed for the current CPU's runqueue. * * However, this isn't sufficient to prevent the CPU from being stuck in a * completely inadequate performance level for too long, because the calls @@ -57,17 +130,5 @@ EXPORT_SYMBOL_GPL(cpufreq_set_freq_updat */ void cpufreq_trigger_update(u64 time) { - struct freq_update_hook *hook; - -#ifdef CONFIG_LOCKDEP - WARN_ON(debug_locks && !rcu_read_lock_sched_held()); -#endif - - hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); - /* - * If this isn't inside of an RCU-sched read-side critical section, hook - * may become NULL after the check below. - */ - if (hook) - hook->func(hook, time); + cpufreq_update_util(time, ULONG_MAX, 0); } Index: linux-pm/kernel/sched/fair.c =================================================================== --- linux-pm.orig/kernel/sched/fair.c +++ linux-pm/kernel/sched/fair.c @@ -2839,6 +2839,8 @@ static inline void update_load_avg(struc update_tg_load_avg(cfs_rq, 0); if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) { + unsigned long max = rq->cpu_capacity_orig; + /* * There are a few boundary cases this might miss but it should * get called often enough that that should (hopefully) not be @@ -2847,9 +2849,11 @@ static inline void update_load_avg(struc * the next tick/schedule should update. * * It will not get called when we go idle, because the idle - * thread is a different class (!fair). + * thread is a different class (!fair), nor will the utilization +- * number include things like RT tasks. */ - cpufreq_trigger_update(rq_clock(rq)); + cpufreq_update_util(rq_clock(rq), + min(cfs_rq->avg.util_avg, max), max); } } Index: linux-pm/kernel/sched/sched.h =================================================================== --- linux-pm.orig/kernel/sched/sched.h +++ linux-pm/kernel/sched/sched.h @@ -1739,3 +1739,19 @@ static inline u64 irq_time_read(int cpu) } #endif /* CONFIG_64BIT */ #endif /* CONFIG_IRQ_TIME_ACCOUNTING */ + +#ifdef CONFIG_CPU_FREQ +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max); +void cpufreq_trigger_update(u64 time); +void cpufreq_set_update_util_hook(int cpu, struct freq_update_hook *hook, + void (*update_util)(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max)); +static inline void cpufreq_clear_update_util_hook(int cpu) +{ + cpufreq_clear_freq_update_hook(cpu); +} +#else +static inline void cpufreq_update_util(u64 time, unsigned long util, + unsigned long max) {} +static inline void cpufreq_trigger_update(u64 time) {} +#endif /* CONFIG_CPU_FREQ */ Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -1088,8 +1088,8 @@ static int intel_pstate_init_cpu(unsigne intel_pstate_busy_pid_reset(cpu); intel_pstate_sample(cpu, 0); - cpu->update_hook.func = intel_pstate_freq_update; - cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook); + cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook, + intel_pstate_freq_update ); pr_debug("intel_pstate: controlling: cpu %d\n", cpunum); @@ -1173,7 +1173,7 @@ static void intel_pstate_stop_cpu(struct pr_debug("intel_pstate: CPU %d exiting\n", cpu_num); - cpufreq_set_freq_update_hook(cpu_num, NULL); + cpufreq_clear_freq_update_hook(cpu_num); synchronize_sched(); if (hwp_active) @@ -1441,7 +1441,7 @@ out: get_online_cpus(); for_each_online_cpu(cpu) { if (all_cpu_data[cpu]) { - cpufreq_set_freq_update_hook(cpu, NULL); + cpufreq_clear_freq_update_hook(cpu); synchronize_sched(); kfree(all_cpu_data[cpu]); } Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -211,43 +211,6 @@ unsigned int dbs_update(struct cpufreq_p } EXPORT_SYMBOL_GPL(dbs_update); -static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs, - unsigned int delay_us) -{ - struct cpufreq_policy *policy = policy_dbs->policy; - int cpu; - - gov_update_sample_delay(policy_dbs, delay_us); - policy_dbs->last_sample_time = 0; - - for_each_cpu(cpu, policy->cpus) { - struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu); - - cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook); - } -} - -static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy) -{ - int i; - - for_each_cpu(i, policy->cpus) - cpufreq_set_freq_update_hook(i, NULL); - - synchronize_sched(); -} - -static void gov_cancel_work(struct cpufreq_policy *policy) -{ - struct policy_dbs_info *policy_dbs = policy->governor_data; - - gov_clear_freq_update_hooks(policy_dbs->policy); - irq_work_sync(&policy_dbs->irq_work); - cancel_work_sync(&policy_dbs->work); - atomic_set(&policy_dbs->work_count, 0); - policy_dbs->work_in_progress = false; -} - static void dbs_work_handler(struct work_struct *work) { struct policy_dbs_info *policy_dbs; @@ -334,6 +297,44 @@ static void dbs_freq_update_handler(stru irq_work_queue(&policy_dbs->irq_work); } +static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs, + unsigned int delay_us) +{ + struct cpufreq_policy *policy = policy_dbs->policy; + int cpu; + + gov_update_sample_delay(policy_dbs, delay_us); + policy_dbs->last_sample_time = 0; + + for_each_cpu(cpu, policy->cpus) { + struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu); + + cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook, + dbs_freq_update_handler); + } +} + +static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy) +{ + int i; + + for_each_cpu(i, policy->cpus) + cpufreq_clear_freq_update_hook(i); + + synchronize_sched(); +} + +static void gov_cancel_work(struct cpufreq_policy *policy) +{ + struct policy_dbs_info *policy_dbs = policy->governor_data; + + gov_clear_freq_update_hooks(policy_dbs->policy); + irq_work_sync(&policy_dbs->irq_work); + cancel_work_sync(&policy_dbs->work); + atomic_set(&policy_dbs->work_count, 0); + policy_dbs->work_in_progress = false; +} + static struct policy_dbs_info *alloc_policy_dbs_info(struct cpufreq_policy *policy, struct dbs_governor *gov) { @@ -356,7 +357,6 @@ static struct policy_dbs_info *alloc_pol struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j); j_cdbs->policy_dbs = policy_dbs; - j_cdbs->update_hook.func = dbs_freq_update_handler; } return policy_dbs; } ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v3 9/10] cpufreq: sched: Re-introduce cpufreq_update_util() 2016-03-04 13:30 ` [PATCH v3 " Rafael J. Wysocki @ 2016-03-04 21:21 ` Steve Muckle 2016-03-04 21:27 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Steve Muckle @ 2016-03-04 21:21 UTC (permalink / raw) To: Rafael J. Wysocki, Linux PM list Cc: Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On 03/04/2016 05:30 AM, Rafael J. Wysocki wrote: > +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) > +{ > + struct freq_update_hook *hook; > + > +#ifdef CONFIG_LOCKDEP > + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); > +#endif > + > + hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); > + /* > + * If this isn't inside of an RCU-sched read-side critical section, hook > + * may become NULL after the check below. > + */ > + if (hook) { > + if (hook->update_util) > + hook->update_util(hook, time, util, max); > + else > + hook->func(hook, time); > + } Is it worth having two hook types? ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v3 9/10] cpufreq: sched: Re-introduce cpufreq_update_util() 2016-03-04 21:21 ` Steve Muckle @ 2016-03-04 21:27 ` Rafael J. Wysocki 2016-03-04 21:36 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 21:27 UTC (permalink / raw) To: Steve Muckle Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Fri, Mar 4, 2016 at 10:21 PM, Steve Muckle <steve.muckle@linaro.org> wrote: > On 03/04/2016 05:30 AM, Rafael J. Wysocki wrote: >> +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) >> +{ >> + struct freq_update_hook *hook; >> + >> +#ifdef CONFIG_LOCKDEP >> + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); >> +#endif >> + >> + hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); >> + /* >> + * If this isn't inside of an RCU-sched read-side critical section, hook >> + * may become NULL after the check below. >> + */ >> + if (hook) { >> + if (hook->update_util) >> + hook->update_util(hook, time, util, max); >> + else >> + hook->func(hook, time); >> + } > > Is it worth having two hook types? Well, that's why I said "maybe over the top" in the changelog comments. :-) If we want to isolate the "old" governors from util/max entirely, then yes. If we don't care that much, then no. I'm open to both possibilities. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v3 9/10] cpufreq: sched: Re-introduce cpufreq_update_util() 2016-03-04 21:27 ` Rafael J. Wysocki @ 2016-03-04 21:36 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 21:36 UTC (permalink / raw) To: Steve Muckle Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Fri, Mar 4, 2016 at 10:27 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Fri, Mar 4, 2016 at 10:21 PM, Steve Muckle <steve.muckle@linaro.org> wrote: >> On 03/04/2016 05:30 AM, Rafael J. Wysocki wrote: >>> +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) >>> +{ >>> + struct freq_update_hook *hook; >>> + >>> +#ifdef CONFIG_LOCKDEP >>> + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); >>> +#endif >>> + >>> + hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); >>> + /* >>> + * If this isn't inside of an RCU-sched read-side critical section, hook >>> + * may become NULL after the check below. >>> + */ >>> + if (hook) { >>> + if (hook->update_util) >>> + hook->update_util(hook, time, util, max); >>> + else >>> + hook->func(hook, time); >>> + } >> >> Is it worth having two hook types? > > Well, that's why I said "maybe over the top" in the changelog comments. :-) > > If we want to isolate the "old" governors from util/max entirely, then yes. > > If we don't care that much, then no. > > I'm open to both possibilities. But in the latter case I don't see a particular reason to put the new governor under kernel/sched/ too and as I wrote in the changelog comments to patch [10/10], I personally think that it would be cleaner to keep it under drivers/cpufreq/. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v2 10/10] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki ` (8 preceding siblings ...) 2016-03-04 3:18 ` [PATCH v2 9/10] cpufreq: sched: Re-introduce cpufreq_update_util() Rafael J. Wysocki @ 2016-03-04 3:35 ` Rafael J. Wysocki 2016-03-04 11:26 ` Juri Lelli 2016-03-08 2:23 ` [PATCH v3 0/7] cpufreq: schedutil governor Rafael J. Wysocki 10 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 3:35 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Add a new cpufreq scaling governor, called "schedutil", that uses scheduler-provided CPU utilization information as input for making its decisions. Doing that is possible after commit fe7034338ba0 (cpufreq: Add mechanism for registering utilization update callbacks) that introduced cpufreq_update_util() called by the scheduler on utilization changes (from CFS) and RT/DL task status updates. In particular, CPU frequency scaling decisions may be based on the the utilization data passed to cpufreq_update_util() by CFS. The new governor is relatively simple. The frequency selection formula used by it is next_freq = util * max_freq / max where util and max are the utilization and CPU capacity coming from CFS. All of the computations are carried out in the utilization update handlers provided by the new governor. One of those handlers is used for cpufreq policies shared between multiple CPUs and the other one is for policies with one CPU only (and therefore it doesn't need to use any extra synchronization means). The governor supports fast frequency switching if that is supported by the cpufreq driver in use and possible for the given policy. In the fast switching case, all operations of the governor take place in its utilization update handlers. If fast switching cannot be used, the frequency switch operations are carried out with the help of a work item which only calls __cpufreq_driver_target() (under a mutex) to trigger a frequency update (to a value already computed beforehand in one of the utilization update handlers). Currently, the governor treats all of the RT and DL tasks as "unknown utilization" and sets the frequency to the allowed maximum when updated from the RT or DL sched classes. That heavy-handed approach should be replaced with something more subtle and specifically targeted at RT and DL tasks. The governor shares some sysfs attributes management code with the "ondemand" and "conservative" governors and uses some common definitions from cpufreq.h, but apart from that it is stand-alone. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Changes from the previous version: - New frequency selection formula and modifications related to that. - The file is now located in kernel/sched/. Initially, I had hoped that it would be possible to split the code into a library part that might go into kernel/sched/ and the governor interface plus sysfs-related code, but that split would have been artificial and I wanted the governor to be one module as a whole. So that didn't work out. Also the way it is configured and built is somewhat bizarre, as the Kconfig options are in the cpufreq Kconfig, but the code they are related to is located in kernel/sched/ (which is not exactly straightforward). Overall, I'd be happier if the governor could stay in drivers/cpufreq/. --- drivers/cpufreq/Kconfig | 26 + drivers/cpufreq/cpufreq_governor.h | 1 include/linux/cpufreq.h | 3 kernel/sched/Makefile | 1 kernel/sched/cpufreq_schedutil.c | 487 +++++++++++++++++++++++++++++++++++++ 5 files changed, 517 insertions(+), 1 deletion(-) Index: linux-pm/drivers/cpufreq/Kconfig =================================================================== --- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor. + +config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL + bool "schedutil" + select CPU_FREQ_GOV_SCHEDUTIL + select CPU_FREQ_GOV_PERFORMANCE + help + Use the 'schedutil' CPUFreq governor by default. If unsure, + have a look at the help section of that governor. The fallback + governor will be 'performance'. + endchoice config CPU_FREQ_GOV_PERFORMANCE @@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE If in doubt, say N. +config CPU_FREQ_GOV_SCHEDUTIL + tristate "'schedutil' cpufreq policy governor" + depends on CPU_FREQ + select CPU_FREQ_GOV_ATTR_SET + select IRQ_WORK + help + The frequency selection formula used by this governor is analogous + to the one used by 'ondemand', but instead of computing CPU load + as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU + utilization data provided by the scheduler as input. + + To compile this driver as a module, choose M here: the + module will be called cpufreq_schedutil. + + If in doubt, say N. + comment "CPU frequency scaling drivers" config CPUFREQ_DT Index: linux-pm/kernel/sched/cpufreq_schedutil.c =================================================================== --- /dev/null +++ linux-pm/kernel/sched/cpufreq_schedutil.c @@ -0,0 +1,487 @@ +/* + * CPUFreq governor based on scheduler-provided CPU utilization data. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/module.h> + +#include "sched.h" + +struct sugov_tunables { + struct gov_attr_set attr_set; + unsigned int rate_limit_us; +}; + +struct sugov_policy { + struct cpufreq_policy *policy; + + struct sugov_tunables *tunables; + struct list_head tunables_hook; + + raw_spinlock_t update_lock; /* For shared policies */ + u64 last_freq_update_time; + s64 freq_update_delay_ns; + unsigned int next_freq; + + /* The next fields are only needed if fast switch cannot be used. */ + struct irq_work irq_work; + struct work_struct work; + struct mutex work_lock; + bool work_in_progress; + + bool need_freq_update; +}; + +struct sugov_cpu { + struct freq_update_hook update_hook; + struct sugov_policy *sg_policy; + + /* The fields below are only needed when sharing a policy. */ + unsigned long util; + unsigned long max; + u64 last_update; +}; + +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); + +/************************ Governor internals ***********************/ + +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) +{ + u64 delta_ns; + + if (sg_policy->work_in_progress) + return false; + + if (unlikely(sg_policy->need_freq_update)) { + sg_policy->need_freq_update = false; + return true; + } + + delta_ns = time - sg_policy->last_freq_update_time; + return (s64)delta_ns >= sg_policy->freq_update_delay_ns; +} + +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, + unsigned int next_freq) +{ + struct cpufreq_policy *policy = sg_policy->policy; + + if (next_freq > policy->max) + next_freq = policy->max; + else if (next_freq < policy->min) + next_freq = policy->min; + + sg_policy->last_freq_update_time = time; + if (sg_policy->next_freq == next_freq) + return; + + sg_policy->next_freq = next_freq; + if (policy->fast_switch_possible) { + cpufreq_driver_fast_switch(policy, next_freq, CPUFREQ_RELATION_L); + } else { + sg_policy->work_in_progress = true; + irq_work_queue(&sg_policy->irq_work); + } +} + +static void sugov_update_single(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + unsigned int max_f, next_f; + + if (!sugov_should_update_freq(sg_policy, time)) + return; + + max_f = sg_policy->policy->cpuinfo.max_freq; + next_f = util > max ? max_f : util * max_f / max; + sugov_update_commit(sg_policy, time, next_f); +} + +static unsigned int sugov_next_freq(struct sugov_policy *sg_policy, + unsigned long util, unsigned long max) +{ + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int max_f = policy->cpuinfo.max_freq; + u64 last_freq_update_time = sg_policy->last_freq_update_time; + unsigned int j; + + if (util > max) + return max_f; + + for_each_cpu(j, policy->cpus) { + struct sugov_cpu *j_sg_cpu; + unsigned long j_util, j_max; + u64 delta_ns; + + if (j == smp_processor_id()) + continue; + + j_sg_cpu = &per_cpu(sugov_cpu, j); + /* + * If the CPU utilization was last updated before the previous + * frequency update and the time elapsed between the last update + * of the CPU utilization and the last frequency update is long + * enough, don't take the CPU into account as it probably is + * idle now. + */ + delta_ns = last_freq_update_time - j_sg_cpu->last_update; + if ((s64)delta_ns > NSEC_PER_SEC / HZ) + continue; + + j_util = j_sg_cpu->util; + j_max = j_sg_cpu->max; + if (j_util > j_max) + return max_f; + + if (j_util * max > j_max * util) { + util = j_util; + max = j_max; + } + } + + return util * max_f / max; +} + +static void sugov_update_shared(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + unsigned int next_f; + + raw_spin_lock(&sg_policy->update_lock); + + sg_cpu->util = util; + sg_cpu->max = max; + sg_cpu->last_update = time; + + if (sugov_should_update_freq(sg_policy, time)) { + next_f = sugov_next_freq(sg_policy, util, max); + sugov_update_commit(sg_policy, time, next_f); + } + + raw_spin_unlock(&sg_policy->update_lock); +} + +static void sugov_work(struct work_struct *work) +{ + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); + + mutex_lock(&sg_policy->work_lock); + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, + CPUFREQ_RELATION_L); + mutex_unlock(&sg_policy->work_lock); + + sg_policy->work_in_progress = false; +} + +static void sugov_irq_work(struct irq_work *irq_work) +{ + struct sugov_policy *sg_policy; + + sg_policy = container_of(irq_work, struct sugov_policy, irq_work); + schedule_work(&sg_policy->work); +} + +/************************** sysfs interface ************************/ + +static struct sugov_tunables *global_tunables; +static DEFINE_MUTEX(global_tunables_lock); + +static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set) +{ + return container_of(attr_set, struct sugov_tunables, attr_set); +} + +static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + + return sprintf(buf, "%u\n", tunables->rate_limit_us); +} + +static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, + size_t count) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + struct sugov_policy *sg_policy; + unsigned int rate_limit_us; + int ret; + + ret = sscanf(buf, "%u", &rate_limit_us); + if (ret != 1) + return -EINVAL; + + tunables->rate_limit_us = rate_limit_us; + + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) + sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC; + + return count; +} + +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us); + +static struct attribute *sugov_attributes[] = { + &rate_limit_us.attr, + NULL +}; + +static struct kobj_type sugov_tunables_ktype = { + .default_attrs = sugov_attributes, + .sysfs_ops = &governor_sysfs_ops, +}; + +/********************** cpufreq governor interface *********************/ + +static struct cpufreq_governor schedutil_gov; + +static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + + sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL); + if (!sg_policy) + return NULL; + + sg_policy->policy = policy; + init_irq_work(&sg_policy->irq_work, sugov_irq_work); + INIT_WORK(&sg_policy->work, sugov_work); + mutex_init(&sg_policy->work_lock); + raw_spin_lock_init(&sg_policy->update_lock); + return sg_policy; +} + +static void sugov_policy_free(struct sugov_policy *sg_policy) +{ + mutex_destroy(&sg_policy->work_lock); + kfree(sg_policy); +} + +static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy) +{ + struct sugov_tunables *tunables; + + tunables = kzalloc(sizeof(*tunables), GFP_KERNEL); + if (tunables) + gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook); + + return tunables; +} + +static void sugov_tunables_free(struct sugov_tunables *tunables) +{ + if (!have_governor_per_policy()) + global_tunables = NULL; + + kfree(tunables); +} + +static int sugov_init(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + struct sugov_tunables *tunables; + unsigned int lat; + int ret = 0; + + /* State should be equivalent to EXIT */ + if (policy->governor_data) + return -EBUSY; + + sg_policy = sugov_policy_alloc(policy); + if (!sg_policy) + return -ENOMEM; + + mutex_lock(&global_tunables_lock); + + if (global_tunables) { + if (WARN_ON(have_governor_per_policy())) { + ret = -EINVAL; + goto free_sg_policy; + } + policy->governor_data = sg_policy; + sg_policy->tunables = global_tunables; + + gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook); + goto out; + } + + tunables = sugov_tunables_alloc(sg_policy); + if (!tunables) { + ret = -ENOMEM; + goto free_sg_policy; + } + + tunables->rate_limit_us = LATENCY_MULTIPLIER; + lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC; + if (lat) + tunables->rate_limit_us *= lat; + + if (!have_governor_per_policy()) + global_tunables = tunables; + + policy->governor_data = sg_policy; + sg_policy->tunables = tunables; + + ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype, + get_governor_parent_kobj(policy), "%s", + schedutil_gov.name); + if (!ret) + goto out; + + /* Failure, so roll back. */ + policy->governor_data = NULL; + sugov_tunables_free(tunables); + + free_sg_policy: + pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret); + sugov_policy_free(sg_policy); + + out: + mutex_unlock(&global_tunables_lock); + return ret; +} + +static int sugov_exit(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + struct sugov_tunables *tunables = sg_policy->tunables; + unsigned int count; + + mutex_lock(&global_tunables_lock); + + count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook); + policy->governor_data = NULL; + if (!count) + sugov_tunables_free(tunables); + + mutex_unlock(&global_tunables_lock); + + sugov_policy_free(sg_policy); + return 0; +} + +static int sugov_start(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC; + sg_policy->last_freq_update_time = 0; + sg_policy->next_freq = UINT_MAX; + sg_policy->work_in_progress = false; + sg_policy->need_freq_update = false; + + for_each_cpu(cpu, policy->cpus) { + struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu); + + sg_cpu->sg_policy = sg_policy; + if (policy_is_shared(policy)) { + sg_cpu->util = ULONG_MAX; + sg_cpu->max = 0; + sg_cpu->last_update = 0; + cpufreq_set_update_util_hook(cpu, &sg_cpu->update_hook, + sugov_update_shared); + } else { + cpufreq_set_update_util_hook(cpu, &sg_cpu->update_hook, + sugov_update_single); + } + } + return 0; +} + +static int sugov_stop(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + for_each_cpu(cpu, policy->cpus) + cpufreq_clear_update_util_hook(cpu); + + synchronize_sched(); + + irq_work_sync(&sg_policy->irq_work); + cancel_work_sync(&sg_policy->work); + return 0; +} + +static int sugov_limits(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + + if (!policy->fast_switch_possible) { + mutex_lock(&sg_policy->work_lock); + + if (policy->max < policy->cur) + __cpufreq_driver_target(policy, policy->max, + CPUFREQ_RELATION_H); + else if (policy->min > policy->cur) + __cpufreq_driver_target(policy, policy->min, + CPUFREQ_RELATION_L); + + mutex_unlock(&sg_policy->work_lock); + } + + sg_policy->need_freq_update = true; + return 0; +} + +int sugov_governor(struct cpufreq_policy *policy, unsigned int event) +{ + if (event == CPUFREQ_GOV_POLICY_INIT) { + return sugov_init(policy); + } else if (policy->governor_data) { + switch (event) { + case CPUFREQ_GOV_POLICY_EXIT: + return sugov_exit(policy); + case CPUFREQ_GOV_START: + return sugov_start(policy); + case CPUFREQ_GOV_STOP: + return sugov_stop(policy); + case CPUFREQ_GOV_LIMITS: + return sugov_limits(policy); + } + } + return -EINVAL; +} + +static struct cpufreq_governor schedutil_gov = { + .name = "schedutil", + .governor = sugov_governor, + .owner = THIS_MODULE, +}; + +static int __init sugov_module_init(void) +{ + return cpufreq_register_governor(&schedutil_gov); +} + +static void __exit sugov_module_exit(void) +{ + cpufreq_unregister_governor(&schedutil_gov); +} + +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>"); +MODULE_DESCRIPTION("Utilization-based CPU frequency selection"); +MODULE_LICENSE("GPL"); + +#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL +struct cpufreq_governor *cpufreq_default_governor(void) +{ + return &schedutil_gov; +} + +fs_initcall(sugov_module_init); +#else +module_init(sugov_module_init); +#endif +module_exit(sugov_module_exit); Index: linux-pm/kernel/sched/Makefile =================================================================== --- linux-pm.orig/kernel/sched/Makefile +++ linux-pm/kernel/sched/Makefile @@ -20,3 +20,4 @@ obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o obj-$(CONFIG_CPU_FREQ) += cpufreq.o +obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -34,7 +34,6 @@ * this governor will not work. All times here are in us (micro seconds). */ #define MIN_SAMPLING_RATE_RATIO (2) -#define LATENCY_MULTIPLIER (1000) #define MIN_LATENCY_MULTIPLIER (20) #define TRANSITION_LATENCY_LIMIT (10 * 1000 * 1000) Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -468,6 +468,9 @@ void cpufreq_unregister_governor(struct struct cpufreq_governor *cpufreq_default_governor(void); struct cpufreq_governor *cpufreq_fallback_governor(void); +/* Coefficient for computing default sampling rate/rate limit in governors */ +#define LATENCY_MULTIPLIER (1000) + /* Governor attribute set */ struct gov_attr_set { struct kobject kobj; ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 10/10] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-04 3:35 ` [PATCH v2 10/10] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki @ 2016-03-04 11:26 ` Juri Lelli 2016-03-04 13:19 ` Rafael J. Wysocki 2016-03-04 15:56 ` Srinivas Pandruvada 0 siblings, 2 replies; 158+ messages in thread From: Juri Lelli @ 2016-03-04 11:26 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar Hi Rafael, On 04/03/16 04:35, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Add a new cpufreq scaling governor, called "schedutil", that uses > scheduler-provided CPU utilization information as input for making > its decisions. > > Doing that is possible after commit fe7034338ba0 (cpufreq: Add > mechanism for registering utilization update callbacks) that > introduced cpufreq_update_util() called by the scheduler on > utilization changes (from CFS) and RT/DL task status updates. > In particular, CPU frequency scaling decisions may be based on > the the utilization data passed to cpufreq_update_util() by CFS. > > The new governor is relatively simple. > > The frequency selection formula used by it is > > next_freq = util * max_freq / max > > where util and max are the utilization and CPU capacity coming from CFS. > The formula looks better to me now. However, problem is that, if you have freq. invariance, util will slowly saturate to the current capacity. So, we won't trigger OPP changes for a task that for example starts light and then becomes big. This is the same problem we faced with schedfreq. The current solution there is to use a margin for calculating a threshold (80% of current capacity ATM). Once util goes above that threshold we trigger an OPP change. Current policy is pretty aggressive, we go to max_f and then adapt to the "real" util during successive enqueues. This was also tought to cope with the fact that PELT seems slow to react to abrupt changes in tasks behaviour. I'm not saying this is the definitive solution, but I fear something along this line is needed when you add freq invariance in the mix. Best, - Juri ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 10/10] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-04 11:26 ` Juri Lelli @ 2016-03-04 13:19 ` Rafael J. Wysocki 2016-03-04 15:56 ` Srinivas Pandruvada 1 sibling, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-04 13:19 UTC (permalink / raw) To: Juri Lelli Cc: Rafael J. Wysocki, Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Fri, Mar 4, 2016 at 12:26 PM, Juri Lelli <juri.lelli@arm.com> wrote: > Hi Rafael, Hi, > On 04/03/16 04:35, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> Add a new cpufreq scaling governor, called "schedutil", that uses >> scheduler-provided CPU utilization information as input for making >> its decisions. >> >> Doing that is possible after commit fe7034338ba0 (cpufreq: Add >> mechanism for registering utilization update callbacks) that >> introduced cpufreq_update_util() called by the scheduler on >> utilization changes (from CFS) and RT/DL task status updates. >> In particular, CPU frequency scaling decisions may be based on >> the the utilization data passed to cpufreq_update_util() by CFS. >> >> The new governor is relatively simple. >> >> The frequency selection formula used by it is >> >> next_freq = util * max_freq / max >> >> where util and max are the utilization and CPU capacity coming from CFS. >> > > The formula looks better to me now. However, problem is that, if you > have freq. invariance, util will slowly saturate to the current > capacity. So, we won't trigger OPP changes for a task that for example > starts light and then becomes big. > > This is the same problem we faced with schedfreq. The current solution > there is to use a margin for calculating a threshold (80% of current > capacity ATM). Once util goes above that threshold we trigger an OPP > change. Current policy is pretty aggressive, we go to max_f and then > adapt to the "real" util during successive enqueues. This was also > tought to cope with the fact that PELT seems slow to react to abrupt > changes in tasks behaviour. > > I'm not saying this is the definitive solution, but I fear something > along this line is needed when you add freq invariance in the mix. I really would like to avoid adding factors that need to be determined experimentally, because the result of that tends to depend on the system where the experiment is carried out and tunables simply don't work (99% or maybe even more users don't change the defaults anyway). So I would really like to use a formula that's based on some science and doesn't depend on additional input. Now, since the equation generally is f = a * x + b (f - frequency, x = util/max) and there are good arguments for b = 0, it all boils down to what number to take as a. a = max_freq is a good candidate (that's what I'm using right now), but it may turn out to be too small. Another reasonable candidate is a = min_freq + max_freq, because then x = 0.5 selects the frequency in the middle of the available range, but that may turn out to be way too big if min_freq is high (like higher that 50% of max_freq). I need to think more about that and admittedly my understanding of the frequency invariance consequences is limited ATM. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v2 10/10] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-04 11:26 ` Juri Lelli 2016-03-04 13:19 ` Rafael J. Wysocki @ 2016-03-04 15:56 ` Srinivas Pandruvada 1 sibling, 0 replies; 158+ messages in thread From: Srinivas Pandruvada @ 2016-03-04 15:56 UTC (permalink / raw) To: Juri Lelli, Rafael J. Wysocki Cc: Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Fri, 2016-03-04 at 11:26 +0000, Juri Lelli wrote: > Hi Rafael, > > On 04/03/16 04:35, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > Add a new cpufreq scaling governor, called "schedutil", that uses > > scheduler-provided CPU utilization information as input for making > > its decisions. > > > > Doing that is possible after commit fe7034338ba0 (cpufreq: Add > > mechanism for registering utilization update callbacks) that > > introduced cpufreq_update_util() called by the scheduler on > > utilization changes (from CFS) and RT/DL task status updates. > > In particular, CPU frequency scaling decisions may be based on > > the the utilization data passed to cpufreq_update_util() by CFS. > > > > The new governor is relatively simple. > > > > The frequency selection formula used by it is > > > > next_freq = util * max_freq / max > > > > where util and max are the utilization and CPU capacity coming from > > CFS. > > > > The formula looks better to me now. However, problem is that, if you > have freq. invariance, util will slowly saturate to the current > capacity. So, we won't trigger OPP changes for a task that for > example > starts light and then becomes big. > > This is the same problem we faced with schedfreq. The current > solution > there is to use a margin for calculating a threshold (80% of current > capacity ATM). Once util goes above that threshold we trigger an OPP > change. Current policy is pretty aggressive, we go to max_f and then > adapt to the "real" util during successive enqueues. This was also > tought to cope with the fact that PELT seems slow to react to abrupt > changes in tasks behaviour. > I also tried something like this in intel_pstate with scheduler util, where you ramp up to turbo when a threshold percent exceeded then ramp down slowly in steps. This helped some workloads like tbench to perform better, but it resulted in lower performance/watt on specpower server workload. The problem is finding what is the right threshold value. Thanks, Srinivas > I'm not saying this is the definitive solution, but I fear something > along this line is needed when you add freq invariance in the mix. > > Best, > > - Juri > -- > To unsubscribe from this list: send the line "unsubscribe linux-pm" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v3 0/7] cpufreq: schedutil governor 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki ` (9 preceding siblings ...) 2016-03-04 3:35 ` [PATCH v2 10/10] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki @ 2016-03-08 2:23 ` Rafael J. Wysocki 2016-03-08 2:25 ` [PATCH v3 1/7][Resend] cpufreq: Rework the scheduler hooks for triggering updates Rafael J. Wysocki ` (7 more replies) 10 siblings, 8 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-08 2:23 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Friday, March 04, 2016 03:56:09 AM Rafael J. Wysocki wrote: > On Wednesday, March 02, 2016 02:56:28 AM Rafael J. Wysocki wrote: > > Hi, > > > > My previous intro message still applies somewhat, so here's a link: > > > > http://marc.info/?l=linux-pm&m=145609673008122&w=2 > > > > The executive summary of the motivation is that I wanted to do two things: > > use the utilization data from the scheduler (it's passed to the governor > > as aguments of update callbacks anyway) and make it possible to set > > CPU frequency without involving process context (fast frequency switching). > > > > Both have been prototyped in the previous RFCs: > > > > https://patchwork.kernel.org/patch/8426691/ > > https://patchwork.kernel.org/patch/8426741/ > > > > [cut] > > > > > Comments welcome. > > There were quite a few comments to address, so here's a new version. > > First off, my interpretation of what Ingo said earlier today (or yesterday > depending on your time zone) is that he wants all of the code dealing with > the util and max values to be located in kernel/sched/. I can understand > the motivation here, although schedutil shares some amount of code with > the other governors, so the dependency on cpufreq will still be there, even > if the code goes to kernel/sched/. Nevertheless, I decided to make that > change just to see how it would look like if not for anything else. > > To that end, I revived a patch I had before the first schedutil one to > remove util/max from the cpufreq hooks [7/10], moved the scheduler-related > code from drivers/cpufreq/cpufreq.c to kernel/sched/cpufreq.c (new file) > on top of that [8/10] and reintroduced cpufreq_update_util() in a slightly > different form [9/10]. I did it this way in case it turns out to be > necessary to apply [7/10] and [8/10] for the time being and defer the rest > to the next cycle. > > Apart from that, I changed the frequency selection formula in the new > governor to next_freq = util * max_freq / max and it seems to work. That > allowed the code to be simplified somewhat as I don't need the extra > relation field in struct sugov_policy now (RELATION_L is used everywhere). > > Finally, I tried to address the bikeshed comment from Viresh about the > "wrong" names of data types etc related to governor sysfs attributes > handling. Hopefully, the new ones are better. > > There are small tweaks all over on top of that. I've taken patches [1-2/10] from the previous iteration into linux-next as they were not controversial and improved things anyway. What follows is reordered a bit and reworked with respect to the v2. Patches [1-4/7] have not been modified (ie. resends). Patch [5/7] (fast switch support) has a mechanism to deal with notifiers included (works for me with the ACPI driver) and cpufreq_driver_fast_switch() is just a wrapper around the driver callback now (because the givernor needs to do frequency tracing by itself as it turns out). Patch [6/7] makes the hooks use util and max arguments again, but this time the callback function format is the same for everyone (ie. 4 arguments) and the new governor added by patch [7/7] goes into drivers/cpufreq/ as that is *much* cleaner IMO. The new frequency formula has been tweaked a bit once more to make more util/max values map to the top-most frequency (that matters for systems where turbo is "encoded" by an extra frequency level where the frequency is greater by 1 MHz from the previous one, for example). At this point I'm inclined to take patches [1-2/7] into linux-next for 4.6, because they set a clear boundary between the current linux-next code which doesn't really use the utilization data and schedutil, and defer the rest till after the 4.6 merge window. That will allow the new next frequency formula to be tested and maybe we can do something about passing util data from DL to cpufreq_update_util() in the meantime. If anyone has any issues with that plan, please let me know. Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v3 1/7][Resend] cpufreq: Rework the scheduler hooks for triggering updates 2016-03-08 2:23 ` [PATCH v3 0/7] cpufreq: schedutil governor Rafael J. Wysocki @ 2016-03-08 2:25 ` Rafael J. Wysocki 2016-03-09 13:41 ` Peter Zijlstra 2016-03-08 2:26 ` [PATCH v3 2/7][Resend] cpufreq: Move scheduler-related code to the sched directory Rafael J. Wysocki ` (6 subsequent siblings) 7 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-08 2:25 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Commit fe7034338ba0 (cpufreq: Add mechanism for registering utilization update callbacks) added cpufreq_update_util() to be called by the scheduler (from the CFS part) on utilization updates. The goal was to allow CFS to pass utilization information to cpufreq and to trigger it to evaluate the frequency/voltage configuration (P-state) of every CPU on a regular basis. However, the last two arguments of that function are never used by the current code, so CFS might simply call cpufreq_trigger_update() instead of it (like the RT and DL sched classes). For this reason, drop the last two arguments of cpufreq_update_util(), rename it to cpufreq_trigger_update() and modify CFS to call it. Moreover, since the utilization is not involved in that now, rename data types, functions and variables related to cpufreq_trigger_update() to reflect that (eg. struct update_util_data becomes struct freq_update_hook and so on). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- No changes from v2. --- drivers/cpufreq/cpufreq.c | 52 +++++++++++++++++++++---------------- drivers/cpufreq/cpufreq_governor.c | 25 ++++++++--------- drivers/cpufreq/cpufreq_governor.h | 2 - drivers/cpufreq/intel_pstate.c | 15 ++++------ include/linux/cpufreq.h | 32 ++-------------------- kernel/sched/deadline.c | 2 - kernel/sched/fair.c | 13 +-------- kernel/sched/rt.c | 2 - 8 files changed, 58 insertions(+), 85 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -65,57 +65,65 @@ static struct cpufreq_driver *cpufreq_dr static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data); static DEFINE_RWLOCK(cpufreq_driver_lock); -static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data); +static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook); /** - * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer. + * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer. * @cpu: The CPU to set the pointer for. - * @data: New pointer value. + * @hook: New pointer value. * - * Set and publish the update_util_data pointer for the given CPU. That pointer - * points to a struct update_util_data object containing a callback function - * to call from cpufreq_update_util(). That function will be called from an RCU - * read-side critical section, so it must not sleep. + * Set and publish the freq_update_hook pointer for the given CPU. That pointer + * points to a struct freq_update_hook object containing a callback function + * to call from cpufreq_trigger_update(). That function will be called from + * an RCU read-side critical section, so it must not sleep. * * Callers must use RCU-sched callbacks to free any memory that might be * accessed via the old update_util_data pointer or invoke synchronize_sched() * right after this function to avoid use-after-free. */ -void cpufreq_set_update_util_data(int cpu, struct update_util_data *data) +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook) { - if (WARN_ON(data && !data->func)) + if (WARN_ON(hook && !hook->func)) return; - rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data); + rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook); } -EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data); +EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook); /** - * cpufreq_update_util - Take a note about CPU utilization changes. + * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. * @time: Current time. - * @util: Current utilization. - * @max: Utilization ceiling. * - * This function is called by the scheduler on every invocation of - * update_load_avg() on the CPU whose utilization is being updated. + * The way cpufreq is currently arranged requires it to evaluate the CPU + * performance state (frequency/voltage) on a regular basis. To facilitate + * that, this function is called by update_load_avg() in CFS when executed for + * the current CPU's runqueue. * - * It can only be called from RCU-sched read-side critical sections. + * However, this isn't sufficient to prevent the CPU from being stuck in a + * completely inadequate performance level for too long, because the calls + * from CFS will not be made if RT or deadline tasks are active all the time + * (or there are RT and DL tasks only). + * + * As a workaround for that issue, this function is called by the RT and DL + * sched classes to trigger extra cpufreq updates to prevent it from stalling, + * but that really is a band-aid. Going forward it should be replaced with + * solutions targeted more specifically at RT and DL tasks. */ -void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) +void cpufreq_trigger_update(u64 time) { - struct update_util_data *data; + struct freq_update_hook *hook; #ifdef CONFIG_LOCKDEP WARN_ON(debug_locks && !rcu_read_lock_sched_held()); #endif - data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data)); + hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); /* * If this isn't inside of an RCU-sched read-side critical section, data * may become NULL after the check below. */ - if (data) - data->func(data, time, util, max); + if (hook) + hook->func(hook, time); } /* Flag to suspend/resume CPUFreq governors */ Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -146,35 +146,13 @@ static inline bool policy_is_shared(stru extern struct kobject *cpufreq_global_kobject; #ifdef CONFIG_CPU_FREQ -void cpufreq_update_util(u64 time, unsigned long util, unsigned long max); +void cpufreq_trigger_update(u64 time); -/** - * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. - * @time: Current time. - * - * The way cpufreq is currently arranged requires it to evaluate the CPU - * performance state (frequency/voltage) on a regular basis to prevent it from - * being stuck in a completely inadequate performance level for too long. - * That is not guaranteed to happen if the updates are only triggered from CFS, - * though, because they may not be coming in if RT or deadline tasks are active - * all the time (or there are RT and DL tasks only). - * - * As a workaround for that issue, this function is called by the RT and DL - * sched classes to trigger extra cpufreq updates to prevent it from stalling, - * but that really is a band-aid. Going forward it should be replaced with - * solutions targeted more specifically at RT and DL tasks. - */ -static inline void cpufreq_trigger_update(u64 time) -{ - cpufreq_update_util(time, ULONG_MAX, 0); -} - -struct update_util_data { - void (*func)(struct update_util_data *data, - u64 time, unsigned long util, unsigned long max); +struct freq_update_hook { + void (*func)(struct freq_update_hook *hook, u64 time); }; -void cpufreq_set_update_util_data(int cpu, struct update_util_data *data); +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook); unsigned int cpufreq_get(unsigned int cpu); unsigned int cpufreq_quick_get(unsigned int cpu); @@ -187,8 +165,6 @@ int cpufreq_update_policy(unsigned int c bool have_governor_per_policy(void); struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy); #else -static inline void cpufreq_update_util(u64 time, unsigned long util, - unsigned long max) {} static inline void cpufreq_trigger_update(u64 time) {} static inline unsigned int cpufreq_get(unsigned int cpu) Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -62,10 +62,10 @@ ssize_t store_sampling_rate(struct dbs_d mutex_lock(&policy_dbs->timer_mutex); /* * On 32-bit architectures this may race with the - * sample_delay_ns read in dbs_update_util_handler(), but that + * sample_delay_ns read in dbs_freq_update_handler(), but that * really doesn't matter. If the read returns a value that's * too big, the sample will be skipped, but the next invocation - * of dbs_update_util_handler() (when the update has been + * of dbs_freq_update_handler() (when the update has been * completed) will take a sample. * * If this runs in parallel with dbs_work_handler(), we may end @@ -257,7 +257,7 @@ unsigned int dbs_update(struct cpufreq_p } EXPORT_SYMBOL_GPL(dbs_update); -static void gov_set_update_util(struct policy_dbs_info *policy_dbs, +static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs, unsigned int delay_us) { struct cpufreq_policy *policy = policy_dbs->policy; @@ -269,16 +269,16 @@ static void gov_set_update_util(struct p for_each_cpu(cpu, policy->cpus) { struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu); - cpufreq_set_update_util_data(cpu, &cdbs->update_util); + cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook); } } -static inline void gov_clear_update_util(struct cpufreq_policy *policy) +static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy) { int i; for_each_cpu(i, policy->cpus) - cpufreq_set_update_util_data(i, NULL); + cpufreq_set_freq_update_hook(i, NULL); synchronize_sched(); } @@ -287,7 +287,7 @@ static void gov_cancel_work(struct cpufr { struct policy_dbs_info *policy_dbs = policy->governor_data; - gov_clear_update_util(policy_dbs->policy); + gov_clear_freq_update_hooks(policy_dbs->policy); irq_work_sync(&policy_dbs->irq_work); cancel_work_sync(&policy_dbs->work); atomic_set(&policy_dbs->work_count, 0); @@ -331,10 +331,9 @@ static void dbs_irq_work(struct irq_work schedule_work(&policy_dbs->work); } -static void dbs_update_util_handler(struct update_util_data *data, u64 time, - unsigned long util, unsigned long max) +static void dbs_freq_update_handler(struct freq_update_hook *hook, u64 time) { - struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util); + struct cpu_dbs_info *cdbs = container_of(hook, struct cpu_dbs_info, update_hook); struct policy_dbs_info *policy_dbs = cdbs->policy_dbs; u64 delta_ns, lst; @@ -403,7 +402,7 @@ static struct policy_dbs_info *alloc_pol struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j); j_cdbs->policy_dbs = policy_dbs; - j_cdbs->update_util.func = dbs_update_util_handler; + j_cdbs->update_hook.func = dbs_freq_update_handler; } return policy_dbs; } @@ -419,7 +418,7 @@ static void free_policy_dbs_info(struct struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j); j_cdbs->policy_dbs = NULL; - j_cdbs->update_util.func = NULL; + j_cdbs->update_hook.func = NULL; } gov->free(policy_dbs); } @@ -586,7 +585,7 @@ static int cpufreq_governor_start(struct gov->start(policy); - gov_set_update_util(policy_dbs, sampling_rate); + gov_set_freq_update_hooks(policy_dbs, sampling_rate); return 0; } Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -144,7 +144,7 @@ struct cpu_dbs_info { * wake-up from idle. */ unsigned int prev_load; - struct update_util_data update_util; + struct freq_update_hook update_hook; struct policy_dbs_info *policy_dbs; }; Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -103,7 +103,7 @@ struct _pid { struct cpudata { int cpu; - struct update_util_data update_util; + struct freq_update_hook update_hook; struct pstate_data pstate; struct vid_data vid; @@ -1019,10 +1019,9 @@ static inline void intel_pstate_adjust_b sample->freq); } -static void intel_pstate_update_util(struct update_util_data *data, u64 time, - unsigned long util, unsigned long max) +static void intel_pstate_freq_update(struct freq_update_hook *hook, u64 time) { - struct cpudata *cpu = container_of(data, struct cpudata, update_util); + struct cpudata *cpu = container_of(hook, struct cpudata, update_hook); u64 delta_ns = time - cpu->sample.time; if ((s64)delta_ns >= pid_params.sample_rate_ns) { @@ -1088,8 +1087,8 @@ static int intel_pstate_init_cpu(unsigne intel_pstate_busy_pid_reset(cpu); intel_pstate_sample(cpu, 0); - cpu->update_util.func = intel_pstate_update_util; - cpufreq_set_update_util_data(cpunum, &cpu->update_util); + cpu->update_hook.func = intel_pstate_freq_update; + cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook); pr_debug("intel_pstate: controlling: cpu %d\n", cpunum); @@ -1173,7 +1172,7 @@ static void intel_pstate_stop_cpu(struct pr_debug("intel_pstate: CPU %d exiting\n", cpu_num); - cpufreq_set_update_util_data(cpu_num, NULL); + cpufreq_set_freq_update_hook(cpu_num, NULL); synchronize_sched(); if (hwp_active) @@ -1441,7 +1440,7 @@ out: get_online_cpus(); for_each_online_cpu(cpu) { if (all_cpu_data[cpu]) { - cpufreq_set_update_util_data(cpu, NULL); + cpufreq_set_freq_update_hook(cpu, NULL); synchronize_sched(); kfree(all_cpu_data[cpu]); } Index: linux-pm/kernel/sched/fair.c =================================================================== --- linux-pm.orig/kernel/sched/fair.c +++ linux-pm/kernel/sched/fair.c @@ -2839,8 +2839,6 @@ static inline void update_load_avg(struc update_tg_load_avg(cfs_rq, 0); if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) { - unsigned long max = rq->cpu_capacity_orig; - /* * There are a few boundary cases this might miss but it should * get called often enough that that should (hopefully) not be @@ -2849,16 +2847,9 @@ static inline void update_load_avg(struc * the next tick/schedule should update. * * It will not get called when we go idle, because the idle - * thread is a different class (!fair), nor will the utilization - * number include things like RT tasks. - * - * As is, the util number is not freq-invariant (we'd have to - * implement arch_scale_freq_capacity() for that). - * - * See cpu_util(). + * thread is a different class (!fair). */ - cpufreq_update_util(rq_clock(rq), - min(cfs_rq->avg.util_avg, max), max); + cpufreq_trigger_update(rq_clock(rq)); } } Index: linux-pm/kernel/sched/deadline.c =================================================================== --- linux-pm.orig/kernel/sched/deadline.c +++ linux-pm/kernel/sched/deadline.c @@ -726,7 +726,7 @@ static void update_curr_dl(struct rq *rq if (!dl_task(curr) || !on_dl_rq(dl_se)) return; - /* Kick cpufreq (see the comment in linux/cpufreq.h). */ + /* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */ if (cpu_of(rq) == smp_processor_id()) cpufreq_trigger_update(rq_clock(rq)); Index: linux-pm/kernel/sched/rt.c =================================================================== --- linux-pm.orig/kernel/sched/rt.c +++ linux-pm/kernel/sched/rt.c @@ -945,7 +945,7 @@ static void update_curr_rt(struct rq *rq if (curr->sched_class != &rt_sched_class) return; - /* Kick cpufreq (see the comment in linux/cpufreq.h). */ + /* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */ if (cpu_of(rq) == smp_processor_id()) cpufreq_trigger_update(rq_clock(rq)); ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v3 1/7][Resend] cpufreq: Rework the scheduler hooks for triggering updates 2016-03-08 2:25 ` [PATCH v3 1/7][Resend] cpufreq: Rework the scheduler hooks for triggering updates Rafael J. Wysocki @ 2016-03-09 13:41 ` Peter Zijlstra 2016-03-09 14:02 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-09 13:41 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Tue, Mar 08, 2016 at 03:25:16AM +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Commit fe7034338ba0 (cpufreq: Add mechanism for registering > utilization update callbacks) added cpufreq_update_util() to be > called by the scheduler (from the CFS part) on utilization updates. > The goal was to allow CFS to pass utilization information to cpufreq > and to trigger it to evaluate the frequency/voltage configuration > (P-state) of every CPU on a regular basis. > > However, the last two arguments of that function are never used by > the current code, so CFS might simply call cpufreq_trigger_update() > instead of it (like the RT and DL sched classes). > > For this reason, drop the last two arguments of cpufreq_update_util(), > rename it to cpufreq_trigger_update() and modify CFS to call it. > > Moreover, since the utilization is not involved in that now, rename > data types, functions and variables related to cpufreq_trigger_update() > to reflect that (eg. struct update_util_data becomes struct > freq_update_hook and so on). > -void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) > +void cpufreq_trigger_update(u64 time) So I'm not convinced about this. Yes the utility of this function is twofold. One to allow in-situ frequency adjustments where possible, but two, also very much to allow using the statistics already gathered. Sure, 4.5 will not have any such users, but who cares. And I'm really not too worried about 'random' people suddenly using it to base work on. Either people are already participating in these discussions and will thus be aware of whatever concerns there might be, or we'll tell them when they post their code. And when they don't participate and don't post their code, I really don't care about them anyway :-) ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v3 1/7][Resend] cpufreq: Rework the scheduler hooks for triggering updates 2016-03-09 13:41 ` Peter Zijlstra @ 2016-03-09 14:02 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-09 14:02 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 9, 2016 at 2:41 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Mar 08, 2016 at 03:25:16AM +0100, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> Commit fe7034338ba0 (cpufreq: Add mechanism for registering >> utilization update callbacks) added cpufreq_update_util() to be >> called by the scheduler (from the CFS part) on utilization updates. >> The goal was to allow CFS to pass utilization information to cpufreq >> and to trigger it to evaluate the frequency/voltage configuration >> (P-state) of every CPU on a regular basis. >> >> However, the last two arguments of that function are never used by >> the current code, so CFS might simply call cpufreq_trigger_update() >> instead of it (like the RT and DL sched classes). >> >> For this reason, drop the last two arguments of cpufreq_update_util(), >> rename it to cpufreq_trigger_update() and modify CFS to call it. >> >> Moreover, since the utilization is not involved in that now, rename >> data types, functions and variables related to cpufreq_trigger_update() >> to reflect that (eg. struct update_util_data becomes struct >> freq_update_hook and so on). > >> -void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) >> +void cpufreq_trigger_update(u64 time) > > So I'm not convinced about this. Yes the utility of this function is > twofold. One to allow in-situ frequency adjustments where possible, but > two, also very much to allow using the statistics already gathered. > > Sure, 4.5 will not have any such users, but who cares. > > And I'm really not too worried about 'random' people suddenly using it > to base work on. Either people are already participating in these > discussions and will thus be aware of whatever concerns there might be, > or we'll tell them when they post their code. > > And when they don't participate and don't post their code, I really > don't care about them anyway :-) OK ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v3 2/7][Resend] cpufreq: Move scheduler-related code to the sched directory 2016-03-08 2:23 ` [PATCH v3 0/7] cpufreq: schedutil governor Rafael J. Wysocki 2016-03-08 2:25 ` [PATCH v3 1/7][Resend] cpufreq: Rework the scheduler hooks for triggering updates Rafael J. Wysocki @ 2016-03-08 2:26 ` Rafael J. Wysocki 2016-03-08 2:28 ` [PATCH v3 3/7][Resend] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki ` (5 subsequent siblings) 7 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-08 2:26 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Create cpufreq.c under kernel/sched/ and move the cpufreq code related to the scheduler to that file. Also move the headers related to that code from cpufreq.h to sched.h. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- No changes from v2. --- drivers/cpufreq/cpufreq.c | 61 ------------------------------ drivers/cpufreq/cpufreq_governor.c | 1 drivers/cpufreq/intel_pstate.c | 1 include/linux/cpufreq.h | 10 ----- include/linux/sched.h | 12 ++++++ kernel/sched/Makefile | 1 kernel/sched/cpufreq.c | 73 +++++++++++++++++++++++++++++++++++++ 7 files changed, 88 insertions(+), 71 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -65,67 +65,6 @@ static struct cpufreq_driver *cpufreq_dr static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data); static DEFINE_RWLOCK(cpufreq_driver_lock); -static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook); - -/** - * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer. - * @cpu: The CPU to set the pointer for. - * @hook: New pointer value. - * - * Set and publish the freq_update_hook pointer for the given CPU. That pointer - * points to a struct freq_update_hook object containing a callback function - * to call from cpufreq_trigger_update(). That function will be called from - * an RCU read-side critical section, so it must not sleep. - * - * Callers must use RCU-sched callbacks to free any memory that might be - * accessed via the old update_util_data pointer or invoke synchronize_sched() - * right after this function to avoid use-after-free. - */ -void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook) -{ - if (WARN_ON(hook && !hook->func)) - return; - - rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook); -} -EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook); - -/** - * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. - * @time: Current time. - * - * The way cpufreq is currently arranged requires it to evaluate the CPU - * performance state (frequency/voltage) on a regular basis. To facilitate - * that, this function is called by update_load_avg() in CFS when executed for - * the current CPU's runqueue. - * - * However, this isn't sufficient to prevent the CPU from being stuck in a - * completely inadequate performance level for too long, because the calls - * from CFS will not be made if RT or deadline tasks are active all the time - * (or there are RT and DL tasks only). - * - * As a workaround for that issue, this function is called by the RT and DL - * sched classes to trigger extra cpufreq updates to prevent it from stalling, - * but that really is a band-aid. Going forward it should be replaced with - * solutions targeted more specifically at RT and DL tasks. - */ -void cpufreq_trigger_update(u64 time) -{ - struct freq_update_hook *hook; - -#ifdef CONFIG_LOCKDEP - WARN_ON(debug_locks && !rcu_read_lock_sched_held()); -#endif - - hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); - /* - * If this isn't inside of an RCU-sched read-side critical section, data - * may become NULL after the check below. - */ - if (hook) - hook->func(hook, time); -} - /* Flag to suspend/resume CPUFreq governors */ static bool cpufreq_suspended; Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -18,6 +18,7 @@ #include <linux/export.h> #include <linux/kernel_stat.h> +#include <linux/sched.h> #include <linux/slab.h> #include "cpufreq_governor.h" Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -21,6 +21,7 @@ #include <linux/list.h> #include <linux/cpu.h> #include <linux/cpufreq.h> +#include <linux/sched.h> #include <linux/sysfs.h> #include <linux/types.h> #include <linux/fs.h> Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -146,14 +146,6 @@ static inline bool policy_is_shared(stru extern struct kobject *cpufreq_global_kobject; #ifdef CONFIG_CPU_FREQ -void cpufreq_trigger_update(u64 time); - -struct freq_update_hook { - void (*func)(struct freq_update_hook *hook, u64 time); -}; - -void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook); - unsigned int cpufreq_get(unsigned int cpu); unsigned int cpufreq_quick_get(unsigned int cpu); unsigned int cpufreq_quick_get_max(unsigned int cpu); @@ -165,8 +157,6 @@ int cpufreq_update_policy(unsigned int c bool have_governor_per_policy(void); struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy); #else -static inline void cpufreq_trigger_update(u64 time) {} - static inline unsigned int cpufreq_get(unsigned int cpu) { return 0; Index: linux-pm/include/linux/sched.h =================================================================== --- linux-pm.orig/include/linux/sched.h +++ linux-pm/include/linux/sched.h @@ -2362,6 +2362,18 @@ extern u64 scheduler_tick_max_deferment( static inline bool sched_can_stop_tick(void) { return false; } #endif +#ifdef CONFIG_CPU_FREQ +void cpufreq_trigger_update(u64 time); + +struct freq_update_hook { + void (*func)(struct freq_update_hook *hook, u64 time); +}; + +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook); +#else +static inline void cpufreq_trigger_update(u64 time) {} +#endif + #ifdef CONFIG_SCHED_AUTOGROUP extern void sched_autogroup_create_attach(struct task_struct *p); extern void sched_autogroup_detach(struct task_struct *p); Index: linux-pm/kernel/sched/Makefile =================================================================== --- linux-pm.orig/kernel/sched/Makefile +++ linux-pm/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_gr obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ) += cpufreq.o Index: linux-pm/kernel/sched/cpufreq.c =================================================================== --- /dev/null +++ linux-pm/kernel/sched/cpufreq.c @@ -0,0 +1,73 @@ +/* + * Scheduler code and data structures related to cpufreq. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/sched.h> + +static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook); + +/** + * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer. + * @cpu: The CPU to set the pointer for. + * @hook: New pointer value. + * + * Set and publish the freq_update_hook pointer for the given CPU. That pointer + * points to a struct freq_update_hook object containing a callback function + * to call from cpufreq_trigger_update(). That function will be called from + * an RCU read-side critical section, so it must not sleep. + * + * Callers must use RCU-sched callbacks to free any memory that might be + * accessed via the old update_util_data pointer or invoke synchronize_sched() + * right after this function to avoid use-after-free. + */ +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook) +{ + if (WARN_ON(hook && !hook->func)) + return; + + rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook); +} +EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook); + +/** + * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. + * @time: Current time. + * + * The way cpufreq is currently arranged requires it to evaluate the CPU + * performance state (frequency/voltage) on a regular basis. To facilitate + * that, this function is called by update_load_avg() in CFS when executed for + * the current CPU's runqueue. + * + * However, this isn't sufficient to prevent the CPU from being stuck in a + * completely inadequate performance level for too long, because the calls + * from CFS will not be made if RT or deadline tasks are active all the time + * (or there are RT and DL tasks only). + * + * As a workaround for that issue, this function is called by the RT and DL + * sched classes to trigger extra cpufreq updates to prevent it from stalling, + * but that really is a band-aid. Going forward it should be replaced with + * solutions targeted more specifically at RT and DL tasks. + */ +void cpufreq_trigger_update(u64 time) +{ + struct freq_update_hook *hook; + +#ifdef CONFIG_LOCKDEP + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); +#endif + + hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); + /* + * If this isn't inside of an RCU-sched read-side critical section, hook + * may become NULL after the check below. + */ + if (hook) + hook->func(hook, time); +} ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v3 3/7][Resend] cpufreq: governor: New data type for management part of dbs_data 2016-03-08 2:23 ` [PATCH v3 0/7] cpufreq: schedutil governor Rafael J. Wysocki 2016-03-08 2:25 ` [PATCH v3 1/7][Resend] cpufreq: Rework the scheduler hooks for triggering updates Rafael J. Wysocki 2016-03-08 2:26 ` [PATCH v3 2/7][Resend] cpufreq: Move scheduler-related code to the sched directory Rafael J. Wysocki @ 2016-03-08 2:28 ` Rafael J. Wysocki 2016-03-08 2:29 ` [PATCH v3 4/7][Resend] cpufreq: governor: Move abstract gov_attr_set code to seperate file Rafael J. Wysocki ` (4 subsequent siblings) 7 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-08 2:28 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> In addition to fields representing governor tunables, struct dbs_data contains some fields needed for the management of objects of that type. As it turns out, that part of struct dbs_data may be shared with (future) governors that won't use the common code used by "ondemand" and "conservative", so move it to a separate struct type and modify the code using struct dbs_data to follow. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> --- No changes from v2. --- drivers/cpufreq/cpufreq_conservative.c | 25 +++++---- drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++------------- drivers/cpufreq/cpufreq_governor.h | 35 +++++++----- drivers/cpufreq/cpufreq_ondemand.c | 29 ++++++---- 4 files changed, 107 insertions(+), 72 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -41,6 +41,13 @@ /* Ondemand Sampling types */ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; +struct gov_attr_set { + struct kobject kobj; + struct list_head policy_list; + struct mutex update_lock; + int usage_count; +}; + /* * Abbreviations: * dbs: used as a shortform for demand based switching It helps to keep variable @@ -52,7 +59,7 @@ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; /* Governor demand based switching data (per-policy or global). */ struct dbs_data { - int usage_count; + struct gov_attr_set attr_set; void *tuners; unsigned int min_sampling_rate; unsigned int ignore_nice_load; @@ -60,37 +67,35 @@ struct dbs_data { unsigned int sampling_down_factor; unsigned int up_threshold; unsigned int io_is_busy; - - struct kobject kobj; - struct list_head policy_dbs_list; - /* - * Protect concurrent updates to governor tunables from sysfs, - * policy_dbs_list and usage_count. - */ - struct mutex mutex; }; +static inline struct dbs_data *to_dbs_data(struct gov_attr_set *attr_set) +{ + return container_of(attr_set, struct dbs_data, attr_set); +} + /* Governor's specific attributes */ -struct dbs_data; struct governor_attr { struct attribute attr; - ssize_t (*show)(struct dbs_data *dbs_data, char *buf); - ssize_t (*store)(struct dbs_data *dbs_data, const char *buf, + ssize_t (*show)(struct gov_attr_set *attr_set, char *buf); + ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf, size_t count); }; #define gov_show_one(_gov, file_name) \ static ssize_t show_##file_name \ -(struct dbs_data *dbs_data, char *buf) \ +(struct gov_attr_set *attr_set, char *buf) \ { \ + struct dbs_data *dbs_data = to_dbs_data(attr_set); \ struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \ return sprintf(buf, "%u\n", tuners->file_name); \ } #define gov_show_one_common(file_name) \ static ssize_t show_##file_name \ -(struct dbs_data *dbs_data, char *buf) \ +(struct gov_attr_set *attr_set, char *buf) \ { \ + struct dbs_data *dbs_data = to_dbs_data(attr_set); \ return sprintf(buf, "%u\n", dbs_data->file_name); \ } @@ -184,7 +189,7 @@ void od_register_powersave_bias_handler( (struct cpufreq_policy *, unsigned int, unsigned int), unsigned int powersave_bias); void od_unregister_powersave_bias_handler(void); -ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf, +ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf, size_t count); void gov_update_cpu_data(struct dbs_data *dbs_data); #endif /* _CPUFREQ_GOVERNOR_H */ Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -43,9 +43,10 @@ static DEFINE_MUTEX(gov_dbs_data_mutex); * This must be called with dbs_data->mutex held, otherwise traversing * policy_dbs_list isn't safe. */ -ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf, +ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct policy_dbs_info *policy_dbs; unsigned int rate; int ret; @@ -59,7 +60,7 @@ ssize_t store_sampling_rate(struct dbs_d * We are operating under dbs_data->mutex and so the list and its * entries can't be freed concurrently. */ - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) { + list_for_each_entry(policy_dbs, &attr_set->policy_list, list) { mutex_lock(&policy_dbs->timer_mutex); /* * On 32-bit architectures this may race with the @@ -96,7 +97,7 @@ void gov_update_cpu_data(struct dbs_data { struct policy_dbs_info *policy_dbs; - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) { + list_for_each_entry(policy_dbs, &dbs_data->attr_set.policy_list, list) { unsigned int j; for_each_cpu(j, policy_dbs->policy->cpus) { @@ -111,9 +112,9 @@ void gov_update_cpu_data(struct dbs_data } EXPORT_SYMBOL_GPL(gov_update_cpu_data); -static inline struct dbs_data *to_dbs_data(struct kobject *kobj) +static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj) { - return container_of(kobj, struct dbs_data, kobj); + return container_of(kobj, struct gov_attr_set, kobj); } static inline struct governor_attr *to_gov_attr(struct attribute *attr) @@ -124,25 +125,24 @@ static inline struct governor_attr *to_g static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, char *buf) { - struct dbs_data *dbs_data = to_dbs_data(kobj); struct governor_attr *gattr = to_gov_attr(attr); - return gattr->show(dbs_data, buf); + return gattr->show(to_gov_attr_set(kobj), buf); } static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, const char *buf, size_t count) { - struct dbs_data *dbs_data = to_dbs_data(kobj); + struct gov_attr_set *attr_set = to_gov_attr_set(kobj); struct governor_attr *gattr = to_gov_attr(attr); int ret = -EBUSY; - mutex_lock(&dbs_data->mutex); + mutex_lock(&attr_set->update_lock); - if (dbs_data->usage_count) - ret = gattr->store(dbs_data, buf, count); + if (attr_set->usage_count) + ret = gattr->store(attr_set, buf, count); - mutex_unlock(&dbs_data->mutex); + mutex_unlock(&attr_set->update_lock); return ret; } @@ -424,6 +424,41 @@ static void free_policy_dbs_info(struct gov->free(policy_dbs); } +static void gov_attr_set_init(struct gov_attr_set *attr_set, + struct list_head *list_node) +{ + INIT_LIST_HEAD(&attr_set->policy_list); + mutex_init(&attr_set->update_lock); + attr_set->usage_count = 1; + list_add(list_node, &attr_set->policy_list); +} + +static void gov_attr_set_get(struct gov_attr_set *attr_set, + struct list_head *list_node) +{ + mutex_lock(&attr_set->update_lock); + attr_set->usage_count++; + list_add(list_node, &attr_set->policy_list); + mutex_unlock(&attr_set->update_lock); +} + +static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, + struct list_head *list_node) +{ + unsigned int count; + + mutex_lock(&attr_set->update_lock); + list_del(list_node); + count = --attr_set->usage_count; + mutex_unlock(&attr_set->update_lock); + if (count) + return count; + + kobject_put(&attr_set->kobj); + mutex_destroy(&attr_set->update_lock); + return 0; +} + static int cpufreq_governor_init(struct cpufreq_policy *policy) { struct dbs_governor *gov = dbs_governor_of(policy); @@ -452,10 +487,7 @@ static int cpufreq_governor_init(struct policy_dbs->dbs_data = dbs_data; policy->governor_data = policy_dbs; - mutex_lock(&dbs_data->mutex); - dbs_data->usage_count++; - list_add(&policy_dbs->list, &dbs_data->policy_dbs_list); - mutex_unlock(&dbs_data->mutex); + gov_attr_set_get(&dbs_data->attr_set, &policy_dbs->list); goto out; } @@ -465,8 +497,7 @@ static int cpufreq_governor_init(struct goto free_policy_dbs_info; } - INIT_LIST_HEAD(&dbs_data->policy_dbs_list); - mutex_init(&dbs_data->mutex); + gov_attr_set_init(&dbs_data->attr_set, &policy_dbs->list); ret = gov->init(dbs_data, !policy->governor->initialized); if (ret) @@ -486,14 +517,11 @@ static int cpufreq_governor_init(struct if (!have_governor_per_policy()) gov->gdbs_data = dbs_data; - policy->governor_data = policy_dbs; - policy_dbs->dbs_data = dbs_data; - dbs_data->usage_count = 1; - list_add(&policy_dbs->list, &dbs_data->policy_dbs_list); + policy->governor_data = policy_dbs; gov->kobj_type.sysfs_ops = &governor_sysfs_ops; - ret = kobject_init_and_add(&dbs_data->kobj, &gov->kobj_type, + ret = kobject_init_and_add(&dbs_data->attr_set.kobj, &gov->kobj_type, get_governor_parent_kobj(policy), "%s", gov->gov.name); if (!ret) @@ -522,29 +550,21 @@ static int cpufreq_governor_exit(struct struct dbs_governor *gov = dbs_governor_of(policy); struct policy_dbs_info *policy_dbs = policy->governor_data; struct dbs_data *dbs_data = policy_dbs->dbs_data; - int count; + unsigned int count; /* Protect gov->gdbs_data against concurrent updates. */ mutex_lock(&gov_dbs_data_mutex); - mutex_lock(&dbs_data->mutex); - list_del(&policy_dbs->list); - count = --dbs_data->usage_count; - mutex_unlock(&dbs_data->mutex); + count = gov_attr_set_put(&dbs_data->attr_set, &policy_dbs->list); - if (!count) { - kobject_put(&dbs_data->kobj); - - policy->governor_data = NULL; + policy->governor_data = NULL; + if (!count) { if (!have_governor_per_policy()) gov->gdbs_data = NULL; gov->exit(dbs_data, policy->governor->initialized == 1); - mutex_destroy(&dbs_data->mutex); kfree(dbs_data); - } else { - policy->governor_data = NULL; } free_policy_dbs_info(policy_dbs, gov); Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c +++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c @@ -207,9 +207,10 @@ static unsigned int od_dbs_timer(struct /************************** sysfs interface ************************/ static struct dbs_governor od_dbs_gov; -static ssize_t store_io_is_busy(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_io_is_busy(struct gov_attr_set *attr_set, const char *buf, + size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; @@ -224,9 +225,10 @@ static ssize_t store_io_is_busy(struct d return count; } -static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_up_threshold(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -240,9 +242,10 @@ static ssize_t store_up_threshold(struct return count; } -static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data, - const char *buf, size_t count) +static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct policy_dbs_info *policy_dbs; unsigned int input; int ret; @@ -254,7 +257,7 @@ static ssize_t store_sampling_down_facto dbs_data->sampling_down_factor = input; /* Reset down sampling multiplier in case it was active */ - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) { + list_for_each_entry(policy_dbs, &attr_set->policy_list, list) { /* * Doing this without locking might lead to using different * rate_mult values in od_update() and od_dbs_timer(). @@ -267,9 +270,10 @@ static ssize_t store_sampling_down_facto return count; } -static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data, - const char *buf, size_t count) +static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; @@ -291,9 +295,10 @@ static ssize_t store_ignore_nice_load(st return count; } -static ssize_t store_powersave_bias(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_powersave_bias(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct od_dbs_tuners *od_tuners = dbs_data->tuners; struct policy_dbs_info *policy_dbs; unsigned int input; @@ -308,7 +313,7 @@ static ssize_t store_powersave_bias(stru od_tuners->powersave_bias = input; - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) + list_for_each_entry(policy_dbs, &attr_set->policy_list, list) ondemand_powersave_bias_init(policy_dbs->policy); return count; Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c +++ linux-pm/drivers/cpufreq/cpufreq_conservative.c @@ -129,9 +129,10 @@ static struct notifier_block cs_cpufreq_ /************************** sysfs interface ************************/ static struct dbs_governor cs_dbs_gov; -static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data, - const char *buf, size_t count) +static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -143,9 +144,10 @@ static ssize_t store_sampling_down_facto return count; } -static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_up_threshold(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; @@ -158,9 +160,10 @@ static ssize_t store_up_threshold(struct return count; } -static ssize_t store_down_threshold(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_down_threshold(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; @@ -175,9 +178,10 @@ static ssize_t store_down_threshold(stru return count; } -static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data, - const char *buf, size_t count) +static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; @@ -199,9 +203,10 @@ static ssize_t store_ignore_nice_load(st return count; } -static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_freq_step(struct gov_attr_set *attr_set, const char *buf, + size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v3 4/7][Resend] cpufreq: governor: Move abstract gov_attr_set code to seperate file 2016-03-08 2:23 ` [PATCH v3 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (2 preceding siblings ...) 2016-03-08 2:28 ` [PATCH v3 3/7][Resend] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki @ 2016-03-08 2:29 ` Rafael J. Wysocki 2016-03-08 2:38 ` [PATCH v3 5/7] cpufreq: Support for fast frequency switching Rafael J. Wysocki ` (3 subsequent siblings) 7 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-08 2:29 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Move abstract code related to struct gov_attr_set to a separate (new) file so it can be shared with (future) goverernors that won't share more code with "ondemand" and "conservative". No intentional functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> --- No changes from v2. --- drivers/cpufreq/Kconfig | 4 + drivers/cpufreq/Makefile | 1 drivers/cpufreq/cpufreq_governor.c | 82 --------------------------- drivers/cpufreq/cpufreq_governor.h | 6 ++ drivers/cpufreq/cpufreq_governor_attr_set.c | 84 ++++++++++++++++++++++++++++ 5 files changed, 95 insertions(+), 82 deletions(-) Index: linux-pm/drivers/cpufreq/Kconfig =================================================================== --- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -18,7 +18,11 @@ config CPU_FREQ if CPU_FREQ +config CPU_FREQ_GOV_ATTR_SET + bool + config CPU_FREQ_GOV_COMMON + select CPU_FREQ_GOV_ATTR_SET select IRQ_WORK bool Index: linux-pm/drivers/cpufreq/Makefile =================================================================== --- linux-pm.orig/drivers/cpufreq/Makefile +++ linux-pm/drivers/cpufreq/Makefile @@ -11,6 +11,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) += obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o +obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -112,53 +112,6 @@ void gov_update_cpu_data(struct dbs_data } EXPORT_SYMBOL_GPL(gov_update_cpu_data); -static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj) -{ - return container_of(kobj, struct gov_attr_set, kobj); -} - -static inline struct governor_attr *to_gov_attr(struct attribute *attr) -{ - return container_of(attr, struct governor_attr, attr); -} - -static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, - char *buf) -{ - struct governor_attr *gattr = to_gov_attr(attr); - - return gattr->show(to_gov_attr_set(kobj), buf); -} - -static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, - const char *buf, size_t count) -{ - struct gov_attr_set *attr_set = to_gov_attr_set(kobj); - struct governor_attr *gattr = to_gov_attr(attr); - int ret = -EBUSY; - - mutex_lock(&attr_set->update_lock); - - if (attr_set->usage_count) - ret = gattr->store(attr_set, buf, count); - - mutex_unlock(&attr_set->update_lock); - - return ret; -} - -/* - * Sysfs Ops for accessing governor attributes. - * - * All show/store invocations for governor specific sysfs attributes, will first - * call the below show/store callbacks and the attribute specific callback will - * be called from within it. - */ -static const struct sysfs_ops governor_sysfs_ops = { - .show = governor_show, - .store = governor_store, -}; - unsigned int dbs_update(struct cpufreq_policy *policy) { struct policy_dbs_info *policy_dbs = policy->governor_data; @@ -424,41 +377,6 @@ static void free_policy_dbs_info(struct gov->free(policy_dbs); } -static void gov_attr_set_init(struct gov_attr_set *attr_set, - struct list_head *list_node) -{ - INIT_LIST_HEAD(&attr_set->policy_list); - mutex_init(&attr_set->update_lock); - attr_set->usage_count = 1; - list_add(list_node, &attr_set->policy_list); -} - -static void gov_attr_set_get(struct gov_attr_set *attr_set, - struct list_head *list_node) -{ - mutex_lock(&attr_set->update_lock); - attr_set->usage_count++; - list_add(list_node, &attr_set->policy_list); - mutex_unlock(&attr_set->update_lock); -} - -static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, - struct list_head *list_node) -{ - unsigned int count; - - mutex_lock(&attr_set->update_lock); - list_del(list_node); - count = --attr_set->usage_count; - mutex_unlock(&attr_set->update_lock); - if (count) - return count; - - kobject_put(&attr_set->kobj); - mutex_destroy(&attr_set->update_lock); - return 0; -} - static int cpufreq_governor_init(struct cpufreq_policy *policy) { struct dbs_governor *gov = dbs_governor_of(policy); Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -48,6 +48,12 @@ struct gov_attr_set { int usage_count; }; +extern const struct sysfs_ops governor_sysfs_ops; + +void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node); +void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node); +unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node); + /* * Abbreviations: * dbs: used as a shortform for demand based switching It helps to keep variable Index: linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c =================================================================== --- /dev/null +++ linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c @@ -0,0 +1,84 @@ +/* + * Abstract code for CPUFreq governor tunable sysfs attributes. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include "cpufreq_governor.h" + +static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj) +{ + return container_of(kobj, struct gov_attr_set, kobj); +} + +static inline struct governor_attr *to_gov_attr(struct attribute *attr) +{ + return container_of(attr, struct governor_attr, attr); +} + +static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, + char *buf) +{ + struct governor_attr *gattr = to_gov_attr(attr); + + return gattr->show(to_gov_attr_set(kobj), buf); +} + +static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, + const char *buf, size_t count) +{ + struct gov_attr_set *attr_set = to_gov_attr_set(kobj); + struct governor_attr *gattr = to_gov_attr(attr); + int ret; + + mutex_lock(&attr_set->update_lock); + ret = attr_set->usage_count ? gattr->store(attr_set, buf, count) : -EBUSY; + mutex_unlock(&attr_set->update_lock); + return ret; +} + +const struct sysfs_ops governor_sysfs_ops = { + .show = governor_show, + .store = governor_store, +}; +EXPORT_SYMBOL_GPL(governor_sysfs_ops); + +void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node) +{ + INIT_LIST_HEAD(&attr_set->policy_list); + mutex_init(&attr_set->update_lock); + attr_set->usage_count = 1; + list_add(list_node, &attr_set->policy_list); +} +EXPORT_SYMBOL_GPL(gov_attr_set_init); + +void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node) +{ + mutex_lock(&attr_set->update_lock); + attr_set->usage_count++; + list_add(list_node, &attr_set->policy_list); + mutex_unlock(&attr_set->update_lock); +} +EXPORT_SYMBOL_GPL(gov_attr_set_get); + +unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node) +{ + unsigned int count; + + mutex_lock(&attr_set->update_lock); + list_del(list_node); + count = --attr_set->usage_count; + mutex_unlock(&attr_set->update_lock); + if (count) + return count; + + kobject_put(&attr_set->kobj); + mutex_destroy(&attr_set->update_lock); + return 0; +} +EXPORT_SYMBOL_GPL(gov_attr_set_put); ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v3 5/7] cpufreq: Support for fast frequency switching 2016-03-08 2:23 ` [PATCH v3 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (3 preceding siblings ...) 2016-03-08 2:29 ` [PATCH v3 4/7][Resend] cpufreq: governor: Move abstract gov_attr_set code to seperate file Rafael J. Wysocki @ 2016-03-08 2:38 ` Rafael J. Wysocki 2016-03-08 2:41 ` [PATCH v3 6/7] cpufreq: sched: Re-introduce cpufreq_update_util() Rafael J. Wysocki ` (2 subsequent siblings) 7 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-08 2:38 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Subject: [PATCH] cpufreq: Support for fast frequency switching Modify the ACPI cpufreq driver to provide a method for switching CPU frequencies from interrupt context and update the cpufreq core to support that method if available. Introduce a new cpufreq driver callback, ->fast_switch, to be invoked for frequency switching from interrupt context by (future) governors supporting that feature via (new) helper function Add a new policy flag, fast_switch_possible, to be set if fast frequency switching can be used for the given policy and add a helper for setting that flag. Since fast frequency switching is inherently incompatible with cpufreq transition notifiers, make it possible to set the fast_switch_possible only if there are no transition notifiers already registered and make the registration of new transition notifiers fail if the fast_switch_possible flag is set for at least one policy. Implement the ->fast_switch callback in the ACPI cpufreq driver and make it set fast_switch_possible during policy initialization as appropriate. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Changes from v2: - The driver ->fast_switch callback and cpufreq_driver_fast_switch() don't need the relation argument as they will always do RELATION_L now. - New mechanism to make fast switch and cpufreq notifiers mutually exclusive. - cpufreq_driver_fast_switch() doesn't do anything in addition to invoking the driver callback and returns its return value. --- drivers/cpufreq/acpi-cpufreq.c | 42 ++++++++++++++++++++ drivers/cpufreq/cpufreq.c | 85 ++++++++++++++++++++++++++++++++++++++--- include/linux/cpufreq.h | 6 ++ 3 files changed, 127 insertions(+), 6 deletions(-) Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c @@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp return result; } +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq) +{ + struct acpi_cpufreq_data *data = policy->driver_data; + struct acpi_processor_performance *perf; + struct cpufreq_frequency_table *entry; + unsigned int next_perf_state, next_freq, freq; + + /* + * Find the closest frequency above target_freq. + * + * The table is sorted in the reverse order with respect to the + * frequency and all of the entries are valid (see the initialization). + */ + entry = data->freq_table; + do { + entry++; + freq = entry->frequency; + } while (freq >= target_freq && freq != CPUFREQ_TABLE_END); + entry--; + next_freq = entry->frequency; + next_perf_state = entry->driver_data; + + perf = to_perf_data(data); + if (perf->state == next_perf_state) { + if (unlikely(data->resume)) + data->resume = 0; + else + return next_freq; + } + + data->cpu_freq_write(&perf->control_register, + perf->states[next_perf_state].control); + perf->state = next_perf_state; + return next_freq; +} + static unsigned long acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu) { @@ -740,6 +777,10 @@ static int acpi_cpufreq_cpu_init(struct goto err_unreg; } + if (!acpi_pstate_strict && !(policy_is_shared(policy) + && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY)) + cpufreq_enable_fast_switch(policy); + data->freq_table = kzalloc(sizeof(*data->freq_table) * (perf->state_count+1), GFP_KERNEL); if (!data->freq_table) { @@ -874,6 +915,7 @@ static struct freq_attr *acpi_cpufreq_at static struct cpufreq_driver acpi_cpufreq_driver = { .verify = cpufreq_generic_frequency_table_verify, .target_index = acpi_cpufreq_target, + .fast_switch = acpi_cpufreq_fast_switch, .bios_limit = acpi_processor_get_bios_limit, .init = acpi_cpufreq_cpu_init, .exit = acpi_cpufreq_cpu_exit, Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -81,6 +81,7 @@ struct cpufreq_policy { struct cpufreq_governor *governor; /* see below */ void *governor_data; char last_governor[CPUFREQ_NAME_LEN]; /* last governor used */ + bool fast_switch_possible; struct work_struct update; /* if update_policy() needs to be * called, but you're in IRQ context */ @@ -156,6 +157,7 @@ int cpufreq_get_policy(struct cpufreq_po int cpufreq_update_policy(unsigned int cpu); bool have_governor_per_policy(void); struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy); +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy); #else static inline unsigned int cpufreq_get(unsigned int cpu) { @@ -236,6 +238,8 @@ struct cpufreq_driver { unsigned int relation); /* Deprecated */ int (*target_index)(struct cpufreq_policy *policy, unsigned int index); + unsigned int (*fast_switch)(struct cpufreq_policy *policy, + unsigned int target_freq); /* * Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION * unset. @@ -450,6 +454,8 @@ struct cpufreq_governor { }; /* Pass a target to the cpufreq driver */ +unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq); int cpufreq_driver_target(struct cpufreq_policy *policy, unsigned int target_freq, unsigned int relation); Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -428,6 +428,26 @@ void cpufreq_freq_transition_end(struct } EXPORT_SYMBOL_GPL(cpufreq_freq_transition_end); +/* + * Fast frequency switching status count. Positive means "enabled", negative + * means "disabled" and 0 means "don't care". + */ +static int cpufreq_fast_switch_count; +static DEFINE_MUTEX(cpufreq_fast_switch_lock); + +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy) +{ + mutex_lock(&cpufreq_fast_switch_lock); + if (cpufreq_fast_switch_count >= 0) { + cpufreq_fast_switch_count++; + policy->fast_switch_possible = true; + } else { + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n", + policy->cpu); + } + mutex_unlock(&cpufreq_fast_switch_lock); +} +EXPORT_SYMBOL_GPL(cpufreq_enable_fast_switch); /********************************************************************* * SYSFS INTERFACE * @@ -1074,6 +1094,23 @@ static void cpufreq_policy_free(struct c kfree(policy); } +static void cpufreq_driver_exit_policy(struct cpufreq_policy *policy) +{ + if (policy->fast_switch_possible) { + mutex_lock(&cpufreq_fast_switch_lock); + + if (!WARN_ON(cpufreq_fast_switch_count <= 0)) + cpufreq_fast_switch_count--; + + mutex_unlock(&cpufreq_fast_switch_lock); + } + + if (cpufreq_driver->exit) { + cpufreq_driver->exit(policy); + policy->freq_table = NULL; + } +} + static int cpufreq_online(unsigned int cpu) { struct cpufreq_policy *policy; @@ -1237,8 +1274,7 @@ static int cpufreq_online(unsigned int c out_exit_policy: up_write(&policy->rwsem); - if (cpufreq_driver->exit) - cpufreq_driver->exit(policy); + cpufreq_driver_exit_policy(policy); out_free_policy: cpufreq_policy_free(policy, !new_policy); return ret; @@ -1335,10 +1371,7 @@ static void cpufreq_offline(unsigned int * since this is a core component, and is essential for the * subsequent light-weight ->init() to succeed. */ - if (cpufreq_driver->exit) { - cpufreq_driver->exit(policy); - policy->freq_table = NULL; - } + cpufreq_driver_exit_policy(policy); unlock: up_write(&policy->rwsem); @@ -1665,8 +1698,18 @@ int cpufreq_register_notifier(struct not switch (list) { case CPUFREQ_TRANSITION_NOTIFIER: + mutex_lock(&cpufreq_fast_switch_lock); + + if (cpufreq_fast_switch_count > 0) { + mutex_unlock(&cpufreq_fast_switch_lock); + return -EPERM; + } ret = srcu_notifier_chain_register( &cpufreq_transition_notifier_list, nb); + if (!ret) + cpufreq_fast_switch_count--; + + mutex_unlock(&cpufreq_fast_switch_lock); break; case CPUFREQ_POLICY_NOTIFIER: ret = blocking_notifier_chain_register( @@ -1699,8 +1742,14 @@ int cpufreq_unregister_notifier(struct n switch (list) { case CPUFREQ_TRANSITION_NOTIFIER: + mutex_lock(&cpufreq_fast_switch_lock); + ret = srcu_notifier_chain_unregister( &cpufreq_transition_notifier_list, nb); + if (!ret && !WARN_ON(cpufreq_fast_switch_count >= 0)) + cpufreq_fast_switch_count++; + + mutex_unlock(&cpufreq_fast_switch_lock); break; case CPUFREQ_POLICY_NOTIFIER: ret = blocking_notifier_chain_unregister( @@ -1719,6 +1768,30 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie * GOVERNORS * *********************************************************************/ +/** + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch. + * @policy: cpufreq policy to switch the frequency for. + * @target_freq: New frequency to set (may be approximate). + * + * Carry out a fast frequency switch from interrupt context. + * + * This function must not be called if policy->fast_switch_possible is unset. + * + * Governors calling this function must guarantee that it will never be invoked + * twice in parallel for the same policy and that it will never be called in + * parallel with either ->target() or ->target_index() for the same policy. + * + * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch() + * callback to indicate an error condition, the hardware configuration must be + * preserved. + */ +unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq) +{ + return cpufreq_driver->fast_switch(policy, target_freq); +} +EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch); + /* Must set freqs->new to intermediate frequency */ static int __target_intermediate(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, int index) ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v3 6/7] cpufreq: sched: Re-introduce cpufreq_update_util() 2016-03-08 2:23 ` [PATCH v3 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (4 preceding siblings ...) 2016-03-08 2:38 ` [PATCH v3 5/7] cpufreq: Support for fast frequency switching Rafael J. Wysocki @ 2016-03-08 2:41 ` Rafael J. Wysocki 2016-03-08 2:50 ` [PATCH v3 7/7] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki 7 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-08 2:41 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> A subsequent change set will introduce a new cpufreq governor using CPU utilization information from the scheduler, so introduce cpufreq_update_util() (again) to allow that information to be passed to the new governor and make cpufreq_trigger_update() call it internally. To that end, modify the ->func callback pointer in struct freq_update_hook to take the util and max arguments in addition to the time one and arrange helpers to set/clear the utilization update hooks accordingly. Modify the current users of cpufreq utilization update callbacks to take the above changes into account. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Changes from v2: - One ->func callback for all of the users of struct freq_update_hook. --- drivers/cpufreq/cpufreq_governor.c | 80 ++++++++++++++++++------------------- drivers/cpufreq/intel_pstate.c | 12 +++-- include/linux/sched.h | 12 ++--- kernel/sched/cpufreq.c | 80 +++++++++++++++++++++++++++---------- kernel/sched/fair.c | 8 ++- kernel/sched/sched.h | 9 ++++ 6 files changed, 129 insertions(+), 72 deletions(-) Index: linux-pm/include/linux/sched.h =================================================================== --- linux-pm.orig/include/linux/sched.h +++ linux-pm/include/linux/sched.h @@ -2363,15 +2363,15 @@ static inline bool sched_can_stop_tick(v #endif #ifdef CONFIG_CPU_FREQ -void cpufreq_trigger_update(u64 time); - struct freq_update_hook { - void (*func)(struct freq_update_hook *hook, u64 time); + void (*func)(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max); }; -void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook); -#else -static inline void cpufreq_trigger_update(u64 time) {} +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook, + void (*func)(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max)); +void cpufreq_clear_freq_update_hook(int cpu); #endif #ifdef CONFIG_SCHED_AUTOGROUP Index: linux-pm/kernel/sched/cpufreq.c =================================================================== --- linux-pm.orig/kernel/sched/cpufreq.c +++ linux-pm/kernel/sched/cpufreq.c @@ -9,12 +9,12 @@ * published by the Free Software Foundation. */ -#include <linux/sched.h> +#include "sched.h" static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook); /** - * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer. + * set_freq_update_hook - Populate the CPU's freq_update_hook pointer. * @cpu: The CPU to set the pointer for. * @hook: New pointer value. * @@ -27,23 +27,75 @@ static DEFINE_PER_CPU(struct freq_update * accessed via the old update_util_data pointer or invoke synchronize_sched() * right after this function to avoid use-after-free. */ -void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook) +static void set_freq_update_hook(int cpu, struct freq_update_hook *hook) { - if (WARN_ON(hook && !hook->func)) + rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook); +} + +/** + * cpufreq_set_freq_update_hook - Set the CPU's frequency update callback. + * @cpu: The CPU to set the callback for. + * @hook: New freq_update_hook pointer value. + * @func: Callback function to use with the new hook. + */ +void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook, + void (*func)(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max)) +{ + if (WARN_ON(!hook || !func)) return; - rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook); + hook->func = func; + set_freq_update_hook(cpu, hook); } EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook); /** + * cpufreq_set_update_util_hook - Clear the CPU's freq_update_hook pointer. + * @cpu: The CPU to clear the pointer for. + */ +void cpufreq_clear_freq_update_hook(int cpu) +{ + set_freq_update_hook(cpu, NULL); +} +EXPORT_SYMBOL_GPL(cpufreq_clear_freq_update_hook); + +/** + * cpufreq_update_util - Take a note about CPU utilization changes. + * @time: Current time. + * @util: CPU utilization. + * @max: CPU capacity. + * + * This function is called on every invocation of update_load_avg() on the CPU + * whose utilization is being updated. + * + * It can only be called from RCU-sched read-side critical sections. + */ +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) +{ + struct freq_update_hook *hook; + +#ifdef CONFIG_LOCKDEP + WARN_ON(debug_locks && !rcu_read_lock_sched_held()); +#endif + + hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); + /* + * If this isn't inside of an RCU-sched read-side critical section, hook + * may become NULL after the check below. + */ + if (hook) + hook->func(hook, time, util, max); +} + +/** * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed. * @time: Current time. * * The way cpufreq is currently arranged requires it to evaluate the CPU * performance state (frequency/voltage) on a regular basis. To facilitate - * that, this function is called by update_load_avg() in CFS when executed for - * the current CPU's runqueue. + * that, cpufreq_update_util() is called by update_load_avg() in CFS when + * executed for the current CPU's runqueue. * * However, this isn't sufficient to prevent the CPU from being stuck in a * completely inadequate performance level for too long, because the calls @@ -57,17 +109,5 @@ EXPORT_SYMBOL_GPL(cpufreq_set_freq_updat */ void cpufreq_trigger_update(u64 time) { - struct freq_update_hook *hook; - -#ifdef CONFIG_LOCKDEP - WARN_ON(debug_locks && !rcu_read_lock_sched_held()); -#endif - - hook = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_freq_update_hook)); - /* - * If this isn't inside of an RCU-sched read-side critical section, hook - * may become NULL after the check below. - */ - if (hook) - hook->func(hook, time); + cpufreq_update_util(time, ULONG_MAX, 0); } Index: linux-pm/kernel/sched/fair.c =================================================================== --- linux-pm.orig/kernel/sched/fair.c +++ linux-pm/kernel/sched/fair.c @@ -2839,6 +2839,8 @@ static inline void update_load_avg(struc update_tg_load_avg(cfs_rq, 0); if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) { + unsigned long max = rq->cpu_capacity_orig; + /* * There are a few boundary cases this might miss but it should * get called often enough that that should (hopefully) not be @@ -2847,9 +2849,11 @@ static inline void update_load_avg(struc * the next tick/schedule should update. * * It will not get called when we go idle, because the idle - * thread is a different class (!fair). + * thread is a different class (!fair), nor will the utilization + * number include things like RT tasks. */ - cpufreq_trigger_update(rq_clock(rq)); + cpufreq_update_util(rq_clock(rq), + min(cfs_rq->avg.util_avg, max), max); } } Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -1020,7 +1020,9 @@ static inline void intel_pstate_adjust_b sample->freq); } -static void intel_pstate_freq_update(struct freq_update_hook *hook, u64 time) +static void intel_pstate_freq_update(struct freq_update_hook *hook, u64 time, + unsigned long util_not_used, + unsigned long max_not_used) { struct cpudata *cpu = container_of(hook, struct cpudata, update_hook); u64 delta_ns = time - cpu->sample.time; @@ -1088,8 +1090,8 @@ static int intel_pstate_init_cpu(unsigne intel_pstate_busy_pid_reset(cpu); intel_pstate_sample(cpu, 0); - cpu->update_hook.func = intel_pstate_freq_update; - cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook); + cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook, + intel_pstate_freq_update); pr_debug("intel_pstate: controlling: cpu %d\n", cpunum); @@ -1173,7 +1175,7 @@ static void intel_pstate_stop_cpu(struct pr_debug("intel_pstate: CPU %d exiting\n", cpu_num); - cpufreq_set_freq_update_hook(cpu_num, NULL); + cpufreq_clear_freq_update_hook(cpu_num); synchronize_sched(); if (hwp_active) @@ -1441,7 +1443,7 @@ out: get_online_cpus(); for_each_online_cpu(cpu) { if (all_cpu_data[cpu]) { - cpufreq_set_freq_update_hook(cpu, NULL); + cpufreq_clear_freq_update_hook(cpu); synchronize_sched(); kfree(all_cpu_data[cpu]); } Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -211,43 +211,6 @@ unsigned int dbs_update(struct cpufreq_p } EXPORT_SYMBOL_GPL(dbs_update); -static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs, - unsigned int delay_us) -{ - struct cpufreq_policy *policy = policy_dbs->policy; - int cpu; - - gov_update_sample_delay(policy_dbs, delay_us); - policy_dbs->last_sample_time = 0; - - for_each_cpu(cpu, policy->cpus) { - struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu); - - cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook); - } -} - -static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy) -{ - int i; - - for_each_cpu(i, policy->cpus) - cpufreq_set_freq_update_hook(i, NULL); - - synchronize_sched(); -} - -static void gov_cancel_work(struct cpufreq_policy *policy) -{ - struct policy_dbs_info *policy_dbs = policy->governor_data; - - gov_clear_freq_update_hooks(policy_dbs->policy); - irq_work_sync(&policy_dbs->irq_work); - cancel_work_sync(&policy_dbs->work); - atomic_set(&policy_dbs->work_count, 0); - policy_dbs->work_in_progress = false; -} - static void dbs_work_handler(struct work_struct *work) { struct policy_dbs_info *policy_dbs; @@ -285,7 +248,9 @@ static void dbs_irq_work(struct irq_work schedule_work(&policy_dbs->work); } -static void dbs_freq_update_handler(struct freq_update_hook *hook, u64 time) +static void dbs_freq_update_handler(struct freq_update_hook *hook, u64 time, + unsigned long util_not_used, + unsigned long max_not_used) { struct cpu_dbs_info *cdbs = container_of(hook, struct cpu_dbs_info, update_hook); struct policy_dbs_info *policy_dbs = cdbs->policy_dbs; @@ -334,6 +299,44 @@ static void dbs_freq_update_handler(stru irq_work_queue(&policy_dbs->irq_work); } +static void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs, + unsigned int delay_us) +{ + struct cpufreq_policy *policy = policy_dbs->policy; + int cpu; + + gov_update_sample_delay(policy_dbs, delay_us); + policy_dbs->last_sample_time = 0; + + for_each_cpu(cpu, policy->cpus) { + struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu); + + cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook, + dbs_freq_update_handler); + } +} + +static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy) +{ + int i; + + for_each_cpu(i, policy->cpus) + cpufreq_clear_freq_update_hook(i); + + synchronize_sched(); +} + +static void gov_cancel_work(struct cpufreq_policy *policy) +{ + struct policy_dbs_info *policy_dbs = policy->governor_data; + + gov_clear_freq_update_hooks(policy_dbs->policy); + irq_work_sync(&policy_dbs->irq_work); + cancel_work_sync(&policy_dbs->work); + atomic_set(&policy_dbs->work_count, 0); + policy_dbs->work_in_progress = false; +} + static struct policy_dbs_info *alloc_policy_dbs_info(struct cpufreq_policy *policy, struct dbs_governor *gov) { @@ -356,7 +359,6 @@ static struct policy_dbs_info *alloc_pol struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j); j_cdbs->policy_dbs = policy_dbs; - j_cdbs->update_hook.func = dbs_freq_update_handler; } return policy_dbs; } Index: linux-pm/kernel/sched/sched.h =================================================================== --- linux-pm.orig/kernel/sched/sched.h +++ linux-pm/kernel/sched/sched.h @@ -1739,3 +1739,12 @@ static inline u64 irq_time_read(int cpu) } #endif /* CONFIG_64BIT */ #endif /* CONFIG_IRQ_TIME_ACCOUNTING */ + +#ifdef CONFIG_CPU_FREQ +void cpufreq_update_util(u64 time, unsigned long util, unsigned long max); +void cpufreq_trigger_update(u64 time); +#else +static inline void cpufreq_update_util(u64 time, unsigned long util, + unsigned long max) {} +static inline void cpufreq_trigger_update(u64 time) {} +#endif /* CONFIG_CPU_FREQ */ ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v3 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-08 2:23 ` [PATCH v3 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (5 preceding siblings ...) 2016-03-08 2:41 ` [PATCH v3 6/7] cpufreq: sched: Re-introduce cpufreq_update_util() Rafael J. Wysocki @ 2016-03-08 2:50 ` Rafael J. Wysocki 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki 7 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-08 2:50 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Add a new cpufreq scaling governor, called "schedutil", that uses scheduler-provided CPU utilization information as input for making its decisions. Doing that is possible after commit fe7034338ba0 (cpufreq: Add mechanism for registering utilization update callbacks) that introduced cpufreq_update_util() called by the scheduler on utilization changes (from CFS) and RT/DL task status updates. In particular, CPU frequency scaling decisions may be based on the the utilization data passed to cpufreq_update_util() by CFS. The new governor is relatively simple. The frequency selection formula used by it is f = 1.1 * max_freq * util / max where util and max are the utilization and CPU capacity coming from CFS and max_freq is the nominal maximum frequency of the CPU (as reported by the cpufreq driver). All of the computations are carried out in the utilization update handlers provided by the new governor. One of those handlers is used for cpufreq policies shared between multiple CPUs and the other one is for policies with one CPU only (and therefore it doesn't need to use any extra synchronization means). The governor supports fast frequency switching if that is supported by the cpufreq driver in use and possible for the given policy. In the fast switching case, all operations of the governor take place in its utilization update handlers. If fast switching cannot be used, the frequency switch operations are carried out with the help of a work item which only calls __cpufreq_driver_target() (under a mutex) to trigger a frequency update (to a value already computed beforehand in one of the utilization update handlers). Currently, the governor treats all of the RT and DL tasks as "unknown utilization" and sets the frequency to the allowed maximum when updated from the RT or DL sched classes. That heavy-handed approach should be replaced with something more subtle and specifically targeted at RT and DL tasks. The governor shares some tunables management code with the "ondemand" and "conservative" governors and uses some common definitions from cpufreq_governor.h, but apart from that it is stand-alone. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Changes from v2: - The governor goes into drivers/cpufreq/. - The "next frequency" formula has an additional 1.1 factor to allow more util/max values to map onto the top-most frequency in case the distance between that and the previous one is unproportionally small. - sugov_update_commit() traces CPU frequency even if the new one is the same as the previous one (otherwise, if the system is 100% loaded for long enough, powertop starts to report that all CPUs are 100% idle). --- drivers/cpufreq/Kconfig | 26 + drivers/cpufreq/Makefile | 1 drivers/cpufreq/cpufreq_schedutil.c | 509 ++++++++++++++++++++++++++++++++++++ 3 files changed, 536 insertions(+) Index: linux-pm/drivers/cpufreq/Kconfig =================================================================== --- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor. + +config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL + bool "schedutil" + select CPU_FREQ_GOV_SCHEDUTIL + select CPU_FREQ_GOV_PERFORMANCE + help + Use the 'schedutil' CPUFreq governor by default. If unsure, + have a look at the help section of that governor. The fallback + governor will be 'performance'. + endchoice config CPU_FREQ_GOV_PERFORMANCE @@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE If in doubt, say N. +config CPU_FREQ_GOV_SCHEDUTIL + tristate "'schedutil' cpufreq policy governor" + depends on CPU_FREQ + select CPU_FREQ_GOV_ATTR_SET + select IRQ_WORK + help + The frequency selection formula used by this governor is analogous + to the one used by 'ondemand', but instead of computing CPU load + as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU + utilization data provided by the scheduler as input. + + To compile this driver as a module, choose M here: the + module will be called cpufreq_schedutil. + + If in doubt, say N. + comment "CPU frequency scaling drivers" config CPUFREQ_DT Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c =================================================================== --- /dev/null +++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c @@ -0,0 +1,509 @@ +/* + * CPUFreq governor based on scheduler-provided CPU utilization data. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/module.h> +#include <linux/slab.h> +#include <trace/events/power.h> + +#include "cpufreq_governor.h" + +struct sugov_tunables { + struct gov_attr_set attr_set; + unsigned int rate_limit_us; +}; + +struct sugov_policy { + struct cpufreq_policy *policy; + + struct sugov_tunables *tunables; + struct list_head tunables_hook; + + raw_spinlock_t update_lock; /* For shared policies */ + u64 last_freq_update_time; + s64 freq_update_delay_ns; + unsigned int next_freq; + unsigned int driver_freq; + unsigned int max_freq; + + /* The next fields are only needed if fast switch cannot be used. */ + struct irq_work irq_work; + struct work_struct work; + struct mutex work_lock; + bool work_in_progress; + + bool need_freq_update; +}; + +struct sugov_cpu { + struct freq_update_hook update_hook; + struct sugov_policy *sg_policy; + + /* The fields below are only needed when sharing a policy. */ + unsigned long util; + unsigned long max; + u64 last_update; +}; + +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); + +/************************ Governor internals ***********************/ + +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) +{ + u64 delta_ns; + + if (sg_policy->work_in_progress) + return false; + + if (unlikely(sg_policy->need_freq_update)) { + sg_policy->need_freq_update = false; + return true; + } + + delta_ns = time - sg_policy->last_freq_update_time; + return (s64)delta_ns >= sg_policy->freq_update_delay_ns; +} + +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, + unsigned int next_freq) +{ + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int freq; + + if (next_freq > policy->max) + next_freq = policy->max; + else if (next_freq < policy->min) + next_freq = policy->min; + + sg_policy->last_freq_update_time = time; + if (sg_policy->next_freq == next_freq) { + if (!policy->fast_switch_possible) + return; + + freq = sg_policy->driver_freq; + } else { + sg_policy->next_freq = next_freq; + if (!policy->fast_switch_possible) { + sg_policy->work_in_progress = true; + irq_work_queue(&sg_policy->irq_work); + return; + } + freq = cpufreq_driver_fast_switch(policy, next_freq); + if (freq == CPUFREQ_ENTRY_INVALID) + return; + + sg_policy->driver_freq = freq; + } + policy->cur = freq; + trace_cpu_frequency(freq, smp_processor_id()); +} + +static void sugov_update_single(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + unsigned int max_f, next_f; + + if (!sugov_should_update_freq(sg_policy, time)) + return; + + max_f = sg_policy->max_freq; + next_f = util > max ? max_f : util * max_f / max; + sugov_update_commit(sg_policy, time, next_f); +} + +static unsigned int sugov_next_freq(struct sugov_policy *sg_policy, + unsigned long util, unsigned long max) +{ + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int max_f = sg_policy->max_freq; + u64 last_freq_update_time = sg_policy->last_freq_update_time; + unsigned int j; + + if (util > max) + return max_f; + + for_each_cpu(j, policy->cpus) { + struct sugov_cpu *j_sg_cpu; + unsigned long j_util, j_max; + u64 delta_ns; + + if (j == smp_processor_id()) + continue; + + j_sg_cpu = &per_cpu(sugov_cpu, j); + /* + * If the CPU utilization was last updated before the previous + * frequency update and the time elapsed between the last update + * of the CPU utilization and the last frequency update is long + * enough, don't take the CPU into account as it probably is + * idle now. + */ + delta_ns = last_freq_update_time - j_sg_cpu->last_update; + if ((s64)delta_ns > NSEC_PER_SEC / HZ) + continue; + + j_util = j_sg_cpu->util; + j_max = j_sg_cpu->max; + if (j_util > j_max) + return max_f; + + if (j_util * max > j_max * util) { + util = j_util; + max = j_max; + } + } + + return util * max_f / max; +} + +static void sugov_update_shared(struct freq_update_hook *hook, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + unsigned int next_f; + + raw_spin_lock(&sg_policy->update_lock); + + sg_cpu->util = util; + sg_cpu->max = max; + sg_cpu->last_update = time; + + if (sugov_should_update_freq(sg_policy, time)) { + next_f = sugov_next_freq(sg_policy, util, max); + sugov_update_commit(sg_policy, time, next_f); + } + + raw_spin_unlock(&sg_policy->update_lock); +} + +static void sugov_work(struct work_struct *work) +{ + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); + + mutex_lock(&sg_policy->work_lock); + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, + CPUFREQ_RELATION_L); + mutex_unlock(&sg_policy->work_lock); + + sg_policy->work_in_progress = false; +} + +static void sugov_irq_work(struct irq_work *irq_work) +{ + struct sugov_policy *sg_policy; + + sg_policy = container_of(irq_work, struct sugov_policy, irq_work); + schedule_work(&sg_policy->work); +} + +/************************** sysfs interface ************************/ + +static struct sugov_tunables *global_tunables; +static DEFINE_MUTEX(global_tunables_lock); + +static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set) +{ + return container_of(attr_set, struct sugov_tunables, attr_set); +} + +static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + + return sprintf(buf, "%u\n", tunables->rate_limit_us); +} + +static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, + size_t count) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + struct sugov_policy *sg_policy; + unsigned int rate_limit_us; + int ret; + + ret = sscanf(buf, "%u", &rate_limit_us); + if (ret != 1) + return -EINVAL; + + tunables->rate_limit_us = rate_limit_us; + + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) + sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC; + + return count; +} + +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us); + +static struct attribute *sugov_attributes[] = { + &rate_limit_us.attr, + NULL +}; + +static struct kobj_type sugov_tunables_ktype = { + .default_attrs = sugov_attributes, + .sysfs_ops = &governor_sysfs_ops, +}; + +/********************** cpufreq governor interface *********************/ + +static struct cpufreq_governor schedutil_gov; + +static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy) +{ + unsigned int max_f = policy->cpuinfo.max_freq; + struct sugov_policy *sg_policy; + + sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL); + if (!sg_policy) + return NULL; + + sg_policy->policy = policy; + /* + * Take the proportionality coefficient between util/max and frequency + * to be 1.1 times the nominal maximum frequency to boost performance + * slightly on systems with a narrow top-most frequency bin. + */ + sg_policy->max_freq = max_f + max_f / 10; + init_irq_work(&sg_policy->irq_work, sugov_irq_work); + INIT_WORK(&sg_policy->work, sugov_work); + mutex_init(&sg_policy->work_lock); + raw_spin_lock_init(&sg_policy->update_lock); + return sg_policy; +} + +static void sugov_policy_free(struct sugov_policy *sg_policy) +{ + mutex_destroy(&sg_policy->work_lock); + kfree(sg_policy); +} + +static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy) +{ + struct sugov_tunables *tunables; + + tunables = kzalloc(sizeof(*tunables), GFP_KERNEL); + if (tunables) + gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook); + + return tunables; +} + +static void sugov_tunables_free(struct sugov_tunables *tunables) +{ + if (!have_governor_per_policy()) + global_tunables = NULL; + + kfree(tunables); +} + +static int sugov_init(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + struct sugov_tunables *tunables; + unsigned int lat; + int ret = 0; + + /* State should be equivalent to EXIT */ + if (policy->governor_data) + return -EBUSY; + + sg_policy = sugov_policy_alloc(policy); + if (!sg_policy) + return -ENOMEM; + + mutex_lock(&global_tunables_lock); + + if (global_tunables) { + if (WARN_ON(have_governor_per_policy())) { + ret = -EINVAL; + goto free_sg_policy; + } + policy->governor_data = sg_policy; + sg_policy->tunables = global_tunables; + + gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook); + goto out; + } + + tunables = sugov_tunables_alloc(sg_policy); + if (!tunables) { + ret = -ENOMEM; + goto free_sg_policy; + } + + tunables->rate_limit_us = LATENCY_MULTIPLIER; + lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC; + if (lat) + tunables->rate_limit_us *= lat; + + if (!have_governor_per_policy()) + global_tunables = tunables; + + policy->governor_data = sg_policy; + sg_policy->tunables = tunables; + + ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype, + get_governor_parent_kobj(policy), "%s", + schedutil_gov.name); + if (!ret) + goto out; + + /* Failure, so roll back. */ + policy->governor_data = NULL; + sugov_tunables_free(tunables); + + free_sg_policy: + pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret); + sugov_policy_free(sg_policy); + + out: + mutex_unlock(&global_tunables_lock); + return ret; +} + +static int sugov_exit(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + struct sugov_tunables *tunables = sg_policy->tunables; + unsigned int count; + + mutex_lock(&global_tunables_lock); + + count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook); + policy->governor_data = NULL; + if (!count) + sugov_tunables_free(tunables); + + mutex_unlock(&global_tunables_lock); + + sugov_policy_free(sg_policy); + return 0; +} + +static int sugov_start(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC; + sg_policy->last_freq_update_time = 0; + sg_policy->next_freq = UINT_MAX; + sg_policy->work_in_progress = false; + sg_policy->need_freq_update = false; + + for_each_cpu(cpu, policy->cpus) { + struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu); + + sg_cpu->sg_policy = sg_policy; + if (policy_is_shared(policy)) { + sg_cpu->util = ULONG_MAX; + sg_cpu->max = 0; + sg_cpu->last_update = 0; + cpufreq_set_freq_update_hook(cpu, &sg_cpu->update_hook, + sugov_update_shared); + } else { + cpufreq_set_freq_update_hook(cpu, &sg_cpu->update_hook, + sugov_update_single); + } + } + return 0; +} + +static int sugov_stop(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + for_each_cpu(cpu, policy->cpus) + cpufreq_clear_freq_update_hook(cpu); + + synchronize_sched(); + + irq_work_sync(&sg_policy->irq_work); + cancel_work_sync(&sg_policy->work); + return 0; +} + +static int sugov_limits(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + + if (!policy->fast_switch_possible) { + mutex_lock(&sg_policy->work_lock); + + if (policy->max < policy->cur) + __cpufreq_driver_target(policy, policy->max, + CPUFREQ_RELATION_H); + else if (policy->min > policy->cur) + __cpufreq_driver_target(policy, policy->min, + CPUFREQ_RELATION_L); + + mutex_unlock(&sg_policy->work_lock); + } + + sg_policy->need_freq_update = true; + return 0; +} + +int sugov_governor(struct cpufreq_policy *policy, unsigned int event) +{ + if (event == CPUFREQ_GOV_POLICY_INIT) { + return sugov_init(policy); + } else if (policy->governor_data) { + switch (event) { + case CPUFREQ_GOV_POLICY_EXIT: + return sugov_exit(policy); + case CPUFREQ_GOV_START: + return sugov_start(policy); + case CPUFREQ_GOV_STOP: + return sugov_stop(policy); + case CPUFREQ_GOV_LIMITS: + return sugov_limits(policy); + } + } + return -EINVAL; +} + +static struct cpufreq_governor schedutil_gov = { + .name = "schedutil", + .governor = sugov_governor, + .owner = THIS_MODULE, +}; + +static int __init sugov_module_init(void) +{ + return cpufreq_register_governor(&schedutil_gov); +} + +static void __exit sugov_module_exit(void) +{ + cpufreq_unregister_governor(&schedutil_gov); +} + +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>"); +MODULE_DESCRIPTION("Utilization-based CPU frequency selection"); +MODULE_LICENSE("GPL"); + +#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL +struct cpufreq_governor *cpufreq_default_governor(void) +{ + return &schedutil_gov; +} + +fs_initcall(sugov_module_init); +#else +module_init(sugov_module_init); +#endif +module_exit(sugov_module_exit); Index: linux-pm/drivers/cpufreq/Makefile =================================================================== --- linux-pm.orig/drivers/cpufreq/Makefile +++ linux-pm/drivers/cpufreq/Makefile @@ -12,6 +12,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += c obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o +obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v4 0/7] cpufreq: schedutil governor 2016-03-08 2:23 ` [PATCH v3 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (6 preceding siblings ...) 2016-03-08 2:50 ` [PATCH v3 7/7] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki @ 2016-03-16 14:41 ` Rafael J. Wysocki 2016-03-16 14:43 ` [PATCH v4 1/7] cpufreq: sched: Helpers to add and remove update_util hooks Rafael J. Wysocki ` (11 more replies) 7 siblings, 12 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 14:41 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar Hi, Here's a new iteration of the schedutil governor series. It is based on linux-next (particularly on the material from my pull request for 4.6-rc1), so I'm not resending the patches already included there. It has been present in my pm-cpufreq-experimental branch for a few days. The first patch is new, but it is just something I think would be useful (and seems to be kind of compatible with thigs currently under discussion: http://marc.info/?l=linux-pm&m=145813384117349&w=4). The next four patches are needed for sharing code between the new governor and the existing ones. Three of them have not changed since the previous iteration of the series and the fourth one is new (but it only moves some symbols around). Patch [6/7] adds fast frequency switching support to cpufreq. It has changed since the previous version. Most importantly, there's a new fast_switch_enabled field in struct cpufreq_policy which is to be set when fast switching is actually enabled for the given policy and governors are supposed to set it (using a helper function provided for that). This way notifier registrations are only affected if someone is really using fast switching and that prevents existing setups from being affected in particular. Patch [7/7] introduces the schedutil governor. There are a few changes in it from the previous version. First off, I've attempted to address some points made during the recent discussion on the next frequency selection formula (http://marc.info/?t=145688568600003&r=1&w=4). It essentially uses the formula from http://marc.info/?l=linux-acpi&m=145756618321500&w=4 (bottom of the message body), but with the modification that if the utilization is frequency-invariant, it will use max_freq instead of the current frequency. It uses the mechanism suggested by Peter to recognize whether or not the utilization is frequency invariant (http://marc.info/?l=linux-kernel&m=145760739700716&w=4). Second, because of the above, the schedutil governor goes into kernel/sched/ (again). Namely, I don't want arch_scale_freq_invariant() to be visible by all cpufreq governors that won't need it. Now, since we seem to want to build upon this series (ref the recent Mike's patchset: http://marc.info/?l=linux-kernel&m=145793318016832&w=4), I need you to tell me what to change before it is good enough to be queued up for 4.7 (assuming that my 4.6 material is merged, that is). Thanks, Rafael ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v4 1/7] cpufreq: sched: Helpers to add and remove update_util hooks 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki @ 2016-03-16 14:43 ` Rafael J. Wysocki 2016-03-16 14:44 ` [PATCH v4 2/7] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki ` (10 subsequent siblings) 11 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 14:43 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Replace the single helper for adding and removing cpufreq utilization update hooks, cpufreq_set_update_util_data(), with a pair of helpers, cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook(), and modify the users of cpufreq_set_update_util_data() accordingly. With the new helpers, the code using them doesn't need to worry about the internals of struct update_util_data and in particular it doesn't need to worry about populating the func field in it properly upfront. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- New patch. --- drivers/cpufreq/cpufreq_governor.c | 76 ++++++++++++++++++------------------- drivers/cpufreq/intel_pstate.c | 8 +-- include/linux/sched.h | 5 +- kernel/sched/cpufreq.c | 48 ++++++++++++++++++----- 4 files changed, 83 insertions(+), 54 deletions(-) Index: linux-pm/include/linux/sched.h =================================================================== --- linux-pm.orig/include/linux/sched.h +++ linux-pm/include/linux/sched.h @@ -3213,7 +3213,10 @@ struct update_util_data { u64 time, unsigned long util, unsigned long max); }; -void cpufreq_set_update_util_data(int cpu, struct update_util_data *data); +void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data, + void (*func)(struct update_util_data *data, u64 time, + unsigned long util, unsigned long max)); +void cpufreq_remove_update_util_hook(int cpu); #endif /* CONFIG_CPU_FREQ */ #endif Index: linux-pm/kernel/sched/cpufreq.c =================================================================== --- linux-pm.orig/kernel/sched/cpufreq.c +++ linux-pm/kernel/sched/cpufreq.c @@ -14,24 +14,50 @@ DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data); /** - * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer. + * cpufreq_add_update_util_hook - Populate the CPU's update_util_data pointer. * @cpu: The CPU to set the pointer for. * @data: New pointer value. + * @func: Callback function to set for the CPU. * - * Set and publish the update_util_data pointer for the given CPU. That pointer - * points to a struct update_util_data object containing a callback function - * to call from cpufreq_update_util(). That function will be called from an RCU - * read-side critical section, so it must not sleep. + * Set and publish the update_util_data pointer for the given CPU. * - * Callers must use RCU-sched callbacks to free any memory that might be - * accessed via the old update_util_data pointer or invoke synchronize_sched() - * right after this function to avoid use-after-free. + * The update_util_data pointer of @cpu is set to @data and the callback + * function pointer in the target struct update_util_data is set to @func. + * That function will be called by cpufreq_update_util() from RCU-sched + * read-side critical sections, so it must not sleep. @data will always be + * passed to it as the first argument which allows the function to get to the + * target update_util_data structure and its container. + * + * The update_util_data pointer of @cpu must be NULL when this function is + * called or it will WARN() and return with no effect. */ -void cpufreq_set_update_util_data(int cpu, struct update_util_data *data) +void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data, + void (*func)(struct update_util_data *data, u64 time, + unsigned long util, unsigned long max)) { - if (WARN_ON(data && !data->func)) + if (WARN_ON(!data || !func)) return; + if (WARN_ON(per_cpu(cpufreq_update_util_data, cpu))) + return; + + data->func = func; rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data); } -EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data); +EXPORT_SYMBOL_GPL(cpufreq_add_update_util_hook); + +/** + * cpufreq_remove_update_util_hook - Clear the CPU's update_util_data pointer. + * @cpu: The CPU to clear the pointer for. + * + * Clear the update_util_data pointer for the given CPU. + * + * Callers must use RCU-sched callbacks to free any memory that might be + * accessed via the old update_util_data pointer or invoke synchronize_sched() + * right after this function to avoid use-after-free. + */ +void cpufreq_remove_update_util_hook(int cpu) +{ + rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), NULL); +} +EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook); Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -258,43 +258,6 @@ unsigned int dbs_update(struct cpufreq_p } EXPORT_SYMBOL_GPL(dbs_update); -static void gov_set_update_util(struct policy_dbs_info *policy_dbs, - unsigned int delay_us) -{ - struct cpufreq_policy *policy = policy_dbs->policy; - int cpu; - - gov_update_sample_delay(policy_dbs, delay_us); - policy_dbs->last_sample_time = 0; - - for_each_cpu(cpu, policy->cpus) { - struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu); - - cpufreq_set_update_util_data(cpu, &cdbs->update_util); - } -} - -static inline void gov_clear_update_util(struct cpufreq_policy *policy) -{ - int i; - - for_each_cpu(i, policy->cpus) - cpufreq_set_update_util_data(i, NULL); - - synchronize_sched(); -} - -static void gov_cancel_work(struct cpufreq_policy *policy) -{ - struct policy_dbs_info *policy_dbs = policy->governor_data; - - gov_clear_update_util(policy_dbs->policy); - irq_work_sync(&policy_dbs->irq_work); - cancel_work_sync(&policy_dbs->work); - atomic_set(&policy_dbs->work_count, 0); - policy_dbs->work_in_progress = false; -} - static void dbs_work_handler(struct work_struct *work) { struct policy_dbs_info *policy_dbs; @@ -382,6 +345,44 @@ static void dbs_update_util_handler(stru irq_work_queue(&policy_dbs->irq_work); } +static void gov_set_update_util(struct policy_dbs_info *policy_dbs, + unsigned int delay_us) +{ + struct cpufreq_policy *policy = policy_dbs->policy; + int cpu; + + gov_update_sample_delay(policy_dbs, delay_us); + policy_dbs->last_sample_time = 0; + + for_each_cpu(cpu, policy->cpus) { + struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu); + + cpufreq_add_update_util_hook(cpu, &cdbs->update_util, + dbs_update_util_handler); + } +} + +static inline void gov_clear_update_util(struct cpufreq_policy *policy) +{ + int i; + + for_each_cpu(i, policy->cpus) + cpufreq_remove_update_util_hook(i); + + synchronize_sched(); +} + +static void gov_cancel_work(struct cpufreq_policy *policy) +{ + struct policy_dbs_info *policy_dbs = policy->governor_data; + + gov_clear_update_util(policy_dbs->policy); + irq_work_sync(&policy_dbs->irq_work); + cancel_work_sync(&policy_dbs->work); + atomic_set(&policy_dbs->work_count, 0); + policy_dbs->work_in_progress = false; +} + static struct policy_dbs_info *alloc_policy_dbs_info(struct cpufreq_policy *policy, struct dbs_governor *gov) { @@ -404,7 +405,6 @@ static struct policy_dbs_info *alloc_pol struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j); j_cdbs->policy_dbs = policy_dbs; - j_cdbs->update_util.func = dbs_update_util_handler; } return policy_dbs; } Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -1089,8 +1089,8 @@ static int intel_pstate_init_cpu(unsigne intel_pstate_busy_pid_reset(cpu); intel_pstate_sample(cpu, 0); - cpu->update_util.func = intel_pstate_update_util; - cpufreq_set_update_util_data(cpunum, &cpu->update_util); + cpufreq_add_update_util_hook(cpunum, &cpu->update_util, + intel_pstate_update_util); pr_debug("intel_pstate: controlling: cpu %d\n", cpunum); @@ -1174,7 +1174,7 @@ static void intel_pstate_stop_cpu(struct pr_debug("intel_pstate: CPU %d exiting\n", cpu_num); - cpufreq_set_update_util_data(cpu_num, NULL); + cpufreq_remove_update_util_hook(cpu_num); synchronize_sched(); if (hwp_active) @@ -1442,7 +1442,7 @@ out: get_online_cpus(); for_each_online_cpu(cpu) { if (all_cpu_data[cpu]) { - cpufreq_set_update_util_data(cpu, NULL); + cpufreq_remove_update_util_hook(cpu); synchronize_sched(); kfree(all_cpu_data[cpu]); } ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v4 2/7] cpufreq: governor: New data type for management part of dbs_data 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki 2016-03-16 14:43 ` [PATCH v4 1/7] cpufreq: sched: Helpers to add and remove update_util hooks Rafael J. Wysocki @ 2016-03-16 14:44 ` Rafael J. Wysocki 2016-03-16 14:45 ` [PATCH v4 3/7] cpufreq: governor: Move abstract gov_attr_set code to seperate file Rafael J. Wysocki ` (9 subsequent siblings) 11 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 14:44 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> In addition to fields representing governor tunables, struct dbs_data contains some fields needed for the management of objects of that type. As it turns out, that part of struct dbs_data may be shared with (future) governors that won't use the common code used by "ondemand" and "conservative", so move it to a separate struct type and modify the code using struct dbs_data to follow. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> --- No changes from the previous version. --- drivers/cpufreq/cpufreq_conservative.c | 25 +++++---- drivers/cpufreq/cpufreq_governor.c | 90 ++++++++++++++++++++------------- drivers/cpufreq/cpufreq_governor.h | 35 +++++++----- drivers/cpufreq/cpufreq_ondemand.c | 29 ++++++---- 4 files changed, 107 insertions(+), 72 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -41,6 +41,13 @@ /* Ondemand Sampling types */ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; +struct gov_attr_set { + struct kobject kobj; + struct list_head policy_list; + struct mutex update_lock; + int usage_count; +}; + /* * Abbreviations: * dbs: used as a shortform for demand based switching It helps to keep variable @@ -52,7 +59,7 @@ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; /* Governor demand based switching data (per-policy or global). */ struct dbs_data { - int usage_count; + struct gov_attr_set attr_set; void *tuners; unsigned int min_sampling_rate; unsigned int ignore_nice_load; @@ -60,37 +67,35 @@ struct dbs_data { unsigned int sampling_down_factor; unsigned int up_threshold; unsigned int io_is_busy; - - struct kobject kobj; - struct list_head policy_dbs_list; - /* - * Protect concurrent updates to governor tunables from sysfs, - * policy_dbs_list and usage_count. - */ - struct mutex mutex; }; +static inline struct dbs_data *to_dbs_data(struct gov_attr_set *attr_set) +{ + return container_of(attr_set, struct dbs_data, attr_set); +} + /* Governor's specific attributes */ -struct dbs_data; struct governor_attr { struct attribute attr; - ssize_t (*show)(struct dbs_data *dbs_data, char *buf); - ssize_t (*store)(struct dbs_data *dbs_data, const char *buf, + ssize_t (*show)(struct gov_attr_set *attr_set, char *buf); + ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf, size_t count); }; #define gov_show_one(_gov, file_name) \ static ssize_t show_##file_name \ -(struct dbs_data *dbs_data, char *buf) \ +(struct gov_attr_set *attr_set, char *buf) \ { \ + struct dbs_data *dbs_data = to_dbs_data(attr_set); \ struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \ return sprintf(buf, "%u\n", tuners->file_name); \ } #define gov_show_one_common(file_name) \ static ssize_t show_##file_name \ -(struct dbs_data *dbs_data, char *buf) \ +(struct gov_attr_set *attr_set, char *buf) \ { \ + struct dbs_data *dbs_data = to_dbs_data(attr_set); \ return sprintf(buf, "%u\n", dbs_data->file_name); \ } @@ -184,7 +189,7 @@ void od_register_powersave_bias_handler( (struct cpufreq_policy *, unsigned int, unsigned int), unsigned int powersave_bias); void od_unregister_powersave_bias_handler(void); -ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf, +ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf, size_t count); void gov_update_cpu_data(struct dbs_data *dbs_data); #endif /* _CPUFREQ_GOVERNOR_H */ Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -43,9 +43,10 @@ static DEFINE_MUTEX(gov_dbs_data_mutex); * This must be called with dbs_data->mutex held, otherwise traversing * policy_dbs_list isn't safe. */ -ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf, +ssize_t store_sampling_rate(struct gov_attr_set *attr_set, const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct policy_dbs_info *policy_dbs; unsigned int rate; int ret; @@ -59,7 +60,7 @@ ssize_t store_sampling_rate(struct dbs_d * We are operating under dbs_data->mutex and so the list and its * entries can't be freed concurrently. */ - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) { + list_for_each_entry(policy_dbs, &attr_set->policy_list, list) { mutex_lock(&policy_dbs->timer_mutex); /* * On 32-bit architectures this may race with the @@ -96,7 +97,7 @@ void gov_update_cpu_data(struct dbs_data { struct policy_dbs_info *policy_dbs; - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) { + list_for_each_entry(policy_dbs, &dbs_data->attr_set.policy_list, list) { unsigned int j; for_each_cpu(j, policy_dbs->policy->cpus) { @@ -111,9 +112,9 @@ void gov_update_cpu_data(struct dbs_data } EXPORT_SYMBOL_GPL(gov_update_cpu_data); -static inline struct dbs_data *to_dbs_data(struct kobject *kobj) +static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj) { - return container_of(kobj, struct dbs_data, kobj); + return container_of(kobj, struct gov_attr_set, kobj); } static inline struct governor_attr *to_gov_attr(struct attribute *attr) @@ -124,25 +125,24 @@ static inline struct governor_attr *to_g static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, char *buf) { - struct dbs_data *dbs_data = to_dbs_data(kobj); struct governor_attr *gattr = to_gov_attr(attr); - return gattr->show(dbs_data, buf); + return gattr->show(to_gov_attr_set(kobj), buf); } static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, const char *buf, size_t count) { - struct dbs_data *dbs_data = to_dbs_data(kobj); + struct gov_attr_set *attr_set = to_gov_attr_set(kobj); struct governor_attr *gattr = to_gov_attr(attr); int ret = -EBUSY; - mutex_lock(&dbs_data->mutex); + mutex_lock(&attr_set->update_lock); - if (dbs_data->usage_count) - ret = gattr->store(dbs_data, buf, count); + if (attr_set->usage_count) + ret = gattr->store(attr_set, buf, count); - mutex_unlock(&dbs_data->mutex); + mutex_unlock(&attr_set->update_lock); return ret; } @@ -425,6 +425,41 @@ static void free_policy_dbs_info(struct gov->free(policy_dbs); } +static void gov_attr_set_init(struct gov_attr_set *attr_set, + struct list_head *list_node) +{ + INIT_LIST_HEAD(&attr_set->policy_list); + mutex_init(&attr_set->update_lock); + attr_set->usage_count = 1; + list_add(list_node, &attr_set->policy_list); +} + +static void gov_attr_set_get(struct gov_attr_set *attr_set, + struct list_head *list_node) +{ + mutex_lock(&attr_set->update_lock); + attr_set->usage_count++; + list_add(list_node, &attr_set->policy_list); + mutex_unlock(&attr_set->update_lock); +} + +static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, + struct list_head *list_node) +{ + unsigned int count; + + mutex_lock(&attr_set->update_lock); + list_del(list_node); + count = --attr_set->usage_count; + mutex_unlock(&attr_set->update_lock); + if (count) + return count; + + kobject_put(&attr_set->kobj); + mutex_destroy(&attr_set->update_lock); + return 0; +} + static int cpufreq_governor_init(struct cpufreq_policy *policy) { struct dbs_governor *gov = dbs_governor_of(policy); @@ -453,10 +488,7 @@ static int cpufreq_governor_init(struct policy_dbs->dbs_data = dbs_data; policy->governor_data = policy_dbs; - mutex_lock(&dbs_data->mutex); - dbs_data->usage_count++; - list_add(&policy_dbs->list, &dbs_data->policy_dbs_list); - mutex_unlock(&dbs_data->mutex); + gov_attr_set_get(&dbs_data->attr_set, &policy_dbs->list); goto out; } @@ -466,8 +498,7 @@ static int cpufreq_governor_init(struct goto free_policy_dbs_info; } - INIT_LIST_HEAD(&dbs_data->policy_dbs_list); - mutex_init(&dbs_data->mutex); + gov_attr_set_init(&dbs_data->attr_set, &policy_dbs->list); ret = gov->init(dbs_data, !policy->governor->initialized); if (ret) @@ -487,14 +518,11 @@ static int cpufreq_governor_init(struct if (!have_governor_per_policy()) gov->gdbs_data = dbs_data; - policy->governor_data = policy_dbs; - policy_dbs->dbs_data = dbs_data; - dbs_data->usage_count = 1; - list_add(&policy_dbs->list, &dbs_data->policy_dbs_list); + policy->governor_data = policy_dbs; gov->kobj_type.sysfs_ops = &governor_sysfs_ops; - ret = kobject_init_and_add(&dbs_data->kobj, &gov->kobj_type, + ret = kobject_init_and_add(&dbs_data->attr_set.kobj, &gov->kobj_type, get_governor_parent_kobj(policy), "%s", gov->gov.name); if (!ret) @@ -523,29 +551,21 @@ static int cpufreq_governor_exit(struct struct dbs_governor *gov = dbs_governor_of(policy); struct policy_dbs_info *policy_dbs = policy->governor_data; struct dbs_data *dbs_data = policy_dbs->dbs_data; - int count; + unsigned int count; /* Protect gov->gdbs_data against concurrent updates. */ mutex_lock(&gov_dbs_data_mutex); - mutex_lock(&dbs_data->mutex); - list_del(&policy_dbs->list); - count = --dbs_data->usage_count; - mutex_unlock(&dbs_data->mutex); + count = gov_attr_set_put(&dbs_data->attr_set, &policy_dbs->list); - if (!count) { - kobject_put(&dbs_data->kobj); - - policy->governor_data = NULL; + policy->governor_data = NULL; + if (!count) { if (!have_governor_per_policy()) gov->gdbs_data = NULL; gov->exit(dbs_data, policy->governor->initialized == 1); - mutex_destroy(&dbs_data->mutex); kfree(dbs_data); - } else { - policy->governor_data = NULL; } free_policy_dbs_info(policy_dbs, gov); Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c +++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c @@ -207,9 +207,10 @@ static unsigned int od_dbs_timer(struct /************************** sysfs interface ************************/ static struct dbs_governor od_dbs_gov; -static ssize_t store_io_is_busy(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_io_is_busy(struct gov_attr_set *attr_set, const char *buf, + size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; @@ -224,9 +225,10 @@ static ssize_t store_io_is_busy(struct d return count; } -static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_up_threshold(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -240,9 +242,10 @@ static ssize_t store_up_threshold(struct return count; } -static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data, - const char *buf, size_t count) +static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct policy_dbs_info *policy_dbs; unsigned int input; int ret; @@ -254,7 +257,7 @@ static ssize_t store_sampling_down_facto dbs_data->sampling_down_factor = input; /* Reset down sampling multiplier in case it was active */ - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) { + list_for_each_entry(policy_dbs, &attr_set->policy_list, list) { /* * Doing this without locking might lead to using different * rate_mult values in od_update() and od_dbs_timer(). @@ -267,9 +270,10 @@ static ssize_t store_sampling_down_facto return count; } -static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data, - const char *buf, size_t count) +static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; @@ -291,9 +295,10 @@ static ssize_t store_ignore_nice_load(st return count; } -static ssize_t store_powersave_bias(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_powersave_bias(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct od_dbs_tuners *od_tuners = dbs_data->tuners; struct policy_dbs_info *policy_dbs; unsigned int input; @@ -308,7 +313,7 @@ static ssize_t store_powersave_bias(stru od_tuners->powersave_bias = input; - list_for_each_entry(policy_dbs, &dbs_data->policy_dbs_list, list) + list_for_each_entry(policy_dbs, &attr_set->policy_list, list) ondemand_powersave_bias_init(policy_dbs->policy); return count; Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c +++ linux-pm/drivers/cpufreq/cpufreq_conservative.c @@ -129,9 +129,10 @@ static struct notifier_block cs_cpufreq_ /************************** sysfs interface ************************/ static struct dbs_governor cs_dbs_gov; -static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data, - const char *buf, size_t count) +static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -143,9 +144,10 @@ static ssize_t store_sampling_down_facto return count; } -static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_up_threshold(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; @@ -158,9 +160,10 @@ static ssize_t store_up_threshold(struct return count; } -static ssize_t store_down_threshold(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_down_threshold(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; @@ -175,9 +178,10 @@ static ssize_t store_down_threshold(stru return count; } -static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data, - const char *buf, size_t count) +static ssize_t store_ignore_nice_load(struct gov_attr_set *attr_set, + const char *buf, size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); unsigned int input; int ret; @@ -199,9 +203,10 @@ static ssize_t store_ignore_nice_load(st return count; } -static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf, - size_t count) +static ssize_t store_freq_step(struct gov_attr_set *attr_set, const char *buf, + size_t count) { + struct dbs_data *dbs_data = to_dbs_data(attr_set); struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v4 3/7] cpufreq: governor: Move abstract gov_attr_set code to seperate file 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki 2016-03-16 14:43 ` [PATCH v4 1/7] cpufreq: sched: Helpers to add and remove update_util hooks Rafael J. Wysocki 2016-03-16 14:44 ` [PATCH v4 2/7] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki @ 2016-03-16 14:45 ` Rafael J. Wysocki 2016-03-16 14:46 ` [PATCH v4 4/7] cpufreq: Move governor attribute set headers to cpufreq.h Rafael J. Wysocki ` (8 subsequent siblings) 11 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 14:45 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Move abstract code related to struct gov_attr_set to a separate (new) file so it can be shared with (future) goverernors that won't share more code with "ondemand" and "conservative". No intentional functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> --- No changes from the previous version. --- drivers/cpufreq/Kconfig | 4 + drivers/cpufreq/Makefile | 1 drivers/cpufreq/cpufreq_governor.c | 82 --------------------------- drivers/cpufreq/cpufreq_governor.h | 6 ++ drivers/cpufreq/cpufreq_governor_attr_set.c | 84 ++++++++++++++++++++++++++++ 5 files changed, 95 insertions(+), 82 deletions(-) Index: linux-pm/drivers/cpufreq/Kconfig =================================================================== --- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -18,7 +18,11 @@ config CPU_FREQ if CPU_FREQ +config CPU_FREQ_GOV_ATTR_SET + bool + config CPU_FREQ_GOV_COMMON + select CPU_FREQ_GOV_ATTR_SET select IRQ_WORK bool Index: linux-pm/drivers/cpufreq/Makefile =================================================================== --- linux-pm.orig/drivers/cpufreq/Makefile +++ linux-pm/drivers/cpufreq/Makefile @@ -11,6 +11,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) += obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o +obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o Index: linux-pm/drivers/cpufreq/cpufreq_governor.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c +++ linux-pm/drivers/cpufreq/cpufreq_governor.c @@ -112,53 +112,6 @@ void gov_update_cpu_data(struct dbs_data } EXPORT_SYMBOL_GPL(gov_update_cpu_data); -static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj) -{ - return container_of(kobj, struct gov_attr_set, kobj); -} - -static inline struct governor_attr *to_gov_attr(struct attribute *attr) -{ - return container_of(attr, struct governor_attr, attr); -} - -static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, - char *buf) -{ - struct governor_attr *gattr = to_gov_attr(attr); - - return gattr->show(to_gov_attr_set(kobj), buf); -} - -static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, - const char *buf, size_t count) -{ - struct gov_attr_set *attr_set = to_gov_attr_set(kobj); - struct governor_attr *gattr = to_gov_attr(attr); - int ret = -EBUSY; - - mutex_lock(&attr_set->update_lock); - - if (attr_set->usage_count) - ret = gattr->store(attr_set, buf, count); - - mutex_unlock(&attr_set->update_lock); - - return ret; -} - -/* - * Sysfs Ops for accessing governor attributes. - * - * All show/store invocations for governor specific sysfs attributes, will first - * call the below show/store callbacks and the attribute specific callback will - * be called from within it. - */ -static const struct sysfs_ops governor_sysfs_ops = { - .show = governor_show, - .store = governor_store, -}; - unsigned int dbs_update(struct cpufreq_policy *policy) { struct policy_dbs_info *policy_dbs = policy->governor_data; @@ -425,41 +378,6 @@ static void free_policy_dbs_info(struct gov->free(policy_dbs); } -static void gov_attr_set_init(struct gov_attr_set *attr_set, - struct list_head *list_node) -{ - INIT_LIST_HEAD(&attr_set->policy_list); - mutex_init(&attr_set->update_lock); - attr_set->usage_count = 1; - list_add(list_node, &attr_set->policy_list); -} - -static void gov_attr_set_get(struct gov_attr_set *attr_set, - struct list_head *list_node) -{ - mutex_lock(&attr_set->update_lock); - attr_set->usage_count++; - list_add(list_node, &attr_set->policy_list); - mutex_unlock(&attr_set->update_lock); -} - -static unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, - struct list_head *list_node) -{ - unsigned int count; - - mutex_lock(&attr_set->update_lock); - list_del(list_node); - count = --attr_set->usage_count; - mutex_unlock(&attr_set->update_lock); - if (count) - return count; - - kobject_put(&attr_set->kobj); - mutex_destroy(&attr_set->update_lock); - return 0; -} - static int cpufreq_governor_init(struct cpufreq_policy *policy) { struct dbs_governor *gov = dbs_governor_of(policy); Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -48,6 +48,12 @@ struct gov_attr_set { int usage_count; }; +extern const struct sysfs_ops governor_sysfs_ops; + +void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node); +void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node); +unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node); + /* * Abbreviations: * dbs: used as a shortform for demand based switching It helps to keep variable Index: linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c =================================================================== --- /dev/null +++ linux-pm/drivers/cpufreq/cpufreq_governor_attr_set.c @@ -0,0 +1,84 @@ +/* + * Abstract code for CPUFreq governor tunable sysfs attributes. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include "cpufreq_governor.h" + +static inline struct gov_attr_set *to_gov_attr_set(struct kobject *kobj) +{ + return container_of(kobj, struct gov_attr_set, kobj); +} + +static inline struct governor_attr *to_gov_attr(struct attribute *attr) +{ + return container_of(attr, struct governor_attr, attr); +} + +static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, + char *buf) +{ + struct governor_attr *gattr = to_gov_attr(attr); + + return gattr->show(to_gov_attr_set(kobj), buf); +} + +static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, + const char *buf, size_t count) +{ + struct gov_attr_set *attr_set = to_gov_attr_set(kobj); + struct governor_attr *gattr = to_gov_attr(attr); + int ret; + + mutex_lock(&attr_set->update_lock); + ret = attr_set->usage_count ? gattr->store(attr_set, buf, count) : -EBUSY; + mutex_unlock(&attr_set->update_lock); + return ret; +} + +const struct sysfs_ops governor_sysfs_ops = { + .show = governor_show, + .store = governor_store, +}; +EXPORT_SYMBOL_GPL(governor_sysfs_ops); + +void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node) +{ + INIT_LIST_HEAD(&attr_set->policy_list); + mutex_init(&attr_set->update_lock); + attr_set->usage_count = 1; + list_add(list_node, &attr_set->policy_list); +} +EXPORT_SYMBOL_GPL(gov_attr_set_init); + +void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node) +{ + mutex_lock(&attr_set->update_lock); + attr_set->usage_count++; + list_add(list_node, &attr_set->policy_list); + mutex_unlock(&attr_set->update_lock); +} +EXPORT_SYMBOL_GPL(gov_attr_set_get); + +unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node) +{ + unsigned int count; + + mutex_lock(&attr_set->update_lock); + list_del(list_node); + count = --attr_set->usage_count; + mutex_unlock(&attr_set->update_lock); + if (count) + return count; + + kobject_put(&attr_set->kobj); + mutex_destroy(&attr_set->update_lock); + return 0; +} +EXPORT_SYMBOL_GPL(gov_attr_set_put); ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v4 4/7] cpufreq: Move governor attribute set headers to cpufreq.h 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (2 preceding siblings ...) 2016-03-16 14:45 ` [PATCH v4 3/7] cpufreq: governor: Move abstract gov_attr_set code to seperate file Rafael J. Wysocki @ 2016-03-16 14:46 ` Rafael J. Wysocki 2016-03-16 14:47 ` [PATCH v4 5/7] cpufreq: Move governor symbols " Rafael J. Wysocki ` (7 subsequent siblings) 11 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 14:46 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Move definitions and function headers related to struct gov_attr_set to include/linux/cpufreq.h so they can be used by (future) goverernors located outside of drivers/cpufreq/. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> --- This one was present in v2, no changes since then. --- drivers/cpufreq/cpufreq_governor.h | 21 --------------------- include/linux/cpufreq.h | 23 +++++++++++++++++++++++ 2 files changed, 23 insertions(+), 21 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -41,19 +41,6 @@ /* Ondemand Sampling types */ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; -struct gov_attr_set { - struct kobject kobj; - struct list_head policy_list; - struct mutex update_lock; - int usage_count; -}; - -extern const struct sysfs_ops governor_sysfs_ops; - -void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node); -void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node); -unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node); - /* * Abbreviations: * dbs: used as a shortform for demand based switching It helps to keep variable @@ -80,14 +67,6 @@ static inline struct dbs_data *to_dbs_da return container_of(attr_set, struct dbs_data, attr_set); } -/* Governor's specific attributes */ -struct governor_attr { - struct attribute attr; - ssize_t (*show)(struct gov_attr_set *attr_set, char *buf); - ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf, - size_t count); -}; - #define gov_show_one(_gov, file_name) \ static ssize_t show_##file_name \ (struct gov_attr_set *attr_set, char *buf) \ Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -462,6 +462,29 @@ void cpufreq_unregister_governor(struct struct cpufreq_governor *cpufreq_default_governor(void); struct cpufreq_governor *cpufreq_fallback_governor(void); +/* Governor attribute set */ +struct gov_attr_set { + struct kobject kobj; + struct list_head policy_list; + struct mutex update_lock; + int usage_count; +}; + +/* sysfs ops for cpufreq governors */ +extern const struct sysfs_ops governor_sysfs_ops; + +void gov_attr_set_init(struct gov_attr_set *attr_set, struct list_head *list_node); +void gov_attr_set_get(struct gov_attr_set *attr_set, struct list_head *list_node); +unsigned int gov_attr_set_put(struct gov_attr_set *attr_set, struct list_head *list_node); + +/* Governor sysfs attribute */ +struct governor_attr { + struct attribute attr; + ssize_t (*show)(struct gov_attr_set *attr_set, char *buf); + ssize_t (*store)(struct gov_attr_set *attr_set, const char *buf, + size_t count); +}; + /********************************************************************* * FREQUENCY TABLE HELPERS * *********************************************************************/ ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v4 5/7] cpufreq: Move governor symbols to cpufreq.h 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (3 preceding siblings ...) 2016-03-16 14:46 ` [PATCH v4 4/7] cpufreq: Move governor attribute set headers to cpufreq.h Rafael J. Wysocki @ 2016-03-16 14:47 ` Rafael J. Wysocki 2016-03-16 14:52 ` [PATCH v4 6/7] cpufreq: Support for fast frequency switching Rafael J. Wysocki ` (6 subsequent siblings) 11 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 14:47 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Move definitions of symbols related to transition latency and sampling rate to include/linux/cpufreq.h so they can be used by (future) goverernors located outside of drivers/cpufreq/. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- New patch. --- drivers/cpufreq/cpufreq_governor.h | 14 -------------- include/linux/cpufreq.h | 14 ++++++++++++++ 2 files changed, 14 insertions(+), 14 deletions(-) Index: linux-pm/drivers/cpufreq/cpufreq_governor.h =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h +++ linux-pm/drivers/cpufreq/cpufreq_governor.h @@ -24,20 +24,6 @@ #include <linux/module.h> #include <linux/mutex.h> -/* - * The polling frequency depends on the capability of the processor. Default - * polling frequency is 1000 times the transition latency of the processor. The - * governor will work on any processor with transition latency <= 10ms, using - * appropriate sampling rate. - * - * For CPUs with transition latency > 10ms (mostly drivers with CPUFREQ_ETERNAL) - * this governor will not work. All times here are in us (micro seconds). - */ -#define MIN_SAMPLING_RATE_RATIO (2) -#define LATENCY_MULTIPLIER (1000) -#define MIN_LATENCY_MULTIPLIER (20) -#define TRANSITION_LATENCY_LIMIT (10 * 1000 * 1000) - /* Ondemand Sampling types */ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -426,6 +426,20 @@ static inline unsigned long cpufreq_scal #define CPUFREQ_POLICY_POWERSAVE (1) #define CPUFREQ_POLICY_PERFORMANCE (2) +/* + * The polling frequency depends on the capability of the processor. Default + * polling frequency is 1000 times the transition latency of the processor. The + * ondemand governor will work on any processor with transition latency <= 10ms, + * using appropriate sampling rate. + * + * For CPUs with transition latency > 10ms (mostly drivers with CPUFREQ_ETERNAL) + * the ondemand governor will not work. All times here are in us (microseconds). + */ +#define MIN_SAMPLING_RATE_RATIO (2) +#define LATENCY_MULTIPLIER (1000) +#define MIN_LATENCY_MULTIPLIER (20) +#define TRANSITION_LATENCY_LIMIT (10 * 1000 * 1000) + /* Governor Events */ #define CPUFREQ_GOV_START 1 #define CPUFREQ_GOV_STOP 2 ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v4 6/7] cpufreq: Support for fast frequency switching 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (4 preceding siblings ...) 2016-03-16 14:47 ` [PATCH v4 5/7] cpufreq: Move governor symbols " Rafael J. Wysocki @ 2016-03-16 14:52 ` Rafael J. Wysocki 2016-03-16 15:35 ` Peter Zijlstra 2016-03-16 15:43 ` Peter Zijlstra 2016-03-16 14:59 ` [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki ` (5 subsequent siblings) 11 siblings, 2 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 14:52 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Modify the ACPI cpufreq driver to provide a method for switching CPU frequencies from interrupt context and update the cpufreq core to support that method if available. Introduce a new cpufreq driver callback, ->fast_switch, to be invoked for frequency switching from interrupt context by (future) governors supporting that feature via (new) helper function cpufreq_driver_fast_switch(). Add two new policy flags, fast_switch_possible, to be set by the cpufreq driver if fast frequency switching can be used for the given policy and fast_switch_enabled, to be set by the governor if it is going to use fast frequency switching for the given policy. Also add a helper for setting the latter. Since fast frequency switching is inherently incompatible with cpufreq transition notifiers, make it possible to set the fast_switch_enabled only if there are no transition notifiers already registered and make the registration of new transition notifiers fail if fast_switch_enabled is set for at least one policy. Implement the ->fast_switch callback in the ACPI cpufreq driver and make it set fast_switch_possible during policy initialization as appropriate. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Changes from v3: - New fast_switch_enabled field in struct cpufreq_policy to help avoid affecting existing setups by setting the fast_switch_possible flag in the driver. - __cpufreq_get() skips the policy->cur check if fast_switch_enabled is set. Changes from v2: - The driver ->fast_switch callback and cpufreq_driver_fast_switch() don't need the relation argument as they will always do RELATION_L now. - New mechanism to make fast switch and cpufreq notifiers mutually exclusive. - cpufreq_driver_fast_switch() doesn't do anything in addition to invoking the driver callback and returns its return value. --- drivers/cpufreq/acpi-cpufreq.c | 41 +++++++++++++++ drivers/cpufreq/cpufreq.c | 108 +++++++++++++++++++++++++++++++++++++---- include/linux/cpufreq.h | 9 +++ 3 files changed, 149 insertions(+), 9 deletions(-) Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c @@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp return result; } +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq) +{ + struct acpi_cpufreq_data *data = policy->driver_data; + struct acpi_processor_performance *perf; + struct cpufreq_frequency_table *entry; + unsigned int next_perf_state, next_freq, freq; + + /* + * Find the closest frequency above target_freq. + * + * The table is sorted in the reverse order with respect to the + * frequency and all of the entries are valid (see the initialization). + */ + entry = data->freq_table; + do { + entry++; + freq = entry->frequency; + } while (freq >= target_freq && freq != CPUFREQ_TABLE_END); + entry--; + next_freq = entry->frequency; + next_perf_state = entry->driver_data; + + perf = to_perf_data(data); + if (perf->state == next_perf_state) { + if (unlikely(data->resume)) + data->resume = 0; + else + return next_freq; + } + + data->cpu_freq_write(&perf->control_register, + perf->states[next_perf_state].control); + perf->state = next_perf_state; + return next_freq; +} + static unsigned long acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu) { @@ -740,6 +777,9 @@ static int acpi_cpufreq_cpu_init(struct goto err_unreg; } + policy->fast_switch_possible = !acpi_pstate_strict && + !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY); + data->freq_table = kzalloc(sizeof(*data->freq_table) * (perf->state_count+1), GFP_KERNEL); if (!data->freq_table) { @@ -874,6 +914,7 @@ static struct freq_attr *acpi_cpufreq_at static struct cpufreq_driver acpi_cpufreq_driver = { .verify = cpufreq_generic_frequency_table_verify, .target_index = acpi_cpufreq_target, + .fast_switch = acpi_cpufreq_fast_switch, .bios_limit = acpi_processor_get_bios_limit, .init = acpi_cpufreq_cpu_init, .exit = acpi_cpufreq_cpu_exit, Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -102,6 +102,10 @@ struct cpufreq_policy { */ struct rw_semaphore rwsem; + /* Fast switch flags */ + bool fast_switch_possible; /* Set by the driver. */ + bool fast_switch_enabled; + /* Synchronization for frequency transitions */ bool transition_ongoing; /* Tracks transition status */ spinlock_t transition_lock; @@ -156,6 +160,7 @@ int cpufreq_get_policy(struct cpufreq_po int cpufreq_update_policy(unsigned int cpu); bool have_governor_per_policy(void); struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy); +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy); #else static inline unsigned int cpufreq_get(unsigned int cpu) { @@ -236,6 +241,8 @@ struct cpufreq_driver { unsigned int relation); /* Deprecated */ int (*target_index)(struct cpufreq_policy *policy, unsigned int index); + unsigned int (*fast_switch)(struct cpufreq_policy *policy, + unsigned int target_freq); /* * Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION * unset. @@ -464,6 +471,8 @@ struct cpufreq_governor { }; /* Pass a target to the cpufreq driver */ +unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq); int cpufreq_driver_target(struct cpufreq_policy *policy, unsigned int target_freq, unsigned int relation); Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -428,6 +428,39 @@ void cpufreq_freq_transition_end(struct } EXPORT_SYMBOL_GPL(cpufreq_freq_transition_end); +/* + * Fast frequency switching status count. Positive means "enabled", negative + * means "disabled" and 0 means "not decided yet". + */ +static int cpufreq_fast_switch_count; +static DEFINE_MUTEX(cpufreq_fast_switch_lock); + +/** + * cpufreq_enable_fast_switch - Enable fast frequency switching for policy. + * @policy: cpufreq policy to enable fast frequency switching for. + * + * Try to enable fast frequency switching for @policy. + * + * The attempt will fail if there is at least one transition notifier registered + * at this point, as fast frequency switching is quite fundamentally at odds + * with transition notifiers. Thus if successful, it will make registration of + * transition notifiers fail going forward. + * + * Call under policy->rwsem. + */ +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy) +{ + mutex_lock(&cpufreq_fast_switch_lock); + if (policy->fast_switch_possible && cpufreq_fast_switch_count >= 0) { + cpufreq_fast_switch_count++; + policy->fast_switch_enabled = true; + } else { + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n", + policy->cpu); + } + mutex_unlock(&cpufreq_fast_switch_lock); +} +EXPORT_SYMBOL_GPL(cpufreq_enable_fast_switch); /********************************************************************* * SYSFS INTERFACE * @@ -1083,6 +1116,24 @@ static void cpufreq_policy_free(struct c kfree(policy); } +static void cpufreq_driver_exit_policy(struct cpufreq_policy *policy) +{ + if (policy->fast_switch_enabled) { + mutex_lock(&cpufreq_fast_switch_lock); + + policy->fast_switch_enabled = false; + if (!WARN_ON(cpufreq_fast_switch_count <= 0)) + cpufreq_fast_switch_count--; + + mutex_unlock(&cpufreq_fast_switch_lock); + } + + if (cpufreq_driver->exit) { + cpufreq_driver->exit(policy); + policy->freq_table = NULL; + } +} + static int cpufreq_online(unsigned int cpu) { struct cpufreq_policy *policy; @@ -1236,8 +1287,7 @@ static int cpufreq_online(unsigned int c out_exit_policy: up_write(&policy->rwsem); - if (cpufreq_driver->exit) - cpufreq_driver->exit(policy); + cpufreq_driver_exit_policy(policy); out_free_policy: cpufreq_policy_free(policy, !new_policy); return ret; @@ -1334,10 +1384,7 @@ static void cpufreq_offline(unsigned int * since this is a core component, and is essential for the * subsequent light-weight ->init() to succeed. */ - if (cpufreq_driver->exit) { - cpufreq_driver->exit(policy); - policy->freq_table = NULL; - } + cpufreq_driver_exit_policy(policy); unlock: up_write(&policy->rwsem); @@ -1444,8 +1491,12 @@ static unsigned int __cpufreq_get(struct ret_freq = cpufreq_driver->get(policy->cpu); - /* Updating inactive policies is invalid, so avoid doing that. */ - if (unlikely(policy_is_inactive(policy))) + /* + * Updating inactive policies is invalid, so avoid doing that. Also + * if fast frequency switching is used with the given policy, the check + * against policy->cur is pointless, so skip it in that case too. + */ + if (unlikely(policy_is_inactive(policy)) || policy->fast_switch_enabled) return ret_freq; if (ret_freq && policy->cur && @@ -1457,7 +1508,6 @@ static unsigned int __cpufreq_get(struct schedule_work(&policy->update); } } - return ret_freq; } @@ -1653,8 +1703,18 @@ int cpufreq_register_notifier(struct not switch (list) { case CPUFREQ_TRANSITION_NOTIFIER: + mutex_lock(&cpufreq_fast_switch_lock); + + if (cpufreq_fast_switch_count > 0) { + mutex_unlock(&cpufreq_fast_switch_lock); + return -EPERM; + } ret = srcu_notifier_chain_register( &cpufreq_transition_notifier_list, nb); + if (!ret) + cpufreq_fast_switch_count--; + + mutex_unlock(&cpufreq_fast_switch_lock); break; case CPUFREQ_POLICY_NOTIFIER: ret = blocking_notifier_chain_register( @@ -1687,8 +1747,14 @@ int cpufreq_unregister_notifier(struct n switch (list) { case CPUFREQ_TRANSITION_NOTIFIER: + mutex_lock(&cpufreq_fast_switch_lock); + ret = srcu_notifier_chain_unregister( &cpufreq_transition_notifier_list, nb); + if (!ret && !WARN_ON(cpufreq_fast_switch_count >= 0)) + cpufreq_fast_switch_count++; + + mutex_unlock(&cpufreq_fast_switch_lock); break; case CPUFREQ_POLICY_NOTIFIER: ret = blocking_notifier_chain_unregister( @@ -1707,6 +1773,30 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie * GOVERNORS * *********************************************************************/ +/** + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch. + * @policy: cpufreq policy to switch the frequency for. + * @target_freq: New frequency to set (may be approximate). + * + * Carry out a fast frequency switch from interrupt context. + * + * This function must not be called if policy->fast_switch_enabled is unset. + * + * Governors calling this function must guarantee that it will never be invoked + * twice in parallel for the same policy and that it will never be called in + * parallel with either ->target() or ->target_index() for the same policy. + * + * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch() + * callback to indicate an error condition, the hardware configuration must be + * preserved. + */ +unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq) +{ + return cpufreq_driver->fast_switch(policy, target_freq); +} +EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch); + /* Must set freqs->new to intermediate frequency */ static int __target_intermediate(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, int index) ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 6/7] cpufreq: Support for fast frequency switching 2016-03-16 14:52 ` [PATCH v4 6/7] cpufreq: Support for fast frequency switching Rafael J. Wysocki @ 2016-03-16 15:35 ` Peter Zijlstra 2016-03-16 16:58 ` Rafael J. Wysocki 2016-03-16 15:43 ` Peter Zijlstra 1 sibling, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-16 15:35 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 16, 2016 at 03:52:28PM +0100, Rafael J. Wysocki wrote: > +/** > + * cpufreq_enable_fast_switch - Enable fast frequency switching for policy. > + * @policy: cpufreq policy to enable fast frequency switching for. > + * > + * Try to enable fast frequency switching for @policy. > + * > + * The attempt will fail if there is at least one transition notifier registered > + * at this point, as fast frequency switching is quite fundamentally at odds > + * with transition notifiers. Thus if successful, it will make registration of > + * transition notifiers fail going forward. > + * > + * Call under policy->rwsem. Nobody reads a comment.. > + */ > +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy) > +{ lockdep_assert_held(&policy->rwsem); While everybody complains when there's a big nasty splat in their dmesg ;-) > + mutex_lock(&cpufreq_fast_switch_lock); > + if (policy->fast_switch_possible && cpufreq_fast_switch_count >= 0) { > + cpufreq_fast_switch_count++; > + policy->fast_switch_enabled = true; > + } else { > + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n", > + policy->cpu); > + } > + mutex_unlock(&cpufreq_fast_switch_lock); > +} ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 6/7] cpufreq: Support for fast frequency switching 2016-03-16 15:35 ` Peter Zijlstra @ 2016-03-16 16:58 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 16:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 16, 2016 at 4:35 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Mar 16, 2016 at 03:52:28PM +0100, Rafael J. Wysocki wrote: >> +/** >> + * cpufreq_enable_fast_switch - Enable fast frequency switching for policy. >> + * @policy: cpufreq policy to enable fast frequency switching for. >> + * >> + * Try to enable fast frequency switching for @policy. >> + * >> + * The attempt will fail if there is at least one transition notifier registered >> + * at this point, as fast frequency switching is quite fundamentally at odds >> + * with transition notifiers. Thus if successful, it will make registration of >> + * transition notifiers fail going forward. >> + * >> + * Call under policy->rwsem. > > Nobody reads a comment.. > >> + */ >> +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy) >> +{ > > lockdep_assert_held(&policy->rwsem); > > While everybody complains when there's a big nasty splat in their dmesg > ;-) OK ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 6/7] cpufreq: Support for fast frequency switching 2016-03-16 14:52 ` [PATCH v4 6/7] cpufreq: Support for fast frequency switching Rafael J. Wysocki 2016-03-16 15:35 ` Peter Zijlstra @ 2016-03-16 15:43 ` Peter Zijlstra 2016-03-16 16:58 ` Rafael J. Wysocki 1 sibling, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-16 15:43 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 16, 2016 at 03:52:28PM +0100, Rafael J. Wysocki wrote: > +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy) > +{ > + mutex_lock(&cpufreq_fast_switch_lock); > + if (policy->fast_switch_possible && cpufreq_fast_switch_count >= 0) { > + cpufreq_fast_switch_count++; > + policy->fast_switch_enabled = true; > + } else { > + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n", > + policy->cpu); This happens because there's transition notifiers, right? Would it make sense to iterate the notifier here and print the notifier function symbol for each? That way we've got a clue as to where to start looking when this happens. > + } > + mutex_unlock(&cpufreq_fast_switch_lock); > +} > @@ -1653,8 +1703,18 @@ int cpufreq_register_notifier(struct not > > switch (list) { > case CPUFREQ_TRANSITION_NOTIFIER: > + mutex_lock(&cpufreq_fast_switch_lock); > + > + if (cpufreq_fast_switch_count > 0) { > + mutex_unlock(&cpufreq_fast_switch_lock); So while theoretically (it has a return code) cpufreq_register_notifier() could fail, it never actually did. Now we do. Do we want to add a WARN here? > + return -EPERM; > + } > ret = srcu_notifier_chain_register( > &cpufreq_transition_notifier_list, nb); > + if (!ret) > + cpufreq_fast_switch_count--; > + > + mutex_unlock(&cpufreq_fast_switch_lock); > break; > case CPUFREQ_POLICY_NOTIFIER: > ret = blocking_notifier_chain_register( ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 6/7] cpufreq: Support for fast frequency switching 2016-03-16 15:43 ` Peter Zijlstra @ 2016-03-16 16:58 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 16:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 16, 2016 at 4:43 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Mar 16, 2016 at 03:52:28PM +0100, Rafael J. Wysocki wrote: >> +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy) >> +{ >> + mutex_lock(&cpufreq_fast_switch_lock); >> + if (policy->fast_switch_possible && cpufreq_fast_switch_count >= 0) { >> + cpufreq_fast_switch_count++; >> + policy->fast_switch_enabled = true; >> + } else { >> + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n", >> + policy->cpu); > > This happens because there's transition notifiers, right? Would it make > sense to iterate the notifier here and print the notifier function > symbol for each? That way we've got a clue as to where to start looking > when this happens. OK >> + } >> + mutex_unlock(&cpufreq_fast_switch_lock); >> +} > >> @@ -1653,8 +1703,18 @@ int cpufreq_register_notifier(struct not >> >> switch (list) { >> case CPUFREQ_TRANSITION_NOTIFIER: >> + mutex_lock(&cpufreq_fast_switch_lock); >> + >> + if (cpufreq_fast_switch_count > 0) { >> + mutex_unlock(&cpufreq_fast_switch_lock); > > So while theoretically (it has a return code) > cpufreq_register_notifier() could fail, it never actually did. Now we > do. Do we want to add a WARN here? Like if (WARN_ON(cpufreq_fast_switch_count > 0)) { That can be done. :-) ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (5 preceding siblings ...) 2016-03-16 14:52 ` [PATCH v4 6/7] cpufreq: Support for fast frequency switching Rafael J. Wysocki @ 2016-03-16 14:59 ` Rafael J. Wysocki 2016-03-16 17:35 ` Peter Zijlstra ` (4 more replies) 2016-03-16 15:27 ` [PATCH v4 0/7] cpufreq: schedutil governor Peter Zijlstra ` (4 subsequent siblings) 11 siblings, 5 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 14:59 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Add a new cpufreq scaling governor, called "schedutil", that uses scheduler-provided CPU utilization information as input for making its decisions. Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add mechanism for registering utilization update callbacks) that introduced cpufreq_update_util() called by the scheduler on utilization changes (from CFS) and RT/DL task status updates. In particular, CPU frequency scaling decisions may be based on the the utilization data passed to cpufreq_update_util() by CFS. The new governor is relatively simple. The frequency selection formula used by it depends on whether or not the utilization is frequency-invariant. In the frequency-invariant case the new CPU frequency is given by next_freq = 1.25 * max_freq * util / max where util and max are the last two arguments of cpufreq_update_util(). In turn, if util is not frequency-invariant, the maximum frequency in the above formula is replaced with the current frequency of the CPU: next_freq = 1.25 * curr_freq * util / max The coefficient 1.25 corresponds to the frequency tipping point at (util / max) = 0.8. All of the computations are carried out in the utilization update handlers provided by the new governor. One of those handlers is used for cpufreq policies shared between multiple CPUs and the other one is for policies with one CPU only (and therefore it doesn't need to use any extra synchronization means). The governor supports fast frequency switching if that is supported by the cpufreq driver in use and possible for the given policy. In the fast switching case, all operations of the governor take place in its utilization update handlers. If fast switching cannot be used, the frequency switch operations are carried out with the help of a work item which only calls __cpufreq_driver_target() (under a mutex) to trigger a frequency update (to a value already computed beforehand in one of the utilization update handlers). Currently, the governor treats all of the RT and DL tasks as "unknown utilization" and sets the frequency to the allowed maximum when updated from the RT or DL sched classes. That heavy-handed approach should be replaced with something more subtle and specifically targeted at RT and DL tasks. The governor shares some tunables management code with the "ondemand" and "conservative" governors and uses some common definitions from cpufreq_governor.h, but apart from that it is stand-alone. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- The reason I decided to use policy->cur as the current frequency representation is that it happens to hold the value in question in both the "fast switch" and the "work item" cases quite naturally. The (rather theoretical) concern about it is that policy->cur may be updated by the core asynchronously if it thinks that it got out of sync with the "real" setting (as reported by the driver's ->get routine). I don't think it will turn out to be a real problem in practice, though. Changes from v3: - The "next frequency" formula based on http://marc.info/?l=linux-acpi&m=145756618321500&w=4 and http://marc.info/?l=linux-kernel&m=145760739700716&w=4 - The governor goes into kernel/sched/ (again). Changes from v2: - The governor goes into drivers/cpufreq/. - The "next frequency" formula has an additional 1.1 factor to allow more util/max values to map onto the top-most frequency in case the distance between that and the previous one is unproportionally small. - sugov_update_commit() traces CPU frequency even if the new one is the same as the previous one (otherwise, if the system is 100% loaded for long enough, powertop starts to report that all CPUs are 100% idle). --- drivers/cpufreq/Kconfig | 26 + kernel/sched/Makefile | 1 kernel/sched/cpufreq_schedutil.c | 531 +++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 8 4 files changed, 566 insertions(+) Index: linux-pm/drivers/cpufreq/Kconfig =================================================================== --- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor. + +config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL + bool "schedutil" + select CPU_FREQ_GOV_SCHEDUTIL + select CPU_FREQ_GOV_PERFORMANCE + help + Use the 'schedutil' CPUFreq governor by default. If unsure, + have a look at the help section of that governor. The fallback + governor will be 'performance'. + endchoice config CPU_FREQ_GOV_PERFORMANCE @@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE If in doubt, say N. +config CPU_FREQ_GOV_SCHEDUTIL + tristate "'schedutil' cpufreq policy governor" + depends on CPU_FREQ + select CPU_FREQ_GOV_ATTR_SET + select IRQ_WORK + help + The frequency selection formula used by this governor is analogous + to the one used by 'ondemand', but instead of computing CPU load + as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU + utilization data provided by the scheduler as input. + + To compile this driver as a module, choose M here: the + module will be called cpufreq_schedutil. + + If in doubt, say N. + comment "CPU frequency scaling drivers" config CPUFREQ_DT Index: linux-pm/kernel/sched/cpufreq_schedutil.c =================================================================== --- /dev/null +++ linux-pm/kernel/sched/cpufreq_schedutil.c @@ -0,0 +1,531 @@ +/* + * CPUFreq governor based on scheduler-provided CPU utilization data. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/slab.h> +#include <trace/events/power.h> + +#include "sched.h" + +struct sugov_tunables { + struct gov_attr_set attr_set; + unsigned int rate_limit_us; +}; + +struct sugov_policy { + struct cpufreq_policy *policy; + + struct sugov_tunables *tunables; + struct list_head tunables_hook; + + raw_spinlock_t update_lock; /* For shared policies */ + u64 last_freq_update_time; + s64 freq_update_delay_ns; + unsigned int next_freq; + + /* The next fields are only needed if fast switch cannot be used. */ + struct irq_work irq_work; + struct work_struct work; + struct mutex work_lock; + bool work_in_progress; + + bool need_freq_update; +}; + +struct sugov_cpu { + struct update_util_data update_util; + struct sugov_policy *sg_policy; + + /* The fields below are only needed when sharing a policy. */ + unsigned long util; + unsigned long max; + u64 last_update; +}; + +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); + +/************************ Governor internals ***********************/ + +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) +{ + u64 delta_ns; + + if (sg_policy->work_in_progress) + return false; + + if (unlikely(sg_policy->need_freq_update)) { + sg_policy->need_freq_update = false; + return true; + } + + delta_ns = time - sg_policy->last_freq_update_time; + return (s64)delta_ns >= sg_policy->freq_update_delay_ns; +} + +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, + unsigned int next_freq) +{ + struct cpufreq_policy *policy = sg_policy->policy; + + if (next_freq > policy->max) + next_freq = policy->max; + else if (next_freq < policy->min) + next_freq = policy->min; + + sg_policy->last_freq_update_time = time; + if (sg_policy->next_freq == next_freq) { + if (policy->fast_switch_enabled) + trace_cpu_frequency(policy->cur, smp_processor_id()); + + return; + } + + sg_policy->next_freq = next_freq; + if (policy->fast_switch_enabled) { + unsigned int freq; + + freq = cpufreq_driver_fast_switch(policy, next_freq); + if (freq == CPUFREQ_ENTRY_INVALID) + return; + + policy->cur = freq; + trace_cpu_frequency(freq, smp_processor_id()); + } else { + sg_policy->work_in_progress = true; + irq_work_queue(&sg_policy->irq_work); + } +} + +/** + * get_next_freq - Compute a new frequency for a given cpufreq policy. + * @policy: cpufreq policy object to compute the new frequency for. + * @util: Current CPU utilization. + * @max: CPU capacity. + * + * If the utilization is frequency-invariant, choose the new frequency to be + * proportional to it, that is + * + * next_freq = C * max_freq * util / max + * + * Otherwise, approximate the would-be frequency-invariant utilization by + * util_raw * (curr_freq / max_freq) which leads to + * + * next_freq = C * curr_freq * util_raw / max + * + * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8. + */ +static unsigned int get_next_freq(struct cpufreq_policy *policy, + unsigned long util, unsigned long max) +{ + unsigned int freq = arch_scale_freq_invariant() ? + policy->cpuinfo.max_freq : policy->cur; + + return (freq + (freq >> 2)) * util / max; +} + +static void sugov_update_single(struct update_util_data *hook, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int next_f; + + if (!sugov_should_update_freq(sg_policy, time)) + return; + + next_f = util <= max ? + get_next_freq(policy, util, max) : policy->cpuinfo.max_freq; + sugov_update_commit(sg_policy, time, next_f); +} + +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy, + unsigned long util, unsigned long max) +{ + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int max_f = policy->cpuinfo.max_freq; + u64 last_freq_update_time = sg_policy->last_freq_update_time; + unsigned int j; + + if (util > max) + return max_f; + + for_each_cpu(j, policy->cpus) { + struct sugov_cpu *j_sg_cpu; + unsigned long j_util, j_max; + u64 delta_ns; + + if (j == smp_processor_id()) + continue; + + j_sg_cpu = &per_cpu(sugov_cpu, j); + /* + * If the CPU utilization was last updated before the previous + * frequency update and the time elapsed between the last update + * of the CPU utilization and the last frequency update is long + * enough, don't take the CPU into account as it probably is + * idle now. + */ + delta_ns = last_freq_update_time - j_sg_cpu->last_update; + if ((s64)delta_ns > NSEC_PER_SEC / HZ) + continue; + + j_util = j_sg_cpu->util; + j_max = j_sg_cpu->max; + if (j_util > j_max) + return max_f; + + if (j_util * max > j_max * util) { + util = j_util; + max = j_max; + } + } + + return get_next_freq(policy, util, max); +} + +static void sugov_update_shared(struct update_util_data *hook, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + unsigned int next_f; + + raw_spin_lock(&sg_policy->update_lock); + + sg_cpu->util = util; + sg_cpu->max = max; + sg_cpu->last_update = time; + + if (sugov_should_update_freq(sg_policy, time)) { + next_f = sugov_next_freq_shared(sg_policy, util, max); + sugov_update_commit(sg_policy, time, next_f); + } + + raw_spin_unlock(&sg_policy->update_lock); +} + +static void sugov_work(struct work_struct *work) +{ + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); + + mutex_lock(&sg_policy->work_lock); + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, + CPUFREQ_RELATION_L); + mutex_unlock(&sg_policy->work_lock); + + sg_policy->work_in_progress = false; +} + +static void sugov_irq_work(struct irq_work *irq_work) +{ + struct sugov_policy *sg_policy; + + sg_policy = container_of(irq_work, struct sugov_policy, irq_work); + schedule_work(&sg_policy->work); +} + +/************************** sysfs interface ************************/ + +static struct sugov_tunables *global_tunables; +static DEFINE_MUTEX(global_tunables_lock); + +static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set) +{ + return container_of(attr_set, struct sugov_tunables, attr_set); +} + +static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + + return sprintf(buf, "%u\n", tunables->rate_limit_us); +} + +static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, + size_t count) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + struct sugov_policy *sg_policy; + unsigned int rate_limit_us; + int ret; + + ret = sscanf(buf, "%u", &rate_limit_us); + if (ret != 1) + return -EINVAL; + + tunables->rate_limit_us = rate_limit_us; + + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) + sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC; + + return count; +} + +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us); + +static struct attribute *sugov_attributes[] = { + &rate_limit_us.attr, + NULL +}; + +static struct kobj_type sugov_tunables_ktype = { + .default_attrs = sugov_attributes, + .sysfs_ops = &governor_sysfs_ops, +}; + +/********************** cpufreq governor interface *********************/ + +static struct cpufreq_governor schedutil_gov; + +static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + + sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL); + if (!sg_policy) + return NULL; + + sg_policy->policy = policy; + init_irq_work(&sg_policy->irq_work, sugov_irq_work); + INIT_WORK(&sg_policy->work, sugov_work); + mutex_init(&sg_policy->work_lock); + raw_spin_lock_init(&sg_policy->update_lock); + return sg_policy; +} + +static void sugov_policy_free(struct sugov_policy *sg_policy) +{ + mutex_destroy(&sg_policy->work_lock); + kfree(sg_policy); +} + +static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy) +{ + struct sugov_tunables *tunables; + + tunables = kzalloc(sizeof(*tunables), GFP_KERNEL); + if (tunables) + gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook); + + return tunables; +} + +static void sugov_tunables_free(struct sugov_tunables *tunables) +{ + if (!have_governor_per_policy()) + global_tunables = NULL; + + kfree(tunables); +} + +static int sugov_init(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + struct sugov_tunables *tunables; + unsigned int lat; + int ret = 0; + + /* State should be equivalent to EXIT */ + if (policy->governor_data) + return -EBUSY; + + sg_policy = sugov_policy_alloc(policy); + if (!sg_policy) + return -ENOMEM; + + mutex_lock(&global_tunables_lock); + + if (global_tunables) { + if (WARN_ON(have_governor_per_policy())) { + ret = -EINVAL; + goto free_sg_policy; + } + policy->governor_data = sg_policy; + sg_policy->tunables = global_tunables; + + gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook); + goto out; + } + + tunables = sugov_tunables_alloc(sg_policy); + if (!tunables) { + ret = -ENOMEM; + goto free_sg_policy; + } + + tunables->rate_limit_us = LATENCY_MULTIPLIER; + lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC; + if (lat) + tunables->rate_limit_us *= lat; + + if (!have_governor_per_policy()) + global_tunables = tunables; + + policy->governor_data = sg_policy; + sg_policy->tunables = tunables; + + ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype, + get_governor_parent_kobj(policy), "%s", + schedutil_gov.name); + if (!ret) + goto out; + + /* Failure, so roll back. */ + policy->governor_data = NULL; + sugov_tunables_free(tunables); + + free_sg_policy: + pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret); + sugov_policy_free(sg_policy); + + out: + mutex_unlock(&global_tunables_lock); + return ret; +} + +static int sugov_exit(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + struct sugov_tunables *tunables = sg_policy->tunables; + unsigned int count; + + mutex_lock(&global_tunables_lock); + + count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook); + policy->governor_data = NULL; + if (!count) + sugov_tunables_free(tunables); + + mutex_unlock(&global_tunables_lock); + + sugov_policy_free(sg_policy); + return 0; +} + +static int sugov_start(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + cpufreq_enable_fast_switch(policy); + + sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC; + sg_policy->last_freq_update_time = 0; + sg_policy->next_freq = UINT_MAX; + sg_policy->work_in_progress = false; + sg_policy->need_freq_update = false; + + for_each_cpu(cpu, policy->cpus) { + struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu); + + sg_cpu->sg_policy = sg_policy; + if (policy_is_shared(policy)) { + sg_cpu->util = ULONG_MAX; + sg_cpu->max = 0; + sg_cpu->last_update = 0; + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, + sugov_update_shared); + } else { + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, + sugov_update_single); + } + } + return 0; +} + +static int sugov_stop(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + for_each_cpu(cpu, policy->cpus) + cpufreq_remove_update_util_hook(cpu); + + synchronize_sched(); + + irq_work_sync(&sg_policy->irq_work); + cancel_work_sync(&sg_policy->work); + return 0; +} + +static int sugov_limits(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + + if (!policy->fast_switch_enabled) { + mutex_lock(&sg_policy->work_lock); + + if (policy->max < policy->cur) + __cpufreq_driver_target(policy, policy->max, + CPUFREQ_RELATION_H); + else if (policy->min > policy->cur) + __cpufreq_driver_target(policy, policy->min, + CPUFREQ_RELATION_L); + + mutex_unlock(&sg_policy->work_lock); + } + + sg_policy->need_freq_update = true; + return 0; +} + +int sugov_governor(struct cpufreq_policy *policy, unsigned int event) +{ + if (event == CPUFREQ_GOV_POLICY_INIT) { + return sugov_init(policy); + } else if (policy->governor_data) { + switch (event) { + case CPUFREQ_GOV_POLICY_EXIT: + return sugov_exit(policy); + case CPUFREQ_GOV_START: + return sugov_start(policy); + case CPUFREQ_GOV_STOP: + return sugov_stop(policy); + case CPUFREQ_GOV_LIMITS: + return sugov_limits(policy); + } + } + return -EINVAL; +} + +static struct cpufreq_governor schedutil_gov = { + .name = "schedutil", + .governor = sugov_governor, + .owner = THIS_MODULE, +}; + +static int __init sugov_module_init(void) +{ + return cpufreq_register_governor(&schedutil_gov); +} + +static void __exit sugov_module_exit(void) +{ + cpufreq_unregister_governor(&schedutil_gov); +} + +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>"); +MODULE_DESCRIPTION("Utilization-based CPU frequency selection"); +MODULE_LICENSE("GPL"); + +#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL +struct cpufreq_governor *cpufreq_default_governor(void) +{ + return &schedutil_gov; +} + +fs_initcall(sugov_module_init); +#else +module_init(sugov_module_init); +#endif +module_exit(sugov_module_exit); Index: linux-pm/kernel/sched/Makefile =================================================================== --- linux-pm.orig/kernel/sched/Makefile +++ linux-pm/kernel/sched/Makefile @@ -20,3 +20,4 @@ obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o obj-$(CONFIG_CPU_FREQ) += cpufreq.o +obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o Index: linux-pm/kernel/sched/sched.h =================================================================== --- linux-pm.orig/kernel/sched/sched.h +++ linux-pm/kernel/sched/sched.h @@ -1786,3 +1786,11 @@ static inline void cpufreq_trigger_updat static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {} static inline void cpufreq_trigger_update(u64 time) {} #endif /* CONFIG_CPU_FREQ */ + +#ifdef arch_scale_freq_capacity +#ifndef arch_scale_freq_invariant +#define arch_scale_freq_invariant() (true) +#endif +#else /* arch_scale_freq_capacity */ +#define arch_scale_freq_invariant() (false) +#endif ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 14:59 ` [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki @ 2016-03-16 17:35 ` Peter Zijlstra 2016-03-16 21:42 ` Rafael J. Wysocki 2016-03-16 17:36 ` Peter Zijlstra ` (3 subsequent siblings) 4 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-16 17:35 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote: > +static unsigned int get_next_freq(struct cpufreq_policy *policy, > + unsigned long util, unsigned long max) > +{ > + unsigned int freq = arch_scale_freq_invariant() ? > + policy->cpuinfo.max_freq : policy->cur; > + > + return (freq + (freq >> 2)) * util / max; > +} > + > +static void sugov_update_single(struct update_util_data *hook, u64 time, > + unsigned long util, unsigned long max) > +{ > + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util); > + struct sugov_policy *sg_policy = sg_cpu->sg_policy; > + struct cpufreq_policy *policy = sg_policy->policy; > + unsigned int next_f; > + > + if (!sugov_should_update_freq(sg_policy, time)) > + return; > + > + next_f = util <= max ? > + get_next_freq(policy, util, max) : policy->cpuinfo.max_freq; I'm not sure that is correct, would not something like this be more accurate? if (util > max) util = max; next_f = get_next_freq(policy, util, max); After all, if we clip util we will still only increment to the next freq with our multiplication factor. Hmm, or was this meant to deal with the DL/RT stuff? Would then not something like: /* ULONG_MAX is used to force max_freq for Real-Time policies */ if (util == ULONG_MAX) { next_f = policy->cpuinfo.max_freq; } else { if (util > max) util = max; next_f = get_next_freq(policy, util, max); } Be clearer? > + sugov_update_commit(sg_policy, time, next_f); > +} ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 17:35 ` Peter Zijlstra @ 2016-03-16 21:42 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 21:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wednesday, March 16, 2016 06:35:41 PM Peter Zijlstra wrote: > On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote: > > > +static unsigned int get_next_freq(struct cpufreq_policy *policy, > > + unsigned long util, unsigned long max) > > +{ > > + unsigned int freq = arch_scale_freq_invariant() ? > > + policy->cpuinfo.max_freq : policy->cur; > > + > > + return (freq + (freq >> 2)) * util / max; > > +} > > + > > +static void sugov_update_single(struct update_util_data *hook, u64 time, > > + unsigned long util, unsigned long max) > > +{ > > + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util); > > + struct sugov_policy *sg_policy = sg_cpu->sg_policy; > > + struct cpufreq_policy *policy = sg_policy->policy; > > + unsigned int next_f; > > + > > + if (!sugov_should_update_freq(sg_policy, time)) > > + return; > > + > > + next_f = util <= max ? > > + get_next_freq(policy, util, max) : policy->cpuinfo.max_freq; > > I'm not sure that is correct, would not something like this be more > accurate? > > if (util > max) > util = max; > next_f = get_next_freq(policy, util, max); > > After all, if we clip util we will still only increment to the next freq > with our multiplication factor. > > Hmm, or was this meant to deal with the DL/RT stuff? Yes, it was. > Would then not something like: > > /* ULONG_MAX is used to force max_freq for Real-Time policies */ > if (util == ULONG_MAX) { > next_f = policy->cpuinfo.max_freq; > } else { > if (util > max) That cannot happen given the way CFS deals with max before passing it to cpufreq_update_util(). > util = max; > next_f = get_next_freq(policy, util, max); > } > > Be clearer? > > > + sugov_update_commit(sg_policy, time, next_f); > > +} So essentially I can replace the util > max check with the util == ULONG_MAX one (here and in some other places) if that helps to understand the code, but functionally that won't change anything. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 14:59 ` [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 2016-03-16 17:35 ` Peter Zijlstra @ 2016-03-16 17:36 ` Peter Zijlstra 2016-03-16 21:34 ` Rafael J. Wysocki 2016-03-16 17:52 ` Peter Zijlstra ` (2 subsequent siblings) 4 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-16 17:36 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote: > + if ((s64)delta_ns > NSEC_PER_SEC / HZ) That's TICK_NSEC ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 17:36 ` Peter Zijlstra @ 2016-03-16 21:34 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 21:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wednesday, March 16, 2016 06:36:46 PM Peter Zijlstra wrote: > On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote: > > + if ((s64)delta_ns > NSEC_PER_SEC / HZ) > > That's TICK_NSEC OK (I didn't know we had a separate symbol for that) ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 14:59 ` [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 2016-03-16 17:35 ` Peter Zijlstra 2016-03-16 17:36 ` Peter Zijlstra @ 2016-03-16 17:52 ` Peter Zijlstra 2016-03-16 21:38 ` Rafael J. Wysocki 2016-03-16 17:53 ` Peter Zijlstra 2016-03-16 18:14 ` Peter Zijlstra 4 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-16 17:52 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote: > +static void sugov_work(struct work_struct *work) > +{ > + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); > + > + mutex_lock(&sg_policy->work_lock); > + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, > + CPUFREQ_RELATION_L); > + mutex_unlock(&sg_policy->work_lock); > + Be aware that the below store can creep up and become visible before the unlock. AFAICT that doesn't really matter, but still. > + sg_policy->work_in_progress = false; > +} > + > +static void sugov_irq_work(struct irq_work *irq_work) > +{ > + struct sugov_policy *sg_policy; > + > + sg_policy = container_of(irq_work, struct sugov_policy, irq_work); > + schedule_work(&sg_policy->work); > +} If you care what cpu the work runs on, you should schedule_work_on(), regular schedule_work() can end up on any random cpu (although typically it does not). In particular schedule_work() -> queue_work() -> queue_work_on(.cpu = WORK_CPU_UNBOUND) -> __queue_work() if (req_cpu == UNBOUND) cpu = wq_select_unbound_cpu(), which has a Round-Robin 'feature' to detect just such dependencies. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 17:52 ` Peter Zijlstra @ 2016-03-16 21:38 ` Rafael J. Wysocki 2016-03-16 22:39 ` Peter Zijlstra 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 21:38 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wednesday, March 16, 2016 06:52:11 PM Peter Zijlstra wrote: > On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote: > > +static void sugov_work(struct work_struct *work) > > +{ > > + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); > > + > > + mutex_lock(&sg_policy->work_lock); > > + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, > > + CPUFREQ_RELATION_L); > > + mutex_unlock(&sg_policy->work_lock); > > + > > Be aware that the below store can creep up and become visible before the > unlock. AFAICT that doesn't really matter, but still. It doesn't matter. :-) Had it mattered, I would have used memory barriers. > > + sg_policy->work_in_progress = false; > > +} > > + > > +static void sugov_irq_work(struct irq_work *irq_work) > > +{ > > + struct sugov_policy *sg_policy; > > + > > + sg_policy = container_of(irq_work, struct sugov_policy, irq_work); > > + schedule_work(&sg_policy->work); > > +} > > If you care what cpu the work runs on, you should schedule_work_on(), > regular schedule_work() can end up on any random cpu (although typically > it does not). I know, but I don't care too much. "ondemand" and "conservative" use schedule_work() for the same thing, so drivers need to cope with that if they need things to run on a particular CPU. That said I guess things would be a bit more efficient if the work was scheduled on the same CPU that had queued up the irq_work. It also wouldn't be too difficult to implement, so I'll make that change. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 21:38 ` Rafael J. Wysocki @ 2016-03-16 22:39 ` Peter Zijlstra 0 siblings, 0 replies; 158+ messages in thread From: Peter Zijlstra @ 2016-03-16 22:39 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 16, 2016 at 10:38:14PM +0100, Rafael J. Wysocki wrote: > > If you care what cpu the work runs on, you should schedule_work_on(), > > regular schedule_work() can end up on any random cpu (although typically > > it does not). > > I know, but I don't care too much. > > "ondemand" and "conservative" use schedule_work() for the same thing, so > drivers need to cope with that if they need things to run on a particular > CPU. Or are just plain buggy -- like a lot of code that uses schedule_work() for per-cpu thingies; that is, its a fairly common bug and only recently did we add that RR thing. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 14:59 ` [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki ` (2 preceding siblings ...) 2016-03-16 17:52 ` Peter Zijlstra @ 2016-03-16 17:53 ` Peter Zijlstra 2016-03-16 21:48 ` Rafael J. Wysocki 2016-03-16 18:14 ` Peter Zijlstra 4 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-16 17:53 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote: > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, > + unsigned int next_freq) > +{ > + struct cpufreq_policy *policy = sg_policy->policy; > + > + if (next_freq > policy->max) > + next_freq = policy->max; > + else if (next_freq < policy->min) > + next_freq = policy->min; I'm still very much undecided on these policy min/max thresholds. I don't particularly like them. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 17:53 ` Peter Zijlstra @ 2016-03-16 21:48 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 21:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wednesday, March 16, 2016 06:53:41 PM Peter Zijlstra wrote: > On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote: > > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, > > + unsigned int next_freq) > > +{ > > + struct cpufreq_policy *policy = sg_policy->policy; > > + > > + if (next_freq > policy->max) > > + next_freq = policy->max; > > + else if (next_freq < policy->min) > > + next_freq = policy->min; > > I'm still very much undecided on these policy min/max thresholds. I > don't particularly like them. These are for consistency mostly. It actually occurs to me that __cpufreq_driver_target() does that already anyway, so they can be moved into the "fast switch" branch. Which means that the code needs to be rearranged a bit here. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 14:59 ` [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki ` (3 preceding siblings ...) 2016-03-16 17:53 ` Peter Zijlstra @ 2016-03-16 18:14 ` Peter Zijlstra 2016-03-16 21:38 ` Rafael J. Wysocki 4 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-16 18:14 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote: > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, > + unsigned int next_freq) > +{ > + struct cpufreq_policy *policy = sg_policy->policy; > + > + if (next_freq > policy->max) > + next_freq = policy->max; > + else if (next_freq < policy->min) > + next_freq = policy->min; > + > + sg_policy->last_freq_update_time = time; > + if (sg_policy->next_freq == next_freq) { > + if (policy->fast_switch_enabled) > + trace_cpu_frequency(policy->cur, smp_processor_id()); > + > + return; > + } > + > + sg_policy->next_freq = next_freq; > + if (policy->fast_switch_enabled) { > + unsigned int freq; > + > + freq = cpufreq_driver_fast_switch(policy, next_freq); So you're assuming a RELATION_L for ->fast_switch() ? > + if (freq == CPUFREQ_ENTRY_INVALID) > + return; > + > + policy->cur = freq; > + trace_cpu_frequency(freq, smp_processor_id()); > + } else { > + sg_policy->work_in_progress = true; > + irq_work_queue(&sg_policy->irq_work); > + } > +} > +static void sugov_work(struct work_struct *work) > +{ > + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); > + > + mutex_lock(&sg_policy->work_lock); > + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, > + CPUFREQ_RELATION_L); As per here, which I assume matches semantics on that point. > + mutex_unlock(&sg_policy->work_lock); > + > + sg_policy->work_in_progress = false; > +} ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 18:14 ` Peter Zijlstra @ 2016-03-16 21:38 ` Rafael J. Wysocki 2016-03-16 22:40 ` Peter Zijlstra 0 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 21:38 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wednesday, March 16, 2016 07:14:20 PM Peter Zijlstra wrote: > On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote: > > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, > > + unsigned int next_freq) > > +{ > > + struct cpufreq_policy *policy = sg_policy->policy; > > + > > + if (next_freq > policy->max) > > + next_freq = policy->max; > > + else if (next_freq < policy->min) > > + next_freq = policy->min; > > + > > + sg_policy->last_freq_update_time = time; > > + if (sg_policy->next_freq == next_freq) { > > + if (policy->fast_switch_enabled) > > + trace_cpu_frequency(policy->cur, smp_processor_id()); > > + > > + return; > > + } > > + > > + sg_policy->next_freq = next_freq; > > + if (policy->fast_switch_enabled) { > > + unsigned int freq; > > + > > + freq = cpufreq_driver_fast_switch(policy, next_freq); > > So you're assuming a RELATION_L for ->fast_switch() ? Yes, I am. > > + if (freq == CPUFREQ_ENTRY_INVALID) > > + return; > > + > > + policy->cur = freq; > > + trace_cpu_frequency(freq, smp_processor_id()); > > + } else { > > + sg_policy->work_in_progress = true; > > + irq_work_queue(&sg_policy->irq_work); > > + } > > +} > > > > +static void sugov_work(struct work_struct *work) > > +{ > > + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); > > + > > + mutex_lock(&sg_policy->work_lock); > > + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, > > + CPUFREQ_RELATION_L); > > As per here, which I assume matches semantics on that point. Correct. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 21:38 ` Rafael J. Wysocki @ 2016-03-16 22:40 ` Peter Zijlstra 2016-03-16 22:53 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-16 22:40 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 16, 2016 at 10:38:55PM +0100, Rafael J. Wysocki wrote: > On Wednesday, March 16, 2016 07:14:20 PM Peter Zijlstra wrote: > > On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote: > > > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, > > > + unsigned int next_freq) > > > +{ > > > + struct cpufreq_policy *policy = sg_policy->policy; > > > + > > > + if (next_freq > policy->max) > > > + next_freq = policy->max; > > > + else if (next_freq < policy->min) > > > + next_freq = policy->min; > > > + > > > + sg_policy->last_freq_update_time = time; > > > + if (sg_policy->next_freq == next_freq) { > > > + if (policy->fast_switch_enabled) > > > + trace_cpu_frequency(policy->cur, smp_processor_id()); > > > + > > > + return; > > > + } > > > + > > > + sg_policy->next_freq = next_freq; > > > + if (policy->fast_switch_enabled) { > > > + unsigned int freq; > > > + > > > + freq = cpufreq_driver_fast_switch(policy, next_freq); > > > > So you're assuming a RELATION_L for ->fast_switch() ? > > Yes, I am. Should we document that fact somewhere? Or alternatively, if you already did, I simply missed it. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 22:40 ` Peter Zijlstra @ 2016-03-16 22:53 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 22:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wednesday, March 16, 2016 11:40:54 PM Peter Zijlstra wrote: > On Wed, Mar 16, 2016 at 10:38:55PM +0100, Rafael J. Wysocki wrote: > > On Wednesday, March 16, 2016 07:14:20 PM Peter Zijlstra wrote: > > > On Wed, Mar 16, 2016 at 03:59:18PM +0100, Rafael J. Wysocki wrote: > > > > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, > > > > + unsigned int next_freq) > > > > +{ > > > > + struct cpufreq_policy *policy = sg_policy->policy; > > > > + > > > > + if (next_freq > policy->max) > > > > + next_freq = policy->max; > > > > + else if (next_freq < policy->min) > > > > + next_freq = policy->min; > > > > + > > > > + sg_policy->last_freq_update_time = time; > > > > + if (sg_policy->next_freq == next_freq) { > > > > + if (policy->fast_switch_enabled) > > > > + trace_cpu_frequency(policy->cur, smp_processor_id()); > > > > + > > > > + return; > > > > + } > > > > + > > > > + sg_policy->next_freq = next_freq; > > > > + if (policy->fast_switch_enabled) { > > > > + unsigned int freq; > > > > + > > > > + freq = cpufreq_driver_fast_switch(policy, next_freq); > > > > > > So you're assuming a RELATION_L for ->fast_switch() ? > > > > Yes, I am. > > Should we document that fact somewhere? Or alternatively, if you already > did, I simply missed it. I thought I did, but clearly that's not the case (I think I wrote about that in a changelog comments somewhere). I'll document it in the kerneldoc for cpufreq_driver_fast_switch() (patch [6/7]). ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 0/7] cpufreq: schedutil governor 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (6 preceding siblings ...) 2016-03-16 14:59 ` [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki @ 2016-03-16 15:27 ` Peter Zijlstra 2016-03-16 16:20 ` Rafael J. Wysocki 2016-03-16 23:51 ` [PATCH v5 6/7][Update] cpufreq: Support for fast frequency switching Rafael J. Wysocki ` (3 subsequent siblings) 11 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-16 15:27 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar Could you please start a new thread for each posting? I only accidentally saw this. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v4 0/7] cpufreq: schedutil governor 2016-03-16 15:27 ` [PATCH v4 0/7] cpufreq: schedutil governor Peter Zijlstra @ 2016-03-16 16:20 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 16:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Wed, Mar 16, 2016 at 4:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > Could you please start a new thread for each posting? I only > accidentally saw this. I will in the future. ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v5 6/7][Update] cpufreq: Support for fast frequency switching 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (7 preceding siblings ...) 2016-03-16 15:27 ` [PATCH v4 0/7] cpufreq: schedutil governor Peter Zijlstra @ 2016-03-16 23:51 ` Rafael J. Wysocki 2016-03-17 11:35 ` Juri Lelli 2016-03-17 0:01 ` [PATCH v5 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki ` (2 subsequent siblings) 11 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-16 23:51 UTC (permalink / raw) To: Linux PM list Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Subject: [PATCH] cpufreq: Support for fast frequency switching Modify the ACPI cpufreq driver to provide a method for switching CPU frequencies from interrupt context and update the cpufreq core to support that method if available. Introduce a new cpufreq driver callback, ->fast_switch, to be invoked for frequency switching from interrupt context by (future) governors supporting that feature via (new) helper function cpufreq_driver_fast_switch(). Add two new policy flags, fast_switch_possible, to be set by the cpufreq driver if fast frequency switching can be used for the given policy and fast_switch_enabled, to be set by the governor if it is going to use fast frequency switching for the given policy. Also add a helper for setting the latter. Since fast frequency switching is inherently incompatible with cpufreq transition notifiers, make it possible to set the fast_switch_enabled only if there are no transition notifiers already registered and make the registration of new transition notifiers fail if fast_switch_enabled is set for at least one policy. Implement the ->fast_switch callback in the ACPI cpufreq driver and make it set fast_switch_possible during policy initialization as appropriate. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Addressing comments from Peter. Changes from v4: - If cpufreq_enable_fast_switch() is about to fail, it will print the list of currently registered transition notifiers. - Added lock_assert_held(&policy->rwsem) to cpufreq_enable_fast_switch(). - Added WARN_ON() to the (cpufreq_fast_switch_count > 0) check in cpufreq_register_notifier(). - Modified the kerneldoc comment of cpufreq_driver_fast_switch() to mention the RELATION_L expectation regarding the ->fast_switch callback. Changes from v3: - New fast_switch_enabled field in struct cpufreq_policy to help avoid affecting existing setups by setting the fast_switch_possible flag in the driver. - __cpufreq_get() skips the policy->cur check if fast_switch_enabled is set. Changes from v2: - The driver ->fast_switch callback and cpufreq_driver_fast_switch() don't need the relation argument as they will always do RELATION_L now. - New mechanism to make fast switch and cpufreq notifiers mutually exclusive. - cpufreq_driver_fast_switch() doesn't do anything in addition to invoking the driver callback and returns its return value. --- drivers/cpufreq/acpi-cpufreq.c | 41 +++++++++++++ drivers/cpufreq/cpufreq.c | 127 ++++++++++++++++++++++++++++++++++++++--- include/linux/cpufreq.h | 9 ++ 3 files changed, 168 insertions(+), 9 deletions(-) Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c @@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp return result; } +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq) +{ + struct acpi_cpufreq_data *data = policy->driver_data; + struct acpi_processor_performance *perf; + struct cpufreq_frequency_table *entry; + unsigned int next_perf_state, next_freq, freq; + + /* + * Find the closest frequency above target_freq. + * + * The table is sorted in the reverse order with respect to the + * frequency and all of the entries are valid (see the initialization). + */ + entry = data->freq_table; + do { + entry++; + freq = entry->frequency; + } while (freq >= target_freq && freq != CPUFREQ_TABLE_END); + entry--; + next_freq = entry->frequency; + next_perf_state = entry->driver_data; + + perf = to_perf_data(data); + if (perf->state == next_perf_state) { + if (unlikely(data->resume)) + data->resume = 0; + else + return next_freq; + } + + data->cpu_freq_write(&perf->control_register, + perf->states[next_perf_state].control); + perf->state = next_perf_state; + return next_freq; +} + static unsigned long acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu) { @@ -740,6 +777,9 @@ static int acpi_cpufreq_cpu_init(struct goto err_unreg; } + policy->fast_switch_possible = !acpi_pstate_strict && + !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY); + data->freq_table = kzalloc(sizeof(*data->freq_table) * (perf->state_count+1), GFP_KERNEL); if (!data->freq_table) { @@ -874,6 +914,7 @@ static struct freq_attr *acpi_cpufreq_at static struct cpufreq_driver acpi_cpufreq_driver = { .verify = cpufreq_generic_frequency_table_verify, .target_index = acpi_cpufreq_target, + .fast_switch = acpi_cpufreq_fast_switch, .bios_limit = acpi_processor_get_bios_limit, .init = acpi_cpufreq_cpu_init, .exit = acpi_cpufreq_cpu_exit, Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -102,6 +102,10 @@ struct cpufreq_policy { */ struct rw_semaphore rwsem; + /* Fast switch flags */ + bool fast_switch_possible; /* Set by the driver. */ + bool fast_switch_enabled; + /* Synchronization for frequency transitions */ bool transition_ongoing; /* Tracks transition status */ spinlock_t transition_lock; @@ -156,6 +160,7 @@ int cpufreq_get_policy(struct cpufreq_po int cpufreq_update_policy(unsigned int cpu); bool have_governor_per_policy(void); struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy); +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy); #else static inline unsigned int cpufreq_get(unsigned int cpu) { @@ -236,6 +241,8 @@ struct cpufreq_driver { unsigned int relation); /* Deprecated */ int (*target_index)(struct cpufreq_policy *policy, unsigned int index); + unsigned int (*fast_switch)(struct cpufreq_policy *policy, + unsigned int target_freq); /* * Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION * unset. @@ -464,6 +471,8 @@ struct cpufreq_governor { }; /* Pass a target to the cpufreq driver */ +unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq); int cpufreq_driver_target(struct cpufreq_policy *policy, unsigned int target_freq, unsigned int relation); Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -428,6 +428,54 @@ void cpufreq_freq_transition_end(struct } EXPORT_SYMBOL_GPL(cpufreq_freq_transition_end); +/* + * Fast frequency switching status count. Positive means "enabled", negative + * means "disabled" and 0 means "not decided yet". + */ +static int cpufreq_fast_switch_count; +static DEFINE_MUTEX(cpufreq_fast_switch_lock); + +static void cpufreq_list_transition_notifiers(void) +{ + struct notifier_block *nb; + + pr_info("cpufreq: Registered transition notifiers:\n"); + + mutex_lock(&cpufreq_transition_notifier_list.mutex); + + for (nb = cpufreq_transition_notifier_list.head; nb; nb = nb->next) + pr_info("cpufreq: %pF\n", nb->notifier_call); + + mutex_unlock(&cpufreq_transition_notifier_list.mutex); +} + +/** + * cpufreq_enable_fast_switch - Enable fast frequency switching for policy. + * @policy: cpufreq policy to enable fast frequency switching for. + * + * Try to enable fast frequency switching for @policy. + * + * The attempt will fail if there is at least one transition notifier registered + * at this point, as fast frequency switching is quite fundamentally at odds + * with transition notifiers. Thus if successful, it will make registration of + * transition notifiers fail going forward. + */ +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy) +{ + lockdep_assert_held(&policy->rwsem); + + mutex_lock(&cpufreq_fast_switch_lock); + if (policy->fast_switch_possible && cpufreq_fast_switch_count >= 0) { + cpufreq_fast_switch_count++; + policy->fast_switch_enabled = true; + } else { + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n", + policy->cpu); + cpufreq_list_transition_notifiers(); + } + mutex_unlock(&cpufreq_fast_switch_lock); +} +EXPORT_SYMBOL_GPL(cpufreq_enable_fast_switch); /********************************************************************* * SYSFS INTERFACE * @@ -1083,6 +1131,24 @@ static void cpufreq_policy_free(struct c kfree(policy); } +static void cpufreq_driver_exit_policy(struct cpufreq_policy *policy) +{ + if (policy->fast_switch_enabled) { + mutex_lock(&cpufreq_fast_switch_lock); + + policy->fast_switch_enabled = false; + if (!WARN_ON(cpufreq_fast_switch_count <= 0)) + cpufreq_fast_switch_count--; + + mutex_unlock(&cpufreq_fast_switch_lock); + } + + if (cpufreq_driver->exit) { + cpufreq_driver->exit(policy); + policy->freq_table = NULL; + } +} + static int cpufreq_online(unsigned int cpu) { struct cpufreq_policy *policy; @@ -1236,8 +1302,7 @@ static int cpufreq_online(unsigned int c out_exit_policy: up_write(&policy->rwsem); - if (cpufreq_driver->exit) - cpufreq_driver->exit(policy); + cpufreq_driver_exit_policy(policy); out_free_policy: cpufreq_policy_free(policy, !new_policy); return ret; @@ -1334,10 +1399,7 @@ static void cpufreq_offline(unsigned int * since this is a core component, and is essential for the * subsequent light-weight ->init() to succeed. */ - if (cpufreq_driver->exit) { - cpufreq_driver->exit(policy); - policy->freq_table = NULL; - } + cpufreq_driver_exit_policy(policy); unlock: up_write(&policy->rwsem); @@ -1444,8 +1506,12 @@ static unsigned int __cpufreq_get(struct ret_freq = cpufreq_driver->get(policy->cpu); - /* Updating inactive policies is invalid, so avoid doing that. */ - if (unlikely(policy_is_inactive(policy))) + /* + * Updating inactive policies is invalid, so avoid doing that. Also + * if fast frequency switching is used with the given policy, the check + * against policy->cur is pointless, so skip it in that case too. + */ + if (unlikely(policy_is_inactive(policy)) || policy->fast_switch_enabled) return ret_freq; if (ret_freq && policy->cur && @@ -1457,7 +1523,6 @@ static unsigned int __cpufreq_get(struct schedule_work(&policy->update); } } - return ret_freq; } @@ -1653,8 +1718,18 @@ int cpufreq_register_notifier(struct not switch (list) { case CPUFREQ_TRANSITION_NOTIFIER: + mutex_lock(&cpufreq_fast_switch_lock); + + if (WARN_ON(cpufreq_fast_switch_count > 0)) { + mutex_unlock(&cpufreq_fast_switch_lock); + return -EPERM; + } ret = srcu_notifier_chain_register( &cpufreq_transition_notifier_list, nb); + if (!ret) + cpufreq_fast_switch_count--; + + mutex_unlock(&cpufreq_fast_switch_lock); break; case CPUFREQ_POLICY_NOTIFIER: ret = blocking_notifier_chain_register( @@ -1687,8 +1762,14 @@ int cpufreq_unregister_notifier(struct n switch (list) { case CPUFREQ_TRANSITION_NOTIFIER: + mutex_lock(&cpufreq_fast_switch_lock); + ret = srcu_notifier_chain_unregister( &cpufreq_transition_notifier_list, nb); + if (!ret && !WARN_ON(cpufreq_fast_switch_count >= 0)) + cpufreq_fast_switch_count++; + + mutex_unlock(&cpufreq_fast_switch_lock); break; case CPUFREQ_POLICY_NOTIFIER: ret = blocking_notifier_chain_unregister( @@ -1707,6 +1788,34 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie * GOVERNORS * *********************************************************************/ +/** + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch. + * @policy: cpufreq policy to switch the frequency for. + * @target_freq: New frequency to set (may be approximate). + * + * Carry out a fast frequency switch from interrupt context. + * + * The driver's ->fast_switch() callback invoked by this function is expected to + * select the minimum available frequency greater than or equal to @target_freq + * (CPUFREQ_RELATION_L). + * + * This function must not be called if policy->fast_switch_enabled is unset. + * + * Governors calling this function must guarantee that it will never be invoked + * twice in parallel for the same policy and that it will never be called in + * parallel with either ->target() or ->target_index() for the same policy. + * + * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch() + * callback to indicate an error condition, the hardware configuration must be + * preserved. + */ +unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq) +{ + return cpufreq_driver->fast_switch(policy, target_freq); +} +EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch); + /* Must set freqs->new to intermediate frequency */ static int __target_intermediate(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, int index) ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v5 6/7][Update] cpufreq: Support for fast frequency switching 2016-03-16 23:51 ` [PATCH v5 6/7][Update] cpufreq: Support for fast frequency switching Rafael J. Wysocki @ 2016-03-17 11:35 ` Juri Lelli 2016-03-17 11:40 ` Peter Zijlstra 0 siblings, 1 reply; 158+ messages in thread From: Juri Lelli @ 2016-03-17 11:35 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar Hi, On 17/03/16 00:51, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Subject: [PATCH] cpufreq: Support for fast frequency switching > [...] > +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy) > +{ > + lockdep_assert_held(&policy->rwsem); > + > + mutex_lock(&cpufreq_fast_switch_lock); > + if (policy->fast_switch_possible && cpufreq_fast_switch_count >= 0) { > + cpufreq_fast_switch_count++; > + policy->fast_switch_enabled = true; > + } else { > + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n", Ultra-minor nit: s/freqnency/frequency/ Also, is this really a warning or just a debug message? (everything seems to work fine on Juno even if this is printed :-)). Best, - Juri ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v5 6/7][Update] cpufreq: Support for fast frequency switching 2016-03-17 11:35 ` Juri Lelli @ 2016-03-17 11:40 ` Peter Zijlstra 2016-03-17 11:48 ` Juri Lelli 0 siblings, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-17 11:40 UTC (permalink / raw) To: Juri Lelli Cc: Rafael J. Wysocki, Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Thu, Mar 17, 2016 at 11:35:07AM +0000, Juri Lelli wrote: > > + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n", > > Ultra-minor nit: s/freqnency/frequency/ > > Also, is this really a warning or just a debug message? (everything > seems to work fine on Juno even if this is printed :-)). I would consider it a warn; this _should_ not happen. If your platform supports fast_switch, then you really rather want to use it. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v5 6/7][Update] cpufreq: Support for fast frequency switching 2016-03-17 11:40 ` Peter Zijlstra @ 2016-03-17 11:48 ` Juri Lelli 2016-03-17 12:53 ` Rafael J. Wysocki 0 siblings, 1 reply; 158+ messages in thread From: Juri Lelli @ 2016-03-17 11:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On 17/03/16 12:40, Peter Zijlstra wrote: > On Thu, Mar 17, 2016 at 11:35:07AM +0000, Juri Lelli wrote: > > > > + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n", > > > > Ultra-minor nit: s/freqnency/frequency/ > > > > Also, is this really a warning or just a debug message? (everything > > seems to work fine on Juno even if this is printed :-)). > > I would consider it a warn; this _should_ not happen. If your platform > supports fast_switch, then you really rather want to use it. > Mmm, right. So, something seems not correct here, as I get this warning when I select schedutil on Juno (that doesn't support fast_switch). ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v5 6/7][Update] cpufreq: Support for fast frequency switching 2016-03-17 11:48 ` Juri Lelli @ 2016-03-17 12:53 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-17 12:53 UTC (permalink / raw) To: Juri Lelli Cc: Peter Zijlstra, Rafael J. Wysocki, Linux PM list, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Thu, Mar 17, 2016 at 12:48 PM, Juri Lelli <juri.lelli@arm.com> wrote: > On 17/03/16 12:40, Peter Zijlstra wrote: >> On Thu, Mar 17, 2016 at 11:35:07AM +0000, Juri Lelli wrote: >> >> > > + pr_warn("cpufreq: CPU%u: Fast freqnency switching not enabled\n", >> > >> > Ultra-minor nit: s/freqnency/frequency/ >> > >> > Also, is this really a warning or just a debug message? (everything >> > seems to work fine on Juno even if this is printed :-)). >> >> I would consider it a warn; this _should_ not happen. If your platform >> supports fast_switch, then you really rather want to use it. >> > > Mmm, right. So, something seems not correct here, as I get this warning > when I select schedutil on Juno (that doesn't support fast_switch). There is a mistake here. The message should not be printed if policy->fast_switch_possible is not set. Will fix. ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v5 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (8 preceding siblings ...) 2016-03-16 23:51 ` [PATCH v5 6/7][Update] cpufreq: Support for fast frequency switching Rafael J. Wysocki @ 2016-03-17 0:01 ` Rafael J. Wysocki 2016-03-17 11:30 ` Juri Lelli 2016-03-17 11:36 ` Peter Zijlstra 2016-03-17 15:54 ` [PATCH v6 6/7][Update] cpufreq: Support for fast frequency switching Rafael J. Wysocki 2016-03-17 16:01 ` [PATCH v6 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 11 siblings, 2 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-17 0:01 UTC (permalink / raw) To: Linux PM list, Peter Zijlstra Cc: Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Add a new cpufreq scaling governor, called "schedutil", that uses scheduler-provided CPU utilization information as input for making its decisions. Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add mechanism for registering utilization update callbacks) that introduced cpufreq_update_util() called by the scheduler on utilization changes (from CFS) and RT/DL task status updates. In particular, CPU frequency scaling decisions may be based on the the utilization data passed to cpufreq_update_util() by CFS. The new governor is relatively simple. The frequency selection formula used by it depends on whether or not the utilization is frequency-invariant. In the frequency-invariant case the new CPU frequency is given by next_freq = 1.25 * max_freq * util / max where util and max are the last two arguments of cpufreq_update_util(). In turn, if util is not frequency-invariant, the maximum frequency in the above formula is replaced with the current frequency of the CPU: next_freq = 1.25 * curr_freq * util / max The coefficient 1.25 corresponds to the frequency tipping point at (util / max) = 0.8. All of the computations are carried out in the utilization update handlers provided by the new governor. One of those handlers is used for cpufreq policies shared between multiple CPUs and the other one is for policies with one CPU only (and therefore it doesn't need to use any extra synchronization means). The governor supports fast frequency switching if that is supported by the cpufreq driver in use and possible for the given policy. In the fast switching case, all operations of the governor take place in its utilization update handlers. If fast switching cannot be used, the frequency switch operations are carried out with the help of a work item which only calls __cpufreq_driver_target() (under a mutex) to trigger a frequency update (to a value already computed beforehand in one of the utilization update handlers). Currently, the governor treats all of the RT and DL tasks as "unknown utilization" and sets the frequency to the allowed maximum when updated from the RT or DL sched classes. That heavy-handed approach should be replaced with something more subtle and specifically targeted at RT and DL tasks. The governor shares some tunables management code with the "ondemand" and "conservative" governors and uses some common definitions from cpufreq_governor.h, but apart from that it is stand-alone. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Addressing comments from Peter. Changes from v4: - Use TICK_NSEC in sugov_next_freq_shared(). - Use schedule_work_on() to schedule work items and replace work_in_progress with work_cpu (which is used both for scheduling work items and as a "work in progress" marker). - Rearrange sugov_update_commit() to only check policy->min/max if fast switching is enabled. - Replace util > max checks with util == ULONG_MAX checks to make it clear that they are about a special case (RT/DL). Changes from v3: - The "next frequency" formula based on http://marc.info/?l=linux-acpi&m=145756618321500&w=4 and http://marc.info/?l=linux-kernel&m=145760739700716&w=4 - The governor goes into kernel/sched/ (again). Changes from v2: - The governor goes into drivers/cpufreq/. - The "next frequency" formula has an additional 1.1 factor to allow more util/max values to map onto the top-most frequency in case the distance between that and the previous one is unproportionally small. - sugov_update_commit() traces CPU frequency even if the new one is the same as the previous one (otherwise, if the system is 100% loaded for long enough, powertop starts to report that all CPUs are 100% idle). --- drivers/cpufreq/Kconfig | 26 + kernel/sched/Makefile | 1 kernel/sched/cpufreq_schedutil.c | 527 +++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 8 4 files changed, 562 insertions(+) Index: linux-pm/drivers/cpufreq/Kconfig =================================================================== --- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor. + +config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL + bool "schedutil" + select CPU_FREQ_GOV_SCHEDUTIL + select CPU_FREQ_GOV_PERFORMANCE + help + Use the 'schedutil' CPUFreq governor by default. If unsure, + have a look at the help section of that governor. The fallback + governor will be 'performance'. + endchoice config CPU_FREQ_GOV_PERFORMANCE @@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE If in doubt, say N. +config CPU_FREQ_GOV_SCHEDUTIL + tristate "'schedutil' cpufreq policy governor" + depends on CPU_FREQ + select CPU_FREQ_GOV_ATTR_SET + select IRQ_WORK + help + The frequency selection formula used by this governor is analogous + to the one used by 'ondemand', but instead of computing CPU load + as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU + utilization data provided by the scheduler as input. + + To compile this driver as a module, choose M here: the + module will be called cpufreq_schedutil. + + If in doubt, say N. + comment "CPU frequency scaling drivers" config CPUFREQ_DT Index: linux-pm/kernel/sched/cpufreq_schedutil.c =================================================================== --- /dev/null +++ linux-pm/kernel/sched/cpufreq_schedutil.c @@ -0,0 +1,527 @@ +/* + * CPUFreq governor based on scheduler-provided CPU utilization data. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/slab.h> +#include <trace/events/power.h> + +#include "sched.h" + +struct sugov_tunables { + struct gov_attr_set attr_set; + unsigned int rate_limit_us; +}; + +struct sugov_policy { + struct cpufreq_policy *policy; + + struct sugov_tunables *tunables; + struct list_head tunables_hook; + + raw_spinlock_t update_lock; /* For shared policies */ + u64 last_freq_update_time; + s64 freq_update_delay_ns; + unsigned int next_freq; + + /* The next fields are only needed if fast switch cannot be used. */ + struct irq_work irq_work; + struct work_struct work; + struct mutex work_lock; + unsigned int work_cpu; + + bool need_freq_update; +}; + +struct sugov_cpu { + struct update_util_data update_util; + struct sugov_policy *sg_policy; + + /* The fields below are only needed when sharing a policy. */ + unsigned long util; + unsigned long max; + u64 last_update; +}; + +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); + +/************************ Governor internals ***********************/ + +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) +{ + u64 delta_ns; + + if (sg_policy->work_cpu != UINT_MAX) + return false; + + if (unlikely(sg_policy->need_freq_update)) { + sg_policy->need_freq_update = false; + return true; + } + + delta_ns = time - sg_policy->last_freq_update_time; + return (s64)delta_ns >= sg_policy->freq_update_delay_ns; +} + +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, + unsigned int next_freq) +{ + struct cpufreq_policy *policy = sg_policy->policy; + + sg_policy->last_freq_update_time = time; + + if (policy->fast_switch_enabled) { + if (next_freq > policy->max) + next_freq = policy->max; + else if (next_freq < policy->min) + next_freq = policy->min; + + if (sg_policy->next_freq == next_freq) { + trace_cpu_frequency(policy->cur, smp_processor_id()); + return; + } + sg_policy->next_freq = next_freq; + next_freq = cpufreq_driver_fast_switch(policy, next_freq); + if (next_freq == CPUFREQ_ENTRY_INVALID) + return; + + policy->cur = next_freq; + trace_cpu_frequency(next_freq, smp_processor_id()); + } else if (sg_policy->next_freq != next_freq) { + sg_policy->work_cpu = smp_processor_id(); + irq_work_queue(&sg_policy->irq_work); + } +} + +/** + * get_next_freq - Compute a new frequency for a given cpufreq policy. + * @policy: cpufreq policy object to compute the new frequency for. + * @util: Current CPU utilization. + * @max: CPU capacity. + * + * If the utilization is frequency-invariant, choose the new frequency to be + * proportional to it, that is + * + * next_freq = C * max_freq * util / max + * + * Otherwise, approximate the would-be frequency-invariant utilization by + * util_raw * (curr_freq / max_freq) which leads to + * + * next_freq = C * curr_freq * util_raw / max + * + * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8. + */ +static unsigned int get_next_freq(struct cpufreq_policy *policy, + unsigned long util, unsigned long max) +{ + unsigned int freq = arch_scale_freq_invariant() ? + policy->cpuinfo.max_freq : policy->cur; + + return (freq + (freq >> 2)) * util / max; +} + +static void sugov_update_single(struct update_util_data *hook, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int next_f; + + if (!sugov_should_update_freq(sg_policy, time)) + return; + + next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq : + get_next_freq(policy, util, max); + sugov_update_commit(sg_policy, time, next_f); +} + +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy, + unsigned long util, unsigned long max) +{ + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int max_f = policy->cpuinfo.max_freq; + u64 last_freq_update_time = sg_policy->last_freq_update_time; + unsigned int j; + + if (util == ULONG_MAX) + return max_f; + + for_each_cpu(j, policy->cpus) { + struct sugov_cpu *j_sg_cpu; + unsigned long j_util, j_max; + u64 delta_ns; + + if (j == smp_processor_id()) + continue; + + j_sg_cpu = &per_cpu(sugov_cpu, j); + /* + * If the CPU utilization was last updated before the previous + * frequency update and the time elapsed between the last update + * of the CPU utilization and the last frequency update is long + * enough, don't take the CPU into account as it probably is + * idle now. + */ + delta_ns = last_freq_update_time - j_sg_cpu->last_update; + if ((s64)delta_ns > TICK_NSEC) + continue; + + j_util = j_sg_cpu->util; + if (j_util == ULONG_MAX) + return max_f; + + j_max = j_sg_cpu->max; + if (j_util * max > j_max * util) { + util = j_util; + max = j_max; + } + } + + return get_next_freq(policy, util, max); +} + +static void sugov_update_shared(struct update_util_data *hook, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + unsigned int next_f; + + raw_spin_lock(&sg_policy->update_lock); + + sg_cpu->util = util; + sg_cpu->max = max; + sg_cpu->last_update = time; + + if (sugov_should_update_freq(sg_policy, time)) { + next_f = sugov_next_freq_shared(sg_policy, util, max); + sugov_update_commit(sg_policy, time, next_f); + } + + raw_spin_unlock(&sg_policy->update_lock); +} + +static void sugov_work(struct work_struct *work) +{ + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); + + mutex_lock(&sg_policy->work_lock); + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, + CPUFREQ_RELATION_L); + mutex_unlock(&sg_policy->work_lock); + + sg_policy->work_cpu = UINT_MAX; +} + +static void sugov_irq_work(struct irq_work *irq_work) +{ + struct sugov_policy *sg_policy; + + sg_policy = container_of(irq_work, struct sugov_policy, irq_work); + schedule_work_on(sg_policy->work_cpu, &sg_policy->work); +} + +/************************** sysfs interface ************************/ + +static struct sugov_tunables *global_tunables; +static DEFINE_MUTEX(global_tunables_lock); + +static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set) +{ + return container_of(attr_set, struct sugov_tunables, attr_set); +} + +static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + + return sprintf(buf, "%u\n", tunables->rate_limit_us); +} + +static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, + size_t count) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + struct sugov_policy *sg_policy; + unsigned int rate_limit_us; + int ret; + + ret = sscanf(buf, "%u", &rate_limit_us); + if (ret != 1) + return -EINVAL; + + tunables->rate_limit_us = rate_limit_us; + + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) + sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC; + + return count; +} + +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us); + +static struct attribute *sugov_attributes[] = { + &rate_limit_us.attr, + NULL +}; + +static struct kobj_type sugov_tunables_ktype = { + .default_attrs = sugov_attributes, + .sysfs_ops = &governor_sysfs_ops, +}; + +/********************** cpufreq governor interface *********************/ + +static struct cpufreq_governor schedutil_gov; + +static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + + sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL); + if (!sg_policy) + return NULL; + + sg_policy->policy = policy; + init_irq_work(&sg_policy->irq_work, sugov_irq_work); + INIT_WORK(&sg_policy->work, sugov_work); + mutex_init(&sg_policy->work_lock); + raw_spin_lock_init(&sg_policy->update_lock); + return sg_policy; +} + +static void sugov_policy_free(struct sugov_policy *sg_policy) +{ + mutex_destroy(&sg_policy->work_lock); + kfree(sg_policy); +} + +static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy) +{ + struct sugov_tunables *tunables; + + tunables = kzalloc(sizeof(*tunables), GFP_KERNEL); + if (tunables) + gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook); + + return tunables; +} + +static void sugov_tunables_free(struct sugov_tunables *tunables) +{ + if (!have_governor_per_policy()) + global_tunables = NULL; + + kfree(tunables); +} + +static int sugov_init(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + struct sugov_tunables *tunables; + unsigned int lat; + int ret = 0; + + /* State should be equivalent to EXIT */ + if (policy->governor_data) + return -EBUSY; + + sg_policy = sugov_policy_alloc(policy); + if (!sg_policy) + return -ENOMEM; + + mutex_lock(&global_tunables_lock); + + if (global_tunables) { + if (WARN_ON(have_governor_per_policy())) { + ret = -EINVAL; + goto free_sg_policy; + } + policy->governor_data = sg_policy; + sg_policy->tunables = global_tunables; + + gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook); + goto out; + } + + tunables = sugov_tunables_alloc(sg_policy); + if (!tunables) { + ret = -ENOMEM; + goto free_sg_policy; + } + + tunables->rate_limit_us = LATENCY_MULTIPLIER; + lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC; + if (lat) + tunables->rate_limit_us *= lat; + + if (!have_governor_per_policy()) + global_tunables = tunables; + + policy->governor_data = sg_policy; + sg_policy->tunables = tunables; + + ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype, + get_governor_parent_kobj(policy), "%s", + schedutil_gov.name); + if (!ret) + goto out; + + /* Failure, so roll back. */ + policy->governor_data = NULL; + sugov_tunables_free(tunables); + + free_sg_policy: + pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret); + sugov_policy_free(sg_policy); + + out: + mutex_unlock(&global_tunables_lock); + return ret; +} + +static int sugov_exit(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + struct sugov_tunables *tunables = sg_policy->tunables; + unsigned int count; + + mutex_lock(&global_tunables_lock); + + count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook); + policy->governor_data = NULL; + if (!count) + sugov_tunables_free(tunables); + + mutex_unlock(&global_tunables_lock); + + sugov_policy_free(sg_policy); + return 0; +} + +static int sugov_start(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + cpufreq_enable_fast_switch(policy); + + sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC; + sg_policy->last_freq_update_time = 0; + sg_policy->next_freq = UINT_MAX; + sg_policy->work_cpu = UINT_MAX; + sg_policy->need_freq_update = false; + + for_each_cpu(cpu, policy->cpus) { + struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu); + + sg_cpu->sg_policy = sg_policy; + if (policy_is_shared(policy)) { + sg_cpu->util = ULONG_MAX; + sg_cpu->max = 0; + sg_cpu->last_update = 0; + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, + sugov_update_shared); + } else { + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, + sugov_update_single); + } + } + return 0; +} + +static int sugov_stop(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + for_each_cpu(cpu, policy->cpus) + cpufreq_remove_update_util_hook(cpu); + + synchronize_sched(); + + irq_work_sync(&sg_policy->irq_work); + cancel_work_sync(&sg_policy->work); + return 0; +} + +static int sugov_limits(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + + if (!policy->fast_switch_enabled) { + mutex_lock(&sg_policy->work_lock); + + if (policy->max < policy->cur) + __cpufreq_driver_target(policy, policy->max, + CPUFREQ_RELATION_H); + else if (policy->min > policy->cur) + __cpufreq_driver_target(policy, policy->min, + CPUFREQ_RELATION_L); + + mutex_unlock(&sg_policy->work_lock); + } + + sg_policy->need_freq_update = true; + return 0; +} + +int sugov_governor(struct cpufreq_policy *policy, unsigned int event) +{ + if (event == CPUFREQ_GOV_POLICY_INIT) { + return sugov_init(policy); + } else if (policy->governor_data) { + switch (event) { + case CPUFREQ_GOV_POLICY_EXIT: + return sugov_exit(policy); + case CPUFREQ_GOV_START: + return sugov_start(policy); + case CPUFREQ_GOV_STOP: + return sugov_stop(policy); + case CPUFREQ_GOV_LIMITS: + return sugov_limits(policy); + } + } + return -EINVAL; +} + +static struct cpufreq_governor schedutil_gov = { + .name = "schedutil", + .governor = sugov_governor, + .owner = THIS_MODULE, +}; + +static int __init sugov_module_init(void) +{ + return cpufreq_register_governor(&schedutil_gov); +} + +static void __exit sugov_module_exit(void) +{ + cpufreq_unregister_governor(&schedutil_gov); +} + +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>"); +MODULE_DESCRIPTION("Utilization-based CPU frequency selection"); +MODULE_LICENSE("GPL"); + +#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL +struct cpufreq_governor *cpufreq_default_governor(void) +{ + return &schedutil_gov; +} + +fs_initcall(sugov_module_init); +#else +module_init(sugov_module_init); +#endif +module_exit(sugov_module_exit); Index: linux-pm/kernel/sched/Makefile =================================================================== --- linux-pm.orig/kernel/sched/Makefile +++ linux-pm/kernel/sched/Makefile @@ -20,3 +20,4 @@ obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o obj-$(CONFIG_CPU_FREQ) += cpufreq.o +obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o Index: linux-pm/kernel/sched/sched.h =================================================================== --- linux-pm.orig/kernel/sched/sched.h +++ linux-pm/kernel/sched/sched.h @@ -1786,3 +1786,11 @@ static inline void cpufreq_trigger_updat static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {} static inline void cpufreq_trigger_update(u64 time) {} #endif /* CONFIG_CPU_FREQ */ + +#ifdef arch_scale_freq_capacity +#ifndef arch_scale_freq_invariant +#define arch_scale_freq_invariant() (true) +#endif +#else /* arch_scale_freq_capacity */ +#define arch_scale_freq_invariant() (false) +#endif ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v5 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-17 0:01 ` [PATCH v5 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki @ 2016-03-17 11:30 ` Juri Lelli 2016-03-17 12:54 ` Rafael J. Wysocki 2016-03-17 11:36 ` Peter Zijlstra 1 sibling, 1 reply; 158+ messages in thread From: Juri Lelli @ 2016-03-17 11:30 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Peter Zijlstra, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar Hi Rafael, On 17/03/16 01:01, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [...] > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, > + unsigned int next_freq) > +{ > + struct cpufreq_policy *policy = sg_policy->policy; > + > + sg_policy->last_freq_update_time = time; > + > + if (policy->fast_switch_enabled) { > + if (next_freq > policy->max) > + next_freq = policy->max; > + else if (next_freq < policy->min) > + next_freq = policy->min; > + > + if (sg_policy->next_freq == next_freq) { > + trace_cpu_frequency(policy->cur, smp_processor_id()); > + return; > + } > + sg_policy->next_freq = next_freq; > + next_freq = cpufreq_driver_fast_switch(policy, next_freq); > + if (next_freq == CPUFREQ_ENTRY_INVALID) > + return; > + > + policy->cur = next_freq; > + trace_cpu_frequency(next_freq, smp_processor_id()); > + } else if (sg_policy->next_freq != next_freq) { > + sg_policy->work_cpu = smp_processor_id(); + sg_policy->next_freq = next_freq; > + irq_work_queue(&sg_policy->irq_work); > + } > +} Or we remain at max_f :-). Best, - Juri ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v5 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-17 11:30 ` Juri Lelli @ 2016-03-17 12:54 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-17 12:54 UTC (permalink / raw) To: Juri Lelli Cc: Rafael J. Wysocki, Linux PM list, Peter Zijlstra, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Thu, Mar 17, 2016 at 12:30 PM, Juri Lelli <juri.lelli@arm.com> wrote: > Hi Rafael, > > On 17/03/16 01:01, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > [...] > >> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, >> + unsigned int next_freq) >> +{ >> + struct cpufreq_policy *policy = sg_policy->policy; >> + >> + sg_policy->last_freq_update_time = time; >> + >> + if (policy->fast_switch_enabled) { >> + if (next_freq > policy->max) >> + next_freq = policy->max; >> + else if (next_freq < policy->min) >> + next_freq = policy->min; >> + >> + if (sg_policy->next_freq == next_freq) { >> + trace_cpu_frequency(policy->cur, smp_processor_id()); >> + return; >> + } >> + sg_policy->next_freq = next_freq; >> + next_freq = cpufreq_driver_fast_switch(policy, next_freq); >> + if (next_freq == CPUFREQ_ENTRY_INVALID) >> + return; >> + >> + policy->cur = next_freq; >> + trace_cpu_frequency(next_freq, smp_processor_id()); >> + } else if (sg_policy->next_freq != next_freq) { >> + sg_policy->work_cpu = smp_processor_id(); > > + sg_policy->next_freq = next_freq; > Doh. >> + irq_work_queue(&sg_policy->irq_work); >> + } >> +} > > Or we remain at max_f :-). Sure, thanks! Will fix. ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v5 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-17 0:01 ` [PATCH v5 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 2016-03-17 11:30 ` Juri Lelli @ 2016-03-17 11:36 ` Peter Zijlstra 2016-03-17 12:54 ` Rafael J. Wysocki 1 sibling, 1 reply; 158+ messages in thread From: Peter Zijlstra @ 2016-03-17 11:36 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Thu, Mar 17, 2016 at 01:01:45AM +0100, Rafael J. Wysocki wrote: > + } else if (sg_policy->next_freq != next_freq) { > + sg_policy->work_cpu = smp_processor_id(); > + irq_work_queue(&sg_policy->irq_work); > + } > +} > +static void sugov_irq_work(struct irq_work *irq_work) > +{ > + struct sugov_policy *sg_policy; > + > + sg_policy = container_of(irq_work, struct sugov_policy, irq_work); > + schedule_work_on(sg_policy->work_cpu, &sg_policy->work); > +} Not sure I see the point of ->work_cpu, irq_work_queue() does guarantee the same CPU, so the above is identical to: schedule_work_on(smp_processor_id(), &sq_policy->work); ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v5 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-17 11:36 ` Peter Zijlstra @ 2016-03-17 12:54 ` Rafael J. Wysocki 0 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-17 12:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM list, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar On Thu, Mar 17, 2016 at 12:36 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Mar 17, 2016 at 01:01:45AM +0100, Rafael J. Wysocki wrote: >> + } else if (sg_policy->next_freq != next_freq) { >> + sg_policy->work_cpu = smp_processor_id(); >> + irq_work_queue(&sg_policy->irq_work); >> + } >> +} > >> +static void sugov_irq_work(struct irq_work *irq_work) >> +{ >> + struct sugov_policy *sg_policy; >> + >> + sg_policy = container_of(irq_work, struct sugov_policy, irq_work); >> + schedule_work_on(sg_policy->work_cpu, &sg_policy->work); >> +} > > Not sure I see the point of ->work_cpu, irq_work_queue() does guarantee > the same CPU, so the above is identical to: > > schedule_work_on(smp_processor_id(), &sq_policy->work); OK I'll do that and restore work_in_progress, then. ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v6 6/7][Update] cpufreq: Support for fast frequency switching 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (9 preceding siblings ...) 2016-03-17 0:01 ` [PATCH v5 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki @ 2016-03-17 15:54 ` Rafael J. Wysocki 2016-03-17 16:01 ` [PATCH v6 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 11 siblings, 0 replies; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-17 15:54 UTC (permalink / raw) To: Linux PM list, Juri Lelli, Peter Zijlstra Cc: Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Modify the ACPI cpufreq driver to provide a method for switching CPU frequencies from interrupt context and update the cpufreq core to support that method if available. Introduce a new cpufreq driver callback, ->fast_switch, to be invoked for frequency switching from interrupt context by (future) governors supporting that feature via (new) helper function cpufreq_driver_fast_switch(). Add two new policy flags, fast_switch_possible, to be set by the cpufreq driver if fast frequency switching can be used for the given policy and fast_switch_enabled, to be set by the governor if it is going to use fast frequency switching for the given policy. Also add a helper for setting the latter. Since fast frequency switching is inherently incompatible with cpufreq transition notifiers, make it possible to set the fast_switch_enabled only if there are no transition notifiers already registered and make the registration of new transition notifiers fail if fast_switch_enabled is set for at least one policy. Implement the ->fast_switch callback in the ACPI cpufreq driver and make it set fast_switch_possible during policy initialization as appropriate. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Addressing comments, fixes. Changes from v5: - cpufreq_enable_fast_switch() fixed to avoid printing a confusing message if fast_switch_possible is not set for the policy. - Fixed a typo in that message. - Removed the WARN_ON() from the (cpufreq_fast_switch_count > 0) check in cpufreq_register_notifier(), because it triggered false-positive warnings from the cpufreq_stats module (cpufreq_stats don't work with the fast switching, because it is based on notifiers). Changes from v4: - If cpufreq_enable_fast_switch() is about to fail, it will print the list of currently registered transition notifiers. - Added lock_assert_held(&policy->rwsem) to cpufreq_enable_fast_switch(). - Added WARN_ON() to the (cpufreq_fast_switch_count > 0) check in cpufreq_register_notifier(). - Modified the kerneldoc comment of cpufreq_driver_fast_switch() to mention the RELATION_L expectation regarding the ->fast_switch callback. Changes from v3: - New fast_switch_enabled field in struct cpufreq_policy to help avoid affecting existing setups by setting the fast_switch_possible flag in the driver. - __cpufreq_get() skips the policy->cur check if fast_switch_enabled is set. Changes from v2: - The driver ->fast_switch callback and cpufreq_driver_fast_switch() don't need the relation argument as they will always do RELATION_L now. - New mechanism to make fast switch and cpufreq notifiers mutually exclusive. - cpufreq_driver_fast_switch() doesn't do anything in addition to invoking the driver callback and returns its return value. --- drivers/cpufreq/acpi-cpufreq.c | 41 ++++++++++++ drivers/cpufreq/cpufreq.c | 130 ++++++++++++++++++++++++++++++++++++++--- include/linux/cpufreq.h | 9 ++ 3 files changed, 171 insertions(+), 9 deletions(-) Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c @@ -458,6 +458,43 @@ static int acpi_cpufreq_target(struct cp return result; } +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq) +{ + struct acpi_cpufreq_data *data = policy->driver_data; + struct acpi_processor_performance *perf; + struct cpufreq_frequency_table *entry; + unsigned int next_perf_state, next_freq, freq; + + /* + * Find the closest frequency above target_freq. + * + * The table is sorted in the reverse order with respect to the + * frequency and all of the entries are valid (see the initialization). + */ + entry = data->freq_table; + do { + entry++; + freq = entry->frequency; + } while (freq >= target_freq && freq != CPUFREQ_TABLE_END); + entry--; + next_freq = entry->frequency; + next_perf_state = entry->driver_data; + + perf = to_perf_data(data); + if (perf->state == next_perf_state) { + if (unlikely(data->resume)) + data->resume = 0; + else + return next_freq; + } + + data->cpu_freq_write(&perf->control_register, + perf->states[next_perf_state].control); + perf->state = next_perf_state; + return next_freq; +} + static unsigned long acpi_cpufreq_guess_freq(struct acpi_cpufreq_data *data, unsigned int cpu) { @@ -740,6 +777,9 @@ static int acpi_cpufreq_cpu_init(struct goto err_unreg; } + policy->fast_switch_possible = !acpi_pstate_strict && + !(policy_is_shared(policy) && policy->shared_type != CPUFREQ_SHARED_TYPE_ANY); + data->freq_table = kzalloc(sizeof(*data->freq_table) * (perf->state_count+1), GFP_KERNEL); if (!data->freq_table) { @@ -874,6 +914,7 @@ static struct freq_attr *acpi_cpufreq_at static struct cpufreq_driver acpi_cpufreq_driver = { .verify = cpufreq_generic_frequency_table_verify, .target_index = acpi_cpufreq_target, + .fast_switch = acpi_cpufreq_fast_switch, .bios_limit = acpi_processor_get_bios_limit, .init = acpi_cpufreq_cpu_init, .exit = acpi_cpufreq_cpu_exit, Index: linux-pm/include/linux/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/cpufreq.h +++ linux-pm/include/linux/cpufreq.h @@ -102,6 +102,10 @@ struct cpufreq_policy { */ struct rw_semaphore rwsem; + /* Fast switch flags */ + bool fast_switch_possible; /* Set by the driver. */ + bool fast_switch_enabled; + /* Synchronization for frequency transitions */ bool transition_ongoing; /* Tracks transition status */ spinlock_t transition_lock; @@ -156,6 +160,7 @@ int cpufreq_get_policy(struct cpufreq_po int cpufreq_update_policy(unsigned int cpu); bool have_governor_per_policy(void); struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy); +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy); #else static inline unsigned int cpufreq_get(unsigned int cpu) { @@ -236,6 +241,8 @@ struct cpufreq_driver { unsigned int relation); /* Deprecated */ int (*target_index)(struct cpufreq_policy *policy, unsigned int index); + unsigned int (*fast_switch)(struct cpufreq_policy *policy, + unsigned int target_freq); /* * Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION * unset. @@ -464,6 +471,8 @@ struct cpufreq_governor { }; /* Pass a target to the cpufreq driver */ +unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq); int cpufreq_driver_target(struct cpufreq_policy *policy, unsigned int target_freq, unsigned int relation); Index: linux-pm/drivers/cpufreq/cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/cpufreq.c +++ linux-pm/drivers/cpufreq/cpufreq.c @@ -428,6 +428,57 @@ void cpufreq_freq_transition_end(struct } EXPORT_SYMBOL_GPL(cpufreq_freq_transition_end); +/* + * Fast frequency switching status count. Positive means "enabled", negative + * means "disabled" and 0 means "not decided yet". + */ +static int cpufreq_fast_switch_count; +static DEFINE_MUTEX(cpufreq_fast_switch_lock); + +static void cpufreq_list_transition_notifiers(void) +{ + struct notifier_block *nb; + + pr_info("cpufreq: Registered transition notifiers:\n"); + + mutex_lock(&cpufreq_transition_notifier_list.mutex); + + for (nb = cpufreq_transition_notifier_list.head; nb; nb = nb->next) + pr_info("cpufreq: %pF\n", nb->notifier_call); + + mutex_unlock(&cpufreq_transition_notifier_list.mutex); +} + +/** + * cpufreq_enable_fast_switch - Enable fast frequency switching for policy. + * @policy: cpufreq policy to enable fast frequency switching for. + * + * Try to enable fast frequency switching for @policy. + * + * The attempt will fail if there is at least one transition notifier registered + * at this point, as fast frequency switching is quite fundamentally at odds + * with transition notifiers. Thus if successful, it will make registration of + * transition notifiers fail going forward. + */ +void cpufreq_enable_fast_switch(struct cpufreq_policy *policy) +{ + lockdep_assert_held(&policy->rwsem); + + if (!policy->fast_switch_possible) + return; + + mutex_lock(&cpufreq_fast_switch_lock); + if (cpufreq_fast_switch_count >= 0) { + cpufreq_fast_switch_count++; + policy->fast_switch_enabled = true; + } else { + pr_warn("cpufreq: CPU%u: Fast frequency switching not enabled\n", + policy->cpu); + cpufreq_list_transition_notifiers(); + } + mutex_unlock(&cpufreq_fast_switch_lock); +} +EXPORT_SYMBOL_GPL(cpufreq_enable_fast_switch); /********************************************************************* * SYSFS INTERFACE * @@ -1083,6 +1134,24 @@ static void cpufreq_policy_free(struct c kfree(policy); } +static void cpufreq_driver_exit_policy(struct cpufreq_policy *policy) +{ + if (policy->fast_switch_enabled) { + mutex_lock(&cpufreq_fast_switch_lock); + + policy->fast_switch_enabled = false; + if (!WARN_ON(cpufreq_fast_switch_count <= 0)) + cpufreq_fast_switch_count--; + + mutex_unlock(&cpufreq_fast_switch_lock); + } + + if (cpufreq_driver->exit) { + cpufreq_driver->exit(policy); + policy->freq_table = NULL; + } +} + static int cpufreq_online(unsigned int cpu) { struct cpufreq_policy *policy; @@ -1236,8 +1305,7 @@ static int cpufreq_online(unsigned int c out_exit_policy: up_write(&policy->rwsem); - if (cpufreq_driver->exit) - cpufreq_driver->exit(policy); + cpufreq_driver_exit_policy(policy); out_free_policy: cpufreq_policy_free(policy, !new_policy); return ret; @@ -1334,10 +1402,7 @@ static void cpufreq_offline(unsigned int * since this is a core component, and is essential for the * subsequent light-weight ->init() to succeed. */ - if (cpufreq_driver->exit) { - cpufreq_driver->exit(policy); - policy->freq_table = NULL; - } + cpufreq_driver_exit_policy(policy); unlock: up_write(&policy->rwsem); @@ -1444,8 +1509,12 @@ static unsigned int __cpufreq_get(struct ret_freq = cpufreq_driver->get(policy->cpu); - /* Updating inactive policies is invalid, so avoid doing that. */ - if (unlikely(policy_is_inactive(policy))) + /* + * Updating inactive policies is invalid, so avoid doing that. Also + * if fast frequency switching is used with the given policy, the check + * against policy->cur is pointless, so skip it in that case too. + */ + if (unlikely(policy_is_inactive(policy)) || policy->fast_switch_enabled) return ret_freq; if (ret_freq && policy->cur && @@ -1457,7 +1526,6 @@ static unsigned int __cpufreq_get(struct schedule_work(&policy->update); } } - return ret_freq; } @@ -1653,8 +1721,18 @@ int cpufreq_register_notifier(struct not switch (list) { case CPUFREQ_TRANSITION_NOTIFIER: + mutex_lock(&cpufreq_fast_switch_lock); + + if (cpufreq_fast_switch_count > 0) { + mutex_unlock(&cpufreq_fast_switch_lock); + return -EBUSY; + } ret = srcu_notifier_chain_register( &cpufreq_transition_notifier_list, nb); + if (!ret) + cpufreq_fast_switch_count--; + + mutex_unlock(&cpufreq_fast_switch_lock); break; case CPUFREQ_POLICY_NOTIFIER: ret = blocking_notifier_chain_register( @@ -1687,8 +1765,14 @@ int cpufreq_unregister_notifier(struct n switch (list) { case CPUFREQ_TRANSITION_NOTIFIER: + mutex_lock(&cpufreq_fast_switch_lock); + ret = srcu_notifier_chain_unregister( &cpufreq_transition_notifier_list, nb); + if (!ret && !WARN_ON(cpufreq_fast_switch_count >= 0)) + cpufreq_fast_switch_count++; + + mutex_unlock(&cpufreq_fast_switch_lock); break; case CPUFREQ_POLICY_NOTIFIER: ret = blocking_notifier_chain_unregister( @@ -1707,6 +1791,34 @@ EXPORT_SYMBOL(cpufreq_unregister_notifie * GOVERNORS * *********************************************************************/ +/** + * cpufreq_driver_fast_switch - Carry out a fast CPU frequency switch. + * @policy: cpufreq policy to switch the frequency for. + * @target_freq: New frequency to set (may be approximate). + * + * Carry out a fast frequency switch from interrupt context. + * + * The driver's ->fast_switch() callback invoked by this function is expected to + * select the minimum available frequency greater than or equal to @target_freq + * (CPUFREQ_RELATION_L). + * + * This function must not be called if policy->fast_switch_enabled is unset. + * + * Governors calling this function must guarantee that it will never be invoked + * twice in parallel for the same policy and that it will never be called in + * parallel with either ->target() or ->target_index() for the same policy. + * + * If CPUFREQ_ENTRY_INVALID is returned by the driver's ->fast_switch() + * callback to indicate an error condition, the hardware configuration must be + * preserved. + */ +unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq) +{ + return cpufreq_driver->fast_switch(policy, target_freq); +} +EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch); + /* Must set freqs->new to intermediate frequency */ static int __target_intermediate(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, int index) ^ permalink raw reply [flat|nested] 158+ messages in thread
* [PATCH v6 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki ` (10 preceding siblings ...) 2016-03-17 15:54 ` [PATCH v6 6/7][Update] cpufreq: Support for fast frequency switching Rafael J. Wysocki @ 2016-03-17 16:01 ` Rafael J. Wysocki 2016-03-18 12:34 ` Patrick Bellasi 11 siblings, 1 reply; 158+ messages in thread From: Rafael J. Wysocki @ 2016-03-17 16:01 UTC (permalink / raw) To: Linux PM list, Peter Zijlstra, Juri Lelli Cc: Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Subject: [PATCH] cpufreq: schedutil: New governor based on scheduler utilization data Add a new cpufreq scaling governor, called "schedutil", that uses scheduler-provided CPU utilization information as input for making its decisions. Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add mechanism for registering utilization update callbacks) that introduced cpufreq_update_util() called by the scheduler on utilization changes (from CFS) and RT/DL task status updates. In particular, CPU frequency scaling decisions may be based on the the utilization data passed to cpufreq_update_util() by CFS. The new governor is relatively simple. The frequency selection formula used by it depends on whether or not the utilization is frequency-invariant. In the frequency-invariant case the new CPU frequency is given by next_freq = 1.25 * max_freq * util / max where util and max are the last two arguments of cpufreq_update_util(). In turn, if util is not frequency-invariant, the maximum frequency in the above formula is replaced with the current frequency of the CPU: next_freq = 1.25 * curr_freq * util / max The coefficient 1.25 corresponds to the frequency tipping point at (util / max) = 0.8. All of the computations are carried out in the utilization update handlers provided by the new governor. One of those handlers is used for cpufreq policies shared between multiple CPUs and the other one is for policies with one CPU only (and therefore it doesn't need to use any extra synchronization means). The governor supports fast frequency switching if that is supported by the cpufreq driver in use and possible for the given policy. In the fast switching case, all operations of the governor take place in its utilization update handlers. If fast switching cannot be used, the frequency switch operations are carried out with the help of a work item which only calls __cpufreq_driver_target() (under a mutex) to trigger a frequency update (to a value already computed beforehand in one of the utilization update handlers). Currently, the governor treats all of the RT and DL tasks as "unknown utilization" and sets the frequency to the allowed maximum when updated from the RT or DL sched classes. That heavy-handed approach should be replaced with something more subtle and specifically targeted at RT and DL tasks. The governor shares some tunables management code with the "ondemand" and "conservative" governors and uses some common definitions from cpufreq_governor.h, but apart from that it is stand-alone. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- Addressing comments from Peter and Juri, fixes. Changes from v5: - Fixed sugov_update_commit() to set sg_policy->next_freq properly in the "work item" branch. - Used smp_processor_id() in sugov_irq_work() and restored work_in_progress. Changes from v4: - Use TICK_NSEC in sugov_next_freq_shared(). - Use schedule_work_on() to schedule work items and replace work_in_progress with work_cpu (which is used both for scheduling work items and as a "work in progress" marker). - Rearrange sugov_update_commit() to only check policy->min/max if fast switching is enabled. - Replace util > max checks with util == ULONG_MAX checks to make it clear that they are about a special case (RT/DL). Changes from v3: - The "next frequency" formula based on http://marc.info/?l=linux-acpi&m=145756618321500&w=4 and http://marc.info/?l=linux-kernel&m=145760739700716&w=4 - The governor goes into kernel/sched/ (again). Changes from v2: - The governor goes into drivers/cpufreq/. - The "next frequency" formula has an additional 1.1 factor to allow more util/max values to map onto the top-most frequency in case the distance between that and the previous one is unproportionally small. - sugov_update_commit() traces CPU frequency even if the new one is the same as the previous one (otherwise, if the system is 100% loaded for long enough, powertop starts to report that all CPUs are 100% idle). --- drivers/cpufreq/Kconfig | 26 + kernel/sched/Makefile | 1 kernel/sched/cpufreq_schedutil.c | 528 +++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 8 4 files changed, 563 insertions(+) Index: linux-pm/drivers/cpufreq/Kconfig =================================================================== --- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor. + +config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL + bool "schedutil" + select CPU_FREQ_GOV_SCHEDUTIL + select CPU_FREQ_GOV_PERFORMANCE + help + Use the 'schedutil' CPUFreq governor by default. If unsure, + have a look at the help section of that governor. The fallback + governor will be 'performance'. + endchoice config CPU_FREQ_GOV_PERFORMANCE @@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE If in doubt, say N. +config CPU_FREQ_GOV_SCHEDUTIL + tristate "'schedutil' cpufreq policy governor" + depends on CPU_FREQ + select CPU_FREQ_GOV_ATTR_SET + select IRQ_WORK + help + The frequency selection formula used by this governor is analogous + to the one used by 'ondemand', but instead of computing CPU load + as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU + utilization data provided by the scheduler as input. + + To compile this driver as a module, choose M here: the + module will be called cpufreq_schedutil. + + If in doubt, say N. + comment "CPU frequency scaling drivers" config CPUFREQ_DT Index: linux-pm/kernel/sched/cpufreq_schedutil.c =================================================================== --- /dev/null +++ linux-pm/kernel/sched/cpufreq_schedutil.c @@ -0,0 +1,528 @@ +/* + * CPUFreq governor based on scheduler-provided CPU utilization data. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/slab.h> +#include <trace/events/power.h> + +#include "sched.h" + +struct sugov_tunables { + struct gov_attr_set attr_set; + unsigned int rate_limit_us; +}; + +struct sugov_policy { + struct cpufreq_policy *policy; + + struct sugov_tunables *tunables; + struct list_head tunables_hook; + + raw_spinlock_t update_lock; /* For shared policies */ + u64 last_freq_update_time; + s64 freq_update_delay_ns; + unsigned int next_freq; + + /* The next fields are only needed if fast switch cannot be used. */ + struct irq_work irq_work; + struct work_struct work; + struct mutex work_lock; + bool work_in_progress; + + bool need_freq_update; +}; + +struct sugov_cpu { + struct update_util_data update_util; + struct sugov_policy *sg_policy; + + /* The fields below are only needed when sharing a policy. */ + unsigned long util; + unsigned long max; + u64 last_update; +}; + +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); + +/************************ Governor internals ***********************/ + +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) +{ + u64 delta_ns; + + if (sg_policy->work_in_progress) + return false; + + if (unlikely(sg_policy->need_freq_update)) { + sg_policy->need_freq_update = false; + return true; + } + + delta_ns = time - sg_policy->last_freq_update_time; + return (s64)delta_ns >= sg_policy->freq_update_delay_ns; +} + +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, + unsigned int next_freq) +{ + struct cpufreq_policy *policy = sg_policy->policy; + + sg_policy->last_freq_update_time = time; + + if (policy->fast_switch_enabled) { + if (next_freq > policy->max) + next_freq = policy->max; + else if (next_freq < policy->min) + next_freq = policy->min; + + if (sg_policy->next_freq == next_freq) { + trace_cpu_frequency(policy->cur, smp_processor_id()); + return; + } + sg_policy->next_freq = next_freq; + next_freq = cpufreq_driver_fast_switch(policy, next_freq); + if (next_freq == CPUFREQ_ENTRY_INVALID) + return; + + policy->cur = next_freq; + trace_cpu_frequency(next_freq, smp_processor_id()); + } else if (sg_policy->next_freq != next_freq) { + sg_policy->next_freq = next_freq; + sg_policy->work_in_progress = true; + irq_work_queue(&sg_policy->irq_work); + } +} + +/** + * get_next_freq - Compute a new frequency for a given cpufreq policy. + * @policy: cpufreq policy object to compute the new frequency for. + * @util: Current CPU utilization. + * @max: CPU capacity. + * + * If the utilization is frequency-invariant, choose the new frequency to be + * proportional to it, that is + * + * next_freq = C * max_freq * util / max + * + * Otherwise, approximate the would-be frequency-invariant utilization by + * util_raw * (curr_freq / max_freq) which leads to + * + * next_freq = C * curr_freq * util_raw / max + * + * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8. + */ +static unsigned int get_next_freq(struct cpufreq_policy *policy, + unsigned long util, unsigned long max) +{ + unsigned int freq = arch_scale_freq_invariant() ? + policy->cpuinfo.max_freq : policy->cur; + + return (freq + (freq >> 2)) * util / max; +} + +static void sugov_update_single(struct update_util_data *hook, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int next_f; + + if (!sugov_should_update_freq(sg_policy, time)) + return; + + next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq : + get_next_freq(policy, util, max); + sugov_update_commit(sg_policy, time, next_f); +} + +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy, + unsigned long util, unsigned long max) +{ + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int max_f = policy->cpuinfo.max_freq; + u64 last_freq_update_time = sg_policy->last_freq_update_time; + unsigned int j; + + if (util == ULONG_MAX) + return max_f; + + for_each_cpu(j, policy->cpus) { + struct sugov_cpu *j_sg_cpu; + unsigned long j_util, j_max; + u64 delta_ns; + + if (j == smp_processor_id()) + continue; + + j_sg_cpu = &per_cpu(sugov_cpu, j); + /* + * If the CPU utilization was last updated before the previous + * frequency update and the time elapsed between the last update + * of the CPU utilization and the last frequency update is long + * enough, don't take the CPU into account as it probably is + * idle now. + */ + delta_ns = last_freq_update_time - j_sg_cpu->last_update; + if ((s64)delta_ns > TICK_NSEC) + continue; + + j_util = j_sg_cpu->util; + if (j_util == ULONG_MAX) + return max_f; + + j_max = j_sg_cpu->max; + if (j_util * max > j_max * util) { + util = j_util; + max = j_max; + } + } + + return get_next_freq(policy, util, max); +} + +static void sugov_update_shared(struct update_util_data *hook, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + unsigned int next_f; + + raw_spin_lock(&sg_policy->update_lock); + + sg_cpu->util = util; + sg_cpu->max = max; + sg_cpu->last_update = time; + + if (sugov_should_update_freq(sg_policy, time)) { + next_f = sugov_next_freq_shared(sg_policy, util, max); + sugov_update_commit(sg_policy, time, next_f); + } + + raw_spin_unlock(&sg_policy->update_lock); +} + +static void sugov_work(struct work_struct *work) +{ + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); + + mutex_lock(&sg_policy->work_lock); + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, + CPUFREQ_RELATION_L); + mutex_unlock(&sg_policy->work_lock); + + sg_policy->work_in_progress = false; +} + +static void sugov_irq_work(struct irq_work *irq_work) +{ + struct sugov_policy *sg_policy; + + sg_policy = container_of(irq_work, struct sugov_policy, irq_work); + schedule_work_on(smp_processor_id(), &sg_policy->work); +} + +/************************** sysfs interface ************************/ + +static struct sugov_tunables *global_tunables; +static DEFINE_MUTEX(global_tunables_lock); + +static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set) +{ + return container_of(attr_set, struct sugov_tunables, attr_set); +} + +static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + + return sprintf(buf, "%u\n", tunables->rate_limit_us); +} + +static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, + size_t count) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + struct sugov_policy *sg_policy; + unsigned int rate_limit_us; + int ret; + + ret = sscanf(buf, "%u", &rate_limit_us); + if (ret != 1) + return -EINVAL; + + tunables->rate_limit_us = rate_limit_us; + + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) + sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC; + + return count; +} + +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us); + +static struct attribute *sugov_attributes[] = { + &rate_limit_us.attr, + NULL +}; + +static struct kobj_type sugov_tunables_ktype = { + .default_attrs = sugov_attributes, + .sysfs_ops = &governor_sysfs_ops, +}; + +/********************** cpufreq governor interface *********************/ + +static struct cpufreq_governor schedutil_gov; + +static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + + sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL); + if (!sg_policy) + return NULL; + + sg_policy->policy = policy; + init_irq_work(&sg_policy->irq_work, sugov_irq_work); + INIT_WORK(&sg_policy->work, sugov_work); + mutex_init(&sg_policy->work_lock); + raw_spin_lock_init(&sg_policy->update_lock); + return sg_policy; +} + +static void sugov_policy_free(struct sugov_policy *sg_policy) +{ + mutex_destroy(&sg_policy->work_lock); + kfree(sg_policy); +} + +static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy) +{ + struct sugov_tunables *tunables; + + tunables = kzalloc(sizeof(*tunables), GFP_KERNEL); + if (tunables) + gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook); + + return tunables; +} + +static void sugov_tunables_free(struct sugov_tunables *tunables) +{ + if (!have_governor_per_policy()) + global_tunables = NULL; + + kfree(tunables); +} + +static int sugov_init(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + struct sugov_tunables *tunables; + unsigned int lat; + int ret = 0; + + /* State should be equivalent to EXIT */ + if (policy->governor_data) + return -EBUSY; + + sg_policy = sugov_policy_alloc(policy); + if (!sg_policy) + return -ENOMEM; + + mutex_lock(&global_tunables_lock); + + if (global_tunables) { + if (WARN_ON(have_governor_per_policy())) { + ret = -EINVAL; + goto free_sg_policy; + } + policy->governor_data = sg_policy; + sg_policy->tunables = global_tunables; + + gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook); + goto out; + } + + tunables = sugov_tunables_alloc(sg_policy); + if (!tunables) { + ret = -ENOMEM; + goto free_sg_policy; + } + + tunables->rate_limit_us = LATENCY_MULTIPLIER; + lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC; + if (lat) + tunables->rate_limit_us *= lat; + + if (!have_governor_per_policy()) + global_tunables = tunables; + + policy->governor_data = sg_policy; + sg_policy->tunables = tunables; + + ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype, + get_governor_parent_kobj(policy), "%s", + schedutil_gov.name); + if (!ret) + goto out; + + /* Failure, so roll back. */ + policy->governor_data = NULL; + sugov_tunables_free(tunables); + + free_sg_policy: + pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret); + sugov_policy_free(sg_policy); + + out: + mutex_unlock(&global_tunables_lock); + return ret; +} + +static int sugov_exit(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + struct sugov_tunables *tunables = sg_policy->tunables; + unsigned int count; + + mutex_lock(&global_tunables_lock); + + count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook); + policy->governor_data = NULL; + if (!count) + sugov_tunables_free(tunables); + + mutex_unlock(&global_tunables_lock); + + sugov_policy_free(sg_policy); + return 0; +} + +static int sugov_start(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + cpufreq_enable_fast_switch(policy); + + sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC; + sg_policy->last_freq_update_time = 0; + sg_policy->next_freq = UINT_MAX; + sg_policy->work_in_progress = false; + sg_policy->need_freq_update = false; + + for_each_cpu(cpu, policy->cpus) { + struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu); + + sg_cpu->sg_policy = sg_policy; + if (policy_is_shared(policy)) { + sg_cpu->util = ULONG_MAX; + sg_cpu->max = 0; + sg_cpu->last_update = 0; + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, + sugov_update_shared); + } else { + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, + sugov_update_single); + } + } + return 0; +} + +static int sugov_stop(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + for_each_cpu(cpu, policy->cpus) + cpufreq_remove_update_util_hook(cpu); + + synchronize_sched(); + + irq_work_sync(&sg_policy->irq_work); + cancel_work_sync(&sg_policy->work); + return 0; +} + +static int sugov_limits(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + + if (!policy->fast_switch_enabled) { + mutex_lock(&sg_policy->work_lock); + + if (policy->max < policy->cur) + __cpufreq_driver_target(policy, policy->max, + CPUFREQ_RELATION_H); + else if (policy->min > policy->cur) + __cpufreq_driver_target(policy, policy->min, + CPUFREQ_RELATION_L); + + mutex_unlock(&sg_policy->work_lock); + } + + sg_policy->need_freq_update = true; + return 0; +} + +int sugov_governor(struct cpufreq_policy *policy, unsigned int event) +{ + if (event == CPUFREQ_GOV_POLICY_INIT) { + return sugov_init(policy); + } else if (policy->governor_data) { + switch (event) { + case CPUFREQ_GOV_POLICY_EXIT: + return sugov_exit(policy); + case CPUFREQ_GOV_START: + return sugov_start(policy); + case CPUFREQ_GOV_STOP: + return sugov_stop(policy); + case CPUFREQ_GOV_LIMITS: + return sugov_limits(policy); + } + } + return -EINVAL; +} + +static struct cpufreq_governor schedutil_gov = { + .name = "schedutil", + .governor = sugov_governor, + .owner = THIS_MODULE, +}; + +static int __init sugov_module_init(void) +{ + return cpufreq_register_governor(&schedutil_gov); +} + +static void __exit sugov_module_exit(void) +{ + cpufreq_unregister_governor(&schedutil_gov); +} + +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>"); +MODULE_DESCRIPTION("Utilization-based CPU frequency selection"); +MODULE_LICENSE("GPL"); + +#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL +struct cpufreq_governor *cpufreq_default_governor(void) +{ + return &schedutil_gov; +} + +fs_initcall(sugov_module_init); +#else +module_init(sugov_module_init); +#endif +module_exit(sugov_module_exit); Index: linux-pm/kernel/sched/Makefile =================================================================== --- linux-pm.orig/kernel/sched/Makefile +++ linux-pm/kernel/sched/Makefile @@ -20,3 +20,4 @@ obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o obj-$(CONFIG_CPU_FREQ) += cpufreq.o +obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o Index: linux-pm/kernel/sched/sched.h =================================================================== --- linux-pm.orig/kernel/sched/sched.h +++ linux-pm/kernel/sched/sched.h @@ -1786,3 +1786,11 @@ static inline void cpufreq_trigger_updat static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {} static inline void cpufreq_trigger_update(u64 time) {} #endif /* CONFIG_CPU_FREQ */ + +#ifdef arch_scale_freq_capacity +#ifndef arch_scale_freq_invariant +#define arch_scale_freq_invariant() (true) +#endif +#else /* arch_scale_freq_capacity */ +#define arch_scale_freq_invariant() (false) +#endif ^ permalink raw reply [flat|nested] 158+ messages in thread
* Re: [PATCH v6 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data 2016-03-17 16:01 ` [PATCH v6 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki @ 2016-03-18 12:34 ` Patrick Bellasi 0 siblings, 0 replies; 158+ messages in thread From: Patrick Bellasi @ 2016-03-18 12:34 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM list, Peter Zijlstra, Juri Lelli, Steve Muckle, ACPI Devel Maling List, Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar, Vincent Guittot, Michael Turquette, Ingo Molnar Hi Rafael, all, I have (yet another) consideration regarding the definition of the margin for the frequency selection. On 17-Mar 17:01, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Subject: [PATCH] cpufreq: schedutil: New governor based on scheduler utilization data > > Add a new cpufreq scaling governor, called "schedutil", that uses > scheduler-provided CPU utilization information as input for making > its decisions. > > Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add > mechanism for registering utilization update callbacks) that > introduced cpufreq_update_util() called by the scheduler on > utilization changes (from CFS) and RT/DL task status updates. > In particular, CPU frequency scaling decisions may be based on > the the utilization data passed to cpufreq_update_util() by CFS. > > The new governor is relatively simple. > > The frequency selection formula used by it depends on whether or not > the utilization is frequency-invariant. In the frequency-invariant > case the new CPU frequency is given by > > next_freq = 1.25 * max_freq * util / max > > where util and max are the last two arguments of cpufreq_update_util(). > In turn, if util is not frequency-invariant, the maximum frequency in > the above formula is replaced with the current frequency of the CPU: > > next_freq = 1.25 * curr_freq * util / max > > The coefficient 1.25 corresponds to the frequency tipping point at > (util / max) = 0.8. In both this formulas the OPP jump is driven by a margin which is effectively proportional to the capacity of the current OPP. For example, if we consider a simple system with this set of OPPs: [200,400,600,800,1000) MHz and we apply the formula for the frequency-invariant case, we get: util/max min_opp min_util margin 1.0 1000 0.80 20% 0.8 800 0.64 16% 0.6 600 0.48 12% 0.4 400 0.32 8% 0.2 200 0.16 4% Where: - min_opp: is the minimum OPP which can satisfy (util/max) capacity request - min_util: is the minimum utilization value which effectively trigger a switch to the upper OPP - margin: is the effective capacity margin to remain at min_opp This means that when running at the lower OPP we can build up to 16% utilization (i.e. 4% less than the capacity of the min_opp) before jumping to the next OPP. But, for example, switching at the 800MHz OPP we need to build up just 4% utilization (i.e. 16% less than the capacity of that OPP) to jump up. This is a really simple example, with OPPs that are equally distributed. However, the question is: does is really make sense to have different effective margins for different starting OPPs? AFAIU, this solution is biasing the frequency selection to higher OPPs. The bigger the utilization of a CPU the more we are likely to run at an higher the minimum OPP. The advantage is a reduce time to reach the highest OPP, which can be beneficial for performance oriented workload. The disadvantage is instead a quite likely reduction of residencies on mid-range OPPs. We should consider also that, at least in its current implementation, PELT "builds up" slower when running at lower OPPs, which further amplify this unbalance on OPP residencies. IMO, biasing the selection of an OPP over another is something which sound more like a "policy" than a "mechanism". Since here the goal should be to provide just a mechanism, perhaps a different approach can be evaluated. Have we ever considered to use a "constant margin" for each OPP? The value of such a margin can still be defined as a (configurable) percentage of the max (or min) OPP. But once defined, the same margin can be used to decide whenever to switch to the next OPP. In the previous example, considering a 5% margin wrt the max capacity, these are the new margins: util/max min_opp min_util margin 1.0 1000 0.95 5% 0.8 800 0.75 5% 0.6 600 0.55 5% 0.4 400 0.35 5% 0.2 200 0.15 5% That means that when running both at the lowest OPP or in a mid-range one, we always need to build up the same amount of utilization before switching to the next one. What is the translation in residencies time? This is still affected by the PELT behaviors when running at different OPPs but IMO it should improve a bit the fairness on OPP selections. Moreover, from an implementation standpoint, what is now a couple of multiplications and comparison, can potentially be reduced to a single comparison, e.g. next_freq = util > (curr_cap - margin) ? curr_freq + 1 : curr_freq where margin is pre-computed to be for example 51 (i.e. 5% of 1024) as well as (curr_cap - margin), which can be cached at each OPP change. -- #include <best/regards.h> Patrick Bellasi ^ permalink raw reply [flat|nested] 158+ messages in thread
end of thread, other threads:[~2016-03-18 12:34 UTC | newest] Thread overview: 158+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-03-02 1:56 [PATCH 0/6] cpufreq: schedutil governor Rafael J. Wysocki 2016-03-02 2:04 ` [PATCH 1/6] cpufreq: Reduce cpufreq_update_util() overhead a bit Rafael J. Wysocki 2016-03-03 5:48 ` Viresh Kumar 2016-03-03 11:47 ` Juri Lelli 2016-03-03 13:04 ` Peter Zijlstra 2016-03-02 2:05 ` [PATCH 2/6][Resend] cpufreq: acpi-cpufreq: Make read and write operations more efficient Rafael J. Wysocki 2016-03-02 2:08 ` [PATCH 3/6] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki 2016-03-03 5:53 ` Viresh Kumar 2016-03-03 19:26 ` Rafael J. Wysocki 2016-03-04 5:49 ` Viresh Kumar 2016-03-02 2:10 ` [PATCH 4/6] cpufreq: governor: Move abstract gov_tunables code to a seperate file Rafael J. Wysocki 2016-03-03 6:03 ` Viresh Kumar 2016-03-02 2:12 ` [PATCH 5/6] cpufreq: Support for fast frequency switching Rafael J. Wysocki 2016-03-03 6:00 ` Viresh Kumar 2016-03-04 2:15 ` Rafael J. Wysocki 2016-03-03 11:16 ` Peter Zijlstra 2016-03-03 20:56 ` Rafael J. Wysocki 2016-03-03 21:12 ` Peter Zijlstra 2016-03-03 11:18 ` Peter Zijlstra 2016-03-03 19:39 ` Rafael J. Wysocki 2016-03-02 2:27 ` [PATCH 6/6] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 2016-03-02 17:10 ` Vincent Guittot 2016-03-02 17:58 ` Rafael J. Wysocki 2016-03-02 22:49 ` Rafael J. Wysocki 2016-03-03 12:20 ` Peter Zijlstra 2016-03-03 12:32 ` Juri Lelli 2016-03-03 16:24 ` Rafael J. Wysocki 2016-03-03 16:37 ` Peter Zijlstra 2016-03-03 16:47 ` Peter Zijlstra 2016-03-04 1:14 ` Rafael J. Wysocki 2016-03-03 16:55 ` Juri Lelli 2016-03-03 16:56 ` Peter Zijlstra 2016-03-03 17:14 ` Juri Lelli 2016-03-03 14:01 ` Vincent Guittot 2016-03-03 15:38 ` Peter Zijlstra 2016-03-03 16:28 ` Peter Zijlstra 2016-03-03 16:42 ` Peter Zijlstra 2016-03-03 17:28 ` Dietmar Eggemann 2016-03-03 18:26 ` Peter Zijlstra 2016-03-03 19:14 ` Dietmar Eggemann 2016-03-08 13:09 ` Peter Zijlstra 2016-03-03 18:58 ` Rafael J. Wysocki 2016-03-03 13:07 ` Vincent Guittot 2016-03-03 20:06 ` Steve Muckle 2016-03-03 20:20 ` Rafael J. Wysocki 2016-03-03 21:37 ` Steve Muckle 2016-03-07 2:41 ` Rafael J. Wysocki 2016-03-08 11:27 ` Peter Zijlstra 2016-03-08 18:00 ` Rafael J. Wysocki 2016-03-08 19:26 ` Peter Zijlstra 2016-03-08 20:05 ` Rafael J. Wysocki 2016-03-09 10:15 ` Juri Lelli 2016-03-09 23:41 ` Rafael J. Wysocki 2016-03-10 4:30 ` Juri Lelli 2016-03-10 21:01 ` Rafael J. Wysocki 2016-03-10 23:19 ` Michael Turquette 2016-03-09 16:39 ` Peter Zijlstra 2016-03-09 23:28 ` Rafael J. Wysocki 2016-03-10 3:44 ` Vincent Guittot 2016-03-10 10:07 ` Peter Zijlstra 2016-03-10 10:26 ` Vincent Guittot [not found] ` <CAKfTPtCbjgbJn+68NJPCnmPFtcHD0wGmZRYaw37zSqPXNpo_Uw@mail.gmail.com> 2016-03-10 10:30 ` Peter Zijlstra 2016-03-10 10:56 ` Peter Zijlstra 2016-03-10 22:28 ` Rafael J. Wysocki 2016-03-10 8:43 ` Peter Zijlstra 2016-03-04 2:56 ` [PATCH v2 0/10] cpufreq: schedutil governor Rafael J. Wysocki 2016-03-04 2:58 ` [PATCH v2 1/10] cpufreq: Reduce cpufreq_update_util() overhead a bit Rafael J. Wysocki 2016-03-09 12:39 ` Peter Zijlstra 2016-03-09 14:17 ` Rafael J. Wysocki 2016-03-09 15:29 ` Peter Zijlstra 2016-03-09 21:35 ` Rafael J. Wysocki 2016-03-10 9:19 ` Peter Zijlstra 2016-03-04 2:59 ` [PATCH v2 2/10][Resend] cpufreq: acpi-cpufreq: Make read and write operations more efficient Rafael J. Wysocki 2016-03-04 3:01 ` [PATCH v2 3/10] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki 2016-03-04 5:52 ` Viresh Kumar 2016-03-04 3:03 ` [PATCH v2 4/10] cpufreq: governor: Move abstract gov_attr_set code to seperate file Rafael J. Wysocki 2016-03-04 5:52 ` Viresh Kumar 2016-03-04 3:05 ` [PATCH v2 5/10] cpufreq: Move governor attribute set headers to cpufreq.h Rafael J. Wysocki 2016-03-04 5:53 ` Viresh Kumar 2016-03-04 3:07 ` [PATCH v2 6/10] cpufreq: Support for fast frequency switching Rafael J. Wysocki 2016-03-04 22:18 ` Steve Muckle 2016-03-04 22:32 ` Rafael J. Wysocki 2016-03-04 22:40 ` Rafael J. Wysocki 2016-03-04 23:18 ` Rafael J. Wysocki 2016-03-04 23:56 ` Steve Muckle 2016-03-05 0:18 ` Rafael J. Wysocki 2016-03-05 11:58 ` Ingo Molnar 2016-03-05 16:49 ` Peter Zijlstra 2016-03-06 2:17 ` Rafael J. Wysocki 2016-03-07 8:00 ` Peter Zijlstra 2016-03-07 13:15 ` Rafael J. Wysocki 2016-03-07 13:32 ` Peter Zijlstra 2016-03-07 13:42 ` Rafael J. Wysocki 2016-03-04 22:58 ` Rafael J. Wysocki 2016-03-04 23:59 ` Steve Muckle 2016-03-04 3:12 ` [PATCH v2 7/10] cpufreq: Rework the scheduler hooks for triggering updates Rafael J. Wysocki 2016-03-04 3:14 ` [PATCH v2 8/10] cpufreq: Move scheduler-related code to the sched directory Rafael J. Wysocki 2016-03-04 3:18 ` [PATCH v2 9/10] cpufreq: sched: Re-introduce cpufreq_update_util() Rafael J. Wysocki 2016-03-04 10:50 ` Juri Lelli 2016-03-04 12:58 ` Rafael J. Wysocki 2016-03-04 13:30 ` [PATCH v3 " Rafael J. Wysocki 2016-03-04 21:21 ` Steve Muckle 2016-03-04 21:27 ` Rafael J. Wysocki 2016-03-04 21:36 ` Rafael J. Wysocki 2016-03-04 3:35 ` [PATCH v2 10/10] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 2016-03-04 11:26 ` Juri Lelli 2016-03-04 13:19 ` Rafael J. Wysocki 2016-03-04 15:56 ` Srinivas Pandruvada 2016-03-08 2:23 ` [PATCH v3 0/7] cpufreq: schedutil governor Rafael J. Wysocki 2016-03-08 2:25 ` [PATCH v3 1/7][Resend] cpufreq: Rework the scheduler hooks for triggering updates Rafael J. Wysocki 2016-03-09 13:41 ` Peter Zijlstra 2016-03-09 14:02 ` Rafael J. Wysocki 2016-03-08 2:26 ` [PATCH v3 2/7][Resend] cpufreq: Move scheduler-related code to the sched directory Rafael J. Wysocki 2016-03-08 2:28 ` [PATCH v3 3/7][Resend] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki 2016-03-08 2:29 ` [PATCH v3 4/7][Resend] cpufreq: governor: Move abstract gov_attr_set code to seperate file Rafael J. Wysocki 2016-03-08 2:38 ` [PATCH v3 5/7] cpufreq: Support for fast frequency switching Rafael J. Wysocki 2016-03-08 2:41 ` [PATCH v3 6/7] cpufreq: sched: Re-introduce cpufreq_update_util() Rafael J. Wysocki 2016-03-08 2:50 ` [PATCH v3 7/7] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 2016-03-16 14:41 ` [PATCH v4 0/7] cpufreq: schedutil governor Rafael J. Wysocki 2016-03-16 14:43 ` [PATCH v4 1/7] cpufreq: sched: Helpers to add and remove update_util hooks Rafael J. Wysocki 2016-03-16 14:44 ` [PATCH v4 2/7] cpufreq: governor: New data type for management part of dbs_data Rafael J. Wysocki 2016-03-16 14:45 ` [PATCH v4 3/7] cpufreq: governor: Move abstract gov_attr_set code to seperate file Rafael J. Wysocki 2016-03-16 14:46 ` [PATCH v4 4/7] cpufreq: Move governor attribute set headers to cpufreq.h Rafael J. Wysocki 2016-03-16 14:47 ` [PATCH v4 5/7] cpufreq: Move governor symbols " Rafael J. Wysocki 2016-03-16 14:52 ` [PATCH v4 6/7] cpufreq: Support for fast frequency switching Rafael J. Wysocki 2016-03-16 15:35 ` Peter Zijlstra 2016-03-16 16:58 ` Rafael J. Wysocki 2016-03-16 15:43 ` Peter Zijlstra 2016-03-16 16:58 ` Rafael J. Wysocki 2016-03-16 14:59 ` [PATCH v4 7/7] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 2016-03-16 17:35 ` Peter Zijlstra 2016-03-16 21:42 ` Rafael J. Wysocki 2016-03-16 17:36 ` Peter Zijlstra 2016-03-16 21:34 ` Rafael J. Wysocki 2016-03-16 17:52 ` Peter Zijlstra 2016-03-16 21:38 ` Rafael J. Wysocki 2016-03-16 22:39 ` Peter Zijlstra 2016-03-16 17:53 ` Peter Zijlstra 2016-03-16 21:48 ` Rafael J. Wysocki 2016-03-16 18:14 ` Peter Zijlstra 2016-03-16 21:38 ` Rafael J. Wysocki 2016-03-16 22:40 ` Peter Zijlstra 2016-03-16 22:53 ` Rafael J. Wysocki 2016-03-16 15:27 ` [PATCH v4 0/7] cpufreq: schedutil governor Peter Zijlstra 2016-03-16 16:20 ` Rafael J. Wysocki 2016-03-16 23:51 ` [PATCH v5 6/7][Update] cpufreq: Support for fast frequency switching Rafael J. Wysocki 2016-03-17 11:35 ` Juri Lelli 2016-03-17 11:40 ` Peter Zijlstra 2016-03-17 11:48 ` Juri Lelli 2016-03-17 12:53 ` Rafael J. Wysocki 2016-03-17 0:01 ` [PATCH v5 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 2016-03-17 11:30 ` Juri Lelli 2016-03-17 12:54 ` Rafael J. Wysocki 2016-03-17 11:36 ` Peter Zijlstra 2016-03-17 12:54 ` Rafael J. Wysocki 2016-03-17 15:54 ` [PATCH v6 6/7][Update] cpufreq: Support for fast frequency switching Rafael J. Wysocki 2016-03-17 16:01 ` [PATCH v6 7/7][Update] cpufreq: schedutil: New governor based on scheduler utilization data Rafael J. Wysocki 2016-03-18 12:34 ` Patrick Bellasi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).