linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] Scheduler changes for v6.7
@ 2023-10-28 12:23 Ingo Molnar
  2023-10-30 23:50 ` pr-tracker-bot
  2024-01-08 14:07 ` [GIT PULL] Scheduler changes for v6.8 Ingo Molnar
  0 siblings, 2 replies; 54+ messages in thread
From: Ingo Molnar @ 2023-10-28 12:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Borislav Petkov,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

Linus,

Please pull the latest sched/core git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-2023-10-28

   # HEAD: 984ffb6a4366752c949f7b39640aecdce222607f sched/fair: Remove SIS_PROP

Scheduler changes for v6.7 are:

 - Fair scheduler (SCHED_OTHER) improvements:

    - Remove the old and now unused SIS_PROP code & option
    - Scan cluster before LLC in the wake-up path
    - Use candidate prev/recent_used CPU if scanning failed for cluster wakeup

 - NUMA scheduling improvements:

    - Improve the VMA access-PID code to better skip/scan VMAs
    - Extend tracing to cover VMA-skipping decisions
    - Improve/fix the recently introduced sched_numa_find_nth_cpu() code
    - Generalize numa_map_to_online_node()

 - Energy scheduling improvements:

    - Remove the EM_MAX_COMPLEXITY limit
    - Add tracepoints to track energy computation
    - Make the behavior of the 'sched_energy_aware' sysctl more consistent
    - Consolidate and clean up access to a CPU's max compute capacity
    - Fix uclamp code corner cases

 - RT scheduling improvements:

    - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
    - Drive the ->rto_mask with rt_rq->pushable_tasks updates

 - Scheduler scalability improvements:

    - Rate-limit updates to tg->load_avg
    - On x86 disable IBRS when CPU is offline to improve single-threaded performance
    - Micro-optimize in_task() and in_interrupt()
    - Micro-optimize the PSI code
    - Avoid updating PSI triggers and ->rtpoll_total when there are no state changes

 - Core scheduler infrastructure improvements:

    - Use saved_state to reduce some spurious freezer wakeups
    - Bring in a handful of fast-headers improvements to scheduler headers
    - Make the scheduler UAPI headers more widely usable by user-space
    - Simplify the control flow of scheduler syscalls by using lock guards
    - Fix sched_setaffinity() vs. CPU hotplug race

 - Scheduler debuggability improvements:
    - Disallow writing invalid values to sched_rt_period_us
    - Fix a race in the rq-clock debugging code triggering warnings
    - Fix a warning in the bandwidth distribution code
    - Micro-optimize in_atomic_preempt_off() checks
    - Enforce that the tasklist_lock is held in for_each_thread()
    - Print the TGID in sched_show_task()
    - Remove the /proc/sys/kernel/sched_child_runs_first sysctl

 - Misc cleanups & fixes

 Thanks,

	Ingo

------------------>
Aaron Lu (1):
      sched/fair: Ratelimit update to tg->load_avg

Barry Song (2):
      sched: Add cpus_share_resources API
      sched/fair: Scan cluster before scanning LLC in wake-up path

Chengming Zhou (1):
      sched/fair: Fix cfs_rq_is_decayed() on !SMP

Colin Ian King (1):
      sched/headers: Remove comment referring to rq::cpu_load, since this has been removed

Cyril Hrubis (3):
      sched/rt: Disallow writing invalid values to sched_rt_period_us
      sched/rt/docs: Clarify & fix sched_rt_* sysctl docs
      sched/rt/docs: Use 'real-time' instead of 'realtime'

Elliot Berman (2):
      sched/core: Remove ifdeffery for saved_state
      freezer,sched: Use saved_state to reduce some spurious wakeups

Fan Yu (1):
      sched/psi: Update poll => rtpoll in relevant comments

Finn Thain (1):
      sched/core: Optimize in_task() and in_interrupt() a bit

GUO Zihua (1):
      sched/headers: Remove duplicated includes in kernel/sched/sched.h

Haifeng Xu (1):
      sched/psi: Bail out early from irq time accounting

Hao Jia (1):
      sched/core: Fix RQCF_ACT_SKIP leak

Ingo Molnar (12):
      sched/fair: Rename check_preempt_wakeup() to check_preempt_wakeup_fair()
      sched/fair: Rename check_preempt_curr() to wakeup_preempt()
      sched/headers: Add header guard to <linux/sched/deadline.h>
      sched/headers: Standardize the <linux/sched/vhost_task.h> header guard name
      sched/headers: Standardize the <linux/sched/type.h> header guard #endif
      sched/headers: Standardize the <linux/sched/smt.h> header guard #endif
      <linux/list.h>: Introduce the list_for_each_reverse() method
      sched/core: Update stale comment in try_to_wake_up()
      sched/nohz: Update idle load-balancing (ILB) comments
      sched/nohz: Use consistent variable names in find_new_ilb() and kick_ilb()
      sched/nohz: Remove unnecessarily complex error handling pattern from find_new_ilb()
      sched/topology: Move the declaration of 'schedutil_gov' to kernel/sched/sched.h

Jiapeng Chong (1):
      sched/fair: Remove duplicate #include

Joel Fernandes (Google) (1):
      sched/nohz: Update comments about NEWILB_KICK

Josh Don (2):
      sched/fair: Make cfs_rq->throttled_csd_list available on !SMP
      sched/fair: Fix warning in bandwidth distribution

Kir Kolyshkin (1):
      sched/headers: Move 'struct sched_param' out of uapi, to work around glibc/musl breakage

Liming Wu (1):
      sched/debug: Avoid checking in_atomic_preempt_off() twice in schedule_debug()

Mathieu Desnoyers (1):
      sched: Make PELT acronym definition searchable

Matthew Wilcox (Oracle) (1):
      sched: Assert for_each_thread() is properly locked

Mel Gorman (5):
      sched/numa: Document vma_numab_state fields
      sched/numa: Rename vma_numab_state::access_pids[] => ::pids_active[], ::next_pid_reset => ::pids_active_reset
      sched/numa: Trace decisions related to skipping VMAs
      sched/numa: Complete scanning of partial VMAs regardless of PID activity
      sched/numa: Complete scanning of inactive VMAs when there is no alternative

Peter Zijlstra (11):
      sched: Simplify set_user_nice()
      sched: Simplify syscalls
      sched: Simplify sched_{set,get}affinity()
      sched: Simplify yield_to()
      sched: Simplify sched_rr_get_interval()
      sched: Simplify sched_move_task()
      sched: Simplify tg_set_cfs_bandwidth()
      sched: Misc cleanups
      sched/topology: Rename 'DIE' domain to 'PKG'
      sched: Fix stop_one_cpu_nowait() vs hotplug
      sched/fair: Remove SIS_PROP

Pierre Gondois (1):
      sched/topology: Remove the EM_MAX_COMPLEXITY limit

Qais Yousef (3):
      sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
      sched/uclamp: Ignore (util == 0) optimization in feec() when p_util_max = 0
      sched/debug: Add new tracepoint to track compute energy computation

Raghavendra K T (1):
      sched/numa: Move up the access pid reset logic

Sebastian Andrzej Siewior (2):
      sched/debug: Remove the /proc/sys/kernel/sched_child_runs_first sysctl
      sched/debug: Update stale reference to sched_debug.c

Shrikanth Hegde (1):
      sched/topology: Change behaviour of the 'sched_energy_aware' sysctl, based on the platform

Uros Bizjak (1):
      sched/core: Use do-while instead of for loop in set_nr_if_polling()

Valentin Schneider (2):
      sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask
      sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded

Vincent Guittot (1):
      sched/topology: Consolidate and clean up access to a CPU's max compute capacity

Waiman Long (4):
      x86/speculation: Add __update_spec_ctrl() helper
      x86/idle: Disable IBRS when CPU is offline to improve single-threaded performance
      intel_idle: Use __update_spec_ctrl() in intel_idle_ibrs()
      intel_idle: Add ibrs_off module parameter to force-disable IBRS

Wang Jinchao (1):
      sched/core: Refactor the task_flags check for worker sleeping in sched_submit_work()

Yajun Deng (2):
      sched/debug: Print 'tgid' in sched_show_task()
      sched/rt: Change the type of 'sysctl_sched_rt_period' from 'unsigned int' to 'int'

Yang Yang (3):
      sched/psi: Change update_triggers() to a 'void' function
      sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
      sched/psi: Delete the 'update_total' function parameter from update_triggers()

Yicong Yang (1):
      sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup

Yiwei Lin (1):
      sched/fair: Remove unused 'curr' argument from pick_next_entity()

Yu Liao (1):
      sched/headers: Remove duplicate header inclusions

Yury Norov (6):
      numa: Generalize numa_map_to_online_node()
      sched/fair: Fix open-coded numa_nearest_node()
      sched/topology: Fix sched_numa_find_nth_cpu() in CPU-less case
      sched/topology: Fix sched_numa_find_nth_cpu() in non-NUMA case
      sched/topology: Handle NUMA_NO_NODE in sched_numa_find_nth_cpu()
      sched/topology: Fix sched_numa_find_nth_cpu() comment


 Documentation/admin-guide/pm/intel_idle.rst |  17 +-
 Documentation/admin-guide/sysctl/kernel.rst |   3 +-
 Documentation/scheduler/sched-capacity.rst  |  13 +-
 Documentation/scheduler/sched-energy.rst    |  29 +-
 Documentation/scheduler/sched-rt-group.rst  |  40 +-
 arch/powerpc/kernel/smp.c                   |   4 +-
 arch/s390/kernel/topology.c                 |   2 +-
 arch/x86/include/asm/spec-ctrl.h            |  11 +
 arch/x86/kernel/smpboot.c                   |  12 +-
 drivers/idle/intel_idle.c                   |  18 +-
 include/linux/cpu.h                         |   2 +
 include/linux/list.h                        |   8 +
 include/linux/mm.h                          |   4 +-
 include/linux/mm_types.h                    |  31 +-
 include/linux/numa.h                        |   7 +-
 include/linux/preempt.h                     |  15 +-
 include/linux/sched.h                       |   7 +-
 include/linux/sched/deadline.h              |   4 +
 include/linux/sched/numa_balancing.h        |  10 +
 include/linux/sched/sd_flags.h              |   7 +
 include/linux/sched/signal.h                |   3 +-
 include/linux/sched/smt.h                   |   2 +-
 include/linux/sched/topology.h              |  10 +-
 include/linux/sched/types.h                 |   2 +-
 include/linux/sched/vhost_task.h            |   7 +-
 include/linux/topology.h                    |   2 +-
 include/trace/events/sched.h                |  57 +++
 include/uapi/linux/sched/types.h            |   4 -
 kernel/freezer.c                            |  41 +-
 kernel/sched/build_utility.c                |   1 -
 kernel/sched/core.c                         | 651 ++++++++++++----------------
 kernel/sched/cpudeadline.c                  |   2 +-
 kernel/sched/deadline.c                     |  73 +---
 kernel/sched/debug.c                        |   7 +-
 kernel/sched/fair.c                         | 432 ++++++++++--------
 kernel/sched/features.h                     |   1 -
 kernel/sched/idle.c                         |   4 +-
 kernel/sched/pelt.c                         |   2 +-
 kernel/sched/psi.c                          |  58 ++-
 kernel/sched/rt.c                           |  95 ++--
 kernel/sched/sched.h                        |  45 +-
 kernel/sched/stop_task.c                    |   4 +-
 kernel/sched/topology.c                     | 213 +++++----
 lib/cpumask.c                               |   4 +-
 mm/mempolicy.c                              |  18 +-
 45 files changed, 1015 insertions(+), 967 deletions(-)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.7
  2023-10-28 12:23 [GIT PULL] Scheduler changes for v6.7 Ingo Molnar
@ 2023-10-30 23:50 ` pr-tracker-bot
  2024-01-08 14:07 ` [GIT PULL] Scheduler changes for v6.8 Ingo Molnar
  1 sibling, 0 replies; 54+ messages in thread
From: pr-tracker-bot @ 2023-10-30 23:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

The pull request you sent on Sat, 28 Oct 2023 14:23:34 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-2023-10-28

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/63ce50fff9240d66cf3b59663f458f55ba6dcfcc

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [GIT PULL] Scheduler changes for v6.8
  2023-10-28 12:23 [GIT PULL] Scheduler changes for v6.7 Ingo Molnar
  2023-10-30 23:50 ` pr-tracker-bot
@ 2024-01-08 14:07 ` Ingo Molnar
  2024-01-09  4:04   ` pr-tracker-bot
  2024-01-10 22:19   ` Linus Torvalds
  1 sibling, 2 replies; 54+ messages in thread
From: Ingo Molnar @ 2024-01-08 14:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider


Linus,

Please pull the latest sched/core git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-2024-01-08

   # HEAD: cdb3033e191fd03da2d7da23b9cd448dfa180a8e Merge branch 'sched/urgent' into sched/core, to pick up pending v6.7 fixes for the v6.8 merge window

Scheduler changes for v6.8:

 - Energy scheduling:

    - Consolidate how the max compute capacity is
      used in the scheduler and how we calculate
      the frequency for a level of utilization.

    - Rework interface between the scheduler and
      the schedutil governor

    - Simplify the util_est logic

 - Deadline scheduler:

    - Work more towards reducing SCHED_DEADLINE
      starvation of low priority tasks (e.g., SCHED_OTHER)
      tasks when higher priority tasks monopolize CPU
      cycles, via the introduction of 'deadline servers'
      (nested/2-level scheduling).
      "Fair servers" to make use of this facility are
      not introduced yet.

 - EEVDF:

    - Introduce O(1) fastpath for EEVDF task selection

 - NUMA balancing:

    - Tune the NUMA-balancing vma scanning logic some more,
      to better distribute the probability
      of a particular vma getting scanned.

 - Plus misc fixes, cleanups and updates.

 Thanks,

	Ingo

------------------>
Abel Wu (2):
      sched/eevdf: Sort the rbtree by virtual deadline
      sched/eevdf: O(1) fastpath for task selection

Elliot Berman (1):
      freezer,sched: Clean saved_state when restoring it during thaw

Frederic Weisbecker (2):
      sched/cpuidle: Comment about timers requirements VS idle handler
      sched/timers: Explain why idle task schedules out on remote timer enqueue

Paul E. McKenney (1):
      sched: Use WRITE_ONCE() for p->on_rq

Peter Zijlstra (6):
      sched: Unify runtime accounting across classes
      sched: Remove vruntime from trace_sched_stat_runtime()
      sched: Unify more update_curr*()
      sched/deadline: Collect sched_dl_entity initialization
      sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
      sched/deadline: Introduce deadline servers

Pierre Gondois (1):
      sched/fair: Use all little CPUs for CPU-bound workloads

Raghavendra K T (1):
      sched/numa: Fix mm numa_scan_seq based unconditional scan

Vincent Guittot (13):
      sched/pelt: Avoid underestimation of task utilization
      sched/cpufreq: Rework schedutil governor performance estimation
      sched/cpufreq: Rework iowait boost
      sched/topology: Add a new arch_scale_freq_ref() method
      cpufreq: Use the fixed and coherent frequency for scaling capacity
      cpufreq/schedutil: Use a fixed reference frequency
      energy_model: Use a fixed reference frequency
      cpufreq/cppc: Move and rename cppc_cpufreq_{perf_to_khz|khz_to_perf}()
      cpufreq/cppc: Set the frequency used for computing the capacity
      arm64/amu: Use capacity_ref_freq() to set AMU ratio
      sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)
      sched/fair: Simplify util_est
      sched/fair: Fix tg->load when offlining a CPU

Wang Jinchao (1):
      sched/fair: Remove unused 'next_buddy_marked' local variable in check_preempt_wakeup_fair()

Wenyu Huang (1):
      sched/doc: Update documentation after renames and synchronize Chinese version

Yiwei Lin (1):
      sched/fair: Update min_vruntime for reweight_entity() correctly


 Documentation/scheduler/sched-design-CFS.rst       |   8 +-
 Documentation/scheduler/schedutil.rst              |   7 +-
 .../zh_CN/scheduler/sched-design-CFS.rst           |   8 +-
 .../translations/zh_CN/scheduler/schedutil.rst     |   7 +-
 arch/arm/include/asm/topology.h                    |   1 +
 arch/arm64/include/asm/topology.h                  |   1 +
 arch/arm64/kernel/topology.c                       |  26 +-
 arch/riscv/include/asm/topology.h                  |   1 +
 drivers/acpi/cppc_acpi.c                           | 104 +++++
 drivers/base/arch_topology.c                       |  56 ++-
 drivers/cpufreq/cppc_cpufreq.c                     | 139 +-----
 drivers/cpufreq/cpufreq.c                          |   4 +-
 include/acpi/cppc_acpi.h                           |   2 +
 include/linux/arch_topology.h                      |   8 +
 include/linux/cpufreq.h                            |   1 +
 include/linux/energy_model.h                       |   7 +-
 include/linux/mm_types.h                           |   3 +
 include/linux/sched.h                              |  75 ++--
 include/linux/sched/topology.h                     |   8 +
 include/trace/events/sched.h                       |  15 +-
 kernel/freezer.c                                   |   1 +
 kernel/sched/core.c                                | 140 +++---
 kernel/sched/cpufreq_schedutil.c                   |  90 ++--
 kernel/sched/deadline.c                            | 477 +++++++++++++--------
 kernel/sched/debug.c                               |  18 +-
 kernel/sched/fair.c                                | 449 ++++++++++---------
 kernel/sched/features.h                            |   1 -
 kernel/sched/idle.c                                |  30 ++
 kernel/sched/pelt.h                                |   4 +-
 kernel/sched/rt.c                                  |  15 +-
 kernel/sched/sched.h                               | 135 ++----
 kernel/sched/stop_task.c                           |  13 +-
 32 files changed, 1054 insertions(+), 800 deletions(-)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-08 14:07 ` [GIT PULL] Scheduler changes for v6.8 Ingo Molnar
@ 2024-01-09  4:04   ` pr-tracker-bot
  2024-01-10 22:19   ` Linus Torvalds
  1 sibling, 0 replies; 54+ messages in thread
From: pr-tracker-bot @ 2024-01-09  4:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

The pull request you sent on Mon, 8 Jan 2024 15:07:07 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-2024-01-08

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/bfe8eb3b85c571f7e94e1039f59b462505b8e0fc

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-08 14:07 ` [GIT PULL] Scheduler changes for v6.8 Ingo Molnar
  2024-01-09  4:04   ` pr-tracker-bot
@ 2024-01-10 22:19   ` Linus Torvalds
  2024-01-10 22:41     ` Linus Torvalds
                       ` (3 more replies)
  1 sibling, 4 replies; 54+ messages in thread
From: Linus Torvalds @ 2024-01-10 22:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider

On Mon, 8 Jan 2024 at 06:07, Ingo Molnar <mingo@kernel.org> wrote:
>
> Please pull the latest sched/core git tree from:
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-2024-01-08

Just a note that I'm currently bisecting into this merge for a
horrendous performance regression.

It makes my empty kernel build go from 22 seconds to 44 seconds, and
makes a full kernel build enormously slower too.

I haven't finished the bisection, but it's now inside *just* this
pull, so I can already tell that I'm going to revert something in
here, because this has been making my merge window miserable.

You've been warned,

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-10 22:19   ` Linus Torvalds
@ 2024-01-10 22:41     ` Linus Torvalds
  2024-01-10 22:57       ` Linus Torvalds
  2024-01-11  9:33     ` Ingo Molnar
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2024-01-10 22:41 UTC (permalink / raw)
  To: Ingo Molnar, Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On Wed, 10 Jan 2024 at 14:19, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Just a note that I'm currently bisecting into this merge for a
> horrendous performance regression.
>
> It makes my empty kernel build go from 22 seconds to 44 seconds, and
> makes a full kernel build enormously slower too.
>
> I haven't finished the bisection, but it's now inside *just* this
> pull, so I can already tell that I'm going to revert something in
> here, because this has been making my merge window miserable.

It's one of these two:

  f12560779f9d sched/cpufreq: Rework iowait boost
  9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation

one more boot to go, then I'll try to revert whichever causes my
machine to perform horribly much worse.

             Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-10 22:41     ` Linus Torvalds
@ 2024-01-10 22:57       ` Linus Torvalds
  2024-01-11  8:11         ` Vincent Guittot
                           ` (3 more replies)
  0 siblings, 4 replies; 54+ messages in thread
From: Linus Torvalds @ 2024-01-10 22:57 UTC (permalink / raw)
  To: Ingo Molnar, Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It's one of these two:
>
>   f12560779f9d sched/cpufreq: Rework iowait boost
>   9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
>
> one more boot to go, then I'll try to revert whichever causes my
> machine to perform horribly much worse.

I guess it should come as no surprise that the result is

   9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit

but to revert cleanly I will have to revert all of

      b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
      f12560779f9d ("sched/cpufreq: Rework iowait boost")
      9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
performance estimation")

This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.

I'll keep that revert in my private test-tree for now (so that I have
a working machine again), but I'll move it to my main branch soon
unless somebody has a quick fix for this problem.

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-10 22:57       ` Linus Torvalds
@ 2024-01-11  8:11         ` Vincent Guittot
  2024-01-11 17:45           ` Linus Torvalds
  2024-01-11 11:09         ` [GIT PULL] scheduler fixes Ingo Molnar
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 54+ messages in thread
From: Vincent Guittot @ 2024-01-11  8:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider

Le mercredi 10 janv. 2024 à 14:57:14 (-0800), Linus Torvalds a écrit :
> On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It's one of these two:
> >
> >   f12560779f9d sched/cpufreq: Rework iowait boost
> >   9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> >
> > one more boot to go, then I'll try to revert whichever causes my
> > machine to perform horribly much worse.
> 
> I guess it should come as no surprise that the result is
> 
>    9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> 
> but to revert cleanly I will have to revert all of
> 
>       b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
>       f12560779f9d ("sched/cpufreq: Rework iowait boost")
>       9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> performance estimation")
> 
> This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.

Could you confirm that cpufreq governor is schedutil and the driver is
amd-pstate on your system ?

Also I'm interested by the output of the amd_pstate to confirm that it uses the
adjust_perf callback

I suppose that you don't use uclamp feature and amd doesn't use EAS so that let
the change of the min parameter of adjust_perf which was probably always 0
unless you use deadline scheduler and which now takes into account irq pressure. 

Could you try the patch below which restores the previous min value ?

---
 kernel/sched/cpufreq_schedutil.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..3fe8ac6ce9cc 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -194,10 +194,11 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
 {
 	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
+	struct rq *rq = cpu_rq(sg_cpu->cpu);

 	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
 	util = max(util, boost);
-	sg_cpu->bw_min = min;
+	sg_cpu->bw_min = cpu_bw_dl(rq);
 	sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
 }

@@ -442,7 +443,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
 	    sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
 		sg_cpu->util = prev_util;

-	cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
+	cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_min),
 				   sg_cpu->util, max_cap);

 	sg_cpu->sg_policy->last_freq_update_time = time;
--
2.34.1


> 
> I'll keep that revert in my private test-tree for now (so that I have
> a working machine again), but I'll move it to my main branch soon
> unless somebody has a quick fix for this problem.
> 
>                 Linus

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-10 22:19   ` Linus Torvalds
  2024-01-10 22:41     ` Linus Torvalds
@ 2024-01-11  9:33     ` Ingo Molnar
  2024-01-11 11:14     ` [tip: sched/urgent] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits tip-bot2 for Ingo Molnar
  2024-01-11 20:55     ` [tip: sched/urgent] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit tip-bot2 for Ingo Molnar
  3 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2024-01-11  9:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 8 Jan 2024 at 06:07, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > Please pull the latest sched/core git tree from:
> >
> >    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-2024-01-08
> 
> Just a note that I'm currently bisecting into this merge for a
> horrendous performance regression.
> 
> It makes my empty kernel build go from 22 seconds to 44 seconds, and
> makes a full kernel build enormously slower too.

Ouch, that's horrible ...

> I haven't finished the bisection, but it's now inside *just* this
> pull, so I can already tell that I'm going to revert something in
> here, because this has been making my merge window miserable.

Just as a quick test, does switching to the 'performance' governor work 
around the regression:

  for ((cpu=0; cpu < $(nproc); cpu++)); do echo performance > /sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor; done

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [GIT PULL] scheduler fixes
  2024-01-10 22:57       ` Linus Torvalds
  2024-01-11  8:11         ` Vincent Guittot
@ 2024-01-11 11:09         ` Ingo Molnar
  2024-01-11 13:04           ` Vincent Guittot
  2024-01-12 18:26         ` [GIT PULL] Scheduler changes for v6.8 Ingo Molnar
  2024-01-14  9:12         ` Wyes Karny
  3 siblings, 1 reply; 54+ messages in thread
From: Ingo Molnar @ 2024-01-11 11:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Guittot, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It's one of these two:
> >
> >   f12560779f9d sched/cpufreq: Rework iowait boost
> >   9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> >
> > one more boot to go, then I'll try to revert whichever causes my
> > machine to perform horribly much worse.
> 
> I guess it should come as no surprise that the result is
> 
>    9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> 
> but to revert cleanly I will have to revert all of
> 
>       b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
>       f12560779f9d ("sched/cpufreq: Rework iowait boost")
>       9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> performance estimation")
> 
> This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> 
> I'll keep that revert in my private test-tree for now (so that I have
> a working machine again), but I'll move it to my main branch soon
> unless somebody has a quick fix for this problem.

Thanks a lot for bisecting this, and ack on the revert in any case, these 
are relatively fresh changes that clearly didn't get enough testing - sorry!

I also made the revert in sched/urgent & added a changelog, which you can 
pull from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-urgent-2024-01-11

   # HEAD: 250ce3c1169743f3575cc5937fccd72380052795 Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits

Revert recent changes to the sched_util logic, to address a bad
performance regression that increased kernel build time on Linus's
64-CPU desktop system substantially.

Lightly build and boot tested.

 Thanks,

	Ingo

------------------>
Ingo Molnar (1):
      Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits


 include/linux/energy_model.h     |  1 +
 kernel/sched/core.c              | 90 +++++++++++++++++++++++-----------------
 kernel/sched/cpufreq_schedutil.c | 90 ++++++++++++----------------------------
 kernel/sched/fair.c              | 22 ++--------
 kernel/sched/sched.h             | 84 +++++++++++++++++++++++++++++++++----
 5 files changed, 160 insertions(+), 127 deletions(-)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [tip: sched/urgent] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits
  2024-01-10 22:19   ` Linus Torvalds
  2024-01-10 22:41     ` Linus Torvalds
  2024-01-11  9:33     ` Ingo Molnar
@ 2024-01-11 11:14     ` tip-bot2 for Ingo Molnar
  2024-01-11 20:55     ` [tip: sched/urgent] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit tip-bot2 for Ingo Molnar
  3 siblings, 0 replies; 54+ messages in thread
From: tip-bot2 for Ingo Molnar @ 2024-01-11 11:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Linus Torvalds, Ingo Molnar, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     250ce3c1169743f3575cc5937fccd72380052795
Gitweb:        https://git.kernel.org/tip/250ce3c1169743f3575cc5937fccd72380052795
Author:        Ingo Molnar <mingo@kernel.org>
AuthorDate:    Thu, 11 Jan 2024 11:45:17 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 11 Jan 2024 11:51:10 +01:00

Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits

This reverts the following commits:

  b3edde44e5d450 ("cpufreq/schedutil: Use a fixed reference frequency")
  f12560779f9d73 ("sched/cpufreq: Rework iowait boost")
  9c0b4bb7f6303c ("sched/cpufreq: Rework schedutil governor performance estimation")

Because Linus reported a bad performance regression with the
sched_util governor, that increased the time his empty
kernel build took from 22 to 44 seconds (and can be similarly
measured in full builds as well) - and bisected it back to 9c0b4bb7f6303c.

Until we have a proper fix, revert the broken commit and its
dependent commits.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com
---
 include/linux/energy_model.h     |  1 +-
 kernel/sched/core.c              | 90 +++++++++++++++++--------------
 kernel/sched/cpufreq_schedutil.c | 90 +++++++++----------------------
 kernel/sched/fair.c              | 22 +-------
 kernel/sched/sched.h             | 84 ++++++++++++++++++++++++++---
 5 files changed, 160 insertions(+), 127 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 88d91e0..c19e7ef 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -243,6 +243,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	scale_cpu = arch_scale_cpu_capacity(cpu);
 	ref_freq = arch_scale_freq_ref(cpu);
 
+	max_util = map_util_perf(max_util);
 	max_util = min(max_util, allowed_cpu_cap);
 	freq = map_util_freq(max_util, ref_freq, scale_cpu);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be49..2de77a6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7467,13 +7467,18 @@ int sched_core_idle_cpu(int cpu)
  * required to meet deadlines.
  */
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long *min,
-				 unsigned long *max)
+				 enum cpu_util_type type,
+				 struct task_struct *p)
 {
-	unsigned long util, irq, scale;
+	unsigned long dl_util, util, irq, max;
 	struct rq *rq = cpu_rq(cpu);
 
-	scale = arch_scale_cpu_capacity(cpu);
+	max = arch_scale_cpu_capacity(cpu);
+
+	if (!uclamp_is_used() &&
+	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
+		return max;
+	}
 
 	/*
 	 * Early check to see if IRQ/steal time saturates the CPU, can be
@@ -7481,49 +7486,45 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * update_irq_load_avg().
 	 */
 	irq = cpu_util_irq(rq);
-	if (unlikely(irq >= scale)) {
-		if (min)
-			*min = scale;
-		if (max)
-			*max = scale;
-		return scale;
-	}
-
-	if (min) {
-		/*
-		 * The minimum utilization returns the highest level between:
-		 * - the computed DL bandwidth needed with the IRQ pressure which
-		 *   steals time to the deadline task.
-		 * - The minimum performance requirement for CFS and/or RT.
-		 */
-		*min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
-
-		/*
-		 * When an RT task is runnable and uclamp is not used, we must
-		 * ensure that the task will run at maximum compute capacity.
-		 */
-		if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt))
-			*min = max(*min, scale);
-	}
+	if (unlikely(irq >= max))
+		return max;
 
 	/*
 	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
 	 * CFS tasks and we use the same metric to track the effective
 	 * utilization (PELT windows are synchronized) we can directly add them
 	 * to obtain the CPU's actual utilization.
+	 *
+	 * CFS and RT utilization can be boosted or capped, depending on
+	 * utilization clamp constraints requested by currently RUNNABLE
+	 * tasks.
+	 * When there are no CFS RUNNABLE tasks, clamps are released and
+	 * frequency will be gracefully reduced with the utilization decay.
 	 */
 	util = util_cfs + cpu_util_rt(rq);
-	util += cpu_util_dl(rq);
+	if (type == FREQUENCY_UTIL)
+		util = uclamp_rq_util_with(rq, util, p);
+
+	dl_util = cpu_util_dl(rq);
 
 	/*
-	 * The maximum hint is a soft bandwidth requirement, which can be lower
-	 * than the actual utilization because of uclamp_max requirements.
+	 * For frequency selection we do not make cpu_util_dl() a permanent part
+	 * of this sum because we want to use cpu_bw_dl() later on, but we need
+	 * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
+	 * that we select f_max when there is no idle time.
+	 *
+	 * NOTE: numerical errors or stop class might cause us to not quite hit
+	 * saturation when we should -- something for later.
 	 */
-	if (max)
-		*max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
+	if (util + dl_util >= max)
+		return max;
 
-	if (util >= scale)
-		return scale;
+	/*
+	 * OTOH, for energy computation we need the estimated running time, so
+	 * include util_dl and ignore dl_bw.
+	 */
+	if (type == ENERGY_UTIL)
+		util += dl_util;
 
 	/*
 	 * There is still idle time; further improve the number by using the
@@ -7534,15 +7535,28 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 *   U' = irq + --------- * U
 	 *                 max
 	 */
-	util = scale_irq_capacity(util, irq, scale);
+	util = scale_irq_capacity(util, irq, max);
 	util += irq;
 
-	return min(scale, util);
+	/*
+	 * Bandwidth required by DEADLINE must always be granted while, for
+	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
+	 * to gracefully reduce the frequency when no tasks show up for longer
+	 * periods of time.
+	 *
+	 * Ideally we would like to set bw_dl as min/guaranteed freq and util +
+	 * bw_dl as requested freq. However, cpufreq is not yet ready for such
+	 * an interface. So, we only do the latter for now.
+	 */
+	if (type == FREQUENCY_UTIL)
+		util += cpu_bw_dl(rq);
+
+	return min(max, util);
 }
 
 unsigned long sched_cpu_util(int cpu)
 {
-	return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
+	return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
 }
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c09..5888176 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -47,7 +47,7 @@ struct sugov_cpu {
 	u64			last_update;
 
 	unsigned long		util;
-	unsigned long		bw_min;
+	unsigned long		bw_dl;
 
 	/* The field below is for single-CPU policies only: */
 #ifdef CONFIG_NO_HZ_COMMON
@@ -115,28 +115,6 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy)
 }
 
 /**
- * get_capacity_ref_freq - get the reference frequency that has been used to
- * correlate frequency and compute capacity for a given cpufreq policy. We use
- * the CPU managing it for the arch_scale_freq_ref() call in the function.
- * @policy: the cpufreq policy of the CPU in question.
- *
- * Return: the reference CPU frequency to compute a capacity.
- */
-static __always_inline
-unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
-{
-	unsigned int freq = arch_scale_freq_ref(policy->cpu);
-
-	if (freq)
-		return freq;
-
-	if (arch_scale_freq_invariant())
-		return policy->cpuinfo.max_freq;
-
-	return policy->cur;
-}
-
-/**
  * get_next_freq - Compute a new frequency for a given cpufreq policy.
  * @sg_policy: schedutil policy object to compute the new frequency for.
  * @util: Current CPU utilization.
@@ -162,9 +140,10 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 				  unsigned long util, unsigned long max)
 {
 	struct cpufreq_policy *policy = sg_policy->policy;
-	unsigned int freq;
+	unsigned int freq = arch_scale_freq_invariant() ?
+				policy->cpuinfo.max_freq : policy->cur;
 
-	freq = get_capacity_ref_freq(policy);
+	util = map_util_perf(util);
 	freq = map_util_freq(util, freq, max);
 
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
@@ -174,31 +153,14 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	return cpufreq_driver_resolve_freq(policy, freq);
 }
 
-unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
-				 unsigned long min,
-				 unsigned long max)
-{
-	/* Add dvfs headroom to actual utilization */
-	actual = map_util_perf(actual);
-	/* Actually we don't need to target the max performance */
-	if (actual < max)
-		max = actual;
-
-	/*
-	 * Ensure at least minimum performance while providing more compute
-	 * capacity when possible.
-	 */
-	return max(min, max);
-}
-
-static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
+static void sugov_get_util(struct sugov_cpu *sg_cpu)
 {
-	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
+	unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu);
+	struct rq *rq = cpu_rq(sg_cpu->cpu);
 
-	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
-	util = max(util, boost);
-	sg_cpu->bw_min = min;
-	sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
+	sg_cpu->bw_dl = cpu_bw_dl(rq);
+	sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
+					  FREQUENCY_UTIL, NULL);
 }
 
 /**
@@ -289,16 +251,18 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
  * This mechanism is designed to boost high frequently IO waiting tasks, while
  * being more conservative on tasks which does sporadic IO operations.
  */
-static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
+static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 			       unsigned long max_cap)
 {
+	unsigned long boost;
+
 	/* No boost currently required */
 	if (!sg_cpu->iowait_boost)
-		return 0;
+		return;
 
 	/* Reset boost if the CPU appears to have been idle enough */
 	if (sugov_iowait_reset(sg_cpu, time, false))
-		return 0;
+		return;
 
 	if (!sg_cpu->iowait_boost_pending) {
 		/*
@@ -307,7 +271,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 		sg_cpu->iowait_boost >>= 1;
 		if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) {
 			sg_cpu->iowait_boost = 0;
-			return 0;
+			return;
 		}
 	}
 
@@ -317,7 +281,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 	 * sg_cpu->util is already in capacity scale; convert iowait_boost
 	 * into the same scale so we can compare.
 	 */
-	return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
+	boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
+	boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
+	if (sg_cpu->util < boost)
+		sg_cpu->util = boost;
 }
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -339,7 +306,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
  */
 static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
 {
-	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)
+	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
 		sg_cpu->sg_policy->limits_changed = true;
 }
 
@@ -347,8 +314,6 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
 					      u64 time, unsigned long max_cap,
 					      unsigned int flags)
 {
-	unsigned long boost;
-
 	sugov_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
 
@@ -357,8 +322,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
 	if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
 		return false;
 
-	boost = sugov_iowait_apply(sg_cpu, time, max_cap);
-	sugov_get_util(sg_cpu, boost);
+	sugov_get_util(sg_cpu);
+	sugov_iowait_apply(sg_cpu, time, max_cap);
 
 	return true;
 }
@@ -442,8 +407,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
 	    sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
 		sg_cpu->util = prev_util;
 
-	cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
-				   sg_cpu->util, max_cap);
+	cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
+				   map_util_perf(sg_cpu->util), max_cap);
 
 	sg_cpu->sg_policy->last_freq_update_time = time;
 }
@@ -459,10 +424,9 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 
 	for_each_cpu(j, policy->cpus) {
 		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
-		unsigned long boost;
 
-		boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
-		sugov_get_util(j_sg_cpu, boost);
+		sugov_get_util(j_sg_cpu);
+		sugov_iowait_apply(j_sg_cpu, time, max_cap);
 
 		util = max(j_sg_cpu->util, util);
 	}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b803030..f43021f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7820,7 +7820,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
 	for_each_cpu(cpu, pd_cpus) {
 		unsigned long util = cpu_util(cpu, p, -1, 0);
 
-		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+		busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
 	}
 
 	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
@@ -7843,7 +7843,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 	for_each_cpu(cpu, pd_cpus) {
 		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
 		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
-		unsigned long eff_util, min, max;
+		unsigned long eff_util;
 
 		/*
 		 * Performance domain frequency: utilization clamping
@@ -7852,23 +7852,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		eff_util = effective_cpu_util(cpu, util, &min, &max);
-
-		/* Task's uclamp can modify min and max value */
-		if (tsk && uclamp_is_used()) {
-			min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
-
-			/*
-			 * If there is no active max uclamp constraint,
-			 * directly use task's one, otherwise keep max.
-			 */
-			if (uclamp_rq_is_idle(cpu_rq(cpu)))
-				max = uclamp_eff_value(p, UCLAMP_MAX);
-			else
-				max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
-		}
-
-		eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
+		eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
 		max_util = max(max_util, eff_util);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54b..8a70d51 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2994,14 +2994,24 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 #endif
 
 #ifdef CONFIG_SMP
-unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long *min,
-				 unsigned long *max);
-
-unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
-				 unsigned long min,
-				 unsigned long max);
+/**
+ * enum cpu_util_type - CPU utilization type
+ * @FREQUENCY_UTIL:	Utilization used to select frequency
+ * @ENERGY_UTIL:	Utilization used during energy calculation
+ *
+ * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
+ * need to be aggregated differently depending on the usage made of them. This
+ * enum is used within effective_cpu_util() to differentiate the types of
+ * utilization expected by the callers, and adjust the aggregation accordingly.
+ */
+enum cpu_util_type {
+	FREQUENCY_UTIL,
+	ENERGY_UTIL,
+};
 
+unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
+				 enum cpu_util_type type,
+				 struct task_struct *p);
 
 /*
  * Verify the fitness of task @p to run on @cpu taking into account the
@@ -3058,6 +3068,59 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
 	return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
 }
 
+/**
+ * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
+ * @rq:		The rq to clamp against. Must not be NULL.
+ * @util:	The util value to clamp.
+ * @p:		The task to clamp against. Can be NULL if you want to clamp
+ *		against @rq only.
+ *
+ * Clamps the passed @util to the max(@rq, @p) effective uclamp values.
+ *
+ * If sched_uclamp_used static key is disabled, then just return the util
+ * without any clamping since uclamp aggregation at the rq level in the fast
+ * path is disabled, rendering this operation a NOP.
+ *
+ * Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
+ * will return the correct effective uclamp value of the task even if the
+ * static key is disabled.
+ */
+static __always_inline
+unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
+				  struct task_struct *p)
+{
+	unsigned long min_util = 0;
+	unsigned long max_util = 0;
+
+	if (!static_branch_likely(&sched_uclamp_used))
+		return util;
+
+	if (p) {
+		min_util = uclamp_eff_value(p, UCLAMP_MIN);
+		max_util = uclamp_eff_value(p, UCLAMP_MAX);
+
+		/*
+		 * Ignore last runnable task's max clamp, as this task will
+		 * reset it. Similarly, no need to read the rq's min clamp.
+		 */
+		if (uclamp_rq_is_idle(rq))
+			goto out;
+	}
+
+	min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
+	max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
+out:
+	/*
+	 * Since CPU's {min,max}_util clamps are MAX aggregated considering
+	 * RUNNABLE tasks with _different_ clamps, we can end up with an
+	 * inversion. Fix it now when the clamps are applied.
+	 */
+	if (unlikely(min_util >= max_util))
+		return min_util;
+
+	return clamp(util, min_util, max_util);
+}
+
 /* Is the rq being capped/throttled by uclamp_max? */
 static inline bool uclamp_rq_is_capped(struct rq *rq)
 {
@@ -3095,6 +3158,13 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p,
 	return SCHED_CAPACITY_SCALE;
 }
 
+static inline
+unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
+				  struct task_struct *p)
+{
+	return util;
+}
+
 static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }
 
 static inline bool uclamp_is_used(void)

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] scheduler fixes
  2024-01-11 11:09         ` [GIT PULL] scheduler fixes Ingo Molnar
@ 2024-01-11 13:04           ` Vincent Guittot
  2024-01-11 20:48             ` [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit Ingo Molnar
  0 siblings, 1 reply; 54+ messages in thread
From: Vincent Guittot @ 2024-01-11 13:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider

On Thu, 11 Jan 2024 at 12:09, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > It's one of these two:
> > >
> > >   f12560779f9d sched/cpufreq: Rework iowait boost
> > >   9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > >
> > > one more boot to go, then I'll try to revert whichever causes my
> > > machine to perform horribly much worse.
> >
> > I guess it should come as no surprise that the result is
> >
> >    9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> >
> > but to revert cleanly I will have to revert all of
> >
> >       b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> >       f12560779f9d ("sched/cpufreq: Rework iowait boost")
> >       9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > performance estimation")
> >
> > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> >
> > I'll keep that revert in my private test-tree for now (so that I have
> > a working machine again), but I'll move it to my main branch soon
> > unless somebody has a quick fix for this problem.
>
> Thanks a lot for bisecting this, and ack on the revert in any case, these
> are relatively fresh changes that clearly didn't get enough testing - sorry!

b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency") is
linked with other patches.
I can provide a clean revert of only :
f12560779f9d ("sched/cpufreq: Rework iowait boost")
9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")

if the fix that i proposed doesn't work:
https://lore.kernel.org/all/ZZ+ixagkxRPYyTCE@vingu-book/

>
> I also made the revert in sched/urgent & added a changelog, which you can
> pull from:
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-urgent-2024-01-11
>
>    # HEAD: 250ce3c1169743f3575cc5937fccd72380052795 Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits
>
> Revert recent changes to the sched_util logic, to address a bad
> performance regression that increased kernel build time on Linus's
> 64-CPU desktop system substantially.
>
> Lightly build and boot tested.
>
>  Thanks,
>
>         Ingo
>
> ------------------>
> Ingo Molnar (1):
>       Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits
>
>
>  include/linux/energy_model.h     |  1 +
>  kernel/sched/core.c              | 90 +++++++++++++++++++++++-----------------
>  kernel/sched/cpufreq_schedutil.c | 90 ++++++++++++----------------------------
>  kernel/sched/fair.c              | 22 ++--------
>  kernel/sched/sched.h             | 84 +++++++++++++++++++++++++++++++++----
>  5 files changed, 160 insertions(+), 127 deletions(-)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-11  8:11         ` Vincent Guittot
@ 2024-01-11 17:45           ` Linus Torvalds
  2024-01-11 17:53             ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2024-01-11 17:45 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider

On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Could you confirm that cpufreq governor is schedutil and the driver is
> amd-pstate on your system ?

schedutil yes, amd-pstate no. I actually just use acpi_cpufreq

>
> Also I'm interested by the output of the amd_pstate to confirm that it uses the
> adjust_perf callback
>
> I suppose that you don't use uclamp feature and amd doesn't use EAS so that let
> the change of the min parameter of adjust_perf which was probably always 0
> unless you use deadline scheduler and which now takes into account irq pressure.
>
> Could you try the patch below which restores the previous min value ?
>
> ---
>  kernel/sched/cpufreq_schedutil.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..3fe8ac6ce9cc 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -194,10 +194,11 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>  static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
>  {
>         unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> +       struct rq *rq = cpu_rq(sg_cpu->cpu);
>
>         util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
>         util = max(util, boost);
> -       sg_cpu->bw_min = min;
> +       sg_cpu->bw_min = cpu_bw_dl(rq);
>         sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
>  }
>
> @@ -442,7 +443,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
>             sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
>                 sg_cpu->util = prev_util;
>
> -       cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
> +       cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_min),
>                                    sg_cpu->util, max_cap);
>
>         sg_cpu->sg_policy->last_freq_update_time = time;
> --
> 2.34.1
>
>
> >
> > I'll keep that revert in my private test-tree for now (so that I have
> > a working machine again), but I'll move it to my main branch soon
> > unless somebody has a quick fix for this problem.
> >
> >                 Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-11 17:45           ` Linus Torvalds
@ 2024-01-11 17:53             ` Linus Torvalds
  2024-01-11 18:16               ` Vincent Guittot
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2024-01-11 17:53 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider

On Thu, 11 Jan 2024 at 09:45, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > Could you confirm that cpufreq governor is schedutil and the driver is
> > amd-pstate on your system ?
>
> schedutil yes, amd-pstate no. I actually just use acpi_cpufreq

Bah. Hit 'send' mistakenly too soon, thus the abrupt end and
unfinished quoting removal.

And don't ask me why it's acpi_pstate-driven. I have X86_AMD_PSTATE=y, but

    /sys/devices/system/cpu/cpufreq/policy0/scaling_driver

clearly says 'acpi-cpufreq'. Maybe I'm looking in the wrong place. My dmesg says

    amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled

which is presumably the reason my machine uses acpi-pstate.

I will also test out your other questions, but I need to go back and
do more pull requests first.

              Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-11 17:53             ` Linus Torvalds
@ 2024-01-11 18:16               ` Vincent Guittot
  2024-01-12 14:23                 ` Dietmar Eggemann
  0 siblings, 1 reply; 54+ messages in thread
From: Vincent Guittot @ 2024-01-11 18:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider

On Thu, 11 Jan 2024 at 18:53, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 11 Jan 2024 at 09:45, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > Could you confirm that cpufreq governor is schedutil and the driver is
> > > amd-pstate on your system ?
> >
> > schedutil yes, amd-pstate no. I actually just use acpi_cpufreq
>
> Bah. Hit 'send' mistakenly too soon, thus the abrupt end and
> unfinished quoting removal.
>
> And don't ask me why it's acpi_pstate-driven. I have X86_AMD_PSTATE=y, but
>
>     /sys/devices/system/cpu/cpufreq/policy0/scaling_driver
>
> clearly says 'acpi-cpufreq'. Maybe I'm looking in the wrong place. My dmesg says

That seems to be the right place to look

>
>     amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
>
> which is presumably the reason my machine uses acpi-pstate.
>
> I will also test out your other questions, but I need to go back and
> do more pull requests first.

ok, thanks

I'm going to continue checking what else could trigger such regression
having in mind that your system should not have beeb impacted by this
changes


>
>               Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit
  2024-01-11 13:04           ` Vincent Guittot
@ 2024-01-11 20:48             ` Ingo Molnar
  2024-01-11 22:22               ` Vincent Guittot
  2024-01-12 18:24               ` Ingo Molnar
  0 siblings, 2 replies; 54+ messages in thread
From: Ingo Molnar @ 2024-01-11 20:48 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Linus Torvalds, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider


* Vincent Guittot <vincent.guittot@linaro.org> wrote:

> On Thu, 11 Jan 2024 at 12:09, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > It's one of these two:
> > > >
> > > >   f12560779f9d sched/cpufreq: Rework iowait boost
> > > >   9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > > >
> > > > one more boot to go, then I'll try to revert whichever causes my
> > > > machine to perform horribly much worse.
> > >
> > > I guess it should come as no surprise that the result is
> > >
> > >    9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > >
> > > but to revert cleanly I will have to revert all of
> > >
> > >       b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > >       f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > >       9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > > performance estimation")
> > >
> > > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > >
> > > I'll keep that revert in my private test-tree for now (so that I have
> > > a working machine again), but I'll move it to my main branch soon
> > > unless somebody has a quick fix for this problem.
> >
> > Thanks a lot for bisecting this, and ack on the revert in any case, these
> > are relatively fresh changes that clearly didn't get enough testing - sorry!
> 
> b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency") is
> linked with other patches.

Indeed.

> I can provide a clean revert of only :
> f12560779f9d ("sched/cpufreq: Rework iowait boost")
> 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")

I've done this too, see this new commit in sched/urgent:

  60ee1706bd11 ("Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit")

Also attached below.

> if the fix that i proposed doesn't work:
> https://lore.kernel.org/all/ZZ+ixagkxRPYyTCE@vingu-book/

Yeah - although of course Linus is free to just pull the revert as well. 
I'll try to reproduce the regression locally as well.

Thanks,

	Ingo

===============================>
From: Ingo Molnar <mingo@kernel.org>
Date: Thu, 11 Jan 2024 11:45:17 +0100
Subject: [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit

This reverts the following commits:

  f12560779f9d73 ("sched/cpufreq: Rework iowait boost")
  9c0b4bb7f6303c ("sched/cpufreq: Rework schedutil governor performance estimation")

Because Linus reported a bad performance regression with the
sched_util governor, that increased the time his empty
kernel build took from 22 to 44 seconds (and can be similarly
measured in full builds as well) - and bisected it back to 9c0b4bb7f6303c.

Until we have a proper fix, revert the broken commit and its
dependent commit.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com
---
 include/linux/energy_model.h     |  1 +
 kernel/sched/core.c              | 90 +++++++++++++++++++++++-----------------
 kernel/sched/cpufreq_schedutil.c | 64 +++++++++++-----------------
 kernel/sched/fair.c              | 22 ++--------
 kernel/sched/sched.h             | 84 +++++++++++++++++++++++++++++++++----
 5 files changed, 158 insertions(+), 103 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 88d91e087471..c19e7effe764 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -243,6 +243,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	scale_cpu = arch_scale_cpu_capacity(cpu);
 	ref_freq = arch_scale_freq_ref(cpu);
 
+	max_util = map_util_perf(max_util);
 	max_util = min(max_util, allowed_cpu_cap);
 	freq = map_util_freq(max_util, ref_freq, scale_cpu);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc90346..038eeaf76d2d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7468,13 +7468,18 @@ int sched_core_idle_cpu(int cpu)
  * required to meet deadlines.
  */
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long *min,
-				 unsigned long *max)
+				 enum cpu_util_type type,
+				 struct task_struct *p)
 {
-	unsigned long util, irq, scale;
+	unsigned long dl_util, util, irq, max;
 	struct rq *rq = cpu_rq(cpu);
 
-	scale = arch_scale_cpu_capacity(cpu);
+	max = arch_scale_cpu_capacity(cpu);
+
+	if (!uclamp_is_used() &&
+	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
+		return max;
+	}
 
 	/*
 	 * Early check to see if IRQ/steal time saturates the CPU, can be
@@ -7482,49 +7487,45 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * update_irq_load_avg().
 	 */
 	irq = cpu_util_irq(rq);
-	if (unlikely(irq >= scale)) {
-		if (min)
-			*min = scale;
-		if (max)
-			*max = scale;
-		return scale;
-	}
-
-	if (min) {
-		/*
-		 * The minimum utilization returns the highest level between:
-		 * - the computed DL bandwidth needed with the IRQ pressure which
-		 *   steals time to the deadline task.
-		 * - The minimum performance requirement for CFS and/or RT.
-		 */
-		*min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
-
-		/*
-		 * When an RT task is runnable and uclamp is not used, we must
-		 * ensure that the task will run at maximum compute capacity.
-		 */
-		if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt))
-			*min = max(*min, scale);
-	}
+	if (unlikely(irq >= max))
+		return max;
 
 	/*
 	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
 	 * CFS tasks and we use the same metric to track the effective
 	 * utilization (PELT windows are synchronized) we can directly add them
 	 * to obtain the CPU's actual utilization.
+	 *
+	 * CFS and RT utilization can be boosted or capped, depending on
+	 * utilization clamp constraints requested by currently RUNNABLE
+	 * tasks.
+	 * When there are no CFS RUNNABLE tasks, clamps are released and
+	 * frequency will be gracefully reduced with the utilization decay.
 	 */
 	util = util_cfs + cpu_util_rt(rq);
-	util += cpu_util_dl(rq);
+	if (type == FREQUENCY_UTIL)
+		util = uclamp_rq_util_with(rq, util, p);
+
+	dl_util = cpu_util_dl(rq);
 
 	/*
-	 * The maximum hint is a soft bandwidth requirement, which can be lower
-	 * than the actual utilization because of uclamp_max requirements.
+	 * For frequency selection we do not make cpu_util_dl() a permanent part
+	 * of this sum because we want to use cpu_bw_dl() later on, but we need
+	 * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
+	 * that we select f_max when there is no idle time.
+	 *
+	 * NOTE: numerical errors or stop class might cause us to not quite hit
+	 * saturation when we should -- something for later.
 	 */
-	if (max)
-		*max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
+	if (util + dl_util >= max)
+		return max;
 
-	if (util >= scale)
-		return scale;
+	/*
+	 * OTOH, for energy computation we need the estimated running time, so
+	 * include util_dl and ignore dl_bw.
+	 */
+	if (type == ENERGY_UTIL)
+		util += dl_util;
 
 	/*
 	 * There is still idle time; further improve the number by using the
@@ -7535,15 +7536,28 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 *   U' = irq + --------- * U
 	 *                 max
 	 */
-	util = scale_irq_capacity(util, irq, scale);
+	util = scale_irq_capacity(util, irq, max);
 	util += irq;
 
-	return min(scale, util);
+	/*
+	 * Bandwidth required by DEADLINE must always be granted while, for
+	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
+	 * to gracefully reduce the frequency when no tasks show up for longer
+	 * periods of time.
+	 *
+	 * Ideally we would like to set bw_dl as min/guaranteed freq and util +
+	 * bw_dl as requested freq. However, cpufreq is not yet ready for such
+	 * an interface. So, we only do the latter for now.
+	 */
+	if (type == FREQUENCY_UTIL)
+		util += cpu_bw_dl(rq);
+
+	return min(max, util);
 }
 
 unsigned long sched_cpu_util(int cpu)
 {
-	return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
+	return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
 }
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..5f8729c41d0a 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -47,7 +47,7 @@ struct sugov_cpu {
 	u64			last_update;
 
 	unsigned long		util;
-	unsigned long		bw_min;
+	unsigned long		bw_dl;
 
 	/* The field below is for single-CPU policies only: */
 #ifdef CONFIG_NO_HZ_COMMON
@@ -164,6 +164,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	struct cpufreq_policy *policy = sg_policy->policy;
 	unsigned int freq;
 
+	util = map_util_perf(util);
 	freq = get_capacity_ref_freq(policy);
 	freq = map_util_freq(util, freq, max);
 
@@ -174,31 +175,14 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	return cpufreq_driver_resolve_freq(policy, freq);
 }
 
-unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
-				 unsigned long min,
-				 unsigned long max)
+static void sugov_get_util(struct sugov_cpu *sg_cpu)
 {
-	/* Add dvfs headroom to actual utilization */
-	actual = map_util_perf(actual);
-	/* Actually we don't need to target the max performance */
-	if (actual < max)
-		max = actual;
+	unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu);
+	struct rq *rq = cpu_rq(sg_cpu->cpu);
 
-	/*
-	 * Ensure at least minimum performance while providing more compute
-	 * capacity when possible.
-	 */
-	return max(min, max);
-}
-
-static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
-{
-	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
-
-	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
-	util = max(util, boost);
-	sg_cpu->bw_min = min;
-	sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
+	sg_cpu->bw_dl = cpu_bw_dl(rq);
+	sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
+					  FREQUENCY_UTIL, NULL);
 }
 
 /**
@@ -289,16 +273,18 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
  * This mechanism is designed to boost high frequently IO waiting tasks, while
  * being more conservative on tasks which does sporadic IO operations.
  */
-static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
+static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 			       unsigned long max_cap)
 {
+	unsigned long boost;
+
 	/* No boost currently required */
 	if (!sg_cpu->iowait_boost)
-		return 0;
+		return;
 
 	/* Reset boost if the CPU appears to have been idle enough */
 	if (sugov_iowait_reset(sg_cpu, time, false))
-		return 0;
+		return;
 
 	if (!sg_cpu->iowait_boost_pending) {
 		/*
@@ -307,7 +293,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 		sg_cpu->iowait_boost >>= 1;
 		if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) {
 			sg_cpu->iowait_boost = 0;
-			return 0;
+			return;
 		}
 	}
 
@@ -317,7 +303,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 	 * sg_cpu->util is already in capacity scale; convert iowait_boost
 	 * into the same scale so we can compare.
 	 */
-	return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
+	boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
+	boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
+	if (sg_cpu->util < boost)
+		sg_cpu->util = boost;
 }
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -339,7 +328,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
  */
 static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
 {
-	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)
+	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
 		sg_cpu->sg_policy->limits_changed = true;
 }
 
@@ -347,8 +336,6 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
 					      u64 time, unsigned long max_cap,
 					      unsigned int flags)
 {
-	unsigned long boost;
-
 	sugov_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
 
@@ -357,8 +344,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
 	if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
 		return false;
 
-	boost = sugov_iowait_apply(sg_cpu, time, max_cap);
-	sugov_get_util(sg_cpu, boost);
+	sugov_get_util(sg_cpu);
+	sugov_iowait_apply(sg_cpu, time, max_cap);
 
 	return true;
 }
@@ -442,8 +429,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
 	    sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
 		sg_cpu->util = prev_util;
 
-	cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
-				   sg_cpu->util, max_cap);
+	cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
+				   map_util_perf(sg_cpu->util), max_cap);
 
 	sg_cpu->sg_policy->last_freq_update_time = time;
 }
@@ -459,10 +446,9 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 
 	for_each_cpu(j, policy->cpus) {
 		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
-		unsigned long boost;
 
-		boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
-		sugov_get_util(j_sg_cpu, boost);
+		sugov_get_util(j_sg_cpu);
+		sugov_iowait_apply(j_sg_cpu, time, max_cap);
 
 		util = max(j_sg_cpu->util, util);
 	}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..f2bb83675e4a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7820,7 +7820,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
 	for_each_cpu(cpu, pd_cpus) {
 		unsigned long util = cpu_util(cpu, p, -1, 0);
 
-		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+		busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
 	}
 
 	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
@@ -7843,7 +7843,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 	for_each_cpu(cpu, pd_cpus) {
 		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
 		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
-		unsigned long eff_util, min, max;
+		unsigned long eff_util;
 
 		/*
 		 * Performance domain frequency: utilization clamping
@@ -7852,23 +7852,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		eff_util = effective_cpu_util(cpu, util, &min, &max);
-
-		/* Task's uclamp can modify min and max value */
-		if (tsk && uclamp_is_used()) {
-			min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
-
-			/*
-			 * If there is no active max uclamp constraint,
-			 * directly use task's one, otherwise keep max.
-			 */
-			if (uclamp_rq_is_idle(cpu_rq(cpu)))
-				max = uclamp_eff_value(p, UCLAMP_MAX);
-			else
-				max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
-		}
-
-		eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
+		eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
 		max_util = max(max_util, eff_util);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 001fe047bd5d..eb7e07a1abcc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3005,14 +3005,24 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 #endif
 
 #ifdef CONFIG_SMP
+/**
+ * enum cpu_util_type - CPU utilization type
+ * @FREQUENCY_UTIL:	Utilization used to select frequency
+ * @ENERGY_UTIL:	Utilization used during energy calculation
+ *
+ * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
+ * need to be aggregated differently depending on the usage made of them. This
+ * enum is used within effective_cpu_util() to differentiate the types of
+ * utilization expected by the callers, and adjust the aggregation accordingly.
+ */
+enum cpu_util_type {
+	FREQUENCY_UTIL,
+	ENERGY_UTIL,
+};
+
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long *min,
-				 unsigned long *max);
-
-unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
-				 unsigned long min,
-				 unsigned long max);
-
+				 enum cpu_util_type type,
+				 struct task_struct *p);
 
 /*
  * Verify the fitness of task @p to run on @cpu taking into account the
@@ -3069,6 +3079,59 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
 	return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
 }
 
+/**
+ * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
+ * @rq:		The rq to clamp against. Must not be NULL.
+ * @util:	The util value to clamp.
+ * @p:		The task to clamp against. Can be NULL if you want to clamp
+ *		against @rq only.
+ *
+ * Clamps the passed @util to the max(@rq, @p) effective uclamp values.
+ *
+ * If sched_uclamp_used static key is disabled, then just return the util
+ * without any clamping since uclamp aggregation at the rq level in the fast
+ * path is disabled, rendering this operation a NOP.
+ *
+ * Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
+ * will return the correct effective uclamp value of the task even if the
+ * static key is disabled.
+ */
+static __always_inline
+unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
+				  struct task_struct *p)
+{
+	unsigned long min_util = 0;
+	unsigned long max_util = 0;
+
+	if (!static_branch_likely(&sched_uclamp_used))
+		return util;
+
+	if (p) {
+		min_util = uclamp_eff_value(p, UCLAMP_MIN);
+		max_util = uclamp_eff_value(p, UCLAMP_MAX);
+
+		/*
+		 * Ignore last runnable task's max clamp, as this task will
+		 * reset it. Similarly, no need to read the rq's min clamp.
+		 */
+		if (uclamp_rq_is_idle(rq))
+			goto out;
+	}
+
+	min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
+	max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
+out:
+	/*
+	 * Since CPU's {min,max}_util clamps are MAX aggregated considering
+	 * RUNNABLE tasks with _different_ clamps, we can end up with an
+	 * inversion. Fix it now when the clamps are applied.
+	 */
+	if (unlikely(min_util >= max_util))
+		return min_util;
+
+	return clamp(util, min_util, max_util);
+}
+
 /* Is the rq being capped/throttled by uclamp_max? */
 static inline bool uclamp_rq_is_capped(struct rq *rq)
 {
@@ -3106,6 +3169,13 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p,
 	return SCHED_CAPACITY_SCALE;
 }
 
+static inline
+unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
+				  struct task_struct *p)
+{
+	return util;
+}
+
 static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }
 
 static inline bool uclamp_is_used(void)

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [tip: sched/urgent] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit
  2024-01-10 22:19   ` Linus Torvalds
                       ` (2 preceding siblings ...)
  2024-01-11 11:14     ` [tip: sched/urgent] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits tip-bot2 for Ingo Molnar
@ 2024-01-11 20:55     ` tip-bot2 for Ingo Molnar
  3 siblings, 0 replies; 54+ messages in thread
From: tip-bot2 for Ingo Molnar @ 2024-01-11 20:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Linus Torvalds, Ingo Molnar, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     60ee1706bd11669b0530efb4e92526854b9a6364
Gitweb:        https://git.kernel.org/tip/60ee1706bd11669b0530efb4e92526854b9a6364
Author:        Ingo Molnar <mingo@kernel.org>
AuthorDate:    Thu, 11 Jan 2024 11:45:17 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 11 Jan 2024 21:42:42 +01:00

Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit

This reverts the following commits:

  f12560779f9d73 ("sched/cpufreq: Rework iowait boost")
  9c0b4bb7f6303c ("sched/cpufreq: Rework schedutil governor performance estimation")

Because Linus reported a bad performance regression with the
sched_util governor, that increased the time his empty
kernel build took from 22 to 44 seconds (and can be similarly
measured in full builds as well) - and bisected it back to 9c0b4bb7f6303c.

Until we have a proper fix, revert the broken commit and its
dependent commit.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com
---
 include/linux/energy_model.h     |  1 +-
 kernel/sched/core.c              | 90 +++++++++++++++++--------------
 kernel/sched/cpufreq_schedutil.c | 64 ++++++++--------------
 kernel/sched/fair.c              | 22 +-------
 kernel/sched/sched.h             | 84 ++++++++++++++++++++++++++---
 5 files changed, 158 insertions(+), 103 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 88d91e0..c19e7ef 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -243,6 +243,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	scale_cpu = arch_scale_cpu_capacity(cpu);
 	ref_freq = arch_scale_freq_ref(cpu);
 
+	max_util = map_util_perf(max_util);
 	max_util = min(max_util, allowed_cpu_cap);
 	freq = map_util_freq(max_util, ref_freq, scale_cpu);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc..038eeaf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7468,13 +7468,18 @@ int sched_core_idle_cpu(int cpu)
  * required to meet deadlines.
  */
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long *min,
-				 unsigned long *max)
+				 enum cpu_util_type type,
+				 struct task_struct *p)
 {
-	unsigned long util, irq, scale;
+	unsigned long dl_util, util, irq, max;
 	struct rq *rq = cpu_rq(cpu);
 
-	scale = arch_scale_cpu_capacity(cpu);
+	max = arch_scale_cpu_capacity(cpu);
+
+	if (!uclamp_is_used() &&
+	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
+		return max;
+	}
 
 	/*
 	 * Early check to see if IRQ/steal time saturates the CPU, can be
@@ -7482,49 +7487,45 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * update_irq_load_avg().
 	 */
 	irq = cpu_util_irq(rq);
-	if (unlikely(irq >= scale)) {
-		if (min)
-			*min = scale;
-		if (max)
-			*max = scale;
-		return scale;
-	}
-
-	if (min) {
-		/*
-		 * The minimum utilization returns the highest level between:
-		 * - the computed DL bandwidth needed with the IRQ pressure which
-		 *   steals time to the deadline task.
-		 * - The minimum performance requirement for CFS and/or RT.
-		 */
-		*min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
-
-		/*
-		 * When an RT task is runnable and uclamp is not used, we must
-		 * ensure that the task will run at maximum compute capacity.
-		 */
-		if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt))
-			*min = max(*min, scale);
-	}
+	if (unlikely(irq >= max))
+		return max;
 
 	/*
 	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
 	 * CFS tasks and we use the same metric to track the effective
 	 * utilization (PELT windows are synchronized) we can directly add them
 	 * to obtain the CPU's actual utilization.
+	 *
+	 * CFS and RT utilization can be boosted or capped, depending on
+	 * utilization clamp constraints requested by currently RUNNABLE
+	 * tasks.
+	 * When there are no CFS RUNNABLE tasks, clamps are released and
+	 * frequency will be gracefully reduced with the utilization decay.
 	 */
 	util = util_cfs + cpu_util_rt(rq);
-	util += cpu_util_dl(rq);
+	if (type == FREQUENCY_UTIL)
+		util = uclamp_rq_util_with(rq, util, p);
+
+	dl_util = cpu_util_dl(rq);
 
 	/*
-	 * The maximum hint is a soft bandwidth requirement, which can be lower
-	 * than the actual utilization because of uclamp_max requirements.
+	 * For frequency selection we do not make cpu_util_dl() a permanent part
+	 * of this sum because we want to use cpu_bw_dl() later on, but we need
+	 * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
+	 * that we select f_max when there is no idle time.
+	 *
+	 * NOTE: numerical errors or stop class might cause us to not quite hit
+	 * saturation when we should -- something for later.
 	 */
-	if (max)
-		*max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
+	if (util + dl_util >= max)
+		return max;
 
-	if (util >= scale)
-		return scale;
+	/*
+	 * OTOH, for energy computation we need the estimated running time, so
+	 * include util_dl and ignore dl_bw.
+	 */
+	if (type == ENERGY_UTIL)
+		util += dl_util;
 
 	/*
 	 * There is still idle time; further improve the number by using the
@@ -7535,15 +7536,28 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 *   U' = irq + --------- * U
 	 *                 max
 	 */
-	util = scale_irq_capacity(util, irq, scale);
+	util = scale_irq_capacity(util, irq, max);
 	util += irq;
 
-	return min(scale, util);
+	/*
+	 * Bandwidth required by DEADLINE must always be granted while, for
+	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
+	 * to gracefully reduce the frequency when no tasks show up for longer
+	 * periods of time.
+	 *
+	 * Ideally we would like to set bw_dl as min/guaranteed freq and util +
+	 * bw_dl as requested freq. However, cpufreq is not yet ready for such
+	 * an interface. So, we only do the latter for now.
+	 */
+	if (type == FREQUENCY_UTIL)
+		util += cpu_bw_dl(rq);
+
+	return min(max, util);
 }
 
 unsigned long sched_cpu_util(int cpu)
 {
-	return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
+	return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
 }
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c09..5f8729c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -47,7 +47,7 @@ struct sugov_cpu {
 	u64			last_update;
 
 	unsigned long		util;
-	unsigned long		bw_min;
+	unsigned long		bw_dl;
 
 	/* The field below is for single-CPU policies only: */
 #ifdef CONFIG_NO_HZ_COMMON
@@ -164,6 +164,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	struct cpufreq_policy *policy = sg_policy->policy;
 	unsigned int freq;
 
+	util = map_util_perf(util);
 	freq = get_capacity_ref_freq(policy);
 	freq = map_util_freq(util, freq, max);
 
@@ -174,31 +175,14 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	return cpufreq_driver_resolve_freq(policy, freq);
 }
 
-unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
-				 unsigned long min,
-				 unsigned long max)
+static void sugov_get_util(struct sugov_cpu *sg_cpu)
 {
-	/* Add dvfs headroom to actual utilization */
-	actual = map_util_perf(actual);
-	/* Actually we don't need to target the max performance */
-	if (actual < max)
-		max = actual;
+	unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu);
+	struct rq *rq = cpu_rq(sg_cpu->cpu);
 
-	/*
-	 * Ensure at least minimum performance while providing more compute
-	 * capacity when possible.
-	 */
-	return max(min, max);
-}
-
-static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
-{
-	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
-
-	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
-	util = max(util, boost);
-	sg_cpu->bw_min = min;
-	sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
+	sg_cpu->bw_dl = cpu_bw_dl(rq);
+	sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
+					  FREQUENCY_UTIL, NULL);
 }
 
 /**
@@ -289,16 +273,18 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
  * This mechanism is designed to boost high frequently IO waiting tasks, while
  * being more conservative on tasks which does sporadic IO operations.
  */
-static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
+static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 			       unsigned long max_cap)
 {
+	unsigned long boost;
+
 	/* No boost currently required */
 	if (!sg_cpu->iowait_boost)
-		return 0;
+		return;
 
 	/* Reset boost if the CPU appears to have been idle enough */
 	if (sugov_iowait_reset(sg_cpu, time, false))
-		return 0;
+		return;
 
 	if (!sg_cpu->iowait_boost_pending) {
 		/*
@@ -307,7 +293,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 		sg_cpu->iowait_boost >>= 1;
 		if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) {
 			sg_cpu->iowait_boost = 0;
-			return 0;
+			return;
 		}
 	}
 
@@ -317,7 +303,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 	 * sg_cpu->util is already in capacity scale; convert iowait_boost
 	 * into the same scale so we can compare.
 	 */
-	return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
+	boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
+	boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
+	if (sg_cpu->util < boost)
+		sg_cpu->util = boost;
 }
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -339,7 +328,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
  */
 static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
 {
-	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)
+	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
 		sg_cpu->sg_policy->limits_changed = true;
 }
 
@@ -347,8 +336,6 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
 					      u64 time, unsigned long max_cap,
 					      unsigned int flags)
 {
-	unsigned long boost;
-
 	sugov_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
 
@@ -357,8 +344,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
 	if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
 		return false;
 
-	boost = sugov_iowait_apply(sg_cpu, time, max_cap);
-	sugov_get_util(sg_cpu, boost);
+	sugov_get_util(sg_cpu);
+	sugov_iowait_apply(sg_cpu, time, max_cap);
 
 	return true;
 }
@@ -442,8 +429,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
 	    sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
 		sg_cpu->util = prev_util;
 
-	cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
-				   sg_cpu->util, max_cap);
+	cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
+				   map_util_perf(sg_cpu->util), max_cap);
 
 	sg_cpu->sg_policy->last_freq_update_time = time;
 }
@@ -459,10 +446,9 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 
 	for_each_cpu(j, policy->cpus) {
 		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
-		unsigned long boost;
 
-		boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
-		sugov_get_util(j_sg_cpu, boost);
+		sugov_get_util(j_sg_cpu);
+		sugov_iowait_apply(j_sg_cpu, time, max_cap);
 
 		util = max(j_sg_cpu->util, util);
 	}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e..f2bb836 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7820,7 +7820,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
 	for_each_cpu(cpu, pd_cpus) {
 		unsigned long util = cpu_util(cpu, p, -1, 0);
 
-		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+		busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
 	}
 
 	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
@@ -7843,7 +7843,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 	for_each_cpu(cpu, pd_cpus) {
 		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
 		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
-		unsigned long eff_util, min, max;
+		unsigned long eff_util;
 
 		/*
 		 * Performance domain frequency: utilization clamping
@@ -7852,23 +7852,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		eff_util = effective_cpu_util(cpu, util, &min, &max);
-
-		/* Task's uclamp can modify min and max value */
-		if (tsk && uclamp_is_used()) {
-			min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
-
-			/*
-			 * If there is no active max uclamp constraint,
-			 * directly use task's one, otherwise keep max.
-			 */
-			if (uclamp_rq_is_idle(cpu_rq(cpu)))
-				max = uclamp_eff_value(p, UCLAMP_MAX);
-			else
-				max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
-		}
-
-		eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
+		eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
 		max_util = max(max_util, eff_util);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 001fe04..eb7e07a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3005,14 +3005,24 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 #endif
 
 #ifdef CONFIG_SMP
-unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long *min,
-				 unsigned long *max);
-
-unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
-				 unsigned long min,
-				 unsigned long max);
+/**
+ * enum cpu_util_type - CPU utilization type
+ * @FREQUENCY_UTIL:	Utilization used to select frequency
+ * @ENERGY_UTIL:	Utilization used during energy calculation
+ *
+ * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
+ * need to be aggregated differently depending on the usage made of them. This
+ * enum is used within effective_cpu_util() to differentiate the types of
+ * utilization expected by the callers, and adjust the aggregation accordingly.
+ */
+enum cpu_util_type {
+	FREQUENCY_UTIL,
+	ENERGY_UTIL,
+};
 
+unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
+				 enum cpu_util_type type,
+				 struct task_struct *p);
 
 /*
  * Verify the fitness of task @p to run on @cpu taking into account the
@@ -3069,6 +3079,59 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
 	return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
 }
 
+/**
+ * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
+ * @rq:		The rq to clamp against. Must not be NULL.
+ * @util:	The util value to clamp.
+ * @p:		The task to clamp against. Can be NULL if you want to clamp
+ *		against @rq only.
+ *
+ * Clamps the passed @util to the max(@rq, @p) effective uclamp values.
+ *
+ * If sched_uclamp_used static key is disabled, then just return the util
+ * without any clamping since uclamp aggregation at the rq level in the fast
+ * path is disabled, rendering this operation a NOP.
+ *
+ * Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
+ * will return the correct effective uclamp value of the task even if the
+ * static key is disabled.
+ */
+static __always_inline
+unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
+				  struct task_struct *p)
+{
+	unsigned long min_util = 0;
+	unsigned long max_util = 0;
+
+	if (!static_branch_likely(&sched_uclamp_used))
+		return util;
+
+	if (p) {
+		min_util = uclamp_eff_value(p, UCLAMP_MIN);
+		max_util = uclamp_eff_value(p, UCLAMP_MAX);
+
+		/*
+		 * Ignore last runnable task's max clamp, as this task will
+		 * reset it. Similarly, no need to read the rq's min clamp.
+		 */
+		if (uclamp_rq_is_idle(rq))
+			goto out;
+	}
+
+	min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
+	max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
+out:
+	/*
+	 * Since CPU's {min,max}_util clamps are MAX aggregated considering
+	 * RUNNABLE tasks with _different_ clamps, we can end up with an
+	 * inversion. Fix it now when the clamps are applied.
+	 */
+	if (unlikely(min_util >= max_util))
+		return min_util;
+
+	return clamp(util, min_util, max_util);
+}
+
 /* Is the rq being capped/throttled by uclamp_max? */
 static inline bool uclamp_rq_is_capped(struct rq *rq)
 {
@@ -3106,6 +3169,13 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p,
 	return SCHED_CAPACITY_SCALE;
 }
 
+static inline
+unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
+				  struct task_struct *p)
+{
+	return util;
+}
+
 static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }
 
 static inline bool uclamp_is_used(void)

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit
  2024-01-11 20:48             ` [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit Ingo Molnar
@ 2024-01-11 22:22               ` Vincent Guittot
  2024-01-12 18:24               ` Ingo Molnar
  1 sibling, 0 replies; 54+ messages in thread
From: Vincent Guittot @ 2024-01-11 22:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider

On Thu, 11 Jan 2024 at 21:48, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> > On Thu, 11 Jan 2024 at 12:09, Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > >
> > > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > >
> > > > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > > > <torvalds@linux-foundation.org> wrote:
> > > > >
> > > > > It's one of these two:
> > > > >
> > > > >   f12560779f9d sched/cpufreq: Rework iowait boost
> > > > >   9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > > > >
> > > > > one more boot to go, then I'll try to revert whichever causes my
> > > > > machine to perform horribly much worse.
> > > >
> > > > I guess it should come as no surprise that the result is
> > > >
> > > >    9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > > >
> > > > but to revert cleanly I will have to revert all of
> > > >
> > > >       b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > > >       f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > > >       9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > > > performance estimation")
> > > >
> > > > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > > >
> > > > I'll keep that revert in my private test-tree for now (so that I have
> > > > a working machine again), but I'll move it to my main branch soon
> > > > unless somebody has a quick fix for this problem.
> > >
> > > Thanks a lot for bisecting this, and ack on the revert in any case, these
> > > are relatively fresh changes that clearly didn't get enough testing - sorry!
> >
> > b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency") is
> > linked with other patches.
>
> Indeed.
>
> > I can provide a clean revert of only :
> > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
>
> I've done this too, see this new commit in sched/urgent:
>
>   60ee1706bd11 ("Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit")

Thanks, your new commit looks good to me.

>
> Also attached below.
>
> > if the fix that i proposed doesn't work:
> > https://lore.kernel.org/all/ZZ+ixagkxRPYyTCE@vingu-book/
>
> Yeah - although of course Linus is free to just pull the revert as well.

Yes, I know

> I'll try to reproduce the regression locally as well.

Ok, Thanks

Vincent

>
> Thanks,
>
>         Ingo
>
> ===============================>
> From: Ingo Molnar <mingo@kernel.org>
> Date: Thu, 11 Jan 2024 11:45:17 +0100
> Subject: [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit
>
> This reverts the following commits:
>
>   f12560779f9d73 ("sched/cpufreq: Rework iowait boost")
>   9c0b4bb7f6303c ("sched/cpufreq: Rework schedutil governor performance estimation")
>
> Because Linus reported a bad performance regression with the
> sched_util governor, that increased the time his empty
> kernel build took from 22 to 44 seconds (and can be similarly
> measured in full builds as well) - and bisected it back to 9c0b4bb7f6303c.
>
> Until we have a proper fix, revert the broken commit and its
> dependent commit.
>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Link: https://lore.kernel.org/r/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com
> ---
>  include/linux/energy_model.h     |  1 +
>  kernel/sched/core.c              | 90 +++++++++++++++++++++++-----------------
>  kernel/sched/cpufreq_schedutil.c | 64 +++++++++++-----------------
>  kernel/sched/fair.c              | 22 ++--------
>  kernel/sched/sched.h             | 84 +++++++++++++++++++++++++++++++++----
>  5 files changed, 158 insertions(+), 103 deletions(-)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 88d91e087471..c19e7effe764 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -243,6 +243,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>         scale_cpu = arch_scale_cpu_capacity(cpu);
>         ref_freq = arch_scale_freq_ref(cpu);
>
> +       max_util = map_util_perf(max_util);
>         max_util = min(max_util, allowed_cpu_cap);
>         freq = map_util_freq(max_util, ref_freq, scale_cpu);
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9116bcc90346..038eeaf76d2d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7468,13 +7468,18 @@ int sched_core_idle_cpu(int cpu)
>   * required to meet deadlines.
>   */
>  unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> -                                unsigned long *min,
> -                                unsigned long *max)
> +                                enum cpu_util_type type,
> +                                struct task_struct *p)
>  {
> -       unsigned long util, irq, scale;
> +       unsigned long dl_util, util, irq, max;
>         struct rq *rq = cpu_rq(cpu);
>
> -       scale = arch_scale_cpu_capacity(cpu);
> +       max = arch_scale_cpu_capacity(cpu);
> +
> +       if (!uclamp_is_used() &&
> +           type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
> +               return max;
> +       }
>
>         /*
>          * Early check to see if IRQ/steal time saturates the CPU, can be
> @@ -7482,49 +7487,45 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>          * update_irq_load_avg().
>          */
>         irq = cpu_util_irq(rq);
> -       if (unlikely(irq >= scale)) {
> -               if (min)
> -                       *min = scale;
> -               if (max)
> -                       *max = scale;
> -               return scale;
> -       }
> -
> -       if (min) {
> -               /*
> -                * The minimum utilization returns the highest level between:
> -                * - the computed DL bandwidth needed with the IRQ pressure which
> -                *   steals time to the deadline task.
> -                * - The minimum performance requirement for CFS and/or RT.
> -                */
> -               *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
> -
> -               /*
> -                * When an RT task is runnable and uclamp is not used, we must
> -                * ensure that the task will run at maximum compute capacity.
> -                */
> -               if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt))
> -                       *min = max(*min, scale);
> -       }
> +       if (unlikely(irq >= max))
> +               return max;
>
>         /*
>          * Because the time spend on RT/DL tasks is visible as 'lost' time to
>          * CFS tasks and we use the same metric to track the effective
>          * utilization (PELT windows are synchronized) we can directly add them
>          * to obtain the CPU's actual utilization.
> +        *
> +        * CFS and RT utilization can be boosted or capped, depending on
> +        * utilization clamp constraints requested by currently RUNNABLE
> +        * tasks.
> +        * When there are no CFS RUNNABLE tasks, clamps are released and
> +        * frequency will be gracefully reduced with the utilization decay.
>          */
>         util = util_cfs + cpu_util_rt(rq);
> -       util += cpu_util_dl(rq);
> +       if (type == FREQUENCY_UTIL)
> +               util = uclamp_rq_util_with(rq, util, p);
> +
> +       dl_util = cpu_util_dl(rq);
>
>         /*
> -        * The maximum hint is a soft bandwidth requirement, which can be lower
> -        * than the actual utilization because of uclamp_max requirements.
> +        * For frequency selection we do not make cpu_util_dl() a permanent part
> +        * of this sum because we want to use cpu_bw_dl() later on, but we need
> +        * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
> +        * that we select f_max when there is no idle time.
> +        *
> +        * NOTE: numerical errors or stop class might cause us to not quite hit
> +        * saturation when we should -- something for later.
>          */
> -       if (max)
> -               *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
> +       if (util + dl_util >= max)
> +               return max;
>
> -       if (util >= scale)
> -               return scale;
> +       /*
> +        * OTOH, for energy computation we need the estimated running time, so
> +        * include util_dl and ignore dl_bw.
> +        */
> +       if (type == ENERGY_UTIL)
> +               util += dl_util;
>
>         /*
>          * There is still idle time; further improve the number by using the
> @@ -7535,15 +7536,28 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>          *   U' = irq + --------- * U
>          *                 max
>          */
> -       util = scale_irq_capacity(util, irq, scale);
> +       util = scale_irq_capacity(util, irq, max);
>         util += irq;
>
> -       return min(scale, util);
> +       /*
> +        * Bandwidth required by DEADLINE must always be granted while, for
> +        * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
> +        * to gracefully reduce the frequency when no tasks show up for longer
> +        * periods of time.
> +        *
> +        * Ideally we would like to set bw_dl as min/guaranteed freq and util +
> +        * bw_dl as requested freq. However, cpufreq is not yet ready for such
> +        * an interface. So, we only do the latter for now.
> +        */
> +       if (type == FREQUENCY_UTIL)
> +               util += cpu_bw_dl(rq);
> +
> +       return min(max, util);
>  }
>
>  unsigned long sched_cpu_util(int cpu)
>  {
> -       return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
> +       return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
>  }
>  #endif /* CONFIG_SMP */
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..5f8729c41d0a 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -47,7 +47,7 @@ struct sugov_cpu {
>         u64                     last_update;
>
>         unsigned long           util;
> -       unsigned long           bw_min;
> +       unsigned long           bw_dl;
>
>         /* The field below is for single-CPU policies only: */
>  #ifdef CONFIG_NO_HZ_COMMON
> @@ -164,6 +164,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>         struct cpufreq_policy *policy = sg_policy->policy;
>         unsigned int freq;
>
> +       util = map_util_perf(util);
>         freq = get_capacity_ref_freq(policy);
>         freq = map_util_freq(util, freq, max);
>
> @@ -174,31 +175,14 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>         return cpufreq_driver_resolve_freq(policy, freq);
>  }
>
> -unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
> -                                unsigned long min,
> -                                unsigned long max)
> +static void sugov_get_util(struct sugov_cpu *sg_cpu)
>  {
> -       /* Add dvfs headroom to actual utilization */
> -       actual = map_util_perf(actual);
> -       /* Actually we don't need to target the max performance */
> -       if (actual < max)
> -               max = actual;
> +       unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu);
> +       struct rq *rq = cpu_rq(sg_cpu->cpu);
>
> -       /*
> -        * Ensure at least minimum performance while providing more compute
> -        * capacity when possible.
> -        */
> -       return max(min, max);
> -}
> -
> -static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> -{
> -       unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> -
> -       util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> -       util = max(util, boost);
> -       sg_cpu->bw_min = min;
> -       sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> +       sg_cpu->bw_dl = cpu_bw_dl(rq);
> +       sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
> +                                         FREQUENCY_UTIL, NULL);
>  }
>
>  /**
> @@ -289,16 +273,18 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
>   * This mechanism is designed to boost high frequently IO waiting tasks, while
>   * being more conservative on tasks which does sporadic IO operations.
>   */
> -static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> +static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
>                                unsigned long max_cap)
>  {
> +       unsigned long boost;
> +
>         /* No boost currently required */
>         if (!sg_cpu->iowait_boost)
> -               return 0;
> +               return;
>
>         /* Reset boost if the CPU appears to have been idle enough */
>         if (sugov_iowait_reset(sg_cpu, time, false))
> -               return 0;
> +               return;
>
>         if (!sg_cpu->iowait_boost_pending) {
>                 /*
> @@ -307,7 +293,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
>                 sg_cpu->iowait_boost >>= 1;
>                 if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) {
>                         sg_cpu->iowait_boost = 0;
> -                       return 0;
> +                       return;
>                 }
>         }
>
> @@ -317,7 +303,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
>          * sg_cpu->util is already in capacity scale; convert iowait_boost
>          * into the same scale so we can compare.
>          */
> -       return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
> +       boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
> +       boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
> +       if (sg_cpu->util < boost)
> +               sg_cpu->util = boost;
>  }
>
>  #ifdef CONFIG_NO_HZ_COMMON
> @@ -339,7 +328,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
>   */
>  static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
>  {
> -       if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)
> +       if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
>                 sg_cpu->sg_policy->limits_changed = true;
>  }
>
> @@ -347,8 +336,6 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
>                                               u64 time, unsigned long max_cap,
>                                               unsigned int flags)
>  {
> -       unsigned long boost;
> -
>         sugov_iowait_boost(sg_cpu, time, flags);
>         sg_cpu->last_update = time;
>
> @@ -357,8 +344,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
>         if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
>                 return false;
>
> -       boost = sugov_iowait_apply(sg_cpu, time, max_cap);
> -       sugov_get_util(sg_cpu, boost);
> +       sugov_get_util(sg_cpu);
> +       sugov_iowait_apply(sg_cpu, time, max_cap);
>
>         return true;
>  }
> @@ -442,8 +429,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
>             sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
>                 sg_cpu->util = prev_util;
>
> -       cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
> -                                  sg_cpu->util, max_cap);
> +       cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
> +                                  map_util_perf(sg_cpu->util), max_cap);
>
>         sg_cpu->sg_policy->last_freq_update_time = time;
>  }
> @@ -459,10 +446,9 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>
>         for_each_cpu(j, policy->cpus) {
>                 struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
> -               unsigned long boost;
>
> -               boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
> -               sugov_get_util(j_sg_cpu, boost);
> +               sugov_get_util(j_sg_cpu);
> +               sugov_iowait_apply(j_sg_cpu, time, max_cap);
>
>                 util = max(j_sg_cpu->util, util);
>         }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 533547e3c90a..f2bb83675e4a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7820,7 +7820,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
>         for_each_cpu(cpu, pd_cpus) {
>                 unsigned long util = cpu_util(cpu, p, -1, 0);
>
> -               busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> +               busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
>         }
>
>         eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
> @@ -7843,7 +7843,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
>         for_each_cpu(cpu, pd_cpus) {
>                 struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
>                 unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
> -               unsigned long eff_util, min, max;
> +               unsigned long eff_util;
>
>                 /*
>                  * Performance domain frequency: utilization clamping
> @@ -7852,23 +7852,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
>                  * NOTE: in case RT tasks are running, by default the
>                  * FREQUENCY_UTIL's utilization can be max OPP.
>                  */
> -               eff_util = effective_cpu_util(cpu, util, &min, &max);
> -
> -               /* Task's uclamp can modify min and max value */
> -               if (tsk && uclamp_is_used()) {
> -                       min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
> -
> -                       /*
> -                        * If there is no active max uclamp constraint,
> -                        * directly use task's one, otherwise keep max.
> -                        */
> -                       if (uclamp_rq_is_idle(cpu_rq(cpu)))
> -                               max = uclamp_eff_value(p, UCLAMP_MAX);
> -                       else
> -                               max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
> -               }
> -
> -               eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
> +               eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
>                 max_util = max(max_util, eff_util);
>         }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 001fe047bd5d..eb7e07a1abcc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3005,14 +3005,24 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>  #endif
>
>  #ifdef CONFIG_SMP
> +/**
> + * enum cpu_util_type - CPU utilization type
> + * @FREQUENCY_UTIL:    Utilization used to select frequency
> + * @ENERGY_UTIL:       Utilization used during energy calculation
> + *
> + * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
> + * need to be aggregated differently depending on the usage made of them. This
> + * enum is used within effective_cpu_util() to differentiate the types of
> + * utilization expected by the callers, and adjust the aggregation accordingly.
> + */
> +enum cpu_util_type {
> +       FREQUENCY_UTIL,
> +       ENERGY_UTIL,
> +};
> +
>  unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> -                                unsigned long *min,
> -                                unsigned long *max);
> -
> -unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
> -                                unsigned long min,
> -                                unsigned long max);
> -
> +                                enum cpu_util_type type,
> +                                struct task_struct *p);
>
>  /*
>   * Verify the fitness of task @p to run on @cpu taking into account the
> @@ -3069,6 +3079,59 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
>         return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
>  }
>
> +/**
> + * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
> + * @rq:                The rq to clamp against. Must not be NULL.
> + * @util:      The util value to clamp.
> + * @p:         The task to clamp against. Can be NULL if you want to clamp
> + *             against @rq only.
> + *
> + * Clamps the passed @util to the max(@rq, @p) effective uclamp values.
> + *
> + * If sched_uclamp_used static key is disabled, then just return the util
> + * without any clamping since uclamp aggregation at the rq level in the fast
> + * path is disabled, rendering this operation a NOP.
> + *
> + * Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
> + * will return the correct effective uclamp value of the task even if the
> + * static key is disabled.
> + */
> +static __always_inline
> +unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
> +                                 struct task_struct *p)
> +{
> +       unsigned long min_util = 0;
> +       unsigned long max_util = 0;
> +
> +       if (!static_branch_likely(&sched_uclamp_used))
> +               return util;
> +
> +       if (p) {
> +               min_util = uclamp_eff_value(p, UCLAMP_MIN);
> +               max_util = uclamp_eff_value(p, UCLAMP_MAX);
> +
> +               /*
> +                * Ignore last runnable task's max clamp, as this task will
> +                * reset it. Similarly, no need to read the rq's min clamp.
> +                */
> +               if (uclamp_rq_is_idle(rq))
> +                       goto out;
> +       }
> +
> +       min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
> +       max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
> +out:
> +       /*
> +        * Since CPU's {min,max}_util clamps are MAX aggregated considering
> +        * RUNNABLE tasks with _different_ clamps, we can end up with an
> +        * inversion. Fix it now when the clamps are applied.
> +        */
> +       if (unlikely(min_util >= max_util))
> +               return min_util;
> +
> +       return clamp(util, min_util, max_util);
> +}
> +
>  /* Is the rq being capped/throttled by uclamp_max? */
>  static inline bool uclamp_rq_is_capped(struct rq *rq)
>  {
> @@ -3106,6 +3169,13 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p,
>         return SCHED_CAPACITY_SCALE;
>  }
>
> +static inline
> +unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
> +                                 struct task_struct *p)
> +{
> +       return util;
> +}
> +
>  static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }
>
>  static inline bool uclamp_is_used(void)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-11 18:16               ` Vincent Guittot
@ 2024-01-12 14:23                 ` Dietmar Eggemann
  2024-01-12 16:58                   ` Vincent Guittot
  2024-01-12 18:18                   ` Qais Yousef
  0 siblings, 2 replies; 54+ messages in thread
From: Dietmar Eggemann @ 2024-01-12 14:23 UTC (permalink / raw)
  To: Vincent Guittot, Linus Torvalds
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On 11/01/2024 19:16, Vincent Guittot wrote:
> On Thu, 11 Jan 2024 at 18:53, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> On Thu, 11 Jan 2024 at 09:45, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
>>> <vincent.guittot@linaro.org> wrote:
>>>>
>>>> Could you confirm that cpufreq governor is schedutil and the driver is
>>>> amd-pstate on your system ?
>>>
>>> schedutil yes, amd-pstate no. I actually just use acpi_cpufreq
>>
>> Bah. Hit 'send' mistakenly too soon, thus the abrupt end and
>> unfinished quoting removal.
>>
>> And don't ask me why it's acpi_pstate-driven. I have X86_AMD_PSTATE=y, but
>>
>>     /sys/devices/system/cpu/cpufreq/policy0/scaling_driver
>>
>> clearly says 'acpi-cpufreq'. Maybe I'm looking in the wrong place. My dmesg says
> 
> That seems to be the right place to look
> 
>>
>>     amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
>>
>> which is presumably the reason my machine uses acpi-pstate.
>>
>> I will also test out your other questions, but I need to go back and
>> do more pull requests first.
> 
> ok, thanks
> 
> I'm going to continue checking what else could trigger such regression
> having in mind that your system should not have beeb impacted by this
> changes

I can't see the regression on my

  20-core (40-thread) Intel Xeon CPU E5-2690 v2

with 'schedutil' and 'acpi-cpufreq'.

f12560779f9d - sched/cpufreq: Rework iowait boost                              <- (w/ patches)
9c0b4bb7f630 - sched/cpufreq: Rework schedutil governor performance estimation
50181c0cff31 - sched/pelt: Avoid underestimation of task utilization           <- (base)
...

# cpufreq-info -c 0 -e
...
analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 10.0 us.
  hardware limits: 1.20 GHz - 3.00 GHz
  available frequency steps: 3.00 GHz, 3.00 GHz, 2.90 GHz, 2.70 GHz, 2.60 GHz, 2.50 GHz, 2.40 GHz, 2.20 GHz,
                             2.10 GHz, 2.00 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.30 GHz, 1.20 GHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, performance, schedutil
  current policy: frequency should be within 1.20 GHz and 3.00 GHz.
                  The governor "schedutil" may decide which speed to use
                  within this range.
  current CPU frequency is 1.20 GHz (asserted by call to hardware).


cpufreq is still fast-switching, so no schedutil 'sugov' DL threads.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-12 14:23                 ` Dietmar Eggemann
@ 2024-01-12 16:58                   ` Vincent Guittot
  2024-01-12 18:18                   ` Qais Yousef
  1 sibling, 0 replies; 54+ messages in thread
From: Vincent Guittot @ 2024-01-12 16:58 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Linus Torvalds, Ingo Molnar, linux-kernel, Peter Zijlstra,
	Thomas Gleixner, Juri Lelli, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider

On Fri, 12 Jan 2024 at 15:23, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 11/01/2024 19:16, Vincent Guittot wrote:
> > On Thu, 11 Jan 2024 at 18:53, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> On Thu, 11 Jan 2024 at 09:45, Linus Torvalds
> >> <torvalds@linux-foundation.org> wrote:
> >>>
> >>> On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
> >>> <vincent.guittot@linaro.org> wrote:
> >>>>
> >>>> Could you confirm that cpufreq governor is schedutil and the driver is
> >>>> amd-pstate on your system ?
> >>>
> >>> schedutil yes, amd-pstate no. I actually just use acpi_cpufreq
> >>
> >> Bah. Hit 'send' mistakenly too soon, thus the abrupt end and
> >> unfinished quoting removal.
> >>
> >> And don't ask me why it's acpi_pstate-driven. I have X86_AMD_PSTATE=y, but
> >>
> >>     /sys/devices/system/cpu/cpufreq/policy0/scaling_driver
> >>
> >> clearly says 'acpi-cpufreq'. Maybe I'm looking in the wrong place. My dmesg says
> >
> > That seems to be the right place to look
> >
> >>
> >>     amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
> >>
> >> which is presumably the reason my machine uses acpi-pstate.
> >>
> >> I will also test out your other questions, but I need to go back and
> >> do more pull requests first.
> >
> > ok, thanks
> >
> > I'm going to continue checking what else could trigger such regression
> > having in mind that your system should not have beeb impacted by this
> > changes
>
> I can't see the regression on my
>
>   20-core (40-thread) Intel Xeon CPU E5-2690 v2
>
> with 'schedutil' and 'acpi-cpufreq'.

Thanks for the tests

>
> f12560779f9d - sched/cpufreq: Rework iowait boost                              <- (w/ patches)
> 9c0b4bb7f630 - sched/cpufreq: Rework schedutil governor performance estimation
> 50181c0cff31 - sched/pelt: Avoid underestimation of task utilization           <- (base)
> ...
>
> # cpufreq-info -c 0 -e
> ...
> analyzing CPU 0:
>   driver: acpi-cpufreq
>   CPUs which run at the same hardware frequency: 0
>   CPUs which need to have their frequency coordinated by software: 0
>   maximum transition latency: 10.0 us.
>   hardware limits: 1.20 GHz - 3.00 GHz
>   available frequency steps: 3.00 GHz, 3.00 GHz, 2.90 GHz, 2.70 GHz, 2.60 GHz, 2.50 GHz, 2.40 GHz, 2.20 GHz,
>                              2.10 GHz, 2.00 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.30 GHz, 1.20 GHz
>   available cpufreq governors: conservative, ondemand, userspace, powersave, performance, schedutil
>   current policy: frequency should be within 1.20 GHz and 3.00 GHz.
>                   The governor "schedutil" may decide which speed to use
>                   within this range.
>   current CPU frequency is 1.20 GHz (asserted by call to hardware).
>
>
> cpufreq is still fast-switching, so no schedutil 'sugov' DL threads.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-12 14:23                 ` Dietmar Eggemann
  2024-01-12 16:58                   ` Vincent Guittot
@ 2024-01-12 18:18                   ` Qais Yousef
  2024-01-12 19:03                     ` Vincent Guittot
  1 sibling, 1 reply; 54+ messages in thread
From: Qais Yousef @ 2024-01-12 18:18 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

On 01/12/24 15:23, Dietmar Eggemann wrote:
> On 11/01/2024 19:16, Vincent Guittot wrote:
> > On Thu, 11 Jan 2024 at 18:53, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> On Thu, 11 Jan 2024 at 09:45, Linus Torvalds
> >> <torvalds@linux-foundation.org> wrote:
> >>>
> >>> On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
> >>> <vincent.guittot@linaro.org> wrote:
> >>>>
> >>>> Could you confirm that cpufreq governor is schedutil and the driver is
> >>>> amd-pstate on your system ?
> >>>
> >>> schedutil yes, amd-pstate no. I actually just use acpi_cpufreq
> >>
> >> Bah. Hit 'send' mistakenly too soon, thus the abrupt end and
> >> unfinished quoting removal.
> >>
> >> And don't ask me why it's acpi_pstate-driven. I have X86_AMD_PSTATE=y, but
> >>
> >>     /sys/devices/system/cpu/cpufreq/policy0/scaling_driver
> >>
> >> clearly says 'acpi-cpufreq'. Maybe I'm looking in the wrong place. My dmesg says
> > 
> > That seems to be the right place to look
> > 
> >>
> >>     amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
> >>
> >> which is presumably the reason my machine uses acpi-pstate.
> >>
> >> I will also test out your other questions, but I need to go back and
> >> do more pull requests first.
> > 
> > ok, thanks
> > 
> > I'm going to continue checking what else could trigger such regression
> > having in mind that your system should not have beeb impacted by this
> > changes
> 
> I can't see the regression on my
> 
>   20-core (40-thread) Intel Xeon CPU E5-2690 v2
> 
> with 'schedutil' and 'acpi-cpufreq'.

I tried to reproduce on AMD 3900X 12-Core system. I don't see any difference
in compiling defconfig with and without the two patches reverted. ~1m26s.

using schedutil and acpi-cpufreq driver too.

I disabled CONFIG_UCLAMP_TASK and that didn't make a difference.

I would have expected the iowait boost to be the more troublesome being the
more subtle one tbh.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit
  2024-01-11 20:48             ` [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit Ingo Molnar
  2024-01-11 22:22               ` Vincent Guittot
@ 2024-01-12 18:24               ` Ingo Molnar
  1 sibling, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2024-01-12 18:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Linus Torvalds, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider


* Ingo Molnar <mingo@kernel.org> wrote:

> > I can provide a clean revert of only :
> > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
> 
> I've done this too, see this new commit in sched/urgent:
> 
>   60ee1706bd11 ("Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit")
> 
> Also attached below.
> 
> > if the fix that i proposed doesn't work:
> > https://lore.kernel.org/all/ZZ+ixagkxRPYyTCE@vingu-book/
> 
> Yeah - although of course Linus is free to just pull the revert as well. 
> I'll try to reproduce the regression locally as well.

Update & heads up: unfortunately I'm unable to reproduce the regression on 
a fairly similar system with a Threadripper 3970X CPU.

Kernel build times are very close, with or without the revert, on vanilla 
v6.7 or v6.7+sched/core.

Here's a few results where I tried to quantify kernel build times without 
having to wait a long time.

Re-building the kernel/**.o object files in a loop:

   $ perf stat --pre 'rm -f kernel/*.o kernel/*/*.o kernel/*/*/*.o' --null --sync --repeat 3 make -j64 kernel/ >/dev/null


    # v6.7.0:
    # bootup default schedutil governor:
              24.521 +- 0.077 seconds time elapsed  ( +-  0.31% )
              24.644 +- 0.071 seconds time elapsed  ( +-  0.29% )

    # cpufreq-max:
              24.452 +- 0.110 seconds time elapsed  ( +-  0.45% )
              24.482 +- 0.048 seconds time elapsed  ( +-  0.20% )

    # v6.7.0+sched/core:
    # bootup default schedutil governor:
              24.666 +- 0.063 seconds time elapsed  ( +-  0.26% )
              24.809 +- 0.118 seconds time elapsed  ( +-  0.48% )

The fully-cached build numbers are very close to each other, and during the 
hot phase of the kernel build all CPUs are saturated.

The 2x performance regression that Linus is seeing is either some 
pathological wakeup behavior, or perhaps the cores don't transition 
frequencies? The difference between the lowest and highest frequency is 
pretty substantial (at least on my box):

  cpu MHz		: 2200.000
  ...
  cpu MHz		: 4000.000


There was *one* test when the tree was cache-cold, when I saw really bad 
performance (which I didn't really expect with my nvram system), with -j32 
builds:

   Performance counter stats for 'make -j32 kernel/' (3 runs):

              64.34 +- 39.22 seconds time elapsed  ( +- 60.95% )
              25.08 +- 0.142 seconds time elapsed  ( +-  0.56% )
              24.97 +- 0.072 seconds time elapsed  ( +-  0.29% )

Unfortunately that outlier was on a vanilla v6.7 bootup.

As a next step I could try Linus's specific config, maybe there's some 
detail in it that makes the difference.

The commit itself that Linus bisected to (9c0b4bb7f6303c) doesn't *seem* 
wrong in itself, especially without uclamp [I presume Linus doesn't use 
CONFIG_UCLAMP_TASK=y and the cpu.uclamp.min/uclamp.max cgroup interface 
that goes with it?], but the commit changes how we use sched_util metrics, 
which could change scheduling patterns - which is why I was spending many 
hours yesterday and today trying to find a pathological workload to 
reproduce this. No luck so far.

Linus: I can send a pull request for the 2-commit revert, or maybe you 
could try Vincent's guess-patch that tries to restore to previous behavior 
as closely as possible.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-10 22:57       ` Linus Torvalds
  2024-01-11  8:11         ` Vincent Guittot
  2024-01-11 11:09         ` [GIT PULL] scheduler fixes Ingo Molnar
@ 2024-01-12 18:26         ` Ingo Molnar
  2024-01-14  9:12         ` Wyes Karny
  3 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2024-01-12 18:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Guittot, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.

Same here:

   model name	: AMD Ryzen Threadripper 3970X 32-Core Processor

Could you please send me your desktop system's .config privately?

If it's triggered by something in the .config, I might be able to 
reproduce.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-12 18:18                   ` Qais Yousef
@ 2024-01-12 19:03                     ` Vincent Guittot
  2024-01-12 20:30                       ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Vincent Guittot @ 2024-01-12 19:03 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Dietmar Eggemann, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

On Fri, 12 Jan 2024 at 19:18, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 01/12/24 15:23, Dietmar Eggemann wrote:
> > On 11/01/2024 19:16, Vincent Guittot wrote:
> > > On Thu, 11 Jan 2024 at 18:53, Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > >>
> > >> On Thu, 11 Jan 2024 at 09:45, Linus Torvalds
> > >> <torvalds@linux-foundation.org> wrote:
> > >>>
> > >>> On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
> > >>> <vincent.guittot@linaro.org> wrote:
> > >>>>
> > >>>> Could you confirm that cpufreq governor is schedutil and the driver is
> > >>>> amd-pstate on your system ?
> > >>>
> > >>> schedutil yes, amd-pstate no. I actually just use acpi_cpufreq
> > >>
> > >> Bah. Hit 'send' mistakenly too soon, thus the abrupt end and
> > >> unfinished quoting removal.
> > >>
> > >> And don't ask me why it's acpi_pstate-driven. I have X86_AMD_PSTATE=y, but
> > >>
> > >>     /sys/devices/system/cpu/cpufreq/policy0/scaling_driver
> > >>
> > >> clearly says 'acpi-cpufreq'. Maybe I'm looking in the wrong place. My dmesg says
> > >
> > > That seems to be the right place to look
> > >
> > >>
> > >>     amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
> > >>
> > >> which is presumably the reason my machine uses acpi-pstate.
> > >>
> > >> I will also test out your other questions, but I need to go back and
> > >> do more pull requests first.
> > >
> > > ok, thanks
> > >
> > > I'm going to continue checking what else could trigger such regression
> > > having in mind that your system should not have beeb impacted by this
> > > changes
> >
> > I can't see the regression on my
> >
> >   20-core (40-thread) Intel Xeon CPU E5-2690 v2
> >
> > with 'schedutil' and 'acpi-cpufreq'.
>
> I tried to reproduce on AMD 3900X 12-Core system. I don't see any difference
> in compiling defconfig with and without the two patches reverted. ~1m26s.

Thanks for testing.
I haven't been able to reproduce the problem too. But my system are
quite different

>
> using schedutil and acpi-cpufreq driver too.
>
> I disabled CONFIG_UCLAMP_TASK and that didn't make a difference.
>
> I would have expected the iowait boost to be the more troublesome being the
> more subtle one tbh.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-12 19:03                     ` Vincent Guittot
@ 2024-01-12 20:30                       ` Linus Torvalds
  2024-01-12 20:49                         ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2024-01-12 20:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Qais Yousef, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

Ok, so testing a bit more. On a working kernel, when I do an empty
"make" (which is the fast test I've used), it's all single-threaded
because it's just 'make' doing tons of stat calls and string
operations.

And "cat /proc/cpuinfo | grep MHz" shows a nice clear signal:

  ...
  cpu MHz : 2200.000
  cpu MHz : 2200.000
  cpu MHz : 4425.339
  cpu MHz : 2200.000
  ...

so it boosts up to the top boost frequency.

Without the revert, doing the same thing, what I see is very
different. It's all just

  ...
  cpu MHz : 2200.000
  cpu MHz : 2200.000
  cpu MHz : 2200.000
  cpu MHz : 2200.000
  ...

which certainly explains why it takes 45s rather than 22s to do a full
empty build.

Btw, the "full empty build" I do is literally just

    timestamp sh -c "make -j128 > ../makes"

where 'timestamp' is my stupid little wrapper program that just shows
elapsed time as the command is progressing (as opposed to "time",
which just shows it at the end).

Side note: that 4425.339 is very much the boost frequency, 'cpupower' reports

  $ cpupower frequency-info
  analyzing CPU 0:
    driver: acpi-cpufreq
    CPUs which run at the same hardware frequency: 0
    CPUs which need to have their frequency coordinated by software: 0
    maximum transition latency:  Cannot determine or is not supported.
    hardware limits: 2.20 GHz - 3.70 GHz
    available frequency steps:  3.70 GHz, 2.80 GHz, 2.20 GHz
    available cpufreq governors: conservative ondemand userspace
powersave performance schedutil
    current policy: frequency should be within 2.20 GHz and 3.70 GHz.
                  The governor "schedutil" may decide which speed to use
                  within this range.
    current CPU frequency: Unable to call hardware
    current CPU frequency: 2.20 GHz (asserted by call to kernel)
    boost state support:
      Supported: yes
      Active: no

and for all I know the scheduler got confused by the fact that it
thinks the hardware limits are 2.2-3.7 GHz. But the 3970X has a boost
frequency of 4.5GHz, and yes, I very much want it.

I will test Vincent's test-patch next.

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-12 20:30                       ` Linus Torvalds
@ 2024-01-12 20:49                         ` Linus Torvalds
  2024-01-12 21:04                           ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2024-01-12 20:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Qais Yousef, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

On Fri, 12 Jan 2024 at 12:30, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I will test Vincent's test-patch next.

The patch at

    https://lore.kernel.org/all/ZZ+ixagkxRPYyTCE@vingu-book/

makes absolutely no difference. All cores stay at 2.2GHz (ok, so
there's noise, but we're talking "within a couple of MHz of 2.2GHz").

             Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-12 20:49                         ` Linus Torvalds
@ 2024-01-12 21:04                           ` Linus Torvalds
  2024-01-13  1:04                             ` Qais Yousef
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2024-01-12 21:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Qais Yousef, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

On Fri, 12 Jan 2024 at 12:49, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> All cores stay at 2.2GHz (ok, so there's noise, but we're
> talking "within a couple of MHz of 2.2GHz").

Note: that is true also when every CPU is fully loaded and I do a real
full build.

So the "empty make" is just my quick test that happens to be
single-threaded and should take just 20s. All my real builds slow down
too, because all CPUs stay at the minimum frequency.

And I just verified that Ingo's revert that only reverts two commits
(commit 60ee1706bd11 in the tip tree), makes things work correctly for
me.

Not surprising, since the bisection clearly pointed at just commit
9c0b4bb7f6303c being the one that caused the issue, but I decided to
just double-check anyway.

So with that revert, for the single-threaded case I see 4GHz+ numbers
(they spread from a single CPU to multiple CPUs once you run the
benchmark a few times).

And then when I run a full parallel build (rather than the
single-threaded empty one), the frequencies drop to ~3.85GHz for the
all-cpu case.

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-12 21:04                           ` Linus Torvalds
@ 2024-01-13  1:04                             ` Qais Yousef
  2024-01-13  1:24                               ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Qais Yousef @ 2024-01-13  1:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

On 01/12/24 13:04, Linus Torvalds wrote:
> On Fri, 12 Jan 2024 at 12:49, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > All cores stay at 2.2GHz (ok, so there's noise, but we're
> > talking "within a couple of MHz of 2.2GHz").
> 
> Note: that is true also when every CPU is fully loaded and I do a real
> full build.

That is odd. I can't see how the patch can cause this yet, could you try with
a different compiler if possible? It seems like some operation goes wrong and
we end up with util or one of the transformations becoming 0 or nearby.

I tried your steps and I see frequencies changing. I have gcc
11.4.0 by default on my system.

I usually use perfetto but it should be easy to see frequency updates from
power/cpu_frequency trace event.

	echo 1 | sudo tee /sys/kernel/tracing/tracing_on
	echo 1 | sudo tee /sys/kernel/tracing/events/power/cpu_frequency/enable
	sudo cat /sys/kernel/tracing/trace

or

	sudo cat /sys/kernel/tracing/trace_pipe

to clear the buffer and block on read

Or easier with trace-cmd

	sudo trace-cmd record -e power/cpu_frequency $COMPILE_CMD
	sudo trace-cmd report

If there's no idle time the frequency updates will stop once we reach highest
frequency.

When I run it I see 2.2GHz and 3.8GHz values only, with and without the patches
reverted.

          <idle>-0     [017]5089547182905: cpu_frequency:        state=2200000 cpu_id=17
           <...>-77147 [015]5089548006123: cpu_frequency:        state=3800000 cpu_id=15
          <idle>-0     [004]5089553342808: cpu_frequency:        state=3800000 cpu_id=4
          <idle>-0     [003]5089566989667: cpu_frequency:        state=2200000 cpu_id=3
              rm-77149 [005]5089569246195: cpu_frequency:        state=3800000 cpu_id=5
          <idle>-0     [004]5089570419322: cpu_frequency:        state=2200000 cpu_id=4
          <idle>-0     [017]5089570826267: cpu_frequency:        state=3800000 cpu_id=17
          <idle>-0     [003]5089575702589: cpu_frequency:        state=3800000 cpu_id=3
          <idle>-0     [017]5089576045916: cpu_frequency:        state=2200000 cpu_id=17
          <idle>-0     [003]5089584007141: cpu_frequency:        state=2200000 cpu_id=3
          <idle>-0     [015]5089593025639: cpu_frequency:        state=2200000 cpu_id=15
          <idle>-0     [010]5089593661028: cpu_frequency:        state=2800000 cpu_id=10
          <idle>-0     [023]5089595181029: cpu_frequency:        state=2200000 cpu_id=23
           <...>-77153 [015]5089595202328: cpu_frequency:        state=3800000 cpu_id=15
          <idle>-0     [017]5089596112508: cpu_frequency:        state=3800000 cpu_id=17
          <idle>-0     [003]5089601227012: cpu_frequency:        state=3800000 cpu_id=3
          <idle>-0     [017]5089601303574: cpu_frequency:        state=2200000 cpu_id=17
          <idle>-0     [005]5089611995487: cpu_frequency:        state=2200000 cpu_id=5
          <idle>-0     [017]5089612446143: cpu_frequency:        state=3800000 cpu_id=17
          <idle>-0     [003]5089612461191: cpu_frequency:        state=2200000 cpu_id=3
             cc1-77159 [009]5089616006631: cpu_frequency:        state=3800000 cpu_id=9
          <idle>-0     [003]5089618213587: cpu_frequency:        state=3800000 cpu_id=3
          <idle>-0     [017]5089618245105: cpu_frequency:        state=2200000 cpu_id=17
          <idle>-0     [003]5089624066151: cpu_frequency:        state=2200000 cpu_id=3
          <idle>-0     [017]5089627031955: cpu_frequency:        state=3800000 cpu_id=17
          <idle>-0     [003]5089632148220: cpu_frequency:        state=3800000 cpu_id=3
          <idle>-0     [017]5089633114584: cpu_frequency:        state=2200000 cpu_id=17
         objcopy-77166 [023]5089635324796: cpu_frequency:        state=3800000 cpu_id=23
          <idle>-0     [003]5089636043349: cpu_frequency:        state=2200000 cpu_id=3
          <idle>-0     [018]5089636071762: cpu_frequency:        state=2200000 cpu_id=18
          <idle>-0     [017]5089636511027: cpu_frequency:        state=3800000 cpu_id=17
          <idle>-0     [003]5089638171879: cpu_frequency:        state=3800000 cpu_id=3
           build-77168 [019]5089640011393: cpu_frequency:        state=3800000 cpu_id=19
          <idle>-0     [017]5089652020092: cpu_frequency:        state=2200000 cpu_id=17
          <idle>-0     [004]5089653340503: cpu_frequency:        state=3800000 cpu_id=4
          <idle>-0     [004]5089654532595: cpu_frequency:        state=2200000 cpu_id=4
          <idle>-0     [003]5089656013393: cpu_frequency:        state=2200000 cpu_id=3
          <idle>-0     [004]5089666815072: cpu_frequency:        state=3800000 cpu_id=4
          <idle>-0     [011]5089697117342: cpu_frequency:        state=3800000 cpu_id=11
             cat-77170 [010]5089697219972: cpu_frequency:        state=3800000 cpu_id=10
          <idle>-0     [004]5089697313957: cpu_frequency:        state=2200000 cpu_id=4
          <idle>-0     [017]5089699129526: cpu_frequency:        state=3800000 cpu_id=17
              ln-77172 [016]5089699710505: cpu_frequency:        state=3800000 cpu_id=16
          <idle>-0     [022]5089700249275: cpu_frequency:        state=2200000 cpu_id=22
          <idle>-0     [023]5089700316449: cpu_frequency:        state=2200000 cpu_id=23
          <idle>-0     [009]5089700372223: cpu_frequency:        state=2200000 cpu_id=9


This is what cpupower frequency-info returns on my system

	analyzing CPU 0:
	  driver: acpi-cpufreq
	  CPUs which run at the same hardware frequency: 0
	  CPUs which need to have their frequency coordinated by software: 0
	  maximum transition latency:  Cannot determine or is not supported.
	  hardware limits: 2.20 GHz - 4.67 GHz
	  available frequency steps:  3.80 GHz, 2.80 GHz, 2.20 GHz
	  available cpufreq governors: conservative ondemand userspace powersave performance schedutil
	  current policy: frequency should be within 2.20 GHz and 3.80 GHz.
			  The governor "schedutil" may decide which speed to use
			  within this range.
	  current CPU frequency: Unable to call hardware
	  current CPU frequency: 2.20 GHz (asserted by call to kernel)
	  boost state support:
	    Supported: yes
	    Active: no


To my understanding schedutil doesn't know about turbo frequencies but we can
read current freq via counters which I think /proc/cpuinfo uses them. The trace
events will only show 3.8GHz as that's what we know about, but /proc/cpuinfo
shows above 4GHz if I run a single threaded workload, similar to what you've
seen. Again with and without the patches reverted.

I have to say, I am on tip/sched/core still. I probably should switch to your
tree to make sure there are no unexpected differences that can affect
reproducibility.


Cheers

--
Qais Yousef

> 
> So the "empty make" is just my quick test that happens to be
> single-threaded and should take just 20s. All my real builds slow down
> too, because all CPUs stay at the minimum frequency.
> 
> And I just verified that Ingo's revert that only reverts two commits
> (commit 60ee1706bd11 in the tip tree), makes things work correctly for
> me.
> 
> Not surprising, since the bisection clearly pointed at just commit
> 9c0b4bb7f6303c being the one that caused the issue, but I decided to
> just double-check anyway.
> 
> So with that revert, for the single-threaded case I see 4GHz+ numbers
> (they spread from a single CPU to multiple CPUs once you run the
> benchmark a few times).
> 
> And then when I run a full parallel build (rather than the
> single-threaded empty one), the frequencies drop to ~3.85GHz for the
> all-cpu case.
> 
>                 Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-13  1:04                             ` Qais Yousef
@ 2024-01-13  1:24                               ` Linus Torvalds
  2024-01-13  1:31                                 ` Linus Torvalds
  2024-01-13 18:37                                 ` Qais Yousef
  0 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2024-01-13  1:24 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

On Fri, 12 Jan 2024 at 17:04, Qais Yousef <qyousef@layalina.io> wrote:
>
> That is odd. I can't see how the patch can cause this yet, could you try with
> a different compiler if possible?

I use two different compilers - I do my allmodconfig builds with gcc,
and the kernels I boot with clang, so my problems have been with a
kernel built with

   clang version 17.0.6

but to check that it's not a compiler issue I just did another try
with my current public tip of tree (ie *without* any reverts for this
issue) and gcc:

    gcc version 13.2.1

and the behavior is exactly the same: all cores are stuck at 2.2GHz.

So no, it's not compiler-dependent.

> I usually use perfetto but it should be easy to see frequency updates from
> power/cpu_frequency trace event.
>
>         echo 1 | sudo tee /sys/kernel/tracing/tracing_on
>         echo 1 | sudo tee /sys/kernel/tracing/events/power/cpu_frequency/enable
>         sudo cat /sys/kernel/tracing/trace

Shows absolutely nothing. Or rather, it shows the header with

  # entries-in-buffer/entries-written: 0/0   #P:64

and that's it.

With a *working* kernel, I get events, setting the frequency to either
2.2GHz (idle) or 3.8GHz (work).

IOW, the tracing output is 100% consistent with "that commit breaks everything".

                  Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-13  1:24                               ` Linus Torvalds
@ 2024-01-13  1:31                                 ` Linus Torvalds
  2024-01-13 10:47                                   ` Vincent Guittot
  2024-01-13 18:37                                 ` Qais Yousef
  1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2024-01-13  1:31 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

On Fri, 12 Jan 2024 at 17:24, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> With a *working* kernel, I get events, setting the frequency to either
> 2.2GHz (idle) or 3.8GHz (work).

Just to fix that - not 3.8Ghz, but in addition to 2.2 I see 2.8 or 3.7:

  ...
  <idle>-0       [034] d.s..   208.340412: cpu_frequency:
state=2200000 cpu_id=34
     cc1-101686  [034] d.h..   208.342402: cpu_frequency:
state=2800000 cpu_id=34
     cc1-101686  [034] d.h..   208.343401: cpu_frequency:
state=3700000 cpu_id=34
      sh-108794  [029] d.h..   216.401014: cpu_frequency:
state=2200000 cpu_id=29
      sh-108794  [029] d....   216.402670: cpu_frequency:
state=2800000 cpu_id=29
genksyms-108565  [029] d.h..   216.404005: cpu_frequency:
state=3700000 cpu_id=29
  ...

etc.

              Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-13  1:31                                 ` Linus Torvalds
@ 2024-01-13 10:47                                   ` Vincent Guittot
  2024-01-13 18:33                                     ` Qais Yousef
  0 siblings, 1 reply; 54+ messages in thread
From: Vincent Guittot @ 2024-01-13 10:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Qais Yousef, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

Le vendredi 12 janv. 2024 à 17:31:13 (-0800), Linus Torvalds a écrit :
> On Fri, 12 Jan 2024 at 17:24, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > With a *working* kernel, I get events, setting the frequency to either
> > 2.2GHz (idle) or 3.8GHz (work).
> 
> Just to fix that - not 3.8Ghz, but in addition to 2.2 I see 2.8 or 3.7:

IIUC, with the commit you stay at lowest frequency all time which is clearly
weird. One change that could create such behavior is in
sugov_effective_cpu_perf() where we take the min between the actual
utilization and the max allowed. If max is set to min capacity for whatever
the reason, then you stay stuck to lowest capacity/frequency

What is the output of 
/sys/devices/system/cpu/cpu0/cpu_capacity

It should be 1024 or close

Could you try the below ? it skips this step and ensures to use the actual
utilization to select the frequency

---
 kernel/sched/cpufreq_schedutil.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..e420e2ee1a10 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -181,7 +181,6 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 	/* Add dvfs headroom to actual utilization */
 	actual = map_util_perf(actual);
 	/* Actually we don't need to target the max performance */
-	if (actual < max)
 		max = actual;
 
 	/*
-- 
2.34.1

> 
>   ...
>   <idle>-0       [034] d.s..   208.340412: cpu_frequency:
> state=2200000 cpu_id=34
>      cc1-101686  [034] d.h..   208.342402: cpu_frequency:
> state=2800000 cpu_id=34
>      cc1-101686  [034] d.h..   208.343401: cpu_frequency:
> state=3700000 cpu_id=34
>       sh-108794  [029] d.h..   216.401014: cpu_frequency:
> state=2200000 cpu_id=29
>       sh-108794  [029] d....   216.402670: cpu_frequency:
> state=2800000 cpu_id=29
> genksyms-108565  [029] d.h..   216.404005: cpu_frequency:
> state=3700000 cpu_id=29
>   ...
> 
> etc.
> 
>               Linus

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-13 10:47                                   ` Vincent Guittot
@ 2024-01-13 18:33                                     ` Qais Yousef
  0 siblings, 0 replies; 54+ messages in thread
From: Qais Yousef @ 2024-01-13 18:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Linus Torvalds, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

On 01/13/24 11:47, Vincent Guittot wrote:
> Le vendredi 12 janv. 2024 à 17:31:13 (-0800), Linus Torvalds a écrit :
> > On Fri, 12 Jan 2024 at 17:24, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > With a *working* kernel, I get events, setting the frequency to either
> > > 2.2GHz (idle) or 3.8GHz (work).
> > 
> > Just to fix that - not 3.8Ghz, but in addition to 2.2 I see 2.8 or 3.7:
> 
> IIUC, with the commit you stay at lowest frequency all time which is clearly
> weird. One change that could create such behavior is in
> sugov_effective_cpu_perf() where we take the min between the actual
> utilization and the max allowed. If max is set to min capacity for whatever
> the reason, then you stay stuck to lowest capacity/frequency

I tried on Linus' ToT (052d534373b7) with and without CONFIG_UCLAMP_TASK=y and
I still can't reproduce.

> 
> What is the output of 
> /sys/devices/system/cpu/cpu0/cpu_capacity

Since it's SMP, I don't see these generated.

There's a acpi_cppc node though. For CPU0

	$ grep . /sys/devices/system/cpu/cpu0/acpi_cppc/*
	/sys/devices/system/cpu/cpu0/acpi_cppc/feedback_ctrs:ref:500461208742 del:522909165272
	/sys/devices/system/cpu/cpu0/acpi_cppc/highest_perf:216
	/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_freq:550
	/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_nonlinear_perf:62
	/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_perf:20
	/sys/devices/system/cpu/cpu0/acpi_cppc/nominal_freq:3801
	/sys/devices/system/cpu/cpu0/acpi_cppc/nominal_perf:135
	/sys/devices/system/cpu/cpu0/acpi_cppc/reference_perf:135
	/sys/devices/system/cpu/cpu0/acpi_cppc/wraparound_time:18446744073709551615

highest_perf for all CPUs

	$ grep . /sys/devices/system/cpu/cpu*/acpi_cppc/highest_perf
	/sys/devices/system/cpu/cpu0/acpi_cppc/highest_perf:216
	/sys/devices/system/cpu/cpu10/acpi_cppc/highest_perf:166
	/sys/devices/system/cpu/cpu11/acpi_cppc/highest_perf:176
	/sys/devices/system/cpu/cpu12/acpi_cppc/highest_perf:216
	/sys/devices/system/cpu/cpu13/acpi_cppc/highest_perf:216
	/sys/devices/system/cpu/cpu14/acpi_cppc/highest_perf:211
	/sys/devices/system/cpu/cpu15/acpi_cppc/highest_perf:206
	/sys/devices/system/cpu/cpu16/acpi_cppc/highest_perf:201
	/sys/devices/system/cpu/cpu17/acpi_cppc/highest_perf:196
	/sys/devices/system/cpu/cpu18/acpi_cppc/highest_perf:191
	/sys/devices/system/cpu/cpu19/acpi_cppc/highest_perf:186
	/sys/devices/system/cpu/cpu1/acpi_cppc/highest_perf:216
	/sys/devices/system/cpu/cpu20/acpi_cppc/highest_perf:181
	/sys/devices/system/cpu/cpu21/acpi_cppc/highest_perf:171
	/sys/devices/system/cpu/cpu22/acpi_cppc/highest_perf:166
	/sys/devices/system/cpu/cpu23/acpi_cppc/highest_perf:176
	/sys/devices/system/cpu/cpu2/acpi_cppc/highest_perf:211
	/sys/devices/system/cpu/cpu3/acpi_cppc/highest_perf:206
	/sys/devices/system/cpu/cpu4/acpi_cppc/highest_perf:201
	/sys/devices/system/cpu/cpu5/acpi_cppc/highest_perf:196
	/sys/devices/system/cpu/cpu6/acpi_cppc/highest_perf:191
	/sys/devices/system/cpu/cpu7/acpi_cppc/highest_perf:186
	/sys/devices/system/cpu/cpu8/acpi_cppc/highest_perf:181
	/sys/devices/system/cpu/cpu9/acpi_cppc/highest_perf:171

Can we overshoot somehow now and hitting a bug in frequency translation in
cpufreq by any chance?

> 
> It should be 1024 or close
> 
> Could you try the below ? it skips this step and ensures to use the actual
> utilization to select the frequency
> 
> ---
>  kernel/sched/cpufreq_schedutil.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..e420e2ee1a10 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -181,7 +181,6 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>  	/* Add dvfs headroom to actual utilization */
>  	actual = map_util_perf(actual);
>  	/* Actually we don't need to target the max performance */
> -	if (actual < max)
>  		max = actual;
>  
>  	/*
> -- 
> 2.34.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-13  1:24                               ` Linus Torvalds
  2024-01-13  1:31                                 ` Linus Torvalds
@ 2024-01-13 18:37                                 ` Qais Yousef
  1 sibling, 0 replies; 54+ messages in thread
From: Qais Yousef @ 2024-01-13 18:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

On 01/12/24 17:24, Linus Torvalds wrote:
> On Fri, 12 Jan 2024 at 17:04, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > That is odd. I can't see how the patch can cause this yet, could you try with
> > a different compiler if possible?
> 
> I use two different compilers - I do my allmodconfig builds with gcc,
> and the kernels I boot with clang, so my problems have been with a
> kernel built with
> 
>    clang version 17.0.6
> 
> but to check that it's not a compiler issue I just did another try
> with my current public tip of tree (ie *without* any reverts for this
> issue) and gcc:
> 
>     gcc version 13.2.1
> 
> and the behavior is exactly the same: all cores are stuck at 2.2GHz.
> 
> So no, it's not compiler-dependent.
> 
> > I usually use perfetto but it should be easy to see frequency updates from
> > power/cpu_frequency trace event.
> >
> >         echo 1 | sudo tee /sys/kernel/tracing/tracing_on
> >         echo 1 | sudo tee /sys/kernel/tracing/events/power/cpu_frequency/enable
> >         sudo cat /sys/kernel/tracing/trace
> 
> Shows absolutely nothing. Or rather, it shows the header with
> 
>   # entries-in-buffer/entries-written: 0/0   #P:64
> 
> and that's it.
> 
> With a *working* kernel, I get events, setting the frequency to either
> 2.2GHz (idle) or 3.8GHz (work).
> 
> IOW, the tracing output is 100% consistent with "that commit breaks everything".

Thanks for all the info. It seems for some reason we no longer think we need to
update the frequency. Not sure why yet. It's strange I can't reproduce too even
running from top of your tree. I ran with these patches on M1, Pixel, Intel and
now AMD and haven't noticed anything since they got merged.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-10 22:57       ` Linus Torvalds
                           ` (2 preceding siblings ...)
  2024-01-12 18:26         ` [GIT PULL] Scheduler changes for v6.8 Ingo Molnar
@ 2024-01-14  9:12         ` Wyes Karny
  2024-01-14 11:18           ` Vincent Guittot
  3 siblings, 1 reply; 54+ messages in thread
From: Wyes Karny @ 2024-01-14  9:12 UTC (permalink / raw)
  To: Linus Torvalds, Vincent Guittot, Qais Yousef
  Cc: Ingo Molnar, Vincent Guittot, linux-kernel, Peter Zijlstra,
	Thomas Gleixner, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It's one of these two:
> >
> >   f12560779f9d sched/cpufreq: Rework iowait boost
> >   9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> >
> > one more boot to go, then I'll try to revert whichever causes my
> > machine to perform horribly much worse.
> 
> I guess it should come as no surprise that the result is
> 
>    9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> 
> but to revert cleanly I will have to revert all of
> 
>       b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
>       f12560779f9d ("sched/cpufreq: Rework iowait boost")
>       9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> performance estimation")
> 
> This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> 
> I'll keep that revert in my private test-tree for now (so that I have
> a working machine again), but I'll move it to my main branch soon
> unless somebody has a quick fix for this problem.

Hi Linus,

I'm able to reproduce this issue with my AMD Ryzen 5600G system.  But
only if I disable CPPC in BIOS and boot with acpi-cpufreq + schedutil.
(I believe for your case also CPPC is diabled as log "_CPC object is not
present" came). Enabling CPPC in BIOS issue not seen in my system.  For
AMD acpi-cpufreq also uses _CPC object to determine the boost ratio.
When CPPC is disabled in BIOS something is going wrong and max
capacity is becoming zero.

Hi Vincent, Qais,

I have collected some data with bpftracing:

sudo bpftrace -e 'kretprobe:effective_cpu_util /cpu == 1/ { @eff_util = lhist(retval, 0, 1200, 50);} kprobe:get_next_freq /cpu == 1/ { @sugov_eff_util = lhist(arg1, 0, 1200, 50); @sugov_max_cap = lhist(arg2, 0, 1000, 2);} kretprobe:get_next_freq /cpu == 1/ { @sugov_freq = lhist(retval, 1000000, 5000000, 100000);}'

with running: taskset -c 1 make

issue case:

Attaching 3 probes...
@eff_util:
[0, 50)             1263 |@                                                   |
[50, 100)            517 |                                                    |
[100, 150)           233 |                                                    |
[150, 200)           297 |                                                    |
[200, 250)           162 |                                                    |
[250, 300)            98 |                                                    |
[300, 350)            75 |                                                    |
[350, 400)           205 |                                                    |
[400, 450)           210 |                                                    |
[450, 500)            16 |                                                    |
[500, 550)          1532 |@                                                   |
[550, 600)          1026 |                                                    |
[600, 650)           761 |                                                    |
[650, 700)           876 |                                                    |
[700, 750)          1085 |                                                    |
[750, 800)           891 |                                                    |
[800, 850)           816 |                                                    |
[850, 900)           983 |                                                    |
[900, 950)           661 |                                                    |
[950, 1000)          759 |                                                    |
[1000, 1050)       57433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@sugov_eff_util:
[0, 50)             1074 |                                                    |
[50, 100)            571 |                                                    |
[100, 150)           259 |                                                    |
[150, 200)           169 |                                                    |
[200, 250)           237 |                                                    |
[250, 300)           156 |                                                    |
[300, 350)            91 |                                                    |
[350, 400)            46 |                                                    |
[400, 450)            52 |                                                    |
[450, 500)           195 |                                                    |
[500, 550)           175 |                                                    |
[550, 600)            46 |                                                    |
[600, 650)           493 |                                                    |
[650, 700)          1424 |@                                                   |
[700, 750)           646 |                                                    |
[750, 800)           628 |                                                    |
[800, 850)           612 |                                                    |
[850, 900)           840 |                                                    |
[900, 950)           893 |                                                    |
[950, 1000)          640 |                                                    |
[1000, 1050)       60679 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@sugov_freq:
[1400000, 1500000)   69911 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@sugov_max_cap:
[0, 2)             69926 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|


good case:

Attaching 3 probes...
@eff_util:
[0, 50)              246 |@                                                   |
[50, 100)            150 |@                                                   |
[100, 150)           191 |@                                                   |
[150, 200)           239 |@                                                   |
[200, 250)           117 |                                                    |
[250, 300)          2101 |@@@@@@@@@@@@@@@                                     |
[300, 350)          2284 |@@@@@@@@@@@@@@@@                                    |
[350, 400)           713 |@@@@@                                               |
[400, 450)           151 |@                                                   |
[450, 500)           154 |@                                                   |
[500, 550)          1121 |@@@@@@@@                                            |
[550, 600)          1901 |@@@@@@@@@@@@@                                       |
[600, 650)          1208 |@@@@@@@@                                            |
[650, 700)           606 |@@@@                                                |
[700, 750)           557 |@@@                                                 |
[750, 800)           872 |@@@@@@                                              |
[800, 850)          1092 |@@@@@@@                                             |
[850, 900)          1416 |@@@@@@@@@@                                          |
[900, 950)          1107 |@@@@@@@                                             |
[950, 1000)         1051 |@@@@@@@                                             |
[1000, 1050)        7260 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@sugov_eff_util:
[0, 50)              241 |                                                    |
[50, 100)            149 |                                                    |
[100, 150)            72 |                                                    |
[150, 200)            95 |                                                    |
[200, 250)            43 |                                                    |
[250, 300)            49 |                                                    |
[300, 350)            19 |                                                    |
[350, 400)            56 |                                                    |
[400, 450)            22 |                                                    |
[450, 500)            29 |                                                    |
[500, 550)          1840 |@@@@@@                                              |
[550, 600)          1476 |@@@@@                                               |
[600, 650)          1027 |@@@                                                 |
[650, 700)           473 |@                                                   |
[700, 750)           366 |@                                                   |
[750, 800)           627 |@@                                                  |
[800, 850)           930 |@@@                                                 |
[850, 900)          1285 |@@@@                                                |
[900, 950)           971 |@@@                                                 |
[950, 1000)          946 |@@@                                                 |
[1000, 1050)       13839 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@sugov_freq:
[1400000, 1500000)     648 |@                                                   |
[1500000, 1600000)       0 |                                                    |
[1600000, 1700000)       0 |                                                    |
[1700000, 1800000)      25 |                                                    |
[1800000, 1900000)       0 |                                                    |
[1900000, 2000000)       0 |                                                    |
[2000000, 2100000)       0 |                                                    |
[2100000, 2200000)       0 |                                                    |
[2200000, 2300000)       0 |                                                    |
[2300000, 2400000)       0 |                                                    |
[2400000, 2500000)       0 |                                                    |
[2500000, 2600000)       0 |                                                    |
[2600000, 2700000)       0 |                                                    |
[2700000, 2800000)       0 |                                                    |
[2800000, 2900000)       0 |                                                    |
[2900000, 3000000)       0 |                                                    |
[3000000, 3100000)       0 |                                                    |
[3100000, 3125K)       0 |                                                    |
[3125K, 3300000)       0 |                                                    |
[3300000, 3400000)       0 |                                                    |
[3400000, 3500000)       0 |                                                    |
[3500000, 3600000)       0 |                                                    |
[3600000, 3700000)       0 |                                                    |
[3700000, 3800000)       0 |                                                    |
[3800000, 3900000)       0 |                                                    |
[3900000, 4000000)   23879 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@sugov_max_cap:
[0, 2)             24555 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

In both case max_cap is zero but selected freq is incorrect in bad case.

Thanks,
Wyes


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14  9:12         ` Wyes Karny
@ 2024-01-14 11:18           ` Vincent Guittot
  2024-01-14 12:37             ` Wyes Karny
  0 siblings, 1 reply; 54+ messages in thread
From: Vincent Guittot @ 2024-01-14 11:18 UTC (permalink / raw)
  To: Wyes Karny
  Cc: Linus Torvalds, Qais Yousef, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

Hi Wyes,

Le dimanche 14 janv. 2024 à 14:42:40 (+0530), Wyes Karny a écrit :
> On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > It's one of these two:
> > >
> > >   f12560779f9d sched/cpufreq: Rework iowait boost
> > >   9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > >
> > > one more boot to go, then I'll try to revert whichever causes my
> > > machine to perform horribly much worse.
> > 
> > I guess it should come as no surprise that the result is
> > 
> >    9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > 
> > but to revert cleanly I will have to revert all of
> > 
> >       b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> >       f12560779f9d ("sched/cpufreq: Rework iowait boost")
> >       9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > performance estimation")
> > 
> > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > 
> > I'll keep that revert in my private test-tree for now (so that I have
> > a working machine again), but I'll move it to my main branch soon
> > unless somebody has a quick fix for this problem.
> 
> Hi Linus,
> 
> I'm able to reproduce this issue with my AMD Ryzen 5600G system.  But
> only if I disable CPPC in BIOS and boot with acpi-cpufreq + schedutil.
> (I believe for your case also CPPC is diabled as log "_CPC object is not
> present" came). Enabling CPPC in BIOS issue not seen in my system.  For
> AMD acpi-cpufreq also uses _CPC object to determine the boost ratio.
> When CPPC is disabled in BIOS something is going wrong and max
> capacity is becoming zero.
> 
> Hi Vincent, Qais,
> 
> I have collected some data with bpftracing:

Thanks for your tests results

> 
> sudo bpftrace -e 'kretprobe:effective_cpu_util /cpu == 1/ { @eff_util = lhist(retval, 0, 1200, 50);} kprobe:get_next_freq /cpu == 1/ { @sugov_eff_util = lhist(arg1, 0, 1200, 50); @sugov_max_cap = lhist(arg2, 0, 1000, 2);} kretprobe:get_next_freq /cpu == 1/ { @sugov_freq = lhist(retval, 1000000, 5000000, 100000);}'
> 
> with running: taskset -c 1 make
> 
> issue case:
> 
> Attaching 3 probes...
> @eff_util:
> [0, 50)             1263 |@                                                   |
> [50, 100)            517 |                                                    |
> [100, 150)           233 |                                                    |
> [150, 200)           297 |                                                    |
> [200, 250)           162 |                                                    |
> [250, 300)            98 |                                                    |
> [300, 350)            75 |                                                    |
> [350, 400)           205 |                                                    |
> [400, 450)           210 |                                                    |
> [450, 500)            16 |                                                    |
> [500, 550)          1532 |@                                                   |
> [550, 600)          1026 |                                                    |
> [600, 650)           761 |                                                    |
> [650, 700)           876 |                                                    |
> [700, 750)          1085 |                                                    |
> [750, 800)           891 |                                                    |
> [800, 850)           816 |                                                    |
> [850, 900)           983 |                                                    |
> [900, 950)           661 |                                                    |
> [950, 1000)          759 |                                                    |
> [1000, 1050)       57433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 

ok so the output of effective_cpu_util() seems correct or at least to maw utilization
value. In order to be correct, it means that arch_scale_cpu_capacity(cpu) is not zero
because of :

unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
				 unsigned long *min,
				 unsigned long *max)
{
	unsigned long util, irq, scale;
	struct rq *rq = cpu_rq(cpu);

	scale = arch_scale_cpu_capacity(cpu);

	/*
	 * Early check to see if IRQ/steal time saturates the CPU, can be
	 * because of inaccuracies in how we track these -- see
	 * update_irq_load_avg().
	 */
	irq = cpu_util_irq(rq);
	if (unlikely(irq >= scale)) {
		if (min)
			*min = scale;
		if (max)
			*max = scale;
		return scale;
	}
...
}

If arch_scale_cpu_capacity(cpu) returns 0 then effective_cpu_util() should returns
0 too.

Now see below

> @sugov_eff_util:
> [0, 50)             1074 |                                                    |
> [50, 100)            571 |                                                    |
> [100, 150)           259 |                                                    |
> [150, 200)           169 |                                                    |
> [200, 250)           237 |                                                    |
> [250, 300)           156 |                                                    |
> [300, 350)            91 |                                                    |
> [350, 400)            46 |                                                    |
> [400, 450)            52 |                                                    |
> [450, 500)           195 |                                                    |
> [500, 550)           175 |                                                    |
> [550, 600)            46 |                                                    |
> [600, 650)           493 |                                                    |
> [650, 700)          1424 |@                                                   |
> [700, 750)           646 |                                                    |
> [750, 800)           628 |                                                    |
> [800, 850)           612 |                                                    |
> [850, 900)           840 |                                                    |
> [900, 950)           893 |                                                    |
> [950, 1000)          640 |                                                    |
> [1000, 1050)       60679 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> @sugov_freq:
> [1400000, 1500000)   69911 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> @sugov_max_cap:
> [0, 2)             69926 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

In get_next_freq(struct sugov_policy *sg_policy, unsigned long util, unsigned long max)

max is 0 and we comes from this path:

static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
				     unsigned int flags)
{

...
	max_cap = arch_scale_cpu_capacity(sg_cpu->cpu);

	if (!sugov_update_single_common(sg_cpu, time, max_cap, flags))
		return;

	next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap);
...

so here arch_scale_cpu_capacity(sg_cpu->cpu) returns 0 ...

AFAICT, AMD platform uses the default 
static __always_inline
unsigned long arch_scale_cpu_capacity(int cpu)
{
	return SCHED_CAPACITY_SCALE;
}

I'm missing something here

> 
> 
> good case:
> 
> Attaching 3 probes...
> @eff_util:
> [0, 50)              246 |@                                                   |
> [50, 100)            150 |@                                                   |
> [100, 150)           191 |@                                                   |
> [150, 200)           239 |@                                                   |
> [200, 250)           117 |                                                    |
> [250, 300)          2101 |@@@@@@@@@@@@@@@                                     |
> [300, 350)          2284 |@@@@@@@@@@@@@@@@                                    |
> [350, 400)           713 |@@@@@                                               |
> [400, 450)           151 |@                                                   |
> [450, 500)           154 |@                                                   |
> [500, 550)          1121 |@@@@@@@@                                            |
> [550, 600)          1901 |@@@@@@@@@@@@@                                       |
> [600, 650)          1208 |@@@@@@@@                                            |
> [650, 700)           606 |@@@@                                                |
> [700, 750)           557 |@@@                                                 |
> [750, 800)           872 |@@@@@@                                              |
> [800, 850)          1092 |@@@@@@@                                             |
> [850, 900)          1416 |@@@@@@@@@@                                          |
> [900, 950)          1107 |@@@@@@@                                             |
> [950, 1000)         1051 |@@@@@@@                                             |
> [1000, 1050)        7260 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> @sugov_eff_util:
> [0, 50)              241 |                                                    |
> [50, 100)            149 |                                                    |
> [100, 150)            72 |                                                    |
> [150, 200)            95 |                                                    |
> [200, 250)            43 |                                                    |
> [250, 300)            49 |                                                    |
> [300, 350)            19 |                                                    |
> [350, 400)            56 |                                                    |
> [400, 450)            22 |                                                    |
> [450, 500)            29 |                                                    |
> [500, 550)          1840 |@@@@@@                                              |
> [550, 600)          1476 |@@@@@                                               |
> [600, 650)          1027 |@@@                                                 |
> [650, 700)           473 |@                                                   |
> [700, 750)           366 |@                                                   |
> [750, 800)           627 |@@                                                  |
> [800, 850)           930 |@@@                                                 |
> [850, 900)          1285 |@@@@                                                |
> [900, 950)           971 |@@@                                                 |
> [950, 1000)          946 |@@@                                                 |
> [1000, 1050)       13839 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> @sugov_freq:
> [1400000, 1500000)     648 |@                                                   |
> [1500000, 1600000)       0 |                                                    |
> [1600000, 1700000)       0 |                                                    |
> [1700000, 1800000)      25 |                                                    |
> [1800000, 1900000)       0 |                                                    |
> [1900000, 2000000)       0 |                                                    |
> [2000000, 2100000)       0 |                                                    |
> [2100000, 2200000)       0 |                                                    |
> [2200000, 2300000)       0 |                                                    |
> [2300000, 2400000)       0 |                                                    |
> [2400000, 2500000)       0 |                                                    |
> [2500000, 2600000)       0 |                                                    |
> [2600000, 2700000)       0 |                                                    |
> [2700000, 2800000)       0 |                                                    |
> [2800000, 2900000)       0 |                                                    |
> [2900000, 3000000)       0 |                                                    |
> [3000000, 3100000)       0 |                                                    |
> [3100000, 3125K)       0 |                                                    |
> [3125K, 3300000)       0 |                                                    |
> [3300000, 3400000)       0 |                                                    |
> [3400000, 3500000)       0 |                                                    |
> [3500000, 3600000)       0 |                                                    |
> [3600000, 3700000)       0 |                                                    |
> [3700000, 3800000)       0 |                                                    |
> [3800000, 3900000)       0 |                                                    |
> [3900000, 4000000)   23879 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> @sugov_max_cap:
> [0, 2)             24555 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> In both case max_cap is zero but selected freq is incorrect in bad case.

Also we have in get_next_freq():
	freq = map_util_freq(util, freq, max);
	       --> util * freq /max

If max was 0, we should have been an error ?

There is something strange that I don't understand

Could you trace on the return of sugov_get_util()
the value of sg_cpu->util ?

Thanks for you help
Vincent

> 
> Thanks,
> Wyes
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 11:18           ` Vincent Guittot
@ 2024-01-14 12:37             ` Wyes Karny
  2024-01-14 13:02               ` Dietmar Eggemann
  2024-01-14 13:03               ` Vincent Guittot
  0 siblings, 2 replies; 54+ messages in thread
From: Wyes Karny @ 2024-01-14 12:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Linus Torvalds, Qais Yousef, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On Sun, Jan 14, 2024 at 12:18:06PM +0100, Vincent Guittot wrote:
> Hi Wyes,
> 
> Le dimanche 14 janv. 2024 à 14:42:40 (+0530), Wyes Karny a écrit :
> > On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> > > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > It's one of these two:
> > > >
> > > >   f12560779f9d sched/cpufreq: Rework iowait boost
> > > >   9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > > >
> > > > one more boot to go, then I'll try to revert whichever causes my
> > > > machine to perform horribly much worse.
> > > 
> > > I guess it should come as no surprise that the result is
> > > 
> > >    9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > > 
> > > but to revert cleanly I will have to revert all of
> > > 
> > >       b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > >       f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > >       9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > > performance estimation")
> > > 
> > > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > > 
> > > I'll keep that revert in my private test-tree for now (so that I have
> > > a working machine again), but I'll move it to my main branch soon
> > > unless somebody has a quick fix for this problem.
> > 
> > Hi Linus,
> > 
> > I'm able to reproduce this issue with my AMD Ryzen 5600G system.  But
> > only if I disable CPPC in BIOS and boot with acpi-cpufreq + schedutil.
> > (I believe for your case also CPPC is diabled as log "_CPC object is not
> > present" came). Enabling CPPC in BIOS issue not seen in my system.  For
> > AMD acpi-cpufreq also uses _CPC object to determine the boost ratio.
> > When CPPC is disabled in BIOS something is going wrong and max
> > capacity is becoming zero.
> > 
> > Hi Vincent, Qais,
> > 
> > I have collected some data with bpftracing:
> 
> Thanks for your tests results
> 
> > 
> > sudo bpftrace -e 'kretprobe:effective_cpu_util /cpu == 1/ { @eff_util = lhist(retval, 0, 1200, 50);} kprobe:get_next_freq /cpu == 1/ { @sugov_eff_util = lhist(arg1, 0, 1200, 50); @sugov_max_cap = lhist(arg2, 0, 1000, 2);} kretprobe:get_next_freq /cpu == 1/ { @sugov_freq = lhist(retval, 1000000, 5000000, 100000);}'
> > 
> > with running: taskset -c 1 make
> > 
> > issue case:
> > 
> > Attaching 3 probes...
> > @eff_util:
> > [0, 50)             1263 |@                                                   |
> > [50, 100)            517 |                                                    |
> > [100, 150)           233 |                                                    |
> > [150, 200)           297 |                                                    |
> > [200, 250)           162 |                                                    |
> > [250, 300)            98 |                                                    |
> > [300, 350)            75 |                                                    |
> > [350, 400)           205 |                                                    |
> > [400, 450)           210 |                                                    |
> > [450, 500)            16 |                                                    |
> > [500, 550)          1532 |@                                                   |
> > [550, 600)          1026 |                                                    |
> > [600, 650)           761 |                                                    |
> > [650, 700)           876 |                                                    |
> > [700, 750)          1085 |                                                    |
> > [750, 800)           891 |                                                    |
> > [800, 850)           816 |                                                    |
> > [850, 900)           983 |                                                    |
> > [900, 950)           661 |                                                    |
> > [950, 1000)          759 |                                                    |
> > [1000, 1050)       57433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > 
> 
> ok so the output of effective_cpu_util() seems correct or at least to maw utilization
> value. In order to be correct, it means that arch_scale_cpu_capacity(cpu) is not zero
> because of :
> 
> unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> 				 unsigned long *min,
> 				 unsigned long *max)
> {
> 	unsigned long util, irq, scale;
> 	struct rq *rq = cpu_rq(cpu);
> 
> 	scale = arch_scale_cpu_capacity(cpu);
> 
> 	/*
> 	 * Early check to see if IRQ/steal time saturates the CPU, can be
> 	 * because of inaccuracies in how we track these -- see
> 	 * update_irq_load_avg().
> 	 */
> 	irq = cpu_util_irq(rq);
> 	if (unlikely(irq >= scale)) {
> 		if (min)
> 			*min = scale;
> 		if (max)
> 			*max = scale;
> 		return scale;
> 	}
> ...
> }
> 
> If arch_scale_cpu_capacity(cpu) returns 0 then effective_cpu_util() should returns
> 0 too.
> 
> Now see below
> 
> > @sugov_eff_util:
> > [0, 50)             1074 |                                                    |
> > [50, 100)            571 |                                                    |
> > [100, 150)           259 |                                                    |
> > [150, 200)           169 |                                                    |
> > [200, 250)           237 |                                                    |
> > [250, 300)           156 |                                                    |
> > [300, 350)            91 |                                                    |
> > [350, 400)            46 |                                                    |
> > [400, 450)            52 |                                                    |
> > [450, 500)           195 |                                                    |
> > [500, 550)           175 |                                                    |
> > [550, 600)            46 |                                                    |
> > [600, 650)           493 |                                                    |
> > [650, 700)          1424 |@                                                   |
> > [700, 750)           646 |                                                    |
> > [750, 800)           628 |                                                    |
> > [800, 850)           612 |                                                    |
> > [850, 900)           840 |                                                    |
> > [900, 950)           893 |                                                    |
> > [950, 1000)          640 |                                                    |
> > [1000, 1050)       60679 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > 
> > @sugov_freq:
> > [1400000, 1500000)   69911 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > 
> > @sugov_max_cap:
> > [0, 2)             69926 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> In get_next_freq(struct sugov_policy *sg_policy, unsigned long util, unsigned long max)
> 
> max is 0 and we comes from this path:
> 
> static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
> 				     unsigned int flags)
> {
> 
> ...
> 	max_cap = arch_scale_cpu_capacity(sg_cpu->cpu);
> 
> 	if (!sugov_update_single_common(sg_cpu, time, max_cap, flags))
> 		return;
> 
> 	next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap);
> ...
> 
> so here arch_scale_cpu_capacity(sg_cpu->cpu) returns 0 ...
> 
> AFAICT, AMD platform uses the default 
> static __always_inline
> unsigned long arch_scale_cpu_capacity(int cpu)
> {
> 	return SCHED_CAPACITY_SCALE;
> }
> 
> I'm missing something here
> 
> > 
> > 
> > good case:
> > 
> > Attaching 3 probes...
> > @eff_util:
> > [0, 50)              246 |@                                                   |
> > [50, 100)            150 |@                                                   |
> > [100, 150)           191 |@                                                   |
> > [150, 200)           239 |@                                                   |
> > [200, 250)           117 |                                                    |
> > [250, 300)          2101 |@@@@@@@@@@@@@@@                                     |
> > [300, 350)          2284 |@@@@@@@@@@@@@@@@                                    |
> > [350, 400)           713 |@@@@@                                               |
> > [400, 450)           151 |@                                                   |
> > [450, 500)           154 |@                                                   |
> > [500, 550)          1121 |@@@@@@@@                                            |
> > [550, 600)          1901 |@@@@@@@@@@@@@                                       |
> > [600, 650)          1208 |@@@@@@@@                                            |
> > [650, 700)           606 |@@@@                                                |
> > [700, 750)           557 |@@@                                                 |
> > [750, 800)           872 |@@@@@@                                              |
> > [800, 850)          1092 |@@@@@@@                                             |
> > [850, 900)          1416 |@@@@@@@@@@                                          |
> > [900, 950)          1107 |@@@@@@@                                             |
> > [950, 1000)         1051 |@@@@@@@                                             |
> > [1000, 1050)        7260 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > 
> > @sugov_eff_util:
> > [0, 50)              241 |                                                    |
> > [50, 100)            149 |                                                    |
> > [100, 150)            72 |                                                    |
> > [150, 200)            95 |                                                    |
> > [200, 250)            43 |                                                    |
> > [250, 300)            49 |                                                    |
> > [300, 350)            19 |                                                    |
> > [350, 400)            56 |                                                    |
> > [400, 450)            22 |                                                    |
> > [450, 500)            29 |                                                    |
> > [500, 550)          1840 |@@@@@@                                              |
> > [550, 600)          1476 |@@@@@                                               |
> > [600, 650)          1027 |@@@                                                 |
> > [650, 700)           473 |@                                                   |
> > [700, 750)           366 |@                                                   |
> > [750, 800)           627 |@@                                                  |
> > [800, 850)           930 |@@@                                                 |
> > [850, 900)          1285 |@@@@                                                |
> > [900, 950)           971 |@@@                                                 |
> > [950, 1000)          946 |@@@                                                 |
> > [1000, 1050)       13839 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > 
> > @sugov_freq:
> > [1400000, 1500000)     648 |@                                                   |
> > [1500000, 1600000)       0 |                                                    |
> > [1600000, 1700000)       0 |                                                    |
> > [1700000, 1800000)      25 |                                                    |
> > [1800000, 1900000)       0 |                                                    |
> > [1900000, 2000000)       0 |                                                    |
> > [2000000, 2100000)       0 |                                                    |
> > [2100000, 2200000)       0 |                                                    |
> > [2200000, 2300000)       0 |                                                    |
> > [2300000, 2400000)       0 |                                                    |
> > [2400000, 2500000)       0 |                                                    |
> > [2500000, 2600000)       0 |                                                    |
> > [2600000, 2700000)       0 |                                                    |
> > [2700000, 2800000)       0 |                                                    |
> > [2800000, 2900000)       0 |                                                    |
> > [2900000, 3000000)       0 |                                                    |
> > [3000000, 3100000)       0 |                                                    |
> > [3100000, 3125K)       0 |                                                    |
> > [3125K, 3300000)       0 |                                                    |
> > [3300000, 3400000)       0 |                                                    |
> > [3400000, 3500000)       0 |                                                    |
> > [3500000, 3600000)       0 |                                                    |
> > [3600000, 3700000)       0 |                                                    |
> > [3700000, 3800000)       0 |                                                    |
> > [3800000, 3900000)       0 |                                                    |
> > [3900000, 4000000)   23879 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > 
> > @sugov_max_cap:
> > [0, 2)             24555 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > 
> > In both case max_cap is zero but selected freq is incorrect in bad case.
> 
> Also we have in get_next_freq():
> 	freq = map_util_freq(util, freq, max);
> 	       --> util * freq /max
> 
> If max was 0, we should have been an error ?
> 
> There is something strange that I don't understand
> 
> Could you trace on the return of sugov_get_util()
> the value of sg_cpu->util ?

Yeah, correct something was wrong in the bpftrace readings, max_cap is
not zero in traces.

             git-5511    [001] d.h1.   427.159763: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
             git-5511    [001] d.h1.   427.163733: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
             git-5511    [001] d.h1.   427.163735: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
             git-5511    [001] d.h1.   427.167706: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
             git-5511    [001] d.h1.   427.167708: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
             git-5511    [001] d.h1.   427.171678: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
             git-5511    [001] d.h1.   427.171679: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
             git-5511    [001] d.h1.   427.175653: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
             git-5511    [001] d.h1.   427.175655: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
             git-5511    [001] d.s1.   427.175665: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
             git-5511    [001] d.s1.   427.175665: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024

Debug patch applied:

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..5c9b3e1de7a0 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -166,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,

        freq = get_capacity_ref_freq(policy);
        freq = map_util_freq(util, freq, max);
+       trace_printk("[DEBUG] : freq %llu, util %llu, max %llu\n", freq, util, max);

        if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
                return sg_policy->next_freq;
@@ -199,6 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
        util = max(util, boost);
        sg_cpu->bw_min = min;
        sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
+       trace_printk("[DEBUG] : util %llu, sg_cpu->util %llu\n", util, sg_cpu->util);
 }

 /**


So, I guess map_util_freq going wrong somewhere.

Thanks,
Wyes
> 
> Thanks for you help
> Vincent
> 
> > 
> > Thanks,
> > Wyes
> > 

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 12:37             ` Wyes Karny
@ 2024-01-14 13:02               ` Dietmar Eggemann
  2024-01-14 13:05                 ` Vincent Guittot
  2024-01-14 13:03               ` Vincent Guittot
  1 sibling, 1 reply; 54+ messages in thread
From: Dietmar Eggemann @ 2024-01-14 13:02 UTC (permalink / raw)
  To: Wyes Karny, Vincent Guittot
  Cc: Linus Torvalds, Qais Yousef, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

On 14/01/2024 13:37, Wyes Karny wrote:
> On Sun, Jan 14, 2024 at 12:18:06PM +0100, Vincent Guittot wrote:
>> Hi Wyes,
>>
>> Le dimanche 14 janv. 2024 � 14:42:40 (+0530), Wyes Karny a �crit :
>>> On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
>>>> On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
>>>> <torvalds@linux-foundation.org> wrote:

[...]

> Yeah, correct something was wrong in the bpftrace readings, max_cap is
> not zero in traces.
> 
>              git-5511    [001] d.h1.   427.159763: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>              git-5511    [001] d.h1.   427.163733: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>              git-5511    [001] d.h1.   427.163735: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>              git-5511    [001] d.h1.   427.167706: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>              git-5511    [001] d.h1.   427.167708: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>              git-5511    [001] d.h1.   427.171678: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>              git-5511    [001] d.h1.   427.171679: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>              git-5511    [001] d.h1.   427.175653: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>              git-5511    [001] d.h1.   427.175655: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>              git-5511    [001] d.s1.   427.175665: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>              git-5511    [001] d.s1.   427.175665: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> 
> Debug patch applied:
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..5c9b3e1de7a0 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -166,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> 
>         freq = get_capacity_ref_freq(policy);
>         freq = map_util_freq(util, freq, max);
> +       trace_printk("[DEBUG] : freq %llu, util %llu, max %llu\n", freq, util, max);
> 
>         if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
>                 return sg_policy->next_freq;
> @@ -199,6 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
>         util = max(util, boost);
>         sg_cpu->bw_min = min;
>         sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> +       trace_printk("[DEBUG] : util %llu, sg_cpu->util %llu\n", util, sg_cpu->util);
>  }
> 
>  /**
> 
> 
> So, I guess map_util_freq going wrong somewhere.

sugov_update_single_freq() -> get_next_freq() -> get_capacity_ref_freq()

Is arch_scale_freq_invariant() true in both cases, so that
get_capacity_ref_freq() returns 'policy->cpuinfo.max_freq' and not just
'policy->cur'?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 12:37             ` Wyes Karny
  2024-01-14 13:02               ` Dietmar Eggemann
@ 2024-01-14 13:03               ` Vincent Guittot
  2024-01-14 15:12                 ` Qais Yousef
  2024-01-14 18:11                 ` Wyes Karny
  1 sibling, 2 replies; 54+ messages in thread
From: Vincent Guittot @ 2024-01-14 13:03 UTC (permalink / raw)
  To: Wyes Karny
  Cc: Linus Torvalds, Qais Yousef, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On Sun, 14 Jan 2024 at 13:38, Wyes Karny <wkarny@gmail.com> wrote:
>
> On Sun, Jan 14, 2024 at 12:18:06PM +0100, Vincent Guittot wrote:
> > Hi Wyes,
> >
> > Le dimanche 14 janv. 2024 à 14:42:40 (+0530), Wyes Karny a écrit :
> > > On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> > > > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > > > <torvalds@linux-foundation.org> wrote:
> > > > >
> > > > > It's one of these two:
> > > > >
> > > > >   f12560779f9d sched/cpufreq: Rework iowait boost
> > > > >   9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > > > >
> > > > > one more boot to go, then I'll try to revert whichever causes my
> > > > > machine to perform horribly much worse.
> > > >
> > > > I guess it should come as no surprise that the result is
> > > >
> > > >    9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > > >
> > > > but to revert cleanly I will have to revert all of
> > > >
> > > >       b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > > >       f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > > >       9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > > > performance estimation")
> > > >
> > > > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > > >
> > > > I'll keep that revert in my private test-tree for now (so that I have
> > > > a working machine again), but I'll move it to my main branch soon
> > > > unless somebody has a quick fix for this problem.
> > >
> > > Hi Linus,
> > >
> > > I'm able to reproduce this issue with my AMD Ryzen 5600G system.  But
> > > only if I disable CPPC in BIOS and boot with acpi-cpufreq + schedutil.
> > > (I believe for your case also CPPC is diabled as log "_CPC object is not
> > > present" came). Enabling CPPC in BIOS issue not seen in my system.  For
> > > AMD acpi-cpufreq also uses _CPC object to determine the boost ratio.
> > > When CPPC is disabled in BIOS something is going wrong and max
> > > capacity is becoming zero.
> > >
> > > Hi Vincent, Qais,
> > >

...

> >
> > There is something strange that I don't understand
> >
> > Could you trace on the return of sugov_get_util()
> > the value of sg_cpu->util ?
>
> Yeah, correct something was wrong in the bpftrace readings, max_cap is
> not zero in traces.
>
>              git-5511    [001] d.h1.   427.159763: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>              git-5511    [001] d.h1.   427.163733: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>              git-5511    [001] d.h1.   427.163735: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>              git-5511    [001] d.h1.   427.167706: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>              git-5511    [001] d.h1.   427.167708: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>              git-5511    [001] d.h1.   427.171678: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>              git-5511    [001] d.h1.   427.171679: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>              git-5511    [001] d.h1.   427.175653: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>              git-5511    [001] d.h1.   427.175655: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>              git-5511    [001] d.s1.   427.175665: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>              git-5511    [001] d.s1.   427.175665: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>
> Debug patch applied:
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..5c9b3e1de7a0 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -166,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>
>         freq = get_capacity_ref_freq(policy);
>         freq = map_util_freq(util, freq, max);
> +       trace_printk("[DEBUG] : freq %llu, util %llu, max %llu\n", freq, util, max);
>
>         if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
>                 return sg_policy->next_freq;
> @@ -199,6 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
>         util = max(util, boost);
>         sg_cpu->bw_min = min;
>         sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> +       trace_printk("[DEBUG] : util %llu, sg_cpu->util %llu\n", util, sg_cpu->util);
>  }
>
>  /**
>
>
> So, I guess map_util_freq going wrong somewhere.

Thanks for the trace. It was really helpful and I think that I got the
root cause.

The problem comes from get_capacity_ref_freq() which returns current
freq when arch_scale_freq_invariant() is not enable, and the fact that
we apply map_util_perf() earlier in the path now which is then capped
by max capacity.

Could you try the below ?

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e420e2ee1a10..611c621543f4 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
cpufreq_policy *policy)
        if (arch_scale_freq_invariant())
                return policy->cpuinfo.max_freq;

-       return policy->cur;
+       return policy->cur + policy->cur >> 2;
 }

 /**



>
> Thanks,
> Wyes
> >
> > Thanks for you help
> > Vincent
> >
> > >
> > > Thanks,
> > > Wyes
> > >

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 13:02               ` Dietmar Eggemann
@ 2024-01-14 13:05                 ` Vincent Guittot
  0 siblings, 0 replies; 54+ messages in thread
From: Vincent Guittot @ 2024-01-14 13:05 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Wyes Karny, Linus Torvalds, Qais Yousef, Ingo Molnar,
	linux-kernel, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On Sun, 14 Jan 2024 at 14:02, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 14/01/2024 13:37, Wyes Karny wrote:
> > On Sun, Jan 14, 2024 at 12:18:06PM +0100, Vincent Guittot wrote:
> >> Hi Wyes,
> >>
> >> Le dimanche 14 janv. 2024 � 14:42:40 (+0530), Wyes Karny a �crit :
> >>> On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> >>>> On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> >>>> <torvalds@linux-foundation.org> wrote:
>
> [...]
>
> > Yeah, correct something was wrong in the bpftrace readings, max_cap is
> > not zero in traces.
> >
> >              git-5511    [001] d.h1.   427.159763: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >              git-5511    [001] d.h1.   427.163733: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> >              git-5511    [001] d.h1.   427.163735: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >              git-5511    [001] d.h1.   427.167706: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> >              git-5511    [001] d.h1.   427.167708: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >              git-5511    [001] d.h1.   427.171678: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> >              git-5511    [001] d.h1.   427.171679: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >              git-5511    [001] d.h1.   427.175653: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> >              git-5511    [001] d.h1.   427.175655: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >              git-5511    [001] d.s1.   427.175665: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> >              git-5511    [001] d.s1.   427.175665: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >
> > Debug patch applied:
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 95c3c097083e..5c9b3e1de7a0 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -166,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> >
> >         freq = get_capacity_ref_freq(policy);
> >         freq = map_util_freq(util, freq, max);
> > +       trace_printk("[DEBUG] : freq %llu, util %llu, max %llu\n", freq, util, max);
> >
> >         if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> >                 return sg_policy->next_freq;
> > @@ -199,6 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> >         util = max(util, boost);
> >         sg_cpu->bw_min = min;
> >         sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> > +       trace_printk("[DEBUG] : util %llu, sg_cpu->util %llu\n", util, sg_cpu->util);
> >  }
> >
> >  /**
> >
> >
> > So, I guess map_util_freq going wrong somewhere.
>
> sugov_update_single_freq() -> get_next_freq() -> get_capacity_ref_freq()
>
> Is arch_scale_freq_invariant() true in both cases, so that
> get_capacity_ref_freq() returns 'policy->cpuinfo.max_freq' and not just
> 'policy->cur'?

That's also my assumption and the change that I sent shoulf fix it

>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 13:03               ` Vincent Guittot
@ 2024-01-14 15:12                 ` Qais Yousef
  2024-01-14 15:20                   ` Vincent Guittot
  2024-01-14 18:11                 ` Wyes Karny
  1 sibling, 1 reply; 54+ messages in thread
From: Qais Yousef @ 2024-01-14 15:12 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Wyes Karny, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On 01/14/24 14:03, Vincent Guittot wrote:

> Thanks for the trace. It was really helpful and I think that I got the
> root cause.
> 
> The problem comes from get_capacity_ref_freq() which returns current
> freq when arch_scale_freq_invariant() is not enable, and the fact that
> we apply map_util_perf() earlier in the path now which is then capped
> by max capacity.
> 
> Could you try the below ?
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index e420e2ee1a10..611c621543f4 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> cpufreq_policy *policy)
>         if (arch_scale_freq_invariant())
>                 return policy->cpuinfo.max_freq;
> 
> -       return policy->cur;
> +       return policy->cur + policy->cur >> 2;
>  }
> 
>  /**

Is this a test patch or a proper fix? I can't see it being the latter. It seems
the current logic fails when util is already 1024, and I think we're trying to
fix the invariance issue too late.

Is the problem that we can't read policy->cur in the scheduler to fix the util
while it's being updated that's why it's done here in this case?

If this is the problem, shouldn't the logic be if util is max then always go to
max frequency? I don't think we have enough info to correct the invariance here
IIUC. All we can see the system is saturated at this frequency and whether
a small jump or a big jump is required is hard to tell.

Something like this

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..473d0352030b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -164,8 +164,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
        struct cpufreq_policy *policy = sg_policy->policy;
        unsigned int freq;

-       freq = get_capacity_ref_freq(policy);
-       freq = map_util_freq(util, freq, max);
+       if (util != max) {
+               freq = get_capacity_ref_freq(policy);
+               freq = map_util_freq(util, freq, max);
+       } else {
+               freq = policy->cpuinfo.max_freq;
+       }

        if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
                return sg_policy->next_freq;

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 15:12                 ` Qais Yousef
@ 2024-01-14 15:20                   ` Vincent Guittot
  2024-01-14 19:58                     ` Qais Yousef
  0 siblings, 1 reply; 54+ messages in thread
From: Vincent Guittot @ 2024-01-14 15:20 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Wyes Karny, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On Sun, 14 Jan 2024 at 16:12, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 01/14/24 14:03, Vincent Guittot wrote:
>
> > Thanks for the trace. It was really helpful and I think that I got the
> > root cause.
> >
> > The problem comes from get_capacity_ref_freq() which returns current
> > freq when arch_scale_freq_invariant() is not enable, and the fact that
> > we apply map_util_perf() earlier in the path now which is then capped
> > by max capacity.
> >
> > Could you try the below ?
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index e420e2ee1a10..611c621543f4 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> > cpufreq_policy *policy)
> >         if (arch_scale_freq_invariant())
> >                 return policy->cpuinfo.max_freq;
> >
> > -       return policy->cur;
> > +       return policy->cur + policy->cur >> 2;
> >  }
> >
> >  /**
>
> Is this a test patch or a proper fix? I can't see it being the latter. It seems

It's a proper fix. It's the same mechanism that is used already :
 - Either you add margin on the utilization to go above current freq
before it is fully used. This si what was done previously
 - or you add margin on the freq range to select a higher freq than
current one before it become fully used

> the current logic fails when util is already 1024, and I think we're trying to
> fix the invariance issue too late.
>
> Is the problem that we can't read policy->cur in the scheduler to fix the util
> while it's being updated that's why it's done here in this case?
>
> If this is the problem, shouldn't the logic be if util is max then always go to
> max frequency? I don't think we have enough info to correct the invariance here
> IIUC. All we can see the system is saturated at this frequency and whether
> a small jump or a big jump is required is hard to tell.
>
> Something like this
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..473d0352030b 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -164,8 +164,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>         struct cpufreq_policy *policy = sg_policy->policy;
>         unsigned int freq;
>
> -       freq = get_capacity_ref_freq(policy);
> -       freq = map_util_freq(util, freq, max);
> +       if (util != max) {
> +               freq = get_capacity_ref_freq(policy);
> +               freq = map_util_freq(util, freq, max);
> +       } else {
> +               freq = policy->cpuinfo.max_freq;
> +       }

This is not correct because you will have to wait to reach full
utilization at the current OPP possibly the lowest OPP before moving
directly to max OPP

>
>         if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
>                 return sg_policy->next_freq;

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 13:03               ` Vincent Guittot
  2024-01-14 15:12                 ` Qais Yousef
@ 2024-01-14 18:11                 ` Wyes Karny
  2024-01-14 18:18                   ` Vincent Guittot
  1 sibling, 1 reply; 54+ messages in thread
From: Wyes Karny @ 2024-01-14 18:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Linus Torvalds, Qais Yousef, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On Sun, Jan 14, 2024 at 02:03:14PM +0100, Vincent Guittot wrote:
> On Sun, 14 Jan 2024 at 13:38, Wyes Karny <wkarny@gmail.com> wrote:
> >
> > On Sun, Jan 14, 2024 at 12:18:06PM +0100, Vincent Guittot wrote:
> > > Hi Wyes,
> > >
> > > Le dimanche 14 janv. 2024 à 14:42:40 (+0530), Wyes Karny a écrit :
> > > > On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> > > > > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > > > > <torvalds@linux-foundation.org> wrote:
> > > > > >
> > > > > > It's one of these two:
> > > > > >
> > > > > >   f12560779f9d sched/cpufreq: Rework iowait boost
> > > > > >   9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > > > > >
> > > > > > one more boot to go, then I'll try to revert whichever causes my
> > > > > > machine to perform horribly much worse.
> > > > >
> > > > > I guess it should come as no surprise that the result is
> > > > >
> > > > >    9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > > > >
> > > > > but to revert cleanly I will have to revert all of
> > > > >
> > > > >       b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > > > >       f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > > > >       9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > > > > performance estimation")
> > > > >
> > > > > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > > > >
> > > > > I'll keep that revert in my private test-tree for now (so that I have
> > > > > a working machine again), but I'll move it to my main branch soon
> > > > > unless somebody has a quick fix for this problem.
> > > >
> > > > Hi Linus,
> > > >
> > > > I'm able to reproduce this issue with my AMD Ryzen 5600G system.  But
> > > > only if I disable CPPC in BIOS and boot with acpi-cpufreq + schedutil.
> > > > (I believe for your case also CPPC is diabled as log "_CPC object is not
> > > > present" came). Enabling CPPC in BIOS issue not seen in my system.  For
> > > > AMD acpi-cpufreq also uses _CPC object to determine the boost ratio.
> > > > When CPPC is disabled in BIOS something is going wrong and max
> > > > capacity is becoming zero.
> > > >
> > > > Hi Vincent, Qais,
> > > >
> 
> ...
> 
> > >
> > > There is something strange that I don't understand
> > >
> > > Could you trace on the return of sugov_get_util()
> > > the value of sg_cpu->util ?
> >
> > Yeah, correct something was wrong in the bpftrace readings, max_cap is
> > not zero in traces.
> >
> >              git-5511    [001] d.h1.   427.159763: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >              git-5511    [001] d.h1.   427.163733: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> >              git-5511    [001] d.h1.   427.163735: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >              git-5511    [001] d.h1.   427.167706: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> >              git-5511    [001] d.h1.   427.167708: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >              git-5511    [001] d.h1.   427.171678: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> >              git-5511    [001] d.h1.   427.171679: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >              git-5511    [001] d.h1.   427.175653: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> >              git-5511    [001] d.h1.   427.175655: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >              git-5511    [001] d.s1.   427.175665: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> >              git-5511    [001] d.s1.   427.175665: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >
> > Debug patch applied:
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 95c3c097083e..5c9b3e1de7a0 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -166,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> >
> >         freq = get_capacity_ref_freq(policy);
> >         freq = map_util_freq(util, freq, max);
> > +       trace_printk("[DEBUG] : freq %llu, util %llu, max %llu\n", freq, util, max);
> >
> >         if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> >                 return sg_policy->next_freq;
> > @@ -199,6 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> >         util = max(util, boost);
> >         sg_cpu->bw_min = min;
> >         sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> > +       trace_printk("[DEBUG] : util %llu, sg_cpu->util %llu\n", util, sg_cpu->util);
> >  }
> >
> >  /**
> >
> >
> > So, I guess map_util_freq going wrong somewhere.
> 
> Thanks for the trace. It was really helpful and I think that I got the
> root cause.
> 
> The problem comes from get_capacity_ref_freq() which returns current
> freq when arch_scale_freq_invariant() is not enable, and the fact that
> we apply map_util_perf() earlier in the path now which is then capped
> by max capacity.
> 
> Could you try the below ?
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index e420e2ee1a10..611c621543f4 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> cpufreq_policy *policy)
>         if (arch_scale_freq_invariant())
>                 return policy->cpuinfo.max_freq;
> 
> -       return policy->cur;
> +       return policy->cur + policy->cur >> 2;
>  }
> 
>  /**

Issue seems to be fixed with this (but bit modified by me for arithmetic precedence):

patch:

@@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
        if (arch_scale_freq_invariant())
                return policy->cpuinfo.max_freq;
 
-       return policy->cur;
+       return policy->cur + (policy->cur >> 2);
 }
 
 /**

trace:
            make-7912    [001] d..2.   182.070005: sugov_get_util: [DEBUG] : util 595, sg_cpu->util 743
            make-7912    [001] d..2.   182.070006: get_next_freq.constprop.0: [DEBUG] : freq 3537231, util 743, max 1024
              sh-7956    [001] d..2.   182.070494: sugov_get_util: [DEBUG] : util 835, sg_cpu->util 1024
              sh-7956    [001] d..2.   182.070495: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
              sh-7956    [001] d..2.   182.070576: sugov_get_util: [DEBUG] : util 955, sg_cpu->util 1024
              sh-7956    [001] d..2.   182.070576: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
              sh-7957    [001] d.h1.   182.072120: sugov_get_util: [DEBUG] : util 990, sg_cpu->util 1024
              sh-7957    [001] d.h1.   182.072121: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
              nm-7957    [001] dNh1.   182.076088: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
              nm-7957    [001] dNh1.   182.076089: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
            grep-7958    [001] d..2.   182.076833: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024


Thanks,
Wyes

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 18:11                 ` Wyes Karny
@ 2024-01-14 18:18                   ` Vincent Guittot
  0 siblings, 0 replies; 54+ messages in thread
From: Vincent Guittot @ 2024-01-14 18:18 UTC (permalink / raw)
  To: Wyes Karny
  Cc: Linus Torvalds, Qais Yousef, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On Sun, 14 Jan 2024 at 19:11, Wyes Karny <wkarny@gmail.com> wrote:
>
> On Sun, Jan 14, 2024 at 02:03:14PM +0100, Vincent Guittot wrote:
> > On Sun, 14 Jan 2024 at 13:38, Wyes Karny <wkarny@gmail.com> wrote:
> > >
> > > On Sun, Jan 14, 2024 at 12:18:06PM +0100, Vincent Guittot wrote:
> > > > Hi Wyes,
> > > >
> > > > Le dimanche 14 janv. 2024 à 14:42:40 (+0530), Wyes Karny a écrit :
> > > > > On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> > > > > > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > > > > > <torvalds@linux-foundation.org> wrote:
> > > > > > >
> > > > > > > It's one of these two:
> > > > > > >
> > > > > > >   f12560779f9d sched/cpufreq: Rework iowait boost
> > > > > > >   9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > > > > > >
> > > > > > > one more boot to go, then I'll try to revert whichever causes my
> > > > > > > machine to perform horribly much worse.
> > > > > >
> > > > > > I guess it should come as no surprise that the result is
> > > > > >
> > > > > >    9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > > > > >
> > > > > > but to revert cleanly I will have to revert all of
> > > > > >
> > > > > >       b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > > > > >       f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > > > > >       9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > > > > > performance estimation")
> > > > > >
> > > > > > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > > > > >
> > > > > > I'll keep that revert in my private test-tree for now (so that I have
> > > > > > a working machine again), but I'll move it to my main branch soon
> > > > > > unless somebody has a quick fix for this problem.
> > > > >
> > > > > Hi Linus,
> > > > >
> > > > > I'm able to reproduce this issue with my AMD Ryzen 5600G system.  But
> > > > > only if I disable CPPC in BIOS and boot with acpi-cpufreq + schedutil.
> > > > > (I believe for your case also CPPC is diabled as log "_CPC object is not
> > > > > present" came). Enabling CPPC in BIOS issue not seen in my system.  For
> > > > > AMD acpi-cpufreq also uses _CPC object to determine the boost ratio.
> > > > > When CPPC is disabled in BIOS something is going wrong and max
> > > > > capacity is becoming zero.
> > > > >
> > > > > Hi Vincent, Qais,
> > > > >
> >
> > ...
> >
> > > >
> > > > There is something strange that I don't understand
> > > >
> > > > Could you trace on the return of sugov_get_util()
> > > > the value of sg_cpu->util ?
> > >
> > > Yeah, correct something was wrong in the bpftrace readings, max_cap is
> > > not zero in traces.
> > >
> > >              git-5511    [001] d.h1.   427.159763: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > >              git-5511    [001] d.h1.   427.163733: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > >              git-5511    [001] d.h1.   427.163735: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > >              git-5511    [001] d.h1.   427.167706: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > >              git-5511    [001] d.h1.   427.167708: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > >              git-5511    [001] d.h1.   427.171678: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > >              git-5511    [001] d.h1.   427.171679: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > >              git-5511    [001] d.h1.   427.175653: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > >              git-5511    [001] d.h1.   427.175655: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > >              git-5511    [001] d.s1.   427.175665: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > >              git-5511    [001] d.s1.   427.175665: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > >
> > > Debug patch applied:
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index 95c3c097083e..5c9b3e1de7a0 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -166,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> > >
> > >         freq = get_capacity_ref_freq(policy);
> > >         freq = map_util_freq(util, freq, max);
> > > +       trace_printk("[DEBUG] : freq %llu, util %llu, max %llu\n", freq, util, max);
> > >
> > >         if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> > >                 return sg_policy->next_freq;
> > > @@ -199,6 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> > >         util = max(util, boost);
> > >         sg_cpu->bw_min = min;
> > >         sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> > > +       trace_printk("[DEBUG] : util %llu, sg_cpu->util %llu\n", util, sg_cpu->util);
> > >  }
> > >
> > >  /**
> > >
> > >
> > > So, I guess map_util_freq going wrong somewhere.
> >
> > Thanks for the trace. It was really helpful and I think that I got the
> > root cause.
> >
> > The problem comes from get_capacity_ref_freq() which returns current
> > freq when arch_scale_freq_invariant() is not enable, and the fact that
> > we apply map_util_perf() earlier in the path now which is then capped
> > by max capacity.
> >
> > Could you try the below ?
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index e420e2ee1a10..611c621543f4 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> > cpufreq_policy *policy)
> >         if (arch_scale_freq_invariant())
> >                 return policy->cpuinfo.max_freq;
> >
> > -       return policy->cur;
> > +       return policy->cur + policy->cur >> 2;
> >  }
> >
> >  /**
>
> Issue seems to be fixed with this (but bit modified by me for arithmetic precedence):

Ok, thanks for your test

>
> patch:
>
> @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
>         if (arch_scale_freq_invariant())
>                 return policy->cpuinfo.max_freq;
>
> -       return policy->cur;
> +       return policy->cur + (policy->cur >> 2);

fair enough
I prepare a patch with the fix and will add your Tested-by

Thanks for your help

>  }
>
>  /**
>
> trace:
>             make-7912    [001] d..2.   182.070005: sugov_get_util: [DEBUG] : util 595, sg_cpu->util 743
>             make-7912    [001] d..2.   182.070006: get_next_freq.constprop.0: [DEBUG] : freq 3537231, util 743, max 1024
>               sh-7956    [001] d..2.   182.070494: sugov_get_util: [DEBUG] : util 835, sg_cpu->util 1024
>               sh-7956    [001] d..2.   182.070495: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
>               sh-7956    [001] d..2.   182.070576: sugov_get_util: [DEBUG] : util 955, sg_cpu->util 1024
>               sh-7956    [001] d..2.   182.070576: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
>               sh-7957    [001] d.h1.   182.072120: sugov_get_util: [DEBUG] : util 990, sg_cpu->util 1024
>               sh-7957    [001] d.h1.   182.072121: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
>               nm-7957    [001] dNh1.   182.076088: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>               nm-7957    [001] dNh1.   182.076089: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
>             grep-7958    [001] d..2.   182.076833: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>
>
> Thanks,
> Wyes

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 15:20                   ` Vincent Guittot
@ 2024-01-14 19:58                     ` Qais Yousef
  2024-01-14 23:37                       ` Qais Yousef
                                         ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Qais Yousef @ 2024-01-14 19:58 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Wyes Karny, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

[-- Attachment #1: Type: text/plain, Size: 4877 bytes --]

On 01/14/24 16:20, Vincent Guittot wrote:
> On Sun, 14 Jan 2024 at 16:12, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 01/14/24 14:03, Vincent Guittot wrote:
> >
> > > Thanks for the trace. It was really helpful and I think that I got the
> > > root cause.
> > >
> > > The problem comes from get_capacity_ref_freq() which returns current
> > > freq when arch_scale_freq_invariant() is not enable, and the fact that
> > > we apply map_util_perf() earlier in the path now which is then capped
> > > by max capacity.
> > >
> > > Could you try the below ?
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index e420e2ee1a10..611c621543f4 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> > > cpufreq_policy *policy)
> > >         if (arch_scale_freq_invariant())
> > >                 return policy->cpuinfo.max_freq;
> > >
> > > -       return policy->cur;
> > > +       return policy->cur + policy->cur >> 2;
> > >  }
> > >
> > >  /**
> >
> > Is this a test patch or a proper fix? I can't see it being the latter. It seems
> 
> It's a proper fix. It's the same mechanism that is used already :
>  - Either you add margin on the utilization to go above current freq
> before it is fully used. This si what was done previously
>  - or you add margin on the freq range to select a higher freq than
> current one before it become fully used

Aren't we applying the 25% headroom twice then?

> 
> > the current logic fails when util is already 1024, and I think we're trying to
> > fix the invariance issue too late.
> >
> > Is the problem that we can't read policy->cur in the scheduler to fix the util
> > while it's being updated that's why it's done here in this case?
> >
> > If this is the problem, shouldn't the logic be if util is max then always go to
> > max frequency? I don't think we have enough info to correct the invariance here
> > IIUC. All we can see the system is saturated at this frequency and whether
> > a small jump or a big jump is required is hard to tell.
> >
> > Something like this
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 95c3c097083e..473d0352030b 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -164,8 +164,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> >         struct cpufreq_policy *policy = sg_policy->policy;
> >         unsigned int freq;
> >
> > -       freq = get_capacity_ref_freq(policy);
> > -       freq = map_util_freq(util, freq, max);
> > +       if (util != max) {
> > +               freq = get_capacity_ref_freq(policy);
> > +               freq = map_util_freq(util, freq, max);
> > +       } else {
> > +               freq = policy->cpuinfo.max_freq;
> > +       }
> 
> This is not correct because you will have to wait to reach full
> utilization at the current OPP possibly the lowest OPP before moving
> directly to max OPP

Isn't this already the case? The ratio (util+headroom/max) will be less than
1 until util is 80% (with 25% headroom). And for all values <= 80% * max, we
will request a frequency smaller than/equal policy->cur, no?

ie:

	util = 600
	max = 1024

	freq = 1.25 * 600 * policy->cur / 1024 = 0.73 * policy->cur

(util+headroom/max) must be greater than 1 for us to start going above
policy->cur - which seems to have been working by accident IIUC.

So yes my proposal is incorrect, but it seems the conversion is not right to me
now.

I could reproduce the problem now (thanks Wyes!). I have 3 freqs on my system

2.2GHz, 2.8GHz and 3.8GHz

which (I believe) translates into capacities

~592, ~754, 1024

which means we should pick 2.8GHz as soon as util * 1.25 > 592; which
translates into util = ~473.

But what I see is that we go to 2.8GHz when we jump from 650 to 680 (see
attached picture), which is what you'd expect since we apply two headrooms now,
which means the ratio (util+headroom/max) will be greater than 1 after go above
this value

	1024 * 0.8 * 0.8 = ~655

So I think the math makes sense logically, but we're missing some other
correction factor.

When I re-enable CPPC I see for the same test that we go into 3.8GHz straight
away. My test is simple busyloop via

	cat /dev/zero > /dev/null

I see the CPU util_avg is at 523 at fork. I expected us to run to 2.8GHz here
to be honest, but I am not sure if util_cfs_boost() and util_est() are maybe
causing us to be slightly above 523 and that's why we start with max freq.

Or I've done the math wrong :-) But the two don't behave the same for the same
kernel with and without CPPC.

> 
> >
> >         if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> >                 return sg_policy->next_freq;

[-- Attachment #2: cppc_freq_fix.png --]
[-- Type: image/png, Size: 27789 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 19:58                     ` Qais Yousef
@ 2024-01-14 23:37                       ` Qais Yousef
  2024-01-15  6:25                         ` Wyes Karny
  2024-01-15  8:21                       ` Vincent Guittot
  2024-01-15  8:42                       ` David Laight
  2 siblings, 1 reply; 54+ messages in thread
From: Qais Yousef @ 2024-01-14 23:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Wyes Karny, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

[-- Attachment #1: Type: text/plain, Size: 3634 bytes --]

On 01/14/24 19:58, Qais Yousef wrote:

> > This is not correct because you will have to wait to reach full
> > utilization at the current OPP possibly the lowest OPP before moving
> > directly to max OPP
> 
> Isn't this already the case? The ratio (util+headroom/max) will be less than
> 1 until util is 80% (with 25% headroom). And for all values <= 80% * max, we
> will request a frequency smaller than/equal policy->cur, no?
> 
> ie:
> 
> 	util = 600
> 	max = 1024
> 
> 	freq = 1.25 * 600 * policy->cur / 1024 = 0.73 * policy->cur
> 
> (util+headroom/max) must be greater than 1 for us to start going above
> policy->cur - which seems to have been working by accident IIUC.
> 
> So yes my proposal is incorrect, but it seems the conversion is not right to me
> now.
> 
> I could reproduce the problem now (thanks Wyes!). I have 3 freqs on my system
> 
> 2.2GHz, 2.8GHz and 3.8GHz
> 
> which (I believe) translates into capacities
> 
> ~592, ~754, 1024
> 
> which means we should pick 2.8GHz as soon as util * 1.25 > 592; which
> translates into util = ~473.
> 
> But what I see is that we go to 2.8GHz when we jump from 650 to 680 (see
> attached picture), which is what you'd expect since we apply two headrooms now,
> which means the ratio (util+headroom/max) will be greater than 1 after go above
> this value
> 
> 	1024 * 0.8 * 0.8 = ~655
> 
> So I think the math makes sense logically, but we're missing some other
> correction factor.
> 
> When I re-enable CPPC I see for the same test that we go into 3.8GHz straight
> away. My test is simple busyloop via
> 
> 	cat /dev/zero > /dev/null
> 
> I see the CPU util_avg is at 523 at fork. I expected us to run to 2.8GHz here
> to be honest, but I am not sure if util_cfs_boost() and util_est() are maybe
> causing us to be slightly above 523 and that's why we start with max freq.
> 
> Or I've done the math wrong :-) But the two don't behave the same for the same
> kernel with and without CPPC.

I think the relationship should be:

	freq = util * f_curr / cap_curr

(patch below)

with that I see (almost) the expected behavior (picture attached). We go to
2.8GHz when we are above 500. But the move to 3.8GHz is a bit earlier at 581
(instead of 754 * 0.8 = 603). Not sure why. With 25% headroom 581 is 726. So
it's a tad too early.


diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..155f96a44fa0 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -123,7 +123,8 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy)
  * Return: the reference CPU frequency to compute a capacity.
  */
 static __always_inline
-unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
+unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy,
+                                   unsigned long *max)
 {
        unsigned int freq = arch_scale_freq_ref(policy->cpu);
 
@@ -133,6 +134,9 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
        if (arch_scale_freq_invariant())
                return policy->cpuinfo.max_freq;
 
+       if (max)
+               *max = policy->cur * (*max) / policy->cpuinfo.max_freq;
+
        return policy->cur;
 }
 
@@ -164,7 +168,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
        struct cpufreq_policy *policy = sg_policy->policy;
        unsigned int freq;
 
-       freq = get_capacity_ref_freq(policy);
+       freq = get_capacity_ref_freq(policy, &max);
        freq = map_util_freq(util, freq, max);
 
        if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)

[-- Attachment #2: cppc_freq_fix2.png --]
[-- Type: image/png, Size: 30853 bytes --]

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 23:37                       ` Qais Yousef
@ 2024-01-15  6:25                         ` Wyes Karny
  2024-01-15 11:59                           ` Qais Yousef
  0 siblings, 1 reply; 54+ messages in thread
From: Wyes Karny @ 2024-01-15  6:25 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Vincent Guittot, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

Hi Qais,

On Mon, Jan 15, 2024 at 5:07 AM Qais Yousef <qyousef@layalina.io> wrote:
>
> On 01/14/24 19:58, Qais Yousef wrote:
>
> > > This is not correct because you will have to wait to reach full
> > > utilization at the current OPP possibly the lowest OPP before moving
> > > directly to max OPP
> >
> > Isn't this already the case? The ratio (util+headroom/max) will be less than
> > 1 until util is 80% (with 25% headroom). And for all values <= 80% * max, we
> > will request a frequency smaller than/equal policy->cur, no?
> >
> > ie:
> >
> >       util = 600
> >       max = 1024
> >
> >       freq = 1.25 * 600 * policy->cur / 1024 = 0.73 * policy->cur
> >
> > (util+headroom/max) must be greater than 1 for us to start going above
> > policy->cur - which seems to have been working by accident IIUC.
> >
> > So yes my proposal is incorrect, but it seems the conversion is not right to me
> > now.
> >
> > I could reproduce the problem now (thanks Wyes!). I have 3 freqs on my system
> >
> > 2.2GHz, 2.8GHz and 3.8GHz
> >
> > which (I believe) translates into capacities
> >
> > ~592, ~754, 1024
> >
> > which means we should pick 2.8GHz as soon as util * 1.25 > 592; which
> > translates into util = ~473.
> >
> > But what I see is that we go to 2.8GHz when we jump from 650 to 680 (see
> > attached picture), which is what you'd expect since we apply two headrooms now,
> > which means the ratio (util+headroom/max) will be greater than 1 after go above
> > this value
> >
> >       1024 * 0.8 * 0.8 = ~655
> >
> > So I think the math makes sense logically, but we're missing some other
> > correction factor.
> >
> > When I re-enable CPPC I see for the same test that we go into 3.8GHz straight
> > away. My test is simple busyloop via
> >
> >       cat /dev/zero > /dev/null
> >
> > I see the CPU util_avg is at 523 at fork. I expected us to run to 2.8GHz here
> > to be honest, but I am not sure if util_cfs_boost() and util_est() are maybe
> > causing us to be slightly above 523 and that's why we start with max freq.
> >
> > Or I've done the math wrong :-) But the two don't behave the same for the same
> > kernel with and without CPPC.
>
> I think the relationship should be:
>
>         freq = util * f_curr / cap_curr

I guess to know the curr_cap correctly we need to know the max_freq,
which is not available when CPPC is disabled.

>
> (patch below)
>
> with that I see (almost) the expected behavior (picture attached). We go to
> 2.8GHz when we are above 500. But the move to 3.8GHz is a bit earlier at 581
> (instead of 754 * 0.8 = 603). Not sure why. With 25% headroom 581 is 726. So
> it's a tad too early.
>
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..155f96a44fa0 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -123,7 +123,8 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy)
>   * Return: the reference CPU frequency to compute a capacity.
>   */
>  static __always_inline
> -unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
> +unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy,
> +                                   unsigned long *max)
>  {
>         unsigned int freq = arch_scale_freq_ref(policy->cpu);
>
> @@ -133,6 +134,9 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
>         if (arch_scale_freq_invariant())
>                 return policy->cpuinfo.max_freq;
>
> +       if (max)
> +               *max = policy->cur * (*max) / policy->cpuinfo.max_freq;

But when freq_invaiant is disabled we don't have policy->cpuinfo.max_freq.

Thanks,
Wyes
> +
>         return policy->cur;
>  }
>
> @@ -164,7 +168,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>         struct cpufreq_policy *policy = sg_policy->policy;
>         unsigned int freq;
>
> -       freq = get_capacity_ref_freq(policy);
> +       freq = get_capacity_ref_freq(policy, &max);
>         freq = map_util_freq(util, freq, max);
>
>         if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)



-- 
Thanks & Regards
Wyes

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 19:58                     ` Qais Yousef
  2024-01-14 23:37                       ` Qais Yousef
@ 2024-01-15  8:21                       ` Vincent Guittot
  2024-01-15 12:09                         ` Qais Yousef
  2024-01-15  8:42                       ` David Laight
  2 siblings, 1 reply; 54+ messages in thread
From: Vincent Guittot @ 2024-01-15  8:21 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Wyes Karny, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On Sun, 14 Jan 2024 at 20:58, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 01/14/24 16:20, Vincent Guittot wrote:
> > On Sun, 14 Jan 2024 at 16:12, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 01/14/24 14:03, Vincent Guittot wrote:
> > >
> > > > Thanks for the trace. It was really helpful and I think that I got the
> > > > root cause.
> > > >
> > > > The problem comes from get_capacity_ref_freq() which returns current
> > > > freq when arch_scale_freq_invariant() is not enable, and the fact that
> > > > we apply map_util_perf() earlier in the path now which is then capped
> > > > by max capacity.
> > > >
> > > > Could you try the below ?
> > > >
> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > index e420e2ee1a10..611c621543f4 100644
> > > > --- a/kernel/sched/cpufreq_schedutil.c
> > > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > > @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> > > > cpufreq_policy *policy)
> > > >         if (arch_scale_freq_invariant())
> > > >                 return policy->cpuinfo.max_freq;
> > > >
> > > > -       return policy->cur;
> > > > +       return policy->cur + policy->cur >> 2;
> > > >  }
> > > >
> > > >  /**
> > >
> > > Is this a test patch or a proper fix? I can't see it being the latter. It seems
> >
> > It's a proper fix. It's the same mechanism that is used already :
> >  - Either you add margin on the utilization to go above current freq
> > before it is fully used. This si what was done previously
> >  - or you add margin on the freq range to select a higher freq than
> > current one before it become fully used
>
> Aren't we applying the 25% headroom twice then?

In some cases, yes. But whereas the 1st one is generic and can be
filtered with uclamp, the 2nd one is there to enable platform without
frequency invariance to jump to the next OPP

>
> >
> > > the current logic fails when util is already 1024, and I think we're trying to
> > > fix the invariance issue too late.
> > >
> > > Is the problem that we can't read policy->cur in the scheduler to fix the util
> > > while it's being updated that's why it's done here in this case?
> > >
> > > If this is the problem, shouldn't the logic be if util is max then always go to
> > > max frequency? I don't think we have enough info to correct the invariance here
> > > IIUC. All we can see the system is saturated at this frequency and whether
> > > a small jump or a big jump is required is hard to tell.
> > >
> > > Something like this
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index 95c3c097083e..473d0352030b 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -164,8 +164,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> > >         struct cpufreq_policy *policy = sg_policy->policy;
> > >         unsigned int freq;
> > >
> > > -       freq = get_capacity_ref_freq(policy);
> > > -       freq = map_util_freq(util, freq, max);
> > > +       if (util != max) {
> > > +               freq = get_capacity_ref_freq(policy);
> > > +               freq = map_util_freq(util, freq, max);
> > > +       } else {
> > > +               freq = policy->cpuinfo.max_freq;
> > > +       }
> >
> > This is not correct because you will have to wait to reach full
> > utilization at the current OPP possibly the lowest OPP before moving
> > directly to max OPP
>
> Isn't this already the case? The ratio (util+headroom/max) will be less than
> 1 until util is 80% (with 25% headroom). And for all values <= 80% * max, we
> will request a frequency smaller than/equal policy->cur, no?

In your proposal, we stay at current OPP until util == 1024 because
util can't be higher than max. And then we switch directly to max_freq

In the previous behavior and with the fix, we can go to the next OPP
because either util can be higher than max (previous behavior) or the
selected freq can be higher than current (fix behavior) and we don't
have to want for util == max

>
> ie:
>
>         util = 600
>         max = 1024
>
>         freq = 1.25 * 600 * policy->cur / 1024 = 0.73 * policy->cur
>
> (util+headroom/max) must be greater than 1 for us to start going above
> policy->cur - which seems to have been working by accident IIUC.

No this is not by accident

>
> So yes my proposal is incorrect, but it seems the conversion is not right to me
> now.
>
> I could reproduce the problem now (thanks Wyes!). I have 3 freqs on my system
>
> 2.2GHz, 2.8GHz and 3.8GHz
>
> which (I believe) translates into capacities
>
> ~592, ~754, 1024
>
> which means we should pick 2.8GHz as soon as util * 1.25 > 592; which
> translates into util = ~473.

Keep in mind that util is the utilization at current OPP not at full
scale because we don't have frequency invariance.

>
> But what I see is that we go to 2.8GHz when we jump from 650 to 680 (see
> attached picture), which is what you'd expect since we apply two headrooms now,
> which means the ratio (util+headroom/max) will be greater than 1 after go above
> this value
>
>         1024 * 0.8 * 0.8 = ~655
>
> So I think the math makes sense logically, but we're missing some other
> correction factor.
>
> When I re-enable CPPC I see for the same test that we go into 3.8GHz straight
> away. My test is simple busyloop via
>
>         cat /dev/zero > /dev/null
>
> I see the CPU util_avg is at 523 at fork. I expected us to run to 2.8GHz here
> to be honest, but I am not sure if util_cfs_boost() and util_est() are maybe
> causing us to be slightly above 523 and that's why we start with max freq.
>
> Or I've done the math wrong :-) But the two don't behave the same for the same
> kernel with and without CPPC.

They will never behave the same because they can't
- with invariance, the utilization is the utilization at max capacity
so we can easily jump several OPP to go directly to the right one
- without invariance, the utilization is the utilization at current
OPP so we can only jump to a limited number of OPP

>
> >
> > >
> > >         if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> > >                 return sg_policy->next_freq;

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [GIT PULL] Scheduler changes for v6.8
  2024-01-14 19:58                     ` Qais Yousef
  2024-01-14 23:37                       ` Qais Yousef
  2024-01-15  8:21                       ` Vincent Guittot
@ 2024-01-15  8:42                       ` David Laight
  2 siblings, 0 replies; 54+ messages in thread
From: David Laight @ 2024-01-15  8:42 UTC (permalink / raw)
  To: 'Qais Yousef', Vincent Guittot
  Cc: Wyes Karny, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

> My test is simple busyloop via
> 
> 	cat /dev/zero > /dev/null

I usually use the userspace loop:
	while :; do :; done

Quite good for testing whether the fans speed up!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-15  6:25                         ` Wyes Karny
@ 2024-01-15 11:59                           ` Qais Yousef
  0 siblings, 0 replies; 54+ messages in thread
From: Qais Yousef @ 2024-01-15 11:59 UTC (permalink / raw)
  To: Wyes Karny
  Cc: Vincent Guittot, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

Hi Wyes

On 01/15/24 11:55, Wyes Karny wrote:

> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 95c3c097083e..155f96a44fa0 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -123,7 +123,8 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy)
> >   * Return: the reference CPU frequency to compute a capacity.
> >   */
> >  static __always_inline
> > -unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
> > +unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy,
> > +                                   unsigned long *max)
> >  {
> >         unsigned int freq = arch_scale_freq_ref(policy->cpu);
> >
> > @@ -133,6 +134,9 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
> >         if (arch_scale_freq_invariant())
> >                 return policy->cpuinfo.max_freq;
> >
> > +       if (max)
> > +               *max = policy->cur * (*max) / policy->cpuinfo.max_freq;
> 
> But when freq_invaiant is disabled we don't have policy->cpuinfo.max_freq.

They are reported as 3.8GHz correctly for me

$ grep . /sys/devices/system/cpu/cpufreq/policy*/cpuinfo_max_freq
/sys/devices/system/cpu/cpufreq/policy0/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy10/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy11/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy12/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy13/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy14/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy15/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy16/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy17/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy18/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy19/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy1/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy20/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy21/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy22/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy23/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy2/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy3/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy4/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy5/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy6/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy7/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy8/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy9/cpuinfo_max_freq:3800000

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-15  8:21                       ` Vincent Guittot
@ 2024-01-15 12:09                         ` Qais Yousef
  2024-01-15 13:26                           ` Vincent Guittot
  0 siblings, 1 reply; 54+ messages in thread
From: Qais Yousef @ 2024-01-15 12:09 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Wyes Karny, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On 01/15/24 09:21, Vincent Guittot wrote:

> > Or I've done the math wrong :-) But the two don't behave the same for the same
> > kernel with and without CPPC.
> 
> They will never behave the same because they can't
> - with invariance, the utilization is the utilization at max capacity
> so we can easily jump several OPP to go directly to the right one
> - without invariance, the utilization is the utilization at current
> OPP so we can only jump to a limited number of OPP

I am probably missing some subtlty, but the  behavior looks more sensible to
me when we divide by current capacity instead of max one.

It seems what you're saying is that the capacity range for each OPP is 0-1024.
And that's when we know that we saturated the current capacity level we decide
to move on.

As I am trying to remove the hardcoded headroom values I am wary of another
one. But it seems this is bandaid scenario anyway; so maybe I shouldn't worry
too much about it.


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-15 12:09                         ` Qais Yousef
@ 2024-01-15 13:26                           ` Vincent Guittot
  2024-01-15 14:03                             ` Dietmar Eggemann
  0 siblings, 1 reply; 54+ messages in thread
From: Vincent Guittot @ 2024-01-15 13:26 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Wyes Karny, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On Mon, 15 Jan 2024 at 13:09, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 01/15/24 09:21, Vincent Guittot wrote:
>
> > > Or I've done the math wrong :-) But the two don't behave the same for the same
> > > kernel with and without CPPC.
> >
> > They will never behave the same because they can't
> > - with invariance, the utilization is the utilization at max capacity
> > so we can easily jump several OPP to go directly to the right one
> > - without invariance, the utilization is the utilization at current
> > OPP so we can only jump to a limited number of OPP
>
> I am probably missing some subtlty, but the  behavior looks more sensible to
> me when we divide by current capacity instead of max one.
>
> It seems what you're saying is that the capacity range for each OPP is 0-1024.

Yes that's the case when you don't have frequency invariance

> And that's when we know that we saturated the current capacity level we decide
> to move on.

yes

>
> As I am trying to remove the hardcoded headroom values I am wary of another
> one. But it seems this is bandaid scenario anyway; so maybe I shouldn't worry
> too much about it.
>
>
> Cheers
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-15 13:26                           ` Vincent Guittot
@ 2024-01-15 14:03                             ` Dietmar Eggemann
  2024-01-15 15:26                               ` Vincent Guittot
  0 siblings, 1 reply; 54+ messages in thread
From: Dietmar Eggemann @ 2024-01-15 14:03 UTC (permalink / raw)
  To: Vincent Guittot, Qais Yousef
  Cc: Wyes Karny, Linus Torvalds, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider

On 15/01/2024 14:26, Vincent Guittot wrote:
> On Mon, 15 Jan 2024 at 13:09, Qais Yousef <qyousef@layalina.io> wrote:
>>
>> On 01/15/24 09:21, Vincent Guittot wrote:
>>
>>>> Or I've done the math wrong :-) But the two don't behave the same for the same
>>>> kernel with and without CPPC.
>>>
>>> They will never behave the same because they can't
>>> - with invariance, the utilization is the utilization at max capacity
>>> so we can easily jump several OPP to go directly to the right one
>>> - without invariance, the utilization is the utilization at current
>>> OPP so we can only jump to a limited number of OPP
>>
>> I am probably missing some subtlty, but the  behavior looks more sensible to
>> me when we divide by current capacity instead of max one.
>>
>> It seems what you're saying is that the capacity range for each OPP is 0-1024.
> 
> Yes that's the case when you don't have frequency invariance
> 
>> And that's when we know that we saturated the current capacity level we decide
>> to move on.
> 
> yes
> 
>>
>> As I am trying to remove the hardcoded headroom values I am wary of another
>> one. But it seems this is bandaid scenario anyway; so maybe I shouldn't worry
>> too much about it.

I still don't fully understand this fix.

We had:

sugov_update_single_freq()

  sugov_update_single_common()

  next_f = get_next_freq()

   freq = arch_scale_freq_invariant() ?
          policy->cpuinfo.max_freq : policy->cur (**) <- (2) !freq_inv


  util = map_util_perf(util);                     <- (1) util *= 1.25

  freq = map_util_freq(util, freq, max);          <- (3)
}



And now there is:

sugov_update_single_freq()

  sugov_update_single_common()

    sugov_get_util()

      sg_cpu->util = sugov_effective_cpu_perf()

        /* Add dvfs headroom to actual utilization */
        actual = map_util_perf(actual)            <- (1) util *= 1.25

  next_f = get_next_freq()

    freq = get_capacity_ref_freq()

      return policy->cur (*)                      <- (2) !freq_inv

    freq = map_util_freq(util, freq, max)         <- (3)

Still not clear to me why we need this extra 'policy->cur *= 1.25' here
(*) and not here (**)



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-15 14:03                             ` Dietmar Eggemann
@ 2024-01-15 15:26                               ` Vincent Guittot
  2024-01-15 20:05                                 ` Dietmar Eggemann
  0 siblings, 1 reply; 54+ messages in thread
From: Vincent Guittot @ 2024-01-15 15:26 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Qais Yousef, Wyes Karny, Linus Torvalds, Ingo Molnar,
	linux-kernel, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On Mon, 15 Jan 2024 at 15:03, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 15/01/2024 14:26, Vincent Guittot wrote:
> > On Mon, 15 Jan 2024 at 13:09, Qais Yousef <qyousef@layalina.io> wrote:
> >>
> >> On 01/15/24 09:21, Vincent Guittot wrote:
> >>
> >>>> Or I've done the math wrong :-) But the two don't behave the same for the same
> >>>> kernel with and without CPPC.
> >>>
> >>> They will never behave the same because they can't
> >>> - with invariance, the utilization is the utilization at max capacity
> >>> so we can easily jump several OPP to go directly to the right one
> >>> - without invariance, the utilization is the utilization at current
> >>> OPP so we can only jump to a limited number of OPP
> >>
> >> I am probably missing some subtlty, but the  behavior looks more sensible to
> >> me when we divide by current capacity instead of max one.
> >>
> >> It seems what you're saying is that the capacity range for each OPP is 0-1024.
> >
> > Yes that's the case when you don't have frequency invariance
> >
> >> And that's when we know that we saturated the current capacity level we decide
> >> to move on.
> >
> > yes
> >
> >>
> >> As I am trying to remove the hardcoded headroom values I am wary of another
> >> one. But it seems this is bandaid scenario anyway; so maybe I shouldn't worry
> >> too much about it.
>
> I still don't fully understand this fix.
>
> We had:
>
> sugov_update_single_freq()
>
>   sugov_update_single_common()
>
>   next_f = get_next_freq()
>
>    freq = arch_scale_freq_invariant() ?
>           policy->cpuinfo.max_freq : policy->cur (**) <- (2) !freq_inv
>
>
>   util = map_util_perf(util);                     <- (1) util *= 1.25
>
>   freq = map_util_freq(util, freq, max);          <- (3)
> }
>
>
>
> And now there is:
>
> sugov_update_single_freq()
>
>   sugov_update_single_common()
>
>     sugov_get_util()
>
>       sg_cpu->util = sugov_effective_cpu_perf()
>
>         /* Add dvfs headroom to actual utilization */
>         actual = map_util_perf(actual)            <- (1) util *= 1.25
>
>   next_f = get_next_freq()
>
>     freq = get_capacity_ref_freq()
>
>       return policy->cur (*)                      <- (2) !freq_inv
>
>     freq = map_util_freq(util, freq, max)         <- (3)
>
> Still not clear to me why we need this extra 'policy->cur *= 1.25' here
> (*) and not here (**)

Now, util can't be higher than max to handle clamping use cases
whereas it could be the case before. The jump to next OPP was
previously done with util being higher than max and it's now done with
freq being higher than policy->cur

>
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [GIT PULL] Scheduler changes for v6.8
  2024-01-15 15:26                               ` Vincent Guittot
@ 2024-01-15 20:05                                 ` Dietmar Eggemann
  0 siblings, 0 replies; 54+ messages in thread
From: Dietmar Eggemann @ 2024-01-15 20:05 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Qais Yousef, Wyes Karny, Linus Torvalds, Ingo Molnar,
	linux-kernel, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On 15/01/2024 16:26, Vincent Guittot wrote:
> On Mon, 15 Jan 2024 at 15:03, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 15/01/2024 14:26, Vincent Guittot wrote:
>>> On Mon, 15 Jan 2024 at 13:09, Qais Yousef <qyousef@layalina.io> wrote:
>>>>
>>>> On 01/15/24 09:21, Vincent Guittot wrote:

[...]

> Now, util can't be higher than max to handle clamping use cases
> whereas it could be the case before. The jump to next OPP was
> previously done with util being higher than max and it's now done with
> freq being higher than policy->cur

Ah, OK:

  util = map_util_perf(util)     <- (1) util *= 1.25

in the old code was doing this. And you do this now via frequency for
the !freq_inv case. I just saw that you already mention this in the
patch header of:
https://lkml.kernel.org/r/20240114183600.135316-1-vincent.guittot@linaro.org

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2024-01-15 20:05 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-28 12:23 [GIT PULL] Scheduler changes for v6.7 Ingo Molnar
2023-10-30 23:50 ` pr-tracker-bot
2024-01-08 14:07 ` [GIT PULL] Scheduler changes for v6.8 Ingo Molnar
2024-01-09  4:04   ` pr-tracker-bot
2024-01-10 22:19   ` Linus Torvalds
2024-01-10 22:41     ` Linus Torvalds
2024-01-10 22:57       ` Linus Torvalds
2024-01-11  8:11         ` Vincent Guittot
2024-01-11 17:45           ` Linus Torvalds
2024-01-11 17:53             ` Linus Torvalds
2024-01-11 18:16               ` Vincent Guittot
2024-01-12 14:23                 ` Dietmar Eggemann
2024-01-12 16:58                   ` Vincent Guittot
2024-01-12 18:18                   ` Qais Yousef
2024-01-12 19:03                     ` Vincent Guittot
2024-01-12 20:30                       ` Linus Torvalds
2024-01-12 20:49                         ` Linus Torvalds
2024-01-12 21:04                           ` Linus Torvalds
2024-01-13  1:04                             ` Qais Yousef
2024-01-13  1:24                               ` Linus Torvalds
2024-01-13  1:31                                 ` Linus Torvalds
2024-01-13 10:47                                   ` Vincent Guittot
2024-01-13 18:33                                     ` Qais Yousef
2024-01-13 18:37                                 ` Qais Yousef
2024-01-11 11:09         ` [GIT PULL] scheduler fixes Ingo Molnar
2024-01-11 13:04           ` Vincent Guittot
2024-01-11 20:48             ` [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit Ingo Molnar
2024-01-11 22:22               ` Vincent Guittot
2024-01-12 18:24               ` Ingo Molnar
2024-01-12 18:26         ` [GIT PULL] Scheduler changes for v6.8 Ingo Molnar
2024-01-14  9:12         ` Wyes Karny
2024-01-14 11:18           ` Vincent Guittot
2024-01-14 12:37             ` Wyes Karny
2024-01-14 13:02               ` Dietmar Eggemann
2024-01-14 13:05                 ` Vincent Guittot
2024-01-14 13:03               ` Vincent Guittot
2024-01-14 15:12                 ` Qais Yousef
2024-01-14 15:20                   ` Vincent Guittot
2024-01-14 19:58                     ` Qais Yousef
2024-01-14 23:37                       ` Qais Yousef
2024-01-15  6:25                         ` Wyes Karny
2024-01-15 11:59                           ` Qais Yousef
2024-01-15  8:21                       ` Vincent Guittot
2024-01-15 12:09                         ` Qais Yousef
2024-01-15 13:26                           ` Vincent Guittot
2024-01-15 14:03                             ` Dietmar Eggemann
2024-01-15 15:26                               ` Vincent Guittot
2024-01-15 20:05                                 ` Dietmar Eggemann
2024-01-15  8:42                       ` David Laight
2024-01-14 18:11                 ` Wyes Karny
2024-01-14 18:18                   ` Vincent Guittot
2024-01-11  9:33     ` Ingo Molnar
2024-01-11 11:14     ` [tip: sched/urgent] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits tip-bot2 for Ingo Molnar
2024-01-11 20:55     ` [tip: sched/urgent] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit tip-bot2 for Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).