Re: [RFC PATCH] sched: Consolidate cpufreq updates

From: Ingo Molnar <mingo@kernel.org>
To: Qais Yousef <qyousef@layalina.io>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Christian Loehle <christian.loehle@arm.com>,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] sched: Consolidate cpufreq updates
Date: Tue, 26 Mar 2024 09:20:31 +0100	[thread overview]
Message-ID: <ZgKFT5b423hfQdl9@gmail.com> (raw)
In-Reply-To: <20240324020139.1032473-1-qyousef@layalina.io>

* Qais Yousef <qyousef@layalina.io> wrote:

> Results of `perf stat --repeat 10 perf bench sched pipe` on AMD 3900X to
> verify any potential overhead because of the addition at context switch
> 
> Before:
> -------
> 
> 	Performance counter stats for 'perf bench sched pipe' (10 runs):
> 
> 		 16,839.74 msec task-clock:u              #    1.158 CPUs utilized            ( +-  0.52% )
> 			 0      context-switches:u        #    0.000 /sec
> 			 0      cpu-migrations:u          #    0.000 /sec
> 		     1,390      page-faults:u             #   83.903 /sec                     ( +-  0.06% )
> 	       333,773,107      cycles:u                  #    0.020 GHz                      ( +-  0.70% )  (83.72%)
> 		67,050,466      stalled-cycles-frontend:u #   19.94% frontend cycles idle     ( +-  2.99% )  (83.23%)
> 		37,763,775      stalled-cycles-backend:u  #   11.23% backend cycles idle      ( +-  2.18% )  (83.09%)
> 		84,456,137      instructions:u            #    0.25  insn per cycle
> 							  #    0.83  stalled cycles per insn  ( +-  0.02% )  (83.01%)
> 		34,097,544      branches:u                #    2.058 M/sec                    ( +-  0.02% )  (83.52%)
> 		 8,038,902      branch-misses:u           #   23.59% of all branches          ( +-  0.03% )  (83.44%)
> 
> 		   14.5464 +- 0.0758 seconds time elapsed  ( +-  0.52% )
> 
> After:
> -------
> 
> 	Performance counter stats for 'perf bench sched pipe' (10 runs):
> 
> 		 16,219.58 msec task-clock:u              #    1.130 CPUs utilized            ( +-  0.80% )
> 			 0      context-switches:u        #    0.000 /sec
> 			 0      cpu-migrations:u          #    0.000 /sec
> 		     1,391      page-faults:u             #   85.163 /sec                     ( +-  0.06% )
> 	       342,768,312      cycles:u                  #    0.021 GHz                      ( +-  0.63% )  (83.36%)
> 		66,231,208      stalled-cycles-frontend:u #   18.91% frontend cycles idle     ( +-  2.34% )  (83.95%)
> 		39,055,410      stalled-cycles-backend:u  #   11.15% backend cycles idle      ( +-  1.80% )  (82.73%)
> 		84,475,662      instructions:u            #    0.24  insn per cycle
> 							  #    0.82  stalled cycles per insn  ( +-  0.02% )  (83.05%)
> 		34,067,160      branches:u                #    2.086 M/sec                    ( +-  0.02% )  (83.67%)
> 		 8,042,888      branch-misses:u           #   23.60% of all branches          ( +-  0.07% )  (83.25%)
> 
> 		    14.358 +- 0.116 seconds time elapsed  ( +-  0.81% )

Noise caused by too many counters & the vagaries of multi-CPU scheduling is 
drowning out any results here.

I'd suggest somethig like this to measure same-CPU context-switching 
overhead:

    taskset 1 perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched pipe

... and make sure the cpufreq governor is at 'performance' first:

    for ((cpu=0; cpu < $(nproc); cpu++)); do echo performance > /sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor; done

With that approach you should much, much lower noise levels even with just 
3 runs:

 Performance counter stats for 'perf bench sched pipe' (3 runs):

    51,616,501,297      cycles                           #    3.188 GHz                         ( +-  0.05% )
    37,523,641,203      instructions                     #    0.73  insn per cycle              ( +-  0.08% )
         16,191.01 msec task-clock                       #    0.999 CPUs utilized               ( +-  0.04% )

          16.20511 +- 0.00578 seconds time elapsed  ( +-  0.04% )

Thanks,

	Ingo