* [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
@ 2016-02-18 11:11 Mel Gorman
2016-02-18 19:43 ` Rafael J. Wysocki
2016-02-19 11:11 ` Stephane Gasparini
0 siblings, 2 replies; 11+ messages in thread
From: Mel Gorman @ 2016-02-18 11:11 UTC (permalink / raw)
To: Rafael Wysocki
Cc: Dirk Brandewie, Ingo Molnar, Peter Zijlstra, Matt Fleming,
Mike Galbraith, Linux-PM, LKML, Mel Gorman
(cc'ing pm and scheduler people as the problem could be blamed on either
subsystem depending on your point of view)
The PID relies on samples of equal time but this does not apply for
deferrable timers when the CPU is idle. intel_pstate checks if the actual
duration between samples is large and if so, the "busyness" of the CPU
is scaled.
This assumes the delay was a deferred timer but a workload may simply have
been idle for a short time if it's context switching between a server and
client or waiting very briefly on IO. It's compounded by the problem that
server/clients migrate between CPUs due to wake-affine trying to maximise
hot cache usage. In such cases, the cores are not considered busy and the
frequency is dropped prematurely.
This patch increases the hold-off value before the busyness is scaled. It
was selected based simply on testing until the desired result was found.
Tests were conducted with workloads that are either client/server based
or short-lived IO.
dbench4
4.5.0-rc2 4.5.0-rc2
vanilla sample-v1r1
Hmean mb/sec-1 309.82 ( 0.00%) 327.01 ( 5.55%)
Hmean mb/sec-2 594.92 ( 0.00%) 613.02 ( 3.04%)
Hmean mb/sec-4 669.17 ( 0.00%) 712.27 ( 6.44%)
Hmean mb/sec-8 700.82 ( 0.00%) 724.04 ( 3.31%)
Hmean mb/sec-64 425.38 ( 0.00%) 448.02 ( 5.32%)
4.5.0-rc2 4.5.0-rc2
vanilla sample-v1r1
Mean %Busy 27.28 26.81
Mean CPU%c1 42.50 44.29
Mean CPU%c3 7.16 7.14
Mean CPU%c6 23.05 21.76
Mean CPU%c7 0.00 0.00
Mean CorWatt 4.60 5.08
Mean PkgWatt 6.83 7.32
There is fairly sizable performance boost from the modification and while
the percentage of time spent in C1 is increased, it is not by a substantial
amount and the power usage increase is tiny.
iozone for small files and varying block sizes. Format is IOOperation-filessize-recordsize
4.5.0-rc2 4.5.0-rc2
vanilla sample-v1r1
Hmean SeqWrite-200704-1 740152.30 ( 0.00%) 748432.35 ( 1.12%)
Hmean SeqWrite-200704-2 1052506.25 ( 0.00%) 1169065.30 ( 11.07%)
Hmean SeqWrite-200704-4 1450716.41 ( 0.00%) 1725335.69 ( 18.93%)
Hmean SeqWrite-200704-8 1523917.72 ( 0.00%) 1881610.25 ( 23.47%)
Hmean SeqWrite-200704-16 1572519.89 ( 0.00%) 1750277.07 ( 11.30%)
Hmean SeqWrite-200704-32 1611078.69 ( 0.00%) 1923796.62 ( 19.41%)
Hmean SeqWrite-200704-64 1656755.37 ( 0.00%) 1892766.99 ( 14.25%)
Hmean SeqWrite-200704-128 1641739.24 ( 0.00%) 1952081.27 ( 18.90%)
Hmean SeqWrite-200704-256 1660046.05 ( 0.00%) 1931237.50 ( 16.34%)
Hmean SeqWrite-200704-512 1634394.86 ( 0.00%) 1860369.95 ( 13.83%)
Hmean SeqWrite-200704-1024 1629526.38 ( 0.00%) 1810320.92 ( 11.09%)
Hmean SeqWrite-401408-1 828943.43 ( 0.00%) 876152.50 ( 5.70%)
Hmean SeqWrite-401408-2 1231519.20 ( 0.00%) 1368986.18 ( 11.16%)
Hmean SeqWrite-401408-4 1724109.56 ( 0.00%) 1838265.22 ( 6.62%)
Hmean SeqWrite-401408-8 1806615.84 ( 0.00%) 1969611.74 ( 9.02%)
Hmean SeqWrite-401408-16 1859268.96 ( 0.00%) 2003005.51 ( 7.73%)
Hmean SeqWrite-401408-32 1887759.67 ( 0.00%) 2415913.37 ( 27.98%)
Hmean SeqWrite-401408-64 1941717.11 ( 0.00%) 1971929.24 ( 1.56%)
Hmean SeqWrite-401408-128 1919515.58 ( 0.00%) 2127647.53 ( 10.84%)
Hmean SeqWrite-401408-256 1908766.57 ( 0.00%) 2067473.02 ( 8.31%)
Hmean SeqWrite-401408-512 1908999.37 ( 0.00%) 2195587.56 ( 15.01%)
Hmean SeqWrite-401408-1024 1912232.98 ( 0.00%) 2150068.56 ( 12.44%)
Hmean Rewrite-200704-1 1151067.57 ( 0.00%) 1155309.64 ( 0.37%)
Hmean Rewrite-200704-2 1786824.53 ( 0.00%) 1837093.18 ( 2.81%)
Hmean Rewrite-200704-4 2539338.19 ( 0.00%) 2649019.78 ( 4.32%)
Hmean Rewrite-200704-8 2687411.53 ( 0.00%) 2785202.26 ( 3.64%)
Hmean Rewrite-200704-16 2709445.97 ( 0.00%) 2805580.76 ( 3.55%)
Hmean Rewrite-200704-32 2735718.43 ( 0.00%) 2807532.87 ( 2.63%)
Hmean Rewrite-200704-64 2782754.97 ( 0.00%) 2952024.38 ( 6.08%)
Hmean Rewrite-200704-128 2791889.73 ( 0.00%) 2805048.02 ( 0.47%)
Hmean Rewrite-200704-256 2711596.34 ( 0.00%) 2828896.54 ( 4.33%)
Hmean Rewrite-200704-512 2665066.25 ( 0.00%) 2868058.05 ( 7.62%)
Hmean Rewrite-200704-1024 2675375.89 ( 0.00%) 2685664.19 ( 0.38%)
Hmean Rewrite-401408-1 1350713.78 ( 0.00%) 1358762.21 ( 0.60%)
Hmean Rewrite-401408-2 2079420.61 ( 0.00%) 2097399.02 ( 0.86%)
Hmean Rewrite-401408-4 2889535.90 ( 0.00%) 2912795.03 ( 0.80%)
Hmean Rewrite-401408-8 3068155.32 ( 0.00%) 3090915.84 ( 0.74%)
Hmean Rewrite-401408-16 3103789.43 ( 0.00%) 3162486.65 ( 1.89%)
Hmean Rewrite-401408-32 3112447.72 ( 0.00%) 3243067.63 ( 4.20%)
Hmean Rewrite-401408-64 3232651.39 ( 0.00%) 3227701.02 ( -0.15%)
Hmean Rewrite-401408-128 3149556.47 ( 0.00%) 3165694.24 ( 0.51%)
Hmean Rewrite-401408-256 3093348.93 ( 0.00%) 3104229.97 ( 0.35%)
Hmean Rewrite-401408-512 3026305.45 ( 0.00%) 3121151.02 ( 3.13%)
Hmean Rewrite-401408-1024 3005431.18 ( 0.00%) 3046910.32 ( 1.38%)
4.5.0-rc2 4.5.0-rc2
vanilla sample-v1r1
Mean %Busy 3.10 3.09
Mean CPU%c1 6.16 5.55
Mean CPU%c3 0.08 0.10
Mean CPU%c6 90.65 91.26
Mean CPU%c7 0.00 0.00
Mean CorWatt 1.71 1.74
Mean PkgWatt 3.88 3.91
Max %Busy 16.51 16.22
Max CPU%c1 17.03 21.99
Max CPU%c3 2.57 2.15
Max CPU%c6 96.39 96.31
Max CPU%c7 0.00 0.00
Max CorWatt 5.40 5.42
Max PkgWatt 7.53 7.56
The other operations are omitted as they showed no performance difference.
For sequential writes and rewrites there is a massive gain in throughput
for very small files. The increase in power consumption is negligible.
It is known that the increase is not universal. Larger core machines see
a much smaller benefit so the rate of CPU migrations are a factor.
netperf-UDP_STREAM
4.5.0-rc2 4.5.0-rc2
vanilla sample-v1r1
Hmean send-64 233.96 ( 0.00%) 244.76 ( 4.61%)
Hmean send-128 466.74 ( 0.00%) 479.16 ( 2.66%)
Hmean send-256 929.12 ( 0.00%) 964.00 ( 3.75%)
Hmean send-1024 3631.36 ( 0.00%) 3781.89 ( 4.15%)
Hmean send-2048 6984.60 ( 0.00%) 7169.60 ( 2.65%)
Hmean send-3312 10792.94 ( 0.00%) 11103.42 ( 2.88%)
Hmean send-4096 12895.57 ( 0.00%) 13112.58 ( 1.68%)
Hmean send-8192 23057.34 ( 0.00%) 23443.80 ( 1.68%)
Hmean send-16384 37871.11 ( 0.00%) 38292.60 ( 1.11%)
Hmean recv-64 233.89 ( 0.00%) 244.71 ( 4.63%)
Hmean recv-128 466.63 ( 0.00%) 479.09 ( 2.67%)
Hmean recv-256 928.88 ( 0.00%) 963.74 ( 3.75%)
Hmean recv-1024 3630.54 ( 0.00%) 3780.96 ( 4.14%)
Hmean recv-2048 6983.20 ( 0.00%) 7167.55 ( 2.64%)
Hmean recv-3312 10790.92 ( 0.00%) 11100.63 ( 2.87%)
Hmean recv-4096 12891.37 ( 0.00%) 13110.35 ( 1.70%)
Hmean recv-8192 23054.79 ( 0.00%) 23438.27 ( 1.66%)
Hmean recv-16384 37866.79 ( 0.00%) 38283.73 ( 1.10%)
4.5.0-rc2 4.5.0-rc2
vanilla sample-v1r1
Mean %Busy 37.30 37.10
Mean CPU%c1 37.52 37.30
Mean CPU%c3 0.10 0.10
Mean CPU%c6 25.08 25.49
Mean CPU%c7 0.00 0.00
Mean CorWatt 11.20 11.18
Mean PkgWatt 13.30 13.28
Max %Busy 50.64 51.73
Max CPU%c1 49.80 50.53
Max CPU%c3 9.14 8.95
Max CPU%c6 62.46 63.48
Max CPU%c7 0.00 0.00
Max CorWatt 16.46 16.44
Max PkgWatt 18.58 18.55
In this test, the client/server are pinned to cores so the scheduler
decisions are not a factor. There is still a mild performance boost
with no impact on power consumption.
cyclictest-pinned
4.5.0-rc2 4.5.0-rc2
vanilla sample-v1r1
Amean LatAvg 3.00 ( 0.00%) 2.64 ( 11.94%)
Amean LatMax 156.93 ( 0.00%) 106.89 ( 31.89%)
4.5.0-rc2 4.5.0-rc2
vanilla sample-v1r1
Mean %Busy 99.74 99.73
Mean CPU%c1 0.02 0.02
Mean CPU%c3 0.00 0.01
Mean CPU%c6 0.23 0.24
Mean CPU%c7 0.00 0.00
Mean CorWatt 5.06 5.92
Mean PkgWatt 7.12 7.99
Max %Busy 100.00 100.00
Max CPU%c1 3.88 3.50
Max CPU%c3 0.71 0.99
Max CPU%c6 41.79 43.17
Max CPU%c7 0.00 0.00
Max CorWatt 6.80 8.66
Max PkgWatt 8.85 10.71
This test measures how quickly a task wakes up after a timeout. The test
could be defeated by selecting a different timeout value that is outside
the new hold-off value. Furthermore, a workload that is very sensitive to
wakeup latencies should use the performance governor. Nevertheless it's
interesting to note the impact of increasing the hold-off value. There is
an increase in power usage because the CPU remains active during sleep times.
In all cases, there are some CPU migrations because wakers pull wakees to
nearby CPUs. It could be argued that the workload should be pinned but this
puts a burden on the user that may not even be possible in all cases. The
scheduler could try keeping processes on the same CPUs but that would impact
cache hotness and cause a different class of issues. It is inevitable that
there will be some conflict between power management and scheduling decisions
but there is some gains from delaying idling slightly without a severe impact
on power consumption.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
drivers/cpufreq/intel_pstate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index cd83d477e32d..54250084174a 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC;
duration_us = ktime_us_delta(cpu->sample.time,
cpu->last_sample_time);
- if (duration_us > sample_time * 3) {
+ if (duration_us > sample_time * 12) {
sample_ratio = div_fp(int_tofp(sample_time),
int_tofp(duration_us));
core_busy = mul_fp(core_busy, sample_ratio);
--
2.6.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
2016-02-18 11:11 [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled Mel Gorman
@ 2016-02-18 19:43 ` Rafael J. Wysocki
2016-02-18 21:09 ` Doug Smythies
2016-02-18 23:29 ` Pandruvada, Srinivas
2016-02-19 11:11 ` Stephane Gasparini
1 sibling, 2 replies; 11+ messages in thread
From: Rafael J. Wysocki @ 2016-02-18 19:43 UTC (permalink / raw)
To: Mel Gorman
Cc: Rafael Wysocki, Dirk Brandewie, Ingo Molnar, Peter Zijlstra,
Matt Fleming, Mike Galbraith, Linux-PM, LKML
Hi Mel,
On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman
<mgorman@techsingularity.net> wrote:
[cut]
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> drivers/cpufreq/intel_pstate.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index cd83d477e32d..54250084174a 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
> sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC;
> duration_us = ktime_us_delta(cpu->sample.time,
> cpu->last_sample_time);
> - if (duration_us > sample_time * 3) {
> + if (duration_us > sample_time * 12) {
> sample_ratio = div_fp(int_tofp(sample_time),
> int_tofp(duration_us));
> core_busy = mul_fp(core_busy, sample_ratio);
> --
I've been considering making a change like this, but I wasn't quite
sure how much greater the multiplier should be, so I've queued this
one up for 4.6.
That said please note that we're planning to make one significant
change to intel_pstate in the 4.6 cycle that's very likely to affect
your results.
It is currently present in linux-next (commit 402c43ed2d74 "cpufreq:
intel_pstate: Replace timers with utilization update callbacks" in the
linux-next branch of the linux-pm.git tree, that depends on commit
fe7034338ba0 "cpufreq: Add mechanism for registering utilization
update callbacks" in the same branch). Also you can just pull from
the pm-cpufreq-test branch in linux-pm.git, but that contains much
more material.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
2016-02-18 19:43 ` Rafael J. Wysocki
@ 2016-02-18 21:09 ` Doug Smythies
2016-02-19 10:49 ` Mel Gorman
2016-02-23 14:04 ` Mel Gorman
2016-02-18 23:29 ` Pandruvada, Srinivas
1 sibling, 2 replies; 11+ messages in thread
From: Doug Smythies @ 2016-02-18 21:09 UTC (permalink / raw)
To: 'Rafael J. Wysocki', 'Mel Gorman'
Cc: 'Rafael Wysocki', 'Ingo Molnar',
'Peter Zijlstra', 'Matt Fleming',
'Mike Galbraith', 'Linux-PM', 'LKML',
'Srinivas Pandruvada'
On 2106.02.18 Rafael J. Wysocki wrote:
On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman wrote:
>>
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> ---
>> drivers/cpufreq/intel_pstate.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
>> index cd83d477e32d..54250084174a 100644
>> --- a/drivers/cpufreq/intel_pstate.c
>> +++ b/drivers/cpufreq/intel_pstate.c
>> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
>> sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC;
>> duration_us = ktime_us_delta(cpu->sample.time,
>> cpu->last_sample_time);
>> - if (duration_us > sample_time * 3) {
>> + if (duration_us > sample_time * 12) {
>> sample_ratio = div_fp(int_tofp(sample_time),
>> int_tofp(duration_us));
>> core_busy = mul_fp(core_busy, sample_ratio);
>> --
The immediately preceding comment needs to be changed also.
Note that with duration related scaling only coming in at such a high
ratio it might be worth saving the divide and just setting it to 0.
> I've been considering making a change like this, but I wasn't quite
> sure how much greater the multiplier should be, so I've queued this
> one up for 4.6.
> That said please note that we're planning to make one significant
> change to intel_pstate in the 4.6 cycle that's very likely to affect
> your results.
Rafael:
I started to test Mel's change added to your 3 patch set version 10.
I only have one data point so far, I selected the test I did from one of Mel's
better results (although there is no reason to expect my computer to have
best results for the same operating conditions):
Stock kernel 4.5-rc4 just for reference:
Linux s15 4.5.0-040500rc4-generic #201602141731 SMP Sun Feb 14 22:33:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0
Output is in Kbytes/sec
KB reclen write rewrite
401408 32 1895293 3035291
_________________________________________________________________
Kernel 4.5-rc4 + jrw 3 patch set version 10 (nominal 3X duration holdoff)
Linux s15 4.5.0-rc4-rjwv10 #167 SMP Mon Feb 15 14:23:10 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0
Output is in Kbytes/sec
KB reclen write rewrite
401408 32 2010558 3086354
401408 32 1945126 3127472
401408 32 1944807 3110387
401408 32 1948620 3110002
AVE 1962278 3108554
Performance mode, for comparison:
KB reclen write rewrite
401408 32 2870111 5023311
401408 32 2869642 5149213
401408 32 2792053 5100280
401408 32 2863887 5149229
_________________________________________________________________
Kernel 4.5-rc4 + jrw 3 patch set version 10 + mg 12X duration hold-off
Linux s15 4.5.0-rc4-rjwv10-12 #169 SMP Thu Feb 18 08:15:33 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0
Output is in Kbytes/sec
KB reclen write rewrite
401408 32 1989670 3100580
401408 32 2062291 3112463
401408 32 2107637 3233567
401408 32 2111772 3340610
AVE 2067843 3196805
Gain Verses 3X 5.4% 2.8%
_________________________________________________________________
Mel: Did you observe any downside conditions?
For example, here is just an example taken some trace samples from my computer:
Duration kick in = 3X
Core busy = 101
Current pstate = 16
Load = 2.2%
Duration = 43.815 mSec
Scaled busy = 48
Next Pstate = 16 (= minimum for my computer)
If duration kick in = 12X then
Scaled busy = 214
Next pstate = 38 (= Max turbo for my computer)
Note: I do NOT have an operational example where it matters in terms
of energy use or whatever. I just suggesting that we look.
... Doug
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
2016-02-18 19:43 ` Rafael J. Wysocki
2016-02-18 21:09 ` Doug Smythies
@ 2016-02-18 23:29 ` Pandruvada, Srinivas
2016-02-18 23:33 ` Rafael J. Wysocki
1 sibling, 1 reply; 11+ messages in thread
From: Pandruvada, Srinivas @ 2016-02-18 23:29 UTC (permalink / raw)
To: mgorman, rafael
Cc: matt, mingo, peterz, Brandewie, Dirk J, linux-kernel, linux-pm,
rjw, umgwanakikbuti
On Thu, 2016-02-18 at 20:43 +0100, Rafael J. Wysocki wrote:
> Hi Mel,
>
> On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman
> <mgorman@techsingularity.net> wrote:
>
> [cut]
>
> >
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> > drivers/cpufreq/intel_pstate.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index cd83d477e32d..54250084174a 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -999,7 +999,7 @@ static inline int32_t
> > get_target_pstate_use_performance(struct cpudata *cpu)
> > sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC;
> > duration_us = ktime_us_delta(cpu->sample.time,
> > cpu->last_sample_time);
> > - if (duration_us > sample_time * 3) {
> > + if (duration_us > sample_time * 12) {
> > sample_ratio = div_fp(int_tofp(sample_time),
> > int_tofp(duration_us));
> > core_busy = mul_fp(core_busy, sample_ratio);
> > --
>
> I've been considering making a change like this, but I wasn't quite
> sure how much greater the multiplier should be, so I've queued this
> one up for 4.6.
>
We need to test power impact on different server workloads. So please
hold on.
We have server folks complaining that we already consume too much
power.
Thanks,
Srinivas
> That said please note that we're planning to make one significant
> change to intel_pstate in the 4.6 cycle that's very likely to affect
> your results.
>
> It is currently present in linux-next (commit 402c43ed2d74 "cpufreq:
> intel_pstate: Replace timers with utilization update callbacks" in
> the
> linux-next branch of the linux-pm.git tree, that depends on commit
> fe7034338ba0 "cpufreq: Add mechanism for registering utilization
> update callbacks" in the same branch). Also you can just pull from
> the pm-cpufreq-test branch in linux-pm.git, but that contains much
> more material.
>
> Thanks,
> Rafael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
2016-02-18 23:29 ` Pandruvada, Srinivas
@ 2016-02-18 23:33 ` Rafael J. Wysocki
0 siblings, 0 replies; 11+ messages in thread
From: Rafael J. Wysocki @ 2016-02-18 23:33 UTC (permalink / raw)
To: Pandruvada, Srinivas
Cc: mgorman, rafael, matt, mingo, peterz, Brandewie, Dirk J,
linux-kernel, linux-pm, rjw, umgwanakikbuti
On Fri, Feb 19, 2016 at 12:29 AM, Pandruvada, Srinivas
<srinivas.pandruvada@intel.com> wrote:
> On Thu, 2016-02-18 at 20:43 +0100, Rafael J. Wysocki wrote:
>> Hi Mel,
>>
>> On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman
>> <mgorman@techsingularity.net> wrote:
>>
>> [cut]
>>
>> >
>> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> > ---
>> > drivers/cpufreq/intel_pstate.c | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/cpufreq/intel_pstate.c
>> > b/drivers/cpufreq/intel_pstate.c
>> > index cd83d477e32d..54250084174a 100644
>> > --- a/drivers/cpufreq/intel_pstate.c
>> > +++ b/drivers/cpufreq/intel_pstate.c
>> > @@ -999,7 +999,7 @@ static inline int32_t
>> > get_target_pstate_use_performance(struct cpudata *cpu)
>> > sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC;
>> > duration_us = ktime_us_delta(cpu->sample.time,
>> > cpu->last_sample_time);
>> > - if (duration_us > sample_time * 3) {
>> > + if (duration_us > sample_time * 12) {
>> > sample_ratio = div_fp(int_tofp(sample_time),
>> > int_tofp(duration_us));
>> > core_busy = mul_fp(core_busy, sample_ratio);
>> > --
>>
>> I've been considering making a change like this, but I wasn't quite
>> sure how much greater the multiplier should be, so I've queued this
>> one up for 4.6.
>>
> We need to test power impact on different server workloads. So please
> hold on.
> We have server folks complaining that we already consume too much
> power.
I'll drop the commit if it turns out to cause too much energy to be consumed.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
2016-02-18 21:09 ` Doug Smythies
@ 2016-02-19 10:49 ` Mel Gorman
2016-02-23 14:04 ` Mel Gorman
1 sibling, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2016-02-19 10:49 UTC (permalink / raw)
To: Doug Smythies
Cc: 'Rafael J. Wysocki', 'Rafael Wysocki',
'Ingo Molnar', 'Peter Zijlstra',
'Matt Fleming', 'Mike Galbraith',
'Linux-PM', 'LKML', 'Srinivas Pandruvada'
On Thu, Feb 18, 2016 at 01:09:26PM -0800, Doug Smythies wrote:
> On 2106.02.18 Rafael J. Wysocki wrote:
> On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman wrote:
> >>
> >> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> >> ---
> >> drivers/cpufreq/intel_pstate.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> >> index cd83d477e32d..54250084174a 100644
> >> --- a/drivers/cpufreq/intel_pstate.c
> >> +++ b/drivers/cpufreq/intel_pstate.c
> >> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
> >> sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC;
> >> duration_us = ktime_us_delta(cpu->sample.time,
> >> cpu->last_sample_time);
> >> - if (duration_us > sample_time * 3) {
> >> + if (duration_us > sample_time * 12) {
> >> sample_ratio = div_fp(int_tofp(sample_time),
> >> int_tofp(duration_us));
> >> core_busy = mul_fp(core_busy, sample_ratio);
> >> --
>
> The immediately preceding comment needs to be changed also.
Yes, it does. Thanks.
> Note that with duration related scaling only coming in at such a high
> ratio it might be worth saving the divide and just setting it to 0.
>
That sounds reasonable. I've queued up a test based on this as well as
tests with the linux-next branch from linux-pm to see what falls out.
> > I've been considering making a change like this, but I wasn't quite
> > sure how much greater the multiplier should be, so I've queued this
> > one up for 4.6.
>
> > That said please note that we're planning to make one significant
> > change to intel_pstate in the 4.6 cycle that's very likely to affect
> > your results.
>
> Rafael:
>
> I started to test Mel's change added to your 3 patch set version 10.
>
> I only have one data point so far, I selected the test I did from one of Mel's
> better results (although there is no reason to expect my computer to have
> best results for the same operating conditions):
>
It's a reasonable expectation.
> Stock kernel 4.5-rc4 just for reference:
> Linux s15 4.5.0-040500rc4-generic #201602141731 SMP Sun Feb 14 22:33:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>
> Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0
> Output is in Kbytes/sec
>
> KB reclen write rewrite
> 401408 32 1895293 3035291
> _________________________________________________________________
>
> Kernel 4.5-rc4 + jrw 3 patch set version 10 (nominal 3X duration holdoff)
> Linux s15 4.5.0-rc4-rjwv10 #167 SMP Mon Feb 15 14:23:10 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
>
> Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0
> Output is in Kbytes/sec
>
> KB reclen write rewrite
> 401408 32 2010558 3086354
> 401408 32 1945126 3127472
> 401408 32 1944807 3110387
> 401408 32 1948620 3110002
> AVE 1962278 3108554
>
> Performance mode, for comparison:
>
> KB reclen write rewrite
> 401408 32 2870111 5023311
> 401408 32 2869642 5149213
> 401408 32 2792053 5100280
> 401408 32 2863887 5149229
> _________________________________________________________________
>
> Kernel 4.5-rc4 + jrw 3 patch set version 10 + mg 12X duration hold-off
> Linux s15 4.5.0-rc4-rjwv10-12 #169 SMP Thu Feb 18 08:15:33 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
>
> Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0
> Output is in Kbytes/sec
>
> KB reclen write rewrite
> 401408 32 1989670 3100580
> 401408 32 2062291 3112463
> 401408 32 2107637 3233567
> 401408 32 2111772 3340610
> AVE 2067843 3196805
> Gain Verses 3X 5.4% 2.8%
> _________________________________________________________________
>
> Mel: Did you observe any downside conditions?
>
Not so far but my expectation is that any downside would be power consumption
related. At worst, I expect the patch to have little or not performance
impact in cases where there are a lot of cores, a lot of migration and the
CPU core is idle longer than the new hold-off period. For power-consumption,
I'm relying entirely on the output of turbostat to tell me if there are
problems which may or may not be sufficient.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
2016-02-18 11:11 [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled Mel Gorman
2016-02-18 19:43 ` Rafael J. Wysocki
@ 2016-02-19 11:11 ` Stephane Gasparini
2016-02-19 16:38 ` Doug Smythies
1 sibling, 1 reply; 11+ messages in thread
From: Stephane Gasparini @ 2016-02-19 11:11 UTC (permalink / raw)
To: Mel Gorman
Cc: Rafael Wysocki, Dirk Brandewie, Ingo Molnar, Peter Zijlstra,
Matt Fleming, Mike Galbraith, Linux-PM, LKML
[-- Attachment #1: Type: text/plain, Size: 14996 bytes --]
The issue you are reporting looks like one we improved on android by using
the average pstate instead of using the last requested pstate
We know that this is improving the ffmpeg encoding performance when using the
load algorithm.
see patch attached
This patch is only applied on get_target_pstate_use_cpu_load however you can give
it a try on get_target_pstate_use_performance
IPLoad+Avg-Pstate vs IP Load:
Benchmark ∆Perf ∆Power
SmartBench-Gaming -0.1% -10.4%
SmartBench-Productivity -0.8% -10.4%
CandyCrush n/a -17.4%
AngryBirds n/a -5.9%
videoPlayback n/a -13.9%
audioPlayback n/a -4.9%
IcyRocks-0-0 0.0% -4.0%
IcyRocks-20-50 0.0% -38.4%
IcyRocks-40-100 0.1% -2.8%
IcyRocks-60-150 1.4% -0.6%
IcyRocks-80-200 2.9% 0.7%
IcyRocks-100-250 1.1% 0.4%
iozone RR -2.7% -4.2%
iozone RW -8.8% -4.2%
Drystone -0.2% -0.8%
Coremark 0.5% 0.2%
Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
---
drivers/cpufreq/intel_pstate.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index cd83d47..6ba8cab 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -908,8 +908,6 @@ static inline void intel_pstate_sample(struct cpudata *cpu)
cpu->sample.mperf -= cpu->prev_mperf;
cpu->sample.tsc -= cpu->prev_tsc;
- intel_pstate_calc_busy(cpu);
-
cpu->prev_aperf = aperf;
cpu->prev_mperf = mperf;
cpu->prev_tsc = tsc;
@@ -931,6 +929,12 @@ static inline void intel_pstate_set_sample_time(struct cpudata *cpu)
mod_timer_pinned(&cpu->timer, jiffies + delay);
}
+static inline int32_t get_avg_pstate(struct cpudata *cpu)
+{
+ return div64_u64(cpu->pstate.max_pstate * cpu->sample.aperf,
+ cpu->sample.mperf);
+}
+
static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
{
struct sample *sample = &cpu->sample;
@@ -964,7 +968,7 @@ static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
cpu_load = div64_u64(int_tofp(100) * mperf, sample->tsc);
cpu->sample.busy_scaled = cpu_load;
- return cpu->pstate.current_pstate - pid_calc(&cpu->pid, cpu_load);
+ return get_avg_pstate(cpu) - pid_calc(&cpu->pid, cpu_load);
}
static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
@@ -973,6 +977,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
s64 duration_us;
u32 sample_time;
+ intel_pstate_calc_busy(cpu);
/*
* core_busy is the ratio of actual performance to max
* max_pstate is the max non turbo pstate available
—
Steph
> On Feb 18, 2016, at 12:11 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> (cc'ing pm and scheduler people as the problem could be blamed on either
> subsystem depending on your point of view)
>
> The PID relies on samples of equal time but this does not apply for
> deferrable timers when the CPU is idle. intel_pstate checks if the actual
> duration between samples is large and if so, the "busyness" of the CPU
> is scaled.
>
> This assumes the delay was a deferred timer but a workload may simply have
> been idle for a short time if it's context switching between a server and
> client or waiting very briefly on IO. It's compounded by the problem that
> server/clients migrate between CPUs due to wake-affine trying to maximise
> hot cache usage. In such cases, the cores are not considered busy and the
> frequency is dropped prematurely.
>
> This patch increases the hold-off value before the busyness is scaled. It
> was selected based simply on testing until the desired result was found.
> Tests were conducted with workloads that are either client/server based
> or short-lived IO.
>
> dbench4
> 4.5.0-rc2 4.5.0-rc2
> vanilla sample-v1r1
> Hmean mb/sec-1 309.82 ( 0.00%) 327.01 ( 5.55%)
> Hmean mb/sec-2 594.92 ( 0.00%) 613.02 ( 3.04%)
> Hmean mb/sec-4 669.17 ( 0.00%) 712.27 ( 6.44%)
> Hmean mb/sec-8 700.82 ( 0.00%) 724.04 ( 3.31%)
> Hmean mb/sec-64 425.38 ( 0.00%) 448.02 ( 5.32%)
>
> 4.5.0-rc2 4.5.0-rc2
> vanilla sample-v1r1
> Mean %Busy 27.28 26.81
> Mean CPU%c1 42.50 44.29
> Mean CPU%c3 7.16 7.14
> Mean CPU%c6 23.05 21.76
> Mean CPU%c7 0.00 0.00
> Mean CorWatt 4.60 5.08
> Mean PkgWatt 6.83 7.32
>
> There is fairly sizable performance boost from the modification and while
> the percentage of time spent in C1 is increased, it is not by a substantial
> amount and the power usage increase is tiny.
>
> iozone for small files and varying block sizes. Format is IOOperation-filessize-recordsize
>
> 4.5.0-rc2 4.5.0-rc2
> vanilla sample-v1r1
> Hmean SeqWrite-200704-1 740152.30 ( 0.00%) 748432.35 ( 1.12%)
> Hmean SeqWrite-200704-2 1052506.25 ( 0.00%) 1169065.30 ( 11.07%)
> Hmean SeqWrite-200704-4 1450716.41 ( 0.00%) 1725335.69 ( 18.93%)
> Hmean SeqWrite-200704-8 1523917.72 ( 0.00%) 1881610.25 ( 23.47%)
> Hmean SeqWrite-200704-16 1572519.89 ( 0.00%) 1750277.07 ( 11.30%)
> Hmean SeqWrite-200704-32 1611078.69 ( 0.00%) 1923796.62 ( 19.41%)
> Hmean SeqWrite-200704-64 1656755.37 ( 0.00%) 1892766.99 ( 14.25%)
> Hmean SeqWrite-200704-128 1641739.24 ( 0.00%) 1952081.27 ( 18.90%)
> Hmean SeqWrite-200704-256 1660046.05 ( 0.00%) 1931237.50 ( 16.34%)
> Hmean SeqWrite-200704-512 1634394.86 ( 0.00%) 1860369.95 ( 13.83%)
> Hmean SeqWrite-200704-1024 1629526.38 ( 0.00%) 1810320.92 ( 11.09%)
> Hmean SeqWrite-401408-1 828943.43 ( 0.00%) 876152.50 ( 5.70%)
> Hmean SeqWrite-401408-2 1231519.20 ( 0.00%) 1368986.18 ( 11.16%)
> Hmean SeqWrite-401408-4 1724109.56 ( 0.00%) 1838265.22 ( 6.62%)
> Hmean SeqWrite-401408-8 1806615.84 ( 0.00%) 1969611.74 ( 9.02%)
> Hmean SeqWrite-401408-16 1859268.96 ( 0.00%) 2003005.51 ( 7.73%)
> Hmean SeqWrite-401408-32 1887759.67 ( 0.00%) 2415913.37 ( 27.98%)
> Hmean SeqWrite-401408-64 1941717.11 ( 0.00%) 1971929.24 ( 1.56%)
> Hmean SeqWrite-401408-128 1919515.58 ( 0.00%) 2127647.53 ( 10.84%)
> Hmean SeqWrite-401408-256 1908766.57 ( 0.00%) 2067473.02 ( 8.31%)
> Hmean SeqWrite-401408-512 1908999.37 ( 0.00%) 2195587.56 ( 15.01%)
> Hmean SeqWrite-401408-1024 1912232.98 ( 0.00%) 2150068.56 ( 12.44%)
> Hmean Rewrite-200704-1 1151067.57 ( 0.00%) 1155309.64 ( 0.37%)
> Hmean Rewrite-200704-2 1786824.53 ( 0.00%) 1837093.18 ( 2.81%)
> Hmean Rewrite-200704-4 2539338.19 ( 0.00%) 2649019.78 ( 4.32%)
> Hmean Rewrite-200704-8 2687411.53 ( 0.00%) 2785202.26 ( 3.64%)
> Hmean Rewrite-200704-16 2709445.97 ( 0.00%) 2805580.76 ( 3.55%)
> Hmean Rewrite-200704-32 2735718.43 ( 0.00%) 2807532.87 ( 2.63%)
> Hmean Rewrite-200704-64 2782754.97 ( 0.00%) 2952024.38 ( 6.08%)
> Hmean Rewrite-200704-128 2791889.73 ( 0.00%) 2805048.02 ( 0.47%)
> Hmean Rewrite-200704-256 2711596.34 ( 0.00%) 2828896.54 ( 4.33%)
> Hmean Rewrite-200704-512 2665066.25 ( 0.00%) 2868058.05 ( 7.62%)
> Hmean Rewrite-200704-1024 2675375.89 ( 0.00%) 2685664.19 ( 0.38%)
> Hmean Rewrite-401408-1 1350713.78 ( 0.00%) 1358762.21 ( 0.60%)
> Hmean Rewrite-401408-2 2079420.61 ( 0.00%) 2097399.02 ( 0.86%)
> Hmean Rewrite-401408-4 2889535.90 ( 0.00%) 2912795.03 ( 0.80%)
> Hmean Rewrite-401408-8 3068155.32 ( 0.00%) 3090915.84 ( 0.74%)
> Hmean Rewrite-401408-16 3103789.43 ( 0.00%) 3162486.65 ( 1.89%)
> Hmean Rewrite-401408-32 3112447.72 ( 0.00%) 3243067.63 ( 4.20%)
> Hmean Rewrite-401408-64 3232651.39 ( 0.00%) 3227701.02 ( -0.15%)
> Hmean Rewrite-401408-128 3149556.47 ( 0.00%) 3165694.24 ( 0.51%)
> Hmean Rewrite-401408-256 3093348.93 ( 0.00%) 3104229.97 ( 0.35%)
> Hmean Rewrite-401408-512 3026305.45 ( 0.00%) 3121151.02 ( 3.13%)
> Hmean Rewrite-401408-1024 3005431.18 ( 0.00%) 3046910.32 ( 1.38%)
>
> 4.5.0-rc2 4.5.0-rc2
> vanilla sample-v1r1
> Mean %Busy 3.10 3.09
> Mean CPU%c1 6.16 5.55
> Mean CPU%c3 0.08 0.10
> Mean CPU%c6 90.65 91.26
> Mean CPU%c7 0.00 0.00
> Mean CorWatt 1.71 1.74
> Mean PkgWatt 3.88 3.91
> Max %Busy 16.51 16.22
> Max CPU%c1 17.03 21.99
> Max CPU%c3 2.57 2.15
> Max CPU%c6 96.39 96.31
> Max CPU%c7 0.00 0.00
> Max CorWatt 5.40 5.42
> Max PkgWatt 7.53 7.56
>
> The other operations are omitted as they showed no performance difference.
> For sequential writes and rewrites there is a massive gain in throughput
> for very small files. The increase in power consumption is negligible.
> It is known that the increase is not universal. Larger core machines see
> a much smaller benefit so the rate of CPU migrations are a factor.
>
> netperf-UDP_STREAM
>
> 4.5.0-rc2 4.5.0-rc2
> vanilla sample-v1r1
> Hmean send-64 233.96 ( 0.00%) 244.76 ( 4.61%)
> Hmean send-128 466.74 ( 0.00%) 479.16 ( 2.66%)
> Hmean send-256 929.12 ( 0.00%) 964.00 ( 3.75%)
> Hmean send-1024 3631.36 ( 0.00%) 3781.89 ( 4.15%)
> Hmean send-2048 6984.60 ( 0.00%) 7169.60 ( 2.65%)
> Hmean send-3312 10792.94 ( 0.00%) 11103.42 ( 2.88%)
> Hmean send-4096 12895.57 ( 0.00%) 13112.58 ( 1.68%)
> Hmean send-8192 23057.34 ( 0.00%) 23443.80 ( 1.68%)
> Hmean send-16384 37871.11 ( 0.00%) 38292.60 ( 1.11%)
> Hmean recv-64 233.89 ( 0.00%) 244.71 ( 4.63%)
> Hmean recv-128 466.63 ( 0.00%) 479.09 ( 2.67%)
> Hmean recv-256 928.88 ( 0.00%) 963.74 ( 3.75%)
> Hmean recv-1024 3630.54 ( 0.00%) 3780.96 ( 4.14%)
> Hmean recv-2048 6983.20 ( 0.00%) 7167.55 ( 2.64%)
> Hmean recv-3312 10790.92 ( 0.00%) 11100.63 ( 2.87%)
> Hmean recv-4096 12891.37 ( 0.00%) 13110.35 ( 1.70%)
> Hmean recv-8192 23054.79 ( 0.00%) 23438.27 ( 1.66%)
> Hmean recv-16384 37866.79 ( 0.00%) 38283.73 ( 1.10%)
>
> 4.5.0-rc2 4.5.0-rc2
> vanilla sample-v1r1
> Mean %Busy 37.30 37.10
> Mean CPU%c1 37.52 37.30
> Mean CPU%c3 0.10 0.10
> Mean CPU%c6 25.08 25.49
> Mean CPU%c7 0.00 0.00
> Mean CorWatt 11.20 11.18
> Mean PkgWatt 13.30 13.28
> Max %Busy 50.64 51.73
> Max CPU%c1 49.80 50.53
> Max CPU%c3 9.14 8.95
> Max CPU%c6 62.46 63.48
> Max CPU%c7 0.00 0.00
> Max CorWatt 16.46 16.44
> Max PkgWatt 18.58 18.55
>
> In this test, the client/server are pinned to cores so the scheduler
> decisions are not a factor. There is still a mild performance boost
> with no impact on power consumption.
>
> cyclictest-pinned
> 4.5.0-rc2 4.5.0-rc2
> vanilla sample-v1r1
> Amean LatAvg 3.00 ( 0.00%) 2.64 ( 11.94%)
> Amean LatMax 156.93 ( 0.00%) 106.89 ( 31.89%)
>
> 4.5.0-rc2 4.5.0-rc2
> vanilla sample-v1r1
> Mean %Busy 99.74 99.73
> Mean CPU%c1 0.02 0.02
> Mean CPU%c3 0.00 0.01
> Mean CPU%c6 0.23 0.24
> Mean CPU%c7 0.00 0.00
> Mean CorWatt 5.06 5.92
> Mean PkgWatt 7.12 7.99
> Max %Busy 100.00 100.00
> Max CPU%c1 3.88 3.50
> Max CPU%c3 0.71 0.99
> Max CPU%c6 41.79 43.17
> Max CPU%c7 0.00 0.00
> Max CorWatt 6.80 8.66
> Max PkgWatt 8.85 10.71
>
> This test measures how quickly a task wakes up after a timeout. The test
> could be defeated by selecting a different timeout value that is outside
> the new hold-off value. Furthermore, a workload that is very sensitive to
> wakeup latencies should use the performance governor. Nevertheless it's
> interesting to note the impact of increasing the hold-off value. There is
> an increase in power usage because the CPU remains active during sleep times.
>
> In all cases, there are some CPU migrations because wakers pull wakees to
> nearby CPUs. It could be argued that the workload should be pinned but this
> puts a burden on the user that may not even be possible in all cases. The
> scheduler could try keeping processes on the same CPUs but that would impact
> cache hotness and cause a different class of issues. It is inevitable that
> there will be some conflict between power management and scheduling decisions
> but there is some gains from delaying idling slightly without a severe impact
> on power consumption.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> drivers/cpufreq/intel_pstate.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index cd83d477e32d..54250084174a 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
> sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC;
> duration_us = ktime_us_delta(cpu->sample.time,
> cpu->last_sample_time);
> - if (duration_us > sample_time * 3) {
> + if (duration_us > sample_time * 12) {
> sample_ratio = div_fp(int_tofp(sample_time),
> int_tofp(duration_us));
> core_busy = mul_fp(core_busy, sample_ratio);
> --
> 2.6.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: [linux-power-mgmt] [PATCH 1_3] cpufreq: intel_pstate: Use avg_pstate instead of current_pstate --]
[-- Type: application/octet-stream, Size: 6382 bytes --]
^ permalink raw reply related [flat|nested] 11+ messages in thread
* RE: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
2016-02-19 11:11 ` Stephane Gasparini
@ 2016-02-19 16:38 ` Doug Smythies
2016-02-24 16:19 ` Stephane Gasparini
0 siblings, 1 reply; 11+ messages in thread
From: Doug Smythies @ 2016-02-19 16:38 UTC (permalink / raw)
To: 'Stephane Gasparini', 'Mel Gorman'
Cc: 'Rafael Wysocki', 'Ingo Molnar',
'Peter Zijlstra', 'Matt Fleming',
'Mike Galbraith', 'Linux-PM', 'LKML',
'Srinivas Pandruvada'
Hi Steph,
On 2016.02.19 03:12 Stephane Gasparini wrote:
>
> The issue you are reporting looks like one we improved on android by using
> the average pstate instead of using the last requested pstate
>
> We know that this is improving the ffmpeg encoding performance when using the
> load algorithm.
>
> see patch attached
>
> This patch is only applied on get_target_pstate_use_cpu_load however you can give
> it a try on get_target_pstate_use_performance
Yes, that type of patch works on the load based approach.
However, I do not think it works on the performance based approach. Why not?
Well, and if I understand correctly, follow the math and you end up with:
scaled_busy = 100%
scaled_busy = (aperf * 100% / mperf) * (max_pstate / * ((aperf * max_pstate) / mperf))
... Doug
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
2016-02-18 21:09 ` Doug Smythies
2016-02-19 10:49 ` Mel Gorman
@ 2016-02-23 14:04 ` Mel Gorman
1 sibling, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2016-02-23 14:04 UTC (permalink / raw)
To: Doug Smythies
Cc: 'Rafael J. Wysocki', 'Rafael Wysocki',
'Ingo Molnar', 'Peter Zijlstra',
'Matt Fleming', 'Mike Galbraith',
'Linux-PM', 'LKML', 'Srinivas Pandruvada'
On Thu, Feb 18, 2016 at 01:09:26PM -0800, Doug Smythies wrote:
> >> +++ b/drivers/cpufreq/intel_pstate.c
> >> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
> >> sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC;
> >> duration_us = ktime_us_delta(cpu->sample.time,
> >> cpu->last_sample_time);
> >> - if (duration_us > sample_time * 3) {
> >> + if (duration_us > sample_time * 12) {
> >> sample_ratio = div_fp(int_tofp(sample_time),
> >> int_tofp(duration_us));
> >> core_busy = mul_fp(core_busy, sample_ratio);
> >> --
>
> The immediately preceding comment needs to be changed also.
> Note that with duration related scaling only coming in at such a high
> ratio it might be worth saving the divide and just setting it to 0.
>
I tried this and FWIW, the performance is generally comparable as is the
power usage as reported by turbostat. On occasion, depending on the
machine, the system CPU usage is noticably lower.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
2016-02-19 16:38 ` Doug Smythies
@ 2016-02-24 16:19 ` Stephane Gasparini
2016-02-25 19:51 ` Doug Smythies
0 siblings, 1 reply; 11+ messages in thread
From: Stephane Gasparini @ 2016-02-24 16:19 UTC (permalink / raw)
To: Doug Smythies
Cc: Mel Gorman, Rafael Wysocki, Ingo Molnar, Peter Zijlstra,
Matt Fleming, Mike Galbraith, Linux-PM, LKML,
Srinivas Pandruvada
Hi Doug
> On Feb 19, 2016, at 5:38 PM, Doug Smythies <dsmythies@telus.net> wrote:
>
> Hi Steph,
>
> On 2016.02.19 03:12 Stephane Gasparini wrote:
>>
>> The issue you are reporting looks like one we improved on android by using
>> the average pstate instead of using the last requested pstate
>>
>> We know that this is improving the ffmpeg encoding performance when using the
>> load algorithm.
>>
>> see patch attached
>>
>> This patch is only applied on get_target_pstate_use_cpu_load however you can give
>> it a try on get_target_pstate_use_performance
>
> Yes, that type of patch works on the load based approach.
> I’m not talking about using average p-state in the scaled_busy computation.
I’m talking adding the output of the PID (the number of pstate to ad or subtract)
to the average pstate rather than adding this to the current p-sate.
The current p-state is in some situation not reflecting the reality as the
current p-state can be imposed by a "linked CPU". This is the case when you have a
thread migration on "linked CPU" that was not loaded. Its current P-State will be low
while its average p-state will reflect the activity of the "linked CPU".
I will not claim this is a perfect solution, but this combined to the topology
awareness of the scheduler is helping to take better decision.
> However, I do not think it works on the performance based approach. Why not?
> Well, and if I understand correctly, follow the math and you end up with:
>
> scaled_busy = 100%
>
> scaled_busy = (aperf * 100% / mperf) * (max_pstate / * ((aperf * max_pstate) / mperf))
>
> ... Doug
>
>
—
Steph
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
2016-02-24 16:19 ` Stephane Gasparini
@ 2016-02-25 19:51 ` Doug Smythies
0 siblings, 0 replies; 11+ messages in thread
From: Doug Smythies @ 2016-02-25 19:51 UTC (permalink / raw)
To: 'Stephane Gasparini'
Cc: 'Mel Gorman', 'Rafael Wysocki',
'Ingo Molnar', 'Peter Zijlstra',
'Matt Fleming', 'Mike Galbraith',
'Linux-PM', 'LKML', 'Srinivas Pandruvada'
Hi Steph,
On 2016.02.24 08:20 Stephane Gasparini wrote:
>> On Feb 19, 2016, at 5:38 PM, Doug Smythies <dsmythies@telus.net> wrote:
>>> On 2016.02.19 03:12 Stephane Gasparini wrote:
>>>
>>> The issue you are reporting looks like one we improved on android by using
>>> the average pstate instead of using the last requested pstate
>>>
>>> We know that this is improving the ffmpeg encoding performance when using the
>>> load algorithm.
>>>
>>> see patch attached
>>>
>>> This patch is only applied on get_target_pstate_use_cpu_load however you can give
>>> it a try on get_target_pstate_use_performance
>>
>> Yes, that type of patch works on the load based approach.
>
> I’m not talking about using average p-state in the scaled_busy computation.
> I’m talking adding the output of the PID (the number of pstate to ad or subtract)
> to the average pstate rather than adding this to the current p-sate.
For the situation we are dealing with here, that would actually make it worse,
wouldn't it?
Let's work through a real very low load example from the Mel V2 patch where
the target pstate is increased whereas it should have been decreased:
Mel patch version 2 (12X hold off added to rjw 3 patch v10 set added to kernel 4.5-rc4):
CPU: 3
Core busy: 105
Scaled busy: 143
Old pstate: 25
New pstate: 34
mperf: 52039
aperf: 55097
tsc: 335265689
freq: 3599750 KHz
Load: 0.02%
Duration (mS): 98.293
New pstate = old pstate + (scaled_busy-setpoint) * p_gain
= 25 + (143 - 97) * 0.2
= 34 (as above)
Ave pstate = max_pstate * aperf / mperf
= 34 * 55097 / 52039
= 36
Steph average pstate method added to the above:
New pstate = ave pstate + (scaled_busy-setpoint) * p_gain
= 36 + (143 - 97) * 0.2
= 45 (before clamping)
Now, just for completeness show the no Mel patch math:
Scaled busy = Core busy * max_pstate / old pstate * sample time / duration
= 105 * 34 / 25 * 10 / 98.293
= 14.53
New pstate = old pstate + (scaled_busy-setpoint) * p_gain
= 25 + (14.53 - 97) * .2
= 8.5
= 16 clamped minimum
Regardless, I coded the average pstate method and observe little
difference between it and the Mel V2 patch with limited testing.
... Doug
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-02-25 19:51 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-18 11:11 [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled Mel Gorman
2016-02-18 19:43 ` Rafael J. Wysocki
2016-02-18 21:09 ` Doug Smythies
2016-02-19 10:49 ` Mel Gorman
2016-02-23 14:04 ` Mel Gorman
2016-02-18 23:29 ` Pandruvada, Srinivas
2016-02-18 23:33 ` Rafael J. Wysocki
2016-02-19 11:11 ` Stephane Gasparini
2016-02-19 16:38 ` Doug Smythies
2016-02-24 16:19 ` Stephane Gasparini
2016-02-25 19:51 ` Doug Smythies
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.