All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
@ 2016-02-18 11:11 Mel Gorman
  2016-02-18 19:43 ` Rafael J. Wysocki
  2016-02-19 11:11 ` Stephane Gasparini
  0 siblings, 2 replies; 11+ messages in thread
From: Mel Gorman @ 2016-02-18 11:11 UTC (permalink / raw)
  To: Rafael Wysocki
  Cc: Dirk Brandewie, Ingo Molnar, Peter Zijlstra, Matt Fleming,
	Mike Galbraith, Linux-PM, LKML, Mel Gorman

(cc'ing pm and scheduler people as the problem could be blamed on either
subsystem depending on your point of view)

The PID relies on samples of equal time but this does not apply for
deferrable timers when the CPU is idle. intel_pstate checks if the actual
duration between samples is large and if so, the "busyness" of the CPU
is scaled.

This assumes the delay was a deferred timer but a workload may simply have
been idle for a short time if it's context switching between a server and
client or waiting very briefly on IO. It's compounded by the problem that
server/clients migrate between CPUs due to wake-affine trying to maximise
hot cache usage. In such cases, the cores are not considered busy and the
frequency is dropped prematurely.

This patch increases the hold-off value before the busyness is scaled. It
was selected based simply on testing until the desired result was found.
Tests were conducted with workloads that are either client/server based
or short-lived IO.

dbench4
                               4.5.0-rc2             4.5.0-rc2
                                 vanilla           sample-v1r1
Hmean    mb/sec-1       309.82 (  0.00%)      327.01 (  5.55%)
Hmean    mb/sec-2       594.92 (  0.00%)      613.02 (  3.04%)
Hmean    mb/sec-4       669.17 (  0.00%)      712.27 (  6.44%)
Hmean    mb/sec-8       700.82 (  0.00%)      724.04 (  3.31%)
Hmean    mb/sec-64      425.38 (  0.00%)      448.02 (  5.32%)

               4.5.0-rc2   4.5.0-rc2
                 vanilla sample-v1r1
Mean %Busy         27.28       26.81
Mean CPU%c1        42.50       44.29
Mean CPU%c3         7.16        7.14
Mean CPU%c6        23.05       21.76
Mean CPU%c7         0.00        0.00
Mean CorWatt        4.60        5.08
Mean PkgWatt        6.83        7.32

There is fairly sizable performance boost from the modification and while
the percentage of time spent in C1 is increased, it is not by a substantial
amount and the power usage increase is tiny.

iozone for small files and varying block sizes. Format is IOOperation-filessize-recordsize

                                           4.5.0-rc2             4.5.0-rc2
                                             vanilla           sample-v1r1
Hmean    SeqWrite-200704-1       740152.30 (  0.00%)   748432.35 (  1.12%)
Hmean    SeqWrite-200704-2      1052506.25 (  0.00%)  1169065.30 ( 11.07%)
Hmean    SeqWrite-200704-4      1450716.41 (  0.00%)  1725335.69 ( 18.93%)
Hmean    SeqWrite-200704-8      1523917.72 (  0.00%)  1881610.25 ( 23.47%)
Hmean    SeqWrite-200704-16     1572519.89 (  0.00%)  1750277.07 ( 11.30%)
Hmean    SeqWrite-200704-32     1611078.69 (  0.00%)  1923796.62 ( 19.41%)
Hmean    SeqWrite-200704-64     1656755.37 (  0.00%)  1892766.99 ( 14.25%)
Hmean    SeqWrite-200704-128    1641739.24 (  0.00%)  1952081.27 ( 18.90%)
Hmean    SeqWrite-200704-256    1660046.05 (  0.00%)  1931237.50 ( 16.34%)
Hmean    SeqWrite-200704-512    1634394.86 (  0.00%)  1860369.95 ( 13.83%)
Hmean    SeqWrite-200704-1024   1629526.38 (  0.00%)  1810320.92 ( 11.09%)
Hmean    SeqWrite-401408-1       828943.43 (  0.00%)   876152.50 (  5.70%)
Hmean    SeqWrite-401408-2      1231519.20 (  0.00%)  1368986.18 ( 11.16%)
Hmean    SeqWrite-401408-4      1724109.56 (  0.00%)  1838265.22 (  6.62%)
Hmean    SeqWrite-401408-8      1806615.84 (  0.00%)  1969611.74 (  9.02%)
Hmean    SeqWrite-401408-16     1859268.96 (  0.00%)  2003005.51 (  7.73%)
Hmean    SeqWrite-401408-32     1887759.67 (  0.00%)  2415913.37 ( 27.98%)
Hmean    SeqWrite-401408-64     1941717.11 (  0.00%)  1971929.24 (  1.56%)
Hmean    SeqWrite-401408-128    1919515.58 (  0.00%)  2127647.53 ( 10.84%)
Hmean    SeqWrite-401408-256    1908766.57 (  0.00%)  2067473.02 (  8.31%)
Hmean    SeqWrite-401408-512    1908999.37 (  0.00%)  2195587.56 ( 15.01%)
Hmean    SeqWrite-401408-1024   1912232.98 (  0.00%)  2150068.56 ( 12.44%)
Hmean    Rewrite-200704-1       1151067.57 (  0.00%)  1155309.64 (  0.37%)
Hmean    Rewrite-200704-2       1786824.53 (  0.00%)  1837093.18 (  2.81%)
Hmean    Rewrite-200704-4       2539338.19 (  0.00%)  2649019.78 (  4.32%)
Hmean    Rewrite-200704-8       2687411.53 (  0.00%)  2785202.26 (  3.64%)
Hmean    Rewrite-200704-16      2709445.97 (  0.00%)  2805580.76 (  3.55%)
Hmean    Rewrite-200704-32      2735718.43 (  0.00%)  2807532.87 (  2.63%)
Hmean    Rewrite-200704-64      2782754.97 (  0.00%)  2952024.38 (  6.08%)
Hmean    Rewrite-200704-128     2791889.73 (  0.00%)  2805048.02 (  0.47%)
Hmean    Rewrite-200704-256     2711596.34 (  0.00%)  2828896.54 (  4.33%)
Hmean    Rewrite-200704-512     2665066.25 (  0.00%)  2868058.05 (  7.62%)
Hmean    Rewrite-200704-1024    2675375.89 (  0.00%)  2685664.19 (  0.38%)
Hmean    Rewrite-401408-1       1350713.78 (  0.00%)  1358762.21 (  0.60%)
Hmean    Rewrite-401408-2       2079420.61 (  0.00%)  2097399.02 (  0.86%)
Hmean    Rewrite-401408-4       2889535.90 (  0.00%)  2912795.03 (  0.80%)
Hmean    Rewrite-401408-8       3068155.32 (  0.00%)  3090915.84 (  0.74%)
Hmean    Rewrite-401408-16      3103789.43 (  0.00%)  3162486.65 (  1.89%)
Hmean    Rewrite-401408-32      3112447.72 (  0.00%)  3243067.63 (  4.20%)
Hmean    Rewrite-401408-64      3232651.39 (  0.00%)  3227701.02 ( -0.15%)
Hmean    Rewrite-401408-128     3149556.47 (  0.00%)  3165694.24 (  0.51%)
Hmean    Rewrite-401408-256     3093348.93 (  0.00%)  3104229.97 (  0.35%)
Hmean    Rewrite-401408-512     3026305.45 (  0.00%)  3121151.02 (  3.13%)
Hmean    Rewrite-401408-1024    3005431.18 (  0.00%)  3046910.32 (  1.38%)

               4.5.0-rc2   4.5.0-rc2
                 vanilla sample-v1r1
Mean %Busy          3.10        3.09
Mean CPU%c1         6.16        5.55
Mean CPU%c3         0.08        0.10
Mean CPU%c6        90.65       91.26
Mean CPU%c7         0.00        0.00
Mean CorWatt        1.71        1.74
Mean PkgWatt        3.88        3.91
Max  %Busy         16.51       16.22
Max  CPU%c1        17.03       21.99
Max  CPU%c3         2.57        2.15
Max  CPU%c6        96.39       96.31
Max  CPU%c7         0.00        0.00
Max  CorWatt        5.40        5.42
Max  PkgWatt        7.53        7.56

The other operations are omitted as they showed no performance difference.
For sequential writes and rewrites there is a massive gain in throughput
for very small files. The increase in power consumption is negligible.
It is known that the increase is not universal. Larger core machines see
a much smaller benefit so the rate of CPU migrations are a factor.

netperf-UDP_STREAM

                                4.5.0-rc2             4.5.0-rc2
                                  vanilla           sample-v1r1
Hmean    send-64         233.96 (  0.00%)      244.76 (  4.61%)
Hmean    send-128        466.74 (  0.00%)      479.16 (  2.66%)
Hmean    send-256        929.12 (  0.00%)      964.00 (  3.75%)
Hmean    send-1024      3631.36 (  0.00%)     3781.89 (  4.15%)
Hmean    send-2048      6984.60 (  0.00%)     7169.60 (  2.65%)
Hmean    send-3312     10792.94 (  0.00%)    11103.42 (  2.88%)
Hmean    send-4096     12895.57 (  0.00%)    13112.58 (  1.68%)
Hmean    send-8192     23057.34 (  0.00%)    23443.80 (  1.68%)
Hmean    send-16384    37871.11 (  0.00%)    38292.60 (  1.11%)
Hmean    recv-64         233.89 (  0.00%)      244.71 (  4.63%)
Hmean    recv-128        466.63 (  0.00%)      479.09 (  2.67%)
Hmean    recv-256        928.88 (  0.00%)      963.74 (  3.75%)
Hmean    recv-1024      3630.54 (  0.00%)     3780.96 (  4.14%)
Hmean    recv-2048      6983.20 (  0.00%)     7167.55 (  2.64%)
Hmean    recv-3312     10790.92 (  0.00%)    11100.63 (  2.87%)
Hmean    recv-4096     12891.37 (  0.00%)    13110.35 (  1.70%)
Hmean    recv-8192     23054.79 (  0.00%)    23438.27 (  1.66%)
Hmean    recv-16384    37866.79 (  0.00%)    38283.73 (  1.10%)

               4.5.0-rc2   4.5.0-rc2
                 vanilla sample-v1r1
Mean %Busy         37.30       37.10
Mean CPU%c1        37.52       37.30
Mean CPU%c3         0.10        0.10
Mean CPU%c6        25.08       25.49
Mean CPU%c7         0.00        0.00
Mean CorWatt       11.20       11.18
Mean PkgWatt       13.30       13.28
Max  %Busy         50.64       51.73
Max  CPU%c1        49.80       50.53
Max  CPU%c3         9.14        8.95
Max  CPU%c6        62.46       63.48
Max  CPU%c7         0.00        0.00
Max  CorWatt       16.46       16.44
Max  PkgWatt       18.58       18.55

In this test, the client/server are pinned to cores so the scheduler
decisions are not a factor. There is still a mild performance boost
with no impact on power consumption.

cyclictest-pinned
                            4.5.0-rc2             4.5.0-rc2
                              vanilla           sample-v1r1
Amean    LatAvg        3.00 (  0.00%)        2.64 ( 11.94%)
Amean    LatMax      156.93 (  0.00%)      106.89 ( 31.89%)

               4.5.0-rc2   4.5.0-rc2
                 vanilla sample-v1r1
Mean %Busy         99.74       99.73
Mean CPU%c1         0.02        0.02
Mean CPU%c3         0.00        0.01
Mean CPU%c6         0.23        0.24
Mean CPU%c7         0.00        0.00
Mean CorWatt        5.06        5.92
Mean PkgWatt        7.12        7.99
Max  %Busy        100.00      100.00
Max  CPU%c1         3.88        3.50
Max  CPU%c3         0.71        0.99
Max  CPU%c6        41.79       43.17
Max  CPU%c7         0.00        0.00
Max  CorWatt        6.80        8.66
Max  PkgWatt        8.85       10.71

This test measures how quickly a task wakes up after a timeout. The test
could be defeated by selecting a different timeout value that is outside
the new hold-off value. Furthermore, a workload that is very sensitive to
wakeup latencies should use the performance governor.  Nevertheless it's
interesting to note the impact of increasing the hold-off value.  There is
an increase in power usage because the CPU remains active during sleep times.

In all cases, there are some CPU migrations because wakers pull wakees to
nearby CPUs. It could be argued that the workload should be pinned but this
puts a burden on the user that may not even be possible in all cases. The
scheduler could try keeping processes on the same CPUs but that would impact
cache hotness and cause a different class of issues. It is inevitable that
there will be some conflict between power management and scheduling decisions
but there is some gains from delaying idling slightly without a severe impact
on power consumption.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 drivers/cpufreq/intel_pstate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index cd83d477e32d..54250084174a 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
 	sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
 	duration_us = ktime_us_delta(cpu->sample.time,
 				     cpu->last_sample_time);
-	if (duration_us > sample_time * 3) {
+	if (duration_us > sample_time * 12) {
 		sample_ratio = div_fp(int_tofp(sample_time),
 				      int_tofp(duration_us));
 		core_busy = mul_fp(core_busy, sample_ratio);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
  2016-02-18 11:11 [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled Mel Gorman
@ 2016-02-18 19:43 ` Rafael J. Wysocki
  2016-02-18 21:09   ` Doug Smythies
  2016-02-18 23:29   ` Pandruvada, Srinivas
  2016-02-19 11:11 ` Stephane Gasparini
  1 sibling, 2 replies; 11+ messages in thread
From: Rafael J. Wysocki @ 2016-02-18 19:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rafael Wysocki, Dirk Brandewie, Ingo Molnar, Peter Zijlstra,
	Matt Fleming, Mike Galbraith, Linux-PM, LKML

Hi Mel,

On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman
<mgorman@techsingularity.net> wrote:

[cut]

>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  drivers/cpufreq/intel_pstate.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index cd83d477e32d..54250084174a 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
>         sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
>         duration_us = ktime_us_delta(cpu->sample.time,
>                                      cpu->last_sample_time);
> -       if (duration_us > sample_time * 3) {
> +       if (duration_us > sample_time * 12) {
>                 sample_ratio = div_fp(int_tofp(sample_time),
>                                       int_tofp(duration_us));
>                 core_busy = mul_fp(core_busy, sample_ratio);
> --

I've been considering making a change like this, but I wasn't quite
sure how much greater the multiplier should be, so I've queued this
one up for 4.6.

That said please note that we're planning to make one significant
change to intel_pstate in the 4.6 cycle that's very likely to affect
your results.

It is currently present in linux-next (commit 402c43ed2d74 "cpufreq:
intel_pstate: Replace timers with utilization update callbacks" in the
linux-next branch of the linux-pm.git tree, that depends on commit
fe7034338ba0 "cpufreq: Add mechanism for registering utilization
update callbacks" in the same branch).  Also you can just pull from
the pm-cpufreq-test branch in linux-pm.git, but that contains much
more material.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
  2016-02-18 19:43 ` Rafael J. Wysocki
@ 2016-02-18 21:09   ` Doug Smythies
  2016-02-19 10:49     ` Mel Gorman
  2016-02-23 14:04     ` Mel Gorman
  2016-02-18 23:29   ` Pandruvada, Srinivas
  1 sibling, 2 replies; 11+ messages in thread
From: Doug Smythies @ 2016-02-18 21:09 UTC (permalink / raw)
  To: 'Rafael J. Wysocki', 'Mel Gorman'
  Cc: 'Rafael Wysocki', 'Ingo Molnar',
	'Peter Zijlstra', 'Matt Fleming',
	'Mike Galbraith', 'Linux-PM', 'LKML',
	'Srinivas Pandruvada'

On 2106.02.18 Rafael J. Wysocki wrote:
On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman wrote:
>>
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> ---
>>  drivers/cpufreq/intel_pstate.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
>> index cd83d477e32d..54250084174a 100644
>> --- a/drivers/cpufreq/intel_pstate.c
>> +++ b/drivers/cpufreq/intel_pstate.c
>> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
>>         sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
>>         duration_us = ktime_us_delta(cpu->sample.time,
>>                                      cpu->last_sample_time);
>> -       if (duration_us > sample_time * 3) {
>> +       if (duration_us > sample_time * 12) {
>>                 sample_ratio = div_fp(int_tofp(sample_time),
>>                                       int_tofp(duration_us));
>>                 core_busy = mul_fp(core_busy, sample_ratio);
>> --

The immediately preceding comment needs to be changed also.
Note that with duration related scaling only coming in at such a high
ratio it might be worth saving the divide and just setting it to 0.

> I've been considering making a change like this, but I wasn't quite
> sure how much greater the multiplier should be, so I've queued this
> one up for 4.6.

> That said please note that we're planning to make one significant
> change to intel_pstate in the 4.6 cycle that's very likely to affect
> your results.

Rafael:

I started to test Mel's change added to your 3 patch set version 10.

I only have one data point so far, I selected the test I did from one of Mel's
better results (although there is no reason to expect my computer to have
best results for the same operating conditions):

Stock kernel 4.5-rc4 just for reference:
Linux s15 4.5.0-040500rc4-generic #201602141731 SMP Sun Feb 14 22:33:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

        Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0
        Output is in Kbytes/sec

              KB  reclen   write rewrite
          401408      32 1895293 3035291
_________________________________________________________________

Kernel 4.5-rc4 + jrw 3 patch set version 10  (nominal 3X duration holdoff)
Linux s15 4.5.0-rc4-rjwv10 #167 SMP Mon Feb 15 14:23:10 PST 2016 x86_64 x86_64 x86_64 GNU/Linux

        Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0
        Output is in Kbytes/sec

              KB  reclen   write rewrite
          401408      32 2010558 3086354
          401408      32 1945126 3127472
          401408      32 1944807 3110387
          401408      32 1948620 3110002
                     AVE 1962278 3108554

Performance mode, for comparison:

              KB  reclen   write rewrite
          401408      32 2870111 5023311
          401408      32 2869642 5149213
          401408      32 2792053 5100280
          401408      32 2863887 5149229
_________________________________________________________________

Kernel 4.5-rc4 + jrw 3 patch set version 10 + mg 12X duration hold-off
Linux s15 4.5.0-rc4-rjwv10-12 #169 SMP Thu Feb 18 08:15:33 PST 2016 x86_64 x86_64 x86_64 GNU/Linux

        Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0
        Output is in Kbytes/sec

              KB  reclen   write rewrite
          401408      32 1989670 3100580
          401408      32 2062291 3112463
          401408      32 2107637 3233567
          401408      32 2111772 3340610
                     AVE 2067843 3196805
          Gain Verses 3X    5.4%    2.8%
_________________________________________________________________

Mel: Did you observe any downside conditions?

For example, here is just an example taken some trace samples from my computer:

Duration kick in = 3X
Core busy = 101
Current pstate = 16
Load = 2.2%
Duration = 43.815 mSec
Scaled busy = 48
Next Pstate = 16 (= minimum for my computer)

If duration kick in = 12X then
Scaled busy = 214
Next pstate = 38 (= Max turbo for my computer)

Note: I do NOT have an operational example where it matters in terms
of energy use or whatever. I just suggesting that we look.

... Doug

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
  2016-02-18 19:43 ` Rafael J. Wysocki
  2016-02-18 21:09   ` Doug Smythies
@ 2016-02-18 23:29   ` Pandruvada, Srinivas
  2016-02-18 23:33     ` Rafael J. Wysocki
  1 sibling, 1 reply; 11+ messages in thread
From: Pandruvada, Srinivas @ 2016-02-18 23:29 UTC (permalink / raw)
  To: mgorman, rafael
  Cc: matt, mingo, peterz, Brandewie, Dirk J, linux-kernel, linux-pm,
	rjw, umgwanakikbuti

On Thu, 2016-02-18 at 20:43 +0100, Rafael J. Wysocki wrote:
> Hi Mel,
> 
> On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman
> <mgorman@techsingularity.net> wrote:
> 
> [cut]
> 
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  drivers/cpufreq/intel_pstate.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index cd83d477e32d..54250084174a 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -999,7 +999,7 @@ static inline int32_t
> > get_target_pstate_use_performance(struct cpudata *cpu)
> >         sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
> >         duration_us = ktime_us_delta(cpu->sample.time,
> >                                      cpu->last_sample_time);
> > -       if (duration_us > sample_time * 3) {
> > +       if (duration_us > sample_time * 12) {
> >                 sample_ratio = div_fp(int_tofp(sample_time),
> >                                       int_tofp(duration_us));
> >                 core_busy = mul_fp(core_busy, sample_ratio);
> > --
> 
> I've been considering making a change like this, but I wasn't quite
> sure how much greater the multiplier should be, so I've queued this
> one up for 4.6.
> 
We need to test power impact on different server workloads. So please
hold on.
We have server folks complaining that we already consume too much
power.

Thanks,
Srinivas

> That said please note that we're planning to make one significant
> change to intel_pstate in the 4.6 cycle that's very likely to affect
> your results.
> 
> It is currently present in linux-next (commit 402c43ed2d74 "cpufreq:
> intel_pstate: Replace timers with utilization update callbacks" in
> the
> linux-next branch of the linux-pm.git tree, that depends on commit
> fe7034338ba0 "cpufreq: Add mechanism for registering utilization
> update callbacks" in the same branch).  Also you can just pull from
> the pm-cpufreq-test branch in linux-pm.git, but that contains much
> more material.
> 
> Thanks,
> Rafael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
  2016-02-18 23:29   ` Pandruvada, Srinivas
@ 2016-02-18 23:33     ` Rafael J. Wysocki
  0 siblings, 0 replies; 11+ messages in thread
From: Rafael J. Wysocki @ 2016-02-18 23:33 UTC (permalink / raw)
  To: Pandruvada, Srinivas
  Cc: mgorman, rafael, matt, mingo, peterz, Brandewie, Dirk J,
	linux-kernel, linux-pm, rjw, umgwanakikbuti

On Fri, Feb 19, 2016 at 12:29 AM, Pandruvada, Srinivas
<srinivas.pandruvada@intel.com> wrote:
> On Thu, 2016-02-18 at 20:43 +0100, Rafael J. Wysocki wrote:
>> Hi Mel,
>>
>> On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman
>> <mgorman@techsingularity.net> wrote:
>>
>> [cut]
>>
>> >
>> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> > ---
>> >  drivers/cpufreq/intel_pstate.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/cpufreq/intel_pstate.c
>> > b/drivers/cpufreq/intel_pstate.c
>> > index cd83d477e32d..54250084174a 100644
>> > --- a/drivers/cpufreq/intel_pstate.c
>> > +++ b/drivers/cpufreq/intel_pstate.c
>> > @@ -999,7 +999,7 @@ static inline int32_t
>> > get_target_pstate_use_performance(struct cpudata *cpu)
>> >         sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
>> >         duration_us = ktime_us_delta(cpu->sample.time,
>> >                                      cpu->last_sample_time);
>> > -       if (duration_us > sample_time * 3) {
>> > +       if (duration_us > sample_time * 12) {
>> >                 sample_ratio = div_fp(int_tofp(sample_time),
>> >                                       int_tofp(duration_us));
>> >                 core_busy = mul_fp(core_busy, sample_ratio);
>> > --
>>
>> I've been considering making a change like this, but I wasn't quite
>> sure how much greater the multiplier should be, so I've queued this
>> one up for 4.6.
>>
> We need to test power impact on different server workloads. So please
> hold on.
> We have server folks complaining that we already consume too much
> power.

I'll drop the commit if it turns out to cause too much energy to be consumed.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
  2016-02-18 21:09   ` Doug Smythies
@ 2016-02-19 10:49     ` Mel Gorman
  2016-02-23 14:04     ` Mel Gorman
  1 sibling, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2016-02-19 10:49 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael J. Wysocki', 'Rafael Wysocki',
	'Ingo Molnar', 'Peter Zijlstra',
	'Matt Fleming', 'Mike Galbraith',
	'Linux-PM', 'LKML', 'Srinivas Pandruvada'

On Thu, Feb 18, 2016 at 01:09:26PM -0800, Doug Smythies wrote:
> On 2106.02.18 Rafael J. Wysocki wrote:
> On Thu, Feb 18, 2016 at 12:11 PM, Mel Gorman wrote:
> >>
> >> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> >> ---
> >>  drivers/cpufreq/intel_pstate.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> >> index cd83d477e32d..54250084174a 100644
> >> --- a/drivers/cpufreq/intel_pstate.c
> >> +++ b/drivers/cpufreq/intel_pstate.c
> >> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
> >>         sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
> >>         duration_us = ktime_us_delta(cpu->sample.time,
> >>                                      cpu->last_sample_time);
> >> -       if (duration_us > sample_time * 3) {
> >> +       if (duration_us > sample_time * 12) {
> >>                 sample_ratio = div_fp(int_tofp(sample_time),
> >>                                       int_tofp(duration_us));
> >>                 core_busy = mul_fp(core_busy, sample_ratio);
> >> --
> 
> The immediately preceding comment needs to be changed also.

Yes, it does. Thanks.

> Note that with duration related scaling only coming in at such a high
> ratio it might be worth saving the divide and just setting it to 0.
> 

That sounds reasonable. I've queued up a test based on this as well as
tests with the linux-next branch from linux-pm to see what falls out.

> > I've been considering making a change like this, but I wasn't quite
> > sure how much greater the multiplier should be, so I've queued this
> > one up for 4.6.
> 
> > That said please note that we're planning to make one significant
> > change to intel_pstate in the 4.6 cycle that's very likely to affect
> > your results.
> 
> Rafael:
> 
> I started to test Mel's change added to your 3 patch set version 10.
> 
> I only have one data point so far, I selected the test I did from one of Mel's
> better results (although there is no reason to expect my computer to have
> best results for the same operating conditions):
> 

It's a reasonable expectation.

> Stock kernel 4.5-rc4 just for reference:
> Linux s15 4.5.0-040500rc4-generic #201602141731 SMP Sun Feb 14 22:33:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> 
>         Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0
>         Output is in Kbytes/sec
> 
>               KB  reclen   write rewrite
>           401408      32 1895293 3035291
> _________________________________________________________________
> 
> Kernel 4.5-rc4 + jrw 3 patch set version 10  (nominal 3X duration holdoff)
> Linux s15 4.5.0-rc4-rjwv10 #167 SMP Mon Feb 15 14:23:10 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
> 
>         Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0
>         Output is in Kbytes/sec
> 
>               KB  reclen   write rewrite
>           401408      32 2010558 3086354
>           401408      32 1945126 3127472
>           401408      32 1944807 3110387
>           401408      32 1948620 3110002
>                      AVE 1962278 3108554
> 
> Performance mode, for comparison:
> 
>               KB  reclen   write rewrite
>           401408      32 2870111 5023311
>           401408      32 2869642 5149213
>           401408      32 2792053 5100280
>           401408      32 2863887 5149229
> _________________________________________________________________
> 
> Kernel 4.5-rc4 + jrw 3 patch set version 10 + mg 12X duration hold-off
> Linux s15 4.5.0-rc4-rjwv10-12 #169 SMP Thu Feb 18 08:15:33 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
> 
>         Command line used: iozone -s 401408 -r 32 -f bla.bla -i 0
>         Output is in Kbytes/sec
> 
>               KB  reclen   write rewrite
>           401408      32 1989670 3100580
>           401408      32 2062291 3112463
>           401408      32 2107637 3233567
>           401408      32 2111772 3340610
>                      AVE 2067843 3196805
>           Gain Verses 3X    5.4%    2.8%
> _________________________________________________________________
> 
> Mel: Did you observe any downside conditions?
> 

Not so far but my expectation is that any downside would be power consumption
related. At worst, I expect the patch to have little or not performance
impact in cases where there are a lot of cores, a lot of migration and the
CPU core is idle longer than the new hold-off period. For power-consumption,
I'm relying entirely on the output of turbostat to tell me if there are
problems which may or may not be sufficient.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
  2016-02-18 11:11 [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled Mel Gorman
  2016-02-18 19:43 ` Rafael J. Wysocki
@ 2016-02-19 11:11 ` Stephane Gasparini
  2016-02-19 16:38   ` Doug Smythies
  1 sibling, 1 reply; 11+ messages in thread
From: Stephane Gasparini @ 2016-02-19 11:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rafael Wysocki, Dirk Brandewie, Ingo Molnar, Peter Zijlstra,
	Matt Fleming, Mike Galbraith, Linux-PM, LKML

[-- Attachment #1: Type: text/plain, Size: 14996 bytes --]

The issue you are reporting looks like one we improved on android by using 
the average pstate instead of using the last requested pstate

We know that this is improving the ffmpeg encoding performance when using the
load algorithm.

see patch attached

This patch is only applied on get_target_pstate_use_cpu_load however you can give
it a try on get_target_pstate_use_performance

IPLoad+Avg-Pstate vs IP Load:

Benchmark               ∆Perf    ∆Power
SmartBench-Gaming       -0.1%   -10.4%
SmartBench-Productivity -0.8%   -10.4%
CandyCrush                n/a   -17.4%
AngryBirds                n/a    -5.9%
videoPlayback             n/a   -13.9%
audioPlayback             n/a    -4.9%
IcyRocks-0-0             0.0%    -4.0%
IcyRocks-20-50           0.0%   -38.4%
IcyRocks-40-100          0.1%    -2.8%
IcyRocks-60-150          1.4%    -0.6%
IcyRocks-80-200          2.9%     0.7%
IcyRocks-100-250         1.1%     0.4%
iozone RR               -2.7%    -4.2%
iozone RW               -8.8%    -4.2%
Drystone                -0.2%    -0.8%
Coremark                 0.5%     0.2%


Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
---
drivers/cpufreq/intel_pstate.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index cd83d47..6ba8cab 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -908,8 +908,6 @@ static inline void intel_pstate_sample(struct cpudata *cpu)
	cpu->sample.mperf -= cpu->prev_mperf;
	cpu->sample.tsc -= cpu->prev_tsc;

-	intel_pstate_calc_busy(cpu);
-
	cpu->prev_aperf = aperf;
	cpu->prev_mperf = mperf;
	cpu->prev_tsc = tsc;
@@ -931,6 +929,12 @@ static inline void intel_pstate_set_sample_time(struct cpudata *cpu)
	mod_timer_pinned(&cpu->timer, jiffies + delay);
}

+static inline int32_t get_avg_pstate(struct cpudata *cpu)
+{
+	return div64_u64(cpu->pstate.max_pstate * cpu->sample.aperf,
+		cpu->sample.mperf);
+}
+
static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
{
	struct sample *sample = &cpu->sample;
@@ -964,7 +968,7 @@ static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
	cpu_load = div64_u64(int_tofp(100) * mperf, sample->tsc);
	cpu->sample.busy_scaled = cpu_load;

-	return cpu->pstate.current_pstate - pid_calc(&cpu->pid, cpu_load);
+	return get_avg_pstate(cpu) - pid_calc(&cpu->pid, cpu_load);
}

static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
@@ -973,6 +977,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
	s64 duration_us;
	u32 sample_time;

+	intel_pstate_calc_busy(cpu);
	/*
	 * core_busy is the ratio of actual performance to max
	 * max_pstate is the max non turbo pstate available
—
Steph




> On Feb 18, 2016, at 12:11 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> (cc'ing pm and scheduler people as the problem could be blamed on either
> subsystem depending on your point of view)
> 
> The PID relies on samples of equal time but this does not apply for
> deferrable timers when the CPU is idle. intel_pstate checks if the actual
> duration between samples is large and if so, the "busyness" of the CPU
> is scaled.
> 
> This assumes the delay was a deferred timer but a workload may simply have
> been idle for a short time if it's context switching between a server and
> client or waiting very briefly on IO. It's compounded by the problem that
> server/clients migrate between CPUs due to wake-affine trying to maximise
> hot cache usage. In such cases, the cores are not considered busy and the
> frequency is dropped prematurely.
> 
> This patch increases the hold-off value before the busyness is scaled. It
> was selected based simply on testing until the desired result was found.
> Tests were conducted with workloads that are either client/server based
> or short-lived IO.
> 
> dbench4
>                               4.5.0-rc2             4.5.0-rc2
>                                 vanilla           sample-v1r1
> Hmean    mb/sec-1       309.82 (  0.00%)      327.01 (  5.55%)
> Hmean    mb/sec-2       594.92 (  0.00%)      613.02 (  3.04%)
> Hmean    mb/sec-4       669.17 (  0.00%)      712.27 (  6.44%)
> Hmean    mb/sec-8       700.82 (  0.00%)      724.04 (  3.31%)
> Hmean    mb/sec-64      425.38 (  0.00%)      448.02 (  5.32%)
> 
>               4.5.0-rc2   4.5.0-rc2
>                 vanilla sample-v1r1
> Mean %Busy         27.28       26.81
> Mean CPU%c1        42.50       44.29
> Mean CPU%c3         7.16        7.14
> Mean CPU%c6        23.05       21.76
> Mean CPU%c7         0.00        0.00
> Mean CorWatt        4.60        5.08
> Mean PkgWatt        6.83        7.32
> 
> There is fairly sizable performance boost from the modification and while
> the percentage of time spent in C1 is increased, it is not by a substantial
> amount and the power usage increase is tiny.
> 
> iozone for small files and varying block sizes. Format is IOOperation-filessize-recordsize
> 
>                                           4.5.0-rc2             4.5.0-rc2
>                                             vanilla           sample-v1r1
> Hmean    SeqWrite-200704-1       740152.30 (  0.00%)   748432.35 (  1.12%)
> Hmean    SeqWrite-200704-2      1052506.25 (  0.00%)  1169065.30 ( 11.07%)
> Hmean    SeqWrite-200704-4      1450716.41 (  0.00%)  1725335.69 ( 18.93%)
> Hmean    SeqWrite-200704-8      1523917.72 (  0.00%)  1881610.25 ( 23.47%)
> Hmean    SeqWrite-200704-16     1572519.89 (  0.00%)  1750277.07 ( 11.30%)
> Hmean    SeqWrite-200704-32     1611078.69 (  0.00%)  1923796.62 ( 19.41%)
> Hmean    SeqWrite-200704-64     1656755.37 (  0.00%)  1892766.99 ( 14.25%)
> Hmean    SeqWrite-200704-128    1641739.24 (  0.00%)  1952081.27 ( 18.90%)
> Hmean    SeqWrite-200704-256    1660046.05 (  0.00%)  1931237.50 ( 16.34%)
> Hmean    SeqWrite-200704-512    1634394.86 (  0.00%)  1860369.95 ( 13.83%)
> Hmean    SeqWrite-200704-1024   1629526.38 (  0.00%)  1810320.92 ( 11.09%)
> Hmean    SeqWrite-401408-1       828943.43 (  0.00%)   876152.50 (  5.70%)
> Hmean    SeqWrite-401408-2      1231519.20 (  0.00%)  1368986.18 ( 11.16%)
> Hmean    SeqWrite-401408-4      1724109.56 (  0.00%)  1838265.22 (  6.62%)
> Hmean    SeqWrite-401408-8      1806615.84 (  0.00%)  1969611.74 (  9.02%)
> Hmean    SeqWrite-401408-16     1859268.96 (  0.00%)  2003005.51 (  7.73%)
> Hmean    SeqWrite-401408-32     1887759.67 (  0.00%)  2415913.37 ( 27.98%)
> Hmean    SeqWrite-401408-64     1941717.11 (  0.00%)  1971929.24 (  1.56%)
> Hmean    SeqWrite-401408-128    1919515.58 (  0.00%)  2127647.53 ( 10.84%)
> Hmean    SeqWrite-401408-256    1908766.57 (  0.00%)  2067473.02 (  8.31%)
> Hmean    SeqWrite-401408-512    1908999.37 (  0.00%)  2195587.56 ( 15.01%)
> Hmean    SeqWrite-401408-1024   1912232.98 (  0.00%)  2150068.56 ( 12.44%)
> Hmean    Rewrite-200704-1       1151067.57 (  0.00%)  1155309.64 (  0.37%)
> Hmean    Rewrite-200704-2       1786824.53 (  0.00%)  1837093.18 (  2.81%)
> Hmean    Rewrite-200704-4       2539338.19 (  0.00%)  2649019.78 (  4.32%)
> Hmean    Rewrite-200704-8       2687411.53 (  0.00%)  2785202.26 (  3.64%)
> Hmean    Rewrite-200704-16      2709445.97 (  0.00%)  2805580.76 (  3.55%)
> Hmean    Rewrite-200704-32      2735718.43 (  0.00%)  2807532.87 (  2.63%)
> Hmean    Rewrite-200704-64      2782754.97 (  0.00%)  2952024.38 (  6.08%)
> Hmean    Rewrite-200704-128     2791889.73 (  0.00%)  2805048.02 (  0.47%)
> Hmean    Rewrite-200704-256     2711596.34 (  0.00%)  2828896.54 (  4.33%)
> Hmean    Rewrite-200704-512     2665066.25 (  0.00%)  2868058.05 (  7.62%)
> Hmean    Rewrite-200704-1024    2675375.89 (  0.00%)  2685664.19 (  0.38%)
> Hmean    Rewrite-401408-1       1350713.78 (  0.00%)  1358762.21 (  0.60%)
> Hmean    Rewrite-401408-2       2079420.61 (  0.00%)  2097399.02 (  0.86%)
> Hmean    Rewrite-401408-4       2889535.90 (  0.00%)  2912795.03 (  0.80%)
> Hmean    Rewrite-401408-8       3068155.32 (  0.00%)  3090915.84 (  0.74%)
> Hmean    Rewrite-401408-16      3103789.43 (  0.00%)  3162486.65 (  1.89%)
> Hmean    Rewrite-401408-32      3112447.72 (  0.00%)  3243067.63 (  4.20%)
> Hmean    Rewrite-401408-64      3232651.39 (  0.00%)  3227701.02 ( -0.15%)
> Hmean    Rewrite-401408-128     3149556.47 (  0.00%)  3165694.24 (  0.51%)
> Hmean    Rewrite-401408-256     3093348.93 (  0.00%)  3104229.97 (  0.35%)
> Hmean    Rewrite-401408-512     3026305.45 (  0.00%)  3121151.02 (  3.13%)
> Hmean    Rewrite-401408-1024    3005431.18 (  0.00%)  3046910.32 (  1.38%)
> 
>               4.5.0-rc2   4.5.0-rc2
>                 vanilla sample-v1r1
> Mean %Busy          3.10        3.09
> Mean CPU%c1         6.16        5.55
> Mean CPU%c3         0.08        0.10
> Mean CPU%c6        90.65       91.26
> Mean CPU%c7         0.00        0.00
> Mean CorWatt        1.71        1.74
> Mean PkgWatt        3.88        3.91
> Max  %Busy         16.51       16.22
> Max  CPU%c1        17.03       21.99
> Max  CPU%c3         2.57        2.15
> Max  CPU%c6        96.39       96.31
> Max  CPU%c7         0.00        0.00
> Max  CorWatt        5.40        5.42
> Max  PkgWatt        7.53        7.56
> 
> The other operations are omitted as they showed no performance difference.
> For sequential writes and rewrites there is a massive gain in throughput
> for very small files. The increase in power consumption is negligible.
> It is known that the increase is not universal. Larger core machines see
> a much smaller benefit so the rate of CPU migrations are a factor.
> 
> netperf-UDP_STREAM
> 
>                                4.5.0-rc2             4.5.0-rc2
>                                  vanilla           sample-v1r1
> Hmean    send-64         233.96 (  0.00%)      244.76 (  4.61%)
> Hmean    send-128        466.74 (  0.00%)      479.16 (  2.66%)
> Hmean    send-256        929.12 (  0.00%)      964.00 (  3.75%)
> Hmean    send-1024      3631.36 (  0.00%)     3781.89 (  4.15%)
> Hmean    send-2048      6984.60 (  0.00%)     7169.60 (  2.65%)
> Hmean    send-3312     10792.94 (  0.00%)    11103.42 (  2.88%)
> Hmean    send-4096     12895.57 (  0.00%)    13112.58 (  1.68%)
> Hmean    send-8192     23057.34 (  0.00%)    23443.80 (  1.68%)
> Hmean    send-16384    37871.11 (  0.00%)    38292.60 (  1.11%)
> Hmean    recv-64         233.89 (  0.00%)      244.71 (  4.63%)
> Hmean    recv-128        466.63 (  0.00%)      479.09 (  2.67%)
> Hmean    recv-256        928.88 (  0.00%)      963.74 (  3.75%)
> Hmean    recv-1024      3630.54 (  0.00%)     3780.96 (  4.14%)
> Hmean    recv-2048      6983.20 (  0.00%)     7167.55 (  2.64%)
> Hmean    recv-3312     10790.92 (  0.00%)    11100.63 (  2.87%)
> Hmean    recv-4096     12891.37 (  0.00%)    13110.35 (  1.70%)
> Hmean    recv-8192     23054.79 (  0.00%)    23438.27 (  1.66%)
> Hmean    recv-16384    37866.79 (  0.00%)    38283.73 (  1.10%)
> 
>               4.5.0-rc2   4.5.0-rc2
>                 vanilla sample-v1r1
> Mean %Busy         37.30       37.10
> Mean CPU%c1        37.52       37.30
> Mean CPU%c3         0.10        0.10
> Mean CPU%c6        25.08       25.49
> Mean CPU%c7         0.00        0.00
> Mean CorWatt       11.20       11.18
> Mean PkgWatt       13.30       13.28
> Max  %Busy         50.64       51.73
> Max  CPU%c1        49.80       50.53
> Max  CPU%c3         9.14        8.95
> Max  CPU%c6        62.46       63.48
> Max  CPU%c7         0.00        0.00
> Max  CorWatt       16.46       16.44
> Max  PkgWatt       18.58       18.55
> 
> In this test, the client/server are pinned to cores so the scheduler
> decisions are not a factor. There is still a mild performance boost
> with no impact on power consumption.
> 
> cyclictest-pinned
>                            4.5.0-rc2             4.5.0-rc2
>                              vanilla           sample-v1r1
> Amean    LatAvg        3.00 (  0.00%)        2.64 ( 11.94%)
> Amean    LatMax      156.93 (  0.00%)      106.89 ( 31.89%)
> 
>               4.5.0-rc2   4.5.0-rc2
>                 vanilla sample-v1r1
> Mean %Busy         99.74       99.73
> Mean CPU%c1         0.02        0.02
> Mean CPU%c3         0.00        0.01
> Mean CPU%c6         0.23        0.24
> Mean CPU%c7         0.00        0.00
> Mean CorWatt        5.06        5.92
> Mean PkgWatt        7.12        7.99
> Max  %Busy        100.00      100.00
> Max  CPU%c1         3.88        3.50
> Max  CPU%c3         0.71        0.99
> Max  CPU%c6        41.79       43.17
> Max  CPU%c7         0.00        0.00
> Max  CorWatt        6.80        8.66
> Max  PkgWatt        8.85       10.71
> 
> This test measures how quickly a task wakes up after a timeout. The test
> could be defeated by selecting a different timeout value that is outside
> the new hold-off value. Furthermore, a workload that is very sensitive to
> wakeup latencies should use the performance governor.  Nevertheless it's
> interesting to note the impact of increasing the hold-off value.  There is
> an increase in power usage because the CPU remains active during sleep times.
> 
> In all cases, there are some CPU migrations because wakers pull wakees to
> nearby CPUs. It could be argued that the workload should be pinned but this
> puts a burden on the user that may not even be possible in all cases. The
> scheduler could try keeping processes on the same CPUs but that would impact
> cache hotness and cause a different class of issues. It is inevitable that
> there will be some conflict between power management and scheduling decisions
> but there is some gains from delaying idling slightly without a severe impact
> on power consumption.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> drivers/cpufreq/intel_pstate.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index cd83d477e32d..54250084174a 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
> 	sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
> 	duration_us = ktime_us_delta(cpu->sample.time,
> 				     cpu->last_sample_time);
> -	if (duration_us > sample_time * 3) {
> +	if (duration_us > sample_time * 12) {
> 		sample_ratio = div_fp(int_tofp(sample_time),
> 				      int_tofp(duration_us));
> 		core_busy = mul_fp(core_busy, sample_ratio);
> -- 
> 2.6.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



[-- Attachment #2: [linux-power-mgmt] [PATCH 1_3] cpufreq: intel_pstate: Use avg_pstate instead of current_pstate --]
[-- Type: application/octet-stream, Size: 6382 bytes --]

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* RE: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
  2016-02-19 11:11 ` Stephane Gasparini
@ 2016-02-19 16:38   ` Doug Smythies
  2016-02-24 16:19     ` Stephane Gasparini
  0 siblings, 1 reply; 11+ messages in thread
From: Doug Smythies @ 2016-02-19 16:38 UTC (permalink / raw)
  To: 'Stephane Gasparini', 'Mel Gorman'
  Cc: 'Rafael Wysocki', 'Ingo Molnar',
	'Peter Zijlstra', 'Matt Fleming',
	'Mike Galbraith', 'Linux-PM', 'LKML',
	'Srinivas Pandruvada'

Hi Steph,

On 2016.02.19 03:12 Stephane Gasparini wrote:
>
> The issue you are reporting looks like one we improved on android by using 
> the average pstate instead of using the last requested pstate
>
> We know that this is improving the ffmpeg encoding performance when using the
> load algorithm.
>
> see patch attached
>
> This patch is only applied on get_target_pstate_use_cpu_load however you can give
> it a try on get_target_pstate_use_performance

Yes, that type of patch works on the load based approach.

However, I do not think it works on the performance based approach. Why not?
Well, and if I understand correctly, follow the math and you end up with:

scaled_busy = 100%

scaled_busy = (aperf * 100% / mperf) * (max_pstate / * ((aperf * max_pstate) / mperf))

... Doug

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
  2016-02-18 21:09   ` Doug Smythies
  2016-02-19 10:49     ` Mel Gorman
@ 2016-02-23 14:04     ` Mel Gorman
  1 sibling, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2016-02-23 14:04 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael J. Wysocki', 'Rafael Wysocki',
	'Ingo Molnar', 'Peter Zijlstra',
	'Matt Fleming', 'Mike Galbraith',
	'Linux-PM', 'LKML', 'Srinivas Pandruvada'

On Thu, Feb 18, 2016 at 01:09:26PM -0800, Doug Smythies wrote:
> >> +++ b/drivers/cpufreq/intel_pstate.c
> >> @@ -999,7 +999,7 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
> >>         sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
> >>         duration_us = ktime_us_delta(cpu->sample.time,
> >>                                      cpu->last_sample_time);
> >> -       if (duration_us > sample_time * 3) {
> >> +       if (duration_us > sample_time * 12) {
> >>                 sample_ratio = div_fp(int_tofp(sample_time),
> >>                                       int_tofp(duration_us));
> >>                 core_busy = mul_fp(core_busy, sample_ratio);
> >> --
> 
> The immediately preceding comment needs to be changed also.
> Note that with duration related scaling only coming in at such a high
> ratio it might be worth saving the divide and just setting it to 0.
> 

I tried this and FWIW, the performance is generally comparable as is the
power usage as reported by turbostat. On occasion, depending on the
machine, the system CPU usage is noticably lower.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
  2016-02-19 16:38   ` Doug Smythies
@ 2016-02-24 16:19     ` Stephane Gasparini
  2016-02-25 19:51       ` Doug Smythies
  0 siblings, 1 reply; 11+ messages in thread
From: Stephane Gasparini @ 2016-02-24 16:19 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Mel Gorman, Rafael Wysocki, Ingo Molnar, Peter Zijlstra,
	Matt Fleming, Mike Galbraith, Linux-PM, LKML,
	Srinivas Pandruvada

Hi Doug


> On Feb 19, 2016, at 5:38 PM, Doug Smythies <dsmythies@telus.net> wrote:
> 
> Hi Steph,
> 
> On 2016.02.19 03:12 Stephane Gasparini wrote:
>> 
>> The issue you are reporting looks like one we improved on android by using 
>> the average pstate instead of using the last requested pstate
>> 
>> We know that this is improving the ffmpeg encoding performance when using the
>> load algorithm.
>> 
>> see patch attached
>> 
>> This patch is only applied on get_target_pstate_use_cpu_load however you can give
>> it a try on get_target_pstate_use_performance
> 
> Yes, that type of patch works on the load based approach.
> I’m not talking about using average p-state in the scaled_busy computation.

I’m talking adding the output of the PID (the number of pstate to ad or subtract)
to the average pstate rather than adding this to the current p-sate.

The current p-state is in some situation not reflecting the reality as the 
current p-state can be imposed by a "linked CPU". This is the case when you have a
thread migration on "linked CPU" that was not loaded. Its current P-State will be low
while its average p-state will reflect the activity of the "linked CPU".

I will not claim this is a perfect solution, but this combined to the topology 
awareness of the scheduler is helping to take better decision.

> However, I do not think it works on the performance based approach. Why not?
> Well, and if I understand correctly, follow the math and you end up with:
> 
> scaled_busy = 100%
> 
> scaled_busy = (aperf * 100% / mperf) * (max_pstate / * ((aperf * max_pstate) / mperf))
> 
> ... Doug
> 
> 
—
Steph

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled
  2016-02-24 16:19     ` Stephane Gasparini
@ 2016-02-25 19:51       ` Doug Smythies
  0 siblings, 0 replies; 11+ messages in thread
From: Doug Smythies @ 2016-02-25 19:51 UTC (permalink / raw)
  To: 'Stephane Gasparini'
  Cc: 'Mel Gorman', 'Rafael Wysocki',
	'Ingo Molnar', 'Peter Zijlstra',
	'Matt Fleming', 'Mike Galbraith',
	'Linux-PM', 'LKML', 'Srinivas Pandruvada'

Hi Steph,

On 2016.02.24 08:20 Stephane Gasparini wrote:
>> On Feb 19, 2016, at 5:38 PM, Doug Smythies <dsmythies@telus.net> wrote: 
>>> On 2016.02.19 03:12 Stephane Gasparini wrote:
>>> 
>>> The issue you are reporting looks like one we improved on android by using 
>>> the average pstate instead of using the last requested pstate
>>> 
>>> We know that this is improving the ffmpeg encoding performance when using the
>>> load algorithm.
>>> 
>>> see patch attached
>>> 
>>> This patch is only applied on get_target_pstate_use_cpu_load however you can give
>>> it a try on get_target_pstate_use_performance
>> 
>> Yes, that type of patch works on the load based approach.
>
> I’m not talking about using average p-state in the scaled_busy computation.
> I’m talking adding the output of the PID (the number of pstate to ad or subtract)
> to the average pstate rather than adding this to the current p-sate.

For the situation we are dealing with here, that would actually make it worse,
wouldn't it?

Let's work through a real very low load example from the Mel V2 patch where
the target pstate is increased whereas it should have been decreased:

Mel patch version 2 (12X hold off added to rjw 3 patch v10 set added to kernel 4.5-rc4):

CPU: 3
Core busy: 105
Scaled busy: 143
Old pstate: 25
New pstate: 34
mperf: 52039
aperf: 55097
tsc: 335265689
freq: 3599750 KHz
Load: 0.02%
Duration (mS): 98.293

New pstate = old pstate + (scaled_busy-setpoint) * p_gain
           = 25 + (143 - 97) * 0.2
           = 34 (as above)

Ave pstate = max_pstate * aperf / mperf
           = 34 * 55097 / 52039
           = 36

Steph average pstate method added to the above:
New pstate = ave pstate + (scaled_busy-setpoint) * p_gain
           = 36 + (143 - 97) * 0.2
           = 45 (before clamping)

Now, just for completeness show the no Mel patch math:
Scaled busy = Core busy * max_pstate / old pstate * sample time / duration
            = 105 * 34 / 25 * 10 / 98.293
            = 14.53
New pstate = old pstate + (scaled_busy-setpoint) * p_gain
            = 25 + (14.53 - 97) * .2
            = 8.5
            = 16 clamped minimum

Regardless, I coded the average pstate method and observe little
difference between it and the Mel V2 patch with limited testing.

... Doug

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-02-25 19:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-18 11:11 [PATCH 1/1] intel_pstate: Increase hold-off time before busyness is scaled Mel Gorman
2016-02-18 19:43 ` Rafael J. Wysocki
2016-02-18 21:09   ` Doug Smythies
2016-02-19 10:49     ` Mel Gorman
2016-02-23 14:04     ` Mel Gorman
2016-02-18 23:29   ` Pandruvada, Srinivas
2016-02-18 23:33     ` Rafael J. Wysocki
2016-02-19 11:11 ` Stephane Gasparini
2016-02-19 16:38   ` Doug Smythies
2016-02-24 16:19     ` Stephane Gasparini
2016-02-25 19:51       ` Doug Smythies

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.