linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Performance of low-cpu utilisation benchmark regressed severely since 4.6
@ 2017-04-10  8:41 Mel Gorman
  2017-04-10 20:51 ` Rafael J. Wysocki
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2017-04-10  8:41 UTC (permalink / raw)
  To: rafael.j.wysocki; +Cc: jrg.otte, linux-kernel, linux-pm

Hi Rafael,

Since kernel 4.6, performance of the low CPU intensity workloads was dropped
severely.  netperf UDP_STREAM has about 15-20% CPU utilisation has regressed
about 10% relative to 4.4 anad about 6-9% running TCP_STREAM. sockperf has
similar utilisation fixes but I won't go into these in detail as they were
running loopback and are sensitive to a lot of factors.

It's far more obvious when looking at the git test suite and the length
of time it takes to run. This is a shellscript and git intensive workload
whose CPU utilisatiion is very low but is less sensitive to multiple
factors than netperf and sockperf.

Bisection indicates that the regression started with commit ffb810563c0c
("intel_pstate: Avoid getting stuck in high P-states when idle").  However,
it's no longer the only relevant commit as the following results will show


                                 4.4.0                 4.5.0                 4.6.0            4.11.0-rc5            4.11.0-rc5
                               vanilla               vanilla               vanilla               vanilla           revert-v1r1
User    min          1786.44 (  0.00%)     1613.72 (  9.67%)     3302.19 (-84.85%)     3487.46 (-95.22%)     2701.84 (-51.24%)
User    mean         1788.35 (  0.00%)     1616.47 (  9.61%)     3304.14 (-84.76%)     3488.12 (-95.05%)     2715.80 (-51.86%)
User    stddev          1.43 (  0.00%)        1.75 (-21.84%)        1.12 ( 22.10%)        0.57 ( 60.14%)        7.13 (-397.62%)
User    coeffvar        0.08 (  0.00%)        0.11 (-34.80%)        0.03 ( 57.83%)        0.02 ( 79.56%)        0.26 (-227.68%)
User    max          1790.14 (  0.00%)     1618.73 (  9.58%)     3305.40 (-84.64%)     3489.01 (-94.90%)     2721.66 (-52.04%)
System  min           218.44 (  0.00%)      202.58 (  7.26%)      407.51 (-86.55%)      269.92 (-23.57%)      196.85 (  9.88%)
System  mean          219.05 (  0.00%)      203.62 (  7.04%)      408.38 (-86.43%)      270.83 (-23.64%)      197.99 (  9.61%)
System  stddev          0.60 (  0.00%)        0.64 ( -6.30%)        0.77 (-28.89%)        0.59 (  1.47%)        0.87 (-44.72%)
System  coeffvar        0.27 (  0.00%)        0.31 (-14.35%)        0.19 ( 30.86%)        0.22 ( 20.31%)        0.44 (-60.11%)
System  max           219.92 (  0.00%)      204.36 (  7.08%)      409.81 (-86.35%)      271.56 (-23.48%)      199.07 (  9.48%)
Elapsed min          2017.05 (  0.00%)     1827.70 (  9.39%)     3701.00 (-83.49%)     3749.00 (-85.87%)     2904.36 (-43.99%)
Elapsed mean         2018.83 (  0.00%)     1830.72 (  9.32%)     3703.20 (-83.43%)     3750.20 (-85.76%)     2919.33 (-44.60%)
Elapsed stddev          1.79 (  0.00%)        2.18 (-21.93%)        1.47 ( 17.90%)        0.75 ( 58.20%)        7.66 (-328.12%)
Elapsed coeffvar        0.09 (  0.00%)        0.12 (-34.46%)        0.04 ( 55.24%)        0.02 ( 77.50%)        0.26 (-196.07%)
Elapsed max          2021.41 (  0.00%)     1833.91 (  9.28%)     3705.00 (-83.29%)     3751.00 (-85.56%)     2926.13 (-44.76%)
CPU     min            99.00 (  0.00%)       99.00 (  0.00%)      100.00 ( -1.01%)      100.00 ( -1.01%)       99.00 (  0.00%)
CPU     mean           99.00 (  0.00%)       99.00 (  0.00%)      100.00 ( -1.01%)      100.00 ( -1.01%)       99.00 (  0.00%)
CPU     stddev          0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)
CPU     coeffvar        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)
CPU     max            99.00 (  0.00%)       99.00 (  0.00%)      100.00 ( -1.01%)      100.00 ( -1.01%)       99.00 (  0.00%)

               4.4.0       4.5.0       4.6.0  4.11.0-rc5  4.11.0-rc5
             vanilla     vanilla     vanilla     vanilla revert-v1r1
User        10819.50     9790.02    19914.22    21021.12    16392.80
System       1327.78     1234.01     2465.45     1635.85     1197.03
Elapsed     12138.54    11008.49    22247.35    22528.79    17543.60

This is showing the user and system CPU usage as well as the elapsed time
to run a single iteration of the git test suite with total times at bottom
report. Overall time takes over 3 hours longer moving from 4.4 to 4.11-rc5
and reverting the commit does not fully address the problem. It's doing
a warmup run whose results are discarded and then 5 iterations.

The test shows it took 2018 seconds on average to complete a single iteration
on 4.4 and 3750 seconds to complete on 4.11-rc5. The major drop is between
4.5 and 4.6 where it went from 1830 seconds to 3703 seconds and has not
recovered. A bisection was clean and pointed to the commit mentioned above.

The results show that it's not the only source as a revert (last column)
doesn't fix the damage although it goes from 3750 seconds (4.11-rc5 vanilla)
to 2919 seconds (with a revert).

The machine is a relatively old desktop-class machine with a i7-3770 CPU @
3.40GHz (IvyBridge). It is definitely using intel_pstate

analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 1.60 GHz - 3.90 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 1.60 GHz and 3.90 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: 1.60 GHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: yes
    3700 MHz max turbo 4 active cores
    3800 MHz max turbo 3 active cores
    3900 MHz max turbo 2 active cores
    3900 MHz max turbo 1 active cores

No special boot parameters are specified.

I didn't poke around too much as the last time I tried, there were too
many conflicting opinions and requirements so here are the observations.

CPU usage is roughly 10% for the full duratiion of the test.
Context switches, interrupt activity is not altered by the revert although it has changed substantially since 4.4
turbostat confirms that busy time is roughtly 10% across the whole machine
turbostat shows that average MHz is roughly halved in 4.11-rc5-vanilla versus 4.4
turbostat shows that average MHz is slightly higher with the revert applied
benchmark in question is doing IO but not a lot. Mostly below 100K/sec writes with small bursts of 6000K/sec

CONFIG_CPU_FREQ_GOV_SCHEDUTIL is *NOT* set. This is deliberate as when
I evaluated schedutil shortly after it was merged, I found that at best
it performed comparably with the old code across a range of workloads
and machines while having higher system CPU usage. I know a lot of
the recent work has been schedutil-focused so I could find no patch on
recent discussions that might relevant to this problem. I've not looked
at schedutil recently but not everyone will be switching to it so the old
setup is still relevant.

While I accept the logic that CPUs should not remain at the highest
frequency if completely idle for prolonged periods of time, it appears to
be too agressive on older CPUs. Low utilisation tasks should still be able
to get to the higher frequencies for the short bursts they are active for.

I hope the data and the bisection is enough to have some ideas on how
it can be addressed without impacting Haswell and Jorg's setup that the
commit was originally intended for.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-10  8:41 Performance of low-cpu utilisation benchmark regressed severely since 4.6 Mel Gorman
@ 2017-04-10 20:51 ` Rafael J. Wysocki
  2017-04-11 10:02   ` Mel Gorman
  2017-04-11 15:41   ` Doug Smythies
  0 siblings, 2 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2017-04-10 20:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rafael Wysocki, Jörg Otte, Linux Kernel Mailing List,
	Linux PM, Srinivas Pandruvada, Doug Smythies

Hi Mel,

On Mon, Apr 10, 2017 at 10:41 AM, Mel Gorman
<mgorman@techsingularity.net> wrote:
> Hi Rafael,
>
> Since kernel 4.6, performance of the low CPU intensity workloads was dropped
> severely.  netperf UDP_STREAM has about 15-20% CPU utilisation has regressed
> about 10% relative to 4.4 anad about 6-9% running TCP_STREAM. sockperf has
> similar utilisation fixes but I won't go into these in detail as they were
> running loopback and are sensitive to a lot of factors.
>
> It's far more obvious when looking at the git test suite and the length
> of time it takes to run. This is a shellscript and git intensive workload
> whose CPU utilisatiion is very low but is less sensitive to multiple
> factors than netperf and sockperf.

First, thanks for the data.

Nobody has reported anything similar to these results so far.

> Bisection indicates that the regression started with commit ffb810563c0c
> ("intel_pstate: Avoid getting stuck in high P-states when idle").  However,
> it's no longer the only relevant commit as the following results will show

Well, that was an attempt to salvage the "Core" P-state selection
algorithm which is problematic overall and reverting this now would
reintroduce the issue addressed by it, unfortunately.

>                                  4.4.0                 4.5.0                 4.6.0            4.11.0-rc5            4.11.0-rc5
>                                vanilla               vanilla               vanilla               vanilla           revert-v1r1
> User    min          1786.44 (  0.00%)     1613.72 (  9.67%)     3302.19 (-84.85%)     3487.46 (-95.22%)     2701.84 (-51.24%)
> User    mean         1788.35 (  0.00%)     1616.47 (  9.61%)     3304.14 (-84.76%)     3488.12 (-95.05%)     2715.80 (-51.86%)
> User    stddev          1.43 (  0.00%)        1.75 (-21.84%)        1.12 ( 22.10%)        0.57 ( 60.14%)        7.13 (-397.62%)
> User    coeffvar        0.08 (  0.00%)        0.11 (-34.80%)        0.03 ( 57.83%)        0.02 ( 79.56%)        0.26 (-227.68%)
> User    max          1790.14 (  0.00%)     1618.73 (  9.58%)     3305.40 (-84.64%)     3489.01 (-94.90%)     2721.66 (-52.04%)
> System  min           218.44 (  0.00%)      202.58 (  7.26%)      407.51 (-86.55%)      269.92 (-23.57%)      196.85 (  9.88%)
> System  mean          219.05 (  0.00%)      203.62 (  7.04%)      408.38 (-86.43%)      270.83 (-23.64%)      197.99 (  9.61%)
> System  stddev          0.60 (  0.00%)        0.64 ( -6.30%)        0.77 (-28.89%)        0.59 (  1.47%)        0.87 (-44.72%)
> System  coeffvar        0.27 (  0.00%)        0.31 (-14.35%)        0.19 ( 30.86%)        0.22 ( 20.31%)        0.44 (-60.11%)
> System  max           219.92 (  0.00%)      204.36 (  7.08%)      409.81 (-86.35%)      271.56 (-23.48%)      199.07 (  9.48%)
> Elapsed min          2017.05 (  0.00%)     1827.70 (  9.39%)     3701.00 (-83.49%)     3749.00 (-85.87%)     2904.36 (-43.99%)
> Elapsed mean         2018.83 (  0.00%)     1830.72 (  9.32%)     3703.20 (-83.43%)     3750.20 (-85.76%)     2919.33 (-44.60%)
> Elapsed stddev          1.79 (  0.00%)        2.18 (-21.93%)        1.47 ( 17.90%)        0.75 ( 58.20%)        7.66 (-328.12%)
> Elapsed coeffvar        0.09 (  0.00%)        0.12 (-34.46%)        0.04 ( 55.24%)        0.02 ( 77.50%)        0.26 (-196.07%)
> Elapsed max          2021.41 (  0.00%)     1833.91 (  9.28%)     3705.00 (-83.29%)     3751.00 (-85.56%)     2926.13 (-44.76%)
> CPU     min            99.00 (  0.00%)       99.00 (  0.00%)      100.00 ( -1.01%)      100.00 ( -1.01%)       99.00 (  0.00%)
> CPU     mean           99.00 (  0.00%)       99.00 (  0.00%)      100.00 ( -1.01%)      100.00 ( -1.01%)       99.00 (  0.00%)
> CPU     stddev          0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)
> CPU     coeffvar        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)
> CPU     max            99.00 (  0.00%)       99.00 (  0.00%)      100.00 ( -1.01%)      100.00 ( -1.01%)       99.00 (  0.00%)
>
>                4.4.0       4.5.0       4.6.0  4.11.0-rc5  4.11.0-rc5
>              vanilla     vanilla     vanilla     vanilla revert-v1r1
> User        10819.50     9790.02    19914.22    21021.12    16392.80
> System       1327.78     1234.01     2465.45     1635.85     1197.03
> Elapsed     12138.54    11008.49    22247.35    22528.79    17543.60

Well, yes, that doesn't look good. :-/

> This is showing the user and system CPU usage as well as the elapsed time
> to run a single iteration of the git test suite with total times at bottom
> report. Overall time takes over 3 hours longer moving from 4.4 to 4.11-rc5
> and reverting the commit does not fully address the problem. It's doing
> a warmup run whose results are discarded and then 5 iterations.
>
> The test shows it took 2018 seconds on average to complete a single iteration
> on 4.4 and 3750 seconds to complete on 4.11-rc5. The major drop is between
> 4.5 and 4.6 where it went from 1830 seconds to 3703 seconds and has not
> recovered. A bisection was clean and pointed to the commit mentioned above.
>
> The results show that it's not the only source as a revert (last column)
> doesn't fix the damage although it goes from 3750 seconds (4.11-rc5 vanilla)
> to 2919 seconds (with a revert).

OK

So if you revert the commit in question on top of 4.6.0, the numbers
go back to the 4.5.0 levels, right?

Anyway, as I said the "Core" P-state selection algorithm is sort of on
the way out and I think that we have a reasonable replacement for it.

Would it be viable to check what happens with
https://patchwork.kernel.org/patch/9640261/ applied?  Depending on the
ACPI system PM profile of the test machine, this is likely to cause it
to use the new algo.

I guess that you have a pstate_snb directory under /sys/kernel/debug/
(if this is where debugfs is mounted)?  It should not be there any
more with the new algo (as that does not use the PID controller any
more).

> The machine is a relatively old desktop-class machine with a i7-3770 CPU @
> 3.40GHz (IvyBridge). It is definitely using intel_pstate
>
> analyzing CPU 0:
>   driver: intel_pstate
>   CPUs which run at the same hardware frequency: 0
>   CPUs which need to have their frequency coordinated by software: 0
>   maximum transition latency:  Cannot determine or is not supported.
>   hardware limits: 1.60 GHz - 3.90 GHz
>   available cpufreq governors: performance powersave
>   current policy: frequency should be within 1.60 GHz and 3.90 GHz.
>                   The governor "powersave" may decide which speed to use
>                   within this range.
>   current CPU frequency: 1.60 GHz (asserted by call to hardware)
>   boost state support:
>     Supported: yes
>     Active: yes
>     3700 MHz max turbo 4 active cores
>     3800 MHz max turbo 3 active cores
>     3900 MHz max turbo 2 active cores
>     3900 MHz max turbo 1 active cores
>
> No special boot parameters are specified.
>
> I didn't poke around too much as the last time I tried, there were too
> many conflicting opinions and requirements so here are the observations.
>
> CPU usage is roughly 10% for the full duratiion of the test.
> Context switches, interrupt activity is not altered by the revert although it has changed substantially since 4.4
> turbostat confirms that busy time is roughtly 10% across the whole machine
> turbostat shows that average MHz is roughly halved in 4.11-rc5-vanilla versus 4.4
> turbostat shows that average MHz is slightly higher with the revert applied
> benchmark in question is doing IO but not a lot. Mostly below 100K/sec writes with small bursts of 6000K/sec
>
> CONFIG_CPU_FREQ_GOV_SCHEDUTIL is *NOT* set. This is deliberate as when
> I evaluated schedutil shortly after it was merged, I found that at best
> it performed comparably with the old code across a range of workloads
> and machines while having higher system CPU usage. I know a lot of
> the recent work has been schedutil-focused so I could find no patch on
> recent discussions that might relevant to this problem. I've not looked
> at schedutil recently but not everyone will be switching to it so the old
> setup is still relevant.

intel_pstate in the active mode (which you are using) is orthogonal to
schedutil.  It has its own P-state selection logic and that evidently
has changed to affect the workload.

[BTW, I have posted a documentation patch for intel_pstate, but it
applies to the code in linux-next ATM
(https://patchwork.kernel.org/patch/9655107/).  It is worth looking at
anyway I think, though.]

At this point I'm not sure what has changed in addition to the commit
you have found and while this is sort of interesting, I'm not sure how
relevant it is.

Unfortunately, the P-state selection algorithm used so far on your
test system is quite fundamentally unstable and tends to converge to
either the highest or the lowest P-state in various conditions.  If
the workload is sufficiently "light", it generally ends up in the
minimum P-state most of the time which probably happens here.

I would really not like to try to "fix" that algorithm as this is
pretty much hopeless and most likely will lead to regressions
elsewhere.  Instead, I'd prefer to migrate away from it altogether and
then tune things so that they work for everybody reasonably well
(which should be doable with the new algorithm).  But let's see how
far we can get with that.

> While I accept the logic that CPUs should not remain at the highest
> frequency if completely idle for prolonged periods of time, it appears to
> be too agressive on older CPUs. Low utilisation tasks should still be able
> to get to the higher frequencies for the short bursts they are active for.

Totally agreed.

> I hope the data and the bisection is enough to have some ideas on how
> it can be addressed without impacting Haswell and Jorg's setup that the
> commit was originally intended for.

Well, as I said. :-)

Cheers,
Rafael

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-10 20:51 ` Rafael J. Wysocki
@ 2017-04-11 10:02   ` Mel Gorman
  2017-04-21  0:52     ` Rafael J. Wysocki
  2017-04-11 15:41   ` Doug Smythies
  1 sibling, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2017-04-11 10:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael Wysocki, Jörg Otte, Linux Kernel Mailing List,
	Linux PM, Srinivas Pandruvada, Doug Smythies

On Mon, Apr 10, 2017 at 10:51:38PM +0200, Rafael J. Wysocki wrote:
> Hi Mel,
> 
> On Mon, Apr 10, 2017 at 10:41 AM, Mel Gorman
> <mgorman@techsingularity.net> wrote:
> > Hi Rafael,
> >
> > Since kernel 4.6, performance of the low CPU intensity workloads was dropped
> > severely.  netperf UDP_STREAM has about 15-20% CPU utilisation has regressed
> > about 10% relative to 4.4 anad about 6-9% running TCP_STREAM. sockperf has
> > similar utilisation fixes but I won't go into these in detail as they were
> > running loopback and are sensitive to a lot of factors.
> >
> > It's far more obvious when looking at the git test suite and the length
> > of time it takes to run. This is a shellscript and git intensive workload
> > whose CPU utilisatiion is very low but is less sensitive to multiple
> > factors than netperf and sockperf.
> 
> First, thanks for the data.
> 
> Nobody has reported anything similar to these results so far.
> 

It's possible that it's due to the CPU being IvyBridge or it may be due
to the fact that people don't spot problems with low CPU utilisation
workloads.

> > Bisection indicates that the regression started with commit ffb810563c0c
> > ("intel_pstate: Avoid getting stuck in high P-states when idle").  However,
> > it's no longer the only relevant commit as the following results will show
> 
> Well, that was an attempt to salvage the "Core" P-state selection
> algorithm which is problematic overall and reverting this now would
> reintroduce the issue addressed by it, unfortunately.
> 

I'm not suggesting that we should revert this patch. I accept that it
would reintroduce the regression reported by Jorg if nothing else

> > This is showing the user and system CPU usage as well as the elapsed time
> > to run a single iteration of the git test suite with total times at bottom
> > report. Overall time takes over 3 hours longer moving from 4.4 to 4.11-rc5
> > and reverting the commit does not fully address the problem. It's doing
> > a warmup run whose results are discarded and then 5 iterations.
> >
> > The test shows it took 2018 seconds on average to complete a single iteration
> > on 4.4 and 3750 seconds to complete on 4.11-rc5. The major drop is between
> > 4.5 and 4.6 where it went from 1830 seconds to 3703 seconds and has not
> > recovered. A bisection was clean and pointed to the commit mentioned above.
> >
> > The results show that it's not the only source as a revert (last column)
> > doesn't fix the damage although it goes from 3750 seconds (4.11-rc5 vanilla)
> > to 2919 seconds (with a revert).
> 
> OK
> 
> So if you revert the commit in question on top of 4.6.0, the numbers
> go back to the 4.5.0 levels, right?
> 

Not quite, it restores a lot of the performance but not all.

> Anyway, as I said the "Core" P-state selection algorithm is sort of on
> the way out and I think that we have a reasonable replacement for it.
> 
> Would it be viable to check what happens with
> https://patchwork.kernel.org/patch/9640261/ applied?  Depending on the
> ACPI system PM profile of the test machine, this is likely to cause it
> to use the new algo.
> 

Yes. The following is a comparison using 4.5 as a baseline as it is the
best known kernel and it reduces the width


gitsource
                                 4.5.0                 4.6.0                 4.6.0            4.11.0-rc5            4.11.0-rc5
                               vanilla               vanilla      revert-v4.6-v1r1               vanilla        loadbased-v1r1
User    min          1613.72 (  0.00%)     3302.19 (-104.63%)     1935.46 (-19.94%)     3487.46 (-116.11%)     2296.87 (-42.33%)
User    mean         1616.47 (  0.00%)     3304.14 (-104.40%)     1937.83 (-19.88%)     3488.12 (-115.79%)     2299.33 (-42.24%)
User    stddev          1.75 (  0.00%)        1.12 ( 36.06%)        1.42 ( 18.54%)        0.57 ( 67.28%)        1.79 ( -2.73%)
User    coeffvar        0.11 (  0.00%)        0.03 ( 68.72%)        0.07 ( 32.05%)        0.02 ( 84.84%)        0.08 ( 27.78%)
User    max          1618.73 (  0.00%)     3305.40 (-104.20%)     1939.84 (-19.84%)     3489.01 (-115.54%)     2302.01 (-42.21%)
System  min           202.58 (  0.00%)      407.51 (-101.16%)      244.03 (-20.46%)      269.92 (-33.24%)      203.79 ( -0.60%)
System  mean          203.62 (  0.00%)      408.38 (-100.56%)      245.24 (-20.44%)      270.83 (-33.01%)      205.19 ( -0.77%)
System  stddev          0.64 (  0.00%)        0.77 (-21.25%)        0.97 (-52.52%)        0.59 (  7.31%)        0.75 (-18.12%)
System  coeffvar        0.31 (  0.00%)        0.19 ( 39.54%)        0.40 (-26.64%)        0.22 ( 30.31%)        0.37 (-17.21%)
System  max           204.36 (  0.00%)      409.81 (-100.53%)      246.85 (-20.79%)      271.56 (-32.88%)      206.06 ( -0.83%)
Elapsed min          1827.70 (  0.00%)     3701.00 (-102.49%)     2186.22 (-19.62%)     3749.00 (-105.12%)     2501.05 (-36.84%)
Elapsed mean         1830.72 (  0.00%)     3703.20 (-102.28%)     2190.03 (-19.63%)     3750.20 (-104.85%)     2503.27 (-36.74%)
Elapsed stddev          2.18 (  0.00%)        1.47 ( 32.67%)        2.25 ( -3.23%)        0.75 ( 65.72%)        1.28 ( 41.43%)
Elapsed coeffvar        0.12 (  0.00%)        0.04 ( 66.71%)        0.10 ( 13.71%)        0.02 ( 83.26%)        0.05 ( 57.16%)
Elapsed max          1833.91 (  0.00%)     3705.00 (-102.03%)     2193.26 (-19.59%)     3751.00 (-104.54%)     2504.54 (-36.57%)
CPU     min            99.00 (  0.00%)      100.00 ( -1.01%)       99.00 (  0.00%)      100.00 ( -1.01%)      100.00 ( -1.01%)
CPU     mean           99.00 (  0.00%)      100.00 ( -1.01%)       99.00 (  0.00%)      100.00 ( -1.01%)      100.00 ( -1.01%)
CPU     stddev          0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)
CPU     coeffvar        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)
CPU     max            99.00 (  0.00%)      100.00 ( -1.01%)       99.00 (  0.00%)      100.00 ( -1.01%)      100.00 ( -1.01%)

               4.5.0       4.6.0       4.6.0  4.11.0-rc5  4.11.0-rc5
             vanilla     vanillarevert-v4.6-v1r1     vanillaloadbased-v1r1
User         9790.02    19914.22    11713.58    21021.12    13888.63
System       1234.01     2465.45     1485.99     1635.85     1242.37
Elapsed     11008.49    22247.35    13162.72    22528.79    15044.76

As you can see, 4.6 is running twice as long as 4.5 (3703 seconds to
comlete vs 1830 seconds). Reverting (revert-v4.6-v1r1) restores some of
the performance and is 19.63% slower on average. 4.11-rc5 is as bad as
4.6 but applying your patch runs for 2503 seconds (36.74% slower). This
is still pretty bad but it's a big step in the right direction.

> I guess that you have a pstate_snb directory under /sys/kernel/debug/
> (if this is where debugfs is mounted)?  It should not be there any
> more with the new algo (as that does not use the PID controller any
> more).
> 

Yes.

> > <SNIP>
> > CONFIG_CPU_FREQ_GOV_SCHEDUTIL is *NOT* set. This is deliberate as when
> > I evaluated schedutil shortly after it was merged, I found that at best
> > it performed comparably with the old code across a range of workloads
> > and machines while having higher system CPU usage. I know a lot of
> > the recent work has been schedutil-focused so I could find no patch on
> > recent discussions that might relevant to this problem. I've not looked
> > at schedutil recently but not everyone will be switching to it so the old
> > setup is still relevant.
> 
> intel_pstate in the active mode (which you are using) is orthogonal to
> schedutil.  It has its own P-state selection logic and that evidently
> has changed to affect the workload.
> 

Understood.

> [BTW, I have posted a documentation patch for intel_pstate, but it
> applies to the code in linux-next ATM
> (https://patchwork.kernel.org/patch/9655107/).  It is worth looking at
> anyway I think, though.]
> 

Ok, this is helpful for getting a better handle on intel_pstate in
general. Thanks.

> At this point I'm not sure what has changed in addition to the commit
> you have found and while this is sort of interesting, I'm not sure how
> relevant it is.
> 
> Unfortunately, the P-state selection algorithm used so far on your
> test system is quite fundamentally unstable and tends to converge to
> either the highest or the lowest P-state in various conditions.  If
> the workload is sufficiently "light", it generally ends up in the
> minimum P-state most of the time which probably happens here.
> 
> I would really not like to try to "fix" that algorithm as this is
> pretty much hopeless and most likely will lead to regressions
> elsewhere.  Instead, I'd prefer to migrate away from it altogether and
> then tune things so that they work for everybody reasonably well
> (which should be doable with the new algorithm).  But let's see how
> far we can get with that.
> 

Other than altering min_perf_pct, is there a way of tuning intel_pstate
such that it delays entering lower p-states for longer? It would
increase power consumption but at least it would be an option for
low-utilisation workloads and probably beneficial in general for those
that need to reduce latency of wakups while still allowing at least the
C1 state.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-10 20:51 ` Rafael J. Wysocki
  2017-04-11 10:02   ` Mel Gorman
@ 2017-04-11 15:41   ` Doug Smythies
  2017-04-11 16:42     ` Mel Gorman
  2017-04-14 23:01     ` Doug Smythies
  1 sibling, 2 replies; 21+ messages in thread
From: Doug Smythies @ 2017-04-11 15:41 UTC (permalink / raw)
  To: 'Mel Gorman'
  Cc: 'Rafael Wysocki', 'Jörg Otte',
	'Linux Kernel Mailing List', 'Linux PM',
	'Srinivas Pandruvada', 'Rafael J. Wysocki'

On 2017.04.11 03:03 Mel Gorman wrote:
>On Mon, Apr 10, 2017 at 10:51:38PM +0200, Rafael J. Wysocki wrote:
>> On Mon, Apr 10, 2017 at 10:41 AM, Mel Gorman wrote:
>>>
>>> It's far more obvious when looking at the git test suite and the length
>>> of time it takes to run. This is a shellscript and git intensive workload
>>> whose CPU utilisatiion is very low but is less sensitive to multiple
>>> factors than netperf and sockperf.
>> 

I would like to repeat your tests on my test computer (i7-2600K).
I am not familiar with, and have not been able to find,
"the git test suite" shellscript. Could you point me to it?

... Doug

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-11 15:41   ` Doug Smythies
@ 2017-04-11 16:42     ` Mel Gorman
  2017-04-14 23:01     ` Doug Smythies
  1 sibling, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2017-04-11 16:42 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael Wysocki', 'Jörg Otte',
	'Linux Kernel Mailing List', 'Linux PM',
	'Srinivas Pandruvada', 'Rafael J. Wysocki'

On Tue, Apr 11, 2017 at 08:41:09AM -0700, Doug Smythies wrote:
> On 2017.04.11 03:03 Mel Gorman wrote:
> >On Mon, Apr 10, 2017 at 10:51:38PM +0200, Rafael J. Wysocki wrote:
> >> On Mon, Apr 10, 2017 at 10:41 AM, Mel Gorman wrote:
> >>>
> >>> It's far more obvious when looking at the git test suite and the length
> >>> of time it takes to run. This is a shellscript and git intensive workload
> >>> whose CPU utilisatiion is very low but is less sensitive to multiple
> >>> factors than netperf and sockperf.
> >> 
> 
> I would like to repeat your tests on my test computer (i7-2600K).
> I am not familiar with, and have not been able to find,
> "the git test suite" shellscript. Could you point me to it?
> 

If you want to use git source directly do a checkout from
https://github.com/git/git and build it. The core "benchmark" is make
test and timing it.

The way I'm doing it is via mmtests so

git clone https://github.com/gormanm/mmtests
cd mmtests
./run-mmtests --no-monitor --config configs/config-global-dhp__workload_shellscripts test-run-1
cd work/log
../../compare-kernels.sh | less

and it'll generate a similar report to what I posted in this email
thread. If you do multiple tests with different kernels then change the
name of "test-run-1" to preserve the old data. compare-kernel.sh will
compare whatever results you have.

Thanks for taking a look.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-11 15:41   ` Doug Smythies
  2017-04-11 16:42     ` Mel Gorman
@ 2017-04-14 23:01     ` Doug Smythies
  2017-04-19  8:15       ` Mel Gorman
  2017-04-20 14:55       ` Doug Smythies
  1 sibling, 2 replies; 21+ messages in thread
From: Doug Smythies @ 2017-04-14 23:01 UTC (permalink / raw)
  To: 'Mel Gorman'
  Cc: 'Rafael Wysocki', 'Jörg Otte',
	'Linux Kernel Mailing List', 'Linux PM',
	'Srinivas Pandruvada', 'Rafael J. Wysocki',
	Doug Smythies

Hi Mel,

Thanks for the "how to" information.
This is a very interesting use case.
>From trace data, I see a lot of minimal durations with
virtually no load on the CPU, typically more consistent
with some type of light duty periodic (~~100 Hz) work flow
(where we would prefer to not ramp up frequencies, or more
accurately keep them ramped up).
My results (further below) are different than yours, sometimes
dramatically, but the trends are similar.
I have nothing to add about the control algorithm over what
Rafael already said.

On 2017.04.11 09:42 Mel Gorman wrote:
> On Tue, Apr 11, 2017 at 08:41:09AM -0700, Doug Smythies wrote:
>> On 2017.04.11 03:03 Mel Gorman wrote:
>>>On Mon, Apr 10, 2017 at 10:51:38PM +0200, Rafael J. Wysocki wrote:
>>>> On Mon, Apr 10, 2017 at 10:41 AM, Mel Gorman wrote:
>>>>>
>>>>> It's far more obvious when looking at the git test suite and the length
>>>>> of time it takes to run. This is a shellscript and git intensive workload
>>>>> whose CPU utilisatiion is very low but is less sensitive to multiple
>>>>> factors than netperf and sockperf.
>>>> 
>> 
>> I would like to repeat your tests on my test computer (i7-2600K).
>> I am not familiar with, and have not been able to find,
>> "the git test suite" shellscript. Could you point me to it?
>>
>
> If you want to use git source directly do a checkout from
> https://github.com/git/git and build it. The core "benchmark" is make
> test and timing it.

Because I had troubles with your method further below, I also did
this method. I did 5 runs, after a throw away run, similar to what
you do (and I could see the need for a throw away pass).

Results (there is something wrong with user and system times and CPU%
in kernel 4.5, so I only calculated Elapsed differences):

Linux s15 4.5.0-stock #232 SMP Tue Apr 11 23:54:49 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
... test_run: start 5 runs ...
327.04user 122.08system 33:57.81elapsed (2037.81 : reference) 22%CPU
... test_run: done ...

Linux s15 4.11.0-rc6-stock #231 SMP Mon Apr 10 08:29:29 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux

intel_pstate - powersave
... test_run: start 5 runs ...
1518.71user 552.87system 39:24.45elapsed (2364.45 : -16.03%) 87%CPU
... test_run: done ...

intel_pstate - performance (fast reference)
... test_run: start 5 runs ...
1160.52user 291.33system 29:36.05elapsed (1776.05 : 12.85%) 81%CPU
... test_run: done ...

intel_cpufreq - powersave (slow reference)
... test_run: start 5 runs ...
2165.72user 1049.18system 57:12.77elapsed (3432.77 : -68.45%) 93%CPU
... test_run: done ...

intel_cpufreq - ondemand
... test_run: start 5 runs ...
1776.79user 808.65system 47:14.74elapsed (2834.74 : -39.11%) 91%CPU

intel_cpufreq - schedutil
... test_run: start 5 runs ...
2049.28user 1028.70system 54:57.82elapsed (3297.82 : -61.83%) 93%CPU
... test_run: done ...

Linux s15 4.11.0-rc6-revert #233 SMP Wed Apr 12 15:30:19 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
... test_run: start 5 runs ...
1295.30user 365.98system 32:50.15elapsed (1970.15 : 3.32%) 84%CPU
... test_run: done ...

> The way I'm doing it is via mmtests so
>
> git clone https://github.com/gormanm/mmtests
> cd mmtests
> ./run-mmtests --no-monitor --config configs/config-global-dhp__workload_shellscripts test-run-1
> cd work/log
> ../../compare-kernels.sh | less
>
> and it'll generate a similar report to what I posted in this email
> thread. If you do multiple tests with different kernels then change the
> name of "test-run-1" to preserve the old data. compare-kernel.sh will
> compare whatever results you have.

         k4.5    k4.11-rc6         k4.11-rc6         k4.11-rc6          k4.11-rc6         k4.11-rc6         k4.11-rc6
                                   performance       pass-ps            pass-od           pass-su           revert
E min    388.71  456.51 (-17.44%)  342.81 ( 11.81%)  668.79 (-72.05%)   552.85 (-42.23%)  646.96 (-66.44%)  375.08 (  3.51%)
E mean   389.74  458.52 (-17.65%)  343.81 ( 11.78%)  669.42 (-71.76%)   553.45 (-42.01%)  647.95 (-66.25%)  375.98 (  3.53%)
E stddev   0.85    1.64 (-92.78%)    0.67 ( 20.83%)    0.41 ( 52.25%)     0.31 ( 64.00%)    0.68 ( 20.35%)    0.46 ( 46.00%)
E coeffvar 0.22    0.36 (-63.86%)    0.20 ( 10.25%)    0.06 ( 72.20%)     0.06 ( 74.65%)    0.10 ( 52.09%)    0.12 ( 44.03%)
E max    390.90  461.47 (-18.05%)  344.83 ( 11.79%)  669.91 (-71.38%)   553.68 (-41.64%)  648.75 (-65.96%)  376.37 (  3.72%)

E = Elapsed (squished in an attempt to prevent line length wrapping when I send)

           k4.5   k4.11-rc6   k4.11-rc6   k4.11-rc6   k4.11-rc6   k4.11-rc6   k4.11-rc6
                            performance     pass-ps     pass-od     pass-su      revert
User     347.26     1801.56     1398.76     2540.67     2106.30     2434.06     1536.80
System   139.01      701.87      366.59     1346.75     1026.67     1322.39      449.81
Elapsed 2346.77     2761.20     2062.12     4017.47     3321.10     3887.19     2268.90

Legend:
blank  = active mode: intel_pstate - powersave
performance = active mode: intel_pstate - performance (fast reference)
pass-ps = passive mode: intel_cpufreq - powersave (slow reference)
pass-od = passive mode: intel_cpufreq - ondemand
pass-su = passive mode: intel_cpufreq - schedutil
revert = active mode: intel_pstate - powersave with commit ffb810563c0c reverted.

I deleted the user, system, and CPU rows, because they don't make any sense.

I do not know why the tests run overall so much faster on my computer,
I can only assume I have something wrong in my installation of your mmtests.
I do see mmtests looking for some packages which it can not find.

Mel wrote:
> The results show that it's not the only source as a revert (last column)
> doesn't fix the damage although it goes from 3750 seconds (4.11-rc5 vanilla)
> to 2919 seconds (with a revert).

In my case, the reverted code ran faster than the kernel 4.5 code.

The other big difference is between Kernel 4.5 and 4.11-rc5 you got
-102.28% elapsed time, whereas I got -16.03% with method 1 and
-17.65% with method 2 (well, between 4.5 and 4.11-rc6 in my case).
I only get -93.28% and -94.82% difference between my fast and slow reference
tests (albeit on the same kernel).

CPU stuff:
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
Min pstate = 16
Max pstate = 38
MSR_TURBO_RATIO_LIMIT: 0x23242526
35 * 100.0 = 3500.0 MHz max turbo 4 active cores
36 * 100.0 = 3600.0 MHz max turbo 3 active cores
37 * 100.0 = 3700.0 MHz max turbo 2 active cores
38 * 100.0 = 3800.0 MHz max turbo 1 active cores

... Doug

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-14 23:01     ` Doug Smythies
@ 2017-04-19  8:15       ` Mel Gorman
  2017-04-21  1:12         ` Rafael J. Wysocki
  2017-04-20 14:55       ` Doug Smythies
  1 sibling, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2017-04-19  8:15 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael Wysocki', 'Jörg Otte',
	'Linux Kernel Mailing List', 'Linux PM',
	'Srinivas Pandruvada', 'Rafael J. Wysocki'

On Fri, Apr 14, 2017 at 04:01:40PM -0700, Doug Smythies wrote:
> Hi Mel,
> 
> Thanks for the "how to" information.
> This is a very interesting use case.
> From trace data, I see a lot of minimal durations with
> virtually no load on the CPU, typically more consistent
> with some type of light duty periodic (~~100 Hz) work flow
> (where we would prefer to not ramp up frequencies, or more
> accurately keep them ramped up).

This broadly matches my expectations in terms of behaviour. It is a
low duty workload but while I accept that a laptop may not want the
frequencies to ramp up, it's not universally true. Long periods at low
frequency to complete a workload is not necessarily better than using a
high frequency to race to idle. Effectively, a low utilisation test suite
could be considered as a "foreground task of high priority" and not a
"background task of little interest".

> My results (further below) are different than yours, sometimes
> dramatically, but the trends are similar.

It's inevitable there would be some hardware based differences. The
machine I have appears to show an extreme case.

> I have nothing to add about the control algorithm over what
> Rafael already said.
> 
> On 2017.04.11 09:42 Mel Gorman wrote:
> > On Tue, Apr 11, 2017 at 08:41:09AM -0700, Doug Smythies wrote:
> >> On 2017.04.11 03:03 Mel Gorman wrote:
> >>>On Mon, Apr 10, 2017 at 10:51:38PM +0200, Rafael J. Wysocki wrote:
> >>>> On Mon, Apr 10, 2017 at 10:41 AM, Mel Gorman wrote:
> >>>>>
> >>>>> It's far more obvious when looking at the git test suite and the length
> >>>>> of time it takes to run. This is a shellscript and git intensive workload
> >>>>> whose CPU utilisatiion is very low but is less sensitive to multiple
> >>>>> factors than netperf and sockperf.
> >>>> 
> >> 
> >> I would like to repeat your tests on my test computer (i7-2600K).
> >> I am not familiar with, and have not been able to find,
> >> "the git test suite" shellscript. Could you point me to it?
> >>
> >
> > If you want to use git source directly do a checkout from
> > https://github.com/git/git and build it. The core "benchmark" is make
> > test and timing it.
> 
> Because I had troubles with your method further below, I also did
> this method. I did 5 runs, after a throw away run, similar to what
> you do (and I could see the need for a throw away pass).
> 

Yeah, at the very least IO effects should be eliminated.

> Results (there is something wrong with user and system times and CPU%
> in kernel 4.5, so I only calculated Elapsed differences):
> 

In case it matters, the User and System CPU times are reported as standard
for these classes of workload by mmtests even though it's not necessarily
universally interesting. Generally, I consider the elapsed time to
be the most important but often, a major change in system CPU time is
interesting. That's not universally true as there have been changes in how
system CPU is calculated in the past and it's sensitive to Kconfig options
with VIRT_CPU_ACCOUNTING_GEN being a notable source of confusion in the past.

> Linux s15 4.5.0-stock #232 SMP Tue Apr 11 23:54:49 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
> ... test_run: start 5 runs ...
> 327.04user 122.08system 33:57.81elapsed (2037.81 : reference) 22%CPU
> ... test_run: done ...
> 
> Linux s15 4.11.0-rc6-stock #231 SMP Mon Apr 10 08:29:29 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
> 
> intel_pstate - powersave
> ... test_run: start 5 runs ...
> 1518.71user 552.87system 39:24.45elapsed (2364.45 : -16.03%) 87%CPU
> ... test_run: done ...
> 
> intel_pstate - performance (fast reference)
> ... test_run: start 5 runs ...
> 1160.52user 291.33system 29:36.05elapsed (1776.05 : 12.85%) 81%CPU
> ... test_run: done ...
> 
> intel_cpufreq - powersave (slow reference)
> ... test_run: start 5 runs ...
> 2165.72user 1049.18system 57:12.77elapsed (3432.77 : -68.45%) 93%CPU
> ... test_run: done ...
> 
> intel_cpufreq - ondemand
> ... test_run: start 5 runs ...
> 1776.79user 808.65system 47:14.74elapsed (2834.74 : -39.11%) 91%CPU
> 

Nothing overly surprising there. It's been my observation that pstate is
generally better than acpi_cpufreq which somewhat amuses me when I still
see suggestions of disabling intel_pstate entirely being used despite the
advice being based on much older kernels.

> intel_cpufreq - schedutil
> ... test_run: start 5 runs ...
> 2049.28user 1028.70system 54:57.82elapsed (3297.82 : -61.83%) 93%CPU
> ... test_run: done ...
> 

I'm mildly surprised at this. I had observed that schedutil is not great
but I don't recall seeing a result this bad.

> Linux s15 4.11.0-rc6-revert #233 SMP Wed Apr 12 15:30:19 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
> ... test_run: start 5 runs ...
> 1295.30user 365.98system 32:50.15elapsed (1970.15 : 3.32%) 84%CPU
> ... test_run: done ...
> 

And the revert does help albeit not being an option for reasons Rafael
covered.

> > The way I'm doing it is via mmtests so
> >
> > git clone https://github.com/gormanm/mmtests
> > cd mmtests
> > ./run-mmtests --no-monitor --config configs/config-global-dhp__workload_shellscripts test-run-1
> > cd work/log
> > ../../compare-kernels.sh | less
> >
> > and it'll generate a similar report to what I posted in this email
> > thread. If you do multiple tests with different kernels then change the
> > name of "test-run-1" to preserve the old data. compare-kernel.sh will
> > compare whatever results you have.
> 
>          k4.5    k4.11-rc6         k4.11-rc6         k4.11-rc6          k4.11-rc6         k4.11-rc6         k4.11-rc6
>                                    performance       pass-ps            pass-od           pass-su           revert
> E min    388.71  456.51 (-17.44%)  342.81 ( 11.81%)  668.79 (-72.05%)   552.85 (-42.23%)  646.96 (-66.44%)  375.08 (  3.51%)
> E mean   389.74  458.52 (-17.65%)  343.81 ( 11.78%)  669.42 (-71.76%)   553.45 (-42.01%)  647.95 (-66.25%)  375.98 (  3.53%)
> E stddev   0.85    1.64 (-92.78%)    0.67 ( 20.83%)    0.41 ( 52.25%)     0.31 ( 64.00%)    0.68 ( 20.35%)    0.46 ( 46.00%)
> E coeffvar 0.22    0.36 (-63.86%)    0.20 ( 10.25%)    0.06 ( 72.20%)     0.06 ( 74.65%)    0.10 ( 52.09%)    0.12 ( 44.03%)
> E max    390.90  461.47 (-18.05%)  344.83 ( 11.79%)  669.91 (-71.38%)   553.68 (-41.64%)  648.75 (-65.96%)  376.37 (  3.72%)
> 
> E = Elapsed (squished in an attempt to prevent line length wrapping when I send)
> 
>            k4.5   k4.11-rc6   k4.11-rc6   k4.11-rc6   k4.11-rc6   k4.11-rc6   k4.11-rc6
>                             performance     pass-ps     pass-od     pass-su      revert
> User     347.26     1801.56     1398.76     2540.67     2106.30     2434.06     1536.80
> System   139.01      701.87      366.59     1346.75     1026.67     1322.39      449.81
> Elapsed 2346.77     2761.20     2062.12     4017.47     3321.10     3887.19     2268.90
> 
> Legend:
> blank  = active mode: intel_pstate - powersave
> performance = active mode: intel_pstate - performance (fast reference)
> pass-ps = passive mode: intel_cpufreq - powersave (slow reference)
> pass-od = passive mode: intel_cpufreq - ondemand
> pass-su = passive mode: intel_cpufreq - schedutil
> revert = active mode: intel_pstate - powersave with commit ffb810563c0c reverted.
> 
> I deleted the user, system, and CPU rows, because they don't make any sense.
> 

User is particularly misleading. System can be very misleading between
kernel versions due to accounting differences so I'm ok with that.

> I do not know why the tests run overall so much faster on my computer,

Differences in CPU I imagine. I know the machine I'm reporting on is a
particularly bad example. I've seen other machines where the effect is
less severe.

> I can only assume I have something wrong in my installation of your mmtests.

No, I've seen results broadly similar to yours on other machines so I
don't think you have a methodology error.

> I do see mmtests looking for some packages which it can not find.
> 

That's not too unusual. The package names are based on opensuse naming
and that doesn't translate to other distributions. If you open
bin/install-depends, you'll see a hashmap near the top that maps some of
the names for redhat-based distributions and debian. It's not actively
maintained. You can either install the packages manaually before the
test or update the mappings.

> Mel wrote:
> > The results show that it's not the only source as a revert (last column)
> > doesn't fix the damage although it goes from 3750 seconds (4.11-rc5 vanilla)
> > to 2919 seconds (with a revert).
> 
> In my case, the reverted code ran faster than the kernel 4.5 code.
> 
> The other big difference is between Kernel 4.5 and 4.11-rc5 you got
> -102.28% elapsed time, whereas I got -16.03% with method 1 and
> -17.65% with method 2 (well, between 4.5 and 4.11-rc6 in my case).
> I only get -93.28% and -94.82% difference between my fast and slow reference
> tests (albeit on the same kernel).
> 

I have no reason to believe this is a methodology error and is due to a
difference in CPU. Consider the following reports

http://beta.suse.com/private/mgorman/results/home/marvin/openSUSE-LEAP-42.2/global-dhp__workload_shellscripts-xfs/delboy/#gitsource
http://beta.suse.com/private/mgorman/results/home/marvin/openSUSE-LEAP-42.2/global-dhp__workload_shellscripts-xfs/ivy/#gitsource

The first one (delboy) shows a gain of 1.35% and it's only for 4.11
(kernel shown is 4.11-rc1 with vmscan-related patches on top that do not
affect this test case) of -17.51% which is very similar to yours. The
CPU there is a Xeon E3-1230 v5.

The second report (ivy) is the machine I'm based the original complain
on and shows the large regression in elapsed time.

So, different CPUs have different behaviours which is no surprise at all
considering that at the very least, exit latencies will be different.
While there may not be a universally correct answer to how to do this
automatically, is it possible to tune intel_pstate such that it ramps up
quickly regardless of recent utilisation and reduces relatively slowly?
That would be better from a power consumption perspective than setting the
"performance" governor.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-14 23:01     ` Doug Smythies
  2017-04-19  8:15       ` Mel Gorman
@ 2017-04-20 14:55       ` Doug Smythies
  2017-04-21  1:17         ` Rafael J. Wysocki
  2017-04-22  6:29         ` Doug Smythies
  1 sibling, 2 replies; 21+ messages in thread
From: Doug Smythies @ 2017-04-20 14:55 UTC (permalink / raw)
  To: 'Mel Gorman'
  Cc: 'Rafael Wysocki', 'Jörg Otte',
	'Linux Kernel Mailing List', 'Linux PM',
	'Srinivas Pandruvada', 'Rafael J. Wysocki',
	Doug Smythies

On 2017.04.19 01:16 Mel Gorman wrote:
> On Fri, Apr 14, 2017 at 04:01:40PM -0700, Doug Smythies wrote:
>> Hi Mel,
>> 
>> Thanks for the "how to" information.
>> This is a very interesting use case.
>> From trace data, I see a lot of minimal durations with
>> virtually no load on the CPU, typically more consistent
>> with some type of light duty periodic (~~100 Hz) work flow
>> (where we would prefer to not ramp up frequencies, or more
>> accurately keep them ramped up).
>
> This broadly matches my expectations in terms of behaviour. It is a
> low duty workload but while I accept that a laptop may not want the
> frequencies to ramp up, it's not universally true.

Agreed.

> Long periods at low
> frequency to complete a workload is not necessarily better than using a
> high frequency to race to idle.

Agreed, but it is processor dependant. For example with my older
i7-2700k processor I get the following package energies for
one loop (after the throw away loop) of the test (method 1):

intel_cpu-freq, powersave (lowest energy reference) 5876 Joules
intel_cpu-freq, conservative 5927 Joules
intel_cpu-freq, ondemand 6525 Joules
intel_cpu_freq, schedutil 6049 Joules
              , performance (highest energy reference) 8105 Joules
intel_pstate, powersave 7044 Joules
intel_pstate, force the load based algorithm 6390 Joules

> Effectively, a low utilisation test suite
> could be considered as a "foreground task of high priority" and not a
> "background task of little interest".

I wouldn't know how to make the distinction.

>> My results (further below) are different than yours, sometimes
>> dramatically, but the trends are similar.
>
> It's inevitable there would be some hardware based differences. The
> machine I have appears to show an extreme case.

Agreed.

>> I have nothing to add about the control algorithm over what
>> Rafael already said.
>> 
>> On 2017.04.11 09:42 Mel Gorman wrote:
>>> On Tue, Apr 11, 2017 at 08:41:09AM -0700, Doug Smythies wrote:
>>>> On 2017.04.11 03:03 Mel Gorman wrote:
>>>>>On Mon, Apr 10, 2017 at 10:51:38PM +0200, Rafael J. Wysocki wrote:
>>>>>> On Mon, Apr 10, 2017 at 10:41 AM, Mel Gorman wrote:
>>>>>>>
>>>>>>> It's far more obvious when looking at the git test suite and the length
>>>>>>> of time it takes to run. This is a shellscript and git intensive workload
>>>>>>> whose CPU utilisatiion is very low but is less sensitive to multiple
>>>>>>> factors than netperf and sockperf.
>>>>>> 
>>>> 
>>>> I would like to repeat your tests on my test computer (i7-2600K).
>>>> I am not familiar with, and have not been able to find,
>>>> "the git test suite" shellscript. Could you point me to it?
>>>>
>>>
>>> If you want to use git source directly do a checkout from
>>> https://github.com/git/git and build it. The core "benchmark" is make
>>> test and timing it.
>> 
>> Because I had troubles with your method further below, I also did
>> this method. I did 5 runs, after a throw away run, similar to what
>> you do (and I could see the need for a throw away pass).
>> 
>
> Yeah, at the very least IO effects should be eliminated.
>
>> Results (there is something wrong with user and system times and CPU%
>> in kernel 4.5, so I only calculated Elapsed differences):
>> 
>
> In case it matters, the User and System CPU times are reported as standard
> for these classes of workload by mmtests even though it's not necessarily
> universally interesting. Generally, I consider the elapsed time to
> be the most important but often, a major change in system CPU time is
> interesting. That's not universally true as there have been changes in how
> system CPU is calculated in the past and it's sensitive to Kconfig options
> with VIRT_CPU_ACCOUNTING_GEN being a notable source of confusion in the past.
>
>> Linux s15 4.5.0-stock #232 SMP Tue Apr 11 23:54:49 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
>> ... test_run: start 5 runs ...
>> 327.04user 122.08system 33:57.81elapsed (2037.81 : reference) 22%CPU
>> ... test_run: done ...
>> 
>> Linux s15 4.11.0-rc6-stock #231 SMP Mon Apr 10 08:29:29 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> intel_pstate - powersave
>> ... test_run: start 5 runs ...
>> 1518.71user 552.87system 39:24.45elapsed (2364.45 : -16.03%) 87%CPU
>> ... test_run: done ...
>> 
>> intel_pstate - performance (fast reference)
>> ... test_run: start 5 runs ...
>> 1160.52user 291.33system 29:36.05elapsed (1776.05 : 12.85%) 81%CPU
>> ... test_run: done ...
>> 
>> intel_cpufreq - powersave (slow reference)
>> ... test_run: start 5 runs ...
>> 2165.72user 1049.18system 57:12.77elapsed (3432.77 : -68.45%) 93%CPU
>> ... test_run: done ...
>> 
>> intel_cpufreq - ondemand
>> ... test_run: start 5 runs ...
>> 1776.79user 808.65system 47:14.74elapsed (2834.74 : -39.11%) 91%CPU
>> 
>
> Nothing overly surprising there. It's been my observation that pstate is
> generally better than acpi_cpufreq which somewhat amuses me when I still
> see suggestions of disabling intel_pstate entirely being used despite the
> advice being based on much older kernels.
>
>> intel_cpufreq - schedutil
>> ... test_run: start 5 runs ...
>> 2049.28user 1028.70system 54:57.82elapsed (3297.82 : -61.83%) 93%CPU
>> ... test_run: done ...
>>
>
> I'm mildly surprised at this. I had observed that schedutil is not great
> but I don't recall seeing a result this bad.
>
>> Linux s15 4.11.0-rc6-revert #233 SMP Wed Apr 12 15:30:19 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
>> ... test_run: start 5 runs ...
>> 1295.30user 365.98system 32:50.15elapsed (1970.15 : 3.32%) 84%CPU
>> ... test_run: done ...
>> 
>
> And the revert does help albeit not being an option for reasons Rafael
> covered.

New data point: Kernel 4.11-rc7  intel_pstate, powersave forcing the
load based algorithm: Elapsed 3178 seconds.

If I understand your data correctly, my load based results are the opposite of yours.

Mel: 4.11-rc5 vanilla: Elapsed mean: 3750.20 Seconds
Mel: 4.11-rc5 load based: Elapsed mean: 2503.27 Seconds
Or: 33.25%

Doug: 4.11-rc6 stock: Elapsed total (5 runs): 2364.45 Seconds
Doug: 4.11-rc7 force load based: Elapsed total (5 runs): 3178 Seconds
Or: -34.4%

>>> The way I'm doing it is via mmtests so
>>>
>>> git clone https://github.com/gormanm/mmtests
>>> cd mmtests
>>> ./run-mmtests --no-monitor --config configs/config-global-dhp__workload_shellscripts test-run-1
>>> cd work/log
>>> ../../compare-kernels.sh | less
>>>
>>> and it'll generate a similar report to what I posted in this email
>>> thread. If you do multiple tests with different kernels then change the
>>> name of "test-run-1" to preserve the old data. compare-kernel.sh will
>>> compare whatever results you have.
>> 
>>          k4.5    k4.11-rc6         k4.11-rc6         k4.11-rc6          k4.11-rc6         k4.11-rc6         k4.11-rc6
>>                                    performance       pass-ps            pass-od           pass-su           revert
>> E min    388.71  456.51 (-17.44%)  342.81 ( 11.81%)  668.79 (-72.05%)   552.85 (-42.23%)  646.96 (-66.44%)  375.08 (  3.51%)
>> E mean   389.74  458.52 (-17.65%)  343.81 ( 11.78%)  669.42 (-71.76%)   553.45 (-42.01%)  647.95 (-66.25%)  375.98 (  3.53%)
>> E stddev   0.85    1.64 (-92.78%)    0.67 ( 20.83%)    0.41 ( 52.25%)     0.31 ( 64.00%)    0.68 ( 20.35%)    0.46 ( 46.00%)
>> E coeffvar 0.22    0.36 (-63.86%)    0.20 ( 10.25%)    0.06 ( 72.20%)     0.06 ( 74.65%)    0.10 ( 52.09%)    0.12 ( 44.03%)
>> E max    390.90  461.47 (-18.05%)  344.83 ( 11.79%)  669.91 (-71.38%)   553.68 (-41.64%)  648.75 (-65.96%)  376.37 (  3.72%)
>> 
>> E = Elapsed (squished in an attempt to prevent line length wrapping when I send)
>> 
>>            k4.5   k4.11-rc6   k4.11-rc6   k4.11-rc6   k4.11-rc6   k4.11-rc6   k4.11-rc6
>>                             performance     pass-ps     pass-od     pass-su      revert
>> User     347.26     1801.56     1398.76     2540.67     2106.30     2434.06     1536.80
>> System   139.01      701.87      366.59     1346.75     1026.67     1322.39      449.81
>> Elapsed 2346.77     2761.20     2062.12     4017.47     3321.10     3887.19     2268.90
>> 
>> Legend:
>> blank  = active mode: intel_pstate - powersave
>> performance = active mode: intel_pstate - performance (fast reference)
>> pass-ps = passive mode: intel_cpufreq - powersave (slow reference)
>> pass-od = passive mode: intel_cpufreq - ondemand
>> pass-su = passive mode: intel_cpufreq - schedutil
>> revert = active mode: intel_pstate - powersave with commit ffb810563c0c reverted.
>> 
>> I deleted the user, system, and CPU rows, because they don't make any sense.
>>
>
> User is particularly misleading. System can be very misleading between
> kernel versions due to accounting differences so I'm ok with that.
>
>> I do not know why the tests run overall so much faster on my computer,
>
> Differences in CPU I imagine. I know the machine I'm reporting on is a
> particularly bad example. I've seen other machines where the effect is
> less severe.

No, I meant that my overall run time was on the order of 3/4 of an hour,
whereas your tests were on the order of 3 hours. As far as I could tell,
our CPUs had similar capabilities.

>
>> I can only assume I have something wrong in my installation of your mmtests.
>
> No, I've seen results broadly similar to yours on other machines so I
> don't think you have a methodology error.
>
>> I do see mmtests looking for some packages which it can not find.
>> 
>
> That's not too unusual. The package names are based on opensuse naming
> and that doesn't translate to other distributions. If you open
> bin/install-depends, you'll see a hashmap near the top that maps some of
> the names for redhat-based distributions and debian. It's not actively
> maintained. You can either install the packages manaually before the
> test or update the mappings.

>> Mel wrote:
>>> The results show that it's not the only source as a revert (last column)
>>> doesn't fix the damage although it goes from 3750 seconds (4.11-rc5 vanilla)
>>> to 2919 seconds (with a revert).
>> 
>> In my case, the reverted code ran faster than the kernel 4.5 code.
>> 
>> The other big difference is between Kernel 4.5 and 4.11-rc5 you got
>> -102.28% elapsed time, whereas I got -16.03% with method 1 and
>> -17.65% with method 2 (well, between 4.5 and 4.11-rc6 in my case).
>> I only get -93.28% and -94.82% difference between my fast and slow reference
>> tests (albeit on the same kernel).
>> 
>
> I have no reason to believe this is a methodology error and is due to a
> difference in CPU. Consider the following reports
>
>
http://beta.suse.com/private/mgorman/results/home/marvin/openSUSE-LEAP-42.2/global-dhp__workload_shellscripts-xfs/delboy/#gitsource
> http://beta.suse.com/private/mgorman/results/home/marvin/openSUSE-LEAP-42.2/global-dhp__workload_shellscripts-xfs/ivy/#gitsource
>
> The first one (delboy) shows a gain of 1.35% and it's only for 4.11
> (kernel shown is 4.11-rc1 with vmscan-related patches on top that do not
> affect this test case) of -17.51% which is very similar to yours. The
> CPU there is a Xeon E3-1230 v5.
>
> The second report (ivy) is the machine I'm based the original complain
> on and shows the large regression in elapsed time.
>
> So, different CPUs have different behaviours which is no surprise at all
> considering that at the very least, exit latencies will be different.
> While there may not be a universally correct answer to how to do this
> automatically, is it possible to tune intel_pstate such that it ramps up
> quickly regardless of recent utilisation and reduces relatively slowly?
> That would be better from a power consumption perspective than setting the
> "performance" governor.

As mentioned above, I don't know how to make the distinction in the use
cases.

... Doug

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-11 10:02   ` Mel Gorman
@ 2017-04-21  0:52     ` Rafael J. Wysocki
  0 siblings, 0 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2017-04-21  0:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rafael J. Wysocki, Rafael Wysocki, Jörg Otte,
	Linux Kernel Mailing List, Linux PM, Srinivas Pandruvada,
	Doug Smythies

On Tuesday, April 11, 2017 11:02:34 AM Mel Gorman wrote:
> On Mon, Apr 10, 2017 at 10:51:38PM +0200, Rafael J. Wysocki wrote:
> > Hi Mel,
> > 
> > On Mon, Apr 10, 2017 at 10:41 AM, Mel Gorman
> > <mgorman@techsingularity.net> wrote:
> > > Hi Rafael,
> > >
> > > Since kernel 4.6, performance of the low CPU intensity workloads was dropped
> > > severely.  netperf UDP_STREAM has about 15-20% CPU utilisation has regressed
> > > about 10% relative to 4.4 anad about 6-9% running TCP_STREAM. sockperf has
> > > similar utilisation fixes but I won't go into these in detail as they were
> > > running loopback and are sensitive to a lot of factors.
> > >
> > > It's far more obvious when looking at the git test suite and the length
> > > of time it takes to run. This is a shellscript and git intensive workload
> > > whose CPU utilisatiion is very low but is less sensitive to multiple
> > > factors than netperf and sockperf.
> > 
> > First, thanks for the data.
> > 
> > Nobody has reported anything similar to these results so far.
> > 
> 
> It's possible that it's due to the CPU being IvyBridge or it may be due
> to the fact that people don't spot problems with low CPU utilisation
> workloads.

I'm guessing the latter.

> > > Bisection indicates that the regression started with commit ffb810563c0c
> > > ("intel_pstate: Avoid getting stuck in high P-states when idle").  However,
> > > it's no longer the only relevant commit as the following results will show
> > 
> > Well, that was an attempt to salvage the "Core" P-state selection
> > algorithm which is problematic overall and reverting this now would
> > reintroduce the issue addressed by it, unfortunately.
> > 
> 
> I'm not suggesting that we should revert this patch. I accept that it
> would reintroduce the regression reported by Jorg if nothing else

OK

> > > This is showing the user and system CPU usage as well as the elapsed time
> > > to run a single iteration of the git test suite with total times at bottom
> > > report. Overall time takes over 3 hours longer moving from 4.4 to 4.11-rc5
> > > and reverting the commit does not fully address the problem. It's doing
> > > a warmup run whose results are discarded and then 5 iterations.
> > >
> > > The test shows it took 2018 seconds on average to complete a single iteration
> > > on 4.4 and 3750 seconds to complete on 4.11-rc5. The major drop is between
> > > 4.5 and 4.6 where it went from 1830 seconds to 3703 seconds and has not
> > > recovered. A bisection was clean and pointed to the commit mentioned above.
> > >
> > > The results show that it's not the only source as a revert (last column)
> > > doesn't fix the damage although it goes from 3750 seconds (4.11-rc5 vanilla)
> > > to 2919 seconds (with a revert).
> > 
> > OK
> > 
> > So if you revert the commit in question on top of 4.6.0, the numbers
> > go back to the 4.5.0 levels, right?
> > 
> 
> Not quite, it restores a lot of the performance but not all.

I see.

> > Anyway, as I said the "Core" P-state selection algorithm is sort of on
> > the way out and I think that we have a reasonable replacement for it.
> > 
> > Would it be viable to check what happens with
> > https://patchwork.kernel.org/patch/9640261/ applied?  Depending on the
> > ACPI system PM profile of the test machine, this is likely to cause it
> > to use the new algo.
> > 
> 
> Yes. The following is a comparison using 4.5 as a baseline as it is the
> best known kernel and it reduces the width
> 
> 
> gitsource
>                                  4.5.0                 4.6.0                 4.6.0            4.11.0-rc5            4.11.0-rc5
>                                vanilla               vanilla      revert-v4.6-v1r1               vanilla        loadbased-v1r1
> User    min          1613.72 (  0.00%)     3302.19 (-104.63%)     1935.46 (-19.94%)     3487.46 (-116.11%)     2296.87 (-42.33%)
> User    mean         1616.47 (  0.00%)     3304.14 (-104.40%)     1937.83 (-19.88%)     3488.12 (-115.79%)     2299.33 (-42.24%)
> User    stddev          1.75 (  0.00%)        1.12 ( 36.06%)        1.42 ( 18.54%)        0.57 ( 67.28%)        1.79 ( -2.73%)
> User    coeffvar        0.11 (  0.00%)        0.03 ( 68.72%)        0.07 ( 32.05%)        0.02 ( 84.84%)        0.08 ( 27.78%)
> User    max          1618.73 (  0.00%)     3305.40 (-104.20%)     1939.84 (-19.84%)     3489.01 (-115.54%)     2302.01 (-42.21%)
> System  min           202.58 (  0.00%)      407.51 (-101.16%)      244.03 (-20.46%)      269.92 (-33.24%)      203.79 ( -0.60%)
> System  mean          203.62 (  0.00%)      408.38 (-100.56%)      245.24 (-20.44%)      270.83 (-33.01%)      205.19 ( -0.77%)
> System  stddev          0.64 (  0.00%)        0.77 (-21.25%)        0.97 (-52.52%)        0.59 (  7.31%)        0.75 (-18.12%)
> System  coeffvar        0.31 (  0.00%)        0.19 ( 39.54%)        0.40 (-26.64%)        0.22 ( 30.31%)        0.37 (-17.21%)
> System  max           204.36 (  0.00%)      409.81 (-100.53%)      246.85 (-20.79%)      271.56 (-32.88%)      206.06 ( -0.83%)
> Elapsed min          1827.70 (  0.00%)     3701.00 (-102.49%)     2186.22 (-19.62%)     3749.00 (-105.12%)     2501.05 (-36.84%)
> Elapsed mean         1830.72 (  0.00%)     3703.20 (-102.28%)     2190.03 (-19.63%)     3750.20 (-104.85%)     2503.27 (-36.74%)
> Elapsed stddev          2.18 (  0.00%)        1.47 ( 32.67%)        2.25 ( -3.23%)        0.75 ( 65.72%)        1.28 ( 41.43%)
> Elapsed coeffvar        0.12 (  0.00%)        0.04 ( 66.71%)        0.10 ( 13.71%)        0.02 ( 83.26%)        0.05 ( 57.16%)
> Elapsed max          1833.91 (  0.00%)     3705.00 (-102.03%)     2193.26 (-19.59%)     3751.00 (-104.54%)     2504.54 (-36.57%)
> CPU     min            99.00 (  0.00%)      100.00 ( -1.01%)       99.00 (  0.00%)      100.00 ( -1.01%)      100.00 ( -1.01%)
> CPU     mean           99.00 (  0.00%)      100.00 ( -1.01%)       99.00 (  0.00%)      100.00 ( -1.01%)      100.00 ( -1.01%)
> CPU     stddev          0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)
> CPU     coeffvar        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)
> CPU     max            99.00 (  0.00%)      100.00 ( -1.01%)       99.00 (  0.00%)      100.00 ( -1.01%)      100.00 ( -1.01%)
> 
>                4.5.0       4.6.0       4.6.0  4.11.0-rc5  4.11.0-rc5
>              vanilla     vanillarevert-v4.6-v1r1     vanillaloadbased-v1r1
> User         9790.02    19914.22    11713.58    21021.12    13888.63
> System       1234.01     2465.45     1485.99     1635.85     1242.37
> Elapsed     11008.49    22247.35    13162.72    22528.79    15044.76
> 
> As you can see, 4.6 is running twice as long as 4.5 (3703 seconds to
> comlete vs 1830 seconds). Reverting (revert-v4.6-v1r1) restores some of
> the performance and is 19.63% slower on average. 4.11-rc5 is as bad as
> 4.6 but applying your patch runs for 2503 seconds (36.74% slower). This
> is still pretty bad but it's a big step in the right direction.

OK

Because of the problems with the current default P-state selection algorithm,
to me the way to go is to migrate over to the load-based one going forward.
Actually, the patch I asked you to test is now scheduled for 4.12 even.

The load-based algorithm basically contains what's needed to react to load
changes quickly and avoid going down too fast, but its time granularity may not
be adequate for the workload at hand.

If possible, can you please add my current linux-next branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next

to the comparison table?  It basically is new ACPI and PM material scheduled
for the 4.12 merge window on top of 4.11.0-rc7.  On top of that, it should be
easier to tweak the load-based P-state selection algorithm somewhat.

> > I guess that you have a pstate_snb directory under /sys/kernel/debug/
> > (if this is where debugfs is mounted)?  It should not be there any
> > more with the new algo (as that does not use the PID controller any
> > more).
> > 
> 

[cut]

> > At this point I'm not sure what has changed in addition to the commit
> > you have found and while this is sort of interesting, I'm not sure how
> > relevant it is.
> > 
> > Unfortunately, the P-state selection algorithm used so far on your
> > test system is quite fundamentally unstable and tends to converge to
> > either the highest or the lowest P-state in various conditions.  If
> > the workload is sufficiently "light", it generally ends up in the
> > minimum P-state most of the time which probably happens here.
> > 
> > I would really not like to try to "fix" that algorithm as this is
> > pretty much hopeless and most likely will lead to regressions
> > elsewhere.  Instead, I'd prefer to migrate away from it altogether and
> > then tune things so that they work for everybody reasonably well
> > (which should be doable with the new algorithm).  But let's see how
> > far we can get with that.
> > 
> 
> Other than altering min_perf_pct, is there a way of tuning intel_pstate
> such that it delays entering lower p-states for longer? It would
> increase power consumption but at least it would be an option for
> low-utilisation workloads and probably beneficial in general for those
> that need to reduce latency of wakups while still allowing at least the
> C1 state.

The P-state selection algorithm for core processors can be tweaked via
the debugfs interface under /sys/kernel/debug/pstate_snb/, for example
by changing the rate limit.

The load-based P-state selection algorithm has no tunables at this time,
but it should be easy enough to make the sampling interval of it adjustable
at least for debugging purposes.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-19  8:15       ` Mel Gorman
@ 2017-04-21  1:12         ` Rafael J. Wysocki
  0 siblings, 0 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2017-04-21  1:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Doug Smythies, 'Rafael Wysocki', 'Jörg Otte',
	'Linux Kernel Mailing List', 'Linux PM',
	'Srinivas Pandruvada'

On Wednesday, April 19, 2017 09:15:37 AM Mel Gorman wrote:
> On Fri, Apr 14, 2017 at 04:01:40PM -0700, Doug Smythies wrote:
> > Hi Mel,
> > 
> > Thanks for the "how to" information.
> > This is a very interesting use case.
> > From trace data, I see a lot of minimal durations with
> > virtually no load on the CPU, typically more consistent
> > with some type of light duty periodic (~~100 Hz) work flow
> > (where we would prefer to not ramp up frequencies, or more
> > accurately keep them ramped up).
> 
> This broadly matches my expectations in terms of behaviour. It is a
> low duty workload but while I accept that a laptop may not want the
> frequencies to ramp up, it's not universally true. Long periods at low
> frequency to complete a workload is not necessarily better than using a
> high frequency to race to idle. Effectively, a low utilisation test suite
> could be considered as a "foreground task of high priority" and not a
> "background task of little interest".

That's fair enough, but somewhat hard to tell from within a scaling governor. :-)

[cut]

> 
> I have no reason to believe this is a methodology error and is due to a
> difference in CPU. Consider the following reports
> 
> http://beta.suse.com/private/mgorman/results/home/marvin/openSUSE-LEAP-42.2/global-dhp__workload_shellscripts-xfs/delboy/#gitsource
> http://beta.suse.com/private/mgorman/results/home/marvin/openSUSE-LEAP-42.2/global-dhp__workload_shellscripts-xfs/ivy/#gitsource
> 
> The first one (delboy) shows a gain of 1.35% and it's only for 4.11
> (kernel shown is 4.11-rc1 with vmscan-related patches on top that do not
> affect this test case) of -17.51% which is very similar to yours. The
> CPU there is a Xeon E3-1230 v5.
> 
> The second report (ivy) is the machine I'm based the original complain
> on and shows the large regression in elapsed time.
> 
> So, different CPUs have different behaviours which is no surprise at all
> considering that at the very least, exit latencies will be different.
> While there may not be a universally correct answer to how to do this
> automatically, is it possible to tune intel_pstate such that it ramps up
> quickly regardless of recent utilisation and reduces relatively slowly?
> That would be better from a power consumption perspective than setting the
> "performance" governor.

It should be, theoretically.

The way the load-based P-state selection algorithm works is based on computing
average utilization periodically and setting the frequency proportional to it with
a couple of twists.  The first twist is that the frequency will be bumped up for
tasks that have waited on I/O ("IO-wait boost").  The second one is that if the
frequency is to be reduced, it will not go down proportionally to the computed
average utilization, but to the frequency between the current (measured) one
and the one proportional to the utilization (so it will go down asymptotically
rather than in one go).

Now, of course, what matters is how often the average utilization is computed,
because if we average several small spikes over a broad sampling window, they
will just almost vanish in the average and the resulting frequency will be small.
If, in turn, the sampling interval is reduced, some intervals will get the spikes
(and for them the average utilization will be greater) and some of them will
get nothing (leading to average utilization close to zero) and now all depends
on the distribution of the spikes along the time axis.

You can actually try to test that on top of my linux-next branch by reducing
INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL (in intel_pstate.c) by, say, 1/2.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-20 14:55       ` Doug Smythies
@ 2017-04-21  1:17         ` Rafael J. Wysocki
  2017-04-22  6:29         ` Doug Smythies
  1 sibling, 0 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2017-04-21  1:17 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Mel Gorman', 'Rafael Wysocki',
	'Jörg Otte', 'Linux Kernel Mailing List',
	'Linux PM', 'Srinivas Pandruvada'

On Thursday, April 20, 2017 07:55:57 AM Doug Smythies wrote:
> On 2017.04.19 01:16 Mel Gorman wrote:
> > On Fri, Apr 14, 2017 at 04:01:40PM -0700, Doug Smythies wrote:
> >> Hi Mel,

[cut]

> > And the revert does help albeit not being an option for reasons Rafael
> > covered.
> 
> New data point: Kernel 4.11-rc7  intel_pstate, powersave forcing the
> load based algorithm: Elapsed 3178 seconds.
> 
> If I understand your data correctly, my load based results are the opposite of yours.
> 
> Mel: 4.11-rc5 vanilla: Elapsed mean: 3750.20 Seconds
> Mel: 4.11-rc5 load based: Elapsed mean: 2503.27 Seconds
> Or: 33.25%
> 
> Doug: 4.11-rc6 stock: Elapsed total (5 runs): 2364.45 Seconds
> Doug: 4.11-rc7 force load based: Elapsed total (5 runs): 3178 Seconds
> Or: -34.4%

I wonder if you can do the same thing I've just advised Mel to do.  That is,
take my linux-next branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next

(which is new material for 4.12 on top of 4.11-rc7) and reduce
INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL (in intel_pstate.c) in it by 1/2
(force load-based if need be, I'm not sure what PM profile of your test system
is).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-20 14:55       ` Doug Smythies
  2017-04-21  1:17         ` Rafael J. Wysocki
@ 2017-04-22  6:29         ` Doug Smythies
  2017-04-22 21:07           ` Rafael J. Wysocki
  2017-04-23 15:31           ` Doug Smythies
  1 sibling, 2 replies; 21+ messages in thread
From: Doug Smythies @ 2017-04-22  6:29 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Mel Gorman', 'Rafael Wysocki',
	'Jörg Otte', 'Linux Kernel Mailing List',
	'Linux PM', 'Srinivas Pandruvada',
	Doug Smythies

On 2017.04.20 18:18 Rafael wrote:
> On Thursday, April 20, 2017 07:55:57 AM Doug Smythies wrote:
>> On 2017.04.19 01:16 Mel Gorman wrote:
>>> On Fri, Apr 14, 2017 at 04:01:40PM -0700, Doug Smythies wrote:
>>>> Hi Mel,
>
> [cut]
>
>>> And the revert does help albeit not being an option for reasons Rafael
>>> covered.
>> 
>> New data point: Kernel 4.11-rc7  intel_pstate, powersave forcing the
>> load based algorithm: Elapsed 3178 seconds.
>> 
>> If I understand your data correctly, my load based results are the opposite of yours.
>> 
>> Mel: 4.11-rc5 vanilla: Elapsed mean: 3750.20 Seconds
>> Mel: 4.11-rc5 load based: Elapsed mean: 2503.27 Seconds
>> Or: 33.25%
>> 
>> Doug: 4.11-rc6 stock: Elapsed total (5 runs): 2364.45 Seconds
>> Doug: 4.11-rc7 force load based: Elapsed total (5 runs): 3178 Seconds
>> Or: -34.4%
>
> I wonder if you can do the same thing I've just advised Mel to do.  That is,
> take my linux-next branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
>
> (which is new material for 4.12 on top of 4.11-rc7) and reduce
> INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL (in intel_pstate.c) in it by 1/2
> (force load-based if need be, I'm not sure what PM profile of your test system
> is).

I did not need to force load-based. I do not know how to figure it out from
an acpidump the way Srinivas does. I did a trace and figured out what algorithm
it was using from the data.

Reference test, before changing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL:
3239.4 seconds.

Test after changing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL:
3195.5 seconds.

By far, and with any code, I get the fastest elapsed time, of course next
to performance mode, but not by much, by limiting the test to only use
just 1 cpu: 1814.2 Seconds.
(performance governor, restated from a previous e-mail: 1776.05 seconds)

... Doug

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-22  6:29         ` Doug Smythies
@ 2017-04-22 21:07           ` Rafael J. Wysocki
  2017-04-24 10:01             ` Mel Gorman
  2017-04-23 15:31           ` Doug Smythies
  1 sibling, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2017-04-22 21:07 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Mel Gorman', 'Rafael Wysocki',
	'Jörg Otte', 'Linux Kernel Mailing List',
	'Linux PM', 'Srinivas Pandruvada'

On Friday, April 21, 2017 11:29:06 PM Doug Smythies wrote:
> On 2017.04.20 18:18 Rafael wrote:
> > On Thursday, April 20, 2017 07:55:57 AM Doug Smythies wrote:
> >> On 2017.04.19 01:16 Mel Gorman wrote:
> >>> On Fri, Apr 14, 2017 at 04:01:40PM -0700, Doug Smythies wrote:
> >>>> Hi Mel,
> >
> > [cut]
> >
> >>> And the revert does help albeit not being an option for reasons Rafael
> >>> covered.
> >> 
> >> New data point: Kernel 4.11-rc7  intel_pstate, powersave forcing the
> >> load based algorithm: Elapsed 3178 seconds.
> >> 
> >> If I understand your data correctly, my load based results are the opposite of yours.
> >> 
> >> Mel: 4.11-rc5 vanilla: Elapsed mean: 3750.20 Seconds
> >> Mel: 4.11-rc5 load based: Elapsed mean: 2503.27 Seconds
> >> Or: 33.25%
> >> 
> >> Doug: 4.11-rc6 stock: Elapsed total (5 runs): 2364.45 Seconds
> >> Doug: 4.11-rc7 force load based: Elapsed total (5 runs): 3178 Seconds
> >> Or: -34.4%
> >
> > I wonder if you can do the same thing I've just advised Mel to do.  That is,
> > take my linux-next branch:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
> >
> > (which is new material for 4.12 on top of 4.11-rc7) and reduce
> > INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL (in intel_pstate.c) in it by 1/2
> > (force load-based if need be, I'm not sure what PM profile of your test system
> > is).
> 
> I did not need to force load-based. I do not know how to figure it out from
> an acpidump the way Srinivas does. I did a trace and figured out what algorithm
> it was using from the data.
> 
> Reference test, before changing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL:
> 3239.4 seconds.
> 
> Test after changing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL:
> 3195.5 seconds.

So it does have an effect, but relatively small.

I wonder if further reducing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL to 2 ms
will make any difference.

> By far, and with any code, I get the fastest elapsed time, of course next
> to performance mode, but not by much, by limiting the test to only use
> just 1 cpu: 1814.2 Seconds.

Interesting.

It looks like the cost is mostly related to moving the load from one CPU to
another and waiting for the new one to ramp up then.

I guess the workload consists of many small tasks that each start on new CPUs
and cause that ping-pong to happen.

> (performance governor, restated from a previous e-mail: 1776.05 seconds)

But that causes the processor to stay in the maximum sustainable P-state all
the time, which on Sandy Bridge is quite costly energetically.

We can do one more trick I forgot about.  Namely, if we are about to increase
the P-state, we can jump to the average between the target and the max
instead of just the target, like in the appended patch (on top of linux-next).

That will make the P-state selection really aggressive, so costly energetically,
but it shoud small jumps of the average load above 0 to case big jumps of
the target P-state.

---
 drivers/cpufreq/intel_pstate.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -1613,7 +1613,7 @@ static inline int32_t get_target_pstate_
 {
 	struct sample *sample = &cpu->sample;
 	int32_t busy_frac, boost;
-	int target, avg_pstate;
+	int max_pstate, target, avg_pstate;
 
 	if (cpu->policy == CPUFREQ_POLICY_PERFORMANCE)
 		return cpu->pstate.turbo_pstate;
@@ -1628,10 +1628,9 @@ static inline int32_t get_target_pstate_
 
 	sample->busy_scaled = busy_frac * 100;
 
-	target = global.no_turbo || global.turbo_disabled ?
+	max_pstate = global.no_turbo || global.turbo_disabled ?
 			cpu->pstate.max_pstate : cpu->pstate.turbo_pstate;
-	target += target >> 2;
-	target = mul_fp(target, busy_frac);
+	target = mul_fp(max_pstate + (max_pstate >> 2), busy_frac);
 	if (target < cpu->pstate.min_pstate)
 		target = cpu->pstate.min_pstate;
 
@@ -1645,6 +1644,8 @@ static inline int32_t get_target_pstate_
 	avg_pstate = get_avg_pstate(cpu);
 	if (avg_pstate > target)
 		target += (avg_pstate - target) >> 1;
+	else if (avg_pstate < target)
+		target = (max_pstate + target) >> 1;
 
 	return target;
 }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-22  6:29         ` Doug Smythies
  2017-04-22 21:07           ` Rafael J. Wysocki
@ 2017-04-23 15:31           ` Doug Smythies
  2017-04-24  0:59             ` Rafael J. Wysocki
  1 sibling, 1 reply; 21+ messages in thread
From: Doug Smythies @ 2017-04-23 15:31 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Mel Gorman', 'Rafael Wysocki',
	'Jörg Otte', 'Linux Kernel Mailing List',
	'Linux PM', 'Srinivas Pandruvada',
	Doug Smythies

On 2017.04.22 14:08 Rafael wrote:
> On Friday, April 21, 2017 11:29:06 PM Doug Smythies wrote:
>> On 2017.04.20 18:18 Rafael wrote:
>>> On Thursday, April 20, 2017 07:55:57 AM Doug Smythies wrote:
>>>> On 2017.04.19 01:16 Mel Gorman wrote:
>>>>> On Fri, Apr 14, 2017 at 04:01:40PM -0700, Doug Smythies wrote:
>>>>>> Hi Mel,
>>>
>>> [cut]
>>>
>>>>> And the revert does help albeit not being an option for reasons Rafael
>>>>> covered.
>>>> 
>>>> New data point: Kernel 4.11-rc7  intel_pstate, powersave forcing the
>>>> load based algorithm: Elapsed 3178 seconds.
>>>> 
>>>> If I understand your data correctly, my load based results are the opposite of yours.
>>>> 
>>>> Mel: 4.11-rc5 vanilla: Elapsed mean: 3750.20 Seconds
>>>> Mel: 4.11-rc5 load based: Elapsed mean: 2503.27 Seconds
>>>> Or: 33.25%
>>>> 
>>>> Doug: 4.11-rc6 stock: Elapsed total (5 runs): 2364.45 Seconds
>>>> Doug: 4.11-rc7 force load based: Elapsed total (5 runs): 3178 Seconds
>>>> Or: -34.4%
>>>
>>> I wonder if you can do the same thing I've just advised Mel to do.  That is,
>>> take my linux-next branch:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
>>>
>>> (which is new material for 4.12 on top of 4.11-rc7) and reduce
>>> INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL (in intel_pstate.c) in it by 1/2
>>> (force load-based if need be, I'm not sure what PM profile of your test system
>>> is).
>> 
>> I did not need to force load-based. I do not know how to figure it out from
>> an acpidump the way Srinivas does. I did a trace and figured out what algorithm
>> it was using from the data.
>> 
>> Reference test, before changing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL:
>> 3239.4 seconds.
>> 
>> Test after changing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL:
>> 3195.5 seconds.
>
> So it does have an effect, but relatively small.

I don't know how repeatable the tests results are.
i.e. I don't know if the 1.36% change is within experimental
error or not. That being said, the trend does seem consistent.

> I wonder if further reducing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL to 2 ms
> will make any difference.

I went all the way to 1 ms, just for the test:
3123.9 Seconds

>> By far, and with any code, I get the fastest elapsed time, of course next
>> to performance mode, but not by much, by limiting the test to only use
>> just 1 cpu: 1814.2 Seconds.
>
> Interesting.
>
> It looks like the cost is mostly related to moving the load from one CPU to
> another and waiting for the new one to ramp up then.
>
> I guess the workload consists of many small tasks that each start on new CPUs
> and cause that ping-pong to happen.

Yes, and (from trace data) many tasks are very very very small. Also the test
appears to take a few holidays, of up to 1 second, during execution.

>> (performance governor, restated from a previous e-mail: 1776.05 seconds)
>
> But that causes the processor to stay in the maximum sustainable P-state all
> the time, which on Sandy Bridge is quite costly energetically.

Agreed. I only provide these data points as a reference and so that we know
what the boundary conditions (limits) are.

> We can do one more trick I forgot about.  Namely, if we are about to increase
> the P-state, we can jump to the average between the target and the max
> instead of just the target, like in the appended patch (on top of linux-next).
>
> That will make the P-state selection really aggressive, so costly energetically,
> but it shoud small jumps of the average load above 0 to case big jumps of
> the target P-state.

I'm already seeing the energy costs of some of this stuff.
3050.2 Seconds.
Idle power 4.06 Watts.

Idle power for kernel 4.11-rc7 (performance-based): 3.89 Watts.
Idle power for kernel 4.11-rc7, using load-based: 4.01 watts
Idle power for kernel 4.11-rc7 next linux-pm: 3.91 watts 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-23 15:31           ` Doug Smythies
@ 2017-04-24  0:59             ` Rafael J. Wysocki
  2017-04-24  1:21               ` Srinivas Pandruvada
                                 ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2017-04-24  0:59 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Mel Gorman, Rafael Wysocki, Jörg Otte,
	Linux Kernel Mailing List, Linux PM, Srinivas Pandruvada

On Sun, Apr 23, 2017 at 5:31 PM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2017.04.22 14:08 Rafael wrote:
>> On Friday, April 21, 2017 11:29:06 PM Doug Smythies wrote:
>>> On 2017.04.20 18:18 Rafael wrote:
>>>> On Thursday, April 20, 2017 07:55:57 AM Doug Smythies wrote:
>>>>> On 2017.04.19 01:16 Mel Gorman wrote:
>>>>>> On Fri, Apr 14, 2017 at 04:01:40PM -0700, Doug Smythies wrote:
>>>>>>> Hi Mel,
>>>>
>>>> [cut]
>>>>
>>>>>> And the revert does help albeit not being an option for reasons Rafael
>>>>>> covered.
>>>>>
>>>>> New data point: Kernel 4.11-rc7  intel_pstate, powersave forcing the
>>>>> load based algorithm: Elapsed 3178 seconds.
>>>>>
>>>>> If I understand your data correctly, my load based results are the opposite of yours.
>>>>>
>>>>> Mel: 4.11-rc5 vanilla: Elapsed mean: 3750.20 Seconds
>>>>> Mel: 4.11-rc5 load based: Elapsed mean: 2503.27 Seconds
>>>>> Or: 33.25%
>>>>>
>>>>> Doug: 4.11-rc6 stock: Elapsed total (5 runs): 2364.45 Seconds
>>>>> Doug: 4.11-rc7 force load based: Elapsed total (5 runs): 3178 Seconds
>>>>> Or: -34.4%
>>>>
>>>> I wonder if you can do the same thing I've just advised Mel to do.  That is,
>>>> take my linux-next branch:
>>>>
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
>>>>
>>>> (which is new material for 4.12 on top of 4.11-rc7) and reduce
>>>> INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL (in intel_pstate.c) in it by 1/2
>>>> (force load-based if need be, I'm not sure what PM profile of your test system
>>>> is).
>>>
>>> I did not need to force load-based. I do not know how to figure it out from
>>> an acpidump the way Srinivas does. I did a trace and figured out what algorithm
>>> it was using from the data.
>>>
>>> Reference test, before changing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL:
>>> 3239.4 seconds.
>>>
>>> Test after changing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL:
>>> 3195.5 seconds.
>>
>> So it does have an effect, but relatively small.
>
> I don't know how repeatable the tests results are.
> i.e. I don't know if the 1.36% change is within experimental
> error or not. That being said, the trend does seem consistent.
>
>> I wonder if further reducing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL to 2 ms
>> will make any difference.
>
> I went all the way to 1 ms, just for the test:
> 3123.9 Seconds
>
>>> By far, and with any code, I get the fastest elapsed time, of course next
>>> to performance mode, but not by much, by limiting the test to only use
>>> just 1 cpu: 1814.2 Seconds.
>>
>> Interesting.
>>
>> It looks like the cost is mostly related to moving the load from one CPU to
>> another and waiting for the new one to ramp up then.
>>
>> I guess the workload consists of many small tasks that each start on new CPUs
>> and cause that ping-pong to happen.
>
> Yes, and (from trace data) many tasks are very very very small. Also the test
> appears to take a few holidays, of up to 1 second, during execution.
>
>>> (performance governor, restated from a previous e-mail: 1776.05 seconds)
>>
>> But that causes the processor to stay in the maximum sustainable P-state all
>> the time, which on Sandy Bridge is quite costly energetically.
>
> Agreed. I only provide these data points as a reference and so that we know
> what the boundary conditions (limits) are.
>
>> We can do one more trick I forgot about.  Namely, if we are about to increase
>> the P-state, we can jump to the average between the target and the max
>> instead of just the target, like in the appended patch (on top of linux-next).
>>
>> That will make the P-state selection really aggressive, so costly energetically,
>> but it shoud small jumps of the average load above 0 to case big jumps of
>> the target P-state.
>
> I'm already seeing the energy costs of some of this stuff.
> 3050.2 Seconds.

Is this with or without reducing the sampling interval?

> Idle power 4.06 Watts.
>
> Idle power for kernel 4.11-rc7 (performance-based): 3.89 Watts.
> Idle power for kernel 4.11-rc7, using load-based: 4.01 watts
> Idle power for kernel 4.11-rc7 next linux-pm: 3.91 watts

Power draw differences are not dramatic, so this might be a viable
change depending on the influence on the results elsewhere.

Anyway, your results are somewhat counter-intuitive.

Would it be possible to run this workload with the linux-next branch
and the schedutil governor and see if the patch at
https://patchwork.kernel.org/patch/9671829/ makes any difference?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-24  0:59             ` Rafael J. Wysocki
@ 2017-04-24  1:21               ` Srinivas Pandruvada
  2017-04-24 14:24               ` Doug Smythies
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Srinivas Pandruvada @ 2017-04-24  1:21 UTC (permalink / raw)
  To: Rafael J. Wysocki, Doug Smythies
  Cc: Rafael J. Wysocki, Mel Gorman, Rafael Wysocki, Jörg Otte,
	Linux Kernel Mailing List, Linux PM

On Mon, 2017-04-24 at 02:59 +0200, Rafael J. Wysocki wrote:
> On Sun, Apr 23, 2017 at 5:31 PM, Doug Smythies <dsmythies@telus.net>
> wrote:
[...]

> > It looks like the cost is mostly related to moving the load from
> > > one CPU to
> > > another and waiting for the new one to ramp up then.
Last time when we analyzed Mel's result last year this was the
conclusion. The problem was more apparent on systems with per core P-
state.

> > > 
> > > I guess the workload consists of many small tasks that each start
> > > on new CPUs
> > > and cause that ping-pong to happen.
> > Yes, and (from trace data) many tasks are very very very small.
> > Also the test
> > appears to take a few holidays, of up to 1 second, during
> > execution.
> > 
> > > 
> > > > 
> > > > (performance governor, restated from a previous e-mail: 1776.05
> > > > seconds)
> > > But that causes the processor to stay in the maximum sustainable
> > > P-state all
> > > the time, which on Sandy Bridge is quite costly energetically.
> > Agreed. I only provide these data points as a reference and so that
> > we know
> > what the boundary conditions (limits) are.
> > 
> > > 
> > > We can do one more trick I forgot about.  Namely, if we are about
> > > to increase
> > > the P-state, we can jump to the average between the target and
> > > the max
> > > instead of just the target, like in the appended patch (on top of
> > > linux-next).
> > > 
> > > That will make the P-state selection really aggressive, so costly
> > > energetically,
> > > but it shoud small jumps of the average load above 0 to case big
> > > jumps of
> > > the target P-state.
> > I'm already seeing the energy costs of some of this stuff.
> > 3050.2 Seconds.
> Is this with or without reducing the sampling interval?
> 
> > 
> > Idle power 4.06 Watts.
> > 
> > Idle power for kernel 4.11-rc7 (performance-based): 3.89 Watts.
> > Idle power for kernel 4.11-rc7, using load-based: 4.01 watts
> > Idle power for kernel 4.11-rc7 next linux-pm: 3.91 watts
> Power draw differences are not dramatic, so this might be a viable
> change depending on the influence on the results elsewhere.
Last time a solution proposed to have higher floor instead of min-
pstate for Atom platforms. But this end up in increasing power
consumption on some Android workloads.

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-22 21:07           ` Rafael J. Wysocki
@ 2017-04-24 10:01             ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2017-04-24 10:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, 'Rafael Wysocki', 'Jörg Otte',
	'Linux Kernel Mailing List', 'Linux PM',
	'Srinivas Pandruvada'

On Sat, Apr 22, 2017 at 11:07:44PM +0200, Rafael J. Wysocki wrote:
> > By far, and with any code, I get the fastest elapsed time, of course next
> > to performance mode, but not by much, by limiting the test to only use
> > just 1 cpu: 1814.2 Seconds.
> 
> Interesting.
> 
> It looks like the cost is mostly related to moving the load from one CPU to
> another and waiting for the new one to ramp up then.
> 

We've had that before although arguably it means this will generally be a
problem on older CPUs or CPUs with high exit latencies. It goes back to the
notion that it should be possible to tune such platforms to optionally ramp
up fast and ramp down slowly withour resorting to the performance governor.

> I guess the workload consists of many small tasks that each start on new CPUs
> and cause that ping-pong to happen.
> 

Yes, not unusual in itself.

> > (performance governor, restated from a previous e-mail: 1776.05 seconds)
> 
> But that causes the processor to stay in the maximum sustainable P-state all
> the time, which on Sandy Bridge is quite costly energetically.
> 
> We can do one more trick I forgot about.  Namely, if we are about to increase
> the P-state, we can jump to the average between the target and the max
> instead of just the target, like in the appended patch (on top of linux-next).
> 
> That will make the P-state selection really aggressive, so costly energetically,
> but it shoud small jumps of the average load above 0 to case big jumps of
> the target P-state.
> 

So I took a look where we currently stand and it's not too bad if you
accept that decisions made for newer CPUs do not always suit old CPUs.
That's inevitable unfortunately.

gitsource
                                 4.5.0            4.11.0-rc7            4.11.0-rc7            4.11.0-rc7            4.11.0-rc7            4.11.0-rc7
                               vanilla               vanilla      pm-next-20170421           revert-v1r1        loadbased-v1r1          bigjump-v1r1
Elapsed min          1827.70 (  0.00%)     3747.00 (-105.01%)     2501.39 (-36.86%)     2908.72 (-59.15%)     2501.01 (-36.84%)     2452.83 (-34.20%)
Elapsed mean         1830.72 (  0.00%)     3748.80 (-104.77%)     2504.02 (-36.78%)     2917.28 (-59.35%)     2503.74 (-36.76%)     2454.15 (-34.05%)
Elapsed stddev          2.18 (  0.00%)        1.33 ( 39.22%)        1.84 ( 15.88%)        5.16 (-136.32%)        1.84 ( 15.69%)        0.91 ( 58.48%)
Elapsed coeffvar        0.12 (  0.00%)        0.04 ( 70.32%)        0.07 ( 38.50%)        0.18 (-48.30%)        0.07 ( 38.36%)        0.04 ( 69.03%)
Elapsed max          1833.91 (  0.00%)     3751.00 (-104.54%)     2506.46 (-36.67%)     2924.93 (-59.49%)     2506.78 (-36.69%)     2455.44 (-33.89%)

At this point, pm-next is better than a plain revert of the patch so
that's great. It's still not as good as 4.5.0 but it's perfectly possible
something else is now at play. Your patch that "jump to the average between
the target and the max" helps a little bit before but given that it
doesn't bring things in like with 4.5.0, I wouldn't worry too much about
it making the merge window. I'll see how things look on a range of
machines after the next merge window.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-24  0:59             ` Rafael J. Wysocki
  2017-04-24  1:21               ` Srinivas Pandruvada
@ 2017-04-24 14:24               ` Doug Smythies
  2017-04-25  7:13               ` Doug Smythies
  2017-04-25 21:03               ` Doug Smythies
  3 siblings, 0 replies; 21+ messages in thread
From: Doug Smythies @ 2017-04-24 14:24 UTC (permalink / raw)
  To: 'Srinivas Pandruvada', 'Rafael J. Wysocki'
  Cc: 'Rafael J. Wysocki', 'Mel Gorman',
	'Rafael Wysocki', 'Jörg Otte',
	'Linux Kernel Mailing List', 'Linux PM',
	Doug Smythies

On 2017.04.23 18:23 Srinivas Pandruvada wrote:
> On Mon, 2017-04-24 at 02:59 +0200, Rafael J. Wysocki wrote:
>> On Sun, Apr 23, 2017 at 5:31 PM, Doug Smythies <dsmythies@telus.net> wrote:

>>> It looks like the cost is mostly related to moving the load from
>>> one CPU to
>>> another and waiting for the new one to ramp up then.
> Last time when we analyzed Mel's result last year this was the
> conclusion. The problem was more apparent on systems with per core P-
> state.

?? I have never seen this particular use case before.
Unless I have looked the wrong thing, Mel's issue last year was a
different use case.

...[cut]...
 
>>>> We can do one more trick I forgot about.  Namely, if we are about
>>>> to increase
>>>> the P-state, we can jump to the average between the target and
>>>> the max
>>>> instead of just the target, like in the appended patch (on top of
>>>> linux-next).
>>>> 
>>>> That will make the P-state selection really aggressive, so costly
>>>> energetically,
>>>> but it shoud small jumps of the average load above 0 to case big
>>>> jumps of
>>>> the target P-state.
>>> I'm already seeing the energy costs of some of this stuff.
>>> 3050.2 Seconds.
>> Is this with or without reducing the sampling interval?

It was without reducing the sample interval.

So, it was the branch you referred us to the other day:

git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next

with your patch (now deleted from this thread) applied.


...[cut]...

>> Anyway, your results are somewhat counter-intuitive.

>> Would it be possible to run this workload with the linux-next branch
>> and the schedutil governor and see if the patch at
>> https://patchwork.kernel.org/patch/9671829/ makes any difference?

git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
Plus that patch is in progress.

... Doug

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-24  0:59             ` Rafael J. Wysocki
  2017-04-24  1:21               ` Srinivas Pandruvada
  2017-04-24 14:24               ` Doug Smythies
@ 2017-04-25  7:13               ` Doug Smythies
  2017-04-25 21:26                 ` Rafael J. Wysocki
  2017-04-25 21:03               ` Doug Smythies
  3 siblings, 1 reply; 21+ messages in thread
From: Doug Smythies @ 2017-04-25  7:13 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Rafael J. Wysocki', 'Mel Gorman',
	'Rafael Wysocki', 'Jörg Otte',
	'Linux Kernel Mailing List', 'Linux PM',
	'Doug Smythies', 'Srinivas Pandruvada'

On 2017.04.24 07:25 Doug wrote:
> On 2017.04.23 18:23 Srinivas Pandruvada wrote:
>> On Mon, 2017-04-24 at 02:59 +0200, Rafael J. Wysocki wrote:
>>> On Sun, Apr 23, 2017 at 5:31 PM, Doug Smythies <dsmythies@telus.net> wrote:
>
>>>> It looks like the cost is mostly related to moving the load from
>>>> one CPU to
>>>> another and waiting for the new one to ramp up then.
>> Last time when we analyzed Mel's result last year this was the
>> conclusion. The problem was more apparent on systems with per core P-
>> state.
>
> ?? I have never seen this particular use case before.
> Unless I have looked the wrong thing, Mel's issue last year was a
> different use case.
>
> ...[cut]...
> 
>>>>> We can do one more trick I forgot about.  Namely, if we are about
>>>>> to increase
>>>>> the P-state, we can jump to the average between the target and
>>>>> the max
>>>>> instead of just the target, like in the appended patch (on top of
>>>>> linux-next).
>>>>> 
>>>>> That will make the P-state selection really aggressive, so costly
>>>>> energetically,
>>>>> but it shoud small jumps of the average load above 0 to case big
>>>>> jumps of
>>>>> the target P-state.
>>>> I'm already seeing the energy costs of some of this stuff.
>>>> 3050.2 Seconds.
>>> Is this with or without reducing the sampling interval?
>
> It was without reducing the sample interval.
>
> So, it was the branch you referred us to the other day:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
>
> with your patch (now deleted from this thread) applied.
>
>
> ...[cut]...
>
>>> Anyway, your results are somewhat counter-intuitive.
>
>>> Would it be possible to run this workload with the linux-next branch
>>> and the schedutil governor and see if the patch at
>>> https://patchwork.kernel.org/patch/9671829/ makes any difference?
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
> Plus that patch is in progress.

3387.76 Seconds.
Idle power 3.85 watts.

Other potentially interesting information for 2 hour idle test:
Driver called 21209 times. Maximum duration 2396 Seconds. Minimum duration 20 mSec. 
Histogram of target pstates:
16 8
17 3149
18 1436
19 1479
20 196
21 2
22 3087
23 375
24 22
25 4
26 2
27 3736
28 2177
29 13
30 0
31 0
32 2
33 0
34 1533
35 246
36 0
37 4
38 3738

Compared to kernel 4.11-rc7 (passive mode, schedutil governor)
3297.82 (re-stated from a previous e-mail)
Idle power 3.81 watts

Other potentially interesting information for 2 hour idle test:
Driver called 1631 times. Maximum duration 2510 Seconds. Minimum duration 0.587 mSec.
Histogram of target pstates (missing lines mean 0 occurrences):
16 813
24 2
38 816

... Doug

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-24  0:59             ` Rafael J. Wysocki
                                 ` (2 preceding siblings ...)
  2017-04-25  7:13               ` Doug Smythies
@ 2017-04-25 21:03               ` Doug Smythies
  3 siblings, 0 replies; 21+ messages in thread
From: Doug Smythies @ 2017-04-25 21:03 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Rafael J. Wysocki', 'Mel Gorman',
	'Rafael Wysocki', 'Jörg Otte',
	'Linux Kernel Mailing List', 'Linux PM',
	'Srinivas Pandruvada', 'Doug Smythies'

Hi Rafael,

Apologies, I reversed reported a couple of data points last night:

On 2017.04.25 00:13 Doug wrote:
> On 2017.04.24 07:25 Doug wrote:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
>> Plus that patch is in progress.
>
>3387.76 Seconds.
> Idle power 3.85 watts.
>
> Other potentially interesting information for 2 hour idle test:
> Driver called 21209 times. Maximum duration 2396 Seconds. Minimum duration 20 mSec. 

Wrong, should read:

Driver called 21209 times. Maximum duration 2510 Seconds. Minimum duration 0.587 mSec.

> Compared to kernel 4.11-rc7 (passive mode, schedutil governor)
> 3297.82 (re-stated from a previous e-mail)
> Idle power 3.81 watts
>
> Other potentially interesting information for 2 hour idle test:
> Driver called 1631 times. Maximum duration 2510 Seconds. Minimum duration 0.587 mSec.

Wrong, should read:

Driver called 1631 times. Maximum duration 2396 Seconds. Minimum duration 20 mSec.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6
  2017-04-25  7:13               ` Doug Smythies
@ 2017-04-25 21:26                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2017-04-25 21:26 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Mel Gorman, Rafael Wysocki,
	Jörg Otte, Linux Kernel Mailing List, Linux PM,
	Srinivas Pandruvada

On Tue, Apr 25, 2017 at 9:13 AM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2017.04.24 07:25 Doug wrote:
>> On 2017.04.23 18:23 Srinivas Pandruvada wrote:
>>> On Mon, 2017-04-24 at 02:59 +0200, Rafael J. Wysocki wrote:
>>>> On Sun, Apr 23, 2017 at 5:31 PM, Doug Smythies <dsmythies@telus.net> wrote:
>>
>>>>> It looks like the cost is mostly related to moving the load from
>>>>> one CPU to
>>>>> another and waiting for the new one to ramp up then.
>>> Last time when we analyzed Mel's result last year this was the
>>> conclusion. The problem was more apparent on systems with per core P-
>>> state.
>>
>> ?? I have never seen this particular use case before.
>> Unless I have looked the wrong thing, Mel's issue last year was a
>> different use case.
>>
>> ...[cut]...
>>
>>>>>> We can do one more trick I forgot about.  Namely, if we are about
>>>>>> to increase
>>>>>> the P-state, we can jump to the average between the target and
>>>>>> the max
>>>>>> instead of just the target, like in the appended patch (on top of
>>>>>> linux-next).
>>>>>>
>>>>>> That will make the P-state selection really aggressive, so costly
>>>>>> energetically,
>>>>>> but it shoud small jumps of the average load above 0 to case big
>>>>>> jumps of
>>>>>> the target P-state.
>>>>> I'm already seeing the energy costs of some of this stuff.
>>>>> 3050.2 Seconds.
>>>> Is this with or without reducing the sampling interval?
>>
>> It was without reducing the sample interval.
>>
>> So, it was the branch you referred us to the other day:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
>>
>> with your patch (now deleted from this thread) applied.
>>
>>
>> ...[cut]...
>>
>>>> Anyway, your results are somewhat counter-intuitive.
>>
>>>> Would it be possible to run this workload with the linux-next branch
>>>> and the schedutil governor and see if the patch at
>>>> https://patchwork.kernel.org/patch/9671829/ makes any difference?
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
>> Plus that patch is in progress.
>
> 3387.76 Seconds.
> Idle power 3.85 watts.
>
> Other potentially interesting information for 2 hour idle test:
> Driver called 21209 times. Maximum duration 2396 Seconds. Minimum duration 20 mSec.
> Histogram of target pstates:
> 16 8
> 17 3149
> 18 1436
> 19 1479
> 20 196
> 21 2
> 22 3087
> 23 375
> 24 22
> 25 4
> 26 2
> 27 3736
> 28 2177
> 29 13
> 30 0
> 31 0
> 32 2
> 33 0
> 34 1533
> 35 246
> 36 0
> 37 4
> 38 3738
>
> Compared to kernel 4.11-rc7 (passive mode, schedutil governor)
> 3297.82 (re-stated from a previous e-mail)
> Idle power 3.81 watts

All right, so it looks like the patch makes the workload run longer
and also use more energy.

Using more energy is quite as expected, but slowing thing down isn't,
as the patch aggregates the updates that would have been discarded by
taking the maximum utilization over them, which should result in
higher frequencies being used too.  It may be due to the increased
governor overhead, however.

> Other potentially interesting information for 2 hour idle test:
> Driver called 1631 times. Maximum duration 2510 Seconds. Minimum duration 0.587 mSec.
> Histogram of target pstates (missing lines mean 0 occurrences):
> 16 813
> 24 2
> 38 816

Thanks for the data!

Rafael

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-04-25 21:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-10  8:41 Performance of low-cpu utilisation benchmark regressed severely since 4.6 Mel Gorman
2017-04-10 20:51 ` Rafael J. Wysocki
2017-04-11 10:02   ` Mel Gorman
2017-04-21  0:52     ` Rafael J. Wysocki
2017-04-11 15:41   ` Doug Smythies
2017-04-11 16:42     ` Mel Gorman
2017-04-14 23:01     ` Doug Smythies
2017-04-19  8:15       ` Mel Gorman
2017-04-21  1:12         ` Rafael J. Wysocki
2017-04-20 14:55       ` Doug Smythies
2017-04-21  1:17         ` Rafael J. Wysocki
2017-04-22  6:29         ` Doug Smythies
2017-04-22 21:07           ` Rafael J. Wysocki
2017-04-24 10:01             ` Mel Gorman
2017-04-23 15:31           ` Doug Smythies
2017-04-24  0:59             ` Rafael J. Wysocki
2017-04-24  1:21               ` Srinivas Pandruvada
2017-04-24 14:24               ` Doug Smythies
2017-04-25  7:13               ` Doug Smythies
2017-04-25 21:26                 ` Rafael J. Wysocki
2017-04-25 21:03               ` Doug Smythies

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).