linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Bench for testing scheduler
@ 2013-11-07 10:54 Vincent Guittot
  2013-11-07 11:32 ` Catalin Marinas
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Vincent Guittot @ 2013-11-07 10:54 UTC (permalink / raw)
  To: Morten Rasmussen, Alex Shi, Vincent Guittot, Peter Zijlstra,
	Paul Turner, Ingo Molnar, rjw, Srivatsa S. Bhat, Catalin Marinas,
	Paul Walmsley, Mel Gorman, Juri Lelli, fengguang.wu, markgross,
	Kevin Hilman, Frank.Rowand, Paul McKenney, linux-kernel

Hi,

During the Energy-aware scheduling mini-summit, we spoke about benches
that should be used to evaluate the modifications of the scheduler.
I’d like to propose a bench that uses cyclictest to measure the wake
up latency and the power consumption. The goal of this bench is to
exercise the scheduler with various sleeping period and get the
average wakeup latency. The range of the sleeping period must cover
all residency times of the idle state table of the platform. I have
run such tests on a tc2 platform with the packing tasks patchset.
I have use the following command:
#cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000

The figures below give the average wakeup latency and power
consumption for default scheduler behavior, packing tasks at cluster
level and packing tasks at core level. We can see both wakeup latency
and power consumption variation. The detailed result is not a simple
single value which makes comparison not so easy but the average of all
measurements should give us a usable “score”.

I know that Ingo would like to add the benches in Tools/* but I wonder
if it make sense to copy cyclictest in this directory when we have an
official git tree here:
git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git

I have put both final "score" and detailed results below so everybody
can check the score vs detailed figures:

            |  Default average results                  |  Cluster
Packing average results          |  Core Packing average results
            |  Latency     stddev  A7 energy A15 energy |  Latency
stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15
energy
            |     (us)                   (J)        (J) |     (us)
              (J)        (J) |     (us)                   (J)
(J)
            |      879                794890    2364175 |      416
           879688      12750 |      189                897452
30052

 Cyclictest |  Default                                  |  Packing at
Cluster level                 |  Packing at Core level
   Interval |  Latency     stddev  A7 energy A15 energy |  Latency
stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15
energy
       (us) |     (us)                   (J)        (J) |     (us)
              (J)        (J) |     (us)                   (J)
(J)
        500         24          1    1147477    2479576         21
     1    1136768      11693         22          1    1126062
30138
        700         22          1    1136084    3058419         21
     0    1125280      11761         21          1    1109950
23503
        900         22          1    1136017    3036768         21
     1    1112542      12017         20          0    1101089
23733
       1100         24          1    1132964    2506132         21
     0    1109039      12248         21          1    1091832
23621
       1300         24          1    1123896    2488459         21
     0    1099308      12015         21          1    1086301
23264
       1500         24          1    1120842    2488272         21
     0    1099811      12685         20          0    1083658
22499
       1700         41         38    1117166    3042091         21
     0    1090920      12393         21          1    1080387
23015
       1900        119        182    1120552    2737555         21
     0    1087900      11900         21          1    1078711
23177
       2100        167        195    1122425    3210655         22
     2    1090420      11900         20          1    1077985
22639
       2300        152        156    1119854    2497773         43
    22    1087278      11921         21          1    1075943
26282
       2500        182        163    1120818    2365870         63
    29    1089169      11551         21          0    1073717
24290
       2700        439        202    1058952    3058516        107
    41    1077955      12122         21          0    1070951
23126
       2900        570        268    1028238    3099162        148
    30    1067562      13287         24          1    1064200
24260
       3100        751        137     946512    3158095        178
    30    1059395      12236         29          1    1058887
23225
       3300        696        203     964822    3042524        206
    28    1041194      13934         36          1    1056656
23941
       3500        728        191     959398    3006066        235
    36    1028150      13387         44          3    1045841
23873
       3700        844        138     921780    3033189        245
    31    1019065      14582         62          6    1034466
22501
       3900        815        172     925600    2862994        273
    33    1001974      12091         80          9    1014650
24444
       4100        870        179     897616    2940444        279
    35     996226      12014         88         11    1030588
25461
       4300        979        119     846912    2996911        306
    36     980075      12641        100         12    1035173
24832
       4500        891        168     863631    2760879        336
    45     955072      12016        126         12     993256
23929
       4700        943        110     836333    2796629        351
    39     942390      12902        125         15     996548
24637
       4900        997        118     800205    2743317        391
    49     917067      12868        134         23    1011089
25266
       5100       1050        114     789152    2693104        408
    53     903123      12033        196         22     894294
25142
       5300       1052        111     769544    2668315        425
    54     895006      12264        171         19     933356
25873
       5500       1002        179     794222    2554432        430
    45     886025      12007        171         18     938921
24382
       5700       1002        180     786714    2441228        436
    46     878043      12258        172         14     944908
30291
       5900       1117         90     742883    2554813        471
    53     864134      12471        170         12     957811
25119
       6100       1166         92     734510    2566381        479
    68     854384      12579        190         16     926807
25544
       6300       1132        123     738812    2447974        488
    57     849740      12968        216         10     882940
26546
       6500       1123        150     743870    2323338        495
    52     836256      12472        210         20     896639
25149
       6700       1173        139     724691    2330720        522
    70     822678      12949        269         27     800938
28653
       6900       1054        112     725451    2953919        522
    69     822682      12184        261         26     785269
28199
       7100       1098        174     731504    2255090        502
    87     820909      13072        216         15     870777
25336
       7300       1244        156     702596    2317562        531
    88     808677      12770        247         18     813081
28126
       7500       1181        143     694538    2226994        545
    90     796698      12368        226         14     862177
26597
       7700       1189        147     681836    2183167        555
    87     799215      12499        250         17     797699
26342
       7900       1082        149     694010    1926757        555
    90     791777      13137        243         20     824061
26772
       8100       1068        145     678222    2791019        552
    80     785043      13071        266         16     781563
26579
       8300       1102        135     690978    1851892        582
   136     781035      13067        267         18     782060
26683
       8500       1190        191     653566    2068057        574
   127     777348      13139        262         21     800524
27086
       8700       1172        185     666525    2031543        602
   104     778754      13364        228         13     884802
25340
       8900       1024        179     685123    1689661        594
    98     768617      13753        266         20     801557
26075
       9100       1077        166     658295    1756367        615
   101     759656      13297        308         19     739619
25677
       9300       1211        203     618593    2055230        606
   111     753652      13231        319         23     743849
26041
       9500       1163        189     627123    1794459        615
   125     751993      13174        264         19     865898
25795
       9700       1240        202     589520    1983417        649
   157     738596      13473        326         71     742113
25528
       9900       1188        207     612908    1830208        635
   125     725890      14240        299         40     770069
24714
      10100       1168        219     596998    1781611        647
   132     718260      13834        245         35     905581
24854
      10300       1083        222     615543    1506529        641
   130     700636      13108        401         24     643222
26497
      10500       1183        210     573875    1753476        648
   169     708408      12756        392         30     636559
28712
      10700       1217        234     526025    2014191        648
   165     696542      13092        374         26     675566
28555
      10900       1161        179     594406    1722260        647
   194     698681      13715        344         45     682158
26681
      11100       1185        209     578309    1919206        670
   166     724562      13408        339         50     743402
28010
      11300       1144        185     609694    1791436        671
   136     712555      12769        307         36     762260
26575
      11500       1070        188     617941    1470628        650
   151     723367      12596        353         21     659704
28015
      11700       1205        199     570787    1801593        673
   168     706260      12568        347         12     689414
29196
      11900       1216        174     563915    1761745        686
   135     698164      12840        361         10     663126
27517
      12100       1155        218     568867    1596189        677
   159     705873      12759        309         14     774833
290747
      12300       1236        187     543536    1738447        705
   177     705564      13028        330         21     745009
28134
      12500       1176        202     545135    1651420        696
   148     697624      13280        339         20     724057
26461

Vincent

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bench for testing scheduler
  2013-11-07 10:54 Bench for testing scheduler Vincent Guittot
@ 2013-11-07 11:32 ` Catalin Marinas
  2013-11-07 13:33 ` Vincent Guittot
  2013-11-07 17:42 ` Morten Rasmussen
  2 siblings, 0 replies; 11+ messages in thread
From: Catalin Marinas @ 2013-11-07 11:32 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, Alex Shi, Peter Zijlstra, Paul Turner,
	Ingo Molnar, rjw, Srivatsa S. Bhat, Paul Walmsley, Mel Gorman,
	Juri Lelli, fengguang.wu, markgross, Kevin Hilman, Frank.Rowand,
	Paul McKenney, linux-kernel

Hi Vincent,

(for whatever reason, the text is wrapped and results hard to read)

On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> During the Energy-aware scheduling mini-summit, we spoke about benches
> that should be used to evaluate the modifications of the scheduler.
> I’d like to propose a bench that uses cyclictest to measure the wake
> up latency and the power consumption. The goal of this bench is to
> exercise the scheduler with various sleeping period and get the
> average wakeup latency. The range of the sleeping period must cover
> all residency times of the idle state table of the platform. I have
> run such tests on a tc2 platform with the packing tasks patchset.
> I have use the following command:
> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000

cyclictest could be a good starting point but we need to improve it to
allow threads of different loads, possibly starting multiple processes
(can be done with a script), randomly varying load threads. These
parameters should be loaded from a file so that we can have multiple
configurations (per SoC and per use-case). But the big risk is that we
try to optimise the scheduler for something which is not realistic.

We are working on describing some basic scenarios (plain English for
now) and one of them could be video playing with threads for audio and
video decoding with random change in the workload.

So I think the first step should be a set of tools/scripts to analyse
the scheduler behaviour, both in terms of latency and power, and these
can use perf sched. We can then run some real life scenarios (e.g.
Android video playback) and build a benchmark that matches such
behaviour as close as possible. We can probably use (or improve) perf
sched replay to also simulate such workload (we may need additional
features like thread dependencies).

> The figures below give the average wakeup latency and power
> consumption for default scheduler behavior, packing tasks at cluster
> level and packing tasks at core level. We can see both wakeup latency
> and power consumption variation. The detailed result is not a simple
> single value which makes comparison not so easy but the average of all
> measurements should give us a usable “score”.

How did you assess the power/energy?

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bench for testing scheduler
  2013-11-07 10:54 Bench for testing scheduler Vincent Guittot
  2013-11-07 11:32 ` Catalin Marinas
@ 2013-11-07 13:33 ` Vincent Guittot
  2013-11-07 14:04   ` Catalin Marinas
  2013-11-08  0:04   ` Rowand, Frank
  2013-11-07 17:42 ` Morten Rasmussen
  2 siblings, 2 replies; 11+ messages in thread
From: Vincent Guittot @ 2013-11-07 13:33 UTC (permalink / raw)
  To: catalin.marinas
  Cc: Morten.Rasmussen, alex.shi, peterz, pjt, mingo, rjw,
	srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
	markgross, khilman, Frank.Rowand, paulmck, linux-kernel,
	Vincent Guittot

On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> Hi Vincent,
>
> (for whatever reason, the text is wrapped and results hard to read)

Yes, i have just seen that. It looks like gmail has wrapped the lines.
I have added the results which should not be wrapped, at the end of this email 

>
>
> On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
>> During the Energy-aware scheduling mini-summit, we spoke about benches
>> that should be used to evaluate the modifications of the scheduler.
>> I’d like to propose a bench that uses cyclictest to measure the wake
>> up latency and the power consumption. The goal of this bench is to
>> exercise the scheduler with various sleeping period and get the
>> average wakeup latency. The range of the sleeping period must cover
>> all residency times of the idle state table of the platform. I have
>> run such tests on a tc2 platform with the packing tasks patchset.
>> I have use the following command:
>> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
>
> cyclictest could be a good starting point but we need to improve it to
> allow threads of different loads, possibly starting multiple processes
> (can be done with a script), randomly varying load threads. These
> parameters should be loaded from a file so that we can have multiple
> configurations (per SoC and per use-case). But the big risk is that we
> try to optimise the scheduler for something which is not realistic.

The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system.

Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results.

I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system

>
>
> We are working on describing some basic scenarios (plain English for
> now) and one of them could be video playing with threads for audio and
> video decoding with random change in the workload.
>
> So I think the first step should be a set of tools/scripts to analyse
> the scheduler behaviour, both in terms of latency and power, and these
> can use perf sched. We can then run some real life scenarios (e.g.
> Android video playback) and build a benchmark that matches such
> behaviour as close as possible. We can probably use (or improve) perf
> sched replay to also simulate such workload (we may need additional
> features like thread dependencies).
>
>> The figures below give the average wakeup latency and power
>> consumption for default scheduler behavior, packing tasks at cluster
>> level and packing tasks at core level. We can see both wakeup latency
>> and power consumption variation. The detailed result is not a simple
>> single value which makes comparison not so easy but the average of all
>> measurements should give us a usable “score”.
>
> How did you assess the power/energy?

I have use the embedded joule meter of the tc2.

>
> Thanks.
>
> --
> Catalin

            |  Default average results                  |  Cluster Packing average results          |  Core Packing average results
            |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy
            |     (us)                   (J)        (J) |     (us)                   (J)        (J) |     (us)                   (J)        (J)
            |      879                794890    2364175 |      416                879688      12750 |      189                897452      30052

 Cyclictest |  Default                                  |  Packing at Cluster level                 |  Packing at Core level
   Interval |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy
       (us) |     (us)                   (J)        (J) |     (us)                   (J)        (J) |     (us)                   (J)        (J)
        500         24          1    1147477    2479576         21          1    1136768      11693         22          1    1126062      30138
        700         22          1    1136084    3058419         21          0    1125280      11761         21          1    1109950      23503
        900         22          1    1136017    3036768         21          1    1112542      12017         20          0    1101089      23733
       1100         24          1    1132964    2506132         21          0    1109039      12248         21          1    1091832      23621
       1300         24          1    1123896    2488459         21          0    1099308      12015         21          1    1086301      23264
       1500         24          1    1120842    2488272         21          0    1099811      12685         20          0    1083658      22499
       1700         41         38    1117166    3042091         21          0    1090920      12393         21          1    1080387      23015
       1900        119        182    1120552    2737555         21          0    1087900      11900         21          1    1078711      23177
       2100        167        195    1122425    3210655         22          2    1090420      11900         20          1    1077985      22639
       2300        152        156    1119854    2497773         43         22    1087278      11921         21          1    1075943      26282
       2500        182        163    1120818    2365870         63         29    1089169      11551         21          0    1073717      24290
       2700        439        202    1058952    3058516        107         41    1077955      12122         21          0    1070951      23126
       2900        570        268    1028238    3099162        148         30    1067562      13287         24          1    1064200      24260
       3100        751        137     946512    3158095        178         30    1059395      12236         29          1    1058887      23225
       3300        696        203     964822    3042524        206         28    1041194      13934         36          1    1056656      23941
       3500        728        191     959398    3006066        235         36    1028150      13387         44          3    1045841      23873
       3700        844        138     921780    3033189        245         31    1019065      14582         62          6    1034466      22501
       3900        815        172     925600    2862994        273         33    1001974      12091         80          9    1014650      24444
       4100        870        179     897616    2940444        279         35     996226      12014         88         11    1030588      25461
       4300        979        119     846912    2996911        306         36     980075      12641        100         12    1035173      24832
       4500        891        168     863631    2760879        336         45     955072      12016        126         12     993256      23929
       4700        943        110     836333    2796629        351         39     942390      12902        125         15     996548      24637
       4900        997        118     800205    2743317        391         49     917067      12868        134         23    1011089      25266
       5100       1050        114     789152    2693104        408         53     903123      12033        196         22     894294      25142
       5300       1052        111     769544    2668315        425         54     895006      12264        171         19     933356      25873
       5500       1002        179     794222    2554432        430         45     886025      12007        171         18     938921      24382
       5700       1002        180     786714    2441228        436         46     878043      12258        172         14     944908      30291
       5900       1117         90     742883    2554813        471         53     864134      12471        170         12     957811      25119
       6100       1166         92     734510    2566381        479         68     854384      12579        190         16     926807      25544
       6300       1132        123     738812    2447974        488         57     849740      12968        216         10     882940      26546
       6500       1123        150     743870    2323338        495         52     836256      12472        210         20     896639      25149
       6700       1173        139     724691    2330720        522         70     822678      12949        269         27     800938      28653
       6900       1054        112     725451    2953919        522         69     822682      12184        261         26     785269      28199
       7100       1098        174     731504    2255090        502         87     820909      13072        216         15     870777      25336
       7300       1244        156     702596    2317562        531         88     808677      12770        247         18     813081      28126
       7500       1181        143     694538    2226994        545         90     796698      12368        226         14     862177      26597
       7700       1189        147     681836    2183167        555         87     799215      12499        250         17     797699      26342
       7900       1082        149     694010    1926757        555         90     791777      13137        243         20     824061      26772
       8100       1068        145     678222    2791019        552         80     785043      13071        266         16     781563      26579
       8300       1102        135     690978    1851892        582        136     781035      13067        267         18     782060      26683
       8500       1190        191     653566    2068057        574        127     777348      13139        262         21     800524      27086
       8700       1172        185     666525    2031543        602        104     778754      13364        228         13     884802      25340
       8900       1024        179     685123    1689661        594         98     768617      13753        266         20     801557      26075
       9100       1077        166     658295    1756367        615        101     759656      13297        308         19     739619      25677
       9300       1211        203     618593    2055230        606        111     753652      13231        319         23     743849      26041
       9500       1163        189     627123    1794459        615        125     751993      13174        264         19     865898      25795
       9700       1240        202     589520    1983417        649        157     738596      13473        326         71     742113      25528
       9900       1188        207     612908    1830208        635        125     725890      14240        299         40     770069      24714
      10100       1168        219     596998    1781611        647        132     718260      13834        245         35     905581      24854
      10300       1083        222     615543    1506529        641        130     700636      13108        401         24     643222      26497
      10500       1183        210     573875    1753476        648        169     708408      12756        392         30     636559      28712
      10700       1217        234     526025    2014191        648        165     696542      13092        374         26     675566      28555
      10900       1161        179     594406    1722260        647        194     698681      13715        344         45     682158      26681
      11100       1185        209     578309    1919206        670        166     724562      13408        339         50     743402      28010
      11300       1144        185     609694    1791436        671        136     712555      12769        307         36     762260      26575
      11500       1070        188     617941    1470628        650        151     723367      12596        353         21     659704      28015
      11700       1205        199     570787    1801593        673        168     706260      12568        347         12     689414      29196
      11900       1216        174     563915    1761745        686        135     698164      12840        361         10     663126      27517
      12100       1155        218     568867    1596189        677        159     705873      12759        309         14     774833     290747
      12300       1236        187     543536    1738447        705        177     705564      13028        330         21     745009      28134
      12500       1176        202     545135    1651420        696        148     697624      13280        339         20     724057      26461

Vincent

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bench for testing scheduler
  2013-11-07 13:33 ` Vincent Guittot
@ 2013-11-07 14:04   ` Catalin Marinas
  2013-11-08  9:30     ` Vincent Guittot
  2013-11-08  0:04   ` Rowand, Frank
  1 sibling, 1 reply; 11+ messages in thread
From: Catalin Marinas @ 2013-11-07 14:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, alex.shi, peterz, pjt, mingo, rjw,
	srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
	markgross, khilman, Frank.Rowand, paulmck, linux-kernel

On Thu, Nov 07, 2013 at 01:33:43PM +0000, Vincent Guittot wrote:
> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> >> During the Energy-aware scheduling mini-summit, we spoke about benches
> >> that should be used to evaluate the modifications of the scheduler.
> >> I’d like to propose a bench that uses cyclictest to measure the wake
> >> up latency and the power consumption. The goal of this bench is to
> >> exercise the scheduler with various sleeping period and get the
> >> average wakeup latency. The range of the sleeping period must cover
> >> all residency times of the idle state table of the platform. I have
> >> run such tests on a tc2 platform with the packing tasks patchset.
> >> I have use the following command:
> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
> >
> > cyclictest could be a good starting point but we need to improve it to
> > allow threads of different loads, possibly starting multiple processes
> > (can be done with a script), randomly varying load threads. These
> > parameters should be loaded from a file so that we can have multiple
> > configurations (per SoC and per use-case). But the big risk is that we
> > try to optimise the scheduler for something which is not realistic.
> 
> The goal of this simple bench is to measure the wake up latency and
> the reachable value of the scheduler on a platform but not to emulate
> a "real" use case. In the same way than sched-pipe tests a specific
> behavior of the scheduler, this bench tests the wake up latency of a
> system.

These figures are indeed useful to make sure we don't have any
regression in terms of latency but I would not use cyclictest (as it is)
to assess power improvements since the test is too artificial.

> Starting multi processes and adding some loads can also be useful but
> the target will be a bit different from wake up latency. I have one
> concern with randomness because it prevents from having repeatable and
> comparable tests and results.

We can avoid randomness but still make it varying by some predictable
function.

> I agree that we have to test "real" use cases but it doesn't prevent
> from testing the limit of a characteristic on a system

I agree. My point is not to use this as "the benchmark".

I would prefer to assess the impact on latency (and power) using a tool
independent from benchmarks like cyclictest (e.g. use the reports from
power sched). The reason is that once we have those tools/scripts in the
kernel, a third party can run it on real workloads and provide the
kernel developers with real numbers on performance vs power scheduling,
regressions between kernel versions etc. We can't create a power model
that you can run on an x86 for example and give you an indication of the
power saving on ARM, you need to run the benchmarks on the actual
hardware (that's why I don't think linsched is of much use from a power
perspective).

-- 
Catalin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bench for testing scheduler
  2013-11-07 10:54 Bench for testing scheduler Vincent Guittot
  2013-11-07 11:32 ` Catalin Marinas
  2013-11-07 13:33 ` Vincent Guittot
@ 2013-11-07 17:42 ` Morten Rasmussen
  2013-11-09  0:15   ` Rowand, Frank
  2 siblings, 1 reply; 11+ messages in thread
From: Morten Rasmussen @ 2013-11-07 17:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Alex Shi, Peter Zijlstra, Paul Turner, Ingo Molnar, rjw,
	Srivatsa S. Bhat, Catalin Marinas, Paul Walmsley, Mel Gorman,
	Juri Lelli, fengguang.wu, markgross, Kevin Hilman, Frank.Rowand,
	Paul McKenney, linux-kernel

Hi Vincent,

On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> Hi,
> 
> During the Energy-aware scheduling mini-summit, we spoke about benches
> that should be used to evaluate the modifications of the scheduler.
> I’d like to propose a bench that uses cyclictest to measure the wake
> up latency and the power consumption. The goal of this bench is to
> exercise the scheduler with various sleeping period and get the
> average wakeup latency. The range of the sleeping period must cover
> all residency times of the idle state table of the platform. I have
> run such tests on a tc2 platform with the packing tasks patchset.
> I have use the following command:
> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000

I think cyclictest is a useful model small(er) periodic tasks for
benchmarking energy related patches. However, it doesn't have a
good-enough-performance criteria as it is. I think that is a strict
requirement for all energy related benchmarks.

Measuring latency gives us a performance metric while the energy tells
us how energy efficient we are. But without a latency requirement we
can't really say if a patch helps energy-awareness unless it improves
both energy _and_ performance. That is the case for your packing patches
for this particular benchmark with this specific configuration. That is
a really good result. However, in the general case patches may trade a
bit of performance to get better energy, which is also good if
performance still meets the requirement of the application/user. So we
need a performance criteria to tells us when we sacrifice too much
performance when trying to save power. Without it it is just a
performance benchmark where we measure power.

Coming up with a performance criteria for cyclictest is not so easy as
it doesn't really model any specific application. I guess sacrificing a
bit of latency is acceptable if it comes with significant energy
savings. But a huge performance impact might not be, even if it comes
with massive energy savings. So maybe the criteria would consist of both
a minimum latency requirement (e.g. up to 10% increase) and a
requirement for improved energy per work.

As I see it, it the only way we can validate energy efficiency of
patches that trade performance for improved energy.

Morten


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Bench for testing scheduler
  2013-11-07 13:33 ` Vincent Guittot
  2013-11-07 14:04   ` Catalin Marinas
@ 2013-11-08  0:04   ` Rowand, Frank
  2013-11-08  9:28     ` Vincent Guittot
  1 sibling, 1 reply; 11+ messages in thread
From: Rowand, Frank @ 2013-11-08  0:04 UTC (permalink / raw)
  To: Vincent Guittot, catalin.marinas
  Cc: Morten.Rasmussen, alex.shi, peterz, pjt, mingo, rjw,
	srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
	markgross, khilman, paulmck, linux-kernel

Hi Vincent,

Thanks for creating some benchmark numbers!


On Thursday, November 07, 2013 5:33 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote:
> 
> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > Hi Vincent,
> >
> > (for whatever reason, the text is wrapped and results hard to read)
> 
> Yes, i have just seen that. It looks like gmail has wrapped the lines.
> I have added the results which should not be wrapped, at the end of this email
> 
> >
> >
> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> >> During the Energy-aware scheduling mini-summit, we spoke about benches
> >> that should be used to evaluate the modifications of the scheduler.
> >> I’d like to propose a bench that uses cyclictest to measure the wake
> >> up latency and the power consumption. The goal of this bench is to
> >> exercise the scheduler with various sleeping period and get the
> >> average wakeup latency. The range of the sleeping period must cover
> >> all residency times of the idle state table of the platform. I have
> >> run such tests on a tc2 platform with the packing tasks patchset.
> >> I have use the following command:
> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000

The number of loops ("-l 2000") should be much larger to create useful
results.  I don't have a specific number that is large enough, I just
know from experience that 2000 is way too small.  For example, running
cyclictest several times with the same values on my laptop gives values
that are not consistent:

   $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
   # /dev/cpu_dma_latency set to 10000000us
   T: 0 ( 9703) P: 0 I:500 C:   2000 Min:      2 Act:   90 Avg:   77 Max:     243
   T: 1 ( 9704) P: 0 I:650 C:   1557 Min:      2 Act:   58 Avg:   68 Max:     226
   T: 2 ( 9705) P: 0 I:800 C:   1264 Min:      2 Act:   54 Avg:   81 Max:    1017
   T: 3 ( 9706) P: 0 I:950 C:   1065 Min:      2 Act:   11 Avg:   80 Max:     260
   
   $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
   # /dev/cpu_dma_latency set to 10000000us
   T: 0 ( 9709) P: 0 I:500 C:   2000 Min:      2 Act:   45 Avg:   74 Max:     390
   T: 1 ( 9710) P: 0 I:650 C:   1554 Min:      2 Act:   82 Avg:   61 Max:     810
   T: 2 ( 9711) P: 0 I:800 C:   1263 Min:      2 Act:   83 Avg:   74 Max:     287
   T: 3 ( 9712) P: 0 I:950 C:   1064 Min:      2 Act:  103 Avg:   79 Max:     551
   
   $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
   # /dev/cpu_dma_latency set to 10000000us
   T: 0 ( 9716) P: 0 I:500 C:   2000 Min:      2 Act:   82 Avg:   72 Max:     252
   T: 1 ( 9717) P: 0 I:650 C:   1556 Min:      2 Act:  115 Avg:   77 Max:     354
   T: 2 ( 9718) P: 0 I:800 C:   1264 Min:      2 Act:   59 Avg:   78 Max:    1143
   T: 3 ( 9719) P: 0 I:950 C:   1065 Min:      2 Act:  104 Avg:   70 Max:     238
   
   $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
   # /dev/cpu_dma_latency set to 10000000us
   T: 0 ( 9722) P: 0 I:500 C:   2000 Min:      2 Act:   82 Avg:   68 Max:     213
   T: 1 ( 9723) P: 0 I:650 C:   1555 Min:      2 Act:   65 Avg:   65 Max:    1279
   T: 2 ( 9724) P: 0 I:800 C:   1264 Min:      2 Act:   91 Avg:   69 Max:     244
   T: 3 ( 9725) P: 0 I:950 C:   1065 Min:      2 Act:   58 Avg:   76 Max:     242


> >
> > cyclictest could be a good starting point but we need to improve it to
> > allow threads of different loads, possibly starting multiple processes
> > (can be done with a script), randomly varying load threads. These
> > parameters should be loaded from a file so that we can have multiple
> > configurations (per SoC and per use-case). But the big risk is that we
> > try to optimise the scheduler for something which is not realistic.
> 
> The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system.
> 
> Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results.
> 
> I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system
> 
> >
> >
> > We are working on describing some basic scenarios (plain English for
> > now) and one of them could be video playing with threads for audio and
> > video decoding with random change in the workload.
> >
> > So I think the first step should be a set of tools/scripts to analyse
> > the scheduler behaviour, both in terms of latency and power, and these
> > can use perf sched. We can then run some real life scenarios (e.g.
> > Android video playback) and build a benchmark that matches such
> > behaviour as close as possible. We can probably use (or improve) perf
> > sched replay to also simulate such workload (we may need additional
> > features like thread dependencies).
> >
> >> The figures below give the average wakeup latency and power
> >> consumption for default scheduler behavior, packing tasks at cluster
> >> level and packing tasks at core level. We can see both wakeup latency
> >> and power consumption variation. The detailed result is not a simple
> >> single value which makes comparison not so easy but the average of all
> >> measurements should give us a usable “score”.
> >
> > How did you assess the power/energy?
> 
> I have use the embedded joule meter of the tc2.
> 
> >
> > Thanks.
> >
> > --
> > Catalin
> 
>             |  Default average results                  |  Cluster Packing average results          |  Core Packing average results
>             |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy
>             |     (us)                   (J)        (J) |     (us)                   (J)        (J) |     (us)                   (J)        (J)
>             |      879                794890    2364175 |      416                879688      12750 |      189                897452      30052
> 
>  Cyclictest |  Default                                  |  Packing at Cluster level                 |  Packing at Core level
>    Interval |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy
>        (us) |     (us)                   (J)        (J) |     (us)                   (J)        (J) |     (us)                   (J)        (J)
>         500         24          1    1147477    2479576         21          1    1136768      11693         22          1    1126062      30138
>         700         22          1    1136084    3058419         21          0    1125280      11761         21          1    1109950      23503

< snip >

Some questions about what these metrics are:

The cyclictest data is reported per thread.  How did you combine the per thread data
to get a single latency and stddev value?

Is "Latency" the average latency?

stddev is not reported by cyclictest.  How did you create this value?  Did you
use the "-v" cyclictest option to report detailed data, then calculate stddev from
the detailed data?

Thanks,

-Frank

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bench for testing scheduler
  2013-11-08  0:04   ` Rowand, Frank
@ 2013-11-08  9:28     ` Vincent Guittot
  2013-11-08 21:12       ` Rowand, Frank
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2013-11-08  9:28 UTC (permalink / raw)
  To: Rowand, Frank
  Cc: catalin.marinas, Morten.Rasmussen, alex.shi, peterz, pjt, mingo,
	rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
	markgross, khilman, paulmck, linux-kernel

On 8 November 2013 01:04, Rowand, Frank <Frank.Rowand@sonymobile.com> wrote:
> Hi Vincent,
>
> Thanks for creating some benchmark numbers!

you're welcome

>
>
> On Thursday, November 07, 2013 5:33 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote:
>>
>> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > Hi Vincent,
>> >
>> > (for whatever reason, the text is wrapped and results hard to read)
>>
>> Yes, i have just seen that. It looks like gmail has wrapped the lines.
>> I have added the results which should not be wrapped, at the end of this email
>>
>> >
>> >
>> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
>> >> During the Energy-aware scheduling mini-summit, we spoke about benches
>> >> that should be used to evaluate the modifications of the scheduler.
>> >> I’d like to propose a bench that uses cyclictest to measure the wake
>> >> up latency and the power consumption. The goal of this bench is to
>> >> exercise the scheduler with various sleeping period and get the
>> >> average wakeup latency. The range of the sleeping period must cover
>> >> all residency times of the idle state table of the platform. I have
>> >> run such tests on a tc2 platform with the packing tasks patchset.
>> >> I have use the following command:
>> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
>
> The number of loops ("-l 2000") should be much larger to create useful
> results.  I don't have a specific number that is large enough, I just
> know from experience that 2000 is way too small.  For example, running
> cyclictest several times with the same values on my laptop gives values
> that are not consistent:

The Avg figures look almost stable IMO. Are you speaking about the Max
value for the inconsistency ?

>
>    $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
>    # /dev/cpu_dma_latency set to 10000000us
>    T: 0 ( 9703) P: 0 I:500 C:   2000 Min:      2 Act:   90 Avg:   77 Max:     243
>    T: 1 ( 9704) P: 0 I:650 C:   1557 Min:      2 Act:   58 Avg:   68 Max:     226
>    T: 2 ( 9705) P: 0 I:800 C:   1264 Min:      2 Act:   54 Avg:   81 Max:    1017
>    T: 3 ( 9706) P: 0 I:950 C:   1065 Min:      2 Act:   11 Avg:   80 Max:     260
>
>    $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
>    # /dev/cpu_dma_latency set to 10000000us
>    T: 0 ( 9709) P: 0 I:500 C:   2000 Min:      2 Act:   45 Avg:   74 Max:     390
>    T: 1 ( 9710) P: 0 I:650 C:   1554 Min:      2 Act:   82 Avg:   61 Max:     810
>    T: 2 ( 9711) P: 0 I:800 C:   1263 Min:      2 Act:   83 Avg:   74 Max:     287
>    T: 3 ( 9712) P: 0 I:950 C:   1064 Min:      2 Act:  103 Avg:   79 Max:     551
>
>    $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
>    # /dev/cpu_dma_latency set to 10000000us
>    T: 0 ( 9716) P: 0 I:500 C:   2000 Min:      2 Act:   82 Avg:   72 Max:     252
>    T: 1 ( 9717) P: 0 I:650 C:   1556 Min:      2 Act:  115 Avg:   77 Max:     354
>    T: 2 ( 9718) P: 0 I:800 C:   1264 Min:      2 Act:   59 Avg:   78 Max:    1143
>    T: 3 ( 9719) P: 0 I:950 C:   1065 Min:      2 Act:  104 Avg:   70 Max:     238
>
>    $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
>    # /dev/cpu_dma_latency set to 10000000us
>    T: 0 ( 9722) P: 0 I:500 C:   2000 Min:      2 Act:   82 Avg:   68 Max:     213
>    T: 1 ( 9723) P: 0 I:650 C:   1555 Min:      2 Act:   65 Avg:   65 Max:    1279
>    T: 2 ( 9724) P: 0 I:800 C:   1264 Min:      2 Act:   91 Avg:   69 Max:     244
>    T: 3 ( 9725) P: 0 I:950 C:   1065 Min:      2 Act:   58 Avg:   76 Max:     242
>
>
>> >
>> > cyclictest could be a good starting point but we need to improve it to
>> > allow threads of different loads, possibly starting multiple processes
>> > (can be done with a script), randomly varying load threads. These
>> > parameters should be loaded from a file so that we can have multiple
>> > configurations (per SoC and per use-case). But the big risk is that we
>> > try to optimise the scheduler for something which is not realistic.
>>
>> The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system.
>>
>> Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results.
>>
>> I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system
>>
>> >
>> >
>> > We are working on describing some basic scenarios (plain English for
>> > now) and one of them could be video playing with threads for audio and
>> > video decoding with random change in the workload.
>> >
>> > So I think the first step should be a set of tools/scripts to analyse
>> > the scheduler behaviour, both in terms of latency and power, and these
>> > can use perf sched. We can then run some real life scenarios (e.g.
>> > Android video playback) and build a benchmark that matches such
>> > behaviour as close as possible. We can probably use (or improve) perf
>> > sched replay to also simulate such workload (we may need additional
>> > features like thread dependencies).
>> >
>> >> The figures below give the average wakeup latency and power
>> >> consumption for default scheduler behavior, packing tasks at cluster
>> >> level and packing tasks at core level. We can see both wakeup latency
>> >> and power consumption variation. The detailed result is not a simple
>> >> single value which makes comparison not so easy but the average of all
>> >> measurements should give us a usable “score”.
>> >
>> > How did you assess the power/energy?
>>
>> I have use the embedded joule meter of the tc2.
>>
>> >
>> > Thanks.
>> >
>> > --
>> > Catalin
>>
>>             |  Default average results                  |  Cluster Packing average results          |  Core Packing average results
>>             |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy
>>             |     (us)                   (J)        (J) |     (us)                   (J)        (J) |     (us)                   (J)        (J)
>>             |      879                794890    2364175 |      416                879688      12750 |      189                897452      30052
>>
>>  Cyclictest |  Default                                  |  Packing at Cluster level                 |  Packing at Core level
>>    Interval |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy
>>        (us) |     (us)                   (J)        (J) |     (us)                   (J)        (J) |     (us)                   (J)        (J)
>>         500         24          1    1147477    2479576         21          1    1136768      11693         22          1    1126062      30138
>>         700         22          1    1136084    3058419         21          0    1125280      11761         21          1    1109950      23503
>
> < snip >
>
> Some questions about what these metrics are:
>
> The cyclictest data is reported per thread.  How did you combine the per thread data
> to get a single latency and stddev value?
>
> Is "Latency" the average latency?

Yes. I have described below the procedure i have followed to get my results:

I run the same test (same parameters) several times ( i have tried
between 5 and 10 runs and the results were similar).
For each run, i compute the average of per thread average figure and i
compute the stddev between per thread results.
The results that i sent is an average of all runs with the same parameters.

>
> stddev is not reported by cyclictest.  How did you create this value?  Did you
> use the "-v" cyclictest option to report detailed data, then calculate stddev from
> the detailed data?

No i haven't used the -v because it generates too much spurious wake
up that makes the results irrelevant

Vincent
>
> Thanks,
>
> -Frank

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bench for testing scheduler
  2013-11-07 14:04   ` Catalin Marinas
@ 2013-11-08  9:30     ` Vincent Guittot
  0 siblings, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2013-11-08  9:30 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Morten Rasmussen, alex.shi, peterz, pjt, mingo, rjw,
	srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
	markgross, khilman, Frank.Rowand, paulmck, linux-kernel

On 7 November 2013 15:04, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Nov 07, 2013 at 01:33:43PM +0000, Vincent Guittot wrote:
>> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
>> >> During the Energy-aware scheduling mini-summit, we spoke about benches
>> >> that should be used to evaluate the modifications of the scheduler.
>> >> I’d like to propose a bench that uses cyclictest to measure the wake
>> >> up latency and the power consumption. The goal of this bench is to
>> >> exercise the scheduler with various sleeping period and get the
>> >> average wakeup latency. The range of the sleeping period must cover
>> >> all residency times of the idle state table of the platform. I have
>> >> run such tests on a tc2 platform with the packing tasks patchset.
>> >> I have use the following command:
>> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
>> >
>> > cyclictest could be a good starting point but we need to improve it to
>> > allow threads of different loads, possibly starting multiple processes
>> > (can be done with a script), randomly varying load threads. These
>> > parameters should be loaded from a file so that we can have multiple
>> > configurations (per SoC and per use-case). But the big risk is that we
>> > try to optimise the scheduler for something which is not realistic.
>>
>> The goal of this simple bench is to measure the wake up latency and
>> the reachable value of the scheduler on a platform but not to emulate
>> a "real" use case. In the same way than sched-pipe tests a specific
>> behavior of the scheduler, this bench tests the wake up latency of a
>> system.
>
> These figures are indeed useful to make sure we don't have any
> regression in terms of latency but I would not use cyclictest (as it is)
> to assess power improvements since the test is too artificial.
>
>> Starting multi processes and adding some loads can also be useful but
>> the target will be a bit different from wake up latency. I have one
>> concern with randomness because it prevents from having repeatable and
>> comparable tests and results.
>
> We can avoid randomness but still make it varying by some predictable
> function.
>
>> I agree that we have to test "real" use cases but it doesn't prevent
>> from testing the limit of a characteristic on a system
>
> I agree. My point is not to use this as "the benchmark".

ok, so i don't plan to make cyclictest "the" benchmark but "a"
benchmark among others because i'm not sure that we can cover all
needs with only one benchmark.

As an example, cyclictest gives information of the wake up latency
that can't be collected with trace

>
> I would prefer to assess the impact on latency (and power) using a tool
> independent from benchmarks like cyclictest (e.g. use the reports from
> power sched). The reason is that once we have those tools/scripts in the
> kernel, a third party can run it on real workloads and provide the
> kernel developers with real numbers on performance vs power scheduling,
> regressions between kernel versions etc. We can't create a power model
> that you can run on an x86 for example and give you an indication of the
> power saving on ARM, you need to run the benchmarks on the actual
> hardware (that's why I don't think linsched is of much use from a power
> perspective).
>
> --
> Catalin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Bench for testing scheduler
  2013-11-08  9:28     ` Vincent Guittot
@ 2013-11-08 21:12       ` Rowand, Frank
  2013-11-12 10:02         ` Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Rowand, Frank @ 2013-11-08 21:12 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: catalin.marinas, Morten.Rasmussen, alex.shi, peterz, pjt, mingo,
	rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
	markgross, khilman, paulmck, linux-kernel


On Friday, November 08, 2013 1:28 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote:
> 
> On 8 November 2013 01:04, Rowand, Frank <Frank.Rowand@sonymobile.com> wrote:
> > Hi Vincent,
> >
> > Thanks for creating some benchmark numbers!
> 
> you're welcome
> 
> >
> >
> > On Thursday, November 07, 2013 5:33 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote:
> >>
> >> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >> > Hi Vincent,
> >> >
> >> > (for whatever reason, the text is wrapped and results hard to read)
> >>
> >> Yes, i have just seen that. It looks like gmail has wrapped the lines.
> >> I have added the results which should not be wrapped, at the end of this email
> >>
> >> >
> >> >
> >> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> >> >> During the Energy-aware scheduling mini-summit, we spoke about benches
> >> >> that should be used to evaluate the modifications of the scheduler.
> >> >> I’d like to propose a bench that uses cyclictest to measure the wake
> >> >> up latency and the power consumption. The goal of this bench is to
> >> >> exercise the scheduler with various sleeping period and get the
> >> >> average wakeup latency. The range of the sleeping period must cover
> >> >> all residency times of the idle state table of the platform. I have
> >> >> run such tests on a tc2 platform with the packing tasks patchset.
> >> >> I have use the following command:
> >> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
> >
> > The number of loops ("-l 2000") should be much larger to create useful
> > results.  I don't have a specific number that is large enough, I just
> > know from experience that 2000 is way too small.  For example, running
> > cyclictest several times with the same values on my laptop gives values
> > that are not consistent:
> 
> The Avg figures look almost stable IMO. Are you speaking about the Max
> value for the inconsistency ?

The values on my laptop for "-l 2000" are not stable.

If I collapse all of the threads in each of the following tests to a
single value I get the following table.  Note that each thread completes
a different number of cycles, so I calculate the average as:

  total count = T0_count + T1_count + T2_count + T3_count

  avg = ( (T0_count * T0_avg) + (T1_count * T1_avg) + ... + (T3_count * T3_avg) ) / total count

  min is the smallest min for any of the threads

  max is the largest max for any of the threads

            total
test   T    count  min     avg   max
---- --- -------- ---- ------- -----
   1   4     5886    2    76.0  1017
   2   4     5881    2    71.5   810
   3   4     5885    2    74.2  1143
   4   4     5884    2    68.9  1279

test 1 average is 10% larger than test 4.

test 4 maximum is 50% larger than test2.

But all of this is just a minor detail of how to run cyclictest.  The more
important question is whether to use cyclictest results as a valid workload
or metric, so for the moment I won't comment further on the cyclictest
parameters you used to collect the example data you provided.


> 
> >
> >    $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> >    # /dev/cpu_dma_latency set to 10000000us
> >    T: 0 ( 9703) P: 0 I:500 C:   2000 Min:      2 Act:   90 Avg:   77 Max:     243
> >    T: 1 ( 9704) P: 0 I:650 C:   1557 Min:      2 Act:   58 Avg:   68 Max:     226
> >    T: 2 ( 9705) P: 0 I:800 C:   1264 Min:      2 Act:   54 Avg:   81 Max:    1017
> >    T: 3 ( 9706) P: 0 I:950 C:   1065 Min:      2 Act:   11 Avg:   80 Max:     260
> >
> >    $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> >    # /dev/cpu_dma_latency set to 10000000us
> >    T: 0 ( 9709) P: 0 I:500 C:   2000 Min:      2 Act:   45 Avg:   74 Max:     390
> >    T: 1 ( 9710) P: 0 I:650 C:   1554 Min:      2 Act:   82 Avg:   61 Max:     810
> >    T: 2 ( 9711) P: 0 I:800 C:   1263 Min:      2 Act:   83 Avg:   74 Max:     287
> >    T: 3 ( 9712) P: 0 I:950 C:   1064 Min:      2 Act:  103 Avg:   79 Max:     551
> >
> >    $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> >    # /dev/cpu_dma_latency set to 10000000us
> >    T: 0 ( 9716) P: 0 I:500 C:   2000 Min:      2 Act:   82 Avg:   72 Max:     252
> >    T: 1 ( 9717) P: 0 I:650 C:   1556 Min:      2 Act:  115 Avg:   77 Max:     354
> >    T: 2 ( 9718) P: 0 I:800 C:   1264 Min:      2 Act:   59 Avg:   78 Max:    1143
> >    T: 3 ( 9719) P: 0 I:950 C:   1065 Min:      2 Act:  104 Avg:   70 Max:     238
> >
> >    $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> >    # /dev/cpu_dma_latency set to 10000000us
> >    T: 0 ( 9722) P: 0 I:500 C:   2000 Min:      2 Act:   82 Avg:   68 Max:     213
> >    T: 1 ( 9723) P: 0 I:650 C:   1555 Min:      2 Act:   65 Avg:   65 Max:    1279
> >    T: 2 ( 9724) P: 0 I:800 C:   1264 Min:      2 Act:   91 Avg:   69 Max:     244
> >    T: 3 ( 9725) P: 0 I:950 C:   1065 Min:      2 Act:   58 Avg:   76 Max:     242
> >
> >
> >> >
> >> > cyclictest could be a good starting point but we need to improve it to
> >> > allow threads of different loads, possibly starting multiple processes
> >> > (can be done with a script), randomly varying load threads. These
> >> > parameters should be loaded from a file so that we can have multiple
> >> > configurations (per SoC and per use-case). But the big risk is that we
> >> > try to optimise the scheduler for something which is not realistic.
> >>
> >> The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system.
> >>
> >> Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results.
> >>
> >> I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system
> >>
> >> >
> >> >
> >> > We are working on describing some basic scenarios (plain English for
> >> > now) and one of them could be video playing with threads for audio and
> >> > video decoding with random change in the workload.
> >> >
> >> > So I think the first step should be a set of tools/scripts to analyse
> >> > the scheduler behaviour, both in terms of latency and power, and these
> >> > can use perf sched. We can then run some real life scenarios (e.g.
> >> > Android video playback) and build a benchmark that matches such
> >> > behaviour as close as possible. We can probably use (or improve) perf
> >> > sched replay to also simulate such workload (we may need additional
> >> > features like thread dependencies).
> >> >
> >> >> The figures below give the average wakeup latency and power
> >> >> consumption for default scheduler behavior, packing tasks at cluster
> >> >> level and packing tasks at core level. We can see both wakeup latency
> >> >> and power consumption variation. The detailed result is not a simple
> >> >> single value which makes comparison not so easy but the average of all
> >> >> measurements should give us a usable “score”.
> >> >
> >> > How did you assess the power/energy?
> >>
> >> I have use the embedded joule meter of the tc2.
> >>
> >> >
> >> > Thanks.
> >> >
> >> > --
> >> > Catalin
> >>
> >>             |  Default average results                  |  Cluster Packing average results          |  Core Packing average results
> >>             |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy
> >>             |     (us)                   (J)        (J) |     (us)                   (J)        (J) |     (us)                   (J)        (J)
> >>             |      879                794890    2364175 |      416                879688      12750 |      189                897452      30052
> >>
> >>  Cyclictest |  Default                                  |  Packing at Cluster level                 |  Packing at Core level
> >>    Interval |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy
> >>        (us) |     (us)                   (J)        (J) |     (us)                   (J)        (J) |     (us)                   (J)        (J)
> >>         500         24          1    1147477    2479576         21          1    1136768      11693         22          1    1126062      30138
> >>         700         22          1    1136084    3058419         21          0    1125280      11761         21          1    1109950      23503
> >
> > < snip >
> >

Thanks for clarifying how the data was calculated (below).  Again, I don't think
this level of detail is the most important issue at this point, but I'm going
to comment on it while it is still fresh in my mind.

> > Some questions about what these metrics are:
> >
> > The cyclictest data is reported per thread.  How did you combine the per thread data
> > to get a single latency and stddev value?
> >
> > Is "Latency" the average latency?
> 
> Yes. I have described below the procedure i have followed to get my results:
> 
> I run the same test (same parameters) several times ( i have tried
> between 5 and 10 runs and the results were similar).
> For each run, i compute the average of per thread average figure and i
> compute the stddev between per thread results.

So the test run stddev is the standard deviation of the values for average
latency of the 8 (???) cyclictest threads in a test run?

If so, I don't think that the calculated stddev has much actual meaning for
comparing the algorithms (I do find it useful to get a loose sense of how
consistent multiple test runs with the same parameters).

> The results that i sent is an average of all runs with the same parameters.

Then the stddev in the table is the average of the stddev in several test runs?

The stddev later on in the table is often in the range of 10%, 20%, 50%, and 100%
of the average latency.  That is rather large.

> 
> >
> > stddev is not reported by cyclictest.  How did you create this value?  Did you
> > use the "-v" cyclictest option to report detailed data, then calculate stddev from
> > the detailed data?
> 
> No i haven't used the -v because it generates too much spurious wake
> up that makes the results irrelevant

Yes, I agree about not using -v.  It was just a wild guess on my part since
I did not know how stddev was calculated.  And I was incorrectly guessing
that stdev was describing the frequency distribution of the latencies
from a single test run.

As a general comment on cyclictest, I don't find average latency
(in isolation) sufficient to compare different runs of cyclictest.
And stddev of the frequency distribution of the latencies (which
can be calculated from the -h data, with fairly low cyclictest
overhead) is usually interesting but should be viewed with a healthy
skepticism since that frequency distribution is often not a normal
distribution.  In addition to average latency, I normally look at
maximum latency and the frequency distribution of latence (in table
or graph form).

(One side effect of specifying -h is that the -d option is then
ignored.)

Thanks,

-Frank

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Bench for testing scheduler
  2013-11-07 17:42 ` Morten Rasmussen
@ 2013-11-09  0:15   ` Rowand, Frank
  0 siblings, 0 replies; 11+ messages in thread
From: Rowand, Frank @ 2013-11-09  0:15 UTC (permalink / raw)
  To: Morten Rasmussen, Vincent Guittot
  Cc: Alex Shi, Peter Zijlstra, Paul Turner, Ingo Molnar, rjw,
	Srivatsa S. Bhat, Catalin Marinas, Paul Walmsley, Mel Gorman,
	Juri Lelli, fengguang.wu, markgross, Kevin Hilman, Paul McKenney,
	linux-kernel

On Thursday, November 07, 2013 9:42 AM, Morten Rasmussen [morten.rasmussen@arm.com] wrote:
> 
> Hi Vincent,
> 
> On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> > Hi,
> >
> > During the Energy-aware scheduling mini-summit, we spoke about benches
> > that should be used to evaluate the modifications of the scheduler.
> > I’d like to propose a bench that uses cyclictest to measure the wake
> > up latency and the power consumption. The goal of this bench is to
> > exercise the scheduler with various sleeping period and get the
> > average wakeup latency. The range of the sleeping period must cover
> > all residency times of the idle state table of the platform. I have
> > run such tests on a tc2 platform with the packing tasks patchset.
> > I have use the following command:
> > #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
> 
> I think cyclictest is a useful model small(er) periodic tasks for
> benchmarking energy related patches. However, it doesn't have a
> good-enough-performance criteria as it is. I think that is a strict
> requirement for all energy related benchmarks.
> 
> Measuring latency gives us a performance metric while the energy tells
> us how energy efficient we are. But without a latency requirement we
> can't really say if a patch helps energy-awareness unless it improves
> both energy _and_ performance. That is the case for your packing patches
> for this particular benchmark with this specific configuration. That is
> a really good result. However, in the general case patches may trade a
> bit of performance to get better energy, which is also good if
> performance still meets the requirement of the application/user. So we
> need a performance criteria to tells us when we sacrifice too much
> performance when trying to save power. Without it it is just a
> performance benchmark where we measure power.
> 
> Coming up with a performance criteria for cyclictest is not so easy as
> it doesn't really model any specific application. I guess sacrificing a
> bit of latency is acceptable if it comes with significant energy
> savings. But a huge performance impact might not be, even if it comes
> with massive energy savings. So maybe the criteria would consist of both
> a minimum latency requirement (e.g. up to 10% increase) and a
> requirement for improved energy per work.
> 
> As I see it, it the only way we can validate energy efficiency of
> patches that trade performance for improved energy.

I think those comments capture some of the additional complexity of
the power vs performance tradeoff that need to be considered.

One thing not well-defined is what "performance" is.  The session at the
kernel discussed throughput and latency.  I'm not sure if people are
combining two different things into the name of latency.  To me, latency
is wake up latency; the elapsed time from when an event occurred to when
the process handling the event is executing instructions (where I think of
the process typically as user space code, but it could sometimes instead
be kernel space code).  The second thing people might think of as latency
is how long from the triggering event until when work is completed on
behalf of the consumer event (where the consumer could be a machine, but
is often a human being, eg if a packet from google arrives, how long until
I see the search result on my screen).  This second thing I call response time.

Then "wake up latency" is also probably a mis-nomer.  The cyclictest wake up
latency ends when the cyclictest thread is both woken, and then is actually
executing code on the cpu ("running").

Wake up latency is a fine thing to focus on (especially since power management
can have a large impact on wake up latency) but I hope we remember to pay
attention to response time as one of the important performance metrics.

Onward to cyclictest...  Cyclictest is commonly referred to as a benchmark
(which it is), but it is at the core more like instrumentation, providing
a measure of some types of wake up latency.  Cyclictest is normally used
in conjunction with a separate workload.  (Even though cyclictest has
enough tuning knobs that it can also be used as a workload.)  There are some
ways that perf and cyclictest can be compared as sources of performance data:

  -----  cyclictest

  - Measures wake up latency of only cyclictest threads.
  - Captures _entire_ latency, including coming out of low power mode to
    service the (timer) interrupt that results in the task wake up.

  -----  perf sched

  - Measures all processes (this can be sliced and diced in post-processing
    to include any desired set of processes).
  - Captures latency from when task is _woken_ to when task is _executing code_
    on a cpu.

I think both cyclictest and perf sched are valuable tools, that can each
contribute to understanding system behavior.

-Frank

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bench for testing scheduler
  2013-11-08 21:12       ` Rowand, Frank
@ 2013-11-12 10:02         ` Vincent Guittot
  0 siblings, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2013-11-12 10:02 UTC (permalink / raw)
  To: Rowand, Frank
  Cc: catalin.marinas, Morten.Rasmussen, alex.shi, peterz, pjt, mingo,
	rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
	markgross, khilman, paulmck, linux-kernel

On 8 November 2013 22:12, Rowand, Frank <Frank.Rowand@sonymobile.com> wrote:
>
> On Friday, November 08, 2013 1:28 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote:
>>
>> On 8 November 2013 01:04, Rowand, Frank <Frank.Rowand@sonymobile.com> wrote:
>>
<snip>
>>
>> The Avg figures look almost stable IMO. Are you speaking about the Max
>> value for the inconsistency ?
>
> The values on my laptop for "-l 2000" are not stable.
>
> If I collapse all of the threads in each of the following tests to a
> single value I get the following table.  Note that each thread completes
> a different number of cycles, so I calculate the average as:
>
>   total count = T0_count + T1_count + T2_count + T3_count
>
>   avg = ( (T0_count * T0_avg) + (T1_count * T1_avg) + ... + (T3_count * T3_avg) ) / total count
>
>   min is the smallest min for any of the threads
>
>   max is the largest max for any of the threads
>
>             total
> test   T    count  min     avg   max
> ---- --- -------- ---- ------- -----
>    1   4     5886    2    76.0  1017
>    2   4     5881    2    71.5   810
>    3   4     5885    2    74.2  1143
>    4   4     5884    2    68.9  1279
>
> test 1 average is 10% larger than test 4.
>
> test 4 maximum is 50% larger than test2.
>
> But all of this is just a minor detail of how to run cyclictest.  The more
> important question is whether to use cyclictest results as a valid workload
> or metric, so for the moment I won't comment further on the cyclictest
> parameters you used to collect the example data you provided.
>
>
>>

<snip>

>> >
>
> Thanks for clarifying how the data was calculated (below).  Again, I don't think
> this level of detail is the most important issue at this point, but I'm going
> to comment on it while it is still fresh in my mind.
>
>> > Some questions about what these metrics are:
>> >
>> > The cyclictest data is reported per thread.  How did you combine the per thread data
>> > to get a single latency and stddev value?
>> >
>> > Is "Latency" the average latency?
>>
>> Yes. I have described below the procedure i have followed to get my results:
>>
>> I run the same test (same parameters) several times ( i have tried
>> between 5 and 10 runs and the results were similar).
>> For each run, i compute the average of per thread average figure and i
>> compute the stddev between per thread results.
>
> So the test run stddev is the standard deviation of the values for average
> latency of the 8 (???) cyclictest threads in a test run?

I have used 5 threads for my tests

>
> If so, I don't think that the calculated stddev has much actual meaning for
> comparing the algorithms (I do find it useful to get a loose sense of how
> consistent multiple test runs with the same parameters).
>
>> The results that i sent is an average of all runs with the same parameters.
>
> Then the stddev in the table is the average of the stddev in several test runs?

yes it is

>
> The stddev later on in the table is often in the range of 10%, 20%, 50%, and 100%
> of the average latency.  That is rather large.

yes i agree and it's an interesting figure IMHO because it points out
how the wake up of a core can impact the task scheduling latency and
how it's possible to reduce it or make it more stable (even if we
still have some large max value which are probably not linked to the
wake up of a core but other activities like deferable timer that have
fired

>
>>
>> >
>> > stddev is not reported by cyclictest.  How did you create this value?  Did you
>> > use the "-v" cyclictest option to report detailed data, then calculate stddev from
>> > the detailed data?
>>
>> No i haven't used the -v because it generates too much spurious wake
>> up that makes the results irrelevant
>
> Yes, I agree about not using -v.  It was just a wild guess on my part since
> I did not know how stddev was calculated.  And I was incorrectly guessing
> that stdev was describing the frequency distribution of the latencies
> from a single test run.

I haven't be so precise in my computation mainly because the output
were almost coherent but we probably need more precised statistic in a
final step

>
> As a general comment on cyclictest, I don't find average latency
> (in isolation) sufficient to compare different runs of cyclictest.
> And stddev of the frequency distribution of the latencies (which
> can be calculated from the -h data, with fairly low cyclictest
> overhead) is usually interesting but should be viewed with a healthy
> skepticism since that frequency distribution is often not a normal
> distribution.  In addition to average latency, I normally look at
> maximum latency and the frequency distribution of latence (in table
> or graph form).
>
> (One side effect of specifying -h is that the -d option is then
> ignored.)
>

I'm going to have a look at -h parameters which can be useful to get a
better view of the frequency distribution as you point out. Having the
distance set to 0 (-d) can be an issue because we could have a
synchronization of the wake up of the threads which will finally hide
the real wake up latency. It's interesting to have a distance which
ensures that the threads will wake up in an "asynchronous" manner
that's why i have chosen 150 (which is may be not the best value).

Thanks,
Vincent

> Thanks,
>
> -Frank

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-11-12 10:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-07 10:54 Bench for testing scheduler Vincent Guittot
2013-11-07 11:32 ` Catalin Marinas
2013-11-07 13:33 ` Vincent Guittot
2013-11-07 14:04   ` Catalin Marinas
2013-11-08  9:30     ` Vincent Guittot
2013-11-08  0:04   ` Rowand, Frank
2013-11-08  9:28     ` Vincent Guittot
2013-11-08 21:12       ` Rowand, Frank
2013-11-12 10:02         ` Vincent Guittot
2013-11-07 17:42 ` Morten Rasmussen
2013-11-09  0:15   ` Rowand, Frank

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).