* Bench for testing scheduler
@ 2013-11-07 10:54 Vincent Guittot
2013-11-07 11:32 ` Catalin Marinas
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Vincent Guittot @ 2013-11-07 10:54 UTC (permalink / raw)
To: Morten Rasmussen, Alex Shi, Vincent Guittot, Peter Zijlstra,
Paul Turner, Ingo Molnar, rjw, Srivatsa S. Bhat, Catalin Marinas,
Paul Walmsley, Mel Gorman, Juri Lelli, fengguang.wu, markgross,
Kevin Hilman, Frank.Rowand, Paul McKenney, linux-kernel
Hi,
During the Energy-aware scheduling mini-summit, we spoke about benches
that should be used to evaluate the modifications of the scheduler.
I’d like to propose a bench that uses cyclictest to measure the wake
up latency and the power consumption. The goal of this bench is to
exercise the scheduler with various sleeping period and get the
average wakeup latency. The range of the sleeping period must cover
all residency times of the idle state table of the platform. I have
run such tests on a tc2 platform with the packing tasks patchset.
I have use the following command:
#cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
The figures below give the average wakeup latency and power
consumption for default scheduler behavior, packing tasks at cluster
level and packing tasks at core level. We can see both wakeup latency
and power consumption variation. The detailed result is not a simple
single value which makes comparison not so easy but the average of all
measurements should give us a usable “score”.
I know that Ingo would like to add the benches in Tools/* but I wonder
if it make sense to copy cyclictest in this directory when we have an
official git tree here:
git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
I have put both final "score" and detailed results below so everybody
can check the score vs detailed figures:
| Default average results | Cluster
Packing average results | Core Packing average results
| Latency stddev A7 energy A15 energy | Latency
stddev A7 energy A15 energy | Latency stddev A7 energy A15
energy
| (us) (J) (J) | (us)
(J) (J) | (us) (J)
(J)
| 879 794890 2364175 | 416
879688 12750 | 189 897452
30052
Cyclictest | Default | Packing at
Cluster level | Packing at Core level
Interval | Latency stddev A7 energy A15 energy | Latency
stddev A7 energy A15 energy | Latency stddev A7 energy A15
energy
(us) | (us) (J) (J) | (us)
(J) (J) | (us) (J)
(J)
500 24 1 1147477 2479576 21
1 1136768 11693 22 1 1126062
30138
700 22 1 1136084 3058419 21
0 1125280 11761 21 1 1109950
23503
900 22 1 1136017 3036768 21
1 1112542 12017 20 0 1101089
23733
1100 24 1 1132964 2506132 21
0 1109039 12248 21 1 1091832
23621
1300 24 1 1123896 2488459 21
0 1099308 12015 21 1 1086301
23264
1500 24 1 1120842 2488272 21
0 1099811 12685 20 0 1083658
22499
1700 41 38 1117166 3042091 21
0 1090920 12393 21 1 1080387
23015
1900 119 182 1120552 2737555 21
0 1087900 11900 21 1 1078711
23177
2100 167 195 1122425 3210655 22
2 1090420 11900 20 1 1077985
22639
2300 152 156 1119854 2497773 43
22 1087278 11921 21 1 1075943
26282
2500 182 163 1120818 2365870 63
29 1089169 11551 21 0 1073717
24290
2700 439 202 1058952 3058516 107
41 1077955 12122 21 0 1070951
23126
2900 570 268 1028238 3099162 148
30 1067562 13287 24 1 1064200
24260
3100 751 137 946512 3158095 178
30 1059395 12236 29 1 1058887
23225
3300 696 203 964822 3042524 206
28 1041194 13934 36 1 1056656
23941
3500 728 191 959398 3006066 235
36 1028150 13387 44 3 1045841
23873
3700 844 138 921780 3033189 245
31 1019065 14582 62 6 1034466
22501
3900 815 172 925600 2862994 273
33 1001974 12091 80 9 1014650
24444
4100 870 179 897616 2940444 279
35 996226 12014 88 11 1030588
25461
4300 979 119 846912 2996911 306
36 980075 12641 100 12 1035173
24832
4500 891 168 863631 2760879 336
45 955072 12016 126 12 993256
23929
4700 943 110 836333 2796629 351
39 942390 12902 125 15 996548
24637
4900 997 118 800205 2743317 391
49 917067 12868 134 23 1011089
25266
5100 1050 114 789152 2693104 408
53 903123 12033 196 22 894294
25142
5300 1052 111 769544 2668315 425
54 895006 12264 171 19 933356
25873
5500 1002 179 794222 2554432 430
45 886025 12007 171 18 938921
24382
5700 1002 180 786714 2441228 436
46 878043 12258 172 14 944908
30291
5900 1117 90 742883 2554813 471
53 864134 12471 170 12 957811
25119
6100 1166 92 734510 2566381 479
68 854384 12579 190 16 926807
25544
6300 1132 123 738812 2447974 488
57 849740 12968 216 10 882940
26546
6500 1123 150 743870 2323338 495
52 836256 12472 210 20 896639
25149
6700 1173 139 724691 2330720 522
70 822678 12949 269 27 800938
28653
6900 1054 112 725451 2953919 522
69 822682 12184 261 26 785269
28199
7100 1098 174 731504 2255090 502
87 820909 13072 216 15 870777
25336
7300 1244 156 702596 2317562 531
88 808677 12770 247 18 813081
28126
7500 1181 143 694538 2226994 545
90 796698 12368 226 14 862177
26597
7700 1189 147 681836 2183167 555
87 799215 12499 250 17 797699
26342
7900 1082 149 694010 1926757 555
90 791777 13137 243 20 824061
26772
8100 1068 145 678222 2791019 552
80 785043 13071 266 16 781563
26579
8300 1102 135 690978 1851892 582
136 781035 13067 267 18 782060
26683
8500 1190 191 653566 2068057 574
127 777348 13139 262 21 800524
27086
8700 1172 185 666525 2031543 602
104 778754 13364 228 13 884802
25340
8900 1024 179 685123 1689661 594
98 768617 13753 266 20 801557
26075
9100 1077 166 658295 1756367 615
101 759656 13297 308 19 739619
25677
9300 1211 203 618593 2055230 606
111 753652 13231 319 23 743849
26041
9500 1163 189 627123 1794459 615
125 751993 13174 264 19 865898
25795
9700 1240 202 589520 1983417 649
157 738596 13473 326 71 742113
25528
9900 1188 207 612908 1830208 635
125 725890 14240 299 40 770069
24714
10100 1168 219 596998 1781611 647
132 718260 13834 245 35 905581
24854
10300 1083 222 615543 1506529 641
130 700636 13108 401 24 643222
26497
10500 1183 210 573875 1753476 648
169 708408 12756 392 30 636559
28712
10700 1217 234 526025 2014191 648
165 696542 13092 374 26 675566
28555
10900 1161 179 594406 1722260 647
194 698681 13715 344 45 682158
26681
11100 1185 209 578309 1919206 670
166 724562 13408 339 50 743402
28010
11300 1144 185 609694 1791436 671
136 712555 12769 307 36 762260
26575
11500 1070 188 617941 1470628 650
151 723367 12596 353 21 659704
28015
11700 1205 199 570787 1801593 673
168 706260 12568 347 12 689414
29196
11900 1216 174 563915 1761745 686
135 698164 12840 361 10 663126
27517
12100 1155 218 568867 1596189 677
159 705873 12759 309 14 774833
290747
12300 1236 187 543536 1738447 705
177 705564 13028 330 21 745009
28134
12500 1176 202 545135 1651420 696
148 697624 13280 339 20 724057
26461
Vincent
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler
2013-11-07 10:54 Bench for testing scheduler Vincent Guittot
@ 2013-11-07 11:32 ` Catalin Marinas
2013-11-07 13:33 ` Vincent Guittot
2013-11-07 17:42 ` Morten Rasmussen
2 siblings, 0 replies; 11+ messages in thread
From: Catalin Marinas @ 2013-11-07 11:32 UTC (permalink / raw)
To: Vincent Guittot
Cc: Morten Rasmussen, Alex Shi, Peter Zijlstra, Paul Turner,
Ingo Molnar, rjw, Srivatsa S. Bhat, Paul Walmsley, Mel Gorman,
Juri Lelli, fengguang.wu, markgross, Kevin Hilman, Frank.Rowand,
Paul McKenney, linux-kernel
Hi Vincent,
(for whatever reason, the text is wrapped and results hard to read)
On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> During the Energy-aware scheduling mini-summit, we spoke about benches
> that should be used to evaluate the modifications of the scheduler.
> I’d like to propose a bench that uses cyclictest to measure the wake
> up latency and the power consumption. The goal of this bench is to
> exercise the scheduler with various sleeping period and get the
> average wakeup latency. The range of the sleeping period must cover
> all residency times of the idle state table of the platform. I have
> run such tests on a tc2 platform with the packing tasks patchset.
> I have use the following command:
> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
cyclictest could be a good starting point but we need to improve it to
allow threads of different loads, possibly starting multiple processes
(can be done with a script), randomly varying load threads. These
parameters should be loaded from a file so that we can have multiple
configurations (per SoC and per use-case). But the big risk is that we
try to optimise the scheduler for something which is not realistic.
We are working on describing some basic scenarios (plain English for
now) and one of them could be video playing with threads for audio and
video decoding with random change in the workload.
So I think the first step should be a set of tools/scripts to analyse
the scheduler behaviour, both in terms of latency and power, and these
can use perf sched. We can then run some real life scenarios (e.g.
Android video playback) and build a benchmark that matches such
behaviour as close as possible. We can probably use (or improve) perf
sched replay to also simulate such workload (we may need additional
features like thread dependencies).
> The figures below give the average wakeup latency and power
> consumption for default scheduler behavior, packing tasks at cluster
> level and packing tasks at core level. We can see both wakeup latency
> and power consumption variation. The detailed result is not a simple
> single value which makes comparison not so easy but the average of all
> measurements should give us a usable “score”.
How did you assess the power/energy?
Thanks.
--
Catalin
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler
2013-11-07 10:54 Bench for testing scheduler Vincent Guittot
2013-11-07 11:32 ` Catalin Marinas
@ 2013-11-07 13:33 ` Vincent Guittot
2013-11-07 14:04 ` Catalin Marinas
2013-11-08 0:04 ` Rowand, Frank
2013-11-07 17:42 ` Morten Rasmussen
2 siblings, 2 replies; 11+ messages in thread
From: Vincent Guittot @ 2013-11-07 13:33 UTC (permalink / raw)
To: catalin.marinas
Cc: Morten.Rasmussen, alex.shi, peterz, pjt, mingo, rjw,
srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
markgross, khilman, Frank.Rowand, paulmck, linux-kernel,
Vincent Guittot
On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> Hi Vincent,
>
> (for whatever reason, the text is wrapped and results hard to read)
Yes, i have just seen that. It looks like gmail has wrapped the lines.
I have added the results which should not be wrapped, at the end of this email
>
>
> On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
>> During the Energy-aware scheduling mini-summit, we spoke about benches
>> that should be used to evaluate the modifications of the scheduler.
>> I’d like to propose a bench that uses cyclictest to measure the wake
>> up latency and the power consumption. The goal of this bench is to
>> exercise the scheduler with various sleeping period and get the
>> average wakeup latency. The range of the sleeping period must cover
>> all residency times of the idle state table of the platform. I have
>> run such tests on a tc2 platform with the packing tasks patchset.
>> I have use the following command:
>> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
>
> cyclictest could be a good starting point but we need to improve it to
> allow threads of different loads, possibly starting multiple processes
> (can be done with a script), randomly varying load threads. These
> parameters should be loaded from a file so that we can have multiple
> configurations (per SoC and per use-case). But the big risk is that we
> try to optimise the scheduler for something which is not realistic.
The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system.
Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results.
I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system
>
>
> We are working on describing some basic scenarios (plain English for
> now) and one of them could be video playing with threads for audio and
> video decoding with random change in the workload.
>
> So I think the first step should be a set of tools/scripts to analyse
> the scheduler behaviour, both in terms of latency and power, and these
> can use perf sched. We can then run some real life scenarios (e.g.
> Android video playback) and build a benchmark that matches such
> behaviour as close as possible. We can probably use (or improve) perf
> sched replay to also simulate such workload (we may need additional
> features like thread dependencies).
>
>> The figures below give the average wakeup latency and power
>> consumption for default scheduler behavior, packing tasks at cluster
>> level and packing tasks at core level. We can see both wakeup latency
>> and power consumption variation. The detailed result is not a simple
>> single value which makes comparison not so easy but the average of all
>> measurements should give us a usable “score”.
>
> How did you assess the power/energy?
I have use the embedded joule meter of the tc2.
>
> Thanks.
>
> --
> Catalin
| Default average results | Cluster Packing average results | Core Packing average results
| Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy
| (us) (J) (J) | (us) (J) (J) | (us) (J) (J)
| 879 794890 2364175 | 416 879688 12750 | 189 897452 30052
Cyclictest | Default | Packing at Cluster level | Packing at Core level
Interval | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy
(us) | (us) (J) (J) | (us) (J) (J) | (us) (J) (J)
500 24 1 1147477 2479576 21 1 1136768 11693 22 1 1126062 30138
700 22 1 1136084 3058419 21 0 1125280 11761 21 1 1109950 23503
900 22 1 1136017 3036768 21 1 1112542 12017 20 0 1101089 23733
1100 24 1 1132964 2506132 21 0 1109039 12248 21 1 1091832 23621
1300 24 1 1123896 2488459 21 0 1099308 12015 21 1 1086301 23264
1500 24 1 1120842 2488272 21 0 1099811 12685 20 0 1083658 22499
1700 41 38 1117166 3042091 21 0 1090920 12393 21 1 1080387 23015
1900 119 182 1120552 2737555 21 0 1087900 11900 21 1 1078711 23177
2100 167 195 1122425 3210655 22 2 1090420 11900 20 1 1077985 22639
2300 152 156 1119854 2497773 43 22 1087278 11921 21 1 1075943 26282
2500 182 163 1120818 2365870 63 29 1089169 11551 21 0 1073717 24290
2700 439 202 1058952 3058516 107 41 1077955 12122 21 0 1070951 23126
2900 570 268 1028238 3099162 148 30 1067562 13287 24 1 1064200 24260
3100 751 137 946512 3158095 178 30 1059395 12236 29 1 1058887 23225
3300 696 203 964822 3042524 206 28 1041194 13934 36 1 1056656 23941
3500 728 191 959398 3006066 235 36 1028150 13387 44 3 1045841 23873
3700 844 138 921780 3033189 245 31 1019065 14582 62 6 1034466 22501
3900 815 172 925600 2862994 273 33 1001974 12091 80 9 1014650 24444
4100 870 179 897616 2940444 279 35 996226 12014 88 11 1030588 25461
4300 979 119 846912 2996911 306 36 980075 12641 100 12 1035173 24832
4500 891 168 863631 2760879 336 45 955072 12016 126 12 993256 23929
4700 943 110 836333 2796629 351 39 942390 12902 125 15 996548 24637
4900 997 118 800205 2743317 391 49 917067 12868 134 23 1011089 25266
5100 1050 114 789152 2693104 408 53 903123 12033 196 22 894294 25142
5300 1052 111 769544 2668315 425 54 895006 12264 171 19 933356 25873
5500 1002 179 794222 2554432 430 45 886025 12007 171 18 938921 24382
5700 1002 180 786714 2441228 436 46 878043 12258 172 14 944908 30291
5900 1117 90 742883 2554813 471 53 864134 12471 170 12 957811 25119
6100 1166 92 734510 2566381 479 68 854384 12579 190 16 926807 25544
6300 1132 123 738812 2447974 488 57 849740 12968 216 10 882940 26546
6500 1123 150 743870 2323338 495 52 836256 12472 210 20 896639 25149
6700 1173 139 724691 2330720 522 70 822678 12949 269 27 800938 28653
6900 1054 112 725451 2953919 522 69 822682 12184 261 26 785269 28199
7100 1098 174 731504 2255090 502 87 820909 13072 216 15 870777 25336
7300 1244 156 702596 2317562 531 88 808677 12770 247 18 813081 28126
7500 1181 143 694538 2226994 545 90 796698 12368 226 14 862177 26597
7700 1189 147 681836 2183167 555 87 799215 12499 250 17 797699 26342
7900 1082 149 694010 1926757 555 90 791777 13137 243 20 824061 26772
8100 1068 145 678222 2791019 552 80 785043 13071 266 16 781563 26579
8300 1102 135 690978 1851892 582 136 781035 13067 267 18 782060 26683
8500 1190 191 653566 2068057 574 127 777348 13139 262 21 800524 27086
8700 1172 185 666525 2031543 602 104 778754 13364 228 13 884802 25340
8900 1024 179 685123 1689661 594 98 768617 13753 266 20 801557 26075
9100 1077 166 658295 1756367 615 101 759656 13297 308 19 739619 25677
9300 1211 203 618593 2055230 606 111 753652 13231 319 23 743849 26041
9500 1163 189 627123 1794459 615 125 751993 13174 264 19 865898 25795
9700 1240 202 589520 1983417 649 157 738596 13473 326 71 742113 25528
9900 1188 207 612908 1830208 635 125 725890 14240 299 40 770069 24714
10100 1168 219 596998 1781611 647 132 718260 13834 245 35 905581 24854
10300 1083 222 615543 1506529 641 130 700636 13108 401 24 643222 26497
10500 1183 210 573875 1753476 648 169 708408 12756 392 30 636559 28712
10700 1217 234 526025 2014191 648 165 696542 13092 374 26 675566 28555
10900 1161 179 594406 1722260 647 194 698681 13715 344 45 682158 26681
11100 1185 209 578309 1919206 670 166 724562 13408 339 50 743402 28010
11300 1144 185 609694 1791436 671 136 712555 12769 307 36 762260 26575
11500 1070 188 617941 1470628 650 151 723367 12596 353 21 659704 28015
11700 1205 199 570787 1801593 673 168 706260 12568 347 12 689414 29196
11900 1216 174 563915 1761745 686 135 698164 12840 361 10 663126 27517
12100 1155 218 568867 1596189 677 159 705873 12759 309 14 774833 290747
12300 1236 187 543536 1738447 705 177 705564 13028 330 21 745009 28134
12500 1176 202 545135 1651420 696 148 697624 13280 339 20 724057 26461
Vincent
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler
2013-11-07 13:33 ` Vincent Guittot
@ 2013-11-07 14:04 ` Catalin Marinas
2013-11-08 9:30 ` Vincent Guittot
2013-11-08 0:04 ` Rowand, Frank
1 sibling, 1 reply; 11+ messages in thread
From: Catalin Marinas @ 2013-11-07 14:04 UTC (permalink / raw)
To: Vincent Guittot
Cc: Morten Rasmussen, alex.shi, peterz, pjt, mingo, rjw,
srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
markgross, khilman, Frank.Rowand, paulmck, linux-kernel
On Thu, Nov 07, 2013 at 01:33:43PM +0000, Vincent Guittot wrote:
> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> >> During the Energy-aware scheduling mini-summit, we spoke about benches
> >> that should be used to evaluate the modifications of the scheduler.
> >> I’d like to propose a bench that uses cyclictest to measure the wake
> >> up latency and the power consumption. The goal of this bench is to
> >> exercise the scheduler with various sleeping period and get the
> >> average wakeup latency. The range of the sleeping period must cover
> >> all residency times of the idle state table of the platform. I have
> >> run such tests on a tc2 platform with the packing tasks patchset.
> >> I have use the following command:
> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
> >
> > cyclictest could be a good starting point but we need to improve it to
> > allow threads of different loads, possibly starting multiple processes
> > (can be done with a script), randomly varying load threads. These
> > parameters should be loaded from a file so that we can have multiple
> > configurations (per SoC and per use-case). But the big risk is that we
> > try to optimise the scheduler for something which is not realistic.
>
> The goal of this simple bench is to measure the wake up latency and
> the reachable value of the scheduler on a platform but not to emulate
> a "real" use case. In the same way than sched-pipe tests a specific
> behavior of the scheduler, this bench tests the wake up latency of a
> system.
These figures are indeed useful to make sure we don't have any
regression in terms of latency but I would not use cyclictest (as it is)
to assess power improvements since the test is too artificial.
> Starting multi processes and adding some loads can also be useful but
> the target will be a bit different from wake up latency. I have one
> concern with randomness because it prevents from having repeatable and
> comparable tests and results.
We can avoid randomness but still make it varying by some predictable
function.
> I agree that we have to test "real" use cases but it doesn't prevent
> from testing the limit of a characteristic on a system
I agree. My point is not to use this as "the benchmark".
I would prefer to assess the impact on latency (and power) using a tool
independent from benchmarks like cyclictest (e.g. use the reports from
power sched). The reason is that once we have those tools/scripts in the
kernel, a third party can run it on real workloads and provide the
kernel developers with real numbers on performance vs power scheduling,
regressions between kernel versions etc. We can't create a power model
that you can run on an x86 for example and give you an indication of the
power saving on ARM, you need to run the benchmarks on the actual
hardware (that's why I don't think linsched is of much use from a power
perspective).
--
Catalin
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler
2013-11-07 10:54 Bench for testing scheduler Vincent Guittot
2013-11-07 11:32 ` Catalin Marinas
2013-11-07 13:33 ` Vincent Guittot
@ 2013-11-07 17:42 ` Morten Rasmussen
2013-11-09 0:15 ` Rowand, Frank
2 siblings, 1 reply; 11+ messages in thread
From: Morten Rasmussen @ 2013-11-07 17:42 UTC (permalink / raw)
To: Vincent Guittot
Cc: Alex Shi, Peter Zijlstra, Paul Turner, Ingo Molnar, rjw,
Srivatsa S. Bhat, Catalin Marinas, Paul Walmsley, Mel Gorman,
Juri Lelli, fengguang.wu, markgross, Kevin Hilman, Frank.Rowand,
Paul McKenney, linux-kernel
Hi Vincent,
On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> Hi,
>
> During the Energy-aware scheduling mini-summit, we spoke about benches
> that should be used to evaluate the modifications of the scheduler.
> I’d like to propose a bench that uses cyclictest to measure the wake
> up latency and the power consumption. The goal of this bench is to
> exercise the scheduler with various sleeping period and get the
> average wakeup latency. The range of the sleeping period must cover
> all residency times of the idle state table of the platform. I have
> run such tests on a tc2 platform with the packing tasks patchset.
> I have use the following command:
> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
I think cyclictest is a useful model small(er) periodic tasks for
benchmarking energy related patches. However, it doesn't have a
good-enough-performance criteria as it is. I think that is a strict
requirement for all energy related benchmarks.
Measuring latency gives us a performance metric while the energy tells
us how energy efficient we are. But without a latency requirement we
can't really say if a patch helps energy-awareness unless it improves
both energy _and_ performance. That is the case for your packing patches
for this particular benchmark with this specific configuration. That is
a really good result. However, in the general case patches may trade a
bit of performance to get better energy, which is also good if
performance still meets the requirement of the application/user. So we
need a performance criteria to tells us when we sacrifice too much
performance when trying to save power. Without it it is just a
performance benchmark where we measure power.
Coming up with a performance criteria for cyclictest is not so easy as
it doesn't really model any specific application. I guess sacrificing a
bit of latency is acceptable if it comes with significant energy
savings. But a huge performance impact might not be, even if it comes
with massive energy savings. So maybe the criteria would consist of both
a minimum latency requirement (e.g. up to 10% increase) and a
requirement for improved energy per work.
As I see it, it the only way we can validate energy efficiency of
patches that trade performance for improved energy.
Morten
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: Bench for testing scheduler
2013-11-07 13:33 ` Vincent Guittot
2013-11-07 14:04 ` Catalin Marinas
@ 2013-11-08 0:04 ` Rowand, Frank
2013-11-08 9:28 ` Vincent Guittot
1 sibling, 1 reply; 11+ messages in thread
From: Rowand, Frank @ 2013-11-08 0:04 UTC (permalink / raw)
To: Vincent Guittot, catalin.marinas
Cc: Morten.Rasmussen, alex.shi, peterz, pjt, mingo, rjw,
srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
markgross, khilman, paulmck, linux-kernel
Hi Vincent,
Thanks for creating some benchmark numbers!
On Thursday, November 07, 2013 5:33 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote:
>
> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > Hi Vincent,
> >
> > (for whatever reason, the text is wrapped and results hard to read)
>
> Yes, i have just seen that. It looks like gmail has wrapped the lines.
> I have added the results which should not be wrapped, at the end of this email
>
> >
> >
> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> >> During the Energy-aware scheduling mini-summit, we spoke about benches
> >> that should be used to evaluate the modifications of the scheduler.
> >> I’d like to propose a bench that uses cyclictest to measure the wake
> >> up latency and the power consumption. The goal of this bench is to
> >> exercise the scheduler with various sleeping period and get the
> >> average wakeup latency. The range of the sleeping period must cover
> >> all residency times of the idle state table of the platform. I have
> >> run such tests on a tc2 platform with the packing tasks patchset.
> >> I have use the following command:
> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
The number of loops ("-l 2000") should be much larger to create useful
results. I don't have a specific number that is large enough, I just
know from experience that 2000 is way too small. For example, running
cyclictest several times with the same values on my laptop gives values
that are not consistent:
$ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
# /dev/cpu_dma_latency set to 10000000us
T: 0 ( 9703) P: 0 I:500 C: 2000 Min: 2 Act: 90 Avg: 77 Max: 243
T: 1 ( 9704) P: 0 I:650 C: 1557 Min: 2 Act: 58 Avg: 68 Max: 226
T: 2 ( 9705) P: 0 I:800 C: 1264 Min: 2 Act: 54 Avg: 81 Max: 1017
T: 3 ( 9706) P: 0 I:950 C: 1065 Min: 2 Act: 11 Avg: 80 Max: 260
$ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
# /dev/cpu_dma_latency set to 10000000us
T: 0 ( 9709) P: 0 I:500 C: 2000 Min: 2 Act: 45 Avg: 74 Max: 390
T: 1 ( 9710) P: 0 I:650 C: 1554 Min: 2 Act: 82 Avg: 61 Max: 810
T: 2 ( 9711) P: 0 I:800 C: 1263 Min: 2 Act: 83 Avg: 74 Max: 287
T: 3 ( 9712) P: 0 I:950 C: 1064 Min: 2 Act: 103 Avg: 79 Max: 551
$ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
# /dev/cpu_dma_latency set to 10000000us
T: 0 ( 9716) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 72 Max: 252
T: 1 ( 9717) P: 0 I:650 C: 1556 Min: 2 Act: 115 Avg: 77 Max: 354
T: 2 ( 9718) P: 0 I:800 C: 1264 Min: 2 Act: 59 Avg: 78 Max: 1143
T: 3 ( 9719) P: 0 I:950 C: 1065 Min: 2 Act: 104 Avg: 70 Max: 238
$ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
# /dev/cpu_dma_latency set to 10000000us
T: 0 ( 9722) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 68 Max: 213
T: 1 ( 9723) P: 0 I:650 C: 1555 Min: 2 Act: 65 Avg: 65 Max: 1279
T: 2 ( 9724) P: 0 I:800 C: 1264 Min: 2 Act: 91 Avg: 69 Max: 244
T: 3 ( 9725) P: 0 I:950 C: 1065 Min: 2 Act: 58 Avg: 76 Max: 242
> >
> > cyclictest could be a good starting point but we need to improve it to
> > allow threads of different loads, possibly starting multiple processes
> > (can be done with a script), randomly varying load threads. These
> > parameters should be loaded from a file so that we can have multiple
> > configurations (per SoC and per use-case). But the big risk is that we
> > try to optimise the scheduler for something which is not realistic.
>
> The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system.
>
> Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results.
>
> I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system
>
> >
> >
> > We are working on describing some basic scenarios (plain English for
> > now) and one of them could be video playing with threads for audio and
> > video decoding with random change in the workload.
> >
> > So I think the first step should be a set of tools/scripts to analyse
> > the scheduler behaviour, both in terms of latency and power, and these
> > can use perf sched. We can then run some real life scenarios (e.g.
> > Android video playback) and build a benchmark that matches such
> > behaviour as close as possible. We can probably use (or improve) perf
> > sched replay to also simulate such workload (we may need additional
> > features like thread dependencies).
> >
> >> The figures below give the average wakeup latency and power
> >> consumption for default scheduler behavior, packing tasks at cluster
> >> level and packing tasks at core level. We can see both wakeup latency
> >> and power consumption variation. The detailed result is not a simple
> >> single value which makes comparison not so easy but the average of all
> >> measurements should give us a usable “score”.
> >
> > How did you assess the power/energy?
>
> I have use the embedded joule meter of the tc2.
>
> >
> > Thanks.
> >
> > --
> > Catalin
>
> | Default average results | Cluster Packing average results | Core Packing average results
> | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy
> | (us) (J) (J) | (us) (J) (J) | (us) (J) (J)
> | 879 794890 2364175 | 416 879688 12750 | 189 897452 30052
>
> Cyclictest | Default | Packing at Cluster level | Packing at Core level
> Interval | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy
> (us) | (us) (J) (J) | (us) (J) (J) | (us) (J) (J)
> 500 24 1 1147477 2479576 21 1 1136768 11693 22 1 1126062 30138
> 700 22 1 1136084 3058419 21 0 1125280 11761 21 1 1109950 23503
< snip >
Some questions about what these metrics are:
The cyclictest data is reported per thread. How did you combine the per thread data
to get a single latency and stddev value?
Is "Latency" the average latency?
stddev is not reported by cyclictest. How did you create this value? Did you
use the "-v" cyclictest option to report detailed data, then calculate stddev from
the detailed data?
Thanks,
-Frank
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler
2013-11-08 0:04 ` Rowand, Frank
@ 2013-11-08 9:28 ` Vincent Guittot
2013-11-08 21:12 ` Rowand, Frank
0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2013-11-08 9:28 UTC (permalink / raw)
To: Rowand, Frank
Cc: catalin.marinas, Morten.Rasmussen, alex.shi, peterz, pjt, mingo,
rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
markgross, khilman, paulmck, linux-kernel
On 8 November 2013 01:04, Rowand, Frank <Frank.Rowand@sonymobile.com> wrote:
> Hi Vincent,
>
> Thanks for creating some benchmark numbers!
you're welcome
>
>
> On Thursday, November 07, 2013 5:33 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote:
>>
>> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > Hi Vincent,
>> >
>> > (for whatever reason, the text is wrapped and results hard to read)
>>
>> Yes, i have just seen that. It looks like gmail has wrapped the lines.
>> I have added the results which should not be wrapped, at the end of this email
>>
>> >
>> >
>> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
>> >> During the Energy-aware scheduling mini-summit, we spoke about benches
>> >> that should be used to evaluate the modifications of the scheduler.
>> >> I’d like to propose a bench that uses cyclictest to measure the wake
>> >> up latency and the power consumption. The goal of this bench is to
>> >> exercise the scheduler with various sleeping period and get the
>> >> average wakeup latency. The range of the sleeping period must cover
>> >> all residency times of the idle state table of the platform. I have
>> >> run such tests on a tc2 platform with the packing tasks patchset.
>> >> I have use the following command:
>> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
>
> The number of loops ("-l 2000") should be much larger to create useful
> results. I don't have a specific number that is large enough, I just
> know from experience that 2000 is way too small. For example, running
> cyclictest several times with the same values on my laptop gives values
> that are not consistent:
The Avg figures look almost stable IMO. Are you speaking about the Max
value for the inconsistency ?
>
> $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> # /dev/cpu_dma_latency set to 10000000us
> T: 0 ( 9703) P: 0 I:500 C: 2000 Min: 2 Act: 90 Avg: 77 Max: 243
> T: 1 ( 9704) P: 0 I:650 C: 1557 Min: 2 Act: 58 Avg: 68 Max: 226
> T: 2 ( 9705) P: 0 I:800 C: 1264 Min: 2 Act: 54 Avg: 81 Max: 1017
> T: 3 ( 9706) P: 0 I:950 C: 1065 Min: 2 Act: 11 Avg: 80 Max: 260
>
> $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> # /dev/cpu_dma_latency set to 10000000us
> T: 0 ( 9709) P: 0 I:500 C: 2000 Min: 2 Act: 45 Avg: 74 Max: 390
> T: 1 ( 9710) P: 0 I:650 C: 1554 Min: 2 Act: 82 Avg: 61 Max: 810
> T: 2 ( 9711) P: 0 I:800 C: 1263 Min: 2 Act: 83 Avg: 74 Max: 287
> T: 3 ( 9712) P: 0 I:950 C: 1064 Min: 2 Act: 103 Avg: 79 Max: 551
>
> $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> # /dev/cpu_dma_latency set to 10000000us
> T: 0 ( 9716) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 72 Max: 252
> T: 1 ( 9717) P: 0 I:650 C: 1556 Min: 2 Act: 115 Avg: 77 Max: 354
> T: 2 ( 9718) P: 0 I:800 C: 1264 Min: 2 Act: 59 Avg: 78 Max: 1143
> T: 3 ( 9719) P: 0 I:950 C: 1065 Min: 2 Act: 104 Avg: 70 Max: 238
>
> $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> # /dev/cpu_dma_latency set to 10000000us
> T: 0 ( 9722) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 68 Max: 213
> T: 1 ( 9723) P: 0 I:650 C: 1555 Min: 2 Act: 65 Avg: 65 Max: 1279
> T: 2 ( 9724) P: 0 I:800 C: 1264 Min: 2 Act: 91 Avg: 69 Max: 244
> T: 3 ( 9725) P: 0 I:950 C: 1065 Min: 2 Act: 58 Avg: 76 Max: 242
>
>
>> >
>> > cyclictest could be a good starting point but we need to improve it to
>> > allow threads of different loads, possibly starting multiple processes
>> > (can be done with a script), randomly varying load threads. These
>> > parameters should be loaded from a file so that we can have multiple
>> > configurations (per SoC and per use-case). But the big risk is that we
>> > try to optimise the scheduler for something which is not realistic.
>>
>> The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system.
>>
>> Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results.
>>
>> I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system
>>
>> >
>> >
>> > We are working on describing some basic scenarios (plain English for
>> > now) and one of them could be video playing with threads for audio and
>> > video decoding with random change in the workload.
>> >
>> > So I think the first step should be a set of tools/scripts to analyse
>> > the scheduler behaviour, both in terms of latency and power, and these
>> > can use perf sched. We can then run some real life scenarios (e.g.
>> > Android video playback) and build a benchmark that matches such
>> > behaviour as close as possible. We can probably use (or improve) perf
>> > sched replay to also simulate such workload (we may need additional
>> > features like thread dependencies).
>> >
>> >> The figures below give the average wakeup latency and power
>> >> consumption for default scheduler behavior, packing tasks at cluster
>> >> level and packing tasks at core level. We can see both wakeup latency
>> >> and power consumption variation. The detailed result is not a simple
>> >> single value which makes comparison not so easy but the average of all
>> >> measurements should give us a usable “score”.
>> >
>> > How did you assess the power/energy?
>>
>> I have use the embedded joule meter of the tc2.
>>
>> >
>> > Thanks.
>> >
>> > --
>> > Catalin
>>
>> | Default average results | Cluster Packing average results | Core Packing average results
>> | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy
>> | (us) (J) (J) | (us) (J) (J) | (us) (J) (J)
>> | 879 794890 2364175 | 416 879688 12750 | 189 897452 30052
>>
>> Cyclictest | Default | Packing at Cluster level | Packing at Core level
>> Interval | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy
>> (us) | (us) (J) (J) | (us) (J) (J) | (us) (J) (J)
>> 500 24 1 1147477 2479576 21 1 1136768 11693 22 1 1126062 30138
>> 700 22 1 1136084 3058419 21 0 1125280 11761 21 1 1109950 23503
>
> < snip >
>
> Some questions about what these metrics are:
>
> The cyclictest data is reported per thread. How did you combine the per thread data
> to get a single latency and stddev value?
>
> Is "Latency" the average latency?
Yes. I have described below the procedure i have followed to get my results:
I run the same test (same parameters) several times ( i have tried
between 5 and 10 runs and the results were similar).
For each run, i compute the average of per thread average figure and i
compute the stddev between per thread results.
The results that i sent is an average of all runs with the same parameters.
>
> stddev is not reported by cyclictest. How did you create this value? Did you
> use the "-v" cyclictest option to report detailed data, then calculate stddev from
> the detailed data?
No i haven't used the -v because it generates too much spurious wake
up that makes the results irrelevant
Vincent
>
> Thanks,
>
> -Frank
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler
2013-11-07 14:04 ` Catalin Marinas
@ 2013-11-08 9:30 ` Vincent Guittot
0 siblings, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2013-11-08 9:30 UTC (permalink / raw)
To: Catalin Marinas
Cc: Morten Rasmussen, alex.shi, peterz, pjt, mingo, rjw,
srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
markgross, khilman, Frank.Rowand, paulmck, linux-kernel
On 7 November 2013 15:04, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Nov 07, 2013 at 01:33:43PM +0000, Vincent Guittot wrote:
>> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
>> >> During the Energy-aware scheduling mini-summit, we spoke about benches
>> >> that should be used to evaluate the modifications of the scheduler.
>> >> I’d like to propose a bench that uses cyclictest to measure the wake
>> >> up latency and the power consumption. The goal of this bench is to
>> >> exercise the scheduler with various sleeping period and get the
>> >> average wakeup latency. The range of the sleeping period must cover
>> >> all residency times of the idle state table of the platform. I have
>> >> run such tests on a tc2 platform with the packing tasks patchset.
>> >> I have use the following command:
>> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
>> >
>> > cyclictest could be a good starting point but we need to improve it to
>> > allow threads of different loads, possibly starting multiple processes
>> > (can be done with a script), randomly varying load threads. These
>> > parameters should be loaded from a file so that we can have multiple
>> > configurations (per SoC and per use-case). But the big risk is that we
>> > try to optimise the scheduler for something which is not realistic.
>>
>> The goal of this simple bench is to measure the wake up latency and
>> the reachable value of the scheduler on a platform but not to emulate
>> a "real" use case. In the same way than sched-pipe tests a specific
>> behavior of the scheduler, this bench tests the wake up latency of a
>> system.
>
> These figures are indeed useful to make sure we don't have any
> regression in terms of latency but I would not use cyclictest (as it is)
> to assess power improvements since the test is too artificial.
>
>> Starting multi processes and adding some loads can also be useful but
>> the target will be a bit different from wake up latency. I have one
>> concern with randomness because it prevents from having repeatable and
>> comparable tests and results.
>
> We can avoid randomness but still make it varying by some predictable
> function.
>
>> I agree that we have to test "real" use cases but it doesn't prevent
>> from testing the limit of a characteristic on a system
>
> I agree. My point is not to use this as "the benchmark".
ok, so i don't plan to make cyclictest "the" benchmark but "a"
benchmark among others because i'm not sure that we can cover all
needs with only one benchmark.
As an example, cyclictest gives information of the wake up latency
that can't be collected with trace
>
> I would prefer to assess the impact on latency (and power) using a tool
> independent from benchmarks like cyclictest (e.g. use the reports from
> power sched). The reason is that once we have those tools/scripts in the
> kernel, a third party can run it on real workloads and provide the
> kernel developers with real numbers on performance vs power scheduling,
> regressions between kernel versions etc. We can't create a power model
> that you can run on an x86 for example and give you an indication of the
> power saving on ARM, you need to run the benchmarks on the actual
> hardware (that's why I don't think linsched is of much use from a power
> perspective).
>
> --
> Catalin
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: Bench for testing scheduler
2013-11-08 9:28 ` Vincent Guittot
@ 2013-11-08 21:12 ` Rowand, Frank
2013-11-12 10:02 ` Vincent Guittot
0 siblings, 1 reply; 11+ messages in thread
From: Rowand, Frank @ 2013-11-08 21:12 UTC (permalink / raw)
To: Vincent Guittot
Cc: catalin.marinas, Morten.Rasmussen, alex.shi, peterz, pjt, mingo,
rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
markgross, khilman, paulmck, linux-kernel
On Friday, November 08, 2013 1:28 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote:
>
> On 8 November 2013 01:04, Rowand, Frank <Frank.Rowand@sonymobile.com> wrote:
> > Hi Vincent,
> >
> > Thanks for creating some benchmark numbers!
>
> you're welcome
>
> >
> >
> > On Thursday, November 07, 2013 5:33 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote:
> >>
> >> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >> > Hi Vincent,
> >> >
> >> > (for whatever reason, the text is wrapped and results hard to read)
> >>
> >> Yes, i have just seen that. It looks like gmail has wrapped the lines.
> >> I have added the results which should not be wrapped, at the end of this email
> >>
> >> >
> >> >
> >> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> >> >> During the Energy-aware scheduling mini-summit, we spoke about benches
> >> >> that should be used to evaluate the modifications of the scheduler.
> >> >> I’d like to propose a bench that uses cyclictest to measure the wake
> >> >> up latency and the power consumption. The goal of this bench is to
> >> >> exercise the scheduler with various sleeping period and get the
> >> >> average wakeup latency. The range of the sleeping period must cover
> >> >> all residency times of the idle state table of the platform. I have
> >> >> run such tests on a tc2 platform with the packing tasks patchset.
> >> >> I have use the following command:
> >> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
> >
> > The number of loops ("-l 2000") should be much larger to create useful
> > results. I don't have a specific number that is large enough, I just
> > know from experience that 2000 is way too small. For example, running
> > cyclictest several times with the same values on my laptop gives values
> > that are not consistent:
>
> The Avg figures look almost stable IMO. Are you speaking about the Max
> value for the inconsistency ?
The values on my laptop for "-l 2000" are not stable.
If I collapse all of the threads in each of the following tests to a
single value I get the following table. Note that each thread completes
a different number of cycles, so I calculate the average as:
total count = T0_count + T1_count + T2_count + T3_count
avg = ( (T0_count * T0_avg) + (T1_count * T1_avg) + ... + (T3_count * T3_avg) ) / total count
min is the smallest min for any of the threads
max is the largest max for any of the threads
total
test T count min avg max
---- --- -------- ---- ------- -----
1 4 5886 2 76.0 1017
2 4 5881 2 71.5 810
3 4 5885 2 74.2 1143
4 4 5884 2 68.9 1279
test 1 average is 10% larger than test 4.
test 4 maximum is 50% larger than test2.
But all of this is just a minor detail of how to run cyclictest. The more
important question is whether to use cyclictest results as a valid workload
or metric, so for the moment I won't comment further on the cyclictest
parameters you used to collect the example data you provided.
>
> >
> > $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> > # /dev/cpu_dma_latency set to 10000000us
> > T: 0 ( 9703) P: 0 I:500 C: 2000 Min: 2 Act: 90 Avg: 77 Max: 243
> > T: 1 ( 9704) P: 0 I:650 C: 1557 Min: 2 Act: 58 Avg: 68 Max: 226
> > T: 2 ( 9705) P: 0 I:800 C: 1264 Min: 2 Act: 54 Avg: 81 Max: 1017
> > T: 3 ( 9706) P: 0 I:950 C: 1065 Min: 2 Act: 11 Avg: 80 Max: 260
> >
> > $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> > # /dev/cpu_dma_latency set to 10000000us
> > T: 0 ( 9709) P: 0 I:500 C: 2000 Min: 2 Act: 45 Avg: 74 Max: 390
> > T: 1 ( 9710) P: 0 I:650 C: 1554 Min: 2 Act: 82 Avg: 61 Max: 810
> > T: 2 ( 9711) P: 0 I:800 C: 1263 Min: 2 Act: 83 Avg: 74 Max: 287
> > T: 3 ( 9712) P: 0 I:950 C: 1064 Min: 2 Act: 103 Avg: 79 Max: 551
> >
> > $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> > # /dev/cpu_dma_latency set to 10000000us
> > T: 0 ( 9716) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 72 Max: 252
> > T: 1 ( 9717) P: 0 I:650 C: 1556 Min: 2 Act: 115 Avg: 77 Max: 354
> > T: 2 ( 9718) P: 0 I:800 C: 1264 Min: 2 Act: 59 Avg: 78 Max: 1143
> > T: 3 ( 9719) P: 0 I:950 C: 1065 Min: 2 Act: 104 Avg: 70 Max: 238
> >
> > $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> > # /dev/cpu_dma_latency set to 10000000us
> > T: 0 ( 9722) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 68 Max: 213
> > T: 1 ( 9723) P: 0 I:650 C: 1555 Min: 2 Act: 65 Avg: 65 Max: 1279
> > T: 2 ( 9724) P: 0 I:800 C: 1264 Min: 2 Act: 91 Avg: 69 Max: 244
> > T: 3 ( 9725) P: 0 I:950 C: 1065 Min: 2 Act: 58 Avg: 76 Max: 242
> >
> >
> >> >
> >> > cyclictest could be a good starting point but we need to improve it to
> >> > allow threads of different loads, possibly starting multiple processes
> >> > (can be done with a script), randomly varying load threads. These
> >> > parameters should be loaded from a file so that we can have multiple
> >> > configurations (per SoC and per use-case). But the big risk is that we
> >> > try to optimise the scheduler for something which is not realistic.
> >>
> >> The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system.
> >>
> >> Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results.
> >>
> >> I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system
> >>
> >> >
> >> >
> >> > We are working on describing some basic scenarios (plain English for
> >> > now) and one of them could be video playing with threads for audio and
> >> > video decoding with random change in the workload.
> >> >
> >> > So I think the first step should be a set of tools/scripts to analyse
> >> > the scheduler behaviour, both in terms of latency and power, and these
> >> > can use perf sched. We can then run some real life scenarios (e.g.
> >> > Android video playback) and build a benchmark that matches such
> >> > behaviour as close as possible. We can probably use (or improve) perf
> >> > sched replay to also simulate such workload (we may need additional
> >> > features like thread dependencies).
> >> >
> >> >> The figures below give the average wakeup latency and power
> >> >> consumption for default scheduler behavior, packing tasks at cluster
> >> >> level and packing tasks at core level. We can see both wakeup latency
> >> >> and power consumption variation. The detailed result is not a simple
> >> >> single value which makes comparison not so easy but the average of all
> >> >> measurements should give us a usable “score”.
> >> >
> >> > How did you assess the power/energy?
> >>
> >> I have use the embedded joule meter of the tc2.
> >>
> >> >
> >> > Thanks.
> >> >
> >> > --
> >> > Catalin
> >>
> >> | Default average results | Cluster Packing average results | Core Packing average results
> >> | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy
> >> | (us) (J) (J) | (us) (J) (J) | (us) (J) (J)
> >> | 879 794890 2364175 | 416 879688 12750 | 189 897452 30052
> >>
> >> Cyclictest | Default | Packing at Cluster level | Packing at Core level
> >> Interval | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy
> >> (us) | (us) (J) (J) | (us) (J) (J) | (us) (J) (J)
> >> 500 24 1 1147477 2479576 21 1 1136768 11693 22 1 1126062 30138
> >> 700 22 1 1136084 3058419 21 0 1125280 11761 21 1 1109950 23503
> >
> > < snip >
> >
Thanks for clarifying how the data was calculated (below). Again, I don't think
this level of detail is the most important issue at this point, but I'm going
to comment on it while it is still fresh in my mind.
> > Some questions about what these metrics are:
> >
> > The cyclictest data is reported per thread. How did you combine the per thread data
> > to get a single latency and stddev value?
> >
> > Is "Latency" the average latency?
>
> Yes. I have described below the procedure i have followed to get my results:
>
> I run the same test (same parameters) several times ( i have tried
> between 5 and 10 runs and the results were similar).
> For each run, i compute the average of per thread average figure and i
> compute the stddev between per thread results.
So the test run stddev is the standard deviation of the values for average
latency of the 8 (???) cyclictest threads in a test run?
If so, I don't think that the calculated stddev has much actual meaning for
comparing the algorithms (I do find it useful to get a loose sense of how
consistent multiple test runs with the same parameters).
> The results that i sent is an average of all runs with the same parameters.
Then the stddev in the table is the average of the stddev in several test runs?
The stddev later on in the table is often in the range of 10%, 20%, 50%, and 100%
of the average latency. That is rather large.
>
> >
> > stddev is not reported by cyclictest. How did you create this value? Did you
> > use the "-v" cyclictest option to report detailed data, then calculate stddev from
> > the detailed data?
>
> No i haven't used the -v because it generates too much spurious wake
> up that makes the results irrelevant
Yes, I agree about not using -v. It was just a wild guess on my part since
I did not know how stddev was calculated. And I was incorrectly guessing
that stdev was describing the frequency distribution of the latencies
from a single test run.
As a general comment on cyclictest, I don't find average latency
(in isolation) sufficient to compare different runs of cyclictest.
And stddev of the frequency distribution of the latencies (which
can be calculated from the -h data, with fairly low cyclictest
overhead) is usually interesting but should be viewed with a healthy
skepticism since that frequency distribution is often not a normal
distribution. In addition to average latency, I normally look at
maximum latency and the frequency distribution of latence (in table
or graph form).
(One side effect of specifying -h is that the -d option is then
ignored.)
Thanks,
-Frank
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: Bench for testing scheduler
2013-11-07 17:42 ` Morten Rasmussen
@ 2013-11-09 0:15 ` Rowand, Frank
0 siblings, 0 replies; 11+ messages in thread
From: Rowand, Frank @ 2013-11-09 0:15 UTC (permalink / raw)
To: Morten Rasmussen, Vincent Guittot
Cc: Alex Shi, Peter Zijlstra, Paul Turner, Ingo Molnar, rjw,
Srivatsa S. Bhat, Catalin Marinas, Paul Walmsley, Mel Gorman,
Juri Lelli, fengguang.wu, markgross, Kevin Hilman, Paul McKenney,
linux-kernel
On Thursday, November 07, 2013 9:42 AM, Morten Rasmussen [morten.rasmussen@arm.com] wrote:
>
> Hi Vincent,
>
> On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> > Hi,
> >
> > During the Energy-aware scheduling mini-summit, we spoke about benches
> > that should be used to evaluate the modifications of the scheduler.
> > I’d like to propose a bench that uses cyclictest to measure the wake
> > up latency and the power consumption. The goal of this bench is to
> > exercise the scheduler with various sleeping period and get the
> > average wakeup latency. The range of the sleeping period must cover
> > all residency times of the idle state table of the platform. I have
> > run such tests on a tc2 platform with the packing tasks patchset.
> > I have use the following command:
> > #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
>
> I think cyclictest is a useful model small(er) periodic tasks for
> benchmarking energy related patches. However, it doesn't have a
> good-enough-performance criteria as it is. I think that is a strict
> requirement for all energy related benchmarks.
>
> Measuring latency gives us a performance metric while the energy tells
> us how energy efficient we are. But without a latency requirement we
> can't really say if a patch helps energy-awareness unless it improves
> both energy _and_ performance. That is the case for your packing patches
> for this particular benchmark with this specific configuration. That is
> a really good result. However, in the general case patches may trade a
> bit of performance to get better energy, which is also good if
> performance still meets the requirement of the application/user. So we
> need a performance criteria to tells us when we sacrifice too much
> performance when trying to save power. Without it it is just a
> performance benchmark where we measure power.
>
> Coming up with a performance criteria for cyclictest is not so easy as
> it doesn't really model any specific application. I guess sacrificing a
> bit of latency is acceptable if it comes with significant energy
> savings. But a huge performance impact might not be, even if it comes
> with massive energy savings. So maybe the criteria would consist of both
> a minimum latency requirement (e.g. up to 10% increase) and a
> requirement for improved energy per work.
>
> As I see it, it the only way we can validate energy efficiency of
> patches that trade performance for improved energy.
I think those comments capture some of the additional complexity of
the power vs performance tradeoff that need to be considered.
One thing not well-defined is what "performance" is. The session at the
kernel discussed throughput and latency. I'm not sure if people are
combining two different things into the name of latency. To me, latency
is wake up latency; the elapsed time from when an event occurred to when
the process handling the event is executing instructions (where I think of
the process typically as user space code, but it could sometimes instead
be kernel space code). The second thing people might think of as latency
is how long from the triggering event until when work is completed on
behalf of the consumer event (where the consumer could be a machine, but
is often a human being, eg if a packet from google arrives, how long until
I see the search result on my screen). This second thing I call response time.
Then "wake up latency" is also probably a mis-nomer. The cyclictest wake up
latency ends when the cyclictest thread is both woken, and then is actually
executing code on the cpu ("running").
Wake up latency is a fine thing to focus on (especially since power management
can have a large impact on wake up latency) but I hope we remember to pay
attention to response time as one of the important performance metrics.
Onward to cyclictest... Cyclictest is commonly referred to as a benchmark
(which it is), but it is at the core more like instrumentation, providing
a measure of some types of wake up latency. Cyclictest is normally used
in conjunction with a separate workload. (Even though cyclictest has
enough tuning knobs that it can also be used as a workload.) There are some
ways that perf and cyclictest can be compared as sources of performance data:
----- cyclictest
- Measures wake up latency of only cyclictest threads.
- Captures _entire_ latency, including coming out of low power mode to
service the (timer) interrupt that results in the task wake up.
----- perf sched
- Measures all processes (this can be sliced and diced in post-processing
to include any desired set of processes).
- Captures latency from when task is _woken_ to when task is _executing code_
on a cpu.
I think both cyclictest and perf sched are valuable tools, that can each
contribute to understanding system behavior.
-Frank
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler
2013-11-08 21:12 ` Rowand, Frank
@ 2013-11-12 10:02 ` Vincent Guittot
0 siblings, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2013-11-12 10:02 UTC (permalink / raw)
To: Rowand, Frank
Cc: catalin.marinas, Morten.Rasmussen, alex.shi, peterz, pjt, mingo,
rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu,
markgross, khilman, paulmck, linux-kernel
On 8 November 2013 22:12, Rowand, Frank <Frank.Rowand@sonymobile.com> wrote:
>
> On Friday, November 08, 2013 1:28 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote:
>>
>> On 8 November 2013 01:04, Rowand, Frank <Frank.Rowand@sonymobile.com> wrote:
>>
<snip>
>>
>> The Avg figures look almost stable IMO. Are you speaking about the Max
>> value for the inconsistency ?
>
> The values on my laptop for "-l 2000" are not stable.
>
> If I collapse all of the threads in each of the following tests to a
> single value I get the following table. Note that each thread completes
> a different number of cycles, so I calculate the average as:
>
> total count = T0_count + T1_count + T2_count + T3_count
>
> avg = ( (T0_count * T0_avg) + (T1_count * T1_avg) + ... + (T3_count * T3_avg) ) / total count
>
> min is the smallest min for any of the threads
>
> max is the largest max for any of the threads
>
> total
> test T count min avg max
> ---- --- -------- ---- ------- -----
> 1 4 5886 2 76.0 1017
> 2 4 5881 2 71.5 810
> 3 4 5885 2 74.2 1143
> 4 4 5884 2 68.9 1279
>
> test 1 average is 10% larger than test 4.
>
> test 4 maximum is 50% larger than test2.
>
> But all of this is just a minor detail of how to run cyclictest. The more
> important question is whether to use cyclictest results as a valid workload
> or metric, so for the moment I won't comment further on the cyclictest
> parameters you used to collect the example data you provided.
>
>
>>
<snip>
>> >
>
> Thanks for clarifying how the data was calculated (below). Again, I don't think
> this level of detail is the most important issue at this point, but I'm going
> to comment on it while it is still fresh in my mind.
>
>> > Some questions about what these metrics are:
>> >
>> > The cyclictest data is reported per thread. How did you combine the per thread data
>> > to get a single latency and stddev value?
>> >
>> > Is "Latency" the average latency?
>>
>> Yes. I have described below the procedure i have followed to get my results:
>>
>> I run the same test (same parameters) several times ( i have tried
>> between 5 and 10 runs and the results were similar).
>> For each run, i compute the average of per thread average figure and i
>> compute the stddev between per thread results.
>
> So the test run stddev is the standard deviation of the values for average
> latency of the 8 (???) cyclictest threads in a test run?
I have used 5 threads for my tests
>
> If so, I don't think that the calculated stddev has much actual meaning for
> comparing the algorithms (I do find it useful to get a loose sense of how
> consistent multiple test runs with the same parameters).
>
>> The results that i sent is an average of all runs with the same parameters.
>
> Then the stddev in the table is the average of the stddev in several test runs?
yes it is
>
> The stddev later on in the table is often in the range of 10%, 20%, 50%, and 100%
> of the average latency. That is rather large.
yes i agree and it's an interesting figure IMHO because it points out
how the wake up of a core can impact the task scheduling latency and
how it's possible to reduce it or make it more stable (even if we
still have some large max value which are probably not linked to the
wake up of a core but other activities like deferable timer that have
fired
>
>>
>> >
>> > stddev is not reported by cyclictest. How did you create this value? Did you
>> > use the "-v" cyclictest option to report detailed data, then calculate stddev from
>> > the detailed data?
>>
>> No i haven't used the -v because it generates too much spurious wake
>> up that makes the results irrelevant
>
> Yes, I agree about not using -v. It was just a wild guess on my part since
> I did not know how stddev was calculated. And I was incorrectly guessing
> that stdev was describing the frequency distribution of the latencies
> from a single test run.
I haven't be so precise in my computation mainly because the output
were almost coherent but we probably need more precised statistic in a
final step
>
> As a general comment on cyclictest, I don't find average latency
> (in isolation) sufficient to compare different runs of cyclictest.
> And stddev of the frequency distribution of the latencies (which
> can be calculated from the -h data, with fairly low cyclictest
> overhead) is usually interesting but should be viewed with a healthy
> skepticism since that frequency distribution is often not a normal
> distribution. In addition to average latency, I normally look at
> maximum latency and the frequency distribution of latence (in table
> or graph form).
>
> (One side effect of specifying -h is that the -d option is then
> ignored.)
>
I'm going to have a look at -h parameters which can be useful to get a
better view of the frequency distribution as you point out. Having the
distance set to 0 (-d) can be an issue because we could have a
synchronization of the wake up of the threads which will finally hide
the real wake up latency. It's interesting to have a distance which
ensures that the threads will wake up in an "asynchronous" manner
that's why i have chosen 150 (which is may be not the best value).
Thanks,
Vincent
> Thanks,
>
> -Frank
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-11-12 10:02 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-07 10:54 Bench for testing scheduler Vincent Guittot
2013-11-07 11:32 ` Catalin Marinas
2013-11-07 13:33 ` Vincent Guittot
2013-11-07 14:04 ` Catalin Marinas
2013-11-08 9:30 ` Vincent Guittot
2013-11-08 0:04 ` Rowand, Frank
2013-11-08 9:28 ` Vincent Guittot
2013-11-08 21:12 ` Rowand, Frank
2013-11-12 10:02 ` Vincent Guittot
2013-11-07 17:42 ` Morten Rasmussen
2013-11-09 0:15 ` Rowand, Frank
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).