* Bench for testing scheduler @ 2013-11-07 10:54 Vincent Guittot 2013-11-07 11:32 ` Catalin Marinas ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Vincent Guittot @ 2013-11-07 10:54 UTC (permalink / raw) To: Morten Rasmussen, Alex Shi, Vincent Guittot, Peter Zijlstra, Paul Turner, Ingo Molnar, rjw, Srivatsa S. Bhat, Catalin Marinas, Paul Walmsley, Mel Gorman, Juri Lelli, fengguang.wu, markgross, Kevin Hilman, Frank.Rowand, Paul McKenney, linux-kernel Hi, During the Energy-aware scheduling mini-summit, we spoke about benches that should be used to evaluate the modifications of the scheduler. I’d like to propose a bench that uses cyclictest to measure the wake up latency and the power consumption. The goal of this bench is to exercise the scheduler with various sleeping period and get the average wakeup latency. The range of the sleeping period must cover all residency times of the idle state table of the platform. I have run such tests on a tc2 platform with the packing tasks patchset. I have use the following command: #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000 The figures below give the average wakeup latency and power consumption for default scheduler behavior, packing tasks at cluster level and packing tasks at core level. We can see both wakeup latency and power consumption variation. The detailed result is not a simple single value which makes comparison not so easy but the average of all measurements should give us a usable “score”. I know that Ingo would like to add the benches in Tools/* but I wonder if it make sense to copy cyclictest in this directory when we have an official git tree here: git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git I have put both final "score" and detailed results below so everybody can check the score vs detailed figures: | Default average results | Cluster Packing average results | Core Packing average results | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | (us) (J) (J) | (us) (J) (J) | (us) (J) (J) | 879 794890 2364175 | 416 879688 12750 | 189 897452 30052 Cyclictest | Default | Packing at Cluster level | Packing at Core level Interval | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy (us) | (us) (J) (J) | (us) (J) (J) | (us) (J) (J) 500 24 1 1147477 2479576 21 1 1136768 11693 22 1 1126062 30138 700 22 1 1136084 3058419 21 0 1125280 11761 21 1 1109950 23503 900 22 1 1136017 3036768 21 1 1112542 12017 20 0 1101089 23733 1100 24 1 1132964 2506132 21 0 1109039 12248 21 1 1091832 23621 1300 24 1 1123896 2488459 21 0 1099308 12015 21 1 1086301 23264 1500 24 1 1120842 2488272 21 0 1099811 12685 20 0 1083658 22499 1700 41 38 1117166 3042091 21 0 1090920 12393 21 1 1080387 23015 1900 119 182 1120552 2737555 21 0 1087900 11900 21 1 1078711 23177 2100 167 195 1122425 3210655 22 2 1090420 11900 20 1 1077985 22639 2300 152 156 1119854 2497773 43 22 1087278 11921 21 1 1075943 26282 2500 182 163 1120818 2365870 63 29 1089169 11551 21 0 1073717 24290 2700 439 202 1058952 3058516 107 41 1077955 12122 21 0 1070951 23126 2900 570 268 1028238 3099162 148 30 1067562 13287 24 1 1064200 24260 3100 751 137 946512 3158095 178 30 1059395 12236 29 1 1058887 23225 3300 696 203 964822 3042524 206 28 1041194 13934 36 1 1056656 23941 3500 728 191 959398 3006066 235 36 1028150 13387 44 3 1045841 23873 3700 844 138 921780 3033189 245 31 1019065 14582 62 6 1034466 22501 3900 815 172 925600 2862994 273 33 1001974 12091 80 9 1014650 24444 4100 870 179 897616 2940444 279 35 996226 12014 88 11 1030588 25461 4300 979 119 846912 2996911 306 36 980075 12641 100 12 1035173 24832 4500 891 168 863631 2760879 336 45 955072 12016 126 12 993256 23929 4700 943 110 836333 2796629 351 39 942390 12902 125 15 996548 24637 4900 997 118 800205 2743317 391 49 917067 12868 134 23 1011089 25266 5100 1050 114 789152 2693104 408 53 903123 12033 196 22 894294 25142 5300 1052 111 769544 2668315 425 54 895006 12264 171 19 933356 25873 5500 1002 179 794222 2554432 430 45 886025 12007 171 18 938921 24382 5700 1002 180 786714 2441228 436 46 878043 12258 172 14 944908 30291 5900 1117 90 742883 2554813 471 53 864134 12471 170 12 957811 25119 6100 1166 92 734510 2566381 479 68 854384 12579 190 16 926807 25544 6300 1132 123 738812 2447974 488 57 849740 12968 216 10 882940 26546 6500 1123 150 743870 2323338 495 52 836256 12472 210 20 896639 25149 6700 1173 139 724691 2330720 522 70 822678 12949 269 27 800938 28653 6900 1054 112 725451 2953919 522 69 822682 12184 261 26 785269 28199 7100 1098 174 731504 2255090 502 87 820909 13072 216 15 870777 25336 7300 1244 156 702596 2317562 531 88 808677 12770 247 18 813081 28126 7500 1181 143 694538 2226994 545 90 796698 12368 226 14 862177 26597 7700 1189 147 681836 2183167 555 87 799215 12499 250 17 797699 26342 7900 1082 149 694010 1926757 555 90 791777 13137 243 20 824061 26772 8100 1068 145 678222 2791019 552 80 785043 13071 266 16 781563 26579 8300 1102 135 690978 1851892 582 136 781035 13067 267 18 782060 26683 8500 1190 191 653566 2068057 574 127 777348 13139 262 21 800524 27086 8700 1172 185 666525 2031543 602 104 778754 13364 228 13 884802 25340 8900 1024 179 685123 1689661 594 98 768617 13753 266 20 801557 26075 9100 1077 166 658295 1756367 615 101 759656 13297 308 19 739619 25677 9300 1211 203 618593 2055230 606 111 753652 13231 319 23 743849 26041 9500 1163 189 627123 1794459 615 125 751993 13174 264 19 865898 25795 9700 1240 202 589520 1983417 649 157 738596 13473 326 71 742113 25528 9900 1188 207 612908 1830208 635 125 725890 14240 299 40 770069 24714 10100 1168 219 596998 1781611 647 132 718260 13834 245 35 905581 24854 10300 1083 222 615543 1506529 641 130 700636 13108 401 24 643222 26497 10500 1183 210 573875 1753476 648 169 708408 12756 392 30 636559 28712 10700 1217 234 526025 2014191 648 165 696542 13092 374 26 675566 28555 10900 1161 179 594406 1722260 647 194 698681 13715 344 45 682158 26681 11100 1185 209 578309 1919206 670 166 724562 13408 339 50 743402 28010 11300 1144 185 609694 1791436 671 136 712555 12769 307 36 762260 26575 11500 1070 188 617941 1470628 650 151 723367 12596 353 21 659704 28015 11700 1205 199 570787 1801593 673 168 706260 12568 347 12 689414 29196 11900 1216 174 563915 1761745 686 135 698164 12840 361 10 663126 27517 12100 1155 218 568867 1596189 677 159 705873 12759 309 14 774833 290747 12300 1236 187 543536 1738447 705 177 705564 13028 330 21 745009 28134 12500 1176 202 545135 1651420 696 148 697624 13280 339 20 724057 26461 Vincent ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler 2013-11-07 10:54 Bench for testing scheduler Vincent Guittot @ 2013-11-07 11:32 ` Catalin Marinas 2013-11-07 13:33 ` Vincent Guittot 2013-11-07 17:42 ` Morten Rasmussen 2 siblings, 0 replies; 11+ messages in thread From: Catalin Marinas @ 2013-11-07 11:32 UTC (permalink / raw) To: Vincent Guittot Cc: Morten Rasmussen, Alex Shi, Peter Zijlstra, Paul Turner, Ingo Molnar, rjw, Srivatsa S. Bhat, Paul Walmsley, Mel Gorman, Juri Lelli, fengguang.wu, markgross, Kevin Hilman, Frank.Rowand, Paul McKenney, linux-kernel Hi Vincent, (for whatever reason, the text is wrapped and results hard to read) On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote: > During the Energy-aware scheduling mini-summit, we spoke about benches > that should be used to evaluate the modifications of the scheduler. > I’d like to propose a bench that uses cyclictest to measure the wake > up latency and the power consumption. The goal of this bench is to > exercise the scheduler with various sleeping period and get the > average wakeup latency. The range of the sleeping period must cover > all residency times of the idle state table of the platform. I have > run such tests on a tc2 platform with the packing tasks patchset. > I have use the following command: > #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000 cyclictest could be a good starting point but we need to improve it to allow threads of different loads, possibly starting multiple processes (can be done with a script), randomly varying load threads. These parameters should be loaded from a file so that we can have multiple configurations (per SoC and per use-case). But the big risk is that we try to optimise the scheduler for something which is not realistic. We are working on describing some basic scenarios (plain English for now) and one of them could be video playing with threads for audio and video decoding with random change in the workload. So I think the first step should be a set of tools/scripts to analyse the scheduler behaviour, both in terms of latency and power, and these can use perf sched. We can then run some real life scenarios (e.g. Android video playback) and build a benchmark that matches such behaviour as close as possible. We can probably use (or improve) perf sched replay to also simulate such workload (we may need additional features like thread dependencies). > The figures below give the average wakeup latency and power > consumption for default scheduler behavior, packing tasks at cluster > level and packing tasks at core level. We can see both wakeup latency > and power consumption variation. The detailed result is not a simple > single value which makes comparison not so easy but the average of all > measurements should give us a usable “score”. How did you assess the power/energy? Thanks. -- Catalin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler 2013-11-07 10:54 Bench for testing scheduler Vincent Guittot 2013-11-07 11:32 ` Catalin Marinas @ 2013-11-07 13:33 ` Vincent Guittot 2013-11-07 14:04 ` Catalin Marinas 2013-11-08 0:04 ` Rowand, Frank 2013-11-07 17:42 ` Morten Rasmussen 2 siblings, 2 replies; 11+ messages in thread From: Vincent Guittot @ 2013-11-07 13:33 UTC (permalink / raw) To: catalin.marinas Cc: Morten.Rasmussen, alex.shi, peterz, pjt, mingo, rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu, markgross, khilman, Frank.Rowand, paulmck, linux-kernel, Vincent Guittot On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote: > Hi Vincent, > > (for whatever reason, the text is wrapped and results hard to read) Yes, i have just seen that. It looks like gmail has wrapped the lines. I have added the results which should not be wrapped, at the end of this email > > > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote: >> During the Energy-aware scheduling mini-summit, we spoke about benches >> that should be used to evaluate the modifications of the scheduler. >> I’d like to propose a bench that uses cyclictest to measure the wake >> up latency and the power consumption. The goal of this bench is to >> exercise the scheduler with various sleeping period and get the >> average wakeup latency. The range of the sleeping period must cover >> all residency times of the idle state table of the platform. I have >> run such tests on a tc2 platform with the packing tasks patchset. >> I have use the following command: >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000 > > cyclictest could be a good starting point but we need to improve it to > allow threads of different loads, possibly starting multiple processes > (can be done with a script), randomly varying load threads. These > parameters should be loaded from a file so that we can have multiple > configurations (per SoC and per use-case). But the big risk is that we > try to optimise the scheduler for something which is not realistic. The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system. Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results. I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system > > > We are working on describing some basic scenarios (plain English for > now) and one of them could be video playing with threads for audio and > video decoding with random change in the workload. > > So I think the first step should be a set of tools/scripts to analyse > the scheduler behaviour, both in terms of latency and power, and these > can use perf sched. We can then run some real life scenarios (e.g. > Android video playback) and build a benchmark that matches such > behaviour as close as possible. We can probably use (or improve) perf > sched replay to also simulate such workload (we may need additional > features like thread dependencies). > >> The figures below give the average wakeup latency and power >> consumption for default scheduler behavior, packing tasks at cluster >> level and packing tasks at core level. We can see both wakeup latency >> and power consumption variation. The detailed result is not a simple >> single value which makes comparison not so easy but the average of all >> measurements should give us a usable “score”. > > How did you assess the power/energy? I have use the embedded joule meter of the tc2. > > Thanks. > > -- > Catalin | Default average results | Cluster Packing average results | Core Packing average results | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | (us) (J) (J) | (us) (J) (J) | (us) (J) (J) | 879 794890 2364175 | 416 879688 12750 | 189 897452 30052 Cyclictest | Default | Packing at Cluster level | Packing at Core level Interval | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy (us) | (us) (J) (J) | (us) (J) (J) | (us) (J) (J) 500 24 1 1147477 2479576 21 1 1136768 11693 22 1 1126062 30138 700 22 1 1136084 3058419 21 0 1125280 11761 21 1 1109950 23503 900 22 1 1136017 3036768 21 1 1112542 12017 20 0 1101089 23733 1100 24 1 1132964 2506132 21 0 1109039 12248 21 1 1091832 23621 1300 24 1 1123896 2488459 21 0 1099308 12015 21 1 1086301 23264 1500 24 1 1120842 2488272 21 0 1099811 12685 20 0 1083658 22499 1700 41 38 1117166 3042091 21 0 1090920 12393 21 1 1080387 23015 1900 119 182 1120552 2737555 21 0 1087900 11900 21 1 1078711 23177 2100 167 195 1122425 3210655 22 2 1090420 11900 20 1 1077985 22639 2300 152 156 1119854 2497773 43 22 1087278 11921 21 1 1075943 26282 2500 182 163 1120818 2365870 63 29 1089169 11551 21 0 1073717 24290 2700 439 202 1058952 3058516 107 41 1077955 12122 21 0 1070951 23126 2900 570 268 1028238 3099162 148 30 1067562 13287 24 1 1064200 24260 3100 751 137 946512 3158095 178 30 1059395 12236 29 1 1058887 23225 3300 696 203 964822 3042524 206 28 1041194 13934 36 1 1056656 23941 3500 728 191 959398 3006066 235 36 1028150 13387 44 3 1045841 23873 3700 844 138 921780 3033189 245 31 1019065 14582 62 6 1034466 22501 3900 815 172 925600 2862994 273 33 1001974 12091 80 9 1014650 24444 4100 870 179 897616 2940444 279 35 996226 12014 88 11 1030588 25461 4300 979 119 846912 2996911 306 36 980075 12641 100 12 1035173 24832 4500 891 168 863631 2760879 336 45 955072 12016 126 12 993256 23929 4700 943 110 836333 2796629 351 39 942390 12902 125 15 996548 24637 4900 997 118 800205 2743317 391 49 917067 12868 134 23 1011089 25266 5100 1050 114 789152 2693104 408 53 903123 12033 196 22 894294 25142 5300 1052 111 769544 2668315 425 54 895006 12264 171 19 933356 25873 5500 1002 179 794222 2554432 430 45 886025 12007 171 18 938921 24382 5700 1002 180 786714 2441228 436 46 878043 12258 172 14 944908 30291 5900 1117 90 742883 2554813 471 53 864134 12471 170 12 957811 25119 6100 1166 92 734510 2566381 479 68 854384 12579 190 16 926807 25544 6300 1132 123 738812 2447974 488 57 849740 12968 216 10 882940 26546 6500 1123 150 743870 2323338 495 52 836256 12472 210 20 896639 25149 6700 1173 139 724691 2330720 522 70 822678 12949 269 27 800938 28653 6900 1054 112 725451 2953919 522 69 822682 12184 261 26 785269 28199 7100 1098 174 731504 2255090 502 87 820909 13072 216 15 870777 25336 7300 1244 156 702596 2317562 531 88 808677 12770 247 18 813081 28126 7500 1181 143 694538 2226994 545 90 796698 12368 226 14 862177 26597 7700 1189 147 681836 2183167 555 87 799215 12499 250 17 797699 26342 7900 1082 149 694010 1926757 555 90 791777 13137 243 20 824061 26772 8100 1068 145 678222 2791019 552 80 785043 13071 266 16 781563 26579 8300 1102 135 690978 1851892 582 136 781035 13067 267 18 782060 26683 8500 1190 191 653566 2068057 574 127 777348 13139 262 21 800524 27086 8700 1172 185 666525 2031543 602 104 778754 13364 228 13 884802 25340 8900 1024 179 685123 1689661 594 98 768617 13753 266 20 801557 26075 9100 1077 166 658295 1756367 615 101 759656 13297 308 19 739619 25677 9300 1211 203 618593 2055230 606 111 753652 13231 319 23 743849 26041 9500 1163 189 627123 1794459 615 125 751993 13174 264 19 865898 25795 9700 1240 202 589520 1983417 649 157 738596 13473 326 71 742113 25528 9900 1188 207 612908 1830208 635 125 725890 14240 299 40 770069 24714 10100 1168 219 596998 1781611 647 132 718260 13834 245 35 905581 24854 10300 1083 222 615543 1506529 641 130 700636 13108 401 24 643222 26497 10500 1183 210 573875 1753476 648 169 708408 12756 392 30 636559 28712 10700 1217 234 526025 2014191 648 165 696542 13092 374 26 675566 28555 10900 1161 179 594406 1722260 647 194 698681 13715 344 45 682158 26681 11100 1185 209 578309 1919206 670 166 724562 13408 339 50 743402 28010 11300 1144 185 609694 1791436 671 136 712555 12769 307 36 762260 26575 11500 1070 188 617941 1470628 650 151 723367 12596 353 21 659704 28015 11700 1205 199 570787 1801593 673 168 706260 12568 347 12 689414 29196 11900 1216 174 563915 1761745 686 135 698164 12840 361 10 663126 27517 12100 1155 218 568867 1596189 677 159 705873 12759 309 14 774833 290747 12300 1236 187 543536 1738447 705 177 705564 13028 330 21 745009 28134 12500 1176 202 545135 1651420 696 148 697624 13280 339 20 724057 26461 Vincent ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler 2013-11-07 13:33 ` Vincent Guittot @ 2013-11-07 14:04 ` Catalin Marinas 2013-11-08 9:30 ` Vincent Guittot 2013-11-08 0:04 ` Rowand, Frank 1 sibling, 1 reply; 11+ messages in thread From: Catalin Marinas @ 2013-11-07 14:04 UTC (permalink / raw) To: Vincent Guittot Cc: Morten Rasmussen, alex.shi, peterz, pjt, mingo, rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu, markgross, khilman, Frank.Rowand, paulmck, linux-kernel On Thu, Nov 07, 2013 at 01:33:43PM +0000, Vincent Guittot wrote: > On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote: > >> During the Energy-aware scheduling mini-summit, we spoke about benches > >> that should be used to evaluate the modifications of the scheduler. > >> I’d like to propose a bench that uses cyclictest to measure the wake > >> up latency and the power consumption. The goal of this bench is to > >> exercise the scheduler with various sleeping period and get the > >> average wakeup latency. The range of the sleeping period must cover > >> all residency times of the idle state table of the platform. I have > >> run such tests on a tc2 platform with the packing tasks patchset. > >> I have use the following command: > >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000 > > > > cyclictest could be a good starting point but we need to improve it to > > allow threads of different loads, possibly starting multiple processes > > (can be done with a script), randomly varying load threads. These > > parameters should be loaded from a file so that we can have multiple > > configurations (per SoC and per use-case). But the big risk is that we > > try to optimise the scheduler for something which is not realistic. > > The goal of this simple bench is to measure the wake up latency and > the reachable value of the scheduler on a platform but not to emulate > a "real" use case. In the same way than sched-pipe tests a specific > behavior of the scheduler, this bench tests the wake up latency of a > system. These figures are indeed useful to make sure we don't have any regression in terms of latency but I would not use cyclictest (as it is) to assess power improvements since the test is too artificial. > Starting multi processes and adding some loads can also be useful but > the target will be a bit different from wake up latency. I have one > concern with randomness because it prevents from having repeatable and > comparable tests and results. We can avoid randomness but still make it varying by some predictable function. > I agree that we have to test "real" use cases but it doesn't prevent > from testing the limit of a characteristic on a system I agree. My point is not to use this as "the benchmark". I would prefer to assess the impact on latency (and power) using a tool independent from benchmarks like cyclictest (e.g. use the reports from power sched). The reason is that once we have those tools/scripts in the kernel, a third party can run it on real workloads and provide the kernel developers with real numbers on performance vs power scheduling, regressions between kernel versions etc. We can't create a power model that you can run on an x86 for example and give you an indication of the power saving on ARM, you need to run the benchmarks on the actual hardware (that's why I don't think linsched is of much use from a power perspective). -- Catalin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler 2013-11-07 14:04 ` Catalin Marinas @ 2013-11-08 9:30 ` Vincent Guittot 0 siblings, 0 replies; 11+ messages in thread From: Vincent Guittot @ 2013-11-08 9:30 UTC (permalink / raw) To: Catalin Marinas Cc: Morten Rasmussen, alex.shi, peterz, pjt, mingo, rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu, markgross, khilman, Frank.Rowand, paulmck, linux-kernel On 7 November 2013 15:04, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Thu, Nov 07, 2013 at 01:33:43PM +0000, Vincent Guittot wrote: >> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote: >> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote: >> >> During the Energy-aware scheduling mini-summit, we spoke about benches >> >> that should be used to evaluate the modifications of the scheduler. >> >> I’d like to propose a bench that uses cyclictest to measure the wake >> >> up latency and the power consumption. The goal of this bench is to >> >> exercise the scheduler with various sleeping period and get the >> >> average wakeup latency. The range of the sleeping period must cover >> >> all residency times of the idle state table of the platform. I have >> >> run such tests on a tc2 platform with the packing tasks patchset. >> >> I have use the following command: >> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000 >> > >> > cyclictest could be a good starting point but we need to improve it to >> > allow threads of different loads, possibly starting multiple processes >> > (can be done with a script), randomly varying load threads. These >> > parameters should be loaded from a file so that we can have multiple >> > configurations (per SoC and per use-case). But the big risk is that we >> > try to optimise the scheduler for something which is not realistic. >> >> The goal of this simple bench is to measure the wake up latency and >> the reachable value of the scheduler on a platform but not to emulate >> a "real" use case. In the same way than sched-pipe tests a specific >> behavior of the scheduler, this bench tests the wake up latency of a >> system. > > These figures are indeed useful to make sure we don't have any > regression in terms of latency but I would not use cyclictest (as it is) > to assess power improvements since the test is too artificial. > >> Starting multi processes and adding some loads can also be useful but >> the target will be a bit different from wake up latency. I have one >> concern with randomness because it prevents from having repeatable and >> comparable tests and results. > > We can avoid randomness but still make it varying by some predictable > function. > >> I agree that we have to test "real" use cases but it doesn't prevent >> from testing the limit of a characteristic on a system > > I agree. My point is not to use this as "the benchmark". ok, so i don't plan to make cyclictest "the" benchmark but "a" benchmark among others because i'm not sure that we can cover all needs with only one benchmark. As an example, cyclictest gives information of the wake up latency that can't be collected with trace > > I would prefer to assess the impact on latency (and power) using a tool > independent from benchmarks like cyclictest (e.g. use the reports from > power sched). The reason is that once we have those tools/scripts in the > kernel, a third party can run it on real workloads and provide the > kernel developers with real numbers on performance vs power scheduling, > regressions between kernel versions etc. We can't create a power model > that you can run on an x86 for example and give you an indication of the > power saving on ARM, you need to run the benchmarks on the actual > hardware (that's why I don't think linsched is of much use from a power > perspective). > > -- > Catalin ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: Bench for testing scheduler 2013-11-07 13:33 ` Vincent Guittot 2013-11-07 14:04 ` Catalin Marinas @ 2013-11-08 0:04 ` Rowand, Frank 2013-11-08 9:28 ` Vincent Guittot 1 sibling, 1 reply; 11+ messages in thread From: Rowand, Frank @ 2013-11-08 0:04 UTC (permalink / raw) To: Vincent Guittot, catalin.marinas Cc: Morten.Rasmussen, alex.shi, peterz, pjt, mingo, rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu, markgross, khilman, paulmck, linux-kernel Hi Vincent, Thanks for creating some benchmark numbers! On Thursday, November 07, 2013 5:33 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote: > > On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote: > > Hi Vincent, > > > > (for whatever reason, the text is wrapped and results hard to read) > > Yes, i have just seen that. It looks like gmail has wrapped the lines. > I have added the results which should not be wrapped, at the end of this email > > > > > > > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote: > >> During the Energy-aware scheduling mini-summit, we spoke about benches > >> that should be used to evaluate the modifications of the scheduler. > >> I’d like to propose a bench that uses cyclictest to measure the wake > >> up latency and the power consumption. The goal of this bench is to > >> exercise the scheduler with various sleeping period and get the > >> average wakeup latency. The range of the sleeping period must cover > >> all residency times of the idle state table of the platform. I have > >> run such tests on a tc2 platform with the packing tasks patchset. > >> I have use the following command: > >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000 The number of loops ("-l 2000") should be much larger to create useful results. I don't have a specific number that is large enough, I just know from experience that 2000 is way too small. For example, running cyclictest several times with the same values on my laptop gives values that are not consistent: $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000 # /dev/cpu_dma_latency set to 10000000us T: 0 ( 9703) P: 0 I:500 C: 2000 Min: 2 Act: 90 Avg: 77 Max: 243 T: 1 ( 9704) P: 0 I:650 C: 1557 Min: 2 Act: 58 Avg: 68 Max: 226 T: 2 ( 9705) P: 0 I:800 C: 1264 Min: 2 Act: 54 Avg: 81 Max: 1017 T: 3 ( 9706) P: 0 I:950 C: 1065 Min: 2 Act: 11 Avg: 80 Max: 260 $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000 # /dev/cpu_dma_latency set to 10000000us T: 0 ( 9709) P: 0 I:500 C: 2000 Min: 2 Act: 45 Avg: 74 Max: 390 T: 1 ( 9710) P: 0 I:650 C: 1554 Min: 2 Act: 82 Avg: 61 Max: 810 T: 2 ( 9711) P: 0 I:800 C: 1263 Min: 2 Act: 83 Avg: 74 Max: 287 T: 3 ( 9712) P: 0 I:950 C: 1064 Min: 2 Act: 103 Avg: 79 Max: 551 $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000 # /dev/cpu_dma_latency set to 10000000us T: 0 ( 9716) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 72 Max: 252 T: 1 ( 9717) P: 0 I:650 C: 1556 Min: 2 Act: 115 Avg: 77 Max: 354 T: 2 ( 9718) P: 0 I:800 C: 1264 Min: 2 Act: 59 Avg: 78 Max: 1143 T: 3 ( 9719) P: 0 I:950 C: 1065 Min: 2 Act: 104 Avg: 70 Max: 238 $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000 # /dev/cpu_dma_latency set to 10000000us T: 0 ( 9722) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 68 Max: 213 T: 1 ( 9723) P: 0 I:650 C: 1555 Min: 2 Act: 65 Avg: 65 Max: 1279 T: 2 ( 9724) P: 0 I:800 C: 1264 Min: 2 Act: 91 Avg: 69 Max: 244 T: 3 ( 9725) P: 0 I:950 C: 1065 Min: 2 Act: 58 Avg: 76 Max: 242 > > > > cyclictest could be a good starting point but we need to improve it to > > allow threads of different loads, possibly starting multiple processes > > (can be done with a script), randomly varying load threads. These > > parameters should be loaded from a file so that we can have multiple > > configurations (per SoC and per use-case). But the big risk is that we > > try to optimise the scheduler for something which is not realistic. > > The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system. > > Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results. > > I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system > > > > > > > We are working on describing some basic scenarios (plain English for > > now) and one of them could be video playing with threads for audio and > > video decoding with random change in the workload. > > > > So I think the first step should be a set of tools/scripts to analyse > > the scheduler behaviour, both in terms of latency and power, and these > > can use perf sched. We can then run some real life scenarios (e.g. > > Android video playback) and build a benchmark that matches such > > behaviour as close as possible. We can probably use (or improve) perf > > sched replay to also simulate such workload (we may need additional > > features like thread dependencies). > > > >> The figures below give the average wakeup latency and power > >> consumption for default scheduler behavior, packing tasks at cluster > >> level and packing tasks at core level. We can see both wakeup latency > >> and power consumption variation. The detailed result is not a simple > >> single value which makes comparison not so easy but the average of all > >> measurements should give us a usable “score”. > > > > How did you assess the power/energy? > > I have use the embedded joule meter of the tc2. > > > > > Thanks. > > > > -- > > Catalin > > | Default average results | Cluster Packing average results | Core Packing average results > | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy > | (us) (J) (J) | (us) (J) (J) | (us) (J) (J) > | 879 794890 2364175 | 416 879688 12750 | 189 897452 30052 > > Cyclictest | Default | Packing at Cluster level | Packing at Core level > Interval | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy > (us) | (us) (J) (J) | (us) (J) (J) | (us) (J) (J) > 500 24 1 1147477 2479576 21 1 1136768 11693 22 1 1126062 30138 > 700 22 1 1136084 3058419 21 0 1125280 11761 21 1 1109950 23503 < snip > Some questions about what these metrics are: The cyclictest data is reported per thread. How did you combine the per thread data to get a single latency and stddev value? Is "Latency" the average latency? stddev is not reported by cyclictest. How did you create this value? Did you use the "-v" cyclictest option to report detailed data, then calculate stddev from the detailed data? Thanks, -Frank ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler 2013-11-08 0:04 ` Rowand, Frank @ 2013-11-08 9:28 ` Vincent Guittot 2013-11-08 21:12 ` Rowand, Frank 0 siblings, 1 reply; 11+ messages in thread From: Vincent Guittot @ 2013-11-08 9:28 UTC (permalink / raw) To: Rowand, Frank Cc: catalin.marinas, Morten.Rasmussen, alex.shi, peterz, pjt, mingo, rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu, markgross, khilman, paulmck, linux-kernel On 8 November 2013 01:04, Rowand, Frank <Frank.Rowand@sonymobile.com> wrote: > Hi Vincent, > > Thanks for creating some benchmark numbers! you're welcome > > > On Thursday, November 07, 2013 5:33 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote: >> >> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote: >> > Hi Vincent, >> > >> > (for whatever reason, the text is wrapped and results hard to read) >> >> Yes, i have just seen that. It looks like gmail has wrapped the lines. >> I have added the results which should not be wrapped, at the end of this email >> >> > >> > >> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote: >> >> During the Energy-aware scheduling mini-summit, we spoke about benches >> >> that should be used to evaluate the modifications of the scheduler. >> >> I’d like to propose a bench that uses cyclictest to measure the wake >> >> up latency and the power consumption. The goal of this bench is to >> >> exercise the scheduler with various sleeping period and get the >> >> average wakeup latency. The range of the sleeping period must cover >> >> all residency times of the idle state table of the platform. I have >> >> run such tests on a tc2 platform with the packing tasks patchset. >> >> I have use the following command: >> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000 > > The number of loops ("-l 2000") should be much larger to create useful > results. I don't have a specific number that is large enough, I just > know from experience that 2000 is way too small. For example, running > cyclictest several times with the same values on my laptop gives values > that are not consistent: The Avg figures look almost stable IMO. Are you speaking about the Max value for the inconsistency ? > > $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000 > # /dev/cpu_dma_latency set to 10000000us > T: 0 ( 9703) P: 0 I:500 C: 2000 Min: 2 Act: 90 Avg: 77 Max: 243 > T: 1 ( 9704) P: 0 I:650 C: 1557 Min: 2 Act: 58 Avg: 68 Max: 226 > T: 2 ( 9705) P: 0 I:800 C: 1264 Min: 2 Act: 54 Avg: 81 Max: 1017 > T: 3 ( 9706) P: 0 I:950 C: 1065 Min: 2 Act: 11 Avg: 80 Max: 260 > > $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000 > # /dev/cpu_dma_latency set to 10000000us > T: 0 ( 9709) P: 0 I:500 C: 2000 Min: 2 Act: 45 Avg: 74 Max: 390 > T: 1 ( 9710) P: 0 I:650 C: 1554 Min: 2 Act: 82 Avg: 61 Max: 810 > T: 2 ( 9711) P: 0 I:800 C: 1263 Min: 2 Act: 83 Avg: 74 Max: 287 > T: 3 ( 9712) P: 0 I:950 C: 1064 Min: 2 Act: 103 Avg: 79 Max: 551 > > $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000 > # /dev/cpu_dma_latency set to 10000000us > T: 0 ( 9716) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 72 Max: 252 > T: 1 ( 9717) P: 0 I:650 C: 1556 Min: 2 Act: 115 Avg: 77 Max: 354 > T: 2 ( 9718) P: 0 I:800 C: 1264 Min: 2 Act: 59 Avg: 78 Max: 1143 > T: 3 ( 9719) P: 0 I:950 C: 1065 Min: 2 Act: 104 Avg: 70 Max: 238 > > $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000 > # /dev/cpu_dma_latency set to 10000000us > T: 0 ( 9722) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 68 Max: 213 > T: 1 ( 9723) P: 0 I:650 C: 1555 Min: 2 Act: 65 Avg: 65 Max: 1279 > T: 2 ( 9724) P: 0 I:800 C: 1264 Min: 2 Act: 91 Avg: 69 Max: 244 > T: 3 ( 9725) P: 0 I:950 C: 1065 Min: 2 Act: 58 Avg: 76 Max: 242 > > >> > >> > cyclictest could be a good starting point but we need to improve it to >> > allow threads of different loads, possibly starting multiple processes >> > (can be done with a script), randomly varying load threads. These >> > parameters should be loaded from a file so that we can have multiple >> > configurations (per SoC and per use-case). But the big risk is that we >> > try to optimise the scheduler for something which is not realistic. >> >> The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system. >> >> Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results. >> >> I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system >> >> > >> > >> > We are working on describing some basic scenarios (plain English for >> > now) and one of them could be video playing with threads for audio and >> > video decoding with random change in the workload. >> > >> > So I think the first step should be a set of tools/scripts to analyse >> > the scheduler behaviour, both in terms of latency and power, and these >> > can use perf sched. We can then run some real life scenarios (e.g. >> > Android video playback) and build a benchmark that matches such >> > behaviour as close as possible. We can probably use (or improve) perf >> > sched replay to also simulate such workload (we may need additional >> > features like thread dependencies). >> > >> >> The figures below give the average wakeup latency and power >> >> consumption for default scheduler behavior, packing tasks at cluster >> >> level and packing tasks at core level. We can see both wakeup latency >> >> and power consumption variation. The detailed result is not a simple >> >> single value which makes comparison not so easy but the average of all >> >> measurements should give us a usable “score”. >> > >> > How did you assess the power/energy? >> >> I have use the embedded joule meter of the tc2. >> >> > >> > Thanks. >> > >> > -- >> > Catalin >> >> | Default average results | Cluster Packing average results | Core Packing average results >> | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy >> | (us) (J) (J) | (us) (J) (J) | (us) (J) (J) >> | 879 794890 2364175 | 416 879688 12750 | 189 897452 30052 >> >> Cyclictest | Default | Packing at Cluster level | Packing at Core level >> Interval | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy >> (us) | (us) (J) (J) | (us) (J) (J) | (us) (J) (J) >> 500 24 1 1147477 2479576 21 1 1136768 11693 22 1 1126062 30138 >> 700 22 1 1136084 3058419 21 0 1125280 11761 21 1 1109950 23503 > > < snip > > > Some questions about what these metrics are: > > The cyclictest data is reported per thread. How did you combine the per thread data > to get a single latency and stddev value? > > Is "Latency" the average latency? Yes. I have described below the procedure i have followed to get my results: I run the same test (same parameters) several times ( i have tried between 5 and 10 runs and the results were similar). For each run, i compute the average of per thread average figure and i compute the stddev between per thread results. The results that i sent is an average of all runs with the same parameters. > > stddev is not reported by cyclictest. How did you create this value? Did you > use the "-v" cyclictest option to report detailed data, then calculate stddev from > the detailed data? No i haven't used the -v because it generates too much spurious wake up that makes the results irrelevant Vincent > > Thanks, > > -Frank ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: Bench for testing scheduler 2013-11-08 9:28 ` Vincent Guittot @ 2013-11-08 21:12 ` Rowand, Frank 2013-11-12 10:02 ` Vincent Guittot 0 siblings, 1 reply; 11+ messages in thread From: Rowand, Frank @ 2013-11-08 21:12 UTC (permalink / raw) To: Vincent Guittot Cc: catalin.marinas, Morten.Rasmussen, alex.shi, peterz, pjt, mingo, rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu, markgross, khilman, paulmck, linux-kernel On Friday, November 08, 2013 1:28 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote: > > On 8 November 2013 01:04, Rowand, Frank <Frank.Rowand@sonymobile.com> wrote: > > Hi Vincent, > > > > Thanks for creating some benchmark numbers! > > you're welcome > > > > > > > On Thursday, November 07, 2013 5:33 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote: > >> > >> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@arm.com> wrote: > >> > Hi Vincent, > >> > > >> > (for whatever reason, the text is wrapped and results hard to read) > >> > >> Yes, i have just seen that. It looks like gmail has wrapped the lines. > >> I have added the results which should not be wrapped, at the end of this email > >> > >> > > >> > > >> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote: > >> >> During the Energy-aware scheduling mini-summit, we spoke about benches > >> >> that should be used to evaluate the modifications of the scheduler. > >> >> I’d like to propose a bench that uses cyclictest to measure the wake > >> >> up latency and the power consumption. The goal of this bench is to > >> >> exercise the scheduler with various sleeping period and get the > >> >> average wakeup latency. The range of the sleeping period must cover > >> >> all residency times of the idle state table of the platform. I have > >> >> run such tests on a tc2 platform with the packing tasks patchset. > >> >> I have use the following command: > >> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000 > > > > The number of loops ("-l 2000") should be much larger to create useful > > results. I don't have a specific number that is large enough, I just > > know from experience that 2000 is way too small. For example, running > > cyclictest several times with the same values on my laptop gives values > > that are not consistent: > > The Avg figures look almost stable IMO. Are you speaking about the Max > value for the inconsistency ? The values on my laptop for "-l 2000" are not stable. If I collapse all of the threads in each of the following tests to a single value I get the following table. Note that each thread completes a different number of cycles, so I calculate the average as: total count = T0_count + T1_count + T2_count + T3_count avg = ( (T0_count * T0_avg) + (T1_count * T1_avg) + ... + (T3_count * T3_avg) ) / total count min is the smallest min for any of the threads max is the largest max for any of the threads total test T count min avg max ---- --- -------- ---- ------- ----- 1 4 5886 2 76.0 1017 2 4 5881 2 71.5 810 3 4 5885 2 74.2 1143 4 4 5884 2 68.9 1279 test 1 average is 10% larger than test 4. test 4 maximum is 50% larger than test2. But all of this is just a minor detail of how to run cyclictest. The more important question is whether to use cyclictest results as a valid workload or metric, so for the moment I won't comment further on the cyclictest parameters you used to collect the example data you provided. > > > > > $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000 > > # /dev/cpu_dma_latency set to 10000000us > > T: 0 ( 9703) P: 0 I:500 C: 2000 Min: 2 Act: 90 Avg: 77 Max: 243 > > T: 1 ( 9704) P: 0 I:650 C: 1557 Min: 2 Act: 58 Avg: 68 Max: 226 > > T: 2 ( 9705) P: 0 I:800 C: 1264 Min: 2 Act: 54 Avg: 81 Max: 1017 > > T: 3 ( 9706) P: 0 I:950 C: 1065 Min: 2 Act: 11 Avg: 80 Max: 260 > > > > $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000 > > # /dev/cpu_dma_latency set to 10000000us > > T: 0 ( 9709) P: 0 I:500 C: 2000 Min: 2 Act: 45 Avg: 74 Max: 390 > > T: 1 ( 9710) P: 0 I:650 C: 1554 Min: 2 Act: 82 Avg: 61 Max: 810 > > T: 2 ( 9711) P: 0 I:800 C: 1263 Min: 2 Act: 83 Avg: 74 Max: 287 > > T: 3 ( 9712) P: 0 I:950 C: 1064 Min: 2 Act: 103 Avg: 79 Max: 551 > > > > $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000 > > # /dev/cpu_dma_latency set to 10000000us > > T: 0 ( 9716) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 72 Max: 252 > > T: 1 ( 9717) P: 0 I:650 C: 1556 Min: 2 Act: 115 Avg: 77 Max: 354 > > T: 2 ( 9718) P: 0 I:800 C: 1264 Min: 2 Act: 59 Avg: 78 Max: 1143 > > T: 3 ( 9719) P: 0 I:950 C: 1065 Min: 2 Act: 104 Avg: 70 Max: 238 > > > > $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000 > > # /dev/cpu_dma_latency set to 10000000us > > T: 0 ( 9722) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 68 Max: 213 > > T: 1 ( 9723) P: 0 I:650 C: 1555 Min: 2 Act: 65 Avg: 65 Max: 1279 > > T: 2 ( 9724) P: 0 I:800 C: 1264 Min: 2 Act: 91 Avg: 69 Max: 244 > > T: 3 ( 9725) P: 0 I:950 C: 1065 Min: 2 Act: 58 Avg: 76 Max: 242 > > > > > >> > > >> > cyclictest could be a good starting point but we need to improve it to > >> > allow threads of different loads, possibly starting multiple processes > >> > (can be done with a script), randomly varying load threads. These > >> > parameters should be loaded from a file so that we can have multiple > >> > configurations (per SoC and per use-case). But the big risk is that we > >> > try to optimise the scheduler for something which is not realistic. > >> > >> The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system. > >> > >> Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results. > >> > >> I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system > >> > >> > > >> > > >> > We are working on describing some basic scenarios (plain English for > >> > now) and one of them could be video playing with threads for audio and > >> > video decoding with random change in the workload. > >> > > >> > So I think the first step should be a set of tools/scripts to analyse > >> > the scheduler behaviour, both in terms of latency and power, and these > >> > can use perf sched. We can then run some real life scenarios (e.g. > >> > Android video playback) and build a benchmark that matches such > >> > behaviour as close as possible. We can probably use (or improve) perf > >> > sched replay to also simulate such workload (we may need additional > >> > features like thread dependencies). > >> > > >> >> The figures below give the average wakeup latency and power > >> >> consumption for default scheduler behavior, packing tasks at cluster > >> >> level and packing tasks at core level. We can see both wakeup latency > >> >> and power consumption variation. The detailed result is not a simple > >> >> single value which makes comparison not so easy but the average of all > >> >> measurements should give us a usable “score”. > >> > > >> > How did you assess the power/energy? > >> > >> I have use the embedded joule meter of the tc2. > >> > >> > > >> > Thanks. > >> > > >> > -- > >> > Catalin > >> > >> | Default average results | Cluster Packing average results | Core Packing average results > >> | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy > >> | (us) (J) (J) | (us) (J) (J) | (us) (J) (J) > >> | 879 794890 2364175 | 416 879688 12750 | 189 897452 30052 > >> > >> Cyclictest | Default | Packing at Cluster level | Packing at Core level > >> Interval | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy > >> (us) | (us) (J) (J) | (us) (J) (J) | (us) (J) (J) > >> 500 24 1 1147477 2479576 21 1 1136768 11693 22 1 1126062 30138 > >> 700 22 1 1136084 3058419 21 0 1125280 11761 21 1 1109950 23503 > > > > < snip > > > Thanks for clarifying how the data was calculated (below). Again, I don't think this level of detail is the most important issue at this point, but I'm going to comment on it while it is still fresh in my mind. > > Some questions about what these metrics are: > > > > The cyclictest data is reported per thread. How did you combine the per thread data > > to get a single latency and stddev value? > > > > Is "Latency" the average latency? > > Yes. I have described below the procedure i have followed to get my results: > > I run the same test (same parameters) several times ( i have tried > between 5 and 10 runs and the results were similar). > For each run, i compute the average of per thread average figure and i > compute the stddev between per thread results. So the test run stddev is the standard deviation of the values for average latency of the 8 (???) cyclictest threads in a test run? If so, I don't think that the calculated stddev has much actual meaning for comparing the algorithms (I do find it useful to get a loose sense of how consistent multiple test runs with the same parameters). > The results that i sent is an average of all runs with the same parameters. Then the stddev in the table is the average of the stddev in several test runs? The stddev later on in the table is often in the range of 10%, 20%, 50%, and 100% of the average latency. That is rather large. > > > > > stddev is not reported by cyclictest. How did you create this value? Did you > > use the "-v" cyclictest option to report detailed data, then calculate stddev from > > the detailed data? > > No i haven't used the -v because it generates too much spurious wake > up that makes the results irrelevant Yes, I agree about not using -v. It was just a wild guess on my part since I did not know how stddev was calculated. And I was incorrectly guessing that stdev was describing the frequency distribution of the latencies from a single test run. As a general comment on cyclictest, I don't find average latency (in isolation) sufficient to compare different runs of cyclictest. And stddev of the frequency distribution of the latencies (which can be calculated from the -h data, with fairly low cyclictest overhead) is usually interesting but should be viewed with a healthy skepticism since that frequency distribution is often not a normal distribution. In addition to average latency, I normally look at maximum latency and the frequency distribution of latence (in table or graph form). (One side effect of specifying -h is that the -d option is then ignored.) Thanks, -Frank ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler 2013-11-08 21:12 ` Rowand, Frank @ 2013-11-12 10:02 ` Vincent Guittot 0 siblings, 0 replies; 11+ messages in thread From: Vincent Guittot @ 2013-11-12 10:02 UTC (permalink / raw) To: Rowand, Frank Cc: catalin.marinas, Morten.Rasmussen, alex.shi, peterz, pjt, mingo, rjw, srivatsa.bhat, paul, mgorman, juri.lelli, fengguang.wu, markgross, khilman, paulmck, linux-kernel On 8 November 2013 22:12, Rowand, Frank <Frank.Rowand@sonymobile.com> wrote: > > On Friday, November 08, 2013 1:28 AM, Vincent Guittot [vincent.guittot@linaro.org] wrote: >> >> On 8 November 2013 01:04, Rowand, Frank <Frank.Rowand@sonymobile.com> wrote: >> <snip> >> >> The Avg figures look almost stable IMO. Are you speaking about the Max >> value for the inconsistency ? > > The values on my laptop for "-l 2000" are not stable. > > If I collapse all of the threads in each of the following tests to a > single value I get the following table. Note that each thread completes > a different number of cycles, so I calculate the average as: > > total count = T0_count + T1_count + T2_count + T3_count > > avg = ( (T0_count * T0_avg) + (T1_count * T1_avg) + ... + (T3_count * T3_avg) ) / total count > > min is the smallest min for any of the threads > > max is the largest max for any of the threads > > total > test T count min avg max > ---- --- -------- ---- ------- ----- > 1 4 5886 2 76.0 1017 > 2 4 5881 2 71.5 810 > 3 4 5885 2 74.2 1143 > 4 4 5884 2 68.9 1279 > > test 1 average is 10% larger than test 4. > > test 4 maximum is 50% larger than test2. > > But all of this is just a minor detail of how to run cyclictest. The more > important question is whether to use cyclictest results as a valid workload > or metric, so for the moment I won't comment further on the cyclictest > parameters you used to collect the example data you provided. > > >> <snip> >> > > > Thanks for clarifying how the data was calculated (below). Again, I don't think > this level of detail is the most important issue at this point, but I'm going > to comment on it while it is still fresh in my mind. > >> > Some questions about what these metrics are: >> > >> > The cyclictest data is reported per thread. How did you combine the per thread data >> > to get a single latency and stddev value? >> > >> > Is "Latency" the average latency? >> >> Yes. I have described below the procedure i have followed to get my results: >> >> I run the same test (same parameters) several times ( i have tried >> between 5 and 10 runs and the results were similar). >> For each run, i compute the average of per thread average figure and i >> compute the stddev between per thread results. > > So the test run stddev is the standard deviation of the values for average > latency of the 8 (???) cyclictest threads in a test run? I have used 5 threads for my tests > > If so, I don't think that the calculated stddev has much actual meaning for > comparing the algorithms (I do find it useful to get a loose sense of how > consistent multiple test runs with the same parameters). > >> The results that i sent is an average of all runs with the same parameters. > > Then the stddev in the table is the average of the stddev in several test runs? yes it is > > The stddev later on in the table is often in the range of 10%, 20%, 50%, and 100% > of the average latency. That is rather large. yes i agree and it's an interesting figure IMHO because it points out how the wake up of a core can impact the task scheduling latency and how it's possible to reduce it or make it more stable (even if we still have some large max value which are probably not linked to the wake up of a core but other activities like deferable timer that have fired > >> >> > >> > stddev is not reported by cyclictest. How did you create this value? Did you >> > use the "-v" cyclictest option to report detailed data, then calculate stddev from >> > the detailed data? >> >> No i haven't used the -v because it generates too much spurious wake >> up that makes the results irrelevant > > Yes, I agree about not using -v. It was just a wild guess on my part since > I did not know how stddev was calculated. And I was incorrectly guessing > that stdev was describing the frequency distribution of the latencies > from a single test run. I haven't be so precise in my computation mainly because the output were almost coherent but we probably need more precised statistic in a final step > > As a general comment on cyclictest, I don't find average latency > (in isolation) sufficient to compare different runs of cyclictest. > And stddev of the frequency distribution of the latencies (which > can be calculated from the -h data, with fairly low cyclictest > overhead) is usually interesting but should be viewed with a healthy > skepticism since that frequency distribution is often not a normal > distribution. In addition to average latency, I normally look at > maximum latency and the frequency distribution of latence (in table > or graph form). > > (One side effect of specifying -h is that the -d option is then > ignored.) > I'm going to have a look at -h parameters which can be useful to get a better view of the frequency distribution as you point out. Having the distance set to 0 (-d) can be an issue because we could have a synchronization of the wake up of the threads which will finally hide the real wake up latency. It's interesting to have a distance which ensures that the threads will wake up in an "asynchronous" manner that's why i have chosen 150 (which is may be not the best value). Thanks, Vincent > Thanks, > > -Frank ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bench for testing scheduler 2013-11-07 10:54 Bench for testing scheduler Vincent Guittot 2013-11-07 11:32 ` Catalin Marinas 2013-11-07 13:33 ` Vincent Guittot @ 2013-11-07 17:42 ` Morten Rasmussen 2013-11-09 0:15 ` Rowand, Frank 2 siblings, 1 reply; 11+ messages in thread From: Morten Rasmussen @ 2013-11-07 17:42 UTC (permalink / raw) To: Vincent Guittot Cc: Alex Shi, Peter Zijlstra, Paul Turner, Ingo Molnar, rjw, Srivatsa S. Bhat, Catalin Marinas, Paul Walmsley, Mel Gorman, Juri Lelli, fengguang.wu, markgross, Kevin Hilman, Frank.Rowand, Paul McKenney, linux-kernel Hi Vincent, On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote: > Hi, > > During the Energy-aware scheduling mini-summit, we spoke about benches > that should be used to evaluate the modifications of the scheduler. > I’d like to propose a bench that uses cyclictest to measure the wake > up latency and the power consumption. The goal of this bench is to > exercise the scheduler with various sleeping period and get the > average wakeup latency. The range of the sleeping period must cover > all residency times of the idle state table of the platform. I have > run such tests on a tc2 platform with the packing tasks patchset. > I have use the following command: > #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000 I think cyclictest is a useful model small(er) periodic tasks for benchmarking energy related patches. However, it doesn't have a good-enough-performance criteria as it is. I think that is a strict requirement for all energy related benchmarks. Measuring latency gives us a performance metric while the energy tells us how energy efficient we are. But without a latency requirement we can't really say if a patch helps energy-awareness unless it improves both energy _and_ performance. That is the case for your packing patches for this particular benchmark with this specific configuration. That is a really good result. However, in the general case patches may trade a bit of performance to get better energy, which is also good if performance still meets the requirement of the application/user. So we need a performance criteria to tells us when we sacrifice too much performance when trying to save power. Without it it is just a performance benchmark where we measure power. Coming up with a performance criteria for cyclictest is not so easy as it doesn't really model any specific application. I guess sacrificing a bit of latency is acceptable if it comes with significant energy savings. But a huge performance impact might not be, even if it comes with massive energy savings. So maybe the criteria would consist of both a minimum latency requirement (e.g. up to 10% increase) and a requirement for improved energy per work. As I see it, it the only way we can validate energy efficiency of patches that trade performance for improved energy. Morten ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: Bench for testing scheduler 2013-11-07 17:42 ` Morten Rasmussen @ 2013-11-09 0:15 ` Rowand, Frank 0 siblings, 0 replies; 11+ messages in thread From: Rowand, Frank @ 2013-11-09 0:15 UTC (permalink / raw) To: Morten Rasmussen, Vincent Guittot Cc: Alex Shi, Peter Zijlstra, Paul Turner, Ingo Molnar, rjw, Srivatsa S. Bhat, Catalin Marinas, Paul Walmsley, Mel Gorman, Juri Lelli, fengguang.wu, markgross, Kevin Hilman, Paul McKenney, linux-kernel On Thursday, November 07, 2013 9:42 AM, Morten Rasmussen [morten.rasmussen@arm.com] wrote: > > Hi Vincent, > > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote: > > Hi, > > > > During the Energy-aware scheduling mini-summit, we spoke about benches > > that should be used to evaluate the modifications of the scheduler. > > I’d like to propose a bench that uses cyclictest to measure the wake > > up latency and the power consumption. The goal of this bench is to > > exercise the scheduler with various sleeping period and get the > > average wakeup latency. The range of the sleeping period must cover > > all residency times of the idle state table of the platform. I have > > run such tests on a tc2 platform with the packing tasks patchset. > > I have use the following command: > > #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000 > > I think cyclictest is a useful model small(er) periodic tasks for > benchmarking energy related patches. However, it doesn't have a > good-enough-performance criteria as it is. I think that is a strict > requirement for all energy related benchmarks. > > Measuring latency gives us a performance metric while the energy tells > us how energy efficient we are. But without a latency requirement we > can't really say if a patch helps energy-awareness unless it improves > both energy _and_ performance. That is the case for your packing patches > for this particular benchmark with this specific configuration. That is > a really good result. However, in the general case patches may trade a > bit of performance to get better energy, which is also good if > performance still meets the requirement of the application/user. So we > need a performance criteria to tells us when we sacrifice too much > performance when trying to save power. Without it it is just a > performance benchmark where we measure power. > > Coming up with a performance criteria for cyclictest is not so easy as > it doesn't really model any specific application. I guess sacrificing a > bit of latency is acceptable if it comes with significant energy > savings. But a huge performance impact might not be, even if it comes > with massive energy savings. So maybe the criteria would consist of both > a minimum latency requirement (e.g. up to 10% increase) and a > requirement for improved energy per work. > > As I see it, it the only way we can validate energy efficiency of > patches that trade performance for improved energy. I think those comments capture some of the additional complexity of the power vs performance tradeoff that need to be considered. One thing not well-defined is what "performance" is. The session at the kernel discussed throughput and latency. I'm not sure if people are combining two different things into the name of latency. To me, latency is wake up latency; the elapsed time from when an event occurred to when the process handling the event is executing instructions (where I think of the process typically as user space code, but it could sometimes instead be kernel space code). The second thing people might think of as latency is how long from the triggering event until when work is completed on behalf of the consumer event (where the consumer could be a machine, but is often a human being, eg if a packet from google arrives, how long until I see the search result on my screen). This second thing I call response time. Then "wake up latency" is also probably a mis-nomer. The cyclictest wake up latency ends when the cyclictest thread is both woken, and then is actually executing code on the cpu ("running"). Wake up latency is a fine thing to focus on (especially since power management can have a large impact on wake up latency) but I hope we remember to pay attention to response time as one of the important performance metrics. Onward to cyclictest... Cyclictest is commonly referred to as a benchmark (which it is), but it is at the core more like instrumentation, providing a measure of some types of wake up latency. Cyclictest is normally used in conjunction with a separate workload. (Even though cyclictest has enough tuning knobs that it can also be used as a workload.) There are some ways that perf and cyclictest can be compared as sources of performance data: ----- cyclictest - Measures wake up latency of only cyclictest threads. - Captures _entire_ latency, including coming out of low power mode to service the (timer) interrupt that results in the task wake up. ----- perf sched - Measures all processes (this can be sliced and diced in post-processing to include any desired set of processes). - Captures latency from when task is _woken_ to when task is _executing code_ on a cpu. I think both cyclictest and perf sched are valuable tools, that can each contribute to understanding system behavior. -Frank ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-11-12 10:02 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-11-07 10:54 Bench for testing scheduler Vincent Guittot 2013-11-07 11:32 ` Catalin Marinas 2013-11-07 13:33 ` Vincent Guittot 2013-11-07 14:04 ` Catalin Marinas 2013-11-08 9:30 ` Vincent Guittot 2013-11-08 0:04 ` Rowand, Frank 2013-11-08 9:28 ` Vincent Guittot 2013-11-08 21:12 ` Rowand, Frank 2013-11-12 10:02 ` Vincent Guittot 2013-11-07 17:42 ` Morten Rasmussen 2013-11-09 0:15 ` Rowand, Frank
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).