On Sun, Aug 29, 2021 at 11:59:41PM +0200, Jiri Olsa wrote: > On Fri, Aug 20, 2021 at 12:53:46PM +0200, Riccardo Mancini wrote: > > Changes in v3: > > - improved separation of threadpool and threadpool_entry method > > - replaced shared workqueue with per-thread workqueue. This should > > improve the performance on big machines (Jiri noticed in his > > experiments a significant performance degradation after 15 threads > > with the shared queue). > > - improved error reporting in both threadpool and workqueue > > - added lazy spinup of threads in workqueue [9/15] > > - added global workqueue [10/15] > > - setup global workqueue in perf record, top and synthesize bench > > [12-14/15] and used in in synthetic events > > > hi, > I ran the test again and there's still the slowdown, > adding the stats below > > I'm doing the review and I noticed few strange things, > but so far nothing that would explain that I used trace compass to show the flow and it shows lot of extra scheduling in the new code, please check attached screenshots the current code takes the quickest approach and distribues 'equal' load for each thread while the lazy thread spin in the new code is nice, I think we should have a way to instruct the new code to do the same thing as the old one, because it's faster in this case I think the work_size setup could help with that > > like I can see for 40 threads only 35 threads spawned, > need to check on that more > > also I'll try run some tests for parallel_for > 1 to cut ugh.. should have been s/parallel_for/work_size/ sorry jirka > down some of the workqueue code.. any tests on that? > > jirka > > > --- > new: old: > ell-r440-01 perf]# ./perf bench internals synthesize -t [root@dell-r440-01 perf]# ./perf bench internals synthesize -t > # Running 'internals/synthesize' benchmark: # Running 'internals/synthesize' benchmark: > Computing performance of multi threaded perf event synthesis by Computing performance of multi threaded perf event synthesis by > synthesizing events on CPU 0: synthesizing events on CPU 0: > Number of synthesis threads: 1 Number of synthesis threads: 1 > Average synthesis took: 13970.400 usec (+- 339.216 usec) Average synthesis took: 13563.700 usec (+- 348.354 usec) > Average num. events: 2349.000 (+- 0.000) Average num. events: 2317.000 (+- 0.000) > Average time per event 5.947 usec Average time per event 5.854 usec > Number of synthesis threads: 2 Number of synthesis threads: 2 > Average synthesis took: 15651.800 usec (+- 1612.798 usec) Average synthesis took: 8433.600 usec (+- 83.725 usec) > Average num. events: 2353.000 (+- 0.000) Average num. events: 2321.600 (+- 0.306) > Average time per event 6.652 usec Average time per event 3.633 usec > Number of synthesis threads: 3 Number of synthesis threads: 3 > Average synthesis took: 12114.100 usec (+- 1208.208 usec) Average synthesis took: 6716.200 usec (+- 16.889 usec) > Average num. events: 2355.000 (+- 0.000) Average num. events: 2325.000 (+- 0.000) > Average time per event 5.144 usec Average time per event 2.889 usec > Number of synthesis threads: 4 Number of synthesis threads: 4 > Average synthesis took: 9812.500 usec (+- 951.284 usec) Average synthesis took: 5981.400 usec (+- 11.102 usec) > Average num. events: 2357.000 (+- 0.000) Average num. events: 2323.000 (+- 0.000) > Average time per event 4.163 usec Average time per event 2.575 usec > Number of synthesis threads: 5 Number of synthesis threads: 5 > Average synthesis took: 7338.300 usec (+- 661.620 usec) Average synthesis took: 5538.800 usec (+- 12.990 usec) > Average num. events: 2359.000 (+- 0.000) Average num. events: 2329.000 (+- 0.000) > Average time per event 3.111 usec Average time per event 2.378 usec > Number of synthesis threads: 6 Number of synthesis threads: 6 > Average synthesis took: 7256.800 usec (+- 680.312 usec) Average synthesis took: 5255.700 usec (+- 7.454 usec) > Average num. events: 2361.000 (+- 0.000) Average num. events: 2331.000 (+- 0.000) > Average time per event 3.074 usec Average time per event 2.255 usec > Number of synthesis threads: 7 Number of synthesis threads: 7 > Average synthesis took: 6119.600 usec (+- 479.409 usec) Average synthesis took: 4836.200 usec (+- 8.132 usec) > Average num. events: 2363.000 (+- 0.000) Average num. events: 2323.000 (+- 0.000) > Average time per event 2.590 usec Average time per event 2.082 usec > Number of synthesis threads: 8 Number of synthesis threads: 8 > Average synthesis took: 5899.600 usec (+- 506.285 usec) Average synthesis took: 4643.000 usec (+- 4.913 usec) > Average num. events: 2365.000 (+- 0.000) Average num. events: 2335.000 (+- 0.000) > Average time per event 2.495 usec Average time per event 1.988 usec > Number of synthesis threads: 9 Number of synthesis threads: 9 > Average synthesis took: 5459.100 usec (+- 431.725 usec) Average synthesis took: 4526.600 usec (+- 5.207 usec) > Average num. events: 2367.000 (+- 0.000) Average num. events: 2337.000 (+- 0.000) > Average time per event 2.306 usec Average time per event 1.937 usec > Number of synthesis threads: 10 Number of synthesis threads: 10 > Average synthesis took: 4977.100 usec (+- 251.378 usec) Average synthesis took: 4128.700 usec (+- 5.911 usec) > Average num. events: 2369.000 (+- 0.000) Average num. events: 2327.800 (+- 0.533) > Average time per event 2.101 usec Average time per event 1.774 usec > Number of synthesis threads: 11 Number of synthesis threads: 11 > Average synthesis took: 5428.700 usec (+- 513.409 usec) Average synthesis took: 3890.800 usec (+- 15.051 usec) > Average num. events: 2371.000 (+- 0.000) Average num. events: 2323.000 (+- 0.000) > Average time per event 2.290 usec Average time per event 1.675 usec > Number of synthesis threads: 12 Number of synthesis threads: 12 > Average synthesis took: 5517.800 usec (+- 508.171 usec) Average synthesis took: 3367.800 usec (+- 14.261 usec) > Average num. events: 2373.000 (+- 0.000) Average num. events: 2343.000 (+- 0.000) > Average time per event 2.325 usec Average time per event 1.437 usec > Number of synthesis threads: 13 Number of synthesis threads: 13 > Average synthesis took: 5279.500 usec (+- 432.819 usec) Average synthesis took: 3974.300 usec (+- 12.437 usec) > Average num. events: 2375.000 (+- 0.000) Average num. events: 2328.200 (+- 1.405) > Average time per event 2.223 usec Average time per event 1.707 usec > Number of synthesis threads: 14 Number of synthesis threads: 14 > Average synthesis took: 4993.100 usec (+- 392.485 usec) Average synthesis took: 4157.100 usec (+- 163.268 usec) > Average num. events: 2377.000 (+- 0.000) Average num. events: 2319.800 (+- 0.533) > Average time per event 2.101 usec Average time per event 1.792 usec > Number of synthesis threads: 15 Number of synthesis threads: 15 > Average synthesis took: 5584.700 usec (+- 379.862 usec) Average synthesis took: 4065.700 usec (+- 25.656 usec) > Average num. events: 2379.000 (+- 0.000) Average num. events: 2322.800 (+- 0.467) > Average time per event 2.347 usec Average time per event 1.750 usec > Number of synthesis threads: 16 Number of synthesis threads: 16 > Average synthesis took: 5009.800 usec (+- 381.018 usec) Average synthesis took: 4580.600 usec (+- 129.218 usec) > Average num. events: 2381.000 (+- 0.000) Average num. events: 2324.800 (+- 0.200) > Average time per event 2.104 usec Average time per event 1.970 usec > Number of synthesis threads: 17 Number of synthesis threads: 17 > Average synthesis took: 5543.300 usec (+- 376.064 usec) Average synthesis took: 4089.700 usec (+- 54.096 usec) > Average num. events: 2383.000 (+- 0.000) Average num. events: 2320.200 (+- 0.611) > Average time per event 2.326 usec Average time per event 1.763 usec > Number of synthesis threads: 18 Number of synthesis threads: 18 > Average synthesis took: 5191.800 usec (+- 342.317 usec) Average synthesis took: 4219.000 usec (+- 61.395 usec) > Average num. events: 2385.000 (+- 0.000) Average num. events: 2323.000 (+- 0.516) > Average time per event 2.177 usec Average time per event 1.816 usec > Number of synthesis threads: 19 Number of synthesis threads: 19 > Average synthesis took: 4647.000 usec (+- 273.303 usec) Average synthesis took: 3998.800 usec (+- 49.221 usec) > Average num. events: 2387.000 (+- 0.000) Average num. events: 2325.200 (+- 0.200) > Average time per event 1.947 usec Average time per event 1.720 usec > Number of synthesis threads: 20 Number of synthesis threads: 20 > Average synthesis took: 4710.600 usec (+- 179.874 usec) Average synthesis took: 3930.300 usec (+- 67.725 usec) > Average num. events: 2389.000 (+- 0.000) Average num. events: 2319.000 (+- 0.000) > Average time per event 1.972 usec Average time per event 1.695 usec > Number of synthesis threads: 21 Number of synthesis threads: 21 > Average synthesis took: 4959.100 usec (+- 318.519 usec) Average synthesis took: 3696.400 usec (+- 30.953 usec) > Average num. events: 2390.800 (+- 0.200) Average num. events: 2319.800 (+- 0.533) > Average time per event 2.074 usec Average time per event 1.593 usec > Number of synthesis threads: 22 Number of synthesis threads: 22 > Average synthesis took: 4422.300 usec (+- 236.998 usec) Average synthesis took: 3394.000 usec (+- 63.254 usec) > Average num. events: 2392.800 (+- 0.200) Average num. events: 2319.000 (+- 0.000) > Average time per event 1.848 usec Average time per event 1.464 usec > Number of synthesis threads: 23 Number of synthesis threads: 23 > Average synthesis took: 4640.800 usec (+- 245.604 usec) Average synthesis took: 4091.100 usec (+- 134.320 usec) > Average num. events: 2394.400 (+- 0.600) Average num. events: 2323.400 (+- 0.267) > Average time per event 1.938 usec Average time per event 1.761 usec > Number of synthesis threads: 24 Number of synthesis threads: 24 > Average synthesis took: 4554.900 usec (+- 201.121 usec) Average synthesis took: 3346.600 usec (+- 78.846 usec) > Average num. events: 2395.800 (+- 0.854) Average num. events: 2321.000 (+- 0.667) > Average time per event 1.901 usec Average time per event 1.442 usec > Number of synthesis threads: 25 Number of synthesis threads: 25 > Average synthesis took: 4668.300 usec (+- 248.254 usec) Average synthesis took: 3794.300 usec (+- 191.158 usec) > Average num. events: 2398.000 (+- 0.803) Average num. events: 2317.900 (+- 6.248) > Average time per event 1.947 usec Average time per event 1.637 usec > Number of synthesis threads: 26 Number of synthesis threads: 26 > Average synthesis took: 4683.300 usec (+- 226.836 usec) Average synthesis took: 3285.700 usec (+- 18.785 usec) > Average num. events: 2399.000 (+- 1.265) Average num. events: 2317.100 (+- 6.198) > Average time per event 1.952 usec Average time per event 1.418 usec > Number of synthesis threads: 27 Number of synthesis threads: 27 > Average synthesis took: 4590.300 usec (+- 158.000 usec) Average synthesis took: 3604.600 usec (+- 35.487 usec) > Average num. events: 2400.200 (+- 1.497) Average num. events: 2319.800 (+- 0.533) > Average time per event 1.912 usec Average time per event 1.554 usec > Number of synthesis threads: 28 Number of synthesis threads: 28 > Average synthesis took: 4683.500 usec (+- 233.543 usec) Average synthesis took: 3594.700 usec (+- 21.267 usec) > Average num. events: 2402.400 (+- 1.688) Average num. events: 2319.200 (+- 0.200) > Average time per event 1.950 usec Average time per event 1.550 usec > Number of synthesis threads: 29 Number of synthesis threads: 29 > Average synthesis took: 4830.700 usec (+- 235.730 usec) Average synthesis took: 3531.700 usec (+- 15.935 usec) > Average num. events: 2405.000 (+- 2.530) Average num. events: 2322.200 (+- 0.800) > Average time per event 2.009 usec Average time per event 1.521 usec > Number of synthesis threads: 30 Number of synthesis threads: 30 > Average synthesis took: 4684.500 usec (+- 210.137 usec) Average synthesis took: 3505.700 usec (+- 58.332 usec) > Average num. events: 2407.600 (+- 2.495) Average num. events: 2315.100 (+- 5.900) > Average time per event 1.946 usec Average time per event 1.514 usec > Number of synthesis threads: 31 Number of synthesis threads: 31 > Average synthesis took: 4823.300 usec (+- 213.480 usec) Average synthesis took: 3431.100 usec (+- 42.022 usec) > Average num. events: 2407.400 (+- 2.647) Average num. events: 2319.000 (+- 0.000) > Average time per event 2.004 usec Average time per event 1.480 usec > Number of synthesis threads: 32 Number of synthesis threads: 32 > Average synthesis took: 4400.800 usec (+- 224.134 usec) Average synthesis took: 3684.900 usec (+- 253.077 usec) > Average num. events: 2407.400 (+- 2.544) Average num. events: 2319.200 (+- 0.200) > Average time per event 1.828 usec Average time per event 1.589 usec > Number of synthesis threads: 33 Number of synthesis threads: 33 > Average synthesis took: 4452.600 usec (+- 231.034 usec) Average synthesis took: 3233.000 usec (+- 24.035 usec) > Average num. events: 2409.300 (+- 3.190) Average num. events: 2316.500 (+- 6.069) > Average time per event 1.848 usec Average time per event 1.396 usec > Number of synthesis threads: 34 Number of synthesis threads: 34 > Average synthesis took: 4770.900 usec (+- 182.325 usec) Average synthesis took: 3016.300 usec (+- 13.343 usec) > Average num. events: 2411.200 (+- 3.032) Average num. events: 2322.800 (+- 0.200) > Average time per event 1.979 usec Average time per event 1.299 usec > Number of synthesis threads: 35 Number of synthesis threads: 35 > Average synthesis took: 4442.800 usec (+- 248.017 usec) Average synthesis took: 3246.700 usec (+- 71.765 usec) > Average num. events: 2412.000 (+- 3.296) Average num. events: 2321.800 (+- 0.611) > Average time per event 1.842 usec Average time per event 1.398 usec > Number of synthesis threads: 36 Number of synthesis threads: 36 > Average synthesis took: 5005.200 usec (+- 235.823 usec) Average synthesis took: 3329.000 usec (+- 122.028 usec) > Average num. events: 2410.400 (+- 2.750) Average num. events: 2310.800 (+- 8.133) > Average time per event 2.077 usec Average time per event 1.441 usec > Number of synthesis threads: 37 Number of synthesis threads: 37 > Average synthesis took: 4654.000 usec (+- 208.838 usec) Average synthesis took: 3011.600 usec (+- 46.026 usec) > Average num. events: 2409.400 (+- 2.473) Average num. events: 2322.200 (+- 0.533) > Average time per event 1.932 usec Average time per event 1.297 usec > Number of synthesis threads: 38 Number of synthesis threads: 38 > Average synthesis took: 4763.700 usec (+- 197.409 usec) Average synthesis took: 3163.500 usec (+- 36.589 usec) > Average num. events: 2406.200 (+- 2.462) Average num. events: 2319.000 (+- 0.000) > Average time per event 1.980 usec Average time per event 1.364 usec > Number of synthesis threads: 39 Number of synthesis threads: 39 > Average synthesis took: 4333.100 usec (+- 194.456 usec) Average synthesis took: 3170.900 usec (+- 30.538 usec) > Average num. events: 2408.600 (+- 3.124) Average num. events: 2319.000 (+- 0.000) > Average time per event 1.799 usec Average time per event 1.367 usec > Number of synthesis threads: 40 Number of synthesis threads: 40 > Average synthesis took: 4520.200 usec (+- 188.901 usec) Average synthesis took: 3111.900 usec (+- 24.287 usec) > Average num. events: 2409.600 (+- 3.184) Average num. events: 2307.600 (+- 7.600) > Average time per event 1.876 usec Average time per event 1.349 usec