* [PATCH 0/6] numa-balancing patches
@ 2018-08-03 6:13 Srikar Dronamraju
2018-08-03 6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
` (6 more replies)
0 siblings, 7 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:13 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
This patchset based on current tip/sched/core, provides left out patches
from the previous series. This version handles the comments given to some of
the patches. It drops "sched/numa: Restrict migrating in parallel to the same
node." It adds an additional patch from Mel Gorman.
It also provides specjbb2005 /dbench/ perf bench numa numbers on a patch
basis on 4 node and 2 node systems.
v2: http://lkml.kernel.org/r/1529514181-9842-1-git-send-email-srikar@linux.vnet.ibm.com
v1: http://lkml.kernel.org/r/1528106428-19992-1-git-send-email-srikar@linux.vnet.ibm.com
specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 199709 209029 4.66679
1 330830 326585 -1.28314
on 2 Socket/4 Node Power8 (PowerNV)
1 218946 221299 1.07469
on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 180473 195444 8.29542
1 212805 222390 4.50412
on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 56941.8 60152.4 5.63839
1 111686 111458 -0.204144
dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 12029.8 12124.6 12060.9 34.0076
5 12904.6 12969 12942.6 23.9053 7.3104
on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4968.51 5006.62 4981.31 13.4151
5 4984.25 5025.95 5004.5 14.2253 0.46554
on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9342.92 9381.44 9363.92 12.8587
5 9277.64 9357.22 9322.07 26.3558 -0.446928
on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 143.4 188.892 170.225 16.9929
5 160.632 175.558 168.655 5.26823 -0.922309
perf bench numa / time / lesser number is better
on 2 Socket/2 Node Intel
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 403.47 420.68 411.27 5.80
numa01.sh Sys: 5.20 8.75 5.96 1.39
numa01.sh User: 14583.44 14832.05 14699.23 83.15
numa02.sh Real: 70.61 73.80 72.53 1.10
numa02.sh Sys: 4.61 5.08 4.80 0.18
numa02.sh User: 2634.28 2690.65 2669.69 20.80
numa03.sh Real: 328.53 374.61 354.53 17.22
numa03.sh Sys: 9.15 11.78 10.34 0.87
numa03.sh User: 14828.93 16646.99 15693.92 758.08
numa04.sh Real: 404.31 424.15 413.53 6.45
numa04.sh Sys: 5.70 7.98 6.33 0.85
numa04.sh User: 14608.86 15002.66 14812.80 156.89
numa05.sh Real: 432.00 449.59 444.57 6.52
numa05.sh Sys: 14.80 16.94 15.67 0.74
numa05.sh User: 15679.60 16048.79 15911.45 133.39
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 392.85 415.77 403.96 8.33 1.80959%
numa01.sh Sys: 6.19 9.81 7.89 1.19 -24.4613%
numa01.sh User: 14219.55 14733.04 14511.33 204.30 1.29485%
numa02.sh Real: 58.77 63.41 60.01 1.74 20.8632%
numa02.sh Sys: 5.28 5.62 5.42 0.11 -11.4391%
numa02.sh User: 2302.26 2454.57 2345.44 55.26 13.8247%
numa03.sh Real: 345.47 401.75 366.51 20.54 -3.26867%
numa03.sh Sys: 8.87 11.94 10.48 1.29 -1.33588%
numa03.sh User: 14709.09 17409.22 15824.09 1010.20 -0.822607%
numa04.sh Real: 392.78 404.64 398.72 4.50 3.71439%
numa04.sh Sys: 6.61 8.30 7.30 0.55 -13.2877%
numa04.sh User: 14324.48 14638.68 14464.01 117.68 2.41143%
numa05.sh Real: 383.94 414.25 396.28 10.61 12.1858%
numa05.sh Sys: 20.20 25.96 24.15 2.11 -35.1139%
numa05.sh User: 14707.57 15251.14 14993.60 185.47 6.12161%
Info on each of the perf bench script is available at
http://lkml.kernel.org/r/1528106428-19992-1-git-send-email-srikar@linux.vnet.ibm.com
vmstat data for perf bench numa01
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 125087 130979 4.71032%
numa_hint_faults_local 122058 118544 -2.87896%
numa_hit 472395 562428 19.0588%
numa_huge_pte_updates 121037 126394 4.42592%
numa_interleave 0 0 NA
numa_local 472041 562071 19.0725%
numa_miss 0 0 NA
numa_other 354 357 0.847458%
numa_pages_migrated 977502 1845407 88.7881%
numa_pte_updates 61980575 64723635 4.42568%
pgfault 709665 823092 15.9832%
pgmajfault 443 507 14.447%
pgmigrate_fail 46592 109568 135.165%
pgmigrate_success 977502 1845407 88.7881%
vmstat data for perf bench numa02
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 38495 22190 -42.3562%
numa_hint_faults_local 35757 17037 -52.3534%
numa_hit 303708 284876 -6.20069%
numa_huge_pte_updates 33842 19259 -43.0914%
numa_interleave 0 0 NA
numa_local 303450 284624 -6.20399%
numa_miss 0 0 NA
numa_other 258 252 -2.32558%
numa_pages_migrated 993537 1570214 58.0428%
numa_pte_updates 17452735 9984967 -42.7885%
pgfault 368888 330200 -10.4877%
pgmajfault 420 308 -26.6667%
pgmigrate_fail 0 0 NA
pgmigrate_success 993537 1570214 58.0428%
vmstat data for perf bench numa03
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 62510 63635 1.79971%
numa_hint_faults_local 38696 39109 1.06729%
numa_hit 382967 395061 3.15797%
numa_huge_pte_updates 58145 59089 1.62353%
numa_interleave 0 0 NA
numa_local 382720 394809 3.15871%
numa_miss 0 0 NA
numa_other 247 252 2.02429%
numa_pages_migrated 2035239 2196610 7.92885%
numa_pte_updates 29777304 30261666 1.62661%
pgfault 544043 560325 2.99278%
pgmajfault 355 451 27.0423%
pgmigrate_fail 224256 344576 53.653%
pgmigrate_success 2035239 2196610 7.92885%
vmstat data for perf bench numa04
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 124431 132260 6.29184%
numa_hint_faults_local 121355 119283 -1.70739%
numa_hit 409737 399975 -2.3825%
numa_huge_pte_updates 120253 127886 6.34745%
numa_interleave 0 0 NA
numa_local 409468 399724 -2.37967%
numa_miss 0 0 NA
numa_other 269 251 -6.69145%
numa_pages_migrated 1116395 1659057 48.6084%
numa_pte_updates 61579860 65487765 6.34608%
pgfault 631873 633795 0.304175%
pgmajfault 337 329 -2.37389%
pgmigrate_fail 47616 156160 227.957%
pgmigrate_success 1116395 1659057 48.6084%
vmstat data for perf bench numa05
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 355042 337290 -4.99997%
numa_hint_faults_local 342282 305309 -10.8019%
numa_hit 469069 461415 -1.63174%
numa_huge_pte_updates 348439 330052 -5.27696%
numa_interleave 0 0 NA
numa_local 468821 461168 -1.63239%
numa_miss 0 0 NA
numa_other 248 247 -0.403226%
numa_pages_migrated 3247276 6844700 110.783%
numa_pte_updates 178418683 169004579 -5.27641%
pgfault 899368 878657 -2.30284%
pgmajfault 345 334 -3.18841%
pgmigrate_fail 781312 1424384 82.3067%
pgmigrate_success 3247276 6844700 110.783%
on 2 Socket/4 Node Power8 (PowerNV)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 358.03 476.82 419.73 46.59
numa01.sh Sys: 14.53 20.23 16.47 1.96
numa01.sh User: 43304.06 53938.77 48978.04 4280.53
numa02.sh Real: 52.55 59.28 56.55 2.58
numa02.sh Sys: 7.33 11.74 9.37 1.56
numa02.sh User: 5112.38 5765.50 5535.94 237.97
numa03.sh Real: 486.71 497.22 490.09 3.68
numa03.sh Sys: 12.12 15.21 14.18 1.07
numa03.sh User: 56814.30 59414.01 58412.02 893.79
numa04.sh Real: 322.51 350.93 335.53 9.06
numa04.sh Sys: 14.03 16.90 15.79 1.10
numa04.sh User: 33446.88 36163.03 34128.47 1023.44
numa05.sh Real: 324.11 333.71 330.69 3.37
numa05.sh Sys: 21.25 28.33 23.59 2.55
numa05.sh User: 33017.37 34332.36 33536.43 437.24
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 402.80 475.17 438.14 23.38 -4.20185%
numa01.sh Sys: 15.81 17.30 16.46 0.53 0.0607533%
numa01.sh User: 46324.59 52566.72 49514.24 2327.98 -1.08292%
numa02.sh Real: 49.32 59.99 54.64 3.42 3.49561%
numa02.sh Sys: 5.84 10.32 8.44 1.55 11.019%
numa02.sh User: 4962.98 5674.79 5456.00 255.40 1.46518%
numa03.sh Real: 481.18 492.84 487.49 4.05 -93.4563%
numa03.sh Sys: 12.11 13.93 13.07 0.73 -84.6213%
numa03.sh User: 56056.97 58557.44 57546.61 870.03 -97.3252%
numa04.sh Real: 314.72 399.01 344.72 28.97 42.1705%
numa04.sh Sys: 14.72 20.70 17.05 2.15 -16.8328%
numa04.sh User: 34075.02 42869.67 36528.81 3261.87 59.9067%
numa05.sh Real: 327.70 363.14 343.96 13.71 -2.45087%
numa05.sh Sys: 23.34 29.42 27.00 2.01 -41.5185%
numa05.sh User: 31716.77 36602.35 33670.61 1653.60 1.35982%
vmstat data for perf bench numa01
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 63348 64799 2.29052%
numa_hint_faults_local 27783 28052 0.968218%
numa_hit 288564 274955 -4.71611%
numa_huge_pte_updates 24248 25297 4.32613%
numa_interleave 0 0 NA
numa_local 288524 274914 -4.71711%
numa_miss 0 0 NA
numa_other 40 41 2.5%
numa_pages_migrated 668765 757419 13.2564%
numa_pte_updates 6247173 6516368 4.30907%
pgfault 1026373 982450 -4.27944%
pgmajfault 552 455 -17.5725%
pgmigrate_fail 110871 105728 -4.63872%
pgmigrate_success 668765 757419 13.2564%
vmstat data for perf bench numa02
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 247818 340248 37.2975%
numa_hint_faults_local 165959 197634 19.086%
numa_hit 327750 338501 3.28024%
numa_huge_pte_updates 1302 1786 37.1736%
numa_interleave 0 0 NA
numa_local 327719 338477 3.28269%
numa_miss 0 0 NA
numa_other 31 24 -22.5806%
numa_pages_migrated 184908 212449 14.8944%
numa_pte_updates 601753 817802 35.9033%
pgfault 714529 802873 12.3639%
pgmajfault 384 391 1.82292%
pgmigrate_fail 512 512 0%
pgmigrate_success 184908 212449 14.8944%
vmstat data for perf bench numa03
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 42672 34540 -19.057%
numa_hint_faults_local 15017 11041 -26.4767%
numa_hit 276938 269723 -2.60528%
numa_huge_pte_updates 13998 13968 -0.214316%
numa_interleave 0 0 NA
numa_local 276919 269715 -2.60148%
numa_miss 0 0 NA
numa_other 19 8 -57.8947%
numa_pages_migrated 333225 349934 5.01433%
numa_pte_updates 3610166 3596818 -0.369734%
pgfault 992860 963524 -2.9547%
pgmajfault 522 957 83.3333%
pgmigrate_fail 108288 127744 17.9669%
pgmigrate_success 333225 349934 5.01433%
vmstat data for perf bench numa04
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 80020 87821 9.74881%
numa_hint_faults_local 59008 56421 -4.38415%
numa_hit 233083 235072 0.853344%
numa_huge_pte_updates 35238 35577 0.96203%
numa_interleave 0 0 NA
numa_local 233064 235067 0.859421%
numa_miss 0 0 NA
numa_other 19 5 -73.6842%
numa_pages_migrated 944562 1028140 8.84833%
numa_pte_updates 9065545 9159954 1.0414%
pgfault 847441 851781 0.51213%
pgmajfault 970 421 -56.5979%
pgmigrate_fail 63233 53760 -14.9811%
pgmigrate_success 944562 1028140 8.84833%
vmstat data for perf bench numa05
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 174387 201119 15.3291%
numa_hint_faults_local 133581 146596 9.74315%
numa_hit 249145 257903 3.51522%
numa_huge_pte_updates 73582 85868 16.697%
numa_interleave 0 0 NA
numa_local 249137 257890 3.51333%
numa_miss 0 0 NA
numa_other 8 13 62.5%
numa_pages_migrated 1781374 2077576 16.6277%
numa_pte_updates 18938248 22100144 16.6958%
pgfault 941574 995042 5.67858%
pgmajfault 434 415 -4.37788%
pgmigrate_fail 49180 69889 42.1086%
pgmigrate_success 1781374 2077576 16.6277%
on 2 Socket/2 Node Power9 (PowerNV)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 462.22 591.23 504.51 44.82
numa01.sh Sys: 37.07 54.86 42.05 6.62
numa01.sh User: 72535.19 86297.67 75983.26 5208.86
numa02.sh Real: 82.50 87.37 84.18 1.82
numa02.sh Sys: 20.18 30.04 27.37 3.66
numa02.sh User: 12171.09 12358.11 12242.31 62.27
numa03.sh Real: 595.65 695.32 640.37 31.93
numa03.sh Sys: 31.45 42.00 35.40 3.78
numa03.sh User: 93877.45 109013.40 100676.82 4856.89
numa04.sh Real: 514.19 594.43 548.24 33.76
numa04.sh Sys: 41.25 54.25 46.86 4.89
numa04.sh User: 76298.64 86625.93 80615.33 4748.38
numa05.sh Real: 466.67 513.17 494.73 18.29
numa05.sh Sys: 61.19 70.28 66.83 3.35
numa05.sh User: 72845.76 76191.80 74651.22 1416.76
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 461.27 719.44 552.31 88.06 -8.65456%
numa01.sh Sys: 39.71 67.60 47.18 10.35 -10.8733%
numa01.sh User: 72257.05 112563.52 83735.04 14612.15 -9.25751%
numa02.sh Real: 82.65 84.25 83.41 0.53 0.923151%
numa02.sh Sys: 18.32 28.89 22.97 4.34 19.1554%
numa02.sh User: 12045.55 12215.64 12162.20 62.80 0.65868%
numa03.sh Real: 587.05 660.43 617.39 25.31 3.72212%
numa03.sh Sys: 28.05 36.74 31.86 3.45 11.1111%
numa03.sh User: 92686.08 103166.58 97013.32 3655.37 3.77629%
numa04.sh Real: 464.56 652.41 515.41 70.89 6.36969%
numa04.sh Sys: 38.40 49.26 42.43 4.00 10.4407%
numa04.sh User: 72275.32 88875.96 77174.44 6149.21 4.45859%
numa05.sh Real: 483.10 664.43 562.87 75.72 -12.1058%
numa05.sh Sys: 56.23 73.67 65.27 5.73 2.39007%
numa05.sh User: 73350.15 89813.72 80238.30 6532.10 -6.96311%
vmstat data for perf bench numa01
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 144716 145619 0.623981%
numa_hint_faults_local 99914 91850 -8.07094%
numa_hit 411314 369456 -10.1767%
numa_huge_pte_updates 136260 136154 -0.0777925%
numa_interleave 0 0 NA
numa_local 411279 369421 -10.1775%
numa_miss 0 0 NA
numa_other 35 35 0%
numa_pages_migrated 464612 481645 3.66607%
numa_pte_updates 4368544 4365935 -0.0597224%
pgfault 1296071 1362892 5.15566%
pgmajfault 1412 1270 -10.0567%
pgmigrate_fail 42656 49952 17.1043%
pgmigrate_success 464612 481645 3.66607%
vmstat data for perf bench numa02
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 30562 28965 -5.22544%
numa_hint_faults_local 23479 21704 -7.55995%
numa_hit 176214 159995 -9.20415%
numa_huge_pte_updates 28447 27168 -4.49608%
numa_interleave 0 0 NA
numa_local 176209 159987 -9.20611%
numa_miss 0 0 NA
numa_other 5 8 60%
numa_pages_migrated 201448 204467 1.49865%
numa_pte_updates 936226 894109 -4.49859%
pgfault 493189 481612 -2.34738%
pgmajfault 993 521 -47.5327%
pgmigrate_fail 0 0 NA
pgmigrate_success 201448 204467 1.49865%
vmstat data for perf bench numa03
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 78163 72811 -6.84723%
numa_hint_faults_local 39811 37318 -6.26209%
numa_hit 313487 308119 -1.71235%
numa_huge_pte_updates 69817 65243 -6.55141%
numa_interleave 0 0 NA
numa_local 313460 308104 -1.70867%
numa_miss 0 0 NA
numa_other 27 15 -44.4444%
numa_pages_migrated 184605 172934 -6.32215%
numa_pte_updates 2242167 2094992 -6.56396%
pgfault 1186922 1166080 -1.75597%
pgmajfault 1077 668 -37.9759%
pgmigrate_fail 24544 24416 -0.521512%
pgmigrate_success 184605 172934 -6.32215%
vmstat data for perf bench numa04
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 152710 147880 -3.16286%
numa_hint_faults_local 97716 100272 2.61574%
numa_hit 324966 321659 -1.01764%
numa_huge_pte_updates 144348 139764 -3.17566%
numa_interleave 0 0 NA
numa_local 324939 321640 -1.01527%
numa_miss 0 0 NA
numa_other 27 19 -29.6296%
numa_pages_migrated 512467 485174 -5.32581%
numa_pte_updates 4626888 4479727 -3.18056%
pgfault 1250077 1234721 -1.2284%
pgmajfault 691 575 -16.7873%
pgmigrate_fail 54848 58560 6.76779%
pgmigrate_success 512467 485174 -5.32581%
vmstat data for perf bench numa05
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 295774 305339 3.23389%
numa_hint_faults_local 218320 218793 0.216654%
numa_hit 352096 357790 1.61717%
numa_huge_pte_updates 286148 297923 4.115%
numa_interleave 0 0 NA
numa_local 352075 357780 1.62039%
numa_miss 0 0 NA
numa_other 21 10 -52.381%
numa_pages_migrated 906755 909564 0.309786%
numa_pte_updates 9165346 9541883 4.10827%
pgfault 1407223 1433802 1.88876%
pgmajfault 599 683 14.0234%
pgmigrate_fail 70272 69568 -1.00182%
pgmigrate_success 906755 909564 0.309786%
on 4 Socket/4 Node Power7
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 677.66 913.24 794.88 85.49
numa01.sh Sys: 125.90 205.16 169.35 25.59
numa01.sh User: 56772.52 71741.79 63335.60 5073.86
numa02.sh Real: 65.34 70.28 67.96 1.98
numa02.sh Sys: 12.04 19.41 15.89 2.34
numa02.sh User: 5499.93 5682.07 5586.30 77.00
numa03.sh Real: 774.48 1035.38 893.82 87.76
numa03.sh Sys: 107.67 153.14 129.77 15.10
numa03.sh User: 62802.39 87222.58 73511.43 8260.39
numa04.sh Real: 504.09 733.50 633.03 78.96
numa04.sh Sys: 213.34 351.26 284.11 56.12
numa04.sh User: 38925.57 55954.50 47690.41 5716.68
numa05.sh Real: 402.78 501.75 453.02 37.15
numa05.sh Sys: 146.84 407.64 299.57 97.43
numa05.sh User: 33365.00 39445.00 36050.94 2053.69
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 636.41 913.49 802.07 94.08 -0.89643%
numa01.sh Sys: 169.10 209.84 181.46 15.01 -6.67365%
numa01.sh User: 51910.75 65727.60 60906.34 5019.96 3.98852%
numa02.sh Real: 63.64 70.40 66.18 2.42 2.68963%
numa02.sh Sys: 9.85 21.19 15.05 3.72 5.5814%
numa02.sh User: 5305.35 5702.47 5477.28 132.74 1.9904%
numa03.sh Real: 753.00 932.44 828.11 66.63 7.93494%
numa03.sh Sys: 81.82 132.68 104.12 17.89 24.635%
numa03.sh User: 61249.69 72311.19 65282.71 3998.53 12.6047%
numa04.sh Real: 504.61 655.03 605.21 52.01 4.59675%
numa04.sh Sys: 130.42 330.44 260.76 73.87 8.95459%
numa04.sh User: 37562.67 48382.57 45063.68 3892.89 5.82893%
numa05.sh Real: 462.05 525.61 488.76 21.16 -7.31238%
numa05.sh Sys: 296.76 389.72 345.10 40.73 -13.1933%
numa05.sh User: 35920.56 39112.35 38022.97 1151.19 -5.18642%
vmstat data for perf bench numa01
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 8094646 6939950 -14.2649%
numa_hint_faults_local 4327343 3249221 -24.9142%
numa_hit 1550444 1490388 -3.87347%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 1550404 1490347 -3.87364%
numa_miss 0 0 NA
numa_other 40 41 2.5%
numa_pages_migrated 777894 731760 -5.93063%
numa_pte_updates 8103835 6945270 -14.2965%
pgfault 9504158 8321001 -12.4488%
pgmajfault 277 267 -3.61011%
pgmigrate_fail 7 12 71.4286%
pgmigrate_success 777894 731760 -5.93063%
vmstat data for perf bench numa02
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 698918 733902 5.00545%
numa_hint_faults_local 553784 562257 1.53002%
numa_hit 473790 466220 -1.59775%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 473785 466216 -1.59756%
numa_miss 0 0 NA
numa_other 5 4 -20%
numa_pages_migrated 136492 134423 -1.51584%
numa_pte_updates 714710 749458 4.86183%
pgfault 1186082 1218861 2.76364%
pgmajfault 155 156 0.645161%
pgmigrate_fail 0 0 NA
pgmigrate_success 136492 134423 -1.51584%
vmstat data for perf bench numa03
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 4525956 3520293 -22.2199%
numa_hint_faults_local 1749531 1319966 -24.5532%
numa_hit 914257 778437 -14.8558%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 914232 778416 -14.8557%
numa_miss 0 0 NA
numa_other 25 21 -16%
numa_pages_migrated 428367 315482 -26.3524%
numa_pte_updates 4536083 3524701 -22.2964%
pgfault 5578522 4509129 -19.1698%
pgmajfault 202 197 -2.47525%
pgmigrate_fail 22 19 -13.6364%
pgmigrate_success 428367 315482 -26.3524%
vmstat data for perf bench numa04
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 13634799 13598386 -0.267059%
numa_hint_faults_local 8473822 8800575 3.85603%
numa_hit 2604435 2456795 -5.66879%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 2604411 2456779 -5.66854%
numa_miss 0 0 NA
numa_other 24 16 -33.3333%
numa_pages_migrated 1414750 1280512 -9.48846%
numa_pte_updates 13678739 13627859 -0.371964%
pgfault 15363067 15317090 -0.29927%
pgmajfault 197 237 20.3046%
pgmigrate_fail 30 27 -10%
pgmigrate_success 1414750 1280512 -9.48846%
vmstat data for perf bench numa05
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 23324034 25274343 8.3618%
numa_hint_faults_local 18759362 18625813 -0.711906%
numa_hit 3944010 4235082 7.3801%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 3943994 4235068 7.38018%
numa_miss 0 0 NA
numa_other 16 14 -12.5%
numa_pages_migrated 1785980 2072221 16.0271%
numa_pte_updates 23411591 25325473 8.17493%
pgfault 26024350 28065879 7.84469%
pgmajfault 233 239 2.57511%
pgmigrate_fail 52 82 57.6923%
pgmigrate_success 1785980 2072221 16.0271%
Mel Gorman (1):
sched/numa: Limit the conditions where scan period is reset
Srikar Dronamraju (5):
sched/numa: Stop multiple tasks from moving to the cpu at the same
time
mm/migrate: Use trylock while resetting rate limit
sched/numa: Avoid task migration for small numa improvement
sched/numa: Pass destination cpu as a parameter to migrate_task_rq
sched/numa: Reset scan rate whenever task moves across nodes
kernel/sched/core.c | 2 +-
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 87 ++++++++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 3 +-
mm/migrate.c | 16 ++++++---
5 files changed, 91 insertions(+), 19 deletions(-)
--
1.8.3.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
@ 2018-08-03 6:13 ` Srikar Dronamraju
2018-09-10 8:42 ` Ingo Molnar
2018-08-03 6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
` (5 subsequent siblings)
6 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:13 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
Task migration under numa balancing can happen in parallel. More than
one task might choose to migrate to the same cpu at the same time. This
can result in
- During task swap, choosing a task that was not part of the evaluation.
- During task swap, task which just got moved into its preferred node,
moving to a completely different node.
- During task swap, task failing to move to the preferred node, will have
to wait an extra interval for the next migrate opportunity.
- During task movement, multiple task movements can cause load imbalance.
This problem is more likely if there are more cores per node or more
nodes in the system.
Use a per run-queue variable to check if numa-balance is active on the
run-queue.
specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 199709 206350 3.32534
1 330830 319963 -3.28477
on 2 Socket/4 Node Power8 (PowerNV)
JVMS Prev Current %Change
8 89011.9 89627.8 0.69193
1 218946 211338 -3.47483
on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 180473 186539 3.36117
1 212805 220344 3.54268
on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 56941.8 56836 -0.185804
1 111686 112970 1.14965
dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 12029.8 12124.6 12060.9 34.0076
5 13136.1 13170.2 13150.2 14.7482 9.03166
on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4968.51 5006.62 4981.31 13.4151
5 4319.79 4998.19 4836.53 261.109 -2.90646
on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9342.92 9381.44 9363.92 12.8587
5 9325.56 9402.7 9362.49 25.9638 -0.0152714
on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 143.4 188.892 170.225 16.9929
5 132.581 191.072 170.554 21.6444 0.193274
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v2->v3:
Add comments as requested by Peter.
kernel/sched/fair.c | 22 ++++++++++++++++++++++
kernel/sched/sched.h | 1 +
2 files changed, 23 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309c93f..5cf921a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1514,6 +1514,21 @@ struct task_numa_env {
static void task_numa_assign(struct task_numa_env *env,
struct task_struct *p, long imp)
{
+ struct rq *rq = cpu_rq(env->dst_cpu);
+
+ /* Bail out if run-queue part of active numa balance. */
+ if (xchg(&rq->numa_migrate_on, 1))
+ return;
+
+ /*
+ * Clear previous best_cpu/rq numa-migrate flag, since task now
+ * found a better cpu to move/swap.
+ */
+ if (env->best_cpu != -1) {
+ rq = cpu_rq(env->best_cpu);
+ WRITE_ONCE(rq->numa_migrate_on, 0);
+ }
+
if (env->best_task)
put_task_struct(env->best_task);
if (p)
@@ -1569,6 +1584,9 @@ static void task_numa_compare(struct task_numa_env *env,
long moveimp = imp;
int dist = env->dist;
+ if (READ_ONCE(dst_rq->numa_migrate_on))
+ return;
+
rcu_read_lock();
cur = task_rcu_dereference(&dst_rq->curr);
if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
@@ -1710,6 +1728,7 @@ static int task_numa_migrate(struct task_struct *p)
.best_cpu = -1,
};
struct sched_domain *sd;
+ struct rq *best_rq;
unsigned long taskweight, groupweight;
int nid, ret, dist;
long taskimp, groupimp;
@@ -1811,14 +1830,17 @@ static int task_numa_migrate(struct task_struct *p)
*/
p->numa_scan_period = task_scan_start(p);
+ best_rq = cpu_rq(env.best_cpu);
if (env.best_task == NULL) {
ret = migrate_task_to(p, env.best_cpu);
+ WRITE_ONCE(best_rq->numa_migrate_on, 0);
if (ret != 0)
trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
return ret;
}
ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
+ WRITE_ONCE(best_rq->numa_migrate_on, 0);
if (ret != 0)
trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8ca..0b91612 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -783,6 +783,7 @@ struct rq {
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
+ unsigned int numa_migrate_on;
#endif
#define CPU_LOAD_IDX_MAX 5
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
2018-08-03 6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
@ 2018-08-03 6:13 ` Srikar Dronamraju
2018-09-06 11:48 ` Peter Zijlstra
2018-09-10 8:39 ` Ingo Molnar
2018-08-03 6:13 ` [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement Srikar Dronamraju
` (4 subsequent siblings)
6 siblings, 2 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:13 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
Since this spinlock will only serialize migrate rate limiting,
convert the spinlock to a trylock. If another task races ahead of this task
then this task can simply move on.
While here, add correct two abnormalities.
- Avoid time being stretched for every interval.
- Use READ/WRITE_ONCE with next window.
specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 206350 200892 -2.64502
1 319963 325766 1.81365
on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 186539 190261 1.99529
1 220344 195305 -11.3636
on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 56836 57651.1 1.43413
1 112970 111351 -1.43312
dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 13136.1 13170.2 13150.2 14.7482
5 12254.7 12331.9 12297.8 28.1846 -6.48203
on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4319.79 4998.19 4836.53 261.109
5 4997.83 5030.14 5015.54 12.947 3.70121
on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9325.56 9402.7 9362.49 25.9638
5 9331.84 9375.11 9352.04 16.0703 -0.111616
on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 132.581 191.072 170.554 21.6444
5 147.55 181.605 168.963 11.3513 -0.932842
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
---
Changelog v1->v2:
Fix stretch every interval pointed by Peter Zijlstra.
Verified that some of the regression is due to fixing interval stretch.
mm/migrate.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f..dbc2cb7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1868,16 +1868,24 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
unsigned long nr_pages)
{
+ unsigned long next_window, interval;
+
+ next_window = READ_ONCE(pgdat->numabalancing_migrate_next_window);
+ interval = msecs_to_jiffies(migrate_interval_millisecs);
+
/*
* Rate-limit the amount of data that is being migrated to a node.
* Optimal placement is no good if the memory bus is saturated and
* all the time is being spent migrating!
*/
- if (time_after(jiffies, next_window)) {
- spin_lock(&pgdat->numabalancing_migrate_lock);
+ if (time_after(jiffies, next_window) &&
+ spin_trylock(&pgdat->numabalancing_migrate_lock)) {
pgdat->numabalancing_migrate_nr_pages = 0;
- pgdat->numabalancing_migrate_next_window = jiffies +
- msecs_to_jiffies(migrate_interval_millisecs);
+ do {
+ next_window += interval;
+ } while (unlikely(time_after(jiffies, next_window)));
+
+ WRITE_ONCE(pgdat->numabalancing_migrate_next_window, next_window);
spin_unlock(&pgdat->numabalancing_migrate_lock);
}
if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
2018-08-03 6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
2018-08-03 6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
@ 2018-08-03 6:13 ` Srikar Dronamraju
2018-09-10 8:46 ` Ingo Molnar
2018-08-03 6:13 ` [PATCH 4/6] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
` (3 subsequent siblings)
6 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:13 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
If numa improvement from the task migration is going to be very
minimal, then avoid task migration.
specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 200892 210118 4.59252
1 325766 313171 -3.86627
on 2 Socket/4 Node Power8 (PowerNV)
JVMS Prev Current %Change
8 89011.9 91027.5 2.26442
1 211338 216460 2.42361
on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 190261 191918 0.870909
1 195305 207043 6.01009
on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 57651.1 58462.1 1.40674
1 111351 108334 -2.70945
dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 12254.7 12331.9 12297.8 28.1846
5 11851.8 11937.3 11890.9 33.5169 -3.30872
on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4997.83 5030.14 5015.54 12.947
5 4791 5016.08 4962.55 85.9625 -1.05652
on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9331.84 9375.11 9352.04 16.0703
5 9353.43 9380.49 9369.6 9.04361 0.187767
on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 147.55 181.605 168.963 11.3513
5 149.518 215.412 179.083 21.5903 5.98948
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1->v2:
- Handle trivial changes due to variable name change. (Rik Van Riel)
- Drop changes where subsequent better cpu find was rejected for
small numa improvement (Rik Van Riel).
kernel/sched/fair.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5cf921a..a717870 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1568,6 +1568,13 @@ static bool load_too_imbalanced(long src_load, long dst_load,
}
/*
+ * Maximum numa importance can be 1998 (2*999);
+ * SMALLIMP @ 30 would be close to 1998/64.
+ * Used to deter task migration.
+ */
+#define SMALLIMP 30
+
+/*
* This checks if the overall compute and NUMA accesses of the system would
* be improved if the source tasks was migrated to the target dst_cpu taking
* into account that it might be best if task running on the dst_cpu should
@@ -1600,7 +1607,7 @@ static void task_numa_compare(struct task_numa_env *env,
goto unlock;
if (!cur) {
- if (maymove || imp > env->best_imp)
+ if (maymove && moveimp >= env->best_imp)
goto assign;
else
goto unlock;
@@ -1643,16 +1650,22 @@ static void task_numa_compare(struct task_numa_env *env,
task_weight(cur, env->dst_nid, dist);
}
- if (imp <= env->best_imp)
- goto unlock;
-
if (maymove && moveimp > imp && moveimp > env->best_imp) {
- imp = moveimp - 1;
+ imp = moveimp;
cur = NULL;
goto assign;
}
/*
+ * If the numa importance is less than SMALLIMP,
+ * task migration might only result in ping pong
+ * of tasks and also hurt performance due to cache
+ * misses.
+ */
+ if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
+ goto unlock;
+
+ /*
* In the overloaded case, try and keep the load balanced.
*/
load = task_h_load(env->p) - task_h_load(cur);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 4/6] sched/numa: Pass destination cpu as a parameter to migrate_task_rq
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
` (2 preceding siblings ...)
2018-08-03 6:13 ` [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement Srikar Dronamraju
@ 2018-08-03 6:13 ` Srikar Dronamraju
2018-08-03 6:14 ` [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
` (2 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:13 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
This additional parameter (new_cpu) is used later for identifying if
task migration is across nodes.
No functional change.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
kernel/sched/core.c | 2 +-
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 2 +-
4 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index deafa9f..fdab290 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1167,7 +1167,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
if (task_cpu(p) != new_cpu) {
if (p->sched_class->migrate_task_rq)
- p->sched_class->migrate_task_rq(p);
+ p->sched_class->migrate_task_rq(p, new_cpu);
p->se.nr_migrations++;
rseq_migrate(p);
perf_event_task_migrate(p);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 997ea7b..91e4202 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1607,7 +1607,7 @@ static void yield_task_dl(struct rq *rq)
return cpu;
}
-static void migrate_task_rq_dl(struct task_struct *p)
+static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused)
{
struct rq *rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a717870..a5936ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6308,7 +6308,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous CPU. The caller guarantees p->pi_lock or task_rq(p)->lock is held.
*/
-static void migrate_task_rq_fair(struct task_struct *p)
+static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unused)
{
/*
* As blocked tasks retain absolute vruntime the migration needs to
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0b91612..455fa33 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1524,7 +1524,7 @@ struct sched_class {
#ifdef CONFIG_SMP
int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
- void (*migrate_task_rq)(struct task_struct *p);
+ void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
void (*task_woken)(struct rq *this_rq, struct task_struct *task);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
` (3 preceding siblings ...)
2018-08-03 6:13 ` [PATCH 4/6] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
@ 2018-08-03 6:14 ` Srikar Dronamraju
2018-09-10 8:48 ` Ingo Molnar
2018-08-03 6:14 ` [PATCH 6/6] sched/numa: Limit the conditions where scan period is reset Srikar Dronamraju
2018-08-21 12:01 ` [PATCH 0/6] numa-balancing patches Srikar Dronamraju
6 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:14 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
Currently task scan rate is reset when numa balancer migrates the task
to a different node. If numa balancer initiates a swap, reset is only
applicable to the task that initiates the swap. Similarly no scan rate
reset is done if the task is migrated across nodes by traditional load
balancer.
Instead move the scan reset to the migrate_task_rq. This ensures the
task moved out of its preferred node, either gets back to its preferred
node quickly or finds a new preferred node. Doing so, would be fair to
all tasks migrating across nodes.
specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 210118 208862 -0.597759
1 313171 307007 -1.96825
on 2 Socket/4 Node Power8 (PowerNV)
JVMS Prev Current %Change
8 91027.5 89911.4 -1.22611
1 216460 216176 -0.131202
on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 191918 196078 2.16759
1 207043 214664 3.68088
on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 58462.1 60719.2 3.86079
1 108334 112615 3.95167
dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 11851.8 11937.3 11890.9 33.5169
5 12511.7 12559.4 12539.5 15.5883 5.45459
on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4791 5016.08 4962.55 85.9625
5 4709.28 4979.28 4919.32 105.126 -0.871125
on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9353.43 9380.49 9369.6 9.04361
5 9388.38 9406.29 9395.1 5.98959 0.272157
on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 149.518 215.412 179.083 21.5903
5 157.71 184.929 174.754 10.7275 -2.41731
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
kernel/sched/fair.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a5936ed..4ea0eff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1837,12 +1837,6 @@ static int task_numa_migrate(struct task_struct *p)
if (env.best_cpu == -1)
return -EAGAIN;
- /*
- * Reset the scan period if the task is being rescheduled on an
- * alternative node to recheck if the tasks is now properly placed.
- */
- p->numa_scan_period = task_scan_start(p);
-
best_rq = cpu_rq(env.best_cpu);
if (env.best_task == NULL) {
ret = migrate_task_to(p, env.best_cpu);
@@ -6361,6 +6355,19 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
/* We have migrated, no longer consider this task hot */
p->se.exec_start = 0;
+
+#ifdef CONFIG_NUMA_BALANCING
+ if (!p->mm || (p->flags & PF_EXITING))
+ return;
+
+ if (p->numa_faults) {
+ int src_nid = cpu_to_node(task_cpu(p));
+ int dst_nid = cpu_to_node(new_cpu);
+
+ if (src_nid != dst_nid)
+ p->numa_scan_period = task_scan_start(p);
+ }
+#endif
}
static void task_dead_fair(struct task_struct *p)
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 6/6] sched/numa: Limit the conditions where scan period is reset
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
` (4 preceding siblings ...)
2018-08-03 6:14 ` [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
@ 2018-08-03 6:14 ` Srikar Dronamraju
2018-08-21 12:01 ` [PATCH 0/6] numa-balancing patches Srikar Dronamraju
6 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:14 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
From: Mel Gorman <mgorman@techsingularity.net>
migrate_task_rq_fair resets the scan rate for NUMA balancing on every
cross-node migration. In the event of excessive load balancing due to
saturation, this may result in the scan rate being pegged at maximum and
further overloading the machine.
This patch only resets the scan if NUMA balancing is active, a preferred
node has been selected and the task is being migrated from the preferred
node as these are the most harmful. For example, a migration to the preferred
node does not justify a faster scan rate. Similarly, a migration between two
nodes that are not preferred is probably bouncing due to over-saturation of
the machine. In that case, scanning faster and trapping more NUMA faults
will further overload the machine.
specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 208862 209029 0.0799571
1 307007 326585 6.37705
on 2 Socket/4 Node Power8 (PowerNV)
JVMS Prev Current %Change
8 89911.4 89627.8 -0.315422
1 216176 221299 2.36983
on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 196078 195444 -0.323341
1 214664 222390 3.59911
on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 60719.2 60152.4 -0.933477
1 112615 111458 -1.02739
dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 12511.7 12559.4 12539.5 15.5883
5 12904.6 12969 12942.6 23.9053 3.21464
on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4709.28 4979.28 4919.32 105.126
5 4984.25 5025.95 5004.5 14.2253 1.73154
on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9388.38 9406.29 9395.1 5.98959
5 9277.64 9357.22 9322.07 26.3558 -0.77732
on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 157.71 184.929 174.754 10.7275
5 160.632 175.558 168.655 5.26823 -3.49005
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
kernel/sched/fair.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ea0eff..6e251e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6357,6 +6357,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
p->se.exec_start = 0;
#ifdef CONFIG_NUMA_BALANCING
+ if (!static_branch_likely(&sched_numa_balancing))
+ return;
+
if (!p->mm || (p->flags & PF_EXITING))
return;
@@ -6364,8 +6367,26 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
int src_nid = cpu_to_node(task_cpu(p));
int dst_nid = cpu_to_node(new_cpu);
- if (src_nid != dst_nid)
- p->numa_scan_period = task_scan_start(p);
+ if (src_nid == dst_nid)
+ return;
+
+ /*
+ * Allow resets if faults have been trapped before one scan
+ * has completed. This is most likely due to a new task that
+ * is pulled cross-node due to wakeups or load balancing.
+ */
+ if (p->numa_scan_seq) {
+ /*
+ * Avoid scan adjustments if moving to the preferred
+ * node or if the task was not previously running on
+ * the preferred node.
+ */
+ if (dst_nid == p->numa_preferred_nid ||
+ (p->numa_preferred_nid != -1 && src_nid != p->numa_preferred_nid))
+ return;
+ }
+
+ p->numa_scan_period = task_scan_start(p);
}
#endif
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 0/6] numa-balancing patches
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
` (5 preceding siblings ...)
2018-08-03 6:14 ` [PATCH 6/6] sched/numa: Limit the conditions where scan period is reset Srikar Dronamraju
@ 2018-08-21 12:01 ` Srikar Dronamraju
2018-09-06 12:17 ` Peter Zijlstra
6 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-21 12:01 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2018-08-03 11:43:55]:
> This patchset based on current tip/sched/core, provides left out patches
> from the previous series. This version handles the comments given to some of
> the patches. It drops "sched/numa: Restrict migrating in parallel to the same
> node." It adds an additional patch from Mel Gorman.
> It also provides specjbb2005 /dbench/ perf bench numa numbers on a patch
> basis on 4 node and 2 node systems.
>
> v2: http://lkml.kernel.org/r/1529514181-9842-1-git-send-email-srikar@linux.vnet.ibm.com
> v1: http://lkml.kernel.org/r/1528106428-19992-1-git-send-email-srikar@linux.vnet.ibm.com
>
Hi Peter, Ingo
Can you please respond with your comments suggestions.
Mel and Rik have acked most of the patches.
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit
2018-08-03 6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
@ 2018-09-06 11:48 ` Peter Zijlstra
2018-09-10 8:39 ` Ingo Molnar
1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2018-09-06 11:48 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
On Fri, Aug 03, 2018 at 11:43:57AM +0530, Srikar Dronamraju wrote:
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8c0af0f..dbc2cb7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1868,16 +1868,24 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
> static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
> unsigned long nr_pages)
> {
> + unsigned long next_window, interval;
> +
> + next_window = READ_ONCE(pgdat->numabalancing_migrate_next_window);
> + interval = msecs_to_jiffies(migrate_interval_millisecs);
> +
> /*
> * Rate-limit the amount of data that is being migrated to a node.
> * Optimal placement is no good if the memory bus is saturated and
> * all the time is being spent migrating!
> */
> - if (time_after(jiffies, next_window)) {
> - spin_lock(&pgdat->numabalancing_migrate_lock);
> + if (time_after(jiffies, next_window) &&
> + spin_trylock(&pgdat->numabalancing_migrate_lock)) {
This patch doesn't apply cleanly; also you introduce @next_window with
this patch, so how can it remove a user of it.
I fixed this up, still weird.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/6] numa-balancing patches
2018-08-21 12:01 ` [PATCH 0/6] numa-balancing patches Srikar Dronamraju
@ 2018-09-06 12:17 ` Peter Zijlstra
0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2018-09-06 12:17 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
On Tue, Aug 21, 2018 at 05:01:50AM -0700, Srikar Dronamraju wrote:
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2018-08-03 11:43:55]:
>
> > This patchset based on current tip/sched/core, provides left out patches
> > from the previous series. This version handles the comments given to some of
> > the patches. It drops "sched/numa: Restrict migrating in parallel to the same
> > node." It adds an additional patch from Mel Gorman.
> > It also provides specjbb2005 /dbench/ perf bench numa numbers on a patch
> > basis on 4 node and 2 node systems.
> >
> > v2: http://lkml.kernel.org/r/1529514181-9842-1-git-send-email-srikar@linux.vnet.ibm.com
> > v1: http://lkml.kernel.org/r/1528106428-19992-1-git-send-email-srikar@linux.vnet.ibm.com
> >
>
> Hi Peter, Ingo
>
> Can you please respond with your comments suggestions.
>
> Mel and Rik have acked most of the patches.
The patches do not contain a single ack.
But I pickd them up now. Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit
2018-08-03 6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
2018-09-06 11:48 ` Peter Zijlstra
@ 2018-09-10 8:39 ` Ingo Molnar
1 sibling, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2018-09-10 8:39 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> Since this spinlock will only serialize migrate rate limiting,
> convert the spinlock to a trylock. If another task races ahead of this task
> then this task can simply move on.
>
> While here, add correct two abnormalities.
> - Avoid time being stretched for every interval.
> - Use READ/WRITE_ONCE with next window.
>
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS Prev Current %Change
> 4 206350 200892 -2.64502
> 1 319963 325766 1.81365
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS Prev Current %Change
> 4 186539 190261 1.99529
> 1 220344 195305 -11.3636
>
>
> on 4 Socket/4 Node Power7
> JVMS Prev Current %Change
> 8 56836 57651.1 1.43413
> 1 112970 111351 -1.43312
>
>
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count Min Max Avg Variance %Change
> 5 13136.1 13170.2 13150.2 14.7482
> 5 12254.7 12331.9 12297.8 28.1846 -6.48203
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> count Min Max Avg Variance %Change
> 5 4319.79 4998.19 4836.53 261.109
> 5 4997.83 5030.14 5015.54 12.947 3.70121
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> count Min Max Avg Variance %Change
> 5 9325.56 9402.7 9362.49 25.9638
> 5 9331.84 9375.11 9352.04 16.0703 -0.111616
>
>
> on 4 Socket/4 Node Power7
> count Min Max Avg Variance %Change
> 5 132.581 191.072 170.554 21.6444
> 5 147.55 181.605 168.963 11.3513 -0.932842
Firstly, *please* always characterize benchmark runs. What did you find? How should we
interpret the result? Are there any tradeoffs?
*Don't* just dump them on us.
Because in this particular case the results are not obvious, at all:
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS Prev Current %Change
> 4 206350 200892 -2.64502
> 1 319963 325766 1.81365
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS Prev Current %Change
> 4 186539 190261 1.99529
> 1 220344 195305 -11.3636
>
>
> on 4 Socket/4 Node Power7
> JVMS Prev Current %Change
> 8 56836 57651.1 1.43413
> 1 112970 111351 -1.43312
Why is this better? The largest drop is 11% which seems significant.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time
2018-08-03 6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
@ 2018-09-10 8:42 ` Ingo Molnar
0 siblings, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2018-09-10 8:42 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> Task migration under numa balancing can happen in parallel. More than
> one task might choose to migrate to the same cpu at the same time. This
> can result in
> - During task swap, choosing a task that was not part of the evaluation.
> - During task swap, task which just got moved into its preferred node,
> moving to a completely different node.
> - During task swap, task failing to move to the preferred node, will have
> to wait an extra interval for the next migrate opportunity.
> - During task movement, multiple task movements can cause load imbalance.
Please capitalize both 'CPU' and 'NUMA' in changelogs and code comments.
> This problem is more likely if there are more cores per node or more
> nodes in the system.
>
> Use a per run-queue variable to check if numa-balance is active on the
> run-queue.
>
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS Prev Current %Change
> 4 199709 206350 3.32534
> 1 330830 319963 -3.28477
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> JVMS Prev Current %Change
> 8 89011.9 89627.8 0.69193
> 1 218946 211338 -3.47483
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS Prev Current %Change
> 4 180473 186539 3.36117
> 1 212805 220344 3.54268
>
>
> on 4 Socket/4 Node Power7
> JVMS Prev Current %Change
> 8 56941.8 56836 -0.185804
> 1 111686 112970 1.14965
>
>
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count Min Max Avg Variance %Change
> 5 12029.8 12124.6 12060.9 34.0076
> 5 13136.1 13170.2 13150.2 14.7482 9.03166
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> count Min Max Avg Variance %Change
> 5 4968.51 5006.62 4981.31 13.4151
> 5 4319.79 4998.19 4836.53 261.109 -2.90646
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> count Min Max Avg Variance %Change
> 5 9342.92 9381.44 9363.92 12.8587
> 5 9325.56 9402.7 9362.49 25.9638 -0.0152714
>
>
> on 4 Socket/4 Node Power7
> count Min Max Avg Variance %Change
> 5 143.4 188.892 170.225 16.9929
> 5 132.581 191.072 170.554 21.6444 0.193274
I have applied this patch, but the zero comments benchmark dump is annoying, as the numbers do
not show unconditional advantages - there's some increases in performance and some regressions.
In particular this:
> dbench / transactions / higher numbers are better
> on 2 Socket/4 Node Power8 (PowerNV)
> count Min Max Avg Variance %Change
> 5 4968.51 5006.62 4981.31 13.4151
> 5 4319.79 4998.19 4836.53 261.109 -2.90646
is concerning: not only did we lose some performance, variance went up by a *lot*. Is this just
a measurement fluke? We cannot know and you didn't comment.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement
2018-08-03 6:13 ` [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement Srikar Dronamraju
@ 2018-09-10 8:46 ` Ingo Molnar
2018-09-12 15:17 ` Srikar Dronamraju
0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2018-09-10 8:46 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> If numa improvement from the task migration is going to be very
> minimal, then avoid task migration.
>
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS Prev Current %Change
> 4 200892 210118 4.59252
> 1 325766 313171 -3.86627
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> JVMS Prev Current %Change
> 8 89011.9 91027.5 2.26442
> 1 211338 216460 2.42361
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS Prev Current %Change
> 4 190261 191918 0.870909
> 1 195305 207043 6.01009
>
>
> on 4 Socket/4 Node Power7
> JVMS Prev Current %Change
> 8 57651.1 58462.1 1.40674
> 1 111351 108334 -2.70945
>
>
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count Min Max Avg Variance %Change
> 5 12254.7 12331.9 12297.8 28.1846
> 5 11851.8 11937.3 11890.9 33.5169 -3.30872
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> count Min Max Avg Variance %Change
> 5 4997.83 5030.14 5015.54 12.947
> 5 4791 5016.08 4962.55 85.9625 -1.05652
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> count Min Max Avg Variance %Change
> 5 9331.84 9375.11 9352.04 16.0703
> 5 9353.43 9380.49 9369.6 9.04361 0.187767
>
>
> on 4 Socket/4 Node Power7
> count Min Max Avg Variance %Change
> 5 147.55 181.605 168.963 11.3513
> 5 149.518 215.412 179.083 21.5903 5.98948
>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v1->v2:
> - Handle trivial changes due to variable name change. (Rik Van Riel)
> - Drop changes where subsequent better cpu find was rejected for
> small numa improvement (Rik Van Riel).
>
> kernel/sched/fair.c | 23 ++++++++++++++++++-----
> 1 file changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5cf921a..a717870 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1568,6 +1568,13 @@ static bool load_too_imbalanced(long src_load, long dst_load,
> }
>
> /*
> + * Maximum numa importance can be 1998 (2*999);
> + * SMALLIMP @ 30 would be close to 1998/64.
> + * Used to deter task migration.
> + */
> +#define SMALLIMP 30
> +
> +/*
> * This checks if the overall compute and NUMA accesses of the system would
> * be improved if the source tasks was migrated to the target dst_cpu taking
> * into account that it might be best if task running on the dst_cpu should
> @@ -1600,7 +1607,7 @@ static void task_numa_compare(struct task_numa_env *env,
> goto unlock;
>
> if (!cur) {
> - if (maymove || imp > env->best_imp)
> + if (maymove && moveimp >= env->best_imp)
> goto assign;
> else
> goto unlock;
> @@ -1643,16 +1650,22 @@ static void task_numa_compare(struct task_numa_env *env,
> task_weight(cur, env->dst_nid, dist);
> }
>
> - if (imp <= env->best_imp)
> - goto unlock;
> -
> if (maymove && moveimp > imp && moveimp > env->best_imp) {
> - imp = moveimp - 1;
> + imp = moveimp;
> cur = NULL;
> goto assign;
> }
>
> /*
> + * If the numa importance is less than SMALLIMP,
> + * task migration might only result in ping pong
> + * of tasks and also hurt performance due to cache
> + * misses.
> + */
> + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
> + goto unlock;
> +
> + /*
> * In the overloaded case, try and keep the load balanced.
> */
> load = task_h_load(env->p) - task_h_load(cur);
So what is this 'NUMA importance'? Seems just like a random parameter which generally isn't a
good idea.
Also, same review feedback as I gave for the previous patches.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes
2018-08-03 6:14 ` [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
@ 2018-09-10 8:48 ` Ingo Molnar
2018-09-12 15:19 ` Srikar Dronamraju
0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2018-09-10 8:48 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> Currently task scan rate is reset when numa balancer migrates the task
> to a different node. If numa balancer initiates a swap, reset is only
> applicable to the task that initiates the swap. Similarly no scan rate
> reset is done if the task is migrated across nodes by traditional load
> balancer.
>
> Instead move the scan reset to the migrate_task_rq. This ensures the
> task moved out of its preferred node, either gets back to its preferred
> node quickly or finds a new preferred node. Doing so, would be fair to
> all tasks migrating across nodes.
>
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS Prev Current %Change
> 4 210118 208862 -0.597759
> 1 313171 307007 -1.96825
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> JVMS Prev Current %Change
> 8 91027.5 89911.4 -1.22611
> 1 216460 216176 -0.131202
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS Prev Current %Change
> 4 191918 196078 2.16759
> 1 207043 214664 3.68088
>
>
> on 4 Socket/4 Node Power7
> JVMS Prev Current %Change
> 8 58462.1 60719.2 3.86079
> 1 108334 112615 3.95167
>
>
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count Min Max Avg Variance %Change
> 5 11851.8 11937.3 11890.9 33.5169
> 5 12511.7 12559.4 12539.5 15.5883 5.45459
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> count Min Max Avg Variance %Change
> 5 4791 5016.08 4962.55 85.9625
> 5 4709.28 4979.28 4919.32 105.126 -0.871125
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> count Min Max Avg Variance %Change
> 5 9353.43 9380.49 9369.6 9.04361
> 5 9388.38 9406.29 9395.1 5.98959 0.272157
>
>
> on 4 Socket/4 Node Power7
> count Min Max Avg Variance %Change
> 5 149.518 215.412 179.083 21.5903
> 5 157.71 184.929 174.754 10.7275 -2.41731
>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> kernel/sched/fair.c | 19 +++++++++++++------
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a5936ed..4ea0eff 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1837,12 +1837,6 @@ static int task_numa_migrate(struct task_struct *p)
> if (env.best_cpu == -1)
> return -EAGAIN;
>
> - /*
> - * Reset the scan period if the task is being rescheduled on an
> - * alternative node to recheck if the tasks is now properly placed.
> - */
> - p->numa_scan_period = task_scan_start(p);
> -
> best_rq = cpu_rq(env.best_cpu);
> if (env.best_task == NULL) {
> ret = migrate_task_to(p, env.best_cpu);
> @@ -6361,6 +6355,19 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
>
> /* We have migrated, no longer consider this task hot */
> p->se.exec_start = 0;
> +
> +#ifdef CONFIG_NUMA_BALANCING
> + if (!p->mm || (p->flags & PF_EXITING))
> + return;
> +
> + if (p->numa_faults) {
> + int src_nid = cpu_to_node(task_cpu(p));
> + int dst_nid = cpu_to_node(new_cpu);
> +
> + if (src_nid != dst_nid)
> + p->numa_scan_period = task_scan_start(p);
> + }
> +#endif
Please don't add #ifdeffery inside functions, especially not if they do weird flow control like
a 'return' from the middle of a block.
A properly named inline helper would work I suppose.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement
2018-09-10 8:46 ` Ingo Molnar
@ 2018-09-12 15:17 ` Srikar Dronamraju
0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-09-12 15:17 UTC (permalink / raw)
To: Ingo Molnar
Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
> >
> > /*
> > + * Maximum numa importance can be 1998 (2*999);
> > + * SMALLIMP @ 30 would be close to 1998/64.
> > + * Used to deter task migration.
> > + */
> > +#define SMALLIMP 30
> > +
> > +/*
> >
> > /*
> > + * If the numa importance is less than SMALLIMP,
> > + * task migration might only result in ping pong
> > + * of tasks and also hurt performance due to cache
> > + * misses.
> > + */
> > + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
> > + goto unlock;
> > +
> > + /*
> > * In the overloaded case, try and keep the load balanced.
> > */
> > load = task_h_load(env->p) - task_h_load(cur);
>
> So what is this 'NUMA importance'? Seems just like a random parameter which generally isn't a
> good idea.
>
I refer the weight that is used to compare the suitability of the task to a
node as NUMA Importance. It varies between -999 to 1000. This is not
something that was introduced by this patch, but was introduced as part of
Numa balancing couple of years ago. group_imp, task_imp, best_imp all refer
to the NUMA importance. May be I am using a wrong term here. May be imp
stands for something other than importance.
In this patch, we are trying to limit task migration for small NUMA
importance. i.e if the NUMA importance for moving/swapping tasks is only 10,
then should we drop all the cache affinity for NUMA affinity? May be we need
to wait for the trend to stabilize.
I have chosen 30 as the weight below which we refuse to consider NUMA
importance. Its based on maximum NUMA importance / 64.
Please do suggest if you have a better method to limit task migrations for
small NUMA gain.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes
2018-09-10 8:48 ` Ingo Molnar
@ 2018-09-12 15:19 ` Srikar Dronamraju
0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-09-12 15:19 UTC (permalink / raw)
To: Ingo Molnar
Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
> > +#ifdef CONFIG_NUMA_BALANCING
> > + if (!p->mm || (p->flags & PF_EXITING))
> > + return;
> > +
> > + if (p->numa_faults) {
> > + int src_nid = cpu_to_node(task_cpu(p));
> > + int dst_nid = cpu_to_node(new_cpu);
> > +
> > + if (src_nid != dst_nid)
> > + p->numa_scan_period = task_scan_start(p);
> > + }
> > +#endif
>
> Please don't add #ifdeffery inside functions, especially not if they do weird flow control like
> a 'return' from the middle of a block.
>
> A properly named inline helper would work I suppose.
>
Okay, will take care.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2018-09-12 15:19 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
2018-08-03 6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
2018-09-10 8:42 ` Ingo Molnar
2018-08-03 6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
2018-09-06 11:48 ` Peter Zijlstra
2018-09-10 8:39 ` Ingo Molnar
2018-08-03 6:13 ` [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement Srikar Dronamraju
2018-09-10 8:46 ` Ingo Molnar
2018-09-12 15:17 ` Srikar Dronamraju
2018-08-03 6:13 ` [PATCH 4/6] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
2018-08-03 6:14 ` [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
2018-09-10 8:48 ` Ingo Molnar
2018-09-12 15:19 ` Srikar Dronamraju
2018-08-03 6:14 ` [PATCH 6/6] sched/numa: Limit the conditions where scan period is reset Srikar Dronamraju
2018-08-21 12:01 ` [PATCH 0/6] numa-balancing patches Srikar Dronamraju
2018-09-06 12:17 ` Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).