linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] numa-balancing patches
@ 2018-08-03  6:13 Srikar Dronamraju
  2018-08-03  6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03  6:13 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

This patchset based on current tip/sched/core, provides left out patches
from the previous series. This version handles the comments given to some of
the patches. It drops "sched/numa: Restrict migrating in parallel to the same
node." It adds an additional patch from Mel Gorman.
It also provides specjbb2005 /dbench/ perf bench numa numbers on a patch
basis on 4 node and 2 node systems.

v2: http://lkml.kernel.org/r/1529514181-9842-1-git-send-email-srikar@linux.vnet.ibm.com
v1: http://lkml.kernel.org/r/1528106428-19992-1-git-send-email-srikar@linux.vnet.ibm.com

specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS  Prev    Current  %Change
4     199709  209029   4.66679
1     330830  326585   -1.28314


on 2 Socket/4 Node Power8 (PowerNV)
1  218946  221299  1.07469


on 2 Socket/2 Node Power9 (PowerNV)
JVMS  Prev    Current  %Change
4     180473  195444   8.29542
1     212805  222390   4.50412


on 4 Socket/4 Node Power7
JVMS  Prev     Current  %Change
8     56941.8  60152.4  5.63839
1     111686   111458   -0.204144


dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count  Min      Max      Avg      Variance  %Change
5      12029.8  12124.6  12060.9  34.0076
5      12904.6  12969    12942.6  23.9053   7.3104



on 2 Socket/4 Node Power8 (PowerNV)
count  Min      Max      Avg     Variance  %Change
5      4968.51  5006.62  4981.31  13.4151
5      4984.25  5025.95  5004.5   14.2253   0.46554


on 2 Socket/2 Node Power9 (PowerNV)
count  Min      Max      Avg      Variance  %Change
5      9342.92  9381.44  9363.92  12.8587
5      9277.64  9357.22  9322.07  26.3558   -0.446928


on 4 Socket/4 Node Power7
count  Min      Max      Avg      Variance  %Change
5      143.4    188.892  170.225  16.9929
5      160.632  175.558  168.655  5.26823   -0.922309

perf bench numa / time / lesser number is better
on 2 Socket/2 Node Intel
Testcase         Time:  Min       Max       Avg       StdDev
  numa01.sh      Real:  403.47    420.68    411.27    5.80
  numa01.sh      Sys:   5.20      8.75      5.96      1.39
  numa01.sh      User:  14583.44  14832.05  14699.23  83.15
  numa02.sh      Real:  70.61     73.80     72.53     1.10
  numa02.sh      Sys:   4.61      5.08      4.80      0.18
  numa02.sh      User:  2634.28   2690.65   2669.69   20.80
  numa03.sh      Real:  328.53    374.61    354.53    17.22
  numa03.sh      Sys:   9.15      11.78     10.34     0.87
  numa03.sh      User:  14828.93  16646.99  15693.92  758.08
  numa04.sh      Real:  404.31    424.15    413.53    6.45
  numa04.sh      Sys:   5.70      7.98      6.33      0.85
  numa04.sh      User:  14608.86  15002.66  14812.80  156.89
  numa05.sh      Real:  432.00    449.59    444.57    6.52
  numa05.sh      Sys:   14.80     16.94     15.67     0.74
  numa05.sh      User:  15679.60  16048.79  15911.45  133.39
Testcase         Time:  Min       Max       Avg       StdDev   %Change
  numa01.sh      Real:  392.85    415.77    403.96    8.33     1.80959%
  numa01.sh      Sys:   6.19      9.81      7.89      1.19     -24.4613%
  numa01.sh      User:  14219.55  14733.04  14511.33  204.30   1.29485%
  numa02.sh      Real:  58.77     63.41     60.01     1.74     20.8632%
  numa02.sh      Sys:   5.28      5.62      5.42      0.11     -11.4391%
  numa02.sh      User:  2302.26   2454.57   2345.44   55.26    13.8247%
  numa03.sh      Real:  345.47    401.75    366.51    20.54    -3.26867%
  numa03.sh      Sys:   8.87      11.94     10.48     1.29     -1.33588%
  numa03.sh      User:  14709.09  17409.22  15824.09  1010.20  -0.822607%
  numa04.sh      Real:  392.78    404.64    398.72    4.50     3.71439%
  numa04.sh      Sys:   6.61      8.30      7.30      0.55     -13.2877%
  numa04.sh      User:  14324.48  14638.68  14464.01  117.68   2.41143%
  numa05.sh      Real:  383.94    414.25    396.28    10.61    12.1858%
  numa05.sh      Sys:   20.20     25.96     24.15     2.11     -35.1139%
  numa05.sh      User:  14707.57  15251.14  14993.60  185.47   6.12161%

Info on each of the perf bench script is available at
http://lkml.kernel.org/r/1528106428-19992-1-git-send-email-srikar@linux.vnet.ibm.com

vmstat data for perf bench numa01
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        125087      130979      4.71032%
numa_hint_faults_local  122058      118544      -2.87896%
numa_hit                472395      562428      19.0588%
numa_huge_pte_updates   121037      126394      4.42592%
numa_interleave         0           0           NA
numa_local              472041      562071      19.0725%
numa_miss               0           0           NA
numa_other              354         357         0.847458%
numa_pages_migrated     977502      1845407     88.7881%
numa_pte_updates        61980575    64723635    4.42568%
pgfault                 709665      823092      15.9832%
pgmajfault              443         507         14.447%
pgmigrate_fail          46592       109568      135.165%
pgmigrate_success       977502      1845407     88.7881%


vmstat data for perf bench numa02
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        38495       22190       -42.3562%
numa_hint_faults_local  35757       17037       -52.3534%
numa_hit                303708      284876      -6.20069%
numa_huge_pte_updates   33842       19259       -43.0914%
numa_interleave         0           0           NA
numa_local              303450      284624      -6.20399%
numa_miss               0           0           NA
numa_other              258         252         -2.32558%
numa_pages_migrated     993537      1570214     58.0428%
numa_pte_updates        17452735    9984967     -42.7885%
pgfault                 368888      330200      -10.4877%
pgmajfault              420         308         -26.6667%
pgmigrate_fail          0           0           NA
pgmigrate_success       993537      1570214     58.0428%


vmstat data for perf bench numa03
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        62510       63635       1.79971%
numa_hint_faults_local  38696       39109       1.06729%
numa_hit                382967      395061      3.15797%
numa_huge_pte_updates   58145       59089       1.62353%
numa_interleave         0           0           NA
numa_local              382720      394809      3.15871%
numa_miss               0           0           NA
numa_other              247         252         2.02429%
numa_pages_migrated     2035239     2196610     7.92885%
numa_pte_updates        29777304    30261666    1.62661%
pgfault                 544043      560325      2.99278%
pgmajfault              355         451         27.0423%
pgmigrate_fail          224256      344576      53.653%
pgmigrate_success       2035239     2196610     7.92885%


vmstat data for perf bench numa04
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        124431      132260      6.29184%
numa_hint_faults_local  121355      119283      -1.70739%
numa_hit                409737      399975      -2.3825%
numa_huge_pte_updates   120253      127886      6.34745%
numa_interleave         0           0           NA
numa_local              409468      399724      -2.37967%
numa_miss               0           0           NA
numa_other              269         251         -6.69145%
numa_pages_migrated     1116395     1659057     48.6084%
numa_pte_updates        61579860    65487765    6.34608%
pgfault                 631873      633795      0.304175%
pgmajfault              337         329         -2.37389%
pgmigrate_fail          47616       156160      227.957%
pgmigrate_success       1116395     1659057     48.6084%


vmstat data for perf bench numa05
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        355042      337290      -4.99997%
numa_hint_faults_local  342282      305309      -10.8019%
numa_hit                469069      461415      -1.63174%
numa_huge_pte_updates   348439      330052      -5.27696%
numa_interleave         0           0           NA
numa_local              468821      461168      -1.63239%
numa_miss               0           0           NA
numa_other              248         247         -0.403226%
numa_pages_migrated     3247276     6844700     110.783%
numa_pte_updates        178418683   169004579   -5.27641%
pgfault                 899368      878657      -2.30284%
pgmajfault              345         334         -3.18841%
pgmigrate_fail          781312      1424384     82.3067%
pgmigrate_success       3247276     6844700     110.783%


on 2 Socket/4 Node Power8 (PowerNV)
Testcase         Time:  Min       Max       Avg       StdDev
  numa01.sh      Real:  358.03    476.82    419.73    46.59
  numa01.sh      Sys:   14.53     20.23     16.47     1.96
  numa01.sh      User:  43304.06  53938.77  48978.04  4280.53
  numa02.sh      Real:  52.55     59.28     56.55     2.58
  numa02.sh      Sys:   7.33      11.74     9.37      1.56
  numa02.sh      User:  5112.38   5765.50   5535.94   237.97
  numa03.sh      Real:  486.71    497.22    490.09    3.68
  numa03.sh      Sys:   12.12     15.21     14.18     1.07
  numa03.sh      User:  56814.30  59414.01  58412.02  893.79
  numa04.sh      Real:  322.51    350.93    335.53    9.06
  numa04.sh      Sys:   14.03     16.90     15.79     1.10
  numa04.sh      User:  33446.88  36163.03  34128.47  1023.44
  numa05.sh      Real:  324.11    333.71    330.69    3.37
  numa05.sh      Sys:   21.25     28.33     23.59     2.55
  numa05.sh      User:  33017.37  34332.36  33536.43  437.24
Testcase         Time:  Min       Max       Avg       StdDev   %Change
  numa01.sh      Real:  402.80    475.17    438.14    23.38    -4.20185%
  numa01.sh      Sys:   15.81     17.30     16.46     0.53     0.0607533%
  numa01.sh      User:  46324.59  52566.72  49514.24  2327.98  -1.08292%
  numa02.sh      Real:  49.32     59.99     54.64     3.42     3.49561%
  numa02.sh      Sys:   5.84      10.32     8.44      1.55     11.019%
  numa02.sh      User:  4962.98   5674.79   5456.00   255.40   1.46518%
  numa03.sh      Real:  481.18    492.84    487.49    4.05     -93.4563%
  numa03.sh      Sys:   12.11     13.93     13.07     0.73     -84.6213%
  numa03.sh      User:  56056.97  58557.44  57546.61  870.03   -97.3252%
  numa04.sh      Real:  314.72    399.01    344.72    28.97    42.1705%
  numa04.sh      Sys:   14.72     20.70     17.05     2.15     -16.8328%
  numa04.sh      User:  34075.02  42869.67  36528.81  3261.87  59.9067%
  numa05.sh      Real:  327.70    363.14    343.96    13.71    -2.45087%
  numa05.sh      Sys:   23.34     29.42     27.00     2.01     -41.5185%
  numa05.sh      User:  31716.77  36602.35  33670.61  1653.60  1.35982%


vmstat data for perf bench numa01
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        63348       64799       2.29052%
numa_hint_faults_local  27783       28052       0.968218%
numa_hit                288564      274955      -4.71611%
numa_huge_pte_updates   24248       25297       4.32613%
numa_interleave         0           0           NA
numa_local              288524      274914      -4.71711%
numa_miss               0           0           NA
numa_other              40          41          2.5%
numa_pages_migrated     668765      757419      13.2564%
numa_pte_updates        6247173     6516368     4.30907%
pgfault                 1026373     982450      -4.27944%
pgmajfault              552         455         -17.5725%
pgmigrate_fail          110871      105728      -4.63872%
pgmigrate_success       668765      757419      13.2564%


vmstat data for perf bench numa02
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        247818      340248      37.2975%
numa_hint_faults_local  165959      197634      19.086%
numa_hit                327750      338501      3.28024%
numa_huge_pte_updates   1302        1786        37.1736%
numa_interleave         0           0           NA
numa_local              327719      338477      3.28269%
numa_miss               0           0           NA
numa_other              31          24          -22.5806%
numa_pages_migrated     184908      212449      14.8944%
numa_pte_updates        601753      817802      35.9033%
pgfault                 714529      802873      12.3639%
pgmajfault              384         391         1.82292%
pgmigrate_fail          512         512         0%
pgmigrate_success       184908      212449      14.8944%


vmstat data for perf bench numa03
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        42672       34540       -19.057%
numa_hint_faults_local  15017       11041       -26.4767%
numa_hit                276938      269723      -2.60528%
numa_huge_pte_updates   13998       13968       -0.214316%
numa_interleave         0           0           NA
numa_local              276919      269715      -2.60148%
numa_miss               0           0           NA
numa_other              19          8           -57.8947%
numa_pages_migrated     333225      349934      5.01433%
numa_pte_updates        3610166     3596818     -0.369734%
pgfault                 992860      963524      -2.9547%
pgmajfault              522         957         83.3333%
pgmigrate_fail          108288      127744      17.9669%
pgmigrate_success       333225      349934      5.01433%


vmstat data for perf bench numa04
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        80020       87821       9.74881%
numa_hint_faults_local  59008       56421       -4.38415%
numa_hit                233083      235072      0.853344%
numa_huge_pte_updates   35238       35577       0.96203%
numa_interleave         0           0           NA
numa_local              233064      235067      0.859421%
numa_miss               0           0           NA
numa_other              19          5           -73.6842%
numa_pages_migrated     944562      1028140     8.84833%
numa_pte_updates        9065545     9159954     1.0414%
pgfault                 847441      851781      0.51213%
pgmajfault              970         421         -56.5979%
pgmigrate_fail          63233       53760       -14.9811%
pgmigrate_success       944562      1028140     8.84833%


vmstat data for perf bench numa05
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        174387      201119      15.3291%
numa_hint_faults_local  133581      146596      9.74315%
numa_hit                249145      257903      3.51522%
numa_huge_pte_updates   73582       85868       16.697%
numa_interleave         0           0           NA
numa_local              249137      257890      3.51333%
numa_miss               0           0           NA
numa_other              8           13          62.5%
numa_pages_migrated     1781374     2077576     16.6277%
numa_pte_updates        18938248    22100144    16.6958%
pgfault                 941574      995042      5.67858%
pgmajfault              434         415         -4.37788%
pgmigrate_fail          49180       69889       42.1086%
pgmigrate_success       1781374     2077576     16.6277%


on 2 Socket/2 Node Power9 (PowerNV)
Testcase         Time:  Min       Max        Avg        StdDev
  numa01.sh      Real:  462.22    591.23     504.51     44.82
  numa01.sh      Sys:   37.07     54.86      42.05      6.62
  numa01.sh      User:  72535.19  86297.67   75983.26   5208.86
  numa02.sh      Real:  82.50     87.37      84.18      1.82
  numa02.sh      Sys:   20.18     30.04      27.37      3.66
  numa02.sh      User:  12171.09  12358.11   12242.31   62.27
  numa03.sh      Real:  595.65    695.32     640.37     31.93
  numa03.sh      Sys:   31.45     42.00      35.40      3.78
  numa03.sh      User:  93877.45  109013.40  100676.82  4856.89
  numa04.sh      Real:  514.19    594.43     548.24     33.76
  numa04.sh      Sys:   41.25     54.25      46.86      4.89
  numa04.sh      User:  76298.64  86625.93   80615.33   4748.38
  numa05.sh      Real:  466.67    513.17     494.73     18.29
  numa05.sh      Sys:   61.19     70.28      66.83      3.35
  numa05.sh      User:  72845.76  76191.80   74651.22   1416.76
Testcase         Time:  Min       Max        Avg        StdDev    %Change
  numa01.sh      Real:  461.27    719.44     552.31     88.06     -8.65456%
  numa01.sh      Sys:   39.71     67.60      47.18      10.35     -10.8733%
  numa01.sh      User:  72257.05  112563.52  83735.04   14612.15  -9.25751%
  numa02.sh      Real:  82.65     84.25      83.41      0.53      0.923151%
  numa02.sh      Sys:   18.32     28.89      22.97      4.34      19.1554%
  numa02.sh      User:  12045.55  12215.64   12162.20   62.80     0.65868%
  numa03.sh      Real:  587.05    660.43     617.39     25.31     3.72212%
  numa03.sh      Sys:   28.05     36.74      31.86      3.45      11.1111%
  numa03.sh      User:  92686.08  103166.58  97013.32   3655.37   3.77629%
  numa04.sh      Real:  464.56    652.41     515.41     70.89     6.36969%
  numa04.sh      Sys:   38.40     49.26      42.43      4.00      10.4407%
  numa04.sh      User:  72275.32  88875.96   77174.44   6149.21   4.45859%
  numa05.sh      Real:  483.10    664.43     562.87     75.72     -12.1058%
  numa05.sh      Sys:   56.23     73.67      65.27      5.73      2.39007%
  numa05.sh      User:  73350.15  89813.72   80238.30   6532.10   -6.96311%


vmstat data for perf bench numa01
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        144716      145619      0.623981%
numa_hint_faults_local  99914       91850       -8.07094%
numa_hit                411314      369456      -10.1767%
numa_huge_pte_updates   136260      136154      -0.0777925%
numa_interleave         0           0           NA
numa_local              411279      369421      -10.1775%
numa_miss               0           0           NA
numa_other              35          35          0%
numa_pages_migrated     464612      481645      3.66607%
numa_pte_updates        4368544     4365935     -0.0597224%
pgfault                 1296071     1362892     5.15566%
pgmajfault              1412        1270        -10.0567%
pgmigrate_fail          42656       49952       17.1043%
pgmigrate_success       464612      481645      3.66607%


vmstat data for perf bench numa02
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        30562       28965       -5.22544%
numa_hint_faults_local  23479       21704       -7.55995%
numa_hit                176214      159995      -9.20415%
numa_huge_pte_updates   28447       27168       -4.49608%
numa_interleave         0           0           NA
numa_local              176209      159987      -9.20611%
numa_miss               0           0           NA
numa_other              5           8           60%
numa_pages_migrated     201448      204467      1.49865%
numa_pte_updates        936226      894109      -4.49859%
pgfault                 493189      481612      -2.34738%
pgmajfault              993         521         -47.5327%
pgmigrate_fail          0           0           NA
pgmigrate_success       201448      204467      1.49865%


vmstat data for perf bench numa03
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        78163       72811       -6.84723%
numa_hint_faults_local  39811       37318       -6.26209%
numa_hit                313487      308119      -1.71235%
numa_huge_pte_updates   69817       65243       -6.55141%
numa_interleave         0           0           NA
numa_local              313460      308104      -1.70867%
numa_miss               0           0           NA
numa_other              27          15          -44.4444%
numa_pages_migrated     184605      172934      -6.32215%
numa_pte_updates        2242167     2094992     -6.56396%
pgfault                 1186922     1166080     -1.75597%
pgmajfault              1077        668         -37.9759%
pgmigrate_fail          24544       24416       -0.521512%
pgmigrate_success       184605      172934      -6.32215%


vmstat data for perf bench numa04
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        152710      147880      -3.16286%
numa_hint_faults_local  97716       100272      2.61574%
numa_hit                324966      321659      -1.01764%
numa_huge_pte_updates   144348      139764      -3.17566%
numa_interleave         0           0           NA
numa_local              324939      321640      -1.01527%
numa_miss               0           0           NA
numa_other              27          19          -29.6296%
numa_pages_migrated     512467      485174      -5.32581%
numa_pte_updates        4626888     4479727     -3.18056%
pgfault                 1250077     1234721     -1.2284%
pgmajfault              691         575         -16.7873%
pgmigrate_fail          54848       58560       6.76779%
pgmigrate_success       512467      485174      -5.32581%


vmstat data for perf bench numa05
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        295774      305339      3.23389%
numa_hint_faults_local  218320      218793      0.216654%
numa_hit                352096      357790      1.61717%
numa_huge_pte_updates   286148      297923      4.115%
numa_interleave         0           0           NA
numa_local              352075      357780      1.62039%
numa_miss               0           0           NA
numa_other              21          10          -52.381%
numa_pages_migrated     906755      909564      0.309786%
numa_pte_updates        9165346     9541883     4.10827%
pgfault                 1407223     1433802     1.88876%
pgmajfault              599         683         14.0234%
pgmigrate_fail          70272       69568       -1.00182%
pgmigrate_success       906755      909564      0.309786%


on 4 Socket/4 Node Power7
Testcase         Time:  Min       Max       Avg       StdDev
  numa01.sh      Real:  677.66    913.24    794.88    85.49
  numa01.sh      Sys:   125.90    205.16    169.35    25.59
  numa01.sh      User:  56772.52  71741.79  63335.60  5073.86
  numa02.sh      Real:  65.34     70.28     67.96     1.98
  numa02.sh      Sys:   12.04     19.41     15.89     2.34
  numa02.sh      User:  5499.93   5682.07   5586.30   77.00
  numa03.sh      Real:  774.48    1035.38   893.82    87.76
  numa03.sh      Sys:   107.67    153.14    129.77    15.10
  numa03.sh      User:  62802.39  87222.58  73511.43  8260.39
  numa04.sh      Real:  504.09    733.50    633.03    78.96
  numa04.sh      Sys:   213.34    351.26    284.11    56.12
  numa04.sh      User:  38925.57  55954.50  47690.41  5716.68
  numa05.sh      Real:  402.78    501.75    453.02    37.15
  numa05.sh      Sys:   146.84    407.64    299.57    97.43
  numa05.sh      User:  33365.00  39445.00  36050.94  2053.69
Testcase         Time:  Min       Max       Avg       StdDev   %Change
  numa01.sh      Real:  636.41    913.49    802.07    94.08    -0.89643%
  numa01.sh      Sys:   169.10    209.84    181.46    15.01    -6.67365%
  numa01.sh      User:  51910.75  65727.60  60906.34  5019.96  3.98852%
  numa02.sh      Real:  63.64     70.40     66.18     2.42     2.68963%
  numa02.sh      Sys:   9.85      21.19     15.05     3.72     5.5814%
  numa02.sh      User:  5305.35   5702.47   5477.28   132.74   1.9904%
  numa03.sh      Real:  753.00    932.44    828.11    66.63    7.93494%
  numa03.sh      Sys:   81.82     132.68    104.12    17.89    24.635%
  numa03.sh      User:  61249.69  72311.19  65282.71  3998.53  12.6047%
  numa04.sh      Real:  504.61    655.03    605.21    52.01    4.59675%
  numa04.sh      Sys:   130.42    330.44    260.76    73.87    8.95459%
  numa04.sh      User:  37562.67  48382.57  45063.68  3892.89  5.82893%
  numa05.sh      Real:  462.05    525.61    488.76    21.16    -7.31238%
  numa05.sh      Sys:   296.76    389.72    345.10    40.73    -13.1933%
  numa05.sh      User:  35920.56  39112.35  38022.97  1151.19  -5.18642%


vmstat data for perf bench numa01
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        8094646     6939950     -14.2649%
numa_hint_faults_local  4327343     3249221     -24.9142%
numa_hit                1550444     1490388     -3.87347%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              1550404     1490347     -3.87364%
numa_miss               0           0           NA
numa_other              40          41          2.5%
numa_pages_migrated     777894      731760      -5.93063%
numa_pte_updates        8103835     6945270     -14.2965%
pgfault                 9504158     8321001     -12.4488%
pgmajfault              277         267         -3.61011%
pgmigrate_fail          7           12          71.4286%
pgmigrate_success       777894      731760      -5.93063%


vmstat data for perf bench numa02
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        698918      733902      5.00545%
numa_hint_faults_local  553784      562257      1.53002%
numa_hit                473790      466220      -1.59775%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              473785      466216      -1.59756%
numa_miss               0           0           NA
numa_other              5           4           -20%
numa_pages_migrated     136492      134423      -1.51584%
numa_pte_updates        714710      749458      4.86183%
pgfault                 1186082     1218861     2.76364%
pgmajfault              155         156         0.645161%
pgmigrate_fail          0           0           NA
pgmigrate_success       136492      134423      -1.51584%


vmstat data for perf bench numa03
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        4525956     3520293     -22.2199%
numa_hint_faults_local  1749531     1319966     -24.5532%
numa_hit                914257      778437      -14.8558%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              914232      778416      -14.8557%
numa_miss               0           0           NA
numa_other              25          21          -16%
numa_pages_migrated     428367      315482      -26.3524%
numa_pte_updates        4536083     3524701     -22.2964%
pgfault                 5578522     4509129     -19.1698%
pgmajfault              202         197         -2.47525%
pgmigrate_fail          22          19          -13.6364%
pgmigrate_success       428367      315482      -26.3524%


vmstat data for perf bench numa04
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        13634799    13598386    -0.267059%
numa_hint_faults_local  8473822     8800575     3.85603%
numa_hit                2604435     2456795     -5.66879%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              2604411     2456779     -5.66854%
numa_miss               0           0           NA
numa_other              24          16          -33.3333%
numa_pages_migrated     1414750     1280512     -9.48846%
numa_pte_updates        13678739    13627859    -0.371964%
pgfault                 15363067    15317090    -0.29927%
pgmajfault              197         237         20.3046%
pgmigrate_fail          30          27          -10%
pgmigrate_success       1414750     1280512     -9.48846%


vmstat data for perf bench numa05
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        23324034    25274343    8.3618%
numa_hint_faults_local  18759362    18625813    -0.711906%
numa_hit                3944010     4235082     7.3801%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              3943994     4235068     7.38018%
numa_miss               0           0           NA
numa_other              16          14          -12.5%
numa_pages_migrated     1785980     2072221     16.0271%
numa_pte_updates        23411591    25325473    8.17493%
pgfault                 26024350    28065879    7.84469%
pgmajfault              233         239         2.57511%
pgmigrate_fail          52          82          57.6923%
pgmigrate_success       1785980     2072221     16.0271%


Mel Gorman (1):
  sched/numa: Limit the conditions where scan period is reset

Srikar Dronamraju (5):
  sched/numa: Stop multiple tasks from moving to the cpu at the same
    time
  mm/migrate: Use trylock while resetting rate limit
  sched/numa: Avoid task migration for small numa improvement
  sched/numa: Pass destination cpu as a parameter to migrate_task_rq
  sched/numa: Reset scan rate whenever task moves across nodes

 kernel/sched/core.c     |  2 +-
 kernel/sched/deadline.c |  2 +-
 kernel/sched/fair.c     | 87 ++++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h    |  3 +-
 mm/migrate.c            | 16 ++++++---
 5 files changed, 91 insertions(+), 19 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time
  2018-08-03  6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
@ 2018-08-03  6:13 ` Srikar Dronamraju
  2018-09-10  8:42   ` Ingo Molnar
  2018-08-03  6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03  6:13 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Task migration under numa balancing can happen in parallel. More than
one task might choose to migrate to the same cpu at the same time. This
can result in
- During task swap, choosing a task that was not part of the evaluation.
- During task swap, task which just got moved into its preferred node,
  moving to a completely different node.
- During task swap, task failing to move to the preferred node, will have
  to wait an extra interval for the next migrate opportunity.
- During task movement, multiple task movements can cause load imbalance.

This problem is more likely if there are more cores per node or more
nodes in the system.

Use a per run-queue variable to check if numa-balance is active on the
run-queue.

specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS  Prev    Current  %Change
4     199709  206350   3.32534
1     330830  319963   -3.28477


on 2 Socket/4 Node Power8 (PowerNV)
JVMS  Prev     Current  %Change
8     89011.9  89627.8  0.69193
1     218946   211338   -3.47483


on 2 Socket/2 Node Power9 (PowerNV)
JVMS  Prev    Current  %Change
4     180473  186539   3.36117
1     212805  220344   3.54268


on 4 Socket/4 Node Power7
JVMS  Prev     Current  %Change
8     56941.8  56836    -0.185804
1     111686   112970   1.14965


dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count  Min      Max      Avg      Variance  %Change
5      12029.8  12124.6  12060.9  34.0076
5      13136.1  13170.2  13150.2  14.7482   9.03166


on 2 Socket/4 Node Power8 (PowerNV)
count  Min      Max      Avg      Variance  %Change
5      4968.51  5006.62  4981.31  13.4151
5      4319.79  4998.19  4836.53  261.109   -2.90646


on 2 Socket/2 Node Power9 (PowerNV)
count  Min      Max      Avg      Variance  %Change
5      9342.92  9381.44  9363.92  12.8587
5      9325.56  9402.7   9362.49  25.9638   -0.0152714


on 4 Socket/4 Node Power7
count  Min      Max      Avg      Variance  %Change
5      143.4    188.892  170.225  16.9929
5      132.581  191.072  170.554  21.6444   0.193274

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v2->v3:
Add comments as requested by Peter.

 kernel/sched/fair.c  | 22 ++++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 23 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309c93f..5cf921a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1514,6 +1514,21 @@ struct task_numa_env {
 static void task_numa_assign(struct task_numa_env *env,
 			     struct task_struct *p, long imp)
 {
+	struct rq *rq = cpu_rq(env->dst_cpu);
+
+	/* Bail out if run-queue part of active numa balance. */
+	if (xchg(&rq->numa_migrate_on, 1))
+		return;
+
+	/*
+	 * Clear previous best_cpu/rq numa-migrate flag, since task now
+	 * found a better cpu to move/swap.
+	 */
+	if (env->best_cpu != -1) {
+		rq = cpu_rq(env->best_cpu);
+		WRITE_ONCE(rq->numa_migrate_on, 0);
+	}
+
 	if (env->best_task)
 		put_task_struct(env->best_task);
 	if (p)
@@ -1569,6 +1584,9 @@ static void task_numa_compare(struct task_numa_env *env,
 	long moveimp = imp;
 	int dist = env->dist;
 
+	if (READ_ONCE(dst_rq->numa_migrate_on))
+		return;
+
 	rcu_read_lock();
 	cur = task_rcu_dereference(&dst_rq->curr);
 	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
@@ -1710,6 +1728,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1,
 	};
 	struct sched_domain *sd;
+	struct rq *best_rq;
 	unsigned long taskweight, groupweight;
 	int nid, ret, dist;
 	long taskimp, groupimp;
@@ -1811,14 +1830,17 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	p->numa_scan_period = task_scan_start(p);
 
+	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {
 		ret = migrate_task_to(p, env.best_cpu);
+		WRITE_ONCE(best_rq->numa_migrate_on, 0);
 		if (ret != 0)
 			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
 		return ret;
 	}
 
 	ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
+	WRITE_ONCE(best_rq->numa_migrate_on, 0);
 
 	if (ret != 0)
 		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8ca..0b91612 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -783,6 +783,7 @@ struct rq {
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int		nr_numa_running;
 	unsigned int		nr_preferred_running;
+	unsigned int		numa_migrate_on;
 #endif
 	#define CPU_LOAD_IDX_MAX 5
 	unsigned long		cpu_load[CPU_LOAD_IDX_MAX];
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit
  2018-08-03  6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
  2018-08-03  6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
@ 2018-08-03  6:13 ` Srikar Dronamraju
  2018-09-06 11:48   ` Peter Zijlstra
  2018-09-10  8:39   ` Ingo Molnar
  2018-08-03  6:13 ` [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement Srikar Dronamraju
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03  6:13 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Since this spinlock will only serialize migrate rate limiting,
convert the spinlock to a trylock. If another task races ahead of this task
then this task can simply move on.

While here, add correct two abnormalities.
- Avoid time being stretched for every interval.
- Use READ/WRITE_ONCE with next window.

specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS  Prev    Current  %Change
4     206350  200892   -2.64502
1     319963  325766   1.81365


on 2 Socket/2 Node Power9 (PowerNV)
JVMS  Prev    Current  %Change
4     186539  190261   1.99529
1     220344  195305   -11.3636


on 4 Socket/4 Node Power7
JVMS  Prev    Current  %Change
8     56836   57651.1  1.43413
1     112970  111351   -1.43312


dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count  Min      Max      Avg      Variance  %Change
5      13136.1  13170.2  13150.2  14.7482
5      12254.7  12331.9  12297.8  28.1846   -6.48203


on 2 Socket/4 Node Power8 (PowerNV)
count  Min      Max      Avg      Variance  %Change
5      4319.79  4998.19  4836.53  261.109
5      4997.83  5030.14  5015.54  12.947    3.70121


on 2 Socket/2 Node Power9 (PowerNV)
count  Min      Max      Avg      Variance  %Change
5      9325.56  9402.7   9362.49  25.9638
5      9331.84  9375.11  9352.04  16.0703   -0.111616


on 4 Socket/4 Node Power7
count  Min      Max      Avg      Variance  %Change
5      132.581  191.072  170.554  21.6444
5      147.55   181.605  168.963  11.3513   -0.932842

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
---
Changelog v1->v2:
Fix stretch every interval pointed by Peter Zijlstra.
Verified that some of the regression is due to fixing interval stretch.

 mm/migrate.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f..dbc2cb7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1868,16 +1868,24 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
 					unsigned long nr_pages)
 {
+	unsigned long next_window, interval;
+
+	next_window = READ_ONCE(pgdat->numabalancing_migrate_next_window);
+	interval = msecs_to_jiffies(migrate_interval_millisecs);
+
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
 	 */
-	if (time_after(jiffies, next_window)) {
-		spin_lock(&pgdat->numabalancing_migrate_lock);
+	if (time_after(jiffies, next_window) &&
+			spin_trylock(&pgdat->numabalancing_migrate_lock)) {
 		pgdat->numabalancing_migrate_nr_pages = 0;
-		pgdat->numabalancing_migrate_next_window = jiffies +
-			msecs_to_jiffies(migrate_interval_millisecs);
+		do {
+			next_window += interval;
+		} while (unlikely(time_after(jiffies, next_window)));
+
+		WRITE_ONCE(pgdat->numabalancing_migrate_next_window, next_window);
 		spin_unlock(&pgdat->numabalancing_migrate_lock);
 	}
 	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement
  2018-08-03  6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
  2018-08-03  6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
  2018-08-03  6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
@ 2018-08-03  6:13 ` Srikar Dronamraju
  2018-09-10  8:46   ` Ingo Molnar
  2018-08-03  6:13 ` [PATCH 4/6] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03  6:13 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

If numa improvement from the task migration is going to be very
minimal, then avoid task migration.

specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS  Prev    Current  %Change
4     200892  210118   4.59252
1     325766  313171   -3.86627


on 2 Socket/4 Node Power8 (PowerNV)
JVMS  Prev     Current  %Change
8     89011.9  91027.5  2.26442
1     211338   216460   2.42361


on 2 Socket/2 Node Power9 (PowerNV)
JVMS  Prev    Current  %Change
4     190261  191918   0.870909
1     195305  207043   6.01009


on 4 Socket/4 Node Power7
JVMS  Prev     Current  %Change
8     57651.1  58462.1  1.40674
1     111351   108334   -2.70945


dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count  Min      Max      Avg      Variance  %Change
5      12254.7  12331.9  12297.8  28.1846
5      11851.8  11937.3  11890.9  33.5169   -3.30872


on 2 Socket/4 Node Power8 (PowerNV)
count  Min      Max      Avg      Variance  %Change
5      4997.83  5030.14  5015.54  12.947
5      4791     5016.08  4962.55  85.9625   -1.05652


on 2 Socket/2 Node Power9 (PowerNV)
count  Min      Max      Avg      Variance  %Change
5      9331.84  9375.11  9352.04  16.0703
5      9353.43  9380.49  9369.6   9.04361   0.187767


on 4 Socket/4 Node Power7
count  Min      Max      Avg      Variance  %Change
5      147.55   181.605  168.963  11.3513
5      149.518  215.412  179.083  21.5903   5.98948

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1->v2:
 - Handle trivial changes due to variable name change. (Rik Van Riel)
 - Drop changes where subsequent better cpu find was rejected for
   small numa improvement (Rik Van Riel).

 kernel/sched/fair.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5cf921a..a717870 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1568,6 +1568,13 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 }
 
 /*
+ * Maximum numa importance can be 1998 (2*999);
+ * SMALLIMP @ 30 would be close to 1998/64.
+ * Used to deter task migration.
+ */
+#define SMALLIMP	30
+
+/*
  * This checks if the overall compute and NUMA accesses of the system would
  * be improved if the source tasks was migrated to the target dst_cpu taking
  * into account that it might be best if task running on the dst_cpu should
@@ -1600,7 +1607,7 @@ static void task_numa_compare(struct task_numa_env *env,
 		goto unlock;
 
 	if (!cur) {
-		if (maymove || imp > env->best_imp)
+		if (maymove && moveimp >= env->best_imp)
 			goto assign;
 		else
 			goto unlock;
@@ -1643,16 +1650,22 @@ static void task_numa_compare(struct task_numa_env *env,
 			       task_weight(cur, env->dst_nid, dist);
 	}
 
-	if (imp <= env->best_imp)
-		goto unlock;
-
 	if (maymove && moveimp > imp && moveimp > env->best_imp) {
-		imp = moveimp - 1;
+		imp = moveimp;
 		cur = NULL;
 		goto assign;
 	}
 
 	/*
+	 * If the numa importance is less than SMALLIMP,
+	 * task migration might only result in ping pong
+	 * of tasks and also hurt performance due to cache
+	 * misses.
+	 */
+	if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
+		goto unlock;
+
+	/*
 	 * In the overloaded case, try and keep the load balanced.
 	 */
 	load = task_h_load(env->p) - task_h_load(cur);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/6] sched/numa: Pass destination cpu as a parameter to migrate_task_rq
  2018-08-03  6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2018-08-03  6:13 ` [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement Srikar Dronamraju
@ 2018-08-03  6:13 ` Srikar Dronamraju
  2018-08-03  6:14 ` [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03  6:13 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

This additional parameter (new_cpu) is used later for identifying if
task migration is across nodes.

No functional change.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/core.c     | 2 +-
 kernel/sched/deadline.c | 2 +-
 kernel/sched/fair.c     | 2 +-
 kernel/sched/sched.h    | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index deafa9f..fdab290 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1167,7 +1167,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 
 	if (task_cpu(p) != new_cpu) {
 		if (p->sched_class->migrate_task_rq)
-			p->sched_class->migrate_task_rq(p);
+			p->sched_class->migrate_task_rq(p, new_cpu);
 		p->se.nr_migrations++;
 		rseq_migrate(p);
 		perf_event_task_migrate(p);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 997ea7b..91e4202 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1607,7 +1607,7 @@ static void yield_task_dl(struct rq *rq)
 	return cpu;
 }
 
-static void migrate_task_rq_dl(struct task_struct *p)
+static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused)
 {
 	struct rq *rq;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a717870..a5936ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6308,7 +6308,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous CPU. The caller guarantees p->pi_lock or task_rq(p)->lock is held.
  */
-static void migrate_task_rq_fair(struct task_struct *p)
+static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unused)
 {
 	/*
 	 * As blocked tasks retain absolute vruntime the migration needs to
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0b91612..455fa33 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1524,7 +1524,7 @@ struct sched_class {
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
-	void (*migrate_task_rq)(struct task_struct *p);
+	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
 	void (*task_woken)(struct rq *this_rq, struct task_struct *task);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes
  2018-08-03  6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
                   ` (3 preceding siblings ...)
  2018-08-03  6:13 ` [PATCH 4/6] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
@ 2018-08-03  6:14 ` Srikar Dronamraju
  2018-09-10  8:48   ` Ingo Molnar
  2018-08-03  6:14 ` [PATCH 6/6] sched/numa: Limit the conditions where scan period is reset Srikar Dronamraju
  2018-08-21 12:01 ` [PATCH 0/6] numa-balancing patches Srikar Dronamraju
  6 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03  6:14 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Currently task scan rate is reset when numa balancer migrates the task
to a different node. If numa balancer initiates a swap, reset is only
applicable to the task that initiates the swap. Similarly no scan rate
reset is done if the task is migrated across nodes by traditional load
balancer.

Instead move the scan reset to the migrate_task_rq. This ensures the
task moved out of its preferred node, either gets back to its preferred
node quickly or finds a new preferred node. Doing so, would be fair to
all tasks migrating across nodes.

specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS  Prev    Current  %Change
4     210118  208862   -0.597759
1     313171  307007   -1.96825


on 2 Socket/4 Node Power8 (PowerNV)
JVMS  Prev     Current  %Change
8     91027.5  89911.4  -1.22611
1     216460   216176   -0.131202


on 2 Socket/2 Node Power9 (PowerNV)
JVMS  Prev    Current  %Change
4     191918  196078   2.16759
1     207043  214664   3.68088


on 4 Socket/4 Node Power7
JVMS  Prev     Current  %Change
8     58462.1  60719.2  3.86079
1     108334   112615   3.95167


dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count  Min      Max      Avg      Variance  %Change
5      11851.8  11937.3  11890.9  33.5169
5      12511.7  12559.4  12539.5  15.5883   5.45459


on 2 Socket/4 Node Power8 (PowerNV)
count  Min      Max      Avg      Variance  %Change
5      4791     5016.08  4962.55  85.9625
5      4709.28  4979.28  4919.32  105.126   -0.871125


on 2 Socket/2 Node Power9 (PowerNV)
count  Min      Max      Avg     Variance  %Change
5      9353.43  9380.49  9369.6  9.04361
5      9388.38  9406.29  9395.1  5.98959   0.272157


on 4 Socket/4 Node Power7
count  Min      Max      Avg      Variance  %Change
5      149.518  215.412  179.083  21.5903
5      157.71   184.929  174.754  10.7275   -2.41731

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a5936ed..4ea0eff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1837,12 +1837,6 @@ static int task_numa_migrate(struct task_struct *p)
 	if (env.best_cpu == -1)
 		return -EAGAIN;
 
-	/*
-	 * Reset the scan period if the task is being rescheduled on an
-	 * alternative node to recheck if the tasks is now properly placed.
-	 */
-	p->numa_scan_period = task_scan_start(p);
-
 	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {
 		ret = migrate_task_to(p, env.best_cpu);
@@ -6361,6 +6355,19 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
 
 	/* We have migrated, no longer consider this task hot */
 	p->se.exec_start = 0;
+
+#ifdef CONFIG_NUMA_BALANCING
+	if (!p->mm || (p->flags & PF_EXITING))
+		return;
+
+	if (p->numa_faults) {
+		int src_nid = cpu_to_node(task_cpu(p));
+		int dst_nid = cpu_to_node(new_cpu);
+
+		if (src_nid != dst_nid)
+			p->numa_scan_period = task_scan_start(p);
+	}
+#endif
 }
 
 static void task_dead_fair(struct task_struct *p)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/6] sched/numa: Limit the conditions where scan period is reset
  2018-08-03  6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
                   ` (4 preceding siblings ...)
  2018-08-03  6:14 ` [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
@ 2018-08-03  6:14 ` Srikar Dronamraju
  2018-08-21 12:01 ` [PATCH 0/6] numa-balancing patches Srikar Dronamraju
  6 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03  6:14 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

From: Mel Gorman <mgorman@techsingularity.net>

migrate_task_rq_fair resets the scan rate for NUMA balancing on every
cross-node migration. In the event of excessive load balancing due to
saturation, this may result in the scan rate being pegged at maximum and
further overloading the machine.

This patch only resets the scan if NUMA balancing is active, a preferred
node has been selected and the task is being migrated from the preferred
node as these are the most harmful. For example, a migration to the preferred
node does not justify a faster scan rate. Similarly, a migration between two
nodes that are not preferred is probably bouncing due to over-saturation of
the machine.  In that case, scanning faster and trapping more NUMA faults
will further overload the machine.

specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS  Prev    Current  %Change
4     208862  209029   0.0799571
1     307007  326585   6.37705


on 2 Socket/4 Node Power8 (PowerNV)
JVMS  Prev     Current  %Change
8     89911.4  89627.8  -0.315422
1     216176   221299   2.36983


on 2 Socket/2 Node Power9 (PowerNV)
JVMS  Prev    Current  %Change
4     196078  195444   -0.323341
1     214664  222390   3.59911


on 4 Socket/4 Node Power7
JVMS  Prev     Current  %Change
8     60719.2  60152.4  -0.933477
1     112615   111458   -1.02739


dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count  Min      Max      Avg      Variance  %Change
5      12511.7  12559.4  12539.5  15.5883
5      12904.6  12969    12942.6  23.9053   3.21464


on 2 Socket/4 Node Power8 (PowerNV)
count  Min      Max      Avg      Variance  %Change
5      4709.28  4979.28  4919.32  105.126
5      4984.25  5025.95  5004.5   14.2253   1.73154


on 2 Socket/2 Node Power9 (PowerNV)
count  Min      Max      Avg      Variance  %Change
5      9388.38  9406.29  9395.1   5.98959
5      9277.64  9357.22  9322.07  26.3558   -0.77732


on 4 Socket/4 Node Power7
count  Min      Max      Avg      Variance  %Change
5      157.71   184.929  174.754  10.7275
5      160.632  175.558  168.655  5.26823   -3.49005


Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ea0eff..6e251e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6357,6 +6357,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
 	p->se.exec_start = 0;
 
 #ifdef CONFIG_NUMA_BALANCING
+	if (!static_branch_likely(&sched_numa_balancing))
+		return;
+
 	if (!p->mm || (p->flags & PF_EXITING))
 		return;
 
@@ -6364,8 +6367,26 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
 		int src_nid = cpu_to_node(task_cpu(p));
 		int dst_nid = cpu_to_node(new_cpu);
 
-		if (src_nid != dst_nid)
-			p->numa_scan_period = task_scan_start(p);
+		if (src_nid == dst_nid)
+			return;
+
+		/*
+		 * Allow resets if faults have been trapped before one scan
+		 * has completed. This is most likely due to a new task that
+		 * is pulled cross-node due to wakeups or load balancing.
+		 */
+		if (p->numa_scan_seq) {
+			/*
+			 * Avoid scan adjustments if moving to the preferred
+			 * node or if the task was not previously running on
+			 * the preferred node.
+			 */
+			if (dst_nid == p->numa_preferred_nid ||
+			    (p->numa_preferred_nid != -1 && src_nid != p->numa_preferred_nid))
+				return;
+		}
+
+		p->numa_scan_period = task_scan_start(p);
 	}
 #endif
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6] numa-balancing patches
  2018-08-03  6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
                   ` (5 preceding siblings ...)
  2018-08-03  6:14 ` [PATCH 6/6] sched/numa: Limit the conditions where scan period is reset Srikar Dronamraju
@ 2018-08-21 12:01 ` Srikar Dronamraju
  2018-09-06 12:17   ` Peter Zijlstra
  6 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-21 12:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2018-08-03 11:43:55]:

> This patchset based on current tip/sched/core, provides left out patches
> from the previous series. This version handles the comments given to some of
> the patches. It drops "sched/numa: Restrict migrating in parallel to the same
> node." It adds an additional patch from Mel Gorman.
> It also provides specjbb2005 /dbench/ perf bench numa numbers on a patch
> basis on 4 node and 2 node systems.
> 
> v2: http://lkml.kernel.org/r/1529514181-9842-1-git-send-email-srikar@linux.vnet.ibm.com
> v1: http://lkml.kernel.org/r/1528106428-19992-1-git-send-email-srikar@linux.vnet.ibm.com
> 

Hi Peter, Ingo

Can you please respond with your comments suggestions.

Mel and Rik have acked most of the patches.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit
  2018-08-03  6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
@ 2018-09-06 11:48   ` Peter Zijlstra
  2018-09-10  8:39   ` Ingo Molnar
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2018-09-06 11:48 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Fri, Aug 03, 2018 at 11:43:57AM +0530, Srikar Dronamraju wrote:
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8c0af0f..dbc2cb7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1868,16 +1868,24 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
>  static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
>  					unsigned long nr_pages)
>  {
> +	unsigned long next_window, interval;
> +
> +	next_window = READ_ONCE(pgdat->numabalancing_migrate_next_window);
> +	interval = msecs_to_jiffies(migrate_interval_millisecs);
> +
>  	/*
>  	 * Rate-limit the amount of data that is being migrated to a node.
>  	 * Optimal placement is no good if the memory bus is saturated and
>  	 * all the time is being spent migrating!
>  	 */
> -	if (time_after(jiffies, next_window)) {
> -		spin_lock(&pgdat->numabalancing_migrate_lock);
> +	if (time_after(jiffies, next_window) &&
> +			spin_trylock(&pgdat->numabalancing_migrate_lock)) {

This patch doesn't apply cleanly; also you introduce @next_window with
this patch, so how can it remove a user of it.

I fixed this up, still weird.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6] numa-balancing patches
  2018-08-21 12:01 ` [PATCH 0/6] numa-balancing patches Srikar Dronamraju
@ 2018-09-06 12:17   ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2018-09-06 12:17 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Tue, Aug 21, 2018 at 05:01:50AM -0700, Srikar Dronamraju wrote:
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2018-08-03 11:43:55]:
> 
> > This patchset based on current tip/sched/core, provides left out patches
> > from the previous series. This version handles the comments given to some of
> > the patches. It drops "sched/numa: Restrict migrating in parallel to the same
> > node." It adds an additional patch from Mel Gorman.
> > It also provides specjbb2005 /dbench/ perf bench numa numbers on a patch
> > basis on 4 node and 2 node systems.
> > 
> > v2: http://lkml.kernel.org/r/1529514181-9842-1-git-send-email-srikar@linux.vnet.ibm.com
> > v1: http://lkml.kernel.org/r/1528106428-19992-1-git-send-email-srikar@linux.vnet.ibm.com
> > 
> 
> Hi Peter, Ingo
> 
> Can you please respond with your comments suggestions.
> 
> Mel and Rik have acked most of the patches.

The patches do not contain a single ack.

But I pickd them up now. Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit
  2018-08-03  6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
  2018-09-06 11:48   ` Peter Zijlstra
@ 2018-09-10  8:39   ` Ingo Molnar
  1 sibling, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2018-09-10  8:39 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> Since this spinlock will only serialize migrate rate limiting,
> convert the spinlock to a trylock. If another task races ahead of this task
> then this task can simply move on.
> 
> While here, add correct two abnormalities.
> - Avoid time being stretched for every interval.
> - Use READ/WRITE_ONCE with next window.
> 
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS  Prev    Current  %Change
> 4     206350  200892   -2.64502
> 1     319963  325766   1.81365
> 
> 
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS  Prev    Current  %Change
> 4     186539  190261   1.99529
> 1     220344  195305   -11.3636
> 
> 
> on 4 Socket/4 Node Power7
> JVMS  Prev    Current  %Change
> 8     56836   57651.1  1.43413
> 1     112970  111351   -1.43312
> 
> 
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count  Min      Max      Avg      Variance  %Change
> 5      13136.1  13170.2  13150.2  14.7482
> 5      12254.7  12331.9  12297.8  28.1846   -6.48203
> 
> 
> on 2 Socket/4 Node Power8 (PowerNV)
> count  Min      Max      Avg      Variance  %Change
> 5      4319.79  4998.19  4836.53  261.109
> 5      4997.83  5030.14  5015.54  12.947    3.70121
> 
> 
> on 2 Socket/2 Node Power9 (PowerNV)
> count  Min      Max      Avg      Variance  %Change
> 5      9325.56  9402.7   9362.49  25.9638
> 5      9331.84  9375.11  9352.04  16.0703   -0.111616
> 
> 
> on 4 Socket/4 Node Power7
> count  Min      Max      Avg      Variance  %Change
> 5      132.581  191.072  170.554  21.6444
> 5      147.55   181.605  168.963  11.3513   -0.932842

Firstly, *please* always characterize benchmark runs. What did you find? How should we 
interpret the result? Are there any tradeoffs?

*Don't* just dump them on us.

Because in this particular case the results are not obvious, at all:

> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS  Prev    Current  %Change
> 4     206350  200892   -2.64502
> 1     319963  325766   1.81365
> 
> 
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS  Prev    Current  %Change
> 4     186539  190261   1.99529
> 1     220344  195305   -11.3636
> 
> 
> on 4 Socket/4 Node Power7
> JVMS  Prev    Current  %Change
> 8     56836   57651.1  1.43413
> 1     112970  111351   -1.43312

Why is this better? The largest drop is 11% which seems significant.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time
  2018-08-03  6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
@ 2018-09-10  8:42   ` Ingo Molnar
  0 siblings, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2018-09-10  8:42 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> Task migration under numa balancing can happen in parallel. More than
> one task might choose to migrate to the same cpu at the same time. This
> can result in
> - During task swap, choosing a task that was not part of the evaluation.
> - During task swap, task which just got moved into its preferred node,
>   moving to a completely different node.
> - During task swap, task failing to move to the preferred node, will have
>   to wait an extra interval for the next migrate opportunity.
> - During task movement, multiple task movements can cause load imbalance.

Please capitalize both 'CPU' and 'NUMA' in changelogs and code comments.

> This problem is more likely if there are more cores per node or more
> nodes in the system.
> 
> Use a per run-queue variable to check if numa-balance is active on the
> run-queue.
> 
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS  Prev    Current  %Change
> 4     199709  206350   3.32534
> 1     330830  319963   -3.28477
> 
> 
> on 2 Socket/4 Node Power8 (PowerNV)
> JVMS  Prev     Current  %Change
> 8     89011.9  89627.8  0.69193
> 1     218946   211338   -3.47483
> 
> 
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS  Prev    Current  %Change
> 4     180473  186539   3.36117
> 1     212805  220344   3.54268
> 
> 
> on 4 Socket/4 Node Power7
> JVMS  Prev     Current  %Change
> 8     56941.8  56836    -0.185804
> 1     111686   112970   1.14965
> 
> 
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count  Min      Max      Avg      Variance  %Change
> 5      12029.8  12124.6  12060.9  34.0076
> 5      13136.1  13170.2  13150.2  14.7482   9.03166
> 
> 
> on 2 Socket/4 Node Power8 (PowerNV)
> count  Min      Max      Avg      Variance  %Change
> 5      4968.51  5006.62  4981.31  13.4151
> 5      4319.79  4998.19  4836.53  261.109   -2.90646
> 
> 
> on 2 Socket/2 Node Power9 (PowerNV)
> count  Min      Max      Avg      Variance  %Change
> 5      9342.92  9381.44  9363.92  12.8587
> 5      9325.56  9402.7   9362.49  25.9638   -0.0152714
> 
> 
> on 4 Socket/4 Node Power7
> count  Min      Max      Avg      Variance  %Change
> 5      143.4    188.892  170.225  16.9929
> 5      132.581  191.072  170.554  21.6444   0.193274

I have applied this patch, but the zero comments benchmark dump is annoying, as the numbers do 
not show unconditional advantages - there's some increases in performance and some regressions. 

In particular this:

> dbench / transactions / higher numbers are better
> on 2 Socket/4 Node Power8 (PowerNV)
> count  Min      Max      Avg      Variance  %Change
> 5      4968.51  5006.62  4981.31  13.4151
> 5      4319.79  4998.19  4836.53  261.109   -2.90646

is concerning: not only did we lose some performance, variance went up by a *lot*. Is this just 
a measurement fluke? We cannot know and you didn't comment.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement
  2018-08-03  6:13 ` [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement Srikar Dronamraju
@ 2018-09-10  8:46   ` Ingo Molnar
  2018-09-12 15:17     ` Srikar Dronamraju
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2018-09-10  8:46 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> If numa improvement from the task migration is going to be very
> minimal, then avoid task migration.
> 
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS  Prev    Current  %Change
> 4     200892  210118   4.59252
> 1     325766  313171   -3.86627
> 
> 
> on 2 Socket/4 Node Power8 (PowerNV)
> JVMS  Prev     Current  %Change
> 8     89011.9  91027.5  2.26442
> 1     211338   216460   2.42361
> 
> 
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS  Prev    Current  %Change
> 4     190261  191918   0.870909
> 1     195305  207043   6.01009
> 
> 
> on 4 Socket/4 Node Power7
> JVMS  Prev     Current  %Change
> 8     57651.1  58462.1  1.40674
> 1     111351   108334   -2.70945
> 
> 
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count  Min      Max      Avg      Variance  %Change
> 5      12254.7  12331.9  12297.8  28.1846
> 5      11851.8  11937.3  11890.9  33.5169   -3.30872
> 
> 
> on 2 Socket/4 Node Power8 (PowerNV)
> count  Min      Max      Avg      Variance  %Change
> 5      4997.83  5030.14  5015.54  12.947
> 5      4791     5016.08  4962.55  85.9625   -1.05652
> 
> 
> on 2 Socket/2 Node Power9 (PowerNV)
> count  Min      Max      Avg      Variance  %Change
> 5      9331.84  9375.11  9352.04  16.0703
> 5      9353.43  9380.49  9369.6   9.04361   0.187767
> 
> 
> on 4 Socket/4 Node Power7
> count  Min      Max      Avg      Variance  %Change
> 5      147.55   181.605  168.963  11.3513
> 5      149.518  215.412  179.083  21.5903   5.98948
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v1->v2:
>  - Handle trivial changes due to variable name change. (Rik Van Riel)
>  - Drop changes where subsequent better cpu find was rejected for
>    small numa improvement (Rik Van Riel).
> 
>  kernel/sched/fair.c | 23 ++++++++++++++++++-----
>  1 file changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5cf921a..a717870 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1568,6 +1568,13 @@ static bool load_too_imbalanced(long src_load, long dst_load,
>  }
>  
>  /*
> + * Maximum numa importance can be 1998 (2*999);
> + * SMALLIMP @ 30 would be close to 1998/64.
> + * Used to deter task migration.
> + */
> +#define SMALLIMP	30
> +
> +/*
>   * This checks if the overall compute and NUMA accesses of the system would
>   * be improved if the source tasks was migrated to the target dst_cpu taking
>   * into account that it might be best if task running on the dst_cpu should
> @@ -1600,7 +1607,7 @@ static void task_numa_compare(struct task_numa_env *env,
>  		goto unlock;
>  
>  	if (!cur) {
> -		if (maymove || imp > env->best_imp)
> +		if (maymove && moveimp >= env->best_imp)
>  			goto assign;
>  		else
>  			goto unlock;
> @@ -1643,16 +1650,22 @@ static void task_numa_compare(struct task_numa_env *env,
>  			       task_weight(cur, env->dst_nid, dist);
>  	}
>  
> -	if (imp <= env->best_imp)
> -		goto unlock;
> -
>  	if (maymove && moveimp > imp && moveimp > env->best_imp) {
> -		imp = moveimp - 1;
> +		imp = moveimp;
>  		cur = NULL;
>  		goto assign;
>  	}
>  
>  	/*
> +	 * If the numa importance is less than SMALLIMP,
> +	 * task migration might only result in ping pong
> +	 * of tasks and also hurt performance due to cache
> +	 * misses.
> +	 */
> +	if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
> +		goto unlock;
> +
> +	/*
>  	 * In the overloaded case, try and keep the load balanced.
>  	 */
>  	load = task_h_load(env->p) - task_h_load(cur);

So what is this 'NUMA importance'? Seems just like a random parameter which generally isn't a 
good idea.

Also, same review feedback as I gave for the previous patches.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes
  2018-08-03  6:14 ` [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
@ 2018-09-10  8:48   ` Ingo Molnar
  2018-09-12 15:19     ` Srikar Dronamraju
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2018-09-10  8:48 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> Currently task scan rate is reset when numa balancer migrates the task
> to a different node. If numa balancer initiates a swap, reset is only
> applicable to the task that initiates the swap. Similarly no scan rate
> reset is done if the task is migrated across nodes by traditional load
> balancer.
> 
> Instead move the scan reset to the migrate_task_rq. This ensures the
> task moved out of its preferred node, either gets back to its preferred
> node quickly or finds a new preferred node. Doing so, would be fair to
> all tasks migrating across nodes.
> 
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS  Prev    Current  %Change
> 4     210118  208862   -0.597759
> 1     313171  307007   -1.96825
> 
> 
> on 2 Socket/4 Node Power8 (PowerNV)
> JVMS  Prev     Current  %Change
> 8     91027.5  89911.4  -1.22611
> 1     216460   216176   -0.131202
> 
> 
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS  Prev    Current  %Change
> 4     191918  196078   2.16759
> 1     207043  214664   3.68088
> 
> 
> on 4 Socket/4 Node Power7
> JVMS  Prev     Current  %Change
> 8     58462.1  60719.2  3.86079
> 1     108334   112615   3.95167
> 
> 
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count  Min      Max      Avg      Variance  %Change
> 5      11851.8  11937.3  11890.9  33.5169
> 5      12511.7  12559.4  12539.5  15.5883   5.45459
> 
> 
> on 2 Socket/4 Node Power8 (PowerNV)
> count  Min      Max      Avg      Variance  %Change
> 5      4791     5016.08  4962.55  85.9625
> 5      4709.28  4979.28  4919.32  105.126   -0.871125
> 
> 
> on 2 Socket/2 Node Power9 (PowerNV)
> count  Min      Max      Avg     Variance  %Change
> 5      9353.43  9380.49  9369.6  9.04361
> 5      9388.38  9406.29  9395.1  5.98959   0.272157
> 
> 
> on 4 Socket/4 Node Power7
> count  Min      Max      Avg      Variance  %Change
> 5      149.518  215.412  179.083  21.5903
> 5      157.71   184.929  174.754  10.7275   -2.41731
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a5936ed..4ea0eff 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1837,12 +1837,6 @@ static int task_numa_migrate(struct task_struct *p)
>  	if (env.best_cpu == -1)
>  		return -EAGAIN;
>  
> -	/*
> -	 * Reset the scan period if the task is being rescheduled on an
> -	 * alternative node to recheck if the tasks is now properly placed.
> -	 */
> -	p->numa_scan_period = task_scan_start(p);
> -
>  	best_rq = cpu_rq(env.best_cpu);
>  	if (env.best_task == NULL) {
>  		ret = migrate_task_to(p, env.best_cpu);
> @@ -6361,6 +6355,19 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
>  
>  	/* We have migrated, no longer consider this task hot */
>  	p->se.exec_start = 0;
> +
> +#ifdef CONFIG_NUMA_BALANCING
> +	if (!p->mm || (p->flags & PF_EXITING))
> +		return;
> +
> +	if (p->numa_faults) {
> +		int src_nid = cpu_to_node(task_cpu(p));
> +		int dst_nid = cpu_to_node(new_cpu);
> +
> +		if (src_nid != dst_nid)
> +			p->numa_scan_period = task_scan_start(p);
> +	}
> +#endif

Please don't add #ifdeffery inside functions, especially not if they do weird flow control like 
a 'return' from the middle of a block.

A properly named inline helper would work I suppose.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement
  2018-09-10  8:46   ` Ingo Molnar
@ 2018-09-12 15:17     ` Srikar Dronamraju
  0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-09-12 15:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

> >  
> >  /*
> > + * Maximum numa importance can be 1998 (2*999);
> > + * SMALLIMP @ 30 would be close to 1998/64.
> > + * Used to deter task migration.
> > + */
> > +#define SMALLIMP	30
> > +
> > +/*
> >  
> >  	/*
> > +	 * If the numa importance is less than SMALLIMP,
> > +	 * task migration might only result in ping pong
> > +	 * of tasks and also hurt performance due to cache
> > +	 * misses.
> > +	 */
> > +	if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
> > +		goto unlock;
> > +
> > +	/*
> >  	 * In the overloaded case, try and keep the load balanced.
> >  	 */
> >  	load = task_h_load(env->p) - task_h_load(cur);
> 
> So what is this 'NUMA importance'? Seems just like a random parameter which generally isn't a 
> good idea.
> 

I refer the weight that is used to compare the suitability of the task to a
node as NUMA Importance. It varies between -999 to 1000. This is not
something that was introduced by this patch, but was introduced as part of
Numa balancing couple of years ago. group_imp, task_imp, best_imp all refer
to the NUMA importance. May be I am using a wrong term here. May be imp
stands for something other than importance.

In this patch, we are trying to limit task migration for small NUMA
importance. i.e if the NUMA importance for moving/swapping tasks is only 10,
then should we drop all the cache affinity for NUMA affinity? May be we need
to wait for the trend to stabilize.

I have chosen 30 as the weight below which we refuse to consider NUMA
importance. Its based on maximum NUMA importance / 64.
Please do suggest if you have a better method to limit task migrations for
small NUMA gain.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes
  2018-09-10  8:48   ` Ingo Molnar
@ 2018-09-12 15:19     ` Srikar Dronamraju
  0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-09-12 15:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

> > +#ifdef CONFIG_NUMA_BALANCING
> > +	if (!p->mm || (p->flags & PF_EXITING))
> > +		return;
> > +
> > +	if (p->numa_faults) {
> > +		int src_nid = cpu_to_node(task_cpu(p));
> > +		int dst_nid = cpu_to_node(new_cpu);
> > +
> > +		if (src_nid != dst_nid)
> > +			p->numa_scan_period = task_scan_start(p);
> > +	}
> > +#endif
> 
> Please don't add #ifdeffery inside functions, especially not if they do weird flow control like 
> a 'return' from the middle of a block.
> 
> A properly named inline helper would work I suppose.
> 

Okay, will take care.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-09-12 15:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-03  6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
2018-08-03  6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
2018-09-10  8:42   ` Ingo Molnar
2018-08-03  6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
2018-09-06 11:48   ` Peter Zijlstra
2018-09-10  8:39   ` Ingo Molnar
2018-08-03  6:13 ` [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement Srikar Dronamraju
2018-09-10  8:46   ` Ingo Molnar
2018-09-12 15:17     ` Srikar Dronamraju
2018-08-03  6:13 ` [PATCH 4/6] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
2018-08-03  6:14 ` [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
2018-09-10  8:48   ` Ingo Molnar
2018-09-12 15:19     ` Srikar Dronamraju
2018-08-03  6:14 ` [PATCH 6/6] sched/numa: Limit the conditions where scan period is reset Srikar Dronamraju
2018-08-21 12:01 ` [PATCH 0/6] numa-balancing patches Srikar Dronamraju
2018-09-06 12:17   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).