linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH V1 0/1] sched/numa: Enhance vma scanning
@ 2023-01-16  1:35 Raghavendra K T
  2023-01-16  1:35 ` [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic Raghavendra K T
  2023-01-16  2:25 ` [RFC PATCH V1 0/1] sched/numa: Enhance vma scanning Raghavendra K T
  0 siblings, 2 replies; 16+ messages in thread
From: Raghavendra K T @ 2023-01-16  1:35 UTC (permalink / raw)
  To: =,
	linux-mm, --cc=Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Andrew Morton, Matthew Wilcox, Vlastimil Babka, Liam R . Howlett,
	Peter Xu, David Hildenbrand, xu xin, Yu Zhao, Colin Cross,
	Arnd Bergmann, Hugh Dickins, Bharata B Rao, Disha Talreja
  Cc: Raghavendra K T

 The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel.

Existing mechanism of scan period involves, scan period derived from
per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA 
fault stats at per-process level to capture aplication behaviour better.

During that course of discussion, Mel proposed several ideas to enhance
current numa balancing. One of the suggestion was below

Track what threads access a VMA. The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what
threads access a VMA. Skip VMAs that did not trap a fault. This would
be approximate because of PID collisions but would reduce scanning of 
areas the thread is not interested in. The above suggestion intends not
to penalize threads that has no interest in the vma, thus reduce scanning
overhead.

Approach in patchset:

1) Tracks atmost 4 threads which recently accessed vma, scan only if that
thread accessed the vma. (note: used only unsigned int. Experiments showed
tracking 8 unique PIDs had more overhead)

2) First 2 times unconditionaly allow threads to scan vmas to preserve
original intention of scanning.

3) If there are more than 4 threads (i.e. more than PIDs we could remember)
by default allow scanning because we might have potentially missed recording
whether current thread had any interest in the vma.
(less accurate and debatable heuristics)

With this patchset we see reduction in considerable amount of scanning
overhead (AutoNuma cost) with some enchmark improving the performance,
and others with almost no regression.

Things to ponder over (and Future TODO):
==========================================
- Do we have to consider clearing PID if it is not accessed "recently"
- Current scan period is not changed in the patchset, so we do see frequent
 tries to scan. Relaxing scan period dynamically could improve results
further.

Results Summary:
================
The result is obtained by running mmtests with below configs
config-workload-kernbench-max
config-io-dbench4-async
config-numa-autonumabench
config-hpc-nas-mpi-full

There is a significant reduction in AutoNuma cost.

SUT:
2 socket AMD Milan System
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2

256GB memory per socket amounting to 512GB in total
NPS1 NUMA configuration where each socket is a NUMA node

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 0 size: 257538 MB
node 0 free: 255739 MB
node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 1 size: 255978 MB
node 1 free: 249680 MB
node distances:
node   0   1
  0:  10  32
  1:  32  10

Detailed Result:
(Note:
1. All rows with 0.00 and 100% from both the side removed
2. Some of the duplicate run/path info is trimmed.
3. SIS results omitted.

kernbench
=============
                          6_1_base_config       6_1_final_config
                   workload-kernbench-max workload-kernbench-max
Min       user-256    22120.65 (   0.00%)    22477.15 (  -1.61%)
Min       syst-256     9975.63 (   0.00%)     8880.00 (  10.98%)
Min       elsp-256      161.24 (   0.00%)      157.45 (   2.35%)
Amean     user-256    22179.56 (   0.00%)    22558.57 *  -1.71%*
Amean     syst-256    10034.72 (   0.00%)     8913.83 *  11.17%*
Amean     elsp-256      161.52 (   0.00%)      157.69 *   2.37%*
Stddev    user-256      101.82 (   0.00%)       82.51 (  18.97%)
Stddev    syst-256       87.56 (   0.00%)       54.31 (  37.97%)
Stddev    elsp-256        0.35 (   0.00%)        0.27 (  24.12%)
CoeffVar  user-256        0.46 (   0.00%)        0.37 (  20.33%)
CoeffVar  syst-256        0.87 (   0.00%)        0.61 (  30.17%)
CoeffVar  elsp-256        0.22 (   0.00%)        0.17 (  22.28%)
Max       user-256    22297.13 (   0.00%)    22642.13 (  -1.55%)
Max       syst-256    10135.31 (   0.00%)     8976.48 (  11.43%)
Max       elsp-256      161.92 (   0.00%)      157.98 (   2.43%)
BAmean-50 user-256    22120.65 (   0.00%)    22477.15 (  -1.61%)
BAmean-50 syst-256     9975.63 (   0.00%)     8880.00 (  10.98%)
BAmean-50 elsp-256      161.24 (   0.00%)      157.45 (   2.35%)
BAmean-95 user-256    22120.77 (   0.00%)    22516.79 (  -1.79%)
BAmean-95 syst-256     9984.42 (   0.00%)     8882.51 (  11.04%)
BAmean-95 elsp-256      161.32 (   0.00%)      157.54 (   2.34%)
BAmean-99 user-256    22120.77 (   0.00%)    22516.79 (  -1.79%)
BAmean-99 syst-256     9984.42 (   0.00%)     8882.51 (  11.04%)
BAmean-99 elsp-256      161.32 (   0.00%)      157.54 (   2.34%)

                6_1_base_config6_1_final_config
                workload-kernbench-maxworkload-kernbench-max
Duration User       66548.20    67685.44
Duration System     30118.43    26756.19
Duration Elapsed      506.13      495.18

                                6_1_base_config6_1_final_config
                                workload-kernbench-maxworkload-kernbench-max
Ops Minor Faults                  1883340576.00  1883195171.00
Ops Major Faults                          88.00          42.00
Ops Normal allocs                 1743174614.00  1742916379.00
Ops Sector Reads                      173600.00       11928.00
Ops Sector Writes                   21099684.00    21480472.00
Ops Page migrate success              111429.00       73693.00
Ops Compaction cost                      115.66          76.49
Ops NUMA alloc hit                1738966281.00  1738692743.00
Ops NUMA alloc local              1738966104.00  1738692456.00
Ops NUMA base-page range updates      401910.00      322972.00
Ops NUMA PTE updates                  401910.00      322972.00
Ops NUMA hint faults                  112231.00       76987.00
Ops NUMA hint local faults %             802.00        3294.00
Ops NUMA hint local percent                0.71           4.28
Ops NUMA pages migrated               111429.00       73693.00
Ops AutoNUMA cost                        566.09         388.60

dbench
=========
Runtime options
Run: dbench -D [...]/testdisk/data/6_1_base_config-io-dbench4-async --warmup 0 -t 180 --loadfile [...]/sources/dbench-781852c2b38a-installed/share//client-tiny.txt --show-execute-time 1 2 4 ... 256
dbench4 Loadfile Execution Time

dbench4 Latency
                             6_1_base_config       6_1_final_config
                            io-dbench4-async       io-dbench4-async
Min       latency-1          0.37 (   0.00%)        0.35 (   4.09%)
Min       latency-2          0.38 (   0.00%)        0.40 (  -6.60%)
Min       latency-4          0.51 (   0.00%)        0.49 (   2.97%)
Min       latency-8          0.69 (   0.00%)        0.62 (   9.17%)
Min       latency-16         1.11 (   0.00%)        0.99 (  10.57%)
Min       latency-32         1.96 (   0.00%)        1.98 (  -0.66%)
Min       latency-64         5.73 (   0.00%)       32.03 (-458.86%)
Min       latency-128       17.60 (   0.00%)       17.79 (  -1.09%)
Min       latency-256       24.71 (   0.00%)       24.06 (   2.66%)
Amean     latency-1          2.18 (   0.00%)        0.46 *  78.72%*
Amean     latency-2          4.52 (   0.00%)        0.54 *  88.10%*
Amean     latency-4          9.99 (   0.00%)        0.97 *  90.34%*
Amean     latency-8         13.62 (   0.00%)        1.02 *  92.48%*
Amean     latency-16        14.11 (   0.00%)        4.07 *  71.16%*
Amean     latency-32        21.18 (   0.00%)       45.50 *-114.84%*
Amean     latency-64        61.19 (   0.00%)       58.78 *   3.95%*
Amean     latency-128       56.48 (   0.00%)       54.86 *   2.86%*
Amean     latency-256       81.08 (   0.00%)       80.03 *   1.30%*
Stddev    latency-1          6.85 (   0.00%)        0.17 (  97.56%)
Stddev    latency-2         10.51 (   0.00%)        0.19 (  98.19%)
Stddev    latency-4         14.51 (   0.00%)        1.61 (  88.91%)
Stddev    latency-8         15.71 (   0.00%)        2.04 (  87.00%)
Stddev    latency-16        16.36 (   0.00%)       10.60 (  35.21%)
Stddev    latency-32        19.95 (   0.00%)       25.89 ( -29.78%)
Stddev    latency-64        18.64 (   0.00%)       16.95 (   9.07%)
Stddev    latency-128       17.61 (   0.00%)       19.64 ( -11.52%)
Stddev    latency-256       32.23 (   0.00%)       32.94 (  -2.19%)
CoeffVar  latency-1        313.68 (   0.00%)       35.95 (  88.54%)
CoeffVar  latency-2        232.70 (   0.00%)       35.35 (  84.81%)
CoeffVar  latency-4        145.21 (   0.00%)      166.58 ( -14.72%)
CoeffVar  latency-8        115.34 (   0.00%)      199.35 ( -72.85%)
CoeffVar  latency-16       115.92 (   0.00%)      260.45 (-124.68%)
CoeffVar  latency-32        94.21 (   0.00%)       56.90 (  39.59%)
CoeffVar  latency-64        30.47 (   0.00%)       28.84 (   5.33%)
CoeffVar  latency-128       31.19 (   0.00%)       35.80 ( -14.80%)
CoeffVar  latency-256       39.75 (   0.00%)       41.15 (  -3.53%)
Max       latency-1         34.43 (   0.00%)        2.43 (  92.95%)
Max       latency-2         34.71 (   0.00%)        2.94 (  91.53%)
Max       latency-4         35.80 (   0.00%)       13.44 (  62.47%)
Max       latency-8         36.29 (   0.00%)       23.73 (  34.61%)
Max       latency-16        64.87 (   0.00%)       76.11 ( -17.32%)
Max       latency-32       133.05 (   0.00%)      123.88 (   6.89%)
Max       latency-64       150.48 (   0.00%)      198.68 ( -32.02%)
Max       latency-128      101.89 (   0.00%)      140.66 ( -38.05%)
Max       latency-256      326.53 (   0.00%)      357.54 (  -9.50%)
BAmean-50 latency-1          0.42 (   0.00%)        0.41 (   3.49%)
BAmean-50 latency-2          0.44 (   0.00%)        0.48 (  -9.16%)
BAmean-50 latency-4          0.60 (   0.00%)        0.55 (   8.02%)
BAmean-50 latency-8          0.80 (   0.00%)        0.75 (   6.60%)
BAmean-50 latency-16         1.57 (   0.00%)        1.08 (  31.38%)
BAmean-50 latency-32         3.83 (   0.00%)       25.03 (-553.42%)
BAmean-50 latency-64        47.76 (   0.00%)       47.49 (   0.56%)
BAmean-50 latency-128       42.90 (   0.00%)       40.81 (   4.89%)
BAmean-50 latency-256       62.90 (   0.00%)       64.40 (  -2.39%)
BAmean-95 latency-1          0.67 (   0.00%)        0.44 (  33.70%)
BAmean-95 latency-2          2.93 (   0.00%)        0.52 (  82.40%)
BAmean-95 latency-4          8.67 (   0.00%)        0.65 (  92.46%)
BAmean-95 latency-8         12.45 (   0.00%)        0.79 (  93.62%)
BAmean-95 latency-16        12.42 (   0.00%)        1.93 (  84.44%)
BAmean-95 latency-32        18.63 (   0.00%)       43.09 (-131.27%)
BAmean-95 latency-64        59.09 (   0.00%)       56.46 (   4.44%)
BAmean-95 latency-128       54.56 (   0.00%)       52.33 (   4.08%)
BAmean-95 latency-256       76.21 (   0.00%)       74.98 (   1.61%)
BAmean-99 latency-1          1.82 (   0.00%)        0.45 (  75.36%)
BAmean-99 latency-2          4.18 (   0.00%)        0.52 (  87.49%)
BAmean-99 latency-4          9.70 (   0.00%)        0.84 (  91.36%)
BAmean-99 latency-8         13.37 (   0.00%)        0.81 (  93.92%)
BAmean-99 latency-16        13.54 (   0.00%)        3.36 (  75.21%)
BAmean-99 latency-32        20.20 (   0.00%)       44.75 (-121.58%)
BAmean-99 latency-64        60.48 (   0.00%)       57.76 (   4.49%)
BAmean-99 latency-128       55.97 (   0.00%)       53.94 (   3.63%)
BAmean-99 latency-256       78.60 (   0.00%)       77.39 (   1.54%)

dbench4 Throughput (misleading but traditional)
                     6_1_base_config       6_1_final_config
                    io-dbench4-async       io-dbench4-async
Min       1        813.26 (   0.00%)      808.06 (  -0.64%)
Min       2       1348.40 (   0.00%)     1373.40 (   1.85%)
Min       4       2250.91 (   0.00%)     2349.87 (   4.40%)
Min       8       3419.43 (   0.00%)     3467.44 (   1.40%)
Min       16      5191.34 (   0.00%)     5808.41 (  11.89%)
Min       32      7798.01 (   0.00%)     7461.17 (  -4.32%)
Min       64      6056.72 (   0.00%)     6566.36 (   8.41%)
Min       128     6103.45 (   0.00%)     6455.82 (   5.77%)
Min       256     7409.77 (   0.00%)     7317.59 (  -1.24%)
Hmean     1        820.42 (   0.00%)      829.23 *   1.07%*
Hmean     2       1433.43 (   0.00%)     1429.61 *  -0.27%*
Hmean     4       2281.21 (   0.00%)     2364.06 *   3.63%*
Hmean     8       3493.67 (   0.00%)     3512.71 *   0.54%*
Hmean     16      5354.84 (   0.00%)     6069.95 *  13.35%*
Hmean     32      8060.95 (   0.00%)     8147.90 *   1.08%*
Hmean     64      6579.22 (   0.00%)     7151.69 *   8.70%*
Hmean     128     7340.59 (   0.00%)     7531.74 *   2.60%*
Hmean     256     7685.46 (   0.00%)     7758.75 *   0.95%*
Stddev    1          2.62 (   0.00%)        5.61 (-114.51%)
Stddev    2         28.63 (   0.00%)       12.60 (  55.98%)
Stddev    4          7.58 (   0.00%)       12.24 ( -61.45%)
Stddev    8         36.99 (   0.00%)       76.05 (-105.62%)
Stddev    16       100.33 (   0.00%)      101.59 (  -1.26%)
Stddev    32       216.04 (   0.00%)      599.98 (-177.71%)
Stddev    64      1101.72 (   0.00%)      198.64 (  81.97%)
Stddev    128      233.31 (   0.00%)      191.10 (  18.09%)
Stddev    256      474.44 (   0.00%)      220.54 (  53.52%)
CoeffVar  1          0.32 (   0.00%)        0.68 (-112.23%)
CoeffVar  2          2.00 (   0.00%)        0.88 (  55.84%)
CoeffVar  4          0.33 (   0.00%)        0.52 ( -55.79%)
CoeffVar  8          1.06 (   0.00%)        2.16 (-104.44%)
CoeffVar  16         1.87 (   0.00%)        1.67 (  10.67%)
CoeffVar  32         2.68 (   0.00%)        7.33 (-173.51%)
CoeffVar  64        16.42 (   0.00%)        2.78 (  83.10%)
CoeffVar  128        3.17 (   0.00%)        2.54 (  20.13%)
CoeffVar  256        6.15 (   0.00%)        2.84 (  53.84%)
Max       1        824.86 (   0.00%)      835.60 (   1.30%)
Max       2       1478.26 (   0.00%)     1445.52 (  -2.21%)
Max       4       2300.81 (   0.00%)     2437.53 (   5.94%)
Max       8       3536.60 (   0.00%)     3924.47 (  10.97%)
Max       16      5726.97 (   0.00%)     6188.43 (   8.06%)
Max       32      8589.30 (   0.00%)     9179.96 (   6.88%)
Max       64     10975.99 (   0.00%)     7353.65 ( -33.00%)
Max       128     7585.49 (   0.00%)     7818.70 (   3.07%)
Max       256     9583.72 (   0.00%)     8450.29 ( -11.83%)
BHmean-50 1        822.25 (   0.00%)      834.01 (   1.43%)
BHmean-50 2       1457.81 (   0.00%)     1435.25 (  -1.55%)
BHmean-50 4       2287.80 (   0.00%)     2369.75 (   3.58%)
BHmean-50 8       3524.24 (   0.00%)     3550.48 (   0.74%)
BHmean-50 16      5426.84 (   0.00%)     6154.63 (  13.41%)
BHmean-50 32      8254.63 (   0.00%)     8705.67 (   5.46%)
BHmean-50 64      7058.18 (   0.00%)     7311.16 (   3.58%)
BHmean-50 128     7457.52 (   0.00%)     7605.17 (   1.98%)
BHmean-50 256     7914.05 (   0.00%)     7874.19 (  -0.50%)
BHmean-95 1        820.74 (   0.00%)      830.01 (   1.13%)
BHmean-95 2       1437.22 (   0.00%)     1432.28 (  -0.34%)
BHmean-95 4       2282.12 (   0.00%)     2364.73 (   3.62%)
BHmean-95 8       3497.46 (   0.00%)     3515.13 (   0.51%)
BHmean-95 16      5360.61 (   0.00%)     6080.68 (  13.43%)
BHmean-95 32      8074.70 (   0.00%)     8187.13 (   1.39%)
BHmean-95 64      6608.51 (   0.00%)     7177.60 (   8.61%)
BHmean-95 128     7394.83 (   0.00%)     7576.42 (   2.46%)
BHmean-95 256     7700.43 (   0.00%)     7768.59 (   0.89%)
BHmean-99 1        820.50 (   0.00%)      829.47 (   1.09%)
BHmean-99 2       1434.42 (   0.00%)     1430.26 (  -0.29%)
BHmean-99 4       2281.50 (   0.00%)     2364.22 (   3.63%)
BHmean-99 8       3494.52 (   0.00%)     3513.23 (   0.54%)
BHmean-99 16      5356.37 (   0.00%)     6072.71 (  13.37%)
BHmean-99 32      8064.00 (   0.00%)     8156.38 (   1.15%)
BHmean-99 64      6585.62 (   0.00%)     7158.54 (   8.70%)
BHmean-99 128     7356.65 (   0.00%)     7545.32 (   2.56%)
BHmean-99 256     7688.68 (   0.00%)     7762.38 (   0.96%)

dbench4 Per-VFS Operation latency Latency

                6_1_base_config6_1_final_config
                io-dbench4-asyncio-dbench4-async
Duration User        1979.58     2097.00
Duration System     19522.29    19246.17
Duration Elapsed     1643.53     1642.83

                                6_1_base_config6_1_final_config
                                io-dbench4-asyncio-dbench4-async
Ops Minor Faults                      410528.00      362832.00
Ops Normal allocs                  418149664.00   440830731.00
Ops Sector Reads                        1132.00          28.00
Ops Sector Writes                  811726764.00   868601700.00
Ops Page migrate success               41282.00       16565.00
Ops Compaction cost                       42.85          17.19
Ops NUMA alloc hit                 407380422.00   428784879.00
Ops NUMA alloc local               407380245.00   428784630.00
Ops NUMA base-page range updates       83253.00       43358.00
Ops NUMA PTE updates                   83253.00       43358.00
Ops NUMA hint faults                   81229.00       30774.00
Ops NUMA hint local faults %           24325.00       13842.00
Ops NUMA hint local percent               29.95          44.98
Ops NUMA pages migrated                41282.00       16565.00
Ops AutoNUMA cost                        407.51         154.49

autonumabench
                                         6_1_base_config       6_1_final_config
                                      numa-autonumabench     numa-autonumabench
Min       syst-NUMA01                  126.65 (   0.00%)       15.45 (  87.80%)
Min       syst-NUMA01_THREADLOCAL        0.19 (   0.00%)        0.14 (  26.32%)
Min       syst-NUMA02                    0.71 (   0.00%)        0.73 (  -2.82%)
Min       syst-NUMA02_SMT                0.57 (   0.00%)        0.59 (  -3.51%)
Min       elsp-NUMA01                  241.41 (   0.00%)      286.88 ( -18.84%)
Min       elsp-NUMA01_THREADLOCAL        1.05 (   0.00%)        1.03 (   1.90%)
Min       elsp-NUMA02                    2.94 (   0.00%)        3.03 (  -3.06%)
Min       elsp-NUMA02_SMT                3.31 (   0.00%)        3.11 (   6.04%)
Amean     syst-NUMA01                  199.02 (   0.00%)       17.55 *  91.18%*
Amean     syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.19 *   8.16%*
Amean     syst-NUMA02                    0.85 (   0.00%)        0.79 *   6.59%*
Amean     syst-NUMA02_SMT                0.63 (   0.00%)        0.63 *   0.00%*
Amean     elsp-NUMA01                  257.78 (   0.00%)      309.43 * -20.04%*
Amean     elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.04 *   2.42%*
Amean     elsp-NUMA02                    3.24 (   0.00%)        3.12 *   3.62%*
Amean     elsp-NUMA02_SMT                3.58 (   0.00%)        3.44 *   4.15%*
Stddev    syst-NUMA01                   41.05 (   0.00%)        1.68 (  95.92%)
Stddev    syst-NUMA01_THREADLOCAL        0.01 (   0.00%)        0.03 (-148.57%)
Stddev    syst-NUMA02                    0.11 (   0.00%)        0.06 (  44.44%)
Stddev    syst-NUMA02_SMT                0.03 (   0.00%)        0.03 (  -8.16%)
Stddev    elsp-NUMA01                   12.69 (   0.00%)       15.71 ( -23.79%)
Stddev    elsp-NUMA01_THREADLOCAL        0.01 (   0.00%)        0.01 ( -14.02%)
Stddev    elsp-NUMA02                    0.23 (   0.00%)        0.05 (  76.35%)
Stddev    elsp-NUMA02_SMT                0.29 (   0.00%)        0.26 (   9.56%)
CoeffVar  syst-NUMA01                   20.63 (   0.00%)        9.56 (  53.68%)
CoeffVar  syst-NUMA01_THREADLOCAL        5.50 (   0.00%)       14.88 (-170.66%)
CoeffVar  syst-NUMA02                   12.77 (   0.00%)        7.59 (  40.52%)
CoeffVar  syst-NUMA02_SMT                4.98 (   0.00%)        5.39 (  -8.16%)
CoeffVar  elsp-NUMA01                    4.92 (   0.00%)        5.08 (  -3.13%)
CoeffVar  elsp-NUMA01_THREADLOCAL        0.65 (   0.00%)        0.76 ( -16.85%)
CoeffVar  elsp-NUMA02                    7.16 (   0.00%)        1.76 (  75.47%)
CoeffVar  elsp-NUMA02_SMT                7.98 (   0.00%)        7.53 (   5.65%)
Max       syst-NUMA01                  264.54 (   0.00%)       19.81 (  92.51%)
Max       syst-NUMA01_THREADLOCAL        0.22 (   0.00%)        0.23 (  -4.55%)
Max       syst-NUMA02                    0.99 (   0.00%)        0.89 (  10.10%)
Max       syst-NUMA02_SMT                0.67 (   0.00%)        0.69 (  -2.99%)
Max       elsp-NUMA01                  273.51 (   0.00%)      325.28 ( -18.93%)
Max       elsp-NUMA01_THREADLOCAL        1.07 (   0.00%)        1.05 (   1.87%)
Max       elsp-NUMA02                    3.63 (   0.00%)        3.20 (  11.85%)
Max       elsp-NUMA02_SMT                4.12 (   0.00%)        3.73 (   9.47%)
BAmean-50 syst-NUMA01                  167.44 (   0.00%)       15.93 (  90.49%)
BAmean-50 syst-NUMA01_THREADLOCAL        0.20 (   0.00%)        0.17 (  15.00%)
BAmean-50 syst-NUMA02                    0.75 (   0.00%)        0.74 (   0.89%)
BAmean-50 syst-NUMA02_SMT                0.60 (   0.00%)        0.60 (   0.55%)
BAmean-50 elsp-NUMA01                  245.37 (   0.00%)      293.30 ( -19.53%)
BAmean-50 elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.03 (   2.52%)
BAmean-50 elsp-NUMA02                    3.04 (   0.00%)        3.07 (  -0.88%)
BAmean-50 elsp-NUMA02_SMT                3.36 (   0.00%)        3.18 (   5.36%)
BAmean-95 syst-NUMA01                  188.10 (   0.00%)       17.17 (  90.87%)
BAmean-95 syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.19 (  10.40%)
BAmean-95 syst-NUMA02                    0.82 (   0.00%)        0.77 (   5.88%)
BAmean-95 syst-NUMA02_SMT                0.62 (   0.00%)        0.62 (   0.54%)
BAmean-95 elsp-NUMA01                  255.16 (   0.00%)      306.79 ( -20.24%)
BAmean-95 elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.03 (   2.52%)
BAmean-95 elsp-NUMA02                    3.17 (   0.00%)        3.11 (   2.05%)
BAmean-95 elsp-NUMA02_SMT                3.49 (   0.00%)        3.39 (   3.10%)
BAmean-99 syst-NUMA01                  188.10 (   0.00%)       17.17 (  90.87%)
BAmean-99 syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.19 (  10.40%)
BAmean-99 syst-NUMA02                    0.82 (   0.00%)        0.77 (   5.88%)
BAmean-99 syst-NUMA02_SMT                0.62 (   0.00%)        0.62 (   0.54%)
BAmean-99 elsp-NUMA01                  255.16 (   0.00%)      306.79 ( -20.24%)
BAmean-99 elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.03 (   2.52%)
BAmean-99 elsp-NUMA02                    3.17 (   0.00%)        3.11 (   2.05%)
BAmean-99 elsp-NUMA02_SMT                3.49 (   0.00%)        3.39 (   3.10%)

                6_1_base_config6_1_final_config
                numa-autonumabenchnuma-autonumabench
Duration User      313803.42   325078.49
Duration System      1405.42      134.62
Duration Elapsed     1872.08     2226.35

                                6_1_base_config6_1_final_config
                                numa-autonumabenchnuma-autonumabench
Ops Minor Faults                   239730038.00    57780527.00
Ops Major Faults                         195.00         195.00
Ops Normal allocs                   59157241.00    49821996.00
Ops Sector Reads                       31644.00       31900.00
Ops Sector Writes                      49096.00       51068.00
Ops Page migrate success             7552783.00         275.00
Ops Compaction cost                     7839.79           0.29
Ops NUMA alloc hit                  58997125.00    49605390.00
Ops NUMA alloc local                58979133.00    49603280.00
Ops NUMA base-page range updates    97494921.00        1239.00
Ops NUMA PTE updates                97494921.00        1239.00
Ops NUMA hint faults                98707853.00        1061.00
Ops NUMA hint local faults %        78220667.00         782.00
Ops NUMA hint local percent               79.24          73.70
Ops NUMA pages migrated              7552783.00         275.00
Ops AutoNUMA cost                     494365.23           5.32

Runtime options
Run: OpenMPI Environment
Run:  NAS_OPENMPI_VERSION=openmpi
Run:  PATH=/usr/lib64/mpi/gcc/openmpi/bin:$PATH
Run:  LD_LIBRARY_PATH=/usr/lib64/mpi/gcc/openmpi/lib64
Run: mpirun --use-hwthread-cpus --allow-run-as-root --allow-run-as-root -np 256 ./bin/sp.D.256
...
nas-mpi-bt NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       bt.D      240.02 (   0.00%)      240.23 (  -0.09%)
Amean     bt.D      240.23 (   0.00%)      241.24 *  -0.42%*
Stddev    bt.D        0.30 (   0.00%)        1.71 (-463.35%)
CoeffVar  bt.D        0.13 (   0.00%)        0.71 (-461.00%)
Max       bt.D      240.58 (   0.00%)      243.21 (  -1.09%)
BAmean-50 bt.D      240.02 (   0.00%)      240.23 (  -0.09%)
BAmean-95 bt.D      240.06 (   0.00%)      240.25 (  -0.08%)
BAmean-99 bt.D      240.06 (   0.00%)      240.25 (  -0.08%)

nas-mpi-bt Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-bt.D       1627.17 (   0.00%)     1554.96 (   4.44%)
Min       elspd-bt.D      244.09 (   0.00%)      244.22 (  -0.05%)
Amean     sys-bt.D       1636.36 (   0.00%)     1570.90 *   4.00%*
Amean     elspd-bt.D      244.34 (   0.00%)      245.33 *  -0.41%*
Stddev    sys-bt.D          8.10 (   0.00%)       22.38 (-176.31%)
Stddev    elspd-bt.D        0.35 (   0.00%)        1.78 (-405.43%)
CoeffVar  sys-bt.D          0.49 (   0.00%)        1.42 (-187.83%)
CoeffVar  elspd-bt.D        0.14 (   0.00%)        0.73 (-403.39%)
Max       sys-bt.D       1642.44 (   0.00%)     1596.48 (   2.80%)
Max       elspd-bt.D      244.74 (   0.00%)      247.38 (  -1.08%)
BAmean-50 sys-bt.D       1627.17 (   0.00%)     1554.96 (   4.44%)
BAmean-50 elspd-bt.D      244.09 (   0.00%)      244.22 (  -0.05%)
BAmean-95 sys-bt.D       1633.33 (   0.00%)     1558.11 (   4.61%)
BAmean-95 elspd-bt.D      244.13 (   0.00%)      244.30 (  -0.07%)
BAmean-99 sys-bt.D       1633.33 (   0.00%)     1558.11 (   4.61%)
BAmean-99 elspd-bt.D      244.13 (   0.00%)      244.30 (  -0.07%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User      181648.82   182614.64
Duration System      4910.66     4714.35
Duration Elapsed      742.88      745.28

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                   380489860.00    75510645.00
Ops Major Faults                         302.00         292.00
Ops Swap Ins                               0.00           0.00
Ops Normal allocs                   26303147.00    26030593.00
Ops Sector Reads                       33988.00       33904.00
Ops Sector Writes                      41932.00       40896.00
Ops Page migrate success              284339.00        5343.00
Ops Page migrate failure                   1.00           0.00
Ops Compaction cost                      295.14           5.55
Ops NUMA alloc hit                  26169335.00    25898920.00
Ops NUMA alloc local                26169335.00    25898920.00
Ops NUMA base-page range updates   354977260.00    49183327.00
Ops NUMA PTE updates               354977260.00    49183327.00
Ops NUMA hint faults               354078577.00    49085315.00
Ops NUMA hint local faults %       353643387.00    49076300.00
Ops NUMA hint local percent               99.88          99.98
Ops NUMA pages migrated               284339.00        5343.00
Ops AutoNUMA cost                    1772883.13      245770.96

nas-mpi-cg NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       cg.D      156.35 (   0.00%)      142.08 (   9.13%)
Amean     cg.D      157.57 (   0.00%)      142.59 *   9.50%*
Stddev    cg.D        1.89 (   0.00%)        0.48 (  74.93%)
CoeffVar  cg.D        1.20 (   0.00%)        0.33 (  72.30%)
Max       cg.D      159.75 (   0.00%)      143.02 (  10.47%)
BAmean-50 cg.D      156.35 (   0.00%)      142.08 (   9.13%)
BAmean-95 cg.D      156.47 (   0.00%)      142.38 (   9.01%)
BAmean-99 cg.D      156.47 (   0.00%)      142.38 (   9.01%)

nas-mpi-cg Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-cg.D       2484.08 (   0.00%)     2262.29 (   8.93%)
Min       elspd-cg.D      164.10 (   0.00%)      149.84 (   8.69%)
Amean     sys-cg.D       2518.20 (   0.00%)     2296.84 *   8.79%*
Amean     elspd-cg.D      165.32 (   0.00%)      150.40 *   9.02%*
Stddev    sys-cg.D         40.64 (   0.00%)       30.83 (  24.14%)
Stddev    elspd-cg.D        1.96 (   0.00%)        0.50 (  74.38%)
CoeffVar  sys-cg.D          1.61 (   0.00%)        1.34 (  16.83%)
CoeffVar  elspd-cg.D        1.18 (   0.00%)        0.33 (  71.84%)
Max       sys-cg.D       2563.17 (   0.00%)     2321.56 (   9.43%)
Max       elspd-cg.D      167.58 (   0.00%)      150.80 (  10.01%)
BAmean-50 sys-cg.D       2484.08 (   0.00%)     2262.29 (   8.93%)
BAmean-50 elspd-cg.D      164.10 (   0.00%)      149.84 (   8.69%)
BAmean-95 sys-cg.D       2495.72 (   0.00%)     2284.47 (   8.46%)
BAmean-95 elspd-cg.D      164.19 (   0.00%)      150.20 (   8.52%)
BAmean-99 sys-cg.D       2495.72 (   0.00%)     2284.47 (   8.46%)
BAmean-99 elspd-cg.D      164.19 (   0.00%)      150.20 (   8.52%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User      118328.59   107542.39
Duration System      7555.79     6891.65
Duration Elapsed      501.43      456.77

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                   115149790.00    32917253.00
Ops Major Faults                          25.00          26.00
Ops Normal allocs                   23369629.00    23011001.00
Ops Sector Reads                         436.00         540.00
Ops Sector Writes                      39728.00       39280.00
Ops Page migrate success              353217.00         686.00
Ops Compaction cost                      366.64           0.71
Ops NUMA alloc hit                  23250676.00    22895722.00
Ops NUMA alloc local                23250672.00    22895722.00
Ops NUMA base-page range updates   106418880.00     9823982.00
Ops NUMA PTE updates               106418880.00     9823982.00
Ops NUMA hint faults                91920469.00     9686971.00
Ops NUMA hint local faults %        91283239.00     9685903.00
Ops NUMA hint local percent               99.31          99.99
Ops NUMA pages migrated               353217.00         686.00
Ops AutoNUMA cost                     460353.99       48503.64

nas-mpi-ep NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       ep.D        8.44 (   0.00%)        8.65 (  -2.49%)
Amean     ep.D        8.65 (   0.00%)        9.35 *  -8.10%*
Stddev    ep.D        0.27 (   0.00%)        0.85 (-218.35%)
CoeffVar  ep.D        3.10 (   0.00%)        9.14 (-194.51%)
Max       ep.D        8.95 (   0.00%)       10.30 ( -15.08%)
BAmean-50 ep.D        8.44 (   0.00%)        8.65 (  -2.49%)
BAmean-95 ep.D        8.50 (   0.00%)        8.87 (  -4.41%)
BAmean-99 ep.D        8.50 (   0.00%)        8.87 (  -4.41%)

nas-mpi-ep Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-ep.D         31.16 (   0.00%)       20.49 (  34.24%)
Min       elspd-ep.D        9.91 (   0.00%)       10.16 (  -2.52%)
Amean     sys-ep.D         31.47 (   0.00%)       27.28 *  13.30%*
Amean     elspd-ep.D       10.20 (   0.00%)       10.89 *  -6.73%*
Stddev    sys-ep.D          0.36 (   0.00%)        6.04 (-1591.77%)
Stddev    elspd-ep.D        0.29 (   0.00%)        0.86 (-196.95%)
CoeffVar  sys-ep.D          1.13 (   0.00%)       22.12 (-1851.38%)
CoeffVar  elspd-ep.D        2.84 (   0.00%)        7.91 (-178.23%)
Max       sys-ep.D         31.86 (   0.00%)       32.03 (  -0.53%)
Max       elspd-ep.D       10.49 (   0.00%)       11.84 ( -12.87%)
BAmean-50 sys-ep.D         31.16 (   0.00%)       20.49 (  34.24%)
BAmean-50 elspd-ep.D        9.91 (   0.00%)       10.16 (  -2.52%)
BAmean-95 sys-ep.D         31.27 (   0.00%)       24.91 (  20.35%)
BAmean-95 elspd-ep.D       10.06 (   0.00%)       10.41 (  -3.53%)
BAmean-99 sys-ep.D         31.27 (   0.00%)       24.91 (  20.35%)
BAmean-99 elspd-ep.D       10.06 (   0.00%)       10.41 (  -3.53%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User        6689.32     7231.16
Duration System        95.70       83.00
Duration Elapsed       35.58       37.36

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                     3171011.00     2958900.00
Ops Major Faults                          21.00          21.00
Ops Normal allocs                    1735733.00     1725659.00
Ops Sector Reads                          24.00          24.00
Ops Sector Writes                      37008.00       37620.00
Ops NUMA alloc hit                   1620169.00     1618267.00
Ops NUMA alloc local                 1620169.00     1618267.00
Ops NUMA base-page range updates     1343733.00     1061345.00
Ops NUMA PTE updates                 1343733.00     1061345.00
Ops NUMA hint faults                 1158152.00      945501.00
Ops NUMA hint local faults %         1158144.00      945501.00
Ops NUMA pages migrated                    7.00           0.00
Ops AutoNUMA cost                       5800.17        4734.93

nas-mpi-is NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       is.D        5.05 (   0.00%)        5.02 (   0.59%)
Amean     is.D        5.19 (   0.00%)        5.22 *  -0.45%*
Stddev    is.D        0.13 (   0.00%)        0.28 (-122.86%)
CoeffVar  is.D        2.44 (   0.00%)        5.41 (-121.86%)
Max       is.D        5.29 (   0.00%)        5.54 (  -4.73%)
BAmean-50 is.D        5.05 (   0.00%)        5.02 (   0.59%)
BAmean-95 is.D        5.14 (   0.00%)        5.05 (   1.75%)
BAmean-99 is.D        5.14 (   0.00%)        5.05 (   1.75%)

nas-mpi-is Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-is.D        370.08 (   0.00%)      364.28 (   1.57%)
Min       elspd-is.D        9.05 (   0.00%)        8.91 (   1.55%)
Amean     sys-is.D        377.93 (   0.00%)      381.63 *  -0.98%*
Amean     elspd-is.D        9.13 (   0.00%)        9.10 *   0.29%*
Stddev    sys-is.D          6.97 (   0.00%)       23.90 (-243.13%)
Stddev    elspd-is.D        0.07 (   0.00%)        0.24 (-240.87%)
CoeffVar  sys-is.D          1.84 (   0.00%)        6.26 (-239.80%)
CoeffVar  elspd-is.D        0.77 (   0.00%)        2.62 (-241.87%)
Max       sys-is.D        383.37 (   0.00%)      408.90 (  -6.66%)
Max       elspd-is.D        9.18 (   0.00%)        9.37 (  -2.07%)
BAmean-50 sys-is.D        370.08 (   0.00%)      364.28 (   1.57%)
BAmean-50 elspd-is.D        9.05 (   0.00%)        8.91 (   1.55%)
BAmean-95 sys-is.D        375.22 (   0.00%)      368.00 (   1.92%)
BAmean-95 elspd-is.D        9.11 (   0.00%)        8.97 (   1.48%)
BAmean-99 sys-is.D        375.22 (   0.00%)      368.00 (   1.92%)
BAmean-99 elspd-is.D        9.11 (   0.00%)        8.97 (   1.48%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User        4762.89     4747.01
Duration System      1134.94     1146.02
Duration Elapsed       32.11       31.81

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                    66496004.00    47159682.00
Ops Major Faults                          21.00          21.00
Ops Normal allocs                   23449077.00    22990132.00
Ops Sector Reads                         384.00         384.00
Ops Sector Writes                      38040.00       35828.00
Ops Page migrate success             1938622.00     1485033.00
Ops Compaction cost                     2012.29        1541.46
Ops NUMA alloc hit                  23337713.00    22880414.00
Ops NUMA alloc local                23337713.00    22880414.00
Ops NUMA base-page range updates    45446261.00    25594500.00
Ops NUMA PTE updates                45446261.00    25594500.00
Ops NUMA hint faults                44834767.00    25501047.00
Ops NUMA hint local faults %        42555580.00    24013161.00
Ops NUMA hint local percent               94.92          94.17
Ops NUMA pages migrated              1938622.00     1485033.00
Ops AutoNUMA cost                     224528.79      127712.61

nas-mpi-lu NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       lu.D      183.79 (   0.00%)      183.22 (   0.31%)
Amean     lu.D      183.96 (   0.00%)      183.43 *   0.29%*
Stddev    lu.D        0.25 (   0.00%)        0.19 (  26.01%)
CoeffVar  lu.D        0.14 (   0.00%)        0.10 (  25.80%)
Max       lu.D      184.25 (   0.00%)      183.55 (   0.38%)
BAmean-50 lu.D      183.79 (   0.00%)      183.22 (   0.31%)
BAmean-95 lu.D      183.82 (   0.00%)      183.38 (   0.24%)
BAmean-99 lu.D      183.82 (   0.00%)      183.38 (   0.24%)

nas-mpi-lu Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-lu.D        755.87 (   0.00%)      636.36 (  15.81%)
Min       elspd-lu.D      187.61 (   0.00%)      187.18 (   0.23%)
Amean     sys-lu.D        766.88 (   0.00%)      649.63 *  15.29%*
Amean     elspd-lu.D      187.86 (   0.00%)      187.28 *   0.31%*
Stddev    sys-lu.D          9.69 (   0.00%)       12.13 ( -25.19%)
Stddev    elspd-lu.D        0.28 (   0.00%)        0.11 (  61.93%)
CoeffVar  sys-lu.D          1.26 (   0.00%)        1.87 ( -47.79%)
CoeffVar  elspd-lu.D        0.15 (   0.00%)        0.06 (  61.81%)
Max       sys-lu.D        774.09 (   0.00%)      660.14 (  14.72%)
Max       elspd-lu.D      188.16 (   0.00%)      187.39 (   0.41%)
BAmean-50 sys-lu.D        755.87 (   0.00%)      636.36 (  15.81%)
BAmean-50 elspd-lu.D      187.61 (   0.00%)      187.18 (   0.23%)
BAmean-95 sys-lu.D        763.28 (   0.00%)      644.38 (  15.58%)
BAmean-95 elspd-lu.D      187.71 (   0.00%)      187.22 (   0.26%)
BAmean-99 sys-lu.D        763.28 (   0.00%)      644.38 (  15.58%)
BAmean-99 elspd-lu.D      187.71 (   0.00%)      187.22 (   0.26%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User      140870.85   140812.35
Duration System      2302.38     1950.50
Duration Elapsed      572.07      570.59

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                   159164134.00    36327663.00
Ops Major Faults                          22.00          23.00
Ops Normal allocs                   12766911.00    12701365.00
Ops Sector Reads                         392.00         432.00
Ops Sector Writes                      40336.00       39988.00
Ops Page migrate success               49476.00        5899.00
Ops Compaction cost                       51.36           6.12
Ops NUMA alloc hit                  12641163.00    12586537.00
Ops NUMA alloc local                12641163.00    12586537.00
Ops NUMA base-page range updates   146297502.00    23468561.00
Ops NUMA PTE updates               146297502.00    23468561.00
Ops NUMA hint faults               146195881.00    23366950.00
Ops NUMA hint local faults %       146121976.00    23360847.00
Ops NUMA hint local percent               99.95          99.97
Ops NUMA pages migrated                49476.00        5899.00
Ops AutoNUMA cost                     732004.43      116999.14

nas-mpi-mg NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       mg.D       32.10 (   0.00%)       31.87 (   0.72%)
Amean     mg.D       32.14 (   0.00%)       31.88 *   0.79%*
Stddev    mg.D        0.04 (   0.00%)        0.02 (  34.24%)
CoeffVar  mg.D        0.11 (   0.00%)        0.07 (  33.72%)
Max       mg.D       32.17 (   0.00%)       31.91 (   0.81%)
BAmean-50 mg.D       32.10 (   0.00%)       31.87 (   0.72%)
BAmean-95 mg.D       32.12 (   0.00%)       31.87 (   0.78%)
BAmean-99 mg.D       32.12 (   0.00%)       31.87 (   0.78%)

nas-mpi-mg Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-mg.D        477.89 (   0.00%)      418.76 (  12.37%)
Min       elspd-mg.D       35.88 (   0.00%)       35.62 (   0.72%)
Amean     sys-mg.D        484.38 (   0.00%)      420.00 *  13.29%*
Amean     elspd-mg.D       35.98 (   0.00%)       35.74 *   0.68%*
Stddev    sys-mg.D          5.70 (   0.00%)        1.16 (  79.67%)
Stddev    elspd-mg.D        0.11 (   0.00%)        0.13 ( -18.57%)
CoeffVar  sys-mg.D          1.18 (   0.00%)        0.28 (  76.55%)
CoeffVar  elspd-mg.D        0.31 (   0.00%)        0.37 ( -19.38%)
Max       sys-mg.D        488.60 (   0.00%)      421.06 (  13.82%)
Max       elspd-mg.D       36.10 (   0.00%)       35.88 (   0.61%)
BAmean-50 sys-mg.D        477.89 (   0.00%)      418.76 (  12.37%)
BAmean-50 elspd-mg.D       35.88 (   0.00%)       35.62 (   0.72%)
BAmean-95 sys-mg.D        482.27 (   0.00%)      419.47 (  13.02%)
BAmean-95 elspd-mg.D       35.92 (   0.00%)       35.66 (   0.71%)
BAmean-99 sys-mg.D        482.27 (   0.00%)      419.47 (  13.02%)
BAmean-99 elspd-mg.D       35.92 (   0.00%)       35.66 (   0.71%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User       25080.64    25066.21
Duration System      1454.59     1261.44
Duration Elapsed      115.21      114.28

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                   179027668.00    68883081.00
Ops Major Faults                          21.00          21.00
Ops Normal allocs                   23768625.00    23750979.00
Ops Sector Reads                          88.00          88.00
Ops Sector Writes                      38272.00       37544.00
Ops Page migrate success               12059.00        1774.00
Ops Compaction cost                       12.52           1.84
Ops NUMA alloc hit                  23643850.00    23634234.00
Ops NUMA alloc local                23643849.00    23634234.00
Ops NUMA base-page range updates   155287026.00    45123410.00
Ops NUMA PTE updates               155287026.00    45123410.00
Ops NUMA hint faults               155055287.00    44908778.00
Ops NUMA hint local faults %       155037730.00    44906752.00
Ops NUMA hint local percent               99.99         100.00
Ops NUMA pages migrated                12059.00        1774.00
Ops AutoNUMA cost                     776363.67      224859.79

nas-mpi-sp NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sp.D      408.19 (   0.00%)      409.12 (  -0.23%)
Amean     sp.D      408.32 (   0.00%)      409.82 *  -0.37%*
Stddev    sp.D        0.13 (   0.00%)        0.63 (-400.26%)
CoeffVar  sp.D        0.03 (   0.00%)        0.15 (-398.42%)
Max       sp.D      408.44 (   0.00%)      410.33 (  -0.46%)
BAmean-50 sp.D      408.19 (   0.00%)      409.12 (  -0.23%)
BAmean-95 sp.D      408.25 (   0.00%)      409.56 (  -0.32%)
BAmean-99 sp.D      408.25 (   0.00%)      409.56 (  -0.32%)

nas-mpi-sp Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-sp.D       4120.36 (   0.00%)     3820.43 (   7.28%)
Min       elspd-sp.D      412.02 (   0.00%)      413.00 (  -0.24%)
Amean     sys-sp.D       4187.54 (   0.00%)     3970.32 *   5.19%*
Amean     elspd-sp.D      412.20 (   0.00%)      413.64 *  -0.35%*
Stddev    sys-sp.D         65.82 (   0.00%)      132.28 (-100.95%)
Stddev    elspd-sp.D        0.20 (   0.00%)        0.57 (-181.53%)
CoeffVar  sys-sp.D          1.57 (   0.00%)        3.33 (-111.94%)
CoeffVar  elspd-sp.D        0.05 (   0.00%)        0.14 (-180.55%)
Max       sys-sp.D       4251.92 (   0.00%)     4070.68 (   4.26%)
Max       elspd-sp.D      412.42 (   0.00%)      414.08 (  -0.40%)
BAmean-50 sys-sp.D       4120.36 (   0.00%)     3820.43 (   7.28%)
BAmean-50 elspd-sp.D      412.02 (   0.00%)      413.00 (  -0.24%)
BAmean-95 sys-sp.D       4155.35 (   0.00%)     3920.14 (   5.66%)
BAmean-95 elspd-sp.D      412.10 (   0.00%)      413.43 (  -0.32%)
BAmean-99 sys-sp.D       4155.35 (   0.00%)     3920.14 (   5.66%)
BAmean-99 elspd-sp.D      412.10 (   0.00%)      413.43 (  -0.32%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User      302843.74   304665.29
Duration System     12564.12    11912.42
Duration Elapsed     1246.11     1249.45

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                   383084930.00    65419151.00
Ops Major Faults                          76.00          21.00
Ops Normal allocs                   22711498.00    22411009.00
Ops Sector Reads                       13799.00         256.00
Ops Sector Writes                      44632.00       42080.00
Ops Page migrate success              299748.00        8317.00
Ops Page migrate failure                   1.00           0.00
Ops Compaction cost                      311.14           8.63
Ops NUMA alloc hit                  22593673.00    22286577.00
Ops NUMA alloc local                22593274.00    22286577.00
Ops NUMA base-page range updates   360559234.00    42879377.00
Ops NUMA PTE updates               360559234.00    42879377.00
Ops NUMA hint faults               360428916.00    42781996.00
Ops NUMA hint local faults %       359952878.00    42767934.00
Ops NUMA hint local percent               99.87          99.97
Ops NUMA pages migrated               299748.00        8317.00
Ops AutoNUMA cost                    1804674.19      214210.29

[1] sched/numa: Process Adaptive autoNUMA 
Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/
Raghavendra K T (1):
  sched/numa: Enhance vma scanning logic

 include/linux/mm_types.h |  2 ++
 kernel/sched/fair.c      | 32 ++++++++++++++++++++++++++++++++
 mm/memory.c              | 21 +++++++++++++++++++++
 3 files changed, 55 insertions(+)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-16  1:35 [RFC PATCH V1 0/1] sched/numa: Enhance vma scanning Raghavendra K T
@ 2023-01-16  1:35 ` Raghavendra K T
  2023-01-16  2:25   ` Raghavendra K T
                     ` (3 more replies)
  2023-01-16  2:25 ` [RFC PATCH V1 0/1] sched/numa: Enhance vma scanning Raghavendra K T
  1 sibling, 4 replies; 16+ messages in thread
From: Raghavendra K T @ 2023-01-16  1:35 UTC (permalink / raw)
  To: =,
	linux-mm, --cc=Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Andrew Morton, Matthew Wilcox, Vlastimil Babka, Liam R . Howlett,
	Peter Xu, David Hildenbrand, xu xin, Yu Zhao, Colin Cross,
	Arnd Bergmann, Hugh Dickins, Bharata B Rao, Disha Talreja
  Cc: Raghavendra K T

 During the Numa scanning make sure only relevant vmas of the
tasks are scanned.

Logic:
1) For the first two time allow unconditional scanning of vmas
2) Store recent 4 unique tasks (last 8bits of PIDs) accessed the vma.
  False negetives in case of collison should be fine here.
3) If more than 4 pids exist assume task indeed accessed vma to
 to avoid false negetives

Co-developed-by: Bharata B Rao <bharata@amd.com>
(initial patch to store pid information)

Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Bharata B Rao <bharata@amd.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm_types.h |  2 ++
 kernel/sched/fair.c      | 32 ++++++++++++++++++++++++++++++++
 mm/memory.c              | 21 +++++++++++++++++++++
 3 files changed, 55 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..07feae37b8e6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -506,6 +506,8 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+	unsigned int accessing_pids;
+	int next_pid_slot;
 } __randomize_layout;
 
 struct kioctx_table;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8bd941c..944d2e3b0b3c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2916,6 +2916,35 @@ static void reset_ptenuma_scan(struct task_struct *p)
 	p->mm->numa_scan_offset = 0;
 }
 
+static bool vma_is_accessed(struct vm_area_struct *vma)
+{
+	int i;
+	bool more_pids_exist;
+	unsigned long pid, max_pids;
+	unsigned long current_pid = current->pid & LAST__PID_MASK;
+
+	max_pids = sizeof(unsigned int) * BITS_PER_BYTE / LAST__PID_SHIFT;
+
+	/* By default we assume >= max_pids exist */
+	more_pids_exist = true;
+
+	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
+		return true;
+
+	for (i = 0; i < max_pids; i++) {
+		pid = (vma->accessing_pids >> i * LAST__PID_SHIFT) &
+			LAST__PID_MASK;
+		if (pid == current_pid)
+			return true;
+		if (pid == 0) {
+			more_pids_exist = false;
+			break;
+		}
+	}
+
+	return more_pids_exist;
+}
+
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
@@ -3015,6 +3044,9 @@ static void task_numa_work(struct callback_head *work)
 		if (!vma_is_accessible(vma))
 			continue;
 
+		if (!vma_is_accessed(vma))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
diff --git a/mm/memory.c b/mm/memory.c
index 8c8420934d60..fafd78d87a51 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4717,7 +4717,28 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	pte_t pte, old_pte;
 	bool was_writable = pte_savedwrite(vmf->orig_pte);
 	int flags = 0;
+	int pid_slot = vma->next_pid_slot;
 
+	int i;
+	unsigned long pid, max_pids;
+	unsigned long current_pid = current->pid & LAST__PID_MASK;
+
+	max_pids = sizeof(unsigned int) * BITS_PER_BYTE / LAST__PID_SHIFT;
+
+	/* Avoid duplicate PID updation */
+	for (i = 0; i < max_pids; i++) {
+		pid = (vma->accessing_pids >> i * LAST__PID_SHIFT) &
+			LAST__PID_MASK;
+		if (pid == current_pid)
+			goto skip_update;
+	}
+
+	vma->next_pid_slot = (++pid_slot) % max_pids;
+	vma->accessing_pids &= ~(LAST__PID_MASK << (pid_slot * LAST__PID_SHIFT));
+	vma->accessing_pids |= ((current_pid) <<
+			(pid_slot * LAST__PID_SHIFT));
+
+skip_update:
 	/*
 	 * The "pte" at this point cannot be used safely without
 	 * validation through pte_unmap_same(). It's of NUMA type but
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH V1 0/1] sched/numa: Enhance vma scanning
  2023-01-16  1:35 [RFC PATCH V1 0/1] sched/numa: Enhance vma scanning Raghavendra K T
  2023-01-16  1:35 ` [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic Raghavendra K T
@ 2023-01-16  2:25 ` Raghavendra K T
  1 sibling, 0 replies; 16+ messages in thread
From: Raghavendra K T @ 2023-01-16  2:25 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Andrew Morton,
	Matthew Wilcox, Vlastimil Babka, Liam R . Howlett, Peter Xu,
	David Hildenbrand, xu xin, Yu Zhao, Colin Cross, Arnd Bergmann,
	Hugh Dickins, Bharata B Rao, Disha Talreja, Raghavendra K T

 The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel.

Existing mechanism of scan period involves, scan period derived from
per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA 
fault stats at per-process level to capture aplication behaviour better.

During that course of discussion, Mel proposed several ideas to enhance
current numa balancing. One of the suggestion was below

Track what threads access a VMA. The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what
threads access a VMA. Skip VMAs that did not trap a fault. This would
be approximate because of PID collisions but would reduce scanning of 
areas the thread is not interested in. The above suggestion intends not
to penalize threads that has no interest in the vma, thus reduce scanning
overhead.

Approach in patchset:

1) Tracks atmost 4 threads which recently accessed vma, scan only if that
thread accessed the vma. (note: used only unsigned int. Experiments showed
tracking 8 unique PIDs had more overhead)

2) First 2 times unconditionaly allow threads to scan vmas to preserve
original intention of scanning.

3) If there are more than 4 threads (i.e. more than PIDs we could remember)
by default allow scanning because we might have potentially missed recording
whether current thread had any interest in the vma.
(less accurate and debatable heuristics)

With this patchset we see reduction in considerable amount of scanning
overhead (AutoNuma cost) with some enchmark improving the performance,
and others with almost no regression.

Things to ponder over (and Future TODO):
==========================================
- Do we have to consider clearing PID if it is not accessed "recently"
- Current scan period is not changed in the patchset, so we do see frequent
 tries to scan. Relaxing scan period dynamically could improve results
further.

Results Summary:
================
The result is obtained by running mmtests with below configs
config-workload-kernbench-max
config-io-dbench4-async
config-numa-autonumabench
config-hpc-nas-mpi-full

There is a significant reduction in AutoNuma cost.

SUT:
2 socket AMD Milan System
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2

256GB memory per socket amounting to 512GB in total
NPS1 NUMA configuration where each socket is a NUMA node

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 0 size: 257538 MB
node 0 free: 255739 MB
node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 1 size: 255978 MB
node 1 free: 249680 MB
node distances:
node   0   1
  0:  10  32
  1:  32  10

Detailed Result:
(Note:
1. All rows with 0.00 and 100% from both the side removed
2. Some of the duplicate run/path info is trimmed.
3. SIS results omitted.

kernbench
=============
                          6_1_base_config       6_1_final_config
                   workload-kernbench-max workload-kernbench-max
Min       user-256    22120.65 (   0.00%)    22477.15 (  -1.61%)
Min       syst-256     9975.63 (   0.00%)     8880.00 (  10.98%)
Min       elsp-256      161.24 (   0.00%)      157.45 (   2.35%)
Amean     user-256    22179.56 (   0.00%)    22558.57 *  -1.71%*
Amean     syst-256    10034.72 (   0.00%)     8913.83 *  11.17%*
Amean     elsp-256      161.52 (   0.00%)      157.69 *   2.37%*
Stddev    user-256      101.82 (   0.00%)       82.51 (  18.97%)
Stddev    syst-256       87.56 (   0.00%)       54.31 (  37.97%)
Stddev    elsp-256        0.35 (   0.00%)        0.27 (  24.12%)
CoeffVar  user-256        0.46 (   0.00%)        0.37 (  20.33%)
CoeffVar  syst-256        0.87 (   0.00%)        0.61 (  30.17%)
CoeffVar  elsp-256        0.22 (   0.00%)        0.17 (  22.28%)
Max       user-256    22297.13 (   0.00%)    22642.13 (  -1.55%)
Max       syst-256    10135.31 (   0.00%)     8976.48 (  11.43%)
Max       elsp-256      161.92 (   0.00%)      157.98 (   2.43%)
BAmean-50 user-256    22120.65 (   0.00%)    22477.15 (  -1.61%)
BAmean-50 syst-256     9975.63 (   0.00%)     8880.00 (  10.98%)
BAmean-50 elsp-256      161.24 (   0.00%)      157.45 (   2.35%)
BAmean-95 user-256    22120.77 (   0.00%)    22516.79 (  -1.79%)
BAmean-95 syst-256     9984.42 (   0.00%)     8882.51 (  11.04%)
BAmean-95 elsp-256      161.32 (   0.00%)      157.54 (   2.34%)
BAmean-99 user-256    22120.77 (   0.00%)    22516.79 (  -1.79%)
BAmean-99 syst-256     9984.42 (   0.00%)     8882.51 (  11.04%)
BAmean-99 elsp-256      161.32 (   0.00%)      157.54 (   2.34%)

                6_1_base_config6_1_final_config
                workload-kernbench-maxworkload-kernbench-max
Duration User       66548.20    67685.44
Duration System     30118.43    26756.19
Duration Elapsed      506.13      495.18

                                6_1_base_config6_1_final_config
                                workload-kernbench-maxworkload-kernbench-max
Ops Minor Faults                  1883340576.00  1883195171.00
Ops Major Faults                          88.00          42.00
Ops Normal allocs                 1743174614.00  1742916379.00
Ops Sector Reads                      173600.00       11928.00
Ops Sector Writes                   21099684.00    21480472.00
Ops Page migrate success              111429.00       73693.00
Ops Compaction cost                      115.66          76.49
Ops NUMA alloc hit                1738966281.00  1738692743.00
Ops NUMA alloc local              1738966104.00  1738692456.00
Ops NUMA base-page range updates      401910.00      322972.00
Ops NUMA PTE updates                  401910.00      322972.00
Ops NUMA hint faults                  112231.00       76987.00
Ops NUMA hint local faults %             802.00        3294.00
Ops NUMA hint local percent                0.71           4.28
Ops NUMA pages migrated               111429.00       73693.00
Ops AutoNUMA cost                        566.09         388.60

dbench
=========
Runtime options
Run: dbench -D [...]/testdisk/data/6_1_base_config-io-dbench4-async --warmup 0 -t 180 --loadfile [...]/sources/dbench-781852c2b38a-installed/share//client-tiny.txt --show-execute-time 1 2 4 ... 256
dbench4 Loadfile Execution Time

dbench4 Latency
                             6_1_base_config       6_1_final_config
                            io-dbench4-async       io-dbench4-async
Min       latency-1          0.37 (   0.00%)        0.35 (   4.09%)
Min       latency-2          0.38 (   0.00%)        0.40 (  -6.60%)
Min       latency-4          0.51 (   0.00%)        0.49 (   2.97%)
Min       latency-8          0.69 (   0.00%)        0.62 (   9.17%)
Min       latency-16         1.11 (   0.00%)        0.99 (  10.57%)
Min       latency-32         1.96 (   0.00%)        1.98 (  -0.66%)
Min       latency-64         5.73 (   0.00%)       32.03 (-458.86%)
Min       latency-128       17.60 (   0.00%)       17.79 (  -1.09%)
Min       latency-256       24.71 (   0.00%)       24.06 (   2.66%)
Amean     latency-1          2.18 (   0.00%)        0.46 *  78.72%*
Amean     latency-2          4.52 (   0.00%)        0.54 *  88.10%*
Amean     latency-4          9.99 (   0.00%)        0.97 *  90.34%*
Amean     latency-8         13.62 (   0.00%)        1.02 *  92.48%*
Amean     latency-16        14.11 (   0.00%)        4.07 *  71.16%*
Amean     latency-32        21.18 (   0.00%)       45.50 *-114.84%*
Amean     latency-64        61.19 (   0.00%)       58.78 *   3.95%*
Amean     latency-128       56.48 (   0.00%)       54.86 *   2.86%*
Amean     latency-256       81.08 (   0.00%)       80.03 *   1.30%*
Stddev    latency-1          6.85 (   0.00%)        0.17 (  97.56%)
Stddev    latency-2         10.51 (   0.00%)        0.19 (  98.19%)
Stddev    latency-4         14.51 (   0.00%)        1.61 (  88.91%)
Stddev    latency-8         15.71 (   0.00%)        2.04 (  87.00%)
Stddev    latency-16        16.36 (   0.00%)       10.60 (  35.21%)
Stddev    latency-32        19.95 (   0.00%)       25.89 ( -29.78%)
Stddev    latency-64        18.64 (   0.00%)       16.95 (   9.07%)
Stddev    latency-128       17.61 (   0.00%)       19.64 ( -11.52%)
Stddev    latency-256       32.23 (   0.00%)       32.94 (  -2.19%)
CoeffVar  latency-1        313.68 (   0.00%)       35.95 (  88.54%)
CoeffVar  latency-2        232.70 (   0.00%)       35.35 (  84.81%)
CoeffVar  latency-4        145.21 (   0.00%)      166.58 ( -14.72%)
CoeffVar  latency-8        115.34 (   0.00%)      199.35 ( -72.85%)
CoeffVar  latency-16       115.92 (   0.00%)      260.45 (-124.68%)
CoeffVar  latency-32        94.21 (   0.00%)       56.90 (  39.59%)
CoeffVar  latency-64        30.47 (   0.00%)       28.84 (   5.33%)
CoeffVar  latency-128       31.19 (   0.00%)       35.80 ( -14.80%)
CoeffVar  latency-256       39.75 (   0.00%)       41.15 (  -3.53%)
Max       latency-1         34.43 (   0.00%)        2.43 (  92.95%)
Max       latency-2         34.71 (   0.00%)        2.94 (  91.53%)
Max       latency-4         35.80 (   0.00%)       13.44 (  62.47%)
Max       latency-8         36.29 (   0.00%)       23.73 (  34.61%)
Max       latency-16        64.87 (   0.00%)       76.11 ( -17.32%)
Max       latency-32       133.05 (   0.00%)      123.88 (   6.89%)
Max       latency-64       150.48 (   0.00%)      198.68 ( -32.02%)
Max       latency-128      101.89 (   0.00%)      140.66 ( -38.05%)
Max       latency-256      326.53 (   0.00%)      357.54 (  -9.50%)
BAmean-50 latency-1          0.42 (   0.00%)        0.41 (   3.49%)
BAmean-50 latency-2          0.44 (   0.00%)        0.48 (  -9.16%)
BAmean-50 latency-4          0.60 (   0.00%)        0.55 (   8.02%)
BAmean-50 latency-8          0.80 (   0.00%)        0.75 (   6.60%)
BAmean-50 latency-16         1.57 (   0.00%)        1.08 (  31.38%)
BAmean-50 latency-32         3.83 (   0.00%)       25.03 (-553.42%)
BAmean-50 latency-64        47.76 (   0.00%)       47.49 (   0.56%)
BAmean-50 latency-128       42.90 (   0.00%)       40.81 (   4.89%)
BAmean-50 latency-256       62.90 (   0.00%)       64.40 (  -2.39%)
BAmean-95 latency-1          0.67 (   0.00%)        0.44 (  33.70%)
BAmean-95 latency-2          2.93 (   0.00%)        0.52 (  82.40%)
BAmean-95 latency-4          8.67 (   0.00%)        0.65 (  92.46%)
BAmean-95 latency-8         12.45 (   0.00%)        0.79 (  93.62%)
BAmean-95 latency-16        12.42 (   0.00%)        1.93 (  84.44%)
BAmean-95 latency-32        18.63 (   0.00%)       43.09 (-131.27%)
BAmean-95 latency-64        59.09 (   0.00%)       56.46 (   4.44%)
BAmean-95 latency-128       54.56 (   0.00%)       52.33 (   4.08%)
BAmean-95 latency-256       76.21 (   0.00%)       74.98 (   1.61%)
BAmean-99 latency-1          1.82 (   0.00%)        0.45 (  75.36%)
BAmean-99 latency-2          4.18 (   0.00%)        0.52 (  87.49%)
BAmean-99 latency-4          9.70 (   0.00%)        0.84 (  91.36%)
BAmean-99 latency-8         13.37 (   0.00%)        0.81 (  93.92%)
BAmean-99 latency-16        13.54 (   0.00%)        3.36 (  75.21%)
BAmean-99 latency-32        20.20 (   0.00%)       44.75 (-121.58%)
BAmean-99 latency-64        60.48 (   0.00%)       57.76 (   4.49%)
BAmean-99 latency-128       55.97 (   0.00%)       53.94 (   3.63%)
BAmean-99 latency-256       78.60 (   0.00%)       77.39 (   1.54%)

dbench4 Throughput (misleading but traditional)
                     6_1_base_config       6_1_final_config
                    io-dbench4-async       io-dbench4-async
Min       1        813.26 (   0.00%)      808.06 (  -0.64%)
Min       2       1348.40 (   0.00%)     1373.40 (   1.85%)
Min       4       2250.91 (   0.00%)     2349.87 (   4.40%)
Min       8       3419.43 (   0.00%)     3467.44 (   1.40%)
Min       16      5191.34 (   0.00%)     5808.41 (  11.89%)
Min       32      7798.01 (   0.00%)     7461.17 (  -4.32%)
Min       64      6056.72 (   0.00%)     6566.36 (   8.41%)
Min       128     6103.45 (   0.00%)     6455.82 (   5.77%)
Min       256     7409.77 (   0.00%)     7317.59 (  -1.24%)
Hmean     1        820.42 (   0.00%)      829.23 *   1.07%*
Hmean     2       1433.43 (   0.00%)     1429.61 *  -0.27%*
Hmean     4       2281.21 (   0.00%)     2364.06 *   3.63%*
Hmean     8       3493.67 (   0.00%)     3512.71 *   0.54%*
Hmean     16      5354.84 (   0.00%)     6069.95 *  13.35%*
Hmean     32      8060.95 (   0.00%)     8147.90 *   1.08%*
Hmean     64      6579.22 (   0.00%)     7151.69 *   8.70%*
Hmean     128     7340.59 (   0.00%)     7531.74 *   2.60%*
Hmean     256     7685.46 (   0.00%)     7758.75 *   0.95%*
Stddev    1          2.62 (   0.00%)        5.61 (-114.51%)
Stddev    2         28.63 (   0.00%)       12.60 (  55.98%)
Stddev    4          7.58 (   0.00%)       12.24 ( -61.45%)
Stddev    8         36.99 (   0.00%)       76.05 (-105.62%)
Stddev    16       100.33 (   0.00%)      101.59 (  -1.26%)
Stddev    32       216.04 (   0.00%)      599.98 (-177.71%)
Stddev    64      1101.72 (   0.00%)      198.64 (  81.97%)
Stddev    128      233.31 (   0.00%)      191.10 (  18.09%)
Stddev    256      474.44 (   0.00%)      220.54 (  53.52%)
CoeffVar  1          0.32 (   0.00%)        0.68 (-112.23%)
CoeffVar  2          2.00 (   0.00%)        0.88 (  55.84%)
CoeffVar  4          0.33 (   0.00%)        0.52 ( -55.79%)
CoeffVar  8          1.06 (   0.00%)        2.16 (-104.44%)
CoeffVar  16         1.87 (   0.00%)        1.67 (  10.67%)
CoeffVar  32         2.68 (   0.00%)        7.33 (-173.51%)
CoeffVar  64        16.42 (   0.00%)        2.78 (  83.10%)
CoeffVar  128        3.17 (   0.00%)        2.54 (  20.13%)
CoeffVar  256        6.15 (   0.00%)        2.84 (  53.84%)
Max       1        824.86 (   0.00%)      835.60 (   1.30%)
Max       2       1478.26 (   0.00%)     1445.52 (  -2.21%)
Max       4       2300.81 (   0.00%)     2437.53 (   5.94%)
Max       8       3536.60 (   0.00%)     3924.47 (  10.97%)
Max       16      5726.97 (   0.00%)     6188.43 (   8.06%)
Max       32      8589.30 (   0.00%)     9179.96 (   6.88%)
Max       64     10975.99 (   0.00%)     7353.65 ( -33.00%)
Max       128     7585.49 (   0.00%)     7818.70 (   3.07%)
Max       256     9583.72 (   0.00%)     8450.29 ( -11.83%)
BHmean-50 1        822.25 (   0.00%)      834.01 (   1.43%)
BHmean-50 2       1457.81 (   0.00%)     1435.25 (  -1.55%)
BHmean-50 4       2287.80 (   0.00%)     2369.75 (   3.58%)
BHmean-50 8       3524.24 (   0.00%)     3550.48 (   0.74%)
BHmean-50 16      5426.84 (   0.00%)     6154.63 (  13.41%)
BHmean-50 32      8254.63 (   0.00%)     8705.67 (   5.46%)
BHmean-50 64      7058.18 (   0.00%)     7311.16 (   3.58%)
BHmean-50 128     7457.52 (   0.00%)     7605.17 (   1.98%)
BHmean-50 256     7914.05 (   0.00%)     7874.19 (  -0.50%)
BHmean-95 1        820.74 (   0.00%)      830.01 (   1.13%)
BHmean-95 2       1437.22 (   0.00%)     1432.28 (  -0.34%)
BHmean-95 4       2282.12 (   0.00%)     2364.73 (   3.62%)
BHmean-95 8       3497.46 (   0.00%)     3515.13 (   0.51%)
BHmean-95 16      5360.61 (   0.00%)     6080.68 (  13.43%)
BHmean-95 32      8074.70 (   0.00%)     8187.13 (   1.39%)
BHmean-95 64      6608.51 (   0.00%)     7177.60 (   8.61%)
BHmean-95 128     7394.83 (   0.00%)     7576.42 (   2.46%)
BHmean-95 256     7700.43 (   0.00%)     7768.59 (   0.89%)
BHmean-99 1        820.50 (   0.00%)      829.47 (   1.09%)
BHmean-99 2       1434.42 (   0.00%)     1430.26 (  -0.29%)
BHmean-99 4       2281.50 (   0.00%)     2364.22 (   3.63%)
BHmean-99 8       3494.52 (   0.00%)     3513.23 (   0.54%)
BHmean-99 16      5356.37 (   0.00%)     6072.71 (  13.37%)
BHmean-99 32      8064.00 (   0.00%)     8156.38 (   1.15%)
BHmean-99 64      6585.62 (   0.00%)     7158.54 (   8.70%)
BHmean-99 128     7356.65 (   0.00%)     7545.32 (   2.56%)
BHmean-99 256     7688.68 (   0.00%)     7762.38 (   0.96%)

dbench4 Per-VFS Operation latency Latency

                6_1_base_config6_1_final_config
                io-dbench4-asyncio-dbench4-async
Duration User        1979.58     2097.00
Duration System     19522.29    19246.17
Duration Elapsed     1643.53     1642.83

                                6_1_base_config6_1_final_config
                                io-dbench4-asyncio-dbench4-async
Ops Minor Faults                      410528.00      362832.00
Ops Normal allocs                  418149664.00   440830731.00
Ops Sector Reads                        1132.00          28.00
Ops Sector Writes                  811726764.00   868601700.00
Ops Page migrate success               41282.00       16565.00
Ops Compaction cost                       42.85          17.19
Ops NUMA alloc hit                 407380422.00   428784879.00
Ops NUMA alloc local               407380245.00   428784630.00
Ops NUMA base-page range updates       83253.00       43358.00
Ops NUMA PTE updates                   83253.00       43358.00
Ops NUMA hint faults                   81229.00       30774.00
Ops NUMA hint local faults %           24325.00       13842.00
Ops NUMA hint local percent               29.95          44.98
Ops NUMA pages migrated                41282.00       16565.00
Ops AutoNUMA cost                        407.51         154.49

autonumabench
                                         6_1_base_config       6_1_final_config
                                      numa-autonumabench     numa-autonumabench
Min       syst-NUMA01                  126.65 (   0.00%)       15.45 (  87.80%)
Min       syst-NUMA01_THREADLOCAL        0.19 (   0.00%)        0.14 (  26.32%)
Min       syst-NUMA02                    0.71 (   0.00%)        0.73 (  -2.82%)
Min       syst-NUMA02_SMT                0.57 (   0.00%)        0.59 (  -3.51%)
Min       elsp-NUMA01                  241.41 (   0.00%)      286.88 ( -18.84%)
Min       elsp-NUMA01_THREADLOCAL        1.05 (   0.00%)        1.03 (   1.90%)
Min       elsp-NUMA02                    2.94 (   0.00%)        3.03 (  -3.06%)
Min       elsp-NUMA02_SMT                3.31 (   0.00%)        3.11 (   6.04%)
Amean     syst-NUMA01                  199.02 (   0.00%)       17.55 *  91.18%*
Amean     syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.19 *   8.16%*
Amean     syst-NUMA02                    0.85 (   0.00%)        0.79 *   6.59%*
Amean     syst-NUMA02_SMT                0.63 (   0.00%)        0.63 *   0.00%*
Amean     elsp-NUMA01                  257.78 (   0.00%)      309.43 * -20.04%*
Amean     elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.04 *   2.42%*
Amean     elsp-NUMA02                    3.24 (   0.00%)        3.12 *   3.62%*
Amean     elsp-NUMA02_SMT                3.58 (   0.00%)        3.44 *   4.15%*
Stddev    syst-NUMA01                   41.05 (   0.00%)        1.68 (  95.92%)
Stddev    syst-NUMA01_THREADLOCAL        0.01 (   0.00%)        0.03 (-148.57%)
Stddev    syst-NUMA02                    0.11 (   0.00%)        0.06 (  44.44%)
Stddev    syst-NUMA02_SMT                0.03 (   0.00%)        0.03 (  -8.16%)
Stddev    elsp-NUMA01                   12.69 (   0.00%)       15.71 ( -23.79%)
Stddev    elsp-NUMA01_THREADLOCAL        0.01 (   0.00%)        0.01 ( -14.02%)
Stddev    elsp-NUMA02                    0.23 (   0.00%)        0.05 (  76.35%)
Stddev    elsp-NUMA02_SMT                0.29 (   0.00%)        0.26 (   9.56%)
CoeffVar  syst-NUMA01                   20.63 (   0.00%)        9.56 (  53.68%)
CoeffVar  syst-NUMA01_THREADLOCAL        5.50 (   0.00%)       14.88 (-170.66%)
CoeffVar  syst-NUMA02                   12.77 (   0.00%)        7.59 (  40.52%)
CoeffVar  syst-NUMA02_SMT                4.98 (   0.00%)        5.39 (  -8.16%)
CoeffVar  elsp-NUMA01                    4.92 (   0.00%)        5.08 (  -3.13%)
CoeffVar  elsp-NUMA01_THREADLOCAL        0.65 (   0.00%)        0.76 ( -16.85%)
CoeffVar  elsp-NUMA02                    7.16 (   0.00%)        1.76 (  75.47%)
CoeffVar  elsp-NUMA02_SMT                7.98 (   0.00%)        7.53 (   5.65%)
Max       syst-NUMA01                  264.54 (   0.00%)       19.81 (  92.51%)
Max       syst-NUMA01_THREADLOCAL        0.22 (   0.00%)        0.23 (  -4.55%)
Max       syst-NUMA02                    0.99 (   0.00%)        0.89 (  10.10%)
Max       syst-NUMA02_SMT                0.67 (   0.00%)        0.69 (  -2.99%)
Max       elsp-NUMA01                  273.51 (   0.00%)      325.28 ( -18.93%)
Max       elsp-NUMA01_THREADLOCAL        1.07 (   0.00%)        1.05 (   1.87%)
Max       elsp-NUMA02                    3.63 (   0.00%)        3.20 (  11.85%)
Max       elsp-NUMA02_SMT                4.12 (   0.00%)        3.73 (   9.47%)
BAmean-50 syst-NUMA01                  167.44 (   0.00%)       15.93 (  90.49%)
BAmean-50 syst-NUMA01_THREADLOCAL        0.20 (   0.00%)        0.17 (  15.00%)
BAmean-50 syst-NUMA02                    0.75 (   0.00%)        0.74 (   0.89%)
BAmean-50 syst-NUMA02_SMT                0.60 (   0.00%)        0.60 (   0.55%)
BAmean-50 elsp-NUMA01                  245.37 (   0.00%)      293.30 ( -19.53%)
BAmean-50 elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.03 (   2.52%)
BAmean-50 elsp-NUMA02                    3.04 (   0.00%)        3.07 (  -0.88%)
BAmean-50 elsp-NUMA02_SMT                3.36 (   0.00%)        3.18 (   5.36%)
BAmean-95 syst-NUMA01                  188.10 (   0.00%)       17.17 (  90.87%)
BAmean-95 syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.19 (  10.40%)
BAmean-95 syst-NUMA02                    0.82 (   0.00%)        0.77 (   5.88%)
BAmean-95 syst-NUMA02_SMT                0.62 (   0.00%)        0.62 (   0.54%)
BAmean-95 elsp-NUMA01                  255.16 (   0.00%)      306.79 ( -20.24%)
BAmean-95 elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.03 (   2.52%)
BAmean-95 elsp-NUMA02                    3.17 (   0.00%)        3.11 (   2.05%)
BAmean-95 elsp-NUMA02_SMT                3.49 (   0.00%)        3.39 (   3.10%)
BAmean-99 syst-NUMA01                  188.10 (   0.00%)       17.17 (  90.87%)
BAmean-99 syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.19 (  10.40%)
BAmean-99 syst-NUMA02                    0.82 (   0.00%)        0.77 (   5.88%)
BAmean-99 syst-NUMA02_SMT                0.62 (   0.00%)        0.62 (   0.54%)
BAmean-99 elsp-NUMA01                  255.16 (   0.00%)      306.79 ( -20.24%)
BAmean-99 elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.03 (   2.52%)
BAmean-99 elsp-NUMA02                    3.17 (   0.00%)        3.11 (   2.05%)
BAmean-99 elsp-NUMA02_SMT                3.49 (   0.00%)        3.39 (   3.10%)

                6_1_base_config6_1_final_config
                numa-autonumabenchnuma-autonumabench
Duration User      313803.42   325078.49
Duration System      1405.42      134.62
Duration Elapsed     1872.08     2226.35

                                6_1_base_config6_1_final_config
                                numa-autonumabenchnuma-autonumabench
Ops Minor Faults                   239730038.00    57780527.00
Ops Major Faults                         195.00         195.00
Ops Normal allocs                   59157241.00    49821996.00
Ops Sector Reads                       31644.00       31900.00
Ops Sector Writes                      49096.00       51068.00
Ops Page migrate success             7552783.00         275.00
Ops Compaction cost                     7839.79           0.29
Ops NUMA alloc hit                  58997125.00    49605390.00
Ops NUMA alloc local                58979133.00    49603280.00
Ops NUMA base-page range updates    97494921.00        1239.00
Ops NUMA PTE updates                97494921.00        1239.00
Ops NUMA hint faults                98707853.00        1061.00
Ops NUMA hint local faults %        78220667.00         782.00
Ops NUMA hint local percent               79.24          73.70
Ops NUMA pages migrated              7552783.00         275.00
Ops AutoNUMA cost                     494365.23           5.32

Runtime options
Run: OpenMPI Environment
Run:  NAS_OPENMPI_VERSION=openmpi
Run:  PATH=/usr/lib64/mpi/gcc/openmpi/bin:$PATH
Run:  LD_LIBRARY_PATH=/usr/lib64/mpi/gcc/openmpi/lib64
Run: mpirun --use-hwthread-cpus --allow-run-as-root --allow-run-as-root -np 256 ./bin/sp.D.256
...
nas-mpi-bt NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       bt.D      240.02 (   0.00%)      240.23 (  -0.09%)
Amean     bt.D      240.23 (   0.00%)      241.24 *  -0.42%*
Stddev    bt.D        0.30 (   0.00%)        1.71 (-463.35%)
CoeffVar  bt.D        0.13 (   0.00%)        0.71 (-461.00%)
Max       bt.D      240.58 (   0.00%)      243.21 (  -1.09%)
BAmean-50 bt.D      240.02 (   0.00%)      240.23 (  -0.09%)
BAmean-95 bt.D      240.06 (   0.00%)      240.25 (  -0.08%)
BAmean-99 bt.D      240.06 (   0.00%)      240.25 (  -0.08%)

nas-mpi-bt Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-bt.D       1627.17 (   0.00%)     1554.96 (   4.44%)
Min       elspd-bt.D      244.09 (   0.00%)      244.22 (  -0.05%)
Amean     sys-bt.D       1636.36 (   0.00%)     1570.90 *   4.00%*
Amean     elspd-bt.D      244.34 (   0.00%)      245.33 *  -0.41%*
Stddev    sys-bt.D          8.10 (   0.00%)       22.38 (-176.31%)
Stddev    elspd-bt.D        0.35 (   0.00%)        1.78 (-405.43%)
CoeffVar  sys-bt.D          0.49 (   0.00%)        1.42 (-187.83%)
CoeffVar  elspd-bt.D        0.14 (   0.00%)        0.73 (-403.39%)
Max       sys-bt.D       1642.44 (   0.00%)     1596.48 (   2.80%)
Max       elspd-bt.D      244.74 (   0.00%)      247.38 (  -1.08%)
BAmean-50 sys-bt.D       1627.17 (   0.00%)     1554.96 (   4.44%)
BAmean-50 elspd-bt.D      244.09 (   0.00%)      244.22 (  -0.05%)
BAmean-95 sys-bt.D       1633.33 (   0.00%)     1558.11 (   4.61%)
BAmean-95 elspd-bt.D      244.13 (   0.00%)      244.30 (  -0.07%)
BAmean-99 sys-bt.D       1633.33 (   0.00%)     1558.11 (   4.61%)
BAmean-99 elspd-bt.D      244.13 (   0.00%)      244.30 (  -0.07%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User      181648.82   182614.64
Duration System      4910.66     4714.35
Duration Elapsed      742.88      745.28

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                   380489860.00    75510645.00
Ops Major Faults                         302.00         292.00
Ops Swap Ins                               0.00           0.00
Ops Normal allocs                   26303147.00    26030593.00
Ops Sector Reads                       33988.00       33904.00
Ops Sector Writes                      41932.00       40896.00
Ops Page migrate success              284339.00        5343.00
Ops Page migrate failure                   1.00           0.00
Ops Compaction cost                      295.14           5.55
Ops NUMA alloc hit                  26169335.00    25898920.00
Ops NUMA alloc local                26169335.00    25898920.00
Ops NUMA base-page range updates   354977260.00    49183327.00
Ops NUMA PTE updates               354977260.00    49183327.00
Ops NUMA hint faults               354078577.00    49085315.00
Ops NUMA hint local faults %       353643387.00    49076300.00
Ops NUMA hint local percent               99.88          99.98
Ops NUMA pages migrated               284339.00        5343.00
Ops AutoNUMA cost                    1772883.13      245770.96

nas-mpi-cg NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       cg.D      156.35 (   0.00%)      142.08 (   9.13%)
Amean     cg.D      157.57 (   0.00%)      142.59 *   9.50%*
Stddev    cg.D        1.89 (   0.00%)        0.48 (  74.93%)
CoeffVar  cg.D        1.20 (   0.00%)        0.33 (  72.30%)
Max       cg.D      159.75 (   0.00%)      143.02 (  10.47%)
BAmean-50 cg.D      156.35 (   0.00%)      142.08 (   9.13%)
BAmean-95 cg.D      156.47 (   0.00%)      142.38 (   9.01%)
BAmean-99 cg.D      156.47 (   0.00%)      142.38 (   9.01%)

nas-mpi-cg Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-cg.D       2484.08 (   0.00%)     2262.29 (   8.93%)
Min       elspd-cg.D      164.10 (   0.00%)      149.84 (   8.69%)
Amean     sys-cg.D       2518.20 (   0.00%)     2296.84 *   8.79%*
Amean     elspd-cg.D      165.32 (   0.00%)      150.40 *   9.02%*
Stddev    sys-cg.D         40.64 (   0.00%)       30.83 (  24.14%)
Stddev    elspd-cg.D        1.96 (   0.00%)        0.50 (  74.38%)
CoeffVar  sys-cg.D          1.61 (   0.00%)        1.34 (  16.83%)
CoeffVar  elspd-cg.D        1.18 (   0.00%)        0.33 (  71.84%)
Max       sys-cg.D       2563.17 (   0.00%)     2321.56 (   9.43%)
Max       elspd-cg.D      167.58 (   0.00%)      150.80 (  10.01%)
BAmean-50 sys-cg.D       2484.08 (   0.00%)     2262.29 (   8.93%)
BAmean-50 elspd-cg.D      164.10 (   0.00%)      149.84 (   8.69%)
BAmean-95 sys-cg.D       2495.72 (   0.00%)     2284.47 (   8.46%)
BAmean-95 elspd-cg.D      164.19 (   0.00%)      150.20 (   8.52%)
BAmean-99 sys-cg.D       2495.72 (   0.00%)     2284.47 (   8.46%)
BAmean-99 elspd-cg.D      164.19 (   0.00%)      150.20 (   8.52%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User      118328.59   107542.39
Duration System      7555.79     6891.65
Duration Elapsed      501.43      456.77

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                   115149790.00    32917253.00
Ops Major Faults                          25.00          26.00
Ops Normal allocs                   23369629.00    23011001.00
Ops Sector Reads                         436.00         540.00
Ops Sector Writes                      39728.00       39280.00
Ops Page migrate success              353217.00         686.00
Ops Compaction cost                      366.64           0.71
Ops NUMA alloc hit                  23250676.00    22895722.00
Ops NUMA alloc local                23250672.00    22895722.00
Ops NUMA base-page range updates   106418880.00     9823982.00
Ops NUMA PTE updates               106418880.00     9823982.00
Ops NUMA hint faults                91920469.00     9686971.00
Ops NUMA hint local faults %        91283239.00     9685903.00
Ops NUMA hint local percent               99.31          99.99
Ops NUMA pages migrated               353217.00         686.00
Ops AutoNUMA cost                     460353.99       48503.64

nas-mpi-ep NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       ep.D        8.44 (   0.00%)        8.65 (  -2.49%)
Amean     ep.D        8.65 (   0.00%)        9.35 *  -8.10%*
Stddev    ep.D        0.27 (   0.00%)        0.85 (-218.35%)
CoeffVar  ep.D        3.10 (   0.00%)        9.14 (-194.51%)
Max       ep.D        8.95 (   0.00%)       10.30 ( -15.08%)
BAmean-50 ep.D        8.44 (   0.00%)        8.65 (  -2.49%)
BAmean-95 ep.D        8.50 (   0.00%)        8.87 (  -4.41%)
BAmean-99 ep.D        8.50 (   0.00%)        8.87 (  -4.41%)

nas-mpi-ep Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-ep.D         31.16 (   0.00%)       20.49 (  34.24%)
Min       elspd-ep.D        9.91 (   0.00%)       10.16 (  -2.52%)
Amean     sys-ep.D         31.47 (   0.00%)       27.28 *  13.30%*
Amean     elspd-ep.D       10.20 (   0.00%)       10.89 *  -6.73%*
Stddev    sys-ep.D          0.36 (   0.00%)        6.04 (-1591.77%)
Stddev    elspd-ep.D        0.29 (   0.00%)        0.86 (-196.95%)
CoeffVar  sys-ep.D          1.13 (   0.00%)       22.12 (-1851.38%)
CoeffVar  elspd-ep.D        2.84 (   0.00%)        7.91 (-178.23%)
Max       sys-ep.D         31.86 (   0.00%)       32.03 (  -0.53%)
Max       elspd-ep.D       10.49 (   0.00%)       11.84 ( -12.87%)
BAmean-50 sys-ep.D         31.16 (   0.00%)       20.49 (  34.24%)
BAmean-50 elspd-ep.D        9.91 (   0.00%)       10.16 (  -2.52%)
BAmean-95 sys-ep.D         31.27 (   0.00%)       24.91 (  20.35%)
BAmean-95 elspd-ep.D       10.06 (   0.00%)       10.41 (  -3.53%)
BAmean-99 sys-ep.D         31.27 (   0.00%)       24.91 (  20.35%)
BAmean-99 elspd-ep.D       10.06 (   0.00%)       10.41 (  -3.53%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User        6689.32     7231.16
Duration System        95.70       83.00
Duration Elapsed       35.58       37.36

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                     3171011.00     2958900.00
Ops Major Faults                          21.00          21.00
Ops Normal allocs                    1735733.00     1725659.00
Ops Sector Reads                          24.00          24.00
Ops Sector Writes                      37008.00       37620.00
Ops NUMA alloc hit                   1620169.00     1618267.00
Ops NUMA alloc local                 1620169.00     1618267.00
Ops NUMA base-page range updates     1343733.00     1061345.00
Ops NUMA PTE updates                 1343733.00     1061345.00
Ops NUMA hint faults                 1158152.00      945501.00
Ops NUMA hint local faults %         1158144.00      945501.00
Ops NUMA pages migrated                    7.00           0.00
Ops AutoNUMA cost                       5800.17        4734.93

nas-mpi-is NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       is.D        5.05 (   0.00%)        5.02 (   0.59%)
Amean     is.D        5.19 (   0.00%)        5.22 *  -0.45%*
Stddev    is.D        0.13 (   0.00%)        0.28 (-122.86%)
CoeffVar  is.D        2.44 (   0.00%)        5.41 (-121.86%)
Max       is.D        5.29 (   0.00%)        5.54 (  -4.73%)
BAmean-50 is.D        5.05 (   0.00%)        5.02 (   0.59%)
BAmean-95 is.D        5.14 (   0.00%)        5.05 (   1.75%)
BAmean-99 is.D        5.14 (   0.00%)        5.05 (   1.75%)

nas-mpi-is Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-is.D        370.08 (   0.00%)      364.28 (   1.57%)
Min       elspd-is.D        9.05 (   0.00%)        8.91 (   1.55%)
Amean     sys-is.D        377.93 (   0.00%)      381.63 *  -0.98%*
Amean     elspd-is.D        9.13 (   0.00%)        9.10 *   0.29%*
Stddev    sys-is.D          6.97 (   0.00%)       23.90 (-243.13%)
Stddev    elspd-is.D        0.07 (   0.00%)        0.24 (-240.87%)
CoeffVar  sys-is.D          1.84 (   0.00%)        6.26 (-239.80%)
CoeffVar  elspd-is.D        0.77 (   0.00%)        2.62 (-241.87%)
Max       sys-is.D        383.37 (   0.00%)      408.90 (  -6.66%)
Max       elspd-is.D        9.18 (   0.00%)        9.37 (  -2.07%)
BAmean-50 sys-is.D        370.08 (   0.00%)      364.28 (   1.57%)
BAmean-50 elspd-is.D        9.05 (   0.00%)        8.91 (   1.55%)
BAmean-95 sys-is.D        375.22 (   0.00%)      368.00 (   1.92%)
BAmean-95 elspd-is.D        9.11 (   0.00%)        8.97 (   1.48%)
BAmean-99 sys-is.D        375.22 (   0.00%)      368.00 (   1.92%)
BAmean-99 elspd-is.D        9.11 (   0.00%)        8.97 (   1.48%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User        4762.89     4747.01
Duration System      1134.94     1146.02
Duration Elapsed       32.11       31.81

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                    66496004.00    47159682.00
Ops Major Faults                          21.00          21.00
Ops Normal allocs                   23449077.00    22990132.00
Ops Sector Reads                         384.00         384.00
Ops Sector Writes                      38040.00       35828.00
Ops Page migrate success             1938622.00     1485033.00
Ops Compaction cost                     2012.29        1541.46
Ops NUMA alloc hit                  23337713.00    22880414.00
Ops NUMA alloc local                23337713.00    22880414.00
Ops NUMA base-page range updates    45446261.00    25594500.00
Ops NUMA PTE updates                45446261.00    25594500.00
Ops NUMA hint faults                44834767.00    25501047.00
Ops NUMA hint local faults %        42555580.00    24013161.00
Ops NUMA hint local percent               94.92          94.17
Ops NUMA pages migrated              1938622.00     1485033.00
Ops AutoNUMA cost                     224528.79      127712.61

nas-mpi-lu NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       lu.D      183.79 (   0.00%)      183.22 (   0.31%)
Amean     lu.D      183.96 (   0.00%)      183.43 *   0.29%*
Stddev    lu.D        0.25 (   0.00%)        0.19 (  26.01%)
CoeffVar  lu.D        0.14 (   0.00%)        0.10 (  25.80%)
Max       lu.D      184.25 (   0.00%)      183.55 (   0.38%)
BAmean-50 lu.D      183.79 (   0.00%)      183.22 (   0.31%)
BAmean-95 lu.D      183.82 (   0.00%)      183.38 (   0.24%)
BAmean-99 lu.D      183.82 (   0.00%)      183.38 (   0.24%)

nas-mpi-lu Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-lu.D        755.87 (   0.00%)      636.36 (  15.81%)
Min       elspd-lu.D      187.61 (   0.00%)      187.18 (   0.23%)
Amean     sys-lu.D        766.88 (   0.00%)      649.63 *  15.29%*
Amean     elspd-lu.D      187.86 (   0.00%)      187.28 *   0.31%*
Stddev    sys-lu.D          9.69 (   0.00%)       12.13 ( -25.19%)
Stddev    elspd-lu.D        0.28 (   0.00%)        0.11 (  61.93%)
CoeffVar  sys-lu.D          1.26 (   0.00%)        1.87 ( -47.79%)
CoeffVar  elspd-lu.D        0.15 (   0.00%)        0.06 (  61.81%)
Max       sys-lu.D        774.09 (   0.00%)      660.14 (  14.72%)
Max       elspd-lu.D      188.16 (   0.00%)      187.39 (   0.41%)
BAmean-50 sys-lu.D        755.87 (   0.00%)      636.36 (  15.81%)
BAmean-50 elspd-lu.D      187.61 (   0.00%)      187.18 (   0.23%)
BAmean-95 sys-lu.D        763.28 (   0.00%)      644.38 (  15.58%)
BAmean-95 elspd-lu.D      187.71 (   0.00%)      187.22 (   0.26%)
BAmean-99 sys-lu.D        763.28 (   0.00%)      644.38 (  15.58%)
BAmean-99 elspd-lu.D      187.71 (   0.00%)      187.22 (   0.26%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User      140870.85   140812.35
Duration System      2302.38     1950.50
Duration Elapsed      572.07      570.59

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                   159164134.00    36327663.00
Ops Major Faults                          22.00          23.00
Ops Normal allocs                   12766911.00    12701365.00
Ops Sector Reads                         392.00         432.00
Ops Sector Writes                      40336.00       39988.00
Ops Page migrate success               49476.00        5899.00
Ops Compaction cost                       51.36           6.12
Ops NUMA alloc hit                  12641163.00    12586537.00
Ops NUMA alloc local                12641163.00    12586537.00
Ops NUMA base-page range updates   146297502.00    23468561.00
Ops NUMA PTE updates               146297502.00    23468561.00
Ops NUMA hint faults               146195881.00    23366950.00
Ops NUMA hint local faults %       146121976.00    23360847.00
Ops NUMA hint local percent               99.95          99.97
Ops NUMA pages migrated                49476.00        5899.00
Ops AutoNUMA cost                     732004.43      116999.14

nas-mpi-mg NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       mg.D       32.10 (   0.00%)       31.87 (   0.72%)
Amean     mg.D       32.14 (   0.00%)       31.88 *   0.79%*
Stddev    mg.D        0.04 (   0.00%)        0.02 (  34.24%)
CoeffVar  mg.D        0.11 (   0.00%)        0.07 (  33.72%)
Max       mg.D       32.17 (   0.00%)       31.91 (   0.81%)
BAmean-50 mg.D       32.10 (   0.00%)       31.87 (   0.72%)
BAmean-95 mg.D       32.12 (   0.00%)       31.87 (   0.78%)
BAmean-99 mg.D       32.12 (   0.00%)       31.87 (   0.78%)

nas-mpi-mg Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-mg.D        477.89 (   0.00%)      418.76 (  12.37%)
Min       elspd-mg.D       35.88 (   0.00%)       35.62 (   0.72%)
Amean     sys-mg.D        484.38 (   0.00%)      420.00 *  13.29%*
Amean     elspd-mg.D       35.98 (   0.00%)       35.74 *   0.68%*
Stddev    sys-mg.D          5.70 (   0.00%)        1.16 (  79.67%)
Stddev    elspd-mg.D        0.11 (   0.00%)        0.13 ( -18.57%)
CoeffVar  sys-mg.D          1.18 (   0.00%)        0.28 (  76.55%)
CoeffVar  elspd-mg.D        0.31 (   0.00%)        0.37 ( -19.38%)
Max       sys-mg.D        488.60 (   0.00%)      421.06 (  13.82%)
Max       elspd-mg.D       36.10 (   0.00%)       35.88 (   0.61%)
BAmean-50 sys-mg.D        477.89 (   0.00%)      418.76 (  12.37%)
BAmean-50 elspd-mg.D       35.88 (   0.00%)       35.62 (   0.72%)
BAmean-95 sys-mg.D        482.27 (   0.00%)      419.47 (  13.02%)
BAmean-95 elspd-mg.D       35.92 (   0.00%)       35.66 (   0.71%)
BAmean-99 sys-mg.D        482.27 (   0.00%)      419.47 (  13.02%)
BAmean-99 elspd-mg.D       35.92 (   0.00%)       35.66 (   0.71%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User       25080.64    25066.21
Duration System      1454.59     1261.44
Duration Elapsed      115.21      114.28

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                   179027668.00    68883081.00
Ops Major Faults                          21.00          21.00
Ops Normal allocs                   23768625.00    23750979.00
Ops Sector Reads                          88.00          88.00
Ops Sector Writes                      38272.00       37544.00
Ops Page migrate success               12059.00        1774.00
Ops Compaction cost                       12.52           1.84
Ops NUMA alloc hit                  23643850.00    23634234.00
Ops NUMA alloc local                23643849.00    23634234.00
Ops NUMA base-page range updates   155287026.00    45123410.00
Ops NUMA PTE updates               155287026.00    45123410.00
Ops NUMA hint faults               155055287.00    44908778.00
Ops NUMA hint local faults %       155037730.00    44906752.00
Ops NUMA hint local percent               99.99         100.00
Ops NUMA pages migrated                12059.00        1774.00
Ops AutoNUMA cost                     776363.67      224859.79

nas-mpi-sp NAS Time
                      6_1_base_config       6_1_final_config
                     hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sp.D      408.19 (   0.00%)      409.12 (  -0.23%)
Amean     sp.D      408.32 (   0.00%)      409.82 *  -0.37%*
Stddev    sp.D        0.13 (   0.00%)        0.63 (-400.26%)
CoeffVar  sp.D        0.03 (   0.00%)        0.15 (-398.42%)
Max       sp.D      408.44 (   0.00%)      410.33 (  -0.46%)
BAmean-50 sp.D      408.19 (   0.00%)      409.12 (  -0.23%)
BAmean-95 sp.D      408.25 (   0.00%)      409.56 (  -0.32%)
BAmean-99 sp.D      408.25 (   0.00%)      409.56 (  -0.32%)

nas-mpi-sp Wall Time
                            6_1_base_config       6_1_final_config
                           hpc-nas-mpi-full       hpc-nas-mpi-full
Min       sys-sp.D       4120.36 (   0.00%)     3820.43 (   7.28%)
Min       elspd-sp.D      412.02 (   0.00%)      413.00 (  -0.24%)
Amean     sys-sp.D       4187.54 (   0.00%)     3970.32 *   5.19%*
Amean     elspd-sp.D      412.20 (   0.00%)      413.64 *  -0.35%*
Stddev    sys-sp.D         65.82 (   0.00%)      132.28 (-100.95%)
Stddev    elspd-sp.D        0.20 (   0.00%)        0.57 (-181.53%)
CoeffVar  sys-sp.D          1.57 (   0.00%)        3.33 (-111.94%)
CoeffVar  elspd-sp.D        0.05 (   0.00%)        0.14 (-180.55%)
Max       sys-sp.D       4251.92 (   0.00%)     4070.68 (   4.26%)
Max       elspd-sp.D      412.42 (   0.00%)      414.08 (  -0.40%)
BAmean-50 sys-sp.D       4120.36 (   0.00%)     3820.43 (   7.28%)
BAmean-50 elspd-sp.D      412.02 (   0.00%)      413.00 (  -0.24%)
BAmean-95 sys-sp.D       4155.35 (   0.00%)     3920.14 (   5.66%)
BAmean-95 elspd-sp.D      412.10 (   0.00%)      413.43 (  -0.32%)
BAmean-99 sys-sp.D       4155.35 (   0.00%)     3920.14 (   5.66%)
BAmean-99 elspd-sp.D      412.10 (   0.00%)      413.43 (  -0.32%)


                6_1_base_config6_1_final_config
                hpc-nas-mpi-fullhpc-nas-mpi-full
Duration User      302843.74   304665.29
Duration System     12564.12    11912.42
Duration Elapsed     1246.11     1249.45

                                6_1_base_config6_1_final_config
                                hpc-nas-mpi-fullhpc-nas-mpi-full
Ops Minor Faults                   383084930.00    65419151.00
Ops Major Faults                          76.00          21.00
Ops Normal allocs                   22711498.00    22411009.00
Ops Sector Reads                       13799.00         256.00
Ops Sector Writes                      44632.00       42080.00
Ops Page migrate success              299748.00        8317.00
Ops Page migrate failure                   1.00           0.00
Ops Compaction cost                      311.14           8.63
Ops NUMA alloc hit                  22593673.00    22286577.00
Ops NUMA alloc local                22593274.00    22286577.00
Ops NUMA base-page range updates   360559234.00    42879377.00
Ops NUMA PTE updates               360559234.00    42879377.00
Ops NUMA hint faults               360428916.00    42781996.00
Ops NUMA hint local faults %       359952878.00    42767934.00
Ops NUMA hint local percent               99.87          99.97
Ops NUMA pages migrated               299748.00        8317.00
Ops AutoNUMA cost                    1804674.19      214210.29

[1] sched/numa: Process Adaptive autoNUMA 
Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/
Raghavendra K T (1):
  sched/numa: Enhance vma scanning logic

 include/linux/mm_types.h |  2 ++
 kernel/sched/fair.c      | 32 ++++++++++++++++++++++++++++++++
 mm/memory.c              | 21 +++++++++++++++++++++
 3 files changed, 55 insertions(+)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-16  1:35 ` [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic Raghavendra K T
@ 2023-01-16  2:25   ` Raghavendra K T
  2023-01-17 11:14   ` David Hildenbrand
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Raghavendra K T @ 2023-01-16  2:25 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Andrew Morton,
	Matthew Wilcox, Vlastimil Babka, Liam R . Howlett, Peter Xu,
	David Hildenbrand, xu xin, Yu Zhao, Colin Cross, Arnd Bergmann,
	Hugh Dickins, Bharata B Rao, Disha Talreja, Raghavendra K T

 During the Numa scanning make sure only relevant vmas of the
tasks are scanned.

Logic:
1) For the first two time allow unconditional scanning of vmas
2) Store recent 4 unique tasks (last 8bits of PIDs) accessed the vma.
  False negetives in case of collison should be fine here.
3) If more than 4 pids exist assume task indeed accessed vma to
 to avoid false negetives

Co-developed-by: Bharata B Rao <bharata@amd.com>
(initial patch to store pid information)

Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Bharata B Rao <bharata@amd.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm_types.h |  2 ++
 kernel/sched/fair.c      | 32 ++++++++++++++++++++++++++++++++
 mm/memory.c              | 21 +++++++++++++++++++++
 3 files changed, 55 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..07feae37b8e6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -506,6 +506,8 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+	unsigned int accessing_pids;
+	int next_pid_slot;
 } __randomize_layout;
 
 struct kioctx_table;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8bd941c..944d2e3b0b3c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2916,6 +2916,35 @@ static void reset_ptenuma_scan(struct task_struct *p)
 	p->mm->numa_scan_offset = 0;
 }
 
+static bool vma_is_accessed(struct vm_area_struct *vma)
+{
+	int i;
+	bool more_pids_exist;
+	unsigned long pid, max_pids;
+	unsigned long current_pid = current->pid & LAST__PID_MASK;
+
+	max_pids = sizeof(unsigned int) * BITS_PER_BYTE / LAST__PID_SHIFT;
+
+	/* By default we assume >= max_pids exist */
+	more_pids_exist = true;
+
+	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
+		return true;
+
+	for (i = 0; i < max_pids; i++) {
+		pid = (vma->accessing_pids >> i * LAST__PID_SHIFT) &
+			LAST__PID_MASK;
+		if (pid == current_pid)
+			return true;
+		if (pid == 0) {
+			more_pids_exist = false;
+			break;
+		}
+	}
+
+	return more_pids_exist;
+}
+
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
@@ -3015,6 +3044,9 @@ static void task_numa_work(struct callback_head *work)
 		if (!vma_is_accessible(vma))
 			continue;
 
+		if (!vma_is_accessed(vma))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
diff --git a/mm/memory.c b/mm/memory.c
index 8c8420934d60..fafd78d87a51 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4717,7 +4717,28 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	pte_t pte, old_pte;
 	bool was_writable = pte_savedwrite(vmf->orig_pte);
 	int flags = 0;
+	int pid_slot = vma->next_pid_slot;
 
+	int i;
+	unsigned long pid, max_pids;
+	unsigned long current_pid = current->pid & LAST__PID_MASK;
+
+	max_pids = sizeof(unsigned int) * BITS_PER_BYTE / LAST__PID_SHIFT;
+
+	/* Avoid duplicate PID updation */
+	for (i = 0; i < max_pids; i++) {
+		pid = (vma->accessing_pids >> i * LAST__PID_SHIFT) &
+			LAST__PID_MASK;
+		if (pid == current_pid)
+			goto skip_update;
+	}
+
+	vma->next_pid_slot = (++pid_slot) % max_pids;
+	vma->accessing_pids &= ~(LAST__PID_MASK << (pid_slot * LAST__PID_SHIFT));
+	vma->accessing_pids |= ((current_pid) <<
+			(pid_slot * LAST__PID_SHIFT));
+
+skip_update:
 	/*
 	 * The "pte" at this point cannot be used safely without
 	 * validation through pte_unmap_same(). It's of NUMA type but
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-16  1:35 ` [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic Raghavendra K T
  2023-01-16  2:25   ` Raghavendra K T
@ 2023-01-17 11:14   ` David Hildenbrand
  2023-01-17 13:09     ` Raghavendra K T
  2023-01-17 14:59   ` Mel Gorman
  2023-01-19  9:39   ` Mike Rapoport
  3 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2023-01-17 11:14 UTC (permalink / raw)
  To: Raghavendra K T, linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Andrew Morton,
	Matthew Wilcox, Vlastimil Babka, Liam R . Howlett, Peter Xu,
	xu xin, Yu Zhao, Colin Cross, Arnd Bergmann, Hugh Dickins,
	Bharata B Rao, Disha Talreja

On 16.01.23 03:25, Raghavendra K T wrote:
>   During the Numa scanning make sure only relevant vmas of the
> tasks are scanned.
> 
> Logic:
> 1) For the first two time allow unconditional scanning of vmas
> 2) Store recent 4 unique tasks (last 8bits of PIDs) accessed the vma.
>    False negetives in case of collison should be fine here.
> 3) If more than 4 pids exist assume task indeed accessed vma to
>   to avoid false negetives
> 
> Co-developed-by: Bharata B Rao <bharata@amd.com>
> (initial patch to store pid information)
> 
> Suggested-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
>   include/linux/mm_types.h |  2 ++
>   kernel/sched/fair.c      | 32 ++++++++++++++++++++++++++++++++
>   mm/memory.c              | 21 +++++++++++++++++++++
>   3 files changed, 55 insertions(+)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 500e536796ca..07feae37b8e6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -506,6 +506,8 @@ struct vm_area_struct {
>   	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>   #endif
>   	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +	unsigned int accessing_pids;
> +	int next_pid_slot;
>   } __randomize_layout;

What immediately jumps at me is the unconditional grow of a VMA by 8 
bytes. A process with 64k mappings consumes 512 KiB more of memory, 
possibly completely unnecessarily.

This at least needs to be fenced by CONFIG_NUMA_BALANCING.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-17 11:14   ` David Hildenbrand
@ 2023-01-17 13:09     ` Raghavendra K T
  0 siblings, 0 replies; 16+ messages in thread
From: Raghavendra K T @ 2023-01-17 13:09 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Andrew Morton,
	Matthew Wilcox, Vlastimil Babka, Liam R . Howlett, Peter Xu,
	xu xin, Yu Zhao, Colin Cross, Arnd Bergmann, Hugh Dickins,
	Bharata B Rao, Disha Talreja

On 1/17/2023 4:44 PM, David Hildenbrand wrote:
> On 16.01.23 03:25, Raghavendra K T wrote:
>>   During the Numa scanning make sure only relevant vmas of the
>> tasks are scanned.
>>
>> Logic:
>> 1) For the first two time allow unconditional scanning of vmas
>> 2) Store recent 4 unique tasks (last 8bits of PIDs) accessed the vma.
>>    False negetives in case of collison should be fine here.
>> 3) If more than 4 pids exist assume task indeed accessed vma to
>>   to avoid false negetives
>>
>> Co-developed-by: Bharata B Rao <bharata@amd.com>
>> (initial patch to store pid information)
>>
>> Suggested-by: Mel Gorman <mgorman@techsingularity.net>
>> Signed-off-by: Bharata B Rao <bharata@amd.com>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
>>   include/linux/mm_types.h |  2 ++
>>   kernel/sched/fair.c      | 32 ++++++++++++++++++++++++++++++++
>>   mm/memory.c              | 21 +++++++++++++++++++++
>>   3 files changed, 55 insertions(+)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 500e536796ca..07feae37b8e6 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -506,6 +506,8 @@ struct vm_area_struct {
>>       struct mempolicy *vm_policy;    /* NUMA policy for the VMA */
>>   #endif
>>       struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>> +    unsigned int accessing_pids;
>> +    int next_pid_slot;
>>   } __randomize_layout;
> 
> What immediately jumps at me is the unconditional grow of a VMA by 8 
> bytes. A process with 64k mappings consumes 512 KiB more of memory, 
> possibly completely unnecessarily.
> 
> This at least needs to be fenced by CONFIG_NUMA_BALANCING.
> 

Thanks for the review David. Good point..  I do agree. I see I will have
to fence further in memory.c only since fair.c is already taken care.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-16  1:35 ` [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic Raghavendra K T
  2023-01-16  2:25   ` Raghavendra K T
  2023-01-17 11:14   ` David Hildenbrand
@ 2023-01-17 14:59   ` Mel Gorman
  2023-01-17 17:45     ` Raghavendra K T
  2023-01-18  4:43     ` Bharata B Rao
  2023-01-19  9:39   ` Mike Rapoport
  3 siblings, 2 replies; 16+ messages in thread
From: Mel Gorman @ 2023-01-17 14:59 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: =,
	linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Valentin Schneider, Andrew Morton,
	Matthew Wilcox, Vlastimil Babka, Liam R . Howlett, Peter Xu,
	David Hildenbrand, xu xin, Yu Zhao, Colin Cross, Arnd Bergmann,
	Hugh Dickins, Bharata B Rao, Disha Talreja

Note that the cc list is excessive for the topic.

On Mon, Jan 16, 2023 at 07:05:34AM +0530, Raghavendra K T wrote:
>  During the Numa scanning make sure only relevant vmas of the
> tasks are scanned.
> 
> Logic:
> 1) For the first two time allow unconditional scanning of vmas
> 2) Store recent 4 unique tasks (last 8bits of PIDs) accessed the vma.
>   False negetives in case of collison should be fine here.
> 3) If more than 4 pids exist assume task indeed accessed vma to
>  to avoid false negetives
> 
> Co-developed-by: Bharata B Rao <bharata@amd.com>
> (initial patch to store pid information)
> 
> Suggested-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
>  include/linux/mm_types.h |  2 ++
>  kernel/sched/fair.c      | 32 ++++++++++++++++++++++++++++++++
>  mm/memory.c              | 21 +++++++++++++++++++++
>  3 files changed, 55 insertions(+)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 500e536796ca..07feae37b8e6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -506,6 +506,8 @@ struct vm_area_struct {
>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +	unsigned int accessing_pids;
> +	int next_pid_slot;
>  } __randomize_layout;
>  

This should be behind CONFIG_NUMA_BALANCING but per-vma state should also be
tracked in its own struct and allocated on demand iff the state is required.

>  struct kioctx_table;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e4a0b8bd941c..944d2e3b0b3c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2916,6 +2916,35 @@ static void reset_ptenuma_scan(struct task_struct *p)
>  	p->mm->numa_scan_offset = 0;
>  }
>  
> +static bool vma_is_accessed(struct vm_area_struct *vma)
> +{
> +	int i;
> +	bool more_pids_exist;
> +	unsigned long pid, max_pids;
> +	unsigned long current_pid = current->pid & LAST__PID_MASK;
> +
> +	max_pids = sizeof(unsigned int) * BITS_PER_BYTE / LAST__PID_SHIFT;
> +
> +	/* By default we assume >= max_pids exist */
> +	more_pids_exist = true;
> +
> +	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
> +		return true;
> +
> +	for (i = 0; i < max_pids; i++) {
> +		pid = (vma->accessing_pids >> i * LAST__PID_SHIFT) &
> +			LAST__PID_MASK;
> +		if (pid == current_pid)
> +			return true;
> +		if (pid == 0) {
> +			more_pids_exist = false;
> +			break;
> +		}
> +	}
> +
> +	return more_pids_exist;
> +}

I get the intent is to avoid PIDs scanning VMAs that it has never faulted
within but it seems unnecessarily complex to search on every fault to track
just 4 pids with no recent access information. The pid modulo BITS_PER_WORD
couls be used to set a bit on an unsigned long to track approximate recent
acceses and skip VMAs that do not have the bit set. That would allow more
recent PIDs to be tracked although false positives would still exist. It
would be necessary to reset the mask periodically.

Even tracking 4 pids, a reset is periodically needed. Otherwise it'll
be vulnerable to changes in phase behaviour causing all pids to scan all
VMAs again.

> @@ -3015,6 +3044,9 @@ static void task_numa_work(struct callback_head *work)
>  		if (!vma_is_accessible(vma))
>  			continue;
>  
> +		if (!vma_is_accessed(vma))
> +			continue;
> +
>  		do {
>  			start = max(start, vma->vm_start);
>  			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
> diff --git a/mm/memory.c b/mm/memory.c
> index 8c8420934d60..fafd78d87a51 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4717,7 +4717,28 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	pte_t pte, old_pte;
>  	bool was_writable = pte_savedwrite(vmf->orig_pte);
>  	int flags = 0;
> +	int pid_slot = vma->next_pid_slot;
>  
> +	int i;
> +	unsigned long pid, max_pids;
> +	unsigned long current_pid = current->pid & LAST__PID_MASK;
> +
> +	max_pids = sizeof(unsigned int) * BITS_PER_BYTE / LAST__PID_SHIFT;
> +

Won't build on defconfig

> +	/* Avoid duplicate PID updation */
> +	for (i = 0; i < max_pids; i++) {
> +		pid = (vma->accessing_pids >> i * LAST__PID_SHIFT) &
> +			LAST__PID_MASK;
> +		if (pid == current_pid)
> +			goto skip_update;
> +	}
> +
> +	vma->next_pid_slot = (++pid_slot) % max_pids;
> +	vma->accessing_pids &= ~(LAST__PID_MASK << (pid_slot * LAST__PID_SHIFT));
> +	vma->accessing_pids |= ((current_pid) <<
> +			(pid_slot * LAST__PID_SHIFT));
> +

The PID tracking and clearing should probably be split out but that aside,
what about do_huge_pmd_numa_page?

First off though, expanding VMA size by more than a word for NUMA balancing
is probably a no-go.

This is a build-tested only prototype to illustrate how VMA could track
NUMA balancing state. It starts with applying the scan delay to every VMA
instead of every task to avoid scanning new or very short-lived VMAs. I
went back to my old notes on how I hoped to reduce excessive scanning in
NUMA balancing and it happened to be second on my list and straight-forward
to prototype in a few minutes.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..3cebda5cc8a7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -620,6 +620,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
+#ifdef CONFIG_NUMA_BALANCING
+	vma->numab = NULL;
+#endif
 }
 
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3b8475007734..3c0cfdde33e0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -526,6 +526,10 @@ struct anon_vma_name {
 	char name[];
 };
 
+struct vma_numab {
+	unsigned long next_scan;
+};
+
 /*
  * This struct describes a virtual memory area. There is one of these
  * per VM-area/task. A VM area is any part of the process virtual memory
@@ -593,6 +597,9 @@ struct vm_area_struct {
 #endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	struct vma_numab *numab;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 } __randomize_layout;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..2d34c484553d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -481,6 +481,9 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 
 void vm_area_free(struct vm_area_struct *vma)
 {
+#ifdef CONFIG_NUMA_BALANCING
+	kfree(vma->numab);
+#endif
 	free_anon_vma_name(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c36aa54ae071..6a1cffdfc76b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3027,6 +3027,23 @@ static void task_numa_work(struct callback_head *work)
 		if (!vma_is_accessible(vma))
 			continue;
 
+		/* Initialise new per-VMA NUMAB state. */
+		if (!vma->numab) {
+			vma->numab = kzalloc(sizeof(struct vma_numab), GFP_KERNEL);
+			if (!vma->numab)
+				continue;
+
+			vma->numab->next_scan = now +
+				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		}
+
+		/*
+		 * After the first scan is complete, delay the balancing scan
+		 * for new VMAs.
+		 */
+		if (mm->numa_scan_seq && time_before(jiffies, vma->numab->next_scan))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-17 14:59   ` Mel Gorman
@ 2023-01-17 17:45     ` Raghavendra K T
  2023-01-18  5:47       ` Raghavendra K T
  2023-01-24 19:18       ` Raghavendra K T
  2023-01-18  4:43     ` Bharata B Rao
  1 sibling, 2 replies; 16+ messages in thread
From: Raghavendra K T @ 2023-01-17 17:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Valentin Schneider, Andrew Morton,
	Matthew Wilcox, Vlastimil Babka, Liam R . Howlett, Peter Xu,
	David Hildenbrand, xu xin, Yu Zhao, Colin Cross, Arnd Bergmann,
	Hugh Dickins, Bharata B Rao, Disha Talreja

On 1/17/2023 8:29 PM, Mel Gorman wrote:
> Note that the cc list is excessive for the topic.
> 

Thank you Mel for the review. Sorry for the long list. (got by
get_maintainer). Will trim the list for V2.

> On Mon, Jan 16, 2023 at 07:05:34AM +0530, Raghavendra K T wrote:
>>   During the Numa scanning make sure only relevant vmas of the
>> tasks are scanned.
>>
>> Logic:
>> 1) For the first two time allow unconditional scanning of vmas
>> 2) Store recent 4 unique tasks (last 8bits of PIDs) accessed the vma.
>>    False negetives in case of collison should be fine here.
>> 3) If more than 4 pids exist assume task indeed accessed vma to
>>   to avoid false negetives
>>
>> Co-developed-by: Bharata B Rao <bharata@amd.com>
>> (initial patch to store pid information)
>>
>> Suggested-by: Mel Gorman <mgorman@techsingularity.net>
>> Signed-off-by: Bharata B Rao <bharata@amd.com>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
>>   include/linux/mm_types.h |  2 ++
>>   kernel/sched/fair.c      | 32 ++++++++++++++++++++++++++++++++
>>   mm/memory.c              | 21 +++++++++++++++++++++
>>   3 files changed, 55 insertions(+)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 500e536796ca..07feae37b8e6 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -506,6 +506,8 @@ struct vm_area_struct {
>>   	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>>   #endif
>>   	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>> +	unsigned int accessing_pids;
>> +	int next_pid_slot;
>>   } __randomize_layout;
>>   
> 
> This should be behind CONFIG_NUMA_BALANCING but per-vma state should also be
> tracked in its own struct and allocated on demand iff the state is required.
> 

Agree as David also pointed. I will take your patch below as base to
develop per-vma struct on its own.

>>   struct kioctx_table;
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e4a0b8bd941c..944d2e3b0b3c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2916,6 +2916,35 @@ static void reset_ptenuma_scan(struct task_struct *p)
>>   	p->mm->numa_scan_offset = 0;
>>   }
>>   
>> +static bool vma_is_accessed(struct vm_area_struct *vma)
>> +{
>> +	int i;
>> +	bool more_pids_exist;
>> +	unsigned long pid, max_pids;
>> +	unsigned long current_pid = current->pid & LAST__PID_MASK;
>> +
>> +	max_pids = sizeof(unsigned int) * BITS_PER_BYTE / LAST__PID_SHIFT;
>> +
>> +	/* By default we assume >= max_pids exist */
>> +	more_pids_exist = true;
>> +
>> +	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
>> +		return true;
>> +
>> +	for (i = 0; i < max_pids; i++) {
>> +		pid = (vma->accessing_pids >> i * LAST__PID_SHIFT) &
>> +			LAST__PID_MASK;
>> +		if (pid == current_pid)
>> +			return true;
>> +		if (pid == 0) {
>> +			more_pids_exist = false;
>> +			break;
>> +		}
>> +	}
>> +
>> +	return more_pids_exist;
>> +}
> 
> I get the intent is to avoid PIDs scanning VMAs that it has never faulted
> within but it seems unnecessarily complex to search on every fault to track
> just 4 pids with no recent access information. The pid modulo BITS_PER_WORD
> couls be used to set a bit on an unsigned long to track approximate recent
> acceses and skip VMAs that do not have the bit set. That would allow more
> recent PIDs to be tracked although false positives would still exist. It
> would be necessary to reset the mask periodically.

Got the idea but I lost you on pid modulo BITS_PER_WORD, (is it
extracting last 5 or 8 bits of PID?) OR
Do you intend to say we can just do

vma->accessing_pids | = current_pid..

so that later we can just check
if (vma->accessing_pids | current_pid) == vma->accessing_pids then it is
a hit..
This becomes simple and we avoid iteration, duplicate tracking etc

> 
> Even tracking 4 pids, a reset is periodically needed. Otherwise it'll
> be vulnerable to changes in phase behaviour causing all pids to scan all
> VMAs again.
> 

Agree. Yes this will be the key thing to do. On a related note I saw
huge increment in numa_scan_seq because we frequently visit scanning
after the patch

>> @@ -3015,6 +3044,9 @@ static void task_numa_work(struct callback_head *work)
>>   		if (!vma_is_accessible(vma))
>>   			continue;
>>   
>> +		if (!vma_is_accessed(vma))
>> +			continue;
>> +
>>   		do {
>>   			start = max(start, vma->vm_start);
>>   			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 8c8420934d60..fafd78d87a51 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4717,7 +4717,28 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>   	pte_t pte, old_pte;
>>   	bool was_writable = pte_savedwrite(vmf->orig_pte);
>>   	int flags = 0;
>> +	int pid_slot = vma->next_pid_slot;
>>   
>> +	int i;
>> +	unsigned long pid, max_pids;
>> +	unsigned long current_pid = current->pid & LAST__PID_MASK;
>> +
>> +	max_pids = sizeof(unsigned int) * BITS_PER_BYTE / LAST__PID_SHIFT;
>> +
> 
> Won't build on defconfig
> 

OOPs! Sorry. This also should have ideally gone behind
CONFIG_NUMA_BALANCING..

>> +	/* Avoid duplicate PID updation */
>> +	for (i = 0; i < max_pids; i++) {
>> +		pid = (vma->accessing_pids >> i * LAST__PID_SHIFT) &
>> +			LAST__PID_MASK;
>> +		if (pid == current_pid)
>> +			goto skip_update;
>> +	}
>> +
>> +	vma->next_pid_slot = (++pid_slot) % max_pids;
>> +	vma->accessing_pids &= ~(LAST__PID_MASK << (pid_slot * LAST__PID_SHIFT));
>> +	vma->accessing_pids |= ((current_pid) <<
>> +			(pid_slot * LAST__PID_SHIFT));
>> +
> 
> The PID tracking and clearing should probably be split out but that aside,

Sure will do.

> what about do_huge_pmd_numa_page?

Will target this eventually, (ASAP if it is less complicated) :)

> 
> First off though, expanding VMA size by more than a word for NUMA balancing
> is probably a no-go.
> 
Agree

> This is a build-tested only prototype to illustrate how VMA could track
> NUMA balancing state. It starts with applying the scan delay to every VMA
> instead of every task to avoid scanning new or very short-lived VMAs. I
> went back to my old notes on how I hoped to reduce excessive scanning in
> NUMA balancing and it happened to be second on my list and straight-forward
> to prototype in a few minutes.
> 

Nice idea. Thanks again.. I will take this as a base patch for expansion.

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f3f196e4d66d..3cebda5cc8a7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -620,6 +620,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>   	vma->vm_mm = mm;
>   	vma->vm_ops = &dummy_vm_ops;
>   	INIT_LIST_HEAD(&vma->anon_vma_chain);
> +#ifdef CONFIG_NUMA_BALANCING
> +	vma->numab = NULL;
> +#endif
>   }
>   
>   static inline void vma_set_anonymous(struct vm_area_struct *vma)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3b8475007734..3c0cfdde33e0 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -526,6 +526,10 @@ struct anon_vma_name {
>   	char name[];
>   };
>   
> +struct vma_numab {
> +	unsigned long next_scan;
> +};
> +
>   /*
>    * This struct describes a virtual memory area. There is one of these
>    * per VM-area/task. A VM area is any part of the process virtual memory
> @@ -593,6 +597,9 @@ struct vm_area_struct {
>   #endif
>   #ifdef CONFIG_NUMA
>   	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
> +#endif
> +#ifdef CONFIG_NUMA_BALANCING
> +	struct vma_numab *numab;	/* NUMA Balancing state */
>   #endif
>   	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>   } __randomize_layout;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9f7fe3541897..2d34c484553d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -481,6 +481,9 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>   
>   void vm_area_free(struct vm_area_struct *vma)
>   {
> +#ifdef CONFIG_NUMA_BALANCING
> +	kfree(vma->numab);
> +#endif
>   	free_anon_vma_name(vma);
>   	kmem_cache_free(vm_area_cachep, vma);
>   }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c36aa54ae071..6a1cffdfc76b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3027,6 +3027,23 @@ static void task_numa_work(struct callback_head *work)
>   		if (!vma_is_accessible(vma))
>   			continue;
>   
> +		/* Initialise new per-VMA NUMAB state. */
> +		if (!vma->numab) {
> +			vma->numab = kzalloc(sizeof(struct vma_numab), GFP_KERNEL);
> +			if (!vma->numab)
> +				continue;
> +
> +			vma->numab->next_scan = now +
> +				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
> +		}
> +
> +		/*
> +		 * After the first scan is complete, delay the balancing scan
> +		 * for new VMAs.
> +		 */
> +		if (mm->numa_scan_seq && time_before(jiffies, vma->numab->next_scan))
> +			continue;
> +
>   		do {
>   			start = max(start, vma->vm_start);
>   			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
> 

Thanks
- Raghu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-17 14:59   ` Mel Gorman
  2023-01-17 17:45     ` Raghavendra K T
@ 2023-01-18  4:43     ` Bharata B Rao
  2023-02-21  0:38       ` Kalra, Ashish
  1 sibling, 1 reply; 16+ messages in thread
From: Bharata B Rao @ 2023-01-18  4:43 UTC (permalink / raw)
  To: Mel Gorman, Raghavendra K T
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Valentin Schneider, Andrew Morton,
	Matthew Wilcox, Vlastimil Babka, Liam R . Howlett, Peter Xu,
	David Hildenbrand, xu xin, Yu Zhao, Colin Cross, Arnd Bergmann,
	Hugh Dickins, Disha Talreja, Sean Christopherson, jhubbard,
	ligang.bdlg, Kalra, Ashish

On 1/17/2023 8:29 PM, Mel Gorman wrote:
> Note that the cc list is excessive for the topic.

(Wasn't sure about pruning the CC list mid-thread, hence continuing with it)

<snip>

> 
> This is a build-tested only prototype to illustrate how VMA could track
> NUMA balancing state. It starts with applying the scan delay to every VMA
> instead of every task to avoid scanning new or very short-lived VMAs. I
> went back to my old notes on how I hoped to reduce excessive scanning in
> NUMA balancing and it happened to be second on my list and straight-forward
> to prototype in a few minutes.

While on the topic of improving NUMA balancer scanning relevancy, the following
additional points may be worth noting:

Recently there have been reports about NUMA balancing induced scanning and
subsequent MMU notifier invalidations causing problems in different scenarios.

1. Currently NUMA balancing won't check at scan time, if a page (or a VMA )is
not migratable since the page (or the address range) is pinned. It will go ahead
with MMU invalidation notifications and changes the PTE protection to PAGE_NONE
only to realize later that the pinned pages can't be migrated before reinstalling
the original PTE.

This was found to cause issues to SEV guests whose pages are completely pinned.
This was discussed here - https://lore.kernel.org/all/20220927000729.498292-1-Ashish.Kalra@amd.com/

We could probably use page_maybe_dma_pinned() to determine if the page is long
term pinned and avoid MMU invalidation and protection change for such a page.
However then we would have to do per-page invalidations (as against one time
PMD range invalidation that is done currently) which is probably not desirable.

Also MMU invalidations are expected to be issued under sleepable context (mostly
except in the OOM notification which uses nonblock verion, AFAICS). This makes it
difficult to check the pinned state of the page prior to MMU invalidation. Some of
this is discussed here: https://lore.kernel.org/linux-arm-kernel/YuEMkKY2RU%2F2KiZW@monolith.localdoman/

This current patchset where we attempt to restrict scanning to relevant VMAs may
help the above case partially, but any ideas on addressing this issue
comprehensively? It would have been ideal if we could identify such non-migratable
pages (long term pinned) clearly and avoid them entirely from scanning and protection
change. 

2. Applications that run on GPUs may like to avoid the NUMA balancing activity
completely and they benefit from per-process enabling/disabling of NUMA balancing.
The patchset (which has a different use case for per-process control) that helps
this is here - https://lore.kernel.org/all/49ed07b1-e167-7f94-9986-8e86fb60bb09@nvidia.com/

Improvements to increase the relevant scanning can help this case to an extent
but per-process NUMA balancing control should be a useful control to have.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-17 17:45     ` Raghavendra K T
@ 2023-01-18  5:47       ` Raghavendra K T
  2023-01-24 19:18       ` Raghavendra K T
  1 sibling, 0 replies; 16+ messages in thread
From: Raghavendra K T @ 2023-01-18  5:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Valentin Schneider, Andrew Morton,
	Matthew Wilcox, Vlastimil Babka, Liam R . Howlett, Peter Xu,
	David Hildenbrand, xu xin, Yu Zhao, Colin Cross, Arnd Bergmann,
	Hugh Dickins, Bharata B Rao, Disha Talreja

On 1/17/2023 11:15 PM, Raghavendra K T wrote:
> On 1/17/2023 8:29 PM, Mel Gorman wrote:
>> Note that the cc list is excessive for the topic.
>>
[...]
> 
>>>   struct kioctx_table;
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e4a0b8bd941c..944d2e3b0b3c 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2916,6 +2916,35 @@ static void reset_ptenuma_scan(struct 
>>> task_struct *p)
>>>       p->mm->numa_scan_offset = 0;
>>>   }
>>> +static bool vma_is_accessed(struct vm_area_struct *vma)
>>> +{
>>> +    int i;
>>> +    bool more_pids_exist;
>>> +    unsigned long pid, max_pids;
>>> +    unsigned long current_pid = current->pid & LAST__PID_MASK;
>>> +
>>> +    max_pids = sizeof(unsigned int) * BITS_PER_BYTE / LAST__PID_SHIFT;
>>> +
>>> +    /* By default we assume >= max_pids exist */
>>> +    more_pids_exist = true;
>>> +
>>> +    if (READ_ONCE(current->mm->numa_scan_seq) < 2)
>>> +        return true;
>>> +
>>> +    for (i = 0; i < max_pids; i++) {
>>> +        pid = (vma->accessing_pids >> i * LAST__PID_SHIFT) &
>>> +            LAST__PID_MASK;
>>> +        if (pid == current_pid)
>>> +            return true;
>>> +        if (pid == 0) {
>>> +            more_pids_exist = false;
>>> +            break;
>>> +        }
>>> +    }
>>> +
>>> +    return more_pids_exist;
>>> +}
>>
>> I get the intent is to avoid PIDs scanning VMAs that it has never faulted
>> within but it seems unnecessarily complex to search on every fault to 
>> track
>> just 4 pids with no recent access information. The pid modulo 
>> BITS_PER_WORD
>> couls be used to set a bit on an unsigned long to track approximate 
>> recent
>> acceses and skip VMAs that do not have the bit set. That would allow more
>> recent PIDs to be tracked although false positives would still exist. It
>> would be necessary to reset the mask periodically.
> 
> Got the idea but I lost you on pid modulo BITS_PER_WORD, (is it
> extracting last 5 or 8 bits of PID?) OR
> Do you intend to say we can just do
> 
> vma->accessing_pids | = current_pid..
> 
> so that later we can just check
> if (vma->accessing_pids | current_pid) == vma->accessing_pids then it is
> a hit..
> This becomes simple and we avoid iteration, duplicate tracking etc
> 

Did more brainstorming/thought on this, I see that you meant

active_bit = (current_pid % BITS_PER_LONG);
accessing_pids |= (1UL << active_bit);

In scan path:
active_bit = (current_pid % BITS_PER_LONG);
if (!(accessing_pids & (1UL << active_bit))
         goto skip_scanning;

My approach above would perhaps give more false positive, this seems 
better thing to..

Thanks, will come up with numbers for this patch  + your vma scan delay 
patch.

>>
>> Even tracking 4 pids, a reset is periodically needed. Otherwise it'll
>> be vulnerable to changes in phase behaviour causing all pids to scan all
>> VMAs again.
>>
> 
> Agree. Yes this will be the key thing to do. On a related note I saw
> huge increment in numa_scan_seq because we frequently visit scanning
> after the patch
> 
[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-16  1:35 ` [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic Raghavendra K T
                     ` (2 preceding siblings ...)
  2023-01-17 14:59   ` Mel Gorman
@ 2023-01-19  9:39   ` Mike Rapoport
  2023-01-19 10:24     ` Raghavendra K T
  3 siblings, 1 reply; 16+ messages in thread
From: Mike Rapoport @ 2023-01-19  9:39 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: =,
	linux-mm, --cc=Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Andrew Morton, Matthew Wilcox, Vlastimil Babka, Liam R . Howlett,
	Peter Xu, David Hildenbrand, xu xin, Yu Zhao, Colin Cross,
	Arnd Bergmann, Hugh Dickins, Bharata B Rao, Disha Talreja

Hi,

On Mon, Jan 16, 2023 at 07:05:34AM +0530, Raghavendra K T wrote:
>  During the Numa scanning make sure only relevant vmas of the
> tasks are scanned.

Please add more detailed description about what are the issues with the
current scanning this patch aims to solve.

> Logic:
> 1) For the first two time allow unconditional scanning of vmas
> 2) Store recent 4 unique tasks (last 8bits of PIDs) accessed the vma.
>   False negetives in case of collison should be fine here.

         ^ negatives

> 3) If more than 4 pids exist assume task indeed accessed vma to
>  to avoid false negetives
> 
> Co-developed-by: Bharata B Rao <bharata@amd.com>
> (initial patch to store pid information)
> 
> Suggested-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
>  include/linux/mm_types.h |  2 ++
>  kernel/sched/fair.c      | 32 ++++++++++++++++++++++++++++++++
>  mm/memory.c              | 21 +++++++++++++++++++++
>  3 files changed, 55 insertions(+)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 500e536796ca..07feae37b8e6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -506,6 +506,8 @@ struct vm_area_struct {
>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +	unsigned int accessing_pids;
> +	int next_pid_slot;
>  } __randomize_layout;
>  
>  struct kioctx_table;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e4a0b8bd941c..944d2e3b0b3c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2916,6 +2916,35 @@ static void reset_ptenuma_scan(struct task_struct *p)
>  	p->mm->numa_scan_offset = 0;
>  }
>  
> +static bool vma_is_accessed(struct vm_area_struct *vma)
> +{
> +	int i;
> +	bool more_pids_exist;
> +	unsigned long pid, max_pids;
> +	unsigned long current_pid = current->pid & LAST__PID_MASK;
> +
> +	max_pids = sizeof(unsigned int) * BITS_PER_BYTE / LAST__PID_SHIFT;
> +
> +	/* By default we assume >= max_pids exist */
> +	more_pids_exist = true;
> +
> +	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
> +		return true;
> +
> +	for (i = 0; i < max_pids; i++) {
> +		pid = (vma->accessing_pids >> i * LAST__PID_SHIFT) &
> +			LAST__PID_MASK;
> +		if (pid == current_pid)
> +			return true;
> +		if (pid == 0) {
> +			more_pids_exist = false;
> +			break;
> +		}
> +	}
> +
> +	return more_pids_exist;
> +}
> +
>  /*
>   * The expensive part of numa migration is done from task_work context.
>   * Triggered from task_tick_numa().
> @@ -3015,6 +3044,9 @@ static void task_numa_work(struct callback_head *work)
>  		if (!vma_is_accessible(vma))
>  			continue;
>  
> +		if (!vma_is_accessed(vma))
> +			continue;
> +
>  		do {
>  			start = max(start, vma->vm_start);
>  			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
> diff --git a/mm/memory.c b/mm/memory.c
> index 8c8420934d60..fafd78d87a51 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4717,7 +4717,28 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	pte_t pte, old_pte;
>  	bool was_writable = pte_savedwrite(vmf->orig_pte);
>  	int flags = 0;
> +	int pid_slot = vma->next_pid_slot;
>  
> +	int i;
> +	unsigned long pid, max_pids;
> +	unsigned long current_pid = current->pid & LAST__PID_MASK;
> +
> +	max_pids = sizeof(unsigned int) * BITS_PER_BYTE / LAST__PID_SHIFT;
> +
> +	/* Avoid duplicate PID updation */
> +	for (i = 0; i < max_pids; i++) {
> +		pid = (vma->accessing_pids >> i * LAST__PID_SHIFT) &
> +			LAST__PID_MASK;
> +		if (pid == current_pid)
> +			goto skip_update;
> +	}
> +
> +	vma->next_pid_slot = (++pid_slot) % max_pids;
> +	vma->accessing_pids &= ~(LAST__PID_MASK << (pid_slot * LAST__PID_SHIFT));
> +	vma->accessing_pids |= ((current_pid) <<
> +			(pid_slot * LAST__PID_SHIFT));
> +
> +skip_update:
>  	/*
>  	 * The "pte" at this point cannot be used safely without
>  	 * validation through pte_unmap_same(). It's of NUMA type but
> -- 
> 2.34.1
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-19  9:39   ` Mike Rapoport
@ 2023-01-19 10:24     ` Raghavendra K T
  0 siblings, 0 replies; 16+ messages in thread
From: Raghavendra K T @ 2023-01-19 10:24 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Andrew Morton, Matthew Wilcox, Vlastimil Babka, Liam R . Howlett,
	Peter Xu, David Hildenbrand, xu xin, Yu Zhao, Colin Cross,
	Arnd Bergmann, Hugh Dickins, Bharata B Rao, Disha Talreja

On 1/19/2023 3:09 PM, Mike Rapoport wrote:
> Hi,
> 
> On Mon, Jan 16, 2023 at 07:05:34AM +0530, Raghavendra K T wrote:
>>   During the Numa scanning make sure only relevant vmas of the
>> tasks are scanned.
> 
> Please add more detailed description about what are the issues with the
> current scanning this patch aims to solve.

Thank you for the review Mike. Sure will add more detail in the patch
  commit in V2

> 
>> Logic:
>> 1) For the first two time allow unconditional scanning of vmas
>> 2) Store recent 4 unique tasks (last 8bits of PIDs) accessed the vma.
>>    False negetives in case of collison should be fine here.
> 
>           ^ negatives

will take care of this and one below

>> 3) If more than 4 pids exist assume task indeed accessed vma to
>>   to avoid false negetives
>>
>> Co-developed-by: Bharata B Rao <bharata@amd.com>
>> (initial patch to store pid information)
>>
>> Suggested-by: Mel Gorman <mgorman@techsingularity.net>
>> Signed-off-by: Bharata B Rao <bharata@amd.com>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-17 17:45     ` Raghavendra K T
  2023-01-18  5:47       ` Raghavendra K T
@ 2023-01-24 19:18       ` Raghavendra K T
  2023-01-27 10:17         ` Mel Gorman
  1 sibling, 1 reply; 16+ messages in thread
From: Raghavendra K T @ 2023-01-24 19:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Bharata B Rao, Disha Talreja,
	Mike Rapoport

On 1/17/2023 11:15 PM, Raghavendra K T wrote:
> On 1/17/2023 8:29 PM, Mel Gorman wrote:
>> Note that the cc list is excessive for the topic.
>>
> 
> Thank you Mel for the review. Sorry for the long list. (got by
> get_maintainer). Will trim the list for V2.
>
(trimming the list early)
[...]
> 
> Nice idea. Thanks again.. I will take this as a base patch for expansion.
> 
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index f3f196e4d66d..3cebda5cc8a7 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -620,6 +620,9 @@ static inline void vma_init(struct vm_area_struct 
>> *vma, struct mm_struct *mm)
>>       vma->vm_mm = mm;
>>       vma->vm_ops = &dummy_vm_ops;
>>       INIT_LIST_HEAD(&vma->anon_vma_chain);
>> +#ifdef CONFIG_NUMA_BALANCING
>> +    vma->numab = NULL;
>> +#endif
>>   }
>>   static inline void vma_set_anonymous(struct vm_area_struct *vma)
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 3b8475007734..3c0cfdde33e0 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -526,6 +526,10 @@ struct anon_vma_name {
>>       char name[];
>>   };
>> +struct vma_numab {
>> +    unsigned long next_scan;
>> +};
>> +
>>   /*
>>    * This struct describes a virtual memory area. There is one of these
>>    * per VM-area/task. A VM area is any part of the process virtual 
>> memory
>> @@ -593,6 +597,9 @@ struct vm_area_struct {
>>   #endif
>>   #ifdef CONFIG_NUMA
>>       struct mempolicy *vm_policy;    /* NUMA policy for the VMA */
>> +#endif
>> +#ifdef CONFIG_NUMA_BALANCING
>> +    struct vma_numab *numab;    /* NUMA Balancing state */
>>   #endif
>>       struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>>   } __randomize_layout;
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 9f7fe3541897..2d34c484553d 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -481,6 +481,9 @@ struct vm_area_struct *vm_area_dup(struct 
>> vm_area_struct *orig)
>>   void vm_area_free(struct vm_area_struct *vma)
>>   {
>> +#ifdef CONFIG_NUMA_BALANCING
>> +    kfree(vma->numab);
>> +#endif >>       free_anon_vma_name(vma);
>>       kmem_cache_free(vm_area_cachep, vma);
>>   }

while running mmtest kernbench on (256 pcpu), I have hit BUG(),
(not reproducible in normal boot flow otherwise)

[  716.825398] kernel BUG at mm/slub.c:419!
[  716.825736] invalid opcode: 0000 [#146] PREEMPT SMP NOPTI
[  716.826042] CPU: 232 PID: 364844 Comm: cc1 Tainted: G      D W 
    6.1.0-test-snp-host-a7065246cf78+ #44
[  716.826345] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 
2.6.6 01/13/2022
[  716.826645] RIP: 0010:__kmem_cache_free+0x2a4/0x2c0
[  716.826941] Code: ff e9 32 ff ff ff 49 8b 47 08 f0 48 83 28 01 0f 85 
9b fe ff ff 49 8b 47 08 4c 89 ff 48 8b 40 08 e8 a1 c5 cc 00 e9 86 fe ff 
ff <0f> 0b 48 8b 15 63 d6 4d 01 e9 85 fd ff ff 66 66 2e 0f 1f 84 00 00
[  716.827550] RSP: 0018:ffffb0b070547c28 EFLAGS: 00010246
[  716.827865] RAX: ffff990fa6bf1310 RBX: ffff990fa6bf1310 RCX: 
ffff990fa6bf1310
[  716.828180] RDX: 00000000001000e8 RSI: 0000000000000000 RDI: 
ffff98d000044200
[  716.828503] RBP: ffffb0b070547c50 R08: ffff98d030f222e0 R09: 
0000000000000001
[  716.828821] R10: ffff990ff6d298b0 R11: ffff98d030f226a0 R12: 
ffff98d000044200
[  716.829139] R13: ffffd605c29afc40 R14: ffffffff9e89c20f R15: 
ffffb0b070547d58
[  716.829458] FS:  00007f05f4cebac0(0000) GS:ffff994e00800000(0000) 
knlGS:0000000000000000
[  716.829781] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  716.830105] CR2: 00007f05e9cbc002 CR3: 00000040eea7c005 CR4: 
0000000000770ee0
[  716.830432] PKRU: 55555554
[  716.830749] Call Trace:
[  716.831057]  <TASK>
[  716.831360]  kfree+0x79/0x120
[  716.831664]  vm_area_free+0x1f/0x50
[  716.831970]  vma_expand+0x311/0x3e0
[  716.832274]  mmap_region+0x772/0x900
[  716.832571]  do_mmap+0x3c0/0x5e0
[  716.832866]  ? __this_cpu_preempt_check+0x13/0x20
[  716.833165]  ? security_mmap_file+0xa1/0xc0
[  716.833458]  vm_mmap_pgoff+0xd5/0x170
[  716.833745]  ksys_mmap_pgoff+0x46/0x210
[  716.834022]  __x64_sys_mmap+0x33/0x50
[  716.834291]  do_syscall_64+0x3b/0x90
[  716.834549]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  716.834806] RIP: 0033:0x7f05f471ebd7
[  716.835054] Code: 00 00 00 89 ef e8 59 ae ff ff eb e4 e8 62 7b 01 00 
66 90 f3 0f 1e fa 41 89 ca 41 f7 c1 ff 0f 00 00 75 10 b8 09 00 00 00 0f 
05 <48> 3d 00 f0 ff ff 77 21 c3 48 8b 05 29 a2 0f 00 64 c7 00 16 00 00
[  716.835567] RSP: 002b:00007fff24c27ae8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000009
[  716.835826] RAX: ffffffffffffffda RBX: 0000000000200000 RCX: 
00007f05f471ebd7
[  716.836077] RDX: 0000000000000003 RSI: 0000000000200000 RDI: 
0000000000000000
[  716.836323] RBP: 0000000000000000 R08: 00000000ffffffff R09: 
0000000000000000
[  716.836567] R10: 0000000000000022 R11: 0000000000000246 R12: 
0000000000000038
[  716.836808] R13: 0000000000001fff R14: 0000000000000044 R15: 
0000000000000048
[  716.837049]  </TASK>
[  716.837285] Modules linked in: tls ipmi_ssif binfmt_misc 
nls_iso8859_1 joydev input_leds intel_rapl_msr intel_rapl_common 
amd64_edac edac_mce_amd hid_generic kvm_amd dell_smbios dcdbas wmi_bmof 
dell_wmi_descriptor kvm usbhid hid ccp k10temp wmi ipmi_si ipmi_devintf 
ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel dm_multipath 
scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efi_pstore ip_tables x_tables 
autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid1 raid0 multipath linear mgag200 drm_kms_helper 
syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul 
i2c_algo_bit crc32_pclmul drm_shmem_helper ghash_clmulni_intel nvme 
aesni_intel crypto_simd cryptd tg3 drm nvme_core megaraid_sas ahci 
xhci_pci i2c_piix4 xhci_pci_renesas libahci
[  716.839185] ---[ end trace 0000000000000000 ]---

looks like we have to additionally handle numab initialization in
vm_area_dup() code path. something like below fixed it (copied pasted
from tty):

diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..f5b2e41296c7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -475,12 +475,18 @@ struct vm_area_struct *vm_area_dup(struct 
vm_area_struct *orig)
                 *new = data_race(*orig);
                 INIT_LIST_HEAD(&new->anon_vma_chain);
                 dup_anon_vma_name(orig, new);
+#ifdef CONFIG_NUMA_BALANCING
+               new->numab = NULL;
+#endif
         }
         return new;
  }

Does this look okay? if so I will fold it into V2 spin (in
vma_scan_delay patch, hoping you are okay with this change and do not
see any other changes required)

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c36aa54ae071..6a1cffdfc76b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3027,6 +3027,23 @@ static void task_numa_work(struct callback_head 
>> *work)
>>           if (!vma_is_accessible(vma))
>>               continue;
>> +        /* Initialise new per-VMA NUMAB state. */
>> +        if (!vma->numab) {
>> +            vma->numab = kzalloc(sizeof(struct vma_numab), GFP_KERNEL);
>> +            if (!vma->numab)
>> +                continue;
>> +
>> +            vma->numab->next_scan = now +
>> +                msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
>> +        }
>> +
>> +        /*
>> +         * After the first scan is complete, delay the balancing scan
>> +         * for new VMAs.
>> +         */
>> +        if (mm->numa_scan_seq && time_before(jiffies, 
>> vma->numab->next_scan))
>> +            continue;
>> +
>>           do {
>>               start = max(start, vma->vm_start);
>>               end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
>>
> 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-24 19:18       ` Raghavendra K T
@ 2023-01-27 10:17         ` Mel Gorman
  2023-01-27 15:27           ` Raghavendra K T
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2023-01-27 10:17 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Bharata B Rao, Disha Talreja,
	Mike Rapoport

On Wed, Jan 25, 2023 at 12:48:16AM +0530, Raghavendra K T wrote:
> looks like we have to additionally handle numab initialization in
> vm_area_dup() code path. something like below fixed it (copied pasted
> from tty):
> 

Yep, it wasn't even boot tested. Better approach is something like this,
still not actually tested

 include/linux/mm.h       |  9 +++++++++
 include/linux/mm_types.h |  7 +++++++
 kernel/fork.c            |  2 ++
 kernel/sched/fair.c      | 17 +++++++++++++++++
 4 files changed, 35 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8f857163ac89..481f90dc1983 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -612,6 +612,14 @@ struct vm_operations_struct {
 					  unsigned long addr);
 };
 
+#ifdef CONFIG_NUMA_BALANCING
+#define vma_numab_init(vma) do { (vma)->numab = NULL; } while (0)
+#define vma_numab_free(vma) do { kfree((vma)->numab); } while (0)
+#else
+static inline void vma_numab_init(struct vm_area_struct *vma) {}
+static inline void vma_numab_free(struct vm_area_struct *vma) {}
+#endif /* CONFIG_NUMA_BALANCING */
+
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	static const struct vm_operations_struct dummy_vm_ops = {};
@@ -620,6 +628,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
+	vma_numab_init(vma);
 }
 
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9757067c3053..43ce363d5124 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -526,6 +526,10 @@ struct anon_vma_name {
 	char name[];
 };
 
+struct vma_numab {
+	unsigned long next_scan;
+};
+
 /*
  * This struct describes a virtual memory area. There is one of these
  * per VM-area/task. A VM area is any part of the process virtual memory
@@ -593,6 +597,9 @@ struct vm_area_struct {
 #endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	struct vma_numab *numab;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 } __randomize_layout;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..5a2e8c3cc410 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		 */
 		*new = data_race(*orig);
 		INIT_LIST_HEAD(&new->anon_vma_chain);
+		vma_numab_init(new);
 		dup_anon_vma_name(orig, new);
 	}
 	return new;
@@ -481,6 +482,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 
 void vm_area_free(struct vm_area_struct *vma)
 {
+	vma_numab_free(vma);
 	free_anon_vma_name(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c36aa54ae071..6a1cffdfc76b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3027,6 +3027,23 @@ static void task_numa_work(struct callback_head *work)
 		if (!vma_is_accessible(vma))
 			continue;
 
+		/* Initialise new per-VMA NUMAB state. */
+		if (!vma->numab) {
+			vma->numab = kzalloc(sizeof(struct vma_numab), GFP_KERNEL);
+			if (!vma->numab)
+				continue;
+
+			vma->numab->next_scan = now +
+				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		}
+
+		/*
+		 * After the first scan is complete, delay the balancing scan
+		 * for new VMAs.
+		 */
+		if (mm->numa_scan_seq && time_before(jiffies, vma->numab->next_scan))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-27 10:17         ` Mel Gorman
@ 2023-01-27 15:27           ` Raghavendra K T
  0 siblings, 0 replies; 16+ messages in thread
From: Raghavendra K T @ 2023-01-27 15:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Bharata B Rao, Disha Talreja,
	Mike Rapoport

On 1/27/2023 3:47 PM, Mel Gorman wrote:
> On Wed, Jan 25, 2023 at 12:48:16AM +0530, Raghavendra K T wrote:
>> looks like we have to additionally handle numab initialization in
>> vm_area_dup() code path. something like below fixed it (copied pasted
>> from tty):
>>
> 
> Yep, it wasn't even boot tested. Better approach is something like this,
> still not actually tested
> 
>   include/linux/mm.h       |  9 +++++++++
>   include/linux/mm_types.h |  7 +++++++
>   kernel/fork.c            |  2 ++
>   kernel/sched/fair.c      | 17 +++++++++++++++++
>   4 files changed, 35 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8f857163ac89..481f90dc1983 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -612,6 +612,14 @@ struct vm_operations_struct {
>   					  unsigned long addr);
>   };
>   
> +#ifdef CONFIG_NUMA_BALANCING
> +#define vma_numab_init(vma) do { (vma)->numab = NULL; } while (0)
> +#define vma_numab_free(vma) do { kfree((vma)->numab); } while (0)
> +#else
> +static inline void vma_numab_init(struct vm_area_struct *vma) {}
> +static inline void vma_numab_free(struct vm_area_struct *vma) {}
> +#endif /* CONFIG_NUMA_BALANCING */
> +
>   static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>   {
>   	static const struct vm_operations_struct dummy_vm_ops = {};
> @@ -620,6 +628,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>   	vma->vm_mm = mm;
>   	vma->vm_ops = &dummy_vm_ops;
>   	INIT_LIST_HEAD(&vma->anon_vma_chain);
> +	vma_numab_init(vma);
>   }
>   
>   static inline void vma_set_anonymous(struct vm_area_struct *vma)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 9757067c3053..43ce363d5124 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -526,6 +526,10 @@ struct anon_vma_name {
>   	char name[];
>   };
>   
> +struct vma_numab {
> +	unsigned long next_scan;
> +};
> +
>   /*
>    * This struct describes a virtual memory area. There is one of these
>    * per VM-area/task. A VM area is any part of the process virtual memory
> @@ -593,6 +597,9 @@ struct vm_area_struct {
>   #endif
>   #ifdef CONFIG_NUMA
>   	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
> +#endif
> +#ifdef CONFIG_NUMA_BALANCING
> +	struct vma_numab *numab;	/* NUMA Balancing state */
>   #endif
>   	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>   } __randomize_layout;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9f7fe3541897..5a2e8c3cc410 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>   		 */
>   		*new = data_race(*orig);
>   		INIT_LIST_HEAD(&new->anon_vma_chain);
> +		vma_numab_init(new);
>   		dup_anon_vma_name(orig, new);
>   	}
>   	return new;
> @@ -481,6 +482,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>   
>   void vm_area_free(struct vm_area_struct *vma)
>   {
> +	vma_numab_free(vma);
>   	free_anon_vma_name(vma);
>   	kmem_cache_free(vm_area_cachep, vma);
>   }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c36aa54ae071..6a1cffdfc76b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3027,6 +3027,23 @@ static void task_numa_work(struct callback_head *work)
>   		if (!vma_is_accessible(vma))
>   			continue;
>   
> +		/* Initialise new per-VMA NUMAB state. */
> +		if (!vma->numab) {
> +			vma->numab = kzalloc(sizeof(struct vma_numab), GFP_KERNEL);
> +			if (!vma->numab)
> +				continue;
> +
> +			vma->numab->next_scan = now +
> +				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
> +		}
> +
> +		/*
> +		 * After the first scan is complete, delay the balancing scan
> +		 * for new VMAs.
> +		 */
> +		if (mm->numa_scan_seq && time_before(jiffies, vma->numab->next_scan))
> +			continue;
> +
>   		do {
>   			start = max(start, vma->vm_start);
>   			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);


Thank you Mel. This looks better now.
Yes we would have moved to mm.h eventually to avoid #if clutter.

Also for PATCH2 function common to memory.c and huge_memory.c would
need the same to handle hugetlb vma as suggested by you.
Working on gathering numbers and PID clear logic now. will post V2 soon.

Thanks
- Raghu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic
  2023-01-18  4:43     ` Bharata B Rao
@ 2023-02-21  0:38       ` Kalra, Ashish
  0 siblings, 0 replies; 16+ messages in thread
From: Kalra, Ashish @ 2023-02-21  0:38 UTC (permalink / raw)
  To: Bharata B Rao, Mel Gorman, Raghavendra K T, mizhang
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Valentin Schneider, Andrew Morton,
	Matthew Wilcox, Vlastimil Babka, Liam R . Howlett, Peter Xu,
	David Hildenbrand, xu xin, Yu Zhao, Colin Cross, Arnd Bergmann,
	Hugh Dickins, Disha Talreja, Sean Christopherson, jhubbard,
	ligang.bdlg

Hello Mingwei, Sean,

Looking forward to your thoughts/feedback on the MMU invalidation 
notifier issues with SEV guests as mentioned below ?

Thanks,
Ashish

On 1/17/2023 10:43 PM, Bharata B Rao wrote:
> On 1/17/2023 8:29 PM, Mel Gorman wrote:
>> Note that the cc list is excessive for the topic.
> 
> (Wasn't sure about pruning the CC list mid-thread, hence continuing with it)
> 
> <snip>
> 
>>
>> This is a build-tested only prototype to illustrate how VMA could track
>> NUMA balancing state. It starts with applying the scan delay to every VMA
>> instead of every task to avoid scanning new or very short-lived VMAs. I
>> went back to my old notes on how I hoped to reduce excessive scanning in
>> NUMA balancing and it happened to be second on my list and straight-forward
>> to prototype in a few minutes.
> 
> While on the topic of improving NUMA balancer scanning relevancy, the following
> additional points may be worth noting:
> 
> Recently there have been reports about NUMA balancing induced scanning and
> subsequent MMU notifier invalidations causing problems in different scenarios.
> 
> 1. Currently NUMA balancing won't check at scan time, if a page (or a VMA )is
> not migratable since the page (or the address range) is pinned. It will go ahead
> with MMU invalidation notifications and changes the PTE protection to PAGE_NONE
> only to realize later that the pinned pages can't be migrated before reinstalling
> the original PTE.
> 
> This was found to cause issues to SEV guests whose pages are completely pinned.
> This was discussed here - https://lore.kernel.org/all/20220927000729.498292-1-Ashish.Kalra@amd.com/
> 
> We could probably use page_maybe_dma_pinned() to determine if the page is long
> term pinned and avoid MMU invalidation and protection change for such a page.
> However then we would have to do per-page invalidations (as against one time
> PMD range invalidation that is done currently) which is probably not desirable.
> 
> Also MMU invalidations are expected to be issued under sleepable context (mostly
> except in the OOM notification which uses nonblock verion, AFAICS). This makes it
> difficult to check the pinned state of the page prior to MMU invalidation. Some of
> this is discussed here: https://lore.kernel.org/linux-arm-kernel/YuEMkKY2RU%2F2KiZW@monolith.localdoman/
> 
> This current patchset where we attempt to restrict scanning to relevant VMAs may
> help the above case partially, but any ideas on addressing this issue
> comprehensively? It would have been ideal if we could identify such non-migratable
> pages (long term pinned) clearly and avoid them entirely from scanning and protection
> change.
> 
> 2. Applications that run on GPUs may like to avoid the NUMA balancing activity
> completely and they benefit from per-process enabling/disabling of NUMA balancing.
> The patchset (which has a different use case for per-process control) that helps
> this is here - https://lore.kernel.org/all/49ed07b1-e167-7f94-9986-8e86fb60bb09@nvidia.com/
> 
> Improvements to increase the relevant scanning can help this case to an extent
> but per-process NUMA balancing control should be a useful control to have.
> 
> Regards,
> Bharata.
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-02-21  0:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-16  1:35 [RFC PATCH V1 0/1] sched/numa: Enhance vma scanning Raghavendra K T
2023-01-16  1:35 ` [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic Raghavendra K T
2023-01-16  2:25   ` Raghavendra K T
2023-01-17 11:14   ` David Hildenbrand
2023-01-17 13:09     ` Raghavendra K T
2023-01-17 14:59   ` Mel Gorman
2023-01-17 17:45     ` Raghavendra K T
2023-01-18  5:47       ` Raghavendra K T
2023-01-24 19:18       ` Raghavendra K T
2023-01-27 10:17         ` Mel Gorman
2023-01-27 15:27           ` Raghavendra K T
2023-01-18  4:43     ` Bharata B Rao
2023-02-21  0:38       ` Kalra, Ashish
2023-01-19  9:39   ` Mike Rapoport
2023-01-19 10:24     ` Raghavendra K T
2023-01-16  2:25 ` [RFC PATCH V1 0/1] sched/numa: Enhance vma scanning Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).