LKML Archive on lore.kernel.org
 help / Atom feed
* [RFC PATCH] mm, page_alloc: double zone's batchsize
@ 2018-07-11  5:58 Aaron Lu
  2018-07-11 21:35 ` Andrew Morton
  2018-07-12 12:54 ` Michal Hocko
  0 siblings, 2 replies; 7+ messages in thread
From: Aaron Lu @ 2018-07-11  5:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen,
	Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman

To improve page allocator's performance for order-0 pages, each CPU has
a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
PCP will be checked first before asking pages from Buddy. When PCP is
used up, a batch of pages will be fetched from Buddy to improve
performance and the size of batch can affect performance.

zone's batch size gets doubled last time by commit ba56e91c9401("mm:
page_alloc: increase size of per-cpu-pages") over ten years ago. Since
then, CPU has envolved a lot and CPU's cache sizes also increased.

Dave Hansen is concerned the current batch size doesn't fit well with
modern hardware and suggested me to do two things: first, use a page
allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
out how performance changes with different batch sizes on various
machines and then choose a new default batch size; second, see how
this new batch size work with other workloads.

From the first test, we saw performance gains on high-core-count systems
and little to no effect on older systems with more modest core counts.
From this phase's test data, two candidates: 63 and 127 are chosen.

In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
and more will-it-scale sub-tests are tested to see how these two
candidates work with these workloads and decides a new default
according to their results.

Most test results are flat. will-it-scale/page_fault2 process mode has
10%-18% performance increase on 4-sockets Skylake and Broadwell.
vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
4-sockets servers while for 2-sockets servers, it caused 3%-8%
performance drop. Further analysis showed that, with a larger pcp->batch
and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch
is maintained in this patch), zone lock contention shifted to LRU add
side lock contention and that caused performance drop. This performance
drop might be mitigated by others' work on optimizing LRU lock.

Another downside of increasing pcp->batch is, when PCP is used up and
need to fetch a batch of pages from Buddy, since batch is increased,
that time can be longer than before. My understanding is, this doesn't
affect slowpath where direct reclaim and compaction dominates. For
fastpath, throughput is a win(according to will-it-scale/page_fault1)
but worst latency can be larger now.

Overall, I think double the batch size from 31 to 63 is relatively
safe and provide good performance boost for high-core-count systems.

The two phase's test results are listed below(all tests are done with
THP disabled).

Phase one(will-it-scale/page_fault1) test results:

Skylake-EX: increased batch size has a good effect on zone->lock
contention, though LRU contention will rise at the same time and
limited the final performance increase.

batch   score     change   zone_contention   lru_contention   total_contention
 31   15345900    +0.00%       64%                 8%           72%
 53   17903847   +16.67%       32%                38%           70%
 63   17992886   +17.25%       24%                45%           69%
 73   18022825   +17.44%       10%                61%           71%
119   18023401   +17.45%        4%                66%           70%
127   18029012   +17.48%        3%                66%           69%
137   18036075   +17.53%        4%                66%           70%
165   18035964   +17.53%        2%                67%           69%
188   18101105   +17.95%        2%                67%           69%
223   18130951   +18.15%        2%                67%           69%
255   18118898   +18.07%        2%                67%           69%
267   18101559   +17.96%        2%                67%           69%
299   18160468   +18.34%        2%                68%           70%
320   18139845   +18.21%        2%                67%           69%
393   18160869   +18.34%        2%                68%           70%
424   18170999   +18.41%        2%                68%           70%
458   18144868   +18.24%        2%                68%           70%
467   18142366   +18.22%        2%                68%           70%
498   18154549   +18.30%        1%                68%           69%
511   18134525   +18.17%        1%                69%           70%

Broadwell-EX: similar pattern as Skylake-EX.

batch   score     change   zone_contention   lru_contention   total_contention
 31   16703983    +0.00%       67%                 7%           74%
 53   18195393    +8.93%       43%                28%           71%
 63   18288885    +9.49%       38%                33%           71%
 73   18344329    +9.82%       35%                37%           72%
119   18535529   +10.96%       24%                46%           70%
127   18513596   +10.83%       23%                48%           71%
137   18514327   +10.84%       23%                48%           71%
165   18511840   +10.82%       22%                49%           71%
188   18593478   +11.31%       17%                53%           70%
223   18601667   +11.36%       17%                52%           69%
255   18774825   +12.40%       12%                58%           70%
267   18754781   +12.28%        9%                60%           69%
299   18892265   +13.10%        7%                63%           70%
320   18873812   +12.99%        8%                62%           70%
393   18891174   +13.09%        6%                64%           70%
424   18975108   +13.60%        6%                64%           70%
458   18932364   +13.34%        8%                62%           70%
467   18960891   +13.51%        5%                65%           70%
498   18944526   +13.41%        5%                64%           69%
511   18960839   +13.51%        5%                64%           69%

Skylake-EP: although increased batch reduced zone->lock contention, but
 the effect is not as good as EX: zone->lock contention is still as
high as 20% with a very high batch value instead of 1% on Skylake-EX or
5% on Broadwell-EX. Also, total_contention actually decreased with a
higher batch but that doesn't translate to performance increase.

batch   score    change   zone_contention   lru_contention   total_contention
 31   9554867    +0.00%       66%                 3%           69%
 53   9855486    +3.15%       63%                 3%           66%
 63   9980145    +4.45%       62%                 4%           66%
 73   10092774   +5.63%       62%                 5%           67%
119   10310061   +7.90%       45%                19%           64%
127   10342019   +8.24%       42%                19%           61%
137   10358182   +8.41%       42%                21%           63%
165   10397060   +8.81%       37%                24%           61%
188   10341808   +8.24%       34%                26%           60%
223   10349135   +8.31%       31%                27%           58%
255   10327189   +8.08%       28%                29%           57%
267   10344204   +8.26%       27%                29%           56%
299   10325043   +8.06%       25%                30%           55%
320   10310325   +7.91%       25%                31%           56%
393   10293274   +7.73%       21%                31%           52%
424   10311099   +7.91%       21%                32%           53%
458   10321375   +8.02%       21%                32%           53%
467   10303881   +7.84%       21%                32%           53%
498   10332462   +8.14%       20%                33%           53%
511   10325016   +8.06%       20%                32%           52%

Broadwell-EP: zone->lock and lru lock had an agreement to make sure
performance doesn't increase and they successfully managed to keep
total contention at 70%.

batch   score    change   zone_contention   lru_contention   total_contention
 31   10121178   +0.00%       19%                50%           69%
 53   10142366   +0.21%        6%                63%           69%
 63   10117984   -0.03%       11%                58%           69%
 73   10123330   +0.02%        7%                63%           70%
119   10108791   -0.12%        2%                67%           69%
127   10166074   +0.44%        3%                66%           69%
137   10141574   +0.20%        3%                66%           69%
165   10154499   +0.33%        2%                68%           70%
188   10124921   +0.04%        2%                67%           69%
223   10137399   +0.16%        2%                67%           69%
255   10143289   +0.22%        0%                68%           68%
267   10123535   +0.02%        1%                68%           69%
299   10140952   +0.20%        0%                68%           68%
320   10163170   +0.41%        0%                68%           68%
393   10000633   -1.19%        0%                69%           69%
424   10087998   -0.33%        0%                69%           69%
458   10187116   +0.65%        0%                69%           69%
467   10146790   +0.25%        0%                69%           69%
498   10197958   +0.76%        0%                69%           69%
511   10152326   +0.31%        0%                69%           69%

Haswell-EP: similar to Broadwell-EP.

batch   score   change   zone_contention   lru_contention   total_contention
 31   10442205   +0.00%       14%                48%           62%
 53   10442255   +0.00%        5%                57%           62%
 63   10452059   +0.09%        6%                57%           63%
 73   10482349   +0.38%        5%                59%           64%
119   10454644   +0.12%        3%                60%           63%
127   10431514   -0.10%        3%                59%           62%
137   10423785   -0.18%        3%                60%           63%
165   10481216   +0.37%        2%                61%           63%
188   10448755   +0.06%        2%                61%           63%
223   10467144   +0.24%        2%                61%           63%
255   10480215   +0.36%        2%                61%           63%
267   10484279   +0.40%        2%                61%           63%
299   10466450   +0.23%        2%                61%           63%
320   10452578   +0.10%        2%                61%           63%
393   10499678   +0.55%        1%                62%           63%
424   10481454   +0.38%        1%                62%           63%
458   10473562   +0.30%        1%                62%           63%
467   10484269   +0.40%        0%                62%           62%
498   10505599   +0.61%        0%                62%           62%
511   10483395   +0.39%        0%                62%           62%

Westmere-EP: contention is pretty small so not interesting. Note too
high a batch value could hurt performance.

batch   score   change   zone_contention   lru_contention   total_contention
 31   4831523   +0.00%        2%                 3%            5%
 53   4834086   +0.05%        2%                 4%            6%
 63   4834262   +0.06%        2%                 3%            5%
 73   4832851   +0.03%        2%                 4%            6%
119   4830534   -0.02%        1%                 3%            4%
127   4827461   -0.08%        1%                 4%            5%
137   4827459   -0.08%        1%                 3%            4%
165   4820534   -0.23%        0%                 4%            4%
188   4817947   -0.28%        0%                 3%            3%
223   4809671   -0.45%        0%                 3%            3%
255   4802463   -0.60%        0%                 4%            4%
267   4801634   -0.62%        0%                 3%            3%
299   4798047   -0.69%        0%                 3%            3%
320   4793084   -0.80%        0%                 3%            3%
393   4785877   -0.94%        0%                 3%            3%
424   4782911   -1.01%        0%                 3%            3%
458   4779346   -1.08%        0%                 3%            3%
467   4780306   -1.06%        0%                 3%            3%
498   4780589   -1.05%        0%                 3%            3%
511   4773724   -1.20%        0%                 3%            3%

Skylake-Desktop: similar to Westmere-EP, nothing interesting.

batch   score   change   zone_contention   lru_contention   total_contention
 31   3906608   +0.00%        2%                 3%            5%
 53   3940164   +0.86%        2%                 3%            5%
 63   3937289   +0.79%        2%                 3%            5%
 73   3940201   +0.86%        2%                 3%            5%
119   3933240   +0.68%        2%                 3%            5%
127   3930514   +0.61%        2%                 4%            6%
137   3938639   +0.82%        0%                 3%            3%
165   3908755   +0.05%        0%                 3%            3%
188   3905621   -0.03%        0%                 3%            3%
223   3903015   -0.09%        0%                 4%            4%
255   3889480   -0.44%        0%                 3%            3%
267   3891669   -0.38%        0%                 4%            4%
299   3898728   -0.20%        0%                 4%            4%
320   3894547   -0.31%        0%                 4%            4%
393   3875137   -0.81%        0%                 4%            4%
424   3874521   -0.82%        0%                 3%            3%
458   3880432   -0.67%        0%                 4%            4%
467   3888715   -0.46%        0%                 3%            3%
498   3888633   -0.46%        0%                 4%            4%
511   3875305   -0.80%        0%                 5%            5%

Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
contention is higher than other desktops.

batch   score   change   zone_contention   lru_contention   total_contention
 31   3511158   +0.00%        2%                 5%            7%
 53   3555445   +1.26%        2%                 6%            8%
 63   3561082   +1.42%        2%                 6%            8%
 73   3547218   +1.03%        2%                 6%            8%
119   3571319   +1.71%        1%                 7%            8%
127   3549375   +1.09%        0%                 6%            6%
137   3560233   +1.40%        0%                 6%            6%
165   3555176   +1.25%        2%                 6%            8%
188   3551501   +1.15%        0%                 8%            8%
223   3531462   +0.58%        0%                 7%            7%
255   3570400   +1.69%        0%                 7%            7%
267   3532235   +0.60%        1%                 8%            9%
299   3562326   +1.46%        0%                 6%            6%
320   3553569   +1.21%        0%                 8%            8%
393   3539519   +0.81%        0%                 7%            7%
424   3549271   +1.09%        0%                 8%            8%
458   3528885   +0.50%        0%                 8%            8%
467   3526554   +0.44%        0%                 7%            7%
498   3525302   +0.40%        0%                 9%            9%
511   3527556   +0.47%        0%                 8%            8%

Sandybridge-Desktop: the 0% contention isn't accurate but caused by
dropped fractional part. Since multiple contention path's contentions
are all under 1% here, with some arithmetic operations like add, the
final deviation could be as large as 3%.

batch   score   change   zone_contention   lru_contention   total_contention
 31   1744495   +0.00%        0%                 0%            0%
 53   1755341   +0.62%        0%                 0%            0%
 63   1758469   +0.80%        0%                 0%            0%
 73   1759626   +0.87%        0%                 0%            0%
119   1770417   +1.49%        0%                 0%            0%
127   1768252   +1.36%        0%                 0%            0%
137   1767848   +1.34%        0%                 0%            0%
165   1765088   +1.18%        0%                 0%            0%
188   1766918   +1.29%        0%                 0%            0%
223   1767866   +1.34%        0%                 0%            0%
255   1768074   +1.35%        0%                 0%            0%
267   1763187   +1.07%        0%                 0%            0%
299   1765620   +1.21%        0%                 0%            0%
320   1767603   +1.32%        0%                 0%            0%
393   1764612   +1.15%        0%                 0%            0%
424   1758476   +0.80%        0%                 0%            0%
458   1758593   +0.81%        0%                 0%            0%
467   1757915   +0.77%        0%                 0%            0%
498   1753363   +0.51%        0%                 0%            0%
511   1755548   +0.63%        0%                 0%            0%

Phase two test results:
Note: all percent change is against base(batch=31).

ebizzy.throughput (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    2410037±7%     2600451±2% +7.9%     2602878 +8.0%
lkp-bdw-ex1     1493328        1489243    -0.3%     1492145 -0.1%
lkp-skl-2sp2    1329674        1345891    +1.2%     1351056 +1.6%
lkp-bdw-ep2      711511         711511     0.0%      710708 -0.1%
lkp-wsm-ep2       75750          75528    -0.3%       75441 -0.4%
lkp-skl-d01      264126         262791    -0.5%      264113 +0.0%
lkp-hsw-d01      176601         176328    -0.2%      176368 -0.1%
lkp-sb02          98937          98937    +0.0%       99030 +0.1%

kbuild.buildtime (less is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     107.00        107.67  +0.6%        107.11  +0.1%
lkp-bdw-ex1       97.33         97.33  +0.0%         97.42  +0.1%
lkp-skl-2sp2     180.00        179.83  -0.1%        179.83  -0.1%
lkp-bdw-ep2      178.17        179.17  +0.6%        177.50  -0.4%
lkp-wsm-ep2      737.00        738.00  +0.1%        738.00  +0.1%
lkp-skl-d01      642.00        653.00  +1.7%        653.00  +1.7%
lkp-hsw-d01     1310.00       1316.00  +0.5%       1311.00  +0.1%


netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     948790        947144  -0.2%        948333 -0.0%
lkp-bdw-ex1      904224        904366  +0.0%        904926 +0.1%
lkp-skl-2sp2     239731        239607  -0.1%        239565 -0.1% 
lk-bdw-ep2       365764        365933  +0.0%        365951 +0.1%
lkp-wsm-ep2       93736         93803  +0.1%         93808 +0.1%
lkp-skl-d01       77314         77303  -0.0%         77375 +0.1%
lkp-hsw-d01       58617         60387  +3.0%         60208 +2.7%
lkp-sb02          29990         30137  +0.5%         30103 +0.4%

oltp.transactions (higer is better)

machine         batch=31      batch=63             batch=127
lkp-bdw-ex1      9073276       9100377     +0.3%    9036344     -0.4%
lkp-skl-2sp2     8898717       8852054     -0.5%    8894459     -0.0%
lkp-bdw-ep2     13426155      13384654     -0.3%   13333637     -0.7%
lkp-hsw-ep2     13146314      13232784     +0.7%   13193163     +0.4%
lkp-wsm-ep2      5035355       5019348     -0.3%    5033418     -0.0%
lkp-skl-d01       418485       4413339     -0.1%    4419039     +0.0%
lkp-hsw-d01      3517817±5%    3396120±3%  -3.5%    3455138±3%  -1.8%

pigz.throughput (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.513e+08     1.507e+08 -0.4%      1.511e+08 -0.2%
lkp-bdw-ex1     2.060e+08     2.052e+08 -0.4%      2.044e+08 -0.8%
lkp-skl-2sp2    8.836e+08     8.845e+08 +0.1%      8.836e+08 -0.0%
lkp-bdw-ep2     8.275e+08     8.464e+08 +2.3%      8.330e+08 +0.7%
lkp-wsm-ep2     2.224e+08     2.221e+08 -0.2%      2.218e+08 -0.3%
lkp-skl-d01     1.177e+08     1.177e+08 -0.0%      1.176e+08 -0.1%
lkp-hsw-d01     1.154e+08     1.154e+08 +0.1%      1.154e+08 -0.0%
lkp-sb02        0.633e+08     0.633e+08 +0.1%      0.633e+08 +0.0%


will-it-scale.malloc1.processes (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1      620181       620484 +0.0%         620240 +0.0%
lkp-bdw-ex1      1403610      1401201 -0.2%        1417900 +1.0%
lkp-skl-2sp2     1288097      1284145 -0.3%        1283907 -0.3%
lkp-bdw-ep2      1427879      1427675 -0.0%        1428266 +0.0%
lkp-hsw-ep2      1362546      1353965 -0.6%        1354759 -0.6%
lkp-wsm-ep2      2099657      2107576 +0.4%        2100226 +0.0%
lkp-skl-d01      1476835      1476358 -0.0%        1474487 -0.2%
lkp-hsw-d01      1308810      1303429 -0.4%        1301299 -0.6%
lkp-sb02          589286       589284 -0.0%         588101 -0.2%

will-it-scale.malloc1.threads (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     21289         21125     -0.8%      21241     -0.2%
lkp-bdw-ex1      28114         28089     -0.1%      28007     -0.4%
lkp-skl-2sp2     91866         91946     +0.1%      92723     +0.9%
lkp-bdw-ep2      37637         37501     -0.4%      37317     -0.9%
lkp-hsw-ep2      43673         43590     -0.2%      43754     +0.2%
lkp-wsm-ep2      28577         28298     -1.0%      28545     -0.1%
lkp-skl-d01     175277        173343     -1.1%     173082     -1.3%
lkp-hsw-d01     130303        129566     -0.6%     129250     -0.8%
lkp-sb02        113742±3%     116911     +2.8%     116417±3%  +2.4%

will-it-scale.malloc2.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.206e+09     1.206e+09 -0.0%      1.206e+09 +0.0%
lkp-bdw-ex1     1.319e+09     1.319e+09 -0.0%      1.319e+09 +0.0%
lkp-skl-2sp2    8.000e+08     8.021e+08 +0.3%      7.995e+08 -0.1%
lkp-bdw-ep2     6.582e+08     6.634e+08 +0.8%      6.513e+08 -1.1%
lkp-hsw-ep2     6.671e+08     6.669e+08 -0.0%      6.665e+08 -0.1%
lkp-wsm-ep2     1.805e+08     1.806e+08 +0.0%      1.804e+08 -0.1%
lkp-skl-d01     1.611e+08     1.611e+08 -0.0%      1.610e+08 -0.0%
lkp-hsw-d01     1.333e+08     1.332e+08 -0.0%      1.332e+08 -0.0%
lkp-sb02         82485104      82478206 -0.0%       82473546 -0.0%

will-it-scale.malloc2.threads (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.574e+09     1.574e+09 -0.0%      1.574e+09 -0.0%
lkp-bdw-ex1     1.737e+09     1.737e+09 +0.0%      1.737e+09 -0.0%
lkp-skl-2sp2    9.161e+08     9.162e+08 +0.0%      9.181e+08 +0.2%
lkp-bdw-ep2     7.856e+08     8.015e+08 +2.0%      8.113e+08 +3.3%
lkp-hsw-ep2     6.908e+08     6.904e+08 -0.1%      6.907e+08 -0.0%
lkp-wsm-ep2     2.409e+08     2.409e+08 +0.0%      2.409e+08 -0.0%
lkp-skl-d01     1.199e+08     1.199e+08 -0.0%      1.199e+08 -0.0%
lkp-hsw-d01     1.029e+08     1.029e+08 -0.0%      1.029e+08 +0.0%
lkp-sb02         68081213      68061423 -0.0%       68076037 -0.0%

will-it-scale.page_fault2.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    14509125±4%   16472364 +13.5%       17123117 +18.0%
lkp-bdw-ex1     14736381      16196588  +9.9%       16364011 +11.0%
lkp-skl-2sp2     6354925       6435444  +1.3%        6436644  +1.3%
lkp-bdw-ep2      8749584       8834422  +1.0%        8827179  +0.9%
lkp-hsw-ep2      8762591       8845920  +1.0%        8825697  +0.7%
lkp-wsm-ep2      3036083       3030428  -0.2%        3021741  -0.5%
lkp-skl-d01      2307834       2304731  -0.1%        2286142  -0.9%
lkp-hsw-d01      1806237       1800786  -0.3%        1795943  -0.6%
lkp-sb02          842616        837844  -0.6%         833921  -1.0%

will-it-scale.page_fault2.threads

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1623294       1615132±2% -0.5%     1656777    +2.1%
lkp-bdw-ex1      1995714       2025948    +1.5%     2113753±3% +5.9%
lkp-skl-2sp2     2346708       2415591    +2.9%     2416919    +3.0%
lkp-bdw-ep2      2342564       2344882    +0.1%     2300206    -1.8%
lkp-hsw-ep2      1820658       1831681    +0.6%     1844057    +1.3%
lkp-wsm-ep2      1725482       1733774    +0.5%     1740517    +0.9%
lkp-skl-d01      1832833       1823628    -0.5%     1806489    -1.4%
lkp-hsw-d01      1427913       1427287    -0.0%     1420226    -0.5%
lkp-sb02          750626        748615    -0.3%      746621    -0.5%

will-it-scale.page_fault3.processes (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    24382726      24400317 +0.1%       24668774 +1.2%
lkp-bdw-ex1     35399750      35683124 +0.8%       35829492 +1.2%
lkp-skl-2sp2    28136820      28068248 -0.2%       28147989 +0.0%
lkp-bdw-ep2     37269077      37459490 +0.5%       37373073 +0.3%
lkp-hsw-ep2     36224967      36114085 -0.3%       36104908 -0.3%
lkp-wsm-ep2     16820457      16911005 +0.5%       16968596 +0.9%
lkp-skl-d01      7721138       7725904 +0.1%        7756740 +0.5%
lkp-hsw-d01      7611979       7650928 +0.5%        7651323 +0.5%
lkp-sb02         3781546       3796502 +0.4%        3796827 +0.4%

will-it-scale.page_fault3.threads (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1865820±3%   1900917±2%  +1.9%     1826245±4%  -2.1%
lkp-bdw-ex1      3094060      3148326     +1.8%     3150036     +1.8%
lkp-skl-2sp2     3952940      3953898     +0.0%     3989360     +0.9%
lkp-bdw-ep2      3420373±3%   3643964     +6.5%     3644910±5%  +6.6%
lkp-hsw-ep2      2609635±2%   2582310±3%  -1.0%     2780459     +6.5%
lkp-wsm-ep2      4395001      4417196     +0.5%     4432499     +0.9%
lkp-skl-d01      5363977      5400003     +0.7%     5411370     +0.9%
lkp-hsw-d01      5274131      5311294     +0.7%     5319359     +0.9%
lkp-sb02         2917314      2913004     -0.1%     2935286     +0.6%

will-it-scale.read1.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    73762279±14%  69322519±10% -6.0%    69349855±13%  -6.0% (result unstable)
lkp-bdw-ex1     1.701e+08     1.704e+08    +0.1%    1.705e+08     +0.2%
lkp-skl-2sp2    63111570      63113953     +0.0%    63836573      +1.1%
lkp-bdw-ep2     79247409      79424610     +0.2%    78012656      -1.6%
lkp-hsw-ep2     67677026      68308800     +0.9%    67539106      -0.2%
lkp-wsm-ep2     13339630      13939817     +4.5%    13766865      +3.2%
lkp-skl-d01     10969487      10972650     +0.0%    no data
lkp-hsw-d01     9857342±2%    10080592±2%  +2.3%    10131560      +2.8%
lkp-sb02        5189076        5197473     +0.2%    5163253       -0.5%

will-it-scale.read1.threads (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    62468045±12%  73666726±7% +17.9%    79553123±12% +27.4% (result unstable)
lkp-bdw-ex1     1.62e+08      1.624e+08    +0.3%    1.614e+08     -0.3%
lkp-skl-2sp2    58319780      59181032     +1.5%    59821353      +2.6%
lkp-bdw-ep2     74057992      75698171     +2.2%    74990869      +1.3%
lkp-hsw-ep2     63672959      63639652     -0.1%    64387051      +1.1%
lkp-wsm-ep2     13489943      13526058     +0.3%    13259032      -1.7%
lkp-skl-d01     10297906      10338796     +0.4%    10407328      +1.1%
lkp-hsw-d01      9636721       9667376     +0.3%     9341147      -3.1%
lkp-sb02         4801938       4804496     +0.1%     4802290      +0.0%

will-it-scale.write1.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.111e+08     1.104e+08±2%  -0.7%   1.122e+08±2%  +1.0%
lkp-bdw-ex1     1.392e+08     1.399e+08     +0.5%   1.397e+08     +0.4%
lkp-skl-2sp2     59369233      58994841     -0.6%    58715168     -1.1%
lkp-bdw-ep2      61820979      CPU throttle          63593123     +2.9%
lkp-hsw-ep2      57897587      57435605     -0.8%    56347450     -2.7%
lkp-wsm-ep2       7814203       7918017±2%  +1.3%     7669068     -1.9%
lkp-skl-d01       8886557       8971422     +1.0%     8818366     -0.8%
lkp-hsw-d01       9171001±5%    9189915     +0.2%     9483909     +3.4%
lkp-sb02          4475406       4475294     -0.0%     4501756     +0.6%

will-it-scale.write1.threads (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.058e+08     1.055e+08±2%  -0.2%   1.065e+08  +0.7%
lkp-bdw-ex1     1.316e+08     1.300e+08     -1.2%   1.308e+08  -0.6%
lkp-skl-2sp2     54492421      56086678     +2.9%    55975657  +2.7%
lkp-bdw-ep2      59360449      59003957     -0.6%    58101262  -2.1%
lkp-hsw-ep2      53346346±2%   52530876     -1.5%    52902487  -0.8%
lkp-wsm-ep2       7774006       7800092±2%  +0.3%     7558833  -2.8%
lkp-skl-d01       8346174       8235695     -1.3%     no data
lkp-hsw-d01       8636244       8655731     +0.2%     8658868  +0.3%
lkp-sb02          4181820       4204107     +0.5%     4182992  +0.0%


vm-scalability.anon-r-rand.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    11933873±3%   12356544±2%  +3.5%   12188624     +2.1%
lkp-bdw-ex1      7114424±2%    7330949±2%  +3.0%    7392419     +3.9%
lkp-skl-2sp2     6773277±5%    6492332±8%  -4.1%    6543962     -3.4%
lkp-bdw-ep2      7133846±4%    7233508     +1.4%    7013518±3%  -1.7%
lkp-hsw-ep2      4576626       4527098     -1.1%    4551679     -0.5%
lkp-wsm-ep2      2583599       2592492     +0.3%    2588039     +0.2%
lkp-hsw-d01       998199±2%    1028311     +3.0%    1006460±2%  +0.8%
lkp-sb02          570572        567854     -0.5%     568449     -0.4%

vm-scalability.anon-r-rand-mt.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1789419       1787830     -0.1%    1788208     -0.1%
lkp-bdw-ex1      3492595±2%    3554966±2%  +1.8%    3558835±3%  +1.9%
lkp-skl-2sp2     3856238±2%    3975403±4%  +3.1%    3994600     +3.6%
lkp-bdw-ep2      3726963±11%   3809292±6%  +2.2%    3871924±4%  +3.9%
lkp-hsw-ep2      2131760±3%    2033578±4%  -4.6%    2130727±6%  -0.0%
lkp-wsm-ep2      2369731       2368384     -0.1%    2370252     +0.0%
lkp-skl-d01      1207128       1206220     -0.1%    1205801     -0.1%
lkp-hsw-d01       964317        992329±2%  +2.9%     992099±2%  +2.9%
lkp-sb02          567137        567346     +0.0%     566144     -0.2%

vm-scalability.lru-file-mmap-read.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    19560469±6%   23018999     +17.7%   23418800     +19.7%
lkp-bdw-ex1     17769135±14%  26141676±3%  +47.1%   26284723±5%  +47.9%
lkp-skl-2sp2    14056512      13578884      -3.4%   13146214      -6.5%
lkp-bdw-ep2     15336542      14737654      -3.9%   14088159      -8.1%
lkp-hsw-ep2     16275498      15756296      -3.2%   15018090      -7.7%
lkp-wsm-ep2     11272160      11237231      -0.3%   11310047      +0.3%
lkp-skl-d01      7322119       7324569      +0.0%    7184148      -1.9%
lkp-hsw-d01      6449234       6404542      -0.7%    6356141      -1.4%
lkp-sb02         3517943       3520668      +0.1%    3527309      +0.3%

vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1689052       1697553  +0.5%       1698726  +0.6%
lkp-bdw-ex1      1675246       1699764  +1.5%       1712226  +2.2%
lkp-skl-2sp2     1800533       1799749  -0.0%       1800581  +0.0%
lkp-bdw-ep2      1807422       1807758  +0.0%       1804932  -0.1%
lkp-hsw-ep2      1809807       1808781  -0.1%       1807811  -0.1%
lkp-wsm-ep2      1800198       1802434  +0.1%       1801236  +0.1%
lkp-skl-d01       696689        695537  -0.2%        694106  -0.4%
lkp-hsw-d01       698364        698666  +0.0%        696686  -0.2%
lkp-sb02          258939        258787  -0.1%        258199  -0.3%

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/page_alloc.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07b3c23762ad..27780310e5e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5566,13 +5566,12 @@ static int zone_batchsize(struct zone *zone)
 
 	/*
 	 * The per-cpu-pages pools are set to around 1000th of the
-	 * size of the zone.  But no more than 1/2 of a meg.
-	 *
-	 * OK, so we don't know how big the cache is.  So guess.
+	 * size of the zone.
 	 */
 	batch = zone->managed_pages / 1024;
-	if (batch * PAGE_SIZE > 512 * 1024)
-		batch = (512 * 1024) / PAGE_SIZE;
+	/* But no more than a meg. */
+	if (batch * PAGE_SIZE > 1024 * 1024)
+		batch = (1024 * 1024) / PAGE_SIZE;
 	batch /= 4;		/* We effectively *= 4 below */
 	if (batch < 1)
 		batch = 1;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
  2018-07-11  5:58 [RFC PATCH] mm, page_alloc: double zone's batchsize Aaron Lu
@ 2018-07-11 21:35 ` Andrew Morton
  2018-07-12  1:40   ` Lu, Aaron
  2018-07-12 12:54 ` Michal Hocko
  1 sibling, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2018-07-11 21:35 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, Huang Ying, Dave Hansen, Kemi Wang,
	Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman

On Wed, 11 Jul 2018 13:58:55 +0800 Aaron Lu <aaron.lu@intel.com> wrote:

> [550 lines of changelog]

OK, I'm convinced ;)  That was a lot of work - thanks for being exhaustive.

Of course, not all the world is x86 but I think we can be confident
that other architectures are unlikely to be harmed by the change, at least.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
  2018-07-11 21:35 ` Andrew Morton
@ 2018-07-12  1:40   ` Lu, Aaron
  2018-07-12  1:46     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Lu, Aaron @ 2018-07-12  1:40 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, tim.c.chen, linux-mm, ak, vbabka, Wang, Kemi,
	mhocko, Hansen, Dave, mgorman, Huang, Ying

On Wed, 2018-07-11 at 14:35 -0700, Andrew Morton wrote:
> On Wed, 11 Jul 2018 13:58:55 +0800 Aaron Lu <aaron.lu@intel.com> wrote:
> 
> > [550 lines of changelog]
> 
> OK, I'm convinced ;)  That was a lot of work - thanks for being exhaustive.

Thanks Andrew.
I think the credit goes to Dave Hansen since he has been very careful
about this change and would like me to do all those 2nd phase tests to
make sure we didn't get any surprise after doubling batch size :-)

I think the LKP robot will run even more tests to capture possible
regressions, if any.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
  2018-07-12  1:40   ` Lu, Aaron
@ 2018-07-12  1:46     ` Andrew Morton
  0 siblings, 0 replies; 7+ messages in thread
From: Andrew Morton @ 2018-07-12  1:46 UTC (permalink / raw)
  To: Lu, Aaron
  Cc: linux-kernel, tim.c.chen, linux-mm, ak, vbabka, Wang, Kemi,
	mhocko, Hansen, Dave, mgorman, Huang, Ying

On Thu, 12 Jul 2018 01:40:41 +0000 "Lu, Aaron" <aaron.lu@intel.com> wrote:

> Thanks Andrew.
> I think the credit goes to Dave Hansen

Oh.  In that case, I take it all back.  The patch sucks!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
  2018-07-11  5:58 [RFC PATCH] mm, page_alloc: double zone's batchsize Aaron Lu
  2018-07-11 21:35 ` Andrew Morton
@ 2018-07-12 12:54 ` Michal Hocko
  2018-07-12 13:55   ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2018-07-12 12:54 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen,
	Kemi Wang, Tim Chen, Andi Kleen, Vlastimil Babka, Mel Gorman,
	Jesper Dangaard Brouer

[CC Jesper - I remember he was really concerned about the worst case
 latencies for highspeed network workloads.]

Sorry for top posting but I do not want to torture anybody to scroll
down the long changelog which I want to preserve for Jesper.

I personally do not mind this change. I usually find performance
improvements solely based on microbenchmarks without any real world
workloads numbers a bit dubious. Especially when numbers are single
cpu vendor based (even though that the number of measurements is quite
impressive here and much better than what we can usually get). On
the other hand this change is really simple. Should we ever find a
regression it will be trivial to reconsider/revert.

One could argue that the size of the batch should scale with the number
of CPUs or even the uarch but cosindering how much we suck in that
regards and that differences are not that large I agree that simply bump
up the number is the most viable way forward.

That being said, feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks. The whole patch including the changelog follows below.

On Wed 11-07-18 13:58:55, Aaron Lu wrote:
> To improve page allocator's performance for order-0 pages, each CPU has
> a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
> PCP will be checked first before asking pages from Buddy. When PCP is
> used up, a batch of pages will be fetched from Buddy to improve
> performance and the size of batch can affect performance.
> 
> zone's batch size gets doubled last time by commit ba56e91c9401("mm:
> page_alloc: increase size of per-cpu-pages") over ten years ago. Since
> then, CPU has envolved a lot and CPU's cache sizes also increased.
> 
> Dave Hansen is concerned the current batch size doesn't fit well with
> modern hardware and suggested me to do two things: first, use a page
> allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
> out how performance changes with different batch sizes on various
> machines and then choose a new default batch size; second, see how
> this new batch size work with other workloads.
> 
> >From the first test, we saw performance gains on high-core-count systems
> and little to no effect on older systems with more modest core counts.
> >From this phase's test data, two candidates: 63 and 127 are chosen.
> 
> In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
> and more will-it-scale sub-tests are tested to see how these two
> candidates work with these workloads and decides a new default
> according to their results.
> 
> Most test results are flat. will-it-scale/page_fault2 process mode has
> 10%-18% performance increase on 4-sockets Skylake and Broadwell.
> vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
> 4-sockets servers while for 2-sockets servers, it caused 3%-8%
> performance drop. Further analysis showed that, with a larger pcp->batch
> and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch
> is maintained in this patch), zone lock contention shifted to LRU add
> side lock contention and that caused performance drop. This performance
> drop might be mitigated by others' work on optimizing LRU lock.
> 
> Another downside of increasing pcp->batch is, when PCP is used up and
> need to fetch a batch of pages from Buddy, since batch is increased,
> that time can be longer than before. My understanding is, this doesn't
> affect slowpath where direct reclaim and compaction dominates. For
> fastpath, throughput is a win(according to will-it-scale/page_fault1)
> but worst latency can be larger now.
> 
> Overall, I think double the batch size from 31 to 63 is relatively
> safe and provide good performance boost for high-core-count systems.
> 
> The two phase's test results are listed below(all tests are done with
> THP disabled).
> 
> Phase one(will-it-scale/page_fault1) test results:
> 
> Skylake-EX: increased batch size has a good effect on zone->lock
> contention, though LRU contention will rise at the same time and
> limited the final performance increase.
> 
> batch   score     change   zone_contention   lru_contention   total_contention
>  31   15345900    +0.00%       64%                 8%           72%
>  53   17903847   +16.67%       32%                38%           70%
>  63   17992886   +17.25%       24%                45%           69%
>  73   18022825   +17.44%       10%                61%           71%
> 119   18023401   +17.45%        4%                66%           70%
> 127   18029012   +17.48%        3%                66%           69%
> 137   18036075   +17.53%        4%                66%           70%
> 165   18035964   +17.53%        2%                67%           69%
> 188   18101105   +17.95%        2%                67%           69%
> 223   18130951   +18.15%        2%                67%           69%
> 255   18118898   +18.07%        2%                67%           69%
> 267   18101559   +17.96%        2%                67%           69%
> 299   18160468   +18.34%        2%                68%           70%
> 320   18139845   +18.21%        2%                67%           69%
> 393   18160869   +18.34%        2%                68%           70%
> 424   18170999   +18.41%        2%                68%           70%
> 458   18144868   +18.24%        2%                68%           70%
> 467   18142366   +18.22%        2%                68%           70%
> 498   18154549   +18.30%        1%                68%           69%
> 511   18134525   +18.17%        1%                69%           70%
> 
> Broadwell-EX: similar pattern as Skylake-EX.
> 
> batch   score     change   zone_contention   lru_contention   total_contention
>  31   16703983    +0.00%       67%                 7%           74%
>  53   18195393    +8.93%       43%                28%           71%
>  63   18288885    +9.49%       38%                33%           71%
>  73   18344329    +9.82%       35%                37%           72%
> 119   18535529   +10.96%       24%                46%           70%
> 127   18513596   +10.83%       23%                48%           71%
> 137   18514327   +10.84%       23%                48%           71%
> 165   18511840   +10.82%       22%                49%           71%
> 188   18593478   +11.31%       17%                53%           70%
> 223   18601667   +11.36%       17%                52%           69%
> 255   18774825   +12.40%       12%                58%           70%
> 267   18754781   +12.28%        9%                60%           69%
> 299   18892265   +13.10%        7%                63%           70%
> 320   18873812   +12.99%        8%                62%           70%
> 393   18891174   +13.09%        6%                64%           70%
> 424   18975108   +13.60%        6%                64%           70%
> 458   18932364   +13.34%        8%                62%           70%
> 467   18960891   +13.51%        5%                65%           70%
> 498   18944526   +13.41%        5%                64%           69%
> 511   18960839   +13.51%        5%                64%           69%
> 
> Skylake-EP: although increased batch reduced zone->lock contention, but
>  the effect is not as good as EX: zone->lock contention is still as
> high as 20% with a very high batch value instead of 1% on Skylake-EX or
> 5% on Broadwell-EX. Also, total_contention actually decreased with a
> higher batch but that doesn't translate to performance increase.
> 
> batch   score    change   zone_contention   lru_contention   total_contention
>  31   9554867    +0.00%       66%                 3%           69%
>  53   9855486    +3.15%       63%                 3%           66%
>  63   9980145    +4.45%       62%                 4%           66%
>  73   10092774   +5.63%       62%                 5%           67%
> 119   10310061   +7.90%       45%                19%           64%
> 127   10342019   +8.24%       42%                19%           61%
> 137   10358182   +8.41%       42%                21%           63%
> 165   10397060   +8.81%       37%                24%           61%
> 188   10341808   +8.24%       34%                26%           60%
> 223   10349135   +8.31%       31%                27%           58%
> 255   10327189   +8.08%       28%                29%           57%
> 267   10344204   +8.26%       27%                29%           56%
> 299   10325043   +8.06%       25%                30%           55%
> 320   10310325   +7.91%       25%                31%           56%
> 393   10293274   +7.73%       21%                31%           52%
> 424   10311099   +7.91%       21%                32%           53%
> 458   10321375   +8.02%       21%                32%           53%
> 467   10303881   +7.84%       21%                32%           53%
> 498   10332462   +8.14%       20%                33%           53%
> 511   10325016   +8.06%       20%                32%           52%
> 
> Broadwell-EP: zone->lock and lru lock had an agreement to make sure
> performance doesn't increase and they successfully managed to keep
> total contention at 70%.
> 
> batch   score    change   zone_contention   lru_contention   total_contention
>  31   10121178   +0.00%       19%                50%           69%
>  53   10142366   +0.21%        6%                63%           69%
>  63   10117984   -0.03%       11%                58%           69%
>  73   10123330   +0.02%        7%                63%           70%
> 119   10108791   -0.12%        2%                67%           69%
> 127   10166074   +0.44%        3%                66%           69%
> 137   10141574   +0.20%        3%                66%           69%
> 165   10154499   +0.33%        2%                68%           70%
> 188   10124921   +0.04%        2%                67%           69%
> 223   10137399   +0.16%        2%                67%           69%
> 255   10143289   +0.22%        0%                68%           68%
> 267   10123535   +0.02%        1%                68%           69%
> 299   10140952   +0.20%        0%                68%           68%
> 320   10163170   +0.41%        0%                68%           68%
> 393   10000633   -1.19%        0%                69%           69%
> 424   10087998   -0.33%        0%                69%           69%
> 458   10187116   +0.65%        0%                69%           69%
> 467   10146790   +0.25%        0%                69%           69%
> 498   10197958   +0.76%        0%                69%           69%
> 511   10152326   +0.31%        0%                69%           69%
> 
> Haswell-EP: similar to Broadwell-EP.
> 
> batch   score   change   zone_contention   lru_contention   total_contention
>  31   10442205   +0.00%       14%                48%           62%
>  53   10442255   +0.00%        5%                57%           62%
>  63   10452059   +0.09%        6%                57%           63%
>  73   10482349   +0.38%        5%                59%           64%
> 119   10454644   +0.12%        3%                60%           63%
> 127   10431514   -0.10%        3%                59%           62%
> 137   10423785   -0.18%        3%                60%           63%
> 165   10481216   +0.37%        2%                61%           63%
> 188   10448755   +0.06%        2%                61%           63%
> 223   10467144   +0.24%        2%                61%           63%
> 255   10480215   +0.36%        2%                61%           63%
> 267   10484279   +0.40%        2%                61%           63%
> 299   10466450   +0.23%        2%                61%           63%
> 320   10452578   +0.10%        2%                61%           63%
> 393   10499678   +0.55%        1%                62%           63%
> 424   10481454   +0.38%        1%                62%           63%
> 458   10473562   +0.30%        1%                62%           63%
> 467   10484269   +0.40%        0%                62%           62%
> 498   10505599   +0.61%        0%                62%           62%
> 511   10483395   +0.39%        0%                62%           62%
> 
> Westmere-EP: contention is pretty small so not interesting. Note too
> high a batch value could hurt performance.
> 
> batch   score   change   zone_contention   lru_contention   total_contention
>  31   4831523   +0.00%        2%                 3%            5%
>  53   4834086   +0.05%        2%                 4%            6%
>  63   4834262   +0.06%        2%                 3%            5%
>  73   4832851   +0.03%        2%                 4%            6%
> 119   4830534   -0.02%        1%                 3%            4%
> 127   4827461   -0.08%        1%                 4%            5%
> 137   4827459   -0.08%        1%                 3%            4%
> 165   4820534   -0.23%        0%                 4%            4%
> 188   4817947   -0.28%        0%                 3%            3%
> 223   4809671   -0.45%        0%                 3%            3%
> 255   4802463   -0.60%        0%                 4%            4%
> 267   4801634   -0.62%        0%                 3%            3%
> 299   4798047   -0.69%        0%                 3%            3%
> 320   4793084   -0.80%        0%                 3%            3%
> 393   4785877   -0.94%        0%                 3%            3%
> 424   4782911   -1.01%        0%                 3%            3%
> 458   4779346   -1.08%        0%                 3%            3%
> 467   4780306   -1.06%        0%                 3%            3%
> 498   4780589   -1.05%        0%                 3%            3%
> 511   4773724   -1.20%        0%                 3%            3%
> 
> Skylake-Desktop: similar to Westmere-EP, nothing interesting.
> 
> batch   score   change   zone_contention   lru_contention   total_contention
>  31   3906608   +0.00%        2%                 3%            5%
>  53   3940164   +0.86%        2%                 3%            5%
>  63   3937289   +0.79%        2%                 3%            5%
>  73   3940201   +0.86%        2%                 3%            5%
> 119   3933240   +0.68%        2%                 3%            5%
> 127   3930514   +0.61%        2%                 4%            6%
> 137   3938639   +0.82%        0%                 3%            3%
> 165   3908755   +0.05%        0%                 3%            3%
> 188   3905621   -0.03%        0%                 3%            3%
> 223   3903015   -0.09%        0%                 4%            4%
> 255   3889480   -0.44%        0%                 3%            3%
> 267   3891669   -0.38%        0%                 4%            4%
> 299   3898728   -0.20%        0%                 4%            4%
> 320   3894547   -0.31%        0%                 4%            4%
> 393   3875137   -0.81%        0%                 4%            4%
> 424   3874521   -0.82%        0%                 3%            3%
> 458   3880432   -0.67%        0%                 4%            4%
> 467   3888715   -0.46%        0%                 3%            3%
> 498   3888633   -0.46%        0%                 4%            4%
> 511   3875305   -0.80%        0%                 5%            5%
> 
> Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
> contention is higher than other desktops.
> 
> batch   score   change   zone_contention   lru_contention   total_contention
>  31   3511158   +0.00%        2%                 5%            7%
>  53   3555445   +1.26%        2%                 6%            8%
>  63   3561082   +1.42%        2%                 6%            8%
>  73   3547218   +1.03%        2%                 6%            8%
> 119   3571319   +1.71%        1%                 7%            8%
> 127   3549375   +1.09%        0%                 6%            6%
> 137   3560233   +1.40%        0%                 6%            6%
> 165   3555176   +1.25%        2%                 6%            8%
> 188   3551501   +1.15%        0%                 8%            8%
> 223   3531462   +0.58%        0%                 7%            7%
> 255   3570400   +1.69%        0%                 7%            7%
> 267   3532235   +0.60%        1%                 8%            9%
> 299   3562326   +1.46%        0%                 6%            6%
> 320   3553569   +1.21%        0%                 8%            8%
> 393   3539519   +0.81%        0%                 7%            7%
> 424   3549271   +1.09%        0%                 8%            8%
> 458   3528885   +0.50%        0%                 8%            8%
> 467   3526554   +0.44%        0%                 7%            7%
> 498   3525302   +0.40%        0%                 9%            9%
> 511   3527556   +0.47%        0%                 8%            8%
> 
> Sandybridge-Desktop: the 0% contention isn't accurate but caused by
> dropped fractional part. Since multiple contention path's contentions
> are all under 1% here, with some arithmetic operations like add, the
> final deviation could be as large as 3%.
> 
> batch   score   change   zone_contention   lru_contention   total_contention
>  31   1744495   +0.00%        0%                 0%            0%
>  53   1755341   +0.62%        0%                 0%            0%
>  63   1758469   +0.80%        0%                 0%            0%
>  73   1759626   +0.87%        0%                 0%            0%
> 119   1770417   +1.49%        0%                 0%            0%
> 127   1768252   +1.36%        0%                 0%            0%
> 137   1767848   +1.34%        0%                 0%            0%
> 165   1765088   +1.18%        0%                 0%            0%
> 188   1766918   +1.29%        0%                 0%            0%
> 223   1767866   +1.34%        0%                 0%            0%
> 255   1768074   +1.35%        0%                 0%            0%
> 267   1763187   +1.07%        0%                 0%            0%
> 299   1765620   +1.21%        0%                 0%            0%
> 320   1767603   +1.32%        0%                 0%            0%
> 393   1764612   +1.15%        0%                 0%            0%
> 424   1758476   +0.80%        0%                 0%            0%
> 458   1758593   +0.81%        0%                 0%            0%
> 467   1757915   +0.77%        0%                 0%            0%
> 498   1753363   +0.51%        0%                 0%            0%
> 511   1755548   +0.63%        0%                 0%            0%
> 
> Phase two test results:
> Note: all percent change is against base(batch=31).
> 
> ebizzy.throughput (higer is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1    2410037±7%     2600451±2% +7.9%     2602878 +8.0%
> lkp-bdw-ex1     1493328        1489243    -0.3%     1492145 -0.1%
> lkp-skl-2sp2    1329674        1345891    +1.2%     1351056 +1.6%
> lkp-bdw-ep2      711511         711511     0.0%      710708 -0.1%
> lkp-wsm-ep2       75750          75528    -0.3%       75441 -0.4%
> lkp-skl-d01      264126         262791    -0.5%      264113 +0.0%
> lkp-hsw-d01      176601         176328    -0.2%      176368 -0.1%
> lkp-sb02          98937          98937    +0.0%       99030 +0.1%
> 
> kbuild.buildtime (less is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1     107.00        107.67  +0.6%        107.11  +0.1%
> lkp-bdw-ex1       97.33         97.33  +0.0%         97.42  +0.1%
> lkp-skl-2sp2     180.00        179.83  -0.1%        179.83  -0.1%
> lkp-bdw-ep2      178.17        179.17  +0.6%        177.50  -0.4%
> lkp-wsm-ep2      737.00        738.00  +0.1%        738.00  +0.1%
> lkp-skl-d01      642.00        653.00  +1.7%        653.00  +1.7%
> lkp-hsw-d01     1310.00       1316.00  +0.5%       1311.00  +0.1%
> 
> 
> netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1     948790        947144  -0.2%        948333 -0.0%
> lkp-bdw-ex1      904224        904366  +0.0%        904926 +0.1%
> lkp-skl-2sp2     239731        239607  -0.1%        239565 -0.1% 
> lk-bdw-ep2       365764        365933  +0.0%        365951 +0.1%
> lkp-wsm-ep2       93736         93803  +0.1%         93808 +0.1%
> lkp-skl-d01       77314         77303  -0.0%         77375 +0.1%
> lkp-hsw-d01       58617         60387  +3.0%         60208 +2.7%
> lkp-sb02          29990         30137  +0.5%         30103 +0.4%
> 
> oltp.transactions (higer is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-bdw-ex1      9073276       9100377     +0.3%    9036344     -0.4%
> lkp-skl-2sp2     8898717       8852054     -0.5%    8894459     -0.0%
> lkp-bdw-ep2     13426155      13384654     -0.3%   13333637     -0.7%
> lkp-hsw-ep2     13146314      13232784     +0.7%   13193163     +0.4%
> lkp-wsm-ep2      5035355       5019348     -0.3%    5033418     -0.0%
> lkp-skl-d01       418485       4413339     -0.1%    4419039     +0.0%
> lkp-hsw-d01      3517817±5%    3396120±3%  -3.5%    3455138±3%  -1.8%
> 
> pigz.throughput (higer is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1    1.513e+08     1.507e+08 -0.4%      1.511e+08 -0.2%
> lkp-bdw-ex1     2.060e+08     2.052e+08 -0.4%      2.044e+08 -0.8%
> lkp-skl-2sp2    8.836e+08     8.845e+08 +0.1%      8.836e+08 -0.0%
> lkp-bdw-ep2     8.275e+08     8.464e+08 +2.3%      8.330e+08 +0.7%
> lkp-wsm-ep2     2.224e+08     2.221e+08 -0.2%      2.218e+08 -0.3%
> lkp-skl-d01     1.177e+08     1.177e+08 -0.0%      1.176e+08 -0.1%
> lkp-hsw-d01     1.154e+08     1.154e+08 +0.1%      1.154e+08 -0.0%
> lkp-sb02        0.633e+08     0.633e+08 +0.1%      0.633e+08 +0.0%
> 
> 
> will-it-scale.malloc1.processes (higher is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1      620181       620484 +0.0%         620240 +0.0%
> lkp-bdw-ex1      1403610      1401201 -0.2%        1417900 +1.0%
> lkp-skl-2sp2     1288097      1284145 -0.3%        1283907 -0.3%
> lkp-bdw-ep2      1427879      1427675 -0.0%        1428266 +0.0%
> lkp-hsw-ep2      1362546      1353965 -0.6%        1354759 -0.6%
> lkp-wsm-ep2      2099657      2107576 +0.4%        2100226 +0.0%
> lkp-skl-d01      1476835      1476358 -0.0%        1474487 -0.2%
> lkp-hsw-d01      1308810      1303429 -0.4%        1301299 -0.6%
> lkp-sb02          589286       589284 -0.0%         588101 -0.2%
> 
> will-it-scale.malloc1.threads (higher is better)
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1     21289         21125     -0.8%      21241     -0.2%
> lkp-bdw-ex1      28114         28089     -0.1%      28007     -0.4%
> lkp-skl-2sp2     91866         91946     +0.1%      92723     +0.9%
> lkp-bdw-ep2      37637         37501     -0.4%      37317     -0.9%
> lkp-hsw-ep2      43673         43590     -0.2%      43754     +0.2%
> lkp-wsm-ep2      28577         28298     -1.0%      28545     -0.1%
> lkp-skl-d01     175277        173343     -1.1%     173082     -1.3%
> lkp-hsw-d01     130303        129566     -0.6%     129250     -0.8%
> lkp-sb02        113742±3%     116911     +2.8%     116417±3%  +2.4%
> 
> will-it-scale.malloc2.processes (higer is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1    1.206e+09     1.206e+09 -0.0%      1.206e+09 +0.0%
> lkp-bdw-ex1     1.319e+09     1.319e+09 -0.0%      1.319e+09 +0.0%
> lkp-skl-2sp2    8.000e+08     8.021e+08 +0.3%      7.995e+08 -0.1%
> lkp-bdw-ep2     6.582e+08     6.634e+08 +0.8%      6.513e+08 -1.1%
> lkp-hsw-ep2     6.671e+08     6.669e+08 -0.0%      6.665e+08 -0.1%
> lkp-wsm-ep2     1.805e+08     1.806e+08 +0.0%      1.804e+08 -0.1%
> lkp-skl-d01     1.611e+08     1.611e+08 -0.0%      1.610e+08 -0.0%
> lkp-hsw-d01     1.333e+08     1.332e+08 -0.0%      1.332e+08 -0.0%
> lkp-sb02         82485104      82478206 -0.0%       82473546 -0.0%
> 
> will-it-scale.malloc2.threads (higer is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1    1.574e+09     1.574e+09 -0.0%      1.574e+09 -0.0%
> lkp-bdw-ex1     1.737e+09     1.737e+09 +0.0%      1.737e+09 -0.0%
> lkp-skl-2sp2    9.161e+08     9.162e+08 +0.0%      9.181e+08 +0.2%
> lkp-bdw-ep2     7.856e+08     8.015e+08 +2.0%      8.113e+08 +3.3%
> lkp-hsw-ep2     6.908e+08     6.904e+08 -0.1%      6.907e+08 -0.0%
> lkp-wsm-ep2     2.409e+08     2.409e+08 +0.0%      2.409e+08 -0.0%
> lkp-skl-d01     1.199e+08     1.199e+08 -0.0%      1.199e+08 -0.0%
> lkp-hsw-d01     1.029e+08     1.029e+08 -0.0%      1.029e+08 +0.0%
> lkp-sb02         68081213      68061423 -0.0%       68076037 -0.0%
> 
> will-it-scale.page_fault2.processes (higer is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1    14509125±4%   16472364 +13.5%       17123117 +18.0%
> lkp-bdw-ex1     14736381      16196588  +9.9%       16364011 +11.0%
> lkp-skl-2sp2     6354925       6435444  +1.3%        6436644  +1.3%
> lkp-bdw-ep2      8749584       8834422  +1.0%        8827179  +0.9%
> lkp-hsw-ep2      8762591       8845920  +1.0%        8825697  +0.7%
> lkp-wsm-ep2      3036083       3030428  -0.2%        3021741  -0.5%
> lkp-skl-d01      2307834       2304731  -0.1%        2286142  -0.9%
> lkp-hsw-d01      1806237       1800786  -0.3%        1795943  -0.6%
> lkp-sb02          842616        837844  -0.6%         833921  -1.0%
> 
> will-it-scale.page_fault2.threads
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1     1623294       1615132±2% -0.5%     1656777    +2.1%
> lkp-bdw-ex1      1995714       2025948    +1.5%     2113753±3% +5.9%
> lkp-skl-2sp2     2346708       2415591    +2.9%     2416919    +3.0%
> lkp-bdw-ep2      2342564       2344882    +0.1%     2300206    -1.8%
> lkp-hsw-ep2      1820658       1831681    +0.6%     1844057    +1.3%
> lkp-wsm-ep2      1725482       1733774    +0.5%     1740517    +0.9%
> lkp-skl-d01      1832833       1823628    -0.5%     1806489    -1.4%
> lkp-hsw-d01      1427913       1427287    -0.0%     1420226    -0.5%
> lkp-sb02          750626        748615    -0.3%      746621    -0.5%
> 
> will-it-scale.page_fault3.processes (higher is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1    24382726      24400317 +0.1%       24668774 +1.2%
> lkp-bdw-ex1     35399750      35683124 +0.8%       35829492 +1.2%
> lkp-skl-2sp2    28136820      28068248 -0.2%       28147989 +0.0%
> lkp-bdw-ep2     37269077      37459490 +0.5%       37373073 +0.3%
> lkp-hsw-ep2     36224967      36114085 -0.3%       36104908 -0.3%
> lkp-wsm-ep2     16820457      16911005 +0.5%       16968596 +0.9%
> lkp-skl-d01      7721138       7725904 +0.1%        7756740 +0.5%
> lkp-hsw-d01      7611979       7650928 +0.5%        7651323 +0.5%
> lkp-sb02         3781546       3796502 +0.4%        3796827 +0.4%
> 
> will-it-scale.page_fault3.threads (higer is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1     1865820±3%   1900917±2%  +1.9%     1826245±4%  -2.1%
> lkp-bdw-ex1      3094060      3148326     +1.8%     3150036     +1.8%
> lkp-skl-2sp2     3952940      3953898     +0.0%     3989360     +0.9%
> lkp-bdw-ep2      3420373±3%   3643964     +6.5%     3644910±5%  +6.6%
> lkp-hsw-ep2      2609635±2%   2582310±3%  -1.0%     2780459     +6.5%
> lkp-wsm-ep2      4395001      4417196     +0.5%     4432499     +0.9%
> lkp-skl-d01      5363977      5400003     +0.7%     5411370     +0.9%
> lkp-hsw-d01      5274131      5311294     +0.7%     5319359     +0.9%
> lkp-sb02         2917314      2913004     -0.1%     2935286     +0.6%
> 
> will-it-scale.read1.processes (higer is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1    73762279±14%  69322519±10% -6.0%    69349855±13%  -6.0% (result unstable)
> lkp-bdw-ex1     1.701e+08     1.704e+08    +0.1%    1.705e+08     +0.2%
> lkp-skl-2sp2    63111570      63113953     +0.0%    63836573      +1.1%
> lkp-bdw-ep2     79247409      79424610     +0.2%    78012656      -1.6%
> lkp-hsw-ep2     67677026      68308800     +0.9%    67539106      -0.2%
> lkp-wsm-ep2     13339630      13939817     +4.5%    13766865      +3.2%
> lkp-skl-d01     10969487      10972650     +0.0%    no data
> lkp-hsw-d01     9857342±2%    10080592±2%  +2.3%    10131560      +2.8%
> lkp-sb02        5189076        5197473     +0.2%    5163253       -0.5%
> 
> will-it-scale.read1.threads (higher is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1    62468045±12%  73666726±7% +17.9%    79553123±12% +27.4% (result unstable)
> lkp-bdw-ex1     1.62e+08      1.624e+08    +0.3%    1.614e+08     -0.3%
> lkp-skl-2sp2    58319780      59181032     +1.5%    59821353      +2.6%
> lkp-bdw-ep2     74057992      75698171     +2.2%    74990869      +1.3%
> lkp-hsw-ep2     63672959      63639652     -0.1%    64387051      +1.1%
> lkp-wsm-ep2     13489943      13526058     +0.3%    13259032      -1.7%
> lkp-skl-d01     10297906      10338796     +0.4%    10407328      +1.1%
> lkp-hsw-d01      9636721       9667376     +0.3%     9341147      -3.1%
> lkp-sb02         4801938       4804496     +0.1%     4802290      +0.0%
> 
> will-it-scale.write1.processes (higer is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1    1.111e+08     1.104e+08±2%  -0.7%   1.122e+08±2%  +1.0%
> lkp-bdw-ex1     1.392e+08     1.399e+08     +0.5%   1.397e+08     +0.4%
> lkp-skl-2sp2     59369233      58994841     -0.6%    58715168     -1.1%
> lkp-bdw-ep2      61820979      CPU throttle          63593123     +2.9%
> lkp-hsw-ep2      57897587      57435605     -0.8%    56347450     -2.7%
> lkp-wsm-ep2       7814203       7918017±2%  +1.3%     7669068     -1.9%
> lkp-skl-d01       8886557       8971422     +1.0%     8818366     -0.8%
> lkp-hsw-d01       9171001±5%    9189915     +0.2%     9483909     +3.4%
> lkp-sb02          4475406       4475294     -0.0%     4501756     +0.6%
> 
> will-it-scale.write1.threads (higer is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1    1.058e+08     1.055e+08±2%  -0.2%   1.065e+08  +0.7%
> lkp-bdw-ex1     1.316e+08     1.300e+08     -1.2%   1.308e+08  -0.6%
> lkp-skl-2sp2     54492421      56086678     +2.9%    55975657  +2.7%
> lkp-bdw-ep2      59360449      59003957     -0.6%    58101262  -2.1%
> lkp-hsw-ep2      53346346±2%   52530876     -1.5%    52902487  -0.8%
> lkp-wsm-ep2       7774006       7800092±2%  +0.3%     7558833  -2.8%
> lkp-skl-d01       8346174       8235695     -1.3%     no data
> lkp-hsw-d01       8636244       8655731     +0.2%     8658868  +0.3%
> lkp-sb02          4181820       4204107     +0.5%     4182992  +0.0%
> 
> 
> vm-scalability.anon-r-rand.throughput (higher is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1    11933873±3%   12356544±2%  +3.5%   12188624     +2.1%
> lkp-bdw-ex1      7114424±2%    7330949±2%  +3.0%    7392419     +3.9%
> lkp-skl-2sp2     6773277±5%    6492332±8%  -4.1%    6543962     -3.4%
> lkp-bdw-ep2      7133846±4%    7233508     +1.4%    7013518±3%  -1.7%
> lkp-hsw-ep2      4576626       4527098     -1.1%    4551679     -0.5%
> lkp-wsm-ep2      2583599       2592492     +0.3%    2588039     +0.2%
> lkp-hsw-d01       998199±2%    1028311     +3.0%    1006460±2%  +0.8%
> lkp-sb02          570572        567854     -0.5%     568449     -0.4%
> 
> vm-scalability.anon-r-rand-mt.throughput (higher is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1     1789419       1787830     -0.1%    1788208     -0.1%
> lkp-bdw-ex1      3492595±2%    3554966±2%  +1.8%    3558835±3%  +1.9%
> lkp-skl-2sp2     3856238±2%    3975403±4%  +3.1%    3994600     +3.6%
> lkp-bdw-ep2      3726963±11%   3809292±6%  +2.2%    3871924±4%  +3.9%
> lkp-hsw-ep2      2131760±3%    2033578±4%  -4.6%    2130727±6%  -0.0%
> lkp-wsm-ep2      2369731       2368384     -0.1%    2370252     +0.0%
> lkp-skl-d01      1207128       1206220     -0.1%    1205801     -0.1%
> lkp-hsw-d01       964317        992329±2%  +2.9%     992099±2%  +2.9%
> lkp-sb02          567137        567346     +0.0%     566144     -0.2%
> 
> vm-scalability.lru-file-mmap-read.throughput (higher is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1    19560469±6%   23018999     +17.7%   23418800     +19.7%
> lkp-bdw-ex1     17769135±14%  26141676±3%  +47.1%   26284723±5%  +47.9%
> lkp-skl-2sp2    14056512      13578884      -3.4%   13146214      -6.5%
> lkp-bdw-ep2     15336542      14737654      -3.9%   14088159      -8.1%
> lkp-hsw-ep2     16275498      15756296      -3.2%   15018090      -7.7%
> lkp-wsm-ep2     11272160      11237231      -0.3%   11310047      +0.3%
> lkp-skl-d01      7322119       7324569      +0.0%    7184148      -1.9%
> lkp-hsw-d01      6449234       6404542      -0.7%    6356141      -1.4%
> lkp-sb02         3517943       3520668      +0.1%    3527309      +0.3%
> 
> vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)
> 
> machine         batch=31      batch=63             batch=127
> lkp-skl-4sp1     1689052       1697553  +0.5%       1698726  +0.6%
> lkp-bdw-ex1      1675246       1699764  +1.5%       1712226  +2.2%
> lkp-skl-2sp2     1800533       1799749  -0.0%       1800581  +0.0%
> lkp-bdw-ep2      1807422       1807758  +0.0%       1804932  -0.1%
> lkp-hsw-ep2      1809807       1808781  -0.1%       1807811  -0.1%
> lkp-wsm-ep2      1800198       1802434  +0.1%       1801236  +0.1%
> lkp-skl-d01       696689        695537  -0.2%        694106  -0.4%
> lkp-hsw-d01       698364        698666  +0.0%        696686  -0.2%
> lkp-sb02          258939        258787  -0.1%        258199  -0.3%
> 
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  mm/page_alloc.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 07b3c23762ad..27780310e5e7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5566,13 +5566,12 @@ static int zone_batchsize(struct zone *zone)
>  
>  	/*
>  	 * The per-cpu-pages pools are set to around 1000th of the
> -	 * size of the zone.  But no more than 1/2 of a meg.
> -	 *
> -	 * OK, so we don't know how big the cache is.  So guess.
> +	 * size of the zone.
>  	 */
>  	batch = zone->managed_pages / 1024;
> -	if (batch * PAGE_SIZE > 512 * 1024)
> -		batch = (512 * 1024) / PAGE_SIZE;
> +	/* But no more than a meg. */
> +	if (batch * PAGE_SIZE > 1024 * 1024)
> +		batch = (1024 * 1024) / PAGE_SIZE;
>  	batch /= 4;		/* We effectively *= 4 below */
>  	if (batch < 1)
>  		batch = 1;
> -- 
> 2.17.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
  2018-07-12 12:54 ` Michal Hocko
@ 2018-07-12 13:55   ` Jesper Dangaard Brouer
  2018-07-12 15:01     ` Tariq Toukan
  0 siblings, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2018-07-12 13:55 UTC (permalink / raw)
  To: Michal Hocko, Tariq Toukan
  Cc: Aaron Lu, linux-mm, linux-kernel, Andrew Morton, Huang Ying,
	Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Vlastimil Babka,
	Mel Gorman, brouer, Saeed Mahameed

On Thu, 12 Jul 2018 14:54:08 +0200
Michal Hocko <mhocko@kernel.org> wrote:

> [CC Jesper - I remember he was really concerned about the worst case
>  latencies for highspeed network workloads.]

Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s),
where we are contenting on the page allocator lock, in a CPU scaling
netperf test AFAIK.  I also have some special-case micro-benchmarks
where I can hit it, but it a micro-bench...

> Sorry for top posting but I do not want to torture anybody to scroll
> down the long changelog which I want to preserve for Jesper.

Thanks! - It look like a very detailed and exhaustive test for a very
small change, I'm impressed.

> I personally do not mind this change. I usually find performance
> improvements solely based on microbenchmarks without any real world
> workloads numbers a bit dubious. Especially when numbers are single
> cpu vendor based (even though that the number of measurements is quite
> impressive here and much better than what we can usually get). On
> the other hand this change is really simple. Should we ever find a
> regression it will be trivial to reconsider/revert.

I do think it is a good idea to increase this batch size.

In network micro-benchmarks it is usually hard to hit the page_alloc
lock, because drivers have all kind of page recycle schemes to avoid
talking to the page allocator.  Most of these schemes depend on pages
getting recycled fast enough, which is true for these micro-benchmarks,
but for real-world workloads, where packets/pages are "outstanding"
longer, these page-recycle schemes break down, and drivers start to
talk to the page allocator.  Thus, this change might help networking
for real-workloads (but will not show-up in network micro-benchs).


> One could argue that the size of the batch should scale with the number
> of CPUs or even the uarch but cosindering how much we suck in that
> regards and that differences are not that large I agree that simply bump
> up the number is the most viable way forward.
> 
> That being said, feel free to add
> Acked-by: Michal Hocko <mhocko@suse.com>

I will also happily ACK this patch, you can add:

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>


> Thanks. The whole patch including the changelog follows below.
> 
> On Wed 11-07-18 13:58:55, Aaron Lu wrote:
> > To improve page allocator's performance for order-0 pages, each CPU has
> > a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
> > PCP will be checked first before asking pages from Buddy. When PCP is
> > used up, a batch of pages will be fetched from Buddy to improve
> > performance and the size of batch can affect performance.
> > 
> > zone's batch size gets doubled last time by commit ba56e91c9401("mm:
> > page_alloc: increase size of per-cpu-pages") over ten years ago. Since
> > then, CPU has envolved a lot and CPU's cache sizes also increased.
> > 
> > Dave Hansen is concerned the current batch size doesn't fit well with
> > modern hardware and suggested me to do two things: first, use a page
> > allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
> > out how performance changes with different batch sizes on various
> > machines and then choose a new default batch size; second, see how
> > this new batch size work with other workloads.
> >   
> > >From the first test, we saw performance gains on high-core-count systems  
> > and little to no effect on older systems with more modest core counts.  
> > >From this phase's test data, two candidates: 63 and 127 are chosen.  
> > 
> > In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
> > and more will-it-scale sub-tests are tested to see how these two
> > candidates work with these workloads and decides a new default
> > according to their results.
> > 
> > Most test results are flat. will-it-scale/page_fault2 process mode has
> > 10%-18% performance increase on 4-sockets Skylake and Broadwell.
> > vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
> > 4-sockets servers while for 2-sockets servers, it caused 3%-8%
> > performance drop. Further analysis showed that, with a larger pcp->batch
> > and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch
> > is maintained in this patch), zone lock contention shifted to LRU add
> > side lock contention and that caused performance drop. This performance
> > drop might be mitigated by others' work on optimizing LRU lock.
> > 
> > Another downside of increasing pcp->batch is, when PCP is used up and
> > need to fetch a batch of pages from Buddy, since batch is increased,
> > that time can be longer than before. My understanding is, this doesn't
> > affect slowpath where direct reclaim and compaction dominates. For
> > fastpath, throughput is a win(according to will-it-scale/page_fault1)
> > but worst latency can be larger now.
> > 
> > Overall, I think double the batch size from 31 to 63 is relatively
> > safe and provide good performance boost for high-core-count systems.
> > 
> > The two phase's test results are listed below(all tests are done with
> > THP disabled).
> > 
> > Phase one(will-it-scale/page_fault1) test results:
> > 
> > Skylake-EX: increased batch size has a good effect on zone->lock
> > contention, though LRU contention will rise at the same time and
> > limited the final performance increase.
> > 
> > batch   score     change   zone_contention   lru_contention   total_contention
> >  31   15345900    +0.00%       64%                 8%           72%
> >  53   17903847   +16.67%       32%                38%           70%
> >  63   17992886   +17.25%       24%                45%           69%
> >  73   18022825   +17.44%       10%                61%           71%
> > 119   18023401   +17.45%        4%                66%           70%
> > 127   18029012   +17.48%        3%                66%           69%
> > 137   18036075   +17.53%        4%                66%           70%
> > 165   18035964   +17.53%        2%                67%           69%
> > 188   18101105   +17.95%        2%                67%           69%
> > 223   18130951   +18.15%        2%                67%           69%
> > 255   18118898   +18.07%        2%                67%           69%
> > 267   18101559   +17.96%        2%                67%           69%
> > 299   18160468   +18.34%        2%                68%           70%
> > 320   18139845   +18.21%        2%                67%           69%
> > 393   18160869   +18.34%        2%                68%           70%
> > 424   18170999   +18.41%        2%                68%           70%
> > 458   18144868   +18.24%        2%                68%           70%
> > 467   18142366   +18.22%        2%                68%           70%
> > 498   18154549   +18.30%        1%                68%           69%
> > 511   18134525   +18.17%        1%                69%           70%
> > 
> > Broadwell-EX: similar pattern as Skylake-EX.
> > 
> > batch   score     change   zone_contention   lru_contention   total_contention
> >  31   16703983    +0.00%       67%                 7%           74%
> >  53   18195393    +8.93%       43%                28%           71%
> >  63   18288885    +9.49%       38%                33%           71%
> >  73   18344329    +9.82%       35%                37%           72%
> > 119   18535529   +10.96%       24%                46%           70%
> > 127   18513596   +10.83%       23%                48%           71%
> > 137   18514327   +10.84%       23%                48%           71%
> > 165   18511840   +10.82%       22%                49%           71%
> > 188   18593478   +11.31%       17%                53%           70%
> > 223   18601667   +11.36%       17%                52%           69%
> > 255   18774825   +12.40%       12%                58%           70%
> > 267   18754781   +12.28%        9%                60%           69%
> > 299   18892265   +13.10%        7%                63%           70%
> > 320   18873812   +12.99%        8%                62%           70%
> > 393   18891174   +13.09%        6%                64%           70%
> > 424   18975108   +13.60%        6%                64%           70%
> > 458   18932364   +13.34%        8%                62%           70%
> > 467   18960891   +13.51%        5%                65%           70%
> > 498   18944526   +13.41%        5%                64%           69%
> > 511   18960839   +13.51%        5%                64%           69%
> > 
> > Skylake-EP: although increased batch reduced zone->lock contention, but
> >  the effect is not as good as EX: zone->lock contention is still as
> > high as 20% with a very high batch value instead of 1% on Skylake-EX or
> > 5% on Broadwell-EX. Also, total_contention actually decreased with a
> > higher batch but that doesn't translate to performance increase.
> > 
> > batch   score    change   zone_contention   lru_contention   total_contention
> >  31   9554867    +0.00%       66%                 3%           69%
> >  53   9855486    +3.15%       63%                 3%           66%
> >  63   9980145    +4.45%       62%                 4%           66%
> >  73   10092774   +5.63%       62%                 5%           67%
> > 119   10310061   +7.90%       45%                19%           64%
> > 127   10342019   +8.24%       42%                19%           61%
> > 137   10358182   +8.41%       42%                21%           63%
> > 165   10397060   +8.81%       37%                24%           61%
> > 188   10341808   +8.24%       34%                26%           60%
> > 223   10349135   +8.31%       31%                27%           58%
> > 255   10327189   +8.08%       28%                29%           57%
> > 267   10344204   +8.26%       27%                29%           56%
> > 299   10325043   +8.06%       25%                30%           55%
> > 320   10310325   +7.91%       25%                31%           56%
> > 393   10293274   +7.73%       21%                31%           52%
> > 424   10311099   +7.91%       21%                32%           53%
> > 458   10321375   +8.02%       21%                32%           53%
> > 467   10303881   +7.84%       21%                32%           53%
> > 498   10332462   +8.14%       20%                33%           53%
> > 511   10325016   +8.06%       20%                32%           52%
> > 
> > Broadwell-EP: zone->lock and lru lock had an agreement to make sure
> > performance doesn't increase and they successfully managed to keep
> > total contention at 70%.
> > 
> > batch   score    change   zone_contention   lru_contention   total_contention
> >  31   10121178   +0.00%       19%                50%           69%
> >  53   10142366   +0.21%        6%                63%           69%
> >  63   10117984   -0.03%       11%                58%           69%
> >  73   10123330   +0.02%        7%                63%           70%
> > 119   10108791   -0.12%        2%                67%           69%
> > 127   10166074   +0.44%        3%                66%           69%
> > 137   10141574   +0.20%        3%                66%           69%
> > 165   10154499   +0.33%        2%                68%           70%
> > 188   10124921   +0.04%        2%                67%           69%
> > 223   10137399   +0.16%        2%                67%           69%
> > 255   10143289   +0.22%        0%                68%           68%
> > 267   10123535   +0.02%        1%                68%           69%
> > 299   10140952   +0.20%        0%                68%           68%
> > 320   10163170   +0.41%        0%                68%           68%
> > 393   10000633   -1.19%        0%                69%           69%
> > 424   10087998   -0.33%        0%                69%           69%
> > 458   10187116   +0.65%        0%                69%           69%
> > 467   10146790   +0.25%        0%                69%           69%
> > 498   10197958   +0.76%        0%                69%           69%
> > 511   10152326   +0.31%        0%                69%           69%
> > 
> > Haswell-EP: similar to Broadwell-EP.
> > 
> > batch   score   change   zone_contention   lru_contention   total_contention
> >  31   10442205   +0.00%       14%                48%           62%
> >  53   10442255   +0.00%        5%                57%           62%
> >  63   10452059   +0.09%        6%                57%           63%
> >  73   10482349   +0.38%        5%                59%           64%
> > 119   10454644   +0.12%        3%                60%           63%
> > 127   10431514   -0.10%        3%                59%           62%
> > 137   10423785   -0.18%        3%                60%           63%
> > 165   10481216   +0.37%        2%                61%           63%
> > 188   10448755   +0.06%        2%                61%           63%
> > 223   10467144   +0.24%        2%                61%           63%
> > 255   10480215   +0.36%        2%                61%           63%
> > 267   10484279   +0.40%        2%                61%           63%
> > 299   10466450   +0.23%        2%                61%           63%
> > 320   10452578   +0.10%        2%                61%           63%
> > 393   10499678   +0.55%        1%                62%           63%
> > 424   10481454   +0.38%        1%                62%           63%
> > 458   10473562   +0.30%        1%                62%           63%
> > 467   10484269   +0.40%        0%                62%           62%
> > 498   10505599   +0.61%        0%                62%           62%
> > 511   10483395   +0.39%        0%                62%           62%
> > 
> > Westmere-EP: contention is pretty small so not interesting. Note too
> > high a batch value could hurt performance.
> > 
> > batch   score   change   zone_contention   lru_contention   total_contention
> >  31   4831523   +0.00%        2%                 3%            5%
> >  53   4834086   +0.05%        2%                 4%            6%
> >  63   4834262   +0.06%        2%                 3%            5%
> >  73   4832851   +0.03%        2%                 4%            6%
> > 119   4830534   -0.02%        1%                 3%            4%
> > 127   4827461   -0.08%        1%                 4%            5%
> > 137   4827459   -0.08%        1%                 3%            4%
> > 165   4820534   -0.23%        0%                 4%            4%
> > 188   4817947   -0.28%        0%                 3%            3%
> > 223   4809671   -0.45%        0%                 3%            3%
> > 255   4802463   -0.60%        0%                 4%            4%
> > 267   4801634   -0.62%        0%                 3%            3%
> > 299   4798047   -0.69%        0%                 3%            3%
> > 320   4793084   -0.80%        0%                 3%            3%
> > 393   4785877   -0.94%        0%                 3%            3%
> > 424   4782911   -1.01%        0%                 3%            3%
> > 458   4779346   -1.08%        0%                 3%            3%
> > 467   4780306   -1.06%        0%                 3%            3%
> > 498   4780589   -1.05%        0%                 3%            3%
> > 511   4773724   -1.20%        0%                 3%            3%
> > 
> > Skylake-Desktop: similar to Westmere-EP, nothing interesting.
> > 
> > batch   score   change   zone_contention   lru_contention   total_contention
> >  31   3906608   +0.00%        2%                 3%            5%
> >  53   3940164   +0.86%        2%                 3%            5%
> >  63   3937289   +0.79%        2%                 3%            5%
> >  73   3940201   +0.86%        2%                 3%            5%
> > 119   3933240   +0.68%        2%                 3%            5%
> > 127   3930514   +0.61%        2%                 4%            6%
> > 137   3938639   +0.82%        0%                 3%            3%
> > 165   3908755   +0.05%        0%                 3%            3%
> > 188   3905621   -0.03%        0%                 3%            3%
> > 223   3903015   -0.09%        0%                 4%            4%
> > 255   3889480   -0.44%        0%                 3%            3%
> > 267   3891669   -0.38%        0%                 4%            4%
> > 299   3898728   -0.20%        0%                 4%            4%
> > 320   3894547   -0.31%        0%                 4%            4%
> > 393   3875137   -0.81%        0%                 4%            4%
> > 424   3874521   -0.82%        0%                 3%            3%
> > 458   3880432   -0.67%        0%                 4%            4%
> > 467   3888715   -0.46%        0%                 3%            3%
> > 498   3888633   -0.46%        0%                 4%            4%
> > 511   3875305   -0.80%        0%                 5%            5%
> > 
> > Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
> > contention is higher than other desktops.
> > 
> > batch   score   change   zone_contention   lru_contention   total_contention
> >  31   3511158   +0.00%        2%                 5%            7%
> >  53   3555445   +1.26%        2%                 6%            8%
> >  63   3561082   +1.42%        2%                 6%            8%
> >  73   3547218   +1.03%        2%                 6%            8%
> > 119   3571319   +1.71%        1%                 7%            8%
> > 127   3549375   +1.09%        0%                 6%            6%
> > 137   3560233   +1.40%        0%                 6%            6%
> > 165   3555176   +1.25%        2%                 6%            8%
> > 188   3551501   +1.15%        0%                 8%            8%
> > 223   3531462   +0.58%        0%                 7%            7%
> > 255   3570400   +1.69%        0%                 7%            7%
> > 267   3532235   +0.60%        1%                 8%            9%
> > 299   3562326   +1.46%        0%                 6%            6%
> > 320   3553569   +1.21%        0%                 8%            8%
> > 393   3539519   +0.81%        0%                 7%            7%
> > 424   3549271   +1.09%        0%                 8%            8%
> > 458   3528885   +0.50%        0%                 8%            8%
> > 467   3526554   +0.44%        0%                 7%            7%
> > 498   3525302   +0.40%        0%                 9%            9%
> > 511   3527556   +0.47%        0%                 8%            8%
> > 
> > Sandybridge-Desktop: the 0% contention isn't accurate but caused by
> > dropped fractional part. Since multiple contention path's contentions
> > are all under 1% here, with some arithmetic operations like add, the
> > final deviation could be as large as 3%.
> > 
> > batch   score   change   zone_contention   lru_contention   total_contention
> >  31   1744495   +0.00%        0%                 0%            0%
> >  53   1755341   +0.62%        0%                 0%            0%
> >  63   1758469   +0.80%        0%                 0%            0%
> >  73   1759626   +0.87%        0%                 0%            0%
> > 119   1770417   +1.49%        0%                 0%            0%
> > 127   1768252   +1.36%        0%                 0%            0%
> > 137   1767848   +1.34%        0%                 0%            0%
> > 165   1765088   +1.18%        0%                 0%            0%
> > 188   1766918   +1.29%        0%                 0%            0%
> > 223   1767866   +1.34%        0%                 0%            0%
> > 255   1768074   +1.35%        0%                 0%            0%
> > 267   1763187   +1.07%        0%                 0%            0%
> > 299   1765620   +1.21%        0%                 0%            0%
> > 320   1767603   +1.32%        0%                 0%            0%
> > 393   1764612   +1.15%        0%                 0%            0%
> > 424   1758476   +0.80%        0%                 0%            0%
> > 458   1758593   +0.81%        0%                 0%            0%
> > 467   1757915   +0.77%        0%                 0%            0%
> > 498   1753363   +0.51%        0%                 0%            0%
> > 511   1755548   +0.63%        0%                 0%            0%
> > 
> > Phase two test results:
> > Note: all percent change is against base(batch=31).
> > 
> > ebizzy.throughput (higer is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1    2410037±7%     2600451±2% +7.9%     2602878 +8.0%
> > lkp-bdw-ex1     1493328        1489243    -0.3%     1492145 -0.1%
> > lkp-skl-2sp2    1329674        1345891    +1.2%     1351056 +1.6%
> > lkp-bdw-ep2      711511         711511     0.0%      710708 -0.1%
> > lkp-wsm-ep2       75750          75528    -0.3%       75441 -0.4%
> > lkp-skl-d01      264126         262791    -0.5%      264113 +0.0%
> > lkp-hsw-d01      176601         176328    -0.2%      176368 -0.1%
> > lkp-sb02          98937          98937    +0.0%       99030 +0.1%
> > 
> > kbuild.buildtime (less is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1     107.00        107.67  +0.6%        107.11  +0.1%
> > lkp-bdw-ex1       97.33         97.33  +0.0%         97.42  +0.1%
> > lkp-skl-2sp2     180.00        179.83  -0.1%        179.83  -0.1%
> > lkp-bdw-ep2      178.17        179.17  +0.6%        177.50  -0.4%
> > lkp-wsm-ep2      737.00        738.00  +0.1%        738.00  +0.1%
> > lkp-skl-d01      642.00        653.00  +1.7%        653.00  +1.7%
> > lkp-hsw-d01     1310.00       1316.00  +0.5%       1311.00  +0.1%
> > 
> > 
> > netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1     948790        947144  -0.2%        948333 -0.0%
> > lkp-bdw-ex1      904224        904366  +0.0%        904926 +0.1%
> > lkp-skl-2sp2     239731        239607  -0.1%        239565 -0.1% 
> > lk-bdw-ep2       365764        365933  +0.0%        365951 +0.1%
> > lkp-wsm-ep2       93736         93803  +0.1%         93808 +0.1%
> > lkp-skl-d01       77314         77303  -0.0%         77375 +0.1%
> > lkp-hsw-d01       58617         60387  +3.0%         60208 +2.7%
> > lkp-sb02          29990         30137  +0.5%         30103 +0.4%
> > 
> > oltp.transactions (higer is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-bdw-ex1      9073276       9100377     +0.3%    9036344     -0.4%
> > lkp-skl-2sp2     8898717       8852054     -0.5%    8894459     -0.0%
> > lkp-bdw-ep2     13426155      13384654     -0.3%   13333637     -0.7%
> > lkp-hsw-ep2     13146314      13232784     +0.7%   13193163     +0.4%
> > lkp-wsm-ep2      5035355       5019348     -0.3%    5033418     -0.0%
> > lkp-skl-d01       418485       4413339     -0.1%    4419039     +0.0%
> > lkp-hsw-d01      3517817±5%    3396120±3%  -3.5%    3455138±3%  -1.8%
> > 
> > pigz.throughput (higer is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1    1.513e+08     1.507e+08 -0.4%      1.511e+08 -0.2%
> > lkp-bdw-ex1     2.060e+08     2.052e+08 -0.4%      2.044e+08 -0.8%
> > lkp-skl-2sp2    8.836e+08     8.845e+08 +0.1%      8.836e+08 -0.0%
> > lkp-bdw-ep2     8.275e+08     8.464e+08 +2.3%      8.330e+08 +0.7%
> > lkp-wsm-ep2     2.224e+08     2.221e+08 -0.2%      2.218e+08 -0.3%
> > lkp-skl-d01     1.177e+08     1.177e+08 -0.0%      1.176e+08 -0.1%
> > lkp-hsw-d01     1.154e+08     1.154e+08 +0.1%      1.154e+08 -0.0%
> > lkp-sb02        0.633e+08     0.633e+08 +0.1%      0.633e+08 +0.0%
> > 
> > 
> > will-it-scale.malloc1.processes (higher is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1      620181       620484 +0.0%         620240 +0.0%
> > lkp-bdw-ex1      1403610      1401201 -0.2%        1417900 +1.0%
> > lkp-skl-2sp2     1288097      1284145 -0.3%        1283907 -0.3%
> > lkp-bdw-ep2      1427879      1427675 -0.0%        1428266 +0.0%
> > lkp-hsw-ep2      1362546      1353965 -0.6%        1354759 -0.6%
> > lkp-wsm-ep2      2099657      2107576 +0.4%        2100226 +0.0%
> > lkp-skl-d01      1476835      1476358 -0.0%        1474487 -0.2%
> > lkp-hsw-d01      1308810      1303429 -0.4%        1301299 -0.6%
> > lkp-sb02          589286       589284 -0.0%         588101 -0.2%
> > 
> > will-it-scale.malloc1.threads (higher is better)
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1     21289         21125     -0.8%      21241     -0.2%
> > lkp-bdw-ex1      28114         28089     -0.1%      28007     -0.4%
> > lkp-skl-2sp2     91866         91946     +0.1%      92723     +0.9%
> > lkp-bdw-ep2      37637         37501     -0.4%      37317     -0.9%
> > lkp-hsw-ep2      43673         43590     -0.2%      43754     +0.2%
> > lkp-wsm-ep2      28577         28298     -1.0%      28545     -0.1%
> > lkp-skl-d01     175277        173343     -1.1%     173082     -1.3%
> > lkp-hsw-d01     130303        129566     -0.6%     129250     -0.8%
> > lkp-sb02        113742±3%     116911     +2.8%     116417±3%  +2.4%
> > 
> > will-it-scale.malloc2.processes (higer is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1    1.206e+09     1.206e+09 -0.0%      1.206e+09 +0.0%
> > lkp-bdw-ex1     1.319e+09     1.319e+09 -0.0%      1.319e+09 +0.0%
> > lkp-skl-2sp2    8.000e+08     8.021e+08 +0.3%      7.995e+08 -0.1%
> > lkp-bdw-ep2     6.582e+08     6.634e+08 +0.8%      6.513e+08 -1.1%
> > lkp-hsw-ep2     6.671e+08     6.669e+08 -0.0%      6.665e+08 -0.1%
> > lkp-wsm-ep2     1.805e+08     1.806e+08 +0.0%      1.804e+08 -0.1%
> > lkp-skl-d01     1.611e+08     1.611e+08 -0.0%      1.610e+08 -0.0%
> > lkp-hsw-d01     1.333e+08     1.332e+08 -0.0%      1.332e+08 -0.0%
> > lkp-sb02         82485104      82478206 -0.0%       82473546 -0.0%
> > 
> > will-it-scale.malloc2.threads (higer is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1    1.574e+09     1.574e+09 -0.0%      1.574e+09 -0.0%
> > lkp-bdw-ex1     1.737e+09     1.737e+09 +0.0%      1.737e+09 -0.0%
> > lkp-skl-2sp2    9.161e+08     9.162e+08 +0.0%      9.181e+08 +0.2%
> > lkp-bdw-ep2     7.856e+08     8.015e+08 +2.0%      8.113e+08 +3.3%
> > lkp-hsw-ep2     6.908e+08     6.904e+08 -0.1%      6.907e+08 -0.0%
> > lkp-wsm-ep2     2.409e+08     2.409e+08 +0.0%      2.409e+08 -0.0%
> > lkp-skl-d01     1.199e+08     1.199e+08 -0.0%      1.199e+08 -0.0%
> > lkp-hsw-d01     1.029e+08     1.029e+08 -0.0%      1.029e+08 +0.0%
> > lkp-sb02         68081213      68061423 -0.0%       68076037 -0.0%
> > 
> > will-it-scale.page_fault2.processes (higer is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1    14509125±4%   16472364 +13.5%       17123117 +18.0%
> > lkp-bdw-ex1     14736381      16196588  +9.9%       16364011 +11.0%
> > lkp-skl-2sp2     6354925       6435444  +1.3%        6436644  +1.3%
> > lkp-bdw-ep2      8749584       8834422  +1.0%        8827179  +0.9%
> > lkp-hsw-ep2      8762591       8845920  +1.0%        8825697  +0.7%
> > lkp-wsm-ep2      3036083       3030428  -0.2%        3021741  -0.5%
> > lkp-skl-d01      2307834       2304731  -0.1%        2286142  -0.9%
> > lkp-hsw-d01      1806237       1800786  -0.3%        1795943  -0.6%
> > lkp-sb02          842616        837844  -0.6%         833921  -1.0%
> > 
> > will-it-scale.page_fault2.threads
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1     1623294       1615132±2% -0.5%     1656777    +2.1%
> > lkp-bdw-ex1      1995714       2025948    +1.5%     2113753±3% +5.9%
> > lkp-skl-2sp2     2346708       2415591    +2.9%     2416919    +3.0%
> > lkp-bdw-ep2      2342564       2344882    +0.1%     2300206    -1.8%
> > lkp-hsw-ep2      1820658       1831681    +0.6%     1844057    +1.3%
> > lkp-wsm-ep2      1725482       1733774    +0.5%     1740517    +0.9%
> > lkp-skl-d01      1832833       1823628    -0.5%     1806489    -1.4%
> > lkp-hsw-d01      1427913       1427287    -0.0%     1420226    -0.5%
> > lkp-sb02          750626        748615    -0.3%      746621    -0.5%
> > 
> > will-it-scale.page_fault3.processes (higher is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1    24382726      24400317 +0.1%       24668774 +1.2%
> > lkp-bdw-ex1     35399750      35683124 +0.8%       35829492 +1.2%
> > lkp-skl-2sp2    28136820      28068248 -0.2%       28147989 +0.0%
> > lkp-bdw-ep2     37269077      37459490 +0.5%       37373073 +0.3%
> > lkp-hsw-ep2     36224967      36114085 -0.3%       36104908 -0.3%
> > lkp-wsm-ep2     16820457      16911005 +0.5%       16968596 +0.9%
> > lkp-skl-d01      7721138       7725904 +0.1%        7756740 +0.5%
> > lkp-hsw-d01      7611979       7650928 +0.5%        7651323 +0.5%
> > lkp-sb02         3781546       3796502 +0.4%        3796827 +0.4%
> > 
> > will-it-scale.page_fault3.threads (higer is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1     1865820±3%   1900917±2%  +1.9%     1826245±4%  -2.1%
> > lkp-bdw-ex1      3094060      3148326     +1.8%     3150036     +1.8%
> > lkp-skl-2sp2     3952940      3953898     +0.0%     3989360     +0.9%
> > lkp-bdw-ep2      3420373±3%   3643964     +6.5%     3644910±5%  +6.6%
> > lkp-hsw-ep2      2609635±2%   2582310±3%  -1.0%     2780459     +6.5%
> > lkp-wsm-ep2      4395001      4417196     +0.5%     4432499     +0.9%
> > lkp-skl-d01      5363977      5400003     +0.7%     5411370     +0.9%
> > lkp-hsw-d01      5274131      5311294     +0.7%     5319359     +0.9%
> > lkp-sb02         2917314      2913004     -0.1%     2935286     +0.6%
> > 
> > will-it-scale.read1.processes (higer is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1    73762279±14%  69322519±10% -6.0%    69349855±13%  -6.0% (result unstable)
> > lkp-bdw-ex1     1.701e+08     1.704e+08    +0.1%    1.705e+08     +0.2%
> > lkp-skl-2sp2    63111570      63113953     +0.0%    63836573      +1.1%
> > lkp-bdw-ep2     79247409      79424610     +0.2%    78012656      -1.6%
> > lkp-hsw-ep2     67677026      68308800     +0.9%    67539106      -0.2%
> > lkp-wsm-ep2     13339630      13939817     +4.5%    13766865      +3.2%
> > lkp-skl-d01     10969487      10972650     +0.0%    no data
> > lkp-hsw-d01     9857342±2%    10080592±2%  +2.3%    10131560      +2.8%
> > lkp-sb02        5189076        5197473     +0.2%    5163253       -0.5%
> > 
> > will-it-scale.read1.threads (higher is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1    62468045±12%  73666726±7% +17.9%    79553123±12% +27.4% (result unstable)
> > lkp-bdw-ex1     1.62e+08      1.624e+08    +0.3%    1.614e+08     -0.3%
> > lkp-skl-2sp2    58319780      59181032     +1.5%    59821353      +2.6%
> > lkp-bdw-ep2     74057992      75698171     +2.2%    74990869      +1.3%
> > lkp-hsw-ep2     63672959      63639652     -0.1%    64387051      +1.1%
> > lkp-wsm-ep2     13489943      13526058     +0.3%    13259032      -1.7%
> > lkp-skl-d01     10297906      10338796     +0.4%    10407328      +1.1%
> > lkp-hsw-d01      9636721       9667376     +0.3%     9341147      -3.1%
> > lkp-sb02         4801938       4804496     +0.1%     4802290      +0.0%
> > 
> > will-it-scale.write1.processes (higer is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1    1.111e+08     1.104e+08±2%  -0.7%   1.122e+08±2%  +1.0%
> > lkp-bdw-ex1     1.392e+08     1.399e+08     +0.5%   1.397e+08     +0.4%
> > lkp-skl-2sp2     59369233      58994841     -0.6%    58715168     -1.1%
> > lkp-bdw-ep2      61820979      CPU throttle          63593123     +2.9%
> > lkp-hsw-ep2      57897587      57435605     -0.8%    56347450     -2.7%
> > lkp-wsm-ep2       7814203       7918017±2%  +1.3%     7669068     -1.9%
> > lkp-skl-d01       8886557       8971422     +1.0%     8818366     -0.8%
> > lkp-hsw-d01       9171001±5%    9189915     +0.2%     9483909     +3.4%
> > lkp-sb02          4475406       4475294     -0.0%     4501756     +0.6%
> > 
> > will-it-scale.write1.threads (higer is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1    1.058e+08     1.055e+08±2%  -0.2%   1.065e+08  +0.7%
> > lkp-bdw-ex1     1.316e+08     1.300e+08     -1.2%   1.308e+08  -0.6%
> > lkp-skl-2sp2     54492421      56086678     +2.9%    55975657  +2.7%
> > lkp-bdw-ep2      59360449      59003957     -0.6%    58101262  -2.1%
> > lkp-hsw-ep2      53346346±2%   52530876     -1.5%    52902487  -0.8%
> > lkp-wsm-ep2       7774006       7800092±2%  +0.3%     7558833  -2.8%
> > lkp-skl-d01       8346174       8235695     -1.3%     no data
> > lkp-hsw-d01       8636244       8655731     +0.2%     8658868  +0.3%
> > lkp-sb02          4181820       4204107     +0.5%     4182992  +0.0%
> > 
> > 
> > vm-scalability.anon-r-rand.throughput (higher is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1    11933873±3%   12356544±2%  +3.5%   12188624     +2.1%
> > lkp-bdw-ex1      7114424±2%    7330949±2%  +3.0%    7392419     +3.9%
> > lkp-skl-2sp2     6773277±5%    6492332±8%  -4.1%    6543962     -3.4%
> > lkp-bdw-ep2      7133846±4%    7233508     +1.4%    7013518±3%  -1.7%
> > lkp-hsw-ep2      4576626       4527098     -1.1%    4551679     -0.5%
> > lkp-wsm-ep2      2583599       2592492     +0.3%    2588039     +0.2%
> > lkp-hsw-d01       998199±2%    1028311     +3.0%    1006460±2%  +0.8%
> > lkp-sb02          570572        567854     -0.5%     568449     -0.4%
> > 
> > vm-scalability.anon-r-rand-mt.throughput (higher is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1     1789419       1787830     -0.1%    1788208     -0.1%
> > lkp-bdw-ex1      3492595±2%    3554966±2%  +1.8%    3558835±3%  +1.9%
> > lkp-skl-2sp2     3856238±2%    3975403±4%  +3.1%    3994600     +3.6%
> > lkp-bdw-ep2      3726963±11%   3809292±6%  +2.2%    3871924±4%  +3.9%
> > lkp-hsw-ep2      2131760±3%    2033578±4%  -4.6%    2130727±6%  -0.0%
> > lkp-wsm-ep2      2369731       2368384     -0.1%    2370252     +0.0%
> > lkp-skl-d01      1207128       1206220     -0.1%    1205801     -0.1%
> > lkp-hsw-d01       964317        992329±2%  +2.9%     992099±2%  +2.9%
> > lkp-sb02          567137        567346     +0.0%     566144     -0.2%
> > 
> > vm-scalability.lru-file-mmap-read.throughput (higher is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1    19560469±6%   23018999     +17.7%   23418800     +19.7%
> > lkp-bdw-ex1     17769135±14%  26141676±3%  +47.1%   26284723±5%  +47.9%
> > lkp-skl-2sp2    14056512      13578884      -3.4%   13146214      -6.5%
> > lkp-bdw-ep2     15336542      14737654      -3.9%   14088159      -8.1%
> > lkp-hsw-ep2     16275498      15756296      -3.2%   15018090      -7.7%
> > lkp-wsm-ep2     11272160      11237231      -0.3%   11310047      +0.3%
> > lkp-skl-d01      7322119       7324569      +0.0%    7184148      -1.9%
> > lkp-hsw-d01      6449234       6404542      -0.7%    6356141      -1.4%
> > lkp-sb02         3517943       3520668      +0.1%    3527309      +0.3%
> > 
> > vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)
> > 
> > machine         batch=31      batch=63             batch=127
> > lkp-skl-4sp1     1689052       1697553  +0.5%       1698726  +0.6%
> > lkp-bdw-ex1      1675246       1699764  +1.5%       1712226  +2.2%
> > lkp-skl-2sp2     1800533       1799749  -0.0%       1800581  +0.0%
> > lkp-bdw-ep2      1807422       1807758  +0.0%       1804932  -0.1%
> > lkp-hsw-ep2      1809807       1808781  -0.1%       1807811  -0.1%
> > lkp-wsm-ep2      1800198       1802434  +0.1%       1801236  +0.1%
> > lkp-skl-d01       696689        695537  -0.2%        694106  -0.4%
> > lkp-hsw-d01       698364        698666  +0.0%        696686  -0.2%
> > lkp-sb02          258939        258787  -0.1%        258199  -0.3%
> > 
> > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > ---
> >  mm/page_alloc.c | 9 ++++-----
> >  1 file changed, 4 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 07b3c23762ad..27780310e5e7 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5566,13 +5566,12 @@ static int zone_batchsize(struct zone *zone)
> >  
> >  	/*
> >  	 * The per-cpu-pages pools are set to around 1000th of the
> > -	 * size of the zone.  But no more than 1/2 of a meg.
> > -	 *
> > -	 * OK, so we don't know how big the cache is.  So guess.
> > +	 * size of the zone.
> >  	 */
> >  	batch = zone->managed_pages / 1024;
> > -	if (batch * PAGE_SIZE > 512 * 1024)
> > -		batch = (512 * 1024) / PAGE_SIZE;
> > +	/* But no more than a meg. */
> > +	if (batch * PAGE_SIZE > 1024 * 1024)
> > +		batch = (1024 * 1024) / PAGE_SIZE;
> >  	batch /= 4;		/* We effectively *= 4 below */
> >  	if (batch < 1)
> >  		batch = 1;
> > -- 
> > 2.17.1
> >   
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
  2018-07-12 13:55   ` Jesper Dangaard Brouer
@ 2018-07-12 15:01     ` Tariq Toukan
  0 siblings, 0 replies; 7+ messages in thread
From: Tariq Toukan @ 2018-07-12 15:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Michal Hocko, Tariq Toukan
  Cc: Aaron Lu, linux-mm, linux-kernel, Andrew Morton, Huang Ying,
	Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Vlastimil Babka,
	Mel Gorman, Saeed Mahameed



On 12/07/2018 4:55 PM, Jesper Dangaard Brouer wrote:
> On Thu, 12 Jul 2018 14:54:08 +0200
> Michal Hocko <mhocko@kernel.org> wrote:
> 
>> [CC Jesper - I remember he was really concerned about the worst case
>>   latencies for highspeed network workloads.]
> 
> Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s),
> where we are contenting on the page allocator lock, in a CPU scaling
> netperf test AFAIK.  I also have some special-case micro-benchmarks
> where I can hit it, but it a micro-bench...
> 

Thanks! Looks good.

Indeed, I simulated the page allocation rate of a 200Gbps NIC, and hit a 
major PCP/buddy bottleneck, where spinning the zonelock took up to 80% 
CPU, with dramatic BW degradation.

Test ran relatively small number of TCP streams (4-16) with unpinned 
application (iperf).

Larger batching reduces the contention on the zone lock and improves the 
CPU util. I also considered increasing the percpu_pagelist_fraction to a 
larger value (thought of 512, see patch below), which also affects the 
batch size (in pageset_set_high_and_batch).

As far as I see it, to totally solve the page allocation bottleneck for 
the increasing networking speeds, the following is still required:
1) optimize order-0 allocations (even on the cost of higher-order 
allocations).
2) bulking API for page allocations.
3) do SKB remote-release (on the originating core).

Regards,
Tariq

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 697ef8c225df..88763bd716a5 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -741,9 +741,9 @@ of hot per cpu pagelists.  User can specify a number 
like 100 to allocate
  The batch value of each per cpu pagelist is also updated as a result. 
It is
  set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)

-The initial value is zero.  Kernel does not use this value at boot time 
to set
+The initial value is 512.  Kernel uses this value at boot time to set
  the high water marks for each per cpu page list.  If the user writes 
'0' to this
-sysctl, it will revert to this default behavior.
+sysctl, it will revert to a behavior based on batchsize calculation.

  ==============================================================

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..c88e8eb50bcb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -129,7 +129,7 @@
  unsigned long totalreserve_pages __read_mostly;
  unsigned long totalcma_pages __read_mostly;

-int percpu_pagelist_fraction;
+int percpu_pagelist_fraction = 512;
  gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

  /*

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, back to index

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-11  5:58 [RFC PATCH] mm, page_alloc: double zone's batchsize Aaron Lu
2018-07-11 21:35 ` Andrew Morton
2018-07-12  1:40   ` Lu, Aaron
2018-07-12  1:46     ` Andrew Morton
2018-07-12 12:54 ` Michal Hocko
2018-07-12 13:55   ` Jesper Dangaard Brouer
2018-07-12 15:01     ` Tariq Toukan

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox