All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
@ 2013-12-13 14:10 ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

Kicked this another bit today. It's still a bit half-baked but it restores
the historical performance and leaves the door open at the end for playing
nice with distributing file pages between nodes. Finishing this series
depends on whether we are going to make the remote node behaviour of the
fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
favour of the configurable option because the default can be redefined and
tested while giving users a "compat" mode if we discover the new default
behaviour sucks for some workload.

Changelog since v1
o Fix lot of brain damage in the configurable policy patch
o Yoink a page cache annotation patch
o Only account batch pages against allocations eligible for the fair policy
o Add patch that default distributes file pages on remote nodes

Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of how
the page allocator and kswapd interacted on the per-zone LRU lists.

Unfortunately a side-effect missed during review was that it's now very
easy to allocate remote memory on NUMA machines. The problem is that
it is not a simple case of just restoring local allocation policies as
there are genuine reasons why global page aging may be prefereable. It's
still a major change to default behaviour so this patch makes the policy
configurable and sets what I think is a sensible default.

The patches are on top of some NUMA balancing patches currently in -mm.
The first patch in the series is a patch posted by Johannes that must be
taken into account before any of my patchs on top. The last patch of the
series is what alters default behaviour and makes the fair zone allocator
policy configurable.

Sniff test results based on following kernels

vanilla		 3.13-rc3 stock
instrument-v2r1  NUMA balancing patches just to rule out any conflicts ther2
lruslabonly-v1r2 Patch 1 only
local-v2r6	 Patches 1-5 to restore local memory allocations
acct-v2r6	 Patches 1-6 to include an accounting adjustment
remotefile-v2r6  Patches 1-7 that breaks MPOL_LOCAL by interleaving file pages

kernbench
                          3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3
                             vanilla       instrument-v2r1      lruslabonly-v2r1            local-v2r6             acct-v2r6       remotefile-v2r6
User    min        1417.32 (  0.00%)     1408.52 (  0.62%)     1414.92 (  0.17%)     1403.37 (  0.98%)     1410.55 (  0.48%)     1405.85 (  0.81%)
User    mean       1419.10 (  0.00%)     1415.39 (  0.26%)     1417.31 (  0.13%)     1409.89 (  0.65%)     1411.40 (  0.54%)     1410.78 (  0.59%)
User    stddev        2.25 (  0.00%)        4.51 (-100.33%)        2.44 ( -8.29%)        3.98 (-76.92%)        0.74 ( 66.98%)        2.94 (-30.81%)
User    max        1422.92 (  0.00%)     1421.05 (  0.13%)     1421.90 (  0.07%)     1415.39 (  0.53%)     1412.55 (  0.73%)     1413.99 (  0.63%)
User    range         5.60 (  0.00%)       12.53 (-123.75%)        6.98 (-24.64%)       12.02 (-114.64%)        2.00 ( 64.29%)        8.14 (-45.36%)
System  min         114.83 (  0.00%)      114.09 (  0.64%)      114.50 (  0.29%)      110.16 (  4.07%)      110.44 (  3.82%)      110.49 (  3.78%)
System  mean        115.89 (  0.00%)      115.01 (  0.76%)      115.12 (  0.67%)      110.73 (  4.46%)      111.20 (  4.05%)      111.17 (  4.08%)
System  stddev        0.63 (  0.00%)        0.57 ( 10.42%)        0.40 ( 37.04%)        0.48 ( 24.87%)        0.51 ( 19.41%)        0.43 ( 32.60%)
System  max         116.81 (  0.00%)      115.87 (  0.80%)      115.52 (  1.10%)      111.47 (  4.57%)      111.98 (  4.13%)      111.63 (  4.43%)
System  range         1.98 (  0.00%)        1.78 ( 10.10%)        1.02 ( 48.48%)        1.31 ( 33.84%)        1.54 ( 22.22%)        1.14 ( 42.42%)
Elapsed min          42.90 (  0.00%)       43.96 ( -2.47%)       42.85 (  0.12%)       43.02 ( -0.28%)       42.55 (  0.82%)       42.75 (  0.35%)
Elapsed mean         43.58 (  0.00%)       44.16 ( -1.34%)       43.88 ( -0.69%)       43.87 ( -0.67%)       43.58 ( -0.00%)       43.80 ( -0.50%)
Elapsed stddev        0.74 (  0.00%)        0.17 ( 77.41%)        0.61 ( 17.23%)        1.00 (-35.26%)        0.67 (  9.46%)        0.82 ( -9.88%)
Elapsed max          44.52 (  0.00%)       44.45 (  0.16%)       44.55 ( -0.07%)       45.72 ( -2.70%)       44.24 (  0.63%)       45.09 ( -1.28%)
Elapsed range         1.62 (  0.00%)        0.49 ( 69.75%)        1.70 ( -4.94%)        2.70 (-66.67%)        1.69 ( -4.32%)        2.34 (-44.44%)
CPU     min        3451.00 (  0.00%)     3455.00 ( -0.12%)     3434.00 (  0.49%)     3311.00 (  4.06%)     3439.00 (  0.35%)     3377.00 (  2.14%)
CPU     mean       3522.40 (  0.00%)     3464.60 (  1.64%)     3492.40 (  0.85%)     3467.40 (  1.56%)     3493.80 (  0.81%)     3475.40 (  1.33%)
CPU     stddev       54.34 (  0.00%)        9.05 ( 83.35%)       54.80 ( -0.85%)       86.04 (-58.33%)       54.99 ( -1.18%)       67.75 (-24.68%)
CPU     max        3570.00 (  0.00%)     3480.00 (  2.52%)     3587.00 ( -0.48%)     3545.00 (  0.70%)     3578.00 ( -0.22%)     3568.00 (  0.06%)
CPU     range       119.00 (  0.00%)       25.00 ( 78.99%)      153.00 (-28.57%)      234.00 (-96.64%)      139.00 (-16.81%)      191.00 (-60.50%)

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v2r1lruslabonly-v2r1  local-v2r6   acct-v2r6remotefile-v2r6
User         8540.49     8516.04     8524.28     8487.25     8488.89     8487.40
System        706.31      701.72      701.20      674.29      675.81      676.52
Elapsed       307.58      311.31      309.72      309.51      308.32      310.36

kernbench figures themselves are not that compelling but the system CPU cost
is down a lot. It's just such a small percentage of the overall workload
that it doesn't really matter and the processes are short lived anyway.

                            3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
                               vanillainstrument-v2r1lruslabonly-v2r1  local-v2r6   acct-v2r6remotefile-v2r6
NUMA alloc hit                73783951    73086669    73385508    93373651    93326068    93321444
NUMA alloc miss               20013534    20247750    19958857         102         118        2129
NUMA interleave hit                  0           0           0           0           0           0
NUMA alloc local              73783935    73086658    73385501    93373644    93326059    93321436

NUMA miss rates are reduced by using the local policy although it really
should have been zero. I suspect it's the __GFP_PAGECACHE annotation patch
and how it's treated but have not proven it. The miss stats go up for the
final patch as page cache pages get distributed between nodes again

vmr-stream
                                3.13.0-rc3                  3.13.0-rc3                  3.13.0-rc3                  3.13.0-rc3                  3.13.0-rc3                  3.13.0-rc3
                                   vanilla             instrument-v2r1            lruslabonly-v2r1                  local-v2r6                   acct-v2r6             remotefile-v2r6
Add      5M        3809.80 (  0.00%)     3783.21 ( -0.70%)     3790.61 ( -0.50%)     3970.34 (  4.21%)     3975.29 (  4.34%)     3992.15 (  4.79%)
Copy     5M        3360.75 (  0.00%)     3345.59 ( -0.45%)     3351.99 ( -0.26%)     3474.69 (  3.39%)     3472.97 (  3.34%)     3474.32 (  3.38%)
Scale    5M        3160.39 (  0.00%)     3163.43 (  0.10%)     3159.88 ( -0.02%)     3393.56 (  7.38%)     3391.85 (  7.32%)     3393.76 (  7.38%)
Triad    5M        3533.04 (  0.00%)     3517.67 ( -0.43%)     3526.18 ( -0.19%)     3856.20 (  9.15%)     3851.39 (  9.01%)     3855.89 (  9.14%)
Add      7M        3789.82 (  0.00%)     3789.03 ( -0.02%)     3779.30 ( -0.28%)     4049.53 (  6.85%)     4001.74 (  5.59%)     3968.84 (  4.72%)
Copy     7M        3345.85 (  0.00%)     3355.75 (  0.30%)     3354.56 (  0.26%)     3484.62 (  4.15%)     3477.23 (  3.93%)     3474.17 (  3.84%)
Scale    7M        3176.00 (  0.00%)     3156.09 ( -0.63%)     3152.84 ( -0.73%)     3401.53 (  7.10%)     3393.55 (  6.85%)     3392.46 (  6.82%)
Triad    7M        3528.85 (  0.00%)     3521.99 ( -0.19%)     3515.20 ( -0.39%)     3861.55 (  9.43%)     3853.51 (  9.20%)     3853.30 (  9.19%)
Add      8M        3801.60 (  0.00%)     3781.66 ( -0.52%)     3788.19 ( -0.35%)     3957.73 (  4.11%)     4002.30 (  5.28%)     4006.69 (  5.39%)
Copy     8M        3364.64 (  0.00%)     3346.31 ( -0.54%)     3353.71 ( -0.32%)     3469.62 (  3.12%)     3476.25 (  3.32%)     3473.67 (  3.24%)
Scale    8M        3169.34 (  0.00%)     3163.10 ( -0.20%)     3157.99 ( -0.36%)     3391.61 (  7.01%)     3395.76 (  7.14%)     3393.20 (  7.06%)
Triad    8M        3531.38 (  0.00%)     3514.83 ( -0.47%)     3518.55 ( -0.36%)     3850.45 (  9.04%)     3853.39 (  9.12%)     3849.50 (  9.01%)
Add      10M       3807.95 (  0.00%)     3791.80 ( -0.42%)     3781.86 ( -0.69%)     3977.13 (  4.44%)     4005.95 (  5.20%)     3983.31 (  4.61%)
Copy     10M       3365.64 (  0.00%)     3361.59 ( -0.12%)     3352.03 ( -0.40%)     3473.78 (  3.21%)     3479.54 (  3.38%)     3471.70 (  3.15%)
Scale    10M       3172.71 (  0.00%)     3157.52 ( -0.48%)     3149.26 ( -0.74%)     3395.59 (  7.02%)     3397.28 (  7.08%)     3394.50 (  6.99%)
Triad    10M       3536.15 (  0.00%)     3524.46 ( -0.33%)     3517.36 ( -0.53%)     3854.88 (  9.01%)     3857.55 (  9.09%)     3853.00 (  8.96%)
Add      14M       3787.56 (  0.00%)     3789.36 (  0.05%)     3780.55 ( -0.19%)     4009.14 (  5.85%)     4019.90 (  6.13%)     3966.93 (  4.74%)
Copy     14M       3345.19 (  0.00%)     3361.79 (  0.50%)     3338.99 ( -0.19%)     3483.34 (  4.13%)     3480.38 (  4.04%)     3470.79 (  3.75%)
Scale    14M       3154.55 (  0.00%)     3155.60 (  0.03%)     3154.74 (  0.01%)     3398.70 (  7.74%)     3396.31 (  7.66%)     3392.50 (  7.54%)
Triad    14M       3522.09 (  0.00%)     3517.21 ( -0.14%)     3514.90 ( -0.20%)     3861.09 (  9.62%)     3854.76 (  9.45%)     3852.52 (  9.38%)
Add      17M       3806.34 (  0.00%)     3770.18 ( -0.95%)     3774.21 ( -0.84%)     3982.37 (  4.62%)     4015.73 (  5.50%)     3979.61 (  4.55%)
Copy     17M       3368.39 (  0.00%)     3334.84 ( -1.00%)     3349.84 ( -0.55%)     3480.15 (  3.32%)     3481.29 (  3.35%)     3470.75 (  3.04%)
Scale    17M       3169.18 (  0.00%)     3164.25 ( -0.16%)     3148.23 ( -0.66%)     3398.11 (  7.22%)     3398.69 (  7.24%)     3389.32 (  6.95%)
Triad    17M       3535.05 (  0.00%)     3510.90 ( -0.68%)     3511.84 ( -0.66%)     3860.14 (  9.20%)     3859.64 (  9.18%)     3848.12 (  8.86%)
Add      21M       3795.31 (  0.00%)     3804.70 (  0.25%)     3795.15 ( -0.00%)     4017.03 (  5.84%)     4029.35 (  6.17%)     3988.21 (  5.08%)
Copy     21M       3353.43 (  0.00%)     3365.89 (  0.37%)     3351.05 ( -0.07%)     3482.88 (  3.86%)     3478.62 (  3.73%)     3479.29 (  3.75%)
Scale    21M       3160.96 (  0.00%)     3170.91 (  0.31%)     3167.45 (  0.21%)     3398.76 (  7.52%)     3394.56 (  7.39%)     3397.91 (  7.50%)
Triad    21M       3530.45 (  0.00%)     3533.62 (  0.09%)     3529.35 ( -0.03%)     3862.25 (  9.40%)     3855.95 (  9.22%)     3859.16 (  9.31%)
Add      28M       3803.11 (  0.00%)     3789.09 ( -0.37%)     3799.69 ( -0.09%)     4016.56 (  5.61%)     3975.01 (  4.52%)     3993.88 (  5.02%)
Copy     28M       3361.16 (  0.00%)     3365.71 (  0.14%)     3368.81 (  0.23%)     3483.91 (  3.65%)     3472.65 (  3.32%)     3475.83 (  3.41%)
Scale    28M       3160.43 (  0.00%)     3151.15 ( -0.29%)     3168.12 (  0.24%)     3399.14 (  7.55%)     3395.77 (  7.45%)     3397.73 (  7.51%)
Triad    28M       3533.66 (  0.00%)     3518.97 ( -0.42%)     3528.59 ( -0.14%)     3861.47 (  9.28%)     3855.76 (  9.12%)     3858.01 (  9.18%)
Add      35M       3792.86 (  0.00%)     3802.89 (  0.26%)     3783.36 ( -0.25%)     3997.11 (  5.39%)     4043.66 (  6.61%)     3962.60 (  4.48%)
Copy     35M       3344.24 (  0.00%)     3356.43 (  0.36%)     3351.61 (  0.22%)     3478.14 (  4.00%)     3486.84 (  4.26%)     3468.70 (  3.72%)
Scale    35M       3160.14 (  0.00%)     3149.58 ( -0.33%)     3159.57 ( -0.02%)     3394.63 (  7.42%)     3401.18 (  7.63%)     3392.57 (  7.36%)
Triad    35M       3531.94 (  0.00%)     3530.90 ( -0.03%)     3517.90 ( -0.40%)     3856.80 (  9.20%)     3862.04 (  9.35%)     3846.73 (  8.91%)
Add      42M       3803.39 (  0.00%)     3789.28 ( -0.37%)     3773.81 ( -0.78%)     4025.00 (  5.83%)     4007.98 (  5.38%)     3944.45 (  3.71%)
Copy     42M       3360.64 (  0.00%)     3355.86 ( -0.14%)     3339.54 ( -0.63%)     3483.81 (  3.67%)     3481.01 (  3.58%)     3464.28 (  3.08%)
Scale    42M       3158.64 (  0.00%)     3168.47 (  0.31%)     3161.82 (  0.10%)     3397.41 (  7.56%)     3397.71 (  7.57%)     3388.43 (  7.27%)
Triad    42M       3529.99 (  0.00%)     3522.03 ( -0.23%)     3512.07 ( -0.51%)     3859.19 (  9.33%)     3859.30 (  9.33%)     3843.50 (  8.88%)
Add      56M       3778.07 (  0.00%)     3802.38 (  0.64%)     3786.95 (  0.23%)     4008.71 (  6.10%)     4001.39 (  5.91%)     3980.85 (  5.37%)
Copy     56M       3348.68 (  0.00%)     3354.81 (  0.18%)     3363.94 (  0.46%)     3481.10 (  3.95%)     3482.10 (  3.98%)     3478.62 (  3.88%)
Scale    56M       3169.25 (  0.00%)     3173.21 (  0.13%)     3160.15 ( -0.29%)     3399.41 (  7.26%)     3399.35 (  7.26%)     3396.19 (  7.16%)
Triad    56M       3517.62 (  0.00%)     3532.08 (  0.41%)     3519.91 (  0.07%)     3861.34 (  9.77%)     3860.40 (  9.74%)     3859.61 (  9.72%)
Add      71M       3811.71 (  0.00%)     3790.78 ( -0.55%)     3792.30 ( -0.51%)     4005.76 (  5.09%)     3996.73 (  4.85%)     4021.00 (  5.49%)
Copy     71M       3370.59 (  0.00%)     3360.98 ( -0.29%)     3357.42 ( -0.39%)     3478.74 (  3.21%)     3472.59 (  3.03%)     3481.72 (  3.30%)
Scale    71M       3168.70 (  0.00%)     3170.94 (  0.07%)     3150.83 ( -0.56%)     3394.36 (  7.12%)     3390.88 (  7.01%)     3397.04 (  7.21%)
Triad    71M       3536.14 (  0.00%)     3525.38 ( -0.30%)     3521.01 ( -0.43%)     3855.90 (  9.04%)     3850.99 (  8.90%)     3859.34 (  9.14%)
Add      85M       3805.94 (  0.00%)     3792.84 ( -0.34%)     3796.44 ( -0.25%)     4004.15 (  5.21%)     4003.69 (  5.20%)     3990.20 (  4.84%)
Copy     85M       3354.76 (  0.00%)     3357.55 (  0.08%)     3360.68 (  0.18%)     3477.66 (  3.66%)     3480.74 (  3.76%)     3471.36 (  3.48%)
Scale    85M       3162.20 (  0.00%)     3156.40 ( -0.18%)     3164.00 (  0.06%)     3396.25 (  7.40%)     3398.16 (  7.46%)     3390.12 (  7.21%)
Triad    85M       3538.76 (  0.00%)     3522.94 ( -0.45%)     3533.03 ( -0.16%)     3854.39 (  8.92%)     3861.37 (  9.12%)     3848.60 (  8.76%)
Add      113M      3803.66 (  0.00%)     3785.42 ( -0.48%)     3804.21 (  0.01%)     3997.16 (  5.09%)     4029.74 (  5.94%)     3987.10 (  4.82%)
Copy     113M      3348.32 (  0.00%)     3359.18 (  0.32%)     3362.06 (  0.41%)     3479.75 (  3.93%)     3488.98 (  4.20%)     3476.86 (  3.84%)
Scale    113M      3177.09 (  0.00%)     3148.61 ( -0.90%)     3147.95 ( -0.92%)     3396.00 (  6.89%)     3404.06 (  7.14%)     3395.97 (  6.89%)
Triad    113M      3536.06 (  0.00%)     3513.51 ( -0.64%)     3531.90 ( -0.12%)     3854.44 (  9.00%)     3869.05 (  9.42%)     3857.86 (  9.10%)
Add      142M      3814.65 (  0.00%)     3779.76 ( -0.91%)     3796.14 ( -0.49%)     3989.97 (  4.60%)     3982.66 (  4.40%)     3944.66 (  3.41%)
Copy     142M      3353.31 (  0.00%)     3347.29 ( -0.18%)     3360.60 (  0.22%)     3477.55 (  3.70%)     3471.80 (  3.53%)     3465.60 (  3.35%)
Scale    142M      3186.05 (  0.00%)     3161.07 ( -0.78%)     3154.54 ( -0.99%)     3397.67 (  6.64%)     3394.53 (  6.54%)     3386.56 (  6.29%)
Triad    142M      3545.41 (  0.00%)     3518.27 ( -0.77%)     3527.15 ( -0.52%)     3858.25 (  8.82%)     3851.34 (  8.63%)     3841.65 (  8.36%)
Add      170M      3787.71 (  0.00%)     3805.45 (  0.47%)     3781.99 ( -0.15%)     3990.15 (  5.34%)     3990.16 (  5.34%)     3997.08 (  5.53%)
Copy     170M      3351.50 (  0.00%)     3362.22 (  0.32%)     3345.90 ( -0.17%)     3478.71 (  3.80%)     3483.70 (  3.94%)     3479.19 (  3.81%)
Scale    170M      3158.38 (  0.00%)     3175.47 (  0.54%)     3151.34 ( -0.22%)     3398.22 (  7.59%)     3400.09 (  7.65%)     3396.11 (  7.53%)
Triad    170M      3521.84 (  0.00%)     3534.01 (  0.35%)     3513.94 ( -0.22%)     3857.99 (  9.54%)     3863.00 (  9.69%)     3856.79 (  9.51%)
Add      227M      3794.46 (  0.00%)     3799.80 (  0.14%)     3789.75 ( -0.12%)     4001.21 (  5.45%)     3982.66 (  4.96%)     3991.65 (  5.20%)
Copy     227M      3368.15 (  0.00%)     3361.29 ( -0.20%)     3357.70 ( -0.31%)     3482.76 (  3.40%)     3473.54 (  3.13%)     3480.61 (  3.34%)
Scale    227M      3160.18 (  0.00%)     3164.94 (  0.15%)     3155.77 ( -0.14%)     3402.44 (  7.67%)     3390.24 (  7.28%)     3397.39 (  7.51%)
Triad    227M      3525.39 (  0.00%)     3523.04 ( -0.07%)     3524.31 ( -0.03%)     3865.12 (  9.64%)     3851.41 (  9.25%)     3859.91 (  9.49%)
Add      284M      3804.29 (  0.00%)     3799.06 ( -0.14%)     3805.86 (  0.04%)     4007.77 (  5.35%)     3986.91 (  4.80%)     3996.16 (  5.04%)
Copy     284M      3366.21 (  0.00%)     3349.03 ( -0.51%)     3369.99 (  0.11%)     3482.10 (  3.44%)     3469.08 (  3.06%)     3475.51 (  3.25%)
Scale    284M      3174.61 (  0.00%)     3173.80 ( -0.03%)     3147.99 ( -0.84%)     3402.22 (  7.17%)     3386.58 (  6.68%)     3395.61 (  6.96%)
Triad    284M      3538.50 (  0.00%)     3538.46 ( -0.00%)     3529.69 ( -0.25%)     3860.86 (  9.11%)     3843.72 (  8.63%)     3853.96 (  8.92%)
Add      341M      3805.26 (  0.00%)     3764.38 ( -1.07%)     3789.55 ( -0.41%)     3989.04 (  4.83%)     3977.50 (  4.53%)     4023.64 (  5.74%)
Copy     341M      3366.98 (  0.00%)     3341.40 ( -0.76%)     3362.85 ( -0.12%)     3476.89 (  3.26%)     3474.40 (  3.19%)     3489.58 (  3.64%)
Scale    341M      3159.11 (  0.00%)     3168.92 (  0.31%)     3177.39 (  0.58%)     3398.01 (  7.56%)     3393.30 (  7.41%)     3405.15 (  7.79%)
Triad    341M      3530.80 (  0.00%)     3506.03 ( -0.70%)     3528.16 ( -0.07%)     3858.85 (  9.29%)     3851.56 (  9.08%)     3868.18 (  9.56%)
Add      455M      3791.15 (  0.00%)     3794.39 (  0.09%)     3807.19 (  0.42%)     4029.29 (  6.28%)     3985.30 (  5.12%)     3988.07 (  5.19%)
Copy     455M      3353.30 (  0.00%)     3365.90 (  0.38%)     3358.94 (  0.17%)     3486.16 (  3.96%)     3475.41 (  3.64%)     3474.43 (  3.61%)
Scale    455M      3161.21 (  0.00%)     3166.60 (  0.17%)     3160.11 ( -0.03%)     3401.81 (  7.61%)     3396.29 (  7.44%)     3395.46 (  7.41%)
Triad    455M      3527.90 (  0.00%)     3525.16 ( -0.08%)     3536.99 (  0.26%)     3864.91 (  9.55%)     3858.19 (  9.36%)     3855.59 (  9.29%)
Add      568M      3779.79 (  0.00%)     3801.70 (  0.58%)     3782.09 (  0.06%)     3985.25 (  5.44%)     4026.56 (  6.53%)     3926.30 (  3.88%)
Copy     568M      3349.93 (  0.00%)     3366.10 (  0.48%)     3336.55 ( -0.40%)     3472.59 (  3.66%)     3485.34 (  4.04%)     3460.49 (  3.30%)
Scale    568M      3163.69 (  0.00%)     3170.00 (  0.20%)     3159.05 ( -0.15%)     3393.16 (  7.25%)     3400.62 (  7.49%)     3382.99 (  6.93%)
Triad    568M      3518.65 (  0.00%)     3535.79 (  0.49%)     3517.04 ( -0.05%)     3850.19 (  9.42%)     3863.35 (  9.80%)     3839.40 (  9.12%)
Add      682M      3801.06 (  0.00%)     3805.79 (  0.12%)     3786.90 ( -0.37%)     3977.83 (  4.65%)     3956.61 (  4.09%)     4001.91 (  5.28%)
Copy     682M      3363.64 (  0.00%)     3357.79 ( -0.17%)     3353.57 ( -0.30%)     3474.04 (  3.28%)     3469.78 (  3.16%)     3475.62 (  3.33%)
Scale    682M      3151.89 (  0.00%)     3169.57 (  0.56%)     3159.20 (  0.23%)     3395.81 (  7.74%)     3392.14 (  7.62%)     3393.91 (  7.68%)
Triad    682M      3528.97 (  0.00%)     3538.12 (  0.26%)     3519.04 ( -0.28%)     3854.44 (  9.22%)     3849.45 (  9.08%)     3853.38 (  9.19%)
Add      910M      3778.97 (  0.00%)     3785.79 (  0.18%)     3799.23 (  0.54%)     4043.50 (  7.00%)     4005.92 (  6.01%)     4014.66 (  6.24%)
Copy     910M      3345.09 (  0.00%)     3355.05 (  0.30%)     3353.56 (  0.25%)     3487.47 (  4.26%)     3473.79 (  3.85%)     3489.55 (  4.32%)
Scale    910M      3164.46 (  0.00%)     3157.34 ( -0.23%)     3167.60 (  0.10%)     3399.70 (  7.43%)     3390.43 (  7.14%)     3404.38 (  7.58%)
Triad    910M      3516.19 (  0.00%)     3520.82 (  0.13%)     3534.78 (  0.53%)     3861.71 (  9.83%)     3850.59 (  9.51%)     3867.83 ( 10.00%)
Add      1137M     3812.17 (  0.00%)     3795.34 ( -0.44%)     3799.71 ( -0.33%)     4022.75 (  5.52%)     3985.00 (  4.53%)     3997.57 (  4.86%)
Copy     1137M     3367.52 (  0.00%)     3364.07 ( -0.10%)     3367.26 ( -0.01%)     3480.58 (  3.36%)     3468.42 (  3.00%)     3473.41 (  3.14%)
Scale    1137M     3158.62 (  0.00%)     3155.05 ( -0.11%)     3164.45 (  0.18%)     3397.03 (  7.55%)     3386.94 (  7.23%)     3392.39 (  7.40%)
Triad    1137M     3536.97 (  0.00%)     3526.00 ( -0.31%)     3529.99 ( -0.20%)     3858.44 (  9.09%)     3845.78 (  8.73%)     3850.80 (  8.87%)
Add      1365M     3806.51 (  0.00%)     3791.63 ( -0.39%)     3786.57 ( -0.52%)     3962.59 (  4.10%)     4029.60 (  5.86%)     3990.23 (  4.83%)
Copy     1365M     3360.43 (  0.00%)     3363.15 (  0.08%)     3347.19 ( -0.39%)     3474.10 (  3.38%)     3488.82 (  3.82%)     3478.98 (  3.53%)
Scale    1365M     3155.95 (  0.00%)     3160.77 (  0.15%)     3164.41 (  0.27%)     3394.90 (  7.57%)     3405.19 (  7.90%)     3396.64 (  7.63%)
Triad    1365M     3534.18 (  0.00%)     3521.12 ( -0.37%)     3519.49 ( -0.42%)     3856.06 (  9.11%)     3865.20 (  9.37%)     3857.96 (  9.16%)
Add      1820M     3797.86 (  0.00%)     3795.51 ( -0.06%)     3800.31 (  0.06%)     4023.79 (  5.95%)     3955.34 (  4.15%)     4003.20 (  5.41%)
Copy     1820M     3362.09 (  0.00%)     3361.06 ( -0.03%)     3359.74 ( -0.07%)     3482.46 (  3.58%)     3468.46 (  3.16%)     3474.92 (  3.36%)
Scale    1820M     3170.20 (  0.00%)     3160.70 ( -0.30%)     3166.72 ( -0.11%)     3396.61 (  7.14%)     3391.98 (  7.00%)     3393.97 (  7.06%)
Triad    1820M     3531.00 (  0.00%)     3527.31 ( -0.10%)     3530.65 ( -0.01%)     3858.18 (  9.27%)     3849.65 (  9.02%)     3854.65 (  9.17%)
Add      2275M     3810.31 (  0.00%)     3792.47 ( -0.47%)     3767.11 ( -1.13%)     3982.71 (  4.52%)     3987.02 (  4.64%)     3977.99 (  4.40%)
Copy     2275M     3373.60 (  0.00%)     3358.29 ( -0.45%)     3335.43 ( -1.13%)     3478.34 (  3.10%)     3476.07 (  3.04%)     3475.55 (  3.02%)
Scale    2275M     3174.64 (  0.00%)     3159.58 ( -0.47%)     3158.94 ( -0.49%)     3398.12 (  7.04%)     3395.41 (  6.95%)     3395.88 (  6.97%)
Triad    2275M     3537.57 (  0.00%)     3527.90 ( -0.27%)     3508.53 ( -0.82%)     3860.60 (  9.13%)     3856.96 (  9.03%)     3856.09 (  9.00%)
Add      2730M     3801.09 (  0.00%)     3812.05 (  0.29%)     3802.64 (  0.04%)     3981.20 (  4.74%)     4017.01 (  5.68%)     3938.62 (  3.62%)
Copy     2730M     3357.18 (  0.00%)     3365.37 (  0.24%)     3361.64 (  0.13%)     3477.74 (  3.59%)     3475.85 (  3.53%)     3464.04 (  3.18%)
Scale    2730M     3177.66 (  0.00%)     3168.10 ( -0.30%)     3161.30 ( -0.51%)     3397.39 (  6.91%)     3393.51 (  6.79%)     3386.47 (  6.57%)
Triad    2730M     3539.59 (  0.00%)     3543.83 (  0.12%)     3528.50 ( -0.31%)     3861.50 (  9.09%)     3854.09 (  8.89%)     3845.27 (  8.64%)
Add      3640M     3816.88 (  0.00%)     3791.01 ( -0.68%)     3779.35 ( -0.98%)     3976.53 (  4.18%)     4050.84 (  6.13%)     3991.81 (  4.58%)
Copy     3640M     3375.91 (  0.00%)     3349.60 ( -0.78%)     3347.88 ( -0.83%)     3472.83 (  2.87%)     3485.96 (  3.26%)     3474.40 (  2.92%)
Scale    3640M     3167.22 (  0.00%)     3168.24 (  0.03%)     3157.93 ( -0.29%)     3395.00 (  7.19%)     3400.17 (  7.36%)     3395.70 (  7.21%)
Triad    3640M     3546.45 (  0.00%)     3528.90 ( -0.49%)     3517.90 ( -0.81%)     3855.08 (  8.70%)     3860.11 (  8.84%)     3854.39 (  8.68%)
Add      4551M     3799.05 (  0.00%)     3805.03 (  0.16%)     3806.14 (  0.19%)     4028.14 (  6.03%)     4026.96 (  6.00%)     4021.84 (  5.86%)
Copy     4551M     3355.66 (  0.00%)     3358.64 (  0.09%)     3356.91 (  0.04%)     3487.50 (  3.93%)     3485.92 (  3.88%)     3481.72 (  3.76%)
Scale    4551M     3171.91 (  0.00%)     3174.92 (  0.09%)     3163.54 ( -0.26%)     3402.45 (  7.27%)     3401.04 (  7.22%)     3396.90 (  7.09%)
Triad    4551M     3531.61 (  0.00%)     3535.95 (  0.12%)     3536.00 (  0.12%)     3864.84 (  9.44%)     3865.01 (  9.44%)     3857.47 (  9.23%)
Add      5461M     3801.60 (  0.00%)     3774.49 ( -0.71%)     3779.16 ( -0.59%)     4010.68 (  5.50%)     3958.91 (  4.14%)     4011.94 (  5.53%)
Copy     5461M     3360.29 (  0.00%)     3347.56 ( -0.38%)     3351.31 ( -0.27%)     3483.90 (  3.68%)     3467.72 (  3.20%)     3480.64 (  3.58%)
Scale    5461M     3161.18 (  0.00%)     3154.56 ( -0.21%)     3149.71 ( -0.36%)     3399.26 (  7.53%)     3391.35 (  7.28%)     3396.95 (  7.46%)
Triad    5461M     3532.35 (  0.00%)     3510.19 ( -0.63%)     3512.62 ( -0.56%)     3862.91 (  9.36%)     3849.95 (  8.99%)     3858.71 (  9.24%)
Add      7281M     3800.80 (  0.00%)     3789.71 ( -0.29%)     3779.60 ( -0.56%)     4023.89 (  5.87%)     4000.63 (  5.26%)     3974.68 (  4.57%)
Copy     7281M     3359.99 (  0.00%)     3349.71 ( -0.31%)     3346.82 ( -0.39%)     3482.20 (  3.64%)     3481.97 (  3.63%)     3471.59 (  3.32%)
Scale    7281M     3168.68 (  0.00%)     3167.95 ( -0.02%)     3154.70 ( -0.44%)     3399.98 (  7.30%)     3400.46 (  7.31%)     3392.10 (  7.05%)
Triad    7281M     3533.59 (  0.00%)     3524.63 ( -0.25%)     3514.25 ( -0.55%)     3861.39 (  9.28%)     3861.70 (  9.29%)     3853.31 (  9.05%)
Add      9102M     3790.67 (  0.00%)     3791.28 (  0.02%)     3790.38 ( -0.01%)     4015.48 (  5.93%)     4013.46 (  5.88%)     4014.66 (  5.91%)
Copy     9102M     3345.80 (  0.00%)     3365.09 (  0.58%)     3353.79 (  0.24%)     3480.51 (  4.03%)     3479.74 (  4.00%)     3481.55 (  4.06%)
Scale    9102M     3174.65 (  0.00%)     3149.82 ( -0.78%)     3166.84 ( -0.25%)     3398.75 (  7.06%)     3398.27 (  7.04%)     3399.20 (  7.07%)
Triad    9102M     3529.51 (  0.00%)     3523.03 ( -0.18%)     3524.38 ( -0.15%)     3861.12 (  9.40%)     3858.35 (  9.32%)     3860.55 (  9.38%)
Add      10922M     3807.96 (  0.00%)     3784.18 ( -0.62%)     3779.45 ( -0.75%)     4021.53 (  5.61%)     3984.89 (  4.65%)     4005.11 (  5.18%)
Copy     10922M     3350.99 (  0.00%)     3351.97 (  0.03%)     3353.08 (  0.06%)     3490.40 (  4.16%)     3472.32 (  3.62%)     3473.98 (  3.67%)
Scale    10922M     3164.74 (  0.00%)     3167.46 (  0.09%)     3154.60 ( -0.32%)     3402.35 (  7.51%)     3392.56 (  7.20%)     3392.16 (  7.19%)
Triad    10922M     3536.69 (  0.00%)     3524.27 ( -0.35%)     3516.30 ( -0.58%)     3865.21 (  9.29%)     3850.74 (  8.88%)     3849.32 (  8.84%)
Add      14563M     3786.28 (  0.00%)     3793.09 (  0.18%)     3787.76 (  0.04%)     3976.82 (  5.03%)     3987.54 (  5.32%)     3988.31 (  5.34%)
Copy     14563M     3352.51 (  0.00%)     3355.74 (  0.10%)     3357.05 (  0.14%)     3472.63 (  3.58%)     3475.97 (  3.68%)     3470.44 (  3.52%)
Scale    14563M     3171.95 (  0.00%)     3168.28 ( -0.12%)     3158.17 ( -0.43%)     3393.54 (  6.99%)     3399.68 (  7.18%)     3390.82 (  6.90%)
Triad    14563M     3522.50 (  0.00%)     3526.12 (  0.10%)     3519.97 ( -0.07%)     3853.92 (  9.41%)     3856.89 (  9.49%)     3847.38 (  9.22%)
Add      18204M     3809.56 (  0.00%)     3772.64 ( -0.97%)     3795.07 ( -0.38%)     4014.65 (  5.38%)     3976.18 (  4.37%)     3963.55 (  4.04%)
Copy     18204M     3365.06 (  0.00%)     3350.49 ( -0.43%)     3359.32 ( -0.17%)     3483.40 (  3.52%)     3473.21 (  3.21%)     3467.66 (  3.05%)
Scale    18204M     3171.25 (  0.00%)     3151.05 ( -0.64%)     3163.69 ( -0.24%)     3400.05 (  7.21%)     3393.76 (  7.02%)     3388.64 (  6.85%)
Triad    18204M     3539.90 (  0.00%)     3508.60 ( -0.88%)     3532.25 ( -0.22%)     3860.99 (  9.07%)     3853.56 (  8.86%)     3847.01 (  8.68%)
Add      21845M     3798.46 (  0.00%)     3800.35 (  0.05%)     3791.21 ( -0.19%)     3995.49 (  5.19%)     3990.65 (  5.06%)     3969.12 (  4.49%)
Copy     21845M     3362.14 (  0.00%)     3363.46 (  0.04%)     3355.34 ( -0.20%)     3477.61 (  3.43%)     3478.33 (  3.46%)     3472.19 (  3.27%)
Scale    21845M     3170.99 (  0.00%)     3164.60 ( -0.20%)     3162.31 ( -0.27%)     3398.14 (  7.16%)     3396.25 (  7.10%)     3393.58 (  7.02%)
Triad    21845M     3534.49 (  0.00%)     3527.34 ( -0.20%)     3522.95 ( -0.33%)     3858.35 (  9.16%)     3856.52 (  9.11%)     3854.98 (  9.07%)
Add      29127M     3819.69 (  0.00%)     3783.38 ( -0.95%)     3786.06 ( -0.88%)     4007.04 (  4.90%)     4005.91 (  4.88%)     4000.99 (  4.75%)
Copy     29127M     3384.67 (  0.00%)     3345.60 ( -1.15%)     3339.55 ( -1.33%)     3480.54 (  2.83%)     3479.91 (  2.81%)     3475.18 (  2.67%)
Scale    29127M     3158.68 (  0.00%)     3166.06 (  0.23%)     3151.78 ( -0.22%)     3399.73 (  7.63%)     3395.21 (  7.49%)     3393.50 (  7.43%)
Triad    29127M     3538.17 (  0.00%)     3520.17 ( -0.51%)     3523.09 ( -0.43%)     3862.24 (  9.16%)     3858.60 (  9.06%)     3851.85 (  8.87%)
Add      36408M     3806.95 (  0.00%)     3793.61 ( -0.35%)     3777.70 ( -0.77%)     4016.66 (  5.51%)     3994.64 (  4.93%)     3991.57 (  4.85%)
Copy     36408M     3361.11 (  0.00%)     3347.61 ( -0.40%)     3353.38 ( -0.23%)     3483.09 (  3.63%)     3476.44 (  3.43%)     3473.26 (  3.34%)
Scale    36408M     3165.87 (  0.00%)     3173.95 (  0.26%)     3171.11 (  0.17%)     3398.81 (  7.36%)     3394.38 (  7.22%)     3393.16 (  7.18%)
Triad    36408M     3536.86 (  0.00%)     3533.81 ( -0.09%)     3513.64 ( -0.66%)     3860.60 (  9.15%)     3855.77 (  9.02%)     3853.09 (  8.94%)
Add      43690M     3799.39 (  0.00%)     3795.90 ( -0.09%)     3803.79 (  0.12%)     3996.57 (  5.19%)     4006.70 (  5.46%)     3981.15 (  4.78%)
Copy     43690M     3359.26 (  0.00%)     3360.94 (  0.05%)     3371.10 (  0.35%)     3479.62 (  3.58%)     3481.69 (  3.64%)     3478.45 (  3.55%)
Scale    43690M     3175.35 (  0.00%)     3163.95 ( -0.36%)     3147.34 ( -0.88%)     3396.36 (  6.96%)     3399.45 (  7.06%)     3398.88 (  7.04%)
Triad    43690M     3535.26 (  0.00%)     3526.88 ( -0.24%)     3528.38 ( -0.19%)     3857.30 (  9.11%)     3858.89 (  9.15%)     3858.38 (  9.14%)
Add      58254M     3799.66 (  0.00%)     3772.37 ( -0.72%)     3768.33 ( -0.82%)     4016.47 (  5.71%)     4014.25 (  5.65%)     3968.79 (  4.45%)
Copy     58254M     3355.12 (  0.00%)     3337.75 ( -0.52%)     3337.41 ( -0.53%)     3481.56 (  3.77%)     3481.28 (  3.76%)     3465.39 (  3.29%)
Scale    58254M     3170.94 (  0.00%)     3159.81 ( -0.35%)     3164.09 ( -0.22%)     3398.35 (  7.17%)     3396.30 (  7.11%)     3388.58 (  6.86%)
Triad    58254M     3537.26 (  0.00%)     3511.62 ( -0.72%)     3507.54 ( -0.84%)     3860.59 (  9.14%)     3858.62 (  9.09%)     3847.30 (  8.76%)
Add      72817M     3815.26 (  0.00%)     3812.73 ( -0.07%)     3787.86 ( -0.72%)     3968.21 (  4.01%)     4030.38 (  5.64%)     3956.57 (  3.70%)
Copy     72817M     3362.18 (  0.00%)     3371.41 (  0.27%)     3345.64 ( -0.49%)     3474.38 (  3.34%)     3482.00 (  3.56%)     3469.46 (  3.19%)
Scale    72817M     3175.73 (  0.00%)     3170.64 ( -0.16%)     3154.28 ( -0.68%)     3394.65 (  6.89%)     3396.69 (  6.96%)     3390.78 (  6.77%)
Triad    72817M     3546.44 (  0.00%)     3537.21 ( -0.26%)     3520.46 ( -0.73%)     3855.50 (  8.71%)     3855.34 (  8.71%)     3849.10 (  8.53%)
Add      87381M     3519.93 (  0.00%)     3501.24 ( -0.53%)     3500.84 ( -0.54%)     3833.20 (  8.90%)     3833.26 (  8.90%)     3840.72 (  9.11%)
Copy     87381M     3175.29 (  0.00%)     3166.11 ( -0.29%)     3163.97 ( -0.36%)     3263.09 (  2.77%)     3264.10 (  2.80%)     3266.85 (  2.88%)
Scale    87381M     2848.76 (  0.00%)     2835.15 ( -0.48%)     2832.37 ( -0.58%)     3177.70 ( 11.55%)     3172.81 ( 11.38%)     3180.05 ( 11.63%)
Triad    87381M     3465.19 (  0.00%)     3453.66 ( -0.33%)     3456.03 ( -0.26%)     3777.01 (  9.00%)     3774.30 (  8.92%)     3783.31 (  9.18%)

Remote access costs are quite visible in this memory streaming benchmark.

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v2r1lruslabonly-v2r1  local-v2r6   acct-v2r6remotefile-v2r6
User         1144.35     1154.81     1156.38     1075.31     1083.70     1087.08
System         55.28       56.07       56.35       49.00       49.06       48.84
Elapsed      1207.64     1220.14     1222.13     1132.20     1141.91     1145.08

pft
                        3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3
                           vanilla       instrument-v2r1      lruslabonly-v2r1            local-v2r6             acct-v2r6       remotefile-v2r6
User       1       0.6980 (  0.00%)       0.6900 (  1.15%)       0.7050 ( -1.00%)       0.6500 (  6.88%)       0.6550 (  6.16%)       0.6750 (  3.30%)
User       2       0.7040 (  0.00%)       0.6990 (  0.71%)       0.7000 (  0.57%)       0.6980 (  0.85%)       0.7150 ( -1.56%)       0.7040 (  0.00%)
User       3       0.6910 (  0.00%)       0.6930 ( -0.29%)       0.7230 ( -4.63%)       0.7390 ( -6.95%)       0.7180 ( -3.91%)       0.7120 ( -3.04%)
User       4       0.7250 (  0.00%)       0.7580 ( -4.55%)       0.7310 ( -0.83%)       0.7220 (  0.41%)       0.7520 ( -3.72%)       0.7250 (  0.00%)
User       5       0.7590 (  0.00%)       0.7490 (  1.32%)       0.7910 ( -4.22%)       0.7730 ( -1.84%)       0.7480 (  1.45%)       0.7690 ( -1.32%)
User       6       0.8130 (  0.00%)       0.8010 (  1.48%)       0.7940 (  2.34%)       0.7770 (  4.43%)       0.7790 (  4.18%)       0.7700 (  5.29%)
User       7       0.8210 (  0.00%)       0.8380 ( -2.07%)       0.8260 ( -0.61%)       0.7950 (  3.17%)       0.8230 ( -0.24%)       0.7760 (  5.48%)
User       8       0.8390 (  0.00%)       0.8200 (  2.26%)       0.8160 (  2.74%)       0.7840 (  6.56%)       0.7830 (  6.67%)       0.8400 ( -0.12%)
System     1       9.1230 (  0.00%)       9.1120 (  0.12%)       9.0810 (  0.46%)       8.2560 (  9.50%)       8.2760 (  9.28%)       8.2260 (  9.83%)
System     2       9.3990 (  0.00%)       9.3340 (  0.69%)       9.4050 ( -0.06%)       8.4630 (  9.96%)       8.4230 ( 10.38%)       8.4420 ( 10.18%)
System     3       9.1460 (  0.00%)       9.0890 (  0.62%)       9.1380 (  0.09%)       8.5660 (  6.34%)       8.5640 (  6.36%)       8.5290 (  6.75%)
System     4       8.9160 (  0.00%)       8.8840 (  0.36%)       8.9260 ( -0.11%)       8.6760 (  2.69%)       8.6330 (  3.17%)       8.6790 (  2.66%)
System     5       9.5900 (  0.00%)       9.5240 (  0.69%)       9.5230 (  0.70%)       8.9390 (  6.79%)       8.8920 (  7.28%)       8.9410 (  6.77%)
System     6       9.8640 (  0.00%)       9.7120 (  1.54%)       9.8740 ( -0.10%)       9.1460 (  7.28%)       9.1310 (  7.43%)       9.1400 (  7.34%)
System     7       9.9860 (  0.00%)       9.9290 (  0.57%)      10.0030 ( -0.17%)       9.3360 (  6.51%)       9.2430 (  7.44%)       9.2860 (  7.01%)
System     8       9.8570 (  0.00%)       9.8510 (  0.06%)       9.9980 ( -1.43%)       9.3050 (  5.60%)       9.2410 (  6.25%)       9.4170 (  4.46%)
Elapsed    1       9.8240 (  0.00%)       9.8050 (  0.19%)       9.7910 (  0.34%)       8.9080 (  9.32%)       8.9320 (  9.08%)       8.9080 (  9.32%)
Elapsed    2       5.0870 (  0.00%)       5.0500 (  0.73%)       5.0710 (  0.31%)       4.6020 (  9.53%)       4.5860 (  9.85%)       4.5940 (  9.69%)
Elapsed    3       3.3220 (  0.00%)       3.2990 (  0.69%)       3.3210 (  0.03%)       3.1170 (  6.17%)       3.1150 (  6.23%)       3.0950 (  6.83%)
Elapsed    4       2.4440 (  0.00%)       2.4440 (  0.00%)       2.4410 (  0.12%)       2.3930 (  2.09%)       2.3780 (  2.70%)       2.3710 (  2.99%)
Elapsed    5       2.1500 (  0.00%)       2.1410 (  0.42%)       2.1400 (  0.47%)       2.0020 (  6.88%)       1.9830 (  7.77%)       2.0030 (  6.84%)
Elapsed    6       1.8290 (  0.00%)       1.7970 (  1.75%)       1.8260 (  0.16%)       1.6960 (  7.27%)       1.6980 (  7.16%)       1.6930 (  7.44%)
Elapsed    7       1.5760 (  0.00%)       1.5610 (  0.95%)       1.5860 ( -0.63%)       1.4830 (  5.90%)       1.4740 (  6.47%)       1.4730 (  6.54%)
Elapsed    8       1.3660 (  0.00%)       1.3490 (  1.24%)       1.3660 ( -0.00%)       1.2820 (  6.15%)       1.2660 (  7.32%)       1.3030 (  4.61%)
Faults/cpu 1  336505.5875 (  0.00%)  337163.8429 (  0.20%)  337713.8261 (  0.36%)  371079.7726 ( 10.27%)  370090.7928 (  9.98%)  371199.3702 ( 10.31%)
Faults/cpu 2  327139.2186 (  0.00%)  329451.3249 (  0.71%)  326974.9735 ( -0.05%)  360766.3203 ( 10.28%)  361595.0312 ( 10.53%)  361389.4583 ( 10.47%)
Faults/cpu 3  336004.1324 (  0.00%)  337826.9136 (  0.54%)  335004.8869 ( -0.30%)  355249.2266 (  5.73%)  356016.6570 (  5.96%)  357584.5258 (  6.42%)
Faults/cpu 4  342824.1564 (  0.00%)  342825.3087 (  0.00%)  342285.3156 ( -0.16%)  351758.5702 (  2.61%)  352312.8339 (  2.77%)  351503.0837 (  2.53%)
Faults/cpu 5  319553.7707 (  0.00%)  321799.3129 (  0.70%)  320521.1950 (  0.30%)  340315.3807 (  6.50%)  342890.6018 (  7.30%)  340381.5220 (  6.52%)
Faults/cpu 6  309614.5554 (  0.00%)  314330.1834 (  1.52%)  309882.5231 (  0.09%)  333075.2546 (  7.58%)  333637.6404 (  7.76%)  333706.0587 (  7.78%)
Faults/cpu 7  306159.2969 (  0.00%)  307277.9428 (  0.37%)  305306.4748 ( -0.28%)  326309.2165 (  6.58%)  328327.9627 (  7.24%)  328590.8507 (  7.33%)
Faults/cpu 8  309077.4966 (  0.00%)  309849.8370 (  0.25%)  305865.6953 ( -1.04%)  327958.3107 (  6.11%)  329731.7933 (  6.68%)  322280.8870 (  4.27%)
Faults/sec 1  336364.5575 (  0.00%)  336993.1010 (  0.19%)  337563.4257 (  0.36%)  370916.0228 ( 10.27%)  369955.7605 (  9.99%)  370971.4836 ( 10.29%)
Faults/sec 2  649713.2290 (  0.00%)  654448.6622 (  0.73%)  651706.3799 (  0.31%)  717987.1734 ( 10.51%)  720641.9249 ( 10.92%)  719435.7495 ( 10.73%)
Faults/sec 3  994812.3119 (  0.00%) 1001443.9434 (  0.67%)  995205.6607 (  0.04%) 1060228.7843 (  6.58%) 1060484.8602 (  6.60%) 1067127.5522 (  7.27%)
Faults/sec 4 1352137.4832 (  0.00%) 1352463.8578 (  0.02%) 1354323.6163 (  0.16%) 1382325.4091 (  2.23%) 1390344.3320 (  2.83%) 1393760.7116 (  3.08%)
Faults/sec 5 1538115.0421 (  0.00%) 1544331.3978 (  0.40%) 1544368.0159 (  0.41%) 1651247.2902 (  7.36%) 1666751.7259 (  8.36%) 1651371.8632 (  7.36%)
Faults/sec 6 1807211.7324 (  0.00%) 1840430.0157 (  1.84%) 1809763.9743 (  0.14%) 1947049.8237 (  7.74%) 1946986.6396 (  7.73%) 1953384.4599 (  8.09%)
Faults/sec 7 2101840.1872 (  0.00%) 2120169.4773 (  0.87%) 2082926.2675 ( -0.90%) 2233207.9026 (  6.25%) 2241803.5953 (  6.66%) 2242647.3545 (  6.70%)
Faults/sec 8 2421813.7208 (  0.00%) 2453320.5034 (  1.30%) 2419371.3924 ( -0.10%) 2582755.9228 (  6.65%) 2612638.2836 (  7.88%) 2537575.0399 (  4.78%)

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v2r1lruslabonly-v2r1  local-v2r6   acct-v2r6remotefile-v2r6
User           60.57       61.53       61.96       59.47       60.74       60.78
System        868.16      862.80      868.89      805.82      802.76      806.01
Elapsed       336.19      336.02      339.19      311.33      313.18      313.58

And page fault microbenchmarks also see a benefit, probably because the
zeroing of pages is no longer incurring a remote access penalty. Lower system CPU usage etc
 

ebizzy
                       3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3
                          vanilla       instrument-v2r1      lruslabonly-v2r1            local-v2r6             acct-v2r6       remotefile-v2r6
Mean     1      3213.33 (  0.00%)     3161.67 ( -1.61%)     3177.00 ( -1.13%)     3234.33 (  0.65%)     3224.00 (  0.33%)     3198.33 ( -0.47%)
Mean     2      2291.33 (  0.00%)     2316.67 (  1.11%)     2309.67 (  0.80%)     2348.67 (  2.50%)     2330.00 (  1.69%)     2332.00 (  1.77%)
Mean     3      2234.67 (  0.00%)     2298.67 (  2.86%)     2252.00 (  0.78%)     2280.67 (  2.06%)     2292.67 (  2.60%)     2270.00 (  1.58%)
Mean     4      2224.33 (  0.00%)     2279.00 (  2.46%)     2250.67 (  1.18%)     2282.33 (  2.61%)     2256.00 (  1.42%)     2256.33 (  1.44%)
Mean     5      2256.33 (  0.00%)     2280.33 (  1.06%)     2265.00 (  0.38%)     2280.67 (  1.08%)     2268.33 (  0.53%)     2276.67 (  0.90%)
Mean     6      2233.00 (  0.00%)     2257.33 (  1.09%)     2200.00 ( -1.48%)     2292.33 (  2.66%)     2274.33 (  1.85%)     2250.33 (  0.78%)
Mean     7      2212.33 (  0.00%)     2229.00 (  0.75%)     2201.67 ( -0.48%)     2279.00 (  3.01%)     2265.00 (  2.38%)     2251.67 (  1.78%)
Mean     8      2224.67 (  0.00%)     2226.33 (  0.07%)     2225.67 (  0.04%)     2255.00 (  1.36%)     2280.67 (  2.52%)     2238.67 (  0.63%)
Mean     12     2213.33 (  0.00%)     2240.00 (  1.20%)     2264.67 (  2.32%)     2249.33 (  1.63%)     2257.67 (  2.00%)     2238.00 (  1.11%)
Mean     16     2221.00 (  0.00%)     2226.33 (  0.24%)     2268.00 (  2.12%)     2266.33 (  2.04%)     2241.00 (  0.90%)     2258.33 (  1.68%)
Mean     20     2215.00 (  0.00%)     2256.00 (  1.85%)     2278.33 (  2.86%)     2238.67 (  1.07%)     2271.67 (  2.56%)     2291.00 (  3.43%)
Mean     24     2175.00 (  0.00%)     2181.00 (  0.28%)     2166.67 ( -0.38%)     2211.00 (  1.66%)     2231.00 (  2.57%)     2247.67 (  3.34%)
Mean     28     2110.00 (  0.00%)     2136.00 (  1.23%)     2123.33 (  0.63%)     2157.00 (  2.23%)     2163.00 (  2.51%)     2164.67 (  2.59%)
Mean     32     2077.67 (  0.00%)     2095.33 (  0.85%)     2091.33 (  0.66%)     2110.67 (  1.59%)     2113.33 (  1.72%)     2110.33 (  1.57%)
Mean     36     2016.33 (  0.00%)     2024.67 (  0.41%)     2039.33 (  1.14%)     2066.33 (  2.48%)     2068.00 (  2.56%)     2069.00 (  2.61%)
Mean     40     1984.00 (  0.00%)     1987.00 (  0.15%)     1993.33 (  0.47%)     2037.00 (  2.67%)     2035.00 (  2.57%)     2042.00 (  2.92%)
Mean     44     1943.33 (  0.00%)     1954.33 (  0.57%)     1961.00 (  0.91%)     2004.33 (  3.14%)     2009.67 (  3.41%)     2018.00 (  3.84%)
Mean     48     1925.00 (  0.00%)     1939.33 (  0.74%)     1929.00 (  0.21%)     1990.67 (  3.41%)     1996.33 (  3.71%)     2007.67 (  4.29%)
Stddev   1        25.42 (  0.00%)       46.78 (-84.02%)       32.75 (-28.84%)       18.62 ( 26.73%)       21.95 ( 13.64%)       30.18 (-18.72%)
Stddev   2        29.68 (  0.00%)        1.70 ( 94.27%)       13.77 ( 53.61%)       12.50 ( 57.89%)       15.51 ( 47.73%)       13.88 ( 53.23%)
Stddev   3        18.15 (  0.00%)       27.48 (-51.35%)        4.32 ( 76.20%)       13.57 ( 25.23%)       15.52 ( 14.50%)       11.78 ( 35.13%)
Stddev   4        41.28 (  0.00%)       13.64 ( 66.96%)        6.94 ( 83.18%)       24.51 ( 40.62%)       24.04 ( 41.76%)        7.41 ( 82.05%)
Stddev   5        27.18 (  0.00%)        9.03 ( 66.78%)        4.97 ( 81.73%)        8.50 ( 68.74%)       17.25 ( 36.54%)       15.80 ( 41.88%)
Stddev   6        10.80 (  0.00%)       17.97 (-66.36%)        9.27 ( 14.14%)        6.60 ( 38.90%)       16.01 (-48.20%)       19.36 (-79.26%)
Stddev   7        23.10 (  0.00%)       17.91 ( 22.48%)       29.58 (-28.05%)       15.94 ( 31.00%)        5.72 ( 75.26%)       12.76 ( 44.75%)
Stddev   8         3.68 (  0.00%)       41.52 (-1027.82%)       26.74 (-626.21%)        4.32 (-17.35%)       33.81 (-818.21%)       12.50 (-239.48%)
Stddev   12       23.84 (  0.00%)        6.48 ( 72.81%)       14.66 ( 38.50%)       13.47 ( 43.47%)       11.79 ( 50.56%)       18.71 ( 21.52%)
Stddev   16       20.22 (  0.00%)       17.13 ( 15.25%)       28.99 (-43.43%)        2.36 ( 88.34%)        2.16 ( 89.31%)       16.13 ( 20.20%)
Stddev   20        3.74 (  0.00%)        6.53 (-74.57%)       45.02 (-1103.24%)       22.54 (-502.51%)        8.18 (-118.58%)       26.28 (-602.38%)
Stddev   24       18.18 (  0.00%)       19.30 ( -6.16%)       23.81 (-30.93%)        9.42 ( 48.22%)       16.99 (  6.57%)        8.18 ( 55.02%)
Stddev   28       11.78 (  0.00%)        7.79 ( 33.86%)       15.92 (-35.22%)       12.96 (-10.07%)       12.83 ( -8.97%)       17.78 (-51.01%)
Stddev   32        9.74 (  0.00%)        2.05 ( 78.91%)        8.81 (  9.59%)        6.55 ( 32.77%)        3.09 ( 68.27%)        1.70 ( 82.55%)
Stddev   36        3.86 (  0.00%)        5.44 (-40.89%)        2.36 ( 38.92%)       13.22 (-242.73%)       11.78 (-205.18%)       16.87 (-337.26%)
Stddev   40       14.17 (  0.00%)        7.48 ( 47.17%)        5.56 ( 60.77%)        5.89 ( 58.44%)        2.16 ( 84.75%)        2.45 ( 82.71%)
Stddev   44        7.54 (  0.00%)        3.40 ( 54.93%)        2.94 ( 60.97%)        7.54 (  0.00%)        3.68 ( 51.19%)        1.63 ( 78.35%)
Stddev   48        2.94 (  0.00%)        5.56 (-88.79%)        3.56 (-20.89%)        6.24 (-111.83%)        1.70 ( 42.26%)       17.25 (-485.95%)

Ran ebizzy because it double up as a page allocation micro benchmark that
hits page faults differently to PFT. Looks like an ok gain but the stddev
is high and would need to be stabilised to draw a solid conclusion from.

None of these benchmarks do *anything* related to what commit 81c0a2bb was
supposed to fix. I just wanted to get the point across that our current
default behaviour sucks and we should revisit that decision.

My position is that by default we should only round-robin zones local to
the allocating process and that node round-robin is something that should
only be explicitely enabled.

I'm less sure about the round robin treatment of slab but am erring on
the side of historical behaviour until it is proven otherwise.

 Documentation/sysctl/vm.txt |  32 +++++++++
 include/linux/gfp.h         |   4 +-
 include/linux/mmzone.h      |   2 +
 include/linux/pagemap.h     |   2 +-
 include/linux/swap.h        |   2 +
 kernel/sysctl.c             |   8 +++
 mm/filemap.c                |   2 +
 mm/page_alloc.c             | 153 +++++++++++++++++++++++++++++++++++++-------
 8 files changed, 180 insertions(+), 25 deletions(-)

-- 
1.8.4


^ permalink raw reply	[flat|nested] 84+ messages in thread

* [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
@ 2013-12-13 14:10 ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

Kicked this another bit today. It's still a bit half-baked but it restores
the historical performance and leaves the door open at the end for playing
nice with distributing file pages between nodes. Finishing this series
depends on whether we are going to make the remote node behaviour of the
fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
favour of the configurable option because the default can be redefined and
tested while giving users a "compat" mode if we discover the new default
behaviour sucks for some workload.

Changelog since v1
o Fix lot of brain damage in the configurable policy patch
o Yoink a page cache annotation patch
o Only account batch pages against allocations eligible for the fair policy
o Add patch that default distributes file pages on remote nodes

Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of how
the page allocator and kswapd interacted on the per-zone LRU lists.

Unfortunately a side-effect missed during review was that it's now very
easy to allocate remote memory on NUMA machines. The problem is that
it is not a simple case of just restoring local allocation policies as
there are genuine reasons why global page aging may be prefereable. It's
still a major change to default behaviour so this patch makes the policy
configurable and sets what I think is a sensible default.

The patches are on top of some NUMA balancing patches currently in -mm.
The first patch in the series is a patch posted by Johannes that must be
taken into account before any of my patchs on top. The last patch of the
series is what alters default behaviour and makes the fair zone allocator
policy configurable.

Sniff test results based on following kernels

vanilla		 3.13-rc3 stock
instrument-v2r1  NUMA balancing patches just to rule out any conflicts ther2
lruslabonly-v1r2 Patch 1 only
local-v2r6	 Patches 1-5 to restore local memory allocations
acct-v2r6	 Patches 1-6 to include an accounting adjustment
remotefile-v2r6  Patches 1-7 that breaks MPOL_LOCAL by interleaving file pages

kernbench
                          3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3
                             vanilla       instrument-v2r1      lruslabonly-v2r1            local-v2r6             acct-v2r6       remotefile-v2r6
User    min        1417.32 (  0.00%)     1408.52 (  0.62%)     1414.92 (  0.17%)     1403.37 (  0.98%)     1410.55 (  0.48%)     1405.85 (  0.81%)
User    mean       1419.10 (  0.00%)     1415.39 (  0.26%)     1417.31 (  0.13%)     1409.89 (  0.65%)     1411.40 (  0.54%)     1410.78 (  0.59%)
User    stddev        2.25 (  0.00%)        4.51 (-100.33%)        2.44 ( -8.29%)        3.98 (-76.92%)        0.74 ( 66.98%)        2.94 (-30.81%)
User    max        1422.92 (  0.00%)     1421.05 (  0.13%)     1421.90 (  0.07%)     1415.39 (  0.53%)     1412.55 (  0.73%)     1413.99 (  0.63%)
User    range         5.60 (  0.00%)       12.53 (-123.75%)        6.98 (-24.64%)       12.02 (-114.64%)        2.00 ( 64.29%)        8.14 (-45.36%)
System  min         114.83 (  0.00%)      114.09 (  0.64%)      114.50 (  0.29%)      110.16 (  4.07%)      110.44 (  3.82%)      110.49 (  3.78%)
System  mean        115.89 (  0.00%)      115.01 (  0.76%)      115.12 (  0.67%)      110.73 (  4.46%)      111.20 (  4.05%)      111.17 (  4.08%)
System  stddev        0.63 (  0.00%)        0.57 ( 10.42%)        0.40 ( 37.04%)        0.48 ( 24.87%)        0.51 ( 19.41%)        0.43 ( 32.60%)
System  max         116.81 (  0.00%)      115.87 (  0.80%)      115.52 (  1.10%)      111.47 (  4.57%)      111.98 (  4.13%)      111.63 (  4.43%)
System  range         1.98 (  0.00%)        1.78 ( 10.10%)        1.02 ( 48.48%)        1.31 ( 33.84%)        1.54 ( 22.22%)        1.14 ( 42.42%)
Elapsed min          42.90 (  0.00%)       43.96 ( -2.47%)       42.85 (  0.12%)       43.02 ( -0.28%)       42.55 (  0.82%)       42.75 (  0.35%)
Elapsed mean         43.58 (  0.00%)       44.16 ( -1.34%)       43.88 ( -0.69%)       43.87 ( -0.67%)       43.58 ( -0.00%)       43.80 ( -0.50%)
Elapsed stddev        0.74 (  0.00%)        0.17 ( 77.41%)        0.61 ( 17.23%)        1.00 (-35.26%)        0.67 (  9.46%)        0.82 ( -9.88%)
Elapsed max          44.52 (  0.00%)       44.45 (  0.16%)       44.55 ( -0.07%)       45.72 ( -2.70%)       44.24 (  0.63%)       45.09 ( -1.28%)
Elapsed range         1.62 (  0.00%)        0.49 ( 69.75%)        1.70 ( -4.94%)        2.70 (-66.67%)        1.69 ( -4.32%)        2.34 (-44.44%)
CPU     min        3451.00 (  0.00%)     3455.00 ( -0.12%)     3434.00 (  0.49%)     3311.00 (  4.06%)     3439.00 (  0.35%)     3377.00 (  2.14%)
CPU     mean       3522.40 (  0.00%)     3464.60 (  1.64%)     3492.40 (  0.85%)     3467.40 (  1.56%)     3493.80 (  0.81%)     3475.40 (  1.33%)
CPU     stddev       54.34 (  0.00%)        9.05 ( 83.35%)       54.80 ( -0.85%)       86.04 (-58.33%)       54.99 ( -1.18%)       67.75 (-24.68%)
CPU     max        3570.00 (  0.00%)     3480.00 (  2.52%)     3587.00 ( -0.48%)     3545.00 (  0.70%)     3578.00 ( -0.22%)     3568.00 (  0.06%)
CPU     range       119.00 (  0.00%)       25.00 ( 78.99%)      153.00 (-28.57%)      234.00 (-96.64%)      139.00 (-16.81%)      191.00 (-60.50%)

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v2r1lruslabonly-v2r1  local-v2r6   acct-v2r6remotefile-v2r6
User         8540.49     8516.04     8524.28     8487.25     8488.89     8487.40
System        706.31      701.72      701.20      674.29      675.81      676.52
Elapsed       307.58      311.31      309.72      309.51      308.32      310.36

kernbench figures themselves are not that compelling but the system CPU cost
is down a lot. It's just such a small percentage of the overall workload
that it doesn't really matter and the processes are short lived anyway.

                            3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
                               vanillainstrument-v2r1lruslabonly-v2r1  local-v2r6   acct-v2r6remotefile-v2r6
NUMA alloc hit                73783951    73086669    73385508    93373651    93326068    93321444
NUMA alloc miss               20013534    20247750    19958857         102         118        2129
NUMA interleave hit                  0           0           0           0           0           0
NUMA alloc local              73783935    73086658    73385501    93373644    93326059    93321436

NUMA miss rates are reduced by using the local policy although it really
should have been zero. I suspect it's the __GFP_PAGECACHE annotation patch
and how it's treated but have not proven it. The miss stats go up for the
final patch as page cache pages get distributed between nodes again

vmr-stream
                                3.13.0-rc3                  3.13.0-rc3                  3.13.0-rc3                  3.13.0-rc3                  3.13.0-rc3                  3.13.0-rc3
                                   vanilla             instrument-v2r1            lruslabonly-v2r1                  local-v2r6                   acct-v2r6             remotefile-v2r6
Add      5M        3809.80 (  0.00%)     3783.21 ( -0.70%)     3790.61 ( -0.50%)     3970.34 (  4.21%)     3975.29 (  4.34%)     3992.15 (  4.79%)
Copy     5M        3360.75 (  0.00%)     3345.59 ( -0.45%)     3351.99 ( -0.26%)     3474.69 (  3.39%)     3472.97 (  3.34%)     3474.32 (  3.38%)
Scale    5M        3160.39 (  0.00%)     3163.43 (  0.10%)     3159.88 ( -0.02%)     3393.56 (  7.38%)     3391.85 (  7.32%)     3393.76 (  7.38%)
Triad    5M        3533.04 (  0.00%)     3517.67 ( -0.43%)     3526.18 ( -0.19%)     3856.20 (  9.15%)     3851.39 (  9.01%)     3855.89 (  9.14%)
Add      7M        3789.82 (  0.00%)     3789.03 ( -0.02%)     3779.30 ( -0.28%)     4049.53 (  6.85%)     4001.74 (  5.59%)     3968.84 (  4.72%)
Copy     7M        3345.85 (  0.00%)     3355.75 (  0.30%)     3354.56 (  0.26%)     3484.62 (  4.15%)     3477.23 (  3.93%)     3474.17 (  3.84%)
Scale    7M        3176.00 (  0.00%)     3156.09 ( -0.63%)     3152.84 ( -0.73%)     3401.53 (  7.10%)     3393.55 (  6.85%)     3392.46 (  6.82%)
Triad    7M        3528.85 (  0.00%)     3521.99 ( -0.19%)     3515.20 ( -0.39%)     3861.55 (  9.43%)     3853.51 (  9.20%)     3853.30 (  9.19%)
Add      8M        3801.60 (  0.00%)     3781.66 ( -0.52%)     3788.19 ( -0.35%)     3957.73 (  4.11%)     4002.30 (  5.28%)     4006.69 (  5.39%)
Copy     8M        3364.64 (  0.00%)     3346.31 ( -0.54%)     3353.71 ( -0.32%)     3469.62 (  3.12%)     3476.25 (  3.32%)     3473.67 (  3.24%)
Scale    8M        3169.34 (  0.00%)     3163.10 ( -0.20%)     3157.99 ( -0.36%)     3391.61 (  7.01%)     3395.76 (  7.14%)     3393.20 (  7.06%)
Triad    8M        3531.38 (  0.00%)     3514.83 ( -0.47%)     3518.55 ( -0.36%)     3850.45 (  9.04%)     3853.39 (  9.12%)     3849.50 (  9.01%)
Add      10M       3807.95 (  0.00%)     3791.80 ( -0.42%)     3781.86 ( -0.69%)     3977.13 (  4.44%)     4005.95 (  5.20%)     3983.31 (  4.61%)
Copy     10M       3365.64 (  0.00%)     3361.59 ( -0.12%)     3352.03 ( -0.40%)     3473.78 (  3.21%)     3479.54 (  3.38%)     3471.70 (  3.15%)
Scale    10M       3172.71 (  0.00%)     3157.52 ( -0.48%)     3149.26 ( -0.74%)     3395.59 (  7.02%)     3397.28 (  7.08%)     3394.50 (  6.99%)
Triad    10M       3536.15 (  0.00%)     3524.46 ( -0.33%)     3517.36 ( -0.53%)     3854.88 (  9.01%)     3857.55 (  9.09%)     3853.00 (  8.96%)
Add      14M       3787.56 (  0.00%)     3789.36 (  0.05%)     3780.55 ( -0.19%)     4009.14 (  5.85%)     4019.90 (  6.13%)     3966.93 (  4.74%)
Copy     14M       3345.19 (  0.00%)     3361.79 (  0.50%)     3338.99 ( -0.19%)     3483.34 (  4.13%)     3480.38 (  4.04%)     3470.79 (  3.75%)
Scale    14M       3154.55 (  0.00%)     3155.60 (  0.03%)     3154.74 (  0.01%)     3398.70 (  7.74%)     3396.31 (  7.66%)     3392.50 (  7.54%)
Triad    14M       3522.09 (  0.00%)     3517.21 ( -0.14%)     3514.90 ( -0.20%)     3861.09 (  9.62%)     3854.76 (  9.45%)     3852.52 (  9.38%)
Add      17M       3806.34 (  0.00%)     3770.18 ( -0.95%)     3774.21 ( -0.84%)     3982.37 (  4.62%)     4015.73 (  5.50%)     3979.61 (  4.55%)
Copy     17M       3368.39 (  0.00%)     3334.84 ( -1.00%)     3349.84 ( -0.55%)     3480.15 (  3.32%)     3481.29 (  3.35%)     3470.75 (  3.04%)
Scale    17M       3169.18 (  0.00%)     3164.25 ( -0.16%)     3148.23 ( -0.66%)     3398.11 (  7.22%)     3398.69 (  7.24%)     3389.32 (  6.95%)
Triad    17M       3535.05 (  0.00%)     3510.90 ( -0.68%)     3511.84 ( -0.66%)     3860.14 (  9.20%)     3859.64 (  9.18%)     3848.12 (  8.86%)
Add      21M       3795.31 (  0.00%)     3804.70 (  0.25%)     3795.15 ( -0.00%)     4017.03 (  5.84%)     4029.35 (  6.17%)     3988.21 (  5.08%)
Copy     21M       3353.43 (  0.00%)     3365.89 (  0.37%)     3351.05 ( -0.07%)     3482.88 (  3.86%)     3478.62 (  3.73%)     3479.29 (  3.75%)
Scale    21M       3160.96 (  0.00%)     3170.91 (  0.31%)     3167.45 (  0.21%)     3398.76 (  7.52%)     3394.56 (  7.39%)     3397.91 (  7.50%)
Triad    21M       3530.45 (  0.00%)     3533.62 (  0.09%)     3529.35 ( -0.03%)     3862.25 (  9.40%)     3855.95 (  9.22%)     3859.16 (  9.31%)
Add      28M       3803.11 (  0.00%)     3789.09 ( -0.37%)     3799.69 ( -0.09%)     4016.56 (  5.61%)     3975.01 (  4.52%)     3993.88 (  5.02%)
Copy     28M       3361.16 (  0.00%)     3365.71 (  0.14%)     3368.81 (  0.23%)     3483.91 (  3.65%)     3472.65 (  3.32%)     3475.83 (  3.41%)
Scale    28M       3160.43 (  0.00%)     3151.15 ( -0.29%)     3168.12 (  0.24%)     3399.14 (  7.55%)     3395.77 (  7.45%)     3397.73 (  7.51%)
Triad    28M       3533.66 (  0.00%)     3518.97 ( -0.42%)     3528.59 ( -0.14%)     3861.47 (  9.28%)     3855.76 (  9.12%)     3858.01 (  9.18%)
Add      35M       3792.86 (  0.00%)     3802.89 (  0.26%)     3783.36 ( -0.25%)     3997.11 (  5.39%)     4043.66 (  6.61%)     3962.60 (  4.48%)
Copy     35M       3344.24 (  0.00%)     3356.43 (  0.36%)     3351.61 (  0.22%)     3478.14 (  4.00%)     3486.84 (  4.26%)     3468.70 (  3.72%)
Scale    35M       3160.14 (  0.00%)     3149.58 ( -0.33%)     3159.57 ( -0.02%)     3394.63 (  7.42%)     3401.18 (  7.63%)     3392.57 (  7.36%)
Triad    35M       3531.94 (  0.00%)     3530.90 ( -0.03%)     3517.90 ( -0.40%)     3856.80 (  9.20%)     3862.04 (  9.35%)     3846.73 (  8.91%)
Add      42M       3803.39 (  0.00%)     3789.28 ( -0.37%)     3773.81 ( -0.78%)     4025.00 (  5.83%)     4007.98 (  5.38%)     3944.45 (  3.71%)
Copy     42M       3360.64 (  0.00%)     3355.86 ( -0.14%)     3339.54 ( -0.63%)     3483.81 (  3.67%)     3481.01 (  3.58%)     3464.28 (  3.08%)
Scale    42M       3158.64 (  0.00%)     3168.47 (  0.31%)     3161.82 (  0.10%)     3397.41 (  7.56%)     3397.71 (  7.57%)     3388.43 (  7.27%)
Triad    42M       3529.99 (  0.00%)     3522.03 ( -0.23%)     3512.07 ( -0.51%)     3859.19 (  9.33%)     3859.30 (  9.33%)     3843.50 (  8.88%)
Add      56M       3778.07 (  0.00%)     3802.38 (  0.64%)     3786.95 (  0.23%)     4008.71 (  6.10%)     4001.39 (  5.91%)     3980.85 (  5.37%)
Copy     56M       3348.68 (  0.00%)     3354.81 (  0.18%)     3363.94 (  0.46%)     3481.10 (  3.95%)     3482.10 (  3.98%)     3478.62 (  3.88%)
Scale    56M       3169.25 (  0.00%)     3173.21 (  0.13%)     3160.15 ( -0.29%)     3399.41 (  7.26%)     3399.35 (  7.26%)     3396.19 (  7.16%)
Triad    56M       3517.62 (  0.00%)     3532.08 (  0.41%)     3519.91 (  0.07%)     3861.34 (  9.77%)     3860.40 (  9.74%)     3859.61 (  9.72%)
Add      71M       3811.71 (  0.00%)     3790.78 ( -0.55%)     3792.30 ( -0.51%)     4005.76 (  5.09%)     3996.73 (  4.85%)     4021.00 (  5.49%)
Copy     71M       3370.59 (  0.00%)     3360.98 ( -0.29%)     3357.42 ( -0.39%)     3478.74 (  3.21%)     3472.59 (  3.03%)     3481.72 (  3.30%)
Scale    71M       3168.70 (  0.00%)     3170.94 (  0.07%)     3150.83 ( -0.56%)     3394.36 (  7.12%)     3390.88 (  7.01%)     3397.04 (  7.21%)
Triad    71M       3536.14 (  0.00%)     3525.38 ( -0.30%)     3521.01 ( -0.43%)     3855.90 (  9.04%)     3850.99 (  8.90%)     3859.34 (  9.14%)
Add      85M       3805.94 (  0.00%)     3792.84 ( -0.34%)     3796.44 ( -0.25%)     4004.15 (  5.21%)     4003.69 (  5.20%)     3990.20 (  4.84%)
Copy     85M       3354.76 (  0.00%)     3357.55 (  0.08%)     3360.68 (  0.18%)     3477.66 (  3.66%)     3480.74 (  3.76%)     3471.36 (  3.48%)
Scale    85M       3162.20 (  0.00%)     3156.40 ( -0.18%)     3164.00 (  0.06%)     3396.25 (  7.40%)     3398.16 (  7.46%)     3390.12 (  7.21%)
Triad    85M       3538.76 (  0.00%)     3522.94 ( -0.45%)     3533.03 ( -0.16%)     3854.39 (  8.92%)     3861.37 (  9.12%)     3848.60 (  8.76%)
Add      113M      3803.66 (  0.00%)     3785.42 ( -0.48%)     3804.21 (  0.01%)     3997.16 (  5.09%)     4029.74 (  5.94%)     3987.10 (  4.82%)
Copy     113M      3348.32 (  0.00%)     3359.18 (  0.32%)     3362.06 (  0.41%)     3479.75 (  3.93%)     3488.98 (  4.20%)     3476.86 (  3.84%)
Scale    113M      3177.09 (  0.00%)     3148.61 ( -0.90%)     3147.95 ( -0.92%)     3396.00 (  6.89%)     3404.06 (  7.14%)     3395.97 (  6.89%)
Triad    113M      3536.06 (  0.00%)     3513.51 ( -0.64%)     3531.90 ( -0.12%)     3854.44 (  9.00%)     3869.05 (  9.42%)     3857.86 (  9.10%)
Add      142M      3814.65 (  0.00%)     3779.76 ( -0.91%)     3796.14 ( -0.49%)     3989.97 (  4.60%)     3982.66 (  4.40%)     3944.66 (  3.41%)
Copy     142M      3353.31 (  0.00%)     3347.29 ( -0.18%)     3360.60 (  0.22%)     3477.55 (  3.70%)     3471.80 (  3.53%)     3465.60 (  3.35%)
Scale    142M      3186.05 (  0.00%)     3161.07 ( -0.78%)     3154.54 ( -0.99%)     3397.67 (  6.64%)     3394.53 (  6.54%)     3386.56 (  6.29%)
Triad    142M      3545.41 (  0.00%)     3518.27 ( -0.77%)     3527.15 ( -0.52%)     3858.25 (  8.82%)     3851.34 (  8.63%)     3841.65 (  8.36%)
Add      170M      3787.71 (  0.00%)     3805.45 (  0.47%)     3781.99 ( -0.15%)     3990.15 (  5.34%)     3990.16 (  5.34%)     3997.08 (  5.53%)
Copy     170M      3351.50 (  0.00%)     3362.22 (  0.32%)     3345.90 ( -0.17%)     3478.71 (  3.80%)     3483.70 (  3.94%)     3479.19 (  3.81%)
Scale    170M      3158.38 (  0.00%)     3175.47 (  0.54%)     3151.34 ( -0.22%)     3398.22 (  7.59%)     3400.09 (  7.65%)     3396.11 (  7.53%)
Triad    170M      3521.84 (  0.00%)     3534.01 (  0.35%)     3513.94 ( -0.22%)     3857.99 (  9.54%)     3863.00 (  9.69%)     3856.79 (  9.51%)
Add      227M      3794.46 (  0.00%)     3799.80 (  0.14%)     3789.75 ( -0.12%)     4001.21 (  5.45%)     3982.66 (  4.96%)     3991.65 (  5.20%)
Copy     227M      3368.15 (  0.00%)     3361.29 ( -0.20%)     3357.70 ( -0.31%)     3482.76 (  3.40%)     3473.54 (  3.13%)     3480.61 (  3.34%)
Scale    227M      3160.18 (  0.00%)     3164.94 (  0.15%)     3155.77 ( -0.14%)     3402.44 (  7.67%)     3390.24 (  7.28%)     3397.39 (  7.51%)
Triad    227M      3525.39 (  0.00%)     3523.04 ( -0.07%)     3524.31 ( -0.03%)     3865.12 (  9.64%)     3851.41 (  9.25%)     3859.91 (  9.49%)
Add      284M      3804.29 (  0.00%)     3799.06 ( -0.14%)     3805.86 (  0.04%)     4007.77 (  5.35%)     3986.91 (  4.80%)     3996.16 (  5.04%)
Copy     284M      3366.21 (  0.00%)     3349.03 ( -0.51%)     3369.99 (  0.11%)     3482.10 (  3.44%)     3469.08 (  3.06%)     3475.51 (  3.25%)
Scale    284M      3174.61 (  0.00%)     3173.80 ( -0.03%)     3147.99 ( -0.84%)     3402.22 (  7.17%)     3386.58 (  6.68%)     3395.61 (  6.96%)
Triad    284M      3538.50 (  0.00%)     3538.46 ( -0.00%)     3529.69 ( -0.25%)     3860.86 (  9.11%)     3843.72 (  8.63%)     3853.96 (  8.92%)
Add      341M      3805.26 (  0.00%)     3764.38 ( -1.07%)     3789.55 ( -0.41%)     3989.04 (  4.83%)     3977.50 (  4.53%)     4023.64 (  5.74%)
Copy     341M      3366.98 (  0.00%)     3341.40 ( -0.76%)     3362.85 ( -0.12%)     3476.89 (  3.26%)     3474.40 (  3.19%)     3489.58 (  3.64%)
Scale    341M      3159.11 (  0.00%)     3168.92 (  0.31%)     3177.39 (  0.58%)     3398.01 (  7.56%)     3393.30 (  7.41%)     3405.15 (  7.79%)
Triad    341M      3530.80 (  0.00%)     3506.03 ( -0.70%)     3528.16 ( -0.07%)     3858.85 (  9.29%)     3851.56 (  9.08%)     3868.18 (  9.56%)
Add      455M      3791.15 (  0.00%)     3794.39 (  0.09%)     3807.19 (  0.42%)     4029.29 (  6.28%)     3985.30 (  5.12%)     3988.07 (  5.19%)
Copy     455M      3353.30 (  0.00%)     3365.90 (  0.38%)     3358.94 (  0.17%)     3486.16 (  3.96%)     3475.41 (  3.64%)     3474.43 (  3.61%)
Scale    455M      3161.21 (  0.00%)     3166.60 (  0.17%)     3160.11 ( -0.03%)     3401.81 (  7.61%)     3396.29 (  7.44%)     3395.46 (  7.41%)
Triad    455M      3527.90 (  0.00%)     3525.16 ( -0.08%)     3536.99 (  0.26%)     3864.91 (  9.55%)     3858.19 (  9.36%)     3855.59 (  9.29%)
Add      568M      3779.79 (  0.00%)     3801.70 (  0.58%)     3782.09 (  0.06%)     3985.25 (  5.44%)     4026.56 (  6.53%)     3926.30 (  3.88%)
Copy     568M      3349.93 (  0.00%)     3366.10 (  0.48%)     3336.55 ( -0.40%)     3472.59 (  3.66%)     3485.34 (  4.04%)     3460.49 (  3.30%)
Scale    568M      3163.69 (  0.00%)     3170.00 (  0.20%)     3159.05 ( -0.15%)     3393.16 (  7.25%)     3400.62 (  7.49%)     3382.99 (  6.93%)
Triad    568M      3518.65 (  0.00%)     3535.79 (  0.49%)     3517.04 ( -0.05%)     3850.19 (  9.42%)     3863.35 (  9.80%)     3839.40 (  9.12%)
Add      682M      3801.06 (  0.00%)     3805.79 (  0.12%)     3786.90 ( -0.37%)     3977.83 (  4.65%)     3956.61 (  4.09%)     4001.91 (  5.28%)
Copy     682M      3363.64 (  0.00%)     3357.79 ( -0.17%)     3353.57 ( -0.30%)     3474.04 (  3.28%)     3469.78 (  3.16%)     3475.62 (  3.33%)
Scale    682M      3151.89 (  0.00%)     3169.57 (  0.56%)     3159.20 (  0.23%)     3395.81 (  7.74%)     3392.14 (  7.62%)     3393.91 (  7.68%)
Triad    682M      3528.97 (  0.00%)     3538.12 (  0.26%)     3519.04 ( -0.28%)     3854.44 (  9.22%)     3849.45 (  9.08%)     3853.38 (  9.19%)
Add      910M      3778.97 (  0.00%)     3785.79 (  0.18%)     3799.23 (  0.54%)     4043.50 (  7.00%)     4005.92 (  6.01%)     4014.66 (  6.24%)
Copy     910M      3345.09 (  0.00%)     3355.05 (  0.30%)     3353.56 (  0.25%)     3487.47 (  4.26%)     3473.79 (  3.85%)     3489.55 (  4.32%)
Scale    910M      3164.46 (  0.00%)     3157.34 ( -0.23%)     3167.60 (  0.10%)     3399.70 (  7.43%)     3390.43 (  7.14%)     3404.38 (  7.58%)
Triad    910M      3516.19 (  0.00%)     3520.82 (  0.13%)     3534.78 (  0.53%)     3861.71 (  9.83%)     3850.59 (  9.51%)     3867.83 ( 10.00%)
Add      1137M     3812.17 (  0.00%)     3795.34 ( -0.44%)     3799.71 ( -0.33%)     4022.75 (  5.52%)     3985.00 (  4.53%)     3997.57 (  4.86%)
Copy     1137M     3367.52 (  0.00%)     3364.07 ( -0.10%)     3367.26 ( -0.01%)     3480.58 (  3.36%)     3468.42 (  3.00%)     3473.41 (  3.14%)
Scale    1137M     3158.62 (  0.00%)     3155.05 ( -0.11%)     3164.45 (  0.18%)     3397.03 (  7.55%)     3386.94 (  7.23%)     3392.39 (  7.40%)
Triad    1137M     3536.97 (  0.00%)     3526.00 ( -0.31%)     3529.99 ( -0.20%)     3858.44 (  9.09%)     3845.78 (  8.73%)     3850.80 (  8.87%)
Add      1365M     3806.51 (  0.00%)     3791.63 ( -0.39%)     3786.57 ( -0.52%)     3962.59 (  4.10%)     4029.60 (  5.86%)     3990.23 (  4.83%)
Copy     1365M     3360.43 (  0.00%)     3363.15 (  0.08%)     3347.19 ( -0.39%)     3474.10 (  3.38%)     3488.82 (  3.82%)     3478.98 (  3.53%)
Scale    1365M     3155.95 (  0.00%)     3160.77 (  0.15%)     3164.41 (  0.27%)     3394.90 (  7.57%)     3405.19 (  7.90%)     3396.64 (  7.63%)
Triad    1365M     3534.18 (  0.00%)     3521.12 ( -0.37%)     3519.49 ( -0.42%)     3856.06 (  9.11%)     3865.20 (  9.37%)     3857.96 (  9.16%)
Add      1820M     3797.86 (  0.00%)     3795.51 ( -0.06%)     3800.31 (  0.06%)     4023.79 (  5.95%)     3955.34 (  4.15%)     4003.20 (  5.41%)
Copy     1820M     3362.09 (  0.00%)     3361.06 ( -0.03%)     3359.74 ( -0.07%)     3482.46 (  3.58%)     3468.46 (  3.16%)     3474.92 (  3.36%)
Scale    1820M     3170.20 (  0.00%)     3160.70 ( -0.30%)     3166.72 ( -0.11%)     3396.61 (  7.14%)     3391.98 (  7.00%)     3393.97 (  7.06%)
Triad    1820M     3531.00 (  0.00%)     3527.31 ( -0.10%)     3530.65 ( -0.01%)     3858.18 (  9.27%)     3849.65 (  9.02%)     3854.65 (  9.17%)
Add      2275M     3810.31 (  0.00%)     3792.47 ( -0.47%)     3767.11 ( -1.13%)     3982.71 (  4.52%)     3987.02 (  4.64%)     3977.99 (  4.40%)
Copy     2275M     3373.60 (  0.00%)     3358.29 ( -0.45%)     3335.43 ( -1.13%)     3478.34 (  3.10%)     3476.07 (  3.04%)     3475.55 (  3.02%)
Scale    2275M     3174.64 (  0.00%)     3159.58 ( -0.47%)     3158.94 ( -0.49%)     3398.12 (  7.04%)     3395.41 (  6.95%)     3395.88 (  6.97%)
Triad    2275M     3537.57 (  0.00%)     3527.90 ( -0.27%)     3508.53 ( -0.82%)     3860.60 (  9.13%)     3856.96 (  9.03%)     3856.09 (  9.00%)
Add      2730M     3801.09 (  0.00%)     3812.05 (  0.29%)     3802.64 (  0.04%)     3981.20 (  4.74%)     4017.01 (  5.68%)     3938.62 (  3.62%)
Copy     2730M     3357.18 (  0.00%)     3365.37 (  0.24%)     3361.64 (  0.13%)     3477.74 (  3.59%)     3475.85 (  3.53%)     3464.04 (  3.18%)
Scale    2730M     3177.66 (  0.00%)     3168.10 ( -0.30%)     3161.30 ( -0.51%)     3397.39 (  6.91%)     3393.51 (  6.79%)     3386.47 (  6.57%)
Triad    2730M     3539.59 (  0.00%)     3543.83 (  0.12%)     3528.50 ( -0.31%)     3861.50 (  9.09%)     3854.09 (  8.89%)     3845.27 (  8.64%)
Add      3640M     3816.88 (  0.00%)     3791.01 ( -0.68%)     3779.35 ( -0.98%)     3976.53 (  4.18%)     4050.84 (  6.13%)     3991.81 (  4.58%)
Copy     3640M     3375.91 (  0.00%)     3349.60 ( -0.78%)     3347.88 ( -0.83%)     3472.83 (  2.87%)     3485.96 (  3.26%)     3474.40 (  2.92%)
Scale    3640M     3167.22 (  0.00%)     3168.24 (  0.03%)     3157.93 ( -0.29%)     3395.00 (  7.19%)     3400.17 (  7.36%)     3395.70 (  7.21%)
Triad    3640M     3546.45 (  0.00%)     3528.90 ( -0.49%)     3517.90 ( -0.81%)     3855.08 (  8.70%)     3860.11 (  8.84%)     3854.39 (  8.68%)
Add      4551M     3799.05 (  0.00%)     3805.03 (  0.16%)     3806.14 (  0.19%)     4028.14 (  6.03%)     4026.96 (  6.00%)     4021.84 (  5.86%)
Copy     4551M     3355.66 (  0.00%)     3358.64 (  0.09%)     3356.91 (  0.04%)     3487.50 (  3.93%)     3485.92 (  3.88%)     3481.72 (  3.76%)
Scale    4551M     3171.91 (  0.00%)     3174.92 (  0.09%)     3163.54 ( -0.26%)     3402.45 (  7.27%)     3401.04 (  7.22%)     3396.90 (  7.09%)
Triad    4551M     3531.61 (  0.00%)     3535.95 (  0.12%)     3536.00 (  0.12%)     3864.84 (  9.44%)     3865.01 (  9.44%)     3857.47 (  9.23%)
Add      5461M     3801.60 (  0.00%)     3774.49 ( -0.71%)     3779.16 ( -0.59%)     4010.68 (  5.50%)     3958.91 (  4.14%)     4011.94 (  5.53%)
Copy     5461M     3360.29 (  0.00%)     3347.56 ( -0.38%)     3351.31 ( -0.27%)     3483.90 (  3.68%)     3467.72 (  3.20%)     3480.64 (  3.58%)
Scale    5461M     3161.18 (  0.00%)     3154.56 ( -0.21%)     3149.71 ( -0.36%)     3399.26 (  7.53%)     3391.35 (  7.28%)     3396.95 (  7.46%)
Triad    5461M     3532.35 (  0.00%)     3510.19 ( -0.63%)     3512.62 ( -0.56%)     3862.91 (  9.36%)     3849.95 (  8.99%)     3858.71 (  9.24%)
Add      7281M     3800.80 (  0.00%)     3789.71 ( -0.29%)     3779.60 ( -0.56%)     4023.89 (  5.87%)     4000.63 (  5.26%)     3974.68 (  4.57%)
Copy     7281M     3359.99 (  0.00%)     3349.71 ( -0.31%)     3346.82 ( -0.39%)     3482.20 (  3.64%)     3481.97 (  3.63%)     3471.59 (  3.32%)
Scale    7281M     3168.68 (  0.00%)     3167.95 ( -0.02%)     3154.70 ( -0.44%)     3399.98 (  7.30%)     3400.46 (  7.31%)     3392.10 (  7.05%)
Triad    7281M     3533.59 (  0.00%)     3524.63 ( -0.25%)     3514.25 ( -0.55%)     3861.39 (  9.28%)     3861.70 (  9.29%)     3853.31 (  9.05%)
Add      9102M     3790.67 (  0.00%)     3791.28 (  0.02%)     3790.38 ( -0.01%)     4015.48 (  5.93%)     4013.46 (  5.88%)     4014.66 (  5.91%)
Copy     9102M     3345.80 (  0.00%)     3365.09 (  0.58%)     3353.79 (  0.24%)     3480.51 (  4.03%)     3479.74 (  4.00%)     3481.55 (  4.06%)
Scale    9102M     3174.65 (  0.00%)     3149.82 ( -0.78%)     3166.84 ( -0.25%)     3398.75 (  7.06%)     3398.27 (  7.04%)     3399.20 (  7.07%)
Triad    9102M     3529.51 (  0.00%)     3523.03 ( -0.18%)     3524.38 ( -0.15%)     3861.12 (  9.40%)     3858.35 (  9.32%)     3860.55 (  9.38%)
Add      10922M     3807.96 (  0.00%)     3784.18 ( -0.62%)     3779.45 ( -0.75%)     4021.53 (  5.61%)     3984.89 (  4.65%)     4005.11 (  5.18%)
Copy     10922M     3350.99 (  0.00%)     3351.97 (  0.03%)     3353.08 (  0.06%)     3490.40 (  4.16%)     3472.32 (  3.62%)     3473.98 (  3.67%)
Scale    10922M     3164.74 (  0.00%)     3167.46 (  0.09%)     3154.60 ( -0.32%)     3402.35 (  7.51%)     3392.56 (  7.20%)     3392.16 (  7.19%)
Triad    10922M     3536.69 (  0.00%)     3524.27 ( -0.35%)     3516.30 ( -0.58%)     3865.21 (  9.29%)     3850.74 (  8.88%)     3849.32 (  8.84%)
Add      14563M     3786.28 (  0.00%)     3793.09 (  0.18%)     3787.76 (  0.04%)     3976.82 (  5.03%)     3987.54 (  5.32%)     3988.31 (  5.34%)
Copy     14563M     3352.51 (  0.00%)     3355.74 (  0.10%)     3357.05 (  0.14%)     3472.63 (  3.58%)     3475.97 (  3.68%)     3470.44 (  3.52%)
Scale    14563M     3171.95 (  0.00%)     3168.28 ( -0.12%)     3158.17 ( -0.43%)     3393.54 (  6.99%)     3399.68 (  7.18%)     3390.82 (  6.90%)
Triad    14563M     3522.50 (  0.00%)     3526.12 (  0.10%)     3519.97 ( -0.07%)     3853.92 (  9.41%)     3856.89 (  9.49%)     3847.38 (  9.22%)
Add      18204M     3809.56 (  0.00%)     3772.64 ( -0.97%)     3795.07 ( -0.38%)     4014.65 (  5.38%)     3976.18 (  4.37%)     3963.55 (  4.04%)
Copy     18204M     3365.06 (  0.00%)     3350.49 ( -0.43%)     3359.32 ( -0.17%)     3483.40 (  3.52%)     3473.21 (  3.21%)     3467.66 (  3.05%)
Scale    18204M     3171.25 (  0.00%)     3151.05 ( -0.64%)     3163.69 ( -0.24%)     3400.05 (  7.21%)     3393.76 (  7.02%)     3388.64 (  6.85%)
Triad    18204M     3539.90 (  0.00%)     3508.60 ( -0.88%)     3532.25 ( -0.22%)     3860.99 (  9.07%)     3853.56 (  8.86%)     3847.01 (  8.68%)
Add      21845M     3798.46 (  0.00%)     3800.35 (  0.05%)     3791.21 ( -0.19%)     3995.49 (  5.19%)     3990.65 (  5.06%)     3969.12 (  4.49%)
Copy     21845M     3362.14 (  0.00%)     3363.46 (  0.04%)     3355.34 ( -0.20%)     3477.61 (  3.43%)     3478.33 (  3.46%)     3472.19 (  3.27%)
Scale    21845M     3170.99 (  0.00%)     3164.60 ( -0.20%)     3162.31 ( -0.27%)     3398.14 (  7.16%)     3396.25 (  7.10%)     3393.58 (  7.02%)
Triad    21845M     3534.49 (  0.00%)     3527.34 ( -0.20%)     3522.95 ( -0.33%)     3858.35 (  9.16%)     3856.52 (  9.11%)     3854.98 (  9.07%)
Add      29127M     3819.69 (  0.00%)     3783.38 ( -0.95%)     3786.06 ( -0.88%)     4007.04 (  4.90%)     4005.91 (  4.88%)     4000.99 (  4.75%)
Copy     29127M     3384.67 (  0.00%)     3345.60 ( -1.15%)     3339.55 ( -1.33%)     3480.54 (  2.83%)     3479.91 (  2.81%)     3475.18 (  2.67%)
Scale    29127M     3158.68 (  0.00%)     3166.06 (  0.23%)     3151.78 ( -0.22%)     3399.73 (  7.63%)     3395.21 (  7.49%)     3393.50 (  7.43%)
Triad    29127M     3538.17 (  0.00%)     3520.17 ( -0.51%)     3523.09 ( -0.43%)     3862.24 (  9.16%)     3858.60 (  9.06%)     3851.85 (  8.87%)
Add      36408M     3806.95 (  0.00%)     3793.61 ( -0.35%)     3777.70 ( -0.77%)     4016.66 (  5.51%)     3994.64 (  4.93%)     3991.57 (  4.85%)
Copy     36408M     3361.11 (  0.00%)     3347.61 ( -0.40%)     3353.38 ( -0.23%)     3483.09 (  3.63%)     3476.44 (  3.43%)     3473.26 (  3.34%)
Scale    36408M     3165.87 (  0.00%)     3173.95 (  0.26%)     3171.11 (  0.17%)     3398.81 (  7.36%)     3394.38 (  7.22%)     3393.16 (  7.18%)
Triad    36408M     3536.86 (  0.00%)     3533.81 ( -0.09%)     3513.64 ( -0.66%)     3860.60 (  9.15%)     3855.77 (  9.02%)     3853.09 (  8.94%)
Add      43690M     3799.39 (  0.00%)     3795.90 ( -0.09%)     3803.79 (  0.12%)     3996.57 (  5.19%)     4006.70 (  5.46%)     3981.15 (  4.78%)
Copy     43690M     3359.26 (  0.00%)     3360.94 (  0.05%)     3371.10 (  0.35%)     3479.62 (  3.58%)     3481.69 (  3.64%)     3478.45 (  3.55%)
Scale    43690M     3175.35 (  0.00%)     3163.95 ( -0.36%)     3147.34 ( -0.88%)     3396.36 (  6.96%)     3399.45 (  7.06%)     3398.88 (  7.04%)
Triad    43690M     3535.26 (  0.00%)     3526.88 ( -0.24%)     3528.38 ( -0.19%)     3857.30 (  9.11%)     3858.89 (  9.15%)     3858.38 (  9.14%)
Add      58254M     3799.66 (  0.00%)     3772.37 ( -0.72%)     3768.33 ( -0.82%)     4016.47 (  5.71%)     4014.25 (  5.65%)     3968.79 (  4.45%)
Copy     58254M     3355.12 (  0.00%)     3337.75 ( -0.52%)     3337.41 ( -0.53%)     3481.56 (  3.77%)     3481.28 (  3.76%)     3465.39 (  3.29%)
Scale    58254M     3170.94 (  0.00%)     3159.81 ( -0.35%)     3164.09 ( -0.22%)     3398.35 (  7.17%)     3396.30 (  7.11%)     3388.58 (  6.86%)
Triad    58254M     3537.26 (  0.00%)     3511.62 ( -0.72%)     3507.54 ( -0.84%)     3860.59 (  9.14%)     3858.62 (  9.09%)     3847.30 (  8.76%)
Add      72817M     3815.26 (  0.00%)     3812.73 ( -0.07%)     3787.86 ( -0.72%)     3968.21 (  4.01%)     4030.38 (  5.64%)     3956.57 (  3.70%)
Copy     72817M     3362.18 (  0.00%)     3371.41 (  0.27%)     3345.64 ( -0.49%)     3474.38 (  3.34%)     3482.00 (  3.56%)     3469.46 (  3.19%)
Scale    72817M     3175.73 (  0.00%)     3170.64 ( -0.16%)     3154.28 ( -0.68%)     3394.65 (  6.89%)     3396.69 (  6.96%)     3390.78 (  6.77%)
Triad    72817M     3546.44 (  0.00%)     3537.21 ( -0.26%)     3520.46 ( -0.73%)     3855.50 (  8.71%)     3855.34 (  8.71%)     3849.10 (  8.53%)
Add      87381M     3519.93 (  0.00%)     3501.24 ( -0.53%)     3500.84 ( -0.54%)     3833.20 (  8.90%)     3833.26 (  8.90%)     3840.72 (  9.11%)
Copy     87381M     3175.29 (  0.00%)     3166.11 ( -0.29%)     3163.97 ( -0.36%)     3263.09 (  2.77%)     3264.10 (  2.80%)     3266.85 (  2.88%)
Scale    87381M     2848.76 (  0.00%)     2835.15 ( -0.48%)     2832.37 ( -0.58%)     3177.70 ( 11.55%)     3172.81 ( 11.38%)     3180.05 ( 11.63%)
Triad    87381M     3465.19 (  0.00%)     3453.66 ( -0.33%)     3456.03 ( -0.26%)     3777.01 (  9.00%)     3774.30 (  8.92%)     3783.31 (  9.18%)

Remote access costs are quite visible in this memory streaming benchmark.

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v2r1lruslabonly-v2r1  local-v2r6   acct-v2r6remotefile-v2r6
User         1144.35     1154.81     1156.38     1075.31     1083.70     1087.08
System         55.28       56.07       56.35       49.00       49.06       48.84
Elapsed      1207.64     1220.14     1222.13     1132.20     1141.91     1145.08

pft
                        3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3
                           vanilla       instrument-v2r1      lruslabonly-v2r1            local-v2r6             acct-v2r6       remotefile-v2r6
User       1       0.6980 (  0.00%)       0.6900 (  1.15%)       0.7050 ( -1.00%)       0.6500 (  6.88%)       0.6550 (  6.16%)       0.6750 (  3.30%)
User       2       0.7040 (  0.00%)       0.6990 (  0.71%)       0.7000 (  0.57%)       0.6980 (  0.85%)       0.7150 ( -1.56%)       0.7040 (  0.00%)
User       3       0.6910 (  0.00%)       0.6930 ( -0.29%)       0.7230 ( -4.63%)       0.7390 ( -6.95%)       0.7180 ( -3.91%)       0.7120 ( -3.04%)
User       4       0.7250 (  0.00%)       0.7580 ( -4.55%)       0.7310 ( -0.83%)       0.7220 (  0.41%)       0.7520 ( -3.72%)       0.7250 (  0.00%)
User       5       0.7590 (  0.00%)       0.7490 (  1.32%)       0.7910 ( -4.22%)       0.7730 ( -1.84%)       0.7480 (  1.45%)       0.7690 ( -1.32%)
User       6       0.8130 (  0.00%)       0.8010 (  1.48%)       0.7940 (  2.34%)       0.7770 (  4.43%)       0.7790 (  4.18%)       0.7700 (  5.29%)
User       7       0.8210 (  0.00%)       0.8380 ( -2.07%)       0.8260 ( -0.61%)       0.7950 (  3.17%)       0.8230 ( -0.24%)       0.7760 (  5.48%)
User       8       0.8390 (  0.00%)       0.8200 (  2.26%)       0.8160 (  2.74%)       0.7840 (  6.56%)       0.7830 (  6.67%)       0.8400 ( -0.12%)
System     1       9.1230 (  0.00%)       9.1120 (  0.12%)       9.0810 (  0.46%)       8.2560 (  9.50%)       8.2760 (  9.28%)       8.2260 (  9.83%)
System     2       9.3990 (  0.00%)       9.3340 (  0.69%)       9.4050 ( -0.06%)       8.4630 (  9.96%)       8.4230 ( 10.38%)       8.4420 ( 10.18%)
System     3       9.1460 (  0.00%)       9.0890 (  0.62%)       9.1380 (  0.09%)       8.5660 (  6.34%)       8.5640 (  6.36%)       8.5290 (  6.75%)
System     4       8.9160 (  0.00%)       8.8840 (  0.36%)       8.9260 ( -0.11%)       8.6760 (  2.69%)       8.6330 (  3.17%)       8.6790 (  2.66%)
System     5       9.5900 (  0.00%)       9.5240 (  0.69%)       9.5230 (  0.70%)       8.9390 (  6.79%)       8.8920 (  7.28%)       8.9410 (  6.77%)
System     6       9.8640 (  0.00%)       9.7120 (  1.54%)       9.8740 ( -0.10%)       9.1460 (  7.28%)       9.1310 (  7.43%)       9.1400 (  7.34%)
System     7       9.9860 (  0.00%)       9.9290 (  0.57%)      10.0030 ( -0.17%)       9.3360 (  6.51%)       9.2430 (  7.44%)       9.2860 (  7.01%)
System     8       9.8570 (  0.00%)       9.8510 (  0.06%)       9.9980 ( -1.43%)       9.3050 (  5.60%)       9.2410 (  6.25%)       9.4170 (  4.46%)
Elapsed    1       9.8240 (  0.00%)       9.8050 (  0.19%)       9.7910 (  0.34%)       8.9080 (  9.32%)       8.9320 (  9.08%)       8.9080 (  9.32%)
Elapsed    2       5.0870 (  0.00%)       5.0500 (  0.73%)       5.0710 (  0.31%)       4.6020 (  9.53%)       4.5860 (  9.85%)       4.5940 (  9.69%)
Elapsed    3       3.3220 (  0.00%)       3.2990 (  0.69%)       3.3210 (  0.03%)       3.1170 (  6.17%)       3.1150 (  6.23%)       3.0950 (  6.83%)
Elapsed    4       2.4440 (  0.00%)       2.4440 (  0.00%)       2.4410 (  0.12%)       2.3930 (  2.09%)       2.3780 (  2.70%)       2.3710 (  2.99%)
Elapsed    5       2.1500 (  0.00%)       2.1410 (  0.42%)       2.1400 (  0.47%)       2.0020 (  6.88%)       1.9830 (  7.77%)       2.0030 (  6.84%)
Elapsed    6       1.8290 (  0.00%)       1.7970 (  1.75%)       1.8260 (  0.16%)       1.6960 (  7.27%)       1.6980 (  7.16%)       1.6930 (  7.44%)
Elapsed    7       1.5760 (  0.00%)       1.5610 (  0.95%)       1.5860 ( -0.63%)       1.4830 (  5.90%)       1.4740 (  6.47%)       1.4730 (  6.54%)
Elapsed    8       1.3660 (  0.00%)       1.3490 (  1.24%)       1.3660 ( -0.00%)       1.2820 (  6.15%)       1.2660 (  7.32%)       1.3030 (  4.61%)
Faults/cpu 1  336505.5875 (  0.00%)  337163.8429 (  0.20%)  337713.8261 (  0.36%)  371079.7726 ( 10.27%)  370090.7928 (  9.98%)  371199.3702 ( 10.31%)
Faults/cpu 2  327139.2186 (  0.00%)  329451.3249 (  0.71%)  326974.9735 ( -0.05%)  360766.3203 ( 10.28%)  361595.0312 ( 10.53%)  361389.4583 ( 10.47%)
Faults/cpu 3  336004.1324 (  0.00%)  337826.9136 (  0.54%)  335004.8869 ( -0.30%)  355249.2266 (  5.73%)  356016.6570 (  5.96%)  357584.5258 (  6.42%)
Faults/cpu 4  342824.1564 (  0.00%)  342825.3087 (  0.00%)  342285.3156 ( -0.16%)  351758.5702 (  2.61%)  352312.8339 (  2.77%)  351503.0837 (  2.53%)
Faults/cpu 5  319553.7707 (  0.00%)  321799.3129 (  0.70%)  320521.1950 (  0.30%)  340315.3807 (  6.50%)  342890.6018 (  7.30%)  340381.5220 (  6.52%)
Faults/cpu 6  309614.5554 (  0.00%)  314330.1834 (  1.52%)  309882.5231 (  0.09%)  333075.2546 (  7.58%)  333637.6404 (  7.76%)  333706.0587 (  7.78%)
Faults/cpu 7  306159.2969 (  0.00%)  307277.9428 (  0.37%)  305306.4748 ( -0.28%)  326309.2165 (  6.58%)  328327.9627 (  7.24%)  328590.8507 (  7.33%)
Faults/cpu 8  309077.4966 (  0.00%)  309849.8370 (  0.25%)  305865.6953 ( -1.04%)  327958.3107 (  6.11%)  329731.7933 (  6.68%)  322280.8870 (  4.27%)
Faults/sec 1  336364.5575 (  0.00%)  336993.1010 (  0.19%)  337563.4257 (  0.36%)  370916.0228 ( 10.27%)  369955.7605 (  9.99%)  370971.4836 ( 10.29%)
Faults/sec 2  649713.2290 (  0.00%)  654448.6622 (  0.73%)  651706.3799 (  0.31%)  717987.1734 ( 10.51%)  720641.9249 ( 10.92%)  719435.7495 ( 10.73%)
Faults/sec 3  994812.3119 (  0.00%) 1001443.9434 (  0.67%)  995205.6607 (  0.04%) 1060228.7843 (  6.58%) 1060484.8602 (  6.60%) 1067127.5522 (  7.27%)
Faults/sec 4 1352137.4832 (  0.00%) 1352463.8578 (  0.02%) 1354323.6163 (  0.16%) 1382325.4091 (  2.23%) 1390344.3320 (  2.83%) 1393760.7116 (  3.08%)
Faults/sec 5 1538115.0421 (  0.00%) 1544331.3978 (  0.40%) 1544368.0159 (  0.41%) 1651247.2902 (  7.36%) 1666751.7259 (  8.36%) 1651371.8632 (  7.36%)
Faults/sec 6 1807211.7324 (  0.00%) 1840430.0157 (  1.84%) 1809763.9743 (  0.14%) 1947049.8237 (  7.74%) 1946986.6396 (  7.73%) 1953384.4599 (  8.09%)
Faults/sec 7 2101840.1872 (  0.00%) 2120169.4773 (  0.87%) 2082926.2675 ( -0.90%) 2233207.9026 (  6.25%) 2241803.5953 (  6.66%) 2242647.3545 (  6.70%)
Faults/sec 8 2421813.7208 (  0.00%) 2453320.5034 (  1.30%) 2419371.3924 ( -0.10%) 2582755.9228 (  6.65%) 2612638.2836 (  7.88%) 2537575.0399 (  4.78%)

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v2r1lruslabonly-v2r1  local-v2r6   acct-v2r6remotefile-v2r6
User           60.57       61.53       61.96       59.47       60.74       60.78
System        868.16      862.80      868.89      805.82      802.76      806.01
Elapsed       336.19      336.02      339.19      311.33      313.18      313.58

And page fault microbenchmarks also see a benefit, probably because the
zeroing of pages is no longer incurring a remote access penalty. Lower system CPU usage etc
 

ebizzy
                       3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3
                          vanilla       instrument-v2r1      lruslabonly-v2r1            local-v2r6             acct-v2r6       remotefile-v2r6
Mean     1      3213.33 (  0.00%)     3161.67 ( -1.61%)     3177.00 ( -1.13%)     3234.33 (  0.65%)     3224.00 (  0.33%)     3198.33 ( -0.47%)
Mean     2      2291.33 (  0.00%)     2316.67 (  1.11%)     2309.67 (  0.80%)     2348.67 (  2.50%)     2330.00 (  1.69%)     2332.00 (  1.77%)
Mean     3      2234.67 (  0.00%)     2298.67 (  2.86%)     2252.00 (  0.78%)     2280.67 (  2.06%)     2292.67 (  2.60%)     2270.00 (  1.58%)
Mean     4      2224.33 (  0.00%)     2279.00 (  2.46%)     2250.67 (  1.18%)     2282.33 (  2.61%)     2256.00 (  1.42%)     2256.33 (  1.44%)
Mean     5      2256.33 (  0.00%)     2280.33 (  1.06%)     2265.00 (  0.38%)     2280.67 (  1.08%)     2268.33 (  0.53%)     2276.67 (  0.90%)
Mean     6      2233.00 (  0.00%)     2257.33 (  1.09%)     2200.00 ( -1.48%)     2292.33 (  2.66%)     2274.33 (  1.85%)     2250.33 (  0.78%)
Mean     7      2212.33 (  0.00%)     2229.00 (  0.75%)     2201.67 ( -0.48%)     2279.00 (  3.01%)     2265.00 (  2.38%)     2251.67 (  1.78%)
Mean     8      2224.67 (  0.00%)     2226.33 (  0.07%)     2225.67 (  0.04%)     2255.00 (  1.36%)     2280.67 (  2.52%)     2238.67 (  0.63%)
Mean     12     2213.33 (  0.00%)     2240.00 (  1.20%)     2264.67 (  2.32%)     2249.33 (  1.63%)     2257.67 (  2.00%)     2238.00 (  1.11%)
Mean     16     2221.00 (  0.00%)     2226.33 (  0.24%)     2268.00 (  2.12%)     2266.33 (  2.04%)     2241.00 (  0.90%)     2258.33 (  1.68%)
Mean     20     2215.00 (  0.00%)     2256.00 (  1.85%)     2278.33 (  2.86%)     2238.67 (  1.07%)     2271.67 (  2.56%)     2291.00 (  3.43%)
Mean     24     2175.00 (  0.00%)     2181.00 (  0.28%)     2166.67 ( -0.38%)     2211.00 (  1.66%)     2231.00 (  2.57%)     2247.67 (  3.34%)
Mean     28     2110.00 (  0.00%)     2136.00 (  1.23%)     2123.33 (  0.63%)     2157.00 (  2.23%)     2163.00 (  2.51%)     2164.67 (  2.59%)
Mean     32     2077.67 (  0.00%)     2095.33 (  0.85%)     2091.33 (  0.66%)     2110.67 (  1.59%)     2113.33 (  1.72%)     2110.33 (  1.57%)
Mean     36     2016.33 (  0.00%)     2024.67 (  0.41%)     2039.33 (  1.14%)     2066.33 (  2.48%)     2068.00 (  2.56%)     2069.00 (  2.61%)
Mean     40     1984.00 (  0.00%)     1987.00 (  0.15%)     1993.33 (  0.47%)     2037.00 (  2.67%)     2035.00 (  2.57%)     2042.00 (  2.92%)
Mean     44     1943.33 (  0.00%)     1954.33 (  0.57%)     1961.00 (  0.91%)     2004.33 (  3.14%)     2009.67 (  3.41%)     2018.00 (  3.84%)
Mean     48     1925.00 (  0.00%)     1939.33 (  0.74%)     1929.00 (  0.21%)     1990.67 (  3.41%)     1996.33 (  3.71%)     2007.67 (  4.29%)
Stddev   1        25.42 (  0.00%)       46.78 (-84.02%)       32.75 (-28.84%)       18.62 ( 26.73%)       21.95 ( 13.64%)       30.18 (-18.72%)
Stddev   2        29.68 (  0.00%)        1.70 ( 94.27%)       13.77 ( 53.61%)       12.50 ( 57.89%)       15.51 ( 47.73%)       13.88 ( 53.23%)
Stddev   3        18.15 (  0.00%)       27.48 (-51.35%)        4.32 ( 76.20%)       13.57 ( 25.23%)       15.52 ( 14.50%)       11.78 ( 35.13%)
Stddev   4        41.28 (  0.00%)       13.64 ( 66.96%)        6.94 ( 83.18%)       24.51 ( 40.62%)       24.04 ( 41.76%)        7.41 ( 82.05%)
Stddev   5        27.18 (  0.00%)        9.03 ( 66.78%)        4.97 ( 81.73%)        8.50 ( 68.74%)       17.25 ( 36.54%)       15.80 ( 41.88%)
Stddev   6        10.80 (  0.00%)       17.97 (-66.36%)        9.27 ( 14.14%)        6.60 ( 38.90%)       16.01 (-48.20%)       19.36 (-79.26%)
Stddev   7        23.10 (  0.00%)       17.91 ( 22.48%)       29.58 (-28.05%)       15.94 ( 31.00%)        5.72 ( 75.26%)       12.76 ( 44.75%)
Stddev   8         3.68 (  0.00%)       41.52 (-1027.82%)       26.74 (-626.21%)        4.32 (-17.35%)       33.81 (-818.21%)       12.50 (-239.48%)
Stddev   12       23.84 (  0.00%)        6.48 ( 72.81%)       14.66 ( 38.50%)       13.47 ( 43.47%)       11.79 ( 50.56%)       18.71 ( 21.52%)
Stddev   16       20.22 (  0.00%)       17.13 ( 15.25%)       28.99 (-43.43%)        2.36 ( 88.34%)        2.16 ( 89.31%)       16.13 ( 20.20%)
Stddev   20        3.74 (  0.00%)        6.53 (-74.57%)       45.02 (-1103.24%)       22.54 (-502.51%)        8.18 (-118.58%)       26.28 (-602.38%)
Stddev   24       18.18 (  0.00%)       19.30 ( -6.16%)       23.81 (-30.93%)        9.42 ( 48.22%)       16.99 (  6.57%)        8.18 ( 55.02%)
Stddev   28       11.78 (  0.00%)        7.79 ( 33.86%)       15.92 (-35.22%)       12.96 (-10.07%)       12.83 ( -8.97%)       17.78 (-51.01%)
Stddev   32        9.74 (  0.00%)        2.05 ( 78.91%)        8.81 (  9.59%)        6.55 ( 32.77%)        3.09 ( 68.27%)        1.70 ( 82.55%)
Stddev   36        3.86 (  0.00%)        5.44 (-40.89%)        2.36 ( 38.92%)       13.22 (-242.73%)       11.78 (-205.18%)       16.87 (-337.26%)
Stddev   40       14.17 (  0.00%)        7.48 ( 47.17%)        5.56 ( 60.77%)        5.89 ( 58.44%)        2.16 ( 84.75%)        2.45 ( 82.71%)
Stddev   44        7.54 (  0.00%)        3.40 ( 54.93%)        2.94 ( 60.97%)        7.54 (  0.00%)        3.68 ( 51.19%)        1.63 ( 78.35%)
Stddev   48        2.94 (  0.00%)        5.56 (-88.79%)        3.56 (-20.89%)        6.24 (-111.83%)        1.70 ( 42.26%)       17.25 (-485.95%)

Ran ebizzy because it double up as a page allocation micro benchmark that
hits page faults differently to PFT. Looks like an ok gain but the stddev
is high and would need to be stabilised to draw a solid conclusion from.

None of these benchmarks do *anything* related to what commit 81c0a2bb was
supposed to fix. I just wanted to get the point across that our current
default behaviour sucks and we should revisit that decision.

My position is that by default we should only round-robin zones local to
the allocating process and that node round-robin is something that should
only be explicitely enabled.

I'm less sure about the round robin treatment of slab but am erring on
the side of historical behaviour until it is proven otherwise.

 Documentation/sysctl/vm.txt |  32 +++++++++
 include/linux/gfp.h         |   4 +-
 include/linux/mmzone.h      |   2 +
 include/linux/pagemap.h     |   2 +-
 include/linux/swap.h        |   2 +
 kernel/sysctl.c             |   8 +++
 mm/filemap.c                |   2 +
 mm/page_alloc.c             | 153 +++++++++++++++++++++++++++++++++++++-------
 8 files changed, 180 insertions(+), 25 deletions(-)

-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 1/7] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
  2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10   ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

From: Johannes Weiner <hannes@cmpxchg.org>

Dave Hansen noted a regression in a microbenchmark that loops around
open() and close() on an 8-node NUMA machine and bisected it down to
81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
change forces the slab allocations of the file descriptor to spread
out to all 8 nodes, causing remote references in the page allocator
and slab.

The round-robin policy is only there to provide fairness among memory
allocations that are reclaimed involuntarily based on pressure in each
zone.  It does not make sense to apply it to unreclaimable kernel
allocations that are freed manually, in this case instantly after the
allocation, and incur the remote reference costs twice for no reason.

Only round-robin allocations that are usually freed through page
reclaim or slab shrinking.

Cc: <stable@kernel.org>
Bisected-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f0..f861d02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1920,7 +1920,8 @@ zonelist_scan:
 		 * back to remote zones that do not partake in the
 		 * fairness round-robin cycle of this zonelist.
 		 */
-		if (alloc_flags & ALLOC_WMARK_LOW) {
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    (gfp_mask & GFP_MOVABLE_MASK)) {
 			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 				continue;
 			if (zone_reclaim_mode &&
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 1/7] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
@ 2013-12-13 14:10   ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

From: Johannes Weiner <hannes@cmpxchg.org>

Dave Hansen noted a regression in a microbenchmark that loops around
open() and close() on an 8-node NUMA machine and bisected it down to
81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
change forces the slab allocations of the file descriptor to spread
out to all 8 nodes, causing remote references in the page allocator
and slab.

The round-robin policy is only there to provide fairness among memory
allocations that are reclaimed involuntarily based on pressure in each
zone.  It does not make sense to apply it to unreclaimable kernel
allocations that are freed manually, in this case instantly after the
allocation, and incur the remote reference costs twice for no reason.

Only round-robin allocations that are usually freed through page
reclaim or slab shrinking.

Cc: <stable@kernel.org>
Bisected-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f0..f861d02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1920,7 +1920,8 @@ zonelist_scan:
 		 * back to remote zones that do not partake in the
 		 * fairness round-robin cycle of this zonelist.
 		 */
-		if (alloc_flags & ALLOC_WMARK_LOW) {
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    (gfp_mask & GFP_MOVABLE_MASK)) {
 			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 				continue;
 			if (zone_reclaim_mode &&
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper
  2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10   ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

This patch moves the decision on whether to round-robin allocations between
zones and nodes into its own helper functions. It'll make some later patches
easier to understand and it will be automatically inlined.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 63 ++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f861d02..64020eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1872,6 +1872,42 @@ static inline void init_zone_allows_reclaim(int nid)
 #endif	/* CONFIG_NUMA */
 
 /*
+ * Distribute pages in proportion to the individual zone size to ensure fair
+ * page aging.  The zone a page was allocated in should have no effect on the
+ * time the page has in memory before being reclaimed.
+ * 
+ * Returns true if this zone should be skipped to spread the page ages to
+ * other zones.
+ */
+static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
+				struct zone *zone, int alloc_flags)
+{
+	/* Only round robin in the allocator fast path */
+	if (!(alloc_flags & ALLOC_WMARK_LOW))
+		return false;
+
+	/* Only round robin pages likely to be LRU or reclaimable slab */
+	if (!(gfp_mask & GFP_MOVABLE_MASK))
+		return false;
+
+	/* Distribute to the next zone if this zone has exhausted its batch */
+	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+		return true;
+
+	/*
+	 * When zone_reclaim_mode is enabled, try to stay in local zones in the
+	 * fastpath.  If that fails, the slowpath is entered, which will do
+	 * another pass starting with the local zones, but ultimately fall back
+	 * back to remote zones that do not partake in the fairness round-robin
+	 * cycle of this zonelist.
+	 */
+	if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+		return true;
+
+	return false;
+}
+
+/*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
  */
@@ -1907,27 +1943,12 @@ zonelist_scan:
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
 			goto try_this_zone;
-		/*
-		 * Distribute pages in proportion to the individual
-		 * zone size to ensure fair page aging.  The zone a
-		 * page was allocated in should have no effect on the
-		 * time the page has in memory before being reclaimed.
-		 *
-		 * When zone_reclaim_mode is enabled, try to stay in
-		 * local zones in the fastpath.  If that fails, the
-		 * slowpath is entered, which will do another pass
-		 * starting with the local zones, but ultimately fall
-		 * back to remote zones that do not partake in the
-		 * fairness round-robin cycle of this zonelist.
-		 */
-		if ((alloc_flags & ALLOC_WMARK_LOW) &&
-		    (gfp_mask & GFP_MOVABLE_MASK)) {
-			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
-				continue;
-			if (zone_reclaim_mode &&
-			    !zone_local(preferred_zone, zone))
-				continue;
-		}
+
+		/* Distribute pages to ensure fair page aging */
+		if (zone_distribute_age(gfp_mask, preferred_zone, zone,
+					alloc_flags))
+			continue;
+
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper
@ 2013-12-13 14:10   ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

This patch moves the decision on whether to round-robin allocations between
zones and nodes into its own helper functions. It'll make some later patches
easier to understand and it will be automatically inlined.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 63 ++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f861d02..64020eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1872,6 +1872,42 @@ static inline void init_zone_allows_reclaim(int nid)
 #endif	/* CONFIG_NUMA */
 
 /*
+ * Distribute pages in proportion to the individual zone size to ensure fair
+ * page aging.  The zone a page was allocated in should have no effect on the
+ * time the page has in memory before being reclaimed.
+ * 
+ * Returns true if this zone should be skipped to spread the page ages to
+ * other zones.
+ */
+static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
+				struct zone *zone, int alloc_flags)
+{
+	/* Only round robin in the allocator fast path */
+	if (!(alloc_flags & ALLOC_WMARK_LOW))
+		return false;
+
+	/* Only round robin pages likely to be LRU or reclaimable slab */
+	if (!(gfp_mask & GFP_MOVABLE_MASK))
+		return false;
+
+	/* Distribute to the next zone if this zone has exhausted its batch */
+	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+		return true;
+
+	/*
+	 * When zone_reclaim_mode is enabled, try to stay in local zones in the
+	 * fastpath.  If that fails, the slowpath is entered, which will do
+	 * another pass starting with the local zones, but ultimately fall back
+	 * back to remote zones that do not partake in the fairness round-robin
+	 * cycle of this zonelist.
+	 */
+	if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+		return true;
+
+	return false;
+}
+
+/*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
  */
@@ -1907,27 +1943,12 @@ zonelist_scan:
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
 			goto try_this_zone;
-		/*
-		 * Distribute pages in proportion to the individual
-		 * zone size to ensure fair page aging.  The zone a
-		 * page was allocated in should have no effect on the
-		 * time the page has in memory before being reclaimed.
-		 *
-		 * When zone_reclaim_mode is enabled, try to stay in
-		 * local zones in the fastpath.  If that fails, the
-		 * slowpath is entered, which will do another pass
-		 * starting with the local zones, but ultimately fall
-		 * back to remote zones that do not partake in the
-		 * fairness round-robin cycle of this zonelist.
-		 */
-		if ((alloc_flags & ALLOC_WMARK_LOW) &&
-		    (gfp_mask & GFP_MOVABLE_MASK)) {
-			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
-				continue;
-			if (zone_reclaim_mode &&
-			    !zone_local(preferred_zone, zone))
-				continue;
-		}
+
+		/* Distribute pages to ensure fair page aging */
+		if (zone_distribute_age(gfp_mask, preferred_zone, zone,
+					alloc_flags))
+			continue;
+
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
  2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10   ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

zone_local is using node_distance which is a more expensive call than
necessary. On x86, it's another function call in the allocator fast path
and increases cache footprint. This patch makes the assumption zones on a
local node will share the same node ID. The necessary information should
already be cache hot.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 64020eb..fd9677e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
 
 static bool zone_local(struct zone *local_zone, struct zone *zone)
 {
-	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+	return zone_to_nid(zone) == numa_node_id();
 }
 
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-13 14:10   ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

zone_local is using node_distance which is a more expensive call than
necessary. On x86, it's another function call in the allocator fast path
and increases cache footprint. This patch makes the assumption zones on a
local node will share the same node ID. The necessary information should
already be cache hot.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 64020eb..fd9677e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
 
 static bool zone_local(struct zone *local_zone, struct zone *zone)
 {
-	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+	return zone_to_nid(zone) == numa_node_id();
 }
 
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 4/7] mm: Annotate page cache allocations
  2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10   ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

Annotations will be used for fair zone allocation policy. Patch is mostly
taken from a link posted by Johannes on IRC. It's not perfect because all
callers of these paths are not guaranteed to be allocating pages for page
cache. However, it's probably close enough to cover all cases that matter
with minimal distortion.

Not-signed-off
---
 include/linux/gfp.h     | 4 +++-
 include/linux/pagemap.h | 2 +-
 mm/filemap.c            | 2 ++
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd49..f69e4cb 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
+#define ___GFP_PAGECACHE	0x2000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -92,6 +93,7 @@ struct vm_area_struct;
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 #define __GFP_KMEMCG	((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
+#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE)   /* Page cache allocation */
 
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
@@ -99,7 +101,7 @@ struct vm_area_struct;
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..bda4845 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
 #else
 static inline struct page *__page_cache_alloc(gfp_t gfp)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp | __GFP_PAGECACHE, 0);
 }
 #endif
 
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..5bb9225 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	int n;
 	struct page *page;
 
+	gfp |= __GFP_PAGECACHE;
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 4/7] mm: Annotate page cache allocations
@ 2013-12-13 14:10   ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

Annotations will be used for fair zone allocation policy. Patch is mostly
taken from a link posted by Johannes on IRC. It's not perfect because all
callers of these paths are not guaranteed to be allocating pages for page
cache. However, it's probably close enough to cover all cases that matter
with minimal distortion.

Not-signed-off
---
 include/linux/gfp.h     | 4 +++-
 include/linux/pagemap.h | 2 +-
 mm/filemap.c            | 2 ++
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd49..f69e4cb 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
+#define ___GFP_PAGECACHE	0x2000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -92,6 +93,7 @@ struct vm_area_struct;
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 #define __GFP_KMEMCG	((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
+#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE)   /* Page cache allocation */
 
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
@@ -99,7 +101,7 @@ struct vm_area_struct;
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..bda4845 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
 #else
 static inline struct page *__page_cache_alloc(gfp_t gfp)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp | __GFP_PAGECACHE, 0);
 }
 #endif
 
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..5bb9225 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	int n;
 	struct page *page;
 
+	gfp |= __GFP_PAGECACHE;
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
  2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10   ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of
how the page allocator and kswapd interacted on the per-zone LRU lists.
Unfortunately it was missed during review that a consequence is that
we also round-robin between NUMA nodes. This is bad for two reasons

1. It alters the semantics of MPOL_LOCAL without telling anyone
2. It incurs an immediate remote memory performance hit in exchange
   for a potential performance gain when memory needs to be reclaimed
   later

No cookies for the reviewers on this one.

This patch makes the behaviour of the fair zone allocator policy
configurable.  By default it will only distribute pages that are going
to exist on the LRU between zones local to the allocating process. This
preserves the historical semantics of MPOL_LOCAL.

By default, slab pages are not distributed between zones after this patch is
applied. It can be argued that they should get similar treatment but they
have different lifecycles to LRU pages, the shrinkers are not zone-aware
and the interaction between the page allocator and kswapd is different
for slabs. If it turns out to be an almost universal win, we can change
the default.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/vm.txt |  32 ++++++++++++++
 include/linux/mmzone.h      |   2 +
 include/linux/swap.h        |   2 +
 kernel/sysctl.c             |   8 ++++
 mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
 5 files changed, 134 insertions(+), 12 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 1fbd4eb..8eaa562 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- zone_distribute_mode
 - zone_reclaim_mode
 
 ==============================================================
@@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
 
 ==============================================================
 
+zone_distribute_mode
+
+Pages allocation and reclaim are managed on a per-zone basis. When the
+system needs to reclaim memory, candidate pages are selected from these
+per-zone lists.  Historically, a potential consequence was that recently
+allocated pages were considered reclaim candidates. From a zone-local
+perspective, page aging was preserved but from a system-wide perspective
+there was an age inversion problem.
+
+A similar problem occurs on a node level where young pages may be reclaimed
+from the local node instead of allocating remote memory. Unforuntately, the
+cost of accessing remote nodes is higher so the system must choose by default
+between favouring page aging or node locality. zone_distribute_mode controls
+how the system will distribute page ages between zones.
+
+0	= Never round-robin based on age
+
+Otherwise the values are ORed together
+
+1	= Distribute anon pages between zones local to the allocating node
+2	= Distribute file pages between zones local to the allocating node
+4	= Distribute slab pages between zones local to the allocating node
+
+The following three flags effectively alter MPOL_DEFAULT, be careful.
+
+8	= Distribute anon pages between zones remote to the allocating node
+16	= Distribute file pages between zones remote to the allocating node
+32	= Distribute slab pages between zones remote to the allocating node
+
+==============================================================
+
 zone_reclaim_mode:
 
 Zone_reclaim_mode allows someone to set more or less aggressive approaches to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b835d3f..20a75e3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -897,6 +897,8 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
+int sysctl_zone_distribute_mode_handler(struct ctl_table *, int,
+			void __user *, size_t *, loff_t *);
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..44329b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -318,6 +318,8 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
+extern unsigned __bitwise__ zone_distribute_mode;
+
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 34a6047..b75c08f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1349,6 +1349,14 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 #endif
+	{
+		.procname	= "zone_distribute_mode",
+		.data		= &zone_distribute_mode,
+		.maxlen		= sizeof(zone_distribute_mode),
+		.mode		= 0644,
+		.proc_handler	= sysctl_zone_distribute_mode_handler,
+		.extra1		= &zero,
+	},
 #ifdef CONFIG_NUMA
 	{
 		.procname	= "zone_reclaim_mode",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd9677e..c2a2229 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1871,6 +1871,49 @@ static inline void init_zone_allows_reclaim(int nid)
 }
 #endif	/* CONFIG_NUMA */
 
+/* Controls how page ages are distributed across zones automatically */
+unsigned __bitwise__ zone_distribute_mode __read_mostly;
+
+/* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */
+#define DISTRIBUTE_DISABLE	(0)
+#define DISTRIBUTE_LOCAL_ANON	(1UL << 0)
+#define DISTRIBUTE_LOCAL_FILE	(1UL << 1)
+#define DISTRIBUTE_LOCAL_SLAB	(1UL << 2)
+#define DISTRIBUTE_REMOTE_ANON	(1UL << 3)
+#define DISTRIBUTE_REMOTE_FILE	(1UL << 4)
+#define DISTRIBUTE_REMOTE_SLAB	(1UL << 5)
+
+#define DISTRIBUTE_STUPID_ANON	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
+#define DISTRIBUTE_STUPID_FILE	(DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
+#define DISTRIBUTE_STUPID_SLAB	(DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
+#define DISTRIBUTE_DEFAULT	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
+
+/* Only these GFP flags are affected by the fair zone allocation policy */
+#define DISTRIBUTE_GFP_MASK	((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
+
+int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	/* If you are an admin reading this comment, what were you thinking? */
+	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_ANON) ==
+							DISTRIBUTE_STUPID_ANON))
+		zone_distribute_mode &= ~DISTRIBUTE_REMOTE_ANON;
+	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_FILE) ==
+							DISTRIBUTE_STUPID_FILE))
+		zone_distribute_mode &= ~DISTRIBUTE_REMOTE_FILE;
+	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_SLAB) ==
+							DISTRIBUTE_STUPID_SLAB))
+		zone_distribute_mode &= ~DISTRIBUTE_REMOTE_SLAB;
+
+	return 0;
+}
+
 /*
  * Distribute pages in proportion to the individual zone size to ensure fair
  * page aging.  The zone a page was allocated in should have no effect on the
@@ -1882,26 +1925,60 @@ static inline void init_zone_allows_reclaim(int nid)
 static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
 				struct zone *zone, int alloc_flags)
 {
+	bool zone_is_local;
+	bool is_file, is_slab, is_anon;
+
 	/* Only round robin in the allocator fast path */
 	if (!(alloc_flags & ALLOC_WMARK_LOW))
 		return false;
 
-	/* Only round robin pages likely to be LRU or reclaimable slab */
-	if (!(gfp_mask & GFP_MOVABLE_MASK))
+	/* Only a subset of GFP flags are considered for fair zone policy */
+	if (!(gfp_mask & DISTRIBUTE_GFP_MASK))
 		return false;
 
-	/* Distribute to the next zone if this zone has exhausted its batch */
-	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
-		return true;
-
 	/*
-	 * When zone_reclaim_mode is enabled, try to stay in local zones in the
-	 * fastpath.  If that fails, the slowpath is entered, which will do
-	 * another pass starting with the local zones, but ultimately fall back
-	 * back to remote zones that do not partake in the fairness round-robin
-	 * cycle of this zonelist.
+	 * Classify the type of allocation. From this point on, the fair zone
+	 * allocation policy is being applied. If the allocation does not meet
+	 * the criteria the zone must be skipped.
 	 */
-	if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+	is_file = gfp_mask & __GFP_PAGECACHE;
+	is_slab = gfp_mask & __GFP_RECLAIMABLE;
+	is_anon = (!is_file && !is_slab);
+	WARN_ON_ONCE(is_slab && is_file);
+
+	zone_is_local = zone_local(preferred_zone, zone);
+	if (zone_local(preferred_zone, zone)) {
+		/* Distribute between zones local to the node if requested */
+		if (is_anon && (zone_distribute_mode & DISTRIBUTE_LOCAL_ANON))
+			goto check_batch;
+		if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE))
+			goto check_batch;
+		if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE))
+			goto check_batch;
+	} else {
+		/*
+		 * When zone_reclaim_mode is enabled, stick to local zones. If
+		 * that fails, the slowpath is entered, which will do another
+		 * pass starting with the local zones, but ultimately fall back
+		 * back to remote zones that do not partake in the fairness
+		 * round-robin cycle of this zonelist.
+		 */
+		if (zone_reclaim_mode)
+			return false;
+
+		if (is_anon && (zone_distribute_mode & DISTRIBUTE_REMOTE_ANON))
+			goto check_batch;
+		if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE))
+			goto check_batch;
+		if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE))
+			goto check_batch;
+	}
+
+	return true;
+
+check_batch:
+	/* Distribute to the next zone if this zone has exhausted its batch */
+	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 		return true;
 
 	return false;
@@ -3797,6 +3874,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 		__build_all_zonelists(NULL);
 		mminit_verify_zonelist();
 		cpuset_init_current_mems_allowed();
+		zone_distribute_mode = DISTRIBUTE_DEFAULT;
 	} else {
 #ifdef CONFIG_MEMORY_HOTPLUG
 		if (zone)
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-13 14:10   ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of
how the page allocator and kswapd interacted on the per-zone LRU lists.
Unfortunately it was missed during review that a consequence is that
we also round-robin between NUMA nodes. This is bad for two reasons

1. It alters the semantics of MPOL_LOCAL without telling anyone
2. It incurs an immediate remote memory performance hit in exchange
   for a potential performance gain when memory needs to be reclaimed
   later

No cookies for the reviewers on this one.

This patch makes the behaviour of the fair zone allocator policy
configurable.  By default it will only distribute pages that are going
to exist on the LRU between zones local to the allocating process. This
preserves the historical semantics of MPOL_LOCAL.

By default, slab pages are not distributed between zones after this patch is
applied. It can be argued that they should get similar treatment but they
have different lifecycles to LRU pages, the shrinkers are not zone-aware
and the interaction between the page allocator and kswapd is different
for slabs. If it turns out to be an almost universal win, we can change
the default.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/vm.txt |  32 ++++++++++++++
 include/linux/mmzone.h      |   2 +
 include/linux/swap.h        |   2 +
 kernel/sysctl.c             |   8 ++++
 mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
 5 files changed, 134 insertions(+), 12 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 1fbd4eb..8eaa562 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- zone_distribute_mode
 - zone_reclaim_mode
 
 ==============================================================
@@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
 
 ==============================================================
 
+zone_distribute_mode
+
+Pages allocation and reclaim are managed on a per-zone basis. When the
+system needs to reclaim memory, candidate pages are selected from these
+per-zone lists.  Historically, a potential consequence was that recently
+allocated pages were considered reclaim candidates. From a zone-local
+perspective, page aging was preserved but from a system-wide perspective
+there was an age inversion problem.
+
+A similar problem occurs on a node level where young pages may be reclaimed
+from the local node instead of allocating remote memory. Unforuntately, the
+cost of accessing remote nodes is higher so the system must choose by default
+between favouring page aging or node locality. zone_distribute_mode controls
+how the system will distribute page ages between zones.
+
+0	= Never round-robin based on age
+
+Otherwise the values are ORed together
+
+1	= Distribute anon pages between zones local to the allocating node
+2	= Distribute file pages between zones local to the allocating node
+4	= Distribute slab pages between zones local to the allocating node
+
+The following three flags effectively alter MPOL_DEFAULT, be careful.
+
+8	= Distribute anon pages between zones remote to the allocating node
+16	= Distribute file pages between zones remote to the allocating node
+32	= Distribute slab pages between zones remote to the allocating node
+
+==============================================================
+
 zone_reclaim_mode:
 
 Zone_reclaim_mode allows someone to set more or less aggressive approaches to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b835d3f..20a75e3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -897,6 +897,8 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
+int sysctl_zone_distribute_mode_handler(struct ctl_table *, int,
+			void __user *, size_t *, loff_t *);
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..44329b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -318,6 +318,8 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
+extern unsigned __bitwise__ zone_distribute_mode;
+
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 34a6047..b75c08f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1349,6 +1349,14 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 #endif
+	{
+		.procname	= "zone_distribute_mode",
+		.data		= &zone_distribute_mode,
+		.maxlen		= sizeof(zone_distribute_mode),
+		.mode		= 0644,
+		.proc_handler	= sysctl_zone_distribute_mode_handler,
+		.extra1		= &zero,
+	},
 #ifdef CONFIG_NUMA
 	{
 		.procname	= "zone_reclaim_mode",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd9677e..c2a2229 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1871,6 +1871,49 @@ static inline void init_zone_allows_reclaim(int nid)
 }
 #endif	/* CONFIG_NUMA */
 
+/* Controls how page ages are distributed across zones automatically */
+unsigned __bitwise__ zone_distribute_mode __read_mostly;
+
+/* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */
+#define DISTRIBUTE_DISABLE	(0)
+#define DISTRIBUTE_LOCAL_ANON	(1UL << 0)
+#define DISTRIBUTE_LOCAL_FILE	(1UL << 1)
+#define DISTRIBUTE_LOCAL_SLAB	(1UL << 2)
+#define DISTRIBUTE_REMOTE_ANON	(1UL << 3)
+#define DISTRIBUTE_REMOTE_FILE	(1UL << 4)
+#define DISTRIBUTE_REMOTE_SLAB	(1UL << 5)
+
+#define DISTRIBUTE_STUPID_ANON	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
+#define DISTRIBUTE_STUPID_FILE	(DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
+#define DISTRIBUTE_STUPID_SLAB	(DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
+#define DISTRIBUTE_DEFAULT	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
+
+/* Only these GFP flags are affected by the fair zone allocation policy */
+#define DISTRIBUTE_GFP_MASK	((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
+
+int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	/* If you are an admin reading this comment, what were you thinking? */
+	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_ANON) ==
+							DISTRIBUTE_STUPID_ANON))
+		zone_distribute_mode &= ~DISTRIBUTE_REMOTE_ANON;
+	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_FILE) ==
+							DISTRIBUTE_STUPID_FILE))
+		zone_distribute_mode &= ~DISTRIBUTE_REMOTE_FILE;
+	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_SLAB) ==
+							DISTRIBUTE_STUPID_SLAB))
+		zone_distribute_mode &= ~DISTRIBUTE_REMOTE_SLAB;
+
+	return 0;
+}
+
 /*
  * Distribute pages in proportion to the individual zone size to ensure fair
  * page aging.  The zone a page was allocated in should have no effect on the
@@ -1882,26 +1925,60 @@ static inline void init_zone_allows_reclaim(int nid)
 static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
 				struct zone *zone, int alloc_flags)
 {
+	bool zone_is_local;
+	bool is_file, is_slab, is_anon;
+
 	/* Only round robin in the allocator fast path */
 	if (!(alloc_flags & ALLOC_WMARK_LOW))
 		return false;
 
-	/* Only round robin pages likely to be LRU or reclaimable slab */
-	if (!(gfp_mask & GFP_MOVABLE_MASK))
+	/* Only a subset of GFP flags are considered for fair zone policy */
+	if (!(gfp_mask & DISTRIBUTE_GFP_MASK))
 		return false;
 
-	/* Distribute to the next zone if this zone has exhausted its batch */
-	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
-		return true;
-
 	/*
-	 * When zone_reclaim_mode is enabled, try to stay in local zones in the
-	 * fastpath.  If that fails, the slowpath is entered, which will do
-	 * another pass starting with the local zones, but ultimately fall back
-	 * back to remote zones that do not partake in the fairness round-robin
-	 * cycle of this zonelist.
+	 * Classify the type of allocation. From this point on, the fair zone
+	 * allocation policy is being applied. If the allocation does not meet
+	 * the criteria the zone must be skipped.
 	 */
-	if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+	is_file = gfp_mask & __GFP_PAGECACHE;
+	is_slab = gfp_mask & __GFP_RECLAIMABLE;
+	is_anon = (!is_file && !is_slab);
+	WARN_ON_ONCE(is_slab && is_file);
+
+	zone_is_local = zone_local(preferred_zone, zone);
+	if (zone_local(preferred_zone, zone)) {
+		/* Distribute between zones local to the node if requested */
+		if (is_anon && (zone_distribute_mode & DISTRIBUTE_LOCAL_ANON))
+			goto check_batch;
+		if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE))
+			goto check_batch;
+		if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE))
+			goto check_batch;
+	} else {
+		/*
+		 * When zone_reclaim_mode is enabled, stick to local zones. If
+		 * that fails, the slowpath is entered, which will do another
+		 * pass starting with the local zones, but ultimately fall back
+		 * back to remote zones that do not partake in the fairness
+		 * round-robin cycle of this zonelist.
+		 */
+		if (zone_reclaim_mode)
+			return false;
+
+		if (is_anon && (zone_distribute_mode & DISTRIBUTE_REMOTE_ANON))
+			goto check_batch;
+		if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE))
+			goto check_batch;
+		if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE))
+			goto check_batch;
+	}
+
+	return true;
+
+check_batch:
+	/* Distribute to the next zone if this zone has exhausted its batch */
+	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 		return true;
 
 	return false;
@@ -3797,6 +3874,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 		__build_all_zonelists(NULL);
 		mminit_verify_zonelist();
 		cpuset_init_current_mems_allowed();
+		zone_distribute_mode = DISTRIBUTE_DEFAULT;
 	} else {
 #ifdef CONFIG_MEMORY_HOTPLUG
 		if (zone)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
  2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10   ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

Not signed off. Johannes, was the intent really to decrement the batch
counts regardless of whether the policy was being enforced or not?

---
 mm/page_alloc.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c2a2229..bf49918 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1547,7 +1547,6 @@ again:
 					  get_pageblock_migratetype(page));
 	}
 
-	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
@@ -1923,7 +1922,8 @@ int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
  * other zones.
  */
 static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
-				struct zone *zone, int alloc_flags)
+				struct zone *zone, int alloc_flags,
+				bool *distrib_eligible)
 {
 	bool zone_is_local;
 	bool is_file, is_slab, is_anon;
@@ -1977,6 +1977,8 @@ static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
 	return true;
 
 check_batch:
+	*distrib_eligible = true;
+
 	/* Distribute to the next zone if this zone has exhausted its batch */
 	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 		return true;
@@ -2000,6 +2002,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	bool distrib_eligible = false;
 
 	classzone_idx = zone_idx(preferred_zone);
 zonelist_scan:
@@ -2023,7 +2026,7 @@ zonelist_scan:
 
 		/* Distribute pages to ensure fair page aging */
 		if (zone_distribute_age(gfp_mask, preferred_zone, zone,
-					alloc_flags))
+				alloc_flags, &distrib_eligible))
 			continue;
 
 		/*
@@ -2119,8 +2122,11 @@ zonelist_scan:
 try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
 						gfp_mask, migratetype);
-		if (page)
+		if (page) {
+			if (distrib_eligible)
+				__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
 			break;
+		}
 this_zone_full:
 		if (IS_ENABLED(CONFIG_NUMA))
 			zlc_mark_zone_full(zonelist, z);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
@ 2013-12-13 14:10   ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

Not signed off. Johannes, was the intent really to decrement the batch
counts regardless of whether the policy was being enforced or not?

---
 mm/page_alloc.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c2a2229..bf49918 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1547,7 +1547,6 @@ again:
 					  get_pageblock_migratetype(page));
 	}
 
-	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
@@ -1923,7 +1922,8 @@ int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
  * other zones.
  */
 static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
-				struct zone *zone, int alloc_flags)
+				struct zone *zone, int alloc_flags,
+				bool *distrib_eligible)
 {
 	bool zone_is_local;
 	bool is_file, is_slab, is_anon;
@@ -1977,6 +1977,8 @@ static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
 	return true;
 
 check_batch:
+	*distrib_eligible = true;
+
 	/* Distribute to the next zone if this zone has exhausted its batch */
 	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 		return true;
@@ -2000,6 +2002,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	bool distrib_eligible = false;
 
 	classzone_idx = zone_idx(preferred_zone);
 zonelist_scan:
@@ -2023,7 +2026,7 @@ zonelist_scan:
 
 		/* Distribute pages to ensure fair page aging */
 		if (zone_distribute_age(gfp_mask, preferred_zone, zone,
-					alloc_flags))
+				alloc_flags, &distrib_eligible))
 			continue;
 
 		/*
@@ -2119,8 +2122,11 @@ zonelist_scan:
 try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
 						gfp_mask, migratetype);
-		if (page)
+		if (page) {
+			if (distrib_eligible)
+				__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
 			break;
+		}
 this_zone_full:
 		if (IS_ENABLED(CONFIG_NUMA))
 			zlc_mark_zone_full(zonelist, z);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
  2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10   ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

Indications from Johannes that he wanted this. Needs some data and/or justification why
thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
it should be considered finished. I do not necessarily agree this patch is necessary
but it's worth punting it out there for discussion and testing.

Not signed off
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf49918..bce40c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1885,7 +1885,8 @@ unsigned __bitwise__ zone_distribute_mode __read_mostly;
 #define DISTRIBUTE_STUPID_ANON	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
 #define DISTRIBUTE_STUPID_FILE	(DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
 #define DISTRIBUTE_STUPID_SLAB	(DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
-#define DISTRIBUTE_DEFAULT	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
+#define DISTRIBUTE_DEFAULT	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB| \
+				 DISTRIBUTE_REMOTE_FILE)
 
 /* Only these GFP flags are affected by the fair zone allocation policy */
 #define DISTRIBUTE_GFP_MASK	((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
@ 2013-12-13 14:10   ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman

Indications from Johannes that he wanted this. Needs some data and/or justification why
thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
it should be considered finished. I do not necessarily agree this patch is necessary
but it's worth punting it out there for discussion and testing.

Not signed off
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf49918..bce40c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1885,7 +1885,8 @@ unsigned __bitwise__ zone_distribute_mode __read_mostly;
 #define DISTRIBUTE_STUPID_ANON	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
 #define DISTRIBUTE_STUPID_FILE	(DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
 #define DISTRIBUTE_STUPID_SLAB	(DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
-#define DISTRIBUTE_DEFAULT	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
+#define DISTRIBUTE_DEFAULT	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB| \
+				 DISTRIBUTE_REMOTE_FILE)
 
 /* Only these GFP flags are affected by the fair zone allocation policy */
 #define DISTRIBUTE_GFP_MASK	((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH 1/7] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
  2013-12-13 14:10   ` Mel Gorman
@ 2013-12-13 15:45     ` Rik van Riel
  -1 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-13 15:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/13/2013 09:10 AM, Mel Gorman wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> Dave Hansen noted a regression in a microbenchmark that loops around
> open() and close() on an 8-node NUMA machine and bisected it down to
> 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
> change forces the slab allocations of the file descriptor to spread
> out to all 8 nodes, causing remote references in the page allocator
> and slab.
> 
> The round-robin policy is only there to provide fairness among memory
> allocations that are reclaimed involuntarily based on pressure in each
> zone.  It does not make sense to apply it to unreclaimable kernel
> allocations that are freed manually, in this case instantly after the
> allocation, and incur the remote reference costs twice for no reason.
> 
> Only round-robin allocations that are usually freed through page
> reclaim or slab shrinking.
> 
> Cc: <stable@kernel.org>
> Bisected-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 1/7] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
@ 2013-12-13 15:45     ` Rik van Riel
  0 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-13 15:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/13/2013 09:10 AM, Mel Gorman wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> Dave Hansen noted a regression in a microbenchmark that loops around
> open() and close() on an 8-node NUMA machine and bisected it down to
> 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
> change forces the slab allocations of the file descriptor to spread
> out to all 8 nodes, causing remote references in the page allocator
> and slab.
> 
> The round-robin policy is only there to provide fairness among memory
> allocations that are reclaimed involuntarily based on pressure in each
> zone.  It does not make sense to apply it to unreclaimable kernel
> allocations that are freed manually, in this case instantly after the
> allocation, and incur the remote reference costs twice for no reason.
> 
> Only round-robin allocations that are usually freed through page
> reclaim or slab shrinking.
> 
> Cc: <stable@kernel.org>
> Bisected-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper
  2013-12-13 14:10   ` Mel Gorman
@ 2013-12-13 15:46     ` Rik van Riel
  -1 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-13 15:46 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/13/2013 09:10 AM, Mel Gorman wrote:
> This patch moves the decision on whether to round-robin allocations between
> zones and nodes into its own helper functions. It'll make some later patches
> easier to understand and it will be automatically inlined.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper
@ 2013-12-13 15:46     ` Rik van Riel
  0 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-13 15:46 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/13/2013 09:10 AM, Mel Gorman wrote:
> This patch moves the decision on whether to round-robin allocations between
> zones and nodes into its own helper functions. It'll make some later patches
> easier to understand and it will be automatically inlined.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
  2013-12-13 14:10   ` Mel Gorman
@ 2013-12-13 17:04     ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-13 17:04 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> Indications from Johannes that he wanted this. Needs some data and/or justification why
> thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> it should be considered finished. I do not necessarily agree this patch is necessary
> but it's worth punting it out there for discussion and testing.

I demonstrated enormous gains in the original submission of the fair
allocation patch and your tests haven't really shown downsides to the
cache-over-nodes portion of it.  So I don't see why we should revert
the cache-over-nodes fairness without any supporting data.

Reverting cross-node fairness for anon and slab is a good idea.  It
was always about cache and the original patch was too broad stroked,
but it doesn't invalidate everything it was about.

I can see, however, that we might want to make this configurable, but
I'm not eager on exporting user interfaces unless we have to.  As the
node-local fairness was never questioned by anybody, is it necessary
to make it configurable?  Shouldn't we be okay with just a single
vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
allows users to go back to pagecache obeying mempolicy?

> Not signed off
> ---
>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bf49918..bce40c0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1885,7 +1885,8 @@ unsigned __bitwise__ zone_distribute_mode __read_mostly;
>  #define DISTRIBUTE_STUPID_ANON	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
>  #define DISTRIBUTE_STUPID_FILE	(DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
>  #define DISTRIBUTE_STUPID_SLAB	(DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
> -#define DISTRIBUTE_DEFAULT	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
> +#define DISTRIBUTE_DEFAULT	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB| \
> +				 DISTRIBUTE_REMOTE_FILE)
>  
>  /* Only these GFP flags are affected by the fair zone allocation policy */
>  #define DISTRIBUTE_GFP_MASK	((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
> -- 
> 1.8.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
@ 2013-12-13 17:04     ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-13 17:04 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> Indications from Johannes that he wanted this. Needs some data and/or justification why
> thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> it should be considered finished. I do not necessarily agree this patch is necessary
> but it's worth punting it out there for discussion and testing.

I demonstrated enormous gains in the original submission of the fair
allocation patch and your tests haven't really shown downsides to the
cache-over-nodes portion of it.  So I don't see why we should revert
the cache-over-nodes fairness without any supporting data.

Reverting cross-node fairness for anon and slab is a good idea.  It
was always about cache and the original patch was too broad stroked,
but it doesn't invalidate everything it was about.

I can see, however, that we might want to make this configurable, but
I'm not eager on exporting user interfaces unless we have to.  As the
node-local fairness was never questioned by anybody, is it necessary
to make it configurable?  Shouldn't we be okay with just a single
vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
allows users to go back to pagecache obeying mempolicy?

> Not signed off
> ---
>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bf49918..bce40c0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1885,7 +1885,8 @@ unsigned __bitwise__ zone_distribute_mode __read_mostly;
>  #define DISTRIBUTE_STUPID_ANON	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
>  #define DISTRIBUTE_STUPID_FILE	(DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
>  #define DISTRIBUTE_STUPID_SLAB	(DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
> -#define DISTRIBUTE_DEFAULT	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
> +#define DISTRIBUTE_DEFAULT	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB| \
> +				 DISTRIBUTE_REMOTE_FILE)
>  
>  /* Only these GFP flags are affected by the fair zone allocation policy */
>  #define DISTRIBUTE_GFP_MASK	((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
> -- 
> 1.8.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
  2013-12-13 17:04     ` Johannes Weiner
@ 2013-12-13 19:20       ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 19:20 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > it should be considered finished. I do not necessarily agree this patch is necessary
> > but it's worth punting it out there for discussion and testing.
> 
> I demonstrated enormous gains in the original submission of the fair
> allocation patch and

And the same test missed that it broke MPOL_DEFAULT and regressed any workload
that does not hit reclaim by incurring remote accesses unnecessarily. With
this patch applied, MPOL_DEFAULT again does not act as documented by
Documentation/vm/numa_memory_policy.txt and that file has been around a
long time. It also does not match the documented behaviour of mbind
where it says

	The  system-wide  default  policy allocates  pages  on	the node of
	the CPU that triggers the allocation.  For MPOL_DEFAULT, the nodemask
	and maxnode arguments must be specify the empty set of nodes.

That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
allocate on remote nodes.

> your tests haven't really shown downsides to the
> cache-over-nodes portion of it. 
> the cache-over-nodes fairness without any supporting data.
> 

It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
overridden by policies and it is not even documented.  The same effect
could have been achieved for the repeatedly reading files by running the
processes with the MPOL_INTERLEAVE policy.  There was also no convenient
way for a user to override that behaviour. Hard-binding to a node would
work but tough luck if the process needs more than one node of memory.

What I will admit is that I doubt anyone cares that file-backed pages
are not node-local as documented as the cost of the IO itself probably
dominates but just because something does not make sense does not mean
someone is depending on the behaviour.

That alone is pretty heavy justification even in the absense of supporting
data showing a workload that depends on file pages being node-local that
is not hidden by the cost of the IO itself.

> Reverting cross-node fairness for anon and slab is a good idea.  It
> was always about cache and the original patch was too broad stroked,
> but it doesn't invalidate everything it was about.
> 

No it doesn't, but it should at least have been documented.

> I can see, however, that we might want to make this configurable, but
> I'm not eager on exporting user interfaces unless we have to.  As the
> node-local fairness was never questioned by anybody, is it necessary
> to make it configurable? 

It's only there since 3.12 and it takes a long time for people to notice
NUMA regressions, especially ones that would just be within a few percent
like this was unless they were specifically looking for it.

> Shouldn't we be okay with just a single
> vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> allows users to go back to pagecache obeying mempolicy?
> 

That can be done. I can put together a patch that defaults it to 0 and
sets the DISTRIBUTE_REMOTE_FILE  flag if someone writes to it. That's a
crude hack but many people will be ok with it.

To make it a default though should require more work though.
Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
is not strictly interleave). Abstract MPOL_DEFAULT to be either
MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
vm.pagecache_interleave. Update manual pages, and Documentation/ then set
the default of vm.pagecache_interleave to 1.

That would allow more sane defaults and also allow users to override it
on a per task and per VMA basis as they can for any other type of memory
policy.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
@ 2013-12-13 19:20       ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 19:20 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > it should be considered finished. I do not necessarily agree this patch is necessary
> > but it's worth punting it out there for discussion and testing.
> 
> I demonstrated enormous gains in the original submission of the fair
> allocation patch and

And the same test missed that it broke MPOL_DEFAULT and regressed any workload
that does not hit reclaim by incurring remote accesses unnecessarily. With
this patch applied, MPOL_DEFAULT again does not act as documented by
Documentation/vm/numa_memory_policy.txt and that file has been around a
long time. It also does not match the documented behaviour of mbind
where it says

	The  system-wide  default  policy allocates  pages  on	the node of
	the CPU that triggers the allocation.  For MPOL_DEFAULT, the nodemask
	and maxnode arguments must be specify the empty set of nodes.

That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
allocate on remote nodes.

> your tests haven't really shown downsides to the
> cache-over-nodes portion of it. 
> the cache-over-nodes fairness without any supporting data.
> 

It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
overridden by policies and it is not even documented.  The same effect
could have been achieved for the repeatedly reading files by running the
processes with the MPOL_INTERLEAVE policy.  There was also no convenient
way for a user to override that behaviour. Hard-binding to a node would
work but tough luck if the process needs more than one node of memory.

What I will admit is that I doubt anyone cares that file-backed pages
are not node-local as documented as the cost of the IO itself probably
dominates but just because something does not make sense does not mean
someone is depending on the behaviour.

That alone is pretty heavy justification even in the absense of supporting
data showing a workload that depends on file pages being node-local that
is not hidden by the cost of the IO itself.

> Reverting cross-node fairness for anon and slab is a good idea.  It
> was always about cache and the original patch was too broad stroked,
> but it doesn't invalidate everything it was about.
> 

No it doesn't, but it should at least have been documented.

> I can see, however, that we might want to make this configurable, but
> I'm not eager on exporting user interfaces unless we have to.  As the
> node-local fairness was never questioned by anybody, is it necessary
> to make it configurable? 

It's only there since 3.12 and it takes a long time for people to notice
NUMA regressions, especially ones that would just be within a few percent
like this was unless they were specifically looking for it.

> Shouldn't we be okay with just a single
> vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> allows users to go back to pagecache obeying mempolicy?
> 

That can be done. I can put together a patch that defaults it to 0 and
sets the DISTRIBUTE_REMOTE_FILE  flag if someone writes to it. That's a
crude hack but many people will be ok with it.

To make it a default though should require more work though.
Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
is not strictly interleave). Abstract MPOL_DEFAULT to be either
MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
vm.pagecache_interleave. Update manual pages, and Documentation/ then set
the default of vm.pagecache_interleave to 1.

That would allow more sane defaults and also allow users to override it
on a per task and per VMA basis as they can for any other type of memory
policy.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
  2013-12-13 19:20       ` Mel Gorman
@ 2013-12-13 22:15         ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-13 22:15 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 07:20:14PM +0000, Mel Gorman wrote:
> On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > > it should be considered finished. I do not necessarily agree this patch is necessary
> > > but it's worth punting it out there for discussion and testing.
> > 
> > I demonstrated enormous gains in the original submission of the fair
> > allocation patch and
> 
> And the same test missed that it broke MPOL_DEFAULT and regressed any workload
> that does not hit reclaim by incurring remote accesses unnecessarily.

And none of this was nice, agreed, but it does not invalidate the
gains, it only changes what we are comparing them to.

> With this patch applied, MPOL_DEFAULT again does not act as
> documented by Documentation/vm/numa_memory_policy.txt and that file
> has been around a long time. It also does not match the documented
> behaviour of mbind where it says
> 
> 	The  system-wide  default  policy allocates  pages  on	the node of
> 	the CPU that triggers the allocation.  For MPOL_DEFAULT, the nodemask
> 	and maxnode arguments must be specify the empty set of nodes.
> 
> That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
> allocate on remote nodes.
>
> > your tests haven't really shown downsides to the
> > cache-over-nodes portion of it. 
> > the cache-over-nodes fairness without any supporting data.
> > 
> 
> It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
> overridden by policies and it is not even documented.  The same effect
> could have been achieved for the repeatedly reading files by running the
> processes with the MPOL_INTERLEAVE policy.  There was also no convenient
> way for a user to override that behaviour. Hard-binding to a node would
> work but tough luck if the process needs more than one node of memory.

Hardbinding or enabling zone_reclaim_mode, yes.  But agreed, let's fix
these problems.

> What I will admit is that I doubt anyone cares that file-backed pages
> are not node-local as documented as the cost of the IO itself probably
> dominates but just because something does not make sense does not mean
> someone is depending on the behaviour.

And that's why I very much agree that we need a way for people to
revert to the old behavior in case we are wrong about this.

But it's also a very strong argument for what the new default should
be, given that we allow people to revert our decision in the field.

> That alone is pretty heavy justification even in the absense of supporting
> data showing a workload that depends on file pages being node-local that
> is not hidden by the cost of the IO itself.

Even if we anticipate that nobody will care about it and we provide a
way to revert the behavior in the field in case we are wrong?

I disagree.

We should definitely allow the user to override our decision, but the
default should be what we anticipate will benefit most users.

And I'm really not trying to be ignorant of long-standing documented
behavior that users may have come to expect.  The bug reports will
land on my desk just as well.  But it looks like the current behavior
does not make much sense and is unlikely to be missed.

> > Reverting cross-node fairness for anon and slab is a good idea.  It
> > was always about cache and the original patch was too broad stroked,
> > but it doesn't invalidate everything it was about.
> > 
> 
> No it doesn't, but it should at least have been documented.

Yes, no argument there.

> > I can see, however, that we might want to make this configurable, but
> > I'm not eager on exporting user interfaces unless we have to.  As the
> > node-local fairness was never questioned by anybody, is it necessary
> > to make it configurable? 
> 
> It's only there since 3.12 and it takes a long time for people to notice
> NUMA regressions, especially ones that would just be within a few percent
> like this was unless they were specifically looking for it.

No, I meant only the case where we distribute memory fairly among the
zones WITHIN a given node.  This does not affect NUMA placement.  I
wouldn't want to make this configurable unless you think people might
want to disable this.  I can't think of a reason, anyway.

> > Shouldn't we be okay with just a single
> > vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> > allows users to go back to pagecache obeying mempolicy?
> > 
> 
> That can be done. I can put together a patch that defaults it to 0 and
> sets the DISTRIBUTE_REMOTE_FILE  flag if someone writes to it. That's a
> crude hack but many people will be ok with it.
> 
> To make it a default though should require more work though.
> Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
> is not strictly interleave). Abstract MPOL_DEFAULT to be either
> MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
> vm.pagecache_interleave. Update manual pages, and Documentation/ then set
> the default of vm.pagecache_interleave to 1.
> 
> That would allow more sane defaults and also allow users to override it
> on a per task and per VMA basis as they can for any other type of memory
> policy.

Not using round-robin placement for cache creates weird artifacts in
our LRU aging decisions.  By not aging all pages in a workingset
equally, we may end up activating barely used pages on a remote node
and creating pressure on its active list for no reason.

This has little to do with the thrash detection patches, either, they
will just potentially trigger a few more non-sensical activations but
for the same reason that the aging is skewed.

Because of that I really don't want to implement round-robin cache
placement as just another possible mempolicy when other parts of the
VM rely on it to be there.

It would make more sense to me to ignore mempolicies for cache per
default and provide a single sysctl to honor them for the sole reason
that we have been honoring them for a very long time.  And document
the whole thing properly of course.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
@ 2013-12-13 22:15         ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-13 22:15 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 07:20:14PM +0000, Mel Gorman wrote:
> On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > > it should be considered finished. I do not necessarily agree this patch is necessary
> > > but it's worth punting it out there for discussion and testing.
> > 
> > I demonstrated enormous gains in the original submission of the fair
> > allocation patch and
> 
> And the same test missed that it broke MPOL_DEFAULT and regressed any workload
> that does not hit reclaim by incurring remote accesses unnecessarily.

And none of this was nice, agreed, but it does not invalidate the
gains, it only changes what we are comparing them to.

> With this patch applied, MPOL_DEFAULT again does not act as
> documented by Documentation/vm/numa_memory_policy.txt and that file
> has been around a long time. It also does not match the documented
> behaviour of mbind where it says
> 
> 	The  system-wide  default  policy allocates  pages  on	the node of
> 	the CPU that triggers the allocation.  For MPOL_DEFAULT, the nodemask
> 	and maxnode arguments must be specify the empty set of nodes.
> 
> That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
> allocate on remote nodes.
>
> > your tests haven't really shown downsides to the
> > cache-over-nodes portion of it. 
> > the cache-over-nodes fairness without any supporting data.
> > 
> 
> It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
> overridden by policies and it is not even documented.  The same effect
> could have been achieved for the repeatedly reading files by running the
> processes with the MPOL_INTERLEAVE policy.  There was also no convenient
> way for a user to override that behaviour. Hard-binding to a node would
> work but tough luck if the process needs more than one node of memory.

Hardbinding or enabling zone_reclaim_mode, yes.  But agreed, let's fix
these problems.

> What I will admit is that I doubt anyone cares that file-backed pages
> are not node-local as documented as the cost of the IO itself probably
> dominates but just because something does not make sense does not mean
> someone is depending on the behaviour.

And that's why I very much agree that we need a way for people to
revert to the old behavior in case we are wrong about this.

But it's also a very strong argument for what the new default should
be, given that we allow people to revert our decision in the field.

> That alone is pretty heavy justification even in the absense of supporting
> data showing a workload that depends on file pages being node-local that
> is not hidden by the cost of the IO itself.

Even if we anticipate that nobody will care about it and we provide a
way to revert the behavior in the field in case we are wrong?

I disagree.

We should definitely allow the user to override our decision, but the
default should be what we anticipate will benefit most users.

And I'm really not trying to be ignorant of long-standing documented
behavior that users may have come to expect.  The bug reports will
land on my desk just as well.  But it looks like the current behavior
does not make much sense and is unlikely to be missed.

> > Reverting cross-node fairness for anon and slab is a good idea.  It
> > was always about cache and the original patch was too broad stroked,
> > but it doesn't invalidate everything it was about.
> > 
> 
> No it doesn't, but it should at least have been documented.

Yes, no argument there.

> > I can see, however, that we might want to make this configurable, but
> > I'm not eager on exporting user interfaces unless we have to.  As the
> > node-local fairness was never questioned by anybody, is it necessary
> > to make it configurable? 
> 
> It's only there since 3.12 and it takes a long time for people to notice
> NUMA regressions, especially ones that would just be within a few percent
> like this was unless they were specifically looking for it.

No, I meant only the case where we distribute memory fairly among the
zones WITHIN a given node.  This does not affect NUMA placement.  I
wouldn't want to make this configurable unless you think people might
want to disable this.  I can't think of a reason, anyway.

> > Shouldn't we be okay with just a single
> > vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> > allows users to go back to pagecache obeying mempolicy?
> > 
> 
> That can be done. I can put together a patch that defaults it to 0 and
> sets the DISTRIBUTE_REMOTE_FILE  flag if someone writes to it. That's a
> crude hack but many people will be ok with it.
> 
> To make it a default though should require more work though.
> Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
> is not strictly interleave). Abstract MPOL_DEFAULT to be either
> MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
> vm.pagecache_interleave. Update manual pages, and Documentation/ then set
> the default of vm.pagecache_interleave to 1.
> 
> That would allow more sane defaults and also allow users to override it
> on a per task and per VMA basis as they can for any other type of memory
> policy.

Not using round-robin placement for cache creates weird artifacts in
our LRU aging decisions.  By not aging all pages in a workingset
equally, we may end up activating barely used pages on a remote node
and creating pressure on its active list for no reason.

This has little to do with the thrash detection patches, either, they
will just potentially trigger a few more non-sensical activations but
for the same reason that the aging is skewed.

Because of that I really don't want to implement round-robin cache
placement as just another possible mempolicy when other parts of the
VM rely on it to be there.

It would make more sense to me to ignore mempolicies for cache per
default and provide a single sysctl to honor them for the sole reason
that we have been honoring them for a very long time.  And document
the whole thing properly of course.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
  2013-12-13 14:10   ` Mel Gorman
@ 2013-12-16 13:20     ` Rik van Riel
  -1 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 13:20 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/13/2013 09:10 AM, Mel Gorman wrote:
> zone_local is using node_distance which is a more expensive call than
> necessary. On x86, it's another function call in the allocator fast path
> and increases cache footprint. This patch makes the assumption zones on a
> local node will share the same node ID. The necessary information should
> already be cache hot.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-16 13:20     ` Rik van Riel
  0 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 13:20 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/13/2013 09:10 AM, Mel Gorman wrote:
> zone_local is using node_distance which is a more expensive call than
> necessary. On x86, it's another function call in the allocator fast path
> and increases cache footprint. This patch makes the assumption zones on a
> local node will share the same node ID. The necessary information should
> already be cache hot.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 4/7] mm: Annotate page cache allocations
  2013-12-13 14:10   ` Mel Gorman
@ 2013-12-16 15:20     ` Rik van Riel
  -1 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 15:20 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/13/2013 09:10 AM, Mel Gorman wrote:
> Annotations will be used for fair zone allocation policy. Patch is mostly
> taken from a link posted by Johannes on IRC. It's not perfect because all
> callers of these paths are not guaranteed to be allocating pages for page
> cache. However, it's probably close enough to cover all cases that matter
> with minimal distortion.
> 
> Not-signed-off

Whenever you and Johannes sign it off, you can add my

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 4/7] mm: Annotate page cache allocations
@ 2013-12-16 15:20     ` Rik van Riel
  0 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 15:20 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/13/2013 09:10 AM, Mel Gorman wrote:
> Annotations will be used for fair zone allocation policy. Patch is mostly
> taken from a link posted by Johannes on IRC. It's not perfect because all
> callers of these paths are not guaranteed to be allocating pages for page
> cache. However, it's probably close enough to cover all cases that matter
> with minimal distortion.
> 
> Not-signed-off

Whenever you and Johannes sign it off, you can add my

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
  2013-12-13 14:10   ` Mel Gorman
@ 2013-12-16 19:25     ` Rik van Riel
  -1 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 19:25 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/13/2013 09:10 AM, Mel Gorman wrote:
> Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> bug whereby new pages could be reclaimed before old pages because of
> how the page allocator and kswapd interacted on the per-zone LRU lists.
> Unfortunately it was missed during review that a consequence is that
> we also round-robin between NUMA nodes. This is bad for two reasons
> 
> 1. It alters the semantics of MPOL_LOCAL without telling anyone
> 2. It incurs an immediate remote memory performance hit in exchange
>    for a potential performance gain when memory needs to be reclaimed
>    later
> 
> No cookies for the reviewers on this one.
> 
> This patch makes the behaviour of the fair zone allocator policy
> configurable.  By default it will only distribute pages that are going
> to exist on the LRU between zones local to the allocating process. This
> preserves the historical semantics of MPOL_LOCAL.
> 
> By default, slab pages are not distributed between zones after this patch is
> applied. It can be argued that they should get similar treatment but they
> have different lifecycles to LRU pages, the shrinkers are not zone-aware
> and the interaction between the page allocator and kswapd is different
> for slabs. If it turns out to be an almost universal win, we can change
> the default.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-16 19:25     ` Rik van Riel
  0 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 19:25 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/13/2013 09:10 AM, Mel Gorman wrote:
> Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> bug whereby new pages could be reclaimed before old pages because of
> how the page allocator and kswapd interacted on the per-zone LRU lists.
> Unfortunately it was missed during review that a consequence is that
> we also round-robin between NUMA nodes. This is bad for two reasons
> 
> 1. It alters the semantics of MPOL_LOCAL without telling anyone
> 2. It incurs an immediate remote memory performance hit in exchange
>    for a potential performance gain when memory needs to be reclaimed
>    later
> 
> No cookies for the reviewers on this one.
> 
> This patch makes the behaviour of the fair zone allocator policy
> configurable.  By default it will only distribute pages that are going
> to exist on the LRU between zones local to the allocating process. This
> preserves the historical semantics of MPOL_LOCAL.
> 
> By default, slab pages are not distributed between zones after this patch is
> applied. It can be argued that they should get similar treatment but they
> have different lifecycles to LRU pages, the shrinkers are not zone-aware
> and the interaction between the page allocator and kswapd is different
> for slabs. If it turns out to be an almost universal win, we can change
> the default.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
  2013-12-13 14:10   ` Mel Gorman
@ 2013-12-16 19:26     ` Rik van Riel
  -1 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 19:26 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/13/2013 09:10 AM, Mel Gorman wrote:
> Indications from Johannes that he wanted this. Needs some data and/or justification why
> thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> it should be considered finished. I do not necessarily agree this patch is necessary
> but it's worth punting it out there for discussion and testing.

This seems like a sane default to me.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
@ 2013-12-16 19:26     ` Rik van Riel
  0 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 19:26 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/13/2013 09:10 AM, Mel Gorman wrote:
> Indications from Johannes that he wanted this. Needs some data and/or justification why
> thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> it should be considered finished. I do not necessarily agree this patch is necessary
> but it's worth punting it out there for discussion and testing.

This seems like a sane default to me.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper
  2013-12-13 14:10   ` Mel Gorman
@ 2013-12-16 20:16     ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 02:10:02PM +0000, Mel Gorman wrote:
> This patch moves the decision on whether to round-robin allocations between
> zones and nodes into its own helper functions. It'll make some later patches
> easier to understand and it will be automatically inlined.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper
@ 2013-12-16 20:16     ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 02:10:02PM +0000, Mel Gorman wrote:
> This patch moves the decision on whether to round-robin allocations between
> zones and nodes into its own helper functions. It'll make some later patches
> easier to understand and it will be automatically inlined.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
  2013-12-13 14:10   ` Mel Gorman
@ 2013-12-16 20:25     ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:25 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> zone_local is using node_distance which is a more expensive call than
> necessary. On x86, it's another function call in the allocator fast path
> and increases cache footprint. This patch makes the assumption zones on a
> local node will share the same node ID. The necessary information should
> already be cache hot.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 64020eb..fd9677e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
>  
>  static bool zone_local(struct zone *local_zone, struct zone *zone)
>  {
> -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> +	return zone_to_nid(zone) == numa_node_id();

Why numa_node_id()?  We pass in the preferred zone as @local_zone:

return zone_to_nid(local_zone) == zone_to_nid(zone)

Or even just compare the ->zone_pgdat pointers?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-16 20:25     ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:25 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> zone_local is using node_distance which is a more expensive call than
> necessary. On x86, it's another function call in the allocator fast path
> and increases cache footprint. This patch makes the assumption zones on a
> local node will share the same node ID. The necessary information should
> already be cache hot.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 64020eb..fd9677e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
>  
>  static bool zone_local(struct zone *local_zone, struct zone *zone)
>  {
> -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> +	return zone_to_nid(zone) == numa_node_id();

Why numa_node_id()?  We pass in the preferred zone as @local_zone:

return zone_to_nid(local_zone) == zone_to_nid(zone)

Or even just compare the ->zone_pgdat pointers?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
  2013-12-13 14:10   ` Mel Gorman
@ 2013-12-16 20:42     ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> bug whereby new pages could be reclaimed before old pages because of
> how the page allocator and kswapd interacted on the per-zone LRU lists.
> Unfortunately it was missed during review that a consequence is that
> we also round-robin between NUMA nodes. This is bad for two reasons
> 
> 1. It alters the semantics of MPOL_LOCAL without telling anyone
> 2. It incurs an immediate remote memory performance hit in exchange
>    for a potential performance gain when memory needs to be reclaimed
>    later
> 
> No cookies for the reviewers on this one.
> 
> This patch makes the behaviour of the fair zone allocator policy
> configurable.  By default it will only distribute pages that are going
> to exist on the LRU between zones local to the allocating process. This
> preserves the historical semantics of MPOL_LOCAL.
> 
> By default, slab pages are not distributed between zones after this patch is
> applied. It can be argued that they should get similar treatment but they
> have different lifecycles to LRU pages, the shrinkers are not zone-aware
> and the interaction between the page allocator and kswapd is different
> for slabs. If it turns out to be an almost universal win, we can change
> the default.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  Documentation/sysctl/vm.txt |  32 ++++++++++++++
>  include/linux/mmzone.h      |   2 +
>  include/linux/swap.h        |   2 +
>  kernel/sysctl.c             |   8 ++++
>  mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
>  5 files changed, 134 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 1fbd4eb..8eaa562 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- zone_distribute_mode
>  - zone_reclaim_mode
>  
>  ==============================================================
> @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
>  
>  ==============================================================
>  
> +zone_distribute_mode
> +
> +Pages allocation and reclaim are managed on a per-zone basis. When the
> +system needs to reclaim memory, candidate pages are selected from these
> +per-zone lists.  Historically, a potential consequence was that recently
> +allocated pages were considered reclaim candidates. From a zone-local
> +perspective, page aging was preserved but from a system-wide perspective
> +there was an age inversion problem.
> +
> +A similar problem occurs on a node level where young pages may be reclaimed
> +from the local node instead of allocating remote memory. Unforuntately, the
> +cost of accessing remote nodes is higher so the system must choose by default
> +between favouring page aging or node locality. zone_distribute_mode controls
> +how the system will distribute page ages between zones.
> +
> +0	= Never round-robin based on age

I think we should be very conservative with the userspace interface we
export on a mechanism we are obviously just figuring out.

> +Otherwise the values are ORed together
> +
> +1	= Distribute anon pages between zones local to the allocating node
> +2	= Distribute file pages between zones local to the allocating node
> +4	= Distribute slab pages between zones local to the allocating node

Zone fairness within a node does not affect mempolicy or remote
reference costs.  Is there a reason to have this configurable?

> +The following three flags effectively alter MPOL_DEFAULT, be careful.
> +
> +8	= Distribute anon pages between zones remote to the allocating node
> +16	= Distribute file pages between zones remote to the allocating node
> +32	= Distribute slab pages between zones remote to the allocating node

Yes, it's conceivable that somebody might want to disable remote
distribution because of the extra references.

But at this point, I'd much rather back out anon and slab distribution
entirely, it was a mistake to include them.

That would leave us with a single knob to disable remote page cache
placement.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-16 20:42     ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> bug whereby new pages could be reclaimed before old pages because of
> how the page allocator and kswapd interacted on the per-zone LRU lists.
> Unfortunately it was missed during review that a consequence is that
> we also round-robin between NUMA nodes. This is bad for two reasons
> 
> 1. It alters the semantics of MPOL_LOCAL without telling anyone
> 2. It incurs an immediate remote memory performance hit in exchange
>    for a potential performance gain when memory needs to be reclaimed
>    later
> 
> No cookies for the reviewers on this one.
> 
> This patch makes the behaviour of the fair zone allocator policy
> configurable.  By default it will only distribute pages that are going
> to exist on the LRU between zones local to the allocating process. This
> preserves the historical semantics of MPOL_LOCAL.
> 
> By default, slab pages are not distributed between zones after this patch is
> applied. It can be argued that they should get similar treatment but they
> have different lifecycles to LRU pages, the shrinkers are not zone-aware
> and the interaction between the page allocator and kswapd is different
> for slabs. If it turns out to be an almost universal win, we can change
> the default.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  Documentation/sysctl/vm.txt |  32 ++++++++++++++
>  include/linux/mmzone.h      |   2 +
>  include/linux/swap.h        |   2 +
>  kernel/sysctl.c             |   8 ++++
>  mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
>  5 files changed, 134 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 1fbd4eb..8eaa562 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- zone_distribute_mode
>  - zone_reclaim_mode
>  
>  ==============================================================
> @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
>  
>  ==============================================================
>  
> +zone_distribute_mode
> +
> +Pages allocation and reclaim are managed on a per-zone basis. When the
> +system needs to reclaim memory, candidate pages are selected from these
> +per-zone lists.  Historically, a potential consequence was that recently
> +allocated pages were considered reclaim candidates. From a zone-local
> +perspective, page aging was preserved but from a system-wide perspective
> +there was an age inversion problem.
> +
> +A similar problem occurs on a node level where young pages may be reclaimed
> +from the local node instead of allocating remote memory. Unforuntately, the
> +cost of accessing remote nodes is higher so the system must choose by default
> +between favouring page aging or node locality. zone_distribute_mode controls
> +how the system will distribute page ages between zones.
> +
> +0	= Never round-robin based on age

I think we should be very conservative with the userspace interface we
export on a mechanism we are obviously just figuring out.

> +Otherwise the values are ORed together
> +
> +1	= Distribute anon pages between zones local to the allocating node
> +2	= Distribute file pages between zones local to the allocating node
> +4	= Distribute slab pages between zones local to the allocating node

Zone fairness within a node does not affect mempolicy or remote
reference costs.  Is there a reason to have this configurable?

> +The following three flags effectively alter MPOL_DEFAULT, be careful.
> +
> +8	= Distribute anon pages between zones remote to the allocating node
> +16	= Distribute file pages between zones remote to the allocating node
> +32	= Distribute slab pages between zones remote to the allocating node

Yes, it's conceivable that somebody might want to disable remote
distribution because of the extra references.

But at this point, I'd much rather back out anon and slab distribution
entirely, it was a mistake to include them.

That would leave us with a single knob to disable remote page cache
placement.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
  2013-12-13 14:10   ` Mel Gorman
@ 2013-12-16 20:52     ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:52 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> Not signed off. Johannes, was the intent really to decrement the batch
> counts regardless of whether the policy was being enforced or not?

Yes.  Bursts of allocations for which the policy does not get enforced
will still create memory pressure and affect cache aging on a given
node.  So even if we only distribute page cache, we want to distribute
it in a way that all allocations on the eligible zones equal out.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
@ 2013-12-16 20:52     ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:52 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> Not signed off. Johannes, was the intent really to decrement the batch
> counts regardless of whether the policy was being enforced or not?

Yes.  Bursts of allocations for which the policy does not get enforced
will still create memory pressure and affect cache aging on a given
node.  So even if we only distribute page cache, we want to distribute
it in a way that all allocations on the eligible zones equal out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
  2013-12-16 20:25     ` Johannes Weiner
@ 2013-12-17 11:13       ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 11:13 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > zone_local is using node_distance which is a more expensive call than
> > necessary. On x86, it's another function call in the allocator fast path
> > and increases cache footprint. This patch makes the assumption zones on a
> > local node will share the same node ID. The necessary information should
> > already be cache hot.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/page_alloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 64020eb..fd9677e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> >  
> >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> >  {
> > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > +	return zone_to_nid(zone) == numa_node_id();
> 
> Why numa_node_id()?  We pass in the preferred zone as @local_zone:
> 

Initially because I was thinking "local node" and numa_node_id() is a
per-cpu variable that should be cheap to access and in some cases
cache-hot as the top-level gfp API calls numa_node_id().

Thinking about it more though it still makes sense because the preferred
zone is not necessarily local. If the allocation request requires ZONE_DMA32
and the local node does not have that zone then preferred zone is on a
remote node.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-17 11:13       ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 11:13 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > zone_local is using node_distance which is a more expensive call than
> > necessary. On x86, it's another function call in the allocator fast path
> > and increases cache footprint. This patch makes the assumption zones on a
> > local node will share the same node ID. The necessary information should
> > already be cache hot.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/page_alloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 64020eb..fd9677e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> >  
> >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> >  {
> > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > +	return zone_to_nid(zone) == numa_node_id();
> 
> Why numa_node_id()?  We pass in the preferred zone as @local_zone:
> 

Initially because I was thinking "local node" and numa_node_id() is a
per-cpu variable that should be cheap to access and in some cases
cache-hot as the top-level gfp API calls numa_node_id().

Thinking about it more though it still makes sense because the preferred
zone is not necessarily local. If the allocation request requires ZONE_DMA32
and the local node does not have that zone then preferred zone is on a
remote node.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
  2013-12-16 20:52     ` Johannes Weiner
@ 2013-12-17 11:20       ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 11:20 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Mon, Dec 16, 2013 at 03:52:37PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> > Not signed off. Johannes, was the intent really to decrement the batch
> > counts regardless of whether the policy was being enforced or not?
> 
> Yes.  Bursts of allocations for which the policy does not get enforced
> will still create memory pressure and affect cache aging on a given
> node.  So even if we only distribute page cache, we want to distribute
> it in a way that all allocations on the eligible zones equal out.

This means that allocations for page table pages affects the distribution of
page cache pages. An adverse workload could time when it faults anonymous
pages (to allocate anon and page table pages) in batch sequences and then
access files to force page cache pages to be allocated from a single node.

I think I know what your response will be. It will be that the utilisation of
the zone for page table pages and anon pages means that you want more page
cache pages to be allocated from the other zones so the reclaim pressure
is still more or less even. If this is the case or there is another reason
then it could have done with a comment because it's a subtle detail.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
@ 2013-12-17 11:20       ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 11:20 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Mon, Dec 16, 2013 at 03:52:37PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> > Not signed off. Johannes, was the intent really to decrement the batch
> > counts regardless of whether the policy was being enforced or not?
> 
> Yes.  Bursts of allocations for which the policy does not get enforced
> will still create memory pressure and affect cache aging on a given
> node.  So even if we only distribute page cache, we want to distribute
> it in a way that all allocations on the eligible zones equal out.

This means that allocations for page table pages affects the distribution of
page cache pages. An adverse workload could time when it faults anonymous
pages (to allocate anon and page table pages) in batch sequences and then
access files to force page cache pages to be allocated from a single node.

I think I know what your response will be. It will be that the utilisation of
the zone for page table pages and anon pages means that you want more page
cache pages to be allocated from the other zones so the reclaim pressure
is still more or less even. If this is the case or there is another reason
then it could have done with a comment because it's a subtle detail.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
  2013-12-13 14:10 ` Mel Gorman
@ 2013-12-17 15:07   ` Zlatko Calusic
  -1 siblings, 0 replies; 84+ messages in thread
From: Zlatko Calusic @ 2013-12-17 15:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
	Linux-MM, LKML

On 13.12.2013 15:10, Mel Gorman wrote:
> Kicked this another bit today. It's still a bit half-baked but it restores
> the historical performance and leaves the door open at the end for playing
> nice with distributing file pages between nodes. Finishing this series
> depends on whether we are going to make the remote node behaviour of the
> fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
> favour of the configurable option because the default can be redefined and
> tested while giving users a "compat" mode if we discover the new default
> behaviour sucks for some workload.
>

I'll start a 5-day test of this patchset in a few hours, unless you can 
send an updated one in the meantime. I intend to test it on a rather 
boring 4GB x86_64 machine that before Johannes' work had lots of trouble 
balancing zones. Would you recommend to use the default settings, i.e. 
don't mess with tunables at this point?

Regards,
-- 
Zlatko


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
@ 2013-12-17 15:07   ` Zlatko Calusic
  0 siblings, 0 replies; 84+ messages in thread
From: Zlatko Calusic @ 2013-12-17 15:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
	Linux-MM, LKML

On 13.12.2013 15:10, Mel Gorman wrote:
> Kicked this another bit today. It's still a bit half-baked but it restores
> the historical performance and leaves the door open at the end for playing
> nice with distributing file pages between nodes. Finishing this series
> depends on whether we are going to make the remote node behaviour of the
> fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
> favour of the configurable option because the default can be redefined and
> tested while giving users a "compat" mode if we discover the new default
> behaviour sucks for some workload.
>

I'll start a 5-day test of this patchset in a few hours, unless you can 
send an updated one in the meantime. I intend to test it on a rather 
boring 4GB x86_64 machine that before Johannes' work had lots of trouble 
balancing zones. Would you recommend to use the default settings, i.e. 
don't mess with tunables at this point?

Regards,
-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
  2013-12-16 20:42     ` Johannes Weiner
@ 2013-12-17 15:29       ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 15:29 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > bug whereby new pages could be reclaimed before old pages because of
> > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > Unfortunately it was missed during review that a consequence is that
> > we also round-robin between NUMA nodes. This is bad for two reasons
> > 
> > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > 2. It incurs an immediate remote memory performance hit in exchange
> >    for a potential performance gain when memory needs to be reclaimed
> >    later
> > 
> > No cookies for the reviewers on this one.
> > 
> > This patch makes the behaviour of the fair zone allocator policy
> > configurable.  By default it will only distribute pages that are going
> > to exist on the LRU between zones local to the allocating process. This
> > preserves the historical semantics of MPOL_LOCAL.
> > 
> > By default, slab pages are not distributed between zones after this patch is
> > applied. It can be argued that they should get similar treatment but they
> > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > and the interaction between the page allocator and kswapd is different
> > for slabs. If it turns out to be an almost universal win, we can change
> > the default.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  Documentation/sysctl/vm.txt |  32 ++++++++++++++
> >  include/linux/mmzone.h      |   2 +
> >  include/linux/swap.h        |   2 +
> >  kernel/sysctl.c             |   8 ++++
> >  mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
> >  5 files changed, 134 insertions(+), 12 deletions(-)
> > 
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index 1fbd4eb..8eaa562 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> >  - swappiness
> >  - user_reserve_kbytes
> >  - vfs_cache_pressure
> > +- zone_distribute_mode
> >  - zone_reclaim_mode
> >  
> >  ==============================================================
> > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> >  
> >  ==============================================================
> >  
> > +zone_distribute_mode
> > +
> > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > +system needs to reclaim memory, candidate pages are selected from these
> > +per-zone lists.  Historically, a potential consequence was that recently
> > +allocated pages were considered reclaim candidates. From a zone-local
> > +perspective, page aging was preserved but from a system-wide perspective
> > +there was an age inversion problem.
> > +
> > +A similar problem occurs on a node level where young pages may be reclaimed
> > +from the local node instead of allocating remote memory. Unforuntately, the
> > +cost of accessing remote nodes is higher so the system must choose by default
> > +between favouring page aging or node locality. zone_distribute_mode controls
> > +how the system will distribute page ages between zones.
> > +
> > +0	= Never round-robin based on age
> 
> I think we should be very conservative with the userspace interface we
> export on a mechanism we are obviously just figuring out.
> 

And we have a proposal on how to limit this. I'll be layering another
patch on top and removes this interface again. That will allows us to
rollback one patch and still have a usable interface if necessary.

> > +Otherwise the values are ORed together
> > +
> > +1	= Distribute anon pages between zones local to the allocating node
> > +2	= Distribute file pages between zones local to the allocating node
> > +4	= Distribute slab pages between zones local to the allocating node
> 
> Zone fairness within a node does not affect mempolicy or remote
> reference costs.  Is there a reason to have this configurable?
> 

Symmetry

> > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > +
> > +8	= Distribute anon pages between zones remote to the allocating node
> > +16	= Distribute file pages between zones remote to the allocating node
> > +32	= Distribute slab pages between zones remote to the allocating node
> 
> Yes, it's conceivable that somebody might want to disable remote
> distribution because of the extra references.
> 
> But at this point, I'd much rather back out anon and slab distribution
> entirely, it was a mistake to include them.
> 
> That would leave us with a single knob to disable remote page cache
> placement.
> 

When looking at this closer I found that sysv is a weird exception. It's
file-backed as far as most of the VM is concerned but looks anonymous to
most applications that care. That and MAP_SHARED anonymous pages should
not be treated like files but we still want tmpfs to be treated as
files. Details will be in the changelog of the next series.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 15:29       ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 15:29 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > bug whereby new pages could be reclaimed before old pages because of
> > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > Unfortunately it was missed during review that a consequence is that
> > we also round-robin between NUMA nodes. This is bad for two reasons
> > 
> > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > 2. It incurs an immediate remote memory performance hit in exchange
> >    for a potential performance gain when memory needs to be reclaimed
> >    later
> > 
> > No cookies for the reviewers on this one.
> > 
> > This patch makes the behaviour of the fair zone allocator policy
> > configurable.  By default it will only distribute pages that are going
> > to exist on the LRU between zones local to the allocating process. This
> > preserves the historical semantics of MPOL_LOCAL.
> > 
> > By default, slab pages are not distributed between zones after this patch is
> > applied. It can be argued that they should get similar treatment but they
> > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > and the interaction between the page allocator and kswapd is different
> > for slabs. If it turns out to be an almost universal win, we can change
> > the default.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  Documentation/sysctl/vm.txt |  32 ++++++++++++++
> >  include/linux/mmzone.h      |   2 +
> >  include/linux/swap.h        |   2 +
> >  kernel/sysctl.c             |   8 ++++
> >  mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
> >  5 files changed, 134 insertions(+), 12 deletions(-)
> > 
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index 1fbd4eb..8eaa562 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> >  - swappiness
> >  - user_reserve_kbytes
> >  - vfs_cache_pressure
> > +- zone_distribute_mode
> >  - zone_reclaim_mode
> >  
> >  ==============================================================
> > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> >  
> >  ==============================================================
> >  
> > +zone_distribute_mode
> > +
> > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > +system needs to reclaim memory, candidate pages are selected from these
> > +per-zone lists.  Historically, a potential consequence was that recently
> > +allocated pages were considered reclaim candidates. From a zone-local
> > +perspective, page aging was preserved but from a system-wide perspective
> > +there was an age inversion problem.
> > +
> > +A similar problem occurs on a node level where young pages may be reclaimed
> > +from the local node instead of allocating remote memory. Unforuntately, the
> > +cost of accessing remote nodes is higher so the system must choose by default
> > +between favouring page aging or node locality. zone_distribute_mode controls
> > +how the system will distribute page ages between zones.
> > +
> > +0	= Never round-robin based on age
> 
> I think we should be very conservative with the userspace interface we
> export on a mechanism we are obviously just figuring out.
> 

And we have a proposal on how to limit this. I'll be layering another
patch on top and removes this interface again. That will allows us to
rollback one patch and still have a usable interface if necessary.

> > +Otherwise the values are ORed together
> > +
> > +1	= Distribute anon pages between zones local to the allocating node
> > +2	= Distribute file pages between zones local to the allocating node
> > +4	= Distribute slab pages between zones local to the allocating node
> 
> Zone fairness within a node does not affect mempolicy or remote
> reference costs.  Is there a reason to have this configurable?
> 

Symmetry

> > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > +
> > +8	= Distribute anon pages between zones remote to the allocating node
> > +16	= Distribute file pages between zones remote to the allocating node
> > +32	= Distribute slab pages between zones remote to the allocating node
> 
> Yes, it's conceivable that somebody might want to disable remote
> distribution because of the extra references.
> 
> But at this point, I'd much rather back out anon and slab distribution
> entirely, it was a mistake to include them.
> 
> That would leave us with a single knob to disable remote page cache
> placement.
> 

When looking at this closer I found that sysv is a weird exception. It's
file-backed as far as most of the VM is concerned but looks anonymous to
most applications that care. That and MAP_SHARED anonymous pages should
not be treated like files but we still want tmpfs to be treated as
files. Details will be in the changelog of the next series.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
  2013-12-17 11:13       ` Mel Gorman
@ 2013-12-17 15:38         ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 15:38 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > zone_local is using node_distance which is a more expensive call than
> > > necessary. On x86, it's another function call in the allocator fast path
> > > and increases cache footprint. This patch makes the assumption zones on a
> > > local node will share the same node ID. The necessary information should
> > > already be cache hot.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > >  mm/page_alloc.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 64020eb..fd9677e 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > >  
> > >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> > >  {
> > > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > +	return zone_to_nid(zone) == numa_node_id();
> > 
> > Why numa_node_id()?  We pass in the preferred zone as @local_zone:
> > 
> 
> Initially because I was thinking "local node" and numa_node_id() is a
> per-cpu variable that should be cheap to access and in some cases
> cache-hot as the top-level gfp API calls numa_node_id().
> 
> Thinking about it more though it still makes sense because the preferred
> zone is not necessarily local. If the allocation request requires ZONE_DMA32
> and the local node does not have that zone then preferred zone is on a
> remote node.

Don't we treat everything in relation to the preferred zone?
zone_reclaim_mode itself does not compare with numa_node_id() but with
whatever is the preferred zone.

I could see some value in changing that to numa_node_id(), but then
zone_local() and zone_allows_reclaim() should probably both switch.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-17 15:38         ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 15:38 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > zone_local is using node_distance which is a more expensive call than
> > > necessary. On x86, it's another function call in the allocator fast path
> > > and increases cache footprint. This patch makes the assumption zones on a
> > > local node will share the same node ID. The necessary information should
> > > already be cache hot.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > >  mm/page_alloc.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 64020eb..fd9677e 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > >  
> > >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> > >  {
> > > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > +	return zone_to_nid(zone) == numa_node_id();
> > 
> > Why numa_node_id()?  We pass in the preferred zone as @local_zone:
> > 
> 
> Initially because I was thinking "local node" and numa_node_id() is a
> per-cpu variable that should be cheap to access and in some cases
> cache-hot as the top-level gfp API calls numa_node_id().
> 
> Thinking about it more though it still makes sense because the preferred
> zone is not necessarily local. If the allocation request requires ZONE_DMA32
> and the local node does not have that zone then preferred zone is on a
> remote node.

Don't we treat everything in relation to the preferred zone?
zone_reclaim_mode itself does not compare with numa_node_id() but with
whatever is the preferred zone.

I could see some value in changing that to numa_node_id(), but then
zone_local() and zone_allows_reclaim() should probably both switch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
  2013-12-17 11:20       ` Mel Gorman
@ 2013-12-17 15:43         ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 15:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 11:20:07AM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 03:52:37PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> > > Not signed off. Johannes, was the intent really to decrement the batch
> > > counts regardless of whether the policy was being enforced or not?
> > 
> > Yes.  Bursts of allocations for which the policy does not get enforced
> > will still create memory pressure and affect cache aging on a given
> > node.  So even if we only distribute page cache, we want to distribute
> > it in a way that all allocations on the eligible zones equal out.
> 
> This means that allocations for page table pages affects the distribution of
> page cache pages. An adverse workload could time when it faults anonymous
> pages (to allocate anon and page table pages) in batch sequences and then
> access files to force page cache pages to be allocated from a single node.
> 
> I think I know what your response will be. It will be that the utilisation of
> the zone for page table pages and anon pages means that you want more page
> cache pages to be allocated from the other zones so the reclaim pressure
> is still more or less even. If this is the case or there is another reason
> then it could have done with a comment because it's a subtle detail.

Yes, that was the idea, that the cache placement compensates for pages
that still are always allocated on the preferred zone first, so that
the end result is approximately as if round-robin had been applied to
everybody.

This should be documented as part of the patch that first diverges
between the allocations that are counted and the allocations that are
round-robined:

  mm: page_alloc: exclude unreclaimable allocations from zone fairness policy

I'm updating my tree.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
@ 2013-12-17 15:43         ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 15:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 11:20:07AM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 03:52:37PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> > > Not signed off. Johannes, was the intent really to decrement the batch
> > > counts regardless of whether the policy was being enforced or not?
> > 
> > Yes.  Bursts of allocations for which the policy does not get enforced
> > will still create memory pressure and affect cache aging on a given
> > node.  So even if we only distribute page cache, we want to distribute
> > it in a way that all allocations on the eligible zones equal out.
> 
> This means that allocations for page table pages affects the distribution of
> page cache pages. An adverse workload could time when it faults anonymous
> pages (to allocate anon and page table pages) in batch sequences and then
> access files to force page cache pages to be allocated from a single node.
> 
> I think I know what your response will be. It will be that the utilisation of
> the zone for page table pages and anon pages means that you want more page
> cache pages to be allocated from the other zones so the reclaim pressure
> is still more or less even. If this is the case or there is another reason
> then it could have done with a comment because it's a subtle detail.

Yes, that was the idea, that the cache placement compensates for pages
that still are always allocated on the preferred zone first, so that
the end result is approximately as if round-robin had been applied to
everybody.

This should be documented as part of the patch that first diverges
between the allocations that are counted and the allocations that are
round-robined:

  mm: page_alloc: exclude unreclaimable allocations from zone fairness policy

I'm updating my tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
  2013-12-17 15:29       ` Mel Gorman
@ 2013-12-17 15:54         ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 15:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 03:29:54PM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > bug whereby new pages could be reclaimed before old pages because of
> > > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > > Unfortunately it was missed during review that a consequence is that
> > > we also round-robin between NUMA nodes. This is bad for two reasons
> > > 
> > > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > > 2. It incurs an immediate remote memory performance hit in exchange
> > >    for a potential performance gain when memory needs to be reclaimed
> > >    later
> > > 
> > > No cookies for the reviewers on this one.
> > > 
> > > This patch makes the behaviour of the fair zone allocator policy
> > > configurable.  By default it will only distribute pages that are going
> > > to exist on the LRU between zones local to the allocating process. This
> > > preserves the historical semantics of MPOL_LOCAL.
> > > 
> > > By default, slab pages are not distributed between zones after this patch is
> > > applied. It can be argued that they should get similar treatment but they
> > > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > > and the interaction between the page allocator and kswapd is different
> > > for slabs. If it turns out to be an almost universal win, we can change
> > > the default.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > >  Documentation/sysctl/vm.txt |  32 ++++++++++++++
> > >  include/linux/mmzone.h      |   2 +
> > >  include/linux/swap.h        |   2 +
> > >  kernel/sysctl.c             |   8 ++++
> > >  mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
> > >  5 files changed, 134 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > index 1fbd4eb..8eaa562 100644
> > > --- a/Documentation/sysctl/vm.txt
> > > +++ b/Documentation/sysctl/vm.txt
> > > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > >  - swappiness
> > >  - user_reserve_kbytes
> > >  - vfs_cache_pressure
> > > +- zone_distribute_mode
> > >  - zone_reclaim_mode
> > >  
> > >  ==============================================================
> > > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> > >  
> > >  ==============================================================
> > >  
> > > +zone_distribute_mode
> > > +
> > > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > > +system needs to reclaim memory, candidate pages are selected from these
> > > +per-zone lists.  Historically, a potential consequence was that recently
> > > +allocated pages were considered reclaim candidates. From a zone-local
> > > +perspective, page aging was preserved but from a system-wide perspective
> > > +there was an age inversion problem.
> > > +
> > > +A similar problem occurs on a node level where young pages may be reclaimed
> > > +from the local node instead of allocating remote memory. Unforuntately, the
> > > +cost of accessing remote nodes is higher so the system must choose by default
> > > +between favouring page aging or node locality. zone_distribute_mode controls
> > > +how the system will distribute page ages between zones.
> > > +
> > > +0	= Never round-robin based on age
> > 
> > I think we should be very conservative with the userspace interface we
> > export on a mechanism we are obviously just figuring out.
> > 
> 
> And we have a proposal on how to limit this. I'll be layering another
> patch on top and removes this interface again. That will allows us to
> rollback one patch and still have a usable interface if necessary.
> 
> > > +Otherwise the values are ORed together
> > > +
> > > +1	= Distribute anon pages between zones local to the allocating node
> > > +2	= Distribute file pages between zones local to the allocating node
> > > +4	= Distribute slab pages between zones local to the allocating node
> > 
> > Zone fairness within a node does not affect mempolicy or remote
> > reference costs.  Is there a reason to have this configurable?
> > 
> 
> Symmetry
> 
> > > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > > +
> > > +8	= Distribute anon pages between zones remote to the allocating node
> > > +16	= Distribute file pages between zones remote to the allocating node
> > > +32	= Distribute slab pages between zones remote to the allocating node
> > 
> > Yes, it's conceivable that somebody might want to disable remote
> > distribution because of the extra references.
> > 
> > But at this point, I'd much rather back out anon and slab distribution
> > entirely, it was a mistake to include them.
> > 
> > That would leave us with a single knob to disable remote page cache
> > placement.
> > 
> 
> When looking at this closer I found that sysv is a weird exception. It's
> file-backed as far as most of the VM is concerned but looks anonymous to
> most applications that care. That and MAP_SHARED anonymous pages should
> not be treated like files but we still want tmpfs to be treated as
> files. Details will be in the changelog of the next series.

In what sense is it seen as file-backed?  The pages are swapbacked and
they sit on the anon LRUs, so at least as far as aging and reclaim
goes (what this series is concerned with) they are anon, not file.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 15:54         ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 15:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 03:29:54PM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > bug whereby new pages could be reclaimed before old pages because of
> > > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > > Unfortunately it was missed during review that a consequence is that
> > > we also round-robin between NUMA nodes. This is bad for two reasons
> > > 
> > > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > > 2. It incurs an immediate remote memory performance hit in exchange
> > >    for a potential performance gain when memory needs to be reclaimed
> > >    later
> > > 
> > > No cookies for the reviewers on this one.
> > > 
> > > This patch makes the behaviour of the fair zone allocator policy
> > > configurable.  By default it will only distribute pages that are going
> > > to exist on the LRU between zones local to the allocating process. This
> > > preserves the historical semantics of MPOL_LOCAL.
> > > 
> > > By default, slab pages are not distributed between zones after this patch is
> > > applied. It can be argued that they should get similar treatment but they
> > > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > > and the interaction between the page allocator and kswapd is different
> > > for slabs. If it turns out to be an almost universal win, we can change
> > > the default.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > >  Documentation/sysctl/vm.txt |  32 ++++++++++++++
> > >  include/linux/mmzone.h      |   2 +
> > >  include/linux/swap.h        |   2 +
> > >  kernel/sysctl.c             |   8 ++++
> > >  mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
> > >  5 files changed, 134 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > index 1fbd4eb..8eaa562 100644
> > > --- a/Documentation/sysctl/vm.txt
> > > +++ b/Documentation/sysctl/vm.txt
> > > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > >  - swappiness
> > >  - user_reserve_kbytes
> > >  - vfs_cache_pressure
> > > +- zone_distribute_mode
> > >  - zone_reclaim_mode
> > >  
> > >  ==============================================================
> > > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> > >  
> > >  ==============================================================
> > >  
> > > +zone_distribute_mode
> > > +
> > > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > > +system needs to reclaim memory, candidate pages are selected from these
> > > +per-zone lists.  Historically, a potential consequence was that recently
> > > +allocated pages were considered reclaim candidates. From a zone-local
> > > +perspective, page aging was preserved but from a system-wide perspective
> > > +there was an age inversion problem.
> > > +
> > > +A similar problem occurs on a node level where young pages may be reclaimed
> > > +from the local node instead of allocating remote memory. Unforuntately, the
> > > +cost of accessing remote nodes is higher so the system must choose by default
> > > +between favouring page aging or node locality. zone_distribute_mode controls
> > > +how the system will distribute page ages between zones.
> > > +
> > > +0	= Never round-robin based on age
> > 
> > I think we should be very conservative with the userspace interface we
> > export on a mechanism we are obviously just figuring out.
> > 
> 
> And we have a proposal on how to limit this. I'll be layering another
> patch on top and removes this interface again. That will allows us to
> rollback one patch and still have a usable interface if necessary.
> 
> > > +Otherwise the values are ORed together
> > > +
> > > +1	= Distribute anon pages between zones local to the allocating node
> > > +2	= Distribute file pages between zones local to the allocating node
> > > +4	= Distribute slab pages between zones local to the allocating node
> > 
> > Zone fairness within a node does not affect mempolicy or remote
> > reference costs.  Is there a reason to have this configurable?
> > 
> 
> Symmetry
> 
> > > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > > +
> > > +8	= Distribute anon pages between zones remote to the allocating node
> > > +16	= Distribute file pages between zones remote to the allocating node
> > > +32	= Distribute slab pages between zones remote to the allocating node
> > 
> > Yes, it's conceivable that somebody might want to disable remote
> > distribution because of the extra references.
> > 
> > But at this point, I'd much rather back out anon and slab distribution
> > entirely, it was a mistake to include them.
> > 
> > That would leave us with a single knob to disable remote page cache
> > placement.
> > 
> 
> When looking at this closer I found that sysv is a weird exception. It's
> file-backed as far as most of the VM is concerned but looks anonymous to
> most applications that care. That and MAP_SHARED anonymous pages should
> not be treated like files but we still want tmpfs to be treated as
> files. Details will be in the changelog of the next series.

In what sense is it seen as file-backed?  The pages are swapbacked and
they sit on the anon LRUs, so at least as far as aging and reclaim
goes (what this series is concerned with) they are anon, not file.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
  2013-12-13 22:15         ` Johannes Weiner
@ 2013-12-17 16:04           ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:04 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 05:15:41PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 07:20:14PM +0000, Mel Gorman wrote:
> > On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > > > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > > > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > > > it should be considered finished. I do not necessarily agree this patch is necessary
> > > > but it's worth punting it out there for discussion and testing.
> > > 
> > > I demonstrated enormous gains in the original submission of the fair
> > > allocation patch and
> > 
> > And the same test missed that it broke MPOL_DEFAULT and regressed any workload
> > that does not hit reclaim by incurring remote accesses unnecessarily.
> 
> And none of this was nice, agreed, but it does not invalidate the
> gains, it only changes what we are comparing them to.
> 

Notifying that we're changing existing interfaces is important. Again, I
need to be clear that I'm not against the change per-se. I'm annoyed with
myself more than anything that I missed some of the major implications
of the change the first time around and want to get back some of the
performance we lost due to remote memory usage.

> > With this patch applied, MPOL_DEFAULT again does not act as
> > documented by Documentation/vm/numa_memory_policy.txt and that file
> > has been around a long time. It also does not match the documented
> > behaviour of mbind where it says
> > 
> > 	The  system-wide  default  policy allocates  pages  on	the node of
> > 	the CPU that triggers the allocation.  For MPOL_DEFAULT, the nodemask
> > 	and maxnode arguments must be specify the empty set of nodes.
> > 
> > That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
> > allocate on remote nodes.
> >
> > > your tests haven't really shown downsides to the
> > > cache-over-nodes portion of it. 
> > > the cache-over-nodes fairness without any supporting data.
> > > 
> > 
> > It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
> > overridden by policies and it is not even documented.  The same effect
> > could have been achieved for the repeatedly reading files by running the
> > processes with the MPOL_INTERLEAVE policy.  There was also no convenient
> > way for a user to override that behaviour. Hard-binding to a node would
> > work but tough luck if the process needs more than one node of memory.
> 
> Hardbinding or enabling zone_reclaim_mode, yes.  But agreed, let's fix
> these problems.
> 

I would very much hate to recommend zone_reclaim_mode to work around
this. That thing is a disaster for a lot of workloads and can cause massive
allocation latencies in an effort to keep memory local. I've dealt with
a fairly sizable number of bugs over the last three years related to
that setting.

> > What I will admit is that I doubt anyone cares that file-backed pages
> > are not node-local as documented as the cost of the IO itself probably
> > dominates but just because something does not make sense does not mean
> > someone is depending on the behaviour.
> 
> And that's why I very much agree that we need a way for people to
> revert to the old behavior in case we are wrong about this.
> 
> But it's also a very strong argument for what the new default should
> be, given that we allow people to revert our decision in the field.
> 

We still need to update the docs at the same time as the default is changed
or at least have the man pages patch in flight to Michael Kerrisk.

> > That alone is pretty heavy justification even in the absense of supporting
> > data showing a workload that depends on file pages being node-local that
> > is not hidden by the cost of the IO itself.
> 
> Even if we anticipate that nobody will care about it and we provide a
> way to revert the behavior in the field in case we are wrong?
> 
> I disagree.
> 

There will be people that care, they just haven't shown up yet. We missed
one important example. After the fair allocation policy we are interleaving
sysv shared memory between nodes. I bet you a shiny penny that heavy users
of sysv shared memory (databases) are depending on the local allocation
policy for those areas and we broke that. They'd be hit even if they were
using direct IO. It could be a long time before some user of those databases
notices a performnace regression of a few percent and finds this change.

We may have missed other examples which is why I would prefer that a
change in the default would be accompanied by an update of Documentation/
and of the manual pages. At least that way we can claim it's behaving as
designed and users will have a chance of discovering the change without
having to post to linux-mm.

> We should definitely allow the user to override our decision, but the
> default should be what we anticipate will benefit most users.
> 
> And I'm really not trying to be ignorant of long-standing documented
> behavior that users may have come to expect.  The bug reports will
> land on my desk just as well.  But it looks like the current behavior
> does not make much sense and is unlikely to be missed.
> 

I think the treatment of sysv shared memory is an important exception.
However, I should cover that in the next series although the hack used may
cause people to throw rocks at me. That's assuming the hack even works,
I have not booted it yet.

> > > Reverting cross-node fairness for anon and slab is a good idea.  It
> > > was always about cache and the original patch was too broad stroked,
> > > but it doesn't invalidate everything it was about.
> > > 
> > 
> > No it doesn't, but it should at least have been documented.
> 
> Yes, no argument there.
> 
> > > I can see, however, that we might want to make this configurable, but
> > > I'm not eager on exporting user interfaces unless we have to.  As the
> > > node-local fairness was never questioned by anybody, is it necessary
> > > to make it configurable? 
> > 
> > It's only there since 3.12 and it takes a long time for people to notice
> > NUMA regressions, especially ones that would just be within a few percent
> > like this was unless they were specifically looking for it.
> 
> No, I meant only the case where we distribute memory fairly among the
> zones WITHIN a given node.  This does not affect NUMA placement.  I
> wouldn't want to make this configurable unless you think people might
> want to disable this.  I can't think of a reason, anyway.
> 

Oh right. That thing was just about API symmetry and for experimentation. I
could not think of a good reason why someone would use it other than to
demonstrate the impact of the fair allocation policy on UMA machines with
a small highest zone. It's the type of thing that Zlatko Calusic's
testing would be sensitive to.

In my current series I replaced it with the knob suggested by Rik and
yourself. The internal details are still the same but the user-visible
knob controls just page cache with special casing of MAP_SHARED anonmous
and sysv memory

> > > Shouldn't we be okay with just a single
> > > vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> > > allows users to go back to pagecache obeying mempolicy?
> > > 
> > 
> > That can be done. I can put together a patch that defaults it to 0 and
> > sets the DISTRIBUTE_REMOTE_FILE  flag if someone writes to it. That's a
> > crude hack but many people will be ok with it.
> > 
> > To make it a default though should require more work though.
> > Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
> > is not strictly interleave). Abstract MPOL_DEFAULT to be either
> > MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
> > vm.pagecache_interleave. Update manual pages, and Documentation/ then set
> > the default of vm.pagecache_interleave to 1.
> > 
> > That would allow more sane defaults and also allow users to override it
> > on a per task and per VMA basis as they can for any other type of memory
> > policy.
> 
> Not using round-robin placement for cache creates weird artifacts in
> our LRU aging decisions.  By not aging all pages in a workingset
> equally, we may end up activating barely used pages on a remote node
> and creating pressure on its active list for no reason.
> 

I fully appreciate the positive aspects of the patch and want to see it
happen. If I didn't, I would be trying to revert the patch and ignoring
any arguments to the contrary. I would just prefer we did it in a way
that generated less paperwork in the future.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
@ 2013-12-17 16:04           ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:04 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Fri, Dec 13, 2013 at 05:15:41PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 07:20:14PM +0000, Mel Gorman wrote:
> > On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > > > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > > > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > > > it should be considered finished. I do not necessarily agree this patch is necessary
> > > > but it's worth punting it out there for discussion and testing.
> > > 
> > > I demonstrated enormous gains in the original submission of the fair
> > > allocation patch and
> > 
> > And the same test missed that it broke MPOL_DEFAULT and regressed any workload
> > that does not hit reclaim by incurring remote accesses unnecessarily.
> 
> And none of this was nice, agreed, but it does not invalidate the
> gains, it only changes what we are comparing them to.
> 

Notifying that we're changing existing interfaces is important. Again, I
need to be clear that I'm not against the change per-se. I'm annoyed with
myself more than anything that I missed some of the major implications
of the change the first time around and want to get back some of the
performance we lost due to remote memory usage.

> > With this patch applied, MPOL_DEFAULT again does not act as
> > documented by Documentation/vm/numa_memory_policy.txt and that file
> > has been around a long time. It also does not match the documented
> > behaviour of mbind where it says
> > 
> > 	The  system-wide  default  policy allocates  pages  on	the node of
> > 	the CPU that triggers the allocation.  For MPOL_DEFAULT, the nodemask
> > 	and maxnode arguments must be specify the empty set of nodes.
> > 
> > That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
> > allocate on remote nodes.
> >
> > > your tests haven't really shown downsides to the
> > > cache-over-nodes portion of it. 
> > > the cache-over-nodes fairness without any supporting data.
> > > 
> > 
> > It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
> > overridden by policies and it is not even documented.  The same effect
> > could have been achieved for the repeatedly reading files by running the
> > processes with the MPOL_INTERLEAVE policy.  There was also no convenient
> > way for a user to override that behaviour. Hard-binding to a node would
> > work but tough luck if the process needs more than one node of memory.
> 
> Hardbinding or enabling zone_reclaim_mode, yes.  But agreed, let's fix
> these problems.
> 

I would very much hate to recommend zone_reclaim_mode to work around
this. That thing is a disaster for a lot of workloads and can cause massive
allocation latencies in an effort to keep memory local. I've dealt with
a fairly sizable number of bugs over the last three years related to
that setting.

> > What I will admit is that I doubt anyone cares that file-backed pages
> > are not node-local as documented as the cost of the IO itself probably
> > dominates but just because something does not make sense does not mean
> > someone is depending on the behaviour.
> 
> And that's why I very much agree that we need a way for people to
> revert to the old behavior in case we are wrong about this.
> 
> But it's also a very strong argument for what the new default should
> be, given that we allow people to revert our decision in the field.
> 

We still need to update the docs at the same time as the default is changed
or at least have the man pages patch in flight to Michael Kerrisk.

> > That alone is pretty heavy justification even in the absense of supporting
> > data showing a workload that depends on file pages being node-local that
> > is not hidden by the cost of the IO itself.
> 
> Even if we anticipate that nobody will care about it and we provide a
> way to revert the behavior in the field in case we are wrong?
> 
> I disagree.
> 

There will be people that care, they just haven't shown up yet. We missed
one important example. After the fair allocation policy we are interleaving
sysv shared memory between nodes. I bet you a shiny penny that heavy users
of sysv shared memory (databases) are depending on the local allocation
policy for those areas and we broke that. They'd be hit even if they were
using direct IO. It could be a long time before some user of those databases
notices a performnace regression of a few percent and finds this change.

We may have missed other examples which is why I would prefer that a
change in the default would be accompanied by an update of Documentation/
and of the manual pages. At least that way we can claim it's behaving as
designed and users will have a chance of discovering the change without
having to post to linux-mm.

> We should definitely allow the user to override our decision, but the
> default should be what we anticipate will benefit most users.
> 
> And I'm really not trying to be ignorant of long-standing documented
> behavior that users may have come to expect.  The bug reports will
> land on my desk just as well.  But it looks like the current behavior
> does not make much sense and is unlikely to be missed.
> 

I think the treatment of sysv shared memory is an important exception.
However, I should cover that in the next series although the hack used may
cause people to throw rocks at me. That's assuming the hack even works,
I have not booted it yet.

> > > Reverting cross-node fairness for anon and slab is a good idea.  It
> > > was always about cache and the original patch was too broad stroked,
> > > but it doesn't invalidate everything it was about.
> > > 
> > 
> > No it doesn't, but it should at least have been documented.
> 
> Yes, no argument there.
> 
> > > I can see, however, that we might want to make this configurable, but
> > > I'm not eager on exporting user interfaces unless we have to.  As the
> > > node-local fairness was never questioned by anybody, is it necessary
> > > to make it configurable? 
> > 
> > It's only there since 3.12 and it takes a long time for people to notice
> > NUMA regressions, especially ones that would just be within a few percent
> > like this was unless they were specifically looking for it.
> 
> No, I meant only the case where we distribute memory fairly among the
> zones WITHIN a given node.  This does not affect NUMA placement.  I
> wouldn't want to make this configurable unless you think people might
> want to disable this.  I can't think of a reason, anyway.
> 

Oh right. That thing was just about API symmetry and for experimentation. I
could not think of a good reason why someone would use it other than to
demonstrate the impact of the fair allocation policy on UMA machines with
a small highest zone. It's the type of thing that Zlatko Calusic's
testing would be sensitive to.

In my current series I replaced it with the knob suggested by Rik and
yourself. The internal details are still the same but the user-visible
knob controls just page cache with special casing of MAP_SHARED anonmous
and sysv memory

> > > Shouldn't we be okay with just a single
> > > vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> > > allows users to go back to pagecache obeying mempolicy?
> > > 
> > 
> > That can be done. I can put together a patch that defaults it to 0 and
> > sets the DISTRIBUTE_REMOTE_FILE  flag if someone writes to it. That's a
> > crude hack but many people will be ok with it.
> > 
> > To make it a default though should require more work though.
> > Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
> > is not strictly interleave). Abstract MPOL_DEFAULT to be either
> > MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
> > vm.pagecache_interleave. Update manual pages, and Documentation/ then set
> > the default of vm.pagecache_interleave to 1.
> > 
> > That would allow more sane defaults and also allow users to override it
> > on a per task and per VMA basis as they can for any other type of memory
> > policy.
> 
> Not using round-robin placement for cache creates weird artifacts in
> our LRU aging decisions.  By not aging all pages in a workingset
> equally, we may end up activating barely used pages on a remote node
> and creating pressure on its active list for no reason.
> 

I fully appreciate the positive aspects of the patch and want to see it
happen. If I didn't, I would be trying to revert the patch and ignoring
any arguments to the contrary. I would just prefer we did it in a way
that generated less paperwork in the future.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
  2013-12-17 15:43         ` Johannes Weiner
@ 2013-12-17 16:06           ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:06 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 10:43:51AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 11:20:07AM +0000, Mel Gorman wrote:
> > On Mon, Dec 16, 2013 at 03:52:37PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> > > > Not signed off. Johannes, was the intent really to decrement the batch
> > > > counts regardless of whether the policy was being enforced or not?
> > > 
> > > Yes.  Bursts of allocations for which the policy does not get enforced
> > > will still create memory pressure and affect cache aging on a given
> > > node.  So even if we only distribute page cache, we want to distribute
> > > it in a way that all allocations on the eligible zones equal out.
> > 
> > This means that allocations for page table pages affects the distribution of
> > page cache pages. An adverse workload could time when it faults anonymous
> > pages (to allocate anon and page table pages) in batch sequences and then
> > access files to force page cache pages to be allocated from a single node.
> > 
> > I think I know what your response will be. It will be that the utilisation of
> > the zone for page table pages and anon pages means that you want more page
> > cache pages to be allocated from the other zones so the reclaim pressure
> > is still more or less even. If this is the case or there is another reason
> > then it could have done with a comment because it's a subtle detail.
> 
> Yes, that was the idea, that the cache placement compensates for pages
> that still are always allocated on the preferred zone first, so that
> the end result is approximately as if round-robin had been applied to
> everybody.
> 

Ok, understood. I wanted to be sure that was the thinking behind it.

> This should be documented as part of the patch that first diverges
> between the allocations that are counted and the allocations that are
> round-robined:
> 
>   mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
> 
> I'm updating my tree.

I'll leave it alone in mine then. We'll figure out how to sync up later.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
@ 2013-12-17 16:06           ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:06 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 10:43:51AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 11:20:07AM +0000, Mel Gorman wrote:
> > On Mon, Dec 16, 2013 at 03:52:37PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> > > > Not signed off. Johannes, was the intent really to decrement the batch
> > > > counts regardless of whether the policy was being enforced or not?
> > > 
> > > Yes.  Bursts of allocations for which the policy does not get enforced
> > > will still create memory pressure and affect cache aging on a given
> > > node.  So even if we only distribute page cache, we want to distribute
> > > it in a way that all allocations on the eligible zones equal out.
> > 
> > This means that allocations for page table pages affects the distribution of
> > page cache pages. An adverse workload could time when it faults anonymous
> > pages (to allocate anon and page table pages) in batch sequences and then
> > access files to force page cache pages to be allocated from a single node.
> > 
> > I think I know what your response will be. It will be that the utilisation of
> > the zone for page table pages and anon pages means that you want more page
> > cache pages to be allocated from the other zones so the reclaim pressure
> > is still more or less even. If this is the case or there is another reason
> > then it could have done with a comment because it's a subtle detail.
> 
> Yes, that was the idea, that the cache placement compensates for pages
> that still are always allocated on the preferred zone first, so that
> the end result is approximately as if round-robin had been applied to
> everybody.
> 

Ok, understood. I wanted to be sure that was the thinking behind it.

> This should be documented as part of the patch that first diverges
> between the allocations that are counted and the allocations that are
> round-robined:
> 
>   mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
> 
> I'm updating my tree.

I'll leave it alone in mine then. We'll figure out how to sync up later.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
  2013-12-17 15:38         ` Johannes Weiner
@ 2013-12-17 16:08           ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:08 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > zone_local is using node_distance which is a more expensive call than
> > > > necessary. On x86, it's another function call in the allocator fast path
> > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > local node will share the same node ID. The necessary information should
> > > > already be cache hot.
> > > > 
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > ---
> > > >  mm/page_alloc.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 64020eb..fd9677e 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > >  
> > > >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > >  {
> > > > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > +	return zone_to_nid(zone) == numa_node_id();
> > > 
> > > Why numa_node_id()?  We pass in the preferred zone as @local_zone:
> > > 
> > 
> > Initially because I was thinking "local node" and numa_node_id() is a
> > per-cpu variable that should be cheap to access and in some cases
> > cache-hot as the top-level gfp API calls numa_node_id().
> > 
> > Thinking about it more though it still makes sense because the preferred
> > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > and the local node does not have that zone then preferred zone is on a
> > remote node.
> 
> Don't we treat everything in relation to the preferred zone?

Usually yes, but this time we really care about whether the memory is
local or remote. It makes sense to me as it is and struggle to see an
advantage of expressing it in terms of the preferred zone. Minimally
zone_local would need to be renamed if it could return true for a remote
zone and I see no advantage in doing that.

I might be stuck in a "la la la, everything is fine" rut.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-17 16:08           ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:08 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > zone_local is using node_distance which is a more expensive call than
> > > > necessary. On x86, it's another function call in the allocator fast path
> > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > local node will share the same node ID. The necessary information should
> > > > already be cache hot.
> > > > 
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > ---
> > > >  mm/page_alloc.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 64020eb..fd9677e 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > >  
> > > >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > >  {
> > > > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > +	return zone_to_nid(zone) == numa_node_id();
> > > 
> > > Why numa_node_id()?  We pass in the preferred zone as @local_zone:
> > > 
> > 
> > Initially because I was thinking "local node" and numa_node_id() is a
> > per-cpu variable that should be cheap to access and in some cases
> > cache-hot as the top-level gfp API calls numa_node_id().
> > 
> > Thinking about it more though it still makes sense because the preferred
> > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > and the local node does not have that zone then preferred zone is on a
> > remote node.
> 
> Don't we treat everything in relation to the preferred zone?

Usually yes, but this time we really care about whether the memory is
local or remote. It makes sense to me as it is and struggle to see an
advantage of expressing it in terms of the preferred zone. Minimally
zone_local would need to be renamed if it could return true for a remote
zone and I see no advantage in doing that.

I might be stuck in a "la la la, everything is fine" rut.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
  2013-12-17 15:54         ` Johannes Weiner
@ 2013-12-17 16:14           ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:14 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 10:54:35AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 03:29:54PM +0000, Mel Gorman wrote:
> > On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > > bug whereby new pages could be reclaimed before old pages because of
> > > > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > > > Unfortunately it was missed during review that a consequence is that
> > > > we also round-robin between NUMA nodes. This is bad for two reasons
> > > > 
> > > > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > > > 2. It incurs an immediate remote memory performance hit in exchange
> > > >    for a potential performance gain when memory needs to be reclaimed
> > > >    later
> > > > 
> > > > No cookies for the reviewers on this one.
> > > > 
> > > > This patch makes the behaviour of the fair zone allocator policy
> > > > configurable.  By default it will only distribute pages that are going
> > > > to exist on the LRU between zones local to the allocating process. This
> > > > preserves the historical semantics of MPOL_LOCAL.
> > > > 
> > > > By default, slab pages are not distributed between zones after this patch is
> > > > applied. It can be argued that they should get similar treatment but they
> > > > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > > > and the interaction between the page allocator and kswapd is different
> > > > for slabs. If it turns out to be an almost universal win, we can change
> > > > the default.
> > > > 
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > ---
> > > >  Documentation/sysctl/vm.txt |  32 ++++++++++++++
> > > >  include/linux/mmzone.h      |   2 +
> > > >  include/linux/swap.h        |   2 +
> > > >  kernel/sysctl.c             |   8 ++++
> > > >  mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
> > > >  5 files changed, 134 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > > index 1fbd4eb..8eaa562 100644
> > > > --- a/Documentation/sysctl/vm.txt
> > > > +++ b/Documentation/sysctl/vm.txt
> > > > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > > >  - swappiness
> > > >  - user_reserve_kbytes
> > > >  - vfs_cache_pressure
> > > > +- zone_distribute_mode
> > > >  - zone_reclaim_mode
> > > >  
> > > >  ==============================================================
> > > > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> > > >  
> > > >  ==============================================================
> > > >  
> > > > +zone_distribute_mode
> > > > +
> > > > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > > > +system needs to reclaim memory, candidate pages are selected from these
> > > > +per-zone lists.  Historically, a potential consequence was that recently
> > > > +allocated pages were considered reclaim candidates. From a zone-local
> > > > +perspective, page aging was preserved but from a system-wide perspective
> > > > +there was an age inversion problem.
> > > > +
> > > > +A similar problem occurs on a node level where young pages may be reclaimed
> > > > +from the local node instead of allocating remote memory. Unforuntately, the
> > > > +cost of accessing remote nodes is higher so the system must choose by default
> > > > +between favouring page aging or node locality. zone_distribute_mode controls
> > > > +how the system will distribute page ages between zones.
> > > > +
> > > > +0	= Never round-robin based on age
> > > 
> > > I think we should be very conservative with the userspace interface we
> > > export on a mechanism we are obviously just figuring out.
> > > 
> > 
> > And we have a proposal on how to limit this. I'll be layering another
> > patch on top and removes this interface again. That will allows us to
> > rollback one patch and still have a usable interface if necessary.
> > 
> > > > +Otherwise the values are ORed together
> > > > +
> > > > +1	= Distribute anon pages between zones local to the allocating node
> > > > +2	= Distribute file pages between zones local to the allocating node
> > > > +4	= Distribute slab pages between zones local to the allocating node
> > > 
> > > Zone fairness within a node does not affect mempolicy or remote
> > > reference costs.  Is there a reason to have this configurable?
> > > 
> > 
> > Symmetry
> > 
> > > > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > > > +
> > > > +8	= Distribute anon pages between zones remote to the allocating node
> > > > +16	= Distribute file pages between zones remote to the allocating node
> > > > +32	= Distribute slab pages between zones remote to the allocating node
> > > 
> > > Yes, it's conceivable that somebody might want to disable remote
> > > distribution because of the extra references.
> > > 
> > > But at this point, I'd much rather back out anon and slab distribution
> > > entirely, it was a mistake to include them.
> > > 
> > > That would leave us with a single knob to disable remote page cache
> > > placement.
> > > 
> > 
> > When looking at this closer I found that sysv is a weird exception. It's
> > file-backed as far as most of the VM is concerned but looks anonymous to
> > most applications that care. That and MAP_SHARED anonymous pages should
> > not be treated like files but we still want tmpfs to be treated as
> > files. Details will be in the changelog of the next series.
> 
> In what sense is it seen as file-backed?

sysv and anonymous pages are backed by an internal shmem mount point. In
lots of respects, it's looks like a file and quacks like a file but I expect
developers think of it being anonmous and chunks of the VM treats it like
it's anonymous. tmpfs uses the same paths and they get treated similar to
the VM as anon but users may think that tmpfs should be subject to the
fair allocation zone policy "because they're files." It's a sufficently
weird case that any action we take there should be deliberate. It'll be
a bit clearer when I post the patch that special cases this.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 16:14           ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:14 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 10:54:35AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 03:29:54PM +0000, Mel Gorman wrote:
> > On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > > bug whereby new pages could be reclaimed before old pages because of
> > > > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > > > Unfortunately it was missed during review that a consequence is that
> > > > we also round-robin between NUMA nodes. This is bad for two reasons
> > > > 
> > > > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > > > 2. It incurs an immediate remote memory performance hit in exchange
> > > >    for a potential performance gain when memory needs to be reclaimed
> > > >    later
> > > > 
> > > > No cookies for the reviewers on this one.
> > > > 
> > > > This patch makes the behaviour of the fair zone allocator policy
> > > > configurable.  By default it will only distribute pages that are going
> > > > to exist on the LRU between zones local to the allocating process. This
> > > > preserves the historical semantics of MPOL_LOCAL.
> > > > 
> > > > By default, slab pages are not distributed between zones after this patch is
> > > > applied. It can be argued that they should get similar treatment but they
> > > > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > > > and the interaction between the page allocator and kswapd is different
> > > > for slabs. If it turns out to be an almost universal win, we can change
> > > > the default.
> > > > 
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > ---
> > > >  Documentation/sysctl/vm.txt |  32 ++++++++++++++
> > > >  include/linux/mmzone.h      |   2 +
> > > >  include/linux/swap.h        |   2 +
> > > >  kernel/sysctl.c             |   8 ++++
> > > >  mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
> > > >  5 files changed, 134 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > > index 1fbd4eb..8eaa562 100644
> > > > --- a/Documentation/sysctl/vm.txt
> > > > +++ b/Documentation/sysctl/vm.txt
> > > > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > > >  - swappiness
> > > >  - user_reserve_kbytes
> > > >  - vfs_cache_pressure
> > > > +- zone_distribute_mode
> > > >  - zone_reclaim_mode
> > > >  
> > > >  ==============================================================
> > > > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> > > >  
> > > >  ==============================================================
> > > >  
> > > > +zone_distribute_mode
> > > > +
> > > > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > > > +system needs to reclaim memory, candidate pages are selected from these
> > > > +per-zone lists.  Historically, a potential consequence was that recently
> > > > +allocated pages were considered reclaim candidates. From a zone-local
> > > > +perspective, page aging was preserved but from a system-wide perspective
> > > > +there was an age inversion problem.
> > > > +
> > > > +A similar problem occurs on a node level where young pages may be reclaimed
> > > > +from the local node instead of allocating remote memory. Unforuntately, the
> > > > +cost of accessing remote nodes is higher so the system must choose by default
> > > > +between favouring page aging or node locality. zone_distribute_mode controls
> > > > +how the system will distribute page ages between zones.
> > > > +
> > > > +0	= Never round-robin based on age
> > > 
> > > I think we should be very conservative with the userspace interface we
> > > export on a mechanism we are obviously just figuring out.
> > > 
> > 
> > And we have a proposal on how to limit this. I'll be layering another
> > patch on top and removes this interface again. That will allows us to
> > rollback one patch and still have a usable interface if necessary.
> > 
> > > > +Otherwise the values are ORed together
> > > > +
> > > > +1	= Distribute anon pages between zones local to the allocating node
> > > > +2	= Distribute file pages between zones local to the allocating node
> > > > +4	= Distribute slab pages between zones local to the allocating node
> > > 
> > > Zone fairness within a node does not affect mempolicy or remote
> > > reference costs.  Is there a reason to have this configurable?
> > > 
> > 
> > Symmetry
> > 
> > > > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > > > +
> > > > +8	= Distribute anon pages between zones remote to the allocating node
> > > > +16	= Distribute file pages between zones remote to the allocating node
> > > > +32	= Distribute slab pages between zones remote to the allocating node
> > > 
> > > Yes, it's conceivable that somebody might want to disable remote
> > > distribution because of the extra references.
> > > 
> > > But at this point, I'd much rather back out anon and slab distribution
> > > entirely, it was a mistake to include them.
> > > 
> > > That would leave us with a single knob to disable remote page cache
> > > placement.
> > > 
> > 
> > When looking at this closer I found that sysv is a weird exception. It's
> > file-backed as far as most of the VM is concerned but looks anonymous to
> > most applications that care. That and MAP_SHARED anonymous pages should
> > not be treated like files but we still want tmpfs to be treated as
> > files. Details will be in the changelog of the next series.
> 
> In what sense is it seen as file-backed?

sysv and anonymous pages are backed by an internal shmem mount point. In
lots of respects, it's looks like a file and quacks like a file but I expect
developers think of it being anonmous and chunks of the VM treats it like
it's anonymous. tmpfs uses the same paths and they get treated similar to
the VM as anon but users may think that tmpfs should be subject to the
fair allocation zone policy "because they're files." It's a sufficently
weird case that any action we take there should be deliberate. It'll be
a bit clearer when I post the patch that special cases this.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
  2013-12-17 16:14           ` Mel Gorman
@ 2013-12-17 17:43             ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 17:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 04:14:20PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 10:54:35AM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 03:29:54PM +0000, Mel Gorman wrote:
> > > On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > > > bug whereby new pages could be reclaimed before old pages because of
> > > > > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > > > > Unfortunately it was missed during review that a consequence is that
> > > > > we also round-robin between NUMA nodes. This is bad for two reasons
> > > > > 
> > > > > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > > > > 2. It incurs an immediate remote memory performance hit in exchange
> > > > >    for a potential performance gain when memory needs to be reclaimed
> > > > >    later
> > > > > 
> > > > > No cookies for the reviewers on this one.
> > > > > 
> > > > > This patch makes the behaviour of the fair zone allocator policy
> > > > > configurable.  By default it will only distribute pages that are going
> > > > > to exist on the LRU between zones local to the allocating process. This
> > > > > preserves the historical semantics of MPOL_LOCAL.
> > > > > 
> > > > > By default, slab pages are not distributed between zones after this patch is
> > > > > applied. It can be argued that they should get similar treatment but they
> > > > > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > > > > and the interaction between the page allocator and kswapd is different
> > > > > for slabs. If it turns out to be an almost universal win, we can change
> > > > > the default.
> > > > > 
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > ---
> > > > >  Documentation/sysctl/vm.txt |  32 ++++++++++++++
> > > > >  include/linux/mmzone.h      |   2 +
> > > > >  include/linux/swap.h        |   2 +
> > > > >  kernel/sysctl.c             |   8 ++++
> > > > >  mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
> > > > >  5 files changed, 134 insertions(+), 12 deletions(-)
> > > > > 
> > > > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > > > index 1fbd4eb..8eaa562 100644
> > > > > --- a/Documentation/sysctl/vm.txt
> > > > > +++ b/Documentation/sysctl/vm.txt
> > > > > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > > > >  - swappiness
> > > > >  - user_reserve_kbytes
> > > > >  - vfs_cache_pressure
> > > > > +- zone_distribute_mode
> > > > >  - zone_reclaim_mode
> > > > >  
> > > > >  ==============================================================
> > > > > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> > > > >  
> > > > >  ==============================================================
> > > > >  
> > > > > +zone_distribute_mode
> > > > > +
> > > > > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > > > > +system needs to reclaim memory, candidate pages are selected from these
> > > > > +per-zone lists.  Historically, a potential consequence was that recently
> > > > > +allocated pages were considered reclaim candidates. From a zone-local
> > > > > +perspective, page aging was preserved but from a system-wide perspective
> > > > > +there was an age inversion problem.
> > > > > +
> > > > > +A similar problem occurs on a node level where young pages may be reclaimed
> > > > > +from the local node instead of allocating remote memory. Unforuntately, the
> > > > > +cost of accessing remote nodes is higher so the system must choose by default
> > > > > +between favouring page aging or node locality. zone_distribute_mode controls
> > > > > +how the system will distribute page ages between zones.
> > > > > +
> > > > > +0	= Never round-robin based on age
> > > > 
> > > > I think we should be very conservative with the userspace interface we
> > > > export on a mechanism we are obviously just figuring out.
> > > > 
> > > 
> > > And we have a proposal on how to limit this. I'll be layering another
> > > patch on top and removes this interface again. That will allows us to
> > > rollback one patch and still have a usable interface if necessary.
> > > 
> > > > > +Otherwise the values are ORed together
> > > > > +
> > > > > +1	= Distribute anon pages between zones local to the allocating node
> > > > > +2	= Distribute file pages between zones local to the allocating node
> > > > > +4	= Distribute slab pages between zones local to the allocating node
> > > > 
> > > > Zone fairness within a node does not affect mempolicy or remote
> > > > reference costs.  Is there a reason to have this configurable?
> > > > 
> > > 
> > > Symmetry
> > > 
> > > > > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > > > > +
> > > > > +8	= Distribute anon pages between zones remote to the allocating node
> > > > > +16	= Distribute file pages between zones remote to the allocating node
> > > > > +32	= Distribute slab pages between zones remote to the allocating node
> > > > 
> > > > Yes, it's conceivable that somebody might want to disable remote
> > > > distribution because of the extra references.
> > > > 
> > > > But at this point, I'd much rather back out anon and slab distribution
> > > > entirely, it was a mistake to include them.
> > > > 
> > > > That would leave us with a single knob to disable remote page cache
> > > > placement.
> > > > 
> > > 
> > > When looking at this closer I found that sysv is a weird exception. It's
> > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > most applications that care. That and MAP_SHARED anonymous pages should
> > > not be treated like files but we still want tmpfs to be treated as
> > > files. Details will be in the changelog of the next series.
> > 
> > In what sense is it seen as file-backed?
> 
> sysv and anonymous pages are backed by an internal shmem mount point. In
> lots of respects, it's looks like a file and quacks like a file but I expect
> developers think of it being anonmous and chunks of the VM treats it like
> it's anonymous. tmpfs uses the same paths and they get treated similar to
> the VM as anon but users may think that tmpfs should be subject to the
> fair allocation zone policy "because they're files." It's a sufficently
> weird case that any action we take there should be deliberate. It'll be
> a bit clearer when I post the patch that special cases this.

The line I see here is mostly derived from performance expectations.

People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
their reclaim at great costs, so they size this part of their workload
according to memory size and locality.  Filesystem cache (on-disk) on
the other hand is expected to be slow on the first fault and after it
has been displaced by other data, but the kernel is mostly expected to
maximize the caching effects in a predictable manner.

The round-robin policy makes the displacement predictable (think of
the aging artifacts here where random pages do not get displaced
reliably because they ended up on remote nodes) and it avoids IO by
maximizing memory utilization.

I.e. it improves behavior associated with a cache, but I don't expect
shmem/tmpfs to be typically used as a disk cache.  I could be wrong
about that, but I figure if you need named shared memory that is
bigger than your memory capacity (the point where your tmpfs would
actually turn into a disk cache), you'd be better of using a more
efficient on-disk filesystem.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 17:43             ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 17:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 04:14:20PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 10:54:35AM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 03:29:54PM +0000, Mel Gorman wrote:
> > > On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > > > bug whereby new pages could be reclaimed before old pages because of
> > > > > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > > > > Unfortunately it was missed during review that a consequence is that
> > > > > we also round-robin between NUMA nodes. This is bad for two reasons
> > > > > 
> > > > > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > > > > 2. It incurs an immediate remote memory performance hit in exchange
> > > > >    for a potential performance gain when memory needs to be reclaimed
> > > > >    later
> > > > > 
> > > > > No cookies for the reviewers on this one.
> > > > > 
> > > > > This patch makes the behaviour of the fair zone allocator policy
> > > > > configurable.  By default it will only distribute pages that are going
> > > > > to exist on the LRU between zones local to the allocating process. This
> > > > > preserves the historical semantics of MPOL_LOCAL.
> > > > > 
> > > > > By default, slab pages are not distributed between zones after this patch is
> > > > > applied. It can be argued that they should get similar treatment but they
> > > > > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > > > > and the interaction between the page allocator and kswapd is different
> > > > > for slabs. If it turns out to be an almost universal win, we can change
> > > > > the default.
> > > > > 
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > ---
> > > > >  Documentation/sysctl/vm.txt |  32 ++++++++++++++
> > > > >  include/linux/mmzone.h      |   2 +
> > > > >  include/linux/swap.h        |   2 +
> > > > >  kernel/sysctl.c             |   8 ++++
> > > > >  mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
> > > > >  5 files changed, 134 insertions(+), 12 deletions(-)
> > > > > 
> > > > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > > > index 1fbd4eb..8eaa562 100644
> > > > > --- a/Documentation/sysctl/vm.txt
> > > > > +++ b/Documentation/sysctl/vm.txt
> > > > > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > > > >  - swappiness
> > > > >  - user_reserve_kbytes
> > > > >  - vfs_cache_pressure
> > > > > +- zone_distribute_mode
> > > > >  - zone_reclaim_mode
> > > > >  
> > > > >  ==============================================================
> > > > > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> > > > >  
> > > > >  ==============================================================
> > > > >  
> > > > > +zone_distribute_mode
> > > > > +
> > > > > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > > > > +system needs to reclaim memory, candidate pages are selected from these
> > > > > +per-zone lists.  Historically, a potential consequence was that recently
> > > > > +allocated pages were considered reclaim candidates. From a zone-local
> > > > > +perspective, page aging was preserved but from a system-wide perspective
> > > > > +there was an age inversion problem.
> > > > > +
> > > > > +A similar problem occurs on a node level where young pages may be reclaimed
> > > > > +from the local node instead of allocating remote memory. Unforuntately, the
> > > > > +cost of accessing remote nodes is higher so the system must choose by default
> > > > > +between favouring page aging or node locality. zone_distribute_mode controls
> > > > > +how the system will distribute page ages between zones.
> > > > > +
> > > > > +0	= Never round-robin based on age
> > > > 
> > > > I think we should be very conservative with the userspace interface we
> > > > export on a mechanism we are obviously just figuring out.
> > > > 
> > > 
> > > And we have a proposal on how to limit this. I'll be layering another
> > > patch on top and removes this interface again. That will allows us to
> > > rollback one patch and still have a usable interface if necessary.
> > > 
> > > > > +Otherwise the values are ORed together
> > > > > +
> > > > > +1	= Distribute anon pages between zones local to the allocating node
> > > > > +2	= Distribute file pages between zones local to the allocating node
> > > > > +4	= Distribute slab pages between zones local to the allocating node
> > > > 
> > > > Zone fairness within a node does not affect mempolicy or remote
> > > > reference costs.  Is there a reason to have this configurable?
> > > > 
> > > 
> > > Symmetry
> > > 
> > > > > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > > > > +
> > > > > +8	= Distribute anon pages between zones remote to the allocating node
> > > > > +16	= Distribute file pages between zones remote to the allocating node
> > > > > +32	= Distribute slab pages between zones remote to the allocating node
> > > > 
> > > > Yes, it's conceivable that somebody might want to disable remote
> > > > distribution because of the extra references.
> > > > 
> > > > But at this point, I'd much rather back out anon and slab distribution
> > > > entirely, it was a mistake to include them.
> > > > 
> > > > That would leave us with a single knob to disable remote page cache
> > > > placement.
> > > > 
> > > 
> > > When looking at this closer I found that sysv is a weird exception. It's
> > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > most applications that care. That and MAP_SHARED anonymous pages should
> > > not be treated like files but we still want tmpfs to be treated as
> > > files. Details will be in the changelog of the next series.
> > 
> > In what sense is it seen as file-backed?
> 
> sysv and anonymous pages are backed by an internal shmem mount point. In
> lots of respects, it's looks like a file and quacks like a file but I expect
> developers think of it being anonmous and chunks of the VM treats it like
> it's anonymous. tmpfs uses the same paths and they get treated similar to
> the VM as anon but users may think that tmpfs should be subject to the
> fair allocation zone policy "because they're files." It's a sufficently
> weird case that any action we take there should be deliberate. It'll be
> a bit clearer when I post the patch that special cases this.

The line I see here is mostly derived from performance expectations.

People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
their reclaim at great costs, so they size this part of their workload
according to memory size and locality.  Filesystem cache (on-disk) on
the other hand is expected to be slow on the first fault and after it
has been displaced by other data, but the kernel is mostly expected to
maximize the caching effects in a predictable manner.

The round-robin policy makes the displacement predictable (think of
the aging artifacts here where random pages do not get displaced
reliably because they ended up on remote nodes) and it avoids IO by
maximizing memory utilization.

I.e. it improves behavior associated with a cache, but I don't expect
shmem/tmpfs to be typically used as a disk cache.  I could be wrong
about that, but I figure if you need named shared memory that is
bigger than your memory capacity (the point where your tmpfs would
actually turn into a disk cache), you'd be better of using a more
efficient on-disk filesystem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
  2013-12-17 16:08           ` Mel Gorman
@ 2013-12-17 20:11             ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 20:11 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 04:08:08PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > > zone_local is using node_distance which is a more expensive call than
> > > > > necessary. On x86, it's another function call in the allocator fast path
> > > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > > local node will share the same node ID. The necessary information should
> > > > > already be cache hot.
> > > > > 
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > ---
> > > > >  mm/page_alloc.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 64020eb..fd9677e 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > > >  
> > > > >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > >  {
> > > > > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > > +	return zone_to_nid(zone) == numa_node_id();
> > > > 
> > > > Why numa_node_id()?  We pass in the preferred zone as @local_zone:
> > > > 
> > > 
> > > Initially because I was thinking "local node" and numa_node_id() is a
> > > per-cpu variable that should be cheap to access and in some cases
> > > cache-hot as the top-level gfp API calls numa_node_id().
> > > 
> > > Thinking about it more though it still makes sense because the preferred
> > > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > > and the local node does not have that zone then preferred zone is on a
> > > remote node.
> > 
> > Don't we treat everything in relation to the preferred zone?
> 
> Usually yes, but this time we really care about whether the memory is
> local or remote. It makes sense to me as it is and struggle to see an
> advantage of expressing it in terms of the preferred zone. Minimally
> zone_local would need to be renamed if it could return true for a remote
> zone and I see no advantage in doing that.

What the function tests for is whether any given zone is close
enough/local to the given preferred zone such that we can allocate
from it without having to invoke zone_reclaim_mode.

In your example, if the preferred DMA32 zone were to be on a remote
node and eligible for allocation but full, a DMA zone on the same node
should be fine as well and would not impose a higher remote reference
burden on the allocator than allocating from the preferred DMA32 zone.

So it's really not about the locality of the allocating task but about
the locality of the given preferred zone.

In my tree, I replaced the function body with

	return local_zone->node == zone->node;

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-17 20:11             ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 20:11 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 04:08:08PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > > zone_local is using node_distance which is a more expensive call than
> > > > > necessary. On x86, it's another function call in the allocator fast path
> > > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > > local node will share the same node ID. The necessary information should
> > > > > already be cache hot.
> > > > > 
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > ---
> > > > >  mm/page_alloc.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 64020eb..fd9677e 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > > >  
> > > > >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > >  {
> > > > > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > > +	return zone_to_nid(zone) == numa_node_id();
> > > > 
> > > > Why numa_node_id()?  We pass in the preferred zone as @local_zone:
> > > > 
> > > 
> > > Initially because I was thinking "local node" and numa_node_id() is a
> > > per-cpu variable that should be cheap to access and in some cases
> > > cache-hot as the top-level gfp API calls numa_node_id().
> > > 
> > > Thinking about it more though it still makes sense because the preferred
> > > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > > and the local node does not have that zone then preferred zone is on a
> > > remote node.
> > 
> > Don't we treat everything in relation to the preferred zone?
> 
> Usually yes, but this time we really care about whether the memory is
> local or remote. It makes sense to me as it is and struggle to see an
> advantage of expressing it in terms of the preferred zone. Minimally
> zone_local would need to be renamed if it could return true for a remote
> zone and I see no advantage in doing that.

What the function tests for is whether any given zone is close
enough/local to the given preferred zone such that we can allocate
from it without having to invoke zone_reclaim_mode.

In your example, if the preferred DMA32 zone were to be on a remote
node and eligible for allocation but full, a DMA zone on the same node
should be fine as well and would not impose a higher remote reference
burden on the allocator than allocating from the preferred DMA32 zone.

So it's really not about the locality of the allocating task but about
the locality of the given preferred zone.

In my tree, I replaced the function body with

	return local_zone->node == zone->node;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
  2013-12-17 20:11             ` Johannes Weiner
@ 2013-12-17 21:03               ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 21:03 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 03:11:47PM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 04:08:08PM +0000, Mel Gorman wrote:
> > On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> > > On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > > > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > > > zone_local is using node_distance which is a more expensive call than
> > > > > > necessary. On x86, it's another function call in the allocator fast path
> > > > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > > > local node will share the same node ID. The necessary information should
> > > > > > already be cache hot.
> > > > > > 
> > > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > > ---
> > > > > >  mm/page_alloc.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > index 64020eb..fd9677e 100644
> > > > > > --- a/mm/page_alloc.c
> > > > > > +++ b/mm/page_alloc.c
> > > > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > > > >  
> > > > > >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > > >  {
> > > > > > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > > > +	return zone_to_nid(zone) == numa_node_id();
> > > > > 
> > > > > Why numa_node_id()?  We pass in the preferred zone as @local_zone:
> > > > > 
> > > > 
> > > > Initially because I was thinking "local node" and numa_node_id() is a
> > > > per-cpu variable that should be cheap to access and in some cases
> > > > cache-hot as the top-level gfp API calls numa_node_id().
> > > > 
> > > > Thinking about it more though it still makes sense because the preferred
> > > > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > > > and the local node does not have that zone then preferred zone is on a
> > > > remote node.
> > > 
> > > Don't we treat everything in relation to the preferred zone?
> > 
> > Usually yes, but this time we really care about whether the memory is
> > local or remote. It makes sense to me as it is and struggle to see an
> > advantage of expressing it in terms of the preferred zone. Minimally
> > zone_local would need to be renamed if it could return true for a remote
> > zone and I see no advantage in doing that.
> 
> What the function tests for is whether any given zone is close
> enough/local to the given preferred zone such that we can allocate
> from it without having to invoke zone_reclaim_mode.
> 

Fine. The helper should then be renamed to zone_preferred_node because
it's no longer about being local.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-17 21:03               ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 21:03 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 03:11:47PM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 04:08:08PM +0000, Mel Gorman wrote:
> > On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> > > On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > > > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > > > zone_local is using node_distance which is a more expensive call than
> > > > > > necessary. On x86, it's another function call in the allocator fast path
> > > > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > > > local node will share the same node ID. The necessary information should
> > > > > > already be cache hot.
> > > > > > 
> > > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > > ---
> > > > > >  mm/page_alloc.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > index 64020eb..fd9677e 100644
> > > > > > --- a/mm/page_alloc.c
> > > > > > +++ b/mm/page_alloc.c
> > > > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > > > >  
> > > > > >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > > >  {
> > > > > > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > > > +	return zone_to_nid(zone) == numa_node_id();
> > > > > 
> > > > > Why numa_node_id()?  We pass in the preferred zone as @local_zone:
> > > > > 
> > > > 
> > > > Initially because I was thinking "local node" and numa_node_id() is a
> > > > per-cpu variable that should be cheap to access and in some cases
> > > > cache-hot as the top-level gfp API calls numa_node_id().
> > > > 
> > > > Thinking about it more though it still makes sense because the preferred
> > > > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > > > and the local node does not have that zone then preferred zone is on a
> > > > remote node.
> > > 
> > > Don't we treat everything in relation to the preferred zone?
> > 
> > Usually yes, but this time we really care about whether the memory is
> > local or remote. It makes sense to me as it is and struggle to see an
> > advantage of expressing it in terms of the preferred zone. Minimally
> > zone_local would need to be renamed if it could return true for a remote
> > zone and I see no advantage in doing that.
> 
> What the function tests for is whether any given zone is close
> enough/local to the given preferred zone such that we can allocate
> from it without having to invoke zone_reclaim_mode.
> 

Fine. The helper should then be renamed to zone_preferred_node because
it's no longer about being local.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
  2013-12-17 17:43             ` Johannes Weiner
@ 2013-12-17 21:22               ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 21:22 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > not be treated like files but we still want tmpfs to be treated as
> > > > files. Details will be in the changelog of the next series.
> > > 
> > > In what sense is it seen as file-backed?
> > 
> > sysv and anonymous pages are backed by an internal shmem mount point. In
> > lots of respects, it's looks like a file and quacks like a file but I expect
> > developers think of it being anonmous and chunks of the VM treats it like
> > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > the VM as anon but users may think that tmpfs should be subject to the
> > fair allocation zone policy "because they're files." It's a sufficently
> > weird case that any action we take there should be deliberate. It'll be
> > a bit clearer when I post the patch that special cases this.
> 
> The line I see here is mostly derived from performance expectations.
> 
> People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> their reclaim at great costs, so they size this part of their workload
> according to memory size and locality.  Filesystem cache (on-disk) on
> the other hand is expected to be slow on the first fault and after it
> has been displaced by other data, but the kernel is mostly expected to
> maximize the caching effects in a predictable manner.
> 

Part of their performance expectations is that memory referenced from the
local node will be allocated locally. Consider NUMA-aware applications that
partition their data usage appropriately and share that data between threads
using processes and shared memory (some MPI implementations). They have
an expectation that the memory will be local and a further expectation
that it will not be reclaimed because they sized it appropriately.
Automatically interleaving such memory by default will be surprising to
NUMA aware applications even if NUMA-oblivious applications benefit.

Similarly, the pagecache sysctl is documented to affect files, at least
that's how I wrote it. It's inconsistent to explain that as "the sysctl
control files, except for tmpfs ones because ...... whatever".

> The round-robin policy makes the displacement predictable (think of
> the aging artifacts here where random pages do not get displaced
> reliably because they ended up on remote nodes) and it avoids IO by
> maximizing memory utilization.
> 
> I.e. it improves behavior associated with a cache, but I don't expect
> shmem/tmpfs to be typically used as a disk cache.  I could be wrong
> about that, but I figure if you need named shared memory that is
> bigger than your memory capacity (the point where your tmpfs would
> actually turn into a disk cache), you'd be better of using a more
> efficient on-disk filesystem.

I am concerned with semantics like "all files except tmpfs files" or
alternatively regressing performance of NUMA-aware applications and their
use of MAP_SHARED and sysv.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 21:22               ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 21:22 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > not be treated like files but we still want tmpfs to be treated as
> > > > files. Details will be in the changelog of the next series.
> > > 
> > > In what sense is it seen as file-backed?
> > 
> > sysv and anonymous pages are backed by an internal shmem mount point. In
> > lots of respects, it's looks like a file and quacks like a file but I expect
> > developers think of it being anonmous and chunks of the VM treats it like
> > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > the VM as anon but users may think that tmpfs should be subject to the
> > fair allocation zone policy "because they're files." It's a sufficently
> > weird case that any action we take there should be deliberate. It'll be
> > a bit clearer when I post the patch that special cases this.
> 
> The line I see here is mostly derived from performance expectations.
> 
> People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> their reclaim at great costs, so they size this part of their workload
> according to memory size and locality.  Filesystem cache (on-disk) on
> the other hand is expected to be slow on the first fault and after it
> has been displaced by other data, but the kernel is mostly expected to
> maximize the caching effects in a predictable manner.
> 

Part of their performance expectations is that memory referenced from the
local node will be allocated locally. Consider NUMA-aware applications that
partition their data usage appropriately and share that data between threads
using processes and shared memory (some MPI implementations). They have
an expectation that the memory will be local and a further expectation
that it will not be reclaimed because they sized it appropriately.
Automatically interleaving such memory by default will be surprising to
NUMA aware applications even if NUMA-oblivious applications benefit.

Similarly, the pagecache sysctl is documented to affect files, at least
that's how I wrote it. It's inconsistent to explain that as "the sysctl
control files, except for tmpfs ones because ...... whatever".

> The round-robin policy makes the displacement predictable (think of
> the aging artifacts here where random pages do not get displaced
> reliably because they ended up on remote nodes) and it avoids IO by
> maximizing memory utilization.
> 
> I.e. it improves behavior associated with a cache, but I don't expect
> shmem/tmpfs to be typically used as a disk cache.  I could be wrong
> about that, but I figure if you need named shared memory that is
> bigger than your memory capacity (the point where your tmpfs would
> actually turn into a disk cache), you'd be better of using a more
> efficient on-disk filesystem.

I am concerned with semantics like "all files except tmpfs files" or
alternatively regressing performance of NUMA-aware applications and their
use of MAP_SHARED and sysv.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
  2013-12-17 15:07   ` Zlatko Calusic
@ 2013-12-17 21:23     ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 21:23 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
	Linux-MM, LKML

On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:
> On 13.12.2013 15:10, Mel Gorman wrote:
> >Kicked this another bit today. It's still a bit half-baked but it restores
> >the historical performance and leaves the door open at the end for playing
> >nice with distributing file pages between nodes. Finishing this series
> >depends on whether we are going to make the remote node behaviour of the
> >fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
> >favour of the configurable option because the default can be redefined and
> >tested while giving users a "compat" mode if we discover the new default
> >behaviour sucks for some workload.
> >
> 
> I'll start a 5-day test of this patchset in a few hours, unless you
> can send an updated one in the meantime. I intend to test it on a
> rather boring 4GB x86_64 machine that before Johannes' work had lots
> of trouble balancing zones. Would you recommend to use the default
> settings, i.e. don't mess with tunables at this point?
> 

For me at least I would prefer you tested v3 of the series with the
default settings of not interleaving file-backed pages on remote nodes
by default. Johannes might request testing with that knob enabled if the
machine is NUMA although I doubt it is with 4G of RAM.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
@ 2013-12-17 21:23     ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 21:23 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
	Linux-MM, LKML

On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:
> On 13.12.2013 15:10, Mel Gorman wrote:
> >Kicked this another bit today. It's still a bit half-baked but it restores
> >the historical performance and leaves the door open at the end for playing
> >nice with distributing file pages between nodes. Finishing this series
> >depends on whether we are going to make the remote node behaviour of the
> >fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
> >favour of the configurable option because the default can be redefined and
> >tested while giving users a "compat" mode if we discover the new default
> >behaviour sucks for some workload.
> >
> 
> I'll start a 5-day test of this patchset in a few hours, unless you
> can send an updated one in the meantime. I intend to test it on a
> rather boring 4GB x86_64 machine that before Johannes' work had lots
> of trouble balancing zones. Would you recommend to use the default
> settings, i.e. don't mess with tunables at this point?
> 

For me at least I would prefer you tested v3 of the series with the
default settings of not interleaving file-backed pages on remote nodes
by default. Johannes might request testing with that knob enabled if the
machine is NUMA although I doubt it is with 4G of RAM.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
  2013-12-17 21:03               ` Mel Gorman
@ 2013-12-17 22:31                 ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 22:31 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 09:03:40PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 03:11:47PM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 04:08:08PM +0000, Mel Gorman wrote:
> > > On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> > > > On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > > > > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > > > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > > > > zone_local is using node_distance which is a more expensive call than
> > > > > > > necessary. On x86, it's another function call in the allocator fast path
> > > > > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > > > > local node will share the same node ID. The necessary information should
> > > > > > > already be cache hot.
> > > > > > > 
> > > > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > > > ---
> > > > > > >  mm/page_alloc.c | 2 +-
> > > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > index 64020eb..fd9677e 100644
> > > > > > > --- a/mm/page_alloc.c
> > > > > > > +++ b/mm/page_alloc.c
> > > > > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > > > > >  
> > > > > > >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > > > >  {
> > > > > > > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > > > > +	return zone_to_nid(zone) == numa_node_id();
> > > > > > 
> > > > > > Why numa_node_id()?  We pass in the preferred zone as @local_zone:
> > > > > > 
> > > > > 
> > > > > Initially because I was thinking "local node" and numa_node_id() is a
> > > > > per-cpu variable that should be cheap to access and in some cases
> > > > > cache-hot as the top-level gfp API calls numa_node_id().
> > > > > 
> > > > > Thinking about it more though it still makes sense because the preferred
> > > > > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > > > > and the local node does not have that zone then preferred zone is on a
> > > > > remote node.
> > > > 
> > > > Don't we treat everything in relation to the preferred zone?
> > > 
> > > Usually yes, but this time we really care about whether the memory is
> > > local or remote. It makes sense to me as it is and struggle to see an
> > > advantage of expressing it in terms of the preferred zone. Minimally
> > > zone_local would need to be renamed if it could return true for a remote
> > > zone and I see no advantage in doing that.
> > 
> > What the function tests for is whether any given zone is close
> > enough/local to the given preferred zone such that we can allocate
> > from it without having to invoke zone_reclaim_mode.
> > 
> 
> Fine. The helper should then be renamed to zone_preferred_node because
> it's no longer about being local.

Fair enough!

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-17 22:31                 ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 22:31 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 09:03:40PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 03:11:47PM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 04:08:08PM +0000, Mel Gorman wrote:
> > > On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> > > > On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > > > > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > > > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > > > > zone_local is using node_distance which is a more expensive call than
> > > > > > > necessary. On x86, it's another function call in the allocator fast path
> > > > > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > > > > local node will share the same node ID. The necessary information should
> > > > > > > already be cache hot.
> > > > > > > 
> > > > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > > > ---
> > > > > > >  mm/page_alloc.c | 2 +-
> > > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > index 64020eb..fd9677e 100644
> > > > > > > --- a/mm/page_alloc.c
> > > > > > > +++ b/mm/page_alloc.c
> > > > > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > > > > >  
> > > > > > >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > > > >  {
> > > > > > > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > > > > +	return zone_to_nid(zone) == numa_node_id();
> > > > > > 
> > > > > > Why numa_node_id()?  We pass in the preferred zone as @local_zone:
> > > > > > 
> > > > > 
> > > > > Initially because I was thinking "local node" and numa_node_id() is a
> > > > > per-cpu variable that should be cheap to access and in some cases
> > > > > cache-hot as the top-level gfp API calls numa_node_id().
> > > > > 
> > > > > Thinking about it more though it still makes sense because the preferred
> > > > > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > > > > and the local node does not have that zone then preferred zone is on a
> > > > > remote node.
> > > > 
> > > > Don't we treat everything in relation to the preferred zone?
> > > 
> > > Usually yes, but this time we really care about whether the memory is
> > > local or remote. It makes sense to me as it is and struggle to see an
> > > advantage of expressing it in terms of the preferred zone. Minimally
> > > zone_local would need to be renamed if it could return true for a remote
> > > zone and I see no advantage in doing that.
> > 
> > What the function tests for is whether any given zone is close
> > enough/local to the given preferred zone such that we can allocate
> > from it without having to invoke zone_reclaim_mode.
> > 
> 
> Fine. The helper should then be renamed to zone_preferred_node because
> it's no longer about being local.

Fair enough!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
  2013-12-17 21:22               ` Mel Gorman
@ 2013-12-17 22:57                 ` Johannes Weiner
  -1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 22:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 09:22:16PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > > not be treated like files but we still want tmpfs to be treated as
> > > > > files. Details will be in the changelog of the next series.
> > > > 
> > > > In what sense is it seen as file-backed?
> > > 
> > > sysv and anonymous pages are backed by an internal shmem mount point. In
> > > lots of respects, it's looks like a file and quacks like a file but I expect
> > > developers think of it being anonmous and chunks of the VM treats it like
> > > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > > the VM as anon but users may think that tmpfs should be subject to the
> > > fair allocation zone policy "because they're files." It's a sufficently
> > > weird case that any action we take there should be deliberate. It'll be
> > > a bit clearer when I post the patch that special cases this.
> > 
> > The line I see here is mostly derived from performance expectations.
> > 
> > People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> > their reclaim at great costs, so they size this part of their workload
> > according to memory size and locality.  Filesystem cache (on-disk) on
> > the other hand is expected to be slow on the first fault and after it
> > has been displaced by other data, but the kernel is mostly expected to
> > maximize the caching effects in a predictable manner.
> > 
> 
> Part of their performance expectations is that memory referenced from the
> local node will be allocated locally. Consider NUMA-aware applications that
> partition their data usage appropriately and share that data between threads
> using processes and shared memory (some MPI implementations). They have
> an expectation that the memory will be local and a further expectation
> that it will not be reclaimed because they sized it appropriately.
> Automatically interleaving such memory by default will be surprising to
> NUMA aware applications even if NUMA-oblivious applications benefit.

That's exactly why I want to exclude any type of data that is
typically sized to memory capacity.  Are we talking past each other?

> Similarly, the pagecache sysctl is documented to affect files, at least
> that's how I wrote it. It's inconsistent to explain that as "the sysctl
> control files, except for tmpfs ones because ...... whatever".

I documented it as affecting by secondary storage cache.

> > The round-robin policy makes the displacement predictable (think of
> > the aging artifacts here where random pages do not get displaced
> > reliably because they ended up on remote nodes) and it avoids IO by
> > maximizing memory utilization.
> > 
> > I.e. it improves behavior associated with a cache, but I don't expect
> > shmem/tmpfs to be typically used as a disk cache.  I could be wrong
> > about that, but I figure if you need named shared memory that is
> > bigger than your memory capacity (the point where your tmpfs would
> > actually turn into a disk cache), you'd be better of using a more
> > efficient on-disk filesystem.
> 
> I am concerned with semantics like "all files except tmpfs files" or
> alternatively regressing performance of NUMA-aware applications and their
> use of MAP_SHARED and sysv.

I'm really not following.  MAP_SHARED, sysv, shmem, tmpfs, whatever is
entirely unaffected by my proposal.  I never claimed "all files except
tmpfs".  It's about what backs the data, which what makes a difference
in people's performance expectation, which makes a difference in how
they size the workloads.

Tmpfs files that may overflow into swap on heavy memory pressure have
an entirely different trade-off than actual cache that is continuously
replaced as part of its size management, and in that sense they are
much closer to anon and sysv shared memory.  I don't believe that the
difference between virtual in-core filesystems and actual secondary
storage filesystems is so obscure to users that this behavioral
difference would violate expectations of the term "file".

Is that what you are saying or am I missing something?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 22:57                 ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 22:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 09:22:16PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > > not be treated like files but we still want tmpfs to be treated as
> > > > > files. Details will be in the changelog of the next series.
> > > > 
> > > > In what sense is it seen as file-backed?
> > > 
> > > sysv and anonymous pages are backed by an internal shmem mount point. In
> > > lots of respects, it's looks like a file and quacks like a file but I expect
> > > developers think of it being anonmous and chunks of the VM treats it like
> > > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > > the VM as anon but users may think that tmpfs should be subject to the
> > > fair allocation zone policy "because they're files." It's a sufficently
> > > weird case that any action we take there should be deliberate. It'll be
> > > a bit clearer when I post the patch that special cases this.
> > 
> > The line I see here is mostly derived from performance expectations.
> > 
> > People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> > their reclaim at great costs, so they size this part of their workload
> > according to memory size and locality.  Filesystem cache (on-disk) on
> > the other hand is expected to be slow on the first fault and after it
> > has been displaced by other data, but the kernel is mostly expected to
> > maximize the caching effects in a predictable manner.
> > 
> 
> Part of their performance expectations is that memory referenced from the
> local node will be allocated locally. Consider NUMA-aware applications that
> partition their data usage appropriately and share that data between threads
> using processes and shared memory (some MPI implementations). They have
> an expectation that the memory will be local and a further expectation
> that it will not be reclaimed because they sized it appropriately.
> Automatically interleaving such memory by default will be surprising to
> NUMA aware applications even if NUMA-oblivious applications benefit.

That's exactly why I want to exclude any type of data that is
typically sized to memory capacity.  Are we talking past each other?

> Similarly, the pagecache sysctl is documented to affect files, at least
> that's how I wrote it. It's inconsistent to explain that as "the sysctl
> control files, except for tmpfs ones because ...... whatever".

I documented it as affecting by secondary storage cache.

> > The round-robin policy makes the displacement predictable (think of
> > the aging artifacts here where random pages do not get displaced
> > reliably because they ended up on remote nodes) and it avoids IO by
> > maximizing memory utilization.
> > 
> > I.e. it improves behavior associated with a cache, but I don't expect
> > shmem/tmpfs to be typically used as a disk cache.  I could be wrong
> > about that, but I figure if you need named shared memory that is
> > bigger than your memory capacity (the point where your tmpfs would
> > actually turn into a disk cache), you'd be better of using a more
> > efficient on-disk filesystem.
> 
> I am concerned with semantics like "all files except tmpfs files" or
> alternatively regressing performance of NUMA-aware applications and their
> use of MAP_SHARED and sysv.

I'm really not following.  MAP_SHARED, sysv, shmem, tmpfs, whatever is
entirely unaffected by my proposal.  I never claimed "all files except
tmpfs".  It's about what backs the data, which what makes a difference
in people's performance expectation, which makes a difference in how
they size the workloads.

Tmpfs files that may overflow into swap on heavy memory pressure have
an entirely different trade-off than actual cache that is continuously
replaced as part of its size management, and in that sense they are
much closer to anon and sysv shared memory.  I don't believe that the
difference between virtual in-core filesystems and actual secondary
storage filesystems is so obscure to users that this behavioral
difference would violate expectations of the term "file".

Is that what you are saying or am I missing something?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
  2013-12-17 22:57                 ` Johannes Weiner
@ 2013-12-17 23:24                   ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 23:24 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 05:57:16PM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 09:22:16PM +0000, Mel Gorman wrote:
> > On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > > > not be treated like files but we still want tmpfs to be treated as
> > > > > > files. Details will be in the changelog of the next series.
> > > > > 
> > > > > In what sense is it seen as file-backed?
> > > > 
> > > > sysv and anonymous pages are backed by an internal shmem mount point. In
> > > > lots of respects, it's looks like a file and quacks like a file but I expect
> > > > developers think of it being anonmous and chunks of the VM treats it like
> > > > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > > > the VM as anon but users may think that tmpfs should be subject to the
> > > > fair allocation zone policy "because they're files." It's a sufficently
> > > > weird case that any action we take there should be deliberate. It'll be
> > > > a bit clearer when I post the patch that special cases this.
> > > 
> > > The line I see here is mostly derived from performance expectations.
> > > 
> > > People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> > > their reclaim at great costs, so they size this part of their workload
> > > according to memory size and locality.  Filesystem cache (on-disk) on
> > > the other hand is expected to be slow on the first fault and after it
> > > has been displaced by other data, but the kernel is mostly expected to
> > > maximize the caching effects in a predictable manner.
> > > 
> > 
> > Part of their performance expectations is that memory referenced from the
> > local node will be allocated locally. Consider NUMA-aware applications that
> > partition their data usage appropriately and share that data between threads
> > using processes and shared memory (some MPI implementations). They have
> > an expectation that the memory will be local and a further expectation
> > that it will not be reclaimed because they sized it appropriately.
> > Automatically interleaving such memory by default will be surprising to
> > NUMA aware applications even if NUMA-oblivious applications benefit.
> 
> That's exactly why I want to exclude any type of data that is
> typically sized to memory capacity.  Are we talking past each other?
> 

No, we're not but I'm concerned that your treatment of shmem ends up
being inconsistent. Your proposal to me has two choices

a) leave it alone. We get proper behaviour for MAP_SHARED anonymous and
   sysv but tmpfs is different to every other filesystem
b) interleave shmem. tmpfs is consistent with other filesystems but
   MAP_SHARED anonymous and sysv is surprising

> > Similarly, the pagecache sysctl is documented to affect files, at least
> > that's how I wrote it. It's inconsistent to explain that as "the sysctl
> > control files, except for tmpfs ones because ...... whatever".
> 
> I documented it as affecting by secondary storage cache.
> 

That is very subtle and a bit weird to me. Arguably tmpfs is also driven
by secondary storage where storage happens to be swap. It's still "files
except for tmpfs files beacuse they're special". That's why I'm
uncomfortable with it.

> > > The round-robin policy makes the displacement predictable (think of
> > > the aging artifacts here where random pages do not get displaced
> > > reliably because they ended up on remote nodes) and it avoids IO by
> > > maximizing memory utilization.
> > > 
> > > I.e. it improves behavior associated with a cache, but I don't expect
> > > shmem/tmpfs to be typically used as a disk cache.  I could be wrong
> > > about that, but I figure if you need named shared memory that is
> > > bigger than your memory capacity (the point where your tmpfs would
> > > actually turn into a disk cache), you'd be better of using a more
> > > efficient on-disk filesystem.
> > 
> > I am concerned with semantics like "all files except tmpfs files" or
> > alternatively regressing performance of NUMA-aware applications and their
> > use of MAP_SHARED and sysv.
> 
> I'm really not following.  MAP_SHARED, sysv, shmem, tmpfs, whatever is
> entirely unaffected by my proposal. 

I understand, it's the tmpfs different to every filesystem I'm not happy
with. From a VM perspective it makes some sense but from a user
perspective it just looks weird.

> I never claimed "all files except
> tmpfs".  It's about what backs the data, which what makes a difference
> in people's performance expectation, which makes a difference in how
> they size the workloads.
> 

So potentially applications have to stat the file they are mapping if they
want to understand what memory policy applies.

> Tmpfs files that may overflow into swap on heavy memory pressure have
> an entirely different trade-off than actual cache that is continuously
> replaced as part of its size management, and in that sense they are
> much closer to anon and sysv shared memory. 

Again, from a VM perspective I understand what you're suggesting but
from an application perspective that is mapping files, it's a tricky
interface.

In terms of restoring historical behaviour in 3.12 and for 3.13 I think my
approach is the more conservative and least surprising to users. We can bash
out whether to default remote interleaving or special case tmpfs in 3.14.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 23:24                   ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 23:24 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 05:57:16PM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 09:22:16PM +0000, Mel Gorman wrote:
> > On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > > > not be treated like files but we still want tmpfs to be treated as
> > > > > > files. Details will be in the changelog of the next series.
> > > > > 
> > > > > In what sense is it seen as file-backed?
> > > > 
> > > > sysv and anonymous pages are backed by an internal shmem mount point. In
> > > > lots of respects, it's looks like a file and quacks like a file but I expect
> > > > developers think of it being anonmous and chunks of the VM treats it like
> > > > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > > > the VM as anon but users may think that tmpfs should be subject to the
> > > > fair allocation zone policy "because they're files." It's a sufficently
> > > > weird case that any action we take there should be deliberate. It'll be
> > > > a bit clearer when I post the patch that special cases this.
> > > 
> > > The line I see here is mostly derived from performance expectations.
> > > 
> > > People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> > > their reclaim at great costs, so they size this part of their workload
> > > according to memory size and locality.  Filesystem cache (on-disk) on
> > > the other hand is expected to be slow on the first fault and after it
> > > has been displaced by other data, but the kernel is mostly expected to
> > > maximize the caching effects in a predictable manner.
> > > 
> > 
> > Part of their performance expectations is that memory referenced from the
> > local node will be allocated locally. Consider NUMA-aware applications that
> > partition their data usage appropriately and share that data between threads
> > using processes and shared memory (some MPI implementations). They have
> > an expectation that the memory will be local and a further expectation
> > that it will not be reclaimed because they sized it appropriately.
> > Automatically interleaving such memory by default will be surprising to
> > NUMA aware applications even if NUMA-oblivious applications benefit.
> 
> That's exactly why I want to exclude any type of data that is
> typically sized to memory capacity.  Are we talking past each other?
> 

No, we're not but I'm concerned that your treatment of shmem ends up
being inconsistent. Your proposal to me has two choices

a) leave it alone. We get proper behaviour for MAP_SHARED anonymous and
   sysv but tmpfs is different to every other filesystem
b) interleave shmem. tmpfs is consistent with other filesystems but
   MAP_SHARED anonymous and sysv is surprising

> > Similarly, the pagecache sysctl is documented to affect files, at least
> > that's how I wrote it. It's inconsistent to explain that as "the sysctl
> > control files, except for tmpfs ones because ...... whatever".
> 
> I documented it as affecting by secondary storage cache.
> 

That is very subtle and a bit weird to me. Arguably tmpfs is also driven
by secondary storage where storage happens to be swap. It's still "files
except for tmpfs files beacuse they're special". That's why I'm
uncomfortable with it.

> > > The round-robin policy makes the displacement predictable (think of
> > > the aging artifacts here where random pages do not get displaced
> > > reliably because they ended up on remote nodes) and it avoids IO by
> > > maximizing memory utilization.
> > > 
> > > I.e. it improves behavior associated with a cache, but I don't expect
> > > shmem/tmpfs to be typically used as a disk cache.  I could be wrong
> > > about that, but I figure if you need named shared memory that is
> > > bigger than your memory capacity (the point where your tmpfs would
> > > actually turn into a disk cache), you'd be better of using a more
> > > efficient on-disk filesystem.
> > 
> > I am concerned with semantics like "all files except tmpfs files" or
> > alternatively regressing performance of NUMA-aware applications and their
> > use of MAP_SHARED and sysv.
> 
> I'm really not following.  MAP_SHARED, sysv, shmem, tmpfs, whatever is
> entirely unaffected by my proposal. 

I understand, it's the tmpfs different to every filesystem I'm not happy
with. From a VM perspective it makes some sense but from a user
perspective it just looks weird.

> I never claimed "all files except
> tmpfs".  It's about what backs the data, which what makes a difference
> in people's performance expectation, which makes a difference in how
> they size the workloads.
> 

So potentially applications have to stat the file they are mapping if they
want to understand what memory policy applies.

> Tmpfs files that may overflow into swap on heavy memory pressure have
> an entirely different trade-off than actual cache that is continuously
> replaced as part of its size management, and in that sense they are
> much closer to anon and sysv shared memory. 

Again, from a VM perspective I understand what you're suggesting but
from an application perspective that is mapping files, it's a tricky
interface.

In terms of restoring historical behaviour in 3.12 and for 3.13 I think my
approach is the more conservative and least surprising to users. We can bash
out whether to default remote interleaving or special case tmpfs in 3.14.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
  2013-12-17 21:23     ` Mel Gorman
@ 2013-12-21 16:03       ` Zlatko Calusic
  -1 siblings, 0 replies; 84+ messages in thread
From: Zlatko Calusic @ 2013-12-21 16:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
	Linux-MM, LKML

On 17.12.2013 22:23, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:
>> On 13.12.2013 15:10, Mel Gorman wrote:
>>> Kicked this another bit today. It's still a bit half-baked but it restores
>>> the historical performance and leaves the door open at the end for playing
>>> nice with distributing file pages between nodes. Finishing this series
>>> depends on whether we are going to make the remote node behaviour of the
>>> fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
>>> favour of the configurable option because the default can be redefined and
>>> tested while giving users a "compat" mode if we discover the new default
>>> behaviour sucks for some workload.
>>>
>>
>> I'll start a 5-day test of this patchset in a few hours, unless you
>> can send an updated one in the meantime. I intend to test it on a
>> rather boring 4GB x86_64 machine that before Johannes' work had lots
>> of trouble balancing zones. Would you recommend to use the default
>> settings, i.e. don't mess with tunables at this point?
>>
>
> For me at least I would prefer you tested v3 of the series with the
> default settings of not interleaving file-backed pages on remote nodes
> by default. Johannes might request testing with that knob enabled if the
> machine is NUMA although I doubt it is with 4G of RAM.
>

Tested v3 on UMA machine, with default setting. I see no regression, no 
issues whatsoever. From what I understand, this whole series is about 
fixing issues noticed on NUMA, so I wish you good luck with that (no 
such hardware here). Just be extra careful not to disturb finally very 
well balanced MM on more common machines (and especially those equipped 
with 4GB RAM). And once again thank you Johannes for your work, you did 
a great job.

Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
-- 
Zlatko


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
@ 2013-12-21 16:03       ` Zlatko Calusic
  0 siblings, 0 replies; 84+ messages in thread
From: Zlatko Calusic @ 2013-12-21 16:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
	Linux-MM, LKML

On 17.12.2013 22:23, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:
>> On 13.12.2013 15:10, Mel Gorman wrote:
>>> Kicked this another bit today. It's still a bit half-baked but it restores
>>> the historical performance and leaves the door open at the end for playing
>>> nice with distributing file pages between nodes. Finishing this series
>>> depends on whether we are going to make the remote node behaviour of the
>>> fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
>>> favour of the configurable option because the default can be redefined and
>>> tested while giving users a "compat" mode if we discover the new default
>>> behaviour sucks for some workload.
>>>
>>
>> I'll start a 5-day test of this patchset in a few hours, unless you
>> can send an updated one in the meantime. I intend to test it on a
>> rather boring 4GB x86_64 machine that before Johannes' work had lots
>> of trouble balancing zones. Would you recommend to use the default
>> settings, i.e. don't mess with tunables at this point?
>>
>
> For me at least I would prefer you tested v3 of the series with the
> default settings of not interleaving file-backed pages on remote nodes
> by default. Johannes might request testing with that knob enabled if the
> machine is NUMA although I doubt it is with 4G of RAM.
>

Tested v3 on UMA machine, with default setting. I see no regression, no 
issues whatsoever. From what I understand, this whole series is about 
fixing issues noticed on NUMA, so I wish you good luck with that (no 
such hardware here). Just be extra careful not to disturb finally very 
well balanced MM on more common machines (and especially those equipped 
with 4GB RAM). And once again thank you Johannes for your work, you did 
a great job.

Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
  2013-12-21 16:03       ` Zlatko Calusic
@ 2013-12-23 10:26         ` Mel Gorman
  -1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-23 10:26 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
	Linux-MM, LKML

On Sat, Dec 21, 2013 at 05:03:43PM +0100, Zlatko Calusic wrote:
> On 17.12.2013 22:23, Mel Gorman wrote:
> >On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:
> >>On 13.12.2013 15:10, Mel Gorman wrote:
> >>>Kicked this another bit today. It's still a bit half-baked but it restores
> >>>the historical performance and leaves the door open at the end for playing
> >>>nice with distributing file pages between nodes. Finishing this series
> >>>depends on whether we are going to make the remote node behaviour of the
> >>>fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
> >>>favour of the configurable option because the default can be redefined and
> >>>tested while giving users a "compat" mode if we discover the new default
> >>>behaviour sucks for some workload.
> >>>
> >>
> >>I'll start a 5-day test of this patchset in a few hours, unless you
> >>can send an updated one in the meantime. I intend to test it on a
> >>rather boring 4GB x86_64 machine that before Johannes' work had lots
> >>of trouble balancing zones. Would you recommend to use the default
> >>settings, i.e. don't mess with tunables at this point?
> >>
> >
> >For me at least I would prefer you tested v3 of the series with the
> >default settings of not interleaving file-backed pages on remote nodes
> >by default. Johannes might request testing with that knob enabled if the
> >machine is NUMA although I doubt it is with 4G of RAM.
> >
> 
> Tested v3 on UMA machine, with default setting. I see no regression,
> no issues whatsoever. From what I understand, this whole series is
> about fixing issues noticed on NUMA, so I wish you good luck with
> that (no such hardware here). Just be extra careful not to disturb
> finally very well balanced MM on more common machines (and
> especially those equipped with 4GB RAM). And once again thank you
> Johannes for your work, you did a great job.
> 
> Tested-by: Zlatko Calusic <zcalusic@bitsync.net>

Thanks for testing. Even though this patch is about NUMA, it preserves
the fair zone allocation policy on UMA that your workload depends upon.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
@ 2013-12-23 10:26         ` Mel Gorman
  0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-23 10:26 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
	Linux-MM, LKML

On Sat, Dec 21, 2013 at 05:03:43PM +0100, Zlatko Calusic wrote:
> On 17.12.2013 22:23, Mel Gorman wrote:
> >On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:
> >>On 13.12.2013 15:10, Mel Gorman wrote:
> >>>Kicked this another bit today. It's still a bit half-baked but it restores
> >>>the historical performance and leaves the door open at the end for playing
> >>>nice with distributing file pages between nodes. Finishing this series
> >>>depends on whether we are going to make the remote node behaviour of the
> >>>fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
> >>>favour of the configurable option because the default can be redefined and
> >>>tested while giving users a "compat" mode if we discover the new default
> >>>behaviour sucks for some workload.
> >>>
> >>
> >>I'll start a 5-day test of this patchset in a few hours, unless you
> >>can send an updated one in the meantime. I intend to test it on a
> >>rather boring 4GB x86_64 machine that before Johannes' work had lots
> >>of trouble balancing zones. Would you recommend to use the default
> >>settings, i.e. don't mess with tunables at this point?
> >>
> >
> >For me at least I would prefer you tested v3 of the series with the
> >default settings of not interleaving file-backed pages on remote nodes
> >by default. Johannes might request testing with that knob enabled if the
> >machine is NUMA although I doubt it is with 4G of RAM.
> >
> 
> Tested v3 on UMA machine, with default setting. I see no regression,
> no issues whatsoever. From what I understand, this whole series is
> about fixing issues noticed on NUMA, so I wish you good luck with
> that (no such hardware here). Just be extra careful not to disturb
> finally very well balanced MM on more common machines (and
> especially those equipped with 4GB RAM). And once again thank you
> Johannes for your work, you did a great job.
> 
> Tested-by: Zlatko Calusic <zcalusic@bitsync.net>

Thanks for testing. Even though this patch is about NUMA, it preserves
the fair zone allocation policy on UMA that your workload depends upon.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2013-12-23 10:26 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-13 14:10 [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6 Mel Gorman
2013-12-13 14:10 ` Mel Gorman
2013-12-13 14:10 ` [PATCH 1/7] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
2013-12-13 14:10   ` Mel Gorman
2013-12-13 15:45   ` Rik van Riel
2013-12-13 15:45     ` Rik van Riel
2013-12-13 14:10 ` [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper Mel Gorman
2013-12-13 14:10   ` Mel Gorman
2013-12-13 15:46   ` Rik van Riel
2013-12-13 15:46     ` Rik van Riel
2013-12-16 20:16   ` Johannes Weiner
2013-12-16 20:16     ` Johannes Weiner
2013-12-13 14:10 ` [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality Mel Gorman
2013-12-13 14:10   ` Mel Gorman
2013-12-16 13:20   ` Rik van Riel
2013-12-16 13:20     ` Rik van Riel
2013-12-16 20:25   ` Johannes Weiner
2013-12-16 20:25     ` Johannes Weiner
2013-12-17 11:13     ` Mel Gorman
2013-12-17 11:13       ` Mel Gorman
2013-12-17 15:38       ` Johannes Weiner
2013-12-17 15:38         ` Johannes Weiner
2013-12-17 16:08         ` Mel Gorman
2013-12-17 16:08           ` Mel Gorman
2013-12-17 20:11           ` Johannes Weiner
2013-12-17 20:11             ` Johannes Weiner
2013-12-17 21:03             ` Mel Gorman
2013-12-17 21:03               ` Mel Gorman
2013-12-17 22:31               ` Johannes Weiner
2013-12-17 22:31                 ` Johannes Weiner
2013-12-13 14:10 ` [PATCH 4/7] mm: Annotate page cache allocations Mel Gorman
2013-12-13 14:10   ` Mel Gorman
2013-12-16 15:20   ` Rik van Riel
2013-12-16 15:20     ` Rik van Riel
2013-12-13 14:10 ` [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable Mel Gorman
2013-12-13 14:10   ` Mel Gorman
2013-12-16 19:25   ` Rik van Riel
2013-12-16 19:25     ` Rik van Riel
2013-12-16 20:42   ` Johannes Weiner
2013-12-16 20:42     ` Johannes Weiner
2013-12-17 15:29     ` Mel Gorman
2013-12-17 15:29       ` Mel Gorman
2013-12-17 15:54       ` Johannes Weiner
2013-12-17 15:54         ` Johannes Weiner
2013-12-17 16:14         ` Mel Gorman
2013-12-17 16:14           ` Mel Gorman
2013-12-17 17:43           ` Johannes Weiner
2013-12-17 17:43             ` Johannes Weiner
2013-12-17 21:22             ` Mel Gorman
2013-12-17 21:22               ` Mel Gorman
2013-12-17 22:57               ` Johannes Weiner
2013-12-17 22:57                 ` Johannes Weiner
2013-12-17 23:24                 ` Mel Gorman
2013-12-17 23:24                   ` Mel Gorman
2013-12-13 14:10 ` [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible Mel Gorman
2013-12-13 14:10   ` Mel Gorman
2013-12-16 20:52   ` Johannes Weiner
2013-12-16 20:52     ` Johannes Weiner
2013-12-17 11:20     ` Mel Gorman
2013-12-17 11:20       ` Mel Gorman
2013-12-17 15:43       ` Johannes Weiner
2013-12-17 15:43         ` Johannes Weiner
2013-12-17 16:06         ` Mel Gorman
2013-12-17 16:06           ` Mel Gorman
2013-12-13 14:10 ` [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy Mel Gorman
2013-12-13 14:10   ` Mel Gorman
2013-12-13 17:04   ` Johannes Weiner
2013-12-13 17:04     ` Johannes Weiner
2013-12-13 19:20     ` Mel Gorman
2013-12-13 19:20       ` Mel Gorman
2013-12-13 22:15       ` Johannes Weiner
2013-12-13 22:15         ` Johannes Weiner
2013-12-17 16:04         ` Mel Gorman
2013-12-17 16:04           ` Mel Gorman
2013-12-16 19:26   ` Rik van Riel
2013-12-16 19:26     ` Rik van Riel
2013-12-17 15:07 ` [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6 Zlatko Calusic
2013-12-17 15:07   ` Zlatko Calusic
2013-12-17 21:23   ` Mel Gorman
2013-12-17 21:23     ` Mel Gorman
2013-12-21 16:03     ` Zlatko Calusic
2013-12-21 16:03       ` Zlatko Calusic
2013-12-23 10:26       ` Mel Gorman
2013-12-23 10:26         ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.