* [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
@ 2013-12-13 14:10 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
Kicked this another bit today. It's still a bit half-baked but it restores
the historical performance and leaves the door open at the end for playing
nice with distributing file pages between nodes. Finishing this series
depends on whether we are going to make the remote node behaviour of the
fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
favour of the configurable option because the default can be redefined and
tested while giving users a "compat" mode if we discover the new default
behaviour sucks for some workload.
Changelog since v1
o Fix lot of brain damage in the configurable policy patch
o Yoink a page cache annotation patch
o Only account batch pages against allocations eligible for the fair policy
o Add patch that default distributes file pages on remote nodes
Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of how
the page allocator and kswapd interacted on the per-zone LRU lists.
Unfortunately a side-effect missed during review was that it's now very
easy to allocate remote memory on NUMA machines. The problem is that
it is not a simple case of just restoring local allocation policies as
there are genuine reasons why global page aging may be prefereable. It's
still a major change to default behaviour so this patch makes the policy
configurable and sets what I think is a sensible default.
The patches are on top of some NUMA balancing patches currently in -mm.
The first patch in the series is a patch posted by Johannes that must be
taken into account before any of my patchs on top. The last patch of the
series is what alters default behaviour and makes the fair zone allocator
policy configurable.
Sniff test results based on following kernels
vanilla 3.13-rc3 stock
instrument-v2r1 NUMA balancing patches just to rule out any conflicts ther2
lruslabonly-v1r2 Patch 1 only
local-v2r6 Patches 1-5 to restore local memory allocations
acct-v2r6 Patches 1-6 to include an accounting adjustment
remotefile-v2r6 Patches 1-7 that breaks MPOL_LOCAL by interleaving file pages
kernbench
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanilla instrument-v2r1 lruslabonly-v2r1 local-v2r6 acct-v2r6 remotefile-v2r6
User min 1417.32 ( 0.00%) 1408.52 ( 0.62%) 1414.92 ( 0.17%) 1403.37 ( 0.98%) 1410.55 ( 0.48%) 1405.85 ( 0.81%)
User mean 1419.10 ( 0.00%) 1415.39 ( 0.26%) 1417.31 ( 0.13%) 1409.89 ( 0.65%) 1411.40 ( 0.54%) 1410.78 ( 0.59%)
User stddev 2.25 ( 0.00%) 4.51 (-100.33%) 2.44 ( -8.29%) 3.98 (-76.92%) 0.74 ( 66.98%) 2.94 (-30.81%)
User max 1422.92 ( 0.00%) 1421.05 ( 0.13%) 1421.90 ( 0.07%) 1415.39 ( 0.53%) 1412.55 ( 0.73%) 1413.99 ( 0.63%)
User range 5.60 ( 0.00%) 12.53 (-123.75%) 6.98 (-24.64%) 12.02 (-114.64%) 2.00 ( 64.29%) 8.14 (-45.36%)
System min 114.83 ( 0.00%) 114.09 ( 0.64%) 114.50 ( 0.29%) 110.16 ( 4.07%) 110.44 ( 3.82%) 110.49 ( 3.78%)
System mean 115.89 ( 0.00%) 115.01 ( 0.76%) 115.12 ( 0.67%) 110.73 ( 4.46%) 111.20 ( 4.05%) 111.17 ( 4.08%)
System stddev 0.63 ( 0.00%) 0.57 ( 10.42%) 0.40 ( 37.04%) 0.48 ( 24.87%) 0.51 ( 19.41%) 0.43 ( 32.60%)
System max 116.81 ( 0.00%) 115.87 ( 0.80%) 115.52 ( 1.10%) 111.47 ( 4.57%) 111.98 ( 4.13%) 111.63 ( 4.43%)
System range 1.98 ( 0.00%) 1.78 ( 10.10%) 1.02 ( 48.48%) 1.31 ( 33.84%) 1.54 ( 22.22%) 1.14 ( 42.42%)
Elapsed min 42.90 ( 0.00%) 43.96 ( -2.47%) 42.85 ( 0.12%) 43.02 ( -0.28%) 42.55 ( 0.82%) 42.75 ( 0.35%)
Elapsed mean 43.58 ( 0.00%) 44.16 ( -1.34%) 43.88 ( -0.69%) 43.87 ( -0.67%) 43.58 ( -0.00%) 43.80 ( -0.50%)
Elapsed stddev 0.74 ( 0.00%) 0.17 ( 77.41%) 0.61 ( 17.23%) 1.00 (-35.26%) 0.67 ( 9.46%) 0.82 ( -9.88%)
Elapsed max 44.52 ( 0.00%) 44.45 ( 0.16%) 44.55 ( -0.07%) 45.72 ( -2.70%) 44.24 ( 0.63%) 45.09 ( -1.28%)
Elapsed range 1.62 ( 0.00%) 0.49 ( 69.75%) 1.70 ( -4.94%) 2.70 (-66.67%) 1.69 ( -4.32%) 2.34 (-44.44%)
CPU min 3451.00 ( 0.00%) 3455.00 ( -0.12%) 3434.00 ( 0.49%) 3311.00 ( 4.06%) 3439.00 ( 0.35%) 3377.00 ( 2.14%)
CPU mean 3522.40 ( 0.00%) 3464.60 ( 1.64%) 3492.40 ( 0.85%) 3467.40 ( 1.56%) 3493.80 ( 0.81%) 3475.40 ( 1.33%)
CPU stddev 54.34 ( 0.00%) 9.05 ( 83.35%) 54.80 ( -0.85%) 86.04 (-58.33%) 54.99 ( -1.18%) 67.75 (-24.68%)
CPU max 3570.00 ( 0.00%) 3480.00 ( 2.52%) 3587.00 ( -0.48%) 3545.00 ( 0.70%) 3578.00 ( -0.22%) 3568.00 ( 0.06%)
CPU range 119.00 ( 0.00%) 25.00 ( 78.99%) 153.00 (-28.57%) 234.00 (-96.64%) 139.00 (-16.81%) 191.00 (-60.50%)
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanillainstrument-v2r1lruslabonly-v2r1 local-v2r6 acct-v2r6remotefile-v2r6
User 8540.49 8516.04 8524.28 8487.25 8488.89 8487.40
System 706.31 701.72 701.20 674.29 675.81 676.52
Elapsed 307.58 311.31 309.72 309.51 308.32 310.36
kernbench figures themselves are not that compelling but the system CPU cost
is down a lot. It's just such a small percentage of the overall workload
that it doesn't really matter and the processes are short lived anyway.
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanillainstrument-v2r1lruslabonly-v2r1 local-v2r6 acct-v2r6remotefile-v2r6
NUMA alloc hit 73783951 73086669 73385508 93373651 93326068 93321444
NUMA alloc miss 20013534 20247750 19958857 102 118 2129
NUMA interleave hit 0 0 0 0 0 0
NUMA alloc local 73783935 73086658 73385501 93373644 93326059 93321436
NUMA miss rates are reduced by using the local policy although it really
should have been zero. I suspect it's the __GFP_PAGECACHE annotation patch
and how it's treated but have not proven it. The miss stats go up for the
final patch as page cache pages get distributed between nodes again
vmr-stream
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanilla instrument-v2r1 lruslabonly-v2r1 local-v2r6 acct-v2r6 remotefile-v2r6
Add 5M 3809.80 ( 0.00%) 3783.21 ( -0.70%) 3790.61 ( -0.50%) 3970.34 ( 4.21%) 3975.29 ( 4.34%) 3992.15 ( 4.79%)
Copy 5M 3360.75 ( 0.00%) 3345.59 ( -0.45%) 3351.99 ( -0.26%) 3474.69 ( 3.39%) 3472.97 ( 3.34%) 3474.32 ( 3.38%)
Scale 5M 3160.39 ( 0.00%) 3163.43 ( 0.10%) 3159.88 ( -0.02%) 3393.56 ( 7.38%) 3391.85 ( 7.32%) 3393.76 ( 7.38%)
Triad 5M 3533.04 ( 0.00%) 3517.67 ( -0.43%) 3526.18 ( -0.19%) 3856.20 ( 9.15%) 3851.39 ( 9.01%) 3855.89 ( 9.14%)
Add 7M 3789.82 ( 0.00%) 3789.03 ( -0.02%) 3779.30 ( -0.28%) 4049.53 ( 6.85%) 4001.74 ( 5.59%) 3968.84 ( 4.72%)
Copy 7M 3345.85 ( 0.00%) 3355.75 ( 0.30%) 3354.56 ( 0.26%) 3484.62 ( 4.15%) 3477.23 ( 3.93%) 3474.17 ( 3.84%)
Scale 7M 3176.00 ( 0.00%) 3156.09 ( -0.63%) 3152.84 ( -0.73%) 3401.53 ( 7.10%) 3393.55 ( 6.85%) 3392.46 ( 6.82%)
Triad 7M 3528.85 ( 0.00%) 3521.99 ( -0.19%) 3515.20 ( -0.39%) 3861.55 ( 9.43%) 3853.51 ( 9.20%) 3853.30 ( 9.19%)
Add 8M 3801.60 ( 0.00%) 3781.66 ( -0.52%) 3788.19 ( -0.35%) 3957.73 ( 4.11%) 4002.30 ( 5.28%) 4006.69 ( 5.39%)
Copy 8M 3364.64 ( 0.00%) 3346.31 ( -0.54%) 3353.71 ( -0.32%) 3469.62 ( 3.12%) 3476.25 ( 3.32%) 3473.67 ( 3.24%)
Scale 8M 3169.34 ( 0.00%) 3163.10 ( -0.20%) 3157.99 ( -0.36%) 3391.61 ( 7.01%) 3395.76 ( 7.14%) 3393.20 ( 7.06%)
Triad 8M 3531.38 ( 0.00%) 3514.83 ( -0.47%) 3518.55 ( -0.36%) 3850.45 ( 9.04%) 3853.39 ( 9.12%) 3849.50 ( 9.01%)
Add 10M 3807.95 ( 0.00%) 3791.80 ( -0.42%) 3781.86 ( -0.69%) 3977.13 ( 4.44%) 4005.95 ( 5.20%) 3983.31 ( 4.61%)
Copy 10M 3365.64 ( 0.00%) 3361.59 ( -0.12%) 3352.03 ( -0.40%) 3473.78 ( 3.21%) 3479.54 ( 3.38%) 3471.70 ( 3.15%)
Scale 10M 3172.71 ( 0.00%) 3157.52 ( -0.48%) 3149.26 ( -0.74%) 3395.59 ( 7.02%) 3397.28 ( 7.08%) 3394.50 ( 6.99%)
Triad 10M 3536.15 ( 0.00%) 3524.46 ( -0.33%) 3517.36 ( -0.53%) 3854.88 ( 9.01%) 3857.55 ( 9.09%) 3853.00 ( 8.96%)
Add 14M 3787.56 ( 0.00%) 3789.36 ( 0.05%) 3780.55 ( -0.19%) 4009.14 ( 5.85%) 4019.90 ( 6.13%) 3966.93 ( 4.74%)
Copy 14M 3345.19 ( 0.00%) 3361.79 ( 0.50%) 3338.99 ( -0.19%) 3483.34 ( 4.13%) 3480.38 ( 4.04%) 3470.79 ( 3.75%)
Scale 14M 3154.55 ( 0.00%) 3155.60 ( 0.03%) 3154.74 ( 0.01%) 3398.70 ( 7.74%) 3396.31 ( 7.66%) 3392.50 ( 7.54%)
Triad 14M 3522.09 ( 0.00%) 3517.21 ( -0.14%) 3514.90 ( -0.20%) 3861.09 ( 9.62%) 3854.76 ( 9.45%) 3852.52 ( 9.38%)
Add 17M 3806.34 ( 0.00%) 3770.18 ( -0.95%) 3774.21 ( -0.84%) 3982.37 ( 4.62%) 4015.73 ( 5.50%) 3979.61 ( 4.55%)
Copy 17M 3368.39 ( 0.00%) 3334.84 ( -1.00%) 3349.84 ( -0.55%) 3480.15 ( 3.32%) 3481.29 ( 3.35%) 3470.75 ( 3.04%)
Scale 17M 3169.18 ( 0.00%) 3164.25 ( -0.16%) 3148.23 ( -0.66%) 3398.11 ( 7.22%) 3398.69 ( 7.24%) 3389.32 ( 6.95%)
Triad 17M 3535.05 ( 0.00%) 3510.90 ( -0.68%) 3511.84 ( -0.66%) 3860.14 ( 9.20%) 3859.64 ( 9.18%) 3848.12 ( 8.86%)
Add 21M 3795.31 ( 0.00%) 3804.70 ( 0.25%) 3795.15 ( -0.00%) 4017.03 ( 5.84%) 4029.35 ( 6.17%) 3988.21 ( 5.08%)
Copy 21M 3353.43 ( 0.00%) 3365.89 ( 0.37%) 3351.05 ( -0.07%) 3482.88 ( 3.86%) 3478.62 ( 3.73%) 3479.29 ( 3.75%)
Scale 21M 3160.96 ( 0.00%) 3170.91 ( 0.31%) 3167.45 ( 0.21%) 3398.76 ( 7.52%) 3394.56 ( 7.39%) 3397.91 ( 7.50%)
Triad 21M 3530.45 ( 0.00%) 3533.62 ( 0.09%) 3529.35 ( -0.03%) 3862.25 ( 9.40%) 3855.95 ( 9.22%) 3859.16 ( 9.31%)
Add 28M 3803.11 ( 0.00%) 3789.09 ( -0.37%) 3799.69 ( -0.09%) 4016.56 ( 5.61%) 3975.01 ( 4.52%) 3993.88 ( 5.02%)
Copy 28M 3361.16 ( 0.00%) 3365.71 ( 0.14%) 3368.81 ( 0.23%) 3483.91 ( 3.65%) 3472.65 ( 3.32%) 3475.83 ( 3.41%)
Scale 28M 3160.43 ( 0.00%) 3151.15 ( -0.29%) 3168.12 ( 0.24%) 3399.14 ( 7.55%) 3395.77 ( 7.45%) 3397.73 ( 7.51%)
Triad 28M 3533.66 ( 0.00%) 3518.97 ( -0.42%) 3528.59 ( -0.14%) 3861.47 ( 9.28%) 3855.76 ( 9.12%) 3858.01 ( 9.18%)
Add 35M 3792.86 ( 0.00%) 3802.89 ( 0.26%) 3783.36 ( -0.25%) 3997.11 ( 5.39%) 4043.66 ( 6.61%) 3962.60 ( 4.48%)
Copy 35M 3344.24 ( 0.00%) 3356.43 ( 0.36%) 3351.61 ( 0.22%) 3478.14 ( 4.00%) 3486.84 ( 4.26%) 3468.70 ( 3.72%)
Scale 35M 3160.14 ( 0.00%) 3149.58 ( -0.33%) 3159.57 ( -0.02%) 3394.63 ( 7.42%) 3401.18 ( 7.63%) 3392.57 ( 7.36%)
Triad 35M 3531.94 ( 0.00%) 3530.90 ( -0.03%) 3517.90 ( -0.40%) 3856.80 ( 9.20%) 3862.04 ( 9.35%) 3846.73 ( 8.91%)
Add 42M 3803.39 ( 0.00%) 3789.28 ( -0.37%) 3773.81 ( -0.78%) 4025.00 ( 5.83%) 4007.98 ( 5.38%) 3944.45 ( 3.71%)
Copy 42M 3360.64 ( 0.00%) 3355.86 ( -0.14%) 3339.54 ( -0.63%) 3483.81 ( 3.67%) 3481.01 ( 3.58%) 3464.28 ( 3.08%)
Scale 42M 3158.64 ( 0.00%) 3168.47 ( 0.31%) 3161.82 ( 0.10%) 3397.41 ( 7.56%) 3397.71 ( 7.57%) 3388.43 ( 7.27%)
Triad 42M 3529.99 ( 0.00%) 3522.03 ( -0.23%) 3512.07 ( -0.51%) 3859.19 ( 9.33%) 3859.30 ( 9.33%) 3843.50 ( 8.88%)
Add 56M 3778.07 ( 0.00%) 3802.38 ( 0.64%) 3786.95 ( 0.23%) 4008.71 ( 6.10%) 4001.39 ( 5.91%) 3980.85 ( 5.37%)
Copy 56M 3348.68 ( 0.00%) 3354.81 ( 0.18%) 3363.94 ( 0.46%) 3481.10 ( 3.95%) 3482.10 ( 3.98%) 3478.62 ( 3.88%)
Scale 56M 3169.25 ( 0.00%) 3173.21 ( 0.13%) 3160.15 ( -0.29%) 3399.41 ( 7.26%) 3399.35 ( 7.26%) 3396.19 ( 7.16%)
Triad 56M 3517.62 ( 0.00%) 3532.08 ( 0.41%) 3519.91 ( 0.07%) 3861.34 ( 9.77%) 3860.40 ( 9.74%) 3859.61 ( 9.72%)
Add 71M 3811.71 ( 0.00%) 3790.78 ( -0.55%) 3792.30 ( -0.51%) 4005.76 ( 5.09%) 3996.73 ( 4.85%) 4021.00 ( 5.49%)
Copy 71M 3370.59 ( 0.00%) 3360.98 ( -0.29%) 3357.42 ( -0.39%) 3478.74 ( 3.21%) 3472.59 ( 3.03%) 3481.72 ( 3.30%)
Scale 71M 3168.70 ( 0.00%) 3170.94 ( 0.07%) 3150.83 ( -0.56%) 3394.36 ( 7.12%) 3390.88 ( 7.01%) 3397.04 ( 7.21%)
Triad 71M 3536.14 ( 0.00%) 3525.38 ( -0.30%) 3521.01 ( -0.43%) 3855.90 ( 9.04%) 3850.99 ( 8.90%) 3859.34 ( 9.14%)
Add 85M 3805.94 ( 0.00%) 3792.84 ( -0.34%) 3796.44 ( -0.25%) 4004.15 ( 5.21%) 4003.69 ( 5.20%) 3990.20 ( 4.84%)
Copy 85M 3354.76 ( 0.00%) 3357.55 ( 0.08%) 3360.68 ( 0.18%) 3477.66 ( 3.66%) 3480.74 ( 3.76%) 3471.36 ( 3.48%)
Scale 85M 3162.20 ( 0.00%) 3156.40 ( -0.18%) 3164.00 ( 0.06%) 3396.25 ( 7.40%) 3398.16 ( 7.46%) 3390.12 ( 7.21%)
Triad 85M 3538.76 ( 0.00%) 3522.94 ( -0.45%) 3533.03 ( -0.16%) 3854.39 ( 8.92%) 3861.37 ( 9.12%) 3848.60 ( 8.76%)
Add 113M 3803.66 ( 0.00%) 3785.42 ( -0.48%) 3804.21 ( 0.01%) 3997.16 ( 5.09%) 4029.74 ( 5.94%) 3987.10 ( 4.82%)
Copy 113M 3348.32 ( 0.00%) 3359.18 ( 0.32%) 3362.06 ( 0.41%) 3479.75 ( 3.93%) 3488.98 ( 4.20%) 3476.86 ( 3.84%)
Scale 113M 3177.09 ( 0.00%) 3148.61 ( -0.90%) 3147.95 ( -0.92%) 3396.00 ( 6.89%) 3404.06 ( 7.14%) 3395.97 ( 6.89%)
Triad 113M 3536.06 ( 0.00%) 3513.51 ( -0.64%) 3531.90 ( -0.12%) 3854.44 ( 9.00%) 3869.05 ( 9.42%) 3857.86 ( 9.10%)
Add 142M 3814.65 ( 0.00%) 3779.76 ( -0.91%) 3796.14 ( -0.49%) 3989.97 ( 4.60%) 3982.66 ( 4.40%) 3944.66 ( 3.41%)
Copy 142M 3353.31 ( 0.00%) 3347.29 ( -0.18%) 3360.60 ( 0.22%) 3477.55 ( 3.70%) 3471.80 ( 3.53%) 3465.60 ( 3.35%)
Scale 142M 3186.05 ( 0.00%) 3161.07 ( -0.78%) 3154.54 ( -0.99%) 3397.67 ( 6.64%) 3394.53 ( 6.54%) 3386.56 ( 6.29%)
Triad 142M 3545.41 ( 0.00%) 3518.27 ( -0.77%) 3527.15 ( -0.52%) 3858.25 ( 8.82%) 3851.34 ( 8.63%) 3841.65 ( 8.36%)
Add 170M 3787.71 ( 0.00%) 3805.45 ( 0.47%) 3781.99 ( -0.15%) 3990.15 ( 5.34%) 3990.16 ( 5.34%) 3997.08 ( 5.53%)
Copy 170M 3351.50 ( 0.00%) 3362.22 ( 0.32%) 3345.90 ( -0.17%) 3478.71 ( 3.80%) 3483.70 ( 3.94%) 3479.19 ( 3.81%)
Scale 170M 3158.38 ( 0.00%) 3175.47 ( 0.54%) 3151.34 ( -0.22%) 3398.22 ( 7.59%) 3400.09 ( 7.65%) 3396.11 ( 7.53%)
Triad 170M 3521.84 ( 0.00%) 3534.01 ( 0.35%) 3513.94 ( -0.22%) 3857.99 ( 9.54%) 3863.00 ( 9.69%) 3856.79 ( 9.51%)
Add 227M 3794.46 ( 0.00%) 3799.80 ( 0.14%) 3789.75 ( -0.12%) 4001.21 ( 5.45%) 3982.66 ( 4.96%) 3991.65 ( 5.20%)
Copy 227M 3368.15 ( 0.00%) 3361.29 ( -0.20%) 3357.70 ( -0.31%) 3482.76 ( 3.40%) 3473.54 ( 3.13%) 3480.61 ( 3.34%)
Scale 227M 3160.18 ( 0.00%) 3164.94 ( 0.15%) 3155.77 ( -0.14%) 3402.44 ( 7.67%) 3390.24 ( 7.28%) 3397.39 ( 7.51%)
Triad 227M 3525.39 ( 0.00%) 3523.04 ( -0.07%) 3524.31 ( -0.03%) 3865.12 ( 9.64%) 3851.41 ( 9.25%) 3859.91 ( 9.49%)
Add 284M 3804.29 ( 0.00%) 3799.06 ( -0.14%) 3805.86 ( 0.04%) 4007.77 ( 5.35%) 3986.91 ( 4.80%) 3996.16 ( 5.04%)
Copy 284M 3366.21 ( 0.00%) 3349.03 ( -0.51%) 3369.99 ( 0.11%) 3482.10 ( 3.44%) 3469.08 ( 3.06%) 3475.51 ( 3.25%)
Scale 284M 3174.61 ( 0.00%) 3173.80 ( -0.03%) 3147.99 ( -0.84%) 3402.22 ( 7.17%) 3386.58 ( 6.68%) 3395.61 ( 6.96%)
Triad 284M 3538.50 ( 0.00%) 3538.46 ( -0.00%) 3529.69 ( -0.25%) 3860.86 ( 9.11%) 3843.72 ( 8.63%) 3853.96 ( 8.92%)
Add 341M 3805.26 ( 0.00%) 3764.38 ( -1.07%) 3789.55 ( -0.41%) 3989.04 ( 4.83%) 3977.50 ( 4.53%) 4023.64 ( 5.74%)
Copy 341M 3366.98 ( 0.00%) 3341.40 ( -0.76%) 3362.85 ( -0.12%) 3476.89 ( 3.26%) 3474.40 ( 3.19%) 3489.58 ( 3.64%)
Scale 341M 3159.11 ( 0.00%) 3168.92 ( 0.31%) 3177.39 ( 0.58%) 3398.01 ( 7.56%) 3393.30 ( 7.41%) 3405.15 ( 7.79%)
Triad 341M 3530.80 ( 0.00%) 3506.03 ( -0.70%) 3528.16 ( -0.07%) 3858.85 ( 9.29%) 3851.56 ( 9.08%) 3868.18 ( 9.56%)
Add 455M 3791.15 ( 0.00%) 3794.39 ( 0.09%) 3807.19 ( 0.42%) 4029.29 ( 6.28%) 3985.30 ( 5.12%) 3988.07 ( 5.19%)
Copy 455M 3353.30 ( 0.00%) 3365.90 ( 0.38%) 3358.94 ( 0.17%) 3486.16 ( 3.96%) 3475.41 ( 3.64%) 3474.43 ( 3.61%)
Scale 455M 3161.21 ( 0.00%) 3166.60 ( 0.17%) 3160.11 ( -0.03%) 3401.81 ( 7.61%) 3396.29 ( 7.44%) 3395.46 ( 7.41%)
Triad 455M 3527.90 ( 0.00%) 3525.16 ( -0.08%) 3536.99 ( 0.26%) 3864.91 ( 9.55%) 3858.19 ( 9.36%) 3855.59 ( 9.29%)
Add 568M 3779.79 ( 0.00%) 3801.70 ( 0.58%) 3782.09 ( 0.06%) 3985.25 ( 5.44%) 4026.56 ( 6.53%) 3926.30 ( 3.88%)
Copy 568M 3349.93 ( 0.00%) 3366.10 ( 0.48%) 3336.55 ( -0.40%) 3472.59 ( 3.66%) 3485.34 ( 4.04%) 3460.49 ( 3.30%)
Scale 568M 3163.69 ( 0.00%) 3170.00 ( 0.20%) 3159.05 ( -0.15%) 3393.16 ( 7.25%) 3400.62 ( 7.49%) 3382.99 ( 6.93%)
Triad 568M 3518.65 ( 0.00%) 3535.79 ( 0.49%) 3517.04 ( -0.05%) 3850.19 ( 9.42%) 3863.35 ( 9.80%) 3839.40 ( 9.12%)
Add 682M 3801.06 ( 0.00%) 3805.79 ( 0.12%) 3786.90 ( -0.37%) 3977.83 ( 4.65%) 3956.61 ( 4.09%) 4001.91 ( 5.28%)
Copy 682M 3363.64 ( 0.00%) 3357.79 ( -0.17%) 3353.57 ( -0.30%) 3474.04 ( 3.28%) 3469.78 ( 3.16%) 3475.62 ( 3.33%)
Scale 682M 3151.89 ( 0.00%) 3169.57 ( 0.56%) 3159.20 ( 0.23%) 3395.81 ( 7.74%) 3392.14 ( 7.62%) 3393.91 ( 7.68%)
Triad 682M 3528.97 ( 0.00%) 3538.12 ( 0.26%) 3519.04 ( -0.28%) 3854.44 ( 9.22%) 3849.45 ( 9.08%) 3853.38 ( 9.19%)
Add 910M 3778.97 ( 0.00%) 3785.79 ( 0.18%) 3799.23 ( 0.54%) 4043.50 ( 7.00%) 4005.92 ( 6.01%) 4014.66 ( 6.24%)
Copy 910M 3345.09 ( 0.00%) 3355.05 ( 0.30%) 3353.56 ( 0.25%) 3487.47 ( 4.26%) 3473.79 ( 3.85%) 3489.55 ( 4.32%)
Scale 910M 3164.46 ( 0.00%) 3157.34 ( -0.23%) 3167.60 ( 0.10%) 3399.70 ( 7.43%) 3390.43 ( 7.14%) 3404.38 ( 7.58%)
Triad 910M 3516.19 ( 0.00%) 3520.82 ( 0.13%) 3534.78 ( 0.53%) 3861.71 ( 9.83%) 3850.59 ( 9.51%) 3867.83 ( 10.00%)
Add 1137M 3812.17 ( 0.00%) 3795.34 ( -0.44%) 3799.71 ( -0.33%) 4022.75 ( 5.52%) 3985.00 ( 4.53%) 3997.57 ( 4.86%)
Copy 1137M 3367.52 ( 0.00%) 3364.07 ( -0.10%) 3367.26 ( -0.01%) 3480.58 ( 3.36%) 3468.42 ( 3.00%) 3473.41 ( 3.14%)
Scale 1137M 3158.62 ( 0.00%) 3155.05 ( -0.11%) 3164.45 ( 0.18%) 3397.03 ( 7.55%) 3386.94 ( 7.23%) 3392.39 ( 7.40%)
Triad 1137M 3536.97 ( 0.00%) 3526.00 ( -0.31%) 3529.99 ( -0.20%) 3858.44 ( 9.09%) 3845.78 ( 8.73%) 3850.80 ( 8.87%)
Add 1365M 3806.51 ( 0.00%) 3791.63 ( -0.39%) 3786.57 ( -0.52%) 3962.59 ( 4.10%) 4029.60 ( 5.86%) 3990.23 ( 4.83%)
Copy 1365M 3360.43 ( 0.00%) 3363.15 ( 0.08%) 3347.19 ( -0.39%) 3474.10 ( 3.38%) 3488.82 ( 3.82%) 3478.98 ( 3.53%)
Scale 1365M 3155.95 ( 0.00%) 3160.77 ( 0.15%) 3164.41 ( 0.27%) 3394.90 ( 7.57%) 3405.19 ( 7.90%) 3396.64 ( 7.63%)
Triad 1365M 3534.18 ( 0.00%) 3521.12 ( -0.37%) 3519.49 ( -0.42%) 3856.06 ( 9.11%) 3865.20 ( 9.37%) 3857.96 ( 9.16%)
Add 1820M 3797.86 ( 0.00%) 3795.51 ( -0.06%) 3800.31 ( 0.06%) 4023.79 ( 5.95%) 3955.34 ( 4.15%) 4003.20 ( 5.41%)
Copy 1820M 3362.09 ( 0.00%) 3361.06 ( -0.03%) 3359.74 ( -0.07%) 3482.46 ( 3.58%) 3468.46 ( 3.16%) 3474.92 ( 3.36%)
Scale 1820M 3170.20 ( 0.00%) 3160.70 ( -0.30%) 3166.72 ( -0.11%) 3396.61 ( 7.14%) 3391.98 ( 7.00%) 3393.97 ( 7.06%)
Triad 1820M 3531.00 ( 0.00%) 3527.31 ( -0.10%) 3530.65 ( -0.01%) 3858.18 ( 9.27%) 3849.65 ( 9.02%) 3854.65 ( 9.17%)
Add 2275M 3810.31 ( 0.00%) 3792.47 ( -0.47%) 3767.11 ( -1.13%) 3982.71 ( 4.52%) 3987.02 ( 4.64%) 3977.99 ( 4.40%)
Copy 2275M 3373.60 ( 0.00%) 3358.29 ( -0.45%) 3335.43 ( -1.13%) 3478.34 ( 3.10%) 3476.07 ( 3.04%) 3475.55 ( 3.02%)
Scale 2275M 3174.64 ( 0.00%) 3159.58 ( -0.47%) 3158.94 ( -0.49%) 3398.12 ( 7.04%) 3395.41 ( 6.95%) 3395.88 ( 6.97%)
Triad 2275M 3537.57 ( 0.00%) 3527.90 ( -0.27%) 3508.53 ( -0.82%) 3860.60 ( 9.13%) 3856.96 ( 9.03%) 3856.09 ( 9.00%)
Add 2730M 3801.09 ( 0.00%) 3812.05 ( 0.29%) 3802.64 ( 0.04%) 3981.20 ( 4.74%) 4017.01 ( 5.68%) 3938.62 ( 3.62%)
Copy 2730M 3357.18 ( 0.00%) 3365.37 ( 0.24%) 3361.64 ( 0.13%) 3477.74 ( 3.59%) 3475.85 ( 3.53%) 3464.04 ( 3.18%)
Scale 2730M 3177.66 ( 0.00%) 3168.10 ( -0.30%) 3161.30 ( -0.51%) 3397.39 ( 6.91%) 3393.51 ( 6.79%) 3386.47 ( 6.57%)
Triad 2730M 3539.59 ( 0.00%) 3543.83 ( 0.12%) 3528.50 ( -0.31%) 3861.50 ( 9.09%) 3854.09 ( 8.89%) 3845.27 ( 8.64%)
Add 3640M 3816.88 ( 0.00%) 3791.01 ( -0.68%) 3779.35 ( -0.98%) 3976.53 ( 4.18%) 4050.84 ( 6.13%) 3991.81 ( 4.58%)
Copy 3640M 3375.91 ( 0.00%) 3349.60 ( -0.78%) 3347.88 ( -0.83%) 3472.83 ( 2.87%) 3485.96 ( 3.26%) 3474.40 ( 2.92%)
Scale 3640M 3167.22 ( 0.00%) 3168.24 ( 0.03%) 3157.93 ( -0.29%) 3395.00 ( 7.19%) 3400.17 ( 7.36%) 3395.70 ( 7.21%)
Triad 3640M 3546.45 ( 0.00%) 3528.90 ( -0.49%) 3517.90 ( -0.81%) 3855.08 ( 8.70%) 3860.11 ( 8.84%) 3854.39 ( 8.68%)
Add 4551M 3799.05 ( 0.00%) 3805.03 ( 0.16%) 3806.14 ( 0.19%) 4028.14 ( 6.03%) 4026.96 ( 6.00%) 4021.84 ( 5.86%)
Copy 4551M 3355.66 ( 0.00%) 3358.64 ( 0.09%) 3356.91 ( 0.04%) 3487.50 ( 3.93%) 3485.92 ( 3.88%) 3481.72 ( 3.76%)
Scale 4551M 3171.91 ( 0.00%) 3174.92 ( 0.09%) 3163.54 ( -0.26%) 3402.45 ( 7.27%) 3401.04 ( 7.22%) 3396.90 ( 7.09%)
Triad 4551M 3531.61 ( 0.00%) 3535.95 ( 0.12%) 3536.00 ( 0.12%) 3864.84 ( 9.44%) 3865.01 ( 9.44%) 3857.47 ( 9.23%)
Add 5461M 3801.60 ( 0.00%) 3774.49 ( -0.71%) 3779.16 ( -0.59%) 4010.68 ( 5.50%) 3958.91 ( 4.14%) 4011.94 ( 5.53%)
Copy 5461M 3360.29 ( 0.00%) 3347.56 ( -0.38%) 3351.31 ( -0.27%) 3483.90 ( 3.68%) 3467.72 ( 3.20%) 3480.64 ( 3.58%)
Scale 5461M 3161.18 ( 0.00%) 3154.56 ( -0.21%) 3149.71 ( -0.36%) 3399.26 ( 7.53%) 3391.35 ( 7.28%) 3396.95 ( 7.46%)
Triad 5461M 3532.35 ( 0.00%) 3510.19 ( -0.63%) 3512.62 ( -0.56%) 3862.91 ( 9.36%) 3849.95 ( 8.99%) 3858.71 ( 9.24%)
Add 7281M 3800.80 ( 0.00%) 3789.71 ( -0.29%) 3779.60 ( -0.56%) 4023.89 ( 5.87%) 4000.63 ( 5.26%) 3974.68 ( 4.57%)
Copy 7281M 3359.99 ( 0.00%) 3349.71 ( -0.31%) 3346.82 ( -0.39%) 3482.20 ( 3.64%) 3481.97 ( 3.63%) 3471.59 ( 3.32%)
Scale 7281M 3168.68 ( 0.00%) 3167.95 ( -0.02%) 3154.70 ( -0.44%) 3399.98 ( 7.30%) 3400.46 ( 7.31%) 3392.10 ( 7.05%)
Triad 7281M 3533.59 ( 0.00%) 3524.63 ( -0.25%) 3514.25 ( -0.55%) 3861.39 ( 9.28%) 3861.70 ( 9.29%) 3853.31 ( 9.05%)
Add 9102M 3790.67 ( 0.00%) 3791.28 ( 0.02%) 3790.38 ( -0.01%) 4015.48 ( 5.93%) 4013.46 ( 5.88%) 4014.66 ( 5.91%)
Copy 9102M 3345.80 ( 0.00%) 3365.09 ( 0.58%) 3353.79 ( 0.24%) 3480.51 ( 4.03%) 3479.74 ( 4.00%) 3481.55 ( 4.06%)
Scale 9102M 3174.65 ( 0.00%) 3149.82 ( -0.78%) 3166.84 ( -0.25%) 3398.75 ( 7.06%) 3398.27 ( 7.04%) 3399.20 ( 7.07%)
Triad 9102M 3529.51 ( 0.00%) 3523.03 ( -0.18%) 3524.38 ( -0.15%) 3861.12 ( 9.40%) 3858.35 ( 9.32%) 3860.55 ( 9.38%)
Add 10922M 3807.96 ( 0.00%) 3784.18 ( -0.62%) 3779.45 ( -0.75%) 4021.53 ( 5.61%) 3984.89 ( 4.65%) 4005.11 ( 5.18%)
Copy 10922M 3350.99 ( 0.00%) 3351.97 ( 0.03%) 3353.08 ( 0.06%) 3490.40 ( 4.16%) 3472.32 ( 3.62%) 3473.98 ( 3.67%)
Scale 10922M 3164.74 ( 0.00%) 3167.46 ( 0.09%) 3154.60 ( -0.32%) 3402.35 ( 7.51%) 3392.56 ( 7.20%) 3392.16 ( 7.19%)
Triad 10922M 3536.69 ( 0.00%) 3524.27 ( -0.35%) 3516.30 ( -0.58%) 3865.21 ( 9.29%) 3850.74 ( 8.88%) 3849.32 ( 8.84%)
Add 14563M 3786.28 ( 0.00%) 3793.09 ( 0.18%) 3787.76 ( 0.04%) 3976.82 ( 5.03%) 3987.54 ( 5.32%) 3988.31 ( 5.34%)
Copy 14563M 3352.51 ( 0.00%) 3355.74 ( 0.10%) 3357.05 ( 0.14%) 3472.63 ( 3.58%) 3475.97 ( 3.68%) 3470.44 ( 3.52%)
Scale 14563M 3171.95 ( 0.00%) 3168.28 ( -0.12%) 3158.17 ( -0.43%) 3393.54 ( 6.99%) 3399.68 ( 7.18%) 3390.82 ( 6.90%)
Triad 14563M 3522.50 ( 0.00%) 3526.12 ( 0.10%) 3519.97 ( -0.07%) 3853.92 ( 9.41%) 3856.89 ( 9.49%) 3847.38 ( 9.22%)
Add 18204M 3809.56 ( 0.00%) 3772.64 ( -0.97%) 3795.07 ( -0.38%) 4014.65 ( 5.38%) 3976.18 ( 4.37%) 3963.55 ( 4.04%)
Copy 18204M 3365.06 ( 0.00%) 3350.49 ( -0.43%) 3359.32 ( -0.17%) 3483.40 ( 3.52%) 3473.21 ( 3.21%) 3467.66 ( 3.05%)
Scale 18204M 3171.25 ( 0.00%) 3151.05 ( -0.64%) 3163.69 ( -0.24%) 3400.05 ( 7.21%) 3393.76 ( 7.02%) 3388.64 ( 6.85%)
Triad 18204M 3539.90 ( 0.00%) 3508.60 ( -0.88%) 3532.25 ( -0.22%) 3860.99 ( 9.07%) 3853.56 ( 8.86%) 3847.01 ( 8.68%)
Add 21845M 3798.46 ( 0.00%) 3800.35 ( 0.05%) 3791.21 ( -0.19%) 3995.49 ( 5.19%) 3990.65 ( 5.06%) 3969.12 ( 4.49%)
Copy 21845M 3362.14 ( 0.00%) 3363.46 ( 0.04%) 3355.34 ( -0.20%) 3477.61 ( 3.43%) 3478.33 ( 3.46%) 3472.19 ( 3.27%)
Scale 21845M 3170.99 ( 0.00%) 3164.60 ( -0.20%) 3162.31 ( -0.27%) 3398.14 ( 7.16%) 3396.25 ( 7.10%) 3393.58 ( 7.02%)
Triad 21845M 3534.49 ( 0.00%) 3527.34 ( -0.20%) 3522.95 ( -0.33%) 3858.35 ( 9.16%) 3856.52 ( 9.11%) 3854.98 ( 9.07%)
Add 29127M 3819.69 ( 0.00%) 3783.38 ( -0.95%) 3786.06 ( -0.88%) 4007.04 ( 4.90%) 4005.91 ( 4.88%) 4000.99 ( 4.75%)
Copy 29127M 3384.67 ( 0.00%) 3345.60 ( -1.15%) 3339.55 ( -1.33%) 3480.54 ( 2.83%) 3479.91 ( 2.81%) 3475.18 ( 2.67%)
Scale 29127M 3158.68 ( 0.00%) 3166.06 ( 0.23%) 3151.78 ( -0.22%) 3399.73 ( 7.63%) 3395.21 ( 7.49%) 3393.50 ( 7.43%)
Triad 29127M 3538.17 ( 0.00%) 3520.17 ( -0.51%) 3523.09 ( -0.43%) 3862.24 ( 9.16%) 3858.60 ( 9.06%) 3851.85 ( 8.87%)
Add 36408M 3806.95 ( 0.00%) 3793.61 ( -0.35%) 3777.70 ( -0.77%) 4016.66 ( 5.51%) 3994.64 ( 4.93%) 3991.57 ( 4.85%)
Copy 36408M 3361.11 ( 0.00%) 3347.61 ( -0.40%) 3353.38 ( -0.23%) 3483.09 ( 3.63%) 3476.44 ( 3.43%) 3473.26 ( 3.34%)
Scale 36408M 3165.87 ( 0.00%) 3173.95 ( 0.26%) 3171.11 ( 0.17%) 3398.81 ( 7.36%) 3394.38 ( 7.22%) 3393.16 ( 7.18%)
Triad 36408M 3536.86 ( 0.00%) 3533.81 ( -0.09%) 3513.64 ( -0.66%) 3860.60 ( 9.15%) 3855.77 ( 9.02%) 3853.09 ( 8.94%)
Add 43690M 3799.39 ( 0.00%) 3795.90 ( -0.09%) 3803.79 ( 0.12%) 3996.57 ( 5.19%) 4006.70 ( 5.46%) 3981.15 ( 4.78%)
Copy 43690M 3359.26 ( 0.00%) 3360.94 ( 0.05%) 3371.10 ( 0.35%) 3479.62 ( 3.58%) 3481.69 ( 3.64%) 3478.45 ( 3.55%)
Scale 43690M 3175.35 ( 0.00%) 3163.95 ( -0.36%) 3147.34 ( -0.88%) 3396.36 ( 6.96%) 3399.45 ( 7.06%) 3398.88 ( 7.04%)
Triad 43690M 3535.26 ( 0.00%) 3526.88 ( -0.24%) 3528.38 ( -0.19%) 3857.30 ( 9.11%) 3858.89 ( 9.15%) 3858.38 ( 9.14%)
Add 58254M 3799.66 ( 0.00%) 3772.37 ( -0.72%) 3768.33 ( -0.82%) 4016.47 ( 5.71%) 4014.25 ( 5.65%) 3968.79 ( 4.45%)
Copy 58254M 3355.12 ( 0.00%) 3337.75 ( -0.52%) 3337.41 ( -0.53%) 3481.56 ( 3.77%) 3481.28 ( 3.76%) 3465.39 ( 3.29%)
Scale 58254M 3170.94 ( 0.00%) 3159.81 ( -0.35%) 3164.09 ( -0.22%) 3398.35 ( 7.17%) 3396.30 ( 7.11%) 3388.58 ( 6.86%)
Triad 58254M 3537.26 ( 0.00%) 3511.62 ( -0.72%) 3507.54 ( -0.84%) 3860.59 ( 9.14%) 3858.62 ( 9.09%) 3847.30 ( 8.76%)
Add 72817M 3815.26 ( 0.00%) 3812.73 ( -0.07%) 3787.86 ( -0.72%) 3968.21 ( 4.01%) 4030.38 ( 5.64%) 3956.57 ( 3.70%)
Copy 72817M 3362.18 ( 0.00%) 3371.41 ( 0.27%) 3345.64 ( -0.49%) 3474.38 ( 3.34%) 3482.00 ( 3.56%) 3469.46 ( 3.19%)
Scale 72817M 3175.73 ( 0.00%) 3170.64 ( -0.16%) 3154.28 ( -0.68%) 3394.65 ( 6.89%) 3396.69 ( 6.96%) 3390.78 ( 6.77%)
Triad 72817M 3546.44 ( 0.00%) 3537.21 ( -0.26%) 3520.46 ( -0.73%) 3855.50 ( 8.71%) 3855.34 ( 8.71%) 3849.10 ( 8.53%)
Add 87381M 3519.93 ( 0.00%) 3501.24 ( -0.53%) 3500.84 ( -0.54%) 3833.20 ( 8.90%) 3833.26 ( 8.90%) 3840.72 ( 9.11%)
Copy 87381M 3175.29 ( 0.00%) 3166.11 ( -0.29%) 3163.97 ( -0.36%) 3263.09 ( 2.77%) 3264.10 ( 2.80%) 3266.85 ( 2.88%)
Scale 87381M 2848.76 ( 0.00%) 2835.15 ( -0.48%) 2832.37 ( -0.58%) 3177.70 ( 11.55%) 3172.81 ( 11.38%) 3180.05 ( 11.63%)
Triad 87381M 3465.19 ( 0.00%) 3453.66 ( -0.33%) 3456.03 ( -0.26%) 3777.01 ( 9.00%) 3774.30 ( 8.92%) 3783.31 ( 9.18%)
Remote access costs are quite visible in this memory streaming benchmark.
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanillainstrument-v2r1lruslabonly-v2r1 local-v2r6 acct-v2r6remotefile-v2r6
User 1144.35 1154.81 1156.38 1075.31 1083.70 1087.08
System 55.28 56.07 56.35 49.00 49.06 48.84
Elapsed 1207.64 1220.14 1222.13 1132.20 1141.91 1145.08
pft
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanilla instrument-v2r1 lruslabonly-v2r1 local-v2r6 acct-v2r6 remotefile-v2r6
User 1 0.6980 ( 0.00%) 0.6900 ( 1.15%) 0.7050 ( -1.00%) 0.6500 ( 6.88%) 0.6550 ( 6.16%) 0.6750 ( 3.30%)
User 2 0.7040 ( 0.00%) 0.6990 ( 0.71%) 0.7000 ( 0.57%) 0.6980 ( 0.85%) 0.7150 ( -1.56%) 0.7040 ( 0.00%)
User 3 0.6910 ( 0.00%) 0.6930 ( -0.29%) 0.7230 ( -4.63%) 0.7390 ( -6.95%) 0.7180 ( -3.91%) 0.7120 ( -3.04%)
User 4 0.7250 ( 0.00%) 0.7580 ( -4.55%) 0.7310 ( -0.83%) 0.7220 ( 0.41%) 0.7520 ( -3.72%) 0.7250 ( 0.00%)
User 5 0.7590 ( 0.00%) 0.7490 ( 1.32%) 0.7910 ( -4.22%) 0.7730 ( -1.84%) 0.7480 ( 1.45%) 0.7690 ( -1.32%)
User 6 0.8130 ( 0.00%) 0.8010 ( 1.48%) 0.7940 ( 2.34%) 0.7770 ( 4.43%) 0.7790 ( 4.18%) 0.7700 ( 5.29%)
User 7 0.8210 ( 0.00%) 0.8380 ( -2.07%) 0.8260 ( -0.61%) 0.7950 ( 3.17%) 0.8230 ( -0.24%) 0.7760 ( 5.48%)
User 8 0.8390 ( 0.00%) 0.8200 ( 2.26%) 0.8160 ( 2.74%) 0.7840 ( 6.56%) 0.7830 ( 6.67%) 0.8400 ( -0.12%)
System 1 9.1230 ( 0.00%) 9.1120 ( 0.12%) 9.0810 ( 0.46%) 8.2560 ( 9.50%) 8.2760 ( 9.28%) 8.2260 ( 9.83%)
System 2 9.3990 ( 0.00%) 9.3340 ( 0.69%) 9.4050 ( -0.06%) 8.4630 ( 9.96%) 8.4230 ( 10.38%) 8.4420 ( 10.18%)
System 3 9.1460 ( 0.00%) 9.0890 ( 0.62%) 9.1380 ( 0.09%) 8.5660 ( 6.34%) 8.5640 ( 6.36%) 8.5290 ( 6.75%)
System 4 8.9160 ( 0.00%) 8.8840 ( 0.36%) 8.9260 ( -0.11%) 8.6760 ( 2.69%) 8.6330 ( 3.17%) 8.6790 ( 2.66%)
System 5 9.5900 ( 0.00%) 9.5240 ( 0.69%) 9.5230 ( 0.70%) 8.9390 ( 6.79%) 8.8920 ( 7.28%) 8.9410 ( 6.77%)
System 6 9.8640 ( 0.00%) 9.7120 ( 1.54%) 9.8740 ( -0.10%) 9.1460 ( 7.28%) 9.1310 ( 7.43%) 9.1400 ( 7.34%)
System 7 9.9860 ( 0.00%) 9.9290 ( 0.57%) 10.0030 ( -0.17%) 9.3360 ( 6.51%) 9.2430 ( 7.44%) 9.2860 ( 7.01%)
System 8 9.8570 ( 0.00%) 9.8510 ( 0.06%) 9.9980 ( -1.43%) 9.3050 ( 5.60%) 9.2410 ( 6.25%) 9.4170 ( 4.46%)
Elapsed 1 9.8240 ( 0.00%) 9.8050 ( 0.19%) 9.7910 ( 0.34%) 8.9080 ( 9.32%) 8.9320 ( 9.08%) 8.9080 ( 9.32%)
Elapsed 2 5.0870 ( 0.00%) 5.0500 ( 0.73%) 5.0710 ( 0.31%) 4.6020 ( 9.53%) 4.5860 ( 9.85%) 4.5940 ( 9.69%)
Elapsed 3 3.3220 ( 0.00%) 3.2990 ( 0.69%) 3.3210 ( 0.03%) 3.1170 ( 6.17%) 3.1150 ( 6.23%) 3.0950 ( 6.83%)
Elapsed 4 2.4440 ( 0.00%) 2.4440 ( 0.00%) 2.4410 ( 0.12%) 2.3930 ( 2.09%) 2.3780 ( 2.70%) 2.3710 ( 2.99%)
Elapsed 5 2.1500 ( 0.00%) 2.1410 ( 0.42%) 2.1400 ( 0.47%) 2.0020 ( 6.88%) 1.9830 ( 7.77%) 2.0030 ( 6.84%)
Elapsed 6 1.8290 ( 0.00%) 1.7970 ( 1.75%) 1.8260 ( 0.16%) 1.6960 ( 7.27%) 1.6980 ( 7.16%) 1.6930 ( 7.44%)
Elapsed 7 1.5760 ( 0.00%) 1.5610 ( 0.95%) 1.5860 ( -0.63%) 1.4830 ( 5.90%) 1.4740 ( 6.47%) 1.4730 ( 6.54%)
Elapsed 8 1.3660 ( 0.00%) 1.3490 ( 1.24%) 1.3660 ( -0.00%) 1.2820 ( 6.15%) 1.2660 ( 7.32%) 1.3030 ( 4.61%)
Faults/cpu 1 336505.5875 ( 0.00%) 337163.8429 ( 0.20%) 337713.8261 ( 0.36%) 371079.7726 ( 10.27%) 370090.7928 ( 9.98%) 371199.3702 ( 10.31%)
Faults/cpu 2 327139.2186 ( 0.00%) 329451.3249 ( 0.71%) 326974.9735 ( -0.05%) 360766.3203 ( 10.28%) 361595.0312 ( 10.53%) 361389.4583 ( 10.47%)
Faults/cpu 3 336004.1324 ( 0.00%) 337826.9136 ( 0.54%) 335004.8869 ( -0.30%) 355249.2266 ( 5.73%) 356016.6570 ( 5.96%) 357584.5258 ( 6.42%)
Faults/cpu 4 342824.1564 ( 0.00%) 342825.3087 ( 0.00%) 342285.3156 ( -0.16%) 351758.5702 ( 2.61%) 352312.8339 ( 2.77%) 351503.0837 ( 2.53%)
Faults/cpu 5 319553.7707 ( 0.00%) 321799.3129 ( 0.70%) 320521.1950 ( 0.30%) 340315.3807 ( 6.50%) 342890.6018 ( 7.30%) 340381.5220 ( 6.52%)
Faults/cpu 6 309614.5554 ( 0.00%) 314330.1834 ( 1.52%) 309882.5231 ( 0.09%) 333075.2546 ( 7.58%) 333637.6404 ( 7.76%) 333706.0587 ( 7.78%)
Faults/cpu 7 306159.2969 ( 0.00%) 307277.9428 ( 0.37%) 305306.4748 ( -0.28%) 326309.2165 ( 6.58%) 328327.9627 ( 7.24%) 328590.8507 ( 7.33%)
Faults/cpu 8 309077.4966 ( 0.00%) 309849.8370 ( 0.25%) 305865.6953 ( -1.04%) 327958.3107 ( 6.11%) 329731.7933 ( 6.68%) 322280.8870 ( 4.27%)
Faults/sec 1 336364.5575 ( 0.00%) 336993.1010 ( 0.19%) 337563.4257 ( 0.36%) 370916.0228 ( 10.27%) 369955.7605 ( 9.99%) 370971.4836 ( 10.29%)
Faults/sec 2 649713.2290 ( 0.00%) 654448.6622 ( 0.73%) 651706.3799 ( 0.31%) 717987.1734 ( 10.51%) 720641.9249 ( 10.92%) 719435.7495 ( 10.73%)
Faults/sec 3 994812.3119 ( 0.00%) 1001443.9434 ( 0.67%) 995205.6607 ( 0.04%) 1060228.7843 ( 6.58%) 1060484.8602 ( 6.60%) 1067127.5522 ( 7.27%)
Faults/sec 4 1352137.4832 ( 0.00%) 1352463.8578 ( 0.02%) 1354323.6163 ( 0.16%) 1382325.4091 ( 2.23%) 1390344.3320 ( 2.83%) 1393760.7116 ( 3.08%)
Faults/sec 5 1538115.0421 ( 0.00%) 1544331.3978 ( 0.40%) 1544368.0159 ( 0.41%) 1651247.2902 ( 7.36%) 1666751.7259 ( 8.36%) 1651371.8632 ( 7.36%)
Faults/sec 6 1807211.7324 ( 0.00%) 1840430.0157 ( 1.84%) 1809763.9743 ( 0.14%) 1947049.8237 ( 7.74%) 1946986.6396 ( 7.73%) 1953384.4599 ( 8.09%)
Faults/sec 7 2101840.1872 ( 0.00%) 2120169.4773 ( 0.87%) 2082926.2675 ( -0.90%) 2233207.9026 ( 6.25%) 2241803.5953 ( 6.66%) 2242647.3545 ( 6.70%)
Faults/sec 8 2421813.7208 ( 0.00%) 2453320.5034 ( 1.30%) 2419371.3924 ( -0.10%) 2582755.9228 ( 6.65%) 2612638.2836 ( 7.88%) 2537575.0399 ( 4.78%)
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanillainstrument-v2r1lruslabonly-v2r1 local-v2r6 acct-v2r6remotefile-v2r6
User 60.57 61.53 61.96 59.47 60.74 60.78
System 868.16 862.80 868.89 805.82 802.76 806.01
Elapsed 336.19 336.02 339.19 311.33 313.18 313.58
And page fault microbenchmarks also see a benefit, probably because the
zeroing of pages is no longer incurring a remote access penalty. Lower system CPU usage etc
ebizzy
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanilla instrument-v2r1 lruslabonly-v2r1 local-v2r6 acct-v2r6 remotefile-v2r6
Mean 1 3213.33 ( 0.00%) 3161.67 ( -1.61%) 3177.00 ( -1.13%) 3234.33 ( 0.65%) 3224.00 ( 0.33%) 3198.33 ( -0.47%)
Mean 2 2291.33 ( 0.00%) 2316.67 ( 1.11%) 2309.67 ( 0.80%) 2348.67 ( 2.50%) 2330.00 ( 1.69%) 2332.00 ( 1.77%)
Mean 3 2234.67 ( 0.00%) 2298.67 ( 2.86%) 2252.00 ( 0.78%) 2280.67 ( 2.06%) 2292.67 ( 2.60%) 2270.00 ( 1.58%)
Mean 4 2224.33 ( 0.00%) 2279.00 ( 2.46%) 2250.67 ( 1.18%) 2282.33 ( 2.61%) 2256.00 ( 1.42%) 2256.33 ( 1.44%)
Mean 5 2256.33 ( 0.00%) 2280.33 ( 1.06%) 2265.00 ( 0.38%) 2280.67 ( 1.08%) 2268.33 ( 0.53%) 2276.67 ( 0.90%)
Mean 6 2233.00 ( 0.00%) 2257.33 ( 1.09%) 2200.00 ( -1.48%) 2292.33 ( 2.66%) 2274.33 ( 1.85%) 2250.33 ( 0.78%)
Mean 7 2212.33 ( 0.00%) 2229.00 ( 0.75%) 2201.67 ( -0.48%) 2279.00 ( 3.01%) 2265.00 ( 2.38%) 2251.67 ( 1.78%)
Mean 8 2224.67 ( 0.00%) 2226.33 ( 0.07%) 2225.67 ( 0.04%) 2255.00 ( 1.36%) 2280.67 ( 2.52%) 2238.67 ( 0.63%)
Mean 12 2213.33 ( 0.00%) 2240.00 ( 1.20%) 2264.67 ( 2.32%) 2249.33 ( 1.63%) 2257.67 ( 2.00%) 2238.00 ( 1.11%)
Mean 16 2221.00 ( 0.00%) 2226.33 ( 0.24%) 2268.00 ( 2.12%) 2266.33 ( 2.04%) 2241.00 ( 0.90%) 2258.33 ( 1.68%)
Mean 20 2215.00 ( 0.00%) 2256.00 ( 1.85%) 2278.33 ( 2.86%) 2238.67 ( 1.07%) 2271.67 ( 2.56%) 2291.00 ( 3.43%)
Mean 24 2175.00 ( 0.00%) 2181.00 ( 0.28%) 2166.67 ( -0.38%) 2211.00 ( 1.66%) 2231.00 ( 2.57%) 2247.67 ( 3.34%)
Mean 28 2110.00 ( 0.00%) 2136.00 ( 1.23%) 2123.33 ( 0.63%) 2157.00 ( 2.23%) 2163.00 ( 2.51%) 2164.67 ( 2.59%)
Mean 32 2077.67 ( 0.00%) 2095.33 ( 0.85%) 2091.33 ( 0.66%) 2110.67 ( 1.59%) 2113.33 ( 1.72%) 2110.33 ( 1.57%)
Mean 36 2016.33 ( 0.00%) 2024.67 ( 0.41%) 2039.33 ( 1.14%) 2066.33 ( 2.48%) 2068.00 ( 2.56%) 2069.00 ( 2.61%)
Mean 40 1984.00 ( 0.00%) 1987.00 ( 0.15%) 1993.33 ( 0.47%) 2037.00 ( 2.67%) 2035.00 ( 2.57%) 2042.00 ( 2.92%)
Mean 44 1943.33 ( 0.00%) 1954.33 ( 0.57%) 1961.00 ( 0.91%) 2004.33 ( 3.14%) 2009.67 ( 3.41%) 2018.00 ( 3.84%)
Mean 48 1925.00 ( 0.00%) 1939.33 ( 0.74%) 1929.00 ( 0.21%) 1990.67 ( 3.41%) 1996.33 ( 3.71%) 2007.67 ( 4.29%)
Stddev 1 25.42 ( 0.00%) 46.78 (-84.02%) 32.75 (-28.84%) 18.62 ( 26.73%) 21.95 ( 13.64%) 30.18 (-18.72%)
Stddev 2 29.68 ( 0.00%) 1.70 ( 94.27%) 13.77 ( 53.61%) 12.50 ( 57.89%) 15.51 ( 47.73%) 13.88 ( 53.23%)
Stddev 3 18.15 ( 0.00%) 27.48 (-51.35%) 4.32 ( 76.20%) 13.57 ( 25.23%) 15.52 ( 14.50%) 11.78 ( 35.13%)
Stddev 4 41.28 ( 0.00%) 13.64 ( 66.96%) 6.94 ( 83.18%) 24.51 ( 40.62%) 24.04 ( 41.76%) 7.41 ( 82.05%)
Stddev 5 27.18 ( 0.00%) 9.03 ( 66.78%) 4.97 ( 81.73%) 8.50 ( 68.74%) 17.25 ( 36.54%) 15.80 ( 41.88%)
Stddev 6 10.80 ( 0.00%) 17.97 (-66.36%) 9.27 ( 14.14%) 6.60 ( 38.90%) 16.01 (-48.20%) 19.36 (-79.26%)
Stddev 7 23.10 ( 0.00%) 17.91 ( 22.48%) 29.58 (-28.05%) 15.94 ( 31.00%) 5.72 ( 75.26%) 12.76 ( 44.75%)
Stddev 8 3.68 ( 0.00%) 41.52 (-1027.82%) 26.74 (-626.21%) 4.32 (-17.35%) 33.81 (-818.21%) 12.50 (-239.48%)
Stddev 12 23.84 ( 0.00%) 6.48 ( 72.81%) 14.66 ( 38.50%) 13.47 ( 43.47%) 11.79 ( 50.56%) 18.71 ( 21.52%)
Stddev 16 20.22 ( 0.00%) 17.13 ( 15.25%) 28.99 (-43.43%) 2.36 ( 88.34%) 2.16 ( 89.31%) 16.13 ( 20.20%)
Stddev 20 3.74 ( 0.00%) 6.53 (-74.57%) 45.02 (-1103.24%) 22.54 (-502.51%) 8.18 (-118.58%) 26.28 (-602.38%)
Stddev 24 18.18 ( 0.00%) 19.30 ( -6.16%) 23.81 (-30.93%) 9.42 ( 48.22%) 16.99 ( 6.57%) 8.18 ( 55.02%)
Stddev 28 11.78 ( 0.00%) 7.79 ( 33.86%) 15.92 (-35.22%) 12.96 (-10.07%) 12.83 ( -8.97%) 17.78 (-51.01%)
Stddev 32 9.74 ( 0.00%) 2.05 ( 78.91%) 8.81 ( 9.59%) 6.55 ( 32.77%) 3.09 ( 68.27%) 1.70 ( 82.55%)
Stddev 36 3.86 ( 0.00%) 5.44 (-40.89%) 2.36 ( 38.92%) 13.22 (-242.73%) 11.78 (-205.18%) 16.87 (-337.26%)
Stddev 40 14.17 ( 0.00%) 7.48 ( 47.17%) 5.56 ( 60.77%) 5.89 ( 58.44%) 2.16 ( 84.75%) 2.45 ( 82.71%)
Stddev 44 7.54 ( 0.00%) 3.40 ( 54.93%) 2.94 ( 60.97%) 7.54 ( 0.00%) 3.68 ( 51.19%) 1.63 ( 78.35%)
Stddev 48 2.94 ( 0.00%) 5.56 (-88.79%) 3.56 (-20.89%) 6.24 (-111.83%) 1.70 ( 42.26%) 17.25 (-485.95%)
Ran ebizzy because it double up as a page allocation micro benchmark that
hits page faults differently to PFT. Looks like an ok gain but the stddev
is high and would need to be stabilised to draw a solid conclusion from.
None of these benchmarks do *anything* related to what commit 81c0a2bb was
supposed to fix. I just wanted to get the point across that our current
default behaviour sucks and we should revisit that decision.
My position is that by default we should only round-robin zones local to
the allocating process and that node round-robin is something that should
only be explicitely enabled.
I'm less sure about the round robin treatment of slab but am erring on
the side of historical behaviour until it is proven otherwise.
Documentation/sysctl/vm.txt | 32 +++++++++
include/linux/gfp.h | 4 +-
include/linux/mmzone.h | 2 +
include/linux/pagemap.h | 2 +-
include/linux/swap.h | 2 +
kernel/sysctl.c | 8 +++
mm/filemap.c | 2 +
mm/page_alloc.c | 153 +++++++++++++++++++++++++++++++++++++-------
8 files changed, 180 insertions(+), 25 deletions(-)
--
1.8.4
^ permalink raw reply [flat|nested] 84+ messages in thread
* [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
@ 2013-12-13 14:10 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
Kicked this another bit today. It's still a bit half-baked but it restores
the historical performance and leaves the door open at the end for playing
nice with distributing file pages between nodes. Finishing this series
depends on whether we are going to make the remote node behaviour of the
fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
favour of the configurable option because the default can be redefined and
tested while giving users a "compat" mode if we discover the new default
behaviour sucks for some workload.
Changelog since v1
o Fix lot of brain damage in the configurable policy patch
o Yoink a page cache annotation patch
o Only account batch pages against allocations eligible for the fair policy
o Add patch that default distributes file pages on remote nodes
Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of how
the page allocator and kswapd interacted on the per-zone LRU lists.
Unfortunately a side-effect missed during review was that it's now very
easy to allocate remote memory on NUMA machines. The problem is that
it is not a simple case of just restoring local allocation policies as
there are genuine reasons why global page aging may be prefereable. It's
still a major change to default behaviour so this patch makes the policy
configurable and sets what I think is a sensible default.
The patches are on top of some NUMA balancing patches currently in -mm.
The first patch in the series is a patch posted by Johannes that must be
taken into account before any of my patchs on top. The last patch of the
series is what alters default behaviour and makes the fair zone allocator
policy configurable.
Sniff test results based on following kernels
vanilla 3.13-rc3 stock
instrument-v2r1 NUMA balancing patches just to rule out any conflicts ther2
lruslabonly-v1r2 Patch 1 only
local-v2r6 Patches 1-5 to restore local memory allocations
acct-v2r6 Patches 1-6 to include an accounting adjustment
remotefile-v2r6 Patches 1-7 that breaks MPOL_LOCAL by interleaving file pages
kernbench
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanilla instrument-v2r1 lruslabonly-v2r1 local-v2r6 acct-v2r6 remotefile-v2r6
User min 1417.32 ( 0.00%) 1408.52 ( 0.62%) 1414.92 ( 0.17%) 1403.37 ( 0.98%) 1410.55 ( 0.48%) 1405.85 ( 0.81%)
User mean 1419.10 ( 0.00%) 1415.39 ( 0.26%) 1417.31 ( 0.13%) 1409.89 ( 0.65%) 1411.40 ( 0.54%) 1410.78 ( 0.59%)
User stddev 2.25 ( 0.00%) 4.51 (-100.33%) 2.44 ( -8.29%) 3.98 (-76.92%) 0.74 ( 66.98%) 2.94 (-30.81%)
User max 1422.92 ( 0.00%) 1421.05 ( 0.13%) 1421.90 ( 0.07%) 1415.39 ( 0.53%) 1412.55 ( 0.73%) 1413.99 ( 0.63%)
User range 5.60 ( 0.00%) 12.53 (-123.75%) 6.98 (-24.64%) 12.02 (-114.64%) 2.00 ( 64.29%) 8.14 (-45.36%)
System min 114.83 ( 0.00%) 114.09 ( 0.64%) 114.50 ( 0.29%) 110.16 ( 4.07%) 110.44 ( 3.82%) 110.49 ( 3.78%)
System mean 115.89 ( 0.00%) 115.01 ( 0.76%) 115.12 ( 0.67%) 110.73 ( 4.46%) 111.20 ( 4.05%) 111.17 ( 4.08%)
System stddev 0.63 ( 0.00%) 0.57 ( 10.42%) 0.40 ( 37.04%) 0.48 ( 24.87%) 0.51 ( 19.41%) 0.43 ( 32.60%)
System max 116.81 ( 0.00%) 115.87 ( 0.80%) 115.52 ( 1.10%) 111.47 ( 4.57%) 111.98 ( 4.13%) 111.63 ( 4.43%)
System range 1.98 ( 0.00%) 1.78 ( 10.10%) 1.02 ( 48.48%) 1.31 ( 33.84%) 1.54 ( 22.22%) 1.14 ( 42.42%)
Elapsed min 42.90 ( 0.00%) 43.96 ( -2.47%) 42.85 ( 0.12%) 43.02 ( -0.28%) 42.55 ( 0.82%) 42.75 ( 0.35%)
Elapsed mean 43.58 ( 0.00%) 44.16 ( -1.34%) 43.88 ( -0.69%) 43.87 ( -0.67%) 43.58 ( -0.00%) 43.80 ( -0.50%)
Elapsed stddev 0.74 ( 0.00%) 0.17 ( 77.41%) 0.61 ( 17.23%) 1.00 (-35.26%) 0.67 ( 9.46%) 0.82 ( -9.88%)
Elapsed max 44.52 ( 0.00%) 44.45 ( 0.16%) 44.55 ( -0.07%) 45.72 ( -2.70%) 44.24 ( 0.63%) 45.09 ( -1.28%)
Elapsed range 1.62 ( 0.00%) 0.49 ( 69.75%) 1.70 ( -4.94%) 2.70 (-66.67%) 1.69 ( -4.32%) 2.34 (-44.44%)
CPU min 3451.00 ( 0.00%) 3455.00 ( -0.12%) 3434.00 ( 0.49%) 3311.00 ( 4.06%) 3439.00 ( 0.35%) 3377.00 ( 2.14%)
CPU mean 3522.40 ( 0.00%) 3464.60 ( 1.64%) 3492.40 ( 0.85%) 3467.40 ( 1.56%) 3493.80 ( 0.81%) 3475.40 ( 1.33%)
CPU stddev 54.34 ( 0.00%) 9.05 ( 83.35%) 54.80 ( -0.85%) 86.04 (-58.33%) 54.99 ( -1.18%) 67.75 (-24.68%)
CPU max 3570.00 ( 0.00%) 3480.00 ( 2.52%) 3587.00 ( -0.48%) 3545.00 ( 0.70%) 3578.00 ( -0.22%) 3568.00 ( 0.06%)
CPU range 119.00 ( 0.00%) 25.00 ( 78.99%) 153.00 (-28.57%) 234.00 (-96.64%) 139.00 (-16.81%) 191.00 (-60.50%)
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanillainstrument-v2r1lruslabonly-v2r1 local-v2r6 acct-v2r6remotefile-v2r6
User 8540.49 8516.04 8524.28 8487.25 8488.89 8487.40
System 706.31 701.72 701.20 674.29 675.81 676.52
Elapsed 307.58 311.31 309.72 309.51 308.32 310.36
kernbench figures themselves are not that compelling but the system CPU cost
is down a lot. It's just such a small percentage of the overall workload
that it doesn't really matter and the processes are short lived anyway.
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanillainstrument-v2r1lruslabonly-v2r1 local-v2r6 acct-v2r6remotefile-v2r6
NUMA alloc hit 73783951 73086669 73385508 93373651 93326068 93321444
NUMA alloc miss 20013534 20247750 19958857 102 118 2129
NUMA interleave hit 0 0 0 0 0 0
NUMA alloc local 73783935 73086658 73385501 93373644 93326059 93321436
NUMA miss rates are reduced by using the local policy although it really
should have been zero. I suspect it's the __GFP_PAGECACHE annotation patch
and how it's treated but have not proven it. The miss stats go up for the
final patch as page cache pages get distributed between nodes again
vmr-stream
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanilla instrument-v2r1 lruslabonly-v2r1 local-v2r6 acct-v2r6 remotefile-v2r6
Add 5M 3809.80 ( 0.00%) 3783.21 ( -0.70%) 3790.61 ( -0.50%) 3970.34 ( 4.21%) 3975.29 ( 4.34%) 3992.15 ( 4.79%)
Copy 5M 3360.75 ( 0.00%) 3345.59 ( -0.45%) 3351.99 ( -0.26%) 3474.69 ( 3.39%) 3472.97 ( 3.34%) 3474.32 ( 3.38%)
Scale 5M 3160.39 ( 0.00%) 3163.43 ( 0.10%) 3159.88 ( -0.02%) 3393.56 ( 7.38%) 3391.85 ( 7.32%) 3393.76 ( 7.38%)
Triad 5M 3533.04 ( 0.00%) 3517.67 ( -0.43%) 3526.18 ( -0.19%) 3856.20 ( 9.15%) 3851.39 ( 9.01%) 3855.89 ( 9.14%)
Add 7M 3789.82 ( 0.00%) 3789.03 ( -0.02%) 3779.30 ( -0.28%) 4049.53 ( 6.85%) 4001.74 ( 5.59%) 3968.84 ( 4.72%)
Copy 7M 3345.85 ( 0.00%) 3355.75 ( 0.30%) 3354.56 ( 0.26%) 3484.62 ( 4.15%) 3477.23 ( 3.93%) 3474.17 ( 3.84%)
Scale 7M 3176.00 ( 0.00%) 3156.09 ( -0.63%) 3152.84 ( -0.73%) 3401.53 ( 7.10%) 3393.55 ( 6.85%) 3392.46 ( 6.82%)
Triad 7M 3528.85 ( 0.00%) 3521.99 ( -0.19%) 3515.20 ( -0.39%) 3861.55 ( 9.43%) 3853.51 ( 9.20%) 3853.30 ( 9.19%)
Add 8M 3801.60 ( 0.00%) 3781.66 ( -0.52%) 3788.19 ( -0.35%) 3957.73 ( 4.11%) 4002.30 ( 5.28%) 4006.69 ( 5.39%)
Copy 8M 3364.64 ( 0.00%) 3346.31 ( -0.54%) 3353.71 ( -0.32%) 3469.62 ( 3.12%) 3476.25 ( 3.32%) 3473.67 ( 3.24%)
Scale 8M 3169.34 ( 0.00%) 3163.10 ( -0.20%) 3157.99 ( -0.36%) 3391.61 ( 7.01%) 3395.76 ( 7.14%) 3393.20 ( 7.06%)
Triad 8M 3531.38 ( 0.00%) 3514.83 ( -0.47%) 3518.55 ( -0.36%) 3850.45 ( 9.04%) 3853.39 ( 9.12%) 3849.50 ( 9.01%)
Add 10M 3807.95 ( 0.00%) 3791.80 ( -0.42%) 3781.86 ( -0.69%) 3977.13 ( 4.44%) 4005.95 ( 5.20%) 3983.31 ( 4.61%)
Copy 10M 3365.64 ( 0.00%) 3361.59 ( -0.12%) 3352.03 ( -0.40%) 3473.78 ( 3.21%) 3479.54 ( 3.38%) 3471.70 ( 3.15%)
Scale 10M 3172.71 ( 0.00%) 3157.52 ( -0.48%) 3149.26 ( -0.74%) 3395.59 ( 7.02%) 3397.28 ( 7.08%) 3394.50 ( 6.99%)
Triad 10M 3536.15 ( 0.00%) 3524.46 ( -0.33%) 3517.36 ( -0.53%) 3854.88 ( 9.01%) 3857.55 ( 9.09%) 3853.00 ( 8.96%)
Add 14M 3787.56 ( 0.00%) 3789.36 ( 0.05%) 3780.55 ( -0.19%) 4009.14 ( 5.85%) 4019.90 ( 6.13%) 3966.93 ( 4.74%)
Copy 14M 3345.19 ( 0.00%) 3361.79 ( 0.50%) 3338.99 ( -0.19%) 3483.34 ( 4.13%) 3480.38 ( 4.04%) 3470.79 ( 3.75%)
Scale 14M 3154.55 ( 0.00%) 3155.60 ( 0.03%) 3154.74 ( 0.01%) 3398.70 ( 7.74%) 3396.31 ( 7.66%) 3392.50 ( 7.54%)
Triad 14M 3522.09 ( 0.00%) 3517.21 ( -0.14%) 3514.90 ( -0.20%) 3861.09 ( 9.62%) 3854.76 ( 9.45%) 3852.52 ( 9.38%)
Add 17M 3806.34 ( 0.00%) 3770.18 ( -0.95%) 3774.21 ( -0.84%) 3982.37 ( 4.62%) 4015.73 ( 5.50%) 3979.61 ( 4.55%)
Copy 17M 3368.39 ( 0.00%) 3334.84 ( -1.00%) 3349.84 ( -0.55%) 3480.15 ( 3.32%) 3481.29 ( 3.35%) 3470.75 ( 3.04%)
Scale 17M 3169.18 ( 0.00%) 3164.25 ( -0.16%) 3148.23 ( -0.66%) 3398.11 ( 7.22%) 3398.69 ( 7.24%) 3389.32 ( 6.95%)
Triad 17M 3535.05 ( 0.00%) 3510.90 ( -0.68%) 3511.84 ( -0.66%) 3860.14 ( 9.20%) 3859.64 ( 9.18%) 3848.12 ( 8.86%)
Add 21M 3795.31 ( 0.00%) 3804.70 ( 0.25%) 3795.15 ( -0.00%) 4017.03 ( 5.84%) 4029.35 ( 6.17%) 3988.21 ( 5.08%)
Copy 21M 3353.43 ( 0.00%) 3365.89 ( 0.37%) 3351.05 ( -0.07%) 3482.88 ( 3.86%) 3478.62 ( 3.73%) 3479.29 ( 3.75%)
Scale 21M 3160.96 ( 0.00%) 3170.91 ( 0.31%) 3167.45 ( 0.21%) 3398.76 ( 7.52%) 3394.56 ( 7.39%) 3397.91 ( 7.50%)
Triad 21M 3530.45 ( 0.00%) 3533.62 ( 0.09%) 3529.35 ( -0.03%) 3862.25 ( 9.40%) 3855.95 ( 9.22%) 3859.16 ( 9.31%)
Add 28M 3803.11 ( 0.00%) 3789.09 ( -0.37%) 3799.69 ( -0.09%) 4016.56 ( 5.61%) 3975.01 ( 4.52%) 3993.88 ( 5.02%)
Copy 28M 3361.16 ( 0.00%) 3365.71 ( 0.14%) 3368.81 ( 0.23%) 3483.91 ( 3.65%) 3472.65 ( 3.32%) 3475.83 ( 3.41%)
Scale 28M 3160.43 ( 0.00%) 3151.15 ( -0.29%) 3168.12 ( 0.24%) 3399.14 ( 7.55%) 3395.77 ( 7.45%) 3397.73 ( 7.51%)
Triad 28M 3533.66 ( 0.00%) 3518.97 ( -0.42%) 3528.59 ( -0.14%) 3861.47 ( 9.28%) 3855.76 ( 9.12%) 3858.01 ( 9.18%)
Add 35M 3792.86 ( 0.00%) 3802.89 ( 0.26%) 3783.36 ( -0.25%) 3997.11 ( 5.39%) 4043.66 ( 6.61%) 3962.60 ( 4.48%)
Copy 35M 3344.24 ( 0.00%) 3356.43 ( 0.36%) 3351.61 ( 0.22%) 3478.14 ( 4.00%) 3486.84 ( 4.26%) 3468.70 ( 3.72%)
Scale 35M 3160.14 ( 0.00%) 3149.58 ( -0.33%) 3159.57 ( -0.02%) 3394.63 ( 7.42%) 3401.18 ( 7.63%) 3392.57 ( 7.36%)
Triad 35M 3531.94 ( 0.00%) 3530.90 ( -0.03%) 3517.90 ( -0.40%) 3856.80 ( 9.20%) 3862.04 ( 9.35%) 3846.73 ( 8.91%)
Add 42M 3803.39 ( 0.00%) 3789.28 ( -0.37%) 3773.81 ( -0.78%) 4025.00 ( 5.83%) 4007.98 ( 5.38%) 3944.45 ( 3.71%)
Copy 42M 3360.64 ( 0.00%) 3355.86 ( -0.14%) 3339.54 ( -0.63%) 3483.81 ( 3.67%) 3481.01 ( 3.58%) 3464.28 ( 3.08%)
Scale 42M 3158.64 ( 0.00%) 3168.47 ( 0.31%) 3161.82 ( 0.10%) 3397.41 ( 7.56%) 3397.71 ( 7.57%) 3388.43 ( 7.27%)
Triad 42M 3529.99 ( 0.00%) 3522.03 ( -0.23%) 3512.07 ( -0.51%) 3859.19 ( 9.33%) 3859.30 ( 9.33%) 3843.50 ( 8.88%)
Add 56M 3778.07 ( 0.00%) 3802.38 ( 0.64%) 3786.95 ( 0.23%) 4008.71 ( 6.10%) 4001.39 ( 5.91%) 3980.85 ( 5.37%)
Copy 56M 3348.68 ( 0.00%) 3354.81 ( 0.18%) 3363.94 ( 0.46%) 3481.10 ( 3.95%) 3482.10 ( 3.98%) 3478.62 ( 3.88%)
Scale 56M 3169.25 ( 0.00%) 3173.21 ( 0.13%) 3160.15 ( -0.29%) 3399.41 ( 7.26%) 3399.35 ( 7.26%) 3396.19 ( 7.16%)
Triad 56M 3517.62 ( 0.00%) 3532.08 ( 0.41%) 3519.91 ( 0.07%) 3861.34 ( 9.77%) 3860.40 ( 9.74%) 3859.61 ( 9.72%)
Add 71M 3811.71 ( 0.00%) 3790.78 ( -0.55%) 3792.30 ( -0.51%) 4005.76 ( 5.09%) 3996.73 ( 4.85%) 4021.00 ( 5.49%)
Copy 71M 3370.59 ( 0.00%) 3360.98 ( -0.29%) 3357.42 ( -0.39%) 3478.74 ( 3.21%) 3472.59 ( 3.03%) 3481.72 ( 3.30%)
Scale 71M 3168.70 ( 0.00%) 3170.94 ( 0.07%) 3150.83 ( -0.56%) 3394.36 ( 7.12%) 3390.88 ( 7.01%) 3397.04 ( 7.21%)
Triad 71M 3536.14 ( 0.00%) 3525.38 ( -0.30%) 3521.01 ( -0.43%) 3855.90 ( 9.04%) 3850.99 ( 8.90%) 3859.34 ( 9.14%)
Add 85M 3805.94 ( 0.00%) 3792.84 ( -0.34%) 3796.44 ( -0.25%) 4004.15 ( 5.21%) 4003.69 ( 5.20%) 3990.20 ( 4.84%)
Copy 85M 3354.76 ( 0.00%) 3357.55 ( 0.08%) 3360.68 ( 0.18%) 3477.66 ( 3.66%) 3480.74 ( 3.76%) 3471.36 ( 3.48%)
Scale 85M 3162.20 ( 0.00%) 3156.40 ( -0.18%) 3164.00 ( 0.06%) 3396.25 ( 7.40%) 3398.16 ( 7.46%) 3390.12 ( 7.21%)
Triad 85M 3538.76 ( 0.00%) 3522.94 ( -0.45%) 3533.03 ( -0.16%) 3854.39 ( 8.92%) 3861.37 ( 9.12%) 3848.60 ( 8.76%)
Add 113M 3803.66 ( 0.00%) 3785.42 ( -0.48%) 3804.21 ( 0.01%) 3997.16 ( 5.09%) 4029.74 ( 5.94%) 3987.10 ( 4.82%)
Copy 113M 3348.32 ( 0.00%) 3359.18 ( 0.32%) 3362.06 ( 0.41%) 3479.75 ( 3.93%) 3488.98 ( 4.20%) 3476.86 ( 3.84%)
Scale 113M 3177.09 ( 0.00%) 3148.61 ( -0.90%) 3147.95 ( -0.92%) 3396.00 ( 6.89%) 3404.06 ( 7.14%) 3395.97 ( 6.89%)
Triad 113M 3536.06 ( 0.00%) 3513.51 ( -0.64%) 3531.90 ( -0.12%) 3854.44 ( 9.00%) 3869.05 ( 9.42%) 3857.86 ( 9.10%)
Add 142M 3814.65 ( 0.00%) 3779.76 ( -0.91%) 3796.14 ( -0.49%) 3989.97 ( 4.60%) 3982.66 ( 4.40%) 3944.66 ( 3.41%)
Copy 142M 3353.31 ( 0.00%) 3347.29 ( -0.18%) 3360.60 ( 0.22%) 3477.55 ( 3.70%) 3471.80 ( 3.53%) 3465.60 ( 3.35%)
Scale 142M 3186.05 ( 0.00%) 3161.07 ( -0.78%) 3154.54 ( -0.99%) 3397.67 ( 6.64%) 3394.53 ( 6.54%) 3386.56 ( 6.29%)
Triad 142M 3545.41 ( 0.00%) 3518.27 ( -0.77%) 3527.15 ( -0.52%) 3858.25 ( 8.82%) 3851.34 ( 8.63%) 3841.65 ( 8.36%)
Add 170M 3787.71 ( 0.00%) 3805.45 ( 0.47%) 3781.99 ( -0.15%) 3990.15 ( 5.34%) 3990.16 ( 5.34%) 3997.08 ( 5.53%)
Copy 170M 3351.50 ( 0.00%) 3362.22 ( 0.32%) 3345.90 ( -0.17%) 3478.71 ( 3.80%) 3483.70 ( 3.94%) 3479.19 ( 3.81%)
Scale 170M 3158.38 ( 0.00%) 3175.47 ( 0.54%) 3151.34 ( -0.22%) 3398.22 ( 7.59%) 3400.09 ( 7.65%) 3396.11 ( 7.53%)
Triad 170M 3521.84 ( 0.00%) 3534.01 ( 0.35%) 3513.94 ( -0.22%) 3857.99 ( 9.54%) 3863.00 ( 9.69%) 3856.79 ( 9.51%)
Add 227M 3794.46 ( 0.00%) 3799.80 ( 0.14%) 3789.75 ( -0.12%) 4001.21 ( 5.45%) 3982.66 ( 4.96%) 3991.65 ( 5.20%)
Copy 227M 3368.15 ( 0.00%) 3361.29 ( -0.20%) 3357.70 ( -0.31%) 3482.76 ( 3.40%) 3473.54 ( 3.13%) 3480.61 ( 3.34%)
Scale 227M 3160.18 ( 0.00%) 3164.94 ( 0.15%) 3155.77 ( -0.14%) 3402.44 ( 7.67%) 3390.24 ( 7.28%) 3397.39 ( 7.51%)
Triad 227M 3525.39 ( 0.00%) 3523.04 ( -0.07%) 3524.31 ( -0.03%) 3865.12 ( 9.64%) 3851.41 ( 9.25%) 3859.91 ( 9.49%)
Add 284M 3804.29 ( 0.00%) 3799.06 ( -0.14%) 3805.86 ( 0.04%) 4007.77 ( 5.35%) 3986.91 ( 4.80%) 3996.16 ( 5.04%)
Copy 284M 3366.21 ( 0.00%) 3349.03 ( -0.51%) 3369.99 ( 0.11%) 3482.10 ( 3.44%) 3469.08 ( 3.06%) 3475.51 ( 3.25%)
Scale 284M 3174.61 ( 0.00%) 3173.80 ( -0.03%) 3147.99 ( -0.84%) 3402.22 ( 7.17%) 3386.58 ( 6.68%) 3395.61 ( 6.96%)
Triad 284M 3538.50 ( 0.00%) 3538.46 ( -0.00%) 3529.69 ( -0.25%) 3860.86 ( 9.11%) 3843.72 ( 8.63%) 3853.96 ( 8.92%)
Add 341M 3805.26 ( 0.00%) 3764.38 ( -1.07%) 3789.55 ( -0.41%) 3989.04 ( 4.83%) 3977.50 ( 4.53%) 4023.64 ( 5.74%)
Copy 341M 3366.98 ( 0.00%) 3341.40 ( -0.76%) 3362.85 ( -0.12%) 3476.89 ( 3.26%) 3474.40 ( 3.19%) 3489.58 ( 3.64%)
Scale 341M 3159.11 ( 0.00%) 3168.92 ( 0.31%) 3177.39 ( 0.58%) 3398.01 ( 7.56%) 3393.30 ( 7.41%) 3405.15 ( 7.79%)
Triad 341M 3530.80 ( 0.00%) 3506.03 ( -0.70%) 3528.16 ( -0.07%) 3858.85 ( 9.29%) 3851.56 ( 9.08%) 3868.18 ( 9.56%)
Add 455M 3791.15 ( 0.00%) 3794.39 ( 0.09%) 3807.19 ( 0.42%) 4029.29 ( 6.28%) 3985.30 ( 5.12%) 3988.07 ( 5.19%)
Copy 455M 3353.30 ( 0.00%) 3365.90 ( 0.38%) 3358.94 ( 0.17%) 3486.16 ( 3.96%) 3475.41 ( 3.64%) 3474.43 ( 3.61%)
Scale 455M 3161.21 ( 0.00%) 3166.60 ( 0.17%) 3160.11 ( -0.03%) 3401.81 ( 7.61%) 3396.29 ( 7.44%) 3395.46 ( 7.41%)
Triad 455M 3527.90 ( 0.00%) 3525.16 ( -0.08%) 3536.99 ( 0.26%) 3864.91 ( 9.55%) 3858.19 ( 9.36%) 3855.59 ( 9.29%)
Add 568M 3779.79 ( 0.00%) 3801.70 ( 0.58%) 3782.09 ( 0.06%) 3985.25 ( 5.44%) 4026.56 ( 6.53%) 3926.30 ( 3.88%)
Copy 568M 3349.93 ( 0.00%) 3366.10 ( 0.48%) 3336.55 ( -0.40%) 3472.59 ( 3.66%) 3485.34 ( 4.04%) 3460.49 ( 3.30%)
Scale 568M 3163.69 ( 0.00%) 3170.00 ( 0.20%) 3159.05 ( -0.15%) 3393.16 ( 7.25%) 3400.62 ( 7.49%) 3382.99 ( 6.93%)
Triad 568M 3518.65 ( 0.00%) 3535.79 ( 0.49%) 3517.04 ( -0.05%) 3850.19 ( 9.42%) 3863.35 ( 9.80%) 3839.40 ( 9.12%)
Add 682M 3801.06 ( 0.00%) 3805.79 ( 0.12%) 3786.90 ( -0.37%) 3977.83 ( 4.65%) 3956.61 ( 4.09%) 4001.91 ( 5.28%)
Copy 682M 3363.64 ( 0.00%) 3357.79 ( -0.17%) 3353.57 ( -0.30%) 3474.04 ( 3.28%) 3469.78 ( 3.16%) 3475.62 ( 3.33%)
Scale 682M 3151.89 ( 0.00%) 3169.57 ( 0.56%) 3159.20 ( 0.23%) 3395.81 ( 7.74%) 3392.14 ( 7.62%) 3393.91 ( 7.68%)
Triad 682M 3528.97 ( 0.00%) 3538.12 ( 0.26%) 3519.04 ( -0.28%) 3854.44 ( 9.22%) 3849.45 ( 9.08%) 3853.38 ( 9.19%)
Add 910M 3778.97 ( 0.00%) 3785.79 ( 0.18%) 3799.23 ( 0.54%) 4043.50 ( 7.00%) 4005.92 ( 6.01%) 4014.66 ( 6.24%)
Copy 910M 3345.09 ( 0.00%) 3355.05 ( 0.30%) 3353.56 ( 0.25%) 3487.47 ( 4.26%) 3473.79 ( 3.85%) 3489.55 ( 4.32%)
Scale 910M 3164.46 ( 0.00%) 3157.34 ( -0.23%) 3167.60 ( 0.10%) 3399.70 ( 7.43%) 3390.43 ( 7.14%) 3404.38 ( 7.58%)
Triad 910M 3516.19 ( 0.00%) 3520.82 ( 0.13%) 3534.78 ( 0.53%) 3861.71 ( 9.83%) 3850.59 ( 9.51%) 3867.83 ( 10.00%)
Add 1137M 3812.17 ( 0.00%) 3795.34 ( -0.44%) 3799.71 ( -0.33%) 4022.75 ( 5.52%) 3985.00 ( 4.53%) 3997.57 ( 4.86%)
Copy 1137M 3367.52 ( 0.00%) 3364.07 ( -0.10%) 3367.26 ( -0.01%) 3480.58 ( 3.36%) 3468.42 ( 3.00%) 3473.41 ( 3.14%)
Scale 1137M 3158.62 ( 0.00%) 3155.05 ( -0.11%) 3164.45 ( 0.18%) 3397.03 ( 7.55%) 3386.94 ( 7.23%) 3392.39 ( 7.40%)
Triad 1137M 3536.97 ( 0.00%) 3526.00 ( -0.31%) 3529.99 ( -0.20%) 3858.44 ( 9.09%) 3845.78 ( 8.73%) 3850.80 ( 8.87%)
Add 1365M 3806.51 ( 0.00%) 3791.63 ( -0.39%) 3786.57 ( -0.52%) 3962.59 ( 4.10%) 4029.60 ( 5.86%) 3990.23 ( 4.83%)
Copy 1365M 3360.43 ( 0.00%) 3363.15 ( 0.08%) 3347.19 ( -0.39%) 3474.10 ( 3.38%) 3488.82 ( 3.82%) 3478.98 ( 3.53%)
Scale 1365M 3155.95 ( 0.00%) 3160.77 ( 0.15%) 3164.41 ( 0.27%) 3394.90 ( 7.57%) 3405.19 ( 7.90%) 3396.64 ( 7.63%)
Triad 1365M 3534.18 ( 0.00%) 3521.12 ( -0.37%) 3519.49 ( -0.42%) 3856.06 ( 9.11%) 3865.20 ( 9.37%) 3857.96 ( 9.16%)
Add 1820M 3797.86 ( 0.00%) 3795.51 ( -0.06%) 3800.31 ( 0.06%) 4023.79 ( 5.95%) 3955.34 ( 4.15%) 4003.20 ( 5.41%)
Copy 1820M 3362.09 ( 0.00%) 3361.06 ( -0.03%) 3359.74 ( -0.07%) 3482.46 ( 3.58%) 3468.46 ( 3.16%) 3474.92 ( 3.36%)
Scale 1820M 3170.20 ( 0.00%) 3160.70 ( -0.30%) 3166.72 ( -0.11%) 3396.61 ( 7.14%) 3391.98 ( 7.00%) 3393.97 ( 7.06%)
Triad 1820M 3531.00 ( 0.00%) 3527.31 ( -0.10%) 3530.65 ( -0.01%) 3858.18 ( 9.27%) 3849.65 ( 9.02%) 3854.65 ( 9.17%)
Add 2275M 3810.31 ( 0.00%) 3792.47 ( -0.47%) 3767.11 ( -1.13%) 3982.71 ( 4.52%) 3987.02 ( 4.64%) 3977.99 ( 4.40%)
Copy 2275M 3373.60 ( 0.00%) 3358.29 ( -0.45%) 3335.43 ( -1.13%) 3478.34 ( 3.10%) 3476.07 ( 3.04%) 3475.55 ( 3.02%)
Scale 2275M 3174.64 ( 0.00%) 3159.58 ( -0.47%) 3158.94 ( -0.49%) 3398.12 ( 7.04%) 3395.41 ( 6.95%) 3395.88 ( 6.97%)
Triad 2275M 3537.57 ( 0.00%) 3527.90 ( -0.27%) 3508.53 ( -0.82%) 3860.60 ( 9.13%) 3856.96 ( 9.03%) 3856.09 ( 9.00%)
Add 2730M 3801.09 ( 0.00%) 3812.05 ( 0.29%) 3802.64 ( 0.04%) 3981.20 ( 4.74%) 4017.01 ( 5.68%) 3938.62 ( 3.62%)
Copy 2730M 3357.18 ( 0.00%) 3365.37 ( 0.24%) 3361.64 ( 0.13%) 3477.74 ( 3.59%) 3475.85 ( 3.53%) 3464.04 ( 3.18%)
Scale 2730M 3177.66 ( 0.00%) 3168.10 ( -0.30%) 3161.30 ( -0.51%) 3397.39 ( 6.91%) 3393.51 ( 6.79%) 3386.47 ( 6.57%)
Triad 2730M 3539.59 ( 0.00%) 3543.83 ( 0.12%) 3528.50 ( -0.31%) 3861.50 ( 9.09%) 3854.09 ( 8.89%) 3845.27 ( 8.64%)
Add 3640M 3816.88 ( 0.00%) 3791.01 ( -0.68%) 3779.35 ( -0.98%) 3976.53 ( 4.18%) 4050.84 ( 6.13%) 3991.81 ( 4.58%)
Copy 3640M 3375.91 ( 0.00%) 3349.60 ( -0.78%) 3347.88 ( -0.83%) 3472.83 ( 2.87%) 3485.96 ( 3.26%) 3474.40 ( 2.92%)
Scale 3640M 3167.22 ( 0.00%) 3168.24 ( 0.03%) 3157.93 ( -0.29%) 3395.00 ( 7.19%) 3400.17 ( 7.36%) 3395.70 ( 7.21%)
Triad 3640M 3546.45 ( 0.00%) 3528.90 ( -0.49%) 3517.90 ( -0.81%) 3855.08 ( 8.70%) 3860.11 ( 8.84%) 3854.39 ( 8.68%)
Add 4551M 3799.05 ( 0.00%) 3805.03 ( 0.16%) 3806.14 ( 0.19%) 4028.14 ( 6.03%) 4026.96 ( 6.00%) 4021.84 ( 5.86%)
Copy 4551M 3355.66 ( 0.00%) 3358.64 ( 0.09%) 3356.91 ( 0.04%) 3487.50 ( 3.93%) 3485.92 ( 3.88%) 3481.72 ( 3.76%)
Scale 4551M 3171.91 ( 0.00%) 3174.92 ( 0.09%) 3163.54 ( -0.26%) 3402.45 ( 7.27%) 3401.04 ( 7.22%) 3396.90 ( 7.09%)
Triad 4551M 3531.61 ( 0.00%) 3535.95 ( 0.12%) 3536.00 ( 0.12%) 3864.84 ( 9.44%) 3865.01 ( 9.44%) 3857.47 ( 9.23%)
Add 5461M 3801.60 ( 0.00%) 3774.49 ( -0.71%) 3779.16 ( -0.59%) 4010.68 ( 5.50%) 3958.91 ( 4.14%) 4011.94 ( 5.53%)
Copy 5461M 3360.29 ( 0.00%) 3347.56 ( -0.38%) 3351.31 ( -0.27%) 3483.90 ( 3.68%) 3467.72 ( 3.20%) 3480.64 ( 3.58%)
Scale 5461M 3161.18 ( 0.00%) 3154.56 ( -0.21%) 3149.71 ( -0.36%) 3399.26 ( 7.53%) 3391.35 ( 7.28%) 3396.95 ( 7.46%)
Triad 5461M 3532.35 ( 0.00%) 3510.19 ( -0.63%) 3512.62 ( -0.56%) 3862.91 ( 9.36%) 3849.95 ( 8.99%) 3858.71 ( 9.24%)
Add 7281M 3800.80 ( 0.00%) 3789.71 ( -0.29%) 3779.60 ( -0.56%) 4023.89 ( 5.87%) 4000.63 ( 5.26%) 3974.68 ( 4.57%)
Copy 7281M 3359.99 ( 0.00%) 3349.71 ( -0.31%) 3346.82 ( -0.39%) 3482.20 ( 3.64%) 3481.97 ( 3.63%) 3471.59 ( 3.32%)
Scale 7281M 3168.68 ( 0.00%) 3167.95 ( -0.02%) 3154.70 ( -0.44%) 3399.98 ( 7.30%) 3400.46 ( 7.31%) 3392.10 ( 7.05%)
Triad 7281M 3533.59 ( 0.00%) 3524.63 ( -0.25%) 3514.25 ( -0.55%) 3861.39 ( 9.28%) 3861.70 ( 9.29%) 3853.31 ( 9.05%)
Add 9102M 3790.67 ( 0.00%) 3791.28 ( 0.02%) 3790.38 ( -0.01%) 4015.48 ( 5.93%) 4013.46 ( 5.88%) 4014.66 ( 5.91%)
Copy 9102M 3345.80 ( 0.00%) 3365.09 ( 0.58%) 3353.79 ( 0.24%) 3480.51 ( 4.03%) 3479.74 ( 4.00%) 3481.55 ( 4.06%)
Scale 9102M 3174.65 ( 0.00%) 3149.82 ( -0.78%) 3166.84 ( -0.25%) 3398.75 ( 7.06%) 3398.27 ( 7.04%) 3399.20 ( 7.07%)
Triad 9102M 3529.51 ( 0.00%) 3523.03 ( -0.18%) 3524.38 ( -0.15%) 3861.12 ( 9.40%) 3858.35 ( 9.32%) 3860.55 ( 9.38%)
Add 10922M 3807.96 ( 0.00%) 3784.18 ( -0.62%) 3779.45 ( -0.75%) 4021.53 ( 5.61%) 3984.89 ( 4.65%) 4005.11 ( 5.18%)
Copy 10922M 3350.99 ( 0.00%) 3351.97 ( 0.03%) 3353.08 ( 0.06%) 3490.40 ( 4.16%) 3472.32 ( 3.62%) 3473.98 ( 3.67%)
Scale 10922M 3164.74 ( 0.00%) 3167.46 ( 0.09%) 3154.60 ( -0.32%) 3402.35 ( 7.51%) 3392.56 ( 7.20%) 3392.16 ( 7.19%)
Triad 10922M 3536.69 ( 0.00%) 3524.27 ( -0.35%) 3516.30 ( -0.58%) 3865.21 ( 9.29%) 3850.74 ( 8.88%) 3849.32 ( 8.84%)
Add 14563M 3786.28 ( 0.00%) 3793.09 ( 0.18%) 3787.76 ( 0.04%) 3976.82 ( 5.03%) 3987.54 ( 5.32%) 3988.31 ( 5.34%)
Copy 14563M 3352.51 ( 0.00%) 3355.74 ( 0.10%) 3357.05 ( 0.14%) 3472.63 ( 3.58%) 3475.97 ( 3.68%) 3470.44 ( 3.52%)
Scale 14563M 3171.95 ( 0.00%) 3168.28 ( -0.12%) 3158.17 ( -0.43%) 3393.54 ( 6.99%) 3399.68 ( 7.18%) 3390.82 ( 6.90%)
Triad 14563M 3522.50 ( 0.00%) 3526.12 ( 0.10%) 3519.97 ( -0.07%) 3853.92 ( 9.41%) 3856.89 ( 9.49%) 3847.38 ( 9.22%)
Add 18204M 3809.56 ( 0.00%) 3772.64 ( -0.97%) 3795.07 ( -0.38%) 4014.65 ( 5.38%) 3976.18 ( 4.37%) 3963.55 ( 4.04%)
Copy 18204M 3365.06 ( 0.00%) 3350.49 ( -0.43%) 3359.32 ( -0.17%) 3483.40 ( 3.52%) 3473.21 ( 3.21%) 3467.66 ( 3.05%)
Scale 18204M 3171.25 ( 0.00%) 3151.05 ( -0.64%) 3163.69 ( -0.24%) 3400.05 ( 7.21%) 3393.76 ( 7.02%) 3388.64 ( 6.85%)
Triad 18204M 3539.90 ( 0.00%) 3508.60 ( -0.88%) 3532.25 ( -0.22%) 3860.99 ( 9.07%) 3853.56 ( 8.86%) 3847.01 ( 8.68%)
Add 21845M 3798.46 ( 0.00%) 3800.35 ( 0.05%) 3791.21 ( -0.19%) 3995.49 ( 5.19%) 3990.65 ( 5.06%) 3969.12 ( 4.49%)
Copy 21845M 3362.14 ( 0.00%) 3363.46 ( 0.04%) 3355.34 ( -0.20%) 3477.61 ( 3.43%) 3478.33 ( 3.46%) 3472.19 ( 3.27%)
Scale 21845M 3170.99 ( 0.00%) 3164.60 ( -0.20%) 3162.31 ( -0.27%) 3398.14 ( 7.16%) 3396.25 ( 7.10%) 3393.58 ( 7.02%)
Triad 21845M 3534.49 ( 0.00%) 3527.34 ( -0.20%) 3522.95 ( -0.33%) 3858.35 ( 9.16%) 3856.52 ( 9.11%) 3854.98 ( 9.07%)
Add 29127M 3819.69 ( 0.00%) 3783.38 ( -0.95%) 3786.06 ( -0.88%) 4007.04 ( 4.90%) 4005.91 ( 4.88%) 4000.99 ( 4.75%)
Copy 29127M 3384.67 ( 0.00%) 3345.60 ( -1.15%) 3339.55 ( -1.33%) 3480.54 ( 2.83%) 3479.91 ( 2.81%) 3475.18 ( 2.67%)
Scale 29127M 3158.68 ( 0.00%) 3166.06 ( 0.23%) 3151.78 ( -0.22%) 3399.73 ( 7.63%) 3395.21 ( 7.49%) 3393.50 ( 7.43%)
Triad 29127M 3538.17 ( 0.00%) 3520.17 ( -0.51%) 3523.09 ( -0.43%) 3862.24 ( 9.16%) 3858.60 ( 9.06%) 3851.85 ( 8.87%)
Add 36408M 3806.95 ( 0.00%) 3793.61 ( -0.35%) 3777.70 ( -0.77%) 4016.66 ( 5.51%) 3994.64 ( 4.93%) 3991.57 ( 4.85%)
Copy 36408M 3361.11 ( 0.00%) 3347.61 ( -0.40%) 3353.38 ( -0.23%) 3483.09 ( 3.63%) 3476.44 ( 3.43%) 3473.26 ( 3.34%)
Scale 36408M 3165.87 ( 0.00%) 3173.95 ( 0.26%) 3171.11 ( 0.17%) 3398.81 ( 7.36%) 3394.38 ( 7.22%) 3393.16 ( 7.18%)
Triad 36408M 3536.86 ( 0.00%) 3533.81 ( -0.09%) 3513.64 ( -0.66%) 3860.60 ( 9.15%) 3855.77 ( 9.02%) 3853.09 ( 8.94%)
Add 43690M 3799.39 ( 0.00%) 3795.90 ( -0.09%) 3803.79 ( 0.12%) 3996.57 ( 5.19%) 4006.70 ( 5.46%) 3981.15 ( 4.78%)
Copy 43690M 3359.26 ( 0.00%) 3360.94 ( 0.05%) 3371.10 ( 0.35%) 3479.62 ( 3.58%) 3481.69 ( 3.64%) 3478.45 ( 3.55%)
Scale 43690M 3175.35 ( 0.00%) 3163.95 ( -0.36%) 3147.34 ( -0.88%) 3396.36 ( 6.96%) 3399.45 ( 7.06%) 3398.88 ( 7.04%)
Triad 43690M 3535.26 ( 0.00%) 3526.88 ( -0.24%) 3528.38 ( -0.19%) 3857.30 ( 9.11%) 3858.89 ( 9.15%) 3858.38 ( 9.14%)
Add 58254M 3799.66 ( 0.00%) 3772.37 ( -0.72%) 3768.33 ( -0.82%) 4016.47 ( 5.71%) 4014.25 ( 5.65%) 3968.79 ( 4.45%)
Copy 58254M 3355.12 ( 0.00%) 3337.75 ( -0.52%) 3337.41 ( -0.53%) 3481.56 ( 3.77%) 3481.28 ( 3.76%) 3465.39 ( 3.29%)
Scale 58254M 3170.94 ( 0.00%) 3159.81 ( -0.35%) 3164.09 ( -0.22%) 3398.35 ( 7.17%) 3396.30 ( 7.11%) 3388.58 ( 6.86%)
Triad 58254M 3537.26 ( 0.00%) 3511.62 ( -0.72%) 3507.54 ( -0.84%) 3860.59 ( 9.14%) 3858.62 ( 9.09%) 3847.30 ( 8.76%)
Add 72817M 3815.26 ( 0.00%) 3812.73 ( -0.07%) 3787.86 ( -0.72%) 3968.21 ( 4.01%) 4030.38 ( 5.64%) 3956.57 ( 3.70%)
Copy 72817M 3362.18 ( 0.00%) 3371.41 ( 0.27%) 3345.64 ( -0.49%) 3474.38 ( 3.34%) 3482.00 ( 3.56%) 3469.46 ( 3.19%)
Scale 72817M 3175.73 ( 0.00%) 3170.64 ( -0.16%) 3154.28 ( -0.68%) 3394.65 ( 6.89%) 3396.69 ( 6.96%) 3390.78 ( 6.77%)
Triad 72817M 3546.44 ( 0.00%) 3537.21 ( -0.26%) 3520.46 ( -0.73%) 3855.50 ( 8.71%) 3855.34 ( 8.71%) 3849.10 ( 8.53%)
Add 87381M 3519.93 ( 0.00%) 3501.24 ( -0.53%) 3500.84 ( -0.54%) 3833.20 ( 8.90%) 3833.26 ( 8.90%) 3840.72 ( 9.11%)
Copy 87381M 3175.29 ( 0.00%) 3166.11 ( -0.29%) 3163.97 ( -0.36%) 3263.09 ( 2.77%) 3264.10 ( 2.80%) 3266.85 ( 2.88%)
Scale 87381M 2848.76 ( 0.00%) 2835.15 ( -0.48%) 2832.37 ( -0.58%) 3177.70 ( 11.55%) 3172.81 ( 11.38%) 3180.05 ( 11.63%)
Triad 87381M 3465.19 ( 0.00%) 3453.66 ( -0.33%) 3456.03 ( -0.26%) 3777.01 ( 9.00%) 3774.30 ( 8.92%) 3783.31 ( 9.18%)
Remote access costs are quite visible in this memory streaming benchmark.
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanillainstrument-v2r1lruslabonly-v2r1 local-v2r6 acct-v2r6remotefile-v2r6
User 1144.35 1154.81 1156.38 1075.31 1083.70 1087.08
System 55.28 56.07 56.35 49.00 49.06 48.84
Elapsed 1207.64 1220.14 1222.13 1132.20 1141.91 1145.08
pft
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanilla instrument-v2r1 lruslabonly-v2r1 local-v2r6 acct-v2r6 remotefile-v2r6
User 1 0.6980 ( 0.00%) 0.6900 ( 1.15%) 0.7050 ( -1.00%) 0.6500 ( 6.88%) 0.6550 ( 6.16%) 0.6750 ( 3.30%)
User 2 0.7040 ( 0.00%) 0.6990 ( 0.71%) 0.7000 ( 0.57%) 0.6980 ( 0.85%) 0.7150 ( -1.56%) 0.7040 ( 0.00%)
User 3 0.6910 ( 0.00%) 0.6930 ( -0.29%) 0.7230 ( -4.63%) 0.7390 ( -6.95%) 0.7180 ( -3.91%) 0.7120 ( -3.04%)
User 4 0.7250 ( 0.00%) 0.7580 ( -4.55%) 0.7310 ( -0.83%) 0.7220 ( 0.41%) 0.7520 ( -3.72%) 0.7250 ( 0.00%)
User 5 0.7590 ( 0.00%) 0.7490 ( 1.32%) 0.7910 ( -4.22%) 0.7730 ( -1.84%) 0.7480 ( 1.45%) 0.7690 ( -1.32%)
User 6 0.8130 ( 0.00%) 0.8010 ( 1.48%) 0.7940 ( 2.34%) 0.7770 ( 4.43%) 0.7790 ( 4.18%) 0.7700 ( 5.29%)
User 7 0.8210 ( 0.00%) 0.8380 ( -2.07%) 0.8260 ( -0.61%) 0.7950 ( 3.17%) 0.8230 ( -0.24%) 0.7760 ( 5.48%)
User 8 0.8390 ( 0.00%) 0.8200 ( 2.26%) 0.8160 ( 2.74%) 0.7840 ( 6.56%) 0.7830 ( 6.67%) 0.8400 ( -0.12%)
System 1 9.1230 ( 0.00%) 9.1120 ( 0.12%) 9.0810 ( 0.46%) 8.2560 ( 9.50%) 8.2760 ( 9.28%) 8.2260 ( 9.83%)
System 2 9.3990 ( 0.00%) 9.3340 ( 0.69%) 9.4050 ( -0.06%) 8.4630 ( 9.96%) 8.4230 ( 10.38%) 8.4420 ( 10.18%)
System 3 9.1460 ( 0.00%) 9.0890 ( 0.62%) 9.1380 ( 0.09%) 8.5660 ( 6.34%) 8.5640 ( 6.36%) 8.5290 ( 6.75%)
System 4 8.9160 ( 0.00%) 8.8840 ( 0.36%) 8.9260 ( -0.11%) 8.6760 ( 2.69%) 8.6330 ( 3.17%) 8.6790 ( 2.66%)
System 5 9.5900 ( 0.00%) 9.5240 ( 0.69%) 9.5230 ( 0.70%) 8.9390 ( 6.79%) 8.8920 ( 7.28%) 8.9410 ( 6.77%)
System 6 9.8640 ( 0.00%) 9.7120 ( 1.54%) 9.8740 ( -0.10%) 9.1460 ( 7.28%) 9.1310 ( 7.43%) 9.1400 ( 7.34%)
System 7 9.9860 ( 0.00%) 9.9290 ( 0.57%) 10.0030 ( -0.17%) 9.3360 ( 6.51%) 9.2430 ( 7.44%) 9.2860 ( 7.01%)
System 8 9.8570 ( 0.00%) 9.8510 ( 0.06%) 9.9980 ( -1.43%) 9.3050 ( 5.60%) 9.2410 ( 6.25%) 9.4170 ( 4.46%)
Elapsed 1 9.8240 ( 0.00%) 9.8050 ( 0.19%) 9.7910 ( 0.34%) 8.9080 ( 9.32%) 8.9320 ( 9.08%) 8.9080 ( 9.32%)
Elapsed 2 5.0870 ( 0.00%) 5.0500 ( 0.73%) 5.0710 ( 0.31%) 4.6020 ( 9.53%) 4.5860 ( 9.85%) 4.5940 ( 9.69%)
Elapsed 3 3.3220 ( 0.00%) 3.2990 ( 0.69%) 3.3210 ( 0.03%) 3.1170 ( 6.17%) 3.1150 ( 6.23%) 3.0950 ( 6.83%)
Elapsed 4 2.4440 ( 0.00%) 2.4440 ( 0.00%) 2.4410 ( 0.12%) 2.3930 ( 2.09%) 2.3780 ( 2.70%) 2.3710 ( 2.99%)
Elapsed 5 2.1500 ( 0.00%) 2.1410 ( 0.42%) 2.1400 ( 0.47%) 2.0020 ( 6.88%) 1.9830 ( 7.77%) 2.0030 ( 6.84%)
Elapsed 6 1.8290 ( 0.00%) 1.7970 ( 1.75%) 1.8260 ( 0.16%) 1.6960 ( 7.27%) 1.6980 ( 7.16%) 1.6930 ( 7.44%)
Elapsed 7 1.5760 ( 0.00%) 1.5610 ( 0.95%) 1.5860 ( -0.63%) 1.4830 ( 5.90%) 1.4740 ( 6.47%) 1.4730 ( 6.54%)
Elapsed 8 1.3660 ( 0.00%) 1.3490 ( 1.24%) 1.3660 ( -0.00%) 1.2820 ( 6.15%) 1.2660 ( 7.32%) 1.3030 ( 4.61%)
Faults/cpu 1 336505.5875 ( 0.00%) 337163.8429 ( 0.20%) 337713.8261 ( 0.36%) 371079.7726 ( 10.27%) 370090.7928 ( 9.98%) 371199.3702 ( 10.31%)
Faults/cpu 2 327139.2186 ( 0.00%) 329451.3249 ( 0.71%) 326974.9735 ( -0.05%) 360766.3203 ( 10.28%) 361595.0312 ( 10.53%) 361389.4583 ( 10.47%)
Faults/cpu 3 336004.1324 ( 0.00%) 337826.9136 ( 0.54%) 335004.8869 ( -0.30%) 355249.2266 ( 5.73%) 356016.6570 ( 5.96%) 357584.5258 ( 6.42%)
Faults/cpu 4 342824.1564 ( 0.00%) 342825.3087 ( 0.00%) 342285.3156 ( -0.16%) 351758.5702 ( 2.61%) 352312.8339 ( 2.77%) 351503.0837 ( 2.53%)
Faults/cpu 5 319553.7707 ( 0.00%) 321799.3129 ( 0.70%) 320521.1950 ( 0.30%) 340315.3807 ( 6.50%) 342890.6018 ( 7.30%) 340381.5220 ( 6.52%)
Faults/cpu 6 309614.5554 ( 0.00%) 314330.1834 ( 1.52%) 309882.5231 ( 0.09%) 333075.2546 ( 7.58%) 333637.6404 ( 7.76%) 333706.0587 ( 7.78%)
Faults/cpu 7 306159.2969 ( 0.00%) 307277.9428 ( 0.37%) 305306.4748 ( -0.28%) 326309.2165 ( 6.58%) 328327.9627 ( 7.24%) 328590.8507 ( 7.33%)
Faults/cpu 8 309077.4966 ( 0.00%) 309849.8370 ( 0.25%) 305865.6953 ( -1.04%) 327958.3107 ( 6.11%) 329731.7933 ( 6.68%) 322280.8870 ( 4.27%)
Faults/sec 1 336364.5575 ( 0.00%) 336993.1010 ( 0.19%) 337563.4257 ( 0.36%) 370916.0228 ( 10.27%) 369955.7605 ( 9.99%) 370971.4836 ( 10.29%)
Faults/sec 2 649713.2290 ( 0.00%) 654448.6622 ( 0.73%) 651706.3799 ( 0.31%) 717987.1734 ( 10.51%) 720641.9249 ( 10.92%) 719435.7495 ( 10.73%)
Faults/sec 3 994812.3119 ( 0.00%) 1001443.9434 ( 0.67%) 995205.6607 ( 0.04%) 1060228.7843 ( 6.58%) 1060484.8602 ( 6.60%) 1067127.5522 ( 7.27%)
Faults/sec 4 1352137.4832 ( 0.00%) 1352463.8578 ( 0.02%) 1354323.6163 ( 0.16%) 1382325.4091 ( 2.23%) 1390344.3320 ( 2.83%) 1393760.7116 ( 3.08%)
Faults/sec 5 1538115.0421 ( 0.00%) 1544331.3978 ( 0.40%) 1544368.0159 ( 0.41%) 1651247.2902 ( 7.36%) 1666751.7259 ( 8.36%) 1651371.8632 ( 7.36%)
Faults/sec 6 1807211.7324 ( 0.00%) 1840430.0157 ( 1.84%) 1809763.9743 ( 0.14%) 1947049.8237 ( 7.74%) 1946986.6396 ( 7.73%) 1953384.4599 ( 8.09%)
Faults/sec 7 2101840.1872 ( 0.00%) 2120169.4773 ( 0.87%) 2082926.2675 ( -0.90%) 2233207.9026 ( 6.25%) 2241803.5953 ( 6.66%) 2242647.3545 ( 6.70%)
Faults/sec 8 2421813.7208 ( 0.00%) 2453320.5034 ( 1.30%) 2419371.3924 ( -0.10%) 2582755.9228 ( 6.65%) 2612638.2836 ( 7.88%) 2537575.0399 ( 4.78%)
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanillainstrument-v2r1lruslabonly-v2r1 local-v2r6 acct-v2r6remotefile-v2r6
User 60.57 61.53 61.96 59.47 60.74 60.78
System 868.16 862.80 868.89 805.82 802.76 806.01
Elapsed 336.19 336.02 339.19 311.33 313.18 313.58
And page fault microbenchmarks also see a benefit, probably because the
zeroing of pages is no longer incurring a remote access penalty. Lower system CPU usage etc
ebizzy
3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3 3.13.0-rc3
vanilla instrument-v2r1 lruslabonly-v2r1 local-v2r6 acct-v2r6 remotefile-v2r6
Mean 1 3213.33 ( 0.00%) 3161.67 ( -1.61%) 3177.00 ( -1.13%) 3234.33 ( 0.65%) 3224.00 ( 0.33%) 3198.33 ( -0.47%)
Mean 2 2291.33 ( 0.00%) 2316.67 ( 1.11%) 2309.67 ( 0.80%) 2348.67 ( 2.50%) 2330.00 ( 1.69%) 2332.00 ( 1.77%)
Mean 3 2234.67 ( 0.00%) 2298.67 ( 2.86%) 2252.00 ( 0.78%) 2280.67 ( 2.06%) 2292.67 ( 2.60%) 2270.00 ( 1.58%)
Mean 4 2224.33 ( 0.00%) 2279.00 ( 2.46%) 2250.67 ( 1.18%) 2282.33 ( 2.61%) 2256.00 ( 1.42%) 2256.33 ( 1.44%)
Mean 5 2256.33 ( 0.00%) 2280.33 ( 1.06%) 2265.00 ( 0.38%) 2280.67 ( 1.08%) 2268.33 ( 0.53%) 2276.67 ( 0.90%)
Mean 6 2233.00 ( 0.00%) 2257.33 ( 1.09%) 2200.00 ( -1.48%) 2292.33 ( 2.66%) 2274.33 ( 1.85%) 2250.33 ( 0.78%)
Mean 7 2212.33 ( 0.00%) 2229.00 ( 0.75%) 2201.67 ( -0.48%) 2279.00 ( 3.01%) 2265.00 ( 2.38%) 2251.67 ( 1.78%)
Mean 8 2224.67 ( 0.00%) 2226.33 ( 0.07%) 2225.67 ( 0.04%) 2255.00 ( 1.36%) 2280.67 ( 2.52%) 2238.67 ( 0.63%)
Mean 12 2213.33 ( 0.00%) 2240.00 ( 1.20%) 2264.67 ( 2.32%) 2249.33 ( 1.63%) 2257.67 ( 2.00%) 2238.00 ( 1.11%)
Mean 16 2221.00 ( 0.00%) 2226.33 ( 0.24%) 2268.00 ( 2.12%) 2266.33 ( 2.04%) 2241.00 ( 0.90%) 2258.33 ( 1.68%)
Mean 20 2215.00 ( 0.00%) 2256.00 ( 1.85%) 2278.33 ( 2.86%) 2238.67 ( 1.07%) 2271.67 ( 2.56%) 2291.00 ( 3.43%)
Mean 24 2175.00 ( 0.00%) 2181.00 ( 0.28%) 2166.67 ( -0.38%) 2211.00 ( 1.66%) 2231.00 ( 2.57%) 2247.67 ( 3.34%)
Mean 28 2110.00 ( 0.00%) 2136.00 ( 1.23%) 2123.33 ( 0.63%) 2157.00 ( 2.23%) 2163.00 ( 2.51%) 2164.67 ( 2.59%)
Mean 32 2077.67 ( 0.00%) 2095.33 ( 0.85%) 2091.33 ( 0.66%) 2110.67 ( 1.59%) 2113.33 ( 1.72%) 2110.33 ( 1.57%)
Mean 36 2016.33 ( 0.00%) 2024.67 ( 0.41%) 2039.33 ( 1.14%) 2066.33 ( 2.48%) 2068.00 ( 2.56%) 2069.00 ( 2.61%)
Mean 40 1984.00 ( 0.00%) 1987.00 ( 0.15%) 1993.33 ( 0.47%) 2037.00 ( 2.67%) 2035.00 ( 2.57%) 2042.00 ( 2.92%)
Mean 44 1943.33 ( 0.00%) 1954.33 ( 0.57%) 1961.00 ( 0.91%) 2004.33 ( 3.14%) 2009.67 ( 3.41%) 2018.00 ( 3.84%)
Mean 48 1925.00 ( 0.00%) 1939.33 ( 0.74%) 1929.00 ( 0.21%) 1990.67 ( 3.41%) 1996.33 ( 3.71%) 2007.67 ( 4.29%)
Stddev 1 25.42 ( 0.00%) 46.78 (-84.02%) 32.75 (-28.84%) 18.62 ( 26.73%) 21.95 ( 13.64%) 30.18 (-18.72%)
Stddev 2 29.68 ( 0.00%) 1.70 ( 94.27%) 13.77 ( 53.61%) 12.50 ( 57.89%) 15.51 ( 47.73%) 13.88 ( 53.23%)
Stddev 3 18.15 ( 0.00%) 27.48 (-51.35%) 4.32 ( 76.20%) 13.57 ( 25.23%) 15.52 ( 14.50%) 11.78 ( 35.13%)
Stddev 4 41.28 ( 0.00%) 13.64 ( 66.96%) 6.94 ( 83.18%) 24.51 ( 40.62%) 24.04 ( 41.76%) 7.41 ( 82.05%)
Stddev 5 27.18 ( 0.00%) 9.03 ( 66.78%) 4.97 ( 81.73%) 8.50 ( 68.74%) 17.25 ( 36.54%) 15.80 ( 41.88%)
Stddev 6 10.80 ( 0.00%) 17.97 (-66.36%) 9.27 ( 14.14%) 6.60 ( 38.90%) 16.01 (-48.20%) 19.36 (-79.26%)
Stddev 7 23.10 ( 0.00%) 17.91 ( 22.48%) 29.58 (-28.05%) 15.94 ( 31.00%) 5.72 ( 75.26%) 12.76 ( 44.75%)
Stddev 8 3.68 ( 0.00%) 41.52 (-1027.82%) 26.74 (-626.21%) 4.32 (-17.35%) 33.81 (-818.21%) 12.50 (-239.48%)
Stddev 12 23.84 ( 0.00%) 6.48 ( 72.81%) 14.66 ( 38.50%) 13.47 ( 43.47%) 11.79 ( 50.56%) 18.71 ( 21.52%)
Stddev 16 20.22 ( 0.00%) 17.13 ( 15.25%) 28.99 (-43.43%) 2.36 ( 88.34%) 2.16 ( 89.31%) 16.13 ( 20.20%)
Stddev 20 3.74 ( 0.00%) 6.53 (-74.57%) 45.02 (-1103.24%) 22.54 (-502.51%) 8.18 (-118.58%) 26.28 (-602.38%)
Stddev 24 18.18 ( 0.00%) 19.30 ( -6.16%) 23.81 (-30.93%) 9.42 ( 48.22%) 16.99 ( 6.57%) 8.18 ( 55.02%)
Stddev 28 11.78 ( 0.00%) 7.79 ( 33.86%) 15.92 (-35.22%) 12.96 (-10.07%) 12.83 ( -8.97%) 17.78 (-51.01%)
Stddev 32 9.74 ( 0.00%) 2.05 ( 78.91%) 8.81 ( 9.59%) 6.55 ( 32.77%) 3.09 ( 68.27%) 1.70 ( 82.55%)
Stddev 36 3.86 ( 0.00%) 5.44 (-40.89%) 2.36 ( 38.92%) 13.22 (-242.73%) 11.78 (-205.18%) 16.87 (-337.26%)
Stddev 40 14.17 ( 0.00%) 7.48 ( 47.17%) 5.56 ( 60.77%) 5.89 ( 58.44%) 2.16 ( 84.75%) 2.45 ( 82.71%)
Stddev 44 7.54 ( 0.00%) 3.40 ( 54.93%) 2.94 ( 60.97%) 7.54 ( 0.00%) 3.68 ( 51.19%) 1.63 ( 78.35%)
Stddev 48 2.94 ( 0.00%) 5.56 (-88.79%) 3.56 (-20.89%) 6.24 (-111.83%) 1.70 ( 42.26%) 17.25 (-485.95%)
Ran ebizzy because it double up as a page allocation micro benchmark that
hits page faults differently to PFT. Looks like an ok gain but the stddev
is high and would need to be stabilised to draw a solid conclusion from.
None of these benchmarks do *anything* related to what commit 81c0a2bb was
supposed to fix. I just wanted to get the point across that our current
default behaviour sucks and we should revisit that decision.
My position is that by default we should only round-robin zones local to
the allocating process and that node round-robin is something that should
only be explicitely enabled.
I'm less sure about the round robin treatment of slab but am erring on
the side of historical behaviour until it is proven otherwise.
Documentation/sysctl/vm.txt | 32 +++++++++
include/linux/gfp.h | 4 +-
include/linux/mmzone.h | 2 +
include/linux/pagemap.h | 2 +-
include/linux/swap.h | 2 +
kernel/sysctl.c | 8 +++
mm/filemap.c | 2 +
mm/page_alloc.c | 153 +++++++++++++++++++++++++++++++++++++-------
8 files changed, 180 insertions(+), 25 deletions(-)
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH 1/7] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
From: Johannes Weiner <hannes@cmpxchg.org>
Dave Hansen noted a regression in a microbenchmark that loops around
open() and close() on an 8-node NUMA machine and bisected it down to
81c0a2bb515f ("mm: page_alloc: fair zone allocator policy"). That
change forces the slab allocations of the file descriptor to spread
out to all 8 nodes, causing remote references in the page allocator
and slab.
The round-robin policy is only there to provide fairness among memory
allocations that are reclaimed involuntarily based on pressure in each
zone. It does not make sense to apply it to unreclaimable kernel
allocations that are freed manually, in this case instantly after the
allocation, and incur the remote reference costs twice for no reason.
Only round-robin allocations that are usually freed through page
reclaim or slab shrinking.
Cc: <stable@kernel.org>
Bisected-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f0..f861d02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1920,7 +1920,8 @@ zonelist_scan:
* back to remote zones that do not partake in the
* fairness round-robin cycle of this zonelist.
*/
- if (alloc_flags & ALLOC_WMARK_LOW) {
+ if ((alloc_flags & ALLOC_WMARK_LOW) &&
+ (gfp_mask & GFP_MOVABLE_MASK)) {
if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
continue;
if (zone_reclaim_mode &&
--
1.8.4
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 1/7] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
@ 2013-12-13 14:10 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
From: Johannes Weiner <hannes@cmpxchg.org>
Dave Hansen noted a regression in a microbenchmark that loops around
open() and close() on an 8-node NUMA machine and bisected it down to
81c0a2bb515f ("mm: page_alloc: fair zone allocator policy"). That
change forces the slab allocations of the file descriptor to spread
out to all 8 nodes, causing remote references in the page allocator
and slab.
The round-robin policy is only there to provide fairness among memory
allocations that are reclaimed involuntarily based on pressure in each
zone. It does not make sense to apply it to unreclaimable kernel
allocations that are freed manually, in this case instantly after the
allocation, and incur the remote reference costs twice for no reason.
Only round-robin allocations that are usually freed through page
reclaim or slab shrinking.
Cc: <stable@kernel.org>
Bisected-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f0..f861d02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1920,7 +1920,8 @@ zonelist_scan:
* back to remote zones that do not partake in the
* fairness round-robin cycle of this zonelist.
*/
- if (alloc_flags & ALLOC_WMARK_LOW) {
+ if ((alloc_flags & ALLOC_WMARK_LOW) &&
+ (gfp_mask & GFP_MOVABLE_MASK)) {
if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
continue;
if (zone_reclaim_mode &&
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
This patch moves the decision on whether to round-robin allocations between
zones and nodes into its own helper functions. It'll make some later patches
easier to understand and it will be automatically inlined.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 63 ++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 42 insertions(+), 21 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f861d02..64020eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1872,6 +1872,42 @@ static inline void init_zone_allows_reclaim(int nid)
#endif /* CONFIG_NUMA */
/*
+ * Distribute pages in proportion to the individual zone size to ensure fair
+ * page aging. The zone a page was allocated in should have no effect on the
+ * time the page has in memory before being reclaimed.
+ *
+ * Returns true if this zone should be skipped to spread the page ages to
+ * other zones.
+ */
+static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
+ struct zone *zone, int alloc_flags)
+{
+ /* Only round robin in the allocator fast path */
+ if (!(alloc_flags & ALLOC_WMARK_LOW))
+ return false;
+
+ /* Only round robin pages likely to be LRU or reclaimable slab */
+ if (!(gfp_mask & GFP_MOVABLE_MASK))
+ return false;
+
+ /* Distribute to the next zone if this zone has exhausted its batch */
+ if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+ return true;
+
+ /*
+ * When zone_reclaim_mode is enabled, try to stay in local zones in the
+ * fastpath. If that fails, the slowpath is entered, which will do
+ * another pass starting with the local zones, but ultimately fall back
+ * back to remote zones that do not partake in the fairness round-robin
+ * cycle of this zonelist.
+ */
+ if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+ return true;
+
+ return false;
+}
+
+/*
* get_page_from_freelist goes through the zonelist trying to allocate
* a page.
*/
@@ -1907,27 +1943,12 @@ zonelist_scan:
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
goto try_this_zone;
- /*
- * Distribute pages in proportion to the individual
- * zone size to ensure fair page aging. The zone a
- * page was allocated in should have no effect on the
- * time the page has in memory before being reclaimed.
- *
- * When zone_reclaim_mode is enabled, try to stay in
- * local zones in the fastpath. If that fails, the
- * slowpath is entered, which will do another pass
- * starting with the local zones, but ultimately fall
- * back to remote zones that do not partake in the
- * fairness round-robin cycle of this zonelist.
- */
- if ((alloc_flags & ALLOC_WMARK_LOW) &&
- (gfp_mask & GFP_MOVABLE_MASK)) {
- if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
- continue;
- if (zone_reclaim_mode &&
- !zone_local(preferred_zone, zone))
- continue;
- }
+
+ /* Distribute pages to ensure fair page aging */
+ if (zone_distribute_age(gfp_mask, preferred_zone, zone,
+ alloc_flags))
+ continue;
+
/*
* When allocating a page cache page for writing, we
* want to get it from a zone that is within its dirty
--
1.8.4
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper
@ 2013-12-13 14:10 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
This patch moves the decision on whether to round-robin allocations between
zones and nodes into its own helper functions. It'll make some later patches
easier to understand and it will be automatically inlined.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 63 ++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 42 insertions(+), 21 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f861d02..64020eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1872,6 +1872,42 @@ static inline void init_zone_allows_reclaim(int nid)
#endif /* CONFIG_NUMA */
/*
+ * Distribute pages in proportion to the individual zone size to ensure fair
+ * page aging. The zone a page was allocated in should have no effect on the
+ * time the page has in memory before being reclaimed.
+ *
+ * Returns true if this zone should be skipped to spread the page ages to
+ * other zones.
+ */
+static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
+ struct zone *zone, int alloc_flags)
+{
+ /* Only round robin in the allocator fast path */
+ if (!(alloc_flags & ALLOC_WMARK_LOW))
+ return false;
+
+ /* Only round robin pages likely to be LRU or reclaimable slab */
+ if (!(gfp_mask & GFP_MOVABLE_MASK))
+ return false;
+
+ /* Distribute to the next zone if this zone has exhausted its batch */
+ if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+ return true;
+
+ /*
+ * When zone_reclaim_mode is enabled, try to stay in local zones in the
+ * fastpath. If that fails, the slowpath is entered, which will do
+ * another pass starting with the local zones, but ultimately fall back
+ * back to remote zones that do not partake in the fairness round-robin
+ * cycle of this zonelist.
+ */
+ if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+ return true;
+
+ return false;
+}
+
+/*
* get_page_from_freelist goes through the zonelist trying to allocate
* a page.
*/
@@ -1907,27 +1943,12 @@ zonelist_scan:
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
goto try_this_zone;
- /*
- * Distribute pages in proportion to the individual
- * zone size to ensure fair page aging. The zone a
- * page was allocated in should have no effect on the
- * time the page has in memory before being reclaimed.
- *
- * When zone_reclaim_mode is enabled, try to stay in
- * local zones in the fastpath. If that fails, the
- * slowpath is entered, which will do another pass
- * starting with the local zones, but ultimately fall
- * back to remote zones that do not partake in the
- * fairness round-robin cycle of this zonelist.
- */
- if ((alloc_flags & ALLOC_WMARK_LOW) &&
- (gfp_mask & GFP_MOVABLE_MASK)) {
- if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
- continue;
- if (zone_reclaim_mode &&
- !zone_local(preferred_zone, zone))
- continue;
- }
+
+ /* Distribute pages to ensure fair page aging */
+ if (zone_distribute_age(gfp_mask, preferred_zone, zone,
+ alloc_flags))
+ continue;
+
/*
* When allocating a page cache page for writing, we
* want to get it from a zone that is within its dirty
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
zone_local is using node_distance which is a more expensive call than
necessary. On x86, it's another function call in the allocator fast path
and increases cache footprint. This patch makes the assumption zones on a
local node will share the same node ID. The necessary information should
already be cache hot.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 64020eb..fd9677e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
static bool zone_local(struct zone *local_zone, struct zone *zone)
{
- return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+ return zone_to_nid(zone) == numa_node_id();
}
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
--
1.8.4
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-13 14:10 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
zone_local is using node_distance which is a more expensive call than
necessary. On x86, it's another function call in the allocator fast path
and increases cache footprint. This patch makes the assumption zones on a
local node will share the same node ID. The necessary information should
already be cache hot.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 64020eb..fd9677e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
static bool zone_local(struct zone *local_zone, struct zone *zone)
{
- return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+ return zone_to_nid(zone) == numa_node_id();
}
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 4/7] mm: Annotate page cache allocations
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
Annotations will be used for fair zone allocation policy. Patch is mostly
taken from a link posted by Johannes on IRC. It's not perfect because all
callers of these paths are not guaranteed to be allocating pages for page
cache. However, it's probably close enough to cover all cases that matter
with minimal distortion.
Not-signed-off
---
include/linux/gfp.h | 4 +++-
include/linux/pagemap.h | 2 +-
mm/filemap.c | 2 ++
3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd49..f69e4cb 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
+#define ___GFP_PAGECACHE 0x2000000u
/* If the above are modified, __GFP_BITS_SHIFT may need updating */
/*
@@ -92,6 +93,7 @@ struct vm_area_struct;
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
+#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */
/*
* This may seem redundant, but it's a way of annotating false positives vs.
@@ -99,7 +101,7 @@ struct vm_area_struct;
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
-#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..bda4845 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
#else
static inline struct page *__page_cache_alloc(gfp_t gfp)
{
- return alloc_pages(gfp, 0);
+ return alloc_pages(gfp | __GFP_PAGECACHE, 0);
}
#endif
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..5bb9225 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
int n;
struct page *page;
+ gfp |= __GFP_PAGECACHE;
+
if (cpuset_do_page_mem_spread()) {
unsigned int cpuset_mems_cookie;
do {
--
1.8.4
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 4/7] mm: Annotate page cache allocations
@ 2013-12-13 14:10 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
Annotations will be used for fair zone allocation policy. Patch is mostly
taken from a link posted by Johannes on IRC. It's not perfect because all
callers of these paths are not guaranteed to be allocating pages for page
cache. However, it's probably close enough to cover all cases that matter
with minimal distortion.
Not-signed-off
---
include/linux/gfp.h | 4 +++-
include/linux/pagemap.h | 2 +-
mm/filemap.c | 2 ++
3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd49..f69e4cb 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
+#define ___GFP_PAGECACHE 0x2000000u
/* If the above are modified, __GFP_BITS_SHIFT may need updating */
/*
@@ -92,6 +93,7 @@ struct vm_area_struct;
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
+#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */
/*
* This may seem redundant, but it's a way of annotating false positives vs.
@@ -99,7 +101,7 @@ struct vm_area_struct;
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
-#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..bda4845 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
#else
static inline struct page *__page_cache_alloc(gfp_t gfp)
{
- return alloc_pages(gfp, 0);
+ return alloc_pages(gfp | __GFP_PAGECACHE, 0);
}
#endif
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..5bb9225 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
int n;
struct page *page;
+ gfp |= __GFP_PAGECACHE;
+
if (cpuset_do_page_mem_spread()) {
unsigned int cpuset_mems_cookie;
do {
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of
how the page allocator and kswapd interacted on the per-zone LRU lists.
Unfortunately it was missed during review that a consequence is that
we also round-robin between NUMA nodes. This is bad for two reasons
1. It alters the semantics of MPOL_LOCAL without telling anyone
2. It incurs an immediate remote memory performance hit in exchange
for a potential performance gain when memory needs to be reclaimed
later
No cookies for the reviewers on this one.
This patch makes the behaviour of the fair zone allocator policy
configurable. By default it will only distribute pages that are going
to exist on the LRU between zones local to the allocating process. This
preserves the historical semantics of MPOL_LOCAL.
By default, slab pages are not distributed between zones after this patch is
applied. It can be argued that they should get similar treatment but they
have different lifecycles to LRU pages, the shrinkers are not zone-aware
and the interaction between the page allocator and kswapd is different
for slabs. If it turns out to be an almost universal win, we can change
the default.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
Documentation/sysctl/vm.txt | 32 ++++++++++++++
include/linux/mmzone.h | 2 +
include/linux/swap.h | 2 +
kernel/sysctl.c | 8 ++++
mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
5 files changed, 134 insertions(+), 12 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 1fbd4eb..8eaa562 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
- swappiness
- user_reserve_kbytes
- vfs_cache_pressure
+- zone_distribute_mode
- zone_reclaim_mode
==============================================================
@@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
==============================================================
+zone_distribute_mode
+
+Pages allocation and reclaim are managed on a per-zone basis. When the
+system needs to reclaim memory, candidate pages are selected from these
+per-zone lists. Historically, a potential consequence was that recently
+allocated pages were considered reclaim candidates. From a zone-local
+perspective, page aging was preserved but from a system-wide perspective
+there was an age inversion problem.
+
+A similar problem occurs on a node level where young pages may be reclaimed
+from the local node instead of allocating remote memory. Unforuntately, the
+cost of accessing remote nodes is higher so the system must choose by default
+between favouring page aging or node locality. zone_distribute_mode controls
+how the system will distribute page ages between zones.
+
+0 = Never round-robin based on age
+
+Otherwise the values are ORed together
+
+1 = Distribute anon pages between zones local to the allocating node
+2 = Distribute file pages between zones local to the allocating node
+4 = Distribute slab pages between zones local to the allocating node
+
+The following three flags effectively alter MPOL_DEFAULT, be careful.
+
+8 = Distribute anon pages between zones remote to the allocating node
+16 = Distribute file pages between zones remote to the allocating node
+32 = Distribute slab pages between zones remote to the allocating node
+
+==============================================================
+
zone_reclaim_mode:
Zone_reclaim_mode allows someone to set more or less aggressive approaches to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b835d3f..20a75e3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -897,6 +897,8 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+int sysctl_zone_distribute_mode_handler(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
extern int numa_zonelist_order_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..44329b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -318,6 +318,8 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;
+extern unsigned __bitwise__ zone_distribute_mode;
+
#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 34a6047..b75c08f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1349,6 +1349,14 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
},
#endif
+ {
+ .procname = "zone_distribute_mode",
+ .data = &zone_distribute_mode,
+ .maxlen = sizeof(zone_distribute_mode),
+ .mode = 0644,
+ .proc_handler = sysctl_zone_distribute_mode_handler,
+ .extra1 = &zero,
+ },
#ifdef CONFIG_NUMA
{
.procname = "zone_reclaim_mode",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd9677e..c2a2229 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1871,6 +1871,49 @@ static inline void init_zone_allows_reclaim(int nid)
}
#endif /* CONFIG_NUMA */
+/* Controls how page ages are distributed across zones automatically */
+unsigned __bitwise__ zone_distribute_mode __read_mostly;
+
+/* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */
+#define DISTRIBUTE_DISABLE (0)
+#define DISTRIBUTE_LOCAL_ANON (1UL << 0)
+#define DISTRIBUTE_LOCAL_FILE (1UL << 1)
+#define DISTRIBUTE_LOCAL_SLAB (1UL << 2)
+#define DISTRIBUTE_REMOTE_ANON (1UL << 3)
+#define DISTRIBUTE_REMOTE_FILE (1UL << 4)
+#define DISTRIBUTE_REMOTE_SLAB (1UL << 5)
+
+#define DISTRIBUTE_STUPID_ANON (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
+#define DISTRIBUTE_STUPID_FILE (DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
+#define DISTRIBUTE_STUPID_SLAB (DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
+#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
+
+/* Only these GFP flags are affected by the fair zone allocation policy */
+#define DISTRIBUTE_GFP_MASK ((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
+
+int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int rc;
+
+ rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (rc)
+ return rc;
+
+ /* If you are an admin reading this comment, what were you thinking? */
+ if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_ANON) ==
+ DISTRIBUTE_STUPID_ANON))
+ zone_distribute_mode &= ~DISTRIBUTE_REMOTE_ANON;
+ if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_FILE) ==
+ DISTRIBUTE_STUPID_FILE))
+ zone_distribute_mode &= ~DISTRIBUTE_REMOTE_FILE;
+ if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_SLAB) ==
+ DISTRIBUTE_STUPID_SLAB))
+ zone_distribute_mode &= ~DISTRIBUTE_REMOTE_SLAB;
+
+ return 0;
+}
+
/*
* Distribute pages in proportion to the individual zone size to ensure fair
* page aging. The zone a page was allocated in should have no effect on the
@@ -1882,26 +1925,60 @@ static inline void init_zone_allows_reclaim(int nid)
static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
struct zone *zone, int alloc_flags)
{
+ bool zone_is_local;
+ bool is_file, is_slab, is_anon;
+
/* Only round robin in the allocator fast path */
if (!(alloc_flags & ALLOC_WMARK_LOW))
return false;
- /* Only round robin pages likely to be LRU or reclaimable slab */
- if (!(gfp_mask & GFP_MOVABLE_MASK))
+ /* Only a subset of GFP flags are considered for fair zone policy */
+ if (!(gfp_mask & DISTRIBUTE_GFP_MASK))
return false;
- /* Distribute to the next zone if this zone has exhausted its batch */
- if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
- return true;
-
/*
- * When zone_reclaim_mode is enabled, try to stay in local zones in the
- * fastpath. If that fails, the slowpath is entered, which will do
- * another pass starting with the local zones, but ultimately fall back
- * back to remote zones that do not partake in the fairness round-robin
- * cycle of this zonelist.
+ * Classify the type of allocation. From this point on, the fair zone
+ * allocation policy is being applied. If the allocation does not meet
+ * the criteria the zone must be skipped.
*/
- if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+ is_file = gfp_mask & __GFP_PAGECACHE;
+ is_slab = gfp_mask & __GFP_RECLAIMABLE;
+ is_anon = (!is_file && !is_slab);
+ WARN_ON_ONCE(is_slab && is_file);
+
+ zone_is_local = zone_local(preferred_zone, zone);
+ if (zone_local(preferred_zone, zone)) {
+ /* Distribute between zones local to the node if requested */
+ if (is_anon && (zone_distribute_mode & DISTRIBUTE_LOCAL_ANON))
+ goto check_batch;
+ if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE))
+ goto check_batch;
+ if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE))
+ goto check_batch;
+ } else {
+ /*
+ * When zone_reclaim_mode is enabled, stick to local zones. If
+ * that fails, the slowpath is entered, which will do another
+ * pass starting with the local zones, but ultimately fall back
+ * back to remote zones that do not partake in the fairness
+ * round-robin cycle of this zonelist.
+ */
+ if (zone_reclaim_mode)
+ return false;
+
+ if (is_anon && (zone_distribute_mode & DISTRIBUTE_REMOTE_ANON))
+ goto check_batch;
+ if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE))
+ goto check_batch;
+ if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE))
+ goto check_batch;
+ }
+
+ return true;
+
+check_batch:
+ /* Distribute to the next zone if this zone has exhausted its batch */
+ if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
return true;
return false;
@@ -3797,6 +3874,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
__build_all_zonelists(NULL);
mminit_verify_zonelist();
cpuset_init_current_mems_allowed();
+ zone_distribute_mode = DISTRIBUTE_DEFAULT;
} else {
#ifdef CONFIG_MEMORY_HOTPLUG
if (zone)
--
1.8.4
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-13 14:10 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of
how the page allocator and kswapd interacted on the per-zone LRU lists.
Unfortunately it was missed during review that a consequence is that
we also round-robin between NUMA nodes. This is bad for two reasons
1. It alters the semantics of MPOL_LOCAL without telling anyone
2. It incurs an immediate remote memory performance hit in exchange
for a potential performance gain when memory needs to be reclaimed
later
No cookies for the reviewers on this one.
This patch makes the behaviour of the fair zone allocator policy
configurable. By default it will only distribute pages that are going
to exist on the LRU between zones local to the allocating process. This
preserves the historical semantics of MPOL_LOCAL.
By default, slab pages are not distributed between zones after this patch is
applied. It can be argued that they should get similar treatment but they
have different lifecycles to LRU pages, the shrinkers are not zone-aware
and the interaction between the page allocator and kswapd is different
for slabs. If it turns out to be an almost universal win, we can change
the default.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
Documentation/sysctl/vm.txt | 32 ++++++++++++++
include/linux/mmzone.h | 2 +
include/linux/swap.h | 2 +
kernel/sysctl.c | 8 ++++
mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
5 files changed, 134 insertions(+), 12 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 1fbd4eb..8eaa562 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
- swappiness
- user_reserve_kbytes
- vfs_cache_pressure
+- zone_distribute_mode
- zone_reclaim_mode
==============================================================
@@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
==============================================================
+zone_distribute_mode
+
+Pages allocation and reclaim are managed on a per-zone basis. When the
+system needs to reclaim memory, candidate pages are selected from these
+per-zone lists. Historically, a potential consequence was that recently
+allocated pages were considered reclaim candidates. From a zone-local
+perspective, page aging was preserved but from a system-wide perspective
+there was an age inversion problem.
+
+A similar problem occurs on a node level where young pages may be reclaimed
+from the local node instead of allocating remote memory. Unforuntately, the
+cost of accessing remote nodes is higher so the system must choose by default
+between favouring page aging or node locality. zone_distribute_mode controls
+how the system will distribute page ages between zones.
+
+0 = Never round-robin based on age
+
+Otherwise the values are ORed together
+
+1 = Distribute anon pages between zones local to the allocating node
+2 = Distribute file pages between zones local to the allocating node
+4 = Distribute slab pages between zones local to the allocating node
+
+The following three flags effectively alter MPOL_DEFAULT, be careful.
+
+8 = Distribute anon pages between zones remote to the allocating node
+16 = Distribute file pages between zones remote to the allocating node
+32 = Distribute slab pages between zones remote to the allocating node
+
+==============================================================
+
zone_reclaim_mode:
Zone_reclaim_mode allows someone to set more or less aggressive approaches to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b835d3f..20a75e3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -897,6 +897,8 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+int sysctl_zone_distribute_mode_handler(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
extern int numa_zonelist_order_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..44329b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -318,6 +318,8 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;
+extern unsigned __bitwise__ zone_distribute_mode;
+
#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 34a6047..b75c08f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1349,6 +1349,14 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
},
#endif
+ {
+ .procname = "zone_distribute_mode",
+ .data = &zone_distribute_mode,
+ .maxlen = sizeof(zone_distribute_mode),
+ .mode = 0644,
+ .proc_handler = sysctl_zone_distribute_mode_handler,
+ .extra1 = &zero,
+ },
#ifdef CONFIG_NUMA
{
.procname = "zone_reclaim_mode",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd9677e..c2a2229 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1871,6 +1871,49 @@ static inline void init_zone_allows_reclaim(int nid)
}
#endif /* CONFIG_NUMA */
+/* Controls how page ages are distributed across zones automatically */
+unsigned __bitwise__ zone_distribute_mode __read_mostly;
+
+/* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */
+#define DISTRIBUTE_DISABLE (0)
+#define DISTRIBUTE_LOCAL_ANON (1UL << 0)
+#define DISTRIBUTE_LOCAL_FILE (1UL << 1)
+#define DISTRIBUTE_LOCAL_SLAB (1UL << 2)
+#define DISTRIBUTE_REMOTE_ANON (1UL << 3)
+#define DISTRIBUTE_REMOTE_FILE (1UL << 4)
+#define DISTRIBUTE_REMOTE_SLAB (1UL << 5)
+
+#define DISTRIBUTE_STUPID_ANON (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
+#define DISTRIBUTE_STUPID_FILE (DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
+#define DISTRIBUTE_STUPID_SLAB (DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
+#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
+
+/* Only these GFP flags are affected by the fair zone allocation policy */
+#define DISTRIBUTE_GFP_MASK ((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
+
+int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int rc;
+
+ rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (rc)
+ return rc;
+
+ /* If you are an admin reading this comment, what were you thinking? */
+ if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_ANON) ==
+ DISTRIBUTE_STUPID_ANON))
+ zone_distribute_mode &= ~DISTRIBUTE_REMOTE_ANON;
+ if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_FILE) ==
+ DISTRIBUTE_STUPID_FILE))
+ zone_distribute_mode &= ~DISTRIBUTE_REMOTE_FILE;
+ if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_SLAB) ==
+ DISTRIBUTE_STUPID_SLAB))
+ zone_distribute_mode &= ~DISTRIBUTE_REMOTE_SLAB;
+
+ return 0;
+}
+
/*
* Distribute pages in proportion to the individual zone size to ensure fair
* page aging. The zone a page was allocated in should have no effect on the
@@ -1882,26 +1925,60 @@ static inline void init_zone_allows_reclaim(int nid)
static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
struct zone *zone, int alloc_flags)
{
+ bool zone_is_local;
+ bool is_file, is_slab, is_anon;
+
/* Only round robin in the allocator fast path */
if (!(alloc_flags & ALLOC_WMARK_LOW))
return false;
- /* Only round robin pages likely to be LRU or reclaimable slab */
- if (!(gfp_mask & GFP_MOVABLE_MASK))
+ /* Only a subset of GFP flags are considered for fair zone policy */
+ if (!(gfp_mask & DISTRIBUTE_GFP_MASK))
return false;
- /* Distribute to the next zone if this zone has exhausted its batch */
- if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
- return true;
-
/*
- * When zone_reclaim_mode is enabled, try to stay in local zones in the
- * fastpath. If that fails, the slowpath is entered, which will do
- * another pass starting with the local zones, but ultimately fall back
- * back to remote zones that do not partake in the fairness round-robin
- * cycle of this zonelist.
+ * Classify the type of allocation. From this point on, the fair zone
+ * allocation policy is being applied. If the allocation does not meet
+ * the criteria the zone must be skipped.
*/
- if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+ is_file = gfp_mask & __GFP_PAGECACHE;
+ is_slab = gfp_mask & __GFP_RECLAIMABLE;
+ is_anon = (!is_file && !is_slab);
+ WARN_ON_ONCE(is_slab && is_file);
+
+ zone_is_local = zone_local(preferred_zone, zone);
+ if (zone_local(preferred_zone, zone)) {
+ /* Distribute between zones local to the node if requested */
+ if (is_anon && (zone_distribute_mode & DISTRIBUTE_LOCAL_ANON))
+ goto check_batch;
+ if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE))
+ goto check_batch;
+ if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE))
+ goto check_batch;
+ } else {
+ /*
+ * When zone_reclaim_mode is enabled, stick to local zones. If
+ * that fails, the slowpath is entered, which will do another
+ * pass starting with the local zones, but ultimately fall back
+ * back to remote zones that do not partake in the fairness
+ * round-robin cycle of this zonelist.
+ */
+ if (zone_reclaim_mode)
+ return false;
+
+ if (is_anon && (zone_distribute_mode & DISTRIBUTE_REMOTE_ANON))
+ goto check_batch;
+ if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE))
+ goto check_batch;
+ if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE))
+ goto check_batch;
+ }
+
+ return true;
+
+check_batch:
+ /* Distribute to the next zone if this zone has exhausted its batch */
+ if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
return true;
return false;
@@ -3797,6 +3874,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
__build_all_zonelists(NULL);
mminit_verify_zonelist();
cpuset_init_current_mems_allowed();
+ zone_distribute_mode = DISTRIBUTE_DEFAULT;
} else {
#ifdef CONFIG_MEMORY_HOTPLUG
if (zone)
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
Not signed off. Johannes, was the intent really to decrement the batch
counts regardless of whether the policy was being enforced or not?
---
mm/page_alloc.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c2a2229..bf49918 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1547,7 +1547,6 @@ again:
get_pageblock_migratetype(page));
}
- __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
__count_zone_vm_events(PGALLOC, zone, 1 << order);
zone_statistics(preferred_zone, zone, gfp_flags);
local_irq_restore(flags);
@@ -1923,7 +1922,8 @@ int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
* other zones.
*/
static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
- struct zone *zone, int alloc_flags)
+ struct zone *zone, int alloc_flags,
+ bool *distrib_eligible)
{
bool zone_is_local;
bool is_file, is_slab, is_anon;
@@ -1977,6 +1977,8 @@ static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
return true;
check_batch:
+ *distrib_eligible = true;
+
/* Distribute to the next zone if this zone has exhausted its batch */
if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
return true;
@@ -2000,6 +2002,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
+ bool distrib_eligible = false;
classzone_idx = zone_idx(preferred_zone);
zonelist_scan:
@@ -2023,7 +2026,7 @@ zonelist_scan:
/* Distribute pages to ensure fair page aging */
if (zone_distribute_age(gfp_mask, preferred_zone, zone,
- alloc_flags))
+ alloc_flags, &distrib_eligible))
continue;
/*
@@ -2119,8 +2122,11 @@ zonelist_scan:
try_this_zone:
page = buffered_rmqueue(preferred_zone, zone, order,
gfp_mask, migratetype);
- if (page)
+ if (page) {
+ if (distrib_eligible)
+ __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
break;
+ }
this_zone_full:
if (IS_ENABLED(CONFIG_NUMA))
zlc_mark_zone_full(zonelist, z);
--
1.8.4
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
@ 2013-12-13 14:10 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
Not signed off. Johannes, was the intent really to decrement the batch
counts regardless of whether the policy was being enforced or not?
---
mm/page_alloc.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c2a2229..bf49918 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1547,7 +1547,6 @@ again:
get_pageblock_migratetype(page));
}
- __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
__count_zone_vm_events(PGALLOC, zone, 1 << order);
zone_statistics(preferred_zone, zone, gfp_flags);
local_irq_restore(flags);
@@ -1923,7 +1922,8 @@ int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
* other zones.
*/
static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
- struct zone *zone, int alloc_flags)
+ struct zone *zone, int alloc_flags,
+ bool *distrib_eligible)
{
bool zone_is_local;
bool is_file, is_slab, is_anon;
@@ -1977,6 +1977,8 @@ static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
return true;
check_batch:
+ *distrib_eligible = true;
+
/* Distribute to the next zone if this zone has exhausted its batch */
if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
return true;
@@ -2000,6 +2002,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
+ bool distrib_eligible = false;
classzone_idx = zone_idx(preferred_zone);
zonelist_scan:
@@ -2023,7 +2026,7 @@ zonelist_scan:
/* Distribute pages to ensure fair page aging */
if (zone_distribute_age(gfp_mask, preferred_zone, zone,
- alloc_flags))
+ alloc_flags, &distrib_eligible))
continue;
/*
@@ -2119,8 +2122,11 @@ zonelist_scan:
try_this_zone:
page = buffered_rmqueue(preferred_zone, zone, order,
gfp_mask, migratetype);
- if (page)
+ if (page) {
+ if (distrib_eligible)
+ __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
break;
+ }
this_zone_full:
if (IS_ENABLED(CONFIG_NUMA))
zlc_mark_zone_full(zonelist, z);
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 14:10 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
Indications from Johannes that he wanted this. Needs some data and/or justification why
thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
it should be considered finished. I do not necessarily agree this patch is necessary
but it's worth punting it out there for discussion and testing.
Not signed off
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf49918..bce40c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1885,7 +1885,8 @@ unsigned __bitwise__ zone_distribute_mode __read_mostly;
#define DISTRIBUTE_STUPID_ANON (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
#define DISTRIBUTE_STUPID_FILE (DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
#define DISTRIBUTE_STUPID_SLAB (DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
-#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
+#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB| \
+ DISTRIBUTE_REMOTE_FILE)
/* Only these GFP flags are affected by the fair zone allocation policy */
#define DISTRIBUTE_GFP_MASK ((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
--
1.8.4
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
@ 2013-12-13 14:10 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 14:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML, Mel Gorman
Indications from Johannes that he wanted this. Needs some data and/or justification why
thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
it should be considered finished. I do not necessarily agree this patch is necessary
but it's worth punting it out there for discussion and testing.
Not signed off
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf49918..bce40c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1885,7 +1885,8 @@ unsigned __bitwise__ zone_distribute_mode __read_mostly;
#define DISTRIBUTE_STUPID_ANON (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
#define DISTRIBUTE_STUPID_FILE (DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
#define DISTRIBUTE_STUPID_SLAB (DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
-#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
+#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB| \
+ DISTRIBUTE_REMOTE_FILE)
/* Only these GFP flags are affected by the fair zone allocation policy */
#define DISTRIBUTE_GFP_MASK ((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 84+ messages in thread
* Re: [PATCH 1/7] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 15:45 ` Rik van Riel
-1 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-13 15:45 UTC (permalink / raw)
To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/13/2013 09:10 AM, Mel Gorman wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
>
> Dave Hansen noted a regression in a microbenchmark that loops around
> open() and close() on an 8-node NUMA machine and bisected it down to
> 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy"). That
> change forces the slab allocations of the file descriptor to spread
> out to all 8 nodes, causing remote references in the page allocator
> and slab.
>
> The round-robin policy is only there to provide fairness among memory
> allocations that are reclaimed involuntarily based on pressure in each
> zone. It does not make sense to apply it to unreclaimable kernel
> allocations that are freed manually, in this case instantly after the
> allocation, and incur the remote reference costs twice for no reason.
>
> Only round-robin allocations that are usually freed through page
> reclaim or slab shrinking.
>
> Cc: <stable@kernel.org>
> Bisected-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 1/7] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
@ 2013-12-13 15:45 ` Rik van Riel
0 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-13 15:45 UTC (permalink / raw)
To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/13/2013 09:10 AM, Mel Gorman wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
>
> Dave Hansen noted a regression in a microbenchmark that loops around
> open() and close() on an 8-node NUMA machine and bisected it down to
> 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy"). That
> change forces the slab allocations of the file descriptor to spread
> out to all 8 nodes, causing remote references in the page allocator
> and slab.
>
> The round-robin policy is only there to provide fairness among memory
> allocations that are reclaimed involuntarily based on pressure in each
> zone. It does not make sense to apply it to unreclaimable kernel
> allocations that are freed manually, in this case instantly after the
> allocation, and incur the remote reference costs twice for no reason.
>
> Only round-robin allocations that are usually freed through page
> reclaim or slab shrinking.
>
> Cc: <stable@kernel.org>
> Bisected-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 15:46 ` Rik van Riel
-1 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-13 15:46 UTC (permalink / raw)
To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/13/2013 09:10 AM, Mel Gorman wrote:
> This patch moves the decision on whether to round-robin allocations between
> zones and nodes into its own helper functions. It'll make some later patches
> easier to understand and it will be automatically inlined.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper
@ 2013-12-13 15:46 ` Rik van Riel
0 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-13 15:46 UTC (permalink / raw)
To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/13/2013 09:10 AM, Mel Gorman wrote:
> This patch moves the decision on whether to round-robin allocations between
> zones and nodes into its own helper functions. It'll make some later patches
> easier to understand and it will be automatically inlined.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-13 17:04 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-13 17:04 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> Indications from Johannes that he wanted this. Needs some data and/or justification why
> thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> it should be considered finished. I do not necessarily agree this patch is necessary
> but it's worth punting it out there for discussion and testing.
I demonstrated enormous gains in the original submission of the fair
allocation patch and your tests haven't really shown downsides to the
cache-over-nodes portion of it. So I don't see why we should revert
the cache-over-nodes fairness without any supporting data.
Reverting cross-node fairness for anon and slab is a good idea. It
was always about cache and the original patch was too broad stroked,
but it doesn't invalidate everything it was about.
I can see, however, that we might want to make this configurable, but
I'm not eager on exporting user interfaces unless we have to. As the
node-local fairness was never questioned by anybody, is it necessary
to make it configurable? Shouldn't we be okay with just a single
vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
allows users to go back to pagecache obeying mempolicy?
> Not signed off
> ---
> mm/page_alloc.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bf49918..bce40c0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1885,7 +1885,8 @@ unsigned __bitwise__ zone_distribute_mode __read_mostly;
> #define DISTRIBUTE_STUPID_ANON (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
> #define DISTRIBUTE_STUPID_FILE (DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
> #define DISTRIBUTE_STUPID_SLAB (DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
> -#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
> +#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB| \
> + DISTRIBUTE_REMOTE_FILE)
>
> /* Only these GFP flags are affected by the fair zone allocation policy */
> #define DISTRIBUTE_GFP_MASK ((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
> --
> 1.8.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
@ 2013-12-13 17:04 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-13 17:04 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> Indications from Johannes that he wanted this. Needs some data and/or justification why
> thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> it should be considered finished. I do not necessarily agree this patch is necessary
> but it's worth punting it out there for discussion and testing.
I demonstrated enormous gains in the original submission of the fair
allocation patch and your tests haven't really shown downsides to the
cache-over-nodes portion of it. So I don't see why we should revert
the cache-over-nodes fairness without any supporting data.
Reverting cross-node fairness for anon and slab is a good idea. It
was always about cache and the original patch was too broad stroked,
but it doesn't invalidate everything it was about.
I can see, however, that we might want to make this configurable, but
I'm not eager on exporting user interfaces unless we have to. As the
node-local fairness was never questioned by anybody, is it necessary
to make it configurable? Shouldn't we be okay with just a single
vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
allows users to go back to pagecache obeying mempolicy?
> Not signed off
> ---
> mm/page_alloc.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bf49918..bce40c0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1885,7 +1885,8 @@ unsigned __bitwise__ zone_distribute_mode __read_mostly;
> #define DISTRIBUTE_STUPID_ANON (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
> #define DISTRIBUTE_STUPID_FILE (DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
> #define DISTRIBUTE_STUPID_SLAB (DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
> -#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
> +#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB| \
> + DISTRIBUTE_REMOTE_FILE)
>
> /* Only these GFP flags are affected by the fair zone allocation policy */
> #define DISTRIBUTE_GFP_MASK ((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
> --
> 1.8.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
2013-12-13 17:04 ` Johannes Weiner
@ 2013-12-13 19:20 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 19:20 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > it should be considered finished. I do not necessarily agree this patch is necessary
> > but it's worth punting it out there for discussion and testing.
>
> I demonstrated enormous gains in the original submission of the fair
> allocation patch and
And the same test missed that it broke MPOL_DEFAULT and regressed any workload
that does not hit reclaim by incurring remote accesses unnecessarily. With
this patch applied, MPOL_DEFAULT again does not act as documented by
Documentation/vm/numa_memory_policy.txt and that file has been around a
long time. It also does not match the documented behaviour of mbind
where it says
The system-wide default policy allocates pages on the node of
the CPU that triggers the allocation. For MPOL_DEFAULT, the nodemask
and maxnode arguments must be specify the empty set of nodes.
That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
allocate on remote nodes.
> your tests haven't really shown downsides to the
> cache-over-nodes portion of it.
> the cache-over-nodes fairness without any supporting data.
>
It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
overridden by policies and it is not even documented. The same effect
could have been achieved for the repeatedly reading files by running the
processes with the MPOL_INTERLEAVE policy. There was also no convenient
way for a user to override that behaviour. Hard-binding to a node would
work but tough luck if the process needs more than one node of memory.
What I will admit is that I doubt anyone cares that file-backed pages
are not node-local as documented as the cost of the IO itself probably
dominates but just because something does not make sense does not mean
someone is depending on the behaviour.
That alone is pretty heavy justification even in the absense of supporting
data showing a workload that depends on file pages being node-local that
is not hidden by the cost of the IO itself.
> Reverting cross-node fairness for anon and slab is a good idea. It
> was always about cache and the original patch was too broad stroked,
> but it doesn't invalidate everything it was about.
>
No it doesn't, but it should at least have been documented.
> I can see, however, that we might want to make this configurable, but
> I'm not eager on exporting user interfaces unless we have to. As the
> node-local fairness was never questioned by anybody, is it necessary
> to make it configurable?
It's only there since 3.12 and it takes a long time for people to notice
NUMA regressions, especially ones that would just be within a few percent
like this was unless they were specifically looking for it.
> Shouldn't we be okay with just a single
> vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> allows users to go back to pagecache obeying mempolicy?
>
That can be done. I can put together a patch that defaults it to 0 and
sets the DISTRIBUTE_REMOTE_FILE flag if someone writes to it. That's a
crude hack but many people will be ok with it.
To make it a default though should require more work though.
Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
is not strictly interleave). Abstract MPOL_DEFAULT to be either
MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
vm.pagecache_interleave. Update manual pages, and Documentation/ then set
the default of vm.pagecache_interleave to 1.
That would allow more sane defaults and also allow users to override it
on a per task and per VMA basis as they can for any other type of memory
policy.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
@ 2013-12-13 19:20 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-13 19:20 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > it should be considered finished. I do not necessarily agree this patch is necessary
> > but it's worth punting it out there for discussion and testing.
>
> I demonstrated enormous gains in the original submission of the fair
> allocation patch and
And the same test missed that it broke MPOL_DEFAULT and regressed any workload
that does not hit reclaim by incurring remote accesses unnecessarily. With
this patch applied, MPOL_DEFAULT again does not act as documented by
Documentation/vm/numa_memory_policy.txt and that file has been around a
long time. It also does not match the documented behaviour of mbind
where it says
The system-wide default policy allocates pages on the node of
the CPU that triggers the allocation. For MPOL_DEFAULT, the nodemask
and maxnode arguments must be specify the empty set of nodes.
That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
allocate on remote nodes.
> your tests haven't really shown downsides to the
> cache-over-nodes portion of it.
> the cache-over-nodes fairness without any supporting data.
>
It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
overridden by policies and it is not even documented. The same effect
could have been achieved for the repeatedly reading files by running the
processes with the MPOL_INTERLEAVE policy. There was also no convenient
way for a user to override that behaviour. Hard-binding to a node would
work but tough luck if the process needs more than one node of memory.
What I will admit is that I doubt anyone cares that file-backed pages
are not node-local as documented as the cost of the IO itself probably
dominates but just because something does not make sense does not mean
someone is depending on the behaviour.
That alone is pretty heavy justification even in the absense of supporting
data showing a workload that depends on file pages being node-local that
is not hidden by the cost of the IO itself.
> Reverting cross-node fairness for anon and slab is a good idea. It
> was always about cache and the original patch was too broad stroked,
> but it doesn't invalidate everything it was about.
>
No it doesn't, but it should at least have been documented.
> I can see, however, that we might want to make this configurable, but
> I'm not eager on exporting user interfaces unless we have to. As the
> node-local fairness was never questioned by anybody, is it necessary
> to make it configurable?
It's only there since 3.12 and it takes a long time for people to notice
NUMA regressions, especially ones that would just be within a few percent
like this was unless they were specifically looking for it.
> Shouldn't we be okay with just a single
> vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> allows users to go back to pagecache obeying mempolicy?
>
That can be done. I can put together a patch that defaults it to 0 and
sets the DISTRIBUTE_REMOTE_FILE flag if someone writes to it. That's a
crude hack but many people will be ok with it.
To make it a default though should require more work though.
Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
is not strictly interleave). Abstract MPOL_DEFAULT to be either
MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
vm.pagecache_interleave. Update manual pages, and Documentation/ then set
the default of vm.pagecache_interleave to 1.
That would allow more sane defaults and also allow users to override it
on a per task and per VMA basis as they can for any other type of memory
policy.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
2013-12-13 19:20 ` Mel Gorman
@ 2013-12-13 22:15 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-13 22:15 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 07:20:14PM +0000, Mel Gorman wrote:
> On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > > it should be considered finished. I do not necessarily agree this patch is necessary
> > > but it's worth punting it out there for discussion and testing.
> >
> > I demonstrated enormous gains in the original submission of the fair
> > allocation patch and
>
> And the same test missed that it broke MPOL_DEFAULT and regressed any workload
> that does not hit reclaim by incurring remote accesses unnecessarily.
And none of this was nice, agreed, but it does not invalidate the
gains, it only changes what we are comparing them to.
> With this patch applied, MPOL_DEFAULT again does not act as
> documented by Documentation/vm/numa_memory_policy.txt and that file
> has been around a long time. It also does not match the documented
> behaviour of mbind where it says
>
> The system-wide default policy allocates pages on the node of
> the CPU that triggers the allocation. For MPOL_DEFAULT, the nodemask
> and maxnode arguments must be specify the empty set of nodes.
>
> That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
> allocate on remote nodes.
>
> > your tests haven't really shown downsides to the
> > cache-over-nodes portion of it.
> > the cache-over-nodes fairness without any supporting data.
> >
>
> It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
> overridden by policies and it is not even documented. The same effect
> could have been achieved for the repeatedly reading files by running the
> processes with the MPOL_INTERLEAVE policy. There was also no convenient
> way for a user to override that behaviour. Hard-binding to a node would
> work but tough luck if the process needs more than one node of memory.
Hardbinding or enabling zone_reclaim_mode, yes. But agreed, let's fix
these problems.
> What I will admit is that I doubt anyone cares that file-backed pages
> are not node-local as documented as the cost of the IO itself probably
> dominates but just because something does not make sense does not mean
> someone is depending on the behaviour.
And that's why I very much agree that we need a way for people to
revert to the old behavior in case we are wrong about this.
But it's also a very strong argument for what the new default should
be, given that we allow people to revert our decision in the field.
> That alone is pretty heavy justification even in the absense of supporting
> data showing a workload that depends on file pages being node-local that
> is not hidden by the cost of the IO itself.
Even if we anticipate that nobody will care about it and we provide a
way to revert the behavior in the field in case we are wrong?
I disagree.
We should definitely allow the user to override our decision, but the
default should be what we anticipate will benefit most users.
And I'm really not trying to be ignorant of long-standing documented
behavior that users may have come to expect. The bug reports will
land on my desk just as well. But it looks like the current behavior
does not make much sense and is unlikely to be missed.
> > Reverting cross-node fairness for anon and slab is a good idea. It
> > was always about cache and the original patch was too broad stroked,
> > but it doesn't invalidate everything it was about.
> >
>
> No it doesn't, but it should at least have been documented.
Yes, no argument there.
> > I can see, however, that we might want to make this configurable, but
> > I'm not eager on exporting user interfaces unless we have to. As the
> > node-local fairness was never questioned by anybody, is it necessary
> > to make it configurable?
>
> It's only there since 3.12 and it takes a long time for people to notice
> NUMA regressions, especially ones that would just be within a few percent
> like this was unless they were specifically looking for it.
No, I meant only the case where we distribute memory fairly among the
zones WITHIN a given node. This does not affect NUMA placement. I
wouldn't want to make this configurable unless you think people might
want to disable this. I can't think of a reason, anyway.
> > Shouldn't we be okay with just a single
> > vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> > allows users to go back to pagecache obeying mempolicy?
> >
>
> That can be done. I can put together a patch that defaults it to 0 and
> sets the DISTRIBUTE_REMOTE_FILE flag if someone writes to it. That's a
> crude hack but many people will be ok with it.
>
> To make it a default though should require more work though.
> Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
> is not strictly interleave). Abstract MPOL_DEFAULT to be either
> MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
> vm.pagecache_interleave. Update manual pages, and Documentation/ then set
> the default of vm.pagecache_interleave to 1.
>
> That would allow more sane defaults and also allow users to override it
> on a per task and per VMA basis as they can for any other type of memory
> policy.
Not using round-robin placement for cache creates weird artifacts in
our LRU aging decisions. By not aging all pages in a workingset
equally, we may end up activating barely used pages on a remote node
and creating pressure on its active list for no reason.
This has little to do with the thrash detection patches, either, they
will just potentially trigger a few more non-sensical activations but
for the same reason that the aging is skewed.
Because of that I really don't want to implement round-robin cache
placement as just another possible mempolicy when other parts of the
VM rely on it to be there.
It would make more sense to me to ignore mempolicies for cache per
default and provide a single sysctl to honor them for the sole reason
that we have been honoring them for a very long time. And document
the whole thing properly of course.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
@ 2013-12-13 22:15 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-13 22:15 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 07:20:14PM +0000, Mel Gorman wrote:
> On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > > it should be considered finished. I do not necessarily agree this patch is necessary
> > > but it's worth punting it out there for discussion and testing.
> >
> > I demonstrated enormous gains in the original submission of the fair
> > allocation patch and
>
> And the same test missed that it broke MPOL_DEFAULT and regressed any workload
> that does not hit reclaim by incurring remote accesses unnecessarily.
And none of this was nice, agreed, but it does not invalidate the
gains, it only changes what we are comparing them to.
> With this patch applied, MPOL_DEFAULT again does not act as
> documented by Documentation/vm/numa_memory_policy.txt and that file
> has been around a long time. It also does not match the documented
> behaviour of mbind where it says
>
> The system-wide default policy allocates pages on the node of
> the CPU that triggers the allocation. For MPOL_DEFAULT, the nodemask
> and maxnode arguments must be specify the empty set of nodes.
>
> That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
> allocate on remote nodes.
>
> > your tests haven't really shown downsides to the
> > cache-over-nodes portion of it.
> > the cache-over-nodes fairness without any supporting data.
> >
>
> It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
> overridden by policies and it is not even documented. The same effect
> could have been achieved for the repeatedly reading files by running the
> processes with the MPOL_INTERLEAVE policy. There was also no convenient
> way for a user to override that behaviour. Hard-binding to a node would
> work but tough luck if the process needs more than one node of memory.
Hardbinding or enabling zone_reclaim_mode, yes. But agreed, let's fix
these problems.
> What I will admit is that I doubt anyone cares that file-backed pages
> are not node-local as documented as the cost of the IO itself probably
> dominates but just because something does not make sense does not mean
> someone is depending on the behaviour.
And that's why I very much agree that we need a way for people to
revert to the old behavior in case we are wrong about this.
But it's also a very strong argument for what the new default should
be, given that we allow people to revert our decision in the field.
> That alone is pretty heavy justification even in the absense of supporting
> data showing a workload that depends on file pages being node-local that
> is not hidden by the cost of the IO itself.
Even if we anticipate that nobody will care about it and we provide a
way to revert the behavior in the field in case we are wrong?
I disagree.
We should definitely allow the user to override our decision, but the
default should be what we anticipate will benefit most users.
And I'm really not trying to be ignorant of long-standing documented
behavior that users may have come to expect. The bug reports will
land on my desk just as well. But it looks like the current behavior
does not make much sense and is unlikely to be missed.
> > Reverting cross-node fairness for anon and slab is a good idea. It
> > was always about cache and the original patch was too broad stroked,
> > but it doesn't invalidate everything it was about.
> >
>
> No it doesn't, but it should at least have been documented.
Yes, no argument there.
> > I can see, however, that we might want to make this configurable, but
> > I'm not eager on exporting user interfaces unless we have to. As the
> > node-local fairness was never questioned by anybody, is it necessary
> > to make it configurable?
>
> It's only there since 3.12 and it takes a long time for people to notice
> NUMA regressions, especially ones that would just be within a few percent
> like this was unless they were specifically looking for it.
No, I meant only the case where we distribute memory fairly among the
zones WITHIN a given node. This does not affect NUMA placement. I
wouldn't want to make this configurable unless you think people might
want to disable this. I can't think of a reason, anyway.
> > Shouldn't we be okay with just a single
> > vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> > allows users to go back to pagecache obeying mempolicy?
> >
>
> That can be done. I can put together a patch that defaults it to 0 and
> sets the DISTRIBUTE_REMOTE_FILE flag if someone writes to it. That's a
> crude hack but many people will be ok with it.
>
> To make it a default though should require more work though.
> Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
> is not strictly interleave). Abstract MPOL_DEFAULT to be either
> MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
> vm.pagecache_interleave. Update manual pages, and Documentation/ then set
> the default of vm.pagecache_interleave to 1.
>
> That would allow more sane defaults and also allow users to override it
> on a per task and per VMA basis as they can for any other type of memory
> policy.
Not using round-robin placement for cache creates weird artifacts in
our LRU aging decisions. By not aging all pages in a workingset
equally, we may end up activating barely used pages on a remote node
and creating pressure on its active list for no reason.
This has little to do with the thrash detection patches, either, they
will just potentially trigger a few more non-sensical activations but
for the same reason that the aging is skewed.
Because of that I really don't want to implement round-robin cache
placement as just another possible mempolicy when other parts of the
VM rely on it to be there.
It would make more sense to me to ignore mempolicies for cache per
default and provide a single sysctl to honor them for the sole reason
that we have been honoring them for a very long time. And document
the whole thing properly of course.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-16 13:20 ` Rik van Riel
-1 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 13:20 UTC (permalink / raw)
To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/13/2013 09:10 AM, Mel Gorman wrote:
> zone_local is using node_distance which is a more expensive call than
> necessary. On x86, it's another function call in the allocator fast path
> and increases cache footprint. This patch makes the assumption zones on a
> local node will share the same node ID. The necessary information should
> already be cache hot.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-16 13:20 ` Rik van Riel
0 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 13:20 UTC (permalink / raw)
To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/13/2013 09:10 AM, Mel Gorman wrote:
> zone_local is using node_distance which is a more expensive call than
> necessary. On x86, it's another function call in the allocator fast path
> and increases cache footprint. This patch makes the assumption zones on a
> local node will share the same node ID. The necessary information should
> already be cache hot.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 4/7] mm: Annotate page cache allocations
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-16 15:20 ` Rik van Riel
-1 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 15:20 UTC (permalink / raw)
To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/13/2013 09:10 AM, Mel Gorman wrote:
> Annotations will be used for fair zone allocation policy. Patch is mostly
> taken from a link posted by Johannes on IRC. It's not perfect because all
> callers of these paths are not guaranteed to be allocating pages for page
> cache. However, it's probably close enough to cover all cases that matter
> with minimal distortion.
>
> Not-signed-off
Whenever you and Johannes sign it off, you can add my
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 4/7] mm: Annotate page cache allocations
@ 2013-12-16 15:20 ` Rik van Riel
0 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 15:20 UTC (permalink / raw)
To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/13/2013 09:10 AM, Mel Gorman wrote:
> Annotations will be used for fair zone allocation policy. Patch is mostly
> taken from a link posted by Johannes on IRC. It's not perfect because all
> callers of these paths are not guaranteed to be allocating pages for page
> cache. However, it's probably close enough to cover all cases that matter
> with minimal distortion.
>
> Not-signed-off
Whenever you and Johannes sign it off, you can add my
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-16 19:25 ` Rik van Riel
-1 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 19:25 UTC (permalink / raw)
To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/13/2013 09:10 AM, Mel Gorman wrote:
> Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> bug whereby new pages could be reclaimed before old pages because of
> how the page allocator and kswapd interacted on the per-zone LRU lists.
> Unfortunately it was missed during review that a consequence is that
> we also round-robin between NUMA nodes. This is bad for two reasons
>
> 1. It alters the semantics of MPOL_LOCAL without telling anyone
> 2. It incurs an immediate remote memory performance hit in exchange
> for a potential performance gain when memory needs to be reclaimed
> later
>
> No cookies for the reviewers on this one.
>
> This patch makes the behaviour of the fair zone allocator policy
> configurable. By default it will only distribute pages that are going
> to exist on the LRU between zones local to the allocating process. This
> preserves the historical semantics of MPOL_LOCAL.
>
> By default, slab pages are not distributed between zones after this patch is
> applied. It can be argued that they should get similar treatment but they
> have different lifecycles to LRU pages, the shrinkers are not zone-aware
> and the interaction between the page allocator and kswapd is different
> for slabs. If it turns out to be an almost universal win, we can change
> the default.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-16 19:25 ` Rik van Riel
0 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 19:25 UTC (permalink / raw)
To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/13/2013 09:10 AM, Mel Gorman wrote:
> Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> bug whereby new pages could be reclaimed before old pages because of
> how the page allocator and kswapd interacted on the per-zone LRU lists.
> Unfortunately it was missed during review that a consequence is that
> we also round-robin between NUMA nodes. This is bad for two reasons
>
> 1. It alters the semantics of MPOL_LOCAL without telling anyone
> 2. It incurs an immediate remote memory performance hit in exchange
> for a potential performance gain when memory needs to be reclaimed
> later
>
> No cookies for the reviewers on this one.
>
> This patch makes the behaviour of the fair zone allocator policy
> configurable. By default it will only distribute pages that are going
> to exist on the LRU between zones local to the allocating process. This
> preserves the historical semantics of MPOL_LOCAL.
>
> By default, slab pages are not distributed between zones after this patch is
> applied. It can be argued that they should get similar treatment but they
> have different lifecycles to LRU pages, the shrinkers are not zone-aware
> and the interaction between the page allocator and kswapd is different
> for slabs. If it turns out to be an almost universal win, we can change
> the default.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-16 19:26 ` Rik van Riel
-1 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 19:26 UTC (permalink / raw)
To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/13/2013 09:10 AM, Mel Gorman wrote:
> Indications from Johannes that he wanted this. Needs some data and/or justification why
> thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> it should be considered finished. I do not necessarily agree this patch is necessary
> but it's worth punting it out there for discussion and testing.
This seems like a sane default to me.
--
All rights reversed
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
@ 2013-12-16 19:26 ` Rik van Riel
0 siblings, 0 replies; 84+ messages in thread
From: Rik van Riel @ 2013-12-16 19:26 UTC (permalink / raw)
To: Mel Gorman; +Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/13/2013 09:10 AM, Mel Gorman wrote:
> Indications from Johannes that he wanted this. Needs some data and/or justification why
> thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> it should be considered finished. I do not necessarily agree this patch is necessary
> but it's worth punting it out there for discussion and testing.
This seems like a sane default to me.
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-16 20:16 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:16 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 02:10:02PM +0000, Mel Gorman wrote:
> This patch moves the decision on whether to round-robin allocations between
> zones and nodes into its own helper functions. It'll make some later patches
> easier to understand and it will be automatically inlined.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper
@ 2013-12-16 20:16 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:16 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 02:10:02PM +0000, Mel Gorman wrote:
> This patch moves the decision on whether to round-robin allocations between
> zones and nodes into its own helper functions. It'll make some later patches
> easier to understand and it will be automatically inlined.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-16 20:25 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:25 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> zone_local is using node_distance which is a more expensive call than
> necessary. On x86, it's another function call in the allocator fast path
> and increases cache footprint. This patch makes the assumption zones on a
> local node will share the same node ID. The necessary information should
> already be cache hot.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> mm/page_alloc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 64020eb..fd9677e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
>
> static bool zone_local(struct zone *local_zone, struct zone *zone)
> {
> - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> + return zone_to_nid(zone) == numa_node_id();
Why numa_node_id()? We pass in the preferred zone as @local_zone:
return zone_to_nid(local_zone) == zone_to_nid(zone)
Or even just compare the ->zone_pgdat pointers?
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-16 20:25 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:25 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> zone_local is using node_distance which is a more expensive call than
> necessary. On x86, it's another function call in the allocator fast path
> and increases cache footprint. This patch makes the assumption zones on a
> local node will share the same node ID. The necessary information should
> already be cache hot.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> mm/page_alloc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 64020eb..fd9677e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
>
> static bool zone_local(struct zone *local_zone, struct zone *zone)
> {
> - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> + return zone_to_nid(zone) == numa_node_id();
Why numa_node_id()? We pass in the preferred zone as @local_zone:
return zone_to_nid(local_zone) == zone_to_nid(zone)
Or even just compare the ->zone_pgdat pointers?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-16 20:42 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:42 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> bug whereby new pages could be reclaimed before old pages because of
> how the page allocator and kswapd interacted on the per-zone LRU lists.
> Unfortunately it was missed during review that a consequence is that
> we also round-robin between NUMA nodes. This is bad for two reasons
>
> 1. It alters the semantics of MPOL_LOCAL without telling anyone
> 2. It incurs an immediate remote memory performance hit in exchange
> for a potential performance gain when memory needs to be reclaimed
> later
>
> No cookies for the reviewers on this one.
>
> This patch makes the behaviour of the fair zone allocator policy
> configurable. By default it will only distribute pages that are going
> to exist on the LRU between zones local to the allocating process. This
> preserves the historical semantics of MPOL_LOCAL.
>
> By default, slab pages are not distributed between zones after this patch is
> applied. It can be argued that they should get similar treatment but they
> have different lifecycles to LRU pages, the shrinkers are not zone-aware
> and the interaction between the page allocator and kswapd is different
> for slabs. If it turns out to be an almost universal win, we can change
> the default.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> Documentation/sysctl/vm.txt | 32 ++++++++++++++
> include/linux/mmzone.h | 2 +
> include/linux/swap.h | 2 +
> kernel/sysctl.c | 8 ++++
> mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
> 5 files changed, 134 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 1fbd4eb..8eaa562 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> - swappiness
> - user_reserve_kbytes
> - vfs_cache_pressure
> +- zone_distribute_mode
> - zone_reclaim_mode
>
> ==============================================================
> @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
>
> ==============================================================
>
> +zone_distribute_mode
> +
> +Pages allocation and reclaim are managed on a per-zone basis. When the
> +system needs to reclaim memory, candidate pages are selected from these
> +per-zone lists. Historically, a potential consequence was that recently
> +allocated pages were considered reclaim candidates. From a zone-local
> +perspective, page aging was preserved but from a system-wide perspective
> +there was an age inversion problem.
> +
> +A similar problem occurs on a node level where young pages may be reclaimed
> +from the local node instead of allocating remote memory. Unforuntately, the
> +cost of accessing remote nodes is higher so the system must choose by default
> +between favouring page aging or node locality. zone_distribute_mode controls
> +how the system will distribute page ages between zones.
> +
> +0 = Never round-robin based on age
I think we should be very conservative with the userspace interface we
export on a mechanism we are obviously just figuring out.
> +Otherwise the values are ORed together
> +
> +1 = Distribute anon pages between zones local to the allocating node
> +2 = Distribute file pages between zones local to the allocating node
> +4 = Distribute slab pages between zones local to the allocating node
Zone fairness within a node does not affect mempolicy or remote
reference costs. Is there a reason to have this configurable?
> +The following three flags effectively alter MPOL_DEFAULT, be careful.
> +
> +8 = Distribute anon pages between zones remote to the allocating node
> +16 = Distribute file pages between zones remote to the allocating node
> +32 = Distribute slab pages between zones remote to the allocating node
Yes, it's conceivable that somebody might want to disable remote
distribution because of the extra references.
But at this point, I'd much rather back out anon and slab distribution
entirely, it was a mistake to include them.
That would leave us with a single knob to disable remote page cache
placement.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-16 20:42 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:42 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> bug whereby new pages could be reclaimed before old pages because of
> how the page allocator and kswapd interacted on the per-zone LRU lists.
> Unfortunately it was missed during review that a consequence is that
> we also round-robin between NUMA nodes. This is bad for two reasons
>
> 1. It alters the semantics of MPOL_LOCAL without telling anyone
> 2. It incurs an immediate remote memory performance hit in exchange
> for a potential performance gain when memory needs to be reclaimed
> later
>
> No cookies for the reviewers on this one.
>
> This patch makes the behaviour of the fair zone allocator policy
> configurable. By default it will only distribute pages that are going
> to exist on the LRU between zones local to the allocating process. This
> preserves the historical semantics of MPOL_LOCAL.
>
> By default, slab pages are not distributed between zones after this patch is
> applied. It can be argued that they should get similar treatment but they
> have different lifecycles to LRU pages, the shrinkers are not zone-aware
> and the interaction between the page allocator and kswapd is different
> for slabs. If it turns out to be an almost universal win, we can change
> the default.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> Documentation/sysctl/vm.txt | 32 ++++++++++++++
> include/linux/mmzone.h | 2 +
> include/linux/swap.h | 2 +
> kernel/sysctl.c | 8 ++++
> mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
> 5 files changed, 134 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 1fbd4eb..8eaa562 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> - swappiness
> - user_reserve_kbytes
> - vfs_cache_pressure
> +- zone_distribute_mode
> - zone_reclaim_mode
>
> ==============================================================
> @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
>
> ==============================================================
>
> +zone_distribute_mode
> +
> +Pages allocation and reclaim are managed on a per-zone basis. When the
> +system needs to reclaim memory, candidate pages are selected from these
> +per-zone lists. Historically, a potential consequence was that recently
> +allocated pages were considered reclaim candidates. From a zone-local
> +perspective, page aging was preserved but from a system-wide perspective
> +there was an age inversion problem.
> +
> +A similar problem occurs on a node level where young pages may be reclaimed
> +from the local node instead of allocating remote memory. Unforuntately, the
> +cost of accessing remote nodes is higher so the system must choose by default
> +between favouring page aging or node locality. zone_distribute_mode controls
> +how the system will distribute page ages between zones.
> +
> +0 = Never round-robin based on age
I think we should be very conservative with the userspace interface we
export on a mechanism we are obviously just figuring out.
> +Otherwise the values are ORed together
> +
> +1 = Distribute anon pages between zones local to the allocating node
> +2 = Distribute file pages between zones local to the allocating node
> +4 = Distribute slab pages between zones local to the allocating node
Zone fairness within a node does not affect mempolicy or remote
reference costs. Is there a reason to have this configurable?
> +The following three flags effectively alter MPOL_DEFAULT, be careful.
> +
> +8 = Distribute anon pages between zones remote to the allocating node
> +16 = Distribute file pages between zones remote to the allocating node
> +32 = Distribute slab pages between zones remote to the allocating node
Yes, it's conceivable that somebody might want to disable remote
distribution because of the extra references.
But at this point, I'd much rather back out anon and slab distribution
entirely, it was a mistake to include them.
That would leave us with a single knob to disable remote page cache
placement.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-16 20:52 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:52 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> Not signed off. Johannes, was the intent really to decrement the batch
> counts regardless of whether the policy was being enforced or not?
Yes. Bursts of allocations for which the policy does not get enforced
will still create memory pressure and affect cache aging on a given
node. So even if we only distribute page cache, we want to distribute
it in a way that all allocations on the eligible zones equal out.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
@ 2013-12-16 20:52 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-16 20:52 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> Not signed off. Johannes, was the intent really to decrement the batch
> counts regardless of whether the policy was being enforced or not?
Yes. Bursts of allocations for which the policy does not get enforced
will still create memory pressure and affect cache aging on a given
node. So even if we only distribute page cache, we want to distribute
it in a way that all allocations on the eligible zones equal out.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
2013-12-16 20:25 ` Johannes Weiner
@ 2013-12-17 11:13 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 11:13 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > zone_local is using node_distance which is a more expensive call than
> > necessary. On x86, it's another function call in the allocator fast path
> > and increases cache footprint. This patch makes the assumption zones on a
> > local node will share the same node ID. The necessary information should
> > already be cache hot.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> > mm/page_alloc.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 64020eb..fd9677e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> >
> > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > {
> > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > + return zone_to_nid(zone) == numa_node_id();
>
> Why numa_node_id()? We pass in the preferred zone as @local_zone:
>
Initially because I was thinking "local node" and numa_node_id() is a
per-cpu variable that should be cheap to access and in some cases
cache-hot as the top-level gfp API calls numa_node_id().
Thinking about it more though it still makes sense because the preferred
zone is not necessarily local. If the allocation request requires ZONE_DMA32
and the local node does not have that zone then preferred zone is on a
remote node.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-17 11:13 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 11:13 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > zone_local is using node_distance which is a more expensive call than
> > necessary. On x86, it's another function call in the allocator fast path
> > and increases cache footprint. This patch makes the assumption zones on a
> > local node will share the same node ID. The necessary information should
> > already be cache hot.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> > mm/page_alloc.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 64020eb..fd9677e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> >
> > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > {
> > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > + return zone_to_nid(zone) == numa_node_id();
>
> Why numa_node_id()? We pass in the preferred zone as @local_zone:
>
Initially because I was thinking "local node" and numa_node_id() is a
per-cpu variable that should be cheap to access and in some cases
cache-hot as the top-level gfp API calls numa_node_id().
Thinking about it more though it still makes sense because the preferred
zone is not necessarily local. If the allocation request requires ZONE_DMA32
and the local node does not have that zone then preferred zone is on a
remote node.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
2013-12-16 20:52 ` Johannes Weiner
@ 2013-12-17 11:20 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 11:20 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Mon, Dec 16, 2013 at 03:52:37PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> > Not signed off. Johannes, was the intent really to decrement the batch
> > counts regardless of whether the policy was being enforced or not?
>
> Yes. Bursts of allocations for which the policy does not get enforced
> will still create memory pressure and affect cache aging on a given
> node. So even if we only distribute page cache, we want to distribute
> it in a way that all allocations on the eligible zones equal out.
This means that allocations for page table pages affects the distribution of
page cache pages. An adverse workload could time when it faults anonymous
pages (to allocate anon and page table pages) in batch sequences and then
access files to force page cache pages to be allocated from a single node.
I think I know what your response will be. It will be that the utilisation of
the zone for page table pages and anon pages means that you want more page
cache pages to be allocated from the other zones so the reclaim pressure
is still more or less even. If this is the case or there is another reason
then it could have done with a comment because it's a subtle detail.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
@ 2013-12-17 11:20 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 11:20 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Mon, Dec 16, 2013 at 03:52:37PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> > Not signed off. Johannes, was the intent really to decrement the batch
> > counts regardless of whether the policy was being enforced or not?
>
> Yes. Bursts of allocations for which the policy does not get enforced
> will still create memory pressure and affect cache aging on a given
> node. So even if we only distribute page cache, we want to distribute
> it in a way that all allocations on the eligible zones equal out.
This means that allocations for page table pages affects the distribution of
page cache pages. An adverse workload could time when it faults anonymous
pages (to allocate anon and page table pages) in batch sequences and then
access files to force page cache pages to be allocated from a single node.
I think I know what your response will be. It will be that the utilisation of
the zone for page table pages and anon pages means that you want more page
cache pages to be allocated from the other zones so the reclaim pressure
is still more or less even. If this is the case or there is another reason
then it could have done with a comment because it's a subtle detail.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
2013-12-13 14:10 ` Mel Gorman
@ 2013-12-17 15:07 ` Zlatko Calusic
-1 siblings, 0 replies; 84+ messages in thread
From: Zlatko Calusic @ 2013-12-17 15:07 UTC (permalink / raw)
To: Mel Gorman
Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
Linux-MM, LKML
On 13.12.2013 15:10, Mel Gorman wrote:
> Kicked this another bit today. It's still a bit half-baked but it restores
> the historical performance and leaves the door open at the end for playing
> nice with distributing file pages between nodes. Finishing this series
> depends on whether we are going to make the remote node behaviour of the
> fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
> favour of the configurable option because the default can be redefined and
> tested while giving users a "compat" mode if we discover the new default
> behaviour sucks for some workload.
>
I'll start a 5-day test of this patchset in a few hours, unless you can
send an updated one in the meantime. I intend to test it on a rather
boring 4GB x86_64 machine that before Johannes' work had lots of trouble
balancing zones. Would you recommend to use the default settings, i.e.
don't mess with tunables at this point?
Regards,
--
Zlatko
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
@ 2013-12-17 15:07 ` Zlatko Calusic
0 siblings, 0 replies; 84+ messages in thread
From: Zlatko Calusic @ 2013-12-17 15:07 UTC (permalink / raw)
To: Mel Gorman
Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
Linux-MM, LKML
On 13.12.2013 15:10, Mel Gorman wrote:
> Kicked this another bit today. It's still a bit half-baked but it restores
> the historical performance and leaves the door open at the end for playing
> nice with distributing file pages between nodes. Finishing this series
> depends on whether we are going to make the remote node behaviour of the
> fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
> favour of the configurable option because the default can be redefined and
> tested while giving users a "compat" mode if we discover the new default
> behaviour sucks for some workload.
>
I'll start a 5-day test of this patchset in a few hours, unless you can
send an updated one in the meantime. I intend to test it on a rather
boring 4GB x86_64 machine that before Johannes' work had lots of trouble
balancing zones. Would you recommend to use the default settings, i.e.
don't mess with tunables at this point?
Regards,
--
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
2013-12-16 20:42 ` Johannes Weiner
@ 2013-12-17 15:29 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 15:29 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > bug whereby new pages could be reclaimed before old pages because of
> > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > Unfortunately it was missed during review that a consequence is that
> > we also round-robin between NUMA nodes. This is bad for two reasons
> >
> > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > 2. It incurs an immediate remote memory performance hit in exchange
> > for a potential performance gain when memory needs to be reclaimed
> > later
> >
> > No cookies for the reviewers on this one.
> >
> > This patch makes the behaviour of the fair zone allocator policy
> > configurable. By default it will only distribute pages that are going
> > to exist on the LRU between zones local to the allocating process. This
> > preserves the historical semantics of MPOL_LOCAL.
> >
> > By default, slab pages are not distributed between zones after this patch is
> > applied. It can be argued that they should get similar treatment but they
> > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > and the interaction between the page allocator and kswapd is different
> > for slabs. If it turns out to be an almost universal win, we can change
> > the default.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> > Documentation/sysctl/vm.txt | 32 ++++++++++++++
> > include/linux/mmzone.h | 2 +
> > include/linux/swap.h | 2 +
> > kernel/sysctl.c | 8 ++++
> > mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
> > 5 files changed, 134 insertions(+), 12 deletions(-)
> >
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index 1fbd4eb..8eaa562 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > - swappiness
> > - user_reserve_kbytes
> > - vfs_cache_pressure
> > +- zone_distribute_mode
> > - zone_reclaim_mode
> >
> > ==============================================================
> > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> >
> > ==============================================================
> >
> > +zone_distribute_mode
> > +
> > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > +system needs to reclaim memory, candidate pages are selected from these
> > +per-zone lists. Historically, a potential consequence was that recently
> > +allocated pages were considered reclaim candidates. From a zone-local
> > +perspective, page aging was preserved but from a system-wide perspective
> > +there was an age inversion problem.
> > +
> > +A similar problem occurs on a node level where young pages may be reclaimed
> > +from the local node instead of allocating remote memory. Unforuntately, the
> > +cost of accessing remote nodes is higher so the system must choose by default
> > +between favouring page aging or node locality. zone_distribute_mode controls
> > +how the system will distribute page ages between zones.
> > +
> > +0 = Never round-robin based on age
>
> I think we should be very conservative with the userspace interface we
> export on a mechanism we are obviously just figuring out.
>
And we have a proposal on how to limit this. I'll be layering another
patch on top and removes this interface again. That will allows us to
rollback one patch and still have a usable interface if necessary.
> > +Otherwise the values are ORed together
> > +
> > +1 = Distribute anon pages between zones local to the allocating node
> > +2 = Distribute file pages between zones local to the allocating node
> > +4 = Distribute slab pages between zones local to the allocating node
>
> Zone fairness within a node does not affect mempolicy or remote
> reference costs. Is there a reason to have this configurable?
>
Symmetry
> > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > +
> > +8 = Distribute anon pages between zones remote to the allocating node
> > +16 = Distribute file pages between zones remote to the allocating node
> > +32 = Distribute slab pages between zones remote to the allocating node
>
> Yes, it's conceivable that somebody might want to disable remote
> distribution because of the extra references.
>
> But at this point, I'd much rather back out anon and slab distribution
> entirely, it was a mistake to include them.
>
> That would leave us with a single knob to disable remote page cache
> placement.
>
When looking at this closer I found that sysv is a weird exception. It's
file-backed as far as most of the VM is concerned but looks anonymous to
most applications that care. That and MAP_SHARED anonymous pages should
not be treated like files but we still want tmpfs to be treated as
files. Details will be in the changelog of the next series.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 15:29 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 15:29 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > bug whereby new pages could be reclaimed before old pages because of
> > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > Unfortunately it was missed during review that a consequence is that
> > we also round-robin between NUMA nodes. This is bad for two reasons
> >
> > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > 2. It incurs an immediate remote memory performance hit in exchange
> > for a potential performance gain when memory needs to be reclaimed
> > later
> >
> > No cookies for the reviewers on this one.
> >
> > This patch makes the behaviour of the fair zone allocator policy
> > configurable. By default it will only distribute pages that are going
> > to exist on the LRU between zones local to the allocating process. This
> > preserves the historical semantics of MPOL_LOCAL.
> >
> > By default, slab pages are not distributed between zones after this patch is
> > applied. It can be argued that they should get similar treatment but they
> > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > and the interaction between the page allocator and kswapd is different
> > for slabs. If it turns out to be an almost universal win, we can change
> > the default.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> > Documentation/sysctl/vm.txt | 32 ++++++++++++++
> > include/linux/mmzone.h | 2 +
> > include/linux/swap.h | 2 +
> > kernel/sysctl.c | 8 ++++
> > mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
> > 5 files changed, 134 insertions(+), 12 deletions(-)
> >
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index 1fbd4eb..8eaa562 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > - swappiness
> > - user_reserve_kbytes
> > - vfs_cache_pressure
> > +- zone_distribute_mode
> > - zone_reclaim_mode
> >
> > ==============================================================
> > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> >
> > ==============================================================
> >
> > +zone_distribute_mode
> > +
> > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > +system needs to reclaim memory, candidate pages are selected from these
> > +per-zone lists. Historically, a potential consequence was that recently
> > +allocated pages were considered reclaim candidates. From a zone-local
> > +perspective, page aging was preserved but from a system-wide perspective
> > +there was an age inversion problem.
> > +
> > +A similar problem occurs on a node level where young pages may be reclaimed
> > +from the local node instead of allocating remote memory. Unforuntately, the
> > +cost of accessing remote nodes is higher so the system must choose by default
> > +between favouring page aging or node locality. zone_distribute_mode controls
> > +how the system will distribute page ages between zones.
> > +
> > +0 = Never round-robin based on age
>
> I think we should be very conservative with the userspace interface we
> export on a mechanism we are obviously just figuring out.
>
And we have a proposal on how to limit this. I'll be layering another
patch on top and removes this interface again. That will allows us to
rollback one patch and still have a usable interface if necessary.
> > +Otherwise the values are ORed together
> > +
> > +1 = Distribute anon pages between zones local to the allocating node
> > +2 = Distribute file pages between zones local to the allocating node
> > +4 = Distribute slab pages between zones local to the allocating node
>
> Zone fairness within a node does not affect mempolicy or remote
> reference costs. Is there a reason to have this configurable?
>
Symmetry
> > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > +
> > +8 = Distribute anon pages between zones remote to the allocating node
> > +16 = Distribute file pages between zones remote to the allocating node
> > +32 = Distribute slab pages between zones remote to the allocating node
>
> Yes, it's conceivable that somebody might want to disable remote
> distribution because of the extra references.
>
> But at this point, I'd much rather back out anon and slab distribution
> entirely, it was a mistake to include them.
>
> That would leave us with a single knob to disable remote page cache
> placement.
>
When looking at this closer I found that sysv is a weird exception. It's
file-backed as far as most of the VM is concerned but looks anonymous to
most applications that care. That and MAP_SHARED anonymous pages should
not be treated like files but we still want tmpfs to be treated as
files. Details will be in the changelog of the next series.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
2013-12-17 11:13 ` Mel Gorman
@ 2013-12-17 15:38 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 15:38 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > zone_local is using node_distance which is a more expensive call than
> > > necessary. On x86, it's another function call in the allocator fast path
> > > and increases cache footprint. This patch makes the assumption zones on a
> > > local node will share the same node ID. The necessary information should
> > > already be cache hot.
> > >
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > > mm/page_alloc.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 64020eb..fd9677e 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > >
> > > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > {
> > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > + return zone_to_nid(zone) == numa_node_id();
> >
> > Why numa_node_id()? We pass in the preferred zone as @local_zone:
> >
>
> Initially because I was thinking "local node" and numa_node_id() is a
> per-cpu variable that should be cheap to access and in some cases
> cache-hot as the top-level gfp API calls numa_node_id().
>
> Thinking about it more though it still makes sense because the preferred
> zone is not necessarily local. If the allocation request requires ZONE_DMA32
> and the local node does not have that zone then preferred zone is on a
> remote node.
Don't we treat everything in relation to the preferred zone?
zone_reclaim_mode itself does not compare with numa_node_id() but with
whatever is the preferred zone.
I could see some value in changing that to numa_node_id(), but then
zone_local() and zone_allows_reclaim() should probably both switch.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-17 15:38 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 15:38 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > zone_local is using node_distance which is a more expensive call than
> > > necessary. On x86, it's another function call in the allocator fast path
> > > and increases cache footprint. This patch makes the assumption zones on a
> > > local node will share the same node ID. The necessary information should
> > > already be cache hot.
> > >
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > > mm/page_alloc.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 64020eb..fd9677e 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > >
> > > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > {
> > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > + return zone_to_nid(zone) == numa_node_id();
> >
> > Why numa_node_id()? We pass in the preferred zone as @local_zone:
> >
>
> Initially because I was thinking "local node" and numa_node_id() is a
> per-cpu variable that should be cheap to access and in some cases
> cache-hot as the top-level gfp API calls numa_node_id().
>
> Thinking about it more though it still makes sense because the preferred
> zone is not necessarily local. If the allocation request requires ZONE_DMA32
> and the local node does not have that zone then preferred zone is on a
> remote node.
Don't we treat everything in relation to the preferred zone?
zone_reclaim_mode itself does not compare with numa_node_id() but with
whatever is the preferred zone.
I could see some value in changing that to numa_node_id(), but then
zone_local() and zone_allows_reclaim() should probably both switch.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
2013-12-17 11:20 ` Mel Gorman
@ 2013-12-17 15:43 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 15:43 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 11:20:07AM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 03:52:37PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> > > Not signed off. Johannes, was the intent really to decrement the batch
> > > counts regardless of whether the policy was being enforced or not?
> >
> > Yes. Bursts of allocations for which the policy does not get enforced
> > will still create memory pressure and affect cache aging on a given
> > node. So even if we only distribute page cache, we want to distribute
> > it in a way that all allocations on the eligible zones equal out.
>
> This means that allocations for page table pages affects the distribution of
> page cache pages. An adverse workload could time when it faults anonymous
> pages (to allocate anon and page table pages) in batch sequences and then
> access files to force page cache pages to be allocated from a single node.
>
> I think I know what your response will be. It will be that the utilisation of
> the zone for page table pages and anon pages means that you want more page
> cache pages to be allocated from the other zones so the reclaim pressure
> is still more or less even. If this is the case or there is another reason
> then it could have done with a comment because it's a subtle detail.
Yes, that was the idea, that the cache placement compensates for pages
that still are always allocated on the preferred zone first, so that
the end result is approximately as if round-robin had been applied to
everybody.
This should be documented as part of the patch that first diverges
between the allocations that are counted and the allocations that are
round-robined:
mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
I'm updating my tree.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
@ 2013-12-17 15:43 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 15:43 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 11:20:07AM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 03:52:37PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> > > Not signed off. Johannes, was the intent really to decrement the batch
> > > counts regardless of whether the policy was being enforced or not?
> >
> > Yes. Bursts of allocations for which the policy does not get enforced
> > will still create memory pressure and affect cache aging on a given
> > node. So even if we only distribute page cache, we want to distribute
> > it in a way that all allocations on the eligible zones equal out.
>
> This means that allocations for page table pages affects the distribution of
> page cache pages. An adverse workload could time when it faults anonymous
> pages (to allocate anon and page table pages) in batch sequences and then
> access files to force page cache pages to be allocated from a single node.
>
> I think I know what your response will be. It will be that the utilisation of
> the zone for page table pages and anon pages means that you want more page
> cache pages to be allocated from the other zones so the reclaim pressure
> is still more or less even. If this is the case or there is another reason
> then it could have done with a comment because it's a subtle detail.
Yes, that was the idea, that the cache placement compensates for pages
that still are always allocated on the preferred zone first, so that
the end result is approximately as if round-robin had been applied to
everybody.
This should be documented as part of the patch that first diverges
between the allocations that are counted and the allocations that are
round-robined:
mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
I'm updating my tree.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
2013-12-17 15:29 ` Mel Gorman
@ 2013-12-17 15:54 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 15:54 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 03:29:54PM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > bug whereby new pages could be reclaimed before old pages because of
> > > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > > Unfortunately it was missed during review that a consequence is that
> > > we also round-robin between NUMA nodes. This is bad for two reasons
> > >
> > > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > > 2. It incurs an immediate remote memory performance hit in exchange
> > > for a potential performance gain when memory needs to be reclaimed
> > > later
> > >
> > > No cookies for the reviewers on this one.
> > >
> > > This patch makes the behaviour of the fair zone allocator policy
> > > configurable. By default it will only distribute pages that are going
> > > to exist on the LRU between zones local to the allocating process. This
> > > preserves the historical semantics of MPOL_LOCAL.
> > >
> > > By default, slab pages are not distributed between zones after this patch is
> > > applied. It can be argued that they should get similar treatment but they
> > > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > > and the interaction between the page allocator and kswapd is different
> > > for slabs. If it turns out to be an almost universal win, we can change
> > > the default.
> > >
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > > Documentation/sysctl/vm.txt | 32 ++++++++++++++
> > > include/linux/mmzone.h | 2 +
> > > include/linux/swap.h | 2 +
> > > kernel/sysctl.c | 8 ++++
> > > mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
> > > 5 files changed, 134 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > index 1fbd4eb..8eaa562 100644
> > > --- a/Documentation/sysctl/vm.txt
> > > +++ b/Documentation/sysctl/vm.txt
> > > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > > - swappiness
> > > - user_reserve_kbytes
> > > - vfs_cache_pressure
> > > +- zone_distribute_mode
> > > - zone_reclaim_mode
> > >
> > > ==============================================================
> > > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> > >
> > > ==============================================================
> > >
> > > +zone_distribute_mode
> > > +
> > > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > > +system needs to reclaim memory, candidate pages are selected from these
> > > +per-zone lists. Historically, a potential consequence was that recently
> > > +allocated pages were considered reclaim candidates. From a zone-local
> > > +perspective, page aging was preserved but from a system-wide perspective
> > > +there was an age inversion problem.
> > > +
> > > +A similar problem occurs on a node level where young pages may be reclaimed
> > > +from the local node instead of allocating remote memory. Unforuntately, the
> > > +cost of accessing remote nodes is higher so the system must choose by default
> > > +between favouring page aging or node locality. zone_distribute_mode controls
> > > +how the system will distribute page ages between zones.
> > > +
> > > +0 = Never round-robin based on age
> >
> > I think we should be very conservative with the userspace interface we
> > export on a mechanism we are obviously just figuring out.
> >
>
> And we have a proposal on how to limit this. I'll be layering another
> patch on top and removes this interface again. That will allows us to
> rollback one patch and still have a usable interface if necessary.
>
> > > +Otherwise the values are ORed together
> > > +
> > > +1 = Distribute anon pages between zones local to the allocating node
> > > +2 = Distribute file pages between zones local to the allocating node
> > > +4 = Distribute slab pages between zones local to the allocating node
> >
> > Zone fairness within a node does not affect mempolicy or remote
> > reference costs. Is there a reason to have this configurable?
> >
>
> Symmetry
>
> > > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > > +
> > > +8 = Distribute anon pages between zones remote to the allocating node
> > > +16 = Distribute file pages between zones remote to the allocating node
> > > +32 = Distribute slab pages between zones remote to the allocating node
> >
> > Yes, it's conceivable that somebody might want to disable remote
> > distribution because of the extra references.
> >
> > But at this point, I'd much rather back out anon and slab distribution
> > entirely, it was a mistake to include them.
> >
> > That would leave us with a single knob to disable remote page cache
> > placement.
> >
>
> When looking at this closer I found that sysv is a weird exception. It's
> file-backed as far as most of the VM is concerned but looks anonymous to
> most applications that care. That and MAP_SHARED anonymous pages should
> not be treated like files but we still want tmpfs to be treated as
> files. Details will be in the changelog of the next series.
In what sense is it seen as file-backed? The pages are swapbacked and
they sit on the anon LRUs, so at least as far as aging and reclaim
goes (what this series is concerned with) they are anon, not file.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 15:54 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 15:54 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 03:29:54PM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > bug whereby new pages could be reclaimed before old pages because of
> > > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > > Unfortunately it was missed during review that a consequence is that
> > > we also round-robin between NUMA nodes. This is bad for two reasons
> > >
> > > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > > 2. It incurs an immediate remote memory performance hit in exchange
> > > for a potential performance gain when memory needs to be reclaimed
> > > later
> > >
> > > No cookies for the reviewers on this one.
> > >
> > > This patch makes the behaviour of the fair zone allocator policy
> > > configurable. By default it will only distribute pages that are going
> > > to exist on the LRU between zones local to the allocating process. This
> > > preserves the historical semantics of MPOL_LOCAL.
> > >
> > > By default, slab pages are not distributed between zones after this patch is
> > > applied. It can be argued that they should get similar treatment but they
> > > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > > and the interaction between the page allocator and kswapd is different
> > > for slabs. If it turns out to be an almost universal win, we can change
> > > the default.
> > >
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > > Documentation/sysctl/vm.txt | 32 ++++++++++++++
> > > include/linux/mmzone.h | 2 +
> > > include/linux/swap.h | 2 +
> > > kernel/sysctl.c | 8 ++++
> > > mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
> > > 5 files changed, 134 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > index 1fbd4eb..8eaa562 100644
> > > --- a/Documentation/sysctl/vm.txt
> > > +++ b/Documentation/sysctl/vm.txt
> > > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > > - swappiness
> > > - user_reserve_kbytes
> > > - vfs_cache_pressure
> > > +- zone_distribute_mode
> > > - zone_reclaim_mode
> > >
> > > ==============================================================
> > > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> > >
> > > ==============================================================
> > >
> > > +zone_distribute_mode
> > > +
> > > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > > +system needs to reclaim memory, candidate pages are selected from these
> > > +per-zone lists. Historically, a potential consequence was that recently
> > > +allocated pages were considered reclaim candidates. From a zone-local
> > > +perspective, page aging was preserved but from a system-wide perspective
> > > +there was an age inversion problem.
> > > +
> > > +A similar problem occurs on a node level where young pages may be reclaimed
> > > +from the local node instead of allocating remote memory. Unforuntately, the
> > > +cost of accessing remote nodes is higher so the system must choose by default
> > > +between favouring page aging or node locality. zone_distribute_mode controls
> > > +how the system will distribute page ages between zones.
> > > +
> > > +0 = Never round-robin based on age
> >
> > I think we should be very conservative with the userspace interface we
> > export on a mechanism we are obviously just figuring out.
> >
>
> And we have a proposal on how to limit this. I'll be layering another
> patch on top and removes this interface again. That will allows us to
> rollback one patch and still have a usable interface if necessary.
>
> > > +Otherwise the values are ORed together
> > > +
> > > +1 = Distribute anon pages between zones local to the allocating node
> > > +2 = Distribute file pages between zones local to the allocating node
> > > +4 = Distribute slab pages between zones local to the allocating node
> >
> > Zone fairness within a node does not affect mempolicy or remote
> > reference costs. Is there a reason to have this configurable?
> >
>
> Symmetry
>
> > > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > > +
> > > +8 = Distribute anon pages between zones remote to the allocating node
> > > +16 = Distribute file pages between zones remote to the allocating node
> > > +32 = Distribute slab pages between zones remote to the allocating node
> >
> > Yes, it's conceivable that somebody might want to disable remote
> > distribution because of the extra references.
> >
> > But at this point, I'd much rather back out anon and slab distribution
> > entirely, it was a mistake to include them.
> >
> > That would leave us with a single knob to disable remote page cache
> > placement.
> >
>
> When looking at this closer I found that sysv is a weird exception. It's
> file-backed as far as most of the VM is concerned but looks anonymous to
> most applications that care. That and MAP_SHARED anonymous pages should
> not be treated like files but we still want tmpfs to be treated as
> files. Details will be in the changelog of the next series.
In what sense is it seen as file-backed? The pages are swapbacked and
they sit on the anon LRUs, so at least as far as aging and reclaim
goes (what this series is concerned with) they are anon, not file.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
2013-12-13 22:15 ` Johannes Weiner
@ 2013-12-17 16:04 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:04 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 05:15:41PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 07:20:14PM +0000, Mel Gorman wrote:
> > On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > > > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > > > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > > > it should be considered finished. I do not necessarily agree this patch is necessary
> > > > but it's worth punting it out there for discussion and testing.
> > >
> > > I demonstrated enormous gains in the original submission of the fair
> > > allocation patch and
> >
> > And the same test missed that it broke MPOL_DEFAULT and regressed any workload
> > that does not hit reclaim by incurring remote accesses unnecessarily.
>
> And none of this was nice, agreed, but it does not invalidate the
> gains, it only changes what we are comparing them to.
>
Notifying that we're changing existing interfaces is important. Again, I
need to be clear that I'm not against the change per-se. I'm annoyed with
myself more than anything that I missed some of the major implications
of the change the first time around and want to get back some of the
performance we lost due to remote memory usage.
> > With this patch applied, MPOL_DEFAULT again does not act as
> > documented by Documentation/vm/numa_memory_policy.txt and that file
> > has been around a long time. It also does not match the documented
> > behaviour of mbind where it says
> >
> > The system-wide default policy allocates pages on the node of
> > the CPU that triggers the allocation. For MPOL_DEFAULT, the nodemask
> > and maxnode arguments must be specify the empty set of nodes.
> >
> > That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
> > allocate on remote nodes.
> >
> > > your tests haven't really shown downsides to the
> > > cache-over-nodes portion of it.
> > > the cache-over-nodes fairness without any supporting data.
> > >
> >
> > It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
> > overridden by policies and it is not even documented. The same effect
> > could have been achieved for the repeatedly reading files by running the
> > processes with the MPOL_INTERLEAVE policy. There was also no convenient
> > way for a user to override that behaviour. Hard-binding to a node would
> > work but tough luck if the process needs more than one node of memory.
>
> Hardbinding or enabling zone_reclaim_mode, yes. But agreed, let's fix
> these problems.
>
I would very much hate to recommend zone_reclaim_mode to work around
this. That thing is a disaster for a lot of workloads and can cause massive
allocation latencies in an effort to keep memory local. I've dealt with
a fairly sizable number of bugs over the last three years related to
that setting.
> > What I will admit is that I doubt anyone cares that file-backed pages
> > are not node-local as documented as the cost of the IO itself probably
> > dominates but just because something does not make sense does not mean
> > someone is depending on the behaviour.
>
> And that's why I very much agree that we need a way for people to
> revert to the old behavior in case we are wrong about this.
>
> But it's also a very strong argument for what the new default should
> be, given that we allow people to revert our decision in the field.
>
We still need to update the docs at the same time as the default is changed
or at least have the man pages patch in flight to Michael Kerrisk.
> > That alone is pretty heavy justification even in the absense of supporting
> > data showing a workload that depends on file pages being node-local that
> > is not hidden by the cost of the IO itself.
>
> Even if we anticipate that nobody will care about it and we provide a
> way to revert the behavior in the field in case we are wrong?
>
> I disagree.
>
There will be people that care, they just haven't shown up yet. We missed
one important example. After the fair allocation policy we are interleaving
sysv shared memory between nodes. I bet you a shiny penny that heavy users
of sysv shared memory (databases) are depending on the local allocation
policy for those areas and we broke that. They'd be hit even if they were
using direct IO. It could be a long time before some user of those databases
notices a performnace regression of a few percent and finds this change.
We may have missed other examples which is why I would prefer that a
change in the default would be accompanied by an update of Documentation/
and of the manual pages. At least that way we can claim it's behaving as
designed and users will have a chance of discovering the change without
having to post to linux-mm.
> We should definitely allow the user to override our decision, but the
> default should be what we anticipate will benefit most users.
>
> And I'm really not trying to be ignorant of long-standing documented
> behavior that users may have come to expect. The bug reports will
> land on my desk just as well. But it looks like the current behavior
> does not make much sense and is unlikely to be missed.
>
I think the treatment of sysv shared memory is an important exception.
However, I should cover that in the next series although the hack used may
cause people to throw rocks at me. That's assuming the hack even works,
I have not booted it yet.
> > > Reverting cross-node fairness for anon and slab is a good idea. It
> > > was always about cache and the original patch was too broad stroked,
> > > but it doesn't invalidate everything it was about.
> > >
> >
> > No it doesn't, but it should at least have been documented.
>
> Yes, no argument there.
>
> > > I can see, however, that we might want to make this configurable, but
> > > I'm not eager on exporting user interfaces unless we have to. As the
> > > node-local fairness was never questioned by anybody, is it necessary
> > > to make it configurable?
> >
> > It's only there since 3.12 and it takes a long time for people to notice
> > NUMA regressions, especially ones that would just be within a few percent
> > like this was unless they were specifically looking for it.
>
> No, I meant only the case where we distribute memory fairly among the
> zones WITHIN a given node. This does not affect NUMA placement. I
> wouldn't want to make this configurable unless you think people might
> want to disable this. I can't think of a reason, anyway.
>
Oh right. That thing was just about API symmetry and for experimentation. I
could not think of a good reason why someone would use it other than to
demonstrate the impact of the fair allocation policy on UMA machines with
a small highest zone. It's the type of thing that Zlatko Calusic's
testing would be sensitive to.
In my current series I replaced it with the knob suggested by Rik and
yourself. The internal details are still the same but the user-visible
knob controls just page cache with special casing of MAP_SHARED anonmous
and sysv memory
> > > Shouldn't we be okay with just a single
> > > vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> > > allows users to go back to pagecache obeying mempolicy?
> > >
> >
> > That can be done. I can put together a patch that defaults it to 0 and
> > sets the DISTRIBUTE_REMOTE_FILE flag if someone writes to it. That's a
> > crude hack but many people will be ok with it.
> >
> > To make it a default though should require more work though.
> > Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
> > is not strictly interleave). Abstract MPOL_DEFAULT to be either
> > MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
> > vm.pagecache_interleave. Update manual pages, and Documentation/ then set
> > the default of vm.pagecache_interleave to 1.
> >
> > That would allow more sane defaults and also allow users to override it
> > on a per task and per VMA basis as they can for any other type of memory
> > policy.
>
> Not using round-robin placement for cache creates weird artifacts in
> our LRU aging decisions. By not aging all pages in a workingset
> equally, we may end up activating barely used pages on a remote node
> and creating pressure on its active list for no reason.
>
I fully appreciate the positive aspects of the patch and want to see it
happen. If I didn't, I would be trying to revert the patch and ignoring
any arguments to the contrary. I would just prefer we did it in a way
that generated less paperwork in the future.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy
@ 2013-12-17 16:04 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:04 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Fri, Dec 13, 2013 at 05:15:41PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 07:20:14PM +0000, Mel Gorman wrote:
> > On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > > > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > > > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > > > it should be considered finished. I do not necessarily agree this patch is necessary
> > > > but it's worth punting it out there for discussion and testing.
> > >
> > > I demonstrated enormous gains in the original submission of the fair
> > > allocation patch and
> >
> > And the same test missed that it broke MPOL_DEFAULT and regressed any workload
> > that does not hit reclaim by incurring remote accesses unnecessarily.
>
> And none of this was nice, agreed, but it does not invalidate the
> gains, it only changes what we are comparing them to.
>
Notifying that we're changing existing interfaces is important. Again, I
need to be clear that I'm not against the change per-se. I'm annoyed with
myself more than anything that I missed some of the major implications
of the change the first time around and want to get back some of the
performance we lost due to remote memory usage.
> > With this patch applied, MPOL_DEFAULT again does not act as
> > documented by Documentation/vm/numa_memory_policy.txt and that file
> > has been around a long time. It also does not match the documented
> > behaviour of mbind where it says
> >
> > The system-wide default policy allocates pages on the node of
> > the CPU that triggers the allocation. For MPOL_DEFAULT, the nodemask
> > and maxnode arguments must be specify the empty set of nodes.
> >
> > That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
> > allocate on remote nodes.
> >
> > > your tests haven't really shown downsides to the
> > > cache-over-nodes portion of it.
> > > the cache-over-nodes fairness without any supporting data.
> > >
> >
> > It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
> > overridden by policies and it is not even documented. The same effect
> > could have been achieved for the repeatedly reading files by running the
> > processes with the MPOL_INTERLEAVE policy. There was also no convenient
> > way for a user to override that behaviour. Hard-binding to a node would
> > work but tough luck if the process needs more than one node of memory.
>
> Hardbinding or enabling zone_reclaim_mode, yes. But agreed, let's fix
> these problems.
>
I would very much hate to recommend zone_reclaim_mode to work around
this. That thing is a disaster for a lot of workloads and can cause massive
allocation latencies in an effort to keep memory local. I've dealt with
a fairly sizable number of bugs over the last three years related to
that setting.
> > What I will admit is that I doubt anyone cares that file-backed pages
> > are not node-local as documented as the cost of the IO itself probably
> > dominates but just because something does not make sense does not mean
> > someone is depending on the behaviour.
>
> And that's why I very much agree that we need a way for people to
> revert to the old behavior in case we are wrong about this.
>
> But it's also a very strong argument for what the new default should
> be, given that we allow people to revert our decision in the field.
>
We still need to update the docs at the same time as the default is changed
or at least have the man pages patch in flight to Michael Kerrisk.
> > That alone is pretty heavy justification even in the absense of supporting
> > data showing a workload that depends on file pages being node-local that
> > is not hidden by the cost of the IO itself.
>
> Even if we anticipate that nobody will care about it and we provide a
> way to revert the behavior in the field in case we are wrong?
>
> I disagree.
>
There will be people that care, they just haven't shown up yet. We missed
one important example. After the fair allocation policy we are interleaving
sysv shared memory between nodes. I bet you a shiny penny that heavy users
of sysv shared memory (databases) are depending on the local allocation
policy for those areas and we broke that. They'd be hit even if they were
using direct IO. It could be a long time before some user of those databases
notices a performnace regression of a few percent and finds this change.
We may have missed other examples which is why I would prefer that a
change in the default would be accompanied by an update of Documentation/
and of the manual pages. At least that way we can claim it's behaving as
designed and users will have a chance of discovering the change without
having to post to linux-mm.
> We should definitely allow the user to override our decision, but the
> default should be what we anticipate will benefit most users.
>
> And I'm really not trying to be ignorant of long-standing documented
> behavior that users may have come to expect. The bug reports will
> land on my desk just as well. But it looks like the current behavior
> does not make much sense and is unlikely to be missed.
>
I think the treatment of sysv shared memory is an important exception.
However, I should cover that in the next series although the hack used may
cause people to throw rocks at me. That's assuming the hack even works,
I have not booted it yet.
> > > Reverting cross-node fairness for anon and slab is a good idea. It
> > > was always about cache and the original patch was too broad stroked,
> > > but it doesn't invalidate everything it was about.
> > >
> >
> > No it doesn't, but it should at least have been documented.
>
> Yes, no argument there.
>
> > > I can see, however, that we might want to make this configurable, but
> > > I'm not eager on exporting user interfaces unless we have to. As the
> > > node-local fairness was never questioned by anybody, is it necessary
> > > to make it configurable?
> >
> > It's only there since 3.12 and it takes a long time for people to notice
> > NUMA regressions, especially ones that would just be within a few percent
> > like this was unless they were specifically looking for it.
>
> No, I meant only the case where we distribute memory fairly among the
> zones WITHIN a given node. This does not affect NUMA placement. I
> wouldn't want to make this configurable unless you think people might
> want to disable this. I can't think of a reason, anyway.
>
Oh right. That thing was just about API symmetry and for experimentation. I
could not think of a good reason why someone would use it other than to
demonstrate the impact of the fair allocation policy on UMA machines with
a small highest zone. It's the type of thing that Zlatko Calusic's
testing would be sensitive to.
In my current series I replaced it with the knob suggested by Rik and
yourself. The internal details are still the same but the user-visible
knob controls just page cache with special casing of MAP_SHARED anonmous
and sysv memory
> > > Shouldn't we be okay with just a single
> > > vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> > > allows users to go back to pagecache obeying mempolicy?
> > >
> >
> > That can be done. I can put together a patch that defaults it to 0 and
> > sets the DISTRIBUTE_REMOTE_FILE flag if someone writes to it. That's a
> > crude hack but many people will be ok with it.
> >
> > To make it a default though should require more work though.
> > Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
> > is not strictly interleave). Abstract MPOL_DEFAULT to be either
> > MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
> > vm.pagecache_interleave. Update manual pages, and Documentation/ then set
> > the default of vm.pagecache_interleave to 1.
> >
> > That would allow more sane defaults and also allow users to override it
> > on a per task and per VMA basis as they can for any other type of memory
> > policy.
>
> Not using round-robin placement for cache creates weird artifacts in
> our LRU aging decisions. By not aging all pages in a workingset
> equally, we may end up activating barely used pages on a remote node
> and creating pressure on its active list for no reason.
>
I fully appreciate the positive aspects of the patch and want to see it
happen. If I didn't, I would be trying to revert the patch and ignoring
any arguments to the contrary. I would just prefer we did it in a way
that generated less paperwork in the future.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
2013-12-17 15:43 ` Johannes Weiner
@ 2013-12-17 16:06 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:06 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 10:43:51AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 11:20:07AM +0000, Mel Gorman wrote:
> > On Mon, Dec 16, 2013 at 03:52:37PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> > > > Not signed off. Johannes, was the intent really to decrement the batch
> > > > counts regardless of whether the policy was being enforced or not?
> > >
> > > Yes. Bursts of allocations for which the policy does not get enforced
> > > will still create memory pressure and affect cache aging on a given
> > > node. So even if we only distribute page cache, we want to distribute
> > > it in a way that all allocations on the eligible zones equal out.
> >
> > This means that allocations for page table pages affects the distribution of
> > page cache pages. An adverse workload could time when it faults anonymous
> > pages (to allocate anon and page table pages) in batch sequences and then
> > access files to force page cache pages to be allocated from a single node.
> >
> > I think I know what your response will be. It will be that the utilisation of
> > the zone for page table pages and anon pages means that you want more page
> > cache pages to be allocated from the other zones so the reclaim pressure
> > is still more or less even. If this is the case or there is another reason
> > then it could have done with a comment because it's a subtle detail.
>
> Yes, that was the idea, that the cache placement compensates for pages
> that still are always allocated on the preferred zone first, so that
> the end result is approximately as if round-robin had been applied to
> everybody.
>
Ok, understood. I wanted to be sure that was the thinking behind it.
> This should be documented as part of the patch that first diverges
> between the allocations that are counted and the allocations that are
> round-robined:
>
> mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
>
> I'm updating my tree.
I'll leave it alone in mine then. We'll figure out how to sync up later.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible
@ 2013-12-17 16:06 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:06 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 10:43:51AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 11:20:07AM +0000, Mel Gorman wrote:
> > On Mon, Dec 16, 2013 at 03:52:37PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:06PM +0000, Mel Gorman wrote:
> > > > Not signed off. Johannes, was the intent really to decrement the batch
> > > > counts regardless of whether the policy was being enforced or not?
> > >
> > > Yes. Bursts of allocations for which the policy does not get enforced
> > > will still create memory pressure and affect cache aging on a given
> > > node. So even if we only distribute page cache, we want to distribute
> > > it in a way that all allocations on the eligible zones equal out.
> >
> > This means that allocations for page table pages affects the distribution of
> > page cache pages. An adverse workload could time when it faults anonymous
> > pages (to allocate anon and page table pages) in batch sequences and then
> > access files to force page cache pages to be allocated from a single node.
> >
> > I think I know what your response will be. It will be that the utilisation of
> > the zone for page table pages and anon pages means that you want more page
> > cache pages to be allocated from the other zones so the reclaim pressure
> > is still more or less even. If this is the case or there is another reason
> > then it could have done with a comment because it's a subtle detail.
>
> Yes, that was the idea, that the cache placement compensates for pages
> that still are always allocated on the preferred zone first, so that
> the end result is approximately as if round-robin had been applied to
> everybody.
>
Ok, understood. I wanted to be sure that was the thinking behind it.
> This should be documented as part of the patch that first diverges
> between the allocations that are counted and the allocations that are
> round-robined:
>
> mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
>
> I'm updating my tree.
I'll leave it alone in mine then. We'll figure out how to sync up later.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
2013-12-17 15:38 ` Johannes Weiner
@ 2013-12-17 16:08 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:08 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > zone_local is using node_distance which is a more expensive call than
> > > > necessary. On x86, it's another function call in the allocator fast path
> > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > local node will share the same node ID. The necessary information should
> > > > already be cache hot.
> > > >
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > ---
> > > > mm/page_alloc.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 64020eb..fd9677e 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > >
> > > > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > {
> > > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > + return zone_to_nid(zone) == numa_node_id();
> > >
> > > Why numa_node_id()? We pass in the preferred zone as @local_zone:
> > >
> >
> > Initially because I was thinking "local node" and numa_node_id() is a
> > per-cpu variable that should be cheap to access and in some cases
> > cache-hot as the top-level gfp API calls numa_node_id().
> >
> > Thinking about it more though it still makes sense because the preferred
> > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > and the local node does not have that zone then preferred zone is on a
> > remote node.
>
> Don't we treat everything in relation to the preferred zone?
Usually yes, but this time we really care about whether the memory is
local or remote. It makes sense to me as it is and struggle to see an
advantage of expressing it in terms of the preferred zone. Minimally
zone_local would need to be renamed if it could return true for a remote
zone and I see no advantage in doing that.
I might be stuck in a "la la la, everything is fine" rut.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-17 16:08 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:08 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > zone_local is using node_distance which is a more expensive call than
> > > > necessary. On x86, it's another function call in the allocator fast path
> > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > local node will share the same node ID. The necessary information should
> > > > already be cache hot.
> > > >
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > ---
> > > > mm/page_alloc.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 64020eb..fd9677e 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > >
> > > > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > {
> > > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > + return zone_to_nid(zone) == numa_node_id();
> > >
> > > Why numa_node_id()? We pass in the preferred zone as @local_zone:
> > >
> >
> > Initially because I was thinking "local node" and numa_node_id() is a
> > per-cpu variable that should be cheap to access and in some cases
> > cache-hot as the top-level gfp API calls numa_node_id().
> >
> > Thinking about it more though it still makes sense because the preferred
> > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > and the local node does not have that zone then preferred zone is on a
> > remote node.
>
> Don't we treat everything in relation to the preferred zone?
Usually yes, but this time we really care about whether the memory is
local or remote. It makes sense to me as it is and struggle to see an
advantage of expressing it in terms of the preferred zone. Minimally
zone_local would need to be renamed if it could return true for a remote
zone and I see no advantage in doing that.
I might be stuck in a "la la la, everything is fine" rut.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
2013-12-17 15:54 ` Johannes Weiner
@ 2013-12-17 16:14 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:14 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 10:54:35AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 03:29:54PM +0000, Mel Gorman wrote:
> > On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > > bug whereby new pages could be reclaimed before old pages because of
> > > > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > > > Unfortunately it was missed during review that a consequence is that
> > > > we also round-robin between NUMA nodes. This is bad for two reasons
> > > >
> > > > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > > > 2. It incurs an immediate remote memory performance hit in exchange
> > > > for a potential performance gain when memory needs to be reclaimed
> > > > later
> > > >
> > > > No cookies for the reviewers on this one.
> > > >
> > > > This patch makes the behaviour of the fair zone allocator policy
> > > > configurable. By default it will only distribute pages that are going
> > > > to exist on the LRU between zones local to the allocating process. This
> > > > preserves the historical semantics of MPOL_LOCAL.
> > > >
> > > > By default, slab pages are not distributed between zones after this patch is
> > > > applied. It can be argued that they should get similar treatment but they
> > > > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > > > and the interaction between the page allocator and kswapd is different
> > > > for slabs. If it turns out to be an almost universal win, we can change
> > > > the default.
> > > >
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > ---
> > > > Documentation/sysctl/vm.txt | 32 ++++++++++++++
> > > > include/linux/mmzone.h | 2 +
> > > > include/linux/swap.h | 2 +
> > > > kernel/sysctl.c | 8 ++++
> > > > mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
> > > > 5 files changed, 134 insertions(+), 12 deletions(-)
> > > >
> > > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > > index 1fbd4eb..8eaa562 100644
> > > > --- a/Documentation/sysctl/vm.txt
> > > > +++ b/Documentation/sysctl/vm.txt
> > > > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > > > - swappiness
> > > > - user_reserve_kbytes
> > > > - vfs_cache_pressure
> > > > +- zone_distribute_mode
> > > > - zone_reclaim_mode
> > > >
> > > > ==============================================================
> > > > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> > > >
> > > > ==============================================================
> > > >
> > > > +zone_distribute_mode
> > > > +
> > > > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > > > +system needs to reclaim memory, candidate pages are selected from these
> > > > +per-zone lists. Historically, a potential consequence was that recently
> > > > +allocated pages were considered reclaim candidates. From a zone-local
> > > > +perspective, page aging was preserved but from a system-wide perspective
> > > > +there was an age inversion problem.
> > > > +
> > > > +A similar problem occurs on a node level where young pages may be reclaimed
> > > > +from the local node instead of allocating remote memory. Unforuntately, the
> > > > +cost of accessing remote nodes is higher so the system must choose by default
> > > > +between favouring page aging or node locality. zone_distribute_mode controls
> > > > +how the system will distribute page ages between zones.
> > > > +
> > > > +0 = Never round-robin based on age
> > >
> > > I think we should be very conservative with the userspace interface we
> > > export on a mechanism we are obviously just figuring out.
> > >
> >
> > And we have a proposal on how to limit this. I'll be layering another
> > patch on top and removes this interface again. That will allows us to
> > rollback one patch and still have a usable interface if necessary.
> >
> > > > +Otherwise the values are ORed together
> > > > +
> > > > +1 = Distribute anon pages between zones local to the allocating node
> > > > +2 = Distribute file pages between zones local to the allocating node
> > > > +4 = Distribute slab pages between zones local to the allocating node
> > >
> > > Zone fairness within a node does not affect mempolicy or remote
> > > reference costs. Is there a reason to have this configurable?
> > >
> >
> > Symmetry
> >
> > > > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > > > +
> > > > +8 = Distribute anon pages between zones remote to the allocating node
> > > > +16 = Distribute file pages between zones remote to the allocating node
> > > > +32 = Distribute slab pages between zones remote to the allocating node
> > >
> > > Yes, it's conceivable that somebody might want to disable remote
> > > distribution because of the extra references.
> > >
> > > But at this point, I'd much rather back out anon and slab distribution
> > > entirely, it was a mistake to include them.
> > >
> > > That would leave us with a single knob to disable remote page cache
> > > placement.
> > >
> >
> > When looking at this closer I found that sysv is a weird exception. It's
> > file-backed as far as most of the VM is concerned but looks anonymous to
> > most applications that care. That and MAP_SHARED anonymous pages should
> > not be treated like files but we still want tmpfs to be treated as
> > files. Details will be in the changelog of the next series.
>
> In what sense is it seen as file-backed?
sysv and anonymous pages are backed by an internal shmem mount point. In
lots of respects, it's looks like a file and quacks like a file but I expect
developers think of it being anonmous and chunks of the VM treats it like
it's anonymous. tmpfs uses the same paths and they get treated similar to
the VM as anon but users may think that tmpfs should be subject to the
fair allocation zone policy "because they're files." It's a sufficently
weird case that any action we take there should be deliberate. It'll be
a bit clearer when I post the patch that special cases this.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 16:14 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 16:14 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 10:54:35AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 03:29:54PM +0000, Mel Gorman wrote:
> > On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > > bug whereby new pages could be reclaimed before old pages because of
> > > > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > > > Unfortunately it was missed during review that a consequence is that
> > > > we also round-robin between NUMA nodes. This is bad for two reasons
> > > >
> > > > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > > > 2. It incurs an immediate remote memory performance hit in exchange
> > > > for a potential performance gain when memory needs to be reclaimed
> > > > later
> > > >
> > > > No cookies for the reviewers on this one.
> > > >
> > > > This patch makes the behaviour of the fair zone allocator policy
> > > > configurable. By default it will only distribute pages that are going
> > > > to exist on the LRU between zones local to the allocating process. This
> > > > preserves the historical semantics of MPOL_LOCAL.
> > > >
> > > > By default, slab pages are not distributed between zones after this patch is
> > > > applied. It can be argued that they should get similar treatment but they
> > > > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > > > and the interaction between the page allocator and kswapd is different
> > > > for slabs. If it turns out to be an almost universal win, we can change
> > > > the default.
> > > >
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > ---
> > > > Documentation/sysctl/vm.txt | 32 ++++++++++++++
> > > > include/linux/mmzone.h | 2 +
> > > > include/linux/swap.h | 2 +
> > > > kernel/sysctl.c | 8 ++++
> > > > mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
> > > > 5 files changed, 134 insertions(+), 12 deletions(-)
> > > >
> > > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > > index 1fbd4eb..8eaa562 100644
> > > > --- a/Documentation/sysctl/vm.txt
> > > > +++ b/Documentation/sysctl/vm.txt
> > > > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > > > - swappiness
> > > > - user_reserve_kbytes
> > > > - vfs_cache_pressure
> > > > +- zone_distribute_mode
> > > > - zone_reclaim_mode
> > > >
> > > > ==============================================================
> > > > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> > > >
> > > > ==============================================================
> > > >
> > > > +zone_distribute_mode
> > > > +
> > > > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > > > +system needs to reclaim memory, candidate pages are selected from these
> > > > +per-zone lists. Historically, a potential consequence was that recently
> > > > +allocated pages were considered reclaim candidates. From a zone-local
> > > > +perspective, page aging was preserved but from a system-wide perspective
> > > > +there was an age inversion problem.
> > > > +
> > > > +A similar problem occurs on a node level where young pages may be reclaimed
> > > > +from the local node instead of allocating remote memory. Unforuntately, the
> > > > +cost of accessing remote nodes is higher so the system must choose by default
> > > > +between favouring page aging or node locality. zone_distribute_mode controls
> > > > +how the system will distribute page ages between zones.
> > > > +
> > > > +0 = Never round-robin based on age
> > >
> > > I think we should be very conservative with the userspace interface we
> > > export on a mechanism we are obviously just figuring out.
> > >
> >
> > And we have a proposal on how to limit this. I'll be layering another
> > patch on top and removes this interface again. That will allows us to
> > rollback one patch and still have a usable interface if necessary.
> >
> > > > +Otherwise the values are ORed together
> > > > +
> > > > +1 = Distribute anon pages between zones local to the allocating node
> > > > +2 = Distribute file pages between zones local to the allocating node
> > > > +4 = Distribute slab pages between zones local to the allocating node
> > >
> > > Zone fairness within a node does not affect mempolicy or remote
> > > reference costs. Is there a reason to have this configurable?
> > >
> >
> > Symmetry
> >
> > > > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > > > +
> > > > +8 = Distribute anon pages between zones remote to the allocating node
> > > > +16 = Distribute file pages between zones remote to the allocating node
> > > > +32 = Distribute slab pages between zones remote to the allocating node
> > >
> > > Yes, it's conceivable that somebody might want to disable remote
> > > distribution because of the extra references.
> > >
> > > But at this point, I'd much rather back out anon and slab distribution
> > > entirely, it was a mistake to include them.
> > >
> > > That would leave us with a single knob to disable remote page cache
> > > placement.
> > >
> >
> > When looking at this closer I found that sysv is a weird exception. It's
> > file-backed as far as most of the VM is concerned but looks anonymous to
> > most applications that care. That and MAP_SHARED anonymous pages should
> > not be treated like files but we still want tmpfs to be treated as
> > files. Details will be in the changelog of the next series.
>
> In what sense is it seen as file-backed?
sysv and anonymous pages are backed by an internal shmem mount point. In
lots of respects, it's looks like a file and quacks like a file but I expect
developers think of it being anonmous and chunks of the VM treats it like
it's anonymous. tmpfs uses the same paths and they get treated similar to
the VM as anon but users may think that tmpfs should be subject to the
fair allocation zone policy "because they're files." It's a sufficently
weird case that any action we take there should be deliberate. It'll be
a bit clearer when I post the patch that special cases this.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
2013-12-17 16:14 ` Mel Gorman
@ 2013-12-17 17:43 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 17:43 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 04:14:20PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 10:54:35AM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 03:29:54PM +0000, Mel Gorman wrote:
> > > On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > > > bug whereby new pages could be reclaimed before old pages because of
> > > > > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > > > > Unfortunately it was missed during review that a consequence is that
> > > > > we also round-robin between NUMA nodes. This is bad for two reasons
> > > > >
> > > > > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > > > > 2. It incurs an immediate remote memory performance hit in exchange
> > > > > for a potential performance gain when memory needs to be reclaimed
> > > > > later
> > > > >
> > > > > No cookies for the reviewers on this one.
> > > > >
> > > > > This patch makes the behaviour of the fair zone allocator policy
> > > > > configurable. By default it will only distribute pages that are going
> > > > > to exist on the LRU between zones local to the allocating process. This
> > > > > preserves the historical semantics of MPOL_LOCAL.
> > > > >
> > > > > By default, slab pages are not distributed between zones after this patch is
> > > > > applied. It can be argued that they should get similar treatment but they
> > > > > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > > > > and the interaction between the page allocator and kswapd is different
> > > > > for slabs. If it turns out to be an almost universal win, we can change
> > > > > the default.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > ---
> > > > > Documentation/sysctl/vm.txt | 32 ++++++++++++++
> > > > > include/linux/mmzone.h | 2 +
> > > > > include/linux/swap.h | 2 +
> > > > > kernel/sysctl.c | 8 ++++
> > > > > mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
> > > > > 5 files changed, 134 insertions(+), 12 deletions(-)
> > > > >
> > > > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > > > index 1fbd4eb..8eaa562 100644
> > > > > --- a/Documentation/sysctl/vm.txt
> > > > > +++ b/Documentation/sysctl/vm.txt
> > > > > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > > > > - swappiness
> > > > > - user_reserve_kbytes
> > > > > - vfs_cache_pressure
> > > > > +- zone_distribute_mode
> > > > > - zone_reclaim_mode
> > > > >
> > > > > ==============================================================
> > > > > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> > > > >
> > > > > ==============================================================
> > > > >
> > > > > +zone_distribute_mode
> > > > > +
> > > > > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > > > > +system needs to reclaim memory, candidate pages are selected from these
> > > > > +per-zone lists. Historically, a potential consequence was that recently
> > > > > +allocated pages were considered reclaim candidates. From a zone-local
> > > > > +perspective, page aging was preserved but from a system-wide perspective
> > > > > +there was an age inversion problem.
> > > > > +
> > > > > +A similar problem occurs on a node level where young pages may be reclaimed
> > > > > +from the local node instead of allocating remote memory. Unforuntately, the
> > > > > +cost of accessing remote nodes is higher so the system must choose by default
> > > > > +between favouring page aging or node locality. zone_distribute_mode controls
> > > > > +how the system will distribute page ages between zones.
> > > > > +
> > > > > +0 = Never round-robin based on age
> > > >
> > > > I think we should be very conservative with the userspace interface we
> > > > export on a mechanism we are obviously just figuring out.
> > > >
> > >
> > > And we have a proposal on how to limit this. I'll be layering another
> > > patch on top and removes this interface again. That will allows us to
> > > rollback one patch and still have a usable interface if necessary.
> > >
> > > > > +Otherwise the values are ORed together
> > > > > +
> > > > > +1 = Distribute anon pages between zones local to the allocating node
> > > > > +2 = Distribute file pages between zones local to the allocating node
> > > > > +4 = Distribute slab pages between zones local to the allocating node
> > > >
> > > > Zone fairness within a node does not affect mempolicy or remote
> > > > reference costs. Is there a reason to have this configurable?
> > > >
> > >
> > > Symmetry
> > >
> > > > > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > > > > +
> > > > > +8 = Distribute anon pages between zones remote to the allocating node
> > > > > +16 = Distribute file pages between zones remote to the allocating node
> > > > > +32 = Distribute slab pages between zones remote to the allocating node
> > > >
> > > > Yes, it's conceivable that somebody might want to disable remote
> > > > distribution because of the extra references.
> > > >
> > > > But at this point, I'd much rather back out anon and slab distribution
> > > > entirely, it was a mistake to include them.
> > > >
> > > > That would leave us with a single knob to disable remote page cache
> > > > placement.
> > > >
> > >
> > > When looking at this closer I found that sysv is a weird exception. It's
> > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > most applications that care. That and MAP_SHARED anonymous pages should
> > > not be treated like files but we still want tmpfs to be treated as
> > > files. Details will be in the changelog of the next series.
> >
> > In what sense is it seen as file-backed?
>
> sysv and anonymous pages are backed by an internal shmem mount point. In
> lots of respects, it's looks like a file and quacks like a file but I expect
> developers think of it being anonmous and chunks of the VM treats it like
> it's anonymous. tmpfs uses the same paths and they get treated similar to
> the VM as anon but users may think that tmpfs should be subject to the
> fair allocation zone policy "because they're files." It's a sufficently
> weird case that any action we take there should be deliberate. It'll be
> a bit clearer when I post the patch that special cases this.
The line I see here is mostly derived from performance expectations.
People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
their reclaim at great costs, so they size this part of their workload
according to memory size and locality. Filesystem cache (on-disk) on
the other hand is expected to be slow on the first fault and after it
has been displaced by other data, but the kernel is mostly expected to
maximize the caching effects in a predictable manner.
The round-robin policy makes the displacement predictable (think of
the aging artifacts here where random pages do not get displaced
reliably because they ended up on remote nodes) and it avoids IO by
maximizing memory utilization.
I.e. it improves behavior associated with a cache, but I don't expect
shmem/tmpfs to be typically used as a disk cache. I could be wrong
about that, but I figure if you need named shared memory that is
bigger than your memory capacity (the point where your tmpfs would
actually turn into a disk cache), you'd be better of using a more
efficient on-disk filesystem.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 17:43 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 17:43 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 04:14:20PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 10:54:35AM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 03:29:54PM +0000, Mel Gorman wrote:
> > > On Mon, Dec 16, 2013 at 03:42:15PM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 13, 2013 at 02:10:05PM +0000, Mel Gorman wrote:
> > > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > > > bug whereby new pages could be reclaimed before old pages because of
> > > > > how the page allocator and kswapd interacted on the per-zone LRU lists.
> > > > > Unfortunately it was missed during review that a consequence is that
> > > > > we also round-robin between NUMA nodes. This is bad for two reasons
> > > > >
> > > > > 1. It alters the semantics of MPOL_LOCAL without telling anyone
> > > > > 2. It incurs an immediate remote memory performance hit in exchange
> > > > > for a potential performance gain when memory needs to be reclaimed
> > > > > later
> > > > >
> > > > > No cookies for the reviewers on this one.
> > > > >
> > > > > This patch makes the behaviour of the fair zone allocator policy
> > > > > configurable. By default it will only distribute pages that are going
> > > > > to exist on the LRU between zones local to the allocating process. This
> > > > > preserves the historical semantics of MPOL_LOCAL.
> > > > >
> > > > > By default, slab pages are not distributed between zones after this patch is
> > > > > applied. It can be argued that they should get similar treatment but they
> > > > > have different lifecycles to LRU pages, the shrinkers are not zone-aware
> > > > > and the interaction between the page allocator and kswapd is different
> > > > > for slabs. If it turns out to be an almost universal win, we can change
> > > > > the default.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > ---
> > > > > Documentation/sysctl/vm.txt | 32 ++++++++++++++
> > > > > include/linux/mmzone.h | 2 +
> > > > > include/linux/swap.h | 2 +
> > > > > kernel/sysctl.c | 8 ++++
> > > > > mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
> > > > > 5 files changed, 134 insertions(+), 12 deletions(-)
> > > > >
> > > > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > > > index 1fbd4eb..8eaa562 100644
> > > > > --- a/Documentation/sysctl/vm.txt
> > > > > +++ b/Documentation/sysctl/vm.txt
> > > > > @@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
> > > > > - swappiness
> > > > > - user_reserve_kbytes
> > > > > - vfs_cache_pressure
> > > > > +- zone_distribute_mode
> > > > > - zone_reclaim_mode
> > > > >
> > > > > ==============================================================
> > > > > @@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
> > > > >
> > > > > ==============================================================
> > > > >
> > > > > +zone_distribute_mode
> > > > > +
> > > > > +Pages allocation and reclaim are managed on a per-zone basis. When the
> > > > > +system needs to reclaim memory, candidate pages are selected from these
> > > > > +per-zone lists. Historically, a potential consequence was that recently
> > > > > +allocated pages were considered reclaim candidates. From a zone-local
> > > > > +perspective, page aging was preserved but from a system-wide perspective
> > > > > +there was an age inversion problem.
> > > > > +
> > > > > +A similar problem occurs on a node level where young pages may be reclaimed
> > > > > +from the local node instead of allocating remote memory. Unforuntately, the
> > > > > +cost of accessing remote nodes is higher so the system must choose by default
> > > > > +between favouring page aging or node locality. zone_distribute_mode controls
> > > > > +how the system will distribute page ages between zones.
> > > > > +
> > > > > +0 = Never round-robin based on age
> > > >
> > > > I think we should be very conservative with the userspace interface we
> > > > export on a mechanism we are obviously just figuring out.
> > > >
> > >
> > > And we have a proposal on how to limit this. I'll be layering another
> > > patch on top and removes this interface again. That will allows us to
> > > rollback one patch and still have a usable interface if necessary.
> > >
> > > > > +Otherwise the values are ORed together
> > > > > +
> > > > > +1 = Distribute anon pages between zones local to the allocating node
> > > > > +2 = Distribute file pages between zones local to the allocating node
> > > > > +4 = Distribute slab pages between zones local to the allocating node
> > > >
> > > > Zone fairness within a node does not affect mempolicy or remote
> > > > reference costs. Is there a reason to have this configurable?
> > > >
> > >
> > > Symmetry
> > >
> > > > > +The following three flags effectively alter MPOL_DEFAULT, be careful.
> > > > > +
> > > > > +8 = Distribute anon pages between zones remote to the allocating node
> > > > > +16 = Distribute file pages between zones remote to the allocating node
> > > > > +32 = Distribute slab pages between zones remote to the allocating node
> > > >
> > > > Yes, it's conceivable that somebody might want to disable remote
> > > > distribution because of the extra references.
> > > >
> > > > But at this point, I'd much rather back out anon and slab distribution
> > > > entirely, it was a mistake to include them.
> > > >
> > > > That would leave us with a single knob to disable remote page cache
> > > > placement.
> > > >
> > >
> > > When looking at this closer I found that sysv is a weird exception. It's
> > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > most applications that care. That and MAP_SHARED anonymous pages should
> > > not be treated like files but we still want tmpfs to be treated as
> > > files. Details will be in the changelog of the next series.
> >
> > In what sense is it seen as file-backed?
>
> sysv and anonymous pages are backed by an internal shmem mount point. In
> lots of respects, it's looks like a file and quacks like a file but I expect
> developers think of it being anonmous and chunks of the VM treats it like
> it's anonymous. tmpfs uses the same paths and they get treated similar to
> the VM as anon but users may think that tmpfs should be subject to the
> fair allocation zone policy "because they're files." It's a sufficently
> weird case that any action we take there should be deliberate. It'll be
> a bit clearer when I post the patch that special cases this.
The line I see here is mostly derived from performance expectations.
People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
their reclaim at great costs, so they size this part of their workload
according to memory size and locality. Filesystem cache (on-disk) on
the other hand is expected to be slow on the first fault and after it
has been displaced by other data, but the kernel is mostly expected to
maximize the caching effects in a predictable manner.
The round-robin policy makes the displacement predictable (think of
the aging artifacts here where random pages do not get displaced
reliably because they ended up on remote nodes) and it avoids IO by
maximizing memory utilization.
I.e. it improves behavior associated with a cache, but I don't expect
shmem/tmpfs to be typically used as a disk cache. I could be wrong
about that, but I figure if you need named shared memory that is
bigger than your memory capacity (the point where your tmpfs would
actually turn into a disk cache), you'd be better of using a more
efficient on-disk filesystem.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
2013-12-17 16:08 ` Mel Gorman
@ 2013-12-17 20:11 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 20:11 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 04:08:08PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > > zone_local is using node_distance which is a more expensive call than
> > > > > necessary. On x86, it's another function call in the allocator fast path
> > > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > > local node will share the same node ID. The necessary information should
> > > > > already be cache hot.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > ---
> > > > > mm/page_alloc.c | 2 +-
> > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 64020eb..fd9677e 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > > >
> > > > > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > > {
> > > > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > > + return zone_to_nid(zone) == numa_node_id();
> > > >
> > > > Why numa_node_id()? We pass in the preferred zone as @local_zone:
> > > >
> > >
> > > Initially because I was thinking "local node" and numa_node_id() is a
> > > per-cpu variable that should be cheap to access and in some cases
> > > cache-hot as the top-level gfp API calls numa_node_id().
> > >
> > > Thinking about it more though it still makes sense because the preferred
> > > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > > and the local node does not have that zone then preferred zone is on a
> > > remote node.
> >
> > Don't we treat everything in relation to the preferred zone?
>
> Usually yes, but this time we really care about whether the memory is
> local or remote. It makes sense to me as it is and struggle to see an
> advantage of expressing it in terms of the preferred zone. Minimally
> zone_local would need to be renamed if it could return true for a remote
> zone and I see no advantage in doing that.
What the function tests for is whether any given zone is close
enough/local to the given preferred zone such that we can allocate
from it without having to invoke zone_reclaim_mode.
In your example, if the preferred DMA32 zone were to be on a remote
node and eligible for allocation but full, a DMA zone on the same node
should be fine as well and would not impose a higher remote reference
burden on the allocator than allocating from the preferred DMA32 zone.
So it's really not about the locality of the allocating task but about
the locality of the given preferred zone.
In my tree, I replaced the function body with
return local_zone->node == zone->node;
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-17 20:11 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 20:11 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 04:08:08PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > > zone_local is using node_distance which is a more expensive call than
> > > > > necessary. On x86, it's another function call in the allocator fast path
> > > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > > local node will share the same node ID. The necessary information should
> > > > > already be cache hot.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > ---
> > > > > mm/page_alloc.c | 2 +-
> > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 64020eb..fd9677e 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > > >
> > > > > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > > {
> > > > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > > + return zone_to_nid(zone) == numa_node_id();
> > > >
> > > > Why numa_node_id()? We pass in the preferred zone as @local_zone:
> > > >
> > >
> > > Initially because I was thinking "local node" and numa_node_id() is a
> > > per-cpu variable that should be cheap to access and in some cases
> > > cache-hot as the top-level gfp API calls numa_node_id().
> > >
> > > Thinking about it more though it still makes sense because the preferred
> > > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > > and the local node does not have that zone then preferred zone is on a
> > > remote node.
> >
> > Don't we treat everything in relation to the preferred zone?
>
> Usually yes, but this time we really care about whether the memory is
> local or remote. It makes sense to me as it is and struggle to see an
> advantage of expressing it in terms of the preferred zone. Minimally
> zone_local would need to be renamed if it could return true for a remote
> zone and I see no advantage in doing that.
What the function tests for is whether any given zone is close
enough/local to the given preferred zone such that we can allocate
from it without having to invoke zone_reclaim_mode.
In your example, if the preferred DMA32 zone were to be on a remote
node and eligible for allocation but full, a DMA zone on the same node
should be fine as well and would not impose a higher remote reference
burden on the allocator than allocating from the preferred DMA32 zone.
So it's really not about the locality of the allocating task but about
the locality of the given preferred zone.
In my tree, I replaced the function body with
return local_zone->node == zone->node;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
2013-12-17 20:11 ` Johannes Weiner
@ 2013-12-17 21:03 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 21:03 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 03:11:47PM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 04:08:08PM +0000, Mel Gorman wrote:
> > On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> > > On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > > > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > > > zone_local is using node_distance which is a more expensive call than
> > > > > > necessary. On x86, it's another function call in the allocator fast path
> > > > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > > > local node will share the same node ID. The necessary information should
> > > > > > already be cache hot.
> > > > > >
> > > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > > ---
> > > > > > mm/page_alloc.c | 2 +-
> > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > index 64020eb..fd9677e 100644
> > > > > > --- a/mm/page_alloc.c
> > > > > > +++ b/mm/page_alloc.c
> > > > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > > > >
> > > > > > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > > > {
> > > > > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > > > + return zone_to_nid(zone) == numa_node_id();
> > > > >
> > > > > Why numa_node_id()? We pass in the preferred zone as @local_zone:
> > > > >
> > > >
> > > > Initially because I was thinking "local node" and numa_node_id() is a
> > > > per-cpu variable that should be cheap to access and in some cases
> > > > cache-hot as the top-level gfp API calls numa_node_id().
> > > >
> > > > Thinking about it more though it still makes sense because the preferred
> > > > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > > > and the local node does not have that zone then preferred zone is on a
> > > > remote node.
> > >
> > > Don't we treat everything in relation to the preferred zone?
> >
> > Usually yes, but this time we really care about whether the memory is
> > local or remote. It makes sense to me as it is and struggle to see an
> > advantage of expressing it in terms of the preferred zone. Minimally
> > zone_local would need to be renamed if it could return true for a remote
> > zone and I see no advantage in doing that.
>
> What the function tests for is whether any given zone is close
> enough/local to the given preferred zone such that we can allocate
> from it without having to invoke zone_reclaim_mode.
>
Fine. The helper should then be renamed to zone_preferred_node because
it's no longer about being local.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-17 21:03 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 21:03 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 03:11:47PM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 04:08:08PM +0000, Mel Gorman wrote:
> > On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> > > On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > > > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > > > zone_local is using node_distance which is a more expensive call than
> > > > > > necessary. On x86, it's another function call in the allocator fast path
> > > > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > > > local node will share the same node ID. The necessary information should
> > > > > > already be cache hot.
> > > > > >
> > > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > > ---
> > > > > > mm/page_alloc.c | 2 +-
> > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > index 64020eb..fd9677e 100644
> > > > > > --- a/mm/page_alloc.c
> > > > > > +++ b/mm/page_alloc.c
> > > > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > > > >
> > > > > > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > > > {
> > > > > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > > > + return zone_to_nid(zone) == numa_node_id();
> > > > >
> > > > > Why numa_node_id()? We pass in the preferred zone as @local_zone:
> > > > >
> > > >
> > > > Initially because I was thinking "local node" and numa_node_id() is a
> > > > per-cpu variable that should be cheap to access and in some cases
> > > > cache-hot as the top-level gfp API calls numa_node_id().
> > > >
> > > > Thinking about it more though it still makes sense because the preferred
> > > > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > > > and the local node does not have that zone then preferred zone is on a
> > > > remote node.
> > >
> > > Don't we treat everything in relation to the preferred zone?
> >
> > Usually yes, but this time we really care about whether the memory is
> > local or remote. It makes sense to me as it is and struggle to see an
> > advantage of expressing it in terms of the preferred zone. Minimally
> > zone_local would need to be renamed if it could return true for a remote
> > zone and I see no advantage in doing that.
>
> What the function tests for is whether any given zone is close
> enough/local to the given preferred zone such that we can allocate
> from it without having to invoke zone_reclaim_mode.
>
Fine. The helper should then be renamed to zone_preferred_node because
it's no longer about being local.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
2013-12-17 17:43 ` Johannes Weiner
@ 2013-12-17 21:22 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 21:22 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > not be treated like files but we still want tmpfs to be treated as
> > > > files. Details will be in the changelog of the next series.
> > >
> > > In what sense is it seen as file-backed?
> >
> > sysv and anonymous pages are backed by an internal shmem mount point. In
> > lots of respects, it's looks like a file and quacks like a file but I expect
> > developers think of it being anonmous and chunks of the VM treats it like
> > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > the VM as anon but users may think that tmpfs should be subject to the
> > fair allocation zone policy "because they're files." It's a sufficently
> > weird case that any action we take there should be deliberate. It'll be
> > a bit clearer when I post the patch that special cases this.
>
> The line I see here is mostly derived from performance expectations.
>
> People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> their reclaim at great costs, so they size this part of their workload
> according to memory size and locality. Filesystem cache (on-disk) on
> the other hand is expected to be slow on the first fault and after it
> has been displaced by other data, but the kernel is mostly expected to
> maximize the caching effects in a predictable manner.
>
Part of their performance expectations is that memory referenced from the
local node will be allocated locally. Consider NUMA-aware applications that
partition their data usage appropriately and share that data between threads
using processes and shared memory (some MPI implementations). They have
an expectation that the memory will be local and a further expectation
that it will not be reclaimed because they sized it appropriately.
Automatically interleaving such memory by default will be surprising to
NUMA aware applications even if NUMA-oblivious applications benefit.
Similarly, the pagecache sysctl is documented to affect files, at least
that's how I wrote it. It's inconsistent to explain that as "the sysctl
control files, except for tmpfs ones because ...... whatever".
> The round-robin policy makes the displacement predictable (think of
> the aging artifacts here where random pages do not get displaced
> reliably because they ended up on remote nodes) and it avoids IO by
> maximizing memory utilization.
>
> I.e. it improves behavior associated with a cache, but I don't expect
> shmem/tmpfs to be typically used as a disk cache. I could be wrong
> about that, but I figure if you need named shared memory that is
> bigger than your memory capacity (the point where your tmpfs would
> actually turn into a disk cache), you'd be better of using a more
> efficient on-disk filesystem.
I am concerned with semantics like "all files except tmpfs files" or
alternatively regressing performance of NUMA-aware applications and their
use of MAP_SHARED and sysv.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 21:22 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 21:22 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > not be treated like files but we still want tmpfs to be treated as
> > > > files. Details will be in the changelog of the next series.
> > >
> > > In what sense is it seen as file-backed?
> >
> > sysv and anonymous pages are backed by an internal shmem mount point. In
> > lots of respects, it's looks like a file and quacks like a file but I expect
> > developers think of it being anonmous and chunks of the VM treats it like
> > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > the VM as anon but users may think that tmpfs should be subject to the
> > fair allocation zone policy "because they're files." It's a sufficently
> > weird case that any action we take there should be deliberate. It'll be
> > a bit clearer when I post the patch that special cases this.
>
> The line I see here is mostly derived from performance expectations.
>
> People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> their reclaim at great costs, so they size this part of their workload
> according to memory size and locality. Filesystem cache (on-disk) on
> the other hand is expected to be slow on the first fault and after it
> has been displaced by other data, but the kernel is mostly expected to
> maximize the caching effects in a predictable manner.
>
Part of their performance expectations is that memory referenced from the
local node will be allocated locally. Consider NUMA-aware applications that
partition their data usage appropriately and share that data between threads
using processes and shared memory (some MPI implementations). They have
an expectation that the memory will be local and a further expectation
that it will not be reclaimed because they sized it appropriately.
Automatically interleaving such memory by default will be surprising to
NUMA aware applications even if NUMA-oblivious applications benefit.
Similarly, the pagecache sysctl is documented to affect files, at least
that's how I wrote it. It's inconsistent to explain that as "the sysctl
control files, except for tmpfs ones because ...... whatever".
> The round-robin policy makes the displacement predictable (think of
> the aging artifacts here where random pages do not get displaced
> reliably because they ended up on remote nodes) and it avoids IO by
> maximizing memory utilization.
>
> I.e. it improves behavior associated with a cache, but I don't expect
> shmem/tmpfs to be typically used as a disk cache. I could be wrong
> about that, but I figure if you need named shared memory that is
> bigger than your memory capacity (the point where your tmpfs would
> actually turn into a disk cache), you'd be better of using a more
> efficient on-disk filesystem.
I am concerned with semantics like "all files except tmpfs files" or
alternatively regressing performance of NUMA-aware applications and their
use of MAP_SHARED and sysv.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
2013-12-17 15:07 ` Zlatko Calusic
@ 2013-12-17 21:23 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 21:23 UTC (permalink / raw)
To: Zlatko Calusic
Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
Linux-MM, LKML
On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:
> On 13.12.2013 15:10, Mel Gorman wrote:
> >Kicked this another bit today. It's still a bit half-baked but it restores
> >the historical performance and leaves the door open at the end for playing
> >nice with distributing file pages between nodes. Finishing this series
> >depends on whether we are going to make the remote node behaviour of the
> >fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
> >favour of the configurable option because the default can be redefined and
> >tested while giving users a "compat" mode if we discover the new default
> >behaviour sucks for some workload.
> >
>
> I'll start a 5-day test of this patchset in a few hours, unless you
> can send an updated one in the meantime. I intend to test it on a
> rather boring 4GB x86_64 machine that before Johannes' work had lots
> of trouble balancing zones. Would you recommend to use the default
> settings, i.e. don't mess with tunables at this point?
>
For me at least I would prefer you tested v3 of the series with the
default settings of not interleaving file-backed pages on remote nodes
by default. Johannes might request testing with that knob enabled if the
machine is NUMA although I doubt it is with 4G of RAM.
Thanks.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
@ 2013-12-17 21:23 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 21:23 UTC (permalink / raw)
To: Zlatko Calusic
Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
Linux-MM, LKML
On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:
> On 13.12.2013 15:10, Mel Gorman wrote:
> >Kicked this another bit today. It's still a bit half-baked but it restores
> >the historical performance and leaves the door open at the end for playing
> >nice with distributing file pages between nodes. Finishing this series
> >depends on whether we are going to make the remote node behaviour of the
> >fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
> >favour of the configurable option because the default can be redefined and
> >tested while giving users a "compat" mode if we discover the new default
> >behaviour sucks for some workload.
> >
>
> I'll start a 5-day test of this patchset in a few hours, unless you
> can send an updated one in the meantime. I intend to test it on a
> rather boring 4GB x86_64 machine that before Johannes' work had lots
> of trouble balancing zones. Would you recommend to use the default
> settings, i.e. don't mess with tunables at this point?
>
For me at least I would prefer you tested v3 of the series with the
default settings of not interleaving file-backed pages on remote nodes
by default. Johannes might request testing with that knob enabled if the
machine is NUMA although I doubt it is with 4G of RAM.
Thanks.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
2013-12-17 21:03 ` Mel Gorman
@ 2013-12-17 22:31 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 22:31 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 09:03:40PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 03:11:47PM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 04:08:08PM +0000, Mel Gorman wrote:
> > > On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> > > > On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > > > > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > > > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > > > > zone_local is using node_distance which is a more expensive call than
> > > > > > > necessary. On x86, it's another function call in the allocator fast path
> > > > > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > > > > local node will share the same node ID. The necessary information should
> > > > > > > already be cache hot.
> > > > > > >
> > > > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > > > ---
> > > > > > > mm/page_alloc.c | 2 +-
> > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > index 64020eb..fd9677e 100644
> > > > > > > --- a/mm/page_alloc.c
> > > > > > > +++ b/mm/page_alloc.c
> > > > > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > > > > >
> > > > > > > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > > > > {
> > > > > > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > > > > + return zone_to_nid(zone) == numa_node_id();
> > > > > >
> > > > > > Why numa_node_id()? We pass in the preferred zone as @local_zone:
> > > > > >
> > > > >
> > > > > Initially because I was thinking "local node" and numa_node_id() is a
> > > > > per-cpu variable that should be cheap to access and in some cases
> > > > > cache-hot as the top-level gfp API calls numa_node_id().
> > > > >
> > > > > Thinking about it more though it still makes sense because the preferred
> > > > > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > > > > and the local node does not have that zone then preferred zone is on a
> > > > > remote node.
> > > >
> > > > Don't we treat everything in relation to the preferred zone?
> > >
> > > Usually yes, but this time we really care about whether the memory is
> > > local or remote. It makes sense to me as it is and struggle to see an
> > > advantage of expressing it in terms of the preferred zone. Minimally
> > > zone_local would need to be renamed if it could return true for a remote
> > > zone and I see no advantage in doing that.
> >
> > What the function tests for is whether any given zone is close
> > enough/local to the given preferred zone such that we can allocate
> > from it without having to invoke zone_reclaim_mode.
> >
>
> Fine. The helper should then be renamed to zone_preferred_node because
> it's no longer about being local.
Fair enough!
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality
@ 2013-12-17 22:31 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 22:31 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 09:03:40PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 03:11:47PM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 04:08:08PM +0000, Mel Gorman wrote:
> > > On Tue, Dec 17, 2013 at 10:38:29AM -0500, Johannes Weiner wrote:
> > > > On Tue, Dec 17, 2013 at 11:13:52AM +0000, Mel Gorman wrote:
> > > > > On Mon, Dec 16, 2013 at 03:25:07PM -0500, Johannes Weiner wrote:
> > > > > > On Fri, Dec 13, 2013 at 02:10:03PM +0000, Mel Gorman wrote:
> > > > > > > zone_local is using node_distance which is a more expensive call than
> > > > > > > necessary. On x86, it's another function call in the allocator fast path
> > > > > > > and increases cache footprint. This patch makes the assumption zones on a
> > > > > > > local node will share the same node ID. The necessary information should
> > > > > > > already be cache hot.
> > > > > > >
> > > > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > > > > ---
> > > > > > > mm/page_alloc.c | 2 +-
> > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > index 64020eb..fd9677e 100644
> > > > > > > --- a/mm/page_alloc.c
> > > > > > > +++ b/mm/page_alloc.c
> > > > > > > @@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > > > > > >
> > > > > > > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > > > > > {
> > > > > > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > > > > > + return zone_to_nid(zone) == numa_node_id();
> > > > > >
> > > > > > Why numa_node_id()? We pass in the preferred zone as @local_zone:
> > > > > >
> > > > >
> > > > > Initially because I was thinking "local node" and numa_node_id() is a
> > > > > per-cpu variable that should be cheap to access and in some cases
> > > > > cache-hot as the top-level gfp API calls numa_node_id().
> > > > >
> > > > > Thinking about it more though it still makes sense because the preferred
> > > > > zone is not necessarily local. If the allocation request requires ZONE_DMA32
> > > > > and the local node does not have that zone then preferred zone is on a
> > > > > remote node.
> > > >
> > > > Don't we treat everything in relation to the preferred zone?
> > >
> > > Usually yes, but this time we really care about whether the memory is
> > > local or remote. It makes sense to me as it is and struggle to see an
> > > advantage of expressing it in terms of the preferred zone. Minimally
> > > zone_local would need to be renamed if it could return true for a remote
> > > zone and I see no advantage in doing that.
> >
> > What the function tests for is whether any given zone is close
> > enough/local to the given preferred zone such that we can allocate
> > from it without having to invoke zone_reclaim_mode.
> >
>
> Fine. The helper should then be renamed to zone_preferred_node because
> it's no longer about being local.
Fair enough!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
2013-12-17 21:22 ` Mel Gorman
@ 2013-12-17 22:57 ` Johannes Weiner
-1 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 22:57 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 09:22:16PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > > not be treated like files but we still want tmpfs to be treated as
> > > > > files. Details will be in the changelog of the next series.
> > > >
> > > > In what sense is it seen as file-backed?
> > >
> > > sysv and anonymous pages are backed by an internal shmem mount point. In
> > > lots of respects, it's looks like a file and quacks like a file but I expect
> > > developers think of it being anonmous and chunks of the VM treats it like
> > > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > > the VM as anon but users may think that tmpfs should be subject to the
> > > fair allocation zone policy "because they're files." It's a sufficently
> > > weird case that any action we take there should be deliberate. It'll be
> > > a bit clearer when I post the patch that special cases this.
> >
> > The line I see here is mostly derived from performance expectations.
> >
> > People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> > their reclaim at great costs, so they size this part of their workload
> > according to memory size and locality. Filesystem cache (on-disk) on
> > the other hand is expected to be slow on the first fault and after it
> > has been displaced by other data, but the kernel is mostly expected to
> > maximize the caching effects in a predictable manner.
> >
>
> Part of their performance expectations is that memory referenced from the
> local node will be allocated locally. Consider NUMA-aware applications that
> partition their data usage appropriately and share that data between threads
> using processes and shared memory (some MPI implementations). They have
> an expectation that the memory will be local and a further expectation
> that it will not be reclaimed because they sized it appropriately.
> Automatically interleaving such memory by default will be surprising to
> NUMA aware applications even if NUMA-oblivious applications benefit.
That's exactly why I want to exclude any type of data that is
typically sized to memory capacity. Are we talking past each other?
> Similarly, the pagecache sysctl is documented to affect files, at least
> that's how I wrote it. It's inconsistent to explain that as "the sysctl
> control files, except for tmpfs ones because ...... whatever".
I documented it as affecting by secondary storage cache.
> > The round-robin policy makes the displacement predictable (think of
> > the aging artifacts here where random pages do not get displaced
> > reliably because they ended up on remote nodes) and it avoids IO by
> > maximizing memory utilization.
> >
> > I.e. it improves behavior associated with a cache, but I don't expect
> > shmem/tmpfs to be typically used as a disk cache. I could be wrong
> > about that, but I figure if you need named shared memory that is
> > bigger than your memory capacity (the point where your tmpfs would
> > actually turn into a disk cache), you'd be better of using a more
> > efficient on-disk filesystem.
>
> I am concerned with semantics like "all files except tmpfs files" or
> alternatively regressing performance of NUMA-aware applications and their
> use of MAP_SHARED and sysv.
I'm really not following. MAP_SHARED, sysv, shmem, tmpfs, whatever is
entirely unaffected by my proposal. I never claimed "all files except
tmpfs". It's about what backs the data, which what makes a difference
in people's performance expectation, which makes a difference in how
they size the workloads.
Tmpfs files that may overflow into swap on heavy memory pressure have
an entirely different trade-off than actual cache that is continuously
replaced as part of its size management, and in that sense they are
much closer to anon and sysv shared memory. I don't believe that the
difference between virtual in-core filesystems and actual secondary
storage filesystems is so obscure to users that this behavioral
difference would violate expectations of the term "file".
Is that what you are saying or am I missing something?
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 22:57 ` Johannes Weiner
0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2013-12-17 22:57 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 09:22:16PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > > not be treated like files but we still want tmpfs to be treated as
> > > > > files. Details will be in the changelog of the next series.
> > > >
> > > > In what sense is it seen as file-backed?
> > >
> > > sysv and anonymous pages are backed by an internal shmem mount point. In
> > > lots of respects, it's looks like a file and quacks like a file but I expect
> > > developers think of it being anonmous and chunks of the VM treats it like
> > > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > > the VM as anon but users may think that tmpfs should be subject to the
> > > fair allocation zone policy "because they're files." It's a sufficently
> > > weird case that any action we take there should be deliberate. It'll be
> > > a bit clearer when I post the patch that special cases this.
> >
> > The line I see here is mostly derived from performance expectations.
> >
> > People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> > their reclaim at great costs, so they size this part of their workload
> > according to memory size and locality. Filesystem cache (on-disk) on
> > the other hand is expected to be slow on the first fault and after it
> > has been displaced by other data, but the kernel is mostly expected to
> > maximize the caching effects in a predictable manner.
> >
>
> Part of their performance expectations is that memory referenced from the
> local node will be allocated locally. Consider NUMA-aware applications that
> partition their data usage appropriately and share that data between threads
> using processes and shared memory (some MPI implementations). They have
> an expectation that the memory will be local and a further expectation
> that it will not be reclaimed because they sized it appropriately.
> Automatically interleaving such memory by default will be surprising to
> NUMA aware applications even if NUMA-oblivious applications benefit.
That's exactly why I want to exclude any type of data that is
typically sized to memory capacity. Are we talking past each other?
> Similarly, the pagecache sysctl is documented to affect files, at least
> that's how I wrote it. It's inconsistent to explain that as "the sysctl
> control files, except for tmpfs ones because ...... whatever".
I documented it as affecting by secondary storage cache.
> > The round-robin policy makes the displacement predictable (think of
> > the aging artifacts here where random pages do not get displaced
> > reliably because they ended up on remote nodes) and it avoids IO by
> > maximizing memory utilization.
> >
> > I.e. it improves behavior associated with a cache, but I don't expect
> > shmem/tmpfs to be typically used as a disk cache. I could be wrong
> > about that, but I figure if you need named shared memory that is
> > bigger than your memory capacity (the point where your tmpfs would
> > actually turn into a disk cache), you'd be better of using a more
> > efficient on-disk filesystem.
>
> I am concerned with semantics like "all files except tmpfs files" or
> alternatively regressing performance of NUMA-aware applications and their
> use of MAP_SHARED and sysv.
I'm really not following. MAP_SHARED, sysv, shmem, tmpfs, whatever is
entirely unaffected by my proposal. I never claimed "all files except
tmpfs". It's about what backs the data, which what makes a difference
in people's performance expectation, which makes a difference in how
they size the workloads.
Tmpfs files that may overflow into swap on heavy memory pressure have
an entirely different trade-off than actual cache that is continuously
replaced as part of its size management, and in that sense they are
much closer to anon and sysv shared memory. I don't believe that the
difference between virtual in-core filesystems and actual secondary
storage filesystems is so obscure to users that this behavioral
difference would violate expectations of the term "file".
Is that what you are saying or am I missing something?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
2013-12-17 22:57 ` Johannes Weiner
@ 2013-12-17 23:24 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 23:24 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 05:57:16PM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 09:22:16PM +0000, Mel Gorman wrote:
> > On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > > > not be treated like files but we still want tmpfs to be treated as
> > > > > > files. Details will be in the changelog of the next series.
> > > > >
> > > > > In what sense is it seen as file-backed?
> > > >
> > > > sysv and anonymous pages are backed by an internal shmem mount point. In
> > > > lots of respects, it's looks like a file and quacks like a file but I expect
> > > > developers think of it being anonmous and chunks of the VM treats it like
> > > > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > > > the VM as anon but users may think that tmpfs should be subject to the
> > > > fair allocation zone policy "because they're files." It's a sufficently
> > > > weird case that any action we take there should be deliberate. It'll be
> > > > a bit clearer when I post the patch that special cases this.
> > >
> > > The line I see here is mostly derived from performance expectations.
> > >
> > > People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> > > their reclaim at great costs, so they size this part of their workload
> > > according to memory size and locality. Filesystem cache (on-disk) on
> > > the other hand is expected to be slow on the first fault and after it
> > > has been displaced by other data, but the kernel is mostly expected to
> > > maximize the caching effects in a predictable manner.
> > >
> >
> > Part of their performance expectations is that memory referenced from the
> > local node will be allocated locally. Consider NUMA-aware applications that
> > partition their data usage appropriately and share that data between threads
> > using processes and shared memory (some MPI implementations). They have
> > an expectation that the memory will be local and a further expectation
> > that it will not be reclaimed because they sized it appropriately.
> > Automatically interleaving such memory by default will be surprising to
> > NUMA aware applications even if NUMA-oblivious applications benefit.
>
> That's exactly why I want to exclude any type of data that is
> typically sized to memory capacity. Are we talking past each other?
>
No, we're not but I'm concerned that your treatment of shmem ends up
being inconsistent. Your proposal to me has two choices
a) leave it alone. We get proper behaviour for MAP_SHARED anonymous and
sysv but tmpfs is different to every other filesystem
b) interleave shmem. tmpfs is consistent with other filesystems but
MAP_SHARED anonymous and sysv is surprising
> > Similarly, the pagecache sysctl is documented to affect files, at least
> > that's how I wrote it. It's inconsistent to explain that as "the sysctl
> > control files, except for tmpfs ones because ...... whatever".
>
> I documented it as affecting by secondary storage cache.
>
That is very subtle and a bit weird to me. Arguably tmpfs is also driven
by secondary storage where storage happens to be swap. It's still "files
except for tmpfs files beacuse they're special". That's why I'm
uncomfortable with it.
> > > The round-robin policy makes the displacement predictable (think of
> > > the aging artifacts here where random pages do not get displaced
> > > reliably because they ended up on remote nodes) and it avoids IO by
> > > maximizing memory utilization.
> > >
> > > I.e. it improves behavior associated with a cache, but I don't expect
> > > shmem/tmpfs to be typically used as a disk cache. I could be wrong
> > > about that, but I figure if you need named shared memory that is
> > > bigger than your memory capacity (the point where your tmpfs would
> > > actually turn into a disk cache), you'd be better of using a more
> > > efficient on-disk filesystem.
> >
> > I am concerned with semantics like "all files except tmpfs files" or
> > alternatively regressing performance of NUMA-aware applications and their
> > use of MAP_SHARED and sysv.
>
> I'm really not following. MAP_SHARED, sysv, shmem, tmpfs, whatever is
> entirely unaffected by my proposal.
I understand, it's the tmpfs different to every filesystem I'm not happy
with. From a VM perspective it makes some sense but from a user
perspective it just looks weird.
> I never claimed "all files except
> tmpfs". It's about what backs the data, which what makes a difference
> in people's performance expectation, which makes a difference in how
> they size the workloads.
>
So potentially applications have to stat the file they are mapping if they
want to understand what memory policy applies.
> Tmpfs files that may overflow into swap on heavy memory pressure have
> an entirely different trade-off than actual cache that is continuously
> replaced as part of its size management, and in that sense they are
> much closer to anon and sysv shared memory.
Again, from a VM perspective I understand what you're suggesting but
from an application perspective that is mapping files, it's a tricky
interface.
In terms of restoring historical behaviour in 3.12 and for 3.13 I think my
approach is the more conservative and least surprising to users. We can bash
out whether to default remote interleaving or special case tmpfs in 3.14.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable
@ 2013-12-17 23:24 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-17 23:24 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 05:57:16PM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 09:22:16PM +0000, Mel Gorman wrote:
> > On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > > > not be treated like files but we still want tmpfs to be treated as
> > > > > > files. Details will be in the changelog of the next series.
> > > > >
> > > > > In what sense is it seen as file-backed?
> > > >
> > > > sysv and anonymous pages are backed by an internal shmem mount point. In
> > > > lots of respects, it's looks like a file and quacks like a file but I expect
> > > > developers think of it being anonmous and chunks of the VM treats it like
> > > > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > > > the VM as anon but users may think that tmpfs should be subject to the
> > > > fair allocation zone policy "because they're files." It's a sufficently
> > > > weird case that any action we take there should be deliberate. It'll be
> > > > a bit clearer when I post the patch that special cases this.
> > >
> > > The line I see here is mostly derived from performance expectations.
> > >
> > > People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> > > their reclaim at great costs, so they size this part of their workload
> > > according to memory size and locality. Filesystem cache (on-disk) on
> > > the other hand is expected to be slow on the first fault and after it
> > > has been displaced by other data, but the kernel is mostly expected to
> > > maximize the caching effects in a predictable manner.
> > >
> >
> > Part of their performance expectations is that memory referenced from the
> > local node will be allocated locally. Consider NUMA-aware applications that
> > partition their data usage appropriately and share that data between threads
> > using processes and shared memory (some MPI implementations). They have
> > an expectation that the memory will be local and a further expectation
> > that it will not be reclaimed because they sized it appropriately.
> > Automatically interleaving such memory by default will be surprising to
> > NUMA aware applications even if NUMA-oblivious applications benefit.
>
> That's exactly why I want to exclude any type of data that is
> typically sized to memory capacity. Are we talking past each other?
>
No, we're not but I'm concerned that your treatment of shmem ends up
being inconsistent. Your proposal to me has two choices
a) leave it alone. We get proper behaviour for MAP_SHARED anonymous and
sysv but tmpfs is different to every other filesystem
b) interleave shmem. tmpfs is consistent with other filesystems but
MAP_SHARED anonymous and sysv is surprising
> > Similarly, the pagecache sysctl is documented to affect files, at least
> > that's how I wrote it. It's inconsistent to explain that as "the sysctl
> > control files, except for tmpfs ones because ...... whatever".
>
> I documented it as affecting by secondary storage cache.
>
That is very subtle and a bit weird to me. Arguably tmpfs is also driven
by secondary storage where storage happens to be swap. It's still "files
except for tmpfs files beacuse they're special". That's why I'm
uncomfortable with it.
> > > The round-robin policy makes the displacement predictable (think of
> > > the aging artifacts here where random pages do not get displaced
> > > reliably because they ended up on remote nodes) and it avoids IO by
> > > maximizing memory utilization.
> > >
> > > I.e. it improves behavior associated with a cache, but I don't expect
> > > shmem/tmpfs to be typically used as a disk cache. I could be wrong
> > > about that, but I figure if you need named shared memory that is
> > > bigger than your memory capacity (the point where your tmpfs would
> > > actually turn into a disk cache), you'd be better of using a more
> > > efficient on-disk filesystem.
> >
> > I am concerned with semantics like "all files except tmpfs files" or
> > alternatively regressing performance of NUMA-aware applications and their
> > use of MAP_SHARED and sysv.
>
> I'm really not following. MAP_SHARED, sysv, shmem, tmpfs, whatever is
> entirely unaffected by my proposal.
I understand, it's the tmpfs different to every filesystem I'm not happy
with. From a VM perspective it makes some sense but from a user
perspective it just looks weird.
> I never claimed "all files except
> tmpfs". It's about what backs the data, which what makes a difference
> in people's performance expectation, which makes a difference in how
> they size the workloads.
>
So potentially applications have to stat the file they are mapping if they
want to understand what memory policy applies.
> Tmpfs files that may overflow into swap on heavy memory pressure have
> an entirely different trade-off than actual cache that is continuously
> replaced as part of its size management, and in that sense they are
> much closer to anon and sysv shared memory.
Again, from a VM perspective I understand what you're suggesting but
from an application perspective that is mapping files, it's a tricky
interface.
In terms of restoring historical behaviour in 3.12 and for 3.13 I think my
approach is the more conservative and least surprising to users. We can bash
out whether to default remote interleaving or special case tmpfs in 3.14.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
2013-12-17 21:23 ` Mel Gorman
@ 2013-12-21 16:03 ` Zlatko Calusic
-1 siblings, 0 replies; 84+ messages in thread
From: Zlatko Calusic @ 2013-12-21 16:03 UTC (permalink / raw)
To: Mel Gorman
Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
Linux-MM, LKML
On 17.12.2013 22:23, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:
>> On 13.12.2013 15:10, Mel Gorman wrote:
>>> Kicked this another bit today. It's still a bit half-baked but it restores
>>> the historical performance and leaves the door open at the end for playing
>>> nice with distributing file pages between nodes. Finishing this series
>>> depends on whether we are going to make the remote node behaviour of the
>>> fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
>>> favour of the configurable option because the default can be redefined and
>>> tested while giving users a "compat" mode if we discover the new default
>>> behaviour sucks for some workload.
>>>
>>
>> I'll start a 5-day test of this patchset in a few hours, unless you
>> can send an updated one in the meantime. I intend to test it on a
>> rather boring 4GB x86_64 machine that before Johannes' work had lots
>> of trouble balancing zones. Would you recommend to use the default
>> settings, i.e. don't mess with tunables at this point?
>>
>
> For me at least I would prefer you tested v3 of the series with the
> default settings of not interleaving file-backed pages on remote nodes
> by default. Johannes might request testing with that knob enabled if the
> machine is NUMA although I doubt it is with 4G of RAM.
>
Tested v3 on UMA machine, with default setting. I see no regression, no
issues whatsoever. From what I understand, this whole series is about
fixing issues noticed on NUMA, so I wish you good luck with that (no
such hardware here). Just be extra careful not to disturb finally very
well balanced MM on more common machines (and especially those equipped
with 4GB RAM). And once again thank you Johannes for your work, you did
a great job.
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
--
Zlatko
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
@ 2013-12-21 16:03 ` Zlatko Calusic
0 siblings, 0 replies; 84+ messages in thread
From: Zlatko Calusic @ 2013-12-21 16:03 UTC (permalink / raw)
To: Mel Gorman
Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
Linux-MM, LKML
On 17.12.2013 22:23, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:
>> On 13.12.2013 15:10, Mel Gorman wrote:
>>> Kicked this another bit today. It's still a bit half-baked but it restores
>>> the historical performance and leaves the door open at the end for playing
>>> nice with distributing file pages between nodes. Finishing this series
>>> depends on whether we are going to make the remote node behaviour of the
>>> fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
>>> favour of the configurable option because the default can be redefined and
>>> tested while giving users a "compat" mode if we discover the new default
>>> behaviour sucks for some workload.
>>>
>>
>> I'll start a 5-day test of this patchset in a few hours, unless you
>> can send an updated one in the meantime. I intend to test it on a
>> rather boring 4GB x86_64 machine that before Johannes' work had lots
>> of trouble balancing zones. Would you recommend to use the default
>> settings, i.e. don't mess with tunables at this point?
>>
>
> For me at least I would prefer you tested v3 of the series with the
> default settings of not interleaving file-backed pages on remote nodes
> by default. Johannes might request testing with that knob enabled if the
> machine is NUMA although I doubt it is with 4G of RAM.
>
Tested v3 on UMA machine, with default setting. I see no regression, no
issues whatsoever. From what I understand, this whole series is about
fixing issues noticed on NUMA, so I wish you good luck with that (no
such hardware here). Just be extra careful not to disturb finally very
well balanced MM on more common machines (and especially those equipped
with 4GB RAM). And once again thank you Johannes for your work, you did
a great job.
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
--
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
2013-12-21 16:03 ` Zlatko Calusic
@ 2013-12-23 10:26 ` Mel Gorman
-1 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-23 10:26 UTC (permalink / raw)
To: Zlatko Calusic
Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
Linux-MM, LKML
On Sat, Dec 21, 2013 at 05:03:43PM +0100, Zlatko Calusic wrote:
> On 17.12.2013 22:23, Mel Gorman wrote:
> >On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:
> >>On 13.12.2013 15:10, Mel Gorman wrote:
> >>>Kicked this another bit today. It's still a bit half-baked but it restores
> >>>the historical performance and leaves the door open at the end for playing
> >>>nice with distributing file pages between nodes. Finishing this series
> >>>depends on whether we are going to make the remote node behaviour of the
> >>>fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
> >>>favour of the configurable option because the default can be redefined and
> >>>tested while giving users a "compat" mode if we discover the new default
> >>>behaviour sucks for some workload.
> >>>
> >>
> >>I'll start a 5-day test of this patchset in a few hours, unless you
> >>can send an updated one in the meantime. I intend to test it on a
> >>rather boring 4GB x86_64 machine that before Johannes' work had lots
> >>of trouble balancing zones. Would you recommend to use the default
> >>settings, i.e. don't mess with tunables at this point?
> >>
> >
> >For me at least I would prefer you tested v3 of the series with the
> >default settings of not interleaving file-backed pages on remote nodes
> >by default. Johannes might request testing with that knob enabled if the
> >machine is NUMA although I doubt it is with 4G of RAM.
> >
>
> Tested v3 on UMA machine, with default setting. I see no regression,
> no issues whatsoever. From what I understand, this whole series is
> about fixing issues noticed on NUMA, so I wish you good luck with
> that (no such hardware here). Just be extra careful not to disturb
> finally very well balanced MM on more common machines (and
> especially those equipped with 4GB RAM). And once again thank you
> Johannes for your work, you did a great job.
>
> Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Thanks for testing. Even though this patch is about NUMA, it preserves
the fair zone allocation policy on UMA that your workload depends upon.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6
@ 2013-12-23 10:26 ` Mel Gorman
0 siblings, 0 replies; 84+ messages in thread
From: Mel Gorman @ 2013-12-23 10:26 UTC (permalink / raw)
To: Zlatko Calusic
Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Rik van Riel,
Linux-MM, LKML
On Sat, Dec 21, 2013 at 05:03:43PM +0100, Zlatko Calusic wrote:
> On 17.12.2013 22:23, Mel Gorman wrote:
> >On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:
> >>On 13.12.2013 15:10, Mel Gorman wrote:
> >>>Kicked this another bit today. It's still a bit half-baked but it restores
> >>>the historical performance and leaves the door open at the end for playing
> >>>nice with distributing file pages between nodes. Finishing this series
> >>>depends on whether we are going to make the remote node behaviour of the
> >>>fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
> >>>favour of the configurable option because the default can be redefined and
> >>>tested while giving users a "compat" mode if we discover the new default
> >>>behaviour sucks for some workload.
> >>>
> >>
> >>I'll start a 5-day test of this patchset in a few hours, unless you
> >>can send an updated one in the meantime. I intend to test it on a
> >>rather boring 4GB x86_64 machine that before Johannes' work had lots
> >>of trouble balancing zones. Would you recommend to use the default
> >>settings, i.e. don't mess with tunables at this point?
> >>
> >
> >For me at least I would prefer you tested v3 of the series with the
> >default settings of not interleaving file-backed pages on remote nodes
> >by default. Johannes might request testing with that knob enabled if the
> >machine is NUMA although I doubt it is with 4G of RAM.
> >
>
> Tested v3 on UMA machine, with default setting. I see no regression,
> no issues whatsoever. From what I understand, this whole series is
> about fixing issues noticed on NUMA, so I wish you good luck with
> that (no such hardware here). Just be extra careful not to disturb
> finally very well balanced MM on more common machines (and
> especially those equipped with 4GB RAM). And once again thank you
> Johannes for your work, you did a great job.
>
> Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Thanks for testing. Even though this patch is about NUMA, it preserves
the fair zone allocation policy on UMA that your workload depends upon.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 84+ messages in thread
end of thread, other threads:[~2013-12-23 10:26 UTC | newest]
Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-13 14:10 [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6 Mel Gorman
2013-12-13 14:10 ` Mel Gorman
2013-12-13 14:10 ` [PATCH 1/7] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
2013-12-13 14:10 ` Mel Gorman
2013-12-13 15:45 ` Rik van Riel
2013-12-13 15:45 ` Rik van Riel
2013-12-13 14:10 ` [PATCH 2/7] mm: page_alloc: Break out zone page aging distribution into its own helper Mel Gorman
2013-12-13 14:10 ` Mel Gorman
2013-12-13 15:46 ` Rik van Riel
2013-12-13 15:46 ` Rik van Riel
2013-12-16 20:16 ` Johannes Weiner
2013-12-16 20:16 ` Johannes Weiner
2013-12-13 14:10 ` [PATCH 3/7] mm: page_alloc: Use zone node IDs to approximate locality Mel Gorman
2013-12-13 14:10 ` Mel Gorman
2013-12-16 13:20 ` Rik van Riel
2013-12-16 13:20 ` Rik van Riel
2013-12-16 20:25 ` Johannes Weiner
2013-12-16 20:25 ` Johannes Weiner
2013-12-17 11:13 ` Mel Gorman
2013-12-17 11:13 ` Mel Gorman
2013-12-17 15:38 ` Johannes Weiner
2013-12-17 15:38 ` Johannes Weiner
2013-12-17 16:08 ` Mel Gorman
2013-12-17 16:08 ` Mel Gorman
2013-12-17 20:11 ` Johannes Weiner
2013-12-17 20:11 ` Johannes Weiner
2013-12-17 21:03 ` Mel Gorman
2013-12-17 21:03 ` Mel Gorman
2013-12-17 22:31 ` Johannes Weiner
2013-12-17 22:31 ` Johannes Weiner
2013-12-13 14:10 ` [PATCH 4/7] mm: Annotate page cache allocations Mel Gorman
2013-12-13 14:10 ` Mel Gorman
2013-12-16 15:20 ` Rik van Riel
2013-12-16 15:20 ` Rik van Riel
2013-12-13 14:10 ` [PATCH 5/7] mm: page_alloc: Make zone distribution page aging policy configurable Mel Gorman
2013-12-13 14:10 ` Mel Gorman
2013-12-16 19:25 ` Rik van Riel
2013-12-16 19:25 ` Rik van Riel
2013-12-16 20:42 ` Johannes Weiner
2013-12-16 20:42 ` Johannes Weiner
2013-12-17 15:29 ` Mel Gorman
2013-12-17 15:29 ` Mel Gorman
2013-12-17 15:54 ` Johannes Weiner
2013-12-17 15:54 ` Johannes Weiner
2013-12-17 16:14 ` Mel Gorman
2013-12-17 16:14 ` Mel Gorman
2013-12-17 17:43 ` Johannes Weiner
2013-12-17 17:43 ` Johannes Weiner
2013-12-17 21:22 ` Mel Gorman
2013-12-17 21:22 ` Mel Gorman
2013-12-17 22:57 ` Johannes Weiner
2013-12-17 22:57 ` Johannes Weiner
2013-12-17 23:24 ` Mel Gorman
2013-12-17 23:24 ` Mel Gorman
2013-12-13 14:10 ` [PATCH 6/7] mm: page_alloc: Only account batch allocations requests that are eligible Mel Gorman
2013-12-13 14:10 ` Mel Gorman
2013-12-16 20:52 ` Johannes Weiner
2013-12-16 20:52 ` Johannes Weiner
2013-12-17 11:20 ` Mel Gorman
2013-12-17 11:20 ` Mel Gorman
2013-12-17 15:43 ` Johannes Weiner
2013-12-17 15:43 ` Johannes Weiner
2013-12-17 16:06 ` Mel Gorman
2013-12-17 16:06 ` Mel Gorman
2013-12-13 14:10 ` [PATCH 7/7] mm: page_alloc: Default allow file pages to use remote nodes for fair allocation policy Mel Gorman
2013-12-13 14:10 ` Mel Gorman
2013-12-13 17:04 ` Johannes Weiner
2013-12-13 17:04 ` Johannes Weiner
2013-12-13 19:20 ` Mel Gorman
2013-12-13 19:20 ` Mel Gorman
2013-12-13 22:15 ` Johannes Weiner
2013-12-13 22:15 ` Johannes Weiner
2013-12-17 16:04 ` Mel Gorman
2013-12-17 16:04 ` Mel Gorman
2013-12-16 19:26 ` Rik van Riel
2013-12-16 19:26 ` Rik van Riel
2013-12-17 15:07 ` [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6 Zlatko Calusic
2013-12-17 15:07 ` Zlatko Calusic
2013-12-17 21:23 ` Mel Gorman
2013-12-17 21:23 ` Mel Gorman
2013-12-21 16:03 ` Zlatko Calusic
2013-12-21 16:03 ` Zlatko Calusic
2013-12-23 10:26 ` Mel Gorman
2013-12-23 10:26 ` Mel Gorman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.