From: SeongJae Park <sjpark@amazon.de> DAMON[1] can be used as a primitive for data access awared memory management optimizations. That said, users who want such optimizations should run DAMON, read the monitoring results, analyze it, plan a new memory management scheme, and apply the new scheme by themselves. Such efforts will be inevitable for some complicated optimizations. However, in many other cases, the users could simply want the system to apply a memory management action to a memory region of a specific size having a specific access frequency for a specific time. For example, "page out a memory region larger than 100 MiB keeping only rare accesses more than 2 minutes", or "Do not use THP for a memory region larger than 2 MiB rarely accessed for more than 1 seconds". This RFC patchset makes DAMON to handle such data access monitoring-based operation schemes. With this change, users can do the data access awared optimizations by simply specifying their schemes to DAMON. Evaluations =========== Efficient THP ------------- Transparent Huge Pages (THP) subsystem could waste memory space in some cases because it aggressively promotes regular pages to huge pages. For the reason, use of THP is prohivited by a number of memory intensive programs such as Redis[1] and MongoDB[2]. Below two simple data access monitoring-based operation schemes might be helpful for the problem: # format: <min/max size> <min/max frequency (0-100)> <min/max age> <action> # If a memory region larger than 2 MiB is showing access rate higher than # 5%, apply MADV_HUGEPAGE to the region. 2M null 5 null null null hugepage # If a memory region larger than 2 MiB is showing access rate lower than 5% # for more than 1 second, apply MADV_NOHUGEPAGE to the region. 2M null null 5 1s null nohugepage We can expect the schmes would reduce the memory space overhead but preserve some of the performance benefit of THP. I call this schemes Efficient THP (ETHP). Please note that these schemes are neither highly tuned nor for general usecases. These are made with my straightforward instinction for only a demonstration of DAMOS. Setup ----- On my personal QEMU/KVM based virtual machine on an Intel i7 host machine running Ubuntu 18.04, I measure runtime and consumed memory space of various realistic workloads with several configurations. I use 13 and 12 workloads in PARSEC3[3] and SPLASH-2X[4] benchmark suites, respectively. I personally use another wrapper scripts[5] for setup and run of the workloads. For the measurement of the amount of consumed memory in system global scope, I drop caches before starting each of the workloads and monitor 'MemFree' in the '/proc/meminfo' file. The configurations I use are as below: orig: Linux v5.5 with 'madvise' THP policy thp: Linux v5.5 with 'always' THP policy ethp: Linux v5.5 applying the above schemes To minimize the measurement errors, I repeat the run 5 times and average results. You can get stdev, min, and max of the numbers among the repeated runs in appendix below. [1] "Redis latency problems troubleshooting", https://redis.io/topics/latency [2] "Disable Transparent Huge Pages (THP)", https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/ [3] "The PARSEC Becnhmark Suite", https://parsec.cs.princeton.edu/index.htm [4] "SPLASH-2x", https://parsec.cs.princeton.edu/parsec3-doc.htm#splash2x [5] "parsec3_on_ubuntu", https://github.com/sjp38/parsec3_on_ubuntu Results ------- TL;DR: 'ethp' removes 97.61% of 'thp' memory space overhead while preserving 25.40% (up to 88.36%) of 'thp' performance improvement in total. Following sections show the results of the measurements with raw numbers and 'orig'-relative overheads (percent) of each configuration. Memory Space Overheads ~~~~~~~~~~~~~~~~~~~~~~ Below shows measured memory space overheads. Raw numbers are in KiB, and the overheads in parentheses are in percent. For example, 'parsec3/blackscholes' consumes about 1.819 GiB and 1.824 GiB with 'orig' and 'thp' configuration, respectively. The overhead of 'thp' compared to 'orig' for the workload is 0.3%. workloads orig thp (overhead) ethp (overhead) parsec3/blackscholes 1819486.000 1824921.400 ( 0.30) 1829070.600 ( 0.53) parsec3/bodytrack 1417885.800 1417077.600 ( -0.06) 1427560.800 ( 0.68) parsec3/canneal 1043876.800 1039773.000 ( -0.39) 1048445.200 ( 0.44) parsec3/dedup 2400000.400 2434625.600 ( 1.44) 2417374.400 ( 0.72) parsec3/facesim 540206.400 542422.400 ( 0.41) 551485.400 ( 2.09) parsec3/ferret 320480.200 320157.000 ( -0.10) 331470.400 ( 3.43) parsec3/fluidanimate 573961.400 572329.600 ( -0.28) 581836.000 ( 1.37) parsec3/freqmine 983981.200 994839.600 ( 1.10) 996124.600 ( 1.23) parsec3/raytrace 1745175.200 1742756.400 ( -0.14) 1751706.000 ( 0.37) parsec3/streamcluster 120558.800 120309.800 ( -0.21) 131997.800 ( 9.49) parsec3/swaptions 14820.400 23388.800 ( 57.81) 24698.000 ( 66.65) parsec3/vips 2956319.200 2955803.600 ( -0.02) 2977506.200 ( 0.72) parsec3/x264 3187699.000 3184944.000 ( -0.09) 3198462.800 ( 0.34) splash2x/barnes 1212774.800 1221892.400 ( 0.75) 1212100.800 ( -0.06) splash2x/fft 9364725.000 9267074.000 ( -1.04) 8997901.200 ( -3.92) splash2x/lu_cb 515242.400 519881.400 ( 0.90) 526621.600 ( 2.21) splash2x/lu_ncb 517308.000 520396.400 ( 0.60) 521732.400 ( 0.86) splash2x/ocean_cp 3348189.400 3380799.400 ( 0.97) 3328473.400 ( -0.59) splash2x/ocean_ncp 3908599.800 7072076.800 ( 80.94) 4449410.400 ( 13.84) splash2x/radiosity 1469087.800 1482244.400 ( 0.90) 1471781.000 ( 0.18) splash2x/radix 1712487.400 1385972.800 (-19.07) 1420461.800 (-17.05) splash2x/raytrace 45030.600 50946.600 ( 13.14) 58586.200 ( 30.10) splash2x/volrend 151037.800 151188.000 ( 0.10) 163213.600 ( 8.06) splash2x/water_nsquared 47442.400 47257.000 ( -0.39) 59285.800 ( 24.96) splash2x/water_spatial 667355.200 666824.400 ( -0.08) 673274.400 ( 0.89) total 40083800.000 42939900.000 ( 7.13) 40150600.000 ( 0.17) In total, 'thp' shows 7.13% memory space overhead while 'ethp' shows only 0.17% overhead. In other words, 'ethp' removed 97.61% of 'thp' memory space overhead. For almost every workload, 'ethp' constantly show about 10-15 MiB memory space overhead, mainly due to its python wrapper I used for convenient test runs. Using DAMON's raw interface would further remove this overhead. In case of 'parsec3/swaptions' and 'splash2x/raytrace', 'ethp' shows even higher memory space overhead. This is mainly due to the small size of the workloads and the constant memory overhead of 'ethp', which came from the python wrapper. The workloads consumes only about 14 MiB and 45 MiB each. Because the constant memory consumption from the python wrapper of 'ethp' (about 10-15 MiB) is relatively huge to the small working set, the relative overhead becomes high. Nonetheless, such small workloads are not appropriate target of the 'ethp' and the overhead can be removed by avoiding use of the wrapper. Runtime Overheads ~~~~~~~~~~~~~~~~~ Below shows measured runtime in similar way. The raw numbers are in seconds and the overheads are in percent. Minus runtime overheads mean speedup. runtime orig thp (overhead) ethp (overhead) parsec3/blackscholes 107.003 106.468 ( -0.50) 107.260 ( 0.24) parsec3/bodytrack 78.854 78.757 ( -0.12) 79.261 ( 0.52) parsec3/canneal 137.520 120.854 (-12.12) 132.427 ( -3.70) parsec3/dedup 11.873 11.665 ( -1.76) 11.883 ( 0.09) parsec3/facesim 207.895 204.215 ( -1.77) 206.170 ( -0.83) parsec3/ferret 190.507 189.972 ( -0.28) 190.818 ( 0.16) parsec3/fluidanimate 211.064 208.862 ( -1.04) 211.874 ( 0.38) parsec3/freqmine 290.157 288.831 ( -0.46) 292.495 ( 0.81) parsec3/raytrace 118.460 118.741 ( 0.24) 119.808 ( 1.14) parsec3/streamcluster 324.524 283.709 (-12.58) 307.209 ( -5.34) parsec3/swaptions 154.458 154.894 ( 0.28) 155.307 ( 0.55) parsec3/vips 58.588 58.622 ( 0.06) 59.037 ( 0.77) parsec3/x264 66.493 66.604 ( 0.17) 67.051 ( 0.84) splash2x/barnes 79.769 73.886 ( -7.38) 78.737 ( -1.29) splash2x/fft 32.857 22.960 (-30.12) 25.808 (-21.45) splash2x/lu_cb 85.113 84.939 ( -0.20) 85.344 ( 0.27) splash2x/lu_ncb 92.408 90.103 ( -2.49) 93.585 ( 1.27) splash2x/ocean_cp 44.374 42.876 ( -3.37) 43.613 ( -1.71) splash2x/ocean_ncp 80.710 51.831 (-35.78) 71.498 (-11.41) splash2x/radiosity 90.626 90.398 ( -0.25) 91.238 ( 0.68) splash2x/radix 30.875 25.226 (-18.30) 25.882 (-16.17) splash2x/raytrace 84.114 82.602 ( -1.80) 85.124 ( 1.20) splash2x/volrend 86.796 86.347 ( -0.52) 88.223 ( 1.64) splash2x/water_nsquared 230.781 220.667 ( -4.38) 232.664 ( 0.82) splash2x/water_spatial 88.719 90.187 ( 1.65) 89.228 ( 0.57) total 2984.530 2854.220 ( -4.37) 2951.540 ( -1.11) In total, 'thp' shows 4.37% speedup while 'ethp' shows 1.11% speedup. In other words, 'ethp' preserves about 25.40% of THP performance benefit. In the best case (splash2x/raytrace), 'ethp' preserves 88.36% of the benefit. If we narrow down to workloads showing high THP performance benefits (splash2x/fft, splash2x/ocean_ncp, and splash2x/radix), 'thp' and 'ethp' shows 30.75% and 14.71% speedup in total, respectively. In other words, 'ethp' preserves about 47.83% of the benefit. Even in the worst case (splash2x/volrend), 'ethp' incurs only 1.64% runtime overhead, which is similar to that of 'thp' (1.65% for 'splash2x/water_spatial'). Sequence Of Patches =================== The patches are based on the v5.5 plus v5 DAMON patchset[1] and Minchan's ``madvise()`` factor-out patch[2]. Minchan's patch was necessary for reuse of ``madvise()`` code in DAMON. You can also clone the complete git tree: $ git clone git://github.com/sjp38/linux -b damos/rfc/v4 The web is also available: https://github.com/sjp38/linux/releases/tag/damos/rfc/v4 [1] https://lore.kernel.org/linux-mm/20200217103110.30817-1-sjpark@amazon.com/ [2] https://lore.kernel.org/linux-mm/20200128001641.5086-2-minchan@kernel.org/ The first patch allows DAMON to reuse ``madvise()`` code for the actions. The second patch accounts age of each region. The third patch implements the handling of the schemes in DAMON and exports a kernel space programming interface for it. The fourth patch implements a debugfs interface for privileged people and programs. The fifth and sixth patches each adds kunittests and selftests for these changes, and finally the seventhe patch modifies the user space tool for DAMON to support description and applying of schemes in human freiendly way. Patch History ============= Changes from RFC v3 (https://lore.kernel.org/linux-mm/20200225102300.23895-1-sjpark@amazon.com/) - Add Reviewed-by from Brendan Higgins - Code cleanup: Modularize madvise() call - Fix a trivial bug in the wrapper python script - Add more stable and detailed evaluation results with updated ETHP scheme Changes from RFC v2 (https://lore.kernel.org/linux-mm/20200218085309.18346-1-sjpark@amazon.com/) - Fix aging mechanism for more better 'old region' selection - Add more kunittests and kselftests for this patchset - Support more human friedly description and application of 'schemes' Changes from RFC v1 (https://lore.kernel.org/linux-mm/20200210150921.32482-1-sjpark@amazon.com/) - Properly adjust age accounting related properties after splitting, merging, and action applying SeongJae Park (7): mm/madvise: Export madvise_common() to mm internal code mm/damon: Account age of target regions mm/damon: Implement data access monitoring-based operation schemes mm/damon/schemes: Implement a debugfs interface mm/damon-test: Add kunit test case for regions age accounting mm/damon/selftests: Add 'schemes' debugfs tests damon/tools: Support more human friendly 'schemes' control include/linux/damon.h | 29 ++ mm/damon-test.h | 5 + mm/damon.c | 391 +++++++++++++++++- mm/internal.h | 4 + mm/madvise.c | 3 +- tools/damon/_convert_damos.py | 125 ++++++ tools/damon/_damon.py | 143 +++++++ tools/damon/damo | 7 + tools/damon/record.py | 135 +----- tools/damon/schemes.py | 105 +++++ .../testing/selftests/damon/debugfs_attrs.sh | 29 ++ 11 files changed, 845 insertions(+), 131 deletions(-) create mode 100755 tools/damon/_convert_damos.py create mode 100644 tools/damon/_damon.py create mode 100644 tools/damon/schemes.py -- 2.17.1 ==================================== >8 ======================================= Appendix: Stdev / min / max numbers among the repeated runs =========================================================== Below are stdev/min/max of each number in the 5 repeated runs. runtime_stdev orig thp ethp parsec3/blackscholes 0.884 0.932 0.693 parsec3/bodytrack 0.672 0.501 0.470 parsec3/canneal 3.434 1.278 4.112 parsec3/dedup 0.074 0.032 0.070 parsec3/facesim 1.079 0.572 0.688 parsec3/ferret 1.674 0.498 0.801 parsec3/fluidanimate 1.422 1.804 1.273 parsec3/freqmine 2.285 2.735 3.852 parsec3/raytrace 1.240 0.821 1.407 parsec3/streamcluster 2.226 2.221 2.778 parsec3/swaptions 1.760 2.164 1.650 parsec3/vips 0.071 0.113 0.433 parsec3/x264 4.972 4.732 5.464 splash2x/barnes 0.149 0.434 0.944 splash2x/fft 0.186 0.074 2.053 splash2x/lu_cb 0.358 0.674 0.054 splash2x/lu_ncb 0.694 0.586 0.301 splash2x/ocean_cp 0.214 0.181 0.163 splash2x/ocean_ncp 0.738 0.574 5.860 splash2x/radiosity 0.447 0.786 0.493 splash2x/radix 0.183 0.195 0.250 splash2x/raytrace 0.869 1.248 1.071 splash2x/volrend 0.896 0.801 0.759 splash2x/water_nsquared 3.050 3.032 1.750 splash2x/water_spatial 0.497 1.607 0.665 memused.avg_stdev orig thp ethp parsec3/blackscholes 6837.158 4942.183 5531.310 parsec3/bodytrack 5591.783 5771.259 3959.415 parsec3/canneal 4034.250 5205.223 3294.782 parsec3/dedup 56582.594 12462.196 49390.950 parsec3/facesim 1879.070 3572.512 2407.374 parsec3/ferret 1686.811 4110.648 3050.263 parsec3/fluidanimate 5252.273 3550.694 3577.428 parsec3/freqmine 2634.481 12225.383 2220.963 parsec3/raytrace 5652.660 5615.677 4645.947 parsec3/streamcluster 2296.864 1906.081 2189.578 parsec3/swaptions 1100.155 18202.456 1689.923 parsec3/vips 5260.607 9104.494 2508.632 parsec3/x264 14892.433 18097.263 16853.532 splash2x/barnes 3055.563 2552.379 3749.773 splash2x/fft 115636.847 18058.645 193864.925 splash2x/lu_cb 2266.989 2495.620 9615.377 splash2x/lu_ncb 4816.990 3106.290 3406.873 splash2x/ocean_cp 5597.264 2189.592 40420.686 splash2x/ocean_ncp 6962.524 5038.039 352254.041 splash2x/radiosity 6151.433 1561.840 6976.647 splash2x/radix 12938.174 4141.470 64272.890 splash2x/raytrace 912.177 1473.169 1812.460 splash2x/volrend 1866.708 1527.107 1881.400 splash2x/water_nsquared 2126.581 4481.707 2471.129 splash2x/water_spatial 1495.886 3564.505 3182.864 runtime_min orig thp ethp parsec3/blackscholes 106.073 105.724 106.799 parsec3/bodytrack 78.361 78.327 78.994 parsec3/canneal 130.735 118.456 125.902 parsec3/dedup 11.816 11.631 11.781 parsec3/facesim 206.358 203.462 205.526 parsec3/ferret 189.118 189.461 190.130 parsec3/fluidanimate 209.879 207.381 210.656 parsec3/freqmine 287.349 285.988 288.519 parsec3/raytrace 117.320 118.014 118.021 parsec3/streamcluster 322.404 280.907 304.489 parsec3/swaptions 153.017 153.133 154.307 parsec3/vips 58.480 58.518 58.496 parsec3/x264 61.569 61.987 62.333 splash2x/barnes 79.595 73.170 77.782 splash2x/fft 32.588 22.838 24.391 splash2x/lu_cb 84.897 84.229 85.300 splash2x/lu_ncb 91.640 89.480 93.192 splash2x/ocean_cp 44.216 42.661 43.403 splash2x/ocean_ncp 79.912 50.717 63.298 splash2x/radiosity 90.332 89.911 90.786 splash2x/radix 30.617 25.012 25.569 splash2x/raytrace 82.972 81.291 83.608 splash2x/volrend 86.205 85.414 86.772 splash2x/water_nsquared 228.749 216.488 230.019 splash2x/water_spatial 88.326 88.636 88.469 memused.avg_min orig thp ethp parsec3/blackscholes 1809578.000 1815893.000 1821555.000 parsec3/bodytrack 1407270.000 1408774.000 1422950.000 parsec3/canneal 1037996.000 1029491.000 1042278.000 parsec3/dedup 2290578.000 2419128.000 2322004.000 parsec3/facesim 536908.000 539368.000 548194.000 parsec3/ferret 317173.000 313275.000 325452.000 parsec3/fluidanimate 566148.000 566925.000 578031.000 parsec3/freqmine 979565.000 985279.000 992844.000 parsec3/raytrace 1737270.000 1735498.000 1745751.000 parsec3/streamcluster 117213.000 118264.000 127825.000 parsec3/swaptions 13012.000 10753.000 21858.000 parsec3/vips 2946474.000 2941690.000 2975157.000 parsec3/x264 3171581.000 3170872.000 3184577.000 splash2x/barnes 1208476.000 1218535.000 1205510.000 splash2x/fft 9160132.000 9250818.000 8835513.000 splash2x/lu_cb 511850.000 515668.000 519205.000 splash2x/lu_ncb 512127.000 514471.000 518500.000 splash2x/ocean_cp 3342506.000 3377932.000 3290066.000 splash2x/ocean_ncp 3901749.000 7063386.000 3962171.000 splash2x/radiosity 1457419.000 1479232.000 1467156.000 splash2x/radix 1690840.000 1380921.000 1344838.000 splash2x/raytrace 43518.000 48571.000 55468.000 splash2x/volrend 147356.000 148650.000 159562.000 splash2x/water_nsquared 43685.000 38495.000 54409.000 splash2x/water_spatial 665912.000 660742.000 669843.000 runtime_max orig thp ethp parsec3/blackscholes 108.322 108.141 108.641 parsec3/bodytrack 80.166 79.687 80.200 parsec3/canneal 140.219 122.073 137.615 parsec3/dedup 12.014 11.723 12.000 parsec3/facesim 209.291 205.234 207.192 parsec3/ferret 193.589 190.830 192.235 parsec3/fluidanimate 213.730 212.390 213.867 parsec3/freqmine 293.634 292.283 299.323 parsec3/raytrace 120.096 120.346 121.437 parsec3/streamcluster 327.827 287.094 311.657 parsec3/swaptions 157.661 158.341 158.589 parsec3/vips 58.648 58.815 59.611 parsec3/x264 73.389 73.856 75.369 splash2x/barnes 79.975 74.413 80.244 splash2x/fft 33.168 23.043 29.852 splash2x/lu_cb 85.825 85.914 85.446 splash2x/lu_ncb 93.717 91.074 93.902 splash2x/ocean_cp 44.789 43.190 43.882 splash2x/ocean_ncp 81.981 52.296 80.782 splash2x/radiosity 91.509 91.966 92.180 splash2x/radix 31.130 25.546 26.299 splash2x/raytrace 85.347 84.163 86.881 splash2x/volrend 88.575 87.389 88.957 splash2x/water_nsquared 236.851 224.982 235.537 splash2x/water_spatial 89.689 92.978 90.276 memused.avg_max orig thp ethp parsec3/blackscholes 1827350.000 1830922.000 1836584.000 parsec3/bodytrack 1423070.000 1422588.000 1434832.000 parsec3/canneal 1048155.000 1043151.000 1051713.000 parsec3/dedup 2446661.000 2452237.000 2459532.000 parsec3/facesim 542340.000 547457.000 554321.000 parsec3/ferret 321678.000 325083.000 333474.000 parsec3/fluidanimate 579067.000 576389.000 587029.000 parsec3/freqmine 986759.000 1018980.000 998800.000 parsec3/raytrace 1750980.000 1749291.000 1757761.000 parsec3/streamcluster 123761.000 122647.000 133602.000 parsec3/swaptions 16305.000 59605.000 26835.000 parsec3/vips 2961299.000 2964746.000 2982101.000 parsec3/x264 3209871.000 3219818.000 3230036.000 splash2x/barnes 1217047.000 1224832.000 1215995.000 splash2x/fft 9505048.000 9302095.000 9378025.000 splash2x/lu_cb 518393.000 522739.000 545540.000 splash2x/lu_ncb 526380.000 522996.000 528341.000 splash2x/ocean_cp 3358820.000 3384581.000 3383533.000 splash2x/ocean_ncp 3920669.000 7079011.000 4937246.000 splash2x/radiosity 1474991.000 1483739.000 1485635.000 splash2x/radix 1731625.000 1393183.000 1498907.000 splash2x/raytrace 46122.000 52292.000 61116.000 splash2x/volrend 152488.000 153180.000 164793.000 splash2x/water_nsquared 49449.000 50555.000 60859.000 splash2x/water_spatial 669943.000 669815.000 679012.000
From: SeongJae Park <sjpark@amazon.de> This commit exports ``madvise_common()`` to ``mm/`` code for future reuse. Signed-off-by: SeongJae Park <sjpark@amazon.de> --- mm/internal.h | 4 ++++ mm/madvise.c | 3 ++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/internal.h b/mm/internal.h index 3cf20ab3ca01..dcdfe00e02ff 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -576,4 +576,8 @@ static inline bool is_migrate_highatomic_page(struct page *page) void setup_zone_pageset(struct zone *zone); extern struct page *alloc_new_node_page(struct page *page, unsigned long node); + + +int madvise_common(struct task_struct *task, struct mm_struct *mm, + unsigned long start, size_t len_in, int behavior); #endif /* __MM_INTERNAL_H */ diff --git a/mm/madvise.c b/mm/madvise.c index 0c901de531e4..4fa9dfc770bc 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1005,7 +1005,7 @@ madvise_behavior_valid(int behavior) * @task could be a zombie leader if it calls sys_exit so accessing mm_struct * via task->mm is prohibited. Please use @mm instead of task->mm. */ -static int madvise_common(struct task_struct *task, struct mm_struct *mm, +int madvise_common(struct task_struct *task, struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) { unsigned long end, tmp; @@ -1103,6 +1103,7 @@ static int madvise_common(struct task_struct *task, struct mm_struct *mm, return error; } +EXPORT_SYMBOL_GPL(madvise_common); /* * The madvise(2) system call. -- 2.17.1
From: SeongJae Park <sjpark@amazon.de> DAMON can be used as a primitive for data access pattern awared memory maangement optimizations. However, users who want such optimizations should run DAMON, read the monitoring results, analyze it, plan a new memory management scheme, and apply the new scheme by themselves. It would not be too hard, but still require some level of efforts. For complicated optimizations, this effort is inevitable. That said, in many cases, users would simply want to apply an actions to a memory region of a specific size having a specific access frequency for a specific time. For example, "page out a memory region larger than 100 MiB but having a low access frequency more than 10 minutes", or "Use THP for a memory region larger than 2 MiB having a high access frequency for more than 2 seconds". For such optimizations, users will need to first account the age of each region themselves. To reduce such efforts, this commit implements a simple age account of each region in DAMON. For each aggregation step, DAMON compares the access frequency and start/end address of each region with those from last aggregation and reset the age of the region if the change is significant. Else, the age is incremented. Signed-off-by: SeongJae Park <sjpark@amazon.de> --- include/linux/damon.h | 5 +++ mm/damon.c | 80 ++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 80 insertions(+), 5 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index 78785cb88d42..50fbe308590e 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -22,6 +22,11 @@ struct damon_region { unsigned long sampling_addr; unsigned int nr_accesses; struct list_head list; + + unsigned int age; + unsigned long last_vm_start; + unsigned long last_vm_end; + unsigned int last_nr_accesses; }; /* Represents a monitoring target task */ diff --git a/mm/damon.c b/mm/damon.c index ff150ae7532a..c292ddd36c86 100644 --- a/mm/damon.c +++ b/mm/damon.c @@ -87,6 +87,10 @@ static struct damon_region *damon_new_region(struct damon_ctx *ctx, ret->sampling_addr = damon_rand(ctx, vm_start, vm_end); INIT_LIST_HEAD(&ret->list); + ret->age = 0; + ret->last_vm_start = vm_start; + ret->last_vm_end = vm_end; + return ret; } @@ -600,11 +604,44 @@ static void kdamond_flush_aggregated(struct damon_ctx *c) damon_write_rbuf(c, &r->vm_end, sizeof(r->vm_end)); damon_write_rbuf(c, &r->nr_accesses, sizeof(r->nr_accesses)); + r->last_nr_accesses = r->nr_accesses; r->nr_accesses = 0; } } } +#define diff_of(a, b) (a > b ? a - b : b - a) + +/* + * Increase or reset the age of the given monitoring target region + * + * If the area or '->nr_accesses' has changed significantly, reset the '->age'. + * Else, increase the age. + */ +static void damon_do_count_age(struct damon_region *r, unsigned int threshold) +{ + unsigned long sz_threshold = (r->vm_end - r->vm_start) / 5; + + if (diff_of(r->vm_start, r->last_vm_start) + + diff_of(r->vm_end, r->last_vm_end) > sz_threshold) + r->age = 0; + else if (diff_of(r->nr_accesses, r->last_nr_accesses) > threshold) + r->age = 0; + else + r->age++; +} + +static void kdamond_count_age(struct damon_ctx *c, unsigned int threshold) +{ + struct damon_task *t; + struct damon_region *r; + + damon_for_each_task(c, t) { + damon_for_each_region(r, t) + damon_do_count_age(r, threshold); + } +} + #define sz_damon_region(r) (r->vm_end - r->vm_start) /* @@ -613,15 +650,15 @@ static void kdamond_flush_aggregated(struct damon_ctx *c) static void damon_merge_two_regions(struct damon_region *l, struct damon_region *r) { - l->nr_accesses = (l->nr_accesses * sz_damon_region(l) + - r->nr_accesses * sz_damon_region(r)) / - (sz_damon_region(l) + sz_damon_region(r)); + unsigned long sz_l = sz_damon_region(l), sz_r = sz_damon_region(r); + + l->nr_accesses = (l->nr_accesses * sz_l + r->nr_accesses * sz_r) / + (sz_l + sz_r); + l->age = (l->age * sz_l + r->age * sz_r) / (sz_l + sz_r); l->vm_end = r->vm_end; damon_destroy_region(r); } -#define diff_of(a, b) (a > b ? a - b : b - a) - /* * Merge adjacent regions having similar access frequencies * @@ -631,17 +668,43 @@ static void damon_merge_two_regions(struct damon_region *l, static void damon_merge_regions_of(struct damon_task *t, unsigned int thres) { struct damon_region *r, *prev = NULL, *next; + unsigned long sz_subregion, last_last_vm = 0; + unsigned long sz_biggest = 0; /* size of the biggest subregion */ + struct region last_biggest; /* last region of the biggest sub */ damon_for_each_region_safe(r, next, t) { if (!prev || prev->vm_end != r->vm_start) goto next; if (diff_of(prev->nr_accesses, r->nr_accesses) > thres) goto next; + if (!sz_biggest) { + sz_biggest = sz_damon_region(prev); + last_biggest.start = prev->last_vm_start; + last_biggest.end = prev->last_vm_end; + } + if (last_last_vm != r->last_vm_start) + sz_subregion = 0; + sz_subregion += sz_damon_region(r); + last_last_vm = r->last_vm_start; + if (sz_subregion > sz_biggest) { + sz_biggest = sz_subregion; + last_biggest.start = r->last_vm_start; + last_biggest.end = r->last_vm_end; + } damon_merge_two_regions(prev, r); continue; next: + if (sz_biggest) { + sz_biggest = 0; + prev->last_vm_start = last_biggest.start; + prev->last_vm_end = last_biggest.end; + } prev = r; } + if (sz_biggest) { + prev->last_vm_start = last_biggest.start; + prev->last_vm_end = last_biggest.end; + } } /* @@ -674,6 +737,12 @@ static void damon_split_region_at(struct damon_ctx *ctx, struct damon_region *new; new = damon_new_region(ctx, r->vm_start + sz_r, r->vm_end); + new->age = r->age; + new->last_vm_start = r->vm_start; + new->last_nr_accesses = r->last_nr_accesses; + + r->last_vm_start = r->vm_start; + r->last_vm_end = r->vm_end; r->vm_end = new->vm_start; damon_add_region(new, r, damon_next_region(r)); @@ -865,6 +934,7 @@ static int kdamond_fn(void *data) if (kdamond_aggregate_interval_passed(ctx)) { kdamond_merge_regions(ctx, max_nr_accesses / 10); + kdamond_count_age(ctx, max_nr_accesses / 10); if (ctx->aggregate_cb) ctx->aggregate_cb(ctx); kdamond_flush_aggregated(ctx); -- 2.17.1
From: SeongJae Park <sjpark@amazon.de> In many cases, users might use DAMON for simple data access awared memory management optimizations such as applying an operation scheme to a memory region of a specific size having a specific access frequency for a specific time. For example, "page out a memory region larger than 100 MiB but having a low access frequency more than 10 minutes", or "Use THP for a memory region larger than 2 MiB having a high access frequency for more than 2 seconds". To minimize users from spending their time for implementation of such simple data access monitoring-based operation schemes, this commit makes DAMON to handle such schemes directly. With this commit, users can simply specify their desired schemes to DAMON. Each of the schemes is composed with conditions for filtering of the target memory regions and desired memory management action for the target. In specific, the format is:: <min/max size> <min/max access frequency> <min/max age> <action> The filtering conditions are size of memory region, number of accesses to the region monitored by DAMON, and the age of the region. The age of region is incremented periodically but reset when its addresses or access frequency has significanly changed or the action of a scheme has applied. For the action, current implementation supports only a few of madvise() hints, ``MADV_WILLNEED``, ``MADV_COLD``, ``MADV_PAGEOUT``, ``MADV_HUGEPAGE``, and ``MADV_NOHUGEPAGE``. Signed-off-by: SeongJae Park <sjpark@amazon.de> --- include/linux/damon.h | 24 ++++++++ mm/damon.c | 140 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 164 insertions(+) diff --git a/include/linux/damon.h b/include/linux/damon.h index 50fbe308590e..8cb2452579ee 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -36,6 +36,27 @@ struct damon_task { struct list_head list; }; +/* Data Access Monitoring-based Operation Scheme */ +enum damos_action { + DAMOS_WILLNEED, + DAMOS_COLD, + DAMOS_PAGEOUT, + DAMOS_HUGEPAGE, + DAMOS_NOHUGEPAGE, + DAMOS_ACTION_LEN, +}; + +struct damos { + unsigned int min_sz_region; + unsigned int max_sz_region; + unsigned int min_nr_accesses; + unsigned int max_nr_accesses; + unsigned int min_age_region; + unsigned int max_age_region; + enum damos_action action; + struct list_head list; +}; + struct damon_ctx { unsigned long sample_interval; unsigned long aggr_interval; @@ -58,6 +79,7 @@ struct damon_ctx { struct rnd_state rndseed; struct list_head tasks_list; /* 'damon_task' objects */ + struct list_head schemes_list; /* 'damos' objects */ /* callbacks */ void (*sample_cb)(struct damon_ctx *context); @@ -66,6 +88,8 @@ struct damon_ctx { int damon_set_pids(struct damon_ctx *ctx, unsigned long *pids, ssize_t nr_pids); +int damon_set_schemes(struct damon_ctx *ctx, + struct damos **schemes, ssize_t nr_schemes); int damon_set_recording(struct damon_ctx *ctx, unsigned int rbuf_len, char *rfile_path); int damon_set_attrs(struct damon_ctx *ctx, unsigned long s, unsigned long a, diff --git a/mm/damon.c b/mm/damon.c index c292ddd36c86..338e7ea76c7f 100644 --- a/mm/damon.c +++ b/mm/damon.c @@ -11,6 +11,7 @@ #define CREATE_TRACE_POINTS +#include <asm-generic/mman-common.h> #include <linux/damon.h> #include <linux/debugfs.h> #include <linux/delay.h> @@ -24,6 +25,8 @@ #include <linux/slab.h> #include <trace/events/damon.h> +#include "internal.h" + #define damon_get_task_struct(t) \ (get_pid_task(find_vpid(t->pid), PIDTYPE_PID)) @@ -45,6 +48,12 @@ #define damon_for_each_task_safe(ctx, t, next) \ list_for_each_entry_safe(t, next, &(ctx)->tasks_list, list) +#define damon_for_each_schemes(ctx, r) \ + list_for_each_entry(r, &(ctx)->schemes_list, list) + +#define damon_for_each_schemes_safe(ctx, s, next) \ + list_for_each_entry_safe(s, next, &(ctx)->schemes_list, list) + #define MAX_RFILE_PATH_LEN 256 /* Get a random number in [l, r) */ @@ -190,6 +199,27 @@ static void damon_destroy_task(struct damon_task *t) damon_free_task(t); } +static void damon_add_scheme(struct damon_ctx *ctx, struct damos *s) +{ + list_add_tail(&s->list, &ctx->schemes_list); +} + +static void damon_del_scheme(struct damos *s) +{ + list_del(&s->list); +} + +static void damon_free_scheme(struct damos *s) +{ + kfree(s); +} + +static void damon_destroy_scheme(struct damos *s) +{ + damon_del_scheme(s); + damon_free_scheme(s); +} + /* * Returns number of monitoring target tasks */ @@ -642,6 +672,93 @@ static void kdamond_count_age(struct damon_ctx *c, unsigned int threshold) } } +static int damos_madvise(struct damon_task *task, struct damon_region *r, + int behavior) +{ + struct task_struct *t; + struct mm_struct *mm; + int ret = -ENOMEM; + + t = damon_get_task_struct(task); + if (!t) + goto out; + mm = damon_get_mm(task); + if (!mm) + goto put_task_out; + + ret = madvise_common(t, mm, PAGE_ALIGN(r->vm_start), + PAGE_ALIGN(r->vm_end - r->vm_start), behavior); + mmput(mm); +put_task_out: + put_task_struct(t); +out: + return ret; +} + +static int damos_do_action(struct damon_task *task, struct damon_region *r, + enum damos_action action) +{ + int madv_action; + + switch (action) { + case DAMOS_WILLNEED: + madv_action = MADV_WILLNEED; + break; + case DAMOS_COLD: + madv_action = MADV_COLD; + break; + case DAMOS_PAGEOUT: + madv_action = MADV_PAGEOUT; + break; + case DAMOS_HUGEPAGE: + madv_action = MADV_HUGEPAGE; + break; + case DAMOS_NOHUGEPAGE: + madv_action = MADV_NOHUGEPAGE; + break; + default: + pr_warn("Wrong action %d\n", action); + return -EINVAL; + } + + return damos_madvise(task, r, madv_action); +} + +static void damon_do_apply_schemes(struct damon_ctx *c, struct damon_task *t, + struct damon_region *r) +{ + struct damos *s; + unsigned long sz; + + damon_for_each_schemes(c, s) { + sz = r->vm_end - r->vm_start; + if ((s->min_sz_region && sz < s->min_sz_region) || + (s->max_sz_region && s->max_sz_region < sz)) + continue; + if ((s->min_nr_accesses && r->nr_accesses < s->min_nr_accesses) + || (s->max_nr_accesses && + s->max_nr_accesses < r->nr_accesses)) + continue; + if ((s->min_age_region && r->age < s->min_age_region) || + (s->max_age_region && + s->max_age_region < r->age)) + continue; + damos_do_action(t, r, s->action); + r->age = 0; + } +} + +static void kdamond_apply_schemes(struct damon_ctx *c) +{ + struct damon_task *t; + struct damon_region *r; + + damon_for_each_task(c, t) { + damon_for_each_region(r, t) + damon_do_apply_schemes(c, t, r); + } +} + #define sz_damon_region(r) (r->vm_end - r->vm_start) /* @@ -937,6 +1054,7 @@ static int kdamond_fn(void *data) kdamond_count_age(ctx, max_nr_accesses / 10); if (ctx->aggregate_cb) ctx->aggregate_cb(ctx); + kdamond_apply_schemes(ctx); kdamond_flush_aggregated(ctx); kdamond_split_regions(ctx); } @@ -1011,6 +1129,27 @@ int damon_stop(struct damon_ctx *ctx) return damon_turn_kdamond(ctx, false); } +/* + * Set the data access monitoring oriented schemes + * + * NOTE: This function should not be called while the kdamond of the context is + * running. + * + * Returns 0 if success, or negative error code otherwise. + */ +int damon_set_schemes(struct damon_ctx *ctx, struct damos **schemes, + ssize_t nr_schemes) +{ + struct damos *s, *next; + ssize_t i; + + damon_for_each_schemes_safe(ctx, s, next) + damon_destroy_scheme(s); + for (i = 0; i < nr_schemes; i++) + damon_add_scheme(ctx, schemes[i]); + return 0; +} + /* * This function should not be called while the kdamond is running. */ @@ -1456,6 +1595,7 @@ static int __init damon_init_user_ctx(void) prandom_seed_state(&ctx->rndseed, 42); INIT_LIST_HEAD(&ctx->tasks_list); + INIT_LIST_HEAD(&ctx->schemes_list); ctx->sample_cb = NULL; ctx->aggregate_cb = NULL; -- 2.17.1
From: SeongJae Park <sjpark@amazon.de> This commit implements a debugfs interface for the data access monitoring oriented memory management schemes. It is supposed to be used by administrators and/or privileged user space programs. Users can read and update the rules using ``<debugfs>/damon/schemes`` file. The format is:: <min/max size> <min/max access frequency> <min/max age> <action> Signed-off-by: SeongJae Park <sjpark@amazon.de> --- mm/damon.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 169 insertions(+), 2 deletions(-) diff --git a/mm/damon.c b/mm/damon.c index 338e7ea76c7f..c573a0290234 100644 --- a/mm/damon.c +++ b/mm/damon.c @@ -199,6 +199,29 @@ static void damon_destroy_task(struct damon_task *t) damon_free_task(t); } +static struct damos *damon_new_scheme( + unsigned int min_sz_region, unsigned int max_sz_region, + unsigned int min_nr_accesses, unsigned int max_nr_accesses, + unsigned int min_age_region, unsigned int max_age_region, + enum damos_action action) +{ + struct damos *ret; + + ret = kmalloc(sizeof(struct damos), GFP_KERNEL); + if (!ret) + return NULL; + ret->min_sz_region = min_sz_region; + ret->max_sz_region = max_sz_region; + ret->min_nr_accesses = min_nr_accesses; + ret->max_nr_accesses = max_nr_accesses; + ret->min_age_region = min_age_region; + ret->max_age_region = max_age_region; + ret->action = action; + INIT_LIST_HEAD(&ret->list); + + return ret; +} + static void damon_add_scheme(struct damon_ctx *ctx, struct damos *s) { list_add_tail(&s->list, &ctx->schemes_list); @@ -1306,6 +1329,144 @@ static ssize_t debugfs_monitor_on_write(struct file *file, return ret; } +static ssize_t sprint_schemes(struct damon_ctx *c, char *buf, ssize_t len) +{ + struct damos *s; + int written = 0; + int rc; + + damon_for_each_schemes(c, s) { + rc = snprintf(&buf[written], len - written, + "%u %u %u %u %u %u %d\n", + s->min_sz_region, s->max_sz_region, + s->min_nr_accesses, s->max_nr_accesses, + s->min_age_region, s->max_age_region, + s->action); + if (!rc) + return -ENOMEM; + written += rc; + } + return written; +} + +static ssize_t debugfs_schemes_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct damon_ctx *ctx = &damon_user_ctx; + char *kbuf; + ssize_t ret; + + kbuf = kmalloc(count, GFP_KERNEL); + if (!kbuf) + return -ENOMEM; + + ret = sprint_schemes(ctx, kbuf, count); + if (ret < 0) + goto out; + ret = simple_read_from_buffer(buf, count, ppos, kbuf, ret); + +out: + kfree(kbuf); + return ret; +} + +static void free_schemes_arr(struct damos **schemes, ssize_t nr_schemes) +{ + ssize_t i; + + for (i = 0; i < nr_schemes; i++) + kfree(schemes[i]); + kfree(schemes); +} + +/* + * Converts a string into an array of struct damos pointers + * + * Returns an array of struct damos pointers that converted if the conversion + * success, or NULL otherwise. + */ +static struct damos **str_to_schemes(const char *str, ssize_t len, + ssize_t *nr_schemes) +{ + struct damos *scheme, **schemes; + const int max_nr_schemes = 256; + int pos = 0, parsed, ret; + unsigned int min_sz, max_sz, min_nr_a, max_nr_a, min_age, max_age; + int action; + + schemes = kmalloc_array(max_nr_schemes, sizeof(struct damos *), + GFP_KERNEL); + if (!schemes) + return NULL; + + *nr_schemes = 0; + while (pos < len && *nr_schemes < max_nr_schemes) { + ret = sscanf(&str[pos], "%u %u %u %u %u %u %d%n", + &min_sz, &max_sz, &min_nr_a, &max_nr_a, + &min_age, &max_age, &action, &parsed); + pos += parsed; + if (ret != 7) + break; + if (action >= DAMOS_ACTION_LEN) { + pr_err("wrong action %d\n", action); + goto fail; + } + + scheme = damon_new_scheme(min_sz, max_sz, min_nr_a, max_nr_a, + min_age, max_age, action); + if (!scheme) + goto fail; + + schemes[*nr_schemes] = scheme; + *nr_schemes += 1; + } + if (!*nr_schemes) + goto fail; + return schemes; +fail: + free_schemes_arr(schemes, *nr_schemes); + return NULL; +} + +static ssize_t debugfs_schemes_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct damon_ctx *ctx = &damon_user_ctx; + char *kbuf; + struct damos **schemes; + ssize_t nr_schemes = 0, ret; + + if (*ppos) + return -EINVAL; + + kbuf = kmalloc_array(count, sizeof(char), GFP_KERNEL); + if (!kbuf) + return -ENOMEM; + + ret = simple_write_to_buffer(kbuf, count, ppos, buf, count); + if (ret < 0) + goto out; + + schemes = str_to_schemes(kbuf, ret, &nr_schemes); + + spin_lock(&ctx->kdamond_lock); + if (ctx->kdamond) + goto monitor_running; + + damon_set_schemes(ctx, schemes, nr_schemes); + spin_unlock(&ctx->kdamond_lock); + goto out; + +monitor_running: + spin_unlock(&ctx->kdamond_lock); + pr_err("%s: kdamond is running. Turn it off first.\n", __func__); + ret = -EINVAL; + free_schemes_arr(schemes, nr_schemes); +out: + kfree(kbuf); + return ret; +} + static ssize_t damon_sprint_pids(struct damon_ctx *ctx, char *buf, ssize_t len) { struct damon_task *t; @@ -1536,6 +1697,12 @@ static const struct file_operations pids_fops = { .write = debugfs_pids_write, }; +static const struct file_operations schemes_fops = { + .owner = THIS_MODULE, + .read = debugfs_schemes_read, + .write = debugfs_schemes_write, +}; + static const struct file_operations record_fops = { .owner = THIS_MODULE, .read = debugfs_record_read, @@ -1552,10 +1719,10 @@ static struct dentry *debugfs_root; static int __init debugfs_init(void) { - const char * const file_names[] = {"attrs", "record", + const char * const file_names[] = {"attrs", "record", "schemes", "pids", "monitor_on"}; const struct file_operations *fops[] = {&attrs_fops, &record_fops, - &pids_fops, &monitor_on_fops}; + &schemes_fops, &pids_fops, &monitor_on_fops}; int i; debugfs_root = debugfs_create_dir("damon", NULL); -- 2.17.1
From: SeongJae Park <sjpark@amazon.de> After merges of regions, each region should know their last shape in proper way to measure the changes from the last modification and reset the age if the changes are significant. This commit adds kunit test cases checking whether the regions are knowing their last shape properly after merges of regions. Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> --- mm/damon-test.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/damon-test.h b/mm/damon-test.h index c7dc21325c77..2ba757357211 100644 --- a/mm/damon-test.h +++ b/mm/damon-test.h @@ -540,6 +540,8 @@ static void damon_test_merge_regions_of(struct kunit *test) unsigned long saddrs[] = {0, 114, 130, 156, 170}; unsigned long eaddrs[] = {112, 130, 156, 170, 230}; + unsigned long lsa[] = {0, 114, 130, 156, 184}; + unsigned long lea[] = {100, 122, 156, 170, 230}; int i; t = damon_new_task(42); @@ -556,6 +558,9 @@ static void damon_test_merge_regions_of(struct kunit *test) r = damon_nth_region_of(t, i); KUNIT_EXPECT_EQ(test, r->vm_start, saddrs[i]); KUNIT_EXPECT_EQ(test, r->vm_end, eaddrs[i]); + KUNIT_EXPECT_EQ(test, r->last_vm_start, lsa[i]); + KUNIT_EXPECT_EQ(test, r->last_vm_end, lea[i]); + } damon_free_task(t); } -- 2.17.1
From: SeongJae Park <sjpark@amazon.de> This commit adds simple selftets for 'schemes' debugfs file of DAMON. Signed-off-by: SeongJae Park <sjpark@amazon.de> --- .../testing/selftests/damon/debugfs_attrs.sh | 29 +++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/tools/testing/selftests/damon/debugfs_attrs.sh b/tools/testing/selftests/damon/debugfs_attrs.sh index d5188b0f71b1..82a98c81975b 100755 --- a/tools/testing/selftests/damon/debugfs_attrs.sh +++ b/tools/testing/selftests/damon/debugfs_attrs.sh @@ -97,6 +97,35 @@ fi echo $ORIG_CONTENT > $file +# Test schemes file +file="$DBGFS/schemes" + +ORIG_CONTENT=$(cat $file) +echo "1 2 3 4 5 6 3" > $file +if [ $? -ne 0 ] +then + echo "$file write fail" + echo $ORIG_CONTENT > $file + exit 1 +fi + +echo "1 2 +3 4 5 6 3" > $file +if [ $? -eq 0 ] +then + echo "$file splitted write success (expected fail)" + echo $ORIG_CONTENT > $file + exit 1 +fi + +echo > $file +if [ $? -ne 0 ] +then + echo "$file empty string writing fail" + echo $ORIG_CONTENT > $file + exit 1 +fi + # Test pids file file="$DBGFS/pids" -- 2.17.1
From: SeongJae Park <sjpark@amazon.de> This commit implements 'schemes' subcommand of the damon userspace tool. It can be used to describe and apply the data access monitoring-based operation schemes in more human friendly fashion. Signed-off-by: SeongJae Park <sjpark@amazon.de> --- tools/damon/_convert_damos.py | 125 +++++++++++++++++++++++++++++ tools/damon/_damon.py | 143 ++++++++++++++++++++++++++++++++++ tools/damon/damo | 7 ++ tools/damon/record.py | 135 +++----------------------------- tools/damon/schemes.py | 105 +++++++++++++++++++++++++ 5 files changed, 392 insertions(+), 123 deletions(-) create mode 100755 tools/damon/_convert_damos.py create mode 100644 tools/damon/_damon.py create mode 100644 tools/damon/schemes.py diff --git a/tools/damon/_convert_damos.py b/tools/damon/_convert_damos.py new file mode 100755 index 000000000000..0f1e7e3d4ccc --- /dev/null +++ b/tools/damon/_convert_damos.py @@ -0,0 +1,125 @@ +#!/usr/bin/env python3 + +""" +Change human readable data access monitoring-based operation schemes to the low +level input for the '<debugfs>/damon/schemes' file. Below is an example of the +schemes written in the human readable format: + +# format is: <min/max size> <min/max frequency (0-100)> <min/max age> <action> +# lines starts with '#' or blank are ignored. +# B/K/M/G/T for Bytes/KiB/MiB/GiB/TiB +# us/ms/s/m/h/d for micro-seconds/milli-seconds/seconds/minutes/hours/days +# 'null' means zero, which passes the check + +# if a region (no matter of its size) keeps a high access frequency for more +# than 100ms, put the region on the head of the LRU list (call madvise() with +# MADV_WILLNEED). +null null 80 null 100ms null willneed + +# if a region keeps a low access frequency for more than 100ms, put the +# region on the tail of the LRU list (call madvise() with MADV_COLD). +0B 0B 10 20 200ms 1h cold + +# if a region keeps a very low access frequency for more than 100ms, swap +# out the region immediately (call madvise() with MADV_PAGEOUT). +0B null 0 10 100ms 2h pageout + +# if a region of a size bigger than 2MiB keeps a very high access frequency +# for more than 100ms, let the region to use huge pages (call madvise() +# with MADV_HUGEPAGE). +2M null 90 99 100ms 2h hugepage + +# If a regions of a size bigger than 2MiB keeps no high access frequency +# for more than 100ms, avoid the region from using huge pages (call +# madvise() with MADV_NOHUGEPAGE). +2M null 0 25 100ms 2h nohugepage +""" + +import argparse + +unit_to_bytes = {'B': 1, 'K': 1024, 'M': 1024 * 1024, 'G': 1024 * 1024 * 1024, + 'T': 1024 * 1024 * 1024 * 1024} + +def text_to_bytes(txt): + if txt == 'null': + return 0 + unit = txt[-1] + number = int(txt[:-1]) + return number * unit_to_bytes[unit] + +unit_to_usecs = {'us': 1, 'ms': 1000, 's': 1000 * 1000, 'm': 60 * 1000 * 1000, + 'h': 60 * 60 * 1000 * 1000, 'd': 24 * 60 * 60 * 1000 * 1000} + +def text_to_us(txt): + if txt == 'null': + return 0 + unit = txt[-2:] + if unit in ['us', 'ms']: + number = int(txt[:-2]) + else: + unit = txt[-1] + number = int(txt[:-1]) + return number * unit_to_usecs[unit] + +damos_action_to_int = {'DAMOS_WILLNEED': 0, 'DAMOS_COLD': 1, + 'DAMOS_PAGEOUT': 2, 'DAMOS_HUGEPAGE': 3, 'DAMOS_NOHUGEPAGE': 4} + +def text_to_damos_action(txt): + return damos_action_to_int['DAMOS_' + txt.upper()] + +def text_to_nr_accesses(txt, max_nr_accesses): + if txt == 'null': + return 0 + return int(int(txt) * max_nr_accesses / 100) + +def debugfs_scheme(line, sample_interval, aggr_interval): + fields = line.split() + if len(fields) != 7: + print('wrong input line: %s' % line) + exit(1) + + limit_nr_accesses = aggr_interval / sample_interval + try: + min_sz = text_to_bytes(fields[0]) + max_sz = text_to_bytes(fields[1]) + min_nr_accesses = text_to_nr_accesses(fields[2], limit_nr_accesses) + max_nr_accesses = text_to_nr_accesses(fields[3], limit_nr_accesses) + min_age = text_to_us(fields[4]) / aggr_interval + max_age = text_to_us(fields[5]) / aggr_interval + action = text_to_damos_action(fields[6]) + except: + print('wrong input field') + raise + return '%d\t%d\t%d\t%d\t%d\t%d\t%d' % (min_sz, max_sz, min_nr_accesses, + max_nr_accesses, min_age, max_age, action) + +def convert(schemes_file, sample_interval, aggr_interval): + lines = [] + with open(schemes_file, 'r') as f: + for line in f: + if line.startswith('#'): + continue + line = line.strip() + if line == '': + continue + lines.append(debugfs_scheme(line, sample_interval, aggr_interval)) + return '\n'.join(lines) + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument('input', metavar='<file>', + help='input file describing the schemes') + parser.add_argument('-s', '--sample', metavar='<interval>', type=int, + default=5000, help='sampling interval (us)') + parser.add_argument('-a', '--aggr', metavar='<interval>', type=int, + default=100000, help='aggregation interval (us)') + args = parser.parse_args() + + schemes_file = args.input + sample_interval = args.sample + aggr_interval = args.aggr + + print(convert(schemes_file, sample_interval, aggr_interval)) + +if __name__ == '__main__': + main() diff --git a/tools/damon/_damon.py b/tools/damon/_damon.py new file mode 100644 index 000000000000..0a703ec7471a --- /dev/null +++ b/tools/damon/_damon.py @@ -0,0 +1,143 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +""" +Contains core functions for DAMON debugfs control. +""" + +import os +import subprocess + +debugfs_attrs = None +debugfs_record = None +debugfs_schemes = None +debugfs_pids = None +debugfs_monitor_on = None + +def set_target_pid(pid): + return subprocess.call('echo %s > %s' % (pid, debugfs_pids), shell=True, + executable='/bin/bash') + +def turn_damon(on_off): + return subprocess.call("echo %s > %s" % (on_off, debugfs_monitor_on), + shell=True, executable="/bin/bash") + +def is_damon_running(): + with open(debugfs_monitor_on, 'r') as f: + return f.read().strip() == 'on' + +class Attrs: + sample_interval = None + aggr_interval = None + regions_update_interval = None + min_nr_regions = None + max_nr_regions = None + rbuf_len = None + rfile_path = None + schemes = None + + def __init__(self, s, a, r, n, x, l, f, c): + self.sample_interval = s + self.aggr_interval = a + self.regions_update_interval = r + self.min_nr_regions = n + self.max_nr_regions = x + self.rbuf_len = l + self.rfile_path = f + self.schemes = c + + def __str__(self): + return "%s %s %s %s %s %s %s\n%s" % (self.sample_interval, + self.aggr_interval, self.regions_update_interval, + self.min_nr_regions, self.max_nr_regions, self.rbuf_len, + self.rfile_path, self.schemes) + + def attr_str(self): + return "%s %s %s %s %s " % (self.sample_interval, self.aggr_interval, + self.regions_update_interval, self.min_nr_regions, + self.max_nr_regions) + + def record_str(self): + return '%s %s ' % (self.rbuf_len, self.rfile_path) + + def apply(self): + ret = subprocess.call('echo %s > %s' % (self.attr_str(), debugfs_attrs), + shell=True, executable='/bin/bash') + if ret: + return ret + ret = subprocess.call('echo %s > %s' % (self.record_str(), + debugfs_record), shell=True, executable='/bin/bash') + if ret: + return ret + return subprocess.call('echo %s > %s' % ( + self.schemes.replace('\n', ' '), debugfs_schemes), shell=True, + executable='/bin/bash') + +def current_attrs(): + with open(debugfs_attrs, 'r') as f: + attrs = f.read().split() + attrs = [int(x) for x in attrs] + + with open(debugfs_record, 'r') as f: + rattrs = f.read().split() + attrs.append(int(rattrs[0])) + attrs.append(rattrs[1]) + + with open(debugfs_schemes, 'r') as f: + schemes = f.read() + attrs.append(schemes) + + return Attrs(*attrs) + +def chk_update_debugfs(debugfs): + global debugfs_attrs + global debugfs_record + global debugfs_schemes + global debugfs_pids + global debugfs_monitor_on + + debugfs_damon = os.path.join(debugfs, 'damon') + debugfs_attrs = os.path.join(debugfs_damon, 'attrs') + debugfs_record = os.path.join(debugfs_damon, 'record') + debugfs_schemes = os.path.join(debugfs_damon, 'schemes') + debugfs_pids = os.path.join(debugfs_damon, 'pids') + debugfs_monitor_on = os.path.join(debugfs_damon, 'monitor_on') + + if not os.path.isdir(debugfs_damon): + print("damon debugfs dir (%s) not found", debugfs_damon) + exit(1) + + for f in [debugfs_attrs, debugfs_record, debugfs_schemes, debugfs_pids, + debugfs_monitor_on]: + if not os.path.isfile(f): + print("damon debugfs file (%s) not found" % f) + exit(1) + +def cmd_args_to_attrs(args): + "Generate attributes with specified arguments" + sample_interval = args.sample + aggr_interval = args.aggr + regions_update_interval = args.updr + min_nr_regions = args.minr + max_nr_regions = args.maxr + rbuf_len = args.rbuf + if not os.path.isabs(args.out): + args.out = os.path.join(os.getcwd(), args.out) + rfile_path = args.out + schemes = args.schemes + return Attrs(sample_interval, aggr_interval, regions_update_interval, + min_nr_regions, max_nr_regions, rbuf_len, rfile_path, schemes) + +def set_attrs_argparser(parser): + parser.add_argument('-d', '--debugfs', metavar='<debugfs>', type=str, + default='/sys/kernel/debug', help='debugfs mounted path') + parser.add_argument('-s', '--sample', metavar='<interval>', type=int, + default=5000, help='sampling interval') + parser.add_argument('-a', '--aggr', metavar='<interval>', type=int, + default=100000, help='aggregate interval') + parser.add_argument('-u', '--updr', metavar='<interval>', type=int, + default=1000000, help='regions update interval') + parser.add_argument('-n', '--minr', metavar='<# regions>', type=int, + default=10, help='minimal number of regions') + parser.add_argument('-m', '--maxr', metavar='<# regions>', type=int, + default=1000, help='maximum number of regions') diff --git a/tools/damon/damo b/tools/damon/damo index 58e1099ae5fc..ce7180069bef 100755 --- a/tools/damon/damo +++ b/tools/damon/damo @@ -5,6 +5,7 @@ import argparse import record import report +import schemes class SubCmdHelpFormatter(argparse.RawDescriptionHelpFormatter): def _format_action(self, action): @@ -25,6 +26,10 @@ parser_record = subparser.add_parser('record', help='record data accesses of the given target processes') record.set_argparser(parser_record) +parser_schemes = subparser.add_parser('schemes', + help='apply operation schemes to the given target process') +schemes.set_argparser(parser_schemes) + parser_report = subparser.add_parser('report', help='report the recorded data accesses in the specified form') report.set_argparser(parser_report) @@ -33,5 +38,7 @@ args = parser.parse_args() if args.command == 'record': record.main(args) +elif args.command == 'schemes': + schemes.main(args) elif args.command == 'report': report.main(args) diff --git a/tools/damon/record.py b/tools/damon/record.py index a547d479a103..3bbf7b8359da 100644 --- a/tools/damon/record.py +++ b/tools/damon/record.py @@ -6,28 +6,12 @@ Record data access patterns of the target process. """ import argparse -import copy import os import signal import subprocess import time -debugfs_attrs = None -debugfs_record = None -debugfs_pids = None -debugfs_monitor_on = None - -def set_target_pid(pid): - return subprocess.call('echo %s > %s' % (pid, debugfs_pids), shell=True, - executable='/bin/bash') - -def turn_damon(on_off): - return subprocess.call("echo %s > %s" % (on_off, debugfs_monitor_on), - shell=True, executable="/bin/bash") - -def is_damon_running(): - with open(debugfs_monitor_on, 'r') as f: - return f.read().strip() == 'on' +import _damon def do_record(target, is_target_cmd, attrs, old_attrs): if os.path.isfile(attrs.rfile_path): @@ -36,93 +20,29 @@ def do_record(target, is_target_cmd, attrs, old_attrs): if attrs.apply(): print('attributes (%s) failed to be applied' % attrs) cleanup_exit(old_attrs, -1) - print('# damon attrs: %s' % attrs) + print('# damon attrs: %s %s' % (attrs.attr_str(), attrs.record_str())) if is_target_cmd: p = subprocess.Popen(target, shell=True, executable='/bin/bash') target = p.pid - if set_target_pid(target): + if _damon.set_target_pid(target): print('pid setting (%s) failed' % target) cleanup_exit(old_attrs, -2) - if turn_damon('on'): + if _damon.turn_damon('on'): print('could not turn on damon' % target) cleanup_exit(old_attrs, -3) if is_target_cmd: p.wait() while True: # damon will turn it off by itself if the target tasks are terminated. - if not is_damon_running(): + if not _damon.is_damon_running(): break time.sleep(1) cleanup_exit(old_attrs, 0) -class Attrs: - sample_interval = None - aggr_interval = None - regions_update_interval = None - min_nr_regions = None - max_nr_regions = None - rbuf_len = None - rfile_path = None - - def __init__(self, s, a, r, n, x, l, f): - self.sample_interval = s - self.aggr_interval = a - self.regions_update_interval = r - self.min_nr_regions = n - self.max_nr_regions = x - self.rbuf_len = l - self.rfile_path = f - - def __str__(self): - return "%s %s %s %s %s %s %s" % (self.sample_interval, self.aggr_interval, - self.regions_update_interval, self.min_nr_regions, - self.max_nr_regions, self.rbuf_len, self.rfile_path) - - def attr_str(self): - return "%s %s %s %s %s " % (self.sample_interval, self.aggr_interval, - self.regions_update_interval, self.min_nr_regions, - self.max_nr_regions) - - def record_str(self): - return '%s %s ' % (self.rbuf_len, self.rfile_path) - - def apply(self): - ret = subprocess.call('echo %s > %s' % (self.attr_str(), debugfs_attrs), - shell=True, executable='/bin/bash') - if ret: - return ret - return subprocess.call('echo %s > %s' % (self.record_str(), - debugfs_record), shell=True, executable='/bin/bash') - -def current_attrs(): - with open(debugfs_attrs, 'r') as f: - attrs = f.read().split() - attrs = [int(x) for x in attrs] - - with open(debugfs_record, 'r') as f: - rattrs = f.read().split() - attrs.append(int(rattrs[0])) - attrs.append(rattrs[1]) - return Attrs(*attrs) - -def cmd_args_to_attrs(args): - "Generate attributes with specified arguments" - sample_interval = args.sample - aggr_interval = args.aggr - regions_update_interval = args.updr - min_nr_regions = args.minr - max_nr_regions = args.maxr - rbuf_len = args.rbuf - if not os.path.isabs(args.out): - args.out = os.path.join(os.getcwd(), args.out) - rfile_path = args.out - return Attrs(sample_interval, aggr_interval, regions_update_interval, - min_nr_regions, max_nr_regions, rbuf_len, rfile_path) - def cleanup_exit(orig_attrs, exit_code): - if is_damon_running(): - if turn_damon('off'): + if _damon.is_damon_running(): + if _damon.turn_damon('off'): print('failed to turn damon off!') if orig_attrs: if orig_attrs.apply(): @@ -133,51 +53,19 @@ def sighandler(signum, frame): print('\nsignal %s received' % signum) cleanup_exit(orig_attrs, signum) -def chk_update_debugfs(debugfs): - global debugfs_attrs - global debugfs_record - global debugfs_pids - global debugfs_monitor_on - - debugfs_damon = os.path.join(debugfs, 'damon') - debugfs_attrs = os.path.join(debugfs_damon, 'attrs') - debugfs_record = os.path.join(debugfs_damon, 'record') - debugfs_pids = os.path.join(debugfs_damon, 'pids') - debugfs_monitor_on = os.path.join(debugfs_damon, 'monitor_on') - - if not os.path.isdir(debugfs_damon): - print("damon debugfs dir (%s) not found", debugfs_damon) - exit(1) - - for f in [debugfs_attrs, debugfs_record, debugfs_pids, debugfs_monitor_on]: - if not os.path.isfile(f): - print("damon debugfs file (%s) not found" % f) - exit(1) - def chk_permission(): if os.geteuid() != 0: print("Run as root") exit(1) def set_argparser(parser): + _damon.set_attrs_argparser(parser) parser.add_argument('target', type=str, metavar='<target>', help='the target command or the pid to record') - parser.add_argument('-s', '--sample', metavar='<interval>', type=int, - default=5000, help='sampling interval') - parser.add_argument('-a', '--aggr', metavar='<interval>', type=int, - default=100000, help='aggregate interval') - parser.add_argument('-u', '--updr', metavar='<interval>', type=int, - default=1000000, help='regions update interval') - parser.add_argument('-n', '--minr', metavar='<# regions>', type=int, - default=10, help='minimal number of regions') - parser.add_argument('-m', '--maxr', metavar='<# regions>', type=int, - default=1000, help='maximum number of regions') parser.add_argument('-l', '--rbuf', metavar='<len>', type=int, default=1024*1024, help='length of record result buffer') parser.add_argument('-o', '--out', metavar='<file path>', type=str, default='damon.data', help='output file path') - parser.add_argument('-d', '--debugfs', metavar='<debugfs>', type=str, - default='/sys/kernel/debug', help='debugfs mounted path') def main(args=None): global orig_attrs @@ -187,13 +75,14 @@ def main(args=None): args = parser.parse_args() chk_permission() - chk_update_debugfs(args.debugfs) + _damon.chk_update_debugfs(args.debugfs) signal.signal(signal.SIGINT, sighandler) signal.signal(signal.SIGTERM, sighandler) - orig_attrs = current_attrs() + orig_attrs = _damon.current_attrs() - new_attrs = cmd_args_to_attrs(args) + args.schemes = '' + new_attrs = _damon.cmd_args_to_attrs(args) target = args.target target_fields = target.split() diff --git a/tools/damon/schemes.py b/tools/damon/schemes.py new file mode 100644 index 000000000000..408a73813234 --- /dev/null +++ b/tools/damon/schemes.py @@ -0,0 +1,105 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +""" +Apply given operation schemes to the target process. +""" + +import argparse +import os +import signal +import subprocess +import time + +import _convert_damos +import _damon + +def run_damon(target, is_target_cmd, attrs, old_attrs): + if os.path.isfile(attrs.rfile_path): + os.rename(attrs.rfile_path, attrs.rfile_path + '.old') + + if attrs.apply(): + print('attributes (%s) failed to be applied' % attrs) + cleanup_exit(old_attrs, -1) + print('# damon attrs: %s %s' % (attrs.attr_str(), attrs.record_str())) + for line in attrs.schemes.split('\n'): + print('# scheme: %s' % line) + if is_target_cmd: + p = subprocess.Popen(target, shell=True, executable='/bin/bash') + target = p.pid + if _damon.set_target_pid(target): + print('pid setting (%s) failed' % target) + cleanup_exit(old_attrs, -2) + if _damon.turn_damon('on'): + print('could not turn on damon' % target) + cleanup_exit(old_attrs, -3) + if is_target_cmd: + p.wait() + while True: + # damon will turn it off by itself if the target tasks are terminated. + if not _damon.is_damon_running(): + break + time.sleep(1) + + cleanup_exit(old_attrs, 0) + +def cleanup_exit(orig_attrs, exit_code): + if _damon.is_damon_running(): + if turn_damon('off'): + print('failed to turn damon off!') + if orig_attrs: + if orig_attrs.apply(): + print('original attributes (%s) restoration failed!' % orig_attrs) + exit(exit_code) + +def sighandler(signum, frame): + print('\nsignal %s received' % signum) + cleanup_exit(orig_attrs, signum) + +def chk_permission(): + if os.geteuid() != 0: + print("Run as root") + exit(1) + +def set_argparser(parser): + _damon.set_attrs_argparser(parser) + parser.add_argument('target', type=str, metavar='<target>', + help='the target command or the pid to record') + parser.add_argument('-c', '--schemes', metavar='<file>', type=str, + default='damon.schemes', + help='data access monitoring-based operation schemes') + +def main(args=None): + global orig_attrs + if not args: + parser = argparse.ArgumentParser() + set_argparser(parser) + args = parser.parse_args() + + chk_permission() + _damon.chk_update_debugfs(args.debugfs) + + signal.signal(signal.SIGINT, sighandler) + signal.signal(signal.SIGTERM, sighandler) + orig_attrs = _damon.current_attrs() + + args.rbuf = 0 + args.out = 'null' + args.schemes = _convert_damos.convert(args.schemes, args.sample, args.aggr) + new_attrs = _damon.cmd_args_to_attrs(args) + target = args.target + + target_fields = target.split() + if not subprocess.call('which %s > /dev/null' % target_fields[0], + shell=True, executable='/bin/bash'): + run_damon(target, True, new_attrs, orig_attrs) + else: + try: + pid = int(target) + except: + print('target \'%s\' is neither a command, nor a pid' % target) + exit(1) + run_damon(target, False, new_attrs, orig_attrs) + +if __name__ == '__main__': + main() -- 2.17.1
[-- Attachment #1: Type: text/plain, Size: 558 bytes --] On Tue, 2020-03-03 at 13:14 +0100, SeongJae Park wrote: > From: SeongJae Park <sjpark@amazon.de> > --- a/mm/damon.c > +++ b/mm/damon.c > @@ -87,6 +87,10 @@ static struct damon_region > *damon_new_region(struct damon_ctx *ctx, > ret->sampling_addr = damon_rand(ctx, vm_start, vm_end); > INIT_LIST_HEAD(&ret->list); > > + ret->age = 0; > + ret->last_vm_start = vm_start; > + ret->last_vm_end = vm_end; Wait, what tree is this supposed to apply against? I see no mm/damon.c file in current Linus upstream. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --]
Hello Rick, Thank you for question :) On Wed, 04 Mar 2020 10:21:29 -0500 Rik van Riel <riel@surriel.com> wrote: > [-- Attachment #1: Type: text/plain, Size: 558 bytes --] > > On Tue, 2020-03-03 at 13:14 +0100, SeongJae Park wrote: > > From: SeongJae Park <sjpark@amazon.de> > > > --- a/mm/damon.c > > +++ b/mm/damon.c > > @@ -87,6 +87,10 @@ static struct damon_region > > *damon_new_region(struct damon_ctx *ctx, > > ret->sampling_addr = damon_rand(ctx, vm_start, vm_end); > > INIT_LIST_HEAD(&ret->list); > > > > + ret->age = 0; > > + ret->last_vm_start = vm_start; > > + ret->last_vm_end = vm_end; > > Wait, what tree is this supposed to apply against? > > I see no mm/damon.c file in current Linus upstream. This patchset is supposed to apply against v5.5 plus DAMON patchset[1] plus a patch from Minchan. You can get the tree this patchset is applied via: $ git clone git://github.com/sjp38/linux -b damos/rfc/v4 Or, the web is also available: https://github.com/sjp38/linux/releases/tag/damos/rfc/v4 I am posting this as a seperate RFC patchset because 1) this patchset is based on the tree other than Linus or other maintainers' upstream trees, 2) I want to keep the size of original patchset small for convenience of reviewers, 3) this patchset is relatively recently made and thus might unstable compared to the DAMON patchset[1], and 4) I want to share my plan and get early feedbacks as many as possible. Sorry if this made you confused. Also, if you have some opinions regarding this seperated postings, please let me know. [1] https://lore.kernel.org/linux-mm/20200224123047.32506-1-sjpark@amazon.com Thanks, SeongJae Park > > -- > All Rights Reversed.