* [PATCH v5 0/8] Remove dependency on congestion_wait in mm/ @ 2021-10-22 14:46 Mel Gorman 2021-10-22 14:46 ` [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman ` (7 more replies) 0 siblings, 8 replies; 23+ messages in thread From: Mel Gorman @ 2021-10-22 14:46 UTC (permalink / raw) To: Andrew Morton Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML, Mel Gorman This series replaces the v4 version in mmotm as the changes caused excessive conflicts. This series is also available at git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v5r4 Changelog since v4 o Costmetic changes (neilb) o Correct number of writeback throttled tasks (neilb) o Use wake_up (neilb) Changelog since v3 o Count writeback completions for NR_THROTTLED_WRITTEN only o Use IRQ-safe inc_node_page_state o Remove redundant throttling This series that removes all calls to congestion_wait in mm/ and deletes wait_iff_congested. It's not a clever implementation but congestion_wait has been broken for a long time (https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/). Even if congestion throttling worked, it was never a great idea. While excessive dirty/writeback pages at the tail of the LRU is one possibility that reclaim may be slow, there is also the problem of too many pages being isolated and reclaim failing for other reasons (elevated references, too many pages isolated, excessive LRU contention etc). This series replaces the "congestion" throttling with 3 different types. o If there are too many dirty/writeback pages, sleep until a timeout or enough pages get cleaned o If too many pages are isolated, sleep until enough isolated pages are either reclaimed or put back on the LRU o If no progress is being made, direct reclaim tasks sleep until another task makes progress with acceptable efficiency. This was initially tested with a mix of workloads that used to trigger corner cases that no longer work. A new test case was created called "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly created XFS filesystem. Note that it may be necessary to increase the timeout of ssh if executing remotely as ssh itself can get throttled and the connection may timeout. stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4 to check the impact as the number of direct reclaimers increase. It has four types of worker. o One "anon latency" worker creates small mappings with mmap() and times how long it takes to fault the mapping reading it 4K at a time o X file writers which is fio randomly writing X files where the total size of the files add up to the allowed dirty_ratio. fio is allowed to run for a warmup period to allow some file-backed pages to accumulate. The duration of the warmup is based on the best-case linear write speed of the storage. o Y file readers which is fio randomly reading small files o Z anon memory hogs which continually map (100-dirty_ratio)% of memory o Total estimated WSS = (100+dirty_ration) percentage of memory X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4 The intent is to maximise the total WSS with a mix of file and anon memory where some anonymous memory must be swapped and there is a high likelihood of dirty/writeback pages reaching the end of the LRU. The test can be configured to have no background readers to stress dirty/writeback pages. The results below are based on having zero readers. The short summary of the results is that the series works and stalls until some event occurs but the timeouts may need adjustment. The test results are not broken down by patch as the series should be treated as one block that replaces a broken throttling mechanism with a working one. Finally, three machines were tested but I'm reporting the worst set of results. The other two machines had much better latencies for example. First the results of the "anon latency" latency stutterp 5.15.0-rc1 5.15.0-rc1 vanilla mm-reclaimcongest-v5r4 Amean mmap-4 31.4003 ( 0.00%) 2661.0198 (-8374.52%) Amean mmap-7 38.1641 ( 0.00%) 149.2891 (-291.18%) Amean mmap-12 60.0981 ( 0.00%) 187.8105 (-212.51%) Amean mmap-21 161.2699 ( 0.00%) 213.9107 ( -32.64%) Amean mmap-30 174.5589 ( 0.00%) 377.7548 (-116.41%) Amean mmap-48 8106.8160 ( 0.00%) 1070.5616 ( 86.79%) Stddev mmap-4 41.3455 ( 0.00%) 27573.9676 (-66591.66%) Stddev mmap-7 53.5556 ( 0.00%) 4608.5860 (-8505.23%) Stddev mmap-12 171.3897 ( 0.00%) 5559.4542 (-3143.75%) Stddev mmap-21 1506.6752 ( 0.00%) 5746.2507 (-281.39%) Stddev mmap-30 557.5806 ( 0.00%) 7678.1624 (-1277.05%) Stddev mmap-48 61681.5718 ( 0.00%) 14507.2830 ( 76.48%) Max-90 mmap-4 31.4243 ( 0.00%) 83.1457 (-164.59%) Max-90 mmap-7 41.0410 ( 0.00%) 41.0720 ( -0.08%) Max-90 mmap-12 66.5255 ( 0.00%) 53.9073 ( 18.97%) Max-90 mmap-21 146.7479 ( 0.00%) 105.9540 ( 27.80%) Max-90 mmap-30 193.9513 ( 0.00%) 64.3067 ( 66.84%) Max-90 mmap-48 277.9137 ( 0.00%) 591.0594 (-112.68%) Max mmap-4 1913.8009 ( 0.00%) 299623.9695 (-15555.96%) Max mmap-7 2423.9665 ( 0.00%) 204453.1708 (-8334.65%) Max mmap-12 6845.6573 ( 0.00%) 221090.3366 (-3129.64%) Max mmap-21 56278.6508 ( 0.00%) 213877.3496 (-280.03%) Max mmap-30 19716.2990 ( 0.00%) 216287.6229 (-997.00%) Max mmap-48 477923.9400 ( 0.00%) 245414.8238 ( 48.65%) For most thread counts, the time to mmap() is unfortunately increased. In earlier versions of the series, this was lower but a large number of throttling events were reaching their timeout increasing the amount of inefficient scanning of the LRU. There is no prioritisation of reclaim tasks making progress based on each tasks rate of page allocation versus progress of reclaim. The variance is also impacted for high worker counts but in all cases, the differences in latency are not statistically significant due to very large maximum outliers. Max-90 shows that 90% of the stalls are comparable but the Max results show the massive outliers which are increased to to stalling. It is expected that this will be very machine dependant. Due to the test design, reclaim is difficult so allocations stall and there are variances depending on whether THPs can be allocated or not. The amount of memory will affect exactly how bad the corner cases are and how often they trigger. The warmup period calculation is not ideal as it's based on linear writes where as fio is randomly writing multiple files from multiple tasks so the start state of the test is variable. For example, these are the latencies on a single-socket machine that had more memory Amean mmap-4 42.2287 ( 0.00%) 49.6838 * -17.65%* Amean mmap-7 216.4326 ( 0.00%) 47.4451 * 78.08%* Amean mmap-12 2412.0588 ( 0.00%) 51.7497 ( 97.85%) Amean mmap-21 5546.2548 ( 0.00%) 51.8862 ( 99.06%) Amean mmap-30 1085.3121 ( 0.00%) 72.1004 ( 93.36%) The overall system CPU usage and elapsed time is as follows 5.15.0-rc3 5.15.0-rc3 vanilla mm-reclaimcongest-v5r4 Duration User 6989.03 983.42 Duration System 7308.12 799.68 Duration Elapsed 2277.67 2092.98 The patches reduce system CPU usage by 89% as the vanilla kernel is rarely stalling. The high-level /proc/vmstats show 5.15.0-rc1 5.15.0-rc1 vanilla mm-reclaimcongest-v5r2 Ops Direct pages scanned 1056608451.00 503594991.00 Ops Kswapd pages scanned 109795048.00 147289810.00 Ops Kswapd pages reclaimed 63269243.00 31036005.00 Ops Direct pages reclaimed 10803973.00 6328887.00 Ops Kswapd efficiency % 57.62 21.07 Ops Kswapd velocity 48204.98 57572.86 Ops Direct efficiency % 1.02 1.26 Ops Direct velocity 463898.83 196845.97 Kswapd scanned less pages but the detailed pattern is different. The vanilla kernel scans slowly over time where as the patches exhibits burst patterns of scan activity. Direct reclaim scanning is reduced by 52% due to stalling. The pattern for stealing pages is also slightly different. Both kernels exhibit spikes but the vanilla kernel when reclaiming shows pages being reclaimed over a period of time where as the patches tend to reclaim in spikes. The difference is that vanilla is not throttling and instead scanning constantly finding some pages over time where as the patched kernel throttles and reclaims in spikes. Ops Percentage direct scans 90.59 77.37 For direct reclaim, vanilla scanned 90.59% of pages where as with the patches, 77.37% were direct reclaim due to throttling Ops Page writes by reclaim 2613590.00 1687131.00 Page writes from reclaim context are reduced. Ops Page writes anon 2932752.00 1917048.00 And there is less swapping. Ops Page reclaim immediate 996248528.00 107664764.00 The number of pages encountered at the tail of the LRU tagged for immediate reclaim but still dirty/writeback is reduced by 89%. Ops Slabs scanned 164284.00 153608.00 Slab scan activity is similar. ftrace was used to gather stall activity Vanilla ------- 1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000 2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000 8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000 29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000 82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0 The fast majority of wait_iff_congested calls do not stall at all. What is likely happening is that cond_resched() reschedules the task for a short period when the BDI is not registering congestion (which it never will in this test setup). 1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000 2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000 4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000 380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000 778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000 congestion_wait if called always exceeds the timeout as there is no trigger to wake it up. Bottom line: Vanilla will throttle but it's not effective. Patch series ------------ Kswapd throttle activity was always due to scanning pages tagged for immediate reclaim at the tail of the LRU 1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK 4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK 5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK 6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK 11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK 11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK 94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK 112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK The majority of events did not stall or stalled for a short period. Roughly 16% of stalls reached the timeout before expiry. For direct reclaim, the number of times stalled for each reason were 6624 reason=VMSCAN_THROTTLE_ISOLATED 93246 reason=VMSCAN_THROTTLE_NOPROGRESS 96934 reason=VMSCAN_THROTTLE_WRITEBACK The most common reason to stall was due to excessive pages tagged for immediate reclaim at the tail of the LRU followed by a failure to make forward. A relatively small number were due to too many pages isolated from the LRU by parallel threads For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was 9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED 12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED 83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED 6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED Most did not stall at all. A small number reached the timeout. For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over the map 1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS 6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS 11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS 13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS 13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS 16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS 17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS 17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS 17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS 18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS 20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS 20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS 20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS 21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS 23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS 23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS 25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS 25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS 26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS 27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS 28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS 29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS 30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS 30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS 31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS 32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS 33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS 35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS 35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS 36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS 36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS 37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS 38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS 40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS 43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS 55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS 56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS 58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS 59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS 61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS 71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS 71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS 79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS 82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS 82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS 85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS 85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS 88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS 90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS 90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS 94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS 118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS 119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS 126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS 146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS 148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS 148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS 159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS 178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS 183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS 237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS 266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS 313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS 347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS 470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS 559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS 964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS 2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS 2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS 7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS 22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS 51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS The full timeout is often hit but a large number also do not stall at all. The remainder slept a little allowing other reclaim tasks to make progress. While this timeout could be further increased, it could also negatively impact worst-case behaviour when there is no prioritisation of what task should make progress. For VMSCAN_THROTTLE_WRITEBACK, the breakdown was 1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK 2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK 3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK 5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK 5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK 6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK 7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK 11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK 12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK 16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK 24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK 28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK 30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK 30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK 32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK 42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK 77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK 99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK 137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK 190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK 339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK 518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK 852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK 3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK 7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK 83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK The majority hit the timeout in direct reclaim context although a sizable number did not stall at all. This is very different to kswapd where only a tiny percentage of stalls due to writeback reached the timeout. Bottom line, the throttling appears to work and the wakeup events may limit worst case stalls. There might be some grounds for adjusting timeouts but it's likely futile as the worst-case scenarios depend on the workload, memory size and the speed of the storage. A better approach to improve the series further would be to prioritise tasks based on their rate of allocation with the caveat that it may be very expensive to track. include/linux/backing-dev.h | 1 - include/linux/mmzone.h | 15 +++ include/trace/events/vmscan.h | 38 ++++++++ include/trace/events/writeback.h | 7 -- mm/backing-dev.c | 48 ---------- mm/compaction.c | 10 +- mm/filemap.c | 1 + mm/internal.h | 21 +++++ mm/memcontrol.c | 10 +- mm/page-writeback.c | 11 ++- mm/page_alloc.c | 26 ++---- mm/vmscan.c | 151 ++++++++++++++++++++++++++++--- mm/vmstat.c | 1 + 13 files changed, 237 insertions(+), 103 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested 2021-10-22 14:46 [PATCH v5 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman @ 2021-10-22 14:46 ` Mel Gorman 2021-10-22 14:46 ` [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated Mel Gorman ` (6 subsequent siblings) 7 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2021-10-22 14:46 UTC (permalink / raw) To: Andrew Morton Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML, Mel Gorman Page reclaim throttles on wait_iff_congested under the following conditions o kswapd is encountering pages under writeback and marked for immediate reclaim implying that pages are cycling through the LRU faster than pages can be cleaned. o Direct reclaim will stall if all dirty pages are backed by congested inodes. wait_iff_congested is almost completely broken with few exceptions. This patch adds a new node-based workqueue and tracks the number of throttled tasks and pages written back since throttling started. If enough pages belonging to the node are written back then the throttled tasks will wake early. If not, the throttled tasks sleeps until the timeout expires. [neilb@suse.de: Uninterruptible sleep and simpler wakeups] [hdanton@sina.com: Avoid race when reclaim starts] [vbabka@suse.cz: vmstat irq-safe api, clarifications] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- include/linux/backing-dev.h | 1 - include/linux/mmzone.h | 13 +++++ include/trace/events/vmscan.h | 34 +++++++++++++ include/trace/events/writeback.h | 7 --- mm/backing-dev.c | 48 ------------------- mm/filemap.c | 1 + mm/internal.h | 11 +++++ mm/page_alloc.c | 5 ++ mm/vmscan.c | 82 +++++++++++++++++++++++++++----- mm/vmstat.c | 1 + 10 files changed, 135 insertions(+), 68 deletions(-) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index ac7f231b8825..9fb1f0ae273c 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -154,7 +154,6 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits) } long congestion_wait(int sync, long timeout); -long wait_iff_congested(int sync, long timeout); static inline bool mapping_can_writeback(struct address_space *mapping) { diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6a1d79d84675..419304093610 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -199,6 +199,7 @@ enum node_stat_item { NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ + NR_THROTTLED_WRITTEN, /* NR_WRITTEN while reclaim throttled */ NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */ NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */ NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */ @@ -272,6 +273,11 @@ enum lru_list { NR_LRU_LISTS }; +enum vmscan_throttle_state { + VMSCAN_THROTTLE_WRITEBACK, + NR_VMSCAN_THROTTLE, +}; + #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++) #define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++) @@ -841,6 +847,13 @@ typedef struct pglist_data { int node_id; wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; + + /* workqueues for throttling reclaim for different reasons. */ + wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE]; + + atomic_t nr_writeback_throttled;/* nr of writeback-throttled tasks */ + unsigned long nr_reclaim_start; /* nr pages written while throttled + * when throttling started. */ struct task_struct *kswapd; /* Protected by mem_hotplug_begin/end() */ int kswapd_order; diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index 88faf2400ec2..c317f9fe0d17 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -27,6 +27,14 @@ {RECLAIM_WB_ASYNC, "RECLAIM_WB_ASYNC"} \ ) : "RECLAIM_WB_NONE" +#define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK) + +#define show_throttle_flags(flags) \ + (flags) ? __print_flags(flags, "|", \ + {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"} \ + ) : "VMSCAN_THROTTLE_NONE" + + #define trace_reclaim_flags(file) ( \ (file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \ (RECLAIM_WB_ASYNC) \ @@ -454,6 +462,32 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_node_reclaim_end, TP_ARGS(nr_reclaimed) ); +TRACE_EVENT(mm_vmscan_throttled, + + TP_PROTO(int nid, int usec_timeout, int usec_delayed, int reason), + + TP_ARGS(nid, usec_timeout, usec_delayed, reason), + + TP_STRUCT__entry( + __field(int, nid) + __field(int, usec_timeout) + __field(int, usec_delayed) + __field(int, reason) + ), + + TP_fast_assign( + __entry->nid = nid; + __entry->usec_timeout = usec_timeout; + __entry->usec_delayed = usec_delayed; + __entry->reason = 1U << reason; + ), + + TP_printk("nid=%d usec_timeout=%d usect_delayed=%d reason=%s", + __entry->nid, + __entry->usec_timeout, + __entry->usec_delayed, + show_throttle_flags(__entry->reason)) +); #endif /* _TRACE_VMSCAN_H */ /* This part must be outside protection */ diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 840d1ba84cf5..3bc759b81897 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -763,13 +763,6 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait, TP_ARGS(usec_timeout, usec_delayed) ); -DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested, - - TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed), - - TP_ARGS(usec_timeout, usec_delayed) -); - DECLARE_EVENT_CLASS(writeback_single_inode_template, TP_PROTO(struct inode *inode, diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 4a9d4e27d0d9..0ea1a105eae5 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -1041,51 +1041,3 @@ long congestion_wait(int sync, long timeout) return ret; } EXPORT_SYMBOL(congestion_wait); - -/** - * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes - * @sync: SYNC or ASYNC IO - * @timeout: timeout in jiffies - * - * In the event of a congested backing_dev (any backing_dev) this waits - * for up to @timeout jiffies for either a BDI to exit congestion of the - * given @sync queue or a write to complete. - * - * The return value is 0 if the sleep is for the full timeout. Otherwise, - * it is the number of jiffies that were still remaining when the function - * returned. return_value == timeout implies the function did not sleep. - */ -long wait_iff_congested(int sync, long timeout) -{ - long ret; - unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = &congestion_wqh[sync]; - - /* - * If there is no congestion, yield if necessary instead - * of sleeping on the congestion queue - */ - if (atomic_read(&nr_wb_congested[sync]) == 0) { - cond_resched(); - - /* In case we scheduled, work out time remaining */ - ret = timeout - (jiffies - start); - if (ret < 0) - ret = 0; - - goto out; - } - - /* Sleep until uncongested or a write happens */ - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, &wait); - -out: - trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout), - jiffies_to_usecs(jiffies - start)); - - return ret; -} -EXPORT_SYMBOL(wait_iff_congested); diff --git a/mm/filemap.c b/mm/filemap.c index dae481293b5d..59187787fbfc 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1605,6 +1605,7 @@ void end_page_writeback(struct page *page) smp_mb__after_atomic(); wake_up_page(page, PG_writeback); + acct_reclaim_writeback(page); put_page(page); } EXPORT_SYMBOL(end_page_writeback); diff --git a/mm/internal.h b/mm/internal.h index cf3cb933eba3..b495e60c955d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -34,6 +34,17 @@ void page_writeback_init(void); +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page, + int nr_throttled); +static inline void acct_reclaim_writeback(struct page *page) +{ + pg_data_t *pgdat = page_pgdat(page); + int nr_throttled = atomic_read(&pgdat->nr_writeback_throttled); + + if (nr_throttled) + __acct_reclaim_writeback(pgdat, page, nr_throttled); +} + vm_fault_t do_swap_page(struct vm_fault *vmf); void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b37435c274cf..78e538067651 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7389,6 +7389,8 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat) {} static void __meminit pgdat_init_internals(struct pglist_data *pgdat) { + int i; + pgdat_resize_init(pgdat); pgdat_init_split_queue(pgdat); @@ -7397,6 +7399,9 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat) init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); + for (i = 0; i < NR_VMSCAN_THROTTLE; i++) + init_waitqueue_head(&pgdat->reclaim_wait[i]); + pgdat_page_ext_init(pgdat); lruvec_init(&pgdat->__lruvec); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 74296c2d1fed..0c1595065f49 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1006,6 +1006,64 @@ static void handle_write_error(struct address_space *mapping, unlock_page(page); } +static void +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, + long timeout) +{ + wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason]; + long ret; + DEFINE_WAIT(wait); + + /* + * Do not throttle IO workers, kthreads other than kswapd or + * workqueues. They may be required for reclaim to make + * forward progress (e.g. journalling workqueues or kthreads). + */ + if (!current_is_kswapd() && + current->flags & (PF_IO_WORKER|PF_KTHREAD)) + return; + + if (atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) { + WRITE_ONCE(pgdat->nr_reclaim_start, + node_page_state(pgdat, NR_THROTTLED_WRITTEN)); + } + + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); + ret = schedule_timeout(timeout); + finish_wait(wqh, &wait); + atomic_dec(&pgdat->nr_writeback_throttled); + + trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout), + jiffies_to_usecs(timeout - ret), + reason); +} + +/* + * Account for pages written if tasks are throttled waiting on dirty + * pages to clean. If enough pages have been cleaned since throttling + * started then wakeup the throttled tasks. + */ +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page, + int nr_throttled) +{ + unsigned long nr_written; + + inc_node_page_state(page, NR_THROTTLED_WRITTEN); + + /* + * This is an inaccurate read as the per-cpu deltas may not + * be synchronised. However, given that the system is + * writeback throttled, it is not worth taking the penalty + * of getting an accurate count. At worst, the throttle + * timeout guarantees forward progress. + */ + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) - + READ_ONCE(pgdat->nr_reclaim_start); + + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled) + wake_up(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]); +} + /* possible outcome of pageout() */ typedef enum { /* failed to write page out, page is locked */ @@ -1412,9 +1470,8 @@ static unsigned int shrink_page_list(struct list_head *page_list, /* * The number of dirty pages determines if a node is marked - * reclaim_congested which affects wait_iff_congested. kswapd - * will stall and start writing pages if the tail of the LRU - * is all dirty unqueued pages. + * reclaim_congested. kswapd will stall and start writing + * pages if the tail of the LRU is all dirty unqueued pages. */ page_check_dirty_writeback(page, &dirty, &writeback); if (dirty || writeback) @@ -3180,19 +3237,19 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) * If kswapd scans pages marked for immediate * reclaim and under writeback (nr_immediate), it * implies that pages are cycling through the LRU - * faster than they are written so also forcibly stall. + * faster than they are written so forcibly stall + * until some pages complete writeback. */ if (sc->nr.immediate) - congestion_wait(BLK_RW_ASYNC, HZ/10); + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10); } /* - * Tag a node/memcg as congested if all the dirty pages - * scanned were backed by a congested BDI and - * wait_iff_congested will stall. + * Tag a node/memcg as congested if all the dirty pages were marked + * for writeback and immediate reclaim (counted in nr.congested). * * Legacy memcg will stall in page writeback so avoid forcibly - * stalling in wait_iff_congested(). + * stalling in reclaim_throttle(). */ if ((current_is_kswapd() || (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) && @@ -3200,15 +3257,15 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) set_bit(LRUVEC_CONGESTED, &target_lruvec->flags); /* - * Stall direct reclaim for IO completions if underlying BDIs - * and node is congested. Allow kswapd to continue until it + * Stall direct reclaim for IO completions if the lruvec is + * node is congested. Allow kswapd to continue until it * starts encountering unqueued dirty pages or cycling through * the LRU too quickly. */ if (!current_is_kswapd() && current_may_throttle() && !sc->hibernation_mode && test_bit(LRUVEC_CONGESTED, &target_lruvec->flags)) - wait_iff_congested(BLK_RW_ASYNC, HZ/10); + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10); if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, sc)) @@ -4286,6 +4343,7 @@ static int kswapd(void *p) WRITE_ONCE(pgdat->kswapd_order, 0); WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES); + atomic_set(&pgdat->nr_writeback_throttled, 0); for ( ; ; ) { bool ret; diff --git a/mm/vmstat.c b/mm/vmstat.c index 8ce2620344b2..9b2bc9d61d4b 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1225,6 +1225,7 @@ const char * const vmstat_text[] = { "nr_vmscan_immediate_reclaim", "nr_dirtied", "nr_written", + "nr_throttled_written", "nr_kernel_misc_reclaimable", "nr_foll_pin_acquired", "nr_foll_pin_released", -- 2.31.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated 2021-10-22 14:46 [PATCH v5 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman 2021-10-22 14:46 ` [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman @ 2021-10-22 14:46 ` Mel Gorman 2021-10-22 14:46 ` [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman ` (5 subsequent siblings) 7 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2021-10-22 14:46 UTC (permalink / raw) To: Andrew Morton Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML, Mel Gorman Page reclaim throttles on congestion if too many parallel reclaim instances have isolated too many pages. This makes no sense, excessive parallelisation has nothing to do with writeback or congestion. This patch creates an additional workqueue to sleep on when too many pages are isolated. The throttled tasks are woken when the number of isolated pages is reduced or a timeout occurs. There may be some false positive wakeups for GFP_NOIO/GFP_NOFS callers but the tasks will throttle again if necessary. [shy828301@gmail.com: Wake up from compaction context] [vbabka@suse.cz: Account number of throttled tasks only for writeback] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- include/linux/mmzone.h | 1 + include/trace/events/vmscan.h | 4 +++- mm/compaction.c | 10 ++++++++-- mm/internal.h | 11 +++++++++++ mm/vmscan.c | 22 ++++++++++++++++------ 5 files changed, 39 insertions(+), 9 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 419304093610..9ccd8d95291b 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -275,6 +275,7 @@ enum lru_list { enum vmscan_throttle_state { VMSCAN_THROTTLE_WRITEBACK, + VMSCAN_THROTTLE_ISOLATED, NR_VMSCAN_THROTTLE, }; diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index c317f9fe0d17..d4905bd9e9c4 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -28,10 +28,12 @@ ) : "RECLAIM_WB_NONE" #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK) +#define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED) #define show_throttle_flags(flags) \ (flags) ? __print_flags(flags, "|", \ - {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"} \ + {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \ + {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \ ) : "VMSCAN_THROTTLE_NONE" diff --git a/mm/compaction.c b/mm/compaction.c index bfc93da1c2c7..7359093d8ac0 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -761,6 +761,8 @@ isolate_freepages_range(struct compact_control *cc, /* Similar to reclaim, but different enough that they don't share logic */ static bool too_many_isolated(pg_data_t *pgdat) { + bool too_many; + unsigned long active, inactive, isolated; inactive = node_page_state(pgdat, NR_INACTIVE_FILE) + @@ -770,7 +772,11 @@ static bool too_many_isolated(pg_data_t *pgdat) isolated = node_page_state(pgdat, NR_ISOLATED_FILE) + node_page_state(pgdat, NR_ISOLATED_ANON); - return isolated > (inactive + active) / 2; + too_many = isolated > (inactive + active) / 2; + if (!too_many) + wake_throttle_isolated(pgdat); + + return too_many; } /** @@ -822,7 +828,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (cc->mode == MIGRATE_ASYNC) return -EAGAIN; - congestion_wait(BLK_RW_ASYNC, HZ/10); + reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10); if (fatal_signal_pending(current)) return -EINTR; diff --git a/mm/internal.h b/mm/internal.h index b495e60c955d..c72d3383ef34 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -45,6 +45,15 @@ static inline void acct_reclaim_writeback(struct page *page) __acct_reclaim_writeback(pgdat, page, nr_throttled); } +static inline void wake_throttle_isolated(pg_data_t *pgdat) +{ + wait_queue_head_t *wqh; + + wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED]; + if (waitqueue_active(wqh)) + wake_up(wqh); +} + vm_fault_t do_swap_page(struct vm_fault *vmf); void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, @@ -120,6 +129,8 @@ extern unsigned long highest_memmap_pfn; */ extern int isolate_lru_page(struct page *page); extern void putback_lru_page(struct page *page); +extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, + long timeout); /* * in mm/rmap.c: diff --git a/mm/vmscan.c b/mm/vmscan.c index 0c1595065f49..1e54e636b927 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1006,12 +1006,12 @@ static void handle_write_error(struct address_space *mapping, unlock_page(page); } -static void -reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, +void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, long timeout) { wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason]; long ret; + bool acct_writeback = (reason == VMSCAN_THROTTLE_WRITEBACK); DEFINE_WAIT(wait); /* @@ -1023,7 +1023,8 @@ reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, current->flags & (PF_IO_WORKER|PF_KTHREAD)) return; - if (atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) { + if (acct_writeback && + atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) { WRITE_ONCE(pgdat->nr_reclaim_start, node_page_state(pgdat, NR_THROTTLED_WRITTEN)); } @@ -1031,7 +1032,9 @@ reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); ret = schedule_timeout(timeout); finish_wait(wqh, &wait); - atomic_dec(&pgdat->nr_writeback_throttled); + + if (acct_writeback) + atomic_dec(&pgdat->nr_writeback_throttled); trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout), jiffies_to_usecs(timeout - ret), @@ -2176,6 +2179,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, struct scan_control *sc) { unsigned long inactive, isolated; + bool too_many; if (current_is_kswapd()) return 0; @@ -2199,7 +2203,13 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) inactive >>= 3; - return isolated > inactive; + too_many = isolated > inactive; + + /* Wake up tasks throttled due to too_many_isolated. */ + if (!too_many) + wake_throttle_isolated(pgdat); + + return too_many; } /* @@ -2308,8 +2318,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, return 0; /* wait a bit for the reclaimer. */ - msleep(100); stalled = true; + reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) -- 2.31.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-10-22 14:46 [PATCH v5 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman 2021-10-22 14:46 ` [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman 2021-10-22 14:46 ` [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated Mel Gorman @ 2021-10-22 14:46 ` Mel Gorman 2021-11-24 1:19 ` Darrick J. Wong 2021-10-22 14:46 ` [PATCH 4/8] mm/writeback: Throttle based on page writeback instead of congestion Mel Gorman ` (4 subsequent siblings) 7 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2021-10-22 14:46 UTC (permalink / raw) To: Andrew Morton Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML, Mel Gorman Memcg reclaim throttles on congestion if no reclaim progress is made. This makes little sense, it might be due to writeback or a host of other factors. For !memcg reclaim, it's messy. Direct reclaim primarily is throttled in the page allocator if it is failing to make progress. Kswapd throttles if too many pages are under writeback and marked for immediate reclaim. This patch explicitly throttles if reclaim is failing to make progress. [vbabka@suse.cz: Remove redundant code] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- include/linux/mmzone.h | 1 + include/trace/events/vmscan.h | 4 +++- mm/memcontrol.c | 10 +--------- mm/vmscan.c | 28 ++++++++++++++++++++++++++++ 4 files changed, 33 insertions(+), 10 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9ccd8d95291b..00e305cfb3ec 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -276,6 +276,7 @@ enum lru_list { enum vmscan_throttle_state { VMSCAN_THROTTLE_WRITEBACK, VMSCAN_THROTTLE_ISOLATED, + VMSCAN_THROTTLE_NOPROGRESS, NR_VMSCAN_THROTTLE, }; diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index d4905bd9e9c4..f25a6149d3ba 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -29,11 +29,13 @@ #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK) #define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED) +#define _VMSCAN_THROTTLE_NOPROGRESS (1 << VMSCAN_THROTTLE_NOPROGRESS) #define show_throttle_flags(flags) \ (flags) ? __print_flags(flags, "|", \ {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \ - {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \ + {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"}, \ + {_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"} \ ) : "VMSCAN_THROTTLE_NONE" diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6da5020a8656..8b33152c9b85 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3465,19 +3465,11 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) /* try to free all pages in this cgroup */ while (nr_retries && page_counter_read(&memcg->memory)) { - int progress; - if (signal_pending(current)) return -EINTR; - progress = try_to_free_mem_cgroup_pages(memcg, 1, - GFP_KERNEL, true); - if (!progress) { + if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true)) nr_retries--; - /* maybe some writeback is necessary */ - congestion_wait(BLK_RW_ASYNC, HZ/10); - } - } return 0; diff --git a/mm/vmscan.c b/mm/vmscan.c index 1e54e636b927..0450f6867d61 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3323,6 +3323,33 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx); } +static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) +{ + /* If reclaim is making progress, wake any throttled tasks. */ + if (sc->nr_reclaimed) { + wait_queue_head_t *wqh; + + wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS]; + if (waitqueue_active(wqh)) + wake_up(wqh); + + return; + } + + /* + * Do not throttle kswapd on NOPROGRESS as it will throttle on + * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under + * writeback and marked for immediate reclaim at the tail of + * the LRU. + */ + if (current_is_kswapd()) + return; + + /* Throttle if making no progress at high prioities. */ + if (sc->priority < DEF_PRIORITY - 2) + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10); +} + /* * This is the direct reclaim path, for page-allocating processes. We only * try to reclaim pages from zones which will satisfy the caller's allocation @@ -3407,6 +3434,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) continue; last_pgdat = zone->zone_pgdat; shrink_node(zone->zone_pgdat, sc); + consider_reclaim_throttle(zone->zone_pgdat, sc); } /* -- 2.31.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-10-22 14:46 ` [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman @ 2021-11-24 1:19 ` Darrick J. Wong 2021-11-24 1:49 ` Darrick J. Wong 2021-11-24 10:32 ` Mel Gorman 0 siblings, 2 replies; 23+ messages in thread From: Darrick J. Wong @ 2021-11-24 1:19 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, NeilBrown, Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML On Fri, Oct 22, 2021 at 03:46:46PM +0100, Mel Gorman wrote: > Memcg reclaim throttles on congestion if no reclaim progress is made. > This makes little sense, it might be due to writeback or a host of > other factors. > > For !memcg reclaim, it's messy. Direct reclaim primarily is throttled > in the page allocator if it is failing to make progress. Kswapd > throttles if too many pages are under writeback and marked for > immediate reclaim. > > This patch explicitly throttles if reclaim is failing to make progress. Hi Mel, Ever since Christoph broke swapfiles, I've been carrying around a little fstest in my dev tree[1] that tries to exercise paging things in and out of a swapfile. Sadly I've been trapped in about three dozen customer escalations for over a month, which means I haven't been able to do much upstream in weeks. Like submit this test upstream. :( Now that I've finally gotten around to trying out a 5.16-rc2 build, I notice that the runtime of this test has gone from ~5s to 2 hours. Among other things that it does, the test sets up a cgroup with a memory controller limiting the memory usage to 25MB, then runs a program that tries to dirty 50MB of memory. There's 2GB of memory in the VM, so we're not running reclaim globally, but the cgroup gets throttled very severely. AFAICT the system is mostly idle, but it's difficult to tell because ps and top also get stuck waiting for this cgroup for whatever reason. My uninformed spculation is that usemem_and_swapoff takes a page fault while dirtying the 50MB memory buffer, prepares to pull a page in from swap, tries to evict another page to stay under the memcg limit, but that decides that it's making no progress and calls reclaim_throttle(..., VMSCAN_THROTTLE_NOPROGRESS). The sleep is uninterruptible, so I can't even kill -9 fstests to shut it down. Eventually we either finish the test or (for the mlock part) the OOM killer actually kills the process, but this takes a very long time. Any thoughts? For now I can just hack around this by skipping reclaim_throttle if cgroup_reclaim() == true, but that's probably not the correct fix. :) --D [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/commit/?h=test-swapfile-io&id=0d0ad843cea366d0ab0a7d8d984e5cd1deba5b43 > > [vbabka@suse.cz: Remove redundant code] > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > include/linux/mmzone.h | 1 + > include/trace/events/vmscan.h | 4 +++- > mm/memcontrol.c | 10 +--------- > mm/vmscan.c | 28 ++++++++++++++++++++++++++++ > 4 files changed, 33 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 9ccd8d95291b..00e305cfb3ec 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -276,6 +276,7 @@ enum lru_list { > enum vmscan_throttle_state { > VMSCAN_THROTTLE_WRITEBACK, > VMSCAN_THROTTLE_ISOLATED, > + VMSCAN_THROTTLE_NOPROGRESS, > NR_VMSCAN_THROTTLE, > }; > > diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h > index d4905bd9e9c4..f25a6149d3ba 100644 > --- a/include/trace/events/vmscan.h > +++ b/include/trace/events/vmscan.h > @@ -29,11 +29,13 @@ > > #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK) > #define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED) > +#define _VMSCAN_THROTTLE_NOPROGRESS (1 << VMSCAN_THROTTLE_NOPROGRESS) > > #define show_throttle_flags(flags) \ > (flags) ? __print_flags(flags, "|", \ > {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \ > - {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \ > + {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"}, \ > + {_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"} \ > ) : "VMSCAN_THROTTLE_NONE" > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 6da5020a8656..8b33152c9b85 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3465,19 +3465,11 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) > > /* try to free all pages in this cgroup */ > while (nr_retries && page_counter_read(&memcg->memory)) { > - int progress; > - > if (signal_pending(current)) > return -EINTR; > > - progress = try_to_free_mem_cgroup_pages(memcg, 1, > - GFP_KERNEL, true); > - if (!progress) { > + if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true)) > nr_retries--; > - /* maybe some writeback is necessary */ > - congestion_wait(BLK_RW_ASYNC, HZ/10); > - } > - > } > > return 0; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 1e54e636b927..0450f6867d61 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -3323,6 +3323,33 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) > return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx); > } > > +static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) > +{ > + /* If reclaim is making progress, wake any throttled tasks. */ > + if (sc->nr_reclaimed) { > + wait_queue_head_t *wqh; > + > + wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS]; > + if (waitqueue_active(wqh)) > + wake_up(wqh); > + > + return; > + } > + > + /* > + * Do not throttle kswapd on NOPROGRESS as it will throttle on > + * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under > + * writeback and marked for immediate reclaim at the tail of > + * the LRU. > + */ > + if (current_is_kswapd()) > + return; > + > + /* Throttle if making no progress at high prioities. */ > + if (sc->priority < DEF_PRIORITY - 2) > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10); > +} > + > /* > * This is the direct reclaim path, for page-allocating processes. We only > * try to reclaim pages from zones which will satisfy the caller's allocation > @@ -3407,6 +3434,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > continue; > last_pgdat = zone->zone_pgdat; > shrink_node(zone->zone_pgdat, sc); > + consider_reclaim_throttle(zone->zone_pgdat, sc); > } > > /* > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-11-24 1:19 ` Darrick J. Wong @ 2021-11-24 1:49 ` Darrick J. Wong 2021-11-24 14:35 ` Mel Gorman 2021-11-24 10:32 ` Mel Gorman 1 sibling, 1 reply; 23+ messages in thread From: Darrick J. Wong @ 2021-11-24 1:49 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, NeilBrown, Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML On Tue, Nov 23, 2021 at 05:19:12PM -0800, Darrick J. Wong wrote: > On Fri, Oct 22, 2021 at 03:46:46PM +0100, Mel Gorman wrote: > > Memcg reclaim throttles on congestion if no reclaim progress is made. > > This makes little sense, it might be due to writeback or a host of > > other factors. > > > > For !memcg reclaim, it's messy. Direct reclaim primarily is throttled > > in the page allocator if it is failing to make progress. Kswapd > > throttles if too many pages are under writeback and marked for > > immediate reclaim. > > > > This patch explicitly throttles if reclaim is failing to make progress. > > Hi Mel, > > Ever since Christoph broke swapfiles, I've been carrying around a little > fstest in my dev tree[1] that tries to exercise paging things in and out > of a swapfile. Sadly I've been trapped in about three dozen customer > escalations for over a month, which means I haven't been able to do much > upstream in weeks. Like submit this test upstream. :( > > Now that I've finally gotten around to trying out a 5.16-rc2 build, I > notice that the runtime of this test has gone from ~5s to 2 hours. > Among other things that it does, the test sets up a cgroup with a memory > controller limiting the memory usage to 25MB, then runs a program that > tries to dirty 50MB of memory. There's 2GB of memory in the VM, so > we're not running reclaim globally, but the cgroup gets throttled very > severely. > > AFAICT the system is mostly idle, but it's difficult to tell because ps > and top also get stuck waiting for this cgroup for whatever reason. My > uninformed spculation is that usemem_and_swapoff takes a page fault > while dirtying the 50MB memory buffer, prepares to pull a page in from > swap, tries to evict another page to stay under the memcg limit, but > that decides that it's making no progress and calls > reclaim_throttle(..., VMSCAN_THROTTLE_NOPROGRESS). > > The sleep is uninterruptible, so I can't even kill -9 fstests to shut it > down. Eventually we either finish the test or (for the mlock part) the > OOM killer actually kills the process, but this takes a very long time. > > Any thoughts? For now I can just hack around this by skipping > reclaim_throttle if cgroup_reclaim() == true, but that's probably not > the correct fix. :) Update: after adding timing information to usemem_and_swapoff, it looks like dirtying the 50MB buffer takes ~22s (up from 0.06s on 5.15). The mlock call stalls for ~280s until the OOM killer kills it (up from nearly instantaneous on 5.15), and the swapon/swapoff variant takes 20 minutes to hours depending on the run. --D > --D > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/commit/?h=test-swapfile-io&id=0d0ad843cea366d0ab0a7d8d984e5cd1deba5b43 > > > > > [vbabka@suse.cz: Remove redundant code] > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > > --- > > include/linux/mmzone.h | 1 + > > include/trace/events/vmscan.h | 4 +++- > > mm/memcontrol.c | 10 +--------- > > mm/vmscan.c | 28 ++++++++++++++++++++++++++++ > > 4 files changed, 33 insertions(+), 10 deletions(-) > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 9ccd8d95291b..00e305cfb3ec 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -276,6 +276,7 @@ enum lru_list { > > enum vmscan_throttle_state { > > VMSCAN_THROTTLE_WRITEBACK, > > VMSCAN_THROTTLE_ISOLATED, > > + VMSCAN_THROTTLE_NOPROGRESS, > > NR_VMSCAN_THROTTLE, > > }; > > > > diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h > > index d4905bd9e9c4..f25a6149d3ba 100644 > > --- a/include/trace/events/vmscan.h > > +++ b/include/trace/events/vmscan.h > > @@ -29,11 +29,13 @@ > > > > #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK) > > #define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED) > > +#define _VMSCAN_THROTTLE_NOPROGRESS (1 << VMSCAN_THROTTLE_NOPROGRESS) > > > > #define show_throttle_flags(flags) \ > > (flags) ? __print_flags(flags, "|", \ > > {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \ > > - {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \ > > + {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"}, \ > > + {_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"} \ > > ) : "VMSCAN_THROTTLE_NONE" > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 6da5020a8656..8b33152c9b85 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -3465,19 +3465,11 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) > > > > /* try to free all pages in this cgroup */ > > while (nr_retries && page_counter_read(&memcg->memory)) { > > - int progress; > > - > > if (signal_pending(current)) > > return -EINTR; > > > > - progress = try_to_free_mem_cgroup_pages(memcg, 1, > > - GFP_KERNEL, true); > > - if (!progress) { > > + if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true)) > > nr_retries--; > > - /* maybe some writeback is necessary */ > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > - } > > - > > } > > > > return 0; > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 1e54e636b927..0450f6867d61 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -3323,6 +3323,33 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) > > return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx); > > } > > > > +static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) > > +{ > > + /* If reclaim is making progress, wake any throttled tasks. */ > > + if (sc->nr_reclaimed) { > > + wait_queue_head_t *wqh; > > + > > + wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS]; > > + if (waitqueue_active(wqh)) > > + wake_up(wqh); > > + > > + return; > > + } > > + > > + /* > > + * Do not throttle kswapd on NOPROGRESS as it will throttle on > > + * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under > > + * writeback and marked for immediate reclaim at the tail of > > + * the LRU. > > + */ > > + if (current_is_kswapd()) > > + return; > > + > > + /* Throttle if making no progress at high prioities. */ > > + if (sc->priority < DEF_PRIORITY - 2) > > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10); > > +} > > + > > /* > > * This is the direct reclaim path, for page-allocating processes. We only > > * try to reclaim pages from zones which will satisfy the caller's allocation > > @@ -3407,6 +3434,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > > continue; > > last_pgdat = zone->zone_pgdat; > > shrink_node(zone->zone_pgdat, sc); > > + consider_reclaim_throttle(zone->zone_pgdat, sc); > > } > > > > /* > > -- > > 2.31.1 > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-11-24 1:49 ` Darrick J. Wong @ 2021-11-24 14:35 ` Mel Gorman 2021-11-24 18:02 ` Darrick J. Wong 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2021-11-24 14:35 UTC (permalink / raw) To: Darrick J. Wong Cc: Andrew Morton, NeilBrown, Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML On Tue, Nov 23, 2021 at 05:49:14PM -0800, Darrick J. Wong wrote: > > Ever since Christoph broke swapfiles, I've been carrying around a little > > fstest in my dev tree[1] that tries to exercise paging things in and out > > of a swapfile. Sadly I've been trapped in about three dozen customer > > escalations for over a month, which means I haven't been able to do much > > upstream in weeks. Like submit this test upstream. :( > > > > Now that I've finally gotten around to trying out a 5.16-rc2 build, I > > notice that the runtime of this test has gone from ~5s to 2 hours. > > Among other things that it does, the test sets up a cgroup with a memory > > controller limiting the memory usage to 25MB, then runs a program that > > tries to dirty 50MB of memory. There's 2GB of memory in the VM, so > > we're not running reclaim globally, but the cgroup gets throttled very > > severely. > > > > AFAICT the system is mostly idle, but it's difficult to tell because ps > > and top also get stuck waiting for this cgroup for whatever reason. My > > uninformed spculation is that usemem_and_swapoff takes a page fault > > while dirtying the 50MB memory buffer, prepares to pull a page in from > > swap, tries to evict another page to stay under the memcg limit, but > > that decides that it's making no progress and calls > > reclaim_throttle(..., VMSCAN_THROTTLE_NOPROGRESS). > > > > The sleep is uninterruptible, so I can't even kill -9 fstests to shut it > > down. Eventually we either finish the test or (for the mlock part) the > > OOM killer actually kills the process, but this takes a very long time. > > > > Any thoughts? For now I can just hack around this by skipping > > reclaim_throttle if cgroup_reclaim() == true, but that's probably not > > the correct fix. :) > > Update: after adding timing information to usemem_and_swapoff, it looks > like dirtying the 50MB buffer takes ~22s (up from 0.06s on 5.15). The > mlock call stalls for ~280s until the OOM killer kills it (up from > nearly instantaneous on 5.15), and the swapon/swapoff variant takes > 20 minutes to hours depending on the run. > Can you try the patch below please? I think I'm running the test correctly and it finishes for me in 16 seconds with this applied diff --git a/mm/vmscan.c b/mm/vmscan.c index 07db03883062..d9166e94eb95 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1057,7 +1057,17 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) break; case VMSCAN_THROTTLE_NOPROGRESS: - timeout = HZ/2; + timeout = 1; + + /* + * If kswapd is disabled, reschedule if necessary but do not + * throttle as the system is likely near OOM. + */ + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) { + cond_resched(); + return; + } + break; case VMSCAN_THROTTLE_ISOLATED: timeout = HZ/50; @@ -3395,7 +3405,7 @@ static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) return; /* Throttle if making no progress at high prioities. */ - if (sc->priority < DEF_PRIORITY - 2) + if (sc->priority < DEF_PRIORITY - 2 && !sc->nr_reclaimed) reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS); } @@ -3415,6 +3425,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) unsigned long nr_soft_scanned; gfp_t orig_mask; pg_data_t *last_pgdat = NULL; + pg_data_t *first_pgdat = NULL; /* * If the number of buffer_heads in the machine exceeds the maximum @@ -3478,14 +3489,18 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) /* need some check for avoid more shrink_zone() */ } + if (!first_pgdat) + first_pgdat = zone->zone_pgdat; + /* See comment about same check for global reclaim above */ if (zone->zone_pgdat == last_pgdat) continue; last_pgdat = zone->zone_pgdat; shrink_node(zone->zone_pgdat, sc); - consider_reclaim_throttle(zone->zone_pgdat, sc); } + consider_reclaim_throttle(first_pgdat, sc); + /* * Restore to original mask to avoid the impact on the caller if we * promoted it to __GFP_HIGHMEM. ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-11-24 14:35 ` Mel Gorman @ 2021-11-24 18:02 ` Darrick J. Wong 0 siblings, 0 replies; 23+ messages in thread From: Darrick J. Wong @ 2021-11-24 18:02 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, NeilBrown, Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML On Wed, Nov 24, 2021 at 02:35:59PM +0000, Mel Gorman wrote: > On Tue, Nov 23, 2021 at 05:49:14PM -0800, Darrick J. Wong wrote: > > > Ever since Christoph broke swapfiles, I've been carrying around a little > > > fstest in my dev tree[1] that tries to exercise paging things in and out > > > of a swapfile. Sadly I've been trapped in about three dozen customer > > > escalations for over a month, which means I haven't been able to do much > > > upstream in weeks. Like submit this test upstream. :( > > > > > > Now that I've finally gotten around to trying out a 5.16-rc2 build, I > > > notice that the runtime of this test has gone from ~5s to 2 hours. > > > Among other things that it does, the test sets up a cgroup with a memory > > > controller limiting the memory usage to 25MB, then runs a program that > > > tries to dirty 50MB of memory. There's 2GB of memory in the VM, so > > > we're not running reclaim globally, but the cgroup gets throttled very > > > severely. > > > > > > AFAICT the system is mostly idle, but it's difficult to tell because ps > > > and top also get stuck waiting for this cgroup for whatever reason. My > > > uninformed spculation is that usemem_and_swapoff takes a page fault > > > while dirtying the 50MB memory buffer, prepares to pull a page in from > > > swap, tries to evict another page to stay under the memcg limit, but > > > that decides that it's making no progress and calls > > > reclaim_throttle(..., VMSCAN_THROTTLE_NOPROGRESS). > > > > > > The sleep is uninterruptible, so I can't even kill -9 fstests to shut it > > > down. Eventually we either finish the test or (for the mlock part) the > > > OOM killer actually kills the process, but this takes a very long time. > > > > > > Any thoughts? For now I can just hack around this by skipping > > > reclaim_throttle if cgroup_reclaim() == true, but that's probably not > > > the correct fix. :) > > > > Update: after adding timing information to usemem_and_swapoff, it looks > > like dirtying the 50MB buffer takes ~22s (up from 0.06s on 5.15). The > > mlock call stalls for ~280s until the OOM killer kills it (up from > > nearly instantaneous on 5.15), and the swapon/swapoff variant takes > > 20 minutes to hours depending on the run. > > > > Can you try the patch below please? I think I'm running the test > correctly and it finishes for me in 16 seconds with this applied 20 seconds here, but this /does/ fix the problem. Thank you! Tested-by: Darrick J. Wong <djwong@kernel.org> --D > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 07db03883062..d9166e94eb95 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1057,7 +1057,17 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) > > break; > case VMSCAN_THROTTLE_NOPROGRESS: > - timeout = HZ/2; > + timeout = 1; > + > + /* > + * If kswapd is disabled, reschedule if necessary but do not > + * throttle as the system is likely near OOM. > + */ > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) { > + cond_resched(); > + return; > + } > + > break; > case VMSCAN_THROTTLE_ISOLATED: > timeout = HZ/50; > @@ -3395,7 +3405,7 @@ static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) > return; > > /* Throttle if making no progress at high prioities. */ > - if (sc->priority < DEF_PRIORITY - 2) > + if (sc->priority < DEF_PRIORITY - 2 && !sc->nr_reclaimed) > reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS); > } > > @@ -3415,6 +3425,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > unsigned long nr_soft_scanned; > gfp_t orig_mask; > pg_data_t *last_pgdat = NULL; > + pg_data_t *first_pgdat = NULL; > > /* > * If the number of buffer_heads in the machine exceeds the maximum > @@ -3478,14 +3489,18 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > /* need some check for avoid more shrink_zone() */ > } > > + if (!first_pgdat) > + first_pgdat = zone->zone_pgdat; > + > /* See comment about same check for global reclaim above */ > if (zone->zone_pgdat == last_pgdat) > continue; > last_pgdat = zone->zone_pgdat; > shrink_node(zone->zone_pgdat, sc); > - consider_reclaim_throttle(zone->zone_pgdat, sc); > } > > + consider_reclaim_throttle(first_pgdat, sc); > + > /* > * Restore to original mask to avoid the impact on the caller if we > * promoted it to __GFP_HIGHMEM. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-11-24 1:19 ` Darrick J. Wong 2021-11-24 1:49 ` Darrick J. Wong @ 2021-11-24 10:32 ` Mel Gorman 2021-11-24 10:43 ` Vlastimil Babka 2021-11-24 17:24 ` Mike Galbraith 1 sibling, 2 replies; 23+ messages in thread From: Mel Gorman @ 2021-11-24 10:32 UTC (permalink / raw) To: Darrick J. Wong Cc: Andrew Morton, NeilBrown, Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML On Tue, Nov 23, 2021 at 05:19:12PM -0800, Darrick J. Wong wrote: > On Fri, Oct 22, 2021 at 03:46:46PM +0100, Mel Gorman wrote: > > Memcg reclaim throttles on congestion if no reclaim progress is made. > > This makes little sense, it might be due to writeback or a host of > > other factors. > > > > For !memcg reclaim, it's messy. Direct reclaim primarily is throttled > > in the page allocator if it is failing to make progress. Kswapd > > throttles if too many pages are under writeback and marked for > > immediate reclaim. > > > > This patch explicitly throttles if reclaim is failing to make progress. > > Hi Mel, > > Ever since Christoph broke swapfiles, I've been carrying around a little > fstest in my dev tree[1] that tries to exercise paging things in and out > of a swapfile. Sadly I've been trapped in about three dozen customer > escalations for over a month, which means I haven't been able to do much > upstream in weeks. Like submit this test upstream. :( > > Now that I've finally gotten around to trying out a 5.16-rc2 build, I > notice that the runtime of this test has gone from ~5s to 2 hours. > Among other things that it does, the test sets up a cgroup with a memory > controller limiting the memory usage to 25MB, then runs a program that > tries to dirty 50MB of memory. There's 2GB of memory in the VM, so > we're not running reclaim globally, but the cgroup gets throttled very > severely. > Ok, so this test cannot make progress until some of the cgroup pages get cleaned. What is the expectation for the test? Should it OOM or do you expect it to have spin-like behaviour until some writeback completes? I'm guessing you'd prefer it to spin and right now it's sleeping far too much. > AFAICT the system is mostly idle, but it's difficult to tell because ps > and top also get stuck waiting for this cgroup for whatever reason. But this is surprising because I expect that ps and top are not running within the cgroup. Was /proc/PID/stack readable? > My > uninformed spculation is that usemem_and_swapoff takes a page fault > while dirtying the 50MB memory buffer, prepares to pull a page in from > swap, tries to evict another page to stay under the memcg limit, but > that decides that it's making no progress and calls > reclaim_throttle(..., VMSCAN_THROTTLE_NOPROGRESS). > > The sleep is uninterruptible, so I can't even kill -9 fstests to shut it > down. Eventually we either finish the test or (for the mlock part) the > OOM killer actually kills the process, but this takes a very long time. > The sleep can be interruptible. > Any thoughts? For now I can just hack around this by skipping > reclaim_throttle if cgroup_reclaim() == true, but that's probably not > the correct fix. :) > No, it wouldn't be but a possibility is throttling for only 1 jiffy if reclaiming within a memcg and the zone is balanced overall. The interruptible part should just be the patch below. I need to poke at the cgroup limit part a bit diff --git a/mm/vmscan.c b/mm/vmscan.c index fb9584641ac7..07db03883062 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1068,7 +1068,7 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) break; } - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); + prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE); ret = schedule_timeout(timeout); finish_wait(wqh, &wait); ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-11-24 10:32 ` Mel Gorman @ 2021-11-24 10:43 ` Vlastimil Babka 2021-11-24 10:53 ` Mel Gorman 2021-11-24 17:24 ` Mike Galbraith 1 sibling, 1 reply; 23+ messages in thread From: Vlastimil Babka @ 2021-11-24 10:43 UTC (permalink / raw) To: Mel Gorman, Darrick J. Wong Cc: Andrew Morton, NeilBrown, Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML On 11/24/21 11:32, Mel Gorman wrote: > On Tue, Nov 23, 2021 at 05:19:12PM -0800, Darrick J. Wong wrote: >> On Fri, Oct 22, 2021 at 03:46:46PM +0100, Mel Gorman wrote: >> > Memcg reclaim throttles on congestion if no reclaim progress is made. >> > This makes little sense, it might be due to writeback or a host of >> > other factors. >> > >> > For !memcg reclaim, it's messy. Direct reclaim primarily is throttled >> > in the page allocator if it is failing to make progress. Kswapd >> > throttles if too many pages are under writeback and marked for >> > immediate reclaim. >> > >> > This patch explicitly throttles if reclaim is failing to make progress. >> >> Hi Mel, >> >> Ever since Christoph broke swapfiles, I've been carrying around a little >> fstest in my dev tree[1] that tries to exercise paging things in and out >> of a swapfile. Sadly I've been trapped in about three dozen customer >> escalations for over a month, which means I haven't been able to do much >> upstream in weeks. Like submit this test upstream. :( >> >> Now that I've finally gotten around to trying out a 5.16-rc2 build, I >> notice that the runtime of this test has gone from ~5s to 2 hours. >> Among other things that it does, the test sets up a cgroup with a memory >> controller limiting the memory usage to 25MB, then runs a program that >> tries to dirty 50MB of memory. There's 2GB of memory in the VM, so >> we're not running reclaim globally, but the cgroup gets throttled very >> severely. >> > > Ok, so this test cannot make progress until some of the cgroup pages get > cleaned. What is the expectation for the test? Should it OOM or do you > expect it to have spin-like behaviour until some writeback completes? > I'm guessing you'd prefer it to spin and right now it's sleeping far > too much. > >> AFAICT the system is mostly idle, but it's difficult to tell because ps >> and top also get stuck waiting for this cgroup for whatever reason. > > But this is surprising because I expect that ps and top are not running > within the cgroup. Was /proc/PID/stack readable? > >> My >> uninformed spculation is that usemem_and_swapoff takes a page fault >> while dirtying the 50MB memory buffer, prepares to pull a page in from >> swap, tries to evict another page to stay under the memcg limit, but >> that decides that it's making no progress and calls >> reclaim_throttle(..., VMSCAN_THROTTLE_NOPROGRESS). >> >> The sleep is uninterruptible, so I can't even kill -9 fstests to shut it >> down. Eventually we either finish the test or (for the mlock part) the >> OOM killer actually kills the process, but this takes a very long time. >> > > The sleep can be interruptible. > >> Any thoughts? For now I can just hack around this by skipping >> reclaim_throttle if cgroup_reclaim() == true, but that's probably not >> the correct fix. :) >> > > No, it wouldn't be but a possibility is throttling for only 1 jiffy if > reclaiming within a memcg and the zone is balanced overall. > > The interruptible part should just be the patch below. I need to poke at > the cgroup limit part a bit As the throttle timeout is short anyway, will the TASK_UNINTERRUPTIBLE vs TASK_INTERRUPTIBLE make a difference for the (ability to kill? AFAIU typically this inability to kill is because of a loop that doesn't check for fatal_signal_pending(). > diff --git a/mm/vmscan.c b/mm/vmscan.c > index fb9584641ac7..07db03883062 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1068,7 +1068,7 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) > break; > } > > - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > + prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE); > ret = schedule_timeout(timeout); > finish_wait(wqh, &wait); > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-11-24 10:43 ` Vlastimil Babka @ 2021-11-24 10:53 ` Mel Gorman 0 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2021-11-24 10:53 UTC (permalink / raw) To: Vlastimil Babka Cc: Darrick J. Wong, Andrew Morton, NeilBrown, Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML On Wed, Nov 24, 2021 at 11:43:05AM +0100, Vlastimil Babka wrote: > >> Any thoughts? For now I can just hack around this by skipping > >> reclaim_throttle if cgroup_reclaim() == true, but that's probably not > >> the correct fix. :) > >> > > > > No, it wouldn't be but a possibility is throttling for only 1 jiffy if > > reclaiming within a memcg and the zone is balanced overall. > > > > The interruptible part should just be the patch below. I need to poke at > > the cgroup limit part a bit > > As the throttle timeout is short anyway, will the TASK_UNINTERRUPTIBLE vs > TASK_INTERRUPTIBLE make a difference for the (ability to kill? AFAIU > typically this inability to kill is because of a loop that doesn't check for > fatal_signal_pending(). > Yep, and the fatal_signal_pending() is lacking within reclaim in general but I'm undecided on how much that should change in the context of reclaim throttling but at minimum, I don't want the signal delivery to be masked or delayed. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-11-24 10:32 ` Mel Gorman 2021-11-24 10:43 ` Vlastimil Babka @ 2021-11-24 17:24 ` Mike Galbraith 1 sibling, 0 replies; 23+ messages in thread From: Mike Galbraith @ 2021-11-24 17:24 UTC (permalink / raw) To: Mel Gorman, Darrick J. Wong Cc: Andrew Morton, NeilBrown, Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML On Wed, 2021-11-24 at 10:32 +0000, Mel Gorman wrote: > On Tue, Nov 23, 2021 at 05:19:12PM -0800, Darrick J. Wong wrote: > > > AFAICT the system is mostly idle, but it's difficult to tell because ps > > and top also get stuck waiting for this cgroup for whatever reason. > > But this is surprising because I expect that ps and top are not running > within the cgroup. Was /proc/PID/stack readable? Probably this. crash> ps | grep UN 4418 4417 4 ffff8881cae66e40 UN 0.0 7620 980 memcg_test_1 <== the bad guy 4419 4417 6 ffff8881cae62f40 UN 0.0 7620 980 memcg_test_1 4420 4417 5 ffff8881cae65e80 UN 0.0 7620 980 memcg_test_1 4421 4417 7 ffff8881cae63f00 UN 0.0 7620 980 memcg_test_1 4422 4417 4 ffff8881cae60000 UN 0.0 7620 980 memcg_test_1 4423 4417 3 ffff888128985e80 UN 0.0 7620 980 memcg_test_1 4424 4417 7 ffff888117f79f80 UN 0.0 7620 980 memcg_test_1 4425 4417 2 ffff888117f7af40 UN 0.0 7620 980 memcg_test_1 4428 2791 6 ffff8881a8253f00 UN 0.0 38868 3568 ps 4429 2808 4 ffff888100c90000 UN 0.0 38868 3600 ps crash> bt -sx 4429 PID: 4429 TASK: ffff888100c90000 CPU: 4 COMMAND: "ps" #0 [ffff8881af1c3ce0] __schedule+0x285 at ffffffff817ae6c5 #1 [ffff8881af1c3d68] schedule+0x3a at ffffffff817aed4a #2 [ffff8881af1c3d78] rwsem_down_read_slowpath+0x197 at ffffffff817b11a7 #3 [ffff8881af1c3e08] down_read_killable+0x5c at ffffffff817b142c #4 [ffff8881af1c3e18] down_read_killable+0x5c at ffffffff817b142c #5 [ffff8881af1c3e28] __access_remote_vm+0x3f at ffffffff8120131f #6 [ffff8881af1c3e90] proc_pid_cmdline_read+0x148 at ffffffff812fc9a8 #7 [ffff8881af1c3ee8] vfs_read+0x92 at ffffffff8126a302 #8 [ffff8881af1c3f00] ksys_read+0x7d at ffffffff8126a72d #9 [ffff8881af1c3f38] do_syscall_64+0x37 at ffffffff817a3f57 #10 [ffff8881af1c3f50] entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8180007c RIP: 00007f4b50fe8b5e RSP: 00007ffdd7f6fe38 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007f4b5186a010 RCX: 00007f4b50fe8b5e RDX: 0000000000020000 RSI: 00007f4b5186a010 RDI: 0000000000000006 RBP: 0000000000020000 R8: 0000000000000007 R9: 00000000ffffffff R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b5186a010 R13: 0000000000000000 R14: 0000000000000006 R15: 0000000000000000 ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b crash> mm_struct -x ffff8881021b4800 struct mm_struct { { mmap = 0xffff8881ccfe6a80, mm_rb = { rb_node = 0xffff8881ccfe61a0 }, ... mmap_lock = { count = { counter = 0x3 }, owner = { counter = 0xffff8881cae66e40 ... crash> bt 0xffff8881cae66e40 PID: 4418 TASK: ffff8881cae66e40 CPU: 4 COMMAND: "memcg_test_1" #0 [ffff888154097a88] __schedule at ffffffff817ae6c5 #1 [ffff888154097b10] schedule at ffffffff817aed4a #2 [ffff888154097b20] schedule_timeout at ffffffff817b311f #3 [ffff888154097b90] reclaim_throttle at ffffffff811d802b #4 [ffff888154097bf0] do_try_to_free_pages at ffffffff811da206 #5 [ffff888154097c40] try_to_free_mem_cgroup_pages at ffffffff811db522 #6 [ffff888154097cd0] try_charge_memcg at ffffffff81256440 #7 [ffff888154097d60] obj_cgroup_charge_pages at ffffffff81256c97 #8 [ffff888154097d88] obj_cgroup_charge at ffffffff8125898c #9 [ffff888154097da8] kmem_cache_alloc at ffffffff81242099 #10 [ffff888154097de0] vm_area_alloc at ffffffff8106c87a #11 [ffff888154097df0] mmap_region at ffffffff812082b2 #12 [ffff888154097e58] do_mmap at ffffffff81208922 #13 [ffff888154097eb0] vm_mmap_pgoff at ffffffff811e259f #14 [ffff888154097f38] do_syscall_64 at ffffffff817a3f57 #15 [ffff888154097f50] entry_SYSCALL_64_after_hwframe at ffffffff8180007c RIP: 00007f211c36b743 RSP: 00007ffeaac1bd58 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f211c36b743 RDX: 0000000000000003 RSI: 0000000000001000 RDI: 0000000000000000 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000002022 R11: 0000000000000246 R12: 0000000000000003 R13: 0000000000001000 R14: 0000000000002022 R15: 0000000000000000 ORIG_RAX: 0000000000000009 CS: 0033 SS: 002b crash> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 4/8] mm/writeback: Throttle based on page writeback instead of congestion 2021-10-22 14:46 [PATCH v5 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman ` (2 preceding siblings ...) 2021-10-22 14:46 ` [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman @ 2021-10-22 14:46 ` Mel Gorman 2021-10-22 14:46 ` [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator Mel Gorman ` (3 subsequent siblings) 7 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2021-10-22 14:46 UTC (permalink / raw) To: Andrew Morton Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML, Mel Gorman do_writepages throttles on congestion if the writepages() fails due to a lack of memory but congestion_wait() is partially broken as the congestion state is not updated for all BDIs. This patch stalls waiting for a number of pages to complete writeback that located on the local node. The main weakness is that there is no correlation between the location of the inode's pages and locality but that is still better than congestion_wait. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- mm/page-writeback.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 4812a17b288c..f34f54fcd5b4 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2366,8 +2366,15 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) ret = generic_writepages(mapping, wbc); if ((ret != -ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL)) break; - cond_resched(); - congestion_wait(BLK_RW_ASYNC, HZ/50); + + /* + * Lacking an allocation context or the locality or writeback + * state of any of the inode's pages, throttle based on + * writeback activity on the local node. It's as good a + * guess as any. + */ + reclaim_throttle(NODE_DATA(numa_node_id()), + VMSCAN_THROTTLE_WRITEBACK, HZ/50); } /* * Usually few pages are written by now from those we've just submitted -- 2.31.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator 2021-10-22 14:46 [PATCH v5 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman ` (3 preceding siblings ...) 2021-10-22 14:46 ` [PATCH 4/8] mm/writeback: Throttle based on page writeback instead of congestion Mel Gorman @ 2021-10-22 14:46 ` Mel Gorman 2021-10-25 10:07 ` Vlastimil Babka 2021-10-22 14:46 ` [PATCH 6/8] mm/vmscan: Centralise timeout values for reclaim_throttle Mel Gorman ` (2 subsequent siblings) 7 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2021-10-22 14:46 UTC (permalink / raw) To: Andrew Morton Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML, Mel Gorman The page allocator stalls based on the number of pages that are waiting for writeback to start but this should now be redundant. shrink_inactive_list() will wake flusher threads if the LRU tail are unqueued dirty pages so the flusher should be active. If it fails to make progress due to pages under writeback not being completed quickly then it should stall on VMSCAN_THROTTLE_WRITEBACK. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/page_alloc.c | 21 +-------------------- 1 file changed, 1 insertion(+), 20 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 78e538067651..8fa0109ff417 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4795,30 +4795,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, trace_reclaim_retry_zone(z, order, reclaimable, available, min_wmark, *no_progress_loops, wmark); if (wmark) { - /* - * If we didn't make any progress and have a lot of - * dirty + writeback pages then we should wait for - * an IO to complete to slow down the reclaim and - * prevent from pre mature OOM - */ - if (!did_some_progress) { - unsigned long write_pending; - - write_pending = zone_page_state_snapshot(zone, - NR_ZONE_WRITE_PENDING); - - if (2 * write_pending > reclaimable) { - congestion_wait(BLK_RW_ASYNC, HZ/10); - return true; - } - } - ret = true; - goto out; + break; } } -out: /* * Memory allocation/reclaim might be called from a WQ context and the * current implementation of the WQ concurrency control doesn't -- 2.31.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator 2021-10-22 14:46 ` [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator Mel Gorman @ 2021-10-25 10:07 ` Vlastimil Babka 0 siblings, 0 replies; 23+ messages in thread From: Vlastimil Babka @ 2021-10-25 10:07 UTC (permalink / raw) To: Mel Gorman, Andrew Morton Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML On 10/22/21 16:46, Mel Gorman wrote: > The page allocator stalls based on the number of pages that are > waiting for writeback to start but this should now be redundant. > shrink_inactive_list() will wake flusher threads if the LRU tail are > unqueued dirty pages so the flusher should be active. If it fails to make > progress due to pages under writeback not being completed quickly then > it should stall on VMSCAN_THROTTLE_WRITEBACK. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > mm/page_alloc.c | 21 +-------------------- > 1 file changed, 1 insertion(+), 20 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 78e538067651..8fa0109ff417 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4795,30 +4795,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > trace_reclaim_retry_zone(z, order, reclaimable, > available, min_wmark, *no_progress_loops, wmark); > if (wmark) { > - /* > - * If we didn't make any progress and have a lot of > - * dirty + writeback pages then we should wait for > - * an IO to complete to slow down the reclaim and > - * prevent from pre mature OOM > - */ > - if (!did_some_progress) { > - unsigned long write_pending; > - > - write_pending = zone_page_state_snapshot(zone, > - NR_ZONE_WRITE_PENDING); > - > - if (2 * write_pending > reclaimable) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > - return true; > - } > - } > - > ret = true; > - goto out; > + break; > } > } > > -out: > /* > * Memory allocation/reclaim might be called from a WQ context and the > * current implementation of the WQ concurrency control doesn't > ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 6/8] mm/vmscan: Centralise timeout values for reclaim_throttle 2021-10-22 14:46 [PATCH v5 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman ` (4 preceding siblings ...) 2021-10-22 14:46 ` [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator Mel Gorman @ 2021-10-22 14:46 ` Mel Gorman 2021-10-22 14:46 ` [PATCH 7/8] mm/vmscan: Increase the timeout if page reclaim is not making progress Mel Gorman 2021-10-22 14:46 ` [PATCH 8/8] mm/vmscan: Delay waking of tasks throttled on NOPROGRESS Mel Gorman 7 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2021-10-22 14:46 UTC (permalink / raw) To: Andrew Morton Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML, Mel Gorman Neil Brown raised concerns about callers of reclaim_throttle specifying a timeout value. The original timeout values to congestion_wait() were probably pulled out of thin air or copy&pasted from somewhere else. This patch centralises the timeout values and selects a timeout based on the reason for reclaim throttling. These figures are also pulled out of the same thin air but better values may be derived Running a workload that is throttling for inappropriate periods and tracing mm_vmscan_throttled can be used to pick a more appropriate value. Excessive throttling would pick a lower timeout where as excessive CPU usage in reclaim context would select a larger timeout. Ideally a large value would always be used and the wakeups would occur before a timeout but that requires careful testing. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- mm/compaction.c | 2 +- mm/internal.h | 3 +-- mm/page-writeback.c | 2 +- mm/vmscan.c | 50 +++++++++++++++++++++++++++++++++------------ 4 files changed, 40 insertions(+), 17 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 7359093d8ac0..151b04c4dab3 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -828,7 +828,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (cc->mode == MIGRATE_ASYNC) return -EAGAIN; - reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10); + reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED); if (fatal_signal_pending(current)) return -EINTR; diff --git a/mm/internal.h b/mm/internal.h index c72d3383ef34..383d9b7e7991 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -129,8 +129,7 @@ extern unsigned long highest_memmap_pfn; */ extern int isolate_lru_page(struct page *page); extern void putback_lru_page(struct page *page); -extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, - long timeout); +extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason); /* * in mm/rmap.c: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index f34f54fcd5b4..4b01a6872f9e 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2374,7 +2374,7 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) * guess as any. */ reclaim_throttle(NODE_DATA(numa_node_id()), - VMSCAN_THROTTLE_WRITEBACK, HZ/50); + VMSCAN_THROTTLE_WRITEBACK); } /* * Usually few pages are written by now from those we've just submitted diff --git a/mm/vmscan.c b/mm/vmscan.c index 0450f6867d61..66da45084af4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1006,12 +1006,10 @@ static void handle_write_error(struct address_space *mapping, unlock_page(page); } -void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, - long timeout) +void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) { wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason]; - long ret; - bool acct_writeback = (reason == VMSCAN_THROTTLE_WRITEBACK); + long timeout, ret; DEFINE_WAIT(wait); /* @@ -1023,17 +1021,43 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, current->flags & (PF_IO_WORKER|PF_KTHREAD)) return; - if (acct_writeback && - atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) { - WRITE_ONCE(pgdat->nr_reclaim_start, - node_page_state(pgdat, NR_THROTTLED_WRITTEN)); + /* + * These figures are pulled out of thin air. + * VMSCAN_THROTTLE_ISOLATED is a transient condition based on too many + * parallel reclaimers which is a short-lived event so the timeout is + * short. Failing to make progress or waiting on writeback are + * potentially long-lived events so use a longer timeout. This is shaky + * logic as a failure to make progress could be due to anything from + * writeback to a slow device to excessive references pages at the tail + * of the inactive LRU. + */ + switch(reason) { + case VMSCAN_THROTTLE_WRITEBACK: + timeout = HZ/10; + + if (atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) { + WRITE_ONCE(pgdat->nr_reclaim_start, + node_page_state(pgdat, NR_THROTTLED_WRITTEN)); + } + + break; + case VMSCAN_THROTTLE_NOPROGRESS: + timeout = HZ/10; + break; + case VMSCAN_THROTTLE_ISOLATED: + timeout = HZ/50; + break; + default: + WARN_ON_ONCE(1); + timeout = HZ; + break; } prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); ret = schedule_timeout(timeout); finish_wait(wqh, &wait); - if (acct_writeback) + if (reason == VMSCAN_THROTTLE_WRITEBACK) atomic_dec(&pgdat->nr_writeback_throttled); trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout), @@ -2319,7 +2343,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, /* wait a bit for the reclaimer. */ stalled = true; - reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10); + reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) @@ -3251,7 +3275,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) * until some pages complete writeback. */ if (sc->nr.immediate) - reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10); + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); } /* @@ -3275,7 +3299,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) if (!current_is_kswapd() && current_may_throttle() && !sc->hibernation_mode && test_bit(LRUVEC_CONGESTED, &target_lruvec->flags)) - reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10); + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, sc)) @@ -3347,7 +3371,7 @@ static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) /* Throttle if making no progress at high prioities. */ if (sc->priority < DEF_PRIORITY - 2) - reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10); + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS); } /* -- 2.31.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 7/8] mm/vmscan: Increase the timeout if page reclaim is not making progress 2021-10-22 14:46 [PATCH v5 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman ` (5 preceding siblings ...) 2021-10-22 14:46 ` [PATCH 6/8] mm/vmscan: Centralise timeout values for reclaim_throttle Mel Gorman @ 2021-10-22 14:46 ` Mel Gorman 2021-10-22 14:46 ` [PATCH 8/8] mm/vmscan: Delay waking of tasks throttled on NOPROGRESS Mel Gorman 7 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2021-10-22 14:46 UTC (permalink / raw) To: Andrew Morton Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML, Mel Gorman Tracing of the stutterp workload showed the following delays 1 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usect_delayed=536000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usect_delayed=544000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usect_delayed=556000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usect_delayed=624000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usect_delayed=716000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usect_delayed=772000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usect_delayed=512000 reason=VMSCAN_THROTTLE_NOPROGRESS 16 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS 53 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS 116 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS 5907 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS 71741 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS All the throttling hit the full timeout and then there was wakeup delays meaning that the wakeups are premature as no other reclaimer such as kswapd has made progress. This patch increases the maximum timeout. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 66da45084af4..35b6ccaa01c3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1042,7 +1042,7 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) break; case VMSCAN_THROTTLE_NOPROGRESS: - timeout = HZ/10; + timeout = HZ/2; break; case VMSCAN_THROTTLE_ISOLATED: timeout = HZ/50; -- 2.31.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 8/8] mm/vmscan: Delay waking of tasks throttled on NOPROGRESS 2021-10-22 14:46 [PATCH v5 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman ` (6 preceding siblings ...) 2021-10-22 14:46 ` [PATCH 7/8] mm/vmscan: Increase the timeout if page reclaim is not making progress Mel Gorman @ 2021-10-22 14:46 ` Mel Gorman 7 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2021-10-22 14:46 UTC (permalink / raw) To: Andrew Morton Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML, Mel Gorman Tracing indicates that tasks throttled on NOPROGRESS are woken prematurely resulting in occasional massive spikes in direct reclaim activity. This patch wakes tasks throttled on NOPROGRESS if reclaim efficiency is at least 12%. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- mm/vmscan.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 35b6ccaa01c3..812d4697d50d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3349,8 +3349,11 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) { - /* If reclaim is making progress, wake any throttled tasks. */ - if (sc->nr_reclaimed) { + /* + * If reclaim is making progress greater than 12% efficiency then + * wake all the NOPROGRESS throttled tasks. + */ + if (sc->nr_reclaimed > (sc->nr_scanned >> 3)) { wait_queue_head_t *wqh; wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS]; -- 2.31.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ @ 2021-10-19 9:01 Mel Gorman 2021-10-19 9:01 ` [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2021-10-19 9:01 UTC (permalink / raw) To: Andrew Morton Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML, Mel Gorman Changelog since v3 o Count writeback completions for NR_THROTTLED_WRITTEN only o Use IRQ-safe inc_node_page_state o Remove redundant throttling This series is also available at git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v4r2 This series that removes all calls to congestion_wait in mm/ and deletes wait_iff_congested. It's not a clever implementation but congestion_wait has been broken for a long time (https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/). Even if congestion throttling worked, it was never a great idea. While excessive dirty/writeback pages at the tail of the LRU is one possibility that reclaim may be slow, there is also the problem of too many pages being isolated and reclaim failing for other reasons (elevated references, too many pages isolated, excessive LRU contention etc). This series replaces the "congestion" throttling with 3 different types. o If there are too many dirty/writeback pages, sleep until a timeout or enough pages get cleaned o If too many pages are isolated, sleep until enough isolated pages are either reclaimed or put back on the LRU o If no progress is being made, direct reclaim tasks sleep until another task makes progress with acceptable efficiency. This was initially tested with a mix of workloads that used to trigger corner cases that no longer work. A new test case was created called "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly created XFS filesystem. Note that it may be necessary to increase the timeout of ssh if executing remotely as ssh itself can get throttled and the connection may timeout. stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4 to check the impact as the number of direct reclaimers increase. It has four types of worker. o One "anon latency" worker creates small mappings with mmap() and times how long it takes to fault the mapping reading it 4K at a time o X file writers which is fio randomly writing X files where the total size of the files add up to the allowed dirty_ratio. fio is allowed to run for a warmup period to allow some file-backed pages to accumulate. The duration of the warmup is based on the best-case linear write speed of the storage. o Y file readers which is fio randomly reading small files o Z anon memory hogs which continually map (100-dirty_ratio)% of memory o Total estimated WSS = (100+dirty_ration) percentage of memory X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4 The intent is to maximise the total WSS with a mix of file and anon memory where some anonymous memory must be swapped and there is a high likelihood of dirty/writeback pages reaching the end of the LRU. The test can be configured to have no background readers to stress dirty/writeback pages. The results below are based on having zero readers. The short summary of the results is that the series works and stalls until some event occurs but the timeouts may need adjustment. The test results are not broken down by patch as the series should be treated as one block that replaces a broken throttling mechanism with a working one. Finally, three machines were tested but I'm reporting the worst set of results. The other two machines had much better latencies for example. First the results of the "anon latency" latency stutterp 5.15.0-rc1 5.15.0-rc1 vanilla mm-reclaimcongest-v4r2 Amean mmap-4 31.4003 ( 0.00%) 176.8729 (-463.28%) Amean mmap-7 38.1641 ( 0.00%) 605.8866 (-1487.58%) Amean mmap-12 60.0981 ( 0.00%) 2866.8561 (-4670.29%) Amean mmap-21 161.2699 ( 0.00%) 118.6587 ( 26.42%) Amean mmap-30 174.5589 ( 0.00%) 3729.2263 (-2036.37%) Amean mmap-48 8106.8160 ( 0.00%) 1463.7815 ( 81.94%) Stddev mmap-4 41.3455 ( 0.00%) 5847.5425 (-14043.13%) Stddev mmap-7 53.5556 ( 0.00%) 12091.9011 (-22478.20%) Stddev mmap-12 171.3897 ( 0.00%) 28785.9881 (-16695.63%) Stddev mmap-21 1506.6752 ( 0.00%) 1609.0361 ( -6.79%) Stddev mmap-30 557.5806 ( 0.00%) 32712.2440 (-5766.82%) Stddev mmap-48 61681.5718 ( 0.00%) 15971.4654 ( 74.11%) Max-90 mmap-4 31.4243 ( 0.00%) 30.1957 ( 3.91%) Max-90 mmap-7 41.0410 ( 0.00%) 36.7782 ( 10.39%) Max-90 mmap-12 66.5255 ( 0.00%) 121.8574 ( -83.17%) Max-90 mmap-21 146.7479 ( 0.00%) 132.2327 ( 9.89%) Max-90 mmap-30 193.9513 ( 0.00%) 61.6135 ( 68.23%) Max-90 mmap-48 277.9137 ( 0.00%) 593.7413 (-113.64%) Max mmap-4 1913.8009 ( 0.00%) 239690.5578 (-12424.32%) Max mmap-7 2423.9665 ( 0.00%) 270122.1751 (-11043.81%) Max mmap-12 6845.6573 ( 0.00%) 308761.7416 (-4410.33%) Max mmap-21 56278.6508 ( 0.00%) 79286.8553 ( -40.88%) Max mmap-30 19716.2990 ( 0.00%) 306793.2333 (-1456.04%) Max mmap-48 477923.9400 ( 0.00%) 229791.8793 ( 51.92%) For most thread counts, the time to mmap() is unfortunately increased. In earlier versions of the series, this was lower but a large number of throttling events were reaching their timeout increasing the amount of inefficient scanning of the LRU. There is no prioritisation of reclaim tasks making progress based on each tasks rate of page allocation versus progress of reclaim. The variance is also impacted for high worker counts but in all cases, the differences in latency are not statistically significant due to very large maximum outliers. Max-90 shows that 90% of the stalls are comparable but the Max results show the massive outliers which are increased to to stalling. It is expected that this will be very machine dependant. Due to the test design, reclaim is difficult so allocations stall and there are variances depending on whether THPs can be allocated or not. The amount of memory will affect exactly how bad the corner cases are and how often they trigger. The warmup period calculation is not ideal as it's based on linear writes where as fio is randomly writing multiple files from multiple tasks so the start state of the test is variable. For example, these are the latencies on a single-socket machine that had more memory Amean mmap-4 20.5437 ( 0.00%) 19.1772 * 6.65%* Amean mmap-6 39.2860 ( 0.00%) 69.4987 ( -76.90%) Amean mmap-8 2476.1950 ( 0.00%) 151.7673 ( 93.87%) Amean mmap-12 178.0936 ( 0.00%) 209.1427 ( -17.43%) Amean mmap-18 3238.9125 ( 0.00%) 262.5806 ( 91.89%) Amean mmap-24 7922.7016 ( 0.00%) 322.9738 ( 95.92%) Amean mmap-30 1766.8392 ( 0.00%) 405.8898 ( 77.03%) Amean mmap-32 7542.2844 ( 0.00%) 555.6236 ( 92.63%) Amean mmap-32 7542.2844 ( 0.00%) 512.1812 ( 93.21%) The overall system CPU usage and elapsed time is as follows 5.15.0-rc3 5.15.0-rc3 vanilla mm-reclaimcongest-v4r2 Duration User 6989.03 2368.70 Duration System 7308.12 843.35 Duration Elapsed 2277.67 2131.77 The patches reduce system CPU usage by 88% as the vanilla kernel is rarely stalling. The high-level /proc/vmstats show 5.15.0-rc1 5.15.0-rc1 vanilla mm-reclaimcongest-v4r2 Ops Direct pages scanned 1056608451.00 76886196.00 Ops Kswapd pages scanned 109795048.00 82179688.00 Ops Kswapd pages reclaimed 63269243.00 27410157.00 Ops Direct pages reclaimed 10803973.00 8016444.00 Ops Kswapd efficiency % 57.62 33.35 Ops Kswapd velocity 48204.98 38549.98 Ops Direct efficiency % 1.02 10.43 Ops Direct velocity 463898.83 36066.83 Kswapd scanned lesspages but the detailed pattern is different. The vanilla kernel scans slowly over time where as the patches exhibits burst patterns of scan activity. Direct reclaim scanning is reduced by 92% due to stalling. The pattern for stealing pages is also slightly different. Both kernels exhibit spikes but the vanilla kernel when reclaiming shows pages being reclaimed over a period of time where as the patches tend to reclaim in spikes. The difference is that vanilla is not throttling and instead scanning constantly finding some pages over time where as the patched kernel throttles and reclaims in spikes. Ops Percentage direct scans 90.59 48.34 For direct reclaim, vanilla scanned 90.59% of pages where as with the patches, 48.34% were direct reclaim due to throttling Ops Page writes by reclaim 2613590.00 1882533.00 Page writes from reclaim context are reduced. Ops Page writes anon 2932752.00 2266749.00 And there is slightly less swapping. Ops Page reclaim immediate 996248528.00 29230920.00 The number of pages encountered at the tail of the LRU tagged for immediate reclaim but still dirty/writeback is reduced by 97%. Ops Slabs scanned 164284.00 166646.00 Slab scan activity is similar. ftrace was used to gather stall activity Vanilla ------- 1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000 2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000 8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000 29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000 82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0 The fast majority of wait_iff_congested calls do not stall at all. What is likely happening is that cond_resched() reschedules the task for a short period when the BDI is not registering congestion (which it never will in this test setup). 1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000 2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000 4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000 380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000 778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000 congestion_wait if called always exceeds the timeout as there is no trigger to wake it up. Bottom line: Vanilla will throttle but it's not effective. Patch series ------------ Kswapd throttle activity was always due to scanning pages tagged for immediate reclaim at the tail of the LRU 1 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK 1 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK 1 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK 1 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK 1 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK 2 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK 8 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK 23 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK 52 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK 61 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBAC The majority of events did not stall or stalled for a short period. Roughly 16% of stalls reached the timeout before expiry. For direct reclaim, the number of times stalled for each reason were 13594 reason=VMSCAN_THROTTLE_ISOLATED 72247 reason=VMSCAN_THROTTLE_WRITEBACK 77203 reason=VMSCAN_THROTTLE_NOPROGRESS The most common reason to stall was due to a failure to make forward progress followed closely by excessive pages tagged for immediate reclaim at the tail of the LRU. A relatively small number were due to too many pages isolated from the LRU by parallel threads For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was 3 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED 8 usec_timeout=20000 usect_delayed=8000 reason=VMSCAN_THROTTLE_ISOLATED 9 usec_timeout=20000 usect_delayed=12000 reason=VMSCAN_THROTTLE_ISOLATED 18 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED 69 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED 1946 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED Most did not stall at all or for a short period. A small number reached the timeout. For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over the map 1 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=276000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=388000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=236000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=396000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=336000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS 6 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS 6 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS 6 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS 11 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS 11 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS 11 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS 13 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS 15 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS 17 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS 18 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS 20 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS 20 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS 21 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS 24 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS 24 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS 27 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS 27 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS 28 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS 29 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS 30 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS 30 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS 30 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS 31 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS 32 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS 32 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS 38 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS 38 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS 39 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS 47 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS 47 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS 52 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS 54 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS 55 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS 61 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS 63 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS 68 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS 75 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS 80 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS 83 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS 88 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS 97 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS 99 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS 99 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS 102 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS 149 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS 187 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS 196 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS 245 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS 322 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS 406 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS 588 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS 843 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS 1299 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS 2839 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS 10111 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS 21492 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS 36441 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS The full timeout is often hit but a large number also do not stall at all. The remainder slept a little allowing other reclaim tasks to make progress. While this timeout could be further increased, it could also negatively impact worst-case behaviour when there is no prioritisation of what task should make progress. For VMSCAN_THROTTLE_WRITEBACK, the breakdown was 5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK 7 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK 8 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK 9 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK 12 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK 12 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK 13 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK 14 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK 14 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK 16 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK 21 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK 24 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK 25 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK 26 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK 32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK 45 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK 50 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK 60 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK 74 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK 122 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK 134 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK 310 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK 568 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK 2038 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK 7061 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK 61547 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK The majority hit the timeout in direct reclaim context although a sizable number did not stall at all. This is very different to kswapd where only a tiny percentage of stalls due to writeback reached the timeout. Bottom line, the throttling appears to work and the wakeup events may limit worst case stalls. There might be some grounds for adjusting timeouts but it's likely futile as the worst-case scenarios depend on the workload, memory size and the speed of the storage. A better approach to improve the series further would be to prioritise tasks based on their rate of allocation with the caveat that it may be very expensive to track. -- 2.31.1 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-10-19 9:01 [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman @ 2021-10-19 9:01 ` Mel Gorman 0 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2021-10-19 9:01 UTC (permalink / raw) To: Andrew Morton Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML, Mel Gorman Memcg reclaim throttles on congestion if no reclaim progress is made. This makes little sense, it might be due to writeback or a host of other factors. For !memcg reclaim, it's messy. Direct reclaim primarily is throttled in the page allocator if it is failing to make progress. Kswapd throttles if too many pages are under writeback and marked for immediate reclaim. This patch explicitly throttles if reclaim is failing to make progress. [vbabka@suse.cz: Remove redundant code] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- include/linux/mmzone.h | 1 + include/trace/events/vmscan.h | 4 +++- mm/memcontrol.c | 10 +--------- mm/vmscan.c | 28 ++++++++++++++++++++++++++++ 4 files changed, 33 insertions(+), 10 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 58a25d42c31c..2ffcf2410b66 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -276,6 +276,7 @@ enum lru_list { enum vmscan_throttle_state { VMSCAN_THROTTLE_WRITEBACK, VMSCAN_THROTTLE_ISOLATED, + VMSCAN_THROTTLE_NOPROGRESS, NR_VMSCAN_THROTTLE, }; diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index d4905bd9e9c4..f25a6149d3ba 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -29,11 +29,13 @@ #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK) #define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED) +#define _VMSCAN_THROTTLE_NOPROGRESS (1 << VMSCAN_THROTTLE_NOPROGRESS) #define show_throttle_flags(flags) \ (flags) ? __print_flags(flags, "|", \ {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \ - {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \ + {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"}, \ + {_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"} \ ) : "VMSCAN_THROTTLE_NONE" diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6da5020a8656..8b33152c9b85 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3465,19 +3465,11 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) /* try to free all pages in this cgroup */ while (nr_retries && page_counter_read(&memcg->memory)) { - int progress; - if (signal_pending(current)) return -EINTR; - progress = try_to_free_mem_cgroup_pages(memcg, 1, - GFP_KERNEL, true); - if (!progress) { + if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true)) nr_retries--; - /* maybe some writeback is necessary */ - congestion_wait(BLK_RW_ASYNC, HZ/10); - } - } return 0; diff --git a/mm/vmscan.c b/mm/vmscan.c index 29434d4fc1c7..14127bbf2c3b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3323,6 +3323,33 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx); } +static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) +{ + /* If reclaim is making progress, wake any throttled tasks. */ + if (sc->nr_reclaimed) { + wait_queue_head_t *wqh; + + wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS]; + if (waitqueue_active(wqh)) + wake_up_all(wqh); + + return; + } + + /* + * Do not throttle kswapd on NOPROGRESS as it will throttle on + * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under + * writeback and marked for immediate reclaim at the tail of + * the LRU. + */ + if (current_is_kswapd()) + return; + + /* Throttle if making no progress at high prioities. */ + if (sc->priority < DEF_PRIORITY - 2) + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10); +} + /* * This is the direct reclaim path, for page-allocating processes. We only * try to reclaim pages from zones which will satisfy the caller's allocation @@ -3407,6 +3434,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) continue; last_pgdat = zone->zone_pgdat; shrink_node(zone->zone_pgdat, sc); + consider_reclaim_throttle(zone->zone_pgdat, sc); } /* -- 2.31.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v3 0/8] Remove dependency on congestion_wait in mm/ @ 2021-10-08 13:53 Mel Gorman 2021-10-08 13:53 ` [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2021-10-08 13:53 UTC (permalink / raw) To: Linux-MM Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel, LKML, Mel Gorman This series is also available at git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v3r9 This series that removes all calls to congestion_wait in mm/ and deletes wait_iff_congested. It's not a clever implementation but congestion_wait has been broken for a long time (https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/). Even if congestion throttling worked, it was never a great idea. While excessive dirty/writeback pages at the tail of the LRU is one possibility that reclaim may be slow, there is also the problem of too many pages being isolated and reclaim failing for other reasons (elevated references, too many pages isolated, excessive LRU contention etc). This series replaces the "congestion" throttling with 3 different types. o If there are too many dirty/writeback pages, sleep until a timeout or enough pages get cleaned o If too many pages are isolated, sleep until enough isolated pages are either reclaimed or put back on the LRU o If no progress is being made, direct reclaim tasks sleep until another task makes progress with acceptable efficiency. This was initially tested with a mix of workloads that used to trigger corner cases that no longer work. A new test case was created called "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly created XFS filesystem. Note that it may be necessary to increase the timeout of ssh if executing remotely as ssh itself can get throttled and the connection may timeout. stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4 to check the impact as the number of direct reclaimers increase. It has four types of worker. o One "anon latency" worker creates small mappings with mmap() and times how long it takes to fault the mapping reading it 4K at a time o X file writers which is fio randomly writing X files where the total size of the files add up to the allowed dirty_ratio. fio is allowed to run for a warmup period to allow some file-backed pages to accumulate. The duration of the warmup is based on the best-case linear write speed of the storage. o Y file readers which is fio randomly reading small files o Z anon memory hogs which continually map (100-dirty_ratio)% of memory o Total estimated WSS = (100+dirty_ration) percentage of memory X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4 The intent is to maximise the total WSS with a mix of file and anon memory where some anonymous memory must be swapped and there is a high likelihood of dirty/writeback pages reaching the end of the LRU. The test can be configured to have no background readers to stress dirty/writeback pages. The results below are based on having zero readers. The short summary of the results is that the series works and stalls until some event occurs but the timeouts may need adjustment. The test results are not broken down by patch as the series should be treated as one block that replaces a broken throttling mechanism with a working one. Finally, three machines were tested but I'm reporting the worst set of results. The other two machines had much better latencies for example. First the results of the "anon latency" latency stutterp 5.15.0-rc1 5.15.0-rc1 vanilla mm-reclaimcongest-v3r9 Amean mmap-4 31.4003 ( 0.00%) 3502.0437 (-11052.92%) Amean mmap-7 38.1641 ( 0.00%) 118.7176 (-211.07%) Amean mmap-12 60.0981 ( 0.00%) 544.4736 (-805.97%) Amean mmap-21 161.2699 ( 0.00%) 246.8211 ( -53.05%) Amean mmap-30 174.5589 ( 0.00%) 511.8941 (-193.25%) Amean mmap-48 8106.8160 ( 0.00%) 5181.3920 ( 36.09%) Stddev mmap-4 41.3455 ( 0.00%) 35007.1657 (-84569.93%) Stddev mmap-7 53.5556 ( 0.00%) 3880.7480 (-7146.20%) Stddev mmap-12 171.3897 ( 0.00%) 11157.8419 (-6410.22%) Stddev mmap-21 1506.6752 ( 0.00%) 6117.6842 (-306.04%) Stddev mmap-30 557.5806 ( 0.00%) 9030.5131 (-1519.59%) Stddev mmap-48 61681.5718 ( 0.00%) 35232.3288 ( 42.88%) Max-90 mmap-4 31.4243 ( 0.00%) 79.4364 (-152.79%) Max-90 mmap-7 41.0410 ( 0.00%) 38.8362 ( 5.37%) Max-90 mmap-12 66.5255 ( 0.00%) 34.0194 ( 48.86%) Max-90 mmap-21 146.7479 ( 0.00%) 79.2514 ( 45.99%) Max-90 mmap-30 193.9513 ( 0.00%) 85.9060 ( 55.71%) Max-90 mmap-48 277.9137 ( 0.00%) 1063.9764 (-282.84% Max mmap-4 1913.8009 ( 0.00%) 362207.4705 (-18826.08%) Max mmap-7 2423.9665 ( 0.00%) 192136.1715 (-7826.52%) Max mmap-12 6845.6573 ( 0.00%) 262738.5257 (-3738.03%) Max mmap-21 56278.6508 ( 0.00%) 212263.3098 (-277.16%) Max mmap-30 19716.2990 ( 0.00%) 218858.2147 (-1010.04%) Max mmap-48 477923.9400 ( 0.00%) 271100.1667 ( 43.28%) For most thread counts, the time to mmap() is unfortunately increased. In earlier versions of the series, this was lower but a large number of throttling events were reaching their timeout increasing the amount of inefficient scanning of the LRU. There is no prioritisation of reclaim tasks making progress based on each tasks rate of page allocation versus progress of reclaim. The variance is also impacted for high worker counts but in all cases, the differences in latency are not statistically significant due to very large maximum outliers. Max-90 shows that 90% of the stalls are comparable but the Max results show the massive outliers which are increased to to stalling. It is expected that this will be very machine dependant. Due to the test design, reclaim is difficult so allocations stall and there are variances depending on whether THPs can be allocated or not. The amount of memory will affect exactly how bad the corner cases are and how often they trigger. The warmup period calculation is not ideal as it's based on linear writes where as fio is randomly writing multiple files from multiple tasks so the start state of the test is variable. For example, these are the latencies on a single-socket machine that had more memory Amean mmap-4 20.5437 ( 0.00%) 17.2818 * 15.88%* Amean mmap-6 39.2860 ( 0.00%) 75.5750 * -92.37%* Amean mmap-8 2476.1950 ( 0.00%) 184.9578 ( 92.53%) Amean mmap-12 178.0936 ( 0.00%) 198.2362 ( -11.31%) Amean mmap-18 3238.9125 ( 0.00%) 168.2480 ( 94.81%) Amean mmap-24 7922.7016 ( 0.00%) 290.8845 ( 96.33%) Amean mmap-30 1766.8392 ( 0.00%) 460.1266 ( 73.96%) Amean mmap-32 7542.2844 ( 0.00%) 512.1812 ( 93.21%) The overall system CPU usage and elapsed time is as follows 5.15.0-rc3 5.15.0-rc3 vanillamm-reclaimcongest-v3r9 Duration User 6989.03 717.92 Duration System 7308.12 774.12 Duration Elapsed 2277.67 2159.98 The patches reduce system CPU usage by 89% as the vanilla kernel is rarely stalling. The differences in elapsed time are due to the possibility that the test controller can also get throttled and miss the timeout. The high-level /proc/vmstats show 5.15.0-rc1 5.15.0-rc1 vanilla mm-reclaimcongest-v3r9 Ops Direct pages scanned 1056608451.00 154109543.00 Ops Kswapd pages scanned 109795048.00 108898253.00 Ops Kswapd pages reclaimed 63269243.00 22029757.00 Ops Direct pages reclaimed 10803973.00 9135952.00 Ops Kswapd velocity 48204.98 50416.32 Ops Direct velocity 463898.83 71347.67 Kswapd scanned a similar number of pages but the detailed pattern is different. The vanilla kernel scans slowly over time where as the patches exhibits burst patterns of scan activity. Direct reclaim scanning is reduced by 85% due to stalling. Generally, there are some spikes in reclaim activity (both direct and kswapd) but crucially, the number of pages reclaimed is relatively consistent. In other words, with this workload, reclaim rate remains relatively constant but there are large variations in scan activity representing useless scanning. Ops Percentage direct scans 90.59 58.60 For direct reclaim, vanilla scanned 90.59% of pages where as with the patches, 58.60% were direct reclaim due to throttling Ops Page writes by reclaim 2613590.00 2320847.00 Page writes from reclaim context are somewhat consistent. Ops Page writes anon 2932752.00 2567954.00 Swap activity remain somewhat consistent. Ops Page reclaim immediate 996248528.00 64076505.00 The number of pages encountered at the tail of the LRU tagged for immediate reclaim but still dirty/writeback is reduced by 94%. Ops Slabs scanned 164284.00 170222.00 Slab scan activity is similar. ftrace was used to gather stall activity Vanilla ------- 1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000 2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000 8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000 29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000 82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0 The fast majority of wait_iff_congested calls do not stall at all. What is likely happening is that cond_resched() reschedules the task for a short period when the BDI is not registering congestion (which it never will in this test setup). 1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000 2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000 4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000 380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000 778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000 congestion_wait if called always exceeds the timeout as there is no trigger to wake it up. Bottom line: Vanilla will throttle but it's not effective. Patch series ------------ Kswapd throttle activity was always due to scanning pages tagged for immediate reclaim at the tail of the LRU 1 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK 2 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK 2 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK 4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK 5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK 7 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK 13 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK 119 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK 131 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK The majority of events did not stall or stalled for a short period. A small number stalled for the entire timeout. For direct reclaim, the number of times stalled for each reason were 2053 reason=VMSCAN_THROTTLE_ISOLATED 100704 reason=VMSCAN_THROTTLE_WRITEBACK 106825 reason=VMSCAN_THROTTLE_NOPROGRESS The most common reason to stall was due to a failure to make forward progress followed closely by excessive pages tagged for immediate reclaim at the tail of the LRU. A relatively small number were due to too many pages isolated from the LRU by parallel threads For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was 3 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED 8 usec_timeout=20000 usect_delayed=8000 reason=VMSCAN_THROTTLE_ISOLATED 9 usec_timeout=20000 usect_delayed=12000 reason=VMSCAN_THROTTLE_ISOLATED 18 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED 69 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED 1946 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED Most did not stall at all or for a short period. A small percentage reached the timeout. For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over the map 1 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=276000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=336000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=236000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS 6 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS 6 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS 6 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS 6 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS 11 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=344000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS 13 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS 13 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS 13 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS 15 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS 16 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS 16 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS 17 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS 18 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS 19 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS 19 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS 21 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS 21 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS 22 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS 25 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS 28 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS 28 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS 28 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS 29 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS 29 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS 29 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS 29 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS 32 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS 32 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS 33 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS 34 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS 35 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS 39 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS 39 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS 44 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS 44 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS 45 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS 46 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS 46 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS 49 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS 54 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS 57 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS 58 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS 59 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS 60 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS 66 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS 75 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS 91 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS 96 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS 97 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS 139 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS 160 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS 160 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS 171 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS 175 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS 181 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS 203 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS 235 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS 267 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS 295 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS 395 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS 471 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS 548 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS 972 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS 1129 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS 1507 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS 3308 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS 14459 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS 34811 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS 45229 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS The full timeout is often hit but a large number also do not stall at all. The remainder slept a little allowing other reclaim tasks to make progress. While this timeout could be further increased, it could also negatively impact worst-case behaviour when there is no prioritisation of what task should make progress. For VMSCAN_THROTTLE_WRITEBACK, the breakdown was 17 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK 18 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK 19 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK 22 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK 38 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK 41 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK 43 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK 51 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK 51 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK 56 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK 64 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK 74 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK 76 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK 94 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK 99 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK 110 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK 112 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK 152 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK 154 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK 386 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK 617 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK 1052 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK 1621 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK 8406 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK 20317 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK 67014 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK The majority hit the timeout in direct reclaim context although a sizable number did not stall at all. This is very different to kswapd where only a tiny percentage of stalls due to writeback reached the timeout. Bottom line, the throttling appears to work and the wakeup events may limit worst case stalls. There might be some grounds for adjusting timeouts but it's likely futile as the worst-case scenarios depend on the workload, memory size and the speed of the storage. A better approach to improve the series further would be to prioritise tasks based on their rate of allocation with the caveat that it may be very expensive to track. -- 2.31.1 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-10-08 13:53 [PATCH v3 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman @ 2021-10-08 13:53 ` Mel Gorman 2021-10-14 12:31 ` Vlastimil Babka 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2021-10-08 13:53 UTC (permalink / raw) To: Linux-MM Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel, LKML, Mel Gorman Memcg reclaim throttles on congestion if no reclaim progress is made. This makes little sense, it might be due to writeback or a host of other factors. For !memcg reclaim, it's messy. Direct reclaim primarily is throttled in the page allocator if it is failing to make progress. Kswapd throttles if too many pages are under writeback and marked for immediate reclaim. This patch explicitly throttles if reclaim is failing to make progress. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mmzone.h | 1 + include/trace/events/vmscan.h | 4 +++- mm/memcontrol.c | 10 +-------- mm/vmscan.c | 38 +++++++++++++++++++++++++++++++++++ 4 files changed, 43 insertions(+), 10 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index ca65d6a64bdd..7c08cc91d526 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -276,6 +276,7 @@ enum lru_list { enum vmscan_throttle_state { VMSCAN_THROTTLE_WRITEBACK, VMSCAN_THROTTLE_ISOLATED, + VMSCAN_THROTTLE_NOPROGRESS, NR_VMSCAN_THROTTLE, }; diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index d4905bd9e9c4..f25a6149d3ba 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -29,11 +29,13 @@ #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK) #define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED) +#define _VMSCAN_THROTTLE_NOPROGRESS (1 << VMSCAN_THROTTLE_NOPROGRESS) #define show_throttle_flags(flags) \ (flags) ? __print_flags(flags, "|", \ {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \ - {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \ + {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"}, \ + {_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"} \ ) : "VMSCAN_THROTTLE_NONE" diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6da5020a8656..8b33152c9b85 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3465,19 +3465,11 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) /* try to free all pages in this cgroup */ while (nr_retries && page_counter_read(&memcg->memory)) { - int progress; - if (signal_pending(current)) return -EINTR; - progress = try_to_free_mem_cgroup_pages(memcg, 1, - GFP_KERNEL, true); - if (!progress) { + if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true)) nr_retries--; - /* maybe some writeback is necessary */ - congestion_wait(BLK_RW_ASYNC, HZ/10); - } - } return 0; diff --git a/mm/vmscan.c b/mm/vmscan.c index 9ce4195d4123..cdebfc618179 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3311,6 +3311,33 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx); } +static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) +{ + /* If reclaim is making progress, wake any throttled tasks. */ + if (sc->nr_reclaimed) { + wait_queue_head_t *wqh; + + wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS]; + if (waitqueue_active(wqh)) + wake_up_all(wqh); + + return; + } + + /* + * Do not throttle kswapd on NOPROGRESS as it will throttle on + * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under + * writeback and marked for immediate reclaim at the tail of + * the LRU. + */ + if (current_is_kswapd()) + return; + + /* Throttle if making no progress at high prioities. */ + if (sc->priority < DEF_PRIORITY - 2) + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10); +} + /* * This is the direct reclaim path, for page-allocating processes. We only * try to reclaim pages from zones which will satisfy the caller's allocation @@ -3395,6 +3422,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) continue; last_pgdat = zone->zone_pgdat; shrink_node(zone->zone_pgdat, sc); + consider_reclaim_throttle(zone->zone_pgdat, sc); } /* @@ -3769,6 +3797,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); set_task_reclaim_state(current, NULL); + if (!nr_reclaimed) { + struct zoneref *z; + pg_data_t *pgdat; + + z = first_zones_zonelist(zonelist, sc.reclaim_idx, sc.nodemask); + pgdat = zonelist_zone(z)->zone_pgdat; + + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10); + } + return nr_reclaimed; } #endif -- 2.31.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-10-08 13:53 ` [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman @ 2021-10-14 12:31 ` Vlastimil Babka 2021-10-14 13:03 ` Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Vlastimil Babka @ 2021-10-14 12:31 UTC (permalink / raw) To: Mel Gorman, Linux-MM Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Johannes Weiner, Jonathan Corbet, Linux-fsdevel, LKML On 10/8/21 15:53, Mel Gorman wrote: > Memcg reclaim throttles on congestion if no reclaim progress is made. > This makes little sense, it might be due to writeback or a host of > other factors. > > For !memcg reclaim, it's messy. Direct reclaim primarily is throttled > in the page allocator if it is failing to make progress. Kswapd > throttles if too many pages are under writeback and marked for > immediate reclaim. > > This patch explicitly throttles if reclaim is failing to make progress. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> ... > @@ -3769,6 +3797,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); > set_task_reclaim_state(current, NULL); > > + if (!nr_reclaimed) { > + struct zoneref *z; > + pg_data_t *pgdat; > + > + z = first_zones_zonelist(zonelist, sc.reclaim_idx, sc.nodemask); > + pgdat = zonelist_zone(z)->zone_pgdat; > + > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10); > + } Is this necessary? AFAICS here we just returned from: do_try_to_free_pages() shrink_zones() for_each_zone()... consider_reclaim_throttle() Which already throttles when needed and using the appropriate pgdat, while here we have to somewhat awkwardly assume the preferred one. > + > return nr_reclaimed; > } > #endif > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-10-14 12:31 ` Vlastimil Babka @ 2021-10-14 13:03 ` Mel Gorman 2021-10-14 15:45 ` Vlastimil Babka 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2021-10-14 13:03 UTC (permalink / raw) To: Vlastimil Babka Cc: Linux-MM, NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Johannes Weiner, Jonathan Corbet, Linux-fsdevel, LKML On Thu, Oct 14, 2021 at 02:31:17PM +0200, Vlastimil Babka wrote: > On 10/8/21 15:53, Mel Gorman wrote: > > Memcg reclaim throttles on congestion if no reclaim progress is made. > > This makes little sense, it might be due to writeback or a host of > > other factors. > > > > For !memcg reclaim, it's messy. Direct reclaim primarily is throttled > > in the page allocator if it is failing to make progress. Kswapd > > throttles if too many pages are under writeback and marked for > > immediate reclaim. > > > > This patch explicitly throttles if reclaim is failing to make progress. > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > ... > > @@ -3769,6 +3797,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > > trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); > > set_task_reclaim_state(current, NULL); > > > > + if (!nr_reclaimed) { > > + struct zoneref *z; > > + pg_data_t *pgdat; > > + > > + z = first_zones_zonelist(zonelist, sc.reclaim_idx, sc.nodemask); > > + pgdat = zonelist_zone(z)->zone_pgdat; > > + > > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10); > > + } > > Is this necessary? AFAICS here we just returned from: > > do_try_to_free_pages() > shrink_zones() > for_each_zone()... > consider_reclaim_throttle() > > Which already throttles when needed and using the appropriate pgdat, while > here we have to somewhat awkwardly assume the preferred one. > Yes, you're right, consider_reclaim_throttle not only throttles on the appropriate pgdat but takes priority into account. Well spotted! -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made 2021-10-14 13:03 ` Mel Gorman @ 2021-10-14 15:45 ` Vlastimil Babka 0 siblings, 0 replies; 23+ messages in thread From: Vlastimil Babka @ 2021-10-14 15:45 UTC (permalink / raw) To: Mel Gorman Cc: Linux-MM, NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel, Johannes Weiner, Jonathan Corbet, Linux-fsdevel, LKML On 10/14/21 15:03, Mel Gorman wrote: > On Thu, Oct 14, 2021 at 02:31:17PM +0200, Vlastimil Babka wrote: >> On 10/8/21 15:53, Mel Gorman wrote: >> > Memcg reclaim throttles on congestion if no reclaim progress is made. >> > This makes little sense, it might be due to writeback or a host of >> > other factors. >> > >> > For !memcg reclaim, it's messy. Direct reclaim primarily is throttled >> > in the page allocator if it is failing to make progress. Kswapd >> > throttles if too many pages are under writeback and marked for >> > immediate reclaim. >> > >> > This patch explicitly throttles if reclaim is failing to make progress. >> > >> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> >> ... >> > @@ -3769,6 +3797,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, >> > trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); >> > set_task_reclaim_state(current, NULL); >> > >> > + if (!nr_reclaimed) { >> > + struct zoneref *z; >> > + pg_data_t *pgdat; >> > + >> > + z = first_zones_zonelist(zonelist, sc.reclaim_idx, sc.nodemask); >> > + pgdat = zonelist_zone(z)->zone_pgdat; >> > + >> > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10); >> > + } >> >> Is this necessary? AFAICS here we just returned from: >> >> do_try_to_free_pages() >> shrink_zones() >> for_each_zone()... >> consider_reclaim_throttle() >> >> Which already throttles when needed and using the appropriate pgdat, while >> here we have to somewhat awkwardly assume the preferred one. >> > > Yes, you're right, consider_reclaim_throttle not only throttles on the > appropriate pgdat but takes priority into account. > > Well spotted! So with that part removed Acked-by: Vlastimil Babka <vbabka@suse.cz> Thanks! ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2021-11-24 18:02 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-10-22 14:46 [PATCH v5 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman 2021-10-22 14:46 ` [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman 2021-10-22 14:46 ` [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated Mel Gorman 2021-10-22 14:46 ` [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman 2021-11-24 1:19 ` Darrick J. Wong 2021-11-24 1:49 ` Darrick J. Wong 2021-11-24 14:35 ` Mel Gorman 2021-11-24 18:02 ` Darrick J. Wong 2021-11-24 10:32 ` Mel Gorman 2021-11-24 10:43 ` Vlastimil Babka 2021-11-24 10:53 ` Mel Gorman 2021-11-24 17:24 ` Mike Galbraith 2021-10-22 14:46 ` [PATCH 4/8] mm/writeback: Throttle based on page writeback instead of congestion Mel Gorman 2021-10-22 14:46 ` [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator Mel Gorman 2021-10-25 10:07 ` Vlastimil Babka 2021-10-22 14:46 ` [PATCH 6/8] mm/vmscan: Centralise timeout values for reclaim_throttle Mel Gorman 2021-10-22 14:46 ` [PATCH 7/8] mm/vmscan: Increase the timeout if page reclaim is not making progress Mel Gorman 2021-10-22 14:46 ` [PATCH 8/8] mm/vmscan: Delay waking of tasks throttled on NOPROGRESS Mel Gorman -- strict thread matches above, loose matches on Subject: below -- 2021-10-19 9:01 [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman 2021-10-19 9:01 ` [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman 2021-10-08 13:53 [PATCH v3 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman 2021-10-08 13:53 ` [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman 2021-10-14 12:31 ` Vlastimil Babka 2021-10-14 13:03 ` Mel Gorman 2021-10-14 15:45 ` Vlastimil Babka
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.