* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis [not found] <009e01d1d5d8$fcf06440$f6d12cc0$@alibaba-inc.com> @ 2016-07-04 10:08 ` Hillf Danton 2016-07-04 10:33 ` Mel Gorman 2016-07-05 3:17 ` Hillf Danton 1 sibling, 1 reply; 16+ messages in thread From: Hillf Danton @ 2016-07-04 10:08 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-kernel, linux-mm > @@ -2561,17 +2580,23 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > * highmem pages could be pinning lowmem pages storing buffer_heads > */ > orig_mask = sc->gfp_mask; > - if (buffer_heads_over_limit) > + if (buffer_heads_over_limit) { > sc->gfp_mask |= __GFP_HIGHMEM; > + sc->reclaim_idx = classzone_idx = gfp_zone(sc->gfp_mask); > + } > We need to push/pop ->reclaim_idx as ->gfp_mask handled? thanks Hillf -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-04 10:08 ` [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis Hillf Danton @ 2016-07-04 10:33 ` Mel Gorman 0 siblings, 0 replies; 16+ messages in thread From: Mel Gorman @ 2016-07-04 10:33 UTC (permalink / raw) To: Hillf Danton; +Cc: linux-kernel, linux-mm On Mon, Jul 04, 2016 at 06:08:27PM +0800, Hillf Danton wrote: > > @@ -2561,17 +2580,23 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > > * highmem pages could be pinning lowmem pages storing buffer_heads > > */ > > orig_mask = sc->gfp_mask; > > - if (buffer_heads_over_limit) > > + if (buffer_heads_over_limit) { > > sc->gfp_mask |= __GFP_HIGHMEM; > > + sc->reclaim_idx = classzone_idx = gfp_zone(sc->gfp_mask); > > + } > > > We need to push/pop ->reclaim_idx as ->gfp_mask handled? > I saw no harm in having one full reclaim attempt reclaiming from all zones if buffer_heads_over_limit was triggered. If it fails, the page allocator will loop again and reset the reclaim_idx. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis [not found] <009e01d1d5d8$fcf06440$f6d12cc0$@alibaba-inc.com> 2016-07-04 10:08 ` [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis Hillf Danton @ 2016-07-05 3:17 ` Hillf Danton 1 sibling, 0 replies; 16+ messages in thread From: Hillf Danton @ 2016-07-05 3:17 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-kernel, linux-mm, Andrew Morton > > This patch makes reclaim decisions on a per-node basis. A reclaimer knows > what zone is required by the allocation request and skips pages from > higher zones. In many cases this will be ok because it's a GFP_HIGHMEM > request of some description. On 64-bit, ZONE_DMA32 requests will cause > some problems but 32-bit devices on 64-bit platforms are increasingly > rare. Historically it would have been a major problem on 32-bit with big > Highmem:Lowmem ratios but such configurations are also now rare and even > where they exist, they are not encouraged. If it really becomes a > problem, it'll manifest as very low reclaim efficiencies. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> > mm/vmscan.c | 79 ++++++++++++++++++++++++++++++++++++++++++------------------- > 1 file changed, 55 insertions(+), 24 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 @ 2016-07-01 20:01 Mel Gorman 2016-07-01 20:01 ` [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis Mel Gorman 0 siblings, 1 reply; 16+ messages in thread From: Mel Gorman @ 2016-07-01 20:01 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman (Sorry for the resend, I accidentally sent the branch that still had the Signed-off-by's from mmotm still applied which is incorrect.) Previous releases double accounted LRU stats on the zone and the node because it was required by should_reclaim_retry. The last patch in the series removes the double accounting. It's not integrated with the series as reviewers may not like the solution. If not, it can be safely dropped without a major impact to the results. Changelog since v7 o Rebase onto current mmots o Avoid double accounting of stats in node and zone o Kswapd will avoid more reclaim if an eligible zone is available o Remove some duplications of sc->reclaim_idx and classzone_idx o Print per-node stats in zoneinfo Changelog since v6 o Correct reclaim_idx when direct reclaiming for memcg o Also account LRU pages per zone for compaction/reclaim o Add page_pgdat helper with more efficient lookup o Init pgdat LRU lock only once o Slight optimisation to wake_all_kswapds o Always wake kcompactd when kswapd is going to sleep o Rebase to mmotm as of June 15th, 2016 Changelog since v5 o Rebase and adjust to changes Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.7-rc4 with Andrew's tree applied. While this is a current rebase, the test results were based on mmotm as of June 23rd. Conceptually, this series is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc --------- This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min total-odr0-1 490.00 ( 0.00%) 463.00 ( 5.51%) Min total-odr0-2 349.00 ( 0.00%) 325.00 ( 6.88%) Min total-odr0-4 288.00 ( 0.00%) 272.00 ( 5.56%) Min total-odr0-8 250.00 ( 0.00%) 235.00 ( 6.00%) Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%) Min total-odr0-32 223.00 ( 0.00%) 205.00 ( 8.07%) Min total-odr0-64 217.00 ( 0.00%) 202.00 ( 6.91%) Min total-odr0-128 214.00 ( 0.00%) 207.00 ( 3.27%) Min total-odr0-256 242.00 ( 0.00%) 242.00 ( 0.00%) Min total-odr0-512 272.00 ( 0.00%) 265.00 ( 2.57%) Min total-odr0-1024 290.00 ( 0.00%) 283.00 ( 2.41%) Min total-odr0-2048 302.00 ( 0.00%) 296.00 ( 1.99%) Min total-odr0-4096 311.00 ( 0.00%) 306.00 ( 1.61%) Min total-odr0-8192 314.00 ( 0.00%) 309.00 ( 1.59%) Min total-odr0-16384 315.00 ( 0.00%) 309.00 ( 1.90%) Min total-odr1-1 741.00 ( 0.00%) 716.00 ( 3.37%) Min total-odr1-2 565.00 ( 0.00%) 524.00 ( 7.26%) Min total-odr1-4 457.00 ( 0.00%) 427.00 ( 6.56%) Min total-odr1-8 408.00 ( 0.00%) 371.00 ( 9.07%) Min total-odr1-16 383.00 ( 0.00%) 344.00 ( 10.18%) Min total-odr1-32 378.00 ( 0.00%) 334.00 ( 11.64%) Min total-odr1-64 383.00 ( 0.00%) 334.00 ( 12.79%) Min total-odr1-128 376.00 ( 0.00%) 342.00 ( 9.04%) Min total-odr1-256 381.00 ( 0.00%) 343.00 ( 9.97%) Min total-odr1-512 388.00 ( 0.00%) 349.00 ( 10.05%) Min total-odr1-1024 386.00 ( 0.00%) 356.00 ( 7.77%) Min total-odr1-2048 389.00 ( 0.00%) 362.00 ( 6.94%) Min total-odr1-4096 389.00 ( 0.00%) 362.00 ( 6.94%) Min total-odr1-8192 389.00 ( 0.00%) 362.00 ( 6.94%) This shows a steady improvement throughout. The primary benefit is from reduced system CPU usage which is obvious from the overall times; 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 User 191.39 191.61 System 2651.24 2504.48 Elapsed 2904.40 2757.01 The vmstats also showed that the fair zone allocation policy was definitely removed as can be seen here; 4.7.0-rc3 4.7.0-rc3 mmotm-20160623 nodelru-v8 DMA32 allocs 28794771816 0 Normal allocs 48432582848 77227356392 Movable allocs 0 0 tiobench on ext4 ---------------- tiobench is a benchmark that artifically benefits if old pages remain resident while new pages get reclaimed. The fair zone allocation policy mitigates this problem so pages age fairly. While the benchmark has problems, it is important that tiobench performance remains constant as it implies that page aging problems that the fair zone allocation policy fixes are not re-introduced. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min PotentialReadSpeed 89.65 ( 0.00%) 90.34 ( 0.77%) Min SeqRead-MB/sec-1 82.68 ( 0.00%) 83.13 ( 0.54%) Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.15 ( -0.84%) Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.23 ( -1.20%) Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.25 ( 0.52%) Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.76 ( 0.84%) Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.95 ( 7.95%) Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.94 ( -1.05%) Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.46 ( 2.10%) Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.58 ( -1.86%) Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.93 ( 7.22%) Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 78.84 ( 3.18%) Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.35 ( -1.03%) Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 78.69 ( -1.70%) Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 71.38 ( -2.06%) Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 75.81 ( -0.13%) Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.12 ( -5.08%) Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.02 ( 0.00%) Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.99 ( -5.71%) Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%) Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.89 ( -3.26%) This shows that the series has little or not impact on tiobench which is desirable. It indicates that the fair zone allocation policy was removed in a manner that didn't reintroduce one class of page aging bug. There were only minor differences in overall reclaim activity 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 Minor Faults 645838 644036 Major Faults 573 593 Swap Ins 0 0 Swap Outs 0 0 Allocation stalls 24 0 DMA allocs 0 0 DMA32 allocs 46041453 44154171 Normal allocs 78053072 79865782 Movable allocs 0 0 Direct pages scanned 10969 54504 Kswapd pages scanned 93375144 93250583 Kswapd pages reclaimed 93372243 93247714 Direct pages reclaimed 10969 54504 Kswapd efficiency 99% 99% Kswapd velocity 13741.015 13711.950 Direct efficiency 100% 100% Direct velocity 1.614 8.014 Percentage direct scans 0% 0% Zone normal velocity 8641.875 13719.964 Zone dma32 velocity 5100.754 0.000 Zone dma velocity 0.000 0.000 Page writes by reclaim 0.000 0.000 Page writes file 0 0 Page writes anon 0 0 Page reclaim immediate 37 54 kswapd activity was roughly comparable. There were differences in direct reclaim activity but negligible in the context of the overall workload (velocity of 8 pages per second with the patches applied, 1.6 pages per second in the baseline kernel). pgbench read-only large configuration on ext4 --------------------------------------------- pgbench is a database benchmark that can be sensitive to page reclaim decisions. This also checks if removing the fair zone allocation policy is safe pgbench Transactions 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%) Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%) Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%) Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%) Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%) Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%) Negligible differences again. As with tiobench, overall reclaim activity was comparable. bonnie++ on ext4 ---------------- No interesting performance difference, negligible differences on reclaim stats. paralleldd on ext4 ------------------ This workload uses varying numbers of dd instances to read large amounts of data from disk. 4.7.0-rc3 4.7.0-rc3 mmotm-20160615 nodelru-v7r17 Amean Elapsd-1 181.57 ( 0.00%) 179.63 ( 1.07%) Amean Elapsd-3 188.29 ( 0.00%) 183.68 ( 2.45%) Amean Elapsd-5 188.02 ( 0.00%) 181.73 ( 3.35%) Amean Elapsd-7 186.07 ( 0.00%) 184.11 ( 1.05%) Amean Elapsd-12 188.16 ( 0.00%) 183.51 ( 2.47%) Amean Elapsd-16 189.03 ( 0.00%) 181.27 ( 4.10%) 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 User 1439.23 1433.37 System 8332.31 8216.01 Elapsed 3619.80 3532.69 There is a slight gain in performance, some of which is from the reduced system CPU usage. There areminor differences in reclaim activity but nothing significant 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 Minor Faults 362486 358215 Major Faults 1143 1113 Swap Ins 26 0 Swap Outs 2920 482 DMA allocs 0 0 DMA32 allocs 31568814 28598887 Normal allocs 46539922 49514444 Movable allocs 0 0 Allocation stalls 0 0 Direct pages scanned 0 0 Kswapd pages scanned 40886878 40849710 Kswapd pages reclaimed 40869923 40835207 Direct pages reclaimed 0 0 Kswapd efficiency 99% 99% Kswapd velocity 11295.342 11563.344 Direct efficiency 100% 100% Direct velocity 0.000 0.000 Slabs scanned 131673 126099 Direct inode steals 57 60 Kswapd inode steals 762 18 It basically shows that kswapd was active at roughly the same rate in both kernels. There was also comparable slab scanning activity and direct reclaim was avoided in both cases. There appears to be a large difference in numbers of inodes reclaimed but the workload has few active inodes and is likely a timing artifact. It's interesting to note that the node-lru did not swap in any pages but given the low swap activity, it's unlikely to be significant. stutter ------- stutter simulates a simple workload. One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. The primary metric is checking for mmap latency. stutter 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min mmap 16.6283 ( 0.00%) 16.1394 ( 2.94%) 1st-qrtle mmap 54.7570 ( 0.00%) 55.2975 ( -0.99%) 2nd-qrtle mmap 57.3163 ( 0.00%) 57.5230 ( -0.36%) 3rd-qrtle mmap 58.9976 ( 0.00%) 58.0537 ( 1.60%) Max-90% mmap 59.7433 ( 0.00%) 58.3910 ( 2.26%) Max-93% mmap 60.1298 ( 0.00%) 58.4801 ( 2.74%) Max-95% mmap 73.4112 ( 0.00%) 58.5537 ( 20.24%) Max-99% mmap 92.8542 ( 0.00%) 58.9673 ( 36.49%) Max mmap 1440.6569 ( 0.00%) 137.6875 ( 90.44%) Mean mmap 59.3493 ( 0.00%) 55.5153 ( 6.46%) Best99%Mean mmap 57.2121 ( 0.00%) 55.4194 ( 3.13%) Best95%Mean mmap 55.9113 ( 0.00%) 55.2813 ( 1.13%) Best90%Mean mmap 55.6199 ( 0.00%) 55.1044 ( 0.93%) Best50%Mean mmap 53.2183 ( 0.00%) 52.8330 ( 0.72%) Best10%Mean mmap 45.9842 ( 0.00%) 42.3740 ( 7.85%) Best5%Mean mmap 43.2256 ( 0.00%) 38.8660 ( 10.09%) Best1%Mean mmap 32.9388 ( 0.00%) 27.7577 ( 15.73%) This shows a number of improvements with the worst-case outlier greatly improved. Some of the vmstats are interesting 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 Swap Ins 163 239 Swap Outs 0 0 Allocation stalls 2603 0 DMA allocs 0 0 DMA32 allocs 618719206 1303037965 Normal allocs 891235743 229914091 Movable allocs 0 0 Direct pages scanned 216787 3173 Kswapd pages scanned 50719775 41732250 Kswapd pages reclaimed 41541765 41731168 Direct pages reclaimed 209159 3173 Kswapd efficiency 81% 99% Kswapd velocity 16859.554 14231.043 Direct efficiency 96% 100% Direct velocity 72.061 1.082 Percentage direct scans 0% 0% Zone normal velocity 8431.777 14232.125 Zone dma32 velocity 8499.838 0.000 Zone dma velocity 0.000 0.000 Page writes by reclaim 6215049.000 0.000 Page writes file 6215049 0 Page writes anon 0 0 Page reclaim immediate 70673 143 Sector Reads 81940800 81489388 Sector Writes 100158984 99161860 Page rescued immediate 0 0 Slabs scanned 1366954 21196 While this is not guaranteed in all cases, this particular test showed a large reduction in direct reclaim activity. It's also worth noting that no page writes were issued from reclaim context. This series is not without its hazards. There are at least three areas that I'm concerned with even though I could not reproduce any problems in that area. 1. Reclaim/compaction is going to be affected because the amount of reclaim is no longer targetted at a specific zone. Compaction works on a per-zone basis so there is no guarantee that reclaiming a few THP's worth page pages will have a positive impact on compaction success rates. 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers are called is now different. This may or may not be a problem but if it is, it'll be because shrinkers are not called enough and some balancing is required. 3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are distributed between zones and the fair zone allocation policy used to do something very similar for anon. The distribution is now different but not necessarily in any way that matters but it's still worth bearing in mind. Documentation/cgroup-v1/memcg_test.txt | 4 +- Documentation/cgroup-v1/memory.txt | 4 +- arch/s390/appldata/appldata_mem.c | 2 +- arch/tile/mm/pgtable.c | 18 +- drivers/base/node.c | 77 ++- drivers/staging/android/lowmemorykiller.c | 12 +- drivers/staging/lustre/lustre/osc/osc_cache.c | 6 +- fs/fs-writeback.c | 4 +- fs/fuse/file.c | 8 +- fs/nfs/internal.h | 2 +- fs/nfs/write.c | 2 +- fs/proc/meminfo.c | 20 +- include/linux/backing-dev.h | 2 +- include/linux/memcontrol.h | 61 +- include/linux/mm.h | 5 + include/linux/mm_inline.h | 35 +- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 155 +++-- include/linux/swap.h | 24 +- include/linux/topology.h | 2 +- include/linux/vm_event_item.h | 14 +- include/linux/vmstat.h | 111 +++- include/linux/writeback.h | 2 +- include/trace/events/vmscan.h | 63 +- include/trace/events/writeback.h | 10 +- kernel/power/snapshot.c | 10 +- kernel/sysctl.c | 4 +- mm/backing-dev.c | 15 +- mm/compaction.c | 50 +- mm/filemap.c | 16 +- mm/huge_memory.c | 12 +- mm/internal.h | 11 +- mm/khugepaged.c | 14 +- mm/memcontrol.c | 215 +++---- mm/memory-failure.c | 4 +- mm/memory_hotplug.c | 7 +- mm/mempolicy.c | 2 +- mm/migrate.c | 35 +- mm/mlock.c | 12 +- mm/page-writeback.c | 123 ++-- mm/page_alloc.c | 371 +++++------ mm/page_idle.c | 4 +- mm/rmap.c | 26 +- mm/shmem.c | 14 +- mm/swap.c | 64 +- mm/swap_state.c | 4 +- mm/util.c | 4 +- mm/vmscan.c | 879 +++++++++++++------------- mm/vmstat.c | 398 +++++++++--- mm/workingset.c | 54 +- 50 files changed, 1674 insertions(+), 1319 deletions(-) -- 2.6.4 Mel Gorman (31): mm, vmstat: add infrastructure for per-node vmstats mm, vmscan: move lru_lock to the node mm, vmscan: move LRU lists to node mm, vmscan: begin reclaiming pages on a per-node basis mm, vmscan: have kswapd only scan based on the highest requested zone mm, vmscan: make kswapd reclaim in terms of nodes mm, vmscan: remove balance gap mm, vmscan: simplify the logic deciding whether kswapd sleeps mm, vmscan: by default have direct reclaim only shrink once per node mm, vmscan: remove duplicate logic clearing node congestion and dirty state mm: vmscan: do not reclaim from kswapd if there is any eligible zone mm, vmscan: make shrink_node decisions more node-centric mm, memcg: move memcg limit enforcement from zones to nodes mm, workingset: make working set detection node-aware mm, page_alloc: consider dirtyable memory in terms of nodes mm: move page mapped accounting to the node mm: rename NR_ANON_PAGES to NR_ANON_MAPPED mm: move most file-based accounting to the node mm: move vmscan writes and file write accounting to the node mm, vmscan: only wakeup kswapd once per node for the requested classzone mm, page_alloc: Wake kswapd based on the highest eligible zone mm: convert zone_reclaim to node_reclaim mm, vmscan: Avoid passing in classzone_idx unnecessarily to shrink_node mm, vmscan: Avoid passing in classzone_idx unnecessarily to compaction_ready mm, vmscan: add classzone information to tracepoints mm, page_alloc: remove fair zone allocation policy mm: page_alloc: cache the last node whose dirty limit is reached mm: vmstat: replace __count_zone_vm_events with a zone id equivalent mm: vmstat: account per-zone stalls and pages skipped during reclaim mm, vmstat: print node-based stats in zoneinfo file mm, vmstat: Remove zone and node double accounting by approximating retries Documentation/cgroup-v1/memcg_test.txt | 4 +- Documentation/cgroup-v1/memory.txt | 4 +- arch/s390/appldata/appldata_mem.c | 2 +- arch/tile/mm/pgtable.c | 18 +- drivers/base/node.c | 77 ++- drivers/staging/android/lowmemorykiller.c | 12 +- drivers/staging/lustre/lustre/osc/osc_cache.c | 6 +- fs/fs-writeback.c | 4 +- fs/fuse/file.c | 8 +- fs/nfs/internal.h | 2 +- fs/nfs/write.c | 2 +- fs/proc/meminfo.c | 20 +- include/linux/backing-dev.h | 2 +- include/linux/memcontrol.h | 61 +- include/linux/mm.h | 5 + include/linux/mm_inline.h | 35 +- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 155 +++-- include/linux/swap.h | 24 +- include/linux/topology.h | 2 +- include/linux/vm_event_item.h | 14 +- include/linux/vmstat.h | 111 +++- include/linux/writeback.h | 2 +- include/trace/events/vmscan.h | 63 +- include/trace/events/writeback.h | 10 +- kernel/power/snapshot.c | 10 +- kernel/sysctl.c | 4 +- mm/backing-dev.c | 15 +- mm/compaction.c | 50 +- mm/filemap.c | 16 +- mm/huge_memory.c | 12 +- mm/internal.h | 11 +- mm/khugepaged.c | 14 +- mm/memcontrol.c | 215 +++---- mm/memory-failure.c | 4 +- mm/memory_hotplug.c | 7 +- mm/mempolicy.c | 2 +- mm/migrate.c | 35 +- mm/mlock.c | 12 +- mm/page-writeback.c | 123 ++-- mm/page_alloc.c | 371 +++++------ mm/page_idle.c | 4 +- mm/rmap.c | 26 +- mm/shmem.c | 14 +- mm/swap.c | 64 +- mm/swap_state.c | 4 +- mm/util.c | 4 +- mm/vmscan.c | 879 +++++++++++++------------- mm/vmstat.c | 398 +++++++++--- mm/workingset.c | 54 +- 50 files changed, 1674 insertions(+), 1319 deletions(-) -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-01 20:01 [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 Mel Gorman @ 2016-07-01 20:01 ` Mel Gorman 2016-07-07 1:12 ` Joonsoo Kim 0 siblings, 1 reply; 16+ messages in thread From: Mel Gorman @ 2016-07-01 20:01 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman This patch makes reclaim decisions on a per-node basis. A reclaimer knows what zone is required by the allocation request and skips pages from higher zones. In many cases this will be ok because it's a GFP_HIGHMEM request of some description. On 64-bit, ZONE_DMA32 requests will cause some problems but 32-bit devices on 64-bit platforms are increasingly rare. Historically it would have been a major problem on 32-bit with big Highmem:Lowmem ratios but such configurations are also now rare and even where they exist, they are not encouraged. If it really becomes a problem, it'll manifest as very low reclaim efficiencies. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/vmscan.c | 79 ++++++++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 55 insertions(+), 24 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 86a523a761c9..766b36bec829 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -84,6 +84,9 @@ struct scan_control { /* Scan (total_size >> priority) pages at once */ int priority; + /* The highest zone to isolate pages for reclaim from */ + enum zone_type reclaim_idx; + unsigned int may_writepage:1; /* Can mapped pages be reclaimed? */ @@ -1392,6 +1395,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, unsigned long nr_taken = 0; unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 }; unsigned long scan, nr_pages; + LIST_HEAD(pages_skipped); for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && !list_empty(src); scan++) { @@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, VM_BUG_ON_PAGE(!PageLRU(page), page); + if (page_zonenum(page) > sc->reclaim_idx) { + list_move(&page->lru, &pages_skipped); + continue; + } + switch (__isolate_lru_page(page, mode)) { case 0: nr_pages = hpage_nr_pages(page); @@ -1420,6 +1429,15 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, } } + /* + * Splice any skipped pages to the start of the LRU list. Note that + * this disrupts the LRU order when reclaiming for lower zones but + * we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX + * scanning would soon rescan the same pages to skip and put the + * system at risk of premature OOM. + */ + if (!list_empty(&pages_skipped)) + list_splice(&pages_skipped, src); *nr_scanned = scan; trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan, nr_taken, mode, is_file_lru(lru)); @@ -1589,7 +1607,7 @@ static int current_may_throttle(void) } /* - * shrink_inactive_list() is a helper for shrink_zone(). It returns the number + * shrink_inactive_list() is a helper for shrink_node(). It returns the number * of reclaimed pages */ static noinline_for_stack unsigned long @@ -2401,12 +2419,13 @@ static inline bool should_continue_reclaim(struct zone *zone, } } -static bool shrink_zone(struct zone *zone, struct scan_control *sc, - bool is_classzone) +static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc, + enum zone_type classzone_idx) { struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_reclaimed, nr_scanned; bool reclaimable = false; + struct zone *zone = &pgdat->node_zones[classzone_idx]; do { struct mem_cgroup *root = sc->target_mem_cgroup; @@ -2438,7 +2457,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, shrink_zone_memcg(zone, memcg, sc, &lru_pages); zone_lru_pages += lru_pages; - if (memcg && is_classzone) + if (!global_reclaim(sc)) shrink_slab(sc->gfp_mask, zone_to_nid(zone), memcg, sc->nr_scanned - scanned, lru_pages); @@ -2469,7 +2488,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, * Shrink the slab caches in the same proportion that * the eligible LRU pages were scanned. */ - if (global_reclaim(sc) && is_classzone) + if (global_reclaim(sc)) shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL, sc->nr_scanned - nr_scanned, zone_lru_pages); @@ -2553,7 +2572,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; gfp_t orig_mask; - enum zone_type requested_highidx = gfp_zone(sc->gfp_mask); + enum zone_type classzone_idx; /* * If the number of buffer_heads in the machine exceeds the maximum @@ -2561,17 +2580,23 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) * highmem pages could be pinning lowmem pages storing buffer_heads */ orig_mask = sc->gfp_mask; - if (buffer_heads_over_limit) + if (buffer_heads_over_limit) { sc->gfp_mask |= __GFP_HIGHMEM; + sc->reclaim_idx = classzone_idx = gfp_zone(sc->gfp_mask); + } for_each_zone_zonelist_nodemask(zone, z, zonelist, - gfp_zone(sc->gfp_mask), sc->nodemask) { - enum zone_type classzone_idx; - + sc->reclaim_idx, sc->nodemask) { if (!populated_zone(zone)) continue; - classzone_idx = requested_highidx; + /* + * Note that reclaim_idx does not change as it is the highest + * zone reclaimed from which for empty zones is a no-op but + * classzone_idx is used by shrink_node to test if the slabs + * should be shrunk on a given node. + */ + classzone_idx = sc->reclaim_idx; while (!populated_zone(zone->zone_pgdat->node_zones + classzone_idx)) classzone_idx--; @@ -2600,8 +2625,8 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) */ if (IS_ENABLED(CONFIG_COMPACTION) && sc->order > PAGE_ALLOC_COSTLY_ORDER && - zonelist_zone_idx(z) <= requested_highidx && - compaction_ready(zone, sc->order, requested_highidx)) { + zonelist_zone_idx(z) <= classzone_idx && + compaction_ready(zone, sc->order, classzone_idx)) { sc->compaction_ready = true; continue; } @@ -2621,7 +2646,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) /* need some check for avoid more shrink_zone() */ } - shrink_zone(zone, sc, zone_idx(zone) == classzone_idx); + shrink_node(zone->zone_pgdat, sc, classzone_idx); } /* @@ -2847,6 +2872,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, struct scan_control sc = { .nr_to_reclaim = SWAP_CLUSTER_MAX, .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)), + .reclaim_idx = gfp_zone(gfp_mask), .order = order, .nodemask = nodemask, .priority = DEF_PRIORITY, @@ -2886,6 +2912,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, .target_mem_cgroup = memcg, .may_writepage = !laptop_mode, .may_unmap = 1, + .reclaim_idx = MAX_NR_ZONES - 1, .may_swap = !noswap, }; unsigned long lru_pages; @@ -2924,6 +2951,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), + .reclaim_idx = MAX_NR_ZONES - 1, .target_mem_cgroup = memcg, .priority = DEF_PRIORITY, .may_writepage = !laptop_mode, @@ -3118,7 +3146,7 @@ static bool kswapd_shrink_zone(struct zone *zone, balance_gap, classzone_idx)) return true; - shrink_zone(zone, sc, zone_idx(zone) == classzone_idx); + shrink_node(zone->zone_pgdat, sc, classzone_idx); /* TODO: ANOMALY */ clear_bit(PGDAT_WRITEBACK, &pgdat->flags); @@ -3167,6 +3195,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) unsigned long nr_soft_scanned; struct scan_control sc = { .gfp_mask = GFP_KERNEL, + .reclaim_idx = MAX_NR_ZONES - 1, .order = order, .priority = DEF_PRIORITY, .may_writepage = !laptop_mode, @@ -3237,15 +3266,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) sc.may_writepage = 1; /* - * Now scan the zone in the dma->highmem direction, stopping - * at the last zone which needs scanning. - * - * We do this because the page allocator works in the opposite - * direction. This prevents the page allocator from allocating - * pages behind kswapd's direction of progress, which would - * cause too much scanning of the lower zones. + * Continue scanning in the highmem->dma direction stopping at + * the last zone which needs scanning. This may reclaim lowmem + * pages that are not necessary for zone balancing but it + * preserves LRU ordering. It is assumed that the bulk of + * allocation requests can use arbitrary zones with the + * possible exception of big highmem:lowmem configurations. */ - for (i = 0; i <= end_zone; i++) { + for (i = end_zone; i >= 0; i--) { struct zone *zone = pgdat->node_zones + i; if (!populated_zone(zone)) @@ -3256,6 +3284,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) continue; sc.nr_scanned = 0; + sc.reclaim_idx = i; nr_soft_scanned = 0; /* @@ -3513,6 +3542,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) struct scan_control sc = { .nr_to_reclaim = nr_to_reclaim, .gfp_mask = GFP_HIGHUSER_MOVABLE, + .reclaim_idx = MAX_NR_ZONES - 1, .priority = DEF_PRIORITY, .may_writepage = 1, .may_unmap = 1, @@ -3704,6 +3734,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), .may_unmap = !!(zone_reclaim_mode & RECLAIM_UNMAP), .may_swap = 1, + .reclaim_idx = zone_idx(zone), }; cond_resched(); @@ -3723,7 +3754,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) * priorities until we have enough memory freed. */ do { - shrink_zone(zone, &sc, true); + shrink_node(zone->zone_pgdat, &sc, zone_idx(zone)); } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0); } -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-01 20:01 ` [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis Mel Gorman @ 2016-07-07 1:12 ` Joonsoo Kim 2016-07-07 9:48 ` Mel Gorman 0 siblings, 1 reply; 16+ messages in thread From: Joonsoo Kim @ 2016-07-07 1:12 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Fri, Jul 01, 2016 at 09:01:12PM +0100, Mel Gorman wrote: > This patch makes reclaim decisions on a per-node basis. A reclaimer knows > what zone is required by the allocation request and skips pages from > higher zones. In many cases this will be ok because it's a GFP_HIGHMEM > request of some description. On 64-bit, ZONE_DMA32 requests will cause > some problems but 32-bit devices on 64-bit platforms are increasingly > rare. Historically it would have been a major problem on 32-bit with big > Highmem:Lowmem ratios but such configurations are also now rare and even > where they exist, they are not encouraged. If it really becomes a > problem, it'll manifest as very low reclaim efficiencies. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > mm/vmscan.c | 79 ++++++++++++++++++++++++++++++++++++++++++------------------- > 1 file changed, 55 insertions(+), 24 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 86a523a761c9..766b36bec829 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -84,6 +84,9 @@ struct scan_control { > /* Scan (total_size >> priority) pages at once */ > int priority; > > + /* The highest zone to isolate pages for reclaim from */ > + enum zone_type reclaim_idx; > + > unsigned int may_writepage:1; > > /* Can mapped pages be reclaimed? */ > @@ -1392,6 +1395,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > unsigned long nr_taken = 0; > unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 }; > unsigned long scan, nr_pages; > + LIST_HEAD(pages_skipped); > > for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && > !list_empty(src); scan++) { > @@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > > VM_BUG_ON_PAGE(!PageLRU(page), page); > > + if (page_zonenum(page) > sc->reclaim_idx) { > + list_move(&page->lru, &pages_skipped); > + continue; > + } > + Hello, Mel. I think that we don't need to skip LRU pages in active list. What we'd like to do is just skipping actual reclaim since it doesn't make freepage that we need. It's unrelated to skip the page in active list. And, I have a concern that if inactive LRU is full with higher zone's LRU pages, reclaim with low reclaim_idx could be stuck. This would be easily possible if fair zone allocation policy is removed because we will allocate the page on higher zone first. Thanks. > switch (__isolate_lru_page(page, mode)) { > case 0: > nr_pages = hpage_nr_pages(page); > @@ -1420,6 +1429,15 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > } > } > > + /* > + * Splice any skipped pages to the start of the LRU list. Note that > + * this disrupts the LRU order when reclaiming for lower zones but > + * we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX > + * scanning would soon rescan the same pages to skip and put the > + * system at risk of premature OOM. > + */ > + if (!list_empty(&pages_skipped)) > + list_splice(&pages_skipped, src); > *nr_scanned = scan; > trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan, > nr_taken, mode, is_file_lru(lru)); > @@ -1589,7 +1607,7 @@ static int current_may_throttle(void) > } > > /* > - * shrink_inactive_list() is a helper for shrink_zone(). It returns the number > + * shrink_inactive_list() is a helper for shrink_node(). It returns the number > * of reclaimed pages > */ > static noinline_for_stack unsigned long > @@ -2401,12 +2419,13 @@ static inline bool should_continue_reclaim(struct zone *zone, > } > } > > -static bool shrink_zone(struct zone *zone, struct scan_control *sc, > - bool is_classzone) > +static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc, > + enum zone_type classzone_idx) > { > struct reclaim_state *reclaim_state = current->reclaim_state; > unsigned long nr_reclaimed, nr_scanned; > bool reclaimable = false; > + struct zone *zone = &pgdat->node_zones[classzone_idx]; > > do { > struct mem_cgroup *root = sc->target_mem_cgroup; > @@ -2438,7 +2457,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > shrink_zone_memcg(zone, memcg, sc, &lru_pages); > zone_lru_pages += lru_pages; > > - if (memcg && is_classzone) > + if (!global_reclaim(sc)) > shrink_slab(sc->gfp_mask, zone_to_nid(zone), > memcg, sc->nr_scanned - scanned, > lru_pages); > @@ -2469,7 +2488,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > * Shrink the slab caches in the same proportion that > * the eligible LRU pages were scanned. > */ > - if (global_reclaim(sc) && is_classzone) > + if (global_reclaim(sc)) > shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL, > sc->nr_scanned - nr_scanned, > zone_lru_pages); > @@ -2553,7 +2572,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > unsigned long nr_soft_reclaimed; > unsigned long nr_soft_scanned; > gfp_t orig_mask; > - enum zone_type requested_highidx = gfp_zone(sc->gfp_mask); > + enum zone_type classzone_idx; > > /* > * If the number of buffer_heads in the machine exceeds the maximum > @@ -2561,17 +2580,23 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > * highmem pages could be pinning lowmem pages storing buffer_heads > */ > orig_mask = sc->gfp_mask; > - if (buffer_heads_over_limit) > + if (buffer_heads_over_limit) { > sc->gfp_mask |= __GFP_HIGHMEM; > + sc->reclaim_idx = classzone_idx = gfp_zone(sc->gfp_mask); > + } > > for_each_zone_zonelist_nodemask(zone, z, zonelist, > - gfp_zone(sc->gfp_mask), sc->nodemask) { > - enum zone_type classzone_idx; > - > + sc->reclaim_idx, sc->nodemask) { > if (!populated_zone(zone)) > continue; > > - classzone_idx = requested_highidx; > + /* > + * Note that reclaim_idx does not change as it is the highest > + * zone reclaimed from which for empty zones is a no-op but > + * classzone_idx is used by shrink_node to test if the slabs > + * should be shrunk on a given node. > + */ > + classzone_idx = sc->reclaim_idx; > while (!populated_zone(zone->zone_pgdat->node_zones + > classzone_idx)) > classzone_idx--; > @@ -2600,8 +2625,8 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > */ > if (IS_ENABLED(CONFIG_COMPACTION) && > sc->order > PAGE_ALLOC_COSTLY_ORDER && > - zonelist_zone_idx(z) <= requested_highidx && > - compaction_ready(zone, sc->order, requested_highidx)) { > + zonelist_zone_idx(z) <= classzone_idx && > + compaction_ready(zone, sc->order, classzone_idx)) { > sc->compaction_ready = true; > continue; > } > @@ -2621,7 +2646,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > /* need some check for avoid more shrink_zone() */ > } > > - shrink_zone(zone, sc, zone_idx(zone) == classzone_idx); > + shrink_node(zone->zone_pgdat, sc, classzone_idx); > } > > /* > @@ -2847,6 +2872,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > struct scan_control sc = { > .nr_to_reclaim = SWAP_CLUSTER_MAX, > .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)), > + .reclaim_idx = gfp_zone(gfp_mask), > .order = order, > .nodemask = nodemask, > .priority = DEF_PRIORITY, > @@ -2886,6 +2912,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, > .target_mem_cgroup = memcg, > .may_writepage = !laptop_mode, > .may_unmap = 1, > + .reclaim_idx = MAX_NR_ZONES - 1, > .may_swap = !noswap, > }; > unsigned long lru_pages; > @@ -2924,6 +2951,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), > .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | > (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), > + .reclaim_idx = MAX_NR_ZONES - 1, > .target_mem_cgroup = memcg, > .priority = DEF_PRIORITY, > .may_writepage = !laptop_mode, > @@ -3118,7 +3146,7 @@ static bool kswapd_shrink_zone(struct zone *zone, > balance_gap, classzone_idx)) > return true; > > - shrink_zone(zone, sc, zone_idx(zone) == classzone_idx); > + shrink_node(zone->zone_pgdat, sc, classzone_idx); > > /* TODO: ANOMALY */ > clear_bit(PGDAT_WRITEBACK, &pgdat->flags); > @@ -3167,6 +3195,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > unsigned long nr_soft_scanned; > struct scan_control sc = { > .gfp_mask = GFP_KERNEL, > + .reclaim_idx = MAX_NR_ZONES - 1, > .order = order, > .priority = DEF_PRIORITY, > .may_writepage = !laptop_mode, > @@ -3237,15 +3266,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > sc.may_writepage = 1; > > /* > - * Now scan the zone in the dma->highmem direction, stopping > - * at the last zone which needs scanning. > - * > - * We do this because the page allocator works in the opposite > - * direction. This prevents the page allocator from allocating > - * pages behind kswapd's direction of progress, which would > - * cause too much scanning of the lower zones. > + * Continue scanning in the highmem->dma direction stopping at > + * the last zone which needs scanning. This may reclaim lowmem > + * pages that are not necessary for zone balancing but it > + * preserves LRU ordering. It is assumed that the bulk of > + * allocation requests can use arbitrary zones with the > + * possible exception of big highmem:lowmem configurations. > */ > - for (i = 0; i <= end_zone; i++) { > + for (i = end_zone; i >= 0; i--) { > struct zone *zone = pgdat->node_zones + i; > > if (!populated_zone(zone)) > @@ -3256,6 +3284,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > continue; > > sc.nr_scanned = 0; > + sc.reclaim_idx = i; > > nr_soft_scanned = 0; > /* > @@ -3513,6 +3542,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) > struct scan_control sc = { > .nr_to_reclaim = nr_to_reclaim, > .gfp_mask = GFP_HIGHUSER_MOVABLE, > + .reclaim_idx = MAX_NR_ZONES - 1, > .priority = DEF_PRIORITY, > .may_writepage = 1, > .may_unmap = 1, > @@ -3704,6 +3734,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) > .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), > .may_unmap = !!(zone_reclaim_mode & RECLAIM_UNMAP), > .may_swap = 1, > + .reclaim_idx = zone_idx(zone), > }; > > cond_resched(); > @@ -3723,7 +3754,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) > * priorities until we have enough memory freed. > */ > do { > - shrink_zone(zone, &sc, true); > + shrink_node(zone->zone_pgdat, &sc, zone_idx(zone)); > } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0); > } > > -- > 2.6.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-07 1:12 ` Joonsoo Kim @ 2016-07-07 9:48 ` Mel Gorman 2016-07-08 2:28 ` Joonsoo Kim 0 siblings, 1 reply; 16+ messages in thread From: Mel Gorman @ 2016-07-07 9:48 UTC (permalink / raw) To: Joonsoo Kim Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Thu, Jul 07, 2016 at 10:12:12AM +0900, Joonsoo Kim wrote: > > @@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > > > > VM_BUG_ON_PAGE(!PageLRU(page), page); > > > > + if (page_zonenum(page) > sc->reclaim_idx) { > > + list_move(&page->lru, &pages_skipped); > > + continue; > > + } > > + > > I think that we don't need to skip LRU pages in active list. What we'd > like to do is just skipping actual reclaim since it doesn't make > freepage that we need. It's unrelated to skip the page in active list. > Why? The active aging is sometimes about simply aging the LRU list. Aging the active list based on the timing of when a zone-constrained allocation arrives potentially introduces the same zone-balancing problems we currently have and applying them to node-lru. > And, I have a concern that if inactive LRU is full with higher zone's > LRU pages, reclaim with low reclaim_idx could be stuck. That is an outside possibility but unlikely given that it would require that all outstanding allocation requests are zone-contrained. If it happens that a premature OOM is encountered while the active list is large then inactive_list_is_low could take scan_control as a parameter and use a different ratio for zone-contrained allocations if scan priority is elevated. It would be preferred to have an actual test case for this so the altered ratio can be tested instead of introducing code that may be useless or dead. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-07 9:48 ` Mel Gorman @ 2016-07-08 2:28 ` Joonsoo Kim 2016-07-08 10:05 ` Mel Gorman 0 siblings, 1 reply; 16+ messages in thread From: Joonsoo Kim @ 2016-07-08 2:28 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Thu, Jul 07, 2016 at 10:48:08AM +0100, Mel Gorman wrote: > On Thu, Jul 07, 2016 at 10:12:12AM +0900, Joonsoo Kim wrote: > > > @@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > > > > > > VM_BUG_ON_PAGE(!PageLRU(page), page); > > > > > > + if (page_zonenum(page) > sc->reclaim_idx) { > > > + list_move(&page->lru, &pages_skipped); > > > + continue; > > > + } > > > + > > > > I think that we don't need to skip LRU pages in active list. What we'd > > like to do is just skipping actual reclaim since it doesn't make > > freepage that we need. It's unrelated to skip the page in active list. > > > > Why? > > The active aging is sometimes about simply aging the LRU list. Aging the > active list based on the timing of when a zone-constrained allocation arrives > potentially introduces the same zone-balancing problems we currently have > and applying them to node-lru. Could you explain more? I don't understand why aging the active list based on the timing of when a zone-constrained allocation arrives introduces the zone-balancing problem again. I think that if above logic is applied to both the active/inactive list, it could cause zone-balancing problem. LRU pages on lower zone can be resident on memory with more chance. What we want to do with node-lru is aging all the lru pages equally as much as possible. So, basically, we need to age active/inactive list regardless allocation type. But, there is a possibility that zone-constrained allocation would reclaim too many LRU pages unnecessarily to satisfy zone-constrained allocation, so we need to implement skipping such a page. It can be done by just skipping the page in inactive list. > > > And, I have a concern that if inactive LRU is full with higher zone's > > LRU pages, reclaim with low reclaim_idx could be stuck. > > That is an outside possibility but unlikely given that it would require > that all outstanding allocation requests are zone-contrained. If it happens I'm not sure that it is outside possibility. It can also happens if there is zone-contrained allocation requestor and parallel memory hogger. In this case, memory would be reclaimed by memory hogger but memory hogger would consume them again so inactive LRU is continually full with higher zone's LRU pages and zone-contrained allocation requestor cannot progress. > that a premature OOM is encountered while the active list is large then > inactive_list_is_low could take scan_control as a parameter and use a > different ratio for zone-contrained allocations if scan priority is elevated. It would work. > It would be preferred to have an actual test case for this so the > altered ratio can be tested instead of introducing code that may be > useless or dead. Yes, actual test case would be preferred. I will try to implement an artificial test case by myself but I'm not sure when I can do it. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-08 2:28 ` Joonsoo Kim @ 2016-07-08 10:05 ` Mel Gorman 2016-07-14 6:28 ` Joonsoo Kim 0 siblings, 1 reply; 16+ messages in thread From: Mel Gorman @ 2016-07-08 10:05 UTC (permalink / raw) To: Joonsoo Kim Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Fri, Jul 08, 2016 at 11:28:52AM +0900, Joonsoo Kim wrote: > On Thu, Jul 07, 2016 at 10:48:08AM +0100, Mel Gorman wrote: > > On Thu, Jul 07, 2016 at 10:12:12AM +0900, Joonsoo Kim wrote: > > > > @@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > > > > > > > > VM_BUG_ON_PAGE(!PageLRU(page), page); > > > > > > > > + if (page_zonenum(page) > sc->reclaim_idx) { > > > > + list_move(&page->lru, &pages_skipped); > > > > + continue; > > > > + } > > > > + > > > > > > I think that we don't need to skip LRU pages in active list. What we'd > > > like to do is just skipping actual reclaim since it doesn't make > > > freepage that we need. It's unrelated to skip the page in active list. > > > > > > > Why? > > > > The active aging is sometimes about simply aging the LRU list. Aging the > > active list based on the timing of when a zone-constrained allocation arrives > > potentially introduces the same zone-balancing problems we currently have > > and applying them to node-lru. > > Could you explain more? I don't understand why aging the active list > based on the timing of when a zone-constrained allocation arrives > introduces the zone-balancing problem again. > I mispoke. Avoid rotation of the active list based on the timing of a zone-constrained allocation is what I think potentially introduces problems. If there are zone-constrained allocations aging the active list then I worry that pages would be artificially preserved on the active list. No matter what we do, there is distortion of the aging for zone-constrained allocation because right now, it may deactivate high zone pages sooner than expected. > I think that if above logic is applied to both the active/inactive > list, it could cause zone-balancing problem. LRU pages on lower zone > can be resident on memory with more chance. If anything, with node-based LRU, it's high zone pages that can be resident on memory for longer but only if there are zone-constrained allocations. If we always reclaim based on age regardless of allocation requirements then there is a risk that high zones are reclaimed far earlier than expected. Basically, whether we skip pages in the active list or not there are distortions with page aging and the impact is workload dependent. Right now, I see no clear advantage to special casing active aging. If we suspect this is a problem in the future, it would be a simple matter of adding an additional bool parameter to isolate_lru_pages. > > > And, I have a concern that if inactive LRU is full with higher zone's > > > LRU pages, reclaim with low reclaim_idx could be stuck. > > > > That is an outside possibility but unlikely given that it would require > > that all outstanding allocation requests are zone-contrained. If it happens > > I'm not sure that it is outside possibility. It can also happens if there > is zone-contrained allocation requestor and parallel memory hogger. In > this case, memory would be reclaimed by memory hogger but memory hogger would > consume them again so inactive LRU is continually full with higher > zone's LRU pages and zone-contrained allocation requestor cannot > progress. > The same memory hogger will also be reclaiming the highmem pages and reallocating highmem pages. > > It would be preferred to have an actual test case for this so the > > altered ratio can be tested instead of introducing code that may be > > useless or dead. > > Yes, actual test case would be preferred. I will try to implement > an artificial test case by myself but I'm not sure when I can do it. > That would be appreciated. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-08 10:05 ` Mel Gorman @ 2016-07-14 6:28 ` Joonsoo Kim 2016-07-14 7:48 ` Vlastimil Babka 2016-07-18 12:11 ` Mel Gorman 0 siblings, 2 replies; 16+ messages in thread From: Joonsoo Kim @ 2016-07-14 6:28 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Fri, Jul 08, 2016 at 11:05:32AM +0100, Mel Gorman wrote: > On Fri, Jul 08, 2016 at 11:28:52AM +0900, Joonsoo Kim wrote: > > On Thu, Jul 07, 2016 at 10:48:08AM +0100, Mel Gorman wrote: > > > On Thu, Jul 07, 2016 at 10:12:12AM +0900, Joonsoo Kim wrote: > > > > > @@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > > > > > > > > > > VM_BUG_ON_PAGE(!PageLRU(page), page); > > > > > > > > > > + if (page_zonenum(page) > sc->reclaim_idx) { > > > > > + list_move(&page->lru, &pages_skipped); > > > > > + continue; > > > > > + } > > > > > + > > > > > > > > I think that we don't need to skip LRU pages in active list. What we'd > > > > like to do is just skipping actual reclaim since it doesn't make > > > > freepage that we need. It's unrelated to skip the page in active list. > > > > > > > > > > Why? > > > > > > The active aging is sometimes about simply aging the LRU list. Aging the > > > active list based on the timing of when a zone-constrained allocation arrives > > > potentially introduces the same zone-balancing problems we currently have > > > and applying them to node-lru. > > > > Could you explain more? I don't understand why aging the active list > > based on the timing of when a zone-constrained allocation arrives > > introduces the zone-balancing problem again. > > > > I mispoke. Avoid rotation of the active list based on the timing of a > zone-constrained allocation is what I think potentially introduces problems. > If there are zone-constrained allocations aging the active list then I worry > that pages would be artificially preserved on the active list. No matter > what we do, there is distortion of the aging for zone-constrained allocation > because right now, it may deactivate high zone pages sooner than expected. > > > I think that if above logic is applied to both the active/inactive > > list, it could cause zone-balancing problem. LRU pages on lower zone > > can be resident on memory with more chance. > > If anything, with node-based LRU, it's high zone pages that can be resident > on memory for longer but only if there are zone-constrained allocations. > If we always reclaim based on age regardless of allocation requirements > then there is a risk that high zones are reclaimed far earlier than expected. > > Basically, whether we skip pages in the active list or not there are > distortions with page aging and the impact is workload dependent. Right now, > I see no clear advantage to special casing active aging. > > If we suspect this is a problem in the future, it would be a simple matter > of adding an additional bool parameter to isolate_lru_pages. Okay. I agree that it would be a simple matter. > > > > > And, I have a concern that if inactive LRU is full with higher zone's > > > > LRU pages, reclaim with low reclaim_idx could be stuck. > > > > > > That is an outside possibility but unlikely given that it would require > > > that all outstanding allocation requests are zone-contrained. If it happens > > > > I'm not sure that it is outside possibility. It can also happens if there > > is zone-contrained allocation requestor and parallel memory hogger. In > > this case, memory would be reclaimed by memory hogger but memory hogger would > > consume them again so inactive LRU is continually full with higher > > zone's LRU pages and zone-contrained allocation requestor cannot > > progress. > > > > The same memory hogger will also be reclaiming the highmem pages and > reallocating highmem pages. > > > > It would be preferred to have an actual test case for this so the > > > altered ratio can be tested instead of introducing code that may be > > > useless or dead. > > > > Yes, actual test case would be preferred. I will try to implement > > an artificial test case by myself but I'm not sure when I can do it. > > > > That would be appreciated. I make an artificial test case and test this series by using next tree (next-20160713) and found a regression. My test setup is: memory: 2048 mb movablecore: 1500 mb (imitates highmem system to test effect of skip logic) swapoff forever repeat: sequential read file (1500 mb) (using mmap) by 2 threads 3000 processes fork lowmem is roughly 500 mb and it is enough to keep 3000 processes. I test this artificial scenario with v4.7-rc5 and find no problem. But, with next-20160713, OOM kill is triggered as below. -------- oops ------- fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0 fork cpuset=/ mems_allowed=0 CPU: 0 PID: 10478 Comm: fork Not tainted 4.7.0-rc7-next-20160713 #646 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 0000000000000000 ffff880014273b18 ffffffff8142b8c3 ffff880014273d20 ffff88001c44a500 ffff880014273b90 ffffffff81240b6e ffffffff81e6f0e0 ffff880014273b40 ffffffff810de08d ffff880014273b60 0000000000000206 Call Trace: [<ffffffff8142b8c3>] dump_stack+0x85/0xc2 [<ffffffff81240b6e>] dump_header+0x5c/0x22e [<ffffffff810de08d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff811b3381>] oom_kill_process+0x221/0x3f0 [<ffffffff810901b7>] ? has_capability_noaudit+0x17/0x20 [<ffffffff811b3acf>] out_of_memory+0x52f/0x560 [<ffffffff811b377c>] ? out_of_memory+0x1dc/0x560 [<ffffffff811ba004>] __alloc_pages_nodemask+0x1154/0x11b0 [<ffffffff810813a1>] ? copy_process.part.30+0x121/0x1bf0 [<ffffffff810813a1>] copy_process.part.30+0x121/0x1bf0 [<ffffffff811ebb06>] ? handle_mm_fault+0xb36/0x13d0 [<ffffffff810fb60d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 [<ffffffff81083066>] _do_fork+0xe6/0x6a0 [<ffffffff810836c9>] SyS_clone+0x19/0x20 [<ffffffff81003e13>] do_syscall_64+0x73/0x1e0 [<ffffffff81858ec3>] entry_SYSCALL64_slow_path+0x25/0x25 Mem-Info: active_anon:19756 inactive_anon:18 isolated_anon:0 active_file:142480 inactive_file:266065 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:6777 slab_unreclaimable:19127 mapped:389778 shmem:95 pagetables:17512 bounce:0 free:9533 free_pcp:80 free_cma:0 Node 0 active_anon:79024kB inactive_anon:72kB active_file:569920kB inactive_file:1064260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1559112kB dirty:0kB writeback:0kB shmem:0kB shmem_thp : 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes Node 0 DMA free:2172kB min:204kB low:252kB high:300kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:2272kB kernel_stack:1216kB pagetables:2436kB bounce:0kB free_pcp:0k B local_pcp:0kB free_cma:0kB node_pages_scanned:15639736 lowmem_reserve[]: 0 493 493 1955 Node 0 DMA32 free:6372kB min:6492kB low:8112kB high:9732kB present:2080632kB managed:508600kB mlocked:0kB slab_reclaimable:27108kB slab_unreclaimable:74236kB kernel_stack:32752kB pagetables:67612kB bounce: 0kB free_pcp:112kB local_pcp:12kB free_cma:0kB node_pages_scanned:16302012 lowmem_reserve[]: 0 0 0 1462 Node 0 Normal free:0kB min:0kB low:0kB high:0kB present:18446744073708015752kB managed:0kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB lo cal_pcp:0kB free_cma:0kB node_pages_scanned:17033632 lowmem_reserve[]: 0 0 0 11698 Node 0 Movable free:29588kB min:19256kB low:24068kB high:28880kB present:1535864kB managed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_ pcp:208kB local_pcp:112kB free_cma:0kB node_pages_scanned:17725436 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (M) 1*8kB (U) 1*16kB (M) 1*32kB (M) 1*64kB (M) 2*128kB (UM) 1*256kB (M) 1*512kB (U) 1*1024kB (U) 0*2048kB 0*4096kB = 2172kB Node 0 DMA32: 60*4kB (ME) 45*8kB (UME) 24*16kB (ME) 13*32kB (UM) 12*64kB (UM) 6*128kB (UM) 6*256kB (M) 4*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 6520kB Node 0 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 Movable: 1*4kB (M) 130*8kB (M) 68*16kB (M) 30*32kB (M) 13*64kB (M) 9*128kB (M) 4*256kB (M) 0*512kB 1*1024kB (M) 1*2048kB (M) 5*4096kB (M) = 29652kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 408717 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 524156 pages RAM 0 pages HighMem/MovableOnly 17788 pages reserved 0 pages cma reserved 0 pages hwpoisoned -------- another one ------- fork invoked oom-killer: gfp_mask=0x25080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0 fork cpuset=/ mems_allowed=0 CPU: 3 PID: 7538 Comm: fork Not tainted 4.7.0-rc7-next-20160713 #646 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 0000000000000000 ffff8800141eb960 ffffffff8142b8c3 ffff8800141ebb68 ffff88001c46a500 ffff8800141eb9d8 ffffffff81240b6e ffffffff81e6f0e0 ffff8800141eb988 ffffffff810de08d ffff8800141eb9a8 0000000000000206 Call Trace: [<ffffffff8142b8c3>] dump_stack+0x85/0xc2 [<ffffffff81240b6e>] dump_header+0x5c/0x22e [<ffffffff810de08d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff811b3381>] oom_kill_process+0x221/0x3f0 [<ffffffff810901b7>] ? has_capability_noaudit+0x17/0x20 [<ffffffff811b3acf>] out_of_memory+0x52f/0x560 [<ffffffff811b377c>] ? out_of_memory+0x1dc/0x560 [<ffffffff811ba004>] __alloc_pages_nodemask+0x1154/0x11b0 [<ffffffff8120ed61>] ? alloc_pages_current+0xa1/0x1f0 [<ffffffff8120ed61>] alloc_pages_current+0xa1/0x1f0 [<ffffffff811eae37>] ? __pmd_alloc+0x37/0x1d0 [<ffffffff811eae37>] __pmd_alloc+0x37/0x1d0 [<ffffffff811ed627>] copy_page_range+0x947/0xa50 [<ffffffff811f9386>] ? anon_vma_fork+0xd6/0x150 [<ffffffff81432bd2>] ? __rb_insert_augmented+0x132/0x210 [<ffffffff81082035>] copy_process.part.30+0xdb5/0x1bf0 [<ffffffff81083066>] _do_fork+0xe6/0x6a0 [<ffffffff810836c9>] SyS_clone+0x19/0x20 [<ffffffff81003e13>] do_syscall_64+0x73/0x1e0 [<ffffffff81858ec3>] entry_SYSCALL64_slow_path+0x25/0x25 Mem-Info: active_anon:18779 inactive_anon:18 isolated_anon:0 active_file:91577 inactive_file:320615 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:6741 slab_unreclaimable:18124 mapped:389774 shmem:95 pagetables:18332 bounce:0 free:8194 free_pcp:140 free_cma:0 Node 0 active_anon:75116kB inactive_anon:72kB active_file:366308kB inactive_file:1282460kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1559096kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes Node 0 DMA free:2172kB min:204kB low:252kB high:300kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:2380kB kernel_stack:1632kB pagetables:3632kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB node_pages_scanned:13673372 lowmem_reserve[]: 0 493 493 1955 Node 0 DMA32 free:6444kB min:6492kB low:8112kB high:9732kB present:2080632kB managed:508600kB mlocked:0kB slab_reclaimable:26964kB slab_unreclaimable:70116kB kernel_stack:30496kB pagetables:69696kB bounce:0kB free_pcp:316kB local_pcp:100kB free_cma:0kB node_pages_scanned:13673372 lowmem_reserve[]: 0 0 0 1462 Node 0 Normal free:0kB min:0kB low:0kB high:0kB present:18446744073708015752kB managed:0kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB node_pages_scanned:13673832 lowmem_reserve[]: 0 0 0 11698 Node 0 Movable free:24200kB min:19256kB low:24068kB high:28880kB present:1535864kB managed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:956kB local_pcp:100kB free_cma:0kB node_pages_scanned:1504 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 2*4kB (M) 0*8kB 1*16kB (M) 0*32kB 1*64kB (M) 0*128kB 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 0*4096kB = 2136kB Node 0 DMA32: 58*4kB (ME) 40*8kB (UME) 27*16kB (UME) 15*32kB (ME) 8*64kB (UM) 5*128kB (M) 10*256kB (UM) 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 6712kB Node 0 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 Movable: 40*4kB (M) 8*8kB (M) 3*16kB (M) 6*32kB (M) 7*64kB (M) 2*128kB (M) 1*256kB (M) 2*512kB (M) 2*1024kB (M) 1*2048kB (M) 5*4096kB (M) = 27024kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 411446 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 524156 pages RAM 0 pages HighMem/MovableOnly 17788 pages reserved 0 pages cma reserved Size of active/inactive_file is larger than size of movable zone so I guess there is reclaimable pages on DMA32 and it would mean that there is some problems related to skip logic. Could you help how to check it? Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-14 6:28 ` Joonsoo Kim @ 2016-07-14 7:48 ` Vlastimil Babka 2016-07-18 4:52 ` Joonsoo Kim 2016-07-18 12:11 ` Mel Gorman 1 sibling, 1 reply; 16+ messages in thread From: Vlastimil Babka @ 2016-07-14 7:48 UTC (permalink / raw) To: Joonsoo Kim, Mel Gorman Cc: Andrew Morton, Linux-MM, Rik van Riel, Johannes Weiner, LKML On 07/14/2016 08:28 AM, Joonsoo Kim wrote: > On Fri, Jul 08, 2016 at 11:05:32AM +0100, Mel Gorman wrote: >> On Fri, Jul 08, 2016 at 11:28:52AM +0900, Joonsoo Kim wrote: >>> On Thu, Jul 07, 2016 at 10:48:08AM +0100, Mel Gorman wrote: >>>> On Thu, Jul 07, 2016 at 10:12:12AM +0900, Joonsoo Kim wrote: >>>>>> @@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, >>>>>> >>>>>> VM_BUG_ON_PAGE(!PageLRU(page), page); >>>>>> >>>>>> + if (page_zonenum(page) > sc->reclaim_idx) { >>>>>> + list_move(&page->lru, &pages_skipped); >>>>>> + continue; >>>>>> + } >>>>>> + >>>>> >>>>> I think that we don't need to skip LRU pages in active list. What we'd >>>>> like to do is just skipping actual reclaim since it doesn't make >>>>> freepage that we need. It's unrelated to skip the page in active list. >>>>> >>>> >>>> Why? >>>> >>>> The active aging is sometimes about simply aging the LRU list. Aging the >>>> active list based on the timing of when a zone-constrained allocation arrives >>>> potentially introduces the same zone-balancing problems we currently have >>>> and applying them to node-lru. >>> >>> Could you explain more? I don't understand why aging the active list >>> based on the timing of when a zone-constrained allocation arrives >>> introduces the zone-balancing problem again. >>> >> >> I mispoke. Avoid rotation of the active list based on the timing of a >> zone-constrained allocation is what I think potentially introduces problems. >> If there are zone-constrained allocations aging the active list then I worry >> that pages would be artificially preserved on the active list. No matter >> what we do, there is distortion of the aging for zone-constrained allocation >> because right now, it may deactivate high zone pages sooner than expected. >> >>> I think that if above logic is applied to both the active/inactive >>> list, it could cause zone-balancing problem. LRU pages on lower zone >>> can be resident on memory with more chance. >> >> If anything, with node-based LRU, it's high zone pages that can be resident >> on memory for longer but only if there are zone-constrained allocations. >> If we always reclaim based on age regardless of allocation requirements >> then there is a risk that high zones are reclaimed far earlier than expected. >> >> Basically, whether we skip pages in the active list or not there are >> distortions with page aging and the impact is workload dependent. Right now, >> I see no clear advantage to special casing active aging. >> >> If we suspect this is a problem in the future, it would be a simple matter >> of adding an additional bool parameter to isolate_lru_pages. > > Okay. I agree that it would be a simple matter. > >> >>>>> And, I have a concern that if inactive LRU is full with higher zone's >>>>> LRU pages, reclaim with low reclaim_idx could be stuck. >>>> >>>> That is an outside possibility but unlikely given that it would require >>>> that all outstanding allocation requests are zone-contrained. If it happens >>> >>> I'm not sure that it is outside possibility. It can also happens if there >>> is zone-contrained allocation requestor and parallel memory hogger. In >>> this case, memory would be reclaimed by memory hogger but memory hogger would >>> consume them again so inactive LRU is continually full with higher >>> zone's LRU pages and zone-contrained allocation requestor cannot >>> progress. >>> >> >> The same memory hogger will also be reclaiming the highmem pages and >> reallocating highmem pages. >> >>>> It would be preferred to have an actual test case for this so the >>>> altered ratio can be tested instead of introducing code that may be >>>> useless or dead. >>> >>> Yes, actual test case would be preferred. I will try to implement >>> an artificial test case by myself but I'm not sure when I can do it. >>> >> >> That would be appreciated. > > I make an artificial test case and test this series by using next tree > (next-20160713) and found a regression. > [...] > Mem-Info: > active_anon:18779 inactive_anon:18 isolated_anon:0 > active_file:91577 inactive_file:320615 isolated_file:0 > unevictable:0 dirty:0 writeback:0 unstable:0 > slab_reclaimable:6741 slab_unreclaimable:18124 > mapped:389774 shmem:95 pagetables:18332 bounce:0 > free:8194 free_pcp:140 free_cma:0 > Node 0 active_anon:75116kB inactive_anon:72kB active_file:366308kB inactive_file:1282460kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1559096kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes > Node 0 DMA free:2172kB min:204kB low:252kB high:300kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:2380kB kernel_stack:1632kB pagetables:3632kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB node_pages_scanned:13673372 > lowmem_reserve[]: 0 493 493 1955 > Node 0 DMA32 free:6444kB min:6492kB low:8112kB high:9732kB present:2080632kB managed:508600kB mlocked:0kB slab_reclaimable:26964kB slab_unreclaimable:70116kB kernel_stack:30496kB pagetables:69696kB bounce:0kB free_pcp:316kB local_pcp:100kB free_cma:0kB node_pages_scanned:13673372 > lowmem_reserve[]: 0 0 0 1462 > Node 0 Normal free:0kB min:0kB low:0kB high:0kB present:18446744073708015752kB managed:0kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB node_pages_scanned:13673832 present:18446744073708015752kB Although unlikely related to your report, that itself doesn't look right. Any idea if that's due to your configuration and would be printed also in the mainline kernel in case of OOM (or if /proc/zoneinfo has similarly bogus value), or is something caused by a patch in mmotm? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-14 7:48 ` Vlastimil Babka @ 2016-07-18 4:52 ` Joonsoo Kim 0 siblings, 0 replies; 16+ messages in thread From: Joonsoo Kim @ 2016-07-18 4:52 UTC (permalink / raw) To: Vlastimil Babka Cc: Mel Gorman, Andrew Morton, Linux-MM, Rik van Riel, Johannes Weiner, LKML On Thu, Jul 14, 2016 at 09:48:41AM +0200, Vlastimil Babka wrote: > On 07/14/2016 08:28 AM, Joonsoo Kim wrote: > >On Fri, Jul 08, 2016 at 11:05:32AM +0100, Mel Gorman wrote: > >>On Fri, Jul 08, 2016 at 11:28:52AM +0900, Joonsoo Kim wrote: > >>>On Thu, Jul 07, 2016 at 10:48:08AM +0100, Mel Gorman wrote: > >>>>On Thu, Jul 07, 2016 at 10:12:12AM +0900, Joonsoo Kim wrote: > >>>>>>@@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > >>>>>> > >>>>>> VM_BUG_ON_PAGE(!PageLRU(page), page); > >>>>>> > >>>>>>+ if (page_zonenum(page) > sc->reclaim_idx) { > >>>>>>+ list_move(&page->lru, &pages_skipped); > >>>>>>+ continue; > >>>>>>+ } > >>>>>>+ > >>>>> > >>>>>I think that we don't need to skip LRU pages in active list. What we'd > >>>>>like to do is just skipping actual reclaim since it doesn't make > >>>>>freepage that we need. It's unrelated to skip the page in active list. > >>>>> > >>>> > >>>>Why? > >>>> > >>>>The active aging is sometimes about simply aging the LRU list. Aging the > >>>>active list based on the timing of when a zone-constrained allocation arrives > >>>>potentially introduces the same zone-balancing problems we currently have > >>>>and applying them to node-lru. > >>> > >>>Could you explain more? I don't understand why aging the active list > >>>based on the timing of when a zone-constrained allocation arrives > >>>introduces the zone-balancing problem again. > >>> > >> > >>I mispoke. Avoid rotation of the active list based on the timing of a > >>zone-constrained allocation is what I think potentially introduces problems. > >>If there are zone-constrained allocations aging the active list then I worry > >>that pages would be artificially preserved on the active list. No matter > >>what we do, there is distortion of the aging for zone-constrained allocation > >>because right now, it may deactivate high zone pages sooner than expected. > >> > >>>I think that if above logic is applied to both the active/inactive > >>>list, it could cause zone-balancing problem. LRU pages on lower zone > >>>can be resident on memory with more chance. > >> > >>If anything, with node-based LRU, it's high zone pages that can be resident > >>on memory for longer but only if there are zone-constrained allocations. > >>If we always reclaim based on age regardless of allocation requirements > >>then there is a risk that high zones are reclaimed far earlier than expected. > >> > >>Basically, whether we skip pages in the active list or not there are > >>distortions with page aging and the impact is workload dependent. Right now, > >>I see no clear advantage to special casing active aging. > >> > >>If we suspect this is a problem in the future, it would be a simple matter > >>of adding an additional bool parameter to isolate_lru_pages. > > > >Okay. I agree that it would be a simple matter. > > > >> > >>>>>And, I have a concern that if inactive LRU is full with higher zone's > >>>>>LRU pages, reclaim with low reclaim_idx could be stuck. > >>>> > >>>>That is an outside possibility but unlikely given that it would require > >>>>that all outstanding allocation requests are zone-contrained. If it happens > >>> > >>>I'm not sure that it is outside possibility. It can also happens if there > >>>is zone-contrained allocation requestor and parallel memory hogger. In > >>>this case, memory would be reclaimed by memory hogger but memory hogger would > >>>consume them again so inactive LRU is continually full with higher > >>>zone's LRU pages and zone-contrained allocation requestor cannot > >>>progress. > >>> > >> > >>The same memory hogger will also be reclaiming the highmem pages and > >>reallocating highmem pages. > >> > >>>>It would be preferred to have an actual test case for this so the > >>>>altered ratio can be tested instead of introducing code that may be > >>>>useless or dead. > >>> > >>>Yes, actual test case would be preferred. I will try to implement > >>>an artificial test case by myself but I'm not sure when I can do it. > >>> > >> > >>That would be appreciated. > > > >I make an artificial test case and test this series by using next tree > >(next-20160713) and found a regression. > > > > [...] > > >Mem-Info: > >active_anon:18779 inactive_anon:18 isolated_anon:0 > > active_file:91577 inactive_file:320615 isolated_file:0 > > unevictable:0 dirty:0 writeback:0 unstable:0 > > slab_reclaimable:6741 slab_unreclaimable:18124 > > mapped:389774 shmem:95 pagetables:18332 bounce:0 > > free:8194 free_pcp:140 free_cma:0 > >Node 0 active_anon:75116kB inactive_anon:72kB active_file:366308kB inactive_file:1282460kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1559096kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes > >Node 0 DMA free:2172kB min:204kB low:252kB high:300kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:2380kB kernel_stack:1632kB pagetables:3632kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB node_pages_scanned:13673372 > >lowmem_reserve[]: 0 493 493 1955 > >Node 0 DMA32 free:6444kB min:6492kB low:8112kB high:9732kB present:2080632kB managed:508600kB mlocked:0kB slab_reclaimable:26964kB slab_unreclaimable:70116kB kernel_stack:30496kB pagetables:69696kB bounce:0kB free_pcp:316kB local_pcp:100kB free_cma:0kB node_pages_scanned:13673372 > >lowmem_reserve[]: 0 0 0 1462 > >Node 0 Normal free:0kB min:0kB low:0kB high:0kB present:18446744073708015752kB managed:0kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB node_pages_scanned:13673832 > > present:18446744073708015752kB > > Although unlikely related to your report, that itself doesn't look > right. Any idea if that's due to your configuration and would be > printed also in the mainline kernel in case of OOM (or if > /proc/zoneinfo has similarly bogus value), or is something caused by > a patch in mmotm? Wrong present count is due to a bug when enabling MOVABLE_ZONE. v4.7-rc5 also has the same problems. I testes above tests with work-around of this present count bug and find that result is the same. v4.7-rc5 is okay but next-20160713 isn't okay. As I said before, this setup just imitate highmem system and problem would also exist on highmem system. In addition, on above setup, I measured hackbench performance while there is a concurrent file reader and found that hackbench slow down roughly 10% with nodelru. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-14 6:28 ` Joonsoo Kim 2016-07-14 7:48 ` Vlastimil Babka @ 2016-07-18 12:11 ` Mel Gorman 2016-07-18 14:27 ` Mel Gorman 1 sibling, 1 reply; 16+ messages in thread From: Mel Gorman @ 2016-07-18 12:11 UTC (permalink / raw) To: Joonsoo Kim Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Thu, Jul 14, 2016 at 03:28:37PM +0900, Joonsoo Kim wrote: > > That would be appreciated. > > I make an artificial test case and test this series by using next tree > (next-20160713) and found a regression. > > My test setup is: > > memory: 2048 mb > movablecore: 1500 mb (imitates highmem system to test effect of skip logic) This is not an equivalent test to highmem. Movable cannot store page table pages and the highmem:lowmem ratio with this configuration is higher than it should be. The OOM is still odd but the differences are worth highlighting. > fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0 > fork cpuset=/ mems_allowed=0 Ok, high-order allocation failure for an allocation request that can enter direct reclaim. > Node 0 active_anon:79024kB inactive_anon:72kB active_file:569920kB inactive_file:1064260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1559112kB dirty:0kB writeback:0kB shmem:0kB shmem_thp > : 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes > Node 0 DMA free:2172kB min:204kB low:252kB high:300kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:2272kB kernel_stack:1216kB pagetables:2436kB bounce:0kB free_pcp:0k > B local_pcp:0kB free_cma:0kB node_pages_scanned:15639736 > lowmem_reserve[]: 0 493 493 1955 > Node 0 DMA32 free:6372kB min:6492kB low:8112kB high:9732kB present:2080632kB managed:508600kB mlocked:0kB slab_reclaimable:27108kB slab_unreclaimable:74236kB kernel_stack:32752kB pagetables:67612kB bounce: > 0kB free_pcp:112kB local_pcp:12kB free_cma:0kB node_pages_scanned:16302012 > lowmem_reserve[]: 0 0 0 1462 > Node 0 Normal free:0kB min:0kB low:0kB high:0kB present:18446744073708015752kB managed:0kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB lo > cal_pcp:0kB free_cma:0kB node_pages_scanned:17033632 > lowmem_reserve[]: 0 0 0 11698 > Node 0 Movable free:29588kB min:19256kB low:24068kB high:28880kB present:1535864kB managed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_ > pcp:208kB local_pcp:112kB free_cma:0kB node_pages_scanned:17725436 Present is corrupt but it's also interesting to note that all_unreclaimable is true. > lowmem_reserve[]: 0 0 0 0 > Node 0 DMA: 1*4kB (M) 1*8kB (U) 1*16kB (M) 1*32kB (M) 1*64kB (M) 2*128kB (UM) 1*256kB (M) 1*512kB (U) 1*1024kB (U) 0*2048kB 0*4096kB = 2172kB > Node 0 DMA32: 60*4kB (ME) 45*8kB (UME) 24*16kB (ME) 13*32kB (UM) 12*64kB (UM) 6*128kB (UM) 6*256kB (M) 4*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 6520kB > Node 0 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB > Node 0 Movable: 1*4kB (M) 130*8kB (M) 68*16kB (M) 30*32kB (M) 13*64kB (M) 9*128kB (M) 4*256kB (M) 0*512kB 1*1024kB (M) 1*2048kB (M) 5*4096kB (M) = 29652kB > Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB And it's true even though enough free pages are actually free so it's not even trying to do the allocation. The all_unreclaimable logic is related to the number of pages scanned but currently pages skipped contributes to pages scanned. That is one possibility. The other is that if all pages scanned are skipped then the OOM killer can believe there is zero progress. Try this to start with; diff --git a/mm/vmscan.c b/mm/vmscan.c index 3f06a7a0d135..c3e509c693bf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1408,7 +1408,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, isolate_mode_t mode, enum lru_list lru) { struct list_head *src = &lruvec->lists[lru]; - unsigned long nr_taken = 0; + unsigned long nr_taken = 0, total_skipped = 0; unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 }; unsigned long nr_skipped[MAX_NR_ZONES] = { 0, }; unsigned long scan, nr_pages; @@ -1462,10 +1462,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, if (!nr_skipped[zid]) continue; + total_skipped += nr_skipped[zid]; __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]); } } - *nr_scanned = scan; + *nr_scanned = scan - total_skipped; trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan, nr_taken, mode, is_file_lru(lru)); update_lru_sizes(lruvec, lru, nr_zone_taken, nr_taken); -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-18 12:11 ` Mel Gorman @ 2016-07-18 14:27 ` Mel Gorman 2016-07-19 8:30 ` Joonsoo Kim 0 siblings, 1 reply; 16+ messages in thread From: Mel Gorman @ 2016-07-18 14:27 UTC (permalink / raw) To: Joonsoo Kim Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Mon, Jul 18, 2016 at 01:11:22PM +0100, Mel Gorman wrote: > The all_unreclaimable logic is related to the number of pages scanned > but currently pages skipped contributes to pages scanned. That is one > possibility. The other is that if all pages scanned are skipped then the > OOM killer can believe there is zero progress. > > Try this to start with; > And if that fails, try this heavier handed version that will scan the full LRU potentially to isolate at least a single page if it's available for zone-constrained allocations. It's compile-tested only diff --git a/mm/vmscan.c b/mm/vmscan.c index a6f31617a08c..6a35691c8b94 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1408,14 +1408,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, isolate_mode_t mode, enum lru_list lru) { struct list_head *src = &lruvec->lists[lru]; - unsigned long nr_taken = 0; + unsigned long nr_taken = 0, total_skipped = 0; unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 }; unsigned long nr_skipped[MAX_NR_ZONES] = { 0, }; unsigned long scan, nr_pages; LIST_HEAD(pages_skipped); for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && - !list_empty(src); scan++) { + !list_empty(src) && scan == total_skipped; scan++) { struct page *page; page = lru_to_page(src); @@ -1426,6 +1426,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, if (page_zonenum(page) > sc->reclaim_idx) { list_move(&page->lru, &pages_skipped); nr_skipped[page_zonenum(page)]++; + total_skipped++; continue; } @@ -1465,7 +1466,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]); } } - *nr_scanned = scan; + *nr_scanned = scan - total_skipped; trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan, nr_taken, mode, is_file_lru(lru)); update_lru_sizes(lruvec, lru, nr_zone_taken, nr_taken); -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-18 14:27 ` Mel Gorman @ 2016-07-19 8:30 ` Joonsoo Kim 2016-07-19 14:25 ` Mel Gorman 0 siblings, 1 reply; 16+ messages in thread From: Joonsoo Kim @ 2016-07-19 8:30 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Mon, Jul 18, 2016 at 03:27:14PM +0100, Mel Gorman wrote: > On Mon, Jul 18, 2016 at 01:11:22PM +0100, Mel Gorman wrote: > > The all_unreclaimable logic is related to the number of pages scanned > > but currently pages skipped contributes to pages scanned. That is one > > possibility. The other is that if all pages scanned are skipped then the > > OOM killer can believe there is zero progress. > > > > Try this to start with; > > > > And if that fails, try this heavier handed version that will scan the full > LRU potentially to isolate at least a single page if it's available for > zone-constrained allocations. It's compile-tested only I tested both patches but they don't work for me. Notable difference is that all_unreclaimable is now "no". Just attach the oops log from heavier version. Thanks. fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0 fork cpuset=/ mems_allowed=0 CPU: 1 PID: 7484 Comm: fork Not tainted 4.7.0-rc7-next-20160713+ #657 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 0000000000000000 ffff880019f6bb18 ffffffff8142b8d3 ffff880019f6bd20 ffff88001c2c2500 ffff880019f6bb90 ffffffff81240b7e ffffffff81e6f0e0 ffff880019f6bb40 ffffffff810de08d ffff880019f6bb60 0000000000000206 Call Trace: [<ffffffff8142b8d3>] dump_stack+0x85/0xc2 [<ffffffff81240b7e>] dump_header+0x5c/0x22e [<ffffffff810de08d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff811b3381>] oom_kill_process+0x221/0x3f0 [<ffffffff810901b7>] ? has_capability_noaudit+0x17/0x20 [<ffffffff811b3acf>] out_of_memory+0x52f/0x560 [<ffffffff811b377c>] ? out_of_memory+0x1dc/0x560 [<ffffffff811ba004>] __alloc_pages_nodemask+0x1154/0x11b0 [<ffffffff810813a1>] ? copy_process.part.30+0x121/0x1bf0 [<ffffffff810813a1>] copy_process.part.30+0x121/0x1bf0 [<ffffffff811ebb16>] ? handle_mm_fault+0xb36/0x13d0 [<ffffffff810fb60d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 [<ffffffff81083066>] _do_fork+0xe6/0x6a0 [<ffffffff810836c9>] SyS_clone+0x19/0x20 [<ffffffff81003e13>] do_syscall_64+0x73/0x1e0 [<ffffffff81858ec3>] entry_SYSCALL64_slow_path+0x25/0x25 Mem-Info: active_anon:23909 inactive_anon:18 isolated_anon:0 active_file:289985 inactive_file:101445 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:6696 slab_unreclaimable:22083 mapped:381662 shmem:95 pagetables:21600 bounce:0 free:8378 free_pcp:227 free_cma:0 Node 0 active_anon:95676kB inactive_anon:72kB active_file:1160056kB inactive_file:405792kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1526812kB dirty:4kB writeback:0kB shmem:0kB shmem_thp : 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Node 0 DMA free:2176kB min:204kB low:252kB high:300kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:2328kB kernel_stack:1472kB pagetables:2940kB bounce:0kB free_pcp:0k B local_pcp:0kB free_cma:0kB node_pages_scanned:1668 lowmem_reserve[]: 0 493 493 1955 Node 0 DMA32 free:8188kB min:6492kB low:8112kB high:9732kB present:2080632kB managed:508600kB mlocked:0kB slab_reclaimable:26784kB slab_unreclaimable:86004kB kernel_stack:40704kB pagetables:83460kB bounce: 0kB free_pcp:208kB local_pcp:0kB free_cma:0kB node_pages_scanned:12000 lowmem_reserve[]: 0 0 0 1462 Node 0 Movable free:23648kB min:19256kB low:24068kB high:28880kB present:1535864kB managed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_ pcp:748kB local_pcp:0kB free_cma:0kB node_pages_scanned:12000 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 2*4kB (M) 0*8kB 2*16kB (UM) 2*32kB (UM) 0*64kB 2*128kB (UM) 1*256kB (U) 1*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 2152kB Node 0 DMA32: 21*4kB (EH) 14*8kB (UMEH) 14*16kB (UMEH) 17*32kB (UM) 11*64kB (ME) 13*128kB (UME) 14*256kB (UME) 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 8452kB Node 0 Movable: 87*4kB (M) 106*8kB (M) 82*16kB (M) 39*32kB (M) 11*64kB (M) 4*128kB (M) 0*256kB 1*512kB (M) 0*1024kB 1*2048kB (M) 4*4096kB (M) = 23916kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 391491 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 908122 pages RAM 0 pages HighMem/MovableOnly 401754 pages reserved 0 pages cma reserved 0 pages hwpoisoned -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-19 8:30 ` Joonsoo Kim @ 2016-07-19 14:25 ` Mel Gorman 0 siblings, 0 replies; 16+ messages in thread From: Mel Gorman @ 2016-07-19 14:25 UTC (permalink / raw) To: Joonsoo Kim Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Tue, Jul 19, 2016 at 05:30:31PM +0900, Joonsoo Kim wrote: > On Mon, Jul 18, 2016 at 03:27:14PM +0100, Mel Gorman wrote: > > On Mon, Jul 18, 2016 at 01:11:22PM +0100, Mel Gorman wrote: > > > The all_unreclaimable logic is related to the number of pages scanned > > > but currently pages skipped contributes to pages scanned. That is one > > > possibility. The other is that if all pages scanned are skipped then the > > > OOM killer can believe there is zero progress. > > > > > > Try this to start with; > > > > > > > And if that fails, try this heavier handed version that will scan the full > > LRU potentially to isolate at least a single page if it's available for > > zone-constrained allocations. It's compile-tested only > > I tested both patches but they don't work for me. Notable difference > is that all_unreclaimable is now "no". > Ok, that's good to know at least. It at least indicates that skips accounted as scans are a contributory factor. > Just attach the oops log from heavier version. > Apparently, isolating at least one page is not enough. Please try the following. If it fails, please post the test script you're using. I can simulate what you describe (mapped reads combined with lots of forks) but no guarantee I'll get it exactly right. I think it's ok to not account skips as scans because the skips are already accounted for. diff --git a/mm/vmscan.c b/mm/vmscan.c index a6f31617a08c..0dc443b52228 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1415,7 +1415,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, LIST_HEAD(pages_skipped); for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && - !list_empty(src); scan++) { + !list_empty(src);) { struct page *page; page = lru_to_page(src); @@ -1428,6 +1428,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, nr_skipped[page_zonenum(page)]++; continue; } +` + /* Pages skipped do not contribute to scan */ + scan++; switch (__isolate_lru_page(page, mode)) { case 0: -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 @ 2016-07-01 15:37 Mel Gorman 2016-07-01 15:37 ` [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis Mel Gorman 0 siblings, 1 reply; 16+ messages in thread From: Mel Gorman @ 2016-07-01 15:37 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman Previous releases double accounted LRU stats on the zone and the node because it was required by should_reclaim_retry. The last patch in the series removes the double accounting. It's not integrated with the series as reviewers may not like the solution. If not, it can be safely dropped without a major impact to the results. Changelog since v7 o Rebase onto current mmots o Avoid double accounting of stats in node and zone o Kswapd will avoid more reclaim if an eligible zone is available o Remove some duplications of sc->reclaim_idx and classzone_idx o Print per-node stats in zoneinfo Changelog since v6 o Correct reclaim_idx when direct reclaiming for memcg o Also account LRU pages per zone for compaction/reclaim o Add page_pgdat helper with more efficient lookup o Init pgdat LRU lock only once o Slight optimisation to wake_all_kswapds o Always wake kcompactd when kswapd is going to sleep o Rebase to mmotm as of June 15th, 2016 Changelog since v5 o Rebase and adjust to changes Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.7-rc4 with Andrew's tree applied. While this is a current rebase, the test results were based on mmotm as of June 23rd. Conceptually, this series is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc --------- This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min total-odr0-1 490.00 ( 0.00%) 463.00 ( 5.51%) Min total-odr0-2 349.00 ( 0.00%) 325.00 ( 6.88%) Min total-odr0-4 288.00 ( 0.00%) 272.00 ( 5.56%) Min total-odr0-8 250.00 ( 0.00%) 235.00 ( 6.00%) Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%) Min total-odr0-32 223.00 ( 0.00%) 205.00 ( 8.07%) Min total-odr0-64 217.00 ( 0.00%) 202.00 ( 6.91%) Min total-odr0-128 214.00 ( 0.00%) 207.00 ( 3.27%) Min total-odr0-256 242.00 ( 0.00%) 242.00 ( 0.00%) Min total-odr0-512 272.00 ( 0.00%) 265.00 ( 2.57%) Min total-odr0-1024 290.00 ( 0.00%) 283.00 ( 2.41%) Min total-odr0-2048 302.00 ( 0.00%) 296.00 ( 1.99%) Min total-odr0-4096 311.00 ( 0.00%) 306.00 ( 1.61%) Min total-odr0-8192 314.00 ( 0.00%) 309.00 ( 1.59%) Min total-odr0-16384 315.00 ( 0.00%) 309.00 ( 1.90%) Min total-odr1-1 741.00 ( 0.00%) 716.00 ( 3.37%) Min total-odr1-2 565.00 ( 0.00%) 524.00 ( 7.26%) Min total-odr1-4 457.00 ( 0.00%) 427.00 ( 6.56%) Min total-odr1-8 408.00 ( 0.00%) 371.00 ( 9.07%) Min total-odr1-16 383.00 ( 0.00%) 344.00 ( 10.18%) Min total-odr1-32 378.00 ( 0.00%) 334.00 ( 11.64%) Min total-odr1-64 383.00 ( 0.00%) 334.00 ( 12.79%) Min total-odr1-128 376.00 ( 0.00%) 342.00 ( 9.04%) Min total-odr1-256 381.00 ( 0.00%) 343.00 ( 9.97%) Min total-odr1-512 388.00 ( 0.00%) 349.00 ( 10.05%) Min total-odr1-1024 386.00 ( 0.00%) 356.00 ( 7.77%) Min total-odr1-2048 389.00 ( 0.00%) 362.00 ( 6.94%) Min total-odr1-4096 389.00 ( 0.00%) 362.00 ( 6.94%) Min total-odr1-8192 389.00 ( 0.00%) 362.00 ( 6.94%) This shows a steady improvement throughout. The primary benefit is from reduced system CPU usage which is obvious from the overall times; 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 User 191.39 191.61 System 2651.24 2504.48 Elapsed 2904.40 2757.01 The vmstats also showed that the fair zone allocation policy was definitely removed as can be seen here; 4.7.0-rc3 4.7.0-rc3 mmotm-20160623 nodelru-v8 DMA32 allocs 28794771816 0 Normal allocs 48432582848 77227356392 Movable allocs 0 0 tiobench on ext4 ---------------- tiobench is a benchmark that artifically benefits if old pages remain resident while new pages get reclaimed. The fair zone allocation policy mitigates this problem so pages age fairly. While the benchmark has problems, it is important that tiobench performance remains constant as it implies that page aging problems that the fair zone allocation policy fixes are not re-introduced. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min PotentialReadSpeed 89.65 ( 0.00%) 90.34 ( 0.77%) Min SeqRead-MB/sec-1 82.68 ( 0.00%) 83.13 ( 0.54%) Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.15 ( -0.84%) Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.23 ( -1.20%) Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.25 ( 0.52%) Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.76 ( 0.84%) Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.95 ( 7.95%) Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.94 ( -1.05%) Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.46 ( 2.10%) Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.58 ( -1.86%) Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.93 ( 7.22%) Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 78.84 ( 3.18%) Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.35 ( -1.03%) Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 78.69 ( -1.70%) Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 71.38 ( -2.06%) Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 75.81 ( -0.13%) Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.12 ( -5.08%) Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.02 ( 0.00%) Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.99 ( -5.71%) Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%) Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.89 ( -3.26%) This shows that the series has little or not impact on tiobench which is desirable. It indicates that the fair zone allocation policy was removed in a manner that didn't reintroduce one class of page aging bug. There were only minor differences in overall reclaim activity 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 Minor Faults 645838 644036 Major Faults 573 593 Swap Ins 0 0 Swap Outs 0 0 Allocation stalls 24 0 DMA allocs 0 0 DMA32 allocs 46041453 44154171 Normal allocs 78053072 79865782 Movable allocs 0 0 Direct pages scanned 10969 54504 Kswapd pages scanned 93375144 93250583 Kswapd pages reclaimed 93372243 93247714 Direct pages reclaimed 10969 54504 Kswapd efficiency 99% 99% Kswapd velocity 13741.015 13711.950 Direct efficiency 100% 100% Direct velocity 1.614 8.014 Percentage direct scans 0% 0% Zone normal velocity 8641.875 13719.964 Zone dma32 velocity 5100.754 0.000 Zone dma velocity 0.000 0.000 Page writes by reclaim 0.000 0.000 Page writes file 0 0 Page writes anon 0 0 Page reclaim immediate 37 54 kswapd activity was roughly comparable. There were differences in direct reclaim activity but negligible in the context of the overall workload (velocity of 8 pages per second with the patches applied, 1.6 pages per second in the baseline kernel). pgbench read-only large configuration on ext4 --------------------------------------------- pgbench is a database benchmark that can be sensitive to page reclaim decisions. This also checks if removing the fair zone allocation policy is safe pgbench Transactions 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%) Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%) Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%) Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%) Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%) Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%) Negligible differences again. As with tiobench, overall reclaim activity was comparable. bonnie++ on ext4 ---------------- No interesting performance difference, negligible differences on reclaim stats. paralleldd on ext4 ------------------ This workload uses varying numbers of dd instances to read large amounts of data from disk. 4.7.0-rc3 4.7.0-rc3 mmotm-20160615 nodelru-v7r17 Amean Elapsd-1 181.57 ( 0.00%) 179.63 ( 1.07%) Amean Elapsd-3 188.29 ( 0.00%) 183.68 ( 2.45%) Amean Elapsd-5 188.02 ( 0.00%) 181.73 ( 3.35%) Amean Elapsd-7 186.07 ( 0.00%) 184.11 ( 1.05%) Amean Elapsd-12 188.16 ( 0.00%) 183.51 ( 2.47%) Amean Elapsd-16 189.03 ( 0.00%) 181.27 ( 4.10%) 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 User 1439.23 1433.37 System 8332.31 8216.01 Elapsed 3619.80 3532.69 There is a slight gain in performance, some of which is from the reduced system CPU usage. There areminor differences in reclaim activity but nothing significant 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 Minor Faults 362486 358215 Major Faults 1143 1113 Swap Ins 26 0 Swap Outs 2920 482 DMA allocs 0 0 DMA32 allocs 31568814 28598887 Normal allocs 46539922 49514444 Movable allocs 0 0 Allocation stalls 0 0 Direct pages scanned 0 0 Kswapd pages scanned 40886878 40849710 Kswapd pages reclaimed 40869923 40835207 Direct pages reclaimed 0 0 Kswapd efficiency 99% 99% Kswapd velocity 11295.342 11563.344 Direct efficiency 100% 100% Direct velocity 0.000 0.000 Slabs scanned 131673 126099 Direct inode steals 57 60 Kswapd inode steals 762 18 It basically shows that kswapd was active at roughly the same rate in both kernels. There was also comparable slab scanning activity and direct reclaim was avoided in both cases. There appears to be a large difference in numbers of inodes reclaimed but the workload has few active inodes and is likely a timing artifact. It's interesting to note that the node-lru did not swap in any pages but given the low swap activity, it's unlikely to be significant. stutter ------- stutter simulates a simple workload. One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. The primary metric is checking for mmap latency. stutter 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min mmap 16.6283 ( 0.00%) 16.1394 ( 2.94%) 1st-qrtle mmap 54.7570 ( 0.00%) 55.2975 ( -0.99%) 2nd-qrtle mmap 57.3163 ( 0.00%) 57.5230 ( -0.36%) 3rd-qrtle mmap 58.9976 ( 0.00%) 58.0537 ( 1.60%) Max-90% mmap 59.7433 ( 0.00%) 58.3910 ( 2.26%) Max-93% mmap 60.1298 ( 0.00%) 58.4801 ( 2.74%) Max-95% mmap 73.4112 ( 0.00%) 58.5537 ( 20.24%) Max-99% mmap 92.8542 ( 0.00%) 58.9673 ( 36.49%) Max mmap 1440.6569 ( 0.00%) 137.6875 ( 90.44%) Mean mmap 59.3493 ( 0.00%) 55.5153 ( 6.46%) Best99%Mean mmap 57.2121 ( 0.00%) 55.4194 ( 3.13%) Best95%Mean mmap 55.9113 ( 0.00%) 55.2813 ( 1.13%) Best90%Mean mmap 55.6199 ( 0.00%) 55.1044 ( 0.93%) Best50%Mean mmap 53.2183 ( 0.00%) 52.8330 ( 0.72%) Best10%Mean mmap 45.9842 ( 0.00%) 42.3740 ( 7.85%) Best5%Mean mmap 43.2256 ( 0.00%) 38.8660 ( 10.09%) Best1%Mean mmap 32.9388 ( 0.00%) 27.7577 ( 15.73%) This shows a number of improvements with the worst-case outlier greatly improved. Some of the vmstats are interesting 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 Swap Ins 163 239 Swap Outs 0 0 Allocation stalls 2603 0 DMA allocs 0 0 DMA32 allocs 618719206 1303037965 Normal allocs 891235743 229914091 Movable allocs 0 0 Direct pages scanned 216787 3173 Kswapd pages scanned 50719775 41732250 Kswapd pages reclaimed 41541765 41731168 Direct pages reclaimed 209159 3173 Kswapd efficiency 81% 99% Kswapd velocity 16859.554 14231.043 Direct efficiency 96% 100% Direct velocity 72.061 1.082 Percentage direct scans 0% 0% Zone normal velocity 8431.777 14232.125 Zone dma32 velocity 8499.838 0.000 Zone dma velocity 0.000 0.000 Page writes by reclaim 6215049.000 0.000 Page writes file 6215049 0 Page writes anon 0 0 Page reclaim immediate 70673 143 Sector Reads 81940800 81489388 Sector Writes 100158984 99161860 Page rescued immediate 0 0 Slabs scanned 1366954 21196 While this is not guaranteed in all cases, this particular test showed a large reduction in direct reclaim activity. It's also worth noting that no page writes were issued from reclaim context. This series is not without its hazards. There are at least three areas that I'm concerned with even though I could not reproduce any problems in that area. 1. Reclaim/compaction is going to be affected because the amount of reclaim is no longer targetted at a specific zone. Compaction works on a per-zone basis so there is no guarantee that reclaiming a few THP's worth page pages will have a positive impact on compaction success rates. 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers are called is now different. This may or may not be a problem but if it is, it'll be because shrinkers are not called enough and some balancing is required. 3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are distributed between zones and the fair zone allocation policy used to do something very similar for anon. The distribution is now different but not necessarily in any way that matters but it's still worth bearing in mind. Documentation/cgroup-v1/memcg_test.txt | 4 +- Documentation/cgroup-v1/memory.txt | 4 +- arch/s390/appldata/appldata_mem.c | 2 +- arch/tile/mm/pgtable.c | 18 +- drivers/base/node.c | 77 ++- drivers/staging/android/lowmemorykiller.c | 12 +- drivers/staging/lustre/lustre/osc/osc_cache.c | 6 +- fs/fs-writeback.c | 4 +- fs/fuse/file.c | 8 +- fs/nfs/internal.h | 2 +- fs/nfs/write.c | 2 +- fs/proc/meminfo.c | 20 +- include/linux/backing-dev.h | 2 +- include/linux/memcontrol.h | 61 +- include/linux/mm.h | 5 + include/linux/mm_inline.h | 35 +- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 155 +++-- include/linux/swap.h | 24 +- include/linux/topology.h | 2 +- include/linux/vm_event_item.h | 14 +- include/linux/vmstat.h | 111 +++- include/linux/writeback.h | 2 +- include/trace/events/vmscan.h | 63 +- include/trace/events/writeback.h | 10 +- kernel/power/snapshot.c | 10 +- kernel/sysctl.c | 4 +- mm/backing-dev.c | 15 +- mm/compaction.c | 50 +- mm/filemap.c | 16 +- mm/huge_memory.c | 12 +- mm/internal.h | 11 +- mm/khugepaged.c | 14 +- mm/memcontrol.c | 215 +++---- mm/memory-failure.c | 4 +- mm/memory_hotplug.c | 7 +- mm/mempolicy.c | 2 +- mm/migrate.c | 35 +- mm/mlock.c | 12 +- mm/page-writeback.c | 123 ++-- mm/page_alloc.c | 371 +++++------ mm/page_idle.c | 4 +- mm/rmap.c | 26 +- mm/shmem.c | 14 +- mm/swap.c | 64 +- mm/swap_state.c | 4 +- mm/util.c | 4 +- mm/vmscan.c | 879 +++++++++++++------------- mm/vmstat.c | 398 +++++++++--- mm/workingset.c | 54 +- 50 files changed, 1674 insertions(+), 1319 deletions(-) -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis 2016-07-01 15:37 [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 Mel Gorman @ 2016-07-01 15:37 ` Mel Gorman 0 siblings, 0 replies; 16+ messages in thread From: Mel Gorman @ 2016-07-01 15:37 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman This patch makes reclaim decisions on a per-node basis. A reclaimer knows what zone is required by the allocation request and skips pages from higher zones. In many cases this will be ok because it's a GFP_HIGHMEM request of some description. On 64-bit, ZONE_DMA32 requests will cause some problems but 32-bit devices on 64-bit platforms are increasingly rare. Historically it would have been a major problem on 32-bit with big Highmem:Lowmem ratios but such configurations are also now rare and even where they exist, they are not encouraged. If it really becomes a problem, it'll manifest as very low reclaim efficiencies. Link: http://lkml.kernel.org/r/1466518566-30034-5-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/vmscan.c | 79 ++++++++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 55 insertions(+), 24 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 86a523a761c9..766b36bec829 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -84,6 +84,9 @@ struct scan_control { /* Scan (total_size >> priority) pages at once */ int priority; + /* The highest zone to isolate pages for reclaim from */ + enum zone_type reclaim_idx; + unsigned int may_writepage:1; /* Can mapped pages be reclaimed? */ @@ -1392,6 +1395,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, unsigned long nr_taken = 0; unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 }; unsigned long scan, nr_pages; + LIST_HEAD(pages_skipped); for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && !list_empty(src); scan++) { @@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, VM_BUG_ON_PAGE(!PageLRU(page), page); + if (page_zonenum(page) > sc->reclaim_idx) { + list_move(&page->lru, &pages_skipped); + continue; + } + switch (__isolate_lru_page(page, mode)) { case 0: nr_pages = hpage_nr_pages(page); @@ -1420,6 +1429,15 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, } } + /* + * Splice any skipped pages to the start of the LRU list. Note that + * this disrupts the LRU order when reclaiming for lower zones but + * we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX + * scanning would soon rescan the same pages to skip and put the + * system at risk of premature OOM. + */ + if (!list_empty(&pages_skipped)) + list_splice(&pages_skipped, src); *nr_scanned = scan; trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan, nr_taken, mode, is_file_lru(lru)); @@ -1589,7 +1607,7 @@ static int current_may_throttle(void) } /* - * shrink_inactive_list() is a helper for shrink_zone(). It returns the number + * shrink_inactive_list() is a helper for shrink_node(). It returns the number * of reclaimed pages */ static noinline_for_stack unsigned long @@ -2401,12 +2419,13 @@ static inline bool should_continue_reclaim(struct zone *zone, } } -static bool shrink_zone(struct zone *zone, struct scan_control *sc, - bool is_classzone) +static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc, + enum zone_type classzone_idx) { struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_reclaimed, nr_scanned; bool reclaimable = false; + struct zone *zone = &pgdat->node_zones[classzone_idx]; do { struct mem_cgroup *root = sc->target_mem_cgroup; @@ -2438,7 +2457,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, shrink_zone_memcg(zone, memcg, sc, &lru_pages); zone_lru_pages += lru_pages; - if (memcg && is_classzone) + if (!global_reclaim(sc)) shrink_slab(sc->gfp_mask, zone_to_nid(zone), memcg, sc->nr_scanned - scanned, lru_pages); @@ -2469,7 +2488,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, * Shrink the slab caches in the same proportion that * the eligible LRU pages were scanned. */ - if (global_reclaim(sc) && is_classzone) + if (global_reclaim(sc)) shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL, sc->nr_scanned - nr_scanned, zone_lru_pages); @@ -2553,7 +2572,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; gfp_t orig_mask; - enum zone_type requested_highidx = gfp_zone(sc->gfp_mask); + enum zone_type classzone_idx; /* * If the number of buffer_heads in the machine exceeds the maximum @@ -2561,17 +2580,23 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) * highmem pages could be pinning lowmem pages storing buffer_heads */ orig_mask = sc->gfp_mask; - if (buffer_heads_over_limit) + if (buffer_heads_over_limit) { sc->gfp_mask |= __GFP_HIGHMEM; + sc->reclaim_idx = classzone_idx = gfp_zone(sc->gfp_mask); + } for_each_zone_zonelist_nodemask(zone, z, zonelist, - gfp_zone(sc->gfp_mask), sc->nodemask) { - enum zone_type classzone_idx; - + sc->reclaim_idx, sc->nodemask) { if (!populated_zone(zone)) continue; - classzone_idx = requested_highidx; + /* + * Note that reclaim_idx does not change as it is the highest + * zone reclaimed from which for empty zones is a no-op but + * classzone_idx is used by shrink_node to test if the slabs + * should be shrunk on a given node. + */ + classzone_idx = sc->reclaim_idx; while (!populated_zone(zone->zone_pgdat->node_zones + classzone_idx)) classzone_idx--; @@ -2600,8 +2625,8 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) */ if (IS_ENABLED(CONFIG_COMPACTION) && sc->order > PAGE_ALLOC_COSTLY_ORDER && - zonelist_zone_idx(z) <= requested_highidx && - compaction_ready(zone, sc->order, requested_highidx)) { + zonelist_zone_idx(z) <= classzone_idx && + compaction_ready(zone, sc->order, classzone_idx)) { sc->compaction_ready = true; continue; } @@ -2621,7 +2646,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) /* need some check for avoid more shrink_zone() */ } - shrink_zone(zone, sc, zone_idx(zone) == classzone_idx); + shrink_node(zone->zone_pgdat, sc, classzone_idx); } /* @@ -2847,6 +2872,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, struct scan_control sc = { .nr_to_reclaim = SWAP_CLUSTER_MAX, .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)), + .reclaim_idx = gfp_zone(gfp_mask), .order = order, .nodemask = nodemask, .priority = DEF_PRIORITY, @@ -2886,6 +2912,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, .target_mem_cgroup = memcg, .may_writepage = !laptop_mode, .may_unmap = 1, + .reclaim_idx = MAX_NR_ZONES - 1, .may_swap = !noswap, }; unsigned long lru_pages; @@ -2924,6 +2951,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), + .reclaim_idx = MAX_NR_ZONES - 1, .target_mem_cgroup = memcg, .priority = DEF_PRIORITY, .may_writepage = !laptop_mode, @@ -3118,7 +3146,7 @@ static bool kswapd_shrink_zone(struct zone *zone, balance_gap, classzone_idx)) return true; - shrink_zone(zone, sc, zone_idx(zone) == classzone_idx); + shrink_node(zone->zone_pgdat, sc, classzone_idx); /* TODO: ANOMALY */ clear_bit(PGDAT_WRITEBACK, &pgdat->flags); @@ -3167,6 +3195,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) unsigned long nr_soft_scanned; struct scan_control sc = { .gfp_mask = GFP_KERNEL, + .reclaim_idx = MAX_NR_ZONES - 1, .order = order, .priority = DEF_PRIORITY, .may_writepage = !laptop_mode, @@ -3237,15 +3266,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) sc.may_writepage = 1; /* - * Now scan the zone in the dma->highmem direction, stopping - * at the last zone which needs scanning. - * - * We do this because the page allocator works in the opposite - * direction. This prevents the page allocator from allocating - * pages behind kswapd's direction of progress, which would - * cause too much scanning of the lower zones. + * Continue scanning in the highmem->dma direction stopping at + * the last zone which needs scanning. This may reclaim lowmem + * pages that are not necessary for zone balancing but it + * preserves LRU ordering. It is assumed that the bulk of + * allocation requests can use arbitrary zones with the + * possible exception of big highmem:lowmem configurations. */ - for (i = 0; i <= end_zone; i++) { + for (i = end_zone; i >= 0; i--) { struct zone *zone = pgdat->node_zones + i; if (!populated_zone(zone)) @@ -3256,6 +3284,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) continue; sc.nr_scanned = 0; + sc.reclaim_idx = i; nr_soft_scanned = 0; /* @@ -3513,6 +3542,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) struct scan_control sc = { .nr_to_reclaim = nr_to_reclaim, .gfp_mask = GFP_HIGHUSER_MOVABLE, + .reclaim_idx = MAX_NR_ZONES - 1, .priority = DEF_PRIORITY, .may_writepage = 1, .may_unmap = 1, @@ -3704,6 +3734,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), .may_unmap = !!(zone_reclaim_mode & RECLAIM_UNMAP), .may_swap = 1, + .reclaim_idx = zone_idx(zone), }; cond_resched(); @@ -3723,7 +3754,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) * priorities until we have enough memory freed. */ do { - shrink_zone(zone, &sc, true); + shrink_node(zone->zone_pgdat, &sc, zone_idx(zone)); } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0); } -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-07-19 14:25 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <009e01d1d5d8$fcf06440$f6d12cc0$@alibaba-inc.com> 2016-07-04 10:08 ` [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis Hillf Danton 2016-07-04 10:33 ` Mel Gorman 2016-07-05 3:17 ` Hillf Danton 2016-07-01 20:01 [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 Mel Gorman 2016-07-01 20:01 ` [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis Mel Gorman 2016-07-07 1:12 ` Joonsoo Kim 2016-07-07 9:48 ` Mel Gorman 2016-07-08 2:28 ` Joonsoo Kim 2016-07-08 10:05 ` Mel Gorman 2016-07-14 6:28 ` Joonsoo Kim 2016-07-14 7:48 ` Vlastimil Babka 2016-07-18 4:52 ` Joonsoo Kim 2016-07-18 12:11 ` Mel Gorman 2016-07-18 14:27 ` Mel Gorman 2016-07-19 8:30 ` Joonsoo Kim 2016-07-19 14:25 ` Mel Gorman -- strict thread matches above, loose matches on Subject: below -- 2016-07-01 15:37 [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 Mel Gorman 2016-07-01 15:37 ` [PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).