* Re: [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes [not found] <071801d1cc5c$245087d0$6cf19770$@alibaba-inc.com> @ 2016-06-22 8:42 ` Hillf Danton 2016-06-23 11:31 ` Mel Gorman 0 siblings, 1 reply; 9+ messages in thread From: Hillf Danton @ 2016-06-22 8:42 UTC (permalink / raw) To: Mel Gorman; +Cc: Johannes Weiner, Vlastimil Babka, linux-kernel, linux-mm > /* > - * kswapd shrinks the zone by the number of pages required to reach > - * the high watermark. > + * kswapd shrinks a node of pages that are at or below the highest usable > + * zone that is currently unbalanced. > * > * Returns true if kswapd scanned at least the requested number of pages to > * reclaim or if the lack of progress was due to pages under writeback. > * This is used to determine if the scanning priority needs to be raised. > */ > -static bool kswapd_shrink_zone(struct zone *zone, > +static bool kswapd_shrink_node(pg_data_t *pgdat, > int classzone_idx, > struct scan_control *sc) > { > - unsigned long balance_gap; > - bool lowmem_pressure; > - struct pglist_data *pgdat = zone->zone_pgdat; > + struct zone *zone; > + int z; > > - /* Reclaim above the high watermark. */ > - sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); > + /* Reclaim a number of pages proportional to the number of zones */ > + sc->nr_to_reclaim = 0; > + for (z = 0; z <= classzone_idx; z++) { > + zone = pgdat->node_zones + z; > + if (!populated_zone(zone)) > + continue; > > - /* > - * We put equal pressure on every zone, unless one zone has way too > - * many pages free already. The "too many pages" is defined as the > - * high wmark plus a "gap" where the gap is either the low > - * watermark or 1% of the zone, whichever is smaller. > - */ > - balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP( > - zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO)); > + sc->nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX); > + } > > /* > - * If there is no low memory pressure or the zone is balanced then no > - * reclaim is necessary > + * Historically care was taken to put equal pressure on all zones but > + * now pressure is applied based on node LRU order. > */ > - lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone)); > - if (!lowmem_pressure && zone_balanced(zone, sc->order, false, > - balance_gap, classzone_idx)) > - return true; > - > - shrink_node(zone->zone_pgdat, sc, classzone_idx); > - > - /* TODO: ANOMALY */ > - clear_bit(PGDAT_WRITEBACK, &pgdat->flags); > + shrink_node(pgdat, sc, classzone_idx); > > /* > - * If a zone reaches its high watermark, consider it to be no longer > - * congested. It's possible there are dirty pages backed by congested > - * BDIs but as pressure is relieved, speculatively avoid congestion > - * waits. > + * Fragmentation may mean that the system cannot be rebalanced for > + * high-order allocations. If twice the allocation size has been > + * reclaimed then recheck watermarks only at order-0 to prevent > + * excessive reclaim. Assume that a process requested a high-order > + * can direct reclaim/compact. > */ > - if (pgdat_reclaimable(zone->zone_pgdat) && > - zone_balanced(zone, sc->order, false, 0, classzone_idx)) { > - clear_bit(PGDAT_CONGESTED, &pgdat->flags); > - clear_bit(PGDAT_DIRTY, &pgdat->flags); > - } > + if (sc->order && sc->nr_reclaimed >= 2UL << sc->order) > + sc->order = 0; > Reclaim order is changed here. Btw, I find no such change in current code. > return sc->nr_scanned >= sc->nr_to_reclaim; > } > > /* > - * For kswapd, balance_pgdat() will work across all this node's zones until > - * they are all at high_wmark_pages(zone). > - * > - * Returns the highest zone idx kswapd was reclaiming at > + * For kswapd, balance_pgdat() will reclaim pages across a node from zones > + * that are eligible for use by the caller until at least one zone is > + * balanced. > * > - * There is special handling here for zones which are full of pinned pages. > - * This can happen if the pages are all mlocked, or if they are all used by > - * device drivers (say, ZONE_DMA). Or if they are all in use by hugetlb. > - * What we do is to detect the case where all pages in the zone have been > - * scanned twice and there has been zero successful reclaim. Mark the zone as > - * dead and from now on, only perform a short scan. Basically we're polling > - * the zone for when the problem goes away. > + * Returns the order kswapd finished reclaiming at. > * > * kswapd scans the zones in the highmem->normal->dma direction. It skips > * zones which have free_pages > high_wmark_pages(zone), but once a zone is > - * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the > - * lower zones regardless of the number of free pages in the lower zones. This > - * interoperates with the page allocator fallback scheme to ensure that aging > - * of pages is balanced across the zones. > + * found to have free_pages <= high_wmark_pages(zone), any page is that zone > + * or lower is eligible for reclaim until at least one usable zone is > + * balanced. > */ > static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > { > int i; > - int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ > unsigned long nr_soft_reclaimed; > unsigned long nr_soft_scanned; > + struct zone *zone; > struct scan_control sc = { > .gfp_mask = GFP_KERNEL, > - .reclaim_idx = MAX_NR_ZONES - 1, > .order = order, > .priority = DEF_PRIORITY, > .may_writepage = !laptop_mode, > .may_unmap = 1, > .may_swap = 1, > + .reclaim_idx = classzone_idx, > }; > count_vm_event(PAGEOUTRUN); > > @@ -3203,21 +3125,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > > /* Scan from the highest requested zone to dma */ > for (i = classzone_idx; i >= 0; i--) { > - struct zone *zone = pgdat->node_zones + i; > - > + zone = pgdat->node_zones + i; > if (!populated_zone(zone)) > continue; > > - if (sc.priority != DEF_PRIORITY && > - !pgdat_reclaimable(zone->zone_pgdat)) > - continue; > - > - /* > - * Do some background aging of the anon list, to give > - * pages a chance to be referenced before reclaiming. > - */ > - age_active_anon(zone, &sc); > - > /* > * If the number of buffer_heads in the machine > * exceeds the maximum allowed level and this node > @@ -3225,19 +3136,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > * it to relieve lowmem pressure. > */ > if (buffer_heads_over_limit && is_highmem_idx(i)) { > - end_zone = i; > + classzone_idx = i; > break; > } > > - if (!zone_balanced(zone, order, false, 0, 0)) { > - end_zone = i; > + if (!zone_balanced(zone, order, 0, 0)) { We need to sync order with the above change? > + classzone_idx = i; > break; > } else { > /* > - * If balanced, clear the dirty and congested > - * flags > - * > - * TODO: ANOMALY > + * If any eligible zone is balanced then the > + * node is not considered congested or dirty. > */ > clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags); > clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags); > @@ -3248,51 +3157,34 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > goto out; > > /* > + * Do some background aging of the anon list, to give > + * pages a chance to be referenced before reclaiming. All > + * pages are rotated regardless of classzone as this is > + * about consistent aging. > + */ > + age_active_anon(pgdat, &pgdat->node_zones[MAX_NR_ZONES - 1], &sc); > + > + /* > * If we're getting trouble reclaiming, start doing writepage > * even in laptop mode. > */ > - if (sc.priority < DEF_PRIORITY - 2) > + if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat)) > sc.may_writepage = 1; > > + /* Call soft limit reclaim before calling shrink_node. */ > + sc.nr_scanned = 0; > + nr_soft_scanned = 0; > + nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, sc.order, > + sc.gfp_mask, &nr_soft_scanned); > + sc.nr_reclaimed += nr_soft_reclaimed; > + > /* > - * Continue scanning in the highmem->dma direction stopping at > - * the last zone which needs scanning. This may reclaim lowmem > - * pages that are not necessary for zone balancing but it > - * preserves LRU ordering. It is assumed that the bulk of > - * allocation requests can use arbitrary zones with the > - * possible exception of big highmem:lowmem configurations. > + * There should be no need to raise the scanning priority if > + * enough pages are already being scanned that that high > + * watermark would be met at 100% efficiency. > */ > - for (i = end_zone; i >= 0; i--) { > - struct zone *zone = pgdat->node_zones + i; > - > - if (!populated_zone(zone)) > - continue; > - > - if (sc.priority != DEF_PRIORITY && > - !pgdat_reclaimable(zone->zone_pgdat)) > - continue; > - > - sc.nr_scanned = 0; > - sc.reclaim_idx = i; > - > - nr_soft_scanned = 0; > - /* > - * Call soft limit reclaim before calling shrink_zone. > - */ > - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, > - order, sc.gfp_mask, > - &nr_soft_scanned); > - sc.nr_reclaimed += nr_soft_reclaimed; > - > - /* > - * There should be no need to raise the scanning > - * priority if enough pages are already being scanned > - * that that high watermark would be met at 100% > - * efficiency. > - */ > - if (kswapd_shrink_zone(zone, end_zone, &sc)) > - raise_priority = false; > - } > + if (kswapd_shrink_node(pgdat, classzone_idx, &sc)) > + raise_priority = false; > > /* > * If the low watermark is met there is no need for processes > @@ -3308,20 +3200,37 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > break; > > /* > + * Stop reclaiming if any eligible zone is balanced and clear > + * node writeback or congested. > + */ > + for (i = 0; i <= classzone_idx; i++) { > + zone = pgdat->node_zones + i; > + if (!populated_zone(zone)) > + continue; > + > + if (zone_balanced(zone, sc.order, 0, classzone_idx)) { > + clear_bit(PGDAT_CONGESTED, &pgdat->flags); > + clear_bit(PGDAT_DIRTY, &pgdat->flags); > + goto out; > + } > + } > + > + /* > * Raise priority if scanning rate is too low or there was no > * progress in reclaiming pages > */ > if (raise_priority || !sc.nr_reclaimed) > sc.priority--; > - } while (sc.priority >= 1 && > - !pgdat_balanced(pgdat, order, classzone_idx)); > + } while (sc.priority >= 1); > > out: > /* > - * Return the highest zone idx we were reclaiming at so > - * prepare_kswapd_sleep() makes the same decisions as here. > + * Return the order kswapd stopped reclaiming at as > + * prepare_kswapd_sleep() takes it into account. If another caller > + * entered the allocator slow path while kswapd was awake, order will > + * remain at the higher level. > */ > - return end_zone; > + return sc.order; > } > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes 2016-06-22 8:42 ` [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes Hillf Danton @ 2016-06-23 11:31 ` Mel Gorman 0 siblings, 0 replies; 9+ messages in thread From: Mel Gorman @ 2016-06-23 11:31 UTC (permalink / raw) To: Hillf Danton; +Cc: Johannes Weiner, Vlastimil Babka, linux-kernel, linux-mm On Wed, Jun 22, 2016 at 04:42:06PM +0800, Hillf Danton wrote: > > /* > > - * If a zone reaches its high watermark, consider it to be no longer > > - * congested. It's possible there are dirty pages backed by congested > > - * BDIs but as pressure is relieved, speculatively avoid congestion > > - * waits. > > + * Fragmentation may mean that the system cannot be rebalanced for > > + * high-order allocations. If twice the allocation size has been > > + * reclaimed then recheck watermarks only at order-0 to prevent > > + * excessive reclaim. Assume that a process requested a high-order > > + * can direct reclaim/compact. > > */ > > - if (pgdat_reclaimable(zone->zone_pgdat) && > > - zone_balanced(zone, sc->order, false, 0, classzone_idx)) { > > - clear_bit(PGDAT_CONGESTED, &pgdat->flags); > > - clear_bit(PGDAT_DIRTY, &pgdat->flags); > > - } > > + if (sc->order && sc->nr_reclaimed >= 2UL << sc->order) > > + sc->order = 0; > > > > Reclaim order is changed here. > Btw, I find no such change in current code. > It is reintroducing a check removed by commit accf62422b3a ("mm, kswapd: replace kswapd compaction with waking up kcompactd"). That patch had kswapd always check at order-0 once kswapd is awake in pgdat_balanced but would still take at least one pass through reclaiming so kcompactd potentially makes progress. This patch removes pgdat_balanced entirely and zone_balanced() checks the order it is asked like it used to. Hence, it is necessary to reset sc->order once progress is made or kswapd potentially stays awake reclaiming pages until a high-order page is freed. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 00/27] Move LRU page reclaim from zones to nodes v7 @ 2016-06-21 14:15 Mel Gorman 2016-06-21 14:15 ` [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman 0 siblings, 1 reply; 9+ messages in thread From: Mel Gorman @ 2016-06-21 14:15 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman (sorry for resend, the previous attempt didn't go through fully for some reason) The bulk of the updates are in response to review from Vlastimil Babka and received a lot more testing than v6. Changelog since v6 o Correct reclaim_idx when direct reclaiming for memcg o Also account LRU pages per zone for compaction/reclaim o Add page_pgdat helper with more efficient lookup o Init pgdat LRU lock only once o Slight optimisation to wake_all_kswapds o Always wake kcompactd when kswapd is going to sleep o Rebase to mmotm as of June 15th, 2016 Changelog since v5 o Rebase and adjust to changes Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.6-rc3 plus the page allocator optimisation series. Conceptually, this is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc --------- This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc3 4.7.0-rc3 mmotm-20160615 nodelru-v7r17 Min total-odr0-1 485.00 ( 0.00%) 462.00 ( 4.74%) Min total-odr0-2 354.00 ( 0.00%) 341.00 ( 3.67%) Min total-odr0-4 285.00 ( 0.00%) 277.00 ( 2.81%) Min total-odr0-8 249.00 ( 0.00%) 240.00 ( 3.61%) Min total-odr0-16 230.00 ( 0.00%) 224.00 ( 2.61%) Min total-odr0-32 222.00 ( 0.00%) 215.00 ( 3.15%) Min total-odr0-64 216.00 ( 0.00%) 210.00 ( 2.78%) Min total-odr0-128 214.00 ( 0.00%) 208.00 ( 2.80%) Min total-odr0-256 248.00 ( 0.00%) 233.00 ( 6.05%) Min total-odr0-512 277.00 ( 0.00%) 270.00 ( 2.53%) Min total-odr0-1024 294.00 ( 0.00%) 284.00 ( 3.40%) Min total-odr0-2048 308.00 ( 0.00%) 298.00 ( 3.25%) Min total-odr0-4096 318.00 ( 0.00%) 307.00 ( 3.46%) Min total-odr0-8192 322.00 ( 0.00%) 308.00 ( 4.35%) Min total-odr0-16384 324.00 ( 0.00%) 309.00 ( 4.63%) Min total-odr1-1 729.00 ( 0.00%) 686.00 ( 5.90%) Min total-odr1-2 533.00 ( 0.00%) 520.00 ( 2.44%) Min total-odr1-4 434.00 ( 0.00%) 415.00 ( 4.38%) Min total-odr1-8 390.00 ( 0.00%) 364.00 ( 6.67%) Min total-odr1-16 359.00 ( 0.00%) 335.00 ( 6.69%) Min total-odr1-32 356.00 ( 0.00%) 327.00 ( 8.15%) Min total-odr1-64 356.00 ( 0.00%) 321.00 ( 9.83%) Min total-odr1-128 356.00 ( 0.00%) 333.00 ( 6.46%) Min total-odr1-256 354.00 ( 0.00%) 337.00 ( 4.80%) Min total-odr1-512 366.00 ( 0.00%) 340.00 ( 7.10%) Min total-odr1-1024 373.00 ( 0.00%) 354.00 ( 5.09%) Min total-odr1-2048 375.00 ( 0.00%) 354.00 ( 5.60%) Min total-odr1-4096 374.00 ( 0.00%) 354.00 ( 5.35%) Min total-odr1-8192 370.00 ( 0.00%) 355.00 ( 4.05%) This shows a steady improvement throughout. The primary benefit is from reduced system CPU usage which is obvious from the overall times; 4.7.0-rc3 4.7.0-rc3 mmotm-20160615 nodelru-v7 User 174.06 174.58 System 2656.78 2485.21 Elapsed 2885.07 2713.67 The vmstats also showed that the fair zone allocation policy was definitely removed as can be seen here; 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 DMA32 allocs 28794408561 0 Normal allocs 48431969998 77226313470 Movable allocs 0 0 tiobench on ext4 ---------------- tiobench is a benchmark that artifically benefits if old pages remain resident while new pages get reclaimed. The fair zone allocation policy mitigates this problem so pages age fairly. While the benchmark has problems, it is important that tiobench performance remains constant as it implies that page aging problems that the fair zone allocation policy fixes are not re-introduced. 4.7.0-rc3 4.7.0-rc3 mmotm-20160615 nodelru-v7r17 Min PotentialReadSpeed 90.24 ( 0.00%) 90.14 ( -0.11%) Min SeqRead-MB/sec-1 80.63 ( 0.00%) 83.09 ( 3.05%) Min SeqRead-MB/sec-2 71.91 ( 0.00%) 72.44 ( 0.74%) Min SeqRead-MB/sec-4 75.20 ( 0.00%) 74.32 ( -1.17%) Min SeqRead-MB/sec-8 65.30 ( 0.00%) 65.21 ( -0.14%) Min SeqRead-MB/sec-16 62.62 ( 0.00%) 62.12 ( -0.80%) Min RandRead-MB/sec-1 0.90 ( 0.00%) 0.94 ( 4.44%) Min RandRead-MB/sec-2 0.96 ( 0.00%) 0.97 ( 1.04%) Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.41 ( -1.40%) Min RandRead-MB/sec-8 1.67 ( 0.00%) 1.72 ( 2.99%) Min RandRead-MB/sec-16 1.77 ( 0.00%) 1.86 ( 5.08%) Min SeqWrite-MB/sec-1 78.12 ( 0.00%) 79.78 ( 2.12%) Min SeqWrite-MB/sec-2 72.74 ( 0.00%) 73.23 ( 0.67%) Min SeqWrite-MB/sec-4 79.40 ( 0.00%) 78.32 ( -1.36%) Min SeqWrite-MB/sec-8 73.18 ( 0.00%) 71.40 ( -2.43%) Min SeqWrite-MB/sec-16 75.82 ( 0.00%) 75.24 ( -0.76%) Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.15 ( -2.54%) Min RandWrite-MB/sec-2 1.05 ( 0.00%) 0.99 ( -5.71%) Min RandWrite-MB/sec-4 1.00 ( 0.00%) 0.96 ( -4.00%) Min RandWrite-MB/sec-8 0.91 ( 0.00%) 0.92 ( 1.10%) Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.92 ( 0.00%) This shows that the series has little or not impact on tiobench which is desirable. It indicates that the fair zone allocation policy was removed in a manner that didn't reintroduce one class of page aging bug. There were only minor differences in overall reclaim activity 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 Minor Faults 640992 640721 Major Faults 728 623 Swap Ins 0 0 Swap Outs 0 0 DMA allocs 0 0 DMA32 allocs 46174282 44219717 Normal allocs 77949344 79858024 Movable allocs 0 0 Allocation stalls 38 76 Direct pages scanned 17463 34865 Kswapd pages scanned 93331163 93302388 Kswapd pages reclaimed 93328173 93299677 Direct pages reclaimed 17463 34865 Kswapd efficiency 99% 99% Kswapd velocity 13729.855 13755.612 Direct efficiency 100% 100% Direct velocity 2.569 5.140 Percentage direct scans 0% 0% Page writes by reclaim 0 0 Page writes file 0 0 Page writes anon 0 0 Page reclaim immediate 54 36 kswapd activity was roughly comparable. There was slight differences in direct reclaim activity but negligible in the context of the overall workload (velocity of 5 pages per second with the patches applied, 2 pages per second in the baseline kernel). pgbench read-only large configuration on ext4 --------------------------------------------- pgbench is a database benchmark that can be sensitive to page reclaim decisions. This also checks if removing the fair zone allocation policy is safe pgbench Transactions 4.7.0-rc3 4.7.0-rc3 mmotm-20160615 nodelru-v7r17 Hmean 1 191.00 ( 0.00%) 193.67 ( 1.40%) Hmean 5 338.59 ( 0.00%) 336.99 ( -0.47%) Hmean 12 374.03 ( 0.00%) 386.15 ( 3.24%) Hmean 21 372.24 ( 0.00%) 372.02 ( -0.06%) Hmean 30 383.98 ( 0.00%) 370.69 ( -3.46%) Hmean 32 431.01 ( 0.00%) 438.47 ( 1.73%) Negligible differences again. As with tiobench, overall reclaim activity was comparable. bonnie++ on ext4 ---------------- No interesting performance difference, negligible differences on reclaim stats. paralleldd on ext4 ------------------ This workload uses varying numbers of dd instances to read large amounts of data from disk. paralleldd 4.7.0-rc3 4.7.0-rc3 mmotm-20160615 nodelru-v7r17 Amean Elapsd-1 181.57 ( 0.00%) 179.63 ( 1.07%) Amean Elapsd-3 188.29 ( 0.00%) 183.68 ( 2.45%) Amean Elapsd-5 188.02 ( 0.00%) 181.73 ( 3.35%) Amean Elapsd-7 186.07 ( 0.00%) 184.11 ( 1.05%) Amean Elapsd-12 188.16 ( 0.00%) 183.51 ( 2.47%) Amean Elapsd-16 189.03 ( 0.00%) 181.27 ( 4.10%) 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 User 1439.23 1433.37 System 8332.31 8216.01 Elapsed 3619.80 3532.69 There is a slight gain in performance, some of which is from the reduced system CPU usage. There areminor differences in reclaim activity but nothing significant 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 Minor Faults 362486 358215 Major Faults 1143 1113 Swap Ins 26 0 Swap Outs 2920 482 DMA allocs 0 0 DMA32 allocs 31568814 28598887 Normal allocs 46539922 49514444 Movable allocs 0 0 Allocation stalls 0 0 Direct pages scanned 0 0 Kswapd pages scanned 40886878 40849710 Kswapd pages reclaimed 40869923 40835207 Direct pages reclaimed 0 0 Kswapd efficiency 99% 99% Kswapd velocity 11295.342 11563.344 Direct efficiency 100% 100% Direct velocity 0.000 0.000 Slabs scanned 131673 126099 Direct inode steals 57 60 Kswapd inode steals 762 18 It basically shows that kswapd was active at roughly the same rate in both kernels. There was also comparable slab scanning activity and direct reclaim was avoided in both cases. There appears to be a large difference in numbers of inodes reclaimed but the workload has few active inodes and is likely a timing artifact. It's interesting to note that the node-lru did not swap in any pages but given the low swap activity, it's unlikely to be significant. stutter ------- stutter simulates a simple workload. One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. The primary metric is checking for mmap latency. stutter 4.7.0-rc3 4.7.0-rc3 mmotm-20160615 nodelru-v7r17 Min mmap 16.8422 ( 0.00%) 15.9821 ( 5.11%) 1st-qrtle mmap 57.8709 ( 0.00%) 58.0794 ( -0.36%) 2nd-qrtle mmap 58.4335 ( 0.00%) 59.4286 ( -1.70%) 3rd-qrtle mmap 58.6957 ( 0.00%) 59.6862 ( -1.69%) Max-90% mmap 58.9388 ( 0.00%) 59.8759 ( -1.59%) Max-93% mmap 59.0505 ( 0.00%) 59.9333 ( -1.50%) Max-95% mmap 59.1877 ( 0.00%) 59.9844 ( -1.35%) Max-99% mmap 60.3237 ( 0.00%) 60.2337 ( 0.15%) Max mmap 285.6454 ( 0.00%) 135.6006 ( 52.53%) Mean mmap 57.8366 ( 0.00%) 58.4884 ( -1.13%) This shows that there is a slight impact on mmap latency but that the worst-case outlier is much improved. As the problem with this benchmark used to be that the kernel stalled for minutes, this difference is negligible. Some of the vmstats are interesting 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 Swap Ins 58 42 Swap Outs 0 0 Allocation stalls 16 0 Direct pages scanned 1374 0 Kswapd pages scanned 42454910 41782544 Kswapd pages reclaimed 41571035 41781833 Direct pages reclaimed 1167 0 Kswapd efficiency 97% 99% Kswapd velocity 14774.479 14223.796 Direct efficiency 84% 100% Direct velocity 0.478 0.000 Percentage direct scans 0% 0% Page writes by reclaim 696918 0 Page writes file 696918 0 Page writes anon 0 0 Page reclaim immediate 2940 137 Sector Reads 81644424 81699544 Sector Writes 99193620 98862160 Page rescued immediate 0 0 Slabs scanned 1279838 22640 kswapd and direct reclaim activity are similar but the node LRU series did not attempt to trigger any page writes from reclaim context. This series is not without its hazards. There are at least three areas that I'm concerned with even though I could not reproduce any problems in that area. 1. Reclaim/compaction is going to be affected because the amount of reclaim is no longer targetted at a specific zone. Compaction works on a per-zone basis so there is no guarantee that reclaiming a few THP's worth page pages will have a positive impact on compaction success rates. 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers are called is now different. This may or may not be a problem but if it is, it'll be because shrinkers are not called enough and some balancing is required. 3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are distributed between zones and the fair zone allocation policy used to do something very similar for anon. The distribution is now different but not necessarily in any way that matters but it's still worth bearing in mind. Documentation/cgroup-v1/memcg_test.txt | 4 +- Documentation/cgroup-v1/memory.txt | 4 +- arch/s390/appldata/appldata_mem.c | 2 +- arch/tile/mm/pgtable.c | 18 +- drivers/base/node.c | 73 +-- drivers/staging/android/lowmemorykiller.c | 12 +- fs/fs-writeback.c | 4 +- fs/fuse/file.c | 8 +- fs/nfs/internal.h | 2 +- fs/nfs/write.c | 2 +- fs/proc/meminfo.c | 14 +- include/linux/backing-dev.h | 2 +- include/linux/memcontrol.h | 32 +- include/linux/mm.h | 5 + include/linux/mm_inline.h | 21 +- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 158 +++--- include/linux/swap.h | 23 +- include/linux/topology.h | 2 +- include/linux/vm_event_item.h | 14 +- include/linux/vmstat.h | 111 +++- include/linux/writeback.h | 2 +- include/trace/events/vmscan.h | 63 ++- include/trace/events/writeback.h | 10 +- kernel/power/snapshot.c | 10 +- kernel/sysctl.c | 4 +- mm/backing-dev.c | 15 +- mm/compaction.c | 28 +- mm/filemap.c | 14 +- mm/huge_memory.c | 33 +- mm/internal.h | 11 +- mm/memcontrol.c | 246 ++++---- mm/memory-failure.c | 4 +- mm/memory_hotplug.c | 7 +- mm/mempolicy.c | 2 +- mm/migrate.c | 35 +- mm/mlock.c | 12 +- mm/page-writeback.c | 124 ++-- mm/page_alloc.c | 268 ++++----- mm/page_idle.c | 4 +- mm/rmap.c | 14 +- mm/shmem.c | 12 +- mm/swap.c | 66 +-- mm/swap_state.c | 4 +- mm/util.c | 4 +- mm/vmscan.c | 901 +++++++++++++++--------------- mm/vmstat.c | 376 ++++++++++--- mm/workingset.c | 54 +- 48 files changed, 1573 insertions(+), 1263 deletions(-) -- 2.6.4 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes 2016-06-21 14:15 [PATCH 00/27] Move LRU page reclaim from zones to nodes v7 Mel Gorman @ 2016-06-21 14:15 ` Mel Gorman 0 siblings, 0 replies; 9+ messages in thread From: Mel Gorman @ 2016-06-21 14:15 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started thinking of reclaim in terms of nodes but kswapd is still zone-centric. This patch gets rid of many of the node-based versus zone-based decisions. o A node is considered balanced when any eligible lower zone is balanced. This eliminates one class of age-inversion problem because we avoid reclaiming a newer page just because it's in the wrong zone o pgdat_balanced disappears because we now only care about one zone being balanced. o Some anomalies related to writeback and congestion tracking being based on zones disappear. o kswapd no longer has to take care to reclaim zones in the reverse order that the page allocator uses. o Most importantly of all, reclaim from node 0 with multiple zones will have similar aging and reclaiming characteristics as every other node. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- mm/vmscan.c | 292 +++++++++++++++++++++--------------------------------------- 1 file changed, 101 insertions(+), 191 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b5b355db97cb..5873f5003078 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2972,7 +2972,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, } #endif -static void age_active_anon(struct zone *zone, struct scan_control *sc) +static void age_active_anon(struct pglist_data *pgdat, + struct zone *zone, struct scan_control *sc) { struct mem_cgroup *memcg; @@ -2991,85 +2992,15 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc) } while (memcg); } -static bool zone_balanced(struct zone *zone, int order, bool highorder, +static bool zone_balanced(struct zone *zone, int order, unsigned long balance_gap, int classzone_idx) { unsigned long mark = high_wmark_pages(zone) + balance_gap; - /* - * When checking from pgdat_balanced(), kswapd should stop and sleep - * when it reaches the high order-0 watermark and let kcompactd take - * over. Other callers such as wakeup_kswapd() want to determine the - * true high-order watermark. - */ - if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) { - mark += (1UL << order); - order = 0; - } - return zone_watermark_ok_safe(zone, order, mark, classzone_idx); } /* - * pgdat_balanced() is used when checking if a node is balanced. - * - * For order-0, all zones must be balanced! - * - * For high-order allocations only zones that meet watermarks and are in a - * zone allowed by the callers classzone_idx are added to balanced_pages. The - * total of balanced pages must be at least 25% of the zones allowed by - * classzone_idx for the node to be considered balanced. Forcing all zones to - * be balanced for high orders can cause excessive reclaim when there are - * imbalanced zones. - * The choice of 25% is due to - * o a 16M DMA zone that is balanced will not balance a zone on any - * reasonable sized machine - * o On all other machines, the top zone must be at least a reasonable - * percentage of the middle zones. For example, on 32-bit x86, highmem - * would need to be at least 256M for it to be balance a whole node. - * Similarly, on x86-64 the Normal zone would need to be at least 1G - * to balance a node on its own. These seemed like reasonable ratios. - */ -static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) -{ - unsigned long managed_pages = 0; - unsigned long balanced_pages = 0; - int i; - - /* Check the watermark levels */ - for (i = 0; i <= classzone_idx; i++) { - struct zone *zone = pgdat->node_zones + i; - - if (!populated_zone(zone)) - continue; - - managed_pages += zone->managed_pages; - - /* - * A special case here: - * - * balance_pgdat() skips over all_unreclaimable after - * DEF_PRIORITY. Effectively, it considers them balanced so - * they must be considered balanced here as well! - */ - if (!pgdat_reclaimable(zone->zone_pgdat)) { - balanced_pages += zone->managed_pages; - continue; - } - - if (zone_balanced(zone, order, false, 0, i)) - balanced_pages += zone->managed_pages; - else if (!order) - return false; - } - - if (order) - return balanced_pages >= (managed_pages >> 2); - else - return true; -} - -/* * Prepare kswapd for sleeping. This verifies that there are no processes * waiting in throttle_direct_reclaim() and that watermarks have been met. * @@ -3078,6 +3009,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, int classzone_idx) { + int i; + /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ if (remaining) return false; @@ -3098,101 +3031,90 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, if (waitqueue_active(&pgdat->pfmemalloc_wait)) wake_up_all(&pgdat->pfmemalloc_wait); - return pgdat_balanced(pgdat, order, classzone_idx); + for (i = 0; i <= classzone_idx; i++) { + struct zone *zone = pgdat->node_zones + i; + + if (!populated_zone(zone)) + continue; + + if (zone_balanced(zone, order, 0, classzone_idx)) + return true; + } + + return false; } /* - * kswapd shrinks the zone by the number of pages required to reach - * the high watermark. + * kswapd shrinks a node of pages that are at or below the highest usable + * zone that is currently unbalanced. * * Returns true if kswapd scanned at least the requested number of pages to * reclaim or if the lack of progress was due to pages under writeback. * This is used to determine if the scanning priority needs to be raised. */ -static bool kswapd_shrink_zone(struct zone *zone, +static bool kswapd_shrink_node(pg_data_t *pgdat, int classzone_idx, struct scan_control *sc) { - unsigned long balance_gap; - bool lowmem_pressure; - struct pglist_data *pgdat = zone->zone_pgdat; + struct zone *zone; + int z; - /* Reclaim above the high watermark. */ - sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); + /* Reclaim a number of pages proportional to the number of zones */ + sc->nr_to_reclaim = 0; + for (z = 0; z <= classzone_idx; z++) { + zone = pgdat->node_zones + z; + if (!populated_zone(zone)) + continue; - /* - * We put equal pressure on every zone, unless one zone has way too - * many pages free already. The "too many pages" is defined as the - * high wmark plus a "gap" where the gap is either the low - * watermark or 1% of the zone, whichever is smaller. - */ - balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP( - zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO)); + sc->nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX); + } /* - * If there is no low memory pressure or the zone is balanced then no - * reclaim is necessary + * Historically care was taken to put equal pressure on all zones but + * now pressure is applied based on node LRU order. */ - lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone)); - if (!lowmem_pressure && zone_balanced(zone, sc->order, false, - balance_gap, classzone_idx)) - return true; - - shrink_node(zone->zone_pgdat, sc, classzone_idx); - - /* TODO: ANOMALY */ - clear_bit(PGDAT_WRITEBACK, &pgdat->flags); + shrink_node(pgdat, sc, classzone_idx); /* - * If a zone reaches its high watermark, consider it to be no longer - * congested. It's possible there are dirty pages backed by congested - * BDIs but as pressure is relieved, speculatively avoid congestion - * waits. + * Fragmentation may mean that the system cannot be rebalanced for + * high-order allocations. If twice the allocation size has been + * reclaimed then recheck watermarks only at order-0 to prevent + * excessive reclaim. Assume that a process requested a high-order + * can direct reclaim/compact. */ - if (pgdat_reclaimable(zone->zone_pgdat) && - zone_balanced(zone, sc->order, false, 0, classzone_idx)) { - clear_bit(PGDAT_CONGESTED, &pgdat->flags); - clear_bit(PGDAT_DIRTY, &pgdat->flags); - } + if (sc->order && sc->nr_reclaimed >= 2UL << sc->order) + sc->order = 0; return sc->nr_scanned >= sc->nr_to_reclaim; } /* - * For kswapd, balance_pgdat() will work across all this node's zones until - * they are all at high_wmark_pages(zone). - * - * Returns the highest zone idx kswapd was reclaiming at + * For kswapd, balance_pgdat() will reclaim pages across a node from zones + * that are eligible for use by the caller until at least one zone is + * balanced. * - * There is special handling here for zones which are full of pinned pages. - * This can happen if the pages are all mlocked, or if they are all used by - * device drivers (say, ZONE_DMA). Or if they are all in use by hugetlb. - * What we do is to detect the case where all pages in the zone have been - * scanned twice and there has been zero successful reclaim. Mark the zone as - * dead and from now on, only perform a short scan. Basically we're polling - * the zone for when the problem goes away. + * Returns the order kswapd finished reclaiming at. * * kswapd scans the zones in the highmem->normal->dma direction. It skips * zones which have free_pages > high_wmark_pages(zone), but once a zone is - * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the - * lower zones regardless of the number of free pages in the lower zones. This - * interoperates with the page allocator fallback scheme to ensure that aging - * of pages is balanced across the zones. + * found to have free_pages <= high_wmark_pages(zone), any page is that zone + * or lower is eligible for reclaim until at least one usable zone is + * balanced. */ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) { int i; - int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; + struct zone *zone; struct scan_control sc = { .gfp_mask = GFP_KERNEL, - .reclaim_idx = MAX_NR_ZONES - 1, .order = order, .priority = DEF_PRIORITY, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = 1, + .reclaim_idx = classzone_idx, }; count_vm_event(PAGEOUTRUN); @@ -3203,21 +3125,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) /* Scan from the highest requested zone to dma */ for (i = classzone_idx; i >= 0; i--) { - struct zone *zone = pgdat->node_zones + i; - + zone = pgdat->node_zones + i; if (!populated_zone(zone)) continue; - if (sc.priority != DEF_PRIORITY && - !pgdat_reclaimable(zone->zone_pgdat)) - continue; - - /* - * Do some background aging of the anon list, to give - * pages a chance to be referenced before reclaiming. - */ - age_active_anon(zone, &sc); - /* * If the number of buffer_heads in the machine * exceeds the maximum allowed level and this node @@ -3225,19 +3136,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) * it to relieve lowmem pressure. */ if (buffer_heads_over_limit && is_highmem_idx(i)) { - end_zone = i; + classzone_idx = i; break; } - if (!zone_balanced(zone, order, false, 0, 0)) { - end_zone = i; + if (!zone_balanced(zone, order, 0, 0)) { + classzone_idx = i; break; } else { /* - * If balanced, clear the dirty and congested - * flags - * - * TODO: ANOMALY + * If any eligible zone is balanced then the + * node is not considered congested or dirty. */ clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags); clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags); @@ -3248,51 +3157,34 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) goto out; /* + * Do some background aging of the anon list, to give + * pages a chance to be referenced before reclaiming. All + * pages are rotated regardless of classzone as this is + * about consistent aging. + */ + age_active_anon(pgdat, &pgdat->node_zones[MAX_NR_ZONES - 1], &sc); + + /* * If we're getting trouble reclaiming, start doing writepage * even in laptop mode. */ - if (sc.priority < DEF_PRIORITY - 2) + if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat)) sc.may_writepage = 1; + /* Call soft limit reclaim before calling shrink_node. */ + sc.nr_scanned = 0; + nr_soft_scanned = 0; + nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, sc.order, + sc.gfp_mask, &nr_soft_scanned); + sc.nr_reclaimed += nr_soft_reclaimed; + /* - * Continue scanning in the highmem->dma direction stopping at - * the last zone which needs scanning. This may reclaim lowmem - * pages that are not necessary for zone balancing but it - * preserves LRU ordering. It is assumed that the bulk of - * allocation requests can use arbitrary zones with the - * possible exception of big highmem:lowmem configurations. + * There should be no need to raise the scanning priority if + * enough pages are already being scanned that that high + * watermark would be met at 100% efficiency. */ - for (i = end_zone; i >= 0; i--) { - struct zone *zone = pgdat->node_zones + i; - - if (!populated_zone(zone)) - continue; - - if (sc.priority != DEF_PRIORITY && - !pgdat_reclaimable(zone->zone_pgdat)) - continue; - - sc.nr_scanned = 0; - sc.reclaim_idx = i; - - nr_soft_scanned = 0; - /* - * Call soft limit reclaim before calling shrink_zone. - */ - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - order, sc.gfp_mask, - &nr_soft_scanned); - sc.nr_reclaimed += nr_soft_reclaimed; - - /* - * There should be no need to raise the scanning - * priority if enough pages are already being scanned - * that that high watermark would be met at 100% - * efficiency. - */ - if (kswapd_shrink_zone(zone, end_zone, &sc)) - raise_priority = false; - } + if (kswapd_shrink_node(pgdat, classzone_idx, &sc)) + raise_priority = false; /* * If the low watermark is met there is no need for processes @@ -3308,20 +3200,37 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) break; /* + * Stop reclaiming if any eligible zone is balanced and clear + * node writeback or congested. + */ + for (i = 0; i <= classzone_idx; i++) { + zone = pgdat->node_zones + i; + if (!populated_zone(zone)) + continue; + + if (zone_balanced(zone, sc.order, 0, classzone_idx)) { + clear_bit(PGDAT_CONGESTED, &pgdat->flags); + clear_bit(PGDAT_DIRTY, &pgdat->flags); + goto out; + } + } + + /* * Raise priority if scanning rate is too low or there was no * progress in reclaiming pages */ if (raise_priority || !sc.nr_reclaimed) sc.priority--; - } while (sc.priority >= 1 && - !pgdat_balanced(pgdat, order, classzone_idx)); + } while (sc.priority >= 1); out: /* - * Return the highest zone idx we were reclaiming at so - * prepare_kswapd_sleep() makes the same decisions as here. + * Return the order kswapd stopped reclaiming at as + * prepare_kswapd_sleep() takes it into account. If another caller + * entered the allocator slow path while kswapd was awake, order will + * remain at the higher level. */ - return end_zone; + return sc.order; } static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, @@ -3478,8 +3387,9 @@ static int kswapd(void *p) */ if (!ret) { trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); - balanced_classzone_idx = balance_pgdat(pgdat, order, - classzone_idx); + + /* return value ignored until next patch */ + balance_pgdat(pgdat, order, classzone_idx); } } @@ -3509,7 +3419,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) } if (!waitqueue_active(&pgdat->kswapd_wait)) return; - if (zone_balanced(zone, order, true, 0, 0)) + if (zone_balanced(zone, order, 0, 0)) return; trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); -- 2.6.4 ^ permalink raw reply related [flat|nested] 9+ messages in thread
[parent not found: <02fe01d1c48b$c44e9e80$4cebdb80$@alibaba-inc.com>]
* Re: [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes [not found] <02fe01d1c48b$c44e9e80$4cebdb80$@alibaba-inc.com> @ 2016-06-12 9:33 ` Hillf Danton 2016-06-14 14:52 ` Mel Gorman 0 siblings, 1 reply; 9+ messages in thread From: Hillf Danton @ 2016-06-12 9:33 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-kernel, linux-mm > > /* > - * kswapd shrinks the zone by the number of pages required to reach > - * the high watermark. > + * kswapd shrinks a node of pages that are at or below the highest usable > + * zone that is currently unbalanced. > * > * Returns true if kswapd scanned at least the requested number of pages to > * reclaim or if the lack of progress was due to pages under writeback. > * This is used to determine if the scanning priority needs to be raised. > */ > -static bool kswapd_shrink_zone(struct zone *zone, > +static bool kswapd_shrink_node(pg_data_t *pgdat, > int classzone_idx, > struct scan_control *sc) > { > - unsigned long balance_gap; > - bool lowmem_pressure; > - struct pglist_data *pgdat = zone->zone_pgdat; > + struct zone *zone; > + unsigned long nr_to_reclaim = 0; > + int z; > > - /* Reclaim above the high watermark. */ > - sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); > + /* Reclaim a number of pages proportional to the number of zones */ > + for (z = 0; z <= classzone_idx; z++) { > + zone = pgdat->node_zones + z; > + if (!populated_zone(zone)) > + continue; > > - /* > - * We put equal pressure on every zone, unless one zone has way too > - * many pages free already. The "too many pages" is defined as the > - * high wmark plus a "gap" where the gap is either the low > - * watermark or 1% of the zone, whichever is smaller. > - */ > - balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP( > - zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO)); > + nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX); > + } Missing sc->nr_to_reclaim = nr_to_reclaim; ? > > /* > - * If there is no low memory pressure or the zone is balanced then no > - * reclaim is necessary > + * Historically care was taken to put equal pressure on all zones but > + * now pressure is applied based on node LRU order. > */ > - lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone)); > - if (!lowmem_pressure && zone_balanced(zone, sc->order, false, > - balance_gap, classzone_idx)) > - return true; > - > - shrink_node(zone->zone_pgdat, sc, classzone_idx); > - > - /* TODO: ANOMALY */ > - clear_bit(PGDAT_WRITEBACK, &pgdat->flags); > + shrink_node(pgdat, sc, classzone_idx); > > /* > - * If a zone reaches its high watermark, consider it to be no longer > - * congested. It's possible there are dirty pages backed by congested > - * BDIs but as pressure is relieved, speculatively avoid congestion > - * waits. > + * Fragmentation may mean that the system cannot be rebalanced for > + * high-order allocations. If twice the allocation size has been > + * reclaimed then recheck watermarks only at order-0 to prevent > + * excessive reclaim. Assume that a process requested a high-order > + * can direct reclaim/compact. > */ > - if (pgdat_reclaimable(zone->zone_pgdat) && > - zone_balanced(zone, sc->order, false, 0, classzone_idx)) { > - clear_bit(PGDAT_CONGESTED, &pgdat->flags); > - clear_bit(PGDAT_DIRTY, &pgdat->flags); > - } > + if (sc->order && sc->nr_reclaimed >= 2UL << sc->order) > + sc->order = 0; > > return sc->nr_scanned >= sc->nr_to_reclaim; > } ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes 2016-06-12 9:33 ` Hillf Danton @ 2016-06-14 14:52 ` Mel Gorman 0 siblings, 0 replies; 9+ messages in thread From: Mel Gorman @ 2016-06-14 14:52 UTC (permalink / raw) To: Hillf Danton; +Cc: linux-kernel, linux-mm On Sun, Jun 12, 2016 at 05:33:24PM +0800, Hillf Danton wrote: > > - /* > > - * We put equal pressure on every zone, unless one zone has way too > > - * many pages free already. The "too many pages" is defined as the > > - * high wmark plus a "gap" where the gap is either the low > > - * watermark or 1% of the zone, whichever is smaller. > > - */ > > - balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP( > > - zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO)); > > + nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX); > > + } > > Missing sc->nr_to_reclaim = nr_to_reclaim; ? > Yes. It may explain why I saw lower than expected kswapd in more detailed tests recently. Thanks. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 00/27] Move LRU page reclaim from zones to nodes v6 @ 2016-06-09 18:04 Mel Gorman 2016-06-09 18:04 ` [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman 0 siblings, 1 reply; 9+ messages in thread From: Mel Gorman @ 2016-06-09 18:04 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman This is only lightly tested as I've had stability problems during boot that have nothing to do with the series. It's based on mmots as of June 6th. Very little has changed with the big exception of "mm, vmscan: Move LRU lists to node" because it had to adapt to per-zone changes in should_reclaim_retry and compaction_zonelist_suitable. Changelog since v5 o Rebase and adjust to changes Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.6-rc3 plus the page allocator optimisation series. Conceptually, this is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series got basic testing this time on a UMA machine. The page allocator microbenchmark highlights the gain from removing the fair zone allocation policy 4.7.0-rc2 4.7.0-rc2 mmotm-20160606 nodelru-v6r2 Min total-odr0-1 500.00 ( 0.00%) 475.00 ( 5.00%) Min total-odr0-2 358.00 ( 0.00%) 343.00 ( 4.19%) Min total-odr0-4 292.00 ( 0.00%) 279.00 ( 4.45%) Min total-odr0-8 253.00 ( 0.00%) 242.00 ( 4.35%) Min total-odr0-16 275.00 ( 0.00%) 226.00 ( 17.82%) Min total-odr0-32 225.00 ( 0.00%) 215.00 ( 4.44%) Min total-odr0-64 219.00 ( 0.00%) 210.00 ( 4.11%) Min total-odr0-128 216.00 ( 0.00%) 207.00 ( 4.17%) Min total-odr0-256 243.00 ( 0.00%) 246.00 ( -1.23%) Min total-odr0-512 276.00 ( 0.00%) 265.00 ( 3.99%) Min total-odr0-1024 290.00 ( 0.00%) 287.00 ( 1.03%) Min total-odr0-2048 303.00 ( 0.00%) 296.00 ( 2.31%) Min total-odr0-4096 312.00 ( 0.00%) 310.00 ( 0.64%) Min total-odr0-8192 320.00 ( 0.00%) 308.00 ( 3.75%) Min total-odr0-16384 320.00 ( 0.00%) 308.00 ( 3.75%) Min total-odr1-1 737.00 ( 0.00%) 707.00 ( 4.07%) Min total-odr1-2 547.00 ( 0.00%) 521.00 ( 4.75%) Min total-odr1-4 620.00 ( 0.00%) 418.00 ( 32.58%) Min total-odr1-8 386.00 ( 0.00%) 367.00 ( 4.92%) Min total-odr1-16 361.00 ( 0.00%) 340.00 ( 5.82%) Min total-odr1-32 352.00 ( 0.00%) 328.00 ( 6.82%) Min total-odr1-64 345.00 ( 0.00%) 324.00 ( 6.09%) Min total-odr1-128 347.00 ( 0.00%) 328.00 ( 5.48%) Min total-odr1-256 347.00 ( 0.00%) 329.00 ( 5.19%) Min total-odr1-512 354.00 ( 0.00%) 332.00 ( 6.21%) Min total-odr1-1024 355.00 ( 0.00%) 337.00 ( 5.07%) Min total-odr1-2048 358.00 ( 0.00%) 345.00 ( 3.63%) Min total-odr1-4096 360.00 ( 0.00%) 346.00 ( 3.89%) Min total-odr1-8192 360.00 ( 0.00%) 347.00 ( 3.61%) A basic IO benchmark based on varying numbers of dd running in parallel showed nothing interesting other than differences in what zones were scanned due to the fair zone allocation policy being removed. This series is not without its hazards. There are at least three areas that I'm concerned with even though I could not reproduce any problems in that area. 1. Reclaim/compaction is going to be affected because the amount of reclaim is no longer targetted at a specific zone. Compaction works on a per-zone basis so there is no guarantee that reclaiming a few THP's worth page pages will have a positive impact on compaction success rates. 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers are called is now different. This may or may not be a problem but if it is, it'll be because shrinkers are not called enough and some balancing is required. 3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are distributed between zones and the fair zone allocation policy used to do something very similar for anon. The distribution is now different but not necessarily in any way that matters but it's still worth bearing in mind. Documentation/cgroup-v1/memcg_test.txt | 4 +- Documentation/cgroup-v1/memory.txt | 4 +- arch/s390/appldata/appldata_mem.c | 2 +- arch/tile/mm/pgtable.c | 18 +- drivers/base/node.c | 73 +-- drivers/staging/android/lowmemorykiller.c | 12 +- fs/fs-writeback.c | 4 +- fs/fuse/file.c | 8 +- fs/nfs/internal.h | 2 +- fs/nfs/write.c | 2 +- fs/proc/meminfo.c | 14 +- include/linux/backing-dev.h | 2 +- include/linux/memcontrol.h | 30 +- include/linux/mm_inline.h | 2 +- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 157 +++--- include/linux/swap.h | 15 +- include/linux/topology.h | 2 +- include/linux/vm_event_item.h | 14 +- include/linux/vmstat.h | 111 +++- include/linux/writeback.h | 2 +- include/trace/events/vmscan.h | 40 +- include/trace/events/writeback.h | 10 +- kernel/power/snapshot.c | 10 +- kernel/sysctl.c | 4 +- mm/backing-dev.c | 15 +- mm/compaction.c | 39 +- mm/filemap.c | 14 +- mm/huge_memory.c | 33 +- mm/internal.h | 11 +- mm/memcontrol.c | 235 ++++----- mm/memory-failure.c | 4 +- mm/memory_hotplug.c | 7 +- mm/mempolicy.c | 2 +- mm/migrate.c | 35 +- mm/mlock.c | 12 +- mm/page-writeback.c | 124 +++-- mm/page_alloc.c | 271 +++++----- mm/page_idle.c | 4 +- mm/rmap.c | 15 +- mm/shmem.c | 12 +- mm/swap.c | 66 +-- mm/swap_state.c | 4 +- mm/util.c | 4 +- mm/vmscan.c | 829 +++++++++++++++--------------- mm/vmstat.c | 374 +++++++++++--- mm/workingset.c | 52 +- 47 files changed, 1489 insertions(+), 1217 deletions(-) -- 2.6.4 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes 2016-06-09 18:04 [PATCH 00/27] Move LRU page reclaim from zones to nodes v6 Mel Gorman @ 2016-06-09 18:04 ` Mel Gorman 2016-06-15 14:23 ` Vlastimil Babka 0 siblings, 1 reply; 9+ messages in thread From: Mel Gorman @ 2016-06-09 18:04 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started thinking of reclaim in terms of nodes but kswapd is still zone-centric. This patch gets rid of many of the node-based versus zone-based decisions. o A node is considered balanced when any eligible lower zone is balanced. This eliminates one class of age-inversion problem because we avoid reclaiming a newer page just because it's in the wrong zone o pgdat_balanced disappears because we now only care about one zone being balanced. o Some anomalies related to writeback and congestion tracking being based on zones disappear. o kswapd no longer has to take care to reclaim zones in the reverse order that the page allocator uses. o Most importantly of all, reclaim from node 0 with multiple zones will have similar aging and reclaiming characteristics as every other node. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/vmscan.c | 292 +++++++++++++++++++++--------------------------------------- 1 file changed, 101 insertions(+), 191 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 0a619241c576..9368af4cfb06 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2942,7 +2942,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, } #endif -static void age_active_anon(struct zone *zone, struct scan_control *sc) +static void age_active_anon(struct pglist_data *pgdat, + struct zone *zone, struct scan_control *sc) { struct mem_cgroup *memcg; @@ -2961,85 +2962,15 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc) } while (memcg); } -static bool zone_balanced(struct zone *zone, int order, bool highorder, +static bool zone_balanced(struct zone *zone, int order, unsigned long balance_gap, int classzone_idx) { unsigned long mark = high_wmark_pages(zone) + balance_gap; - /* - * When checking from pgdat_balanced(), kswapd should stop and sleep - * when it reaches the high order-0 watermark and let kcompactd take - * over. Other callers such as wakeup_kswapd() want to determine the - * true high-order watermark. - */ - if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) { - mark += (1UL << order); - order = 0; - } - return zone_watermark_ok_safe(zone, order, mark, classzone_idx); } /* - * pgdat_balanced() is used when checking if a node is balanced. - * - * For order-0, all zones must be balanced! - * - * For high-order allocations only zones that meet watermarks and are in a - * zone allowed by the callers classzone_idx are added to balanced_pages. The - * total of balanced pages must be at least 25% of the zones allowed by - * classzone_idx for the node to be considered balanced. Forcing all zones to - * be balanced for high orders can cause excessive reclaim when there are - * imbalanced zones. - * The choice of 25% is due to - * o a 16M DMA zone that is balanced will not balance a zone on any - * reasonable sized machine - * o On all other machines, the top zone must be at least a reasonable - * percentage of the middle zones. For example, on 32-bit x86, highmem - * would need to be at least 256M for it to be balance a whole node. - * Similarly, on x86-64 the Normal zone would need to be at least 1G - * to balance a node on its own. These seemed like reasonable ratios. - */ -static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) -{ - unsigned long managed_pages = 0; - unsigned long balanced_pages = 0; - int i; - - /* Check the watermark levels */ - for (i = 0; i <= classzone_idx; i++) { - struct zone *zone = pgdat->node_zones + i; - - if (!populated_zone(zone)) - continue; - - managed_pages += zone->managed_pages; - - /* - * A special case here: - * - * balance_pgdat() skips over all_unreclaimable after - * DEF_PRIORITY. Effectively, it considers them balanced so - * they must be considered balanced here as well! - */ - if (!pgdat_reclaimable(zone->zone_pgdat)) { - balanced_pages += zone->managed_pages; - continue; - } - - if (zone_balanced(zone, order, false, 0, i)) - balanced_pages += zone->managed_pages; - else if (!order) - return false; - } - - if (order) - return balanced_pages >= (managed_pages >> 2); - else - return true; -} - -/* * Prepare kswapd for sleeping. This verifies that there are no processes * waiting in throttle_direct_reclaim() and that watermarks have been met. * @@ -3048,6 +2979,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, int classzone_idx) { + int i; + /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ if (remaining) return false; @@ -3068,101 +3001,90 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, if (waitqueue_active(&pgdat->pfmemalloc_wait)) wake_up_all(&pgdat->pfmemalloc_wait); - return pgdat_balanced(pgdat, order, classzone_idx); + for (i = 0; i <= classzone_idx; i++) { + struct zone *zone = pgdat->node_zones + i; + + if (!populated_zone(zone)) + continue; + + if (zone_balanced(zone, order, 0, classzone_idx)) + return true; + } + + return false; } /* - * kswapd shrinks the zone by the number of pages required to reach - * the high watermark. + * kswapd shrinks a node of pages that are at or below the highest usable + * zone that is currently unbalanced. * * Returns true if kswapd scanned at least the requested number of pages to * reclaim or if the lack of progress was due to pages under writeback. * This is used to determine if the scanning priority needs to be raised. */ -static bool kswapd_shrink_zone(struct zone *zone, +static bool kswapd_shrink_node(pg_data_t *pgdat, int classzone_idx, struct scan_control *sc) { - unsigned long balance_gap; - bool lowmem_pressure; - struct pglist_data *pgdat = zone->zone_pgdat; + struct zone *zone; + unsigned long nr_to_reclaim = 0; + int z; - /* Reclaim above the high watermark. */ - sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); + /* Reclaim a number of pages proportional to the number of zones */ + for (z = 0; z <= classzone_idx; z++) { + zone = pgdat->node_zones + z; + if (!populated_zone(zone)) + continue; - /* - * We put equal pressure on every zone, unless one zone has way too - * many pages free already. The "too many pages" is defined as the - * high wmark plus a "gap" where the gap is either the low - * watermark or 1% of the zone, whichever is smaller. - */ - balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP( - zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO)); + nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX); + } /* - * If there is no low memory pressure or the zone is balanced then no - * reclaim is necessary + * Historically care was taken to put equal pressure on all zones but + * now pressure is applied based on node LRU order. */ - lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone)); - if (!lowmem_pressure && zone_balanced(zone, sc->order, false, - balance_gap, classzone_idx)) - return true; - - shrink_node(zone->zone_pgdat, sc, classzone_idx); - - /* TODO: ANOMALY */ - clear_bit(PGDAT_WRITEBACK, &pgdat->flags); + shrink_node(pgdat, sc, classzone_idx); /* - * If a zone reaches its high watermark, consider it to be no longer - * congested. It's possible there are dirty pages backed by congested - * BDIs but as pressure is relieved, speculatively avoid congestion - * waits. + * Fragmentation may mean that the system cannot be rebalanced for + * high-order allocations. If twice the allocation size has been + * reclaimed then recheck watermarks only at order-0 to prevent + * excessive reclaim. Assume that a process requested a high-order + * can direct reclaim/compact. */ - if (pgdat_reclaimable(zone->zone_pgdat) && - zone_balanced(zone, sc->order, false, 0, classzone_idx)) { - clear_bit(PGDAT_CONGESTED, &pgdat->flags); - clear_bit(PGDAT_DIRTY, &pgdat->flags); - } + if (sc->order && sc->nr_reclaimed >= 2UL << sc->order) + sc->order = 0; return sc->nr_scanned >= sc->nr_to_reclaim; } /* - * For kswapd, balance_pgdat() will work across all this node's zones until - * they are all at high_wmark_pages(zone). - * - * Returns the highest zone idx kswapd was reclaiming at + * For kswapd, balance_pgdat() will reclaim pages across a node from zones + * that are eligible for use by the caller until at least one zone is + * balanced. * - * There is special handling here for zones which are full of pinned pages. - * This can happen if the pages are all mlocked, or if they are all used by - * device drivers (say, ZONE_DMA). Or if they are all in use by hugetlb. - * What we do is to detect the case where all pages in the zone have been - * scanned twice and there has been zero successful reclaim. Mark the zone as - * dead and from now on, only perform a short scan. Basically we're polling - * the zone for when the problem goes away. + * Returns the order kswapd finished reclaiming at. * * kswapd scans the zones in the highmem->normal->dma direction. It skips * zones which have free_pages > high_wmark_pages(zone), but once a zone is - * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the - * lower zones regardless of the number of free pages in the lower zones. This - * interoperates with the page allocator fallback scheme to ensure that aging - * of pages is balanced across the zones. + * found to have free_pages <= high_wmark_pages(zone), any page is that zone + * or lower is eligible for reclaim until at least one usable zone is + * balanced. */ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) { int i; - int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; + struct zone *zone; struct scan_control sc = { .gfp_mask = GFP_KERNEL, - .reclaim_idx = MAX_NR_ZONES - 1, .order = order, .priority = DEF_PRIORITY, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = 1, + .reclaim_idx = classzone_idx, }; count_vm_event(PAGEOUTRUN); @@ -3173,21 +3095,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) /* Scan from the highest requested zone to dma */ for (i = classzone_idx; i >= 0; i--) { - struct zone *zone = pgdat->node_zones + i; - + zone = pgdat->node_zones + i; if (!populated_zone(zone)) continue; - if (sc.priority != DEF_PRIORITY && - !pgdat_reclaimable(zone->zone_pgdat)) - continue; - - /* - * Do some background aging of the anon list, to give - * pages a chance to be referenced before reclaiming. - */ - age_active_anon(zone, &sc); - /* * If the number of buffer_heads in the machine * exceeds the maximum allowed level and this node @@ -3195,19 +3106,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) * it to relieve lowmem pressure. */ if (buffer_heads_over_limit && is_highmem_idx(i)) { - end_zone = i; + classzone_idx = i; break; } - if (!zone_balanced(zone, order, false, 0, 0)) { - end_zone = i; + if (!zone_balanced(zone, order, 0, 0)) { + classzone_idx = i; break; } else { /* - * If balanced, clear the dirty and congested - * flags - * - * TODO: ANOMALY + * If any eligible zone is balanced then the + * node is not considered congested or dirty. */ clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags); clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags); @@ -3218,51 +3127,34 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) goto out; /* + * Do some background aging of the anon list, to give + * pages a chance to be referenced before reclaiming. All + * pages are rotated regardless of classzone as this is + * about consistent aging. + */ + age_active_anon(pgdat, &pgdat->node_zones[MAX_NR_ZONES - 1], &sc); + + /* * If we're getting trouble reclaiming, start doing writepage * even in laptop mode. */ - if (sc.priority < DEF_PRIORITY - 2) + if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat)) sc.may_writepage = 1; + /* Call soft limit reclaim before calling shrink_node. */ + sc.nr_scanned = 0; + nr_soft_scanned = 0; + nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, sc.order, + sc.gfp_mask, &nr_soft_scanned); + sc.nr_reclaimed += nr_soft_reclaimed; + /* - * Continue scanning in the highmem->dma direction stopping at - * the last zone which needs scanning. This may reclaim lowmem - * pages that are not necessary for zone balancing but it - * preserves LRU ordering. It is assumed that the bulk of - * allocation requests can use arbitrary zones with the - * possible exception of big highmem:lowmem configurations. + * There should be no need to raise the scanning priority if + * enough pages are already being scanned that that high + * watermark would be met at 100% efficiency. */ - for (i = end_zone; i >= end_zone; i--) { - struct zone *zone = pgdat->node_zones + i; - - if (!populated_zone(zone)) - continue; - - if (sc.priority != DEF_PRIORITY && - !pgdat_reclaimable(zone->zone_pgdat)) - continue; - - sc.nr_scanned = 0; - sc.reclaim_idx = i; - - nr_soft_scanned = 0; - /* - * Call soft limit reclaim before calling shrink_zone. - */ - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - order, sc.gfp_mask, - &nr_soft_scanned); - sc.nr_reclaimed += nr_soft_reclaimed; - - /* - * There should be no need to raise the scanning - * priority if enough pages are already being scanned - * that that high watermark would be met at 100% - * efficiency. - */ - if (kswapd_shrink_zone(zone, end_zone, &sc)) - raise_priority = false; - } + if (kswapd_shrink_node(pgdat, classzone_idx, &sc)) + raise_priority = false; /* * If the low watermark is met there is no need for processes @@ -3278,20 +3170,37 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) break; /* + * Stop reclaiming if any eligible zone is balanced and clear + * node writeback or congested. + */ + for (i = 0; i <= classzone_idx; i++) { + zone = pgdat->node_zones + i; + if (!populated_zone(zone)) + continue; + + if (zone_balanced(zone, sc.order, 0, classzone_idx)) { + clear_bit(PGDAT_CONGESTED, &pgdat->flags); + clear_bit(PGDAT_DIRTY, &pgdat->flags); + goto out; + } + } + + /* * Raise priority if scanning rate is too low or there was no * progress in reclaiming pages */ if (raise_priority || !sc.nr_reclaimed) sc.priority--; - } while (sc.priority >= 1 && - !pgdat_balanced(pgdat, order, classzone_idx)); + } while (sc.priority >= 1); out: /* - * Return the highest zone idx we were reclaiming at so - * prepare_kswapd_sleep() makes the same decisions as here. + * Return the order kswapd stopped reclaiming at as + * prepare_kswapd_sleep() takes it into account. If another caller + * entered the allocator slow path while kswapd was awake, order will + * remain at the higher level. */ - return end_zone; + return sc.order; } static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, @@ -3448,8 +3357,9 @@ static int kswapd(void *p) */ if (!ret) { trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); - balanced_classzone_idx = balance_pgdat(pgdat, order, - classzone_idx); + + /* return value ignored until next patch */ + balance_pgdat(pgdat, order, classzone_idx); } } @@ -3479,7 +3389,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) } if (!waitqueue_active(&pgdat->kswapd_wait)) return; - if (zone_balanced(zone, order, true, 0, 0)) + if (zone_balanced(zone, order, 0, 0)) return; trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); -- 2.6.4 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes 2016-06-09 18:04 ` [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman @ 2016-06-15 14:23 ` Vlastimil Babka 0 siblings, 0 replies; 9+ messages in thread From: Vlastimil Babka @ 2016-06-15 14:23 UTC (permalink / raw) To: Mel Gorman, Andrew Morton, Linux-MM; +Cc: Rik van Riel, Johannes Weiner, LKML On 06/09/2016 08:04 PM, Mel Gorman wrote: > Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started > thinking of reclaim in terms of nodes but kswapd is still zone-centric. This > patch gets rid of many of the node-based versus zone-based decisions. > > o A node is considered balanced when any eligible lower zone is balanced. > This eliminates one class of age-inversion problem because we avoid > reclaiming a newer page just because it's in the wrong zone > o pgdat_balanced disappears because we now only care about one zone being > balanced. > o Some anomalies related to writeback and congestion tracking being based on > zones disappear. > o kswapd no longer has to take care to reclaim zones in the reverse order > that the page allocator uses. > o Most importantly of all, reclaim from node 0 with multiple zones will > have similar aging and reclaiming characteristics as every > other node. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 00/27] Move LRU page reclaim from zones to nodes v5 @ 2016-04-15 9:13 Mel Gorman 2016-04-15 9:13 ` [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman 0 siblings, 1 reply; 9+ messages in thread From: Mel Gorman @ 2016-04-15 9:13 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Jesper Dangaard Brouer, LKML, Mel Gorman Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.6-rc3 plus the page allocator optimisation series. Conceptually, this is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. It was tested on a UMA (16 cores single socket) and a NUMA machine (48 cores, 2 sockets). In most cases, only the UMA results are presented as the NUMA machine takes an excessive amount of time to complete tests. There may be an obvious difference in the number of allocations from each zone as the fair zone allocation policy is removed towards the end of the series. In cases where the working set exceeds memory, the differences will be small but on small workloads it'll be very obvious. For example, these are the allocation stats on a workload that is doing small amounts of dd. 4.6.0-rc1 4.6.0-rc1 vanilla nodelru-v3 DMA allocs 0 0 DMA32 allocs 1961196 0 Normal allocs 3355799 5247180 Movable allocs 0 0 The key reason why this is not a problem is that kswapd will sleep if any applicable zone for a classzone is free. If it tried to balance all zones then there would be excessive reclaim. bonnie ------ This was configured to do an IO test with a working set 2*RAM using the ext4 filesystem. For both machines, there was no significant performance difference between them but this is the result for the UMA machine bonnie 4.6.0-rc1 4.6.0-rc1 vanilla nodelru-v3r10 Hmean SeqOut Char 53306.32 ( 0.00%) 79027.86 ( 48.25%) Hmean SeqOut Block 87796.15 ( 0.00%) 87881.69 ( 0.10%) Hmean SeqOut Rewrite 35996.31 ( 0.00%) 36355.59 ( 1.00%) Hmean SeqIn Char 38789.17 ( 0.00%) 76356.20 ( 96.85%) Hmean SeqIn Block 105315.39 ( 0.00%) 105514.07 ( 0.19%) Hmean Random seeks 329.80 ( 0.00%) 334.36 ( 1.38%) Hmean SeqCreate ops 4.62 ( 0.00%) 4.62 ( 0.00%) Hmean SeqCreate read 4.62 ( 0.00%) 4.62 ( 0.00%) Hmean SeqCreate del 599.29 ( 0.00%) 1580.23 (163.68%) Hmean RandCreate ops 5.00 ( 0.00%) 5.00 ( 0.00%) Hmean RandCreate read 5.00 ( 0.00%) 4.62 ( -7.69%) Hmean RandCreate del 629.51 ( 0.00%) 1634.55 (159.66%) 4.6.0-rc1 4.6.0-rc1 vanillanodelru-v3r10 User 2049.02 1078.82 System 294.25 181.00 Elapsed 6960.58 6021.58 Note that the massive gains shown here are possible an anomaly. It has been noted that in some cases, bonnie gets an artifical boost due to dumb reclaim luck. There is no guarantee this result would be reproducible on the same machine let alone any other machine. That said, the VM stats are interesting; However, the overall VM stats are interesting 4.5.0-rc3 4.5.0-rc3 mmotm-20160209 nodelru-v2 Swap Ins 14 0 Swap Outs 873 0 DMA allocs 0 0 DMA32 allocs 38259888 36320496 Normal allocs 64762073 66488556 Movable allocs 0 0 Allocation stalls 3584 0 Direct pages scanned 736769 0 Kswapd pages scanned 77818637 78836064 Kswapd pages reclaimed 77782378 78812260 Direct pages reclaimed 736548 0 Kswapd efficiency 99% 99% Kswapd velocity 11179.907 13092.256 Direct efficiency 99% 100% Direct velocity 105.849 0.000 The series does not swap the workload and it never stalls on direct reclaim. There is a slight increase in kswapd scans but it's offset by the elimination of direct scans and the overall scanning velocity is not noticably higher. While it's not reported here, the overall IO stats and CPU usage over time are very similar. kswapd CPU usage is slightly elevated but (0.5% usage to roughly 1.2% usage over time) but that is acceptable given the lack of direct reclaim. tiobench -------- tiobench is a flawed benchmark but it's very important in this case. tiobench benefited from a bug prior to the fair zone allocation policy that allowed old pages to be artificially preserved. The visible impact was that performance exceeded the physical capabilities of the disk. With this patch applied the results are tiobench Throughput tiobench Throughput 4.6.0-rc1 4.6.0-rc1 vanilla nodelru-v3 Hmean PotentialReadSpeed 85.84 ( 0.00%) 86.20 ( 0.42%) Hmean SeqRead-MB/sec-1 84.48 ( 0.00%) 84.60 ( 0.14%) Hmean SeqRead-MB/sec-2 75.69 ( 0.00%) 75.44 ( -0.34%) Hmean SeqRead-MB/sec-4 77.35 ( 0.00%) 77.62 ( 0.35%) Hmean SeqRead-MB/sec-8 68.29 ( 0.00%) 68.58 ( 0.43%) Hmean SeqRead-MB/sec-16 62.82 ( 0.00%) 62.72 ( -0.15%) Hmean RandRead-MB/sec-1 0.93 ( 0.00%) 0.88 ( -4.69%) Hmean RandRead-MB/sec-2 1.11 ( 0.00%) 1.08 ( -3.20%) Hmean RandRead-MB/sec-4 1.52 ( 0.00%) 1.48 ( -2.86%) Hmean RandRead-MB/sec-8 1.70 ( 0.00%) 1.70 ( -0.26%) Hmean RandRead-MB/sec-16 1.96 ( 0.00%) 1.91 ( -2.49%) Hmean SeqWrite-MB/sec-1 83.01 ( 0.00%) 83.07 ( 0.07%) Hmean SeqWrite-MB/sec-2 77.80 ( 0.00%) 78.20 ( 0.52%) Hmean SeqWrite-MB/sec-4 81.68 ( 0.00%) 81.72 ( 0.05%) Hmean SeqWrite-MB/sec-8 78.17 ( 0.00%) 78.41 ( 0.31%) Hmean SeqWrite-MB/sec-16 80.08 ( 0.00%) 80.08 ( 0.01%) Hmean RandWrite-MB/sec-1 1.17 ( 0.00%) 1.17 ( -0.03%) Hmean RandWrite-MB/sec-2 1.02 ( 0.00%) 1.06 ( 4.21%) Hmean RandWrite-MB/sec-4 1.02 ( 0.00%) 1.04 ( 2.32%) Hmean RandWrite-MB/sec-8 0.95 ( 0.00%) 0.97 ( 1.75%) Hmean RandWrite-MB/sec-16 0.95 ( 0.00%) 0.96 ( 0.97%) Note that the performance is almost identical allowing us to conclude that the correct reclaim behaviour granted by the fair zone allocation policy is preserved. stutter ------- stutter simulates a simple workload. One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. The primary metric is checking for mmap latency. stutter 4.6.0-rc1 4.6.0-rc1 vanilla nodelru-v3 Min mmap 13.4442 ( 0.00%) 13.6705 ( -1.68%) 1st-qrtle mmap 38.0442 ( 0.00%) 37.7842 ( 0.68%) 2nd-qrtle mmap 78.5109 ( 0.00%) 40.3648 ( 48.59%) 3rd-qrtle mmap 86.7806 ( 0.00%) 46.2499 ( 46.70%) Max-90% mmap 89.7028 ( 0.00%) 86.5790 ( 3.48%) Max-93% mmap 90.6776 ( 0.00%) 89.5367 ( 1.26%) Max-95% mmap 91.1678 ( 0.00%) 90.3138 ( 0.94%) Max-99% mmap 92.0036 ( 0.00%) 93.2003 ( -1.30%) Max mmap 167.0073 ( 0.00%) 94.5935 ( 43.36%) Mean mmap 68.7672 ( 0.00%) 48.9853 ( 28.77%) Best99%Mean mmap 68.5246 ( 0.00%) 48.5354 ( 29.17%) Best95%Mean mmap 67.5540 ( 0.00%) 46.7102 ( 30.86%) Best90%Mean mmap 66.2798 ( 0.00%) 44.3547 ( 33.08%) Best50%Mean mmap 50.7730 ( 0.00%) 37.1298 ( 26.87%) Best10%Mean mmap 35.8311 ( 0.00%) 33.6910 ( 5.97%) Best5%Mean mmap 34.0159 ( 0.00%) 31.4259 ( 7.61%) Best1%Mean mmap 22.1306 ( 0.00%) 24.8851 (-12.45%) 4.6.0-rc1 4.6.0-rc1 vanillanodelru-v3r10 User 1.51 0.97 System 138.03 122.58 Elapsed 2420.90 2394.80 The VM stats in this case were not that intresting and are very roughly comparable. Page allocator intensive workloads showed few differences as the cost of the fair zone allocation policy does not dominate from a userspace perspective but a microbench of just the allocator shows a difference 4.6.0-rc1 4.6.0-rc1 vanilla nodelru-v3 Min total-odr0-1 725.00 ( 0.00%) 697.00 ( 3.86%) Min total-odr0-2 559.00 ( 0.00%) 527.00 ( 5.72%) Min total-odr0-4 459.00 ( 0.00%) 436.00 ( 5.01%) Min total-odr0-8 403.00 ( 0.00%) 391.00 ( 2.98%) Min total-odr0-16 329.00 ( 0.00%) 366.00 (-11.25%) Min total-odr0-32 365.00 ( 0.00%) 355.00 ( 2.74%) Min total-odr0-64 297.00 ( 0.00%) 348.00 (-17.17%) Min total-odr0-128 752.00 ( 0.00%) 344.00 ( 54.26%) Min total-odr0-256 385.00 ( 0.00%) 379.00 ( 1.56%) Min total-odr0-512 899.00 ( 0.00%) 414.00 ( 53.95%) Min total-odr0-1024 763.00 ( 0.00%) 530.00 ( 30.54%) Min total-odr0-2048 982.00 ( 0.00%) 469.00 ( 52.24%) Min total-odr0-4096 928.00 ( 0.00%) 526.00 ( 43.32%) Min total-odr0-8192 1007.00 ( 0.00%) 768.00 ( 23.73%) Min total-odr0-16384 375.00 ( 0.00%) 366.00 ( 2.40%) This series is not without its hazards. There are at least three areas that I'm concerned with even though I could not reproduce any problems in that area. 1. Reclaim/compaction is going to be affected because the amount of reclaim is no longer targetted at a specific zone. Compaction works on a per-zone basis so there is no guarantee that reclaiming a few THP's worth page pages will have a positive impact on compaction success rates. 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers are called is now different. This may or may not be a problem but if it is, it'll be because shrinkers are not called enough and some balancing is required. 3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are distributed between zones and the fair zone allocation policy used to do something very similar for anon. The distribution is now different but not necessarily in any way that matters but it's still worth bearing in mind. Documentation/cgroup-v1/memcg_test.txt | 4 +- Documentation/cgroup-v1/memory.txt | 4 +- arch/s390/appldata/appldata_mem.c | 2 +- arch/tile/mm/pgtable.c | 18 +- drivers/base/node.c | 73 +-- drivers/staging/android/lowmemorykiller.c | 12 +- fs/fs-writeback.c | 4 +- fs/fuse/file.c | 8 +- fs/nfs/internal.h | 2 +- fs/nfs/write.c | 2 +- fs/proc/meminfo.c | 14 +- include/linux/backing-dev.h | 2 +- include/linux/memcontrol.h | 30 +- include/linux/mm_inline.h | 4 +- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 156 +++--- include/linux/swap.h | 13 +- include/linux/topology.h | 2 +- include/linux/vm_event_item.h | 14 +- include/linux/vmstat.h | 111 +++- include/linux/writeback.h | 2 +- include/trace/events/vmscan.h | 40 +- include/trace/events/writeback.h | 10 +- kernel/power/snapshot.c | 10 +- kernel/sysctl.c | 4 +- mm/backing-dev.c | 14 +- mm/compaction.c | 24 +- mm/filemap.c | 14 +- mm/huge_memory.c | 14 +- mm/internal.h | 11 +- mm/memcontrol.c | 235 ++++----- mm/memory-failure.c | 4 +- mm/memory_hotplug.c | 7 +- mm/mempolicy.c | 2 +- mm/migrate.c | 35 +- mm/mlock.c | 12 +- mm/page-writeback.c | 119 ++--- mm/page_alloc.c | 289 +++++----- mm/page_idle.c | 4 +- mm/rmap.c | 15 +- mm/shmem.c | 12 +- mm/swap.c | 66 +-- mm/swap_state.c | 4 +- mm/util.c | 4 +- mm/vmscan.c | 847 ++++++++++++++---------------- mm/vmstat.c | 369 ++++++++++--- mm/workingset.c | 53 +- 47 files changed, 1476 insertions(+), 1221 deletions(-) -- 2.6.4 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes 2016-04-15 9:13 [PATCH 00/27] Move LRU page reclaim from zones to nodes v5 Mel Gorman @ 2016-04-15 9:13 ` Mel Gorman 2016-04-28 8:36 ` Vlastimil Babka 0 siblings, 1 reply; 9+ messages in thread From: Mel Gorman @ 2016-04-15 9:13 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Jesper Dangaard Brouer, LKML, Mel Gorman Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started thinking of reclaim in terms of nodes but kswapd is still zone-centric. This patch gets rid of many of the node-based versus zone-based decisions. o A node is considered balanced when any eligible lower zone is balanced. This eliminates one class of age-inversion problem because we avoid reclaiming a newer page just because it's in the wrong zone o pgdat_balanced disappears because we now only care about one zone being balanced. o Some anomalies related to writeback and congestion tracking being based on zones disappear. o kswapd no longer has to take care to reclaim zones in the reverse order that the page allocator uses. o Most importantly of all, reclaim from node 0 with multiple zones will have similar aging and reclaiming characteristics as every other node. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/vmscan.c | 292 +++++++++++++++++++++--------------------------------------- 1 file changed, 101 insertions(+), 191 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f2534e8f8527..c23d8f9722ad 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2979,7 +2979,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, } #endif -static void age_active_anon(struct zone *zone, struct scan_control *sc) +static void age_active_anon(struct pglist_data *pgdat, + struct zone *zone, struct scan_control *sc) { struct mem_cgroup *memcg; @@ -2998,85 +2999,15 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc) } while (memcg); } -static bool zone_balanced(struct zone *zone, int order, bool highorder, +static bool zone_balanced(struct zone *zone, int order, unsigned long balance_gap, int classzone_idx) { unsigned long mark = high_wmark_pages(zone) + balance_gap; - /* - * When checking from pgdat_balanced(), kswapd should stop and sleep - * when it reaches the high order-0 watermark and let kcompactd take - * over. Other callers such as wakeup_kswapd() want to determine the - * true high-order watermark. - */ - if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) { - mark += (1UL << order); - order = 0; - } - return zone_watermark_ok_safe(zone, order, mark, classzone_idx); } /* - * pgdat_balanced() is used when checking if a node is balanced. - * - * For order-0, all zones must be balanced! - * - * For high-order allocations only zones that meet watermarks and are in a - * zone allowed by the callers classzone_idx are added to balanced_pages. The - * total of balanced pages must be at least 25% of the zones allowed by - * classzone_idx for the node to be considered balanced. Forcing all zones to - * be balanced for high orders can cause excessive reclaim when there are - * imbalanced zones. - * The choice of 25% is due to - * o a 16M DMA zone that is balanced will not balance a zone on any - * reasonable sized machine - * o On all other machines, the top zone must be at least a reasonable - * percentage of the middle zones. For example, on 32-bit x86, highmem - * would need to be at least 256M for it to be balance a whole node. - * Similarly, on x86-64 the Normal zone would need to be at least 1G - * to balance a node on its own. These seemed like reasonable ratios. - */ -static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) -{ - unsigned long managed_pages = 0; - unsigned long balanced_pages = 0; - int i; - - /* Check the watermark levels */ - for (i = 0; i <= classzone_idx; i++) { - struct zone *zone = pgdat->node_zones + i; - - if (!populated_zone(zone)) - continue; - - managed_pages += zone->managed_pages; - - /* - * A special case here: - * - * balance_pgdat() skips over all_unreclaimable after - * DEF_PRIORITY. Effectively, it considers them balanced so - * they must be considered balanced here as well! - */ - if (!pgdat_reclaimable(zone->zone_pgdat)) { - balanced_pages += zone->managed_pages; - continue; - } - - if (zone_balanced(zone, order, false, 0, i)) - balanced_pages += zone->managed_pages; - else if (!order) - return false; - } - - if (order) - return balanced_pages >= (managed_pages >> 2); - else - return true; -} - -/* * Prepare kswapd for sleeping. This verifies that there are no processes * waiting in throttle_direct_reclaim() and that watermarks have been met. * @@ -3085,6 +3016,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, int classzone_idx) { + int i; + /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ if (remaining) return false; @@ -3105,101 +3038,90 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, if (waitqueue_active(&pgdat->pfmemalloc_wait)) wake_up_all(&pgdat->pfmemalloc_wait); - return pgdat_balanced(pgdat, order, classzone_idx); + for (i = 0; i <= classzone_idx; i++) { + struct zone *zone = pgdat->node_zones + i; + + if (!populated_zone(zone)) + continue; + + if (zone_balanced(zone, order, 0, classzone_idx)) + return true; + } + + return false; } /* - * kswapd shrinks the zone by the number of pages required to reach - * the high watermark. + * kswapd shrinks a node of pages that are at or below the highest usable + * zone that is currently unbalanced. * * Returns true if kswapd scanned at least the requested number of pages to * reclaim or if the lack of progress was due to pages under writeback. * This is used to determine if the scanning priority needs to be raised. */ -static bool kswapd_shrink_zone(struct zone *zone, +static bool kswapd_shrink_node(pg_data_t *pgdat, int classzone_idx, struct scan_control *sc) { - unsigned long balance_gap; - bool lowmem_pressure; - struct pglist_data *pgdat = zone->zone_pgdat; + struct zone *zone; + unsigned long nr_to_reclaim = 0; + int z; - /* Reclaim above the high watermark. */ - sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); + /* Reclaim a number of pages proportional to the number of zones */ + for (z = 0; z <= classzone_idx; z++) { + zone = pgdat->node_zones + z; + if (!populated_zone(zone)) + continue; - /* - * We put equal pressure on every zone, unless one zone has way too - * many pages free already. The "too many pages" is defined as the - * high wmark plus a "gap" where the gap is either the low - * watermark or 1% of the zone, whichever is smaller. - */ - balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP( - zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO)); + nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX); + } /* - * If there is no low memory pressure or the zone is balanced then no - * reclaim is necessary + * Historically care was taken to put equal pressure on all zones but + * now pressure is applied based on node LRU order. */ - lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone)); - if (!lowmem_pressure && zone_balanced(zone, sc->order, false, - balance_gap, classzone_idx)) - return true; - - shrink_node(zone->zone_pgdat, sc, classzone_idx); - - /* TODO: ANOMALY */ - clear_bit(PGDAT_WRITEBACK, &pgdat->flags); + shrink_node(pgdat, sc, classzone_idx); /* - * If a zone reaches its high watermark, consider it to be no longer - * congested. It's possible there are dirty pages backed by congested - * BDIs but as pressure is relieved, speculatively avoid congestion - * waits. + * Fragmentation may mean that the system cannot be rebalanced for + * high-order allocations. If twice the allocation size has been + * reclaimed then recheck watermarks only at order-0 to prevent + * excessive reclaim. Assume that a process requested a high-order + * can direct reclaim/compact. */ - if (pgdat_reclaimable(zone->zone_pgdat) && - zone_balanced(zone, sc->order, false, 0, classzone_idx)) { - clear_bit(PGDAT_CONGESTED, &pgdat->flags); - clear_bit(PGDAT_DIRTY, &pgdat->flags); - } + if (sc->order && sc->nr_reclaimed >= 2UL << sc->order) + sc->order = 0; return sc->nr_scanned >= sc->nr_to_reclaim; } /* - * For kswapd, balance_pgdat() will work across all this node's zones until - * they are all at high_wmark_pages(zone). - * - * Returns the highest zone idx kswapd was reclaiming at + * For kswapd, balance_pgdat() will reclaim pages across a node from zones + * that are eligible for use by the caller until at least one zone is + * balanced. * - * There is special handling here for zones which are full of pinned pages. - * This can happen if the pages are all mlocked, or if they are all used by - * device drivers (say, ZONE_DMA). Or if they are all in use by hugetlb. - * What we do is to detect the case where all pages in the zone have been - * scanned twice and there has been zero successful reclaim. Mark the zone as - * dead and from now on, only perform a short scan. Basically we're polling - * the zone for when the problem goes away. + * Returns the order kswapd finished reclaiming at. * * kswapd scans the zones in the highmem->normal->dma direction. It skips * zones which have free_pages > high_wmark_pages(zone), but once a zone is - * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the - * lower zones regardless of the number of free pages in the lower zones. This - * interoperates with the page allocator fallback scheme to ensure that aging - * of pages is balanced across the zones. + * found to have free_pages <= high_wmark_pages(zone), any page is that zone + * or lower is eligible for reclaim until at least one usable zone is + * balanced. */ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) { int i; - int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; + struct zone *zone; struct scan_control sc = { .gfp_mask = GFP_KERNEL, - .reclaim_idx = MAX_NR_ZONES - 1, .order = order, .priority = DEF_PRIORITY, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = 1, + .reclaim_idx = classzone_idx, }; count_vm_event(PAGEOUTRUN); @@ -3210,21 +3132,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) /* Scan from the highest requested zone to dma */ for (i = classzone_idx; i >= 0; i--) { - struct zone *zone = pgdat->node_zones + i; - + zone = pgdat->node_zones + i; if (!populated_zone(zone)) continue; - if (sc.priority != DEF_PRIORITY && - !pgdat_reclaimable(zone->zone_pgdat)) - continue; - - /* - * Do some background aging of the anon list, to give - * pages a chance to be referenced before reclaiming. - */ - age_active_anon(zone, &sc); - /* * If the number of buffer_heads in the machine * exceeds the maximum allowed level and this node @@ -3232,19 +3143,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) * it to relieve lowmem pressure. */ if (buffer_heads_over_limit && is_highmem_idx(i)) { - end_zone = i; + classzone_idx = i; break; } - if (!zone_balanced(zone, order, false, 0, 0)) { - end_zone = i; + if (!zone_balanced(zone, order, 0, 0)) { + classzone_idx = i; break; } else { /* - * If balanced, clear the dirty and congested - * flags - * - * TODO: ANOMALY + * If any eligible zone is balanced then the + * node is not considered congested or dirty. */ clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags); clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags); @@ -3255,51 +3164,34 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) goto out; /* + * Do some background aging of the anon list, to give + * pages a chance to be referenced before reclaiming. All + * pages are rotated regardless of classzone as this is + * about consistent aging. + */ + age_active_anon(pgdat, &pgdat->node_zones[MAX_NR_ZONES - 1], &sc); + + /* * If we're getting trouble reclaiming, start doing writepage * even in laptop mode. */ - if (sc.priority < DEF_PRIORITY - 2) + if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat)) sc.may_writepage = 1; + /* Call soft limit reclaim before calling shrink_node. */ + sc.nr_scanned = 0; + nr_soft_scanned = 0; + nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, sc.order, + sc.gfp_mask, &nr_soft_scanned); + sc.nr_reclaimed += nr_soft_reclaimed; + /* - * Continue scanning in the highmem->dma direction stopping at - * the last zone which needs scanning. This may reclaim lowmem - * pages that are not necessary for zone balancing but it - * preserves LRU ordering. It is assumed that the bulk of - * allocation requests can use arbitrary zones with the - * possible exception of big highmem:lowmem configurations. + * There should be no need to raise the scanning priority if + * enough pages are already being scanned that that high + * watermark would be met at 100% efficiency. */ - for (i = end_zone; i >= end_zone; i--) { - struct zone *zone = pgdat->node_zones + i; - - if (!populated_zone(zone)) - continue; - - if (sc.priority != DEF_PRIORITY && - !pgdat_reclaimable(zone->zone_pgdat)) - continue; - - sc.nr_scanned = 0; - sc.reclaim_idx = i; - - nr_soft_scanned = 0; - /* - * Call soft limit reclaim before calling shrink_zone. - */ - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - order, sc.gfp_mask, - &nr_soft_scanned); - sc.nr_reclaimed += nr_soft_reclaimed; - - /* - * There should be no need to raise the scanning - * priority if enough pages are already being scanned - * that that high watermark would be met at 100% - * efficiency. - */ - if (kswapd_shrink_zone(zone, end_zone, &sc)) - raise_priority = false; - } + if (kswapd_shrink_node(pgdat, classzone_idx, &sc)) + raise_priority = false; /* * If the low watermark is met there is no need for processes @@ -3315,20 +3207,37 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) break; /* + * Stop reclaiming if any eligible zone is balanced and clear + * node writeback or congested. + */ + for (i = 0; i <= classzone_idx; i++) { + zone = pgdat->node_zones + i; + if (!populated_zone(zone)) + continue; + + if (zone_balanced(zone, sc.order, 0, classzone_idx)) { + clear_bit(PGDAT_CONGESTED, &pgdat->flags); + clear_bit(PGDAT_DIRTY, &pgdat->flags); + goto out; + } + } + + /* * Raise priority if scanning rate is too low or there was no * progress in reclaiming pages */ if (raise_priority || !sc.nr_reclaimed) sc.priority--; - } while (sc.priority >= 1 && - !pgdat_balanced(pgdat, order, classzone_idx)); + } while (sc.priority >= 1); out: /* - * Return the highest zone idx we were reclaiming at so - * prepare_kswapd_sleep() makes the same decisions as here. + * Return the order kswapd stopped reclaiming at as + * prepare_kswapd_sleep() takes it into account. If another caller + * entered the allocator slow path while kswapd was awake, order will + * remain at the higher level. */ - return end_zone; + return sc.order; } static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, @@ -3485,8 +3394,9 @@ static int kswapd(void *p) */ if (!ret) { trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); - balanced_classzone_idx = balance_pgdat(pgdat, order, - classzone_idx); + + /* return value ignored until next patch */ + balance_pgdat(pgdat, order, classzone_idx); } } @@ -3516,7 +3426,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) } if (!waitqueue_active(&pgdat->kswapd_wait)) return; - if (zone_balanced(zone, order, true, 0, 0)) + if (zone_balanced(zone, order, 0, 0)) return; trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); -- 2.6.4 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes 2016-04-15 9:13 ` [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman @ 2016-04-28 8:36 ` Vlastimil Babka 0 siblings, 0 replies; 9+ messages in thread From: Vlastimil Babka @ 2016-04-28 8:36 UTC (permalink / raw) To: Mel Gorman, Andrew Morton, Linux-MM Cc: Rik van Riel, Johannes Weiner, Jesper Dangaard Brouer, LKML On 04/15/2016 11:13 AM, Mel Gorman wrote: > /* > - * If a zone reaches its high watermark, consider it to be no longer > - * congested. It's possible there are dirty pages backed by congested > - * BDIs but as pressure is relieved, speculatively avoid congestion > - * waits. > + * Fragmentation may mean that the system cannot be rebalanced for > + * high-order allocations. If twice the allocation size has been > + * reclaimed then recheck watermarks only at order-0 to prevent > + * excessive reclaim. Assume that a process requested a high-order > + * can direct reclaim/compact. Also kcompactd is woken up in this case... > */ > - if (pgdat_reclaimable(zone->zone_pgdat) && > - zone_balanced(zone, sc->order, false, 0, classzone_idx)) { > - clear_bit(PGDAT_CONGESTED, &pgdat->flags); > - clear_bit(PGDAT_DIRTY, &pgdat->flags); > - } > + if (sc->order && sc->nr_reclaimed >= 2UL << sc->order) > + sc->order = 0; > > return sc->nr_scanned >= sc->nr_to_reclaim; This looks indeed simpler than my earlier zone_balanced() modification you removed. However I think there's still potential of overreclaim due to a stream of kswapd_wakeups where each will have to reclaim 2UL << sc->order pages, regardless of watermarks. Could be some high-order wakeups from GFP_ATOMIC context that have order-0 fallbacks but will cause kswapd to keep reclaiming when kcompactd can't keep up due to fragmentation... ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-06-23 11:31 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <071801d1cc5c$245087d0$6cf19770$@alibaba-inc.com> 2016-06-22 8:42 ` [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes Hillf Danton 2016-06-23 11:31 ` Mel Gorman 2016-06-21 14:15 [PATCH 00/27] Move LRU page reclaim from zones to nodes v7 Mel Gorman 2016-06-21 14:15 ` [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman [not found] <02fe01d1c48b$c44e9e80$4cebdb80$@alibaba-inc.com> 2016-06-12 9:33 ` Hillf Danton 2016-06-14 14:52 ` Mel Gorman -- strict thread matches above, loose matches on Subject: below -- 2016-06-09 18:04 [PATCH 00/27] Move LRU page reclaim from zones to nodes v6 Mel Gorman 2016-06-09 18:04 ` [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman 2016-06-15 14:23 ` Vlastimil Babka 2016-04-15 9:13 [PATCH 00/27] Move LRU page reclaim from zones to nodes v5 Mel Gorman 2016-04-15 9:13 ` [PATCH 06/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman 2016-04-28 8:36 ` Vlastimil Babka
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).