From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: [patch 113/131] mm: balance LRU lists based on relative thrashing Date: Wed, 03 Jun 2020 16:03:03 -0700 Message-ID: <20200603230303.kSkT62Lb5%akpm@linux-foundation.org> References: <20200603155549.e041363450869eaae4c7f05b@linux-foundation.org> Reply-To: linux-kernel@vger.kernel.org Return-path: Received: from mail.kernel.org ([198.145.29.99]:47798 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726635AbgFCXIn (ORCPT ); Wed, 3 Jun 2020 19:08:43 -0400 In-Reply-To: <20200603155549.e041363450869eaae4c7f05b@linux-foundation.org> Sender: mm-commits-owner@vger.kernel.org List-Id: mm-commits@vger.kernel.org To: akpm@linux-foundation.org, hannes@cmpxchg.org, iamjoonsoo.kim@lge.com, linux-mm@kvack.org, mhocko@suse.com, minchan@kernel.org, mm-commits@vger.kernel.org, riel@surriel.com, torvalds@linux-foundation.org From: Johannes Weiner Subject: mm: balance LRU lists based on relative thrashing Since the LRUs were split into anon and file lists, the VM has been balancing between page cache and anonymous pages based on per-list ratios of scanned vs. rotated pages. In most cases that tips page reclaim towards the list that is easier to reclaim and has the fewest actively used pages, but there are a few problems with it: 1. Refaults and LRU rotations are weighted the same way, even though one costs IO and the other costs a bit of CPU. 2. The less we scan an LRU list based on already observed rotations, the more we increase the sampling interval for new references, and rotations become even more likely on that list. This can enter a death spiral in which we stop looking at one list completely until the other one is all but annihilated by page reclaim. Since commit a528910e12ec ("mm: thrash detection-based file cache sizing") we have refault detection for the page cache. Along with swapin events, they are good indicators of when the file or anon list, respectively, is too small for its workingset and needs to grow. For example, if the page cache is thrashing, the cache pages need more time in memory, while there may be colder pages on the anonymous list. Likewise, if swapped pages are faulting back in, it indicates that we reclaim anonymous pages too aggressively and should back off. Replace LRU rotations with refaults and swapins as the basis for relative reclaim cost of the two LRUs. This will have the VM target list balances that incur the least amount of IO on aggregate. Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton --- include/linux/swap.h | 3 +-- mm/swap.c | 11 +++++++---- mm/swap_state.c | 5 +++++ mm/vmscan.c | 39 ++++++++++----------------------------- mm/workingset.c | 4 ++++ 5 files changed, 27 insertions(+), 35 deletions(-) --- a/include/linux/swap.h~mm-balance-lru-lists-based-on-relative-thrashing +++ a/include/linux/swap.h @@ -334,8 +334,7 @@ extern unsigned long nr_free_pagecache_p /* linux/mm/swap.c */ -extern void lru_note_cost(struct lruvec *lruvec, bool file, - unsigned int nr_pages); +extern void lru_note_cost(struct page *); extern void lru_cache_add(struct page *); extern void lru_add_page_tail(struct page *page, struct page *page_tail, struct lruvec *lruvec, struct list_head *head); --- a/mm/swap.c~mm-balance-lru-lists-based-on-relative-thrashing +++ a/mm/swap.c @@ -278,12 +278,15 @@ void rotate_reclaimable_page(struct page } } -void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) +void lru_note_cost(struct page *page) { - if (file) - lruvec->file_cost += nr_pages; + struct lruvec *lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + + /* Record new data point */ + if (page_is_file_lru(page)) + lruvec->file_cost++; else - lruvec->anon_cost += nr_pages; + lruvec->anon_cost++; } static void __activate_page(struct page *page, struct lruvec *lruvec, --- a/mm/swap_state.c~mm-balance-lru-lists-based-on-relative-thrashing +++ a/mm/swap_state.c @@ -440,6 +440,11 @@ struct page *__read_swap_cache_async(swp goto fail_unlock; } + /* XXX: Move to lru_cache_add() when it supports new vs putback */ + spin_lock_irq(&page_pgdat(page)->lru_lock); + lru_note_cost(page); + spin_unlock_irq(&page_pgdat(page)->lru_lock); + /* Caller will initiate read into locked page */ SetPageWorkingset(page); lru_cache_add(page); --- a/mm/vmscan.c~mm-balance-lru-lists-based-on-relative-thrashing +++ a/mm/vmscan.c @@ -1958,12 +1958,6 @@ shrink_inactive_list(unsigned long nr_to move_pages_to_lru(lruvec, &page_list); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - /* - * Rotating pages costs CPU without actually - * progressing toward the reclaim goal. - */ - lru_note_cost(lruvec, 0, stat.nr_activate[0]); - lru_note_cost(lruvec, 1, stat.nr_activate[1]); item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_reclaimed); @@ -2079,11 +2073,6 @@ static void shrink_active_list(unsigned * Move pages back to the lru list. */ spin_lock_irq(&pgdat->lru_lock); - /* - * Rotating pages costs CPU without actually - * progressing toward the reclaim goal. - */ - lru_note_cost(lruvec, file, nr_rotated); nr_activate = move_pages_to_lru(lruvec, &l_active); nr_deactivate = move_pages_to_lru(lruvec, &l_inactive); @@ -2298,22 +2287,23 @@ static void get_scan_count(struct lruvec scan_balance = SCAN_FRACT; /* - * With swappiness at 100, anonymous and file have the same priority. - * This scanning priority is essentially the inverse of IO cost. + * Calculate the pressure balance between anon and file pages. + * + * The amount of pressure we put on each LRU is inversely + * proportional to the cost of reclaiming each list, as + * determined by the share of pages that are refaulting, times + * the relative IO cost of bringing back a swapped out + * anonymous page vs reloading a filesystem page (swappiness). + * + * With swappiness at 100, anon and file have equal IO cost. */ anon_prio = swappiness; file_prio = 200 - anon_prio; /* - * OK, so we have swap space and a fair amount of page cache - * pages. We use the recently rotated / recently scanned - * ratios to determine how valuable each cache is. - * * Because workloads change over time (and to avoid overflow) * we keep these statistics as a floating average, which ends - * up weighing recent references more than old ones. - * - * anon in [0], file in [1] + * up weighing recent refaults more than old ones. */ anon = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, MAX_NR_ZONES) + @@ -2328,15 +2318,6 @@ static void get_scan_count(struct lruvec lruvec->file_cost /= 2; totalcost /= 2; } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD85DC433E0 for ; Wed, 3 Jun 2020 23:03:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5DF13221EE for ; Wed, 3 Jun 2020 23:03:06 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="OnifxC5r" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5DF13221EE Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 04D4028007A; Wed, 3 Jun 2020 19:03:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F18C028006C; Wed, 3 Jun 2020 19:03:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E55D028007A; Wed, 3 Jun 2020 19:03:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0185.hostedemail.com [216.40.44.185]) by kanga.kvack.org (Postfix) with ESMTP id CEF6D28006C for ; Wed, 3 Jun 2020 19:03:05 -0400 (EDT) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 9BE1333C4 for ; Wed, 3 Jun 2020 23:03:05 +0000 (UTC) X-FDA: 76889427930.12.fork00_380ae23603216 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin12.hostedemail.com (Postfix) with ESMTP id 797371805517B for ; Wed, 3 Jun 2020 23:03:05 +0000 (UTC) X-HE-Tag: fork00_380ae23603216 X-Filterd-Recvd-Size: 8759 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf15.hostedemail.com (Postfix) with ESMTP for ; Wed, 3 Jun 2020 23:03:04 +0000 (UTC) Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 06F03221F0; Wed, 3 Jun 2020 23:03:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1591225384; bh=7lPU8zkYZm71Bn+BwOVdBcwRcxzrxbz0E9HtCo+2gr4=; h=Date:From:To:Subject:In-Reply-To:From; b=OnifxC5robXgQgHWtWbNMUI1Raz9z6bJqts+docjrS4c+wNW4k/thqfMtNjJYzU2V CrRY0Z2+LviUTsli5xYuCuAjQlWapQ2XGK2iqPqiT/75m9wxNWO/QiPzpvMRix00rj 2rxRtu6129GDau5G8AMU83qJ1To86r3dJ9F8nYYs= Date: Wed, 03 Jun 2020 16:03:03 -0700 From: Andrew Morton To: akpm@linux-foundation.org, hannes@cmpxchg.org, iamjoonsoo.kim@lge.com, linux-mm@kvack.org, mhocko@suse.com, minchan@kernel.org, mm-commits@vger.kernel.org, riel@surriel.com, torvalds@linux-foundation.org Subject: [patch 113/131] mm: balance LRU lists based on relative thrashing Message-ID: <20200603230303.kSkT62Lb5%akpm@linux-foundation.org> In-Reply-To: <20200603155549.e041363450869eaae4c7f05b@linux-foundation.org> User-Agent: s-nail v14.8.16 X-Rspamd-Queue-Id: 797371805517B X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Johannes Weiner Subject: mm: balance LRU lists based on relative thrashing Since the LRUs were split into anon and file lists, the VM has been balancing between page cache and anonymous pages based on per-list ratios of scanned vs. rotated pages. In most cases that tips page reclaim towards the list that is easier to reclaim and has the fewest actively used pages, but there are a few problems with it: 1. Refaults and LRU rotations are weighted the same way, even though one costs IO and the other costs a bit of CPU. 2. The less we scan an LRU list based on already observed rotations, the more we increase the sampling interval for new references, and rotations become even more likely on that list. This can enter a death spiral in which we stop looking at one list completely until the other one is all but annihilated by page reclaim. Since commit a528910e12ec ("mm: thrash detection-based file cache sizing") we have refault detection for the page cache. Along with swapin events, they are good indicators of when the file or anon list, respectively, is too small for its workingset and needs to grow. For example, if the page cache is thrashing, the cache pages need more time in memory, while there may be colder pages on the anonymous list. Likewise, if swapped pages are faulting back in, it indicates that we reclaim anonymous pages too aggressively and should back off. Replace LRU rotations with refaults and swapins as the basis for relative reclaim cost of the two LRUs. This will have the VM target list balances that incur the least amount of IO on aggregate. Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton --- include/linux/swap.h | 3 +-- mm/swap.c | 11 +++++++---- mm/swap_state.c | 5 +++++ mm/vmscan.c | 39 ++++++++++----------------------------- mm/workingset.c | 4 ++++ 5 files changed, 27 insertions(+), 35 deletions(-) --- a/include/linux/swap.h~mm-balance-lru-lists-based-on-relative-thrashing +++ a/include/linux/swap.h @@ -334,8 +334,7 @@ extern unsigned long nr_free_pagecache_p /* linux/mm/swap.c */ -extern void lru_note_cost(struct lruvec *lruvec, bool file, - unsigned int nr_pages); +extern void lru_note_cost(struct page *); extern void lru_cache_add(struct page *); extern void lru_add_page_tail(struct page *page, struct page *page_tail, struct lruvec *lruvec, struct list_head *head); --- a/mm/swap.c~mm-balance-lru-lists-based-on-relative-thrashing +++ a/mm/swap.c @@ -278,12 +278,15 @@ void rotate_reclaimable_page(struct page } } -void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) +void lru_note_cost(struct page *page) { - if (file) - lruvec->file_cost += nr_pages; + struct lruvec *lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + + /* Record new data point */ + if (page_is_file_lru(page)) + lruvec->file_cost++; else - lruvec->anon_cost += nr_pages; + lruvec->anon_cost++; } static void __activate_page(struct page *page, struct lruvec *lruvec, --- a/mm/swap_state.c~mm-balance-lru-lists-based-on-relative-thrashing +++ a/mm/swap_state.c @@ -440,6 +440,11 @@ struct page *__read_swap_cache_async(swp goto fail_unlock; } + /* XXX: Move to lru_cache_add() when it supports new vs putback */ + spin_lock_irq(&page_pgdat(page)->lru_lock); + lru_note_cost(page); + spin_unlock_irq(&page_pgdat(page)->lru_lock); + /* Caller will initiate read into locked page */ SetPageWorkingset(page); lru_cache_add(page); --- a/mm/vmscan.c~mm-balance-lru-lists-based-on-relative-thrashing +++ a/mm/vmscan.c @@ -1958,12 +1958,6 @@ shrink_inactive_list(unsigned long nr_to move_pages_to_lru(lruvec, &page_list); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - /* - * Rotating pages costs CPU without actually - * progressing toward the reclaim goal. - */ - lru_note_cost(lruvec, 0, stat.nr_activate[0]); - lru_note_cost(lruvec, 1, stat.nr_activate[1]); item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_reclaimed); @@ -2079,11 +2073,6 @@ static void shrink_active_list(unsigned * Move pages back to the lru list. */ spin_lock_irq(&pgdat->lru_lock); - /* - * Rotating pages costs CPU without actually - * progressing toward the reclaim goal. - */ - lru_note_cost(lruvec, file, nr_rotated); nr_activate = move_pages_to_lru(lruvec, &l_active); nr_deactivate = move_pages_to_lru(lruvec, &l_inactive); @@ -2298,22 +2287,23 @@ static void get_scan_count(struct lruvec scan_balance = SCAN_FRACT; /* - * With swappiness at 100, anonymous and file have the same priority. - * This scanning priority is essentially the inverse of IO cost. + * Calculate the pressure balance between anon and file pages. + * + * The amount of pressure we put on each LRU is inversely + * proportional to the cost of reclaiming each list, as + * determined by the share of pages that are refaulting, times + * the relative IO cost of bringing back a swapped out + * anonymous page vs reloading a filesystem page (swappiness). + * + * With swappiness at 100, anon and file have equal IO cost. */ anon_prio = swappiness; file_prio = 200 - anon_prio; /* - * OK, so we have swap space and a fair amount of page cache - * pages. We use the recently rotated / recently scanned - * ratios to determine how valuable each cache is. - * * Because workloads change over time (and to avoid overflow) * we keep these statistics as a floating average, which ends - * up weighing recent references more than old ones. - * - * anon in [0], file in [1] + * up weighing recent refaults more than old ones. */ anon = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, MAX_NR_ZONES) + @@ -2328,15 +2318,6 @@ static void get_scan_count(struct lruvec lruvec->file_cost /= 2; totalcost /= 2; } - - /* - * The amount of pressure on anon vs file pages is inversely - * proportional to the assumed cost of reclaiming each list, - * as determined by the share of pages that are likely going - * to refault or rotate on each list (recently referenced), - * times the relative IO cost of bringing back a swapped out - * anonymous page vs reloading a filesystem page (swappiness). - */ ap = anon_prio * (totalcost + 1); ap /= lruvec->anon_cost + 1; --- a/mm/workingset.c~mm-balance-lru-lists-based-on-relative-thrashing +++ a/mm/workingset.c @@ -365,6 +365,10 @@ void workingset_refault(struct page *pag /* Page was active prior to eviction */ if (workingset) { SetPageWorkingset(page); + /* XXX: Move to lru_cache_add() when it supports new vs putback */ + spin_lock_irq(&page_pgdat(page)->lru_lock); + lru_note_cost(page); + spin_unlock_irq(&page_pgdat(page)->lru_lock); inc_lruvec_state(lruvec, WORKINGSET_RESTORE); } out: _