From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=u5hr=IW=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7200CC433C1
	for <linux-mm@archiver.kernel.org>; Wed, 24 Mar 2021 08:39:48 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id DC252619F3
	for <linux-mm@archiver.kernel.org>; Wed, 24 Mar 2021 08:39:47 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DC252619F3
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 3AC3B6B0289; Wed, 24 Mar 2021 04:39:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 35C276B028D; Wed, 24 Mar 2021 04:39:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1FD266B0292; Wed, 24 Mar 2021 04:39:47 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 005326B0289
	for <linux-mm@kvack.org>; Wed, 24 Mar 2021 04:39:46 -0400 (EDT)
Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id ADB82180ACF7F
	for <linux-mm@kvack.org>; Wed, 24 Mar 2021 08:39:46 +0000 (UTC)
X-FDA: 77954119572.21.4641BAE
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
	by imf09.hostedemail.com (Postfix) with ESMTP id DA4816000106
	for <linux-mm@kvack.org>; Wed, 24 Mar 2021 08:39:44 +0000 (UTC)
IronPort-SDR: /gDfw0ixu+MdG9gkEaK2Dxd1FJmjcEYKtdP8PdQ25wndg8d/pn6r0asmFIZeDhqUzh402YQwFE
 QlhCI/Rpi6ow==
X-IronPort-AV: E=McAfee;i="6000,8403,9932"; a="190069547"
X-IronPort-AV: E=Sophos;i="5.81,274,1610438400"; 
   d="scan'208";a="190069547"
Received: from orsmga008.jf.intel.com ([10.7.209.65])
  by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Mar 2021 01:39:43 -0700
IronPort-SDR: /1Hn1Gw25ZH8DZ0wrqQYR8IE94GRz/aOqMJX7UflhUHsaH0rukH/1JnjNa40VhuvDsunlnrZ19
 bn837bul6Wvw==
X-IronPort-AV: E=Sophos;i="5.81,274,1610438400"; 
   d="scan'208";a="415386079"
Received: from yhuang6-desk1.sh.intel.com ([10.239.13.1])
  by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Mar 2021 01:39:39 -0700
From: Huang Ying <ying.huang@intel.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	Huang Ying <ying.huang@intel.com>,
	Yu Zhao <yuzhao@google.com>,
	Hillf Danton <hdanton@sina.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Michal Hocko <mhocko@suse.com>,
	Roman Gushchin <guro@fb.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Wei Yang <richard.weiyang@linux.alibaba.com>,
	Yang Shi <shy828301@gmail.com>
Subject: [RFC] mm: activate access-more-than-once page via NUMA balancing
Date: Wed, 24 Mar 2021 16:32:09 +0800
Message-Id: <20210324083209.527427-1-ying.huang@intel.com>
X-Mailer: git-send-email 2.30.2
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
X-Stat-Signature: po3fkaphtuqs38n9743yamo7rfyba8y6
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: DA4816000106
Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf09; identity=mailfrom; envelope-from="<ying.huang@intel.com>"; helo=mga14.intel.com; client-ip=192.55.52.115
X-HE-DKIM-Result: none/none
X-HE-Tag: 1616575184-370787
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

One idea behind the LRU page reclaiming algorithm is to put the
access-once pages in the inactive list and access-more-than-once pages
in the active list.  This is true for the file pages that are accessed
via syscall (read()/write(), etc.), but not for the pages accessed via
the page tables.  We can only activate them via page reclaim scanning
now.  This may cause some problems.  For example, even if there are
only hot file pages accessed via the page tables in the inactive list,
we will enable the cache trim mode incorrectly to scan only the hot
file pages instead of cold anon pages.

This can be improved via NUMA balancing.  Where, the page tables of
all processes will be scanned gradually to trap the page accesses.
With that, we can identify whether a page in the inactive list has
been accessed at least twice.  If so, we can activate the page to
leave only the access-once pages in the inactive list.  This patch
implements this.

It may sound overkill to enable NUMA balancing only to activate some
pages.  But firstly, if you have used NUMA balancing already, the
added overhead is negligible.  Secondly, this patch is only the first
step to take advantage of the NUMA balancing to optimize the page
reclaiming.  We may improve the page reclaim further with the help of
the NUMA balancing.  For example, we have implemented a way to measure
the page hot/cold via NUMA balancing in

https://lore.kernel.org/linux-mm/20210311081821.138467-5-ying.huang@intel=
.com/

That may help to improve the LRU algorithm.  For example, instead of
migrating from PMEM to DRAM, the hot pages can be put at the head of
the active list (or a separate hot page list) to make it easier to
reclaim the cold pages at the tail of the LRU.

This patch is inspired by the work done by Yu Zhao in the
Multigenerational LRU patchset as follows,

https://lore.kernel.org/linux-mm/20210313075747.3781593-1-yuzhao@google.c=
om/

It may be possible to combine some ideas from the multi-generational
LRU patchset with the NUMA balancing page table scanning to improve
the LRU page reclaiming algorithm.  Compared with the page table
scanning method used in the multi-generational LRU patchset, the page
tables can be scanned much slower via NUMA balancing, because the page
faults instead of the Accessed bit is used to trap the page accesses.
This can reduce the peak overhead of scanning.

To show the effect of the patch, we designed a test as follows,

On a system with 128 GB DRAM and 2 NVMe disks as swap,

  * Run the workload A with about 60 GB hot anon pages.

  * After 100 seconds, run the workload B with about 58 GB cold anon
    pages (accessed-once).

  * After another 200 second, run the workload C with about 57 GB hot
    anon pages.

It=E2=80=99s desirable that the 58 GB cold pages of the workload B will b=
e
swapped out to accommodate the 57 GB memory of the workload C.

The test results are as follows,

			         base	      patched
Pages swapped in (GB)		  2.3		  0.0
Pages swapped out (GB)		 59.0		 55.9
Pages scanned (GB)		296.7		172.5
Avg length of active
list (GB)			 18.1		 58.4
Avg length of inactive
list (GB)			 89.1		 48.4

Because the size of the cold workload B (58 GB) is larger than the
size of the workload C, it=E2=80=99s desirable that the accessed-once pag=
es of
workload B will be reclaimed to accommodate the workload C, so that
there should be no pages to be swapped in.  But in the base kernel,
because the pages of the workload A are scanned before that of the
workload B, some hot pages (~2.3 GB) from the workload A will be
swapped out wrongly.  While in the patched kernel, the pages of
workload A will be activated to the active list beforehand, so the
pages swapped in reduces greatly (~14.2 MB).  Because the size of
inactive list is much shorter in the patched kernel, to reclaim pages
for the workload C, the pages scanned is much less too (172.5 GB
vs. 296.7 GB).

As always, the VM subsystem is complex, any change may cause some
regressions.  We have observed some for this patch too.  The
fundamental effect of the patch is to reduce the size of inactive list
to reduce the scanning overhead and improve scanning correctness.  But
in some situations, the long inactive list in the base kernel (not
patched) can help performance.  Because it will take longer to scan
a (not so) hot page twice, to make it easier to distinguish the hot
and cold pages.  But generally, I don't think it is a good idea to
improve the performance via increasing the system overhead purely.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Inspired-by: Yu Zhao <yuzhao@google.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Yang Shi <shy828301@gmail.com>
---
 mm/memory.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 5efa07fb6cdc..b44b6fd577a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4165,6 +4165,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vm=
f)
 			&flags);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	if (target_nid =3D=3D NUMA_NO_NODE) {
+		if (!PageActive(page) && page_evictable(page) &&
+		    (!PageSwapBacked(page) || total_swap_pages)) {
+			if (pte_young(old_pte) && !PageReferenced(page))
+				SetPageReferenced(page);
+			if (PageReferenced(page))
+				mark_page_accessed(page);
+		}
 		put_page(page);
 		goto out;
 	}
--=20
2.30.2