[PATCH] mm: rmap: Don't flush TLB after checking PTE young for page reference

From: Barry Song <21cnbao@gmail.com>
To: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-arm-kernel@lists.infradead.org
Cc: catalin.marinas@arm.com, huzhanyuan@oppo.com, lipeifeng@oppo.com,
	zhangshiming@oppo.com, guojian@oppo.com,
	Barry Song <v-songbaohua@oppo.com>, Yu Zhao <yuzhao@google.com>,
	Will Deacon <will@kernel.org>,
	Alex Van Brunt <avanbrunt@nvidia.com>,
	Shaohua Li <shli@kernel.org>
Subject: [PATCH] mm: rmap: Don't flush TLB after checking PTE young for page reference
Date: Wed,  6 Jul 2022 23:20:41 +1200	[thread overview]
Message-ID: <20220706112041.3831-1-21cnbao@gmail.com> (raw)

From: Barry Song <v-songbaohua@oppo.com>

Whether it is done through hardware or software, TLB flushing is
usually extremely expensive. Since a page can be mapped by lots
of processes at the same time, in folio_referenced_one(), each
process with pte_young will send a tlb broadcast, this further
increases the overhead of tlb flush exponentially.

Some platforms have tried to remove the overhead of tlb flush by
implementing their own ptep_clear_flush_young() in which, flush
are dropped(x86, s390, powerpc, riscv) or deferred(arm64). This
approach has obviously broken the semantics of the API since it
is named as "flush". Dropping flush in a function named "flush"
isn't cool.

On ARM64, flush_tlb_page_nosync() is used as a cheaper way in
ptep_clear_flush_young() to replace the more expensive sync
tlb broadcast with dsb. But the cost of this nosync alternative
has probably been underestimated. Profiling is done by running
a program with high memory pressure on rk3568 64bit quad core
processor Quad Core Cortex-A55 platform - ROCK 3A with 4GB
memory, using zRAM as swap device. In the program, 8 processes
are trying to access one shared memory as below,
 
 int main()
 {
 #define MB (1024 * 1024)
 	pid_t pid = getpid();
 
 	volatile unsigned char *p = mmap(NULL, 4096UL * MB, PROT_READ | PROT_WRITE,
 			MAP_SHARED | MAP_ANONYMOUS, -1, 0);
 
 	memset(p, 0x11, 4096UL * MB);
 
 	/* simulate memory mapped by multiple processes,like libs, .txt section, shmem */
 	fork(); fork(); fork();
 
 	while(1) {
 		int i;
 		/* randomly get an offset then access 1024 pages */
 		unsigned long offset = (rand() % MB);
 		if (offset + 1024 > MB)
 			offset = MB - 1024;
 
 		for (i = 0; i < 1024; i++) {
 			(void)p[(offset + i) * 4096];
 		}
 
 		usleep(1000);
 	}
 }

After removing "inline" before flush_tlb_page_nosync() as below,
 <static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
 >static noinline void flush_tlb_page_nosync(struct vm_area_struct *vma,

perf result for kswapd is quite surprising,
    19.63%  kswapd0  [kernel.kallsyms]  [k] page_vma_mapped_walk
    10.69%  kswapd0  [kernel.kallsyms]  [k] flush_tlb_page_nosync
     6.73%  kswapd0  [kernel.kallsyms]  [k] folio_referenced_one
     5.92%  kswapd0  [kernel.kallsyms]  [k] zram_bvec_rw.constprop.0.isra.0
     4.55%  kswapd0  [kernel.kallsyms]  [k] ptep_clear_flush
     3.66%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
     2.87%  kswapd0  [kernel.kallsyms]  [k] rmap_walk_file
     2.72%  kswapd0  [kernel.kallsyms]  [k] try_to_unmap_one
     2.03%  kswapd0  [kernel.kallsyms]  [k] vma_interval_tree_iter_next
     1.86%  kswapd0  [kernel.kallsyms]  [k] shrink_page_list
     1.86%  kswapd0  [kernel.kallsyms]  [k] isolate_lru_pages
     1.78%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_unlock
     1.23%  kswapd0  [kernel.kallsyms]  [k] vma_interval_tree_subtree_search
     1.15%  kswapd0  [kernel.kallsyms]  [k] PageHuge
     1.02%  kswapd0  [kernel.kallsyms]  [k] check_pte
If flush_tlb_page_nosync() is inlined, its overhead will be counted
somewhere else. That's why the profiling is removing the inline.

The 10.60% overhead demonstrates for ARM64, we still need to move to
ptep_clear_young_notify() after we have used the nosync tlbi.

In addition to those commits to remove flush in platforms such as
riscv, x86, powerpc, Yu Zhao also listed some other evidences to
support moving to ptep_clear_young_notify() in vmscan within the
discussion of MGLRU.
 * The fundamental hardware limitation in terms of the TLB scalability[1]
 * Alexander's benchmark[2]
 * TLB doesn't cache stale pte young most of the time, flushing TLB just
 for the sake of the A-bit isn't necessary[3]

This patch solves the problem from the source - vmscan, so probably
platforms which haven't dropped flush can benefit directly. On the
other hand, ARM64 with lightweight tlbi can also eventually remove
the overhead of nosync tlb flush.

At last but not least, MGLRU has no flush in look_around after
clearing pte young, this patch also makes vmscan generally
consistent with the approach of MGLRU.

 [1] https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
 [2] https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
 [3] https://lore.kernel.org/lkml/CAOUHufbOwPSbBwd7TG0QFt4YJvBp93Q9nUJEDvMpUA6PqjYMUQ@mail.gmail.com/

Cc: Yu Zhao <yuzhao@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alex Van Brunt <avanbrunt@nvidia.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 -v1 differences with rfc
 * refine commit log
 * investigate on arm64's flush_tlb_page_nosync
   with memory pressure

 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..7ce6f0b6c330 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
 		}
 
 		if (pvmw.pte) {
-			if (ptep_clear_flush_young_notify(vma, address,
+			if (ptep_clear_young_notify(vma, address,
 						pvmw.pte)) {
 				/*
 				 * Don't treat a reference through
-- 
2.25.1