All of lore.kernel.org
 help / color / mirror / Atom feed
From: Barry Song <21cnbao@gmail.com>
To: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-arm-kernel@lists.infradead.org
Cc: catalin.marinas@arm.com, huzhanyuan@oppo.com, lipeifeng@oppo.com,
	zhangshiming@oppo.com, guojian@oppo.com,
	Barry Song <v-songbaohua@oppo.com>, Yu Zhao <yuzhao@google.com>,
	Will Deacon <will@kernel.org>,
	Alex Van Brunt <avanbrunt@nvidia.com>,
	Shaohua Li <shli@kernel.org>
Subject: [PATCH] mm: rmap: Don't flush TLB after checking PTE young for page reference
Date: Wed,  6 Jul 2022 23:20:41 +1200	[thread overview]
Message-ID: <20220706112041.3831-1-21cnbao@gmail.com> (raw)

From: Barry Song <v-songbaohua@oppo.com>

Whether it is done through hardware or software, TLB flushing is
usually extremely expensive. Since a page can be mapped by lots
of processes at the same time, in folio_referenced_one(), each
process with pte_young will send a tlb broadcast, this further
increases the overhead of tlb flush exponentially.

Some platforms have tried to remove the overhead of tlb flush by
implementing their own ptep_clear_flush_young() in which, flush
are dropped(x86, s390, powerpc, riscv) or deferred(arm64). This
approach has obviously broken the semantics of the API since it
is named as "flush". Dropping flush in a function named "flush"
isn't cool.

On ARM64, flush_tlb_page_nosync() is used as a cheaper way in
ptep_clear_flush_young() to replace the more expensive sync
tlb broadcast with dsb. But the cost of this nosync alternative
has probably been underestimated. Profiling is done by running
a program with high memory pressure on rk3568 64bit quad core
processor Quad Core Cortex-A55 platform - ROCK 3A with 4GB
memory, using zRAM as swap device. In the program, 8 processes
are trying to access one shared memory as below,
 
 int main()
 {
 #define MB (1024 * 1024)
 	pid_t pid = getpid();
 
 	volatile unsigned char *p = mmap(NULL, 4096UL * MB, PROT_READ | PROT_WRITE,
 			MAP_SHARED | MAP_ANONYMOUS, -1, 0);
 
 	memset(p, 0x11, 4096UL * MB);
 
 	/* simulate memory mapped by multiple processes,like libs, .txt section, shmem */
 	fork(); fork(); fork();
 
 	while(1) {
 		int i;
 		/* randomly get an offset then access 1024 pages */
 		unsigned long offset = (rand() % MB);
 		if (offset + 1024 > MB)
 			offset = MB - 1024;
 
 		for (i = 0; i < 1024; i++) {
 			(void)p[(offset + i) * 4096];
 		}
 
 		usleep(1000);
 	}
 }

After removing "inline" before flush_tlb_page_nosync() as below,
 <static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
 >static noinline void flush_tlb_page_nosync(struct vm_area_struct *vma,

perf result for kswapd is quite surprising,
    19.63%  kswapd0  [kernel.kallsyms]  [k] page_vma_mapped_walk
    10.69%  kswapd0  [kernel.kallsyms]  [k] flush_tlb_page_nosync
     6.73%  kswapd0  [kernel.kallsyms]  [k] folio_referenced_one
     5.92%  kswapd0  [kernel.kallsyms]  [k] zram_bvec_rw.constprop.0.isra.0
     4.55%  kswapd0  [kernel.kallsyms]  [k] ptep_clear_flush
     3.66%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
     2.87%  kswapd0  [kernel.kallsyms]  [k] rmap_walk_file
     2.72%  kswapd0  [kernel.kallsyms]  [k] try_to_unmap_one
     2.03%  kswapd0  [kernel.kallsyms]  [k] vma_interval_tree_iter_next
     1.86%  kswapd0  [kernel.kallsyms]  [k] shrink_page_list
     1.86%  kswapd0  [kernel.kallsyms]  [k] isolate_lru_pages
     1.78%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_unlock
     1.23%  kswapd0  [kernel.kallsyms]  [k] vma_interval_tree_subtree_search
     1.15%  kswapd0  [kernel.kallsyms]  [k] PageHuge
     1.02%  kswapd0  [kernel.kallsyms]  [k] check_pte
If flush_tlb_page_nosync() is inlined, its overhead will be counted
somewhere else. That's why the profiling is removing the inline.

The 10.60% overhead demonstrates for ARM64, we still need to move to
ptep_clear_young_notify() after we have used the nosync tlbi.

In addition to those commits to remove flush in platforms such as
riscv, x86, powerpc, Yu Zhao also listed some other evidences to
support moving to ptep_clear_young_notify() in vmscan within the
discussion of MGLRU.
 * The fundamental hardware limitation in terms of the TLB scalability[1]
 * Alexander's benchmark[2]
 * TLB doesn't cache stale pte young most of the time, flushing TLB just
 for the sake of the A-bit isn't necessary[3]

This patch solves the problem from the source - vmscan, so probably
platforms which haven't dropped flush can benefit directly. On the
other hand, ARM64 with lightweight tlbi can also eventually remove
the overhead of nosync tlb flush.

At last but not least, MGLRU has no flush in look_around after
clearing pte young, this patch also makes vmscan generally
consistent with the approach of MGLRU.

 [1] https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
 [2] https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
 [3] https://lore.kernel.org/lkml/CAOUHufbOwPSbBwd7TG0QFt4YJvBp93Q9nUJEDvMpUA6PqjYMUQ@mail.gmail.com/

Cc: Yu Zhao <yuzhao@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alex Van Brunt <avanbrunt@nvidia.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 -v1 differences with rfc
 * refine commit log
 * investigate on arm64's flush_tlb_page_nosync
   with memory pressure

 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..7ce6f0b6c330 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
 		}
 
 		if (pvmw.pte) {
-			if (ptep_clear_flush_young_notify(vma, address,
+			if (ptep_clear_young_notify(vma, address,
 						pvmw.pte)) {
 				/*
 				 * Don't treat a reference through
-- 
2.25.1



WARNING: multiple messages have this Message-ID (diff)
From: Barry Song <21cnbao@gmail.com>
To: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-arm-kernel@lists.infradead.org
Cc: catalin.marinas@arm.com, huzhanyuan@oppo.com, lipeifeng@oppo.com,
	zhangshiming@oppo.com, guojian@oppo.com,
	Barry Song <v-songbaohua@oppo.com>, Yu Zhao <yuzhao@google.com>,
	Will Deacon <will@kernel.org>,
	Alex Van Brunt <avanbrunt@nvidia.com>,
	Shaohua Li <shli@kernel.org>
Subject: [PATCH] mm: rmap: Don't flush TLB after checking PTE young for page reference
Date: Wed,  6 Jul 2022 23:20:41 +1200	[thread overview]
Message-ID: <20220706112041.3831-1-21cnbao@gmail.com> (raw)

From: Barry Song <v-songbaohua@oppo.com>

Whether it is done through hardware or software, TLB flushing is
usually extremely expensive. Since a page can be mapped by lots
of processes at the same time, in folio_referenced_one(), each
process with pte_young will send a tlb broadcast, this further
increases the overhead of tlb flush exponentially.

Some platforms have tried to remove the overhead of tlb flush by
implementing their own ptep_clear_flush_young() in which, flush
are dropped(x86, s390, powerpc, riscv) or deferred(arm64). This
approach has obviously broken the semantics of the API since it
is named as "flush". Dropping flush in a function named "flush"
isn't cool.

On ARM64, flush_tlb_page_nosync() is used as a cheaper way in
ptep_clear_flush_young() to replace the more expensive sync
tlb broadcast with dsb. But the cost of this nosync alternative
has probably been underestimated. Profiling is done by running
a program with high memory pressure on rk3568 64bit quad core
processor Quad Core Cortex-A55 platform - ROCK 3A with 4GB
memory, using zRAM as swap device. In the program, 8 processes
are trying to access one shared memory as below,
 
 int main()
 {
 #define MB (1024 * 1024)
 	pid_t pid = getpid();
 
 	volatile unsigned char *p = mmap(NULL, 4096UL * MB, PROT_READ | PROT_WRITE,
 			MAP_SHARED | MAP_ANONYMOUS, -1, 0);
 
 	memset(p, 0x11, 4096UL * MB);
 
 	/* simulate memory mapped by multiple processes,like libs, .txt section, shmem */
 	fork(); fork(); fork();
 
 	while(1) {
 		int i;
 		/* randomly get an offset then access 1024 pages */
 		unsigned long offset = (rand() % MB);
 		if (offset + 1024 > MB)
 			offset = MB - 1024;
 
 		for (i = 0; i < 1024; i++) {
 			(void)p[(offset + i) * 4096];
 		}
 
 		usleep(1000);
 	}
 }

After removing "inline" before flush_tlb_page_nosync() as below,
 <static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
 >static noinline void flush_tlb_page_nosync(struct vm_area_struct *vma,

perf result for kswapd is quite surprising,
    19.63%  kswapd0  [kernel.kallsyms]  [k] page_vma_mapped_walk
    10.69%  kswapd0  [kernel.kallsyms]  [k] flush_tlb_page_nosync
     6.73%  kswapd0  [kernel.kallsyms]  [k] folio_referenced_one
     5.92%  kswapd0  [kernel.kallsyms]  [k] zram_bvec_rw.constprop.0.isra.0
     4.55%  kswapd0  [kernel.kallsyms]  [k] ptep_clear_flush
     3.66%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
     2.87%  kswapd0  [kernel.kallsyms]  [k] rmap_walk_file
     2.72%  kswapd0  [kernel.kallsyms]  [k] try_to_unmap_one
     2.03%  kswapd0  [kernel.kallsyms]  [k] vma_interval_tree_iter_next
     1.86%  kswapd0  [kernel.kallsyms]  [k] shrink_page_list
     1.86%  kswapd0  [kernel.kallsyms]  [k] isolate_lru_pages
     1.78%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_unlock
     1.23%  kswapd0  [kernel.kallsyms]  [k] vma_interval_tree_subtree_search
     1.15%  kswapd0  [kernel.kallsyms]  [k] PageHuge
     1.02%  kswapd0  [kernel.kallsyms]  [k] check_pte
If flush_tlb_page_nosync() is inlined, its overhead will be counted
somewhere else. That's why the profiling is removing the inline.

The 10.60% overhead demonstrates for ARM64, we still need to move to
ptep_clear_young_notify() after we have used the nosync tlbi.

In addition to those commits to remove flush in platforms such as
riscv, x86, powerpc, Yu Zhao also listed some other evidences to
support moving to ptep_clear_young_notify() in vmscan within the
discussion of MGLRU.
 * The fundamental hardware limitation in terms of the TLB scalability[1]
 * Alexander's benchmark[2]
 * TLB doesn't cache stale pte young most of the time, flushing TLB just
 for the sake of the A-bit isn't necessary[3]

This patch solves the problem from the source - vmscan, so probably
platforms which haven't dropped flush can benefit directly. On the
other hand, ARM64 with lightweight tlbi can also eventually remove
the overhead of nosync tlb flush.

At last but not least, MGLRU has no flush in look_around after
clearing pte young, this patch also makes vmscan generally
consistent with the approach of MGLRU.

 [1] https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
 [2] https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
 [3] https://lore.kernel.org/lkml/CAOUHufbOwPSbBwd7TG0QFt4YJvBp93Q9nUJEDvMpUA6PqjYMUQ@mail.gmail.com/

Cc: Yu Zhao <yuzhao@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alex Van Brunt <avanbrunt@nvidia.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 -v1 differences with rfc
 * refine commit log
 * investigate on arm64's flush_tlb_page_nosync
   with memory pressure

 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..7ce6f0b6c330 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
 		}
 
 		if (pvmw.pte) {
-			if (ptep_clear_flush_young_notify(vma, address,
+			if (ptep_clear_young_notify(vma, address,
 						pvmw.pte)) {
 				/*
 				 * Don't treat a reference through
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

             reply	other threads:[~2022-07-06 11:21 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-06 11:20 Barry Song [this message]
2022-07-06 11:20 ` [PATCH] mm: rmap: Don't flush TLB after checking PTE young for page reference Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220706112041.3831-1-21cnbao@gmail.com \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=avanbrunt@nvidia.com \
    --cc=catalin.marinas@arm.com \
    --cc=guojian@oppo.com \
    --cc=huzhanyuan@oppo.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=lipeifeng@oppo.com \
    --cc=shli@kernel.org \
    --cc=v-songbaohua@oppo.com \
    --cc=will@kernel.org \
    --cc=yuzhao@google.com \
    --cc=zhangshiming@oppo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.