From: Ankur Arora <ankur.a.arora@oracle.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org
Cc: torvalds@linux-foundation.org, akpm@linux-foundation.org,
mike.kravetz@oracle.com, mingo@kernel.org, luto@kernel.org,
tglx@linutronix.de, bp@alien8.de, peterz@infradead.org,
ak@linux.intel.com, arnd@arndb.de, jgg@nvidia.com,
jon.grimm@amd.com, boris.ostrovsky@oracle.com,
konrad.wilk@oracle.com, joao.m.martins@oracle.com,
ankur.a.arora@oracle.com
Subject: [PATCH v3 21/21] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake
Date: Mon, 6 Jun 2022 20:37:25 +0000 [thread overview]
Message-ID: <20220606203725.1313715-17-ankur.a.arora@oracle.com> (raw)
In-Reply-To: <20220606202109.1306034-1-ankur.a.arora@oracle.com>
System: Oracle X8-2 (2 nodes * 26 cores/node * 2 threads/core)
Processor: Intel Xeon Platinum 8270CL (Skylakex, 6:85:7)
Memory: 3TB evenly split between nodes
Microcode: 0x5002f01
scaling_governor: performance
LLC size: 36MB for each node
intel_pstate/no_turbo: 1
$ for i in 2 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB
done
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 2MB bytes ...
6.361971 GB/sec
# Copying 8MB bytes ...
6.300403 GB/sec
# Copying 32MB bytes ...
6.288992 GB/sec
# Copying 128MB bytes ...
6.328793 GB/sec
# Copying 512MB bytes ...
6.324471 GB/sec
# Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
# (X86_FEATURE_ERMS) and x86-64-movnt:
x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 20.38 GB/s ( +- 2.58%) 6.25 GB/s ( +- 0.41%) -69.28%
128MB 6.52 GB/s ( +- 0.14%) 6.31 GB/s ( +- 0.47%) -3.22%
1024MB 6.48 GB/s ( +- 0.31%) 6.24 GB/s ( +- 0.00%) -3.70%
4096MB 6.51 GB/s ( +- 0.01%) 6.27 GB/s ( +- 0.42%) -3.68%
Comparing perf stats for size=4096MB:
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb
# Running 'mem/memset' benchmark:
# function 'x86-64-stosb' (movsb-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
6.516972 GB/sec (+- 0.01%)
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb' (5 runs):
3,357,373,317 cpu-cycles # 1.133 GHz ( +- 0.01% ) (29.38%)
165,063,710 instructions # 0.05 insn per cycle ( +- 1.54% ) (35.29%)
358,997 cache-references # 0.121 M/sec ( +- 0.89% ) (35.32%)
205,420 cache-misses # 57.221 % of all cache refs ( +- 3.61% ) (35.36%)
6,117,673 branch-instructions # 2.065 M/sec ( +- 1.48% ) (35.38%)
58,309 branch-misses # 0.95% of all branches ( +- 1.30% ) (35.39%)
31,329,466 bus-cycles # 10.575 M/sec ( +- 0.03% ) (23.56%)
68,543,766 L1-dcache-load-misses # 157.03% of all L1-dcache accesses ( +- 0.02% ) (23.53%)
43,648,909 L1-dcache-loads # 14.734 M/sec ( +- 0.50% ) (23.50%)
137,498 LLC-loads # 0.046 M/sec ( +- 0.21% ) (23.49%)
12,308 LLC-load-misses # 8.95% of all LL-cache accesses ( +- 2.52% ) (23.49%)
26,335 LLC-stores # 0.009 M/sec ( +- 5.65% ) (11.75%)
25,008 LLC-store-misses # 0.008 M/sec ( +- 3.42% ) (11.75%)
2.962842 +- 0.000162 seconds time elapsed ( +- 0.01% )
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
6.283420 GB/sec (+- 0.01%)
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):
4,462,272,094 cpu-cycles # 1.322 GHz ( +- 0.30% ) (29.38%)
1,633,675,881 instructions # 0.37 insn per cycle ( +- 0.21% ) (35.28%)
283,627 cache-references # 0.084 M/sec ( +- 0.58% ) (35.31%)
28,824 cache-misses # 10.163 % of all cache refs ( +- 20.67% ) (35.34%)
139,719,697 branch-instructions # 41.407 M/sec ( +- 0.16% ) (35.35%)
58,062 branch-misses # 0.04% of all branches ( +- 1.49% ) (35.36%)
41,760,350 bus-cycles # 12.376 M/sec ( +- 0.05% ) (23.55%)
303,300 L1-dcache-load-misses # 0.69% of all L1-dcache accesses ( +- 2.08% ) (23.53%)
43,769,498 L1-dcache-loads # 12.972 M/sec ( +- 0.54% ) (23.52%)
99,570 LLC-loads # 0.030 M/sec ( +- 1.06% ) (23.52%)
1,966 LLC-load-misses # 1.97% of all LL-cache accesses ( +- 6.17% ) (23.52%)
129 LLC-stores # 0.038 K/sec ( +- 27.85% ) (11.75%)
7 LLC-store-misses # 0.002 K/sec ( +- 47.82% ) (11.75%)
3.37465 +- 0.00474 seconds time elapsed ( +- 0.14% )
It's unclear if using MOVNT is a net negative on Skylake. For bulk stores
MOVNT is slightly slower than REP;STOSB, but from the L1-dcache-load-misses
stats (L1D.REPLACEMENT), it does elide the write-allocate and thus helps
with cache efficiency.
However, we err on the side of caution and set X86_FEATURE_MOVNT_SLOW
on Skylake.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/kernel/cpu/bugs.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 16e293654d34..ee7206f03d15 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -97,7 +97,21 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
void check_movnt_quirks(struct cpuinfo_x86 *c)
{
#ifdef CONFIG_X86_64
-
+ if (c->x86_vendor == X86_VENDOR_INTEL) {
+ if (c->x86 == 6) {
+ switch (c->x86_model) {
+ case INTEL_FAM6_SKYLAKE_L:
+ fallthrough;
+ case INTEL_FAM6_SKYLAKE:
+ fallthrough;
+ case INTEL_FAM6_SKYLAKE_X:
+ set_cpu_cap(c, X86_FEATURE_MOVNT_SLOW);
+ break;
+ default:
+ break;
+ }
+ }
+ }
#endif
}
--
2.31.1
next prev parent reply other threads:[~2022-06-06 20:51 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
2022-06-06 20:20 ` [PATCH v3 01/21] mm, huge-page: reorder arguments to process_huge_page() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 02/21] mm, huge-page: refactor process_subpage() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 03/21] clear_page: add generic clear_user_pages() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 04/21] mm, clear_huge_page: support clear_user_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 05/21] mm/huge_page: generalize process_huge_page() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 06/21] x86/clear_page: add clear_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 07/21] x86/asm: add memset_movnti() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 08/21] perf bench: " Ankur Arora
2022-06-06 20:37 ` [PATCH v3 09/21] x86/asm: add clear_pages_movnt() Ankur Arora
2022-06-10 22:11 ` Noah Goldstein
2022-06-10 22:15 ` Noah Goldstein
2022-06-12 11:18 ` Ankur Arora
2022-06-06 20:37 ` [PATCH v3 10/21] x86/asm: add clear_pages_clzero() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 11/21] x86/cpuid: add X86_FEATURE_MOVNT_SLOW Ankur Arora
2022-06-06 20:37 ` [PATCH v3 12/21] sparse: add address_space __incoherent Ankur Arora
2022-06-06 20:37 ` [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent() Ankur Arora
2022-06-08 0:01 ` Luc Van Oostenryck
2022-06-12 11:19 ` Ankur Arora
2022-06-06 20:37 ` [PATCH v3 14/21] x86/clear_page: add clear_pages_incoherent() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 15/21] mm/clear_page: add clear_page_non_caching_threshold() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 16/21] x86/clear_page: add arch_clear_page_non_caching_threshold() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 17/21] clear_huge_page: use non-cached clearing Ankur Arora
2022-06-06 20:37 ` [PATCH v3 18/21] gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING Ankur Arora
2022-06-06 20:37 ` [PATCH v3 19/21] gup: hint non-caching if clearing large regions Ankur Arora
2022-06-06 20:37 ` [PATCH v3 20/21] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages() Ankur Arora
2022-06-06 20:37 ` Ankur Arora [this message]
2022-06-06 21:53 ` [PATCH v3 00/21] huge page clearing optimizations Linus Torvalds
2022-06-07 15:08 ` Ankur Arora
2022-06-07 17:56 ` Linus Torvalds
2022-06-08 19:24 ` Ankur Arora
2022-06-08 19:39 ` Linus Torvalds
2022-06-08 20:21 ` Ankur Arora
2022-06-08 19:49 ` Matthew Wilcox
2022-06-08 19:51 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220606203725.1313715-17-ankur.a.arora@oracle.com \
--to=ankur.a.arora@oracle.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=jgg@nvidia.com \
--cc=joao.m.martins@oracle.com \
--cc=jon.grimm@amd.com \
--cc=konrad.wilk@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mike.kravetz@oracle.com \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.