From: Ankur Arora <ankur.a.arora@oracle.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org
Cc: torvalds@linux-foundation.org, akpm@linux-foundation.org,
mike.kravetz@oracle.com, mingo@kernel.org, luto@kernel.org,
tglx@linutronix.de, bp@alien8.de, peterz@infradead.org,
ak@linux.intel.com, arnd@arndb.de, jgg@nvidia.com,
jon.grimm@amd.com, boris.ostrovsky@oracle.com,
konrad.wilk@oracle.com, joao.m.martins@oracle.com,
ankur.a.arora@oracle.com
Subject: [PATCH v3 19/21] gup: hint non-caching if clearing large regions
Date: Mon, 6 Jun 2022 20:37:23 +0000 [thread overview]
Message-ID: <20220606203725.1313715-15-ankur.a.arora@oracle.com> (raw)
In-Reply-To: <20220606202109.1306034-1-ankur.a.arora@oracle.com>
When clearing a large region, or when the user explicitly hints
via FOLL_HINT_BULK that a call to get_user_pages() is part of a larger
region being gup'd, take the non-caching path.
One notable limitation is that this is only done when the underlying
pages are huge or gigantic, even if a large region composed of PAGE_SIZE
pages is being cleared. This is because non-caching stores are generally
weakly ordered and need some kind of store fence -- at PTE write
granularity -- to avoid data leakage. This is expensive enough to
negate any performance advantage.
Performance
==
System: Oracle X9-2c (2 nodes * 32 cores * 2 threads)
Processor: Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
Memory: 1024 GB evenly split between nodes
LLC-size: 48MB for each node (32-cores * 2-threads)
no_turbo: 1, Microcode: 0xd0002c1, scaling-governor: performance
System: Oracle E4-2c (2 nodes * 8 CCXes * 8 cores * 2 threads)
Processor: AMD EPYC 7J13 64-Core Processor (Milan, 25:1:1)
Memory: 512 GB evenly split between nodes
LLC-size: 32MB for each CCX (8-cores * 2-threads)
boost: 1, Microcode: 0xa00115d, scaling-governor: performance
Two workloads: qemu VM creation where that is the exclusive load
and, to probe the cache interference with unrelated processes aspect
of these changes, a kbuild with a background page clearing workload.
Workload: create a 192GB qemu-VM (backed by preallocated 2MB
pages on the local node)
==
Icelakex
--
Time (s) Delta (%)
clear_pages_erms() 16.49 ( +- 0.06s ) # 12.50 bytes/ns
clear_pages_movnt() 9.42 ( +- 0.20s ) -42.87% # 21.88 bytes/ns
It is easy enough to see where the improvement is coming from -- given
the non-caching stores, the CPU does not need to do any RFOs ending up
with way fewer L1-dcache-load-misses:
- 407,619,058 L1-dcache-loads # 24.746 M/sec ( +- 0.17% ) (69.20%)
- 3,245,399,461 L1-dcache-load-misses # 801.49% of all L1-dcache accesses ( +- 0.01% ) (69.22%)
+ 393,160,148 L1-dcache-loads # 41.786 M/sec ( +- 0.80% ) (69.22%)
+ 5,790,543 L1-dcache-load-misses # 1.50% of all L1-dcache accesses ( +- 1.55% ) (69.26%)
(Fuller perf stat output, at [1], [2].)
Milan
--
Time (s) Delta
clear_pages_erms() 11.83 ( +- 0.08s ) # 17.42 bytes/ns
clear_pages_clzero() 4.91 ( +- 0.27s ) -58.49% # 41.98 bytes/ns
Milan does significantly fewer RFO, as well.
- 6,882,968,897 L1-dcache-loads # 582.960 M/sec ( +- 0.03% ) (33.38%)
- 3,267,546,914 L1-dcache-load-misses # 47.45% of all L1-dcache accesses ( +- 0.02% ) (33.37%)
+ 418,489,450 L1-dcache-loads # 85.611 M/sec ( +- 1.19% ) (33.46%)
+ 5,406,557 L1-dcache-load-misses # 1.35% of all L1-dcache accesses ( +- 1.07% ) (33.45%)
(Fuller perf stat output, at [3], [4].)
Workload: Kbuild with background clear_huge_page()
==
Probe the cache-pollution aspect of this commit with a kbuild
(make -j 32 bzImage) alongside a background process doing
clear_huge_page() via mmap(length=64GB, flags=MAP_POPULATE|MAP_HUGE_2MB)
in a loop.
The expectation -- assuming kbuild performance is partly cache
limited -- is that the clear_huge_page() -> clear_pages_erms()
background load would show a greater slowdown than,
clear_huge_page() -> clear_pages_movnt(). The kbuild itself does not
use THP or similar, so any performance changes are due to the
background load.
Icelakex
--
# kbuild: 16 cores, 32 threads
# clear_huge_page() load: single thread bound to the same CPUset
# taskset -c 16-31,80-95 perf stat -r 5 -ddd \
make -C .. -j 32 O=b2 clean bzImage
- 8,226,884,900,694 instructions # 1.09 insn per cycle ( +- 0.02% ) (47.27%)
+ 8,223,413,950,371 instructions # 1.12 insn per cycle ( +- 0.03% ) (47.31%)
- 20,016,410,480,886 slots # 6.565 G/sec ( +- 0.01% ) (69.84%)
- 1,310,070,777,023 topdown-be-bound # 6.1% backend bound ( +- 0.28% ) (69.84%)
+ 19,328,950,611,944 slots # 6.494 G/sec ( +- 0.02% ) (69.87%)
+ 1,043,408,291,623 topdown-be-bound # 5.0% backend bound ( +- 0.33% ) (69.87%)
- 10,747,834,729 LLC-loads # 3.525 M/sec ( +- 0.05% ) (69.68%)
- 4,841,355,743 LLC-load-misses # 45.02% of all LL-cache accesses ( +- 0.06% ) (69.70%)
+ 10,466,865,056 LLC-loads # 3.517 M/sec ( +- 0.08% ) (69.68%)
+ 4,206,944,783 LLC-load-misses # 40.21% of all LL-cache accesses ( +- 0.06% ) (69.71%)
The LLC-load-misses show a significant improvement (-13.11%) which is
borne out in the (-20.35%) reduction in topdown-be-bound and a (2.7%)
improvement in IPC.
- 7,521,157,276,899 cycles # 2.467 GHz ( +- 0.02% ) (39.65%)
+ 7,348,971,235,549 cycles # 2.469 GHz ( +- 0.04% ) (39.68%)
The ends up with an overall improvement in cycles of (-2.28%).
(Fuller perf stat output, at [5], [6].)
Milan
--
# kbuild: 2 CCxes, 16 cores, 32 threads
# clear_huge_page() load: single thread bound to the same CPUset
# taskset -c 16-31,144-159 perf stat -r 5 -ddd \
make -C .. -j 32 O=b2 clean bzImage
- 302,739,130,717 stalled-cycles-backend # 3.82% backend cycles idle ( +- 0.10% ) (41.11%)
+ 287,703,667,307 stalled-cycles-backend # 3.74% backend cycles idle ( +- 0.04% ) (41.11%)
- 8,981,403,534,446 instructions # 1.13 insn per cycle
+ 8,969,062,192,998 instructions # 1.16 insn per cycle
Milan sees a (-4.96%) improvement in stalled-cycles-backend and
a (-2.65%) improvement in IPC.
- 7,930,842,057,103 cycles # 2.338 GHz ( +- 0.04% ) (41.09%)
+ 7,705,812,395,365 cycles # 2.339 GHz ( +- 0.01% ) (41.11%)
The ends up with an overall improvement in cycles of (-2.83%).
(Fuller perf stat output, at [7], [8].)
[1] Icelakex, clear_pages_erms()
# perf stat -r 5 --all-kernel -ddd ./qemu.sh
Performance counter stats for './qemu.sh' (5 runs):
16,329.41 msec task-clock # 0.990 CPUs utilized ( +- 0.42% )
143 context-switches # 8.681 /sec ( +- 0.93% )
1 cpu-migrations # 0.061 /sec ( +- 63.25% )
118 page-faults # 7.164 /sec ( +- 0.27% )
41,735,523,673 cycles # 2.534 GHz ( +- 0.42% ) (38.46%)
1,454,116,543 instructions # 0.03 insn per cycle ( +- 0.49% ) (46.16%)
266,749,920 branches # 16.194 M/sec ( +- 0.41% ) (53.86%)
928,726 branch-misses # 0.35% of all branches ( +- 0.38% ) (61.54%)
208,805,754,709 slots # 12.676 G/sec ( +- 0.41% ) (69.23%)
5,355,889,366 topdown-retiring # 2.5% retiring ( +- 0.50% ) (69.23%)
12,720,749,784 topdown-bad-spec # 6.1% bad speculation ( +- 1.38% ) (69.23%)
998,710,552 topdown-fe-bound # 0.5% frontend bound ( +- 0.85% ) (69.23%)
192,653,197,875 topdown-be-bound # 90.9% backend bound ( +- 0.38% ) (69.23%)
407,619,058 L1-dcache-loads # 24.746 M/sec ( +- 0.17% ) (69.20%)
3,245,399,461 L1-dcache-load-misses # 801.49% of all L1-dcache accesses ( +- 0.01% ) (69.22%)
10,805,747 LLC-loads # 656.009 K/sec ( +- 0.37% ) (69.25%)
804,475 LLC-load-misses # 7.44% of all LL-cache accesses ( +- 2.73% ) (69.26%)
<not supported> L1-icache-loads
18,134,527 L1-icache-load-misses ( +- 1.24% ) (30.80%)
435,474,462 dTLB-loads # 26.437 M/sec ( +- 0.28% ) (30.80%)
41,187 dTLB-load-misses # 0.01% of all dTLB cache accesses ( +- 4.06% ) (30.79%)
<not supported> iTLB-loads
440,135 iTLB-load-misses ( +- 1.07% ) (30.78%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
16.4906 +- 0.0676 seconds time elapsed ( +- 0.41% )
[2] Icelakex, clear_pages_movnt()
# perf stat -r 5 --all-kernel -ddd ./qemu.sh
Performance counter stats for './qemu.sh' (5 runs):
9,896.77 msec task-clock # 1.050 CPUs utilized ( +- 2.08% )
135 context-switches # 14.348 /sec ( +- 0.74% )
0 cpu-migrations # 0.000 /sec
116 page-faults # 12.329 /sec ( +- 0.50% )
25,239,642,558 cycles # 2.683 GHz ( +- 2.11% ) (38.43%)
36,791,658,500 instructions # 1.54 insn per cycle ( +- 0.06% ) (46.12%)
3,475,279,229 branches # 369.361 M/sec ( +- 0.09% ) (53.82%)
1,987,098 branch-misses # 0.06% of all branches ( +- 0.71% ) (61.51%)
126,256,220,768 slots # 13.419 G/sec ( +- 2.10% ) (69.21%)
57,705,186,453 topdown-retiring # 47.8% retiring ( +- 0.28% ) (69.21%)
5,934,729,245 topdown-bad-spec # 4.3% bad speculation ( +- 5.91% ) (69.21%)
4,089,990,217 topdown-fe-bound # 3.1% frontend bound ( +- 2.11% ) (69.21%)
60,298,426,167 topdown-be-bound # 44.8% backend bound ( +- 4.21% ) (69.21%)
393,160,148 L1-dcache-loads # 41.786 M/sec ( +- 0.80% ) (69.22%)
5,790,543 L1-dcache-load-misses # 1.50% of all L1-dcache accesses ( +- 1.55% ) (69.26%)
1,069,049 LLC-loads # 113.621 K/sec ( +- 1.25% ) (69.27%)
728,260 LLC-load-misses # 70.65% of all LL-cache accesses ( +- 2.63% ) (69.30%)
<not supported> L1-icache-loads
14,620,549 L1-icache-load-misses ( +- 1.27% ) (30.80%)
404,962,421 dTLB-loads # 43.040 M/sec ( +- 1.13% ) (30.80%)
31,916 dTLB-load-misses # 0.01% of all dTLB cache accesses ( +- 4.61% ) (30.77%)
<not supported> iTLB-loads
396,984 iTLB-load-misses ( +- 2.23% ) (30.74%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
9.428 +- 0.206 seconds time elapsed ( +- 2.18% )
[3] Milan, clear_pages_erms()
# perf stat -r 5 --all-kernel -ddd ./qemu.sh
Performance counter stats for './qemu.sh' (5 runs):
11,676.79 msec task-clock # 0.987 CPUs utilized ( +- 0.68% )
96 context-switches # 8.131 /sec ( +- 0.78% )
2 cpu-migrations # 0.169 /sec ( +- 18.71% )
106 page-faults # 8.978 /sec ( +- 0.23% )
28,161,726,414 cycles # 2.385 GHz ( +- 0.69% ) (33.33%)
141,032,827 stalled-cycles-frontend # 0.50% frontend cycles idle ( +- 52.44% ) (33.35%)
796,792,139 stalled-cycles-backend # 2.80% backend cycles idle ( +- 23.73% ) (33.35%)
1,140,172,646 instructions # 0.04 insn per cycle
# 0.50 stalled cycles per insn ( +- 0.89% ) (33.35%)
219,864,061 branches # 18.622 M/sec ( +- 1.06% ) (33.36%)
1,407,446 branch-misses # 0.63% of all branches ( +- 10.66% ) (33.40%)
6,882,968,897 L1-dcache-loads # 582.960 M/sec ( +- 0.03% ) (33.38%)
3,267,546,914 L1-dcache-load-misses # 47.45% of all L1-dcache accesses ( +- 0.02% ) (33.37%)
<not supported> LLC-loads
<not supported> LLC-load-misses
146,901,513 L1-icache-loads # 12.442 M/sec ( +- 0.78% ) (33.36%)
1,462,155 L1-icache-load-misses # 0.99% of all L1-icache accesses ( +- 0.83% ) (33.34%)
2,055,805 dTLB-loads # 174.118 K/sec ( +- 22.56% ) (33.33%)
136,260 dTLB-load-misses # 4.69% of all dTLB cache accesses ( +- 23.13% ) (33.35%)
941 iTLB-loads # 79.699 /sec ( +- 5.54% ) (33.35%)
115,444 iTLB-load-misses # 14051.12% of all iTLB cache accesses ( +- 21.17% ) (33.34%)
95,438,373 L1-dcache-prefetches # 8.083 M/sec ( +- 19.99% ) (33.34%)
<not supported> L1-dcache-prefetch-misses
11.8296 +- 0.0805 seconds time elapsed ( +- 0.68% )
[4] Milan, clear_pages_clzero()
# perf stat -r 5 --all-kernel -ddd ./qemu.sh
Performance counter stats for './qemu.sh' (5 runs):
4,599.00 msec task-clock # 0.937 CPUs utilized ( +- 5.93% )
91 context-switches # 18.616 /sec ( +- 0.92% )
0 cpu-migrations # 0.000 /sec
107 page-faults # 21.889 /sec ( +- 0.19% )
10,975,453,059 cycles # 2.245 GHz ( +- 6.02% ) (33.28%)
14,193,355 stalled-cycles-frontend # 0.12% frontend cycles idle ( +- 1.90% ) (33.35%)
38,969,144 stalled-cycles-backend # 0.33% backend cycles idle ( +- 23.92% ) (33.34%)
13,951,880,530 instructions # 1.20 insn per cycle
# 0.00 stalled cycles per insn ( +- 0.11% ) (33.33%)
3,426,708,418 branches # 701.003 M/sec ( +- 0.06% ) (33.36%)
2,350,619 branch-misses # 0.07% of all branches ( +- 0.61% ) (33.45%)
418,489,450 L1-dcache-loads # 85.611 M/sec ( +- 1.19% ) (33.46%)
5,406,557 L1-dcache-load-misses # 1.35% of all L1-dcache accesses ( +- 1.07% ) (33.45%)
<not supported> LLC-loads
<not supported> LLC-load-misses
90,088,059 L1-icache-loads # 18.429 M/sec ( +- 0.36% ) (33.44%)
1,081,035 L1-icache-load-misses # 1.20% of all L1-icache accesses ( +- 3.67% ) (33.42%)
4,017,464 dTLB-loads # 821.854 K/sec ( +- 1.02% ) (33.40%)
204,096 dTLB-load-misses # 5.22% of all dTLB cache accesses ( +- 9.77% ) (33.36%)
770 iTLB-loads # 157.519 /sec ( +- 5.12% ) (33.36%)
209,834 iTLB-load-misses # 29479.35% of all iTLB cache accesses ( +- 0.17% ) (33.34%)
1,596,265 L1-dcache-prefetches # 326.548 K/sec ( +- 1.55% ) (33.31%)
<not supported> L1-dcache-prefetch-misses
4.908 +- 0.272 seconds time elapsed ( +- 5.54% )
[5] Icelakex, kbuild + bg:clear_pages_erms() load.
# taskset -c 16-31,80-95 perf stat -r 5 -ddd \
make -C .. -j 32 O=b2 clean bzImage
Performance counter stats for 'make -C .. -j 32 O=b2 clean bzImage' (5 runs):
3,047,329.07 msec task-clock # 19.520 CPUs utilized ( +- 0.02% )
1,675,061 context-switches # 549.415 /sec ( +- 0.43% )
89,232 cpu-migrations # 29.268 /sec ( +- 2.34% )
85,752,972 page-faults # 28.127 K/sec ( +- 0.00% )
7,521,157,276,899 cycles # 2.467 GHz ( +- 0.02% ) (39.65%)
8,226,884,900,694 instructions # 1.09 insn per cycle ( +- 0.02% ) (47.27%)
1,744,557,848,503 branches # 572.209 M/sec ( +- 0.02% ) (54.83%)
36,252,079,075 branch-misses # 2.08% of all branches ( +- 0.02% ) (62.35%)
20,016,410,480,886 slots # 6.565 G/sec ( +- 0.01% ) (69.84%)
6,518,990,385,998 topdown-retiring # 30.5% retiring ( +- 0.02% ) (69.84%)
7,821,817,193,732 topdown-bad-spec # 36.7% bad speculation ( +- 0.29% ) (69.84%)
5,714,082,318,274 topdown-fe-bound # 26.7% frontend bound ( +- 0.10% ) (69.84%)
1,310,070,777,023 topdown-be-bound # 6.1% backend bound ( +- 0.28% ) (69.84%)
2,270,017,283,501 L1-dcache-loads # 744.558 M/sec ( +- 0.02% ) (69.60%)
103,295,556,544 L1-dcache-load-misses # 4.55% of all L1-dcache accesses ( +- 0.02% ) (69.64%)
10,747,834,729 LLC-loads # 3.525 M/sec ( +- 0.05% ) (69.68%)
4,841,355,743 LLC-load-misses # 45.02% of all LL-cache accesses ( +- 0.06% ) (69.70%)
<not supported> L1-icache-loads
180,672,238,145 L1-icache-load-misses ( +- 0.03% ) (31.18%)
2,216,149,664,522 dTLB-loads # 726.890 M/sec ( +- 0.03% ) (31.83%)
2,000,781,326 dTLB-load-misses # 0.09% of all dTLB cache accesses ( +- 0.08% ) (31.79%)
<not supported> iTLB-loads
1,938,124,234 iTLB-load-misses ( +- 0.04% ) (31.76%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
156.1136 +- 0.0785 seconds time elapsed ( +- 0.05% )
[6] Icelakex, kbuild + bg:clear_pages_movnt() load.
# taskset -c 16-31,80-95 perf stat -r 5 -ddd \
make -C .. -j 32 O=b2 clean bzImage
Performance counter stats for 'make -C .. -j 32 O=b2 clean bzImage' (5 runs):
2,978,535.47 msec task-clock # 19.471 CPUs utilized ( +- 0.05% )
1,637,295 context-switches # 550.105 /sec ( +- 0.89% )
91,635 cpu-migrations # 30.788 /sec ( +- 1.88% )
85,754,138 page-faults # 28.812 K/sec ( +- 0.00% )
7,348,971,235,549 cycles # 2.469 GHz ( +- 0.04% ) (39.68%)
8,223,413,950,371 instructions # 1.12 insn per cycle ( +- 0.03% ) (47.31%)
1,743,914,970,674 branches # 585.928 M/sec ( +- 0.01% ) (54.87%)
36,188,623,655 branch-misses # 2.07% of all branches ( +- 0.05% ) (62.39%)
19,328,950,611,944 slots # 6.494 G/sec ( +- 0.02% ) (69.87%)
6,508,801,041,075 topdown-retiring # 31.7% retiring ( +- 0.35% ) (69.87%)
7,581,383,615,462 topdown-bad-spec # 36.4% bad speculation ( +- 0.43% ) (69.87%)
5,521,686,808,149 topdown-fe-bound # 26.8% frontend bound ( +- 0.14% ) (69.87%)
1,043,408,291,623 topdown-be-bound # 5.0% backend bound ( +- 0.33% ) (69.87%)
2,269,475,492,575 L1-dcache-loads # 762.507 M/sec ( +- 0.03% ) (69.63%)
101,544,979,642 L1-dcache-load-misses # 4.47% of all L1-dcache accesses ( +- 0.05% ) (69.66%)
10,466,865,056 LLC-loads # 3.517 M/sec ( +- 0.08% ) (69.68%)
4,206,944,783 LLC-load-misses # 40.21% of all LL-cache accesses ( +- 0.06% ) (69.71%)
<not supported> L1-icache-loads
180,267,126,923 L1-icache-load-misses ( +- 0.07% ) (31.17%)
2,216,010,317,050 dTLB-loads # 744.544 M/sec ( +- 0.03% ) (31.82%)
1,979,801,744 dTLB-load-misses # 0.09% of all dTLB cache accesses ( +- 0.10% ) (31.79%)
<not supported> iTLB-loads
1,925,390,304 iTLB-load-misses ( +- 0.08% ) (31.77%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
152.972 +- 0.309 seconds time elapsed ( +- 0.20% )
[7] Milan, clear_pages_erms()
# taskset -c 16-31,144-159 perf stat -r 5 -ddd \
make -C .. -j 32 O=b2 clean bzImage
Performance counter stats for 'make -C .. -j 32 O=b2 clean bzImage' (5 runs):
3,390,130.53 msec task-clock # 18.241 CPUs utilized ( +- 0.04% )
1,720,283 context-switches # 507.160 /sec ( +- 0.27% )
96,694 cpu-migrations # 28.507 /sec ( +- 1.41% )
75,872,994 page-faults # 22.368 K/sec ( +- 0.00% )
7,930,842,057,103 cycles # 2.338 GHz ( +- 0.04% ) (41.09%)
39,974,518,172 stalled-cycles-frontend # 0.50% frontend cycles idle ( +- 0.05% ) (41.10%)
302,739,130,717 stalled-cycles-backend # 3.82% backend cycles idle ( +- 0.10% ) (41.11%)
8,981,403,534,446 instructions # 1.13 insn per cycle
# 0.03 stalled cycles per insn ( +- 0.03% ) (41.10%)
1,909,303,327,220 branches # 562.886 M/sec ( +- 0.02% ) (41.10%)
50,324,935,298 branch-misses # 2.64% of all branches ( +- 0.02% ) (41.09%)
3,563,297,595,796 L1-dcache-loads # 1.051 G/sec ( +- 0.03% ) (41.08%)
129,901,339,258 L1-dcache-load-misses # 3.65% of all L1-dcache accesses ( +- 0.10% ) (41.07%)
<not supported> LLC-loads
<not supported> LLC-load-misses
809,770,606,566 L1-icache-loads # 238.730 M/sec ( +- 0.03% ) (41.07%)
12,403,758,671 L1-icache-load-misses # 1.53% of all L1-icache accesses ( +- 0.08% ) (41.07%)
60,010,026,089 dTLB-loads # 17.692 M/sec ( +- 0.04% ) (41.07%)
3,254,066,681 dTLB-load-misses # 5.42% of all dTLB cache accesses ( +- 0.09% ) (41.07%)
5,195,070,952 iTLB-loads # 1.532 M/sec ( +- 0.03% ) (41.08%)
489,196,395 iTLB-load-misses # 9.42% of all iTLB cache accesses ( +- 0.10% ) (41.09%)
39,920,161,716 L1-dcache-prefetches # 11.769 M/sec ( +- 0.03% ) (41.09%)
<not supported> L1-dcache-prefetch-misses
185.852 +- 0.501 seconds time elapsed ( +- 0.27% )
[8] Milan, clear_pages_clzero()
# taskset -c 16-31,144-159 perf stat -r 5 -ddd \
make -C .. -j 32 O=b2 clean bzImage
Performance counter stats for 'make -C .. -j 32 O=b2 clean bzImage' (5 runs):
3,296,677.12 msec task-clock # 18.051 CPUs utilized ( +- 0.02% )
1,713,645 context-switches # 520.062 /sec ( +- 0.26% )
91,883 cpu-migrations # 27.885 /sec ( +- 0.83% )
75,877,740 page-faults # 23.028 K/sec ( +- 0.00% )
7,705,812,395,365 cycles # 2.339 GHz ( +- 0.01% ) (41.11%)
38,866,265,031 stalled-cycles-frontend # 0.50% frontend cycles idle ( +- 0.09% ) (41.10%)
287,703,667,307 stalled-cycles-backend # 3.74% backend cycles idle ( +- 0.04% ) (41.11%)
8,969,062,192,998 instructions # 1.16 insn per cycle
# 0.03 stalled cycles per insn ( +- 0.01% ) (41.11%)
1,906,857,866,689 branches # 578.699 M/sec ( +- 0.01% ) (41.10%)
50,155,411,444 branch-misses # 2.63% of all branches ( +- 0.03% ) (41.11%)
3,552,652,190,906 L1-dcache-loads # 1.078 G/sec ( +- 0.01% ) (41.13%)
127,238,478,917 L1-dcache-load-misses # 3.58% of all L1-dcache accesses ( +- 0.04% ) (41.13%)
<not supported> LLC-loads
<not supported> LLC-load-misses
808,024,730,682 L1-icache-loads # 245.222 M/sec ( +- 0.03% ) (41.13%)
7,773,178,107 L1-icache-load-misses # 0.96% of all L1-icache accesses ( +- 0.11% ) (41.13%)
59,684,355,294 dTLB-loads # 18.113 M/sec ( +- 0.04% ) (41.12%)
3,247,521,154 dTLB-load-misses # 5.44% of all dTLB cache accesses ( +- 0.04% ) (41.12%)
5,064,547,530 iTLB-loads # 1.537 M/sec ( +- 0.09% ) (41.12%)
462,977,175 iTLB-load-misses # 9.13% of all iTLB cache accesses ( +- 0.07% ) (41.12%)
39,307,810,241 L1-dcache-prefetches # 11.929 M/sec ( +- 0.06% ) (41.11%)
<not supported> L1-dcache-prefetch-misses
182.630 +- 0.365 seconds time elapsed ( +- 0.20% )
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
Notes:
Not sure if this wall of perf-stats (or indeed the whole kbuild test) is
warranted here.
To my eyes, there's no non-obvious information in the performance results
(reducing cache usage should and does lead to other processes getting a small
bump in performance), so is there any value in keeping this in the commit
message?
fs/hugetlbfs/inode.c | 7 ++++++-
mm/gup.c | 18 ++++++++++++++++++
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 9 ++++++++-
4 files changed, 33 insertions(+), 3 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 62408047e8d7..993bb7227a2f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -650,6 +650,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
loff_t hpage_size = huge_page_size(h);
unsigned long hpage_shift = huge_page_shift(h);
pgoff_t start, index, end;
+ bool hint_non_caching;
int error;
u32 hash;
@@ -667,6 +668,9 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
start = offset >> hpage_shift;
end = (offset + len + hpage_size - 1) >> hpage_shift;
+ /* Don't pollute the cache if we are fallocte'ing a large region. */
+ hint_non_caching = clear_page_prefer_non_caching((end - start) << hpage_shift);
+
inode_lock(inode);
/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -745,7 +749,8 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
error = PTR_ERR(page);
goto out;
}
- clear_huge_page(page, addr, pages_per_huge_page(h));
+ clear_huge_page(page, addr, pages_per_huge_page(h),
+ hint_non_caching);
__SetPageUptodate(page);
error = huge_add_to_page_cache(page, mapping, index);
if (unlikely(error)) {
diff --git a/mm/gup.c b/mm/gup.c
index 551264407624..bceb6ff64687 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -944,6 +944,13 @@ static int faultin_page(struct vm_area_struct *vma,
*/
fault_flags |= FAULT_FLAG_TRIED;
}
+ if (*flags & FOLL_HINT_BULK) {
+ /*
+ * This page is part of a large region being faulted-in
+ * so attempt to minimize cache-pollution.
+ */
+ fault_flags |= FAULT_FLAG_NON_CACHING;
+ }
if (unshare) {
fault_flags |= FAULT_FLAG_UNSHARE;
/* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */
@@ -1116,6 +1123,17 @@ static long __get_user_pages(struct mm_struct *mm,
if (!(gup_flags & FOLL_FORCE))
gup_flags |= FOLL_NUMA;
+ /*
+ * Non-cached page clearing is generally faster when clearing regions
+ * larger than O(LLC-size). So hint the non-caching path based on
+ * clear_page_prefer_non_caching().
+ *
+ * Note, however this check is optimistic -- nr_pages is the upper
+ * limit and we might be clearing less than that.
+ */
+ if (clear_page_prefer_non_caching(nr_pages * PAGE_SIZE))
+ gup_flags |= FOLL_HINT_BULK;
+
do {
struct page *page;
unsigned int foll_flags = gup_flags;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73654db77a1c..c7294cffc384 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -594,7 +594,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
pgtable_t pgtable;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
vm_fault_t ret = 0;
- bool non_cached = false;
+ bool non_cached = vmf->flags & FAULT_FLAG_NON_CACHING;
VM_BUG_ON_PAGE(!PageCompound(page), page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0c4a31b5c1e9..d906c6558b15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5481,7 +5481,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
- bool non_cached = false;
+ bool non_cached = flags & FAULT_FLAG_NON_CACHING;
/*
* Currently, we are forced to kill the process in the event the
@@ -6182,6 +6182,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
fault_flags |= FAULT_FLAG_TRIED;
}
+ if (flags & FOLL_HINT_BULK) {
+ /*
+ * From the user hint, we might be faulting-in
+ * a large region so minimize cache-pollution.
+ */
+ fault_flags |= FAULT_FLAG_NON_CACHING;
+ }
ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
if (ret & VM_FAULT_ERROR) {
err = vm_fault_to_errno(ret, flags);
--
2.31.1
next prev parent reply other threads:[~2022-06-06 20:47 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
2022-06-06 20:20 ` [PATCH v3 01/21] mm, huge-page: reorder arguments to process_huge_page() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 02/21] mm, huge-page: refactor process_subpage() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 03/21] clear_page: add generic clear_user_pages() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 04/21] mm, clear_huge_page: support clear_user_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 05/21] mm/huge_page: generalize process_huge_page() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 06/21] x86/clear_page: add clear_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 07/21] x86/asm: add memset_movnti() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 08/21] perf bench: " Ankur Arora
2022-06-06 20:37 ` [PATCH v3 09/21] x86/asm: add clear_pages_movnt() Ankur Arora
2022-06-10 22:11 ` Noah Goldstein
2022-06-10 22:15 ` Noah Goldstein
2022-06-12 11:18 ` Ankur Arora
2022-06-06 20:37 ` [PATCH v3 10/21] x86/asm: add clear_pages_clzero() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 11/21] x86/cpuid: add X86_FEATURE_MOVNT_SLOW Ankur Arora
2022-06-06 20:37 ` [PATCH v3 12/21] sparse: add address_space __incoherent Ankur Arora
2022-06-06 20:37 ` [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent() Ankur Arora
2022-06-08 0:01 ` Luc Van Oostenryck
2022-06-12 11:19 ` Ankur Arora
2022-06-06 20:37 ` [PATCH v3 14/21] x86/clear_page: add clear_pages_incoherent() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 15/21] mm/clear_page: add clear_page_non_caching_threshold() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 16/21] x86/clear_page: add arch_clear_page_non_caching_threshold() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 17/21] clear_huge_page: use non-cached clearing Ankur Arora
2022-06-06 20:37 ` [PATCH v3 18/21] gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING Ankur Arora
2022-06-06 20:37 ` Ankur Arora [this message]
2022-06-06 20:37 ` [PATCH v3 20/21] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 21/21] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake Ankur Arora
2022-06-06 21:53 ` [PATCH v3 00/21] huge page clearing optimizations Linus Torvalds
2022-06-07 15:08 ` Ankur Arora
2022-06-07 17:56 ` Linus Torvalds
2022-06-08 19:24 ` Ankur Arora
2022-06-08 19:39 ` Linus Torvalds
2022-06-08 20:21 ` Ankur Arora
2022-06-08 19:49 ` Matthew Wilcox
2022-06-08 19:51 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220606203725.1313715-15-ankur.a.arora@oracle.com \
--to=ankur.a.arora@oracle.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=jgg@nvidia.com \
--cc=joao.m.martins@oracle.com \
--cc=jon.grimm@amd.com \
--cc=konrad.wilk@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mike.kravetz@oracle.com \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.