From: Andrew Murray <andrew.murray@arm.com>
To: Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will.deacon@arm.com>,
Peter Zijlstra <peterz@infradead.org>,
Ard.Biesheuvel@arm.com
Cc: Mark Rutland <mark.rutland@arm.com>,
Boqun Feng <boqun.feng@gmail.com>,
linux-arm-kernel@lists.infradead.org
Subject: [PATCH v3 0/5] arm64: avoid out-of-line ll/sc atomics
Date: Mon, 12 Aug 2019 15:36:20 +0100 [thread overview]
Message-ID: <20190812143625.42745-1-andrew.murray@arm.com> (raw)
When building for LSE atomics (CONFIG_ARM64_LSE_ATOMICS), if the hardware
or toolchain doesn't support it the existing code will fallback to ll/sc
atomics. It achieves this by branching from inline assembly to a function
that is built with specical compile flags. Further this results in the
clobbering of registers even when the fallback isn't used increasing
register pressure.
Let's improve this by providing inline implementatins of both LSE and
ll/sc and use a static key to select between them. This allows for the
compiler to generate better atomics code.
Whilst it may be difficult to understand the performance impact, we gain
improved code readability, ability to use Clang, and improved backtrace
reliability.
Build and boot tested, along with atomic_64_test.
Following is the assembly of a function that has three consecutive
atomic_add calls when built with LSE and this patchset:
Dump of assembler code for function atomics_test:
0xffff000010084338 <+0>: b 0xffff000010084388 <atomics_test+80>
0xffff00001008433c <+4>: b 0xffff000010084388 <atomics_test+80>
0xffff000010084340 <+8>: adrp x0, 0xffff0000118d5000 <reset_devices>
0xffff000010084344 <+12>: add x2, x0, #0x0
0xffff000010084348 <+16>: mov w1, #0x1 // #1
0xffff00001008434c <+20>: add x3, x2, #0x28
0xffff000010084350 <+24>: stadd w1, [x3]
0xffff000010084354 <+28>: b 0xffff00001008439c <atomics_test+100>
0xffff000010084358 <+32>: b 0xffff00001008439c <atomics_test+100>
0xffff00001008435c <+36>: add x1, x0, #0x0
0xffff000010084360 <+40>: mov w2, #0x1 // #1
0xffff000010084364 <+44>: add x3, x1, #0x28
0xffff000010084368 <+48>: stadd w2, [x3]
0xffff00001008436c <+52>: b 0xffff0000100843ac <atomics_test+116>
0xffff000010084370 <+56>: b 0xffff0000100843ac <atomics_test+116>
0xffff000010084374 <+60>: add x0, x0, #0x0
0xffff000010084378 <+64>: mov w1, #0x1 // #1
0xffff00001008437c <+68>: add x2, x0, #0x28
0xffff000010084380 <+72>: stadd w1, [x2]
0xffff000010084384 <+76>: ret
0xffff000010084388 <+80>: adrp x0, 0xffff0000118d5000 <reset_devices>
0xffff00001008438c <+84>: add x1, x0, #0x0
0xffff000010084390 <+88>: add x1, x1, #0x28
0xffff000010084394 <+92>: b 0xffff000010084570
0xffff000010084398 <+96>: b 0xffff000010084354 <atomics_test+28>
0xffff00001008439c <+100>: add x1, x0, #0x0
0xffff0000100843a0 <+104>: add x1, x1, #0x28
0xffff0000100843a4 <+108>: b 0xffff000010084588
0xffff0000100843a8 <+112>: b 0xffff00001008436c <atomics_test+52>
0xffff0000100843ac <+116>: add x0, x0, #0x0
0xffff0000100843b0 <+120>: add x0, x0, #0x28
0xffff0000100843b4 <+124>: b 0xffff0000100845a0
0xffff0000100843b8 <+128>: ret
End of assembler dump.
ffff000010084570: f9800031 prfm pstl1strm, [x1]
ffff000010084574: 885f7c22 ldxr w2, [x1]
ffff000010084578: 11000442 add w2, w2, #0x1
ffff00001008457c: 88037c22 stxr w3, w2, [x1]
ffff000010084580: 35ffffa3 cbnz w3, ffff000010084574 <do_one_initcall+0x1b4>
ffff000010084584: 17ffff85 b ffff000010084398 <atomics_test+0x60>
ffff000010084588: f9800031 prfm pstl1strm, [x1]
ffff00001008458c: 885f7c22 ldxr w2, [x1]
ffff000010084590: 11000442 add w2, w2, #0x1
ffff000010084594: 88037c22 stxr w3, w2, [x1]
ffff000010084598: 35ffffa3 cbnz w3, ffff00001008458c <do_one_initcall+0x1cc>
ffff00001008459c: 17ffff83 b ffff0000100843a8 <atomics_test+0x70>
ffff0000100845a0: f9800011 prfm pstl1strm, [x0]
ffff0000100845a4: 885f7c01 ldxr w1, [x0]
ffff0000100845a8: 11000421 add w1, w1, #0x1
ffff0000100845ac: 88027c01 stxr w2, w1, [x0]
ffff0000100845b0: 35ffffa2 cbnz w2, ffff0000100845a4 <do_one_initcall+0x1e4>
ffff0000100845b4: 17ffff81 b ffff0000100843b8 <atomics_test+0x80>
The two branches before each section of atomics relates to the two static
keys which both become nop's when LSE is available. When LSE isn't
available the branches are used to run the slowpath fallback LL/SC atomics.
In v1 of this series, due to the use of likely/unlikely for the LSE code,
the fallback code ended up in one place at the end of the function. In this
v2 patchset we move the fallback code into its own subsection, this moves
any atomics code to the end of each compilation unit. It is felt that this
may improve icache performance for both LSE and LL/SC.
Where CONFIG_ARM64_LSE_ATOMICS isn't enabled then the same function is as
follows:
Dump of assembler code for function atomics_test:
0xffff000010084338 <+0>: adrp x0, 0xffff000011865000 <reset_devices>
0xffff00001008433c <+4>: add x0, x0, #0x0
0xffff000010084340 <+8>: add x3, x0, #0x28
0xffff000010084344 <+12>: prfm pstl1strm, [x3]
0xffff000010084348 <+16>: ldxr w1, [x3]
0xffff00001008434c <+20>: add w1, w1, #0x1
0xffff000010084350 <+24>: stxr w2, w1, [x3]
0xffff000010084354 <+28>: cbnz w2, 0xffff000010084348 <atomics_test+16>
0xffff000010084358 <+32>: prfm pstl1strm, [x3]
0xffff00001008435c <+36>: ldxr w1, [x3]
0xffff000010084360 <+40>: add w1, w1, #0x1
0xffff000010084364 <+44>: stxr w2, w1, [x3]
0xffff000010084368 <+48>: cbnz w2, 0xffff00001008435c <atomics_test+36>
0xffff00001008436c <+52>: prfm pstl1strm, [x3]
0xffff000010084370 <+56>: ldxr w1, [x3]
0xffff000010084374 <+60>: add w1, w1, #0x1
0xffff000010084378 <+64>: stxr w2, w1, [x3]
0xffff00001008437c <+68>: cbnz w2, 0xffff000010084370 <atomics_test+56>
0xffff000010084380 <+72>: ret
End of assembler dump.
These changes add some bloat on defconfig according to bloat-o-meter:
For LSE build (text):
add/remove: 4/109 grow/shrink: 3398/67 up/down: 151556/-4940
Total: Before=12759457, After=12906073, chg +1.15%
For LL/LSC only build (text):
add/remove: 2/2 grow/shrink: 1423/57 up/down: 12224/-564 (11660)
Total: Before=12836417, After=12848077, chg +0.09%
The bloat for LSE is due to the provision of LL/SC fallback atomics no longer
being !inline.
The bloat for LL/SC seems to be due to patch 2, which changes some assembly
constraints (i.e. moving an intermediate to a register).
When comparing the number of data transfer instructions (those starting or
ending with ld or st) in vmlinux we see a reduction from 30.8% to 30.6% when
applying this series. And no change when CONFIG_ARM64_LSE_ATOMICS isn't
enabled (30.9%). This was a feable attempt to measure register spilling.
Changes since v2:
- Ensure _{relaxed,acquire,release} qualifers are used
- Rebased onto arm64/for-next/fixes (v5.3-rc3)
Changes since v1:
- Move LL/SC atomics to a subsection when being used as a fallback
- Rebased onto arm64/for-next/fixes
Andrew Murray (5):
jump_label: Don't warn on __exit jump entries
arm64: Use correct ll/sc atomic constraints
arm64: atomics: avoid out-of-line ll/sc atomics
arm64: avoid using hard-coded registers for LSE atomics
arm64: atomics: remove atomic_ll_sc compilation unit
arch/arm64/include/asm/atomic.h | 11 +-
arch/arm64/include/asm/atomic_arch.h | 154 ++++++++++
arch/arm64/include/asm/atomic_ll_sc.h | 200 ++++++-------
arch/arm64/include/asm/atomic_lse.h | 395 +++++++++-----------------
arch/arm64/include/asm/cmpxchg.h | 2 +-
arch/arm64/include/asm/lse.h | 11 -
arch/arm64/lib/Makefile | 19 --
arch/arm64/lib/atomic_ll_sc.c | 3 -
kernel/jump_label.c | 4 +-
9 files changed, 398 insertions(+), 401 deletions(-)
create mode 100644 arch/arm64/include/asm/atomic_arch.h
delete mode 100644 arch/arm64/lib/atomic_ll_sc.c
--
2.21.0
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next reply other threads:[~2019-08-12 14:36 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-12 14:36 Andrew Murray [this message]
2019-08-12 14:36 ` [PATCH v3 1/5] jump_label: Don't warn on __exit jump entries Andrew Murray
2019-08-22 15:32 ` Mark Rutland
2019-08-22 18:41 ` Andrew Murray
2019-08-12 14:36 ` [PATCH v3 2/5] arm64: Use correct ll/sc atomic constraints Andrew Murray
2019-08-22 15:32 ` Mark Rutland
2019-08-28 13:01 ` Andrew Murray
2019-08-28 15:25 ` Mark Rutland
2019-08-28 15:44 ` Andrew Murray
2019-08-28 16:24 ` Mark Rutland
2019-08-28 16:42 ` Andrew Murray
2019-08-12 14:36 ` [PATCH v3 3/5] arm64: atomics: avoid out-of-line ll/sc atomics Andrew Murray
2019-08-22 17:01 ` Mark Rutland
2019-08-28 11:53 ` Andrew Murray
2019-08-12 14:36 ` [PATCH v3 4/5] arm64: avoid using hard-coded registers for LSE atomics Andrew Murray
2019-08-12 14:36 ` [PATCH v3 5/5] arm64: atomics: remove atomic_ll_sc compilation unit Andrew Murray
2019-08-27 16:49 ` [PATCH v3 0/5] arm64: avoid out-of-line ll/sc atomics Will Deacon
2019-08-28 9:04 ` Andrew Murray
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190812143625.42745-1-andrew.murray@arm.com \
--to=andrew.murray@arm.com \
--cc=Ard.Biesheuvel@arm.com \
--cc=boqun.feng@gmail.com \
--cc=catalin.marinas@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=mark.rutland@arm.com \
--cc=peterz@infradead.org \
--cc=will.deacon@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).