[PATCH v2 0/5] arm64: avoid out-of-line ll/sc atomics

* [PATCH v2 0/5] arm64: avoid out-of-line ll/sc atomics
@ 2019-07-31 16:12 Andrew Murray
  2019-07-31 16:12 ` [PATCH v2 1/5] jump_label: Don't warn on __exit jump entries Andrew Murray
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Andrew Murray @ 2019-07-31 16:12 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Peter Zijlstra, Ard.Biesheuvel
  Cc: Mark Rutland, Boqun Feng, linux-arm-kernel

When building for LSE atomics (CONFIG_ARM64_LSE_ATOMICS), if the hardware
or toolchain doesn't support it the existing code will fallback to ll/sc
atomics. It achieves this by branching from inline assembly to a function
that is built with specical compile flags. Further this results in the
clobbering of registers even when the fallback isn't used increasing
register pressure.

Let's improve this by providing inline implementatins of both LSE and
ll/sc and use a static key to select between them. This allows for the
compiler to generate better atomics code.

Whilst it may be difficult to understand the performance impact, we gain
improved code readability, ability to use Clang, and improved backtrace
reliability.

Build and boot tested, along with atomic_64_test.

Following is the assembly of a function that has three consecutive
atomic_add calls when built with LSE and this patchset:

Dump of assembler code for function atomics_test:
   0xffff000010084338 <+0>:     b       0xffff000010084388 <atomics_test+80>
   0xffff00001008433c <+4>:     b       0xffff000010084388 <atomics_test+80>
   0xffff000010084340 <+8>:     adrp    x0, 0xffff0000118d5000 <reset_devices>
   0xffff000010084344 <+12>:    add     x2, x0, #0x0
   0xffff000010084348 <+16>:    mov     w1, #0x1                        // #1
   0xffff00001008434c <+20>:    add     x3, x2, #0x28
   0xffff000010084350 <+24>:    stadd   w1, [x3]
   0xffff000010084354 <+28>:    b       0xffff00001008439c <atomics_test+100>
   0xffff000010084358 <+32>:    b       0xffff00001008439c <atomics_test+100>
   0xffff00001008435c <+36>:    add     x1, x0, #0x0
   0xffff000010084360 <+40>:    mov     w2, #0x1                        // #1
   0xffff000010084364 <+44>:    add     x3, x1, #0x28
   0xffff000010084368 <+48>:    stadd   w2, [x3]
   0xffff00001008436c <+52>:    b       0xffff0000100843ac <atomics_test+116>
   0xffff000010084370 <+56>:    b       0xffff0000100843ac <atomics_test+116>
   0xffff000010084374 <+60>:    add     x0, x0, #0x0
   0xffff000010084378 <+64>:    mov     w1, #0x1                        // #1
   0xffff00001008437c <+68>:    add     x2, x0, #0x28
   0xffff000010084380 <+72>:    stadd   w1, [x2]
   0xffff000010084384 <+76>:    ret
   0xffff000010084388 <+80>:    adrp    x0, 0xffff0000118d5000 <reset_devices>
   0xffff00001008438c <+84>:    add     x1, x0, #0x0
   0xffff000010084390 <+88>:    add     x1, x1, #0x28
   0xffff000010084394 <+92>:    b       0xffff000010084570
   0xffff000010084398 <+96>:    b       0xffff000010084354 <atomics_test+28>
   0xffff00001008439c <+100>:   add     x1, x0, #0x0
   0xffff0000100843a0 <+104>:   add     x1, x1, #0x28
   0xffff0000100843a4 <+108>:   b       0xffff000010084588
   0xffff0000100843a8 <+112>:   b       0xffff00001008436c <atomics_test+52>
   0xffff0000100843ac <+116>:   add     x0, x0, #0x0
   0xffff0000100843b0 <+120>:   add     x0, x0, #0x28
   0xffff0000100843b4 <+124>:   b       0xffff0000100845a0
   0xffff0000100843b8 <+128>:   ret
End of assembler dump.

ffff000010084570:       f9800031        prfm    pstl1strm, [x1]
ffff000010084574:       885f7c22        ldxr    w2, [x1]
ffff000010084578:       11000442        add     w2, w2, #0x1
ffff00001008457c:       88037c22        stxr    w3, w2, [x1]
ffff000010084580:       35ffffa3        cbnz    w3, ffff000010084574 <do_one_initcall+0x1b4>
ffff000010084584:       17ffff85        b       ffff000010084398 <atomics_test+0x60>
ffff000010084588:       f9800031        prfm    pstl1strm, [x1]
ffff00001008458c:       885f7c22        ldxr    w2, [x1]
ffff000010084590:       11000442        add     w2, w2, #0x1
ffff000010084594:       88037c22        stxr    w3, w2, [x1]
ffff000010084598:       35ffffa3        cbnz    w3, ffff00001008458c <do_one_initcall+0x1cc>
ffff00001008459c:       17ffff83        b       ffff0000100843a8 <atomics_test+0x70>
ffff0000100845a0:       f9800011        prfm    pstl1strm, [x0]
ffff0000100845a4:       885f7c01        ldxr    w1, [x0]
ffff0000100845a8:       11000421        add     w1, w1, #0x1
ffff0000100845ac:       88027c01        stxr    w2, w1, [x0]
ffff0000100845b0:       35ffffa2        cbnz    w2, ffff0000100845a4 <do_one_initcall+0x1e4>
ffff0000100845b4:       17ffff81        b       ffff0000100843b8 <atomics_test+0x80>

The two branches before each section of atomics relates to the two static
keys which both become nop's when LSE is available. When LSE isn't
available the branches are used to run the slowpath fallback LL/SC atomics.

In v1 of this series, due to the use of likely/unlikely for the LSE code,
the fallback code ended up in one place at the end of the function. In this
v2 patchset we move the fallback code into its own subsection, this moves
any atomics code to the end of each compilation unit. It is felt that this
may improve icache performance for both LSE and LL/SC.

Where CONFIG_ARM64_LSE_ATOMICS isn't enabled then the same function is as
follows:

Dump of assembler code for function atomics_test:
   0xffff000010084338 <+0>:     adrp    x0, 0xffff000011865000 <reset_devices>
   0xffff00001008433c <+4>:     add     x0, x0, #0x0
   0xffff000010084340 <+8>:     add     x3, x0, #0x28
   0xffff000010084344 <+12>:    prfm    pstl1strm, [x3]
   0xffff000010084348 <+16>:    ldxr    w1, [x3]
   0xffff00001008434c <+20>:    add     w1, w1, #0x1
   0xffff000010084350 <+24>:    stxr    w2, w1, [x3]
   0xffff000010084354 <+28>:    cbnz    w2, 0xffff000010084348 <atomics_test+16>
   0xffff000010084358 <+32>:    prfm    pstl1strm, [x3]
   0xffff00001008435c <+36>:    ldxr    w1, [x3]
   0xffff000010084360 <+40>:    add     w1, w1, #0x1
   0xffff000010084364 <+44>:    stxr    w2, w1, [x3]
   0xffff000010084368 <+48>:    cbnz    w2, 0xffff00001008435c <atomics_test+36>
   0xffff00001008436c <+52>:    prfm    pstl1strm, [x3]
   0xffff000010084370 <+56>:    ldxr    w1, [x3]
   0xffff000010084374 <+60>:    add     w1, w1, #0x1
   0xffff000010084378 <+64>:    stxr    w2, w1, [x3]
   0xffff00001008437c <+68>:    cbnz    w2, 0xffff000010084370 <atomics_test+56>
   0xffff000010084380 <+72>:    ret
End of assembler dump.

These changes add some bloat on defconfig according to bloat-o-meter:

For LSE build (text):
  add/remove: 4/109 grow/shrink: 3398/67 up/down: 151556/-4940
  Total: Before=12759457, After=12906073, chg +1.15%

For LL/LSC only build (text):
  add/remove: 2/2 grow/shrink: 1423/57 up/down: 12224/-564 (11660)
  Total: Before=12836417, After=12848077, chg +0.09%

The bloat for LSE is due to the provision of LL/SC fallback atomics no longer
being !inline.

The bloat for LL/SC seems to be due to patch 2, which changes some assembly
constraints (i.e. moving an intermediate to a register).

When comparing the number of data transfer instructions (those starting or
ending with ld or st) in vmlinux we see a reduction from 30.8% to 30.6% when
applying this series. And no change when CONFIG_ARM64_LSE_ATOMICS isn't
enabled (30.9%). This was a feable attempt to measure register spilling.

Changes since v1:

 - Move LL/SC atomics to a subsection when being used as a fallback

 - Rebased onto arm64/for-next/fixes

Andrew Murray (5):
  jump_label: Don't warn on __exit jump entries
  arm64: Use correct ll/sc atomic constraints
  arm64: atomics: avoid out-of-line ll/sc atomics
  arm64: avoid using hard-coded registers for LSE atomics
  arm64: atomics: remove atomic_ll_sc compilation unit

 arch/arm64/include/asm/atomic.h       |  11 +-
 arch/arm64/include/asm/atomic_arch.h  | 154 ++++++++++
 arch/arm64/include/asm/atomic_ll_sc.h | 200 ++++++-------
 arch/arm64/include/asm/atomic_lse.h   | 395 +++++++++-----------------
 arch/arm64/include/asm/cmpxchg.h      |   2 +-
 arch/arm64/include/asm/lse.h          |  11 -
 arch/arm64/lib/Makefile               |  19 --
 arch/arm64/lib/atomic_ll_sc.c         |   3 -
 kernel/jump_label.c                   |   4 +-
 9 files changed, 398 insertions(+), 401 deletions(-)
 create mode 100644 arch/arm64/include/asm/atomic_arch.h
 delete mode 100644 arch/arm64/lib/atomic_ll_sc.c

-- 
2.21.0

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread