linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Murray <andrew.murray@arm.com>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	Will Deacon <will.deacon@arm.com>,
	Ard.Biesheuvel@arm.com,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH v1 0/5] arm64: avoid out-of-line ll/sc atomics
Date: Wed, 22 May 2019 16:36:27 +0100	[thread overview]
Message-ID: <20190522153627.GE8268@e119886-lin.cambridge.arm.com> (raw)
In-Reply-To: <CAKv+Gu9sWBwCisYPd7eAH7YBC4RfeQNvGh2Tt_f2iXZ5UUbmsw@mail.gmail.com>

On Wed, May 22, 2019 at 12:44:35PM +0100, Ard Biesheuvel wrote:
> On Wed, 22 May 2019 at 11:45, Andrew Murray <andrew.murray@arm.com> wrote:
> >
> > On Fri, May 17, 2019 at 12:29:54PM +0200, Ard Biesheuvel wrote:
> > > On Fri, 17 May 2019 at 12:08, Andrew Murray <andrew.murray@arm.com> wrote:
> > > >
> > > > On Fri, May 17, 2019 at 09:24:01AM +0200, Peter Zijlstra wrote:
> > > > > On Thu, May 16, 2019 at 04:53:39PM +0100, Andrew Murray wrote:
> > > > > > When building for LSE atomics (CONFIG_ARM64_LSE_ATOMICS), if the hardware
> > > > > > or toolchain doesn't support it the existing code will fallback to ll/sc
> > > > > > atomics. It achieves this by branching from inline assembly to a function
> > > > > > that is built with specical compile flags. Further this results in the
> > > > > > clobbering of registers even when the fallback isn't used increasing
> > > > > > register pressure.
> > > > > >
> > > > > > Let's improve this by providing inline implementatins of both LSE and
> > > > > > ll/sc and use a static key to select between them. This allows for the
> > > > > > compiler to generate better atomics code.
> > > > >
> > > > > Don't you guys have alternatives? That would avoid having both versions
> > > > > in the code, and thus significantly cuts back on the bloat.
> > > >
> > > > Yes we do.
> > > >
> > > > Prior to patch 3 of this series, the ARM64_LSE_ATOMIC_INSN macro used
> > > > ALTERNATIVE to either bl to a fallback ll/sc function (and nops) - or execute
> > > > some LSE instructions.
> > > >
> > > > But this approach limits the compilers ability to optimise the code due to
> > > > the asm clobber list being the superset of both ll/sc and LSE - and the gcc
> > > > compiler flags used on the ll/sc functions.
> > > >
> > > > I think the alternative solution (excuse the pun) that you are suggesting
> > > > is to put the body of the ll/sc or LSE code in the ALTERNATIVE oldinstr/newinstr
> > > > blocks (i.e. drop the fallback branches). However this still gives us some
> > > > bloat (but less than my current solution) because we're still now inlining the
> > > > larger fallback ll/sc whereas previously they were non-inline'd functions. We
> > > > still end up with potentially unnecessary clobbers for LSE code with this
> > > > approach.
> > > >
> > > > Approach prior to this series:
> > > >
> > > >    BL 1 or NOP <- single alternative instruction
> > > >    LSE
> > > >    LSE
> > > >    ...
> > > >
> > > > 1: LL/SC <- LL/SC fallback not inlined so reused
> > > >    LL/SC
> > > >    LL/SC
> > > >    LL/SC
> > > >
> > > > Approach proposed by this series:
> > > >
> > > >    BL 1 or NOP <- single alternative instruction
> > > >    LSE
> > > >    LSE
> > > >    BL 2
> > > > 1: LL/SC <- inlined LL/SC and thus duplicated
> > > >    LL/SC
> > > >    LL/SC
> > > >    LL/SC
> > > > 2: ..
> > > >
> > > > Approach using alternative without braces:
> > > >
> > > >    LSE
> > > >    LSE
> > > >    NOP
> > > >    NOP
> > > >
> > > > or
> > > >
> > > >    LL/SC <- inlined LL/SC and thus duplicated
> > > >    LL/SC
> > > >    LL/SC
> > > >    LL/SC
> > > >
> > > > I guess there is a balance here between bloat and code optimisation.
> > > >
> > >
> > >
> > > So there are two separate questions here:
> > > 1) whether or not we should merge the inline asm blocks so that the
> > > compiler sees a single set of constraints and operands
> > > 2) whether the LL/SC sequence should be inlined and/or duplicated.
> > >
> > > This approach appears to be based on the assumption that reserving one
> > > or sometimes two additional registers for the LL/SC fallback has a
> > > more severe impact on performance than the unconditional branch.
> > > However, it seems to me that any call site that uses the atomics has
> > > to deal with the possibility of either version being invoked, and so
> > > the additional registers need to be freed up in any case. Or am I
> > > missing something?
> >
> > Yes at compile time the compiler doesn't know which atomics path will
> > be taken so code has to be generated for both (thus optimisation is
> > limited). However due to this approach we no longer use hard-coded
> > registers or restrict which/how registers can be used and therefore the
> > compiler ought to have greater freedom to optimise.
> >
> 
> Yes, I agree that is an improvement. But that doesn't require the
> LL/SC and LSE asm sequences to be distinct.
> 
> > >
> > > As for the duplication: a while ago, I suggested an approach [0] using
> > > alternatives and asm subsections, which moved the duplicated LL/SC
> > > fallbacks out of the hot path. This does not remove the bloat, but it
> > > does mitigate its impact on I-cache efficiency when running on
> > > hardware that does not require the fallbacks.#
> >
> > I've seen this. I guess its possible to incorporate subsections into the
> > inline assembly in the __ll_sc_* functions of this series. If we wanted
> > the ll/sc fallbacks not to be inlined, then I suppose we can put these
> > functions in their own section to achieve the same goal.
> >
> > My toolchain knowledge is a limited here - but in order to use subsections
> > you require a branch - in this case does the compiler optimise across the
> > sub sections? If not then I guess there is no benefit to inlining the code
> > in which case you may as well have a branch to a function (in its own
> > section) and then you get both the icache gain and also avoid bloat. Does
> > that make any sense?
> >
> 
> 
> Not entirely. A function call requires an additional register to be
> preserved, and the bl and ret instructions are both indirect branches,
> while subsections use direct unconditional branches only.
> 
> Another reason we want to get rid of the current approach (and the
> reason I looked into it in the first place) is that we are introducing
> hidden branches, which affects the reliability of backtraces and this
> is an issue for livepatch.

I guess we don't have enough information to determine the performance effect
of this.

I think I'll spend some time comparing the effect of some of these factors
on typical code with objdump to get a better feel for the likely effect
on performance and post my findings.

Thanks for the feedback.

Thanks,

Andrew Murray

> 
> > >
> > >
> > > [0] https://lore.kernel.org/linux-arm-kernel/20181113233923.20098-1-ard.biesheuvel@linaro.org/
> > >
> > >
> > >
> > > > >
> > > > > > These changes add a small amount of bloat on defconfig according to
> > > > > > bloat-o-meter:
> > > > > >
> > > > > > text:
> > > > > >   add/remove: 1/108 grow/shrink: 3448/20 up/down: 272768/-4320 (268448)
> > > > > >   Total: Before=12363112, After=12631560, chg +2.17%
> > > > >
> > > > > I'd say 2% is quite significant bloat.
> > > >
> > > > Thanks,
> > > >
> > > > Andrew Murray
> > > >
> > > > _______________________________________________
> > > > linux-arm-kernel mailing list
> > > > linux-arm-kernel@lists.infradead.org
> > > > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2019-05-22 15:36 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-16 15:53 [PATCH v1 0/5] arm64: avoid out-of-line ll/sc atomics Andrew Murray
2019-05-16 15:53 ` [PATCH v1 1/5] jump_label: Don't warn on __exit jump entries Andrew Murray
2019-05-16 15:53 ` [PATCH v1 2/5] arm64: Use correct ll/sc atomic constraints Andrew Murray
2019-05-16 15:53 ` [PATCH v1 3/5] arm64: atomics: avoid out-of-line ll/sc atomics Andrew Murray
2019-05-16 15:53 ` [PATCH v1 4/5] arm64: avoid using hard-coded registers for LSE atomics Andrew Murray
2019-05-16 15:53 ` [PATCH v1 5/5] arm64: atomics: remove atomic_ll_sc compilation unit Andrew Murray
2019-05-17  7:24 ` [PATCH v1 0/5] arm64: avoid out-of-line ll/sc atomics Peter Zijlstra
2019-05-17 10:08   ` Andrew Murray
2019-05-17 10:29     ` Ard Biesheuvel
2019-05-22 10:45       ` Andrew Murray
2019-05-22 11:44         ` Ard Biesheuvel
2019-05-22 15:36           ` Andrew Murray [this message]
2019-05-17 12:05     ` Peter Zijlstra
2019-05-17 12:19       ` Ard Biesheuvel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190522153627.GE8268@e119886-lin.cambridge.arm.com \
    --to=andrew.murray@arm.com \
    --cc=Ard.Biesheuvel@arm.com \
    --cc=ard.biesheuvel@linaro.org \
    --cc=boqun.feng@gmail.com \
    --cc=catalin.marinas@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=peterz@infradead.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).