From: Mark Rutland <mark.rutland@arm.com>
To: "Bj\"orn T\"opel" <bjorn@kernel.org>
Cc: Puranjay Mohan <puranjay12@gmail.com>,
Paul Walmsley <paul.walmsley@sifive.com>,
Palmer Dabbelt <palmer@dabbelt.com>,
Albert Ou <aou@eecs.berkeley.edu>,
Steven Rostedt <rostedt@goodmis.org>,
Masami Hiramatsu <mhiramat@kernel.org>,
Sami Tolvanen <samitolvanen@google.com>,
Guo Ren <guoren@kernel.org>,
Ley Foon Tan <leyfoon.tan@starfivetech.com>,
Deepak Gupta <debug@rivosinc.com>,
Sia Jee Heng <jeeheng.sia@starfivetech.com>,
"Bj\"orn T\"opel" <bjorn@rivosinc.com>,
Song Shuai <suagrfillet@gmail.com>,
Cl'ement L'eger <cleger@rivosinc.com>,
Al Viro <viro@zeniv.linux.org.uk>,
Jisheng Zhang <jszhang@kernel.org>,
linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org,
linux-trace-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
Date: Tue, 12 Mar 2024 13:42:28 +0000 [thread overview]
Message-ID: <ZfBbxPDd0rz6FN2T@FVFF77S0Q05N> (raw)
In-Reply-To: <87ttlhdeqb.fsf@all.your.base.are.belong.to.us>
Hi Bjorn
(apologies, my corporate mail server has butchered your name here).
There's a big info dump below; I realise this sounds like a sales pitch for
CALL_OPS, but my intent is more to say "here are some dragons you may not have
spotted".
On Thu, Mar 07, 2024 at 08:27:40PM +0100, Bj"orn T"opel wrote:
> Puranjay!
>
> Puranjay Mohan <puranjay12@gmail.com> writes:
>
> > This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V.
> > This allows each ftrace callsite to provide an ftrace_ops to the common
> > ftrace trampoline, allowing each callsite to invoke distinct tracer
> > functions without the need to fall back to list processing or to
> > allocate custom trampolines for each callsite. This significantly speeds
> > up cases where multiple distinct trace functions are used and callsites
> > are mostly traced by a single tracer.
> >
> > The idea and most of the implementation is taken from the ARM64's
> > implementation of the same feature. The idea is to place a pointer to
> > the ftrace_ops as a literal at a fixed offset from the function entry
> > point, which can be recovered by the common ftrace trampoline.
>
> Not really a review, but some more background; Another rationale (on-top
> of the improved per-call performance!) for CALL_OPS was to use it to
> build ftrace direct call support (which BPF uses a lot!). Mark, please
> correct me if I'm lying here!
Yep; it gives us the ability to do a number of per-callsite things, including
direct calls.
> On Arm64, CALL_OPS makes it possible to implement direct calls, while
> only patching one BL instruction -- nice!
The key thing here isn't that we patch a single instruction (since we have ot
patch the ops pointer too!); it's that we can safely patch either of the ops
pointer or BL/NOP at any time while threads are concurrently executing.
If you have a multi-instruction sequence, then threads can be preempted
mid-sequence, and it's very painful/complex to handle all of the races that
entails.
For example, if your callsites use a sequence:
AUIPC <tmp>, <funcptr>
JALR <tmp2>, <funcptr>(<tmp>)
Using stop_machine() won't allow you to patch that safely as some threads
could be stuck mid-sequence, e.g.
AUIPC <tmp>, <funcptr>
[ preempted here ]
JALR <tmp2>, <funcptr>(<tmp>)
... and you can't update the JALR to use a new funcptr immediate until those
have completed the sequence.
There are ways around that, but they're complicated and/or expensive, e.g.
* Use a sequence of multiple patches, starting with replacing the JALR with an
exception-generating instruction with a fixup handler, which is sort-of what
x86 does with UD2. This may require multiple passes with
synchronize_rcu_tasks() to make sure all threads have seen the latest
instructions, and that cannot be done under stop_machine(), so if you need
stop_machine() for CMODx reasons, you may need to use that several times with
intervening calls to synchronize_rcu_tasks().
* Have the patching logic manually go over each thread and fix up the pt_regs
for the interrupted thread. This is pretty horrid since you could have nested
exceptions and a task could have several pt_regs which might require
updating.
The CALL_OPS approach is a bit easier to deal with as we can patch the
per-callsite pointer atomically, then we can (possibly) enable/disable the
callsite's branch, then wait for threads to drain once.
As a heads-up, there are some latent/generic issues with DYNAMIC_FTRACE
generally in this area (CALL_OPs happens to side-step those, but trampoline
usage is currently affected):
https://lore.kernel.org/lkml/Zenx_Q0UiwMbSAdP@FVFF77S0Q05N/
... I'm looking into fixing that at the moment, and it looks like that's likely
to require some per-architecture changes.
> On RISC-V we cannot use use the same ideas as Arm64 straight off,
> because the range of jal (compare to BL) is simply too short (+/-1M).
> So, on RISC-V we need to use a full auipc/jal pair (the text patching
> story is another chapter, but let's leave that aside for now). Since we
> have to patch multiple instructions, the cmodx situation doesn't really
> improve with CALL_OPS.
The branch range thing is annoying, but I think this boils down to the same
problem as arm64 has with needing a "MOV <tmp>, LR" instruction that we have to
patch in once at boot time. You could do the same and patch in the AUIPC once,
e.g. have
| NOP
| NOP
| func:
| AUIPC <tmp>, <common_ftrace_caller>
| JALR <tmp2>, <common_ftrace_caller>(<tmp>) // patched with NOP
... which'd look very similar to arm64's sequence:
| NOP
| NOP
| func:
| MOV X9, LR
| BL ftrace_caller // patched with NOP
... which I think means it *might* be better from a cmodx perspective?
> Let's say that we continue building on your patch and implement direct
> calls on CALL_OPS for RISC-V as well.
>
> From Florent's commit message for direct calls:
>
> | There are a few cases to distinguish:
> | - If a direct call ops is the only one tracing a function:
> | - If the direct called trampoline is within the reach of a BL
> | instruction
> | -> the ftrace patchsite jumps to the trampoline
> | - Else
> | -> the ftrace patchsite jumps to the ftrace_caller trampoline which
> | reads the ops pointer in the patchsite and jumps to the direct
> | call address stored in the ops
> | - Else
> | -> the ftrace patchsite jumps to the ftrace_caller trampoline and its
> | ops literal points to ftrace_list_ops so it iterates over all
> | registered ftrace ops, including the direct call ops and calls its
> | call_direct_funcs handler which stores the direct called
> | trampoline's address in the ftrace_regs and the ftrace_caller
> | trampoline will return to that address instead of returning to the
> | traced function
>
> On RISC-V, where auipc/jalr is used, the direct called trampoline would
> always be reachable, and then first Else-clause would never be entered.
> This means the the performance for direct calls would be the same as the
> one we have today (i.e. no regression!).
>
> RISC-V does like x86 does (-ish) -- patch multiple instructions, long
> reach.
>
> Arm64 uses CALL_OPS and patch one instruction BL.
>
> Now, with this background in mind, compared to what we have today,
> CALL_OPS would give us (again assuming we're using it for direct calls):
>
> * Better performance for tracer per-call (faster ops lookup) GOOD
> * Larger text size (function alignment + extra nops) BAD
> * Same direct call performance NEUTRAL
> * Same complicated text patching required NEUTRAL
Is your current sequence safe for preemptible kernels (i.e. with PREEMPT_FULL=y
or PREEMPT_DYNAMIC=y + "preempt=full" on the kernel cmdline) ?
Looking at v6.8, IIUC you have:
// unpatched //patched
NOP AUIPC
NOP JALR
What prevents a thread being preempted mid-sequence such that it executes:
NOP
[ preempted ]
[ stop_machine() used to patch text ]
[ restored ]
JALR
... ?
... or when changing the call:
AUIPC // old funcptr
[ preempted ]
[ stop_machine() used to patch text ]
[ restored ]
JALR // new funcptr
... ?
I suspect those might both be broken, but it's difficult to hit the race and so
testing hasn't revealed that so far.
> It would be interesting to see how the per-call performance would
> improve on x86 with CALL_OPS! ;-)
Heh. ;)
> I'm trying to wrap my head if it makes sense to have it on RISC-V, given
> that we're a bit different from Arm64. Does the scale tip to the GOOD
> side?
>
> Oh, and we really need to see performance numbers on real HW! I have a
> VF2 that I could try this series on.
I'll have to leave those two to you. I suspect the preempt/stop_machine()
might just tip that from NEUTRAL to GOOD.
Mark.
next prev parent reply other threads:[~2024-03-12 13:42 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-06 16:59 [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS Puranjay Mohan
2024-03-06 20:35 ` Alexandre Ghiti
2024-03-06 20:38 ` Alexandre Ghiti
2024-03-07 0:17 ` Puranjay Mohan
2024-03-08 8:48 ` Andy Chiu
2024-03-11 13:56 ` [DMARC Error] " Evgenii Shatokhin
2024-03-07 19:27 ` Björn Töpel
2024-03-07 19:51 ` Puranjay Mohan
2024-03-08 9:18 ` Andy Chiu
2024-03-08 14:13 ` Puranjay Mohan
2024-03-10 1:37 ` Guo Ren
2024-03-08 10:16 ` Björn Töpel
2024-03-08 14:22 ` Puranjay Mohan
2024-03-08 15:15 ` Björn Töpel
2024-03-12 13:42 ` Mark Rutland [this message]
2024-03-13 11:23 ` Björn Töpel
2024-03-14 14:16 ` Puranjay Mohan
2024-03-14 15:07 ` Björn Töpel
2024-03-14 20:50 ` Björn Töpel
2024-03-20 16:41 ` Andy Chiu
2024-03-21 8:48 ` Björn Töpel
2024-03-21 17:39 ` Andy Chiu
2024-03-21 18:10 ` Björn Töpel
2024-03-25 12:50 ` Robbin Ehn
2024-03-20 18:03 ` Mark Rutland
2024-03-21 8:58 ` Björn Töpel
2024-03-21 14:49 ` Steven Rostedt
2024-03-21 20:02 ` Steven Rostedt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZfBbxPDd0rz6FN2T@FVFF77S0Q05N \
--to=mark.rutland@arm.com \
--cc=aou@eecs.berkeley.edu \
--cc=bjorn@kernel.org \
--cc=bjorn@rivosinc.com \
--cc=cleger@rivosinc.com \
--cc=debug@rivosinc.com \
--cc=guoren@kernel.org \
--cc=jeeheng.sia@starfivetech.com \
--cc=jszhang@kernel.org \
--cc=leyfoon.tan@starfivetech.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-riscv@lists.infradead.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=mhiramat@kernel.org \
--cc=palmer@dabbelt.com \
--cc=paul.walmsley@sifive.com \
--cc=puranjay12@gmail.com \
--cc=rostedt@goodmis.org \
--cc=samitolvanen@google.com \
--cc=suagrfillet@gmail.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).