Re: [PATCH -next V7 0/7] riscv: Optimize function trace

From: Guo Ren <guoren@kernel.org>
To: Mark Rutland <mark.rutland@arm.com>
Cc: David Laight <david.laight@aculab.com>,
	Evgenii Shatokhin <e.shatokhin@yadro.com>,
	"suagrfillet@gmail.com" <suagrfillet@gmail.com>,
	"andy.chiu@sifive.com" <andy.chiu@sifive.com>,
	"linux-riscv@lists.infradead.org"
	<linux-riscv@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Guo Ren <guoren@linux.alibaba.com>,
	"anup@brainfault.org" <anup@brainfault.org>,
	"paul.walmsley@sifive.com" <paul.walmsley@sifive.com>,
	"palmer@dabbelt.com" <palmer@dabbelt.com>,
	"conor.dooley@microchip.com" <conor.dooley@microchip.com>,
	"heiko@sntech.de" <heiko@sntech.de>,
	"rostedt@goodmis.org" <rostedt@goodmis.org>,
	"mhiramat@kernel.org" <mhiramat@kernel.org>,
	"jolsa@redhat.com" <jolsa@redhat.com>, "bp@suse.de" <bp@suse.de>,
	"jpoimboe@kernel.org" <jpoimboe@kernel.org>,
	"linux@yadro.com" <linux@yadro.com>
Subject: Re: [PATCH -next V7 0/7] riscv: Optimize function trace
Date: Fri, 10 Feb 2023 10:21:01 +0800	[thread overview]
Message-ID: <CAJF2gTSL0c2GH0Jy+hrhFsswq3BsoqyC81harC_K=9TspS8eaQ@mail.gmail.com> (raw)
In-Reply-To: <Y+TC037Erd+bsrB7@FVFF77S0Q05N>

On Thu, Feb 9, 2023 at 5:54 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Thu, Feb 09, 2023 at 09:59:33AM +0800, Guo Ren wrote:
> > On Thu, Feb 9, 2023 at 9:51 AM Guo Ren <guoren@kernel.org> wrote:
> > >
> > > On Thu, Feb 9, 2023 at 6:29 AM David Laight <David.Laight@aculab.com> wrote:
> > > >
> > > > > >   # Note: aligned to 8 bytes
> > > > > >   addr-08               // Literal (first 32-bits)      // patched to ops ptr
> > > > > >   addr-04               // Literal (last 32-bits)       // patched to ops ptr
> > > > > >   addr+00       func:   mv      t0, ra
> > > > > We needn't "mv t0, ra" here because our "jalr" could work with t0 and
> > > > > won't affect ra. Let's do it in the trampoline code, and then we can
> > > > > save another word here.
> > > > > >   addr+04               auipc   t1, ftrace_caller
> > > > > >   addr+08               jalr    ftrace_caller(t1)
> > > >
> > > > Is that some kind of 'load high' and 'add offset' pair?
> > > Yes.
> > >
> > > > I guess 64bit kernels guarantee to put all module code
> > > > within +-2G of the main kernel?
> > > Yes, 32-bit is enough. So we only need one 32-bit literal size for the
> > > current rv64, just like CONFIG_32BIT.
> > We need kernel_addr_base + this 32-bit Literal.
> >
> > @Mark Rutland
> > What do you think the idea about reducing one more 32-bit in
> > call-site? (It also sould work for arm64.)
>
> The literal pointer is for a struct ftrace_ops, which is data, not code.
>
> An ftrace_ops can be allocated from anywhere (e.g. core kernel data, module
> data, linear map, vmalloc space), and so is not guaranteed to be within 2GiB of
> all code. The literal needs to be able to address the entire kernel addresss
> range, and since it can be modified concurrently (with PREEMPT and not using
> stop_machine()) it needs to be possible to read/write atomically. So
> practically speaking it needs to be the native pointer size (i.e. 64-bit on a
> 64-bit kernel).
Got it, thx. Let's use an absolute pointer as the beginning.

>
> Other schemes for compressing that (e.g. using an integer into an array of
> pointers) is possible, but uses more memory and gets more complicated for
> concurrent manipulation, so I would strongly recommend keeping this simple and
> using a native pointer size here.
>
> > > > > Here is the call-site:
> > > > >    # Note: aligned to 8 bytes
> > > > >    addr-08               // Literal (first 32-bits)      // patched to ops ptr
> > > > >    addr-04               // Literal (last 32-bits)       // patched to ops ptr
> > > > >    addr+00               auipc   t0, ftrace_caller
> > > > >    addr+04               jalr    ftrace_caller(t0)
> > > >
> > > > Could you even do something like:
> > > >         addr-n  call ftrace-function
> > > >         addr-n+x        literals
> > > >         addr+0  nop or jmp addr-n
> > > >         addr+4  function_code
> > > Yours cost one more instruction, right?
> > >          addr-12  auipc
> > >          addr-8    jalr
> > >          addr-4    // Literal (32-bits)
> > >          addr+0   nop or jmp addr-n // one more?
> > >          addr+4   function_code
>
> Placing instructions before the entry point is going to confuse symbol
> resolution and unwind code, so I would not recommend that. It also means the
> trampoline will need to re-adjust the return address back into the function,
> but that is relatively simple.
>
> I also think that this is micro-optimizing. The AUPIC *should* be cheap, so
> executing that unconditionally should be fine. I think the form that Guo
> suggested with AUIPC + {JALR || NOP} in the function (and 64-bits reserved
> immediately bfore the function) is the way to go, so long as that does the
> right thing with ra.
>
> Thanks,
> Mark.


-- 
Best Regards
 Guo Ren