Re: [PATCH RFC v2 riscv/for-next 0/5] Enable ftrace with kernel preemption for RISC-V

From: Evgenii Shatokhin <e.shatokhin@yadro.com>
To: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Chiu <andy.chiu@sifive.com>, <palmer@dabbelt.com>,
	<paul.walmsley@sifive.com>, <aou@eecs.berkeley.edu>,
	<rostedt@goodmis.org>, <mingo@redhat.com>, <peterz@infradead.org>,
	<jpoimboe@kernel.org>, <jbaron@akamai.com>, <ardb@kernel.org>,
	<greentime.hu@sifive.com>, <zong.li@sifive.com>,
	<guoren@kernel.org>, Jessica Clarke <jrtc27@jrtc27.com>,
	<kernel@esmil.dk>, <linux-riscv@lists.infradead.org>,
	<linux@yadro.com>
Subject: Re: [PATCH RFC v2 riscv/for-next 0/5] Enable ftrace with kernel preemption for RISC-V
Date: Thu, 7 Mar 2024 11:35:58 +0300	[thread overview]
Message-ID: <0ad4166a-44ac-4b7e-a6f8-e4509a305c8e@yadro.com> (raw)
In-Reply-To: <bd7c2263-4e81-4be6-af4d-22a0842ba2e4@ghiti.fr>

Hi Alexandre,

On 06.03.2024 23:57, Alexandre Ghiti wrote:
> Hi Evgenii,
> 
> On 21/02/2024 17:55, Evgenii Shatokhin wrote:
>> On 21.02.2024 08:27, Andy Chiu wrote:
>>> «Внимание! Данное письмо от внешнего адресата!»
>>>
>>> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin
>>> <e.shatokhin@yadro.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 13.09.2022 12:42, Andy Chiu wrote:
>>>>> This patch removes dependency of dynamic ftrace from calling
>>>>> stop_machine(), and makes it compatiable with kernel preemption.
>>>>> Originally, we ran into stack corruptions, or execution of partially
>>>>> updated instructions when starting or stopping ftrace on a fully
>>>>> preemptible kernel configuration. The reason is that kernel
>>>>> periodically
>>>>> calls rcu_momentary_dyntick_idle() on cores waiting for the
>>>>> code-patching
>>>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
>>>>> marked as notrace, it would call a bunch of tracable functions if we
>>>>> configured the kernel as preemptible. For example, these are some
>>>>> functions
>>>>> that happened to have a symbol and have not been marked as notrace
>>>>> on a
>>>>> RISC-V preemptible kernel compiled with GCC-11:
>>>>>    - __rcu_report_exp_rnp()
>>>>>    - rcu_report_exp_cpu_mult()
>>>>>    - rcu_preempt_deferred_qs()
>>>>>    - rcu_preempt_need_deferred_qs()
>>>>>    - rcu_preempt_deferred_qs_irqrestore()
>>>>>
>>>>> Thus, this make it not ideal for us to rely on stop_machine() and
>>>>> handly marked "notrace"s to perform runtime code patching. To remove
>>>>> such dependency, we must make updates of code seemed atomic on running
>>>>> cores. This might not be obvious for RISC-V since it usaually uses
>>>>> a pair
>>>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
>>>>> executed concurrently if we consider preemptions. As such, this patch
>>>>> proposed a way to make it possible. It embeds a 32-bit rel-address
>>>>> data
>>>>> into instructions of each ftrace prologue and jumps indirectly. In
>>>>> this
>>>>> way, we could store and load the address atomically so that the code
>>>>> patching core could run simutaneously with the rest of running cores.
>>>>>
>>>>> After applying the patchset, we compiled a preemptible kernel with all
>>>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU
>>>>> virt
>>>>> machine. The kernel could boot up successfully, passing all ftrace
>>>>> testsuits. Besides, we ran a script that randomly pick a tracer on
>>>>> every
>>>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
>>>>> contrast, a preemptible kernel without our patch would panic in few
>>>>> rounds on the same machine.
>>>>>
>>>>> Though we ran into errors when using hwlat or irqsoff tracers together
>>>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
>>>>> believe the reason may be that  percpu workers of the tracers are
>>>>> being
>>>>> queued into unbounded workqueue when cpu get offlined and patches
>>>>> will go
>>>>> through tracing tree.
>>>>>
>>>>> Additionally, we found patching of tracepoints unsafe since the
>>>>> instructions being patched are not naturally aligned. This may
>>>>> result in
>>>>> 2 half-word stores, which breaks atomicity, during the code patching.
>>>>>
>>>>> changes in patch v2:
>>>>>    - Enforce alignments on all functions with a compiler workaround.
>>>>>    - Support 64bit addressing for ftrace targets if xlen == 64
>>>>>    - Initialize ftrace target addresses to avoid calling bad
>>>>> address in a
>>>>>      hypothesized case.
>>>>>    - Use LGPTR instead of SZPTR since .align is log-scaled for
>>>>>      mcount-dyn.S
>>>>>    - Require the nop instruction of all jump_labels aligns
>>>>> naturally on
>>>>>      4B.
>>>>>
>>>>> Andy Chiu (5):
>>>>>     riscv: align ftrace to 4 Byte boundary and increase ftrace
>>>>> prologue
>>>>>       size
>>>>>     riscv: export patch_insn_write
>>>>>     riscv: ftrace: use indirect jump to work with kernel preemption
>>>>>     riscv: ftrace: do not use stop_machine to update code
>>>>>     riscv: align arch_static_branch function
>>>>>
>>>>>    arch/riscv/Makefile                 |   2 +-
>>>>>    arch/riscv/include/asm/ftrace.h     |  24 ----
>>>>>    arch/riscv/include/asm/jump_label.h |   2 +
>>>>>    arch/riscv/include/asm/patch.h      |   1 +
>>>>>    arch/riscv/kernel/ftrace.c          | 179
>>>>> ++++++++++++++++++++--------
>>>>>    arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
>>>>>    arch/riscv/kernel/patch.c           |   4 +-
>>>>>    7 files changed, 188 insertions(+), 93 deletions(-)
>>>>>
>>>>
>>>> First of all, thank you for working on making dynamic Ftrace robust in
>>>> preemptible kernels on RISC-V.
>>>> It is an important use case but, for now, dynamic Ftrace and related
>>>> tracers cannot be safely used with such kernels.
>>>>
>>>> Are there any updates on this series?
>>>> It needs a rebase, of course, but it looks doable.
>>>>
>>>> If I understand the discussion correctly, the only blocker was that
>>>> using "-falign-functions" was not enough to properly align cold
>>>> functions and "-fno-guess-branch-probability" would likely have a
>>>> performance cost.
>>>>
>>>> It seems, GCC developers have recently provided a workaround for that
>>>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
>>>>
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
>>>>
>>>> "-fmin-function-alignment" should help but, I do not know, which GCC
>>>> versions have got that patch already. In the meantime, one could
>>>> probably check if "-fmin-function-alignment" is supported by the
>>>> compiler and use it, if it is.
>>>>
>>>> Thoughts?
>>>
>>> Hi Evgenii,
>>>
>>> Thanks for the update. Indeed, it is essential to this patch for
>>> toolchain to provide forced alignment. We can test this flag in the
>>> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
>>> figured out a way for this to work on any 2-B align addresses but
>>> hadn't implemented it out yet. Basically it would require more
>>> patching space for us to do software alignment. I would opt for a
>>> special toolchain flag if the toolchain just supports it.
>>>
>>> Let me take some time to look and get back to you soon.
>>
>> Thank you! Looking forward to it.
>>
>> In case it helps, here is what I have checked so far.
>>
>> 1.
>> I added the patch
>> https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
>> to the current revision of GCC 13.2.0 from RISC-V toolchain.
>>
>> Rebased your patchset on top of Linux 6.8-rc4 (mostly - context
>> changes, SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
>>
>> Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling
>> preemption").
>>
>> Switched from -falign-functions=4 to -fmin-function-alignment=4:
>> ------------------
>> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
>> index b33b787c8b07..dcd0adeebaae 100644
>> --- a/arch/riscv/Makefile
>> +++ b/arch/riscv/Makefile
>> @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>>      LDFLAGS_vmlinux += --no-relax
>>      KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
>>  ifeq ($(CONFIG_RISCV_ISA_C),y)
>> -    CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
>> +    CC_FLAGS_FTRACE := -fpatchable-function-entry=12
>> -fmin-function-alignment=4
>>  else
>> -    CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
>> +    CC_FLAGS_FTRACE := -fpatchable-function-entry=6
>> -fmin-function-alignment=4
>>  endif
>>  endif
>>
>> ------------------
>>
>> As far as I can see from objdump, the functions that were not aligned
>> at 4-byte boundary with -falign-functions=4, are now aligned correctly
>> with -fmin-function-alignment=4.
>>
>> 2.
>> I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
>>
>> The boottime tests for Ftrace had passed, except the tests for
>> function_graph. I described the failure and the possible fix here:
>> https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/
>>
>>
>> 3.
>> There were also boottime warnings about "RCU not on for:
>> arch_cpu_idle+0x0/0x2c". These are probably not related to your
>> patchset, but rather to the fact that Ftrace is enabled in a
>> preemptble kernel where RCU does different things.
>>
>> As a workaround, I disabled tracing of arch_cpu_idle() for now:
>> ------------------
>> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
>> index 92922dbd5b5c..6abeecbfc51d 100644
>> --- a/arch/riscv/kernel/process.c
>> +++ b/arch/riscv/kernel/process.c
>> @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
>>
>>  extern asmlinkage void ret_from_fork(void);
>>
>> -void arch_cpu_idle(void)
>> +void noinstr arch_cpu_idle(void)
>>  {
>>      cpu_do_idle();
>>  }
> 
> 
> I came up with the same fix for this, based on a similar fix for s390. I
> have a patch ready and will send it soon since to me, it is a fix, not a
> workaround.
> 
> Thanks,
> 
> Alex

Great! Thank you. That is very good news.

By the way, have you tried switching dynamic tracers like "function", 
"function_graph", etc. while the system is under pressure, on a kernel 
with this patchset?

I am using 'stress-ng --hrtimers 1' and memory corruption still happens 
within a few minutes each time. I described the issue earlier.

It seems as if the address of ftrace_caller is sometimes written to a 
wrong location when enabling "function" or "function_graph" tracer. 
Perhaps, a barrier is missing somewhere, or something.

> 
> 
>>
>> ------------------
>>
>> 4.
>> Stress-testing revealed an issue though, which I do not understand yet.
>>
>> Probably similar to what you did earlier, I ran a script that switched
>> the current tracer to "function", "function_graph", "nop", "blk" each
>> 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
>>
>> The kernel usually crashed within a few minutes, in seemingly random
>> locations, but often in one of two ways:
>>
>> (a) Invalid instruction, because the address of ftrace_caller function
>> was somehow written to the body of the traced function rather than
>> just to the Ftrace prologue.
>>
>> In the following example, the crash happened at 0xffffffff800d3398.
>> "b0 d7" is actually not part of the code here, but rather the lower
>> bytes of 0xffffffff8000d7b0, the address of ftrace_caller() in this
>> kernel.
>>
>> (gdb) disas /r 0xffffffff800d3382,+0x20
>> Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
>> ...
>>    0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv a5,a4
>>    0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j
>> 0xffffffff800d3366 <clockevents_program_event+98>
>>    0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw
>> a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
>>    0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte
>> 0x8000
>>    0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte
>> 0xffff
>>    0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte
>> 0xffff
>>    0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j
>> 0xffffffff800d3394 <clockevents_program_event+144
>>
>> The backtrace usually contains one or more occurrences of
>> return_to_handler() in this case.
>>
>> [  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
>> [  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
>> [  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
>> [  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
>> [  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
>> ----------------------
>>
>> (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte
>> instruction. %ra usually points right after the last instruction,
>> "jalr   a2", in return_to_handler() in such cases, so the jump was
>> likely made from there.
>>
>> The problem is reproducible, although I have not found what causes it
>> yet.
>>
>> Any help is appreciated, of course.
>>
>>>
>>>>
>>>> Regards,
>>>> Evgenii
>>>
>>> Regards,
>>> Andy
>>

Regards,
Evgenii

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv