From mboxrd@z Thu Jan 1 00:00:00 1970 References: <87mtti3325.fsf@xenomai.org> <87k0om32jb.fsf@xenomai.org> <87bl9w2pkm.fsf@xenomai.org> <875z042odb.fsf@xenomai.org> <8735v82jmd.fsf@xenomai.org> <87zgx3pzp9.fsf@xenomai.org> From: Philippe Gerum Subject: Re: gdb test failure debug status update In-reply-to: <87zgx3pzp9.fsf@xenomai.org> Date: Sat, 15 May 2021 17:55:46 +0200 Message-ID: <87fsyoout9.fsf@xenomai.org> MIME-Version: 1.0 Content-Type: text/plain List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Chen, Hongzhan" Cc: "xenomai@xenomai.org" Philippe Gerum writes: > Chen, Hongzhan writes: > >>>> >>>>>>-----Original Message----- >>>>>>From: Xenomai On Behalf Of Chen, Hongzhan via Xenomai >>>>>>Sent: Friday, April 30, 2021 4:07 PM >>>>>>To: Philippe Gerum >>>>>>Cc: xenomai@xenomai.org >>>>>>Subject: RE: gdb test failure debug status update >>>>>> >>>>>> >>>>>> >>>>>>>-----Original Message----- >>>>>>>From: Philippe Gerum >>>>>>>Sent: Friday, April 30, 2021 4:01 PM >>>>>>>To: Chen, Hongzhan >>>>>>>Cc: xenomai@xenomai.org >>>>>>>Subject: Re: gdb test failure debug status update >>>>>>> >>>>>>> >>>>>>>Philippe Gerum writes: >>>>>>> >>>>>>>> Chen, Hongzhan writes: >>>>>>>> >>>>>>>>> The final xnthread_relaxed call path is like this asm_sysv_apic_timer_interrupt ->handle_irq_pipelined_finish >>>>>>>>> ->dovetail_call_mayday ->handle_oob_mayday>xnthread_relax. >>>>>>>>> That means that handle_irq_pipelined_finish is called under OOB condition of arch_pipeline_entry in >>>>>>>>> arch/x86/kernel/irq_pipeline.c. Does that means that kernel entry/exit code is never called after return from >>>>>>>>> xnthread_relax to handle_irq_pipelined_finish then to asm_sysv_apic_timer_interrupt? Even I enforce to >>>>>>>>> call dovetail_request_ucall before calling final xnthread_relax system would not try to switch back to primary mode >>>>>>>>> because kernel exit code is never called in this case? >>>>>>>>> >>>>>>>> >>>>>>>> The IRQ frame is indeed kept from unwinding until the preempted task >>>>>>>> context returns from irq_exit_pipeline(), which branches to the Cobalt >>>>>>>> rescheduling procedure. From the Dovetail interface POV, >>>>>>>> irq_exit_pipeline() is called by handle_irq_pipelined_finish() to allow >>>>>>>> the companion core to perform post-IRQ chores, such as running its own >>>>>>>> rescheduling procedure. >>>>>>>> >>>>>>>> IOW, if task @foo is preempted by an IRQ, then suspended in oob context >>>>>>>> as a result of what the interrupt handler just did for it (e.g. raising >>>>>>>> XNDBGSTOP, XNRELAX, XNPEND, XNSUSP in its state), then >>>>>>>> handle_irq_pipelined_finish()->irq_exit_pipeline()->xnsched_run() will >>>>>>>> cause the @foo context to switch away, effectively preventing >>>>>>>> handle_irq_pipelined_finish() to return, until @foo resumes execution >>>>>>> eventually. >>>>> >>>>> ln handle_irq_pipelined_finish, irq_exit_pipeline would at first be executed and it >>>>> handle dovetail_call_mayday in the end. But issue happen after run dovetail_call_mayday >>>>> because it call final xnthread_relax before gdb test failue. >>>>> >>>> >>>>Can you add WARN_ON(1) to dovetail_call_mayday() and report about the >>>>output? TIA, >>>> >>>>-- >>>>Philippe. >>>> >>> >>>Please check following output. >> >> Hi Philippe, >> >> Please let me know if you have new patch or other thing to let me try. >> > > I spent hours of this issue, and there may be a wrong basic assumption > done in the smokey/gdb test. Specifically, handle_sigwake_event() > un-stops the debuggee (lifting XNDBGSTOP), then sends a mayday notice to > make sure that debuggee re-enters the kernel asap for leaving the oob > stage. What might happen between these two events might not be as > well-defined as this test expects (e.g. what if the debugger might be > able to run more user code before the mayday trap is enforced?). > > I'll keep on debugging that stuff and let you know. This one was nailed down eventually. As you found out, the "retuser" event was missed and left pending. This could happen when a task resumes to user, unwinding an IRQ frame, while being demoted from oob to in-band context in the process. This is typically the case of the gdb test: [hi-pri-task] (...timed sleep...) [lo-spin-task] (...spinning...) TIMER-IRQENTRY (wakeup hi-pri-task) [hi-pri-task] (breakpoint) [gdb] tkill(lo-spin-task, SIGTRAP) handle_sigwake_event(lo-spin-task) notify_mayday(lo-spin-task) [lo-spin-task] handle_mayday_event(lo-spin-task) switch_inband() TIMER-IRQEXIT ** missing check for pending _TIF_WORK|_TIF_RETUSER ** Which explains why lo-spin-task could run un-preempted by hi-pri-task for a while, until the Cobalt core hits a rescheduling point eventually. Only x86 is affected, ARM and arm64 have a different way out of IRQ context which does not exhibit such issue. This bug is now fixed by [1] for v5.10-dovetail. As I was at it, I added a missing u-call request to the leave_oob helper. Thanks for your help in digging into this. [1] https://git.evlproject.org/linux-evl.git/commit/?h=dovetail/v5.10&id=cfcab38909d870d1ef484cd401fa00e52e86a8d0 -- Philippe.