Re: Possible to annotate ARM64 IRQ handling to help gdb?

From: Doug Anderson <dianders@chromium.org>
To: Dave Martin <Dave.Martin@arm.com>
Cc: Caroline Tice <cmtice@chromium.org>,
	kgdb-bugreport@lists.sourceforge.net,
	Will Deacon <will.deacon@arm.com>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>,
	Stephen Boyd <swboyd@chromium.org>
Subject: Re: Possible to annotate ARM64 IRQ handling to help gdb?
Date: Wed, 13 Feb 2019 13:19:17 -0800	[thread overview]
Message-ID: <CAD=FV=V6mPHUBGn6u5v4=D_DBxrtj_MAyfHk+TOnbCRNNokYJQ@mail.gmail.com> (raw)
In-Reply-To: <20190211195750.GI3567@e103592.cambridge.arm.com>

Hi,

On Mon, Feb 11, 2019 at 11:58 AM Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Mon, Feb 11, 2019 at 09:27:11AM -0800, Doug Anderson wrote:
> > Hi,
> >
> > On Mon, Feb 4, 2019 at 4:31 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > >
> > > On Fri, Feb 01, 2019 at 01:38:05PM -0800, Doug Anderson wrote:
> > > > Hi,
> > > >
> > > > I was wondering if anyone out there has given any thought to
> > > > annotating the ARM64 IRQ handling in such a way that we could stack
> > > > crawl past el1_irq() when in gdb.
> > > >
> > > > I spent a bit of time on this a few months ago and documented all my
> > > > findings in:
> > > >
> > > > https://bugs.chromium.org/p/chromium/issues/detail?id=908721
> > > >
> > > > I can copy and paste all the discussion from that bug here, but since
> > > > it's public hopefully folks can read the discussion / investigation
> > > > there.  To put it briefly, though: I can stack crawl past "el1_irq"
> > > > with the normal linux stack crawl (which is what kdb uses) but I can't
> > > > crawl past "el1_irq" in gdb().  After talking to some of our tools
> > > > guys here I'm fairly certain that we could solve this with the right
> > > > CFI directives, but when I poked at it I wasn't able to figure out the
> > > > magic.
> > > >
> > > >
> > > > Anyway, I figured I'd check to see if anyone here happens to know the
> > > > right magic.
> > >
> > > The kernel (appears to) generate a valid frame record for el1_irq:
> > >
> > >    0xffffff8008082b94 <+84>:    mrs     x22, elr_el1
> > >
> > >         [...]
> > >
> > >    0xffffff8008082ba0 <+96>:    stp     x29, x22, [sp, #304]
> > >    0xffffff8008082ba4 <+100>:   add     x29, sp, #0x130
> > >
> > > (I note that 0x130 == 304.  Yay binutils.)
> >
> > Right, this is how the kernel is able to do the crawl.  It's also why
> > I was able to manually do the crawl in the bug by chaining together
> > frame pointers.
> >
> >
> > > From the bug report, I don't see any real investigation into what
> > > precisely causes gdb to choke on this frame.
> >
> > Right.  I just don't know gdb well enough.  :(  I've had it on my list
> > to dig into it, but I need to find time.  ;-)
> >
> >
> > > Do you have evidence that CFI annotations help in this case?  And can
> > > you explain _why_ they help (i.e., precisely how is gdb relying on the
> > > annotations)?
> >
> > I spent a tiny bit of time playing around with CFI annotations.
> > Mostly it was stumbling around in the dark since I had a hard time
> > finding good arm/arm64 examples and the documentation was a little
> > hard for me to parse.
>
> You could try compiling a few simple C functions with gcc -S
> -fexceptions and see what the compiler spits out.

Thanks, this definitely helped!

> > ...but from my experience with gdb, my guess is that gdb wants more
> > than just the simple frame pointers.  It wants to know where _all_ the
> > registers are stored on the stack and the only way it's going to get
> > that from assembly code (especially assembly code that barfed the
> > registers onto the stack somewhere that's not between FUNC and
> > ENDFUNC) is with some type of annotation.  My guess is that it doesn't
> > fall back to just looking at frame pointer chains.  Specifically as
> > you move up the stack frame in gdb and you type "info reg", the set of
> > registers changes to be those registers that are correct for the stack
> > frame you're on.  Here's a quick example showing how gdb behaves with
> > a random register that was barfed, $x22:
> >
> > (gdb) frame 3
> > #3  0xffffff800846a088 in __handle_sysrq (key=103,
> > check_mask=<optimized out>) at .../drivers/tty/sysrq.c:620
> > 620                             op_p->handler(key);
> >
> > (gdb) disass
> > Dump of assembler code for function __handle_sysrq:
> >    0xffffff8008469f64 <+0>:     str     x23, [sp, #-64]!
> >    0xffffff8008469f68 <+4>:     stp     x22, x21, [sp, #16]
> >    0xffffff8008469f6c <+8>:     stp     x20, x19, [sp, #32]
> >    0xffffff8008469f70 <+12>:    stp     x29, x30, [sp, #48]
> >    0xffffff8008469f74 <+16>:    add     x29, sp, #0x30
> >
> > (gdb) print /x $x22
> > $13 = 0xffffff8009035000
> >
> > (gdb) print /x *(void**)($x29 - 0x30 + 16)
> > $14 = 0x8000100
> >
> > (gdb) up
> > #4  0xffffff800846a0dc in handle_sysrq (key=103) at .../drivers/tty/sysrq.c:649
> > 649                     __handle_sysrq(key, true);
> >
> > (gdb) print /x $x22
> > $15 = 0x8000100
>
>
> Indeed, but this requires full DWARF or .eh_frame info, which is not
> generally available in the kernel.

Yup, but I have it for gdb and right now the problem I'm trying to
solve is being able to crawl in gdb since the kernel seems to be OK.
I guess I was thinking that perhaps the DWARF info could be confusing
gdb?

> Except for code built with -fomit-frame-pointer, you should at least
> be able to see a list of frames though: this doesn't require all the
> registers of ancestor frames to be recovered, just x29 and lr (which is
> what the frame records on the stack contain -- so no other magic info
> is required in order to recover these).
>
> gdb tries various methods to unwind a frame, and ought to fall back to
> this approach if all else fails.  Frame chains that appear to loop
> are a problem though, with no straightforward solution.
>
> My hunch is that gdb sees the frame chain attempt to loop backwards
> after el1_irq and bails out.  Is your task stack at a lower address than
> the IRQ stack?

Here's what I've got (not lower)

#16 0xffffff8008082bf0 in el1_irq () at
/mnt/host/source/src/third_party/kernel/v4.19/arch/arm64/kernel/entry.S:622
622             irq_handler

(gdb) print /x $sp
$11 = 0xffffff8008004000
(gdb) print /x $x29
$12 = 0xffffff8009003e90
(gdb) print /x ((void**)$x29)[0]
$13 = 0xffffff8009003ed0
(gdb) print /x (*(void***)$x29)[0]
$14 = 0xffffff8009003ee0

...but then I poked a bit more and found out one really big problem is
this that "irq_stack_entry" swaps the stack before calling
gic_handle_irq() and this seemed to be confusing gdb.  Specifically
the value of "sp" when I point gdb at the "el1_irq" frame is actually
"irq_stack_ptr" AKA 0xffffff8008004000.

I've been fighting a bit with trying to figure out how to make .cfi
directives do what I want and I managed a stupid/ugly hack that at
least seems to get my stack pointer to be correct in el1_irq now:

---

 static asmlinkage void __exception_irq_entry gic_handle_irq(struct
pt_regs *regs)
 {
        u32 irqnr;
+       asm volatile (".cfi_register 31, 19");

---

...when I do that then my stack pointer sane which I point at el1_irq
(it matches x19), but I still can't get a trace.  I also haven't yet
been able to figure out how to accomplish that without hacking it into
gic_handle_irq().

While it would be nice to get all this solved, it's probably not high
priority right now, so I might have to punt unless there's some other
obvious / low hanging fruit to try.

-Doug

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel