Re: [PATCH 7/7] DWARF: add the config option

From: Josh Poimboeuf <jpoimboe@redhat.com>
To: Andy Lutomirski <luto@kernel.org>
Cc: "H. J. Lu" <hjl.tools@gmail.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Jiri Slaby <jslaby@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	live-patching@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,
	the arch/x86 maintainers <x86@kernel.org>,
	Jiri Kosina <jikos@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 7/7] DWARF: add the config option
Date: Sat, 20 May 2017 11:20:34 -0500	[thread overview]
Message-ID: <20170520162034.fcciinh3nw5mvad5@treble> (raw)
In-Reply-To: <CALCETrWxwhKvm2jYG+d2xupd6uuEDAeBkafPonYNhxF6O=+qVA@mail.gmail.com>

On Fri, May 19, 2017 at 10:23:53PM -0700, Andy Lutomirski wrote:
> On Fri, May 19, 2017 at 2:35 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, May 19, 2017 at 04:29:13PM -0500, Josh Poimboeuf wrote:
> >> > How are you handling control flow?
> >>
> >> Control flow of what?
> >>
> >> > > Here's the struct in its current state:
> >> > >
> >> > >   #define UNDWARF_REG_UNDEFINED           0
> >> > >   #define UNDWARF_REG_CFA                 1
> >> > >   #define UNDWARF_REG_SP                  2
> >> > >   #define UNDWARF_REG_FP                  3
> >> > >   #define UNDWARF_REG_SP_INDIRECT         4
> >> > >   #define UNDWARF_REG_FP_INDIRECT         5
> >> > >   #define UNDWARF_REG_R10                 6
> >> > >   #define UNDWARF_REG_DI                  7
> >> > >   #define UNDWARF_REG_DX                  8
> >> > >
> >> >
> >> > Why only those registers?  Also, if you have the option I would really
> >> > suggest using the actual x86 register numbers (ax, ex, dx, bx, sp, bp,
> >> > si, di, r8-r15 in that order.)
> >>
> >> Those are the only registers which are ever needed as the base for
> >> finding the previous stack frame.  99% of the time it's sp or bp, the
> >> other registers are needed for aligned stacks and entry code.
> >>
> >> Using the actual register numbers isn't an option because I don't need
> >> them all and they need to fit in a small number of bits.
> >>
> >> This construct might be useful for other arches, which is why I called
> >> it "FP" instead of "BP".  But then I ruined that with the last 3 :-)
> >
> > BTW, here's the link to the unwinder code if you're interested:
> >
> >   https://github.com/jpoimboe/linux/blob/undwarf/arch/x86/kernel/unwind_undwarf.c
> 
> At the risk of potentially overcomplicating matters, here's a
> suggestion.  As far as I know, all (or most all?) unwinders
> effectively do the following in a loop:
> 
> 1. Look up the IP to figure out how to unwind from that IP.
> 2. Use the results of that lookup to compute the previous frame state.
> 
> The results of step 1 could perhaps be expressed like this:
> 
> struct reg_formula {
>   unsigned int source_reg :4;
>   long offset;
>   bool dereference;  /* true: *(reg + offset); false: (reg + offset) */
>   /* For DWARF, I think this can be considerably more complicated, but
> I doubt it's useful. */
> };
> 
> struct unwind_step {
>   u16 available_regs;  /* mask of the caller frame regs that we are
> able to recover */
>   struct reg_formula[16];
> };

Ok, so I assume you mean we would need to have an in-kernel DWARF reader
which reads .eh_frame and converts it to the above, which is called for
every step of the unwind phase.

> The CFA computation is just reg_formula[UNWIND_REG_SP] (or that plus
> or minus sizeof(unsigned long) or whatever -- I can never remember
> exactly what CFA refers to.)  For a frame pointer-based unwinder, the
> entire unwind_step is a foregone conclusion independent of IP: SP = BP
> + 8 (or whatever), BP = *(BP + whatever), all other regs unknown.
> 
> Could it make sense to actually structure the code this way?  I can
> see a few advantages.  It would make the actual meat of the unwind
> loop totally independent of the unwinding algorithm in use,

Yes, this part of it is an interesting idea, separating the debuginfo
reading step from the unwinding step.  And for people who don't want to
carry around megs of DWARF data, get_dwarf_step() could just be a fake
lookup which always returns the frame pointer version.  

> it would
> make the meat be dead simple (and thus easy to verify for
> non-crashiness), and I think it just might open the door for a real
> in-kernel DWARF unwinder that Linus would be okay with.  Specifically,
> write a function like:
> 
> bool get_dwarf_step(struct unwind_step *step, unsigned long ip);

If we keep the frame pointer encoding thing for non-DWARF kernels then
we may also need to pass in bp as well.

Or maybe we could force frame pointer users to at least have DWARF data
for the entry code.  Then even frame pointer kernels could detect entry
regs and we could get rid of that nasty frame pointer encoding thing.

> Put this function in its own file and make it buildable as kernel code
> or as user code.  Write a test case that runs it on every single
> address on the kernel image (in user mode!) with address-sanitizer
> enabled (or in Valgrind or both) and make sure that (a) it doesn't
> blow up and (b) that the results are credible (e.g. by comparing to
> objtool output).  Heck, you could even fuzz-test it where the fuzzer
> is allowed to corrupt the actual DWARF data.  You could do the same
> thing with whatever crazy super-compacted undwarf scheme someone comes
> up with down the road, too.

I think your proposal can be separated into two ideas, which can each be
considered on their own merit:

1) Put .eh_frame in the kernel, along with an in-kernel DWARF unwinder.
   Manage the complexity of the unwinder by validating the output of the
   complex part of the algorithm in user space.

   That's a lot of hoops to jump through.  The only real advantage I can
   see is that it would allow us to use the toolchain's DWARF data.  But
   that data is going to have issues anyway because of inline asm,
   special sections (e.g., exception tables), generated code (e.g.,
   bpf), hand-coded asm with missing/incorrect annotations, etc.

   Now, we could use objtool to find such issues and warn about them.
   Then those issues could be corrected, either by hand or
   programmatically.

   But then, if we're going that far, why not just have objtool reformat
   the data into something much simpler?  It already has the knowledge
   to do so.  Then we don't have to jump through all those hoops to
   justify jumping through more hoops in the kernel (i.e., having a
   complex DWARF state machine).  With a simple debuginfo format, the
   kernel unwinder is simple enough that we don't need to validate its
   functionality in a simulator.

2) Make a unified unwinder which uses get_dwarf_step() to abstract out
   the differences between frame pointers and DWARF (or undwarf or
   whatever else).  This could an interesting.  Though I'm not sure how
   it would integrate with our "guess" unwinder for kernels which don't
   have frame pointers or DWARF/undwarf.  I guess we could still keep
   that one separate.

> I personally like the idea of using real DWARF annotations in the
> entry code because it makes gdb work better (not kgdb -- real gdb
> attached to KVM).
>
> I bet that we could get entry asm annotations into
> good shape if we really wanted to.  OTOH, getting DWARF to work well
> for inline asm is really nasty IIRC.
> 
> (H.J., could we get a binutils feature that allows is to do:
> 
> pushq %whatever
> .cfi_adjust_sp -8
> ...
> popq %whatever
> .cfi_adjust_sp 8
> 
> that will emit the right DWARF instructions regardless of whether
> there's a frame pointer or not?  .cfi_adjust_cfa_offset is not
> particularly helpful here because it's totally wrong if the CFA is
> currently being computed based on BP.)

I agree that entry_64.o should have DWARF data, regardless of what the
in-kernel representation of the data looks like.  And the same for all
the other asm .o files.

But how we achieve that is debatable.  In the past, having all the
manual .cfi annotations everywhere for every push/pop was an
unmaintainable disaster.

If we could have binutils do automatic adjustments for pushes and pops
and sp adds/subtracts without the .cfi annotations, that would be great.

Otherwise, objtool can do it.  And it can write both undwarf and DWARF,
if needed.

> Also, you read the stack like this:
> 
> *val = *(unsigned long *)addr;
> 
> how about probe_kernel_read() instead?

It depends on how paranoid you want to be.  The undwarf code has the
level of paranoia the frame pointer code has always had.

In other words, it ensures each dereference is within the bounds of the
current stack.  It relies on task->stack and the percpu exception/irq
stack pointers, as well as the previous stack pointers at the edge of
each stack.

I don't think we've ever had a problem with that level of paranoia, so I
think staying with the status quo might be fine.

If we did use probe_kernel_read() we'd have to use some other form of it
because it does some kasan and hardened usercopy checks which wouldn't
always be appropriate.

-- 
Josh