All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [BUG] perf: bogus correlation of kernel symbols
       [not found] <1305292059.1949.0.camel@dan>
@ 2011-05-13 13:29 ` Dan Rosenberg
  2011-05-16 15:35   ` Ingo Molnar
  0 siblings, 1 reply; 50+ messages in thread
From: Dan Rosenberg @ 2011-05-13 13:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: davej, kees.cook, davem, eranian, mingo, torvalds, adobriyan, penberg

Hi all,

I would have appreciated a CC on this one, as the author of the feature
that got disabled.

> * Dave Jones <davej@redhat.com> wrote:
> 
> > On Thu, May 12, 2011 at 11:50:23PM +0200, Ingo Molnar wrote:
> >  
> >  > Dunno, i would not couple them necessarily - certain users might still have 
> >  > access to kernel symbols via some other channel - for example the System.map.
> >  
> > That always made this security by obscurity feature seem pointless for the bulk
> > of users to me. Given the majority are going to be running distro kernels,
> > anyone can find those addresses easily no matter how hard we hide them on the
> > running system.
> 
> I certainly agree and made that argument as well, in the original thread(s) 
> about /proc/kallsyms.
> 

I agree about the fact that kptr_restrict is an incomplete security
feature.  However, I disagree that it lacks usefulness entirely.
Virtually every public kernel exploit in the past year
leverages /proc/kallsyms or other kernel address leakage to target an
attack.  I'm not ignorant of the fact that it's trivial to fingerprint
distribution kernels in the absence of this information, but the reality
is, a huge portion of real life exploit attempts leverage pre-fabricated
exploits and are conducted by people who lack the ability to adjust
exploits to target a specific running kernel.  Even though this is
trivial to sidestep if you know what you're doing, this extra little
step may mean some script kiddie can't root some poor sysadmin's
machine, and that's a win.  In addition, when more powerful
randomization is hopefully introduced, blocking access to these pointers
will be more essential in preserving the lack of knowledge of the
location of kernel internals.

But this is all just for the record I suppose, since it seems that ship
has sailed.

> > Unless we were somehow introduced randomness into where we unpack the kernel 
> > each boot, and using System.map as a table of offsets instead of absolute 
> > addresses.
> 
> Correct. This security feature is IMO only solving a tiny fraction of the 
> problem and is thus in fact hindering the implementation of a *real* layer
> of protection of kernel absolute addresses:
> 
> The x86 kernel is relocatable, so slightly randomizing the position of the 
> kernel would be feasible with no overhead on the vast majority of exising 
> distro installs, with just an updated kernel.
> 
> When exposing randomized RIPs to user-space we could recalculate all RIPs back 
> to the 0xffffffff80000000 base, so oopses would have the usual non-randomized 
> form:
> 
> [   32.946003] IP: [<ffffffff80222521>] get_cur_val+0xcc/0x106
> [   32.946003] PGD 0 
> [   32.946003] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
> [   32.946003] last sysfs file: 
> [   32.946003] CPU 1 
> [   32.946003] Pid: 1, comm: swapper Tainted: G        W  2.6.29-rc1-00190-g37a76bd #10
> [   32.946003] RIP: 0010:[<ffffffff80222521>]  [<ffffffff80222521>] get_cur_val+0xcc/0x106
> [   32.946003] RSP: 0018:ffff88003f977b80  EFLAGS: 00010202
> [   32.946003] RAX: 0000000000000001 RBX: ffff8800029c8c80 RCX: 0000000000000008
> [   32.946003] RDX: 0000000000000000 RSI: ffffffff80ce0100 RDI: 0000000000000000
> [   32.946003] RBP: ffff88003f977bd0 R08: 0000000000000004 R09: 0000000000000040
> [   32.946003] R10: 0000000000000060 R11: 0000000081363fa8 R12: ffffffff81c4f0e0
> [   32.946003] R13: ffffffff80ce0100 R14: ffff88003c888a00 R15: 0000000000000000
> [   32.946003] FS:  0000000000000000(0000) GS:ffff88003f802c00(0000) knlGS:0000000000000000
> [   32.946003] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [   32.946003] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> [   32.946003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   32.946003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   32.946003] Process swapper (pid: 1, threadinfo ffff88003f976000, task ffff88003f978000)
> [   32.946003] Stack:
> 
> Likewise, /proc/kallsyms could pass these addresses as well and the perf 
> call-chain code and other places that sample RIPs could easily convert them to 
> the constant address as well.
> 
> We'd still leak some information like the relative position of symbols from 
> each other (this can be useful to certain classes of attacks), but we could 
> pretty effectively hide the absolute location of the kernel - which is the most 
> valuable piece of information -.
> 
> Then the random base has to be protected: i.e. all information leaks of raw 
> kernel RIPs have to be plugged. The nice thing is that this will happen as 
> *bugfixes*: randomized RIPs will not be useful for anything, so any 
> tools/people who rely on them will notice it immediately.
> 
> I think *that* would be a maintainable and complete security feature to truly 
> hide the exact location of the kernel image. kptr_restrict is not.
> 

I want this feature, as I think it is far more useful and important.
This has been mentioned before, but no one has stepped up to actually do
it.  Unfortunately, I lack the necessary knowledge of the relevant code
to do it properly.  What's the best way to make this feature a reality?

Regards,
Dan


> Thanks,
> 
> 	Ingo



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-13 13:29 ` [BUG] perf: bogus correlation of kernel symbols Dan Rosenberg
@ 2011-05-16 15:35   ` Ingo Molnar
  2011-05-16 16:14     ` Dan Rosenberg
  2011-05-20  0:56     ` Dan Rosenberg
  0 siblings, 2 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-16 15:35 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: linux-kernel, davej, kees.cook, davem, eranian, torvalds,
	adobriyan, penberg


* Dan Rosenberg <drosenberg@vsecurity.com> wrote:

> Hi all,
> 
> I would have appreciated a CC on this one, as the author of the feature
> that got disabled.

That's true and sorry about it: i could have sworn the author was Cc:-ed but 
confused you with Kees ...

> > * Dave Jones <davej@redhat.com> wrote:
> > 
> > > On Thu, May 12, 2011 at 11:50:23PM +0200, Ingo Molnar wrote:
> > >  
> > >  > Dunno, i would not couple them necessarily - certain users might still have 
> > >  > access to kernel symbols via some other channel - for example the System.map.
> > >  
> > > That always made this security by obscurity feature seem pointless for the bulk
> > > of users to me. Given the majority are going to be running distro kernels,
> > > anyone can find those addresses easily no matter how hard we hide them on the
> > > running system.
> > 
> > I certainly agree and made that argument as well, in the original thread(s) 
> > about /proc/kallsyms.
> 
> I agree about the fact that kptr_restrict is an incomplete security feature.  
> However, I disagree that it lacks usefulness entirely. Virtually every public 
> kernel exploit in the past year leverages /proc/kallsyms or other kernel 
> address leakage to target an attack.  I'm not ignorant of the fact that it's 
> trivial to fingerprint distribution kernels in the absence of this 
> information, but the reality is, a huge portion of real life exploit attempts 
> leverage pre-fabricated exploits and are conducted by people who lack the 
> ability to adjust exploits to target a specific running kernel.  Even though 
> this is trivial to sidestep if you know what you're doing, this extra little 
> step may mean some script kiddie can't root some poor sysadmin's machine, and 
> that's a win.  In addition, when more powerful randomization is hopefully 
> introduced, blocking access to these pointers will be more essential in 
> preserving the lack of knowledge of the location of kernel internals.

Well, but lets think it through further: what happens when we do such a change? 

 - Script kiddies get thwarted for a few weeks.

 - Script authors will laugh and will update their scripts to query rpmfind.net 
   or other package servers for symbol info.

 - After that transition all the exploits will continue to work. They might in 
   fact be more robust because they can specifically target only package 
   versions that are known to be exploitable.

 - *Useful* tools that do not try to harm the system will stay less useful 
   forever and that's permanent collateral damage.

I.e. we would have driven the development of *attack* tools to be even more 
harmful and will have hurt *useful* tools. Is this really what we want?

> But this is all just for the record I suppose, since it seems that ship has 
> sailed.

We can still revert the revert as well although indeed it is not very common.

> > > Unless we were somehow introduced randomness into where we unpack the kernel 
> > > each boot, and using System.map as a table of offsets instead of absolute 
> > > addresses.
> > 
> > Correct. This security feature is IMO only solving a tiny fraction of the 
> > problem and is thus in fact hindering the implementation of a *real* layer
> > of protection of kernel absolute addresses:
> > 
> > The x86 kernel is relocatable, so slightly randomizing the position of the 
> > kernel would be feasible with no overhead on the vast majority of exising 
> > distro installs, with just an updated kernel.
> > 
> > When exposing randomized RIPs to user-space we could recalculate all RIPs back 
> > to the 0xffffffff80000000 base, so oopses would have the usual non-randomized 
> > form:
> > 
> > [   32.946003] IP: [<ffffffff80222521>] get_cur_val+0xcc/0x106
> > [   32.946003] PGD 0 
> > [   32.946003] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
> > [   32.946003] last sysfs file: 
> > [   32.946003] CPU 1 
> > [   32.946003] Pid: 1, comm: swapper Tainted: G        W  2.6.29-rc1-00190-g37a76bd #10
> > [   32.946003] RIP: 0010:[<ffffffff80222521>]  [<ffffffff80222521>] get_cur_val+0xcc/0x106
> > [   32.946003] RSP: 0018:ffff88003f977b80  EFLAGS: 00010202
> > [   32.946003] RAX: 0000000000000001 RBX: ffff8800029c8c80 RCX: 0000000000000008
> > [   32.946003] RDX: 0000000000000000 RSI: ffffffff80ce0100 RDI: 0000000000000000
> > [   32.946003] RBP: ffff88003f977bd0 R08: 0000000000000004 R09: 0000000000000040
> > [   32.946003] R10: 0000000000000060 R11: 0000000081363fa8 R12: ffffffff81c4f0e0
> > [   32.946003] R13: ffffffff80ce0100 R14: ffff88003c888a00 R15: 0000000000000000
> > [   32.946003] FS:  0000000000000000(0000) GS:ffff88003f802c00(0000) knlGS:0000000000000000
> > [   32.946003] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > [   32.946003] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> > [   32.946003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [   32.946003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [   32.946003] Process swapper (pid: 1, threadinfo ffff88003f976000, task ffff88003f978000)
> > [   32.946003] Stack:
> > 
> > Likewise, /proc/kallsyms could pass these addresses as well and the perf 
> > call-chain code and other places that sample RIPs could easily convert them to 
> > the constant address as well.
> > 
> > We'd still leak some information like the relative position of symbols from 
> > each other (this can be useful to certain classes of attacks), but we could 
> > pretty effectively hide the absolute location of the kernel - which is the most 
> > valuable piece of information -.
> > 
> > Then the random base has to be protected: i.e. all information leaks of raw 
> > kernel RIPs have to be plugged. The nice thing is that this will happen as 
> > *bugfixes*: randomized RIPs will not be useful for anything, so any 
> > tools/people who rely on them will notice it immediately.
> > 
> > I think *that* would be a maintainable and complete security feature to truly 
> > hide the exact location of the kernel image. kptr_restrict is not.
> > 
> 
> I want this feature, as I think it is far more useful and important. This has 
> been mentioned before, but no one has stepped up to actually do it.  
> Unfortunately, I lack the necessary knowledge of the relevant code to do it 
> properly.  What's the best way to make this feature a reality?

Agreed, it would be a very useful feature.

I'd suggest to implement it along the lines of:

 - First check whether grsecurity or PAX has this implemented already via the
   relocation facility - they are pretty good at being paranoid so i'd be 
   surprised if they didnt think of this already! :-)

 - If not then have a look at CONFIG_RELOCATABLE and to relocate the kernel 
   binary intentionally via a hardcoded parameter. Just see whether you can do 
   it and whether it works as you expect it. Check /proc/kallsyms changing 
   after your patch. Enjoy the kernel still working ;-)

 - Then promote it to a boot parameter - this way you'll be able to tell 
   whether there's any hidden build-time assumptions about relocation position. 
   (there really shouldnt be any given that kexec works just fine - but i'd 
    suggest this step just in case.)

 - Then promote that hack to be a randomized parameter. Marvel at a different,
   randomized /proc/kallsyms output at every bootup and enjoy the still working 
   kernel!

 - Then look at all RIP outputs (thanks to your prior efforts they are now 
   mostly concentrated in the vprints code!) and reverse apply the random 
   offset before it's exported into user-space. wchan, etc. Marvel at the 
   constant /proc/kallsyms output, fully knowing that the *real* addresses
   are randomized.

 - Please do not forget to transfer perf RIPs and callchains and marvel at the
   well working 'perf top' output.

At that point the feature will be highly useful already IMO. Remaining work 
will be to think through and close down all remaining avenues of RIP leakage. 

At this point kptr_restrict will be a lot less relevant - the symbols will 
expose offsets (so it's not totally unhelpful to attackers) but not the real 
absolute addresses.

Unless i'm missing some particularly difficult roadblock, which is possible.

If you try this then please keep us posted at every step above, even if your 
patches are not fully working and useful yet. Maybe some other 
details/ideas/suggestions will arise at that point.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-16 15:35   ` Ingo Molnar
@ 2011-05-16 16:14     ` Dan Rosenberg
  2011-05-20  0:56     ` Dan Rosenberg
  1 sibling, 0 replies; 50+ messages in thread
From: Dan Rosenberg @ 2011-05-16 16:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, davej, kees.cook, davem, eranian, torvalds,
	adobriyan, penberg

On Mon, 2011-05-16 at 17:35 +0200, Ingo Molnar wrote:

> Agreed, it would be a very useful feature.
> 
> I'd suggest to implement it along the lines of:
> 
>  - First check whether grsecurity or PAX has this implemented already via the
>    relocation facility - they are pretty good at being paranoid so i'd be 
>    surprised if they didnt think of this already! :-)
> 
>  - If not then have a look at CONFIG_RELOCATABLE and to relocate the kernel 
>    binary intentionally via a hardcoded parameter. Just see whether you can do 
>    it and whether it works as you expect it. Check /proc/kallsyms changing 
>    after your patch. Enjoy the kernel still working ;-)
> 
>  - Then promote it to a boot parameter - this way you'll be able to tell 
>    whether there's any hidden build-time assumptions about relocation position. 
>    (there really shouldnt be any given that kexec works just fine - but i'd 
>     suggest this step just in case.)
> 
>  - Then promote that hack to be a randomized parameter. Marvel at a different,
>    randomized /proc/kallsyms output at every bootup and enjoy the still working 
>    kernel!
> 
>  - Then look at all RIP outputs (thanks to your prior efforts they are now 
>    mostly concentrated in the vprints code!) and reverse apply the random 
>    offset before it's exported into user-space. wchan, etc. Marvel at the 
>    constant /proc/kallsyms output, fully knowing that the *real* addresses
>    are randomized.
> 
>  - Please do not forget to transfer perf RIPs and callchains and marvel at the
>    well working 'perf top' output.
> 
> At that point the feature will be highly useful already IMO. Remaining work 
> will be to think through and close down all remaining avenues of RIP leakage. 
> 
> At this point kptr_restrict will be a lot less relevant - the symbols will 
> expose offsets (so it's not totally unhelpful to attackers) but not the real 
> absolute addresses.
> 
> Unless i'm missing some particularly difficult roadblock, which is possible.
> 
> If you try this then please keep us posted at every step above, even if your 
> patches are not fully working and useful yet. Maybe some other 
> details/ideas/suggestions will arise at that point.
> 

Thanks for the detailed response.  I will attempt to go down this road,
and will keep people posted with my progress.

-Dan

> Thanks,
> 
> 	Ingo



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-16 15:35   ` Ingo Molnar
  2011-05-16 16:14     ` Dan Rosenberg
@ 2011-05-20  0:56     ` Dan Rosenberg
  2011-05-20 12:07       ` Ingo Molnar
  1 sibling, 1 reply; 50+ messages in thread
From: Dan Rosenberg @ 2011-05-20  0:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, davej, kees.cook, davem, eranian, torvalds,
	adobriyan, penberg


> > > > Unless we were somehow introduced randomness into where we unpack the kernel 
> > > > each boot, and using System.map as a table of offsets instead of absolute 
> > > > addresses.
> > > 
> > > Correct. This security feature is IMO only solving a tiny fraction of the 
> > > problem and is thus in fact hindering the implementation of a *real* layer
> > > of protection of kernel absolute addresses:
> > > 
> > > The x86 kernel is relocatable, so slightly randomizing the position of the 
> > > kernel would be feasible with no overhead on the vast majority of exising 
> > > distro installs, with just an updated kernel.
> > > 
> > > When exposing randomized RIPs to user-space we could recalculate all RIPs back 
> > > to the 0xffffffff80000000 base, so oopses would have the usual non-randomized 
> > > form:
> > > 
> > > [   32.946003] IP: [<ffffffff80222521>] get_cur_val+0xcc/0x106
> > > [   32.946003] PGD 0 
> > > [   32.946003] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
> > > [   32.946003] last sysfs file: 
> > > [   32.946003] CPU 1 
> > > [   32.946003] Pid: 1, comm: swapper Tainted: G        W  2.6.29-rc1-00190-g37a76bd #10
> > > [   32.946003] RIP: 0010:[<ffffffff80222521>]  [<ffffffff80222521>] get_cur_val+0xcc/0x106
> > > [   32.946003] RSP: 0018:ffff88003f977b80  EFLAGS: 00010202
> > > [   32.946003] RAX: 0000000000000001 RBX: ffff8800029c8c80 RCX: 0000000000000008
> > > [   32.946003] RDX: 0000000000000000 RSI: ffffffff80ce0100 RDI: 0000000000000000
> > > [   32.946003] RBP: ffff88003f977bd0 R08: 0000000000000004 R09: 0000000000000040
> > > [   32.946003] R10: 0000000000000060 R11: 0000000081363fa8 R12: ffffffff81c4f0e0
> > > [   32.946003] R13: ffffffff80ce0100 R14: ffff88003c888a00 R15: 0000000000000000
> > > [   32.946003] FS:  0000000000000000(0000) GS:ffff88003f802c00(0000) knlGS:0000000000000000
> > > [   32.946003] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > > [   32.946003] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> > > [   32.946003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [   32.946003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > [   32.946003] Process swapper (pid: 1, threadinfo ffff88003f976000, task ffff88003f978000)
> > > [   32.946003] Stack:
> > > 
> > > Likewise, /proc/kallsyms could pass these addresses as well and the perf 
> > > call-chain code and other places that sample RIPs could easily convert them to 
> > > the constant address as well.
> > > 
> > > We'd still leak some information like the relative position of symbols from 
> > > each other (this can be useful to certain classes of attacks), but we could 
> > > pretty effectively hide the absolute location of the kernel - which is the most 
> > > valuable piece of information -.
> > > 
> > > Then the random base has to be protected: i.e. all information leaks of raw 
> > > kernel RIPs have to be plugged. The nice thing is that this will happen as 
> > > *bugfixes*: randomized RIPs will not be useful for anything, so any 
> > > tools/people who rely on them will notice it immediately.
> > > 
> > > I think *that* would be a maintainable and complete security feature to truly 
> > > hide the exact location of the kernel image. kptr_restrict is not.
> > > 
> > 
> > I want this feature, as I think it is far more useful and important. This has 
> > been mentioned before, but no one has stepped up to actually do it.  
> > Unfortunately, I lack the necessary knowledge of the relevant code to do it 
> > properly.  What's the best way to make this feature a reality?
> 
> Agreed, it would be a very useful feature.
> 
> I'd suggest to implement it along the lines of:
> 
>  - First check whether grsecurity or PAX has this implemented already via the
>    relocation facility - they are pretty good at being paranoid so i'd be 
>    surprised if they didnt think of this already! :-)
> 
>  - If not then have a look at CONFIG_RELOCATABLE and to relocate the kernel 
>    binary intentionally via a hardcoded parameter. Just see whether you can do 
>    it and whether it works as you expect it. Check /proc/kallsyms changing 
>    after your patch. Enjoy the kernel still working ;-)
> 
>  - Then promote it to a boot parameter - this way you'll be able to tell 
>    whether there's any hidden build-time assumptions about relocation position. 
>    (there really shouldnt be any given that kexec works just fine - but i'd 
>     suggest this step just in case.)
> 
>  - Then promote that hack to be a randomized parameter. Marvel at a different,
>    randomized /proc/kallsyms output at every bootup and enjoy the still working 
>    kernel!
> 
>  - Then look at all RIP outputs (thanks to your prior efforts they are now 
>    mostly concentrated in the vprints code!) and reverse apply the random 
>    offset before it's exported into user-space. wchan, etc. Marvel at the 
>    constant /proc/kallsyms output, fully knowing that the *real* addresses
>    are randomized.
> 
>  - Please do not forget to transfer perf RIPs and callchains and marvel at the
>    well working 'perf top' output.
> 
> At that point the feature will be highly useful already IMO. Remaining work 
> will be to think through and close down all remaining avenues of RIP leakage. 
> 
> At this point kptr_restrict will be a lot less relevant - the symbols will 
> expose offsets (so it's not totally unhelpful to attackers) but not the real 
> absolute addresses.
> 
> Unless i'm missing some particularly difficult roadblock, which is possible.
> 
> If you try this then please keep us posted at every step above, even if your 
> patches are not fully working and useful yet. Maybe some other 
> details/ideas/suggestions will arise at that point.
> 

I was able to boot a relocatable kernel with the decompression location
at a hard-coded offset without too much trouble.  Everything seems to
work fine.

However, it occurred to me that even if the kernel image's base address
were randomized at boot, assuming a binary distro kernel it would still
be possible to sidt the address of the IDT and calculate symbol offsets
relative to that.  Any thoughts on how to avoid that?  Seems difficult.
Another hurdle will be to find a reasonable source of entropy that early
in the boot process.

-Dan


> Thanks,
> 
> 	Ingo



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-20  0:56     ` Dan Rosenberg
@ 2011-05-20 12:07       ` Ingo Molnar
  2011-05-20 12:54         ` Dan Rosenberg
  0 siblings, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2011-05-20 12:07 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: linux-kernel, davej, kees.cook, davem, eranian, torvalds,
	adobriyan, penberg


* Dan Rosenberg <drosenberg@vsecurity.com> wrote:

> I was able to boot a relocatable kernel with the decompression location at a 
> hard-coded offset without too much trouble.  Everything seems to work fine.

Nice!

> However, it occurred to me that even if the kernel image's base address were 
> randomized at boot, assuming a binary distro kernel it would still be 
> possible to sidt the address of the IDT and calculate symbol offsets relative 
> to that.  Any thoughts on how to avoid that?  Seems difficult. Another hurdle 
> will be to find a reasonable source of entropy that early in the boot 
> process.

I do not think it's an issue.

If an attacker can execute arbitrary privileged instructions like SIDT then 
it's game over. There's plenty of CPU state, the IDT, GDT, various MSRs that 
would tell roughly where the kernel is, etc.

The attack randomization protects against is when the attacker has a limited 
amount of control over a stack return address (due to a buffer overflow for 
example) and can redirect kernel execution to some 'interesting' place that 
allows more control. With SMEP and kernel image randomization this would be 
rather difficult to pull off: the kernel wont jump to a pre-prepared user-space 
shellcode buffer (due to SMEP) while the location of already existing, 
executable, supervisor-privileged pages is randomized.

So when you have implemented this i'd suggest enabling CONFIG_X86_PTDUMP=y to 
get access to a dump of all pagetables, in the /debug/kernel_page_tables file. 
There you can check every single executable, kernel-privileged mapping on a 
live system and make sure it's not easily discovered.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-20 12:07       ` Ingo Molnar
@ 2011-05-20 12:54         ` Dan Rosenberg
  2011-05-20 13:11           ` Ingo Molnar
  0 siblings, 1 reply; 50+ messages in thread
From: Dan Rosenberg @ 2011-05-20 12:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, davej, kees.cook, davem, eranian, torvalds,
	adobriyan, penberg

On Fri, 2011-05-20 at 14:07 +0200, Ingo Molnar wrote:
> * Dan Rosenberg <drosenberg@vsecurity.com> wrote:
> 
> > I was able to boot a relocatable kernel with the decompression location at a 
> > hard-coded offset without too much trouble.  Everything seems to work fine.
> 
> Nice!
> 
> > However, it occurred to me that even if the kernel image's base address were 
> > randomized at boot, assuming a binary distro kernel it would still be 
> > possible to sidt the address of the IDT and calculate symbol offsets relative 
> > to that.  Any thoughts on how to avoid that?  Seems difficult. Another hurdle 
> > will be to find a reasonable source of entropy that early in the boot 
> > process.
> 
> I do not think it's an issue.
> 
> If an attacker can execute arbitrary privileged instructions like SIDT then 
> it's game over. There's plenty of CPU state, the IDT, GDT, various MSRs that 
> would tell roughly where the kernel is, etc.
> 

Except that SIDT isn't a privilege instruction, it's accessible as ring
3.

> The attack randomization protects against is when the attacker has a limited 
> amount of control over a stack return address (due to a buffer overflow for 
> example) and can redirect kernel execution to some 'interesting' place that 
> allows more control. With SMEP and kernel image randomization this would be 
> rather difficult to pull off: the kernel wont jump to a pre-prepared user-space 
> shellcode buffer (due to SMEP) while the location of already existing, 
> executable, supervisor-privileged pages is randomized.
> 

Yes, all true, except are you specifically considering remote-only
attack vectors?

> So when you have implemented this i'd suggest enabling CONFIG_X86_PTDUMP=y to 
> get access to a dump of all pagetables, in the /debug/kernel_page_tables file. 
> There you can check every single executable, kernel-privileged mapping on a 
> live system and make sure it's not easily discovered.
> 

I'll do this too, but first I'd like to address the above.

Thanks,
Dan

> Thanks,
> 
> 	Ingo



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-20 12:54         ` Dan Rosenberg
@ 2011-05-20 13:11           ` Ingo Molnar
  2011-05-20 17:41             ` Dan Rosenberg
  2011-05-22 18:45             ` Dan Rosenberg
  0 siblings, 2 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-20 13:11 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: linux-kernel, davej, kees.cook, davem, eranian, torvalds,
	adobriyan, penberg


* Dan Rosenberg <drosenberg@vsecurity.com> wrote:

> On Fri, 2011-05-20 at 14:07 +0200, Ingo Molnar wrote:
> > * Dan Rosenberg <drosenberg@vsecurity.com> wrote:
> > 
> > > I was able to boot a relocatable kernel with the decompression location at a 
> > > hard-coded offset without too much trouble.  Everything seems to work fine.
> > 
> > Nice!
> > 
> > > However, it occurred to me that even if the kernel image's base address were 
> > > randomized at boot, assuming a binary distro kernel it would still be 
> > > possible to sidt the address of the IDT and calculate symbol offsets relative 
> > > to that.  Any thoughts on how to avoid that?  Seems difficult. Another hurdle 
> > > will be to find a reasonable source of entropy that early in the boot 
> > > process.
> > 
> > I do not think it's an issue.
> > 
> > If an attacker can execute arbitrary privileged instructions like SIDT then 
> > it's game over. There's plenty of CPU state, the IDT, GDT, various MSRs 
> > that would tell roughly where the kernel is, etc.
> 
> Except that SIDT isn't a privilege instruction, it's accessible as ring 3.

Oops, stupid me :-/

We need to allocate the IDT dynamically: just kmalloc() it, update idt_descr 
and do a load_idt(). Double check places that modify idt_descr or use 
idt_table.

Note, you could do this as a side effect of a nice performance optimization: 
would you be interested in allocating it in the percpu area, using 
percpu_alloc()? That way the IDT is distributed between CPUs - this has 
scalability advantages on NUMA systems and maybe even on SMP.

> > The attack randomization protects against is when the attacker has a 
> > limited amount of control over a stack return address (due to a buffer 
> > overflow for example) and can redirect kernel execution to some 
> > 'interesting' place that allows more control. With SMEP and kernel image 
> > randomization this would be rather difficult to pull off: the kernel wont 
> > jump to a pre-prepared user-space shellcode buffer (due to SMEP) while the 
> > location of already existing, executable, supervisor-privileged pages is 
> > randomized.
> 
> Yes, all true, except are you specifically considering remote-only attack 
> vectors?

No, unprivileged local user, so yes, the IDT address has to be protected.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-20 13:11           ` Ingo Molnar
@ 2011-05-20 17:41             ` Dan Rosenberg
  2011-05-20 18:14               ` Linus Torvalds
  2011-05-20 18:35               ` Ingo Molnar
  2011-05-22 18:45             ` Dan Rosenberg
  1 sibling, 2 replies; 50+ messages in thread
From: Dan Rosenberg @ 2011-05-20 17:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, davej, kees.cook, davem, eranian, torvalds,
	adobriyan, penberg

On Fri, 2011-05-20 at 15:11 +0200, Ingo Molnar wrote:

> We need to allocate the IDT dynamically: just kmalloc() it, update idt_descr 
> and do a load_idt(). Double check places that modify idt_descr or use 
> idt_table.
> 
> Note, you could do this as a side effect of a nice performance optimization: 
> would you be interested in allocating it in the percpu area, using 
> percpu_alloc()? That way the IDT is distributed between CPUs - this has 
> scalability advantages on NUMA systems and maybe even on SMP.
> 

Any suggestions on when this allocation should take place?  I'm hesitant
to touch anything in arch/x86/kernel/head_32.S, where the IDT is setup
and lidt idt_descr is called (on x86-32 anyway).  That means at some
point I'd have to copy the table into a region allocated with
alloc_percpu() and set up a new descriptor.  Seems like this should
happen before IRQ is enabled, but I'm not sure about the best place.

Also, I'd still welcome suggestions on generating entropy so early in
the boot process as to randomize the location at which the kernel is
decompressed.

On a related note, would there be obstacles to marking the IDT as
read-only?

Thanks,
Dan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-20 17:41             ` Dan Rosenberg
@ 2011-05-20 18:14               ` Linus Torvalds
  2011-05-20 18:27                 ` Kees Cook
                                   ` (2 more replies)
  2011-05-20 18:35               ` Ingo Molnar
  1 sibling, 3 replies; 50+ messages in thread
From: Linus Torvalds @ 2011-05-20 18:14 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: Ingo Molnar, linux-kernel, davej, kees.cook, davem, eranian,
	adobriyan, penberg

On Fri, May 20, 2011 at 10:41 AM, Dan Rosenberg
<drosenberg@vsecurity.com> wrote:
>
> Also, I'd still welcome suggestions on generating entropy so early in
> the boot process as to randomize the location at which the kernel is
> decompressed.

The fundamental problem with the whole kernel address randomization is
sadly totally unrelated to any of the small details.

There's a *big* detail that makes it hard: there's only a few bits of
randomness we can add to the address. The kernel base address ends up
having various fundamental limitations (cacheline alignment for the
code, and we have several segments that require page alignment), so
you really can't realistically do more than something like 8-12 bits
of address randomization.

Which means that once you have a vmlinux image (say, because it's a
standard distro kernel), you only need to try your exploit a few
hundred times. That can be done quickly enough that no MIS person will
ever have time to react to the attack.

Sure, it will likely leave some hints around (oopses etc), but still..

> On a related note, would there be obstacles to marking the IDT as
> read-only?

We do that for the F00F bug workaround. But while the linear address
is read-only, the IDT can still be accessed read-write through the
physical address through the normal 1:1 mapping.

Regardless, the virtual mapping trick (independently of whether it's
read-only or not) can be used to avoid exposing the *actual* address
of the IDT of the kernel, and would hide the kernel load address
details. However, it does make traps slightly slower, if they cannot
use the 1:1 mapping with large pages for the IDT access and thus cause
more TLB pressure. Of course, in many situations we probably end up
not having large pages for the kernel anyway, so..

As a result, we do that F00F bug workaround _only_ if we're actually
running on a CPU with the FOOF bug.

                   Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-20 18:14               ` Linus Torvalds
@ 2011-05-20 18:27                 ` Kees Cook
  2011-05-20 18:34                   ` Dan Rosenberg
  2011-05-20 18:28                 ` Ingo Molnar
  2011-05-22  6:11                 ` david
  2 siblings, 1 reply; 50+ messages in thread
From: Kees Cook @ 2011-05-20 18:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Rosenberg, Ingo Molnar, linux-kernel, davej, davem, eranian,
	adobriyan, penberg

On Fri, May 20, 2011 at 11:14:09AM -0700, Linus Torvalds wrote:
> On Fri, May 20, 2011 at 10:41 AM, Dan Rosenberg
> <drosenberg@vsecurity.com> wrote:
> >
> > Also, I'd still welcome suggestions on generating entropy so early in
> > the boot process as to randomize the location at which the kernel is
> > decompressed.
> 
> The fundamental problem with the whole kernel address randomization is
> sadly totally unrelated to any of the small details.
> 
> There's a *big* detail that makes it hard: there's only a few bits of
> randomness we can add to the address. The kernel base address ends up
> having various fundamental limitations (cacheline alignment for the
> code, and we have several segments that require page alignment), so
> you really can't realistically do more than something like 8-12 bits
> of address randomization.
> 
> Which means that once you have a vmlinux image (say, because it's a
> standard distro kernel), you only need to try your exploit a few
> hundred times. That can be done quickly enough that no MIS person will
> ever have time to react to the attack.
> 
> Sure, it will likely leave some hints around (oopses etc), but still..

Certain flaws will present that way, yes. Others will take the entire
system down on the first missed address guess. Many times, ASLR will
give a statistical advantage to the defender. As a result it has value,
even if it's not perfect.

-Kees

-- 
Kees Cook
Ubuntu Security Team

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-20 18:14               ` Linus Torvalds
  2011-05-20 18:27                 ` Kees Cook
@ 2011-05-20 18:28                 ` Ingo Molnar
  2011-05-22  6:11                 ` david
  2 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-20 18:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Rosenberg, linux-kernel, davej, kees.cook, davem, eranian,
	adobriyan, penberg


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> There's a *big* detail that makes it hard: there's only a few bits of 
> randomness we can add to the address. The kernel base address ends up having 
> various fundamental limitations (cacheline alignment for the code, and we 
> have several segments that require page alignment), so you really can't 
> realistically do more than something like 8-12 bits of address randomization.

Yeah, i tried to address this issue in my first mail: basically just a few bits 
would already make a big difference in practice: even a *single* bit of 
randomness makes an exploit crash 50% of the time - at which point the attack 
stops being stealth.

8 bits would be a lot.

So i think this is really realistic, even if a brute force, networked attack 
can successfully attack 1 out of 256, 512 or 1024 boxes. Even for the worm cas 
the networked attack would not scale very well.

> Regardless, the virtual mapping trick (independently of whether it's 
> read-only or not) can be used to avoid exposing the *actual* address of the 
> IDT of the kernel, and would hide the kernel load address details. However, 
> it does make traps slightly slower, if they cannot use the 1:1 mapping with 
> large pages for the IDT access and thus cause more TLB pressure. Of course, 
> in many situations we probably end up not having large pages for the kernel 
> anyway, so..

We could put per CPU IDTs into the percpu area if that improves performance.

This might help on NUMA: on NUMA only one node has the IDT local, the others 
will take a remote DRAM access every time they miss the IDT - and the IDT could 
easily be missed if there are no IRQs or traps for a long time (say CPU-bound 
user-space processing).

There may also be cases where an implicit locked access is generated to the 
IDT?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-20 18:27                 ` Kees Cook
@ 2011-05-20 18:34                   ` Dan Rosenberg
  2011-05-20 18:42                     ` Ingo Molnar
  0 siblings, 1 reply; 50+ messages in thread
From: Dan Rosenberg @ 2011-05-20 18:34 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, Ingo Molnar, linux-kernel, davej, davem, eranian,
	adobriyan, penberg

On Fri, 2011-05-20 at 11:27 -0700, Kees Cook wrote:
> On Fri, May 20, 2011 at 11:14:09AM -0700, Linus Torvalds wrote:
> > On Fri, May 20, 2011 at 10:41 AM, Dan Rosenberg
> > <drosenberg@vsecurity.com> wrote:
> > >
> > > Also, I'd still welcome suggestions on generating entropy so early in
> > > the boot process as to randomize the location at which the kernel is
> > > decompressed.
> > 
> > The fundamental problem with the whole kernel address randomization is
> > sadly totally unrelated to any of the small details.
> > 
> > There's a *big* detail that makes it hard: there's only a few bits of
> > randomness we can add to the address. The kernel base address ends up
> > having various fundamental limitations (cacheline alignment for the
> > code, and we have several segments that require page alignment), so
> > you really can't realistically do more than something like 8-12 bits
> > of address randomization.
> > 
> > Which means that once you have a vmlinux image (say, because it's a
> > standard distro kernel), you only need to try your exploit a few
> > hundred times. That can be done quickly enough that no MIS person will
> > ever have time to react to the attack.
> > 
> > Sure, it will likely leave some hints around (oopses etc), but still..
> 
> Certain flaws will present that way, yes. Others will take the entire
> system down on the first missed address guess. Many times, ASLR will
> give a statistical advantage to the defender. As a result it has value,
> even if it's not perfect.
> 

At least one distro (Red Hat) ships with panic_on_oops enabled by
default, so attackers don't get more than one chance.  Likewise,
vulnerabilities in interrupt context will only have one chance, as will
any issue where failed exploitation prevents subsequent attempts, as is
frequently the case due to failures to clean up locking primitives on an
OOPS.

-Dan

> -Kees
> 
> -- 
> Kees Cook
> Ubuntu Security Team



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-20 17:41             ` Dan Rosenberg
  2011-05-20 18:14               ` Linus Torvalds
@ 2011-05-20 18:35               ` Ingo Molnar
  1 sibling, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-20 18:35 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: linux-kernel, davej, kees.cook, davem, eranian, torvalds,
	adobriyan, penberg


* Dan Rosenberg <drosenberg@vsecurity.com> wrote:

> On Fri, 2011-05-20 at 15:11 +0200, Ingo Molnar wrote:
> 
> > We need to allocate the IDT dynamically: just kmalloc() it, update idt_descr 
> > and do a load_idt(). Double check places that modify idt_descr or use 
> > idt_table.
> > 
> > Note, you could do this as a side effect of a nice performance optimization: 
> > would you be interested in allocating it in the percpu area, using 
> > percpu_alloc()? That way the IDT is distributed between CPUs - this has 
> > scalability advantages on NUMA systems and maybe even on SMP.
> > 
> 
> Any suggestions on when this allocation should take place?  I'm hesitant to 
> touch anything in arch/x86/kernel/head_32.S, where the IDT is setup and lidt 
> idt_descr is called (on x86-32 anyway).  That means at some point I'd have to 
> copy the table into a region allocated with alloc_percpu() and set up a new 
> descriptor.  Seems like this should happen before IRQ is enabled, but I'm not 
> sure about the best place.

I think there's a static percpu area that can be used pretty early on.

The boot IDT can be marked __initdata so its space wont be wasted.

The thing is, until SMP is not initialized the boot IDT can be kept. So i'd 
suggest allocating per CPU IDTs after memory has initialized. For that a pretty 
good place is trap_init(): there we already have the page allocator initialized 
and probably the percpu allocator too. IDT allocation is also pretty naturally 
done in trap_init().

> Also, I'd still welcome suggestions on generating entropy so early in the 
> boot process as to randomize the location at which the kernel is 
> decompressed.
> 
> On a related note, would there be obstacles to marking the IDT as read-only?

The cost is that its access TLB may change from a 2MB TB to a 4K TLB. We 
generally try to keep critical data structures in 2MB mapped areas.

But this is really hard to measure (you'd have to have a borderline workload 
where the loss of a single 4K TLB is measurable) so i'd suggest splitting this 
from the randomization step.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-20 18:34                   ` Dan Rosenberg
@ 2011-05-20 18:42                     ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-20 18:42 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: Kees Cook, Linus Torvalds, linux-kernel, davej, davem, eranian,
	adobriyan, penberg


* Dan Rosenberg <drosenberg@vsecurity.com> wrote:

> At least one distro (Red Hat) ships with panic_on_oops enabled by default, so 
> attackers don't get more than one chance.  Likewise, vulnerabilities in 
> interrupt context will only have one chance, as will any issue where failed 
> exploitation prevents subsequent attempts, as is frequently the case due to 
> failures to clean up locking primitives on an OOPS.

So it's basically a last line of defense: the attacker has to assume the risk 
of the attack being detected.

That has a chilling effect on some types of attacks: especially those where the 
attacker goes against a high value target with a zero day kernel exploit. 
Risking a crash does not just mean possibly alerting the target, but also means 
possibly losing the zero-day exploit - if that oops log gets to a kernel 
developer who starts wondering about the weird backtrace.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-20 18:14               ` Linus Torvalds
  2011-05-20 18:27                 ` Kees Cook
  2011-05-20 18:28                 ` Ingo Molnar
@ 2011-05-22  6:11                 ` david
  2 siblings, 0 replies; 50+ messages in thread
From: david @ 2011-05-22  6:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Rosenberg, Ingo Molnar, linux-kernel, davej, kees.cook,
	davem, eranian, adobriyan, penberg

On Fri, 20 May 2011, Linus Torvalds wrote:

> Sure, it will likely leave some hints around (oopses etc), but still..
>

as an admin running a large site, I'd love to have evidence around that 
strange things were happening, especially prior to the exploit running.

no, I probably can't take manual action fast enough to prevent the initial 
exploit, but I can alarm on the failed attacks and potentially react fast 
enough to prevent the attacker from going further into my network.

David Lang

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-20 13:11           ` Ingo Molnar
  2011-05-20 17:41             ` Dan Rosenberg
@ 2011-05-22 18:45             ` Dan Rosenberg
       [not found]               ` <BANLkTik1SK_kWVvGsKk0SqdByQ5-0b5nFg@mail.gmail.com>
  1 sibling, 1 reply; 50+ messages in thread
From: Dan Rosenberg @ 2011-05-22 18:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, davej, kees.cook, davem, eranian, torvalds,
	adobriyan, penberg, hpa

On Fri, 2011-05-20 at 15:11 +0200, Ingo Molnar wrote:
> * Dan Rosenberg <drosenberg@vsecurity.com> wrote:
> 
> > On Fri, 2011-05-20 at 14:07 +0200, Ingo Molnar wrote:
> > > * Dan Rosenberg <drosenberg@vsecurity.com> wrote:
> > > 
> > > > I was able to boot a relocatable kernel with the decompression location at a 
> > > > hard-coded offset without too much trouble.  Everything seems to work fine.
> > > 
> > > Nice!
> > > 

A progress update, and a number of questions.

The randomization itself is working fine with the following hack in
arch/x86/boot/compressed/head_32.S:

#ifdef CONFIG_RELOCATABLE
	movl	%ebp, %ebx
#ifdef CONFIG_RANDOMIZE_BASE
	rdtsc
	shll	$0x8, %eax
	andl	$0x3ffffff, %eax
	addl	%eax, %ebx
#endif
	movl	BP_kernel_alignment(%esi), %eax
	decl	%eax
	addl	%eax, %ebx
	notl	%eax
	andl	%eax
#endif

That brings me to my first two questions:

1. Is it ok to assume the existence of rdtsc?  If not, what are other
ways of gathering entropy early in the boot process?  If this is the
approach that's going to be taken, system uptime potentially becomes
useful for attackers.  Any thoughts on how to address this?

2. The current default physical alignment is 16mb as a result of this
patch:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ceefccc93932b920a8ec6f35f596db05202a12fe

Having 16mb alignment greatly restricts the amount of usable entropy for
randomization.  It seems an alternate solution to the problem the above
patch addresses is reverting back to 1mb alignment (or 2/4 mb if that
has performance benefits) and enforcing a 16mb minimum physical start
for relocatable kernels by bumping it up in the boot code if necessary.
Would this be possible?  I'd like to avoid requiring distros to touch
CONFIG_PHYSICAL_ALIGN (and risk breaking things) in order to have more
useful randomization.

A few more questions arose during my efforts:

3. The current hack I'm using to determine the offset to reverse apply
to symbol output looks something like this:

unsigned long kptr_adjust = ((unsigned long)_text &
~(CONFIG_PHYSICAL_ALIGN-1)) - (PAGE_OFFSET + CONFIG_PHYSICAL_START);

Is it safe to assume that kernel .text is the first thing in a
decompressed kernel image?  If not, any other suggestions?  It seemed
easier to compute this in the decompressed kernel at runtime rather than
try to figure out a way to pass the actual decompression address from
the boot stage to the main kernel.

4. What kind of behavior do people want with %pK and kptr_restrict?  If
possible, I'd like to find a way that perf users can have the benefits
of this feature and still have usable symbol support.  However, module
symbols are a bit tricky, since they're not being relocated with the
rest of the kernel, and it doesn't seem meaningful to reverse-apply the
same offset in module symbol output.  Perhaps a separate format
specifier should be introduced to differentiate symbols that need to be
offset?

Basically, we've got kernel .text symbols, module symbols, and dynamic
kernel pointers, and I'm not sure with what granularity people are
interested in hiding them.  It seems perf at least wants more than "all
or nothing".

5. I'd like some more opinions on moving the IDT.  So far, the two
options are using a fixed mapping similar to the F00F bug fix, and
allocating it percpu at runtime.

Looking forward to feedback, criticism, disgust, etc.

Regards,
Dan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
       [not found]               ` <BANLkTik1SK_kWVvGsKk0SqdByQ5-0b5nFg@mail.gmail.com>
@ 2011-05-23  0:25                 ` Dan Rosenberg
  2011-05-23  0:37                   ` H. Peter Anvin
  2011-05-23 10:49                   ` Ingo Molnar
  0 siblings, 2 replies; 50+ messages in thread
From: Dan Rosenberg @ 2011-05-23  0:25 UTC (permalink / raw)
  To: Tony Luck
  Cc: Ingo Molnar, linux-kernel, davej, kees.cook, davem, eranian,
	torvalds, adobriyan, penberg, hpa

On Sun, 2011-05-22 at 16:49 -0700, Tony Luck wrote:
> 
> 
> On Sun, May 22, 2011 at 11:45 AM, Dan Rosenberg
> <drosenberg@vsecurity.com> wrote:
>         1. Is it ok to assume the existence of rdtsc?  If not, what
>         are other
>         ways of gathering entropy early in the boot process?  If this
>         is the
>         approach that's going to be taken, system uptime potentially
>         becomes
>         useful for attackers.  Any thoughts on how to address this?
>         
> There is a cpuid bit to tell you whether the processor supports rdtsc.
> 

This might be worth checking, but it also might just be easier to
clearly document in the Kconfig that the feature requires rdtsc.  I'll
play with it a bit.

> In the cold boot case, I'd worry about whether rdtsc was all that
> random,
> it counts from zero when the processors come out of cold reset, and
> things should be quite deterministic up to when the kernel loads;
> especially if you have a solid state boot drive rather than the old
> spinning kind.
> 

The question really becomes whether it's "good enough".  It's certainly
not cryptographic randomness, but it will produce different results on
different boots depending on a variety of factors, including
out-of-order instruction execution, multi-CPU systems, and differences
in CPU models.  Even though a single machine may be slightly more likely
to produce certain offsets more often, it seems to me that it would
still accomplish the goal, because it's highly unlikely that an attacker
would be able to perform the statistical analysis necessary to figure
that out.

> Sometime soon you'll have "rdrand" available (check a different cpuid
> bit).
> 

This of course is highly preferable, but I'd rather implement a solution
that's widely supported in hardware today.  Perhaps further down the
road it could be switched.

> -Tony
> 
> 

Thanks for the feedback.

-Dan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-23  0:25                 ` Dan Rosenberg
@ 2011-05-23  0:37                   ` H. Peter Anvin
  2011-05-23 10:49                   ` Ingo Molnar
  1 sibling, 0 replies; 50+ messages in thread
From: H. Peter Anvin @ 2011-05-23  0:37 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: Tony Luck, Ingo Molnar, linux-kernel, davej, kees.cook, davem,
	eranian, torvalds, adobriyan, penberg

On 05/22/2011 05:25 PM, Dan Rosenberg wrote:
> 
>> Sometime soon you'll have "rdrand" available (check a different cpuid
>> bit).
> 
> This of course is highly preferable, but I'd rather implement a solution
> that's widely supported in hardware today.  Perhaps further down the
> road it could be switched.
> 

Use the better one that is actually available.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-23  0:25                 ` Dan Rosenberg
  2011-05-23  0:37                   ` H. Peter Anvin
@ 2011-05-23 10:49                   ` Ingo Molnar
  2011-05-23 19:02                     ` Ray Lee
  2011-05-24  1:59                     ` Valdis.Kletnieks
  1 sibling, 2 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-23 10:49 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: Tony Luck, linux-kernel, davej, kees.cook, davem, eranian,
	torvalds, adobriyan, penberg, hpa, Arjan van de Ven,
	Andrew Morton


* Dan Rosenberg <drosenberg@vsecurity.com> wrote:

> On Sun, 2011-05-22 at 16:49 -0700, Tony Luck wrote:
> > 
> > 
> > On Sun, May 22, 2011 at 11:45 AM, Dan Rosenberg
> > <drosenberg@vsecurity.com> wrote:
> >         1. Is it ok to assume the existence of rdtsc?  If not, what
> >         are other
> >         ways of gathering entropy early in the boot process?  If this
> >         is the
> >         approach that's going to be taken, system uptime potentially
> >         becomes
> >         useful for attackers.  Any thoughts on how to address this?
> >         
> > There is a cpuid bit to tell you whether the processor supports rdtsc.
> > 
> 
> This might be worth checking, but it also might just be easier to clearly 
> document in the Kconfig that the feature requires rdtsc.  I'll play with it a 
> bit.

All modern x86 boxes have an RDTSC so this not really a practical concern - 
other than not crashing old boxes that do not have an RDTSC.

In theory there's other sources of entropy: we could read out the RTC CMOS as 
well. We could also read the current CPU frequency and fan speeds - those have 
components of thermal noise as well. Obviously it's hard to read this out early 
during bootup, when system devices are not enumerated yet.

What would be very useful is to print out the bootup RDTSC value for several 
bootups. I've done this on a testbox here, with real full reboots inbetween, 
putting the RDTSC printout very close to where you'd sample it for kernel image 
randomization:

[    0.000000] RDTSC: 26615467048
[    0.000000] RDTSC: 26527108278
[    0.000000] RDTSC: 26464738628
[    0.000000] RDTSC: 26554778708
[    0.000000] RDTSC: 26441165788
[    0.000000] RDTSC: 26555252088
[    0.000000] RDTSC: 26431986988
[    0.000000] RDTSC: 26521303608
[    0.000000] RDTSC: 26497878018
[    0.000000] RDTSC: 26455546968
[    0.000000] RDTSC: 26467673718
[    0.000000] RDTSC: 26460404758
[    0.000000] RDTSC: 26496175038

Even visually there's *plenty* of real randomness in the bootup value of the 
cycle counter (look at the above pattern of numbers and squint), even during 
early bootup, on real hardware.

While the last few bits are non-random there's at least 10-20 bits of 
randomness on this (very simple) testbox - possibly more.

I've done a few tests on virtual hardware as well, KVM based:

[    0.000000] RDTSC: 208122033
[    0.000000] RDTSC: 200455104
[    0.000000] RDTSC: 207258945

As expected there's even more randmness on virtual hardware than on real 
hardware: virtual hardware will boot in a complex host CPU state and is so 
exposed to the non-determinisic CPU cache state.

So the cycle counter is plenty good for this.

[ I'd only worry about the cycle counter if a hardware platform is so 
  minimalistic that it boots up very quickly without exposing itself to natural 
  sources of entropy. Eventually we'll get such systems, but right now the 
  cycle counter is good enough. ]

Btw, we already rely on the cycle counter for early bootup randomness: 
boot_init_stack_canary() relies on it.

> > Sometime soon you'll have "rdrand" available (check a different cpuid bit).
> 
> This of course is highly preferable, but I'd rather implement a solution 
> that's widely supported in hardware today.  Perhaps further down the road it 
> could be switched.

Well, since entropy does not get reduced on addition of independent variables 
the right sequence is (pseudocode):

	rnd  = entropy_cycles();
	rnd += entropy_rdrand();
	rnd += entropy_RTC();
	rnd += entropy_system();

That way systems that do not have RDRAND will still have the other ones as 
fallback.

Of course in your prototype you can use RDTSC as a first step, just make it 
easy to add other sources.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-23 10:49                   ` Ingo Molnar
@ 2011-05-23 19:02                     ` Ray Lee
  2011-05-23 19:35                       ` Ingo Molnar
  2011-05-24  1:59                     ` Valdis.Kletnieks
  1 sibling, 1 reply; 50+ messages in thread
From: Ray Lee @ 2011-05-23 19:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dan Rosenberg, Tony Luck, linux-kernel, davej, kees.cook, davem,
	eranian, torvalds, adobriyan, penberg, hpa, Arjan van de Ven,
	Andrew Morton

On Mon, May 23, 2011 at 3:49 AM, Ingo Molnar <mingo@elte.hu> wrote:
> Well, since entropy does not get reduced on addition of independent variables
> the right sequence is (pseudocode):
>
>        rnd  = entropy_cycles();
>        rnd += entropy_rdrand();
>        rnd += entropy_RTC();
>        rnd += entropy_system();

I think you mean concatenation rather than addition? Or perhaps XOR,
or a hash? It's pretty easy to show that the addition of n random
variables evenly distributed between [0, 1] converges to 1/2 n  +-
1/sqrt(n) (or numbers to that effect), which gives an attacker better
chances than they would otherwise if they target the center of the
distribution.

But none of this is to detract from your main point, which still
holds. Structuring it such that other sources of randomness can be
included as available keeps options open.

~r.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-23 19:02                     ` Ray Lee
@ 2011-05-23 19:35                       ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-23 19:35 UTC (permalink / raw)
  To: Ray Lee
  Cc: Dan Rosenberg, Tony Luck, linux-kernel, davej, kees.cook, davem,
	eranian, torvalds, adobriyan, penberg, hpa, Arjan van de Ven,
	Andrew Morton


* Ray Lee <ray-lk@madrabbit.org> wrote:

> On Mon, May 23, 2011 at 3:49 AM, Ingo Molnar <mingo@elte.hu> wrote:
> > Well, since entropy does not get reduced on addition of independent variables
> > the right sequence is (pseudocode):
> >
> >        rnd  = entropy_cycles();
> >        rnd += entropy_rdrand();
> >        rnd += entropy_RTC();
> >        rnd += entropy_system();
> 
> I think you mean concatenation rather than addition? Or perhaps XOR, or a 
> hash? [...]

Yeah.

In this special case probably concatenation works the best: the above 4 random 
variables have total randomness probably less than 32 bits, so we want to 
create 4 tight random numbers and concatenate them.

[ XOR would destroy some fair amount of entropy because most of these random 
  variables have their randomness in their low bits, and a hash would probably
  lose about 2 bits and would also be slower. A hash would probably be safer
  and more robust though, if we mis-identify any of the random variables. ]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-23 10:49                   ` Ingo Molnar
  2011-05-23 19:02                     ` Ray Lee
@ 2011-05-24  1:59                     ` Valdis.Kletnieks
  2011-05-24  4:06                       ` Ingo Molnar
  1 sibling, 1 reply; 50+ messages in thread
From: Valdis.Kletnieks @ 2011-05-24  1:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dan Rosenberg, Tony Luck, linux-kernel, davej, kees.cook, davem,
	eranian, torvalds, adobriyan, penberg, hpa, Arjan van de Ven,
	Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1215 bytes --]

On Mon, 23 May 2011 12:49:02 +0200, Ingo Molnar said:
> Well, since entropy does not get reduced on addition of independent variables 
> the right sequence is (pseudocode):
>
> 	rnd  = entropy_cycles();
> 	rnd += entropy_rdrand();
> 	rnd += entropy_RTC();
> 	rnd += entropy_system();

I'm having trouble convincing myself that RTC and cycles are truly independent
variables.... ;)

Consider the case of a fixed-frequency CPU - if you know the time since boot,
and the current RTC, and the current cycle count, you can work backwards to
find the RTC and cycle count at boot.  I'm not sure that a variable clockspeed
helps all that much - an attacker can perhaps find a way to force the highest/
lowest CPU speed - or the system may even helpfully do it for the attacker -
I've seen plenty of misconfigured laptops that force lowest supported CPU
clockspeed on battery rather than race-to-idle.

Having said that, the 13 bootup rdtsc values you list *seem* to have on the
order of 24-28 bits of entropy, and only the lowest-order bit seems to be
non-random (the low-order byte of the 13 values are 28, b6, 44, 54, dc, 78, 2c,
38, 02, 58, 76, 16, and be).  So rdtsc appears to be good enough for what we
want here...


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-24  1:59                     ` Valdis.Kletnieks
@ 2011-05-24  4:06                       ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-24  4:06 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Dan Rosenberg, Tony Luck, linux-kernel, davej, kees.cook, davem,
	eranian, torvalds, adobriyan, penberg, hpa, Arjan van de Ven,
	Andrew Morton


* Valdis.Kletnieks@vt.edu <Valdis.Kletnieks@vt.edu> wrote:

> On Mon, 23 May 2011 12:49:02 +0200, Ingo Molnar said:
> > Well, since entropy does not get reduced on addition of independent variables 
> > the right sequence is (pseudocode):
> >
> > 	rnd  = entropy_cycles();
> > 	rnd += entropy_rdrand();
> > 	rnd += entropy_RTC();
> > 	rnd += entropy_system();
> 
> I'm having trouble convincing myself that RTC and cycles are truly independent
> variables.... ;)

Generally the RTC stores absolute time in seconds (it stores the date), while 
cycles start new when the CPU is reset.

So they are independent.

The question i think you are asking is whether the fact that we can observe 
current values of them after bootup can be used to figure out their value:

> Consider the case of a fixed-frequency CPU - if you know the time since boot, 
> and the current RTC, and the current cycle count, you can work backwards to 
> find the RTC and cycle count at boot. [...]

Yes, you are correct, if you are local then the guessing the RTC to the second 
is probably possible.

Guessing the cycle counter's value will be hard: see the natural noise it has 
at a fixed instruction after bootup in the same-bzImage test i performed - with 
no IRQs having executed at all yet ...

The RTC is still reasonably noisy to external attackers though.

> [...] I'm not sure that a variable clockspeed helps all that much - an 
> attacker can perhaps find a way to force the highest/ lowest CPU speed - or 
> the system may even helpfully do it for the attacker - I've seen plenty of 
> misconfigured laptops that force lowest supported CPU clockspeed on battery 
> rather than race-to-idle.

The tests i performed were on a fixed frequency system - the cycle counter was 
still largely random during early bootup.

Others should try it too - i've attached a simple patch. Maybe my system has 
more bootup noise than others.

> Having said that, the 13 bootup rdtsc values you list *seem* to have on the 
> order of 24-28 bits of entropy, and only the lowest-order bit seems to be 
> non-random (the low-order byte of the 13 values are 28, b6, 44, 54, dc, 78, 
> 2c, 38, 02, 58, 76, 16, and be).  So rdtsc appears to be good enough for what 
> we want here...

Yeah. And for cases that the rdtsc might be predictable for some weird reason 
(say it would be 0 on an old system with no RDTSC), the RTC would give some 
minimal fallback seed to make the canary at least not remotely guessable.

Thanks,

	Ingo

---
 init/main.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux/init/main.c
===================================================================
--- linux.orig/init/main.c
+++ linux/init/main.c
@@ -472,6 +472,12 @@ asmlinkage void __init start_kernel(void
 	 */
 	boot_init_stack_canary();
 
+	{
+		u64 cycles = get_cycles();
+
+		printk("RDTSC: %Ld / %08Lx\n", cycles, cycles);
+	}
+
 	cgroup_init_early();
 
 	local_irq_disable();

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-13 16:23                 ` Andi Kleen
@ 2011-05-17 12:17                   ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-17 12:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Jones, Stephane Eranian, Linus Torvalds,
	Arnaldo Carvalho de Melo, LKML, H. Peter Anvin, Thomas Gleixner,
	Arjan van de Ven


* Andi Kleen <andi@firstfloor.org> wrote:

> > The x86 kernel is relocatable, so slightly randomizing the position of the 
> > kernel would be feasible with no overhead on the vast majority of exising 
> > distro installs, with just an updated kernel.
> 
> Problem is that we don't have a source of secure randomness early on when the 
> relocation would need to happen.

That's indeed a problem but not a fundamental one: we can read out the current 
time (RTC CMOS is always available on most systems), mix it with the current 
cycle counter value and PRNG mix it.

It could only be recovered if the attacker is local to that box, guesses the 
precise cycle count on that specific hardware (and hardware has small thermal 
variations) and knows the precise boot time to the second as present in the 
RTC.

Note that the amount of randomization would be small to begin with: if we have 
only 3 bits of randomization and can thus make ~90% of kernel exploit attempts 
crash statistically then we have most of the advantages already.

[ For the really paranoid we could add a new flag to the boot protocol and 
  embedd a random seed in the bzImage. This could be re-set upon installation 
  of a new kernel package, so on the next reboot the system gains a unique 
  seed.

  Or we could add a boot parameter to seed things and cut this particular boot 
  parameter from all output like /proc/cmdline or the syslog command line 
  parameters printout. /etc/grub.conf is already inaccessible to unprivileged 
  userspace on most distros so the parameter is hidden. ]

So it's a solvable problem.

> You could either pass it as an option, but that option would be right now too 
> exposed, or just use kexec and boot twice.
> 
> But all of this has drawbacks.
> 
> > When exposing randomized RIPs to user-space we could recalculate all RIPs back 
> > to the 0xffffffff80000000 base, so oopses would have the usual non-randomized 
> > form:
> 
> This would be very confusing because the register and stack contents 
> would have the non relocated addresses.

Well, kernel crashes can expose security relevant details anyway so they better 
be hidden from unprivileged attackers anyway, the important thing is to 
properly decode the symbols. We can keep the rest of the oops in its raw form 
(and thus expose the seed to a privileged user - which we'd do anyway), being 
dependable is important for oopses.

> I bet it would cause a lot of similar problems as the current %kP madness, 
> just more subtle ones.

Well, did you expect me to react to your claim of 'subtle issues'?

If yes (which i assume) then why didn't you outline what you meant with that in 
more detail, why are you forcing me to ask you what you mean precisely?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-13  8:57               ` Ingo Molnar
@ 2011-05-13 16:23                 ` Andi Kleen
  2011-05-17 12:17                   ` Ingo Molnar
  0 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2011-05-13 16:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Jones, Stephane Eranian, Linus Torvalds,
	Arnaldo Carvalho de Melo, LKML, H. Peter Anvin, Thomas Gleixner,
	Arjan van de Ven

Ingo Molnar <mingo@elte.hu> writes:

I agree that the current %kP default is really a catastrophe, clearly
on the trajectory of "the system is only secure when nothing works anymore"

> The x86 kernel is relocatable, so slightly randomizing the position of the 
> kernel would be feasible with no overhead on the vast majority of exising 
> distro installs, with just an updated kernel.

Problem is that we don't have a source of secure randomness early on
when the relocation would need to happen.

You could either pass it as an option, but that option would be right
now too exposed, or just use kexec and boot twice.

But all of this has drawbacks.

> When exposing randomized RIPs to user-space we could recalculate all RIPs back 
> to the 0xffffffff80000000 base, so oopses would have the usual non-randomized 
> form:

This would be very confusing because the register and stack contents 
would have the non relocated addresses.

I bet it would cause a lot of similar problems as the current %kP
madness, just more subtle ones.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 22:15               ` Stephane Eranian
@ 2011-05-13  9:01                 ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-13  9:01 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Dave Jones, Linus Torvalds, Arnaldo Carvalho de Melo, LKML


* Stephane Eranian <eranian@google.com> wrote:

> Good point about System.map! Even if /proc/kallsyms contains zero addresses, 
> I can still get them from /boot/System.map which is readable by everyone, I 
> think. It does not contain the modules addresses, but you have the core 
> functions, unless I am somehow mistaken.

Yes. I pointed out this and some other details a couple of months ago, for a 
similarly motivated /proc/kallsyms obfuscation patch:

  http://lkml.org/lkml/2010/11/4/113
  http://lkml.org/lkml/2010/11/4/145

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 22:07             ` Dave Jones
  2011-05-12 22:15               ` Stephane Eranian
@ 2011-05-13  8:57               ` Ingo Molnar
  2011-05-13 16:23                 ` Andi Kleen
  1 sibling, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2011-05-13  8:57 UTC (permalink / raw)
  To: Dave Jones, Stephane Eranian, Linus Torvalds,
	Arnaldo Carvalho de Melo, LKML
  Cc: H. Peter Anvin, Thomas Gleixner, Arjan van de Ven


* Dave Jones <davej@redhat.com> wrote:

> On Thu, May 12, 2011 at 11:50:23PM +0200, Ingo Molnar wrote:
>  
>  > Dunno, i would not couple them necessarily - certain users might still have 
>  > access to kernel symbols via some other channel - for example the System.map.
>  
> That always made this security by obscurity feature seem pointless for the bulk
> of users to me. Given the majority are going to be running distro kernels,
> anyone can find those addresses easily no matter how hard we hide them on the
> running system.

I certainly agree and made that argument as well, in the original thread(s) 
about /proc/kallsyms.

> Unless we were somehow introduced randomness into where we unpack the kernel 
> each boot, and using System.map as a table of offsets instead of absolute 
> addresses.

Correct. This security feature is IMO only solving a tiny fraction of the 
problem and is thus in fact hindering the implementation of a *real* layer
of protection of kernel absolute addresses:

The x86 kernel is relocatable, so slightly randomizing the position of the 
kernel would be feasible with no overhead on the vast majority of exising 
distro installs, with just an updated kernel.

When exposing randomized RIPs to user-space we could recalculate all RIPs back 
to the 0xffffffff80000000 base, so oopses would have the usual non-randomized 
form:

[   32.946003] IP: [<ffffffff80222521>] get_cur_val+0xcc/0x106
[   32.946003] PGD 0 
[   32.946003] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[   32.946003] last sysfs file: 
[   32.946003] CPU 1 
[   32.946003] Pid: 1, comm: swapper Tainted: G        W  2.6.29-rc1-00190-g37a76bd #10
[   32.946003] RIP: 0010:[<ffffffff80222521>]  [<ffffffff80222521>] get_cur_val+0xcc/0x106
[   32.946003] RSP: 0018:ffff88003f977b80  EFLAGS: 00010202
[   32.946003] RAX: 0000000000000001 RBX: ffff8800029c8c80 RCX: 0000000000000008
[   32.946003] RDX: 0000000000000000 RSI: ffffffff80ce0100 RDI: 0000000000000000
[   32.946003] RBP: ffff88003f977bd0 R08: 0000000000000004 R09: 0000000000000040
[   32.946003] R10: 0000000000000060 R11: 0000000081363fa8 R12: ffffffff81c4f0e0
[   32.946003] R13: ffffffff80ce0100 R14: ffff88003c888a00 R15: 0000000000000000
[   32.946003] FS:  0000000000000000(0000) GS:ffff88003f802c00(0000) knlGS:0000000000000000
[   32.946003] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[   32.946003] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
[   32.946003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   32.946003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   32.946003] Process swapper (pid: 1, threadinfo ffff88003f976000, task ffff88003f978000)
[   32.946003] Stack:

Likewise, /proc/kallsyms could pass these addresses as well and the perf 
call-chain code and other places that sample RIPs could easily convert them to 
the constant address as well.

We'd still leak some information like the relative position of symbols from 
each other (this can be useful to certain classes of attacks), but we could 
pretty effectively hide the absolute location of the kernel - which is the most 
valuable piece of information -.

Then the random base has to be protected: i.e. all information leaks of raw 
kernel RIPs have to be plugged. The nice thing is that this will happen as 
*bugfixes*: randomized RIPs will not be useful for anything, so any 
tools/people who rely on them will notice it immediately.

I think *that* would be a maintainable and complete security feature to truly 
hide the exact location of the kernel image. kptr_restrict is not.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-13  6:12         ` Kees Cook
@ 2011-05-13  6:24           ` Pekka Enberg
  0 siblings, 0 replies; 50+ messages in thread
From: Pekka Enberg @ 2011-05-13  6:24 UTC (permalink / raw)
  To: Kees Cook
  Cc: David Miller, davej, eranian, acme, linux-kernel, Linus Torvalds,
	Ingo Molnar

Hi Kees,

On Fri, May 13, 2011 at 9:12 AM, Kees Cook <kees.cook@canonical.com> wrote:
> Hi Pekka,
>
> On Thu, May 12, 2011 at 10:58:53PM +0300, Pekka Enberg wrote:
>> On Thu, May 12, 2011 at 10:01 PM, David Miller <davem@davemloft.net> wrote:
>> > From: Dave Jones <davej@redhat.com>
>> > Date: Thu, 12 May 2011 14:37:41 -0400
>> >
>> >> On Thu, May 12, 2011 at 02:06:30PM -0400, David Miller wrote:
>> >>  > I hate this too, and I think it's absolutely rediculous.
>> >>  >
>> >>  > Also, like you, I lost an entire afternoon trying to figure out why
>> >>  > this started happening.
>> >>  >
>> >>  > I wish we could revert this change.
>> >>
>> >> At least it can be permanently disabled..
>> >>
>> >> echo kernel.kptr_restrict = 0 >> /etc/sysctl.conf
>> >
>> > Regardless, what to do about all of the "perf is broken" reports?
>>
>> Lets revert the commit 9f36e2c448007b54851e7e4fa48da97d1477a175
>> ("printk: use %pK for /proc/kallsyms and /proc/modules"), please! I
>> too have been wondering what's going on with perf reporting insane
>> symbols and this should definitely not be enabled by default.
>
> No, reverting that is not the answer. If perf has a problem with the
> kptr_restrict feature, it should just disable it in /proc/sys when it
> runs and restore it when finished. Since our defaults should be secure
> for the average user (who does not use perf), it's fine the way it
> is. Anyone using perf can adjust this for their use-case (that is why
> there is a /proc/sys tunable).

No, it's the other way around. See commit
411f05f123cbd7f8aa1edcae86970755a6e2a9d9 ("vsprintf: Turn
kptr_restrict off by default") in Linus' tree for details.

                        Pekka

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 19:58       ` Pekka Enberg
@ 2011-05-13  6:12         ` Kees Cook
  2011-05-13  6:24           ` Pekka Enberg
  0 siblings, 1 reply; 50+ messages in thread
From: Kees Cook @ 2011-05-13  6:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Miller, davej, eranian, acme, linux-kernel, Linus Torvalds,
	Ingo Molnar

Hi Pekka,

On Thu, May 12, 2011 at 10:58:53PM +0300, Pekka Enberg wrote:
> On Thu, May 12, 2011 at 10:01 PM, David Miller <davem@davemloft.net> wrote:
> > From: Dave Jones <davej@redhat.com>
> > Date: Thu, 12 May 2011 14:37:41 -0400
> >
> >> On Thu, May 12, 2011 at 02:06:30PM -0400, David Miller wrote:
> >>  > I hate this too, and I think it's absolutely rediculous.
> >>  >
> >>  > Also, like you, I lost an entire afternoon trying to figure out why
> >>  > this started happening.
> >>  >
> >>  > I wish we could revert this change.
> >>
> >> At least it can be permanently disabled..
> >>
> >> echo kernel.kptr_restrict = 0 >> /etc/sysctl.conf
> >
> > Regardless, what to do about all of the "perf is broken" reports?
> 
> Lets revert the commit 9f36e2c448007b54851e7e4fa48da97d1477a175
> ("printk: use %pK for /proc/kallsyms and /proc/modules"), please! I
> too have been wondering what's going on with perf reporting insane
> symbols and this should definitely not be enabled by default.

No, reverting that is not the answer. If perf has a problem with the
kptr_restrict feature, it should just disable it in /proc/sys when it
runs and restore it when finished. Since our defaults should be secure
for the average user (who does not use perf), it's fine the way it
is. Anyone using perf can adjust this for their use-case (that is why
there is a /proc/sys tunable).

-Kees

-- 
Kees Cook
Ubuntu Security Team

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 22:07             ` Dave Jones
@ 2011-05-12 22:15               ` Stephane Eranian
  2011-05-13  9:01                 ` Ingo Molnar
  2011-05-13  8:57               ` Ingo Molnar
  1 sibling, 1 reply; 50+ messages in thread
From: Stephane Eranian @ 2011-05-12 22:15 UTC (permalink / raw)
  To: Dave Jones, Ingo Molnar, Stephane Eranian, Linus Torvalds,
	Arnaldo Carvalho de Melo, LKML

On Fri, May 13, 2011 at 12:07 AM, Dave Jones <davej@redhat.com> wrote:
> On Thu, May 12, 2011 at 11:50:23PM +0200, Ingo Molnar wrote:
>
>  > Dunno, i would not couple them necessarily - certain users might still have
>  > access to kernel symbols via some other channel - for example the System.map.
>
> That always made this security by obscurity feature seem pointless for the bulk
> of users to me. Given the majority are going to be running distro kernels,
> anyone can find those addresses easily no matter how hard we hide them on the
> running system.

> Unless we were somehow introduced randomness into where we unpack the kernel
> each boot, and using System.map as a table of offsets instead of absolute addresses.
>
Good point about System.map! Even if /proc/kallsyms contains zero
addresses, I can
still get them from /boot/System.map which is readable by everyone, I
think. It does
not contain the modules addresses, but you have the core functions, unless I am
somehow mistaken.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 21:50           ` Ingo Molnar
  2011-05-12 21:56             ` Stephane Eranian
@ 2011-05-12 22:07             ` Dave Jones
  2011-05-12 22:15               ` Stephane Eranian
  2011-05-13  8:57               ` Ingo Molnar
  1 sibling, 2 replies; 50+ messages in thread
From: Dave Jones @ 2011-05-12 22:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephane Eranian, Linus Torvalds, Arnaldo Carvalho de Melo, LKML

On Thu, May 12, 2011 at 11:50:23PM +0200, Ingo Molnar wrote:
 
 > Dunno, i would not couple them necessarily - certain users might still have 
 > access to kernel symbols via some other channel - for example the System.map.
 
That always made this security by obscurity feature seem pointless for the bulk
of users to me. Given the majority are going to be running distro kernels,
anyone can find those addresses easily no matter how hard we hide them on the
running system.

Unless we were somehow introduced randomness into where we unpack the kernel
each boot, and using System.map as a table of offsets instead of absolute addresses.

	Dave


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 21:56             ` Stephane Eranian
@ 2011-05-12 22:00               ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-12 22:00 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Linus Torvalds, Arnaldo Carvalho de Melo, LKML


* Stephane Eranian <eranian@google.com> wrote:

> On Thu, May 12, 2011 at 11:50 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Stephane Eranian <eranian@google.com> wrote:
> >
> >> On Thu, May 12, 2011 at 11:35 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >> >
> >> > * Stephane Eranian <eranian@google.com> wrote:
> >> >
> >> >> The other contradiction, I see, is that you have perf_event paranoia level
> >> >> and this new kptr masquerading feature which conflict with each
> >> >> other.
> >> >>
> >> >> You can be allowed to monitor at the kernel level (paranoid=1, default)
> >> >> but you cannot correlate symbols:
> >> >>
> >> >> $ perf record -e cycles:k foo
> >> >>
> >> >> I suspect if you have this kptr thing turned on, then you need to disallow
> >> >> monitoring at the kernel level too.
> >> >
> >> > The better (and consistent) solution would be to turn the kptr_restrict thing
> >> > off - see the patch i sent.
> >>
> >> I saw that. But I think that when someone turns it back on, then you need to
> >> increase the perf_events paranoia level to disallow kernel monitoring to
> >> regular users such that you maintain consistency across the board.
> >
> > Dunno, i would not couple them necessarily - certain users might still have
> > access to kernel symbols via some other channel - for example the System.map.
>
> Ok, that's true, but then you'd need to have perf print a message or refuse to
> use /proc/kallsyms and suggest that the user provides a System.map.

Correct - the right approach would be to just use what we had in earlier 
versions, an 'unknown symbol' kind of catch-all entry that shows how much
stuff we did not recognize.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 21:50           ` Ingo Molnar
@ 2011-05-12 21:56             ` Stephane Eranian
  2011-05-12 22:00               ` Ingo Molnar
  2011-05-12 22:07             ` Dave Jones
  1 sibling, 1 reply; 50+ messages in thread
From: Stephane Eranian @ 2011-05-12 21:56 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, Arnaldo Carvalho de Melo, LKML

On Thu, May 12, 2011 at 11:50 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Stephane Eranian <eranian@google.com> wrote:
>
>> On Thu, May 12, 2011 at 11:35 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> >
>> > * Stephane Eranian <eranian@google.com> wrote:
>> >
>> >> The other contradiction, I see, is that you have perf_event paranoia level
>> >> and this new kptr masquerading feature which conflict with each
>> >> other.
>> >>
>> >> You can be allowed to monitor at the kernel level (paranoid=1, default)
>> >> but you cannot correlate symbols:
>> >>
>> >> $ perf record -e cycles:k foo
>> >>
>> >> I suspect if you have this kptr thing turned on, then you need to disallow
>> >> monitoring at the kernel level too.
>> >
>> > The better (and consistent) solution would be to turn the kptr_restrict thing
>> > off - see the patch i sent.
>>
>> I saw that. But I think that when someone turns it back on, then you need to
>> increase the perf_events paranoia level to disallow kernel monitoring to
>> regular users such that you maintain consistency across the board.
>
> Dunno, i would not couple them necessarily - certain users might still have
> access to kernel symbols via some other channel - for example the System.map.
>
Ok, that's true, but then you'd need to have perf print a message or refuse to
use /proc/kallsyms and suggest that the user provides a System.map.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 21:41       ` Stephane Eranian
@ 2011-05-12 21:54         ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-12 21:54 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Linus Torvalds, Arnaldo Carvalho de Melo, LKML


* Stephane Eranian <eranian@google.com> wrote:

> >> [...] So much for having perf in the kernel source tree to keep things in 
> >> sync...
> >
> > What do you mean?
>
> I meant that when this kptr feature was added, people should have scanned the 
> entire tree (include tools/perf) to look for potential impact on programs 
> relying on /proc/kallsyms. Having perf in the tree should have made this 
> easier to catch. That's all.

It was noticed in another case when there was kallsyms twiddling going on so it 
depends. What wasnt noticed here was how the present but zero value symbols:

 0000000000000000 D irq_stack_union
 0000000000000000 D __per_cpu_start
 0000000000000000 D gdt_page
 0000000000000000 d exception_stacks
 0000000000000000 d tlb_vector_offset
 0000000000000000 d shared_msrs
 0000000000000000 d cpu_tsc_khz

caused the symbol code of perf consider them non-existing.

Perf being in-tree wont magically avoid all bugs, so you should not expect that 
magical effect from tool integration into the kernel tree.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 21:38         ` Stephane Eranian
@ 2011-05-12 21:50           ` Ingo Molnar
  2011-05-12 21:56             ` Stephane Eranian
  2011-05-12 22:07             ` Dave Jones
  0 siblings, 2 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-12 21:50 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Linus Torvalds, Arnaldo Carvalho de Melo, LKML


* Stephane Eranian <eranian@google.com> wrote:

> On Thu, May 12, 2011 at 11:35 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Stephane Eranian <eranian@google.com> wrote:
> >
> >> The other contradiction, I see, is that you have perf_event paranoia level
> >> and this new kptr masquerading feature which conflict with each
> >> other.
> >>
> >> You can be allowed to monitor at the kernel level (paranoid=1, default)
> >> but you cannot correlate symbols:
> >>
> >> $ perf record -e cycles:k foo
> >>
> >> I suspect if you have this kptr thing turned on, then you need to disallow
> >> monitoring at the kernel level too.
> >
> > The better (and consistent) solution would be to turn the kptr_restrict thing
> > off - see the patch i sent.
>
> I saw that. But I think that when someone turns it back on, then you need to 
> increase the perf_events paranoia level to disallow kernel monitoring to 
> regular users such that you maintain consistency across the board.

Dunno, i would not couple them necessarily - certain users might still have 
access to kernel symbols via some other channel - for example the System.map.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 21:36     ` Ingo Molnar
@ 2011-05-12 21:41       ` Stephane Eranian
  2011-05-12 21:54         ` Ingo Molnar
  0 siblings, 1 reply; 50+ messages in thread
From: Stephane Eranian @ 2011-05-12 21:41 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, Arnaldo Carvalho de Melo, LKML

On Thu, May 12, 2011 at 11:36 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Stephane Eranian <eranian@google.com> wrote:
>
>> > The bug is that perf doesn't say "I can't match kernel symbols", but
>> > instead does some crazy matching and gives total crap module information (I
>> > think it just picks the one that shows up last in /proc/kallsyms).
>>
>> But I agree perf must not silently return bogus information. It should print
>> a big warning message and/or fallback to printing the raw addresses. [...]
>
> Yes, agreed, this is a bug in perf. I found out about this about two weeks ago
> and reported it to Arnaldo, but he is away right now - he might be able to fix
> it next week the earliest.
>
>> [...] So much for having perf in the kernel source tree to keep things in
>> sync...
>
> What do you mean?
>
I meant that when this kptr feature was added, people should have scanned the
entire tree (include tools/perf) to look for potential impact on
programs relying
on /proc/kallsyms. Having perf in the tree should have made this easier to
catch. That's all.



> Thanks,
>
>        Ingo
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 21:35       ` Ingo Molnar
@ 2011-05-12 21:38         ` Stephane Eranian
  2011-05-12 21:50           ` Ingo Molnar
  0 siblings, 1 reply; 50+ messages in thread
From: Stephane Eranian @ 2011-05-12 21:38 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, Arnaldo Carvalho de Melo, LKML

On Thu, May 12, 2011 at 11:35 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Stephane Eranian <eranian@google.com> wrote:
>
>> The other contradiction, I see, is that you have perf_event paranoia level
>> and this new kptr masquerading feature which conflict with each
>> other.
>>
>> You can be allowed to monitor at the kernel level (paranoid=1, default)
>> but you cannot correlate symbols:
>>
>> $ perf record -e cycles:k foo
>>
>> I suspect if you have this kptr thing turned on, then you need to disallow
>> monitoring at the kernel level too.
>
> The better (and consistent) solution would be to turn the kptr_restrict thing
> off - see the patch i sent.
>
I saw that. But I think that when someone turns it back on, then you need
to increase the perf_events paranoia level to disallow kernel monitoring to
regular users such that you maintain consistency across the board.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 21:07   ` Stephane Eranian
  2011-05-12 21:30     ` Stephane Eranian
@ 2011-05-12 21:36     ` Ingo Molnar
  2011-05-12 21:41       ` Stephane Eranian
  1 sibling, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2011-05-12 21:36 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Linus Torvalds, Arnaldo Carvalho de Melo, LKML


* Stephane Eranian <eranian@google.com> wrote:

> > The bug is that perf doesn't say "I can't match kernel symbols", but 
> > instead does some crazy matching and gives total crap module information (I 
> > think it just picks the one that shows up last in /proc/kallsyms).
>
> But I agree perf must not silently return bogus information. It should print 
> a big warning message and/or fallback to printing the raw addresses. [...]

Yes, agreed, this is a bug in perf. I found out about this about two weeks ago 
and reported it to Arnaldo, but he is away right now - he might be able to fix 
it next week the earliest.

> [...] So much for having perf in the kernel source tree to keep things in 
> sync...

What do you mean?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 21:30     ` Stephane Eranian
@ 2011-05-12 21:35       ` Ingo Molnar
  2011-05-12 21:38         ` Stephane Eranian
  0 siblings, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2011-05-12 21:35 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Linus Torvalds, Arnaldo Carvalho de Melo, LKML


* Stephane Eranian <eranian@google.com> wrote:

> The other contradiction, I see, is that you have perf_event paranoia level
> and this new kptr masquerading feature which conflict with each
> other.
> 
> You can be allowed to monitor at the kernel level (paranoid=1, default)
> but you cannot correlate symbols:
> 
> $ perf record -e cycles:k foo
> 
> I suspect if you have this kptr thing turned on, then you need to disallow
> monitoring at the kernel level too.

The better (and consistent) solution would be to turn the kptr_restrict thing 
off - see the patch i sent.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 21:07   ` Stephane Eranian
@ 2011-05-12 21:30     ` Stephane Eranian
  2011-05-12 21:35       ` Ingo Molnar
  2011-05-12 21:36     ` Ingo Molnar
  1 sibling, 1 reply; 50+ messages in thread
From: Stephane Eranian @ 2011-05-12 21:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Arnaldo Carvalho de Melo, LKML, Ingo Molnar

The other contradiction, I see, is that you have perf_event paranoia level
and this new kptr masquerading feature which conflict with each
other.

You can be allowed to monitor at the kernel level (paranoid=1, default)
but you cannot correlate symbols:

$ perf record -e cycles:k foo

I suspect if you have this kptr thing turned on, then you need to disallow
monitoring at the kernel level too.



On Thu, May 12, 2011 at 11:07 PM, Stephane Eranian <eranian@google.com> wrote:
> On Thu, May 12, 2011 at 10:31 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> On Thu, May 12, 2011 at 7:48 AM, Stephane Eranian <eranian@google.com> wrote:
>> >
>> > I think there is a serious problem with kernel symbol correlation
>> > with the latest perf in 2.6.39-rc7-tip.
>>
>> Yeah. It's annoying. It's a "perf" bug, though - triggered by
>> /proc/sys/kernel/kptr_restrict being set to 1.
>>
> I did not know about this new masquerading of pointers in /proc/kallsyms.
> That certainly explains the problem.
>
>>
>> The bug is that perf doesn't say "I can't match kernel symbols", but
>> instead does some crazy matching and gives total crap module
>> information (I think it just picks the one that shows up last in
>> /proc/kallsyms).
>>
> But I agree perf must not silently return bogus information. It
> should print a big warning message and/or fallback to printing the raw
> addresses. So much for having perf in the kernel source tree to
> keep things in sync...
>
>>
>> That said, I have considered just reverting the thing that makes
>> kptr_restrict be 1 by default. I do like the security implications of
>> restricting visibility into kernel pointers, but I also think that
>> security rules that make the system less usable are dubious. So I
>> dunno.
>>
> I am not clear as to what people could actually do with the addresses
> taken out of /proc/kallsyms. Looks to me like we've lost functionality
> for the vast majority of users. So maybe the default should be inverted.
>
> I know of a somewhat similar issue with the file descriptor limit which
> people are hitting frequently these days when monitoring apps with lots
> of threads or lots of events in one run on large smp systems.
> That can easily be corrected by again requires root privilege to regain
> the functionality.
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 20:31 ` Linus Torvalds
  2011-05-12 20:43   ` David Miller
@ 2011-05-12 21:07   ` Stephane Eranian
  2011-05-12 21:30     ` Stephane Eranian
  2011-05-12 21:36     ` Ingo Molnar
  1 sibling, 2 replies; 50+ messages in thread
From: Stephane Eranian @ 2011-05-12 21:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Arnaldo Carvalho de Melo, LKML, Ingo Molnar

On Thu, May 12, 2011 at 10:31 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, May 12, 2011 at 7:48 AM, Stephane Eranian <eranian@google.com> wrote:
> >
> > I think there is a serious problem with kernel symbol correlation
> > with the latest perf in 2.6.39-rc7-tip.
>
> Yeah. It's annoying. It's a "perf" bug, though - triggered by
> /proc/sys/kernel/kptr_restrict being set to 1.
>
I did not know about this new masquerading of pointers in /proc/kallsyms.
That certainly explains the problem.

>
> The bug is that perf doesn't say "I can't match kernel symbols", but
> instead does some crazy matching and gives total crap module
> information (I think it just picks the one that shows up last in
> /proc/kallsyms).
>
But I agree perf must not silently return bogus information. It
should print a big warning message and/or fallback to printing the raw
addresses. So much for having perf in the kernel source tree to
keep things in sync...

>
> That said, I have considered just reverting the thing that makes
> kptr_restrict be 1 by default. I do like the security implications of
> restricting visibility into kernel pointers, but I also think that
> security rules that make the system less usable are dubious. So I
> dunno.
>
I am not clear as to what people could actually do with the addresses
taken out of /proc/kallsyms. Looks to me like we've lost functionality
for the vast majority of users. So maybe the default should be inverted.

I know of a somewhat similar issue with the file descriptor limit which
people are hitting frequently these days when monitoring apps with lots
of threads or lots of events in one run on large smp systems.
That can easily be corrected by again requires root privilege to regain
the functionality.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 18:06 ` David Miller
  2011-05-12 18:37   ` Dave Jones
@ 2011-05-12 21:06   ` Ingo Molnar
  1 sibling, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2011-05-12 21:06 UTC (permalink / raw)
  To: David Miller; +Cc: eranian, acme, linux-kernel, Linus Torvalds


* David Miller <davem@davemloft.net> wrote:

> From: Stephane Eranian <eranian@google.com>
> Date: Thu, 12 May 2011 16:48:46 +0200
> 
> > I think there is a serious problem with kernel symbol correlation
> > with the latest perf in 2.6.39-rc7-tip.
> 
> The behavior seems to be intentional, so that we don't expose internal
> kernel addresses to userspace.
> 
> I hate this too, and I think it's absolutely rediculous.
> 
> Also, like you, I lost an entire afternoon trying to figure out why
> this started happening.

I lost about an hour with Arnaldo on IRC to help me until we figured out that 
/proc/kallsyms started having zero value entries ... I'm too running perf as an 
unprivileged user.

Zero is a valid symbol address so nothing within perf tripped up explicitly, 
but perf report and perf top results were nonsensical.

There was another problem with it: perf is caching and storing known kernel 
buildid addresses in ~/.debug, under the (previously correct) assumption that 
kernel symbols do not change for one given kernel build. But with kptr_restrict 
it would cache the zero values - which were cached even after kptr_restrict was 
set back to 0.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 20:31 ` Linus Torvalds
@ 2011-05-12 20:43   ` David Miller
  2011-05-12 21:07   ` Stephane Eranian
  1 sibling, 0 replies; 50+ messages in thread
From: David Miller @ 2011-05-12 20:43 UTC (permalink / raw)
  To: torvalds; +Cc: eranian, acme, linux-kernel, mingo

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 12 May 2011 13:31:37 -0700

> That said, I have considered just reverting the thing that makes
> kptr_restrict be 1 by default. I do like the security implications of
> restricting visibility into kernel pointers, but I also think that
> security rules that make the system less usable are dubious. So I
> dunno.

We don't have any firewalling or SELINUX rules installed by default,
even if those features are enabled in the kernel.  Userspace asks for
it.

Many people would claim that use of such things are "essential" these
days.

I don't see a good reason to handle kptr_restrict any differently.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 14:48 Stephane Eranian
  2011-05-12 18:06 ` David Miller
@ 2011-05-12 20:31 ` Linus Torvalds
  2011-05-12 20:43   ` David Miller
  2011-05-12 21:07   ` Stephane Eranian
  1 sibling, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2011-05-12 20:31 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Arnaldo Carvalho de Melo, LKML, Ingo Molnar

On Thu, May 12, 2011 at 7:48 AM, Stephane Eranian <eranian@google.com> wrote:
>
> I think there is a serious problem with kernel symbol correlation
> with the latest perf in 2.6.39-rc7-tip.

Yeah. It's annoying. It's a "perf" bug, though - triggered by
/proc/sys/kernel/kptr_restrict being set to 1.

The bug is that perf doesn't say "I can't match kernel symbols", but
instead does some crazy matching and gives total crap module
information (I think it just picks the one that shows up last in
/proc/kallsyms).

That said, I have considered just reverting the thing that makes
kptr_restrict be 1 by default. I do like the security implications of
restricting visibility into kernel pointers, but I also think that
security rules that make the system less usable are dubious. So I
dunno.

                           Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 19:01     ` David Miller
  2011-05-12 19:58       ` Pekka Enberg
@ 2011-05-12 20:24       ` Alexey Dobriyan
  1 sibling, 0 replies; 50+ messages in thread
From: Alexey Dobriyan @ 2011-05-12 20:24 UTC (permalink / raw)
  To: David Miller; +Cc: davej, eranian, acme, linux-kernel

On Thu, May 12, 2011 at 03:01:32PM -0400, David Miller wrote:
> From: Dave Jones <davej@redhat.com>
> Date: Thu, 12 May 2011 14:37:41 -0400
> 
> > On Thu, May 12, 2011 at 02:06:30PM -0400, David Miller wrote:
> >  > I hate this too, and I think it's absolutely rediculous.
> >  > 
> >  > Also, like you, I lost an entire afternoon trying to figure out why
> >  > this started happening.
> >  > 
> >  > I wish we could revert this change.
> > 
> > At least it can be permanently disabled..
> > 
> > echo kernel.kptr_restrict = 0 >> /etc/sysctl.conf
> 
> Regardless, what to do about all of the "perf is broken" reports?

The problem is that they turned it on by default.

	int kptr_restrict = 1;

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 19:01     ` David Miller
@ 2011-05-12 19:58       ` Pekka Enberg
  2011-05-13  6:12         ` Kees Cook
  2011-05-12 20:24       ` Alexey Dobriyan
  1 sibling, 1 reply; 50+ messages in thread
From: Pekka Enberg @ 2011-05-12 19:58 UTC (permalink / raw)
  To: David Miller
  Cc: davej, eranian, acme, linux-kernel, Kees Cook, Linus Torvalds,
	Ingo Molnar

On Thu, May 12, 2011 at 10:01 PM, David Miller <davem@davemloft.net> wrote:
> From: Dave Jones <davej@redhat.com>
> Date: Thu, 12 May 2011 14:37:41 -0400
>
>> On Thu, May 12, 2011 at 02:06:30PM -0400, David Miller wrote:
>>  > I hate this too, and I think it's absolutely rediculous.
>>  >
>>  > Also, like you, I lost an entire afternoon trying to figure out why
>>  > this started happening.
>>  >
>>  > I wish we could revert this change.
>>
>> At least it can be permanently disabled..
>>
>> echo kernel.kptr_restrict = 0 >> /etc/sysctl.conf
>
> Regardless, what to do about all of the "perf is broken" reports?

Lets revert the commit 9f36e2c448007b54851e7e4fa48da97d1477a175
("printk: use %pK for /proc/kallsyms and /proc/modules"), please! I
too have been wondering what's going on with perf reporting insane
symbols and this should definitely not be enabled by default.

                        Pekka

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 18:37   ` Dave Jones
@ 2011-05-12 19:01     ` David Miller
  2011-05-12 19:58       ` Pekka Enberg
  2011-05-12 20:24       ` Alexey Dobriyan
  0 siblings, 2 replies; 50+ messages in thread
From: David Miller @ 2011-05-12 19:01 UTC (permalink / raw)
  To: davej; +Cc: eranian, acme, linux-kernel

From: Dave Jones <davej@redhat.com>
Date: Thu, 12 May 2011 14:37:41 -0400

> On Thu, May 12, 2011 at 02:06:30PM -0400, David Miller wrote:
>  > I hate this too, and I think it's absolutely rediculous.
>  > 
>  > Also, like you, I lost an entire afternoon trying to figure out why
>  > this started happening.
>  > 
>  > I wish we could revert this change.
> 
> At least it can be permanently disabled..
> 
> echo kernel.kptr_restrict = 0 >> /etc/sysctl.conf

Regardless, what to do about all of the "perf is broken" reports?

First off, perf can find out whether this madness exists, and it
should by default print out a warning in this situation instead of
knowingly emitting garbage kernel event information.

"I'm going to knowingly give you bad data, and I'm not even going to
 let you know about it."

It's really crazy that we give people these incredibly powerful tools
and they don't even work properly by default.

We've been exposing kernel pointers for 20 years, nobody's grandmother
died because of it.

This is very "Animal Farm" the way we're gradually losing little bits
of functionality, time and time again, over this "kernel pointer
exposure" issue.

Are we going to be like animals and just accept the totality of this,
or are we going to be outraged enough to push back on stuff like perf
actually working properly?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 18:06 ` David Miller
@ 2011-05-12 18:37   ` Dave Jones
  2011-05-12 19:01     ` David Miller
  2011-05-12 21:06   ` Ingo Molnar
  1 sibling, 1 reply; 50+ messages in thread
From: Dave Jones @ 2011-05-12 18:37 UTC (permalink / raw)
  To: David Miller; +Cc: eranian, acme, linux-kernel

On Thu, May 12, 2011 at 02:06:30PM -0400, David Miller wrote:
 > From: Stephane Eranian <eranian@google.com>
 > Date: Thu, 12 May 2011 16:48:46 +0200
 > 
 > > I think there is a serious problem with kernel symbol correlation
 > > with the latest perf in 2.6.39-rc7-tip.
 > 
 > The behavior seems to be intentional, so that we don't expose internal
 > kernel addresses to userspace.

Sounds like commit 9f36e2c448007b54851e7e4fa48da97d1477a175

 > I hate this too, and I think it's absolutely rediculous.
 > 
 > Also, like you, I lost an entire afternoon trying to figure out why
 > this started happening.
 > 
 > I wish we could revert this change.

At least it can be permanently disabled..

echo kernel.kptr_restrict = 0 >> /etc/sysctl.conf

	Dave


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [BUG] perf: bogus correlation of kernel symbols
  2011-05-12 14:48 Stephane Eranian
@ 2011-05-12 18:06 ` David Miller
  2011-05-12 18:37   ` Dave Jones
  2011-05-12 21:06   ` Ingo Molnar
  2011-05-12 20:31 ` Linus Torvalds
  1 sibling, 2 replies; 50+ messages in thread
From: David Miller @ 2011-05-12 18:06 UTC (permalink / raw)
  To: eranian; +Cc: acme, linux-kernel

From: Stephane Eranian <eranian@google.com>
Date: Thu, 12 May 2011 16:48:46 +0200

> I think there is a serious problem with kernel symbol correlation
> with the latest perf in 2.6.39-rc7-tip.

The behavior seems to be intentional, so that we don't expose internal
kernel addresses to userspace.

I hate this too, and I think it's absolutely rediculous.

Also, like you, I lost an entire afternoon trying to figure out why
this started happening.

I wish we could revert this change.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [BUG] perf: bogus correlation of kernel symbols
@ 2011-05-12 14:48 Stephane Eranian
  2011-05-12 18:06 ` David Miller
  2011-05-12 20:31 ` Linus Torvalds
  0 siblings, 2 replies; 50+ messages in thread
From: Stephane Eranian @ 2011-05-12 14:48 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: LKML

Hi,

I think there is a serious problem with kernel symbol correlation
with the latest perf in 2.6.39-rc7-tip.

Here is a simple example with a stupid program that only
does open()/close on /dev/null:

$ perf record -e cycles:k openclose
$ perf report --stdio

# Events: 2K cycles
#
# Overhead    Command     Shared Object           Symbol
# ........  .........  ................  ...............
#
    99.76%  openclose  [binfmt_misc]     [k] 0xffffffff81010fe6
     0.13%  openclose  libc-2.12.1.so    [.] __open_nocancel
     0.09%  openclose  libc-2.12.1.so    [.] __GI_close

The DSO (binfmt_misc) is bogus. That's not where time is spent.

But if I ran the same test as root:

$ sudo perf record -e cycles:k openclose
$ sudo perf report --stdio

# Events: 2K cycles
#
# Overhead    Command      Shared Object                         Symbol
# ........  .........  .................  .............................
#
    17.13%  openclose  [kernel.kallsyms]  [k] __lock_acquire
    11.77%  openclose  [kernel.kallsyms]  [k] native_sched_clock
     7.36%  openclose  [kernel.kallsyms]  [k] sched_clock_local
     5.99%  openclose  [kernel.kallsyms]  [k] lock_release
     5.38%  openclose  [kernel.kallsyms]  [k] local_clock
     4.43%  openclose  [kernel.kallsyms]  [k] lock_acquired
     4.05%  openclose  [kernel.kallsyms]  [k] lock_acquire
     3.95%  openclose  [kernel.kallsyms]  [k] lock_is_held
     3.51%  openclose  [kernel.kallsyms]  [k] sched_clock_cpu
     3.24%  openclose  [kernel.kallsyms]  [k] trace_hardirqs_off_caller

This is much more meaningful.

This is not related to the paranoid level (1 for me).

Looking at perf report -D, the same kernel address is associated to different
module based on my permission level.

first perf.data:
416749738927 0x4210 [0x28]: PERF_RECORD_SAMPLE(IP, 1): 4886/4886:
0xffffffff8107c1d8 period: 2262681
 ... thread: openclose:4886
 ...... dso: /lib/modules/2.6.39-rc7-tip/kernel/fs/binfmt_misc.ko

second perf.data:
436879910722 0xc950 [0x28]: PERF_RECORD_SAMPLE(IP, 1): 4894/4894:
0xffffffff8107c1d8 period: 2280253
 ... thread: openclose:4894
 ...... dso: vmlinux

Same address different mapping!

My path to vmlinux is all accessible to me.

If there were permission problems, I would expect perf record or perf report
to tell me and not fallback to some bogus mappings.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2011-05-24  4:06 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1305292059.1949.0.camel@dan>
2011-05-13 13:29 ` [BUG] perf: bogus correlation of kernel symbols Dan Rosenberg
2011-05-16 15:35   ` Ingo Molnar
2011-05-16 16:14     ` Dan Rosenberg
2011-05-20  0:56     ` Dan Rosenberg
2011-05-20 12:07       ` Ingo Molnar
2011-05-20 12:54         ` Dan Rosenberg
2011-05-20 13:11           ` Ingo Molnar
2011-05-20 17:41             ` Dan Rosenberg
2011-05-20 18:14               ` Linus Torvalds
2011-05-20 18:27                 ` Kees Cook
2011-05-20 18:34                   ` Dan Rosenberg
2011-05-20 18:42                     ` Ingo Molnar
2011-05-20 18:28                 ` Ingo Molnar
2011-05-22  6:11                 ` david
2011-05-20 18:35               ` Ingo Molnar
2011-05-22 18:45             ` Dan Rosenberg
     [not found]               ` <BANLkTik1SK_kWVvGsKk0SqdByQ5-0b5nFg@mail.gmail.com>
2011-05-23  0:25                 ` Dan Rosenberg
2011-05-23  0:37                   ` H. Peter Anvin
2011-05-23 10:49                   ` Ingo Molnar
2011-05-23 19:02                     ` Ray Lee
2011-05-23 19:35                       ` Ingo Molnar
2011-05-24  1:59                     ` Valdis.Kletnieks
2011-05-24  4:06                       ` Ingo Molnar
2011-05-12 14:48 Stephane Eranian
2011-05-12 18:06 ` David Miller
2011-05-12 18:37   ` Dave Jones
2011-05-12 19:01     ` David Miller
2011-05-12 19:58       ` Pekka Enberg
2011-05-13  6:12         ` Kees Cook
2011-05-13  6:24           ` Pekka Enberg
2011-05-12 20:24       ` Alexey Dobriyan
2011-05-12 21:06   ` Ingo Molnar
2011-05-12 20:31 ` Linus Torvalds
2011-05-12 20:43   ` David Miller
2011-05-12 21:07   ` Stephane Eranian
2011-05-12 21:30     ` Stephane Eranian
2011-05-12 21:35       ` Ingo Molnar
2011-05-12 21:38         ` Stephane Eranian
2011-05-12 21:50           ` Ingo Molnar
2011-05-12 21:56             ` Stephane Eranian
2011-05-12 22:00               ` Ingo Molnar
2011-05-12 22:07             ` Dave Jones
2011-05-12 22:15               ` Stephane Eranian
2011-05-13  9:01                 ` Ingo Molnar
2011-05-13  8:57               ` Ingo Molnar
2011-05-13 16:23                 ` Andi Kleen
2011-05-17 12:17                   ` Ingo Molnar
2011-05-12 21:36     ` Ingo Molnar
2011-05-12 21:41       ` Stephane Eranian
2011-05-12 21:54         ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.