linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: possible CPU bug and request for Intel contacts
@ 2005-01-26  1:38 Seth, Rohit
  2005-02-13 20:10 ` Ingo Molnar
  0 siblings, 1 reply; 11+ messages in thread
From: Seth, Rohit @ 2005-01-26  1:38 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Linus Torvalds, Ingo Molnar, Saxena, Sunil, Pallipadi, Venkatesh,
	Andrey Savochkin, linux-kernel

Kirill Korotaev <mailto:dev@sw.ru> wrote on Tuesday, January 25, 2005
6:12 AM:

> Hello Rohit,
> 
>>> BTW, can you explain why making pages non-global is the cure? Is it
>>> safe workaround for this bug?

>> There is a boundary condition that can have non-global pages
>> containing the CR3 load to also hit this issue on affected PIII. 
>> Though for this to happen, mov to cr3 has to be the very last
>> instruction on a page. And the page following that page (containing
>> CR3 load) has to have different mapping between user and kernel
>> spaces. 

> but in our case "mov %edx, %cr3" is not the last instruction on a
> page. 
> It is in the middle of it.

So, in this scenario (where trampoline code is mapped by non global
page), we will not hit this issue.

> Well, another remark is that after cr3 load there are only few
> instructions before the "call system_call_table(%edx)" which
> references 
> the page with different user and kernel mappings.
> 
> also, this bug can be cured via inserting about 20 simple operations
> between cr3 load and call to the page with overlapping mappings.
> 

This is not a recommended solution.

> I'm just trying to understand is it the bug referenced in E80 or not
> and 
> is it safe to use non-global mappings as a cure.

Our analysis has shown that this is E80 issue.  In this 4G-4G kernel
context, we are safe to use non-global mapping as a workaround for this
issue. (Or we can use any of the other recommendations given in the spec
update except rdtsc with global pages).

We have also seen that inserting rdtsc instruction is not a workaround.
We will update the spec update with this information.

On a little different note, while running the 4G-4G kernel on our
machine, we saw occasional hangs.  Those are root caused to the fact
that this kernel was first chaging the stack pointer from virtual stack
to kernel and then changing the CR3 to that of kernel.  Any interrupt
between these two instructions will result in those hangs as the
interruption handler will execute with user's CR3(as the kernel thinks
that it is already in kernel because of the value of esp).  Swapping the
order, first loading the CR3 with kernel and then switching the stack to
kernel fixes this issue.  Venki will generate that patch and send to
lkml.

Thanks, rohit


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: possible CPU bug and request for Intel contacts
  2005-01-26  1:38 possible CPU bug and request for Intel contacts Seth, Rohit
@ 2005-02-13 20:10 ` Ingo Molnar
  0 siblings, 0 replies; 11+ messages in thread
From: Ingo Molnar @ 2005-02-13 20:10 UTC (permalink / raw)
  To: Seth, Rohit
  Cc: Kirill Korotaev, Linus Torvalds, Saxena, Sunil, Pallipadi,
	Venkatesh, Andrey Savochkin, linux-kernel


* Seth, Rohit <rohit.seth@intel.com> wrote:

> On a little different note, while running the 4G-4G kernel on our
> machine, we saw occasional hangs.  Those are root caused to the fact
> that this kernel was first chaging the stack pointer from virtual
> stack to kernel and then changing the CR3 to that of kernel.  Any
> interrupt between these two instructions will result in those hangs as
> the interruption handler will execute with user's CR3(as the kernel
> thinks that it is already in kernel because of the value of esp). 
> Swapping the order, first loading the CR3 with kernel and then
> switching the stack to kernel fixes this issue.  Venki will generate
> that patch and send to lkml.

i'm not sure what you mean. Here's the relevant 4:4 code from Fedora:

#define __SWITCH_KERNELSPACE                            \
...
        movl %edx, %cr3;                                \
        movl %ebx, %esp;                                \

i.e. we _first_ load cr3 with the kernel pagetable value, then do we
switch esp to the real kernel stack.

	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: possible CPU bug and request for Intel contacts
@ 2005-02-14 16:55 Pallipadi, Venkatesh
  0 siblings, 0 replies; 11+ messages in thread
From: Pallipadi, Venkatesh @ 2005-02-14 16:55 UTC (permalink / raw)
  To: Ingo Molnar, Seth, Rohit
  Cc: Kirill Korotaev, Linus Torvalds, Saxena, Sunil, Andrey Savochkin,
	linux-kernel


>-----Original Message-----
>From: Ingo Molnar [mailto:mingo@elte.hu] 
>Sent: Sunday, February 13, 2005 12:10 PM
>To: Seth, Rohit
>Cc: Kirill Korotaev; Linus Torvalds; Saxena, Sunil; Pallipadi, 
>Venkatesh; Andrey Savochkin; linux-kernel@vger.kernel.org
>Subject: Re: possible CPU bug and request for Intel contacts
>
>
>* Seth, Rohit <rohit.seth@intel.com> wrote:
>
>> On a little different note, while running the 4G-4G kernel on our
>> machine, we saw occasional hangs.  Those are root caused to the fact
>> that this kernel was first chaging the stack pointer from virtual
>> stack to kernel and then changing the CR3 to that of kernel.  Any
>> interrupt between these two instructions will result in 
>those hangs as
>> the interruption handler will execute with user's CR3(as the kernel
>> thinks that it is already in kernel because of the value of esp). 
>> Swapping the order, first loading the CR3 with kernel and then
>> switching the stack to kernel fixes this issue.  Venki will generate
>> that patch and send to lkml.
>
>i'm not sure what you mean. Here's the relevant 4:4 code from Fedora:
>
>#define __SWITCH_KERNELSPACE                            \
>...
>        movl %edx, %cr3;                                \
>        movl %ebx, %esp;                                \
>
>i.e. we _first_ load cr3 with the kernel pagetable value, then do we
>switch esp to the real kernel stack.
>

Yes. I verified that and that's the reason I didn't send any patch. But,

the kernel we were using in our testing of this bug, came from some 
earlier version of 4:4 code and had this cr3 switch and esp switch in 
reverse order. 

With the latest kernels there is no issue related to this.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: possible CPU bug and request for Intel contacts
  2005-01-25  7:22 Seth, Rohit
@ 2005-01-25 14:12 ` Kirill Korotaev
  0 siblings, 0 replies; 11+ messages in thread
From: Kirill Korotaev @ 2005-01-25 14:12 UTC (permalink / raw)
  To: Seth, Rohit
  Cc: Linus Torvalds, Ingo Molnar, Saxena, Sunil, Pallipadi, Venkatesh,
	Andrey Savochkin, linux-kernel

Hello Rohit,

>>BTW, can you explain why making pages non-global is the cure? Is it
>> safe workaround for this bug?
> There is a boundary condition that can have non-global pages containing
> the CR3 load to also hit this issue on affected PIII.  Though for this
> to happen, mov to cr3 has to be the very last instruction on a page.
> And the page following that page (containing CR3 load) has to have
> different mapping between user and kernel spaces.
but in our case "mov %edx, %cr3" is not the last instruction on a page. 
It is in the middle of it.
Well, another remark is that after cr3 load there are only few 
instructions before the "call system_call_table(%edx)" which references 
the page with different user and kernel mappings.

also, this bug can be cured via inserting about 20 simple operations 
between cr3 load and call to the page with overlapping mappings.

I'm just trying to understand is it the bug referenced in E80 or not and 
is it safe to use non-global mappings as a cure.

Kirill


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: possible CPU bug and request for Intel contacts
@ 2005-01-25  7:22 Seth, Rohit
  2005-01-25 14:12 ` Kirill Korotaev
  0 siblings, 1 reply; 11+ messages in thread
From: Seth, Rohit @ 2005-01-25  7:22 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Linus Torvalds, Ingo Molnar, Saxena, Sunil, Pallipadi, Venkatesh,
	Andrey Savochkin, linux-kernel

Kirill Korotaev <mailto:dev@sw.ru> wrote on Monday, January 24, 2005
1:51 AM:

> Hello Rohit,
> 
>> Thanks for sending the detailed information. Based on our experiments
>> and analysis, we believe at this point that this is a known E80 issue
>> mentioned in the PIII spec update at this location
>> (http://www.intel.com/design/pentiumiii/specupdt/24445351.pdf)
> 
>> Could you please try one of the suggested work arounds for this
>> issue. 
> Yes, double cr3 reload and cpuid helps us. But rdtsc doesn't.
> 

I will have to get back to you about rdtsc.


> BTW, can you explain why making pages non-global is the cure? Is it
>   safe workaround for this bug?

There is a boundary condition that can have non-global pages containing
the CR3 load to also hit this issue on affected PIII.  Though for this
to happen, mov to cr3 has to be the very last instruction on a page.
And the page following that page (containing CR3 load) has to have
different mapping between user and kernel spaces.

> Double cr3 reload looks a bit unsafe to me, since interrupts or NMI
> can 
> occur between the reloads and probably reuse stale iTLB mappings...

Interruptions will ensure that stale mapping don't exist in ITLB fill
buffer.  So, you are okay with double CR3 laods.

Thanks, rohit


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: possible CPU bug and request for Intel contacts
@ 2005-01-25  7:15 Seth, Rohit
  0 siblings, 0 replies; 11+ messages in thread
From: Seth, Rohit @ 2005-01-25  7:15 UTC (permalink / raw)
  To: Pavel Machek, Kirill Korotaev
  Cc: Linus Torvalds, Ingo Molnar, Saxena, Sunil, Pallipadi, Venkatesh,
	Andrey Savochkin, linux-kernel

Pavel Machek <mailto:pavel@ucw.cz> wrote on Saturday, January 22, 2005
2:03 AM:

> Hi!
> 
>> Here are the details about CPU bug I mentioned in my previous post.
>> Though it turned out later that it happens on P-III systems only I
>> still hope it can be of interest.
> 
> What about Pentium-M? They are based on P-III and are certainly *very*
> interesting.
> 								Pavel

This issue does not happen on Pentium-M.  This issue happens only on
some PIII steppings.  Information about the affected stepping is
provided in the spec update link.

http://www.intel.com/design/pentiumiii/specupdt/24445351.pdf

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: possible CPU bug and request for Intel contacts
  2005-01-22  3:02 Seth, Rohit
@ 2005-01-24  9:51 ` Kirill Korotaev
  0 siblings, 0 replies; 11+ messages in thread
From: Kirill Korotaev @ 2005-01-24  9:51 UTC (permalink / raw)
  To: Seth, Rohit
  Cc: Linus Torvalds, Ingo Molnar, Saxena, Sunil, Pallipadi, Venkatesh,
	Andrey Savochkin, linux-kernel

Hello Rohit,

> Thanks for sending the detailed information. Based on our experiments
> and analysis, we believe at this point that this is a known E80 issue
> mentioned in the PIII spec update at this location
> (http://www.intel.com/design/pentiumiii/specupdt/24445351.pdf)

> Could you please try one of the suggested work arounds for this issue.  
Yes, double cr3 reload and cpuid helps us. But rdtsc doesn't.

BTW, can you explain why making pages non-global is the cure? Is it safe 
  workaround for this bug?
Double cr3 reload looks a bit unsafe to me, since interrupts or NMI can 
occur between the reloads and probably reuse stale iTLB mappings... But 
I'm not sure about this since it is much harder to catch and I have no 
access to CPU internals. What is your opinion about this?

Sorry for taking your time, I should have checked ERRATAs more 
attentively myself.

Thank you,
Kirill

>>Hello,
>>
>>Here are the details about CPU bug I mentioned in my previous post.
>>Though it turned out later that it happens on P-III systems only I
>>still 
>>hope it can be of interest.
>>
>>Brief description
>>~~~~~~~~~~~~~~~~~
>>
>>This issue was found by Vasily Averin (vvs@sw.ru) when playing
>>with uselib security exploit on kernels with my 4gb split patch.
>>
>>This bug results in strange effects such as calltraces below,
>>reboots, impossible call traces and so on.
>>
>>I started to resolve the bug, narrowed down uselib exploit and
>>got a simple testcase for the bug, which can be found in attach.
>>This testcase does a simple thing - it maps pages at low addresses
>>from 0x04000000 downto 0x00000000, page by page and touches them
>>for write. Sometimes when running this exploit I got oopses,
>>sometimes reboots and I found that this is sensitive to the page
>>addresses which exploit maps.
>>
>>Why it crashes? I think this is due to virtual addresses of
>>kernel code and mapped user space pages overlap. I was able even to
>>reboot machine if mapped user space pages were filled with some
>>appropriate asm code.
>>
>>I found that Ingo Molnar 4gb split is not vulnerable, and after
>>investigations I found that Ingo patch doesn't map kernel entry code
>>(trampline) as _PAGE_GLOBAL. This was the answer.
>>
>>I tested it on 4 different P-III machines - all of them were
>>vulnerable. 
>>But lately I tested it on Celeron 2.4Ghz and P4 systems - it doesn't
>>happen, so this bug can be of low interest to Intel people :(
>>
>>Below you can find the way how to reproduce the bug, call traces
>>and why I think it's a hardware bug.
>>
>>How to reproduce a bug
>>~~~~~~~~~~~~~~~~~~~~~~
>>
>>- take any FedoraCore kernel with Ingo Molnar 4gb split patch
>>   or mainstream kernel and apply 4GB split patch
>>- apply attached diff-arch-4gb-global patch to make
>>   trampline code to be GLOBAL
>>- compile kernel with turned on 4gb split, i.e. CONFIG_X86_4GB=y
>>- boot the kernel and run the attached testcase:
>>
>># while true; do ./4gbtest; done;
>>
>>or
>>
>># ./elflbl -l ./lib -a 0x4000000  (where elflbl is uselib exploit)
>>
>>During each 4-5 test runs I get the following oops:
>>
>>Jan 21 12:15:17 ts Unable to handle kernel NULL pointer dereference at
>>virtual address 000000c0
>>Jan 21 12:15:17 ts  printing eip:
>>Jan 21 12:15:17 ts 02114450
>>Jan 21 12:15:17 ts *pde = 00000000
>>Jan 21 12:15:17 ts Oops: 0002
>>Jan 21 12:15:17 ts SMP
>>Jan 21 12:15:17 ts Modules linked in:
>>Jan 21 12:15:17 ts CPU:    0
>>Jan 21 12:15:17 ts EIP:    0060:[<02114450>]    Not tainted
>>Jan 21 12:15:17 ts EFLAGS: 00010246   (2.6.8-dev)
>>Jan 21 12:15:17 ts EIP is at sys_mmap2+0x0/0xb0
>>Jan 21 12:15:17 ts eax: 000000c0   ebx: 31524fc4   ecx: 00001000  
>>edx: 004ec000
>>Jan 21 12:15:17 ts esi: 00000032   edi: 00000000   ebp: 31524000  
>>esp: 31524fc0
>>Jan 21 12:15:17 ts ds: 007b   es: 007b   ss: 0068
>>Jan 21 12:15:17 ts Process test (pid: 25, threadinfo=31524000
>>task=31f680c0) Jan 21 12:15:17 ts Stack: fffec200 01a2a000 00001000
>>00000003 00000032 00000000 00000000 000000c0
>>Jan 21 12:15:17 ts        0000007b 0000007b 000000c0 08048541 00000073
>>00000282 bffffdcc 0000007b
>>Jan 21 12:15:17 ts Call Trace:
>>Jan 21 12:15:17 ts Code: 55 bd f7 ff ff ff 57 31 ff 56 53 83 ec 18 8b
>>44 24 38 89 c6
>>
>>  Unable to handle kernel NULL pointer dereference at virtual address
>>000000c0
>>  02114450
>>  *pde = 00000000
>>  Oops: 0002
>>  CPU:    0
>>  EIP:    0060:[<02114450>]    Not tainted
>>  EFLAGS: 00010246   (2.6.8-dev)
>>  eax: 000000c0   ebx: 31524fc4   ecx: 00001000   edx: 004ec000
>>  esi: 00000032   edi: 00000000   ebp: 31524000   esp: 31524fc0
>>  ds: 007b   es: 007b   ss: 0068
>>  Stack: fffec200 01a2a000 00001000 00000003 00000032 00000000
>>00000000 000000c0
>>         0000007b 0000007b 000000c0 08048541 00000073 00000282
>>bffffdcc 0000007b
>>  Call Trace:
>>  Code: 55 bd f7 ff ff ff 57 31 ff 56 53 83 ec 18 8b 44 24 38 89 c6
>>
>>
>> >>EIP; 02114450 <sys_mmap2+0/b0>   <=====
>>
>> >>ebx; 31524fc4 <pg0+2eff8fc4/fdac0000>
>> >>ebp; 31524000 <pg0+2eff8000/fdac0000>
>> >>esp; 31524fc0 <pg0+2eff8fc0/fdac0000>
>>
>>Code;  02114450 <sys_mmap2+0/b0>
>>00000000 <_EIP>:
>>Code;  02114450 <sys_mmap2+0/b0>   <=====
>>    0:   55                        push   %ebp   <=====
>>Code;  02114451 <sys_mmap2+1/b0>
>>    1:   bd f7 ff ff ff            mov    $0xfffffff7,%ebp
>>Code;  02114456 <sys_mmap2+6/b0>
>>    6:   57                        push   %edi
>>Code;  02114457 <sys_mmap2+7/b0>
>>    7:   31 ff                     xor    %edi,%edi
>>Code;  02114459 <sys_mmap2+9/b0>
>>    9:   56                        push   %esi
>>Code;  0211445a <sys_mmap2+a/b0>
>>    a:   53                        push   %ebx
>>Code;  0211445b <sys_mmap2+b/b0>
>>    b:   83 ec 18                  sub    $0x18,%esp
>>Code;  0211445e <sys_mmap2+e/b0>
>>    e:   8b 44 24 38               mov    0x38(%esp,1),%eax
>>Code;  02114462 <sys_mmap2+12/b0>
>>   12:   89 c6                     mov    %eax,%esi
>>
>>Why CPU is unable to handle paging request at 0x000000c0? There is no
>>access to
>>this addr in executing code! What has "push %ebp" to do with 0xc0?
>>The answer is that %eax contains 0xc0 and the touched in user space
>>pages contain 4092 zero bytes. And 0x0000 is an opcode for "addl %al,
>>(%eax)". 
>>So we see the situation when CPU is executing code from user space
>>pages though we are in kernel space already and data peeks from these
>>addresses
>>shows us the correct code (code in call trace is correct!).
>>I checked it and if these pages are filled with some other values,
>>not zeroes, than it's possible to make CPU execute this code.
>>
>>And why this happens on sys_mmap2+0? Because entry code (system_call)
>>is mapped at high addresses (> 0xffc00000) and is the same both in
>>kernel 
>>and user spaces, so entry.S code works ok.
>>
>>So we found 2 ways of curing this bug:
>>- make trampline code to be non-GLOBAL
>>- another observation was that PAE turned ON helps as well.
>>
>>Hypothesis
>>~~~~~~~~~~
>>I think that the problem is in code prefetch queue or somewhere in
>>CPU. 
>>It looks like CPU doesn't flush code prefetch queue after %cr3 reload
>>(to kernel space) in entry.S and continues to execute prefetched code
>>from user space pages.
>>
>>Why making entry code non-global helps the problem?
>>I think that if the code at %eip is flushed on %cr3 reload than the
>>_whole_ prefetch queue is flushed and when entry code is global than
>>it is 
>>not flushed on %cr3 reload and prefetch queue (including call to
>>flushed sys_mmap2 code) is not flushed.
>>
>>Kirill
>>
>>
>>
>>>Hi Kirill,
>>>
>>>I appreciate you bringing this issue up.  Could you please send us
>>>the information on how you are able to reproduce this issue (System
>>>config, Linux kernel version and any test case).  We would like to
>>>root cause the failure here at Intel. 
>>>
>>>Appreciate your help,
>>>Thanks,
>>>-rohit
>>>
>>>Kirill Korotaev <> wrote on Wednesday, January 19, 2005 8:08 AM:
>>>
>>>
>>>
>>>>Hello Linus,
>>>>
>>>>Linus, Ingo, I've got one strange CPU bug leading to oopses, reboots
>>>>and so on. This bug can be reproduced with a little bit modified 4gb
>>>>split and is probably related to CPU speculative execution. I'll
>>>>post more information about this bug later, but I would like to ask
>>>>you for Intel guys contacts who maybe interested in this
>>>>information, so I could CC them as well. 
>>>>
>>>>Thank you,
>>>>Kirill
>>>>
>>>>-
>>>>To unsubscribe from this list: send the line "unsubscribe
>>>>linux-kernel" in the body of a message to majordomo@vger.kernel.org
>>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>Please read the FAQ at  http://www.tux.org/lkml/
> 
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: possible CPU bug and request for Intel contacts
  2005-01-21 12:46 ` Kirill Korotaev
@ 2005-01-22 10:03   ` Pavel Machek
  0 siblings, 0 replies; 11+ messages in thread
From: Pavel Machek @ 2005-01-22 10:03 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Seth, Rohit, Linus Torvalds, Ingo Molnar, Saxena, Sunil,
	Pallipadi, Venkatesh, Andrey Savochkin, linux-kernel

Hi!

> Here are the details about CPU bug I mentioned in my previous post. 
> Though it turned out later that it happens on P-III systems only I still 
> hope it can be of interest.

What about Pentium-M? They are based on P-III and are certainly *very*
interesting.
								Pavel

-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: possible CPU bug and request for Intel contacts
@ 2005-01-22  3:02 Seth, Rohit
  2005-01-24  9:51 ` Kirill Korotaev
  0 siblings, 1 reply; 11+ messages in thread
From: Seth, Rohit @ 2005-01-22  3:02 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Linus Torvalds, Ingo Molnar, Saxena, Sunil, Pallipadi, Venkatesh,
	Andrey Savochkin, linux-kernel

Hello Kirill,

Thanks for sending the detailed information. Based on our experiments
and analysis, we believe at this point that this is a known E80 issue
mentioned in the PIII spec update at this location
(http://www.intel.com/design/pentiumiii/specupdt/24445351.pdf)

Could you please try one of the suggested work arounds for this issue.  

Thanks, rohit

Kirill Korotaev <mailto:dev@sw.ru> wrote on Friday, January 21, 2005
4:47 AM:

> Hello,
> 
> Here are the details about CPU bug I mentioned in my previous post.
> Though it turned out later that it happens on P-III systems only I
> still 
> hope it can be of interest.
> 
> Brief description
> ~~~~~~~~~~~~~~~~~
> 
> This issue was found by Vasily Averin (vvs@sw.ru) when playing
> with uselib security exploit on kernels with my 4gb split patch.
> 
> This bug results in strange effects such as calltraces below,
> reboots, impossible call traces and so on.
> 
> I started to resolve the bug, narrowed down uselib exploit and
> got a simple testcase for the bug, which can be found in attach.
> This testcase does a simple thing - it maps pages at low addresses
> from 0x04000000 downto 0x00000000, page by page and touches them
> for write. Sometimes when running this exploit I got oopses,
> sometimes reboots and I found that this is sensitive to the page
> addresses which exploit maps.
> 
> Why it crashes? I think this is due to virtual addresses of
> kernel code and mapped user space pages overlap. I was able even to
> reboot machine if mapped user space pages were filled with some
> appropriate asm code.
> 
> I found that Ingo Molnar 4gb split is not vulnerable, and after
> investigations I found that Ingo patch doesn't map kernel entry code
> (trampline) as _PAGE_GLOBAL. This was the answer.
> 
> I tested it on 4 different P-III machines - all of them were
> vulnerable. 
> But lately I tested it on Celeron 2.4Ghz and P4 systems - it doesn't
> happen, so this bug can be of low interest to Intel people :(
> 
> Below you can find the way how to reproduce the bug, call traces
> and why I think it's a hardware bug.
> 
> How to reproduce a bug
> ~~~~~~~~~~~~~~~~~~~~~~
> 
> - take any FedoraCore kernel with Ingo Molnar 4gb split patch
>    or mainstream kernel and apply 4GB split patch
> - apply attached diff-arch-4gb-global patch to make
>    trampline code to be GLOBAL
> - compile kernel with turned on 4gb split, i.e. CONFIG_X86_4GB=y
> - boot the kernel and run the attached testcase:
> 
> # while true; do ./4gbtest; done;
> 
> or
> 
> # ./elflbl -l ./lib -a 0x4000000  (where elflbl is uselib exploit)
> 
> During each 4-5 test runs I get the following oops:
> 
> Jan 21 12:15:17 ts Unable to handle kernel NULL pointer dereference at
> virtual address 000000c0
> Jan 21 12:15:17 ts  printing eip:
> Jan 21 12:15:17 ts 02114450
> Jan 21 12:15:17 ts *pde = 00000000
> Jan 21 12:15:17 ts Oops: 0002
> Jan 21 12:15:17 ts SMP
> Jan 21 12:15:17 ts Modules linked in:
> Jan 21 12:15:17 ts CPU:    0
> Jan 21 12:15:17 ts EIP:    0060:[<02114450>]    Not tainted
> Jan 21 12:15:17 ts EFLAGS: 00010246   (2.6.8-dev)
> Jan 21 12:15:17 ts EIP is at sys_mmap2+0x0/0xb0
> Jan 21 12:15:17 ts eax: 000000c0   ebx: 31524fc4   ecx: 00001000  
> edx: 004ec000
> Jan 21 12:15:17 ts esi: 00000032   edi: 00000000   ebp: 31524000  
> esp: 31524fc0
> Jan 21 12:15:17 ts ds: 007b   es: 007b   ss: 0068
> Jan 21 12:15:17 ts Process test (pid: 25, threadinfo=31524000
> task=31f680c0) Jan 21 12:15:17 ts Stack: fffec200 01a2a000 00001000
> 00000003 00000032 00000000 00000000 000000c0
> Jan 21 12:15:17 ts        0000007b 0000007b 000000c0 08048541 00000073
> 00000282 bffffdcc 0000007b
> Jan 21 12:15:17 ts Call Trace:
> Jan 21 12:15:17 ts Code: 55 bd f7 ff ff ff 57 31 ff 56 53 83 ec 18 8b
> 44 24 38 89 c6
> 
>   Unable to handle kernel NULL pointer dereference at virtual address
> 000000c0
>   02114450
>   *pde = 00000000
>   Oops: 0002
>   CPU:    0
>   EIP:    0060:[<02114450>]    Not tainted
>   EFLAGS: 00010246   (2.6.8-dev)
>   eax: 000000c0   ebx: 31524fc4   ecx: 00001000   edx: 004ec000
>   esi: 00000032   edi: 00000000   ebp: 31524000   esp: 31524fc0
>   ds: 007b   es: 007b   ss: 0068
>   Stack: fffec200 01a2a000 00001000 00000003 00000032 00000000
> 00000000 000000c0
>          0000007b 0000007b 000000c0 08048541 00000073 00000282
> bffffdcc 0000007b
>   Call Trace:
>   Code: 55 bd f7 ff ff ff 57 31 ff 56 53 83 ec 18 8b 44 24 38 89 c6
> 
> 
>  >>EIP; 02114450 <sys_mmap2+0/b0>   <=====
> 
>  >>ebx; 31524fc4 <pg0+2eff8fc4/fdac0000>
>  >>ebp; 31524000 <pg0+2eff8000/fdac0000>
>  >>esp; 31524fc0 <pg0+2eff8fc0/fdac0000>
> 
> Code;  02114450 <sys_mmap2+0/b0>
> 00000000 <_EIP>:
> Code;  02114450 <sys_mmap2+0/b0>   <=====
>     0:   55                        push   %ebp   <=====
> Code;  02114451 <sys_mmap2+1/b0>
>     1:   bd f7 ff ff ff            mov    $0xfffffff7,%ebp
> Code;  02114456 <sys_mmap2+6/b0>
>     6:   57                        push   %edi
> Code;  02114457 <sys_mmap2+7/b0>
>     7:   31 ff                     xor    %edi,%edi
> Code;  02114459 <sys_mmap2+9/b0>
>     9:   56                        push   %esi
> Code;  0211445a <sys_mmap2+a/b0>
>     a:   53                        push   %ebx
> Code;  0211445b <sys_mmap2+b/b0>
>     b:   83 ec 18                  sub    $0x18,%esp
> Code;  0211445e <sys_mmap2+e/b0>
>     e:   8b 44 24 38               mov    0x38(%esp,1),%eax
> Code;  02114462 <sys_mmap2+12/b0>
>    12:   89 c6                     mov    %eax,%esi
> 
> Why CPU is unable to handle paging request at 0x000000c0? There is no
> access to
> this addr in executing code! What has "push %ebp" to do with 0xc0?
> The answer is that %eax contains 0xc0 and the touched in user space
> pages contain 4092 zero bytes. And 0x0000 is an opcode for "addl %al,
> (%eax)". 
> So we see the situation when CPU is executing code from user space
> pages though we are in kernel space already and data peeks from these
> addresses
> shows us the correct code (code in call trace is correct!).
> I checked it and if these pages are filled with some other values,
> not zeroes, than it's possible to make CPU execute this code.
> 
> And why this happens on sys_mmap2+0? Because entry code (system_call)
> is mapped at high addresses (> 0xffc00000) and is the same both in
> kernel 
> and user spaces, so entry.S code works ok.
> 
> So we found 2 ways of curing this bug:
> - make trampline code to be non-GLOBAL
> - another observation was that PAE turned ON helps as well.
> 
> Hypothesis
> ~~~~~~~~~~
> I think that the problem is in code prefetch queue or somewhere in
> CPU. 
> It looks like CPU doesn't flush code prefetch queue after %cr3 reload
> (to kernel space) in entry.S and continues to execute prefetched code
> from user space pages.
> 
> Why making entry code non-global helps the problem?
> I think that if the code at %eip is flushed on %cr3 reload than the
> _whole_ prefetch queue is flushed and when entry code is global than
> it is 
> not flushed on %cr3 reload and prefetch queue (including call to
> flushed sys_mmap2 code) is not flushed.
> 
> Kirill
> 
> 
>> Hi Kirill,
>> 
>> I appreciate you bringing this issue up.  Could you please send us
>> the information on how you are able to reproduce this issue (System
>> config, Linux kernel version and any test case).  We would like to
>> root cause the failure here at Intel. 
>> 
>> Appreciate your help,
>> Thanks,
>> -rohit
>> 
>> Kirill Korotaev <> wrote on Wednesday, January 19, 2005 8:08 AM:
>> 
>> 
>>> Hello Linus,
>>> 
>>> Linus, Ingo, I've got one strange CPU bug leading to oopses, reboots
>>> and so on. This bug can be reproduced with a little bit modified 4gb
>>> split and is probably related to CPU speculative execution. I'll
>>> post more information about this bug later, but I would like to ask
>>> you for Intel guys contacts who maybe interested in this
>>> information, so I could CC them as well. 
>>> 
>>> Thank you,
>>> Kirill
>>> 
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: possible CPU bug and request for Intel contacts
       [not found] <01EF044AAEE12F4BAAD955CB7506494302DFE109@scsmsx401.amr.corp.intel.com>
@ 2005-01-21 12:46 ` Kirill Korotaev
  2005-01-22 10:03   ` Pavel Machek
  0 siblings, 1 reply; 11+ messages in thread
From: Kirill Korotaev @ 2005-01-21 12:46 UTC (permalink / raw)
  To: Seth, Rohit
  Cc: Linus Torvalds, Ingo Molnar, Saxena, Sunil, Pallipadi, Venkatesh,
	Andrey Savochkin, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7394 bytes --]

Hello,

Here are the details about CPU bug I mentioned in my previous post. 
Though it turned out later that it happens on P-III systems only I still 
hope it can be of interest.

Brief description
~~~~~~~~~~~~~~~~~

This issue was found by Vasily Averin (vvs@sw.ru) when playing
with uselib security exploit on kernels with my 4gb split patch.

This bug results in strange effects such as calltraces below,
reboots, impossible call traces and so on.

I started to resolve the bug, narrowed down uselib exploit and
got a simple testcase for the bug, which can be found in attach.
This testcase does a simple thing - it maps pages at low addresses
from 0x04000000 downto 0x00000000, page by page and touches them
for write. Sometimes when running this exploit I got oopses,
sometimes reboots and I found that this is sensitive to the page
addresses which exploit maps.

Why it crashes? I think this is due to virtual addresses of
kernel code and mapped user space pages overlap. I was able even to
reboot machine if mapped user space pages were filled with some
appropriate asm code.

I found that Ingo Molnar 4gb split is not vulnerable, and after
investigations I found that Ingo patch doesn't map kernel entry code
(trampline) as _PAGE_GLOBAL. This was the answer.

I tested it on 4 different P-III machines - all of them were vulnerable.
But lately I tested it on Celeron 2.4Ghz and P4 systems - it doesn't 
happen, so this bug can be of low interest to Intel people :(

Below you can find the way how to reproduce the bug, call traces
and why I think it's a hardware bug.

How to reproduce a bug
~~~~~~~~~~~~~~~~~~~~~~

- take any FedoraCore kernel with Ingo Molnar 4gb split patch
   or mainstream kernel and apply 4GB split patch
- apply attached diff-arch-4gb-global patch to make
   trampline code to be GLOBAL
- compile kernel with turned on 4gb split, i.e. CONFIG_X86_4GB=y
- boot the kernel and run the attached testcase:

# while true; do ./4gbtest; done;

or

# ./elflbl -l ./lib -a 0x4000000  (where elflbl is uselib exploit)

During each 4-5 test runs I get the following oops:

Jan 21 12:15:17 ts Unable to handle kernel NULL pointer dereference at 
virtual address 000000c0
Jan 21 12:15:17 ts  printing eip:
Jan 21 12:15:17 ts 02114450
Jan 21 12:15:17 ts *pde = 00000000
Jan 21 12:15:17 ts Oops: 0002
Jan 21 12:15:17 ts SMP
Jan 21 12:15:17 ts Modules linked in:
Jan 21 12:15:17 ts CPU:    0
Jan 21 12:15:17 ts EIP:    0060:[<02114450>]    Not tainted
Jan 21 12:15:17 ts EFLAGS: 00010246   (2.6.8-dev)
Jan 21 12:15:17 ts EIP is at sys_mmap2+0x0/0xb0
Jan 21 12:15:17 ts eax: 000000c0   ebx: 31524fc4   ecx: 00001000   edx: 
004ec000
Jan 21 12:15:17 ts esi: 00000032   edi: 00000000   ebp: 31524000   esp: 
31524fc0
Jan 21 12:15:17 ts ds: 007b   es: 007b   ss: 0068
Jan 21 12:15:17 ts Process test (pid: 25, threadinfo=31524000 task=31f680c0)
Jan 21 12:15:17 ts Stack: fffec200 01a2a000 00001000 00000003 00000032 
00000000 00000000 000000c0
Jan 21 12:15:17 ts        0000007b 0000007b 000000c0 08048541 00000073 
00000282 bffffdcc 0000007b
Jan 21 12:15:17 ts Call Trace:
Jan 21 12:15:17 ts Code: 55 bd f7 ff ff ff 57 31 ff 56 53 83 ec 18 8b 44 
24 38 89 c6

  Unable to handle kernel NULL pointer dereference at virtual address 
000000c0
  02114450
  *pde = 00000000
  Oops: 0002
  CPU:    0
  EIP:    0060:[<02114450>]    Not tainted
  EFLAGS: 00010246   (2.6.8-dev)
  eax: 000000c0   ebx: 31524fc4   ecx: 00001000   edx: 004ec000
  esi: 00000032   edi: 00000000   ebp: 31524000   esp: 31524fc0
  ds: 007b   es: 007b   ss: 0068
  Stack: fffec200 01a2a000 00001000 00000003 00000032 00000000 00000000 
000000c0
         0000007b 0000007b 000000c0 08048541 00000073 00000282 bffffdcc 
0000007b
  Call Trace:
  Code: 55 bd f7 ff ff ff 57 31 ff 56 53 83 ec 18 8b 44 24 38 89 c6


 >>EIP; 02114450 <sys_mmap2+0/b0>   <=====

 >>ebx; 31524fc4 <pg0+2eff8fc4/fdac0000>
 >>ebp; 31524000 <pg0+2eff8000/fdac0000>
 >>esp; 31524fc0 <pg0+2eff8fc0/fdac0000>

Code;  02114450 <sys_mmap2+0/b0>
00000000 <_EIP>:
Code;  02114450 <sys_mmap2+0/b0>   <=====
    0:   55                        push   %ebp   <=====
Code;  02114451 <sys_mmap2+1/b0>
    1:   bd f7 ff ff ff            mov    $0xfffffff7,%ebp
Code;  02114456 <sys_mmap2+6/b0>
    6:   57                        push   %edi
Code;  02114457 <sys_mmap2+7/b0>
    7:   31 ff                     xor    %edi,%edi
Code;  02114459 <sys_mmap2+9/b0>
    9:   56                        push   %esi
Code;  0211445a <sys_mmap2+a/b0>
    a:   53                        push   %ebx
Code;  0211445b <sys_mmap2+b/b0>
    b:   83 ec 18                  sub    $0x18,%esp
Code;  0211445e <sys_mmap2+e/b0>
    e:   8b 44 24 38               mov    0x38(%esp,1),%eax
Code;  02114462 <sys_mmap2+12/b0>
   12:   89 c6                     mov    %eax,%esi

Why CPU is unable to handle paging request at 0x000000c0? There is no 
access to
this addr in executing code! What has "push %ebp" to do with 0xc0?
The answer is that %eax contains 0xc0 and the touched in user space pages
contain 4092 zero bytes. And 0x0000 is an opcode for "addl %al, (%eax)".
So we see the situation when CPU is executing code from user space
pages though we are in kernel space already and data peeks from these 
addresses
shows us the correct code (code in call trace is correct!).
I checked it and if these pages are filled with some other values,
not zeroes, than it's possible to make CPU execute this code.

And why this happens on sys_mmap2+0? Because entry code (system_call)
is mapped at high addresses (> 0xffc00000) and is the same both in kernel
and user spaces, so entry.S code works ok.

So we found 2 ways of curing this bug:
- make trampline code to be non-GLOBAL
- another observation was that PAE turned ON helps as well.

Hypothesis
~~~~~~~~~~
I think that the problem is in code prefetch queue or somewhere in CPU.
It looks like CPU doesn't flush code prefetch queue after %cr3 reload
(to kernel space) in entry.S and continues to execute prefetched code
from user space pages.

Why making entry code non-global helps the problem?
I think that if the code at %eip is flushed on %cr3 reload than the _whole_
prefetch queue is flushed and when entry code is global than it is
not flushed on %cr3 reload and prefetch queue (including call to flushed
sys_mmap2 code) is not flushed.

Kirill


> Hi Kirill,
> 
> I appreciate you bringing this issue up.  Could you please send us the
> information on how you are able to reproduce this issue (System config,
> Linux kernel version and any test case).  We would like to root cause
> the failure here at Intel.
> 
> Appreciate your help,
> Thanks, 
> -rohit
> 
> Kirill Korotaev <> wrote on Wednesday, January 19, 2005 8:08 AM:
> 
> 
>>Hello Linus,
>>
>>Linus, Ingo, I've got one strange CPU bug leading to oopses, reboots
>>and so on. This bug can be reproduced with a little bit modified 4gb
>>split and is probably related to CPU speculative execution. I'll post
>>more information about this bug later, but I would like to ask you
>>for Intel guys contacts who maybe interested in this information, so
>>I could CC them as well.
>>
>>Thank you,
>>Kirill
>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe
>>linux-kernel" in the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at  http://www.tux.org/lkml/
> 
> 


[-- Attachment #2: 4gbtest.c --]
[-- Type: text/plain, Size: 2298 bytes --]

/*
 *	binfmt_elf uselib VMA insert race vulnerability
 *	v1.09
 *	tested only on 2.4.x and gcc 2.96
 *
 *	gcc -O2 -fomit-frame-pointer elflbl.c -o elflbl
 *
 *	Copyright (c) 2004  iSEC Security Research. All Rights Reserved.
 *
 *	THIS PROGRAM IS FOR EDUCATIONAL PURPOSES *ONLY* IT IS PROVIDED "AS IS"
 *	AND WITHOUT ANY WARRANTY. COPYING, PRINTING, DISTRIBUTION, MODIFICATION
 *	WITHOUT PERMISSION OF THE AUTHOR IS STRICTLY PROHIBITED.
 *
 */


#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <syscall.h>
#include <signal.h>

#include <sys/types.h>
#include <sys/mman.h>
#include <asm/page.h>

static int
	map_base=0x4000000,
	map_addr;

#define __NR_sys_mmap2		__NR_mmap2
inline _syscall6(int, sys_mmap2, int, a, int, b, int, c, int, d, int, e, int, f);

void fatal(const char *message)
{
	int sig = SIGKILL;

	if(!errno) {
		fprintf(stdout, "\n[-] FAILED: %s ", message);
	} else {
		fprintf(stdout, "\n[-] FAILED: %s (%s) ", message,
			(char*) (strerror(errno)) );
	}
	printf("\n");
	fflush(stdout);

	for(;;) kill(0, sig);
}

void mmap_one_page()
{
	int *r, i;

	map_addr -= PAGE_SIZE;

	r = (void*)sys_mmap2((unsigned)map_addr, PAGE_SIZE, PROT_READ|PROT_WRITE,
			     MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0, 0);
	if(MAP_FAILED == r) {
		fatal("mmap2 failed");
	}

	/* TOUCH THE PAGE! THIS IS IMPORTANT! */
	*r = map_addr;
//	for (i = 0; i < 1024; i++)
//		*(r+i) = 0x128b128b;
//	memset(r, 0x11, PAGE_SIZE);
}



//	use elf library and try to sleep on kmalloc
void exploitme()
{
	int pages;

	map_addr = map_base;
	pages = map_addr/PAGE_SIZE;

//	map_addr = 0x2150000;
//	pages = 0x35;
	printf("mmaping 0x%08x downto 0x%08x...\n",
			map_addr, map_addr - pages * PAGE_SIZE);
	while(pages) {
		mmap_one_page();
		pages--;
	}
}

void usage(char *n)
{
	printf("\nUsage: %s\t\n", n);
	printf("\t\t-a alternate addr hex\n");
	printf("\n");
	_exit(1);
}


//	give -s for forced stop, -b to clean SLAB
int main(int ac, char **av)
{
	int r;

	while(ac) {
		r = getopt(ac, av, "a:h");
		if(r<0) break;

		switch(r) {

		case 'a' :
			if(1!=sscanf(optarg, "%x", &map_base))
				fatal("bad addr value");
			break;

		case 'h' :
		default:
			usage(av[0]);
			break;
		}
	}

	exploitme();

	return 0;
}

[-- Attachment #3: diff-arch-4gb-global --]
[-- Type: text/plain, Size: 937 bytes --]

--- ./arch/i386/kernel/entry_trampoline.c.4gbglb	2005-01-19 11:01:17.000000000 +0300
+++ ./arch/i386/kernel/entry_trampoline.c	2005-01-19 11:01:28.275121416 +0300
@@ -24,14 +24,16 @@ void __init init_entry_mappings(void)
 
 	void *tramp;
 	int p;
+	pgprot_t prot;
 
 	/*
 	 * We need a high IDT and GDT for the 4G/4G split:
 	 */
 	trap_init_virtual_IDT();
 
-	__set_fixmap(FIX_ENTRY_TRAMPOLINE_0, __pa((unsigned long)&__entry_tramp_start), PAGE_KERNEL_EXEC);
-	__set_fixmap(FIX_ENTRY_TRAMPOLINE_1, __pa((unsigned long)&__entry_tramp_start) + PAGE_SIZE, PAGE_KERNEL_EXEC);
+	prot = __pgprot(pgprot_val(PAGE_KERNEL_EXEC) | _PAGE_GLOBAL);
+	__set_fixmap(FIX_ENTRY_TRAMPOLINE_0, __pa((unsigned long)&__entry_tramp_start), prot);
+	__set_fixmap(FIX_ENTRY_TRAMPOLINE_1, __pa((unsigned long)&__entry_tramp_start) + PAGE_SIZE, prot);
 	tramp = (void *)fix_to_virt(FIX_ENTRY_TRAMPOLINE_0);
 
 	printk("mapped 4G/4G trampoline to %p.\n", tramp);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* possible CPU bug and request for Intel contacts
@ 2005-01-19 16:07 Kirill Korotaev
  0 siblings, 0 replies; 11+ messages in thread
From: Kirill Korotaev @ 2005-01-19 16:07 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar, linux-kernel

Hello Linus,

Linus, Ingo, I've got one strange CPU bug leading to oopses, reboots and 
so on. This bug can be reproduced with a little bit modified 4gb split 
and is probably related to CPU speculative execution. I'll post more 
information about this bug later, but I would like to ask you for Intel 
guys contacts who maybe interested in this information, so I could CC 
them as well.

Thank you,
Kirill


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-02-14 16:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-26  1:38 possible CPU bug and request for Intel contacts Seth, Rohit
2005-02-13 20:10 ` Ingo Molnar
  -- strict thread matches above, loose matches on Subject: below --
2005-02-14 16:55 Pallipadi, Venkatesh
2005-01-25  7:22 Seth, Rohit
2005-01-25 14:12 ` Kirill Korotaev
2005-01-25  7:15 Seth, Rohit
2005-01-22  3:02 Seth, Rohit
2005-01-24  9:51 ` Kirill Korotaev
     [not found] <01EF044AAEE12F4BAAD955CB7506494302DFE109@scsmsx401.amr.corp.intel.com>
2005-01-21 12:46 ` Kirill Korotaev
2005-01-22 10:03   ` Pavel Machek
2005-01-19 16:07 Kirill Korotaev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).