linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
@ 2005-04-05  6:55 Ingo Molnar
  2005-04-05  7:03 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05  6:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, stsp, Andrew Morton


the crashes below happen when PAGEALLOC is enabled. It's this 
instruction:

        movb OLDSS(%esp), %ah

OLDSS is 0x38, esp is f4f83fc8, OLDSS(%esp) is thus f4f84000, which 
correctly creates the PAGEALLOC pagefault. esp is off by 4 bytes?

it could be the ESP-16-bit-corruption patch causing this, or it could be 
an already existing latent bug getting triggered now: normally only iret 
accesses the OLDSS, and we fix any iret faults up, but now that we 
explicitly access %esp the esp bug shows up.

so it would be nice to understand why this triggers. It seems to be a 
sporadic event - first it hit hotplug, then input.agent. If i disable 
PAGEALLOC the system boots up fine. In any case, the ESP-corruption 
patch is not safe until this bug is understood, as it right now may read 
a random byte off the next page, and possibly doing bogus calls to the 
16-bit-fixup code.

	Ingo

-------------

BUG: Unable to handle kernel paging request at virtual address f4f84000
 printing eip:
c010287c
*pde = 00527067
*pte = 34f84000
Oops: 0000 [#1]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU:    0
EIP:    0060:[<c010287c>]    Not tainted VLI
EFLAGS: 00010046   (2.6.12-rc2-RT-V0.7.43-09) 
EIP is at restore_all+0x4/0x18
eax: 00000206   ebx: 00000000   ecx: 00000000   edx: 00000001
esi: 00000000   edi: 009b63f9   ebp: f4f82000   esp: f4f83fc8
ds: 007b   es: 007b   ss: 0068   preempt: 00000001
Process 10-udev.hotplug (pid: 1264, threadinfo=f4f82000 task=f5034a10)
Stack: 00000000 bfa71dd0 009c0ffc 00000000 009b63f9 bfa71d44 000000c5 0000007b 
       0000007b ffffffef c01027ba 00000060 00000206 0000007b 
Call Trace:
 [<c01036ac>] show_stack+0x7a/0x90 (32)
 [<c0103835>] show_registers+0x15a/0x1d2 (56)
 [<c0103a30>] die+0xf4/0x17e (68)
 [<c010f444>] do_page_fault+0x3de/0x60a (212)
 [<c01032eb>] error_code+0x4f/0x54 (-8076)

---------------------

BUG: Unable to handle kernel paging request at virtual address f57bc000
 printing eip:
c010287c
*pde = 00529067
*pte = 357bc000
Oops: 0000 [#1]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU:    0
EIP:    0060:[<c010287c>]    Not tainted VLI
EFLAGS: 00010046   (2.6.12-rc2-RT-V0.7.43-09) 
EIP is at restore_all+0x4/0x18
eax: 00000206   ebx: b7f11000   ecx: 00000000   edx: 00000000
esi: 080e4f28   edi: 00000000   ebp: f57ba000   esp: f57bbfc8
ds: 007b   es: 007b   ss: 0068   preempt: 00000001
Process input.agent (pid: 1131, threadinfo=f57ba000 task=f57b9a10)
Stack: b7f11000 00001000 009c0ffc 080e4f28 00000000 bfc112c0 0000005b 0000007b 
       0000007b ffffff00 c01027ba 00000060 00000206 0000007b 
Call Trace:
 [<c01036ac>] show_stack+0x7a/0x90 (32)
 [<c0103835>] show_registers+0x15a/0x1d2 (56)
 [<c0103a30>] die+0xf4/0x17e (68)
 [<c010f474>] do_page_fault+0x3de/0x60a (212)
 [<c01032eb>] error_code+0x4f/0x54 (-8076)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05  6:55 crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC Ingo Molnar
@ 2005-04-05  7:03 ` Andrew Morton
  2005-04-05  7:07   ` Ingo Molnar
  2005-04-05  7:16   ` Ingo Molnar
  2005-04-05  7:05 ` Ingo Molnar
  2005-04-05 19:11 ` Stas Sergeev
  2 siblings, 2 replies; 31+ messages in thread
From: Andrew Morton @ 2005-04-05  7:03 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, torvalds, stsp

Ingo Molnar <mingo@elte.hu> wrote:
>
> the crashes below happen when PAGEALLOC is enabled. It's this 
>  instruction:
> 
>          movb OLDSS(%esp), %ah
> 
>  OLDSS is 0x38, esp is f4f83fc8, OLDSS(%esp) is thus f4f84000, which 
>  correctly creates the PAGEALLOC pagefault. esp is off by 4 bytes?
> 
>  it could be the ESP-16-bit-corruption patch causing this,

Do you have nmis enabled?



From: Akinobu Mita <amgta@yacht.ocn.ne.jp>

With nmi_watchdog=1, I got random Oopses (Unable to handle kernel paging
request, not by the NMI oopser) from many processes.  It is not happend
with -rc1.

The following change fixes this problem.

Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/arch/i386/kernel/entry.S |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

diff -puN arch/i386/kernel/entry.S~nmi_stack_correct-fix arch/i386/kernel/entry.S
--- 25/arch/i386/kernel/entry.S~nmi_stack_correct-fix	2005-04-05 00:02:48.000000000 -0700
+++ 25-akpm/arch/i386/kernel/entry.S	2005-04-05 00:02:48.000000000 -0700
@@ -550,7 +550,7 @@ nmi_stack_correct:
 	xorl %edx,%edx		# zero error code
 	movl %esp,%eax		# pt_regs pointer
 	call do_nmi
-	jmp restore_all
+	jmp restore_nocheck
 
 nmi_stack_fixup:
 	FIX_STACK(12,nmi_stack_correct, 1)
_


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05  6:55 crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC Ingo Molnar
  2005-04-05  7:03 ` Andrew Morton
@ 2005-04-05  7:05 ` Ingo Molnar
  2005-04-05 19:11 ` Stas Sergeev
  2 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05  7:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, stsp, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> it could be the ESP-16-bit-corruption patch causing this, or it could 
> be an already existing latent bug getting triggered now: normally only 
> iret accesses the OLDSS, and we fix any iret faults up, but now that 
> we explicitly access %esp the esp bug shows up.

update: it's a latent condition being exposed by the ESP-corruption-fix 
patch. I reverted the ESP patch from 2.6.12-rc2, and applied the patch 
below, and it produces the same sort of crashes (attached further 
below).

	Ingo

--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -123,6 +123,7 @@ VM_MASK		= 0x00020000
 
 
 #define RESTORE_ALL	\
+	movl OLDSS(%esp), %eax; \
 	RESTORE_REGS	\
 	addl $4, %esp;	\
 1:	iret;		\

------------->
Freeing unused kernel memory: 188k freed
Unable to handle kernel paging request at virtual address f5bda000
 printing eip:
c0102874
*pde = 00517067
*pte = 35bda000
Oops: 0000 [#1]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU:    0
EIP:    0060:[<c0102874>]    Not tainted VLI
EFLAGS: 00010046   (2.6.12-rc2) 
EIP is at restore_all+0x0/0x14
eax: 00000260   ebx: 00000000   ecx: 00000000   edx: 00000001
esi: 00000000   edi: 009c0ffc   ebp: f5bd8000   esp: f5bd9fc8
ds: 007b   es: 007b   ss: 0068
Process default.hotplug (pid: 1124, threadinfo=f5bd8000 task=f5999b10)
Stack: 00000000 00000003 00000000 00000000 009c0ffc bf96dd58 000000dd 0000007b 
       0000007b ffffff00 c01027ba 00000060 00000246 0000007b 
Call Trace:
 [<c01035dc>] show_stack+0x7a/0x90
 [<c0103758>] show_registers+0x14d/0x1c5
 [<c0103956>] die+0xf4/0x186
 [<c010f15d>] do_page_fault+0x430/0x649
 [<c0103283>] error_code+0x2b/0x30
Code: 45 08 81 01 75 7d 3d 21 01 00 00 0f 83 da 00 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 80 00 00 00 <8b> 44 24 38 5b 59 5a 5e 5f 5d 58 1f 07 83 c4 04 cf 8d 76 00 f6 
 <1>Unable to handle kernel paging request at virtual address f50a8000
 printing eip:
c0102874
*pde = 00515067
*pte = 350a8000
Oops: 0000 [#2]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU:    0
EIP:    0060:[<c0102874>]    Not tainted VLI
EFLAGS: 00010046   (2.6.12-rc2) 
EIP is at restore_all+0x0/0x14
eax: 00000260   ebx: 00000003   ecx: 00000000   edx: 00000001
esi: 00000008   edi: 009c0ffc   ebp: f50a6000   esp: f50a7fc8
ds: 007b   es: 007b   ss: 0068
Process default.hotplug (pid: 1181, threadinfo=f50a6000 task=f55f9b10)
Stack: 00000003 bff9500c bff94f7c 00000008 009c0ffc bff94f60 000000ae 0000007b 
       0000007b ffffff00 c01027ba 00000060 00000286 0000007b 
Call Trace:
 [<c01035dc>] show_stack+0x7a/0x90
 [<c0103758>] show_registers+0x14d/0x1c5
 [<c0103956>] die+0xf4/0x186
 [<c010f15d>] do_page_fault+0x430/0x649
 [<c0103283>] error_code+0x2b/0x30
Code: 45 08 81 01 75 7d 3d 21 01 00 00 0f 83 da 00 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 80 00 00 00 <8b> 44 24 38 5b 59 5a 5e 5f 5d 58 1f 07 83 c4 04 cf 8d 76 00 f6 
 <1>Unable to handle kernel paging request at virtual address f553e000
 printing eip:
c0102874
*pde = 00516067
*pte = 3553e000
Oops: 0000 [#3]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU:    0
EIP:    0060:[<c0102874>]    Not tainted VLI
EFLAGS: 00010046   (2.6.12-rc2) 
EIP is at restore_all+0x0/0x14
eax: 00000260   ebx: 00000000   ecx: 00000000   edx: 00000001
esi: 00000008   edi: 009c0ffc   ebp: f553c000   esp: f553dfc8
ds: 007b   es: 007b   ss: 0068
Process default.hotplug (pid: 1113, threadinfo=f553c000 task=f5a75b10)
Stack: 00000000 00000000 080e1f3c 00000008 009c0ffc bf856de0 000000af 0000007b 
       0000007b ffffffef c01027ba 00000060 00000246 0000007b 
Call Trace:
 [<c01035dc>] show_stack+0x7a/0x90
 [<c0103758>] show_registers+0x14d/0x1c5
 [<c0103956>] die+0xf4/0x186
 [<c010f15d>] do_page_fault+0x430/0x649
 [<c0103283>] error_code+0x2b/0x30
Code: 45 08 81 01 75 7d 3d 21 01 00 00 0f 83 da 00 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 80 00 00 00 <8b> 44 24 38 5b 59 5a 5e 5f 5d 58 1f 07 83 c4 04 cf 8d 76 00 f6 
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05  7:03 ` Andrew Morton
@ 2005-04-05  7:07   ` Ingo Molnar
  2005-04-05  7:16   ` Ingo Molnar
  1 sibling, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05  7:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds, stsp


* Andrew Morton <akpm@osdl.org> wrote:

> Do you have nmis enabled?

yeah ...

>  	call do_nmi
> -	jmp restore_all
> +	jmp restore_nocheck

and i was about to take a closer look at the NMI path :-)

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05  7:03 ` Andrew Morton
  2005-04-05  7:07   ` Ingo Molnar
@ 2005-04-05  7:16   ` Ingo Molnar
  2005-04-05  7:29     ` Ingo Molnar
  1 sibling, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05  7:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds, stsp


* Andrew Morton <akpm@osdl.org> wrote:

> From: Akinobu Mita <amgta@yacht.ocn.ne.jp>
> 
> With nmi_watchdog=1, I got random Oopses (Unable to handle kernel 
> paging request, not by the NMI oopser) from many processes.  It is not 
> happend with -rc1.
> 
> The following change fixes this problem.

this fixed my crashes too.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05  7:16   ` Ingo Molnar
@ 2005-04-05  7:29     ` Ingo Molnar
  2005-04-05  7:40       ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05  7:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds, stsp


* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > From: Akinobu Mita <amgta@yacht.ocn.ne.jp>
> > 
> > With nmi_watchdog=1, I got random Oopses (Unable to handle kernel 
> > paging request, not by the NMI oopser) from many processes.  It is not 
> > happend with -rc1.
> > 
> > The following change fixes this problem.
> 

> this fixed my crashes too.

spoke too soon - they still trigger even with the patch applied.

	Ingo

Unable to handle kernel paging request at virtual address f543c000
 printing eip:
c010287c
*pde = 00516067
*pte = 3543c000
Oops: 0000 [#1]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU:    0
EIP:    0060:[<c010287c>]    Not tainted VLI
EFLAGS: 00010046   (2.6.12-rc2) 
EIP is at restore_all+0x4/0x18
eax: 00000202   ebx: 009b63f9   ecx: 00000000   edx: 00000001
esi: 009b63f9   edi: 00000001   ebp: f543a000   esp: f543bfc8
ds: 007b   es: 007b   ss: 0068
Process udevd (pid: 1044, threadinfo=f543a000 task=f5412b10)
Stack: 009b63f9 00000000 000001b6 009b63f9 00000001 bf8c503c 00000005 0000007b 
       0000007b ffffff00 c01027ba 00000060 00000202 0000007b 
Call Trace:
 [<c010368c>] show_stack+0x7a/0x90
 [<c0103808>] show_registers+0x14d/0x1c5
 [<c0103a06>] die+0xf4/0x186
 [<c010f2ad>] do_page_fault+0x430/0x649
 [<c01032eb>] error_code+0x4f/0x54
Code: 00 00 3d 21 01 00 00 0f 83 1a 01 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 c0 00 00 00 8b 44 24 30 <8a> 64 24 38 8a 44 24 2c 25 03 04 02 00 3d 03 04 00 00 74 0d 5b 
 printing eip:
*EIP:    0060:[<c010287c>]    Not tainted VLI
EFLAGS: 00010046   (2.6.12-rc2) 
EIP is at restore_all+0x4/0x18
eax: 00000206   ebx: 080d1502   ecx: 00000000   edx: 00000001
esi: 080ec058   edi: 080e8b6e   ebp: f5e38000   esp: f5e39fc8
ds: 007b   es: 007b   ss: 0068
Process rc.sysinit (pid: 3032, threadinfo=f5e38000 task=f5913b10)
Stack: 080d1502 bff73ff0 009c0ffc 080ec058 080e8b6e bff73fb4 000000c3 0000007b 
       0000007b ffffff00 c01027ba 00000060 00000206 0000007b 
Call Trace:
 [<c010368c>] show_stack+0x7a/0x90
 [<c0103808>] show_registers+0x14d/0x1c5
 [<c0103a06>] die+0xf4/0x186
 [<c010f2ad>] do_page_fault+0x430/0x649
 [<c01032eb>] error_code+0x4f/0x54
Code: 00 00 3d 21 01 00 00 0f 83 1a 01 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 c0 00 00 00 8b 44 24 30 <8a> 64 24 38 8a 44 24 2c 25 03 04 02 00 3d 03 04 00 00 74 0d 5b 
 <6>EXT3 FS on hdb1, internal journal
cdrom: open failed.
Adding 1052216k swap on /dev/hdb2.  Priority:-1 extents:1
eth1: link up, 100Mbps, full-duplex, lpa 0x45E1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05  7:29     ` Ingo Molnar
@ 2005-04-05  7:40       ` Ingo Molnar
  2005-04-05  9:51         ` Mikael Pettersson
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05  7:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds, stsp


* Ingo Molnar <mingo@elte.hu> wrote:

> > this fixed my crashes too.
> 
> spoke too soon - they still trigger even with the patch applied.

the patch below fixes the crash, it was related to CONFIG_PREEMPT.

	Ingo

--
fix entry.S crash with PREEMPT+PAGEALLOC

Signed-off-by: Ingo Molnar <mingo@elte.hu>

--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -165,9 +165,9 @@ ENTRY(resume_kernel)
 need_resched:
 	movl TI_flags(%ebp), %ecx	# need_resched set ?
 	testb $_TIF_NEED_RESCHED, %cl
-	jz restore_all
+	jz restore_nocheck
 	testl $IF_MASK,EFLAGS(%esp)     # interrupts off (exception path) ?
-	jz restore_all
+	jz restore_nocheck
 	call preempt_schedule_irq
 	jmp need_resched
 #endif

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05  7:40       ` Ingo Molnar
@ 2005-04-05  9:51         ` Mikael Pettersson
  2005-04-05 18:09           ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Mikael Pettersson @ 2005-04-05  9:51 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, torvalds, stsp

Ingo Molnar writes:
 > 
 > * Ingo Molnar <mingo@elte.hu> wrote:
 > 
 > > > this fixed my crashes too.
 > > 
 > > spoke too soon - they still trigger even with the patch applied.
 > 
 > the patch below fixes the crash, it was related to CONFIG_PREEMPT.
 > 
 > 	Ingo
 > 
 > --
 > fix entry.S crash with PREEMPT+PAGEALLOC
 > 
 > Signed-off-by: Ingo Molnar <mingo@elte.hu>
 > 
 > --- linux/arch/i386/kernel/entry.S.orig
 > +++ linux/arch/i386/kernel/entry.S
 > @@ -165,9 +165,9 @@ ENTRY(resume_kernel)
 >  need_resched:
 >  	movl TI_flags(%ebp), %ecx	# need_resched set ?
 >  	testb $_TIF_NEED_RESCHED, %cl
 > -	jz restore_all
 > +	jz restore_nocheck
 >  	testl $IF_MASK,EFLAGS(%esp)     # interrupts off (exception path) ?
 > -	jz restore_all
 > +	jz restore_nocheck
 >  	call preempt_schedule_irq
 >  	jmp need_resched
 >  #endif

Is this sufficient or do we also need the s/restore_all/restore_nocheck/
at around line 553 which was in the first posted patch?

/Mikael

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05  9:51         ` Mikael Pettersson
@ 2005-04-05 18:09           ` Ingo Molnar
  0 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05 18:09 UTC (permalink / raw)
  To: Mikael Pettersson; +Cc: Andrew Morton, linux-kernel, torvalds, stsp


* Mikael Pettersson <mikpe@csd.uu.se> wrote:

>  > -	jz restore_all
>  > +	jz restore_nocheck
>  >  	testl $IF_MASK,EFLAGS(%esp)     # interrupts off (exception path) ?
>  > -	jz restore_all
>  > +	jz restore_nocheck
>  >  	call preempt_schedule_irq
>  >  	jmp need_resched
>  >  #endif
> 
> Is this sufficient or do we also need the 
> s/restore_all/restore_nocheck/ at around line 553 which was in the 
> first posted patch?

all 3 restore_all->restore_nocheck changes are needed to make the bug 
not trigger. I dont think it's a real fix because whatever is being 
preempted involuntarily, it ought to have a proper kernel stack to begin 
with. I'll do some more debugging.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05  6:55 crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC Ingo Molnar
  2005-04-05  7:03 ` Andrew Morton
  2005-04-05  7:05 ` Ingo Molnar
@ 2005-04-05 19:11 ` Stas Sergeev
  2005-04-05 19:19   ` Linus Torvalds
  2 siblings, 1 reply; 31+ messages in thread
From: Stas Sergeev @ 2005-04-05 19:11 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Linus Torvalds, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 518 bytes --]

Hi Ingo et all.

Ingo Molnar wrote:
> the crashes below happen when PAGEALLOC is enabled. It's this 
> instruction:
>  movb OLDSS(%esp), %ah
I am really sorry about that screwup :(
I can't do too much right now as I am
reading the mail in a batch mode, and
the next time I'll be reading it will
be 24 hours from now.

Attached is a quick fix, which I'll be
testing to death tomorrow at work.
I had DEBUG_PAGEALLOC disabled, so I
haven't noticed that stupid bug while
optimizing my checks...
Let me know how it goes. 


[-- Attachment #2: entry.S.diff --]
[-- Type: text/x-patch, Size: 596 bytes --]

--- entry.S.old	2005-04-05 20:08:07.000000000 +0400
+++ entry.S	2005-04-05 22:54:43.000000000 +0400
@@ -244,11 +244,12 @@
 	jne syscall_exit_work
 
 restore_all:
-	movl EFLAGS(%esp), %eax		# mix EFLAGS, SS and CS
-	movb OLDSS(%esp), %ah
-	movb CS(%esp), %al
-	andl $(VM_MASK | (4 << 8) | 3), %eax
-	cmpl $((4 << 8) | 3), %eax
+	testl $3, CS(%esp)
+	jz restore_nocheck		# return to kernel or v86
+	movl EFLAGS(%esp), %eax		# mix EFLAGS and SS
+	movb OLDSS(%esp), %al
+	andl $(VM_MASK | 4), %eax
+	cmpl $4, %eax
 	je ldt_ss			# returning to user-space with LDT SS
 restore_nocheck:
 	RESTORE_REGS


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05 19:11 ` Stas Sergeev
@ 2005-04-05 19:19   ` Linus Torvalds
  2005-04-05 19:41     ` Stas Sergeev
  0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2005-04-05 19:19 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: Ingo Molnar, linux-kernel, Andrew Morton



On Tue, 5 Apr 2005, Stas Sergeev wrote:
> 
> Attached is a quick fix, which I'll be
> testing to death tomorrow at work.

This one can pass through vm86 mode stuff without the high-16-bit fixup, 
as far as I can tell.

Also, I think your optimization to optimistically load SS is valid per se,
but we need to find out how some kernel thread gets zero stack associated
with it. They should all have the full "struct pt_reg" as far as I could
see, which means that we should never be _so_ high up the stack that 
SS/ESP would be on the next page.

So I'd actually prefer to get that mystery explained..

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05 19:19   ` Linus Torvalds
@ 2005-04-05 19:41     ` Stas Sergeev
  2005-04-05 19:53       ` Linus Torvalds
  0 siblings, 1 reply; 31+ messages in thread
From: Stas Sergeev @ 2005-04-05 19:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec

Hi Linus,

Linus Torvalds wrote:
> This one can pass through vm86 mode stuff without the high-16-bit fixup,
> as far as I can tell.
Yes, but according to Petr, vm86 is not
affected by the bug at all. I did some
rough tests in the past that seem to
confirm that. Also, in any case, the
dependance of vm86 code on the higher word
of %esp would be very, very obscure.

> So I'd actually prefer to get that mystery explained..
IIRC if the interrupt doesn't do the CPL
switch, the interrupt gate doesn't save
the stack, and so there may not be the
full "struct pt_regs" when the kernel
thread is interrupted.
Does this sound any realistic?

So while it would be excellent to hear
that my patch was not guilty at all, I
think it is not the case. 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05 19:41     ` Stas Sergeev
@ 2005-04-05 19:53       ` Linus Torvalds
  2005-04-05 20:44         ` Ingo Molnar
  2005-04-06 15:44         ` Stas Sergeev
  0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2005-04-05 19:53 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec



On Tue, 5 Apr 2005, Stas Sergeev wrote:
> 
> > So I'd actually prefer to get that mystery explained..
> IIRC if the interrupt doesn't do the CPL
> switch, the interrupt gate doesn't save
> the stack, and so there may not be the
> full "struct pt_regs" when the kernel
> thread is interrupted.

Yes. But how do you have _such_ an empty stack when the interrupt comes
in? See what I mean? IOW, that requires that the kernel stack would have
only two words on it when the interrupt happens. How?

It may be a 4kB stack issue or something. Does this happen only with
CONFIG_4KSTACKS, and just after a stack switch to an irq stack, for 
example?

So I definitely think the "bug" is in your optimization, I just think it 
should be a valid optimization and we should just make sure our kernel 
stack is never _so_ empty that "struct pt_regs" is not safe to 
dereference.

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05 19:53       ` Linus Torvalds
@ 2005-04-05 20:44         ` Ingo Molnar
  2005-04-05 21:04           ` Linus Torvalds
  2005-04-06 15:44         ` Stas Sergeev
  1 sibling, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05 20:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stas Sergeev, linux-kernel, Andrew Morton, Petr Vandrovec


* Linus Torvalds <torvalds@osdl.org> wrote:

> > > So I'd actually prefer to get that mystery explained..
> > IIRC if the interrupt doesn't do the CPL
> > switch, the interrupt gate doesn't save
> > the stack, and so there may not be the
> > full "struct pt_regs" when the kernel
> > thread is interrupted.
> 
> Yes. But how do you have _such_ an empty stack when the interrupt 
> comes in? See what I mean? IOW, that requires that the kernel stack 
> would have only two words on it when the interrupt happens. How?
> 
> It may be a 4kB stack issue or something. Does this happen only with 
> CONFIG_4KSTACKS, and just after a stack switch to an irq stack, for 
> example?

i didnt have 4K stacks set. In all crashes, esp had the same pattern:

 esi: 009b63f9   edi: 00000001   ebp: f543a000   esp: f543bfc8

i.e. esp & 0xfff was 0xfc8 - while i think it should normally be 0xfc4 
(page boundary minus size of pt_regs == 0 - 0x3c == 0xfc4). So somewhere 
we lost 4 bytes of esp? An extra popl, or an addl $4, %esp? But why dont 
we crash in that case - it ought to shift all the pt_regs structure by a 
word, making it a completely senseless return frame. Any task in such a 
situation ought to at least segfault. So if this is during thread 
startup, i dont know how it survives the first execution.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05 20:44         ` Ingo Molnar
@ 2005-04-05 21:04           ` Linus Torvalds
  0 siblings, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2005-04-05 21:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Stas Sergeev, linux-kernel, Andrew Morton, Petr Vandrovec



On Tue, 5 Apr 2005, Ingo Molnar wrote:
> 
>  esi: 009b63f9   edi: 00000001   ebp: f543a000   esp: f543bfc8
> 
> i.e. esp & 0xfff was 0xfc8 - while i think it should normally be 0xfc4 
> (page boundary minus size of pt_regs == 0 - 0x3c == 0xfc4). So somewhere 
> we lost 4 bytes of esp? An extra popl, or an addl $4, %esp? But why dont 
> we crash in that case

Normally, esp will be immediately reset by any user-land stuff: we'll 
forget the old kernel stack entirely, and always re-load esp from the esp0 
thing in the TSS.

And as long as we stay in kernel land, we'll obviously never touch the
esp/xss fields of pt_regs (except in this special case of doing the
speculative load of xss), so...

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-05 19:53       ` Linus Torvalds
  2005-04-05 20:44         ` Ingo Molnar
@ 2005-04-06 15:44         ` Stas Sergeev
  2005-04-07  8:00           ` Ingo Molnar
  1 sibling, 1 reply; 31+ messages in thread
From: Stas Sergeev @ 2005-04-06 15:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec

Hi,

Linus Torvalds wrote:
> Yes. But how do you have _such_ an empty stack when the interrupt comes
> in? See what I mean?
Yes, I hope so.

> IOW, that requires that the kernel stack would have
> only two words on it when the interrupt happens. How?
Well, you can simply do something like this:

--- entry.S.old1	2005-04-05 22:54:43.000000000 +0400
+++ entry.S	2005-04-06 19:35:14.000000000 +0400
@@ -179,9 +179,9 @@
 ENTRY(sysenter_entry)
 	movl TSS_sysenter_esp0(%esp),%esp
 sysenter_past_esp:
-	sti
 	pushl $(__USER_DS)
 	pushl %ebp
+	sti
 	pushfl
 	pushl $(__USER_CS)
 	pushl $SYSENTER_RETURN

And this will "elimenate" the problem
(modulo NMI and there could be other places
too, but for me it elimenates it completely).
So I don't think this is something strange.

> So I definitely think the "bug" is in your optimization,
Yes, and I think the patch I posted, can
just work, or are there the problems with
the taken forward jump on a fast path?

> I just think it 
> should be a valid optimization
But it is totally bogus, why not should it
crash? It is probably even very good that
it does:)

> and we should just make sure our kernel 
> stack is never _so_ empty that "struct pt_regs" is not safe to 
> dereference.
I guess you'll just need to adjust the tss.esp0
then, but do you really want this? Accesing
the registers that are simply not there, doesn't
sound too good I think.
Or am I still missing your point? 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-06 15:44         ` Stas Sergeev
@ 2005-04-07  8:00           ` Ingo Molnar
  2005-04-07 11:10             ` Andrew Morton
  2005-04-07 16:11             ` Stas Sergeev
  0 siblings, 2 replies; 31+ messages in thread
From: Ingo Molnar @ 2005-04-07  8:00 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: Linus Torvalds, linux-kernel, Andrew Morton, Petr Vandrovec


* Stas Sergeev <stsp@aknet.ru> wrote:

> ENTRY(sysenter_entry)
> 	movl TSS_sysenter_esp0(%esp),%esp
> sysenter_past_esp:
> -	sti
> 	pushl $(__USER_DS)
> 	pushl %ebp
> +	sti

ah, yes, sysenter. SYSENTER creates a degenerate 'small' stackframe with 
an esp0 that is missing the 5 entry words relative to the normal entry 
(int80 or irq) esp0 stackframe. These 5 words are: xss, esp, eflags, 
xcs, eip. The sysenter code sets them up manually.

now if an interrupt hits at this point, it will set up a 'same privilege 
level' stackframe, which has eip/xcs/eflags, i.e. no esp/xss. If upon 
irq-return we then examine the stack due to your patch, it will be an 
incorrect stackframe -> kaboom.

your patch doesnt remove the condition, it only removes the crash, 
because it adds the 2 words space that is needed - but the information 
relied on by your irq-return test is still bogus. At this point i'd 
suggest to remove the ESP patch altogether.

the correct solution is to always let the sysenter path set up a full 
and correct stackframe, before allowing preemption (see the attached 
patch). This was a nasty bug in the waiting. (I have not made this 
conditional on CONFIG_PREEMPT, to keep it simple and because the impact 
to irq latency is small and predictable. There's no runtime overhead.)

so i think with the help of Stas the mystery has been fully explained 
and solved. Linus?

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -179,12 +179,17 @@ need_resched:
 ENTRY(sysenter_entry)
 	movl TSS_sysenter_esp0(%esp),%esp
 sysenter_past_esp:
-	sti
+	#
+	# irqs are disabled: set up an entry stackframe without
+	# allowing irqs to potentially preempt us with an
+	# incomplete entry frame!
+	#
 	pushl $(__USER_DS)
 	pushl %ebp
 	pushfl
 	pushl $(__USER_CS)
 	pushl $SYSENTER_RETURN
+	sti
 
 /*
  * Load the potential sixth argument from user stack.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-07  8:00           ` Ingo Molnar
@ 2005-04-07 11:10             ` Andrew Morton
  2005-04-07 14:47               ` Linus Torvalds
  2005-04-07 16:11             ` Stas Sergeev
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2005-04-07 11:10 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: stsp, torvalds, linux-kernel, VANDROVE

Ingo Molnar <mingo@elte.hu> wrote:
>
> --- linux/arch/i386/kernel/entry.S.orig
>  +++ linux/arch/i386/kernel/entry.S
>  @@ -179,12 +179,17 @@ need_resched:
>   ENTRY(sysenter_entry)
>   	movl TSS_sysenter_esp0(%esp),%esp
>   sysenter_past_esp:
>  -	sti
>  +	#
>  +	# irqs are disabled: set up an entry stackframe without
>  +	# allowing irqs to potentially preempt us with an
>  +	# incomplete entry frame!
>  +	#
>   	pushl $(__USER_DS)
>   	pushl %ebp
>   	pushfl
>   	pushl $(__USER_CS)
>   	pushl $SYSENTER_RETURN
>  +	sti
>   

Well that bites the big one.


Program received signal SIGTRAP, Trace/breakpoint trap.
0xc015c4ba in __find_get_block (bdev=0xc18cd988, block=2818614, size=4096) at fs/buffer.c:1371
1371            BUG_ON(irqs_disabled());
(gdb) t
[Current thread is 0 (Thread 32768)]
(gdb) bt
#0  0xc015c4ba in __find_get_block (bdev=0xc18cd988, block=2818614, size=4096) at fs/buffer.c:1371
#1  0xc015c57b in __getblk (bdev=0xc18cd988, block=2818614, size=4096) at fs/buffer.c:1487
#2  0xc0199978 in ext3_getblk (handle=0x0, inode=0xcff2a2b4, block=1, create=0, errp=0xcdf61dbc) at include/linux/buffer_head.h:260
#3  0xc019d5a6 in ext3_find_entry (dentry=0xcdaef8d4, res_dir=0xcdf61dfc) at fs/ext3/namei.c:868
#4  0xc019d9ca in ext3_lookup (dir=0x1, dentry=0xcdaef8d4, nd=0xcdf61f58) at fs/ext3/namei.c:988
#5  0xc0166b2d in real_lookup (parent=0xcdaef8d4, name=0xcdf61e8c, nd=0xcdf61f58) at fs/namei.c:416
#6  0xc0166df3 in do_lookup (nd=0xcdf61f58, name=0xcdf61e8c, path=0xcdf61e84) at fs/namei.c:670
#7  0xc0167667 in __link_path_walk (name=0xcdd24011 "", nd=0xcdf61f58) at fs/namei.c:833
#8  0xc0167c28 in link_path_walk (name=0xcdd24000 "/lib/libpcre.so.0", nd=0xcdf61f58) at fs/namei.c:911
#9  0xc0168056 in path_lookup (name=0xcdd24000 "/lib/libpcre.so.0", flags=0, nd=0xcdf61f58) at fs/namei.c:1026
#10 0xc016877d in open_namei (pathname=0xcdd24000 "/lib/libpcre.so.0", flag=1, mode=-1208907480, nd=0xcdf61f58) at fs/namei.c:1420
#11 0xc0159256 in filp_open (filename=0xcdd24000 "/lib/libpcre.so.0", flags=0, mode=-1208907480) at fs/open.c:764
#12 0xc01595e6 in sys_open (filename=0x1 <Address 0x1 out of bounds>, flags=1, mode=1) at fs/open.c:946
#13 0xc0102e91 in syscall_call () at arch/i386/kernel/semaphore.c:177
#14 0xb7f22c8a in ?? ()

Is someone doing a syscall with irqs disabled, or what?

This fixes it...



--- 25/arch/i386/kernel/entry.S~sysenter-irq-atomicity-fix-fix	2005-04-07 04:03:07.000000000 -0700
+++ 25-akpm/arch/i386/kernel/entry.S	2005-04-07 04:05:47.000000000 -0700
@@ -187,6 +187,7 @@ sysenter_past_esp:
 	pushl $(__USER_DS)
 	pushl %ebp
 	pushfl
+	orl   $0x200,0(%esp)
 	pushl $(__USER_CS)
 	pushl $SYSENTER_RETURN
 	sti
_


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-07 11:10             ` Andrew Morton
@ 2005-04-07 14:47               ` Linus Torvalds
  2005-04-07 14:51                 ` Ingo Molnar
  2005-04-07 16:47                 ` Dave Jones
  0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2005-04-07 14:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, stsp, linux-kernel, VANDROVE



On Thu, 7 Apr 2005, Andrew Morton wrote:
>
> Ingo Molnar <mingo@elte.hu> wrote:
> >
> > --- linux/arch/i386/kernel/entry.S.orig
> >  +++ linux/arch/i386/kernel/entry.S
> >  @@ -179,12 +179,17 @@ need_resched:
> >   ENTRY(sysenter_entry)
> >   	movl TSS_sysenter_esp0(%esp),%esp
> >   sysenter_past_esp:
> >  -	sti
> >  +	#
> >  +	# irqs are disabled: set up an entry stackframe without
> >  +	# allowing irqs to potentially preempt us with an
> >  +	# incomplete entry frame!
> >  +	#
> >   	pushl $(__USER_DS)
> >   	pushl %ebp
> >   	pushfl
> >   	pushl $(__USER_CS)
> >   	pushl $SYSENTER_RETURN
> >  +	sti
> >   
> 
> Well that bites the big one.
>> 
> Program received signal SIGTRAP, Trace/breakpoint trap.
> 0xc015c4ba in __find_get_block (bdev=0xc18cd988, block=2818614, size=4096) at fs/buffer.c:1371
> 1371            BUG_ON(irqs_disabled());

Yes. With the change in place "entry.S" will always save the wrogn EIP, so 
we'll return with interrupts disabled.

Your suggested patch is pretty horrid, though. It's sufficient to enable
interrupts after the two first pushes, since that has already now set up
enough of a kernel stack that any subsequent interrupt will always have at
least a "fake" SS/ESP (ie some values there, although not anything
relevant).

So the sysenter sequence might as well look like

	pushl $(__USER_DS)	
	pushl %ebp
	sti
	pushfl
	..

which actually does three protected pushes thanks to the one-instruction 
"interrupt shadow" after an sti.

Another alternative (and to some degree a maybe a better one) is to not
use "pushfl" at all in the sysenter sequence, but instead just push a
fixed default value. At that point, the "sti" can stay where it was.

After all, I very strongly suspect that we don't actually really _care_ if
eflags stays the same over a system call, and I could see that some
dynamic CPU's might prefer not having to push an eflags value that just
got changed by the "sti", so it _might_ save several cycles to avoid that
dependency, and also obviously avoid a subtle dependency on a sw level
that the previous patch clearly introduced.

Anybody willing to time it? ;)

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-07 14:47               ` Linus Torvalds
@ 2005-04-07 14:51                 ` Ingo Molnar
  2005-04-07 16:47                 ` Dave Jones
  1 sibling, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2005-04-07 14:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, stsp, linux-kernel, VANDROVE


* Linus Torvalds <torvalds@osdl.org> wrote:

> > >   	pushfl

> After all, I very strongly suspect that we don't actually really 
> _care_ if eflags stays the same over a system call, and I could see 
> that some dynamic CPU's might prefer not having to push an eflags 
> value that just got changed by the "sti", so it _might_ save several 
> cycles to avoid that dependency, and also obviously avoid a subtle 
> dependency on a sw level that the previous patch clearly introduced.
> 
> Anybody willing to time it? ;)

i can tell you without any measurement that pushfl is slower by a couple 
of cycles than a simple pushl $0x00010046, on basically all x86 CPUs.  
And since we only support SYENTER from 32-bit mode and we dont guarantee 
flags to be saved, it isnt all that incorrect to do?

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-07  8:00           ` Ingo Molnar
  2005-04-07 11:10             ` Andrew Morton
@ 2005-04-07 16:11             ` Stas Sergeev
  2005-04-07 16:35               ` Linus Torvalds
  1 sibling, 1 reply; 31+ messages in thread
From: Stas Sergeev @ 2005-04-07 16:11 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel, Andrew Morton, Petr Vandrovec

Hello.

Ingo Molnar wrote:
> now if an interrupt hits at this point, it will set up a 'same privilege 
> level' stackframe, which has eip/xcs/eflags, i.e. no esp/xss.
Yes, that's something I tried to say
when talking about the interrupt gates
(sorry if I wasn't clear).

> If upon 
> irq-return we then examine the stack due to your patch, it will be an 
> incorrect stackframe -> kaboom.
Yes, and that's where I think my patch is
at fault, i.e. it just shouldn't do this.
Another option is to make it always possible
to access OLDSS(%esp), but I think it is just
my patch have to be fixed to not do this at all.

> your patch doesnt remove the condition, it only removes the crash, 
No, that wasn't my point at all. That example
with moving "sti" was *only* to answer Linus's
question of where we have an empty stack.
And I guess I wasn't clear enough also here,
I was in a hurry :(
The real patch I meant, is this one:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0504.0/1287.html
This, I am sure, fixes a real bug. But there
can be the other approaches too.

> because it adds the 2 words space that is needed - but the information 
> relied on by your irq-return test is still bogus.
But as an example for demonstrating the problem,
I thought, it could do:)

> At this point i'd 
> suggest to remove the ESP patch altogether.
That's probably too heavy-handed. The fix is
really simple, we can either store the right
values by hands (as you propose), or fix my
patch (as I propose).

> the correct solution is to always let the sysenter path set up a full 
> and correct stackframe,
But what will this solve? If I understand you
correctly, you will push the %ss/%esp of the
user-space process that did sysenter. Then
you enable the interrupts and get pre-empted.
Now what we have: OLDSS/OLDESP are of the
user-space process, but the EFLAGS/CS/EIP
are of the kernel (where it got pre-empted
on a sysenter path). This will avoid the crash,
but the information on stack is still wrong.
Or am I missing something?

> before allowing preemption (see the attached 
> patch).
Hmm, will it work also for NMIs? You move
the sti, you can't be pre-empted, but the
NMI uses the restore_all too, no?
Also, it seems that Linus wants only the
*some* values available on stack, just to
make it not to crash. I think we can simply
adjust the tss.esp0 to point 8 bytes below
the real stack top, and so we are always safe.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-07 16:11             ` Stas Sergeev
@ 2005-04-07 16:35               ` Linus Torvalds
  2005-04-07 16:46                 ` Stas Sergeev
  0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2005-04-07 16:35 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec



On Thu, 7 Apr 2005, Stas Sergeev wrote:
> 
> > because it adds the 2 words space that is needed - but the information 
> > relied on by your irq-return test is still bogus.
>
> But as an example for demonstrating the problem,
> I thought, it could do:)

Ingo: the information is bogus, but you're wrong: the code doesn't "rely"  
on it.

The fact is, bogus information is _fine_. That's what speculative work is 
all about: working with bogus information, with the assumption that some 
later test will ignore it if it's not relevant.

And the later test _will_ ignore it if it isn't relevant. Look for 
yourself:

        cmpl $((4 << 8) | 3), %eax
        je ldt_ss                       # returning to user-space with LDT SS

notice how the "cmpl" _only_ triggers if the old CS had the low three bits 
set and if EFLAGS_VM is clear. So if we return to kernel mode (or vm86) 
mode, we _know_ that the SS is bogus, but we don't care. We've tested for 
the proper thing, and we only do the special user-space LDT SS case if
 - the LDT bit was set in SS (possibly bogus)
AND
 - old CS was user space
AND
 - old eflags wasn't vm86 mode.

Ie the two second checks are what validates the first (possibly bogus) 
one.

So I really think that the _correct_ fix is literally to move the "cli" 
in the sysenter path down two lines. It doesn't just "hide" the bug, it 
literally fixes is.

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-07 16:35               ` Linus Torvalds
@ 2005-04-07 16:46                 ` Stas Sergeev
  2005-04-07 16:55                   ` Linus Torvalds
  0 siblings, 1 reply; 31+ messages in thread
From: Stas Sergeev @ 2005-04-07 16:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec

Hello.

Linus Torvalds wrote:
> So I really think that the _correct_ fix is literally to move the "cli" 
> in the sysenter path down two lines. It doesn't just "hide" the bug, it 
> literally fixes is.
OK, Linus, I find it amazing that you
like all my patches, even though I myself
think they are wrong:)
I still have those two questions however:
1. Does the "later sti" fixes the problem
also in case of an NMI? I mean, why can't
you just be NMI'ed before you did sti?
NMI uses the restore_all too IIRC.
2. How can one be sure there are no more
of the like places where the stack is left
empty?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-07 14:47               ` Linus Torvalds
  2005-04-07 14:51                 ` Ingo Molnar
@ 2005-04-07 16:47                 ` Dave Jones
  2005-04-07 17:17                   ` Richard B. Johnson
  2005-04-07 17:23                   ` Linus Torvalds
  1 sibling, 2 replies; 31+ messages in thread
From: Dave Jones @ 2005-04-07 16:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Ingo Molnar, stsp, linux-kernel, VANDROVE

On Thu, Apr 07, 2005 at 07:47:41AM -0700, Linus Torvalds wrote:

 > So the sysenter sequence might as well look like
 > 
 > 	pushl $(__USER_DS)	
 > 	pushl %ebp
 > 	sti
 > 	pushfl
 > 	..
 > 
 > which actually does three protected pushes thanks to the one-instruction 
 > "interrupt shadow" after an sti.

Is this guaranteed on every x86 variant (or rather, every one
that has SEP). ?

		Dave


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-07 16:46                 ` Stas Sergeev
@ 2005-04-07 16:55                   ` Linus Torvalds
  2005-04-07 18:10                     ` Stas Sergeev
  2005-04-10 13:20                     ` Stas Sergeev
  0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2005-04-07 16:55 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec



On Thu, 7 Apr 2005, Stas Sergeev wrote:
>
> 1. Does the "later sti" fixes the problem
> also in case of an NMI? I mean, why can't
> you just be NMI'ed before you did sti?
> NMI uses the restore_all too IIRC.

The NMI code had better be really careful, and yeah, I suspect it needs 
fixing.

> 2. How can one be sure there are no more
> of the like places where the stack is left
> empty?

That's a good argument, and may be the strongest reason for _not_ doing 
the speculation. However, I don't think it really can happen anywhere 
else. 

If people feel very nervous about the speculation, we can certainly just 
make it do the two branches. It _is_ a hotspot, though, and the 
optimization of the speculatiom may well be worth it.


		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-07 16:47                 ` Dave Jones
@ 2005-04-07 17:17                   ` Richard B. Johnson
  2005-04-07 17:23                   ` Linus Torvalds
  1 sibling, 0 replies; 31+ messages in thread
From: Richard B. Johnson @ 2005-04-07 17:17 UTC (permalink / raw)
  To: Dave Jones
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, stsp, linux-kernel, VANDROVE

On Thu, 7 Apr 2005, Dave Jones wrote:

> On Thu, Apr 07, 2005 at 07:47:41AM -0700, Linus Torvalds wrote:
>
> > So the sysenter sequence might as well look like
> >
> > 	pushl $(__USER_DS)
> > 	pushl %ebp
> > 	sti
> > 	pushfl
> > 	..
> >
> > which actually does three protected pushes thanks to the one-instruction
> > "interrupt shadow" after an sti.
>
> Is this guaranteed on every x86 variant (or rather, every one
> that has SEP). ?
>
> 		Dave

The i486 book says that if the next instruction lets the IF
flag remain enabled, then any interrupts pending occur after
the next instruction. If the next instruction is CLI, no
interrupts will occur. There is a note:

"In case of an NM1, trap, or fault following ST1, the interrupt
will be taken before executing the next sequential instruction
in the code." The use of "NM1" and "ST1" is (sic). I don't
know what NM1 is, it can't be NMI because, by definition NMI
is not maskable so CLI and STI could not affect this pin.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by Dictator Bush.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-07 16:47                 ` Dave Jones
  2005-04-07 17:17                   ` Richard B. Johnson
@ 2005-04-07 17:23                   ` Linus Torvalds
  1 sibling, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2005-04-07 17:23 UTC (permalink / raw)
  To: Dave Jones; +Cc: Andrew Morton, Ingo Molnar, stsp, linux-kernel, VANDROVE



On Thu, 7 Apr 2005, Dave Jones wrote:
>
> On Thu, Apr 07, 2005 at 07:47:41AM -0700, Linus Torvalds wrote:
> 
>  > So the sysenter sequence might as well look like
>  > 
>  > 	pushl $(__USER_DS)	
>  > 	pushl %ebp
>  > 	sti
>  > 	pushfl
>  > 	..
>  > 
>  > which actually does three protected pushes thanks to the one-instruction 
>  > "interrupt shadow" after an sti.
> 
> Is this guaranteed on every x86 variant (or rather, every one
> that has SEP). ?

Well, since we only need two in this case, we don't care, but yes, it's 
supposed to be guaranteed by anything that calls itself an x86.

In fact, we _do_ depend on it in a few other sequences. Notably

	sti ; hlt

depends on the fact that an interrupt will always finish _after_ the hlt, 
and we'll never halt before the hlt (and then re-execute the hlt after the 
interrupt), and in

	sti ; iret

where we depend on the fact that we don't get recursive interrupt stacks 
(since we at that point have re-enabled the interrupt that happened).

Of course, if some future x86 decides that the interrupt shadow only
matters for special instructions (ie it's not so much a general interrupt
shadow as a "instruction combination"), I don't think Linux would care. I
really think there are only a very few valid sti-combinations, and I
suspect the above two are pretty much it.

(The other "magic" x86 behaviour is loading into the SS register, which
creates a one-cycle black hole after it. Linux shouldn't care, and in fact
nothing should care about it outside of old 16-bit non-protected-mode
programs, so I think that's another one that could be retired eventually)

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-07 16:55                   ` Linus Torvalds
@ 2005-04-07 18:10                     ` Stas Sergeev
  2005-04-10 13:20                     ` Stas Sergeev
  1 sibling, 0 replies; 31+ messages in thread
From: Stas Sergeev @ 2005-04-07 18:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec

Hi Linus.

Linus Torvalds wrote:
> The NMI code had better be really careful, and yeah, I suspect it needs 
> fixing.
And since the NMI return will need a
ESP fixup too, it will require the
"branched" version of the restore_all
checks I suppose.

>> 2. How can one be sure there are no more
>> of the like places where the stack is left
>> empty?
> That's a good argument, and may be the strongest reason for _not_ doing 
> the speculation.
I haven't said that explicitly, only
implied:) My another idea was to adjust
the tss.esp0 to always point 8 bytes below
the real top of the stack, so that even in
case of an NMI we are still safe. And in
case of another such instance - too.
This may look more like a hack than shifting
the "sti", but it is probably more reliable.
At least something to consider as soon as we
are at it.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-07 16:55                   ` Linus Torvalds
  2005-04-07 18:10                     ` Stas Sergeev
@ 2005-04-10 13:20                     ` Stas Sergeev
  2005-04-10 22:32                       ` Andrew Morton
  1 sibling, 1 reply; 31+ messages in thread
From: Stas Sergeev @ 2005-04-10 13:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec

[-- Attachment #1: Type: text/plain, Size: 642 bytes --]

Hello.

Linus Torvalds wrote:
>> 2. How can one be sure there are no more
>> of the like places where the stack is left
>> empty?
> That's a good argument, and may be the strongest reason for _not_ doing 
> the speculation. However, I don't think it really can happen anywhere 
> else. 
OK, so how do you feel about the attached
patch? I understand that from some point
of view it may look like a hack, but at
the same time it:
1. Allows to preserve the valueable optimization
2. Works for NMIs
3. Doesn't care whether or not there are more
of the like instances where the stack is left
empty.
4. Seems to work for me without the crashes:) 


[-- Attachment #2: esp0.diff --]
[-- Type: text/x-patch, Size: 399 bytes --]

--- linux/arch/i386/kernel/process.c.old	2005-03-20 14:12:18.000000000 +0300
+++ linux/arch/i386/kernel/process.c	2005-04-10 16:54:39.000000000 +0400
@@ -394,7 +394,7 @@
 	childregs->esp = esp;
 
 	p->thread.esp = (unsigned long) childregs;
-	p->thread.esp0 = (unsigned long) (childregs+1);
+	p->thread.esp0 = (unsigned long) (childregs+1) - 8;
 
 	p->thread.eip = (unsigned long) ret_from_fork;
 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-10 13:20                     ` Stas Sergeev
@ 2005-04-10 22:32                       ` Andrew Morton
  2005-04-11 17:15                         ` Stas Sergeev
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2005-04-10 22:32 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: torvalds, mingo, linux-kernel, VANDROVE

Stas Sergeev <stsp@aknet.ru> wrote:
>
> -	p->thread.esp0 = (unsigned long) (childregs+1);
>  +	p->thread.esp0 = (unsigned long) (childregs+1) - 8;

This is utterly obscure - it needs a comment so that readers know what that
"- 8" is doing there.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
  2005-04-10 22:32                       ` Andrew Morton
@ 2005-04-11 17:15                         ` Stas Sergeev
  0 siblings, 0 replies; 31+ messages in thread
From: Stas Sergeev @ 2005-04-11 17:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, mingo, linux-kernel, VANDROVE

[-- Attachment #1: Type: text/plain, Size: 424 bytes --]

Hello.

Andrew Morton wrote:
> This is utterly obscure - it needs a comment so that readers know what that
> "- 8" is doing there.
Yes, that was only an RFC thing.
And now since there were not too much
of an FC, I prepared the "polished"
version. But apparently you already
released -mm3:)

Well, at least you can still apply the
comments if you feel like that. Here
they are.

Signed-off-by: Stas Sergeev <stsp@aknet.ru> 


[-- Attachment #2: esp0fix.diff --]
[-- Type: text/x-patch, Size: 1307 bytes --]

--- linux-2.6.12-rc2/arch/i386/kernel/entry.S	2005-04-06 09:34:35.000000000 +0400
+++ linux/arch/i386/kernel/entry.S	2005-04-11 10:49:28.000000000 +0400
@@ -245,6 +245,9 @@
 
 restore_all:
 	movl EFLAGS(%esp), %eax		# mix EFLAGS, SS and CS
+	# Warning: OLDSS(%esp) contains the wrong/random values if we
+	# are returning to the kernel.
+	# See comments in process.c:copy_thread() for details.
 	movb OLDSS(%esp), %ah
 	movb CS(%esp), %al
 	andl $(VM_MASK | (4 << 8) | 3), %eax
--- linux-2.6.12-rc2/arch/i386/kernel/process.c	2005-04-06 09:34:35.000000000 +0400
+++ linux/arch/i386/kernel/process.c	2005-04-11 10:30:39.000000000 +0400
@@ -394,6 +394,16 @@
 	childregs->esp = esp;
 
 	p->thread.esp = (unsigned long) childregs;
+	/*
+	 * The below -8 is to reserve 8 bytes on top of the ring0 stack.
+	 * This is necessary to guarantee that the entire "struct pt_regs"
+	 * is accessable even if the CPU haven't stored the SS/ESP registers
+	 * on the stack (interrupt gate does not save these registers
+	 * when switching to the same priv ring).
+	 * Therefore beware: accessing the xss/esp fields of the
+	 * "struct pt_regs" is possible, but they may contain the
+	 * completely wrong values.
+	 */
 	p->thread.esp0 = (unsigned long) (childregs+1) - 8;
 
 	p->thread.eip = (unsigned long) ret_from_fork;

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2005-04-11 17:15 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-05  6:55 crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC Ingo Molnar
2005-04-05  7:03 ` Andrew Morton
2005-04-05  7:07   ` Ingo Molnar
2005-04-05  7:16   ` Ingo Molnar
2005-04-05  7:29     ` Ingo Molnar
2005-04-05  7:40       ` Ingo Molnar
2005-04-05  9:51         ` Mikael Pettersson
2005-04-05 18:09           ` Ingo Molnar
2005-04-05  7:05 ` Ingo Molnar
2005-04-05 19:11 ` Stas Sergeev
2005-04-05 19:19   ` Linus Torvalds
2005-04-05 19:41     ` Stas Sergeev
2005-04-05 19:53       ` Linus Torvalds
2005-04-05 20:44         ` Ingo Molnar
2005-04-05 21:04           ` Linus Torvalds
2005-04-06 15:44         ` Stas Sergeev
2005-04-07  8:00           ` Ingo Molnar
2005-04-07 11:10             ` Andrew Morton
2005-04-07 14:47               ` Linus Torvalds
2005-04-07 14:51                 ` Ingo Molnar
2005-04-07 16:47                 ` Dave Jones
2005-04-07 17:17                   ` Richard B. Johnson
2005-04-07 17:23                   ` Linus Torvalds
2005-04-07 16:11             ` Stas Sergeev
2005-04-07 16:35               ` Linus Torvalds
2005-04-07 16:46                 ` Stas Sergeev
2005-04-07 16:55                   ` Linus Torvalds
2005-04-07 18:10                     ` Stas Sergeev
2005-04-10 13:20                     ` Stas Sergeev
2005-04-10 22:32                       ` Andrew Morton
2005-04-11 17:15                         ` Stas Sergeev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).