* crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
@ 2005-04-05 6:55 Ingo Molnar
2005-04-05 7:03 ` Andrew Morton
` (2 more replies)
0 siblings, 3 replies; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05 6:55 UTC (permalink / raw)
To: linux-kernel; +Cc: Linus Torvalds, stsp, Andrew Morton
the crashes below happen when PAGEALLOC is enabled. It's this
instruction:
movb OLDSS(%esp), %ah
OLDSS is 0x38, esp is f4f83fc8, OLDSS(%esp) is thus f4f84000, which
correctly creates the PAGEALLOC pagefault. esp is off by 4 bytes?
it could be the ESP-16-bit-corruption patch causing this, or it could be
an already existing latent bug getting triggered now: normally only iret
accesses the OLDSS, and we fix any iret faults up, but now that we
explicitly access %esp the esp bug shows up.
so it would be nice to understand why this triggers. It seems to be a
sporadic event - first it hit hotplug, then input.agent. If i disable
PAGEALLOC the system boots up fine. In any case, the ESP-corruption
patch is not safe until this bug is understood, as it right now may read
a random byte off the next page, and possibly doing bogus calls to the
16-bit-fixup code.
Ingo
-------------
BUG: Unable to handle kernel paging request at virtual address f4f84000
printing eip:
c010287c
*pde = 00527067
*pte = 34f84000
Oops: 0000 [#1]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c010287c>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2-RT-V0.7.43-09)
EIP is at restore_all+0x4/0x18
eax: 00000206 ebx: 00000000 ecx: 00000000 edx: 00000001
esi: 00000000 edi: 009b63f9 ebp: f4f82000 esp: f4f83fc8
ds: 007b es: 007b ss: 0068 preempt: 00000001
Process 10-udev.hotplug (pid: 1264, threadinfo=f4f82000 task=f5034a10)
Stack: 00000000 bfa71dd0 009c0ffc 00000000 009b63f9 bfa71d44 000000c5 0000007b
0000007b ffffffef c01027ba 00000060 00000206 0000007b
Call Trace:
[<c01036ac>] show_stack+0x7a/0x90 (32)
[<c0103835>] show_registers+0x15a/0x1d2 (56)
[<c0103a30>] die+0xf4/0x17e (68)
[<c010f444>] do_page_fault+0x3de/0x60a (212)
[<c01032eb>] error_code+0x4f/0x54 (-8076)
---------------------
BUG: Unable to handle kernel paging request at virtual address f57bc000
printing eip:
c010287c
*pde = 00529067
*pte = 357bc000
Oops: 0000 [#1]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c010287c>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2-RT-V0.7.43-09)
EIP is at restore_all+0x4/0x18
eax: 00000206 ebx: b7f11000 ecx: 00000000 edx: 00000000
esi: 080e4f28 edi: 00000000 ebp: f57ba000 esp: f57bbfc8
ds: 007b es: 007b ss: 0068 preempt: 00000001
Process input.agent (pid: 1131, threadinfo=f57ba000 task=f57b9a10)
Stack: b7f11000 00001000 009c0ffc 080e4f28 00000000 bfc112c0 0000005b 0000007b
0000007b ffffff00 c01027ba 00000060 00000206 0000007b
Call Trace:
[<c01036ac>] show_stack+0x7a/0x90 (32)
[<c0103835>] show_registers+0x15a/0x1d2 (56)
[<c0103a30>] die+0xf4/0x17e (68)
[<c010f474>] do_page_fault+0x3de/0x60a (212)
[<c01032eb>] error_code+0x4f/0x54 (-8076)
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 6:55 crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC Ingo Molnar
@ 2005-04-05 7:03 ` Andrew Morton
2005-04-05 7:07 ` Ingo Molnar
2005-04-05 7:16 ` Ingo Molnar
2005-04-05 7:05 ` Ingo Molnar
2005-04-05 19:11 ` Stas Sergeev
2 siblings, 2 replies; 31+ messages in thread
From: Andrew Morton @ 2005-04-05 7:03 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, torvalds, stsp
Ingo Molnar <mingo@elte.hu> wrote:
>
> the crashes below happen when PAGEALLOC is enabled. It's this
> instruction:
>
> movb OLDSS(%esp), %ah
>
> OLDSS is 0x38, esp is f4f83fc8, OLDSS(%esp) is thus f4f84000, which
> correctly creates the PAGEALLOC pagefault. esp is off by 4 bytes?
>
> it could be the ESP-16-bit-corruption patch causing this,
Do you have nmis enabled?
From: Akinobu Mita <amgta@yacht.ocn.ne.jp>
With nmi_watchdog=1, I got random Oopses (Unable to handle kernel paging
request, not by the NMI oopser) from many processes. It is not happend
with -rc1.
The following change fixes this problem.
Signed-off-by: Andrew Morton <akpm@osdl.org>
---
25-akpm/arch/i386/kernel/entry.S | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)
diff -puN arch/i386/kernel/entry.S~nmi_stack_correct-fix arch/i386/kernel/entry.S
--- 25/arch/i386/kernel/entry.S~nmi_stack_correct-fix 2005-04-05 00:02:48.000000000 -0700
+++ 25-akpm/arch/i386/kernel/entry.S 2005-04-05 00:02:48.000000000 -0700
@@ -550,7 +550,7 @@ nmi_stack_correct:
xorl %edx,%edx # zero error code
movl %esp,%eax # pt_regs pointer
call do_nmi
- jmp restore_all
+ jmp restore_nocheck
nmi_stack_fixup:
FIX_STACK(12,nmi_stack_correct, 1)
_
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 6:55 crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC Ingo Molnar
2005-04-05 7:03 ` Andrew Morton
@ 2005-04-05 7:05 ` Ingo Molnar
2005-04-05 19:11 ` Stas Sergeev
2 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05 7:05 UTC (permalink / raw)
To: linux-kernel; +Cc: Linus Torvalds, stsp, Andrew Morton
* Ingo Molnar <mingo@elte.hu> wrote:
> it could be the ESP-16-bit-corruption patch causing this, or it could
> be an already existing latent bug getting triggered now: normally only
> iret accesses the OLDSS, and we fix any iret faults up, but now that
> we explicitly access %esp the esp bug shows up.
update: it's a latent condition being exposed by the ESP-corruption-fix
patch. I reverted the ESP patch from 2.6.12-rc2, and applied the patch
below, and it produces the same sort of crashes (attached further
below).
Ingo
--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -123,6 +123,7 @@ VM_MASK = 0x00020000
#define RESTORE_ALL \
+ movl OLDSS(%esp), %eax; \
RESTORE_REGS \
addl $4, %esp; \
1: iret; \
------------->
Freeing unused kernel memory: 188k freed
Unable to handle kernel paging request at virtual address f5bda000
printing eip:
c0102874
*pde = 00517067
*pte = 35bda000
Oops: 0000 [#1]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c0102874>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2)
EIP is at restore_all+0x0/0x14
eax: 00000260 ebx: 00000000 ecx: 00000000 edx: 00000001
esi: 00000000 edi: 009c0ffc ebp: f5bd8000 esp: f5bd9fc8
ds: 007b es: 007b ss: 0068
Process default.hotplug (pid: 1124, threadinfo=f5bd8000 task=f5999b10)
Stack: 00000000 00000003 00000000 00000000 009c0ffc bf96dd58 000000dd 0000007b
0000007b ffffff00 c01027ba 00000060 00000246 0000007b
Call Trace:
[<c01035dc>] show_stack+0x7a/0x90
[<c0103758>] show_registers+0x14d/0x1c5
[<c0103956>] die+0xf4/0x186
[<c010f15d>] do_page_fault+0x430/0x649
[<c0103283>] error_code+0x2b/0x30
Code: 45 08 81 01 75 7d 3d 21 01 00 00 0f 83 da 00 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 80 00 00 00 <8b> 44 24 38 5b 59 5a 5e 5f 5d 58 1f 07 83 c4 04 cf 8d 76 00 f6
<1>Unable to handle kernel paging request at virtual address f50a8000
printing eip:
c0102874
*pde = 00515067
*pte = 350a8000
Oops: 0000 [#2]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c0102874>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2)
EIP is at restore_all+0x0/0x14
eax: 00000260 ebx: 00000003 ecx: 00000000 edx: 00000001
esi: 00000008 edi: 009c0ffc ebp: f50a6000 esp: f50a7fc8
ds: 007b es: 007b ss: 0068
Process default.hotplug (pid: 1181, threadinfo=f50a6000 task=f55f9b10)
Stack: 00000003 bff9500c bff94f7c 00000008 009c0ffc bff94f60 000000ae 0000007b
0000007b ffffff00 c01027ba 00000060 00000286 0000007b
Call Trace:
[<c01035dc>] show_stack+0x7a/0x90
[<c0103758>] show_registers+0x14d/0x1c5
[<c0103956>] die+0xf4/0x186
[<c010f15d>] do_page_fault+0x430/0x649
[<c0103283>] error_code+0x2b/0x30
Code: 45 08 81 01 75 7d 3d 21 01 00 00 0f 83 da 00 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 80 00 00 00 <8b> 44 24 38 5b 59 5a 5e 5f 5d 58 1f 07 83 c4 04 cf 8d 76 00 f6
<1>Unable to handle kernel paging request at virtual address f553e000
printing eip:
c0102874
*pde = 00516067
*pte = 3553e000
Oops: 0000 [#3]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c0102874>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2)
EIP is at restore_all+0x0/0x14
eax: 00000260 ebx: 00000000 ecx: 00000000 edx: 00000001
esi: 00000008 edi: 009c0ffc ebp: f553c000 esp: f553dfc8
ds: 007b es: 007b ss: 0068
Process default.hotplug (pid: 1113, threadinfo=f553c000 task=f5a75b10)
Stack: 00000000 00000000 080e1f3c 00000008 009c0ffc bf856de0 000000af 0000007b
0000007b ffffffef c01027ba 00000060 00000246 0000007b
Call Trace:
[<c01035dc>] show_stack+0x7a/0x90
[<c0103758>] show_registers+0x14d/0x1c5
[<c0103956>] die+0xf4/0x186
[<c010f15d>] do_page_fault+0x430/0x649
[<c0103283>] error_code+0x2b/0x30
Code: 45 08 81 01 75 7d 3d 21 01 00 00 0f 83 da 00 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 80 00 00 00 <8b> 44 24 38 5b 59 5a 5e 5f 5d 58 1f 07 83 c4 04 cf 8d 76 00 f6
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 7:03 ` Andrew Morton
@ 2005-04-05 7:07 ` Ingo Molnar
2005-04-05 7:16 ` Ingo Molnar
1 sibling, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05 7:07 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, torvalds, stsp
* Andrew Morton <akpm@osdl.org> wrote:
> Do you have nmis enabled?
yeah ...
> call do_nmi
> - jmp restore_all
> + jmp restore_nocheck
and i was about to take a closer look at the NMI path :-)
Ingo
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 7:03 ` Andrew Morton
2005-04-05 7:07 ` Ingo Molnar
@ 2005-04-05 7:16 ` Ingo Molnar
2005-04-05 7:29 ` Ingo Molnar
1 sibling, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05 7:16 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, torvalds, stsp
* Andrew Morton <akpm@osdl.org> wrote:
> From: Akinobu Mita <amgta@yacht.ocn.ne.jp>
>
> With nmi_watchdog=1, I got random Oopses (Unable to handle kernel
> paging request, not by the NMI oopser) from many processes. It is not
> happend with -rc1.
>
> The following change fixes this problem.
this fixed my crashes too.
Ingo
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 7:16 ` Ingo Molnar
@ 2005-04-05 7:29 ` Ingo Molnar
2005-04-05 7:40 ` Ingo Molnar
0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05 7:29 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, torvalds, stsp
* Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andrew Morton <akpm@osdl.org> wrote:
>
> > From: Akinobu Mita <amgta@yacht.ocn.ne.jp>
> >
> > With nmi_watchdog=1, I got random Oopses (Unable to handle kernel
> > paging request, not by the NMI oopser) from many processes. It is not
> > happend with -rc1.
> >
> > The following change fixes this problem.
>
> this fixed my crashes too.
spoke too soon - they still trigger even with the patch applied.
Ingo
Unable to handle kernel paging request at virtual address f543c000
printing eip:
c010287c
*pde = 00516067
*pte = 3543c000
Oops: 0000 [#1]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c010287c>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2)
EIP is at restore_all+0x4/0x18
eax: 00000202 ebx: 009b63f9 ecx: 00000000 edx: 00000001
esi: 009b63f9 edi: 00000001 ebp: f543a000 esp: f543bfc8
ds: 007b es: 007b ss: 0068
Process udevd (pid: 1044, threadinfo=f543a000 task=f5412b10)
Stack: 009b63f9 00000000 000001b6 009b63f9 00000001 bf8c503c 00000005 0000007b
0000007b ffffff00 c01027ba 00000060 00000202 0000007b
Call Trace:
[<c010368c>] show_stack+0x7a/0x90
[<c0103808>] show_registers+0x14d/0x1c5
[<c0103a06>] die+0xf4/0x186
[<c010f2ad>] do_page_fault+0x430/0x649
[<c01032eb>] error_code+0x4f/0x54
Code: 00 00 3d 21 01 00 00 0f 83 1a 01 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 c0 00 00 00 8b 44 24 30 <8a> 64 24 38 8a 44 24 2c 25 03 04 02 00 3d 03 04 00 00 74 0d 5b
printing eip:
*EIP: 0060:[<c010287c>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2)
EIP is at restore_all+0x4/0x18
eax: 00000206 ebx: 080d1502 ecx: 00000000 edx: 00000001
esi: 080ec058 edi: 080e8b6e ebp: f5e38000 esp: f5e39fc8
ds: 007b es: 007b ss: 0068
Process rc.sysinit (pid: 3032, threadinfo=f5e38000 task=f5913b10)
Stack: 080d1502 bff73ff0 009c0ffc 080ec058 080e8b6e bff73fb4 000000c3 0000007b
0000007b ffffff00 c01027ba 00000060 00000206 0000007b
Call Trace:
[<c010368c>] show_stack+0x7a/0x90
[<c0103808>] show_registers+0x14d/0x1c5
[<c0103a06>] die+0xf4/0x186
[<c010f2ad>] do_page_fault+0x430/0x649
[<c01032eb>] error_code+0x4f/0x54
Code: 00 00 3d 21 01 00 00 0f 83 1a 01 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 c0 00 00 00 8b 44 24 30 <8a> 64 24 38 8a 44 24 2c 25 03 04 02 00 3d 03 04 00 00 74 0d 5b
<6>EXT3 FS on hdb1, internal journal
cdrom: open failed.
Adding 1052216k swap on /dev/hdb2. Priority:-1 extents:1
eth1: link up, 100Mbps, full-duplex, lpa 0x45E1
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 7:29 ` Ingo Molnar
@ 2005-04-05 7:40 ` Ingo Molnar
2005-04-05 9:51 ` Mikael Pettersson
0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05 7:40 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, torvalds, stsp
* Ingo Molnar <mingo@elte.hu> wrote:
> > this fixed my crashes too.
>
> spoke too soon - they still trigger even with the patch applied.
the patch below fixes the crash, it was related to CONFIG_PREEMPT.
Ingo
--
fix entry.S crash with PREEMPT+PAGEALLOC
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -165,9 +165,9 @@ ENTRY(resume_kernel)
need_resched:
movl TI_flags(%ebp), %ecx # need_resched set ?
testb $_TIF_NEED_RESCHED, %cl
- jz restore_all
+ jz restore_nocheck
testl $IF_MASK,EFLAGS(%esp) # interrupts off (exception path) ?
- jz restore_all
+ jz restore_nocheck
call preempt_schedule_irq
jmp need_resched
#endif
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 7:40 ` Ingo Molnar
@ 2005-04-05 9:51 ` Mikael Pettersson
2005-04-05 18:09 ` Ingo Molnar
0 siblings, 1 reply; 31+ messages in thread
From: Mikael Pettersson @ 2005-04-05 9:51 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, torvalds, stsp
Ingo Molnar writes:
>
> * Ingo Molnar <mingo@elte.hu> wrote:
>
> > > this fixed my crashes too.
> >
> > spoke too soon - they still trigger even with the patch applied.
>
> the patch below fixes the crash, it was related to CONFIG_PREEMPT.
>
> Ingo
>
> --
> fix entry.S crash with PREEMPT+PAGEALLOC
>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
>
> --- linux/arch/i386/kernel/entry.S.orig
> +++ linux/arch/i386/kernel/entry.S
> @@ -165,9 +165,9 @@ ENTRY(resume_kernel)
> need_resched:
> movl TI_flags(%ebp), %ecx # need_resched set ?
> testb $_TIF_NEED_RESCHED, %cl
> - jz restore_all
> + jz restore_nocheck
> testl $IF_MASK,EFLAGS(%esp) # interrupts off (exception path) ?
> - jz restore_all
> + jz restore_nocheck
> call preempt_schedule_irq
> jmp need_resched
> #endif
Is this sufficient or do we also need the s/restore_all/restore_nocheck/
at around line 553 which was in the first posted patch?
/Mikael
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 9:51 ` Mikael Pettersson
@ 2005-04-05 18:09 ` Ingo Molnar
0 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05 18:09 UTC (permalink / raw)
To: Mikael Pettersson; +Cc: Andrew Morton, linux-kernel, torvalds, stsp
* Mikael Pettersson <mikpe@csd.uu.se> wrote:
> > - jz restore_all
> > + jz restore_nocheck
> > testl $IF_MASK,EFLAGS(%esp) # interrupts off (exception path) ?
> > - jz restore_all
> > + jz restore_nocheck
> > call preempt_schedule_irq
> > jmp need_resched
> > #endif
>
> Is this sufficient or do we also need the
> s/restore_all/restore_nocheck/ at around line 553 which was in the
> first posted patch?
all 3 restore_all->restore_nocheck changes are needed to make the bug
not trigger. I dont think it's a real fix because whatever is being
preempted involuntarily, it ought to have a proper kernel stack to begin
with. I'll do some more debugging.
Ingo
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 6:55 crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC Ingo Molnar
2005-04-05 7:03 ` Andrew Morton
2005-04-05 7:05 ` Ingo Molnar
@ 2005-04-05 19:11 ` Stas Sergeev
2005-04-05 19:19 ` Linus Torvalds
2 siblings, 1 reply; 31+ messages in thread
From: Stas Sergeev @ 2005-04-05 19:11 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, Linus Torvalds, Andrew Morton
[-- Attachment #1: Type: text/plain, Size: 518 bytes --]
Hi Ingo et all.
Ingo Molnar wrote:
> the crashes below happen when PAGEALLOC is enabled. It's this
> instruction:
> movb OLDSS(%esp), %ah
I am really sorry about that screwup :(
I can't do too much right now as I am
reading the mail in a batch mode, and
the next time I'll be reading it will
be 24 hours from now.
Attached is a quick fix, which I'll be
testing to death tomorrow at work.
I had DEBUG_PAGEALLOC disabled, so I
haven't noticed that stupid bug while
optimizing my checks...
Let me know how it goes.
[-- Attachment #2: entry.S.diff --]
[-- Type: text/x-patch, Size: 596 bytes --]
--- entry.S.old 2005-04-05 20:08:07.000000000 +0400
+++ entry.S 2005-04-05 22:54:43.000000000 +0400
@@ -244,11 +244,12 @@
jne syscall_exit_work
restore_all:
- movl EFLAGS(%esp), %eax # mix EFLAGS, SS and CS
- movb OLDSS(%esp), %ah
- movb CS(%esp), %al
- andl $(VM_MASK | (4 << 8) | 3), %eax
- cmpl $((4 << 8) | 3), %eax
+ testl $3, CS(%esp)
+ jz restore_nocheck # return to kernel or v86
+ movl EFLAGS(%esp), %eax # mix EFLAGS and SS
+ movb OLDSS(%esp), %al
+ andl $(VM_MASK | 4), %eax
+ cmpl $4, %eax
je ldt_ss # returning to user-space with LDT SS
restore_nocheck:
RESTORE_REGS
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 19:11 ` Stas Sergeev
@ 2005-04-05 19:19 ` Linus Torvalds
2005-04-05 19:41 ` Stas Sergeev
0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2005-04-05 19:19 UTC (permalink / raw)
To: Stas Sergeev; +Cc: Ingo Molnar, linux-kernel, Andrew Morton
On Tue, 5 Apr 2005, Stas Sergeev wrote:
>
> Attached is a quick fix, which I'll be
> testing to death tomorrow at work.
This one can pass through vm86 mode stuff without the high-16-bit fixup,
as far as I can tell.
Also, I think your optimization to optimistically load SS is valid per se,
but we need to find out how some kernel thread gets zero stack associated
with it. They should all have the full "struct pt_reg" as far as I could
see, which means that we should never be _so_ high up the stack that
SS/ESP would be on the next page.
So I'd actually prefer to get that mystery explained..
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 19:19 ` Linus Torvalds
@ 2005-04-05 19:41 ` Stas Sergeev
2005-04-05 19:53 ` Linus Torvalds
0 siblings, 1 reply; 31+ messages in thread
From: Stas Sergeev @ 2005-04-05 19:41 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec
Hi Linus,
Linus Torvalds wrote:
> This one can pass through vm86 mode stuff without the high-16-bit fixup,
> as far as I can tell.
Yes, but according to Petr, vm86 is not
affected by the bug at all. I did some
rough tests in the past that seem to
confirm that. Also, in any case, the
dependance of vm86 code on the higher word
of %esp would be very, very obscure.
> So I'd actually prefer to get that mystery explained..
IIRC if the interrupt doesn't do the CPL
switch, the interrupt gate doesn't save
the stack, and so there may not be the
full "struct pt_regs" when the kernel
thread is interrupted.
Does this sound any realistic?
So while it would be excellent to hear
that my patch was not guilty at all, I
think it is not the case.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 19:41 ` Stas Sergeev
@ 2005-04-05 19:53 ` Linus Torvalds
2005-04-05 20:44 ` Ingo Molnar
2005-04-06 15:44 ` Stas Sergeev
0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2005-04-05 19:53 UTC (permalink / raw)
To: Stas Sergeev; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec
On Tue, 5 Apr 2005, Stas Sergeev wrote:
>
> > So I'd actually prefer to get that mystery explained..
> IIRC if the interrupt doesn't do the CPL
> switch, the interrupt gate doesn't save
> the stack, and so there may not be the
> full "struct pt_regs" when the kernel
> thread is interrupted.
Yes. But how do you have _such_ an empty stack when the interrupt comes
in? See what I mean? IOW, that requires that the kernel stack would have
only two words on it when the interrupt happens. How?
It may be a 4kB stack issue or something. Does this happen only with
CONFIG_4KSTACKS, and just after a stack switch to an irq stack, for
example?
So I definitely think the "bug" is in your optimization, I just think it
should be a valid optimization and we should just make sure our kernel
stack is never _so_ empty that "struct pt_regs" is not safe to
dereference.
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 19:53 ` Linus Torvalds
@ 2005-04-05 20:44 ` Ingo Molnar
2005-04-05 21:04 ` Linus Torvalds
2005-04-06 15:44 ` Stas Sergeev
1 sibling, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2005-04-05 20:44 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Stas Sergeev, linux-kernel, Andrew Morton, Petr Vandrovec
* Linus Torvalds <torvalds@osdl.org> wrote:
> > > So I'd actually prefer to get that mystery explained..
> > IIRC if the interrupt doesn't do the CPL
> > switch, the interrupt gate doesn't save
> > the stack, and so there may not be the
> > full "struct pt_regs" when the kernel
> > thread is interrupted.
>
> Yes. But how do you have _such_ an empty stack when the interrupt
> comes in? See what I mean? IOW, that requires that the kernel stack
> would have only two words on it when the interrupt happens. How?
>
> It may be a 4kB stack issue or something. Does this happen only with
> CONFIG_4KSTACKS, and just after a stack switch to an irq stack, for
> example?
i didnt have 4K stacks set. In all crashes, esp had the same pattern:
esi: 009b63f9 edi: 00000001 ebp: f543a000 esp: f543bfc8
i.e. esp & 0xfff was 0xfc8 - while i think it should normally be 0xfc4
(page boundary minus size of pt_regs == 0 - 0x3c == 0xfc4). So somewhere
we lost 4 bytes of esp? An extra popl, or an addl $4, %esp? But why dont
we crash in that case - it ought to shift all the pt_regs structure by a
word, making it a completely senseless return frame. Any task in such a
situation ought to at least segfault. So if this is during thread
startup, i dont know how it survives the first execution.
Ingo
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 20:44 ` Ingo Molnar
@ 2005-04-05 21:04 ` Linus Torvalds
0 siblings, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2005-04-05 21:04 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Stas Sergeev, linux-kernel, Andrew Morton, Petr Vandrovec
On Tue, 5 Apr 2005, Ingo Molnar wrote:
>
> esi: 009b63f9 edi: 00000001 ebp: f543a000 esp: f543bfc8
>
> i.e. esp & 0xfff was 0xfc8 - while i think it should normally be 0xfc4
> (page boundary minus size of pt_regs == 0 - 0x3c == 0xfc4). So somewhere
> we lost 4 bytes of esp? An extra popl, or an addl $4, %esp? But why dont
> we crash in that case
Normally, esp will be immediately reset by any user-land stuff: we'll
forget the old kernel stack entirely, and always re-load esp from the esp0
thing in the TSS.
And as long as we stay in kernel land, we'll obviously never touch the
esp/xss fields of pt_regs (except in this special case of doing the
speculative load of xss), so...
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-05 19:53 ` Linus Torvalds
2005-04-05 20:44 ` Ingo Molnar
@ 2005-04-06 15:44 ` Stas Sergeev
2005-04-07 8:00 ` Ingo Molnar
1 sibling, 1 reply; 31+ messages in thread
From: Stas Sergeev @ 2005-04-06 15:44 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec
Hi,
Linus Torvalds wrote:
> Yes. But how do you have _such_ an empty stack when the interrupt comes
> in? See what I mean?
Yes, I hope so.
> IOW, that requires that the kernel stack would have
> only two words on it when the interrupt happens. How?
Well, you can simply do something like this:
--- entry.S.old1 2005-04-05 22:54:43.000000000 +0400
+++ entry.S 2005-04-06 19:35:14.000000000 +0400
@@ -179,9 +179,9 @@
ENTRY(sysenter_entry)
movl TSS_sysenter_esp0(%esp),%esp
sysenter_past_esp:
- sti
pushl $(__USER_DS)
pushl %ebp
+ sti
pushfl
pushl $(__USER_CS)
pushl $SYSENTER_RETURN
And this will "elimenate" the problem
(modulo NMI and there could be other places
too, but for me it elimenates it completely).
So I don't think this is something strange.
> So I definitely think the "bug" is in your optimization,
Yes, and I think the patch I posted, can
just work, or are there the problems with
the taken forward jump on a fast path?
> I just think it
> should be a valid optimization
But it is totally bogus, why not should it
crash? It is probably even very good that
it does:)
> and we should just make sure our kernel
> stack is never _so_ empty that "struct pt_regs" is not safe to
> dereference.
I guess you'll just need to adjust the tss.esp0
then, but do you really want this? Accesing
the registers that are simply not there, doesn't
sound too good I think.
Or am I still missing your point?
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-06 15:44 ` Stas Sergeev
@ 2005-04-07 8:00 ` Ingo Molnar
2005-04-07 11:10 ` Andrew Morton
2005-04-07 16:11 ` Stas Sergeev
0 siblings, 2 replies; 31+ messages in thread
From: Ingo Molnar @ 2005-04-07 8:00 UTC (permalink / raw)
To: Stas Sergeev; +Cc: Linus Torvalds, linux-kernel, Andrew Morton, Petr Vandrovec
* Stas Sergeev <stsp@aknet.ru> wrote:
> ENTRY(sysenter_entry)
> movl TSS_sysenter_esp0(%esp),%esp
> sysenter_past_esp:
> - sti
> pushl $(__USER_DS)
> pushl %ebp
> + sti
ah, yes, sysenter. SYSENTER creates a degenerate 'small' stackframe with
an esp0 that is missing the 5 entry words relative to the normal entry
(int80 or irq) esp0 stackframe. These 5 words are: xss, esp, eflags,
xcs, eip. The sysenter code sets them up manually.
now if an interrupt hits at this point, it will set up a 'same privilege
level' stackframe, which has eip/xcs/eflags, i.e. no esp/xss. If upon
irq-return we then examine the stack due to your patch, it will be an
incorrect stackframe -> kaboom.
your patch doesnt remove the condition, it only removes the crash,
because it adds the 2 words space that is needed - but the information
relied on by your irq-return test is still bogus. At this point i'd
suggest to remove the ESP patch altogether.
the correct solution is to always let the sysenter path set up a full
and correct stackframe, before allowing preemption (see the attached
patch). This was a nasty bug in the waiting. (I have not made this
conditional on CONFIG_PREEMPT, to keep it simple and because the impact
to irq latency is small and predictable. There's no runtime overhead.)
so i think with the help of Stas the mystery has been fully explained
and solved. Linus?
Ingo
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -179,12 +179,17 @@ need_resched:
ENTRY(sysenter_entry)
movl TSS_sysenter_esp0(%esp),%esp
sysenter_past_esp:
- sti
+ #
+ # irqs are disabled: set up an entry stackframe without
+ # allowing irqs to potentially preempt us with an
+ # incomplete entry frame!
+ #
pushl $(__USER_DS)
pushl %ebp
pushfl
pushl $(__USER_CS)
pushl $SYSENTER_RETURN
+ sti
/*
* Load the potential sixth argument from user stack.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-07 8:00 ` Ingo Molnar
@ 2005-04-07 11:10 ` Andrew Morton
2005-04-07 14:47 ` Linus Torvalds
2005-04-07 16:11 ` Stas Sergeev
1 sibling, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2005-04-07 11:10 UTC (permalink / raw)
To: Ingo Molnar; +Cc: stsp, torvalds, linux-kernel, VANDROVE
Ingo Molnar <mingo@elte.hu> wrote:
>
> --- linux/arch/i386/kernel/entry.S.orig
> +++ linux/arch/i386/kernel/entry.S
> @@ -179,12 +179,17 @@ need_resched:
> ENTRY(sysenter_entry)
> movl TSS_sysenter_esp0(%esp),%esp
> sysenter_past_esp:
> - sti
> + #
> + # irqs are disabled: set up an entry stackframe without
> + # allowing irqs to potentially preempt us with an
> + # incomplete entry frame!
> + #
> pushl $(__USER_DS)
> pushl %ebp
> pushfl
> pushl $(__USER_CS)
> pushl $SYSENTER_RETURN
> + sti
>
Well that bites the big one.
Program received signal SIGTRAP, Trace/breakpoint trap.
0xc015c4ba in __find_get_block (bdev=0xc18cd988, block=2818614, size=4096) at fs/buffer.c:1371
1371 BUG_ON(irqs_disabled());
(gdb) t
[Current thread is 0 (Thread 32768)]
(gdb) bt
#0 0xc015c4ba in __find_get_block (bdev=0xc18cd988, block=2818614, size=4096) at fs/buffer.c:1371
#1 0xc015c57b in __getblk (bdev=0xc18cd988, block=2818614, size=4096) at fs/buffer.c:1487
#2 0xc0199978 in ext3_getblk (handle=0x0, inode=0xcff2a2b4, block=1, create=0, errp=0xcdf61dbc) at include/linux/buffer_head.h:260
#3 0xc019d5a6 in ext3_find_entry (dentry=0xcdaef8d4, res_dir=0xcdf61dfc) at fs/ext3/namei.c:868
#4 0xc019d9ca in ext3_lookup (dir=0x1, dentry=0xcdaef8d4, nd=0xcdf61f58) at fs/ext3/namei.c:988
#5 0xc0166b2d in real_lookup (parent=0xcdaef8d4, name=0xcdf61e8c, nd=0xcdf61f58) at fs/namei.c:416
#6 0xc0166df3 in do_lookup (nd=0xcdf61f58, name=0xcdf61e8c, path=0xcdf61e84) at fs/namei.c:670
#7 0xc0167667 in __link_path_walk (name=0xcdd24011 "", nd=0xcdf61f58) at fs/namei.c:833
#8 0xc0167c28 in link_path_walk (name=0xcdd24000 "/lib/libpcre.so.0", nd=0xcdf61f58) at fs/namei.c:911
#9 0xc0168056 in path_lookup (name=0xcdd24000 "/lib/libpcre.so.0", flags=0, nd=0xcdf61f58) at fs/namei.c:1026
#10 0xc016877d in open_namei (pathname=0xcdd24000 "/lib/libpcre.so.0", flag=1, mode=-1208907480, nd=0xcdf61f58) at fs/namei.c:1420
#11 0xc0159256 in filp_open (filename=0xcdd24000 "/lib/libpcre.so.0", flags=0, mode=-1208907480) at fs/open.c:764
#12 0xc01595e6 in sys_open (filename=0x1 <Address 0x1 out of bounds>, flags=1, mode=1) at fs/open.c:946
#13 0xc0102e91 in syscall_call () at arch/i386/kernel/semaphore.c:177
#14 0xb7f22c8a in ?? ()
Is someone doing a syscall with irqs disabled, or what?
This fixes it...
--- 25/arch/i386/kernel/entry.S~sysenter-irq-atomicity-fix-fix 2005-04-07 04:03:07.000000000 -0700
+++ 25-akpm/arch/i386/kernel/entry.S 2005-04-07 04:05:47.000000000 -0700
@@ -187,6 +187,7 @@ sysenter_past_esp:
pushl $(__USER_DS)
pushl %ebp
pushfl
+ orl $0x200,0(%esp)
pushl $(__USER_CS)
pushl $SYSENTER_RETURN
sti
_
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-07 11:10 ` Andrew Morton
@ 2005-04-07 14:47 ` Linus Torvalds
2005-04-07 14:51 ` Ingo Molnar
2005-04-07 16:47 ` Dave Jones
0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2005-04-07 14:47 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ingo Molnar, stsp, linux-kernel, VANDROVE
On Thu, 7 Apr 2005, Andrew Morton wrote:
>
> Ingo Molnar <mingo@elte.hu> wrote:
> >
> > --- linux/arch/i386/kernel/entry.S.orig
> > +++ linux/arch/i386/kernel/entry.S
> > @@ -179,12 +179,17 @@ need_resched:
> > ENTRY(sysenter_entry)
> > movl TSS_sysenter_esp0(%esp),%esp
> > sysenter_past_esp:
> > - sti
> > + #
> > + # irqs are disabled: set up an entry stackframe without
> > + # allowing irqs to potentially preempt us with an
> > + # incomplete entry frame!
> > + #
> > pushl $(__USER_DS)
> > pushl %ebp
> > pushfl
> > pushl $(__USER_CS)
> > pushl $SYSENTER_RETURN
> > + sti
> >
>
> Well that bites the big one.
>>
> Program received signal SIGTRAP, Trace/breakpoint trap.
> 0xc015c4ba in __find_get_block (bdev=0xc18cd988, block=2818614, size=4096) at fs/buffer.c:1371
> 1371 BUG_ON(irqs_disabled());
Yes. With the change in place "entry.S" will always save the wrogn EIP, so
we'll return with interrupts disabled.
Your suggested patch is pretty horrid, though. It's sufficient to enable
interrupts after the two first pushes, since that has already now set up
enough of a kernel stack that any subsequent interrupt will always have at
least a "fake" SS/ESP (ie some values there, although not anything
relevant).
So the sysenter sequence might as well look like
pushl $(__USER_DS)
pushl %ebp
sti
pushfl
..
which actually does three protected pushes thanks to the one-instruction
"interrupt shadow" after an sti.
Another alternative (and to some degree a maybe a better one) is to not
use "pushfl" at all in the sysenter sequence, but instead just push a
fixed default value. At that point, the "sti" can stay where it was.
After all, I very strongly suspect that we don't actually really _care_ if
eflags stays the same over a system call, and I could see that some
dynamic CPU's might prefer not having to push an eflags value that just
got changed by the "sti", so it _might_ save several cycles to avoid that
dependency, and also obviously avoid a subtle dependency on a sw level
that the previous patch clearly introduced.
Anybody willing to time it? ;)
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-07 14:47 ` Linus Torvalds
@ 2005-04-07 14:51 ` Ingo Molnar
2005-04-07 16:47 ` Dave Jones
1 sibling, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2005-04-07 14:51 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andrew Morton, stsp, linux-kernel, VANDROVE
* Linus Torvalds <torvalds@osdl.org> wrote:
> > > pushfl
> After all, I very strongly suspect that we don't actually really
> _care_ if eflags stays the same over a system call, and I could see
> that some dynamic CPU's might prefer not having to push an eflags
> value that just got changed by the "sti", so it _might_ save several
> cycles to avoid that dependency, and also obviously avoid a subtle
> dependency on a sw level that the previous patch clearly introduced.
>
> Anybody willing to time it? ;)
i can tell you without any measurement that pushfl is slower by a couple
of cycles than a simple pushl $0x00010046, on basically all x86 CPUs.
And since we only support SYENTER from 32-bit mode and we dont guarantee
flags to be saved, it isnt all that incorrect to do?
Ingo
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-07 8:00 ` Ingo Molnar
2005-04-07 11:10 ` Andrew Morton
@ 2005-04-07 16:11 ` Stas Sergeev
2005-04-07 16:35 ` Linus Torvalds
1 sibling, 1 reply; 31+ messages in thread
From: Stas Sergeev @ 2005-04-07 16:11 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel, Andrew Morton, Petr Vandrovec
Hello.
Ingo Molnar wrote:
> now if an interrupt hits at this point, it will set up a 'same privilege
> level' stackframe, which has eip/xcs/eflags, i.e. no esp/xss.
Yes, that's something I tried to say
when talking about the interrupt gates
(sorry if I wasn't clear).
> If upon
> irq-return we then examine the stack due to your patch, it will be an
> incorrect stackframe -> kaboom.
Yes, and that's where I think my patch is
at fault, i.e. it just shouldn't do this.
Another option is to make it always possible
to access OLDSS(%esp), but I think it is just
my patch have to be fixed to not do this at all.
> your patch doesnt remove the condition, it only removes the crash,
No, that wasn't my point at all. That example
with moving "sti" was *only* to answer Linus's
question of where we have an empty stack.
And I guess I wasn't clear enough also here,
I was in a hurry :(
The real patch I meant, is this one:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0504.0/1287.html
This, I am sure, fixes a real bug. But there
can be the other approaches too.
> because it adds the 2 words space that is needed - but the information
> relied on by your irq-return test is still bogus.
But as an example for demonstrating the problem,
I thought, it could do:)
> At this point i'd
> suggest to remove the ESP patch altogether.
That's probably too heavy-handed. The fix is
really simple, we can either store the right
values by hands (as you propose), or fix my
patch (as I propose).
> the correct solution is to always let the sysenter path set up a full
> and correct stackframe,
But what will this solve? If I understand you
correctly, you will push the %ss/%esp of the
user-space process that did sysenter. Then
you enable the interrupts and get pre-empted.
Now what we have: OLDSS/OLDESP are of the
user-space process, but the EFLAGS/CS/EIP
are of the kernel (where it got pre-empted
on a sysenter path). This will avoid the crash,
but the information on stack is still wrong.
Or am I missing something?
> before allowing preemption (see the attached
> patch).
Hmm, will it work also for NMIs? You move
the sti, you can't be pre-empted, but the
NMI uses the restore_all too, no?
Also, it seems that Linus wants only the
*some* values available on stack, just to
make it not to crash. I think we can simply
adjust the tss.esp0 to point 8 bytes below
the real stack top, and so we are always safe.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-07 16:11 ` Stas Sergeev
@ 2005-04-07 16:35 ` Linus Torvalds
2005-04-07 16:46 ` Stas Sergeev
0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2005-04-07 16:35 UTC (permalink / raw)
To: Stas Sergeev; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec
On Thu, 7 Apr 2005, Stas Sergeev wrote:
>
> > because it adds the 2 words space that is needed - but the information
> > relied on by your irq-return test is still bogus.
>
> But as an example for demonstrating the problem,
> I thought, it could do:)
Ingo: the information is bogus, but you're wrong: the code doesn't "rely"
on it.
The fact is, bogus information is _fine_. That's what speculative work is
all about: working with bogus information, with the assumption that some
later test will ignore it if it's not relevant.
And the later test _will_ ignore it if it isn't relevant. Look for
yourself:
cmpl $((4 << 8) | 3), %eax
je ldt_ss # returning to user-space with LDT SS
notice how the "cmpl" _only_ triggers if the old CS had the low three bits
set and if EFLAGS_VM is clear. So if we return to kernel mode (or vm86)
mode, we _know_ that the SS is bogus, but we don't care. We've tested for
the proper thing, and we only do the special user-space LDT SS case if
- the LDT bit was set in SS (possibly bogus)
AND
- old CS was user space
AND
- old eflags wasn't vm86 mode.
Ie the two second checks are what validates the first (possibly bogus)
one.
So I really think that the _correct_ fix is literally to move the "cli"
in the sysenter path down two lines. It doesn't just "hide" the bug, it
literally fixes is.
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-07 16:35 ` Linus Torvalds
@ 2005-04-07 16:46 ` Stas Sergeev
2005-04-07 16:55 ` Linus Torvalds
0 siblings, 1 reply; 31+ messages in thread
From: Stas Sergeev @ 2005-04-07 16:46 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec
Hello.
Linus Torvalds wrote:
> So I really think that the _correct_ fix is literally to move the "cli"
> in the sysenter path down two lines. It doesn't just "hide" the bug, it
> literally fixes is.
OK, Linus, I find it amazing that you
like all my patches, even though I myself
think they are wrong:)
I still have those two questions however:
1. Does the "later sti" fixes the problem
also in case of an NMI? I mean, why can't
you just be NMI'ed before you did sti?
NMI uses the restore_all too IIRC.
2. How can one be sure there are no more
of the like places where the stack is left
empty?
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-07 14:47 ` Linus Torvalds
2005-04-07 14:51 ` Ingo Molnar
@ 2005-04-07 16:47 ` Dave Jones
2005-04-07 17:17 ` Richard B. Johnson
2005-04-07 17:23 ` Linus Torvalds
1 sibling, 2 replies; 31+ messages in thread
From: Dave Jones @ 2005-04-07 16:47 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andrew Morton, Ingo Molnar, stsp, linux-kernel, VANDROVE
On Thu, Apr 07, 2005 at 07:47:41AM -0700, Linus Torvalds wrote:
> So the sysenter sequence might as well look like
>
> pushl $(__USER_DS)
> pushl %ebp
> sti
> pushfl
> ..
>
> which actually does three protected pushes thanks to the one-instruction
> "interrupt shadow" after an sti.
Is this guaranteed on every x86 variant (or rather, every one
that has SEP). ?
Dave
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-07 16:46 ` Stas Sergeev
@ 2005-04-07 16:55 ` Linus Torvalds
2005-04-07 18:10 ` Stas Sergeev
2005-04-10 13:20 ` Stas Sergeev
0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2005-04-07 16:55 UTC (permalink / raw)
To: Stas Sergeev; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec
On Thu, 7 Apr 2005, Stas Sergeev wrote:
>
> 1. Does the "later sti" fixes the problem
> also in case of an NMI? I mean, why can't
> you just be NMI'ed before you did sti?
> NMI uses the restore_all too IIRC.
The NMI code had better be really careful, and yeah, I suspect it needs
fixing.
> 2. How can one be sure there are no more
> of the like places where the stack is left
> empty?
That's a good argument, and may be the strongest reason for _not_ doing
the speculation. However, I don't think it really can happen anywhere
else.
If people feel very nervous about the speculation, we can certainly just
make it do the two branches. It _is_ a hotspot, though, and the
optimization of the speculatiom may well be worth it.
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-07 16:47 ` Dave Jones
@ 2005-04-07 17:17 ` Richard B. Johnson
2005-04-07 17:23 ` Linus Torvalds
1 sibling, 0 replies; 31+ messages in thread
From: Richard B. Johnson @ 2005-04-07 17:17 UTC (permalink / raw)
To: Dave Jones
Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, stsp, linux-kernel, VANDROVE
On Thu, 7 Apr 2005, Dave Jones wrote:
> On Thu, Apr 07, 2005 at 07:47:41AM -0700, Linus Torvalds wrote:
>
> > So the sysenter sequence might as well look like
> >
> > pushl $(__USER_DS)
> > pushl %ebp
> > sti
> > pushfl
> > ..
> >
> > which actually does three protected pushes thanks to the one-instruction
> > "interrupt shadow" after an sti.
>
> Is this guaranteed on every x86 variant (or rather, every one
> that has SEP). ?
>
> Dave
The i486 book says that if the next instruction lets the IF
flag remain enabled, then any interrupts pending occur after
the next instruction. If the next instruction is CLI, no
interrupts will occur. There is a note:
"In case of an NM1, trap, or fault following ST1, the interrupt
will be taken before executing the next sequential instruction
in the code." The use of "NM1" and "ST1" is (sic). I don't
know what NM1 is, it can't be NMI because, by definition NMI
is not maskable so CLI and STI could not affect this pin.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-07 16:47 ` Dave Jones
2005-04-07 17:17 ` Richard B. Johnson
@ 2005-04-07 17:23 ` Linus Torvalds
1 sibling, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2005-04-07 17:23 UTC (permalink / raw)
To: Dave Jones; +Cc: Andrew Morton, Ingo Molnar, stsp, linux-kernel, VANDROVE
On Thu, 7 Apr 2005, Dave Jones wrote:
>
> On Thu, Apr 07, 2005 at 07:47:41AM -0700, Linus Torvalds wrote:
>
> > So the sysenter sequence might as well look like
> >
> > pushl $(__USER_DS)
> > pushl %ebp
> > sti
> > pushfl
> > ..
> >
> > which actually does three protected pushes thanks to the one-instruction
> > "interrupt shadow" after an sti.
>
> Is this guaranteed on every x86 variant (or rather, every one
> that has SEP). ?
Well, since we only need two in this case, we don't care, but yes, it's
supposed to be guaranteed by anything that calls itself an x86.
In fact, we _do_ depend on it in a few other sequences. Notably
sti ; hlt
depends on the fact that an interrupt will always finish _after_ the hlt,
and we'll never halt before the hlt (and then re-execute the hlt after the
interrupt), and in
sti ; iret
where we depend on the fact that we don't get recursive interrupt stacks
(since we at that point have re-enabled the interrupt that happened).
Of course, if some future x86 decides that the interrupt shadow only
matters for special instructions (ie it's not so much a general interrupt
shadow as a "instruction combination"), I don't think Linux would care. I
really think there are only a very few valid sti-combinations, and I
suspect the above two are pretty much it.
(The other "magic" x86 behaviour is loading into the SS register, which
creates a one-cycle black hole after it. Linux shouldn't care, and in fact
nothing should care about it outside of old 16-bit non-protected-mode
programs, so I think that's another one that could be retired eventually)
Linus
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-07 16:55 ` Linus Torvalds
@ 2005-04-07 18:10 ` Stas Sergeev
2005-04-10 13:20 ` Stas Sergeev
1 sibling, 0 replies; 31+ messages in thread
From: Stas Sergeev @ 2005-04-07 18:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec
Hi Linus.
Linus Torvalds wrote:
> The NMI code had better be really careful, and yeah, I suspect it needs
> fixing.
And since the NMI return will need a
ESP fixup too, it will require the
"branched" version of the restore_all
checks I suppose.
>> 2. How can one be sure there are no more
>> of the like places where the stack is left
>> empty?
> That's a good argument, and may be the strongest reason for _not_ doing
> the speculation.
I haven't said that explicitly, only
implied:) My another idea was to adjust
the tss.esp0 to always point 8 bytes below
the real top of the stack, so that even in
case of an NMI we are still safe. And in
case of another such instance - too.
This may look more like a hack than shifting
the "sti", but it is probably more reliable.
At least something to consider as soon as we
are at it.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-07 16:55 ` Linus Torvalds
2005-04-07 18:10 ` Stas Sergeev
@ 2005-04-10 13:20 ` Stas Sergeev
2005-04-10 22:32 ` Andrew Morton
1 sibling, 1 reply; 31+ messages in thread
From: Stas Sergeev @ 2005-04-10 13:20 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Petr Vandrovec
[-- Attachment #1: Type: text/plain, Size: 642 bytes --]
Hello.
Linus Torvalds wrote:
>> 2. How can one be sure there are no more
>> of the like places where the stack is left
>> empty?
> That's a good argument, and may be the strongest reason for _not_ doing
> the speculation. However, I don't think it really can happen anywhere
> else.
OK, so how do you feel about the attached
patch? I understand that from some point
of view it may look like a hack, but at
the same time it:
1. Allows to preserve the valueable optimization
2. Works for NMIs
3. Doesn't care whether or not there are more
of the like instances where the stack is left
empty.
4. Seems to work for me without the crashes:)
[-- Attachment #2: esp0.diff --]
[-- Type: text/x-patch, Size: 399 bytes --]
--- linux/arch/i386/kernel/process.c.old 2005-03-20 14:12:18.000000000 +0300
+++ linux/arch/i386/kernel/process.c 2005-04-10 16:54:39.000000000 +0400
@@ -394,7 +394,7 @@
childregs->esp = esp;
p->thread.esp = (unsigned long) childregs;
- p->thread.esp0 = (unsigned long) (childregs+1);
+ p->thread.esp0 = (unsigned long) (childregs+1) - 8;
p->thread.eip = (unsigned long) ret_from_fork;
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-10 13:20 ` Stas Sergeev
@ 2005-04-10 22:32 ` Andrew Morton
2005-04-11 17:15 ` Stas Sergeev
0 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2005-04-10 22:32 UTC (permalink / raw)
To: Stas Sergeev; +Cc: torvalds, mingo, linux-kernel, VANDROVE
Stas Sergeev <stsp@aknet.ru> wrote:
>
> - p->thread.esp0 = (unsigned long) (childregs+1);
> + p->thread.esp0 = (unsigned long) (childregs+1) - 8;
This is utterly obscure - it needs a comment so that readers know what that
"- 8" is doing there.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC
2005-04-10 22:32 ` Andrew Morton
@ 2005-04-11 17:15 ` Stas Sergeev
0 siblings, 0 replies; 31+ messages in thread
From: Stas Sergeev @ 2005-04-11 17:15 UTC (permalink / raw)
To: Andrew Morton; +Cc: torvalds, mingo, linux-kernel, VANDROVE
[-- Attachment #1: Type: text/plain, Size: 424 bytes --]
Hello.
Andrew Morton wrote:
> This is utterly obscure - it needs a comment so that readers know what that
> "- 8" is doing there.
Yes, that was only an RFC thing.
And now since there were not too much
of an FC, I prepared the "polished"
version. But apparently you already
released -mm3:)
Well, at least you can still apply the
comments if you feel like that. Here
they are.
Signed-off-by: Stas Sergeev <stsp@aknet.ru>
[-- Attachment #2: esp0fix.diff --]
[-- Type: text/x-patch, Size: 1307 bytes --]
--- linux-2.6.12-rc2/arch/i386/kernel/entry.S 2005-04-06 09:34:35.000000000 +0400
+++ linux/arch/i386/kernel/entry.S 2005-04-11 10:49:28.000000000 +0400
@@ -245,6 +245,9 @@
restore_all:
movl EFLAGS(%esp), %eax # mix EFLAGS, SS and CS
+ # Warning: OLDSS(%esp) contains the wrong/random values if we
+ # are returning to the kernel.
+ # See comments in process.c:copy_thread() for details.
movb OLDSS(%esp), %ah
movb CS(%esp), %al
andl $(VM_MASK | (4 << 8) | 3), %eax
--- linux-2.6.12-rc2/arch/i386/kernel/process.c 2005-04-06 09:34:35.000000000 +0400
+++ linux/arch/i386/kernel/process.c 2005-04-11 10:30:39.000000000 +0400
@@ -394,6 +394,16 @@
childregs->esp = esp;
p->thread.esp = (unsigned long) childregs;
+ /*
+ * The below -8 is to reserve 8 bytes on top of the ring0 stack.
+ * This is necessary to guarantee that the entire "struct pt_regs"
+ * is accessable even if the CPU haven't stored the SS/ESP registers
+ * on the stack (interrupt gate does not save these registers
+ * when switching to the same priv ring).
+ * Therefore beware: accessing the xss/esp fields of the
+ * "struct pt_regs" is possible, but they may contain the
+ * completely wrong values.
+ */
p->thread.esp0 = (unsigned long) (childregs+1) - 8;
p->thread.eip = (unsigned long) ret_from_fork;
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2005-04-11 17:15 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-05 6:55 crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC Ingo Molnar
2005-04-05 7:03 ` Andrew Morton
2005-04-05 7:07 ` Ingo Molnar
2005-04-05 7:16 ` Ingo Molnar
2005-04-05 7:29 ` Ingo Molnar
2005-04-05 7:40 ` Ingo Molnar
2005-04-05 9:51 ` Mikael Pettersson
2005-04-05 18:09 ` Ingo Molnar
2005-04-05 7:05 ` Ingo Molnar
2005-04-05 19:11 ` Stas Sergeev
2005-04-05 19:19 ` Linus Torvalds
2005-04-05 19:41 ` Stas Sergeev
2005-04-05 19:53 ` Linus Torvalds
2005-04-05 20:44 ` Ingo Molnar
2005-04-05 21:04 ` Linus Torvalds
2005-04-06 15:44 ` Stas Sergeev
2005-04-07 8:00 ` Ingo Molnar
2005-04-07 11:10 ` Andrew Morton
2005-04-07 14:47 ` Linus Torvalds
2005-04-07 14:51 ` Ingo Molnar
2005-04-07 16:47 ` Dave Jones
2005-04-07 17:17 ` Richard B. Johnson
2005-04-07 17:23 ` Linus Torvalds
2005-04-07 16:11 ` Stas Sergeev
2005-04-07 16:35 ` Linus Torvalds
2005-04-07 16:46 ` Stas Sergeev
2005-04-07 16:55 ` Linus Torvalds
2005-04-07 18:10 ` Stas Sergeev
2005-04-10 13:20 ` Stas Sergeev
2005-04-10 22:32 ` Andrew Morton
2005-04-11 17:15 ` Stas Sergeev
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).