All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9
@ 2008-12-17 22:46 Helge Deller
  2008-12-18  0:05 ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling John David Anglin
                   ` (3 more replies)
  0 siblings, 4 replies; 56+ messages in thread
From: Helge Deller @ 2008-12-17 22:46 UTC (permalink / raw)
  To: linux-parisc

[-- Attachment #1: Type: text/plain, Size: 2371 bytes --]

The Debian bugzilla has a long thread about kernel crashes when
compiling ruby1.9 on hppa. This kernel bug led even to discussions if
hppa should be dropped for lenny.
See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=478717 for details.

It's really easy to reproduce the bug, and it generates this backtrace
(interestingly two backtraces):

     < Your System ate a SPARC! Gah! >
      -------------------------------
             \   ^__^
              \  (xx)\_______
                 (__)\       )\/\
                  U  ||----w |
                     ||     ||
miniruby (pid 15221): Protection id trap (code 27)

     YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00000000000001000000000000001111 Not tainted
r00-03  0004000f 102a9800 101a141c 8210c388
r04-07  00000000 0020fd08 0020fd10 00000001
r08-11  00000000 8210c388 fffffff2 8210c0c8
r12-15  fb0d04c8 402cc3d8 00001000 40007000
r16-19  002120a0 00000010 0020fd90 00000001
r20-23  8210c000 00000000 0020fd0e 8210c39e
r24-27  00000000 00000001 8e7c5660 105e7a90
r28-31  0000000f 00190834 8210c500 101a12b8
sr00-03  00000000 00000000 00000000 00000847
sr04-07  00000000 00000000 00000000 00000000

IASQ: 00000000 00000000 IAOQ: 101a147c 101a1480
 IIR: 0ed5d240    ISR: 00000847  IOR: 0020fd0e
 CPU:        0   CR30: 8210c000 CR31: d22344f0
 ORIG_R28: 00001000
 IAOQ[0]: do_sys_poll+0x1ac/0x1b8
 IAOQ[1]: do_sys_poll+0x1b0/0x1b8
 RP(r2): do_sys_poll+0x14c/0x1b8
Backtrace:
 [<101a1574>] sys_poll+0x84/0xec
 [<10114078>] syscall_exit+0x0/0x28

Backtrace:
 [<1010fdb8>] die_if_kernel+0xe8/0x1ac
 [<10110584>] handle_interruption+0x2fc/0x598
 [<10113078>] intr_check_sig+0x0/0x34


The bug (sometimes but not always!) happens in fs/select.c:do_sys_poll()
when calling __put_user() and writing back fds[0].revents to userspace.
What I quite don't understand yet is, why does copy_from_user() [called
a few lines above the __put_user()] succeeds, and __put_user() sometimes
suddenly fails with a protection id fault.

The attached patch simply adds the lookup for a fixup handler when trap
#27 (protection id trap) happens in kernel space. This was missing in
the code path for trap #27 which is why the kernel then called
die_if_kernel() and crashed.

Even with this patch ruby1.9 may fail to compile, but at least the
kernel crashes are gone.

Any feedback welcome.

Helge

Signed-off-by: Helge Deller <deller@gmx.de>


[-- Attachment #2: data_protection_id_failure.diff --]
[-- Type: text/x-patch, Size: 1693 bytes --]

diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c
index 4c771cd..70eabfe 100644
--- a/arch/parisc/kernel/traps.c
+++ b/arch/parisc/kernel/traps.c
@@ -43,6 +43,8 @@
 
 #include "../math-emu/math-emu.h"	/* for handle_fpe() */
 
+DECLARE_PER_CPU(struct exception_data, exception_data);
+
 #define PRINT_USER_FAULTS /* (turn this on if you want user faults to be */
 			  /*  dumped to the console via printk)          */
 
@@ -745,6 +747,41 @@ void handle_interruption(int code, struct pt_regs *regs)
 		/* Fall Through */
 	case 27: 
 		/* Data memory protection ID trap */
+		if (code == 27 && !user_mode(regs)) {
+			const struct exception_table_entry *fix;
+
+			/* mostly copied from:
+ 			   arch/parisc/mm/fault.c:do_page_fault()
+			 */
+			fix = search_exception_tables(regs->iaoq[0]);
+			printk(KERN_CRIT "BUG: Kernel Data memory protection ID"
+				" trap at %p (%pS), fix=%p\n",
+				(void*)regs->iaoq[0], (void*)regs->iaoq[0], fix);
+			if (fix) {
+				struct exception_data *d;
+
+				d = &__get_cpu_var(exception_data);
+				d->fault_ip = regs->iaoq[0];
+				d->fault_space = regs->isr;
+				d->fault_addr = regs->ior;
+
+				regs->iaoq[0] = ((fix->fixup) & ~3);
+
+				/*
+				 * NOTE: In some cases the faulting instruction
+				 * may be in the delay slot of a branch. We
+				 * don't want to take the branch, so we don't
+				 * increment iaoq[1], instead we set it to be
+				 * iaoq[0]+4, and clear the B bit in the PSW
+				 */
+
+				regs->iaoq[1] = regs->iaoq[0] + 4;
+				regs->gr[0] &= ~PSW_B; /* IPSW in gr[0] */
+
+				return;
+			}
+		}
+
 		die_if_kernel("Protection id trap", regs, code);
 		si.si_code = SEGV_MAPERR;
 		si.si_signo = SIGSEGV;

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling
  2008-12-17 22:46 [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Helge Deller
@ 2008-12-18  0:05 ` John David Anglin
  2008-12-18  0:43   ` Kyle McMartin
  2008-12-18  1:46 ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Carlos O'Donell
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 56+ messages in thread
From: John David Anglin @ 2008-12-18  0:05 UTC (permalink / raw)
  To: Helge Deller; +Cc: linux-parisc

> r20-23  8210c000 00000000 0020fd0e 8210c39e
> r24-27  00000000 00000001 8e7c5660 105e7a90
> r28-31  0000000f 00190834 8210c500 101a12b8
> sr00-03  00000000 00000000 00000000 00000847
> sr04-07  00000000 00000000 00000000 00000000
> 
> IASQ: 00000000 00000000 IAOQ: 101a147c 101a1480
>  IIR: 0ed5d240    ISR: 00000847  IOR: 0020fd0e
>  CPU:        0   CR30: 8210c000 CR31: d22344f0
>  ORIG_R28: 00001000
>  IAOQ[0]: do_sys_poll+0x1ac/0x1b8
>  IAOQ[1]: do_sys_poll+0x1b0/0x1b8
>  RP(r2): do_sys_poll+0x14c/0x1b8

The faulting instruction is:

dave@hiauly6:~/opt/gnu/bin$ disasm 0x0ed5d240
   0:	0e d5 d2 40 	sth r21,0(sr3,r22)

r22 is 0.  r22 being 0 is a bit wierd as revents has on offset of 6.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling
  2008-12-18  0:05 ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling John David Anglin
@ 2008-12-18  0:43   ` Kyle McMartin
  2008-12-18  3:02     ` [PATCH] [RFC] fix kernel crash (protection id trap) when John David Anglin
  0 siblings, 1 reply; 56+ messages in thread
From: Kyle McMartin @ 2008-12-18  0:43 UTC (permalink / raw)
  To: John David Anglin; +Cc: Helge Deller, linux-parisc

On Wed, Dec 17, 2008 at 07:05:26PM -0500, John David Anglin wrote:
> > r20-23  8210c000 00000000 0020fd0e 8210c39e
> > r24-27  00000000 00000001 8e7c5660 105e7a90
> > r28-31  0000000f 00190834 8210c500 101a12b8
> > sr00-03  00000000 00000000 00000000 00000847
> > sr04-07  00000000 00000000 00000000 00000000
> > 
> > IASQ: 00000000 00000000 IAOQ: 101a147c 101a1480
> >  IIR: 0ed5d240    ISR: 00000847  IOR: 0020fd0e
> >  CPU:        0   CR30: 8210c000 CR31: d22344f0
> >  ORIG_R28: 00001000
> >  IAOQ[0]: do_sys_poll+0x1ac/0x1b8
> >  IAOQ[1]: do_sys_poll+0x1b0/0x1b8
> >  RP(r2): do_sys_poll+0x14c/0x1b8
> 
> The faulting instruction is:
> 
> dave@hiauly6:~/opt/gnu/bin$ disasm 0x0ed5d240
>    0:	0e d5 d2 40 	sth r21,0(sr3,r22)
> 
> r22 is 0.  r22 being 0 is a bit wierd as revents has on offset of 6.
> 

Eh? r22 is 0020fd0e... although it's easier to spot by looking at IOR
> >  IIR: 0ed5d240    ISR: 00000847  IOR: 0020fd0e

Weird.

Although, I don't think we're supposed to ever be taking a P. Id trap on
a userspace process as the kernel...

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9
  2008-12-17 22:46 [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Helge Deller
  2008-12-18  0:05 ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling John David Anglin
@ 2008-12-18  1:46 ` Carlos O'Donell
  2008-12-18  2:02   ` Carlos O'Donell
  2008-12-18  2:36   ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Kyle McMartin
  2008-12-18  7:03 ` Kyle McMartin
  2008-12-19 21:29 ` Kyle McMartin
  3 siblings, 2 replies; 56+ messages in thread
From: Carlos O'Donell @ 2008-12-18  1:46 UTC (permalink / raw)
  To: Helge Deller; +Cc: linux-parisc

On Wed, Dec 17, 2008 at 5:46 PM, Helge Deller <deller@gmx.de> wrote:
> The attached patch simply adds the lookup for a fixup handler when trap
> #27 (protection id trap) happens in kernel space. This was missing in
> the code path for trap #27 which is why the kernel then called
> die_if_kernel() and crashed.
>
> Even with this patch ruby1.9 may fail to compile, but at least the
> kernel crashes are gone.
>
> Any feedback welcome.

Can we avoid copying code from do_page_fault and just call do_page_fault?

Shouldn't a data protection id trap on a userspace address cause
do_page_fault to be called which triggers the lookup and transfer to
the exception table location?

Why aren't we doing this in the data protection id trap case?

I remember testing this for the compare-and-swap kernel LWS, in
particular calling with an invalid address, where the kernel must load
a userspace address, possibly fault, and jump to the exception
location to return EFAULT as the error for the compare-and-swap.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9
  2008-12-18  1:46 ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Carlos O'Donell
@ 2008-12-18  2:02   ` Carlos O'Donell
  2008-12-18 13:13     ` Helge Deller
  2008-12-18  2:36   ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Kyle McMartin
  1 sibling, 1 reply; 56+ messages in thread
From: Carlos O'Donell @ 2008-12-18  2:02 UTC (permalink / raw)
  To: Helge Deller; +Cc: linux-parisc

On Wed, Dec 17, 2008 at 8:46 PM, Carlos O'Donell
<carlos@systemhalted.org> wrote:
> On Wed, Dec 17, 2008 at 5:46 PM, Helge Deller <deller@gmx.de> wrote:
>> The attached patch simply adds the lookup for a fixup handler when trap
>> #27 (protection id trap) happens in kernel space. This was missing in
>> the code path for trap #27 which is why the kernel then called
>> die_if_kernel() and crashed.
>>
>> Even with this patch ruby1.9 may fail to compile, but at least the
>> kernel crashes are gone.
>>
>> Any feedback welcome.
>
> Can we avoid copying code from do_page_fault and just call do_page_fault?
>
> Shouldn't a data protection id trap on a userspace address cause
> do_page_fault to be called which triggers the lookup and transfer to
> the exception table location?
>
> Why aren't we doing this in the data protection id trap case?

I went all the way back to 2.6.0 and the code is still the same.

We must never have generated a protection id trap against user
addresses while in the kernel, otherwise we would have noticed this
sooner.

I think your code is correct and necessary.

However! I think you should not duplicate code. Please split out this
"bit" of code into a function in mm/fault.c and call it from
handle_interruption.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9
  2008-12-18  1:46 ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Carlos O'Donell
  2008-12-18  2:02   ` Carlos O'Donell
@ 2008-12-18  2:36   ` Kyle McMartin
  2008-12-18 12:21     ` Carlos O'Donell
  1 sibling, 1 reply; 56+ messages in thread
From: Kyle McMartin @ 2008-12-18  2:36 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Helge Deller, linux-parisc

On Wed, Dec 17, 2008 at 08:46:30PM -0500, Carlos O'Donell wrote:
> On Wed, Dec 17, 2008 at 5:46 PM, Helge Deller <deller@gmx.de> wrote:
> > The attached patch simply adds the lookup for a fixup handler when trap
> > #27 (protection id trap) happens in kernel space. This was missing in
> > the code path for trap #27 which is why the kernel then called
> > die_if_kernel() and crashed.
> >
> > Even with this patch ruby1.9 may fail to compile, but at least the
> > kernel crashes are gone.
> >
> > Any feedback welcome.
> 
> Can we avoid copying code from do_page_fault and just call do_page_fault?
> 
> Shouldn't a data protection id trap on a userspace address cause
> do_page_fault to be called which triggers the lookup and transfer to
> the exception table location?
> 

Yes, it could just fall through in this case, but I don't think that's
necessarily correct either.

> Why aren't we doing this in the data protection id trap case?
> 
> I remember testing this for the compare-and-swap kernel LWS, in
> particular calling with an invalid address, where the kernel must load
> a userspace address, possibly fault, and jump to the exception
> location to return EFAULT as the error for the compare-and-swap.
> 

Indeed, there's something more sinister going on.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-18  0:43   ` Kyle McMartin
@ 2008-12-18  3:02     ` John David Anglin
  2008-12-18  3:05       ` Kyle McMartin
  2008-12-18  4:04       ` Kyle McMartin
  0 siblings, 2 replies; 56+ messages in thread
From: John David Anglin @ 2008-12-18  3:02 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: deller, linux-parisc

> Eh? r22 is 0020fd0e... although it's easier to spot by looking at IOR
> > >  IIR: 0ed5d240    ISR: 00000847  IOR: 0020fd0e

Obviously, I can't count...

> Although, I don't think we're supposed to ever be taking a P. Id trap on
> a userspace process as the kernel...

Ok, should P and/or D be on?

Assuming the protection id isn't wrong, then the WD bit must be on.
A problem with the WD bit might explain Helge's comment.

As a start, dumping CR 8, 9, 12 and 13 might help debug the problem.

There might be a problem restoring access IDs after a cache purge or
TLB insertion.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-18  3:02     ` [PATCH] [RFC] fix kernel crash (protection id trap) when John David Anglin
@ 2008-12-18  3:05       ` Kyle McMartin
  2008-12-18  4:04       ` Kyle McMartin
  1 sibling, 0 replies; 56+ messages in thread
From: Kyle McMartin @ 2008-12-18  3:05 UTC (permalink / raw)
  To: John David Anglin; +Cc: deller, linux-parisc

On Wed, Dec 17, 2008 at 10:02:13PM -0500, John David Anglin wrote:
> > Although, I don't think we're supposed to ever be taking a P. Id trap on
> > a userspace process as the kernel...
> 
> Ok, should P and/or D be on?
> 
> Assuming the protection id isn't wrong, then the WD bit must be on.
> A problem with the WD bit might explain Helge's comment.
> 
> As a start, dumping CR 8, 9, 12 and 13 might help debug the problem.
> 
> There might be a problem restoring access IDs after a cache purge or
> TLB insertion.
> 

I'm starting a build of ruby now, with some debugging gunk added
to dump the pid %crs, we'll see how it goes in a few hours...

re, Kyle

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-18  3:02     ` [PATCH] [RFC] fix kernel crash (protection id trap) when John David Anglin
  2008-12-18  3:05       ` Kyle McMartin
@ 2008-12-18  4:04       ` Kyle McMartin
  2008-12-18 16:16         ` Carlos O'Donell
  2008-12-18 18:28         ` Kyle McMartin
  1 sibling, 2 replies; 56+ messages in thread
From: Kyle McMartin @ 2008-12-18  4:04 UTC (permalink / raw)
  To: John David Anglin; +Cc: Kyle McMartin, deller, linux-parisc

On Wed, Dec 17, 2008 at 10:02:13PM -0500, John David Anglin wrote:
> > Eh? r22 is 0020fd0e... although it's easier to spot by looking at IOR
> > > >  IIR: 0ed5d240    ISR: 00000847  IOR: 0020fd0e
> 
> Obviously, I can't count...
> 
> > Although, I don't think we're supposed to ever be taking a P. Id trap on
> > a userspace process as the kernel...
> 
> Ok, should P and/or D be on?
> 

Huh... Possibly the P-bit shouldn't be on for kernel mode... the kernel
after all can scribble on whomevers pages it so chooses...

> Assuming the protection id isn't wrong, then the WD bit must be on.
> A problem with the WD bit might explain Helge's comment.
> 
> As a start, dumping CR 8, 9, 12 and 13 might help debug the problem.
> 

Building a patch to do this now, will attach it here if it builds.

> There might be a problem restoring access IDs after a cache purge or
> TLB insertion.
> 

I honestly have no idea how the protection id registers are loaded... I
thought it had something do with with space registers, but I can't
remember off the top of my head.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9
  2008-12-17 22:46 [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Helge Deller
  2008-12-18  0:05 ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling John David Anglin
  2008-12-18  1:46 ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Carlos O'Donell
@ 2008-12-18  7:03 ` Kyle McMartin
  2008-12-18 13:09   ` Helge Deller
  2008-12-19 21:29 ` Kyle McMartin
  3 siblings, 1 reply; 56+ messages in thread
From: Kyle McMartin @ 2008-12-18  7:03 UTC (permalink / raw)
  To: Helge Deller; +Cc: linux-parisc

On Wed, Dec 17, 2008 at 11:46:05PM +0100, Helge Deller wrote:
> The Debian bugzilla has a long thread about kernel crashes when
> compiling ruby1.9 on hppa. This kernel bug led even to discussions if
> hppa should be dropped for lenny.
> See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=478717 for details.
> 
> It's really easy to reproduce the bug, and it generates this backtrace
> (interestingly two backtraces):
> 

When does it die? My ruby1.9 is building through to the tests (where
it's hanging without crashing...)

r, Kyle

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9
  2008-12-18  2:36   ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Kyle McMartin
@ 2008-12-18 12:21     ` Carlos O'Donell
  0 siblings, 0 replies; 56+ messages in thread
From: Carlos O'Donell @ 2008-12-18 12:21 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: Helge Deller, linux-parisc

On Wed, Dec 17, 2008 at 9:36 PM, Kyle McMartin <kyle@infradead.org> wrote:
> On Wed, Dec 17, 2008 at 08:46:30PM -0500, Carlos O'Donell wrote:
>> On Wed, Dec 17, 2008 at 5:46 PM, Helge Deller <deller@gmx.de> wrote:
>> > The attached patch simply adds the lookup for a fixup handler when trap
>> > #27 (protection id trap) happens in kernel space. This was missing in
>> > the code path for trap #27 which is why the kernel then called
>> > die_if_kernel() and crashed.
>> >
>> > Even with this patch ruby1.9 may fail to compile, but at least the
>> > kernel crashes are gone.
>> >
>> > Any feedback welcome.
>>
>> Can we avoid copying code from do_page_fault and just call do_page_fault?
>>
>> Shouldn't a data protection id trap on a userspace address cause
>> do_page_fault to be called which triggers the lookup and transfer to
>> the exception table location?
>>
>
> Yes, it could just fall through in this case, but I don't think that's
> necessarily correct either.

Shouldn't it always be wrong to call die_if_kernel on protection id
trap without searching the exception tables? This is exactly what
happens in the case of trap 27.

>> I remember testing this for the compare-and-swap kernel LWS, in
>> particular calling with an invalid address, where the kernel must load
>> a userspace address, possibly fault, and jump to the exception
>> location to return EFAULT as the error for the compare-and-swap.
>
> Indeed, there's something more sinister going on.

It might be that a different fault is generated?

I ran my test-lws program again this morning and it correctly returns
EFAULT which means the exception handling should be working, *but* I
don't know which trap was generated.

The ldw could have generated a data tlb miss fault/data page fault
trap, and that is correctly handled.

A data address memory protection id trap only happens if:
- PSW P-bit is 1.
- Prot. ID check fails? (What memory has prot id checks on it?)
- One of load/Store/Ldcw/cache purge references the memory.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9
  2008-12-18  7:03 ` Kyle McMartin
@ 2008-12-18 13:09   ` Helge Deller
  2008-12-18 15:05     ` Kyle McMartin
  0 siblings, 1 reply; 56+ messages in thread
From: Helge Deller @ 2008-12-18 13:09 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: linux-parisc

-------- Original-Nachricht --------
> On Wed, Dec 17, 2008 at 11:46:05PM +0100, Helge Deller wrote:
> > The Debian bugzilla has a long thread about kernel crashes when
> > compiling ruby1.9 on hppa. This kernel bug led even to discussions =
if
> > hppa should be dropped for lenny.
> > See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D478717 for d=
etails.
> >=20
> > It's really easy to reproduce the bug, and it generates this backtr=
ace
> > (interestingly two backtraces):
>=20
> When does it die? My ruby1.9 is building through to the tests (where
> it's hanging without crashing...)

If it does not happen directly for you, just try a few times/cycles, e.=
g. make, make clean in the build directory.
Even killing the hanging miniruby processes can make the kernel die (wh=
en my patch wasn't applied). With the patch it stays hanging.

Helge
--=20
Psssst! Schon vom neuen GMX MultiMessenger geh=F6rt? Der kann`s mit all=
en: http://www.gmx.net/de/go/multimessenger
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc"=
 in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9
  2008-12-18  2:02   ` Carlos O'Donell
@ 2008-12-18 13:13     ` Helge Deller
  2008-12-18 15:28       ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling John David Anglin
  0 siblings, 1 reply; 56+ messages in thread
From: Helge Deller @ 2008-12-18 13:13 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: linux-parisc

> On Wed, Dec 17, 2008 at 8:46 PM, Carlos O'Donell
> > Can we avoid copying code from do_page_fault and just call
> do_page_fault?
=2E..
> However! I think you should not duplicate code. Please split out this
> "bit" of code into a function in mm/fault.c and call it from
> handle_interruption.

Sure. This patch was just meant to be short and easy understandable for=
 review of the problem itself and without other noise like splitting ou=
t a function...

Helge
--=20
Sensationsangebot verl=E4ngert: GMX FreeDSL - Telefonanschluss + DSL=20
f=FCr nur 16,37 Euro/mtl.!* http://dsl.gmx.de/?ac=3DOM.AD.PD003K1308T45=
69a
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc"=
 in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9
  2008-12-18 13:09   ` Helge Deller
@ 2008-12-18 15:05     ` Kyle McMartin
  0 siblings, 0 replies; 56+ messages in thread
From: Kyle McMartin @ 2008-12-18 15:05 UTC (permalink / raw)
  To: Helge Deller; +Cc: linux-parisc

On Thu, Dec 18, 2008 at 02:09:40PM +0100, Helge Deller wrote:
> -------- Original-Nachricht --------
> > On Wed, Dec 17, 2008 at 11:46:05PM +0100, Helge Deller wrote:
> > > The Debian bugzilla has a long thread about kernel crashes when
> > > compiling ruby1.9 on hppa. This kernel bug led even to discussions if
> > > hppa should be dropped for lenny.
> > > See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=478717 for details.
> > > 
> > > It's really easy to reproduce the bug, and it generates this backtrace
> > > (interestingly two backtraces):
> > 
> > When does it die? My ruby1.9 is building through to the tests (where
> > it's hanging without crashing...)
> 
> If it does not happen directly for you, just try a few times/cycles, e.g. make, make clean in the build directory.
> Even killing the hanging miniruby processes can make the kernel die (when my patch wasn't applied). With the patch it stays hanging.
>

Aside from the testsuite hanging it seems to be building fine for me no
matter how many times I let it go...

making ruby1.9
make[1]: Entering directory `/home/kyle/src/ruby1.9-1.9.0.2'
cc -fno-strict-aliasing -g -g -O2 -O2 -g -Wall -Wno-parentheses  -fPIC
-L.  -rdynamic -Wl,-export-dynamic   main.o  -lruby1.9 -lpthread -lrt
-ldl -lcrypt -lm   -o ruby1.9
make[1]: Leaving directory `/home/kyle/src/ruby1.9-1.9.0.2'

Let me debootstrap on another machine and try it there, sigh.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling
  2008-12-18 13:13     ` Helge Deller
@ 2008-12-18 15:28       ` John David Anglin
  2008-12-18 16:09         ` Carlos O'Donell
  0 siblings, 1 reply; 56+ messages in thread
From: John David Anglin @ 2008-12-18 15:28 UTC (permalink / raw)
  To: Helge Deller; +Cc: linux-parisc, carlos

> > On Wed, Dec 17, 2008 at 8:46 PM, Carlos O'Donell
> > > Can we avoid copying code from do_page_fault and just call
> > do_page_fault?
> ...
> > However! I think you should not duplicate code. Please split out this
> > "bit" of code into a function in mm/fault.c and call it from
> > handle_interruption.
> 
> Sure. This patch was just meant to be short and easy understandable for review of the problem itself and without other noise like splitting out a function...

>From the documentation, it appears to me that a protection id trap should
be treated differently from a data TLB miss/data page fault.  The latter can
potentially be fixed with TLB insertion.  I don't think a protection ID trap
can be fixed...

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling
  2008-12-18 15:28       ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling John David Anglin
@ 2008-12-18 16:09         ` Carlos O'Donell
  0 siblings, 0 replies; 56+ messages in thread
From: Carlos O'Donell @ 2008-12-18 16:09 UTC (permalink / raw)
  To: John David Anglin; +Cc: Helge Deller, linux-parisc

On Thu, Dec 18, 2008 at 10:28 AM, John David Anglin
<dave@hiauly1.hia.nrc.ca> wrote:
> From the documentation, it appears to me that a protection id trap should
> be treated differently from a data TLB miss/data page fault.  The latter can
> potentially be fixed with TLB insertion.  I don't think a protection ID trap
> can be fixed...

Yes, I think that is also true.

What if, after a context switch we have stale TLB entry with an access
ID from another process? When the kernel tries to write to the user
address the current protection ids don't match the page's access id
and we generate a protection id trap.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-18  4:04       ` Kyle McMartin
@ 2008-12-18 16:16         ` Carlos O'Donell
  2008-12-18 18:28         ` Kyle McMartin
  1 sibling, 0 replies; 56+ messages in thread
From: Carlos O'Donell @ 2008-12-18 16:16 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: John David Anglin, deller, linux-parisc

On Wed, Dec 17, 2008 at 11:04 PM, Kyle McMartin <kyle@infradead.org> wrote:
> On Wed, Dec 17, 2008 at 10:02:13PM -0500, John David Anglin wrote:
>> > Eh? r22 is 0020fd0e... although it's easier to spot by looking at IOR
>> > > >  IIR: 0ed5d240    ISR: 00000847  IOR: 0020fd0e
>>
>> Obviously, I can't count...
>>
>> > Although, I don't think we're supposed to ever be taking a P. Id trap on
>> > a userspace process as the kernel...
>>
>> Ok, should P and/or D be on?
>>
>
> Huh... Possibly the P-bit shouldn't be on for kernel mode... the kernel
> after all can scribble on whomevers pages it so chooses...

IMO the kernel should run with the P-Bit on.

Disabling the P-Bit could potentially mask stale TLB entry bugs.

Imagine this scenario:
1. Context switch to process A, setup protection ids, setup TLB entries.
2. Context switch to process B, setup protection ids, incorrectly
setup TLB entries (entries from process A remain).
3. Kernel tries to access process B's page but triggers TLB entry for
process A and gets a data protection id trap.

The trap in #3 is correct since we used a stale TLB entry.

What I don't know is this: Are all of process A's tlb entries purged
at context switch? Do they have to be?

Do we implement lazy tlb flushing? Do process A's tlb entries only get
flushed when they trigger a data protection id trap?

>> There might be a problem restoring access IDs after a cache purge or
>> TLB insertion.
>>
>
> I honestly have no idea how the protection id registers are loaded... I
> thought it had something do with with space registers, but I can't
> remember off the top of my head.

I would imagine protection id registers are loaded at every context switch?

Do threads have the same protection id register values (and the same
space register) such that they can access the same pages?

c.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-18  4:04       ` Kyle McMartin
  2008-12-18 16:16         ` Carlos O'Donell
@ 2008-12-18 18:28         ` Kyle McMartin
  2008-12-19 15:25           ` Carlos O'Donell
  1 sibling, 1 reply; 56+ messages in thread
From: Kyle McMartin @ 2008-12-18 18:28 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: John David Anglin, deller, linux-parisc

On Wed, Dec 17, 2008 at 11:04:00PM -0500, Kyle McMartin wrote:
> > There might be a problem restoring access IDs after a cache purge or
> > TLB insertion.
> > 
> 
> I honestly have no idea how the protection id registers are loaded... I
> thought it had something do with with space registers, but I can't
> remember off the top of my head.

For future reference, in mmu_context.h. Bloody tricky mtctl() macro
doesn't take a prefix, just the numeric cr #.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-18 18:28         ` Kyle McMartin
@ 2008-12-19 15:25           ` Carlos O'Donell
  2008-12-19 16:13             ` John David Anglin
  0 siblings, 1 reply; 56+ messages in thread
From: Carlos O'Donell @ 2008-12-19 15:25 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: John David Anglin, deller, linux-parisc

On Thu, Dec 18, 2008 at 1:28 PM, Kyle McMartin <kyle@infradead.org> wrote:
> On Wed, Dec 17, 2008 at 11:04:00PM -0500, Kyle McMartin wrote:
>> > There might be a problem restoring access IDs after a cache purge or
>> > TLB insertion.
>> >
>>
>> I honestly have no idea how the protection id registers are loaded... I
>> thought it had something do with with space registers, but I can't
>> remember off the top of my head.
>
> For future reference, in mmu_context.h. Bloody tricky mtctl() macro
> doesn't take a prefix, just the numeric cr #.

Evil.

Helge,

Can you cleanup your patch and resubmit. I think your patch is a good
belt-and-suspenders patch. The kernel should not fault on a protection
id trap taken while writing to a userspace address (even if such a
scenario indicates a possible bug in the mm context switching code).

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 15:25           ` Carlos O'Donell
@ 2008-12-19 16:13             ` John David Anglin
  2008-12-19 16:23               ` Carlos O'Donell
  0 siblings, 1 reply; 56+ messages in thread
From: John David Anglin @ 2008-12-19 16:13 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: kyle, deller, linux-parisc

> > For future reference, in mmu_context.h. Bloody tricky mtctl() macro
> > doesn't take a prefix, just the numeric cr #.
> 
> Evil.

If this is a problem with the WD bit, the problem is likely here:

        mtctl(context >> (SPACEID_SHIFT - 1),8);

It's possible this might set the WD bit depending on the alignment
of context and the value of SPACEID_SHIFT on your machine.  If you
can duplicate, maybe add a BUG_ON.  On hppa64, all structs are passed
by value, so the alignment of context  may only be BIGGEST_ALIGNMENT.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 16:13             ` John David Anglin
@ 2008-12-19 16:23               ` Carlos O'Donell
  2008-12-19 16:28                 ` Kyle McMartin
  0 siblings, 1 reply; 56+ messages in thread
From: Carlos O'Donell @ 2008-12-19 16:23 UTC (permalink / raw)
  To: John David Anglin; +Cc: kyle, deller, linux-parisc

On Fri, Dec 19, 2008 at 11:13 AM, John David Anglin
<dave@hiauly1.hia.nrc.ca> wrote:
>> > For future reference, in mmu_context.h. Bloody tricky mtctl() macro
>> > doesn't take a prefix, just the numeric cr #.
>>
>> Evil.
>
> If this is a problem with the WD bit, the problem is likely here:
>
>        mtctl(context >> (SPACEID_SHIFT - 1),8);
>
> It's possible this might set the WD bit depending on the alignment
> of context and the value of SPACEID_SHIFT on your machine.  If you
> can duplicate, maybe add a BUG_ON.  On hppa64, all structs are passed
> by value, so the alignment of context  may only be BIGGEST_ALIGNMENT.

The value of context is a space id (the value returned by alloc_sid())
and is not related to any alignment?

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 16:23               ` Carlos O'Donell
@ 2008-12-19 16:28                 ` Kyle McMartin
  2008-12-19 16:35                   ` Carlos O'Donell
  0 siblings, 1 reply; 56+ messages in thread
From: Kyle McMartin @ 2008-12-19 16:28 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: John David Anglin, kyle, deller, linux-parisc

On Fri, Dec 19, 2008 at 11:23:39AM -0500, Carlos O'Donell wrote:
> On Fri, Dec 19, 2008 at 11:13 AM, John David Anglin
> <dave@hiauly1.hia.nrc.ca> wrote:
> >> > For future reference, in mmu_context.h. Bloody tricky mtctl() macro
> >> > doesn't take a prefix, just the numeric cr #.
> >>
> >> Evil.
> >
> > If this is a problem with the WD bit, the problem is likely here:
> >
> >        mtctl(context >> (SPACEID_SHIFT - 1),8);
> >
> > It's possible this might set the WD bit depending on the alignment
> > of context and the value of SPACEID_SHIFT on your machine.  If you
> > can duplicate, maybe add a BUG_ON.  On hppa64, all structs are passed
> > by value, so the alignment of context  may only be BIGGEST_ALIGNMENT.
> 
> The value of context is a space id (the value returned by alloc_sid())
> and is not related to any alignment?

It's a typedef to an unsigned long. I'm not sure we've seen this on a
64-bit kernel (which is the SPACEID_SHIFT one.)

On a 32-bit kernel (where I've been able to reproduce this) we do a
leftshift by one, so the WD bit is guaranteed to be 0.

cheers, Kyle

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 16:28                 ` Kyle McMartin
@ 2008-12-19 16:35                   ` Carlos O'Donell
  2008-12-19 16:36                     ` Carlos O'Donell
  0 siblings, 1 reply; 56+ messages in thread
From: Carlos O'Donell @ 2008-12-19 16:35 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: John David Anglin, deller, linux-parisc

On Fri, Dec 19, 2008 at 11:28 AM, Kyle McMartin <kyle@infradead.org> wrote:
> On Fri, Dec 19, 2008 at 11:23:39AM -0500, Carlos O'Donell wrote:
>> On Fri, Dec 19, 2008 at 11:13 AM, John David Anglin
>> <dave@hiauly1.hia.nrc.ca> wrote:
>> >> > For future reference, in mmu_context.h. Bloody tricky mtctl() macro
>> >> > doesn't take a prefix, just the numeric cr #.
>> >>
>> >> Evil.
>> >
>> > If this is a problem with the WD bit, the problem is likely here:
>> >
>> >        mtctl(context >> (SPACEID_SHIFT - 1),8);
>> >
>> > It's possible this might set the WD bit depending on the alignment
>> > of context and the value of SPACEID_SHIFT on your machine.  If you
>> > can duplicate, maybe add a BUG_ON.  On hppa64, all structs are passed
>> > by value, so the alignment of context  may only be BIGGEST_ALIGNMENT.
>>
>> The value of context is a space id (the value returned by alloc_sid())
>> and is not related to any alignment?
>
> It's a typedef to an unsigned long. I'm not sure we've seen this on a
> 64-bit kernel (which is the SPACEID_SHIFT one.)
>
> On a 32-bit kernel (where I've been able to reproduce this) we do a
> leftshift by one, so the WD bit is guaranteed to be 0.

John has a good point though, we might as well assert that?

c.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 16:35                   ` Carlos O'Donell
@ 2008-12-19 16:36                     ` Carlos O'Donell
  2008-12-19 16:44                       ` Kyle McMartin
  0 siblings, 1 reply; 56+ messages in thread
From: Carlos O'Donell @ 2008-12-19 16:36 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: John David Anglin, deller, linux-parisc

On Fri, Dec 19, 2008 at 11:35 AM, Carlos O'Donell
<carlos@systemhalted.org> wrote:
> On Fri, Dec 19, 2008 at 11:28 AM, Kyle McMartin <kyle@infradead.org> wrote:
>> On Fri, Dec 19, 2008 at 11:23:39AM -0500, Carlos O'Donell wrote:
>>> On Fri, Dec 19, 2008 at 11:13 AM, John David Anglin
>>> <dave@hiauly1.hia.nrc.ca> wrote:
>>> >> > For future reference, in mmu_context.h. Bloody tricky mtctl() macro
>>> >> > doesn't take a prefix, just the numeric cr #.
>>> >>
>>> >> Evil.
>>> >
>>> > If this is a problem with the WD bit, the problem is likely here:
>>> >
>>> >        mtctl(context >> (SPACEID_SHIFT - 1),8);
>>> >
>>> > It's possible this might set the WD bit depending on the alignment
>>> > of context and the value of SPACEID_SHIFT on your machine.  If you
>>> > can duplicate, maybe add a BUG_ON.  On hppa64, all structs are passed
>>> > by value, so the alignment of context  may only be BIGGEST_ALIGNMENT.
>>>
>>> The value of context is a space id (the value returned by alloc_sid())
>>> and is not related to any alignment?
>>
>> It's a typedef to an unsigned long. I'm not sure we've seen this on a
>> 64-bit kernel (which is the SPACEID_SHIFT one.)
>>
>> On a 32-bit kernel (where I've been able to reproduce this) we do a
>> leftshift by one, so the WD bit is guaranteed to be 0.
>
> John has a good point though, we might as well assert that?

In alloc_sid() when we allocate the space id?

c.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 16:36                     ` Carlos O'Donell
@ 2008-12-19 16:44                       ` Kyle McMartin
  2008-12-19 17:28                         ` Carlos O'Donell
  0 siblings, 1 reply; 56+ messages in thread
From: Kyle McMartin @ 2008-12-19 16:44 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Kyle McMartin, John David Anglin, deller, linux-parisc

On Fri, Dec 19, 2008 at 11:36:33AM -0500, Carlos O'Donell wrote:
> On Fri, Dec 19, 2008 at 11:35 AM, Carlos O'Donell
> <carlos@systemhalted.org> wrote:
> > On Fri, Dec 19, 2008 at 11:28 AM, Kyle McMartin <kyle@infradead.org> wrote:
> >> On Fri, Dec 19, 2008 at 11:23:39AM -0500, Carlos O'Donell wrote:
> >>> On Fri, Dec 19, 2008 at 11:13 AM, John David Anglin
> >>> <dave@hiauly1.hia.nrc.ca> wrote:
> >>> >> > For future reference, in mmu_context.h. Bloody tricky mtctl() macro
> >>> >> > doesn't take a prefix, just the numeric cr #.
> >>> >>
> >>> >> Evil.
> >>> >
> >>> > If this is a problem with the WD bit, the problem is likely here:
> >>> >
> >>> >        mtctl(context >> (SPACEID_SHIFT - 1),8);
> >>> >
> >>> > It's possible this might set the WD bit depending on the alignment
> >>> > of context and the value of SPACEID_SHIFT on your machine.  If you
> >>> > can duplicate, maybe add a BUG_ON.  On hppa64, all structs are passed
> >>> > by value, so the alignment of context  may only be BIGGEST_ALIGNMENT.
> >>>
> >>> The value of context is a space id (the value returned by alloc_sid())
> >>> and is not related to any alignment?
> >>
> >> It's a typedef to an unsigned long. I'm not sure we've seen this on a
> >> 64-bit kernel (which is the SPACEID_SHIFT one.)
> >>
> >> On a 32-bit kernel (where I've been able to reproduce this) we do a
> >> leftshift by one, so the WD bit is guaranteed to be 0.
> >
> > John has a good point though, we might as well assert that?
> 
> In alloc_sid() when we allocate the space id?
> 

Meh. > >>> >        mtctl(context >> (SPACEID_SHIFT - 1),8);

SPACEID_SHIFT is the number of spaceid bits, which is 32 on 64-bit.

I suspect what this is trying to do, is put it in the rightmost half of
the register, so as long as context is shifted left already, (which it
is.)

When you cancel out the SPACEID_SHIFT in alloc_sid, and here in
load_context, you end up with the same mtctl(context << 1, 8) as on
32-bit.

re, Kyle

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 16:44                       ` Kyle McMartin
@ 2008-12-19 17:28                         ` Carlos O'Donell
  2008-12-19 17:33                           ` Kyle McMartin
  0 siblings, 1 reply; 56+ messages in thread
From: Carlos O'Donell @ 2008-12-19 17:28 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: John David Anglin, deller, linux-parisc

On Fri, Dec 19, 2008 at 11:44 AM, Kyle McMartin <kyle@infradead.org> wrote:
> SPACEID_SHIFT is the number of spaceid bits, which is 32 on 64-bit.

On 64-bit SPACEID_SHIFT is 11 bits. Verified by hand and by gcc -E/-dM
and test program.

On 32-bit SPACEID_SHIFT is 0 bits. Verified by hand.

> I suspect what this is trying to do, is put it in the rightmost half of
> the register, so as long as context is shifted left already, (which it
> is.)
>
> When you cancel out the SPACEID_SHIFT in alloc_sid, and here in
> load_context, you end up with the same mtctl(context << 1, 8) as on
> 32-bit.

Yes, and that keeps the WD bit clear.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 17:28                         ` Carlos O'Donell
@ 2008-12-19 17:33                           ` Kyle McMartin
  2008-12-19 17:36                             ` Carlos O'Donell
  0 siblings, 1 reply; 56+ messages in thread
From: Kyle McMartin @ 2008-12-19 17:33 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Kyle McMartin, John David Anglin, deller, linux-parisc

On Fri, Dec 19, 2008 at 12:28:40PM -0500, Carlos O'Donell wrote:
> On Fri, Dec 19, 2008 at 11:44 AM, Kyle McMartin <kyle@infradead.org> wrote:
> > SPACEID_SHIFT is the number of spaceid bits, which is 32 on 64-bit.
> 
> On 64-bit SPACEID_SHIFT is 11 bits. Verified by hand and by gcc -E/-dM
> and test program.
> 
> On 32-bit SPACEID_SHIFT is 0 bits. Verified by hand.
> 

I just assumed MAX_ADDRBITS was 64. Anyway the point is moot.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 17:33                           ` Kyle McMartin
@ 2008-12-19 17:36                             ` Carlos O'Donell
  2008-12-19 17:39                               ` Kyle McMartin
  0 siblings, 1 reply; 56+ messages in thread
From: Carlos O'Donell @ 2008-12-19 17:36 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: John David Anglin, deller, linux-parisc

On Fri, Dec 19, 2008 at 12:33 PM, Kyle McMartin <kyle@infradead.org> wrote:
> On Fri, Dec 19, 2008 at 12:28:40PM -0500, Carlos O'Donell wrote:
>> On Fri, Dec 19, 2008 at 11:44 AM, Kyle McMartin <kyle@infradead.org> wrote:
>> > SPACEID_SHIFT is the number of spaceid bits, which is 32 on 64-bit.
>>
>> On 64-bit SPACEID_SHIFT is 11 bits. Verified by hand and by gcc -E/-dM
>> and test program.
>>
>> On 32-bit SPACEID_SHIFT is 0 bits. Verified by hand.
>>
>
> I just assumed MAX_ADDRBITS was 64. Anyway the point is moot.
>

I was just crossing the T's and dotting the I's. In case we need to
remember what SPACEID_SHIFT value was if we see something sketchy.

On 64-bit if you are not using 4kb pages, then SPACEID_SHIFT is 0 bits
like a 32-bit kernel.

c.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 17:36                             ` Carlos O'Donell
@ 2008-12-19 17:39                               ` Kyle McMartin
  2008-12-19 17:42                                 ` Kyle McMartin
  0 siblings, 1 reply; 56+ messages in thread
From: Kyle McMartin @ 2008-12-19 17:39 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Kyle McMartin, John David Anglin, deller, linux-parisc

On Fri, Dec 19, 2008 at 12:36:15PM -0500, Carlos O'Donell wrote:
> On Fri, Dec 19, 2008 at 12:33 PM, Kyle McMartin <kyle@infradead.org> wrote:
> > On Fri, Dec 19, 2008 at 12:28:40PM -0500, Carlos O'Donell wrote:
> >> On Fri, Dec 19, 2008 at 11:44 AM, Kyle McMartin <kyle@infradead.org> wrote:
> >> > SPACEID_SHIFT is the number of spaceid bits, which is 32 on 64-bit.
> >>
> >> On 64-bit SPACEID_SHIFT is 11 bits. Verified by hand and by gcc -E/-dM
> >> and test program.
> >>
> >> On 32-bit SPACEID_SHIFT is 0 bits. Verified by hand.
> >>
> >
> > I just assumed MAX_ADDRBITS was 64. Anyway the point is moot.
> >
> 
> I was just crossing the T's and dotting the I's. In case we need to
> remember what SPACEID_SHIFT value was if we see something sketchy.
> 
> On 64-bit if you are not using 4kb pages, then SPACEID_SHIFT is 0 bits
> like a 32-bit kernel.
> 

That's a bug, and explains one of the reasons why !4K pages are broken.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 17:39                               ` Kyle McMartin
@ 2008-12-19 17:42                                 ` Kyle McMartin
  2008-12-19 18:43                                   ` Carlos O'Donell
  0 siblings, 1 reply; 56+ messages in thread
From: Kyle McMartin @ 2008-12-19 17:42 UTC (permalink / raw)
  To: Kyle McMartin
  Cc: Carlos O'Donell, John David Anglin, deller, linux-parisc

On Fri, Dec 19, 2008 at 12:39:31PM -0500, Kyle McMartin wrote:
> On Fri, Dec 19, 2008 at 12:36:15PM -0500, Carlos O'Donell wrote:
> > On Fri, Dec 19, 2008 at 12:33 PM, Kyle McMartin <kyle@infradead.org> wrote:
> > > On Fri, Dec 19, 2008 at 12:28:40PM -0500, Carlos O'Donell wrote:
> > >> On Fri, Dec 19, 2008 at 11:44 AM, Kyle McMartin <kyle@infradead.org> wrote:
> > >> > SPACEID_SHIFT is the number of spaceid bits, which is 32 on 64-bit.
> > >>
> > >> On 64-bit SPACEID_SHIFT is 11 bits. Verified by hand and by gcc -E/-dM
> > >> and test program.
> > >>
> > >> On 32-bit SPACEID_SHIFT is 0 bits. Verified by hand.
> > >>
> > >
> > > I just assumed MAX_ADDRBITS was 64. Anyway the point is moot.
> > >
> > 
> > I was just crossing the T's and dotting the I's. In case we need to
> > remember what SPACEID_SHIFT value was if we see something sketchy.
> > 
> > On 64-bit if you are not using 4kb pages, then SPACEID_SHIFT is 0 bits
> > like a 32-bit kernel.
> > 
> 
> That's a bug, and explains one of the reasons why !4K pages are broken.
> 

Or rather not, my thinking being that context << 1 might end up being
33-bits, but I forgot we clip spaceids to the width of the protection
registers, so this can't happen.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 17:42                                 ` Kyle McMartin
@ 2008-12-19 18:43                                   ` Carlos O'Donell
  0 siblings, 0 replies; 56+ messages in thread
From: Carlos O'Donell @ 2008-12-19 18:43 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: John David Anglin, deller, linux-parisc

On Fri, Dec 19, 2008 at 12:42 PM, Kyle McMartin <kyle@infradead.org> wrote:
> Or rather not, my thinking being that context << 1 might end up being
> 33-bits, but I forgot we clip spaceids to the width of the protection
> registers, so this can't happen.

Yes, that is correct, we only use 17-bit space ids on PA 2.0 (18-bits
implemented) and 14-bit space ids on PA 1.1 (15-bits implemented on
average). I assume we reserve the last bit for the rightmost WD-bit.
Therefore context << 1 yields at most 18-bits.

c.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9
  2008-12-17 22:46 [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Helge Deller
                   ` (2 preceding siblings ...)
  2008-12-18  7:03 ` Kyle McMartin
@ 2008-12-19 21:29 ` Kyle McMartin
  2008-12-19 22:59   ` Helge Deller
  2008-12-21 15:20   ` John David Anglin
  3 siblings, 2 replies; 56+ messages in thread
From: Kyle McMartin @ 2008-12-19 21:29 UTC (permalink / raw)
  To: Helge Deller; +Cc: linux-parisc

On Wed, Dec 17, 2008 at 11:46:05PM +0100, Helge Deller wrote:
>

Honestly, I can't decide whether to apply this. It really should never
happen in the kernel, since the kernel can guarantee it won't get the
access rights failure (highest privilege level, and can set %sr and
%protid to whatever it wants.)

It really genuinely is a bug that probably should panic the kernel. The
only precedent I can easily see is x86 fixing up a bad iret with a
general protection fault, which is more or less analogous to code 27
here.

On the other hand, taking the exception on a userspace access really
isn't all that critical, and there's fundamentally little reason for the
kernel not to SIGSEGV the process, and continue...

Argh.

(btw, I've instrumented my do_sys_poll with a pile of assertions that
 %cr8 << 1 == %sr3 == current->mm.context... let's see if where we're
 getting corrupted is deterministic, though, I would guess that it won't
 be.)

diff --git a/arch/parisc/include/asm/uaccess.h b/arch/parisc/include/asm/uaccess.h
index 4878b95..1c6dbb6 100644
--- a/arch/parisc/include/asm/uaccess.h
+++ b/arch/parisc/include/asm/uaccess.h
@@ -241,4 +241,6 @@ unsigned long copy_in_user(void __user *dst, const void __user *src, unsigned lo
 #define __copy_to_user_inatomic __copy_to_user
 #define __copy_from_user_inatomic __copy_from_user
 
+int fixup_exception(struct pt_regs *regs);
+
 #endif /* __PARISC_UACCESS_H */
diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c
index 4c771cd..548ba0c 100644
--- a/arch/parisc/kernel/traps.c
+++ b/arch/parisc/kernel/traps.c
@@ -745,6 +745,10 @@ void handle_interruption(int code, struct pt_regs *regs)
 		/* Fall Through */
 	case 27: 
 		/* Data memory protection ID trap */
+		if (code == 27 && !user_mode(regs) &&
+			fixup_exception(regs))
+			return;
+
 		die_if_kernel("Protection id trap", regs, code);
 		si.si_code = SEGV_MAPERR;
 		si.si_signo = SIGSEGV;
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index b2e3e9a..92c7fa4 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -139,13 +139,41 @@ parisc_acctyp(unsigned long code, unsigned int inst)
 			}
 #endif
 
+int fixup_exception(struct pt_regs *regs)
+{
+	const struct exception_table_entry *fix;
+
+	fix = search_exception_tables(regs->iaoq[0]);
+	if (fix) {
+		struct exception_data *d;
+		d = &__get_cpu_var(exception_data);
+		d->fault_ip = regs->iaoq[0];
+		d->fault_space = regs->isr;
+		d->fault_addr = regs->ior;
+
+		regs->iaoq[0] = ((fix->fixup) & ~3);
+		/*
+		 * NOTE: In some cases the faulting instruction
+		 * may be in the delay slot of a branch. We
+		 * don't want to take the branch, so we don't
+		 * increment iaoq[1], instead we set it to be
+		 * iaoq[0]+4, and clear the B bit in the PSW
+		 */
+		regs->iaoq[1] = regs->iaoq[0] + 4;
+		regs->gr[0] &= ~PSW_B; /* IPSW in gr[0] */
+
+		return 1;
+	}
+
+	return 0;
+}
+
 void do_page_fault(struct pt_regs *regs, unsigned long code,
 			      unsigned long address)
 {
 	struct vm_area_struct *vma, *prev_vma;
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
-	const struct exception_table_entry *fix;
 	unsigned long acc_type;
 	int fault;
 
@@ -229,32 +257,8 @@ bad_area:
 
 no_context:
 
-	if (!user_mode(regs)) {
-		fix = search_exception_tables(regs->iaoq[0]);
-
-		if (fix) {
-			struct exception_data *d;
-
-			d = &__get_cpu_var(exception_data);
-			d->fault_ip = regs->iaoq[0];
-			d->fault_space = regs->isr;
-			d->fault_addr = regs->ior;
-
-			regs->iaoq[0] = ((fix->fixup) & ~3);
-
-			/*
-			 * NOTE: In some cases the faulting instruction
-			 * may be in the delay slot of a branch. We
-			 * don't want to take the branch, so we don't
-			 * increment iaoq[1], instead we set it to be
-			 * iaoq[0]+4, and clear the B bit in the PSW
-			 */
-
-			regs->iaoq[1] = regs->iaoq[0] + 4;
-			regs->gr[0] &= ~PSW_B; /* IPSW in gr[0] */

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9
  2008-12-19 21:29 ` Kyle McMartin
@ 2008-12-19 22:59   ` Helge Deller
  2008-12-19 23:34     ` Kyle McMartin
  2008-12-21 15:20   ` John David Anglin
  1 sibling, 1 reply; 56+ messages in thread
From: Helge Deller @ 2008-12-19 22:59 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: linux-parisc

Kyle McMartin wrote:
> Honestly, I can't decide whether to apply this. It really should never
> happen in the kernel, since the kernel can guarantee it won't get the
> access rights failure (highest privilege level, and can set %sr and
> %protid to whatever it wants.)

Yes, that's the problem.
We catch a bug in the kernel which shouldn't be there (and which we
can't find yet).
If it wouldn't be there, we wouldn't need this patch..
Chicken and egg problem.

> It really genuinely is a bug that probably should panic the kernel. The
> only precedent I can easily see is x86 fixing up a bad iret with a
> general protection fault, which is more or less analogous to code 27
> here.
> 
> On the other hand, taking the exception on a userspace access really
> isn't all that critical, and there's fundamentally little reason for the
> kernel not to SIGSEGV the process, and continue...
> 
> Argh.

Argh. Argh :-)


> (btw, I've instrumented my do_sys_poll with a pile of assertions that
>  %cr8 << 1 == %sr3 == current->mm.context... let's see if where we're
>  getting corrupted is deterministic, though, I would guess that it won't
>  be.)

I'll continue to debug as well. I have some more ideas left to try...

Thanks for cleaning up my patch and splitting out the fixup_exception()
code.
Your patch is fine and I think you should apply it (with one minor
addition - see below).
The reason I think why you should apply it is, that we shouldn't let the
kernel crash on (correct) userspace apps.
Maybe adding a comment that this check in trap #27 should be removed
when the bug is fixed would be good though.

> diff --git a/arch/parisc/include/asm/uaccess.h b/arch/parisc/include/asm/uaccess.h
> index 4878b95..1c6dbb6 100644
> --- a/arch/parisc/include/asm/uaccess.h
> +++ b/arch/parisc/include/asm/uaccess.h
> @@ -745,6 +745,10 @@ void handle_interruption(int code, struct pt_regs *regs)
>  		/* Fall Through */
>  	case 27: 
>  		/* Data memory protection ID trap */
> +		if (code == 27 && !user_mode(regs) &&
> +			fixup_exception(regs))
> +                   return;

I think you should add a self-explaining printk(KERN_CRIT,..) here, so
that we can track when this bug appeared in userspace.
Something like:
printk("strange data protection ID trap happened to <procname> PID: xxx.
(this is a bug in the kernel)\n");
(you probably find a better wording)

Having a correctly-coded userspace process which suddenly just return
failures in syscalls is not good and would make people wonder and
unnecessarily try to debug their (correct) userspace apps...

Helge

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9
  2008-12-19 22:59   ` Helge Deller
@ 2008-12-19 23:34     ` Kyle McMartin
  2008-12-20 17:07       ` [PATCH] [RFC] fix kernel crash (protection id trap) when John David Anglin
  0 siblings, 1 reply; 56+ messages in thread
From: Kyle McMartin @ 2008-12-19 23:34 UTC (permalink / raw)
  To: Helge Deller; +Cc: Kyle McMartin, linux-parisc

On Fri, Dec 19, 2008 at 11:59:11PM +0100, Helge Deller wrote:
> Thanks for cleaning up my patch and splitting out the fixup_exception()
> code.
> Your patch is fine and I think you should apply it (with one minor
> addition - see below).
> The reason I think why you should apply it is, that we shouldn't let the
> kernel crash on (correct) userspace apps.
> Maybe adding a comment that this check in trap #27 should be removed
> when the bug is fixed would be good though.
> 

Hmm. You raise a good point that userspace is DTRT...
Can you test to see if this patch fixes things? My c3000 with an
unpatched kernel has suddenly decided it doesn't want to crash 
anymore...

diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c
index 4c771cd..73058e8 100644
--- a/arch/parisc/kernel/traps.c
+++ b/arch/parisc/kernel/traps.c
@@ -745,6 +745,19 @@ void handle_interruption(int code, struct pt_regs *regs)
 		/* Fall Through */
 	case 27: 
 		/* Data memory protection ID trap */
+#define HANDLE_KERNEL_PROTECTION_TRAPS
+#ifdef HANDLE_KERNEL_PROTECTION_TRAPS
+		/* XXX very very evil */
+		if (code == 27 && !user_mode(regs)) {
+			/* bang the spaceid back into sr3 */
+			regs->sr[3] = (unsigned long)current->mm.context;
+			/* TODO? reprogram %cr8? load_context will get it. */
+			/* TODO? thread_struct.faulted count to catch
+			 *  taking the same fault on the same insn again?
+			 */
+			return;
+		}
+#endif /*HANDLE_KERNEL_PROTECTION_TRAPS*/
 		die_if_kernel("Protection id trap", regs, code);
 		si.si_code = SEGV_MAPERR;
 		si.si_signo = SIGSEGV;

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 23:34     ` Kyle McMartin
@ 2008-12-20 17:07       ` John David Anglin
  0 siblings, 0 replies; 56+ messages in thread
From: John David Anglin @ 2008-12-20 17:07 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: deller, kyle, linux-parisc

>  	case 27: 
>  		/* Data memory protection ID trap */
> +#define HANDLE_KERNEL_PROTECTION_TRAPS
> +#ifdef HANDLE_KERNEL_PROTECTION_TRAPS
> +		/* XXX very very evil */
> +		if (code == 27 && !user_mode(regs)) {

Should a check be added for regs->sr[3] != (unsigned long)current->mm.context?

> +			/* bang the spaceid back into sr3 */
> +			regs->sr[3] = (unsigned long)current->mm.context;

I seem to recall that James thought there was a problem with sr3 handling
a year or two ago.

> +			/* TODO? reprogram %cr8? load_context will get it. */
> +			/* TODO? thread_struct.faulted count to catch
> +			 *  taking the same fault on the same insn again?

Is this a general problem with parisc exception support?  The reason
I ask is it seems possible for an infinite loop to occur when an application
catches exceptions like bash.  It catches terminating signals so that it
can write the history file.  When terminate_immediately is zero, the
signal handler (termsig_sighandler) just records the terminating signal
and returns.  It doesn't reset the default handler for the signal or
specify SA_RESETHAND as far as I can tell.

It then appears the kernel restarts the application at the point of the
exception generating another exception.  This causes the kernel log files
to fill in short order.

I would say bash's behavior in trying to continue after a SEGV is questionable
but I also think the kernel shouldn't retry a faulting instruction unless
there's a chance the fault has been fixed.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-19 21:29 ` Kyle McMartin
  2008-12-19 22:59   ` Helge Deller
@ 2008-12-21 15:20   ` John David Anglin
  2008-12-21 17:27     ` John David Anglin
                       ` (2 more replies)
  1 sibling, 3 replies; 56+ messages in thread
From: John David Anglin @ 2008-12-21 15:20 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: deller, linux-parisc

> Honestly, I can't decide whether to apply this. It really should never
> happen in the kernel, since the kernel can guarantee it won't get the
> access rights failure (highest privilege level, and can set %sr and
> %protid to whatever it wants.)

I believe this change should be applied because we enable P and the
hardware can generate the exception.  Yes, it should never happen, but
the coding to prevent this occurring is very subtle.

> It really genuinely is a bug that probably should panic the kernel. The
> only precedent I can easily see is x86 fixing up a bad iret with a
> general protection fault, which is more or less analogous to code 27
> here.

I don't think we have enough information here.  A panic doesn't tell
us what we need to know...

> On the other hand, taking the exception on a userspace access really
> isn't all that critical, and there's fundamentally little reason for the
> kernel not to SIGSEGV the process, and continue...

Agreed, especially in a __put_user operation.

> (btw, I've instrumented my do_sys_poll with a pile of assertions that
>  %cr8 << 1 == %sr3 == current->mm.context... let's see if where we're
>  getting corrupted is deterministic, though, I would guess that it won't
>  be.)

Having looked at the code a bit, I have come to the conclusion that the
problem is not with the sr3 value.  We have a one-to-one mapping between
space register values and protection IDs.  TLB entries get inserted using
the space register value for the protection ID (see entry.S).  So, I think
we have to have inconsistent values in cr8 and sr3.

I think the bug may be in flush_user_cache_page_non_current.  It hijacks
sr3 temporarily and I don't think cr8 is updated when this is done.  The
switch may need to be atomic.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 15:20   ` John David Anglin
@ 2008-12-21 17:27     ` John David Anglin
  2008-12-21 21:33       ` Kyle McMartin
  2008-12-21 22:52       ` Helge Deller
  2008-12-21 22:20     ` Kyle McMartin
  2008-12-21 22:59     ` Helge Deller
  2 siblings, 2 replies; 56+ messages in thread
From: John David Anglin @ 2008-12-21 17:27 UTC (permalink / raw)
  To: John David Anglin; +Cc: kyle, deller, linux-parisc

> I think the bug may be in flush_user_cache_page_non_current.  It hijacks
> sr3 temporarily and I don't think cr8 is updated when this is done.  The
> switch may need to be atomic.

The following might fix the protection ID bug.

--- cache.c.orig	2008-07-17 21:24:46.000000000 -0400
+++ cache.c	2008-12-21 11:53:54.000000000 -0500
@@ -312,14 +312,14 @@
 
 	/* make us current */
 	mtctl(__pa(vma->vm_mm->pgd), 25);
-	mtsp(vma->vm_mm->context, 3);
+	load_context(vma->vm_mm->context);
 
 	flush_user_dcache_page(vmaddr);
 	if(vma->vm_flags & VM_EXEC)
 		flush_user_icache_page(vmaddr);
 
 	/* put the old current process back */
-	mtsp(space, 3);
+	load_context(space);
 	mtctl(pgd, 25);
 	preempt_enable();
 }

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 17:27     ` John David Anglin
@ 2008-12-21 21:33       ` Kyle McMartin
  2008-12-21 22:02         ` Kyle McMartin
  2008-12-21 22:11         ` John David Anglin
  2008-12-21 22:52       ` Helge Deller
  1 sibling, 2 replies; 56+ messages in thread
From: Kyle McMartin @ 2008-12-21 21:33 UTC (permalink / raw)
  To: John David Anglin; +Cc: kyle, deller, linux-parisc

On Sun, Dec 21, 2008 at 12:27:42PM -0500, John David Anglin wrote:
> > I think the bug may be in flush_user_cache_page_non_current.  It hijacks
> > sr3 temporarily and I don't think cr8 is updated when this is done.  The
> > switch may need to be atomic.
> 
> The following might fix the protection ID bug.
> 

I'm not convinced... this doesn't explain why we don't see this on
64-bit...

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 21:33       ` Kyle McMartin
@ 2008-12-21 22:02         ` Kyle McMartin
  2008-12-21 22:52           ` John David Anglin
  2008-12-21 22:11         ` John David Anglin
  1 sibling, 1 reply; 56+ messages in thread
From: Kyle McMartin @ 2008-12-21 22:02 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: John David Anglin, deller, linux-parisc

On Sun, Dec 21, 2008 at 04:33:03PM -0500, Kyle McMartin wrote:
> On Sun, Dec 21, 2008 at 12:27:42PM -0500, John David Anglin wrote:
> > > I think the bug may be in flush_user_cache_page_non_current.  It hijacks
> > > sr3 temporarily and I don't think cr8 is updated when this is done.  The
> > > switch may need to be atomic.
> > 
> > The following might fix the protection ID bug.
> > 
> 
> I'm not convinced... this doesn't explain why we don't see this on
> 64-bit...
> 

Nope, no love... :\

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 21:33       ` Kyle McMartin
  2008-12-21 22:02         ` Kyle McMartin
@ 2008-12-21 22:11         ` John David Anglin
  1 sibling, 0 replies; 56+ messages in thread
From: John David Anglin @ 2008-12-21 22:11 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: kyle, deller, linux-parisc

> On Sun, Dec 21, 2008 at 12:27:42PM -0500, John David Anglin wrote:
> > > I think the bug may be in flush_user_cache_page_non_current.  It hijacks
> > > sr3 temporarily and I don't think cr8 is updated when this is done.  The
> > > switch may need to be atomic.
> > 
> > The following might fix the protection ID bug.
> > 
> 
> I'm not convinced... this doesn't explain why we don't see this on
> 64-bit...

Agreed.  However, we do have tlb issues on 64-bit and this is just about
the only place where sr3 is messed with and cr8 isn't correctly set.
I'm sure the bug is somewhat timing dependent.

There has to be something else that causes a userspace access
with inconsistent space and protection IDs.  The cache flush can cause
non access tlb misses, and there is some difference in tlb hardware,
but cr8 isn't involved in the tlb miss handling as far as I can see.

I'm running with the patch on 2.6.28-rc8 (SMP) and 2.6.22.19 (UP).

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 15:20   ` John David Anglin
  2008-12-21 17:27     ` John David Anglin
@ 2008-12-21 22:20     ` Kyle McMartin
  2008-12-21 22:48       ` John David Anglin
  2008-12-22 22:46       ` James Bottomley
  2008-12-21 22:59     ` Helge Deller
  2 siblings, 2 replies; 56+ messages in thread
From: Kyle McMartin @ 2008-12-21 22:20 UTC (permalink / raw)
  To: John David Anglin; +Cc: Kyle McMartin, deller, linux-parisc

On Sun, Dec 21, 2008 at 10:20:16AM -0500, John David Anglin wrote:
> > Honestly, I can't decide whether to apply this. It really should never
> > happen in the kernel, since the kernel can guarantee it won't get the
> > access rights failure (highest privilege level, and can set %sr and
> > %protid to whatever it wants.)
> 
> I believe this change should be applied because we enable P and the
> hardware can generate the exception.  Yes, it should never happen, but
> the coding to prevent this occurring is very subtle.
> 

If we're going to paper over a bug like this, we might as well just turn
the P-bit off...

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 22:20     ` Kyle McMartin
@ 2008-12-21 22:48       ` John David Anglin
  2008-12-22 22:46       ` James Bottomley
  1 sibling, 0 replies; 56+ messages in thread
From: John David Anglin @ 2008-12-21 22:48 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: kyle, deller, linux-parisc

> 
> On Sun, Dec 21, 2008 at 10:20:16AM -0500, John David Anglin wrote:
> > > Honestly, I can't decide whether to apply this. It really should never
> > > happen in the kernel, since the kernel can guarantee it won't get the
> > > access rights failure (highest privilege level, and can set %sr and
> > > %protid to whatever it wants.)
> > 
> > I believe this change should be applied because we enable P and the
> > hardware can generate the exception.  Yes, it should never happen, but
> > the coding to prevent this occurring is very subtle.
> > 
> 
> If we're going to paper over a bug like this, we might as well just turn
> the P-bit off...

I just want a segv instead of a panic as I don't think we know whether
the kernel is bad or not.

This bug has been around for for some time.  I believe that it was reported
in 2004, or earlier.

Even HP has problems.  Saw this today:

Release Date: Nov 19 2008
Description :   PHKL_39110: ( QX:QXCR1000827621 ) The system can panic with
 the following panic string and stack trace. Panic string : Data memory
 protection/access rights/alignment fault FUNC panic+0x6c
 report_trap_or_int_and_panic+0x94 interrupt+0x4e4 ihandler+0x940
+-------------  TRAP  ----------------------------
|  Trap type 18 in KERNEL mode at 0x38f50 (lbcopy+0x1f0)
|  p struct save_state 0.0x189cc60
+-------------  TRAP  ----------------

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 17:27     ` John David Anglin
  2008-12-21 21:33       ` Kyle McMartin
@ 2008-12-21 22:52       ` Helge Deller
  2008-12-21 22:58         ` John David Anglin
  2008-12-22 22:43         ` James Bottomley
  1 sibling, 2 replies; 56+ messages in thread
From: Helge Deller @ 2008-12-21 22:52 UTC (permalink / raw)
  To: John David Anglin; +Cc: kyle, linux-parisc

John David Anglin wrote:
>> I think the bug may be in flush_user_cache_page_non_current.  It hijacks
>> sr3 temporarily and I don't think cr8 is updated when this is done.  The
>> switch may need to be atomic.
> 
> The following might fix the protection ID bug.
> 
> --- cache.c.orig	2008-07-17 21:24:46.000000000 -0400
> +++ cache.c	2008-12-21 11:53:54.000000000 -0500
> @@ -312,14 +312,14 @@
>  
>  	/* make us current */
>  	mtctl(__pa(vma->vm_mm->pgd), 25);
> -	mtsp(vma->vm_mm->context, 3);
> +	load_context(vma->vm_mm->context);
>  
>  	flush_user_dcache_page(vmaddr);
>  	if(vma->vm_flags & VM_EXEC)
>  		flush_user_icache_page(vmaddr);
>  
>  	/* put the old current process back */
> -	mtsp(space, 3);
> +	load_context(space);

I came to the similar conclusion and tried exactly this patch earlier
today. It didn't fixed the problem (although I had the feeling that the
bug didn't appeared as often then).

Helge


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 22:02         ` Kyle McMartin
@ 2008-12-21 22:52           ` John David Anglin
  0 siblings, 0 replies; 56+ messages in thread
From: John David Anglin @ 2008-12-21 22:52 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: kyle, deller, linux-parisc

> Nope, no love... :\

Sigh, I knew it.

Did any of your instrumentation trigger?

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 22:52       ` Helge Deller
@ 2008-12-21 22:58         ` John David Anglin
  2008-12-21 23:08           ` Helge Deller
  2008-12-22 22:43         ` James Bottomley
  1 sibling, 1 reply; 56+ messages in thread
From: John David Anglin @ 2008-12-21 22:58 UTC (permalink / raw)
  To: Helge Deller; +Cc: kyle, linux-parisc

> > -	mtsp(space, 3);
> > +	load_context(space);
> 
> I came to the similar conclusion and tried exactly this patch earlier
> today. It didn't fixed the problem (although I had the feeling that the
> bug didn't appeared as often then).

Ok, then maybe load_context needs to be atomic.  This is a bit tricky
because we may have to ensure that no tlb misses are triggered (relied
upon translation) during the update.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 15:20   ` John David Anglin
  2008-12-21 17:27     ` John David Anglin
  2008-12-21 22:20     ` Kyle McMartin
@ 2008-12-21 22:59     ` Helge Deller
  2 siblings, 0 replies; 56+ messages in thread
From: Helge Deller @ 2008-12-21 22:59 UTC (permalink / raw)
  To: John David Anglin; +Cc: Kyle McMartin, linux-parisc

John David Anglin wrote:
> Having looked at the code a bit, I have come to the conclusion that the
> problem is not with the sr3 value.  

In all debugging I added I could _always_ see that sr3 had the wrong
values. cr8 and mm->context were always correct.
Interestingly it was often just a few numbers off in the lower 4 bits,
e.g. sr3 should have been 0x1a8 but was 0x1a2 (just an example).

> We have a one-to-one mapping between
> space register values and protection IDs.  TLB entries get inserted using
> the space register value for the protection ID (see entry.S).  So, I think
> we have to have inconsistent values in cr8 and sr3.

Again, only sr3 is wrong. cr8 is correct.

> I think the bug may be in flush_user_cache_page_non_current.  It hijacks
> sr3 temporarily and I don't think cr8 is updated when this is done.  The
> switch may need to be atomic.

I tried that as well, e.g. adding local_irq_[en|dis]able didn't worked
(it crashed IIRC).

Helge

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 22:58         ` John David Anglin
@ 2008-12-21 23:08           ` Helge Deller
  2008-12-22  0:07             ` John David Anglin
  0 siblings, 1 reply; 56+ messages in thread
From: Helge Deller @ 2008-12-21 23:08 UTC (permalink / raw)
  To: John David Anglin; +Cc: kyle, linux-parisc

John David Anglin wrote:
>>> -	mtsp(space, 3);
>>> +	load_context(space);
>> I came to the similar conclusion and tried exactly this patch earlier
>> today. It didn't fixed the problem (although I had the feeling that the
>> bug didn't appeared as often then).
> 
> Ok, then maybe load_context needs to be atomic.  This is a bit tricky
> because we may have to ensure that no tlb misses are triggered (relied
> upon translation) during the update.

I'll try tomorrow.

What makes me wondering:

a) the bug always triggers AFAICS with applications which uses threads
(for the ruby1.9 problem it's always the miniruby process). Maybe the
problem happens to something being wrong in the signal handler with
threadened applications, e.g. arch/parisc/kernel/signal.c:648 ?

b) maybe stupid question: In the case it's a generic processor problem,
would e.g. changing the kernel to use sr4 instead of sr3 for
userspace-accesses change something? What does HPUX uses? At least one
could try...?


Helge

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 23:08           ` Helge Deller
@ 2008-12-22  0:07             ` John David Anglin
  0 siblings, 0 replies; 56+ messages in thread
From: John David Anglin @ 2008-12-22  0:07 UTC (permalink / raw)
  To: Helge Deller; +Cc: kyle, linux-parisc

> John David Anglin wrote:
> >>> -	mtsp(space, 3);
> >>> +	load_context(space);
> >> I came to the similar conclusion and tried exactly this patch earlier
> >> today. It didn't fixed the problem (although I had the feeling that the
> >> bug didn't appeared as often then).
> > 
> > Ok, then maybe load_context needs to be atomic.  This is a bit tricky
> > because we may have to ensure that no tlb misses are triggered (relied
> > upon translation) during the update.
> 
> I'll try tomorrow.

Another thing that I'm wondering about is the following.  The tlb miss
handlers assume the following:

	* cr24 contains a pointer to the kernel address space
	* page directory.
	*
	* cr25 contains a pointer to the current user address
	* space page directory.
	*
	* sr3 will contain the space id of the user address space
	* of the current running thread while that thread is
	* running in the kernel.

Possibly, load_context needs to update cr25 as well.  Assume cr24
never changes.

> What makes me wondering:
> 
> a) the bug always triggers AFAICS with applications which uses threads
> (for the ruby1.9 problem it's always the miniruby process). Maybe the
> problem happens to something being wrong in the signal handler with
> threadened applications, e.g. arch/parisc/kernel/signal.c:648 ?

In my gcc builds, it's bash and make that experience the majority of
unexplained segvs.  There is an issue with the signal handler and bash
that causes a loop, however I think the initial fault was caused by a
tlb issue.

> b) maybe stupid question: In the case it's a generic processor problem,
> would e.g. changing the kernel to use sr4 instead of sr3 for
> userspace-accesses change something? What does HPUX uses? At least one
> could try...?

Personally, it's not clear to me that this is just a problem with kernel
userspace accesses.  If sr3 is corrupt in the kernel, sr7 will be corrupt
in userspace.  Think the only thing special about sr3 is that the kernel
changes it for cache flushes, forks, etc.

Your comment that it's sr3 that's wrong suggests a problem with context
switches, particularly since the corrupt value is close to the correct
value.  If sr3 and cr8 are still inconsistent after the patch to
flush_user_cache_page_non_current, we must be missing a mechanism that
updates sr3.  One thought is to load cr8 before sr3 in load_context and
see what happens.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 22:52       ` Helge Deller
  2008-12-21 22:58         ` John David Anglin
@ 2008-12-22 22:43         ` James Bottomley
  2008-12-22 22:46           ` Kyle McMartin
  2008-12-23  2:31           ` John David Anglin
  1 sibling, 2 replies; 56+ messages in thread
From: James Bottomley @ 2008-12-22 22:43 UTC (permalink / raw)
  To: Helge Deller; +Cc: John David Anglin, kyle, linux-parisc

On Sun, 2008-12-21 at 23:52 +0100, Helge Deller wrote:
> John David Anglin wrote:
> >> I think the bug may be in flush_user_cache_page_non_current.  It hijacks
> >> sr3 temporarily and I don't think cr8 is updated when this is done.  The
> >> switch may need to be atomic.
> > 
> > The following might fix the protection ID bug.
> > 
> > --- cache.c.orig	2008-07-17 21:24:46.000000000 -0400
> > +++ cache.c	2008-12-21 11:53:54.000000000 -0500
> > @@ -312,14 +312,14 @@
> >  
> >  	/* make us current */
> >  	mtctl(__pa(vma->vm_mm->pgd), 25);
> > -	mtsp(vma->vm_mm->context, 3);
> > +	load_context(vma->vm_mm->context);
> >  
> >  	flush_user_dcache_page(vmaddr);
> >  	if(vma->vm_flags & VM_EXEC)
> >  		flush_user_icache_page(vmaddr);
> >  
> >  	/* put the old current process back */
> > -	mtsp(space, 3);
> > +	load_context(space);
> 
> I came to the similar conclusion and tried exactly this patch earlier
> today. It didn't fixed the problem (although I had the feeling that the
> bug didn't appeared as often then).

Actually, it should bug more often.  This function:
flush_user_cache_page_non_current() is very rarely called (which is
hopefully why you don't see an increase in bugs).  However, this is a
kernel function ... if you call load_context() here, you'll get the user
protection IDs in the register and it will immediately fault when it
returns to the kernel.  All it should be doing (which is what it
currently does) is to set up sr3 to allow the kernel to poke into a user
address space, which is the design of the function.

James



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-21 22:20     ` Kyle McMartin
  2008-12-21 22:48       ` John David Anglin
@ 2008-12-22 22:46       ` James Bottomley
  2008-12-22 22:47         ` Kyle McMartin
  1 sibling, 1 reply; 56+ messages in thread
From: James Bottomley @ 2008-12-22 22:46 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: John David Anglin, deller, linux-parisc

On Sun, 2008-12-21 at 17:20 -0500, Kyle McMartin wrote:
> On Sun, Dec 21, 2008 at 10:20:16AM -0500, John David Anglin wrote:
> > > Honestly, I can't decide whether to apply this. It really should never
> > > happen in the kernel, since the kernel can guarantee it won't get the
> > > access rights failure (highest privilege level, and can set %sr and
> > > %protid to whatever it wants.)
> > 
> > I believe this change should be applied because we enable P and the
> > hardware can generate the exception.  Yes, it should never happen, but
> > the coding to prevent this occurring is very subtle.
> > 
> 
> If we're going to paper over a bug like this, we might as well just turn
> the P-bit off...

I'm afraid you can't.  The invisibility of the kernel space from user
space relies on this, as does quite a lot of the page protections.

James



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-22 22:43         ` James Bottomley
@ 2008-12-22 22:46           ` Kyle McMartin
  2008-12-23  2:31           ` John David Anglin
  1 sibling, 0 replies; 56+ messages in thread
From: Kyle McMartin @ 2008-12-22 22:46 UTC (permalink / raw)
  To: James Bottomley; +Cc: Helge Deller, John David Anglin, kyle, linux-parisc

On Mon, Dec 22, 2008 at 04:43:36PM -0600, James Bottomley wrote:
> Actually, it should bug more often.  This function:
> flush_user_cache_page_non_current() is very rarely called (which is
> hopefully why you don't see an increase in bugs).  However, this is a
> kernel function ... if you call load_context() here, you'll get the user
> protection IDs in the register and it will immediately fault when it
> returns to the kernel.  All it should be doing (which is what it
> currently does) is to set up sr3 to allow the kernel to poke into a user
> address space, which is the design of the function.
> 

Well, it will work by accident, since the other 3 (or 7) prot id
registers will still be zeroed, since we only ever touch %cr8...

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-22 22:46       ` James Bottomley
@ 2008-12-22 22:47         ` Kyle McMartin
  0 siblings, 0 replies; 56+ messages in thread
From: Kyle McMartin @ 2008-12-22 22:47 UTC (permalink / raw)
  To: James Bottomley; +Cc: Kyle McMartin, John David Anglin, deller, linux-parisc

On Mon, Dec 22, 2008 at 04:46:30PM -0600, James Bottomley wrote:
> On Sun, 2008-12-21 at 17:20 -0500, Kyle McMartin wrote:
> > On Sun, Dec 21, 2008 at 10:20:16AM -0500, John David Anglin wrote:
> > > > Honestly, I can't decide whether to apply this. It really should never
> > > > happen in the kernel, since the kernel can guarantee it won't get the
> > > > access rights failure (highest privilege level, and can set %sr and
> > > > %protid to whatever it wants.)
> > > 
> > > I believe this change should be applied because we enable P and the
> > > hardware can generate the exception.  Yes, it should never happen, but
> > > the coding to prevent this occurring is very subtle.
> > > 
> > 
> > If we're going to paper over a bug like this, we might as well just turn
> > the P-bit off...
> 
> I'm afraid you can't.  The invisibility of the kernel space from user
> space relies on this, as does quite a lot of the page protections.
> 

Access right bits are still enforced, just the access id checking is
disabled when the P-bit is off.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-22 22:43         ` James Bottomley
  2008-12-22 22:46           ` Kyle McMartin
@ 2008-12-23  2:31           ` John David Anglin
  2008-12-23  2:54             ` Kyle McMartin
  1 sibling, 1 reply; 56+ messages in thread
From: John David Anglin @ 2008-12-23  2:31 UTC (permalink / raw)
  To: James Bottomley; +Cc: deller, kyle, linux-parisc

> Actually, it should bug more often.  This function:
> flush_user_cache_page_non_current() is very rarely called (which is
> hopefully why you don't see an increase in bugs).  However, this is a
> kernel function ... if you call load_context() here, you'll get the user
> protection IDs in the register and it will immediately fault when it
> returns to the kernel.  All it should be doing (which is what it
> currently does) is to set up sr3 to allow the kernel to poke into a user
> address space, which is the design of the function.

As things stand, the call is fully inlined.  Nothing can be poked into
the user address space without the protection ID register being correctly
set.  The cache flushes can trigger a non-access tlb miss fault.  At
the moment, I can't see how this function could cause a protection ID fault.
However, it still seems good practice to set cr8 consistently when the
user context is changed.

The kernel uses space register values of zero, so it shouldn't fault
because of the change to sr3 and cr8 for non current.  The values are
restored before flush_user_cache_page_non_current() exits.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-23  2:31           ` John David Anglin
@ 2008-12-23  2:54             ` Kyle McMartin
  2008-12-23  3:15               ` John David Anglin
  2008-12-23 13:13               ` John David Anglin
  0 siblings, 2 replies; 56+ messages in thread
From: Kyle McMartin @ 2008-12-23  2:54 UTC (permalink / raw)
  To: John David Anglin; +Cc: James Bottomley, deller, kyle, linux-parisc

On Mon, Dec 22, 2008 at 09:31:10PM -0500, John David Anglin wrote:
> > Actually, it should bug more often.  This function:
> > flush_user_cache_page_non_current() is very rarely called (which is
> > hopefully why you don't see an increase in bugs).  However, this is a
> > kernel function ... if you call load_context() here, you'll get the user
> > protection IDs in the register and it will immediately fault when it
> > returns to the kernel.  All it should be doing (which is what it
> > currently does) is to set up sr3 to allow the kernel to poke into a user
> > address space, which is the design of the function.
> 
> As things stand, the call is fully inlined.  Nothing can be poked into
> the user address space without the protection ID register being correctly
> set.  The cache flushes can trigger a non-access tlb miss fault.  At
> the moment, I can't see how this function could cause a protection ID fault.
> However, it still seems good practice to set cr8 consistently when the
> user context is changed.
> 
> The kernel uses space register values of zero, so it shouldn't fault
> because of the change to sr3 and cr8 for non current.  The values are
> restored before flush_user_cache_page_non_current() exits.
> 

just fyi, jejb sleuthed out that it was flush_tlb_mm's non-SMP path
(which "flushes" the tlb by allocating a new spaceid) that was causing
the problem, and helge verified that it fixes the bug.

i'm confirming now with a make && make clean loop building ruby, if it
survives the night, i'll commit the patch tomorrow and push it to stable
until we have a clean(er) fix.

diff --git a/arch/parisc/include/asm/tlbflush.h b/arch/parisc/include/asm/tlbflush.h
index b72ec66..1f6fd4f 100644
--- a/arch/parisc/include/asm/tlbflush.h
+++ b/arch/parisc/include/asm/tlbflush.h
@@ -44,9 +44,12 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 {
 	BUG_ON(mm == &init_mm); /* Should never happen */
 
-#ifdef CONFIG_SMP
+#if 1 || defined(CONFIG_SMP)
 	flush_tlb_all();
 #else
+	/* FIXME: currently broken, causing space id and protection ids
+	 *  to go out of sync, resulting in faults on userspace accesses.
+	 */
 	if (mm) {
 		if (mm->context != 0)
 			free_sid(mm->context);

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-23  2:54             ` Kyle McMartin
@ 2008-12-23  3:15               ` John David Anglin
  2008-12-23 13:13               ` John David Anglin
  1 sibling, 0 replies; 56+ messages in thread
From: John David Anglin @ 2008-12-23  3:15 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: James.Bottomley, deller, kyle, linux-parisc

> just fyi, jejb sleuthed out that it was flush_tlb_mm's non-SMP path
> (which "flushes" the tlb by allocating a new spaceid) that was causing
> the problem, and helge verified that it fixes the bug.

That's great news!  I'd noticed this code tonight and wondered why it
wasn't loading cr25.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [RFC] fix kernel crash (protection id trap) when
  2008-12-23  2:54             ` Kyle McMartin
  2008-12-23  3:15               ` John David Anglin
@ 2008-12-23 13:13               ` John David Anglin
  1 sibling, 0 replies; 56+ messages in thread
From: John David Anglin @ 2008-12-23 13:13 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: James.Bottomley, deller, kyle, linux-parisc

> just fyi, jejb sleuthed out that it was flush_tlb_mm's non-SMP path
> (which "flushes" the tlb by allocating a new spaceid) that was causing
> the problem, and helge verified that it fixes the bug.

It seems this code can be called from drivers.  Maybe the context
update needs to be surrounded by task_lock(current)...task_unlock(current).

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2008-12-23 13:13 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-17 22:46 [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Helge Deller
2008-12-18  0:05 ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling John David Anglin
2008-12-18  0:43   ` Kyle McMartin
2008-12-18  3:02     ` [PATCH] [RFC] fix kernel crash (protection id trap) when John David Anglin
2008-12-18  3:05       ` Kyle McMartin
2008-12-18  4:04       ` Kyle McMartin
2008-12-18 16:16         ` Carlos O'Donell
2008-12-18 18:28         ` Kyle McMartin
2008-12-19 15:25           ` Carlos O'Donell
2008-12-19 16:13             ` John David Anglin
2008-12-19 16:23               ` Carlos O'Donell
2008-12-19 16:28                 ` Kyle McMartin
2008-12-19 16:35                   ` Carlos O'Donell
2008-12-19 16:36                     ` Carlos O'Donell
2008-12-19 16:44                       ` Kyle McMartin
2008-12-19 17:28                         ` Carlos O'Donell
2008-12-19 17:33                           ` Kyle McMartin
2008-12-19 17:36                             ` Carlos O'Donell
2008-12-19 17:39                               ` Kyle McMartin
2008-12-19 17:42                                 ` Kyle McMartin
2008-12-19 18:43                                   ` Carlos O'Donell
2008-12-18  1:46 ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Carlos O'Donell
2008-12-18  2:02   ` Carlos O'Donell
2008-12-18 13:13     ` Helge Deller
2008-12-18 15:28       ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling John David Anglin
2008-12-18 16:09         ` Carlos O'Donell
2008-12-18  2:36   ` [PATCH] [RFC] fix kernel crash (protection id trap) when compiling ruby1.9 Kyle McMartin
2008-12-18 12:21     ` Carlos O'Donell
2008-12-18  7:03 ` Kyle McMartin
2008-12-18 13:09   ` Helge Deller
2008-12-18 15:05     ` Kyle McMartin
2008-12-19 21:29 ` Kyle McMartin
2008-12-19 22:59   ` Helge Deller
2008-12-19 23:34     ` Kyle McMartin
2008-12-20 17:07       ` [PATCH] [RFC] fix kernel crash (protection id trap) when John David Anglin
2008-12-21 15:20   ` John David Anglin
2008-12-21 17:27     ` John David Anglin
2008-12-21 21:33       ` Kyle McMartin
2008-12-21 22:02         ` Kyle McMartin
2008-12-21 22:52           ` John David Anglin
2008-12-21 22:11         ` John David Anglin
2008-12-21 22:52       ` Helge Deller
2008-12-21 22:58         ` John David Anglin
2008-12-21 23:08           ` Helge Deller
2008-12-22  0:07             ` John David Anglin
2008-12-22 22:43         ` James Bottomley
2008-12-22 22:46           ` Kyle McMartin
2008-12-23  2:31           ` John David Anglin
2008-12-23  2:54             ` Kyle McMartin
2008-12-23  3:15               ` John David Anglin
2008-12-23 13:13               ` John David Anglin
2008-12-21 22:20     ` Kyle McMartin
2008-12-21 22:48       ` John David Anglin
2008-12-22 22:46       ` James Bottomley
2008-12-22 22:47         ` Kyle McMartin
2008-12-21 22:59     ` Helge Deller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.