All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [parisc-linux] Debugging 64-bit kernel crashes involving
       [not found] <200702240039.l1O0dCxG017471@hiauly1.hia.nrc.ca>
@ 2007-02-24  0:45 ` James Bottomley
       [not found] ` <1172277920.3424.83.camel@mulgrave.il.steeleye.com>
  1 sibling, 0 replies; 9+ messages in thread
From: James Bottomley @ 2007-02-24  0:45 UTC (permalink / raw)
  To: John David Anglin; +Cc: parisc-linux

On Fri, 2007-02-23 at 19:39 -0500, John David Anglin wrote:
> Just wondering why a protection id trap causes the system to eat
> sparcs...

Because we're in the kernel ... that's what the die_if_kernel() does.

> Is this a combined TLB issue?

It's clearly an Access ID mismatch ... but why I don't know; it's either
the TLB access ID or %cr8, which is what I asked Carlos to debug.

James


_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [parisc-linux] Debugging 64-bit kernel crashes involving
       [not found]   ` <119aab440702251819u1a070512j2cb8ca2eeadf0962@mail.gmail.com>
@ 2007-02-26  3:53     ` James Bottomley
       [not found]     ` <1172461983.3423.34.camel@mulgrave.il.steeleye.com>
  1 sibling, 0 replies; 9+ messages in thread
From: James Bottomley @ 2007-02-26  3:53 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: John David Anglin, parisc-linux

On Sun, 2007-02-25 at 21:19 -0500, Carlos O'Donell wrote:
> current->comm, and cr8 included in the dump.

Thanks!

> sr00-03  0000000000077000 0000000000000000 0000000000000000 0000000000077000
> sr04-07  0000000000077000 0000000000077000 0000000000077000 0000000000077000
> 
> IASQ: 0000000000077000 0000000000077000 IAOQ: 0000000040723f2c 0000000040723f2f
>  IIR: 43ffff40    ISR: 0000000000000000  IOR: 0000000000000000
>  CPU:        0   CR30: 000000009af10000 CR31: 0000000040848000
>  ORIG_R28: 0000000042c774c8
>  CR8: 00000000000001e0

Perfect, 0x1e8 on a 64 bit kernel (where SPACEID_SHIFT is 11) is
0x78000.  This means that the protection and the space are indeed
mismatched ... we just have to find out how, sigh!

>  current->comm: 000000009af253e8

Actually, current->comm should be printed as a %s: it's the name of the
faulting process.

>  IAOQ[0]: cuda_images+0xc14/0x3180
>  IAOQ[1]: cuda_images+0xc14/0x3180
>  RP(r2): 0x40bf0e1c

James


_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [parisc-linux] Debugging 64-bit kernel crashes involving
       [not found]     ` <1172461983.3423.34.camel@mulgrave.il.steeleye.com>
@ 2007-02-26  5:19       ` James Bottomley
       [not found]       ` <1172467172.3423.39.camel@mulgrave.il.steeleye.com>
  1 sibling, 0 replies; 9+ messages in thread
From: James Bottomley @ 2007-02-26  5:19 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: John David Anglin, parisc-linux

On Sun, 2007-02-25 at 21:53 -0600, James Bottomley wrote:
> On Sun, 2007-02-25 at 21:19 -0500, Carlos O'Donell wrote:
> > current->comm, and cr8 included in the dump.
> 
> Thanks!
> 
> > sr00-03  0000000000077000 0000000000000000 0000000000000000 0000000000077000
> > sr04-07  0000000000077000 0000000000077000 0000000000077000 0000000000077000
> > 
> > IASQ: 0000000000077000 0000000000077000 IAOQ: 0000000040723f2c 0000000040723f2f
> >  IIR: 43ffff40    ISR: 0000000000000000  IOR: 0000000000000000
> >  CPU:        0   CR30: 000000009af10000 CR31: 0000000040848000
> >  ORIG_R28: 0000000042c774c8
> >  CR8: 00000000000001e0
> 
> Perfect, 0x1e8 on a 64 bit kernel (where SPACEID_SHIFT is 11) is
> 0x78000.  This means that the protection and the space are indeed
> mismatched ... we just have to find out how, sigh!

OK, I have a theory.  It has to do with the way we do flush_tlb_mm by
incrementing the spaceid.  This works in a single space per process
model.  However, a process with multiple threads has >1 scheduling
context which share spaces.  So, the theory goes that when we fork from
a thread, we execute flush_tlb_mm which bumps the context (space).  Then
we schedule another thread in the same process.  However, this picks up
its space registers from the task rather than the mm->context, so it
uses the old mm.  Now, the load context has updated %cr8, the protection
ID.  However %cr8 isn't part of the task context, so we end up executing
in the old context with the protection of the new one ... resulting in a
protection ID trap.

James


_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [parisc-linux] Debugging 64-bit kernel crashes involving
       [not found]       ` <1172467172.3423.39.camel@mulgrave.il.steeleye.com>
@ 2007-02-28  0:02         ` James Bottomley
       [not found]         ` <1172620930.3408.46.camel@mulgrave.il.steeleye.com>
  1 sibling, 0 replies; 9+ messages in thread
From: James Bottomley @ 2007-02-28  0:02 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: John David Anglin, parisc-linux

On Sun, 2007-02-25 at 23:19 -0600, James Bottomley wrote:
> OK, I have a theory.  It has to do with the way we do flush_tlb_mm by
> incrementing the spaceid.  This works in a single space per process
> model.  However, a process with multiple threads has >1 scheduling
> context which share spaces.  So, the theory goes that when we fork from
> a thread, we execute flush_tlb_mm which bumps the context (space).  Then
> we schedule another thread in the same process.  However, this picks up
> its space registers from the task rather than the mm->context, so it
> uses the old mm.  Now, the load context has updated %cr8, the protection
> ID.  However %cr8 isn't part of the task context, so we end up executing
> in the old context with the protection of the new one ... resulting in a
> protection ID trap.

Based on the theory, I managed to reproduce the problem on ioz (you just
have to increase N to be much greater than the number of CPUs you have)
and tried a little fix, which seems to work for ioz.  Could you try this
out on your a500?

Thanks,

James
Index: BUILD-2.6/arch/parisc/kernel/process.c
===================================================================
--- BUILD-2.6.orig/arch/parisc/kernel/process.c	2007-02-27 15:52:54.000000000 -0800
+++ BUILD-2.6/arch/parisc/kernel/process.c	2007-02-27 15:57:24.000000000 -0800
@@ -395,3 +395,30 @@ get_wchan(struct task_struct *p)
 	} while (count++ < 16);
 	return 0;
 }
+
+struct task_struct *__switch_to(struct task_struct *prev,
+			       struct task_struct *next)
+{
+	unsigned long sr3;
+	unsigned long newsr3 = mfsp(3);
+	struct pt_regs *regs = &next->thread.regs;
+
+	/* need to be executing in user context */
+	if (regs->iasq[0] != 0 || regs->iasq[1] != 0) {
+		sr3 = regs->sr[7];
+
+		/* need our current space to be different from our
+		 * new one.  Note, this trips a lot if we're in a
+		 * syscall not an interrupt from userspace, but in the
+		 * syscall case, this is a nop since the space is
+		 * explicitly reconstructed on return from syscall */
+		if (unlikely(sr3 != 0 && sr3 != newsr3)) {
+			int i;
+
+			for (i = 0; i < 8; i++)
+				if (regs->sr[i] == sr3)
+					regs->sr[i] = newsr3;
+		}
+	}
+	return _switch_to(prev, next);
+}
Index: BUILD-2.6/include/asm-parisc/system.h
===================================================================
--- BUILD-2.6.orig/include/asm-parisc/system.h	2007-02-27 15:53:12.000000000 -0800
+++ BUILD-2.6/include/asm-parisc/system.h	2007-02-27 15:54:33.000000000 -0800
@@ -43,9 +43,10 @@ struct pa_psw {
 struct task_struct;
 
 extern struct task_struct *_switch_to(struct task_struct *, struct task_struct *);
+extern struct task_struct *__switch_to(struct task_struct *, struct task_struct *);
 
 #define switch_to(prev, next, last) do {			\
-	(last) = _switch_to(prev, next);			\
+	(last) = __switch_to(prev, next);			\
 } while(0)
 
 /*



_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [parisc-linux] Debugging 64-bit kernel crashes involving
       [not found]         ` <1172620930.3408.46.camel@mulgrave.il.steeleye.com>
@ 2007-03-04  1:05           ` John David Anglin
  2007-03-08 19:31           ` Kyle McMartin
       [not found]           ` <20070308193119.GA3553@athena.road.mcmartin.ca>
  2 siblings, 0 replies; 9+ messages in thread
From: John David Anglin @ 2007-03-04  1:05 UTC (permalink / raw)
  To: James Bottomley; +Cc: parisc-linux

James,

> Based on the theory, I managed to reproduce the problem on ioz (you just
> have to increase N to be much greater than the number of CPUs you have)
> and tried a little fix, which seems to work for ioz.  Could you try this
> out on your a500?

I tried the change on my c3750.  It's survived four full GCC builds
and checks with no hung process in the libjava testsuite.  While this
might have happened before, the probability of this happening was low.
So, I think this change is a good one ;)

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [parisc-linux] Debugging 64-bit kernel crashes involving
       [not found]         ` <1172620930.3408.46.camel@mulgrave.il.steeleye.com>
  2007-03-04  1:05           ` John David Anglin
@ 2007-03-08 19:31           ` Kyle McMartin
       [not found]           ` <20070308193119.GA3553@athena.road.mcmartin.ca>
  2 siblings, 0 replies; 9+ messages in thread
From: Kyle McMartin @ 2007-03-08 19:31 UTC (permalink / raw)
  To: James Bottomley; +Cc: John David Anglin, parisc-linux

On Tue, Feb 27, 2007 at 06:02:10PM -0600, James Bottomley wrote:
> +	/* need to be executing in user context */
> +	if (regs->iasq[0] != 0 || regs->iasq[1] != 0) {

Is there a reason we can't use user_space() from ptrace.h here? You may have
just spotted a subtle bug elsewhere since user_space only checks iasq[1]...
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [parisc-linux] Debugging 64-bit kernel crashes involving
       [not found]             ` <119aab440703111328x5738d60es118519743e6000a3@mail.gmail.com>
@ 2007-03-11 20:39               ` James Bottomley
       [not found]               ` <1173645565.3420.22.camel@mulgrave.il.steeleye.com>
  1 sibling, 0 replies; 9+ messages in thread
From: James Bottomley @ 2007-03-11 20:39 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: John David Anglin, parisc-linux

On Sun, 2007-03-11 at 16:28 -0400, Carlos O'Donell wrote:
> Are we considering James' patch for inclusion into
> git.parisc-linux.org?

No ... it's failing on more complex tests.  The theory is sound ... we
have a huge cockup in our shared space handling code (which is why
threads don't work very well).  The problem is I'm not entirely sure how
to fix it.

James


_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [parisc-linux] Debugging 64-bit kernel crashes involving
       [not found]                 ` <119aab440703111513x796e0f34t87507ebd8be2c33d@mail.gmail.com>
@ 2007-03-11 23:23                   ` James Bottomley
  0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2007-03-11 23:23 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: John David Anglin, parisc-linux

On Sun, 2007-03-11 at 18:13 -0400, Carlos O'Donell wrote:
> Have you posted a more complex test somewhere?

Well, I told you what it is, yes ... it's your fork test with N jacked
up to very high values.

James


_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [parisc-linux] Debugging 64-bit kernel crashes involving
       [not found] <1172265809.3424.75.camel@mulgrave.il.steeleye.com>
@ 2007-02-24  0:39 ` John David Anglin
  0 siblings, 0 replies; 9+ messages in thread
From: John David Anglin @ 2007-02-24  0:39 UTC (permalink / raw)
  To: James Bottomley; +Cc: parisc-linux

> >  IIR: 0ec25033    ISR: 00000000000e2800  IOR: 00000000000aac8c
>                          ^^^^^^^^^^^^^^^^
> This clearly identifies the faulting space

This is for
	ldb,ma 1(sr1,r22),r19

> >  CPU:        0   CR30: 000000009ad38000 CR31: 0000000040848000
> >  ORIG_R28: 00000000407f6c00
> >  IAOQ[0]: pa_memcpy+0x114/0x2d0
> >  IAOQ[1]: pa_memcpy+0x118/0x2d0
> >  RP(r2): copy_from_user+0x34/0x40
> > Backtrace:
> 
> Which means that somehow a TLB entry got inserted for address
> 00000000000aac8c in space 00000000000e2800 which didn't have the correct
> Access ID (which for us should have been the space 00000000000e2800).

Just wondering why a protection id trap causes the system to eat sparcs...
Is this a combined TLB issue?

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-03-11 23:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200702240039.l1O0dCxG017471@hiauly1.hia.nrc.ca>
2007-02-24  0:45 ` [parisc-linux] Debugging 64-bit kernel crashes involving James Bottomley
     [not found] ` <1172277920.3424.83.camel@mulgrave.il.steeleye.com>
     [not found]   ` <119aab440702251819u1a070512j2cb8ca2eeadf0962@mail.gmail.com>
2007-02-26  3:53     ` James Bottomley
     [not found]     ` <1172461983.3423.34.camel@mulgrave.il.steeleye.com>
2007-02-26  5:19       ` James Bottomley
     [not found]       ` <1172467172.3423.39.camel@mulgrave.il.steeleye.com>
2007-02-28  0:02         ` James Bottomley
     [not found]         ` <1172620930.3408.46.camel@mulgrave.il.steeleye.com>
2007-03-04  1:05           ` John David Anglin
2007-03-08 19:31           ` Kyle McMartin
     [not found]           ` <20070308193119.GA3553@athena.road.mcmartin.ca>
     [not found]             ` <119aab440703111328x5738d60es118519743e6000a3@mail.gmail.com>
2007-03-11 20:39               ` James Bottomley
     [not found]               ` <1173645565.3420.22.camel@mulgrave.il.steeleye.com>
     [not found]                 ` <119aab440703111513x796e0f34t87507ebd8be2c33d@mail.gmail.com>
2007-03-11 23:23                   ` James Bottomley
     [not found] <1172265809.3424.75.camel@mulgrave.il.steeleye.com>
2007-02-24  0:39 ` John David Anglin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.