16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors

All of lore.kernel.org
 help / color / mirror / Atom feed

* 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
@ 2014-08-26  9:27 Joshua Kinard
  2014-08-26 10:20 ` Ralf Baechle
  0 siblings, 1 reply; 15+ messages in thread
From: Joshua Kinard @ 2014-08-26  9:27 UTC (permalink / raw)
  To: Linux MIPS List

Okay, so from the "make kmap cache coloring aware" thread, I've been playing
with larger PAGE_SIZE values on the Octane and O2 for the last few hours.
16k and 64k used to, in the past, never get far after init (usually died
*at* init)  That appears to have changed now.  Most programs seem to
JustWork(), but very randomly, I am getting a signal -4, illegal instruction
(SIGILL) on the Octane.  Both systems are running kernels w/ 64k PAGE_SIZE
at the moment.

I cannot reproduce it on demand, so I'm not really sure what the cause could
be.  PAGE_SIZE should be largely transparent to userland these days, so I am
wondering if this might be more oddities w/ an R14000 CPU.

Ideas?

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26  9:27 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors Joshua Kinard
@ 2014-08-26 10:20 ` Ralf Baechle
  2014-08-26 10:42   ` Maciej W. Rozycki
  2014-08-26 11:06   ` Joshua Kinard
  0 siblings, 2 replies; 15+ messages in thread
From: Ralf Baechle @ 2014-08-26 10:20 UTC (permalink / raw)
  To: Joshua Kinard; +Cc: Linux MIPS List

On Tue, Aug 26, 2014 at 05:27:28AM -0400, Joshua Kinard wrote:

> Okay, so from the "make kmap cache coloring aware" thread, I've been playing
> with larger PAGE_SIZE values on the Octane and O2 for the last few hours.
> 16k and 64k used to, in the past, never get far after init (usually died
> *at* init)  That appears to have changed now.  Most programs seem to
> JustWork(), but very randomly, I am getting a signal -4, illegal instruction
> (SIGILL) on the Octane.  Both systems are running kernels w/ 64k PAGE_SIZE
> at the moment.
> 
> I cannot reproduce it on demand, so I'm not really sure what the cause could
> be.  PAGE_SIZE should be largely transparent to userland these days, so I am
> wondering if this might be more oddities w/ an R14000 CPU.

This sound very unlikely as the CPU was primarily designed to run IRIX and
SGI's systems were using 16k or even 64k page size.

What userland are you running and how old is it?  Are you seeing different
results for 16k and 64k?

  Ralf

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26 10:20 ` Ralf Baechle
@ 2014-08-26 10:42   ` Maciej W. Rozycki
  2014-08-26 10:49     ` Maciej W. Rozycki
  2014-08-26 11:49     ` Ralf Baechle
  2014-08-26 11:06   ` Joshua Kinard
  1 sibling, 2 replies; 15+ messages in thread
From: Maciej W. Rozycki @ 2014-08-26 10:42 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Joshua Kinard, Linux MIPS List

On Tue, 26 Aug 2014, Ralf Baechle wrote:

> > I cannot reproduce it on demand, so I'm not really sure what the cause could
> > be.  PAGE_SIZE should be largely transparent to userland these days, so I am
> > wondering if this might be more oddities w/ an R14000 CPU.
> 
> This sound very unlikely as the CPU was primarily designed to run IRIX and
> SGI's systems were using 16k or even 64k page size.
> 
> What userland are you running and how old is it?  Are you seeing different
> results for 16k and 64k?

 FWIW, I've been always using the 16k page size exclusively with my 64-bit 
userland and my SWARM board using the SB-1/BCM1250 processor (with either 
endianness) and never had issues even with stuff as intensive as native 
GCC bootstrapping (with all the languages enabled such as Ada and Java) or 
glibc builds.  It's been like 8 years now and quite recent kernels like 
from two months ago gave me no trouble either.  So it must be something 
specific to the configuration, my first candidates to look at would be the 
generated TLB and cache handlers, that are system-specific.

  Maciej

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26 10:42   ` Maciej W. Rozycki
@ 2014-08-26 10:49     ` Maciej W. Rozycki
  2014-08-26 11:49     ` Ralf Baechle
  1 sibling, 0 replies; 15+ messages in thread
From: Maciej W. Rozycki @ 2014-08-26 10:49 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Joshua Kinard, Linux MIPS List

On Tue, 26 Aug 2014, Maciej W. Rozycki wrote:

>  FWIW, I've been always using the 16k page size exclusively with my 64-bit 
> userland and my SWARM board using the SB-1/BCM1250 processor (with either 
> endianness) and never had issues even with stuff as intensive as native 
> GCC bootstrapping (with all the languages enabled such as Ada and Java) or 
> glibc builds.  It's been like 8 years now and quite recent kernels like 
> from two months ago gave me no trouble either.  So it must be something 
> specific to the configuration, my first candidates to look at would be the 
> generated TLB and cache handlers, that are system-specific.

 Ah, and no issues with the 16k page size and my R4400SC DECstation and 
the same 64-bit userland either, I booted it recently just fine, though 
little-endian only of course.  Like with the SWARM I stuck here to using 
that page size exclusively with 64-bit userland.

  Maciej

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26 10:20 ` Ralf Baechle
  2014-08-26 10:42   ` Maciej W. Rozycki
@ 2014-08-26 11:06   ` Joshua Kinard
  2014-08-26 11:50     ` Maciej W. Rozycki
  2014-08-26 12:03     ` Ralf Baechle
  1 sibling, 2 replies; 15+ messages in thread
From: Joshua Kinard @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-mips

On 08/26/2014 06:20, Ralf Baechle wrote:
> On Tue, Aug 26, 2014 at 05:27:28AM -0400, Joshua Kinard wrote:
> 
>> Okay, so from the "make kmap cache coloring aware" thread, I've been playing
>> with larger PAGE_SIZE values on the Octane and O2 for the last few hours.
>> 16k and 64k used to, in the past, never get far after init (usually died
>> *at* init)  That appears to have changed now.  Most programs seem to
>> JustWork(), but very randomly, I am getting a signal -4, illegal instruction
>> (SIGILL) on the Octane.  Both systems are running kernels w/ 64k PAGE_SIZE
>> at the moment.
>>
>> I cannot reproduce it on demand, so I'm not really sure what the cause could
>> be.  PAGE_SIZE should be largely transparent to userland these days, so I am
>> wondering if this might be more oddities w/ an R14000 CPU.
> 
> This sound very unlikely as the CPU was primarily designed to run IRIX and
> SGI's systems were using 16k or even 64k page size.
> 
> What userland are you running and how old is it?  Are you seeing different
> results for 16k and 64k?

o32 userland is the primary on both systems.  However, the last SIGILL was
under the 64k PAGE_SIZE kernel inside of an n32 chroot compiling the 'boost'
package on the Octane, which I restarted that and it's not complained since.
 Also got SIGILL on the 16k PAGE_SIZE kernel when I booted 16k PAGE_SIZE the
first time and ran 'ps'.  Subsequent runs of 'ps' didn't reproduce the
error.  Also saw SIGILLs in the bootlog of the 16k PAGE_SIZE kernel when
"rm" was ran once (couldn't reproduce) and when mdadm tried to put one of
the arrays back together.  Subsequent runs using similar argument lines
don't reproduce once I got to a root shell.

Being it's a Gentoo install...the o32 userland is pretty fresh.  Especially
on the Octane, where I literally rebuilt the old userland over 2-3 times
just to make sure all the old 5-year cruft was gone.  The n32 userland
chroot is brand-spanking new.  gcc-4.7.x only for now on both, because of
PR61538 in gcc.  Latest binutils.

The O2 is chugging away happily so far in updating a bunch of packages.  So
I am leaning towards this being another quirk I have to hunt down in the
Octane's code again.  There isn't much in the Octane-specific code that
deals with memory, though -- it seems the higher-level MIPS memory code
handles most things just fine.

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26 10:42   ` Maciej W. Rozycki
  2014-08-26 10:49     ` Maciej W. Rozycki
@ 2014-08-26 11:49     ` Ralf Baechle
  2014-08-26 12:03       ` Joshua Kinard
  1 sibling, 1 reply; 15+ messages in thread
From: Ralf Baechle @ 2014-08-26 11:49 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Joshua Kinard, Linux MIPS List

On Tue, Aug 26, 2014 at 11:42:30AM +0100, Maciej W. Rozycki wrote:

> > > I cannot reproduce it on demand, so I'm not really sure what the cause could
> > > be.  PAGE_SIZE should be largely transparent to userland these days, so I am
> > > wondering if this might be more oddities w/ an R14000 CPU.
> > 
> > This sound very unlikely as the CPU was primarily designed to run IRIX and
> > SGI's systems were using 16k or even 64k page size.
> > 
> > What userland are you running and how old is it?  Are you seeing different
> > results for 16k and 64k?
> 
>  FWIW, I've been always using the 16k page size exclusively with my 64-bit 
> userland and my SWARM board using the SB-1/BCM1250 processor (with either 
> endianness) and never had issues even with stuff as intensive as native 
> GCC bootstrapping (with all the languages enabled such as Ada and Java) or 
> glibc builds.  It's been like 8 years now and quite recent kernels like 
> from two months ago gave me no trouble either.  So it must be something 
> specific to the configuration, my first candidates to look at would be the 
> generated TLB and cache handlers, that are system-specific.

Generally the R10000 architecture is such that there is much less potencial
for software bugs as well.  The TLB is nice, cleans up conflicting entries
so no TLB shutdown or similar horrors possible.  And the caches while they
suffer from cache aliases, will cleanup those aliases transparently to
software, that is an OS can treat them as non-aliasing.  R10000 systems
with the notable exception of the SGI O2 and Indigo² R10000 have fully
coherent I/O.  Basically the only thing that needs to be done in software
is I-cache coherency.  The I-cache snoops stores by remote CPUs but not
by the local CPU itself so in a sense SMP is a simpler case than UP even.

  Ralf

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26 11:06   ` Joshua Kinard
@ 2014-08-26 11:50     ` Maciej W. Rozycki
  2014-08-26 12:03     ` Ralf Baechle
  1 sibling, 0 replies; 15+ messages in thread
From: Maciej W. Rozycki @ 2014-08-26 11:50 UTC (permalink / raw)
  To: Joshua Kinard; +Cc: linux-mips

On Tue, 26 Aug 2014, Joshua Kinard wrote:

> > This sound very unlikely as the CPU was primarily designed to run IRIX and
> > SGI's systems were using 16k or even 64k page size.
> > 
> > What userland are you running and how old is it?  Are you seeing different
> > results for 16k and 64k?
> 
> o32 userland is the primary on both systems.  However, the last SIGILL was
> under the 64k PAGE_SIZE kernel inside of an n32 chroot compiling the 'boost'
> package on the Octane, which I restarted that and it's not complained since.
>  Also got SIGILL on the 16k PAGE_SIZE kernel when I booted 16k PAGE_SIZE the
> first time and ran 'ps'.  Subsequent runs of 'ps' didn't reproduce the
> error.  Also saw SIGILLs in the bootlog of the 16k PAGE_SIZE kernel when
> "rm" was ran once (couldn't reproduce) and when mdadm tried to put one of
> the arrays back together.  Subsequent runs using similar argument lines
> don't reproduce once I got to a root shell.

 Such intermittent failures look to me remarkably like cache coherency 
problems e.g. D$ vs I$.  You can try making cache invalidation harder, 
e.g. tweak all the writeback calls and invalidation calls so that they 
perform their operation on the whole cache rather than the requested range 
only and see if that makes things better.  You may instead tweak the 
suspected calling site too, of course.

  Maciej

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26 11:06   ` Joshua Kinard
  2014-08-26 11:50     ` Maciej W. Rozycki
@ 2014-08-26 12:03     ` Ralf Baechle
  2014-08-26 13:16       ` Joshua Kinard
  1 sibling, 1 reply; 15+ messages in thread
From: Ralf Baechle @ 2014-08-26 12:03 UTC (permalink / raw)
  To: Joshua Kinard; +Cc: linux-mips

On Tue, Aug 26, 2014 at 07:06:56AM -0400, Joshua Kinard wrote:

> o32 userland is the primary on both systems.  However, the last SIGILL was
> under the 64k PAGE_SIZE kernel inside of an n32 chroot compiling the 'boost'
> package on the Octane, which I restarted that and it's not complained since.
>  Also got SIGILL on the 16k PAGE_SIZE kernel when I booted 16k PAGE_SIZE the
> first time and ran 'ps'.  Subsequent runs of 'ps' didn't reproduce the
> error.  Also saw SIGILLs in the bootlog of the 16k PAGE_SIZE kernel when
> "rm" was ran once (couldn't reproduce) and when mdadm tried to put one of
> the arrays back together.  Subsequent runs using similar argument lines
> don't reproduce once I got to a root shell.
> 
> Being it's a Gentoo install...the o32 userland is pretty fresh.  Especially
> on the Octane, where I literally rebuilt the old userland over 2-3 times
> just to make sure all the old 5-year cruft was gone.  The n32 userland
> chroot is brand-spanking new.  gcc-4.7.x only for now on both, because of
> PR61538 in gcc.  Latest binutils.
> 
> The O2 is chugging away happily so far in updating a bunch of packages.  So
> I am leaning towards this being another quirk I have to hunt down in the
> Octane's code again.  There isn't much in the Octane-specific code that
> deals with memory, though -- it seems the higher-level MIPS memory code
> handles most things just fine.

Can you enable core dumps?  I'm wondering about the EPC of the crashed
process.  If it's at a function entry or the beginning of a page that
might indicate there is an issue with flushing caches after the containing
page got loaded.  Also interesting to know if this possibly happened in a
signal trampoline or VDSO.

These are just the usual suspects - nothing indicates this case is actually
related.

  Ralf

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26 11:49     ` Ralf Baechle
@ 2014-08-26 12:03       ` Joshua Kinard
  2014-08-26 12:11         ` Ralf Baechle
  0 siblings, 1 reply; 15+ messages in thread
From: Joshua Kinard @ 2014-08-26 12:03 UTC (permalink / raw)
  To: linux-mips

On 08/26/2014 07:49, Ralf Baechle wrote:
> On Tue, Aug 26, 2014 at 11:42:30AM +0100, Maciej W. Rozycki wrote:
> 
>>>> I cannot reproduce it on demand, so I'm not really sure what the cause could
>>>> be.  PAGE_SIZE should be largely transparent to userland these days, so I am
>>>> wondering if this might be more oddities w/ an R14000 CPU.
>>>
>>> This sound very unlikely as the CPU was primarily designed to run IRIX and
>>> SGI's systems were using 16k or even 64k page size.
>>>
>>> What userland are you running and how old is it?  Are you seeing different
>>> results for 16k and 64k?
>>
>>  FWIW, I've been always using the 16k page size exclusively with my 64-bit 
>> userland and my SWARM board using the SB-1/BCM1250 processor (with either 
>> endianness) and never had issues even with stuff as intensive as native 
>> GCC bootstrapping (with all the languages enabled such as Ada and Java) or 
>> glibc builds.  It's been like 8 years now and quite recent kernels like 
>> from two months ago gave me no trouble either.  So it must be something 
>> specific to the configuration, my first candidates to look at would be the 
>> generated TLB and cache handlers, that are system-specific.
> 
> Generally the R10000 architecture is such that there is much less potencial
> for software bugs as well.  The TLB is nice, cleans up conflicting entries
> so no TLB shutdown or similar horrors possible.  And the caches while they
> suffer from cache aliases, will cleanup those aliases transparently to
> software, that is an OS can treat them as non-aliasing.  R10000 systems
> with the notable exception of the SGI O2 and Indigo² R10000 have fully
> coherent I/O.  Basically the only thing that needs to be done in software
> is I-cache coherency.  The I-cache snoops stores by remote CPUs but not
> by the local CPU itself so in a sense SMP is a simpler case than UP even.

Yeah, coherency shouldn't be a problem for the Octane.  hardware-coherent
like IP27.

The icache snooping fix is already enabled in
arch/mips/include/asm/mach-ip30/cpu-feature-overrides.h:

#define cpu_icache_snoops_remote_store 1

SMP is not working yet on IP30, though.  I gave up on that for now, because
I can't get the second CPU to start ticking properly.

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26 12:03       ` Joshua Kinard
@ 2014-08-26 12:11         ` Ralf Baechle
  0 siblings, 0 replies; 15+ messages in thread
From: Ralf Baechle @ 2014-08-26 12:11 UTC (permalink / raw)
  To: Joshua Kinard; +Cc: linux-mips

On Tue, Aug 26, 2014 at 08:03:28AM -0400, Joshua Kinard wrote:

> Yeah, coherency shouldn't be a problem for the Octane.  hardware-coherent
> like IP27.
> 
> The icache snooping fix is already enabled in
> arch/mips/include/asm/mach-ip30/cpu-feature-overrides.h:
> 
> #define cpu_icache_snoops_remote_store 1
> 
> SMP is not working yet on IP30, though.  I gave up on that for now, because
> I can't get the second CPU to start ticking properly.

Are you running a preemptible kernel?

You could also change the definition of cpu_icache_snoops_remote_store to 0
for testing.  That should make things just a bit slower but otherwise have
not impact.  Would be interesting to see if that makes the SIGs go away.

  Ralf

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26 12:03     ` Ralf Baechle
@ 2014-08-26 13:16       ` Joshua Kinard
  2014-08-26 14:02         ` Ralf Baechle
  2014-08-26 14:03         ` Ralf Baechle
  0 siblings, 2 replies; 15+ messages in thread
From: Joshua Kinard @ 2014-08-26 13:16 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: linux-mips

On 08/26/2014 08:03, Ralf Baechle wrote:
> On Tue, Aug 26, 2014 at 07:06:56AM -0400, Joshua Kinard wrote:
> 
>> o32 userland is the primary on both systems.  However, the last SIGILL was
>> under the 64k PAGE_SIZE kernel inside of an n32 chroot compiling the 'boost'
>> package on the Octane, which I restarted that and it's not complained since.
>>  Also got SIGILL on the 16k PAGE_SIZE kernel when I booted 16k PAGE_SIZE the
>> first time and ran 'ps'.  Subsequent runs of 'ps' didn't reproduce the
>> error.  Also saw SIGILLs in the bootlog of the 16k PAGE_SIZE kernel when
>> "rm" was ran once (couldn't reproduce) and when mdadm tried to put one of
>> the arrays back together.  Subsequent runs using similar argument lines
>> don't reproduce once I got to a root shell.
>>
>> Being it's a Gentoo install...the o32 userland is pretty fresh.  Especially
>> on the Octane, where I literally rebuilt the old userland over 2-3 times
>> just to make sure all the old 5-year cruft was gone.  The n32 userland
>> chroot is brand-spanking new.  gcc-4.7.x only for now on both, because of
>> PR61538 in gcc.  Latest binutils.
>>
>> The O2 is chugging away happily so far in updating a bunch of packages.  So
>> I am leaning towards this being another quirk I have to hunt down in the
>> Octane's code again.  There isn't much in the Octane-specific code that
>> deals with memory, though -- it seems the higher-level MIPS memory code
>> handles most things just fine.
> 
> Can you enable core dumps?  I'm wondering about the EPC of the crashed
> process.  If it's at a function entry or the beginning of a page that
> might indicate there is an issue with flushing caches after the containing
> page got loaded.  Also interesting to know if this possibly happened in a
> signal trampoline or VDSO.
> 
> These are just the usual suspects - nothing indicates this case is actually
> related.

(Missed the reply all on the last one)

Enabled coredumps and got the 'shash' program to fail a second time (first
program to do so)...so I'll rebuild that with debugging symbols and try to
trip it up again later on.

Is a core file from a binary w/o debugging of any value?

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26 13:16       ` Joshua Kinard
@ 2014-08-26 14:02         ` Ralf Baechle
  2014-08-27  0:53           ` Joshua Kinard
  2014-08-26 14:03         ` Ralf Baechle
  1 sibling, 1 reply; 15+ messages in thread
From: Ralf Baechle @ 2014-08-26 14:02 UTC (permalink / raw)
  To: Joshua Kinard; +Cc: linux-mips

On Tue, Aug 26, 2014 at 09:16:56AM -0400, Joshua Kinard wrote:

> On 08/26/2014 08:03, Ralf Baechle wrote:
> > On Tue, Aug 26, 2014 at 07:06:56AM -0400, Joshua Kinard wrote:
> > 
> >> o32 userland is the primary on both systems.  However, the last SIGILL was
> >> under the 64k PAGE_SIZE kernel inside of an n32 chroot compiling the 'boost'
> >> package on the Octane, which I restarted that and it's not complained since.
> >>  Also got SIGILL on the 16k PAGE_SIZE kernel when I booted 16k PAGE_SIZE the
> >> first time and ran 'ps'.  Subsequent runs of 'ps' didn't reproduce the
> >> error.  Also saw SIGILLs in the bootlog of the 16k PAGE_SIZE kernel when
> >> "rm" was ran once (couldn't reproduce) and when mdadm tried to put one of
> >> the arrays back together.  Subsequent runs using similar argument lines
> >> don't reproduce once I got to a root shell.
> >>
> >> Being it's a Gentoo install...the o32 userland is pretty fresh.  Especially
> >> on the Octane, where I literally rebuilt the old userland over 2-3 times
> >> just to make sure all the old 5-year cruft was gone.  The n32 userland
> >> chroot is brand-spanking new.  gcc-4.7.x only for now on both, because of
> >> PR61538 in gcc.  Latest binutils.
> >>
> >> The O2 is chugging away happily so far in updating a bunch of packages.  So
> >> I am leaning towards this being another quirk I have to hunt down in the
> >> Octane's code again.  There isn't much in the Octane-specific code that
> >> deals with memory, though -- it seems the higher-level MIPS memory code
> >> handles most things just fine.
> > 
> > Can you enable core dumps?  I'm wondering about the EPC of the crashed
> > process.  If it's at a function entry or the beginning of a page that
> > might indicate there is an issue with flushing caches after the containing
> > page got loaded.  Also interesting to know if this possibly happened in a
> > signal trampoline or VDSO.
> > 
> > These are just the usual suspects - nothing indicates this case is actually
> > related.
> 
> (Missed the reply all on the last one)
> 
> Enabled coredumps and got the 'shash' program to fail a second time (first
> program to do so)...so I'll rebuild that with debugging symbols and try to
> trip it up again later on.
> 
> Is a core file from a binary w/o debugging of any value?

Yes - it will contain registers etc.  Just what really matters in this case.
We don't need the debug info because we're not interested in debugging the
application.

  Ralf

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26 13:16       ` Joshua Kinard
  2014-08-26 14:02         ` Ralf Baechle
@ 2014-08-26 14:03         ` Ralf Baechle
  1 sibling, 0 replies; 15+ messages in thread
From: Ralf Baechle @ 2014-08-26 14:03 UTC (permalink / raw)
  To: Joshua Kinard; +Cc: linux-mips

On Tue, Aug 26, 2014 at 09:16:56AM -0400, Joshua Kinard wrote:

> (Missed the reply all on the last one)

If that meant to say you replied to me privately only - I didn't receive
that mail yet.

I hope no more email debugging - I'm already getting nightmares ...

  Ralf

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-26 14:02         ` Ralf Baechle
@ 2014-08-27  0:53           ` Joshua Kinard
  2014-09-04  3:35             ` Joshua Kinard
  0 siblings, 1 reply; 15+ messages in thread
From: Joshua Kinard @ 2014-08-27  0:53 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: linux-mips

[-- Attachment #1: Type: text/plain, Size: 2981 bytes --]

On 08/26/2014 10:02, Ralf Baechle wrote:
> On Tue, Aug 26, 2014 at 09:16:56AM -0400, Joshua Kinard wrote:
> 
>> On 08/26/2014 08:03, Ralf Baechle wrote:
>>> On Tue, Aug 26, 2014 at 07:06:56AM -0400, Joshua Kinard wrote:
>>>
>>>> o32 userland is the primary on both systems.  However, the last SIGILL was
>>>> under the 64k PAGE_SIZE kernel inside of an n32 chroot compiling the 'boost'
>>>> package on the Octane, which I restarted that and it's not complained since.
>>>>  Also got SIGILL on the 16k PAGE_SIZE kernel when I booted 16k PAGE_SIZE the
>>>> first time and ran 'ps'.  Subsequent runs of 'ps' didn't reproduce the
>>>> error.  Also saw SIGILLs in the bootlog of the 16k PAGE_SIZE kernel when
>>>> "rm" was ran once (couldn't reproduce) and when mdadm tried to put one of
>>>> the arrays back together.  Subsequent runs using similar argument lines
>>>> don't reproduce once I got to a root shell.
>>>>
>>>> Being it's a Gentoo install...the o32 userland is pretty fresh.  Especially
>>>> on the Octane, where I literally rebuilt the old userland over 2-3 times
>>>> just to make sure all the old 5-year cruft was gone.  The n32 userland
>>>> chroot is brand-spanking new.  gcc-4.7.x only for now on both, because of
>>>> PR61538 in gcc.  Latest binutils.
>>>>
>>>> The O2 is chugging away happily so far in updating a bunch of packages.  So
>>>> I am leaning towards this being another quirk I have to hunt down in the
>>>> Octane's code again.  There isn't much in the Octane-specific code that
>>>> deals with memory, though -- it seems the higher-level MIPS memory code
>>>> handles most things just fine.
>>>
>>> Can you enable core dumps?  I'm wondering about the EPC of the crashed
>>> process.  If it's at a function entry or the beginning of a page that
>>> might indicate there is an issue with flushing caches after the containing
>>> page got loaded.  Also interesting to know if this possibly happened in a
>>> signal trampoline or VDSO.
>>>
>>> These are just the usual suspects - nothing indicates this case is actually
>>> related.
>>
>> (Missed the reply all on the last one)
>>
>> Enabled coredumps and got the 'shash' program to fail a second time (first
>> program to do so)...so I'll rebuild that with debugging symbols and try to
>> trip it up again later on.
>>
>> Is a core file from a binary w/o debugging of any value?
> 
> Yes - it will contain registers etc.  Just what really matters in this case.
> We don't need the debug info because we're not interested in debugging the
> application.
> 
>   Ralf

Attached.  I assume readelf and objdump are used to extract the register
information?  Most searches on Google keep pointing me to GDB as if I want
to debug the program.

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

[-- Attachment #2: core-shash-11-0-0-1479-1409058599.xz --]
[-- Type: application/octet-stream, Size: 167396 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors
  2014-08-27  0:53           ` Joshua Kinard
@ 2014-09-04  3:35             ` Joshua Kinard
  0 siblings, 0 replies; 15+ messages in thread
From: Joshua Kinard @ 2014-09-04  3:35 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: linux-mips

On 08/26/2014 20:53, Joshua Kinard wrote:
> On 08/26/2014 10:02, Ralf Baechle wrote:
>> On Tue, Aug 26, 2014 at 09:16:56AM -0400, Joshua Kinard wrote:
>>
>>> On 08/26/2014 08:03, Ralf Baechle wrote:
>>>> On Tue, Aug 26, 2014 at 07:06:56AM -0400, Joshua Kinard wrote:
>>>>
>>>>> o32 userland is the primary on both systems.  However, the last SIGILL was
>>>>> under the 64k PAGE_SIZE kernel inside of an n32 chroot compiling the 'boost'
>>>>> package on the Octane, which I restarted that and it's not complained since.
>>>>>  Also got SIGILL on the 16k PAGE_SIZE kernel when I booted 16k PAGE_SIZE the
>>>>> first time and ran 'ps'.  Subsequent runs of 'ps' didn't reproduce the
>>>>> error.  Also saw SIGILLs in the bootlog of the 16k PAGE_SIZE kernel when
>>>>> "rm" was ran once (couldn't reproduce) and when mdadm tried to put one of
>>>>> the arrays back together.  Subsequent runs using similar argument lines
>>>>> don't reproduce once I got to a root shell.
>>>>>
>>>>> Being it's a Gentoo install...the o32 userland is pretty fresh.  Especially
>>>>> on the Octane, where I literally rebuilt the old userland over 2-3 times
>>>>> just to make sure all the old 5-year cruft was gone.  The n32 userland
>>>>> chroot is brand-spanking new.  gcc-4.7.x only for now on both, because of
>>>>> PR61538 in gcc.  Latest binutils.
>>>>>
>>>>> The O2 is chugging away happily so far in updating a bunch of packages.  So
>>>>> I am leaning towards this being another quirk I have to hunt down in the
>>>>> Octane's code again.  There isn't much in the Octane-specific code that
>>>>> deals with memory, though -- it seems the higher-level MIPS memory code
>>>>> handles most things just fine.
>>>>
>>>> Can you enable core dumps?  I'm wondering about the EPC of the crashed
>>>> process.  If it's at a function entry or the beginning of a page that
>>>> might indicate there is an issue with flushing caches after the containing
>>>> page got loaded.  Also interesting to know if this possibly happened in a
>>>> signal trampoline or VDSO.
>>>>
>>>> These are just the usual suspects - nothing indicates this case is actually
>>>> related.
>>>
>>> (Missed the reply all on the last one)
>>>
>>> Enabled coredumps and got the 'shash' program to fail a second time (first
>>> program to do so)...so I'll rebuild that with debugging symbols and try to
>>> trip it up again later on.
>>>
>>> Is a core file from a binary w/o debugging of any value?
>>
>> Yes - it will contain registers etc.  Just what really matters in this case.
>> We don't need the debug info because we're not interested in debugging the
>> application.
>>
>>   Ralf
> 
> Attached.  I assume readelf and objdump are used to extract the register
> information?  Most searches on Google keep pointing me to GDB as if I want
> to debug the program.

Was anyone able to take a look at the core dump and see if there is anything
out of the ordinary?

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-09-04  3:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-26  9:27 16k or 64k PAGE_SIZE and "illegal instruction" (signal -4) errors Joshua Kinard
2014-08-26 10:20 ` Ralf Baechle
2014-08-26 10:42   ` Maciej W. Rozycki
2014-08-26 10:49     ` Maciej W. Rozycki
2014-08-26 11:49     ` Ralf Baechle
2014-08-26 12:03       ` Joshua Kinard
2014-08-26 12:11         ` Ralf Baechle
2014-08-26 11:06   ` Joshua Kinard
2014-08-26 11:50     ` Maciej W. Rozycki
2014-08-26 12:03     ` Ralf Baechle
2014-08-26 13:16       ` Joshua Kinard
2014-08-26 14:02         ` Ralf Baechle
2014-08-27  0:53           ` Joshua Kinard
2014-09-04  3:35             ` Joshua Kinard
2014-08-26 14:03         ` Ralf Baechle

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.