linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Kernel Panic with Rawtherapee
@ 2012-03-13 22:48 Adalbert Dawid
  2012-03-14 14:53 ` Kernel Panic with Rawtherapee (mce related) Srivatsa S. Bhat
  0 siblings, 1 reply; 6+ messages in thread
From: Adalbert Dawid @ 2012-03-13 22:48 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 631 bytes --]

Hello,

in the last few weeks I've been having several crashes when processing
raw photo files with rawtherapee. AFAICT this always happend under
(high) cpu load (e.g. when exporting a file to jpg), however I am not
able to narrow it down a certain operation.

I am running debian testing on a Core 2 Duo E8400 (not overclocked or
anything) with 8GB of ram on an Asus board.

$ uname -a
Linux erde 3.2.0-1-amd64 #1 SMP Fri Feb 17 05:17:36 UTC 2012 x86_64 GNU/Linux

Is anyone able to tell from the stack trace ("screenshot" attached) what
might be going on?

(Please CC me, as I am not subscribed to the list.)

Thank you
Adalbert

[-- Attachment #2: kernel-panic.jpg --]
[-- Type: image/jpeg, Size: 207417 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel Panic with Rawtherapee (mce related)
  2012-03-13 22:48 Kernel Panic with Rawtherapee Adalbert Dawid
@ 2012-03-14 14:53 ` Srivatsa S. Bhat
  2012-03-14 15:59   ` Borislav Petkov
  0 siblings, 1 reply; 6+ messages in thread
From: Srivatsa S. Bhat @ 2012-03-14 14:53 UTC (permalink / raw)
  To: Adalbert Dawid; +Cc: linux-kernel, Borislav Petkov, Tony Luck, mingo, x86

[-- Attachment #1: Type: text/plain, Size: 738 bytes --]

[Adding some Cc's]

On 03/14/2012 04:18 AM, Adalbert Dawid wrote:

> Hello,
> 
> in the last few weeks I've been having several crashes when processing
> raw photo files with rawtherapee. AFAICT this always happend under
> (high) cpu load (e.g. when exporting a file to jpg), however I am not
> able to narrow it down a certain operation.
> 
> I am running debian testing on a Core 2 Duo E8400 (not overclocked or
> anything) with 8GB of ram on an Asus board.
> 
> $ uname -a
> Linux erde 3.2.0-1-amd64 #1 SMP Fri Feb 17 05:17:36 UTC 2012 x86_64 GNU/Linux
> 
> Is anyone able to tell from the stack trace ("screenshot" attached) what
> might be going on?
> 
> (Please CC me, as I am not subscribed to the list.)
> 
> Thank you
> Adalbert

[-- Attachment #2: kernel-panic.jpg --]
[-- Type: image/jpeg, Size: 207417 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel Panic with Rawtherapee (mce related)
  2012-03-14 14:53 ` Kernel Panic with Rawtherapee (mce related) Srivatsa S. Bhat
@ 2012-03-14 15:59   ` Borislav Petkov
  2012-03-14 17:51     ` Luck, Tony
  0 siblings, 1 reply; 6+ messages in thread
From: Borislav Petkov @ 2012-03-14 15:59 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Adalbert Dawid, linux-kernel, Borislav Petkov, Tony Luck, mingo, x86

On Wed, Mar 14, 2012 at 08:23:23PM +0530, Srivatsa S. Bhat wrote:
> [Adding some Cc's]
> 
> On 03/14/2012 04:18 AM, Adalbert Dawid wrote:
> 
> > Hello,
> > 
> > in the last few weeks I've been having several crashes when processing
> > raw photo files with rawtherapee. AFAICT this always happend under
> > (high) cpu load (e.g. when exporting a file to jpg), however I am not
> > able to narrow it down a certain operation.
> > 
> > I am running debian testing on a Core 2 Duo E8400 (not overclocked or
> > anything) with 8GB of ram on an Asus board.
> > 
> > $ uname -a
> > Linux erde 3.2.0-1-amd64 #1 SMP Fri Feb 17 05:17:36 UTC 2012 x86_64 GNU/Linux
> > 
> > Is anyone able to tell from the stack trace ("screenshot" attached) what
> > might be going on?

You're getting a bunch of machine checks, the last one of them being
fatal (Process Context Corrupt bit is set) causing the machine to panic.

Tony will probably be able to help you further in decoding what exactly
those MC0_STATUS and MC5_STATUS values mean. <tease>Well, if we had MCE
decoding in the kernel that would've not been an issue... :-)</tease>

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Kernel Panic with Rawtherapee (mce related)
  2012-03-14 15:59   ` Borislav Petkov
@ 2012-03-14 17:51     ` Luck, Tony
  2012-03-14 20:18       ` Adalbert Dawid
  0 siblings, 1 reply; 6+ messages in thread
From: Luck, Tony @ 2012-03-14 17:51 UTC (permalink / raw)
  To: Borislav Petkov, Srivatsa S. Bhat
  Cc: Adalbert Dawid, linux-kernel, mingo, x86

> You're getting a bunch of machine checks, the last one of them being
> fatal (Process Context Corrupt bit is set) causing the machine to panic.

PCC is set in all of them

> Tony will probably be able to help you further in decoding what exactly
> those MC0_STATUS and MC5_STATUS values mean

Bank 5 ends in 0400 - which means "Internal timer error". Bank 0 has 0800
which is a bus/interconnect error where this processor was the source of
a memory transaction.

That's where the facts end - speculation begins here ...

Since this is repeatable under load - it's possible that a page table got
corrupted and you are trying to access some non-existent memory location?
Do all traces for this panic involve *_tlb_* functions?

Or perhaps you have a cooling problem - and when stressed your cpu or
memory is getting too hot?

-Tony


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Kernel Panic with Rawtherapee (mce related)
  2012-03-14 17:51     ` Luck, Tony
@ 2012-03-14 20:18       ` Adalbert Dawid
  2012-03-14 21:53         ` Luck, Tony
  0 siblings, 1 reply; 6+ messages in thread
From: Adalbert Dawid @ 2012-03-14 20:18 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, Srivatsa S. Bhat, linux-kernel, mingo, x86

Thank you for the quick reply.

On Wed, 2012-03-14 at 17:51 +0000, Luck, Tony wrote:
> > You're getting a bunch of machine checks, the last one of them being
> > fatal (Process Context Corrupt bit is set) causing the machine to panic.
> 
> PCC is set in all of them
> 
> > Tony will probably be able to help you further in decoding what exactly
> > those MC0_STATUS and MC5_STATUS values mean
> 
> Bank 5 ends in 0400 - which means "Internal timer error". Bank 0 has 0800
> which is a bus/interconnect error where this processor was the source of
> a memory transaction.
> 
> That's where the facts end - speculation begins here ...
> 
> Since this is repeatable under load - it's possible that a page table got
> corrupted and you are trying to access some non-existent memory location?
> Do all traces for this panic involve *_tlb_* functions?

Since the screenshot I had posted is the only one I have been able to
capture, I don't know. I will try to provoke the crash by setting the
machine under load utilizing rawtherapee and will post results in case
of success. Cpuburn did not manage to crash the machine in a (shortish)
test I did a few days ago.

It would be very helpful to disable the "reboot in 30 seconds" timeout.
Is that possible somehow?

> Or perhaps you have a cooling problem - and when stressed your cpu or
> memory is getting too hot?

I do not believe this is true as the cpu fan plus two case fans are
running fine and the sensors display cpu tempratures <60°C, even under
load.

Up to now, it has always been rawtherapee that crashed the machine. This
is why I thought it might possibly be some special cpu feature (an SSE
command or something) that happens to be broken in my cpu and that is
triggered only by rawtherapee and not by any other software. What is
your opinion on this theory? 

> -Tony
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Kernel Panic with Rawtherapee (mce related)
  2012-03-14 20:18       ` Adalbert Dawid
@ 2012-03-14 21:53         ` Luck, Tony
  0 siblings, 0 replies; 6+ messages in thread
From: Luck, Tony @ 2012-03-14 21:53 UTC (permalink / raw)
  To: Adalbert Dawid
  Cc: Borislav Petkov, Srivatsa S. Bhat, linux-kernel, mingo, x86

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 935 bytes --]

> It would be very helpful to disable the "reboot in 30 seconds" timeout.
> Is that possible somehow?

# echo 0 > /proc/sys/kernel/panic

[A non-zero value is the number of seconds before rebooting. zero means don't reboot]

> Up to now, it has always been rawtherapee that crashed the machine. This
> is why I thought it might possibly be some special cpu feature (an SSE
> command or something) that happens to be broken in my cpu and that is
> triggered only by rawtherapee and not by any other software. What is
> your opinion on this theory?

Can you set up a serial console for this system?  Seeing the kernel
log leading up to the machine checks might also give some clues (e.g.
perhaps rawtherapee is running the system out of memory right before
the crash).

-Tony

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-03-14 21:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-13 22:48 Kernel Panic with Rawtherapee Adalbert Dawid
2012-03-14 14:53 ` Kernel Panic with Rawtherapee (mce related) Srivatsa S. Bhat
2012-03-14 15:59   ` Borislav Petkov
2012-03-14 17:51     ` Luck, Tony
2012-03-14 20:18       ` Adalbert Dawid
2012-03-14 21:53         ` Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).