All of lore.kernel.org
 help / color / mirror / Atom feed
* Debugging system hangs
@ 2021-12-14 15:54 Wols Lists
  2021-12-14 17:46 ` Roman Mamedov
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Wols Lists @ 2021-12-14 15:54 UTC (permalink / raw)
  To: linux-raid

Don't know if this is off-topic or not, seeing as my system is very much 
reliant on raid ...

But basically I'm seeing the system just stop responding. Typically it's 
in screensaver mode, I've got a blank screen, and it won't wake up. (I 
used to think it was something to do with Thunderbird, it mostly 
happened while TB was hammering the system, but no ...)

Today, I had it happen while the system was idle but not in screensaver, 
I run xosview, and everything was clearly frozen - including xosview.

As you might know, my stack is ext4 over lvm (over raid over 
dm-integrity for /home) over spinning rust.

And I run gentoo/systemd - currently on the latest stable kernel afaik, 
5.10.76-gentoo-r1 SMP x86_64.

Any advice on how to debug a hang - basically I need something that'll 
just sit there so when it crashes (and I press the reset button to 
recover) I'll have some sort of trace. It would be nice to prove it's 
not the disk stack at fault ...

Obviously, "set these options in the kernel" won't faze me ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-14 15:54 Debugging system hangs Wols Lists
@ 2021-12-14 17:46 ` Roman Mamedov
  2021-12-14 21:59   ` Phil Turmel
  2021-12-15 12:07 ` o1bigtenor
  2021-12-15 16:45 ` Roger Heflin
  2 siblings, 1 reply; 16+ messages in thread
From: Roman Mamedov @ 2021-12-14 17:46 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid

On Tue, 14 Dec 2021 15:54:50 +0000
Wols Lists <antlists@youngman.org.uk> wrote:

> Don't know if this is off-topic or not, seeing as my system is very much 
> reliant on raid ...
> 
> But basically I'm seeing the system just stop responding. Typically it's 
> in screensaver mode, I've got a blank screen, and it won't wake up. (I 
> used to think it was something to do with Thunderbird, it mostly 
> happened while TB was hammering the system, but no ...)
> 
> Today, I had it happen while the system was idle but not in screensaver, 
> I run xosview, and everything was clearly frozen - including xosview.
> 
> As you might know, my stack is ext4 over lvm (over raid over 
> dm-integrity for /home) over spinning rust.
> 
> And I run gentoo/systemd - currently on the latest stable kernel afaik, 
> 5.10.76-gentoo-r1 SMP x86_64.
> 
> Any advice on how to debug a hang - basically I need something that'll 
> just sit there so when it crashes (and I press the reset button to 
> recover) I'll have some sort of trace. It would be nice to prove it's 
> not the disk stack at fault ...
> 
> Obviously, "set these options in the kernel" won't faze me ...

Set up "netconsole":
https://www.kernel.org/doc/html/latest/networking/netconsole.html
https://wiki.ubuntu.com/Kernel/Netconsole

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-14 17:46 ` Roman Mamedov
@ 2021-12-14 21:59   ` Phil Turmel
  2021-12-15  1:08     ` John Stoffel
  0 siblings, 1 reply; 16+ messages in thread
From: Phil Turmel @ 2021-12-14 21:59 UTC (permalink / raw)
  To: Roman Mamedov, Wols Lists; +Cc: linux-raid

On 12/14/21 12:46 PM, Roman Mamedov wrote:
> On Tue, 14 Dec 2021 15:54:50 +0000
> Wols Lists <antlists@youngman.org.uk> wrote:
> 

>> Any advice on how to debug a hang - basically I need something that'll
>> just sit there so when it crashes (and I press the reset button to
>> recover) I'll have some sort of trace. It would be nice to prove it's
>> not the disk stack at fault ...
>>
>> Obviously, "set these options in the kernel" won't faze me ...
> 
> Set up "netconsole":
> https://www.kernel.org/doc/html/latest/networking/netconsole.html
> https://wiki.ubuntu.com/Kernel/Netconsole
> 

+1 for netconsole.  Also enable the SysRq key for thread dumps.  So you 
have the dump before you reset.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-14 21:59   ` Phil Turmel
@ 2021-12-15  1:08     ` John Stoffel
  2021-12-15 21:45       ` Wol
  0 siblings, 1 reply; 16+ messages in thread
From: John Stoffel @ 2021-12-15  1:08 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Roman Mamedov, Wols Lists, linux-raid

>>>>> "Phil" == Phil Turmel <philip@turmel.org> writes:

Phil> On 12/14/21 12:46 PM, Roman Mamedov wrote:
>> On Tue, 14 Dec 2021 15:54:50 +0000
>> Wols Lists <antlists@youngman.org.uk> wrote:
>> 

>>> Any advice on how to debug a hang - basically I need something that'll
>>> just sit there so when it crashes (and I press the reset button to
>>> recover) I'll have some sort of trace. It would be nice to prove it's
>>> not the disk stack at fault ...
>>> 
>>> Obviously, "set these options in the kernel" won't faze me ...
>> 
>> Set up "netconsole":
>> https://www.kernel.org/doc/html/latest/networking/netconsole.html
>> https://wiki.ubuntu.com/Kernel/Netconsole
>> 

Phil> +1 for netconsole.  Also enable the SysRq key for thread dumps.
Phil> So you have the dump before you reset.

Can you explain more about your hardware?  Is it new?  Have you made
any changes recently?  Is there any over clocking?  As for changes,
I'm talking both hardware and software.  What daemons are you running?
Anything that changed is a good thing to take a long hard look at.  

If you have another system, remote syslog might also be something to
add into the mix.  I also really like serial consoles, so you can
capture all the output onto another system easily.

It might also be useful to try going back to an older kernel, but I
haven't a clue how hard that is with gentoo.  Does the distro still
expect you to compile everything and bootstrap yourself these days?

Good luck!
John


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-14 15:54 Debugging system hangs Wols Lists
  2021-12-14 17:46 ` Roman Mamedov
@ 2021-12-15 12:07 ` o1bigtenor
  2021-12-15 16:45 ` Roger Heflin
  2 siblings, 0 replies; 16+ messages in thread
From: o1bigtenor @ 2021-12-15 12:07 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid

On Wed, Dec 15, 2021 at 3:51 AM Wols Lists <antlists@youngman.org.uk> wrote:
>
> Don't know if this is off-topic or not, seeing as my system is very much
> reliant on raid ...
>
> But basically I'm seeing the system just stop responding. Typically it's
> in screensaver mode, I've got a blank screen, and it won't wake up. (I
> used to think it was something to do with Thunderbird, it mostly
> happened while TB was hammering the system, but no ...)
>
> Today, I had it happen while the system was idle but not in screensaver,
> I run xosview, and everything was clearly frozen - including xosview.
>
 I have issues similar to yours but mine are related to using nouveau
drivers on nvidia gpus. I'm running a complicated graphics system so
dunno if that's your issue root.

If you are running nouveau, well - - - then I have some ideas for your
delectation.

Please advise.

Regards

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-14 15:54 Debugging system hangs Wols Lists
  2021-12-14 17:46 ` Roman Mamedov
  2021-12-15 12:07 ` o1bigtenor
@ 2021-12-15 16:45 ` Roger Heflin
  2021-12-15 21:53   ` Wol
  2 siblings, 1 reply; 16+ messages in thread
From: Roger Heflin @ 2021-12-15 16:45 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid

If you cannot login to the machine via ssh, also try pinging it.  If
ping works but ssh does not either ssh died, or the machine is paging
so heavily that user space cannot respond in a reasonable time.

If the disk were an issue there should be messages about something in
the disk layer timing out, but it sounds like there aren't any of
those sorts of messages.  If it was a controller hardware/pci slot/hw
issue that will in some cases cause an immediate power cycle and boot
back up.

You might also configure kdump, there should be doc's someplace on
configuring it for your distribution, once configured then test it
with "echo c > /proc/sysrq-trigger" and that should crash the machine
and leave you with a kernel core dump + dmesg from the time of the
crash.   Also if kdump is configured and working it will crash/dump
memory and typically boot back up automatically.

On Wed, Dec 15, 2021 at 3:54 AM Wols Lists <antlists@youngman.org.uk> wrote:
>
> Don't know if this is off-topic or not, seeing as my system is very much
> reliant on raid ...
>
> But basically I'm seeing the system just stop responding. Typically it's
> in screensaver mode, I've got a blank screen, and it won't wake up. (I
> used to think it was something to do with Thunderbird, it mostly
> happened while TB was hammering the system, but no ...)
>
> Today, I had it happen while the system was idle but not in screensaver,
> I run xosview, and everything was clearly frozen - including xosview.
>
> As you might know, my stack is ext4 over lvm (over raid over
> dm-integrity for /home) over spinning rust.
>
> And I run gentoo/systemd - currently on the latest stable kernel afaik,
> 5.10.76-gentoo-r1 SMP x86_64.
>
> Any advice on how to debug a hang - basically I need something that'll
> just sit there so when it crashes (and I press the reset button to
> recover) I'll have some sort of trace. It would be nice to prove it's
> not the disk stack at fault ...
>
> Obviously, "set these options in the kernel" won't faze me ...
>
> Cheers,
> Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-15  1:08     ` John Stoffel
@ 2021-12-15 21:45       ` Wol
  0 siblings, 0 replies; 16+ messages in thread
From: Wol @ 2021-12-15 21:45 UTC (permalink / raw)
  To: John Stoffel, Phil Turmel; +Cc: Roman Mamedov, linux-raid

On 15/12/2021 01:08, John Stoffel wrote:
> Can you explain more about your hardware?  Is it new?  Have you made
> any changes recently?  Is there any over clocking?  As for changes,
> I'm talking both hardware and software.  What daemons are you running?
> Anything that changed is a good thing to take a long hard look at.

Well, it's a new system that's about three years old ... it's a 
home-built, most of which I bought about three years ago, and I'm 
finally finishing off commissioning it.

The hard drives are new Seagate Ironwolves x2 and an old Barracuda, 32GB 
of new DDR4 ram, and a Radeon 4350 video card which is also old but 
hardly used.
> 
> If you have another system, remote syslog might also be something to
> add into the mix.  I also really like serial consoles, so you can
> capture all the output onto another system easily.
> 
Hardware bought for this system is being re-purposed into a raid 
test-bed so I will hopefully have one available soon (if anyone in SE 
London has any 1TB drives they don't want, they'll be welcome for 
this... :-)

> It might also be useful to try going back to an older kernel, but I
> haven't a clue how hard that is with gentoo.  Does the distro still
> expect you to compile everything and bootstrap yourself these days?

Pretty much. There are utilities that make it un-necessary, but it's 
easy enough and encouraged ... so my trouble COULD be as simple as a 
misconfigured kernel ...

But all this will be a learning experience :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-15 16:45 ` Roger Heflin
@ 2021-12-15 21:53   ` Wol
  2021-12-15 22:05     ` Roger Heflin
  0 siblings, 1 reply; 16+ messages in thread
From: Wol @ 2021-12-15 21:53 UTC (permalink / raw)
  To: Roger Heflin; +Cc: linux-raid

On 15/12/2021 16:45, Roger Heflin wrote:
> If you cannot login to the machine via ssh, also try pinging it.  If
> ping works but ssh does not either ssh died, or the machine is paging
> so heavily that user space cannot respond in a reasonable time.

"Unable to resolve host name 'thewolery'"

Paging is EXTREMELY unlikely with 32GB ram ... :-)
> 
> If the disk were an issue there should be messages about something in
> the disk layer timing out, but it sounds like there aren't any of
> those sorts of messages.  If it was a controller hardware/pci slot/hw
> issue that will in some cases cause an immediate power cycle and boot
> back up.

Where do I look for those after a reboot? The system basically is 
completely unresponsive - so no it's not a reset or anything, the system 
just stops...
> 
> You might also configure kdump, there should be doc's someplace on
> configuring it for your distribution, once configured then test it
> with "echo c > /proc/sysrq-trigger" and that should crash the machine
> and leave you with a kernel core dump + dmesg from the time of the
> crash.   Also if kdump is configured and working it will crash/dump
> memory and typically boot back up automatically.

I'll have to try it, although an autoreboot might not be a particularly 
good idea ...
> 
> On Wed, Dec 15, 2021 at 3:54 AM Wols Lists <antlists@youngman.org.uk> wrote:
>>
>> Don't know if this is off-topic or not, seeing as my system is very much
>> reliant on raid ...
>>
>> But basically I'm seeing the system just stop responding. Typically it's
>> in screensaver mode, I've got a blank screen, and it won't wake up. (I
>> used to think it was something to do with Thunderbird, it mostly
>> happened while TB was hammering the system, but no ...)
>>
>> Today, I had it happen while the system was idle but not in screensaver,
>> I run xosview, and everything was clearly frozen - including xosview.
>>
>> As you might know, my stack is ext4 over lvm (over raid over
>> dm-integrity for /home) over spinning rust.
>>
>> And I run gentoo/systemd - currently on the latest stable kernel afaik,
>> 5.10.76-gentoo-r1 SMP x86_64.
>>
>> Any advice on how to debug a hang - basically I need something that'll
>> just sit there so when it crashes (and I press the reset button to
>> recover) I'll have some sort of trace. It would be nice to prove it's
>> not the disk stack at fault ...
>>
>> Obviously, "set these options in the kernel" won't faze me ...
>>
>> Cheers,
>> Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-15 21:53   ` Wol
@ 2021-12-15 22:05     ` Roger Heflin
  2021-12-16 21:43       ` John Stoffel
  2021-12-19 17:52       ` Wols Lists
  0 siblings, 2 replies; 16+ messages in thread
From: Roger Heflin @ 2021-12-15 22:05 UTC (permalink / raw)
  To: Wol; +Cc: linux-raid

There would be various messages.
 grep -E 'ATA| sd |ata[0-9]' /var/log/messages
might get you details.  It will also show when the disks are first
showing up and being reported.

Timeouts look kind of like this:
ata5: SError: { Handshk }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/40:58:40:e8:88/00:00:e8:00:00/40 tag 11 ncq dma 32768
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/18:60:48:ea:88/00:00:e8:00:00/40 tag 12 ncq dma 12288
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/08:68:00:eb:88/00:00:e8:00:00/40 tag 13 ncq dma 4096
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/08:78:60:ea:88/00:00:e8:00:00/40 tag 15 ncq dma 4096
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/08:c8:f8:e5:88/02:00:e8:00:00/40 tag 25 ncq dma 266240
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/40:d0:00:e8:88/00:00:e8:00:00/40 tag 26 ncq dma 32768
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/c8:d8:80:e8:88/01:00:e8:00:00/40 tag 27 ncq dma 233472
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/08:f8:90:eb:88/00:00:e8:00:00/40 tag 31 ncq dma 4096
out#012         res 40/00:6c:00:eb:88/00:00:e8:00:00/40 Emask 0x10
(ATA bus error)
ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata5.00: configured for UDMA/133
ata5: EH complete
[4544065.390549] ata4.00: exception Emask 0x10 SAct 0xc000 SErr
0x400000 action 0x6 frozen
[4544065.392582] ata4.00: irq_stat 0x08000000, interface fatal error
[4544065.394543] ata4: SError: { Handshk }
[4544065.396595] ata4.00: failed command: WRITE FPDMA QUEUED
[4544065.398523] ata4.00: cmd 61/40:70:98:2d:ea/00:00:85:00:00/40 tag
14 ncq dma 32768 out
[4544065.398523]          res 40/00:7c:18:2e:ea/00:00:85:00:00/40
Emask 0x10 (ATA bus error)
[4544065.402441] ata4.00: status: { DRDY }
[4544065.404753] ata4.00: failed command: WRITE FPDMA QUEUED
[4544065.406946] ata4.00: cmd 61/40:78:18:2e:ea/00:00:85:00:00/40 tag
15 ncq dma 32768 out
[4544065.406946]          res 40/00:7c:18:2e:ea/00:00:85:00:00/40
Emask 0x10 (ATA bus error)
[4544065.410850] ata4.00: status: { DRDY }
[4544065.412787] ata4: hard resetting link
[4544065.877609] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[4544065.880880] ata4.00: configured for UDMA/133
[4544065.882816] ata4: EH complete
ata4.00: exception Emask 0x10 SAct 0xc000 SErr 0x400000 action 0x6 frozen
ata4.00: irq_stat 0x08000000, interface fatal error
ata4: SError: { Handshk }
ata4.00: failed command: WRITE FPDMA QUEUED
ata4.00: cmd 61/40:70:98:2d:ea/00:00:85:00:00/40 tag 14 ncq dma 32768
out#012         res 40/00:7c:18:2e:ea/00:00:85:00:00/40 Emask 0x10
(ATA bus error)
ata4.00: status: { DRDY }
ata4.00: failed command: WRITE FPDMA QUEUED
ata4.00: cmd 61/40:78:18:2e:ea/00:00:85:00:00/40 tag 15 ncq dma 32768
out#012         res 40/00:7c:18:2e:ea/00:00:85:00:00/40 Emask 0x10
(ATA bus error)
ata4.00: status: { DRDY }
ata4: hard resetting link
ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata4.00: configured for UDMA/133
ata4: EH complete



The autoreboot only happens after the machine has already 'crashed'
and would have been otherwise unresponsive anyway.

On Wed, Dec 15, 2021 at 3:53 PM Wol <antlists@youngman.org.uk> wrote:
>
> On 15/12/2021 16:45, Roger Heflin wrote:
> > If you cannot login to the machine via ssh, also try pinging it.  If
> > ping works but ssh does not either ssh died, or the machine is paging
> > so heavily that user space cannot respond in a reasonable time.
>
> "Unable to resolve host name 'thewolery'"
>
> Paging is EXTREMELY unlikely with 32GB ram ... :-)
> >
> > If the disk were an issue there should be messages about something in
> > the disk layer timing out, but it sounds like there aren't any of
> > those sorts of messages.  If it was a controller hardware/pci slot/hw
> > issue that will in some cases cause an immediate power cycle and boot
> > back up.
>
> Where do I look for those after a reboot? The system basically is
> completely unresponsive - so no it's not a reset or anything, the system
> just stops...
> >
> > You might also configure kdump, there should be doc's someplace on
> > configuring it for your distribution, once configured then test it
> > with "echo c > /proc/sysrq-trigger" and that should crash the machine
> > and leave you with a kernel core dump + dmesg from the time of the
> > crash.   Also if kdump is configured and working it will crash/dump
> > memory and typically boot back up automatically.
>
> I'll have to try it, although an autoreboot might not be a particularly
> good idea ...
> >
> > On Wed, Dec 15, 2021 at 3:54 AM Wols Lists <antlists@youngman.org.uk> wrote:
> >>
> >> Don't know if this is off-topic or not, seeing as my system is very much
> >> reliant on raid ...
> >>
> >> But basically I'm seeing the system just stop responding. Typically it's
> >> in screensaver mode, I've got a blank screen, and it won't wake up. (I
> >> used to think it was something to do with Thunderbird, it mostly
> >> happened while TB was hammering the system, but no ...)
> >>
> >> Today, I had it happen while the system was idle but not in screensaver,
> >> I run xosview, and everything was clearly frozen - including xosview.
> >>
> >> As you might know, my stack is ext4 over lvm (over raid over
> >> dm-integrity for /home) over spinning rust.
> >>
> >> And I run gentoo/systemd - currently on the latest stable kernel afaik,
> >> 5.10.76-gentoo-r1 SMP x86_64.
> >>
> >> Any advice on how to debug a hang - basically I need something that'll
> >> just sit there so when it crashes (and I press the reset button to
> >> recover) I'll have some sort of trace. It would be nice to prove it's
> >> not the disk stack at fault ...
> >>
> >> Obviously, "set these options in the kernel" won't faze me ...
> >>
> >> Cheers,
> >> Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-15 22:05     ` Roger Heflin
@ 2021-12-16 21:43       ` John Stoffel
  2021-12-16 21:50         ` Roger Heflin
  2021-12-16 22:30         ` Wols Lists
  2021-12-19 17:52       ` Wols Lists
  1 sibling, 2 replies; 16+ messages in thread
From: John Stoffel @ 2021-12-16 21:43 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Wol, linux-raid


Another thing that struck me is maybe it's time to boot into a small
stress testing image and see if it's more of a hardware issue.  It
might also be a power supply issue, where as the load goes up, your
power supply can't keep the voltage up and the system fails that way?

There's the 'stress-ng' package for beating on systems.  And I think
I've used 'sysrecue' in the past to boot up systems and run stress
tests.

Getting the regular OS out of the way with something lower level and
simpler to stress test the hardware is a good idea.

https://www.stresslinux.org/sl/

Might be another good option.

Good luck!
John

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-16 21:43       ` John Stoffel
@ 2021-12-16 21:50         ` Roger Heflin
  2021-12-16 22:08           ` John Stoffel
  2021-12-16 22:30         ` Wols Lists
  1 sibling, 1 reply; 16+ messages in thread
From: Roger Heflin @ 2021-12-16 21:50 UTC (permalink / raw)
  To: John Stoffel; +Cc: Wol, linux-raid

If the power supply cannot handle it the node will flat out crash.

There is no mechanism for the cpu/memory/mb to deal with a power
supply unable to supply enough.

The load going up will likely be something else, I have never seen hw
show as that.

On Thu, Dec 16, 2021 at 3:43 PM John Stoffel <john@stoffel.org> wrote:
>
>
> Another thing that struck me is maybe it's time to boot into a small
> stress testing image and see if it's more of a hardware issue.  It
> might also be a power supply issue, where as the load goes up, your
> power supply can't keep the voltage up and the system fails that way?
>
> There's the 'stress-ng' package for beating on systems.  And I think
> I've used 'sysrecue' in the past to boot up systems and run stress
> tests.
>
> Getting the regular OS out of the way with something lower level and
> simpler to stress test the hardware is a good idea.
>
> https://www.stresslinux.org/sl/
>
> Might be another good option.
>
> Good luck!
> John

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-16 21:50         ` Roger Heflin
@ 2021-12-16 22:08           ` John Stoffel
  0 siblings, 0 replies; 16+ messages in thread
From: John Stoffel @ 2021-12-16 22:08 UTC (permalink / raw)
  To: Roger Heflin; +Cc: John Stoffel, Wol, linux-raid

>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:

Roger> If the power supply cannot handle it the node will flat out crash.

Maybe... it might just have the voltage drop enough to cause the
system to hang, instead of quite crashing.  It's hard to debug though...

Roger> There is no mechanism for the cpu/memory/mb to deal with a
Roger> power supply unable to supply enough.

Roger> The load going up will likely be something else, I have never
Roger> seen hw show as that.

Roger> On Thu, Dec 16, 2021 at 3:43 PM John Stoffel <john@stoffel.org> wrote:
>> 
>> 
>> Another thing that struck me is maybe it's time to boot into a small
>> stress testing image and see if it's more of a hardware issue.  It
>> might also be a power supply issue, where as the load goes up, your
>> power supply can't keep the voltage up and the system fails that way?
>> 
>> There's the 'stress-ng' package for beating on systems.  And I think
>> I've used 'sysrecue' in the past to boot up systems and run stress
>> tests.
>> 
>> Getting the regular OS out of the way with something lower level and
>> simpler to stress test the hardware is a good idea.
>> 
>> https://www.stresslinux.org/sl/
>> 
>> Might be another good option.
>> 
>> Good luck!
>> John

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-16 21:43       ` John Stoffel
  2021-12-16 21:50         ` Roger Heflin
@ 2021-12-16 22:30         ` Wols Lists
  2021-12-17 10:07           ` Jani Partanen
  1 sibling, 1 reply; 16+ messages in thread
From: Wols Lists @ 2021-12-16 22:30 UTC (permalink / raw)
  To: John Stoffel, Roger Heflin; +Cc: linux-raid

On 16/12/2021 21:43, John Stoffel wrote:
> Another thing that struck me is maybe it's time to boot into a small
> stress testing image and see if it's more of a hardware issue.  It
> might also be a power supply issue, where as the load goes up, your
> power supply can't keep the voltage up and the system fails that way?

Unlikely, but never say never ...

It doesn't seem to be crashing under load, it's more like getting stuck 
in idle, but I've debugged enough problems to know that it's rarely what 
you initially think it is.

The PSU is a 550W Corsair unit, so unless it's faulty power load 
certainly won't be an issue...

I'll need to do a lot of playing over Christmas, if I get the chance ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-16 22:30         ` Wols Lists
@ 2021-12-17 10:07           ` Jani Partanen
  0 siblings, 0 replies; 16+ messages in thread
From: Jani Partanen @ 2021-12-17 10:07 UTC (permalink / raw)
  To: Wols Lists, John Stoffel, Roger Heflin; +Cc: linux-raid

Wols Lists kirjoitti 17/12/2021 klo 0.30:
> On 16/12/2021 21:43, John Stoffel wrote:
>> Another thing that struck me is maybe it's time to boot into a small
>> stress testing image and see if it's more of a hardware issue. It
>> might also be a power supply issue, where as the load goes up, your
>> power supply can't keep the voltage up and the system fails that way?
>
> Unlikely, but never say never ...
>
> It doesn't seem to be crashing under load, it's more like getting 
> stuck in idle, but I've debugged enough problems to know that it's 
> rarely what you initially think it is.
>
> The PSU is a 550W Corsair unit, so unless it's faulty power load 
> certainly won't be an issue...
>
> I'll need to do a lot of playing over Christmas, if I get the chance ...
>
> Cheers,
> Wol

I have couple old CPU's I5 4670K and I7 4790K, both have idle locking 
issues. I solved my issue boosting voltages little from bios. I7 from 
friend and I know it's been heavily overclocked and it's so bad that if 
I use stock settings what bios set, I cannot even install OS because it 
just lock up or crash on install.
But when I boost voltages, it does run stable. Another lockup source 
what I have got is memory. I was trying to add 16GB momory, but sticks 
are not same brand and speed and they just don't play nice together even 
if I run them with stock settings.


// JiiPee

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
  2021-12-15 22:05     ` Roger Heflin
  2021-12-16 21:43       ` John Stoffel
@ 2021-12-19 17:52       ` Wols Lists
       [not found]         ` <CAAMCDecF2PoAtAb1w6reU=RYocaDPb0ZVQ20S44QOrh3fVEXNw@mail.gmail.com>
  1 sibling, 1 reply; 16+ messages in thread
From: Wols Lists @ 2021-12-19 17:52 UTC (permalink / raw)
  To: Roger Heflin; +Cc: linux-raid

On 15/12/2021 22:05, Roger Heflin wrote:
> There would be various messages.
>   grep -E 'ATA| sd |ata[0-9]' /var/log/messages
> might get you details.  It will also show when the disks are first
> showing up and being reported.

thewolery /home/anthony # tail /var/log/messages
tail: cannot open '/var/log/messages' for reading: No such file or directory

Cheers,
Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debugging system hangs
       [not found]         ` <CAAMCDecF2PoAtAb1w6reU=RYocaDPb0ZVQ20S44QOrh3fVEXNw@mail.gmail.com>
@ 2021-12-19 19:27           ` Wols Lists
  0 siblings, 0 replies; 16+ messages in thread
From: Wols Lists @ 2021-12-19 19:27 UTC (permalink / raw)
  To: Roger Heflin; +Cc: linux-raid

On 19/12/2021 18:46, Roger Heflin wrote:
> You will need to figure out on your distribution where messages get saved.
> 
> On Sun, Dec 19, 2021, 11:52 AM Wols Lists <antlists@youngman.org.uk 
> <mailto:antlists@youngman.org.uk>> wrote:
> 
>     On 15/12/2021 22:05, Roger Heflin wrote:
>      > There would be various messages.
>      >   grep -E 'ATA| sd |ata[0-9]' /var/log/messages
>      > might get you details.  It will also show when the disks are first
>      > showing up and being reported.
> 
>     thewolery /home/anthony # tail /var/log/messages
>     tail: cannot open '/var/log/messages' for reading: No such file or
>     directory
> 
I found it - journalctl. As luck would have it, the system hung, I think 
at 18:35


thewolery /home/anthony # journalctl --since=18:20:00
-- Journal begins at Mon 2021-06-21 14:57:58 BST, ends at Sun 2021-12-19 
19:21:36 GMT. --
Dec 19 18:29:05 thewolery dmeventd[1226]: dmeventd was idle for 3600 
second(s), exiting.
Dec 19 18:29:05 thewolery dmeventd[1226]: dmeventd shutting down.
Dec 19 18:32:55 thewolery plasmashell[1682]: libkcups: 
Create-Printer-Subscriptions last error: 1282 Bad file descriptor
Dec 19 18:32:55 thewolery plasmashell[1682]: libkcups: Request failed 
1282 -1
Dec 19 18:34:53 thewolery kded5[1655]: ktp-kded-module: "auto-away" 
presence change request: "away" ""
Dec 19 18:34:53 thewolery kded5[1655]: ktp-kded-module: plugin queue 
activation: "away" ""
Dec 19 18:34:53 thewolery kscreenlocker_greet[4420]: Qt: Session 
management error: networkIdsList argument is NULL
Dec 19 18:34:54 thewolery kded5[1655]: ktp-kded-module: 
"screen-saver-away" presence change request: "away" ""
Dec 19 18:34:54 thewolery kded5[1655]: ktp-kded-module: plugin queue 
activation: "away" ""
Dec 19 18:34:54 thewolery kscreenlocker_greet[4420]: 
file:///usr/share/plasma/look-and-feel/org.kde.breeze.desktop/contents/components/UserList.qml:41:9: 
Unable to assign [undefined] to bool
Dec 19 18:34:54 thewolery kscreenlocker_greet[4420]: 
file:///usr/share/plasma/look-and-feel/org.kde.breeze.desktop/contents/components/UserList.qml:41:9: 
Unable to assign [undefined] to bool
Dec 19 18:40:54 thewolery kded5[1655]: ktp-kded-module: "auto-away" 
state change: TelepathyKDEDModulePlugin::Enabled
Dec 19 18:40:54 thewolery kded5[1655]: ktp-kded-module: plugin queue 
activation: "away" ""
Dec 19 18:40:59 thewolery kwin_x11[1659]: qt.qpa.xcb: QXcbConnection: 
XCB error: 3 (BadWindow), sequence: 39263, resource id: 65011771, major 
code: 18 (ChangeProperty), minor code: 0
Dec 19 18:40:59 thewolery kded5[1655]: ktp-kded-module: 
"screen-saver-away" state change: TelepathyKDEDModulePlugin::Enabled
Dec 19 18:40:59 thewolery kded5[1655]: ktp-kded-module: plugin queue 
activation: "unset" ""

So that looks to me either it's not anything gets logged, or quite 
likely it was so abrupt the system never got a chance to log it ...

So it looks like netconsole is it ...

And today I discovered that the shop that messed up fixing my computer 
has messed up even further - I've lost the cpu cooler fixing bars ... it 
might not be them but it wouldn't surprise me ... :-( so that's another 
one ordered from Amazon ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-12-19 22:59 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-14 15:54 Debugging system hangs Wols Lists
2021-12-14 17:46 ` Roman Mamedov
2021-12-14 21:59   ` Phil Turmel
2021-12-15  1:08     ` John Stoffel
2021-12-15 21:45       ` Wol
2021-12-15 12:07 ` o1bigtenor
2021-12-15 16:45 ` Roger Heflin
2021-12-15 21:53   ` Wol
2021-12-15 22:05     ` Roger Heflin
2021-12-16 21:43       ` John Stoffel
2021-12-16 21:50         ` Roger Heflin
2021-12-16 22:08           ` John Stoffel
2021-12-16 22:30         ` Wols Lists
2021-12-17 10:07           ` Jani Partanen
2021-12-19 17:52       ` Wols Lists
     [not found]         ` <CAAMCDecF2PoAtAb1w6reU=RYocaDPb0ZVQ20S44QOrh3fVEXNw@mail.gmail.com>
2021-12-19 19:27           ` Wols Lists

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.