All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 207359] New: MegaRAID SAS 9361 controller hang/reset
@ 2020-04-19 18:25 bugzilla-daemon
  2020-04-19 20:24 ` [Bug 207359] " bugzilla-daemon
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: bugzilla-daemon @ 2020-04-19 18:25 UTC (permalink / raw)
  To: linuxppc-dev

https://bugzilla.kernel.org/show_bug.cgi?id=207359

            Bug ID: 207359
           Summary: MegaRAID SAS 9361 controller hang/reset
           Product: Platform Specific/Hardware
           Version: 2.5
    Kernel Version: >=v5.4
          Hardware: PPC-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: PPC-64
          Assignee: platform_ppc-64@kernel-bugs.osdl.org
          Reporter: cam@neo-zeon.de
        Regression: No

Created attachment 288623
  --> https://bugzilla.kernel.org/attachment.cgi?id=288623&action=edit
dmesg output for controller hang

On a Talos II 2x 36 core (144 thread) POWER9 box, MegaRAID SAS 9361-16i PCIE
controller can be made to pretty consistently hang with "heavy IO" on kernel
versions greater than 5.3.18.
I am unable to reproduce this on a 16/32 core/thread amd64 box with a MegaRAID
SAS 9361-16i PCIE with the exact same firmware revision.

The box also has a Microsemi SAS HBA which seems unaffected by this.

System details:
Talos II motherboard
2x 36 core (144 thread) POWER9 processors
512GB memory
4k page size
MegaRAID SAS 9361-16i PCIE controller (4 disk RAID10 volume, megaraid_sas
driver)
Microsemi HBA w/4x SSD's

The relevant dmesg messages are attached.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 207359] MegaRAID SAS 9361 controller hang/reset
  2020-04-19 18:25 [Bug 207359] New: MegaRAID SAS 9361 controller hang/reset bugzilla-daemon
@ 2020-04-19 20:24 ` bugzilla-daemon
  2020-04-19 20:55 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2020-04-19 20:24 UTC (permalink / raw)
  To: linuxppc-dev

https://bugzilla.kernel.org/show_bug.cgi?id=207359

gyakovlev@gentoo.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |gyakovlev@gentoo.org

--- Comment #1 from gyakovlev@gentoo.org ---
In my case I see similar problem on same motherboard but with aacraid driver
(microsemi one)

https://bugzilla.kernel.org/show_bug.cgi?id=206123

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 207359] MegaRAID SAS 9361 controller hang/reset
  2020-04-19 18:25 [Bug 207359] New: MegaRAID SAS 9361 controller hang/reset bugzilla-daemon
  2020-04-19 20:24 ` [Bug 207359] " bugzilla-daemon
@ 2020-04-19 20:55 ` bugzilla-daemon
  2020-05-10  3:02 ` bugzilla-daemon
  2020-08-06 17:56 ` bugzilla-daemon
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2020-04-19 20:55 UTC (permalink / raw)
  To: linuxppc-dev

https://bugzilla.kernel.org/show_bug.cgi?id=207359

--- Comment #2 from Cameron (cam@neo-zeon.de) ---
Looking at bug 206123 above, it's worth noting that the amd64 box I'm using for
comparison has SATA disks, though this is probably still a PPC specific issue.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 207359] MegaRAID SAS 9361 controller hang/reset
  2020-04-19 18:25 [Bug 207359] New: MegaRAID SAS 9361 controller hang/reset bugzilla-daemon
  2020-04-19 20:24 ` [Bug 207359] " bugzilla-daemon
  2020-04-19 20:55 ` bugzilla-daemon
@ 2020-05-10  3:02 ` bugzilla-daemon
  2020-08-06 17:56 ` bugzilla-daemon
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2020-05-10  3:02 UTC (permalink / raw)
  To: linuxppc-dev

https://bugzilla.kernel.org/show_bug.cgi?id=207359

--- Comment #3 from Cameron (cam@neo-zeon.de) ---
Created attachment 289041
  --> https://bugzilla.kernel.org/attachment.cgi?id=289041&action=edit
5.6.11 megaraid POWER hang

Still happens with 5.6.11. There seems to be potentially a bit more output this
time, and I've included output from shutting down too in case it's useful.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 207359] MegaRAID SAS 9361 controller hang/reset
  2020-04-19 18:25 [Bug 207359] New: MegaRAID SAS 9361 controller hang/reset bugzilla-daemon
                   ` (2 preceding siblings ...)
  2020-05-10  3:02 ` bugzilla-daemon
@ 2020-08-06 17:56 ` bugzilla-daemon
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2020-08-06 17:56 UTC (permalink / raw)
  To: linuxppc-dev

https://bugzilla.kernel.org/show_bug.cgi?id=207359

--- Comment #4 from Cameron (cam@neo-zeon.de) ---
I converted the box's filesystems from BTRFS to XFS, and switched the page size
from 4k to 64k. The problem appears to be entirely gone now. I am able to
conclusively run 5.7.13 without issue, which I verified as having the
megaraid_sas controller hang problem while still running my previous BTRFS+4k
page configuration.

Unfortunately, it took a great deal of time to perform this conversion, and I
wasn't able to keep the box down even longer to test if converting to XFS and
64k pages individually resolved the issue. All I can say for certain is that
either switching to XFS, to a 64k page size, or both has fixed the problem for
me.

The backup volume is a single SATA disk that is still using BTRFS (for
snapshotting), and is not giving me any trouble. But if this has any relation
to https://bugzilla.kernel.org/show_bug.cgi?id=206123, then this may not be
conclusive due to being that SATA disks potentially may not trigger the issue.
The single disk also can't push as much IO as the RAID10 volume so that may be
another reason.

My quasi educated non-kernel-dev guess is that this is probably a bug relating
to the 4k page size. Whether or not the regular behavior of BTRFS exacerbates
this (making it easier to reproduce), is possible, but unknown.

Hopefully someone else encountering this issue will find this helpful.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-08-06 17:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-19 18:25 [Bug 207359] New: MegaRAID SAS 9361 controller hang/reset bugzilla-daemon
2020-04-19 20:24 ` [Bug 207359] " bugzilla-daemon
2020-04-19 20:55 ` bugzilla-daemon
2020-05-10  3:02 ` bugzilla-daemon
2020-08-06 17:56 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.