linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
@ 2020-01-08  5:59 bugzilla-daemon
  2020-01-08  6:00 ` [Bug 206123] " bugzilla-daemon
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-01-08  5:59 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

            Bug ID: 206123
           Summary: aacraid ( PM8068) and iommu=nobypass Frozen PHB error
                    on ppc64
           Product: SCSI Drivers
           Version: 2.5
    Kernel Version: 5.4.8
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: AACRAID
          Assignee: scsi_drivers-aacraid@kernel-bugs.osdl.org
          Reporter: gyakovlev@gentoo.org
        Regression: No

Device Talos Raptor T2P9D01 REV 1.01 SAS
System chassis: SC747TQ-R1620B
All disks are attached via SAS backplane on the chassis.

Problem description: I'm running linux 5.4.8 (LE, 64k pages) and using
iommu=nobypass kernel option to catch and prevent illegal DMA access.

Since installing SAS dives (4x HUH721008AL4200 drives) I'm seeing
errors in dmesg and all IO to the controller stops.
controller tries to reset itself and reports successful reset but after
that any IO to ANY disk on the controller hangs.
detailed errors in attached file.

I've been using SATA (both SSD and HDD) devices just fine before,
accessing those SAS drives seems not to trigger the error.

I've tried patching kernel with latest patches from 5.5
https://github.com/torvalds/linux/commits/master/drivers/scsi/aacraid
^ all commits from oct15, but errors still here.

Steps to reproduce
I have 4 disks, I create 4 filesystems (ext4 but it's irrelevant)
then I copy a 5gb file to tmpfs
then I copy that 5gb file to each disk in parallel couple of times.
after 1-4 attempts error happens, all IO to the controller hangs
the only way to recover is to hard-reboot

if I boot WITHOUT iommu=nobypass, everything works just fine
some key messages from attached dmesg

[    0.000000] PowerNV: IOMMU bypass window disabled.
^ here system shows that bypass disabled

[13860.157868] EEH: Frozen PHB#2-PE#fd detected
[13860.157876] EEH: PE location: UOPWR.A100034-Node0-Builtin SAS
^ hang triggered

after that EEH asks driver to reset and block/filesystem layer
starts to report errors


after that controller reports that it's been reset, but it's not
functional. any IO to any disks on controller will hang forever.

In attached dmesg I have ZFS filesystem, but it's reproduce-able with
simple single partition with ext4 on top of that. with single partition
IO also never recovers, so please don't focus on ZFS.



Any help appreciated.
I hope it's a driver bug and not a HW bug in PM8068 itself.


full dmesg below

[    0.000000] PowerNV: IOMMU bypass window disabled.
...
[13428.236656] logitech-hidpp-device 0003:046D:4069.0006: multiplier = 8
[13860.157868] EEH: Frozen PHB#2-PE#fd detected
[13860.157876] EEH: PE location: UOPWR.A100034-Node0-Builtin SAS, PHB location:
N/A
[13860.157877] EEH: Frozen PHB#2-PE#fd detected
[13860.157878] EEH: Call Trace:
[13860.157885] EEH: [000000009c57f2e8] __eeh_send_failure_event+0x60/0x110
[13860.157888] EEH: [0000000006b53b28] eeh_dev_check_failure+0x360/0x5f0
[13860.157890] EEH: [000000001947df59] eeh_check_failure+0x98/0x100
[13860.157894] EEH: [0000000066f23435] aac_src_check_health+0x8c/0xc0
[13860.157898] EEH: [00000000361f4dbd] aac_command_thread+0x718/0x930
[13860.157902] EEH: [00000000b5fb52e2] kthread+0x180/0x190
[13860.157906] EEH: [000000005791e370] ret_from_kernel_thread+0x5c/0x74
[13860.157908] EEH: This PCI device has failed 1 times in the last hour and
will be permanently disabled after 5 failures.
[13860.157908] EEH: Notify device drivers to shutdown
[13860.157910] EEH: Beginning: 'error_detected(IO frozen)'
[13860.157914] PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->error_detected(IO
frozen)
[13860.157918] aacraid 0002:01:00.0: aacraid: PCI error detected 2
[13860.158142] sd 0:2:5:0: [sde] tag#788 UNKNOWN(0x2003) Result: hostbyte=0x01
driverbyte=0x00
[13860.158150] sd 0:2:5:0: [sde] tag#788 CDB: opcode=0x2a 2a 00 08 4c a9 05 00
00 40 00
[13860.158154] blk_update_request: I/O error, dev sde, sector 1113933864 op
0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158168] sd 0:2:5:0: [sde] tag#789 UNKNOWN(0x2003) Result: hostbyte=0x01
driverbyte=0x00
[13860.158171] sd 0:2:5:0: [sde] tag#789 CDB: opcode=0x2a 2a 00 08 4c a9 45 00
00 40 00
[13860.158174] blk_update_request: I/O error, dev sde, sector 1113934376 op
0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158179] sd 0:2:5:0: [sde] tag#790 UNKNOWN(0x2003) Result: hostbyte=0x01
driverbyte=0x00
[13860.158182] sd 0:2:5:0: [sde] tag#790 CDB: opcode=0x2a 2a 00 08 4c a9 85 00
00 40 00
[13860.158185] blk_update_request: I/O error, dev sde, sector 1113934888 op
0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158190] sd 0:2:5:0: [sde] tag#791 UNKNOWN(0x2003) Result: hostbyte=0x01
driverbyte=0x00
[13860.158193] sd 0:2:5:0: [sde] tag#791 CDB: opcode=0x2a 2a 00 08 4c a9 c5 00
00 20 00
[13860.158196] blk_update_request: I/O error, dev sde, sector 1113935400 op
0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[13860.158200] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2565ab744-part1
error=5 type=2 offset=570325749760 size=917504 flags=40080c80
[13860.158416] blk_update_request: I/O error, dev sdf, sector 1448660344 op
0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158419] sd 0:2:6:0: [sdf] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x01
driverbyte=0x00
[13860.158424] blk_update_request: I/O error, dev sdf, sector 1448660088 op
0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[13860.158426] sd 0:2:4:0: [sdd] tag#35 UNKNOWN(0x2003) Result: hostbyte=0x01
driverbyte=0x00
[13860.158429] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca266d5c270-part1
error=5 type=2 offset=741705707520 size=1048576 flags=40080c80
[13860.158431] sd 0:2:6:0: [sdf] tag#1 CDB: opcode=0x2a 2a 00 0a cb bf cf 00 00
40 00
[13860.158433] sd 0:2:4:0: [sdd] tag#35 CDB: opcode=0x2a 2a 00 08 4c a4 c5 00
00 40 00
[13860.158436] blk_update_request: I/O error, dev sdf, sector 1449000568 op
0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158440] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca266d5c270-part1
error=5 type=2 offset=741705576448 size=131072 flags=180880
[13860.158445] blk_update_request: I/O error, dev sdd, sector 1113925160 op
0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158449] blk_update_request: I/O error, dev sdd, sector 1113927208 op
0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[13860.158451] sd 0:2:6:0: [sdf] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x01
driverbyte=0x00
[13860.158455] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2666c6d98-part1
error=5 type=2 offset=570322341888 size=262144 flags=40080c80
[13860.158456] sd 0:2:6:0: [sdf] tag#2 CDB: opcode=0x2a 2a 00 0a cb c0 0f 00 00
40 00
[13860.158458] sd 0:2:7:0: [sdg] tag#32 UNKNOWN(0x2003) Result: hostbyte=0x01
driverbyte=0x00
[13860.158462] blk_update_request: I/O error, dev sdf, sector 1449001080 op
0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158464] sd 0:2:4:0: [sdd] tag#36 UNKNOWN(0x2003) Result: hostbyte=0x01
driverbyte=0x00
[13860.158466] sd 0:2:7:0: [sdg] tag#32 CDB: opcode=0x2a 2a 00 0a cb e4 4f 00
00 20 00
[13860.158470] sd 0:2:4:0: [sdd] tag#36 CDB: opcode=0x2a 2a 00 08 4c a5 05 00
00 40 00
[13860.158473] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca26b5708c8-part1
error=5 type=2 offset=741918175232 size=131072 flags=180880
[13860.158480] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca266d5c270-part1
error=5 type=2 offset=741879902208 size=1048576 flags=40080c80
[13860.158483] sd 0:2:4:0: [sdd] tag#38 UNKNOWN(0x2003) Result: hostbyte=0x01
driverbyte=0x00
[13860.158486] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2565ab744-part1
error=5 type=2 offset=570325094400 size=131072 flags=180880
[13860.158488] sd 0:2:4:0: [sdd] tag#38 CDB: opcode=0x2a 2a 00 08 4c a5 45 00
00 40 00
[13860.158497] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2666c6d98-part1
error=5 type=2 offset=570321293312 size=1048576 flags=40080c80
[13860.158500] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca266d5c270-part1
error=5 type=2 offset=741918437376 size=1048576 flags=40080c80
[13860.158542] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2565ab744-part1
error=5 type=2 offset=570324701184 size=131072 flags=180880
[13860.158545] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2565ab744-part1
error=5 type=2 offset=570323783680 size=131072 flags=180880
[13860.158575] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2666c6d98-part1
error=5 type=2 offset=570320244736 size=1048576 flags=40080c80
[13860.158583] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca266d5c270-part1
error=5 type=2 offset=741740703744 size=655360 flags=40080c80
[13860.158586] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2565ab744-part1
error=5 type=2 offset=570325225472 size=524288 flags=40080c80
[13860.158606] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca26b5708c8-part1
error=5 type=2 offset=741921583104 size=1048576
...
[13860.158939] PCI 0002:01:00.0#00fd: EEH: aacraid driver reports: 'need reset'
[13860.158941] EEH: Finished:'error_detected(IO frozen)' with aggregate
recovery state:'need reset'
[13860.158945] EEH: Collect temporary log
[13860.158970] EEH: of node=0002:01:00.0
[13860.158972] EEH: PCI device/vendor: 028d9005
[13860.158975] EEH: PCI cmd/status register: 00100146
[13860.158976] EEH: PCI-E capabilities and status follow:
[13860.158987] EEH: PCI-E 00: 00020010 000081a2 00002950 00437083 
[13860.158996] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 
[13860.158997] EEH: PCI-E 20: 00000000 
[13860.158998] EEH: PCI-E AER capability register set follows:
[13860.159009] EEH: PCI-E AER 00: 30020001 00000000 00400000 00462030 
[13860.159017] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000 
[13860.159026] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
[13860.159029] EEH: PCI-E AER 30: 00000000 00000000 
[13860.159031] PHB4 PHB#2 Diag-data (Version: 1)
[13860.159032] brdgCtl:    00000002
[13860.159033] RootSts:    00060040 00402000 e0820008 00100107 00000800
[13860.159035] PhbSts:     0000001c00000000 0000001c00000000
[13860.159036] Lem:        0000000100000080 0000000000000000 0000000000000080
[13860.159038] PhbErr:     0000028000000000 0000020000000000 2148000098000240
a008400000000000
[13860.159039] RxeTceErr:  6000000000000000 2000000000000000 c0000000000000fd
0000000000000000
[13860.159041] PblErr:     0000000000020000 0000000000020000 0000000000000000
0000000000000000
[13860.159042] RegbErr:    0000004000000000 0000004000000000 8800000400000000
0000000000000000
[13860.159044] PE[0fd] A/B: 8300b03800000000 8000000000000000
[13860.159046] EEH: Reset without hotplug activity
flags=40080c80
[13865.217467] aacraid 0002:01:00.0: enabling device (0140 -> 0142)
[13865.224325] EEH: Beginning: 'slot_reset'
[13865.224334] PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->slot_reset()
[13865.224337] aacraid 0002:01:00.0: aacraid: PCI error - slot reset
[13865.224401] PCI 0002:01:00.0#00fd: EEH: aacraid driver reports: 'recovered'
[13865.224402] EEH: Finished:'slot_reset' with aggregate recovery
state:'recovered'
[13865.224403] EEH: Notify device driver to resume
[13865.224404] EEH: Beginning: 'resume'
[13865.224406] PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->resume()




^ at this poing all IO to any of the disks on the controller hangs forever.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
@ 2020-01-08  6:00 ` bugzilla-daemon
  2020-01-08  6:05 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-01-08  6:00 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

--- Comment #1 from gyakovlev@gentoo.org ---
Created attachment 286681
  --> https://bugzilla.kernel.org/attachment.cgi?id=286681&action=edit
full dmesg

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
  2020-01-08  6:00 ` [Bug 206123] " bugzilla-daemon
@ 2020-01-08  6:05 ` bugzilla-daemon
  2020-01-08  6:25 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-01-08  6:05 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

--- Comment #2 from gyakovlev@gentoo.org ---
also I have to add that attaching disks directly to controller ports (bypassing
backplane) makes no difference.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
  2020-01-08  6:00 ` [Bug 206123] " bugzilla-daemon
  2020-01-08  6:05 ` bugzilla-daemon
@ 2020-01-08  6:25 ` bugzilla-daemon
  2020-04-20  3:18 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-01-08  6:25 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

Timothy Pearson (tpearson@raptorengineering.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tpearson@raptorengineering.
                   |                            |com

--- Comment #3 from Timothy Pearson (tpearson@raptorengineering.com) ---
If I'm decoding this right, the EEH is caused by a PCIe configuration space
write, triggering a correctable error in the PCIe core.  I have no way of
knowing if the address reported is valid (I suspect it is not) but would be
0x0.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (2 preceding siblings ...)
  2020-01-08  6:25 ` bugzilla-daemon
@ 2020-04-20  3:18 ` bugzilla-daemon
  2020-04-20 18:41 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-04-20  3:18 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

Oliver O'Halloran (oohall@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |oohall@gmail.com

--- Comment #4 from Oliver O'Halloran (oohall@gmail.com) ---
(In reply to Timothy Pearson from comment #3)
> If I'm decoding this right, the EEH is caused by a PCIe configuration space
> write, triggering a correctable error in the PCIe core.  I have no way of
> knowing if the address reported is valid (I suspect it is not) but would be
> 0x0.

$ pest 8300b03800000000 8000000000000000
Transaction type: DMA Read Response
Invalid MMIO Address
TCE Page Fault
TCE Access Fault
LEM Bit Number 56
Requestor 00:0.0
MSI Data 0x0000
Fault Address = 0x0000000000000000

A TCE fault makes more sense given that it doesn't happen when bypass is
enabled. I'm leaning towards this being a driver bug, but it could be a powerpc
IOMMU specific issue. I'll investigate.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (3 preceding siblings ...)
  2020-04-20  3:18 ` bugzilla-daemon
@ 2020-04-20 18:41 ` bugzilla-daemon
  2020-05-06  8:21 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-04-20 18:41 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

Cameron (cam@neo-zeon.de) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |cam@neo-zeon.de

--- Comment #5 from Cameron (cam@neo-zeon.de) ---
Bug 207359 may potentially be a duplicate of this one so perhaps some of the
info there could be useful.

(In reply to Oliver O'Halloran from comment #4)
> A TCE fault makes more sense given that it doesn't happen when bypass is
> enabled. I'm leaning towards this being a driver bug, but it could be a
> powerpc IOMMU specific issue. I'll investigate.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (4 preceding siblings ...)
  2020-04-20 18:41 ` bugzilla-daemon
@ 2020-05-06  8:21 ` bugzilla-daemon
  2020-09-09 19:07 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-05-06  8:21 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

--- Comment #6 from gyakovlev@gentoo.org ---
tried linux 5.6.10, it now happens right at boot, but at least controller reset
is working it seems, before needed a reboot to access disks again.

[May 6 01:10] PowerNV: IOMMU bypass window disabled.
...
[   24.609683] Adaptec aacraid driver 1.2.1[50983]-custom
[   24.609784] aacraid 0002:01:00.0: enabling device (0140 -> 0142)
[   24.628036] aacraid: Comm Interface type3 enabled
...
[   25.661962] EEH: Recovering PHB#2-PE#fd
[   25.662010] EEH: PE location: UOPWR.A100034-Node0-Builtin SAS, PHB location:
N/A
[   25.662097] EEH: Frozen PHB#2-PE#fd detected
[   25.662145] EEH: Call Trace:
[   25.662186] EEH: [(____ptrval____)] __eeh_send_failure_event+0x60/0x110
[   25.662282] EEH: [(____ptrval____)] eeh_dev_check_failure+0x360/0x5f0
[   25.662373] EEH: [(____ptrval____)] eeh_check_failure+0x98/0x100
[   25.666794] EEH: [(____ptrval____)] aac_src_check_health+0x8c/0xc0
[   25.669770] EEH: [(____ptrval____)] aac_command_thread+0x718/0x930
[   25.672745] EEH: [(____ptrval____)] kthread+0x180/0x190
[   25.675719] EEH: [(____ptrval____)] ret_from_kernel_thread+0x5c/0x6c
[   25.678722] EEH: This PCI device has failed 1 times in the last hour and
will be permanently disabled after 5 failures.
[   25.681822] EEH: Notify device drivers to shutdown
[   25.684910] EEH: Beginning: 'error_detected(IO frozen)'
[   25.688007] PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->error_detected(IO
frozen)
[   25.688011] aacraid 0002:01:00.0: aacraid: PCI error detected 2
[   25.695317] PCI 0002:01:00.0#00fd: EEH: aacraid driver reports: 'need reset'
[   25.695320] EEH: Finished:'error_detected(IO frozen)' with aggregate
recovery state:'need reset'
[   25.695325] EEH: Collect temporary log
[   25.695354] EEH: of node=0002:01:00.0
[   25.695358] EEH: PCI device/vendor: 028d9005
[   25.695361] EEH: PCI cmd/status register: 00100146
[   25.695362] EEH: PCI-E capabilities and status follow:
[   25.695376] EEH: PCI-E 00: 00020010 000081a2 00002950 00437083
[   25.695387] EEH: PCI-E 10: 10820000 00000000 00000000 00000000
[   25.695389] EEH: PCI-E 20: 00000000
[   25.695391] EEH: PCI-E AER capability register set follows:
[   25.695404] EEH: PCI-E AER 00: 30020001 00000000 00400000 00462030
[   25.695415] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000
[   25.695426] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[   25.695430] EEH: PCI-E AER 30: 00000000 00000000
[   25.695432] PHB4 PHB#2 Diag-data (Version: 1)
[   25.695434] brdgCtl:    00000002
[   25.695436] RootSts:    00000040 00402000 e0820008 00100107 00000000
[   25.695438] PhbSts:     0000001c00000000 0000001c00000000
[   25.695440] Lem:        0000000000000080 0000000000000000 0000000000000080
[   25.695443] PhbErr:     0000020000000000 0000020000000000 2148000098000240
a008400000000000
[   25.695445] RxeTceErr:  6000000000000000 2000000000000000 40000000000000fd
0000000000000000
[   25.695450] PE[0fd] A/B: 8000b03800000000 8000000000000000
[   25.695453] EEH: Reset without hotplug activity
...
aacraid 0002:01:00.0: enabling device (0140 -> 0142)
[ 1392.284584276,3] PHB#0002[0:2]:                  brdgCtl = 00000002
[ 1392.284685636,3] PHB#0002[0:2]:             deviceStatus = 00000040
[ 1392.284739080,3] PHB#0002[0:2]:               slotStatus = 00402000
[ 1392.284804382,3] PHB#0002[0:2]:               linkStatus = e0820008
[ 1392.284857805,3] PHB#0002[0:2]:             devCmdStatus = 00100107
[ 1392.284899389,3] PHB#0002[0:2]:             devSecStatus = 00000000
[ 1392.284948786,3] PHB#0002[0:2]:          rootErrorStatus = 00000000
[ 1392.285006352,3] PHB#0002[0:2]:          corrErrorStatus = 00000000
[ 1392.285055882,3] PHB#0002[0:2]:        uncorrErrorStatus = 00000000
[ 1392.285113499,3] PHB#0002[0:2]:                   devctl = 00000040
[ 1392.285162880,3] PHB#0002[0:2]:                  devStat = 00000000
[ 1392.285224300,3] PHB#0002[0:2]:                  tlpHdr1 = 00000000
[ 1392.285285888,3] PHB#0002[0:2]:                  tlpHdr2 = 00000000
[ 1392.285355027,3] PHB#0002[0:2]:                  tlpHdr3 = 00000000
[ 1392.285404499,3] PHB#0002[0:2]:                  tlpHdr4 = 00000000
[ 1392.285473783,3] PHB#0002[0:2]:                 sourceId = 00000000
[ 1392.285523293,3] PHB#0002[0:2]:                     nFir = 0000000000000000
[ 1392.285599065,3] PHB#0002[0:2]:                 nFirMask = 0030001c00000000
[ 1392.285658870,3] PHB#0002[0:2]:                  nFirWOF = 0000000000000000
[ 1392.285718721,3] PHB#0002[0:2]:                 phbPlssr = 0000001c00000000
[ 1392.285778426,3] PHB#0002[0:2]:                   phbCsr = 0000001c00000000
[ 1392.285834260,3] PHB#0002[0:2]:                   lemFir = 0000000000000080
[ 1392.285894227,3] PHB#0002[0:2]:             lemErrorMask = 0000000000000000
[ 1392.285954146,3] PHB#0002[0:2]:                   lemWOF = 0000000000000080
[ 1392.286017988,3] PHB#0002[0:2]:           phbErrorStatus = 0000020000000000
[ 1392.286085562,3] PHB#0002[0:2]:      phbFirstErrorStatus = 0000020000000000
[ 1392.286145499,3] PHB#0002[0:2]:             phbErrorLog0 = 2148000098000240
[ 1392.286205500,3] PHB#0002[0:2]:             phbErrorLog1 = a008400000000000
[ 1392.286265282,3] PHB#0002[0:2]:        phbTxeErrorStatus = 0000000000000000
[ 1392.286328808,3] PHB#0002[0:2]:   phbTxeFirstErrorStatus = 0000000000000000
[ 1392.286388242,3] PHB#0002[0:2]:          phbTxeErrorLog0 = 0000000000000000
[ 1392.286448308,3] PHB#0002[0:2]:          phbTxeErrorLog1 = 0000000000000000
[ 1392.286508132,3] PHB#0002[0:2]:     phbRxeArbErrorStatus = 0000000000000000
[ 1392.286568068,3] PHB#0002[0:2]: phbRxeArbFrstErrorStatus = 0000000000000000
[ 1392.286623656,3] PHB#0002[0:2]:       phbRxeArbErrorLog0 = 0000000000000000
[ 1392.286683206,3] PHB#0002[0:2]:       phbRxeArbErrorLog1 = 0000000000000000
[ 1392.286743009,3] PHB#0002[0:2]:     phbRxeMrgErrorStatus = 0000000000000000
[ 1392.286802898,3] PHB#0002[0:2]: phbRxeMrgFrstErrorStatus = 0000000000000000
[ 1392.286862689,3] PHB#0002[0:2]:       phbRxeMrgErrorLog0 = 0000000000000000
[ 1392.286922435,3] PHB#0002[0:2]:       phbRxeMrgErrorLog1 = 0000000000000000
[ 1392.286982236,3] PHB#0002[0:2]:     phbRxeTceErrorStatus = 6000000000000000
[ 1392.287042233,3] PHB#0002[0:2]: phbRxeTceFrstErrorStatus = 2000000000000000
[ 1392.287101957,3] PHB#0002[0:2]:       phbRxeTceErrorLog0 = 40000000000000fd
[ 1392.287161569,3] PHB#0002[0:2]:       phbRxeTceErrorLog1 = 0000000000000000
[ 1392.287221038,3] PHB#0002[0:2]:        phbPblErrorStatus = 0000000000000000
[ 1392.287280741,3] PHB#0002[0:2]:   phbPblFirstErrorStatus = 0000000000000000
[ 1392.287336316,3] PHB#0002[0:2]:          phbPblErrorLog0 = 0000000000000000
[ 1392.287407731,3] PHB#0002[0:2]:          phbPblErrorLog1 = 0000000000000000
[ 1392.287479365,3] PHB#0002[0:2]:      phbPcieDlpErrorLog1 = 0000000000000000
[ 1392.287550878,3] PHB#0002[0:2]:      phbPcieDlpErrorLog2 = 0000000000000000
[ 1392.287622331,3] PHB#0002[0:2]:    phbPcieDlpErrorStatus = 0000000000000000
[ 1392.287682208,3] PHB#0002[0:2]:       phbRegbErrorStatus = 0040000000000000
[ 1392.287741819,3] PHB#0002[0:2]:  phbRegbFirstErrorStatus = 0000000000000000
[ 1392.287801590,3] PHB#0002[0:2]:         phbRegbErrorLog0 = 4800003c00000000
[ 1392.287861285,3] PHB#0002[0:2]:         phbRegbErrorLog1 = 0000000000000200
[ 1392.287921850,3] PHB#0002[0:2]:                PEST[0fd] = 8000b03800000000
8000000000000000
EEH: Beginning: 'slot_reset'
PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->slot_reset()
aacraid 0002:01:00.0: aacraid: PCI error - slot reset
PCI 0002:01:00.0#00fd: EEH: aacraid driver reports: 'recovered'
EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
EEH: Notify device driver to resume
EEH: Beginning: 'resume'
PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->resume()

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (5 preceding siblings ...)
  2020-05-06  8:21 ` bugzilla-daemon
@ 2020-09-09 19:07 ` bugzilla-daemon
  2020-09-09 23:07 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-09-09 19:07 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

Sagar (sagar.biradar@microchip.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sagar.biradar@microchip.com

--- Comment #7 from Sagar (sagar.biradar@microchip.com) ---
Hi @gyakovlev@gentoo.org,
Is this issue still observed?
I tried to dupe this, so far no luck. I haven't run into this issue.
Could you please confirm if this issue still persists?

Thanks
Sagar

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (6 preceding siblings ...)
  2020-09-09 19:07 ` bugzilla-daemon
@ 2020-09-09 23:07 ` bugzilla-daemon
  2020-09-09 23:14 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-09-09 23:07 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

--- Comment #8 from Oliver O'Halloran (oohall@gmail.com) ---
Can you see if this patch fixes it?

https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200908015106.79661-1-aik@ozlabs.ru/

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (7 preceding siblings ...)
  2020-09-09 23:07 ` bugzilla-daemon
@ 2020-09-09 23:14 ` bugzilla-daemon
  2020-09-10  1:32 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-09-09 23:14 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

--- Comment #9 from gyakovlev@gentoo.org ---
Hi!
Thanks.

Yes, will test again sometime this week, sooner that later.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (8 preceding siblings ...)
  2020-09-09 23:14 ` bugzilla-daemon
@ 2020-09-10  1:32 ` bugzilla-daemon
  2020-09-10  1:46 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-09-10  1:32 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

--- Comment #10 from gyakovlev@gentoo.org ---
(In reply to Oliver O'Halloran from comment #8)
> Can you see if this patch fixes it?
> 
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200908015106.79661-
> 1-aik@ozlabs.ru/

ok I applied this patch to linux-5.4.63

looks very good so far, kernel booted with 'iommu=nobypass' and I don't see any
problems with aacraid yet, it works. I can write to all 8 SAS disks in
parallel, and can't trigger the error.


I'll try to generate torture/heavy random IO on the disks a bit later.

also I may give linux-5.8.8 a try.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (9 preceding siblings ...)
  2020-09-10  1:32 ` bugzilla-daemon
@ 2020-09-10  1:46 ` bugzilla-daemon
  2020-09-11  3:38 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-09-10  1:46 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

--- Comment #11 from Sagar (sagar.biradar@microchip.com) ---
(In reply to gyakovlev from comment #10)
> (In reply to Oliver O'Halloran from comment #8)
> > Can you see if this patch fixes it?
> > 
> >
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200908015106.79661-
> > 1-aik@ozlabs.ru/
> 
> ok I applied this patch to linux-5.4.63
> 
> looks very good so far, kernel booted with 'iommu=nobypass' and I don't see
> any problems with aacraid yet, it works. I can write to all 8 SAS disks in
> parallel, and can't trigger the error.
> 
> 
> I'll try to generate torture/heavy random IO on the disks a bit later.
> 
> also I may give linux-5.8.8 a try.

Hi,
could you please post you findings with heavy IO load?

Also Thanks Oliver for adding the reference to the potential patch.
Appreciate it.

Thanks
Sagar

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (10 preceding siblings ...)
  2020-09-10  1:46 ` bugzilla-daemon
@ 2020-09-11  3:38 ` bugzilla-daemon
  2020-09-11  6:16 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-09-11  3:38 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

--- Comment #12 from gyakovlev@gentoo.org ---
applied patch to linux-5.4.64, booted with iommu=nobypass and ran some
stress-ng tests across all drives.
looks good so far.
wrote close to 1TB of test data, not a sign of a problem.
performance is excellent.


also booted linux-5.8.8 (without patch), just a tiny bit of IO triggers the
error, controller does not recover after reset, everything hangs.
I have not tested patched 5.8.8.


Conclusion is - patch Oliver linked definitely helps and system is stable and
performant.

Hopefully it'll make into 5.4 and 5.8, I'll be patching manually meanwhile.


Tested-by: Georgy Yakovlev <gyakovlev@gentoo.org>

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (11 preceding siblings ...)
  2020-09-11  3:38 ` bugzilla-daemon
@ 2020-09-11  6:16 ` bugzilla-daemon
  2020-09-11  6:22 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-09-11  6:16 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

--- Comment #13 from Sagar (sagar.biradar@microchip.com) ---
(In reply to gyakovlev from comment #12)
> applied patch to linux-5.4.64, booted with iommu=nobypass and ran some
> stress-ng tests across all drives.
> looks good so far.
> wrote close to 1TB of test data, not a sign of a problem.
> performance is excellent.
> 
> 
> also booted linux-5.8.8 (without patch), just a tiny bit of IO triggers the
> error, controller does not recover after reset, everything hangs.
> I have not tested patched 5.8.8.
> 
> 
> Conclusion is - patch Oliver linked definitely helps and system is stable
> and performant.
> 
> Hopefully it'll make into 5.4 and 5.8, I'll be patching manually meanwhile.
> 
> 
> Tested-by: Georgy Yakovlev <gyakovlev@gentoo.org>

Thanks for the prompt response Georgy. Appreciate it
Does that mean you will patch it one more time on 5.8.8 and based on the result
- we can consider closing this BZ?

Sagar

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (12 preceding siblings ...)
  2020-09-11  6:16 ` bugzilla-daemon
@ 2020-09-11  6:22 ` bugzilla-daemon
  2020-09-11 18:30 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-09-11  6:22 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

--- Comment #14 from gyakovlev@gentoo.org ---
Hi Sagar,

testing on 5.8 is a bit problematic for me, because some things on that system
require 5.4 kernel.

Patch applies to 5.8, I assume it'll work just fine. No plans testing it.
I meant I'll be patching my kernels till the fix will be backported to release
versions of linux.

Thanks all for you help and attention, bug definitely can be closed from my
point of view.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (13 preceding siblings ...)
  2020-09-11  6:22 ` bugzilla-daemon
@ 2020-09-11 18:30 ` bugzilla-daemon
  2020-09-11 18:39 ` bugzilla-daemon
  2020-09-11 18:48 ` bugzilla-daemon
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-09-11 18:30 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

--- Comment #15 from Sagar (sagar.biradar@microchip.com) ---
Hi Georgy,
Thanks for your response and efforts on this.

Also Thanks to Oliver for pointing to the right patch.
I am closing this one since we are no longer seeing this issue.

Sagar

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (14 preceding siblings ...)
  2020-09-11 18:30 ` bugzilla-daemon
@ 2020-09-11 18:39 ` bugzilla-daemon
  2020-09-11 18:48 ` bugzilla-daemon
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-09-11 18:39 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

--- Comment #16 from Sagar (sagar.biradar@microchip.com) ---
Hi Georgy,
I cannot resolve and mark this BZ "CLOSED" since this is not assigned to me.
Could you please mark this closed since you are the reporter?

Thanks
Sagar

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug 206123] aacraid ( PM8068) and iommu=nobypass Frozen PHB error  on ppc64
  2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
                   ` (15 preceding siblings ...)
  2020-09-11 18:39 ` bugzilla-daemon
@ 2020-09-11 18:48 ` bugzilla-daemon
  16 siblings, 0 replies; 18+ messages in thread
From: bugzilla-daemon @ 2020-09-11 18:48 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=206123

gyakovlev@gentoo.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                URL|                            |https://patchwork.ozlabs.or
                   |                            |g/project/linuxppc-dev/patc
                   |                            |h/20200908015106.79661-1-ai
                   |                            |k@ozlabs.ru/
         Resolution|---                         |PATCH_ALREADY_AVAILABLE

--- Comment #17 from gyakovlev@gentoo.org ---
Sure, closing.
Thanks again.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-09-11 18:48 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-08  5:59 [Bug 206123] New: aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 bugzilla-daemon
2020-01-08  6:00 ` [Bug 206123] " bugzilla-daemon
2020-01-08  6:05 ` bugzilla-daemon
2020-01-08  6:25 ` bugzilla-daemon
2020-04-20  3:18 ` bugzilla-daemon
2020-04-20 18:41 ` bugzilla-daemon
2020-05-06  8:21 ` bugzilla-daemon
2020-09-09 19:07 ` bugzilla-daemon
2020-09-09 23:07 ` bugzilla-daemon
2020-09-09 23:14 ` bugzilla-daemon
2020-09-10  1:32 ` bugzilla-daemon
2020-09-10  1:46 ` bugzilla-daemon
2020-09-11  3:38 ` bugzilla-daemon
2020-09-11  6:16 ` bugzilla-daemon
2020-09-11  6:22 ` bugzilla-daemon
2020-09-11 18:30 ` bugzilla-daemon
2020-09-11 18:39 ` bugzilla-daemon
2020-09-11 18:48 ` bugzilla-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).