All of lore.kernel.org
 help / color / mirror / Atom feed
* MD Array Unexpected Kernel Hang
@ 2022-06-08 21:36 Alan Braithwaite
  2022-06-08 22:56 ` o1bigtenor
  0 siblings, 1 reply; 5+ messages in thread
From: Alan Braithwaite @ 2022-06-08 21:36 UTC (permalink / raw)
  To: linux-raid

Forgive me if this is the wrong place to ask for help on this issue.  I've scoured the internet for more tools to debug these arrays and came up short.

I'm afraid I may have done something wrong and gotten my linux md raid array into a weird state (can't re-add a failed device).  First, it started with a disk failure.

First some initial context: The array is a non-root partition 5-device raid6 array in a VM using drives passed into the vm via virtio.  Host drives are a JBOD array connected to an LSI HBA.  I do not pass through the LSI controller PCI device to the VM for simplicity sake.  I'm not sure if that's relevant, but the point is there's another layer of indirection if it matters.

I observed the disk failure via kernel tasks getting hung reading/writing to disk, despite all disks being reported as healthy via smartmontools on both the host and the guest.  Smartmon tests were fine also (as far as I can tell, a few slow reads but nothing crazy).

I couldn't unmount the disk or stop the array properly with `mdadm —stop /dev/md127`as both commands would hang indefinitely and the md_raid6 kernel task was also in the `D` uninterruptable sleep state.  No threads on the host machine were blocked on IO during this time.  I tried a normal shutdown of the VM, but when that hung indefinitely also (waited an hour), I forced the shut down of the machine.

I brought it back up briefly to verify that there was still an issue and had the same results.  I was able to start and mount the array (ext4 fs on top).  However I was unable to read it or unmount it.   I then forcefully brought the machine back down again.

I took a wild guess and replaced the first drive in the array which also showed the highest read errors corrected by ECC according to smartmon (approx 300k over 470TiB processed).  Fortunately, the array came online and was mountable/readable again.  I repaced it with a spare drive I had by adding a new disk in KVM and putting at the same mount point `/dev/vdb`.  I added that drive to the array via: `sfdisk -d /dev/vdc | sfdisk /dev/vdb`.  Then `mdadm --manage /dev/md127 --add /dev/vdb1`.

After the sync finished, that drive worked for a few days until I went to reboot the machine for updates again at which point the system hung at shutdown again failing to unmount the array.

So I go through the same process to replace the drive with yet another drive.  This time I tried rebooting immediately after the sync/recovery finished for the newly added drive.  Lo and behold, this one hung too.  What are the chances that 3 drives failed at almost the same exact time?

Mounting the array with the 4 disks not in that first position works for mounting the array readonly at least (I have not tried read/write yet).

How do I go about debugging this?  The `mdadm —examine` command reports `State: clean` for every drive.  Checksums are correct. Everything looks fine from the tools I can see.  There were no useful md logs in dmesg, aside from the hung task backtrace (below).

I know that detecting disk failures can be tricky (particularly if they're not throwing errors), but I also don't expect md arrays to completely hang when trying to safely abort usage.

Below are the details of my setup.

Please let me know how I can go about debugging this and recovering the full array?  What should I do from here?  It feels like a software issue since I'm not able to see any other errors via other tools and the "faulty" disks read fine from the host (haven't tried writing to them).

Please let me know if there's any more information I can provide.

Thanks,
- Alan

## Host Details

Hardware Details:

```
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 3.3
        Release Date: 07/19/2018
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 12 MB
        BIOS Revision: 4.6

Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: Supermicro
        Product Name: X9DRT
        Version: 0123456789
        Serial Number: 0123456789
        UUID: 00000000-0000-0000-0000-002590a29e98
        Wake-up Type: Power Switch
        SKU Number: To be filled by O.E.M.
        Family: To be filled by O.E.M.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
        Manufacturer: Supermicro
        Product Name: X9DRT
        Version: 1.21
        Serial Number: ZM2AU38792
        Asset Tag: To be filled by O.E.M.
        Features:
                Board is a hosting board
                Board is replaceable
        Location In Chassis: To be filled by O.E.M.
        Chassis Handle: 0x0003
        Type: Motherboard
        Contained Object Handles: 0

Handle 0x0004, DMI type 4, 42 bytes
Processor Information
        Socket Designation: SOCKET 0
        Type: Central Processor
        Family: Xeon
        Manufacturer: Intel
        ID: E4 06 03 00 FF FB EB BF
        Signature: Type 0, Family 6, Model 62, Stepping 4
```

Host HBA device:

```
LSI 9200-8e 6Gbps 8-lane external SAS HBA P20 IT Mode
```

Arch Linux on Host

```
[abraithwaite@ceres ~]$ uname -a
Linux ceres 5.17.9-arch1-1 #1 SMP PREEMPT Wed, 18 May 2022 17:30:11 +0000 x86_64 GNU/Linux
[abraithwaite@ceres ~]$ pacman -Q | grep -i libvirt
libvirt 1:8.3.0-1
[abraithwaite@ceres ~]$ pacman -Q | grep -i qemu-common
qemu-common 7.0.0-10
```

Host storage details

```
sudo dmesg | grep -i lsi
[    1.862556] mpt2sas_cm0: LSISAS2008: FWVersion(20.00.07.00), ChipRevision(0x03), BiosVersion(00.00.00.00)
[    3.419509] scsi 7:0:9:0: Enclosure         LSI CORP SAS2X36          0718 PQ: 0 ANSI: 5
[abraithwaite@ceres ~]$ sudo dmesg | grep -E '(lsi|sas|scsi)'
[    3.379175] mpt2sas_cm0: hba_port entry: 00000000e975d737, port: 255 is added to hba_port list
[    3.384029] mpt2sas_cm0: host_add: handle(0x0001), sas_addr(0x500605b0044145d0), phys(8)
[    3.386017] mpt2sas_cm0: expander_add: handle(0x0009), parent(0x0001), sas_addr(0x50030480003954bf), phys(38)
[    3.386589]  expander-7:0: add: handle(0x0009), sas_addr(0x50030480003954bf)
[    3.397307] mpt2sas_cm0: handle(0xa) sas_address(0x5000cca06d2807be) port_type(0x1)
[    3.398416] scsi 7:0:0:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.398427] scsi 7:0:0:0: SSP: handle(0x000a), sas_addr(0x5000cca06d2807be), phy(8), device_name(0x5000cca06d2807be)
[    3.398431] scsi 7:0:0:0: enclosure logical id (0x500304800039543f), slot(0)
[    3.398438] scsi 7:0:0:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.398676] scsi 7:0:0:0: Power-on or device reset occurred
[    3.399879]  end_device-7:0:0: add: handle(0x000a), sas_addr(0x5000cca06d2807be)
[    3.400117] mpt2sas_cm0: handle(0xb) sas_address(0x5000cca06d2927fe) port_type(0x1)
[    3.400901] scsi 7:0:1:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.400906] scsi 7:0:1:0: SSP: handle(0x000b), sas_addr(0x5000cca06d2927fe), phy(9), device_name(0x5000cca06d2927fe)
[    3.400907] scsi 7:0:1:0: enclosure logical id (0x500304800039543f), slot(1)
[    3.400911] scsi 7:0:1:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.401085] scsi 7:0:1:0: Power-on or device reset occurred
[    3.402223]  end_device-7:0:1: add: handle(0x000b), sas_addr(0x5000cca06d2927fe)
[    3.402454] mpt2sas_cm0: handle(0xc) sas_address(0x5000cca06d2924f2) port_type(0x1)
[    3.403399] scsi 7:0:2:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.403404] scsi 7:0:2:0: SSP: handle(0x000c), sas_addr(0x5000cca06d2924f2), phy(10), device_name(0x5000cca06d2924f2)
[    3.403406] scsi 7:0:2:0: enclosure logical id (0x500304800039543f), slot(2)
[    3.403410] scsi 7:0:2:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.403586] scsi 7:0:2:0: Power-on or device reset occurred
[    3.404672]  end_device-7:0:2: add: handle(0x000c), sas_addr(0x5000cca06d2924f2)
[    3.404915] mpt2sas_cm0: handle(0xd) sas_address(0x5000cca06d283b0a) port_type(0x1)
[    3.405667] scsi 7:0:3:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.405672] scsi 7:0:3:0: SSP: handle(0x000d), sas_addr(0x5000cca06d283b0a), phy(11), device_name(0x5000cca06d283b0a)
[    3.405674] scsi 7:0:3:0: enclosure logical id (0x500304800039543f), slot(3)
[    3.405678] scsi 7:0:3:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.405855] scsi 7:0:3:0: Power-on or device reset occurred
[    3.406944]  end_device-7:0:3: add: handle(0x000d), sas_addr(0x5000cca06d283b0a)
[    3.407182] mpt2sas_cm0: handle(0xe) sas_address(0x5000cca0284dc06a) port_type(0x1)
[    3.407901] scsi 7:0:4:0: Direct-Access     HGST     HUS724020ALS640  A2C0 PQ: 0 ANSI: 6
[    3.407906] scsi 7:0:4:0: SSP: handle(0x000e), sas_addr(0x5000cca0284dc06a), phy(12), device_name(0x5000cca0284dc06a)
[    3.407908] scsi 7:0:4:0: enclosure logical id (0x500304800039543f), slot(4)
[    3.407912] scsi 7:0:4:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.408088] scsi 7:0:4:0: Power-on or device reset occurred
[    3.409221]  end_device-7:0:4: add: handle(0x000e), sas_addr(0x5000cca0284dc06a)
[    3.409465] mpt2sas_cm0: handle(0xf) sas_address(0x5000cca06d287592) port_type(0x1)
[    3.410156] scsi 7:0:5:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.410161] scsi 7:0:5:0: SSP: handle(0x000f), sas_addr(0x5000cca06d287592), phy(14), device_name(0x5000cca06d287592)
[    3.410162] scsi 7:0:5:0: enclosure logical id (0x500304800039543f), slot(6)
[    3.410166] scsi 7:0:5:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.410342] scsi 7:0:5:0: Power-on or device reset occurred
[    3.411402]  end_device-7:0:5: add: handle(0x000f), sas_addr(0x5000cca06d287592)
[    3.411643] mpt2sas_cm0: handle(0x10) sas_address(0x5000cca06d3da1be) port_type(0x1)
[    3.412464] scsi 7:0:6:0: Direct-Access     HGST     HUS724020ALS640  A2C0 PQ: 0 ANSI: 6
[    3.412469] scsi 7:0:6:0: SSP: handle(0x0010), sas_addr(0x5000cca06d3da1be), phy(15), device_name(0x5000cca06d3da1be)
[    3.412471] scsi 7:0:6:0: enclosure logical id (0x500304800039543f), slot(7)
[    3.412474] scsi 7:0:6:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.412669] scsi 7:0:6:0: Power-on or device reset occurred
[    3.413767]  end_device-7:0:6: add: handle(0x0010), sas_addr(0x5000cca06d3da1be)
[    3.413997] mpt2sas_cm0: handle(0x11) sas_address(0x5000cca06d2919ee) port_type(0x1)
[    3.414669] scsi 7:0:7:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.414674] scsi 7:0:7:0: SSP: handle(0x0011), sas_addr(0x5000cca06d2919ee), phy(18), device_name(0x5000cca06d2919ee)
[    3.414676] scsi 7:0:7:0: enclosure logical id (0x500304800039543f), slot(10)
[    3.414679] scsi 7:0:7:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.414855] scsi 7:0:7:0: Power-on or device reset occurred
[    3.415859]  end_device-7:0:7: add: handle(0x0011), sas_addr(0x5000cca06d2919ee)
[    3.416099] mpt2sas_cm0: handle(0x12) sas_address(0x5000cca06d28092a) port_type(0x1)
[    3.416734] scsi 7:0:8:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.416739] scsi 7:0:8:0: SSP: handle(0x0012), sas_addr(0x5000cca06d28092a), phy(19), device_name(0x5000cca06d28092a)
[    3.416741] scsi 7:0:8:0: enclosure logical id (0x500304800039543f), slot(11)
[    3.416744] scsi 7:0:8:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.416921] scsi 7:0:8:0: Power-on or device reset occurred
[    3.418104]  end_device-7:0:8: add: handle(0x0012), sas_addr(0x5000cca06d28092a)
[    3.418734] mpt2sas_cm0: handle(0x13) sas_address(0x50030480003954bd) port_type(0x1)
[    3.419509] scsi 7:0:9:0: Enclosure         LSI CORP SAS2X36          0718 PQ: 0 ANSI: 5
[    3.419514] scsi 7:0:9:0: set ignore_delay_remove for handle(0x0013)
[    3.419516] scsi 7:0:9:0: SES: handle(0x0013), sas_addr(0x50030480003954bd), phy(36), device_name(0x50030480003954bd)
[    3.419518] scsi 7:0:9:0: enclosure logical id (0x500304800039543f), slot(28)
[    3.419521] scsi 7:0:9:0: qdepth(254), tagged(1), scsi_level(6), cmd_que(1)
[    3.419642] scsi 7:0:9:0: Power-on or device reset occurred
[    3.421412]  end_device-7:0:9: add: handle(0x0013), sas_addr(0x50030480003954bd)
[    9.644938] mpt2sas_cm0: port enable: SUCCESS
```

Passing through SAS drives to guest in KVM via this kind of libvirt configuration:

```
<disk type="block" device="disk">
  <driver name="qemu" type="raw" cache="none" io="native"/>
  <source dev="/dev/sdh1" index="7"/>
  <backingStore/>
  <target dev="vdb" bus="virtio"/>
  <alias name="virtio-disk1"/>
  <address type="pci" domain="0x0000" bus="0x08" slot="0x00" function="0x0"/>
</disk>
```

Host disks are formatted as such:

```
sudo fdisk -l /dev/sdb
Disk /dev/sdb: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: HUS724020ALS640
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xaa654211

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdb1        2048 3907029167 3907027120  1.8T 83 Linux
```

## Guest Details

Arch Linux Guest, recently updated the kernel (saw some patches in 5.18.2 that made me hopeful, but alas I'm still having the issue).

```
[root@arch arch]# uname -a
Linux arch 5.18.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 06 Jun 2022 19:58:58 +0000 x86_64 GNU/Linux
```

Bad disk (currently out of the array):

```
mdadm --examine /dev/vdb1
/dev/vdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 238d6359:a2f569cb:7cc48d36:43ba9632
           Name : arch:jbod  (local to host arch)
  Creation Time : Sat Dec  4 18:34:57 2021
     Raid Level : raid6
   Raid Devices : 5

Avail Dev Size : 3906760847 sectors (1862.89 GiB 2000.26 GB)
     Array Size : 5860141056 KiB (5.46 TiB 6.00 TB)
  Used Dev Size : 3906760704 sectors (1862.89 GiB 2000.26 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=143 sectors
          State : clean
    Device UUID : 5a1d61a1:a95d4a05:0e5a9337:d6f76dbb

Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Jun  8 12:36:41 2022
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : 3eab5336 - correct
         Events : 19946

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : Active device 0
   Array State : AAAAA ('A' == active, '.' == missing, 'R' == replacing)
```

Good disk:

```
mdadm --examine /dev/vdc1
/dev/vdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 238d6359:a2f569cb:7cc48d36:43ba9632
           Name : arch:jbod  (local to host arch)
  Creation Time : Sat Dec  4 18:34:57 2021
     Raid Level : raid6
   Raid Devices : 5

Avail Dev Size : 3906760847 sectors (1862.89 GiB 2000.26 GB)
     Array Size : 5860141056 KiB (5.46 TiB 6.00 TB)
  Used Dev Size : 3906760704 sectors (1862.89 GiB 2000.26 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=143 sectors
          State : clean
    Device UUID : d2857c0f:dcd8b710:101dfb14:6b3f7ff5

Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Jun  8 14:08:50 2022
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : cfae5712 - correct
         Events : 19954

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : Active device 1
   Array State : .AAAA ('A' == active, '.' == missing, 'R' == replacing)
```

Dmesg hang:

```
Jun 05 13:24:17 arch kernel: INFO: task kworker/0:2:258 blocked for more than 122 seconds.
Jun 05 13:24:17 arch kernel:       Not tainted 5.18.1-arch1-1 #1
Jun 05 13:24:17 arch kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 05 13:24:17 arch kernel: task:kworker/0:2     state:D stack:    0 pid:  258 ppid:     2 flags:0x00004000
Jun 05 13:24:17 arch kernel: Workqueue: md submit_flushes [md_mod]
Jun 05 13:24:17 arch kernel: Call Trace:
Jun 05 13:24:17 arch kernel:  <TASK>
Jun 05 13:24:17 arch kernel:  ? wbt_done+0xb0/0xb0
Jun 05 13:24:17 arch kernel:  __schedule+0x37c/0x11f0
Jun 05 13:24:17 arch kernel:  ? check_preempt_wakeup+0x28b/0x2a0
Jun 05 13:24:17 arch kernel:  ? rwb_trace_step+0x80/0x80
Jun 05 13:24:17 arch kernel:  ? wbt_done+0xb0/0xb0
Jun 05 13:24:17 arch kernel:  schedule+0x4f/0xb0
Jun 05 13:24:17 arch kernel:  io_schedule+0x46/0x70
Jun 05 13:24:17 arch kernel:  rq_qos_wait+0xc0/0x130
Jun 05 13:24:17 arch kernel:  ? karma_partition+0x280/0x280
Jun 05 13:24:17 arch kernel:  ? rwb_trace_step+0x80/0x80
Jun 05 13:24:17 arch kernel:  wbt_wait+0xa6/0x110
Jun 05 13:24:17 arch kernel:  __rq_qos_throttle+0x27/0x40
Jun 05 13:24:17 arch kernel:  blk_mq_submit_bio+0x3c4/0x640
Jun 05 13:24:17 arch kernel:  __submit_bio+0xf2/0x180
Jun 05 13:24:17 arch kernel:  submit_bio_noacct_nocheck+0x20b/0x2c0
Jun 05 13:24:17 arch kernel:  submit_flushes+0xd8/0x150 [md_mod 728f525e20ac2cfceb893dc85f8d68d92d31c960]
Jun 05 13:24:17 arch kernel:  process_one_work+0x1c7/0x380
Jun 05 13:24:17 arch kernel:  worker_thread+0x51/0x380
Jun 05 13:24:17 arch kernel:  ? rescuer_thread+0x3a0/0x3a0
Jun 05 13:24:17 arch kernel:  kthread+0xde/0x110
Jun 05 13:24:17 arch kernel:  ? kthread_complete_and_exit+0x20/0x20
Jun 05 13:24:17 arch kernel:  ret_from_fork+0x22/0x30
Jun 05 13:24:17 arch kernel:  </TASK>
```

5.18.2 dmesg changes slightly:

```
Jun 08 12:48:50 arch kernel: INFO: task md127_raid6:420 blocked for more than 614 seconds.
Jun 08 12:48:50 arch kernel:       Tainted: P           OE     5.18.2-arch1-1 #1
Jun 08 12:48:50 arch kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 08 12:48:50 arch kernel: task:md127_raid6     state:D stack:    0 pid:  420 ppid:     2 flags:0x00004000
Jun 08 12:48:50 arch kernel: Call Trace:
Jun 08 12:48:50 arch kernel:  <TASK>
Jun 08 12:48:50 arch kernel:  __schedule+0x37c/0x11f0
Jun 08 12:48:50 arch kernel:  ? __slab_free+0xe0/0x310
Jun 08 12:48:50 arch kernel:  schedule+0x4f/0xb0
Jun 08 12:48:50 arch kernel:  md_super_wait+0x9f/0xd0 [md_mod afcab5f485650b4a150ec4b265d57c09e8217b2a]
Jun 08 12:48:50 arch kernel:  ? cpuacct_percpu_seq_show+0x20/0x20
Jun 08 12:48:50 arch kernel:  md_bitmap_daemon_work+0x213/0x3a0 [md_mod afcab5f485650b4a150ec4b265d57c09e8217b2a]
Jun 08 12:48:50 arch kernel:  md_check_recovery+0x47/0x5a0 [md_mod afcab5f485650b4a150ec4b265d57c09e8217b2a]
Jun 08 12:48:50 arch kernel:  raid5d+0x8d/0x680 [raid456 a9c4e3a091d6fc6eff1c917206c669e086d59fa9]
Jun 08 12:48:50 arch kernel:  ? lock_timer_base+0x61/0x80
Jun 08 12:48:50 arch kernel:  ? md_set_read_only+0x90/0x90 [md_mod afcab5f485650b4a150ec4b265d57c09e8217b2a]
Jun 08 12:48:50 arch kernel:  ? del_timer_sync+0x73/0xb0
Jun 08 12:48:50 arch kernel:  ? md_set_read_only+0x90/0x90 [md_mod afcab5f485650b4a150ec4b265d57c09e8217b2a]
Jun 08 12:48:50 arch kernel:  md_thread+0xad/0x190 [md_mod afcab5f485650b4a150ec4b265d57c09e8217b2a]
Jun 08 12:48:50 arch kernel:  ? cpuacct_percpu_seq_show+0x20/0x20
Jun 08 12:48:50 arch kernel:  kthread+0xde/0x110
Jun 08 12:48:50 arch kernel:  ? kthread_complete_and_exit+0x20/0x20
Jun 08 12:48:50 arch kernel:  ret_from_fork+0x22/0x30
Jun 08 12:48:50 arch kernel:  </TASK>
```

Now tainted because I installed zfs drivers, since that's my next option.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD Array Unexpected Kernel Hang
  2022-06-08 21:36 MD Array Unexpected Kernel Hang Alan Braithwaite
@ 2022-06-08 22:56 ` o1bigtenor
  2022-06-09  0:25   ` Alan Braithwaite
  0 siblings, 1 reply; 5+ messages in thread
From: o1bigtenor @ 2022-06-08 22:56 UTC (permalink / raw)
  To: alan; +Cc: Linux-RAID

On Wed, Jun 8, 2022 at 5:22 PM Alan Braithwaite <alan@braithwaite.dev> wrote:
>
Just someone learning who asked questions in the past here
>
> Please let me know if there's any more information I can provide.
>

smartmontools is very likely something that will give some more
information that (likely) will be asked for.
If not previous installed you might want to get and then look at
what it says about each drive. Often said information is asked
for when the low down and dirty work gets a happening.

HTH

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD Array Unexpected Kernel Hang
  2022-06-08 22:56 ` o1bigtenor
@ 2022-06-09  0:25   ` Alan Braithwaite
  2022-06-09  7:01     ` Wols Lists
  0 siblings, 1 reply; 5+ messages in thread
From: Alan Braithwaite @ 2022-06-09  0:25 UTC (permalink / raw)
  To: o1bigtenor; +Cc: Linux-RAID

Appreciated.  Only reason I didn't include it initially is because it's a giant wall of text (which the other debugging info was anyway, so I should have just been proactive).

Anyway, it can now be found below, annotated with the failing drives first.

Thanks,
- Alan



First 2 (sdg, sdh) are the "bad" drives which cause the md hangs when added to the array.  sdh looks particularly healthy to my reading of the output, but was still causing the issues. 


[root@ceres tmp]# sudo smartctl -x /dev/sdg
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.9-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724020ALS640
Revision:             A1C4
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca06d287590
Serial number:        P5GR7TBV
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Wed Jun  8 17:22:20 2022 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     42 C
Drive Trip Temperature:        85 C

Manufactured in week 21 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  32
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1530
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 12276642395193344

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:     110826       19         0    110845    1515087     460659.950           0
write:         0        0         0         0     972273      44976.018           0
verify:        0        0         0         0      58029          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   39609                 - [-   -    -]

Long (extended) Self-test duration: 22236 seconds [6.2 hours]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 39704:24 [2382264 minutes]
    Number of background scans performed: 62,  scan progress: 0.00%
    Number of background medium scans performed: 62


Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca06d287591
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0
relative target port id = 2
  generation code = 1
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca06d287592
    attached SAS address = 0x50030480003954bf
    attached phy identifier = 14
    Invalid DWORD count = 10
    Running disparity error count = 10
    Loss of DWORD synchronization count = 2
    Phy reset problem count = 0

[root@ceres tmp]# sudo smartctl -x /dev/sdh
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.9-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724020ALS640
Revision:             A2C0
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca06d3da1bc
Serial number:        P5H2WSDV
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Wed Jun  8 17:22:25 2022 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     41 C
Drive Trip Temperature:        85 C

Manufactured in week 28 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  23
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1512
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 386113231388672

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:        525        2         0       527        222       4203.269           0
write:         0        0         0         0       4103      13423.670           0
verify:        0        0         0         0      11426          0.000           0

Non-medium error count:        1

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   39221                 - [-   -    -]
# 2  Background short  Completed                   -   34658                 - [-   -    -]
# 3  Background short  Completed                   -   32405                 - [-   -    -]
# 4  Background short  Completed                   -   32380                 - [-   -    -]
# 5  Background short  Completed                   -   32357                 - [-   -    -]
# 6  Background short  Completed                   -   32332                 - [-   -    -]
# 7  Background short  Completed                   -   32308                 - [-   -    -]
# 8  Background short  Completed                   -   32285                 - [-   -    -]
# 9  Background short  Completed                   -   32260                 - [-   -    -]
#10  Background short  Completed                   -   32237                 - [-   -    -]
#11  Background short  Completed                   -   32213                 - [-   -    -]
#12  Background short  Completed                   -   32189                 - [-   -    -]
#13  Background short  Completed                   -   32165                 - [-   -    -]
#14  Background short  Completed                   -   32141                 - [-   -    -]
#15  Background short  Completed                   -   32117                 - [-   -    -]
#16  Background short  Completed                   -   32093                 - [-   -    -]
#17  Background short  Completed                   -   32068                 - [-   -    -]
#18  Background short  Completed                   -   32045                 - [-   -    -]
#19  Background short  Completed                   -   32020                 - [-   -    -]
#20  Background short  Completed                   -   31997                 - [-   -    -]

Long (extended) Self-test duration: 22650 seconds [6.3 hours]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 39315:59 [2358959 minutes]
    Number of background scans performed: 243,  scan progress: 0.00%
    Number of background medium scans performed: 243


Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca06d3da1bd
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0
relative target port id = 2
  generation code = 1
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca06d3da1be
    attached SAS address = 0x50030480003954bf
    attached phy identifier = 15
    Invalid DWORD count = 18
    Running disparity error count = 16
    Loss of DWORD synchronization count = 3
    Phy reset problem count = 0



# --- Good Drives Start Here ---

[root@ceres tmp]# sudo smartctl -x /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.9-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724020ALS640
Revision:             A1C4
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca06d2927fc
Serial number:        P5GRMNRV
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Wed Jun  8 17:22:31 2022 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     44 C
Drive Trip Temperature:        85 C

Manufactured in week 21 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  33
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2019
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 12460064392609792

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:     125617        4         0    125621    1511670     470203.046           0
write:         0        0         0         0     311583      46916.011           0
verify:        0        0         0         0       1226          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   39776                 - [-   -    -]
# 2  Background short  Completed                   -   39480                 - [-   -    -]

Long (extended) Self-test duration: 22236 seconds [6.2 hours]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 39871:02 [2392262 minutes]
    Number of background scans performed: 65,  scan progress: 0.00%
    Number of background medium scans performed: 65


Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca06d2927fd
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0
relative target port id = 2
  generation code = 1
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca06d2927fe
    attached SAS address = 0x50030480003954bf
    attached phy identifier = 9
    Invalid DWORD count = 11
    Running disparity error count = 10
    Loss of DWORD synchronization count = 2
    Phy reset problem count = 0

[root@ceres tmp]# sudo smartctl -x /dev/sdd
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.9-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724020ALS640
Revision:             A1C4
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca06d2924f0
Serial number:        P5GRMGEV
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Wed Jun  8 17:22:32 2022 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     44 C
Drive Trip Temperature:        85 C

Manufactured in week 21 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  33
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2031
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 12182041999179776

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:     284059        0         0    284059    1507556     472213.429           0
write:         0        0         0         0     378639      45073.915           0
verify:        0        0         0         0        319          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   39775                 - [-   -    -]
# 2  Background short  Completed                   -   39479                 - [-   -    -]

Long (extended) Self-test duration: 22236 seconds [6.2 hours]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 39870:03 [2392203 minutes]
    Number of background scans performed: 63,  scan progress: 0.00%
    Number of background medium scans performed: 63


Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca06d2924f1
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0
relative target port id = 2
  generation code = 1
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca06d2924f2
    attached SAS address = 0x50030480003954bf
    attached phy identifier = 10
    Invalid DWORD count = 10
    Running disparity error count = 10
    Loss of DWORD synchronization count = 2
    Phy reset problem count = 0

[root@ceres tmp]# sudo smartctl -x /dev/sde
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.9-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724020ALS640
Revision:             A1C4
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca06d283b08
Serial number:        P5GR3WJV
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Wed Jun  8 17:22:34 2022 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     41 C
Drive Trip Temperature:        85 C

Manufactured in week 21 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  33
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2024
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 12291100328853504

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:      72338        1         0     72339    1454480     468885.996           0
write:         0        0         0         0     283314      44839.275           0
verify:        0        0         0         0        754          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   39775                 - [-   -    -]
# 2  Background short  Completed                   -   39479                 - [-   -    -]

Long (extended) Self-test duration: 22236 seconds [6.2 hours]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 39870:31 [2392231 minutes]
    Number of background scans performed: 62,  scan progress: 0.00%
    Number of background medium scans performed: 62


Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca06d283b09
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0
relative target port id = 2
  generation code = 1
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca06d283b0a
    attached SAS address = 0x50030480003954bf
    attached phy identifier = 11
    Invalid DWORD count = 9
    Running disparity error count = 9
    Loss of DWORD synchronization count = 2
    Phy reset problem count = 0

[root@ceres tmp]# sudo smartctl -x /dev/sdf
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.9-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724020ALS640
Revision:             A2C0
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca0284dc068
Serial number:        P6HBSKZV
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Wed Jun  8 17:22:36 2022 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     37 C
Drive Trip Temperature:        85 C

Manufactured in week 06 of year 2014
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  47
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1871
Elements in grown defect list: 5

Vendor (Seagate Cache) information
  Blocks sent to initiator = 498103379558400

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:      44471        1         0     44472      36583      12783.706           0
write:         0        0         0         0      11015      19676.167           0
verify:        0        0         0         0       6221          0.000           0

Non-medium error count:        1

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   46820                 - [-   -    -]
# 2  Background short  Completed                   -   42138                 - [-   -    -]
# 3  Background short  Completed                   -   42111                 - [-   -    -]
# 4  Background short  Completed                   -   42103                 - [-   -    -]
# 5  Background short  Completed                   -   42099                 - [-   -    -]
# 6  Background short  Completed                   -   41365                 - [-   -    -]
# 7  Background short  Completed                   -   41341                 - [-   -    -]
# 8  Background short  Completed                   -   41317                 - [-   -    -]
# 9  Background short  Completed                   -   41293                 - [-   -    -]
#10  Background short  Completed                   -   41269                 - [-   -    -]
#11  Background short  Completed                   -   41245                 - [-   -    -]
#12  Background short  Completed                   -   41221                 - [-   -    -]
#13  Background short  Completed                   -   41197                 - [-   -    -]
#14  Background short  Completed                   -   41172                 - [-   -    -]
#15  Background short  Completed                   -   41149                 - [-   -    -]
#16  Background short  Completed                   -   41125                 - [-   -    -]
#17  Background short  Completed                   -   41101                 - [-   -    -]
#18  Background short  Completed                   -   41077                 - [-   -    -]
#19  Background short  Completed                   -   41053                 - [-   -    -]
#20  Background short  Completed                   -   41029                 - [-   -    -]

Long (extended) Self-test duration: 22650 seconds [6.3 hours]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 46915:04 [2814904 minutes]
    Number of background scans performed: 294,  scan progress: 0.00%
    Number of background medium scans performed: 294

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1 42149:02  00000000becb3840  [3,11,0]   Recovered via rewrite in-place


Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca0284dc069
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0
relative target port id = 2
  generation code = 1
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca0284dc06a
    attached SAS address = 0x50030480003954bf
    attached phy identifier = 12
    Invalid DWORD count = 10
    Running disparity error count = 9
    Loss of DWORD synchronization count = 2
    Phy reset problem count = 0

On Wed, Jun 8, 2022, at 15:56, o1bigtenor wrote:
> On Wed, Jun 8, 2022 at 5:22 PM Alan Braithwaite <alan@braithwaite.dev> wrote:
>>
> Just someone learning who asked questions in the past here
>>
>> Please let me know if there's any more information I can provide.
>>
>
> smartmontools is very likely something that will give some more
> information that (likely) will be asked for.
> If not previous installed you might want to get and then look at
> what it says about each drive. Often said information is asked
> for when the low down and dirty work gets a happening.
>
> HTH

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD Array Unexpected Kernel Hang
  2022-06-09  0:25   ` Alan Braithwaite
@ 2022-06-09  7:01     ` Wols Lists
  2022-06-09 15:10       ` Alan Braithwaite
  0 siblings, 1 reply; 5+ messages in thread
From: Wols Lists @ 2022-06-09  7:01 UTC (permalink / raw)
  To: Alan Braithwaite, o1bigtenor; +Cc: Linux-RAID

On 09/06/2022 01:25, Alan Braithwaite wrote:
> Appreciated.  Only reason I didn't include it initially is because it's a giant wall of text (which the other debugging info was anyway, so I should have just been proactive).
> 
> Anyway, it can now be found below, annotated with the failing drives first.

The thing that jumps out at me from that is while SMART is available and 
enabled, it should list a bunch of SMART settings, including SCT/ERC. I 
can't see that. That's worrying.

A quick google also says these are old drives, which may or not be a 
concern. That also possibly explains the lack of sct/erc.

Given that you say three drives all failed in the first slot? My money 
would actually be on nothing to do with raid, but a dodgy cable or 
motherboard connector. I don't think they're rated at being swapped over 
that many times ...

I don't know how much help this website will be for you, but take a look...

https://raid.wiki.kernel.org/index.php/Linux_Raid

You've already done some of what this asks for, and I don't actually 
know that the rest of this will be much use ...
https://raid.wiki.kernel.org/index.php/Asking_for_help

Cheers,
Wol

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD Array Unexpected Kernel Hang
  2022-06-09  7:01     ` Wols Lists
@ 2022-06-09 15:10       ` Alan Braithwaite
  0 siblings, 0 replies; 5+ messages in thread
From: Alan Braithwaite @ 2022-06-09 15:10 UTC (permalink / raw)
  To: Wols Lists, o1bigtenor; +Cc: Linux-RAID

Hey all,

Thanks for the responses!

> A quick google also says these are old drives, which may or not be a 
> concern. That also possibly explains the lack of sct/erc.

These are old SAS drives, correct.  They're in an external JBOD chassis with an HBA expander to the host machine.  I acquired them used with the hopes of extending their life in a raid array.

> Given that you say three drives all failed in the first slot? My money 
> would actually be on nothing to do with raid, but a dodgy cable or 
> motherboard connector. I don't think they're rated at being swapped over 
> that many times ...

First position in the virtual array, not physical.  The actual drives are plugged into different drive bays and I'm swapping them into the VM by updating the disk's libvirt XML configuration.

> I don't know how much help this website will be for you, but take a look...
>
> https://raid.wiki.kernel.org/index.php/Linux_Raid

Thanks!  I've spent a lot of time on that wiki, it's very helpful!  It's also why I've come to the mailing list for help, since I've been using md arrays since 2015 and haven't encountered an issue like this before.

> I would question why he is passing the disks in via virtio, rather than doing the raid outside the VM and passing in the raid6 block device.

I do it for the convenience of managing VMs.  It's for a homelab, so I won't always have remote hands to replace a disk and I like to have the spares free to use for other projects instead of tying them up as a hot-spare full-time.

> I don't believe that using MD-raid with devices virtualized and passed through via a virtio device is going to be valid/supportable.

I realize it's a bit wonky, but I'm surprised that this would be the case.  I wouldn't expect it to be that uncommon these days.  I guess I can always go back to zfs if you all feel like this is too crazy. :-P

If anybody has any other tips for debugging, it would be much appreciated.  I feel that ebpf might be helpful here, but haven't yet been able to quite figure out the right way to inspect the situation with BCC-tools.

Thanks everyone,
- Alan

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-06-09 15:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-08 21:36 MD Array Unexpected Kernel Hang Alan Braithwaite
2022-06-08 22:56 ` o1bigtenor
2022-06-09  0:25   ` Alan Braithwaite
2022-06-09  7:01     ` Wols Lists
2022-06-09 15:10       ` Alan Braithwaite

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.