* do i need to give up on this setup @ 2020-10-05 13:10 Daniel Sanabria 2020-10-05 13:17 ` Reindl Harald 2020-10-05 13:44 ` Roman Mamedov 0 siblings, 2 replies; 16+ messages in thread From: Daniel Sanabria @ 2020-10-05 13:10 UTC (permalink / raw) To: Linux-RAID Hi all, Scrubbing ( # echo check > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :( I'm attaching details of the array and disks (bloody wd greens) as well as journalctl errors providing some details about the issue. If you have any pointers on what might be the cause of this as well as any recommendations on how to improve things please let me thank you in advance ... I have backups of the data so happy to move this to a different setup you might recommend (apps will be mostly reading from the array via NFS since most of the content will be media). My suspicion is that a timer service is kicking in and disrupting the scrubbing somehow but can't pinpoint what causes this. Thanks again, Dan PD. Apologies for the verbosity of the logs but wasn't really sure if you guys accept links from paste services [dan@lamachine ~]$ sudo mdadm --detail /dev/md1 [sudo] password for dan: /dev/md1: Version : 1.2 Creation Time : Fri Feb 15 12:26:56 2019 Raid Level : raid5 Array Size : 4194039808 (3.91 TiB 4.29 TB) Used Dev Size : 2097019904 (1999.87 GiB 2147.35 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon Oct 5 11:35:31 2020 State : clean, degraded Active Devices : 2 Working Devices : 2 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 49 1 active sync /dev/sdd1 - 0 0 2 removed 3 8 65 - faulty /dev/sde1 [dan@lamachine ~]$ [dan@lamachine ~]$ sudo hdparm -I /dev/sdc [sudo] password for dan: /dev/sdc: ATA device, with non-removable media Model Number: WDC WD30EZRX-00D8PB0 Serial Number: WD-WCC4NCWT13RF Firmware Revision: 80.00A80 Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0 Standards: Supported: 9 8 7 6 5 Likely used: 9 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 5860531055 Logical Sector size: 512 bytes Physical Sector size: 4096 bytes device size with M = 1024*1024: 2861587 MBytes device size with M = 1000*1000: 3000591 MBytes (3000 GB) cache/buffer size = unknown Nominal Media Rotation Rate: 5400 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * NOP cmd * DOWNLOAD_MICROCODE Power-Up In Standby feature set * SET_FEATURES required to spinup after power up SET_MAX security extension * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * 64-bit World wide name * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Gen3 signaling speed (6.0Gb/s) * Native Command Queueing (NCQ) * Host-initiated interface power management * Phy event counters * NCQ priority information * READ_LOG_DMA_EXT equivalent to READ_LOG_EXT * DMA Setup Auto-Activate optimization Device-initiated interface power management * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Write Same (AC2) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) unknown 206[13] (vendor specific) unknown 206[14] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 414min for SECURITY ERASE UNIT. 414min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 50014ee25fc9e460 NAA : 5 IEEE OUI : 0014ee Unique ID : 25fc9e460 Checksum: correct [dan@lamachine ~]$ sudo hdparm -I /dev/sde /dev/sde: ATA device, with non-removable media Model Number: WDC WD30EZRX-00D8PB0 Serial Number: WD-WCC4N1294906 Firmware Revision: 80.00A80 Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0 Standards: Supported: 9 8 7 6 5 Likely used: 9 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 5860531055 Logical Sector size: 512 bytes Physical Sector size: 4096 bytes device size with M = 1024*1024: 2861587 MBytes device size with M = 1000*1000: 3000591 MBytes (3000 GB) cache/buffer size = unknown Nominal Media Rotation Rate: 5400 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * NOP cmd * DOWNLOAD_MICROCODE Power-Up In Standby feature set * SET_FEATURES required to spinup after power up SET_MAX security extension * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * 64-bit World wide name * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Gen3 signaling speed (6.0Gb/s) * Native Command Queueing (NCQ) * Host-initiated interface power management * Phy event counters * NCQ priority information * READ_LOG_DMA_EXT equivalent to READ_LOG_EXT * DMA Setup Auto-Activate optimization Device-initiated interface power management * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Write Same (AC2) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) unknown 206[13] (vendor specific) unknown 206[14] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 458min for SECURITY ERASE UNIT. 458min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 50014ee25f968120 NAA : 5 IEEE OUI : 0014ee Unique ID : 25f968120 Checksum: correct [dan@lamachine ~]$ sudo hdparm -I /dev/sdd /dev/sdd: ATA device, with non-removable media Model Number: WDC WD30EZRX-00D8PB0 Serial Number: WD-WCC4NPRDD6D7 Firmware Revision: 80.00A80 Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0 Standards: Supported: 9 8 7 6 5 Likely used: 9 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 5860533168 Logical Sector size: 512 bytes Physical Sector size: 4096 bytes device size with M = 1024*1024: 2861588 MBytes device size with M = 1000*1000: 3000592 MBytes (3000 GB) cache/buffer size = unknown Nominal Media Rotation Rate: 5400 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * NOP cmd * DOWNLOAD_MICROCODE Power-Up In Standby feature set * SET_FEATURES required to spinup after power up SET_MAX security extension * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * 64-bit World wide name * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Gen3 signaling speed (6.0Gb/s) * Native Command Queueing (NCQ) * Host-initiated interface power management * Phy event counters * NCQ priority information * READ_LOG_DMA_EXT equivalent to READ_LOG_EXT * DMA Setup Auto-Activate optimization Device-initiated interface power management * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Write Same (AC2) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) unknown 206[13] (vendor specific) unknown 206[14] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 414min for SECURITY ERASE UNIT. 414min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 50014ee25fca27b1 NAA : 5 IEEE OUI : 0014ee Unique ID : 25fca27b1 Checksum: correct [dan@lamachine ~]$ truncated journalctl logs: Oct 05 10:57:11 lamachine systemd-logind[1571]: Session 8 logged out. Waiting for processes to exit. Oct 05 10:57:11 lamachine systemd-logind[1571]: Removed session 8. Oct 05 11:00:35 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 05 11:00:35 lamachine smartd[1480]: Device: /dev/sdc [SAT], failed to read SMART Attribute Data Oct 05 11:00:35 lamachine kernel: ata7.00: configured for UDMA/133 Oct 05 11:00:35 lamachine smartd[1480]: Sending warning via /usr/libexec/smartmontools/smartdnotify to root ... Oct 05 11:00:35 lamachine smartd[1480]: Warning via /usr/libexec/smartmontools/smartdnotify to root: successful Oct 05 11:00:35 lamachine postfix/pickup[2347]: EFF87608EF11: uid=0 from=<root> Oct 05 11:00:35 lamachine postfix/cleanup[4225]: EFF87608EF11: message-id=<20201005100035.EFF87608EF11@lamachine.localdomain> Oct 05 11:00:36 lamachine postfix/qmgr[2080]: EFF87608EF11: from=<root@lamachine.localdomain>, size=524, nrcpt=1 (queue active) Oct 05 11:00:36 lamachine postfix/local[4228]: EFF87608EF11: to=<root@lamachine.localdomain>, orig_to=<root>, relay=local, delay=0.15, delays=0.09/0.02/0/0.04, dsn=2.0.0, status=sent (delivered to mailbox) Oct 05 11:00:36 lamachine postfix/qmgr[2080]: EFF87608EF11: removed Oct 05 11:08:36 lamachine sshd[3936]: Timeout, client not responding from user dan 192.168.1.113 port 54226 Oct 05 11:08:36 lamachine sshd[3933]: pam_unix(sshd:session): session closed for user dan Oct 05 11:08:36 lamachine systemd-logind[1571]: Session 7 logged out. Waiting for processes to exit. Oct 05 11:08:36 lamachine sudo[3990]: pam_unix(sudo:session): session closed for user root Oct 05 11:08:36 lamachine systemd-logind[1571]: Removed session 7. Oct 05 11:29:33 lamachine smartd[1480]: Device: /dev/sdc [SAT], read SMART Attribute Data worked again, warning condition reset after 1 email Oct 05 11:30:43 lamachine kernel: ata9: link is slow to respond, please be patient (ready=0) Oct 05 11:30:47 lamachine kernel: ata9: COMRESET failed (errno=-16) Oct 05 11:30:53 lamachine kernel: ata9: link is slow to respond, please be patient (ready=0) Oct 05 11:30:57 lamachine kernel: ata9: COMRESET failed (errno=-16) Oct 05 11:31:03 lamachine kernel: ata9: link is slow to respond, please be patient (ready=0) Oct 05 11:31:32 lamachine kernel: ata9: COMRESET failed (errno=-16) Oct 05 11:31:32 lamachine kernel: ata9: limiting SATA link speed to 3.0 Gbps Oct 05 11:31:37 lamachine smartd[1480]: Device: /dev/sde [SAT], failed to read SMART Attribute Data Oct 05 11:31:37 lamachine smartd[1480]: Sending warning via /usr/libexec/smartmontools/smartdnotify to root ... Oct 05 11:31:37 lamachine kernel: ata9: COMRESET failed (errno=-16) Oct 05 11:31:37 lamachine kernel: ata9: reset failed, giving up Oct 05 11:31:37 lamachine kernel: ata9.00: disabled Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#7 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#6 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=124s Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#6 CDB: Read(16) 88 00 00 00 00 00 45 29 f4 d8 00 00 05 40 00 00 Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#7 CDB: Read(16) 88 00 00 00 00 00 45 2a 78 18 00 00 05 40 00 00 Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev sde, sector 1160411160 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev sde, sector 1160377560 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#9 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#9 CDB: Read(16) 88 00 00 00 00 00 45 29 fa 18 00 00 05 40 00 00 Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#16 CDB: Read(16) 88 00 00 00 00 00 45 2a 82 98 00 00 05 40 00 00 Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev sde, sector 1160413848 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev sde, sector 1160378904 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#10 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#10 CDB: Read(16) 88 00 00 00 00 00 45 29 ff 58 00 00 05 40 00 00 Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#18 CDB: Read(16) 88 00 00 00 00 00 45 2a 8d 18 00 00 05 40 00 00 Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev sde, sector 1160416536 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev sde, sector 1160380248 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#11 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#31 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#11 CDB: Read(16) 88 00 00 00 00 00 45 2a 04 98 00 00 05 40 00 00 Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#31 CDB: Read(16) 88 00 00 00 00 00 45 2a 97 98 00 00 05 40 00 00 Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev sde, sector 1160419224 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev sde, sector 1160381592 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#18 CDB: Read(16) 88 00 00 00 00 00 45 2a b7 18 00 00 05 40 00 00 Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev sde, sector 1160427288 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#12 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#12 CDB: Read(16) 88 00 00 00 00 00 45 2a 09 d8 00 00 05 40 00 00 Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev sde, sector 1160382936 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:31:37 lamachine smartd[1480]: Warning via /usr/libexec/smartmontools/smartdnotify to root: successful Oct 05 11:31:38 lamachine postfix/pickup[4381]: 00E50608EF11: uid=0 from=<root> Oct 05 11:31:38 lamachine postfix/cleanup[4522]: 00E50608EF11: message-id=<20201005103138.00E50608EF11@lamachine.localdomain> Oct 05 11:31:38 lamachine postfix/qmgr[2080]: 00E50608EF11: from=<root@lamachine.localdomain>, size=524, nrcpt=1 (queue active) Oct 05 11:31:38 lamachine postfix/local[4524]: 00E50608EF11: to=<root@lamachine.localdomain>, orig_to=<root>, relay=local, delay=0.11, delays=0.08/0.01/0/0.03, dsn=2.0.0, status=sent (delivered to mailbox) Oct 05 11:31:38 lamachine postfix/qmgr[2080]: 00E50608EF11: removed Oct 05 11:31:47 lamachine kernel: INFO: task md1_resync:3091 blocked for more than 120 seconds. Oct 05 11:31:47 lamachine kernel: Not tainted 4.18.0-193.14.2.el8_2.x86_64 #1 Oct 05 11:31:47 lamachine kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 05 11:31:47 lamachine kernel: md1_resync D 0 3091 2 0x80004080 Oct 05 11:31:47 lamachine kernel: Call Trace: Oct 05 11:31:47 lamachine kernel: ? __schedule+0x24f/0x650 Oct 05 11:31:47 lamachine kernel: schedule+0x2f/0xa0 Oct 05 11:31:47 lamachine kernel: raid5_get_active_stripe+0x469/0x5f0 [raid456] Oct 05 11:31:47 lamachine kernel: ? finish_wait+0x80/0x80 Oct 05 11:31:47 lamachine kernel: raid5_sync_request+0x387/0x3b0 [raid456] Oct 05 11:31:47 lamachine kernel: ? cpumask_next+0x17/0x20 Oct 05 11:31:47 lamachine kernel: ? is_mddev_idle+0xcc/0x12a Oct 05 11:31:47 lamachine kernel: md_do_sync.cold.83+0x424/0x953 Oct 05 11:31:47 lamachine kernel: ? xfrm_user_net_init+0x90/0xa0 Oct 05 11:31:47 lamachine kernel: ? __switch_to_asm+0x41/0x70 Oct 05 11:31:47 lamachine kernel: ? finish_wait+0x80/0x80 Oct 05 11:31:47 lamachine kernel: ? md_register_thread+0xd0/0xd0 Oct 05 11:31:47 lamachine kernel: md_thread+0x94/0x150 Oct 05 11:31:47 lamachine kernel: kthread+0x112/0x130 Oct 05 11:31:47 lamachine kernel: ? kthread_flush_work_fn+0x10/0x10 Oct 05 11:31:47 lamachine kernel: ret_from_fork+0x35/0x40 Oct 05 11:32:41 lamachine kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Oct 05 11:32:46 lamachine kernel: ata10.00: qc timeout (cmd 0xec) Oct 05 11:32:47 lamachine kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 05 11:32:47 lamachine kernel: ata10.00: revalidation failed (errno=-5) Oct 05 11:32:48 lamachine kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Oct 05 11:32:58 lamachine kernel: ata10.00: qc timeout (cmd 0xec) Oct 05 11:32:59 lamachine kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 05 11:32:59 lamachine kernel: ata10.00: revalidation failed (errno=-5) Oct 05 11:32:59 lamachine kernel: ata10: limiting SATA link speed to 1.5 Gbps Oct 05 11:32:59 lamachine kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 310) Oct 05 11:33:30 lamachine kernel: ata10.00: qc timeout (cmd 0xec) Oct 05 11:33:30 lamachine kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 05 11:33:30 lamachine kernel: ata10.00: revalidation failed (errno=-5) Oct 05 11:33:30 lamachine kernel: ata10.00: disabled Oct 05 11:33:32 lamachine kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 310) Oct 05 11:33:50 lamachine kernel: INFO: task md1_raid5:1304 blocked for more than 120 seconds. Oct 05 11:33:50 lamachine kernel: Not tainted 4.18.0-193.14.2.el8_2.x86_64 #1 Oct 05 11:33:50 lamachine kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 05 11:33:50 lamachine kernel: md1_raid5 D 0 1304 2 0x80004000 Oct 05 11:33:50 lamachine kernel: Call Trace: Oct 05 11:33:50 lamachine kernel: ? __schedule+0x24f/0x650 Oct 05 11:33:50 lamachine kernel: schedule+0x2f/0xa0 Oct 05 11:33:50 lamachine kernel: io_schedule+0x12/0x40 Oct 05 11:33:50 lamachine kernel: blk_mq_get_tag+0x119/0x250 Oct 05 11:33:50 lamachine kernel: ? finish_wait+0x80/0x80 Oct 05 11:33:50 lamachine kernel: blk_mq_get_request+0xb7/0x3c0 Oct 05 11:33:50 lamachine kernel: blk_mq_make_request+0x134/0x5a0 Oct 05 11:33:50 lamachine kernel: generic_make_request+0xcf/0x310 Oct 05 11:33:50 lamachine kernel: ops_run_io+0x881/0xd30 [raid456] Oct 05 11:33:50 lamachine kernel: ? ops_complete_check+0x50/0x50 [raid456] Oct 05 11:33:50 lamachine kernel: handle_stripe+0xc47/0x1f80 [raid456] Oct 05 11:33:50 lamachine kernel: ? __wake_up_common+0x7a/0x190 Oct 05 11:33:50 lamachine kernel: handle_active_stripes.isra.73+0x3e7/0x5c0 [raid456] Oct 05 11:33:50 lamachine kernel: raid5d+0x392/0x5b0 [raid456] Oct 05 11:33:50 lamachine kernel: ? schedule_timeout+0x20d/0x310 Oct 05 11:33:50 lamachine kernel: ? _raw_spin_unlock_irqrestore+0x11/0x20 Oct 05 11:33:50 lamachine kernel: ? md_register_thread+0xd0/0xd0 Oct 05 11:33:50 lamachine kernel: md_thread+0x94/0x150 Oct 05 11:33:50 lamachine kernel: ? finish_wait+0x80/0x80 Oct 05 11:33:50 lamachine kernel: kthread+0x112/0x130 Oct 05 11:33:50 lamachine kernel: ? kthread_flush_work_fn+0x10/0x10 Oct 05 11:33:50 lamachine kernel: ret_from_fork+0x35/0x40 Oct 05 11:33:50 lamachine kernel: INFO: task md1_resync:3091 blocked for more than 120 seconds. Oct 05 11:33:50 lamachine kernel: Not tainted 4.18.0-193.14.2.el8_2.x86_64 #1 Oct 05 11:33:50 lamachine kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 05 11:33:50 lamachine kernel: md1_resync D 0 3091 2 0x80004080 Oct 05 11:33:50 lamachine kernel: Call Trace: Oct 05 11:33:50 lamachine kernel: ? __schedule+0x24f/0x650 Oct 05 11:33:50 lamachine kernel: schedule+0x2f/0xa0 Oct 05 11:33:50 lamachine kernel: raid5_get_active_stripe+0x469/0x5f0 [raid456] Oct 05 11:33:50 lamachine kernel: ? finish_wait+0x80/0x80 Oct 05 11:33:50 lamachine kernel: raid5_sync_request+0x387/0x3b0 [raid456] Oct 05 11:33:50 lamachine kernel: ? cpumask_next+0x17/0x20 Oct 05 11:33:50 lamachine kernel: ? is_mddev_idle+0xcc/0x12a Oct 05 11:33:50 lamachine kernel: md_do_sync.cold.83+0x424/0x953 Oct 05 11:33:50 lamachine kernel: ? xfrm_user_net_init+0x90/0xa0 Oct 05 11:33:50 lamachine kernel: ? __switch_to_asm+0x41/0x70 Oct 05 11:33:50 lamachine kernel: ? finish_wait+0x80/0x80 Oct 05 11:33:50 lamachine kernel: ? md_register_thread+0xd0/0xd0 Oct 05 11:33:50 lamachine kernel: md_thread+0x94/0x150 Oct 05 11:33:50 lamachine kernel: kthread+0x112/0x130 Oct 05 11:33:50 lamachine kernel: ? kthread_flush_work_fn+0x10/0x10 Oct 05 11:33:50 lamachine kernel: ret_from_fork+0x35/0x40 Oct 05 11:34:39 lamachine kernel: ata8.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x6 frozen Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:00:d8:2f:2b/05:00:45:00:00/40 tag 0 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:08:18:35:2b/05:00:45:00:00/40 tag 1 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:10:58:3a:2b/05:00:45:00:00/40 tag 2 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:18:98:3f:2b/05:00:45:00:00/40 tag 3 ncq dma 688128 in res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:20:d8:44:2b/05:00:45:00:00/40 tag 4 ncq dma 688128 in res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:28:18:4a:2b/05:00:45:00:00/40 tag 5 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:30:58:4f:2b/05:00:45:00:00/40 tag 6 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:38:98:54:2b/05:00:45:00:00/40 tag 7 ncq dma 688128 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:40:d8:59:2b/05:00:45:00:00/40 tag 8 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:48:18:5f:2b/05:00:45:00:00/40 tag 9 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:50:58:64:2b/05:00:45:00:00/40 tag 10 ncq dma 688128 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:58:98:69:2b/05:00:45:00:00/40 tag 11 ncq dma 688128 in res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:60:d8:6e:2b/05:00:45:00:00/40 tag 12 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:68:18:74:2b/05:00:45:00:00/40 tag 13 ncq dma 688128 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:70:58:79:2b/05:00:45:00:00/40 tag 14 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:78:98:7e:2b/05:00:45:00:00/40 tag 15 ncq dma 688128 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:80:18:b3:2b/05:00:45:00:00/40 tag 16 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:88:58:b8:2b/05:00:45:00:00/40 tag 17 ncq dma 688128 in res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:90:98:bd:2b/05:00:45:00:00/40 tag 18 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:98:d8:c2:2b/05:00:45:00:00/40 tag 19 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:a0:18:c8:2b/05:00:45:00:00/40 tag 20 ncq dma 688128 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:a8:58:cd:2b/05:00:45:00:00/40 tag 21 ncq dma 688128 in res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:b0:d8:83:2b/05:00:45:00:00/40 tag 22 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:b8:18:89:2b/05:00:45:00:00/40 tag 23 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:c0:58:8e:2b/05:00:45:00:00/40 tag 24 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:c8:98:93:2b/05:00:45:00:00/40 tag 25 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:d0:d8:98:2b/05:00:45:00:00/40 tag 26 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:d8:18:9e:2b/05:00:45:00:00/40 tag 27 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:e0:58:a3:2b/05:00:45:00:00/40 tag 28 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:e8:98:a8:2b/05:00:45:00:00/40 tag 29 ncq dma 688128 in res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:f0:d8:ad:2b/05:00:45:00:00/40 tag 30 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:f8:98:d2:2b/05:00:45:00:00/40 tag 31 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8: hard resetting link Oct 05 11:34:39 lamachine kernel: ata7.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x6 frozen Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:00:d8:98:2b/05:00:45:00:00/40 tag 0 ncq dma 688128 in res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:08:18:9e:2b/05:00:45:00:00/40 tag 1 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:10:58:a3:2b/05:00:45:00:00/40 tag 2 ncq dma 688128 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:18:d8:2f:2b/05:00:45:00:00/40 tag 3 ncq dma 688128 in res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:20:18:35:2b/05:00:45:00:00/40 tag 4 ncq dma 688128 in res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:28:58:3a:2b/05:00:45:00:00/40 tag 5 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:30:98:3f:2b/05:00:45:00:00/40 tag 6 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:38:d8:44:2b/05:00:45:00:00/40 tag 7 ncq dma 688128 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:40:18:4a:2b/05:00:45:00:00/40 tag 8 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:48:58:4f:2b/05:00:45:00:00/40 tag 9 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:50:98:54:2b/05:00:45:00:00/40 tag 10 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:58:d8:59:2b/05:00:45:00:00/40 tag 11 ncq dma 688128 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:60:18:5f:2b/05:00:45:00:00/40 tag 12 ncq dma 688128 in res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:68:58:64:2b/05:00:45:00:00/40 tag 13 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:70:98:69:2b/05:00:45:00:00/40 tag 14 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:78:d8:6e:2b/05:00:45:00:00/40 tag 15 ncq dma 688128 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:80:98:a8:2b/05:00:45:00:00/40 tag 16 ncq dma 688128 in res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:88:d8:ad:2b/05:00:45:00:00/40 tag 17 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:90:18:74:2b/05:00:45:00:00/40 tag 18 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:98:58:79:2b/05:00:45:00:00/40 tag 19 ncq dma 688128 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:a0:98:7e:2b/05:00:45:00:00/40 tag 20 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:a8:d8:83:2b/05:00:45:00:00/40 tag 21 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:b0:18:89:2b/05:00:45:00:00/40 tag 22 ncq dma 688128 in res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:b8:58:8e:2b/05:00:45:00:00/40 tag 23 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:c0:18:b3:2b/05:00:45:00:00/40 tag 24 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:c8:58:b8:2b/05:00:45:00:00/40 tag 25 ncq dma 688128 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:d0:98:bd:2b/05:00:45:00:00/40 tag 26 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:d8:d8:c2:2b/05:00:45:00:00/40 tag 27 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:e0:18:c8:2b/05:00:45:00:00/40 tag 28 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:e8:58:cd:2b/05:00:45:00:00/40 tag 29 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:f0:98:d2:2b/05:00:45:00:00/40 tag 30 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata7.00: cmd 60/40:f8:98:93:2b/05:00:45:00:00/40 tag 31 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata7: hard resetting link Oct 05 11:34:40 lamachine kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 05 11:34:40 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 05 11:34:45 lamachine kernel: ata7.00: qc timeout (cmd 0xec) Oct 05 11:34:45 lamachine kernel: ata8.00: qc timeout (cmd 0xec) Oct 05 11:34:46 lamachine kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 05 11:34:46 lamachine kernel: ata7.00: revalidation failed (errno=-5) Oct 05 11:34:46 lamachine kernel: ata7: hard resetting link Oct 05 11:34:46 lamachine kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 05 11:34:46 lamachine kernel: ata8.00: revalidation failed (errno=-5) Oct 05 11:34:46 lamachine kernel: ata8: hard resetting link Oct 05 11:34:46 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 05 11:34:46 lamachine kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 05 11:34:57 lamachine kernel: ata7.00: qc timeout (cmd 0xec) Oct 05 11:34:57 lamachine kernel: ata8.00: qc timeout (cmd 0xec) Oct 05 11:34:57 lamachine kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 05 11:34:57 lamachine kernel: ata8.00: revalidation failed (errno=-5) Oct 05 11:34:57 lamachine kernel: ata8: limiting SATA link speed to 3.0 Gbps Oct 05 11:34:57 lamachine kernel: ata8: hard resetting link Oct 05 11:34:57 lamachine kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 05 11:34:57 lamachine kernel: ata7.00: revalidation failed (errno=-5) Oct 05 11:34:57 lamachine kernel: ata7: limiting SATA link speed to 3.0 Gbps Oct 05 11:34:57 lamachine kernel: ata7: hard resetting link Oct 05 11:34:58 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 320) Oct 05 11:34:58 lamachine kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 320) Oct 05 11:35:29 lamachine kernel: ata7.00: qc timeout (cmd 0xec) Oct 05 11:35:29 lamachine kernel: ata8.00: qc timeout (cmd 0xec) Oct 05 11:35:29 lamachine kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 05 11:35:29 lamachine kernel: ata8.00: revalidation failed (errno=-5) Oct 05 11:35:29 lamachine kernel: ata8.00: disabled Oct 05 11:35:29 lamachine kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 05 11:35:29 lamachine kernel: ata7.00: revalidation failed (errno=-5) Oct 05 11:35:29 lamachine kernel: ata7.00: disabled Oct 05 11:35:30 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 320) Oct 05 11:35:30 lamachine kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 320) Oct 05 11:35:31 lamachine kernel: ata8: EH complete Oct 05 11:35:31 lamachine kernel: ata7: EH complete Oct 05 11:35:31 lamachine kernel: scsi_io_completion_action: 115 callbacks suppressed Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#12 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#12 CDB: Read(16) 88 00 00 00 00 00 45 2b d7 d8 00 00 05 40 00 00 Oct 05 11:35:31 lamachine kernel: print_req_error: 115 callbacks suppressed Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev sdd, sector 1160501208 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#18 CDB: Read(16) 88 00 00 00 00 00 45 2b d7 d8 00 00 05 40 00 00 Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev sdc, sector 1160501208 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#13 CDB: Read(16) 88 00 00 00 00 00 45 2b dd 18 00 00 05 40 00 00 Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev sdd, sector 1160502552 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#19 CDB: Read(16) 88 00 00 00 00 00 45 2b dd 18 00 00 05 40 00 00 Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev sdc, sector 1160502552 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#14 CDB: Read(16) 88 00 00 00 00 00 45 2b e2 58 00 00 05 40 00 00 Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev sdd, sector 1160503896 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#20 CDB: Read(16) 88 00 00 00 00 00 45 2b e2 58 00 00 05 40 00 00 Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev sdc, sector 1160503896 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#15 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#15 CDB: Read(16) 88 00 00 00 00 00 45 2b e7 98 00 00 05 40 00 00 Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev sdd, sector 1160505240 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#21 CDB: Read(16) 88 00 00 00 00 00 45 2b e7 98 00 00 05 40 00 00 Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev sdc, sector 1160505240 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#16 CDB: Read(16) 88 00 00 00 00 00 45 2b ec d8 00 00 05 40 00 00 Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev sdd, sector 1160506584 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#22 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#22 CDB: Read(16) 88 00 00 00 00 00 45 2b ec d8 00 00 05 40 00 00 Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev sdc, sector 1160506584 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:35:31 lamachine kernel: md/raid:md1: 23277 read_errors > 23276 stripes Oct 05 11:35:31 lamachine kernel: md/raid:md1: Too many read errors, failing device sde1. Oct 05 11:35:31 lamachine kernel: md/raid:md1: Disk failure on sde1, disabling device. md/raid:md1: Operation continuing on 2 devices. Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not correctable (sector 1160433448 on sde1). Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not correctable (sector 1160433456 on sde1). Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not correctable (sector 1160433336 on sde1). Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not correctable (sector 1160433464 on sde1). Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not correctable (sector 1160433472 on sde1). Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not correctable (sector 1160433344 on sde1). Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not correctable (sector 1160433480 on sde1). Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not correctable (sector 1160433352 on sde1). Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not correctable (sector 1160433488 on sde1). Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not correctable (sector 1160433360 on sde1). Oct 05 11:35:31 lamachine kernel: md: super_written gets error=10 Oct 05 11:35:31 lamachine kernel: md: super_written gets error=10 Oct 05 11:35:31 lamachine kernel: md: super_written gets error=10 Oct 05 11:35:31 lamachine kernel: md: super_written gets error=10 Oct 05 11:35:31 lamachine kernel: md: super_written gets error=10 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-05 13:10 do i need to give up on this setup Daniel Sanabria @ 2020-10-05 13:17 ` Reindl Harald 2020-10-05 13:44 ` Roman Mamedov 1 sibling, 0 replies; 16+ messages in thread From: Reindl Harald @ 2020-10-05 13:17 UTC (permalink / raw) To: Daniel Sanabria, Linux-RAID Am 05.10.20 um 15:10 schrieb Daniel Sanabria: > Scrubbing ( # echo check > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :( > > I'm attaching details of the array and disks (bloody wd greens) as > well as journalctl errors providing some details about the issue. > > If you have any pointers on what might be the cause of this as well as > any recommendations on how to improve things please let me thank you > in advance ... > > 3 8 65 - faulty /dev/sde1 why would you scrub an array when you *clearly* lost a whole disk instead first replace that one and rebuild the array? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-05 13:10 do i need to give up on this setup Daniel Sanabria 2020-10-05 13:17 ` Reindl Harald @ 2020-10-05 13:44 ` Roman Mamedov [not found] ` <CAHscji0pNezf6xCpjWto5-21ayoCeLWm34GTYh5TSgxkOw90mw@mail.gmail.com> 1 sibling, 1 reply; 16+ messages in thread From: Roman Mamedov @ 2020-10-05 13:44 UTC (permalink / raw) To: Daniel Sanabria; +Cc: Linux-RAID On Mon, 5 Oct 2020 14:10:25 +0100 Daniel Sanabria <sanabria.d@gmail.com> wrote: > Hi all, > > Scrubbing ( # echo check > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :( > > I'm attaching details of the array and disks (bloody wd greens) as > well as journalctl errors providing some details about the issue. > > If you have any pointers on what might be the cause of this as well as > any recommendations on how to improve things please let me thank you > in advance ... > > I have backups of the data so happy to move this to a different setup > you might recommend (apps will be mostly reading from the array via > NFS since most of the content will be media). > > My suspicion is that a timer service is kicking in and disrupting the > scrubbing somehow but can't pinpoint what causes this. It looks like a drive is dropping off the bus and then failing to reidentify, could be bad cabling/controller/PSU, or just a bad drive. You should post "smartctl -a" of all drives as well. -- With respect, Roman ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <CAHscji0pNezf6xCpjWto5-21ayoCeLWm34GTYh5TSgxkOw90mw@mail.gmail.com>]
* Re: do i need to give up on this setup [not found] ` <CAHscji0pNezf6xCpjWto5-21ayoCeLWm34GTYh5TSgxkOw90mw@mail.gmail.com> @ 2020-10-05 14:04 ` Roman Mamedov 2020-10-05 14:10 ` Reindl Harald 2020-10-05 14:28 ` Daniel Sanabria 0 siblings, 2 replies; 16+ messages in thread From: Roman Mamedov @ 2020-10-05 14:04 UTC (permalink / raw) To: Daniel Sanabria, Linux-RAID On Mon, 5 Oct 2020 14:59:35 +0100 Daniel Sanabria <sanabria.d@gmail.com> wrote: > > It looks like a drive is dropping off the bus and then failing to reidentify, > > could be bad cabling/controller/PSU, or just a bad drive. You should post > > "smartctl -a" of all drives as well. I meant not to me personally, but to the mailing list. The drives seem OK though, even sde. > [dan@lamachine ~]$ sudo smartctl -a /dev/sdc > [sudo] password for dan: > smartctl 6.6 2017-11-05 r4594 > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > === START OF INFORMATION SECTION === > Model Family: Western Digital Green > Device Model: WDC WD30EZRX-00D8PB0 > Serial Number: WD-WCC4NCWT13RF > LU WWN Device Id: 5 0014ee 25fc9e460 > Firmware Version: 80.00A80 > User Capacity: 3,000,591,900,160 bytes [3.00 TB] > Sector Sizes: 512 bytes logical, 4096 bytes physical > Rotation Rate: 5400 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ACS-2 (minor revision not indicated) > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > Local Time is: Mon Oct 5 14:58:34 2020 BST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x82) Offline data collection activity > was completed without error. > Auto Offline Data Collection: Enabled. > Self-test execution status: ( 0) The previous self-test routine completed > without error or no self-test has ever > been run. > Total time to complete Offline > data collection: (38940) seconds. > Offline data collection > capabilities: (0x7b) SMART execute Offline immediate. > Auto Offline data collection on/off support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 391) minutes. > Conveyance self-test routine > recommended polling time: ( 5) minutes. > SCT capabilities: (0x7035) SCT Status supported. > SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > Always - 0 > 3 Spin_Up_Time 0x0027 178 165 021 Pre-fail > Always - 6075 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > Always - 81 > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age > Always - 0 > 9 Power_On_Hours 0x0032 075 075 000 Old_age > Always - 18577 > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > Always - 0 > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 81 > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > Always - 46 > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age > Always - 176661 > 194 Temperature_Celsius 0x0022 122 109 000 Old_age > Always - 28 > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > Always - 0 > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > Offline - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Extended offline Completed without error 00% 17479 - > # 2 Short offline Completed without error 00% 15531 - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute delay. > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdd > smartctl 6.6 2017-11-05 r4594 > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > === START OF INFORMATION SECTION === > Model Family: Western Digital Green > Device Model: WDC WD30EZRX-00D8PB0 > Serial Number: WD-WCC4NPRDD6D7 > LU WWN Device Id: 5 0014ee 25fca27b1 > Firmware Version: 80.00A80 > User Capacity: 3,000,592,982,016 bytes [3.00 TB] > Sector Sizes: 512 bytes logical, 4096 bytes physical > Rotation Rate: 5400 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ACS-2 (minor revision not indicated) > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > Local Time is: Mon Oct 5 14:58:54 2020 BST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x82) Offline data collection activity > was completed without error. > Auto Offline Data Collection: Enabled. > Self-test execution status: ( 0) The previous self-test routine completed > without error or no self-test has ever > been run. > Total time to complete Offline > data collection: (39060) seconds. > Offline data collection > capabilities: (0x7b) SMART execute Offline immediate. > Auto Offline data collection on/off support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 392) minutes. > Conveyance self-test routine > recommended polling time: ( 5) minutes. > SCT capabilities: (0x7035) SCT Status supported. > SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > Always - 0 > 3 Spin_Up_Time 0x0027 178 164 021 Pre-fail > Always - 6100 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > Always - 81 > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age > Always - 0 > 9 Power_On_Hours 0x0032 075 075 000 Old_age > Always - 18580 > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > Always - 0 > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 81 > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > Always - 53 > 193 Load_Cycle_Count 0x0032 136 136 000 Old_age > Always - 192427 > 194 Temperature_Celsius 0x0022 121 108 000 Old_age > Always - 29 > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > Always - 0 > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > Offline - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Extended offline Completed without error 00% 17481 - > # 2 Short offline Completed without error 00% 15534 - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute delay. > > [dan@lamachine ~]$ sudo smartctl -a /dev/sde > smartctl 6.6 2017-11-05 r4594 > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > === START OF INFORMATION SECTION === > Model Family: Western Digital Green > Device Model: WDC WD30EZRX-00D8PB0 > Serial Number: WD-WCC4N1294906 > LU WWN Device Id: 5 0014ee 25f968120 > Firmware Version: 80.00A80 > User Capacity: 3,000,591,900,160 bytes [3.00 TB] > Sector Sizes: 512 bytes logical, 4096 bytes physical > Rotation Rate: 5400 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ACS-2 (minor revision not indicated) > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > Local Time is: Mon Oct 5 14:58:57 2020 BST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x82) Offline data collection activity > was completed without error. > Auto Offline Data Collection: Enabled. > Self-test execution status: ( 0) The previous self-test routine completed > without error or no self-test has ever > been run. > Total time to complete Offline > data collection: (43200) seconds. > Offline data collection > capabilities: (0x7b) SMART execute Offline immediate. > Auto Offline data collection on/off support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 433) minutes. > Conveyance self-test routine > recommended polling time: ( 5) minutes. > SCT capabilities: (0x7035) SCT Status supported. > SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > Always - 0 > 3 Spin_Up_Time 0x0027 176 166 021 Pre-fail > Always - 6158 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > Always - 80 > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age > Always - 0 > 9 Power_On_Hours 0x0032 075 075 000 Old_age > Always - 18465 > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > Always - 0 > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 80 > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > Always - 53 > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age > Always - 174015 > 194 Temperature_Celsius 0x0022 121 107 000 Old_age > Always - 29 > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > Always - 0 > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > Offline - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Extended offline Completed without error 00% 17347 - > # 2 Short offline Completed without error 00% 15414 - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute delay. > > [dan@lamachine ~]$ > > > On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote: > > > > On Mon, 5 Oct 2020 14:10:25 +0100 > > Daniel Sanabria <sanabria.d@gmail.com> wrote: > > > > > Hi all, > > > > > > Scrubbing ( # echo check > > > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :( > > > > > > I'm attaching details of the array and disks (bloody wd greens) as > > > well as journalctl errors providing some details about the issue. > > > > > > If you have any pointers on what might be the cause of this as well as > > > any recommendations on how to improve things please let me thank you > > > in advance ... > > > > > > I have backups of the data so happy to move this to a different setup > > > you might recommend (apps will be mostly reading from the array via > > > NFS since most of the content will be media). > > > > > > My suspicion is that a timer service is kicking in and disrupting the > > > scrubbing somehow but can't pinpoint what causes this. > > > > It looks like a drive is dropping off the bus and then failing to reidentify, > > could be bad cabling/controller/PSU, or just a bad drive. You should post > > "smartctl -a" of all drives as well. > > > > -- > > With respect, > > Roman -- With respect, Roman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-05 14:04 ` Roman Mamedov @ 2020-10-05 14:10 ` Reindl Harald 2020-10-05 14:28 ` Daniel Sanabria 1 sibling, 0 replies; 16+ messages in thread From: Reindl Harald @ 2020-10-05 14:10 UTC (permalink / raw) To: Roman Mamedov, Daniel Sanabria, Linux-RAID Am 05.10.20 um 16:04 schrieb Roman Mamedov: > On Mon, 5 Oct 2020 14:59:35 +0100 > Daniel Sanabria <sanabria.d@gmail.com> wrote: > >>> It looks like a drive is dropping off the bus and then failing to reidentify, >>> could be bad cabling/controller/PSU, or just a bad drive. You should post >>> "smartctl -a" of all drives as well. > > I meant not to me personally, but to the mailing list. The drives seem OK > though, even sde. you have a hardware problem and it#s no uncommon when one of your disks is going crazy under laod that due the reset of the crontroller a second one on the same bus is also reset either one of your disks is faulty, the controller is faulty or you have an issue with your cables 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:48:18:5f:2b/05:00:45:00:00/40 tag 9 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:50:58:64:2b/05:00:45:00:00/40 tag 10 ncq dma 688128 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY } Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED Oct 05 11:34:39 lamachine kernel: ata8.00: cmd 60/40:58:98:69:2b/05:00:45:00:00/40 tag 11 ncq dma 688128 in res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 05 11:35:29 lamachine kernel: ata7.00: qc timeout (cmd 0xec) Oct 05 11:35:29 lamachine kernel: ata8.00: qc timeout (cmd 0xec) Oct 05 11:35:29 lamachine kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 05 11:35:29 lamachine kernel: ata8.00: revalidation failed (errno=-5) Oct 05 11:35:29 lamachine kernel: ata8.00: disabled Oct 05 11:35:29 lamachine kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 05 11:35:29 lamachine kernel: ata7.00: revalidation failed (errno=-5) Oct 05 11:35:29 lamachine kernel: ata7.00: disabled Oct 05 11:35:30 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 320) Oct 05 11:35:30 lamachine kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 320) Oct 05 11:35:31 lamachine kernel: ata8: EH complete Oct 05 11:35:31 lamachine kernel: ata7: EH complete Oct 05 11:35:31 lamachine kernel: scsi_io_completion_action: 115 callbacks suppressed Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#12 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#12 CDB: Read(16) 88 00 00 00 00 00 45 2b d7 d8 00 00 05 40 00 00 Oct 05 11:35:31 lamachine kernel: print_req_error: 115 callbacks suppressed Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev sdd, sector 1160501208 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 0 Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#18 CDB: Read(16) 88 00 00 00 00 00 45 2b d7 d8 00 00 05 40 00 00 Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev sdc, sector 1160501208 op 0x0:(READ) flags 0x4000 phys_seg 168 prio >> [dan@lamachine ~]$ sudo smartctl -a /dev/sdc >> [sudo] password for dan: >> smartctl 6.6 2017-11-05 r4594 >> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) >> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org >> >> === START OF INFORMATION SECTION === >> Model Family: Western Digital Green >> Device Model: WDC WD30EZRX-00D8PB0 >> Serial Number: WD-WCC4NCWT13RF >> LU WWN Device Id: 5 0014ee 25fc9e460 >> Firmware Version: 80.00A80 >> User Capacity: 3,000,591,900,160 bytes [3.00 TB] >> Sector Sizes: 512 bytes logical, 4096 bytes physical >> Rotation Rate: 5400 rpm >> Device is: In smartctl database [for details use: -P show] >> ATA Version is: ACS-2 (minor revision not indicated) >> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >> Local Time is: Mon Oct 5 14:58:34 2020 BST >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> >> General SMART Values: >> Offline data collection status: (0x82) Offline data collection activity >> was completed without error. >> Auto Offline Data Collection: Enabled. >> Self-test execution status: ( 0) The previous self-test routine completed >> without error or no self-test has ever >> been run. >> Total time to complete Offline >> data collection: (38940) seconds. >> Offline data collection >> capabilities: (0x7b) SMART execute Offline immediate. >> Auto Offline data collection on/off support. >> Suspend Offline collection upon new >> command. >> Offline surface scan supported. >> Self-test supported. >> Conveyance Self-test supported. >> Selective Self-test supported. >> SMART capabilities: (0x0003) Saves SMART data before entering >> power-saving mode. >> Supports SMART auto save timer. >> Error logging capability: (0x01) Error logging supported. >> General Purpose Logging supported. >> Short self-test routine >> recommended polling time: ( 2) minutes. >> Extended self-test routine >> recommended polling time: ( 391) minutes. >> Conveyance self-test routine >> recommended polling time: ( 5) minutes. >> SCT capabilities: (0x7035) SCT Status supported. >> SCT Feature Control supported. >> SCT Data Table supported. >> >> SMART Attributes Data Structure revision number: 16 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >> UPDATED WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail >> Always - 0 >> 3 Spin_Up_Time 0x0027 178 165 021 Pre-fail >> Always - 6075 >> 4 Start_Stop_Count 0x0032 100 100 000 Old_age >> Always - 81 >> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail >> Always - 0 >> 7 Seek_Error_Rate 0x002e 100 253 000 Old_age >> Always - 0 >> 9 Power_On_Hours 0x0032 075 075 000 Old_age >> Always - 18577 >> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age >> Always - 0 >> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age >> Always - 0 >> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age >> Always - 81 >> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age >> Always - 46 >> 193 Load_Cycle_Count 0x0032 142 142 000 Old_age >> Always - 176661 >> 194 Temperature_Celsius 0x0022 122 109 000 Old_age >> Always - 28 >> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age >> Always - 0 >> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age >> Always - 0 >> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age >> Offline - 0 >> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age >> Always - 0 >> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age >> Offline - 0 >> >> SMART Error Log Version: 1 >> No Errors Logged >> >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining >> LifeTime(hours) LBA_of_first_error >> # 1 Extended offline Completed without error 00% 17479 - >> # 2 Short offline Completed without error 00% 15531 - >> >> SMART Selective self-test log data structure revision number 1 >> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >> 1 0 0 Not_testing >> 2 0 0 Not_testing >> 3 0 0 Not_testing >> 4 0 0 Not_testing >> 5 0 0 Not_testing >> Selective self-test flags (0x0): >> After scanning selected spans, do NOT read-scan remainder of disk. >> If Selective self-test is pending on power-up, resume after 0 minute delay. >> >> [dan@lamachine ~]$ sudo smartctl -a /dev/sdd >> smartctl 6.6 2017-11-05 r4594 >> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) >> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org >> >> === START OF INFORMATION SECTION === >> Model Family: Western Digital Green >> Device Model: WDC WD30EZRX-00D8PB0 >> Serial Number: WD-WCC4NPRDD6D7 >> LU WWN Device Id: 5 0014ee 25fca27b1 >> Firmware Version: 80.00A80 >> User Capacity: 3,000,592,982,016 bytes [3.00 TB] >> Sector Sizes: 512 bytes logical, 4096 bytes physical >> Rotation Rate: 5400 rpm >> Device is: In smartctl database [for details use: -P show] >> ATA Version is: ACS-2 (minor revision not indicated) >> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >> Local Time is: Mon Oct 5 14:58:54 2020 BST >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> >> General SMART Values: >> Offline data collection status: (0x82) Offline data collection activity >> was completed without error. >> Auto Offline Data Collection: Enabled. >> Self-test execution status: ( 0) The previous self-test routine completed >> without error or no self-test has ever >> been run. >> Total time to complete Offline >> data collection: (39060) seconds. >> Offline data collection >> capabilities: (0x7b) SMART execute Offline immediate. >> Auto Offline data collection on/off support. >> Suspend Offline collection upon new >> command. >> Offline surface scan supported. >> Self-test supported. >> Conveyance Self-test supported. >> Selective Self-test supported. >> SMART capabilities: (0x0003) Saves SMART data before entering >> power-saving mode. >> Supports SMART auto save timer. >> Error logging capability: (0x01) Error logging supported. >> General Purpose Logging supported. >> Short self-test routine >> recommended polling time: ( 2) minutes. >> Extended self-test routine >> recommended polling time: ( 392) minutes. >> Conveyance self-test routine >> recommended polling time: ( 5) minutes. >> SCT capabilities: (0x7035) SCT Status supported. >> SCT Feature Control supported. >> SCT Data Table supported. >> >> SMART Attributes Data Structure revision number: 16 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >> UPDATED WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail >> Always - 0 >> 3 Spin_Up_Time 0x0027 178 164 021 Pre-fail >> Always - 6100 >> 4 Start_Stop_Count 0x0032 100 100 000 Old_age >> Always - 81 >> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail >> Always - 0 >> 7 Seek_Error_Rate 0x002e 100 253 000 Old_age >> Always - 0 >> 9 Power_On_Hours 0x0032 075 075 000 Old_age >> Always - 18580 >> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age >> Always - 0 >> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age >> Always - 0 >> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age >> Always - 81 >> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age >> Always - 53 >> 193 Load_Cycle_Count 0x0032 136 136 000 Old_age >> Always - 192427 >> 194 Temperature_Celsius 0x0022 121 108 000 Old_age >> Always - 29 >> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age >> Always - 0 >> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age >> Always - 0 >> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age >> Offline - 0 >> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age >> Always - 0 >> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age >> Offline - 0 >> >> SMART Error Log Version: 1 >> No Errors Logged >> >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining >> LifeTime(hours) LBA_of_first_error >> # 1 Extended offline Completed without error 00% 17481 - >> # 2 Short offline Completed without error 00% 15534 - >> >> SMART Selective self-test log data structure revision number 1 >> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >> 1 0 0 Not_testing >> 2 0 0 Not_testing >> 3 0 0 Not_testing >> 4 0 0 Not_testing >> 5 0 0 Not_testing >> Selective self-test flags (0x0): >> After scanning selected spans, do NOT read-scan remainder of disk. >> If Selective self-test is pending on power-up, resume after 0 minute delay. >> >> [dan@lamachine ~]$ sudo smartctl -a /dev/sde >> smartctl 6.6 2017-11-05 r4594 >> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) >> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org >> >> === START OF INFORMATION SECTION === >> Model Family: Western Digital Green >> Device Model: WDC WD30EZRX-00D8PB0 >> Serial Number: WD-WCC4N1294906 >> LU WWN Device Id: 5 0014ee 25f968120 >> Firmware Version: 80.00A80 >> User Capacity: 3,000,591,900,160 bytes [3.00 TB] >> Sector Sizes: 512 bytes logical, 4096 bytes physical >> Rotation Rate: 5400 rpm >> Device is: In smartctl database [for details use: -P show] >> ATA Version is: ACS-2 (minor revision not indicated) >> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >> Local Time is: Mon Oct 5 14:58:57 2020 BST >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> >> General SMART Values: >> Offline data collection status: (0x82) Offline data collection activity >> was completed without error. >> Auto Offline Data Collection: Enabled. >> Self-test execution status: ( 0) The previous self-test routine completed >> without error or no self-test has ever >> been run. >> Total time to complete Offline >> data collection: (43200) seconds. >> Offline data collection >> capabilities: (0x7b) SMART execute Offline immediate. >> Auto Offline data collection on/off support. >> Suspend Offline collection upon new >> command. >> Offline surface scan supported. >> Self-test supported. >> Conveyance Self-test supported. >> Selective Self-test supported. >> SMART capabilities: (0x0003) Saves SMART data before entering >> power-saving mode. >> Supports SMART auto save timer. >> Error logging capability: (0x01) Error logging supported. >> General Purpose Logging supported. >> Short self-test routine >> recommended polling time: ( 2) minutes. >> Extended self-test routine >> recommended polling time: ( 433) minutes. >> Conveyance self-test routine >> recommended polling time: ( 5) minutes. >> SCT capabilities: (0x7035) SCT Status supported. >> SCT Feature Control supported. >> SCT Data Table supported. >> >> SMART Attributes Data Structure revision number: 16 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >> UPDATED WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail >> Always - 0 >> 3 Spin_Up_Time 0x0027 176 166 021 Pre-fail >> Always - 6158 >> 4 Start_Stop_Count 0x0032 100 100 000 Old_age >> Always - 80 >> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail >> Always - 0 >> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age >> Always - 0 >> 9 Power_On_Hours 0x0032 075 075 000 Old_age >> Always - 18465 >> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age >> Always - 0 >> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age >> Always - 0 >> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age >> Always - 80 >> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age >> Always - 53 >> 193 Load_Cycle_Count 0x0032 142 142 000 Old_age >> Always - 174015 >> 194 Temperature_Celsius 0x0022 121 107 000 Old_age >> Always - 29 >> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age >> Always - 0 >> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age >> Always - 0 >> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age >> Offline - 0 >> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age >> Always - 0 >> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age >> Offline - 0 >> >> SMART Error Log Version: 1 >> No Errors Logged >> >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining >> LifeTime(hours) LBA_of_first_error >> # 1 Extended offline Completed without error 00% 17347 - >> # 2 Short offline Completed without error 00% 15414 - >> >> SMART Selective self-test log data structure revision number 1 >> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >> 1 0 0 Not_testing >> 2 0 0 Not_testing >> 3 0 0 Not_testing >> 4 0 0 Not_testing >> 5 0 0 Not_testing >> Selective self-test flags (0x0): >> After scanning selected spans, do NOT read-scan remainder of disk. >> If Selective self-test is pending on power-up, resume after 0 minute delay. >> >> [dan@lamachine ~]$ >> >> >> On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote: >>> >>> On Mon, 5 Oct 2020 14:10:25 +0100 >>> Daniel Sanabria <sanabria.d@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> Scrubbing ( # echo check > >>>> /sys/devices/virtual/block/md1/md/sync_action) is killing my array :( >>>> >>>> I'm attaching details of the array and disks (bloody wd greens) as >>>> well as journalctl errors providing some details about the issue. >>>> >>>> If you have any pointers on what might be the cause of this as well as >>>> any recommendations on how to improve things please let me thank you >>>> in advance ... >>>> >>>> I have backups of the data so happy to move this to a different setup >>>> you might recommend (apps will be mostly reading from the array via >>>> NFS since most of the content will be media). >>>> >>>> My suspicion is that a timer service is kicking in and disrupting the >>>> scrubbing somehow but can't pinpoint what causes this. >>> >>> It looks like a drive is dropping off the bus and then failing to reidentify, >>> could be bad cabling/controller/PSU, or just a bad drive. You should post >>> "smartctl -a" of all drives as well. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-05 14:04 ` Roman Mamedov 2020-10-05 14:10 ` Reindl Harald @ 2020-10-05 14:28 ` Daniel Sanabria 2020-10-05 15:58 ` Roger Heflin 1 sibling, 1 reply; 16+ messages in thread From: Daniel Sanabria @ 2020-10-05 14:28 UTC (permalink / raw) To: Roman Mamedov; +Cc: Linux-RAID > I meant not to me personally, but to the mailing list. The drives seem OK > though, even sde. Sorry missed the reply-all button On Mon, 5 Oct 2020 at 15:04, Roman Mamedov <rm@romanrm.net> wrote: > > On Mon, 5 Oct 2020 14:59:35 +0100 > Daniel Sanabria <sanabria.d@gmail.com> wrote: > > > > It looks like a drive is dropping off the bus and then failing to reidentify, > > > could be bad cabling/controller/PSU, or just a bad drive. You should post > > > "smartctl -a" of all drives as well. > > I meant not to me personally, but to the mailing list. The drives seem OK > though, even sde. > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdc > > [sudo] password for dan: > > smartctl 6.6 2017-11-05 r4594 > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > === START OF INFORMATION SECTION === > > Model Family: Western Digital Green > > Device Model: WDC WD30EZRX-00D8PB0 > > Serial Number: WD-WCC4NCWT13RF > > LU WWN Device Id: 5 0014ee 25fc9e460 > > Firmware Version: 80.00A80 > > User Capacity: 3,000,591,900,160 bytes [3.00 TB] > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > Rotation Rate: 5400 rpm > > Device is: In smartctl database [for details use: -P show] > > ATA Version is: ACS-2 (minor revision not indicated) > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > Local Time is: Mon Oct 5 14:58:34 2020 BST > > SMART support is: Available - device has SMART capability. > > SMART support is: Enabled > > > > === START OF READ SMART DATA SECTION === > > SMART overall-health self-assessment test result: PASSED > > > > General SMART Values: > > Offline data collection status: (0x82) Offline data collection activity > > was completed without error. > > Auto Offline Data Collection: Enabled. > > Self-test execution status: ( 0) The previous self-test routine completed > > without error or no self-test has ever > > been run. > > Total time to complete Offline > > data collection: (38940) seconds. > > Offline data collection > > capabilities: (0x7b) SMART execute Offline immediate. > > Auto Offline data collection on/off support. > > Suspend Offline collection upon new > > command. > > Offline surface scan supported. > > Self-test supported. > > Conveyance Self-test supported. > > Selective Self-test supported. > > SMART capabilities: (0x0003) Saves SMART data before entering > > power-saving mode. > > Supports SMART auto save timer. > > Error logging capability: (0x01) Error logging supported. > > General Purpose Logging supported. > > Short self-test routine > > recommended polling time: ( 2) minutes. > > Extended self-test routine > > recommended polling time: ( 391) minutes. > > Conveyance self-test routine > > recommended polling time: ( 5) minutes. > > SCT capabilities: (0x7035) SCT Status supported. > > SCT Feature Control supported. > > SCT Data Table supported. > > > > SMART Attributes Data Structure revision number: 16 > > Vendor Specific SMART Attributes with Thresholds: > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > UPDATED WHEN_FAILED RAW_VALUE > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > Always - 0 > > 3 Spin_Up_Time 0x0027 178 165 021 Pre-fail > > Always - 6075 > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > Always - 81 > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > Always - 0 > > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age > > Always - 0 > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > Always - 18577 > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > Always - 0 > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > Always - 0 > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > Always - 81 > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > Always - 46 > > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age > > Always - 176661 > > 194 Temperature_Celsius 0x0022 122 109 000 Old_age > > Always - 28 > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > Always - 0 > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > Always - 0 > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > Offline - 0 > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > Always - 0 > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > Offline - 0 > > > > SMART Error Log Version: 1 > > No Errors Logged > > > > SMART Self-test log structure revision number 1 > > Num Test_Description Status Remaining > > LifeTime(hours) LBA_of_first_error > > # 1 Extended offline Completed without error 00% 17479 - > > # 2 Short offline Completed without error 00% 15531 - > > > > SMART Selective self-test log data structure revision number 1 > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > 1 0 0 Not_testing > > 2 0 0 Not_testing > > 3 0 0 Not_testing > > 4 0 0 Not_testing > > 5 0 0 Not_testing > > Selective self-test flags (0x0): > > After scanning selected spans, do NOT read-scan remainder of disk. > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdd > > smartctl 6.6 2017-11-05 r4594 > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > === START OF INFORMATION SECTION === > > Model Family: Western Digital Green > > Device Model: WDC WD30EZRX-00D8PB0 > > Serial Number: WD-WCC4NPRDD6D7 > > LU WWN Device Id: 5 0014ee 25fca27b1 > > Firmware Version: 80.00A80 > > User Capacity: 3,000,592,982,016 bytes [3.00 TB] > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > Rotation Rate: 5400 rpm > > Device is: In smartctl database [for details use: -P show] > > ATA Version is: ACS-2 (minor revision not indicated) > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > Local Time is: Mon Oct 5 14:58:54 2020 BST > > SMART support is: Available - device has SMART capability. > > SMART support is: Enabled > > > > === START OF READ SMART DATA SECTION === > > SMART overall-health self-assessment test result: PASSED > > > > General SMART Values: > > Offline data collection status: (0x82) Offline data collection activity > > was completed without error. > > Auto Offline Data Collection: Enabled. > > Self-test execution status: ( 0) The previous self-test routine completed > > without error or no self-test has ever > > been run. > > Total time to complete Offline > > data collection: (39060) seconds. > > Offline data collection > > capabilities: (0x7b) SMART execute Offline immediate. > > Auto Offline data collection on/off support. > > Suspend Offline collection upon new > > command. > > Offline surface scan supported. > > Self-test supported. > > Conveyance Self-test supported. > > Selective Self-test supported. > > SMART capabilities: (0x0003) Saves SMART data before entering > > power-saving mode. > > Supports SMART auto save timer. > > Error logging capability: (0x01) Error logging supported. > > General Purpose Logging supported. > > Short self-test routine > > recommended polling time: ( 2) minutes. > > Extended self-test routine > > recommended polling time: ( 392) minutes. > > Conveyance self-test routine > > recommended polling time: ( 5) minutes. > > SCT capabilities: (0x7035) SCT Status supported. > > SCT Feature Control supported. > > SCT Data Table supported. > > > > SMART Attributes Data Structure revision number: 16 > > Vendor Specific SMART Attributes with Thresholds: > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > UPDATED WHEN_FAILED RAW_VALUE > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > Always - 0 > > 3 Spin_Up_Time 0x0027 178 164 021 Pre-fail > > Always - 6100 > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > Always - 81 > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > Always - 0 > > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age > > Always - 0 > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > Always - 18580 > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > Always - 0 > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > Always - 0 > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > Always - 81 > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > Always - 53 > > 193 Load_Cycle_Count 0x0032 136 136 000 Old_age > > Always - 192427 > > 194 Temperature_Celsius 0x0022 121 108 000 Old_age > > Always - 29 > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > Always - 0 > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > Always - 0 > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > Offline - 0 > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > Always - 0 > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > Offline - 0 > > > > SMART Error Log Version: 1 > > No Errors Logged > > > > SMART Self-test log structure revision number 1 > > Num Test_Description Status Remaining > > LifeTime(hours) LBA_of_first_error > > # 1 Extended offline Completed without error 00% 17481 - > > # 2 Short offline Completed without error 00% 15534 - > > > > SMART Selective self-test log data structure revision number 1 > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > 1 0 0 Not_testing > > 2 0 0 Not_testing > > 3 0 0 Not_testing > > 4 0 0 Not_testing > > 5 0 0 Not_testing > > Selective self-test flags (0x0): > > After scanning selected spans, do NOT read-scan remainder of disk. > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sde > > smartctl 6.6 2017-11-05 r4594 > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > === START OF INFORMATION SECTION === > > Model Family: Western Digital Green > > Device Model: WDC WD30EZRX-00D8PB0 > > Serial Number: WD-WCC4N1294906 > > LU WWN Device Id: 5 0014ee 25f968120 > > Firmware Version: 80.00A80 > > User Capacity: 3,000,591,900,160 bytes [3.00 TB] > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > Rotation Rate: 5400 rpm > > Device is: In smartctl database [for details use: -P show] > > ATA Version is: ACS-2 (minor revision not indicated) > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > Local Time is: Mon Oct 5 14:58:57 2020 BST > > SMART support is: Available - device has SMART capability. > > SMART support is: Enabled > > > > === START OF READ SMART DATA SECTION === > > SMART overall-health self-assessment test result: PASSED > > > > General SMART Values: > > Offline data collection status: (0x82) Offline data collection activity > > was completed without error. > > Auto Offline Data Collection: Enabled. > > Self-test execution status: ( 0) The previous self-test routine completed > > without error or no self-test has ever > > been run. > > Total time to complete Offline > > data collection: (43200) seconds. > > Offline data collection > > capabilities: (0x7b) SMART execute Offline immediate. > > Auto Offline data collection on/off support. > > Suspend Offline collection upon new > > command. > > Offline surface scan supported. > > Self-test supported. > > Conveyance Self-test supported. > > Selective Self-test supported. > > SMART capabilities: (0x0003) Saves SMART data before entering > > power-saving mode. > > Supports SMART auto save timer. > > Error logging capability: (0x01) Error logging supported. > > General Purpose Logging supported. > > Short self-test routine > > recommended polling time: ( 2) minutes. > > Extended self-test routine > > recommended polling time: ( 433) minutes. > > Conveyance self-test routine > > recommended polling time: ( 5) minutes. > > SCT capabilities: (0x7035) SCT Status supported. > > SCT Feature Control supported. > > SCT Data Table supported. > > > > SMART Attributes Data Structure revision number: 16 > > Vendor Specific SMART Attributes with Thresholds: > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > UPDATED WHEN_FAILED RAW_VALUE > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > Always - 0 > > 3 Spin_Up_Time 0x0027 176 166 021 Pre-fail > > Always - 6158 > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > Always - 80 > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > Always - 0 > > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age > > Always - 0 > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > Always - 18465 > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > Always - 0 > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > Always - 0 > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > Always - 80 > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > Always - 53 > > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age > > Always - 174015 > > 194 Temperature_Celsius 0x0022 121 107 000 Old_age > > Always - 29 > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > Always - 0 > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > Always - 0 > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > Offline - 0 > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > Always - 0 > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > Offline - 0 > > > > SMART Error Log Version: 1 > > No Errors Logged > > > > SMART Self-test log structure revision number 1 > > Num Test_Description Status Remaining > > LifeTime(hours) LBA_of_first_error > > # 1 Extended offline Completed without error 00% 17347 - > > # 2 Short offline Completed without error 00% 15414 - > > > > SMART Selective self-test log data structure revision number 1 > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > 1 0 0 Not_testing > > 2 0 0 Not_testing > > 3 0 0 Not_testing > > 4 0 0 Not_testing > > 5 0 0 Not_testing > > Selective self-test flags (0x0): > > After scanning selected spans, do NOT read-scan remainder of disk. > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > [dan@lamachine ~]$ > > > > > > On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote: > > > > > > On Mon, 5 Oct 2020 14:10:25 +0100 > > > Daniel Sanabria <sanabria.d@gmail.com> wrote: > > > > > > > Hi all, > > > > > > > > Scrubbing ( # echo check > > > > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :( > > > > > > > > I'm attaching details of the array and disks (bloody wd greens) as > > > > well as journalctl errors providing some details about the issue. > > > > > > > > If you have any pointers on what might be the cause of this as well as > > > > any recommendations on how to improve things please let me thank you > > > > in advance ... > > > > > > > > I have backups of the data so happy to move this to a different setup > > > > you might recommend (apps will be mostly reading from the array via > > > > NFS since most of the content will be media). > > > > > > > > My suspicion is that a timer service is kicking in and disrupting the > > > > scrubbing somehow but can't pinpoint what causes this. > > > > > > It looks like a drive is dropping off the bus and then failing to reidentify, > > > could be bad cabling/controller/PSU, or just a bad drive. You should post > > > "smartctl -a" of all drives as well. > > > > > > -- > > > With respect, > > > Roman > > > -- > With respect, > Roman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-05 14:28 ` Daniel Sanabria @ 2020-10-05 15:58 ` Roger Heflin 2020-10-06 7:56 ` Daniel Sanabria 0 siblings, 1 reply; 16+ messages in thread From: Roger Heflin @ 2020-10-05 15:58 UTC (permalink / raw) To: Daniel Sanabria; +Cc: Roman Mamedov, Linux-RAID what they said you have a hardware problem. it could be about anything previously mentioned and could also be the power supply being unable to provide a stable 12V for the disks. You should provide the list more specifics on your hw setup, of interest are what kind of SATA/SAS ports you are using and how the disk are cabled in. Note that there are a number of controllers that aren't the most reliable and some of those controllers when something happens will stop responding for all disks connected to it. I have also seen badly designed motherboards have build-in(non-AMD/non-Intel chips) sata ports that don't work under any load that uses more than a single disk at a time, and/or acts badly when given smart commands. On Mon, Oct 5, 2020 at 9:30 AM Daniel Sanabria <sanabria.d@gmail.com> wrote: > > > I meant not to me personally, but to the mailing list. The drives seem OK > > though, even sde. > > Sorry missed the reply-all button > > On Mon, 5 Oct 2020 at 15:04, Roman Mamedov <rm@romanrm.net> wrote: > > > > On Mon, 5 Oct 2020 14:59:35 +0100 > > Daniel Sanabria <sanabria.d@gmail.com> wrote: > > > > > > It looks like a drive is dropping off the bus and then failing to reidentify, > > > > could be bad cabling/controller/PSU, or just a bad drive. You should post > > > > "smartctl -a" of all drives as well. > > > > I meant not to me personally, but to the mailing list. The drives seem OK > > though, even sde. > > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdc > > > [sudo] password for dan: > > > smartctl 6.6 2017-11-05 r4594 > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > > > === START OF INFORMATION SECTION === > > > Model Family: Western Digital Green > > > Device Model: WDC WD30EZRX-00D8PB0 > > > Serial Number: WD-WCC4NCWT13RF > > > LU WWN Device Id: 5 0014ee 25fc9e460 > > > Firmware Version: 80.00A80 > > > User Capacity: 3,000,591,900,160 bytes [3.00 TB] > > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > > Rotation Rate: 5400 rpm > > > Device is: In smartctl database [for details use: -P show] > > > ATA Version is: ACS-2 (minor revision not indicated) > > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > > Local Time is: Mon Oct 5 14:58:34 2020 BST > > > SMART support is: Available - device has SMART capability. > > > SMART support is: Enabled > > > > > > === START OF READ SMART DATA SECTION === > > > SMART overall-health self-assessment test result: PASSED > > > > > > General SMART Values: > > > Offline data collection status: (0x82) Offline data collection activity > > > was completed without error. > > > Auto Offline Data Collection: Enabled. > > > Self-test execution status: ( 0) The previous self-test routine completed > > > without error or no self-test has ever > > > been run. > > > Total time to complete Offline > > > data collection: (38940) seconds. > > > Offline data collection > > > capabilities: (0x7b) SMART execute Offline immediate. > > > Auto Offline data collection on/off support. > > > Suspend Offline collection upon new > > > command. > > > Offline surface scan supported. > > > Self-test supported. > > > Conveyance Self-test supported. > > > Selective Self-test supported. > > > SMART capabilities: (0x0003) Saves SMART data before entering > > > power-saving mode. > > > Supports SMART auto save timer. > > > Error logging capability: (0x01) Error logging supported. > > > General Purpose Logging supported. > > > Short self-test routine > > > recommended polling time: ( 2) minutes. > > > Extended self-test routine > > > recommended polling time: ( 391) minutes. > > > Conveyance self-test routine > > > recommended polling time: ( 5) minutes. > > > SCT capabilities: (0x7035) SCT Status supported. > > > SCT Feature Control supported. > > > SCT Data Table supported. > > > > > > SMART Attributes Data Structure revision number: 16 > > > Vendor Specific SMART Attributes with Thresholds: > > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > > UPDATED WHEN_FAILED RAW_VALUE > > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > > Always - 0 > > > 3 Spin_Up_Time 0x0027 178 165 021 Pre-fail > > > Always - 6075 > > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > > Always - 81 > > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > > Always - 0 > > > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age > > > Always - 0 > > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > > Always - 18577 > > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > > Always - 0 > > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > > Always - 0 > > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > > Always - 81 > > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > > Always - 46 > > > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age > > > Always - 176661 > > > 194 Temperature_Celsius 0x0022 122 109 000 Old_age > > > Always - 28 > > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > > Always - 0 > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > > Always - 0 > > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > > Offline - 0 > > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > > Always - 0 > > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > > Offline - 0 > > > > > > SMART Error Log Version: 1 > > > No Errors Logged > > > > > > SMART Self-test log structure revision number 1 > > > Num Test_Description Status Remaining > > > LifeTime(hours) LBA_of_first_error > > > # 1 Extended offline Completed without error 00% 17479 - > > > # 2 Short offline Completed without error 00% 15531 - > > > > > > SMART Selective self-test log data structure revision number 1 > > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > > 1 0 0 Not_testing > > > 2 0 0 Not_testing > > > 3 0 0 Not_testing > > > 4 0 0 Not_testing > > > 5 0 0 Not_testing > > > Selective self-test flags (0x0): > > > After scanning selected spans, do NOT read-scan remainder of disk. > > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdd > > > smartctl 6.6 2017-11-05 r4594 > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > > > === START OF INFORMATION SECTION === > > > Model Family: Western Digital Green > > > Device Model: WDC WD30EZRX-00D8PB0 > > > Serial Number: WD-WCC4NPRDD6D7 > > > LU WWN Device Id: 5 0014ee 25fca27b1 > > > Firmware Version: 80.00A80 > > > User Capacity: 3,000,592,982,016 bytes [3.00 TB] > > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > > Rotation Rate: 5400 rpm > > > Device is: In smartctl database [for details use: -P show] > > > ATA Version is: ACS-2 (minor revision not indicated) > > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > > Local Time is: Mon Oct 5 14:58:54 2020 BST > > > SMART support is: Available - device has SMART capability. > > > SMART support is: Enabled > > > > > > === START OF READ SMART DATA SECTION === > > > SMART overall-health self-assessment test result: PASSED > > > > > > General SMART Values: > > > Offline data collection status: (0x82) Offline data collection activity > > > was completed without error. > > > Auto Offline Data Collection: Enabled. > > > Self-test execution status: ( 0) The previous self-test routine completed > > > without error or no self-test has ever > > > been run. > > > Total time to complete Offline > > > data collection: (39060) seconds. > > > Offline data collection > > > capabilities: (0x7b) SMART execute Offline immediate. > > > Auto Offline data collection on/off support. > > > Suspend Offline collection upon new > > > command. > > > Offline surface scan supported. > > > Self-test supported. > > > Conveyance Self-test supported. > > > Selective Self-test supported. > > > SMART capabilities: (0x0003) Saves SMART data before entering > > > power-saving mode. > > > Supports SMART auto save timer. > > > Error logging capability: (0x01) Error logging supported. > > > General Purpose Logging supported. > > > Short self-test routine > > > recommended polling time: ( 2) minutes. > > > Extended self-test routine > > > recommended polling time: ( 392) minutes. > > > Conveyance self-test routine > > > recommended polling time: ( 5) minutes. > > > SCT capabilities: (0x7035) SCT Status supported. > > > SCT Feature Control supported. > > > SCT Data Table supported. > > > > > > SMART Attributes Data Structure revision number: 16 > > > Vendor Specific SMART Attributes with Thresholds: > > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > > UPDATED WHEN_FAILED RAW_VALUE > > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > > Always - 0 > > > 3 Spin_Up_Time 0x0027 178 164 021 Pre-fail > > > Always - 6100 > > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > > Always - 81 > > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > > Always - 0 > > > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age > > > Always - 0 > > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > > Always - 18580 > > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > > Always - 0 > > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > > Always - 0 > > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > > Always - 81 > > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > > Always - 53 > > > 193 Load_Cycle_Count 0x0032 136 136 000 Old_age > > > Always - 192427 > > > 194 Temperature_Celsius 0x0022 121 108 000 Old_age > > > Always - 29 > > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > > Always - 0 > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > > Always - 0 > > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > > Offline - 0 > > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > > Always - 0 > > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > > Offline - 0 > > > > > > SMART Error Log Version: 1 > > > No Errors Logged > > > > > > SMART Self-test log structure revision number 1 > > > Num Test_Description Status Remaining > > > LifeTime(hours) LBA_of_first_error > > > # 1 Extended offline Completed without error 00% 17481 - > > > # 2 Short offline Completed without error 00% 15534 - > > > > > > SMART Selective self-test log data structure revision number 1 > > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > > 1 0 0 Not_testing > > > 2 0 0 Not_testing > > > 3 0 0 Not_testing > > > 4 0 0 Not_testing > > > 5 0 0 Not_testing > > > Selective self-test flags (0x0): > > > After scanning selected spans, do NOT read-scan remainder of disk. > > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sde > > > smartctl 6.6 2017-11-05 r4594 > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > > > === START OF INFORMATION SECTION === > > > Model Family: Western Digital Green > > > Device Model: WDC WD30EZRX-00D8PB0 > > > Serial Number: WD-WCC4N1294906 > > > LU WWN Device Id: 5 0014ee 25f968120 > > > Firmware Version: 80.00A80 > > > User Capacity: 3,000,591,900,160 bytes [3.00 TB] > > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > > Rotation Rate: 5400 rpm > > > Device is: In smartctl database [for details use: -P show] > > > ATA Version is: ACS-2 (minor revision not indicated) > > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > > Local Time is: Mon Oct 5 14:58:57 2020 BST > > > SMART support is: Available - device has SMART capability. > > > SMART support is: Enabled > > > > > > === START OF READ SMART DATA SECTION === > > > SMART overall-health self-assessment test result: PASSED > > > > > > General SMART Values: > > > Offline data collection status: (0x82) Offline data collection activity > > > was completed without error. > > > Auto Offline Data Collection: Enabled. > > > Self-test execution status: ( 0) The previous self-test routine completed > > > without error or no self-test has ever > > > been run. > > > Total time to complete Offline > > > data collection: (43200) seconds. > > > Offline data collection > > > capabilities: (0x7b) SMART execute Offline immediate. > > > Auto Offline data collection on/off support. > > > Suspend Offline collection upon new > > > command. > > > Offline surface scan supported. > > > Self-test supported. > > > Conveyance Self-test supported. > > > Selective Self-test supported. > > > SMART capabilities: (0x0003) Saves SMART data before entering > > > power-saving mode. > > > Supports SMART auto save timer. > > > Error logging capability: (0x01) Error logging supported. > > > General Purpose Logging supported. > > > Short self-test routine > > > recommended polling time: ( 2) minutes. > > > Extended self-test routine > > > recommended polling time: ( 433) minutes. > > > Conveyance self-test routine > > > recommended polling time: ( 5) minutes. > > > SCT capabilities: (0x7035) SCT Status supported. > > > SCT Feature Control supported. > > > SCT Data Table supported. > > > > > > SMART Attributes Data Structure revision number: 16 > > > Vendor Specific SMART Attributes with Thresholds: > > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > > UPDATED WHEN_FAILED RAW_VALUE > > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > > Always - 0 > > > 3 Spin_Up_Time 0x0027 176 166 021 Pre-fail > > > Always - 6158 > > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > > Always - 80 > > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > > Always - 0 > > > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age > > > Always - 0 > > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > > Always - 18465 > > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > > Always - 0 > > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > > Always - 0 > > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > > Always - 80 > > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > > Always - 53 > > > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age > > > Always - 174015 > > > 194 Temperature_Celsius 0x0022 121 107 000 Old_age > > > Always - 29 > > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > > Always - 0 > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > > Always - 0 > > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > > Offline - 0 > > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > > Always - 0 > > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > > Offline - 0 > > > > > > SMART Error Log Version: 1 > > > No Errors Logged > > > > > > SMART Self-test log structure revision number 1 > > > Num Test_Description Status Remaining > > > LifeTime(hours) LBA_of_first_error > > > # 1 Extended offline Completed without error 00% 17347 - > > > # 2 Short offline Completed without error 00% 15414 - > > > > > > SMART Selective self-test log data structure revision number 1 > > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > > 1 0 0 Not_testing > > > 2 0 0 Not_testing > > > 3 0 0 Not_testing > > > 4 0 0 Not_testing > > > 5 0 0 Not_testing > > > Selective self-test flags (0x0): > > > After scanning selected spans, do NOT read-scan remainder of disk. > > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > > > [dan@lamachine ~]$ > > > > > > > > > On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote: > > > > > > > > On Mon, 5 Oct 2020 14:10:25 +0100 > > > > Daniel Sanabria <sanabria.d@gmail.com> wrote: > > > > > > > > > Hi all, > > > > > > > > > > Scrubbing ( # echo check > > > > > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :( > > > > > > > > > > I'm attaching details of the array and disks (bloody wd greens) as > > > > > well as journalctl errors providing some details about the issue. > > > > > > > > > > If you have any pointers on what might be the cause of this as well as > > > > > any recommendations on how to improve things please let me thank you > > > > > in advance ... > > > > > > > > > > I have backups of the data so happy to move this to a different setup > > > > > you might recommend (apps will be mostly reading from the array via > > > > > NFS since most of the content will be media). > > > > > > > > > > My suspicion is that a timer service is kicking in and disrupting the > > > > > scrubbing somehow but can't pinpoint what causes this. > > > > > > > > It looks like a drive is dropping off the bus and then failing to reidentify, > > > > could be bad cabling/controller/PSU, or just a bad drive. You should post > > > > "smartctl -a" of all drives as well. > > > > > > > > -- > > > > With respect, > > > > Roman > > > > > > -- > > With respect, > > Roman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-05 15:58 ` Roger Heflin @ 2020-10-06 7:56 ` Daniel Sanabria 2020-10-06 8:24 ` Reindl Harald 2020-10-06 10:53 ` Roger Heflin 0 siblings, 2 replies; 16+ messages in thread From: Daniel Sanabria @ 2020-10-06 7:56 UTC (permalink / raw) To: Roger Heflin; +Cc: Roman Mamedov, Linux-RAID Yeah it is quite possible that the hardware can't support the setup because this array was an afterthought and considered an upgrade to the system. For the record here are some more details about the setup: Motherboard: ASROCK EP2C602-4L/D16 PSU: 850W Corsair RM Series I have 6 drives connected to the motherboard. The 3 drives forming the array are Western Digital Green WDC WD30EZRX-00D8PB0 and are connected to 3 ports of the Marvell 9230 SATA controller. The other 3 drives are Western Digital Caviar Blue (SATA) WDC WD5000AAKS-00A7B2 are connected to the spare Marvell port and the 2 SATA/SAS Motherboard ports available. On Mon, 5 Oct 2020 at 16:58, Roger Heflin <rogerheflin@gmail.com> wrote: > > what they said you have a hardware problem. > > it could be about anything previously mentioned and could also be the > power supply being unable to provide a stable 12V for the disks. > > You should provide the list more specifics on your hw setup, of > interest are what kind of SATA/SAS ports you are using and how the > disk are cabled in. > > Note that there are a number of controllers that aren't the most > reliable and some of those controllers when something happens will > stop responding for all disks connected to it. > > I have also seen badly designed motherboards have > build-in(non-AMD/non-Intel chips) sata ports that don't work under any > load that uses more than a single disk at a time, and/or acts badly > when given smart commands. > > On Mon, Oct 5, 2020 at 9:30 AM Daniel Sanabria <sanabria.d@gmail.com> wrote: > > > > > I meant not to me personally, but to the mailing list. The drives seem OK > > > though, even sde. > > > > Sorry missed the reply-all button > > > > On Mon, 5 Oct 2020 at 15:04, Roman Mamedov <rm@romanrm.net> wrote: > > > > > > On Mon, 5 Oct 2020 14:59:35 +0100 > > > Daniel Sanabria <sanabria.d@gmail.com> wrote: > > > > > > > > It looks like a drive is dropping off the bus and then failing to reidentify, > > > > > could be bad cabling/controller/PSU, or just a bad drive. You should post > > > > > "smartctl -a" of all drives as well. > > > > > > I meant not to me personally, but to the mailing list. The drives seem OK > > > though, even sde. > > > > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdc > > > > [sudo] password for dan: > > > > smartctl 6.6 2017-11-05 r4594 > > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > > > > > === START OF INFORMATION SECTION === > > > > Model Family: Western Digital Green > > > > Device Model: WDC WD30EZRX-00D8PB0 > > > > Serial Number: WD-WCC4NCWT13RF > > > > LU WWN Device Id: 5 0014ee 25fc9e460 > > > > Firmware Version: 80.00A80 > > > > User Capacity: 3,000,591,900,160 bytes [3.00 TB] > > > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > > > Rotation Rate: 5400 rpm > > > > Device is: In smartctl database [for details use: -P show] > > > > ATA Version is: ACS-2 (minor revision not indicated) > > > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > > > Local Time is: Mon Oct 5 14:58:34 2020 BST > > > > SMART support is: Available - device has SMART capability. > > > > SMART support is: Enabled > > > > > > > > === START OF READ SMART DATA SECTION === > > > > SMART overall-health self-assessment test result: PASSED > > > > > > > > General SMART Values: > > > > Offline data collection status: (0x82) Offline data collection activity > > > > was completed without error. > > > > Auto Offline Data Collection: Enabled. > > > > Self-test execution status: ( 0) The previous self-test routine completed > > > > without error or no self-test has ever > > > > been run. > > > > Total time to complete Offline > > > > data collection: (38940) seconds. > > > > Offline data collection > > > > capabilities: (0x7b) SMART execute Offline immediate. > > > > Auto Offline data collection on/off support. > > > > Suspend Offline collection upon new > > > > command. > > > > Offline surface scan supported. > > > > Self-test supported. > > > > Conveyance Self-test supported. > > > > Selective Self-test supported. > > > > SMART capabilities: (0x0003) Saves SMART data before entering > > > > power-saving mode. > > > > Supports SMART auto save timer. > > > > Error logging capability: (0x01) Error logging supported. > > > > General Purpose Logging supported. > > > > Short self-test routine > > > > recommended polling time: ( 2) minutes. > > > > Extended self-test routine > > > > recommended polling time: ( 391) minutes. > > > > Conveyance self-test routine > > > > recommended polling time: ( 5) minutes. > > > > SCT capabilities: (0x7035) SCT Status supported. > > > > SCT Feature Control supported. > > > > SCT Data Table supported. > > > > > > > > SMART Attributes Data Structure revision number: 16 > > > > Vendor Specific SMART Attributes with Thresholds: > > > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > > > UPDATED WHEN_FAILED RAW_VALUE > > > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > > > Always - 0 > > > > 3 Spin_Up_Time 0x0027 178 165 021 Pre-fail > > > > Always - 6075 > > > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > > > Always - 81 > > > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > > > Always - 0 > > > > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age > > > > Always - 0 > > > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > > > Always - 18577 > > > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > > > Always - 0 > > > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > > > Always - 0 > > > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > > > Always - 81 > > > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > > > Always - 46 > > > > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age > > > > Always - 176661 > > > > 194 Temperature_Celsius 0x0022 122 109 000 Old_age > > > > Always - 28 > > > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > > > Always - 0 > > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > > > Always - 0 > > > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > > > Offline - 0 > > > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > > > Always - 0 > > > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > > > Offline - 0 > > > > > > > > SMART Error Log Version: 1 > > > > No Errors Logged > > > > > > > > SMART Self-test log structure revision number 1 > > > > Num Test_Description Status Remaining > > > > LifeTime(hours) LBA_of_first_error > > > > # 1 Extended offline Completed without error 00% 17479 - > > > > # 2 Short offline Completed without error 00% 15531 - > > > > > > > > SMART Selective self-test log data structure revision number 1 > > > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > > > 1 0 0 Not_testing > > > > 2 0 0 Not_testing > > > > 3 0 0 Not_testing > > > > 4 0 0 Not_testing > > > > 5 0 0 Not_testing > > > > Selective self-test flags (0x0): > > > > After scanning selected spans, do NOT read-scan remainder of disk. > > > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdd > > > > smartctl 6.6 2017-11-05 r4594 > > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > > > > > === START OF INFORMATION SECTION === > > > > Model Family: Western Digital Green > > > > Device Model: WDC WD30EZRX-00D8PB0 > > > > Serial Number: WD-WCC4NPRDD6D7 > > > > LU WWN Device Id: 5 0014ee 25fca27b1 > > > > Firmware Version: 80.00A80 > > > > User Capacity: 3,000,592,982,016 bytes [3.00 TB] > > > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > > > Rotation Rate: 5400 rpm > > > > Device is: In smartctl database [for details use: -P show] > > > > ATA Version is: ACS-2 (minor revision not indicated) > > > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > > > Local Time is: Mon Oct 5 14:58:54 2020 BST > > > > SMART support is: Available - device has SMART capability. > > > > SMART support is: Enabled > > > > > > > > === START OF READ SMART DATA SECTION === > > > > SMART overall-health self-assessment test result: PASSED > > > > > > > > General SMART Values: > > > > Offline data collection status: (0x82) Offline data collection activity > > > > was completed without error. > > > > Auto Offline Data Collection: Enabled. > > > > Self-test execution status: ( 0) The previous self-test routine completed > > > > without error or no self-test has ever > > > > been run. > > > > Total time to complete Offline > > > > data collection: (39060) seconds. > > > > Offline data collection > > > > capabilities: (0x7b) SMART execute Offline immediate. > > > > Auto Offline data collection on/off support. > > > > Suspend Offline collection upon new > > > > command. > > > > Offline surface scan supported. > > > > Self-test supported. > > > > Conveyance Self-test supported. > > > > Selective Self-test supported. > > > > SMART capabilities: (0x0003) Saves SMART data before entering > > > > power-saving mode. > > > > Supports SMART auto save timer. > > > > Error logging capability: (0x01) Error logging supported. > > > > General Purpose Logging supported. > > > > Short self-test routine > > > > recommended polling time: ( 2) minutes. > > > > Extended self-test routine > > > > recommended polling time: ( 392) minutes. > > > > Conveyance self-test routine > > > > recommended polling time: ( 5) minutes. > > > > SCT capabilities: (0x7035) SCT Status supported. > > > > SCT Feature Control supported. > > > > SCT Data Table supported. > > > > > > > > SMART Attributes Data Structure revision number: 16 > > > > Vendor Specific SMART Attributes with Thresholds: > > > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > > > UPDATED WHEN_FAILED RAW_VALUE > > > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > > > Always - 0 > > > > 3 Spin_Up_Time 0x0027 178 164 021 Pre-fail > > > > Always - 6100 > > > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > > > Always - 81 > > > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > > > Always - 0 > > > > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age > > > > Always - 0 > > > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > > > Always - 18580 > > > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > > > Always - 0 > > > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > > > Always - 0 > > > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > > > Always - 81 > > > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > > > Always - 53 > > > > 193 Load_Cycle_Count 0x0032 136 136 000 Old_age > > > > Always - 192427 > > > > 194 Temperature_Celsius 0x0022 121 108 000 Old_age > > > > Always - 29 > > > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > > > Always - 0 > > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > > > Always - 0 > > > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > > > Offline - 0 > > > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > > > Always - 0 > > > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > > > Offline - 0 > > > > > > > > SMART Error Log Version: 1 > > > > No Errors Logged > > > > > > > > SMART Self-test log structure revision number 1 > > > > Num Test_Description Status Remaining > > > > LifeTime(hours) LBA_of_first_error > > > > # 1 Extended offline Completed without error 00% 17481 - > > > > # 2 Short offline Completed without error 00% 15534 - > > > > > > > > SMART Selective self-test log data structure revision number 1 > > > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > > > 1 0 0 Not_testing > > > > 2 0 0 Not_testing > > > > 3 0 0 Not_testing > > > > 4 0 0 Not_testing > > > > 5 0 0 Not_testing > > > > Selective self-test flags (0x0): > > > > After scanning selected spans, do NOT read-scan remainder of disk. > > > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sde > > > > smartctl 6.6 2017-11-05 r4594 > > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > > > > > === START OF INFORMATION SECTION === > > > > Model Family: Western Digital Green > > > > Device Model: WDC WD30EZRX-00D8PB0 > > > > Serial Number: WD-WCC4N1294906 > > > > LU WWN Device Id: 5 0014ee 25f968120 > > > > Firmware Version: 80.00A80 > > > > User Capacity: 3,000,591,900,160 bytes [3.00 TB] > > > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > > > Rotation Rate: 5400 rpm > > > > Device is: In smartctl database [for details use: -P show] > > > > ATA Version is: ACS-2 (minor revision not indicated) > > > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > > > Local Time is: Mon Oct 5 14:58:57 2020 BST > > > > SMART support is: Available - device has SMART capability. > > > > SMART support is: Enabled > > > > > > > > === START OF READ SMART DATA SECTION === > > > > SMART overall-health self-assessment test result: PASSED > > > > > > > > General SMART Values: > > > > Offline data collection status: (0x82) Offline data collection activity > > > > was completed without error. > > > > Auto Offline Data Collection: Enabled. > > > > Self-test execution status: ( 0) The previous self-test routine completed > > > > without error or no self-test has ever > > > > been run. > > > > Total time to complete Offline > > > > data collection: (43200) seconds. > > > > Offline data collection > > > > capabilities: (0x7b) SMART execute Offline immediate. > > > > Auto Offline data collection on/off support. > > > > Suspend Offline collection upon new > > > > command. > > > > Offline surface scan supported. > > > > Self-test supported. > > > > Conveyance Self-test supported. > > > > Selective Self-test supported. > > > > SMART capabilities: (0x0003) Saves SMART data before entering > > > > power-saving mode. > > > > Supports SMART auto save timer. > > > > Error logging capability: (0x01) Error logging supported. > > > > General Purpose Logging supported. > > > > Short self-test routine > > > > recommended polling time: ( 2) minutes. > > > > Extended self-test routine > > > > recommended polling time: ( 433) minutes. > > > > Conveyance self-test routine > > > > recommended polling time: ( 5) minutes. > > > > SCT capabilities: (0x7035) SCT Status supported. > > > > SCT Feature Control supported. > > > > SCT Data Table supported. > > > > > > > > SMART Attributes Data Structure revision number: 16 > > > > Vendor Specific SMART Attributes with Thresholds: > > > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > > > UPDATED WHEN_FAILED RAW_VALUE > > > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > > > Always - 0 > > > > 3 Spin_Up_Time 0x0027 176 166 021 Pre-fail > > > > Always - 6158 > > > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > > > Always - 80 > > > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > > > Always - 0 > > > > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age > > > > Always - 0 > > > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > > > Always - 18465 > > > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > > > Always - 0 > > > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > > > Always - 0 > > > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > > > Always - 80 > > > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > > > Always - 53 > > > > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age > > > > Always - 174015 > > > > 194 Temperature_Celsius 0x0022 121 107 000 Old_age > > > > Always - 29 > > > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > > > Always - 0 > > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > > > Always - 0 > > > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > > > Offline - 0 > > > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > > > Always - 0 > > > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > > > Offline - 0 > > > > > > > > SMART Error Log Version: 1 > > > > No Errors Logged > > > > > > > > SMART Self-test log structure revision number 1 > > > > Num Test_Description Status Remaining > > > > LifeTime(hours) LBA_of_first_error > > > > # 1 Extended offline Completed without error 00% 17347 - > > > > # 2 Short offline Completed without error 00% 15414 - > > > > > > > > SMART Selective self-test log data structure revision number 1 > > > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > > > 1 0 0 Not_testing > > > > 2 0 0 Not_testing > > > > 3 0 0 Not_testing > > > > 4 0 0 Not_testing > > > > 5 0 0 Not_testing > > > > Selective self-test flags (0x0): > > > > After scanning selected spans, do NOT read-scan remainder of disk. > > > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > > > > > [dan@lamachine ~]$ > > > > > > > > > > > > On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote: > > > > > > > > > > On Mon, 5 Oct 2020 14:10:25 +0100 > > > > > Daniel Sanabria <sanabria.d@gmail.com> wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > Scrubbing ( # echo check > > > > > > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :( > > > > > > > > > > > > I'm attaching details of the array and disks (bloody wd greens) as > > > > > > well as journalctl errors providing some details about the issue. > > > > > > > > > > > > If you have any pointers on what might be the cause of this as well as > > > > > > any recommendations on how to improve things please let me thank you > > > > > > in advance ... > > > > > > > > > > > > I have backups of the data so happy to move this to a different setup > > > > > > you might recommend (apps will be mostly reading from the array via > > > > > > NFS since most of the content will be media). > > > > > > > > > > > > My suspicion is that a timer service is kicking in and disrupting the > > > > > > scrubbing somehow but can't pinpoint what causes this. > > > > > > > > > > It looks like a drive is dropping off the bus and then failing to reidentify, > > > > > could be bad cabling/controller/PSU, or just a bad drive. You should post > > > > > "smartctl -a" of all drives as well. > > > > > > > > > > -- > > > > > With respect, > > > > > Roman > > > > > > > > > -- > > > With respect, > > > Roman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-06 7:56 ` Daniel Sanabria @ 2020-10-06 8:24 ` Reindl Harald 2020-10-06 10:53 ` Roger Heflin 1 sibling, 0 replies; 16+ messages in thread From: Reindl Harald @ 2020-10-06 8:24 UTC (permalink / raw) To: Daniel Sanabria, Roger Heflin; +Cc: Roman Mamedov, Linux-RAID Am 06.10.20 um 09:56 schrieb Daniel Sanabria: > Yeah it is quite possible that the hardware can't support the setup > because this array was an afterthought and considered an upgrade to > the system. > > For the record here are some more details about the setup: > > Motherboard: ASROCK EP2C602-4L/D16 > PSU: 850W Corsair RM Series > I have 6 drives connected to the motherboard. The 3 drives forming the > array are Western Digital Green WDC WD30EZRX-00D8PB0 and are connected > to 3 ports of the Marvell 9230 SATA controller. The other 3 drives are > Western Digital Caviar Blue (SATA) WDC WD5000AAKS-00A7B2 are connected > to the spare Marvell port and the 2 SATA/SAS Motherboard ports > available. yeah, unreliable desktop disks without at least increase the timeouts - currently the only WD disks for a RAID setip are the "WD Gold" after they lost theri brain and starting SMR on "WD Red" https://raid.wiki.kernel.org/index.php/Timeout_Mismatch ------------------------- [root@srv-rhsoft:~]$ cat /etc/systemd/system/disk-timeout.service [Unit] Description=SCSI Timeouts [Service] Type=oneshot ExecStart=/usr/local/bin/disk-timeout.sh [Install] WantedBy=multi-user.target ------------------------- [root@srv-rhsoft:~]$ cat /usr/local/bin/disk-timeout.sh #!/usr/bin/dash echo 180 > "/sys/block/sda/device/timeout" echo 180 > "/sys/block/sdb/device/timeout" echo 180 > "/sys/block/sdc/device/timeout" echo 180 > "/sys/block/sdd/device/timeout" ------------------------- > On Mon, 5 Oct 2020 at 16:58, Roger Heflin <rogerheflin@gmail.com> wrote: >> >> what they said you have a hardware problem. >> >> it could be about anything previously mentioned and could also be the >> power supply being unable to provide a stable 12V for the disks. >> >> You should provide the list more specifics on your hw setup, of >> interest are what kind of SATA/SAS ports you are using and how the >> disk are cabled in. >> >> Note that there are a number of controllers that aren't the most >> reliable and some of those controllers when something happens will >> stop responding for all disks connected to it. >> >> I have also seen badly designed motherboards have >> build-in(non-AMD/non-Intel chips) sata ports that don't work under any >> load that uses more than a single disk at a time, and/or acts badly >> when given smart commands. >> >> On Mon, Oct 5, 2020 at 9:30 AM Daniel Sanabria <sanabria.d@gmail.com> wrote: >>> >>>> I meant not to me personally, but to the mailing list. The drives seem OK >>>> though, even sde. >>> >>> Sorry missed the reply-all button >>> >>> On Mon, 5 Oct 2020 at 15:04, Roman Mamedov <rm@romanrm.net> wrote: >>>> >>>> On Mon, 5 Oct 2020 14:59:35 +0100 >>>> Daniel Sanabria <sanabria.d@gmail.com> wrote: >>>> >>>>>> It looks like a drive is dropping off the bus and then failing to reidentify, >>>>>> could be bad cabling/controller/PSU, or just a bad drive. You should post >>>>>> "smartctl -a" of all drives as well. >>>> >>>> I meant not to me personally, but to the mailing list. The drives seem OK >>>> though, even sde. >>>> >>>>> [dan@lamachine ~]$ sudo smartctl -a /dev/sdc >>>>> [sudo] password for dan: >>>>> smartctl 6.6 2017-11-05 r4594 >>>>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) >>>>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org >>>>> >>>>> === START OF INFORMATION SECTION === >>>>> Model Family: Western Digital Green >>>>> Device Model: WDC WD30EZRX-00D8PB0 >>>>> Serial Number: WD-WCC4NCWT13RF >>>>> LU WWN Device Id: 5 0014ee 25fc9e460 >>>>> Firmware Version: 80.00A80 >>>>> User Capacity: 3,000,591,900,160 bytes [3.00 TB] >>>>> Sector Sizes: 512 bytes logical, 4096 bytes physical >>>>> Rotation Rate: 5400 rpm >>>>> Device is: In smartctl database [for details use: -P show] >>>>> ATA Version is: ACS-2 (minor revision not indicated) >>>>> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >>>>> Local Time is: Mon Oct 5 14:58:34 2020 BST >>>>> SMART support is: Available - device has SMART capability. >>>>> SMART support is: Enabled >>>>> >>>>> === START OF READ SMART DATA SECTION === >>>>> SMART overall-health self-assessment test result: PASSED >>>>> >>>>> General SMART Values: >>>>> Offline data collection status: (0x82) Offline data collection activity >>>>> was completed without error. >>>>> Auto Offline Data Collection: Enabled. >>>>> Self-test execution status: ( 0) The previous self-test routine completed >>>>> without error or no self-test has ever >>>>> been run. >>>>> Total time to complete Offline >>>>> data collection: (38940) seconds. >>>>> Offline data collection >>>>> capabilities: (0x7b) SMART execute Offline immediate. >>>>> Auto Offline data collection on/off support. >>>>> Suspend Offline collection upon new >>>>> command. >>>>> Offline surface scan supported. >>>>> Self-test supported. >>>>> Conveyance Self-test supported. >>>>> Selective Self-test supported. >>>>> SMART capabilities: (0x0003) Saves SMART data before entering >>>>> power-saving mode. >>>>> Supports SMART auto save timer. >>>>> Error logging capability: (0x01) Error logging supported. >>>>> General Purpose Logging supported. >>>>> Short self-test routine >>>>> recommended polling time: ( 2) minutes. >>>>> Extended self-test routine >>>>> recommended polling time: ( 391) minutes. >>>>> Conveyance self-test routine >>>>> recommended polling time: ( 5) minutes. >>>>> SCT capabilities: (0x7035) SCT Status supported. >>>>> SCT Feature Control supported. >>>>> SCT Data Table supported. >>>>> >>>>> SMART Attributes Data Structure revision number: 16 >>>>> Vendor Specific SMART Attributes with Thresholds: >>>>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >>>>> UPDATED WHEN_FAILED RAW_VALUE >>>>> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail >>>>> Always - 0 >>>>> 3 Spin_Up_Time 0x0027 178 165 021 Pre-fail >>>>> Always - 6075 >>>>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age >>>>> Always - 81 >>>>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail >>>>> Always - 0 >>>>> 7 Seek_Error_Rate 0x002e 100 253 000 Old_age >>>>> Always - 0 >>>>> 9 Power_On_Hours 0x0032 075 075 000 Old_age >>>>> Always - 18577 >>>>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age >>>>> Always - 0 >>>>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age >>>>> Always - 0 >>>>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age >>>>> Always - 81 >>>>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age >>>>> Always - 46 >>>>> 193 Load_Cycle_Count 0x0032 142 142 000 Old_age >>>>> Always - 176661 >>>>> 194 Temperature_Celsius 0x0022 122 109 000 Old_age >>>>> Always - 28 >>>>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age >>>>> Always - 0 >>>>> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age >>>>> Always - 0 >>>>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age >>>>> Offline - 0 >>>>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age >>>>> Always - 0 >>>>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age >>>>> Offline - 0 >>>>> >>>>> SMART Error Log Version: 1 >>>>> No Errors Logged >>>>> >>>>> SMART Self-test log structure revision number 1 >>>>> Num Test_Description Status Remaining >>>>> LifeTime(hours) LBA_of_first_error >>>>> # 1 Extended offline Completed without error 00% 17479 - >>>>> # 2 Short offline Completed without error 00% 15531 - >>>>> >>>>> SMART Selective self-test log data structure revision number 1 >>>>> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >>>>> 1 0 0 Not_testing >>>>> 2 0 0 Not_testing >>>>> 3 0 0 Not_testing >>>>> 4 0 0 Not_testing >>>>> 5 0 0 Not_testing >>>>> Selective self-test flags (0x0): >>>>> After scanning selected spans, do NOT read-scan remainder of disk. >>>>> If Selective self-test is pending on power-up, resume after 0 minute delay. >>>>> >>>>> [dan@lamachine ~]$ sudo smartctl -a /dev/sdd >>>>> smartctl 6.6 2017-11-05 r4594 >>>>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) >>>>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org >>>>> >>>>> === START OF INFORMATION SECTION === >>>>> Model Family: Western Digital Green >>>>> Device Model: WDC WD30EZRX-00D8PB0 >>>>> Serial Number: WD-WCC4NPRDD6D7 >>>>> LU WWN Device Id: 5 0014ee 25fca27b1 >>>>> Firmware Version: 80.00A80 >>>>> User Capacity: 3,000,592,982,016 bytes [3.00 TB] >>>>> Sector Sizes: 512 bytes logical, 4096 bytes physical >>>>> Rotation Rate: 5400 rpm >>>>> Device is: In smartctl database [for details use: -P show] >>>>> ATA Version is: ACS-2 (minor revision not indicated) >>>>> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >>>>> Local Time is: Mon Oct 5 14:58:54 2020 BST >>>>> SMART support is: Available - device has SMART capability. >>>>> SMART support is: Enabled >>>>> >>>>> === START OF READ SMART DATA SECTION === >>>>> SMART overall-health self-assessment test result: PASSED >>>>> >>>>> General SMART Values: >>>>> Offline data collection status: (0x82) Offline data collection activity >>>>> was completed without error. >>>>> Auto Offline Data Collection: Enabled. >>>>> Self-test execution status: ( 0) The previous self-test routine completed >>>>> without error or no self-test has ever >>>>> been run. >>>>> Total time to complete Offline >>>>> data collection: (39060) seconds. >>>>> Offline data collection >>>>> capabilities: (0x7b) SMART execute Offline immediate. >>>>> Auto Offline data collection on/off support. >>>>> Suspend Offline collection upon new >>>>> command. >>>>> Offline surface scan supported. >>>>> Self-test supported. >>>>> Conveyance Self-test supported. >>>>> Selective Self-test supported. >>>>> SMART capabilities: (0x0003) Saves SMART data before entering >>>>> power-saving mode. >>>>> Supports SMART auto save timer. >>>>> Error logging capability: (0x01) Error logging supported. >>>>> General Purpose Logging supported. >>>>> Short self-test routine >>>>> recommended polling time: ( 2) minutes. >>>>> Extended self-test routine >>>>> recommended polling time: ( 392) minutes. >>>>> Conveyance self-test routine >>>>> recommended polling time: ( 5) minutes. >>>>> SCT capabilities: (0x7035) SCT Status supported. >>>>> SCT Feature Control supported. >>>>> SCT Data Table supported. >>>>> >>>>> SMART Attributes Data Structure revision number: 16 >>>>> Vendor Specific SMART Attributes with Thresholds: >>>>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >>>>> UPDATED WHEN_FAILED RAW_VALUE >>>>> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail >>>>> Always - 0 >>>>> 3 Spin_Up_Time 0x0027 178 164 021 Pre-fail >>>>> Always - 6100 >>>>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age >>>>> Always - 81 >>>>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail >>>>> Always - 0 >>>>> 7 Seek_Error_Rate 0x002e 100 253 000 Old_age >>>>> Always - 0 >>>>> 9 Power_On_Hours 0x0032 075 075 000 Old_age >>>>> Always - 18580 >>>>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age >>>>> Always - 0 >>>>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age >>>>> Always - 0 >>>>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age >>>>> Always - 81 >>>>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age >>>>> Always - 53 >>>>> 193 Load_Cycle_Count 0x0032 136 136 000 Old_age >>>>> Always - 192427 >>>>> 194 Temperature_Celsius 0x0022 121 108 000 Old_age >>>>> Always - 29 >>>>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age >>>>> Always - 0 >>>>> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age >>>>> Always - 0 >>>>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age >>>>> Offline - 0 >>>>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age >>>>> Always - 0 >>>>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age >>>>> Offline - 0 >>>>> >>>>> SMART Error Log Version: 1 >>>>> No Errors Logged >>>>> >>>>> SMART Self-test log structure revision number 1 >>>>> Num Test_Description Status Remaining >>>>> LifeTime(hours) LBA_of_first_error >>>>> # 1 Extended offline Completed without error 00% 17481 - >>>>> # 2 Short offline Completed without error 00% 15534 - >>>>> >>>>> SMART Selective self-test log data structure revision number 1 >>>>> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >>>>> 1 0 0 Not_testing >>>>> 2 0 0 Not_testing >>>>> 3 0 0 Not_testing >>>>> 4 0 0 Not_testing >>>>> 5 0 0 Not_testing >>>>> Selective self-test flags (0x0): >>>>> After scanning selected spans, do NOT read-scan remainder of disk. >>>>> If Selective self-test is pending on power-up, resume after 0 minute delay. >>>>> >>>>> [dan@lamachine ~]$ sudo smartctl -a /dev/sde >>>>> smartctl 6.6 2017-11-05 r4594 >>>>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) >>>>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org >>>>> >>>>> === START OF INFORMATION SECTION === >>>>> Model Family: Western Digital Green >>>>> Device Model: WDC WD30EZRX-00D8PB0 >>>>> Serial Number: WD-WCC4N1294906 >>>>> LU WWN Device Id: 5 0014ee 25f968120 >>>>> Firmware Version: 80.00A80 >>>>> User Capacity: 3,000,591,900,160 bytes [3.00 TB] >>>>> Sector Sizes: 512 bytes logical, 4096 bytes physical >>>>> Rotation Rate: 5400 rpm >>>>> Device is: In smartctl database [for details use: -P show] >>>>> ATA Version is: ACS-2 (minor revision not indicated) >>>>> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) >>>>> Local Time is: Mon Oct 5 14:58:57 2020 BST >>>>> SMART support is: Available - device has SMART capability. >>>>> SMART support is: Enabled >>>>> >>>>> === START OF READ SMART DATA SECTION === >>>>> SMART overall-health self-assessment test result: PASSED >>>>> >>>>> General SMART Values: >>>>> Offline data collection status: (0x82) Offline data collection activity >>>>> was completed without error. >>>>> Auto Offline Data Collection: Enabled. >>>>> Self-test execution status: ( 0) The previous self-test routine completed >>>>> without error or no self-test has ever >>>>> been run. >>>>> Total time to complete Offline >>>>> data collection: (43200) seconds. >>>>> Offline data collection >>>>> capabilities: (0x7b) SMART execute Offline immediate. >>>>> Auto Offline data collection on/off support. >>>>> Suspend Offline collection upon new >>>>> command. >>>>> Offline surface scan supported. >>>>> Self-test supported. >>>>> Conveyance Self-test supported. >>>>> Selective Self-test supported. >>>>> SMART capabilities: (0x0003) Saves SMART data before entering >>>>> power-saving mode. >>>>> Supports SMART auto save timer. >>>>> Error logging capability: (0x01) Error logging supported. >>>>> General Purpose Logging supported. >>>>> Short self-test routine >>>>> recommended polling time: ( 2) minutes. >>>>> Extended self-test routine >>>>> recommended polling time: ( 433) minutes. >>>>> Conveyance self-test routine >>>>> recommended polling time: ( 5) minutes. >>>>> SCT capabilities: (0x7035) SCT Status supported. >>>>> SCT Feature Control supported. >>>>> SCT Data Table supported. >>>>> >>>>> SMART Attributes Data Structure revision number: 16 >>>>> Vendor Specific SMART Attributes with Thresholds: >>>>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >>>>> UPDATED WHEN_FAILED RAW_VALUE >>>>> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail >>>>> Always - 0 >>>>> 3 Spin_Up_Time 0x0027 176 166 021 Pre-fail >>>>> Always - 6158 >>>>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age >>>>> Always - 80 >>>>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail >>>>> Always - 0 >>>>> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age >>>>> Always - 0 >>>>> 9 Power_On_Hours 0x0032 075 075 000 Old_age >>>>> Always - 18465 >>>>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age >>>>> Always - 0 >>>>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age >>>>> Always - 0 >>>>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age >>>>> Always - 80 >>>>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age >>>>> Always - 53 >>>>> 193 Load_Cycle_Count 0x0032 142 142 000 Old_age >>>>> Always - 174015 >>>>> 194 Temperature_Celsius 0x0022 121 107 000 Old_age >>>>> Always - 29 >>>>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age >>>>> Always - 0 >>>>> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age >>>>> Always - 0 >>>>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age >>>>> Offline - 0 >>>>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age >>>>> Always - 0 >>>>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age >>>>> Offline - 0 >>>>> >>>>> SMART Error Log Version: 1 >>>>> No Errors Logged >>>>> >>>>> SMART Self-test log structure revision number 1 >>>>> Num Test_Description Status Remaining >>>>> LifeTime(hours) LBA_of_first_error >>>>> # 1 Extended offline Completed without error 00% 17347 - >>>>> # 2 Short offline Completed without error 00% 15414 - >>>>> >>>>> SMART Selective self-test log data structure revision number 1 >>>>> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >>>>> 1 0 0 Not_testing >>>>> 2 0 0 Not_testing >>>>> 3 0 0 Not_testing >>>>> 4 0 0 Not_testing >>>>> 5 0 0 Not_testing >>>>> Selective self-test flags (0x0): >>>>> After scanning selected spans, do NOT read-scan remainder of disk. >>>>> If Selective self-test is pending on power-up, resume after 0 minute delay. >>>>> >>>>> [dan@lamachine ~]$ >>>>> >>>>> >>>>> On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote: >>>>>> >>>>>> On Mon, 5 Oct 2020 14:10:25 +0100 >>>>>> Daniel Sanabria <sanabria.d@gmail.com> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> Scrubbing ( # echo check > >>>>>>> /sys/devices/virtual/block/md1/md/sync_action) is killing my array :( >>>>>>> >>>>>>> I'm attaching details of the array and disks (bloody wd greens) as >>>>>>> well as journalctl errors providing some details about the issue. >>>>>>> >>>>>>> If you have any pointers on what might be the cause of this as well as >>>>>>> any recommendations on how to improve things please let me thank you >>>>>>> in advance ... >>>>>>> >>>>>>> I have backups of the data so happy to move this to a different setup >>>>>>> you might recommend (apps will be mostly reading from the array via >>>>>>> NFS since most of the content will be media). >>>>>>> >>>>>>> My suspicion is that a timer service is kicking in and disrupting the >>>>>>> scrubbing somehow but can't pinpoint what causes this. >>>>>> >>>>>> It looks like a drive is dropping off the bus and then failing to reidentify, >>>>>> could be bad cabling/controller/PSU, or just a bad drive. You should post >>>>>> "smartctl -a" of all drives as well. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-06 7:56 ` Daniel Sanabria 2020-10-06 8:24 ` Reindl Harald @ 2020-10-06 10:53 ` Roger Heflin 2020-10-06 11:29 ` antlists 2020-10-06 15:03 ` Tim Small 1 sibling, 2 replies; 16+ messages in thread From: Roger Heflin @ 2020-10-06 10:53 UTC (permalink / raw) To: Daniel Sanabria; +Cc: Roman Mamedov, Linux-RAID Ok. That is kind of what I was fearing based on the behavior. I know first hand that the marvel 9230 sata card is a POS. When I was running it I noted that if you ran smart commands against it it would go offline quicker. If you don't run smart commands at all then it is more stable, but still will have serious issues sometimes during raid syncs and/or rebuilds. When the card has an issue all of the ports seem to stop responding to commands. I am guessing the firmware on the card somehow crashes or gets into some sort of endless loop. I reported it to marvel, they blamed the OS'es ACHI drivers,even though the AHCI drivers worked perfectly fine with the built in AMD ports. Gotta love engineers and support people that have absolutely no idea what they are doing, nor how to validate a design works. I have been burned by marvell cards enough I will not buy any marvell product as I know they have zero idea how to validate designs. I moved to a used LSI SAS card (4 internal ports, 4 external ports also, needs non-raid bios installed to be a dumb card). Outside of the enterprise type cards, I have yet found a stable PCIE card, and in one of my backup machines still use a Sata Sil (old pci) card as while it is slow (all 4 ports limited to a total of about 120MB-real is 90MB), it does consistently work right. On Tue, Oct 6, 2020 at 2:56 AM Daniel Sanabria <sanabria.d@gmail.com> wrote: > > Yeah it is quite possible that the hardware can't support the setup > because this array was an afterthought and considered an upgrade to > the system. > > For the record here are some more details about the setup: > > Motherboard: ASROCK EP2C602-4L/D16 > PSU: 850W Corsair RM Series > I have 6 drives connected to the motherboard. The 3 drives forming the > array are Western Digital Green WDC WD30EZRX-00D8PB0 and are connected > to 3 ports of the Marvell 9230 SATA controller. The other 3 drives are > Western Digital Caviar Blue (SATA) WDC WD5000AAKS-00A7B2 are connected > to the spare Marvell port and the 2 SATA/SAS Motherboard ports > available. > > On Mon, 5 Oct 2020 at 16:58, Roger Heflin <rogerheflin@gmail.com> wrote: > > > > what they said you have a hardware problem. > > > > it could be about anything previously mentioned and could also be the > > power supply being unable to provide a stable 12V for the disks. > > > > You should provide the list more specifics on your hw setup, of > > interest are what kind of SATA/SAS ports you are using and how the > > disk are cabled in. > > > > Note that there are a number of controllers that aren't the most > > reliable and some of those controllers when something happens will > > stop responding for all disks connected to it. > > > > I have also seen badly designed motherboards have > > build-in(non-AMD/non-Intel chips) sata ports that don't work under any > > load that uses more than a single disk at a time, and/or acts badly > > when given smart commands. > > > > On Mon, Oct 5, 2020 at 9:30 AM Daniel Sanabria <sanabria.d@gmail.com> wrote: > > > > > > > I meant not to me personally, but to the mailing list. The drives seem OK > > > > though, even sde. > > > > > > Sorry missed the reply-all button > > > > > > On Mon, 5 Oct 2020 at 15:04, Roman Mamedov <rm@romanrm.net> wrote: > > > > > > > > On Mon, 5 Oct 2020 14:59:35 +0100 > > > > Daniel Sanabria <sanabria.d@gmail.com> wrote: > > > > > > > > > > It looks like a drive is dropping off the bus and then failing to reidentify, > > > > > > could be bad cabling/controller/PSU, or just a bad drive. You should post > > > > > > "smartctl -a" of all drives as well. > > > > > > > > I meant not to me personally, but to the mailing list. The drives seem OK > > > > though, even sde. > > > > > > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdc > > > > > [sudo] password for dan: > > > > > smartctl 6.6 2017-11-05 r4594 > > > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > > > > > > > === START OF INFORMATION SECTION === > > > > > Model Family: Western Digital Green > > > > > Device Model: WDC WD30EZRX-00D8PB0 > > > > > Serial Number: WD-WCC4NCWT13RF > > > > > LU WWN Device Id: 5 0014ee 25fc9e460 > > > > > Firmware Version: 80.00A80 > > > > > User Capacity: 3,000,591,900,160 bytes [3.00 TB] > > > > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > > > > Rotation Rate: 5400 rpm > > > > > Device is: In smartctl database [for details use: -P show] > > > > > ATA Version is: ACS-2 (minor revision not indicated) > > > > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > > > > Local Time is: Mon Oct 5 14:58:34 2020 BST > > > > > SMART support is: Available - device has SMART capability. > > > > > SMART support is: Enabled > > > > > > > > > > === START OF READ SMART DATA SECTION === > > > > > SMART overall-health self-assessment test result: PASSED > > > > > > > > > > General SMART Values: > > > > > Offline data collection status: (0x82) Offline data collection activity > > > > > was completed without error. > > > > > Auto Offline Data Collection: Enabled. > > > > > Self-test execution status: ( 0) The previous self-test routine completed > > > > > without error or no self-test has ever > > > > > been run. > > > > > Total time to complete Offline > > > > > data collection: (38940) seconds. > > > > > Offline data collection > > > > > capabilities: (0x7b) SMART execute Offline immediate. > > > > > Auto Offline data collection on/off support. > > > > > Suspend Offline collection upon new > > > > > command. > > > > > Offline surface scan supported. > > > > > Self-test supported. > > > > > Conveyance Self-test supported. > > > > > Selective Self-test supported. > > > > > SMART capabilities: (0x0003) Saves SMART data before entering > > > > > power-saving mode. > > > > > Supports SMART auto save timer. > > > > > Error logging capability: (0x01) Error logging supported. > > > > > General Purpose Logging supported. > > > > > Short self-test routine > > > > > recommended polling time: ( 2) minutes. > > > > > Extended self-test routine > > > > > recommended polling time: ( 391) minutes. > > > > > Conveyance self-test routine > > > > > recommended polling time: ( 5) minutes. > > > > > SCT capabilities: (0x7035) SCT Status supported. > > > > > SCT Feature Control supported. > > > > > SCT Data Table supported. > > > > > > > > > > SMART Attributes Data Structure revision number: 16 > > > > > Vendor Specific SMART Attributes with Thresholds: > > > > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > > > > UPDATED WHEN_FAILED RAW_VALUE > > > > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > > > > Always - 0 > > > > > 3 Spin_Up_Time 0x0027 178 165 021 Pre-fail > > > > > Always - 6075 > > > > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > > > > Always - 81 > > > > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > > > > Always - 0 > > > > > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age > > > > > Always - 0 > > > > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > > > > Always - 18577 > > > > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > > > > Always - 0 > > > > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > > > > Always - 0 > > > > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > > > > Always - 81 > > > > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > > > > Always - 46 > > > > > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age > > > > > Always - 176661 > > > > > 194 Temperature_Celsius 0x0022 122 109 000 Old_age > > > > > Always - 28 > > > > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > > > > Always - 0 > > > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > > > > Always - 0 > > > > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > > > > Offline - 0 > > > > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > > > > Always - 0 > > > > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > > > > Offline - 0 > > > > > > > > > > SMART Error Log Version: 1 > > > > > No Errors Logged > > > > > > > > > > SMART Self-test log structure revision number 1 > > > > > Num Test_Description Status Remaining > > > > > LifeTime(hours) LBA_of_first_error > > > > > # 1 Extended offline Completed without error 00% 17479 - > > > > > # 2 Short offline Completed without error 00% 15531 - > > > > > > > > > > SMART Selective self-test log data structure revision number 1 > > > > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > > > > 1 0 0 Not_testing > > > > > 2 0 0 Not_testing > > > > > 3 0 0 Not_testing > > > > > 4 0 0 Not_testing > > > > > 5 0 0 Not_testing > > > > > Selective self-test flags (0x0): > > > > > After scanning selected spans, do NOT read-scan remainder of disk. > > > > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > > > > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdd > > > > > smartctl 6.6 2017-11-05 r4594 > > > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > > > > > > > === START OF INFORMATION SECTION === > > > > > Model Family: Western Digital Green > > > > > Device Model: WDC WD30EZRX-00D8PB0 > > > > > Serial Number: WD-WCC4NPRDD6D7 > > > > > LU WWN Device Id: 5 0014ee 25fca27b1 > > > > > Firmware Version: 80.00A80 > > > > > User Capacity: 3,000,592,982,016 bytes [3.00 TB] > > > > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > > > > Rotation Rate: 5400 rpm > > > > > Device is: In smartctl database [for details use: -P show] > > > > > ATA Version is: ACS-2 (minor revision not indicated) > > > > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > > > > Local Time is: Mon Oct 5 14:58:54 2020 BST > > > > > SMART support is: Available - device has SMART capability. > > > > > SMART support is: Enabled > > > > > > > > > > === START OF READ SMART DATA SECTION === > > > > > SMART overall-health self-assessment test result: PASSED > > > > > > > > > > General SMART Values: > > > > > Offline data collection status: (0x82) Offline data collection activity > > > > > was completed without error. > > > > > Auto Offline Data Collection: Enabled. > > > > > Self-test execution status: ( 0) The previous self-test routine completed > > > > > without error or no self-test has ever > > > > > been run. > > > > > Total time to complete Offline > > > > > data collection: (39060) seconds. > > > > > Offline data collection > > > > > capabilities: (0x7b) SMART execute Offline immediate. > > > > > Auto Offline data collection on/off support. > > > > > Suspend Offline collection upon new > > > > > command. > > > > > Offline surface scan supported. > > > > > Self-test supported. > > > > > Conveyance Self-test supported. > > > > > Selective Self-test supported. > > > > > SMART capabilities: (0x0003) Saves SMART data before entering > > > > > power-saving mode. > > > > > Supports SMART auto save timer. > > > > > Error logging capability: (0x01) Error logging supported. > > > > > General Purpose Logging supported. > > > > > Short self-test routine > > > > > recommended polling time: ( 2) minutes. > > > > > Extended self-test routine > > > > > recommended polling time: ( 392) minutes. > > > > > Conveyance self-test routine > > > > > recommended polling time: ( 5) minutes. > > > > > SCT capabilities: (0x7035) SCT Status supported. > > > > > SCT Feature Control supported. > > > > > SCT Data Table supported. > > > > > > > > > > SMART Attributes Data Structure revision number: 16 > > > > > Vendor Specific SMART Attributes with Thresholds: > > > > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > > > > UPDATED WHEN_FAILED RAW_VALUE > > > > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > > > > Always - 0 > > > > > 3 Spin_Up_Time 0x0027 178 164 021 Pre-fail > > > > > Always - 6100 > > > > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > > > > Always - 81 > > > > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > > > > Always - 0 > > > > > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age > > > > > Always - 0 > > > > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > > > > Always - 18580 > > > > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > > > > Always - 0 > > > > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > > > > Always - 0 > > > > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > > > > Always - 81 > > > > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > > > > Always - 53 > > > > > 193 Load_Cycle_Count 0x0032 136 136 000 Old_age > > > > > Always - 192427 > > > > > 194 Temperature_Celsius 0x0022 121 108 000 Old_age > > > > > Always - 29 > > > > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > > > > Always - 0 > > > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > > > > Always - 0 > > > > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > > > > Offline - 0 > > > > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > > > > Always - 0 > > > > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > > > > Offline - 0 > > > > > > > > > > SMART Error Log Version: 1 > > > > > No Errors Logged > > > > > > > > > > SMART Self-test log structure revision number 1 > > > > > Num Test_Description Status Remaining > > > > > LifeTime(hours) LBA_of_first_error > > > > > # 1 Extended offline Completed without error 00% 17481 - > > > > > # 2 Short offline Completed without error 00% 15534 - > > > > > > > > > > SMART Selective self-test log data structure revision number 1 > > > > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > > > > 1 0 0 Not_testing > > > > > 2 0 0 Not_testing > > > > > 3 0 0 Not_testing > > > > > 4 0 0 Not_testing > > > > > 5 0 0 Not_testing > > > > > Selective self-test flags (0x0): > > > > > After scanning selected spans, do NOT read-scan remainder of disk. > > > > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > > > > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sde > > > > > smartctl 6.6 2017-11-05 r4594 > > > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > > > > > > > === START OF INFORMATION SECTION === > > > > > Model Family: Western Digital Green > > > > > Device Model: WDC WD30EZRX-00D8PB0 > > > > > Serial Number: WD-WCC4N1294906 > > > > > LU WWN Device Id: 5 0014ee 25f968120 > > > > > Firmware Version: 80.00A80 > > > > > User Capacity: 3,000,591,900,160 bytes [3.00 TB] > > > > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > > > > Rotation Rate: 5400 rpm > > > > > Device is: In smartctl database [for details use: -P show] > > > > > ATA Version is: ACS-2 (minor revision not indicated) > > > > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > > > > Local Time is: Mon Oct 5 14:58:57 2020 BST > > > > > SMART support is: Available - device has SMART capability. > > > > > SMART support is: Enabled > > > > > > > > > > === START OF READ SMART DATA SECTION === > > > > > SMART overall-health self-assessment test result: PASSED > > > > > > > > > > General SMART Values: > > > > > Offline data collection status: (0x82) Offline data collection activity > > > > > was completed without error. > > > > > Auto Offline Data Collection: Enabled. > > > > > Self-test execution status: ( 0) The previous self-test routine completed > > > > > without error or no self-test has ever > > > > > been run. > > > > > Total time to complete Offline > > > > > data collection: (43200) seconds. > > > > > Offline data collection > > > > > capabilities: (0x7b) SMART execute Offline immediate. > > > > > Auto Offline data collection on/off support. > > > > > Suspend Offline collection upon new > > > > > command. > > > > > Offline surface scan supported. > > > > > Self-test supported. > > > > > Conveyance Self-test supported. > > > > > Selective Self-test supported. > > > > > SMART capabilities: (0x0003) Saves SMART data before entering > > > > > power-saving mode. > > > > > Supports SMART auto save timer. > > > > > Error logging capability: (0x01) Error logging supported. > > > > > General Purpose Logging supported. > > > > > Short self-test routine > > > > > recommended polling time: ( 2) minutes. > > > > > Extended self-test routine > > > > > recommended polling time: ( 433) minutes. > > > > > Conveyance self-test routine > > > > > recommended polling time: ( 5) minutes. > > > > > SCT capabilities: (0x7035) SCT Status supported. > > > > > SCT Feature Control supported. > > > > > SCT Data Table supported. > > > > > > > > > > SMART Attributes Data Structure revision number: 16 > > > > > Vendor Specific SMART Attributes with Thresholds: > > > > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > > > > UPDATED WHEN_FAILED RAW_VALUE > > > > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > > > > Always - 0 > > > > > 3 Spin_Up_Time 0x0027 176 166 021 Pre-fail > > > > > Always - 6158 > > > > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > > > > Always - 80 > > > > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > > > > Always - 0 > > > > > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age > > > > > Always - 0 > > > > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > > > > Always - 18465 > > > > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > > > > Always - 0 > > > > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > > > > Always - 0 > > > > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > > > > Always - 80 > > > > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > > > > Always - 53 > > > > > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age > > > > > Always - 174015 > > > > > 194 Temperature_Celsius 0x0022 121 107 000 Old_age > > > > > Always - 29 > > > > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > > > > Always - 0 > > > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > > > > Always - 0 > > > > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > > > > Offline - 0 > > > > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > > > > Always - 0 > > > > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > > > > Offline - 0 > > > > > > > > > > SMART Error Log Version: 1 > > > > > No Errors Logged > > > > > > > > > > SMART Self-test log structure revision number 1 > > > > > Num Test_Description Status Remaining > > > > > LifeTime(hours) LBA_of_first_error > > > > > # 1 Extended offline Completed without error 00% 17347 - > > > > > # 2 Short offline Completed without error 00% 15414 - > > > > > > > > > > SMART Selective self-test log data structure revision number 1 > > > > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > > > > 1 0 0 Not_testing > > > > > 2 0 0 Not_testing > > > > > 3 0 0 Not_testing > > > > > 4 0 0 Not_testing > > > > > 5 0 0 Not_testing > > > > > Selective self-test flags (0x0): > > > > > After scanning selected spans, do NOT read-scan remainder of disk. > > > > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > > > > > > > [dan@lamachine ~]$ > > > > > > > > > > > > > > > On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote: > > > > > > > > > > > > On Mon, 5 Oct 2020 14:10:25 +0100 > > > > > > Daniel Sanabria <sanabria.d@gmail.com> wrote: > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > Scrubbing ( # echo check > > > > > > > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :( > > > > > > > > > > > > > > I'm attaching details of the array and disks (bloody wd greens) as > > > > > > > well as journalctl errors providing some details about the issue. > > > > > > > > > > > > > > If you have any pointers on what might be the cause of this as well as > > > > > > > any recommendations on how to improve things please let me thank you > > > > > > > in advance ... > > > > > > > > > > > > > > I have backups of the data so happy to move this to a different setup > > > > > > > you might recommend (apps will be mostly reading from the array via > > > > > > > NFS since most of the content will be media). > > > > > > > > > > > > > > My suspicion is that a timer service is kicking in and disrupting the > > > > > > > scrubbing somehow but can't pinpoint what causes this. > > > > > > > > > > > > It looks like a drive is dropping off the bus and then failing to reidentify, > > > > > > could be bad cabling/controller/PSU, or just a bad drive. You should post > > > > > > "smartctl -a" of all drives as well. > > > > > > > > > > > > -- > > > > > > With respect, > > > > > > Roman > > > > > > > > > > > > -- > > > > With respect, > > > > Roman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-06 10:53 ` Roger Heflin @ 2020-10-06 11:29 ` antlists 2020-10-06 14:59 ` Roger Heflin 2020-10-06 15:03 ` Tim Small 1 sibling, 1 reply; 16+ messages in thread From: antlists @ 2020-10-06 11:29 UTC (permalink / raw) To: Roger Heflin, Daniel Sanabria; +Cc: Roman Mamedov, Linux-RAID On 06/10/2020 11:53, Roger Heflin wrote: > When the card has an > issue all of the ports seem to stop responding to commands. I am > guessing the firmware on the card somehow crashes or gets into some > sort of endless loop. I reported it to marvel, they blamed the OS'es > ACHI drivers,even though the AHCI drivers worked perfectly fine with > the built in AMD ports. So we've got the crap drives on the crap controllers ... would it make any difference if you put the Greens on the motherboard, and the Caviars on the Marvell? Caviars I believe are good quality drives that might take enough load off the Marvell to enable it to work sort-of okay ... Oh - and replace the Greens pretty soon - I don't know how they compare against other drives quality-wise, but they are optimised in a raid anti-pattern. Cheers, Wol ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-06 11:29 ` antlists @ 2020-10-06 14:59 ` Roger Heflin 2020-10-09 1:00 ` John Stoffel 0 siblings, 1 reply; 16+ messages in thread From: Roger Heflin @ 2020-10-06 14:59 UTC (permalink / raw) To: antlists; +Cc: Daniel Sanabria, Roman Mamedov, Linux-RAID The controller is crap, and is expected to have serious issues no matter what drives are on the controller. Given the green's don't seem to have any reallocated blocks, I am guessing the controller is 90%+ of the problem, and right now may be all of the problem. If you lose all of the disks on the marvell controller at roughly the same time, that is the controller bug and not a disk issue. It also does not seem to be caused by a disk issue, the controller just seems to have a race condition when multiple operations are being done at the same time the controller just "crashes" and stops responding to all drives on that controller. On Tue, Oct 6, 2020 at 6:29 AM antlists <antlists@youngman.org.uk> wrote: > > On 06/10/2020 11:53, Roger Heflin wrote: > > When the card has an > > issue all of the ports seem to stop responding to commands. I am > > guessing the firmware on the card somehow crashes or gets into some > > sort of endless loop. I reported it to marvel, they blamed the OS'es > > ACHI drivers,even though the AHCI drivers worked perfectly fine with > > the built in AMD ports. > > So we've got the crap drives on the crap controllers ... would it make > any difference if you put the Greens on the motherboard, and the Caviars > on the Marvell? Caviars I believe are good quality drives that might > take enough load off the Marvell to enable it to work sort-of okay ... > > Oh - and replace the Greens pretty soon - I don't know how they compare > against other drives quality-wise, but they are optimised in a raid > anti-pattern. > > Cheers, > Wol ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-06 14:59 ` Roger Heflin @ 2020-10-09 1:00 ` John Stoffel 0 siblings, 0 replies; 16+ messages in thread From: John Stoffel @ 2020-10-09 1:00 UTC (permalink / raw) To: Roger Heflin; +Cc: antlists, Daniel Sanabria, Roman Mamedov, Linux-RAID >>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes: Roger> The controller is crap, and is expected to have serious issues no Roger> matter what drives are on the controller. I can't say enough good things about the LSI SATA RAID controllers. You can usually get them pretty cheap on eBay, and just flash them with the JBOD firmware and they do great. LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT S) It's been a while, but I think it was an IBM branded card at the time. 8 ports, easy setup, works well. And looking on ebay, they're cheap now, around $50 though you might have to pay for the 1-to-4 cables to goto SATA drives. 9211, 9341, stuff like that. Here's an eBay listing with cables: https://www.ebay.com/p/1404809612?iid=133501363959&_trkparms=aid%3D555018%26algo%3DPL.SIM%26ao%3D1%26asc%3D20170810093926%26meid%3De20ed20fb9634e328a31bca5c9e2063c%26pid%3D100854%26rk%3D1%26rkt%3D1%26itm%3D133501363959%26pmt%3D1%26noa%3D0%26pg%3D2322090%26algv%3DSimplAMLSeedlessV2&_trksid=p2322090.c100854.m4779 In my mind, the other advantage of these cards is that you can get two of them, and split your data across two controllers. But it also gets your data disks off the internal controllers, which means you don't run into nearly as many problems where the system tries to boot off your data drives, or you have to put boot blocks on them, etc. This controller would be more tolerant of your existing drives, since you have the 850w power supply, it's almost certainly not power problems either. John ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-06 10:53 ` Roger Heflin 2020-10-06 11:29 ` antlists @ 2020-10-06 15:03 ` Tim Small 2020-10-06 16:01 ` Daniel Sanabria 1 sibling, 1 reply; 16+ messages in thread From: Tim Small @ 2020-10-06 15:03 UTC (permalink / raw) To: Roger Heflin, Daniel Sanabria; +Cc: Roman Mamedov, Linux-RAID On 06/10/2020 11:53, Roger Heflin wrote: > Outside of the enterprise type cards, I have yet found a stable PCIE card > I've also had numerous problems with Marvell SATA controllers. I've generally found the ASMedia ASM1083 / ASM1085 AHCI controllers stable and reliable. ASMedia is part of Asus, and from Wikipedia: "[ASMedia] produces designs for USB, PCI Express and SATA controllers. Excluding the X570 chipset, all of the AM4 chipsets for AMD's Zen micro-architecture were designed by ASMedia" The ASM108x are only PCIe 2.0 x1 <-> 2 SATA port cards, however there are designs (e.g. "SA3008" - around $45 online) which incorporate multiple ASM108x behind an ASMedia PCIe 2.0 bridge if number of available PCIe slots are an issue. Tim. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-06 15:03 ` Tim Small @ 2020-10-06 16:01 ` Daniel Sanabria 2020-10-07 7:26 ` Tim Small 0 siblings, 1 reply; 16+ messages in thread From: Daniel Sanabria @ 2020-10-06 16:01 UTC (permalink / raw) To: Tim Small; +Cc: Roger Heflin, Roman Mamedov, Linux-RAID Thank you very much Guys. This is one of the best email lists I'm subscribed to so thanks to you all ! I've decided to dissolve this array and will use the disks as stand alone drives. I have another array (raid0) using a pair of the WD caviar blues and the pair of non-marvell ports and haven't had any issues in years so will keep that one. Thanks again, Dan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: do i need to give up on this setup 2020-10-06 16:01 ` Daniel Sanabria @ 2020-10-07 7:26 ` Tim Small 0 siblings, 0 replies; 16+ messages in thread From: Tim Small @ 2020-10-07 7:26 UTC (permalink / raw) To: Daniel Sanabria; +Cc: Roger Heflin, Roman Mamedov, Linux-RAID If you want to keep the Marvell controller in use, then I found them a lot more stable with Tagged Command Queueing disabled: echo 1 > /sys/block/sdX/device/queue_depth Otherwise you might also look for a firmware update for the WD Green drives which you are having problems with (they might be hidden on vendor support sites like those of Dell and Lenovo, if you can find particular PC models which shipped with the same models of WD drives that you have). Also if possible, consider switching for the ASMedia controllers if possible - the two port versions are available for under €10. Tim. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2020-10-09 1:07 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-10-05 13:10 do i need to give up on this setup Daniel Sanabria 2020-10-05 13:17 ` Reindl Harald 2020-10-05 13:44 ` Roman Mamedov [not found] ` <CAHscji0pNezf6xCpjWto5-21ayoCeLWm34GTYh5TSgxkOw90mw@mail.gmail.com> 2020-10-05 14:04 ` Roman Mamedov 2020-10-05 14:10 ` Reindl Harald 2020-10-05 14:28 ` Daniel Sanabria 2020-10-05 15:58 ` Roger Heflin 2020-10-06 7:56 ` Daniel Sanabria 2020-10-06 8:24 ` Reindl Harald 2020-10-06 10:53 ` Roger Heflin 2020-10-06 11:29 ` antlists 2020-10-06 14:59 ` Roger Heflin 2020-10-09 1:00 ` John Stoffel 2020-10-06 15:03 ` Tim Small 2020-10-06 16:01 ` Daniel Sanabria 2020-10-07 7:26 ` Tim Small
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.