Linux-Raid Archives on lore.kernel.org
 help / color / Atom feed
* do i need to give up on this setup
@ 2020-10-05 13:10 Daniel Sanabria
  2020-10-05 13:17 ` Reindl Harald
  2020-10-05 13:44 ` Roman Mamedov
  0 siblings, 2 replies; 16+ messages in thread
From: Daniel Sanabria @ 2020-10-05 13:10 UTC (permalink / raw)
  To: Linux-RAID

Hi all,

Scrubbing ( # echo check >
/sys/devices/virtual/block/md1/md/sync_action) is killing my array :(

I'm attaching details of the array and disks (bloody wd greens) as
well as journalctl errors providing some details about the issue.

If you have any pointers on what might be the cause of this as well as
any recommendations on how to improve things please let me thank you
in advance ...

I have backups of the data so happy to move this to a different setup
you might recommend (apps will be mostly reading from the array via
NFS since most of the content will be media).

My suspicion is that a timer service is kicking in and disrupting the
scrubbing somehow but can't pinpoint what causes this.

Thanks again,

Dan

PD. Apologies for the verbosity of the logs but wasn't really sure if
you guys accept links from paste services


[dan@lamachine ~]$ sudo mdadm --detail /dev/md1
[sudo] password for dan:
/dev/md1:
           Version : 1.2
     Creation Time : Fri Feb 15 12:26:56 2019
        Raid Level : raid5
        Array Size : 4194039808 (3.91 TiB 4.29 TB)
     Used Dev Size : 2097019904 (1999.87 GiB 2147.35 GB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon Oct  5 11:35:31 2020
             State : clean, degraded
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 1
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       -       0        0        2      removed

       3       8       65        -      faulty   /dev/sde1
[dan@lamachine ~]$

[dan@lamachine ~]$ sudo hdparm -I /dev/sdc
[sudo] password for dan:

/dev/sdc:

ATA device, with non-removable media
Model Number:       WDC WD30EZRX-00D8PB0
Serial Number:      WD-WCC4NCWT13RF
Firmware Revision:  80.00A80
Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev
2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Supported: 9 8 7 6 5
Likely used: 9
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors:    16514064
LBA    user addressable sectors:   268435455
LBA48  user addressable sectors:  5860531055
Logical  Sector size:                   512 bytes
Physical Sector size:                  4096 bytes
device size with M = 1024*1024:     2861587 MBytes
device size with M = 1000*1000:     3000591 MBytes (3000 GB)
cache/buffer size  = unknown
Nominal Media Rotation Rate: 5400
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
     Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
Enabled Supported:
   * SMART feature set
    Security Mode feature set
   * Power Management feature set
   * Write cache
   * Look-ahead
   * Host Protected Area feature set
   * WRITE_BUFFER command
   * READ_BUFFER command
   * NOP cmd
   * DOWNLOAD_MICROCODE
    Power-Up In Standby feature set
   * SET_FEATURES required to spinup after power up
    SET_MAX security extension
   * 48-bit Address feature set
   * Device Configuration Overlay feature set
   * Mandatory FLUSH_CACHE
   * FLUSH_CACHE_EXT
   * SMART error logging
   * SMART self-test
   * General Purpose Logging feature set
   * 64-bit World wide name
   * WRITE_UNCORRECTABLE_EXT command
   * {READ,WRITE}_DMA_EXT_GPL commands
   * Segmented DOWNLOAD_MICROCODE
   * Gen1 signaling speed (1.5Gb/s)
   * Gen2 signaling speed (3.0Gb/s)
   * Gen3 signaling speed (6.0Gb/s)
   * Native Command Queueing (NCQ)
   * Host-initiated interface power management
   * Phy event counters
   * NCQ priority information
   * READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
   * DMA Setup Auto-Activate optimization
    Device-initiated interface power management
   * Software settings preservation
   * SMART Command Transport (SCT) feature set
   * SCT Write Same (AC2)
   * SCT Features Control (AC4)
   * SCT Data Tables (AC5)
    unknown 206[12] (vendor specific)
    unknown 206[13] (vendor specific)
    unknown 206[14] (vendor specific)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
414min for SECURITY ERASE UNIT. 414min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50014ee25fc9e460
NAA : 5
IEEE OUI : 0014ee
Unique ID : 25fc9e460
Checksum: correct
[dan@lamachine ~]$ sudo hdparm -I /dev/sde

/dev/sde:

ATA device, with non-removable media
Model Number:       WDC WD30EZRX-00D8PB0
Serial Number:      WD-WCC4N1294906
Firmware Revision:  80.00A80
Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev
2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Supported: 9 8 7 6 5
Likely used: 9
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors:    16514064
LBA    user addressable sectors:   268435455
LBA48  user addressable sectors:  5860531055
Logical  Sector size:                   512 bytes
Physical Sector size:                  4096 bytes
device size with M = 1024*1024:     2861587 MBytes
device size with M = 1000*1000:     3000591 MBytes (3000 GB)
cache/buffer size  = unknown
Nominal Media Rotation Rate: 5400
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
     Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
Enabled Supported:
   * SMART feature set
    Security Mode feature set
   * Power Management feature set
   * Write cache
   * Look-ahead
   * Host Protected Area feature set
   * WRITE_BUFFER command
   * READ_BUFFER command
   * NOP cmd
   * DOWNLOAD_MICROCODE
    Power-Up In Standby feature set
   * SET_FEATURES required to spinup after power up
    SET_MAX security extension
   * 48-bit Address feature set
   * Device Configuration Overlay feature set
   * Mandatory FLUSH_CACHE
   * FLUSH_CACHE_EXT
   * SMART error logging
   * SMART self-test
   * General Purpose Logging feature set
   * 64-bit World wide name
   * WRITE_UNCORRECTABLE_EXT command
   * {READ,WRITE}_DMA_EXT_GPL commands
   * Segmented DOWNLOAD_MICROCODE
   * Gen1 signaling speed (1.5Gb/s)
   * Gen2 signaling speed (3.0Gb/s)
   * Gen3 signaling speed (6.0Gb/s)
   * Native Command Queueing (NCQ)
   * Host-initiated interface power management
   * Phy event counters
   * NCQ priority information
   * READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
   * DMA Setup Auto-Activate optimization
    Device-initiated interface power management
   * Software settings preservation
   * SMART Command Transport (SCT) feature set
   * SCT Write Same (AC2)
   * SCT Features Control (AC4)
   * SCT Data Tables (AC5)
    unknown 206[12] (vendor specific)
    unknown 206[13] (vendor specific)
    unknown 206[14] (vendor specific)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
458min for SECURITY ERASE UNIT. 458min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50014ee25f968120
NAA : 5
IEEE OUI : 0014ee
Unique ID : 25f968120
Checksum: correct
[dan@lamachine ~]$ sudo hdparm -I /dev/sdd

/dev/sdd:

ATA device, with non-removable media
Model Number:       WDC WD30EZRX-00D8PB0
Serial Number:      WD-WCC4NPRDD6D7
Firmware Revision:  80.00A80
Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev
2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Supported: 9 8 7 6 5
Likely used: 9
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors:    16514064
LBA    user addressable sectors:   268435455
LBA48  user addressable sectors:  5860533168
Logical  Sector size:                   512 bytes
Physical Sector size:                  4096 bytes
device size with M = 1024*1024:     2861588 MBytes
device size with M = 1000*1000:     3000592 MBytes (3000 GB)
cache/buffer size  = unknown
Nominal Media Rotation Rate: 5400
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
     Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
Enabled Supported:
   * SMART feature set
    Security Mode feature set
   * Power Management feature set
   * Write cache
   * Look-ahead
   * Host Protected Area feature set
   * WRITE_BUFFER command
   * READ_BUFFER command
   * NOP cmd
   * DOWNLOAD_MICROCODE
    Power-Up In Standby feature set
   * SET_FEATURES required to spinup after power up
    SET_MAX security extension
   * 48-bit Address feature set
   * Device Configuration Overlay feature set
   * Mandatory FLUSH_CACHE
   * FLUSH_CACHE_EXT
   * SMART error logging
   * SMART self-test
   * General Purpose Logging feature set
   * 64-bit World wide name
   * WRITE_UNCORRECTABLE_EXT command
   * {READ,WRITE}_DMA_EXT_GPL commands
   * Segmented DOWNLOAD_MICROCODE
   * Gen1 signaling speed (1.5Gb/s)
   * Gen2 signaling speed (3.0Gb/s)
   * Gen3 signaling speed (6.0Gb/s)
   * Native Command Queueing (NCQ)
   * Host-initiated interface power management
   * Phy event counters
   * NCQ priority information
   * READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
   * DMA Setup Auto-Activate optimization
    Device-initiated interface power management
   * Software settings preservation
   * SMART Command Transport (SCT) feature set
   * SCT Write Same (AC2)
   * SCT Features Control (AC4)
   * SCT Data Tables (AC5)
    unknown 206[12] (vendor specific)
    unknown 206[13] (vendor specific)
    unknown 206[14] (vendor specific)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
414min for SECURITY ERASE UNIT. 414min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50014ee25fca27b1
NAA : 5
IEEE OUI : 0014ee
Unique ID : 25fca27b1
Checksum: correct
[dan@lamachine ~]$


truncated journalctl logs:

Oct 05 10:57:11 lamachine systemd-logind[1571]: Session 8 logged out.
Waiting for processes to exit.
Oct 05 10:57:11 lamachine systemd-logind[1571]: Removed session 8.
Oct 05 11:00:35 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus
133 SControl 300)
Oct 05 11:00:35 lamachine smartd[1480]: Device: /dev/sdc [SAT], failed
to read SMART Attribute Data
Oct 05 11:00:35 lamachine kernel: ata7.00: configured for UDMA/133
Oct 05 11:00:35 lamachine smartd[1480]: Sending warning via
/usr/libexec/smartmontools/smartdnotify to root ...
Oct 05 11:00:35 lamachine smartd[1480]: Warning via
/usr/libexec/smartmontools/smartdnotify to root: successful
Oct 05 11:00:35 lamachine postfix/pickup[2347]: EFF87608EF11: uid=0 from=<root>
Oct 05 11:00:35 lamachine postfix/cleanup[4225]: EFF87608EF11:
message-id=<20201005100035.EFF87608EF11@lamachine.localdomain>
Oct 05 11:00:36 lamachine postfix/qmgr[2080]: EFF87608EF11:
from=<root@lamachine.localdomain>, size=524, nrcpt=1 (queue active)
Oct 05 11:00:36 lamachine postfix/local[4228]: EFF87608EF11:
to=<root@lamachine.localdomain>, orig_to=<root>, relay=local,
delay=0.15, delays=0.09/0.02/0/0.04, dsn=2.0.0, status=sent (delivered
to mailbox)
Oct 05 11:00:36 lamachine postfix/qmgr[2080]: EFF87608EF11: removed
Oct 05 11:08:36 lamachine sshd[3936]: Timeout, client not responding
from user dan 192.168.1.113 port 54226
Oct 05 11:08:36 lamachine sshd[3933]: pam_unix(sshd:session): session
closed for user dan
Oct 05 11:08:36 lamachine systemd-logind[1571]: Session 7 logged out.
Waiting for processes to exit.
Oct 05 11:08:36 lamachine sudo[3990]: pam_unix(sudo:session): session
closed for user root
Oct 05 11:08:36 lamachine systemd-logind[1571]: Removed session 7.
Oct 05 11:29:33 lamachine smartd[1480]: Device: /dev/sdc [SAT], read
SMART Attribute Data worked again, warning condition reset after 1
email
Oct 05 11:30:43 lamachine kernel: ata9: link is slow to respond,
please be patient (ready=0)
Oct 05 11:30:47 lamachine kernel: ata9: COMRESET failed (errno=-16)
Oct 05 11:30:53 lamachine kernel: ata9: link is slow to respond,
please be patient (ready=0)
Oct 05 11:30:57 lamachine kernel: ata9: COMRESET failed (errno=-16)
Oct 05 11:31:03 lamachine kernel: ata9: link is slow to respond,
please be patient (ready=0)
Oct 05 11:31:32 lamachine kernel: ata9: COMRESET failed (errno=-16)
Oct 05 11:31:32 lamachine kernel: ata9: limiting SATA link speed to 3.0 Gbps
Oct 05 11:31:37 lamachine smartd[1480]: Device: /dev/sde [SAT], failed
to read SMART Attribute Data
Oct 05 11:31:37 lamachine smartd[1480]: Sending warning via
/usr/libexec/smartmontools/smartdnotify to root ...
Oct 05 11:31:37 lamachine kernel: ata9: COMRESET failed (errno=-16)
Oct 05 11:31:37 lamachine kernel: ata9: reset failed, giving up
Oct 05 11:31:37 lamachine kernel: ata9.00: disabled
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#7 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#6 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=124s
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#6 CDB:
Read(16) 88 00 00 00 00 00 45 29 f4 d8 00 00 05 40 00 00
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#7 CDB:
Read(16) 88 00 00 00 00 00 45 2a 78 18 00 00 05 40 00 00
Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev
sde, sector 1160411160 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev
sde, sector 1160377560 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#9 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#16 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#9 CDB:
Read(16) 88 00 00 00 00 00 45 29 fa 18 00 00 05 40 00 00
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#16 CDB:
Read(16) 88 00 00 00 00 00 45 2a 82 98 00 00 05 40 00 00
Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev
sde, sector 1160413848 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev
sde, sector 1160378904 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#10 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#18 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#10 CDB:
Read(16) 88 00 00 00 00 00 45 29 ff 58 00 00 05 40 00 00
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#18 CDB:
Read(16) 88 00 00 00 00 00 45 2a 8d 18 00 00 05 40 00 00
Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev
sde, sector 1160416536 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev
sde, sector 1160380248 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#11 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#31 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#11 CDB:
Read(16) 88 00 00 00 00 00 45 2a 04 98 00 00 05 40 00 00
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#31 CDB:
Read(16) 88 00 00 00 00 00 45 2a 97 98 00 00 05 40 00 00
Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev
sde, sector 1160419224 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev
sde, sector 1160381592 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#18 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#18 CDB:
Read(16) 88 00 00 00 00 00 45 2a b7 18 00 00 05 40 00 00
Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev
sde, sector 1160427288 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#12 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:31:37 lamachine kernel: sd 8:0:0:0: [sde] tag#12 CDB:
Read(16) 88 00 00 00 00 00 45 2a 09 d8 00 00 05 40 00 00
Oct 05 11:31:37 lamachine kernel: blk_update_request: I/O error, dev
sde, sector 1160382936 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:31:37 lamachine smartd[1480]: Warning via
/usr/libexec/smartmontools/smartdnotify to root: successful
Oct 05 11:31:38 lamachine postfix/pickup[4381]: 00E50608EF11: uid=0 from=<root>
Oct 05 11:31:38 lamachine postfix/cleanup[4522]: 00E50608EF11:
message-id=<20201005103138.00E50608EF11@lamachine.localdomain>
Oct 05 11:31:38 lamachine postfix/qmgr[2080]: 00E50608EF11:
from=<root@lamachine.localdomain>, size=524, nrcpt=1 (queue active)
Oct 05 11:31:38 lamachine postfix/local[4524]: 00E50608EF11:
to=<root@lamachine.localdomain>, orig_to=<root>, relay=local,
delay=0.11, delays=0.08/0.01/0/0.03, dsn=2.0.0, status=sent (delivered
to mailbox)
Oct 05 11:31:38 lamachine postfix/qmgr[2080]: 00E50608EF11: removed
Oct 05 11:31:47 lamachine kernel: INFO: task md1_resync:3091 blocked
for more than 120 seconds.
Oct 05 11:31:47 lamachine kernel:       Not tainted
4.18.0-193.14.2.el8_2.x86_64 #1
Oct 05 11:31:47 lamachine kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 05 11:31:47 lamachine kernel: md1_resync      D    0  3091      2 0x80004080
Oct 05 11:31:47 lamachine kernel: Call Trace:
Oct 05 11:31:47 lamachine kernel:  ? __schedule+0x24f/0x650
Oct 05 11:31:47 lamachine kernel:  schedule+0x2f/0xa0
Oct 05 11:31:47 lamachine kernel:  raid5_get_active_stripe+0x469/0x5f0 [raid456]
Oct 05 11:31:47 lamachine kernel:  ? finish_wait+0x80/0x80
Oct 05 11:31:47 lamachine kernel:  raid5_sync_request+0x387/0x3b0 [raid456]
Oct 05 11:31:47 lamachine kernel:  ? cpumask_next+0x17/0x20
Oct 05 11:31:47 lamachine kernel:  ? is_mddev_idle+0xcc/0x12a
Oct 05 11:31:47 lamachine kernel:  md_do_sync.cold.83+0x424/0x953
Oct 05 11:31:47 lamachine kernel:  ? xfrm_user_net_init+0x90/0xa0
Oct 05 11:31:47 lamachine kernel:  ? __switch_to_asm+0x41/0x70
Oct 05 11:31:47 lamachine kernel:  ? finish_wait+0x80/0x80
Oct 05 11:31:47 lamachine kernel:  ? md_register_thread+0xd0/0xd0
Oct 05 11:31:47 lamachine kernel:  md_thread+0x94/0x150
Oct 05 11:31:47 lamachine kernel:  kthread+0x112/0x130
Oct 05 11:31:47 lamachine kernel:  ? kthread_flush_work_fn+0x10/0x10
Oct 05 11:31:47 lamachine kernel:  ret_from_fork+0x35/0x40
Oct 05 11:32:41 lamachine kernel: ata10: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Oct 05 11:32:46 lamachine kernel: ata10.00: qc timeout (cmd 0xec)
Oct 05 11:32:47 lamachine kernel: ata10.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:32:47 lamachine kernel: ata10.00: revalidation failed (errno=-5)
Oct 05 11:32:48 lamachine kernel: ata10: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Oct 05 11:32:58 lamachine kernel: ata10.00: qc timeout (cmd 0xec)
Oct 05 11:32:59 lamachine kernel: ata10.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:32:59 lamachine kernel: ata10.00: revalidation failed (errno=-5)
Oct 05 11:32:59 lamachine kernel: ata10: limiting SATA link speed to 1.5 Gbps
Oct 05 11:32:59 lamachine kernel: ata10: SATA link up 3.0 Gbps
(SStatus 123 SControl 310)
Oct 05 11:33:30 lamachine kernel: ata10.00: qc timeout (cmd 0xec)
Oct 05 11:33:30 lamachine kernel: ata10.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:33:30 lamachine kernel: ata10.00: revalidation failed (errno=-5)
Oct 05 11:33:30 lamachine kernel: ata10.00: disabled
Oct 05 11:33:32 lamachine kernel: ata10: SATA link up 3.0 Gbps
(SStatus 123 SControl 310)
Oct 05 11:33:50 lamachine kernel: INFO: task md1_raid5:1304 blocked
for more than 120 seconds.
Oct 05 11:33:50 lamachine kernel:       Not tainted
4.18.0-193.14.2.el8_2.x86_64 #1
Oct 05 11:33:50 lamachine kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 05 11:33:50 lamachine kernel: md1_raid5       D    0  1304      2 0x80004000
Oct 05 11:33:50 lamachine kernel: Call Trace:
Oct 05 11:33:50 lamachine kernel:  ? __schedule+0x24f/0x650
Oct 05 11:33:50 lamachine kernel:  schedule+0x2f/0xa0
Oct 05 11:33:50 lamachine kernel:  io_schedule+0x12/0x40
Oct 05 11:33:50 lamachine kernel:  blk_mq_get_tag+0x119/0x250
Oct 05 11:33:50 lamachine kernel:  ? finish_wait+0x80/0x80
Oct 05 11:33:50 lamachine kernel:  blk_mq_get_request+0xb7/0x3c0
Oct 05 11:33:50 lamachine kernel:  blk_mq_make_request+0x134/0x5a0
Oct 05 11:33:50 lamachine kernel:  generic_make_request+0xcf/0x310
Oct 05 11:33:50 lamachine kernel:  ops_run_io+0x881/0xd30 [raid456]
Oct 05 11:33:50 lamachine kernel:  ? ops_complete_check+0x50/0x50 [raid456]
Oct 05 11:33:50 lamachine kernel:  handle_stripe+0xc47/0x1f80 [raid456]
Oct 05 11:33:50 lamachine kernel:  ? __wake_up_common+0x7a/0x190
Oct 05 11:33:50 lamachine kernel:
handle_active_stripes.isra.73+0x3e7/0x5c0 [raid456]
Oct 05 11:33:50 lamachine kernel:  raid5d+0x392/0x5b0 [raid456]
Oct 05 11:33:50 lamachine kernel:  ? schedule_timeout+0x20d/0x310
Oct 05 11:33:50 lamachine kernel:  ? _raw_spin_unlock_irqrestore+0x11/0x20
Oct 05 11:33:50 lamachine kernel:  ? md_register_thread+0xd0/0xd0
Oct 05 11:33:50 lamachine kernel:  md_thread+0x94/0x150
Oct 05 11:33:50 lamachine kernel:  ? finish_wait+0x80/0x80
Oct 05 11:33:50 lamachine kernel:  kthread+0x112/0x130
Oct 05 11:33:50 lamachine kernel:  ? kthread_flush_work_fn+0x10/0x10
Oct 05 11:33:50 lamachine kernel:  ret_from_fork+0x35/0x40
Oct 05 11:33:50 lamachine kernel: INFO: task md1_resync:3091 blocked
for more than 120 seconds.
Oct 05 11:33:50 lamachine kernel:       Not tainted
4.18.0-193.14.2.el8_2.x86_64 #1
Oct 05 11:33:50 lamachine kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 05 11:33:50 lamachine kernel: md1_resync      D    0  3091      2 0x80004080
Oct 05 11:33:50 lamachine kernel: Call Trace:
Oct 05 11:33:50 lamachine kernel:  ? __schedule+0x24f/0x650
Oct 05 11:33:50 lamachine kernel:  schedule+0x2f/0xa0
Oct 05 11:33:50 lamachine kernel:  raid5_get_active_stripe+0x469/0x5f0 [raid456]
Oct 05 11:33:50 lamachine kernel:  ? finish_wait+0x80/0x80
Oct 05 11:33:50 lamachine kernel:  raid5_sync_request+0x387/0x3b0 [raid456]
Oct 05 11:33:50 lamachine kernel:  ? cpumask_next+0x17/0x20
Oct 05 11:33:50 lamachine kernel:  ? is_mddev_idle+0xcc/0x12a
Oct 05 11:33:50 lamachine kernel:  md_do_sync.cold.83+0x424/0x953
Oct 05 11:33:50 lamachine kernel:  ? xfrm_user_net_init+0x90/0xa0
Oct 05 11:33:50 lamachine kernel:  ? __switch_to_asm+0x41/0x70
Oct 05 11:33:50 lamachine kernel:  ? finish_wait+0x80/0x80
Oct 05 11:33:50 lamachine kernel:  ? md_register_thread+0xd0/0xd0
Oct 05 11:33:50 lamachine kernel:  md_thread+0x94/0x150
Oct 05 11:33:50 lamachine kernel:  kthread+0x112/0x130
Oct 05 11:33:50 lamachine kernel:  ? kthread_flush_work_fn+0x10/0x10
Oct 05 11:33:50 lamachine kernel:  ret_from_fork+0x35/0x40
Oct 05 11:34:39 lamachine kernel: ata8.00: exception Emask 0x0 SAct
0xffffffff SErr 0x0 action 0x6 frozen
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:00:d8:2f:2b/05:00:45:00:00/40 tag 0 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:08:18:35:2b/05:00:45:00:00/40 tag 1 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:10:58:3a:2b/05:00:45:00:00/40 tag 2 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:18:98:3f:2b/05:00:45:00:00/40 tag 3 ncq dma 688128 in
                                           res
40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:20:d8:44:2b/05:00:45:00:00/40 tag 4 ncq dma 688128 in
                                           res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:28:18:4a:2b/05:00:45:00:00/40 tag 5 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:30:58:4f:2b/05:00:45:00:00/40 tag 6 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:38:98:54:2b/05:00:45:00:00/40 tag 7 ncq dma 688128 in
                                           res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:40:d8:59:2b/05:00:45:00:00/40 tag 8 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:48:18:5f:2b/05:00:45:00:00/40 tag 9 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:50:58:64:2b/05:00:45:00:00/40 tag 10 ncq dma 688128 in
                                           res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:58:98:69:2b/05:00:45:00:00/40 tag 11 ncq dma 688128 in
                                           res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:60:d8:6e:2b/05:00:45:00:00/40 tag 12 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:68:18:74:2b/05:00:45:00:00/40 tag 13 ncq dma 688128 in
                                           res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:70:58:79:2b/05:00:45:00:00/40 tag 14 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:78:98:7e:2b/05:00:45:00:00/40 tag 15 ncq dma 688128 in
                                           res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:80:18:b3:2b/05:00:45:00:00/40 tag 16 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:88:58:b8:2b/05:00:45:00:00/40 tag 17 ncq dma 688128 in
                                           res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:90:98:bd:2b/05:00:45:00:00/40 tag 18 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:98:d8:c2:2b/05:00:45:00:00/40 tag 19 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:a0:18:c8:2b/05:00:45:00:00/40 tag 20 ncq dma 688128 in
                                           res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:a8:58:cd:2b/05:00:45:00:00/40 tag 21 ncq dma 688128 in
                                           res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:b0:d8:83:2b/05:00:45:00:00/40 tag 22 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:b8:18:89:2b/05:00:45:00:00/40 tag 23 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:c0:58:8e:2b/05:00:45:00:00/40 tag 24 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:c8:98:93:2b/05:00:45:00:00/40 tag 25 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:d0:d8:98:2b/05:00:45:00:00/40 tag 26 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:d8:18:9e:2b/05:00:45:00:00/40 tag 27 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:e0:58:a3:2b/05:00:45:00:00/40 tag 28 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:e8:98:a8:2b/05:00:45:00:00/40 tag 29 ncq dma 688128 in
                                           res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:f0:d8:ad:2b/05:00:45:00:00/40 tag 30 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:f8:98:d2:2b/05:00:45:00:00/40 tag 31 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8: hard resetting link
Oct 05 11:34:39 lamachine kernel: ata7.00: exception Emask 0x0 SAct
0xffffffff SErr 0x0 action 0x6 frozen
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:00:d8:98:2b/05:00:45:00:00/40 tag 0 ncq dma 688128 in
                                           res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:08:18:9e:2b/05:00:45:00:00/40 tag 1 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:10:58:a3:2b/05:00:45:00:00/40 tag 2 ncq dma 688128 in
                                           res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:18:d8:2f:2b/05:00:45:00:00/40 tag 3 ncq dma 688128 in
                                           res
40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:20:18:35:2b/05:00:45:00:00/40 tag 4 ncq dma 688128 in
                                           res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:28:58:3a:2b/05:00:45:00:00/40 tag 5 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:30:98:3f:2b/05:00:45:00:00/40 tag 6 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:38:d8:44:2b/05:00:45:00:00/40 tag 7 ncq dma 688128 in
                                           res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:40:18:4a:2b/05:00:45:00:00/40 tag 8 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:48:58:4f:2b/05:00:45:00:00/40 tag 9 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:50:98:54:2b/05:00:45:00:00/40 tag 10 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:58:d8:59:2b/05:00:45:00:00/40 tag 11 ncq dma 688128 in
                                           res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:60:18:5f:2b/05:00:45:00:00/40 tag 12 ncq dma 688128 in
                                           res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:68:58:64:2b/05:00:45:00:00/40 tag 13 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:70:98:69:2b/05:00:45:00:00/40 tag 14 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:78:d8:6e:2b/05:00:45:00:00/40 tag 15 ncq dma 688128 in
                                           res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:80:98:a8:2b/05:00:45:00:00/40 tag 16 ncq dma 688128 in
                                           res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:88:d8:ad:2b/05:00:45:00:00/40 tag 17 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:90:18:74:2b/05:00:45:00:00/40 tag 18 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:98:58:79:2b/05:00:45:00:00/40 tag 19 ncq dma 688128 in
                                           res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:a0:98:7e:2b/05:00:45:00:00/40 tag 20 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:a8:d8:83:2b/05:00:45:00:00/40 tag 21 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:b0:18:89:2b/05:00:45:00:00/40 tag 22 ncq dma 688128 in
                                           res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:b8:58:8e:2b/05:00:45:00:00/40 tag 23 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:c0:18:b3:2b/05:00:45:00:00/40 tag 24 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:c8:58:b8:2b/05:00:45:00:00/40 tag 25 ncq dma 688128 in
                                           res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:d0:98:bd:2b/05:00:45:00:00/40 tag 26 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:d8:d8:c2:2b/05:00:45:00:00/40 tag 27 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:e0:18:c8:2b/05:00:45:00:00/40 tag 28 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:e8:58:cd:2b/05:00:45:00:00/40 tag 29 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:f0:98:d2:2b/05:00:45:00:00/40 tag 30 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata7.00: cmd
60/40:f8:98:93:2b/05:00:45:00:00/40 tag 31 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata7.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata7: hard resetting link
Oct 05 11:34:40 lamachine kernel: ata8: SATA link up 6.0 Gbps (SStatus
133 SControl 300)
Oct 05 11:34:40 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus
133 SControl 300)
Oct 05 11:34:45 lamachine kernel: ata7.00: qc timeout (cmd 0xec)
Oct 05 11:34:45 lamachine kernel: ata8.00: qc timeout (cmd 0xec)
Oct 05 11:34:46 lamachine kernel: ata7.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:34:46 lamachine kernel: ata7.00: revalidation failed (errno=-5)
Oct 05 11:34:46 lamachine kernel: ata7: hard resetting link
Oct 05 11:34:46 lamachine kernel: ata8.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:34:46 lamachine kernel: ata8.00: revalidation failed (errno=-5)
Oct 05 11:34:46 lamachine kernel: ata8: hard resetting link
Oct 05 11:34:46 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus
133 SControl 300)
Oct 05 11:34:46 lamachine kernel: ata8: SATA link up 6.0 Gbps (SStatus
133 SControl 300)
Oct 05 11:34:57 lamachine kernel: ata7.00: qc timeout (cmd 0xec)
Oct 05 11:34:57 lamachine kernel: ata8.00: qc timeout (cmd 0xec)
Oct 05 11:34:57 lamachine kernel: ata8.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:34:57 lamachine kernel: ata8.00: revalidation failed (errno=-5)
Oct 05 11:34:57 lamachine kernel: ata8: limiting SATA link speed to 3.0 Gbps
Oct 05 11:34:57 lamachine kernel: ata8: hard resetting link
Oct 05 11:34:57 lamachine kernel: ata7.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:34:57 lamachine kernel: ata7.00: revalidation failed (errno=-5)
Oct 05 11:34:57 lamachine kernel: ata7: limiting SATA link speed to 3.0 Gbps
Oct 05 11:34:57 lamachine kernel: ata7: hard resetting link
Oct 05 11:34:58 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus
133 SControl 320)
Oct 05 11:34:58 lamachine kernel: ata8: SATA link up 6.0 Gbps (SStatus
133 SControl 320)
Oct 05 11:35:29 lamachine kernel: ata7.00: qc timeout (cmd 0xec)
Oct 05 11:35:29 lamachine kernel: ata8.00: qc timeout (cmd 0xec)
Oct 05 11:35:29 lamachine kernel: ata8.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:35:29 lamachine kernel: ata8.00: revalidation failed (errno=-5)
Oct 05 11:35:29 lamachine kernel: ata8.00: disabled
Oct 05 11:35:29 lamachine kernel: ata7.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:35:29 lamachine kernel: ata7.00: revalidation failed (errno=-5)
Oct 05 11:35:29 lamachine kernel: ata7.00: disabled
Oct 05 11:35:30 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus
133 SControl 320)
Oct 05 11:35:30 lamachine kernel: ata8: SATA link up 6.0 Gbps (SStatus
133 SControl 320)
Oct 05 11:35:31 lamachine kernel: ata8: EH complete
Oct 05 11:35:31 lamachine kernel: ata7: EH complete
Oct 05 11:35:31 lamachine kernel: scsi_io_completion_action: 115
callbacks suppressed
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#12 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#12 CDB:
Read(16) 88 00 00 00 00 00 45 2b d7 d8 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: print_req_error: 115 callbacks suppressed
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdd, sector 1160501208 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#18 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#18 CDB:
Read(16) 88 00 00 00 00 00 45 2b d7 d8 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdc, sector 1160501208 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#13 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#13 CDB:
Read(16) 88 00 00 00 00 00 45 2b dd 18 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdd, sector 1160502552 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#19 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#19 CDB:
Read(16) 88 00 00 00 00 00 45 2b dd 18 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdc, sector 1160502552 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#14 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#14 CDB:
Read(16) 88 00 00 00 00 00 45 2b e2 58 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdd, sector 1160503896 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#20 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#20 CDB:
Read(16) 88 00 00 00 00 00 45 2b e2 58 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdc, sector 1160503896 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#15 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#15 CDB:
Read(16) 88 00 00 00 00 00 45 2b e7 98 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdd, sector 1160505240 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#21 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#21 CDB:
Read(16) 88 00 00 00 00 00 45 2b e7 98 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdc, sector 1160505240 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#16 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#16 CDB:
Read(16) 88 00 00 00 00 00 45 2b ec d8 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdd, sector 1160506584 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#22 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#22 CDB:
Read(16) 88 00 00 00 00 00 45 2b ec d8 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdc, sector 1160506584 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:35:31 lamachine kernel: md/raid:md1: 23277 read_errors > 23276 stripes
Oct 05 11:35:31 lamachine kernel: md/raid:md1: Too many read errors,
failing device sde1.
Oct 05 11:35:31 lamachine kernel: md/raid:md1: Disk failure on sde1,
disabling device.
                                  md/raid:md1: Operation continuing on
2 devices.
Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not
correctable (sector 1160433448 on sde1).
Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not
correctable (sector 1160433456 on sde1).
Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not
correctable (sector 1160433336 on sde1).
Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not
correctable (sector 1160433464 on sde1).
Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not
correctable (sector 1160433472 on sde1).
Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not
correctable (sector 1160433344 on sde1).
Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not
correctable (sector 1160433480 on sde1).
Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not
correctable (sector 1160433352 on sde1).
Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not
correctable (sector 1160433488 on sde1).
Oct 05 11:35:31 lamachine kernel: md/raid:md1: read error not
correctable (sector 1160433360 on sde1).
Oct 05 11:35:31 lamachine kernel: md: super_written gets error=10
Oct 05 11:35:31 lamachine kernel: md: super_written gets error=10
Oct 05 11:35:31 lamachine kernel: md: super_written gets error=10
Oct 05 11:35:31 lamachine kernel: md: super_written gets error=10
Oct 05 11:35:31 lamachine kernel: md: super_written gets error=10

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-05 13:10 do i need to give up on this setup Daniel Sanabria
@ 2020-10-05 13:17 ` Reindl Harald
  2020-10-05 13:44 ` Roman Mamedov
  1 sibling, 0 replies; 16+ messages in thread
From: Reindl Harald @ 2020-10-05 13:17 UTC (permalink / raw)
  To: Daniel Sanabria, Linux-RAID



Am 05.10.20 um 15:10 schrieb Daniel Sanabria:
> Scrubbing ( # echo check >
> /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
> 
> I'm attaching details of the array and disks (bloody wd greens) as
> well as journalctl errors providing some details about the issue.
> 
> If you have any pointers on what might be the cause of this as well as
> any recommendations on how to improve things please let me thank you
> in advance ...
> 
>        3       8       65        -      faulty   /dev/sde1
why would you scrub an array when you *clearly* lost a whole disk
instead first replace that one and rebuild the array?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-05 13:10 do i need to give up on this setup Daniel Sanabria
  2020-10-05 13:17 ` Reindl Harald
@ 2020-10-05 13:44 ` Roman Mamedov
       [not found]   ` <CAHscji0pNezf6xCpjWto5-21ayoCeLWm34GTYh5TSgxkOw90mw@mail.gmail.com>
  1 sibling, 1 reply; 16+ messages in thread
From: Roman Mamedov @ 2020-10-05 13:44 UTC (permalink / raw)
  To: Daniel Sanabria; +Cc: Linux-RAID

On Mon, 5 Oct 2020 14:10:25 +0100
Daniel Sanabria <sanabria.d@gmail.com> wrote:

> Hi all,
> 
> Scrubbing ( # echo check >
> /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
> 
> I'm attaching details of the array and disks (bloody wd greens) as
> well as journalctl errors providing some details about the issue.
> 
> If you have any pointers on what might be the cause of this as well as
> any recommendations on how to improve things please let me thank you
> in advance ...
> 
> I have backups of the data so happy to move this to a different setup
> you might recommend (apps will be mostly reading from the array via
> NFS since most of the content will be media).
> 
> My suspicion is that a timer service is kicking in and disrupting the
> scrubbing somehow but can't pinpoint what causes this.

It looks like a drive is dropping off the bus and then failing to reidentify,
could be bad cabling/controller/PSU, or just a bad drive. You should post
"smartctl -a" of all drives as well.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
       [not found]   ` <CAHscji0pNezf6xCpjWto5-21ayoCeLWm34GTYh5TSgxkOw90mw@mail.gmail.com>
@ 2020-10-05 14:04     ` Roman Mamedov
  2020-10-05 14:10       ` Reindl Harald
  2020-10-05 14:28       ` Daniel Sanabria
  0 siblings, 2 replies; 16+ messages in thread
From: Roman Mamedov @ 2020-10-05 14:04 UTC (permalink / raw)
  To: Daniel Sanabria, Linux-RAID

On Mon, 5 Oct 2020 14:59:35 +0100
Daniel Sanabria <sanabria.d@gmail.com> wrote:

> > It looks like a drive is dropping off the bus and then failing to reidentify,
> > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > "smartctl -a" of all drives as well.

I meant not to me personally, but to the mailing list. The drives seem OK
though, even sde.

> [dan@lamachine ~]$ sudo smartctl -a /dev/sdc
> [sudo] password for dan:
> smartctl 6.6 2017-11-05 r4594
> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Green
> Device Model:     WDC WD30EZRX-00D8PB0
> Serial Number:    WD-WCC4NCWT13RF
> LU WWN Device Id: 5 0014ee 25fc9e460
> Firmware Version: 80.00A80
> User Capacity:    3,000,591,900,160 bytes [3.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    5400 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ACS-2 (minor revision not indicated)
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is:    Mon Oct  5 14:58:34 2020 BST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (38940) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: (   2) minutes.
> Extended self-test routine
> recommended polling time: ( 391) minutes.
> Conveyance self-test routine
> recommended polling time: (   5) minutes.
> SCT capabilities:        (0x7035) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> Always       -       0
>   3 Spin_Up_Time            0x0027   178   165   021    Pre-fail
> Always       -       6075
>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> Always       -       81
>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> Always       -       0
>   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
> Always       -       0
>   9 Power_On_Hours          0x0032   075   075   000    Old_age
> Always       -       18577
>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> Always       -       0
>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> Always       -       81
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> Always       -       46
> 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
> Always       -       176661
> 194 Temperature_Celsius     0x0022   122   109   000    Old_age
> Always       -       28
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> Always       -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> Offline      -       0
> 
> SMART Error Log Version: 1
> No Errors Logged
> 
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Extended offline    Completed without error       00%     17479         -
> # 2  Short offline       Completed without error       00%     15531         -
> 
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> 
> [dan@lamachine ~]$ sudo smartctl -a /dev/sdd
> smartctl 6.6 2017-11-05 r4594
> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Green
> Device Model:     WDC WD30EZRX-00D8PB0
> Serial Number:    WD-WCC4NPRDD6D7
> LU WWN Device Id: 5 0014ee 25fca27b1
> Firmware Version: 80.00A80
> User Capacity:    3,000,592,982,016 bytes [3.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    5400 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ACS-2 (minor revision not indicated)
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is:    Mon Oct  5 14:58:54 2020 BST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (39060) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: (   2) minutes.
> Extended self-test routine
> recommended polling time: ( 392) minutes.
> Conveyance self-test routine
> recommended polling time: (   5) minutes.
> SCT capabilities:        (0x7035) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> Always       -       0
>   3 Spin_Up_Time            0x0027   178   164   021    Pre-fail
> Always       -       6100
>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> Always       -       81
>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> Always       -       0
>   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
> Always       -       0
>   9 Power_On_Hours          0x0032   075   075   000    Old_age
> Always       -       18580
>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> Always       -       0
>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> Always       -       81
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> Always       -       53
> 193 Load_Cycle_Count        0x0032   136   136   000    Old_age
> Always       -       192427
> 194 Temperature_Celsius     0x0022   121   108   000    Old_age
> Always       -       29
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> Always       -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> Offline      -       0
> 
> SMART Error Log Version: 1
> No Errors Logged
> 
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Extended offline    Completed without error       00%     17481         -
> # 2  Short offline       Completed without error       00%     15534         -
> 
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> 
> [dan@lamachine ~]$ sudo smartctl -a /dev/sde
> smartctl 6.6 2017-11-05 r4594
> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Green
> Device Model:     WDC WD30EZRX-00D8PB0
> Serial Number:    WD-WCC4N1294906
> LU WWN Device Id: 5 0014ee 25f968120
> Firmware Version: 80.00A80
> User Capacity:    3,000,591,900,160 bytes [3.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    5400 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ACS-2 (minor revision not indicated)
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is:    Mon Oct  5 14:58:57 2020 BST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (43200) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: (   2) minutes.
> Extended self-test routine
> recommended polling time: ( 433) minutes.
> Conveyance self-test routine
> recommended polling time: (   5) minutes.
> SCT capabilities:        (0x7035) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> Always       -       0
>   3 Spin_Up_Time            0x0027   176   166   021    Pre-fail
> Always       -       6158
>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> Always       -       80
>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> Always       -       0
>   7 Seek_Error_Rate         0x002e   200   200   000    Old_age
> Always       -       0
>   9 Power_On_Hours          0x0032   075   075   000    Old_age
> Always       -       18465
>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> Always       -       0
>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> Always       -       80
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> Always       -       53
> 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
> Always       -       174015
> 194 Temperature_Celsius     0x0022   121   107   000    Old_age
> Always       -       29
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> Always       -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> Offline      -       0
> 
> SMART Error Log Version: 1
> No Errors Logged
> 
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Extended offline    Completed without error       00%     17347         -
> # 2  Short offline       Completed without error       00%     15414         -
> 
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> 
> [dan@lamachine ~]$
> 
> 
> On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote:
> >
> > On Mon, 5 Oct 2020 14:10:25 +0100
> > Daniel Sanabria <sanabria.d@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Scrubbing ( # echo check >
> > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
> > >
> > > I'm attaching details of the array and disks (bloody wd greens) as
> > > well as journalctl errors providing some details about the issue.
> > >
> > > If you have any pointers on what might be the cause of this as well as
> > > any recommendations on how to improve things please let me thank you
> > > in advance ...
> > >
> > > I have backups of the data so happy to move this to a different setup
> > > you might recommend (apps will be mostly reading from the array via
> > > NFS since most of the content will be media).
> > >
> > > My suspicion is that a timer service is kicking in and disrupting the
> > > scrubbing somehow but can't pinpoint what causes this.
> >
> > It looks like a drive is dropping off the bus and then failing to reidentify,
> > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > "smartctl -a" of all drives as well.
> >
> > --
> > With respect,
> > Roman


-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-05 14:04     ` Roman Mamedov
@ 2020-10-05 14:10       ` Reindl Harald
  2020-10-05 14:28       ` Daniel Sanabria
  1 sibling, 0 replies; 16+ messages in thread
From: Reindl Harald @ 2020-10-05 14:10 UTC (permalink / raw)
  To: Roman Mamedov, Daniel Sanabria, Linux-RAID



Am 05.10.20 um 16:04 schrieb Roman Mamedov:
> On Mon, 5 Oct 2020 14:59:35 +0100
> Daniel Sanabria <sanabria.d@gmail.com> wrote:
> 
>>> It looks like a drive is dropping off the bus and then failing to reidentify,
>>> could be bad cabling/controller/PSU, or just a bad drive. You should post
>>> "smartctl -a" of all drives as well.
> 
> I meant not to me personally, but to the mailing list. The drives seem OK
> though, even sde.

you have a hardware problem and it#s no uncommon when one of your disks
is going crazy under laod that due the reset of the crontroller a second
one on the same bus is also reset

either one of your disks is faulty, the controller is faulty or you have
an issue with your cables

40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:48:18:5f:2b/05:00:45:00:00/40 tag 9 ncq dma 688128 in
                                           res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:50:58:64:2b/05:00:45:00:00/40 tag 10 ncq dma 688128 in
                                           res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:58:98:69:2b/05:00:45:00:00/40 tag 11 ncq dma 688128 in
                                           res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Oct 05 11:35:29 lamachine kernel: ata7.00: qc timeout (cmd 0xec)
Oct 05 11:35:29 lamachine kernel: ata8.00: qc timeout (cmd 0xec)
Oct 05 11:35:29 lamachine kernel: ata8.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:35:29 lamachine kernel: ata8.00: revalidation failed (errno=-5)
Oct 05 11:35:29 lamachine kernel: ata8.00: disabled
Oct 05 11:35:29 lamachine kernel: ata7.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:35:29 lamachine kernel: ata7.00: revalidation failed (errno=-5)
Oct 05 11:35:29 lamachine kernel: ata7.00: disabled
Oct 05 11:35:30 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus
133 SControl 320)
Oct 05 11:35:30 lamachine kernel: ata8: SATA link up 6.0 Gbps (SStatus
133 SControl 320)
Oct 05 11:35:31 lamachine kernel: ata8: EH complete
Oct 05 11:35:31 lamachine kernel: ata7: EH complete
Oct 05 11:35:31 lamachine kernel: scsi_io_completion_action: 115
callbacks suppressed
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#12 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#12 CDB:
Read(16) 88 00 00 00 00 00 45 2b d7 d8 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: print_req_error: 115 callbacks suppressed
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdd, sector 1160501208 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#18 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#18 CDB:
Read(16) 88 00 00 00 00 00 45 2b d7 d8 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdc, sector 1160501208 op 0x0:(READ) flags 0x4000 phys_seg 168 prio

>> [dan@lamachine ~]$ sudo smartctl -a /dev/sdc
>> [sudo] password for dan:
>> smartctl 6.6 2017-11-05 r4594
>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>>
>> === START OF INFORMATION SECTION ===
>> Model Family:     Western Digital Green
>> Device Model:     WDC WD30EZRX-00D8PB0
>> Serial Number:    WD-WCC4NCWT13RF
>> LU WWN Device Id: 5 0014ee 25fc9e460
>> Firmware Version: 80.00A80
>> User Capacity:    3,000,591,900,160 bytes [3.00 TB]
>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>> Rotation Rate:    5400 rpm
>> Device is:        In smartctl database [for details use: -P show]
>> ATA Version is:   ACS-2 (minor revision not indicated)
>> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>> Local Time is:    Mon Oct  5 14:58:34 2020 BST
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> === START OF READ SMART DATA SECTION ===
>> SMART overall-health self-assessment test result: PASSED
>>
>> General SMART Values:
>> Offline data collection status:  (0x82) Offline data collection activity
>> was completed without error.
>> Auto Offline Data Collection: Enabled.
>> Self-test execution status:      (   0) The previous self-test routine completed
>> without error or no self-test has ever
>> been run.
>> Total time to complete Offline
>> data collection: (38940) seconds.
>> Offline data collection
>> capabilities: (0x7b) SMART execute Offline immediate.
>> Auto Offline data collection on/off support.
>> Suspend Offline collection upon new
>> command.
>> Offline surface scan supported.
>> Self-test supported.
>> Conveyance Self-test supported.
>> Selective Self-test supported.
>> SMART capabilities:            (0x0003) Saves SMART data before entering
>> power-saving mode.
>> Supports SMART auto save timer.
>> Error logging capability:        (0x01) Error logging supported.
>> General Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: (   2) minutes.
>> Extended self-test routine
>> recommended polling time: ( 391) minutes.
>> Conveyance self-test routine
>> recommended polling time: (   5) minutes.
>> SCT capabilities:        (0x7035) SCT Status supported.
>> SCT Feature Control supported.
>> SCT Data Table supported.
>>
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
>> UPDATED  WHEN_FAILED RAW_VALUE
>>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
>> Always       -       0
>>   3 Spin_Up_Time            0x0027   178   165   021    Pre-fail
>> Always       -       6075
>>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
>> Always       -       81
>>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
>> Always       -       0
>>   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
>> Always       -       0
>>   9 Power_On_Hours          0x0032   075   075   000    Old_age
>> Always       -       18577
>>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
>> Always       -       0
>>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
>> Always       -       0
>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
>> Always       -       81
>> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
>> Always       -       46
>> 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
>> Always       -       176661
>> 194 Temperature_Celsius     0x0022   122   109   000    Old_age
>> Always       -       28
>> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
>> Always       -       0
>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>> Always       -       0
>> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
>> Offline      -       0
>> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
>> Always       -       0
>> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
>> Offline      -       0
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num  Test_Description    Status                  Remaining
>> LifeTime(hours)  LBA_of_first_error
>> # 1  Extended offline    Completed without error       00%     17479         -
>> # 2  Short offline       Completed without error       00%     15531         -
>>
>> SMART Selective self-test log data structure revision number 1
>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>     1        0        0  Not_testing
>>     2        0        0  Not_testing
>>     3        0        0  Not_testing
>>     4        0        0  Not_testing
>>     5        0        0  Not_testing
>> Selective self-test flags (0x0):
>>   After scanning selected spans, do NOT read-scan remainder of disk.
>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>
>> [dan@lamachine ~]$ sudo smartctl -a /dev/sdd
>> smartctl 6.6 2017-11-05 r4594
>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>>
>> === START OF INFORMATION SECTION ===
>> Model Family:     Western Digital Green
>> Device Model:     WDC WD30EZRX-00D8PB0
>> Serial Number:    WD-WCC4NPRDD6D7
>> LU WWN Device Id: 5 0014ee 25fca27b1
>> Firmware Version: 80.00A80
>> User Capacity:    3,000,592,982,016 bytes [3.00 TB]
>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>> Rotation Rate:    5400 rpm
>> Device is:        In smartctl database [for details use: -P show]
>> ATA Version is:   ACS-2 (minor revision not indicated)
>> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>> Local Time is:    Mon Oct  5 14:58:54 2020 BST
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> === START OF READ SMART DATA SECTION ===
>> SMART overall-health self-assessment test result: PASSED
>>
>> General SMART Values:
>> Offline data collection status:  (0x82) Offline data collection activity
>> was completed without error.
>> Auto Offline Data Collection: Enabled.
>> Self-test execution status:      (   0) The previous self-test routine completed
>> without error or no self-test has ever
>> been run.
>> Total time to complete Offline
>> data collection: (39060) seconds.
>> Offline data collection
>> capabilities: (0x7b) SMART execute Offline immediate.
>> Auto Offline data collection on/off support.
>> Suspend Offline collection upon new
>> command.
>> Offline surface scan supported.
>> Self-test supported.
>> Conveyance Self-test supported.
>> Selective Self-test supported.
>> SMART capabilities:            (0x0003) Saves SMART data before entering
>> power-saving mode.
>> Supports SMART auto save timer.
>> Error logging capability:        (0x01) Error logging supported.
>> General Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: (   2) minutes.
>> Extended self-test routine
>> recommended polling time: ( 392) minutes.
>> Conveyance self-test routine
>> recommended polling time: (   5) minutes.
>> SCT capabilities:        (0x7035) SCT Status supported.
>> SCT Feature Control supported.
>> SCT Data Table supported.
>>
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
>> UPDATED  WHEN_FAILED RAW_VALUE
>>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
>> Always       -       0
>>   3 Spin_Up_Time            0x0027   178   164   021    Pre-fail
>> Always       -       6100
>>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
>> Always       -       81
>>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
>> Always       -       0
>>   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
>> Always       -       0
>>   9 Power_On_Hours          0x0032   075   075   000    Old_age
>> Always       -       18580
>>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
>> Always       -       0
>>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
>> Always       -       0
>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
>> Always       -       81
>> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
>> Always       -       53
>> 193 Load_Cycle_Count        0x0032   136   136   000    Old_age
>> Always       -       192427
>> 194 Temperature_Celsius     0x0022   121   108   000    Old_age
>> Always       -       29
>> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
>> Always       -       0
>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>> Always       -       0
>> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
>> Offline      -       0
>> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
>> Always       -       0
>> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
>> Offline      -       0
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num  Test_Description    Status                  Remaining
>> LifeTime(hours)  LBA_of_first_error
>> # 1  Extended offline    Completed without error       00%     17481         -
>> # 2  Short offline       Completed without error       00%     15534         -
>>
>> SMART Selective self-test log data structure revision number 1
>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>     1        0        0  Not_testing
>>     2        0        0  Not_testing
>>     3        0        0  Not_testing
>>     4        0        0  Not_testing
>>     5        0        0  Not_testing
>> Selective self-test flags (0x0):
>>   After scanning selected spans, do NOT read-scan remainder of disk.
>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>
>> [dan@lamachine ~]$ sudo smartctl -a /dev/sde
>> smartctl 6.6 2017-11-05 r4594
>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>>
>> === START OF INFORMATION SECTION ===
>> Model Family:     Western Digital Green
>> Device Model:     WDC WD30EZRX-00D8PB0
>> Serial Number:    WD-WCC4N1294906
>> LU WWN Device Id: 5 0014ee 25f968120
>> Firmware Version: 80.00A80
>> User Capacity:    3,000,591,900,160 bytes [3.00 TB]
>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>> Rotation Rate:    5400 rpm
>> Device is:        In smartctl database [for details use: -P show]
>> ATA Version is:   ACS-2 (minor revision not indicated)
>> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>> Local Time is:    Mon Oct  5 14:58:57 2020 BST
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> === START OF READ SMART DATA SECTION ===
>> SMART overall-health self-assessment test result: PASSED
>>
>> General SMART Values:
>> Offline data collection status:  (0x82) Offline data collection activity
>> was completed without error.
>> Auto Offline Data Collection: Enabled.
>> Self-test execution status:      (   0) The previous self-test routine completed
>> without error or no self-test has ever
>> been run.
>> Total time to complete Offline
>> data collection: (43200) seconds.
>> Offline data collection
>> capabilities: (0x7b) SMART execute Offline immediate.
>> Auto Offline data collection on/off support.
>> Suspend Offline collection upon new
>> command.
>> Offline surface scan supported.
>> Self-test supported.
>> Conveyance Self-test supported.
>> Selective Self-test supported.
>> SMART capabilities:            (0x0003) Saves SMART data before entering
>> power-saving mode.
>> Supports SMART auto save timer.
>> Error logging capability:        (0x01) Error logging supported.
>> General Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: (   2) minutes.
>> Extended self-test routine
>> recommended polling time: ( 433) minutes.
>> Conveyance self-test routine
>> recommended polling time: (   5) minutes.
>> SCT capabilities:        (0x7035) SCT Status supported.
>> SCT Feature Control supported.
>> SCT Data Table supported.
>>
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
>> UPDATED  WHEN_FAILED RAW_VALUE
>>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
>> Always       -       0
>>   3 Spin_Up_Time            0x0027   176   166   021    Pre-fail
>> Always       -       6158
>>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
>> Always       -       80
>>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
>> Always       -       0
>>   7 Seek_Error_Rate         0x002e   200   200   000    Old_age
>> Always       -       0
>>   9 Power_On_Hours          0x0032   075   075   000    Old_age
>> Always       -       18465
>>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
>> Always       -       0
>>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
>> Always       -       0
>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
>> Always       -       80
>> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
>> Always       -       53
>> 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
>> Always       -       174015
>> 194 Temperature_Celsius     0x0022   121   107   000    Old_age
>> Always       -       29
>> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
>> Always       -       0
>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>> Always       -       0
>> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
>> Offline      -       0
>> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
>> Always       -       0
>> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
>> Offline      -       0
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num  Test_Description    Status                  Remaining
>> LifeTime(hours)  LBA_of_first_error
>> # 1  Extended offline    Completed without error       00%     17347         -
>> # 2  Short offline       Completed without error       00%     15414         -
>>
>> SMART Selective self-test log data structure revision number 1
>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>     1        0        0  Not_testing
>>     2        0        0  Not_testing
>>     3        0        0  Not_testing
>>     4        0        0  Not_testing
>>     5        0        0  Not_testing
>> Selective self-test flags (0x0):
>>   After scanning selected spans, do NOT read-scan remainder of disk.
>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>
>> [dan@lamachine ~]$
>>
>>
>> On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote:
>>>
>>> On Mon, 5 Oct 2020 14:10:25 +0100
>>> Daniel Sanabria <sanabria.d@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Scrubbing ( # echo check >
>>>> /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
>>>>
>>>> I'm attaching details of the array and disks (bloody wd greens) as
>>>> well as journalctl errors providing some details about the issue.
>>>>
>>>> If you have any pointers on what might be the cause of this as well as
>>>> any recommendations on how to improve things please let me thank you
>>>> in advance ...
>>>>
>>>> I have backups of the data so happy to move this to a different setup
>>>> you might recommend (apps will be mostly reading from the array via
>>>> NFS since most of the content will be media).
>>>>
>>>> My suspicion is that a timer service is kicking in and disrupting the
>>>> scrubbing somehow but can't pinpoint what causes this.
>>>
>>> It looks like a drive is dropping off the bus and then failing to reidentify,
>>> could be bad cabling/controller/PSU, or just a bad drive. You should post
>>> "smartctl -a" of all drives as well.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-05 14:04     ` Roman Mamedov
  2020-10-05 14:10       ` Reindl Harald
@ 2020-10-05 14:28       ` Daniel Sanabria
  2020-10-05 15:58         ` Roger Heflin
  1 sibling, 1 reply; 16+ messages in thread
From: Daniel Sanabria @ 2020-10-05 14:28 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Linux-RAID

> I meant not to me personally, but to the mailing list. The drives seem OK
> though, even sde.

Sorry missed the reply-all button

On Mon, 5 Oct 2020 at 15:04, Roman Mamedov <rm@romanrm.net> wrote:
>
> On Mon, 5 Oct 2020 14:59:35 +0100
> Daniel Sanabria <sanabria.d@gmail.com> wrote:
>
> > > It looks like a drive is dropping off the bus and then failing to reidentify,
> > > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > > "smartctl -a" of all drives as well.
>
> I meant not to me personally, but to the mailing list. The drives seem OK
> though, even sde.
>
> > [dan@lamachine ~]$ sudo smartctl -a /dev/sdc
> > [sudo] password for dan:
> > smartctl 6.6 2017-11-05 r4594
> > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > === START OF INFORMATION SECTION ===
> > Model Family:     Western Digital Green
> > Device Model:     WDC WD30EZRX-00D8PB0
> > Serial Number:    WD-WCC4NCWT13RF
> > LU WWN Device Id: 5 0014ee 25fc9e460
> > Firmware Version: 80.00A80
> > User Capacity:    3,000,591,900,160 bytes [3.00 TB]
> > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > Rotation Rate:    5400 rpm
> > Device is:        In smartctl database [for details use: -P show]
> > ATA Version is:   ACS-2 (minor revision not indicated)
> > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > Local Time is:    Mon Oct  5 14:58:34 2020 BST
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> >
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> >
> > General SMART Values:
> > Offline data collection status:  (0x82) Offline data collection activity
> > was completed without error.
> > Auto Offline Data Collection: Enabled.
> > Self-test execution status:      (   0) The previous self-test routine completed
> > without error or no self-test has ever
> > been run.
> > Total time to complete Offline
> > data collection: (38940) seconds.
> > Offline data collection
> > capabilities: (0x7b) SMART execute Offline immediate.
> > Auto Offline data collection on/off support.
> > Suspend Offline collection upon new
> > command.
> > Offline surface scan supported.
> > Self-test supported.
> > Conveyance Self-test supported.
> > Selective Self-test supported.
> > SMART capabilities:            (0x0003) Saves SMART data before entering
> > power-saving mode.
> > Supports SMART auto save timer.
> > Error logging capability:        (0x01) Error logging supported.
> > General Purpose Logging supported.
> > Short self-test routine
> > recommended polling time: (   2) minutes.
> > Extended self-test routine
> > recommended polling time: ( 391) minutes.
> > Conveyance self-test routine
> > recommended polling time: (   5) minutes.
> > SCT capabilities:        (0x7035) SCT Status supported.
> > SCT Feature Control supported.
> > SCT Data Table supported.
> >
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > UPDATED  WHEN_FAILED RAW_VALUE
> >   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> > Always       -       0
> >   3 Spin_Up_Time            0x0027   178   165   021    Pre-fail
> > Always       -       6075
> >   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> > Always       -       81
> >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> > Always       -       0
> >   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
> > Always       -       0
> >   9 Power_On_Hours          0x0032   075   075   000    Old_age
> > Always       -       18577
> >  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> > Always       -       0
> >  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> > Always       -       0
> >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> > Always       -       81
> > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> > Always       -       46
> > 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
> > Always       -       176661
> > 194 Temperature_Celsius     0x0022   122   109   000    Old_age
> > Always       -       28
> > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> > Always       -       0
> > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> > Always       -       0
> > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > Offline      -       0
> > 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> > Always       -       0
> > 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> > Offline      -       0
> >
> > SMART Error Log Version: 1
> > No Errors Logged
> >
> > SMART Self-test log structure revision number 1
> > Num  Test_Description    Status                  Remaining
> > LifeTime(hours)  LBA_of_first_error
> > # 1  Extended offline    Completed without error       00%     17479         -
> > # 2  Short offline       Completed without error       00%     15531         -
> >
> > SMART Selective self-test log data structure revision number 1
> >  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> >     1        0        0  Not_testing
> >     2        0        0  Not_testing
> >     3        0        0  Not_testing
> >     4        0        0  Not_testing
> >     5        0        0  Not_testing
> > Selective self-test flags (0x0):
> >   After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute delay.
> >
> > [dan@lamachine ~]$ sudo smartctl -a /dev/sdd
> > smartctl 6.6 2017-11-05 r4594
> > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > === START OF INFORMATION SECTION ===
> > Model Family:     Western Digital Green
> > Device Model:     WDC WD30EZRX-00D8PB0
> > Serial Number:    WD-WCC4NPRDD6D7
> > LU WWN Device Id: 5 0014ee 25fca27b1
> > Firmware Version: 80.00A80
> > User Capacity:    3,000,592,982,016 bytes [3.00 TB]
> > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > Rotation Rate:    5400 rpm
> > Device is:        In smartctl database [for details use: -P show]
> > ATA Version is:   ACS-2 (minor revision not indicated)
> > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > Local Time is:    Mon Oct  5 14:58:54 2020 BST
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> >
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> >
> > General SMART Values:
> > Offline data collection status:  (0x82) Offline data collection activity
> > was completed without error.
> > Auto Offline Data Collection: Enabled.
> > Self-test execution status:      (   0) The previous self-test routine completed
> > without error or no self-test has ever
> > been run.
> > Total time to complete Offline
> > data collection: (39060) seconds.
> > Offline data collection
> > capabilities: (0x7b) SMART execute Offline immediate.
> > Auto Offline data collection on/off support.
> > Suspend Offline collection upon new
> > command.
> > Offline surface scan supported.
> > Self-test supported.
> > Conveyance Self-test supported.
> > Selective Self-test supported.
> > SMART capabilities:            (0x0003) Saves SMART data before entering
> > power-saving mode.
> > Supports SMART auto save timer.
> > Error logging capability:        (0x01) Error logging supported.
> > General Purpose Logging supported.
> > Short self-test routine
> > recommended polling time: (   2) minutes.
> > Extended self-test routine
> > recommended polling time: ( 392) minutes.
> > Conveyance self-test routine
> > recommended polling time: (   5) minutes.
> > SCT capabilities:        (0x7035) SCT Status supported.
> > SCT Feature Control supported.
> > SCT Data Table supported.
> >
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > UPDATED  WHEN_FAILED RAW_VALUE
> >   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> > Always       -       0
> >   3 Spin_Up_Time            0x0027   178   164   021    Pre-fail
> > Always       -       6100
> >   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> > Always       -       81
> >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> > Always       -       0
> >   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
> > Always       -       0
> >   9 Power_On_Hours          0x0032   075   075   000    Old_age
> > Always       -       18580
> >  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> > Always       -       0
> >  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> > Always       -       0
> >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> > Always       -       81
> > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> > Always       -       53
> > 193 Load_Cycle_Count        0x0032   136   136   000    Old_age
> > Always       -       192427
> > 194 Temperature_Celsius     0x0022   121   108   000    Old_age
> > Always       -       29
> > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> > Always       -       0
> > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> > Always       -       0
> > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > Offline      -       0
> > 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> > Always       -       0
> > 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> > Offline      -       0
> >
> > SMART Error Log Version: 1
> > No Errors Logged
> >
> > SMART Self-test log structure revision number 1
> > Num  Test_Description    Status                  Remaining
> > LifeTime(hours)  LBA_of_first_error
> > # 1  Extended offline    Completed without error       00%     17481         -
> > # 2  Short offline       Completed without error       00%     15534         -
> >
> > SMART Selective self-test log data structure revision number 1
> >  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> >     1        0        0  Not_testing
> >     2        0        0  Not_testing
> >     3        0        0  Not_testing
> >     4        0        0  Not_testing
> >     5        0        0  Not_testing
> > Selective self-test flags (0x0):
> >   After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute delay.
> >
> > [dan@lamachine ~]$ sudo smartctl -a /dev/sde
> > smartctl 6.6 2017-11-05 r4594
> > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > === START OF INFORMATION SECTION ===
> > Model Family:     Western Digital Green
> > Device Model:     WDC WD30EZRX-00D8PB0
> > Serial Number:    WD-WCC4N1294906
> > LU WWN Device Id: 5 0014ee 25f968120
> > Firmware Version: 80.00A80
> > User Capacity:    3,000,591,900,160 bytes [3.00 TB]
> > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > Rotation Rate:    5400 rpm
> > Device is:        In smartctl database [for details use: -P show]
> > ATA Version is:   ACS-2 (minor revision not indicated)
> > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > Local Time is:    Mon Oct  5 14:58:57 2020 BST
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> >
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> >
> > General SMART Values:
> > Offline data collection status:  (0x82) Offline data collection activity
> > was completed without error.
> > Auto Offline Data Collection: Enabled.
> > Self-test execution status:      (   0) The previous self-test routine completed
> > without error or no self-test has ever
> > been run.
> > Total time to complete Offline
> > data collection: (43200) seconds.
> > Offline data collection
> > capabilities: (0x7b) SMART execute Offline immediate.
> > Auto Offline data collection on/off support.
> > Suspend Offline collection upon new
> > command.
> > Offline surface scan supported.
> > Self-test supported.
> > Conveyance Self-test supported.
> > Selective Self-test supported.
> > SMART capabilities:            (0x0003) Saves SMART data before entering
> > power-saving mode.
> > Supports SMART auto save timer.
> > Error logging capability:        (0x01) Error logging supported.
> > General Purpose Logging supported.
> > Short self-test routine
> > recommended polling time: (   2) minutes.
> > Extended self-test routine
> > recommended polling time: ( 433) minutes.
> > Conveyance self-test routine
> > recommended polling time: (   5) minutes.
> > SCT capabilities:        (0x7035) SCT Status supported.
> > SCT Feature Control supported.
> > SCT Data Table supported.
> >
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > UPDATED  WHEN_FAILED RAW_VALUE
> >   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> > Always       -       0
> >   3 Spin_Up_Time            0x0027   176   166   021    Pre-fail
> > Always       -       6158
> >   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> > Always       -       80
> >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> > Always       -       0
> >   7 Seek_Error_Rate         0x002e   200   200   000    Old_age
> > Always       -       0
> >   9 Power_On_Hours          0x0032   075   075   000    Old_age
> > Always       -       18465
> >  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> > Always       -       0
> >  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> > Always       -       0
> >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> > Always       -       80
> > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> > Always       -       53
> > 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
> > Always       -       174015
> > 194 Temperature_Celsius     0x0022   121   107   000    Old_age
> > Always       -       29
> > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> > Always       -       0
> > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> > Always       -       0
> > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > Offline      -       0
> > 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> > Always       -       0
> > 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> > Offline      -       0
> >
> > SMART Error Log Version: 1
> > No Errors Logged
> >
> > SMART Self-test log structure revision number 1
> > Num  Test_Description    Status                  Remaining
> > LifeTime(hours)  LBA_of_first_error
> > # 1  Extended offline    Completed without error       00%     17347         -
> > # 2  Short offline       Completed without error       00%     15414         -
> >
> > SMART Selective self-test log data structure revision number 1
> >  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> >     1        0        0  Not_testing
> >     2        0        0  Not_testing
> >     3        0        0  Not_testing
> >     4        0        0  Not_testing
> >     5        0        0  Not_testing
> > Selective self-test flags (0x0):
> >   After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute delay.
> >
> > [dan@lamachine ~]$
> >
> >
> > On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote:
> > >
> > > On Mon, 5 Oct 2020 14:10:25 +0100
> > > Daniel Sanabria <sanabria.d@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Scrubbing ( # echo check >
> > > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
> > > >
> > > > I'm attaching details of the array and disks (bloody wd greens) as
> > > > well as journalctl errors providing some details about the issue.
> > > >
> > > > If you have any pointers on what might be the cause of this as well as
> > > > any recommendations on how to improve things please let me thank you
> > > > in advance ...
> > > >
> > > > I have backups of the data so happy to move this to a different setup
> > > > you might recommend (apps will be mostly reading from the array via
> > > > NFS since most of the content will be media).
> > > >
> > > > My suspicion is that a timer service is kicking in and disrupting the
> > > > scrubbing somehow but can't pinpoint what causes this.
> > >
> > > It looks like a drive is dropping off the bus and then failing to reidentify,
> > > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > > "smartctl -a" of all drives as well.
> > >
> > > --
> > > With respect,
> > > Roman
>
>
> --
> With respect,
> Roman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-05 14:28       ` Daniel Sanabria
@ 2020-10-05 15:58         ` Roger Heflin
  2020-10-06  7:56           ` Daniel Sanabria
  0 siblings, 1 reply; 16+ messages in thread
From: Roger Heflin @ 2020-10-05 15:58 UTC (permalink / raw)
  To: Daniel Sanabria; +Cc: Roman Mamedov, Linux-RAID

what they said you have a hardware problem.

it could be about anything previously mentioned and could also be the
power supply being unable to provide a stable 12V for the disks.

You should provide the list more specifics on your hw setup, of
interest are what kind of SATA/SAS ports you are using and how the
disk are cabled in.

Note that there are a number of controllers that aren't the most
reliable and some of those controllers when something happens will
stop responding for all disks connected to it.

I have also seen badly designed motherboards have
build-in(non-AMD/non-Intel chips) sata ports that don't work under any
load that uses more than a single disk at a time, and/or acts badly
when given smart commands.

On Mon, Oct 5, 2020 at 9:30 AM Daniel Sanabria <sanabria.d@gmail.com> wrote:
>
> > I meant not to me personally, but to the mailing list. The drives seem OK
> > though, even sde.
>
> Sorry missed the reply-all button
>
> On Mon, 5 Oct 2020 at 15:04, Roman Mamedov <rm@romanrm.net> wrote:
> >
> > On Mon, 5 Oct 2020 14:59:35 +0100
> > Daniel Sanabria <sanabria.d@gmail.com> wrote:
> >
> > > > It looks like a drive is dropping off the bus and then failing to reidentify,
> > > > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > > > "smartctl -a" of all drives as well.
> >
> > I meant not to me personally, but to the mailing list. The drives seem OK
> > though, even sde.
> >
> > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdc
> > > [sudo] password for dan:
> > > smartctl 6.6 2017-11-05 r4594
> > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> > >
> > > === START OF INFORMATION SECTION ===
> > > Model Family:     Western Digital Green
> > > Device Model:     WDC WD30EZRX-00D8PB0
> > > Serial Number:    WD-WCC4NCWT13RF
> > > LU WWN Device Id: 5 0014ee 25fc9e460
> > > Firmware Version: 80.00A80
> > > User Capacity:    3,000,591,900,160 bytes [3.00 TB]
> > > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > > Rotation Rate:    5400 rpm
> > > Device is:        In smartctl database [for details use: -P show]
> > > ATA Version is:   ACS-2 (minor revision not indicated)
> > > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > > Local Time is:    Mon Oct  5 14:58:34 2020 BST
> > > SMART support is: Available - device has SMART capability.
> > > SMART support is: Enabled
> > >
> > > === START OF READ SMART DATA SECTION ===
> > > SMART overall-health self-assessment test result: PASSED
> > >
> > > General SMART Values:
> > > Offline data collection status:  (0x82) Offline data collection activity
> > > was completed without error.
> > > Auto Offline Data Collection: Enabled.
> > > Self-test execution status:      (   0) The previous self-test routine completed
> > > without error or no self-test has ever
> > > been run.
> > > Total time to complete Offline
> > > data collection: (38940) seconds.
> > > Offline data collection
> > > capabilities: (0x7b) SMART execute Offline immediate.
> > > Auto Offline data collection on/off support.
> > > Suspend Offline collection upon new
> > > command.
> > > Offline surface scan supported.
> > > Self-test supported.
> > > Conveyance Self-test supported.
> > > Selective Self-test supported.
> > > SMART capabilities:            (0x0003) Saves SMART data before entering
> > > power-saving mode.
> > > Supports SMART auto save timer.
> > > Error logging capability:        (0x01) Error logging supported.
> > > General Purpose Logging supported.
> > > Short self-test routine
> > > recommended polling time: (   2) minutes.
> > > Extended self-test routine
> > > recommended polling time: ( 391) minutes.
> > > Conveyance self-test routine
> > > recommended polling time: (   5) minutes.
> > > SCT capabilities:        (0x7035) SCT Status supported.
> > > SCT Feature Control supported.
> > > SCT Data Table supported.
> > >
> > > SMART Attributes Data Structure revision number: 16
> > > Vendor Specific SMART Attributes with Thresholds:
> > > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > > UPDATED  WHEN_FAILED RAW_VALUE
> > >   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> > > Always       -       0
> > >   3 Spin_Up_Time            0x0027   178   165   021    Pre-fail
> > > Always       -       6075
> > >   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> > > Always       -       81
> > >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> > > Always       -       0
> > >   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
> > > Always       -       0
> > >   9 Power_On_Hours          0x0032   075   075   000    Old_age
> > > Always       -       18577
> > >  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> > > Always       -       0
> > >  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> > > Always       -       0
> > >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> > > Always       -       81
> > > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> > > Always       -       46
> > > 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
> > > Always       -       176661
> > > 194 Temperature_Celsius     0x0022   122   109   000    Old_age
> > > Always       -       28
> > > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> > > Always       -       0
> > > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> > > Always       -       0
> > > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > > Offline      -       0
> > > 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> > > Always       -       0
> > > 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> > > Offline      -       0
> > >
> > > SMART Error Log Version: 1
> > > No Errors Logged
> > >
> > > SMART Self-test log structure revision number 1
> > > Num  Test_Description    Status                  Remaining
> > > LifeTime(hours)  LBA_of_first_error
> > > # 1  Extended offline    Completed without error       00%     17479         -
> > > # 2  Short offline       Completed without error       00%     15531         -
> > >
> > > SMART Selective self-test log data structure revision number 1
> > >  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> > >     1        0        0  Not_testing
> > >     2        0        0  Not_testing
> > >     3        0        0  Not_testing
> > >     4        0        0  Not_testing
> > >     5        0        0  Not_testing
> > > Selective self-test flags (0x0):
> > >   After scanning selected spans, do NOT read-scan remainder of disk.
> > > If Selective self-test is pending on power-up, resume after 0 minute delay.
> > >
> > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdd
> > > smartctl 6.6 2017-11-05 r4594
> > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> > >
> > > === START OF INFORMATION SECTION ===
> > > Model Family:     Western Digital Green
> > > Device Model:     WDC WD30EZRX-00D8PB0
> > > Serial Number:    WD-WCC4NPRDD6D7
> > > LU WWN Device Id: 5 0014ee 25fca27b1
> > > Firmware Version: 80.00A80
> > > User Capacity:    3,000,592,982,016 bytes [3.00 TB]
> > > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > > Rotation Rate:    5400 rpm
> > > Device is:        In smartctl database [for details use: -P show]
> > > ATA Version is:   ACS-2 (minor revision not indicated)
> > > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > > Local Time is:    Mon Oct  5 14:58:54 2020 BST
> > > SMART support is: Available - device has SMART capability.
> > > SMART support is: Enabled
> > >
> > > === START OF READ SMART DATA SECTION ===
> > > SMART overall-health self-assessment test result: PASSED
> > >
> > > General SMART Values:
> > > Offline data collection status:  (0x82) Offline data collection activity
> > > was completed without error.
> > > Auto Offline Data Collection: Enabled.
> > > Self-test execution status:      (   0) The previous self-test routine completed
> > > without error or no self-test has ever
> > > been run.
> > > Total time to complete Offline
> > > data collection: (39060) seconds.
> > > Offline data collection
> > > capabilities: (0x7b) SMART execute Offline immediate.
> > > Auto Offline data collection on/off support.
> > > Suspend Offline collection upon new
> > > command.
> > > Offline surface scan supported.
> > > Self-test supported.
> > > Conveyance Self-test supported.
> > > Selective Self-test supported.
> > > SMART capabilities:            (0x0003) Saves SMART data before entering
> > > power-saving mode.
> > > Supports SMART auto save timer.
> > > Error logging capability:        (0x01) Error logging supported.
> > > General Purpose Logging supported.
> > > Short self-test routine
> > > recommended polling time: (   2) minutes.
> > > Extended self-test routine
> > > recommended polling time: ( 392) minutes.
> > > Conveyance self-test routine
> > > recommended polling time: (   5) minutes.
> > > SCT capabilities:        (0x7035) SCT Status supported.
> > > SCT Feature Control supported.
> > > SCT Data Table supported.
> > >
> > > SMART Attributes Data Structure revision number: 16
> > > Vendor Specific SMART Attributes with Thresholds:
> > > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > > UPDATED  WHEN_FAILED RAW_VALUE
> > >   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> > > Always       -       0
> > >   3 Spin_Up_Time            0x0027   178   164   021    Pre-fail
> > > Always       -       6100
> > >   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> > > Always       -       81
> > >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> > > Always       -       0
> > >   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
> > > Always       -       0
> > >   9 Power_On_Hours          0x0032   075   075   000    Old_age
> > > Always       -       18580
> > >  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> > > Always       -       0
> > >  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> > > Always       -       0
> > >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> > > Always       -       81
> > > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> > > Always       -       53
> > > 193 Load_Cycle_Count        0x0032   136   136   000    Old_age
> > > Always       -       192427
> > > 194 Temperature_Celsius     0x0022   121   108   000    Old_age
> > > Always       -       29
> > > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> > > Always       -       0
> > > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> > > Always       -       0
> > > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > > Offline      -       0
> > > 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> > > Always       -       0
> > > 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> > > Offline      -       0
> > >
> > > SMART Error Log Version: 1
> > > No Errors Logged
> > >
> > > SMART Self-test log structure revision number 1
> > > Num  Test_Description    Status                  Remaining
> > > LifeTime(hours)  LBA_of_first_error
> > > # 1  Extended offline    Completed without error       00%     17481         -
> > > # 2  Short offline       Completed without error       00%     15534         -
> > >
> > > SMART Selective self-test log data structure revision number 1
> > >  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> > >     1        0        0  Not_testing
> > >     2        0        0  Not_testing
> > >     3        0        0  Not_testing
> > >     4        0        0  Not_testing
> > >     5        0        0  Not_testing
> > > Selective self-test flags (0x0):
> > >   After scanning selected spans, do NOT read-scan remainder of disk.
> > > If Selective self-test is pending on power-up, resume after 0 minute delay.
> > >
> > > [dan@lamachine ~]$ sudo smartctl -a /dev/sde
> > > smartctl 6.6 2017-11-05 r4594
> > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> > >
> > > === START OF INFORMATION SECTION ===
> > > Model Family:     Western Digital Green
> > > Device Model:     WDC WD30EZRX-00D8PB0
> > > Serial Number:    WD-WCC4N1294906
> > > LU WWN Device Id: 5 0014ee 25f968120
> > > Firmware Version: 80.00A80
> > > User Capacity:    3,000,591,900,160 bytes [3.00 TB]
> > > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > > Rotation Rate:    5400 rpm
> > > Device is:        In smartctl database [for details use: -P show]
> > > ATA Version is:   ACS-2 (minor revision not indicated)
> > > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > > Local Time is:    Mon Oct  5 14:58:57 2020 BST
> > > SMART support is: Available - device has SMART capability.
> > > SMART support is: Enabled
> > >
> > > === START OF READ SMART DATA SECTION ===
> > > SMART overall-health self-assessment test result: PASSED
> > >
> > > General SMART Values:
> > > Offline data collection status:  (0x82) Offline data collection activity
> > > was completed without error.
> > > Auto Offline Data Collection: Enabled.
> > > Self-test execution status:      (   0) The previous self-test routine completed
> > > without error or no self-test has ever
> > > been run.
> > > Total time to complete Offline
> > > data collection: (43200) seconds.
> > > Offline data collection
> > > capabilities: (0x7b) SMART execute Offline immediate.
> > > Auto Offline data collection on/off support.
> > > Suspend Offline collection upon new
> > > command.
> > > Offline surface scan supported.
> > > Self-test supported.
> > > Conveyance Self-test supported.
> > > Selective Self-test supported.
> > > SMART capabilities:            (0x0003) Saves SMART data before entering
> > > power-saving mode.
> > > Supports SMART auto save timer.
> > > Error logging capability:        (0x01) Error logging supported.
> > > General Purpose Logging supported.
> > > Short self-test routine
> > > recommended polling time: (   2) minutes.
> > > Extended self-test routine
> > > recommended polling time: ( 433) minutes.
> > > Conveyance self-test routine
> > > recommended polling time: (   5) minutes.
> > > SCT capabilities:        (0x7035) SCT Status supported.
> > > SCT Feature Control supported.
> > > SCT Data Table supported.
> > >
> > > SMART Attributes Data Structure revision number: 16
> > > Vendor Specific SMART Attributes with Thresholds:
> > > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > > UPDATED  WHEN_FAILED RAW_VALUE
> > >   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> > > Always       -       0
> > >   3 Spin_Up_Time            0x0027   176   166   021    Pre-fail
> > > Always       -       6158
> > >   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> > > Always       -       80
> > >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> > > Always       -       0
> > >   7 Seek_Error_Rate         0x002e   200   200   000    Old_age
> > > Always       -       0
> > >   9 Power_On_Hours          0x0032   075   075   000    Old_age
> > > Always       -       18465
> > >  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> > > Always       -       0
> > >  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> > > Always       -       0
> > >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> > > Always       -       80
> > > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> > > Always       -       53
> > > 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
> > > Always       -       174015
> > > 194 Temperature_Celsius     0x0022   121   107   000    Old_age
> > > Always       -       29
> > > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> > > Always       -       0
> > > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> > > Always       -       0
> > > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > > Offline      -       0
> > > 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> > > Always       -       0
> > > 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> > > Offline      -       0
> > >
> > > SMART Error Log Version: 1
> > > No Errors Logged
> > >
> > > SMART Self-test log structure revision number 1
> > > Num  Test_Description    Status                  Remaining
> > > LifeTime(hours)  LBA_of_first_error
> > > # 1  Extended offline    Completed without error       00%     17347         -
> > > # 2  Short offline       Completed without error       00%     15414         -
> > >
> > > SMART Selective self-test log data structure revision number 1
> > >  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> > >     1        0        0  Not_testing
> > >     2        0        0  Not_testing
> > >     3        0        0  Not_testing
> > >     4        0        0  Not_testing
> > >     5        0        0  Not_testing
> > > Selective self-test flags (0x0):
> > >   After scanning selected spans, do NOT read-scan remainder of disk.
> > > If Selective self-test is pending on power-up, resume after 0 minute delay.
> > >
> > > [dan@lamachine ~]$
> > >
> > >
> > > On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote:
> > > >
> > > > On Mon, 5 Oct 2020 14:10:25 +0100
> > > > Daniel Sanabria <sanabria.d@gmail.com> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Scrubbing ( # echo check >
> > > > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
> > > > >
> > > > > I'm attaching details of the array and disks (bloody wd greens) as
> > > > > well as journalctl errors providing some details about the issue.
> > > > >
> > > > > If you have any pointers on what might be the cause of this as well as
> > > > > any recommendations on how to improve things please let me thank you
> > > > > in advance ...
> > > > >
> > > > > I have backups of the data so happy to move this to a different setup
> > > > > you might recommend (apps will be mostly reading from the array via
> > > > > NFS since most of the content will be media).
> > > > >
> > > > > My suspicion is that a timer service is kicking in and disrupting the
> > > > > scrubbing somehow but can't pinpoint what causes this.
> > > >
> > > > It looks like a drive is dropping off the bus and then failing to reidentify,
> > > > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > > > "smartctl -a" of all drives as well.
> > > >
> > > > --
> > > > With respect,
> > > > Roman
> >
> >
> > --
> > With respect,
> > Roman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-05 15:58         ` Roger Heflin
@ 2020-10-06  7:56           ` Daniel Sanabria
  2020-10-06  8:24             ` Reindl Harald
  2020-10-06 10:53             ` Roger Heflin
  0 siblings, 2 replies; 16+ messages in thread
From: Daniel Sanabria @ 2020-10-06  7:56 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Roman Mamedov, Linux-RAID

Yeah it is quite possible that the hardware can't support the setup
because this array was an afterthought and considered an upgrade to
the system.

For the record here are some more details about the setup:

Motherboard: ASROCK EP2C602-4L/D16
PSU: 850W Corsair RM Series
I have 6 drives connected to the motherboard. The 3 drives forming the
array are Western Digital Green WDC WD30EZRX-00D8PB0 and are connected
to 3 ports of the Marvell 9230 SATA controller. The other 3 drives are
Western Digital Caviar Blue (SATA) WDC WD5000AAKS-00A7B2 are connected
to the spare Marvell port and the 2 SATA/SAS Motherboard ports
available.

On Mon, 5 Oct 2020 at 16:58, Roger Heflin <rogerheflin@gmail.com> wrote:
>
> what they said you have a hardware problem.
>
> it could be about anything previously mentioned and could also be the
> power supply being unable to provide a stable 12V for the disks.
>
> You should provide the list more specifics on your hw setup, of
> interest are what kind of SATA/SAS ports you are using and how the
> disk are cabled in.
>
> Note that there are a number of controllers that aren't the most
> reliable and some of those controllers when something happens will
> stop responding for all disks connected to it.
>
> I have also seen badly designed motherboards have
> build-in(non-AMD/non-Intel chips) sata ports that don't work under any
> load that uses more than a single disk at a time, and/or acts badly
> when given smart commands.
>
> On Mon, Oct 5, 2020 at 9:30 AM Daniel Sanabria <sanabria.d@gmail.com> wrote:
> >
> > > I meant not to me personally, but to the mailing list. The drives seem OK
> > > though, even sde.
> >
> > Sorry missed the reply-all button
> >
> > On Mon, 5 Oct 2020 at 15:04, Roman Mamedov <rm@romanrm.net> wrote:
> > >
> > > On Mon, 5 Oct 2020 14:59:35 +0100
> > > Daniel Sanabria <sanabria.d@gmail.com> wrote:
> > >
> > > > > It looks like a drive is dropping off the bus and then failing to reidentify,
> > > > > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > > > > "smartctl -a" of all drives as well.
> > >
> > > I meant not to me personally, but to the mailing list. The drives seem OK
> > > though, even sde.
> > >
> > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdc
> > > > [sudo] password for dan:
> > > > smartctl 6.6 2017-11-05 r4594
> > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> > > >
> > > > === START OF INFORMATION SECTION ===
> > > > Model Family:     Western Digital Green
> > > > Device Model:     WDC WD30EZRX-00D8PB0
> > > > Serial Number:    WD-WCC4NCWT13RF
> > > > LU WWN Device Id: 5 0014ee 25fc9e460
> > > > Firmware Version: 80.00A80
> > > > User Capacity:    3,000,591,900,160 bytes [3.00 TB]
> > > > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > > > Rotation Rate:    5400 rpm
> > > > Device is:        In smartctl database [for details use: -P show]
> > > > ATA Version is:   ACS-2 (minor revision not indicated)
> > > > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > > > Local Time is:    Mon Oct  5 14:58:34 2020 BST
> > > > SMART support is: Available - device has SMART capability.
> > > > SMART support is: Enabled
> > > >
> > > > === START OF READ SMART DATA SECTION ===
> > > > SMART overall-health self-assessment test result: PASSED
> > > >
> > > > General SMART Values:
> > > > Offline data collection status:  (0x82) Offline data collection activity
> > > > was completed without error.
> > > > Auto Offline Data Collection: Enabled.
> > > > Self-test execution status:      (   0) The previous self-test routine completed
> > > > without error or no self-test has ever
> > > > been run.
> > > > Total time to complete Offline
> > > > data collection: (38940) seconds.
> > > > Offline data collection
> > > > capabilities: (0x7b) SMART execute Offline immediate.
> > > > Auto Offline data collection on/off support.
> > > > Suspend Offline collection upon new
> > > > command.
> > > > Offline surface scan supported.
> > > > Self-test supported.
> > > > Conveyance Self-test supported.
> > > > Selective Self-test supported.
> > > > SMART capabilities:            (0x0003) Saves SMART data before entering
> > > > power-saving mode.
> > > > Supports SMART auto save timer.
> > > > Error logging capability:        (0x01) Error logging supported.
> > > > General Purpose Logging supported.
> > > > Short self-test routine
> > > > recommended polling time: (   2) minutes.
> > > > Extended self-test routine
> > > > recommended polling time: ( 391) minutes.
> > > > Conveyance self-test routine
> > > > recommended polling time: (   5) minutes.
> > > > SCT capabilities:        (0x7035) SCT Status supported.
> > > > SCT Feature Control supported.
> > > > SCT Data Table supported.
> > > >
> > > > SMART Attributes Data Structure revision number: 16
> > > > Vendor Specific SMART Attributes with Thresholds:
> > > > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > > > UPDATED  WHEN_FAILED RAW_VALUE
> > > >   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> > > > Always       -       0
> > > >   3 Spin_Up_Time            0x0027   178   165   021    Pre-fail
> > > > Always       -       6075
> > > >   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> > > > Always       -       81
> > > >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> > > > Always       -       0
> > > >   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
> > > > Always       -       0
> > > >   9 Power_On_Hours          0x0032   075   075   000    Old_age
> > > > Always       -       18577
> > > >  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> > > > Always       -       0
> > > >  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> > > > Always       -       0
> > > >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> > > > Always       -       81
> > > > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> > > > Always       -       46
> > > > 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
> > > > Always       -       176661
> > > > 194 Temperature_Celsius     0x0022   122   109   000    Old_age
> > > > Always       -       28
> > > > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> > > > Always       -       0
> > > > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> > > > Always       -       0
> > > > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > > > Offline      -       0
> > > > 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> > > > Always       -       0
> > > > 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> > > > Offline      -       0
> > > >
> > > > SMART Error Log Version: 1
> > > > No Errors Logged
> > > >
> > > > SMART Self-test log structure revision number 1
> > > > Num  Test_Description    Status                  Remaining
> > > > LifeTime(hours)  LBA_of_first_error
> > > > # 1  Extended offline    Completed without error       00%     17479         -
> > > > # 2  Short offline       Completed without error       00%     15531         -
> > > >
> > > > SMART Selective self-test log data structure revision number 1
> > > >  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> > > >     1        0        0  Not_testing
> > > >     2        0        0  Not_testing
> > > >     3        0        0  Not_testing
> > > >     4        0        0  Not_testing
> > > >     5        0        0  Not_testing
> > > > Selective self-test flags (0x0):
> > > >   After scanning selected spans, do NOT read-scan remainder of disk.
> > > > If Selective self-test is pending on power-up, resume after 0 minute delay.
> > > >
> > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdd
> > > > smartctl 6.6 2017-11-05 r4594
> > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> > > >
> > > > === START OF INFORMATION SECTION ===
> > > > Model Family:     Western Digital Green
> > > > Device Model:     WDC WD30EZRX-00D8PB0
> > > > Serial Number:    WD-WCC4NPRDD6D7
> > > > LU WWN Device Id: 5 0014ee 25fca27b1
> > > > Firmware Version: 80.00A80
> > > > User Capacity:    3,000,592,982,016 bytes [3.00 TB]
> > > > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > > > Rotation Rate:    5400 rpm
> > > > Device is:        In smartctl database [for details use: -P show]
> > > > ATA Version is:   ACS-2 (minor revision not indicated)
> > > > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > > > Local Time is:    Mon Oct  5 14:58:54 2020 BST
> > > > SMART support is: Available - device has SMART capability.
> > > > SMART support is: Enabled
> > > >
> > > > === START OF READ SMART DATA SECTION ===
> > > > SMART overall-health self-assessment test result: PASSED
> > > >
> > > > General SMART Values:
> > > > Offline data collection status:  (0x82) Offline data collection activity
> > > > was completed without error.
> > > > Auto Offline Data Collection: Enabled.
> > > > Self-test execution status:      (   0) The previous self-test routine completed
> > > > without error or no self-test has ever
> > > > been run.
> > > > Total time to complete Offline
> > > > data collection: (39060) seconds.
> > > > Offline data collection
> > > > capabilities: (0x7b) SMART execute Offline immediate.
> > > > Auto Offline data collection on/off support.
> > > > Suspend Offline collection upon new
> > > > command.
> > > > Offline surface scan supported.
> > > > Self-test supported.
> > > > Conveyance Self-test supported.
> > > > Selective Self-test supported.
> > > > SMART capabilities:            (0x0003) Saves SMART data before entering
> > > > power-saving mode.
> > > > Supports SMART auto save timer.
> > > > Error logging capability:        (0x01) Error logging supported.
> > > > General Purpose Logging supported.
> > > > Short self-test routine
> > > > recommended polling time: (   2) minutes.
> > > > Extended self-test routine
> > > > recommended polling time: ( 392) minutes.
> > > > Conveyance self-test routine
> > > > recommended polling time: (   5) minutes.
> > > > SCT capabilities:        (0x7035) SCT Status supported.
> > > > SCT Feature Control supported.
> > > > SCT Data Table supported.
> > > >
> > > > SMART Attributes Data Structure revision number: 16
> > > > Vendor Specific SMART Attributes with Thresholds:
> > > > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > > > UPDATED  WHEN_FAILED RAW_VALUE
> > > >   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> > > > Always       -       0
> > > >   3 Spin_Up_Time            0x0027   178   164   021    Pre-fail
> > > > Always       -       6100
> > > >   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> > > > Always       -       81
> > > >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> > > > Always       -       0
> > > >   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
> > > > Always       -       0
> > > >   9 Power_On_Hours          0x0032   075   075   000    Old_age
> > > > Always       -       18580
> > > >  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> > > > Always       -       0
> > > >  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> > > > Always       -       0
> > > >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> > > > Always       -       81
> > > > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> > > > Always       -       53
> > > > 193 Load_Cycle_Count        0x0032   136   136   000    Old_age
> > > > Always       -       192427
> > > > 194 Temperature_Celsius     0x0022   121   108   000    Old_age
> > > > Always       -       29
> > > > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> > > > Always       -       0
> > > > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> > > > Always       -       0
> > > > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > > > Offline      -       0
> > > > 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> > > > Always       -       0
> > > > 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> > > > Offline      -       0
> > > >
> > > > SMART Error Log Version: 1
> > > > No Errors Logged
> > > >
> > > > SMART Self-test log structure revision number 1
> > > > Num  Test_Description    Status                  Remaining
> > > > LifeTime(hours)  LBA_of_first_error
> > > > # 1  Extended offline    Completed without error       00%     17481         -
> > > > # 2  Short offline       Completed without error       00%     15534         -
> > > >
> > > > SMART Selective self-test log data structure revision number 1
> > > >  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> > > >     1        0        0  Not_testing
> > > >     2        0        0  Not_testing
> > > >     3        0        0  Not_testing
> > > >     4        0        0  Not_testing
> > > >     5        0        0  Not_testing
> > > > Selective self-test flags (0x0):
> > > >   After scanning selected spans, do NOT read-scan remainder of disk.
> > > > If Selective self-test is pending on power-up, resume after 0 minute delay.
> > > >
> > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sde
> > > > smartctl 6.6 2017-11-05 r4594
> > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> > > >
> > > > === START OF INFORMATION SECTION ===
> > > > Model Family:     Western Digital Green
> > > > Device Model:     WDC WD30EZRX-00D8PB0
> > > > Serial Number:    WD-WCC4N1294906
> > > > LU WWN Device Id: 5 0014ee 25f968120
> > > > Firmware Version: 80.00A80
> > > > User Capacity:    3,000,591,900,160 bytes [3.00 TB]
> > > > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > > > Rotation Rate:    5400 rpm
> > > > Device is:        In smartctl database [for details use: -P show]
> > > > ATA Version is:   ACS-2 (minor revision not indicated)
> > > > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > > > Local Time is:    Mon Oct  5 14:58:57 2020 BST
> > > > SMART support is: Available - device has SMART capability.
> > > > SMART support is: Enabled
> > > >
> > > > === START OF READ SMART DATA SECTION ===
> > > > SMART overall-health self-assessment test result: PASSED
> > > >
> > > > General SMART Values:
> > > > Offline data collection status:  (0x82) Offline data collection activity
> > > > was completed without error.
> > > > Auto Offline Data Collection: Enabled.
> > > > Self-test execution status:      (   0) The previous self-test routine completed
> > > > without error or no self-test has ever
> > > > been run.
> > > > Total time to complete Offline
> > > > data collection: (43200) seconds.
> > > > Offline data collection
> > > > capabilities: (0x7b) SMART execute Offline immediate.
> > > > Auto Offline data collection on/off support.
> > > > Suspend Offline collection upon new
> > > > command.
> > > > Offline surface scan supported.
> > > > Self-test supported.
> > > > Conveyance Self-test supported.
> > > > Selective Self-test supported.
> > > > SMART capabilities:            (0x0003) Saves SMART data before entering
> > > > power-saving mode.
> > > > Supports SMART auto save timer.
> > > > Error logging capability:        (0x01) Error logging supported.
> > > > General Purpose Logging supported.
> > > > Short self-test routine
> > > > recommended polling time: (   2) minutes.
> > > > Extended self-test routine
> > > > recommended polling time: ( 433) minutes.
> > > > Conveyance self-test routine
> > > > recommended polling time: (   5) minutes.
> > > > SCT capabilities:        (0x7035) SCT Status supported.
> > > > SCT Feature Control supported.
> > > > SCT Data Table supported.
> > > >
> > > > SMART Attributes Data Structure revision number: 16
> > > > Vendor Specific SMART Attributes with Thresholds:
> > > > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > > > UPDATED  WHEN_FAILED RAW_VALUE
> > > >   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> > > > Always       -       0
> > > >   3 Spin_Up_Time            0x0027   176   166   021    Pre-fail
> > > > Always       -       6158
> > > >   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> > > > Always       -       80
> > > >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> > > > Always       -       0
> > > >   7 Seek_Error_Rate         0x002e   200   200   000    Old_age
> > > > Always       -       0
> > > >   9 Power_On_Hours          0x0032   075   075   000    Old_age
> > > > Always       -       18465
> > > >  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> > > > Always       -       0
> > > >  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> > > > Always       -       0
> > > >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> > > > Always       -       80
> > > > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> > > > Always       -       53
> > > > 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
> > > > Always       -       174015
> > > > 194 Temperature_Celsius     0x0022   121   107   000    Old_age
> > > > Always       -       29
> > > > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> > > > Always       -       0
> > > > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> > > > Always       -       0
> > > > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > > > Offline      -       0
> > > > 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> > > > Always       -       0
> > > > 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> > > > Offline      -       0
> > > >
> > > > SMART Error Log Version: 1
> > > > No Errors Logged
> > > >
> > > > SMART Self-test log structure revision number 1
> > > > Num  Test_Description    Status                  Remaining
> > > > LifeTime(hours)  LBA_of_first_error
> > > > # 1  Extended offline    Completed without error       00%     17347         -
> > > > # 2  Short offline       Completed without error       00%     15414         -
> > > >
> > > > SMART Selective self-test log data structure revision number 1
> > > >  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> > > >     1        0        0  Not_testing
> > > >     2        0        0  Not_testing
> > > >     3        0        0  Not_testing
> > > >     4        0        0  Not_testing
> > > >     5        0        0  Not_testing
> > > > Selective self-test flags (0x0):
> > > >   After scanning selected spans, do NOT read-scan remainder of disk.
> > > > If Selective self-test is pending on power-up, resume after 0 minute delay.
> > > >
> > > > [dan@lamachine ~]$
> > > >
> > > >
> > > > On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote:
> > > > >
> > > > > On Mon, 5 Oct 2020 14:10:25 +0100
> > > > > Daniel Sanabria <sanabria.d@gmail.com> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Scrubbing ( # echo check >
> > > > > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
> > > > > >
> > > > > > I'm attaching details of the array and disks (bloody wd greens) as
> > > > > > well as journalctl errors providing some details about the issue.
> > > > > >
> > > > > > If you have any pointers on what might be the cause of this as well as
> > > > > > any recommendations on how to improve things please let me thank you
> > > > > > in advance ...
> > > > > >
> > > > > > I have backups of the data so happy to move this to a different setup
> > > > > > you might recommend (apps will be mostly reading from the array via
> > > > > > NFS since most of the content will be media).
> > > > > >
> > > > > > My suspicion is that a timer service is kicking in and disrupting the
> > > > > > scrubbing somehow but can't pinpoint what causes this.
> > > > >
> > > > > It looks like a drive is dropping off the bus and then failing to reidentify,
> > > > > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > > > > "smartctl -a" of all drives as well.
> > > > >
> > > > > --
> > > > > With respect,
> > > > > Roman
> > >
> > >
> > > --
> > > With respect,
> > > Roman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-06  7:56           ` Daniel Sanabria
@ 2020-10-06  8:24             ` Reindl Harald
  2020-10-06 10:53             ` Roger Heflin
  1 sibling, 0 replies; 16+ messages in thread
From: Reindl Harald @ 2020-10-06  8:24 UTC (permalink / raw)
  To: Daniel Sanabria, Roger Heflin; +Cc: Roman Mamedov, Linux-RAID



Am 06.10.20 um 09:56 schrieb Daniel Sanabria:
> Yeah it is quite possible that the hardware can't support the setup
> because this array was an afterthought and considered an upgrade to
> the system.
> 
> For the record here are some more details about the setup:
> 
> Motherboard: ASROCK EP2C602-4L/D16
> PSU: 850W Corsair RM Series
> I have 6 drives connected to the motherboard. The 3 drives forming the
> array are Western Digital Green WDC WD30EZRX-00D8PB0 and are connected
> to 3 ports of the Marvell 9230 SATA controller. The other 3 drives are
> Western Digital Caviar Blue (SATA) WDC WD5000AAKS-00A7B2 are connected
> to the spare Marvell port and the 2 SATA/SAS Motherboard ports
> available.

yeah, unreliable desktop disks without at least increase the timeouts -
currently the only WD disks for a RAID setip are the "WD Gold" after
they lost theri brain and starting SMR on "WD Red"

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

-------------------------
[root@srv-rhsoft:~]$ cat /etc/systemd/system/disk-timeout.service
[Unit]
Description=SCSI Timeouts

[Service]
Type=oneshot
ExecStart=/usr/local/bin/disk-timeout.sh

[Install]
WantedBy=multi-user.target

-------------------------

[root@srv-rhsoft:~]$ cat /usr/local/bin/disk-timeout.sh
#!/usr/bin/dash

echo 180 > "/sys/block/sda/device/timeout"
echo 180 > "/sys/block/sdb/device/timeout"
echo 180 > "/sys/block/sdc/device/timeout"
echo 180 > "/sys/block/sdd/device/timeout"

-------------------------

> On Mon, 5 Oct 2020 at 16:58, Roger Heflin <rogerheflin@gmail.com> wrote:
>>
>> what they said you have a hardware problem.
>>
>> it could be about anything previously mentioned and could also be the
>> power supply being unable to provide a stable 12V for the disks.
>>
>> You should provide the list more specifics on your hw setup, of
>> interest are what kind of SATA/SAS ports you are using and how the
>> disk are cabled in.
>>
>> Note that there are a number of controllers that aren't the most
>> reliable and some of those controllers when something happens will
>> stop responding for all disks connected to it.
>>
>> I have also seen badly designed motherboards have
>> build-in(non-AMD/non-Intel chips) sata ports that don't work under any
>> load that uses more than a single disk at a time, and/or acts badly
>> when given smart commands.
>>
>> On Mon, Oct 5, 2020 at 9:30 AM Daniel Sanabria <sanabria.d@gmail.com> wrote:
>>>
>>>> I meant not to me personally, but to the mailing list. The drives seem OK
>>>> though, even sde.
>>>
>>> Sorry missed the reply-all button
>>>
>>> On Mon, 5 Oct 2020 at 15:04, Roman Mamedov <rm@romanrm.net> wrote:
>>>>
>>>> On Mon, 5 Oct 2020 14:59:35 +0100
>>>> Daniel Sanabria <sanabria.d@gmail.com> wrote:
>>>>
>>>>>> It looks like a drive is dropping off the bus and then failing to reidentify,
>>>>>> could be bad cabling/controller/PSU, or just a bad drive. You should post
>>>>>> "smartctl -a" of all drives as well.
>>>>
>>>> I meant not to me personally, but to the mailing list. The drives seem OK
>>>> though, even sde.
>>>>
>>>>> [dan@lamachine ~]$ sudo smartctl -a /dev/sdc
>>>>> [sudo] password for dan:
>>>>> smartctl 6.6 2017-11-05 r4594
>>>>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
>>>>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>>>>>
>>>>> === START OF INFORMATION SECTION ===
>>>>> Model Family:     Western Digital Green
>>>>> Device Model:     WDC WD30EZRX-00D8PB0
>>>>> Serial Number:    WD-WCC4NCWT13RF
>>>>> LU WWN Device Id: 5 0014ee 25fc9e460
>>>>> Firmware Version: 80.00A80
>>>>> User Capacity:    3,000,591,900,160 bytes [3.00 TB]
>>>>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>>>>> Rotation Rate:    5400 rpm
>>>>> Device is:        In smartctl database [for details use: -P show]
>>>>> ATA Version is:   ACS-2 (minor revision not indicated)
>>>>> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>>>>> Local Time is:    Mon Oct  5 14:58:34 2020 BST
>>>>> SMART support is: Available - device has SMART capability.
>>>>> SMART support is: Enabled
>>>>>
>>>>> === START OF READ SMART DATA SECTION ===
>>>>> SMART overall-health self-assessment test result: PASSED
>>>>>
>>>>> General SMART Values:
>>>>> Offline data collection status:  (0x82) Offline data collection activity
>>>>> was completed without error.
>>>>> Auto Offline Data Collection: Enabled.
>>>>> Self-test execution status:      (   0) The previous self-test routine completed
>>>>> without error or no self-test has ever
>>>>> been run.
>>>>> Total time to complete Offline
>>>>> data collection: (38940) seconds.
>>>>> Offline data collection
>>>>> capabilities: (0x7b) SMART execute Offline immediate.
>>>>> Auto Offline data collection on/off support.
>>>>> Suspend Offline collection upon new
>>>>> command.
>>>>> Offline surface scan supported.
>>>>> Self-test supported.
>>>>> Conveyance Self-test supported.
>>>>> Selective Self-test supported.
>>>>> SMART capabilities:            (0x0003) Saves SMART data before entering
>>>>> power-saving mode.
>>>>> Supports SMART auto save timer.
>>>>> Error logging capability:        (0x01) Error logging supported.
>>>>> General Purpose Logging supported.
>>>>> Short self-test routine
>>>>> recommended polling time: (   2) minutes.
>>>>> Extended self-test routine
>>>>> recommended polling time: ( 391) minutes.
>>>>> Conveyance self-test routine
>>>>> recommended polling time: (   5) minutes.
>>>>> SCT capabilities:        (0x7035) SCT Status supported.
>>>>> SCT Feature Control supported.
>>>>> SCT Data Table supported.
>>>>>
>>>>> SMART Attributes Data Structure revision number: 16
>>>>> Vendor Specific SMART Attributes with Thresholds:
>>>>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
>>>>> UPDATED  WHEN_FAILED RAW_VALUE
>>>>>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
>>>>> Always       -       0
>>>>>   3 Spin_Up_Time            0x0027   178   165   021    Pre-fail
>>>>> Always       -       6075
>>>>>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
>>>>> Always       -       81
>>>>>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
>>>>> Always       -       0
>>>>>   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
>>>>> Always       -       0
>>>>>   9 Power_On_Hours          0x0032   075   075   000    Old_age
>>>>> Always       -       18577
>>>>>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
>>>>> Always       -       0
>>>>>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
>>>>> Always       -       0
>>>>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
>>>>> Always       -       81
>>>>> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
>>>>> Always       -       46
>>>>> 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
>>>>> Always       -       176661
>>>>> 194 Temperature_Celsius     0x0022   122   109   000    Old_age
>>>>> Always       -       28
>>>>> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
>>>>> Offline      -       0
>>>>> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
>>>>> Offline      -       0
>>>>>
>>>>> SMART Error Log Version: 1
>>>>> No Errors Logged
>>>>>
>>>>> SMART Self-test log structure revision number 1
>>>>> Num  Test_Description    Status                  Remaining
>>>>> LifeTime(hours)  LBA_of_first_error
>>>>> # 1  Extended offline    Completed without error       00%     17479         -
>>>>> # 2  Short offline       Completed without error       00%     15531         -
>>>>>
>>>>> SMART Selective self-test log data structure revision number 1
>>>>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>>>>     1        0        0  Not_testing
>>>>>     2        0        0  Not_testing
>>>>>     3        0        0  Not_testing
>>>>>     4        0        0  Not_testing
>>>>>     5        0        0  Not_testing
>>>>> Selective self-test flags (0x0):
>>>>>   After scanning selected spans, do NOT read-scan remainder of disk.
>>>>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>>>>
>>>>> [dan@lamachine ~]$ sudo smartctl -a /dev/sdd
>>>>> smartctl 6.6 2017-11-05 r4594
>>>>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
>>>>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>>>>>
>>>>> === START OF INFORMATION SECTION ===
>>>>> Model Family:     Western Digital Green
>>>>> Device Model:     WDC WD30EZRX-00D8PB0
>>>>> Serial Number:    WD-WCC4NPRDD6D7
>>>>> LU WWN Device Id: 5 0014ee 25fca27b1
>>>>> Firmware Version: 80.00A80
>>>>> User Capacity:    3,000,592,982,016 bytes [3.00 TB]
>>>>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>>>>> Rotation Rate:    5400 rpm
>>>>> Device is:        In smartctl database [for details use: -P show]
>>>>> ATA Version is:   ACS-2 (minor revision not indicated)
>>>>> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>>>>> Local Time is:    Mon Oct  5 14:58:54 2020 BST
>>>>> SMART support is: Available - device has SMART capability.
>>>>> SMART support is: Enabled
>>>>>
>>>>> === START OF READ SMART DATA SECTION ===
>>>>> SMART overall-health self-assessment test result: PASSED
>>>>>
>>>>> General SMART Values:
>>>>> Offline data collection status:  (0x82) Offline data collection activity
>>>>> was completed without error.
>>>>> Auto Offline Data Collection: Enabled.
>>>>> Self-test execution status:      (   0) The previous self-test routine completed
>>>>> without error or no self-test has ever
>>>>> been run.
>>>>> Total time to complete Offline
>>>>> data collection: (39060) seconds.
>>>>> Offline data collection
>>>>> capabilities: (0x7b) SMART execute Offline immediate.
>>>>> Auto Offline data collection on/off support.
>>>>> Suspend Offline collection upon new
>>>>> command.
>>>>> Offline surface scan supported.
>>>>> Self-test supported.
>>>>> Conveyance Self-test supported.
>>>>> Selective Self-test supported.
>>>>> SMART capabilities:            (0x0003) Saves SMART data before entering
>>>>> power-saving mode.
>>>>> Supports SMART auto save timer.
>>>>> Error logging capability:        (0x01) Error logging supported.
>>>>> General Purpose Logging supported.
>>>>> Short self-test routine
>>>>> recommended polling time: (   2) minutes.
>>>>> Extended self-test routine
>>>>> recommended polling time: ( 392) minutes.
>>>>> Conveyance self-test routine
>>>>> recommended polling time: (   5) minutes.
>>>>> SCT capabilities:        (0x7035) SCT Status supported.
>>>>> SCT Feature Control supported.
>>>>> SCT Data Table supported.
>>>>>
>>>>> SMART Attributes Data Structure revision number: 16
>>>>> Vendor Specific SMART Attributes with Thresholds:
>>>>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
>>>>> UPDATED  WHEN_FAILED RAW_VALUE
>>>>>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
>>>>> Always       -       0
>>>>>   3 Spin_Up_Time            0x0027   178   164   021    Pre-fail
>>>>> Always       -       6100
>>>>>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
>>>>> Always       -       81
>>>>>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
>>>>> Always       -       0
>>>>>   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
>>>>> Always       -       0
>>>>>   9 Power_On_Hours          0x0032   075   075   000    Old_age
>>>>> Always       -       18580
>>>>>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
>>>>> Always       -       0
>>>>>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
>>>>> Always       -       0
>>>>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
>>>>> Always       -       81
>>>>> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
>>>>> Always       -       53
>>>>> 193 Load_Cycle_Count        0x0032   136   136   000    Old_age
>>>>> Always       -       192427
>>>>> 194 Temperature_Celsius     0x0022   121   108   000    Old_age
>>>>> Always       -       29
>>>>> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
>>>>> Offline      -       0
>>>>> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
>>>>> Offline      -       0
>>>>>
>>>>> SMART Error Log Version: 1
>>>>> No Errors Logged
>>>>>
>>>>> SMART Self-test log structure revision number 1
>>>>> Num  Test_Description    Status                  Remaining
>>>>> LifeTime(hours)  LBA_of_first_error
>>>>> # 1  Extended offline    Completed without error       00%     17481         -
>>>>> # 2  Short offline       Completed without error       00%     15534         -
>>>>>
>>>>> SMART Selective self-test log data structure revision number 1
>>>>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>>>>     1        0        0  Not_testing
>>>>>     2        0        0  Not_testing
>>>>>     3        0        0  Not_testing
>>>>>     4        0        0  Not_testing
>>>>>     5        0        0  Not_testing
>>>>> Selective self-test flags (0x0):
>>>>>   After scanning selected spans, do NOT read-scan remainder of disk.
>>>>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>>>>
>>>>> [dan@lamachine ~]$ sudo smartctl -a /dev/sde
>>>>> smartctl 6.6 2017-11-05 r4594
>>>>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
>>>>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>>>>>
>>>>> === START OF INFORMATION SECTION ===
>>>>> Model Family:     Western Digital Green
>>>>> Device Model:     WDC WD30EZRX-00D8PB0
>>>>> Serial Number:    WD-WCC4N1294906
>>>>> LU WWN Device Id: 5 0014ee 25f968120
>>>>> Firmware Version: 80.00A80
>>>>> User Capacity:    3,000,591,900,160 bytes [3.00 TB]
>>>>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>>>>> Rotation Rate:    5400 rpm
>>>>> Device is:        In smartctl database [for details use: -P show]
>>>>> ATA Version is:   ACS-2 (minor revision not indicated)
>>>>> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>>>>> Local Time is:    Mon Oct  5 14:58:57 2020 BST
>>>>> SMART support is: Available - device has SMART capability.
>>>>> SMART support is: Enabled
>>>>>
>>>>> === START OF READ SMART DATA SECTION ===
>>>>> SMART overall-health self-assessment test result: PASSED
>>>>>
>>>>> General SMART Values:
>>>>> Offline data collection status:  (0x82) Offline data collection activity
>>>>> was completed without error.
>>>>> Auto Offline Data Collection: Enabled.
>>>>> Self-test execution status:      (   0) The previous self-test routine completed
>>>>> without error or no self-test has ever
>>>>> been run.
>>>>> Total time to complete Offline
>>>>> data collection: (43200) seconds.
>>>>> Offline data collection
>>>>> capabilities: (0x7b) SMART execute Offline immediate.
>>>>> Auto Offline data collection on/off support.
>>>>> Suspend Offline collection upon new
>>>>> command.
>>>>> Offline surface scan supported.
>>>>> Self-test supported.
>>>>> Conveyance Self-test supported.
>>>>> Selective Self-test supported.
>>>>> SMART capabilities:            (0x0003) Saves SMART data before entering
>>>>> power-saving mode.
>>>>> Supports SMART auto save timer.
>>>>> Error logging capability:        (0x01) Error logging supported.
>>>>> General Purpose Logging supported.
>>>>> Short self-test routine
>>>>> recommended polling time: (   2) minutes.
>>>>> Extended self-test routine
>>>>> recommended polling time: ( 433) minutes.
>>>>> Conveyance self-test routine
>>>>> recommended polling time: (   5) minutes.
>>>>> SCT capabilities:        (0x7035) SCT Status supported.
>>>>> SCT Feature Control supported.
>>>>> SCT Data Table supported.
>>>>>
>>>>> SMART Attributes Data Structure revision number: 16
>>>>> Vendor Specific SMART Attributes with Thresholds:
>>>>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
>>>>> UPDATED  WHEN_FAILED RAW_VALUE
>>>>>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
>>>>> Always       -       0
>>>>>   3 Spin_Up_Time            0x0027   176   166   021    Pre-fail
>>>>> Always       -       6158
>>>>>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
>>>>> Always       -       80
>>>>>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
>>>>> Always       -       0
>>>>>   7 Seek_Error_Rate         0x002e   200   200   000    Old_age
>>>>> Always       -       0
>>>>>   9 Power_On_Hours          0x0032   075   075   000    Old_age
>>>>> Always       -       18465
>>>>>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
>>>>> Always       -       0
>>>>>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
>>>>> Always       -       0
>>>>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
>>>>> Always       -       80
>>>>> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
>>>>> Always       -       53
>>>>> 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
>>>>> Always       -       174015
>>>>> 194 Temperature_Celsius     0x0022   121   107   000    Old_age
>>>>> Always       -       29
>>>>> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
>>>>> Offline      -       0
>>>>> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
>>>>> Offline      -       0
>>>>>
>>>>> SMART Error Log Version: 1
>>>>> No Errors Logged
>>>>>
>>>>> SMART Self-test log structure revision number 1
>>>>> Num  Test_Description    Status                  Remaining
>>>>> LifeTime(hours)  LBA_of_first_error
>>>>> # 1  Extended offline    Completed without error       00%     17347         -
>>>>> # 2  Short offline       Completed without error       00%     15414         -
>>>>>
>>>>> SMART Selective self-test log data structure revision number 1
>>>>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>>>>     1        0        0  Not_testing
>>>>>     2        0        0  Not_testing
>>>>>     3        0        0  Not_testing
>>>>>     4        0        0  Not_testing
>>>>>     5        0        0  Not_testing
>>>>> Selective self-test flags (0x0):
>>>>>   After scanning selected spans, do NOT read-scan remainder of disk.
>>>>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>>>>
>>>>> [dan@lamachine ~]$
>>>>>
>>>>>
>>>>> On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote:
>>>>>>
>>>>>> On Mon, 5 Oct 2020 14:10:25 +0100
>>>>>> Daniel Sanabria <sanabria.d@gmail.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Scrubbing ( # echo check >
>>>>>>> /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
>>>>>>>
>>>>>>> I'm attaching details of the array and disks (bloody wd greens) as
>>>>>>> well as journalctl errors providing some details about the issue.
>>>>>>>
>>>>>>> If you have any pointers on what might be the cause of this as well as
>>>>>>> any recommendations on how to improve things please let me thank you
>>>>>>> in advance ...
>>>>>>>
>>>>>>> I have backups of the data so happy to move this to a different setup
>>>>>>> you might recommend (apps will be mostly reading from the array via
>>>>>>> NFS since most of the content will be media).
>>>>>>>
>>>>>>> My suspicion is that a timer service is kicking in and disrupting the
>>>>>>> scrubbing somehow but can't pinpoint what causes this.
>>>>>>
>>>>>> It looks like a drive is dropping off the bus and then failing to reidentify,
>>>>>> could be bad cabling/controller/PSU, or just a bad drive. You should post
>>>>>> "smartctl -a" of all drives as well.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-06  7:56           ` Daniel Sanabria
  2020-10-06  8:24             ` Reindl Harald
@ 2020-10-06 10:53             ` Roger Heflin
  2020-10-06 11:29               ` antlists
  2020-10-06 15:03               ` Tim Small
  1 sibling, 2 replies; 16+ messages in thread
From: Roger Heflin @ 2020-10-06 10:53 UTC (permalink / raw)
  To: Daniel Sanabria; +Cc: Roman Mamedov, Linux-RAID

Ok.  That is kind of what I was fearing based on the behavior. I know
first hand that the marvel 9230 sata card is a POS.

When I was running it I noted that if you ran smart commands against
it it would go offline quicker.   If you don't run smart commands at
all then it is more stable, but still will have serious issues
sometimes during raid syncs and/or rebuilds.  When the card has an
issue all of the ports seem to stop responding to commands.  I am
guessing the firmware on the card somehow crashes or gets into some
sort of endless loop.  I reported it to marvel, they blamed the OS'es
ACHI drivers,even though the AHCI drivers worked perfectly fine with
the built in AMD ports.  Gotta love engineers and support people that
have absolutely no idea what they are doing, nor how to validate a
design works.   I have been burned by marvell cards enough I will not
buy any marvell product as I know they have zero idea how to validate
designs.

I moved to a used LSI SAS card (4 internal ports, 4 external ports
also, needs non-raid bios installed to be a dumb card).  Outside of
the enterprise type cards, I have yet found a stable PCIE card, and in
one of my backup machines still use a Sata Sil (old pci) card as while
it is slow (all 4 ports limited to a total of about 120MB-real is
90MB), it does consistently work right.


On Tue, Oct 6, 2020 at 2:56 AM Daniel Sanabria <sanabria.d@gmail.com> wrote:
>
> Yeah it is quite possible that the hardware can't support the setup
> because this array was an afterthought and considered an upgrade to
> the system.
>
> For the record here are some more details about the setup:
>
> Motherboard: ASROCK EP2C602-4L/D16
> PSU: 850W Corsair RM Series
> I have 6 drives connected to the motherboard. The 3 drives forming the
> array are Western Digital Green WDC WD30EZRX-00D8PB0 and are connected
> to 3 ports of the Marvell 9230 SATA controller. The other 3 drives are
> Western Digital Caviar Blue (SATA) WDC WD5000AAKS-00A7B2 are connected
> to the spare Marvell port and the 2 SATA/SAS Motherboard ports
> available.
>
> On Mon, 5 Oct 2020 at 16:58, Roger Heflin <rogerheflin@gmail.com> wrote:
> >
> > what they said you have a hardware problem.
> >
> > it could be about anything previously mentioned and could also be the
> > power supply being unable to provide a stable 12V for the disks.
> >
> > You should provide the list more specifics on your hw setup, of
> > interest are what kind of SATA/SAS ports you are using and how the
> > disk are cabled in.
> >
> > Note that there are a number of controllers that aren't the most
> > reliable and some of those controllers when something happens will
> > stop responding for all disks connected to it.
> >
> > I have also seen badly designed motherboards have
> > build-in(non-AMD/non-Intel chips) sata ports that don't work under any
> > load that uses more than a single disk at a time, and/or acts badly
> > when given smart commands.
> >
> > On Mon, Oct 5, 2020 at 9:30 AM Daniel Sanabria <sanabria.d@gmail.com> wrote:
> > >
> > > > I meant not to me personally, but to the mailing list. The drives seem OK
> > > > though, even sde.
> > >
> > > Sorry missed the reply-all button
> > >
> > > On Mon, 5 Oct 2020 at 15:04, Roman Mamedov <rm@romanrm.net> wrote:
> > > >
> > > > On Mon, 5 Oct 2020 14:59:35 +0100
> > > > Daniel Sanabria <sanabria.d@gmail.com> wrote:
> > > >
> > > > > > It looks like a drive is dropping off the bus and then failing to reidentify,
> > > > > > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > > > > > "smartctl -a" of all drives as well.
> > > >
> > > > I meant not to me personally, but to the mailing list. The drives seem OK
> > > > though, even sde.
> > > >
> > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdc
> > > > > [sudo] password for dan:
> > > > > smartctl 6.6 2017-11-05 r4594
> > > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> > > > >
> > > > > === START OF INFORMATION SECTION ===
> > > > > Model Family:     Western Digital Green
> > > > > Device Model:     WDC WD30EZRX-00D8PB0
> > > > > Serial Number:    WD-WCC4NCWT13RF
> > > > > LU WWN Device Id: 5 0014ee 25fc9e460
> > > > > Firmware Version: 80.00A80
> > > > > User Capacity:    3,000,591,900,160 bytes [3.00 TB]
> > > > > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > > > > Rotation Rate:    5400 rpm
> > > > > Device is:        In smartctl database [for details use: -P show]
> > > > > ATA Version is:   ACS-2 (minor revision not indicated)
> > > > > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > > > > Local Time is:    Mon Oct  5 14:58:34 2020 BST
> > > > > SMART support is: Available - device has SMART capability.
> > > > > SMART support is: Enabled
> > > > >
> > > > > === START OF READ SMART DATA SECTION ===
> > > > > SMART overall-health self-assessment test result: PASSED
> > > > >
> > > > > General SMART Values:
> > > > > Offline data collection status:  (0x82) Offline data collection activity
> > > > > was completed without error.
> > > > > Auto Offline Data Collection: Enabled.
> > > > > Self-test execution status:      (   0) The previous self-test routine completed
> > > > > without error or no self-test has ever
> > > > > been run.
> > > > > Total time to complete Offline
> > > > > data collection: (38940) seconds.
> > > > > Offline data collection
> > > > > capabilities: (0x7b) SMART execute Offline immediate.
> > > > > Auto Offline data collection on/off support.
> > > > > Suspend Offline collection upon new
> > > > > command.
> > > > > Offline surface scan supported.
> > > > > Self-test supported.
> > > > > Conveyance Self-test supported.
> > > > > Selective Self-test supported.
> > > > > SMART capabilities:            (0x0003) Saves SMART data before entering
> > > > > power-saving mode.
> > > > > Supports SMART auto save timer.
> > > > > Error logging capability:        (0x01) Error logging supported.
> > > > > General Purpose Logging supported.
> > > > > Short self-test routine
> > > > > recommended polling time: (   2) minutes.
> > > > > Extended self-test routine
> > > > > recommended polling time: ( 391) minutes.
> > > > > Conveyance self-test routine
> > > > > recommended polling time: (   5) minutes.
> > > > > SCT capabilities:        (0x7035) SCT Status supported.
> > > > > SCT Feature Control supported.
> > > > > SCT Data Table supported.
> > > > >
> > > > > SMART Attributes Data Structure revision number: 16
> > > > > Vendor Specific SMART Attributes with Thresholds:
> > > > > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > > > > UPDATED  WHEN_FAILED RAW_VALUE
> > > > >   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> > > > > Always       -       0
> > > > >   3 Spin_Up_Time            0x0027   178   165   021    Pre-fail
> > > > > Always       -       6075
> > > > >   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> > > > > Always       -       81
> > > > >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> > > > > Always       -       0
> > > > >   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
> > > > > Always       -       0
> > > > >   9 Power_On_Hours          0x0032   075   075   000    Old_age
> > > > > Always       -       18577
> > > > >  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> > > > > Always       -       0
> > > > >  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> > > > > Always       -       0
> > > > >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> > > > > Always       -       81
> > > > > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> > > > > Always       -       46
> > > > > 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
> > > > > Always       -       176661
> > > > > 194 Temperature_Celsius     0x0022   122   109   000    Old_age
> > > > > Always       -       28
> > > > > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> > > > > Always       -       0
> > > > > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> > > > > Always       -       0
> > > > > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > > > > Offline      -       0
> > > > > 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> > > > > Always       -       0
> > > > > 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> > > > > Offline      -       0
> > > > >
> > > > > SMART Error Log Version: 1
> > > > > No Errors Logged
> > > > >
> > > > > SMART Self-test log structure revision number 1
> > > > > Num  Test_Description    Status                  Remaining
> > > > > LifeTime(hours)  LBA_of_first_error
> > > > > # 1  Extended offline    Completed without error       00%     17479         -
> > > > > # 2  Short offline       Completed without error       00%     15531         -
> > > > >
> > > > > SMART Selective self-test log data structure revision number 1
> > > > >  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> > > > >     1        0        0  Not_testing
> > > > >     2        0        0  Not_testing
> > > > >     3        0        0  Not_testing
> > > > >     4        0        0  Not_testing
> > > > >     5        0        0  Not_testing
> > > > > Selective self-test flags (0x0):
> > > > >   After scanning selected spans, do NOT read-scan remainder of disk.
> > > > > If Selective self-test is pending on power-up, resume after 0 minute delay.
> > > > >
> > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdd
> > > > > smartctl 6.6 2017-11-05 r4594
> > > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> > > > >
> > > > > === START OF INFORMATION SECTION ===
> > > > > Model Family:     Western Digital Green
> > > > > Device Model:     WDC WD30EZRX-00D8PB0
> > > > > Serial Number:    WD-WCC4NPRDD6D7
> > > > > LU WWN Device Id: 5 0014ee 25fca27b1
> > > > > Firmware Version: 80.00A80
> > > > > User Capacity:    3,000,592,982,016 bytes [3.00 TB]
> > > > > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > > > > Rotation Rate:    5400 rpm
> > > > > Device is:        In smartctl database [for details use: -P show]
> > > > > ATA Version is:   ACS-2 (minor revision not indicated)
> > > > > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > > > > Local Time is:    Mon Oct  5 14:58:54 2020 BST
> > > > > SMART support is: Available - device has SMART capability.
> > > > > SMART support is: Enabled
> > > > >
> > > > > === START OF READ SMART DATA SECTION ===
> > > > > SMART overall-health self-assessment test result: PASSED
> > > > >
> > > > > General SMART Values:
> > > > > Offline data collection status:  (0x82) Offline data collection activity
> > > > > was completed without error.
> > > > > Auto Offline Data Collection: Enabled.
> > > > > Self-test execution status:      (   0) The previous self-test routine completed
> > > > > without error or no self-test has ever
> > > > > been run.
> > > > > Total time to complete Offline
> > > > > data collection: (39060) seconds.
> > > > > Offline data collection
> > > > > capabilities: (0x7b) SMART execute Offline immediate.
> > > > > Auto Offline data collection on/off support.
> > > > > Suspend Offline collection upon new
> > > > > command.
> > > > > Offline surface scan supported.
> > > > > Self-test supported.
> > > > > Conveyance Self-test supported.
> > > > > Selective Self-test supported.
> > > > > SMART capabilities:            (0x0003) Saves SMART data before entering
> > > > > power-saving mode.
> > > > > Supports SMART auto save timer.
> > > > > Error logging capability:        (0x01) Error logging supported.
> > > > > General Purpose Logging supported.
> > > > > Short self-test routine
> > > > > recommended polling time: (   2) minutes.
> > > > > Extended self-test routine
> > > > > recommended polling time: ( 392) minutes.
> > > > > Conveyance self-test routine
> > > > > recommended polling time: (   5) minutes.
> > > > > SCT capabilities:        (0x7035) SCT Status supported.
> > > > > SCT Feature Control supported.
> > > > > SCT Data Table supported.
> > > > >
> > > > > SMART Attributes Data Structure revision number: 16
> > > > > Vendor Specific SMART Attributes with Thresholds:
> > > > > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > > > > UPDATED  WHEN_FAILED RAW_VALUE
> > > > >   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> > > > > Always       -       0
> > > > >   3 Spin_Up_Time            0x0027   178   164   021    Pre-fail
> > > > > Always       -       6100
> > > > >   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> > > > > Always       -       81
> > > > >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> > > > > Always       -       0
> > > > >   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
> > > > > Always       -       0
> > > > >   9 Power_On_Hours          0x0032   075   075   000    Old_age
> > > > > Always       -       18580
> > > > >  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> > > > > Always       -       0
> > > > >  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> > > > > Always       -       0
> > > > >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> > > > > Always       -       81
> > > > > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> > > > > Always       -       53
> > > > > 193 Load_Cycle_Count        0x0032   136   136   000    Old_age
> > > > > Always       -       192427
> > > > > 194 Temperature_Celsius     0x0022   121   108   000    Old_age
> > > > > Always       -       29
> > > > > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> > > > > Always       -       0
> > > > > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> > > > > Always       -       0
> > > > > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > > > > Offline      -       0
> > > > > 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> > > > > Always       -       0
> > > > > 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> > > > > Offline      -       0
> > > > >
> > > > > SMART Error Log Version: 1
> > > > > No Errors Logged
> > > > >
> > > > > SMART Self-test log structure revision number 1
> > > > > Num  Test_Description    Status                  Remaining
> > > > > LifeTime(hours)  LBA_of_first_error
> > > > > # 1  Extended offline    Completed without error       00%     17481         -
> > > > > # 2  Short offline       Completed without error       00%     15534         -
> > > > >
> > > > > SMART Selective self-test log data structure revision number 1
> > > > >  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> > > > >     1        0        0  Not_testing
> > > > >     2        0        0  Not_testing
> > > > >     3        0        0  Not_testing
> > > > >     4        0        0  Not_testing
> > > > >     5        0        0  Not_testing
> > > > > Selective self-test flags (0x0):
> > > > >   After scanning selected spans, do NOT read-scan remainder of disk.
> > > > > If Selective self-test is pending on power-up, resume after 0 minute delay.
> > > > >
> > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sde
> > > > > smartctl 6.6 2017-11-05 r4594
> > > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> > > > >
> > > > > === START OF INFORMATION SECTION ===
> > > > > Model Family:     Western Digital Green
> > > > > Device Model:     WDC WD30EZRX-00D8PB0
> > > > > Serial Number:    WD-WCC4N1294906
> > > > > LU WWN Device Id: 5 0014ee 25f968120
> > > > > Firmware Version: 80.00A80
> > > > > User Capacity:    3,000,591,900,160 bytes [3.00 TB]
> > > > > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > > > > Rotation Rate:    5400 rpm
> > > > > Device is:        In smartctl database [for details use: -P show]
> > > > > ATA Version is:   ACS-2 (minor revision not indicated)
> > > > > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > > > > Local Time is:    Mon Oct  5 14:58:57 2020 BST
> > > > > SMART support is: Available - device has SMART capability.
> > > > > SMART support is: Enabled
> > > > >
> > > > > === START OF READ SMART DATA SECTION ===
> > > > > SMART overall-health self-assessment test result: PASSED
> > > > >
> > > > > General SMART Values:
> > > > > Offline data collection status:  (0x82) Offline data collection activity
> > > > > was completed without error.
> > > > > Auto Offline Data Collection: Enabled.
> > > > > Self-test execution status:      (   0) The previous self-test routine completed
> > > > > without error or no self-test has ever
> > > > > been run.
> > > > > Total time to complete Offline
> > > > > data collection: (43200) seconds.
> > > > > Offline data collection
> > > > > capabilities: (0x7b) SMART execute Offline immediate.
> > > > > Auto Offline data collection on/off support.
> > > > > Suspend Offline collection upon new
> > > > > command.
> > > > > Offline surface scan supported.
> > > > > Self-test supported.
> > > > > Conveyance Self-test supported.
> > > > > Selective Self-test supported.
> > > > > SMART capabilities:            (0x0003) Saves SMART data before entering
> > > > > power-saving mode.
> > > > > Supports SMART auto save timer.
> > > > > Error logging capability:        (0x01) Error logging supported.
> > > > > General Purpose Logging supported.
> > > > > Short self-test routine
> > > > > recommended polling time: (   2) minutes.
> > > > > Extended self-test routine
> > > > > recommended polling time: ( 433) minutes.
> > > > > Conveyance self-test routine
> > > > > recommended polling time: (   5) minutes.
> > > > > SCT capabilities:        (0x7035) SCT Status supported.
> > > > > SCT Feature Control supported.
> > > > > SCT Data Table supported.
> > > > >
> > > > > SMART Attributes Data Structure revision number: 16
> > > > > Vendor Specific SMART Attributes with Thresholds:
> > > > > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > > > > UPDATED  WHEN_FAILED RAW_VALUE
> > > > >   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> > > > > Always       -       0
> > > > >   3 Spin_Up_Time            0x0027   176   166   021    Pre-fail
> > > > > Always       -       6158
> > > > >   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> > > > > Always       -       80
> > > > >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> > > > > Always       -       0
> > > > >   7 Seek_Error_Rate         0x002e   200   200   000    Old_age
> > > > > Always       -       0
> > > > >   9 Power_On_Hours          0x0032   075   075   000    Old_age
> > > > > Always       -       18465
> > > > >  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> > > > > Always       -       0
> > > > >  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> > > > > Always       -       0
> > > > >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> > > > > Always       -       80
> > > > > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> > > > > Always       -       53
> > > > > 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
> > > > > Always       -       174015
> > > > > 194 Temperature_Celsius     0x0022   121   107   000    Old_age
> > > > > Always       -       29
> > > > > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> > > > > Always       -       0
> > > > > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> > > > > Always       -       0
> > > > > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > > > > Offline      -       0
> > > > > 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> > > > > Always       -       0
> > > > > 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> > > > > Offline      -       0
> > > > >
> > > > > SMART Error Log Version: 1
> > > > > No Errors Logged
> > > > >
> > > > > SMART Self-test log structure revision number 1
> > > > > Num  Test_Description    Status                  Remaining
> > > > > LifeTime(hours)  LBA_of_first_error
> > > > > # 1  Extended offline    Completed without error       00%     17347         -
> > > > > # 2  Short offline       Completed without error       00%     15414         -
> > > > >
> > > > > SMART Selective self-test log data structure revision number 1
> > > > >  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> > > > >     1        0        0  Not_testing
> > > > >     2        0        0  Not_testing
> > > > >     3        0        0  Not_testing
> > > > >     4        0        0  Not_testing
> > > > >     5        0        0  Not_testing
> > > > > Selective self-test flags (0x0):
> > > > >   After scanning selected spans, do NOT read-scan remainder of disk.
> > > > > If Selective self-test is pending on power-up, resume after 0 minute delay.
> > > > >
> > > > > [dan@lamachine ~]$
> > > > >
> > > > >
> > > > > On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote:
> > > > > >
> > > > > > On Mon, 5 Oct 2020 14:10:25 +0100
> > > > > > Daniel Sanabria <sanabria.d@gmail.com> wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Scrubbing ( # echo check >
> > > > > > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
> > > > > > >
> > > > > > > I'm attaching details of the array and disks (bloody wd greens) as
> > > > > > > well as journalctl errors providing some details about the issue.
> > > > > > >
> > > > > > > If you have any pointers on what might be the cause of this as well as
> > > > > > > any recommendations on how to improve things please let me thank you
> > > > > > > in advance ...
> > > > > > >
> > > > > > > I have backups of the data so happy to move this to a different setup
> > > > > > > you might recommend (apps will be mostly reading from the array via
> > > > > > > NFS since most of the content will be media).
> > > > > > >
> > > > > > > My suspicion is that a timer service is kicking in and disrupting the
> > > > > > > scrubbing somehow but can't pinpoint what causes this.
> > > > > >
> > > > > > It looks like a drive is dropping off the bus and then failing to reidentify,
> > > > > > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > > > > > "smartctl -a" of all drives as well.
> > > > > >
> > > > > > --
> > > > > > With respect,
> > > > > > Roman
> > > >
> > > >
> > > > --
> > > > With respect,
> > > > Roman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-06 10:53             ` Roger Heflin
@ 2020-10-06 11:29               ` antlists
  2020-10-06 14:59                 ` Roger Heflin
  2020-10-06 15:03               ` Tim Small
  1 sibling, 1 reply; 16+ messages in thread
From: antlists @ 2020-10-06 11:29 UTC (permalink / raw)
  To: Roger Heflin, Daniel Sanabria; +Cc: Roman Mamedov, Linux-RAID

On 06/10/2020 11:53, Roger Heflin wrote:
> When the card has an
> issue all of the ports seem to stop responding to commands.  I am
> guessing the firmware on the card somehow crashes or gets into some
> sort of endless loop.  I reported it to marvel, they blamed the OS'es
> ACHI drivers,even though the AHCI drivers worked perfectly fine with
> the built in AMD ports.

So we've got the crap drives on the crap controllers ... would it make 
any difference if you put the Greens on the motherboard, and the Caviars 
on the Marvell? Caviars I believe are good quality drives that might 
take enough load off the Marvell to enable it to work sort-of okay ...

Oh - and replace the Greens pretty soon - I don't know how they compare 
against other drives quality-wise, but they are optimised in a raid 
anti-pattern.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-06 11:29               ` antlists
@ 2020-10-06 14:59                 ` Roger Heflin
  2020-10-09  1:00                   ` John Stoffel
  0 siblings, 1 reply; 16+ messages in thread
From: Roger Heflin @ 2020-10-06 14:59 UTC (permalink / raw)
  To: antlists; +Cc: Daniel Sanabria, Roman Mamedov, Linux-RAID

The controller is crap, and is expected to have serious issues no
matter what drives are on the controller.

Given the green's don't seem to have any reallocated blocks, I am
guessing the controller is 90%+ of the problem, and right now may be
all of the problem.  If you lose all of the disks on the marvell
controller at roughly the same time, that is the controller bug and
not a disk issue.  It also does not seem to be caused by a disk issue,
the controller just seems to have a race condition when multiple
operations are being done at the same time the controller just
"crashes" and stops responding to all drives on that controller.

On Tue, Oct 6, 2020 at 6:29 AM antlists <antlists@youngman.org.uk> wrote:
>
> On 06/10/2020 11:53, Roger Heflin wrote:
> > When the card has an
> > issue all of the ports seem to stop responding to commands.  I am
> > guessing the firmware on the card somehow crashes or gets into some
> > sort of endless loop.  I reported it to marvel, they blamed the OS'es
> > ACHI drivers,even though the AHCI drivers worked perfectly fine with
> > the built in AMD ports.
>
> So we've got the crap drives on the crap controllers ... would it make
> any difference if you put the Greens on the motherboard, and the Caviars
> on the Marvell? Caviars I believe are good quality drives that might
> take enough load off the Marvell to enable it to work sort-of okay ...
>
> Oh - and replace the Greens pretty soon - I don't know how they compare
> against other drives quality-wise, but they are optimised in a raid
> anti-pattern.
>
> Cheers,
> Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-06 10:53             ` Roger Heflin
  2020-10-06 11:29               ` antlists
@ 2020-10-06 15:03               ` Tim Small
  2020-10-06 16:01                 ` Daniel Sanabria
  1 sibling, 1 reply; 16+ messages in thread
From: Tim Small @ 2020-10-06 15:03 UTC (permalink / raw)
  To: Roger Heflin, Daniel Sanabria; +Cc: Roman Mamedov, Linux-RAID

On 06/10/2020 11:53, Roger Heflin wrote:
> Outside of the enterprise type cards, I have yet found a stable PCIE card
>

I've also had numerous problems with Marvell SATA controllers.

I've generally found the ASMedia ASM1083 / ASM1085 AHCI controllers
stable and reliable.

ASMedia is part of Asus, and from Wikipedia:

"[ASMedia] produces designs for USB, PCI Express and SATA controllers.
Excluding the X570 chipset, all of the AM4 chipsets for AMD's Zen
micro-architecture were designed by ASMedia"

The ASM108x are only PCIe 2.0 x1 <-> 2 SATA port cards, however there
are designs (e.g. "SA3008" - around $45 online) which incorporate
multiple ASM108x behind an ASMedia PCIe 2.0 bridge if number of
available PCIe slots are an issue.

Tim.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-06 15:03               ` Tim Small
@ 2020-10-06 16:01                 ` Daniel Sanabria
  2020-10-07  7:26                   ` Tim Small
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Sanabria @ 2020-10-06 16:01 UTC (permalink / raw)
  To: Tim Small; +Cc: Roger Heflin, Roman Mamedov, Linux-RAID

Thank you very much Guys. This is one of the best email lists I'm
subscribed to so thanks to you all !

I've decided to dissolve this array and will use the disks as stand
alone drives. I have another array (raid0) using a pair of the WD
caviar blues and the pair of non-marvell ports and haven't had any
issues in years so will keep that one.

Thanks again,

Dan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-06 16:01                 ` Daniel Sanabria
@ 2020-10-07  7:26                   ` Tim Small
  0 siblings, 0 replies; 16+ messages in thread
From: Tim Small @ 2020-10-07  7:26 UTC (permalink / raw)
  To: Daniel Sanabria; +Cc: Roger Heflin, Roman Mamedov, Linux-RAID

If you want to keep the Marvell controller in use, then I found them a
lot more stable with Tagged Command Queueing disabled:

echo 1 > /sys/block/sdX/device/queue_depth

Otherwise you might also look for a firmware update for the WD Green
drives which you are having problems with (they might be hidden on
vendor support sites like those of Dell and Lenovo, if you can find
particular PC models which shipped with the same models of WD drives
that you have).

Also if possible, consider switching for the ASMedia controllers if
possible - the two port versions are available for under €10.

Tim.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: do i need to give up on this setup
  2020-10-06 14:59                 ` Roger Heflin
@ 2020-10-09  1:00                   ` John Stoffel
  0 siblings, 0 replies; 16+ messages in thread
From: John Stoffel @ 2020-10-09  1:00 UTC (permalink / raw)
  To: Roger Heflin; +Cc: antlists, Daniel Sanabria, Roman Mamedov, Linux-RAID

>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:

Roger> The controller is crap, and is expected to have serious issues no
Roger> matter what drives are on the controller.

I can't say enough good things about the LSI SATA RAID controllers.
You can usually get them pretty cheap on eBay, and just flash them
with the JBOD firmware and they do great.

  LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT S)

It's been a while, but I think it was an IBM branded card at the
time.  8 ports, easy setup, works well.  And looking on ebay, they're
cheap now, around $50 though you might have to pay for the 1-to-4
cables to goto SATA drives.  9211, 9341, stuff like that.  Here's an
eBay listing with cables:

https://www.ebay.com/p/1404809612?iid=133501363959&_trkparms=aid%3D555018%26algo%3DPL.SIM%26ao%3D1%26asc%3D20170810093926%26meid%3De20ed20fb9634e328a31bca5c9e2063c%26pid%3D100854%26rk%3D1%26rkt%3D1%26itm%3D133501363959%26pmt%3D1%26noa%3D0%26pg%3D2322090%26algv%3DSimplAMLSeedlessV2&_trksid=p2322090.c100854.m4779

In my mind, the other advantage of these cards is that you can get two
of them, and split your data across two controllers.  But it also gets
your data disks off the internal controllers, which means you don't
run into nearly as many problems where the system tries to boot off
your data drives, or you have to put boot blocks on them, etc.  

This controller would be more tolerant of your existing drives, since
you have the 850w power supply, it's almost certainly not power
problems either.

John

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, back to index

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-05 13:10 do i need to give up on this setup Daniel Sanabria
2020-10-05 13:17 ` Reindl Harald
2020-10-05 13:44 ` Roman Mamedov
     [not found]   ` <CAHscji0pNezf6xCpjWto5-21ayoCeLWm34GTYh5TSgxkOw90mw@mail.gmail.com>
2020-10-05 14:04     ` Roman Mamedov
2020-10-05 14:10       ` Reindl Harald
2020-10-05 14:28       ` Daniel Sanabria
2020-10-05 15:58         ` Roger Heflin
2020-10-06  7:56           ` Daniel Sanabria
2020-10-06  8:24             ` Reindl Harald
2020-10-06 10:53             ` Roger Heflin
2020-10-06 11:29               ` antlists
2020-10-06 14:59                 ` Roger Heflin
2020-10-09  1:00                   ` John Stoffel
2020-10-06 15:03               ` Tim Small
2020-10-06 16:01                 ` Daniel Sanabria
2020-10-07  7:26                   ` Tim Small

Linux-Raid Archives on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-raid/0 linux-raid/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-raid linux-raid/ https://lore.kernel.org/linux-raid \
		linux-raid@vger.kernel.org
	public-inbox-index linux-raid

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-raid


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git