From: Reindl Harald <h.reindl@thelounge.net>
To: Roman Mamedov <rm@romanrm.net>,
Daniel Sanabria <sanabria.d@gmail.com>,
Linux-RAID <linux-raid@vger.kernel.org>
Subject: Re: do i need to give up on this setup
Date: Mon, 5 Oct 2020 16:10:35 +0200 [thread overview]
Message-ID: <d60e92fc-a9aa-a48d-a3ac-dc3613d4686b@thelounge.net> (raw)
In-Reply-To: <20201005190421.4ecd8f1b@natsu>
Am 05.10.20 um 16:04 schrieb Roman Mamedov:
> On Mon, 5 Oct 2020 14:59:35 +0100
> Daniel Sanabria <sanabria.d@gmail.com> wrote:
>
>>> It looks like a drive is dropping off the bus and then failing to reidentify,
>>> could be bad cabling/controller/PSU, or just a bad drive. You should post
>>> "smartctl -a" of all drives as well.
>
> I meant not to me personally, but to the mailing list. The drives seem OK
> though, even sde.
you have a hardware problem and it#s no uncommon when one of your disks
is going crazy under laod that due the reset of the crontroller a second
one on the same bus is also reset
either one of your disks is faulty, the controller is faulty or you have
an issue with your cables
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:48:18:5f:2b/05:00:45:00:00/40 tag 9 ncq dma 688128 in
res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:50:58:64:2b/05:00:45:00:00/40 tag 10 ncq dma 688128 in
res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:34:39 lamachine kernel: ata8.00: status: { DRDY }
Oct 05 11:34:39 lamachine kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 05 11:34:39 lamachine kernel: ata8.00: cmd
60/40:58:98:69:2b/05:00:45:00:00/40 tag 11 ncq dma 688128 in
res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 05 11:35:29 lamachine kernel: ata7.00: qc timeout (cmd 0xec)
Oct 05 11:35:29 lamachine kernel: ata8.00: qc timeout (cmd 0xec)
Oct 05 11:35:29 lamachine kernel: ata8.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:35:29 lamachine kernel: ata8.00: revalidation failed (errno=-5)
Oct 05 11:35:29 lamachine kernel: ata8.00: disabled
Oct 05 11:35:29 lamachine kernel: ata7.00: failed to IDENTIFY (I/O
error, err_mask=0x4)
Oct 05 11:35:29 lamachine kernel: ata7.00: revalidation failed (errno=-5)
Oct 05 11:35:29 lamachine kernel: ata7.00: disabled
Oct 05 11:35:30 lamachine kernel: ata7: SATA link up 6.0 Gbps (SStatus
133 SControl 320)
Oct 05 11:35:30 lamachine kernel: ata8: SATA link up 6.0 Gbps (SStatus
133 SControl 320)
Oct 05 11:35:31 lamachine kernel: ata8: EH complete
Oct 05 11:35:31 lamachine kernel: ata7: EH complete
Oct 05 11:35:31 lamachine kernel: scsi_io_completion_action: 115
callbacks suppressed
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#12 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 7:0:0:0: [sdd] tag#12 CDB:
Read(16) 88 00 00 00 00 00 45 2b d7 d8 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: print_req_error: 115 callbacks suppressed
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdd, sector 1160501208 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
class 0
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#18 FAILED
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Oct 05 11:35:31 lamachine kernel: sd 6:0:0:0: [sdc] tag#18 CDB:
Read(16) 88 00 00 00 00 00 45 2b d7 d8 00 00 05 40 00 00
Oct 05 11:35:31 lamachine kernel: blk_update_request: I/O error, dev
sdc, sector 1160501208 op 0x0:(READ) flags 0x4000 phys_seg 168 prio
>> [dan@lamachine ~]$ sudo smartctl -a /dev/sdc
>> [sudo] password for dan:
>> smartctl 6.6 2017-11-05 r4594
>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>>
>> === START OF INFORMATION SECTION ===
>> Model Family: Western Digital Green
>> Device Model: WDC WD30EZRX-00D8PB0
>> Serial Number: WD-WCC4NCWT13RF
>> LU WWN Device Id: 5 0014ee 25fc9e460
>> Firmware Version: 80.00A80
>> User Capacity: 3,000,591,900,160 bytes [3.00 TB]
>> Sector Sizes: 512 bytes logical, 4096 bytes physical
>> Rotation Rate: 5400 rpm
>> Device is: In smartctl database [for details use: -P show]
>> ATA Version is: ACS-2 (minor revision not indicated)
>> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>> Local Time is: Mon Oct 5 14:58:34 2020 BST
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> === START OF READ SMART DATA SECTION ===
>> SMART overall-health self-assessment test result: PASSED
>>
>> General SMART Values:
>> Offline data collection status: (0x82) Offline data collection activity
>> was completed without error.
>> Auto Offline Data Collection: Enabled.
>> Self-test execution status: ( 0) The previous self-test routine completed
>> without error or no self-test has ever
>> been run.
>> Total time to complete Offline
>> data collection: (38940) seconds.
>> Offline data collection
>> capabilities: (0x7b) SMART execute Offline immediate.
>> Auto Offline data collection on/off support.
>> Suspend Offline collection upon new
>> command.
>> Offline surface scan supported.
>> Self-test supported.
>> Conveyance Self-test supported.
>> Selective Self-test supported.
>> SMART capabilities: (0x0003) Saves SMART data before entering
>> power-saving mode.
>> Supports SMART auto save timer.
>> Error logging capability: (0x01) Error logging supported.
>> General Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: ( 2) minutes.
>> Extended self-test routine
>> recommended polling time: ( 391) minutes.
>> Conveyance self-test routine
>> recommended polling time: ( 5) minutes.
>> SCT capabilities: (0x7035) SCT Status supported.
>> SCT Feature Control supported.
>> SCT Data Table supported.
>>
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
>> UPDATED WHEN_FAILED RAW_VALUE
>> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
>> Always - 0
>> 3 Spin_Up_Time 0x0027 178 165 021 Pre-fail
>> Always - 6075
>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age
>> Always - 81
>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
>> Always - 0
>> 7 Seek_Error_Rate 0x002e 100 253 000 Old_age
>> Always - 0
>> 9 Power_On_Hours 0x0032 075 075 000 Old_age
>> Always - 18577
>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age
>> Always - 0
>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
>> Always - 0
>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
>> Always - 81
>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
>> Always - 46
>> 193 Load_Cycle_Count 0x0032 142 142 000 Old_age
>> Always - 176661
>> 194 Temperature_Celsius 0x0022 122 109 000 Old_age
>> Always - 28
>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
>> Always - 0
>> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age
>> Always - 0
>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
>> Offline - 0
>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
>> Always - 0
>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
>> Offline - 0
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num Test_Description Status Remaining
>> LifeTime(hours) LBA_of_first_error
>> # 1 Extended offline Completed without error 00% 17479 -
>> # 2 Short offline Completed without error 00% 15531 -
>>
>> SMART Selective self-test log data structure revision number 1
>> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
>> 1 0 0 Not_testing
>> 2 0 0 Not_testing
>> 3 0 0 Not_testing
>> 4 0 0 Not_testing
>> 5 0 0 Not_testing
>> Selective self-test flags (0x0):
>> After scanning selected spans, do NOT read-scan remainder of disk.
>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>
>> [dan@lamachine ~]$ sudo smartctl -a /dev/sdd
>> smartctl 6.6 2017-11-05 r4594
>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>>
>> === START OF INFORMATION SECTION ===
>> Model Family: Western Digital Green
>> Device Model: WDC WD30EZRX-00D8PB0
>> Serial Number: WD-WCC4NPRDD6D7
>> LU WWN Device Id: 5 0014ee 25fca27b1
>> Firmware Version: 80.00A80
>> User Capacity: 3,000,592,982,016 bytes [3.00 TB]
>> Sector Sizes: 512 bytes logical, 4096 bytes physical
>> Rotation Rate: 5400 rpm
>> Device is: In smartctl database [for details use: -P show]
>> ATA Version is: ACS-2 (minor revision not indicated)
>> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>> Local Time is: Mon Oct 5 14:58:54 2020 BST
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> === START OF READ SMART DATA SECTION ===
>> SMART overall-health self-assessment test result: PASSED
>>
>> General SMART Values:
>> Offline data collection status: (0x82) Offline data collection activity
>> was completed without error.
>> Auto Offline Data Collection: Enabled.
>> Self-test execution status: ( 0) The previous self-test routine completed
>> without error or no self-test has ever
>> been run.
>> Total time to complete Offline
>> data collection: (39060) seconds.
>> Offline data collection
>> capabilities: (0x7b) SMART execute Offline immediate.
>> Auto Offline data collection on/off support.
>> Suspend Offline collection upon new
>> command.
>> Offline surface scan supported.
>> Self-test supported.
>> Conveyance Self-test supported.
>> Selective Self-test supported.
>> SMART capabilities: (0x0003) Saves SMART data before entering
>> power-saving mode.
>> Supports SMART auto save timer.
>> Error logging capability: (0x01) Error logging supported.
>> General Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: ( 2) minutes.
>> Extended self-test routine
>> recommended polling time: ( 392) minutes.
>> Conveyance self-test routine
>> recommended polling time: ( 5) minutes.
>> SCT capabilities: (0x7035) SCT Status supported.
>> SCT Feature Control supported.
>> SCT Data Table supported.
>>
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
>> UPDATED WHEN_FAILED RAW_VALUE
>> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
>> Always - 0
>> 3 Spin_Up_Time 0x0027 178 164 021 Pre-fail
>> Always - 6100
>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age
>> Always - 81
>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
>> Always - 0
>> 7 Seek_Error_Rate 0x002e 100 253 000 Old_age
>> Always - 0
>> 9 Power_On_Hours 0x0032 075 075 000 Old_age
>> Always - 18580
>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age
>> Always - 0
>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
>> Always - 0
>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
>> Always - 81
>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
>> Always - 53
>> 193 Load_Cycle_Count 0x0032 136 136 000 Old_age
>> Always - 192427
>> 194 Temperature_Celsius 0x0022 121 108 000 Old_age
>> Always - 29
>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
>> Always - 0
>> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age
>> Always - 0
>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
>> Offline - 0
>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
>> Always - 0
>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
>> Offline - 0
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num Test_Description Status Remaining
>> LifeTime(hours) LBA_of_first_error
>> # 1 Extended offline Completed without error 00% 17481 -
>> # 2 Short offline Completed without error 00% 15534 -
>>
>> SMART Selective self-test log data structure revision number 1
>> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
>> 1 0 0 Not_testing
>> 2 0 0 Not_testing
>> 3 0 0 Not_testing
>> 4 0 0 Not_testing
>> 5 0 0 Not_testing
>> Selective self-test flags (0x0):
>> After scanning selected spans, do NOT read-scan remainder of disk.
>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>
>> [dan@lamachine ~]$ sudo smartctl -a /dev/sde
>> smartctl 6.6 2017-11-05 r4594
>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>>
>> === START OF INFORMATION SECTION ===
>> Model Family: Western Digital Green
>> Device Model: WDC WD30EZRX-00D8PB0
>> Serial Number: WD-WCC4N1294906
>> LU WWN Device Id: 5 0014ee 25f968120
>> Firmware Version: 80.00A80
>> User Capacity: 3,000,591,900,160 bytes [3.00 TB]
>> Sector Sizes: 512 bytes logical, 4096 bytes physical
>> Rotation Rate: 5400 rpm
>> Device is: In smartctl database [for details use: -P show]
>> ATA Version is: ACS-2 (minor revision not indicated)
>> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>> Local Time is: Mon Oct 5 14:58:57 2020 BST
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> === START OF READ SMART DATA SECTION ===
>> SMART overall-health self-assessment test result: PASSED
>>
>> General SMART Values:
>> Offline data collection status: (0x82) Offline data collection activity
>> was completed without error.
>> Auto Offline Data Collection: Enabled.
>> Self-test execution status: ( 0) The previous self-test routine completed
>> without error or no self-test has ever
>> been run.
>> Total time to complete Offline
>> data collection: (43200) seconds.
>> Offline data collection
>> capabilities: (0x7b) SMART execute Offline immediate.
>> Auto Offline data collection on/off support.
>> Suspend Offline collection upon new
>> command.
>> Offline surface scan supported.
>> Self-test supported.
>> Conveyance Self-test supported.
>> Selective Self-test supported.
>> SMART capabilities: (0x0003) Saves SMART data before entering
>> power-saving mode.
>> Supports SMART auto save timer.
>> Error logging capability: (0x01) Error logging supported.
>> General Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: ( 2) minutes.
>> Extended self-test routine
>> recommended polling time: ( 433) minutes.
>> Conveyance self-test routine
>> recommended polling time: ( 5) minutes.
>> SCT capabilities: (0x7035) SCT Status supported.
>> SCT Feature Control supported.
>> SCT Data Table supported.
>>
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
>> UPDATED WHEN_FAILED RAW_VALUE
>> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
>> Always - 0
>> 3 Spin_Up_Time 0x0027 176 166 021 Pre-fail
>> Always - 6158
>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age
>> Always - 80
>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
>> Always - 0
>> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age
>> Always - 0
>> 9 Power_On_Hours 0x0032 075 075 000 Old_age
>> Always - 18465
>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age
>> Always - 0
>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
>> Always - 0
>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
>> Always - 80
>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
>> Always - 53
>> 193 Load_Cycle_Count 0x0032 142 142 000 Old_age
>> Always - 174015
>> 194 Temperature_Celsius 0x0022 121 107 000 Old_age
>> Always - 29
>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
>> Always - 0
>> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age
>> Always - 0
>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
>> Offline - 0
>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
>> Always - 0
>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
>> Offline - 0
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num Test_Description Status Remaining
>> LifeTime(hours) LBA_of_first_error
>> # 1 Extended offline Completed without error 00% 17347 -
>> # 2 Short offline Completed without error 00% 15414 -
>>
>> SMART Selective self-test log data structure revision number 1
>> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
>> 1 0 0 Not_testing
>> 2 0 0 Not_testing
>> 3 0 0 Not_testing
>> 4 0 0 Not_testing
>> 5 0 0 Not_testing
>> Selective self-test flags (0x0):
>> After scanning selected spans, do NOT read-scan remainder of disk.
>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>
>> [dan@lamachine ~]$
>>
>>
>> On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote:
>>>
>>> On Mon, 5 Oct 2020 14:10:25 +0100
>>> Daniel Sanabria <sanabria.d@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Scrubbing ( # echo check >
>>>> /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
>>>>
>>>> I'm attaching details of the array and disks (bloody wd greens) as
>>>> well as journalctl errors providing some details about the issue.
>>>>
>>>> If you have any pointers on what might be the cause of this as well as
>>>> any recommendations on how to improve things please let me thank you
>>>> in advance ...
>>>>
>>>> I have backups of the data so happy to move this to a different setup
>>>> you might recommend (apps will be mostly reading from the array via
>>>> NFS since most of the content will be media).
>>>>
>>>> My suspicion is that a timer service is kicking in and disrupting the
>>>> scrubbing somehow but can't pinpoint what causes this.
>>>
>>> It looks like a drive is dropping off the bus and then failing to reidentify,
>>> could be bad cabling/controller/PSU, or just a bad drive. You should post
>>> "smartctl -a" of all drives as well.
next prev parent reply other threads:[~2020-10-05 14:10 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-05 13:10 do i need to give up on this setup Daniel Sanabria
2020-10-05 13:17 ` Reindl Harald
2020-10-05 13:44 ` Roman Mamedov
[not found] ` <CAHscji0pNezf6xCpjWto5-21ayoCeLWm34GTYh5TSgxkOw90mw@mail.gmail.com>
2020-10-05 14:04 ` Roman Mamedov
2020-10-05 14:10 ` Reindl Harald [this message]
2020-10-05 14:28 ` Daniel Sanabria
2020-10-05 15:58 ` Roger Heflin
2020-10-06 7:56 ` Daniel Sanabria
2020-10-06 8:24 ` Reindl Harald
2020-10-06 10:53 ` Roger Heflin
2020-10-06 11:29 ` antlists
2020-10-06 14:59 ` Roger Heflin
2020-10-09 1:00 ` John Stoffel
2020-10-06 15:03 ` Tim Small
2020-10-06 16:01 ` Daniel Sanabria
2020-10-07 7:26 ` Tim Small
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d60e92fc-a9aa-a48d-a3ac-dc3613d4686b@thelounge.net \
--to=h.reindl@thelounge.net \
--cc=linux-raid@vger.kernel.org \
--cc=rm@romanrm.net \
--cc=sanabria.d@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).