From: Roman Mamedov <rm@romanrm.net>
To: Daniel Sanabria <sanabria.d@gmail.com>,
Linux-RAID <linux-raid@vger.kernel.org>
Subject: Re: do i need to give up on this setup
Date: Mon, 5 Oct 2020 19:04:21 +0500 [thread overview]
Message-ID: <20201005190421.4ecd8f1b@natsu> (raw)
In-Reply-To: <CAHscji0pNezf6xCpjWto5-21ayoCeLWm34GTYh5TSgxkOw90mw@mail.gmail.com>
On Mon, 5 Oct 2020 14:59:35 +0100
Daniel Sanabria <sanabria.d@gmail.com> wrote:
> > It looks like a drive is dropping off the bus and then failing to reidentify,
> > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > "smartctl -a" of all drives as well.
I meant not to me personally, but to the mailing list. The drives seem OK
though, even sde.
> [dan@lamachine ~]$ sudo smartctl -a /dev/sdc
> [sudo] password for dan:
> smartctl 6.6 2017-11-05 r4594
> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family: Western Digital Green
> Device Model: WDC WD30EZRX-00D8PB0
> Serial Number: WD-WCC4NCWT13RF
> LU WWN Device Id: 5 0014ee 25fc9e460
> Firmware Version: 80.00A80
> User Capacity: 3,000,591,900,160 bytes [3.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Rotation Rate: 5400 rpm
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ACS-2 (minor revision not indicated)
> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is: Mon Oct 5 14:58:34 2020 BST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (38940) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 391) minutes.
> Conveyance self-test routine
> recommended polling time: ( 5) minutes.
> SCT capabilities: (0x7035) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
> UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
> Always - 0
> 3 Spin_Up_Time 0x0027 178 165 021 Pre-fail
> Always - 6075
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age
> Always - 81
> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
> Always - 0
> 7 Seek_Error_Rate 0x002e 100 253 000 Old_age
> Always - 0
> 9 Power_On_Hours 0x0032 075 075 000 Old_age
> Always - 18577
> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age
> Always - 0
> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
> Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
> Always - 81
> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
> Always - 46
> 193 Load_Cycle_Count 0x0032 142 142 000 Old_age
> Always - 176661
> 194 Temperature_Celsius 0x0022 122 109 000 Old_age
> Always - 28
> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
> Always - 0
> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age
> Always - 0
> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
> Offline - 0
> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
> Always - 0
> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
> Offline - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Extended offline Completed without error 00% 17479 -
> # 2 Short offline Completed without error 00% 15531 -
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> [dan@lamachine ~]$ sudo smartctl -a /dev/sdd
> smartctl 6.6 2017-11-05 r4594
> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family: Western Digital Green
> Device Model: WDC WD30EZRX-00D8PB0
> Serial Number: WD-WCC4NPRDD6D7
> LU WWN Device Id: 5 0014ee 25fca27b1
> Firmware Version: 80.00A80
> User Capacity: 3,000,592,982,016 bytes [3.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Rotation Rate: 5400 rpm
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ACS-2 (minor revision not indicated)
> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is: Mon Oct 5 14:58:54 2020 BST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (39060) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 392) minutes.
> Conveyance self-test routine
> recommended polling time: ( 5) minutes.
> SCT capabilities: (0x7035) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
> UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
> Always - 0
> 3 Spin_Up_Time 0x0027 178 164 021 Pre-fail
> Always - 6100
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age
> Always - 81
> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
> Always - 0
> 7 Seek_Error_Rate 0x002e 100 253 000 Old_age
> Always - 0
> 9 Power_On_Hours 0x0032 075 075 000 Old_age
> Always - 18580
> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age
> Always - 0
> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
> Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
> Always - 81
> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
> Always - 53
> 193 Load_Cycle_Count 0x0032 136 136 000 Old_age
> Always - 192427
> 194 Temperature_Celsius 0x0022 121 108 000 Old_age
> Always - 29
> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
> Always - 0
> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age
> Always - 0
> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
> Offline - 0
> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
> Always - 0
> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
> Offline - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Extended offline Completed without error 00% 17481 -
> # 2 Short offline Completed without error 00% 15534 -
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> [dan@lamachine ~]$ sudo smartctl -a /dev/sde
> smartctl 6.6 2017-11-05 r4594
> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family: Western Digital Green
> Device Model: WDC WD30EZRX-00D8PB0
> Serial Number: WD-WCC4N1294906
> LU WWN Device Id: 5 0014ee 25f968120
> Firmware Version: 80.00A80
> User Capacity: 3,000,591,900,160 bytes [3.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Rotation Rate: 5400 rpm
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ACS-2 (minor revision not indicated)
> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is: Mon Oct 5 14:58:57 2020 BST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (43200) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 433) minutes.
> Conveyance self-test routine
> recommended polling time: ( 5) minutes.
> SCT capabilities: (0x7035) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
> UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
> Always - 0
> 3 Spin_Up_Time 0x0027 176 166 021 Pre-fail
> Always - 6158
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age
> Always - 80
> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
> Always - 0
> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age
> Always - 0
> 9 Power_On_Hours 0x0032 075 075 000 Old_age
> Always - 18465
> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age
> Always - 0
> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
> Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
> Always - 80
> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
> Always - 53
> 193 Load_Cycle_Count 0x0032 142 142 000 Old_age
> Always - 174015
> 194 Temperature_Celsius 0x0022 121 107 000 Old_age
> Always - 29
> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
> Always - 0
> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age
> Always - 0
> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
> Offline - 0
> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
> Always - 0
> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
> Offline - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Extended offline Completed without error 00% 17347 -
> # 2 Short offline Completed without error 00% 15414 -
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> [dan@lamachine ~]$
>
>
> On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote:
> >
> > On Mon, 5 Oct 2020 14:10:25 +0100
> > Daniel Sanabria <sanabria.d@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Scrubbing ( # echo check >
> > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
> > >
> > > I'm attaching details of the array and disks (bloody wd greens) as
> > > well as journalctl errors providing some details about the issue.
> > >
> > > If you have any pointers on what might be the cause of this as well as
> > > any recommendations on how to improve things please let me thank you
> > > in advance ...
> > >
> > > I have backups of the data so happy to move this to a different setup
> > > you might recommend (apps will be mostly reading from the array via
> > > NFS since most of the content will be media).
> > >
> > > My suspicion is that a timer service is kicking in and disrupting the
> > > scrubbing somehow but can't pinpoint what causes this.
> >
> > It looks like a drive is dropping off the bus and then failing to reidentify,
> > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > "smartctl -a" of all drives as well.
> >
> > --
> > With respect,
> > Roman
--
With respect,
Roman
next prev parent reply other threads:[~2020-10-05 14:04 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-05 13:10 do i need to give up on this setup Daniel Sanabria
2020-10-05 13:17 ` Reindl Harald
2020-10-05 13:44 ` Roman Mamedov
[not found] ` <CAHscji0pNezf6xCpjWto5-21ayoCeLWm34GTYh5TSgxkOw90mw@mail.gmail.com>
2020-10-05 14:04 ` Roman Mamedov [this message]
2020-10-05 14:10 ` Reindl Harald
2020-10-05 14:28 ` Daniel Sanabria
2020-10-05 15:58 ` Roger Heflin
2020-10-06 7:56 ` Daniel Sanabria
2020-10-06 8:24 ` Reindl Harald
2020-10-06 10:53 ` Roger Heflin
2020-10-06 11:29 ` antlists
2020-10-06 14:59 ` Roger Heflin
2020-10-09 1:00 ` John Stoffel
2020-10-06 15:03 ` Tim Small
2020-10-06 16:01 ` Daniel Sanabria
2020-10-07 7:26 ` Tim Small
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201005190421.4ecd8f1b@natsu \
--to=rm@romanrm.net \
--cc=linux-raid@vger.kernel.org \
--cc=sanabria.d@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).