From: Daniel Sanabria <sanabria.d@gmail.com>
To: Roman Mamedov <rm@romanrm.net>
Cc: Linux-RAID <linux-raid@vger.kernel.org>
Subject: Re: do i need to give up on this setup
Date: Mon, 5 Oct 2020 15:28:20 +0100 [thread overview]
Message-ID: <CAHscji1VrccTOaQc4GdWof4E+Bzs5KL0-tJJj0ZUM9Db=QBriw@mail.gmail.com> (raw)
In-Reply-To: <20201005190421.4ecd8f1b@natsu>
> I meant not to me personally, but to the mailing list. The drives seem OK
> though, even sde.
Sorry missed the reply-all button
On Mon, 5 Oct 2020 at 15:04, Roman Mamedov <rm@romanrm.net> wrote:
>
> On Mon, 5 Oct 2020 14:59:35 +0100
> Daniel Sanabria <sanabria.d@gmail.com> wrote:
>
> > > It looks like a drive is dropping off the bus and then failing to reidentify,
> > > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > > "smartctl -a" of all drives as well.
>
> I meant not to me personally, but to the mailing list. The drives seem OK
> though, even sde.
>
> > [dan@lamachine ~]$ sudo smartctl -a /dev/sdc
> > [sudo] password for dan:
> > smartctl 6.6 2017-11-05 r4594
> > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > === START OF INFORMATION SECTION ===
> > Model Family: Western Digital Green
> > Device Model: WDC WD30EZRX-00D8PB0
> > Serial Number: WD-WCC4NCWT13RF
> > LU WWN Device Id: 5 0014ee 25fc9e460
> > Firmware Version: 80.00A80
> > User Capacity: 3,000,591,900,160 bytes [3.00 TB]
> > Sector Sizes: 512 bytes logical, 4096 bytes physical
> > Rotation Rate: 5400 rpm
> > Device is: In smartctl database [for details use: -P show]
> > ATA Version is: ACS-2 (minor revision not indicated)
> > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > Local Time is: Mon Oct 5 14:58:34 2020 BST
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> >
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> >
> > General SMART Values:
> > Offline data collection status: (0x82) Offline data collection activity
> > was completed without error.
> > Auto Offline Data Collection: Enabled.
> > Self-test execution status: ( 0) The previous self-test routine completed
> > without error or no self-test has ever
> > been run.
> > Total time to complete Offline
> > data collection: (38940) seconds.
> > Offline data collection
> > capabilities: (0x7b) SMART execute Offline immediate.
> > Auto Offline data collection on/off support.
> > Suspend Offline collection upon new
> > command.
> > Offline surface scan supported.
> > Self-test supported.
> > Conveyance Self-test supported.
> > Selective Self-test supported.
> > SMART capabilities: (0x0003) Saves SMART data before entering
> > power-saving mode.
> > Supports SMART auto save timer.
> > Error logging capability: (0x01) Error logging supported.
> > General Purpose Logging supported.
> > Short self-test routine
> > recommended polling time: ( 2) minutes.
> > Extended self-test routine
> > recommended polling time: ( 391) minutes.
> > Conveyance self-test routine
> > recommended polling time: ( 5) minutes.
> > SCT capabilities: (0x7035) SCT Status supported.
> > SCT Feature Control supported.
> > SCT Data Table supported.
> >
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
> > UPDATED WHEN_FAILED RAW_VALUE
> > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
> > Always - 0
> > 3 Spin_Up_Time 0x0027 178 165 021 Pre-fail
> > Always - 6075
> > 4 Start_Stop_Count 0x0032 100 100 000 Old_age
> > Always - 81
> > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
> > Always - 0
> > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age
> > Always - 0
> > 9 Power_On_Hours 0x0032 075 075 000 Old_age
> > Always - 18577
> > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age
> > Always - 0
> > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
> > Always - 0
> > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
> > Always - 81
> > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
> > Always - 46
> > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age
> > Always - 176661
> > 194 Temperature_Celsius 0x0022 122 109 000 Old_age
> > Always - 28
> > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
> > Always - 0
> > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age
> > Always - 0
> > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
> > Offline - 0
> > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
> > Always - 0
> > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
> > Offline - 0
> >
> > SMART Error Log Version: 1
> > No Errors Logged
> >
> > SMART Self-test log structure revision number 1
> > Num Test_Description Status Remaining
> > LifeTime(hours) LBA_of_first_error
> > # 1 Extended offline Completed without error 00% 17479 -
> > # 2 Short offline Completed without error 00% 15531 -
> >
> > SMART Selective self-test log data structure revision number 1
> > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> > 1 0 0 Not_testing
> > 2 0 0 Not_testing
> > 3 0 0 Not_testing
> > 4 0 0 Not_testing
> > 5 0 0 Not_testing
> > Selective self-test flags (0x0):
> > After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute delay.
> >
> > [dan@lamachine ~]$ sudo smartctl -a /dev/sdd
> > smartctl 6.6 2017-11-05 r4594
> > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > === START OF INFORMATION SECTION ===
> > Model Family: Western Digital Green
> > Device Model: WDC WD30EZRX-00D8PB0
> > Serial Number: WD-WCC4NPRDD6D7
> > LU WWN Device Id: 5 0014ee 25fca27b1
> > Firmware Version: 80.00A80
> > User Capacity: 3,000,592,982,016 bytes [3.00 TB]
> > Sector Sizes: 512 bytes logical, 4096 bytes physical
> > Rotation Rate: 5400 rpm
> > Device is: In smartctl database [for details use: -P show]
> > ATA Version is: ACS-2 (minor revision not indicated)
> > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > Local Time is: Mon Oct 5 14:58:54 2020 BST
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> >
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> >
> > General SMART Values:
> > Offline data collection status: (0x82) Offline data collection activity
> > was completed without error.
> > Auto Offline Data Collection: Enabled.
> > Self-test execution status: ( 0) The previous self-test routine completed
> > without error or no self-test has ever
> > been run.
> > Total time to complete Offline
> > data collection: (39060) seconds.
> > Offline data collection
> > capabilities: (0x7b) SMART execute Offline immediate.
> > Auto Offline data collection on/off support.
> > Suspend Offline collection upon new
> > command.
> > Offline surface scan supported.
> > Self-test supported.
> > Conveyance Self-test supported.
> > Selective Self-test supported.
> > SMART capabilities: (0x0003) Saves SMART data before entering
> > power-saving mode.
> > Supports SMART auto save timer.
> > Error logging capability: (0x01) Error logging supported.
> > General Purpose Logging supported.
> > Short self-test routine
> > recommended polling time: ( 2) minutes.
> > Extended self-test routine
> > recommended polling time: ( 392) minutes.
> > Conveyance self-test routine
> > recommended polling time: ( 5) minutes.
> > SCT capabilities: (0x7035) SCT Status supported.
> > SCT Feature Control supported.
> > SCT Data Table supported.
> >
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
> > UPDATED WHEN_FAILED RAW_VALUE
> > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
> > Always - 0
> > 3 Spin_Up_Time 0x0027 178 164 021 Pre-fail
> > Always - 6100
> > 4 Start_Stop_Count 0x0032 100 100 000 Old_age
> > Always - 81
> > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
> > Always - 0
> > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age
> > Always - 0
> > 9 Power_On_Hours 0x0032 075 075 000 Old_age
> > Always - 18580
> > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age
> > Always - 0
> > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
> > Always - 0
> > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
> > Always - 81
> > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
> > Always - 53
> > 193 Load_Cycle_Count 0x0032 136 136 000 Old_age
> > Always - 192427
> > 194 Temperature_Celsius 0x0022 121 108 000 Old_age
> > Always - 29
> > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
> > Always - 0
> > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age
> > Always - 0
> > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
> > Offline - 0
> > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
> > Always - 0
> > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
> > Offline - 0
> >
> > SMART Error Log Version: 1
> > No Errors Logged
> >
> > SMART Self-test log structure revision number 1
> > Num Test_Description Status Remaining
> > LifeTime(hours) LBA_of_first_error
> > # 1 Extended offline Completed without error 00% 17481 -
> > # 2 Short offline Completed without error 00% 15534 -
> >
> > SMART Selective self-test log data structure revision number 1
> > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> > 1 0 0 Not_testing
> > 2 0 0 Not_testing
> > 3 0 0 Not_testing
> > 4 0 0 Not_testing
> > 5 0 0 Not_testing
> > Selective self-test flags (0x0):
> > After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute delay.
> >
> > [dan@lamachine ~]$ sudo smartctl -a /dev/sde
> > smartctl 6.6 2017-11-05 r4594
> > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
> > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > === START OF INFORMATION SECTION ===
> > Model Family: Western Digital Green
> > Device Model: WDC WD30EZRX-00D8PB0
> > Serial Number: WD-WCC4N1294906
> > LU WWN Device Id: 5 0014ee 25f968120
> > Firmware Version: 80.00A80
> > User Capacity: 3,000,591,900,160 bytes [3.00 TB]
> > Sector Sizes: 512 bytes logical, 4096 bytes physical
> > Rotation Rate: 5400 rpm
> > Device is: In smartctl database [for details use: -P show]
> > ATA Version is: ACS-2 (minor revision not indicated)
> > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> > Local Time is: Mon Oct 5 14:58:57 2020 BST
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> >
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> >
> > General SMART Values:
> > Offline data collection status: (0x82) Offline data collection activity
> > was completed without error.
> > Auto Offline Data Collection: Enabled.
> > Self-test execution status: ( 0) The previous self-test routine completed
> > without error or no self-test has ever
> > been run.
> > Total time to complete Offline
> > data collection: (43200) seconds.
> > Offline data collection
> > capabilities: (0x7b) SMART execute Offline immediate.
> > Auto Offline data collection on/off support.
> > Suspend Offline collection upon new
> > command.
> > Offline surface scan supported.
> > Self-test supported.
> > Conveyance Self-test supported.
> > Selective Self-test supported.
> > SMART capabilities: (0x0003) Saves SMART data before entering
> > power-saving mode.
> > Supports SMART auto save timer.
> > Error logging capability: (0x01) Error logging supported.
> > General Purpose Logging supported.
> > Short self-test routine
> > recommended polling time: ( 2) minutes.
> > Extended self-test routine
> > recommended polling time: ( 433) minutes.
> > Conveyance self-test routine
> > recommended polling time: ( 5) minutes.
> > SCT capabilities: (0x7035) SCT Status supported.
> > SCT Feature Control supported.
> > SCT Data Table supported.
> >
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
> > UPDATED WHEN_FAILED RAW_VALUE
> > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
> > Always - 0
> > 3 Spin_Up_Time 0x0027 176 166 021 Pre-fail
> > Always - 6158
> > 4 Start_Stop_Count 0x0032 100 100 000 Old_age
> > Always - 80
> > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
> > Always - 0
> > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age
> > Always - 0
> > 9 Power_On_Hours 0x0032 075 075 000 Old_age
> > Always - 18465
> > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age
> > Always - 0
> > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
> > Always - 0
> > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
> > Always - 80
> > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
> > Always - 53
> > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age
> > Always - 174015
> > 194 Temperature_Celsius 0x0022 121 107 000 Old_age
> > Always - 29
> > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
> > Always - 0
> > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age
> > Always - 0
> > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
> > Offline - 0
> > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
> > Always - 0
> > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
> > Offline - 0
> >
> > SMART Error Log Version: 1
> > No Errors Logged
> >
> > SMART Self-test log structure revision number 1
> > Num Test_Description Status Remaining
> > LifeTime(hours) LBA_of_first_error
> > # 1 Extended offline Completed without error 00% 17347 -
> > # 2 Short offline Completed without error 00% 15414 -
> >
> > SMART Selective self-test log data structure revision number 1
> > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> > 1 0 0 Not_testing
> > 2 0 0 Not_testing
> > 3 0 0 Not_testing
> > 4 0 0 Not_testing
> > 5 0 0 Not_testing
> > Selective self-test flags (0x0):
> > After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute delay.
> >
> > [dan@lamachine ~]$
> >
> >
> > On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@romanrm.net> wrote:
> > >
> > > On Mon, 5 Oct 2020 14:10:25 +0100
> > > Daniel Sanabria <sanabria.d@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Scrubbing ( # echo check >
> > > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
> > > >
> > > > I'm attaching details of the array and disks (bloody wd greens) as
> > > > well as journalctl errors providing some details about the issue.
> > > >
> > > > If you have any pointers on what might be the cause of this as well as
> > > > any recommendations on how to improve things please let me thank you
> > > > in advance ...
> > > >
> > > > I have backups of the data so happy to move this to a different setup
> > > > you might recommend (apps will be mostly reading from the array via
> > > > NFS since most of the content will be media).
> > > >
> > > > My suspicion is that a timer service is kicking in and disrupting the
> > > > scrubbing somehow but can't pinpoint what causes this.
> > >
> > > It looks like a drive is dropping off the bus and then failing to reidentify,
> > > could be bad cabling/controller/PSU, or just a bad drive. You should post
> > > "smartctl -a" of all drives as well.
> > >
> > > --
> > > With respect,
> > > Roman
>
>
> --
> With respect,
> Roman
next prev parent reply other threads:[~2020-10-05 14:28 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-05 13:10 do i need to give up on this setup Daniel Sanabria
2020-10-05 13:17 ` Reindl Harald
2020-10-05 13:44 ` Roman Mamedov
[not found] ` <CAHscji0pNezf6xCpjWto5-21ayoCeLWm34GTYh5TSgxkOw90mw@mail.gmail.com>
2020-10-05 14:04 ` Roman Mamedov
2020-10-05 14:10 ` Reindl Harald
2020-10-05 14:28 ` Daniel Sanabria [this message]
2020-10-05 15:58 ` Roger Heflin
2020-10-06 7:56 ` Daniel Sanabria
2020-10-06 8:24 ` Reindl Harald
2020-10-06 10:53 ` Roger Heflin
2020-10-06 11:29 ` antlists
2020-10-06 14:59 ` Roger Heflin
2020-10-09 1:00 ` John Stoffel
2020-10-06 15:03 ` Tim Small
2020-10-06 16:01 ` Daniel Sanabria
2020-10-07 7:26 ` Tim Small
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAHscji1VrccTOaQc4GdWof4E+Bzs5KL0-tJJj0ZUM9Db=QBriw@mail.gmail.com' \
--to=sanabria.d@gmail.com \
--cc=linux-raid@vger.kernel.org \
--cc=rm@romanrm.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).