* request for help on IMSM-metadata RAID-5 array
@ 2023-09-23 10:54 Joel Parthemore
2023-09-23 11:24 ` Roman Mamedov
2023-09-25 9:44 ` Mariusz Tkaczyk
0 siblings, 2 replies; 14+ messages in thread
From: Joel Parthemore @ 2023-09-23 10:54 UTC (permalink / raw)
To: linux-raid
Apologies in advance for the long email, but I wanted to include
everything that is asked for on the "asking for help" page associated
with the mailing list. The output from some of the requested commands is
pretty lengthy.
My home directory is on a three-disk RAID-5 array that, for whatever
reason (it seemed like a good idea at the time?), I built using the
hooks from the UEFI BIOS (or so I understand what I did). That is to
say, it's a "real" software-based RAID array in Linux that's built on a
"fake" RAID array in the UEFI BIOS. Mostly nothing important is stored
on the /home partition, but I forgot to back up a few important things
that are (or, at least, were). So I'd like to get the RAID array back if
I can, or know if I can't; and I will be extremely grateful to anyone
who can tell me one way or the other.
All was well for some number of years until a few days ago. After I
installed the latest KDE updates, the RAID array would lock up entirely
when I tried to log in to a new KDE Wayland session. It all came down to
one process that refused to die, running startplasma-wayland. Because
the process refused to die, the RAID array could not be stopped cleanly
and rebooting the computer therefore caused the RAID array to go out of
sync. After that, any attempt whatsoever to access the RAID array would
cause the RAID array to lock up again.
The first few times this happened, I was able to start the computer
without starting the RAID array, reassemble the RAID array using the
command mdadm --assemble --run --force /dev/md126 /dev/sda /dev/sde
/dev/sdc and have it working fine -- I could fix any filestore problems
with e2fsck, mount /home, log in to my home directory, do pretty much
whatever I wanted -- until I tried logging into a new KDE Wayland
session again. This happened several times while I was trying to
troubleshoot the problem with startplasma-wayland.
Unfortunately, one time this didn't work. I was still able to start the
computer without starting the RAID array, reassemble it and reboot with
the RAID array looking seemingly okay (according to mdadm -D) BUT this
time, any attempt to access the RAID array or even just stop the array
(mdadm --stop /dev/md126, mdadm --stop /dev/md127) once it was started
would cause the RAID array to lock up. That means (I think) that I can't
create an image of the array contents using dd, which is what -- of
course -- I should have done in the first place. (I could assemble the
RAID array read-only, but the RAID array is out of sync because it
didn't shut down properly.)
I'm guessing that the contents of the filestore on the RAID array are
probably still there. Does anyone have suggestions on getting the RAID
array working properly again and accessing them? I have avoided doing
anything further myself because, of course, if the contents of the
filestore are still there, I don't want to do anything to jeopardize
them. You may tell me that I've done too much already. :-)
uname -a
Linux bjoernbaer 6.5.3-gentoo #7 SMP PREEMPT_DYNAMIC Wed Sep 20 22:46:24
CEST 2023 x86_64 Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz GenuineIntel
GNU/Linux
mdadm --version
mdadm - v4.2 - 2021-12-30
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md126 : active raid5 sda[2] sde[1] sdd[0]
1953513472 blocks super external:/md127/0 level 5, 128k chunk,
algorithm 0 [3/3] [UUU]
md127 : inactive sdd[2](S) sda[1](S) sde[0](S)
15603 blocks super external:imsm
unused devices: <none>
mdstat -D /dev/md126
/dev/md126:
Container : /dev/md/imsm0, member 0
Raid Level : raid5
Array Size : 1953513472 (1863.02 GiB 2000.40 GB)
Used Dev Size : 976756736 (931.51 GiB 1000.20 GB)
Raid Devices : 3
Total Devices : 3
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Layout : left-asymmetric
Chunk Size : 128K
Consistency Policy : resync
UUID : aa989aae:6526b858:7b1edb5f:4f4b2686
Number Major Minor RaidDevice State
2 8 0 0 active sync /dev/sda
1 8 64 1 active sync /dev/sde
0 8 48 2 active sync /dev/sdd
mdstat -D /dev/md127
/dev/md127:
Version : imsm
Raid Level : container
Total Devices : 3
Working Devices : 3
UUID : 01ce9128:9a9d46c6:efc9650f:28fe9662
Member Arrays : /dev/md/vol0
Number Major Minor RaidDevice
- 8 64 - /dev/sde
- 8 0 - /dev/sda
- 8 48 - /dev/sdd
The output of smartctl --xall /dev/sdX and lsdrv follows.
smartctl --xall /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.5.3-gentoo] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue
Device Model: WDC WD10EZEX-75WN4A1
Serial Number: WD-WCC6Y3LCXY73
LU WWN Device Id: 5 0014ee 2bcc2d9cf
Firmware Version: 13057113
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5533
ATA Version is: ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Sep 23 09:33:42 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test
routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (11580) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 120) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
3 Spin_Up_Time POS--K 179 170 021 - 2050
4 Start_Stop_Count -O--CK 099 099 000 - 1171
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 079 079 000 - 15674
10 Spin_Retry_Count -O--CK 100 100 000 - 0
11 Calibration_Retry_Count -O--CK 100 100 000 - 0
12 Power_Cycle_Count -O--CK 099 099 000 - 1171
192 Power-Off_Retract_Count -O--CK 200 200 000 - 192
193 Load_Cycle_Count -O--CK 138 138 000 - 187769
194 Temperature_Celsius -O---K 104 084 000 - 39
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 200 200 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0
240 Head_Flying_Hours -O--CK 082 082 000 - 13596
241 Total_LBAs_Written -O--CK 200 200 000 - 11248575141
242 Total_LBAs_Read -O--CK 200 200 000 - 148189090777
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 6 Ext. Comprehensive SMART error log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
0xa8-0xb6 GPL,SL VS 1 Device vendor specific log
0xb7 GPL,SL VS 48 Device vendor specific log
0xbd GPL,SL VS 1 Device vendor specific log
0xc0 GPL,SL VS 1 Device vendor specific log
0xc1 GPL VS 93 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 2
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 [1] occurred at disk power-on lifetime: 13994 hours (583 days +
2 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 18 a5 f5 80 40 00 Error: UNC at LBA =
0x18a5f580 = 413529472
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 00 08 00 c0 00 00 18 a5 f9 48 40 08 1d+13:49:57.480 READ FPDMA
QUEUED
60 00 08 00 b8 00 00 18 a5 f9 50 40 08 1d+13:49:57.480 READ FPDMA
QUEUED
60 00 08 00 b0 00 00 18 a5 f9 58 40 08 1d+13:49:57.480 READ FPDMA
QUEUED
60 00 20 00 a8 00 00 18 a7 eb e0 40 08 1d+13:49:57.480 READ FPDMA
QUEUED
60 00 08 00 20 00 00 18 a5 f5 80 40 08 1d+13:49:57.448 READ FPDMA
QUEUED
Error 1 [0] occurred at disk power-on lifetime: 13994 hours (583 days +
2 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 18 a5 f5 80 40 00 Error: UNC at LBA =
0x18a5f580 = 413529472
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time
Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- ---------------
--------------------
60 05 40 00 18 00 00 18 a6 91 a0 40 08 1d+13:49:52.955 READ FPDMA
QUEUED
60 05 40 00 10 00 00 18 a6 8c 60 40 08 1d+13:49:52.951 READ FPDMA
QUEUED
60 05 40 00 08 00 00 18 a6 87 20 40 08 1d+13:49:52.946 READ FPDMA
QUEUED
60 05 40 00 00 00 00 18 a6 81 e0 40 08 1d+13:49:52.943 READ FPDMA
QUEUED
60 05 40 00 f8 00 00 18 a6 7c a0 40 08 1d+13:49:52.940 READ FPDMA
QUEUED
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
Device State: Active (0)
Current Temperature: 39 Celsius
Power Cycle Min/Max Temperature: 23/44 Celsius
Lifetime Min/Max Temperature: 17/58 Celsius
Under/Over Temperature Limit Count: 0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/60 Celsius
Min/Max Temperature Limit: -41/85 Celsius
Temperature History Size (Index): 478 (358)
Index Estimated Time Temperature Celsius
359 2023-09-23 01:36 39 ********************
... ..( 23 skipped). .. ********************
383 2023-09-23 02:00 39 ********************
384 2023-09-23 02:01 38 *******************
385 2023-09-23 02:02 39 ********************
386 2023-09-23 02:03 39 ********************
387 2023-09-23 02:04 39 ********************
388 2023-09-23 02:05 38 *******************
389 2023-09-23 02:06 39 ********************
390 2023-09-23 02:07 38 *******************
... ..(144 skipped). .. *******************
57 2023-09-23 04:32 38 *******************
58 2023-09-23 04:33 39 ********************
... ..( 18 skipped). .. ********************
77 2023-09-23 04:52 39 ********************
78 2023-09-23 04:53 38 *******************
... ..( 4 skipped). .. *******************
83 2023-09-23 04:58 38 *******************
84 2023-09-23 04:59 39 ********************
... ..( 17 skipped). .. ********************
102 2023-09-23 05:17 39 ********************
103 2023-09-23 05:18 40 *********************
... ..( 30 skipped). .. *********************
134 2023-09-23 05:49 40 *********************
135 2023-09-23 05:50 39 ********************
... ..(222 skipped). .. ********************
358 2023-09-23 09:33 39 ********************
SCT Error Recovery Control command not supported
Device Statistics (GP/SMART Log 0x04) not supported
Pending Defects log (GP Log 0x0c) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0008 2 0 Device-to-host non-data FIS retries
0x0009 2 38 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 40 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x000d 2 0 Non-CRC errors within host-to-device FIS
0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
0x8000 4 223809 Vendor specific
smartctl --xall /dev/sde
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.5.3-gentoo] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue
Device Model: WDC WD10EZEX-75WN4A1
Serial Number: WD-WCC6Y5KP9A25
LU WWN Device Id: 5 0014ee 21217afbc
Firmware Version: 13057113
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5533
ATA Version is: ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Sep 23 09:33:46 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test
routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (11100) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 115) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
3 Spin_Up_Time POS--K 175 166 021 - 2208
4 Start_Stop_Count -O--CK 099 099 000 - 1171
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 079 079 000 - 15673
10 Spin_Retry_Count -O--CK 100 100 000 - 0
11 Calibration_Retry_Count -O--CK 100 100 000 - 0
12 Power_Cycle_Count -O--CK 099 099 000 - 1171
192 Power-Off_Retract_Count -O--CK 200 200 000 - 192
193 Load_Cycle_Count -O--CK 136 136 000 - 192695
194 Temperature_Celsius -O---K 106 087 000 - 37
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 200 200 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 1
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0
240 Head_Flying_Hours -O--CK 082 082 000 - 13573
241 Total_LBAs_Written -O--CK 200 200 000 - 11125049534
242 Total_LBAs_Read -O--CK 200 200 000 - 148752593578
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 6 Ext. Comprehensive SMART error log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
0xa8-0xb6 GPL,SL VS 1 Device vendor specific log
0xb7 GPL,SL VS 48 Device vendor specific log
0xbd GPL,SL VS 1 Device vendor specific log
0xc0 GPL,SL VS 1 Device vendor specific log
0xc1 GPL VS 93 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
Device State: Active (0)
Current Temperature: 37 Celsius
Power Cycle Min/Max Temperature: 23/41 Celsius
Lifetime Min/Max Temperature: 17/55 Celsius
Under/Over Temperature Limit Count: 0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/60 Celsius
Min/Max Temperature Limit: -41/85 Celsius
Temperature History Size (Index): 478 (278)
Index Estimated Time Temperature Celsius
279 2023-09-23 01:36 37 ******************
... ..(220 skipped). .. ******************
22 2023-09-23 05:17 37 ******************
23 2023-09-23 05:18 38 *******************
... ..( 86 skipped). .. *******************
110 2023-09-23 06:45 38 *******************
111 2023-09-23 06:46 37 ******************
112 2023-09-23 06:47 38 *******************
113 2023-09-23 06:48 37 ******************
114 2023-09-23 06:49 37 ******************
115 2023-09-23 06:50 38 *******************
116 2023-09-23 06:51 37 ******************
... ..( 4 skipped). .. ******************
121 2023-09-23 06:56 37 ******************
122 2023-09-23 06:57 38 *******************
123 2023-09-23 06:58 37 ******************
... ..( 5 skipped). .. ******************
129 2023-09-23 07:04 37 ******************
130 2023-09-23 07:05 38 *******************
131 2023-09-23 07:06 37 ******************
... ..(146 skipped). .. ******************
278 2023-09-23 09:33 37 ******************
SCT Error Recovery Control command not supported
Device Statistics (GP/SMART Log 0x04) not supported
Pending Defects log (GP Log 0x0c) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0008 2 0 Device-to-host non-data FIS retries
0x0009 2 46 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 48 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x000d 2 0 Non-CRC errors within host-to-device FIS
0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
0x8000 4 223816 Vendor specific
smartctl --xall /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.5.3-gentoo] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue
Device Model: WDC WD10EZEX-75WN4A1
Serial Number: WD-WCC6Y3NF1PD2
LU WWN Device Id: 5 0014ee 2676cfff8
Firmware Version: 13057113
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5533
ATA Version is: ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Sep 23 09:33:49 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test
routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (10740) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 112) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
3 Spin_Up_Time POS--K 178 169 021 - 2075
4 Start_Stop_Count -O--CK 099 099 000 - 1171
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 079 079 000 - 15674
10 Spin_Retry_Count -O--CK 100 100 000 - 0
11 Calibration_Retry_Count -O--CK 100 100 000 - 0
12 Power_Cycle_Count -O--CK 099 099 000 - 1171
192 Power-Off_Retract_Count -O--CK 200 200 000 - 193
193 Load_Cycle_Count -O--CK 139 139 000 - 184449
194 Temperature_Celsius -O---K 104 085 000 - 39
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 200 200 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 1
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0
240 Head_Flying_Hours -O--CK 082 082 000 - 13577
241 Total_LBAs_Written -O--CK 200 200 000 - 10965137767
242 Total_LBAs_Read -O--CK 200 200 000 - 149454382073
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 6 Ext. Comprehensive SMART error log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
0xa8-0xb6 GPL,SL VS 1 Device vendor specific log
0xb7 GPL,SL VS 48 Device vendor specific log
0xbd GPL,SL VS 1 Device vendor specific log
0xc0 GPL,SL VS 1 Device vendor specific log
0xc1 GPL VS 93 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
Device State: Active (0)
Current Temperature: 39 Celsius
Power Cycle Min/Max Temperature: 23/45 Celsius
Lifetime Min/Max Temperature: 17/58 Celsius
Under/Over Temperature Limit Count: 0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/60 Celsius
Min/Max Temperature Limit: -41/85 Celsius
Temperature History Size (Index): 478 (333)
Index Estimated Time Temperature Celsius
334 2023-09-23 01:36 39 ********************
... ..(220 skipped). .. ********************
77 2023-09-23 05:17 39 ********************
78 2023-09-23 05:18 41 **********************
79 2023-09-23 05:19 40 *********************
80 2023-09-23 05:20 40 *********************
81 2023-09-23 05:21 41 **********************
82 2023-09-23 05:22 41 **********************
83 2023-09-23 05:23 40 *********************
84 2023-09-23 05:24 41 **********************
85 2023-09-23 05:25 40 *********************
... ..( 2 skipped). .. *********************
88 2023-09-23 05:28 40 *********************
89 2023-09-23 05:29 41 **********************
90 2023-09-23 05:30 40 *********************
... ..(182 skipped). .. *********************
273 2023-09-23 08:33 40 *********************
274 2023-09-23 08:34 39 ********************
275 2023-09-23 08:35 40 *********************
276 2023-09-23 08:36 39 ********************
... ..( 56 skipped). .. ********************
333 2023-09-23 09:33 39 ********************
SCT Error Recovery Control command not supported
Device Statistics (GP/SMART Log 0x04) not supported
Pending Defects log (GP Log 0x0c) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0008 2 0 Device-to-host non-data FIS retries
0x0009 2 46 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 48 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x000d 2 0 Non-CRC errors within host-to-device FIS
0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
0x8000 4 223818 Vendor specific
lsdrv
PCI [nvme] 03:00.0 Non-Volatile memory controller: Intel Corporation SSD
660P Series (rev 03)
└nvme nvme0 INTEL SSDPEKNW010T8 {BTNH94442DT11P0B}
└nvme0n1 953.87g [259:0] ext4 'work'
{86c116a2-351a-4b71-b54a-e92e8adbb9c7}
└Mounted as /dev/nvme0n1 @ /work
PCI [nvme] 09:00.0 Non-Volatile memory controller: Intel Corporation SSD
660P Series (rev 03)
└nvme nvme1 INTEL SSDPEKNW010T8 {BTNH94141GH01P0B}
└nvme1n1 953.87g [259:1] ext4 {8a83519a-9bec-4d5b-9303-f8ce71cfe36c}
PCI [ahci] 00:17.0 RAID bus controller: Intel Corporation Comet Lake
PCH-H RAID
├scsi 0:0:0:0 ATA WDC WD10EZEX-75W {WD-WCC6Y3LCXY73}
│└sda 931.51g [8:0] isw_raid_member
│ ├md126 1.82t [9:126] MD vexternal:/md127/0 raid5 (3) write-pending,
128k Chunk {00000000:-0000-00:00-0000-:000000000000}
│ │ ext4 'HOME' {2f44f848-3665-4663-a0d7-70e19ad99b30}
│ └md127 0.00k [9:127] MD vexternal:imsm () inactive, None (None) None
{00000000:-0000-00:00-0000-:000000000000}
│ Empty/Unknown
├scsi 1:0:0:0 TSSTcorp DVDWBD SH-B123L {R84A6GDC2003Y7}
│└sr0 1.00g [11:0] Empty/Unknown
├scsi 2:0:0:0 ATA Corsair Neutron {12437904000021010029}
│└sdb 447.13g [8:16] Partitioned (gpt)
│ ├sdb1 92.00m [8:17] vfat 'EFI' {AD3D-F277}
│ │└Mounted as /dev/sdb1 @ /boot/efi
│ ├sdb2 977.00m [8:18] swap 'swap' {f5fafe65-ed0f-48fe-91e1-e9244d372f1e}
│ ├sdb3 976.00m [8:19] ext4 'boot' {763fb069-ad45-4f2d-b887-d6673cdbe74f}
│ │└Mounted as /dev/sdb3 @ /boot
│ └sdb4 445.10g [8:20] ext4 'root' {6bd121cd-69a4-41eb-b469-cfbb2e91022f}
│ └Mounted as /dev/root @ /
├scsi 3:0:0:0 ATA ST2000DM008-2UB1 {ZK30CAMT}
│└sdc 1.82t [8:32] ext4 'HOME' {2f44f848-3665-4663-a0d7-70e19ad99b30}
├scsi 4:0:0:0 ATA WDC WD10EZEX-75W {WD-WCC6Y3NF1PD2}
│└sdd 931.51g [8:48] isw_raid_member
│ ├md126 1.82t [9:126] MD vexternal:/md127/0 raid5 (3) write-pending,
128k Chunk {00000000:-0000-00:00-0000-:000000000000}
│ │ ext4 'HOME' {2f44f848-3665-4663-a0d7-70e19ad99b30}
└scsi 5:0:0:0 ATA WDC WD10EZEX-75W {WD-WCC6Y5KP9A25}
└sde 931.51g [8:64] isw_raid_member
├md126 1.82t [9:126] MD vexternal:/md127/0 raid5 (3) write-pending,
128k Chunk {00000000:-0000-00:00-0000-:000000000000}
│ ext4 'HOME' {2f44f848-3665-4663-a0d7-70e19ad99b30}
PCI [ahci] 05:00.0 SATA controller: ASMedia Technology Inc. ASM1062
Serial ATA Controller (rev 02)
└scsi 6:x:x:x [Empty]
USB [usb-storage] Bus 005 Device 002: ID 1058:1100 Western Digital
Technologies, Inc. My Book Essential Edition 2.0 (WDH1U)
{57442D574D41535530313138313235}
└scsi 8:0:0:0 WD 5000AAC External {WD-WMASU0118125}
└sdf 465.76g [8:80] Partitioned (mac)
USB [usb-storage] Bus 001 Device 016: ID 0781:5595 SanDisk Corp. SanDisk
3.2Gen1 {00010320101721124020}
└scsi 9:0:0:0 USB SanDisk 3.2Gen1 {00010320101721124020}
└sdg 114.56g [8:96] Partitioned (dos)
└sdg1 114.56g [8:97] vfat 'UUI' {FBA7-BB71}
└Mounted as /dev/sdg1 @ /mnt/tmp
USB [usb-storage] Bus 001 Device 015: ID 174c:55aa ASMedia Technology
Inc. ASM1051E SATA 6Gb/s bridge, ASM1053E SATA 6Gb/s bridge, ASM1153
SATA 3Gb/s bridge, ASM1153E SATA 6Gb/s bridge {0000000002F8}
└scsi 10:0:0:0 ST1000DM 003-9YN162 {S1D254AM}
└sdh 931.51g [8:112] Partitioned (dos)
└sdh1 931.51g [8:113] ext4 'BACKUP'
{161c4a60-6abd-458e-ab2b-00c6365095a7}
└Mounted as /dev/sdh1 @ /backup
Other Block Devices
├loop0 0.00k [7:0] Empty/Unknown
├loop1 0.00k [7:1] Empty/Unknown
├loop2 0.00k [7:2] Empty/Unknown
├loop3 0.00k [7:3] Empty/Unknown
├loop4 0.00k [7:4] Empty/Unknown
├loop5 0.00k [7:5] Empty/Unknown
├loop6 0.00k [7:6] Empty/Unknown
└loop7 0.00k [7:7] Empty/Unknown
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
2023-09-23 10:54 request for help on IMSM-metadata RAID-5 array Joel Parthemore
@ 2023-09-23 11:24 ` Roman Mamedov
2023-09-23 15:18 ` Joel Parthemore
2023-09-25 9:44 ` Mariusz Tkaczyk
1 sibling, 1 reply; 14+ messages in thread
From: Roman Mamedov @ 2023-09-23 11:24 UTC (permalink / raw)
To: Joel Parthemore; +Cc: linux-raid
On Sat, 23 Sep 2023 12:54:52 +0200
Joel Parthemore <joel@parthemores.com> wrote:
> the RAID array looking seemingly okay (according to mdadm -D) BUT this
> time, any attempt to access the RAID array or even just stop the array
> (mdadm --stop /dev/md126, mdadm --stop /dev/md127) once it was started
> would cause the RAID array to lock up. That means (I think) that I can't
> create an image of the array contents using dd, which is what -- of
> course -- I should have done in the first place. (I could assemble the
> RAID array read-only, but the RAID array is out of sync because it
> didn't shut down properly.)
Does accessing the array also lock up when it's assembled read-only?
The out-of-sync should not be a great issue, because only a tiny portion that
was being written at the time of crash would be out of sync. The rest should be
fine, and you could try backing up your important data.
If #1 is yes, then try a different distro with a newer kernel version.
--
With respect,
Roman
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
2023-09-23 11:24 ` Roman Mamedov
@ 2023-09-23 15:18 ` Joel Parthemore
2023-09-23 15:35 ` Roman Mamedov
0 siblings, 1 reply; 14+ messages in thread
From: Joel Parthemore @ 2023-09-23 15:18 UTC (permalink / raw)
To: Roman Mamedov; +Cc: linux-raid
Den 2023-09-23 kl. 13:24, skrev Roman Mamedov:
> On Sat, 23 Sep 2023 12:54:52 +0200
> Joel Parthemore <joel@parthemores.com> wrote:
>
>> the RAID array looking seemingly okay (according to mdadm -D) BUT this
>> time, any attempt to access the RAID array or even just stop the array
>> (mdadm --stop /dev/md126, mdadm --stop /dev/md127) once it was started
>> would cause the RAID array to lock up. That means (I think) that I can't
>> create an image of the array contents using dd, which is what -- of
>> course -- I should have done in the first place. (I could assemble the
>> RAID array read-only, but the RAID array is out of sync because it
>> didn't shut down properly.)
> Does accessing the array also lock up when it's assembled read-only?
I didn't want to try that again until I had confirmation that the
out-of-sync wouldn't (or shouldn't) be an issue. (I had tried it once
before, but the system had somehow swapped /dev/md126 and /dev/md127 so
that /dev/md126 became the container and /dev/md127 the RAID-5 array,
which confused me. So I stopped experimenting further until I had a
chance to write to the list.)
The array is assembled read only, and this time both /dev/md126 and
/dev/md127 are looking like I expect them to. I started dd to make a
backup image using dd if=/dev/md126 of=/dev/sdc bs=64K
conv=noerror,sync. (The EXT4 file store on the 2TB RAID-5 array is about
900GB full.) At first, it was running most of the time and just
occasionally in uninterruptible sleep, but the periods of
uninterruptible sleep quickly started getting longer. Now it seems to be
spending most but not quite all of its time in uninterruptible sleep. Is
this some kind of race condition? Anyway, I'll leave it running
overnight to see if it completes.
Accessing the RAID array definitely isn't locking things up this time. I
can go in and look at the partition table, for example, no problem.
Access is awfully slow, but I assume that's because of whatever dd is or
isn't doing.
By the way, I'm using kernel 6.5.3, which isn't the latest (that would
be 6.5.5) but is close.
Thanks for the help,
Joel Parthemore
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
2023-09-23 15:18 ` Joel Parthemore
@ 2023-09-23 15:35 ` Roman Mamedov
2023-09-23 15:45 ` Joel Parthemore
2023-09-23 18:49 ` Joel Parthemore
0 siblings, 2 replies; 14+ messages in thread
From: Roman Mamedov @ 2023-09-23 15:35 UTC (permalink / raw)
To: Joel Parthemore; +Cc: linux-raid
On Sat, 23 Sep 2023 17:18:00 +0200
Joel Parthemore <joel@parthemores.com> wrote:
> I didn't want to try that again until I had confirmation that the
> out-of-sync wouldn't (or shouldn't) be an issue. (I had tried it once
> before, but the system had somehow swapped /dev/md126 and /dev/md127 so
> that /dev/md126 became the container and /dev/md127 the RAID-5 array,
> which confused me. So I stopped experimenting further until I had a
> chance to write to the list.)
>
> The array is assembled read only, and this time both /dev/md126 and
> /dev/md127 are looking like I expect them to. I started dd to make a
> backup image using dd if=/dev/md126 of=/dev/sdc bs=64K
> conv=noerror,sync. (The EXT4 file store on the 2TB RAID-5 array is about
> 900GB full.) At first, it was running most of the time and just
> occasionally in uninterruptible sleep, but the periods of
> uninterruptible sleep quickly started getting longer. Now it seems to be
> spending most but not quite all of its time in uninterruptible sleep. Is
> this some kind of race condition? Anyway, I'll leave it running
> overnight to see if it completes.
>
> Accessing the RAID array definitely isn't locking things up this time. I
> can go in and look at the partition table, for example, no problem.
> Access is awfully slow, but I assume that's because of whatever dd is or
> isn't doing.
>
> By the way, I'm using kernel 6.5.3, which isn't the latest (that would
> be 6.5.5) but is close.
Maybe it's an HDD issue, one of them did have some unreadable sectors in the
past, although the firmware has not decided to do anything about that, such
as reallocating them and recording that in SMART.
Check if one of the drives is holding up things, with a command like
iostat -x 2 /dev/sd?
If you see 100% next to one of the drives, and much less for others, that one
might be culprit.
--
With respect,
Roman
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
2023-09-23 15:35 ` Roman Mamedov
@ 2023-09-23 15:45 ` Joel Parthemore
2023-09-23 18:49 ` Joel Parthemore
1 sibling, 0 replies; 14+ messages in thread
From: Joel Parthemore @ 2023-09-23 15:45 UTC (permalink / raw)
To: Roman Mamedov; +Cc: linux-raid
I have been wondering about HDD issues all along, of course, though I
didn't see any smoking gun.
I ran iostat -x 2 /dev/sdX on all three drives. All show an idle rate of
just under 90%. So I don't think that's the problem.
Joel
Den 2023-09-23 kl. 17:35, skrev Roman Mamedov:
> On Sat, 23 Sep 2023 17:18:00 +0200
> Joel Parthemore <joel@parthemores.com> wrote:
>
>> I didn't want to try that again until I had confirmation that the
>> out-of-sync wouldn't (or shouldn't) be an issue. (I had tried it once
>> before, but the system had somehow swapped /dev/md126 and /dev/md127 so
>> that /dev/md126 became the container and /dev/md127 the RAID-5 array,
>> which confused me. So I stopped experimenting further until I had a
>> chance to write to the list.)
>>
>> The array is assembled read only, and this time both /dev/md126 and
>> /dev/md127 are looking like I expect them to. I started dd to make a
>> backup image using dd if=/dev/md126 of=/dev/sdc bs=64K
>> conv=noerror,sync. (The EXT4 file store on the 2TB RAID-5 array is about
>> 900GB full.) At first, it was running most of the time and just
>> occasionally in uninterruptible sleep, but the periods of
>> uninterruptible sleep quickly started getting longer. Now it seems to be
>> spending most but not quite all of its time in uninterruptible sleep. Is
>> this some kind of race condition? Anyway, I'll leave it running
>> overnight to see if it completes.
>>
>> Accessing the RAID array definitely isn't locking things up this time. I
>> can go in and look at the partition table, for example, no problem.
>> Access is awfully slow, but I assume that's because of whatever dd is or
>> isn't doing.
>>
>> By the way, I'm using kernel 6.5.3, which isn't the latest (that would
>> be 6.5.5) but is close.
> Maybe it's an HDD issue, one of them did have some unreadable sectors in the
> past, although the firmware has not decided to do anything about that, such
> as reallocating them and recording that in SMART.
>
> Check if one of the drives is holding up things, with a command like
>
> iostat -x 2 /dev/sd?
>
> If you see 100% next to one of the drives, and much less for others, that one
> might be culprit.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
2023-09-23 15:35 ` Roman Mamedov
2023-09-23 15:45 ` Joel Parthemore
@ 2023-09-23 18:49 ` Joel Parthemore
2023-09-25 1:43 ` Yu Kuai
1 sibling, 1 reply; 14+ messages in thread
From: Joel Parthemore @ 2023-09-23 18:49 UTC (permalink / raw)
To: Roman Mamedov; +Cc: linux-raid
So, dd finally sped up and finished. It appears that I have lost none of
my data. I am a very happy man. A question: is there anything useful I
am likely to discover from keeping the RAID array as it is a bit longer
before I recreate it and copy the data back?
Joel
-----------------------------------------------------------------------------
I have been wondering about HDD issues all along, of course, though I
didn't see any smoking gun.
I ran iostat -x 2 /dev/sdX on all three drives. All show an idle rate of
just under 90%. So I don't think that's the problem.
Joel
Den 2023-09-23 kl. 17:35, skrev Roman Mamedov:
> On Sat, 23 Sep 2023 17:18:00 +0200
> Joel Parthemore <joel@parthemores.com> wrote:
>
>> I didn't want to try that again until I had confirmation that the
>> out-of-sync wouldn't (or shouldn't) be an issue. (I had tried it once
>> before, but the system had somehow swapped /dev/md126 and /dev/md127 so
>> that /dev/md126 became the container and /dev/md127 the RAID-5 array,
>> which confused me. So I stopped experimenting further until I had a
>> chance to write to the list.)
>>
>> The array is assembled read only, and this time both /dev/md126 and
>> /dev/md127 are looking like I expect them to. I started dd to make a
>> backup image using dd if=/dev/md126 of=/dev/sdc bs=64K
>> conv=noerror,sync. (The EXT4 file store on the 2TB RAID-5 array is about
>> 900GB full.) At first, it was running most of the time and just
>> occasionally in uninterruptible sleep, but the periods of
>> uninterruptible sleep quickly started getting longer. Now it seems to be
>> spending most but not quite all of its time in uninterruptible sleep. Is
>> this some kind of race condition? Anyway, I'll leave it running
>> overnight to see if it completes.
>>
>> Accessing the RAID array definitely isn't locking things up this time. I
>> can go in and look at the partition table, for example, no problem.
>> Access is awfully slow, but I assume that's because of whatever dd is or
>> isn't doing.
>>
>> By the way, I'm using kernel 6.5.3, which isn't the latest (that would
>> be 6.5.5) but is close.
> Maybe it's an HDD issue, one of them did have some unreadable sectors
> in the
> past, although the firmware has not decided to do anything about that,
> such
> as reallocating them and recording that in SMART.
>
> Check if one of the drives is holding up things, with a command like
>
> iostat -x 2 /dev/sd?
>
> If you see 100% next to one of the drives, and much less for others,
> that one
> might be culprit.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
2023-09-23 18:49 ` Joel Parthemore
@ 2023-09-25 1:43 ` Yu Kuai
2023-09-25 15:57 ` Joel Parthemore
0 siblings, 1 reply; 14+ messages in thread
From: Yu Kuai @ 2023-09-25 1:43 UTC (permalink / raw)
To: Joel Parthemore, Roman Mamedov; +Cc: linux-raid, yukuai (C)
Hi,
在 2023/09/24 2:49, Joel Parthemore 写道:
> So, dd finally sped up and finished. It appears that I have lost none of
> my data. I am a very happy man. A question: is there anything useful I
> am likely to discover from keeping the RAID array as it is a bit longer
> before I recreate it and copy the data back?
It'll be much helper for developers to collect kernel stack for all
stuck thread(and it'll be much better to use add2line).
Thanks,
Kuai
>
> Joel
>
> -----------------------------------------------------------------------------
>
>
> I have been wondering about HDD issues all along, of course, though I
> didn't see any smoking gun.
>
>
> I ran iostat -x 2 /dev/sdX on all three drives. All show an idle rate of
> just under 90%. So I don't think that's the problem.
>
> Joel
>
> Den 2023-09-23 kl. 17:35, skrev Roman Mamedov:
>> On Sat, 23 Sep 2023 17:18:00 +0200
>> Joel Parthemore <joel@parthemores.com> wrote:
>>
>>> I didn't want to try that again until I had confirmation that the
>>> out-of-sync wouldn't (or shouldn't) be an issue. (I had tried it once
>>> before, but the system had somehow swapped /dev/md126 and /dev/md127 so
>>> that /dev/md126 became the container and /dev/md127 the RAID-5 array,
>>> which confused me. So I stopped experimenting further until I had a
>>> chance to write to the list.)
>>>
>>> The array is assembled read only, and this time both /dev/md126 and
>>> /dev/md127 are looking like I expect them to. I started dd to make a
>>> backup image using dd if=/dev/md126 of=/dev/sdc bs=64K
>>> conv=noerror,sync. (The EXT4 file store on the 2TB RAID-5 array is about
>>> 900GB full.) At first, it was running most of the time and just
>>> occasionally in uninterruptible sleep, but the periods of
>>> uninterruptible sleep quickly started getting longer. Now it seems to be
>>> spending most but not quite all of its time in uninterruptible sleep. Is
>>> this some kind of race condition? Anyway, I'll leave it running
>>> overnight to see if it completes.
>>>
>>> Accessing the RAID array definitely isn't locking things up this time. I
>>> can go in and look at the partition table, for example, no problem.
>>> Access is awfully slow, but I assume that's because of whatever dd is or
>>> isn't doing.
>>>
>>> By the way, I'm using kernel 6.5.3, which isn't the latest (that would
>>> be 6.5.5) but is close.
>> Maybe it's an HDD issue, one of them did have some unreadable sectors
>> in the
>> past, although the firmware has not decided to do anything about that,
>> such
>> as reallocating them and recording that in SMART.
>>
>> Check if one of the drives is holding up things, with a command like
>>
>> iostat -x 2 /dev/sd?
>>
>> If you see 100% next to one of the drives, and much less for others,
>> that one
>> might be culprit.
> .
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
2023-09-23 10:54 request for help on IMSM-metadata RAID-5 array Joel Parthemore
2023-09-23 11:24 ` Roman Mamedov
@ 2023-09-25 9:44 ` Mariusz Tkaczyk
2023-09-25 15:52 ` Joel Parthemore
1 sibling, 1 reply; 14+ messages in thread
From: Mariusz Tkaczyk @ 2023-09-25 9:44 UTC (permalink / raw)
To: Joel Parthemore; +Cc: linux-raid
On Sat, 23 Sep 2023 12:54:52 +0200
Joel Parthemore <joel@parthemores.com> wrote:
> Apologies in advance for the long email, but I wanted to include
> everything that is asked for on the "asking for help" page associated
> with the mailing list. The output from some of the requested commands is
> pretty lengthy.
>
> My home directory is on a three-disk RAID-5 array that, for whatever
> reason (it seemed like a good idea at the time?), I built using the
> hooks from the UEFI BIOS (or so I understand what I did). That is to
> say, it's a "real" software-based RAID array in Linux that's built on a
> "fake" RAID array in the UEFI BIOS. Mostly nothing important is stored
> on the /home partition, but I forgot to back up a few important things
> that are (or, at least, were). So I'd like to get the RAID array back if
> I can, or know if I can't; and I will be extremely grateful to anyone
> who can tell me one way or the other.
>
> All was well for some number of years until a few days ago. After I
> installed the latest KDE updates, the RAID array would lock up entirely
> when I tried to log in to a new KDE Wayland session. It all came down to
> one process that refused to die, running startplasma-wayland. Because
> the process refused to die, the RAID array could not be stopped cleanly
> and rebooting the computer therefore caused the RAID array to go out of
> sync. After that, any attempt whatsoever to access the RAID array would
> cause the RAID array to lock up again.
>
> The first few times this happened, I was able to start the computer
> without starting the RAID array, reassemble the RAID array using the
> command mdadm --assemble --run --force /dev/md126 /dev/sda /dev/sde
> /dev/sdc and have it working fine -- I could fix any filestore problems
> with e2fsck, mount /home, log in to my home directory, do pretty much
> whatever I wanted -- until I tried logging into a new KDE Wayland
> session again. This happened several times while I was trying to
> troubleshoot the problem with startplasma-wayland.
>
> Unfortunately, one time this didn't work. I was still able to start the
> computer without starting the RAID array, reassemble it and reboot with
> the RAID array looking seemingly okay (according to mdadm -D) BUT this
> time, any attempt to access the RAID array or even just stop the array
> (mdadm --stop /dev/md126, mdadm --stop /dev/md127) once it was started
> would cause the RAID array to lock up. That means (I think) that I can't
> create an image of the array contents using dd, which is what -- of
> course -- I should have done in the first place. (I could assemble the
> RAID array read-only, but the RAID array is out of sync because it
> didn't shut down properly.)
>
> I'm guessing that the contents of the filestore on the RAID array are
> probably still there. Does anyone have suggestions on getting the RAID
> array working properly again and accessing them? I have avoided doing
> anything further myself because, of course, if the contents of the
> filestore are still there, I don't want to do anything to jeopardize
> them. You may tell me that I've done too much already. :-)
Hi Joel,
sorry for late response, I see that you were able to recover the data!
I was few days off.
I think that metadata manager is down or broken from some reasons.
#systemctl status mdmon@md127.service
I you will get the problem again, please try (but do not abuse- use it as last
resort!!):
#systemctl restart mdmon@md127.service
We know that there was a change in systemd and it causes that our userspace
metadata manager was not responsible because it couldn't be restarted after
switch root. Issue is fixed in upstream:
https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/commit/?id=723d1df4946eb40337bf494f9b2549500c1399b2
I didn't read whole thread but issue matches for me.
Hopefully, you will find it useful.
Thanks,
Mariusz
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
2023-09-25 9:44 ` Mariusz Tkaczyk
@ 2023-09-25 15:52 ` Joel Parthemore
2023-09-25 16:43 ` Mariusz Tkaczyk
0 siblings, 1 reply; 14+ messages in thread
From: Joel Parthemore @ 2023-09-25 15:52 UTC (permalink / raw)
To: Mariusz Tkaczyk; +Cc: linux-raid
Den 2023-09-25 kl. 11:44, skrev Mariusz Tkaczyk:
> Hi Joel,
> sorry for late response, I see that you were able to recover the data!
Yes. I just wanted to proceed as slowly and carefully as possible so
that I would not make an awkward situation worse. ;-) The advice I got
from the list gave me the assurance to go ahead.
> I think that metadata manager is down or broken from some reasons.
> #systemctl status mdmon@md127.service
I forgot to say with my earlier post that I'm not running systemd but
rather openrc. Gentoo supports both, and I have my reasons for
preferring not to make the switch to systemd. Therefore...
> I you will get the problem again, please try (but do not abuse- use it as last
> resort!!):
> #systemctl restart mdmon@md127.service
...That solution won't work. ;-)
If I do have the problem again, maybe I can come to understand better
how/why it's happening.
Joel
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
2023-09-25 1:43 ` Yu Kuai
@ 2023-09-25 15:57 ` Joel Parthemore
2023-09-26 1:10 ` Yu Kuai
0 siblings, 1 reply; 14+ messages in thread
From: Joel Parthemore @ 2023-09-25 15:57 UTC (permalink / raw)
To: Yu Kuai, Roman Mamedov; +Cc: linux-raid, yukuai (C)
Den 2023-09-25 kl. 03:43, skrev Yu Kuai:
> Hi,
>
> 在 2023/09/24 2:49, Joel Parthemore 写道:
>> So, dd finally sped up and finished. It appears that I have lost none
>> of my data. I am a very happy man. A question: is there anything
>> useful I am likely to discover from keeping the RAID array as it is a
>> bit longer before I recreate it and copy the data back?
>
> It'll be much helper for developers to collect kernel stack for all
> stuck thread(and it'll be much better to use add2line).
Presuming I can re-create the problem, let me know what I should do to
collect that information for you. I'm very much a newbie in that area.
Joel
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
2023-09-25 15:52 ` Joel Parthemore
@ 2023-09-25 16:43 ` Mariusz Tkaczyk
0 siblings, 0 replies; 14+ messages in thread
From: Mariusz Tkaczyk @ 2023-09-25 16:43 UTC (permalink / raw)
To: Joel Parthemore; +Cc: linux-raid
On Mon, 25 Sep 2023 17:52:41 +0200
Joel Parthemore <joel@parthemores.com> wrote:
> Den 2023-09-25 kl. 11:44, skrev Mariusz Tkaczyk:
>
> > Hi Joel,
> > sorry for late response, I see that you were able to recover the data!
>
>
> Yes. I just wanted to proceed as slowly and carefully as possible so
> that I would not make an awkward situation worse. ;-) The advice I got
> from the list gave me the assurance to go ahead.
>
>
> > I think that metadata manager is down or broken from some reasons.
> > #systemctl status mdmon@md127.service
>
>
> I forgot to say with my earlier post that I'm not running systemd but
> rather openrc. Gentoo supports both, and I have my reasons for
> preferring not to make the switch to systemd. Therefore...
>
>
> > I you will get the problem again, please try (but do not abuse- use it as
> > last resort!!):
> > #systemctl restart mdmon@md127.service
>
>
> ...That solution won't work. ;-)
>
> If I do have the problem again, maybe I can come to understand better
> how/why it's happening.
>
Ohh, ok. Please note that VROC solution is not validated with openrc and we
have userspace metadata manager, it requires to be careful even with systemd.
In this case native metadata is safer option because the metadata management is
done by kernel. But... you used this for years with no issues so I seems to be
not so bad. Anyway you can always try # mdmon -at to gentle ask metadata daemon
to fork and restart for every IMSM array in your system.
Mariusz
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
2023-09-25 15:57 ` Joel Parthemore
@ 2023-09-26 1:10 ` Yu Kuai
2023-09-29 19:44 ` Joel Parthemore
0 siblings, 1 reply; 14+ messages in thread
From: Yu Kuai @ 2023-09-26 1:10 UTC (permalink / raw)
To: Joel Parthemore, Yu Kuai, Roman Mamedov; +Cc: linux-raid, yukuai (C)
Hi,
在 2023/09/25 23:57, Joel Parthemore 写道:
>
> Den 2023-09-25 kl. 03:43, skrev Yu Kuai:
>> Hi,
>>
>> 在 2023/09/24 2:49, Joel Parthemore 写道:
>>> So, dd finally sped up and finished. It appears that I have lost none
>>> of my data. I am a very happy man. A question: is there anything
>>> useful I am likely to discover from keeping the RAID array as it is a
>>> bit longer before I recreate it and copy the data back?
>>
>> It'll be much helper for developers to collect kernel stack for all
>> stuck thread(and it'll be much better to use add2line).
>
>
> Presuming I can re-create the problem, let me know what I should do to
> collect that information for you. I'm very much a newbie in that area.
You can use following cmd:
for pid in `ps -elf | grep " D " | awk '{print $4}'`; do ps $pid; cat
/proc/$pid/stack; done
Thanks,
Kuai
>
> Joel
>
> .
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
2023-09-26 1:10 ` Yu Kuai
@ 2023-09-29 19:44 ` Joel Parthemore
[not found] ` <a0b8a693-5d9c-d354-5afc-4500b78a983e@huaweicloud.com>
0 siblings, 1 reply; 14+ messages in thread
From: Joel Parthemore @ 2023-09-29 19:44 UTC (permalink / raw)
To: Yu Kuai, Roman Mamedov; +Cc: linux-raid, yukuai (C)
[-- Attachment #1: Type: text/plain, Size: 1905 bytes --]
Den 2023-09-26 kl. 03:10, skrev Yu Kuai:
>>> It'll be much helper for developers to collect kernel stack for all
>>> stuck thread(and it'll be much better to use add2line).
>>
>>
>> Presuming I can re-create the problem, let me know what I should do
>> to collect that information for you. I'm very much a newbie in that
>> area.
>
> You can use following cmd:
>
> for pid in `ps -elf | grep " D " | awk '{print $4}'`; do ps $pid; cat
> /proc/$pid/stack; done
>
> Thanks,
> Kuai
for pid in `ps -elf | grep " D " | awk '{print $4}'`; do ps $pid; cat
/proc/$pid/stack; done
PID TTY STAT TIME COMMAND
4017 tty1 D+ 0:00 e2fsck /dev/md126
[<0>] md_write_start+0x191/0x310
[<0>] raid5_make_request+0x90/0x1230 [raid456]
[<0>] md_handle_request+0x129/0x210
[<0>] __submit_bio+0x7f/0x110
[<0>] submit_bio_noacct_nocheck+0x150/0x360
[<0>] __block_write_full_folio+0x1d8/0x3f0
[<0>] writepage_cb+0x11/0x70
[<0>] write_cache_pages+0x13a/0x390
[<0>] do_writepages+0x15b/0x1e0
[<0>] filemap_fdatawrite_wbc+0x5a/0x80
[<0>] __filemap_fdatawrite_range+0x53/0x80
[<0>] filemap_write_and_wait_range+0x3a/0xa0
[<0>] blkdev_put+0x116/0x1c0
[<0>] blkdev_release+0x22/0x30
[<0>] __fput+0xe5/0x290
[<0>] task_work_run+0x51/0x80
[<0>] exit_to_user_mode_prepare+0x132/0x140
[<0>] syscall_exit_to_user_mode+0x1d/0x50
[<0>] do_syscall_64+0x46/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
PID TTY STAT TIME COMMAND
cat: /proc/5266/stack: Filen eller katalogen finns inte
--
---------------------------------------------------------------
- Joel Parthemore - philosopher - academic proofreader --------
- English teacher - technology instructor - editor ------------
- "We are all in the gutter, but some of us are looking at the-
- "stars." - Oscar Wilde --------------------------------------
---------------------------------------------------------------
[-- Attachment #2: S/MIME-kryptografisk signatur --]
[-- Type: application/pkcs7-signature, Size: 4005 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: request for help on IMSM-metadata RAID-5 array
[not found] ` <a0b8a693-5d9c-d354-5afc-4500b78a983e@huaweicloud.com>
@ 2023-10-05 7:28 ` Joel Parthemore
0 siblings, 0 replies; 14+ messages in thread
From: Joel Parthemore @ 2023-10-05 7:28 UTC (permalink / raw)
To: Yu Kuai, Roman Mamedov; +Cc: linux-raid, yukuai (C)
Den 2023-10-05 kl. 04:50, skrev Yu Kuai:
> Hi,
>
> 在 2023/09/30 3:44, Joel Parthemore 写道:
>> Den 2023-09-26 kl. 03:10, skrev Yu Kuai:
>>
>>
>>>>> It'll be much helper for developers to collect kernel stack for
>>>>> all stuck thread(and it'll be much better to use add2line).
>>>>
>>>>
>>>> Presuming I can re-create the problem, let me know what I should do
>>>> to collect that information for you. I'm very much a newbie in that
>>>> area.
>>>
>>> You can use following cmd:
>>>
>>> for pid in `ps -elf | grep " D " | awk '{print $4}'`; do ps $pid;
>>> cat /proc/$pid/stack; done
>>>
>>> Thanks,
>>> Kuai
>>
>>
>> for pid in `ps -elf | grep " D " | awk '{print $4}'`; do ps $pid; cat
>> /proc/$pid/stack; done
>> PID TTY STAT TIME COMMAND
>> 4017 tty1 D+ 0:00 e2fsck /dev/md126
>
> Do you check that this thread is hanged and the e2fsck doesn't make
> progress?
I wasn't trying to run e2fsck at the time. Instead I had run mdadm
--stop /dev/md126, which was hanging. (Sorry; looks like I forgot to
include that!) mdadm -D /md126 showed no re-syncing happening at the time.
Unfortunately, I had not heard back from you, and I felt like I couldn't
keep the problematic RAID array around any longer, so I wiped it and
replaced it with a fully native-Linux RAID5 array instead of one using
IMSM metadata.
Joel
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2023-10-05 15:14 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-23 10:54 request for help on IMSM-metadata RAID-5 array Joel Parthemore
2023-09-23 11:24 ` Roman Mamedov
2023-09-23 15:18 ` Joel Parthemore
2023-09-23 15:35 ` Roman Mamedov
2023-09-23 15:45 ` Joel Parthemore
2023-09-23 18:49 ` Joel Parthemore
2023-09-25 1:43 ` Yu Kuai
2023-09-25 15:57 ` Joel Parthemore
2023-09-26 1:10 ` Yu Kuai
2023-09-29 19:44 ` Joel Parthemore
[not found] ` <a0b8a693-5d9c-d354-5afc-4500b78a983e@huaweicloud.com>
2023-10-05 7:28 ` Joel Parthemore
2023-09-25 9:44 ` Mariusz Tkaczyk
2023-09-25 15:52 ` Joel Parthemore
2023-09-25 16:43 ` Mariusz Tkaczyk
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.