All of lore.kernel.org
 help / color / mirror / Atom feed
* Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
@ 2023-04-23 19:09 Jove
  2023-04-23 19:19 ` Reindl Harald
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Jove @ 2023-04-23 19:09 UTC (permalink / raw)
  To: linux-raid

Hi,

I've added two drives to my raid5 array and tried to migrate
it to raid6 with the following command:

mdadm --grow /dev/md0 --raid-devices 4 --level 6
--backup-file=/root/mdadm_raid6_backup.md

This may have been my first mistake, as there are only 5
drives. it should have been --raid-devices 3, I think.

As soon as I started this grow, the filesystems went
unavailable. All processes trying to access files on it hung.
I searched the web which said a reboot during a rebuild
was not problematic if things shut down cleanly, so I
rebooted. The reboot hung too. The drive activity
continued so I let it run overnight. I did wake up to a
rebooted system in emergency mode as it could not
mount all the partitions on the raid array.

The OS tried to reassemble the array and succeeded.
However the udev processes that try to create the dev
entries hang.

I went back to Google and found out how i could reboot
my system without this automatic assemble.
I tried reassembling the array with:

mdadm --verbose --assemble --backup-file mdadm_raid6_backup.md0 /dev/md0

This failed with:
No backup metadata on mdadm_raid6_backup.md0
Failed to find final backup of critical section.
Failed to restore critical section for reshape, sorry.

 I tried again wtih:

mdadm --verbose --assemble --backup-file mdadm_raid6_backup.md0
--invalid-backup /dev/md0

Rhis said in addition to the lines above:

continuying without restoring backup

This seemed to have succeeded in reassembling the
array but it also hangs indefinitely.

/proc/mdstat now shows:

md0 : active (read-only) raid6 sdc1[0] sde[4](S) sdf[5] sdd1[3] sdg1[1]
      7813771264 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_]
      bitmap: 1/30 pages [4KB], 65536KB chunk

Again the udev processes trying to access this device hung indefinitely

Eventually, the kernel dumps this in my journal:

Apr 23 19:17:22 atom kernel: task:systemd-udevd   state:D stack:    0
pid: 8121 ppid:   706 flags:0x00000006
Apr 23 19:17:22 atom kernel: Call Trace:
Apr 23 19:17:22 atom kernel:  <TASK>
Apr 23 19:17:22 atom kernel:  __schedule+0x20a/0x550
Apr 23 19:17:22 atom kernel:  schedule+0x5a/0xc0
Apr 23 19:17:22 atom kernel:  schedule_timeout+0x11f/0x160
Apr 23 19:17:22 atom kernel:  ? make_stripe_request+0x284/0x490 [raid456]
Apr 23 19:17:22 atom kernel:  wait_woken+0x50/0x70
Apr 23 19:17:22 atom kernel:  raid5_make_request+0x2cb/0x3e0 [raid456]
Apr 23 19:17:22 atom kernel:  ? sched_show_numa+0xf0/0xf0
Apr 23 19:17:22 atom kernel:  md_handle_request+0x132/0x1e0
Apr 23 19:17:22 atom kernel:  ? do_mpage_readpage+0x282/0x6b0
Apr 23 19:17:22 atom kernel:  __submit_bio+0x86/0x130
Apr 23 19:17:22 atom kernel:  __submit_bio_noacct+0x81/0x1f0
Apr 23 19:17:22 atom kernel:  mpage_readahead+0x15c/0x1d0
Apr 23 19:17:22 atom kernel:  ? blkdev_write_begin+0x20/0x20
Apr 23 19:17:22 atom kernel:  read_pages+0x58/0x2f0
Apr 23 19:17:22 atom kernel:  page_cache_ra_unbounded+0x137/0x180
Apr 23 19:17:22 atom kernel:  force_page_cache_ra+0xc5/0xf0
Apr 23 19:17:22 atom kernel:  filemap_get_pages+0xe4/0x350
Apr 23 19:17:22 atom kernel:  filemap_read+0xbe/0x3c0
Apr 23 19:17:22 atom kernel:  ? make_kgid+0x13/0x20
Apr 23 19:17:22 atom kernel:  ? deactivate_locked_super+0x90/0xa0
Apr 23 19:17:22 atom kernel:  blkdev_read_iter+0xaf/0x170
Apr 23 19:17:22 atom kernel:  new_sync_read+0xf9/0x180
Apr 23 19:17:22 atom kernel:  vfs_read+0x13c/0x190
Apr 23 19:17:22 atom kernel:  ksys_read+0x5f/0xe0
Apr 23 19:17:22 atom kernel:  do_syscall_64+0x59/0x90
Apr 23 19:17:22 atom kernel:  ? do_user_addr_fault+0x1dd/0x6b0
Apr 23 19:17:22 atom kernel:  ? do_syscall_64+0x69/0x90
Apr 23 19:17:22 atom kernel:  ? exc_page_fault+0x62/0x150
Apr 23 19:17:22 atom kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
Apr 23 19:17:22 atom kernel: RIP: 0033:0x7fb20653eaf2
Apr 23 19:17:22 atom kernel: RSP: 002b:00007ffe1e3e8d28 EFLAGS:
00000246 ORIG_RAX: 0000000000000000
Apr 23 19:17:22 atom kernel: RAX: ffffffffffffffda RBX:
0000555888b0e0b8 RCX: 00007fb20653eaf2
Apr 23 19:17:22 atom kernel: RDX: 0000000000000040 RSI:
0000555888b0e0c8 RDI: 000000000000000d
Apr 23 19:17:22 atom kernel: RBP: 0000555888ad64e0 R08:
0000000000000000 R09: 0000000000000000
Apr 23 19:17:22 atom kernel: R10: 0000000000000010 R11:
0000000000000246 R12: 00000746f2bf0000
Apr 23 19:17:22 atom kernel: R13: 0000000000000040 R14:
0000555888b0e0a0 R15: 0000555888ad6530
Apr 23 19:17:22 atom kernel:  </TASK>

Any help to recover the data on my array would be much appreciated.

Additional system and drive information below.

Thank you for your attention,

    Johan

This is the hung mdadm command.
# cat /proc/8110/stack
[<0>] mddev_suspend+0x14f/0x180
[<0>] suspend_lo_store+0x60/0xb0
[<0>] md_attr_store+0x80/0xf0
[<0>] kernfs_fop_write_iter+0x121/0x1b0
[<0>] new_sync_write+0xfc/0x190
[<0>] vfs_write+0x1ef/0x280
[<0>] ksys_write+0x5f/0xe0
[<0>] do_syscall_64+0x59/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

# cat /etc/centos-release
CentOS Stream release 9

# uname -a
Linux atom 5.14.0-299.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 13
10:08:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

# mdadm --version
mdadm - v4.2 - 2021-12-30 - 8

# mdadm -D /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Sat Oct 21 01:57:20 2017
        Raid Level : raid6
        Array Size : 7813771264 (7.28 TiB 8.00 TB)
     Used Dev Size : 3906885632 (3.64 TiB 4.00 TB)
      Raid Devices : 4
     Total Devices : 5
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun Apr 23 10:32:01 2023
             State : clean, degraded
    Active Devices : 3
   Working Devices : 5
    Failed Devices : 0
     Spare Devices : 2

            Layout : left-symmetric-6
        Chunk Size : 512K

Consistency Policy : bitmap

        New Layout : left-symmetric

              Name : atom:0  (local to host atom)
              UUID : 8c56384e:ba1a3cec:aaf34c17:d0cd9318
            Events : 669453

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       97        1      active sync   /dev/sdg1
       3       8       49        2      active sync   /dev/sdd1
       5       8       80        3      spare rebuilding   /dev/sdf

       4       8       64        -      spare   /dev/sde

# mdadm --examine /dev/sdc1
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x5
     Array UUID : 8c56384e:ba1a3cec:aaf34c17:d0cd9318
           Name : atom:0  (local to host atom)
  Creation Time : Sat Oct 21 01:57:20 2017
     Raid Level : raid6
   Raid Devices : 4

 Avail Dev Size : 7813771264 sectors (3.64 TiB 4.00 TB)
     Array Size : 7813771264 KiB (7.28 TiB 8.00 TB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : e6cbce38:ce3a1997:254cd445:65a67d5d

Internal Bitmap : 8 sectors from superblock
  Reshape pos'n : 3473357824 (3.23 TiB 3.56 TB)
     New Layout : left-symmetric

    Update Time : Sun Apr 23 10:32:01 2023
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : f3ffb20c - correct
         Events : 669453

         Layout : left-symmetric-6
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

# mdadm --examine /dev/sdg1
/dev/sdg1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x5
     Array UUID : 8c56384e:ba1a3cec:aaf34c17:d0cd9318
           Name : atom:0  (local to host atom)
  Creation Time : Sat Oct 21 01:57:20 2017
     Raid Level : raid6
   Raid Devices : 4

 Avail Dev Size : 7813771264 sectors (3.64 TiB 4.00 TB)
     Array Size : 7813771264 KiB (7.28 TiB 8.00 TB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : 9c130a77:d12da8fa:ca8a2e59:4778168e

Internal Bitmap : 8 sectors from superblock
  Reshape pos'n : 3473357824 (3.23 TiB 3.56 TB)
     New Layout : left-symmetric

    Update Time : Sun Apr 23 10:32:01 2023
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : d9bcfd4e - correct
         Events : 669453

         Layout : left-symmetric-6
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

# mdadm --examine /dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x5
     Array UUID : 8c56384e:ba1a3cec:aaf34c17:d0cd9318
           Name : atom:0  (local to host atom)
  Creation Time : Sat Oct 21 01:57:20 2017
     Raid Level : raid6
   Raid Devices : 4

 Avail Dev Size : 7813771264 sectors (3.64 TiB 4.00 TB)
     Array Size : 7813771264 KiB (7.28 TiB 8.00 TB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : c298e079:1f616f66:3e4c5df6:cb942253

Internal Bitmap : 8 sectors from superblock
  Reshape pos'n : 3473357824 (3.23 TiB 3.56 TB)
     New Layout : left-symmetric

    Update Time : Sun Apr 23 10:32:01 2023
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : aa9593c4 - correct
         Events : 669453

         Layout : left-symmetric-6
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

# mdadm --examine /dev/sdf
/dev/sdf:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x7
     Array UUID : 8c56384e:ba1a3cec:aaf34c17:d0cd9318
           Name : atom:0  (local to host atom)
  Creation Time : Sat Oct 21 01:57:20 2017
     Raid Level : raid6
   Raid Devices : 4

 Avail Dev Size : 7813775024 sectors (3.64 TiB 4.00 TB)
     Array Size : 7813771264 KiB (7.28 TiB 8.00 TB)
  Used Dev Size : 7813771264 sectors (3.64 TiB 4.00 TB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
Recovery Offset : 3473357824 sectors
   Unused Space : before=262064 sectors, after=3760 sectors
          State : clean
    Device UUID : 277110b0:d174c17a:3bac9963:405bf18e

Internal Bitmap : 8 sectors from superblock
  Reshape pos'n : 3473357824 (3.23 TiB 3.56 TB)
     New Layout : left-symmetric

    Update Time : Sun Apr 23 10:32:01 2023
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : 6d29f0ca - correct
         Events : 669453

         Layout : left-symmetric-6
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

# mdadm --examine /dev/sde
/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x5
     Array UUID : 8c56384e:ba1a3cec:aaf34c17:d0cd9318
           Name : atom:0  (local to host atom)
  Creation Time : Sat Oct 21 01:57:20 2017
     Raid Level : raid6
   Raid Devices : 4

 Avail Dev Size : 7813775024 sectors (3.64 TiB 4.00 TB)
     Array Size : 7813771264 KiB (7.28 TiB 8.00 TB)
  Used Dev Size : 7813771264 sectors (3.64 TiB 4.00 TB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262064 sectors, after=3760 sectors
          State : clean
    Device UUID : 000ceb71:ab7291e6:5721b832:5003c849

Internal Bitmap : 8 sectors from superblock
  Reshape pos'n : 3473357824 (3.23 TiB 3.56 TB)
     New Layout : left-symmetric

    Update Time : Sun Apr 23 10:32:01 2023
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : 55c26aa5 - correct
         Events : 669453

         Layout : left-symmetric-6
     Chunk Size : 512K

   Device Role : spare
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

# smartctl --xall /dev/sdc1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.14.0-299.el9.x86_64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0
Serial Number:    WD-WCC7K6DNPVFP
LU WWN Device Id: 5 0014ee 20f133383
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Apr 23 20:48:22 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test
routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (43740) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 464) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x303d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   192   162   021    -    5375
  4 Start_Stop_Count        -O--CK   100   100   000    -    83
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   041   041   000    -    43289
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    83
192 Power-Off_Retract_Count -O--CK   200   200   000    -    65
193 Load_Cycle_Count        -O--CK   200   200   000    -    193
194 Temperature_Celsius     -O---K   115   096   000    -    35
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    34
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS       1  Device vendor specific log
0xb7       GPL,SL  VS      56  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     43107         -
# 2  Short offline       Completed without error       00%     42939         -
# 3  Short offline       Completed without error       00%     42771         -
# 4  Extended offline    Completed without error       00%     42760         -
# 5  Short offline       Completed without error       00%     42437         -
# 6  Short offline       Completed without error       00%     42269         -
# 7  Short offline       Completed without error       00%     42101         -
# 8  Extended offline    Completed without error       00%     42017         -
# 9  Short offline       Completed without error       00%     41933         -
#10  Short offline       Completed without error       00%     41765         -
#11  Short offline       Completed without error       00%     41598         -
#12  Short offline       Completed without error       00%     41430         -
#13  Extended offline    Completed without error       00%     41346         -
#14  Short offline       Completed without error       00%     41262         -
#15  Short offline       Completed without error       00%     41094         -
#16  Short offline       Completed without error       00%     40926         -
#17  Short offline       Completed without error       00%     40759         -
#18  Extended offline    Completed without error       00%     40602         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    35 Celsius
Power Cycle Min/Max Temperature:     35/38 Celsius
Lifetime    Min/Max Temperature:     20/54 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (233)

Index    Estimated Time   Temperature Celsius
 234    2023-04-23 12:51    36  *****************
 ...    ..( 20 skipped).    ..  *****************
 255    2023-04-23 13:12    36  *****************
 256    2023-04-23 13:13    35  ****************
 ...    ..( 20 skipped).    ..  ****************
 277    2023-04-23 13:34    35  ****************
 278    2023-04-23 13:35    36  *****************
 ...    ..( 22 skipped).    ..  *****************
 301    2023-04-23 13:58    36  *****************
 302    2023-04-23 13:59    37  ******************
 ...    ..( 10 skipped).    ..  ******************
 313    2023-04-23 14:10    37  ******************
 314    2023-04-23 14:11    36  *****************
 315    2023-04-23 14:12    37  ******************
 ...    ..( 44 skipped).    ..  ******************
 360    2023-04-23 14:57    37  ******************
 361    2023-04-23 14:58    38  *******************
 ...    ..( 11 skipped).    ..  *******************
 373    2023-04-23 15:10    38  *******************
 374    2023-04-23 15:11    37  ******************
 ...    ..( 54 skipped).    ..  ******************
 429    2023-04-23 16:06    37  ******************
 430    2023-04-23 16:07    36  *****************
 ...    ..( 38 skipped).    ..  *****************
 469    2023-04-23 16:46    36  *****************
 470    2023-04-23 16:47    35  ****************
 ...    ..( 50 skipped).    ..  ****************
  43    2023-04-23 17:38    35  ****************
  44    2023-04-23 17:39    38  *******************
 ...    ..( 71 skipped).    ..  *******************
 116    2023-04-23 18:51    38  *******************
 117    2023-04-23 18:52    39  ********************
 ...    ..( 48 skipped).    ..  ********************
 166    2023-04-23 19:41    39  ********************
 167    2023-04-23 19:42    38  *******************
 ...    ..(  4 skipped).    ..  *******************
 172    2023-04-23 19:47    38  *******************
 173    2023-04-23 19:48    37  ******************
 ...    ..(  5 skipped).    ..  ******************
 179    2023-04-23 19:54    37  ******************
 180    2023-04-23 19:55    36  *****************
 ...    ..(  5 skipped).    ..  *****************
 186    2023-04-23 20:01    36  *****************
 187    2023-04-23 20:02     ?  -
 188    2023-04-23 20:03    33  **************
 ...    ..(  4 skipped).    ..  **************
 193    2023-04-23 20:08    33  **************
 194    2023-04-23 20:09     ?  -
 195    2023-04-23 20:10    34  ***************
 ...    ..(  8 skipped).    ..  ***************
 204    2023-04-23 20:19    34  ***************
 205    2023-04-23 20:20    35  ****************
 ...    ..( 15 skipped).    ..  ****************
 221    2023-04-23 20:36    35  ****************
 222    2023-04-23 20:37     ?  -
 223    2023-04-23 20:38    35  ****************
 224    2023-04-23 20:39     ?  -
 225    2023-04-23 20:40    35  ****************
 226    2023-04-23 20:41    35  ****************
 227    2023-04-23 20:42     ?  -
 228    2023-04-23 20:43    36  *****************
 ...    ..(  4 skipped).    ..  *****************
 233    2023-04-23 20:48    36  *****************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              83  ---  Lifetime Power-On Resets
0x01  0x010  4           43289  ---  Power-on Hours
0x01  0x018  6     15614407059  ---  Logical Sectors Written
0x01  0x020  6       599311580  ---  Number of Write Commands
0x01  0x028  6    908162628478  ---  Logical Sectors Read
0x01  0x030  6      2826279430  ---  Number of Read Commands
0x01  0x038  6      1221577344  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4           47885  ---  Spindle Motor Power-on Hours
0x03  0x010  4           47296  ---  Head Flying Hours
0x03  0x018  4             259  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate
Logical Sectors
0x03  0x040  4              65  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               1  ---  Resets Between Cmd Acceptance and
Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              36  ---  Current Temperature
0x05  0x010  1              37  ---  Average Short Term Temperature
0x05  0x018  1              34  ---  Average Long Term Temperature
0x05  0x020  1              54  ---  Highest Temperature
0x05  0x028  1              23  ---  Lowest Temperature
0x05  0x030  1              52  ---  Highest Average Short Term Temperature
0x05  0x038  1              27  ---  Lowest Average Short Term Temperature
0x05  0x040  1              44  ---  Highest Average Long Term Temperature
0x05  0x048  1              31  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              65  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             809  ---  Number of Hardware Resets
0x06  0x010  4             374  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2           88  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           82  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        18432  Vendor specific

# smartctl --xall /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.14.0-299.el9.x86_64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0
Serial Number:    WD-WCC7K3EXJ3S7
LU WWN Device Id: 5 0014ee 264687983
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Apr 23 20:48:54 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test
routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (43440) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 462) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x303d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   185   156   021    -    5716
  4 Start_Stop_Count        -O--CK   100   100   000    -    83
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   100   253   000    -    0
  9 Power_On_Hours          -O--CK   043   043   000    -    42100
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    83
192 Power-Off_Retract_Count -O--CK   200   200   000    -    65
193 Load_Cycle_Count        -O--CK   200   200   000    -    199
194 Temperature_Celsius     -O---K   117   102   000    -    33
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS       1  Device vendor specific log
0xb7       GPL,SL  VS      56  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     41918         -
# 2  Short offline       Completed without error       00%     41750         -
# 3  Short offline       Completed without error       00%     41582         -
# 4  Extended offline    Completed without error       00%     41570         -
# 5  Short offline       Completed without error       00%     41248         -
# 6  Short offline       Completed without error       00%     41080         -
# 7  Short offline       Completed without error       00%     40912         -
# 8  Extended offline    Completed without error       00%     40828         -
# 9  Short offline       Completed without error       00%     40744         -
#10  Short offline       Completed without error       00%     40576         -
#11  Short offline       Completed without error       00%     40408         -
#12  Short offline       Completed without error       00%     40241         -
#13  Extended offline    Completed without error       00%     40157         -
#14  Short offline       Completed without error       00%     40073         -
#15  Short offline       Completed without error       00%     41098         -
#16  Short offline       Completed without error       00%     40930         -
#17  Short offline       Completed without error       00%     40762         -
#18  Extended offline    Completed without error       00%     40606         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    33 Celsius
Power Cycle Min/Max Temperature:     33/36 Celsius
Lifetime    Min/Max Temperature:     19/48 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (453)

Index    Estimated Time   Temperature Celsius
 454    2023-04-23 12:51    35  ****************
 ...    ..( 11 skipped).    ..  ****************
 466    2023-04-23 13:03    35  ****************
 467    2023-04-23 13:04    34  ***************
 ...    ..( 32 skipped).    ..  ***************
  22    2023-04-23 13:37    34  ***************
  23    2023-04-23 13:38    35  ****************
 ...    ..( 60 skipped).    ..  ****************
  84    2023-04-23 14:39    35  ****************
  85    2023-04-23 14:40    36  *****************
 ...    ..( 48 skipped).    ..  *****************
 134    2023-04-23 15:29    36  *****************
 135    2023-04-23 15:30    35  ****************
 ...    ..( 44 skipped).    ..  ****************
 180    2023-04-23 16:15    35  ****************
 181    2023-04-23 16:16    34  ***************
 ...    ..( 69 skipped).    ..  ***************
 251    2023-04-23 17:26    34  ***************
 252    2023-04-23 17:27    33  **************
 ...    ..( 10 skipped).    ..  **************
 263    2023-04-23 17:38    33  **************
 264    2023-04-23 17:39    36  *****************
 ...    ..(140 skipped).    ..  *****************
 405    2023-04-23 20:00    36  *****************
 406    2023-04-23 20:01     ?  -
 407    2023-04-23 20:02    31  ************
 ...    ..(  4 skipped).    ..  ************
 412    2023-04-23 20:07    31  ************
 413    2023-04-23 20:08     ?  -
 414    2023-04-23 20:09    32  *************
 ...    ..(  3 skipped).    ..  *************
 418    2023-04-23 20:13    32  *************
 419    2023-04-23 20:14    33  **************
 ...    ..(  8 skipped).    ..  **************
 428    2023-04-23 20:23    33  **************
 429    2023-04-23 20:24    34  ***************
 ...    ..( 10 skipped).    ..  ***************
 440    2023-04-23 20:35    34  ***************
 441    2023-04-23 20:36     ?  -
 442    2023-04-23 20:37    34  ***************
 443    2023-04-23 20:38     ?  -
 444    2023-04-23 20:39    34  ***************
 445    2023-04-23 20:40    34  ***************
 446    2023-04-23 20:41     ?  -
 447    2023-04-23 20:42    35  ****************
 ...    ..(  5 skipped).    ..  ****************
 453    2023-04-23 20:48    35  ****************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              83  ---  Lifetime Power-On Resets
0x01  0x010  4           42100  ---  Power-on Hours
0x01  0x018  6     15407958681  ---  Logical Sectors Written
0x01  0x020  6       595027021  ---  Number of Write Commands
0x01  0x028  6    908203824645  ---  Logical Sectors Read
0x01  0x030  6      2834811358  ---  Number of Read Commands
0x01  0x038  6      1236144640  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4           47888  ---  Spindle Motor Power-on Hours
0x03  0x010  4           47298  ---  Head Flying Hours
0x03  0x018  4             265  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4              11  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate
Logical Sectors
0x03  0x040  4              65  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and
Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              34  ---  Current Temperature
0x05  0x010  1              35  ---  Average Short Term Temperature
0x05  0x018  1              30  ---  Average Long Term Temperature
0x05  0x020  1              48  ---  Highest Temperature
0x05  0x028  1              22  ---  Lowest Temperature
0x05  0x030  1              46  ---  Highest Average Short Term Temperature
0x05  0x038  1              24  ---  Lowest Average Short Term Temperature
0x05  0x040  1              39  ---  Highest Average Long Term Temperature
0x05  0x048  1              26  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              65  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             793  ---  Number of Hardware Resets
0x06  0x010  4             332  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2           88  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           82  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        18464  Vendor specific

# smartctl --xall /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.14.0-299.el9.x86_64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E0645620
LU WWN Device Id: 5 0014ee 2b438eb78
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Apr 23 20:49:30 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test
routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (55560) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 555) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    12
  3 Spin_Up_Time            POS--K   190   177   021    -    7458
  4 Start_Stop_Count        -O--CK   097   097   000    -    3213
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   100   253   000    -    0
  9 Power_On_Hours          -O--CK   014   014   000    -    62899
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    111
192 Power-Off_Retract_Count -O--CK   200   200   000    -    78
193 Load_Cycle_Count        -O--CK   198   198   000    -    6721
194 Temperature_Celsius     -O---K   117   097   000    -    35
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 14
    CR     = Command Register
    FEATR  = Features Register
    COUNT  = Count (was: Sector Count) Register
    LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
    LH     = LBA High (was: Cylinder High) Register    ]   LBA
    LM     = LBA Mid (was: Cylinder Low) Register      ] Register
    LL     = LBA Low (was: Sector Number) Register     ]
    DV     = Device (was: Device/Head) Register
    DC     = Device Control Register
    ER     = Error register
    ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 14 [13] occurred at disk power-on lifetime: 21182 hours (882
days + 14 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 b7 3e 3a b8 40 00  Error: UNC at LBA =
0x1b73e3ab8 = 7369276088

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 d8 00 01 b7 3e 3a f0 40 08  8d+09:09:02.085  READ FPDMA QUEUED
  60 00 08 00 d0 00 01 b7 3e 3b 10 40 08  8d+09:09:02.085  READ FPDMA QUEUED
  60 00 08 00 c8 00 01 b7 3e 3a e8 40 08  8d+09:09:02.085  READ FPDMA QUEUED
  60 00 08 00 c0 00 01 b7 3e 3a e0 40 08  8d+09:09:02.085  READ FPDMA QUEUED
  60 00 08 00 b8 00 01 b7 3e 3a d8 40 08  8d+09:09:02.084  READ FPDMA QUEUED

Error 13 [12] occurred at disk power-on lifetime: 21182 hours (882
days + 14 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 b7 3e 3a b8 40 00  Error: UNC at LBA =
0x1b73e3ab8 = 7369276088

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 d8 00 01 b7 3e 3b a8 40 08  8d+09:08:58.566  READ FPDMA QUEUED
  60 00 08 00 d0 00 01 b7 3e 3b a0 40 08  8d+09:08:58.566  READ FPDMA QUEUED
  60 00 08 00 c8 00 01 b7 3e 3b 98 40 08  8d+09:08:58.566  READ FPDMA QUEUED
  60 00 08 00 c0 00 01 b7 3e 3b 90 40 08  8d+09:08:58.566  READ FPDMA QUEUED
  60 00 08 00 b8 00 01 b7 3e 3b 88 40 08  8d+09:08:58.566  READ FPDMA QUEUED

Error 12 [11] occurred at disk power-on lifetime: 21182 hours (882
days + 14 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 b7 3e 3a b8 40 00  Error: UNC at LBA =
0x1b73e3ab8 = 7369276088

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 d8 00 01 b7 3e 3a f0 40 08  8d+09:08:55.047  READ FPDMA QUEUED
  60 00 08 00 d0 00 01 b7 3e 3b 10 40 08  8d+09:08:55.047  READ FPDMA QUEUED
  60 00 08 00 c8 00 01 b7 3e 3a e8 40 08  8d+09:08:55.047  READ FPDMA QUEUED
  60 00 08 00 c0 00 01 b7 3e 3a e0 40 08  8d+09:08:55.047  READ FPDMA QUEUED
  60 00 08 00 b8 00 01 b7 3e 3a d8 40 08  8d+09:08:55.047  READ FPDMA QUEUED

Error 11 [10] occurred at disk power-on lifetime: 21182 hours (882
days + 14 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 b7 3e 3a b8 40 00  Error: UNC at LBA =
0x1b73e3ab8 = 7369276088

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 d8 00 01 b7 3e 3b a8 40 08  8d+09:08:51.528  READ FPDMA QUEUED
  60 00 08 00 d0 00 01 b7 3e 3b a0 40 08  8d+09:08:51.528  READ FPDMA QUEUED
  60 00 08 00 c8 00 01 b7 3e 3b 98 40 08  8d+09:08:51.528  READ FPDMA QUEUED
  60 00 08 00 c0 00 01 b7 3e 3b 90 40 08  8d+09:08:51.528  READ FPDMA QUEUED
  60 00 08 00 b8 00 01 b7 3e 3b 88 40 08  8d+09:08:51.528  READ FPDMA QUEUED

Error 10 [9] occurred at disk power-on lifetime: 21182 hours (882 days
+ 14 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 b7 3e 3a b8 40 00  Error: UNC at LBA =
0x1b73e3ab8 = 7369276088

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 d8 00 01 b7 3e 3a f0 40 08  8d+09:08:48.010  READ FPDMA QUEUED
  60 00 08 00 d0 00 01 b7 3e 3b 10 40 08  8d+09:08:48.010  READ FPDMA QUEUED
  60 00 08 00 c8 00 01 b7 3e 3a e8 40 08  8d+09:08:48.010  READ FPDMA QUEUED
  60 00 08 00 c0 00 01 b7 3e 3a e0 40 08  8d+09:08:48.010  READ FPDMA QUEUED
  60 00 08 00 b8 00 01 b7 3e 3a d8 40 08  8d+09:08:48.010  READ FPDMA QUEUED

Error 9 [8] occurred at disk power-on lifetime: 21182 hours (882 days
+ 14 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 b7 3e 3a b8 40 00  Error: UNC at LBA =
0x1b73e3ab8 = 7369276088

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 d8 00 01 b7 3e 3b a8 40 08  8d+09:08:44.517  READ FPDMA QUEUED
  60 00 08 00 d0 00 01 b7 3e 3b a0 40 08  8d+09:08:44.516  READ FPDMA QUEUED
  60 00 08 00 c8 00 01 b7 3e 3b 98 40 08  8d+09:08:44.509  READ FPDMA QUEUED
  60 00 08 00 c0 00 01 b7 3e 3b 90 40 08  8d+09:08:44.502  READ FPDMA QUEUED
  60 00 08 00 b8 00 01 b7 3e 3b 88 40 08  8d+09:08:44.495  READ FPDMA QUEUED

Error 8 [7] occurred at disk power-on lifetime: 21182 hours (882 days
+ 14 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 b7 3e 3a b8 40 00  Error: UNC at LBA =
0x1b73e3ab8 = 7369276088

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 04 00 00 90 00 01 b7 3e 3e 08 40 08  8d+09:08:40.925  READ FPDMA QUEUED
  60 04 00 00 88 00 01 b7 3e 42 08 40 08  8d+09:08:40.925  READ FPDMA QUEUED
  60 04 00 00 80 00 01 b7 3e 46 08 40 08  8d+09:08:40.925  READ FPDMA QUEUED
  60 04 00 00 78 00 01 b7 3e 62 08 40 08  8d+09:08:40.925  READ FPDMA QUEUED
  60 04 00 00 70 00 01 b7 3e 66 08 40 08  8d+09:08:40.925  READ FPDMA QUEUED

Error 7 [6] occurred at disk power-on lifetime: 21182 hours (882 days
+ 14 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 b7 3e 3a b8 40 00  Error: UNC at LBA =
0x1b73e3ab8 = 7369276088

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 04 00 00 18 00 01 b7 3e 62 08 40 08  8d+09:08:37.429  READ FPDMA QUEUED
  60 04 00 00 10 00 01 b7 3e 46 08 40 08  8d+09:08:37.429  READ FPDMA QUEUED
  60 04 00 00 08 00 01 b7 3e 42 08 40 08  8d+09:08:37.429  READ FPDMA QUEUED
  60 04 00 00 00 00 01 b7 3e 3e 08 40 08  8d+09:08:37.429  READ FPDMA QUEUED
  60 04 00 00 f0 00 01 b7 3e 3a 08 40 08  8d+09:08:37.429  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     62717         -
# 2  Short offline       Completed without error       00%     62549         -
# 3  Short offline       Completed without error       00%     62382         -
# 4  Extended offline    Completed without error       00%     62373         -
# 5  Short offline       Completed without error       00%     62047         -
# 6  Short offline       Completed without error       00%     61879         -
# 7  Short offline       Completed without error       00%     61711         -
# 8  Extended offline    Completed without error       00%     61631         -
# 9  Short offline       Completed without error       00%     61544         -
#10  Short offline       Completed without error       00%     61376         -
#11  Short offline       Completed without error       00%     61208         -
#12  Short offline       Completed without error       00%     61040         -
#13  Extended offline    Completed without error       00%     60959         -
#14  Short offline       Completed without error       00%     60872         -
#15  Short offline       Completed without error       00%     61897         -
#16  Short offline       Completed without error       00%     61730         -
#17  Short offline       Completed without error       00%     61562         -
#18  Extended offline    Completed without error       00%     61409         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    35 Celsius
Power Cycle Min/Max Temperature:     35/38 Celsius
Lifetime    Min/Max Temperature:     16/55 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (211)

Index    Estimated Time   Temperature Celsius
 212    2023-04-23 12:52    36  *****************
 ...    ..( 17 skipped).    ..  *****************
 230    2023-04-23 13:10    36  *****************
 231    2023-04-23 13:11    37  ******************
 ...    ..( 61 skipped).    ..  ******************
 293    2023-04-23 14:13    37  ******************
 294    2023-04-23 14:14    38  *******************
 ...    ..( 22 skipped).    ..  *******************
 317    2023-04-23 14:37    38  *******************
 318    2023-04-23 14:38    37  ******************
 319    2023-04-23 14:39    37  ******************
 320    2023-04-23 14:40    38  *******************
 ...    ..(  9 skipped).    ..  *******************
 330    2023-04-23 14:50    38  *******************
 331    2023-04-23 14:51    37  ******************
 ...    ..( 52 skipped).    ..  ******************
 384    2023-04-23 15:44    37  ******************
 385    2023-04-23 15:45    36  *****************
 ...    ..( 57 skipped).    ..  *****************
 443    2023-04-23 16:43    36  *****************
 444    2023-04-23 16:44    35  ****************
 ...    ..( 20 skipped).    ..  ****************
 465    2023-04-23 17:05    35  ****************
 466    2023-04-23 17:06    38  *******************
 ...    ..(109 skipped).    ..  *******************
  98    2023-04-23 18:56    38  *******************
  99    2023-04-23 18:57    39  ********************
 ...    ..( 28 skipped).    ..  ********************
 128    2023-04-23 19:26    39  ********************
 129    2023-04-23 19:27    38  *******************
 130    2023-04-23 19:28     ?  -
 131    2023-04-23 19:29    33  **************
 ...    ..(  2 skipped).    ..  **************
 134    2023-04-23 19:32    33  **************
 135    2023-04-23 19:33     ?  -
 136    2023-04-23 19:34    34  ***************
 ...    ..(  3 skipped).    ..  ***************
 140    2023-04-23 19:38    34  ***************
 141    2023-04-23 19:39    35  ****************
 ...    ..(  8 skipped).    ..  ****************
 150    2023-04-23 19:48    35  ****************
 151    2023-04-23 19:49    36  *****************
 ...    ..( 12 skipped).    ..  *****************
 164    2023-04-23 20:02    36  *****************
 165    2023-04-23 20:03     ?  -
 166    2023-04-23 20:04    36  *****************
 167    2023-04-23 20:05     ?  -
 168    2023-04-23 20:06    36  *****************
 169    2023-04-23 20:07    36  *****************
 170    2023-04-23 20:08     ?  -
 171    2023-04-23 20:09    37  ******************
 172    2023-04-23 20:10    36  *****************
 ...    ..( 29 skipped).    ..  *****************
 202    2023-04-23 20:40    36  *****************
 203    2023-04-23 20:41    35  ****************
 ...    ..(  2 skipped).    ..  ****************
 206    2023-04-23 20:44    35  ****************
 207    2023-04-23 20:45    36  *****************
 ...    ..(  3 skipped).    ..  *****************
 211    2023-04-23 20:49    36  *****************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2           87  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           73  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        18495  Vendor specific

# smartctl --xall /dev/sde
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.14.0-299.el9.x86_64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40EFPX-68C6CN0
Serial Number:    WD-WXK2AA2HCDY2
LU WWN Device Id: 5 0014ee 26ada4de8
Firmware Version: 81.00A81
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Apr 23 20:51:17 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test
routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (42000) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 437) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x3039)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   253   051    -    0
  3 Spin_Up_Time            POS--K   207   207   021    -    2625
  4 Start_Stop_Count        -O--CK   100   100   000    -    7
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   100   253   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    22
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    6
192 Power-Off_Retract_Count -O--CK   200   200   000    -    4
193 Load_Cycle_Count        -O--CK   200   200   000    -    15
194 Temperature_Celsius     -O---K   120   111   000    -    27
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O    255  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   2048  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x24       GPL     R/O    307  Current Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS       1  Device vendor specific log
0xb7       GPL,SL  VS      78  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    27 Celsius
Power Cycle Min/Max Temperature:     27/29 Celsius
Lifetime    Min/Max Temperature:     20/29 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (361)

Index    Estimated Time   Temperature Celsius
 362    2023-04-23 12:54    26  *******
 ...    ..(146 skipped).    ..  *******
  31    2023-04-23 15:21    26  *******
  32    2023-04-23 15:22     ?  -
  33    2023-04-23 15:23    25  ******
  34    2023-04-23 15:24    24  *****
  35    2023-04-23 15:25    24  *****
  36    2023-04-23 15:26    25  ******
 ...    ..(  3 skipped).    ..  ******
  40    2023-04-23 15:30    25  ******
  41    2023-04-23 15:31     ?  -
  42    2023-04-23 15:32    26  *******
 ...    ..(  4 skipped).    ..  *******
  47    2023-04-23 15:37    26  *******
  48    2023-04-23 15:38    27  ********
 ...    ..(  5 skipped).    ..  ********
  54    2023-04-23 15:44    27  ********
  55    2023-04-23 15:45     ?  -
  56    2023-04-23 15:46    27  ********
  57    2023-04-23 15:47     ?  -
  58    2023-04-23 15:48    27  ********
 ...    ..(  4 skipped).    ..  ********
  63    2023-04-23 15:53    27  ********
  64    2023-04-23 15:54     ?  -
  65    2023-04-23 15:55    28  *********
  66    2023-04-23 15:56    27  ********
 ...    ..(  2 skipped).    ..  ********
  69    2023-04-23 15:59    27  ********
  70    2023-04-23 16:00    28  *********
 ...    ..( 13 skipped).    ..  *********
  84    2023-04-23 16:14    28  *********
  85    2023-04-23 16:15    27  ********
 ...    ..( 19 skipped).    ..  ********
 105    2023-04-23 16:35    27  ********
 106    2023-04-23 16:36    28  *********
 ...    ..(  3 skipped).    ..  *********
 110    2023-04-23 16:40    28  *********
 111    2023-04-23 16:41    27  ********
 112    2023-04-23 16:42    28  *********
 113    2023-04-23 16:43    27  ********
 ...    ..( 14 skipped).    ..  ********
 128    2023-04-23 16:58    27  ********
 129    2023-04-23 16:59    28  *********
 ...    ..(  3 skipped).    ..  *********
 133    2023-04-23 17:03    28  *********
 134    2023-04-23 17:04    27  ********
 135    2023-04-23 17:05    27  ********
 136    2023-04-23 17:06    27  ********
 137    2023-04-23 17:07    28  *********
 ...    ..( 16 skipped).    ..  *********
 154    2023-04-23 17:24    28  *********
 155    2023-04-23 17:25    27  ********
 ...    ..(  4 skipped).    ..  ********
 160    2023-04-23 17:30    27  ********
 161    2023-04-23 17:31    28  *********
 ...    ..( 15 skipped).    ..  *********
 177    2023-04-23 17:47    28  *********
 178    2023-04-23 17:48    29  **********
 179    2023-04-23 17:49    29  **********
 180    2023-04-23 17:50    28  *********
 ...    ..(  5 skipped).    ..  *********
 186    2023-04-23 17:56    28  *********
 187    2023-04-23 17:57    29  **********
 188    2023-04-23 17:58    28  *********
 ...    ..(  4 skipped).    ..  *********
 193    2023-04-23 18:03    28  *********
 194    2023-04-23 18:04    29  **********
 195    2023-04-23 18:05    29  **********
 196    2023-04-23 18:06    28  *********
 ...    ..(  3 skipped).    ..  *********
 200    2023-04-23 18:10    28  *********
 201    2023-04-23 18:11    29  **********
 ...    ..(  6 skipped).    ..  **********
 208    2023-04-23 18:18    29  **********
 209    2023-04-23 18:19    28  *********
 ...    ..(  9 skipped).    ..  *********
 219    2023-04-23 18:29    28  *********
 220    2023-04-23 18:30    29  **********
 221    2023-04-23 18:31    29  **********
 222    2023-04-23 18:32    29  **********
 223    2023-04-23 18:33    28  *********
 ...    ..( 21 skipped).    ..  *********
 245    2023-04-23 18:55    28  *********
 246    2023-04-23 18:56    29  **********
 247    2023-04-23 18:57    28  *********
 ...    ..( 31 skipped).    ..  *********
 279    2023-04-23 19:29    28  *********
 280    2023-04-23 19:30    27  ********
 281    2023-04-23 19:31    28  *********
 282    2023-04-23 19:32    28  *********
 283    2023-04-23 19:33    27  ********
 284    2023-04-23 19:34    28  *********
 285    2023-04-23 19:35    27  ********
 ...    ..( 75 skipped).    ..  ********
 361    2023-04-23 20:51    27  ********

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 3) ==
0x01  0x008  4               6  ---  Lifetime Power-On Resets
0x01  0x010  4              22  ---  Power-on Hours
0x01  0x018  6            5438  ---  Logical Sectors Written
0x01  0x020  6            5429  ---  Number of Write Commands
0x01  0x028  6           25138  ---  Logical Sectors Read
0x01  0x030  6            1710  ---  Number of Read Commands
0x01  0x038  6        79200000  ---  Date and Time TimeStamp
0x02  =====  =               =  ===  == Free-Fall Statistics (rev 1) ==
0x02  0x010  4               0  ---  Overlimit Shock Events
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4              21  ---  Spindle Motor Power-on Hours
0x03  0x010  4              18  ---  Head Flying Hours
0x03  0x018  4              20  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate
Logical Sectors
0x03  0x040  4               4  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and
Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              27  ---  Current Temperature
0x05  0x010  1               -  ---  Average Short Term Temperature
0x05  0x018  1               -  ---  Average Long Term Temperature
0x05  0x020  1              29  ---  Highest Temperature
0x05  0x028  1              24  ---  Lowest Temperature
0x05  0x030  1               -  ---  Highest Average Short Term Temperature
0x05  0x038  1               -  ---  Lowest Average Short Term Temperature
0x05  0x040  1               -  ---  Highest Average Long Term Temperature
0x05  0x048  1               -  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              65  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             130  ---  Number of Hardware Resets
0x06  0x010  4              63  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
0xff  0x008  7               0  ---  Vendor Specific
0xff  0x010  7               0  ---  Vendor Specific
0xff  0x018  7               0  ---  Vendor Specific
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2           88  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           89  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        18606  Vendor specific

# smartctl --xall /dev/sdf
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.14.0-299.el9.x86_64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40EFPX-68C6CN0
Serial Number:    WD-WX42A92A31RX
LU WWN Device Id: 5 0014ee 215825736
Firmware Version: 81.00A81
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Apr 23 20:51:46 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test
routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (39060) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 407) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x3039)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   253   051    -    0
  3 Spin_Up_Time            POS--K   206   206   021    -    2658
  4 Start_Stop_Count        -O--CK   100   100   000    -    7
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    22
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    6
192 Power-Off_Retract_Count -O--CK   200   200   000    -    4
193 Load_Cycle_Count        -O--CK   200   200   000    -    19
194 Temperature_Celsius     -O---K   118   113   000    -    29
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O    255  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   2048  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x24       GPL     R/O    307  Current Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS       1  Device vendor specific log
0xb7       GPL,SL  VS      78  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    29 Celsius
Power Cycle Min/Max Temperature:     29/31 Celsius
Lifetime    Min/Max Temperature:     20/31 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (376)

Index    Estimated Time   Temperature Celsius
 377    2023-04-23 12:54    29  **********
 ...    ..(132 skipped).    ..  **********
  32    2023-04-23 15:07    29  **********
  33    2023-04-23 15:08     ?  -
  34    2023-04-23 15:09    26  *******
 ...    ..(  2 skipped).    ..  *******
  37    2023-04-23 15:12    26  *******
  38    2023-04-23 15:13    27  ********
 ...    ..(  2 skipped).    ..  ********
  41    2023-04-23 15:16    27  ********
  42    2023-04-23 15:17     ?  -
  43    2023-04-23 15:18    28  *********
 ...    ..(  4 skipped).    ..  *********
  48    2023-04-23 15:23    28  *********
  49    2023-04-23 15:24    29  **********
 ...    ..(  7 skipped).    ..  **********
  57    2023-04-23 15:32    29  **********
  58    2023-04-23 15:33    30  ***********
 ...    ..(  9 skipped).    ..  ***********
  68    2023-04-23 15:43    30  ***********
  69    2023-04-23 15:44    29  **********
  70    2023-04-23 15:45     ?  -
  71    2023-04-23 15:46    29  **********
  72    2023-04-23 15:47     ?  -
  73    2023-04-23 15:48    29  **********
 ...    ..(  2 skipped).    ..  **********
  76    2023-04-23 15:51    29  **********
  77    2023-04-23 15:52    30  ***********
  78    2023-04-23 15:53    30  ***********
  79    2023-04-23 15:54     ?  -
  80    2023-04-23 15:55    30  ***********
 ...    ..(  5 skipped).    ..  ***********
  86    2023-04-23 16:01    30  ***********
  87    2023-04-23 16:02    31  ************
  88    2023-04-23 16:03    31  ************
  89    2023-04-23 16:04    30  ***********
 ...    ..( 15 skipped).    ..  ***********
 105    2023-04-23 16:20    30  ***********
 106    2023-04-23 16:21    29  **********
 107    2023-04-23 16:22    30  ***********
 108    2023-04-23 16:23    29  **********
 ...    ..( 11 skipped).    ..  **********
 120    2023-04-23 16:35    29  **********
 121    2023-04-23 16:36    30  ***********
 ...    ..(  6 skipped).    ..  ***********
 128    2023-04-23 16:43    30  ***********
 129    2023-04-23 16:44    29  **********
 ...    ..(  4 skipped).    ..  **********
 134    2023-04-23 16:49    29  **********
 135    2023-04-23 16:50    30  ***********
 ...    ..(  2 skipped).    ..  ***********
 138    2023-04-23 16:53    30  ***********
 139    2023-04-23 16:54    29  **********
 140    2023-04-23 16:55    29  **********
 141    2023-04-23 16:56    30  ***********
 ...    ..(  6 skipped).    ..  ***********
 148    2023-04-23 17:03    30  ***********
 149    2023-04-23 17:04    29  **********
 150    2023-04-23 17:05    29  **********
 151    2023-04-23 17:06    30  ***********
 ...    ..( 14 skipped).    ..  ***********
 166    2023-04-23 17:21    30  ***********
 167    2023-04-23 17:22    29  **********
 ...    ..(  3 skipped).    ..  **********
 171    2023-04-23 17:26    29  **********
 172    2023-04-23 17:27    30  ***********
 ...    ..( 19 skipped).    ..  ***********
 192    2023-04-23 17:47    30  ***********
 193    2023-04-23 17:48    31  ************
 194    2023-04-23 17:49    31  ************
 195    2023-04-23 17:50    31  ************
 196    2023-04-23 17:51    30  ***********
 ...    ..(  4 skipped).    ..  ***********
 201    2023-04-23 17:56    30  ***********
 202    2023-04-23 17:57    31  ************
 203    2023-04-23 17:58    30  ***********
 ...    ..(  4 skipped).    ..  ***********
 208    2023-04-23 18:03    30  ***********
 209    2023-04-23 18:04    31  ************
 210    2023-04-23 18:05    31  ************
 211    2023-04-23 18:06    30  ***********
 ...    ..(  4 skipped).    ..  ***********
 216    2023-04-23 18:11    30  ***********
 217    2023-04-23 18:12    31  ************
 ...    ..(  5 skipped).    ..  ************
 223    2023-04-23 18:18    31  ************
 224    2023-04-23 18:19    30  ***********
 ...    ..(  9 skipped).    ..  ***********
 234    2023-04-23 18:29    30  ***********
 235    2023-04-23 18:30    31  ************
 ...    ..(  2 skipped).    ..  ************
 238    2023-04-23 18:33    31  ************
 239    2023-04-23 18:34    30  ***********
 ...    ..( 12 skipped).    ..  ***********
 252    2023-04-23 18:47    30  ***********
 253    2023-04-23 18:48    29  **********
 254    2023-04-23 18:49    29  **********
 255    2023-04-23 18:50    30  ***********
 ...    ..( 30 skipped).    ..  ***********
 286    2023-04-23 19:21    30  ***********
 287    2023-04-23 19:22    29  **********
 ...    ..( 25 skipped).    ..  **********
 313    2023-04-23 19:48    29  **********
 314    2023-04-23 19:49    30  ***********
 ...    ..(  2 skipped).    ..  ***********
 317    2023-04-23 19:52    30  ***********
 318    2023-04-23 19:53    29  **********
 ...    ..( 57 skipped).    ..  **********
 376    2023-04-23 20:51    29  **********

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 3) ==
0x01  0x008  4               6  ---  Lifetime Power-On Resets
0x01  0x010  4              22  ---  Power-on Hours
0x01  0x018  6      3474015532  ---  Logical Sectors Written
0x01  0x020  6        29684071  ---  Number of Write Commands
0x01  0x028  6           25060  ---  Logical Sectors Read
0x01  0x030  6            1639  ---  Number of Read Commands
0x01  0x038  6        79200000  ---  Date and Time TimeStamp
0x02  =====  =               =  ===  == Free-Fall Statistics (rev 1) ==
0x02  0x010  4               0  ---  Overlimit Shock Events
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4              22  ---  Spindle Motor Power-on Hours
0x03  0x010  4              18  ---  Head Flying Hours
0x03  0x018  4              24  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate
Logical Sectors
0x03  0x040  4               4  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and
Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              29  ---  Current Temperature
0x05  0x010  1               -  ---  Average Short Term Temperature
0x05  0x018  1               -  ---  Average Long Term Temperature
0x05  0x020  1              31  ---  Highest Temperature
0x05  0x028  1              25  ---  Lowest Temperature
0x05  0x030  1               -  ---  Highest Average Short Term Temperature
0x05  0x038  1               -  ---  Lowest Average Short Term Temperature
0x05  0x040  1               -  ---  Highest Average Long Term Temperature
0x05  0x048  1               -  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              65  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             130  ---  Number of Hardware Resets
0x06  0x010  4              66  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
0xff  0x008  7               0  ---  Vendor Specific
0xff  0x010  7               0  ---  Vendor Specific
0xff  0x018  7               0  ---  Vendor Specific
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2           88  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           89  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        18634  Vendor specific

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-04-23 19:09 Raid5 to raid6 grow interrupted, mdadm hangs on assemble command Jove
@ 2023-04-23 19:19 ` Reindl Harald
  2023-04-23 19:32   ` Jove
  2023-04-24  7:41 ` Wols Lists
  2023-05-04 11:41 ` Yu Kuai
  2 siblings, 1 reply; 21+ messages in thread
From: Reindl Harald @ 2023-04-23 19:19 UTC (permalink / raw)
  To: Jove, linux-raid



Am 23.04.23 um 21:09 schrieb Jove:
> I've added two drives to my raid5 array and tried to migrate
> it to raid6 with the following command:
> 
> mdadm --grow /dev/md0 --raid-devices 4 --level 6
> --backup-file=/root/mdadm_raid6_backup.md
> 
> This may have been my first mistake, as there are only 5
> drives. it should have been --raid-devices 3, I think.

how do you come to the conclusion 3 when there are 5 drives? you tell it 
how much drives there are and pretty sure after "mdadm --add" you can 
skip "--raid-devices" entirely because it knows how many drives there are

https://raid.wiki.kernel.org/index.php/Growing

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-04-23 19:19 ` Reindl Harald
@ 2023-04-23 19:32   ` Jove
  2023-04-24  7:02     ` Jove
  0 siblings, 1 reply; 21+ messages in thread
From: Jove @ 2023-04-23 19:32 UTC (permalink / raw)
  To: Reindl Harald; +Cc: linux-raid

That comment was because I misunderstood the actual function
of the argument. It should have been 5, not 4 or 3 :).

I do doubt this is the cause of my problems though.

On Sun, Apr 23, 2023 at 9:19 PM Reindl Harald <h.reindl@thelounge.net> wrote:
>
>
>
> Am 23.04.23 um 21:09 schrieb Jove:
> > I've added two drives to my raid5 array and tried to migrate
> > it to raid6 with the following command:
> >
> > mdadm --grow /dev/md0 --raid-devices 4 --level 6
> > --backup-file=/root/mdadm_raid6_backup.md
> >
> > This may have been my first mistake, as there are only 5
> > drives. it should have been --raid-devices 3, I think.
>
> how do you come to the conclusion 3 when there are 5 drives? you tell it
> how much drives there are and pretty sure after "mdadm --add" you can
> skip "--raid-devices" entirely because it knows how many drives there are
>
> https://raid.wiki.kernel.org/index.php/Growing

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-04-23 19:32   ` Jove
@ 2023-04-24  7:02     ` Jove
  2023-04-24  7:30       ` Wols Lists
  0 siblings, 1 reply; 21+ messages in thread
From: Jove @ 2023-04-24  7:02 UTC (permalink / raw)
  To: Reindl Harald; +Cc: linux-raid

> I do doubt this is the cause of my problems though.

Just to clarify, migrating an array from a 3 disk raid5 to a 4 disk
raid6 should be fine?

On Sun, Apr 23, 2023 at 9:32 PM Jove <jovetoo@gmail.com> wrote:
>
> That comment was because I misunderstood the actual function
> of the argument. It should have been 5, not 4 or 3 :).
>
> I do doubt this is the cause of my problems though.
>
> On Sun, Apr 23, 2023 at 9:19 PM Reindl Harald <h.reindl@thelounge.net> wrote:
> >
> >
> >
> > Am 23.04.23 um 21:09 schrieb Jove:
> > > I've added two drives to my raid5 array and tried to migrate
> > > it to raid6 with the following command:
> > >
> > > mdadm --grow /dev/md0 --raid-devices 4 --level 6
> > > --backup-file=/root/mdadm_raid6_backup.md
> > >
> > > This may have been my first mistake, as there are only 5
> > > drives. it should have been --raid-devices 3, I think.
> >
> > how do you come to the conclusion 3 when there are 5 drives? you tell it
> > how much drives there are and pretty sure after "mdadm --add" you can
> > skip "--raid-devices" entirely because it knows how many drives there are
> >
> > https://raid.wiki.kernel.org/index.php/Growing

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-04-24  7:02     ` Jove
@ 2023-04-24  7:30       ` Wols Lists
  0 siblings, 0 replies; 21+ messages in thread
From: Wols Lists @ 2023-04-24  7:30 UTC (permalink / raw)
  To: Jove; +Cc: linux-raid

On 24/04/2023 08:02, Jove wrote:
>> I do doubt this is the cause of my problems though.
> Just to clarify, migrating an array from a 3 disk raid5 to a 4 disk
> raid6 should be fine?

Yup. This should not have been a problem.

I notice you have WD Reds ... are the new drives new Reds? Not a wise 
move...

At what percent is the conversion hung? If a status says 0% complete, 
then a data recovery should be fine. Snag is, this doesn't at first 
glance sound like that.

And you shouldn't have needed a backup file - again I'll have to dig 
deeper ...

Give me a chance, I'll dig deeper.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-04-23 19:09 Raid5 to raid6 grow interrupted, mdadm hangs on assemble command Jove
  2023-04-23 19:19 ` Reindl Harald
@ 2023-04-24  7:41 ` Wols Lists
  2023-04-24 13:31   ` Jove
  2023-05-04 11:41 ` Yu Kuai
  2 siblings, 1 reply; 21+ messages in thread
From: Wols Lists @ 2023-04-24  7:41 UTC (permalink / raw)
  To: Jove, linux-raid; +Cc: Phil Turmel, NeilBrown

On 23/04/2023 20:09, Jove wrote:
> # mdadm --version
> mdadm - v4.2 - 2021-12-30 - 8
> 
> # mdadm -D /dev/md0
> /dev/md0:
>             Version : 1.2
>       Creation Time : Sat Oct 21 01:57:20 2017
>          Raid Level : raid6
>          Array Size : 7813771264 (7.28 TiB 8.00 TB)
>       Used Dev Size : 3906885632 (3.64 TiB 4.00 TB)
>        Raid Devices : 4
>       Total Devices : 5
>         Persistence : Superblock is persistent
> 
>       Intent Bitmap : Internal
> 
>         Update Time : Sun Apr 23 10:32:01 2023
>               State : clean, degraded
>      Active Devices : 3
>     Working Devices : 5
>      Failed Devices : 0
>       Spare Devices : 2
> 
>              Layout : left-symmetric-6
>          Chunk Size : 512K
> 
> Consistency Policy : bitmap
> 
>          New Layout : left-symmetric
> 
>                Name : atom:0  (local to host atom)
>                UUID : 8c56384e:ba1a3cec:aaf34c17:d0cd9318
>              Events : 669453
> 
>      Number   Major   Minor   RaidDevice State
>         0       8       33        0      active sync   /dev/sdc1
>         1       8       97        1      active sync   /dev/sdg1
>         3       8       49        2      active sync   /dev/sdd1
>         5       8       80        3      spare rebuilding   /dev/sdf
> 
>         4       8       64        -      spare   /dev/sde

This bit looks good. You have three active drives, so I'm HOPEFUL your 
data hasn't actually been damaged.

I've cc'd two people more experienced than me who I hope can help.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-04-24  7:41 ` Wols Lists
@ 2023-04-24 13:31   ` Jove
  2023-04-24 21:29     ` Jove
  0 siblings, 1 reply; 21+ messages in thread
From: Jove @ 2023-04-24 13:31 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid, Phil Turmel, NeilBrown

Any data that can be retrieved would be a plus. There is much data on
this array that I don't mind being trashed.

The older drives are WD Red, they are pre-SHMR. I have made sure after
that to use WD Red Plus and WD Red Pro drives. From what I found
online, they should be CMR too. Unless they quietly changed those too.

No, the conversion definitely did not stop at 0%. It ran for several
hours. It stopped during the night, so I can't tell you more.

I am worried that the processes are hung, though. Is that normal?

Thank you for your time!

On Mon, Apr 24, 2023 at 9:41 AM Wols Lists <antlists@youngman.org.uk> wrote:
>
> On 23/04/2023 20:09, Jove wrote:
> > # mdadm --version
> > mdadm - v4.2 - 2021-12-30 - 8
> >
> > # mdadm -D /dev/md0
> > /dev/md0:
> >             Version : 1.2
> >       Creation Time : Sat Oct 21 01:57:20 2017
> >          Raid Level : raid6
> >          Array Size : 7813771264 (7.28 TiB 8.00 TB)
> >       Used Dev Size : 3906885632 (3.64 TiB 4.00 TB)
> >        Raid Devices : 4
> >       Total Devices : 5
> >         Persistence : Superblock is persistent
> >
> >       Intent Bitmap : Internal
> >
> >         Update Time : Sun Apr 23 10:32:01 2023
> >               State : clean, degraded
> >      Active Devices : 3
> >     Working Devices : 5
> >      Failed Devices : 0
> >       Spare Devices : 2
> >
> >              Layout : left-symmetric-6
> >          Chunk Size : 512K
> >
> > Consistency Policy : bitmap
> >
> >          New Layout : left-symmetric
> >
> >                Name : atom:0  (local to host atom)
> >                UUID : 8c56384e:ba1a3cec:aaf34c17:d0cd9318
> >              Events : 669453
> >
> >      Number   Major   Minor   RaidDevice State
> >         0       8       33        0      active sync   /dev/sdc1
> >         1       8       97        1      active sync   /dev/sdg1
> >         3       8       49        2      active sync   /dev/sdd1
> >         5       8       80        3      spare rebuilding   /dev/sdf
> >
> >         4       8       64        -      spare   /dev/sde
>
> This bit looks good. You have three active drives, so I'm HOPEFUL your
> data hasn't actually been damaged.
>
> I've cc'd two people more experienced than me who I hope can help.
>
> Cheers,
> Wol

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-04-24 13:31   ` Jove
@ 2023-04-24 21:29     ` Jove
  0 siblings, 0 replies; 21+ messages in thread
From: Jove @ 2023-04-24 21:29 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid, Phil Turmel, NeilBrown

> There is much data on this array that I don't mind being trashed.

There is about 200GB I would very much like to have back. Email archive,
travel pictures, openhab configuration, ... It is all in a huge LVM
with different
logical volumes.

On Mon, Apr 24, 2023 at 3:31 PM Jove <jovetoo@gmail.com> wrote:
>
> Any data that can be retrieved would be a plus. There is much data on
> this array that I don't mind being trashed.
>
> The older drives are WD Red, they are pre-SHMR. I have made sure after
> that to use WD Red Plus and WD Red Pro drives. From what I found
> online, they should be CMR too. Unless they quietly changed those too.
>
> No, the conversion definitely did not stop at 0%. It ran for several
> hours. It stopped during the night, so I can't tell you more.
>
> I am worried that the processes are hung, though. Is that normal?
>
> Thank you for your time!
>
> On Mon, Apr 24, 2023 at 9:41 AM Wols Lists <antlists@youngman.org.uk> wrote:
> >
> > On 23/04/2023 20:09, Jove wrote:
> > > # mdadm --version
> > > mdadm - v4.2 - 2021-12-30 - 8
> > >
> > > # mdadm -D /dev/md0
> > > /dev/md0:
> > >             Version : 1.2
> > >       Creation Time : Sat Oct 21 01:57:20 2017
> > >          Raid Level : raid6
> > >          Array Size : 7813771264 (7.28 TiB 8.00 TB)
> > >       Used Dev Size : 3906885632 (3.64 TiB 4.00 TB)
> > >        Raid Devices : 4
> > >       Total Devices : 5
> > >         Persistence : Superblock is persistent
> > >
> > >       Intent Bitmap : Internal
> > >
> > >         Update Time : Sun Apr 23 10:32:01 2023
> > >               State : clean, degraded
> > >      Active Devices : 3
> > >     Working Devices : 5
> > >      Failed Devices : 0
> > >       Spare Devices : 2
> > >
> > >              Layout : left-symmetric-6
> > >          Chunk Size : 512K
> > >
> > > Consistency Policy : bitmap
> > >
> > >          New Layout : left-symmetric
> > >
> > >                Name : atom:0  (local to host atom)
> > >                UUID : 8c56384e:ba1a3cec:aaf34c17:d0cd9318
> > >              Events : 669453
> > >
> > >      Number   Major   Minor   RaidDevice State
> > >         0       8       33        0      active sync   /dev/sdc1
> > >         1       8       97        1      active sync   /dev/sdg1
> > >         3       8       49        2      active sync   /dev/sdd1
> > >         5       8       80        3      spare rebuilding   /dev/sdf
> > >
> > >         4       8       64        -      spare   /dev/sde
> >
> > This bit looks good. You have three active drives, so I'm HOPEFUL your
> > data hasn't actually been damaged.
> >
> > I've cc'd two people more experienced than me who I hope can help.
> >
> > Cheers,
> > Wol

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-04-23 19:09 Raid5 to raid6 grow interrupted, mdadm hangs on assemble command Jove
  2023-04-23 19:19 ` Reindl Harald
  2023-04-24  7:41 ` Wols Lists
@ 2023-05-04 11:41 ` Yu Kuai
  2023-05-04 18:02   ` Jove
  2 siblings, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2023-05-04 11:41 UTC (permalink / raw)
  To: Jove, linux-raid; +Cc: yukuai (C)

Hi,

在 2023/04/24 3:09, Jove 写道:
> Hi,
> 
> I've added two drives to my raid5 array and tried to migrate
> it to raid6 with the following command:
> 
> mdadm --grow /dev/md0 --raid-devices 4 --level 6
> --backup-file=/root/mdadm_raid6_backup.md
> 
> This may have been my first mistake, as there are only 5
> drives. it should have been --raid-devices 3, I think.
> 
> As soon as I started this grow, the filesystems went
> unavailable. All processes trying to access files on it hung.
> I searched the web which said a reboot during a rebuild
> was not problematic if things shut down cleanly, so I
> rebooted. The reboot hung too. The drive activity
> continued so I let it run overnight. I did wake up to a
> rebooted system in emergency mode as it could not
> mount all the partitions on the raid array.
> 
> The OS tried to reassemble the array and succeeded.
> However the udev processes that try to create the dev
> entries hang.
> 
> I went back to Google and found out how i could reboot
> my system without this automatic assemble.
> I tried reassembling the array with:
> 
> mdadm --verbose --assemble --backup-file mdadm_raid6_backup.md0 /dev/md0
> 
> This failed with:
> No backup metadata on mdadm_raid6_backup.md0
> Failed to find final backup of critical section.
> Failed to restore critical section for reshape, sorry.
> 
>   I tried again wtih:
> 
> mdadm --verbose --assemble --backup-file mdadm_raid6_backup.md0
> --invalid-backup /dev/md0
> 
> Rhis said in addition to the lines above:
> 
> continuying without restoring backup
> 
> This seemed to have succeeded in reassembling the
> array but it also hangs indefinitely.
> 
> /proc/mdstat now shows:
> 
> md0 : active (read-only) raid6 sdc1[0] sde[4](S) sdf[5] sdd1[3] sdg1[1]
>        7813771264 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_]
>        bitmap: 1/30 pages [4KB], 65536KB chunk

Read only can't continue reshape progress, see details in
md_check_recovery(), reshape can only start if md_is_rdwr(mddev) pass.
Do you know why this array is read-only?

> 
> Again the udev processes trying to access this device hung indefinitely
> 
> Eventually, the kernel dumps this in my journal:
> 
> Apr 23 19:17:22 atom kernel: task:systemd-udevd   state:D stack:    0
> pid: 8121 ppid:   706 flags:0x00000006
> Apr 23 19:17:22 atom kernel: Call Trace:
> Apr 23 19:17:22 atom kernel:  <TASK>
> Apr 23 19:17:22 atom kernel:  __schedule+0x20a/0x550
> Apr 23 19:17:22 atom kernel:  schedule+0x5a/0xc0
> Apr 23 19:17:22 atom kernel:  schedule_timeout+0x11f/0x160
> Apr 23 19:17:22 atom kernel:  ? make_stripe_request+0x284/0x490 [raid456]
> Apr 23 19:17:22 atom kernel:  wait_woken+0x50/0x70

Looks like this normal io is waiting for reshape to be done, that's why
it hanged indefinitely.

This really is a kernel bug, perhaps it can be bypassed if reshape can
be done, hopefully automatically if this array can be read/write. Noted
never echo reshape to sync_action, this will corrupt data in your case.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-05-04 11:41 ` Yu Kuai
@ 2023-05-04 18:02   ` Jove
  2023-05-05  1:34     ` Yu Kuai
  0 siblings, 1 reply; 21+ messages in thread
From: Jove @ 2023-05-04 18:02 UTC (permalink / raw)
  To: Yu Kuai; +Cc: linux-raid, yukuai (C)

Hi Kuai,

the madm --assemble command also hangs in the kernel. It never completes.

root         142     112  1 19:01 tty1     00:00:00 mdadm --assemble
/dev/md0 /dev/ubdb /dev/ubdc /dev/ubdd /dev/ubde --backup-file
mdadm_raid6_backup.md0 --invalid-backup
root         145       2  0 19:01 ?        00:00:00 [md0_raid6]

[root@LXCNAME ~]# cat /proc/142/stack
[<0>] __switch_to+0x50/0x7f
[<0>] __schedule+0x39c/0x3dd
[<0>] schedule+0x78/0xb9
[<0>] mddev_suspend+0x10b/0x1e8
[<0>] suspend_lo_store+0x72/0xbb
[<0>] md_attr_store+0x6c/0x8d
[<0>] sysfs_kf_write+0x34/0x37
[<0>] kernfs_fop_write_iter+0x167/0x1d0
[<0>] new_sync_write+0x68/0xd8
[<0>] vfs_write+0xe7/0x12b
[<0>] ksys_write+0x6d/0xa6
[<0>] sys_write+0x10/0x12
[<0>] handle_syscall+0x81/0xb1
[<0>] userspace+0x3db/0x598
[<0>] fork_handler+0x94/0x96

[root@LXCNAME ~]# cat /proc/145/stack
[<0>] __switch_to+0x50/0x7f
[<0>] __schedule+0x39c/0x3dd
[<0>] schedule+0x78/0xb9
[<0>] schedule_timeout+0xd2/0xfb
[<0>] md_thread+0x12c/0x18a
[<0>] kthread+0x11d/0x122
[<0>] new_thread_handler+0x81/0xb2

I have had one case in which mdadm didn't hang and in which the
reshape continued. Sadly, I was using sparse overlay files and the
filesystem could not handle the full 4x 4TB. I had to terminate the
reshape.

Best regards,

    Johan

On Thu, May 4, 2023 at 1:41 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/04/24 3:09, Jove 写道:
> > Hi,
> >
> > I've added two drives to my raid5 array and tried to migrate
> > it to raid6 with the following command:
> >
> > mdadm --grow /dev/md0 --raid-devices 4 --level 6
> > --backup-file=/root/mdadm_raid6_backup.md
> >
> > This may have been my first mistake, as there are only 5
> > drives. it should have been --raid-devices 3, I think.
> >
> > As soon as I started this grow, the filesystems went
> > unavailable. All processes trying to access files on it hung.
> > I searched the web which said a reboot during a rebuild
> > was not problematic if things shut down cleanly, so I
> > rebooted. The reboot hung too. The drive activity
> > continued so I let it run overnight. I did wake up to a
> > rebooted system in emergency mode as it could not
> > mount all the partitions on the raid array.
> >
> > The OS tried to reassemble the array and succeeded.
> > However the udev processes that try to create the dev
> > entries hang.
> >
> > I went back to Google and found out how i could reboot
> > my system without this automatic assemble.
> > I tried reassembling the array with:
> >
> > mdadm --verbose --assemble --backup-file mdadm_raid6_backup.md0 /dev/md0
> >
> > This failed with:
> > No backup metadata on mdadm_raid6_backup.md0
> > Failed to find final backup of critical section.
> > Failed to restore critical section for reshape, sorry.
> >
> >   I tried again wtih:
> >
> > mdadm --verbose --assemble --backup-file mdadm_raid6_backup.md0
> > --invalid-backup /dev/md0
> >
> > Rhis said in addition to the lines above:
> >
> > continuying without restoring backup
> >
> > This seemed to have succeeded in reassembling the
> > array but it also hangs indefinitely.
> >
> > /proc/mdstat now shows:
> >
> > md0 : active (read-only) raid6 sdc1[0] sde[4](S) sdf[5] sdd1[3] sdg1[1]
> >        7813771264 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_]
> >        bitmap: 1/30 pages [4KB], 65536KB chunk
>
> Read only can't continue reshape progress, see details in
> md_check_recovery(), reshape can only start if md_is_rdwr(mddev) pass.
> Do you know why this array is read-only?
>
> >
> > Again the udev processes trying to access this device hung indefinitely
> >
> > Eventually, the kernel dumps this in my journal:
> >
> > Apr 23 19:17:22 atom kernel: task:systemd-udevd   state:D stack:    0
> > pid: 8121 ppid:   706 flags:0x00000006
> > Apr 23 19:17:22 atom kernel: Call Trace:
> > Apr 23 19:17:22 atom kernel:  <TASK>
> > Apr 23 19:17:22 atom kernel:  __schedule+0x20a/0x550
> > Apr 23 19:17:22 atom kernel:  schedule+0x5a/0xc0
> > Apr 23 19:17:22 atom kernel:  schedule_timeout+0x11f/0x160
> > Apr 23 19:17:22 atom kernel:  ? make_stripe_request+0x284/0x490 [raid456]
> > Apr 23 19:17:22 atom kernel:  wait_woken+0x50/0x70
>
> Looks like this normal io is waiting for reshape to be done, that's why
> it hanged indefinitely.
>
> This really is a kernel bug, perhaps it can be bypassed if reshape can
> be done, hopefully automatically if this array can be read/write. Noted
> never echo reshape to sync_action, this will corrupt data in your case.
>
> Thanks,
> Kuai
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-05-04 18:02   ` Jove
@ 2023-05-05  1:34     ` Yu Kuai
  2023-05-05  6:58       ` Wol
  0 siblings, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2023-05-05  1:34 UTC (permalink / raw)
  To: Jove, Yu Kuai; +Cc: linux-raid, yukuai (C)

Hi,

在 2023/05/05 2:02, Jove 写道:
> Hi Kuai,
> 
> the madm --assemble command also hangs in the kernel. It never completes.
> 
> root         142     112  1 19:01 tty1     00:00:00 mdadm --assemble
> /dev/md0 /dev/ubdb /dev/ubdc /dev/ubdd /dev/ubde --backup-file
> mdadm_raid6_backup.md0 --invalid-backup
> root         145       2  0 19:01 ?        00:00:00 [md0_raid6]
> 
> [root@LXCNAME ~]# cat /proc/142/stack
> [<0>] __switch_to+0x50/0x7f
> [<0>] __schedule+0x39c/0x3dd
> [<0>] schedule+0x78/0xb9
> [<0>] mddev_suspend+0x10b/0x1e8
mddev_suspend is wait for read io to be done, while read io is waiting
for reshape to progress.

So this is just based on if there is a read io beyond reshape position
while mdadm is executed.

> [<0>] suspend_lo_store+0x72/0xbb
> [<0>] md_attr_store+0x6c/0x8d
> [<0>] sysfs_kf_write+0x34/0x37
> [<0>] kernfs_fop_write_iter+0x167/0x1d0
> [<0>] new_sync_write+0x68/0xd8
> [<0>] vfs_write+0xe7/0x12b
> [<0>] ksys_write+0x6d/0xa6
> [<0>] sys_write+0x10/0x12
> [<0>] handle_syscall+0x81/0xb1
> [<0>] userspace+0x3db/0x598
> [<0>] fork_handler+0x94/0x96
> 
> [root@LXCNAME ~]# cat /proc/145/stack
> [<0>] __switch_to+0x50/0x7f
> [<0>] __schedule+0x39c/0x3dd
> [<0>] schedule+0x78/0xb9
> [<0>] schedule_timeout+0xd2/0xfb
> [<0>] md_thread+0x12c/0x18a
> [<0>] kthread+0x11d/0x122
> [<0>] new_thread_handler+0x81/0xb2
> 
> I have had one case in which mdadm didn't hang and in which the
> reshape continued. Sadly, I was using sparse overlay files and the
> filesystem could not handle the full 4x 4TB. I had to terminate the
> reshape.

This sounds like a dead end for now, normal io beyond reshape position
must wait:

raid5_make_request
  make_stripe_request
   ahead_of_reshape
    wait_woken

Thanks,
Kuai
> 
> Best regards,
> 
>      Johan
> 
> On Thu, May 4, 2023 at 1:41 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2023/04/24 3:09, Jove 写道:
>>> Hi,
>>>
>>> I've added two drives to my raid5 array and tried to migrate
>>> it to raid6 with the following command:
>>>
>>> mdadm --grow /dev/md0 --raid-devices 4 --level 6
>>> --backup-file=/root/mdadm_raid6_backup.md
>>>
>>> This may have been my first mistake, as there are only 5
>>> drives. it should have been --raid-devices 3, I think.
>>>
>>> As soon as I started this grow, the filesystems went
>>> unavailable. All processes trying to access files on it hung.
>>> I searched the web which said a reboot during a rebuild
>>> was not problematic if things shut down cleanly, so I
>>> rebooted. The reboot hung too. The drive activity
>>> continued so I let it run overnight. I did wake up to a
>>> rebooted system in emergency mode as it could not
>>> mount all the partitions on the raid array.
>>>
>>> The OS tried to reassemble the array and succeeded.
>>> However the udev processes that try to create the dev
>>> entries hang.
>>>
>>> I went back to Google and found out how i could reboot
>>> my system without this automatic assemble.
>>> I tried reassembling the array with:
>>>
>>> mdadm --verbose --assemble --backup-file mdadm_raid6_backup.md0 /dev/md0
>>>
>>> This failed with:
>>> No backup metadata on mdadm_raid6_backup.md0
>>> Failed to find final backup of critical section.
>>> Failed to restore critical section for reshape, sorry.
>>>
>>>    I tried again wtih:
>>>
>>> mdadm --verbose --assemble --backup-file mdadm_raid6_backup.md0
>>> --invalid-backup /dev/md0
>>>
>>> Rhis said in addition to the lines above:
>>>
>>> continuying without restoring backup
>>>
>>> This seemed to have succeeded in reassembling the
>>> array but it also hangs indefinitely.
>>>
>>> /proc/mdstat now shows:
>>>
>>> md0 : active (read-only) raid6 sdc1[0] sde[4](S) sdf[5] sdd1[3] sdg1[1]
>>>         7813771264 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_]
>>>         bitmap: 1/30 pages [4KB], 65536KB chunk
>>
>> Read only can't continue reshape progress, see details in
>> md_check_recovery(), reshape can only start if md_is_rdwr(mddev) pass.
>> Do you know why this array is read-only?
>>
>>>
>>> Again the udev processes trying to access this device hung indefinitely
>>>
>>> Eventually, the kernel dumps this in my journal:
>>>
>>> Apr 23 19:17:22 atom kernel: task:systemd-udevd   state:D stack:    0
>>> pid: 8121 ppid:   706 flags:0x00000006
>>> Apr 23 19:17:22 atom kernel: Call Trace:
>>> Apr 23 19:17:22 atom kernel:  <TASK>
>>> Apr 23 19:17:22 atom kernel:  __schedule+0x20a/0x550
>>> Apr 23 19:17:22 atom kernel:  schedule+0x5a/0xc0
>>> Apr 23 19:17:22 atom kernel:  schedule_timeout+0x11f/0x160
>>> Apr 23 19:17:22 atom kernel:  ? make_stripe_request+0x284/0x490 [raid456]
>>> Apr 23 19:17:22 atom kernel:  wait_woken+0x50/0x70
>>
>> Looks like this normal io is waiting for reshape to be done, that's why
>> it hanged indefinitely.
>>
>> This really is a kernel bug, perhaps it can be bypassed if reshape can
>> be done, hopefully automatically if this array can be read/write. Noted
>> never echo reshape to sync_action, this will corrupt data in your case.
>>
>> Thanks,
>> Kuai
>>
> .
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-05-05  1:34     ` Yu Kuai
@ 2023-05-05  6:58       ` Wol
  2023-05-05  8:02         ` Yu Kuai
  0 siblings, 1 reply; 21+ messages in thread
From: Wol @ 2023-05-05  6:58 UTC (permalink / raw)
  To: Yu Kuai, Jove; +Cc: linux-raid, yukuai (C)

On 05/05/2023 02:34, Yu Kuai wrote:
>> I have had one case in which mdadm didn't hang and in which the
>> reshape continued. Sadly, I was using sparse overlay files and the
>> filesystem could not handle the full 4x 4TB. I had to terminate the
>> reshape.
> 
> This sounds like a dead end for now, normal io beyond reshape position
> must wait:
> 
> raid5_make_request
>   make_stripe_request
>    ahead_of_reshape
>     wait_woken

Not sure if I've got the wrong end of the stick, but if I've understood 
correctly, that shouldn't be the case.

Reshape takes place in a window. All io *beyond* the window is allowed 
to proceed normally - that part of the array has not been reshaped so 
the old parameters are used.

All io *in front* of the window is allowed to proceed normally - that 
part of the array has been reshaped so the new parameters are used.

io *IN* the window is paused until the window has passed. This 
interruption should be short and sweet.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-05-05  6:58       ` Wol
@ 2023-05-05  8:02         ` Yu Kuai
  2023-05-05 15:47           ` Jove
  0 siblings, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2023-05-05  8:02 UTC (permalink / raw)
  To: Wol, Yu Kuai, Jove; +Cc: linux-raid, yukuai (C)

Hi,

在 2023/05/05 14:58, Wol 写道:
> On 05/05/2023 02:34, Yu Kuai wrote:
>>> I have had one case in which mdadm didn't hang and in which the
>>> reshape continued. Sadly, I was using sparse overlay files and the
>>> filesystem could not handle the full 4x 4TB. I had to terminate the
>>> reshape.
>>
>> This sounds like a dead end for now, normal io beyond reshape position
>> must wait:
>>
>> raid5_make_request
>>   make_stripe_request
>>    ahead_of_reshape
>>     wait_woken
> 
> Not sure if I've got the wrong end of the stick, but if I've understood 
> correctly, that shouldn't be the case.
> 
> Reshape takes place in a window. All io *beyond* the window is allowed 
> to proceed normally - that part of the array has not been reshaped so 
> the old parameters are used.
> 
> All io *in front* of the window is allowed to proceed normally - that 
> part of the array has been reshaped so the new parameters are used.
> 
> io *IN* the window is paused until the window has passed. This 
> interruption should be short and sweet.

Yes, it's correct, and in this case reshape_safe should be the same as
reshapge_progress, and I guess io is stuck because
stripe_ahead_of_reshape() return true.

So this deadlock happens when io is blocked because of reshape, and
mddev_suspend() is waiting for this io to be done, in the meantime
reshape can't start untill mddev_suspend() returns.

Jove, As I understand this, if mdadm make progress without a blocked
io, and reshape continues, it seems you can use this array without
problem.

Thanks,
Kuai
> 
> Cheers,
> Wol
> 
> .
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-05-05  8:02         ` Yu Kuai
@ 2023-05-05 15:47           ` Jove
  2023-05-06  1:33             ` Yu Kuai
  0 siblings, 1 reply; 21+ messages in thread
From: Jove @ 2023-05-05 15:47 UTC (permalink / raw)
  To: Yu Kuai; +Cc: Wol, linux-raid, yukuai (C)

Hi Kuai.

> Jove, As I understand this, if mdadm make progress without a blocked
> io, and reshape continues, it seems you can use this array without
> problem

I've had to do some sleuthing to figure out who was doing that array
access, I was already running a minimal FedoraCore image. I've
discovered that the culprit is the systemd-udevd daemon. I do not know
why it accesses the array but if I stop it and rename that executable
(it gets started automatically when the array is assembled) then the
reshape continues.

Now it is just a matter of time until the reshape is finished and I
can discover just how much data I still have :)

Thank you all for your help, I will send a last mail when I know more.

Best regards,

      Johan



On Fri, May 5, 2023 at 10:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/05/05 14:58, Wol 写道:
> > On 05/05/2023 02:34, Yu Kuai wrote:
> >>> I have had one case in which mdadm didn't hang and in which the
> >>> reshape continued. Sadly, I was using sparse overlay files and the
> >>> filesystem could not handle the full 4x 4TB. I had to terminate the
> >>> reshape.
> >>
> >> This sounds like a dead end for now, normal io beyond reshape position
> >> must wait:
> >>
> >> raid5_make_request
> >>   make_stripe_request
> >>    ahead_of_reshape
> >>     wait_woken
> >
> > Not sure if I've got the wrong end of the stick, but if I've understood
> > correctly, that shouldn't be the case.
> >
> > Reshape takes place in a window. All io *beyond* the window is allowed
> > to proceed normally - that part of the array has not been reshaped so
> > the old parameters are used.
> >
> > All io *in front* of the window is allowed to proceed normally - that
> > part of the array has been reshaped so the new parameters are used.
> >
> > io *IN* the window is paused until the window has passed. This
> > interruption should be short and sweet.
>
> Yes, it's correct, and in this case reshape_safe should be the same as
> reshapge_progress, and I guess io is stuck because
> stripe_ahead_of_reshape() return true.
>
> So this deadlock happens when io is blocked because of reshape, and
> mddev_suspend() is waiting for this io to be done, in the meantime
> reshape can't start untill mddev_suspend() returns.
>
> Jove, As I understand this, if mdadm make progress without a blocked
> io, and reshape continues, it seems you can use this array without
> problem.
>
> Thanks,
> Kuai
> >
> > Cheers,
> > Wol
> >
> > .
> >
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-05-05 15:47           ` Jove
@ 2023-05-06  1:33             ` Yu Kuai
  2023-05-06 13:07               ` Jove
  0 siblings, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2023-05-06  1:33 UTC (permalink / raw)
  To: Jove, Yu Kuai; +Cc: Wol, linux-raid, yukuai (C)

Hi,

在 2023/05/05 23:47, Jove 写道:
> Hi Kuai.
> 
>> Jove, As I understand this, if mdadm make progress without a blocked
>> io, and reshape continues, it seems you can use this array without
>> problem
> 
> I've had to do some sleuthing to figure out who was doing that array
> access, I was already running a minimal FedoraCore image. I've
> discovered that the culprit is the systemd-udevd daemon. I do not know
> why it accesses the array but if I stop it and rename that executable
> (it gets started automatically when the array is assembled) then the
> reshape continues.

Thanks for confirming this, however, I have no idea why systemd-udevd is
accessing the array.

In the meantime, I'll try to fix this deadlock, hope you don't mind a
reported-by tag.

Thanks,
Kuai
> 
> Now it is just a matter of time until the reshape is finished and I
> can discover just how much data I still have :)
> 
> Thank you all for your help, I will send a last mail when I know more.
> 
> Best regards,
> 
>        Johan
> 
> 
> 
> On Fri, May 5, 2023 at 10:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2023/05/05 14:58, Wol 写道:
>>> On 05/05/2023 02:34, Yu Kuai wrote:
>>>>> I have had one case in which mdadm didn't hang and in which the
>>>>> reshape continued. Sadly, I was using sparse overlay files and the
>>>>> filesystem could not handle the full 4x 4TB. I had to terminate the
>>>>> reshape.
>>>>
>>>> This sounds like a dead end for now, normal io beyond reshape position
>>>> must wait:
>>>>
>>>> raid5_make_request
>>>>    make_stripe_request
>>>>     ahead_of_reshape
>>>>      wait_woken
>>>
>>> Not sure if I've got the wrong end of the stick, but if I've understood
>>> correctly, that shouldn't be the case.
>>>
>>> Reshape takes place in a window. All io *beyond* the window is allowed
>>> to proceed normally - that part of the array has not been reshaped so
>>> the old parameters are used.
>>>
>>> All io *in front* of the window is allowed to proceed normally - that
>>> part of the array has been reshaped so the new parameters are used.
>>>
>>> io *IN* the window is paused until the window has passed. This
>>> interruption should be short and sweet.
>>
>> Yes, it's correct, and in this case reshape_safe should be the same as
>> reshapge_progress, and I guess io is stuck because
>> stripe_ahead_of_reshape() return true.
>>
>> So this deadlock happens when io is blocked because of reshape, and
>> mddev_suspend() is waiting for this io to be done, in the meantime
>> reshape can't start untill mddev_suspend() returns.
>>
>> Jove, As I understand this, if mdadm make progress without a blocked
>> io, and reshape continues, it seems you can use this array without
>> problem.
>>
>> Thanks,
>> Kuai
>>>
>>> Cheers,
>>> Wol
>>>
>>> .
>>>
>>
> .
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-05-06  1:33             ` Yu Kuai
@ 2023-05-06 13:07               ` Jove
  2023-05-06 21:59                 ` Wol
  2023-05-09  2:10                 ` Yu Kuai
  0 siblings, 2 replies; 21+ messages in thread
From: Jove @ 2023-05-06 13:07 UTC (permalink / raw)
  To: Yu Kuai; +Cc: Wol, linux-raid, yukuai (C)

Hi Kuai,

Just to confirm, the array seems fine after the reshape. Copying files now.

Would it be best if I scrap this array and create a new one or is this
array safe to use in the long term? It had to use the --invalid-backup
flag to get it to reshape, so there might be corruption before that
resume point?

I have to do a reshape anyway, to 5 raid devices.

> In the meantime, I'll try to fix this deadlock, hope you don't mind a
> reported-by tag.

I would not, thank you.

I still have the backup images of the drive in reshape. If you wish I
can test any fix you create.

> I have no idea why systemd-udevd is accessing the array.

My guess is it is accessing this array is because it checks it for the
lvm layout so it can automatically create the /dev/mapper entries.
With systemd-udevd disabled, these entries to not automatically
appear.

And thank you again for getting me my data back.

Best regards,

   Johan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-05-06 13:07               ` Jove
@ 2023-05-06 21:59                 ` Wol
  2023-05-07 11:30                   ` Jove
  2023-05-09  2:10                 ` Yu Kuai
  1 sibling, 1 reply; 21+ messages in thread
From: Wol @ 2023-05-06 21:59 UTC (permalink / raw)
  To: Jove, Yu Kuai; +Cc: linux-raid, yukuai (C)

On 06/05/2023 14:07, Jove wrote:
> Hi Kuai,
> 
> Just to confirm, the array seems fine after the reshape. Copying files now.
> 
> Would it be best if I scrap this array and create a new one or is this
> array safe to use in the long term? It had to use the --invalid-backup
> flag to get it to reshape, so there might be corruption before that
> resume point?
> 
> I have to do a reshape anyway, to 5 raid devices.
> 
I wouldn't think it necessary to scrap the array, but if you've backed 
it up and are happier doing so ...

AIUI it was an external program squeezing in where it shouldn't that 
(quite literally) threw a spanner in the works and jammed things up. The 
array itself should be perfectly okay.

As for the "invalid backup" problem, you should never have given it a 
backup in the first place, and (while I don't know the code) I very much 
expect it ignored the option completely. You have superblock 1.2, which 
has a chunk of space "reserved for internal use", one of which is to 
provide this backup.

The only real good reason I can think of for scrapping and recreating 
the array is that it will give you a clean array, with ALL THE CURRENT 
DEFAULTS. This is important if anything goes wrong in future, if you 
have an array with a known creation date, that has not been "messed 
about" with since, it's easier to recover if you're really stupid and 
damage it and lose your records of the layout. Once an array goes 
through reshapes, it can be a lot harder to work out the layout if you 
have to rescue the array by recreating it.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-05-06 21:59                 ` Wol
@ 2023-05-07 11:30                   ` Jove
  0 siblings, 0 replies; 21+ messages in thread
From: Jove @ 2023-05-07 11:30 UTC (permalink / raw)
  To: Wol; +Cc: Yu Kuai, linux-raid, yukuai (C)

Hi Wol,

> I wouldn't think it necessary to scrap the array, but if you've backed
> it up and are happier doing so ...

Not particularly. I have taken backups and I am reshaping it to 5 raid
devices and if it works, I'll keep it.

> As for the "invalid backup" problem, you should never have given it a
> backup in the first place, and (while I don't know the code) I very much
> expect it ignored the option completely.

I don't know, Wol. I added the option because the wiki recommended it.
All I know is that when I tried to resume the reshape without the
option or without the --invalid-backup option, mdadm complained it
could not restore the critical section and refused to assemble the
array.

> Once an array goes through reshapes, it can be a lot harder to work
> out the layout if you have to rescue the array by recreating it.

I am no longer going to rely on the array alone to keep my data safe.
Should this array ever fail again, there will be backups to recover
from.

Thanks,

    Johan



On Sat, May 6, 2023 at 11:59 PM Wol <antlists@youngman.org.uk> wrote:
>
> On 06/05/2023 14:07, Jove wrote:
> > Hi Kuai,
> >
> > Just to confirm, the array seems fine after the reshape. Copying files now.
> >
> > Would it be best if I scrap this array and create a new one or is this
> > array safe to use in the long term? It had to use the --invalid-backup
> > flag to get it to reshape, so there might be corruption before that
> > resume point?
> >
> > I have to do a reshape anyway, to 5 raid devices.
> >
> I wouldn't think it necessary to scrap the array, but if you've backed
> it up and are happier doing so ...
>
> AIUI it was an external program squeezing in where it shouldn't that
> (quite literally) threw a spanner in the works and jammed things up. The
> array itself should be perfectly okay.
>
> As for the "invalid backup" problem, you should never have given it a
> backup in the first place, and (while I don't know the code) I very much
> expect it ignored the option completely. You have superblock 1.2, which
> has a chunk of space "reserved for internal use", one of which is to
> provide this backup.
>
> The only real good reason I can think of for scrapping and recreating
> the array is that it will give you a clean array, with ALL THE CURRENT
> DEFAULTS. This is important if anything goes wrong in future, if you
> have an array with a known creation date, that has not been "messed
> about" with since, it's easier to recover if you're really stupid and
> damage it and lose your records of the layout. Once an array goes
> through reshapes, it can be a lot harder to work out the layout if you
> have to rescue the array by recreating it.
>
> Cheers,
> Wol

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-05-06 13:07               ` Jove
  2023-05-06 21:59                 ` Wol
@ 2023-05-09  2:10                 ` Yu Kuai
  2023-05-09 20:18                   ` Johan Verrept
  1 sibling, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2023-05-09  2:10 UTC (permalink / raw)
  To: Jove, Yu Kuai
  Cc: Wol, linux-raid, yukuai (C), songliubraving, Logan Gunthorpe

[-- Attachment #1: Type: text/plain, Size: 1278 bytes --]

Hi, Jove

在 2023/05/06 21:07, Jove 写道:
> Hi Kuai,
> 
> Just to confirm, the array seems fine after the reshape. Copying files now.
> 
> Would it be best if I scrap this array and create a new one or is this
> array safe to use in the long term? It had to use the --invalid-backup
> flag to get it to reshape, so there might be corruption before that
> resume point?
> 
> I have to do a reshape anyway, to 5 raid devices.
> 
>> In the meantime, I'll try to fix this deadlock, hope you don't mind a
>> reported-by tag.
> 
> I would not, thank you.
> 
> I still have the backup images of the drive in reshape. If you wish I
> can test any fix you create.

Here is the first verion of the fixed patch, I fail the io that is
waiting for reshape while reshape can't make progress. I tested in my
VM and it works as I expected. Can you give it a try to see if mdadm
can still assemble?

Thanks,
Kuai
> 
>> I have no idea why systemd-udevd is accessing the array.
> 
> My guess is it is accessing this array is because it checks it for the
> lvm layout so it can automatically create the /dev/mapper entries.
> With systemd-udevd disabled, these entries to not automatically
> appear.
> 
> And thank you again for getting me my data back.
> 
> Best regards,
> 
>     Johan
> .
> 

[-- Attachment #2: 0001-md-fix-raid456-deadlock.patch --]
[-- Type: text/plain, Size: 5758 bytes --]

From 159ea7c8d591882dfbbdf30938c1c1d5bc9d4931 Mon Sep 17 00:00:00 2001
From: Yu Kuai <yukuai3@huawei.com>
Date: Tue, 9 May 2023 09:28:36 +0800
Subject: [PATCH] md: fix raid456 deadlock

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.c    | 20 ++++----------------
 drivers/md/md.h    | 18 ++++++++++++++++++
 drivers/md/raid5.c | 32 +++++++++++++++++++++++++++++++-
 3 files changed, 53 insertions(+), 17 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8e344b4b3444..462529e47f19 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -93,18 +93,6 @@ static int remove_and_add_spares(struct mddev *mddev,
 				 struct md_rdev *this);
 static void mddev_detach(struct mddev *mddev);
 
-enum md_ro_state {
-	MD_RDWR,
-	MD_RDONLY,
-	MD_AUTO_READ,
-	MD_MAX_STATE
-};
-
-static bool md_is_rdwr(struct mddev *mddev)
-{
-	return (mddev->ro == MD_RDWR);
-}
-
 /*
  * Default number of read corrections we'll attempt on an rdev
  * before ejecting it from the array. We divide the read error
@@ -360,10 +348,6 @@ EXPORT_SYMBOL_GPL(md_new_event);
 static LIST_HEAD(all_mddevs);
 static DEFINE_SPINLOCK(all_mddevs_lock);
 
-static bool is_md_suspended(struct mddev *mddev)
-{
-	return percpu_ref_is_dying(&mddev->active_io);
-}
 /* Rather than calling directly into the personality make_request function,
  * IO requests come here first so that we can check if the device is
  * being suspended pending a reconfiguration.
@@ -464,6 +448,10 @@ void mddev_suspend(struct mddev *mddev)
 	wake_up(&mddev->sb_wait);
 	set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
 	percpu_ref_kill(&mddev->active_io);
+
+	if (mddev->pers->prepare_suspend)
+		mddev->pers->prepare_suspend(mddev);
+
 	wait_event(mddev->sb_wait, percpu_ref_is_zero(&mddev->active_io));
 	mddev->pers->quiesce(mddev, 1);
 	clear_bit_unlock(MD_ALLOW_SB_UPDATE, &mddev->flags);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index fd8f260ed5f8..292b96a15890 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -536,6 +536,23 @@ struct mddev {
 	bool	serialize_policy:1;
 };
 
+enum md_ro_state {
+	MD_RDWR,
+	MD_RDONLY,
+	MD_AUTO_READ,
+	MD_MAX_STATE
+};
+
+static inline bool md_is_rdwr(struct mddev *mddev)
+{
+	return (mddev->ro == MD_RDWR);
+}
+
+static inline bool is_md_suspended(struct mddev *mddev)
+{
+	return percpu_ref_is_dying(&mddev->active_io);
+}
+
 enum recovery_flags {
 	/*
 	 * If neither SYNC or RESHAPE are set, then it is a recovery.
@@ -614,6 +631,7 @@ struct md_personality
 	int (*start_reshape) (struct mddev *mddev);
 	void (*finish_reshape) (struct mddev *mddev);
 	void (*update_reshape_pos) (struct mddev *mddev);
+	void (*prepare_suspend) (struct mddev *mddev);
 	/* quiesce suspends or resumes internal processing.
 	 * 1 - stop new actions and wait for action io to complete
 	 * 0 - return to normal behaviour
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 812a12e3e41a..5a24935c113d 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -761,6 +761,7 @@ enum stripe_result {
 	STRIPE_RETRY,
 	STRIPE_SCHEDULE_AND_RETRY,
 	STRIPE_FAIL,
+	STRIPE_FAIL_AND_RETRY,
 };
 
 struct stripe_request_ctx {
@@ -5997,7 +5998,8 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 			if (ahead_of_reshape(mddev, logical_sector,
 					     conf->reshape_safe)) {
 				spin_unlock_irq(&conf->device_lock);
-				return STRIPE_SCHEDULE_AND_RETRY;
+				ret = STRIPE_SCHEDULE_AND_RETRY;
+				goto out;
 			}
 		}
 		spin_unlock_irq(&conf->device_lock);
@@ -6076,6 +6078,18 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 
 out_release:
 	raid5_release_stripe(sh);
+out:
+	/*
+	 * There is no point to wait for reshape because reshape can't make
+	 * progress if the array is suspended or is not read write.
+	 */
+	if (ret == STRIPE_SCHEDULE_AND_RETRY &&
+	    (is_md_suspended(mddev) || !md_is_rdwr(mddev))) {
+		bi->bi_status = BLK_STS_IOERR;
+		ret = STRIPE_FAIL;
+		pr_err("md/raid456:%s: array is suspended or not read write, io accross reshape position failed, please try again after reshape.\n",
+		       mdname(mddev));
+	}
 	return ret;
 }
 
@@ -8654,6 +8668,19 @@ static void raid5_finish_reshape(struct mddev *mddev)
 	}
 }
 
+static void raid5_prepare_suspend(struct mddev *mddev)
+{
+	struct r5conf *conf = mddev->private;
+
+	/*
+	 * Before waiting for active_io to be done, fail all the io that is
+	 * waiting for reshape because they can never be done after suspend.
+	 *
+	 * Perhaps it's better to let those io wait for resume than failing.
+	 */
+	wake_up(&conf->wait_for_overlap);
+}
+
 static void raid5_quiesce(struct mddev *mddev, int quiesce)
 {
 	struct r5conf *conf = mddev->private;
@@ -9020,6 +9047,7 @@ static struct md_personality raid6_personality =
 	.check_reshape	= raid6_check_reshape,
 	.start_reshape  = raid5_start_reshape,
 	.finish_reshape = raid5_finish_reshape,
+	.prepare_suspend = raid5_prepare_suspend,
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid6_takeover,
 	.change_consistency_policy = raid5_change_consistency_policy,
@@ -9044,6 +9072,7 @@ static struct md_personality raid5_personality =
 	.check_reshape	= raid5_check_reshape,
 	.start_reshape  = raid5_start_reshape,
 	.finish_reshape = raid5_finish_reshape,
+	.prepare_suspend = raid5_prepare_suspend,
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid5_takeover,
 	.change_consistency_policy = raid5_change_consistency_policy,
@@ -9069,6 +9098,7 @@ static struct md_personality raid4_personality =
 	.check_reshape	= raid5_check_reshape,
 	.start_reshape  = raid5_start_reshape,
 	.finish_reshape = raid5_finish_reshape,
+	.prepare_suspend = raid5_prepare_suspend,
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid4_takeover,
 	.change_consistency_policy = raid5_change_consistency_policy,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-05-09  2:10                 ` Yu Kuai
@ 2023-05-09 20:18                   ` Johan Verrept
  2023-05-10  1:13                     ` Yu Kuai
  0 siblings, 1 reply; 21+ messages in thread
From: Johan Verrept @ 2023-05-09 20:18 UTC (permalink / raw)
  To: Yu Kuai, Jove
  Cc: Wol, linux-raid, yukuai (C), songliubraving, Logan Gunthorpe


Hi Kuai,

> Here is the first verion of the fixed patch, I fail the io that is
> waiting for reshape while reshape can't make progress. I tested in my
> VM and it works as I expected. Can you give it a try to see if mdadm
> can still assemble? 

Assemble seems to work fine and the reshape resumed.

I see this error appearing:

     md/raid456:md0: array is suspended or not read write, io accross 
reshape position failed, please try again after reshape.

 From what I can see in your patch, this is what is expected.

Best regards,

     Johan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command
  2023-05-09 20:18                   ` Johan Verrept
@ 2023-05-10  1:13                     ` Yu Kuai
  0 siblings, 0 replies; 21+ messages in thread
From: Yu Kuai @ 2023-05-10  1:13 UTC (permalink / raw)
  To: Johan Verrept, Yu Kuai, Jove
  Cc: Wol, linux-raid, songliubraving, Logan Gunthorpe, David Gilmour,
	yukuai (C)

Hi, Johan

在 2023/05/10 4:18, Johan Verrept 写道:
> 
> Hi Kuai,
> 
>> Here is the first verion of the fixed patch, I fail the io that is
>> waiting for reshape while reshape can't make progress. I tested in my
>> VM and it works as I expected. Can you give it a try to see if mdadm
>> can still assemble? 
> 
> Assemble seems to work fine and the reshape resumed.
> 

That's great, thanks for testing.

David, you can try this patsh as well, your case is different but
I think this patch will work.

Thanks,
Kuai
> I see this error appearing:
> 
>      md/raid456:md0: array is suspended or not read write, io accross 
> reshape position failed, please try again after reshape.
> 
>  From what I can see in your patch, this is what is expected.
> 
> Best regards,
> 
>      Johan
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-05-10  1:13 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-23 19:09 Raid5 to raid6 grow interrupted, mdadm hangs on assemble command Jove
2023-04-23 19:19 ` Reindl Harald
2023-04-23 19:32   ` Jove
2023-04-24  7:02     ` Jove
2023-04-24  7:30       ` Wols Lists
2023-04-24  7:41 ` Wols Lists
2023-04-24 13:31   ` Jove
2023-04-24 21:29     ` Jove
2023-05-04 11:41 ` Yu Kuai
2023-05-04 18:02   ` Jove
2023-05-05  1:34     ` Yu Kuai
2023-05-05  6:58       ` Wol
2023-05-05  8:02         ` Yu Kuai
2023-05-05 15:47           ` Jove
2023-05-06  1:33             ` Yu Kuai
2023-05-06 13:07               ` Jove
2023-05-06 21:59                 ` Wol
2023-05-07 11:30                   ` Jove
2023-05-09  2:10                 ` Yu Kuai
2023-05-09 20:18                   ` Johan Verrept
2023-05-10  1:13                     ` Yu Kuai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.