* Disk "failed" while doing scrub
@ 2015-07-13 6:26 Dāvis Mosāns
2015-07-13 8:12 ` Duncan
2015-08-21 4:16 ` Dāvis Mosāns
0 siblings, 2 replies; 5+ messages in thread
From: Dāvis Mosāns @ 2015-07-13 6:26 UTC (permalink / raw)
To: linux-btrfs
Hello,
Short version: while doing scrub on 5 disk btrfs filesystem, /dev/sdd
"failed" and also had some error on other disk (/dev/sdh)
Because filesystem still mounts, I assume I should do "btrfs device
delete /dev/sdd /mntpoint" and then restore damaged files from backup.
Are all affected files listed in journal? there's messages about "x
callbacks suppressed" so I'm not sure and if there aren't how to get
full list of damaged files?
Also I wonder if there are any tools to recover partial file fragments
and reconstruct file? (where missing fragments filled with nulls)
I assume that there's no point in running "btrfs check
--check-data-csum" because scrub already does check that?
from journal:
kernel: drivers/scsi/mvsas/mv_sas.c 1863:Release slot [1] tag[1], task
[ffff88007efb8800]:
kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 00000002, slot [1].
kernel: sas: sas_ata_task_done: SAS error 8a
kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
kernel: sas: ata9: end_device-7:2: cmd error handler
kernel: sas: ata7: end_device-7:0: dev error handler
kernel: sas: ata14: end_device-7:7: dev error handler
kernel: ata9.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0
kernel: ata9.00: failed command: READ FPDMA QUEUED
kernel: ata9.00: cmd 60/00:00:00:3d:a1/04:00:ab:00:00/40 tag 11 ncq 524288 in
res
41/40:00:48:40:a1/00:04:ab:00:00/00 Emask 0x409 (media error) <F>
kernel: ata9.00: status: { DRDY ERR }
kernel: ata9.00: error: { UNC }
kernel: ata9.00: configured for UDMA/133
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00
driverbyte=0x08
kernel: sd 7:0:2:0: [sdd] tag#0 Sense Key : 0x3 [current] [descriptor]
kernel: sd 7:0:2:0: [sdd] tag#0 ASC=0x11 ASCQ=0x4
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 3d 00 00 04 00 00
kernel: blk_update_request: I/O error, dev sdd, sector 2879471688
kernel: ata9: EH complete
kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
kernel: drivers/scsi/mvsas/mv_sas.c 1863:Release slot [1] tag[1], task
[ffff88007efb9a00]:
kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 00000003, slot [1].
kernel: sas: sas_ata_task_done: SAS error 8a
kernel: sas: Enter sas_scsi_recover_host busy: 2 failed: 2
kernel: sas: trying to find task 0xffff8801e0cadb00
kernel: sas: sas_scsi_find_task: aborting task 0xffff8801e0cadb00
kernel: sas: sas_scsi_find_task: task 0xffff8801e0cadb00 is aborted
kernel: sas: sas_eh_handle_sas_errors: task 0xffff8801e0cadb00 is aborted
kernel: sas: ata9: end_device-7:2: cmd error handler
kernel: sas: ata8: end_device-7:1: cmd error handler
kernel: sas: ata7: end_device-7:0: dev error handler
kernel: sas: ata8: end_device-7:1: dev error handler
kernel: ata8.00: exception Emask 0x0 SAct 0x40000 SErr 0x0 action 0x6 frozen
kernel: ata8.00: failed command: READ FPDMA QUEUED
kernel: ata8.00: cmd 60/00:00:00:1b:36/04:00:bf:00:00/40 tag 18 ncq 524288 in
res
40/00:08:00:58:11/00:00:a6:00:00/40 Emask 0x4 (timeout)
kernel: ata8.00: status: { DRDY }
kernel: ata8: hard resetting link
kernel: sas: ata9: end_device-7:2: dev error handler
kernel: sas: ata14: end_device-7:7: dev error handler
kernel: ata9: log page 10h reported inactive tag 26
kernel: ata9.00: exception Emask 0x1 SAct 0x400000 SErr 0x0 action 0x6
kernel: ata9.00: failed command: READ FPDMA QUEUED
kernel: ata9.00: cmd 60/08:00:48:40:a1/00:00:ab:00:00/40 tag 22 ncq 4096 in
res
01/04:a8:40:40:a1/00:00:ab:00:00/40 Emask 0x3 (HSM violation)
kernel: ata9.00: status: { ERR }
kernel: ata9.00: error: { ABRT }
kernel: ata9: hard resetting link
kernel: sas: sas_form_port: phy1 belongs to port1 already(1)!
kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV
kernel: ata9.00: revalidation failed (errno=-2)
kernel: drivers/scsi/mvsas/mv_sas.c 1428:mvs_I_T_nexus_reset for device[1]:rc= 0
kernel: ata8.00: configured for UDMA/133
kernel: ata8.00: device reported invalid CHS sector 0
kernel: ata8: EH complete
kernel: ata9: hard resetting link
kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV
kernel: ata9.00: revalidation failed (errno=-2)
kernel: ata9: hard resetting link
kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV
kernel: ata9.00: revalidation failed (errno=-2)
kernel: ata9.00: disabled
kernel: ata9: EH complete
kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 40 48 00 00 08 00
kernel: blk_update_request: I/O error, dev sdd, sector 2879471688
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 45 00 00 06 00 00
kernel: BTRFS: unable to fixup (regular) error at logical
7390602616832 on dev /dev/sdd
kernel: BTRFS: unable to fixup (regular) error at logical
7390602891264 on dev /dev/sdd
kernel: scsi_io_completion: 186117 callbacks suppressed
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x2a 2a 00 00 14 78 c0 00 00 20 00
kernel: blk_update_request: 186156 callbacks suppressed
kernel: blk_update_request: I/O error, dev sdd, sector 1341632
kernel: sd 7:0:2:0: [sdd] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#1 CDB: opcode=0x2a 2a 00 00 14 7a 80 00 00 20 00
kernel: blk_update_request: I/O error, dev sdd, sector 2879472896
kernel: BTRFS: i/o error at logical 7386235424768 on dev /dev/sdd,
sector 2891849768, root 3034, inode 5633529, offset 11878400, length
4096, links 1 (path: [...])
kernel: BTRFS: i/o error at logical 7386235039744 on dev /dev/sdd,
sector 2891849016, root 3034, inode 5633529, offset 11493376, length
4096, links 1 (path: [...])
kernel: btrfs_dev_stat_print_on_error: 78908 callbacks suppressed
kernel: BTRFS: bdev /dev/sdd errs: wr 347, rd 1644871, flush 0, corrupt 0, gen 0
kernel: BTRFS: bdev /dev/sdd errs: wr 356, rd 1644871, flush 0, corrupt 0, gen 0
kernel: BTRFS: error (device sdh) in write_all_supers:3454: errno=-5
IO failure (errors while submitting device barriers.)
kernel: BTRFS info (device sdh): forced readonly
kernel: BTRFS warning (device sdh): Skipping commit of aborted transaction.
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 5 PID: 3756 at fs/btrfs/super.c:260
__btrfs_abort_transaction+0x54/0x130 [btrfs]()
kernel: BTRFS: Transaction aborted (error -5)
kernel: Modules linked in: nf_conntrack_netbios_ns
nf_conntrack_broadcast xt_tcpudp ip6t_rpfilter ip6t_REJECT [...]
kernel: nvidia(PO) tda8290 tuner aes_x86_64 lrw saa7134
snd_hda_codec_realtek gf128mul edac_core glue_helper [...]
kernel:
kernel: CPU: 5 PID: 3756 Comm: btrfs-transacti Tainted: P O
4.0.7-2-ARCH #1
kernel: Hardware name: Gigabyte Technology Co., Ltd.
GA-990FXA-UD3/GA-990FXA-UD3, BIOS FFe 11/08/2013
kernel: 0000000000000000 000000005f5d9ca7 ffff88006090fc18 ffffffff81574ec3
kernel: 0000000000000000 ffff88006090fc70 ffff88006090fc58 ffffffff81074e7a
kernel: 0000000000000000 ffff8800ce8e6c60 00000000fffffffb ffff8800bbaa4800
kernel: Call Trace:
kernel: [<ffffffff81574ec3>] dump_stack+0x4c/0x6e
kernel: [<ffffffff81074e7a>] warn_slowpath_common+0x8a/0xc0
kernel: [<ffffffff81074f05>] warn_slowpath_fmt+0x55/0x70
kernel: [<ffffffffa0253bb4>] __btrfs_abort_transaction+0x54/0x130 [btrfs]
kernel: [<ffffffffa0282ceb>] cleanup_transaction+0x7b/0x300 [btrfs]
kernel: [<ffffffff810b6ce0>] ? wake_atomic_t_function+0x60/0x60
kernel: [<ffffffffa0284162>] btrfs_commit_transaction+0x932/0xc10 [btrfs]
kernel: [<ffffffffa027f3a5>] transaction_kthread+0x1d5/0x240 [btrfs]
kernel: [<ffffffffa027f1d0>] ? btrfs_cleanup_transaction+0x5a0/0x5a0 [btrfs]
kernel: [<ffffffff810934b8>] kthread+0xd8/0xf0
kernel: [<ffffffff810933e0>] ? kthread_worker_fn+0x170/0x170
kernel: [<ffffffff8157a718>] ret_from_fork+0x58/0x90
kernel: [<ffffffff810933e0>] ? kthread_worker_fn+0x170/0x170
kernel: ---[ end trace 8ecc49ef203bd88c ]---
kernel: BTRFS: error (device sdh) in cleanup_transaction:1686:
errno=-5 IO failure
kernel: BTRFS info (device sdh): delayed_refs has NO entry
kernel: scrub_handle_errored_block: 92600 callbacks suppressed
kernel: BTRFS: i/o error at logical 7390928568320 on dev /dev/sdd,
sector 2892627456, root 3034, inode 5637106, offset 614400, length
4096, links 1 (path: [...])
kernel: BTRFS: i/o error at logical 7390928175104 on dev /dev/sdd,
sector 2892626688, root 3034, inode 5637106, offset 483328, length
4096, links 1 (path: [...])
kernel: scrub_handle_errored_block: 77404 callbacks suppressed
kernel: BTRFS: unable to fixup (regular) error at logical
7390928568320 on dev /dev/sdd
kernel: BTRFS: unable to fixup (regular) error at logical
7390928175104 on dev /dev/sdd
smartd[723]: Device: /dev/sdd [SAT], not capable of SMART self-check
smartd[723]: Device: /dev/sdd [SAT], failed to read SMART Attribute Data
smartd[723]: Device: /dev/sdd [SAT], Read SMART Self Test Log Failed
smartd[723]: Device: /dev/sdd [SAT], Read Summary SMART Error Log failed
kernel: scsi_io_completion: 8110 callbacks suppressed
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 e8 e0 88 00 00 00 08 00
kernel: blk_update_request: 8115 callbacks suppressed
kernel: blk_update_request: I/O error, dev sdd, sector 3907028992
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 e8 e0 88 00 00 00 08 00
kernel: blk_update_request: I/O error, dev sdd, sector 3907028992
kernel: Buffer I/O error on dev sdd, logical block 488378624, async page read
Long story:
I had Seagate disk which died, but still was covered by warranty so I
got replacement, only disk they returned wasn't new, but repaired
and I haven't used it much, but seems it won't hold for long as it got
uncorrectable sectors.
When I received it, I did full SMART test and checked all sectors,
everything passed and seemed to be good, but now I copied my data
and used it for a while, only to find
smartd[592]: Device: /dev/sdd [SAT], 16 Currently unreadable (pending) sectors
smartd[592]: Device: /dev/sdd [SAT], 16 Offline uncorrectable sectors
then I ran scrub
scrub status for 1ec5b839-acc6-4f70-be9d-6f9e6118c71c
scrub started at Sun Jul 12 13:36:11 2015 and was aborted after 02:43:21
total bytes scrubbed: 6.24TiB with 1648151 errors
error details: read=1648151
corrected errors: 704, uncorrectable errors: 1647447,
unverified errors: 0
it caused drive to become unrecognizable by Linux and seems it also
made some error for different disk (/dev/sdh)
which caused filesystem to become read-only and didn't mount
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 00 00 00 80 00 00 08 00
kernel: blk_update_request: I/O error, dev sdd, sector 128
kernel: BTRFS info (device sdh): enabling auto defrag
kernel: BTRFS info (device sdh): disk space caching is enabled
kernel: BTRFS: has skinny extents
kernel: BTRFS: failed to read chunk tree on sdh
mount[17625]: mount: wrong fs type, bad option, bad superblock on /dev/sdh,
mount[17625]: missing codepage or helper program, or other error
mount[17625]: In some cases useful info is found in syslog - try
mount[17625]: dmesg | tail or so.
kernel: BTRFS: open_ctree failed
kernel: sd 7:0:2:0: [sdd] Synchronizing SCSI cache
kernel: sd 7:0:2:0: [sdd] Synchronize Cache(10) failed: Result:
hostbyte=0x04 driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] Stopping disk
kernel: sd 7:0:2:0: [sdd] Start/Stop Unit failed: Result:
hostbyte=0x04 driverbyte=0x00
pulled out that /dev/sdd drive and plugged back in
kernel: mvsas 0000:07:00.0: Phy2 : No sig fis
kernel: sas: phy-7:2 added to port-7:2, phy_mask:0x4 ( 200000000000000)
kernel: sas: DOING DISCOVERY on port 2, pid:16744
kernel: sas: DONE DISCOVERY on port 2, pid:16744, result:0
kernel: sas: Enter sas_scsi_recover_host busy: 0 failed: 0
kernel: ata20.00: ATA-8: ST2000DM001-9YN164, CC9F, max UDMA/133
kernel: ata20.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32)
kernel: ata20.00: configured for UDMA/133
kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
kernel: scsi 7:0:8:0: Direct-Access ATA ST2000DM001-9YN1 CC9F
PQ: 0 ANSI: 5
kernel: sd 7:0:8:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
kernel: sd 7:0:8:0: [sdd] 4096-byte physical blocks
kernel: sd 7:0:8:0: [sdd] Write Protect is off
kernel: sd 7:0:8:0: [sdd] Mode Sense: 00 3a 00 00
kernel: sd 7:0:8:0: [sdd] Write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
kernel: sd 7:0:8:0: [sdd] Attached SCSI disk
smartd[723]: Device: /dev/sdd [SAT], SMART Usage Attribute: 187
Reported_Uncorrect changed from 100 to 98
smartd[723]: Device: /dev/sdd [SAT], previous self-test completed with
error (read test element)
smartd[723]: Device: /dev/sdd [SAT], Self-Test Log error count
increased from 0 to 2
smartd[723]: Device: /dev/sdd [SAT], ATA error count increased from 0 to 2
everything seems "ok" again, run short SMART self-test which now
failed for first time (but disk SMART status still says PASSED)
then resumed scrub and it completed
scrub status for 1ec5b839-acc6-4f70-be9d-6f9e6118c71c
scrub device /dev/sdc (id 1) history
scrub resumed at Sun Jul 12 18:07:06 2015 and finished after 04:34:02
total bytes scrubbed: 2.35TiB with 0 errors
scrub device /dev/sdd (id 2) history
scrub resumed at Sun Jul 12 18:07:06 2015 and finished after 02:56:23
total bytes scrubbed: 1.44TiB with 1648151 errors
error details: read=1648151
corrected errors: 704, uncorrectable errors: 1647447,
unverified errors: 0
scrub device /dev/sde (id 3) history
scrub started at Sun Jul 12 13:36:11 2015 and finished after 02:35:46
total bytes scrubbed: 1.43TiB with 0 errors
scrub device /dev/sdg (id 4) history
scrub started at Sun Jul 12 13:36:11 2015 and finished after 02:40:01
total bytes scrubbed: 1.44TiB with 0 errors
scrub device /dev/sdh (id 5) history
scrub started at Sun Jul 12 13:36:11 2015 and finished after 01:14:34
total bytes scrubbed: 537.82GiB with 0 errors
btrfs device stats doesn't show any errors
[/dev/sdc].write_io_errs 0
[/dev/sdc].read_io_errs 0
[/dev/sdc].flush_io_errs 0
[/dev/sdc].corruption_errs 0
[/dev/sdc].generation_errs 0
[/dev/sdd].write_io_errs 0
[/dev/sdd].read_io_errs 0
[/dev/sdd].flush_io_errs 0
[/dev/sdd].corruption_errs 0
[/dev/sdd].generation_errs 0
[/dev/sde].write_io_errs 0
[/dev/sde].read_io_errs 0
[/dev/sde].flush_io_errs 0
[/dev/sde].corruption_errs 0
[/dev/sde].generation_errs 0
[/dev/sdg].write_io_errs 0
[/dev/sdg].read_io_errs 0
[/dev/sdg].flush_io_errs 0
[/dev/sdg].corruption_errs 0
[/dev/sdg].generation_errs 0
[/dev/sdh].write_io_errs 0
[/dev/sdh].read_io_errs 0
[/dev/sdh].flush_io_errs 0
[/dev/sdh].corruption_errs 0
[/dev/sdh].generation_errs 0
other disk /dev/sdh doesn't show any signs if it would have become bad
so most likely it was controller's fault when sdd threw errors.
when scrub says about error counts, what exactly count's as error, a
file fragment?
also are there some easy way to locate those unreadable sectors and
rewrite them so hdd relocates them?
Thanks :)
Here's ful SMART info for /dev/sdd
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST2000DM001-9YN164
Serial Number: W2404VST
LU WWN Device Id: 5 000c50 044a7a68a
Firmware Version: CC9F
User Capacity: 2 000 398 934 016 bytes [2,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Jul 13 07:40:14 2015 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM level is: 128 (minimum power consumption without standby)
Rd look-ahead is: Enabled
Write cache is: Enabled
ATA Security is: Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: ( 592) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 254) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3081) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 117 100 006 - 166724616
3 Spin_Up_Time PO---- 092 092 000 - 0
4 Start_Stop_Count -O--CK 100 100 020 - 626
5 Reallocated_Sector_Ct PO--CK 100 100 036 - 0
7 Seek_Error_Rate POSR-- 060 060 030 - 1306645
9 Power_On_Hours -O--CK 097 097 000 - 3154
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 100 100 020 - 433
183 Runtime_Bad_Block -O--CK 100 100 000 - 0
184 End-to-End_Error -O--CK 100 100 099 - 0
187 Reported_Uncorrect -O--CK 098 098 000 - 2
188 Command_Timeout -O--CK 100 099 000 - 4 4 4
189 High_Fly_Writes -O-RCK 100 100 000 - 0
190 Airflow_Temperature_Cel -O---K 070 058 045 - 30 (0 1 34 29 0)
191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
192 Power-Off_Retract_Count -O--CK 100 100 000 - 335
193 Load_Cycle_Count -O--CK 096 096 000 - 9566
194 Temperature_Celsius -O---K 030 042 000 - 30 (128 0 0 0 0)
197 Current_Pending_Sector -O--C- 100 100 000 - 16
198 Offline_Uncorrectable ----C- 100 100 000 - 16
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
240 Head_Flying_Hours ------ 100 253 000 - 367h+26m+14.504s
241 Total_LBAs_Written ------ 100 253 000 - 38608136381115
242 Total_LBAs_Read ------ 100 253 000 - 7979572945843
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 5 Ext. Comprehensive SMART error log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 SATA NCQ Queued Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x21 GPL R/O 1 Write stream error log
0x22 GPL R/O 1 Read stream error log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa1 GPL,SL VS 20 Device vendor specific log
0xa2 GPL VS 4496 Device vendor specific log
0xa8 GPL,SL VS 20 Device vendor specific log
0xa9 GPL,SL VS 1 Device vendor specific log
0xab GPL VS 1 Device vendor specific log
0xb0 GPL VS 5067 Device vendor specific log
0xbd GPL VS 512 Device vendor specific log
0xbe-0xbf GPL VS 65535 Device vendor specific log
0xc0 GPL,SL VS 1 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
Device Error Count: 2
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 [1] occurred at disk power-on lifetime: 3139 hours (130 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 ab a1 40 48 00 00 Error: UNC at LBA =
0xaba14048 = 2879471688
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 00 00 08 00 00 ab a1 40 48 40 00 02:54:39.784 READ FPDMA QUEUED
60 00 00 00 08 00 00 ab a1 40 40 40 00 02:54:39.783 READ FPDMA QUEUED
60 00 00 00 08 00 00 ab a1 40 38 40 00 02:54:39.783 READ FPDMA QUEUED
60 00 00 00 08 00 00 ab a1 40 30 40 00 02:54:39.782 READ FPDMA QUEUED
60 00 00 00 08 00 00 ab a1 40 28 40 00 02:54:39.782 READ FPDMA QUEUED
Error 1 [0] occurred at disk power-on lifetime: 3139 hours (130 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 ab a1 40 48 00 00 Error: UNC at LBA =
0xaba14048 = 2879471688
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 00 04 00 00 00 ab a0 14 00 40 00 02:54:36.512 READ FPDMA QUEUED
60 00 00 04 00 00 00 ab a0 10 00 40 00 02:54:36.500 READ FPDMA QUEUED
60 00 00 04 00 00 00 ab a0 0c 00 40 00 02:54:36.498 READ FPDMA QUEUED
60 00 00 04 00 00 00 ab a0 08 00 40 00 02:54:36.497 READ FPDMA QUEUED
60 00 00 04 00 00 00 ab 9f f9 00 40 00 02:54:36.402 READ FPDMA QUEUED
SMART Error Log Version: 1
ATA Error Count: 2
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 occurred at disk power-on lifetime: 3139 hours (130 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 08 ff ff ff 4f 00 02:54:39.784 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 02:54:39.783 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 02:54:39.783 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 02:54:39.782 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 02:54:39.782 READ FPDMA QUEUED
Error 1 occurred at disk power-on lifetime: 3139 hours (130 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 02:54:36.512 READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00 02:54:36.500 READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00 02:54:36.498 READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00 02:54:36.497 READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00 02:54:36.402 READ FPDMA QUEUED
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 3139
2879471688
# 2 Short offline Completed: read failure 90% 3139
2879471688
# 3 Short offline Completed without error 00% 3049 -
# 4 Conveyance offline Completed without error 00% 2996 -
# 5 Short offline Completed without error 00% 2239 -
# 6 Extended offline Completed without error 00% 2238 -
# 7 Short offline Completed without error 00% 1550 -
# 8 Short offline Completed without error 00% 1550 -
# 9 Short offline Completed without error 00% 69 -
#10 Short offline Completed without error 00% 9 -
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 3139
2879471688
# 2 Short offline Completed: read failure 90% 3139
2879471688
# 3 Short offline Completed without error 00% 3049 -
# 4 Conveyance offline Completed without error 00% 2996 -
# 5 Short offline Completed without error 00% 2239 -
# 6 Extended offline Completed without error 00% 2238 -
# 7 Short offline Completed without error 00% 1550 -
# 8 Short offline Completed without error 00% 1550 -
# 9 Short offline Completed without error 00% 69 -
#10 Short offline Completed without error 00% 9 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 522 (0x020a)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 30 Celsius
Power Cycle Min/Max Temperature: 29/34 Celsius
Lifetime Min/Max Temperature: 9/42 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Data Table command not supported
SCT Error Recovery Control command not supported
Device Statistics (GP/SMART Log 0x04) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x000a 2 1 Device-to-host register FISes sent due to a COMRESET
0x0001 2 0 Command failed due to ICRC error
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
SMART info for /dev/sdh
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F3
Device Model: SAMSUNG HD103SJ
Serial Number: S246JDWZ113593
LU WWN Device Id: 5 0024e9 002bf43c5
Firmware Version: 1AJ100E4
User Capacity: 1 000 204 886 016 bytes [1,00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Mon Jul 13 07:53:49 2015 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Disabled
APM feature is: Disabled
Rd look-ahead is: Enabled
Write cache is: Enabled
ATA Security is: Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 9420) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 157) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 100 100 051 - 1
2 Throughput_Performance -OS--K 055 055 000 - 8621
3 Spin_Up_Time PO---K 073 071 025 - 8314
4 Start_Stop_Count -O--CK 091 091 000 - 9745
5 Reallocated_Sector_Ct PO--CK 252 252 010 - 0
7 Seek_Error_Rate -OSR-K 252 252 051 - 0
8 Seek_Time_Performance --S--K 252 252 015 - 0
9 Power_On_Hours -O--CK 100 100 000 - 20675
10 Spin_Retry_Count -O--CK 252 252 051 - 0
11 Calibration_Retry_Count -O--CK 252 252 000 - 0
12 Power_Cycle_Count -O--CK 097 097 000 - 3297
191 G-Sense_Error_Rate -O---K 100 100 000 - 42
192 Power-Off_Retract_Count -O---K 252 252 000 - 0
194 Temperature_Celsius -O---- 064 043 000 - 32 (Min/Max 4/57)
195 Hardware_ECC_Recovered -O-RCK 100 100 000 - 0
196 Reallocated_Event_Count -O--CK 252 252 000 - 0
197 Current_Pending_Sector -O--CK 252 252 000 - 0
198 Offline_Uncorrectable ----CK 252 252 000 - 0
199 UDMA_CRC_Error_Count -OS-CK 100 100 000 - 2
200 Multi_Zone_Error_Rate -O-R-K 100 100 000 - 101
223 Load_Retry_Count -O--CK 252 252 000 - 0
225 Load_Cycle_Count -O--CK 100 100 000 - 9897
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 2 Comprehensive SMART error log
0x03 GPL R/O 2 Ext. Comprehensive SMART error log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 2 Extended self-test log
0x08 GPL R/O 2 Power Conditions log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 SATA NCQ Queued Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xbb GPL VS 4 Device vendor specific log
0xbc GPL VS 2 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
Device Error Count: 2
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 [1] occurred at disk power-on lifetime: 4244 hours (176 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
84 -- 51 93 e8 00 00 00 00 00 00 e0 00 Error: ICRC, ABRT 37864
sectors at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
35 00 00 01 00 00 00 61 18 92 e8 e0 08 00:00:01.927 WRITE DMA EXT
25 00 00 01 00 00 00 1b ce e8 60 e0 08 00:00:01.927 READ DMA EXT
25 00 00 01 00 00 00 1b ce e7 60 e0 08 00:00:01.927 READ DMA EXT
25 00 00 01 00 00 00 1b ce e6 60 e0 08 00:00:01.927 READ DMA EXT
25 00 00 01 00 00 00 1b ce e5 60 e0 08 00:00:01.927 READ DMA EXT
Error 1 [0] occurred at disk power-on lifetime: 2234 hours (93 days + 2 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
84 -- 51 e5 ee 00 00 00 00 00 00 e0 00 Error: ICRC, ABRT 58862
sectors at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
35 00 00 00 06 00 00 00 35 e5 e8 e0 08 00:00:17.173 WRITE DMA EXT
35 00 00 00 08 00 00 06 d5 77 10 e0 08 00:00:17.173 WRITE DMA EXT
35 00 00 00 03 00 00 00 82 12 48 e0 08 00:00:17.173 WRITE DMA EXT
35 00 00 00 07 00 00 06 d5 77 10 e0 08 00:00:17.171 WRITE DMA EXT
35 00 00 00 03 00 00 00 82 12 48 e0 08 00:00:17.171 WRITE DMA EXT
SMART Error Log Version: 1
No Errors Logged
SMART Extended Self-test Log Version: 1 (2 sectors)
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 20661 -
# 2 Extended offline Completed without error 00% 19724 -
# 3 Short offline Completed without error 00% 19721 -
# 4 Short offline Aborted by host 90% 19404 -
# 5 Short offline Completed without error 00% 18910 -
# 6 Short offline Completed without error 00% 15792 -
# 7 Short offline Completed without error 00% 15792 -
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 20661 -
# 2 Extended offline Completed without error 00% 19724 -
# 3 Short offline Completed without error 00% 19721 -
# 4 Short offline Aborted by host 90% 19404 -
# 5 Short offline Completed without error 00% 18910 -
# 6 Short offline Completed without error 00% 15792 -
# 7 Short offline Completed without error 00% 15792 -
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has
ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Completed [00% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 2
SCT Version (vendor specific): 256 (0x0100)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 32 Celsius
Power Cycle Min/Max Temperature: 24/38 Celsius
Lifetime Min/Max Temperature: 7/57 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 5 minutes
Temperature Logging Interval: 5 minutes
Min/Max recommended Temperature: -5/80 Celsius
Min/Max Temperature Limit: -10/85 Celsius
Temperature History Size (Index): 128 (106)
Index Estimated Time Temperature Celsius
107 2015-07-12 21:15 35 ****************
108 2015-07-12 21:20 34 ***************
105 2015-07-13 07:45 33 **************
106 2015-07-13 07:50 32 *************
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
Device Statistics (GP/SMART Log 0x04) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 4 0 Command failed due to ICRC error
0x0002 4 0 R_ERR response for data FIS
0x0003 4 0 R_ERR response for device-to-host data FIS
0x0004 4 0 R_ERR response for host-to-device data FIS
0x0005 4 0 R_ERR response for non-data FIS
0x0006 4 0 R_ERR response for device-to-host non-data FIS
0x0007 4 0 R_ERR response for host-to-device non-data FIS
0x0008 4 0 Device-to-host non-data FIS retries
0x0009 4 1 Transition from drive PhyRdy to drive PhyNRdy
0x000a 4 2 Device-to-host register FISes sent due to a COMRESET
0x000b 4 0 CRC errors within host-to-device FIS
0x000d 4 0 Non-CRC errors within host-to-device FIS
0x000f 4 0 R_ERR response for host-to-device data FIS, CRC
0x0010 4 0 R_ERR response for host-to-device data FIS, non-CRC
0x0012 4 0 R_ERR response for host-to-device non-data FIS, CRC
0x0013 4 0 R_ERR response for host-to-device non-data FIS, non-CRC
0x8e00 4 0 Vendor specific
0x8e01 4 0 Vendor specific
0x8e02 4 0 Vendor specific
0x8e03 4 0 Vendor specific
0x8e04 4 0 Vendor specific
0x8e05 4 0 Vendor specific
0x8e06 4 0 Vendor specific
0x8e07 4 0 Vendor specific
0x8e08 4 0 Vendor specific
0x8e09 4 0 Vendor specific
0x8e0a 4 0 Vendor specific
0x8e0b 4 0 Vendor specific
0x8e0c 4 0 Vendor specific
0x8e0d 4 0 Vendor specific
0x8e0e 4 0 Vendor specific
0x8e0f 4 0 Vendor specific
0x8e10 4 0 Vendor specific
0x8e11 4 0 Vendor specific
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Disk "failed" while doing scrub
2015-07-13 6:26 Disk "failed" while doing scrub Dāvis Mosāns
@ 2015-07-13 8:12 ` Duncan
2015-07-14 1:54 ` Dāvis Mosāns
2015-08-21 4:16 ` Dāvis Mosāns
1 sibling, 1 reply; 5+ messages in thread
From: Duncan @ 2015-07-13 8:12 UTC (permalink / raw)
To: linux-btrfs
Dāvis Mosāns posted on Mon, 13 Jul 2015 09:26:05 +0300 as excerpted:
> Short version: while doing scrub on 5 disk btrfs filesystem, /dev/sdd
> "failed" and also had some error on other disk (/dev/sdh)
You say five disk, but nowhere in your post do you mention what raid mode
you were using, neither do you post btrfs filesystem show and btrfs
filesystem df, as suggested on the wiki and which list that information.
FWIW, btrfs defaults for a multi-device filesystem are raid1 metadata,
raid0 data. If you didn't specify raid level at mkfs time, it's very
likely that's what you're using. The scrub results seem to support this
as if the data had been raid1 or raid10, nearly all the errors should
have been correctable by pulling from the second copy. And raid5/6
should have been able to recover from parity, tho this mode is new enough
it's still not recommended as the chances of bugs and thus failure to
work properly are much higher.
So you really should have been using raid1/10 if you wanted device
failure tolerance, but you didn't say, and if you're using defaults as
seems reasonably likely, your data was raid0, and thus it's likely many/
most files are either gone or damaged beyond repair.
(As it happens I have a number of btrfs raid1 data/metadata on a pair of
partitioned ssds, with each btrfs on a corresponding partition on both of
them, with one of the ssds developing bad sectors and basically slowly
failing. But the other member of the raid1 pair is solid and I have
backups, as well as a spare I can replace the failing one with when I
decide it's time, so I've been letting the bad one stick around due as
much as anything to morbid curiosity, watching it slowly fail. So I know
exactly how scrub on btrfs raid1 behaves in a bad-sector case, pulling
the copy from the good device to overwrite the bad copy with, triggering
the device's sector remapping in the process. Despite all the read
errors, they've all been correctable, because I'm using raid1 for both
data and metadata.)
> Because filesystem still mounts, I assume I should do "btrfs device
> delete /dev/sdd /mntpoint" and then restore damaged files from backup.
You can try a replace, but with a failing drive still connected, people
report mixed results. It's likely to fail as it can't read certain
blocks to transfer them to the new device.
With raid1 or better, physically disconnecting the failing device, and
doing a device delete missing (or replace missing, but AFAIK this doesn't
work with released versions and I'm not sure if it's even in integration
yet, but there are patches on-list that should make it work) can work,
but with raid0/single, you can mount with a missing device if you use
degraded,ro, but obviously that'll only let you try to copy files off,
and you'll likely not have a lot of luck with raid0, with files missing
but a bit more luck with single.
In the likely raid0/single case, you're best bet is probably to try
copying off what you can, and/or restoring from backups. See the
discussion below.
> Are all affected files listed in journal? there's messages about "x
> callbacks suppressed" so I'm not sure and if there aren't how to get
> full list of damaged files?
> Also I wonder if there are any tools to recover partial file fragments
> and reconstruct file? (where missing fragments filled with nulls)
> I assume that there's no point in running "btrfs check
> --check-data-csum" because scrub already does check that?
There's no such partial-file with null-fill tools shipped just yet.
Those files normally simply trigger errors trying to read them, because
btrfs won't let you at them if the checksum doesn't verify.
There /is/, however, a command that can be used to either regenerate or
zero-out the checksum tree. See btrfs check --init-csum-tree. Current
versions recalculate the csums, older versions (btrfsck as that was
before btrfs check) simply zeroed it out. Then you can read the file
despite bad checksums, tho you'll still get errors if the block
physically cannot be read.
There's also btrfs restore, which works on the unmounted filesystem
without actually writing to it, copying the files it can read to a new
location, which of course has to be a filesystem with enough room to
restore the files to, altho it's possible to tell restore to do only
specific subdirs, for instance.
What I'd recommend depends on how complete and how recent your backup
is. If it's complete and recent enough, probably the easiest thing is to
simply blow away the bad filesystem and start over, recovering from the
backup to a new filesystem.
If there's files you'd like to get back that weren't backed up or where
the backup is old, since the filesystem is mountable, I'd probably copy
everything off it I could. Then, I'd try restore, letting it restore to
the same location I had copied to, but NOT using the --overwrite option,
so it only wrote any files it could restore that the copy wasn't able to
get you, as they might be slightly older versions.
Then, if you really need more of the files, you can try using btrfs check
--init-csum-tree as mentioned above, and then try mounting and see if you
can access more files. But as these are likely to be somewhat corrupt,
I'd probably /not/ copy them to the same location as the others. If you
have space for two copies, you might duplicate the set of files as you
were able to recover them with the initial copy and restore, and use the
same don't-overwrite technique on one of the sets, marking it the
possibly corrupted version. Then you can do a diff or rsync dry-run to
see the differences between the good version and the bad, and examine
anything spitout by the diff/rsync individually.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Disk "failed" while doing scrub
2015-07-13 8:12 ` Duncan
@ 2015-07-14 1:54 ` Dāvis Mosāns
2015-07-14 6:26 ` Duncan
0 siblings, 1 reply; 5+ messages in thread
From: Dāvis Mosāns @ 2015-07-14 1:54 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
2015-07-13 11:12 GMT+03:00 Duncan <1i5t5.duncan@cox.net>:
> You say five disk, but nowhere in your post do you mention what raid mode
> you were using, neither do you post btrfs filesystem show and btrfs
> filesystem df, as suggested on the wiki and which list that information.
Sorry, I forgot. I'm running Arch Linux 4.0.7, with btrfs-progs v4.1
Using RAID1 for metadata and single for data, with features
big_metadata, extended_iref, mixed_backref, no_holes, skinny_metadata
and mounted with noatime,compress=zlib,space_cache,autodefrag
Label: 'Data' uuid: 1ec5b839-acc6-4f70-be9d-6f9e6118c71c
Total devices 5 FS bytes used 7.16TiB
devid 1 size 2.73TiB used 2.35TiB path /dev/sdc
devid 2 size 1.82TiB used 1.44TiB path /dev/sdd
devid 3 size 1.82TiB used 1.44TiB path /dev/sde
devid 4 size 1.82TiB used 1.44TiB path /dev/sdg
devid 5 size 931.51GiB used 539.01GiB path /dev/sdh
Data, single: total=7.15TiB, used=7.15TiB
System, RAID1: total=8.00MiB, used=784.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=16.00GiB, used=14.37GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B
>> Because filesystem still mounts, I assume I should do "btrfs device
>> delete /dev/sdd /mntpoint" and then restore damaged files from backup.
>
> You can try a replace, but with a failing drive still connected, people
> report mixed results. It's likely to fail as it can't read certain
> blocks to transfer them to the new device.
As I understand, device delete will copy data from that disk and
distribute across rest of disks,
while btrfs replace will copy to new disk which must be atleast size
of disk I'm replacing.
Assuming other existing disks are good, if so, why replace would be
preferable over delete?
because delete could fail, but replace not?
> There's no such partial-file with null-fill tools shipped just yet.
> Those files normally simply trigger errors trying to read them, because
> btrfs won't let you at them if the checksum doesn't verify.
>From journal I have only 14 files mentioned where errors occurred. Now
13 files from
them don't throw any errors and their SHA's match to my backups so they're fine.
And actually btrfs does allow to copy/read that one damaged file, only
I get I/O error
when trying to read data from those broken sectors
kernel: drivers/scsi/mvsas/mv_sas.c 1863:Release slot [0] tag[0], task
[ffff88011c8c9900]:
kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 00000001, slot [0].
kernel: sas: sas_ata_task_done: SAS error 8a
kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
kernel: sas: ata9: end_device-7:2: cmd error handler
kernel: sas: ata7: end_device-7:0: dev error handler
kernel: sas: ata14: end_device-7:7: dev error handler
kernel: ata9.00: exception Emask 0x0 SAct 0x4000 SErr 0x0 action 0x0
kernel: ata9.00: failed command: READ FPDMA QUEUED
kernel: ata9.00: cmd 60/00:00:00:33:a1/0f:00:ab:00:00/40 tag 14 ncq 1966080 in
res 41/40:00:48:40:a1/00:0f:ab:00:00/00 Emask
0x409 (media error) <F>
kernel: ata9.00: status: { DRDY ERR }
kernel: ata9.00: error: { UNC }
kernel: ata9.00: configured for UDMA/133
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00
driverbyte=0x08
kernel: sd 7:0:2:0: [sdd] tag#0 Sense Key : 0x3 [current] [descriptor]
kernel: sd 7:0:2:0: [sdd] tag#0 ASC=0x11 ASCQ=0x4
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 33 00 00 0f 00 00
kernel: blk_update_request: I/O error, dev sdd, sector 2879471688
kernel: ata9: EH complete
kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
but all other sectors can be copied fine
$ du -m ./damaged_file
6250 ./damaged_file
$ cp ./damaged_file /tmp/
cp: error reading ‘damaged_file’: Input/output error
$ du -m /tmp/damaged_file
4335 /tmp/damaged_file
cp copies first file part correctly, and I verified that both
start of file (first 4336M) and end of file (last 1890M) SHA's match backup
$ head -c 4336M ./damaged_file | sha256sum
e81b20bfa7358c9f5a0ed165bffe43185abc59e35246e52a7be1d43e6b7e040d -
$ head -c 4337M ./damaged_file | sha256sum
head: error reading ‘./damaged_file’: Input/output error
$ tail -c 1890M ./damaged_file | sha256sum
941568f4b614077858cb8c8dd262bb431bf4c45eca936af728ecffc95619cb60 -
$ tail -c 1891M ./damaged_file | sha256sum
tail: error reading ‘./damaged_file’: Input/output error
with dd can also copy almost all file, only using noerror option it
excludes those regions
from target file rather than filling with nulls so this isn't good for recovery
$ dd conv=noerror if=damaged_file of=/tmp/damaged_file
dd: error reading ‘damaged_file’: Input/output error
8880328+0 records in
8880328+0 records out
4546727936 bytes (4,5 GB) copied, 69,7282 s, 65,2 MB/s
dd: error reading ‘damaged_file’: Input/output error
8930824+0 records in
8930824+0 records out
4572581888 bytes (4,6 GB) copied, 113,648 s, 40,2 MB/s
12801720+0 records in
12801720+0 records out
6554480640 bytes (6,6 GB) copied, 223,212 s, 29,4 MB/s
$ du -m /tmp/damaged_file
6251 /tmp/damaged_file
best and correct way to recover a file is using ddrescue
$ ddrescue ./damaged_file /tmp/damaged_file info.log
rescued: 6554 MB, errsize: 8192 B, current rate: 0 B/s
ipos: 4572 MB, errors: 2, average rate: 43407 kB/s
opos: 4572 MB, run time: 2.51 m, successful read: 34 s ago
Finished
pos size status
0x00000000 0x10F019000 +
0x10F019000 0x00001000 -
0x10F01A000 0x018A8000 +
0x1108C2000 0x00001000 -
0x1108C3000 0x76216000 +
$ du -m /tmp/damaged_file
6251 /tmp/damaged_file
so basically only like 8K bytes are unrecoverable from this file. Probably there
could be created some tool which could get even more data knowing about btrfs.
> There /is/, however, a command that can be used to either regenerate or
> zero-out the checksum tree. See btrfs check --init-csum-tree. Current
> versions recalculate the csums, older versions (btrfsck as that was
> before btrfs check) simply zeroed it out. Then you can read the file
> despite bad checksums, tho you'll still get errors if the block
> physically cannot be read.
>
Seems, you can't specify a path/file for it and it's quite destructive
action if you
want to get data only about some one specific file.
I did scrub second time and this time there aren't that many
uncorrectable errors and
also there's no csum_errors so --init-csum-tree is useless here I think.
Most likely previously scrub got that many errors because it still
continued for a bit even
if disk didn't respond.
scrub status for 1ec5b839-acc6-4f70-be9d-6f9e6118c71c
scrub resumed at Mon Jul 13 22:24:43 2015 and finished after 02:47:28
data_extents_scrubbed: 26357534
tree_extents_scrubbed: 316780
data_bytes_scrubbed: 1574584311808
tree_bytes_scrubbed: 5190123520
read_errors: 2
csum_errors: 0
verify_errors: 0
no_csum: 89600
csum_discards: 656214
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 2
unverified_errors: 0
corrected_errors: 0
last_physical: 2590041112576
also now, there's i/o errors from device stats which were 0 previously
[/dev/sdd].write_io_errs 0
[/dev/sdd].read_io_errs 123
[/dev/sdd].flush_io_errs 0
[/dev/sdd].corruption_errs 0
[/dev/sdd].generation_errs 0
these are all errors which came from 2nd scrub, only 2 dead sectors
kernel: BTRFS: i/o error at logical 7358423011328 on dev /dev/sdd,
sector 2879471688, root 3034, inode 5619902, offset 4546727936, length
4096, links 1 (path: dir2/damaged_file)
kernel: BTRFS: bdev /dev/sdd errs: wr 0, rd 50, flush 0, corrupt 0, gen 0
kernel: BTRFS: unable to fixup (regular) error at logical
7358423011328 on dev /dev/sdd
kernel: BTRFS: i/o error at logical 7358448869376 on dev /dev/sdd,
sector 2879522192, root 3034, inode 5619902, offset 4572585984, length
4096, links 1 (path: dir2/damaged_file)
kernel: BTRFS: bdev /dev/sdd errs: wr 0, rd 51, flush 0, corrupt 0, gen 0
kernel: BTRFS: unable to fixup (regular) error at logical
7358448869376 on dev /dev/sdd
> There's also btrfs restore, which works on the unmounted filesystem
> without actually writing to it, copying the files it can read to a new
> location, which of course has to be a filesystem with enough room to
> restore the files to, altho it's possible to tell restore to do only
> specific subdirs, for instance.
>
I tried restore for that file, but it's not as good as ddrescue because it
stopped on error even with --ignore-errors flag and seems there aren't option
to continue and try more.
$ btrfs restore -i -x -m -v --path-regex
"^/dir1(|/dir2(|/damaged_file))$" /dev/sdd ./
Restoring ./dir1
Restoring ./dir1/dir2
Restoring ./dir1/dir2/damaged_file
offset is 258048
offset is 212992
offset is 233472
offset is 217088
offset is 237568
Exhausted mirrors trying to read
Error copying data for ./dir1/dir2/damaged_file
Done searching /dir1/dir2/damaged_file
Done searching /dir1/dir2
Done searching /dir1
Done searching
$ du -m ./dir1/dir2/damaged_file
4296 ./dir1/dir2/damaged_file
can see that it got only first half, similar how simple cp does.
> What I'd recommend depends on how complete and how recent your backup
> is. If it's complete and recent enough, probably the easiest thing is to
> simply blow away the bad filesystem and start over, recovering from the
> backup to a new filesystem.
Actually this time I've 100% complete and up-to-date backups of all
files so I can
freely experiment and try practicing real world recovery which could
be very useful.
So far seems if I didn't had backup I would have lost only 8K bytes.
Why recreate to new filesystem rather than just delete/replace dying
disk? I will still
check if all files are ok, but I don't really see need to recreate
filesystem if files are fine.
by the way I managed to crash btrfs progs, I had scrub running with -B and then
Xorg crashed (not related to btrfs) and it took down scrub process.
Then I just resumed scrub.
I've stripped symbols so stack trace is totally useless..
#0 0x0000000000418103 in ?? ()
#1 0x000000000040ee82 in main ()
also when I try restore from different root tree it crashes (this is
on 2 disk RAID0)
# btrfs restore -v -t 74579968 /dev/sdk ./
parent transid verify failed on 74579968 wanted 135 found 132
parent transid verify failed on 74579968 wanted 135 found 132
parent transid verify failed on 74579968 wanted 135 found 132
parent transid verify failed on 74579968 wanted 135 found 132
Ignoring transid failure
volumes.c:1554: btrfs_chunk_readonly: Assertion `!ce` failed.
btrfs[0x44c6ce]
btrfs[0x44f426]
btrfs(btrfs_read_block_groups+0x23e)[0x4442de]
btrfs(btrfs_setup_all_roots+0x387)[0x43edd7]
btrfs[0x43f124]
btrfs(open_ctree_fs_info+0x43)[0x43f2b3]
btrfs(cmd_restore+0xb5b)[0x42e77b]
btrfs(main+0x82)[0x40ee82]
/usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f1090778790]
btrfs(_start+0x29)[0x40ef79]
Thanks for your reply :)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Disk "failed" while doing scrub
2015-07-14 1:54 ` Dāvis Mosāns
@ 2015-07-14 6:26 ` Duncan
0 siblings, 0 replies; 5+ messages in thread
From: Duncan @ 2015-07-14 6:26 UTC (permalink / raw)
To: linux-btrfs
Dāvis Mosāns posted on Tue, 14 Jul 2015 04:54:27 +0300 as excerpted:
> 2015-07-13 11:12 GMT+03:00 Duncan <1i5t5.duncan@cox.net>:
>> You say five disk, but nowhere in your post do you mention what raid
>> mode you were using, neither do you post btrfs filesystem show and
>> btrfs filesystem df, as suggested on the wiki and which list that
>> information.
>
> Sorry, I forgot. I'm running Arch Linux 4.0.7, with btrfs-progs v4.1
> Using RAID1 for metadata and single for data, with features
> big_metadata, extended_iref, mixed_backref, no_holes, skinny_metadata
> and mounted with noatime,compress=zlib,space_cache,autodefrag
Thanks. FWIW, pretty similar here, but running gentoo, now with btrfs-
progs v4.1.1 and the mainline 4.2-rc1+ kernel.
BTW, note that space_cache has been the default for quite some time,
now. I've never actually manually mounted with space_cache on any of my
filesystems over several years, now, yet they all report it when I check
/proc/mounts, etc. So if you're adding that manually, you can kill that
option and save the commandline/fstab space. =:^)
> Label: 'Data' uuid: 1ec5b839-acc6-4f70-be9d-6f9e6118c71c
> Total devices 5 FS bytes used 7.16TiB
> devid 1 size 2.73TiB used 2.35TiB path /dev/sdc
> devid 2 size 1.82TiB used 1.44TiB path /dev/sdd
> devid 3 size 1.82TiB used 1.44TiB path /dev/sde
> devid 4 size 1.82TiB used 1.44TiB path /dev/sdg
> devid 5 size 931.51GiB used 539.01GiB path /dev/sdh
>
> Data, single: total=7.15TiB, used=7.15TiB
> System, RAID1: total=8.00MiB, used=784.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID1: total=16.00GiB, used=14.37GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
And note that you can easily and quickly remove those empty single-mode
system and metadata chunks, which are an artifact of the way mkfs.btrfs
works, using balance filters.
btrfs balance start -mprofile=single
... should do it. They're actually working on mkfs.btrfs patches to fix
it not to do that, right now. There's active patch and testing threads
discussing it. Hopefully for btrfs-progs v4.2. (4.1.1 has the patches
for single-device and prep work for multi-device, according to the
changelog.)
>>> Because filesystem still mounts, I assume I should do "btrfs device
>>> delete /dev/sdd /mntpoint" and then restore damaged files from backup.
>>
>> You can try a replace, but with a failing drive still connected, people
>> report mixed results. It's likely to fail as it can't read certain
>> blocks to transfer them to the new device.
>
> As I understand, device delete will copy data from that disk and
> distribute across rest of disks, while btrfs replace will copy to new
> disk which must be atleast size of disk I'm replacing.
Sorry. You wrote delete, I read replace. How'd I do that? =:^(
You are absolutely correct. Delete would be better here.
I guess I had just been reading a thread discussing the problems I
mentioned with replace, and saw what I expected to see, not what you
actually wrote.
>> There's no such partial-file with null-fill tools shipped just yet.
> From journal I have only 14 files mentioned where errors occurred. Now
> 13 files from them don't throw any errors and their SHA's match to my
> backups so they're fine.
Good. I was going on the assumption that the questionable device was in
much worse shape than that.
> And actually btrfs does allow to copy/read that one damaged file, only I
> get I/O error when trying to read data from those broken sectors
Good, and good to know. Thanks. =:^)
> best and correct way to recover a file is using ddrescue
I was just going to mention ddrescue. =:^)
> $ du -m /tmp/damaged_file 6251 /tmp/damaged_file
>
> so basically only like 8K bytes are unrecoverable from this file.
> Probably there could be created some tool which could get even more data
> knowing about btrfs.
>
>> There /is/, however, a command that can be used to either regenerate or
>> zero-out the checksum tree. See btrfs check --init-csum-tree.
>>
> Seems, you can't specify a path/file for it and it's quite destructive
> action if you want to get data only about some one specific file.
Yes. It's whole-filesystem-all-or-nothing, unfortunately. =:^(
> I did scrub second time and this time there aren't that many
> uncorrectable errors and also there's no csum_errors so --init-csum-tree
> is useless here I think.
Agreed.
> Most likely previously scrub got that many errors because it still
> continued for a bit even if disk didn't respond.
Yes.
> scrub status [...]
> read_errors: 2
> csum_errors: 0
> verify_errors: 0
> no_csum: 89600
> csum_discards: 656214
> super_errors: 0
> malloc_errors: 0
> uncorrectable_errors: 2
> unverified_errors: 0
> corrected_errors: 0
> last_physical: 2590041112576
OK, that matches up with 8 KiB bad, since blocks are 4 KiB and there's
two uncorrectable errors. With the scrub now reporting no further errors
and the two it does report accounted for, nothing else should be
affected. =:^)
> also now, there's i/o errors from device stats which were 0 previously
Good. It's recording them now.
>> There's also btrfs restore, which works on the unmounted filesystem
>> without actually writing to it, copying the files it can read to a new
>> location, which of course has to be a filesystem with enough room to
>> restore the files to, altho it's possible to tell restore to do only
>> specific subdirs, for instance.
>>
>>
> I tried restore for that file, but it's not as good as ddrescue because
> it stopped on error even with --ignore-errors flag and seems there
> aren't option to continue and try more.
Yes. It's primary use is when the filesystem can't be mounted and
backups aren't available or at least aren't current. The fact that it
works without writing to the filesystem in question is also nice, as that
lets people grab the files they can while they know they can, before
trying potential fixes that might end up making things worse instead of
better.
Since you could mount, and the questionable device turned out not as bad
as it first seemed, actually mounting and working with the mounted
filesystem is the better choice. I was just throwing restore out as an
available tool, because again, I thought the iffy device could fail at
any time, leaving you grasping at straws.
>> What I'd recommend depends on how complete and how recent your backup
>> is. If it's complete and recent enough, probably the easiest thing is
>> to simply blow away the bad filesystem and start over, recovering from
>> the backup to a new filesystem.
>
> Actually this time I've 100% complete and up-to-date backups of all
> files so I can freely experiment and try practicing real world recovery
> which could be very useful.
=:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Disk "failed" while doing scrub
2015-07-13 6:26 Disk "failed" while doing scrub Dāvis Mosāns
2015-07-13 8:12 ` Duncan
@ 2015-08-21 4:16 ` Dāvis Mosāns
1 sibling, 0 replies; 5+ messages in thread
From: Dāvis Mosāns @ 2015-08-21 4:16 UTC (permalink / raw)
To: linux-btrfs
2015-07-13 9:26 GMT+03:00 Dāvis Mosāns <davispuh@gmail.com>:
> also are there some easy way to locate those unreadable sectors and
> rewrite them so hdd relocates them?
>
Only now noticed that scrub does tell it :)
> kernel: BTRFS: i/o error at logical 7358423011328 on dev /dev/sdd,
sector 2879471688, root 3034, inode 5619902, offset 4546727936, length
4096, links 1 (path: dir2/damaged_file)
So for each broken sector I did
$ dd if=/dev/zero of=/dev/sdd seek=359933961 count=1 bs=4096
note that for dd seek need to specify block number which is 4096 byte size
in my case, but from scrub sector is 512 bytes size so 2879471688 / 8
= 359933961
Now disk was able to mark those sectors as dead and self-test passes
also it doesn't show any uncorrectable sectors anymore
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 3173 -
# 2 Short offline Completed without error 00%
3169 -
# 3 Short offline Completed: read failure 90%
3139 2879471688
Then I tried to copy that same file
$ cp damaged_file /tmp/damaged_file
cp: error reading damaged_file: Input/output error
$ ddrescue damaged_file /tmp/damaged_file
GNU ddrescue 1.19
Press Ctrl-C to interrupt
rescued: 6554 MB, errsize: 8192 B, current rate: 56082 kB/s
ipos: 4572 MB, errors: 2, average rate: 99310 kB/s
opos: 4572 MB, run time: 1.10 m, successful read: 0 s ago
Finished
and result is same, cp stops on first error, but ddrescue is able to
get everything
except those 8 KiB only difference is that I get csum error instead of
I/O error :)
kernel: BTRFS warning (device sdh): csum failed ino 5619902 off
4546727936 csum 2566472073 expected csum
when running scrub
scrub device /dev/sdd (id 2) done
scrub started at Thu Jul 17 13:58:06 2015 and finished after 02:48:05
data_extents_scrubbed: 26349742
tree_extents_scrubbed: 316806
data_bytes_scrubbed: 1574102949888
tree_bytes_scrubbed: 5190549504
read_errors: 0
csum_errors: 2
verify_errors: 0
no_csum: 89600
csum_discards: 656179
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 2
unverified_errors: 0
corrected_errors: 0
last_physical: 1579475271680
ERROR: There are uncorrectable errors.
Now to fix csum errors I could use btrfs check --init-csum-tree but I
think that's bad
as it will basically force all files to be valid even if they are
corrupted so I just copied
file from backup overwriting this damaged one.
Then after running scrub again can see that there's no errors anymore
scrub status for 1ec5b839-acc6-4f70-be9d-6f9e6118c71c
scrub started at Fri Jul 17 19:22:45 2015 and finished after 02:47:58
data_extents_scrubbed: 26347511
tree_extents_scrubbed: 317192
data_bytes_scrubbed: 1573973471232
tree_bytes_scrubbed: 5196873728
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 89472
csum_discards: 656152
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 1580549013504
Next I did
$ btrfs device delete /dev/sdd /mnt/Data
Which successfully completed, only seems there's a bug that it shows incorrect
unallocated space for device when delete is in progress
$ btrfs filesystem usage
Unallocated:
/dev/sdc 11.49GiB
/dev/sdd 16.00EiB // disk isn't that big...
/dev/sde 12.02GiB
/dev/sdg 12.02GiB
/dev/sdh 11.48GiB
Then I tested that disk with badblocks and it didn't find anything so I just
added it back with
$ btrfs device add /dev/sdd /mnt/Data
and balance
$ btrfs balance start /mnt/Data
And just be completely sure everything is ok
$ btrfs check --check-data-csum /dev/sdc
Checking filesystem on /dev/sdc
UUID: 1ec5b839-acc6-4f70-be9d-6f9e6118c71c
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 7931796849809 bytes used err is 0
total csum bytes: 7731179932
total tree bytes: 15068594176
total fs tree bytes: 5814714368
total extent tree bytes: 860798976
btree space waste bytes: 1691112689
file data blocks allocated: 7918108438528
referenced 8212185219072
That's all, wasn't any need to recreate filesystem from scratch but just recover
1 file from backup and I even verified all files from backup with
rsync --checksum --dry-run
that everything is indeed correct.
PS. Sorry for so delayed follow-up.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2015-08-21 4:16 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-13 6:26 Disk "failed" while doing scrub Dāvis Mosāns
2015-07-13 8:12 ` Duncan
2015-07-14 1:54 ` Dāvis Mosāns
2015-07-14 6:26 ` Duncan
2015-08-21 4:16 ` Dāvis Mosāns
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.