All of lore.kernel.org
 help / color / mirror / Atom feed
* How to repair a BTRFS block?
@ 2015-04-18  7:45 Martin Monperrus
  2015-04-23 18:05 ` Martin Monperrus
  0 siblings, 1 reply; 7+ messages in thread
From: Martin Monperrus @ 2015-04-18  7:45 UTC (permalink / raw)
  To: linux-btrfs

Dear Btrfs developers,

For some unknown reasons, my BTRFS filesystem is corrupted. dmesg prints

|BTRFS critical (device sda2): corrupt leaf, slot offset bad:
block=43231330304,root=1, slot=47|

(more than 1000x in the dmesg trace).

btrfs check --repair fails with:

read block failed check_tree_block
incorrect offset 12725 2298746482
items overlap, can't fix
cmds_check.c:2918: fix_item_offset: Assertion 'ret' failed

How to list the files in block #43231330304 affected by the corruption?
How to repair block #43231330304?

Best regards,

--Martin


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to repair a BTRFS block?
  2015-04-18  7:45 How to repair a BTRFS block? Martin Monperrus
@ 2015-04-23 18:05 ` Martin Monperrus
  2015-04-24  3:30   ` Duncan
  2015-04-24 17:44   ` Martin Monperrus
  0 siblings, 2 replies; 7+ messages in thread
From: Martin Monperrus @ 2015-04-23 18:05 UTC (permalink / raw)
  To: linux-btrfs

Hi,

More on my issue, I have "uncorrectable errors"

# btrfs scrub status /
scrub status for e11013b3-b244-4d1a-a9c7-3956db1a699c
    scrub started at Thu Apr 23 19:07:45 2015 and finished after 372 seconds
    total bytes scrubbed: 167.13GiB with 13 errors
    error details: read=13
    corrected errors: 0, uncorrectable errors: 13, unverified errors: 0

Before going to my backups, how can know the files impacted by those
uncorrectable errors?

Best regards,

--Martin



On 04/18/2015 09:45 AM, Martin Monperrus wrote:
> Dear Btrfs developers,
>
> For some unknown reasons, my BTRFS filesystem is corrupted. dmesg prints
>
> |BTRFS critical (device sda2): corrupt leaf, slot offset bad:
> block=43231330304,root=1, slot=47|
>
> (more than 1000x in the dmesg trace).
>
> btrfs check --repair fails with:
>
> read block failed check_tree_block
> incorrect offset 12725 2298746482
> items overlap, can't fix
> cmds_check.c:2918: fix_item_offset: Assertion 'ret' failed
>
> How to list the files in block #43231330304 affected by the corruption?
> How to repair block #43231330304?
>
> Best regards,
>
> --Martin
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to repair a BTRFS block?
  2015-04-23 18:05 ` Martin Monperrus
@ 2015-04-24  3:30   ` Duncan
  2015-04-24 17:44   ` Martin Monperrus
  1 sibling, 0 replies; 7+ messages in thread
From: Duncan @ 2015-04-24  3:30 UTC (permalink / raw)
  To: linux-btrfs

Martin Monperrus posted on Thu, 23 Apr 2015 20:05:16 +0200 as excerpted:

> # btrfs scrub status /
> scrub status for e11013b3-b244-4d1a-a9c7-3956db1a699c
>     scrub started at Thu Apr 23 19:07:45 2015
>     and finished after 372 seconds
>     total bytes scrubbed: 167.13GiB with 13 errors
>     error details: read=13 corrected errors: 0,
>     uncorrectable errors: 13, unverified errors: 0
> 
> Before going to my backups, how can know the files impacted by those
> uncorrectable errors?

The kernel log (dmesg, also logged to syslog/journald on most systems) 
from during the scrub should capture more information on those errors.  
You didn't check that?  (Checked or not, you obviously didn't post it.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to repair a BTRFS block?
  2015-04-23 18:05 ` Martin Monperrus
  2015-04-24  3:30   ` Duncan
@ 2015-04-24 17:44   ` Martin Monperrus
  2015-04-25  0:42     ` Duncan
  2015-04-25 17:56     ` Martin Monperrus
  1 sibling, 2 replies; 7+ messages in thread
From: Martin Monperrus @ 2015-04-24 17:44 UTC (permalink / raw)
  To: linux-btrfs

Hi Duncan,

> The kernel log (dmesg, also logged to syslog/journald on most systems)
> from during the scrub should capture more information on those errors. 
Thanks. The dmesg log indeed contains the file path (see below).

The error is in /home/martin/XXXXX. It is related to a low-level error
("failed command: READ DMA").

Beyond this corrupted file, is my disk dead?
Can I repair the file system or re-create a new one on the same disk?

Best,

--Martin

[ 7695.806090] BTRFS: i/o error at logical 167135232000 on dev
/dev/sda2, sector 213189792, root 5, inode 2963892, offset 7700480,
length 4096, links 1 (path: /home/martin/XXXXX)
[ 7695.806097] BTRFS: bdev /dev/sda2 errs: wr 0, rd 401, flush 0,
corrupt 0, gen 0
[ 7695.812770] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 7695.812774] ata1.00: irq_stat 0x40000001
[ 7695.812778] ata1.00: failed command: READ DMA
[ 7695.812783] ata1.00: cmd c8/00:08:a0:dc:91/00:00:00:00:00/ee tag 23
dma 4096 in
         res 51/40:00:00:00:00/00:00:00:00:00/ee Emask 0x9 (media error)
[ 7695.812785] ata1.00: status: { DRDY ERR }
[ 7695.812786] ata1.00: error: { UNC }
[ 7695.813013] ata1.00: supports DRM functions and may not be fully
accessible
[ 7695.813210] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
[ 7695.813770] ata1.00: supports DRM functions and may not be fully
accessible
[ 7695.813859] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
[ 7695.814164] ata1.00: configured for UDMA/133
[ 7695.814179] sd 0:0:0:0: [sda] Unhandled sense code
[ 7695.814181] sd 0:0:0:0: [sda] 
[ 7695.814182] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 7695.814183] sd 0:0:0:0: [sda] 
[ 7695.814185] Sense Key : Medium Error [current] [descriptor]
[ 7695.814187] Descriptor sense data with sense descriptors (in hex):
[ 7695.814188]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 7695.814195]         0e 00 00 00
[ 7695.814198] sd 0:0:0:0: [sda] 
[ 7695.814199] Add. Sense: Unrecovered read error - auto reallocate failed
[ 7695.814201] sd 0:0:0:0: [sda] CDB:
[ 7695.814202] Read(10): 28 00 0e 91 dc a0 00 00 08 00
[ 7695.814208] end_request: I/O error, dev sda, sector 244440224
[ 7695.814222] ata1: EH complete
[ 7695.814227] BTRFS: unable to fixup (regular) error at logical
167135232000 on dev /dev/sda2



On 04/23/2015 08:05 PM, Martin Monperrus wrote:
> Hi,
>
> More on my issue, I have "uncorrectable errors"
>
> # btrfs scrub status /
> scrub status for e11013b3-b244-4d1a-a9c7-3956db1a699c
>     scrub started at Thu Apr 23 19:07:45 2015 and finished after 372 seconds
>     total bytes scrubbed: 167.13GiB with 13 errors
>     error details: read=13
>     corrected errors: 0, uncorrectable errors: 13, unverified errors: 0
>
> Before going to my backups, how can know the files impacted by those
> uncorrectable errors?
>
> Best regards,
>
> --Martin
>
>
>
> On 04/18/2015 09:45 AM, Martin Monperrus wrote:
>> Dear Btrfs developers,
>>
>> For some unknown reasons, my BTRFS filesystem is corrupted. dmesg prints
>>
>> |BTRFS critical (device sda2): corrupt leaf, slot offset bad:
>> block=43231330304,root=1, slot=47|
>>
>> (more than 1000x in the dmesg trace).
>>
>> btrfs check --repair fails with:
>>
>> read block failed check_tree_block
>> incorrect offset 12725 2298746482
>> items overlap, can't fix
>> cmds_check.c:2918: fix_item_offset: Assertion 'ret' failed
>>
>> How to list the files in block #43231330304 affected by the corruption?
>> How to repair block #43231330304?
>>
>> Best regards,
>>
>> --Martin
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to repair a BTRFS block?
  2015-04-24 17:44   ` Martin Monperrus
@ 2015-04-25  0:42     ` Duncan
  2015-04-25  8:11       ` Duncan
  2015-04-25 17:56     ` Martin Monperrus
  1 sibling, 1 reply; 7+ messages in thread
From: Duncan @ 2015-04-25  0:42 UTC (permalink / raw)
  To: linux-btrfs

Martin Monperrus posted on Fri, 24 Apr 2015 19:44:47 +0200 as excerpted:

> Hi Duncan,
> 
>> The kernel log (dmesg, also logged to syslog/journald on most systems)
>> from during the scrub should capture more information on those errors.
> Thanks. The dmesg log indeed contains the file path (see below).
> 
> The error is in /home/martin/XXXXX. It is related to a low-level error
> ("failed command: READ DMA").
> 
> Beyond this corrupted file, is my disk dead?
> Can I repair the file system or re-create a new one on the same disk?

A direct answer is beyond my knowledge level, certainly without SMART 
status information, etc.  What I do know is that assuming the rest of the 
device is responding fine, most drives keep a number of reserved sectors 
available and will automatically substitute them in on a *write* to an 
affected dead sector.

So if the device in general appears to be working fine, and assuming the 
SMART status still passes, I'd backup everything else on that partition, 
unmount it, then do something like a badblocks destructive write (-w) 
test to the partition.  If it comes back clean, I'd consider the device 
usable again.

Also note that if you run smartctl -A (attributes) on the device before 
attempting anything else and check the raw value for ID 5 (reallocated 
sector count), then check again after doing something like that badblocks 
-w, you can see if it actually relocated any sectors.  Finally, note that 
while it's possible to have a one-off, once a drive starts reallocating 
sectors it often fails relatively quickly as that can indicate a failing 
media layer and once it starts to go, often it doesn't stop.  So once you 
see that value move from zero, do keep an eye on it and if you notice the 
value starting to climb, get the data off that thing as soon as possible.

And of course it should go without saying, but I'll repeat the sysadmin's 
data value rule of thumb anyway, for the benefit of others reading as 
well.  If you care about the data, by definition, you have a (tested) 
backup (a corollary rule states that an untested backup isn't a backup at 
all).  If you don't have a backup, by definition you do NOT care about 
that data, /regardless/ of any claims to the contrary.  Unfortunately, 
many (most?) people end up learning this the hard way, finding out too 
late how much more value the data had than they thought, and thus that 
they /should/ have cared about it more (more backups, more testing of 
them) than they did.

(For those who end up in that situation...)  On the flip side there's the 
big picture.  During hurricane Katrina a data hosting firm in New Orleans 
made (tech) headlines by blogging live their struggle to stay powered and 
online.  I was one of thousands watching that, along with the mainstream 
news about the flooding, looting and dying going on.  Obviously losing a 
bit of data ends up pretty far down the list when you're wet and cold and 
just lost your house and possibly members of your family!  A bit of data 
loss might hurt a bit, but in the big picture, if you're still healthy, 
and have a job and a home and family, it's /not/ the end of the world.  A 
bit of perspective helps! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to repair a BTRFS block?
  2015-04-25  0:42     ` Duncan
@ 2015-04-25  8:11       ` Duncan
  0 siblings, 0 replies; 7+ messages in thread
From: Duncan @ 2015-04-25  8:11 UTC (permalink / raw)
  To: linux-btrfs

Duncan posted on Sat, 25 Apr 2015 00:42:12 +0000 as excerpted:

> Also note that if you run smartctl -A (attributes) on the device before
> attempting anything else and check the raw value for ID 5 (reallocated
> sector count), then check again after doing something like that
> badblocks -w, you can see if it actually relocated any sectors. 
> Finally, note that while it's possible to have a one-off, once a drive
> starts reallocating sectors it often fails relatively quickly as that
> can indicate a failing media layer and once it starts to go, often it
> doesn't stop.  So once you see that value move from zero, do keep an eye
> on it and if you notice the value starting to climb, get the data off
> that thing as soon as possible.

FWIW, I'm running btrfs raid1 (both data/metadata) here.  I run multiple 
btrfs filesystems (with the raid1 on parallel partitions on two ssds) 
instead of subvolumes.  Of course SSDs have a far different wear life 
than spinning rust, and the most-used sectors are expected to drop out as 
the device ages.

When I bought my SSDs, I found that one had been used some and then 
returned, with me getting it.  However, smart said no relocated sectors 
at the time and I decided to call it a good thing, since it meant the one 
should wear out first, instead of having them both wear out together.

I normally keep / mounted read-only, unless I'm updating, and that has 
proven to be a good decision as I rarely have problems with it.  /home, 
OTOH, is of course mounted writable, and occasionally doesn't get cleanly 
unmounted, so it tends to see problems once in awhile.  However, scrub 
normally fixes them right up (as it can because I'm running raid1 and 
there's a second, generally valid, copy to copy over the bad one).

After writing the above, I decided it was time to do a scrub, and sure 
enough, it found some problems on /home.  I actually had to run it twice 
to fix them all.  Each time it said (with no-background, raw, per-device 
reporting options set) that the one device had a read-error and several 
unverified errors.  After the second scrub, a third scrub found no 
further errors.

The btrfs errors occurred as lower level ata errors logged in dmesg, very 
similar to what you posted, above.

But I ran smartctl -A on the device both before and after the scrubs, as 
it happens the first one because I had looked up -A in the manpage and 
run it while composing the above reply in ordered to check that -A was 
actually what I wanted.

Before the scrubs, the previously-used device had 19 sectors 
reallocated.  Afterward it was 20.  So the first scrub probably triggered 
the reallocation but didn't fix the problem, while the second scrub fixed 
the problem as it could now write to the newly reallocated sector.

The kicker, of course, is that because I'm running btrfs raid1, there was 
a second copy (on the newer device, which doesn't report any reallocated 
sectors yet) btrfs could use to fix the bad one, and doing so forced a 
write to that sector, thus triggering the reallocation by the device 
firmware.  (Of course due to btrfs cow, it writes the new copy elsewhere 
too, but apparently in doing so it triggered a write to the old sector as 
well.)

If I hadn't been running raid such that btrfs could find or create from 
parity a second copy, fixing that would have been a lot harder, tho with 
the data from the ata error I could have unmounted and tried to use dd to 
write to exactly that sector, trying to trigger the device's sector 
reallocation that way.  But that's a lot lower level, with a much larger 
chance for user error, particularly as I've never attempted it before.  
With btrfs scrub, I just had to do the scrub and the details were handled 
for me. =:^)

Meanwhile, the device with a raw value of zero reallocated sectors has a 
cooked value of 253 for that attribute.  The device with a raw value of 
20 reallocated sectors has a cooked value of 100, with a threshold value 
of 36.  So I'm watching it.

FWIW, I bought three SSDs at the time, thinking I'd use one for something 
else, which I never did.  So I already have a spare SSD to connect and do 
a btrfs replace, when the time comes.  It's apparently new (not returned 
like the one was), so should last quite some time, based on the fact that 
the one that was new seems to be just fine, so far.  At a guess, the 
current new-at-installation one will be about where the used one was, by 
the time I have to switch out the used one.  So they should stay nicely 
staggered. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to repair a BTRFS block?
  2015-04-24 17:44   ` Martin Monperrus
  2015-04-25  0:42     ` Duncan
@ 2015-04-25 17:56     ` Martin Monperrus
  1 sibling, 0 replies; 7+ messages in thread
From: Martin Monperrus @ 2015-04-25 17:56 UTC (permalink / raw)
  To: linux-btrfs

Hi Duncan,

>> Beyond this corrupted file, is my disk dead?
>> Can I repair the file system or re-create a new one on the same disk?
> A direct answer is beyond my knowledge level, certainly without SMART
> status information, etc.
I attach the result of `smartctl -x` below.

Best regards,

--Martin

smartctl -x /dev/sda
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG MZ7PD256HCGM-000H7
Serial Number:    S1N8NSAGC23049
LU WWN Device Id: 5 012548 500000000
Firmware Version: DXM06H6Q
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Apr 25 19:45:38 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
          was completed without error.
          Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine
completed
          without error or no self-test has ever
          been run.
Total time to complete Offline
data collection:    (    0) seconds.
Offline data collection
capabilities:        (0x53) SMART execute Offline immediate.
          Auto Offline data collection on/off support.
          Suspend Offline collection upon new
          command.
          No Offline surface scan supported.
          Self-test supported.
          No Conveyance Self-test supported.
          Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
          power-saving mode.
          Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
          General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (  17) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
          SCT Error Recovery Control supported.
          SCT Feature Control supported.
          SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   199   199   002    -    790
  5 Reallocated_Sector_Ct   PO--CK   099   099   010    -    48
  9 Power_On_Hours          -O--CK   099   099   000    -    203
12 Power_Cycle_Count       -O--CK   099   099   000    -    460
170 Unknown_Attribute       PO--C-   099   099   010    -    4550
171 Unknown_Attribute       -O--CK   100   100   010    -    0
172 Unknown_Attribute       -O--CK   100   100   010    -    0
173 Unknown_Attribute       PO--C-   098   098   005    -    54
174 Unknown_Attribute       -O--CK   099   099   000    -    59
183 Runtime_Bad_Block       -O--CK   099   099   001    -    82
184 End-to-End_Error        PO--CK   100   100   097    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    790
188 Command_Timeout         -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   079   053   000    -    21
196 Reallocated_Event_Count -O----   099   099   000    -    48
198 Offline_Uncorrectable   ----CK   099   099   000    -    3
199 UDMA_CRC_Error_Count    -OSRCK   099   099   000    -    3
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01       GPL,SL  R/O      1  Summary SMART error log
0x02       GPL,SL  R/O      1  Comprehensive SMART error log
0x03       GPL,SL  R/O      1  Ext. Comprehensive SMART error log
0x06       GPL,SL  R/O      1  SMART self-test log
0x07       GPL,SL  R/O      1  Extended self-test log
0x09       GPL,SL  R/W      1  Selective self-test log
0x10       GPL,SL  R/O      1  SATA NCQ Queued Error log
0x11       GPL,SL  R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was completed without error
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
SCT Support Level:                   1
Device State:                        SCT command executing in background (5)
Current Temperature:                    40 Celsius
Power Cycle Min/Max Temperature:     40/40 Celsius
Lifetime    Min/Max Temperature:      0/70 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     3 (Unknown, should be 2)
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/70 Celsius
Min/Max Temperature Limit:            0/70 Celsius
Temperature History Size (Index):    128 (0)

Index    Estimated Time   Temperature Celsius
  1    2015-04-25 17:38     ?  -
...    ..(125 skipped).    ..  -
127    2015-04-25 19:44     ?  -
  0    2015-04-25 19:45    40  *********************

SCT Error Recovery Control:
          Read: Disabled
          Write: Disabled

Device Statistics (GP/SMART Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS,
non-CRC



On 04/24/2015 07:44 PM, Martin Monperrus wrote:
> Hi Duncan,
>
>> The kernel log (dmesg, also logged to syslog/journald on most systems)
>> from during the scrub should capture more information on those errors. 
> Thanks. The dmesg log indeed contains the file path (see below).
>
> The error is in /home/martin/XXXXX. It is related to a low-level error
> ("failed command: READ DMA").
>
> Beyond this corrupted file, is my disk dead?
> Can I repair the file system or re-create a new one on the same disk?
>
> Best,
>
> --Martin
>
> [ 7695.806090] BTRFS: i/o error at logical 167135232000 on dev
> /dev/sda2, sector 213189792, root 5, inode 2963892, offset 7700480,
> length 4096, links 1 (path: /home/martin/XXXXX)
> [ 7695.806097] BTRFS: bdev /dev/sda2 errs: wr 0, rd 401, flush 0,
> corrupt 0, gen 0
> [ 7695.812770] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> [ 7695.812774] ata1.00: irq_stat 0x40000001
> [ 7695.812778] ata1.00: failed command: READ DMA
> [ 7695.812783] ata1.00: cmd c8/00:08:a0:dc:91/00:00:00:00:00/ee tag 23
> dma 4096 in
>          res 51/40:00:00:00:00/00:00:00:00:00/ee Emask 0x9 (media error)
> [ 7695.812785] ata1.00: status: { DRDY ERR }
> [ 7695.812786] ata1.00: error: { UNC }
> [ 7695.813013] ata1.00: supports DRM functions and may not be fully
> accessible
> [ 7695.813210] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
> [ 7695.813770] ata1.00: supports DRM functions and may not be fully
> accessible
> [ 7695.813859] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
> [ 7695.814164] ata1.00: configured for UDMA/133
> [ 7695.814179] sd 0:0:0:0: [sda] Unhandled sense code
> [ 7695.814181] sd 0:0:0:0: [sda] 
> [ 7695.814182] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [ 7695.814183] sd 0:0:0:0: [sda] 
> [ 7695.814185] Sense Key : Medium Error [current] [descriptor]
> [ 7695.814187] Descriptor sense data with sense descriptors (in hex):
> [ 7695.814188]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
> [ 7695.814195]         0e 00 00 00
> [ 7695.814198] sd 0:0:0:0: [sda] 
> [ 7695.814199] Add. Sense: Unrecovered read error - auto reallocate failed
> [ 7695.814201] sd 0:0:0:0: [sda] CDB:
> [ 7695.814202] Read(10): 28 00 0e 91 dc a0 00 00 08 00
> [ 7695.814208] end_request: I/O error, dev sda, sector 244440224
> [ 7695.814222] ata1: EH complete
> [ 7695.814227] BTRFS: unable to fixup (regular) error at logical
> 167135232000 on dev /dev/sda2
>
>
>
> On 04/23/2015 08:05 PM, Martin Monperrus wrote:
>> Hi,
>>
>> More on my issue, I have "uncorrectable errors"
>>
>> # btrfs scrub status /
>> scrub status for e11013b3-b244-4d1a-a9c7-3956db1a699c
>>     scrub started at Thu Apr 23 19:07:45 2015 and finished after 372 seconds
>>     total bytes scrubbed: 167.13GiB with 13 errors
>>     error details: read=13
>>     corrected errors: 0, uncorrectable errors: 13, unverified errors: 0
>>
>> Before going to my backups, how can know the files impacted by those
>> uncorrectable errors?
>>
>> Best regards,
>>
>> --Martin
>>
>>
>>
>> On 04/18/2015 09:45 AM, Martin Monperrus wrote:
>>> Dear Btrfs developers,
>>>
>>> For some unknown reasons, my BTRFS filesystem is corrupted. dmesg prints
>>>
>>> |BTRFS critical (device sda2): corrupt leaf, slot offset bad:
>>> block=43231330304,root=1, slot=47|
>>>
>>> (more than 1000x in the dmesg trace).
>>>
>>> btrfs check --repair fails with:
>>>
>>> read block failed check_tree_block
>>> incorrect offset 12725 2298746482
>>> items overlap, can't fix
>>> cmds_check.c:2918: fix_item_offset: Assertion 'ret' failed
>>>
>>> How to list the files in block #43231330304 affected by the corruption?
>>> How to repair block #43231330304?
>>>
>>> Best regards,
>>>
>>> --Martin
>>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-04-25 17:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-18  7:45 How to repair a BTRFS block? Martin Monperrus
2015-04-23 18:05 ` Martin Monperrus
2015-04-24  3:30   ` Duncan
2015-04-24 17:44   ` Martin Monperrus
2015-04-25  0:42     ` Duncan
2015-04-25  8:11       ` Duncan
2015-04-25 17:56     ` Martin Monperrus

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.