All of lore.kernel.org
 help / color / mirror / Atom feed
* Corrupted files support needed
       [not found] <CAD7Y51iQMJQiTBBW9AqQ_-aJ6A4fMVEswyNwPMYnj5iAaLOXjw@mail.gmail.com>
@ 2019-04-15 11:44 ` Daniel Brunner
  2019-04-15 11:58   ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Brunner @ 2019-04-15 11:44 UTC (permalink / raw)
  To: linux-btrfs

Hi,

after normal a reboot I noticed that many files fail to open / read
(Input/Output error). I don't really have a backup of it (only
snapshots which also seem to be corrupted too).
The machine is remote, I can only ssh into it (at the moment). The
raid consists of 8x 10TB drives.

System & Metadata is RAID1
Data is RAID6

kernel is the most recent from arch linux repositories:
5.0.7-arch1-1-ARCH #1 SMP PREEMPT Mon Apr 8 10:37:08 UTC 2019 x86_64 GNU/Linux

Disk layout & usage:
https://0x0.st/zNBl.log

Output of btrfs check:
https://0x0.st/zNJs.log

Kernel messages:
http://cwillu.com:8080/84.115.44.105/2

btrfs scrub is running but it seems like it will take weeks to finish...

Do you think my data is gone or is there anything I can try? An rsync
is already running, but every 5th file fails to read.

Best regards,
Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Corrupted files support needed
  2019-04-15 11:44 ` Corrupted files support needed Daniel Brunner
@ 2019-04-15 11:58   ` Qu Wenruo
  2019-04-16  7:46     ` Daniel Brunner
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2019-04-15 11:58 UTC (permalink / raw)
  To: Daniel Brunner, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1763 bytes --]



On 2019/4/15 下午7:44, Daniel Brunner wrote:
> Hi,
> 
> after normal a reboot I noticed that many files fail to open / read
> (Input/Output error). I don't really have a backup of it (only
> snapshots which also seem to be corrupted too).
> The machine is remote, I can only ssh into it (at the moment). The
> raid consists of 8x 10TB drives.
> 
> System & Metadata is RAID1
> Data is RAID6
> 
> kernel is the most recent from arch linux repositories:
> 5.0.7-arch1-1-ARCH #1 SMP PREEMPT Mon Apr 8 10:37:08 UTC 2019 x86_64 GNU/Linux
> 
> Disk layout & usage:
> https://0x0.st/zNBl.log
> 
> Output of btrfs check:
> https://0x0.st/zNJs.log

Strange, btrfs check reports no error at all.

Would you please try btrfs check unmounted just in case.

> 
> Kernel messages:
> http://cwillu.com:8080/84.115.44.105/2

According to the dmesg, it looks like one device is bad, affecting both
data and metadata.

And thanks to recent RAID6 enhancement to try all possible mirror
combination, it doesn't cause too many problem to metadata at least.

Although data csum is not that good.
Quite a lot of combination doesn't match checksum, and even more, some
check doesn't exist in csum tree.

An unmounted btrfs check is recommended.
# btrfs check --check-data-csum --readonly

This would be better than scrub.

And please check the SMART info, some devices look suspicious.

Also, please remount the fs read only at least to prevent further
corruption.

Thanks,
Qu

> 
> btrfs scrub is running but it seems like it will take weeks to finish...
> 
> Do you think my data is gone or is there anything I can try? An rsync
> is already running, but every 5th file fails to read.
> 
> Best regards,
> Daniel
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Corrupted files support needed
  2019-04-15 11:58   ` Qu Wenruo
@ 2019-04-16  7:46     ` Daniel Brunner
  2019-04-16  7:56       ` Qu Wenruo
  2019-04-16  8:00       ` Roman Mamedov
  0 siblings, 2 replies; 7+ messages in thread
From: Daniel Brunner @ 2019-04-16  7:46 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hi,

thanks for the quick response.

The filesystem went read-only on its own right at the first read error.
I unmounted all mounts and rebooted (just to be sure).

I ran the command you suggested with --progress
All output is flushed away with thousands of lines like those at the
end of the log paste.

Does it make sense to let it run until the end or can I assume that 2
drives are bad?
Also I checked SMART values but they seem to be ok.

https://0x0.st/zNg6.log

-- Daniel

Am Mo., 15. Apr. 2019 um 13:58 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
>
>
> On 2019/4/15 下午7:44, Daniel Brunner wrote:
> > Hi,
> >
> > after normal a reboot I noticed that many files fail to open / read
> > (Input/Output error). I don't really have a backup of it (only
> > snapshots which also seem to be corrupted too).
> > The machine is remote, I can only ssh into it (at the moment). The
> > raid consists of 8x 10TB drives.
> >
> > System & Metadata is RAID1
> > Data is RAID6
> >
> > kernel is the most recent from arch linux repositories:
> > 5.0.7-arch1-1-ARCH #1 SMP PREEMPT Mon Apr 8 10:37:08 UTC 2019 x86_64 GNU/Linux
> >
> > Disk layout & usage:
> > https://0x0.st/zNBl.log
> >
> > Output of btrfs check:
> > https://0x0.st/zNJs.log
>
> Strange, btrfs check reports no error at all.
>
> Would you please try btrfs check unmounted just in case.
>
> >
> > Kernel messages:
> > http://cwillu.com:8080/84.115.44.105/2
>
> According to the dmesg, it looks like one device is bad, affecting both
> data and metadata.
>
> And thanks to recent RAID6 enhancement to try all possible mirror
> combination, it doesn't cause too many problem to metadata at least.
>
> Although data csum is not that good.
> Quite a lot of combination doesn't match checksum, and even more, some
> check doesn't exist in csum tree.
>
> An unmounted btrfs check is recommended.
> # btrfs check --check-data-csum --readonly
>
> This would be better than scrub.
>
> And please check the SMART info, some devices look suspicious.
>
> Also, please remount the fs read only at least to prevent further
> corruption.
>
> Thanks,
> Qu
>
> >
> > btrfs scrub is running but it seems like it will take weeks to finish...
> >
> > Do you think my data is gone or is there anything I can try? An rsync
> > is already running, but every 5th file fails to read.
> >
> > Best regards,
> > Daniel
> >
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Corrupted files support needed
  2019-04-16  7:46     ` Daniel Brunner
@ 2019-04-16  7:56       ` Qu Wenruo
  2019-04-16  8:00       ` Roman Mamedov
  1 sibling, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2019-04-16  7:56 UTC (permalink / raw)
  To: Daniel Brunner, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3116 bytes --]



On 2019/4/16 下午3:46, Daniel Brunner wrote:
> Hi,
> 
> thanks for the quick response.
> 
> The filesystem went read-only on its own right at the first read error.

Then the fs metadata should get corrupted.

> I unmounted all mounts and rebooted (just to be sure).
> 
> I ran the command you suggested with --progress
> All output is flushed away with thousands of lines like those at the
> end of the log paste.

Then please stop that running command, as it won't help much.

Regular "btrfs check --readonly" should be able to report the metadata
corruption.

> 
> Does it make sense to let it run until the end or can I assume that 2
> drives are bad?

No need to continue, there should be some metadata corruption.

Above command should report something.

For the device corruption part, I'm not sure. But I think there is
definitely something wrong.

Please start salvage your data, as I'm afraid the fs is already damaged,
although not sure how serious it is.

Thanks,
Qu

> Also I checked SMART values but they seem to be ok.
> 
> https://0x0.st/zNg6.log
> 
> -- Daniel
> 
> Am Mo., 15. Apr. 2019 um 13:58 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>
>>
>>
>> On 2019/4/15 下午7:44, Daniel Brunner wrote:
>>> Hi,
>>>
>>> after normal a reboot I noticed that many files fail to open / read
>>> (Input/Output error). I don't really have a backup of it (only
>>> snapshots which also seem to be corrupted too).
>>> The machine is remote, I can only ssh into it (at the moment). The
>>> raid consists of 8x 10TB drives.
>>>
>>> System & Metadata is RAID1
>>> Data is RAID6
>>>
>>> kernel is the most recent from arch linux repositories:
>>> 5.0.7-arch1-1-ARCH #1 SMP PREEMPT Mon Apr 8 10:37:08 UTC 2019 x86_64 GNU/Linux
>>>
>>> Disk layout & usage:
>>> https://0x0.st/zNBl.log
>>>
>>> Output of btrfs check:
>>> https://0x0.st/zNJs.log
>>
>> Strange, btrfs check reports no error at all.
>>
>> Would you please try btrfs check unmounted just in case.
>>
>>>
>>> Kernel messages:
>>> http://cwillu.com:8080/84.115.44.105/2
>>
>> According to the dmesg, it looks like one device is bad, affecting both
>> data and metadata.
>>
>> And thanks to recent RAID6 enhancement to try all possible mirror
>> combination, it doesn't cause too many problem to metadata at least.
>>
>> Although data csum is not that good.
>> Quite a lot of combination doesn't match checksum, and even more, some
>> check doesn't exist in csum tree.
>>
>> An unmounted btrfs check is recommended.
>> # btrfs check --check-data-csum --readonly
>>
>> This would be better than scrub.
>>
>> And please check the SMART info, some devices look suspicious.
>>
>> Also, please remount the fs read only at least to prevent further
>> corruption.
>>
>> Thanks,
>> Qu
>>
>>>
>>> btrfs scrub is running but it seems like it will take weeks to finish...
>>>
>>> Do you think my data is gone or is there anything I can try? An rsync
>>> is already running, but every 5th file fails to read.
>>>
>>> Best regards,
>>> Daniel
>>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Corrupted files support needed
  2019-04-16  7:46     ` Daniel Brunner
  2019-04-16  7:56       ` Qu Wenruo
@ 2019-04-16  8:00       ` Roman Mamedov
  2019-04-17 11:16         ` Daniel Brunner
  1 sibling, 1 reply; 7+ messages in thread
From: Roman Mamedov @ 2019-04-16  8:00 UTC (permalink / raw)
  To: Daniel Brunner; +Cc: Qu Wenruo, linux-btrfs

On Tue, 16 Apr 2019 09:46:39 +0200
Daniel Brunner <daniel@brunner.ninja> wrote:

> Hi,
> 
> thanks for the quick response.
> 
> The filesystem went read-only on its own right at the first read error.
> I unmounted all mounts and rebooted (just to be sure).
> 
> I ran the command you suggested with --progress
> All output is flushed away with thousands of lines like those at the
> end of the log paste.
> 
> Does it make sense to let it run until the end or can I assume that 2
> drives are bad?
> Also I checked SMART values but they seem to be ok.
> 
> https://0x0.st/zNg6.log

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   047   031   040    Old_age   Always   In_the_past 53 (Min/Max 47/53 #5115)

This seems to be normalized as VALUE=100-RAW_VALUE by the SMART firmware, and
looking at the reading in WORST, indicates that some of your drives earlier
have seen temperatures of as high as 69 C.

This is insanely hot to run your drives at, I'd say to the point of "shut off
everything ASAP via the mains breaker to avoid immediate permanent damage";

Not sure if it's related to the csum errors at hand, but it very well might be.

Even the current temps of 55-60 are about 15-20 degrees higher than ideal.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Corrupted files support needed
  2019-04-16  8:00       ` Roman Mamedov
@ 2019-04-17 11:16         ` Daniel Brunner
  2019-04-17 12:42           ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Brunner @ 2019-04-17 11:16 UTC (permalink / raw)
  To: linux-btrfs

Hi again,

the check went through,
and no errors were reported...

I tried mounting the fs again,
and suddenly all read errors disappeared!

A full copy is already running,
and after that i will nuke it from orbit.

Thank you all for the quick answers,
I do not completely understand what exactly was going wrong,
but I am confident that the files are not corrupted anymore (I checked
a few random ones).

-- Daniel

(output of the check)
# btrfs check --readonly /dev/mapper/sde-open
Opening filesystem to check...
Checking filesystem on /dev/mapper/sde-open
UUID: 26a3ef79-15ba-4041-951b-284fbbe08074
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 50094867738635 bytes used, no error found
total csum bytes: 48133927932
total tree bytes: 60644540416
total fs tree bytes: 4307402752
total extent tree bytes: 1874182144
btree space waste bytes: 6642842911
file data blocks allocated: 50464303362048
 referenced 49273323020288

Am Di., 16. Apr. 2019 um 10:00 Uhr schrieb Roman Mamedov <rm@romanrm.net>:
>
> On Tue, 16 Apr 2019 09:46:39 +0200
> Daniel Brunner <daniel@brunner.ninja> wrote:
>
> > Hi,
> >
> > thanks for the quick response.
> >
> > The filesystem went read-only on its own right at the first read error.
> > I unmounted all mounts and rebooted (just to be sure).
> >
> > I ran the command you suggested with --progress
> > All output is flushed away with thousands of lines like those at the
> > end of the log paste.
> >
> > Does it make sense to let it run until the end or can I assume that 2
> > drives are bad?
> > Also I checked SMART values but they seem to be ok.
> >
> > https://0x0.st/zNg6.log
>
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
> 190 Airflow_Temperature_Cel 0x0022   047   031   040    Old_age   Always   In_the_past 53 (Min/Max 47/53 #5115)
>
> This seems to be normalized as VALUE=100-RAW_VALUE by the SMART firmware, and
> looking at the reading in WORST, indicates that some of your drives earlier
> have seen temperatures of as high as 69 C.
>
> This is insanely hot to run your drives at, I'd say to the point of "shut off
> everything ASAP via the mains breaker to avoid immediate permanent damage";
>
> Not sure if it's related to the csum errors at hand, but it very well might be.
>
> Even the current temps of 55-60 are about 15-20 degrees higher than ideal.
>
> --
> With respect,
> Roman

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Corrupted files support needed
  2019-04-17 11:16         ` Daniel Brunner
@ 2019-04-17 12:42           ` Qu Wenruo
  0 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2019-04-17 12:42 UTC (permalink / raw)
  To: Daniel Brunner, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3124 bytes --]



On 2019/4/17 下午7:16, Daniel Brunner wrote:
> Hi again,
> 
> the check went through,
> and no errors were reported...
> 
> I tried mounting the fs again,
> and suddenly all read errors disappeared!

Then there should be something wrong with the disc connection (but why
block layer doesn't complain?)

> 
> A full copy is already running,
> and after that i will nuke it from orbit.
> 
> Thank you all for the quick answers,
> I do not completely understand what exactly was going wrong,
> but I am confident that the files are not corrupted anymore (I checked
> a few random ones).

As long as btrfs doesn't report checksum error and your memory is not
faulty, btrfs should ensure all your data is correct.

Glad your problem just disappeared.

Thanks,
Qu

> 
> -- Daniel
> 
> (output of the check)
> # btrfs check --readonly /dev/mapper/sde-open
> Opening filesystem to check...
> Checking filesystem on /dev/mapper/sde-open
> UUID: 26a3ef79-15ba-4041-951b-284fbbe08074
> [1/7] checking root items
> [2/7] checking extents
> [3/7] checking free space cache
> [4/7] checking fs roots
> [5/7] checking only csums items (without verifying data)
> [6/7] checking root refs
> [7/7] checking quota groups skipped (not enabled on this FS)
> found 50094867738635 bytes used, no error found
> total csum bytes: 48133927932
> total tree bytes: 60644540416
> total fs tree bytes: 4307402752
> total extent tree bytes: 1874182144
> btree space waste bytes: 6642842911
> file data blocks allocated: 50464303362048
>  referenced 49273323020288
> 
> Am Di., 16. Apr. 2019 um 10:00 Uhr schrieb Roman Mamedov <rm@romanrm.net>:
>>
>> On Tue, 16 Apr 2019 09:46:39 +0200
>> Daniel Brunner <daniel@brunner.ninja> wrote:
>>
>>> Hi,
>>>
>>> thanks for the quick response.
>>>
>>> The filesystem went read-only on its own right at the first read error.
>>> I unmounted all mounts and rebooted (just to be sure).
>>>
>>> I ran the command you suggested with --progress
>>> All output is flushed away with thousands of lines like those at the
>>> end of the log paste.
>>>
>>> Does it make sense to let it run until the end or can I assume that 2
>>> drives are bad?
>>> Also I checked SMART values but they seem to be ok.
>>>
>>> https://0x0.st/zNg6.log
>>
>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>> 190 Airflow_Temperature_Cel 0x0022   047   031   040    Old_age   Always   In_the_past 53 (Min/Max 47/53 #5115)
>>
>> This seems to be normalized as VALUE=100-RAW_VALUE by the SMART firmware, and
>> looking at the reading in WORST, indicates that some of your drives earlier
>> have seen temperatures of as high as 69 C.
>>
>> This is insanely hot to run your drives at, I'd say to the point of "shut off
>> everything ASAP via the mains breaker to avoid immediate permanent damage";
>>
>> Not sure if it's related to the csum errors at hand, but it very well might be.
>>
>> Even the current temps of 55-60 are about 15-20 degrees higher than ideal.
>>
>> --
>> With respect,
>> Roman


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-04-17 12:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAD7Y51iQMJQiTBBW9AqQ_-aJ6A4fMVEswyNwPMYnj5iAaLOXjw@mail.gmail.com>
2019-04-15 11:44 ` Corrupted files support needed Daniel Brunner
2019-04-15 11:58   ` Qu Wenruo
2019-04-16  7:46     ` Daniel Brunner
2019-04-16  7:56       ` Qu Wenruo
2019-04-16  8:00       ` Roman Mamedov
2019-04-17 11:16         ` Daniel Brunner
2019-04-17 12:42           ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.