Bad hard drive - checksum verify failure forces readonly mount

All of lore.kernel.org
 help / color / mirror / Atom feed

* Bad hard drive - checksum verify failure forces readonly mount
@ 2016-06-23 20:30 Vasco Almeida
  2016-06-24  0:54 ` Chris Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: Vasco Almeida @ 2016-06-23 20:30 UTC (permalink / raw)
  To: linux-btrfs

I was running OpenSuse Leap 42.1 with btrfs and
LVM (Logical Volume Management).
Last time I've checked smartd log, I noticed there were
30 sector pending reallocation and 1 unrecoverable bad
sector on hard drive.
I think my hard drive got some sector corrupted and now btrfs fails
some checksum and forces mount readonly.
The device is successfully mounted readonly.

OpenSuse dmesg reported:

BTRFS: dm-1 checksum verify failed on 437944320 wanted 39F45669 found
8BF8C752 leval 0
(more 2 times)
BTRFS: error (device dm-1) in btrfs_drop_snapshot:???: error=-5 IO failure
BTRFS: info (device dm-1): forced readonly

Now I'm on System Rescue CD and that is not reported.
I've written down those log line on paper, so there may be some typo.
Seemingly there is no journalctl installed on this system to check
OpenSuse logs again.

All the following logs are on System Rescue CD.
mount -o ro,recovery /dev/mapper/vg_pupu-lv_opensuse_root /mnt/opensuse
https://bpaste.net/show/263e5f7ae9d4

After mounting and umounting several times with and without "-o ro,recovery"
https://bpaste.net/show/43eb64decb63

btrfs check --readonly /dev/mapper/vg_pupu-lv_opensuse_root
https://bpaste.net/show/7ecf422c73a2

Would it be apropriate to run any of "btrfs check --repair /device" or
"btrfs check --init-csum-tree /device" to be able to mount readwrite again?

smartctl --all /dev/disk/by-id/ata-SAMSUNG_HD154UI_S1Y6JDWSC01351
https://bpaste.net/show/a6c132618974

btrfs check manpage: https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-check
btrfsck page: https://btrfs.wiki.kernel.org/index.php/Btrfsck

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
  2016-06-23 20:30 Bad hard drive - checksum verify failure forces readonly mount Vasco Almeida
@ 2016-06-24  0:54 ` Chris Murphy
  2016-06-24  4:56   ` Duncan
       [not found]   ` <5356822.A3RRKHDHNy@linux-omuo>
  0 siblings, 2 replies; 14+ messages in thread
From: Chris Murphy @ 2016-06-24  0:54 UTC (permalink / raw)
  To: Vasco Almeida; +Cc: Btrfs BTRFS

On Thu, Jun 23, 2016 at 2:30 PM, Vasco Almeida <vascomalmeida@sapo.pt> wrote:
> I was running OpenSuse Leap 42.1 with btrfs and
> LVM (Logical Volume Management).
> Last time I've checked smartd log, I noticed there were
> 30 sector pending reallocation and 1 unrecoverable bad
> sector on hard drive.
> I think my hard drive got some sector corrupted and now btrfs fails
> some checksum and forces mount readonly.
> The device is successfully mounted readonly.
>
> OpenSuse dmesg reported:
>
> BTRFS: dm-1 checksum verify failed on 437944320 wanted 39F45669 found
> 8BF8C752 leval 0
> (more 2 times)
> BTRFS: error (device dm-1) in btrfs_drop_snapshot:???: error=-5 IO failure
> BTRFS: info (device dm-1): forced readonly
>
> Now I'm on System Rescue CD and that is not reported.
> I've written down those log line on paper, so there may be some typo.
> Seemingly there is no journalctl installed on this system to check
> OpenSuse logs again.
>
> All the following logs are on System Rescue CD.
> mount -o ro,recovery /dev/mapper/vg_pupu-lv_opensuse_root /mnt/opensuse
> https://bpaste.net/show/263e5f7ae9d4
>
> After mounting and umounting several times with and without "-o ro,recovery"
> https://bpaste.net/show/43eb64decb63
>
> btrfs check --readonly /dev/mapper/vg_pupu-lv_opensuse_root
> https://bpaste.net/show/7ecf422c73a2
>
>
> Would it be apropriate to run any of "btrfs check --repair /device" or
> "btrfs check --init-csum-tree /device" to be able to mount readwrite again?
>
> smartctl --all /dev/disk/by-id/ata-SAMSUNG_HD154UI_S1Y6JDWSC01351
> https://bpaste.net/show/a6c132618974
>
> btrfs check manpage: https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-check
> btrfsck page: https://btrfs.wiki.kernel.org/index.php/Btrfsck

Normally if this is just data blocks corrupted it will still mount rw
and just flag the affected file in kernel messages so you can delete
it and replace.

Since that's not happening, it's probably metadata, but then there
should be two copies unless this is on SSD or otherwise the file
system was created with -m single. If there are two copies of the
metadata and both are wrong that's unusual.


>From the pasted kernel messages:

> Linux version 3.18.34-std473-amd64 (root@rl-sysrcd-p11) (gcc version 4.8.5 (Gentoo 4.8.5 p1.3, pie-0.6.2) ) #2 SMP Tue May 24 20:34:19 UTC 2016


3.18.34 is ancient. Find something newer and try to remount normally.
And then also with recovery if necessary (don't use ro, see if it'll
mount rw and fix itself). And if not, then try btrfs check with a
newer version of btrfs-progs, I can't tell from the pasted output what
version you're using but since the kernel is so old, decent chance the
btrfsck is old also.


Chris Murphy





-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
  2016-06-24  0:54 ` Chris Murphy
@ 2016-06-24  4:56   ` Duncan
  2016-06-24  5:34     ` Chris Murphy
       [not found]   ` <5356822.A3RRKHDHNy@linux-omuo>
  1 sibling, 1 reply; 14+ messages in thread
From: Duncan @ 2016-06-24  4:56 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Thu, 23 Jun 2016 18:54:28 -0600 as excerpted:

> From the pasted kernel messages:
> 
>> Linux version 3.18.34-std473-amd64 (root@rl-sysrcd-p11) (gcc version
>> 4.8.5 (Gentoo 4.8.5 p1.3, pie-0.6.2) ) #2 SMP Tue May 24 20:34:19 UTC
>> 2016
> 
> 
> 3.18.34 is ancient. Find something newer and try to remount normally.
> And then also with recovery if necessary (don't use ro, see if it'll
> mount rw and fix itself). And if not, then try btrfs check with a newer
> version of btrfs-progs, I can't tell from the pasted output what version
> you're using but since the kernel is so old, decent chance the btrfsck
> is old also.

...  So I guess that means we're back to supporting only the latest two 
LTS kernel series, those being 4.1 and 4.4 at this time.  I had hoped 
that btrfs was stabilizing enough, and 3.18 was trouble-free enough btrfs-
wise, that we could expand that to three LTS series now, as the 
indications were we might when 4.4 was still new.  But it seems that 
while we did support it a bit longer, say 2.5 LTS series, that couldn't 
continue until the /next/ LTS came out.

Oh, well, it /was/ a bit of a stretch...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
  2016-06-24  4:56   ` Duncan
@ 2016-06-24  5:34     ` Chris Murphy
  0 siblings, 0 replies; 14+ messages in thread
From: Chris Murphy @ 2016-06-24  5:34 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

On Thu, Jun 23, 2016 at 10:56 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Chris Murphy posted on Thu, 23 Jun 2016 18:54:28 -0600 as excerpted:
>
>> From the pasted kernel messages:
>>
>>> Linux version 3.18.34-std473-amd64 (root@rl-sysrcd-p11) (gcc version
>>> 4.8.5 (Gentoo 4.8.5 p1.3, pie-0.6.2) ) #2 SMP Tue May 24 20:34:19 UTC
>>> 2016
>>
>>
>> 3.18.34 is ancient. Find something newer and try to remount normally.
>> And then also with recovery if necessary (don't use ro, see if it'll
>> mount rw and fix itself). And if not, then try btrfs check with a newer
>> version of btrfs-progs, I can't tell from the pasted output what version
>> you're using but since the kernel is so old, decent chance the btrfsck
>> is old also.
>
> ...  So I guess that means we're back to supporting only the latest two
> LTS kernel series, those being 4.1 and 4.4 at this time.  I had hoped
> that btrfs was stabilizing enough, and 3.18 was trouble-free enough btrfs-
> wise, that we could expand that to three LTS series now, as the
> indications were we might when 4.4 was still new.  But it seems that
> while we did support it a bit longer, say 2.5 LTS series, that couldn't
> continue until the /next/ LTS came out.
>
> Oh, well, it /was/ a bit of a stretch...

Yeah looks like 3.18.35 even has some backports, and it's not that old
but I have no idea if the problem in this case if fixed by something
newer.

I'd say 50/50 shot at a new kernel doing better, but for the sure the
btrfs-progs has a better chance because btrfsck has had lots of
improvements since 3.18. It's just too easy to dd a Fedora 24 live
image to a USB stick, which has kernel 4.5.5 and btrfs-progs 4.5.2 and
give it a shot. And if that doesn't work, then btrfs-image time so
hopefully devs can see if it's possible to improve btrfsck. But at
that point it also means blowing away this fs :-\ but at least it's ro
mountable so anything important can be copied off normally.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
       [not found]   ` <5356822.A3RRKHDHNy@linux-omuo>
@ 2016-06-24 16:47     ` Chris Murphy
  2016-06-25  0:06       ` Vasco Almeida
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2016-06-24 16:47 UTC (permalink / raw)
  To: Vasco Almeida; +Cc: Chris Murphy, Btrfs BTRFS

On Fri, Jun 24, 2016 at 9:52 AM, Vasco Almeida <vascomalmeida@sapo.pt> wrote:

>>
>> From the pasted kernel messages:
>> > Linux version 3.18.34-std473-amd64 (root@rl-sysrcd-p11) (gcc version 4.8.5
>> > (Gentoo 4.8.5 p1.3, pie-0.6.2) ) #2 SMP Tue May 24 20:34:19 UTC 2016
>> 3.18.34 is ancient. Find something newer and try to remount normally.
> Present information concerns openSUSE Leap 42.1 (x86_64) mount of root file
> system at boot time. That should mount it normally. Hope that fits what you
> mean.

OK but it's not mounting it normally, it's still being forced readonly
at btrfs_drop_snapshot and the only thing I'm coming up with search
wise is that it's related to qgroups. Have you enabled quotas on this
file system ever?

> btrfs-progs v4.1.2+20151002

A lot of changes have happened since 4.1.2 I would still use something
newer and try to repair it.

>
> $ /usr/sbin/btrfs fi df /
> Data, single: total=10.01GiB, used=9.06GiB
> System, DUP: total=64.00MiB, used=16.00KiB
> Metadata, DUP: total=1.12GiB, used=596.69MiB
> GlobalReserve, single: total=208.00MiB, used=0.00B
>
> I forgot to mention in last e-mail that I ran Marc MERLIN's scrubbing script
> [1] after mounting the device with "-o ro,recovery" on System Rescue CD.
> Even after that device is forced readonly.

OK but System Rescue CD uses an old kernel by btrfs standards, even
account for all the backports in that particular version:
4.7.3) 2016-06-04:
Standard kernels: Long-Term-Supported linux-3.18.34 (rescue32 + rescue64)

So that's why I'm suggesting you use something newer, like 4.5.x, same
for btrfs-progs. The old versions aren't working. There's no assurance
it'll work with new versions, but that it doesn't get fixed up with
old versions means you either try new versions or you rebuild the file
system. *shrug*

> I would like to find a solution to be able to mount normally readwrite again
> and hopefully understand what caused the issue.

My best guess is qgroup related, there were a lot of problems with
multiple quota implementations and snapshots and openSUSE does take
many many snapshots. So that could be it. But without a reproducer
it's hard to say what caused it.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
  2016-06-24 16:47     ` Chris Murphy
@ 2016-06-25  0:06       ` Vasco Almeida
  2016-06-25 13:20         ` Chris Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: Vasco Almeida @ 2016-06-25  0:06 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Citando Chris Murphy <lists@colorremedies.com>:

> On Fri, Jun 24, 2016 at 9:52 AM, Vasco Almeida <vascomalmeida@sapo.pt> wrote:
>
>>>
>>> From the pasted kernel messages:
>>> > Linux version 3.18.34-std473-amd64 (root@rl-sysrcd-p11) (gcc  
>>> version 4.8.5
>>> > (Gentoo 4.8.5 p1.3, pie-0.6.2) ) #2 SMP Tue May 24 20:34:19 UTC 2016
>>> 3.18.34 is ancient. Find something newer and try to remount normally.
>> Present information concerns openSUSE Leap 42.1 (x86_64) mount of root file
>> system at boot time. That should mount it normally. Hope that fits what you
>> mean.
>
> OK but it's not mounting it normally, it's still being forced readonly
> at btrfs_drop_snapshot and the only thing I'm coming up with search
> wise is that it's related to qgroups. Have you enabled quotas on this
> file system ever?

Unless openSUSE does that by default, I did not enable quotas. It is  
not something I am aware of doing.
>
>
>> btrfs-progs v4.1.2+20151002
>
> A lot of changes have happened since 4.1.2 I would still use something
> newer and try to repair it.

By repair do you mean issue "btrfs check --repair /device" ?

>> $ /usr/sbin/btrfs fi df /
>> Data, single: total=10.01GiB, used=9.06GiB
>> System, DUP: total=64.00MiB, used=16.00KiB
>> Metadata, DUP: total=1.12GiB, used=596.69MiB
>> GlobalReserve, single: total=208.00MiB, used=0.00B
>>
>> I forgot to mention in last e-mail that I ran Marc MERLIN's scrubbing script
>> [1] after mounting the device with "-o ro,recovery" on System Rescue CD.
>> Even after that device is forced readonly.
>
> OK but System Rescue CD uses an old kernel by btrfs standards, even
> account for all the backports in that particular version:
> 4.7.3) 2016-06-04:
> Standard kernels: Long-Term-Supported linux-3.18.34 (rescue32 + rescue64)
>
> So that's why I'm suggesting you use something newer, like 4.5.x, same
> for btrfs-progs. The old versions aren't working. There's no assurance
> it'll work with new versions, but that it doesn't get fixed up with
> old versions means you either try new versions or you rebuild the file
> system. *shrug*

I am using Fedora 24 and have issued "mount  
/dev/mapper/vg_pupu-lv_opensuse_root /mnt". Got some call trace and  
scary stuff that did not get before on other systems. Please check  
dmesg output linked below.

Linux catarina 4.5.7-300.fc24.x86_64 #1 SMP Wed Jun 8 18:12:45 UTC  
2016 x86_64 x86_64 x86_64 GNU/Linux
btrfs-progs v4.5.2

# btrfs fi show
Label: none  uuid: ad167e92-fbb1-4148-b54d-6345b6fb26da
	Total devices 1 FS bytes used 9.63GiB
	devid    1 size 50.00GiB used 12.32GiB path  
/dev/mapper/vg_pupu-lv_opensuse_root
# btrfs fi df /mnt/
Data, single: total=10.01GiB, used=9.05GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=1.12GiB, used=597.62MiB
GlobalReserve, single: total=208.00MiB, used=224.00KiB

dmesg http://paste.fedoraproject.org/384352/80842814/
dmesg after umount http://paste.fedoraproject.org/384359/14668108/
diff between two http://paste.fedoraproject.org/384364/11704146/

btrfs check --readonly /dev/mappper/vg_pupu-lv_opensuse_root
http://paste.fedoraproject.org/384361/68112421/

After umount and mounting again, the device was normally mounted  
readwrite again:
/dev/mapper/vg_pupu-lv_opensuse_root on /mnt type btrfs  
(rw,relatime,seclabel,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot)
But trying to umount it afterwards makes umount command hang. Device  
no longer shows on mount output, though.
CTRL-C or SIGTERM can't kill umount.

dmesg http://paste.fedoraproject.org/384371/14668130/



>> I would like to find a solution to be able to mount normally readwrite again
>> and hopefully understand what caused the issue.
>
> My best guess is qgroup related, there were a lot of problems with
> multiple quota implementations and snapshots and openSUSE does take
> many many snapshots. So that could be it. But without a reproducer
> it's hard to say what caused it.

Thank you again for your time and reply.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
  2016-06-25  0:06       ` Vasco Almeida
@ 2016-06-25 13:20         ` Chris Murphy
  2016-06-25 20:10           ` Vasco Almeida
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2016-06-25 13:20 UTC (permalink / raw)
  To: Vasco Almeida; +Cc: Chris Murphy, Btrfs BTRFS

On Fri, Jun 24, 2016 at 6:06 PM, Vasco Almeida <vascomalmeida@sapo.pt> wrote:
> Citando Chris Murphy <lists@colorremedies.com>:

>> A lot of changes have happened since 4.1.2 I would still use something
>> newer and try to repair it.
>
>
> By repair do you mean issue "btrfs check --repair /device" ?

Once you have copied off the important stuff, yes. It's less likely to
make things worse now. However, there are some things to do first:

> dmesg http://paste.fedoraproject.org/384352/80842814/

[ 1837.386732] BTRFS info (device dm-9): continuing balance
[ 1838.006038] BTRFS info (device dm-9): relocating block group
15799943168 flags 34
[ 1838.684892] BTRFS info (device dm-9): relocating block group
10934550528 flags 36
[ 1839.301453] ------------[ cut here ]------------
[ 1839.301495] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:1625
lookup_inline_extent_backref+0x45c/0x5a0 [btrfs]()

followed by

[ 1839.301797] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:2946
btrfs_run_delayed_refs+0x29d/0x2d0 [btrfs]()
[ 1839.301798] BTRFS: Transaction aborted (error -5)
[...]
[ 1839.301972] BTRFS: error (device dm-9) in
btrfs_run_delayed_refs:2946: errno=-5 IO failure
[ 1839.301975] BTRFS info (device dm-9): forced readonly

So it looks like it was resuming a balance automatically, and while
processing delayed references it's running into something it doesn't
expect and doesn't have a way to fix, so it goes read only to avoid
causing more problems.

I would do a couple things in order:
1. Mount ro and copy off what you want in case the whole thing gets
worse and can't ever be mounted again.
2. Mount with only these options: -o skip_balance,subvolid=5,nospace_cache

If it mounts rw, don't do anything with it, just see if it cleans up
after itself. It also looks from the previous trace it was trying to
remove a snapshot and there are complaints of problems in that
snapshot. So hopefully just waiting 5 minutes doing nothing and it'll
clean up after itself (you can check with top to see if there are any
btrfs related transactions that run including the btrfs-cleaner
process) wait until they're done.

Then umount. If you want you could have two other consoles ready
first, one for 'journalctl -f' and another for sysrq+t to issue in
case you get a hang. This doesn't fix anything but it collects more
information for a bug report for the devs.

Once you get it umounted normally or by force, the next thing to do is

3. btrfs-image so that devs can see what's causing the problem that
the current code isn't handling well enough.
4. btrfs check --repair

Let's see the results of that repair. You can use 'script
btrfsrepair.txt' first and then 'btrfs check --repair' and it will log
everything. After btrfs check completes, use 'exit' to stop script
from recording and you should have a btrfsrepair.txt file you can post
somewhere. When using > not everything gets logged for some reason but
script will capture everything.

Depending on how the repair goes, there might be a couple more options left.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
  2016-06-25 13:20         ` Chris Murphy
@ 2016-06-25 20:10           ` Vasco Almeida
  2016-06-25 20:54             ` Chris Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: Vasco Almeida @ 2016-06-25 20:10 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Citando Chris Murphy <lists@colorremedies.com>:

> On Fri, Jun 24, 2016 at 6:06 PM, Vasco Almeida <vascomalmeida@sapo.pt> wrote:
>> Citando Chris Murphy <lists@colorremedies.com>:
>> dmesg http://paste.fedoraproject.org/384352/80842814/
>
> [ 1837.386732] BTRFS info (device dm-9): continuing balance
> [ 1838.006038] BTRFS info (device dm-9): relocating block group
> 15799943168 flags 34
> [ 1838.684892] BTRFS info (device dm-9): relocating block group
> 10934550528 flags 36
> [ 1839.301453] ------------[ cut here ]------------
> [ 1839.301495] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:1625
> lookup_inline_extent_backref+0x45c/0x5a0 [btrfs]()
>
> followed by
>
> [ 1839.301797] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:2946
> btrfs_run_delayed_refs+0x29d/0x2d0 [btrfs]()
> [ 1839.301798] BTRFS: Transaction aborted (error -5)
> [...]
> [ 1839.301972] BTRFS: error (device dm-9) in
> btrfs_run_delayed_refs:2946: errno=-5 IO failure
> [ 1839.301975] BTRFS info (device dm-9): forced readonly
>
> So it looks like it was resuming a balance automatically, and while
> processing delayed references it's running into something it doesn't
> expect and doesn't have a way to fix, so it goes read only to avoid
> causing more problems.
>
> I would do a couple things in order:
> 1. Mount ro and copy off what you want in case the whole thing gets
> worse and can't ever be mounted again.
> 2. Mount with only these options: -o skip_balance,subvolid=5,nospace_cache

I have mounted with that options and was readwrite first and then it  
forces readonly. You can see a delay between first BTRFS messages and  
the "BTRFS info: forced readonly" message in dmesg.

/dev/mapper/vg_pupu-lv_opensuse_root on /mnt type btrfs  
(ro,relatime,seclabel,nospace_cache,skip_balance,subvolid=5,subvol=/)

> If it mounts rw, don't do anything with it, just see if it cleans up
> after itself. It also looks from the previous trace it was trying to
> remove a snapshot and there are complaints of problems in that
> snapshot. So hopefully just waiting 5 minutes doing nothing and it'll
> clean up after itself (you can check with top to see if there are any
> btrfs related transactions that run including the btrfs-cleaner
> process) wait until they're done.

I can see that btrfs processes including btrfs-cleaner but they may be  
not doing much since device was forced readonly after mounting it.

> Then umount. If you want you could have two other consoles ready
> first, one for 'journalctl -f' and another for sysrq+t to issue in
> case you get a hang. This doesn't fix anything but it collects more
> information for a bug report for the devs.
>
> Once you get it umounted normally or by force, the next thing to do is

I have umount it normally (umount /mnt) after more than 20 minutes  
since mounting it.

> 3. btrfs-image so that devs can see what's causing the problem that
> the current code isn't handling well enough.

btrfs-image does not create dump image:

# btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root  
btrfs-lv_opensuse_root.image
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
Csum didn't match
Error reading metadata block
Error adding block -5
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
Csum didn't match
Error reading metadata block
Error flushing pending -5
create failed (Success)
# echo $?
1

> 4. btrfs check --repair

Did not issue this command yet.

dmesg http://paste.fedoraproject.org/384799/14668851/

Thank your for helping.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
  2016-06-25 20:10           ` Vasco Almeida
@ 2016-06-25 20:54             ` Chris Murphy
  2016-06-26 13:05               ` Vasco Almeida
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2016-06-25 20:54 UTC (permalink / raw)
  To: Vasco Almeida; +Cc: Chris Murphy, Btrfs BTRFS

On Sat, Jun 25, 2016 at 2:10 PM, Vasco Almeida <vascomalmeida@sapo.pt> wrote:
> Citando Chris Murphy <lists@colorremedies.com>:

>>
>> I would do a couple things in order:
>> 1. Mount ro and copy off what you want in case the whole thing gets
>> worse and can't ever be mounted again.
>> 2. Mount with only these options: -o skip_balance,subvolid=5,nospace_cache
>
>
> I have mounted with that options and was readwrite first and then it forces
> readonly. You can see a delay between first BTRFS messages and the "BTRFS
> info: forced readonly" message in dmesg.
>
> /dev/mapper/vg_pupu-lv_opensuse_root on /mnt type btrfs
> (ro,relatime,seclabel,nospace_cache,skip_balance,subvolid=5,subvol=/)
>
>
>> If it mounts rw, don't do anything with it, just see if it cleans up
>> after itself. It also looks from the previous trace it was trying to
>> remove a snapshot and there are complaints of problems in that
>> snapshot. So hopefully just waiting 5 minutes doing nothing and it'll
>> clean up after itself (you can check with top to see if there are any
>> btrfs related transactions that run including the btrfs-cleaner
>> process) wait until they're done.
>
>
> I can see that btrfs processes including btrfs-cleaner but they may be not
> doing much since device was forced readonly after mounting it.

Readonly just refers to user space to and including VFS, is my
understanding. The file system itself can still write to the block
device.


> I have umount it normally (umount /mnt) after more than 20 minutes since
> mounting it.
>
>> 3. btrfs-image so that devs can see what's causing the problem that
>> the current code isn't handling well enough.
>
>
> btrfs-image does not create dump image:
>
> # btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root
> btrfs-lv_opensuse_root.image
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> Csum didn't match
> Error reading metadata block
> Error adding block -5
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> Csum didn't match
> Error reading metadata block
> Error flushing pending -5
> create failed (Success)
> # echo $?
> 1

Well it's pretty strange to have DUP metadata and for the checksum
verify to fail on both copies. I don't have much optimism that brfsck
repair can fix it either. But still it's worth a shot since there's
not much else to go on.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
  2016-06-25 20:54             ` Chris Murphy
@ 2016-06-26 13:05               ` Vasco Almeida
  2016-06-26 19:54                 ` Chris Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: Vasco Almeida @ 2016-06-26 13:05 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

A Sáb, 25-06-2016 às 14:54 -0600, Chris Murphy escreveu:
> On Sat, Jun 25, 2016 at 2:10 PM, Vasco Almeida <vascomalmeida@sapo.pt
> > wrote:
> > Citando Chris Murphy <lists@colorremedies.com>:
> > > 3. btrfs-image so that devs can see what's causing the problem
> > > that
> > > the current code isn't handling well enough.
> > 
> > 
> > btrfs-image does not create dump image:
> > 
> > # btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root
> > btrfs-lv_opensuse_root.image
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > Csum didn't match
> > Error reading metadata block
> > Error adding block -5
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > Csum didn't match
> > Error reading metadata block
> > Error flushing pending -5
> > create failed (Success)
> > # echo $?
> > 1
> 
> Well it's pretty strange to have DUP metadata and for the checksum
> verify to fail on both copies. I don't have much optimism that brfsck
> repair can fix it either. But still it's worth a shot since there's
> not much else to go on.

I have tried "btrfs check --repair /device" but that seems do not do
any good.
http://paste.fedoraproject.org/384960/66945936/

I then issued "mount /device /mnt" and, like before, it was mounted
readwrite and then forced readonly. Got some kernel oops and traces. 

I noticed that btrfs-balance was using ~100% CPU whilst btrfs device
was mounted readonly. I let it run for about 20 minutes.
Then had to reboot because the system was no responding well: was
unable to open or close applications, use internet. Did SysRq+reisu
(operations were enabled) and then pressed reset button on computer.

Unfortunately dmesg dumps were lost after resetting computer.

What else can I do or I must rebuild the file system?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
  2016-06-26 13:05               ` Vasco Almeida
@ 2016-06-26 19:54                 ` Chris Murphy
  2016-06-27  6:30                   ` Vasco Almeida
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2016-06-26 19:54 UTC (permalink / raw)
  To: Vasco Almeida; +Cc: Chris Murphy, Btrfs BTRFS

On Sun, Jun 26, 2016 at 7:05 AM, Vasco Almeida <vascomalmeida@sapo.pt> wrote:
> A Sáb, 25-06-2016 às 14:54 -0600, Chris Murphy escreveu:
>> On Sat, Jun 25, 2016 at 2:10 PM, Vasco Almeida <vascomalmeida@sapo.pt
>> > wrote:
>> > Citando Chris Murphy <lists@colorremedies.com>:
>> > > 3. btrfs-image so that devs can see what's causing the problem
>> > > that
>> > > the current code isn't handling well enough.
>> >
>> >
>> > btrfs-image does not create dump image:
>> >
>> > # btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root
>> > btrfs-lv_opensuse_root.image
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > Csum didn't match
>> > Error reading metadata block
>> > Error adding block -5
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > Csum didn't match
>> > Error reading metadata block
>> > Error flushing pending -5
>> > create failed (Success)
>> > # echo $?
>> > 1
>>
>> Well it's pretty strange to have DUP metadata and for the checksum
>> verify to fail on both copies. I don't have much optimism that brfsck
>> repair can fix it either. But still it's worth a shot since there's
>> not much else to go on.
>
> I have tried "btrfs check --repair /device" but that seems do not do
> any good.
> http://paste.fedoraproject.org/384960/66945936/

It did fix things, in particular with the snapshot that was having
problems being dropped. But it's not enough it seems to prevent it
from going read only.

There's more than one bug here, you might see if the repair was good
enough that it's possible to use brtfs-image now. If not, use
btrfs-debug-tree <dev> > file.txt and post that file somewhere. This
does expose file names. Maybe that'll shed some light on the problem.
But also worth filing a bug at bugzilla.kernel.org with this debug
tree referenced (probably too big to attach), maybe a dev will be able
to look at it and improve things so they don't fail.

> What else can I do or I must rebuild the file system?

Well, it's a long shot but you could try using --repair --init-csum
which will create a new csum tree. But that applies to data, if the
problem with it going read only is due to metadata corruption this
won't help. And then last you could try --init-extent-tree. Thing I
can't answer is which order to do it in.

In any case there will be files that you shouldn't trust after csum
has been recreated, anything corrupt will now have a new csum, so you
can get silent data corruption. It's better to just blow away this
file system and make a new one and reinstall the OS. But if you're
feeling brave, you can try one or both of those additional options and
see if they can help.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
  2016-06-26 19:54                 ` Chris Murphy
@ 2016-06-27  6:30                   ` Vasco Almeida
  2016-06-27 16:49                     ` Chris Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: Vasco Almeida @ 2016-06-27  6:30 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

A Dom, 26-06-2016 às 13:54 -0600, Chris Murphy escreveu:
> On Sun, Jun 26, 2016 at 7:05 AM, Vasco Almeida <vascomalmeida@sapo.pt
> > wrote:
> > I have tried "btrfs check --repair /device" but that seems do not
> > do
> > any good.
> > http://paste.fedoraproject.org/384960/66945936/
> 
> It did fix things, in particular with the snapshot that was having
> problems being dropped. But it's not enough it seems to prevent it
> from going read only.
> 
> There's more than one bug here, you might see if the repair was good
> enough that it's possible to use brtfs-image now.

File system image available at (choose one link)
https://mega.nz/#!AkAEgKyB!RUa7G5xHIygWm0ALx5ZxQjjXNdFYa7lDRHJ_sW0bWLs
https://www.sendspace.com/file/i70cft

>  If not, use
> btrfs-debug-tree <dev> > file.txt and post that file somewhere. This
> does expose file names. Maybe that'll shed some light on the problem.
> But also worth filing a bug at bugzilla.kernel.org with this debug
> tree referenced (probably too big to attach), maybe a dev will be
> able
> to look at it and improve things so they don't fail.

Should I file a bug report with that image dump linked above or btrfs-
debug-tree output or both?
I think I will use the subject of this thread as summary to file the
bug. Can you think of something more suitable or is that fine?

> > What else can I do or I must rebuild the file system?
> 
> Well, it's a long shot but you could try using --repair --init-csum
> which will create a new csum tree. But that applies to data, if the
> problem with it going read only is due to metadata corruption this
> won't help. And then last you could try --init-extent-tree. Thing I
> can't answer is which order to do it in.
> 
> In any case there will be files that you shouldn't trust after csum
> has been recreated, anything corrupt will now have a new csum, so you
> can get silent data corruption. It's better to just blow away this
> file system and make a new one and reinstall the OS. But if you're
> feeling brave, you can try one or both of those additional options
> and
> see if they can help.

I think I will reinstall the OS since, even if I manage to recover the
file system from this issue, that OS will be something I can not trust
fully.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
  2016-06-27  6:30                   ` Vasco Almeida
@ 2016-06-27 16:49                     ` Chris Murphy
  2016-07-05 17:43                       ` Vasco Almeida
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2016-06-27 16:49 UTC (permalink / raw)
  To: Vasco Almeida; +Cc: Chris Murphy, Btrfs BTRFS

On Mon, Jun 27, 2016 at 12:30 AM, Vasco Almeida <vascomalmeida@sapo.pt> wrote:

> File system image available at (choose one link)
> https://mega.nz/#!AkAEgKyB!RUa7G5xHIygWm0ALx5ZxQjjXNdFYa7lDRHJ_sW0bWLs
> https://www.sendspace.com/file/i70cft

> Should I file a bug report with that image dump linked above or btrfs-
> debug-tree output or both?

If it were me, I'd include both. Maybe the image is incomplete or vice
versa. The debug tree output is also human readable. I'd also put them
up in a cloud location where you can kinda forget about them for a
while, I've had images not looked at for 6+ months by a dev.

> I think I will use the subject of this thread as summary to file the
> bug. Can you think of something more suitable or is that fine?

I would try to summarize something like:
file system created with btrfs-progs version -----, and mostly used
with kernel version -----, and inexplicably the file system became
unusable at boot time always mounting only readonly. Newer kernel
versions still could not mount it, nor was btrfs check using
btrfs-progs version ----- able to repair. See thread URL for more
details.

btrfs-image URL
btrfs-debug-tree URL

> I think I will reinstall the OS since, even if I manage to recover the
> file system from this issue, that OS will be something I can not trust
> fully.

Yeah pretty much that's right. There is an rpm command where you can
have it check the signatures of all installed binaries, but I forget
what it is offhand. That'd be an alternative to reinstalling if the
init options were to work.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bad hard drive - checksum verify failure forces readonly mount
  2016-06-27 16:49                     ` Chris Murphy
@ 2016-07-05 17:43                       ` Vasco Almeida
  0 siblings, 0 replies; 14+ messages in thread
From: Vasco Almeida @ 2016-07-05 17:43 UTC (permalink / raw)
  To: BTRFS

Bug reported
https://bugzilla.kernel.org/show_bug.cgi?id=121491

Thank you for helping.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-07-05 17:43 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-23 20:30 Bad hard drive - checksum verify failure forces readonly mount Vasco Almeida
2016-06-24  0:54 ` Chris Murphy
2016-06-24  4:56   ` Duncan
2016-06-24  5:34     ` Chris Murphy
     [not found]   ` <5356822.A3RRKHDHNy@linux-omuo>
2016-06-24 16:47     ` Chris Murphy
2016-06-25  0:06       ` Vasco Almeida
2016-06-25 13:20         ` Chris Murphy
2016-06-25 20:10           ` Vasco Almeida
2016-06-25 20:54             ` Chris Murphy
2016-06-26 13:05               ` Vasco Almeida
2016-06-26 19:54                 ` Chris Murphy
2016-06-27  6:30                   ` Vasco Almeida
2016-06-27 16:49                     ` Chris Murphy
2016-07-05 17:43                       ` Vasco Almeida

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.