Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing

All of lore.kernel.org
 help / color / mirror / Atom feed

* Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing
@ 2016-01-24 17:52 Tom Hunt
  2016-01-24 20:34 ` Chris Murphy
  0 siblings, 1 reply; 12+ messages in thread
From: Tom Hunt @ 2016-01-24 17:52 UTC (permalink / raw)
  To: linux-btrfs

I've been running for a week or two using a single-drive 6TB btrfs
volume. For some of this time, the machine running had bad memory,
which led to various checksum errors. For most of these, I just
deleted the relevant file and reacquired it (the errors fortuitously
never occurring in files which were not easily replaceable). However,
there currently remains a single error which does not appear to be in
any file:

# btrfs scrub status /
scrub status for 85f5b744-f68c-4194-aa90-d6fe238115a3
      scrub started at Fri Jan 22 09:49:02 2016 and finished after 11:55:08
      total bytes scrubbed: 4.27TiB with 1 errors
      error details: csum=1
      corrected errors: 0, uncorrectable errors: 1, unverified errors: 0

# dmesg
(...)
[52841.310422] BTRFS warning (device dm-0): csum failed ino 515 off 15118336 csum 2629660496 expected csum 54021641
[52841.335656] BTRFS warning (device dm-0): csum failed ino 515 off 15118336 csum 2629660496 expected csum 54021641
[95071.256448] BTRFS: bdev /dev/mapper/rootvol_1 errs: wr 0, rd 0, flush 0, corrupt 11, gen 0
[95071.256532] BTRFS: unable to fixup (regular) error at logical 4450167468032 on dev /dev/mapper/rootvol_1

I've searched for ino 515, and the file there does not have any
apparent error (can read the whole thing without problem; deleting and
recreating it does not make the error go away). The error is, of
course, uncorrectable, because it's a single-drive volume. However,
having put in a second drive, the balance filter to convert to raid1
fails because of the I/O error.

How do I deal with this?

-- 
Tom Hunt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing
  2016-01-24 17:52 Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing Tom Hunt
@ 2016-01-24 20:34 ` Chris Murphy
  2016-01-24 20:58   ` Tom Hunt
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2016-01-24 20:34 UTC (permalink / raw)
  To: Tom Hunt; +Cc: Btrfs BTRFS

On Sun, Jan 24, 2016 at 10:52 AM, Tom Hunt <tomdicksonhunt@gmail.com> wrote:
> I've been running for a week or two using a single-drive 6TB btrfs
> volume. For some of this time, the machine running had bad memory,
> which led to various checksum errors. For most of these, I just
> deleted the relevant file and reacquired it (the errors fortuitously
> never occurring in files which were not easily replaceable). However,
> there currently remains a single error which does not appear to be in
> any file:
>
> # btrfs scrub status /
> scrub status for 85f5b744-f68c-4194-aa90-d6fe238115a3
>       scrub started at Fri Jan 22 09:49:02 2016 and finished after 11:55:08
>       total bytes scrubbed: 4.27TiB with 1 errors
>       error details: csum=1
>       corrected errors: 0, uncorrectable errors: 1, unverified errors: 0
>
> # dmesg
> (...)
> [52841.310422] BTRFS warning (device dm-0): csum failed ino 515 off 15118336 csum 2629660496 expected csum 54021641
> [52841.335656] BTRFS warning (device dm-0): csum failed ino 515 off 15118336 csum 2629660496 expected csum 54021641
> [95071.256448] BTRFS: bdev /dev/mapper/rootvol_1 errs: wr 0, rd 0, flush 0, corrupt 11, gen 0
> [95071.256532] BTRFS: unable to fixup (regular) error at logical 4450167468032 on dev /dev/mapper/rootvol_1
>
> I've searched for ino 515, and the file there does not have any
> apparent error (can read the whole thing without problem; deleting and
> recreating it does not make the error go away). The error is, of
> course, uncorrectable, because it's a single-drive volume. However,
> having put in a second drive, the balance filter to convert to raid1
> fails because of the I/O error.

You delete the file and yet the scrub still says inode 515 exists and
has an error? Or there are no errors, but then after copying the same
file back to the volume, the problem reoccurs? Are there any snapshots
or subvolumes? Because if there are any subvolumes/snapshots, each is
its own fs tree with its own set of inodes. So an inode can be used
more than once for different files so I wonder off hand if you haven't
found the actual problematic file.

Or possibly it's a directory, and not a file.

# find /brick2 -inum 60724

On my system, it returns six results, four are files, two of which are
in common to the others due to snapshotting, and two are directories
one of which is also a snapshot of the other. So a single inode can
not only appear multiple times on a Btrfs volume, but can be pointing
to a file or a directory. The scrub not saying what the file path is
suggests it could be a directory.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing
  2016-01-24 20:34 ` Chris Murphy
@ 2016-01-24 20:58   ` Tom Hunt
  2016-01-24 21:07     ` Chris Murphy
  0 siblings, 1 reply; 12+ messages in thread
From: Tom Hunt @ 2016-01-24 20:58 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

> You delete the file and yet the scrub still says inode 515 exists and
> has an error? Or there are no errors, but then after copying the same
> file back to the volume, the problem reoccurs? Are there any snapshots
> or subvolumes? Because if there are any subvolumes/snapshots, each is
> its own fs tree with its own set of inodes. So an inode can be used
> more than once for different files so I wonder off hand if you haven't
> found the actual problematic file.
> 
> Or possibly it's a directory, and not a file.

There are no snapshots, but there are subvolumes. I did the same procedure on
the file at inode 515 in each subvolume, which was:

# cp $file ~
# rm $file
# mv ~/$file {old_file_path}

This concluded without any errors. After doing this, the inode number is
different, and 'find / -inum 515' no longer finds anything on either subvolume.
However, initiating a scrub after this still shows the error at ino 515.

On Sun, Jan 24, 2016 at 01:34:20PM -0700, Chris Murphy wrote:
> On Sun, Jan 24, 2016 at 10:52 AM, Tom Hunt <tomdicksonhunt@gmail.com> wrote:
> > I've been running for a week or two using a single-drive 6TB btrfs
> > volume. For some of this time, the machine running had bad memory,
> > which led to various checksum errors. For most of these, I just
> > deleted the relevant file and reacquired it (the errors fortuitously
> > never occurring in files which were not easily replaceable). However,
> > there currently remains a single error which does not appear to be in
> > any file:
> >
> > # btrfs scrub status /
> > scrub status for 85f5b744-f68c-4194-aa90-d6fe238115a3
> >       scrub started at Fri Jan 22 09:49:02 2016 and finished after 11:55:08
> >       total bytes scrubbed: 4.27TiB with 1 errors
> >       error details: csum=1
> >       corrected errors: 0, uncorrectable errors: 1, unverified errors: 0
> >
> > # dmesg
> > (...)
> > [52841.310422] BTRFS warning (device dm-0): csum failed ino 515 off 15118336 csum 2629660496 expected csum 54021641
> > [52841.335656] BTRFS warning (device dm-0): csum failed ino 515 off 15118336 csum 2629660496 expected csum 54021641
> > [95071.256448] BTRFS: bdev /dev/mapper/rootvol_1 errs: wr 0, rd 0, flush 0, corrupt 11, gen 0
> > [95071.256532] BTRFS: unable to fixup (regular) error at logical 4450167468032 on dev /dev/mapper/rootvol_1
> >
> > I've searched for ino 515, and the file there does not have any
> > apparent error (can read the whole thing without problem; deleting and
> > recreating it does not make the error go away). The error is, of
> > course, uncorrectable, because it's a single-drive volume. However,
> > having put in a second drive, the balance filter to convert to raid1
> > fails because of the I/O error.
> 
> # find /brick2 -inum 60724
> 
> On my system, it returns six results, four are files, two of which are
> in common to the others due to snapshotting, and two are directories
> one of which is also a snapshot of the other. So a single inode can
> not only appear multiple times on a Btrfs volume, but can be pointing
> to a file or a directory. The scrub not saying what the file path is
> suggests it could be a directory.
> 
> 
> -- 
> Chris Murphy

-- 
Tom Hunt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing
  2016-01-24 20:58   ` Tom Hunt
@ 2016-01-24 21:07     ` Chris Murphy
  2016-01-24 21:17       ` Tom Hunt
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2016-01-24 21:07 UTC (permalink / raw)
  To: Tom Hunt; +Cc: Chris Murphy, Btrfs BTRFS

On Sun, Jan 24, 2016 at 1:58 PM, Tom Hunt <tomdicksonhunt@gmail.com> wrote:
>> You delete the file and yet the scrub still says inode 515 exists and
>> has an error? Or there are no errors, but then after copying the same
>> file back to the volume, the problem reoccurs? Are there any snapshots
>> or subvolumes? Because if there are any subvolumes/snapshots, each is
>> its own fs tree with its own set of inodes. So an inode can be used
>> more than once for different files so I wonder off hand if you haven't
>> found the actual problematic file.
>>
>> Or possibly it's a directory, and not a file.
>
> There are no snapshots, but there are subvolumes. I did the same procedure on
> the file at inode 515 in each subvolume, which was:
>
> # cp $file ~
> # rm $file
> # mv ~/$file {old_file_path}
>
> This concluded without any errors. After doing this, the inode number is
> different, and 'find / -inum 515' no longer finds anything on either subvolume.
> However, initiating a scrub after this still shows the error at ino 515.

Try pointing the find command at the mountpoint, rather than at a subvolume.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing
  2016-01-24 21:07     ` Chris Murphy
@ 2016-01-24 21:17       ` Tom Hunt
  2016-01-24 22:04         ` Chris Murphy
  0 siblings, 1 reply; 12+ messages in thread
From: Tom Hunt @ 2016-01-24 21:17 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

My arrangement is two subvolumes, 'root' and 'home', directly under the root
subvolume. The 'root' subvolume is mounted on /, the 'home' subvolume on /home.
I've tried both pointing the find command at /, and mounting the root subvolume
on a mount directory and pointing it there. Neither show any file or directory
with inum 515.

On Sun, Jan 24, 2016 at 02:07:58PM -0700, Chris Murphy wrote:
> On Sun, Jan 24, 2016 at 1:58 PM, Tom Hunt <tomdicksonhunt@gmail.com> wrote:
> >> You delete the file and yet the scrub still says inode 515 exists and
> >> has an error? Or there are no errors, but then after copying the same
> >> file back to the volume, the problem reoccurs? Are there any snapshots
> >> or subvolumes? Because if there are any subvolumes/snapshots, each is
> >> its own fs tree with its own set of inodes. So an inode can be used
> >> more than once for different files so I wonder off hand if you haven't
> >> found the actual problematic file.
> >>
> >> Or possibly it's a directory, and not a file.
> >
> > There are no snapshots, but there are subvolumes. I did the same procedure on
> > the file at inode 515 in each subvolume, which was:
> >
> > # cp $file ~
> > # rm $file
> > # mv ~/$file {old_file_path}
> >
> > This concluded without any errors. After doing this, the inode number is
> > different, and 'find / -inum 515' no longer finds anything on either subvolume.
> > However, initiating a scrub after this still shows the error at ino 515.
> 
> Try pointing the find command at the mountpoint, rather than at a subvolume.
> 
> 
> -- 
> Chris Murphy

-- 
Tom Hunt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing
  2016-01-24 21:17       ` Tom Hunt
@ 2016-01-24 22:04         ` Chris Murphy
  2016-01-24 22:16           ` Tom Hunt
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2016-01-24 22:04 UTC (permalink / raw)
  To: Tom Hunt; +Cc: Chris Murphy, Btrfs BTRFS

On Sun, Jan 24, 2016 at 2:17 PM, Tom Hunt <tomdicksonhunt@gmail.com> wrote:
> My arrangement is two subvolumes, 'root' and 'home', directly under the root
> subvolume. The 'root' subvolume is mounted on /, the 'home' subvolume on /home.
> I've tried both pointing the find command at /, and mounting the root subvolume
> on a mount directory and pointing it there. Neither show any file or directory
> with inum 515.

Inode 515 isn't special as far as I can tell. That you get zero
results for 'find <mp> -inum 515' and yet it appears with an error
during a scrub is weird.

What do you get for btrfs check without --repair?



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing
  2016-01-24 22:04         ` Chris Murphy
@ 2016-01-24 22:16           ` Tom Hunt
  2016-01-24 22:45             ` Chris Murphy
  0 siblings, 1 reply; 12+ messages in thread
From: Tom Hunt @ 2016-01-24 22:16 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

I get "currently mounted, aborting".

If I must bring down the machine over this, I can, but I'd prefer not to.

On Sun, Jan 24, 2016 at 03:04:25PM -0700, Chris Murphy wrote:
> On Sun, Jan 24, 2016 at 2:17 PM, Tom Hunt <tomdicksonhunt@gmail.com> wrote:
> > My arrangement is two subvolumes, 'root' and 'home', directly under the root
> > subvolume. The 'root' subvolume is mounted on /, the 'home' subvolume on /home.
> > I've tried both pointing the find command at /, and mounting the root subvolume
> > on a mount directory and pointing it there. Neither show any file or directory
> > with inum 515.
> 
> Inode 515 isn't special as far as I can tell. That you get zero
> results for 'find <mp> -inum 515' and yet it appears with an error
> during a scrub is weird.
> 
> What do you get for btrfs check without --repair?
> 
> 
> 
> -- 
> Chris Murphy

-- 
Tom Hunt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing
  2016-01-24 22:16           ` Tom Hunt
@ 2016-01-24 22:45             ` Chris Murphy
  2016-01-25  1:06               ` Chris Murphy
                                 ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Chris Murphy @ 2016-01-24 22:45 UTC (permalink / raw)
  To: Tom Hunt; +Cc: Chris Murphy, Btrfs BTRFS

On Sun, Jan 24, 2016 at 3:16 PM, Tom Hunt <tomdicksonhunt@gmail.com> wrote:
> I get "currently mounted, aborting".
>
> If I must bring down the machine over this, I can, but I'd prefer not to.

Now is a good time to refresh its backup, while it's online. It's also
a good idea to maintain readonly snapshots of each subvolume you want
to keep in case you need to depend on using btrfs send/receive to move
the data to a new file system, since only ro snapshots can be used for
send/receive.

Maybe it's an orphaned item. I don't know much about those, whether
btrfs check finds or removes them. But you can safely use
btrfs-debug-tree while the fs is mounted.

btrfs-debug-tree <dev> | grep -A3 -B3 ORPHAN

That'd find object type ORPHAN and item type ORPHAN_ITEM. Again, no
idea what to do about it if either are found, and matches inode 515.
You could also do a search using

btrfs-debug-tree <dev> | grep -A3 -B3 "(515 "

and see if that reveals anything.

While the fs has to be umounted for this, the --init-csum-tree option
for btrfs check will obliterate the current csum tree (and all file
csums) and then compute new csums for everything. That might fix it,
at least if it's a stuck orphan item then this ought to give it a
valid csum so you can proceed with conversion.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing
  2016-01-24 22:45             ` Chris Murphy
@ 2016-01-25  1:06               ` Chris Murphy
  2016-01-25  1:28               ` Duncan
  2016-01-25  3:21               ` Tom Hunt
  2 siblings, 0 replies; 12+ messages in thread
From: Chris Murphy @ 2016-01-25  1:06 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sun, Jan 24, 2016 at 3:45 PM, Chris Murphy <lists@colorremedies.com> wrote:

> Maybe it's an orphaned item. I don't know much about those, whether
> btrfs check finds or removes them. But you can safely use
> btrfs-debug-tree while the fs is mounted.
>
> btrfs-debug-tree <dev> | grep -A3 -B3 ORPHAN
>
> That'd find object type ORPHAN and item type ORPHAN_ITEM.

OK I guess that's not implemented yet, nevermind.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing
  2016-01-24 22:45             ` Chris Murphy
  2016-01-25  1:06               ` Chris Murphy
@ 2016-01-25  1:28               ` Duncan
  2016-01-25  3:21               ` Tom Hunt
  2 siblings, 0 replies; 12+ messages in thread
From: Duncan @ 2016-01-25  1:28 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Sun, 24 Jan 2016 15:45:49 -0700 as excerpted:

> Maybe it's an orphaned item. I don't know much about those, whether
> btrfs check finds or removes them. But you can safely use
> btrfs-debug-tree while the fs is mounted.
> 
> btrfs-debug-tree <dev> | grep -A3 -B3 ORPHAN
> 
> That'd find object type ORPHAN and item type ORPHAN_ITEM. Again, no idea
> what to do about it if either are found, and matches inode 515.
> You could also do a search using
> 
> btrfs-debug-tree <dev> | grep -A3 -B3 "(515 "

Orphans, in POSIX filesystem context, are files that are open, but 
deleted.  Think shared object libraries (*.so*) that are loaded by a 
running program at the time they're updated, in which context it's a 
security issue as well since some of those updates might have been 
security related and existing processes continue to use the old, 
vulnerable versions until they are restarted.  These files can be 
logically deleted so nothing else can access them, but the filesystem 
can't remove all references to them until all running programs holding 
them open have closed them.  That's what makes them orphan -- nothing new 
can access them, but existing references mean they can't yet be fully 
removed from the filesystem, either.

Typically such orphans block unmounting or remounting the filesystem 
readonly (unless it's forced read-only by filesystem error or kernel 
emergency-SRQ sequence, etc, which override things and locks orphans in 
their existing state on the filesystem, just as a crash would, except it 
doesn't necessarily immediately crash, tho a livelock often follows 
pretty quickly).

There are various tools available that can find these files and tell you 
which processes are holding them open and/or kill or restart those 
processes in ordered to release the locks and close the files, allowing 
them to be fully deleted.  I routinely use one after updates here called 
lib_users, which simply tells me what processes are holding open these 
deleted files, so I can decide whether I want to restart them, kill them, 
or simply ignore the problem until later.  

In a system reboot (or simple descent to single user or systemd emergency 
mode to stop most services, systemd itself can then be reexecuted using 
systemctl daemon-reexec, if it's holding open deleted files as well, 
before returning to normal mode), normal processes and system services 
are stopped, releasing these files, so the filesystem can clean them up 
before unmount or remount-readonly.

(Here, my root, including /usr, remains mounted read-only by default, 
only being mounted writable for updates.  Thus my routine use of 
lib_users to find services I need to restart, or if there's too many, 
decide I want to temporarily go to systemd emergency mode, before 
returning to normal mode, releasing all these deleted files in the 
process, so I can cleanly remount / read-only once again.  That of course 
is the reason I know a bit more about this than many, since it's part of 
my update routine now.)

Of course in an unclean shutdown situation, the filesystem will not have 
had the chance to clean up the orphans before the umount or remount-
readonly, so these orphans remain around at reboot and must be cleaned up 
then.  Btrfs at least, does this automatically, so btrfs check need not 
be run to do it, as the check has to be done on an unmounted filesystem, 
and at mount, btrfs will handle it, itself.  However, btrfs check may 
still clean them up as well (I'm not sure), simply to avoid other issues 
the orphan files might have that are easiest eliminated by simply 
eliminating the files, since that's what would happen on mount anyway.

If it's orphan files, just quitting X (or logging out and back in if you 
use a *DM graphical login, I don't, preferring to login at the CLI and 
run startx) tends to release a lot of them, no reboot needed.  But 
sometimes quitting X and restarting other services holding open deleted 
files may be necessary as well, and if there's enough such services, 
descending to emergency/single-user mode, and possibly restarting init/
systemd itself, if it is holding open any such files, may be necessary.  
But a full reboot isn't normally needed.

Tho of course restarting X, and descending to emergency/single-user-mode 
even more so, can in practice be as bad as having to reboot.  Going to 
emergency/single-user especially, pretty much all you save in practice is 
your uptime statistics and caches that would be lost on a full reboot.  
But if you're running over a year uptime (yeah, right, on a still not 
fully stable btrfs where the recommendation is to keep to reasonably 
current kernels, hardly the stuff of year-plus uptimes!), or have 
spinning rust and lots of RAM and thus cache that would have to be 
reloaded from the slow spinning rust, going single-user to avoid dropping 
them can be well worth the hassle over simply rebooting.

Meanwhile, if it /is/ an orphan file, it's an interesting situation, 
since that means the checksum was fine and btrfs was able to load the 
file when whatever is holding it open/orphan was started, but that isn't 
the case now.  Which means we have evidence of near-real-time (well, 
depending on how long the file has been open) block decay and checksum 
invalidation, which has serious implications in terms of storage device 
health.  Be sure to run smartctl -AH, or similar, to see what's going on, 
and track it closely for awhile, as you may well have a failing device!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing
  2016-01-24 22:45             ` Chris Murphy
  2016-01-25  1:06               ` Chris Murphy
  2016-01-25  1:28               ` Duncan
@ 2016-01-25  3:21               ` Tom Hunt
  2016-01-25  5:58                 ` Chris Murphy
  2 siblings, 1 reply; 12+ messages in thread
From: Tom Hunt @ 2016-01-25  3:21 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

The only output from btrfs-debug-tree containing ORPHAN:

        item 151 key (ORPHAN ORPHAN_ITEM 150415) itemoff 4904 itemsize 0
                  orphan item
        item 152 key (ORPHAN ORPHAN_ITEM 150416) itemoff 4904 itemsize 0
                  orphan item
        item 153 key (ORPHAN ORPHAN_ITEM 175228) itemoff 4904 itemsize 0
                  orphan item
        item 154 key (ORPHAN ORPHAN_ITEM 175229) itemoff 4904 itemsize 0
                  orphan item

Given the later talk about orphans, I don't think this is an orphan; the disks
are new, and I know the checksum error issues came from running with bad RAM,
which has since been replaced.

The output referring to 515:

        item 49 key (515 INODE_ITEM 0) itemoff 10851 itemsize 160
                inode generation 132 transid 132 size 262144 nbytes 4194304
                block group 0 mode 100600 links 1 uid 0 gid 0
                rdev 0 flags 0x1b
        item 50 key (515 EXTENT_DATA 0) itemoff 10798 itemsize 53
                extent data disk byte 603428720640 nr 262144
                extent data offset 0 nr 262144 ram 262144
                extent compression 0

Incidentally, btrfs-debug-tree produced about 1.5G of text output; I'm not sure
whether that's normal for a 6TB volume with ~4TB used, or what.

On Sun, Jan 24, 2016 at 03:45:49PM -0700, Chris Murphy wrote:
> On Sun, Jan 24, 2016 at 3:16 PM, Tom Hunt <tomdicksonhunt@gmail.com> wrote:
> > I get "currently mounted, aborting".
> >
> > If I must bring down the machine over this, I can, but I'd prefer not to.
> 
> Now is a good time to refresh its backup, while it's online. It's also
> a good idea to maintain readonly snapshots of each subvolume you want
> to keep in case you need to depend on using btrfs send/receive to move
> the data to a new file system, since only ro snapshots can be used for
> send/receive.
> 
> Maybe it's an orphaned item. I don't know much about those, whether
> btrfs check finds or removes them. But you can safely use
> btrfs-debug-tree while the fs is mounted.
> 
> btrfs-debug-tree <dev> | grep -A3 -B3 ORPHAN
> 
> That'd find object type ORPHAN and item type ORPHAN_ITEM. Again, no
> idea what to do about it if either are found, and matches inode 515.
> You could also do a search using
> 
> btrfs-debug-tree <dev> | grep -A3 -B3 "(515 "
> 
> and see if that reveals anything.
> 
> While the fs has to be umounted for this, the --init-csum-tree option
> for btrfs check will obliterate the current csum tree (and all file
> csums) and then compute new csums for everything. That might fix it,
> at least if it's a stuck orphan item then this ought to give it a
> valid csum so you can proceed with conversion.
> 
> 
> 
> -- 
> Chris Murphy

-- 
Tom Hunt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing
  2016-01-25  3:21               ` Tom Hunt
@ 2016-01-25  5:58                 ` Chris Murphy
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Murphy @ 2016-01-25  5:58 UTC (permalink / raw)
  To: Tom Hunt; +Cc: Chris Murphy, Btrfs BTRFS

On Sun, Jan 24, 2016 at 8:21 PM, Tom Hunt <tomdicksonhunt@gmail.com> wrote:
> The only output from btrfs-debug-tree containing ORPHAN:
>
>         item 151 key (ORPHAN ORPHAN_ITEM 150415) itemoff 4904 itemsize 0
>                   orphan item
>         item 152 key (ORPHAN ORPHAN_ITEM 150416) itemoff 4904 itemsize 0
>                   orphan item
>         item 153 key (ORPHAN ORPHAN_ITEM 175228) itemoff 4904 itemsize 0
>                   orphan item
>         item 154 key (ORPHAN ORPHAN_ITEM 175229) itemoff 4904 itemsize 0
>                   orphan item
>
> Given the later talk about orphans, I don't think this is an orphan; the disks
> are new, and I know the checksum error issues came from running with bad RAM,
> which has since been replaced.
>
> The output referring to 515:
>
>         item 49 key (515 INODE_ITEM 0) itemoff 10851 itemsize 160
>                 inode generation 132 transid 132 size 262144 nbytes 4194304
>                 block group 0 mode 100600 links 1 uid 0 gid 0
>                 rdev 0 flags 0x1b
>         item 50 key (515 EXTENT_DATA 0) itemoff 10798 itemsize 53
>                 extent data disk byte 603428720640 nr 262144
>                 extent data offset 0 nr 262144 ram 262144
>                 extent compression 0

I'm kinda out of ideas. A full balance at least is an online process,
maybe it gets rid of the phantom missing inode reference. It'd be
interesting to see what 'btrfs check' shows. If you decide to try
--init-csum-tree, I'd first refresh a backup, and make current ro
snapshots of things you'd send/receive, should it become necessary to
create a new file system.

>
> Incidentally, btrfs-debug-tree produced about 1.5G of text output; I'm not sure
> whether that's normal for a 6TB volume with ~4TB used, or what.

Yeah, very loose estimate is a btrfs-debug-tree output to a file is
1/4 of the metadata size reported by 'fi usage/df' although it could
be much smaller if if you have a lot of small files with inline data
extents.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-01-25  5:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-24 17:52 Chicken-egg: uncorrectable checksum error prevents RAID1 rebalancing Tom Hunt
2016-01-24 20:34 ` Chris Murphy
2016-01-24 20:58   ` Tom Hunt
2016-01-24 21:07     ` Chris Murphy
2016-01-24 21:17       ` Tom Hunt
2016-01-24 22:04         ` Chris Murphy
2016-01-24 22:16           ` Tom Hunt
2016-01-24 22:45             ` Chris Murphy
2016-01-25  1:06               ` Chris Murphy
2016-01-25  1:28               ` Duncan
2016-01-25  3:21               ` Tom Hunt
2016-01-25  5:58                 ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.