All of lore.kernel.org
 help / color / mirror / Atom feed
* Potential rebalance bug plus some questions
@ 2014-03-29 23:25 jon
  2014-03-30  9:04 ` Duncan
  0 siblings, 1 reply; 2+ messages in thread
From: jon @ 2014-03-29 23:25 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

First off I've got a couple of questions that I posed over on the 
fedoraforum
http://www.forums.fedoraforum.org/showthread.php?t=298142

"I'm in the process of building a btrfs storage server (mostly for 
evaluation) and I'm trying to understand the COW system. As I understand 
it no data is over written when file X is changed ot file Y is created, 
but what happens when you get to the end of your disk?
Say you write files X1, X2, ... Xn which fills up your disk. You then 
delete X1 through Xn-1, does the disk space actually free up? How does 
this affect the 30 second snapshot mechanism and all the roll back stuff?

Second, the raid functionality works at the filesystem block level 
rather than the device block level. Ok cool, so "raid 1" is creating two 
copies of every block and sticking each copy on a different device 
instead of block mirroring over multipul devices. So you can have a 
"raid 1" in 3, 5, or n disks. If I understand that correctly then you 
should be able to lose a single disk out of a raid 1 and still have all 
your data where lossing two disks may kill off data. Is that right? Is 
there a good rundown on "raid" levels in btrfs somewhere?"

If anyone could field those I would be very thankful. Second, I've got a 
centOS 6 box with the current epel kernel and btrfs progs (3.12) on 
which I'm playing with the raid1 setup. Using four disks, I created an 
array
mkfs.btrfs -d raid1 -m raid1 /dev/sd[b-e]
mounted via uuid and rebooted. At this point all was well
Next I simulated a disk failure by pulling the power on the disk sdb and 
I was still able to get at my data. Great.
Plugged sdb back in and it came up as /dev/sdg, ok whatever. Next I did 
a rebalance of the array which is what I *think* killed it. The 
rebalance went on, I saw many I/O errors, but I dismissed them as they 
were all about sdb.
After the rebalance I removed /dev/sdb from the pool, added /dev/sdg and 
rebooted.
On the reboot the pool failed to mount at all. dmesg showed something 
like "btrfs open_ctree failure" (sorry, don't have access to the box atm).

So tl;dr I think there may be an issue with the balance command when a 
disk is offline.

Jon

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Potential rebalance bug plus some questions
  2014-03-29 23:25 Potential rebalance bug plus some questions jon
@ 2014-03-30  9:04 ` Duncan
  0 siblings, 0 replies; 2+ messages in thread
From: Duncan @ 2014-03-30  9:04 UTC (permalink / raw)
  To: linux-btrfs

jon posted on Sat, 29 Mar 2014 13:25:29 -1000 as excerpted:

> Hi all,
> 
> First off I've got a couple of questions that I posed over on the
> fedoraforum http://www.forums.fedoraforum.org/showthread.php?t=298142
> 
> "I'm in the process of building a btrfs storage server (mostly for
> evaluation) and I'm trying to understand the COW system. As I understand
> it no data is over written when file X is changed ot file Y is created,
> but what happens when you get to the end of your disk?
> Say you write files X1, X2, ... Xn which fills up your disk. You then
> delete X1 through Xn-1, does the disk space actually free up?

Well, yes and no.  A barebones answer is that btrfs actually allocates 
space in two stages, but presently only automatically frees one -- the 
other presently requires a rebalance to free.

Putting a bit more flesh on those bones, a new and unused filesystem is 
mostly unallocated free space.  As files are added, btrfs allocates room 
for them a chunk at a time on demand.  As long as there is room, data 
chunks are 1 GiB in size while metadata chunks are 256 MiB (1/4 GiB) in 
size.  However, metadata defaults to dup mode, two copies of all metadata 
are written, so metadata chunks are allocated in pairs, two quarter-GiB 
chunks so half a GiB at a time, while data chunks default to single mode, 
a single 1 GiB chunk at a time.  Btrfs then writes files to those chunks 
until they are full, at which point it allocates additional chunks of 
whichever type it has run out of.

The filesystem is said to be "full" when all previously unallocated space 
is allocated to data or metadata chunks, *AND* one *OR* the other has 
used up all its allocated space and needs to allocate more, but can't as 
it's all allocated already.  (FWIW there's also a very limited bit of 
space, normally a few MiB, allocated as system chunks, but this 
allocation typically doesn't grow much, it's almost all data and metadata 
chunks.  I'm not sure what size system chunks are, but typically they 
total rather less than a single metadata chunk, that is, less than 256 
MiB.)  It's worth noting that normal df (that is, the df command, not 
btrfs filesystem df) will most often still report quite some space left, 
but it's all of the /other/ type.

Absent snapshots, when files are deleted, the space their data and 
metadata took are freed back to their respective data and metadata 
chunks.  That space can then be reused AS THE SAME TYPE, DATA OR 
METADATA, but because the chunks remain allocated, currently the freed 
space cannot be AUTOMATICALLY switched to the other type.  As it happens, 
most of the space used by most files and thus returned to the chunk for 
reuse on deletion is data space -- individual files don't normally take a 
lot of metadata space, tho a bunch of files together do take some.  Thus, 
deletions tend to free more data space than metadata, and over time, 
normal usage patterns tend to accumulate a lot of mostly empty data chunk 
space, with relatively little accumulation of empty metadata chunk 
space.  As a result, after all filesystem space is allocated to either 
data or metadata chunks and there's none left unallocated, most of the 
time people end up running out of metadata space first, with lots of data 
space still left free, but it's all tied up in data chunk allocation, 
with no unallocated space left to allocate further metadata chunks when 
they are needed.

At this point it's worth noting that due to copy-on-write, even DELETING 
files requires SOME free metadata space, and btrfs does reserve some 
metadata space for that sort of thing, so once you're down to writing in 
the last one, which given they're allocated and written in pairs, means 
once you get down under 512 MiB of free metadata space, you're actually 
very close to running out entirely, if there's no additional unallocated 
space to allocate as metadata chunks.

IOW, if you have less than 500 MiB of free metadata reported and no 
unallocated space left, you're effectively out of space!

To solve that problem, you (re)balance using the btrfs balance command.  
This rewrites allocated chunks, freeing their unused space back to the 
unallocated pool in the process, after which they can one again be on-
demand allocated to either data or metadata chunks once again.

Thus the (current) situation outlined in the barebones above: Deleting 
files returns the space they took to the data or metadata chunk it was 
using, but to reclaim the space from those chunks to the unallocated pool 
so they can be used as the OTHER type if needed, requires a rebalance.

Now to wrap up a couple loose ends.

1) For relatively small filesystems, btrfs typically does this 
automatically for filesystems under 1 GiB in size but mkfs.btrfs has an 
option (--mixed) to force it as well, btrfs has a shared/mixed data/
metadata chunk mode.  This must be set at mkfs.btrfs time -- it cannot be 
changed later.  Like standard metadata chunks but in this case with data 
sharing them as well, these chunks are normally 256 MiB in size (smaller 
if there's not enough space left for a full-sized allocation, thus 
allowing full usage) and are by default duplicated -- two chunks 
allocated at a time, with (meta)data duplicated to both.  Shared mode 
does sacrifice some performance, however, which is why it's only the 
default on filesystems under 1 GiB.  Never-the-less, many users find that 
shared mode actually works better for them on filesystems of several GiB 
and it's often recommended on filesystems up to 16 or 32 GiB.  General 
consensus is, however, that as filesystem size nears and passes 64 GiB, 
the better performance of separate data and metadata makes it the better 
choice.

**Due to the default duplication, this shared mode is the only way to 
actually store duplicated data on a single device btrfs.  Ordinarily data 
chunks can only be single or one of the raid modes allocated so 
duplicating data as well requires two devices and raid; only metadata can 
ordinarily be dup mode on a single device btrfs.  But shared mode allows 
treating data as metadata, thus allowing dup mode for data as well.

Duplication does mean you can only fit about half of what you might 
otherwise fit on that filesystem, but it also means there's a second copy 
of the data (not just metadata) for use with btrfs' data integrity 
checksumming and scrubbing features, in case the one copy gets corrupted, 
somehow.  That's actually one of the big reasons I'm using btrfs here, 
altho most of my btrfs are multi-device in raid1 mode for both data and 
metadata, tho I am taking advantage of shared mode on a couple smaller 
single-device filesystems.

2) On a single-device btrfs, data defaults to single mode, metadata (and 
mixed) defaults to dup (except for SSDs, which default to single for 
metadata/mixed as well).  You can of course specify single mode for 
metadata/mixed if you like, or dup mode on ssd where the default would be 
single.  That's normally set at mkfs.btrfs time but it's also possible to 
convert using balance with some of its available options.

On a multi-device btrfs, data still defaults to single, while metadata 
defaults to raid1 mode, two copies of metadata as with dup, but ensuring 
they're on separate devices so a loss of one device with one copy will 
still leave the other copy available.

> How does this affect the 30 second snapshot mechanism and all the
> roll back stuff?

First, it's not *THE* 30-second snapshot mechanism.  Snapshots can be 
taken whenever you wish.  Btrfs builds in the snapshotting mechanism but 
not the timing policy.  There are scripts available that automate the 
snapshotting process, taking one a minute or one an hour or one a day or 
whatever, and apparently on whatever you're looking at, one every 30 
seconds, but that's not btrfs, that's whatever snapshotting script you or 
your distro has chosen to use and configure for 30 second snapshots.

Meanwhile, snapshots would have been another loose end to wrap up above, 
but you asked the questions specifically, so I'll deal with them here.

Background: As you've read, btrfs is in general a copy-on-write (COW) 
based filesystem.  That means as files (well, file blocks, 4096 bytes aka 
4 KiB in size on x86 and amd64 and I /think/ on ARM as well, but not 
always on other archs) are changed, the new version isn't written over-
top of the old one, but to a different location (filesystem block), with 
the file's metadata updated accordingly (and atomically, so either the 
new copy or the old exists, not bits of old and new mixed -- that 
actually being one of the main benefits of COW), pointing to the new 
location for that file block instead of the old one.

Snapshots: Once you have a working COW based filesystem, snapshots are 
reasonably simple to implement since the COW mechanisms are already doing 
most of the work for you.  The concept is simple.  Since changes are 
already written to a different location with the metadata normally simply 
updated to point to the new location and mark the old one free to reuse, 
a snapshot simply stores a copy of all the metadata as it exists at that 
point in time, and when a new version of a file block is written, the old 
one is only actually freed if there's no snapshot with metadata still 
pointing at the old location as part of the file at the time the snapshot 
was taken.

Which answers your snapshot specific question:  If a snapshot still 
points at the file block as part of the file as it was when that snapshot 
was taken, that block cannot be freed when the file is changed and an 
updated block is written elsewhere.  Only once all snapshots pointing at 
that file block are deleted, can the file block itself be marked as free 
once again.

So if you're taking 30-second snapshots (and assuming the files aren't 
being changed at a faster rate than that), basically, no file blocks will 
ever be freed on file change or delete unless/until you delete all the 
snapshots referring to the old file block(s).

Typically, the same automated snapshotting scripts that take per-minute 
or per-hour or whatever snapshots, also provide a configurable mechanism 
for thinning them down, for example from 30 seconds to 1 minute (deleting 
every other snapshot) after an hour, from 1 minute to 5 minutes (deleting 
four of five) after six hours, from 5 minutes to half an hour (deleting 5 
of six) after a day, from half an hour to an hour (deleting every other 
once again) after a second day, from an hour to 6 hours (deleting 5 of 6) 
after a week, from 4 a day to daily after 4 weeks (28 days, deleting 3 of 
4), from daily to weekly after a quarter (13 weeks, deleting 6 of 7), 
with the snapshots transferred to permanent and perhaps off-site backup 
and thus entirely deletable after perhaps 5 quarters (thus a year and a 
quarter, giving an extra quarter's overlap beyond a year).

Using a thinning system such as this, intermediate changes would be 
finally deleted and the blocks tracking them freed when all the snapshots 
containing them were deleted, but gradually thinned out longer term 
snapshot copies would remain around for, in the example above, 15 
months.  Only after final 15-month deletion would the filesystem be able 
to retrieve blocks from the more permanent edits and deletions.

> Second, the raid functionality works at the filesystem block level
> rather than the device block level. Ok cool, so "raid 1" is creating two
> copies of every block and sticking each copy on a different device
> instead of block mirroring over multipul devices. So you can have a
> "raid 1" in 3, 5, or n disks. If I understand that correctly then you
> should be able to lose a single disk out of a raid 1 and still have all
> your data where lossing two disks may kill off data. Is that right? Is
> there a good rundown on "raid" levels in btrfs somewhere?"

You understand correctly.  FWIW, there's an N-way-mirroring (where N>2) 
feature on the roadmap, for people like me that really appreciate btrfs 
data integrity features but really REALLY want that third or fourth or 
whatever copy, just in case, but it has been awhile in coming, as it's 
penciled in to depend on some of the raid5/6 implementing code, and while 
there's a sort-of-working raid5/6 implementation since 3.10 (?) or so, as 
of 3.14 the raid5/6 device-loss recovery and scrubbing code isn't yet 
fully complete, so it could be some time before N-way-mirroring is ready.

Raid-level-rundown?

Maturity:  Single-device single and dup modes were the first implemented 
and are now basically stable, but for the general btrfs bug-fixing still 
going on (mostly features such as send/receive, snapshot-aware-defrag, 
quota-groups, etc, still not entirely bug free, snapshot-aware-defrag 
actually disabled ATM for rewrite as the previous implementation didn't 
scale well at all).  Multi-device single and raid0/1/10 modes were 
implemented soon after and are also close to stable.  Raid5/6 modes are 
working run-time implemented, but lack critical recovery code as well as 
working raid5/6 scrub (attempting a scrub does no damage but returns a 
lot of errors since scrub doesn't understand that mode yet and is 
interpreting what it sees incorrectly).  N-way-mirroring aka true raid1 
is next-up, but could be awhile.  There's also talk of a more generic 
stripe/mirror/parity configuration, but I've not seen enough discussion 
on that to reasonably relay anything.

Device-requirements:  Raid0 and raid1 modes require two devices minimum 
to function properly.  Raid1 is paired-writes and raid0 allocates and 
stripes across all available devices.  To prevent complications from 
dropping below the minimum number of devices, however, raid1 really needs 
three devices, all with unallocated space available, in ordered to stay 
raid1 when a device drops out.  Raid10 is four devices minimum; again 
bump that by one to five minimum, for device-drop-out tolerance.  Raid5/6 
are three and four devices minimum respectively, as one might expect; I'm 
not sure if their implementation needs a device bump to 4/5 devices to 
maintain functionality or not, but since the recovery and scrub code 
isn't complete, consider them effectively really slow raid0 at this point 
in terms of reliability, but already configured so the upgrade to raid5/6 
whenever that code is fully implemented and tested will be automatic and 
"free", since it's effectively calculating and writing the parity already 
-- it simply can't yet properly recover or scrub it.

> Second, I've got a centOS 6 box with the current epel kernel and btrfs
> progs (3.12) on which I'm playing with the raid1 setup.

I'm not sure what the epel kernel version is or its btrfs support status, 
but on this list anyway, btrfs is still considered under heavy 
development, and at least until 3.13, if you were not running the latest 
mainstream stable kernel series or newer (the development kernel or btrfs-
next), you're considered to be running an old kernel with known-fixed 
bugs, and upgrading to something current is highly recommended.

With 3.13, kconfig's btrfs option wording was toned down from dire 
warning to something a bit less dire, and effectively single device and 
multi-device raid0/1/10 are considered semi-stable from there, with 
bugfixes backported to stable kernels from 3.13 forward.  There's effort 
to backport fixes to earlier stable series, but for them the kconfig 
btrfs option warning was still very strongly worded, so there's no 
guarantees... you take what you get.

Meanwhile, at least as of btrfs-progs 3.12 (current latest, but the 
number is kernel release synced and there's a 3.14 planned), mkfs.btrfs 
still has a strong recommendation to use a current kernel as well.

So I'd strongly recommend at least 3.13 or newer stable series going 
forward, and preferably latest stable or even development kernel, tho 
from 3.13 forward, that's at least somewhat more up to you than it has 
been.

> Using four disks, I created an array
> mkfs.btrfs -d raid1 -m raid1 /dev/sd[b-e]

[...]

> Next I did a rebalance of the array [with the missing device] which is
> what I *think* killed it.

> After the rebalance I removed /dev/sdb from the pool, added /dev/sdg and
> rebooted.

> On the reboot the pool failed to mount at all. dmesg showed something
> like "btrfs open_ctree failure" (sorry, don't have access to the box
> atm).

> So tl;dr I think there may be an issue with the balance command when a
> disk is offline.

Standing alone, the btrfs "open_ctree failed" mount-error is 
unfortunately rather generic.  Btrfs uses trees for everything, including 
the space cache, and the severity of that error depends greatly on which 
one of those trees it was as reflected by the surrounding dmesg context 
-- a bad space-cache is easy enough corrected with the clear_cache mount-
option, but the same generic error can also mean it didn't find the main 
root tree with everything under it, so context is everything!

Meanwhile, there are various possibilities for recovery, including btrfs-
find-root and btrfs restore, to roll back to an earlier tree root node 
(btrfs keeps a list of several) if necessary.  (But worth noting, while 
btrfsck aka btrfs check is by default read-only and thus won't do any 
harm, do NOT use it with the --repair option except as a last resort, 
either as instructed by a dev or when you've given up and the next step 
is a new mkfs, since while it can be used to repair certain types of 
damage, there are others it doesn't understand, where attempts to repair 
will instead damage the filesystem further, killing any chance at using 
other tools to at least retrieve some of the files, even if the 
filesystem is otherwise too far gone to restore to usable.)

That said... yes, balance with a device missing isn't a good thing to 
do.  Ideally you btrfs device add if necessary to bring the number of 
devices up to mode-minimum (two devices for raid1), then btrfs device 
delete missing, THEN btrfs balance if necessary.

Oh, and when mounting with a (possibly) missing device, use the degraded 
mount option.  In fact, it's quite possible that would have worked fine 
for you, tho if necessary it would mean the btrfs device delete hadn't 
finished yet,


And one final thing:  If you haven't yet, take some time to read over the 
btrfs wiki at https://btrfs.wiki.kernel.org .  Among other things, that 
would have covered the degraded and clear_cache mount options, various 
recovery options, some stuff about raid modes, snapshots, btrfs space 
issues, etc.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-03-30  9:04 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-29 23:25 Potential rebalance bug plus some questions jon
2014-03-30  9:04 ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.