All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: general thoughts and questions + general and RAID5/6 stability?
@ 2014-09-19 20:50 William Hanson
  2014-09-20  9:32 ` Duncan
  0 siblings, 1 reply; 13+ messages in thread
From: William Hanson @ 2014-09-19 20:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: calestyo

Hey guys...

I was just crawling through the wiki and this list's archive to find
answers about some questions.
Actually many of them matching those which Christoph has asked here
some time ago, though it seems no answers came up at all.

Isn't it possible to answer them, at least one by one? I'd believe that
most of these questions and their answers would be of common interest
and having them properly answered should be a benefit for all possible
btrfs users.

Regards,
William.


On Sun, 2014-08-31 at 06:02 +0200, Christoph Anton Mitterer wrote:
> Hey.
>
>
> For some time now I consider to use btrfs at a larger scale, basically
> in two scenarios:
>
> a) As the backend for data pools handled by dcache (dcache.org), where
> we run a Tier-2 in the higher PiB range for the LHC Computing Grid...
> For now that would be rather "boring" use of btrfs (i.e. not really
> using any of its advanced features) and also RAID functionality would
> still be provided by hardware (at least with the current hardware
> generations we have in use).
>
> b) Personally, for my NAS. Here the main goal is less performance but
> rather data safety (i.e. I want something like RAID6 or better) and
> security (i.e. it will be on top of dm-crypt/LUKS) and integrity.
> Hardware wise I'll use and UPS as well as enterprise SATA disks, from
> different vendors respectively different production lots.
> (Of course I'm aware that btrfs is experimental, and I would have
> regular backups)
>
>
>
>
> 1) Now I've followed linux-btrfs for a while and blogs like Marc's...
> and I still read about a lot of stability problems, some which sound
> quite serious.
> Sure we have a fsck now, but even in the wiki one can read statements
> like "the developers use it on their systems without major
problems"...
> but also "if you do this, it could help you... or break even more".
>
> I mean I understand that there won't be a single point in time, where
> Chris Mason says "now it's stable" and it would be rock solid form
that
> point on... but especially since new features (e.g. things like
> subvolume quota groups, online/offline dedup, online/offline fsck)
move
> (or will) move in with every new version... one has (as an end-user)
> basically no chance to determine what can be used safely and what
> tickles the devil.
>
> So one issue I have is to determine the general stability of the
> different parts.
>
>
>
>
> 2) Documentation status...
> I feel that some general and extensive documentation is missing. One
> that basically handles (and teaches) all the things which are specific
> to modern (especially CoW) filesystems.
> - General design, features and problems of CoW and btrfs
> - Special situations that arise from the CoW, e.g. that one may not be
> able to remove files once the fs is full,... or that just reading
files
> could make the used space grow (via the atime)
> - General guidelines when and how to use nodatacow... i.e. telling
> people for which kinds of files this SHOULD usually be done (VM
> images)... and what this means for those files (not checksumming) and
> what the drawbacks are if it's not used (e.g. if people insist on
having
> the checksumming - what happens to the performance of VM images? what
> about the wear with SSDs?)
> - the implications of things like compression and hash algos...
whether
> and when this will have performance impacts (positive or negative) and
> when not.
> - the typical lifecycles and procedures when using stuff like multiple
> devices (how to replace a faulty disk) or important hints like (don't
> span a btrfs RAID over multiple partitions on the same disk)
> - especially with the different (mount)options, I mean things that
> change the way the fs works like no-hole or mixed data/meta block
> groups... people need to have some general information when to choose
> which and some real world examples of disadvantages / advantages. E.g.
> what are the disadvantages of having mixed data/meta block groups? If
> there'd be only advantages, why wouldn't it be the default?
>
> Parts of this is already scattered over LWN articles, the wiki
(however
> the quality greatly "varies" there), blog posts or mailing list
posts...
> many of the information there is however outdated... and suggested
> procedures (e.g. how to replace a faulty disk) differ from example to
> example.
> An admin that wants to use btrfs shouldn't be required to pick all
this
> together (which is basically impossible).. there should be a manpage
> (which is kept up to date!) that describes all this.
>
> Other important things to document (which I couldn't fine so far in
most
> cases): What is actually guaranteed by btrfs respectively its design?
> For example:
> - If there'd be no bugs in the code,.. would the fs be guaranteed to
be
> always consistent by it's CoW design? Or are there circumstances where
> it can still run into being inconsistent?
> - Does this basically mean, that even without and fs journal,.. my
> database is always consistent even if I have a power cut or system
> crash?
> - At which places does checksumming take place? Just data or also meta
> data? And is the checksumming chained as with ZFS, so that every
change
> in blocks, triggers changes in the "upper" metadata blocks up to the
> superblock(s)?
> - When are these checksums verified? Only on fsck/scrub? Or really on
> every read? All this is information needed by an admin to determine
what
> the system actually guarantees or how it behaves.
> - How much data/metadata (in terms of bytes) is covered by one
checksum
> value? And if that varies, what's the maximum size? I mean if there
> would be on CRC32 per file (which can be GiB large) which would be
read
> every time a single byte of that file is read... this would probably
be
> bad ;) ... so we should tell the user "no we do this block or extent
> wise"... And since e.g. CRC32 is maybe not well suited for very big
> chunks of data, the user may want to know how much data is "protected"
> by one hash value... so that he can decide whether to switch to
another
> algorithm (if one should become available).
> - Does stacking with block layers work in all cases (and in which does
> it not)? E.g. btrfs on top of looback devices, dm-crypt, MD, lvm2? And
> also the other way round: What of these can be put on top of btrfs?
> There's the prominent case, that swap files don't work on btrfs. But
> documentation in that area should also contain performance
instructions,
> i.e. that while it's possible to have swap on top of btrfs via
loopback,
> it's perhaps stupid with CoW... or e.g. with dmcrypt+MD there were
quite
> some heavy performance impacts depending on whether dmcrypt was below
or
> above MD. Now of course normally, dmcrypt will be below btrfs,... but
> there are still performance questions e.g. how does this work with
> multiple devices? Is there one IO thread per device or one for all?
> Or questions like: Are there any stability issues when btrfs is
stacked
> below/above other block layer, e.g. in case of power losses...
> especially since btrfs relies so heavy on barriers.
> Or questions like: Is btrfs stable if lower block layers modify data?
> e.g. if dmcrypt should ever support online re-encryption
> - Many things about RAID (but more on that later).
>
>
>
>
> 3) What about some nice features which many people probably want to
> see...
> Especially other compression algos (xz/lzma or lz4[hc]) and hash alogs
> (xxHash... some people may even be interested in things like SHA2 or
> Keccak).
> I know some of them are planned... but is there any real estimation on
> when they come?
>
>
>
>
> 4) Are (or how) exiting btrfs filesystems kept up to date when btrfs
> evolves over time?
> What I mean here is... over time, more and more features are added to
> btrfs... this is of course not always a change in the on disk
format...
> but I always wonder a bit: If I write the same that of my existing fs
> into a freshly created one (with the same settings)... would it
> basically look like the same (of course not exactly)?
> In many of the mails here on the list respectively commit logs one can
> read things which sound as this happens quite often... that things
(that
> affect how data is written on the disk) are now handled better.
> Or what if defaults change? E.g. if something new like no-hole would
> become the default for new filesystems?
> An admin cannot track all these things and understand which of them
> actually means that he should recreate the filesystem.
>
> Of course there's the balance operation... but does this really affect
> everything?
>
> So the question is basically: As btrfs evolves... how to I keep my
> existing filesystems up to date so that they are as if they were
created
> as new.
>
>
>
>
> 5) btrfs management [G]UIs are needed
> Not sure whether this should be go into existing files managers (like
> nemo or konqueror) or something separate... but I definitely think,
that
> the btrfs community will need to provide some kind of powerful
> management [G]UI.
> Such a manager is IMHO crucial for anything that behaves like a
storage
> management system.
> What should it be able to do?
> a) Searching for btrfs specific properties, e.g.
> - files compressed with a given algo
> - files for which the compression ratio is <,>,= n%
> - files which are nodatacow
> - files for which integrity data is stored with a given hash algo
> - files with a given redundancy level (e.g. DUP or RAID1 or RAID6 or
> DUPn if that should ever come)
> - files which should have a given redundancy level, but whose actual
> level is different (e.g. due to a degraded state, or for which more
> block copies than desired are still available)
> - files which are defragmented at n%
>
> Of course all these conditions should be combinable, and one should
have
> further conditions like m/c/a-times or like the subvolumes/snapshots
> that should be searched.
>
> b) File lists in such a manager should display many details like
> compression ratio, algos (compression, hash), number of fragments,
> whether blocks of that file are referenced by other files, etc. pp.
>
> c) Of course it should be easy to change all the properties from above
> for a files (well at least if that's possible in btrfs).
> Like when I want to have some files, or dirs/subdirs, recompressed
with
> another algo, or uncompressed.
> Or triggering online defragmentation for all files of a given
> fragmentation level.
> Or maybe I want to set a higher redundancy level for files which I
> consider extremely precious to myself (not sure if it's planned to
have
> different redundancy levels per file)
>
> d) Such manager should perhaps also go through the logs and tell
things
> like:
> - when was the last complete balance
> - when was the last complete scrub
> - for which files happened integrity check problems during
read/scrub...
> how many of these could be corrected via other block copies?
>
> e) Maybe it could give even more low level information, like showing
how
> a file is distributed over the devices, e.g. how the blocks are
located,
> or showing the location block copies or involved block devices for the
> redundancy levels.
>
>
>
>
> 6) RAID / Redundancy Levels
> a) Just some remark, I think it's a bad idea to call these RAID in the
> btrfs terminology... since what we do is not necessarily exactly the
> same like classic RAID... this becomes most obvious with RAID1, which
> behaves not as RAID1 should (i.e. one copy per disk)... at least the
> used names should comply with MD.
>
> b) In other words... I think there should be RAID1, which equals to 1
> copy per underlying device.
> And it would be great to have a redundancy level DUPx, which is x
copies
> for each block spread over the underlying devices. So if x is 6 and
one
> has 3 underlying devices, each of them should have 2 copies of each
> block.
> I think the DUPx level is quite interesting to protect against single
> block failures, especially also on computers where one usually simply
> doesn't have more than one disk drive (e.g. notebooks).
>
> c) As I've noted before, I think it would be quite nice if it would be
> supported to have different redundancy levels for different files...
> e.g. less previous stuff like OS data could have DUP ... more valuable
> data could have RAID6... and my most precious data could have DUP5
(i.e.
> 5 copies of each block).
> If that would ever come, one would probably need to make that property
> inheritable by directories to be really useful.
>
> d) What's the status of the multi-parity RAID (i.e. more than tow
parity
> blocks)? Weren't some patches for that posted a while ago?
>
> e) Most important:
> What's the status on RAID5/6? Is it still completely experimental or
> already well tested?
> Does rebuilding work? Does scrubbing work?
> I mean as far as I know, there are still important parts that miss so
> that it works at all, right?
> When can one expect work on that to be completed?
>
> f) Again, it detailed documentation should be added how the different
> redundancy levels actually work, e.g.
> - Is there a chunk size, can it be configured and how does it affect
> reads/writes (as with MD)
> - How do parallel reads happen if multiple blocks are available? What
> e.g. if there are multiple block copies per device? Is simply always
the
> first tried to be read? Or the one with the best seek times? Or is
this
> optimised with other reads?
>
> g) When a block is read (and the checksum is always verified), does
that
> already work, that if verification fails, the other blocks are tried,
> respectively the block is tried to be recalculated using the parity?
> What if all that fails, will it give a read error, or will it simply
> deliver a corrupted block, as with traditional RAID?
>
> h) We also need some RAID and integrity monitoring tool.
> Doesn't matter whether this is a completely new tool or whether it can
> be integrated in something existing.
> But we need tools, which inform the admin via different ways when a
disk
> failed an a rebuild is necessary.
> And the same should happen when checksum verification errors happen
that
> could be corrected (perhaps with a configurable threshold)...so that
> admins have the chance to notice signs of a disk that is about to
fail.
>
> Of course such information is already printed to the kernel logs -
well
> I guess so),... but I don't think it's enough to let 3rd parties and
> admins write scripts/daemons which do these checks and alerting...
there
> should be something which is "official" and guaranteed to catch all
> cases and simply works(TM).
>
>
>
> Cheers,
> Chris.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: general thoughts and questions + general and RAID5/6 stability?
  2014-09-19 20:50 general thoughts and questions + general and RAID5/6 stability? William Hanson
@ 2014-09-20  9:32 ` Duncan
  2014-09-22 20:51   ` Stefan G. Weichinger
  0 siblings, 1 reply; 13+ messages in thread
From: Duncan @ 2014-09-20  9:32 UTC (permalink / raw)
  To: linux-btrfs

William Hanson posted on Fri, 19 Sep 2014 16:50:05 -0400 as excerpted:

> Hey guys...
> 
> I was just crawling through the wiki and this list's archive to find
> answers about some questions. Actually many of them matching those
> which Christoph has asked here some time ago, though it seems no
> answers came up at all.

Seems his post slipped thru the cracks, perhaps because it was too much 
at once for people to try to chew on.  Let's see if second time around 
works better...

> 
> On Sun, 2014-08-31 at 06:02 +0200, Christoph Anton Mitterer wrote:
> 
>>
>> For some time now I consider to use btrfs at a larger scale, basically
>> in two scenarios:
> 
>>
>> a) As the backend for data pools handled by dcache (dcache.org), where
>> we run a Tier-2 in the higher PiB range for the LHC Computing Grid...
> 
>> For now that would be rather "boring" use of btrfs (i.e. not really
>> using any of its advanced features) and also RAID functionality would
>> still be provided by hardware (at least with the current hardware
>> generations we have in use).

While that scale is simply out of my league, here's what I'd say if I 
were asked my own opinion.

I'd say btrfs isn't ready for that, basically for one reason.

Btrfs has stabilized quite a bit in the last year, and the scary warnings 
have now come off, but it's still not fully stable, and keeping backups 
of any data you value is still very strongly recommended.

The scenario above is talking high PiB scale.  Simply put, that's a 
**LOT** of data to keep backups of, or to lose all at once if you don't 
and something happens!  At that scale I'd look at something more mature, 
with a reputation for working well at that scale.  Xfs is what I'd be 
looking at.  That or possibly zfs.

People who value their data highly tend, for good reason, to be rather 
conservative when it comes to filesystems.  At that level and at the 
conservatism I'd guess it calls for, I'd say another two years, perhaps 
longer, given btrfs history and how much longer than expected every step 
has seemed to take.

>> b) Personally, for my NAS. Here the main goal is less performance but
>> rather data safety (i.e. I want something like RAID6 or better) and
>> security (i.e. it will be on top of dm-crypt/LUKS) and integrity.
>> Hardware wise I'll use and UPS as well as enterprise SATA disks, from
>> different vendors respectively different production lots.
> 
>> (Of course I'm aware that btrfs is experimental, and I would have
>> regular backups)

[...]

>> [1] So one issue I have is to determine the general stability of the
>> different parts.

Raid5/6 are still out of the question at this point.  The operating code 
is there, but the recovery code is incomplete.  In effect, btrfs raid5/6 
must be treated as if it's slow raid0 in terms of dependability, but with 
a "free" upgrade to raid5/6 when the code is complete (assuming the array 
survives that long in its raid0 stage), as the operational code has been 
there all along and it has been creating and writing the parity, it just 
can't yet reliably restore from it if called to do so.

So if you wouldn't be comfortable with the data on raid0, that is, with 
the idea of losing it all if you lose any of it, don't put it on btrfs 
raid5/6 at this point.  The situation is actually /somewhat/ better than 
that, but that's the reliability bottom line you should be planning for, 
and if raid0 reliability isn't appropriate for your data, neither is 
btrfs raid5/6 at this point.

Btrfs raid1 and raid10 modes, OTOH, are reasonably mature and ready for 
use, basically at the same level as single-device btrfs.  Which is to say 
there's still active development and keep your backups ready as it's not 
/entirely/ stable yet, but a lot of people are using it without undue 
issues -- just keep those backups current and tested, and be prepared to 
use them if you need to.

For btrfs raid1 mode, it's worth pointing out that for btrfs raid1 means 
two copies on different devices, no matter how many devices are in the 
array.  It's always two copies, more devices simply adds more total 
capacity.

Similarly with btrfs raid10, the "1/mirror" side of that 10 is always 
paired.  Stripes can be two or three or whatever width, but there's 
always only the two mirrors.

N-way-mirroring is on the roadmap, scheduled for introduction after 
raid5/6 is complete.  So it's coming, but given the time it has taken for 
raid5/6 and the fact that it's still not complete, reasonably reliable n-
way-mirroring could easily still be a year away or more.


Features: Most of the core btrfs features are reasonably stable but some 
don't work so well together; see my just-previous post on a different 
thread about nocow and snapshots, for instance.  (Basically, setting nocow 
ends up being nearly useless in the face of frequent snapshots of an 
actively rewritten file.)

Qgroups/quotas are an exception.  They've recently rewritten it as the 
old approach simply wasn't working, and while it /should/ be more stable 
now, it's still very new (like 3.17 new), and I'd give it at least two 
more kernel cycles before I'd consider it usable... if no further major 
problems show up during that time.

And snapshot-aware-defrag has been disabled for now due to scalability 
issues, so defrag only considers the current snapshot it's actually 
pointed into to defrag, triggering data duplication and using up space 
faster that would otherwise be expected.

You'd need to check on the status of non-core btrfs features like the 
various dedup applications, snapper style scheduled snapshotting, etc, 
individually, as they're developed separately and more or less 
independently.

>> 2) Documentation status...
> 
>> I feel that some general and extensive documentation is missing.

This is gradually getting better.  The manpages are generally kept 
current, and their practical usability without reference to other sources 
such as the wiki has improved DRAMATICALLY in the last six months or so.

It still helps to have some good background in general principles such as 
COW, as they're not always explained, either on the wiki or in the 
manpages, but it's coming, and really, if there's one area I'd point out 
as having made MARKED strides toward a stable btrfs over the last six 
months, it WOULD be the documentation, as six months ago it simply wasn't 
stable ready, full-stop, but now I'd characterize much of the 
documentation as reasonably close to stable-ready, altho there are still 
some holes.

IOW, while before documentation had fallen behind the progress of the 
rest of btrfs toward stable, in the last several months it has caught up 
and in general can be characterized as at about the same stability/
maturity status as btrfs itself, that is, not yet fully stable, but 
getting to where that goal is at least visible, now.

But there's still no replacement for some good time investment in 
actually reading a few weeks of the list and most of the user-pages in 
the wiki, before you actually dive into btrfs on your own systems.  Your 
choices and usage of btrfs will be the better for it, and it could well 
save you needless data loss or at least needless grief and stress.  But 
of course that's the way it is with most reasonably advanced systems.


>> Other important things to document (which I couldn't fine so far in
>> most cases): What is actually guaranteed by btrfs respectively its
>> design?
> 
>> For example:
> 
>> - If there'd be no bugs in the code,.. would the fs be guaranteed to
>> be always consistent by it's CoW design? Or are there circumstances
>> where it can still run into being inconsistent?

In theory, yes, absent (software) bugs, btrfs would always be 
consistent.  In reality, hardware has bugs too, and then there's simply 
cheap hardware that even absent bugs doesn't make the guarantees of more 
expensive hardware.

Consumer-level storage hardware doesn't tend to have battery-backed write-
caches, for instance, and some of it is known to lie and say the write-
cache has been flushed to permanent storage when it hasn't been, for 
instance.

But absent (both hardware and software) bugs, in theory...


>> - Does this basically mean, that even without and fs journal,.. my
>> database is always consistent even if I have a power cut or system
>> crash?

That's the idea of tree-based copy-on-write, yes.

> 
>> - At which places does checksumming take place? Just data or also meta
>> data? And is the checksumming chained as with ZFS, so that every
>> change in blocks, triggers changes in the "upper" metadata blocks up
>> to the superblock(s)?

FWIW, at this level of question, people should really be reading the 
various whitepapers and articles discussing and explaining the 
technology, as linked on the wiki.

But both data and metadata are checksummed, and yes, it's chained, all 
the way up the tree.

>> - When are these checksums verified? Only on fsck/scrub? Or really on
>> every read? All this is information needed by an admin to determine
>> what the system actually guarantees or how it behaves.

Checksums are verified per-read.  If verification fails and there's a 
second copy available (btrfs multi-device raid1 or raid10 modes and dup-
mode metadata or mixed-bg on single-device), it is verified and 
substituted (both in RAM and rewritten in place of the bad copy) if it 
checks out.  If no valid copy is available, IO error.

Scrub is simply the method used to do this systematically across the 
entire filesystem, instead of waiting until a particular block is read 
and its checksum verified.


>> - How much data/metadata (in terms of bytes) is covered by one
>> checksum value? And if that varies, what's the maximum size?

Checksums are normally per block or node.  For data, that's a standard 
page-size block (4 KiB on x86 and amd64, and also on arm, I believe, but 
for example, I believe it's 64 KiB on sparc).  Metadata node/leaf sizes 
can be set at mkfs.btrfs time, but now default to 16 KiB, altho that too 
was 4 KiB in the past.  

>> - Does stacking with block layers work in all cases (and in which does
>> it not)? E.g. btrfs on top of loopback devices, dm-crypt, MD, lvm2?

Stacking btrfs on top of any block device variant should "just work", 
altho it should be noted that some of them might not pass flushes down 
and thus not be as resilient as others.  And of course performance can be 
more or less affected as well.

>> And also the other way round: What of these can be put on top of btrfs?

Btrfs is a filesystem.  So it'll take files.  Via a loopback mounted 
file, you can make it a block device, which will of course take 
filesystems or other block devices stacked.  That's not saying 
performance will be good thru all those layers, and reliability can be 
affected too, but it's possible.

>> There's the prominent case, that swap files don't work on btrfs. But
>> documentation in that area should also contain performance
>> instructions

Wait a minute.  Where's my consulting fee?  Come on, this is getting 
ridiculous.  That's were individual case research and deployment testing 
comes in.

>> Is there one IO thread per device or one for all?

It should be noted that btrfs has /not/ yet been optimized for 
parallelization.  The code still generally serializes writing each copy 
of a raid1 pair, for instance, and raid1 reads are assigned using a 
fairly dumb but reasonable initial-implementation odd/even-PID-based 
round-robin.  (So if your use-case happens to involve a bunch of 
otherwise parallelized reads from all-even PIDs, for instance, they'll 
all hit the same copy of the raid1, leaving the other one idle...)

This stuff will eventually be optimized, but getting raid5/6 and N-way-
mirroring done first, so they know the implementation there that they're 
optimizing for, makes sense.


>> 3) What about some nice features which many people probably want to
>> see...
> 
>> Especially other compression algos (xz/lzma or lz4[hc]) and hash alogs
>> (xxHash... some people may even be interested in things like SHA2 or
>> Keccak).
> 
>> I know some of them are planned... but is there any real estimation on
>> when they come?

If there were estimations they'd be way off.  The history of btrfs is 
that features repeatedly take far longer to implement than originally 
thought.

What roadmap there is, is on the wiki.

We know that raid5/6 mode is still in current development and n-way-
mirroring is scheduled after that.  But raid5/6 has been a kernel cycle 
or two out for over a year now.  Then when they got it in, it was only 
the operational stuff, the recovery stuff, scrub, etc, still isn't 
complete.

And there's the quota rework that is just done or still ongoing (I'm not 
sure which as I'm not particularly interested in that feature), and the 
snapshot-aware-defrag that was introduced in 3.9 but didn't scale so was 
disabled again, that is still to be reenabled after the quota rework and 
snapshot scaling stuff is done, and one dev has been putting a *LOT* of 
work into improving the manpages, and that intersects with the work on 
mount option consistency they're doing, and..., and...

Various devs are the leads on various features and so several are 
developing in parallel, but of course there's the bug hunting, and review 
and testing of each other's work they do, and... so they're not able to 
simply work on their assigned feature.

>> 4) Are (or how) exiting btrfs filesystems kept up to date when btrfs
>> evolves over time?
> 
>> What I mean here is... over time, more and more features are added to
>> btrfs... this is of course not always a change in the on disk format...

The disk format has been slowly changing, but keeping compatibility for 
the existing format and filesystems since I believe 2.6.32.

What I do as part of my regular backup regime, is every few kernel cycles 
I wipe the (first level) backup and do a fresh mkfs.btrfs, activating new 
optional features as I believe appropriate.  Then I boot to the new 
backup and run a bit to test it, then wipe the normal working copy and do 
a fresh mkfs.btrfs on it, again with the new optional features enabled 
that I want.

All that keeping in mind that I have a second level backup (and for some 
things a third level), that's on reiserfs (which I used before and which 
since the switch to data=ordered by default has been extremely dependable 
for me, even thru hardware issues like bad memory, failing mobo that 
would reset the sata connection, etc) not btrfs, in case there's a 
problem with btrfs that hits both the working copy and primary backup.

New kernels can mount old filesystems without problems (barring the 
occasional bug, and it's treated as a bug and fixed), but it isn't always 
possible to mount new filesystems on older kernels.

However, given the rate of change and the number of fixed bugs, the 
recommendation is to stay current with the kernel in any case.  Recently 
there was a bug that affected 3.15 and 3.16 (fixed in 3.16.2 and in 3.17-
rc2), that didn't affect 3.14 series.  During the trace and fix of that 
bug, the recommendation was to use 3.14, but not previous to that as 
there were known bugs fixed, and now that that known bug has been fixed, 
the recommendation is again latest stable series, thus 3.16.x currently, 
if not latest development serious, 3.17-rcX currently, or even btrfs 
integration, which currently are the patches that will be submitted for 
3.18.

Given that, if you're using earlier kernels you're using known-buggy 
kernels anyway.  So keep current with the kernel (and to a lessor extent 
userspace, btrfs-progs-3.16 is current, and the previous 3.14.2 is 
acceptable, 3.12 if you /must/ drag your feet), and you won't have to 
worry about it.

Of course that's a mark of btrfs stability as well.  The recommendation 
to keep to current should relax as btrfs stabilizes.  But 3.14 is a long-
term-support stable kernel series and the recommendation to be running at 
least that is a good one.  Perhaps it'll remain the earliest recommended 
stable kernel series for some time now that btrfs is stabilizing.

>> Of course there's the balance operation... but does this really affect
>> everything?

Not everything.  Some things are mkfs.btrfs-time only.

>> So the question is basically: As btrfs evolves... how to I keep my
>> existing filesystems up to date so that they are as if they were
>> created as new.

Balance is reasonable on an existing filesystem.  However, as I said, I 
myself do, and would also recommend, taking advantage of those backups 
you should be making/testing, to boot from them and do a mkfs on the 
working filesystem every few kernel cycles, to take advantage of the new 
features and keep everything working as well as possible considering the 
filesystem is after all, while no longer officially experimental, 
certainly not yet entirely stable, either.


>> 5) btrfs management [G]UIs are needed

Separate project.  It'll happen as that's the way FLOSS works, but it's 
not a worry of the core btrfs project at this point.

As such, I'm not going to worry about it either, which means I can delete 
a nice big chunk without replying to any of it further than I just have...

>> 6) RAID / Redundancy Levels
> 
>> a) Just some remark, I think it's a bad idea to call these RAID in the
>> btrfs terminology... since what we do is not necessarily exactly the
>> same like classic RAID... this becomes most obvious with RAID1, which
>> behaves not as RAID1 should (i.e. one copy per disk)... at least the
>> used names should comply with MD.

While I personally would have called in something else, say pair-
mirroring, by the original raid definitions going back to the original 
paper outlining them back in the day (which someone posted a link to at 
one point and I actually read, at least that part), two-way-mirroring 
regardless of the number of devices actually DOES qualify as RAID-1.

mdraid's implementation is different and does N-way-mirroring across all 
devices for RAID-1, but that's simply its implementation, not a 
requirement for RAID-1 either in the original paper or as generally 
accepted today.

That said, you will note that in btrfs, the various levels are called 
raid0, raid1, raid10, raid56, in *non-caps*, as opposed to the 
traditional ALL-CAPS RAID-1 notation.  One of the reasons given for that 
is that these btrfs raidN "modes" don't necessarily exactly correspond to 
the traditional RAID-N levels at the technical level, and the non-caps 
raidN notation was seen as an acceptable method of noting "RAID-like", 
behavior, that wasn't technically precisely RAID.

N-way-mirroring is coming.  It's just not implemented yet.


>> c) As I've noted before, I think it would be quite nice if it would be
>> supported to have different redundancy levels for different files...

That's actually on the roadmap too, tho rather farther down the line.  
The btrfs subvolume framework is already setup to allow per-subvolume 
raid-levels, etc, at some point, altho it's not yet implemented, and 
there's already per-subvolume and per-file properties and extended 
attributes, including a per-file compression attribute.  After they 
extend btrfs to handle per-subvolume redundancy levels, it should be a 
much smaller step to simply make that the default, and have per-file 
properties/attributes available for it as well, just as the per-file 
compression attribute is already there.

But I'd put this probably 3-5 years out... and given btrfs history with 
implementations repeatedly taking longer than expected, it could easily 
be 5-10 years out...

>> d) What's the status of the multi-parity RAID (i.e. more than [two]
>> parity blocks)? Weren't some patches for that posted a while ago?

Some proof-of-concept patches were indeed posted.  And it's on the 
roadmap, but again, 3-5 years out.  Tho it's likely there will be a 
general kernel solution before then, usable by mdraid, btrfs, etc, and if/
when that happens, it should make adapting it for btrfs much simpler.  
OTOH, that also means there will be much broader debate about getting a 
suitable general purpose solution, but it also means not just btrfs folks 
will be involved.  At this point then, it's not a btrfs problem, but 
waiting on that general purpose kernel solution, which btrfs can then 
adapt at its leisure.


>> e) Most important:
> 
>> What's the status on RAID5/6? Is it still completely experimental or
>> already well tested?

Covered above.  Consider it raid0 reliability at this point and you won't 
be caught out.  Additionally, Marc MERLIN has put quite a bit of testing 
into it and has writeups on the wiki and linking to his blog.  That's 
more detail than I have, for sure.

>> f) Again, it detailed documentation should be added how the different
>> redundancy levels actually work, e.g.
> 
>> - Is there a chunk size, can it be configured

There's a semi-major rework potentially planned to either coincide with 
the N-way-mirroring introduction, or possibly for after that, but with 
the N-way-mirroring written with it in mind.

Existing raid0/1/10/5/6 would remain implemented as they are, possibly 
with a few more options, and likely with the existing names being aliases 
for new ones fitting the new naming framework.  The new naming framework, 
meanwhile, would include redundancy/striping/parity/hotspares (possibly) 
all in the same overall framework.  Hugo Mills is the guy with the 
details on that, tho I think it's mentioned in the ideas section on the 
wiki as well.

With that in mind, too much documentation detail on the existing 
implementation would be premature as much of it would need rewritten for 
the new framework.

Never-the-less, there's reasonable detail out there if you look.  The 
wiki covers more than I'll write here, for sure.

>> g) When a block is read (and the checksum is always verified), does
>> that already work, that if verification fails, the other blocks are
>> tried, respectively the block is tried to be recalculated using the
>> parity?

Other copies of the block (raid1,10,dup) are checked, as mentioned above.

I'm not sure how raid56 handles it with parity, but since that code 
remains incomplete, it hasn't been a big factor.  Presumably either Marc 
MERLIN or one of the devs will fill in the details once it's considered 
complete and usable.

>> What if all that fails, will it give a read error, or will it simply
>> deliver a corrupted block, as with traditional RAID?

Read error, as mentioned above.

>> h) We also need some RAID and integrity monitoring tool.

"Patience, grasshopper." All in time...

And that too could be a third-party tool, at least at first, altho while 
separate enough to be developed third-party, it's core enough presumably 
one would eventually be selected and shipped as part of btrfs-progs.

I'd actually guess it /will/ be a third party tool at first.  That's pure 
userspace after all, with little beyond what's already available in the 
logs and in sysfs needed, and the core btrfs devs already have their 
hands full with other projects, so a third-party implementation will 
almost certainly appear before they get to it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: general thoughts and questions + general and RAID5/6 stability?
  2014-09-20  9:32 ` Duncan
@ 2014-09-22 20:51   ` Stefan G. Weichinger
  2014-09-23 12:08     ` Austin S Hemmelgarn
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan G. Weichinger @ 2014-09-22 20:51 UTC (permalink / raw)
  To: linux-btrfs

Am 20.09.2014 um 11:32 schrieb Duncan:

> What I do as part of my regular backup regime, is every few kernel cycles 
> I wipe the (first level) backup and do a fresh mkfs.btrfs, activating new 
> optional features as I believe appropriate.  Then I boot to the new 
> backup and run a bit to test it, then wipe the normal working copy and do 
> a fresh mkfs.btrfs on it, again with the new optional features enabled 
> that I want.

Is re-creating btrfs-filesystems *recommended* in any way?

Does that actually make a difference in the fs-structure?

So far I assumed it was enough to keep the kernel up2date, use current
(stable) btrfs-progs and run some scrub every week or so (not to mention
backups .. if it ain't backed up, it was/isn't important).

Stefan





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: general thoughts and questions + general and RAID5/6 stability?
  2014-09-22 20:51   ` Stefan G. Weichinger
@ 2014-09-23 12:08     ` Austin S Hemmelgarn
  2014-09-23 13:06       ` Stefan G. Weichinger
  0 siblings, 1 reply; 13+ messages in thread
From: Austin S Hemmelgarn @ 2014-09-23 12:08 UTC (permalink / raw)
  To: lists, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1141 bytes --]

On 2014-09-22 16:51, Stefan G. Weichinger wrote:
> Am 20.09.2014 um 11:32 schrieb Duncan:
>
>> What I do as part of my regular backup regime, is every few kernel cycles
>> I wipe the (first level) backup and do a fresh mkfs.btrfs, activating new
>> optional features as I believe appropriate.  Then I boot to the new
>> backup and run a bit to test it, then wipe the normal working copy and do
>> a fresh mkfs.btrfs on it, again with the new optional features enabled
>> that I want.
>
> Is re-creating btrfs-filesystems *recommended* in any way?
>
> Does that actually make a difference in the fs-structure?
>
I would recommend it, there are some newer features that you can only 
set at mkfs time.  Quite often, when a new feature is implemented, it is 
some time before things are such that it can be enabled online, and even 
then that doesn't convert anything until it is rewritten.
> So far I assumed it was enough to keep the kernel up2date, use current
> (stable) btrfs-progs and run some scrub every week or so (not to mention
> backups .. if it ain't backed up, it was/isn't important).
>
> Stefan
>
>



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: general thoughts and questions + general and RAID5/6 stability?
  2014-09-23 12:08     ` Austin S Hemmelgarn
@ 2014-09-23 13:06       ` Stefan G. Weichinger
  2014-09-23 13:38         ` Austin S Hemmelgarn
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan G. Weichinger @ 2014-09-23 13:06 UTC (permalink / raw)
  To: linux-btrfs

Am 23.09.2014 um 14:08 schrieb Austin S Hemmelgarn:
> On 2014-09-22 16:51, Stefan G. Weichinger wrote:
>> Is re-creating btrfs-filesystems *recommended* in any way?
>>
>> Does that actually make a difference in the fs-structure?
>>
> I would recommend it, there are some newer features that you can only
> set at mkfs time.  Quite often, when a new feature is implemented, it is
> some time before things are such that it can be enabled online, and even
> then that doesn't convert anything until it is rewritten.

What features for example?

I created my main btrfs a few months ago and would like to avoid
recreating it as this would mean restoring my root-fs on my main
workstation.

Although I would do it if it is "worth it" ;-)

I assume I could read some kind of version number out of the superblock
or so?

btrfs-show-super ?

S




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: general thoughts and questions + general and RAID5/6 stability?
  2014-09-23 13:06       ` Stefan G. Weichinger
@ 2014-09-23 13:38         ` Austin S Hemmelgarn
  2014-09-23 13:51           ` Stefan G. Weichinger
                             ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Austin S Hemmelgarn @ 2014-09-23 13:38 UTC (permalink / raw)
  To: lists, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3100 bytes --]

On 2014-09-23 09:06, Stefan G. Weichinger wrote:
> Am 23.09.2014 um 14:08 schrieb Austin S Hemmelgarn:
>> On 2014-09-22 16:51, Stefan G. Weichinger wrote:
>>> Is re-creating btrfs-filesystems *recommended* in any way?
>>>
>>> Does that actually make a difference in the fs-structure?
>>>
>> I would recommend it, there are some newer features that you can only
>> set at mkfs time.  Quite often, when a new feature is implemented, it is
>> some time before things are such that it can be enabled online, and even
>> then that doesn't convert anything until it is rewritten.
>
> What features for example?
Well, running 'mkfs.btrfs -O list-all' with 3.16 btrfs-progs gives the 
following list of features:
mixed-bg		- mixed data and metadata block groups
extref			- increased hard-link limit per file to 65536
raid56			- raid56 extended format
skinny-metadata		- reduced size metadata extent refs
no-holes		- no explicit hole extents for files

mixed-bg is something that you generally wouldn't want to change after mkfs.
extref can be enabled online, and the filesystem metadata gets updated 
as-needed, and dosen't provide any real performance improvement (but is 
needed for some mail servers that have HUGE mail-queues)
I don't know anything about the raid56 option, but there isn't any way 
to change it after mkfs.
skinyy-metadata can be changed online, and the format gets updated on 
rewrite of each metadata block.  This one does provide a performance 
improvement (stat() in particular runs noticeably faster).  You should 
probably enable this if it isn't already enabled, even if you don't 
recreate your filesystem.
no-holes cannot currently be changed online, and is a very recent 
addition (post v3.14 btrfs-progs I believe) that provides improved 
performance for sparse files (which is particularly useful if you are 
doing things with fixed size virtual machine disk images).

It's this last one that prompted me personally to recreate my 
filesystems most recently, as I use sparse files to save space as much 
as possible.
>
> I created my main btrfs a few months ago and would like to avoid
> recreating it as this would mean restoring my root-fs on my main
> workstation.
>
> Although I would do it if it is "worth it" ;-)
>
> I assume I could read some kind of version number out of the superblock
> or so?
>
> btrfs-show-super ?
>
AFAIK there isn't really any 'version number' that has any meaning in 
the superblock (except for telling the kernel that it uses the stable 
disk layout), however, there are flag bits that you can look for 
(compat_flags, compat_ro_flags, and incompat_flags).  I'm not 100% 
certain what each bit means, but on my system with a only 1 month old 
BTRFS filesystem, with extref, skinny-metadata, and no-holes turned on, 
i have compat_flags: 0x0, compat_ro_flags: 0x0, and incompat_flags: 0x16b.

The other potentially significant thing is that the default 
nodesize/leafsize has changed recently from 4096 to 16384, as that gives 
somewhat better performance for most use cases.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: general thoughts and questions + general and RAID5/6 stability?
  2014-09-23 13:38         ` Austin S Hemmelgarn
@ 2014-09-23 13:51           ` Stefan G. Weichinger
  2014-09-23 14:24           ` Tobias Holst
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Stefan G. Weichinger @ 2014-09-23 13:51 UTC (permalink / raw)
  To: linux-btrfs

Am 23.09.2014 um 15:38 schrieb Austin S Hemmelgarn:
> On 2014-09-23 09:06, Stefan G. Weichinger wrote:
>> What features for example?
> Well, running 'mkfs.btrfs -O list-all' with 3.16 btrfs-progs gives the
> following list of features:
> mixed-bg        - mixed data and metadata block groups
> extref            - increased hard-link limit per file to 65536
> raid56            - raid56 extended format
> skinny-metadata        - reduced size metadata extent refs
> no-holes        - no explicit hole extents for files
> 
> mixed-bg is something that you generally wouldn't want to change after
> mkfs.
> extref can be enabled online, and the filesystem metadata gets updated
> as-needed, and dosen't provide any real performance improvement (but is
> needed for some mail servers that have HUGE mail-queues)

ok, not needed here

> I don't know anything about the raid56 option, but there isn't any way
> to change it after mkfs.

not needed in my systems.

> skinyy-metadata can be changed online, and the format gets updated on
> rewrite of each metadata block.  This one does provide a performance
> improvement (stat() in particular runs noticeably faster).  You should
> probably enable this if it isn't already enabled, even if you don't
> recreate your filesystem.

So this is done via btrfstune, right?

I will give that a try, for my rootfs it doesn't allow me right now as
it is obviously mounted (live-cd, right?).

> no-holes cannot currently be changed online, and is a very recent
> addition (post v3.14 btrfs-progs I believe) that provides improved
> performance for sparse files (which is particularly useful if you are
> doing things with fixed size virtual machine disk images).

Yes, I have some of those!

> AFAIK there isn't really any 'version number' that has any meaning in
> the superblock (except for telling the kernel that it uses the stable
> disk layout), however, there are flag bits that you can look for
> (compat_flags, compat_ro_flags, and incompat_flags).  I'm not 100%
> certain what each bit means, but on my system with a only 1 month old
> BTRFS filesystem, with extref, skinny-metadata, and no-holes turned on,
> i have compat_flags: 0x0, compat_ro_flags: 0x0, and incompat_flags: 0x16b.
> 
> The other potentially significant thing is that the default
> nodesize/leafsize has changed recently from 4096 to 16384, as that gives
> somewhat better performance for most use cases.

I have the 16k for both already.

Thanks for your explanations, I will dig into it as soon as I find the
time. Seems I have to backup/restore quite some stuff ;-)

Stefan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: general thoughts and questions + general and RAID5/6 stability?
  2014-09-23 13:38         ` Austin S Hemmelgarn
  2014-09-23 13:51           ` Stefan G. Weichinger
@ 2014-09-23 14:24           ` Tobias Holst
  2014-09-24  1:08             ` Qu Wenruo
       [not found]           ` <CAGwxe4i2gQXSPiBGXbUKWid3o1tmD_+YtbOj=GQ11vzGx8CuTw@mail.gmail.com>
  2014-09-25  7:15           ` Stefan G. Weichinger
  3 siblings, 1 reply; 13+ messages in thread
From: Tobias Holst @ 2014-09-23 14:24 UTC (permalink / raw)
  To: linux-btrfs

If it is unknown, which of these options have been used at btrfs
creation time - is it possible to check the state of these options
afterwards on a mounted or unmounted filesystem?


2014-09-23 15:38 GMT+02:00 Austin S Hemmelgarn <ahferroin7@gmail.com>:
>
> Well, running 'mkfs.btrfs -O list-all' with 3.16 btrfs-progs gives the following list of features:
> mixed-bg                - mixed data and metadata block groups
> extref                  - increased hard-link limit per file to 65536
> raid56                  - raid56 extended format
> skinny-metadata         - reduced size metadata extent refs
> no-holes                - no explicit hole extents for files

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: general thoughts and questions + general and RAID5/6 stability?
       [not found]           ` <CAGwxe4i2gQXSPiBGXbUKWid3o1tmD_+YtbOj=GQ11vzGx8CuTw@mail.gmail.com>
@ 2014-09-23 14:47             ` Austin S Hemmelgarn
  2014-09-23 15:25               ` Kyle Gates
  0 siblings, 1 reply; 13+ messages in thread
From: Austin S Hemmelgarn @ 2014-09-23 14:47 UTC (permalink / raw)
  To: Tobias Holst; +Cc: lists, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 985 bytes --]

On 2014-09-23 10:23, Tobias Holst wrote:
> If it is unknown, which of these options have been used at btrfs
> creation time - is it possible to check the state of these options
> afterwards on a mounted or unmounted filesystem?
>
>
> 2014-09-23 15:38 GMT+02:00 Austin S Hemmelgarn <ahferroin7@gmail.com
> <mailto:ahferroin7@gmail.com>>:
>
>     Well, running 'mkfs.btrfs -O list-all' with 3.16 btrfs-progs gives
>     the following list of features:
>     mixed-bg                - mixed data and metadata block groups
>     extref                  - increased hard-link limit per file to 65536
>     raid56                  - raid56 extended format
>     skinny-metadata         - reduced size metadata extent refs
>     no-holes                - no explicit hole extents for files
>
I don't think there is a specific tool for doing this, but some of them 
do show up in dmesg, for example skinny-metadata shows up as a mention 
of the FS having skinny extents.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: general thoughts and questions + general and RAID5/6 stability?
  2014-09-23 14:47             ` Austin S Hemmelgarn
@ 2014-09-23 15:25               ` Kyle Gates
  0 siblings, 0 replies; 13+ messages in thread
From: Kyle Gates @ 2014-09-23 15:25 UTC (permalink / raw)
  To: linux-btrfs


>> If it is unknown, which of these options have been used at btrfs
>> creation time - is it possible to check the state of these options
>> afterwards on a mounted or unmounted filesystem?
>>
> I don't think there is a specific tool for doing this, but some of them
> do show up in dmesg, for example skinny-metadata shows up as a mention
> of the FS having skinny extents.
>
Devs,

It may be helpful to include the device in the kernel log for skinny extents.
Currently it shows up like the following which is a little ambiguous:

[    6.050134] BTRFS info (device sde3): disk space caching is enabled
[    6.056606] BTRFS: has skinny extents
<snipped>
[    7.740986] BTRFS info (device sde3): enabling auto defrag
[    7.747151] BTRFS info (device sde3): disk space caching is enabled
<snipped>
[    7.908906] BTRFS info (device sde2): enabling auto defrag
[    7.915031] BTRFS info (device sde2): disk space caching is enabled
[    8.071033] BTRFS info (device sde4): enabling auto defrag
[    8.076715] BTRFS info (device sde4): disk space caching is enabled
[    8.082187] BTRFS: has skinny extents
[    8.513502] BTRFS info (device sde5): enabling auto defrag
[    8.518887] BTRFS info (device sde5): disk space caching is enabled
[    8.524064] BTRFS: has skinny extents
[    9.634285] BTRFS info (device sdd6): enabling auto defrag
[    9.639308] BTRFS info (device sdd6): disk space caching is enabled
[    9.644338] BTRFS: has skinny extents

Thanks,
Kyle
 		 	   		  

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: general thoughts and questions + general and RAID5/6 stability?
  2014-09-23 14:24           ` Tobias Holst
@ 2014-09-24  1:08             ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2014-09-24  1:08 UTC (permalink / raw)
  To: Tobias Holst, linux-btrfs


-------- Original Message --------
Subject: Re: general thoughts and questions + general and RAID5/6 stability?
From: Tobias Holst <tobby@tobby.eu>
To: <linux-btrfs@vger.kernel.org>
Date: 2014年09月23日 22:24
> If it is unknown, which of these options have been used at btrfs
> creation time - is it possible to check the state of these options
> afterwards on a mounted or unmounted filesystem?
For mounted fs, sysfs can be used to see the feature enabled:
/sys/fs/btrfs/<UUID>/features/

For unmounted fs, maybe not the best one, btrfs-show-super can show the 
incompat_flags in hex,
and we can check <kernel tree>/fs/btrfs/ctree.h for 
BTRFS_FEATURE_INCOMPAT_## for the bits
and calculate by hand...
(Would it be better to add human readable output for btrfs-show-super?)

Thanks,
Qu
>
>
> 2014-09-23 15:38 GMT+02:00 Austin S Hemmelgarn <ahferroin7@gmail.com>:
>> Well, running 'mkfs.btrfs -O list-all' with 3.16 btrfs-progs gives the following list of features:
>> mixed-bg                - mixed data and metadata block groups
>> extref                  - increased hard-link limit per file to 65536
>> raid56                  - raid56 extended format
>> skinny-metadata         - reduced size metadata extent refs
>> no-holes                - no explicit hole extents for files
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: general thoughts and questions + general and RAID5/6 stability?
  2014-09-23 13:38         ` Austin S Hemmelgarn
                             ` (2 preceding siblings ...)
       [not found]           ` <CAGwxe4i2gQXSPiBGXbUKWid3o1tmD_+YtbOj=GQ11vzGx8CuTw@mail.gmail.com>
@ 2014-09-25  7:15           ` Stefan G. Weichinger
  3 siblings, 0 replies; 13+ messages in thread
From: Stefan G. Weichinger @ 2014-09-25  7:15 UTC (permalink / raw)
  To: linux-btrfs

Am 23.09.2014 um 15:38 schrieb Austin S Hemmelgarn:

>> What features for example?
> Well, running 'mkfs.btrfs -O list-all' with 3.16 btrfs-progs gives the
> following list of features:
> mixed-bg        - mixed data and metadata block groups
> extref            - increased hard-link limit per file to 65536
> raid56            - raid56 extended format
> skinny-metadata        - reduced size metadata extent refs
> no-holes        - no explicit hole extents for files
> 
> mixed-bg is something that you generally wouldn't want to change after
> mkfs.
> extref can be enabled online, and the filesystem metadata gets updated
> as-needed, and dosen't provide any real performance improvement (but is
> needed for some mail servers that have HUGE mail-queues)
> I don't know anything about the raid56 option, but there isn't any way
> to change it after mkfs.
> skinyy-metadata can be changed online, and the format gets updated on
> rewrite of each metadata block.  This one does provide a performance
> improvement (stat() in particular runs noticeably faster).  You should
> probably enable this if it isn't already enabled, even if you don't
> recreate your filesystem.
> no-holes cannot currently be changed online, and is a very recent
> addition (post v3.14 btrfs-progs I believe) that provides improved
> performance for sparse files (which is particularly useful if you are
> doing things with fixed size virtual machine disk images).

Recreating or at least "btrfstune -rx" for my rootfs would mean that I
have to boot from a live medium bringing recent btrfs-progs, right?

sysresccd brings btrfs-progs-3.14.2 ... that should be enough, ok?

aside from that, the rootfs on my thinkpad shows these features:

# ls /sys/fs/btrfs/bec7dff9-8749-4db4-9a1b-fa844cfcc36a/features/
big_metadata  compress_lzo  extended_iref  mixed_backref

So I only miss the skinny extents ... and "no-holes".

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* general thoughts and questions + general and RAID5/6 stability?
@ 2014-08-31  4:02 Christoph Anton Mitterer
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Anton Mitterer @ 2014-08-31  4:02 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 13845 bytes --]

Hey.


For some time now I consider to use btrfs at a larger scale, basically
in two scenarios:

a) As the backend for data pools handled by dcache (dcache.org), where
we run a Tier-2 in the higher PiB range for the LHC Computing Grid...
For now that would be rather "boring" use of btrfs (i.e. not really
using any of its advanced features) and also RAID functionality would
still be provided by hardware (at least with the current hardware
generations we have in use). 

b) Personally, for my NAS. Here the main goal is less performance but
rather data safety (i.e. I want something like RAID6 or better) and
security (i.e. it will be on top of dm-crypt/LUKS) and integrity.
Hardware wise I'll use and UPS as well as enterprise SATA disks, from
different vendors respectively different production lots.
(Of course I'm aware that btrfs is experimental, and I would have
regular backups)




1) Now I've followed linux-btrfs for a while and blogs like Marc's...
and I still read about a lot of stability problems, some which sound
quite serious.
Sure we have a fsck now, but even in the wiki one can read statements
like "the developers use it on their systems without major problems"...
but also "if you do this, it could help you... or break even more".

I mean I understand that there won't be a single point in time, where
Chris Mason says "now it's stable" and it would be rock solid form that
point on... but especially since new features (e.g. things like
subvolume quota groups, online/offline dedup, online/offline fsck) move
(or will) move in with every new version... one has (as an end-user)
basically no chance to determine what can be used safely and what
tickles the devil.

So one issue I have is to determine the general stability of the
different parts.




2) Documentation status...
I feel that some general and extensive documentation is missing. One
that basically handles (and teaches) all the things which are specific
to modern (especially CoW) filesystems.
- General design, features and problems of CoW and btrfs
- Special situations that arise from the CoW, e.g. that one may not be
able to remove files once the fs is full,... or that just reading files
could make the used space grow (via the atime)
- General guidelines when and how to use nodatacow... i.e. telling
people for which kinds of files this SHOULD usually be done (VM
images)... and what this means for those files (not checksumming) and
what the drawbacks are if it's not used (e.g. if people insist on having
the checksumming - what happens to the performance of VM images? what
about the wear with SSDs?)
- the implications of things like compression and hash algos... whether
and when this will have performance impacts (positive or negative) and
when not.
- the typical lifecycles and procedures when using stuff like multiple
devices (how to replace a faulty disk) or important hints like (don't
span a btrfs RAID over multiple partitions on the same disk)
- especially with the different (mount)options, I mean things that
change the way the fs works like no-hole or mixed data/meta block
groups... people need to have some general information when to choose
which and some real world examples of disadvantages / advantages. E.g.
what are the disadvantages of having mixed data/meta block groups? If
there'd be only advantages, why wouldn't it be the default?

Parts of this is already scattered over LWN articles, the wiki (however
the quality greatly "varies" there), blog posts or mailing list posts...
many of the information there is however outdated... and suggested
procedures (e.g. how to replace a faulty disk) differ from example to
example.
An admin that wants to use btrfs shouldn't be required to pick all this
together (which is basically impossible).. there should be a manpage
(which is kept up to date!) that describes all this.

Other important things to document (which I couldn't fine so far in most
cases): What is actually guaranteed by btrfs respectively its design?
For example:
- If there'd be no bugs in the code,.. would the fs be guaranteed to be
always consistent by it's CoW design? Or are there circumstances where
it can still run into being inconsistent?
- Does this basically mean, that even without and fs journal,.. my
database is always consistent even if I have a power cut or system
crash?
- At which places does checksumming take place? Just data or also meta
data? And is the checksumming chained as with ZFS, so that every change
in blocks, triggers changes in the "upper" metadata blocks up to the
superblock(s)?
- When are these checksums verified? Only on fsck/scrub? Or really on
every read? All this is information needed by an admin to determine what
the system actually guarantees or how it behaves.
- How much data/metadata (in terms of bytes) is covered by one checksum
value? And if that varies, what's the maximum size? I mean if there
would be on CRC32 per file (which can be GiB large) which would be read
every time a single byte of that file is read... this would probably be
bad ;) ... so we should tell the user "no we do this block or extent
wise"... And since e.g. CRC32 is maybe not well suited for very big
chunks of data, the user may want to know how much data is "protected"
by one hash value... so that he can decide whether to switch to another
algorithm (if one should become available).
- Does stacking with block layers work in all cases (and in which does
it not)? E.g. btrfs on top of looback devices, dm-crypt, MD, lvm2? And
also the other way round: What of these can be put on top of btrfs?
There's the prominent case, that swap files don't work on btrfs. But
documentation in that area should also contain performance instructions,
i.e. that while it's possible to have swap on top of btrfs via loopback,
it's perhaps stupid with CoW... or e.g. with dmcrypt+MD there were quite
some heavy performance impacts depending on whether dmcrypt was below or
above MD. Now of course normally, dmcrypt will be below btrfs,... but
there are still performance questions e.g. how does this work with
multiple devices? Is there one IO thread per device or one for all?
Or questions like: Are there any stability issues when btrfs is stacked
below/above other block layer, e.g. in case of power losses...
especially since btrfs relies so heavy on barriers.
Or questions like: Is btrfs stable if lower block layers modify data?
e.g. if dmcrypt should ever support online re-encryption 
- Many things about RAID (but more on that later).




3) What about some nice features which many people probably want to
see...
Especially other compression algos (xz/lzma or lz4[hc]) and hash alogs
(xxHash... some people may even be interested in things like SHA2 or
Keccak).
I know some of them are planned... but is there any real estimation on
when they come?




4) Are (or how) exiting btrfs filesystems kept up to date when btrfs
evolves over time?
What I mean here is... over time, more and more features are added to
btrfs... this is of course not always a change in the on disk format...
but I always wonder a bit: If I write the same that of my existing fs
into a freshly created one (with the same settings)... would it
basically look like the same (of course not exactly)?
In many of the mails here on the list respectively commit logs one can
read things which sound as this happens quite often... that things (that
affect how data is written on the disk) are now handled better.
Or what if defaults change? E.g. if something new like no-hole would
become the default for new filesystems?
An admin cannot track all these things and understand which of them
actually means that he should recreate the filesystem.

Of course there's the balance operation... but does this really affect
everything?

So the question is basically: As btrfs evolves... how to I keep my
existing filesystems up to date so that they are as if they were created
as new.




5) btrfs management [G]UIs are needed
Not sure whether this should be go into existing files managers (like
nemo or konqueror) or something separate... but I definitely think, that
the btrfs community will need to provide some kind of powerful
management [G]UI.
Such a manager is IMHO crucial for anything that behaves like a storage
management system.
What should it be able to do?
a) Searching for btrfs specific properties, e.g.
- files compressed with a given algo
- files for which the compression ratio is <,>,= n%
- files which are nodatacow
- files for which integrity data is stored with a given hash algo
- files with a given redundancy level (e.g. DUP or RAID1 or RAID6 or
DUPn if that should ever come)
- files which should have a given redundancy level, but whose actual
level is different (e.g. due to a degraded state, or for which more
block copies than desired are still available)
- files which are defragmented at n%

Of course all these conditions should be combinable, and one should have
further conditions like m/c/a-times or like the subvolumes/snapshots
that should be searched.

b) File lists in such a manager should display many details like
compression ratio, algos (compression, hash), number of fragments,
whether blocks of that file are referenced by other files, etc. pp.

c) Of course it should be easy to change all the properties from above
for a files (well at least if that's possible in btrfs).
Like when I want to have some files, or dirs/subdirs, recompressed with
another algo, or uncompressed.
Or triggering online defragmentation for all files of a given
fragmentation level.
Or maybe I want to set a higher redundancy level for files which I
consider extremely precious to myself (not sure if it's planned to have
different redundancy levels per file)

d) Such manager should perhaps also go through the logs and tell things
like:
- when was the last complete balance
- when was the last complete scrub
- for which files happened integrity check problems during read/scrub...
how many of these could be corrected via other block copies?

e) Maybe it could give even more low level information, like showing how
a file is distributed over the devices, e.g. how the blocks are located,
or showing the location block copies or involved block devices for the
redundancy levels. 




6) RAID / Redundancy Levels
a) Just some remark, I think it's a bad idea to call these RAID in the
btrfs terminology... since what we do is not necessarily exactly the
same like classic RAID... this becomes most obvious with RAID1, which
behaves not as RAID1 should (i.e. one copy per disk)... at least the
used names should comply with MD. 

b) In other words... I think there should be RAID1, which equals to 1
copy per underlying device.
And it would be great to have a redundancy level DUPx, which is x copies
for each block spread over the underlying devices. So if x is 6 and one
has 3 underlying devices, each of them should have 2 copies of each
block.
I think the DUPx level is quite interesting to protect against single
block failures, especially also on computers where one usually simply
doesn't have more than one disk drive (e.g. notebooks). 

c) As I've noted before, I think it would be quite nice if it would be
supported to have different redundancy levels for different files...
e.g. less previous stuff like OS data could have DUP ... more valuable
data could have RAID6... and my most precious data could have DUP5 (i.e.
5 copies of each block).
If that would ever come, one would probably need to make that property
inheritable by directories to be really useful.

d) What's the status of the multi-parity RAID (i.e. more than tow parity
blocks)? Weren't some patches for that posted a while ago?

e) Most important:
What's the status on RAID5/6? Is it still completely experimental or
already well tested?
Does rebuilding work? Does scrubbing work?
I mean as far as I know, there are still important parts that miss so
that it works at all, right?
When can one expect work on that to be completed?

f) Again, it detailed documentation should be added how the different
redundancy levels actually work, e.g.
- Is there a chunk size, can it be configured and how does it affect
reads/writes (as with MD)
- How do parallel reads happen if multiple blocks are available? What
e.g. if there are multiple block copies per device? Is simply always the
first tried to be read? Or the one with the best seek times? Or is this
optimised with other reads?

g) When a block is read (and the checksum is always verified), does that
already work, that if verification fails, the other blocks are tried,
respectively the block is tried to be recalculated using the parity?
What if all that fails, will it give a read error, or will it simply
deliver a corrupted block, as with traditional RAID?

h) We also need some RAID and integrity monitoring tool.
Doesn't matter whether this is a completely new tool or whether it can
be integrated in something existing.
But we need tools, which inform the admin via different ways when a disk
failed an a rebuild is necessary.
And the same should happen when checksum verification errors happen that
could be corrected (perhaps with a configurable threshold)...so that
admins have the chance to notice signs of a disk that is about to fail. 

Of course such information is already printed to the kernel logs - well
I guess so),... but I don't think it's enough to let 3rd parties and
admins write scripts/daemons which do these checks and alerting... there
should be something which is "official" and guaranteed to catch all
cases and simply works(TM). 



Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-09-25  7:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-19 20:50 general thoughts and questions + general and RAID5/6 stability? William Hanson
2014-09-20  9:32 ` Duncan
2014-09-22 20:51   ` Stefan G. Weichinger
2014-09-23 12:08     ` Austin S Hemmelgarn
2014-09-23 13:06       ` Stefan G. Weichinger
2014-09-23 13:38         ` Austin S Hemmelgarn
2014-09-23 13:51           ` Stefan G. Weichinger
2014-09-23 14:24           ` Tobias Holst
2014-09-24  1:08             ` Qu Wenruo
     [not found]           ` <CAGwxe4i2gQXSPiBGXbUKWid3o1tmD_+YtbOj=GQ11vzGx8CuTw@mail.gmail.com>
2014-09-23 14:47             ` Austin S Hemmelgarn
2014-09-23 15:25               ` Kyle Gates
2014-09-25  7:15           ` Stefan G. Weichinger
  -- strict thread matches above, loose matches on Subject: below --
2014-08-31  4:02 Christoph Anton Mitterer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.