On Fri, 2016-06-03 at 15:50 -0400, Austin S Hemmelgarn wrote: > There's no point in trying to do higher parity levels if we can't get > regular parity working correctly.  Given the current state of things, > it might be better to break even and just rewrite the whole parity > raid thing from scratch, but I doubt that anybody is willing to do > that. Well... as I've said, things are pretty worrying. Obviously I cannot really judge, since I'm not into btrfs' development... maybe there's a lack of manpower? Since btrfs seems to be a very important part (i.e. next-gen fs), wouldn't it be possible to either get some additional funding by the Linux Foundation, or possible that some of the core developers make an open call for funding by companies? Having some additional people, perhaps working fulltime on it, may be a big help. As for the RAID... given how many time/effort is spent now into 5/6,.. it really seems that one should have considered multi-parity from the beginning on. Kinda feels like either, with multi-parity this whole instability phase would start again, or it will simply never happen. > > - Serious show-stoppers and security deficiencies like the UUID > >   collision corruptions/attacks that have been extensively > > discussed > >   earlier, are still open > The UUID issue is not a BTRFS specific one, it just happens to be > easier > to cause issues with it on BTRFS uhm this had been discussed extensively before, as I've said... AFAICS btrfs is the only system we have, that can possibly cause data corruption or even security breach by UUID collisions. I wouldn't know that other fs, or LVM are affected, these just continue to use those devices already "online"... and I think lvm refuses to activate VGs, if conflicting UUIDs are found. > There is no way to solve it sanely given the requirement that > userspace > not be broken. No this is not true. Back when this was discussed, I and others described how it could/should be done,... respectively how userspace/kernel should behave, in short: - continue using those devices that are already active - refusing to (auto)assemble by UUID, if there are conflicts   or requiring to specify the devices (with some --override-yes-i-know-   what-i-do option option or so) - in case of assembling/rebuilding/similar... never doing this   automatically I think there were some more corner cases, I basically had them all discussed in the thread back then (search for "attacking btrfs filesystems via UUID collisions?" and IIRC some different titled parent or child threads). >   Properly fixing this would likely make us more dependent > on hardware configuration than even mounting by device name. Sure, if there are colliding UUIDs, and one still wants to mount (by using some --override-yes-i-know-what-i-do option),.. it would need to be by specifying the device name... But where's the problem? This would anyway only happen if someone either attacks or someone made a clone, and it's far better to refuse automatic assembly in cases where accidental corruption can happen or where attacks may be possible, requiring the user/admin to manually take action, than having corruption or security breach. Imagine the simple case: degraded RAID1 on a PC; if btrfs would do some auto-rebuild based on UUID, then if an attacker knows that he'd just need to plug in a USB disk with a fitting UUID...and easily gets a copy of everything on disk, gpg keys, ssh keys, etc. > > - a number of important core features not fully working in many > >   situations (e.g. the issues with defrag, not being ref-link > > aware,... > >   an I vaguely remember similar things with compression). > OK, how then should defrag handle reflinks?  Preserving them prevents > it > from being able to completely defragment data. Didn't that even work in the past and had just some performance issues? > > - OTOH, defrag seems to be viable for important use cases (VM > > images, > >   DBs,... everything where large files are internally re-written > >   randomly). > >   Sure there is nodatacow, but with that one effectively completely > >   looses one of the core features/promises of btrfs (integrity by > >   checksumming)... and as I've showed in an earlier large > > discussion, > >   none of the typical use cases for nodatacow has any high-level > >   checksumming, and even if, it's not used per default, or doesn't > > give > >   the same benefits at it would on the fs level, like using it for > > RAID > >   recovery). > The argument of nodatacow being viable for anything is a pretty > significant secondary discussion that is itself entirely orthogonal > to > the point you appear to be trying to make here. Well the point here was:  - many people (including myself) like btrfs, it's   (promised/future/current) features - it's intended as a general purpose fs - this includes the case of having such file/IO patterns as e.g. for VM   images or DBs - this is currently not really doable without loosing one of the   promises (integrity) So the point I'm trying to make: People do probably not care so much whether their VM image/etc. is COWed or not, snapshots/etc. still work with that,... but they may likely care if the integrity feature is lost. So IMHO, nodatacow + checksumming deserves to be amongst the top priorities. > > - still no real RAID 1 > No, you mean still no higher order replication.  I know I'm being > stubborn about this, but RAID-1 is offici8ally defined in the > standards > as 2-way replication. I think I remember that you've claimed that last time already, and as I've said back then: - what counts is probably the common understanding of the term, which   is N disks RAID1 = N disks mirrored - if there is something like an "official definition", it's probably   the original paper that introduced RAID:   http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf   PDF page 11, respectively content page 9 describes RAID1 as:   "This is the most expensive option since *all* disks are   duplicated..." > The only extant systems that support higher > levels of replication and call it RAID-1 are entirely based on MD > RAID > and it's poor choice of naming. Not true either, show me any single hardware RAID controller that does RAID1 in a dup2 fashion... I manage some >2PiB of storage at the faculty, all controller we have, handle RAID1 in the sense of "all disks mirrored". > > - no end-user/admin grade maangement/analysis tools, that tell non- > >   experts about the state/health of their fs, and whether things > > like > >   balance etc.pp. are necessary > I don't see anyone forthcoming with such tools either.  As far as > basic > monitoring, it's trivial to do with simple scripts from tools like > monit > or nagios. AFAIU, even that isn't really possible right now, is it? Take RAID again,... there is no place where you can see whether the RAID state is "optimal", or does that exist in the meantime? Last time, people were advised to look at the kernel logs, but this is no proper way to check for the state... logging may simply be deactivated, or you may have an offline fs, for which the logs have been lost because they were on another disk. Not to talk about the inability to properly determine how often btrfs encountered errors, and "silently" corrected it. E.g. some statistics about a device, that can be used to decide whether its dying. I think these things should be stored in the fs (and additionally also on the respective device), where it can also be extracted when no /var/log is present or when forensics are done. >   As far as complex things like determining whether a fs needs > balanced, that's really non-trivial to figure out.  Even with a > person > looking at it, it's still not easy to know whether or not a balance > will > actually help. Well I wouldn't call myself a btrfs expert, but from time to time I've been a bit "more active" on the list. Even I know about these strange cases (sometimes tricks), like many empty data/meta block groups, that may or may not get cleaned up, and may result in troubles How should the normal user/admin be able to cope with such things if there are no good tools? It starts with simple things like: - adding a further disk to a RAID   => there should be a tool which tells you: dude, some files are not      yet "rebuild"(duplicated),... do a balance or whatever. > >- the still problematic documentation situation > Not trying to rationalize this, but go take a look at a majority of > other projects, most of them that aren't backed by some huge > corporation > throwing insane amounts of money at them have at best mediocre end- > user > documentation.  The fact that more effort is being put into > development > than documentation is generally a good thing, especially for > something > that is not yet feature complete like BTRFS. Uhm.. yes and no... The lack of documentation (i.e. admin/end-user-grade documentation) also means that people have less understanding in the system, less trust, less knowledge on what they can expect/do with it (will Ctrl-C on btrfs checl work? what if I shut down during a balance? does it break then? etc. pp.), less will to play with it. Further,... if btrfs would reach the state of being "feature complete" (if that ever happens, and I don't mean because of slow development, but rather, because most other fs shows that development goes "ever" on),... there would be *so much* to do in documentation, that it's unlikely it will happen. Cheers, Chris.