On Fri, 2016-06-03 at 15:50 -0400, Austin S Hemmelgarn wrote:
> There's no point in trying to do higher parity levels if we can't get
> regular parity working correctly.  Given the current state of things,
> it might be better to break even and just rewrite the whole parity
> raid thing from scratch, but I doubt that anybody is willing to do
> that.

Well... as I've said, things are pretty worrying. Obviously I cannot
really judge, since I'm not into btrfs' development... maybe there's a
lack of manpower? Since btrfs seems to be a very important part (i.e.
next-gen fs), wouldn't it be possible to either get some additional
funding by the Linux Foundation, or possible that some of the core
developers make an open call for funding by companies?
Having some additional people, perhaps working fulltime on it, may be a
big help.

As for the RAID... given how many time/effort is spent now into 5/6,..
it really seems that one should have considered multi-parity from the
beginning on.
Kinda feels like either, with multi-parity this whole instability phase
would start again, or it will simply never happen.


> > - Serious show-stoppers and security deficiencies like the UUID
> >   collision corruptions/attacks that have been extensively
> > discussed
> >   earlier, are still open
> The UUID issue is not a BTRFS specific one, it just happens to be
> easier
> to cause issues with it on BTRFS

uhm this had been discussed extensively before, as I've said... AFAICS
btrfs is the only system we have, that can possibly cause data
corruption or even security breach by UUID collisions.
I wouldn't know that other fs, or LVM are affected, these just continue
to use those devices already "online"... and I think lvm refuses to
activate VGs, if conflicting UUIDs are found.


> There is no way to solve it sanely given the requirement that
> userspace
> not be broken.
No this is not true. Back when this was discussed, I and others
described how it could/should be done,... respectively how
userspace/kernel should behave, in short:
- continue using those devices that are already active
- refusing to (auto)assemble by UUID, if there are conflicts
  or requiring to specify the devices (with some --override-yes-i-know-
  what-i-do option option or so)
- in case of assembling/rebuilding/similar... never doing this
  automatically

I think there were some more corner cases, I basically had them all
discussed in the thread back then (search for "attacking btrfs
filesystems via UUID collisions?" and IIRC some different titled parent
or child threads).


>   Properly fixing this would likely make us more dependent
> on hardware configuration than even mounting by device name.
Sure, if there are colliding UUIDs, and one still wants to mount (by
using some --override-yes-i-know-what-i-do option),.. it would need to
be by specifying the device name...
But where's the problem?
This would anyway only happen if someone either attacks or someone made
a clone, and it's far better to refuse automatic assembly in cases
where accidental corruption can happen or where attacks may be
possible, requiring the user/admin to manually take action, than having
corruption or security breach.

Imagine the simple case: degraded RAID1 on a PC; if btrfs would do some
auto-rebuild based on UUID, then if an attacker knows that he'd just
need to plug in a USB disk with a fitting UUID...and easily gets a copy
of everything on disk, gpg keys, ssh keys, etc.



> > - a number of important core features not fully working in many
> >   situations (e.g. the issues with defrag, not being ref-link
> > aware,...
> >   an I vaguely remember similar things with compression).
> OK, how then should defrag handle reflinks?  Preserving them prevents
> it
> from being able to completely defragment data.
Didn't that even work in the past and had just some performance issues?


> > - OTOH, defrag seems to be viable for important use cases (VM
> > images,
> >   DBs,... everything where large files are internally re-written
> >   randomly).
> >   Sure there is nodatacow, but with that one effectively completely
> >   looses one of the core features/promises of btrfs (integrity by
> >   checksumming)... and as I've showed in an earlier large
> > discussion,
> >   none of the typical use cases for nodatacow has any high-level
> >   checksumming, and even if, it's not used per default, or doesn't
> > give
> >   the same benefits at it would on the fs level, like using it for
> > RAID
> >   recovery).
> The argument of nodatacow being viable for anything is a pretty
> significant secondary discussion that is itself entirely orthogonal
> to
> the point you appear to be trying to make here.

Well the point here was: 
- many people (including myself) like btrfs, it's
  (promised/future/current) features
- it's intended as a general purpose fs
- this includes the case of having such file/IO patterns as e.g. for VM
  images or DBs
- this is currently not really doable without loosing one of the
  promises (integrity)

So the point I'm trying to make:
People do probably not care so much whether their VM image/etc. is
COWed or not, snapshots/etc. still work with that,... but they may
likely care if the integrity feature is lost.
So IMHO, nodatacow + checksumming deserves to be amongst the top
priorities.


> > - still no real RAID 1
> No, you mean still no higher order replication.  I know I'm being
> stubborn about this, but RAID-1 is offici8ally defined in the
> standards
> as 2-way replication.
I think I remember that you've claimed that last time already, and as
I've said back then:
- what counts is probably the common understanding of the term, which
  is N disks RAID1 = N disks mirrored
- if there is something like an "official definition", it's probably
  the original paper that introduced RAID:
  http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf
  PDF page 11, respectively content page 9 describes RAID1 as:
  "This is the most expensive option since *all* disks are
  duplicated..."


> The only extant systems that support higher
> levels of replication and call it RAID-1 are entirely based on MD
> RAID
> and it's poor choice of naming.

Not true either, show me any single hardware RAID controller that does
RAID1 in a dup2 fashion... I manage some >2PiB of storage at the
faculty, all controller we have, handle RAID1 in the sense of "all
disks mirrored".


> > - no end-user/admin grade maangement/analysis tools, that tell non-
> >   experts about the state/health of their fs, and whether things
> > like
> >   balance etc.pp. are necessary
> I don't see anyone forthcoming with such tools either.  As far as
> basic
> monitoring, it's trivial to do with simple scripts from tools like
> monit
> or nagios.

AFAIU, even that isn't really possible right now, is it?
Take RAID again,... there is no place where you can see whether the
RAID state is "optimal", or does that exist in the meantime? Last time,
people were advised to look at the kernel logs, but this is no proper
way to check for the state... logging may simply be deactivated, or you
may have an offline fs, for which the logs have been lost because they
were on another disk.

Not to talk about the inability to properly determine how often btrfs
encountered errors, and "silently" corrected it.
E.g. some statistics about a device, that can be used to decide whether
its dying.
I think these things should be stored in the fs (and additionally also
on the respective device), where it can also be extracted when no
/var/log is present or when forensics are done.


>   As far as complex things like determining whether a fs needs
> balanced, that's really non-trivial to figure out.  Even with a
> person
> looking at it, it's still not easy to know whether or not a balance
> will
> actually help.
Well I wouldn't call myself a btrfs expert, but from time to time I've
been a bit "more active" on the list.
Even I know about these strange cases (sometimes tricks), like many
empty data/meta block groups, that may or may not get cleaned up, and
may result in troubles
How should the normal user/admin be able to cope with such things if
there are no good tools?

It starts with simple things like:
- adding a further disk to a RAID
  => there should be a tool which tells you: dude, some files are not
     yet "rebuild"(duplicated),... do a balance or whatever.


> >- the still problematic documentation situation
> Not trying to rationalize this, but go take a look at a majority of
> other projects, most of them that aren't backed by some huge
> corporation
> throwing insane amounts of money at them have at best mediocre end-
> user
> documentation.  The fact that more effort is being put into
> development
> than documentation is generally a good thing, especially for
> something
> that is not yet feature complete like BTRFS.

Uhm.. yes and no...
The lack of documentation (i.e. admin/end-user-grade documentation)
also means that people have less understanding in the system, less
trust, less knowledge on what they can expect/do with it (will Ctrl-C
on btrfs checl work? what if I shut down during a balance? does it
break then? etc. pp.), less will to play with it.
Further,... if btrfs would reach the state of being "feature complete"
(if that ever happens, and I don't mean because of slow development,
but rather, because most other fs shows that development goes "ever"
on),... there would be *so much* to do in documentation, that it's
unlikely it will happen.


Cheers,
Chris.