All of lore.kernel.org
 help / color / mirror / Atom feed
* raid5/6 production use status?
@ 2016-06-01 22:25 Christoph Anton Mitterer
  2016-06-02  9:24 ` Gerald Hopf
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Anton Mitterer @ 2016-06-01 22:25 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 980 bytes --]

Hey.

I've lost a bit track recently and the wiki changelog doesn't seem to
contain much about how things went on at the RAID5/6 front... so how're
things going?

Is it already more or less "productively" usable? What's still missing?

I guess there still aren't any administrative tools that e.g. monitor
for failed disks or block errors?

Does that RAID5/6 itself work already? Is it possible to replace broken
devices (or such with block errors)? Are things like a completely
failing disk (during fs being online) handled gracefully?
How about scrubbing/repairing... I assume on read it would identify
silent block errors by the checksum[0] and rebuild it if possible,
doing what when it fails? Just giving read errors? Marking the btrfs
RAID failed and remounting the fs read-only?


Cheers & thx,
Chris.

[0] Except of course for the nodatacow case, which, albeit a major
case, unfortunately still seems to lack the important checksumming
support :-(

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: raid5/6 production use status?
  2016-06-01 22:25 raid5/6 production use status? Christoph Anton Mitterer
@ 2016-06-02  9:24 ` Gerald Hopf
  2016-06-02  9:35   ` Hugo Mills
  2016-06-03 17:38   ` btrfs (was: raid5/6) production use status (and future)? Christoph Anton Mitterer
  0 siblings, 2 replies; 25+ messages in thread
From: Gerald Hopf @ 2016-06-02  9:24 UTC (permalink / raw)
  To: linux-btrfs


> Hey.
>
> I've lost a bit track recently and the wiki changelog doesn't seem to
> contain much about how things went on at the RAID5/6 front... so how're
> things going?
>
> Is it already more or less "productively" usable? What's still missing?
Well, you still can't even check for free space.

~ # btrfs fi usage /mnt/data-raid
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
     Device size:                  18.19TiB
     Device allocated:                0.00B
     Device unallocated:           18.19TiB
     Device missing:                  0.00B
     Used:                            0.00B
     Free (estimated):                0.00B      (min: 8.00EiB)

btrfs --version ==> btrfs-progs v4.5.3-70-gc1c27b9
kernel ==> 4.6.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: raid5/6 production use status?
  2016-06-02  9:24 ` Gerald Hopf
@ 2016-06-02  9:35   ` Hugo Mills
  2016-06-02 10:03     ` Gerald Hopf
  2016-06-03 17:38   ` btrfs (was: raid5/6) production use status (and future)? Christoph Anton Mitterer
  1 sibling, 1 reply; 25+ messages in thread
From: Hugo Mills @ 2016-06-02  9:35 UTC (permalink / raw)
  To: Gerald Hopf; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1221 bytes --]

On Thu, Jun 02, 2016 at 11:24:45AM +0200, Gerald Hopf wrote:
> 
> >Hey.
> >
> >I've lost a bit track recently and the wiki changelog doesn't seem to
> >contain much about how things went on at the RAID5/6 front... so how're
> >things going?
> >
> >Is it already more or less "productively" usable? What's still missing?
> Well, you still can't even check for free space.

   You can, but not with that tool.

https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools

   Hugo.

> ~ # btrfs fi usage /mnt/data-raid
> WARNING: RAID56 detected, not implemented
> WARNING: RAID56 detected, not implemented
> WARNING: RAID56 detected, not implemented
> Overall:
>     Device size:                  18.19TiB
>     Device allocated:                0.00B
>     Device unallocated:           18.19TiB
>     Device missing:                  0.00B
>     Used:                            0.00B
>     Free (estimated):                0.00B      (min: 8.00EiB)
> 
> btrfs --version ==> btrfs-progs v4.5.3-70-gc1c27b9
> kernel ==> 4.6.0
> 
> 

-- 
Hugo Mills             | UNIX: Spanish manufacturer of fire extinguishers
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: raid5/6 production use status?
  2016-06-02  9:35   ` Hugo Mills
@ 2016-06-02 10:03     ` Gerald Hopf
  0 siblings, 0 replies; 25+ messages in thread
From: Gerald Hopf @ 2016-06-02 10:03 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs


>>> Hey.
>>>
>>> I've lost a bit track recently and the wiki changelog doesn't seem to
>>> contain much about how things went on at the RAID5/6 front... so how're
>>> things going?
>>>
>>> Is it already more or less "productively" usable? What's still missing?
>> Well, you still can't even check for free space.
>     You can, but not with that tool.
>
> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
>
>     Hugo.
That tool however according to the wiki is the "new tool" which you are 
supposed to use! The other options are not that good...

btrfs fi usage
==> 3x WARNING: RAID56 detected, not implemented
btrfs fi df
==> only shows what part of "allocated" space is in use, not useful 
information if you want to know if you have free space
btrfs fi show
==> does not show total free space. I guess you can use the information 
in btrfs fi show and then subtract used from total and then multiply 
that? But by what? n disks? Or by n-1 disks because of parity?
==> multiplying it by all disks (including parity) seems to arrive at a 
similar free space as df -h shows me. But is it correct? Or should it be 
4/5 of this because I have one parity disk and 4 data disks?

I do however stand corrected: You actually can (barely) check for free 
space. And you can get a number that might or might not be the free space.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs (was: raid5/6) production use status (and future)?
  2016-06-02  9:24 ` Gerald Hopf
  2016-06-02  9:35   ` Hugo Mills
@ 2016-06-03 17:38   ` Christoph Anton Mitterer
  2016-06-03 19:50     ` btrfs Austin S Hemmelgarn
       [not found]     ` <f4a9ef2f-99a8-bcc4-5a8f-b022914980f0@swiftspirit.co.za>
  1 sibling, 2 replies; 25+ messages in thread
From: Christoph Anton Mitterer @ 2016-06-03 17:38 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1654 bytes --]

Hey..

Hm... so the overall btrfs state seems to be still pretty worrying,
doesn't it?

- RAID5/6 seems far from being stable or even usable,... not to talk
  about higher parity levels, whose earlier posted patches (e.g.
  http://thread.gmane.org/gmane.linux.kernel/1654735) seem to have
  been given up.

- Serious show-stoppers and security deficiencies like the UUID
  collision corruptions/attacks that have been extensively discussed
  earlier, are still open

- a number of important core features not fully working in many
  situations (e.g. the issues with defrag, not being ref-link aware,...
  an I vaguely remember similar things with compression).

- OTOH, defrag seems to be viable for important use cases (VM images,
  DBs,... everything where large files are internally re-written
  randomly).
  Sure there is nodatacow, but with that one effectively completely
  looses one of the core features/promises of btrfs (integrity by
  checksumming)... and as I've showed in an earlier large discussion,
  none of the typical use cases for nodatacow has any high-level
  checksumming, and even if, it's not used per default, or doesn't give
  the same benefits at it would on the fs level, like using it for RAID
  recovery).

- other earlier anticipated features like newer/better compression or
  checksum algos seem to be dead either

- still no real RAID 1

- no end-user/admin grade maangement/analysis tools, that tell non-
  experts about the state/health of their fs, and whether things like
  balance etc.pp. are necessary

- the still problematic documentation situation



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-03 17:38   ` btrfs (was: raid5/6) production use status (and future)? Christoph Anton Mitterer
@ 2016-06-03 19:50     ` Austin S Hemmelgarn
  2016-06-04  1:51       ` btrfs Christoph Anton Mitterer
       [not found]     ` <f4a9ef2f-99a8-bcc4-5a8f-b022914980f0@swiftspirit.co.za>
  1 sibling, 1 reply; 25+ messages in thread
From: Austin S Hemmelgarn @ 2016-06-03 19:50 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs

On 2016-06-03 13:38, Christoph Anton Mitterer wrote:
> Hey..
> 
> Hm... so the overall btrfs state seems to be still pretty worrying,
> doesn't it?
> 
> - RAID5/6 seems far from being stable or even usable,... not to talk
>   about higher parity levels, whose earlier posted patches (e.g.
>   http://thread.gmane.org/gmane.linux.kernel/1654735) seem to have
>   been given up.
There's no point in trying to do higher parity levels if we can't get
regular parity working correctly.  Given the current state of things, it
might be better to break even and just rewrite the whole parity raid
thing from scratch, but I doubt that anybody is willing to do that.
> 
> - Serious show-stoppers and security deficiencies like the UUID
>   collision corruptions/attacks that have been extensively discussed
>   earlier, are still open
The UUID issue is not a BTRFS specific one, it just happens to be easier
to cause issues with it on BTRFS, it causes problems with all Linux
native filesystems, as well as LVM, and is also an issue on Windows.
There is no way to solve it sanely given the requirement that userspace
not be broken.  Properly fixing this would likely make us more dependent
on hardware configuration than even mounting by device name.
> 
> - a number of important core features not fully working in many
>   situations (e.g. the issues with defrag, not being ref-link aware,...
>   an I vaguely remember similar things with compression).
OK, how then should defrag handle reflinks?  Preserving them prevents it
from being able to completely defragment data.  It's worth pointing out
that it is generally pointless to defragment snapshots, as they are
typically infrequently accessed in most use cases.
> 
> - OTOH, defrag seems to be viable for important use cases (VM images,
>   DBs,... everything where large files are internally re-written
>   randomly).
>   Sure there is nodatacow, but with that one effectively completely
>   looses one of the core features/promises of btrfs (integrity by
>   checksumming)... and as I've showed in an earlier large discussion,
>   none of the typical use cases for nodatacow has any high-level
>   checksumming, and even if, it's not used per default, or doesn't give
>   the same benefits at it would on the fs level, like using it for RAID
>   recovery).
The argument of nodatacow being viable for anything is a pretty
significant secondary discussion that is itself entirely orthogonal to
the point you appear to be trying to make here.
> 
> - other earlier anticipated features like newer/better compression or
>   checksum algos seem to be dead either
This one I entirely agree about.  The arguments against adding other
compression algorithms and new checksums are entirely bogus.  Ideally
we'd switch to just encoding API info from the CryptoAPI and let people
use wherever they want from there.
> 
> - still no real RAID 1
No, you mean still no higher order replication.  I know I'm being
stubborn about this, but RAID-1 is offici8ally defined in the standards
as 2-way replication.  The only extant systems that support higher
levels of replication and call it RAID-1 are entirely based on MD RAID
and it's poor choice of naming.

Overall, between this and the insanity that is raid5/6, somebody with
significantly more skill than me, and significantly more time than most
of the developers, needs to just take a step back and rewrite the whole
multi-device profile support from scratch.
> 
> - no end-user/admin grade maangement/analysis tools, that tell non-
>   experts about the state/health of their fs, and whether things like
>   balance etc.pp. are necessary
I don't see anyone forthcoming with such tools either.  As far as basic
monitoring, it's trivial to do with simple scripts from tools like monit
or nagios.  As far as complex things like determining whether a fs needs
balanced, that's really non-trivial to figure out.  Even with a person
looking at it, it's still not easy to know whether or not a balance will
actually help.
> 
> - the still problematic documentation situation
Not trying to rationalize this, but go take a look at a majority of
other projects, most of them that aren't backed by some huge corporation
throwing insane amounts of money at them have at best mediocre end-user
documentation.  The fact that more effort is being put into development
than documentation is generally a good thing, especially for something
that is not yet feature complete like BTRFS.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-03 19:50     ` btrfs Austin S Hemmelgarn
@ 2016-06-04  1:51       ` Christoph Anton Mitterer
  2016-06-04  7:24         ` btrfs Andrei Borzenkov
                           ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Christoph Anton Mitterer @ 2016-06-04  1:51 UTC (permalink / raw)
  To: Austin S Hemmelgarn, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 9787 bytes --]

On Fri, 2016-06-03 at 15:50 -0400, Austin S Hemmelgarn wrote:
> There's no point in trying to do higher parity levels if we can't get
> regular parity working correctly.  Given the current state of things,
> it might be better to break even and just rewrite the whole parity
> raid thing from scratch, but I doubt that anybody is willing to do
> that.

Well... as I've said, things are pretty worrying. Obviously I cannot
really judge, since I'm not into btrfs' development... maybe there's a
lack of manpower? Since btrfs seems to be a very important part (i.e.
next-gen fs), wouldn't it be possible to either get some additional
funding by the Linux Foundation, or possible that some of the core
developers make an open call for funding by companies?
Having some additional people, perhaps working fulltime on it, may be a
big help.

As for the RAID... given how many time/effort is spent now into 5/6,..
it really seems that one should have considered multi-parity from the
beginning on.
Kinda feels like either, with multi-parity this whole instability phase
would start again, or it will simply never happen.


> > - Serious show-stoppers and security deficiencies like the UUID
> >   collision corruptions/attacks that have been extensively
> > discussed
> >   earlier, are still open
> The UUID issue is not a BTRFS specific one, it just happens to be
> easier
> to cause issues with it on BTRFS

uhm this had been discussed extensively before, as I've said... AFAICS
btrfs is the only system we have, that can possibly cause data
corruption or even security breach by UUID collisions.
I wouldn't know that other fs, or LVM are affected, these just continue
to use those devices already "online"... and I think lvm refuses to
activate VGs, if conflicting UUIDs are found.


> There is no way to solve it sanely given the requirement that
> userspace
> not be broken.
No this is not true. Back when this was discussed, I and others
described how it could/should be done,... respectively how
userspace/kernel should behave, in short:
- continue using those devices that are already active
- refusing to (auto)assemble by UUID, if there are conflicts
  or requiring to specify the devices (with some --override-yes-i-know-
  what-i-do option option or so)
- in case of assembling/rebuilding/similar... never doing this
  automatically

I think there were some more corner cases, I basically had them all
discussed in the thread back then (search for "attacking btrfs
filesystems via UUID collisions?" and IIRC some different titled parent
or child threads).


>   Properly fixing this would likely make us more dependent
> on hardware configuration than even mounting by device name.
Sure, if there are colliding UUIDs, and one still wants to mount (by
using some --override-yes-i-know-what-i-do option),.. it would need to
be by specifying the device name...
But where's the problem?
This would anyway only happen if someone either attacks or someone made
a clone, and it's far better to refuse automatic assembly in cases
where accidental corruption can happen or where attacks may be
possible, requiring the user/admin to manually take action, than having
corruption or security breach.

Imagine the simple case: degraded RAID1 on a PC; if btrfs would do some
auto-rebuild based on UUID, then if an attacker knows that he'd just
need to plug in a USB disk with a fitting UUID...and easily gets a copy
of everything on disk, gpg keys, ssh keys, etc.



> > - a number of important core features not fully working in many
> >   situations (e.g. the issues with defrag, not being ref-link
> > aware,...
> >   an I vaguely remember similar things with compression).
> OK, how then should defrag handle reflinks?  Preserving them prevents
> it
> from being able to completely defragment data.
Didn't that even work in the past and had just some performance issues?


> > - OTOH, defrag seems to be viable for important use cases (VM
> > images,
> >   DBs,... everything where large files are internally re-written
> >   randomly).
> >   Sure there is nodatacow, but with that one effectively completely
> >   looses one of the core features/promises of btrfs (integrity by
> >   checksumming)... and as I've showed in an earlier large
> > discussion,
> >   none of the typical use cases for nodatacow has any high-level
> >   checksumming, and even if, it's not used per default, or doesn't
> > give
> >   the same benefits at it would on the fs level, like using it for
> > RAID
> >   recovery).
> The argument of nodatacow being viable for anything is a pretty
> significant secondary discussion that is itself entirely orthogonal
> to
> the point you appear to be trying to make here.

Well the point here was: 
- many people (including myself) like btrfs, it's
  (promised/future/current) features
- it's intended as a general purpose fs
- this includes the case of having such file/IO patterns as e.g. for VM
  images or DBs
- this is currently not really doable without loosing one of the
  promises (integrity)

So the point I'm trying to make:
People do probably not care so much whether their VM image/etc. is
COWed or not, snapshots/etc. still work with that,... but they may
likely care if the integrity feature is lost.
So IMHO, nodatacow + checksumming deserves to be amongst the top
priorities.


> > - still no real RAID 1
> No, you mean still no higher order replication.  I know I'm being
> stubborn about this, but RAID-1 is offici8ally defined in the
> standards
> as 2-way replication.
I think I remember that you've claimed that last time already, and as
I've said back then:
- what counts is probably the common understanding of the term, which
  is N disks RAID1 = N disks mirrored
- if there is something like an "official definition", it's probably
  the original paper that introduced RAID:
  http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf
  PDF page 11, respectively content page 9 describes RAID1 as:
  "This is the most expensive option since *all* disks are
  duplicated..."


> The only extant systems that support higher
> levels of replication and call it RAID-1 are entirely based on MD
> RAID
> and it's poor choice of naming.

Not true either, show me any single hardware RAID controller that does
RAID1 in a dup2 fashion... I manage some >2PiB of storage at the
faculty, all controller we have, handle RAID1 in the sense of "all
disks mirrored".


> > - no end-user/admin grade maangement/analysis tools, that tell non-
> >   experts about the state/health of their fs, and whether things
> > like
> >   balance etc.pp. are necessary
> I don't see anyone forthcoming with such tools either.  As far as
> basic
> monitoring, it's trivial to do with simple scripts from tools like
> monit
> or nagios.

AFAIU, even that isn't really possible right now, is it?
Take RAID again,... there is no place where you can see whether the
RAID state is "optimal", or does that exist in the meantime? Last time,
people were advised to look at the kernel logs, but this is no proper
way to check for the state... logging may simply be deactivated, or you
may have an offline fs, for which the logs have been lost because they
were on another disk.

Not to talk about the inability to properly determine how often btrfs
encountered errors, and "silently" corrected it.
E.g. some statistics about a device, that can be used to decide whether
its dying.
I think these things should be stored in the fs (and additionally also
on the respective device), where it can also be extracted when no
/var/log is present or when forensics are done.


>   As far as complex things like determining whether a fs needs
> balanced, that's really non-trivial to figure out.  Even with a
> person
> looking at it, it's still not easy to know whether or not a balance
> will
> actually help.
Well I wouldn't call myself a btrfs expert, but from time to time I've
been a bit "more active" on the list.
Even I know about these strange cases (sometimes tricks), like many
empty data/meta block groups, that may or may not get cleaned up, and
may result in troubles
How should the normal user/admin be able to cope with such things if
there are no good tools?

It starts with simple things like:
- adding a further disk to a RAID
  => there should be a tool which tells you: dude, some files are not
     yet "rebuild"(duplicated),... do a balance or whatever.


> >- the still problematic documentation situation
> Not trying to rationalize this, but go take a look at a majority of
> other projects, most of them that aren't backed by some huge
> corporation
> throwing insane amounts of money at them have at best mediocre end-
> user
> documentation.  The fact that more effort is being put into
> development
> than documentation is generally a good thing, especially for
> something
> that is not yet feature complete like BTRFS.

Uhm.. yes and no...
The lack of documentation (i.e. admin/end-user-grade documentation)
also means that people have less understanding in the system, less
trust, less knowledge on what they can expect/do with it (will Ctrl-C
on btrfs checl work? what if I shut down during a balance? does it
break then? etc. pp.), less will to play with it.
Further,... if btrfs would reach the state of being "feature complete"
(if that ever happens, and I don't mean because of slow development,
but rather, because most other fs shows that development goes "ever"
on),... there would be *so much* to do in documentation, that it's
unlikely it will happen.


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
       [not found]     ` <f4a9ef2f-99a8-bcc4-5a8f-b022914980f0@swiftspirit.co.za>
@ 2016-06-04  2:13       ` Christoph Anton Mitterer
  2016-06-04  2:36         ` btrfs Chris Murphy
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Anton Mitterer @ 2016-06-04  2:13 UTC (permalink / raw)
  To: Brendan Hide, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4591 bytes --]

On Sat, 2016-06-04 at 00:22 +0200, Brendan Hide wrote:
> - RAID5/6 seems far from being stable or even usable,... not to
> > talk
> >   about higher parity levels, whose earlier posted patches (e.g.
> >   http://thread.gmane.org/gmane.linux.kernel/1654735) seem to have
> >   been given up.
>  I'm not certain why that patch didn't get any replies, though it
> should also be noted that it was sent to three mailing lists - and
> that btrfs was simply an implementation example. See previous thread
> here: http://thread.gmane.org/gmane.linux.kernel/1622485
Ah... I remembered that one, but just couldn't find it anymore... so
even two efforts already, both seem dead :-(

> I recall reading it and thinking 6 parities is madness - but I
> certainly see how it would be good for future-proofing.
Well I can imagine that scenarios exist in which more than two parities
may be highly desirable...


> > - a number of important core features not fully working in many
> >   situations (e.g. the issues with defrag, not being ref-link
> > aware,...
> >   an I vaguely remember similar things with compression).
>  True also. There are various features and situations where btrfs
> does not work as intelligently as expected.

And even worse: Some of these are totally impossible to know for the
average user. => the documentation issue (though at least the defrag
issue is documented now in btrfs-filesystem(8) at least).



>  I class these under the "you're doing it wrong" theme. The vast
> majority of popular database engines have been designed without CoW
> in mind and, unfortunately, one *cannot* simply dump it onto a CoW
> system and expect it to perform well. There is no easy answer here.
Well the easy answer is: nodatacow
At least in terms of: it's technically possible, not talking about "is
it easy for the end-user (the average admin may possible at one point
read that nodatacow should be done for VMs and DBs, but what about all
the smallish DBs like Firefox sqlites and so on, or simply any other
scenario where such IO patterns happen).

But the problem with nodatacow is the implication of checksumming loss.




> > - other earlier anticipated features like newer/better compression
> > or
> >   checksum algos seem to be dead either
>  Re alternative compression: https://btrfs.wiki.kernel.org/index.php/
> FAQ#Will_btrfs_support_LZ4.3F
> My short version: This is a premature optimisation.
> 
> IMO, alternative checksums is also a premature optimisation. An RFC
> for alternative checksums was last looked at by Liu Bo in November
> 2014. A different strategy was proposed as the code didn't make use
> of a pre-existing crypto code in the kernel.



> > - still no real RAID 1
>  This depends on what you mean by "real" - and I'm guessing you're
> misled by mdraid's feature to have multiple copies in RAID1 rather
> than just the two. RAID1 by definition is exactly two mirrored
> copies. No more. No less.
See my answer to Austin about the same claim.
Actually I have no idea where it comes from,... even the more down-to-
earth sources like Wikipedia all speak about "mirroring of all disks",
as the original paper about RAID.


> > - no end-user/admin grade maangement/analysis tools, that tell non-
> >   experts about the state/health of their fs, and whether things
> > like
> >   balance etc.pp. are necessary
> > 
> > - the still problematic documentation situation
>  Simple answer: RAID5/6 is not yet recommended for storing data you
> don't mind losing. Btrfs is *also* not yet ready for install-and-
> forget-style system administration.

Well the problem with writing good documentation in the "we do it once
it's finished style" is often that it will never happen... or that the
devs themselves don't recall all details.

Also in the meantime there is so much (also often outdated) 3rd party
documentation and myths that come alive, that it takes ages to clean up
with all that.


> I personally recommend against using btrfs for people who aren't
> familiar with it.
I think it *is* pretty important that many people try/test/play with
it, because that helps stabilisation... but even during that phase,
documentation would be quite important.

If there would be e.g. an kept-up-to-date wiki page about the status
and current perils of e.g. RAID5/6, people (like me) wouldn't ask every
weeks, saving the devs' time.
Plus people wouldn't end up simply trying it, believing it works
already, and then face data loss.


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-04  2:13       ` btrfs Christoph Anton Mitterer
@ 2016-06-04  2:36         ` Chris Murphy
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Murphy @ 2016-06-04  2:36 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Brendan Hide, Btrfs BTRFS

On Fri, Jun 3, 2016 at 8:13 PM, Christoph Anton Mitterer
<calestyo@scientia.net> wrote:

> If there would be e.g. an kept-up-to-date wiki page about the status
> and current perils of e.g. RAID5/6, people (like me) wouldn't ask every
> weeks, saving the devs' time.

Well up until 4.6, there was a rather clear "Btrfs is under heavy
development, and is not suitable for-any uses other than benchmarking
and review." statement in kernel documentation.

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/diff/Documentation/filesystems/btrfs.txt?id=v4.6&id2=v4.5

There's no longer such a strongly worded caution in that document, nor
in the wiki.

The wiki has stale information still, but it's a volunteer effort like
everything else Btrfs related.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-04  1:51       ` btrfs Christoph Anton Mitterer
@ 2016-06-04  7:24         ` Andrei Borzenkov
  2016-06-04 17:00           ` btrfs Chris Murphy
  2016-06-05 20:39         ` btrfs Henk Slager
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Andrei Borzenkov @ 2016-06-04  7:24 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Austin S Hemmelgarn, linux-btrfs

04.06.2016 04:51, Christoph Anton Mitterer пишет:
...
> 
>> The only extant systems that support higher
>> levels of replication and call it RAID-1 are entirely based on MD
>> RAID
>> and it's poor choice of naming.
> 
> Not true either, show me any single hardware RAID controller that does
> RAID1 in a dup2 fashion... I manage some >2PiB of storage at the
> faculty, all controller we have, handle RAID1 in the sense of "all
> disks mirrored".
> 

Out of curiosity - which model of hardware controllers? Those I am aware
of simply won't let you create RAID1 if more than 2 disks are selected.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-04  7:24         ` btrfs Andrei Borzenkov
@ 2016-06-04 17:00           ` Chris Murphy
  2016-06-04 17:37             ` btrfs Christoph Anton Mitterer
  2016-06-04 21:18             ` btrfs Andrei Borzenkov
  0 siblings, 2 replies; 25+ messages in thread
From: Chris Murphy @ 2016-06-04 17:00 UTC (permalink / raw)
  To: Andrei Borzenkov
  Cc: Christoph Anton Mitterer, Austin S Hemmelgarn, Btrfs BTRFS

On Sat, Jun 4, 2016 at 1:24 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> 04.06.2016 04:51, Christoph Anton Mitterer пишет:
> ...
>>
>>> The only extant systems that support higher
>>> levels of replication and call it RAID-1 are entirely based on MD
>>> RAID
>>> and it's poor choice of naming.
>>
>> Not true either, show me any single hardware RAID controller that does
>> RAID1 in a dup2 fashion... I manage some >2PiB of storage at the
>> faculty, all controller we have, handle RAID1 in the sense of "all
>> disks mirrored".
>>
>
> Out of curiosity - which model of hardware controllers? Those I am aware
> of simply won't let you create RAID1 if more than 2 disks are selected.

SNIA's DDF 2.0 spec Rev 19
page 18/19 shows 'RAID-1 Simple Mirroring" vs "RAID-1 Multi-Mirroring"



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-04 17:00           ` btrfs Chris Murphy
@ 2016-06-04 17:37             ` Christoph Anton Mitterer
  2016-06-04 19:13               ` btrfs Chris Murphy
  2016-06-04 21:18             ` btrfs Andrei Borzenkov
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Anton Mitterer @ 2016-06-04 17:37 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S Hemmelgarn, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1351 bytes --]

On Sat, 2016-06-04 at 11:00 -0600, Chris Murphy wrote:
> SNIA's DDF 2.0 spec Rev 19
> page 18/19 shows 'RAID-1 Simple Mirroring" vs "RAID-1 Multi-
> Mirroring"

And DDF came how many years after the original RAID paper and everyone
understood RAID1 as it was defined there? 1987 vs. ~2003 or so?

Also SINA's "standard definition" seems pretty strange, doesn't it?
They have two RAID1, as you say:
- "simple mirroring" with n=2
- "multi mirrioring" with n=3

I wouldn't see why the n=2 case is "simpler" than the n=3 case, neither
why the n=3 case is multi and the n=2 is not (it's also already
multiple disks).
Also why did they allow n=3 but not n>=3? If n=4 wouldn't make sense,
why would n=3, compared to n=2?

Anyway,...
- the original paper defines it as n mirrored disks
- Wikipedia handles it like that
- the already existing major RAID implementation (MD) in the Linux
  kernel handles it like that
- LVM's native mirroring, allows to set the number of mirrors, i.e. it
  allows for everything >=2 which is IMHO closer to the common meaning
  of RAID1 than to btrfs' two-duplicates

So even if there would be some reasonable competing definition (and I
don't think the rather proprietary DDF is very reasonable here), why
using one that is incomptabible with everything we have in Linux?


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-04 17:37             ` btrfs Christoph Anton Mitterer
@ 2016-06-04 19:13               ` Chris Murphy
  2016-06-04 22:43                 ` btrfs Christoph Anton Mitterer
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Murphy @ 2016-06-04 19:13 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Chris Murphy, Austin S Hemmelgarn, Btrfs BTRFS

On Sat, Jun 4, 2016 at 11:37 AM, Christoph Anton Mitterer
<calestyo@scientia.net> wrote:
> On Sat, 2016-06-04 at 11:00 -0600, Chris Murphy wrote:
>> SNIA's DDF 2.0 spec Rev 19
>> page 18/19 shows 'RAID-1 Simple Mirroring" vs "RAID-1 Multi-
>> Mirroring"
>
> And DDF came how many years after the original RAID paper and everyone
> understood RAID1 as it was defined there? 1987 vs. ~2003 or so?
>
> Also SINA's "standard definition" seems pretty strange, doesn't it?
> They have two RAID1, as you say:
> - "simple mirroring" with n=2
> - "multi mirrioring" with n=3
>
> I wouldn't see why the n=2 case is "simpler" than the n=3 case, neither
> why the n=3 case is multi and the n=2 is not (it's also already
> multiple disks).
> Also why did they allow n=3 but not n>=3? If n=4 wouldn't make sense,
> why would n=3, compared to n=2?
>
> Anyway,...
> - the original paper defines it as n mirrored disks
> - Wikipedia handles it like that
> - the already existing major RAID implementation (MD) in the Linux
>   kernel handles it like that
> - LVM's native mirroring, allows to set the number of mirrors, i.e. it
>   allows for everything >=2 which is IMHO closer to the common meaning
>   of RAID1 than to btrfs' two-duplicates
>
> So even if there would be some reasonable competing definition (and I
> don't think the rather proprietary DDF is very reasonable here), why
> using one that is incomptabible with everything we have in Linux?

mdadm supports DDF.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-04 17:00           ` btrfs Chris Murphy
  2016-06-04 17:37             ` btrfs Christoph Anton Mitterer
@ 2016-06-04 21:18             ` Andrei Borzenkov
  1 sibling, 0 replies; 25+ messages in thread
From: Andrei Borzenkov @ 2016-06-04 21:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Christoph Anton Mitterer, Austin S Hemmelgarn, Btrfs BTRFS

04.06.2016 20:00, Chris Murphy пишет:
> On Sat, Jun 4, 2016 at 1:24 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>> 04.06.2016 04:51, Christoph Anton Mitterer пишет:
>> ...
>>>
>>>> The only extant systems that support higher
>>>> levels of replication and call it RAID-1 are entirely based on MD
>>>> RAID
>>>> and it's poor choice of naming.
>>>
>>> Not true either, show me any single hardware RAID controller that does
>>> RAID1 in a dup2 fashion... I manage some >2PiB of storage at the
>>> faculty, all controller we have, handle RAID1 in the sense of "all
>>> disks mirrored".
>>>
>>
>> Out of curiosity - which model of hardware controllers? Those I am aware
>> of simply won't let you create RAID1 if more than 2 disks are selected.
> 
> SNIA's DDF 2.0 spec Rev 19
> page 18/19 shows 'RAID-1 Simple Mirroring" vs "RAID-1 Multi-Mirroring"
> 

The question was about hardware that implements it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-04 19:13               ` btrfs Chris Murphy
@ 2016-06-04 22:43                 ` Christoph Anton Mitterer
  2016-06-05 15:51                   ` btrfs Chris Murphy
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Anton Mitterer @ 2016-06-04 22:43 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S Hemmelgarn, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 231 bytes --]

On Sat, 2016-06-04 at 13:13 -0600, Chris Murphy wrote:
> mdadm supports DDF.

Sure... it also supports IMSM,... so what? Neither of them are the
default for mdadm, nor does it change the used terminology :)


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-04 22:43                 ` btrfs Christoph Anton Mitterer
@ 2016-06-05 15:51                   ` Chris Murphy
  2016-06-05 20:39                     ` btrfs Christoph Anton Mitterer
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Murphy @ 2016-06-05 15:51 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Chris Murphy, Austin S Hemmelgarn, Btrfs BTRFS

On Sat, Jun 4, 2016 at 4:43 PM, Christoph Anton Mitterer
<calestyo@scientia.net> wrote:
> On Sat, 2016-06-04 at 13:13 -0600, Chris Murphy wrote:
>> mdadm supports DDF.
>
> Sure... it also supports IMSM,... so what? Neither of them are the
> default for mdadm, nor does it change the used terminology :)

Why is mdadm the reference point for terminology?

There's actually better consistency in terminology usage outside Linux
because of SNIA and DDF than within Linux where the most basic terms
aren't agreed upon by various upstream maintainers. mdadm and lvm use
different terms even though they're both now using the same md backend
in the kernel.

 mdadm chunk = lvm segment = btrfs stripe = ddf strip = ddf stripe
element. Some things have no equivalents like the Btrfs chunk. But
someone hears chunk and they wonder if it's the same thing as the
mdadm chunk but it isn't, and actually Btrfs also uses the term block
group for chunk, because...

So if you want to create a decoder ring for terminology that's great
and would be useful; but just asking everyone in Btrfs land to come up
with Btrfs terminology 2.0 merely adds to the list of inconsistent
term usage, it doesn't actual fix any problems.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-05 15:51                   ` btrfs Chris Murphy
@ 2016-06-05 20:39                     ` Christoph Anton Mitterer
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Anton Mitterer @ 2016-06-05 20:39 UTC (permalink / raw)
  To: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1682 bytes --]

On Sun, 2016-06-05 at 09:51 -0600, Chris Murphy wrote:
> Why is mdadm the reference point for terminology?
I haven't said it is,... I just said it mdadm, original paper, WP use
it the common/historic way.
And since all of these were there before btrfs, and in the case of
mdadm/MD "in" the kernel,... one should probably try to follow that, if
possible.



> There's actually better consistency in terminology usage outside
> Linux
> because of SNIA and DDF than within Linux where the most basic terms
> aren't agreed upon by various upstream maintainers.

Does anyone in the Linux world really care much about DDF? Even
outside? ;-)
Seriously,... as I tried to show in one of my previous posts, I think
the terminology of DDF, at least WRT RAID1 is a bit awkward.


>  mdadm and lvm use
> different terms even though they're both now using the same md
> backend
> in the kernel.
Depending on whether one choose to use "raid1" and "mirror" segment
types....


Anyway,... I think that discussion gets a bit pointless,... I think
it's clear that the current terminology may easily cause confusion, and
I think for a term like "RAID1", which is a artificial name it's
something completely else as for terms like "stripe", "chunk", etc.,
which are rather common terms and where one must expect that they are
used for different things in different areas.

And as I've said just before... the other points on my bucket list,
like the UUID collision (security) issues, the no checksumming with
nodatacow, etc.  deserve IMHO much more attention than the terminology
:)

So I'm kinda out of this specific part of the discussion.

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-04  1:51       ` btrfs Christoph Anton Mitterer
  2016-06-04  7:24         ` btrfs Andrei Borzenkov
@ 2016-06-05 20:39         ` Henk Slager
  2016-06-05 20:56           ` btrfs Christoph Anton Mitterer
  2016-06-06  0:56         ` btrfs Chris Murphy
  2016-06-06 13:04         ` btrfs Austin S. Hemmelgarn
  3 siblings, 1 reply; 25+ messages in thread
From: Henk Slager @ 2016-06-05 20:39 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

>> > - OTOH, defrag seems to be viable for important use cases (VM
>> > images,
>> >   DBs,... everything where large files are internally re-written
>> >   randomly).
>> >   Sure there is nodatacow, but with that one effectively completely
>> >   looses one of the core features/promises of btrfs (integrity by
>> >   checksumming)... and as I've showed in an earlier large
>> > discussion,
>> >   none of the typical use cases for nodatacow has any high-level
>> >   checksumming, and even if, it's not used per default, or doesn't
>> > give
>> >   the same benefits at it would on the fs level, like using it for
>> > RAID
>> >   recovery).
>> The argument of nodatacow being viable for anything is a pretty
>> significant secondary discussion that is itself entirely orthogonal
>> to
>> the point you appear to be trying to make here.
>
> Well the point here was:
> - many people (including myself) like btrfs, it's
>   (promised/future/current) features
> - it's intended as a general purpose fs
> - this includes the case of having such file/IO patterns as e.g. for VM
>   images or DBs
> - this is currently not really doable without loosing one of the
>   promises (integrity)
>
> So the point I'm trying to make:
> People do probably not care so much whether their VM image/etc. is
> COWed or not, snapshots/etc. still work with that,... but they may
> likely care if the integrity feature is lost.
> So IMHO, nodatacow + checksumming deserves to be amongst the top
> priorities.

Have you tried blockdevice/HDD caching like bcache or dmcache in
combination with VMs on BTRFS?  Or ZVOL for VMs in ZFS with L2ARC?
I assume the primary reason for wanting nodatacow + checksumming is to
avoid long seektimes on HDDs due to growing fragmentation of the VM
images over time. But even if you have nodatacow + checksumming
implemented, it is then still HDD access and a VM imagefile itself is
not guaranteed to be continuous.
It is clear that for VM images the amount of extents will be large
over time (like 50k or so, autodefrag on), but with a modern SSD used
as cache, it doesn't matter. It is still way faster than just HDD(s),
even with freshly copied image with <100 extents.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-05 20:39         ` btrfs Henk Slager
@ 2016-06-05 20:56           ` Christoph Anton Mitterer
  2016-06-05 21:07             ` btrfs Hugo Mills
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Anton Mitterer @ 2016-06-05 20:56 UTC (permalink / raw)
  To: Henk Slager; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2396 bytes --]

On Sun, 2016-06-05 at 22:39 +0200, Henk Slager wrote:
> > So the point I'm trying to make:
> > People do probably not care so much whether their VM image/etc. is
> > COWed or not, snapshots/etc. still work with that,... but they may
> > likely care if the integrity feature is lost.
> > So IMHO, nodatacow + checksumming deserves to be amongst the top
> > priorities.
> Have you tried blockdevice/HDD caching like bcache or dmcache in
> combination with VMs on BTRFS?
No yet,... my personal use case is just some VMs on the notebook, and
for this, the above would seem a bit overkill.
For the larger VM cluster at the institute,... puh to be honest I don't
know by hard what we do there.


>   Or ZVOL for VMs in ZFS with L2ARC?
Well but all this is an alternative solution,...


> I assume the primary reason for wanting nodatacow + checksumming is
> to
> avoid long seektimes on HDDs due to growing fragmentation of the VM
> images over time.
Well the primary reason is wanting to have overall checksumming in the
fs, regardless of which features one uses.

I think we already have some situations where tools use/set btrfs
features by themselves (i.e. automatically)... wasn't systemd creating
subvols per default in some locations, when there's btrfs?
So it's no big step to postgresql/etc. setting nodatacow, making people
loose integrity without them even knowing.


Of course, avoiding the fragmentation is the reason for the desire to
have nodatacow.


>  But even if you have nodatacow + checksumming
> implemented, it is then still HDD access and a VM imagefile itself is
> not guaranteed to be continuous.
Uhm... sure, but that's no difference to other filesystems?!


> It is clear that for VM images the amount of extents will be large
> over time (like 50k or so, autodefrag on),
Wasn't it said, that autodefrag performs bad for anything larger than
~1G?


>  but with a modern SSD used
> as cache, it doesn't matter. It is still way faster than just HDD(s),
> even with freshly copied image with <100 extents.
Well the fragmentation has also many other consequences and not just
seeks (assuming everyone would use SSDs, which is and probably won't be
the case for quite a while).
Most obviously you get much more IOPS and btrfs itself will, AFAIU,
also suffer from some issues due to the fragmentation.


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-05 20:56           ` btrfs Christoph Anton Mitterer
@ 2016-06-05 21:07             ` Hugo Mills
  2016-06-05 21:31               ` btrfs Christoph Anton Mitterer
  0 siblings, 1 reply; 25+ messages in thread
From: Hugo Mills @ 2016-06-05 21:07 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Henk Slager, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3726 bytes --]

On Sun, Jun 05, 2016 at 10:56:45PM +0200, Christoph Anton Mitterer wrote:
> On Sun, 2016-06-05 at 22:39 +0200, Henk Slager wrote:
> > > So the point I'm trying to make:
> > > People do probably not care so much whether their VM image/etc. is
> > > COWed or not, snapshots/etc. still work with that,... but they may
> > > likely care if the integrity feature is lost.
> > > So IMHO, nodatacow + checksumming deserves to be amongst the top
> > > priorities.
> > Have you tried blockdevice/HDD caching like bcache or dmcache in
> > combination with VMs on BTRFS?
> No yet,... my personal use case is just some VMs on the notebook, and
> for this, the above would seem a bit overkill.
> For the larger VM cluster at the institute,... puh to be honest I don't
> know by hard what we do there.
> 
> 
> >   Or ZVOL for VMs in ZFS with L2ARC?
> Well but all this is an alternative solution,...
> 
> 
> > I assume the primary reason for wanting nodatacow + checksumming is
> > to
> > avoid long seektimes on HDDs due to growing fragmentation of the VM
> > images over time.
> Well the primary reason is wanting to have overall checksumming in the
> fs, regardless of which features one uses.

   The problem is that you can't guarantee consistency with
nodatacow+checksums. If you have nodatacow, then data is overwritten,
in place. If you do that, then you can't have a fully consistent
checksum -- there are always race conditions between the checksum and
the data being written (or the data and the checksum, depending on
which way round you do it).

> I think we already have some situations where tools use/set btrfs
> features by themselves (i.e. automatically)... wasn't systemd creating
> subvols per default in some locations, when there's btrfs?
> So it's no big step to postgresql/etc. setting nodatacow, making people
> loose integrity without them even knowing.
> 
> Of course, avoiding the fragmentation is the reason for the desire to
> have nodatacow.
> 
> 
> >  But even if you have nodatacow + checksumming
> > implemented, it is then still HDD access and a VM imagefile itself is
> > not guaranteed to be continuous.
> Uhm... sure, but that's no difference to other filesystems?!
> 
> 
> > It is clear that for VM images the amount of extents will be large
> > over time (like 50k or so, autodefrag on),
> Wasn't it said, that autodefrag performs bad for anything larger than
> ~1G?

   I don't recall ever seeing someone saying that. Of course, I may
have forgotten seeing it...

> >  but with a modern SSD used
> > as cache, it doesn't matter. It is still way faster than just HDD(s),
> > even with freshly copied image with <100 extents.
> Well the fragmentation has also many other consequences and not just
> seeks (assuming everyone would use SSDs, which is and probably won't be
> the case for quite a while).
> Most obviously you get much more IOPS and btrfs itself will, AFAIU,
> also suffer from some issues due to the fragmentation.

   This is a fundamental problem with all CoW filesystems. There are
some mititgations that can be put in place (true CoW rather than
btrfs's redirect-on-write, like some databases do, where the original
data is copied elsewhere before overwriting; cache aggressively and
with knowledge of the CoW nature of the FS, like ZFS does), but they
all have their drawbacks and pathological cases.

   Hugo.

-- 
Hugo Mills             | How do you become King? You stand in the marketplace
hugo@... carfax.org.uk | and announce you're going to tax everyone. If you
http://carfax.org.uk/  | get out alive, you're King.
PGP: E2AB1DE4          |                                        Harry Harrison

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-05 21:07             ` btrfs Hugo Mills
@ 2016-06-05 21:31               ` Christoph Anton Mitterer
  2016-06-05 23:39                 ` btrfs Chris Murphy
  2016-06-08  6:13                 ` btrfs Duncan
  0 siblings, 2 replies; 25+ messages in thread
From: Christoph Anton Mitterer @ 2016-06-05 21:31 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Henk Slager, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3516 bytes --]

On Sun, 2016-06-05 at 21:07 +0000, Hugo Mills wrote:
>    The problem is that you can't guarantee consistency with
> nodatacow+checksums. If you have nodatacow, then data is overwritten,
> in place. If you do that, then you can't have a fully consistent
> checksum -- there are always race conditions between the checksum and
> the data being written (or the data and the checksum, depending on
> which way round you do it).

I'm not an expert in the btrfs internals... but I had a pretty long
discussion back then when I brought this up first, and everything that
came out of that - to my understanding - indicated, that it should be
simply possible.

a) nodatacow just means "no data cow", but not "no meta data cow".
   And isn't the checksumming data meda data? So AFAIU, this is itself
   anyway COWed.
b) What you refer to above is, AFAIU, that data may be written (not
   COWed) and there is of course no guarantee that the written data
   matches the checksum (which may e.g. still be the old sum).
   => So what?
      This anyway only happens in case of crash/etc. and in that case
      we anyway have no idea, whether the written not COWed block is
      consistent or not, whether we do checksumming or not.
      We rather get the benefit that we now know: it may be garbage
      The only "bad" thing that could happen was:
      the block is fully written and actually consistent, but the
      checksum hasn't been written yet - IMHO much less likely than
      the other case(s). And I rather get one false positive in an
      more unlikely case, than corrupted blocks in all other possible
      situations (silent block errors, etc. pp.)
      And in principle, nothing would prevent a future btrfs to get a
      journal for the nodatacow-ed writes.

Look for the past thread "dear developers, can we have notdatacow +
checksumming, plz?",... I think I wrote about much more cases there,
any why - even it may not be perfect as datacow+checksumming - it would
always still be better to have checksumming with nodatacow.

> > Wasn't it said, that autodefrag performs bad for anything larger
> > than
> > ~1G?
> 
>    I don't recall ever seeing someone saying that. Of course, I may
> have forgotten seeing it...
I think it was mentioned below this thread:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/50444/focus=50586
and also implied here:
http://article.gmane.org/gmane.comp.file-systems.btrfs/51399/match=autodefrag+large+files


> > Well the fragmentation has also many other consequences and not
> > just
> > seeks (assuming everyone would use SSDs, which is and probably
> > won't be
> > the case for quite a while).
> > Most obviously you get much more IOPS and btrfs itself will, AFAIU,
> > also suffer from some issues due to the fragmentation.
>    This is a fundamental problem with all CoW filesystems. There are
> some mititgations that can be put in place (true CoW rather than
> btrfs's redirect-on-write, like some databases do, where the original
> data is copied elsewhere before overwriting; cache aggressively and
> with knowledge of the CoW nature of the FS, like ZFS does), but they
> all have their drawbacks and pathological cases.
Sure... but defrag (if it would generally work) or notdatacow (if it
wouldn't make you loose the ability to determine whether you're
consistent or not) would be already quite helpful here.


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-05 21:31               ` btrfs Christoph Anton Mitterer
@ 2016-06-05 23:39                 ` Chris Murphy
  2016-06-08  6:13                 ` btrfs Duncan
  1 sibling, 0 replies; 25+ messages in thread
From: Chris Murphy @ 2016-06-05 23:39 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Hugo Mills, Henk Slager, linux-btrfs

On Sun, Jun 5, 2016 at 3:31 PM, Christoph Anton Mitterer
<calestyo@scientia.net> wrote:
> On Sun, 2016-06-05 at 21:07 +0000, Hugo Mills wrote:
>>    The problem is that you can't guarantee consistency with
>> nodatacow+checksums. If you have nodatacow, then data is overwritten,
>> in place. If you do that, then you can't have a fully consistent
>> checksum -- there are always race conditions between the checksum and
>> the data being written (or the data and the checksum, depending on
>> which way round you do it).
>
> I'm not an expert in the btrfs internals... but I had a pretty long
> discussion back then when I brought this up first, and everything that
> came out of that - to my understanding - indicated, that it should be
> simply possible.
>
> a) nodatacow just means "no data cow", but not "no meta data cow".
>    And isn't the checksumming data meda data? So AFAIU, this is itself
>    anyway COWed.
> b) What you refer to above is, AFAIU, that data may be written (not
>    COWed) and there is of course no guarantee that the written data
>    matches the checksum (which may e.g. still be the old sum).
>    => So what?

For  a file like a VM image constantly being modified, essentially at
no time will the csums on disk ever reflect the state of the file.

>       This anyway only happens in case of crash/etc. and in that case
>       we anyway have no idea, whether the written not COWed block is
>       consistent or not, whether we do checksumming or not.

If the file is cow'd and checksummed, and there's a crash, there is
supposed to be consistency: either the old state or new state for the
data is on-disk and the current valid metadata correctly describes
which state that data is in.

If the file is not cow'd and not checksummed, its consistency is
unknown but also ignored, when doing normal reads, balance or scrubs.

If the file is not cow'd but were checksummed, there would always be
some inconsistency if the file is actively being modified. Only when
it's not being modified, and metadata writes for that file are
committed to disk and the superblock updated, is there consistency. At
any other time, there's inconsistency. So if there's a crash, a
balance or scrub or normal read will say the file is corrupt. And the
normal way Btrfs deals with corruption on reads from a mounted fs is
to complain and it does not pass the corrupt data to user space,
instead there's an i/o error. You have to use restore to scrape it off
the volume; or alternatively use btrfsck to recompute checksums.

Presumably you'd ask for an exception for this kind of file, where it
can still be read even though there's a checksum mismatch, can be
scrubbed and balanced which will report there's corruption even if
there isn't any, and you've gained, insofar as I can tell, a lot of
confusion and ambiguity.

It's fine you want a change in behavior for Btrfs. But when a
developer responds, more than once, about how this is somewhere
between difficult and not possible, and you say it should simply be
possible, I think that's annoying, bordering on irritating.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-04  1:51       ` btrfs Christoph Anton Mitterer
  2016-06-04  7:24         ` btrfs Andrei Borzenkov
  2016-06-05 20:39         ` btrfs Henk Slager
@ 2016-06-06  0:56         ` Chris Murphy
  2016-06-06 13:04         ` btrfs Austin S. Hemmelgarn
  3 siblings, 0 replies; 25+ messages in thread
From: Chris Murphy @ 2016-06-06  0:56 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Austin S Hemmelgarn, Btrfs BTRFS

On Fri, Jun 3, 2016 at 7:51 PM, Christoph Anton Mitterer
<calestyo@scientia.net> wrote:

> I think I remember that you've claimed that last time already, and as
> I've said back then:
> - what counts is probably the common understanding of the term, which
>   is N disks RAID1 = N disks mirrored
> - if there is something like an "official definition", it's probably
>   the original paper that introduced RAID:
>   http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf
>   PDF page 11, respectively content page 9 describes RAID1 as:
>   "This is the most expensive option since *all* disks are
>   duplicated..."


You've misread the paper.

It defines what it means by "all disks are duplicated" as G=1 and C=1.
That is, every data disk has one check disk. That is, two copies.
There is no mention of n-copies.

Further in table 2 "Characteristics of Level 1 RAID" the overhead is
described as 100%, and the usable storage capacity is 50%. Again, that
is consistent with duplication.

The definition of duplicate is "one of two or more identical things."

The etymology of duplicate is "1400-50; late Middle English < Latin
duplicātus (past participle of duplicāre to make double), equivalent
to duplic- (stem of duplex) duplex + -ātus -ate1
http://www.dictionary.com/browse/duplicate

There is no possible reading of this that suggests n-way RAID is intended.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-04  1:51       ` btrfs Christoph Anton Mitterer
                           ` (2 preceding siblings ...)
  2016-06-06  0:56         ` btrfs Chris Murphy
@ 2016-06-06 13:04         ` Austin S. Hemmelgarn
  3 siblings, 0 replies; 25+ messages in thread
From: Austin S. Hemmelgarn @ 2016-06-06 13:04 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs

On 2016-06-03 21:51, Christoph Anton Mitterer wrote:
> On Fri, 2016-06-03 at 15:50 -0400, Austin S Hemmelgarn wrote:
>> There's no point in trying to do higher parity levels if we can't get
>> regular parity working correctly.  Given the current state of things,
>> it might be better to break even and just rewrite the whole parity
>> raid thing from scratch, but I doubt that anybody is willing to do
>> that.
>
> Well... as I've said, things are pretty worrying. Obviously I cannot
> really judge, since I'm not into btrfs' development... maybe there's a
> lack of manpower? Since btrfs seems to be a very important part (i.e.
> next-gen fs), wouldn't it be possible to either get some additional
> funding by the Linux Foundation, or possible that some of the core
> developers make an open call for funding by companies?
> Having some additional people, perhaps working fulltime on it, may be a
> big help.
>
> As for the RAID... given how many time/effort is spent now into 5/6,..
> it really seems that one should have considered multi-parity from the
> beginning on.
> Kinda feels like either, with multi-parity this whole instability phase
> would start again, or it will simply never happen.
New features will always cause some instability, period, there is no way 
to avoid that.
>
>
>>> - Serious show-stoppers and security deficiencies like the UUID
>>>   collision corruptions/attacks that have been extensively
>>> discussed
>>>   earlier, are still open
>> The UUID issue is not a BTRFS specific one, it just happens to be
>> easier
>> to cause issues with it on BTRFS
>
> uhm this had been discussed extensively before, as I've said... AFAICS
> btrfs is the only system we have, that can possibly cause data
> corruption or even security breach by UUID collisions.
> I wouldn't know that other fs, or LVM are affected, these just continue
> to use those devices already "online"... and I think lvm refuses to
> activate VGs, if conflicting UUIDs are found.
If you are mounting by UUID, it is entirely non-deterministic which 
filesystem with that UUID will be mounted (because device enumeration is 
non-deterministic).  As far as LVM, it refuses activating VG's, but it 
can still have issues if you have LV's with the same UUID (which can be 
done pretty trivially), and the fact that it refuses to activate them 
technically constitutes a DoS attack (because you can't use the resources).
>
>
>> There is no way to solve it sanely given the requirement that
>> userspace
>> not be broken.
> No this is not true. Back when this was discussed, I and others
> described how it could/should be done,... respectively how
> userspace/kernel should behave, in short:
> - continue using those devices that are already active
This is easy, but only works for mounted filesystems.
> - refusing to (auto)assemble by UUID, if there are conflicts
>   or requiring to specify the devices (with some --override-yes-i-know-
>   what-i-do option option or so)
> - in case of assembling/rebuilding/similar... never doing this
>   automatically
These two allow anyone with the ability to plug in a USB device to DoS 
the system.
>
> I think there were some more corner cases, I basically had them all
> discussed in the thread back then (search for "attacking btrfs
> filesystems via UUID collisions?" and IIRC some different titled parent
> or child threads).
>
>
>>   Properly fixing this would likely make us more dependent
>> on hardware configuration than even mounting by device name.
> Sure, if there are colliding UUIDs, and one still wants to mount (by
> using some --override-yes-i-know-what-i-do option),.. it would need to
> be by specifying the device name...
> But where's the problem?
> This would anyway only happen if someone either attacks or someone made
> a clone, and it's far better to refuse automatic assembly in cases
> where accidental corruption can happen or where attacks may be
> possible, requiring the user/admin to manually take action, than having
> corruption or security breach.
Refusing automatic assembly does not prevent the attack, it simply 
converts it from a data corruption attack to a DoS attack.
>
> Imagine the simple case: degraded RAID1 on a PC; if btrfs would do some
> auto-rebuild based on UUID, then if an attacker knows that he'd just
> need to plug in a USB disk with a fitting UUID...and easily gets a copy
> of everything on disk, gpg keys, ssh keys, etc.
If the attacker has physical access to the machine, it's irrelevant even 
with such protection, as there are all kinds of other things that could 
be done to get data off of the disk (especially if the system has 
thunderbolt ports or USB C ports).  If the user has any unsecured 
encryption or authentication tokens on the system, they're screwed 
anyway though.
>
>>> - a number of important core features not fully working in many
>>>   situations (e.g. the issues with defrag, not being ref-link
>>> aware,...
>>>   an I vaguely remember similar things with compression).
>> OK, how then should defrag handle reflinks?  Preserving them prevents
>> it
>> from being able to completely defragment data.
> Didn't that even work in the past and had just some performance issues?
Most of it was scaling issues, but unless you have some solution to 
handle it correctly, there's no point in complaining about it.  And my 
point about defragmentation with reflinks still stands.
>
>
>>> - OTOH, defrag seems to be viable for important use cases (VM
>>> images,
>>>   DBs,... everything where large files are internally re-written
>>>   randomly).
>>>   Sure there is nodatacow, but with that one effectively completely
>>>   looses one of the core features/promises of btrfs (integrity by
>>>   checksumming)... and as I've showed in an earlier large
>>> discussion,
>>>   none of the typical use cases for nodatacow has any high-level
>>>   checksumming, and even if, it's not used per default, or doesn't
>>> give
>>>   the same benefits at it would on the fs level, like using it for
>>> RAID
>>>   recovery).
>> The argument of nodatacow being viable for anything is a pretty
>> significant secondary discussion that is itself entirely orthogonal
>> to
>> the point you appear to be trying to make here.
>
> Well the point here was:
> - many people (including myself) like btrfs, it's
>   (promised/future/current) features
> - it's intended as a general purpose fs
> - this includes the case of having such file/IO patterns as e.g. for VM
>   images or DBs
> - this is currently not really doable without loosing one of the
>   promises (integrity)
>
> So the point I'm trying to make:
> People do probably not care so much whether their VM image/etc. is
> COWed or not, snapshots/etc. still work with that,... but they may
> likely care if the integrity feature is lost.
> So IMHO, nodatacow + checksumming deserves to be amongst the top
> priorities.
You're not thinking from a programming perspective.  There is no way to 
force atomic updates of data in chunks bigger than the sector size on a 
block storage device.  Without that ability, there is no way to ensure 
that the checksum for a data block and the data block itself are either 
both written or neither written unless you either use COW or some form 
of journaling.
>
>
>>> - still no real RAID 1
>> No, you mean still no higher order replication.  I know I'm being
>> stubborn about this, but RAID-1 is offici8ally defined in the
>> standards
>> as 2-way replication.
> I think I remember that you've claimed that last time already, and as
> I've said back then:
> - what counts is probably the common understanding of the term, which
>   is N disks RAID1 = N disks mirrored
> - if there is something like an "official definition", it's probably
>   the original paper that introduced RAID:
>   http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf
>   PDF page 11, respectively content page 9 describes RAID1 as:
>   "This is the most expensive option since *all* disks are
>   duplicated..."
>
>
>> The only extant systems that support higher
>> levels of replication and call it RAID-1 are entirely based on MD
>> RAID
>> and it's poor choice of naming.
>
> Not true either, show me any single hardware RAID controller that does
> RAID1 in a dup2 fashion... I manage some >2PiB of storage at the
> faculty, all controller we have, handle RAID1 in the sense of "all
> disks mirrored".
Exact specs, please.  While I don't manage data on anywhere near that 
scale, I have seen hundreds of different models of RAID controllers over 
the years, and have yet to see one that is an actual hardware 
implementation that supports creating a RAID1 configuration with more 
than two disks.

As far as controllers that I've seen that do RAID-1 solely as 2 way 
replication:
* Every single Dell branded controller I've dealt with, including recent 
SAS3 based ones (pretty sure most of these are LSI Logic devices)
* Every single Marvell based controller I've dealt with.
* All of the Adaptec and LSI Logic controllers I've dealt with (although 
most of these I've dealt with are older devices).
* All of the HighPoint controllers I've dealt with.
* The few non-Marvell based Areca controllers I've dealt with.
>
>
>>> - no end-user/admin grade maangement/analysis tools, that tell non-
>>>   experts about the state/health of their fs, and whether things
>>> like
>>>   balance etc.pp. are necessary
>> I don't see anyone forthcoming with such tools either.  As far as
>> basic
>> monitoring, it's trivial to do with simple scripts from tools like
>> monit
>> or nagios.
>
> AFAIU, even that isn't really possible right now, is it?
There's a limit to what you can do with this, but you can definitely 
check things like error counts from normal operation and scrubs, notify 
when the filesystem goes degraded, and other basic things that most 
people expect out of system monitoring.

In my particular case, what I'm doing is:
1. Run scrub from a cronjob daily (none of my filesystems are big enough 
for this to take more than an hour)
2. From monit, check the return code of 'btrfs scrub status' at some 
point early in the morning after the scrub finishes, if it returns 
non-zero, there were errors during the scrub.
3. Have monit poll filesystem flags every cycle (in my case, every 
minute).  If it sees these change, the filesystem had some issue.
4. Parse the output of 'btrfs device stats' to check for recorded errors 
and send an alert under various cases (checking whole system aggregates 
of each type, and per-filesystem aggregates of all types, and flagging 
when it's above a certain threshold).
5. Run an hourly filtered balance with -dusage=50 -dlimit=2 -musage=50 
-mlimit=3 to clean up partially used chunks.
6. If any of these have issues, I get an e-mail from the system (and 
because of how I set that up, that works even if none of the persistent 
storage on the system is working correctly).
Note that this is just the BTRFS specific things, and doesn't include 
SMART checks, low-level LVM verification, and other similar things.
> Take RAID again,... there is no place where you can see whether the
> RAID state is "optimal", or does that exist in the meantime? Last time,
> people were advised to look at the kernel logs, but this is no proper
> way to check for the state... logging may simply be deactivated, or you
> may have an offline fs, for which the logs have been lost because they
> were on another disk.
Unless you have a modified kernel or are using raid5/6, the filesystem 
will go read-only when degraded.  You can poll the filesystem flags to 
verify this (although it's better to poll and check if they're changed,a 
s that can detect other issues too).  Additionally, you can check device 
stats, which will show any errors.
>
> Not to talk about the inability to properly determine how often btrfs
> encountered errors, and "silently" corrected it.
> E.g. some statistics about a device, that can be used to decide whether
> its dying.
> I think these things should be stored in the fs (and additionally also
> on the respective device), where it can also be extracted when no
> /var/log is present or when forensics are done.
'btrfs device stats' will show you running error counts since the last 
time they were manually reset (by passing the -z flag to said command). 
It's also notably one of the few tools that has output which is easy to 
parse programmatically (which is an entirely separate issue).
>
>
>>   As far as complex things like determining whether a fs needs
>> balanced, that's really non-trivial to figure out.  Even with a
>> person
>> looking at it, it's still not easy to know whether or not a balance
>> will
>> actually help.
> Well I wouldn't call myself a btrfs expert, but from time to time I've
> been a bit "more active" on the list.
> Even I know about these strange cases (sometimes tricks), like many
> empty data/meta block groups, that may or may not get cleaned up, and
> may result in troubles
> How should the normal user/admin be able to cope with such things if
> there are no good tools?
Empty block groups get deleted automatically these days (I distinctly 
remember this going in, because it temporarily broke discard and fstrim 
support).\, so that one is not an issue if they're on a new enough kernel.

As far as what I specifically said, it's still hard to know if a balance 
will _help_ or not.  For example, one of the people I was helping on the 
mailing list recently had a filesystem which had a bunch of chunks which 
were partially allocated and thus a lot of 'free space' listed in 
various tools, but none which were empty, and the only reason this was 
apparent was because a balance filtered on usage was failing above a 
certain threshold and not balancing anything below that threshold. 
Having to test for such things and as such use potentially a lot of disk 
bandwidth (especially because the threshold can be pretty high, in this 
case it was 67%) is not user friendly any more than not reporting an 
issue at all is.

Part of the issue here is that people aren't used to using filesystem 
specific tools to check their filesystems.  df is a classic example of 
this, which was designed in the 70's and never envisioned some of the 
cases we have to deal with in BTRFS.
>
> It starts with simple things like:
> - adding a further disk to a RAID
>   => there should be a tool which tells you: dude, some files are not
>      yet "rebuild"(duplicated),... do a balance or whatever.
Adding a disk should implicitly balance the FS unless you tell it not 
to, it was just a poor design choice in the first place to not do it 
that way.
>
>
>>> - the still problematic documentation situation
>> Not trying to rationalize this, but go take a look at a majority of
>> other projects, most of them that aren't backed by some huge
>> corporation
>> throwing insane amounts of money at them have at best mediocre end-
>> user
>> documentation.  The fact that more effort is being put into
>> development
>> than documentation is generally a good thing, especially for
>> something
>> that is not yet feature complete like BTRFS.
>
> Uhm.. yes and no...
> The lack of documentation (i.e. admin/end-user-grade documentation)
> also means that people have less understanding in the system, less
> trust, less knowledge on what they can expect/do with it (will Ctrl-C
> on btrfs checl work? what if I shut down during a balance? does it
> break then? etc. pp.), less will to play with it.
Given the state of BTRFS, that's not a bad thing.  A good administrator 
looking into it will do proper testing before using it.  If you aren't 
going to properly test something this comparatively new, you probably 
shouldn't be just arbitrarily using it without question.
> Further,... if btrfs would reach the state of being "feature complete"
> (if that ever happens, and I don't mean because of slow development,
> but rather, because most other fs shows that development goes "ever"
> on),... there would be *so much* to do in documentation, that it's
> unlikely it will happen.
In this particular case, I use the term 'feature complete' to mean on 
par feature wise with most other equivalent software (in this case, near 
feature parity with ZFS, as that's really the only significant 
competitor in the intended market).  As of right now, the only extant 
items other than bugs that would need to be in BTRFS to be feature 
complete by this definition are:
1. Quota support
2. Higher-order replication (at a minimum, 3 copies)
3. Higher order parity (at a minimum, 3-level, which is the highest ZFS 
supports right now).
4. Online filesystem checking.
5. In-band deduplication.
6. In-line encryption.
7. Storage teiring (like ZFS's L2ARC, or bcache).

Of these, items 1 and 5 are under active development, 6 would likely not 
require much effort for a basic implementation because there's a VFS 
level API for it now, and 2 and 3 are stalled pending functional raid5/6 
(which is the correct choice, adding them now would make it more 
complicated to fix raid5/6), which means that the only ones that don't 
appear to be actively on the radar are 4 (which most non-enterprise 
users probably don't strictly need) and 7 (which would be nice but would 
require significant work for limited benefit given the alternative 
options in the block layer itself).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs
  2016-06-05 21:31               ` btrfs Christoph Anton Mitterer
  2016-06-05 23:39                 ` btrfs Chris Murphy
@ 2016-06-08  6:13                 ` Duncan
  1 sibling, 0 replies; 25+ messages in thread
From: Duncan @ 2016-06-08  6:13 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Sun, 05 Jun 2016 23:31:57 +0200 as
excerpted:

>> > Wasn't it said, that autodefrag performs bad for anything larger than
>> > ~1G?
>> 
>>    I don't recall ever seeing someone saying that. Of course, I may
>> have forgotten seeing it...
> I think it was mentioned below this thread:
> http://thread.gmane.org/gmane.comp.file-systems.btrfs/50444/focus=50586
> and also implied here:
> http://article.gmane.org/gmane.comp.file-systems.btrfs/51399/match=autodefrag+large+files

Yes.

I was rather surprised to see Hugo say he doesn't recall seeing anyone
state that autodefrag performs poorly on large (from half gig) files, and
that its primary recommended use is for smaller database files such as the
typical quarter-gig or smaller sqlite files created by firefox and various
mail clients (thunderbird, evolution).  Because I've both seen and
repeated that many times, myself, and indeed, the wiki's mount options
page used to say effectively that.

And actually, looking at the history of the page, it was Hugo that deleted
the wording to the effect that autodefrag didn't work well on large
database or VM files..

https://btrfs.wiki.kernel.org/index.php?title=Mount_options&diff=29268&oldid=28191

So if he doesn't remember it...

But perhaps Hugo read it as manual defrag, not autodefrag, as I don't
remember manual defrag ever being associated with that problem (tho it did
and does still have the reflinks/snapshots problem, but that's a totally
different issue).


Meanwhile, it's news to me that autodefrag doesn't have that problem any
longer...





-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-06-08  6:14 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-01 22:25 raid5/6 production use status? Christoph Anton Mitterer
2016-06-02  9:24 ` Gerald Hopf
2016-06-02  9:35   ` Hugo Mills
2016-06-02 10:03     ` Gerald Hopf
2016-06-03 17:38   ` btrfs (was: raid5/6) production use status (and future)? Christoph Anton Mitterer
2016-06-03 19:50     ` btrfs Austin S Hemmelgarn
2016-06-04  1:51       ` btrfs Christoph Anton Mitterer
2016-06-04  7:24         ` btrfs Andrei Borzenkov
2016-06-04 17:00           ` btrfs Chris Murphy
2016-06-04 17:37             ` btrfs Christoph Anton Mitterer
2016-06-04 19:13               ` btrfs Chris Murphy
2016-06-04 22:43                 ` btrfs Christoph Anton Mitterer
2016-06-05 15:51                   ` btrfs Chris Murphy
2016-06-05 20:39                     ` btrfs Christoph Anton Mitterer
2016-06-04 21:18             ` btrfs Andrei Borzenkov
2016-06-05 20:39         ` btrfs Henk Slager
2016-06-05 20:56           ` btrfs Christoph Anton Mitterer
2016-06-05 21:07             ` btrfs Hugo Mills
2016-06-05 21:31               ` btrfs Christoph Anton Mitterer
2016-06-05 23:39                 ` btrfs Chris Murphy
2016-06-08  6:13                 ` btrfs Duncan
2016-06-06  0:56         ` btrfs Chris Murphy
2016-06-06 13:04         ` btrfs Austin S. Hemmelgarn
     [not found]     ` <f4a9ef2f-99a8-bcc4-5a8f-b022914980f0@swiftspirit.co.za>
2016-06-04  2:13       ` btrfs Christoph Anton Mitterer
2016-06-04  2:36         ` btrfs Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.