linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ditto blocks on ZFS
@ 2014-05-16  3:07 Russell Coker
  2014-05-17 12:50 ` Martin
  0 siblings, 1 reply; 18+ messages in thread
From: Russell Coker @ 2014-05-16  3:07 UTC (permalink / raw)
  To: linux-btrfs

https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape

Probably most of you already know about this, but for those of you who haven't 
the above describes ZFS "ditto blocks" which is a good feature we need on 
BTRFS.  The briefest summary is that on top of the RAID redundancy there is 
one more copy of metadata than there is of data, so copies=2 implies 3 copies 
of metadata and the default option of 1 copy of data means that metadata is 
"dup" in addition to whatever RAID redundancy is in place.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-16  3:07 ditto blocks on ZFS Russell Coker
@ 2014-05-17 12:50 ` Martin
  2014-05-17 14:24   ` Hugo Mills
  2014-05-18 16:09   ` Russell Coker
  0 siblings, 2 replies; 18+ messages in thread
From: Martin @ 2014-05-17 12:50 UTC (permalink / raw)
  To: linux-btrfs

On 16/05/14 04:07, Russell Coker wrote:
> https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
> 
> Probably most of you already know about this, but for those of you who haven't 
> the above describes ZFS "ditto blocks" which is a good feature we need on 
> BTRFS.  The briefest summary is that on top of the RAID redundancy there...
[... are additional copies of metadata ...]


Is that idea not already implemented in effect in btrfs with the way
that the superblocks are replicated multiple times, ever more times, for
ever more huge storage devices?

The one exception is for SSDs whereby there is the excuse that you
cannot know whether your data is usefully replicated across different
erase blocks on a single device, and SSDs are not 'that big' anyhow.


So... Your idea of replicating metadata multiple times in proportion to
assumed 'importance' or 'extent of impact if lost' is an interesting
approach. However, is that appropriate and useful considering the real
world failure mechanisms that are to be guarded against?

Do you see or measure any real advantage?


Regards,
Martin


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-17 12:50 ` Martin
@ 2014-05-17 14:24   ` Hugo Mills
  2014-05-18 16:09   ` Russell Coker
  1 sibling, 0 replies; 18+ messages in thread
From: Hugo Mills @ 2014-05-17 14:24 UTC (permalink / raw)
  To: Martin; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1711 bytes --]

On Sat, May 17, 2014 at 01:50:52PM +0100, Martin wrote:
> On 16/05/14 04:07, Russell Coker wrote:
> > https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
> > 
> > Probably most of you already know about this, but for those of you who haven't 
> > the above describes ZFS "ditto blocks" which is a good feature we need on 
> > BTRFS.  The briefest summary is that on top of the RAID redundancy there...
> [... are additional copies of metadata ...]
> 
> 
> Is that idea not already implemented in effect in btrfs with the way
> that the superblocks are replicated multiple times, ever more times, for
> ever more huge storage devices?

   Superblocks are the smallest part of the metadata. There's a whole
load of metadata that's not in the superblocks that isn't replicated
in this way.

> The one exception is for SSDs whereby there is the excuse that you
> cannot know whether your data is usefully replicated across different
> erase blocks on a single device, and SSDs are not 'that big' anyhow.
> 
> 
> So... Your idea of replicating metadata multiple times in proportion to
> assumed 'importance' or 'extent of impact if lost' is an interesting
> approach. However, is that appropriate and useful considering the real
> world failure mechanisms that are to be guarded against?
> 
> Do you see or measure any real advantage?

   This. How many copies do you actually need? Are there concrete
statistics to show the marginal utility of each additional copy?

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
     --- IMPROVE YOUR ORGANISMS!!  -- Subject line of spam email ---     

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-17 12:50 ` Martin
  2014-05-17 14:24   ` Hugo Mills
@ 2014-05-18 16:09   ` Russell Coker
  2014-05-19 20:36     ` Martin
  1 sibling, 1 reply; 18+ messages in thread
From: Russell Coker @ 2014-05-18 16:09 UTC (permalink / raw)
  To: Martin; +Cc: linux-btrfs

On Sat, 17 May 2014 13:50:52 Martin wrote:
> On 16/05/14 04:07, Russell Coker wrote:
> > https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
> > 
> > Probably most of you already know about this, but for those of you who
> > haven't the above describes ZFS "ditto blocks" which is a good feature we
> > need on BTRFS.  The briefest summary is that on top of the RAID
> > redundancy there...
> [... are additional copies of metadata ...]
> 
> 
> Is that idea not already implemented in effect in btrfs with the way
> that the superblocks are replicated multiple times, ever more times, for
> ever more huge storage devices?

No.  If the metadata for the root directory is corrupted then everything is 
lost even if the superblock is OK.  At every level in the directory tree a 
corruption will lose all levels below that, a corruption for /home would be 
very significant as would a corruption of /home/importantuser/major-project.

> The one exception is for SSDs whereby there is the excuse that you
> cannot know whether your data is usefully replicated across different
> erase blocks on a single device, and SSDs are not 'that big' anyhow.

I am not convinced by that argument.  While you can't know that it's usefully 
replicated you also can't say for sure that replication will never save you.  
There will surely be some random factors involved.  If dup on ssd will save 
you from 50% of corruption problems is it worth doing?  What if it's 80% or 
20%?

I have BTRFS running as the root filesystem on Intel SSDs on four machines 
(one of which is a file server with a pair of large disks in a BTRFS RAID-1).  
On all of those systems I have dup for metadata, it doesn't take up any amount 
of space I need for something else and it might save me.

> So... Your idea of replicating metadata multiple times in proportion to
> assumed 'importance' or 'extent of impact if lost' is an interesting
> approach. However, is that appropriate and useful considering the real
> world failure mechanisms that are to be guarded against?

Firstly it's not my idea, it's the idea of the ZFS developers.  Secondly I 
started reading about this after doing some experiments with a failing SATA 
disk.  In spite of having ~14,000 read errors (which sounds like a lot but is 
a small fraction of a 2TB disk) the vast majority of the data was readable, 
largely due to ~2000 errors corrected by dup metadata.

> Do you see or measure any real advantage?

Imagine that you have a RAID-1 array where both disks get ~14,000 read errors.  
This could happen due to a design defect common to drives of a particular 
model or some shared environmental problem.  Most errors would be corrected by 
RAID-1 but there would be a risk of some data being lost due to both copies 
being corrupt.  Another possibility is that one disk could entirely die 
(although total disk death seems rare nowadays) and the other could have 
corruption.  If metadata was duplicated in addition to being on both disks 
then the probability of data loss would be reduced.

Another issue is the case where all drive slots are filled with active drives 
(a very common configuration).  To replace a disk you have to physically 
remove the old disk before adding the new one.  If the array is a RAID-1 or 
RAID-5 then ANY error during reconstruction loses data.  Using dup for 
metadata on top of the RAID protections (IE the ZFS ditto idea) means that 
case doesn't lose you data.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-18 16:09   ` Russell Coker
@ 2014-05-19 20:36     ` Martin
  2014-05-19 21:47       ` Brendan Hide
  0 siblings, 1 reply; 18+ messages in thread
From: Martin @ 2014-05-19 20:36 UTC (permalink / raw)
  To: linux-btrfs

On 18/05/14 17:09, Russell Coker wrote:
> On Sat, 17 May 2014 13:50:52 Martin wrote:
[...]
>> Do you see or measure any real advantage?
> 
> Imagine that you have a RAID-1 array where both disks get ~14,000 read errors.  
> This could happen due to a design defect common to drives of a particular 
> model or some shared environmental problem.  Most errors would be corrected by 
> RAID-1 but there would be a risk of some data being lost due to both copies 
> being corrupt.  Another possibility is that one disk could entirely die 
> (although total disk death seems rare nowadays) and the other could have 
> corruption.  If metadata was duplicated in addition to being on both disks 
> then the probability of data loss would be reduced.
> 
> Another issue is the case where all drive slots are filled with active drives 
> (a very common configuration).  To replace a disk you have to physically 
> remove the old disk before adding the new one.  If the array is a RAID-1 or 
> RAID-5 then ANY error during reconstruction loses data.  Using dup for 
> metadata on top of the RAID protections (IE the ZFS ditto idea) means that 
> case doesn't lose you data.

Your example there is for the case where in effect there is no RAID. How
is that case any better than what is already done for btrfs duplicating
metadata?



So...


What real-world failure modes do the ditto blocks usefully protect against?

And how does that compare for failure rates and against what is already
done?


For example, we have RAID1 and RAID5 to protect against any one RAID
chunk being corrupted or for the total loss of any one device.

There is a second part to that in that another failure cannot be
tolerated until the RAID is remade.


Hence, we have RAID6 that protects against any two failures for a chunk
or device. Hence with just one failure, you can tolerate a second
failure whilst rebuilding the RAID.


And then we supposedly have safety-by-design where the filesystem itself
is using a journal and barriers/sync to ensure that the filesystem is
always kept in a consistent state, even after an interruption to any writes.


*What other failure modes* should we guard against?


There has been mention of fixing metadata keys from single bit flips...

Should hamming codes be used instead of a crc so that we can have
multiple bit error detect, single bit error correct functionality for
all data both in RAM and on disk for those systems that do not use ECC RAM?

Would that be useful?...


Regards,
Martin


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-19 20:36     ` Martin
@ 2014-05-19 21:47       ` Brendan Hide
  2014-05-20  2:07         ` Russell Coker
  0 siblings, 1 reply; 18+ messages in thread
From: Brendan Hide @ 2014-05-19 21:47 UTC (permalink / raw)
  To: Martin, linux-btrfs

On 2014/05/19 10:36 PM, Martin wrote:
> On 18/05/14 17:09, Russell Coker wrote:
>> On Sat, 17 May 2014 13:50:52 Martin wrote:
> [...]
>>> Do you see or measure any real advantage?
>> [snip]
This is extremely difficult to measure objectively. Subjectively ... see 
below.
> [snip]
>
> *What other failure modes* should we guard against?

I know I'd sleep a /little/ better at night knowing that a double disk 
failure on a "raid5/1/10" configuration might ruin a ton of data along 
with an obscure set of metadata in some "long" tree paths - but not the 
entire filesystem.

The other use-case/failure mode - where you are somehow unlucky enough 
to have sets of bad sectors/bitrot on multiple disks that simultaneously 
affect the only copies of the tree roots - is an extremely unlikely 
scenario. As unlikely as it may be, the scenario is a very painful 
consequence in spite of VERY little corruption. That is where the 
peace-of-mind/bragging rights come in.

-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-19 21:47       ` Brendan Hide
@ 2014-05-20  2:07         ` Russell Coker
  2014-05-20 14:07           ` Austin S Hemmelgarn
                             ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Russell Coker @ 2014-05-20  2:07 UTC (permalink / raw)
  To: Brendan Hide, linux-btrfs

On Mon, 19 May 2014 23:47:37 Brendan Hide wrote:
> This is extremely difficult to measure objectively. Subjectively ... see
> below.
> 
> > [snip]
> > 
> > *What other failure modes* should we guard against?
> 
> I know I'd sleep a /little/ better at night knowing that a double disk
> failure on a "raid5/1/10" configuration might ruin a ton of data along
> with an obscure set of metadata in some "long" tree paths - but not the
> entire filesystem.

My experience is that most disk failures that don't involve extreme physical 
damage (EG dropping a drive on concrete) don't involve totally losing the 
disk.  Much of the discussion about RAID failures concerns entirely failed 
disks, but I believe that is due to RAID implementations such as Linux 
software RAID that will entirely remove a disk when it gives errors.

I have a disk which had ~14,000 errors of which ~2000 errors were corrected by 
duplicate metadata.  If two disks with that problem were in a RAID-1 array 
then duplicate metadata would be a significant benefit.

> The other use-case/failure mode - where you are somehow unlucky enough
> to have sets of bad sectors/bitrot on multiple disks that simultaneously
> affect the only copies of the tree roots - is an extremely unlikely
> scenario. As unlikely as it may be, the scenario is a very painful
> consequence in spite of VERY little corruption. That is where the
> peace-of-mind/bragging rights come in.

http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html

The NetApp research on latent errors on drives is worth reading.  On page 12 
they report latent sector errors on 9.5% of SATA disks per year.  So if you 
lose one disk entirely the risk of having errors on a second disk is higher 
than you would want for RAID-5.  While losing the root of the tree is 
unlikely, losing a directory in the middle that has lots of subdirectories is 
a risk.

I can understand why people wouldn't want ditto blocks to be mandatory.  But 
why are people arguing against them as an option?


As an aside, I'd really like to be able to set RAID levels by subtree.  I'd 
like to use RAID-1 with ditto blocks for my important data and RAID-0 for 
unimportant data.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-20  2:07         ` Russell Coker
@ 2014-05-20 14:07           ` Austin S Hemmelgarn
  2014-05-20 20:11             ` Brendan Hide
  2014-05-20 14:56           ` ashford
  2014-05-21 23:29           ` Konstantinos Skarlatos
  2 siblings, 1 reply; 18+ messages in thread
From: Austin S Hemmelgarn @ 2014-05-20 14:07 UTC (permalink / raw)
  To: russell, Brendan Hide, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3016 bytes --]

On 2014-05-19 22:07, Russell Coker wrote:
> On Mon, 19 May 2014 23:47:37 Brendan Hide wrote:
>> This is extremely difficult to measure objectively. Subjectively ... see
>> below.
>>
>>> [snip]
>>>
>>> *What other failure modes* should we guard against?
>>
>> I know I'd sleep a /little/ better at night knowing that a double disk
>> failure on a "raid5/1/10" configuration might ruin a ton of data along
>> with an obscure set of metadata in some "long" tree paths - but not the
>> entire filesystem.
> 
> My experience is that most disk failures that don't involve extreme physical 
> damage (EG dropping a drive on concrete) don't involve totally losing the 
> disk.  Much of the discussion about RAID failures concerns entirely failed 
> disks, but I believe that is due to RAID implementations such as Linux 
> software RAID that will entirely remove a disk when it gives errors.
> 
> I have a disk which had ~14,000 errors of which ~2000 errors were corrected by 
> duplicate metadata.  If two disks with that problem were in a RAID-1 array 
> then duplicate metadata would be a significant benefit.
> 
>> The other use-case/failure mode - where you are somehow unlucky enough
>> to have sets of bad sectors/bitrot on multiple disks that simultaneously
>> affect the only copies of the tree roots - is an extremely unlikely
>> scenario. As unlikely as it may be, the scenario is a very painful
>> consequence in spite of VERY little corruption. That is where the
>> peace-of-mind/bragging rights come in.
> 
> http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
> 
> The NetApp research on latent errors on drives is worth reading.  On page 12 
> they report latent sector errors on 9.5% of SATA disks per year.  So if you 
> lose one disk entirely the risk of having errors on a second disk is higher 
> than you would want for RAID-5.  While losing the root of the tree is 
> unlikely, losing a directory in the middle that has lots of subdirectories is 
> a risk.
> 
> I can understand why people wouldn't want ditto blocks to be mandatory.  But 
> why are people arguing against them as an option?
> 
> 
> As an aside, I'd really like to be able to set RAID levels by subtree.  I'd 
> like to use RAID-1 with ditto blocks for my important data and RAID-0 for 
> unimportant data.
> 
But the proposed changes for n-way replication would already handle
this.  They would just need the option of having more than one copy per
device (which theoretically shouldn't be too hard once you have n-way
replication).  Also, BTRFS already has the option of replicating the
root tree across multiple devices (it is included in the System Data
subset), and in fact dose so by default when using multiple devices.
Also, there are plans to have per-subvolume or per file RAID level
selection, but IIRC that is planned for after n-way replication (and of
course, RAID 5/6, as n-way replication isn't going to be implemented
until after RAID 5/6)


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-20  2:07         ` Russell Coker
  2014-05-20 14:07           ` Austin S Hemmelgarn
@ 2014-05-20 14:56           ` ashford
  2014-05-21  2:51             ` Russell Coker
  2014-05-21 23:29           ` Konstantinos Skarlatos
  2 siblings, 1 reply; 18+ messages in thread
From: ashford @ 2014-05-20 14:56 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ahferroin7, russell, brendan

I’ve been reading this list for a few years, and giving almost no
feedback, but I feel that this subject demands that I provide some input.

I can think of five possible effects of implementing ditto blocks for the
metadata.  We've only been discussing one (#3 in my list) in this thread. 
While most of these effects are fairly obvious, I have seen no discussion
on them.

In discussing the issues of implementing ditto blocks, I think it would be
good to address all of the potential effects, and determine from that
discussion whether or not the enhancement should be made, and, if so, when
the appropriate development resources should be made available.  As Austin
pointed out, there are some enhancements currently planned which would
make the implementation of ditto blocks simpler.  I believe that defines
the earliest good time for implementation of ditto blocks.

1.  There will be more disk space used by the metadata.  I've been aware
of space allocation issues in BTRFS for more than three years.  If the use
of ditto blocks will make this issue worse, then it's probably not a good
idea to implement it.  The actual increase in metadata space is probably
small in most circumstances.

2.  Use of ditto blocks will increase write bandwidth to the disk.  This
is a direct and unavoidable result of having more copies of the metadata. 
The actual impact of this would depend on the file-system usage pattern,
but would probably be unnoticeable in most circumstances.  Does anyone
have a “worst-case” scenario for testing?

3.  Certain kinds of disk errors would be easier to recover from.  Some
people here claim that those specific errors are rare.  I have no opinion
on how often they happen, but I believe that if the overall disk space
cost is low, it will have a reasonable return.  There would be virtually
no reliability gains on an SSD-based file-system, as the ditto blocks
would be written at the same time, and the SSD would be likely to map the
logical blocks into the same page of flash memory.

4.  If the BIO layer of BTRFS and the device driver are smart enough,
ditto blocks could reduce I/O wait time.  This is a direct result of
having more instances of the data on the disk, so it's likely that there
will be a ditto block closer to where the disk head is currently.  The
actual benefit for disk-based file-systems is likely to be under 1ms per
metadata seek.  It's possible that a short-term backlog on one disk could
cause BTRFS to use a ditto block on another disk, which could deliver
>20ms of performance.  There would be no performance benefit for SSD-based
file-systems.

5.  There will be a (hopefully short) period where the code may be
slightly less stable, due to the modifications being performed at a
low-level within the file-system.  This is likely to happen with any
modification of the file-system code, with more complex modifications
being more likely to introduce instability.  I believe that the overall
complexity of this particular modification is great enough that there may
be some added instability for a bit, but perhaps use of the n-way
replication feature will substantially reduce the complexity.  Hopefully,
the integration testing that’s being performed on the BTRFS code will find
most of the new bugs, and point the core developers in the right direction
to fix them.



I have one final note about RAID levels.  I build and sell file servers as
a side job, having assembled and delivered over 100 file servers storing
several hundreds of TB.  TTBOMK, no system that I’ve built to my own
specifications (not overridden by customer requests) has lost any data
during the first 3 years of operation.  One customer requested a disk
manufacturer change, and has lost data.  A few systems have had data loss
in the 4-year timeframe, due to multiple drive failure, combined with
inadequate disk status monitoring.

My experience is that once your disks are larger than about 500-750GB,
RAID-6 becomes a much better choice, due to the increased chances of
having an uncorrectable read error during a reconstruct.  My opinion is
that anyone storing critical information in RAID-5, or even 2-disk RAID-1,
with disks of this capacity, should either reconsider their storage
topology, or verify that they have a good backup/restore mechanism in
place for that data.

Thank you.

Peter Ashford


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-20 14:07           ` Austin S Hemmelgarn
@ 2014-05-20 20:11             ` Brendan Hide
  0 siblings, 0 replies; 18+ messages in thread
From: Brendan Hide @ 2014-05-20 20:11 UTC (permalink / raw)
  To: Austin S Hemmelgarn, russell, linux-btrfs

On 2014/05/20 04:07 PM, Austin S Hemmelgarn wrote:
> On 2014-05-19 22:07, Russell Coker wrote:
>> [snip]
>> As an aside, I'd really like to be able to set RAID levels by subtree.  I'd
>> like to use RAID-1 with ditto blocks for my important data and RAID-0 for
>> unimportant data.
>>
> But the proposed changes for n-way replication would already handle
> this.
> [snip]
>
Russell's specific request above is probably best handled by being able 
to change replication levels per subvolume - this won't be handled by 
N-way replication.

Extra replication on leaf nodes will make relatively little difference 
in the scenarios laid out in this thread - but on "trunk" nodes (folders 
or subvolumes closer to the filesystem root) it makes a significant 
difference. "Plain" N-way replication doesn't flexibly treat these two 
nodes differently.

As an example, Russell might have a server with two disks - yet he wants 
6 copies of all metadata for subvolumes and their immediate subfolders. 
At three folders deep he "only" wants to have 4 copies. At six folders 
deep, only 2. Ditto blocks add an attractive safety net without 
unnecessarily doubling or tripling the size of *all* metadata.

It is a good idea. The next question to me is whether or not it is 
something that can be implemented elegantly and whether or not a 
talented *dev* thinks it is a good idea.

-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-20 14:56           ` ashford
@ 2014-05-21  2:51             ` Russell Coker
  2014-05-21 23:05               ` Martin
  2014-05-22 22:09               ` ashford
  0 siblings, 2 replies; 18+ messages in thread
From: Russell Coker @ 2014-05-21  2:51 UTC (permalink / raw)
  To: ashford; +Cc: linux-btrfs

On Tue, 20 May 2014 07:56:41 ashford@whisperpc.com wrote:
> 1.  There will be more disk space used by the metadata.  I've been aware
> of space allocation issues in BTRFS for more than three years.  If the use
> of ditto blocks will make this issue worse, then it's probably not a good
> idea to implement it.  The actual increase in metadata space is probably
> small in most circumstances.

Data, RAID1: total=2.51TB, used=2.50TB
System, RAID1: total=32.00MB, used=376.00KB
Metadata, RAID1: total=28.25GB, used=26.63GB

The above is my home RAID-1 array.  It includes multiple backup copies of a 
medium size Maildir format mail spool which probably accounts for a 
significant portion of the used space, the Maildir spool has an average file 
size of about 70K and lots of hard links between different versions of the 
backup.  Even so the metadata is only 1% of the total used space.  Going from 
1% to 2% to improve reliability really isn't a problem.

Data, RAID1: total=140.00GB, used=139.60GB
System, RAID1: total=32.00MB, used=28.00KB
Metadata, RAID1: total=4.00GB, used=2.97GB

Above is a small Xen server which uses snapshots to backup the files for Xen 
block devices (the system is lightly loaded so I don't use nocow) and for data 
files that include a small Maildir spool.  It's still only 2% of disk space 
used for metadata, again going from 2% to 4% isn't going to be a great 
problem.

> 2.  Use of ditto blocks will increase write bandwidth to the disk.  This
> is a direct and unavoidable result of having more copies of the metadata.
> The actual impact of this would depend on the file-system usage pattern,
> but would probably be unnoticeable in most circumstances.  Does anyone
> have a “worst-case” scenario for testing?

The ZFS design involves ditto blocks being spaced apart due to the fact that 
corruption tends to have some spacial locality.  So you are adding an extra 
seek.

The worst case would be when you have lots of small synchronous writes, 
probably the default configuration of Maildir delivery would be a good case.

As an aside I've been thinking of patching a mail server to do a sleep() 
before fsync() on mail delivery to see if that improves aggregate performance.  
My theory is that if you have dozens of concurrent delivery attempts then if 
they all sleep() before fsync() then the filesystem could write out metadata 
for multiple files in one pass in the most efficient manner.

> 3.  Certain kinds of disk errors would be easier to recover from.  Some
> people here claim that those specific errors are rare.

All errors are rare.  :-#

Seriously you can run Ext4 on a single disk for years and probably not lose 
data.  It's just a matter of how many disks and how much reliability you want.

> I have no opinion
> on how often they happen, but I believe that if the overall disk space
> cost is low, it will have a reasonable return.  There would be virtually
> no reliability gains on an SSD-based file-system, as the ditto blocks
> would be written at the same time, and the SSD would be likely to map the
> logical blocks into the same page of flash memory.

That claim is unproven AFAIK.  On SSD the performance cost of such things is 
negligible (no seek cost) and losing 1% of disk space isn't a problem for most 
systems (admittedly the early SSDs were small).

> 4.  If the BIO layer of BTRFS and the device driver are smart enough,
> ditto blocks could reduce I/O wait time.  This is a direct result of
> having more instances of the data on the disk, so it's likely that there
> will be a ditto block closer to where the disk head is currently.  The
> actual benefit for disk-based file-systems is likely to be under 1ms per
> metadata seek.  It's possible that a short-term backlog on one disk could
> cause BTRFS to use a ditto block on another disk, which could deliver
> >20ms of performance.  There would be no performance benefit for SSD-based
> file-systems.

That is likely with RAID-5 and RAID-10.

> My experience is that once your disks are larger than about 500-750GB,
> RAID-6 becomes a much better choice, due to the increased chances of
> having an uncorrectable read error during a reconstruct.  My opinion is
> that anyone storing critical information in RAID-5, or even 2-disk RAID-1,
> with disks of this capacity, should either reconsider their storage
> topology, or verify that they have a good backup/restore mechanism in
> place for that data.

http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html

The NetApp research shows that the incidence of silent corruption is a lot 
greater than you would expect.  RAID-6 doesn't save you from this.  You need 
BTRFS or ZFS RAID-6.


On Tue, 20 May 2014 22:11:16 Brendan Hide wrote:
> Extra replication on leaf nodes will make relatively little difference 
> in the scenarios laid out in this thread - but on "trunk" nodes (folders 
> or subvolumes closer to the filesystem root) it makes a significant 
> difference. "Plain" N-way replication doesn't flexibly treat these two 
> nodes differently.
> 
> As an example, Russell might have a server with two disks - yet he wants 
> 6 copies of all metadata for subvolumes and their immediate subfolders. 
> At three folders deep he "only" wants to have 4 copies. At six folders 
> deep, only 2. Ditto blocks add an attractive safety net without 
> unnecessarily doubling or tripling the size of *all* metadata.

Firstly I don't think that doubling all metadata is a real problem.

Next the amount of duplicate metadata can't be determined by depth.  For 
example I have a mail server where an outage of the entire server is 
preferable to losing email.  I would set more ditto blocks for /mail than for 
the root subvol.  In that case I'd want the metadata for the root directory to 
have the same replication as /mail but for /home etc nothing special.

Hypothetically if metadata duplication consumed any significant disk space 
then I'd probably want to only enable it on /lib* /sbin, /etc, and whatever 
data the server is designed to hold.  But really it's small enough to just 
duplicate everything.

Currently I only run two systems for which I can't more than double the disk 
space at a moderate cost.  One is my EeePC 701 and the other is a ZFS archive 
server (which already has the ditto blocks).  For all the other systems there 
is no shortage of space at all.  Disk just keeps getting bigger and cheaper, 
for most of my uses disk size increases faster than data storage.

Currently the smallest SATA disk I can buy new is 500G.  The smallest SSD is 
60G for $63 but I can get 120G for $82, 240G for $149, or 480G for $295.  All 
the workstations I run use a lot less than 120G of storage.  Storage capacity 
isn't an issue for most users.

It seems to me that the only time when an extra 1% disk space usage would 
really matter is when you have an array of 100 disks that's almost full.  But 
that's the time when you REALLY want extra duplication of metadata.

> It is a good idea. The next question to me is whether or not it is 
> something that can be implemented elegantly and whether or not a 
> talented *dev* thinks it is a good idea.

Absolutely.  Hopefully this discussion will inspire the developers to consider 
this an interesting technical challenge and a feature that is needed to beat 
ZFS.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-21  2:51             ` Russell Coker
@ 2014-05-21 23:05               ` Martin
  2014-05-22 11:10                 ` Austin S Hemmelgarn
  2014-05-22 22:09               ` ashford
  1 sibling, 1 reply; 18+ messages in thread
From: Martin @ 2014-05-21 23:05 UTC (permalink / raw)
  To: linux-btrfs

Very good comment from Ashford.


Sorry, but I see no advantages from Russell's replies other than for a
"feel-good" factor or a dangerous false sense of security. At best,
there is a weak justification that "for metadata, again going from 2% to
4% isn't going to be a great problem" (storage is cheap and fast).

I thought an important idea behind btrfs was that we avoid by design in
the first place the very long and vulnerable RAID rebuild scenarios
suffered for block-level RAID...


On 21/05/14 03:51, Russell Coker wrote:
> Absolutely. Hopefully this discussion will inspire the developers to
> consider this an interesting technical challenge and a feature that
> is needed to beat ZFS.

Sorry, but I think that is completely the wrong reasoning. ...Unless
that is you are some proprietary sales droid hyping features and big
numbers! :-P


Personally I'm not convinced we gain anything beyond what btrfs will
eventually offer in any case for the n-way raid or the raid-n Cauchy stuff.

Also note that usually, data is wanted to be 100% reliable and
retrievable. Or if that fails, you go to your backups instead. Gambling
"proportions" and "importance" rather than *ensuring* fault/error
tolerance is a very human thing... ;-)


Sorry:

Interesting idea but not convinced there's any advantage for disk/SSD
storage.


Regards,
Martin





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-20  2:07         ` Russell Coker
  2014-05-20 14:07           ` Austin S Hemmelgarn
  2014-05-20 14:56           ` ashford
@ 2014-05-21 23:29           ` Konstantinos Skarlatos
  2 siblings, 0 replies; 18+ messages in thread
From: Konstantinos Skarlatos @ 2014-05-21 23:29 UTC (permalink / raw)
  To: russell, Brendan Hide, linux-btrfs

On 20/5/2014 5:07 πμ, Russell Coker wrote:
> On Mon, 19 May 2014 23:47:37 Brendan Hide wrote:
>> This is extremely difficult to measure objectively. Subjectively ... see
>> below.
>>
>>> [snip]
>>>
>>> *What other failure modes* should we guard against?
>> I know I'd sleep a /little/ better at night knowing that a double disk
>> failure on a "raid5/1/10" configuration might ruin a ton of data along
>> with an obscure set of metadata in some "long" tree paths - but not the
>> entire filesystem.
> My experience is that most disk failures that don't involve extreme physical
> damage (EG dropping a drive on concrete) don't involve totally losing the
> disk.  Much of the discussion about RAID failures concerns entirely failed
> disks, but I believe that is due to RAID implementations such as Linux
> software RAID that will entirely remove a disk when it gives errors.
>
> I have a disk which had ~14,000 errors of which ~2000 errors were corrected by
> duplicate metadata.  If two disks with that problem were in a RAID-1 array
> then duplicate metadata would be a significant benefit.
>
>> The other use-case/failure mode - where you are somehow unlucky enough
>> to have sets of bad sectors/bitrot on multiple disks that simultaneously
>> affect the only copies of the tree roots - is an extremely unlikely
>> scenario. As unlikely as it may be, the scenario is a very painful
>> consequence in spite of VERY little corruption. That is where the
>> peace-of-mind/bragging rights come in.
> http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
>
> The NetApp research on latent errors on drives is worth reading.  On page 12
> they report latent sector errors on 9.5% of SATA disks per year.  So if you
> lose one disk entirely the risk of having errors on a second disk is higher
> than you would want for RAID-5.  While losing the root of the tree is
> unlikely, losing a directory in the middle that has lots of subdirectories is
> a risk.
Seeing the results of that paper, I think erasure coding is a better 
solution. Instead of having many copies of metadata or data, we could do 
erasure coding using something like zfec[1] that is being used by 
Tahoe-LAFS, increasing their size by lets say 5-10%, and be quite safe 
even from multiple continuous bad sectors.

[1] https://pypi.python.org/pypi/zfec
>
> I can understand why people wouldn't want ditto blocks to be mandatory.  But
> why are people arguing against them as an option?
>
>
> As an aside, I'd really like to be able to set RAID levels by subtree.  I'd
> like to use RAID-1 with ditto blocks for my important data and RAID-0 for
> unimportant data.
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-21 23:05               ` Martin
@ 2014-05-22 11:10                 ` Austin S Hemmelgarn
  0 siblings, 0 replies; 18+ messages in thread
From: Austin S Hemmelgarn @ 2014-05-22 11:10 UTC (permalink / raw)
  To: Martin, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2560 bytes --]

On 2014-05-21 19:05, Martin wrote:
> Very good comment from Ashford.
> 
> 
> Sorry, but I see no advantages from Russell's replies other than for a
> "feel-good" factor or a dangerous false sense of security. At best,
> there is a weak justification that "for metadata, again going from 2% to
> 4% isn't going to be a great problem" (storage is cheap and fast).
> 
> I thought an important idea behind btrfs was that we avoid by design in
> the first place the very long and vulnerable RAID rebuild scenarios
> suffered for block-level RAID...
> 
> 
> On 21/05/14 03:51, Russell Coker wrote:
>> Absolutely. Hopefully this discussion will inspire the developers to
>> consider this an interesting technical challenge and a feature that
>> is needed to beat ZFS.
> 
> Sorry, but I think that is completely the wrong reasoning. ...Unless
> that is you are some proprietary sales droid hyping features and big
> numbers! :-P
> 
> 
> Personally I'm not convinced we gain anything beyond what btrfs will
> eventually offer in any case for the n-way raid or the raid-n Cauchy stuff.
> 
> Also note that usually, data is wanted to be 100% reliable and
> retrievable. Or if that fails, you go to your backups instead. Gambling
> "proportions" and "importance" rather than *ensuring* fault/error
> tolerance is a very human thing... ;-)
> 
> 
> Sorry:
> 
> Interesting idea but not convinced there's any advantage for disk/SSD
> storage.
> 
> 
> Regards,
> Martin
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
 Another nice option in this case might be adding logic to make sure
that there is some (considerable) offset between copies of metadata
using the dup profile (all of the filesystems that I have actually
looked at the low-level on-disk structures have had both copies of the
System chunks right next to each other, right at the beginning of the
disk, which of course mitigates the usefulness of storing two copies of
them on disk).  Adding an offset in those allocations would provide some
better protection against some of the more common 'idiot' failure-modes
(i.e. trying to use dd to write a disk image to a USB flash drive, and
accidentally overwriting the first n GB of your first HDD instead).
Ideally, once we have n-way replication, System chunks should default to
one copy per device for multi-device filesystems.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-21  2:51             ` Russell Coker
  2014-05-21 23:05               ` Martin
@ 2014-05-22 22:09               ` ashford
  2014-05-23  3:54                 ` Russell Coker
  1 sibling, 1 reply; 18+ messages in thread
From: ashford @ 2014-05-22 22:09 UTC (permalink / raw)
  To: russell; +Cc: ashford, linux-btrfs

Russell,

Overall, there are still a lot of unknowns WRT the stability, and ROI
(Return On Investment) of implementing ditto blocks for BTRFS.  The good
news is that there's a lot of time before the underlying structure is in
place to support, so there's time to figure this out a bit better.

> On Tue, 20 May 2014 07:56:41 ashford@whisperpc.com wrote:
>> 1.  There will be more disk space used by the metadata.  I've been aware
>> of space allocation issues in BTRFS for more than three years.  If the
>> use of ditto blocks will make this issue worse, then it's probably not a
>> good idea to implement it.  The actual increase in metadata space is
>> probably small in most circumstances.
>
> Data, RAID1: total=2.51TB, used=2.50TB
> System, RAID1: total=32.00MB, used=376.00KB
> Metadata, RAID1: total=28.25GB, used=26.63GB
>
> The above is my home RAID-1 array.  It includes multiple backup copies of
> a medium size Maildir format mail spool which probably accounts for a
> significant portion of the used space, the Maildir spool has an average
> file size of about 70K and lots of hard links between different versions
> of the backup.  Even so the metadata is only 1% of the total used space.
> Going from 1% to 2% to improve reliability really isn't a problem.
>
> Data, RAID1: total=140.00GB, used=139.60GB
> System, RAID1: total=32.00MB, used=28.00KB
> Metadata, RAID1: total=4.00GB, used=2.97GB
>
> Above is a small Xen server which uses snapshots to backup the files for
> Xen block devices (the system is lightly loaded so I don't use nocow)
> and for data> files that include a small Maildir spool.  It's still only
> 2% of disk space used for metadata, again going from 2% to 4% isn't
> going to be a great problem.

You've addressed half of the issue.  It appears that the metadata is
normally a bit over 1% using the current methods, but two samples do not
make a statistical universe.  The good news is that these two samples are
from opposite extremes of usage, so I expect they're close to where the
overall average would end up.  I'd like to see a few more samples, from
other usage scenarios, just to be sure.

If the above numbers are normal, adding ditto blocks could increase the
size of the metadata from 1% to 2% or even 3%.  This isn't a problem.

What we still don't know, and probably won't until after it's implemented,
is whether or not the addition of ditto blocks will make the space
allocation worse.

>> 2.  Use of ditto blocks will increase write bandwidth to the disk.  This
>> is a direct and unavoidable result of having more copies of the
>> metadata.
>> The actual impact of this would depend on the file-system usage pattern,
>> but would probably be unnoticeable in most circumstances.  Does anyone
>> have a “worst-case” scenario for testing?
>
> The ZFS design involves ditto blocks being spaced apart due to the fact
> that corruption tends to have some spacial locality.  So you are adding
> an extra seek.
>
> The worst case would be when you have lots of small synchronous writes,
> probably the default configuration of Maildir delivery would be a good
> case.

Is there a performance test for this?  That would be helpful in
determining the worst-case performance impact of implementing ditto
blocks, and probably some other enhancements as well.

>> 3.  Certain kinds of disk errors would be easier to recover from.  Some
>> people here claim that those specific errors are rare.  I have no
>> opinion on how often they happen, but I believe that if the overall
>> disk space cost is low, it will have a reasonable return.  There would
>> be virtually no reliability gains on an SSD-based file-system, as the
>> ditto blocks would be written at the same time, and the SSD would be
>> likely to map the logical blocks into the same page of flash memory.
>
> That claim is unproven AFAIK.

That claim is a direct result of how SSDs function.

>> 4.  If the BIO layer of BTRFS and the device driver are smart enough,
>> ditto blocks could reduce I/O wait time.  This is a direct result of
>> having more instances of the data on the disk, so it's likely that there
>> will be a ditto block closer to where the disk head is currently.  The
>> actual benefit for disk-based file-systems is likely to be under 1ms per
>> metadata seek.  It's possible that a short-term backlog on one disk
>> could cause BTRFS to use a ditto block on another disk, which could
>> deliver >20ms of performance.  There would be no performance benefit for
>> SSD-based file-systems.
>
> That is likely with RAID-5 and RAID-10.

It's likely with all disk layouts.  The reason just looks different on
different RAID structures.

>> My experience is that once your disks are larger than about 500-750GB,
>> RAID-6 becomes a much better choice, due to the increased chances of
>> having an uncorrectable read error during a reconstruct.  My opinion is
>> that anyone storing critical information in RAID-5, or even 2-disk
>> RAID-1,
>> with disks of this capacity, should either reconsider their storage
>> topology, or verify that they have a good backup/restore mechanism in
>> place for that data.
>
> http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
>
> The NetApp research shows that the incidence of silent corruption is a
> lot greater than you would expect.  RAID-6 doesn't save you from this.
> You need BTRFS or ZFS RAID-6.

I was referring to hard read errors, not silent data corruption.

Peter Ashford


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-22 22:09               ` ashford
@ 2014-05-23  3:54                 ` Russell Coker
  2014-05-23  8:03                   ` Duncan
  0 siblings, 1 reply; 18+ messages in thread
From: Russell Coker @ 2014-05-23  3:54 UTC (permalink / raw)
  To: ashford; +Cc: linux-btrfs

On Thu, 22 May 2014 15:09:40 ashford@whisperpc.com wrote:
> You've addressed half of the issue.  It appears that the metadata is
> normally a bit over 1% using the current methods, but two samples do not
> make a statistical universe.  The good news is that these two samples are
> from opposite extremes of usage, so I expect they're close to where the
> overall average would end up.  I'd like to see a few more samples, from
> other usage scenarios, just to be sure.
> 
> If the above numbers are normal, adding ditto blocks could increase the
> size of the metadata from 1% to 2% or even 3%.  This isn't a problem.
> 
> What we still don't know, and probably won't until after it's implemented,
> is whether or not the addition of ditto blocks will make the space
> allocation worse.

I've been involved in many discussions about filesystem choice.  None of them 
have included anyone raising an issue about ZFS metadata space usage, probably 
most ZFS users don't even know about ditto blocks.

The relevant issue regarding disk space is the fact that filesystems tend to 
perform better if there is a reasonable amount of free space.  The amount of 
free space for good performance will depend on filesystem, usage pattern, and 
whatever you might define as "good performance".

The first two Google hits on searching for recommended free space on ZFS 
recommended using no more than 80% and 85% of disk space.  Obviously if "good 
performance" requires 15% of free disk space then your capacity problem isn't 
going to be solved by not duplicating metadata.  Note that I am not aware of 
the accuracy of such claims about ZFS performance.

Is anyone doing research on how much free disk space is required on BTRFS for 
"good performance"?  If a rumor (whether correct or incorrect) goes around 
that you need 20% free space on a BTRFS filesystem for performance then that 
will vastly outweigh the space used for metadata.

> > The ZFS design involves ditto blocks being spaced apart due to the fact
> > that corruption tends to have some spacial locality.  So you are adding
> > an extra seek.
> > 
> > The worst case would be when you have lots of small synchronous writes,
> > probably the default configuration of Maildir delivery would be a good
> > case.
> 
> Is there a performance test for this?  That would be helpful in
> determining the worst-case performance impact of implementing ditto
> blocks, and probably some other enhancements as well.

http://doc.coker.com.au/projects/postal/

My Postal mail server benchmark is one option.  There are more than a few 
benchmarks of synchronous writes of small files, but Postal uses real world 
programs that need such performance.  Delivering a single message via a 
typical Unix MTA requires synchronous writes of two queue files and then the 
destination file in the mail store.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
  2014-05-23  3:54                 ` Russell Coker
@ 2014-05-23  8:03                   ` Duncan
  0 siblings, 0 replies; 18+ messages in thread
From: Duncan @ 2014-05-23  8:03 UTC (permalink / raw)
  To: linux-btrfs

Russell Coker posted on Fri, 23 May 2014 13:54:46 +1000 as excerpted:

> Is anyone doing research on how much free disk space is required on
> BTRFS for "good performance"?  If a rumor (whether correct or incorrect)
> goes around that you need 20% free space on a BTRFS filesystem for
> performance then that will vastly outweigh the space used for metadata.

Well, on btrfs there's free-space, and then there's free-space.  The 
chunk allocation and both data/metadata fragmentation make a difference.

That said, *IF* you're looking at the right numbers, btrfs doesn't 
actually require that much free space, and should run as efficiently 
right up to just a few GiB free, on pretty much any btrfs over a few GiB 
in size, so at least in the significant fractions of a TiB on up range, 
it doesn't require that much free space /as/ /a/ /percentage/ at all.

**BUT BE SURE YOU'RE LOOKING AT THE RIGHT NUMBERS** as explained below.

Chunks:

On btrfs, both data and metadata are allocated in chunks, 1 GiB chunks 
for data, 256 MiB chunks for metadata.  The catch is that while both 
chunks and space within chunks can be allocated on-demand, deleting files 
only frees space within chunks -- the chunks themselves remain allocated 
to data/metadata whichever they were, and cannot be reallocated to the 
other.  To deallocate unused chunks and to rewrite partially used chunks 
to consolidate usage on to fewer chunks and free the others, btrfs admins 
must currently manually (or via script) do a btrfs balance.

btrfs filesystem show:

For the btrfs filesystem show output, the individual devid lines show 
total filesystem space on the device vs. used, as in allocated to chunks, 
space.[1]  Ideally (assuming equal sized devices) you should keep at 
least 2.5-3.0 GiB free per device, since that will allow allocation of 
two chunks each for data (1 GiB each) and metadata (quarter GiB each, but 
on single-device filesystems they are allocated in pairs by default, so 
half a MiB, see below).  Since the balance process itself will want to 
allocate a new chunk to write into in ordered to rewrite and consolidate 
existing chunks, you don't want to use the last one available, and since 
the filesystem could decide it needs to allocate another chunk for normal 
usage as well, you always want to keep at least two chunks worth of each, 
thus 2.5 GiB (3.0 GiB for single-device-filesystems, see below), 
unallocated, one chunk each data/metadata for the filesystem if it needs 
it, and another to ensure balance can allocate at least the one chunk to 
do its rewrite.

As I said, data chunks are 1 GiB, while metadata chunks are 256 MiB, a 
quarter GiB.  However, on a single-device btrfs, metadata will normally 
default to dup (duplicate, two copies for safety) mode, and will thus 
allocate two chunks, half a GiB at a time.  This is why you want 3 GiB 
minimum free on a single-device btrfs, space for two single-mode data 
chunk allocations (1 GiB * 2 = 2 GiB), plus two dup-mode metadata chunk 
allocations (256 MiB * 2 * 2 = 1 GiB).  But on multi-device btrfs, only a 
single copy is stored per device, so the metadata minimum reserve is only 
half a GiB per device (256 MiB * 2 = 512 MiB = half a GiB).

That's the minimum unallocated space you need free.  More than that is 
nice and lets you go longer between having to worry about rebalances, but 
it really won't help btrfs efficiency that much, since btrfs uses already 
allocated chunk space where it can.

btrfs filesystem df:

Then there's the already chunk-allocated space.  btrfs filesystem df 
reports on this.  In the df output, total means allocated while used 
means used, of that allocated, so the spread between them is the 
allocated but unused.

Since btrfs allocates new chunks on-demand from the unallocated space 
pool, but cannot reallocate chunks between data and metadata on its own, 
and because the used blocks within existing chunks will get fragmented 
over time, it's best to keep the btrfs filesystem df reported spread 
between total and used to a minimum.

Of course, as I said above data chunks are 1 GiB each, so a data 
allocation spread of under a GiB won't be recoverable in any case, and a 
spread of 1-5 GiB isn't a big deal.  But if for instance btrfs filesystem 
df reports data 1.25 TiB total (that is, allocated) but only 250 GiB 
used, that's a spread of roughly a TiB, and running a btrfs balance in 
ordered to recover most of that spread to unallocated is a good idea.

Similarly with metadata, except it'll be allocated in 256 MiB chunks, two 
at a time by default on a single device filesystem so 512 MiB at at time 
in that case.  But again, if btrfs filesystem df is reporting say 10.5 
GiB total metadata but only perhaps 1.75 GiB used, the spread is several 
chunks worth and particularly if your unallocated reserve (as reported by 
btrfs filesystem show in the individual device lines) is getting low, 
it's time to consider rebalancing it to recover the unused metadata space 
to unallocated.

It's also worth noting that btrfs required some metadata space free to 
work with, figure about one chunk worth, so if there's no unallocated 
space left and metadata space gets under 300 MiB or so, you're getting 
real close to ENOSPC errors!  For the same reason, even a full balance 
will likely still leave a metadata chunk or two (so say half a gig) of 
reported spread between metadata total and used, that's not recoverable 
by balance because btrfs actually reserves that for its own use.

Finally, it can be noted that under normal usage and particularly in 
cases where people delete a whole bunch of medium to large files (and 
assuming those same files aren't being saved in a btrfs snapshot, which 
would prevent their deletion actually freeing the space they take until 
all the snapshots that contain them are deleted as well), a lot of 
previously allocated data chunks will become mostly or fully empty, but 
metadata usage won't go down all that much, so relatively less metadata 
space will return to unused.  That means where people haven't rebalanced 
in awhile, they're likely to have a lot of allocated but unused data 
space that can be reused, but rather less unused metadata space to 
reuse.  As a result, when all space is allocated and there's no more to 
allocate to new chunks, it's most commonly metadata space that runs out 
first, *SOMETIMES WITH LOTS OF SPACE STILL REPORTED AS FREE BY ORDINARY 
DF* and lots of data space free as reported by btrfs filesystem df as 
well, simply because all available metadata chunks are full, and all 
remaining space is allocated to data chunks, a significant number of 
which may be mostly free.

But OTOH, if you work with mostly small files, a KiB or smaller, and have 
deleted a bunch of them, it's likely you'll free a lot of metadata space 
because such small files are often stored entirely as metadata.  In that 
case you may run out of data space first, once all space is allocated to 
chunks of some kind.  This is somewhat rarer, but it does happen, and the 
symptoms can look a bit strange as sometimes it'll result in a bunch of 
zero-sized files, because the metadata space was available for them but 
when it came time to write the actual data, there was no space to do so.

But once all space is allocated to chunks so no more chunks can be 
allocated, it's only a matter of time until either data or metadata runs 
out, even if there's plenty of "space" free, because all that "space" is 
tied up in the other one!  As I said above, keep an eye on btrfs 
filesystem show output, and try to do a rebalance when the spread between 
total and used (allocated) gets close to 3 GiB, because once all space is 
actually allocated, you're in a bit of a bind and balance may find it 
hard to free space as well.  There's tricks that can help as described 
below, but it's better not to find yourself in that spot in the first 
place.

Balance and balance filters:

Now let's look at balance and balance filters.  There's a page on the wiki
[2] that explains balance filters in some detail, but for our purposes 
here, it's sufficient to know -m tells balance to only handle metadata 
chunks, while -d tells it to only handle data chunks, and usage=N can be 
used to tell it to only rebalance chunks with that usage or LESS, thus 
allowing you to avoid unnecessarily rebalancing full and almost full 
chunks, while still allowing recovery of nearly empty chunks to the 
unallocated pool.

So if btrfs filesystem df shows a big spread between total and used for 
data, try something like this:

btrfs balance start -dusage=20  (note no space between -d and usage)

That says balance (rewrite and consolidate) only data chunks with usage 
of 20% or less.  That will be MUCH faster than a full rebalance, and 
should be quite a bit faster than simply -d (data chunks only, without 
the usage filter) as well, while still consolidating data chunks with 
usage at or below 20%, which will likely be quite a few if the spread is 
pretty big.

Of course you can adjust the N in that usage=N as needed between 0 and 
100.  As the filesystem really does fill up and there's less room to 
spare to allocated but unused chunks, you'll need to increase that usage= 
toward 100 in ordered to consolidate and recover as many partially used 
chunks as possible.  But while the filesystem is mostly empty and/or if 
the btrfs filesystem df spread between used and total is large (tens or 
hundreds of gigs), a smaller usage=, say usage=5, will likely get you 
very good results, but MUCH faster, since you're only dealing with chunks 
at or under 5% full, meaning far less actual rewriting, while most of the 
time getting a full gig back for every 1/20 gig (5%) gig you rebalance!

***ANSWER!***

While btrfs shouldn't lose that much operational efficiency as the 
filesystem fills as long as there's unallocated chunks available to 
allocate as it needs them, the closer it is to full, the more frequently 
one will need to rebalance and the closer to 100 the usage= balance 
filter will need to be in ordered to recover all possible space to 
unallocated in ordered to keep it free for allocation as necessary.

Tying up loose ends: Tricks:

Above, I mentioned tricks that can let you balance even if there's no 
space left to allocate the new chunk to rewrite data/metadata from the 
old chunk into, so a normal balance won't work.

The first such trick is the usage=0 balance filter.  Even if you're 
totally out of unallocated space as reported by btrfs filesystem show, if 
btrfs filesystem df shows a large spread between used and total (or even 
if not, if you're lucky, as long as the spread is at least one chunk's 
worth), there's a fair chance that at least one chunk is totally empty.  
In that case, there's nothing in it to rewrite, and balancing that chunk 
will simply free it, without requiring a chunk allocation to do the 
rewrite. Using usage=0 tells balance to only consider such chunks, 
freeing any that it finds without requiring space to rewrite the data, 
since there's nothing there to rewrite.  =:^)

Still, there's no guarantee balance will find any totally empty chunks to 
free, so it's better not to get into that situation to begin with.  As I 
said above, try to keep at least 3 GiB free as reported by the individual 
device lines of btrfs filesystem show (or 2.5 GiB each device of a multi-
device filesystem).

If -dusage=0/-musage=0 doesn't work, the next trick is to try temporarily 
adding another device to the btrfs, using btrfs device add.  This device 
should be at least several GiB (again, I'd say 3 GiB, minimum, but 10 GiB 
or so would be better, no need to make it /huge/) in size, and could be a 
USB thumb drive or the like.  If you have 8 GiB or better memory and 
aren't using it all, even a several GiB loopback file created on top of 
tmpfs can work,  but of course if the system crashes while that temporary 
device is in use, say goodbye to whatever was on it at the time!

The idea is to add the device temporarily, do a btrfs balance with a 
usage filter set as low as possible to free up at least one extra chunk 
worth of space on the permanent device(s), then when balance has 
recovered enough chunks worth of space to do so, do a btrfs device delete 
on the temporary device to return the chunks on it to the newly 
unallocated space on the permanent devices.

The temporary device trick should work where the usage=0 trick fails and 
should allow getting out of the bind, but again, better never to find 
yourself in that bind in the first place, so keep an eye on those btrfs 
filesystem show results!

More loose ends:

Above I assumed all devices of a multi-device btrfs are the same size, so 
they should fill up roughly in parallel and the per-device lines in the 
btrfs filesystem show output should be similar.  If you're using 
different sized devices, depending on your configured raid mode and the 
size of the devices, one will likely fill up first, but there will still 
be room left on the others.  The details are too complex to deal with 
here, but one thing that's worth noting is that for some device sizes and 
raid mode configurations, btrfs will not be able to use the full size of 
the largest device.  Hugo's btrfs device and filesystem layout 
configurator page is a good tool to use when planning a mixed-device-size 
btrfs.

Finally, there's the usage value in the total devices line of btrfs 
filesystem show, which in footnote [1] below I recommend ignoring if you 
don't understand it.  That number is actually the (rounded appropriately) 
sum of all the used values as reported by btrfs filesystem df.  
Basically, add the used values from the data and metadata lines (because 
the other usage lines end up being rounding errors) of btrfs filesystem 
df, and that should (within rounding error) be the number reported by 
btrfs filesystem show as usage in the total devices line.  That's where 
the number comes from and it is in some ways the actual filesystem 
usage.  But in btrfs terms it's relatively unimportant compared to the 
chunk-allocated/unallocated/total values as reported on the individual 
device lines, and the data/metadata values as reported by btrfs 
filesystem df, so for btrfs administration purposes it's generally better 
to simply pretend that btrfs filesystem show total devices line usage 
doesn't even appear at all, as in real life, far more people seem to get 
confused by it than find it actually useful.  But that's where that 
number derives, if you find you can't simply ignore it as I recommend.  
(I know I'd have a hard time ignoring it myself, until I knew where it 
actually came from.)

---
[1] The total devices line used is reporting something entirely 
different, best ignored if you don't understand it as it has deceived a 
lot of people into thinking they have lots of room available when it's 
actually all allocated.

[2] Btrfs wiki, general link: https://btrfs.wiki.kernel.org

Balance filters:
https://btrfs.wiki.kernel.org/index.php/Balance_Filters

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ditto blocks on ZFS
@ 2014-05-22 15:28 Tomasz Chmielewski
  0 siblings, 0 replies; 18+ messages in thread
From: Tomasz Chmielewski @ 2014-05-22 15:28 UTC (permalink / raw)
  To: linux-btrfs

> I thought an important idea behind btrfs was that we avoid by design
> in the first place the very long and vulnerable RAID rebuild scenarios
> suffered for block-level RAID...

This may be true for SSD disks - for ordinary disks it's not entirely
the case.

For most RAID rebuilds, it still seems way faster with software RAID-1
where one drive is being read at its (almost) full speed, and the other
is being written to at its (almost) full speed (assuming no other IO
load).

With btrfs RAID-1, the way balance is made after disk replace, it takes
lots of disk head movements resulting in overall small speed to rebuild
the RAID, especially with lots of snapshots and related fragmentation.

And the balance is still not smart and is causing reads from one device,
and writes to *both* devices (extra unnecessary write to the
healthy device - while it should read from the healthy device and write
to the replaced device only).


Of course, other factors such as the amount of data or disk IO usage
during rebuild apply.


-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-05-23  8:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-16  3:07 ditto blocks on ZFS Russell Coker
2014-05-17 12:50 ` Martin
2014-05-17 14:24   ` Hugo Mills
2014-05-18 16:09   ` Russell Coker
2014-05-19 20:36     ` Martin
2014-05-19 21:47       ` Brendan Hide
2014-05-20  2:07         ` Russell Coker
2014-05-20 14:07           ` Austin S Hemmelgarn
2014-05-20 20:11             ` Brendan Hide
2014-05-20 14:56           ` ashford
2014-05-21  2:51             ` Russell Coker
2014-05-21 23:05               ` Martin
2014-05-22 11:10                 ` Austin S Hemmelgarn
2014-05-22 22:09               ` ashford
2014-05-23  3:54                 ` Russell Coker
2014-05-23  8:03                   ` Duncan
2014-05-21 23:29           ` Konstantinos Skarlatos
2014-05-22 15:28 Tomasz Chmielewski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).