* ditto blocks on ZFS
@ 2014-05-16 3:07 Russell Coker
2014-05-17 12:50 ` Martin
0 siblings, 1 reply; 18+ messages in thread
From: Russell Coker @ 2014-05-16 3:07 UTC (permalink / raw)
To: linux-btrfs
https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
Probably most of you already know about this, but for those of you who haven't
the above describes ZFS "ditto blocks" which is a good feature we need on
BTRFS. The briefest summary is that on top of the RAID redundancy there is
one more copy of metadata than there is of data, so copies=2 implies 3 copies
of metadata and the default option of 1 copy of data means that metadata is
"dup" in addition to whatever RAID redundancy is in place.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-16 3:07 ditto blocks on ZFS Russell Coker
@ 2014-05-17 12:50 ` Martin
2014-05-17 14:24 ` Hugo Mills
2014-05-18 16:09 ` Russell Coker
0 siblings, 2 replies; 18+ messages in thread
From: Martin @ 2014-05-17 12:50 UTC (permalink / raw)
To: linux-btrfs
On 16/05/14 04:07, Russell Coker wrote:
> https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
>
> Probably most of you already know about this, but for those of you who haven't
> the above describes ZFS "ditto blocks" which is a good feature we need on
> BTRFS. The briefest summary is that on top of the RAID redundancy there...
[... are additional copies of metadata ...]
Is that idea not already implemented in effect in btrfs with the way
that the superblocks are replicated multiple times, ever more times, for
ever more huge storage devices?
The one exception is for SSDs whereby there is the excuse that you
cannot know whether your data is usefully replicated across different
erase blocks on a single device, and SSDs are not 'that big' anyhow.
So... Your idea of replicating metadata multiple times in proportion to
assumed 'importance' or 'extent of impact if lost' is an interesting
approach. However, is that appropriate and useful considering the real
world failure mechanisms that are to be guarded against?
Do you see or measure any real advantage?
Regards,
Martin
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-17 12:50 ` Martin
@ 2014-05-17 14:24 ` Hugo Mills
2014-05-18 16:09 ` Russell Coker
1 sibling, 0 replies; 18+ messages in thread
From: Hugo Mills @ 2014-05-17 14:24 UTC (permalink / raw)
To: Martin; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1711 bytes --]
On Sat, May 17, 2014 at 01:50:52PM +0100, Martin wrote:
> On 16/05/14 04:07, Russell Coker wrote:
> > https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
> >
> > Probably most of you already know about this, but for those of you who haven't
> > the above describes ZFS "ditto blocks" which is a good feature we need on
> > BTRFS. The briefest summary is that on top of the RAID redundancy there...
> [... are additional copies of metadata ...]
>
>
> Is that idea not already implemented in effect in btrfs with the way
> that the superblocks are replicated multiple times, ever more times, for
> ever more huge storage devices?
Superblocks are the smallest part of the metadata. There's a whole
load of metadata that's not in the superblocks that isn't replicated
in this way.
> The one exception is for SSDs whereby there is the excuse that you
> cannot know whether your data is usefully replicated across different
> erase blocks on a single device, and SSDs are not 'that big' anyhow.
>
>
> So... Your idea of replicating metadata multiple times in proportion to
> assumed 'importance' or 'extent of impact if lost' is an interesting
> approach. However, is that appropriate and useful considering the real
> world failure mechanisms that are to be guarded against?
>
> Do you see or measure any real advantage?
This. How many copies do you actually need? Are there concrete
statistics to show the marginal utility of each additional copy?
Hugo.
--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- IMPROVE YOUR ORGANISMS!! -- Subject line of spam email ---
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-17 12:50 ` Martin
2014-05-17 14:24 ` Hugo Mills
@ 2014-05-18 16:09 ` Russell Coker
2014-05-19 20:36 ` Martin
1 sibling, 1 reply; 18+ messages in thread
From: Russell Coker @ 2014-05-18 16:09 UTC (permalink / raw)
To: Martin; +Cc: linux-btrfs
On Sat, 17 May 2014 13:50:52 Martin wrote:
> On 16/05/14 04:07, Russell Coker wrote:
> > https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
> >
> > Probably most of you already know about this, but for those of you who
> > haven't the above describes ZFS "ditto blocks" which is a good feature we
> > need on BTRFS. The briefest summary is that on top of the RAID
> > redundancy there...
> [... are additional copies of metadata ...]
>
>
> Is that idea not already implemented in effect in btrfs with the way
> that the superblocks are replicated multiple times, ever more times, for
> ever more huge storage devices?
No. If the metadata for the root directory is corrupted then everything is
lost even if the superblock is OK. At every level in the directory tree a
corruption will lose all levels below that, a corruption for /home would be
very significant as would a corruption of /home/importantuser/major-project.
> The one exception is for SSDs whereby there is the excuse that you
> cannot know whether your data is usefully replicated across different
> erase blocks on a single device, and SSDs are not 'that big' anyhow.
I am not convinced by that argument. While you can't know that it's usefully
replicated you also can't say for sure that replication will never save you.
There will surely be some random factors involved. If dup on ssd will save
you from 50% of corruption problems is it worth doing? What if it's 80% or
20%?
I have BTRFS running as the root filesystem on Intel SSDs on four machines
(one of which is a file server with a pair of large disks in a BTRFS RAID-1).
On all of those systems I have dup for metadata, it doesn't take up any amount
of space I need for something else and it might save me.
> So... Your idea of replicating metadata multiple times in proportion to
> assumed 'importance' or 'extent of impact if lost' is an interesting
> approach. However, is that appropriate and useful considering the real
> world failure mechanisms that are to be guarded against?
Firstly it's not my idea, it's the idea of the ZFS developers. Secondly I
started reading about this after doing some experiments with a failing SATA
disk. In spite of having ~14,000 read errors (which sounds like a lot but is
a small fraction of a 2TB disk) the vast majority of the data was readable,
largely due to ~2000 errors corrected by dup metadata.
> Do you see or measure any real advantage?
Imagine that you have a RAID-1 array where both disks get ~14,000 read errors.
This could happen due to a design defect common to drives of a particular
model or some shared environmental problem. Most errors would be corrected by
RAID-1 but there would be a risk of some data being lost due to both copies
being corrupt. Another possibility is that one disk could entirely die
(although total disk death seems rare nowadays) and the other could have
corruption. If metadata was duplicated in addition to being on both disks
then the probability of data loss would be reduced.
Another issue is the case where all drive slots are filled with active drives
(a very common configuration). To replace a disk you have to physically
remove the old disk before adding the new one. If the array is a RAID-1 or
RAID-5 then ANY error during reconstruction loses data. Using dup for
metadata on top of the RAID protections (IE the ZFS ditto idea) means that
case doesn't lose you data.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-18 16:09 ` Russell Coker
@ 2014-05-19 20:36 ` Martin
2014-05-19 21:47 ` Brendan Hide
0 siblings, 1 reply; 18+ messages in thread
From: Martin @ 2014-05-19 20:36 UTC (permalink / raw)
To: linux-btrfs
On 18/05/14 17:09, Russell Coker wrote:
> On Sat, 17 May 2014 13:50:52 Martin wrote:
[...]
>> Do you see or measure any real advantage?
>
> Imagine that you have a RAID-1 array where both disks get ~14,000 read errors.
> This could happen due to a design defect common to drives of a particular
> model or some shared environmental problem. Most errors would be corrected by
> RAID-1 but there would be a risk of some data being lost due to both copies
> being corrupt. Another possibility is that one disk could entirely die
> (although total disk death seems rare nowadays) and the other could have
> corruption. If metadata was duplicated in addition to being on both disks
> then the probability of data loss would be reduced.
>
> Another issue is the case where all drive slots are filled with active drives
> (a very common configuration). To replace a disk you have to physically
> remove the old disk before adding the new one. If the array is a RAID-1 or
> RAID-5 then ANY error during reconstruction loses data. Using dup for
> metadata on top of the RAID protections (IE the ZFS ditto idea) means that
> case doesn't lose you data.
Your example there is for the case where in effect there is no RAID. How
is that case any better than what is already done for btrfs duplicating
metadata?
So...
What real-world failure modes do the ditto blocks usefully protect against?
And how does that compare for failure rates and against what is already
done?
For example, we have RAID1 and RAID5 to protect against any one RAID
chunk being corrupted or for the total loss of any one device.
There is a second part to that in that another failure cannot be
tolerated until the RAID is remade.
Hence, we have RAID6 that protects against any two failures for a chunk
or device. Hence with just one failure, you can tolerate a second
failure whilst rebuilding the RAID.
And then we supposedly have safety-by-design where the filesystem itself
is using a journal and barriers/sync to ensure that the filesystem is
always kept in a consistent state, even after an interruption to any writes.
*What other failure modes* should we guard against?
There has been mention of fixing metadata keys from single bit flips...
Should hamming codes be used instead of a crc so that we can have
multiple bit error detect, single bit error correct functionality for
all data both in RAM and on disk for those systems that do not use ECC RAM?
Would that be useful?...
Regards,
Martin
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-19 20:36 ` Martin
@ 2014-05-19 21:47 ` Brendan Hide
2014-05-20 2:07 ` Russell Coker
0 siblings, 1 reply; 18+ messages in thread
From: Brendan Hide @ 2014-05-19 21:47 UTC (permalink / raw)
To: Martin, linux-btrfs
On 2014/05/19 10:36 PM, Martin wrote:
> On 18/05/14 17:09, Russell Coker wrote:
>> On Sat, 17 May 2014 13:50:52 Martin wrote:
> [...]
>>> Do you see or measure any real advantage?
>> [snip]
This is extremely difficult to measure objectively. Subjectively ... see
below.
> [snip]
>
> *What other failure modes* should we guard against?
I know I'd sleep a /little/ better at night knowing that a double disk
failure on a "raid5/1/10" configuration might ruin a ton of data along
with an obscure set of metadata in some "long" tree paths - but not the
entire filesystem.
The other use-case/failure mode - where you are somehow unlucky enough
to have sets of bad sectors/bitrot on multiple disks that simultaneously
affect the only copies of the tree roots - is an extremely unlikely
scenario. As unlikely as it may be, the scenario is a very painful
consequence in spite of VERY little corruption. That is where the
peace-of-mind/bragging rights come in.
--
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-19 21:47 ` Brendan Hide
@ 2014-05-20 2:07 ` Russell Coker
2014-05-20 14:07 ` Austin S Hemmelgarn
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Russell Coker @ 2014-05-20 2:07 UTC (permalink / raw)
To: Brendan Hide, linux-btrfs
On Mon, 19 May 2014 23:47:37 Brendan Hide wrote:
> This is extremely difficult to measure objectively. Subjectively ... see
> below.
>
> > [snip]
> >
> > *What other failure modes* should we guard against?
>
> I know I'd sleep a /little/ better at night knowing that a double disk
> failure on a "raid5/1/10" configuration might ruin a ton of data along
> with an obscure set of metadata in some "long" tree paths - but not the
> entire filesystem.
My experience is that most disk failures that don't involve extreme physical
damage (EG dropping a drive on concrete) don't involve totally losing the
disk. Much of the discussion about RAID failures concerns entirely failed
disks, but I believe that is due to RAID implementations such as Linux
software RAID that will entirely remove a disk when it gives errors.
I have a disk which had ~14,000 errors of which ~2000 errors were corrected by
duplicate metadata. If two disks with that problem were in a RAID-1 array
then duplicate metadata would be a significant benefit.
> The other use-case/failure mode - where you are somehow unlucky enough
> to have sets of bad sectors/bitrot on multiple disks that simultaneously
> affect the only copies of the tree roots - is an extremely unlikely
> scenario. As unlikely as it may be, the scenario is a very painful
> consequence in spite of VERY little corruption. That is where the
> peace-of-mind/bragging rights come in.
http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
The NetApp research on latent errors on drives is worth reading. On page 12
they report latent sector errors on 9.5% of SATA disks per year. So if you
lose one disk entirely the risk of having errors on a second disk is higher
than you would want for RAID-5. While losing the root of the tree is
unlikely, losing a directory in the middle that has lots of subdirectories is
a risk.
I can understand why people wouldn't want ditto blocks to be mandatory. But
why are people arguing against them as an option?
As an aside, I'd really like to be able to set RAID levels by subtree. I'd
like to use RAID-1 with ditto blocks for my important data and RAID-0 for
unimportant data.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-20 2:07 ` Russell Coker
@ 2014-05-20 14:07 ` Austin S Hemmelgarn
2014-05-20 20:11 ` Brendan Hide
2014-05-20 14:56 ` ashford
2014-05-21 23:29 ` Konstantinos Skarlatos
2 siblings, 1 reply; 18+ messages in thread
From: Austin S Hemmelgarn @ 2014-05-20 14:07 UTC (permalink / raw)
To: russell, Brendan Hide, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 3016 bytes --]
On 2014-05-19 22:07, Russell Coker wrote:
> On Mon, 19 May 2014 23:47:37 Brendan Hide wrote:
>> This is extremely difficult to measure objectively. Subjectively ... see
>> below.
>>
>>> [snip]
>>>
>>> *What other failure modes* should we guard against?
>>
>> I know I'd sleep a /little/ better at night knowing that a double disk
>> failure on a "raid5/1/10" configuration might ruin a ton of data along
>> with an obscure set of metadata in some "long" tree paths - but not the
>> entire filesystem.
>
> My experience is that most disk failures that don't involve extreme physical
> damage (EG dropping a drive on concrete) don't involve totally losing the
> disk. Much of the discussion about RAID failures concerns entirely failed
> disks, but I believe that is due to RAID implementations such as Linux
> software RAID that will entirely remove a disk when it gives errors.
>
> I have a disk which had ~14,000 errors of which ~2000 errors were corrected by
> duplicate metadata. If two disks with that problem were in a RAID-1 array
> then duplicate metadata would be a significant benefit.
>
>> The other use-case/failure mode - where you are somehow unlucky enough
>> to have sets of bad sectors/bitrot on multiple disks that simultaneously
>> affect the only copies of the tree roots - is an extremely unlikely
>> scenario. As unlikely as it may be, the scenario is a very painful
>> consequence in spite of VERY little corruption. That is where the
>> peace-of-mind/bragging rights come in.
>
> http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
>
> The NetApp research on latent errors on drives is worth reading. On page 12
> they report latent sector errors on 9.5% of SATA disks per year. So if you
> lose one disk entirely the risk of having errors on a second disk is higher
> than you would want for RAID-5. While losing the root of the tree is
> unlikely, losing a directory in the middle that has lots of subdirectories is
> a risk.
>
> I can understand why people wouldn't want ditto blocks to be mandatory. But
> why are people arguing against them as an option?
>
>
> As an aside, I'd really like to be able to set RAID levels by subtree. I'd
> like to use RAID-1 with ditto blocks for my important data and RAID-0 for
> unimportant data.
>
But the proposed changes for n-way replication would already handle
this. They would just need the option of having more than one copy per
device (which theoretically shouldn't be too hard once you have n-way
replication). Also, BTRFS already has the option of replicating the
root tree across multiple devices (it is included in the System Data
subset), and in fact dose so by default when using multiple devices.
Also, there are plans to have per-subvolume or per file RAID level
selection, but IIRC that is planned for after n-way replication (and of
course, RAID 5/6, as n-way replication isn't going to be implemented
until after RAID 5/6)
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-20 2:07 ` Russell Coker
2014-05-20 14:07 ` Austin S Hemmelgarn
@ 2014-05-20 14:56 ` ashford
2014-05-21 2:51 ` Russell Coker
2014-05-21 23:29 ` Konstantinos Skarlatos
2 siblings, 1 reply; 18+ messages in thread
From: ashford @ 2014-05-20 14:56 UTC (permalink / raw)
To: linux-btrfs; +Cc: ahferroin7, russell, brendan
Ive been reading this list for a few years, and giving almost no
feedback, but I feel that this subject demands that I provide some input.
I can think of five possible effects of implementing ditto blocks for the
metadata. We've only been discussing one (#3 in my list) in this thread.
While most of these effects are fairly obvious, I have seen no discussion
on them.
In discussing the issues of implementing ditto blocks, I think it would be
good to address all of the potential effects, and determine from that
discussion whether or not the enhancement should be made, and, if so, when
the appropriate development resources should be made available. As Austin
pointed out, there are some enhancements currently planned which would
make the implementation of ditto blocks simpler. I believe that defines
the earliest good time for implementation of ditto blocks.
1. There will be more disk space used by the metadata. I've been aware
of space allocation issues in BTRFS for more than three years. If the use
of ditto blocks will make this issue worse, then it's probably not a good
idea to implement it. The actual increase in metadata space is probably
small in most circumstances.
2. Use of ditto blocks will increase write bandwidth to the disk. This
is a direct and unavoidable result of having more copies of the metadata.
The actual impact of this would depend on the file-system usage pattern,
but would probably be unnoticeable in most circumstances. Does anyone
have a worst-case scenario for testing?
3. Certain kinds of disk errors would be easier to recover from. Some
people here claim that those specific errors are rare. I have no opinion
on how often they happen, but I believe that if the overall disk space
cost is low, it will have a reasonable return. There would be virtually
no reliability gains on an SSD-based file-system, as the ditto blocks
would be written at the same time, and the SSD would be likely to map the
logical blocks into the same page of flash memory.
4. If the BIO layer of BTRFS and the device driver are smart enough,
ditto blocks could reduce I/O wait time. This is a direct result of
having more instances of the data on the disk, so it's likely that there
will be a ditto block closer to where the disk head is currently. The
actual benefit for disk-based file-systems is likely to be under 1ms per
metadata seek. It's possible that a short-term backlog on one disk could
cause BTRFS to use a ditto block on another disk, which could deliver
>20ms of performance. There would be no performance benefit for SSD-based
file-systems.
5. There will be a (hopefully short) period where the code may be
slightly less stable, due to the modifications being performed at a
low-level within the file-system. This is likely to happen with any
modification of the file-system code, with more complex modifications
being more likely to introduce instability. I believe that the overall
complexity of this particular modification is great enough that there may
be some added instability for a bit, but perhaps use of the n-way
replication feature will substantially reduce the complexity. Hopefully,
the integration testing thats being performed on the BTRFS code will find
most of the new bugs, and point the core developers in the right direction
to fix them.
I have one final note about RAID levels. I build and sell file servers as
a side job, having assembled and delivered over 100 file servers storing
several hundreds of TB. TTBOMK, no system that Ive built to my own
specifications (not overridden by customer requests) has lost any data
during the first 3 years of operation. One customer requested a disk
manufacturer change, and has lost data. A few systems have had data loss
in the 4-year timeframe, due to multiple drive failure, combined with
inadequate disk status monitoring.
My experience is that once your disks are larger than about 500-750GB,
RAID-6 becomes a much better choice, due to the increased chances of
having an uncorrectable read error during a reconstruct. My opinion is
that anyone storing critical information in RAID-5, or even 2-disk RAID-1,
with disks of this capacity, should either reconsider their storage
topology, or verify that they have a good backup/restore mechanism in
place for that data.
Thank you.
Peter Ashford
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-20 14:07 ` Austin S Hemmelgarn
@ 2014-05-20 20:11 ` Brendan Hide
0 siblings, 0 replies; 18+ messages in thread
From: Brendan Hide @ 2014-05-20 20:11 UTC (permalink / raw)
To: Austin S Hemmelgarn, russell, linux-btrfs
On 2014/05/20 04:07 PM, Austin S Hemmelgarn wrote:
> On 2014-05-19 22:07, Russell Coker wrote:
>> [snip]
>> As an aside, I'd really like to be able to set RAID levels by subtree. I'd
>> like to use RAID-1 with ditto blocks for my important data and RAID-0 for
>> unimportant data.
>>
> But the proposed changes for n-way replication would already handle
> this.
> [snip]
>
Russell's specific request above is probably best handled by being able
to change replication levels per subvolume - this won't be handled by
N-way replication.
Extra replication on leaf nodes will make relatively little difference
in the scenarios laid out in this thread - but on "trunk" nodes (folders
or subvolumes closer to the filesystem root) it makes a significant
difference. "Plain" N-way replication doesn't flexibly treat these two
nodes differently.
As an example, Russell might have a server with two disks - yet he wants
6 copies of all metadata for subvolumes and their immediate subfolders.
At three folders deep he "only" wants to have 4 copies. At six folders
deep, only 2. Ditto blocks add an attractive safety net without
unnecessarily doubling or tripling the size of *all* metadata.
It is a good idea. The next question to me is whether or not it is
something that can be implemented elegantly and whether or not a
talented *dev* thinks it is a good idea.
--
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-20 14:56 ` ashford
@ 2014-05-21 2:51 ` Russell Coker
2014-05-21 23:05 ` Martin
2014-05-22 22:09 ` ashford
0 siblings, 2 replies; 18+ messages in thread
From: Russell Coker @ 2014-05-21 2:51 UTC (permalink / raw)
To: ashford; +Cc: linux-btrfs
On Tue, 20 May 2014 07:56:41 ashford@whisperpc.com wrote:
> 1. There will be more disk space used by the metadata. I've been aware
> of space allocation issues in BTRFS for more than three years. If the use
> of ditto blocks will make this issue worse, then it's probably not a good
> idea to implement it. The actual increase in metadata space is probably
> small in most circumstances.
Data, RAID1: total=2.51TB, used=2.50TB
System, RAID1: total=32.00MB, used=376.00KB
Metadata, RAID1: total=28.25GB, used=26.63GB
The above is my home RAID-1 array. It includes multiple backup copies of a
medium size Maildir format mail spool which probably accounts for a
significant portion of the used space, the Maildir spool has an average file
size of about 70K and lots of hard links between different versions of the
backup. Even so the metadata is only 1% of the total used space. Going from
1% to 2% to improve reliability really isn't a problem.
Data, RAID1: total=140.00GB, used=139.60GB
System, RAID1: total=32.00MB, used=28.00KB
Metadata, RAID1: total=4.00GB, used=2.97GB
Above is a small Xen server which uses snapshots to backup the files for Xen
block devices (the system is lightly loaded so I don't use nocow) and for data
files that include a small Maildir spool. It's still only 2% of disk space
used for metadata, again going from 2% to 4% isn't going to be a great
problem.
> 2. Use of ditto blocks will increase write bandwidth to the disk. This
> is a direct and unavoidable result of having more copies of the metadata.
> The actual impact of this would depend on the file-system usage pattern,
> but would probably be unnoticeable in most circumstances. Does anyone
> have a worst-case scenario for testing?
The ZFS design involves ditto blocks being spaced apart due to the fact that
corruption tends to have some spacial locality. So you are adding an extra
seek.
The worst case would be when you have lots of small synchronous writes,
probably the default configuration of Maildir delivery would be a good case.
As an aside I've been thinking of patching a mail server to do a sleep()
before fsync() on mail delivery to see if that improves aggregate performance.
My theory is that if you have dozens of concurrent delivery attempts then if
they all sleep() before fsync() then the filesystem could write out metadata
for multiple files in one pass in the most efficient manner.
> 3. Certain kinds of disk errors would be easier to recover from. Some
> people here claim that those specific errors are rare.
All errors are rare. :-#
Seriously you can run Ext4 on a single disk for years and probably not lose
data. It's just a matter of how many disks and how much reliability you want.
> I have no opinion
> on how often they happen, but I believe that if the overall disk space
> cost is low, it will have a reasonable return. There would be virtually
> no reliability gains on an SSD-based file-system, as the ditto blocks
> would be written at the same time, and the SSD would be likely to map the
> logical blocks into the same page of flash memory.
That claim is unproven AFAIK. On SSD the performance cost of such things is
negligible (no seek cost) and losing 1% of disk space isn't a problem for most
systems (admittedly the early SSDs were small).
> 4. If the BIO layer of BTRFS and the device driver are smart enough,
> ditto blocks could reduce I/O wait time. This is a direct result of
> having more instances of the data on the disk, so it's likely that there
> will be a ditto block closer to where the disk head is currently. The
> actual benefit for disk-based file-systems is likely to be under 1ms per
> metadata seek. It's possible that a short-term backlog on one disk could
> cause BTRFS to use a ditto block on another disk, which could deliver
> >20ms of performance. There would be no performance benefit for SSD-based
> file-systems.
That is likely with RAID-5 and RAID-10.
> My experience is that once your disks are larger than about 500-750GB,
> RAID-6 becomes a much better choice, due to the increased chances of
> having an uncorrectable read error during a reconstruct. My opinion is
> that anyone storing critical information in RAID-5, or even 2-disk RAID-1,
> with disks of this capacity, should either reconsider their storage
> topology, or verify that they have a good backup/restore mechanism in
> place for that data.
http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
The NetApp research shows that the incidence of silent corruption is a lot
greater than you would expect. RAID-6 doesn't save you from this. You need
BTRFS or ZFS RAID-6.
On Tue, 20 May 2014 22:11:16 Brendan Hide wrote:
> Extra replication on leaf nodes will make relatively little difference
> in the scenarios laid out in this thread - but on "trunk" nodes (folders
> or subvolumes closer to the filesystem root) it makes a significant
> difference. "Plain" N-way replication doesn't flexibly treat these two
> nodes differently.
>
> As an example, Russell might have a server with two disks - yet he wants
> 6 copies of all metadata for subvolumes and their immediate subfolders.
> At three folders deep he "only" wants to have 4 copies. At six folders
> deep, only 2. Ditto blocks add an attractive safety net without
> unnecessarily doubling or tripling the size of *all* metadata.
Firstly I don't think that doubling all metadata is a real problem.
Next the amount of duplicate metadata can't be determined by depth. For
example I have a mail server where an outage of the entire server is
preferable to losing email. I would set more ditto blocks for /mail than for
the root subvol. In that case I'd want the metadata for the root directory to
have the same replication as /mail but for /home etc nothing special.
Hypothetically if metadata duplication consumed any significant disk space
then I'd probably want to only enable it on /lib* /sbin, /etc, and whatever
data the server is designed to hold. But really it's small enough to just
duplicate everything.
Currently I only run two systems for which I can't more than double the disk
space at a moderate cost. One is my EeePC 701 and the other is a ZFS archive
server (which already has the ditto blocks). For all the other systems there
is no shortage of space at all. Disk just keeps getting bigger and cheaper,
for most of my uses disk size increases faster than data storage.
Currently the smallest SATA disk I can buy new is 500G. The smallest SSD is
60G for $63 but I can get 120G for $82, 240G for $149, or 480G for $295. All
the workstations I run use a lot less than 120G of storage. Storage capacity
isn't an issue for most users.
It seems to me that the only time when an extra 1% disk space usage would
really matter is when you have an array of 100 disks that's almost full. But
that's the time when you REALLY want extra duplication of metadata.
> It is a good idea. The next question to me is whether or not it is
> something that can be implemented elegantly and whether or not a
> talented *dev* thinks it is a good idea.
Absolutely. Hopefully this discussion will inspire the developers to consider
this an interesting technical challenge and a feature that is needed to beat
ZFS.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-21 2:51 ` Russell Coker
@ 2014-05-21 23:05 ` Martin
2014-05-22 11:10 ` Austin S Hemmelgarn
2014-05-22 22:09 ` ashford
1 sibling, 1 reply; 18+ messages in thread
From: Martin @ 2014-05-21 23:05 UTC (permalink / raw)
To: linux-btrfs
Very good comment from Ashford.
Sorry, but I see no advantages from Russell's replies other than for a
"feel-good" factor or a dangerous false sense of security. At best,
there is a weak justification that "for metadata, again going from 2% to
4% isn't going to be a great problem" (storage is cheap and fast).
I thought an important idea behind btrfs was that we avoid by design in
the first place the very long and vulnerable RAID rebuild scenarios
suffered for block-level RAID...
On 21/05/14 03:51, Russell Coker wrote:
> Absolutely. Hopefully this discussion will inspire the developers to
> consider this an interesting technical challenge and a feature that
> is needed to beat ZFS.
Sorry, but I think that is completely the wrong reasoning. ...Unless
that is you are some proprietary sales droid hyping features and big
numbers! :-P
Personally I'm not convinced we gain anything beyond what btrfs will
eventually offer in any case for the n-way raid or the raid-n Cauchy stuff.
Also note that usually, data is wanted to be 100% reliable and
retrievable. Or if that fails, you go to your backups instead. Gambling
"proportions" and "importance" rather than *ensuring* fault/error
tolerance is a very human thing... ;-)
Sorry:
Interesting idea but not convinced there's any advantage for disk/SSD
storage.
Regards,
Martin
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-20 2:07 ` Russell Coker
2014-05-20 14:07 ` Austin S Hemmelgarn
2014-05-20 14:56 ` ashford
@ 2014-05-21 23:29 ` Konstantinos Skarlatos
2 siblings, 0 replies; 18+ messages in thread
From: Konstantinos Skarlatos @ 2014-05-21 23:29 UTC (permalink / raw)
To: russell, Brendan Hide, linux-btrfs
On 20/5/2014 5:07 πμ, Russell Coker wrote:
> On Mon, 19 May 2014 23:47:37 Brendan Hide wrote:
>> This is extremely difficult to measure objectively. Subjectively ... see
>> below.
>>
>>> [snip]
>>>
>>> *What other failure modes* should we guard against?
>> I know I'd sleep a /little/ better at night knowing that a double disk
>> failure on a "raid5/1/10" configuration might ruin a ton of data along
>> with an obscure set of metadata in some "long" tree paths - but not the
>> entire filesystem.
> My experience is that most disk failures that don't involve extreme physical
> damage (EG dropping a drive on concrete) don't involve totally losing the
> disk. Much of the discussion about RAID failures concerns entirely failed
> disks, but I believe that is due to RAID implementations such as Linux
> software RAID that will entirely remove a disk when it gives errors.
>
> I have a disk which had ~14,000 errors of which ~2000 errors were corrected by
> duplicate metadata. If two disks with that problem were in a RAID-1 array
> then duplicate metadata would be a significant benefit.
>
>> The other use-case/failure mode - where you are somehow unlucky enough
>> to have sets of bad sectors/bitrot on multiple disks that simultaneously
>> affect the only copies of the tree roots - is an extremely unlikely
>> scenario. As unlikely as it may be, the scenario is a very painful
>> consequence in spite of VERY little corruption. That is where the
>> peace-of-mind/bragging rights come in.
> http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
>
> The NetApp research on latent errors on drives is worth reading. On page 12
> they report latent sector errors on 9.5% of SATA disks per year. So if you
> lose one disk entirely the risk of having errors on a second disk is higher
> than you would want for RAID-5. While losing the root of the tree is
> unlikely, losing a directory in the middle that has lots of subdirectories is
> a risk.
Seeing the results of that paper, I think erasure coding is a better
solution. Instead of having many copies of metadata or data, we could do
erasure coding using something like zfec[1] that is being used by
Tahoe-LAFS, increasing their size by lets say 5-10%, and be quite safe
even from multiple continuous bad sectors.
[1] https://pypi.python.org/pypi/zfec
>
> I can understand why people wouldn't want ditto blocks to be mandatory. But
> why are people arguing against them as an option?
>
>
> As an aside, I'd really like to be able to set RAID levels by subtree. I'd
> like to use RAID-1 with ditto blocks for my important data and RAID-0 for
> unimportant data.
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-21 23:05 ` Martin
@ 2014-05-22 11:10 ` Austin S Hemmelgarn
0 siblings, 0 replies; 18+ messages in thread
From: Austin S Hemmelgarn @ 2014-05-22 11:10 UTC (permalink / raw)
To: Martin, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2560 bytes --]
On 2014-05-21 19:05, Martin wrote:
> Very good comment from Ashford.
>
>
> Sorry, but I see no advantages from Russell's replies other than for a
> "feel-good" factor or a dangerous false sense of security. At best,
> there is a weak justification that "for metadata, again going from 2% to
> 4% isn't going to be a great problem" (storage is cheap and fast).
>
> I thought an important idea behind btrfs was that we avoid by design in
> the first place the very long and vulnerable RAID rebuild scenarios
> suffered for block-level RAID...
>
>
> On 21/05/14 03:51, Russell Coker wrote:
>> Absolutely. Hopefully this discussion will inspire the developers to
>> consider this an interesting technical challenge and a feature that
>> is needed to beat ZFS.
>
> Sorry, but I think that is completely the wrong reasoning. ...Unless
> that is you are some proprietary sales droid hyping features and big
> numbers! :-P
>
>
> Personally I'm not convinced we gain anything beyond what btrfs will
> eventually offer in any case for the n-way raid or the raid-n Cauchy stuff.
>
> Also note that usually, data is wanted to be 100% reliable and
> retrievable. Or if that fails, you go to your backups instead. Gambling
> "proportions" and "importance" rather than *ensuring* fault/error
> tolerance is a very human thing... ;-)
>
>
> Sorry:
>
> Interesting idea but not convinced there's any advantage for disk/SSD
> storage.
>
>
> Regards,
> Martin
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Another nice option in this case might be adding logic to make sure
that there is some (considerable) offset between copies of metadata
using the dup profile (all of the filesystems that I have actually
looked at the low-level on-disk structures have had both copies of the
System chunks right next to each other, right at the beginning of the
disk, which of course mitigates the usefulness of storing two copies of
them on disk). Adding an offset in those allocations would provide some
better protection against some of the more common 'idiot' failure-modes
(i.e. trying to use dd to write a disk image to a USB flash drive, and
accidentally overwriting the first n GB of your first HDD instead).
Ideally, once we have n-way replication, System chunks should default to
one copy per device for multi-device filesystems.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-21 2:51 ` Russell Coker
2014-05-21 23:05 ` Martin
@ 2014-05-22 22:09 ` ashford
2014-05-23 3:54 ` Russell Coker
1 sibling, 1 reply; 18+ messages in thread
From: ashford @ 2014-05-22 22:09 UTC (permalink / raw)
To: russell; +Cc: ashford, linux-btrfs
Russell,
Overall, there are still a lot of unknowns WRT the stability, and ROI
(Return On Investment) of implementing ditto blocks for BTRFS. The good
news is that there's a lot of time before the underlying structure is in
place to support, so there's time to figure this out a bit better.
> On Tue, 20 May 2014 07:56:41 ashford@whisperpc.com wrote:
>> 1. There will be more disk space used by the metadata. I've been aware
>> of space allocation issues in BTRFS for more than three years. If the
>> use of ditto blocks will make this issue worse, then it's probably not a
>> good idea to implement it. The actual increase in metadata space is
>> probably small in most circumstances.
>
> Data, RAID1: total=2.51TB, used=2.50TB
> System, RAID1: total=32.00MB, used=376.00KB
> Metadata, RAID1: total=28.25GB, used=26.63GB
>
> The above is my home RAID-1 array. It includes multiple backup copies of
> a medium size Maildir format mail spool which probably accounts for a
> significant portion of the used space, the Maildir spool has an average
> file size of about 70K and lots of hard links between different versions
> of the backup. Even so the metadata is only 1% of the total used space.
> Going from 1% to 2% to improve reliability really isn't a problem.
>
> Data, RAID1: total=140.00GB, used=139.60GB
> System, RAID1: total=32.00MB, used=28.00KB
> Metadata, RAID1: total=4.00GB, used=2.97GB
>
> Above is a small Xen server which uses snapshots to backup the files for
> Xen block devices (the system is lightly loaded so I don't use nocow)
> and for data> files that include a small Maildir spool. It's still only
> 2% of disk space used for metadata, again going from 2% to 4% isn't
> going to be a great problem.
You've addressed half of the issue. It appears that the metadata is
normally a bit over 1% using the current methods, but two samples do not
make a statistical universe. The good news is that these two samples are
from opposite extremes of usage, so I expect they're close to where the
overall average would end up. I'd like to see a few more samples, from
other usage scenarios, just to be sure.
If the above numbers are normal, adding ditto blocks could increase the
size of the metadata from 1% to 2% or even 3%. This isn't a problem.
What we still don't know, and probably won't until after it's implemented,
is whether or not the addition of ditto blocks will make the space
allocation worse.
>> 2. Use of ditto blocks will increase write bandwidth to the disk. This
>> is a direct and unavoidable result of having more copies of the
>> metadata.
>> The actual impact of this would depend on the file-system usage pattern,
>> but would probably be unnoticeable in most circumstances. Does anyone
>> have a worst-case scenario for testing?
>
> The ZFS design involves ditto blocks being spaced apart due to the fact
> that corruption tends to have some spacial locality. So you are adding
> an extra seek.
>
> The worst case would be when you have lots of small synchronous writes,
> probably the default configuration of Maildir delivery would be a good
> case.
Is there a performance test for this? That would be helpful in
determining the worst-case performance impact of implementing ditto
blocks, and probably some other enhancements as well.
>> 3. Certain kinds of disk errors would be easier to recover from. Some
>> people here claim that those specific errors are rare. I have no
>> opinion on how often they happen, but I believe that if the overall
>> disk space cost is low, it will have a reasonable return. There would
>> be virtually no reliability gains on an SSD-based file-system, as the
>> ditto blocks would be written at the same time, and the SSD would be
>> likely to map the logical blocks into the same page of flash memory.
>
> That claim is unproven AFAIK.
That claim is a direct result of how SSDs function.
>> 4. If the BIO layer of BTRFS and the device driver are smart enough,
>> ditto blocks could reduce I/O wait time. This is a direct result of
>> having more instances of the data on the disk, so it's likely that there
>> will be a ditto block closer to where the disk head is currently. The
>> actual benefit for disk-based file-systems is likely to be under 1ms per
>> metadata seek. It's possible that a short-term backlog on one disk
>> could cause BTRFS to use a ditto block on another disk, which could
>> deliver >20ms of performance. There would be no performance benefit for
>> SSD-based file-systems.
>
> That is likely with RAID-5 and RAID-10.
It's likely with all disk layouts. The reason just looks different on
different RAID structures.
>> My experience is that once your disks are larger than about 500-750GB,
>> RAID-6 becomes a much better choice, due to the increased chances of
>> having an uncorrectable read error during a reconstruct. My opinion is
>> that anyone storing critical information in RAID-5, or even 2-disk
>> RAID-1,
>> with disks of this capacity, should either reconsider their storage
>> topology, or verify that they have a good backup/restore mechanism in
>> place for that data.
>
> http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
>
> The NetApp research shows that the incidence of silent corruption is a
> lot greater than you would expect. RAID-6 doesn't save you from this.
> You need BTRFS or ZFS RAID-6.
I was referring to hard read errors, not silent data corruption.
Peter Ashford
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-22 22:09 ` ashford
@ 2014-05-23 3:54 ` Russell Coker
2014-05-23 8:03 ` Duncan
0 siblings, 1 reply; 18+ messages in thread
From: Russell Coker @ 2014-05-23 3:54 UTC (permalink / raw)
To: ashford; +Cc: linux-btrfs
On Thu, 22 May 2014 15:09:40 ashford@whisperpc.com wrote:
> You've addressed half of the issue. It appears that the metadata is
> normally a bit over 1% using the current methods, but two samples do not
> make a statistical universe. The good news is that these two samples are
> from opposite extremes of usage, so I expect they're close to where the
> overall average would end up. I'd like to see a few more samples, from
> other usage scenarios, just to be sure.
>
> If the above numbers are normal, adding ditto blocks could increase the
> size of the metadata from 1% to 2% or even 3%. This isn't a problem.
>
> What we still don't know, and probably won't until after it's implemented,
> is whether or not the addition of ditto blocks will make the space
> allocation worse.
I've been involved in many discussions about filesystem choice. None of them
have included anyone raising an issue about ZFS metadata space usage, probably
most ZFS users don't even know about ditto blocks.
The relevant issue regarding disk space is the fact that filesystems tend to
perform better if there is a reasonable amount of free space. The amount of
free space for good performance will depend on filesystem, usage pattern, and
whatever you might define as "good performance".
The first two Google hits on searching for recommended free space on ZFS
recommended using no more than 80% and 85% of disk space. Obviously if "good
performance" requires 15% of free disk space then your capacity problem isn't
going to be solved by not duplicating metadata. Note that I am not aware of
the accuracy of such claims about ZFS performance.
Is anyone doing research on how much free disk space is required on BTRFS for
"good performance"? If a rumor (whether correct or incorrect) goes around
that you need 20% free space on a BTRFS filesystem for performance then that
will vastly outweigh the space used for metadata.
> > The ZFS design involves ditto blocks being spaced apart due to the fact
> > that corruption tends to have some spacial locality. So you are adding
> > an extra seek.
> >
> > The worst case would be when you have lots of small synchronous writes,
> > probably the default configuration of Maildir delivery would be a good
> > case.
>
> Is there a performance test for this? That would be helpful in
> determining the worst-case performance impact of implementing ditto
> blocks, and probably some other enhancements as well.
http://doc.coker.com.au/projects/postal/
My Postal mail server benchmark is one option. There are more than a few
benchmarks of synchronous writes of small files, but Postal uses real world
programs that need such performance. Delivering a single message via a
typical Unix MTA requires synchronous writes of two queue files and then the
destination file in the mail store.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
2014-05-23 3:54 ` Russell Coker
@ 2014-05-23 8:03 ` Duncan
0 siblings, 0 replies; 18+ messages in thread
From: Duncan @ 2014-05-23 8:03 UTC (permalink / raw)
To: linux-btrfs
Russell Coker posted on Fri, 23 May 2014 13:54:46 +1000 as excerpted:
> Is anyone doing research on how much free disk space is required on
> BTRFS for "good performance"? If a rumor (whether correct or incorrect)
> goes around that you need 20% free space on a BTRFS filesystem for
> performance then that will vastly outweigh the space used for metadata.
Well, on btrfs there's free-space, and then there's free-space. The
chunk allocation and both data/metadata fragmentation make a difference.
That said, *IF* you're looking at the right numbers, btrfs doesn't
actually require that much free space, and should run as efficiently
right up to just a few GiB free, on pretty much any btrfs over a few GiB
in size, so at least in the significant fractions of a TiB on up range,
it doesn't require that much free space /as/ /a/ /percentage/ at all.
**BUT BE SURE YOU'RE LOOKING AT THE RIGHT NUMBERS** as explained below.
Chunks:
On btrfs, both data and metadata are allocated in chunks, 1 GiB chunks
for data, 256 MiB chunks for metadata. The catch is that while both
chunks and space within chunks can be allocated on-demand, deleting files
only frees space within chunks -- the chunks themselves remain allocated
to data/metadata whichever they were, and cannot be reallocated to the
other. To deallocate unused chunks and to rewrite partially used chunks
to consolidate usage on to fewer chunks and free the others, btrfs admins
must currently manually (or via script) do a btrfs balance.
btrfs filesystem show:
For the btrfs filesystem show output, the individual devid lines show
total filesystem space on the device vs. used, as in allocated to chunks,
space.[1] Ideally (assuming equal sized devices) you should keep at
least 2.5-3.0 GiB free per device, since that will allow allocation of
two chunks each for data (1 GiB each) and metadata (quarter GiB each, but
on single-device filesystems they are allocated in pairs by default, so
half a MiB, see below). Since the balance process itself will want to
allocate a new chunk to write into in ordered to rewrite and consolidate
existing chunks, you don't want to use the last one available, and since
the filesystem could decide it needs to allocate another chunk for normal
usage as well, you always want to keep at least two chunks worth of each,
thus 2.5 GiB (3.0 GiB for single-device-filesystems, see below),
unallocated, one chunk each data/metadata for the filesystem if it needs
it, and another to ensure balance can allocate at least the one chunk to
do its rewrite.
As I said, data chunks are 1 GiB, while metadata chunks are 256 MiB, a
quarter GiB. However, on a single-device btrfs, metadata will normally
default to dup (duplicate, two copies for safety) mode, and will thus
allocate two chunks, half a GiB at a time. This is why you want 3 GiB
minimum free on a single-device btrfs, space for two single-mode data
chunk allocations (1 GiB * 2 = 2 GiB), plus two dup-mode metadata chunk
allocations (256 MiB * 2 * 2 = 1 GiB). But on multi-device btrfs, only a
single copy is stored per device, so the metadata minimum reserve is only
half a GiB per device (256 MiB * 2 = 512 MiB = half a GiB).
That's the minimum unallocated space you need free. More than that is
nice and lets you go longer between having to worry about rebalances, but
it really won't help btrfs efficiency that much, since btrfs uses already
allocated chunk space where it can.
btrfs filesystem df:
Then there's the already chunk-allocated space. btrfs filesystem df
reports on this. In the df output, total means allocated while used
means used, of that allocated, so the spread between them is the
allocated but unused.
Since btrfs allocates new chunks on-demand from the unallocated space
pool, but cannot reallocate chunks between data and metadata on its own,
and because the used blocks within existing chunks will get fragmented
over time, it's best to keep the btrfs filesystem df reported spread
between total and used to a minimum.
Of course, as I said above data chunks are 1 GiB each, so a data
allocation spread of under a GiB won't be recoverable in any case, and a
spread of 1-5 GiB isn't a big deal. But if for instance btrfs filesystem
df reports data 1.25 TiB total (that is, allocated) but only 250 GiB
used, that's a spread of roughly a TiB, and running a btrfs balance in
ordered to recover most of that spread to unallocated is a good idea.
Similarly with metadata, except it'll be allocated in 256 MiB chunks, two
at a time by default on a single device filesystem so 512 MiB at at time
in that case. But again, if btrfs filesystem df is reporting say 10.5
GiB total metadata but only perhaps 1.75 GiB used, the spread is several
chunks worth and particularly if your unallocated reserve (as reported by
btrfs filesystem show in the individual device lines) is getting low,
it's time to consider rebalancing it to recover the unused metadata space
to unallocated.
It's also worth noting that btrfs required some metadata space free to
work with, figure about one chunk worth, so if there's no unallocated
space left and metadata space gets under 300 MiB or so, you're getting
real close to ENOSPC errors! For the same reason, even a full balance
will likely still leave a metadata chunk or two (so say half a gig) of
reported spread between metadata total and used, that's not recoverable
by balance because btrfs actually reserves that for its own use.
Finally, it can be noted that under normal usage and particularly in
cases where people delete a whole bunch of medium to large files (and
assuming those same files aren't being saved in a btrfs snapshot, which
would prevent their deletion actually freeing the space they take until
all the snapshots that contain them are deleted as well), a lot of
previously allocated data chunks will become mostly or fully empty, but
metadata usage won't go down all that much, so relatively less metadata
space will return to unused. That means where people haven't rebalanced
in awhile, they're likely to have a lot of allocated but unused data
space that can be reused, but rather less unused metadata space to
reuse. As a result, when all space is allocated and there's no more to
allocate to new chunks, it's most commonly metadata space that runs out
first, *SOMETIMES WITH LOTS OF SPACE STILL REPORTED AS FREE BY ORDINARY
DF* and lots of data space free as reported by btrfs filesystem df as
well, simply because all available metadata chunks are full, and all
remaining space is allocated to data chunks, a significant number of
which may be mostly free.
But OTOH, if you work with mostly small files, a KiB or smaller, and have
deleted a bunch of them, it's likely you'll free a lot of metadata space
because such small files are often stored entirely as metadata. In that
case you may run out of data space first, once all space is allocated to
chunks of some kind. This is somewhat rarer, but it does happen, and the
symptoms can look a bit strange as sometimes it'll result in a bunch of
zero-sized files, because the metadata space was available for them but
when it came time to write the actual data, there was no space to do so.
But once all space is allocated to chunks so no more chunks can be
allocated, it's only a matter of time until either data or metadata runs
out, even if there's plenty of "space" free, because all that "space" is
tied up in the other one! As I said above, keep an eye on btrfs
filesystem show output, and try to do a rebalance when the spread between
total and used (allocated) gets close to 3 GiB, because once all space is
actually allocated, you're in a bit of a bind and balance may find it
hard to free space as well. There's tricks that can help as described
below, but it's better not to find yourself in that spot in the first
place.
Balance and balance filters:
Now let's look at balance and balance filters. There's a page on the wiki
[2] that explains balance filters in some detail, but for our purposes
here, it's sufficient to know -m tells balance to only handle metadata
chunks, while -d tells it to only handle data chunks, and usage=N can be
used to tell it to only rebalance chunks with that usage or LESS, thus
allowing you to avoid unnecessarily rebalancing full and almost full
chunks, while still allowing recovery of nearly empty chunks to the
unallocated pool.
So if btrfs filesystem df shows a big spread between total and used for
data, try something like this:
btrfs balance start -dusage=20 (note no space between -d and usage)
That says balance (rewrite and consolidate) only data chunks with usage
of 20% or less. That will be MUCH faster than a full rebalance, and
should be quite a bit faster than simply -d (data chunks only, without
the usage filter) as well, while still consolidating data chunks with
usage at or below 20%, which will likely be quite a few if the spread is
pretty big.
Of course you can adjust the N in that usage=N as needed between 0 and
100. As the filesystem really does fill up and there's less room to
spare to allocated but unused chunks, you'll need to increase that usage=
toward 100 in ordered to consolidate and recover as many partially used
chunks as possible. But while the filesystem is mostly empty and/or if
the btrfs filesystem df spread between used and total is large (tens or
hundreds of gigs), a smaller usage=, say usage=5, will likely get you
very good results, but MUCH faster, since you're only dealing with chunks
at or under 5% full, meaning far less actual rewriting, while most of the
time getting a full gig back for every 1/20 gig (5%) gig you rebalance!
***ANSWER!***
While btrfs shouldn't lose that much operational efficiency as the
filesystem fills as long as there's unallocated chunks available to
allocate as it needs them, the closer it is to full, the more frequently
one will need to rebalance and the closer to 100 the usage= balance
filter will need to be in ordered to recover all possible space to
unallocated in ordered to keep it free for allocation as necessary.
Tying up loose ends: Tricks:
Above, I mentioned tricks that can let you balance even if there's no
space left to allocate the new chunk to rewrite data/metadata from the
old chunk into, so a normal balance won't work.
The first such trick is the usage=0 balance filter. Even if you're
totally out of unallocated space as reported by btrfs filesystem show, if
btrfs filesystem df shows a large spread between used and total (or even
if not, if you're lucky, as long as the spread is at least one chunk's
worth), there's a fair chance that at least one chunk is totally empty.
In that case, there's nothing in it to rewrite, and balancing that chunk
will simply free it, without requiring a chunk allocation to do the
rewrite. Using usage=0 tells balance to only consider such chunks,
freeing any that it finds without requiring space to rewrite the data,
since there's nothing there to rewrite. =:^)
Still, there's no guarantee balance will find any totally empty chunks to
free, so it's better not to get into that situation to begin with. As I
said above, try to keep at least 3 GiB free as reported by the individual
device lines of btrfs filesystem show (or 2.5 GiB each device of a multi-
device filesystem).
If -dusage=0/-musage=0 doesn't work, the next trick is to try temporarily
adding another device to the btrfs, using btrfs device add. This device
should be at least several GiB (again, I'd say 3 GiB, minimum, but 10 GiB
or so would be better, no need to make it /huge/) in size, and could be a
USB thumb drive or the like. If you have 8 GiB or better memory and
aren't using it all, even a several GiB loopback file created on top of
tmpfs can work, but of course if the system crashes while that temporary
device is in use, say goodbye to whatever was on it at the time!
The idea is to add the device temporarily, do a btrfs balance with a
usage filter set as low as possible to free up at least one extra chunk
worth of space on the permanent device(s), then when balance has
recovered enough chunks worth of space to do so, do a btrfs device delete
on the temporary device to return the chunks on it to the newly
unallocated space on the permanent devices.
The temporary device trick should work where the usage=0 trick fails and
should allow getting out of the bind, but again, better never to find
yourself in that bind in the first place, so keep an eye on those btrfs
filesystem show results!
More loose ends:
Above I assumed all devices of a multi-device btrfs are the same size, so
they should fill up roughly in parallel and the per-device lines in the
btrfs filesystem show output should be similar. If you're using
different sized devices, depending on your configured raid mode and the
size of the devices, one will likely fill up first, but there will still
be room left on the others. The details are too complex to deal with
here, but one thing that's worth noting is that for some device sizes and
raid mode configurations, btrfs will not be able to use the full size of
the largest device. Hugo's btrfs device and filesystem layout
configurator page is a good tool to use when planning a mixed-device-size
btrfs.
Finally, there's the usage value in the total devices line of btrfs
filesystem show, which in footnote [1] below I recommend ignoring if you
don't understand it. That number is actually the (rounded appropriately)
sum of all the used values as reported by btrfs filesystem df.
Basically, add the used values from the data and metadata lines (because
the other usage lines end up being rounding errors) of btrfs filesystem
df, and that should (within rounding error) be the number reported by
btrfs filesystem show as usage in the total devices line. That's where
the number comes from and it is in some ways the actual filesystem
usage. But in btrfs terms it's relatively unimportant compared to the
chunk-allocated/unallocated/total values as reported on the individual
device lines, and the data/metadata values as reported by btrfs
filesystem df, so for btrfs administration purposes it's generally better
to simply pretend that btrfs filesystem show total devices line usage
doesn't even appear at all, as in real life, far more people seem to get
confused by it than find it actually useful. But that's where that
number derives, if you find you can't simply ignore it as I recommend.
(I know I'd have a hard time ignoring it myself, until I knew where it
actually came from.)
---
[1] The total devices line used is reporting something entirely
different, best ignored if you don't understand it as it has deceived a
lot of people into thinking they have lots of room available when it's
actually all allocated.
[2] Btrfs wiki, general link: https://btrfs.wiki.kernel.org
Balance filters:
https://btrfs.wiki.kernel.org/index.php/Balance_Filters
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
@ 2014-05-22 15:28 Tomasz Chmielewski
0 siblings, 0 replies; 18+ messages in thread
From: Tomasz Chmielewski @ 2014-05-22 15:28 UTC (permalink / raw)
To: linux-btrfs
> I thought an important idea behind btrfs was that we avoid by design
> in the first place the very long and vulnerable RAID rebuild scenarios
> suffered for block-level RAID...
This may be true for SSD disks - for ordinary disks it's not entirely
the case.
For most RAID rebuilds, it still seems way faster with software RAID-1
where one drive is being read at its (almost) full speed, and the other
is being written to at its (almost) full speed (assuming no other IO
load).
With btrfs RAID-1, the way balance is made after disk replace, it takes
lots of disk head movements resulting in overall small speed to rebuild
the RAID, especially with lots of snapshots and related fragmentation.
And the balance is still not smart and is causing reads from one device,
and writes to *both* devices (extra unnecessary write to the
healthy device - while it should read from the healthy device and write
to the replaced device only).
Of course, other factors such as the amount of data or disk IO usage
during rebuild apply.
--
Tomasz Chmielewski
http://wpkg.org
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2014-05-23 8:03 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-16 3:07 ditto blocks on ZFS Russell Coker
2014-05-17 12:50 ` Martin
2014-05-17 14:24 ` Hugo Mills
2014-05-18 16:09 ` Russell Coker
2014-05-19 20:36 ` Martin
2014-05-19 21:47 ` Brendan Hide
2014-05-20 2:07 ` Russell Coker
2014-05-20 14:07 ` Austin S Hemmelgarn
2014-05-20 20:11 ` Brendan Hide
2014-05-20 14:56 ` ashford
2014-05-21 2:51 ` Russell Coker
2014-05-21 23:05 ` Martin
2014-05-22 11:10 ` Austin S Hemmelgarn
2014-05-22 22:09 ` ashford
2014-05-23 3:54 ` Russell Coker
2014-05-23 8:03 ` Duncan
2014-05-21 23:29 ` Konstantinos Skarlatos
2014-05-22 15:28 Tomasz Chmielewski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).