* ditto blocks on ZFS @ 2014-05-16 3:07 Russell Coker 2014-05-17 12:50 ` Martin 0 siblings, 1 reply; 18+ messages in thread From: Russell Coker @ 2014-05-16 3:07 UTC (permalink / raw) To: linux-btrfs https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape Probably most of you already know about this, but for those of you who haven't the above describes ZFS "ditto blocks" which is a good feature we need on BTRFS. The briefest summary is that on top of the RAID redundancy there is one more copy of metadata than there is of data, so copies=2 implies 3 copies of metadata and the default option of 1 copy of data means that metadata is "dup" in addition to whatever RAID redundancy is in place. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-16 3:07 ditto blocks on ZFS Russell Coker @ 2014-05-17 12:50 ` Martin 2014-05-17 14:24 ` Hugo Mills 2014-05-18 16:09 ` Russell Coker 0 siblings, 2 replies; 18+ messages in thread From: Martin @ 2014-05-17 12:50 UTC (permalink / raw) To: linux-btrfs On 16/05/14 04:07, Russell Coker wrote: > https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape > > Probably most of you already know about this, but for those of you who haven't > the above describes ZFS "ditto blocks" which is a good feature we need on > BTRFS. The briefest summary is that on top of the RAID redundancy there... [... are additional copies of metadata ...] Is that idea not already implemented in effect in btrfs with the way that the superblocks are replicated multiple times, ever more times, for ever more huge storage devices? The one exception is for SSDs whereby there is the excuse that you cannot know whether your data is usefully replicated across different erase blocks on a single device, and SSDs are not 'that big' anyhow. So... Your idea of replicating metadata multiple times in proportion to assumed 'importance' or 'extent of impact if lost' is an interesting approach. However, is that appropriate and useful considering the real world failure mechanisms that are to be guarded against? Do you see or measure any real advantage? Regards, Martin ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-17 12:50 ` Martin @ 2014-05-17 14:24 ` Hugo Mills 2014-05-18 16:09 ` Russell Coker 1 sibling, 0 replies; 18+ messages in thread From: Hugo Mills @ 2014-05-17 14:24 UTC (permalink / raw) To: Martin; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1711 bytes --] On Sat, May 17, 2014 at 01:50:52PM +0100, Martin wrote: > On 16/05/14 04:07, Russell Coker wrote: > > https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape > > > > Probably most of you already know about this, but for those of you who haven't > > the above describes ZFS "ditto blocks" which is a good feature we need on > > BTRFS. The briefest summary is that on top of the RAID redundancy there... > [... are additional copies of metadata ...] > > > Is that idea not already implemented in effect in btrfs with the way > that the superblocks are replicated multiple times, ever more times, for > ever more huge storage devices? Superblocks are the smallest part of the metadata. There's a whole load of metadata that's not in the superblocks that isn't replicated in this way. > The one exception is for SSDs whereby there is the excuse that you > cannot know whether your data is usefully replicated across different > erase blocks on a single device, and SSDs are not 'that big' anyhow. > > > So... Your idea of replicating metadata multiple times in proportion to > assumed 'importance' or 'extent of impact if lost' is an interesting > approach. However, is that appropriate and useful considering the real > world failure mechanisms that are to be guarded against? > > Do you see or measure any real advantage? This. How many copies do you actually need? Are there concrete statistics to show the marginal utility of each additional copy? Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- IMPROVE YOUR ORGANISMS!! -- Subject line of spam email --- [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-17 12:50 ` Martin 2014-05-17 14:24 ` Hugo Mills @ 2014-05-18 16:09 ` Russell Coker 2014-05-19 20:36 ` Martin 1 sibling, 1 reply; 18+ messages in thread From: Russell Coker @ 2014-05-18 16:09 UTC (permalink / raw) To: Martin; +Cc: linux-btrfs On Sat, 17 May 2014 13:50:52 Martin wrote: > On 16/05/14 04:07, Russell Coker wrote: > > https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape > > > > Probably most of you already know about this, but for those of you who > > haven't the above describes ZFS "ditto blocks" which is a good feature we > > need on BTRFS. The briefest summary is that on top of the RAID > > redundancy there... > [... are additional copies of metadata ...] > > > Is that idea not already implemented in effect in btrfs with the way > that the superblocks are replicated multiple times, ever more times, for > ever more huge storage devices? No. If the metadata for the root directory is corrupted then everything is lost even if the superblock is OK. At every level in the directory tree a corruption will lose all levels below that, a corruption for /home would be very significant as would a corruption of /home/importantuser/major-project. > The one exception is for SSDs whereby there is the excuse that you > cannot know whether your data is usefully replicated across different > erase blocks on a single device, and SSDs are not 'that big' anyhow. I am not convinced by that argument. While you can't know that it's usefully replicated you also can't say for sure that replication will never save you. There will surely be some random factors involved. If dup on ssd will save you from 50% of corruption problems is it worth doing? What if it's 80% or 20%? I have BTRFS running as the root filesystem on Intel SSDs on four machines (one of which is a file server with a pair of large disks in a BTRFS RAID-1). On all of those systems I have dup for metadata, it doesn't take up any amount of space I need for something else and it might save me. > So... Your idea of replicating metadata multiple times in proportion to > assumed 'importance' or 'extent of impact if lost' is an interesting > approach. However, is that appropriate and useful considering the real > world failure mechanisms that are to be guarded against? Firstly it's not my idea, it's the idea of the ZFS developers. Secondly I started reading about this after doing some experiments with a failing SATA disk. In spite of having ~14,000 read errors (which sounds like a lot but is a small fraction of a 2TB disk) the vast majority of the data was readable, largely due to ~2000 errors corrected by dup metadata. > Do you see or measure any real advantage? Imagine that you have a RAID-1 array where both disks get ~14,000 read errors. This could happen due to a design defect common to drives of a particular model or some shared environmental problem. Most errors would be corrected by RAID-1 but there would be a risk of some data being lost due to both copies being corrupt. Another possibility is that one disk could entirely die (although total disk death seems rare nowadays) and the other could have corruption. If metadata was duplicated in addition to being on both disks then the probability of data loss would be reduced. Another issue is the case where all drive slots are filled with active drives (a very common configuration). To replace a disk you have to physically remove the old disk before adding the new one. If the array is a RAID-1 or RAID-5 then ANY error during reconstruction loses data. Using dup for metadata on top of the RAID protections (IE the ZFS ditto idea) means that case doesn't lose you data. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-18 16:09 ` Russell Coker @ 2014-05-19 20:36 ` Martin 2014-05-19 21:47 ` Brendan Hide 0 siblings, 1 reply; 18+ messages in thread From: Martin @ 2014-05-19 20:36 UTC (permalink / raw) To: linux-btrfs On 18/05/14 17:09, Russell Coker wrote: > On Sat, 17 May 2014 13:50:52 Martin wrote: [...] >> Do you see or measure any real advantage? > > Imagine that you have a RAID-1 array where both disks get ~14,000 read errors. > This could happen due to a design defect common to drives of a particular > model or some shared environmental problem. Most errors would be corrected by > RAID-1 but there would be a risk of some data being lost due to both copies > being corrupt. Another possibility is that one disk could entirely die > (although total disk death seems rare nowadays) and the other could have > corruption. If metadata was duplicated in addition to being on both disks > then the probability of data loss would be reduced. > > Another issue is the case where all drive slots are filled with active drives > (a very common configuration). To replace a disk you have to physically > remove the old disk before adding the new one. If the array is a RAID-1 or > RAID-5 then ANY error during reconstruction loses data. Using dup for > metadata on top of the RAID protections (IE the ZFS ditto idea) means that > case doesn't lose you data. Your example there is for the case where in effect there is no RAID. How is that case any better than what is already done for btrfs duplicating metadata? So... What real-world failure modes do the ditto blocks usefully protect against? And how does that compare for failure rates and against what is already done? For example, we have RAID1 and RAID5 to protect against any one RAID chunk being corrupted or for the total loss of any one device. There is a second part to that in that another failure cannot be tolerated until the RAID is remade. Hence, we have RAID6 that protects against any two failures for a chunk or device. Hence with just one failure, you can tolerate a second failure whilst rebuilding the RAID. And then we supposedly have safety-by-design where the filesystem itself is using a journal and barriers/sync to ensure that the filesystem is always kept in a consistent state, even after an interruption to any writes. *What other failure modes* should we guard against? There has been mention of fixing metadata keys from single bit flips... Should hamming codes be used instead of a crc so that we can have multiple bit error detect, single bit error correct functionality for all data both in RAM and on disk for those systems that do not use ECC RAM? Would that be useful?... Regards, Martin ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-19 20:36 ` Martin @ 2014-05-19 21:47 ` Brendan Hide 2014-05-20 2:07 ` Russell Coker 0 siblings, 1 reply; 18+ messages in thread From: Brendan Hide @ 2014-05-19 21:47 UTC (permalink / raw) To: Martin, linux-btrfs On 2014/05/19 10:36 PM, Martin wrote: > On 18/05/14 17:09, Russell Coker wrote: >> On Sat, 17 May 2014 13:50:52 Martin wrote: > [...] >>> Do you see or measure any real advantage? >> [snip] This is extremely difficult to measure objectively. Subjectively ... see below. > [snip] > > *What other failure modes* should we guard against? I know I'd sleep a /little/ better at night knowing that a double disk failure on a "raid5/1/10" configuration might ruin a ton of data along with an obscure set of metadata in some "long" tree paths - but not the entire filesystem. The other use-case/failure mode - where you are somehow unlucky enough to have sets of bad sectors/bitrot on multiple disks that simultaneously affect the only copies of the tree roots - is an extremely unlikely scenario. As unlikely as it may be, the scenario is a very painful consequence in spite of VERY little corruption. That is where the peace-of-mind/bragging rights come in. -- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-19 21:47 ` Brendan Hide @ 2014-05-20 2:07 ` Russell Coker 2014-05-20 14:07 ` Austin S Hemmelgarn ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Russell Coker @ 2014-05-20 2:07 UTC (permalink / raw) To: Brendan Hide, linux-btrfs On Mon, 19 May 2014 23:47:37 Brendan Hide wrote: > This is extremely difficult to measure objectively. Subjectively ... see > below. > > > [snip] > > > > *What other failure modes* should we guard against? > > I know I'd sleep a /little/ better at night knowing that a double disk > failure on a "raid5/1/10" configuration might ruin a ton of data along > with an obscure set of metadata in some "long" tree paths - but not the > entire filesystem. My experience is that most disk failures that don't involve extreme physical damage (EG dropping a drive on concrete) don't involve totally losing the disk. Much of the discussion about RAID failures concerns entirely failed disks, but I believe that is due to RAID implementations such as Linux software RAID that will entirely remove a disk when it gives errors. I have a disk which had ~14,000 errors of which ~2000 errors were corrected by duplicate metadata. If two disks with that problem were in a RAID-1 array then duplicate metadata would be a significant benefit. > The other use-case/failure mode - where you are somehow unlucky enough > to have sets of bad sectors/bitrot on multiple disks that simultaneously > affect the only copies of the tree roots - is an extremely unlikely > scenario. As unlikely as it may be, the scenario is a very painful > consequence in spite of VERY little corruption. That is where the > peace-of-mind/bragging rights come in. http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html The NetApp research on latent errors on drives is worth reading. On page 12 they report latent sector errors on 9.5% of SATA disks per year. So if you lose one disk entirely the risk of having errors on a second disk is higher than you would want for RAID-5. While losing the root of the tree is unlikely, losing a directory in the middle that has lots of subdirectories is a risk. I can understand why people wouldn't want ditto blocks to be mandatory. But why are people arguing against them as an option? As an aside, I'd really like to be able to set RAID levels by subtree. I'd like to use RAID-1 with ditto blocks for my important data and RAID-0 for unimportant data. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-20 2:07 ` Russell Coker @ 2014-05-20 14:07 ` Austin S Hemmelgarn 2014-05-20 20:11 ` Brendan Hide 2014-05-20 14:56 ` ashford 2014-05-21 23:29 ` Konstantinos Skarlatos 2 siblings, 1 reply; 18+ messages in thread From: Austin S Hemmelgarn @ 2014-05-20 14:07 UTC (permalink / raw) To: russell, Brendan Hide, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3016 bytes --] On 2014-05-19 22:07, Russell Coker wrote: > On Mon, 19 May 2014 23:47:37 Brendan Hide wrote: >> This is extremely difficult to measure objectively. Subjectively ... see >> below. >> >>> [snip] >>> >>> *What other failure modes* should we guard against? >> >> I know I'd sleep a /little/ better at night knowing that a double disk >> failure on a "raid5/1/10" configuration might ruin a ton of data along >> with an obscure set of metadata in some "long" tree paths - but not the >> entire filesystem. > > My experience is that most disk failures that don't involve extreme physical > damage (EG dropping a drive on concrete) don't involve totally losing the > disk. Much of the discussion about RAID failures concerns entirely failed > disks, but I believe that is due to RAID implementations such as Linux > software RAID that will entirely remove a disk when it gives errors. > > I have a disk which had ~14,000 errors of which ~2000 errors were corrected by > duplicate metadata. If two disks with that problem were in a RAID-1 array > then duplicate metadata would be a significant benefit. > >> The other use-case/failure mode - where you are somehow unlucky enough >> to have sets of bad sectors/bitrot on multiple disks that simultaneously >> affect the only copies of the tree roots - is an extremely unlikely >> scenario. As unlikely as it may be, the scenario is a very painful >> consequence in spite of VERY little corruption. That is where the >> peace-of-mind/bragging rights come in. > > http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html > > The NetApp research on latent errors on drives is worth reading. On page 12 > they report latent sector errors on 9.5% of SATA disks per year. So if you > lose one disk entirely the risk of having errors on a second disk is higher > than you would want for RAID-5. While losing the root of the tree is > unlikely, losing a directory in the middle that has lots of subdirectories is > a risk. > > I can understand why people wouldn't want ditto blocks to be mandatory. But > why are people arguing against them as an option? > > > As an aside, I'd really like to be able to set RAID levels by subtree. I'd > like to use RAID-1 with ditto blocks for my important data and RAID-0 for > unimportant data. > But the proposed changes for n-way replication would already handle this. They would just need the option of having more than one copy per device (which theoretically shouldn't be too hard once you have n-way replication). Also, BTRFS already has the option of replicating the root tree across multiple devices (it is included in the System Data subset), and in fact dose so by default when using multiple devices. Also, there are plans to have per-subvolume or per file RAID level selection, but IIRC that is planned for after n-way replication (and of course, RAID 5/6, as n-way replication isn't going to be implemented until after RAID 5/6) [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 2967 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-20 14:07 ` Austin S Hemmelgarn @ 2014-05-20 20:11 ` Brendan Hide 0 siblings, 0 replies; 18+ messages in thread From: Brendan Hide @ 2014-05-20 20:11 UTC (permalink / raw) To: Austin S Hemmelgarn, russell, linux-btrfs On 2014/05/20 04:07 PM, Austin S Hemmelgarn wrote: > On 2014-05-19 22:07, Russell Coker wrote: >> [snip] >> As an aside, I'd really like to be able to set RAID levels by subtree. I'd >> like to use RAID-1 with ditto blocks for my important data and RAID-0 for >> unimportant data. >> > But the proposed changes for n-way replication would already handle > this. > [snip] > Russell's specific request above is probably best handled by being able to change replication levels per subvolume - this won't be handled by N-way replication. Extra replication on leaf nodes will make relatively little difference in the scenarios laid out in this thread - but on "trunk" nodes (folders or subvolumes closer to the filesystem root) it makes a significant difference. "Plain" N-way replication doesn't flexibly treat these two nodes differently. As an example, Russell might have a server with two disks - yet he wants 6 copies of all metadata for subvolumes and their immediate subfolders. At three folders deep he "only" wants to have 4 copies. At six folders deep, only 2. Ditto blocks add an attractive safety net without unnecessarily doubling or tripling the size of *all* metadata. It is a good idea. The next question to me is whether or not it is something that can be implemented elegantly and whether or not a talented *dev* thinks it is a good idea. -- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-20 2:07 ` Russell Coker 2014-05-20 14:07 ` Austin S Hemmelgarn @ 2014-05-20 14:56 ` ashford 2014-05-21 2:51 ` Russell Coker 2014-05-21 23:29 ` Konstantinos Skarlatos 2 siblings, 1 reply; 18+ messages in thread From: ashford @ 2014-05-20 14:56 UTC (permalink / raw) To: linux-btrfs; +Cc: ahferroin7, russell, brendan Ive been reading this list for a few years, and giving almost no feedback, but I feel that this subject demands that I provide some input. I can think of five possible effects of implementing ditto blocks for the metadata. We've only been discussing one (#3 in my list) in this thread. While most of these effects are fairly obvious, I have seen no discussion on them. In discussing the issues of implementing ditto blocks, I think it would be good to address all of the potential effects, and determine from that discussion whether or not the enhancement should be made, and, if so, when the appropriate development resources should be made available. As Austin pointed out, there are some enhancements currently planned which would make the implementation of ditto blocks simpler. I believe that defines the earliest good time for implementation of ditto blocks. 1. There will be more disk space used by the metadata. I've been aware of space allocation issues in BTRFS for more than three years. If the use of ditto blocks will make this issue worse, then it's probably not a good idea to implement it. The actual increase in metadata space is probably small in most circumstances. 2. Use of ditto blocks will increase write bandwidth to the disk. This is a direct and unavoidable result of having more copies of the metadata. The actual impact of this would depend on the file-system usage pattern, but would probably be unnoticeable in most circumstances. Does anyone have a worst-case scenario for testing? 3. Certain kinds of disk errors would be easier to recover from. Some people here claim that those specific errors are rare. I have no opinion on how often they happen, but I believe that if the overall disk space cost is low, it will have a reasonable return. There would be virtually no reliability gains on an SSD-based file-system, as the ditto blocks would be written at the same time, and the SSD would be likely to map the logical blocks into the same page of flash memory. 4. If the BIO layer of BTRFS and the device driver are smart enough, ditto blocks could reduce I/O wait time. This is a direct result of having more instances of the data on the disk, so it's likely that there will be a ditto block closer to where the disk head is currently. The actual benefit for disk-based file-systems is likely to be under 1ms per metadata seek. It's possible that a short-term backlog on one disk could cause BTRFS to use a ditto block on another disk, which could deliver >20ms of performance. There would be no performance benefit for SSD-based file-systems. 5. There will be a (hopefully short) period where the code may be slightly less stable, due to the modifications being performed at a low-level within the file-system. This is likely to happen with any modification of the file-system code, with more complex modifications being more likely to introduce instability. I believe that the overall complexity of this particular modification is great enough that there may be some added instability for a bit, but perhaps use of the n-way replication feature will substantially reduce the complexity. Hopefully, the integration testing thats being performed on the BTRFS code will find most of the new bugs, and point the core developers in the right direction to fix them. I have one final note about RAID levels. I build and sell file servers as a side job, having assembled and delivered over 100 file servers storing several hundreds of TB. TTBOMK, no system that Ive built to my own specifications (not overridden by customer requests) has lost any data during the first 3 years of operation. One customer requested a disk manufacturer change, and has lost data. A few systems have had data loss in the 4-year timeframe, due to multiple drive failure, combined with inadequate disk status monitoring. My experience is that once your disks are larger than about 500-750GB, RAID-6 becomes a much better choice, due to the increased chances of having an uncorrectable read error during a reconstruct. My opinion is that anyone storing critical information in RAID-5, or even 2-disk RAID-1, with disks of this capacity, should either reconsider their storage topology, or verify that they have a good backup/restore mechanism in place for that data. Thank you. Peter Ashford ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-20 14:56 ` ashford @ 2014-05-21 2:51 ` Russell Coker 2014-05-21 23:05 ` Martin 2014-05-22 22:09 ` ashford 0 siblings, 2 replies; 18+ messages in thread From: Russell Coker @ 2014-05-21 2:51 UTC (permalink / raw) To: ashford; +Cc: linux-btrfs On Tue, 20 May 2014 07:56:41 ashford@whisperpc.com wrote: > 1. There will be more disk space used by the metadata. I've been aware > of space allocation issues in BTRFS for more than three years. If the use > of ditto blocks will make this issue worse, then it's probably not a good > idea to implement it. The actual increase in metadata space is probably > small in most circumstances. Data, RAID1: total=2.51TB, used=2.50TB System, RAID1: total=32.00MB, used=376.00KB Metadata, RAID1: total=28.25GB, used=26.63GB The above is my home RAID-1 array. It includes multiple backup copies of a medium size Maildir format mail spool which probably accounts for a significant portion of the used space, the Maildir spool has an average file size of about 70K and lots of hard links between different versions of the backup. Even so the metadata is only 1% of the total used space. Going from 1% to 2% to improve reliability really isn't a problem. Data, RAID1: total=140.00GB, used=139.60GB System, RAID1: total=32.00MB, used=28.00KB Metadata, RAID1: total=4.00GB, used=2.97GB Above is a small Xen server which uses snapshots to backup the files for Xen block devices (the system is lightly loaded so I don't use nocow) and for data files that include a small Maildir spool. It's still only 2% of disk space used for metadata, again going from 2% to 4% isn't going to be a great problem. > 2. Use of ditto blocks will increase write bandwidth to the disk. This > is a direct and unavoidable result of having more copies of the metadata. > The actual impact of this would depend on the file-system usage pattern, > but would probably be unnoticeable in most circumstances. Does anyone > have a worst-case scenario for testing? The ZFS design involves ditto blocks being spaced apart due to the fact that corruption tends to have some spacial locality. So you are adding an extra seek. The worst case would be when you have lots of small synchronous writes, probably the default configuration of Maildir delivery would be a good case. As an aside I've been thinking of patching a mail server to do a sleep() before fsync() on mail delivery to see if that improves aggregate performance. My theory is that if you have dozens of concurrent delivery attempts then if they all sleep() before fsync() then the filesystem could write out metadata for multiple files in one pass in the most efficient manner. > 3. Certain kinds of disk errors would be easier to recover from. Some > people here claim that those specific errors are rare. All errors are rare. :-# Seriously you can run Ext4 on a single disk for years and probably not lose data. It's just a matter of how many disks and how much reliability you want. > I have no opinion > on how often they happen, but I believe that if the overall disk space > cost is low, it will have a reasonable return. There would be virtually > no reliability gains on an SSD-based file-system, as the ditto blocks > would be written at the same time, and the SSD would be likely to map the > logical blocks into the same page of flash memory. That claim is unproven AFAIK. On SSD the performance cost of such things is negligible (no seek cost) and losing 1% of disk space isn't a problem for most systems (admittedly the early SSDs were small). > 4. If the BIO layer of BTRFS and the device driver are smart enough, > ditto blocks could reduce I/O wait time. This is a direct result of > having more instances of the data on the disk, so it's likely that there > will be a ditto block closer to where the disk head is currently. The > actual benefit for disk-based file-systems is likely to be under 1ms per > metadata seek. It's possible that a short-term backlog on one disk could > cause BTRFS to use a ditto block on another disk, which could deliver > >20ms of performance. There would be no performance benefit for SSD-based > file-systems. That is likely with RAID-5 and RAID-10. > My experience is that once your disks are larger than about 500-750GB, > RAID-6 becomes a much better choice, due to the increased chances of > having an uncorrectable read error during a reconstruct. My opinion is > that anyone storing critical information in RAID-5, or even 2-disk RAID-1, > with disks of this capacity, should either reconsider their storage > topology, or verify that they have a good backup/restore mechanism in > place for that data. http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html The NetApp research shows that the incidence of silent corruption is a lot greater than you would expect. RAID-6 doesn't save you from this. You need BTRFS or ZFS RAID-6. On Tue, 20 May 2014 22:11:16 Brendan Hide wrote: > Extra replication on leaf nodes will make relatively little difference > in the scenarios laid out in this thread - but on "trunk" nodes (folders > or subvolumes closer to the filesystem root) it makes a significant > difference. "Plain" N-way replication doesn't flexibly treat these two > nodes differently. > > As an example, Russell might have a server with two disks - yet he wants > 6 copies of all metadata for subvolumes and their immediate subfolders. > At three folders deep he "only" wants to have 4 copies. At six folders > deep, only 2. Ditto blocks add an attractive safety net without > unnecessarily doubling or tripling the size of *all* metadata. Firstly I don't think that doubling all metadata is a real problem. Next the amount of duplicate metadata can't be determined by depth. For example I have a mail server where an outage of the entire server is preferable to losing email. I would set more ditto blocks for /mail than for the root subvol. In that case I'd want the metadata for the root directory to have the same replication as /mail but for /home etc nothing special. Hypothetically if metadata duplication consumed any significant disk space then I'd probably want to only enable it on /lib* /sbin, /etc, and whatever data the server is designed to hold. But really it's small enough to just duplicate everything. Currently I only run two systems for which I can't more than double the disk space at a moderate cost. One is my EeePC 701 and the other is a ZFS archive server (which already has the ditto blocks). For all the other systems there is no shortage of space at all. Disk just keeps getting bigger and cheaper, for most of my uses disk size increases faster than data storage. Currently the smallest SATA disk I can buy new is 500G. The smallest SSD is 60G for $63 but I can get 120G for $82, 240G for $149, or 480G for $295. All the workstations I run use a lot less than 120G of storage. Storage capacity isn't an issue for most users. It seems to me that the only time when an extra 1% disk space usage would really matter is when you have an array of 100 disks that's almost full. But that's the time when you REALLY want extra duplication of metadata. > It is a good idea. The next question to me is whether or not it is > something that can be implemented elegantly and whether or not a > talented *dev* thinks it is a good idea. Absolutely. Hopefully this discussion will inspire the developers to consider this an interesting technical challenge and a feature that is needed to beat ZFS. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-21 2:51 ` Russell Coker @ 2014-05-21 23:05 ` Martin 2014-05-22 11:10 ` Austin S Hemmelgarn 2014-05-22 22:09 ` ashford 1 sibling, 1 reply; 18+ messages in thread From: Martin @ 2014-05-21 23:05 UTC (permalink / raw) To: linux-btrfs Very good comment from Ashford. Sorry, but I see no advantages from Russell's replies other than for a "feel-good" factor or a dangerous false sense of security. At best, there is a weak justification that "for metadata, again going from 2% to 4% isn't going to be a great problem" (storage is cheap and fast). I thought an important idea behind btrfs was that we avoid by design in the first place the very long and vulnerable RAID rebuild scenarios suffered for block-level RAID... On 21/05/14 03:51, Russell Coker wrote: > Absolutely. Hopefully this discussion will inspire the developers to > consider this an interesting technical challenge and a feature that > is needed to beat ZFS. Sorry, but I think that is completely the wrong reasoning. ...Unless that is you are some proprietary sales droid hyping features and big numbers! :-P Personally I'm not convinced we gain anything beyond what btrfs will eventually offer in any case for the n-way raid or the raid-n Cauchy stuff. Also note that usually, data is wanted to be 100% reliable and retrievable. Or if that fails, you go to your backups instead. Gambling "proportions" and "importance" rather than *ensuring* fault/error tolerance is a very human thing... ;-) Sorry: Interesting idea but not convinced there's any advantage for disk/SSD storage. Regards, Martin ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-21 23:05 ` Martin @ 2014-05-22 11:10 ` Austin S Hemmelgarn 0 siblings, 0 replies; 18+ messages in thread From: Austin S Hemmelgarn @ 2014-05-22 11:10 UTC (permalink / raw) To: Martin, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2560 bytes --] On 2014-05-21 19:05, Martin wrote: > Very good comment from Ashford. > > > Sorry, but I see no advantages from Russell's replies other than for a > "feel-good" factor or a dangerous false sense of security. At best, > there is a weak justification that "for metadata, again going from 2% to > 4% isn't going to be a great problem" (storage is cheap and fast). > > I thought an important idea behind btrfs was that we avoid by design in > the first place the very long and vulnerable RAID rebuild scenarios > suffered for block-level RAID... > > > On 21/05/14 03:51, Russell Coker wrote: >> Absolutely. Hopefully this discussion will inspire the developers to >> consider this an interesting technical challenge and a feature that >> is needed to beat ZFS. > > Sorry, but I think that is completely the wrong reasoning. ...Unless > that is you are some proprietary sales droid hyping features and big > numbers! :-P > > > Personally I'm not convinced we gain anything beyond what btrfs will > eventually offer in any case for the n-way raid or the raid-n Cauchy stuff. > > Also note that usually, data is wanted to be 100% reliable and > retrievable. Or if that fails, you go to your backups instead. Gambling > "proportions" and "importance" rather than *ensuring* fault/error > tolerance is a very human thing... ;-) > > > Sorry: > > Interesting idea but not convinced there's any advantage for disk/SSD > storage. > > > Regards, > Martin > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Another nice option in this case might be adding logic to make sure that there is some (considerable) offset between copies of metadata using the dup profile (all of the filesystems that I have actually looked at the low-level on-disk structures have had both copies of the System chunks right next to each other, right at the beginning of the disk, which of course mitigates the usefulness of storing two copies of them on disk). Adding an offset in those allocations would provide some better protection against some of the more common 'idiot' failure-modes (i.e. trying to use dd to write a disk image to a USB flash drive, and accidentally overwriting the first n GB of your first HDD instead). Ideally, once we have n-way replication, System chunks should default to one copy per device for multi-device filesystems. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 2967 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-21 2:51 ` Russell Coker 2014-05-21 23:05 ` Martin @ 2014-05-22 22:09 ` ashford 2014-05-23 3:54 ` Russell Coker 1 sibling, 1 reply; 18+ messages in thread From: ashford @ 2014-05-22 22:09 UTC (permalink / raw) To: russell; +Cc: ashford, linux-btrfs Russell, Overall, there are still a lot of unknowns WRT the stability, and ROI (Return On Investment) of implementing ditto blocks for BTRFS. The good news is that there's a lot of time before the underlying structure is in place to support, so there's time to figure this out a bit better. > On Tue, 20 May 2014 07:56:41 ashford@whisperpc.com wrote: >> 1. There will be more disk space used by the metadata. I've been aware >> of space allocation issues in BTRFS for more than three years. If the >> use of ditto blocks will make this issue worse, then it's probably not a >> good idea to implement it. The actual increase in metadata space is >> probably small in most circumstances. > > Data, RAID1: total=2.51TB, used=2.50TB > System, RAID1: total=32.00MB, used=376.00KB > Metadata, RAID1: total=28.25GB, used=26.63GB > > The above is my home RAID-1 array. It includes multiple backup copies of > a medium size Maildir format mail spool which probably accounts for a > significant portion of the used space, the Maildir spool has an average > file size of about 70K and lots of hard links between different versions > of the backup. Even so the metadata is only 1% of the total used space. > Going from 1% to 2% to improve reliability really isn't a problem. > > Data, RAID1: total=140.00GB, used=139.60GB > System, RAID1: total=32.00MB, used=28.00KB > Metadata, RAID1: total=4.00GB, used=2.97GB > > Above is a small Xen server which uses snapshots to backup the files for > Xen block devices (the system is lightly loaded so I don't use nocow) > and for data> files that include a small Maildir spool. It's still only > 2% of disk space used for metadata, again going from 2% to 4% isn't > going to be a great problem. You've addressed half of the issue. It appears that the metadata is normally a bit over 1% using the current methods, but two samples do not make a statistical universe. The good news is that these two samples are from opposite extremes of usage, so I expect they're close to where the overall average would end up. I'd like to see a few more samples, from other usage scenarios, just to be sure. If the above numbers are normal, adding ditto blocks could increase the size of the metadata from 1% to 2% or even 3%. This isn't a problem. What we still don't know, and probably won't until after it's implemented, is whether or not the addition of ditto blocks will make the space allocation worse. >> 2. Use of ditto blocks will increase write bandwidth to the disk. This >> is a direct and unavoidable result of having more copies of the >> metadata. >> The actual impact of this would depend on the file-system usage pattern, >> but would probably be unnoticeable in most circumstances. Does anyone >> have a worst-case scenario for testing? > > The ZFS design involves ditto blocks being spaced apart due to the fact > that corruption tends to have some spacial locality. So you are adding > an extra seek. > > The worst case would be when you have lots of small synchronous writes, > probably the default configuration of Maildir delivery would be a good > case. Is there a performance test for this? That would be helpful in determining the worst-case performance impact of implementing ditto blocks, and probably some other enhancements as well. >> 3. Certain kinds of disk errors would be easier to recover from. Some >> people here claim that those specific errors are rare. I have no >> opinion on how often they happen, but I believe that if the overall >> disk space cost is low, it will have a reasonable return. There would >> be virtually no reliability gains on an SSD-based file-system, as the >> ditto blocks would be written at the same time, and the SSD would be >> likely to map the logical blocks into the same page of flash memory. > > That claim is unproven AFAIK. That claim is a direct result of how SSDs function. >> 4. If the BIO layer of BTRFS and the device driver are smart enough, >> ditto blocks could reduce I/O wait time. This is a direct result of >> having more instances of the data on the disk, so it's likely that there >> will be a ditto block closer to where the disk head is currently. The >> actual benefit for disk-based file-systems is likely to be under 1ms per >> metadata seek. It's possible that a short-term backlog on one disk >> could cause BTRFS to use a ditto block on another disk, which could >> deliver >20ms of performance. There would be no performance benefit for >> SSD-based file-systems. > > That is likely with RAID-5 and RAID-10. It's likely with all disk layouts. The reason just looks different on different RAID structures. >> My experience is that once your disks are larger than about 500-750GB, >> RAID-6 becomes a much better choice, due to the increased chances of >> having an uncorrectable read error during a reconstruct. My opinion is >> that anyone storing critical information in RAID-5, or even 2-disk >> RAID-1, >> with disks of this capacity, should either reconsider their storage >> topology, or verify that they have a good backup/restore mechanism in >> place for that data. > > http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html > > The NetApp research shows that the incidence of silent corruption is a > lot greater than you would expect. RAID-6 doesn't save you from this. > You need BTRFS or ZFS RAID-6. I was referring to hard read errors, not silent data corruption. Peter Ashford ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-22 22:09 ` ashford @ 2014-05-23 3:54 ` Russell Coker 2014-05-23 8:03 ` Duncan 0 siblings, 1 reply; 18+ messages in thread From: Russell Coker @ 2014-05-23 3:54 UTC (permalink / raw) To: ashford; +Cc: linux-btrfs On Thu, 22 May 2014 15:09:40 ashford@whisperpc.com wrote: > You've addressed half of the issue. It appears that the metadata is > normally a bit over 1% using the current methods, but two samples do not > make a statistical universe. The good news is that these two samples are > from opposite extremes of usage, so I expect they're close to where the > overall average would end up. I'd like to see a few more samples, from > other usage scenarios, just to be sure. > > If the above numbers are normal, adding ditto blocks could increase the > size of the metadata from 1% to 2% or even 3%. This isn't a problem. > > What we still don't know, and probably won't until after it's implemented, > is whether or not the addition of ditto blocks will make the space > allocation worse. I've been involved in many discussions about filesystem choice. None of them have included anyone raising an issue about ZFS metadata space usage, probably most ZFS users don't even know about ditto blocks. The relevant issue regarding disk space is the fact that filesystems tend to perform better if there is a reasonable amount of free space. The amount of free space for good performance will depend on filesystem, usage pattern, and whatever you might define as "good performance". The first two Google hits on searching for recommended free space on ZFS recommended using no more than 80% and 85% of disk space. Obviously if "good performance" requires 15% of free disk space then your capacity problem isn't going to be solved by not duplicating metadata. Note that I am not aware of the accuracy of such claims about ZFS performance. Is anyone doing research on how much free disk space is required on BTRFS for "good performance"? If a rumor (whether correct or incorrect) goes around that you need 20% free space on a BTRFS filesystem for performance then that will vastly outweigh the space used for metadata. > > The ZFS design involves ditto blocks being spaced apart due to the fact > > that corruption tends to have some spacial locality. So you are adding > > an extra seek. > > > > The worst case would be when you have lots of small synchronous writes, > > probably the default configuration of Maildir delivery would be a good > > case. > > Is there a performance test for this? That would be helpful in > determining the worst-case performance impact of implementing ditto > blocks, and probably some other enhancements as well. http://doc.coker.com.au/projects/postal/ My Postal mail server benchmark is one option. There are more than a few benchmarks of synchronous writes of small files, but Postal uses real world programs that need such performance. Delivering a single message via a typical Unix MTA requires synchronous writes of two queue files and then the destination file in the mail store. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-23 3:54 ` Russell Coker @ 2014-05-23 8:03 ` Duncan 0 siblings, 0 replies; 18+ messages in thread From: Duncan @ 2014-05-23 8:03 UTC (permalink / raw) To: linux-btrfs Russell Coker posted on Fri, 23 May 2014 13:54:46 +1000 as excerpted: > Is anyone doing research on how much free disk space is required on > BTRFS for "good performance"? If a rumor (whether correct or incorrect) > goes around that you need 20% free space on a BTRFS filesystem for > performance then that will vastly outweigh the space used for metadata. Well, on btrfs there's free-space, and then there's free-space. The chunk allocation and both data/metadata fragmentation make a difference. That said, *IF* you're looking at the right numbers, btrfs doesn't actually require that much free space, and should run as efficiently right up to just a few GiB free, on pretty much any btrfs over a few GiB in size, so at least in the significant fractions of a TiB on up range, it doesn't require that much free space /as/ /a/ /percentage/ at all. **BUT BE SURE YOU'RE LOOKING AT THE RIGHT NUMBERS** as explained below. Chunks: On btrfs, both data and metadata are allocated in chunks, 1 GiB chunks for data, 256 MiB chunks for metadata. The catch is that while both chunks and space within chunks can be allocated on-demand, deleting files only frees space within chunks -- the chunks themselves remain allocated to data/metadata whichever they were, and cannot be reallocated to the other. To deallocate unused chunks and to rewrite partially used chunks to consolidate usage on to fewer chunks and free the others, btrfs admins must currently manually (or via script) do a btrfs balance. btrfs filesystem show: For the btrfs filesystem show output, the individual devid lines show total filesystem space on the device vs. used, as in allocated to chunks, space.[1] Ideally (assuming equal sized devices) you should keep at least 2.5-3.0 GiB free per device, since that will allow allocation of two chunks each for data (1 GiB each) and metadata (quarter GiB each, but on single-device filesystems they are allocated in pairs by default, so half a MiB, see below). Since the balance process itself will want to allocate a new chunk to write into in ordered to rewrite and consolidate existing chunks, you don't want to use the last one available, and since the filesystem could decide it needs to allocate another chunk for normal usage as well, you always want to keep at least two chunks worth of each, thus 2.5 GiB (3.0 GiB for single-device-filesystems, see below), unallocated, one chunk each data/metadata for the filesystem if it needs it, and another to ensure balance can allocate at least the one chunk to do its rewrite. As I said, data chunks are 1 GiB, while metadata chunks are 256 MiB, a quarter GiB. However, on a single-device btrfs, metadata will normally default to dup (duplicate, two copies for safety) mode, and will thus allocate two chunks, half a GiB at a time. This is why you want 3 GiB minimum free on a single-device btrfs, space for two single-mode data chunk allocations (1 GiB * 2 = 2 GiB), plus two dup-mode metadata chunk allocations (256 MiB * 2 * 2 = 1 GiB). But on multi-device btrfs, only a single copy is stored per device, so the metadata minimum reserve is only half a GiB per device (256 MiB * 2 = 512 MiB = half a GiB). That's the minimum unallocated space you need free. More than that is nice and lets you go longer between having to worry about rebalances, but it really won't help btrfs efficiency that much, since btrfs uses already allocated chunk space where it can. btrfs filesystem df: Then there's the already chunk-allocated space. btrfs filesystem df reports on this. In the df output, total means allocated while used means used, of that allocated, so the spread between them is the allocated but unused. Since btrfs allocates new chunks on-demand from the unallocated space pool, but cannot reallocate chunks between data and metadata on its own, and because the used blocks within existing chunks will get fragmented over time, it's best to keep the btrfs filesystem df reported spread between total and used to a minimum. Of course, as I said above data chunks are 1 GiB each, so a data allocation spread of under a GiB won't be recoverable in any case, and a spread of 1-5 GiB isn't a big deal. But if for instance btrfs filesystem df reports data 1.25 TiB total (that is, allocated) but only 250 GiB used, that's a spread of roughly a TiB, and running a btrfs balance in ordered to recover most of that spread to unallocated is a good idea. Similarly with metadata, except it'll be allocated in 256 MiB chunks, two at a time by default on a single device filesystem so 512 MiB at at time in that case. But again, if btrfs filesystem df is reporting say 10.5 GiB total metadata but only perhaps 1.75 GiB used, the spread is several chunks worth and particularly if your unallocated reserve (as reported by btrfs filesystem show in the individual device lines) is getting low, it's time to consider rebalancing it to recover the unused metadata space to unallocated. It's also worth noting that btrfs required some metadata space free to work with, figure about one chunk worth, so if there's no unallocated space left and metadata space gets under 300 MiB or so, you're getting real close to ENOSPC errors! For the same reason, even a full balance will likely still leave a metadata chunk or two (so say half a gig) of reported spread between metadata total and used, that's not recoverable by balance because btrfs actually reserves that for its own use. Finally, it can be noted that under normal usage and particularly in cases where people delete a whole bunch of medium to large files (and assuming those same files aren't being saved in a btrfs snapshot, which would prevent their deletion actually freeing the space they take until all the snapshots that contain them are deleted as well), a lot of previously allocated data chunks will become mostly or fully empty, but metadata usage won't go down all that much, so relatively less metadata space will return to unused. That means where people haven't rebalanced in awhile, they're likely to have a lot of allocated but unused data space that can be reused, but rather less unused metadata space to reuse. As a result, when all space is allocated and there's no more to allocate to new chunks, it's most commonly metadata space that runs out first, *SOMETIMES WITH LOTS OF SPACE STILL REPORTED AS FREE BY ORDINARY DF* and lots of data space free as reported by btrfs filesystem df as well, simply because all available metadata chunks are full, and all remaining space is allocated to data chunks, a significant number of which may be mostly free. But OTOH, if you work with mostly small files, a KiB or smaller, and have deleted a bunch of them, it's likely you'll free a lot of metadata space because such small files are often stored entirely as metadata. In that case you may run out of data space first, once all space is allocated to chunks of some kind. This is somewhat rarer, but it does happen, and the symptoms can look a bit strange as sometimes it'll result in a bunch of zero-sized files, because the metadata space was available for them but when it came time to write the actual data, there was no space to do so. But once all space is allocated to chunks so no more chunks can be allocated, it's only a matter of time until either data or metadata runs out, even if there's plenty of "space" free, because all that "space" is tied up in the other one! As I said above, keep an eye on btrfs filesystem show output, and try to do a rebalance when the spread between total and used (allocated) gets close to 3 GiB, because once all space is actually allocated, you're in a bit of a bind and balance may find it hard to free space as well. There's tricks that can help as described below, but it's better not to find yourself in that spot in the first place. Balance and balance filters: Now let's look at balance and balance filters. There's a page on the wiki [2] that explains balance filters in some detail, but for our purposes here, it's sufficient to know -m tells balance to only handle metadata chunks, while -d tells it to only handle data chunks, and usage=N can be used to tell it to only rebalance chunks with that usage or LESS, thus allowing you to avoid unnecessarily rebalancing full and almost full chunks, while still allowing recovery of nearly empty chunks to the unallocated pool. So if btrfs filesystem df shows a big spread between total and used for data, try something like this: btrfs balance start -dusage=20 (note no space between -d and usage) That says balance (rewrite and consolidate) only data chunks with usage of 20% or less. That will be MUCH faster than a full rebalance, and should be quite a bit faster than simply -d (data chunks only, without the usage filter) as well, while still consolidating data chunks with usage at or below 20%, which will likely be quite a few if the spread is pretty big. Of course you can adjust the N in that usage=N as needed between 0 and 100. As the filesystem really does fill up and there's less room to spare to allocated but unused chunks, you'll need to increase that usage= toward 100 in ordered to consolidate and recover as many partially used chunks as possible. But while the filesystem is mostly empty and/or if the btrfs filesystem df spread between used and total is large (tens or hundreds of gigs), a smaller usage=, say usage=5, will likely get you very good results, but MUCH faster, since you're only dealing with chunks at or under 5% full, meaning far less actual rewriting, while most of the time getting a full gig back for every 1/20 gig (5%) gig you rebalance! ***ANSWER!*** While btrfs shouldn't lose that much operational efficiency as the filesystem fills as long as there's unallocated chunks available to allocate as it needs them, the closer it is to full, the more frequently one will need to rebalance and the closer to 100 the usage= balance filter will need to be in ordered to recover all possible space to unallocated in ordered to keep it free for allocation as necessary. Tying up loose ends: Tricks: Above, I mentioned tricks that can let you balance even if there's no space left to allocate the new chunk to rewrite data/metadata from the old chunk into, so a normal balance won't work. The first such trick is the usage=0 balance filter. Even if you're totally out of unallocated space as reported by btrfs filesystem show, if btrfs filesystem df shows a large spread between used and total (or even if not, if you're lucky, as long as the spread is at least one chunk's worth), there's a fair chance that at least one chunk is totally empty. In that case, there's nothing in it to rewrite, and balancing that chunk will simply free it, without requiring a chunk allocation to do the rewrite. Using usage=0 tells balance to only consider such chunks, freeing any that it finds without requiring space to rewrite the data, since there's nothing there to rewrite. =:^) Still, there's no guarantee balance will find any totally empty chunks to free, so it's better not to get into that situation to begin with. As I said above, try to keep at least 3 GiB free as reported by the individual device lines of btrfs filesystem show (or 2.5 GiB each device of a multi- device filesystem). If -dusage=0/-musage=0 doesn't work, the next trick is to try temporarily adding another device to the btrfs, using btrfs device add. This device should be at least several GiB (again, I'd say 3 GiB, minimum, but 10 GiB or so would be better, no need to make it /huge/) in size, and could be a USB thumb drive or the like. If you have 8 GiB or better memory and aren't using it all, even a several GiB loopback file created on top of tmpfs can work, but of course if the system crashes while that temporary device is in use, say goodbye to whatever was on it at the time! The idea is to add the device temporarily, do a btrfs balance with a usage filter set as low as possible to free up at least one extra chunk worth of space on the permanent device(s), then when balance has recovered enough chunks worth of space to do so, do a btrfs device delete on the temporary device to return the chunks on it to the newly unallocated space on the permanent devices. The temporary device trick should work where the usage=0 trick fails and should allow getting out of the bind, but again, better never to find yourself in that bind in the first place, so keep an eye on those btrfs filesystem show results! More loose ends: Above I assumed all devices of a multi-device btrfs are the same size, so they should fill up roughly in parallel and the per-device lines in the btrfs filesystem show output should be similar. If you're using different sized devices, depending on your configured raid mode and the size of the devices, one will likely fill up first, but there will still be room left on the others. The details are too complex to deal with here, but one thing that's worth noting is that for some device sizes and raid mode configurations, btrfs will not be able to use the full size of the largest device. Hugo's btrfs device and filesystem layout configurator page is a good tool to use when planning a mixed-device-size btrfs. Finally, there's the usage value in the total devices line of btrfs filesystem show, which in footnote [1] below I recommend ignoring if you don't understand it. That number is actually the (rounded appropriately) sum of all the used values as reported by btrfs filesystem df. Basically, add the used values from the data and metadata lines (because the other usage lines end up being rounding errors) of btrfs filesystem df, and that should (within rounding error) be the number reported by btrfs filesystem show as usage in the total devices line. That's where the number comes from and it is in some ways the actual filesystem usage. But in btrfs terms it's relatively unimportant compared to the chunk-allocated/unallocated/total values as reported on the individual device lines, and the data/metadata values as reported by btrfs filesystem df, so for btrfs administration purposes it's generally better to simply pretend that btrfs filesystem show total devices line usage doesn't even appear at all, as in real life, far more people seem to get confused by it than find it actually useful. But that's where that number derives, if you find you can't simply ignore it as I recommend. (I know I'd have a hard time ignoring it myself, until I knew where it actually came from.) --- [1] The total devices line used is reporting something entirely different, best ignored if you don't understand it as it has deceived a lot of people into thinking they have lots of room available when it's actually all allocated. [2] Btrfs wiki, general link: https://btrfs.wiki.kernel.org Balance filters: https://btrfs.wiki.kernel.org/index.php/Balance_Filters -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS 2014-05-20 2:07 ` Russell Coker 2014-05-20 14:07 ` Austin S Hemmelgarn 2014-05-20 14:56 ` ashford @ 2014-05-21 23:29 ` Konstantinos Skarlatos 2 siblings, 0 replies; 18+ messages in thread From: Konstantinos Skarlatos @ 2014-05-21 23:29 UTC (permalink / raw) To: russell, Brendan Hide, linux-btrfs On 20/5/2014 5:07 πμ, Russell Coker wrote: > On Mon, 19 May 2014 23:47:37 Brendan Hide wrote: >> This is extremely difficult to measure objectively. Subjectively ... see >> below. >> >>> [snip] >>> >>> *What other failure modes* should we guard against? >> I know I'd sleep a /little/ better at night knowing that a double disk >> failure on a "raid5/1/10" configuration might ruin a ton of data along >> with an obscure set of metadata in some "long" tree paths - but not the >> entire filesystem. > My experience is that most disk failures that don't involve extreme physical > damage (EG dropping a drive on concrete) don't involve totally losing the > disk. Much of the discussion about RAID failures concerns entirely failed > disks, but I believe that is due to RAID implementations such as Linux > software RAID that will entirely remove a disk when it gives errors. > > I have a disk which had ~14,000 errors of which ~2000 errors were corrected by > duplicate metadata. If two disks with that problem were in a RAID-1 array > then duplicate metadata would be a significant benefit. > >> The other use-case/failure mode - where you are somehow unlucky enough >> to have sets of bad sectors/bitrot on multiple disks that simultaneously >> affect the only copies of the tree roots - is an extremely unlikely >> scenario. As unlikely as it may be, the scenario is a very painful >> consequence in spite of VERY little corruption. That is where the >> peace-of-mind/bragging rights come in. > http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html > > The NetApp research on latent errors on drives is worth reading. On page 12 > they report latent sector errors on 9.5% of SATA disks per year. So if you > lose one disk entirely the risk of having errors on a second disk is higher > than you would want for RAID-5. While losing the root of the tree is > unlikely, losing a directory in the middle that has lots of subdirectories is > a risk. Seeing the results of that paper, I think erasure coding is a better solution. Instead of having many copies of metadata or data, we could do erasure coding using something like zfec[1] that is being used by Tahoe-LAFS, increasing their size by lets say 5-10%, and be quite safe even from multiple continuous bad sectors. [1] https://pypi.python.org/pypi/zfec > > I can understand why people wouldn't want ditto blocks to be mandatory. But > why are people arguing against them as an option? > > > As an aside, I'd really like to be able to set RAID levels by subtree. I'd > like to use RAID-1 with ditto blocks for my important data and RAID-0 for > unimportant data. > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: ditto blocks on ZFS
@ 2014-05-22 15:28 Tomasz Chmielewski
0 siblings, 0 replies; 18+ messages in thread
From: Tomasz Chmielewski @ 2014-05-22 15:28 UTC (permalink / raw)
To: linux-btrfs
> I thought an important idea behind btrfs was that we avoid by design
> in the first place the very long and vulnerable RAID rebuild scenarios
> suffered for block-level RAID...
This may be true for SSD disks - for ordinary disks it's not entirely
the case.
For most RAID rebuilds, it still seems way faster with software RAID-1
where one drive is being read at its (almost) full speed, and the other
is being written to at its (almost) full speed (assuming no other IO
load).
With btrfs RAID-1, the way balance is made after disk replace, it takes
lots of disk head movements resulting in overall small speed to rebuild
the RAID, especially with lots of snapshots and related fragmentation.
And the balance is still not smart and is causing reads from one device,
and writes to *both* devices (extra unnecessary write to the
healthy device - while it should read from the healthy device and write
to the replaced device only).
Of course, other factors such as the amount of data or disk IO usage
during rebuild apply.
--
Tomasz Chmielewski
http://wpkg.org
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2014-05-23 8:03 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-05-16 3:07 ditto blocks on ZFS Russell Coker 2014-05-17 12:50 ` Martin 2014-05-17 14:24 ` Hugo Mills 2014-05-18 16:09 ` Russell Coker 2014-05-19 20:36 ` Martin 2014-05-19 21:47 ` Brendan Hide 2014-05-20 2:07 ` Russell Coker 2014-05-20 14:07 ` Austin S Hemmelgarn 2014-05-20 20:11 ` Brendan Hide 2014-05-20 14:56 ` ashford 2014-05-21 2:51 ` Russell Coker 2014-05-21 23:05 ` Martin 2014-05-22 11:10 ` Austin S Hemmelgarn 2014-05-22 22:09 ` ashford 2014-05-23 3:54 ` Russell Coker 2014-05-23 8:03 ` Duncan 2014-05-21 23:29 ` Konstantinos Skarlatos 2014-05-22 15:28 Tomasz Chmielewski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).