From: Russell Coker <russell@coker.com.au>
To: ashford@whisperpc.com
Cc: linux-btrfs@vger.kernel.org
Subject: Re: ditto blocks on ZFS
Date: Wed, 21 May 2014 12:51:23 +1000 [thread overview]
Message-ID: <1795587.Ol58oREtZ7@xev> (raw)
In-Reply-To: <57f050e2a37907d810b40c5e115b28ff.squirrel@webmail.wanet.net>
On Tue, 20 May 2014 07:56:41 ashford@whisperpc.com wrote:
> 1. There will be more disk space used by the metadata. I've been aware
> of space allocation issues in BTRFS for more than three years. If the use
> of ditto blocks will make this issue worse, then it's probably not a good
> idea to implement it. The actual increase in metadata space is probably
> small in most circumstances.
Data, RAID1: total=2.51TB, used=2.50TB
System, RAID1: total=32.00MB, used=376.00KB
Metadata, RAID1: total=28.25GB, used=26.63GB
The above is my home RAID-1 array. It includes multiple backup copies of a
medium size Maildir format mail spool which probably accounts for a
significant portion of the used space, the Maildir spool has an average file
size of about 70K and lots of hard links between different versions of the
backup. Even so the metadata is only 1% of the total used space. Going from
1% to 2% to improve reliability really isn't a problem.
Data, RAID1: total=140.00GB, used=139.60GB
System, RAID1: total=32.00MB, used=28.00KB
Metadata, RAID1: total=4.00GB, used=2.97GB
Above is a small Xen server which uses snapshots to backup the files for Xen
block devices (the system is lightly loaded so I don't use nocow) and for data
files that include a small Maildir spool. It's still only 2% of disk space
used for metadata, again going from 2% to 4% isn't going to be a great
problem.
> 2. Use of ditto blocks will increase write bandwidth to the disk. This
> is a direct and unavoidable result of having more copies of the metadata.
> The actual impact of this would depend on the file-system usage pattern,
> but would probably be unnoticeable in most circumstances. Does anyone
> have a worst-case scenario for testing?
The ZFS design involves ditto blocks being spaced apart due to the fact that
corruption tends to have some spacial locality. So you are adding an extra
seek.
The worst case would be when you have lots of small synchronous writes,
probably the default configuration of Maildir delivery would be a good case.
As an aside I've been thinking of patching a mail server to do a sleep()
before fsync() on mail delivery to see if that improves aggregate performance.
My theory is that if you have dozens of concurrent delivery attempts then if
they all sleep() before fsync() then the filesystem could write out metadata
for multiple files in one pass in the most efficient manner.
> 3. Certain kinds of disk errors would be easier to recover from. Some
> people here claim that those specific errors are rare.
All errors are rare. :-#
Seriously you can run Ext4 on a single disk for years and probably not lose
data. It's just a matter of how many disks and how much reliability you want.
> I have no opinion
> on how often they happen, but I believe that if the overall disk space
> cost is low, it will have a reasonable return. There would be virtually
> no reliability gains on an SSD-based file-system, as the ditto blocks
> would be written at the same time, and the SSD would be likely to map the
> logical blocks into the same page of flash memory.
That claim is unproven AFAIK. On SSD the performance cost of such things is
negligible (no seek cost) and losing 1% of disk space isn't a problem for most
systems (admittedly the early SSDs were small).
> 4. If the BIO layer of BTRFS and the device driver are smart enough,
> ditto blocks could reduce I/O wait time. This is a direct result of
> having more instances of the data on the disk, so it's likely that there
> will be a ditto block closer to where the disk head is currently. The
> actual benefit for disk-based file-systems is likely to be under 1ms per
> metadata seek. It's possible that a short-term backlog on one disk could
> cause BTRFS to use a ditto block on another disk, which could deliver
> >20ms of performance. There would be no performance benefit for SSD-based
> file-systems.
That is likely with RAID-5 and RAID-10.
> My experience is that once your disks are larger than about 500-750GB,
> RAID-6 becomes a much better choice, due to the increased chances of
> having an uncorrectable read error during a reconstruct. My opinion is
> that anyone storing critical information in RAID-5, or even 2-disk RAID-1,
> with disks of this capacity, should either reconsider their storage
> topology, or verify that they have a good backup/restore mechanism in
> place for that data.
http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
The NetApp research shows that the incidence of silent corruption is a lot
greater than you would expect. RAID-6 doesn't save you from this. You need
BTRFS or ZFS RAID-6.
On Tue, 20 May 2014 22:11:16 Brendan Hide wrote:
> Extra replication on leaf nodes will make relatively little difference
> in the scenarios laid out in this thread - but on "trunk" nodes (folders
> or subvolumes closer to the filesystem root) it makes a significant
> difference. "Plain" N-way replication doesn't flexibly treat these two
> nodes differently.
>
> As an example, Russell might have a server with two disks - yet he wants
> 6 copies of all metadata for subvolumes and their immediate subfolders.
> At three folders deep he "only" wants to have 4 copies. At six folders
> deep, only 2. Ditto blocks add an attractive safety net without
> unnecessarily doubling or tripling the size of *all* metadata.
Firstly I don't think that doubling all metadata is a real problem.
Next the amount of duplicate metadata can't be determined by depth. For
example I have a mail server where an outage of the entire server is
preferable to losing email. I would set more ditto blocks for /mail than for
the root subvol. In that case I'd want the metadata for the root directory to
have the same replication as /mail but for /home etc nothing special.
Hypothetically if metadata duplication consumed any significant disk space
then I'd probably want to only enable it on /lib* /sbin, /etc, and whatever
data the server is designed to hold. But really it's small enough to just
duplicate everything.
Currently I only run two systems for which I can't more than double the disk
space at a moderate cost. One is my EeePC 701 and the other is a ZFS archive
server (which already has the ditto blocks). For all the other systems there
is no shortage of space at all. Disk just keeps getting bigger and cheaper,
for most of my uses disk size increases faster than data storage.
Currently the smallest SATA disk I can buy new is 500G. The smallest SSD is
60G for $63 but I can get 120G for $82, 240G for $149, or 480G for $295. All
the workstations I run use a lot less than 120G of storage. Storage capacity
isn't an issue for most users.
It seems to me that the only time when an extra 1% disk space usage would
really matter is when you have an array of 100 disks that's almost full. But
that's the time when you REALLY want extra duplication of metadata.
> It is a good idea. The next question to me is whether or not it is
> something that can be implemented elegantly and whether or not a
> talented *dev* thinks it is a good idea.
Absolutely. Hopefully this discussion will inspire the developers to consider
this an interesting technical challenge and a feature that is needed to beat
ZFS.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
next prev parent reply other threads:[~2014-05-21 2:51 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-16 3:07 ditto blocks on ZFS Russell Coker
2014-05-17 12:50 ` Martin
2014-05-17 14:24 ` Hugo Mills
2014-05-18 16:09 ` Russell Coker
2014-05-19 20:36 ` Martin
2014-05-19 21:47 ` Brendan Hide
2014-05-20 2:07 ` Russell Coker
2014-05-20 14:07 ` Austin S Hemmelgarn
2014-05-20 20:11 ` Brendan Hide
2014-05-20 14:56 ` ashford
2014-05-21 2:51 ` Russell Coker [this message]
2014-05-21 23:05 ` Martin
2014-05-22 11:10 ` Austin S Hemmelgarn
2014-05-22 22:09 ` ashford
2014-05-23 3:54 ` Russell Coker
2014-05-23 8:03 ` Duncan
2014-05-21 23:29 ` Konstantinos Skarlatos
2014-05-22 15:28 Tomasz Chmielewski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1795587.Ol58oREtZ7@xev \
--to=russell@coker.com.au \
--cc=ashford@whisperpc.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).